arXiv 2512.17126

DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences

By Xiaoxiao Zhou, Zihan Wang, et al.

Published 2025-12-18

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers und…

View the original paper on arXiv