arXiv 2512.17126
DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences
By Xiaoxiao Zhou, Zihan Wang, et al.
Published 2025-12-18
Discussion
Read the public discussion and references gathered around this paper.
DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers und…