arXiv 2512.17126
DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences
By Xiaoxiao Zhou, Zihan Wang, et al.
Published 2025-12-18
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers und…