arXiv 2512.17126

DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences

By Xiaoxiao Zhou, Zihan Wang, et al.

Published 2025-12-18

Discussion

Read the public discussion and references gathered around this paper.

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers und…

View the original paper on arXiv