arXiv 2512.17126

DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences

By Xiaoxiao Zhou, Zihan Wang, et al.

Published 2025-12-18

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly distributed DNA sequence motifs, which are critical for accurate and interpretable models. To investigate, we systematically benchmark k-mer and Byte-Pair Encoding (BPE) tokenizers und…

View the original paper on arXiv