arXiv 1910.01108
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
By Victor Sanh, Lysandre Debut, et al.
Published 2019-10-02
Discussion
Read the public discussion and references gathered around this paper.
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good perfo…