arXiv 1910.01108

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

By Victor Sanh, Lysandre Debut, et al.

Published 2019-10-02

Discussion

Read the public discussion and references gathered around this paper.

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good perfo…

View the original paper on arXiv