This is a Pytorch pre-trained model obtained from converting the Tensorflow checkpoint found in the official Google BERT repository. This model is one of the smaller pre-trained BERT variants, along with bert-tiny, bert-mini, and bert-medium. They were introduced in the study "Well-Read Students Learn Better: On the Importance of Pre-training Compact Models" and ported to Hugging Face for the study "Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics". These models are intended to be trained on a downstream task.
If you use the model, please consider citing both papers:
@misc{bhargava2021generalization, title={Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics}, author={Prajjwal Bhargava and Aleksandr Drozd and Anna Rogers}, year={2021}, eprint={2110.01518}, archivePrefix={arXiv}, primaryClass={cs.CL} } @article{DBLP:journals/corr/abs-1908-08962, author = {Iulia Turc and Ming{-}Wei Chang and Kenton Lee and Kristina Toutanova}, title = {Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation}, journal = {CoRR}, volume = {abs/1908.08962}, year = {2019}, url = {http://arxiv.org/abs/1908.08962}, eprinttype = {arXiv}, eprint = {1908.08962}, timestamp = {Thu, 29 Aug 2019 16:32:34 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1908-08962.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
The BERT Small model is based on the transformer architecture. Specifically, it is a smaller variant of the original BERT model, which uses 4 layers (L=4) and 512 hidden units (H=512). This compact design allows for efficient training and inference while maintaining good performance on various natural language processing tasks. Bert models are decoder only models suitable for capturing contexual meaning in texts.
BERT Small models are optimized for various natural language understanding tasks. Their compact size makes them suitable for:
visit the GitHub repository. Follow @prajjwal_1 on Twitter for updates.