fill mask models

55 models · ranked by HuggingFace downloads

bert-base-uncased

Google's original BERT base model in uncased form, pre-trained on BookCorpus and English Wikipedia via masked language modeling. Tokens are lowercased before processing, making it insensitive to capitalization. It remains a standard fine-tuning base for classification, NER, and extractive QA, though newer encoders outperform it on most benchmarks.

60,271,662 ↓ · 2,690 ♡

xlm-roberta-base

XLM-RoBERTa base from Facebook AI, pre-trained on 2.5TB of filtered CommonCrawl text across 100 languages using the RoBERTa training procedure. Enables cross-lingual transfer — models fine-tuned on labeled English data can infer on other languages without parallel annotations. The standard starting point for multilingual classification and token-level tasks.

20,459,644 ↓ · 855 ♡

roberta-large

RoBERTa large, the 355M-parameter version of Facebook AI's strongly trained BERT variant, offering doubled hidden size and additional attention heads over RoBERTa base. It provides stronger NLU accuracy at roughly 4x the inference compute cost of the base variant. Used where task accuracy on complex English language understanding outweighs latency constraints.

12,075,442 ↓ · 301 ♡

roberta-base

RoBERTa base from Facebook AI, trained with the same architecture as BERT base but significantly more data, longer training schedules, larger batch sizes, and dynamic masking. Pre-trained on BookCorpus, Wikipedia, CC-News, OpenWebText, and Stories — substantially more data than the original BERT. MIT licensed with multi-framework support.

11,648,834 ↓ · 617 ♡

ModernBERT-base

As a modernbert-based open-weight model, ModernBERT-base focuses on masked language modeling. The Apache 2.0 license keeps ModernBERT-base unrestricted for commercial reuse. Before relying on ModernBERT-base, reproduce its key numbers on representative inputs.

9,894,988 ↓ · 1,060 ♡

distilbert-base-uncased

DistilBERT-base-uncased is a distilled version of BERT-base-uncased, 40% smaller and 60% faster while retaining approximately 97% of BERT's language understanding performance on the GLUE benchmark. Trained via knowledge distillation from BERT using BookCorpus and Wikipedia. Commonly used when BERT's performance is needed but inference speed or resource constraints are limiting factors.

8,864,724 ↓ · 903 ♡

xlm-roberta-large

XLM-RoBERTa Large, the 560-million-parameter multilingual encoder from Facebook AI, trained on 2.5TB of CommonCrawl data across 100 languages. It offers stronger multilingual language understanding than the base variant across classification, NER, and cross-lingual tasks, at roughly 4x the compute cost. MIT licensed with multi-framework support.

7,045,619 ↓ · 519 ♡

Bio_ClinicalBERT

Bio_ClinicalBERT is BERT-base fine-tuned first on biomedical literature (PubMed) and then on MIMIC-III clinical notes. It produces contextual representations tuned for both biomedical and clinical language.

4,494,619 ↓ · 432 ♡

bert-base-multilingual-uncased

BERT-base-multilingual-uncased is Google's multilingual BERT trained on Wikipedia text from 104 languages with all text lowercased before tokenization. Lowercasing simplifies processing but removes capitalization signals that help named entity recognition. It produces 768-dimensional embeddings shared across all supported languages.

4,136,130 ↓ · 157 ♡

bert-large-uncased

bert-large-uncased targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let bert-large-uncased go straight into commercial pipelines. Evaluate bert-large-uncased on your own data before trusting it in production.

3,796,700 ↓ · 147 ♡

bert-base-multilingual-cased

BERT-base-multilingual-cased is Google's multilingual BERT trained on 104-language Wikipedia data with case preserved, making it better suited than the uncased variant for named entity recognition and tasks where capitalization carries semantic meaning. It shares the same 12-layer Transformer architecture and 768-dimensional embedding space as BERT-base-uncased. Despite its age, it remains a common transfer learning starting point for multilingual tasks.

3,663,787 ↓ · 593 ♡

bert-base-cased

Google's BERT base model in cased form, pre-trained on BookCorpus and English Wikipedia with original case preserved. Unlike bert-base-uncased, this model maintains distinctions between 'bert' and 'BERT' — essential for tasks where capitalization carries semantic information, such as named entity recognition. Same architecture as bert-base-uncased but with case-sensitive tokenization.

3,298,881 ↓ · 362 ♡

esm2_t33_650M_UR50D

esm2_t33_650M_UR50D is an open-weight checkpoint for masked language modeling, distributed on the HuggingFace Hub. The MIT license keeps esm2_t33_650M_UR50D unrestricted for commercial reuse. Treat esm2_t33_650M_UR50D's published metrics as a starting point and validate against your workload.

2,808,334 ↓ · 82 ♡

deberta-v3-base

DeBERTa-v3-base uses disentangled attention and ELECTRA-style pretraining on diverse multilingual data, achieving state-of-the-art NLU results for a BERT-base-scale model at time of release. It consistently outperforms RoBERTa-base on GLUE benchmarks.

2,587,671 ↓ · 429 ♡

ModernBERT-large

ModernBERT-large is a 395M encoder-only model from Answer.AI that updates BERT's architecture with flash attention, rotary position embeddings, and extended context (8192 tokens). It aims to be a drop-in improvement over BERT-large for masked language modeling and downstream encoder tasks. Apache-2.0 licensed.

2,022,389 ↓ · 473 ♡

mdeberta-v3-base

mdeberta-v3-base fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

1,904,731 ↓ · 226 ♡

distilroberta-base

distilroberta-base fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

1,829,252 ↓ · 177 ♡

bert-large-portuguese-cased

BERTimbau-large is a Portuguese BERT-large model pretrained from scratch on a 2.7B-word Portuguese corpus. It provides strong contextual representations for Brazilian and European Portuguese NLP tasks.

1,551,150 ↓ · 73 ♡

japanese-roberta-base

japanese-roberta-base is Rinna's Japanese RoBERTa-base, pre-trained on Japanese Common Crawl and Wikipedia using the masked language modeling objective. Unlike multilingual models, it uses a morpheme-aware tokenizer (MeCab-based) optimized for Japanese, improving token efficiency on Japanese text. It is intended as a foundation for fine-tuning on Japanese NLP classification and NER tasks.

1,455,907 ↓ · 39 ♡

ESMC-6B

ESMC-6B is EvolutionaryScale's 6B-parameter protein language model, pre-trained on diverse protein sequences with masked-language-modeling objectives. It generates high-quality residue-level embeddings suitable for variant effect prediction, protein engineering, and transfer-learning to downstream structure or function tasks. The eSM-C architecture focuses on sequence understanding rather than structure prediction.

1,443,617 ↓ · 18 ♡

esm2_t12_35M_UR50D

ESM2-t12-35M is Meta's 35M parameter protein language model from the ESM2 family, pre-trained on the UniRef50 database of protein sequences. It generates protein residue embeddings for downstream structure prediction, function annotation, and variant effect prediction tasks. MIT-licensed.

1,229,417 ↓ · 23 ♡

deberta-v3-large

Built for masked language modeling, deberta-v3-large is a deberta-based model with publicly available weights. deberta-v3-large is MIT-licensed, clearing it for closed-source and paid products. deberta-v3-large ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,083,649 ↓ · 281 ♡

camembert-base

As a bert-based open-weight model, camembert-base focuses on masked language modeling. The MIT license keeps camembert-base unrestricted for commercial reuse. Before relying on camembert-base, reproduce its key numbers on representative inputs.

1,016,040 ↓ · 102 ♡

BiomedNLP-BiomedBERT-base-uncased-abstract

BiomedNLP-BiomedBERT-base-uncased-abstract is a bert-based open-weight model aimed at masked language modeling. Permissive MIT terms let BiomedNLP-BiomedBERT-base-uncased-abstract go straight into commercial pipelines. Before relying on BiomedNLP-BiomedBERT-base-uncased-abstract, reproduce its key numbers on representative inputs.

898,450 ↓ · 95 ♡

esm2_t6_8M_UR50D

As an open-weight model, esm2_t6_8M_UR50D focuses on masked language modeling. The MIT license keeps esm2_t6_8M_UR50D unrestricted for commercial reuse. Check the esm2_t6_8M_UR50D model card for benchmarks and intended use before adopting it.

840,110 ↓ · 35 ♡

graphcodebert-base

graphcodebert-base is Microsoft Research's code-aware BERT variant that incorporates data-flow graphs from source code alongside token sequences during pre-training. Unlike CodeBERT which treats code as flat text, GraphCodeBERT explicitly models variable dependencies and control flow, improving performance on code search and clone detection tasks. It supports six programming languages from the CodeSearchNet benchmark.

806,623 ↓ · 90 ♡

ESMC-600M

ESMC-600M is a 600-million-parameter protein language model from the Chan Zuckerberg Biohub, trained using masked language modeling on protein sequences to produce contextual residue-level embeddings. It belongs to the ESM (Evolutionary Scale Modeling) family and is specifically designed for variant effect prediction, protein engineering, and transfer learning to downstream structural or functional tasks. Dual licensing (MIT and an additional 'other' license) means users should review the model card carefully before commercial use.

711,205 ↓ · 9 ♡

bert-base-arabertv02

bert-base-arabertv02 fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

662,417 ↓ · 47 ♡

deberta-v3-small

As a deberta-based open-weight model, deberta-v3-small focuses on masked language modeling. The MIT license keeps deberta-v3-small unrestricted for commercial reuse. Read deberta-v3-small's card for hardware requirements and licensing fine print before deploying.

645,705 ↓ · 77 ♡

esm2_t36_3B_UR50D

esm2_t36_3B_UR50D targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let esm2_t36_3B_UR50D go straight into commercial pipelines. esm2_t36_3B_UR50D is community-maintained, so track upstream changes and pin a known-good revision.

636,134 ↓ · 32 ♡

distilbert-base-multilingual-cased

As a distilbert-based open-weight model, distilbert-base-multilingual-cased focuses on masked language modeling. Training spans multiple languages, so distilbert-base-multilingual-cased covers cross-lingual masked language modeling from one checkpoint. The Apache 2.0 license keeps distilbert-base-multilingual-cased unrestricted for commercial reuse. Read distilbert-base-multilingual-cased's card for hardware requirements and licensing fine print before deploying.

599,609 ↓ · 244 ♡

bert-base-chinese

bert-base-chinese is an open-weight checkpoint for masked language modeling, distributed on the HuggingFace Hub. The Apache 2.0 license keeps bert-base-chinese unrestricted for commercial reuse. Like most open checkpoints, bert-base-chinese rewards a quick in-domain eval before commitment.

540,582 ↓ · 1,439 ♡

bert-base-german-cased

bert-base-german-cased is an open-weight checkpoint for masked language modeling, distributed on the HuggingFace Hub. The MIT license keeps bert-base-german-cased unrestricted for commercial reuse. Evaluate bert-base-german-cased on your own data before trusting it in production.

516,168 ↓ · 82 ♡

Clinical-Longformer

Clinical-Longformer fills in [MASK] positions in a sentence by attending to both left and right context. The internal representations are used for classification, tagging, and semantic search via fine-tuning.

507,319 ↓ · 69 ♡

esm2_t30_150M_UR50D

ESM-2 is Meta's protein language model trained on UniRef50, treating amino acid sequences analogously to text tokens. The t30_150M variant has 30 transformer layers at 150M total parameters, offering a practical balance between representation quality and inference speed. ESM-2 embeddings are widely used as features for protein function prediction, structure-adjacent tasks, and zero-shot fitness scoring.

483,383 ↓ · 10 ♡

Bio_Discharge_Summary_BERT

Bio_Discharge_Summary_BERT is a BERT model pre-trained on clinical discharge summaries from MIMIC-III, providing biomedical domain adaptation specifically for clinical documentation language. It captures the informal, fragmented style of clinical notes better than PubMedBERT trained on abstracts. MIT-licensed.

459,418 ↓ · 38 ♡

juribert-base

JuriBERT-base is a BERT-base model pre-trained from scratch on French legal text, making it the primary French-language masked LM for legal NLP tasks. Standard French BERT models trained on general web text perform poorly on legal vocabulary and sentence structures; JuriBERT addresses this by training exclusively on French legal corpora including legislation, jurisprudence, and legal commentary.

422,829 ↓ · 0 ♡

bert-base-spanish-wwm-uncased

bert-base-spanish-wwm-uncased is an open-weight masked language modeling model in the bert family. Like most open checkpoints, bert-base-spanish-wwm-uncased rewards a quick in-domain eval before commitment.

422,456 ↓ · 75 ♡

albert-base-v2

albert-base-v2 is an albert-based open-weight model aimed at masked language modeling. Permissive Apache 2.0 terms let albert-base-v2 go straight into commercial pipelines. Read albert-base-v2's card for hardware requirements and licensing fine print before deploying.

414,707 ↓ · 142 ♡

BiomedVLP-CXR-BERT-specialized

BiomedVLP-CXR-BERT-specialized is a BERT-based model from Microsoft Research, pre-trained and specialized on chest X-ray radiology reports for biomedical vision-language tasks. It is designed for joint image-text learning in the clinical radiology domain, grounded in the BioViL line of work (arxiv:2204.09817). The MIT license makes it freely usable for research and commercial applications.

413,220 ↓ · 36 ♡

bert-base-japanese-whole-word-masking

Tohoku NLP Lab's Japanese BERT-base trained with whole-word masking on Japanese Wikipedia. A foundational Japanese NLP model that improved on earlier Japanese BERT variants by using morphology-aware masking rather than character-level masking.

389,133 ↓ · 76 ♡

PetBERT

PetBERT is an open-weight masked language modeling model in the bert family. Distribution of PetBERT is under OpenRAIL, which is worth reading before you ship. It is a fine-tune of bert-base-uncased, inheriting that base model's general competence. Treat PetBERT's published metrics as a starting point and validate against your workload.

381,689 ↓ · 5 ♡

prot_bert

prot_bert targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Treat prot_bert's published metrics as a starting point and validate against your workload.

379,354 ↓ · 134 ♡

dummy-unknown

dummy-unknown is an open-weight masked language modeling model in the roberta family. Treat dummy-unknown's published metrics as a starting point and validate against your workload.

375,656 ↓ · 1 ♡

distilbert-base-german-cased

distilbert-base-german-cased targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let distilbert-base-german-cased go straight into commercial pipelines. Treat distilbert-base-german-cased's published metrics as a starting point and validate against your workload.

367,966 ↓ · 25 ♡

roberta-base

roberta-base targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Evaluate roberta-base on your own data before trusting it in production.

363,527 ↓ · 48 ♡

legal-bert-base-cased-ptbr

legal-bert-base-cased-ptbr is a BERT-base model pre-trained on Brazilian Portuguese legal text — legislation, court decisions, and official government publications. It addresses the gap in Brazilian legal NLP where standard Portuguese BERT models (BERTimbau) lack the specialised legal vocabulary of the Brazilian judiciary. Downstream tasks require fine-tuning on labelled Brazilian legal datasets.

353,779 ↓ · 15 ♡

mmBERT-base

As a bert-based open-weight model, mmBERT-base focuses on masked language modeling. The MIT license keeps mmBERT-base unrestricted for commercial reuse. Read mmBERT-base's card for hardware requirements and licensing fine print before deploying.

348,244 ↓ · 217 ♡

bert-base-japanese

bert-base-japanese targets masked language modeling and is shipped as an open-weight, self-hostable checkpoint. Because bert-base-japanese uses CC BY-SA 4.0, vet the conditions against your deployment plan. bert-base-japanese is community-maintained, so track upstream changes and pin a known-good revision.

347,886 ↓ · 41 ♡

deberta-v2-large-japanese-char-wwm

deberta-v2-large-japanese-char-wwm is an openly licensed masked language modeling model in the deberta family. Distribution of deberta-v2-large-japanese-char-wwm is under CC BY-SA 4.0, which is worth reading before you ship. Like most open checkpoints, deberta-v2-large-japanese-char-wwm rewards a quick in-domain eval before commitment.

344,750 ↓ · 9 ♡

ChemBERTa-77M-MLM

As a roberta-based compact model, ChemBERTa-77M-MLM focuses on masked language modeling. Weighing in near 77M parameters, ChemBERTa-77M-MLM trades some ceiling for cheaper, faster inference. Check the ChemBERTa-77M-MLM model card for benchmarks and intended use before adopting it.

342,965 ↓ · 26 ♡

twitter-xlm-roberta-base

XLM-RoBERTa-base fine-tuned on multilingual Twitter data by Cardiff NLP, covering sentiment, topic, and other social-media classification tasks. One of the most-cited multilingual Twitter models, with follow-on task-specific checkpoints available in the Cardiff NLP organization.

301,734 ↓ · 19 ♡

bert-base-portuguese-cased

Built for masked language modeling, bert-base-portuguese-cased is a bert-based model with publicly available weights. bert-base-portuguese-cased is MIT-licensed, clearing it for closed-source and paid products. Check the bert-base-portuguese-cased model card for benchmarks and intended use before adopting it.

300,876 ↓ · 229 ♡

chinese-bert-wwm-ext

As a bert-based open-weight model, chinese-bert-wwm-ext focuses on masked language modeling. The Apache 2.0 license keeps chinese-bert-wwm-ext unrestricted for commercial reuse. Before relying on chinese-bert-wwm-ext, reproduce its key numbers on representative inputs.

298,910 ↓ · 193 ♡

kcbert-base

KcBERT-base is a BERT-base model pre-trained on Korean news comments (Naver댓글, Daum댓글) collected from 2019-2020, giving it strong coverage of informal Korean internet language, slang, and emoticons. Unlike KoBERT trained on formal Korean text, KcBERT targets social media and user-generated content NLP tasks where colloquial Korean is predominant.

244,434 ↓ · 31 ♡