sentence similarity models

89 models · ranked by HuggingFace downloads

all-MiniLM-L6-v2

Distilled BERT model that encodes sentences into 384-dimensional vectors for measuring semantic similarity. Trained on over a billion sentence pairs spanning scientific papers, web QA, NLI datasets, and community forums. At 22M parameters and 6 transformer layers, it is fast enough for CPU inference while remaining competitive on standard sentence similarity benchmarks.

245,742,847 ↓ · 5,015 ♡

paraphrase-multilingual-MiniLM-L12-v2

Multilingual sentence embedding model covering 50+ languages, built on a 12-layer distilled MiniLM architecture. Produces 384-dimensional vectors designed for semantic similarity and paraphrase detection across language boundaries. Trained on multilingual paraphrase data to align semantically equivalent sentences even when expressed in different languages.

50,349,812 ↓ · 1,290 ♡

all-mpnet-base-v2

Sentence embedding model based on the MPNet architecture, producing 768-dimensional vectors. Trained on over a billion sentence pairs from MS MARCO, NLI datasets, and community QA forums, it is frequently used when accuracy matters more than inference speed among English embedding models. The MPNet backbone enables masked and permuted prediction during pre-training for stronger representations.

33,515,916 ↓ · 1,313 ♡

bge-m3

BAAI's BGE-M3 embedding model supporting over 100 languages with a unified architecture capable of dense, sparse (lexical), and late-interaction (ColBERT-style) retrieval modes from a single checkpoint. Built on XLM-RoBERTa with large-scale multilingual training, it targets multi-lingual and cross-lingual retrieval where a single model must handle diverse language inputs.

31,360,936 ↓ · 3,158 ♡

nomic-embed-text-v1.5

Nomic Embed Text v1.5 is a matryoshka-capable English embedding model from Nomic AI, built on a custom nomic-BERT architecture trained with contrastive learning on large-scale text pairs. Matryoshka Representation Learning allows truncating embeddings to shorter dimensions (e.g. 64, 128, 256) without retraining, enabling flexible precision-cost tradeoffs. The model is transformers.js-compatible for browser-side inference.

18,124,658 ↓ · 857 ♡

multilingual-e5-small

Multilingual-E5-Small is a compact multilingual embedding model from Microsoft Research supporting 100+ languages on a BERT-based backbone, smaller and faster than the E5-large variant. It uses the same instruction-prefix training approach as E5-large ('query:'/'passage:') for asymmetric retrieval. MIT licensed with ONNX and OpenVINO export.

9,936,733 ↓ · 348 ♡

paraphrase-multilingual-mpnet-base-v2

Multilingual MPNet embedding model from the sentence-transformers library, producing 768-dimensional vectors across 50+ languages. Uses an MPNet backbone extended to multilingual training for higher-quality multilingual embeddings than the lighter MiniLM multilingual variant. Suitable when the 384-dim paraphrase-multilingual-MiniLM-L12-v2 is insufficient in accuracy.

7,410,185 ↓ · 465 ♡

multilingual-e5-base

multilingual-e5-base is a multilingual text embedding model from Microsoft using an XLM-RoBERTa backbone, trained with E5's text-pair ranking objective across 94 languages. It produces 768-dimensional sentence embeddings for semantic search, clustering, and cross-lingual retrieval. The base variant balances embedding quality and inference cost between the small and large tiers.

6,636,747 ↓ · 368 ♡

all-MiniLM-L12-v2

A 12-layer sentence encoder producing 384-dimensional embeddings, offering a quality step up from all-MiniLM-L6-v2 at roughly 2x the inference cost. Fine-tuned on a billion sentence pairs using contrastive objectives for semantic similarity and retrieval.

5,062,135 ↓ · 320 ♡

nomic-embed-text-v1

Nomic Embed Text v1 is the original version of Nomic AI's English text embedding model based on nomic-BERT, preceding the v1.5 matryoshka update. It produces 768-dimensional embeddings via contrastive learning and is fully open — model weights, training code, and data are publicly available. Apache 2.0 licensed.

4,376,017 ↓ · 575 ♡

e5-large-v2

e5-large-v2 targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let e5-large-v2 go straight into commercial pipelines. Like most open checkpoints, e5-large-v2 rewards a quick in-domain eval before commitment.

3,586,167 ↓ · 279 ♡

paraphrase-MiniLM-L6-v2

A lightweight 22M-parameter sentence encoder fine-tuned for paraphrase detection and semantic similarity, producing 384-dimensional embeddings. One of the earliest widely adopted sentence-transformers models, optimized for speed over state-of-the-art accuracy.

3,190,018 ↓ · 148 ♡

multi-qa-mpnet-base-dot-v1

MPNet-base fine-tuned on 215M question-answer pairs for asymmetric dense retrieval using dot-product similarity. Designed specifically for the query-document retrieval case rather than symmetric sentence similarity.

2,590,303 ↓ · 193 ♡

all-distilroberta-v1

DistilRoBERTa fine-tuned as a sentence encoder on over 1 billion sentence pairs, producing 768-dimensional embeddings. Offers a balance between the speed of DistilBERT and the richer representations of full RoBERTa.

2,490,342 ↓ · 43 ♡

e5-base-v2

e5-base-v2 is an openly licensed semantic similarity and embeddings model in the sentence transformers family. e5-base-v2 is MIT-licensed, clearing it for closed-source and paid products. Evaluate e5-base-v2 on your own data before trusting it in production.

2,218,419 ↓ · 156 ♡

paraphrase-mpnet-base-v2

Built for semantic similarity and embeddings, paraphrase-mpnet-base-v2 is a sentence transformers-based model with publicly available weights. paraphrase-mpnet-base-v2 is Apache 2.0-licensed, clearing it for closed-source and paid products. paraphrase-mpnet-base-v2 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,779,013 ↓ · 49 ♡

text2vec-base-chinese

text2vec-base-chinese is a sentence transformers-based open-weight model aimed at semantic similarity and embeddings. Permissive Apache 2.0 terms let text2vec-base-chinese go straight into commercial pipelines. Read text2vec-base-chinese's card for hardware requirements and licensing fine print before deploying.

1,625,149 ↓ · 796 ♡

embeddinggemma-300m

embeddinggemma-300m is a compact checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. embeddinggemma-300m is subject to Gemma terms, so confirm licensing before commercial use. Weighing in near 300M parameters, embeddinggemma-300m trades some ceiling for cheaper, faster inference. Treat embeddinggemma-300m's published metrics as a starting point and validate against your workload.

1,609,629 ↓ · 1,750 ♡

multi-qa-MiniLM-L6-cos-v1

multi-qa-MiniLM-L6-cos-v1 is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. multi-qa-MiniLM-L6-cos-v1 is community-maintained, so track upstream changes and pin a known-good revision.

1,509,372 ↓ · 137 ♡

Qwen3-VL-Embedding-8B

Qwen3-VL-Embedding-8B is a sentence transformers-based open-weight model aimed at semantic similarity and embeddings. Permissive Apache 2.0 terms let Qwen3-VL-Embedding-8B go straight into commercial pipelines. Qwen3-VL-Embedding-8B's 8000M-parameter size keeps hosting requirements modest relative to frontier models. Qwen3-VL-Embedding-8B ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,379,325 ↓ · 450 ♡

stsb-bert-tiny-safetensors

stsb-bert-tiny-safetensors is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. Like most open checkpoints, stsb-bert-tiny-safetensors rewards a quick in-domain eval before commitment.

1,348,807 ↓ · 4 ♡

gte-multilingual-base

GTE-multilingual-base is Alibaba's 305M-parameter embedding model covering 70+ languages, designed for multilingual dense retrieval and semantic similarity. It uses a modified transformer backbone with improved positional encoding for cross-lingual transfer.

1,232,122 ↓ · 365 ♡

all-MiniLM-L6-v2-onnx

all-MiniLM-L6-v2-onnx is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. The Apache 2.0 license keeps all-MiniLM-L6-v2-onnx unrestricted for commercial reuse. Evaluate all-MiniLM-L6-v2-onnx on your own data before trusting it in production.

1,209,645 ↓ · 7 ♡

distiluse-base-multilingual-cased-v2

distiluse-base-multilingual-cased-v2 is an openly licensed semantic similarity and embeddings model in the sentence transformers family. distiluse-base-multilingual-cased-v2 is multilingual by design rather than English-only. distiluse-base-multilingual-cased-v2 is Apache 2.0-licensed, clearing it for closed-source and paid products. Like most open checkpoints, distiluse-base-multilingual-cased-v2 rewards a quick in-domain eval before commitment.

1,202,375 ↓ · 209 ♡

Qwen3-VL-Embedding-2B

Qwen3-VL-Embedding-2B is a 2B multimodal embedding model that encodes both images and text into a shared vector space. Designed for multimodal retrieval tasks where visual and textual queries need to be compared against mixed corpora.

1,190,245 ↓ · 425 ♡

gte-large-en-v1.5

Built for semantic similarity and embeddings, gte-large-en-v1.5 is a sentence transformers-based model with publicly available weights. gte-large-en-v1.5 is Apache 2.0-licensed, clearing it for closed-source and paid products. Read gte-large-en-v1.5's card for hardware requirements and licensing fine print before deploying.

1,186,041 ↓ · 238 ♡

distiluse-base-multilingual-cased-v1

distiluse-base-multilingual-cased-v1 is a sentence transformers-based open-weight model aimed at semantic similarity and embeddings. Training spans multiple languages, so distiluse-base-multilingual-cased-v1 covers cross-lingual semantic similarity and embeddings from one checkpoint. Permissive Apache 2.0 terms let distiluse-base-multilingual-cased-v1 go straight into commercial pipelines. Before relying on distiluse-base-multilingual-cased-v1, reproduce its key numbers on representative inputs.

1,178,559 ↓ · 132 ♡

LaBSE

As a sentence transformers-based open-weight model, LaBSE focuses on semantic similarity and embeddings. Training spans multiple languages, so LaBSE covers cross-lingual semantic similarity and embeddings from one checkpoint. The Apache 2.0 license keeps LaBSE unrestricted for commercial reuse. Read LaBSE's card for hardware requirements and licensing fine print before deploying.

1,119,688 ↓ · 343 ♡

ko-sroberta-multitask

ko-sroberta-multitask is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. Treat ko-sroberta-multitask's published metrics as a starting point and validate against your workload.

995,230 ↓ · 149 ♡

snowflake-arctic-embed-l-v2.0

snowflake-arctic-embed-l-v2.0 is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. snowflake-arctic-embed-l-v2.0 is multilingual by design rather than English-only. The Apache 2.0 license keeps snowflake-arctic-embed-l-v2.0 unrestricted for commercial reuse. Like most open checkpoints, snowflake-arctic-embed-l-v2.0 rewards a quick in-domain eval before commitment.

908,233 ↓ · 248 ♡

bge-small-en-v1.5-onnx-Q

bge-small-en-v1.5-onnx-Q is an openly licensed semantic similarity and embeddings model in the bert family. bge-small-en-v1.5-onnx-Q is Apache 2.0-licensed, clearing it for closed-source and paid products. bge-small-en-v1.5-onnx-Q is community-maintained, so track upstream changes and pin a known-good revision.

830,776 ↓ · 2 ♡

paraphrase-MiniLM-L3-v2

As a sentence transformers-based open-weight model, paraphrase-MiniLM-L3-v2 focuses on semantic similarity and embeddings. The Apache 2.0 license keeps paraphrase-MiniLM-L3-v2 unrestricted for commercial reuse. paraphrase-MiniLM-L3-v2 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

824,793 ↓ · 30 ♡

e5-base

E5-base is a 109M-parameter English text embedding model from Microsoft trained with a text-pair weakly-supervised approach on large-scale web data followed by BEIR fine-tuning. It requires prepending 'query: ' or 'passage: ' prefixes to inputs for optimal retrieval performance. E5-base sits between the small and large variants in the series, balancing embedding quality and inference speed.

813,108 ↓ · 25 ♡

nomic-embed-text-v2-moe

nomic-embed-text-v2-moe is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. nomic-embed-text-v2-moe is multilingual by design rather than English-only. It is a fine-tune of nomic-embed-text-v2-moe-unsupervised, inheriting that base model's general competence. Evaluate nomic-embed-text-v2-moe on your own data before trusting it in production.

794,165 ↓ · 484 ♡

pubmedbert-base-embeddings

pubmedbert-base-embeddings is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. It is a fine-tune of biomednlp-biomedbert-base-uncased-abstract-fulltext, inheriting that base model's general competence. The Apache 2.0 license keeps pubmedbert-base-embeddings unrestricted for commercial reuse. Like most open checkpoints, pubmedbert-base-embeddings rewards a quick in-domain eval before commitment.

754,815 ↓ · 190 ♡

bm25

bm25 is an openly licensed semantic similarity and embeddings model. bm25 is Apache 2.0-licensed, clearing it for closed-source and paid products. Treat bm25's published metrics as a starting point and validate against your workload.

725,147 ↓ · 32 ♡

bge-micro-v2

BGE-Micro-v2 is a heavily distilled BERT embedding model targeting near-zero latency sentence encoding with acceptable MTEB scores. Extremely small footprint allows embedding generation in CPU-only or mobile environments. MIT-licensed with ONNX and transformers.js support.

699,842 ↓ · 64 ♡

BGE-m3-ko

BGE-m3-ko is a Korean-specialized fine-tune of BAAI's BGE-M3 multilingual embedding model, trained with additional Korean-Korean and Korean-English parallel data to improve retrieval performance in Korean. It retains the XLM-RoBERTa backbone and supports up to 8192 tokens, making it suitable for long Korean document retrieval and cross-lingual search.

683,940 ↓ · 76 ♡

msmarco-bert-base-dot-v5

msmarco-bert-base-dot-v5 targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Treat msmarco-bert-base-dot-v5's published metrics as a starting point and validate against your workload.

652,864 ↓ · 21 ♡

e5-small-v2

e5-small-v2 targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let e5-small-v2 go straight into commercial pipelines. Like most open checkpoints, e5-small-v2 rewards a quick in-domain eval before commitment.

638,486 ↓ · 120 ♡

e5-large

E5-large is a 335M-parameter embedding model fine-tuned with contrastive learning on a mixture of web-scale text pairs. It consistently ranks near the top of the MTEB leaderboard for English text retrieval and similarity tasks.

632,198 ↓ · 80 ♡

gte-large

gte-large is an openly licensed semantic similarity and embeddings model in the sentence transformers family. gte-large is MIT-licensed, clearing it for closed-source and paid products. gte-large is community-maintained, so track upstream changes and pin a known-good revision.

624,382 ↓ · 304 ♡

gte-Qwen2-1.5B-instruct

GTE-Qwen2-1.5B-instruct is Alibaba's embedding model built on a 1.5B Qwen2 decoder backbone with instruction fine-tuning for text retrieval. It significantly outperforms encoder-only models its size on MTEB by leveraging the Qwen2 language model's broader world knowledge.

615,862 ↓ · 236 ♡

all-roberta-large-v1

all-roberta-large-v1 is a sentence transformers-based open-weight model aimed at semantic similarity and embeddings. Permissive Apache 2.0 terms let all-roberta-large-v1 go straight into commercial pipelines. Read all-roberta-large-v1's card for hardware requirements and licensing fine print before deploying.

602,921 ↓ · 66 ♡

ruri-v3-310m

Ruri v3 (310M) is Nagoya University's Japanese text embedding model built on the ModernBERT architecture, optimised for semantic similarity and retrieval in Japanese. It is part of the Ruri series, which targets Japanese-specific sentence embedding quality. The v3 310M variant balances embedding dimension, retrieval quality, and inference speed for production Japanese NLP pipelines.

601,675 ↓ · 80 ♡

gte-base-en-v1.5

As a sentence transformers-based open-weight model, gte-base-en-v1.5 focuses on semantic similarity and embeddings. The Apache 2.0 license keeps gte-base-en-v1.5 unrestricted for commercial reuse. Check the gte-base-en-v1.5 model card for benchmarks and intended use before adopting it.

540,579 ↓ · 71 ♡

vietnamese-bi-encoder

Vietnamese Bi-Encoder is BKAI's Vietnamese-language sentence embedding model based on PhoBERT/RoBERTa, trained with sentence-transformers for semantic similarity and retrieval in Vietnamese. Apache-2.0 licensed, it fills a gap in Vietnamese NLP tooling.

499,933 ↓ · 76 ♡

sup-SimCSE-VietNamese-phobert-base

sup-SimCSE-VietNamese-phobert-base applies the supervised SimCSE contrastive learning objective to PhoBERT-base, producing dense sentence embeddings optimized for Vietnamese semantic similarity tasks. The approach, detailed in arxiv:2104.08821, trains on natural language inference pairs to produce embeddings that align semantically related sentences. This is one of the few purpose-built Vietnamese sentence embedding models publicly available.

487,248 ↓ · 30 ♡

rubert-tiny2

rubert-tiny2 is an openly licensed semantic similarity and embeddings model in the sentence transformers family. rubert-tiny2 is MIT-licensed, clearing it for closed-source and paid products. Like most open checkpoints, rubert-tiny2 rewards a quick in-domain eval before commitment.

481,242 ↓ · 171 ♡

nomic-embed-code

Nomic Embed Code is Nomic AI's code-specialized embedding model built on a Qwen2 backbone, designed for code retrieval, documentation search, and code similarity tasks. Apache-2.0 licensed with text-embeddings-inference compatibility.

479,908 ↓ · 121 ♡

snowflake-arctic-embed-m

As a sentence transformers-based open-weight model, snowflake-arctic-embed-m focuses on semantic similarity and embeddings. The Apache 2.0 license keeps snowflake-arctic-embed-m unrestricted for commercial reuse. Before relying on snowflake-arctic-embed-m, reproduce its key numbers on representative inputs.

479,857 ↓ · 166 ♡

BioLORD-2023

BioLORD-2023 is a sentence embedding model trained for biomedical concept representation, using a knowledge-grounded contrastive approach that anchors concept embeddings to formal ontology definitions. It produces embeddings where semantically related biomedical terms (e.g., synonymous disease names across different coding systems) cluster tightly. The model is designed for medical NLP tasks where concept normalisation and synonym matching are important.

456,729 ↓ · 53 ♡

snowflake-arctic-embed-xs

snowflake-arctic-embed-xs is a compact BERT-based text embedding model from Snowflake, optimized for retrieval tasks and evaluated on the MTEB benchmark suite. It ships with ONNX and safetensors exports and supports deployment via transformers.js, making it usable in both server-side and browser environments. The XS size tier targets latency-sensitive pipelines where embedding throughput matters more than peak accuracy.

446,771 ↓ · 43 ♡

finance-embeddings-investopedia

Sentence embeddings fine-tuned on Investopedia financial content, intended to improve semantic similarity for financial terminology and concepts compared to general-purpose embedding models.

441,086 ↓ · 65 ♡

gte-base

GTE-base (General Text Embeddings) is Alibaba's 110M-parameter BERT-based embedding model trained on a large multi-task text similarity dataset. It became a popular baseline embedding model due to its strong MTEB scores relative to its size before larger models like GTE-large and e5-mistral gained traction.

430,625 ↓ · 131 ♡

telugu-sentence-bert-nli

Telugu-sentence-bert-nli is a sentence-transformers model producing fixed-length semantic embeddings for Telugu text, trained using natural language inference (NLI) methodology. Published by L3Cube Pune as part of their Indic NLP research, it fills a practical gap for semantic similarity and retrieval in Telugu, a Dravidian language with limited off-the-shelf NLP tooling. The CC-BY-4.0 license permits broad reuse with attribution.

407,797 ↓ · 1 ♡

paraphrase-albert-small-v2

paraphrase-albert-small-v2 is a compact sentence-transformer model based on the ALBERT architecture, trained on a diverse mix of paraphrase and NLI datasets including MS MARCO, SNLI, MultiNLI, and Stack Exchange. It maps sentences to a fixed-length dense embedding space suitable for similarity computation. At roughly 22M parameters, it prioritizes low memory footprint over embedding quality.

403,059 ↓ · 11 ♡

klue-sroberta-base-continue-learning-by-mnr

A Korean sentence embedding model built on KLUE-RoBERTa-base, fine-tuned with Multiple Negatives Ranking (MNR) loss for continued learning after the initial sentence-transformers training. It is designed for Korean semantic similarity and retrieval tasks, extending the KLUE benchmark-trained base with better sentence-level representations. Bespin Global targets Korean enterprise NLP applications with this checkpoint.

400,999 ↓ · 31 ♡

bge-m3-spa-law-qa

bge-m3-spa-law-qa is an openly licensed semantic similarity and embeddings model in the sentence transformers family. bge-m3-spa-law-qa is Apache 2.0-licensed, clearing it for closed-source and paid products. It is a fine-tune of bge-m3, inheriting that base model's general competence. bge-m3-spa-law-qa is community-maintained, so track upstream changes and pin a known-good revision.

397,035 ↓ · 20 ♡

USER-bge-m3

USER-bge-m3 is DeepVK's Russian-enhanced version of BGE-M3, fine-tuned to improve text embedding quality on Russian-language documents and search tasks. It inherits BGE-M3's hybrid retrieval capabilities (dense + sparse + ColBERT) while boosting Slavic text representation.

395,724 ↓ · 80 ♡

all-indo-e5-small-v4

all-indo-e5-small is LazarusNLP's Indonesian fine-tune of a small e5 embedding model, designed to improve semantic search and sentence similarity quality on Bahasa Indonesia text. v4 reflects iterative improvements over previous Indonesian embedding baselines.

387,255 ↓ · 13 ♡

gte-modernbert-base

GTE-ModernBERT-base is Alibaba's text embedding model built on the ModernBERT architecture, which extends the classic BERT design with rotary position encodings and improved attention kernels for better long-context handling. It achieves strong scores on MTEB benchmarks at the 149M-parameter base scale. The Transformers.js export makes it deployable in browser environments alongside Python serving.

383,239 ↓ · 197 ♡

msmarco-MiniLM-L12-cos-v5

Built for semantic similarity and embeddings, msmarco-MiniLM-L12-cos-v5 is a sentence transformers-based model with publicly available weights. Check the msmarco-MiniLM-L12-cos-v5 model card for benchmarks and intended use before adopting it.

381,899 ↓ · 10 ♡

snowflake-arctic-embed-m-v1.5

Snowflake Arctic Embed M v1.5 is Snowflake's medium-scale English embedding model, optimized for retrieval tasks with MTEB benchmark focus. Available in ONNX, GGUF, and safetensors formats with transformers.js compatibility, making it unusually portable across inference environments. Apache-2.0 licensed.

381,345 ↓ · 72 ♡

french-bge-m3

french-bge-m3 targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let french-bge-m3 go straight into commercial pipelines. Like most open checkpoints, french-bge-m3 rewards a quick in-domain eval before commitment.

380,386 ↓ · 0 ♡

KR-SBERT-V40K-klueNLI-augSTS

As a sentence transformers-based open-weight model, KR-SBERT-V40K-klueNLI-augSTS focuses on semantic similarity and embeddings. Before relying on KR-SBERT-V40K-klueNLI-augSTS, reproduce its key numbers on representative inputs.

352,551 ↓ · 83 ♡

SecureBERT2.0-cross_encoder

The cross-encoder companion to SecureBERT2.0-biencoder, designed for reranking in cybersecurity retrieval pipelines. Cross-encoders jointly encode query and document pairs, making them more accurate but slower than biencoder retrieval for re-scoring top candidates.

350,197 ↓ · 3 ♡

paraphrase-MiniLM-L12-v2

Built for semantic similarity and embeddings, paraphrase-MiniLM-L12-v2 is a sentence transformers-based model with publicly available weights. paraphrase-MiniLM-L12-v2 is Apache 2.0-licensed, clearing it for closed-source and paid products. paraphrase-MiniLM-L12-v2 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

350,149 ↓ · 7 ♡

SecureBERT2.0-biencoder

SecureBERT 2.0 biencoder is a ModernBERT-based dense retrieval model trained on cybersecurity corpora for semantic search over security documents. It uses MultipleNegativesRankingLoss fine-tuning on ~35k pairs, making it well-suited for threat intelligence retrieval.

345,921 ↓ · 5 ♡

S-PubMedBert-MS-MARCO

S-PubMedBert-MS-MARCO targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Licensing for S-PubMedBert-MS-MARCO is unspecified or custom — clear it before commercial use. Like most open checkpoints, S-PubMedBert-MS-MARCO rewards a quick in-domain eval before commitment.

343,659 ↓ · 43 ♡

bengali-sentence-similarity-sbert

An SBERT-style Bengali sentence embedding model from L3Cube Pune for semantic similarity tasks on Bengali text. Part of L3Cube's series of Indian language NLP models, targeting a language with limited NLP tooling.

340,343 ↓ · 6 ♡

gte-small

gte-small is an openly licensed semantic similarity and embeddings model in the sentence transformers family. gte-small is MIT-licensed, clearing it for closed-source and paid products. gte-small is community-maintained, so track upstream changes and pin a known-good revision.

338,565 ↓ · 188 ♡

multilingual-e5-large-onnx

multilingual-e5-large-onnx targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let multilingual-e5-large-onnx go straight into commercial pipelines. Like most open checkpoints, multilingual-e5-large-onnx rewards a quick in-domain eval before commitment.

337,311 ↓ · 3 ♡

embedic-base

embedic-base targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let embedic-base go straight into commercial pipelines. embedic-base is multilingual by design rather than English-only. Evaluate embedic-base on your own data before trusting it in production.

330,069 ↓ · 2 ♡

LLM2Vec-Meta-Llama-3-8B-Instruct-mntp

LLM2Vec converts LLaMA 3 8B Instruct into a text embedding model using masked next-token prediction (MNTP) fine-tuning, enabling decoder-only LLMs to produce high-quality pooled sentence embeddings. From McGill NLP, this approach demonstrates that decoder LLMs can match or exceed encoder embedding models.

328,072 ↓ · 22 ♡

stella_en_400M_v5

Built for semantic similarity and embeddings, stella_en_400M_v5 is a sentence transformers-based model with publicly available weights. stella_en_400M_v5 is MIT-licensed, clearing it for closed-source and paid products. Before relying on stella_en_400M_v5, reproduce its key numbers on representative inputs.

326,526 ↓ · 233 ♡

S-PubMedBert-MedQuAD

S-PubMedBert-MedQuAD is a sentence-transformers fine-tune of PubMedBERT trained on the MedQuAD question-answer dataset. It produces embeddings specialised for matching consumer-style medical questions to relevant answers, making it useful for FAQ retrieval in health information systems. The underlying PubMedBERT base already incorporates biomedical vocabulary, giving it an advantage over general-purpose sentence transformers on clinical text.

318,330 ↓ · 8 ♡

serafim-335m-portuguese-pt-sentence-encoder-ir

As a sentence transformers-based compact model, serafim-335m-portuguese-pt-sentence-encoder-ir focuses on semantic similarity and embeddings. Weighing in near 335M parameters, serafim-335m-portuguese-pt-sentence-encoder-ir trades some ceiling for cheaper, faster inference. The MIT license keeps serafim-335m-portuguese-pt-sentence-encoder-ir unrestricted for commercial reuse. Read serafim-335m-portuguese-pt-sentence-encoder-ir's card for hardware requirements and licensing fine print before deploying.

301,574 ↓ · 0 ♡

gte-base-en-v1.5

As a sentence transformers-based open-weight model, gte-base-en-v1.5 focuses on semantic similarity and embeddings. The Apache 2.0 license keeps gte-base-en-v1.5 unrestricted for commercial reuse. gte-base-en-v1.5 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

298,070 ↓ · 0 ♡

stsb-roberta-base

stsb-roberta-base is an open-weight checkpoint for semantic similarity and embeddings, distributed on the HuggingFace Hub. The Apache 2.0 license keeps stsb-roberta-base unrestricted for commercial reuse. stsb-roberta-base is community-maintained, so track upstream changes and pin a known-good revision.

296,697 ↓ · 1 ♡

all_miniLM_L6_v2_with_attentions

Built for semantic similarity and embeddings, all_miniLM_L6_v2_with_attentions is a bert-based model with publicly available weights. all_miniLM_L6_v2_with_attentions is Apache 2.0-licensed, clearing it for closed-source and paid products. all_miniLM_L6_v2_with_attentions ships without a hosted SLA, so budget for self-managed deployment and monitoring.

295,936 ↓ · 14 ♡

gte-Qwen2-7B-instruct

gte-Qwen2-7B-instruct targets semantic similarity and embeddings and is shipped as a mid-sized, self-hostable checkpoint. Permissive Apache 2.0 terms let gte-Qwen2-7B-instruct go straight into commercial pipelines. gte-Qwen2-7B-instruct's 7000M-parameter size keeps hosting requirements modest relative to frontier models. Like most open checkpoints, gte-Qwen2-7B-instruct rewards a quick in-domain eval before commitment.

295,540 ↓ · 482 ♡

Vietnamese_Embedding

Vietnamese_Embedding targets semantic similarity and embeddings and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let Vietnamese_Embedding go straight into commercial pipelines. Treat Vietnamese_Embedding's published metrics as a starting point and validate against your workload.

294,221 ↓ · 61 ♡

langcache-embed-v1

Built for semantic similarity and embeddings, langcache-embed-v1 is a sentence transformers-based model with publicly available weights. Check the langcache-embed-v1 model card for benchmarks and intended use before adopting it.

293,682 ↓ · 14 ♡

distilbert-multilingual-nli-stsb-quora-ranking

Built for semantic similarity and embeddings, distilbert-multilingual-nli-stsb-quora-ranking is a sentence transformers-based model with publicly available weights. distilbert-multilingual-nli-stsb-quora-ranking is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the distilbert-multilingual-nli-stsb-quora-ranking model card for benchmarks and intended use before adopting it.

288,656 ↓ · 10 ♡

instructor-large

As a sentence transformers-based open-weight model, instructor-large focuses on semantic similarity and embeddings. The Apache 2.0 license keeps instructor-large unrestricted for commercial reuse. Check the instructor-large model card for benchmarks and intended use before adopting it.

286,640 ↓ · 524 ♡

msmarco-MiniLM-L6-v3

msmarco-MiniLM-L6-v3 is a compact 6-layer MiniLM sentence embedding model fine-tuned on the MS MARCO passage retrieval dataset. It produces query and passage embeddings optimised for asymmetric retrieval — finding relevant web passages for short natural language queries. The model is well-suited for latency-sensitive applications where a full BERT encoder is too slow.

237,614 ↓ · 15 ♡

bge-m3-korean

As a sentence transformers-based open-weight model, bge-m3-korean focuses on semantic similarity and embeddings. Training spans multiple languages, so bge-m3-korean covers cross-lingual semantic similarity and embeddings from one checkpoint. The weights start from bge-m3 and specialize it for the target task. bge-m3-korean ships without a hosted SLA, so budget for self-managed deployment and monitoring.

232,861 ↓ · 64 ♡

GIST-Embedding-v0

GIST-Embedding-v0 (Guided In-sample Selection of Training Negatives) is a BERT-based sentence embedding model trained with guided negative sampling to improve contrastive learning quality. It targets MTEB retrieval and similarity tasks for English. MIT-licensed and compatible with sentence-transformers and text-embeddings-inference.

229,434 ↓ · 30 ♡