feature extraction models

82 models · ranked by HuggingFace downloads

bge-small-en-v1.5

Small English dense embedding model from BAAI's BGE (BAAI General Embedding) series, producing 384-dimensional vectors via MIT license. Optimized for MTEB retrieval benchmarks through a retrieval-focused training strategy, it achieves competitive scores relative to its parameter count. Suited for embedding workflows where throughput and cost matter more than peak accuracy.

61,803,330 ↓ · 497 ♡

bge-large-en-v1.5

BGE-Large-EN-v1.5 is BAAI's highest-capacity English embedding model in the v1.5 series, producing 1024-dimensional vectors. It achieves top MTEB retrieval scores among its generation of English-only embedding models, at the cost of higher compute and storage than BGE-small or BGE-base. MIT licensed with ONNX export support.

14,764,689 ↓ · 689 ♡

Qwen3-Embedding-0.6B

Qwen3-Embedding-0.6B is Alibaba Cloud's compact embedding model from the Qwen3 series, fine-tuned from Qwen3-0.6B-Base for text embedding tasks. At 0.6B parameters it provides instruction-following embedding capability at a size deployable without dedicated GPU infrastructure. Apache 2.0 licensed.

10,301,022 ↓ · 1,085 ♡

multilingual-e5-large

Multilingual-E5-Large is a 560-million-parameter multilingual embedding model from Microsoft Research, supporting 100+ languages via an XLM-RoBERTa backbone. Trained with E5's instruction-following approach (prepending 'query:' or 'passage:' prefixes), it achieves strong MTEB multilingual retrieval scores. MIT licensed with ONNX and OpenVINO export.

8,580,779 ↓ · 1,208 ♡

bge-base-en-v1.5

BGE-Base-EN-v1.5 is BAAI's mid-tier English embedding model in the v1.5 series, producing 768-dimensional vectors. It balances accuracy and compute cost between the small (384d) and large (1024d) variants, making it a practical default for English retrieval tasks where storage and inference overhead of the large model are undesirable. MIT licensed with ONNX export.

8,435,495 ↓ · 441 ♡

mxbai-embed-large-v1

mxbai-embed-large-v1 is Mixedbread AI's English embedding model producing 1024-dimensional vectors, trained for retrieval and ranking tasks using angle-optimized contrastive learning (AnglE). It achieves strong MTEB retrieval scores among English embedding models. Apache 2.0 licensed.

5,906,405 ↓ · 811 ♡

bge-small-zh-v1.5

BGE-small-zh-v1.5 is a compact Chinese text embedding model from BAAI, producing 512-dimensional sentence vectors optimized for Chinese semantic search and retrieval tasks. Part of the BGE series that also covers multilingual and English variants.

4,730,163 ↓ · 120 ♡

w2v-bert-2.0

Meta's wav2vec-BERT 2.0 is a self-supervised speech encoder that combines contrastive learning with masked language modeling objectives. It serves as the backbone for Seamless and other Meta speech recognition and translation systems.

3,911,445 ↓ · 221 ♡

all-MiniLM-L6-v2

ONNX-converted port of sentence-transformers/all-MiniLM-L6-v2, optimized for Transformers.js to run embedding inference directly in a browser or Node.js without a Python backend. Produces 384-dimensional sentence embeddings.

2,877,488 ↓ · 126 ♡

jina-embeddings-v3

Jina Embeddings v3 is a 570M-parameter text embedding model supporting 89 languages with a 8192-token context window. It uses LoRA adapters to switch between task-specific embedding modes (retrieval, similarity, classification) without separate models.

2,846,646 ↓ · 1,147 ♡

Qwen3-Embedding-8B

Qwen3-Embedding-8B is a large checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. The Apache 2.0 license keeps Qwen3-Embedding-8B unrestricted for commercial reuse. It is a fine-tune of qwen3-8b-base, inheriting that base model's general competence. Evaluate Qwen3-Embedding-8B on your own data before trusting it in production.

2,400,273 ↓ · 723 ♡

Qwen3-Embedding-4B

Qwen3-Embedding-4B is a mid-sized checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. It is a fine-tune of qwen3-4b-base, inheriting that base model's general competence. Weighing in near 4000M parameters, Qwen3-Embedding-4B trades some ceiling for cheaper, faster inference. Qwen3-Embedding-4B is community-maintained, so track upstream changes and pin a known-good revision.

2,380,001 ↓ · 291 ♡

granite-embedding-small-english-r2

granite-embedding-small-english-r2 is a sentence transformers-based open-weight model aimed at embedding and feature extraction. Permissive Apache 2.0 terms let granite-embedding-small-english-r2 go straight into commercial pipelines. Read granite-embedding-small-english-r2's card for hardware requirements and licensing fine print before deploying.

2,143,224 ↓ · 73 ♡

UAE-Large-V1

UAE-Large-V1 is a BERT encoder with English support. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

2,054,778 ↓ · 237 ♡

bge-reranker-large

bge-reranker-large is an openly licensed embedding and feature extraction model in the xlm roberta family. bge-reranker-large is MIT-licensed, clearing it for closed-source and paid products. Like most open checkpoints, bge-reranker-large rewards a quick in-domain eval before commitment.

1,977,008 ↓ · 465 ♡

bge-base-en-v1.5

bge-base-en-v1.5 is a bert-based open-weight model aimed at embedding and feature extraction. Permissive MIT terms let bge-base-en-v1.5 go straight into commercial pipelines. Check the bge-base-en-v1.5 model card for benchmarks and intended use before adopting it.

1,796,382 ↓ · 9 ♡

SapBERT-from-PubMedBERT-fulltext

SapBERT-from-PubMedBERT-fulltext is a BERT encoder with English support. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,709,727 ↓ · 71 ♡

multilingual-e5-small

Transformers.js-compatible ONNX conversion of multilingual-e5-small, enabling browser and Node.js inference of a 118M-parameter multilingual embedding model covering 100+ languages.

1,637,615 ↓ · 11 ♡

multilingual-e5-large-instruct

multilingual-e5-large-instruct is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. multilingual-e5-large-instruct is multilingual by design rather than English-only. The MIT license keeps multilingual-e5-large-instruct unrestricted for commercial reuse. Treat multilingual-e5-large-instruct's published metrics as a starting point and validate against your workload.

1,579,225 ↓ · 628 ♡

wavlm-large

As an open-weight model, wavlm-large focuses on embedding and feature extraction. Read wavlm-large's card for hardware requirements and licensing fine print before deploying.

1,461,141 ↓ · 110 ♡

bge-large-zh-v1.5

bge-large-zh-v1.5 targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let bge-large-zh-v1.5 go straight into commercial pipelines. bge-large-zh-v1.5 is community-maintained, so track upstream changes and pin a known-good revision.

1,422,029 ↓ · 637 ♡

bge-multilingual-gemma2

bge-multilingual-gemma2 is a Gemma encoder. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,403,940 ↓ · 202 ♡

conv-bert-base

conv-bert-base is a BERT encoder. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

1,264,309 ↓ · 10 ♡

1

1 is an open-weight embedding and feature extraction model in the llama family. Treat 1's published metrics as a starting point and validate against your workload.

1,248,725 ↓ · 1 ♡

bge-base-zh-v1.5

BGE-Base-ZH-v1.5 is BAAI's Chinese sentence embedding model in the BGE family, trained for Chinese semantic similarity and retrieval tasks. MIT-licensed and compatible with sentence-transformers and text-embeddings-inference. Optimized for Chinese-language RAG and search.

1,204,901 ↓ · 107 ♡

jina-embeddings-v2-small-en

jina-embeddings-v2-small-en targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let jina-embeddings-v2-small-en go straight into commercial pipelines. Evaluate jina-embeddings-v2-small-en on your own data before trusting it in production.

1,126,960 ↓ · 141 ♡

repeat

repeat is an open-weight embedding and feature extraction model in the llama family. Evaluate repeat on your own data before trusting it in production.

1,061,224 ↓ · 0 ♡

SFR-Embedding-2_R

SFR-Embedding-2_R is Salesforce's SFR-Embedding-2_R, a Mistral-7B-based text embedding model trained for retrieval-centric tasks on the MTEB benchmark suite. The '_R' suffix indicates retrieval optimization. It achieves strong performance on passage retrieval, semantic search, and reranking when used as a bi-encoder, with full 4096-token context support.

953,835 ↓ · 94 ♡

jina-embeddings-v5-text-nano

jina-embeddings-v5-text-nano is Jina AI's smallest text embedding model in the v5 family, built on EuroBERT-210m with multimodal capability for both text and image feature extraction. Despite the 'nano' designation, it supports multilingual inputs and is optimized for edge and latency-sensitive retrieval scenarios where model size matters more than peak accuracy.

867,063 ↓ · 82 ♡

canine-c

canine-c is Google's CANINE-C, a character-level pre-trained encoder that operates directly on Unicode codepoints without any tokenization step. Unlike wordpiece or BPE models, it accepts raw text character sequences, making it robust to spelling variation, morphological richness, and unseen vocabularies. It supports over 100 languages by design, with no language-specific tokenizer required.

834,650 ↓ · 35 ♡

bge-base-en

BGE-base-en is a 109M-parameter BERT-based English text embedding model from the Beijing Academy of Artificial Intelligence, designed for dense retrieval and semantic similarity tasks. It was evaluated on the MTEB benchmark and supports ONNX export alongside native PyTorch, making it suitable for inference-optimized deployments via Text Embeddings Inference. Two associated arXiv papers (2310.07554, 2309.07597) document training methodology and benchmark results.

802,360 ↓ · 61 ♡

wavlm-base-plus

wavlm-base-plus targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Treat wavlm-base-plus's published metrics as a starting point and validate against your workload.

775,286 ↓ · 40 ♡

mimi

As an open-weight model, mimi focuses on embedding and feature extraction. mimi is subject to CC BY 4.0 terms, so confirm licensing before commercial use. Read mimi's card for hardware requirements and licensing fine print before deploying.

731,108 ↓ · 310 ♡

indobert-base-p1

indobert-base-p1 is an openly licensed embedding and feature extraction model in the bert family. indobert-base-p1 is MIT-licensed, clearing it for closed-source and paid products. Like most open checkpoints, indobert-base-p1 rewards a quick in-domain eval before commitment.

654,862 ↓ · 50 ♡

llama-nemotron-embed-1b-v2

llama-nemotron-embed-1b-v2 is a Llama encoder with multilingual coverage. It produces token- and sequence-level vectors that capture syntactic and semantic information, serving as a base for transfer learning.

643,486 ↓ · 57 ♡

e5-base-sts-en-de

e5-base-sts-en-de is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. The MIT license keeps e5-base-sts-en-de unrestricted for commercial reuse. e5-base-sts-en-de is community-maintained, so track upstream changes and pin a known-good revision.

581,322 ↓ · 17 ♡

Qwen3-Embedding-4B-W4A16-G128

Qwen3-Embedding-4B-W4A16-G128 is a W4A16 group-wise quantized version of Alibaba's Qwen3-Embedding-4B, designed for dense text embedding and sentence similarity tasks. The W4A16 scheme quantizes weights to 4-bit while keeping activations at 16-bit, targeting efficient inference on accelerators. It integrates with both sentence-transformers and text-embeddings-inference.

562,203 ↓ · 5 ♡

specter2_base

specter2_base is an openly licensed embedding and feature extraction model in the bert family. specter2_base is Apache 2.0-licensed, clearing it for closed-source and paid products. Treat specter2_base's published metrics as a starting point and validate against your workload.

558,161 ↓ · 46 ♡

ru-en-RoSBERTa

RoSBERTa is a bilingual Russian-English sentence embedding model from ai-forever, built on RoBERTa with MTEB-style training for semantic similarity. It targets retrieval and semantic search use cases in Russian-language NLP pipelines. MIT-licensed and available with text-embeddings-inference compatibility.

511,825 ↓ · 82 ♡

lambda

A LLaMA-architecture model packaged by Unsloth for feature extraction, likely used internally as a base for fine-tuning experiments. The safetensors format and Unsloth branding suggest it serves as a reference checkpoint rather than a production embedding model.

511,231 ↓ · 0 ♡

vram-16

As a llama-based open-weight model, vram-16 focuses on embedding and feature extraction. Before relying on vram-16, reproduce its key numbers on representative inputs.

510,444 ↓ · 0 ♡

paraphrase-albert-small-v2

paraphrase-albert-small-v2 is an ALBERT-small-v2 model fine-tuned for paraphrase detection and sentence similarity, distributed by GPTCache as a lightweight semantic cache key encoder. It encodes queries into sentence embeddings for detecting semantically equivalent user inputs, enabling cache hits in LLM serving pipelines. At ALBERT-small scale it is significantly faster than BERT-base alternatives.

504,440 ↓ · 2 ♡

TinyBERT_L-4_H-312_v2

As a bert-based open-weight model, TinyBERT_L-4_H-312_v2 focuses on embedding and feature extraction. Read TinyBERT_L-4_H-312_v2's card for hardware requirements and licensing fine print before deploying.

503,958 ↓ · 1 ♡

bge-small-en

BGE-Small-EN is a 33M-parameter English text embedding model from BAAI, the smallest in the BGE (BAAI General Embedding) series. Despite its size it achieves competitive MTEB scores for retrieval tasks relative to larger BERT-based models. It is designed for high-throughput, memory-efficient embedding generation where larger models are too slow or expensive.

472,477 ↓ · 93 ♡

bart-base

bart-base is a bart-based open-weight model aimed at embedding and feature extraction. Permissive Apache 2.0 terms let bart-base go straight into commercial pipelines. Before relying on bart-base, reproduce its key numbers on representative inputs.

434,193 ↓ · 205 ♡

opensearch-neural-sparse-encoding-doc-v2-distill

opensearch-neural-sparse-encoding-doc-v2-distill is a sentence transformers-based open-weight model aimed at embedding and feature extraction. Permissive Apache 2.0 terms let opensearch-neural-sparse-encoding-doc-v2-distill go straight into commercial pipelines. Read opensearch-neural-sparse-encoding-doc-v2-distill's card for hardware requirements and licensing fine print before deploying.

432,402 ↓ · 19 ♡

other

other is a llama-based open-weight model aimed at embedding and feature extraction. Before relying on other, reproduce its key numbers on representative inputs.

427,698 ↓ · 0 ♡

splade-cocondenser-ensembledistil

splade-cocondenser-ensembledistil is an open-weight embedding and feature extraction model in the sentence transformers family. Distribution of splade-cocondenser-ensembledistil is under CC BY-NC-SA 4.0, which is worth reading before you ship. Evaluate splade-cocondenser-ensembledistil on your own data before trusting it in production.

425,599 ↓ · 62 ♡

MedCPT-Query-Encoder

MedCPT is NCBI's biomedical retrieval model trained on PubMed citation data using a contrastive learning objective. The query encoder maps clinical and biomedical questions into a shared embedding space with MedCPT's article encoder for dense biomedical literature retrieval.

396,620 ↓ · 62 ♡

SapBERT-from-PubMedBERT-fulltext-mean-token

SapBERT-from-PubMedBERT-fulltext-mean-token targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Treat SapBERT-from-PubMedBERT-fulltext-mean-token's published metrics as a starting point and validate against your workload.

391,878 ↓ · 2 ♡

e5-mistral-7b-instruct

E5-Mistral-7B-Instruct is an embedding model that leverages the full generative capacity of Mistral 7B by using decoder-only LLM representations for text embeddings. It uses instruction prompts at inference time to orient embeddings for retrieval, clustering, or classification tasks. At release it achieved state-of-the-art MTEB scores for dense retrieval, outperforming BERT-family embedding models by a significant margin on hard retrieval tasks.

388,569 ↓ · 566 ♡

clap-htsat-unfused

Built for embedding and feature extraction, clap-htsat-unfused is a model with publicly available weights. clap-htsat-unfused is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the clap-htsat-unfused model card for benchmarks and intended use before adopting it.

383,372 ↓ · 76 ♡

Qwen3-VL-Embedding-2B-AWQ-4bit

This is an AWQ 4-bit quantization of Qwen/Qwen3-VL-Embedding-2B, a 2-billion-parameter multimodal embedding model that produces joint image-text representations. Quantization reduces memory footprint substantially, enabling deployment on GPUs with limited VRAM. The base model supports English and Chinese and is described in arxiv:2601.04720.

382,011 ↓ · 2 ♡

jina-embeddings-v5-text-small

Built for embedding and feature extraction, jina-embeddings-v5-text-small is a sentence transformers-based model with publicly available weights. Distribution of jina-embeddings-v5-text-small is under CC BY-NC 4.0, which is worth reading before you ship. Training spans multiple languages, so jina-embeddings-v5-text-small covers cross-lingual embedding and feature extraction from one checkpoint. Check the jina-embeddings-v5-text-small model card for benchmarks and intended use before adopting it.

375,221 ↓ · 182 ♡

granite-embedding-311m-multilingual-r2

Built for embedding and feature extraction, granite-embedding-311m-multilingual-r2 is a sentence transformers-based model with publicly available weights. granite-embedding-311m-multilingual-r2 is Apache 2.0-licensed, clearing it for closed-source and paid products. Training spans multiple languages, so granite-embedding-311m-multilingual-r2 covers cross-lingual embedding and feature extraction from one checkpoint. Check the granite-embedding-311m-multilingual-r2 model card for benchmarks and intended use before adopting it.

366,282 ↓ · 106 ♡

OTel-Embedding-33M

A 33M-parameter text embedding model from farbodtavakkoli specialized for OpenTelemetry (OTel) log and trace data. Designed to embed observability signals (log lines, span names, error messages) for semantic search and anomaly clustering in monitoring pipelines.

364,194 ↓ · 0 ♡

codebert-base

codebert-base targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Like most open checkpoints, codebert-base rewards a quick in-domain eval before commitment.

363,256 ↓ · 288 ♡

rubert-base-cased

RuBERT-base-cased is DeepPavlov's BERT base model pre-trained on Russian text from Wikipedia and news corpora, with a case-sensitive vocabulary. It provides Russian-specific contextualized representations for downstream NLP tasks. PyTorch and JAX checkpoints are available.

361,482 ↓ · 131 ♡

MoLFormer-XL-both-10pct

MoLFormer-XL-both-10pct is IBM Research's MoLFormer-XL, a BERT-style molecular language model pre-trained on 1.1B SMILES strings from PubChem and ZINC. It produces molecular fingerprint-like embeddings from SMILES notation for property prediction tasks. The 'both-10pct' variant uses linear attention and rotary embeddings, trained on 10% of the full corpus mixture.

355,714 ↓ · 35 ♡

hubert-base-ls960

As a hubert-based open-weight model, hubert-base-ls960 focuses on embedding and feature extraction. The Apache 2.0 license keeps hubert-base-ls960 unrestricted for commercial reuse. hubert-base-ls960 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

350,788 ↓ · 74 ♡

harrier-oss-v1-0.6b

As a sentence transformers-based compact model, harrier-oss-v1-0.6b focuses on embedding and feature extraction. Training spans multiple languages, so harrier-oss-v1-0.6b covers cross-lingual embedding and feature extraction from one checkpoint. The MIT license keeps harrier-oss-v1-0.6b unrestricted for commercial reuse. Before relying on harrier-oss-v1-0.6b, reproduce its key numbers on representative inputs.

350,511 ↓ · 260 ♡

signal-jepa_without-chans

signal-jepa_without-chans is a self-supervised EEG foundation model from the braindecode project, using a joint-embedding predictive architecture (JEPA) trained on unlabeled EEG recordings. It generates channel-agnostic temporal representations suitable for downstream BCI or clinical EEG classification tasks. The 'without-chans' variant drops channel position encoding, making it compatible with variable electrode montages.

345,016 ↓ · 0 ♡

opensearch-neural-sparse-encoding-v2-distill

A distilled neural sparse encoding model from the OpenSearch project, designed for SPLADE-style learned sparse retrieval. It generates sparse token weight vectors from text, enabling neural relevance ranking within inverted index infrastructure without dense vector ANN.

342,318 ↓ · 10 ♡

jina-embeddings-v2-base-de

jina-embeddings-v2-base-de is an openly licensed embedding and feature extraction model in the sentence transformers family. jina-embeddings-v2-base-de is Apache 2.0-licensed, clearing it for closed-source and paid products. Evaluate jina-embeddings-v2-base-de on your own data before trusting it in production.

342,090 ↓ · 84 ♡

bart-large

BART-large is Meta's denoising autoencoder pretrained for sequence-to-sequence tasks, excelling at abstractive summarization, translation, and text generation. The large (400M) variant is the strongest in the original BART family before fine-tuning on downstream tasks.

339,176 ↓ · 201 ♡

jina-embeddings-v2-base-code

jina-embeddings-v2-base-code generates dense embeddings for mixed code-text inputs, supporting 8192-token context windows. It was trained to handle docstrings, function bodies, and natural language queries together, making it well-suited for semantic code search. The model ships ONNX and Transformers.js versions alongside the standard PyTorch weights.

338,862 ↓ · 139 ♡

deepset-mxbai-embed-de-large-v1

deepset-mxbai-embed-de-large-v1 targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let deepset-mxbai-embed-de-large-v1 go straight into commercial pipelines. Treat deepset-mxbai-embed-de-large-v1's published metrics as a starting point and validate against your workload.

336,805 ↓ · 60 ♡

OTel-Embedding-34M

A 34M-parameter OTel-domain text embedding model from farbodtavakkoli, nearly identical in scope to the 33M variant but potentially a slightly different architecture or training iteration. Designed for embedding OpenTelemetry observability signals.

333,330 ↓ · 0 ♡

Solon-embeddings-large-0.1

Built for embedding and feature extraction, Solon-embeddings-large-0.1 is a xlm roberta-based model with publicly available weights. Solon-embeddings-large-0.1 is MIT-licensed, clearing it for closed-source and paid products. Before relying on Solon-embeddings-large-0.1, reproduce its key numbers on representative inputs.

333,035 ↓ · 53 ♡

OTel-Embedding-109M

The largest of farbodtavakkoli's OTel embedding series at 109M parameters, offering the best embedding quality among the OTel-Embedding models for OpenTelemetry log, span, and metric text.

330,494 ↓ · 1 ♡

biobert-v1.1

biobert-v1.1 is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. Evaluate biobert-v1.1 on your own data before trusting it in production.

319,163 ↓ · 112 ♡

sentence-bert-base-ja-mean-tokens

sentence-bert-base-ja-mean-tokens is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. sentence-bert-base-ja-mean-tokens is subject to CC BY-SA 4.0 terms, so confirm licensing before commercial use. Treat sentence-bert-base-ja-mean-tokens's published metrics as a starting point and validate against your workload.

317,956 ↓ · 11 ♡

OTel-Embedding-300M

Built for embedding and feature extraction, OTel-Embedding-300M is a model with publicly available weights. OTel-Embedding-300M is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 300M parameters, OTel-Embedding-300M sits in the compact tier, which sets its memory and latency budget. Before relying on OTel-Embedding-300M, reproduce its key numbers on representative inputs.

317,491 ↓ · 0 ♡

distilhubert

As a hubert-based open-weight model, distilhubert focuses on embedding and feature extraction. The Apache 2.0 license keeps distilhubert unrestricted for commercial reuse. Before relying on distilhubert, reproduce its key numbers on representative inputs.

316,274 ↓ · 38 ♡

distilbert-base-nli-mean-tokens

Built for embedding and feature extraction, distilbert-base-nli-mean-tokens is a sentence transformers-based model with publicly available weights. distilbert-base-nli-mean-tokens is Apache 2.0-licensed, clearing it for closed-source and paid products. distilbert-base-nli-mean-tokens ships without a hosted SLA, so budget for self-managed deployment and monitoring.

313,194 ↓ · 13 ♡

bge-small-en-v1.5

Xenova's transformers.js ONNX conversion of BGE-Small-EN-v1.5 for browser and Node.js inference. BGE-Small-EN-v1.5 is BAAI's small English embedding model; this version targets client-side semantic search without server infrastructure. The ONNX format enables cross-platform deployment.

312,243 ↓ · 16 ♡

FRIDA

FRIDA is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. It is a fine-tune of fred-t5-1.7b, inheriting that base model's general competence. The MIT license keeps FRIDA unrestricted for commercial reuse. FRIDA is community-maintained, so track upstream changes and pin a known-good revision.

309,397 ↓ · 137 ♡

zembed-1-embedding

zembed-1-embedding is an open-weight checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. zembed-1-embedding is multilingual by design rather than English-only. zembed-1-embedding is subject to CC BY-NC 4.0 terms, so confirm licensing before commercial use. Treat zembed-1-embedding's published metrics as a starting point and validate against your workload.

309,279 ↓ · 110 ♡

bge-base-zh

bge-base-zh is an openly licensed embedding and feature extraction model in the bert family. bge-base-zh is MIT-licensed, clearing it for closed-source and paid products. Like most open checkpoints, bge-base-zh rewards a quick in-domain eval before commitment.

299,140 ↓ · 58 ♡

OTel-Embedding-22M

OTel-Embedding-22M is a compact checkpoint for embedding and feature extraction, distributed on the HuggingFace Hub. Weighing in near 22M parameters, OTel-Embedding-22M trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps OTel-Embedding-22M unrestricted for commercial reuse. Like most open checkpoints, OTel-Embedding-22M rewards a quick in-domain eval before commitment.

299,119 ↓ · 0 ♡

dac_44khz

dac_44khz targets embedding and feature extraction and is shipped as an open-weight, self-hostable checkpoint. Treat dac_44khz's published metrics as a starting point and validate against your workload.

292,706 ↓ · 11 ♡

llama2-embedding-1b-8k

llama2-embedding-1b-8k is a llama-based open-weight model aimed at embedding and feature extraction. llama2-embedding-1b-8k's 1000M-parameter size keeps hosting requirements modest relative to frontier models. Before relying on llama2-embedding-1b-8k, reproduce its key numbers on representative inputs.

291,822 ↓ · 2 ♡