image feature extraction models

20 models · ranked by HuggingFace downloads

dinov2-small

DINOv2 ViT-S is the smallest variant in Meta's DINOv2 series, offering a 21M-parameter self-supervised vision transformer suitable for resource-constrained feature extraction applications.

2,907,120 ↓ · 67 ♡

ViT-B/16 pretrained on ImageNet-21K with 21,000 classes using supervised training. A standard vision transformer backbone widely used as a starting point for fine-tuning on downstream vision classification and feature extraction tasks.

1,726,565 ↓ · 411 ♡

vit_small_patch14_reg4_dinov2.lvd142m

vit_small_patch14_reg4_dinov2.lvd142m encodes images into fixed-dimension feature vectors for downstream visual similarity and classification tasks.

1,364,449 ↓ · 7 ♡

nomic-embed-vision-v1.5

nomic-embed-vision-v1.5 is a vision encoder from Nomic AI that produces embeddings aligned with their nomic-embed-text embedding space. Images and text can be projected into the same vector space, enabling cross-modal retrieval without separate encoders. The model is based on a modified BERT-style backbone rather than a typical CLIP ViT.

1,294,387 ↓ · 220 ♡

dinov2-base

DINOv2 ViT-B is Meta's self-supervised vision transformer trained on 142M curated images using a combination of DINO and iBOT objectives. It produces strong visual features for dense prediction tasks without any labels during pretraining.

1,274,651 ↓ · 181 ♡

dinov2-large

DINOv2 ViT-L is Meta's large-scale self-supervised vision transformer, offering significantly better visual representations than the base variant at 4x the parameter count. It achieves near-supervised performance on linear probing for ImageNet.

957,180 ↓ · 113 ♡

vit_small_patch14_dinov2.lvd142m

A ViT-Small backbone pre-trained with DINOv2 self-supervised learning on the curated LVD-142M dataset. DINOv2 models learn dense visual features without labels, producing representations that transfer well to segmentation, depth estimation, and retrieval tasks. The small patch14 variant offers a balance between spatial resolution and inference speed.

850,258 ↓ · 6 ♡

convnext_base.clip_laion2b

convnext_base.clip_laion2b is a safetensors distribution of the base model, packaged for local or server inference. The exact pipeline type is not specified in the model card metadata, but the model targets text or multimodal generation tasks based on its architecture tags. Check the source model card for specific capability and benchmark details.

647,396 ↓ · 0 ♡

vit_base_patch14_dinov2.lvd142m

vit_base_patch14_dinov2.lvd142m extracts compact visual representations from images, enabling content-based search and fine-tuning on top of frozen features.

563,495 ↓ · 10 ♡

dinov3-vitl16-pretrain-lvd1689m

dinov3-vitl16-pretrain-lvd1689m produces dense visual embeddings from image inputs without a task-specific head. Used for image retrieval, clustering, and transfer learning.

529,195 ↓ · 364 ♡

dinov3-vitb16-pretrain-lvd1689m

dinov3-vitb16-pretrain-lvd1689m encodes images into fixed-dimension feature vectors for downstream visual similarity and classification tasks.

516,810 ↓ · 169 ♡

vit_small_patch16_dinov3.lvd1689m

vit_small_patch16_dinov3.lvd1689m is a Vision Transformer small model trained with the DINOv3 self-supervised learning method on the LVD-1689M large-scale image dataset. It is distributed through the timm library and targets dense image feature extraction without task-specific fine-tuning. The arxiv:2508.10104 reference points to the DINOv3 methodology paper.

474,713 ↓ · 6 ♡

dino-vitb16

dino-vitb16 produces dense visual embeddings from image inputs without a task-specific head. Used for image retrieval, clustering, and transfer learning.

464,134 ↓ · 112 ♡

dinov3-vits16-pretrain-lvd1689m

DINOv3-ViTS16 is the small ViT variant from Facebook's third-generation DINO self-supervised visual pre-training series, trained on the LVD-1689M dataset of 1.689 billion curated images. It is a distilled version derived from the 7B-parameter DINOv3-ViT7B16 teacher model. The model produces general-purpose image features without relying on labeled data during pre-training.

401,908 ↓ · 119 ♡

ultraVAD

Built for image embedding, ultraVAD is a model with publicly available weights. Check the ultraVAD model card for benchmarks and intended use before adopting it.

389,233 ↓ · 38 ♡

rad-dino

RAD-DINO is Microsoft's radiology-focused DINOv2 model trained on chest X-ray images to produce self-supervised visual features suited for medical imaging tasks. It enables zero-shot and few-shot learning on radiological data without labelled fine-tuning datasets. Microsoft published this model alongside a paper demonstrating its utility for X-ray report generation and pathology classification.

382,232 ↓ · 76 ♡

dinov3-vitl16-pretrain-lvd1689m

dinov3-vitl16-pretrain-lvd1689m is an open-weight image embedding model in the vit family. At about 1689M parameters, dinov3-vitl16-pretrain-lvd1689m sits in the mid-sized tier, which sets its memory and latency budget. Licensing for dinov3-vitl16-pretrain-lvd1689m is unspecified or custom — clear it before commercial use. Evaluate dinov3-vitl16-pretrain-lvd1689m on your own data before trusting it in production.

375,537 ↓ · 11 ♡

EAT-base_epoch30_finetune_AS2M

Built for image embedding, EAT-base_epoch30_finetune_AS2M is a model with publicly available weights. The weights start from eat-base_epoch30_pretrain and specialize it for the target task. At about 2M parameters, EAT-base_epoch30_finetune_AS2M sits in the compact tier, which sets its memory and latency budget. Read EAT-base_epoch30_finetune_AS2M's card for hardware requirements and licensing fine print before deploying.

345,829 ↓ · 3 ♡

dinov2-with-registers-base

dinov2-with-registers-base is an openly licensed image embedding model. dinov2-with-registers-base is Apache 2.0-licensed, clearing it for closed-source and paid products. Like most open checkpoints, dinov2-with-registers-base rewards a quick in-domain eval before commitment.

309,213 ↓ · 10 ♡

vit_base_patch16_clip_224.openai

vit_base_patch16_clip_224.openai is a clip-based open-weight model aimed at image embedding. Permissive Apache 2.0 terms let vit_base_patch16_clip_224.openai go straight into commercial pipelines. vit_base_patch16_clip_224.openai ships without a hosted SLA, so budget for self-managed deployment and monitoring.

297,382 ↓ · 11 ♡

Search

image feature extraction models

dinov2-small

vit-base-patch16-224-in21k

vit_small_patch14_reg4_dinov2.lvd142m

nomic-embed-vision-v1.5

dinov2-base

dinov2-large

vit_small_patch14_dinov2.lvd142m

convnext_base.clip_laion2b

vit_base_patch14_dinov2.lvd142m

dinov3-vitl16-pretrain-lvd1689m

dinov3-vitb16-pretrain-lvd1689m

vit_small_patch16_dinov3.lvd1689m

dino-vitb16

dinov3-vits16-pretrain-lvd1689m

ultraVAD

rad-dino

dinov3-vitl16-pretrain-lvd1689m

EAT-base_epoch30_finetune_AS2M

dinov2-with-registers-base

vit_base_patch16_clip_224.openai