zero shot image classification models

27 models · ranked by HuggingFace downloads

clip-vit-base-patch32

OpenAI's CLIP model using a ViT-B/32 image encoder, the smaller of the two widely deployed CLIP variants. Trained contrastively on 400 million image-text pairs, it aligns image and text representations in a shared embedding space for zero-shot classification and retrieval. The B/32 variant sacrifices accuracy versus ViT-L/14 for faster inference.

23,159,737 ↓ · 964 ♡

clip-vit-large-patch14

OpenAI's CLIP model using a ViT-L/14 image encoder, trained contrastively on 400 million image-text pairs from the internet. It aligns image and text in a shared embedding space, enabling zero-shot image classification by comparing image embeddings against text label embeddings. The ViT-L/14 variant offers higher accuracy than the smaller ViT-B/32 at greater compute cost.

11,807,851 ↓ · 2,040 ♡

CLIP-ViT-B-32-laion2B-s34B-b79K

OpenCLIP ViT-B/32 trained by LAION on 2 billion image-text pairs from the LAION-2B dataset. It provides open-source CLIP features comparable to OpenAI's original ViT-B/32 while being trained on a fully public dataset.

3,724,719 ↓ · 141 ♡

PickScore_v1

PickScore_v1 is a CLIP-based human preference scorer trained on the Pick-a-Pic dataset of text-image pairs with human preference labels. Given a text prompt and a set of generated images, it predicts which image humans would prefer. It is typically used as a reward model in reinforcement-learning-from-human-feedback (RLHF) pipelines for image generation, not as a standalone image generator.

3,208,423 ↓ · 52 ♡

fashion-clip

CLIP fine-tuned on a large fashion product dataset to improve image-text alignment for apparel, accessories, and retail imagery. Standard CLIP models underperform on fashion-specific queries due to distribution shift from generic web data.

2,941,942 ↓ · 284 ♡

clip-vit-large-patch14-336

OpenAI CLIP ViT-L/14 at 336×336px input resolution, a higher-resolution variant of the standard ViT-L/14 CLIP model. The larger input patch size reduces information loss during tokenization, improving performance on classification tasks requiring fine-grained visual detail. Otherwise shares the same contrastive training on 400M image-text pairs as the base ViT-L/14.

2,548,335 ↓ · 307 ♡

siglip-so400m-patch14-384

SigLIP (Sigmoid Loss for Language-Image Pre-training) SO/400M at 384px resolution is Google's vision-language model using a sigmoid binary cross-entropy loss instead of CLIP's softmax contrastive loss. It achieves stronger zero-shot classification than CLIP ViT-L at comparable scale.

1,605,718 ↓ · 678 ♡

clip-vit-base-patch16

clip-vit-base-patch16 uses a joint image-text embedding space to score unseen label categories against input images.

1,529,849 ↓ · 163 ♡

siglip-base-patch16-224

SigLIP base/patch16 at 224px resolution is the lightweight tier of Google's sigmoid-loss vision-language pretraining model. It serves as a vision encoder for multimodal pipelines and as a standalone zero-shot classifier.

1,433,507 ↓ · 84 ♡

siglip2-giant-opt-patch16-384

siglip2-giant-opt-patch16-384 is Google's SigLIP 2 giant variant, a contrastively trained vision-language encoder with 384px patch-16 resolution. SigLIP 2 introduces sigmoid loss instead of softmax for cross-modal alignment, improving per-example calibration and zero-shot classification accuracy over the original SigLIP. The 'opt' variant uses optimized training recipes and targets state-of-the-art zero-shot classification quality.

885,045 ↓ · 43 ♡

siglip2-so400m-patch16-naflex

SigLIP2 SO400M with NaFlex (Native Resolution Flexible) encoding — the larger 400M variant of siglip2-base-patch16-naflex. NaFlex processes images at native resolution without forced resizing, preserving spatial detail. This is the strongest SigLIP2 variant for both CLIP-style tasks and as a vision encoder in multimodal LLMs.

857,999 ↓ · 74 ♡

marqo-fashionSigLIP

marqo-fashionSigLIP classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

837,897 ↓ · 83 ♡

siglip2-base-patch16-naflex

SigLIP2-Base with NaFlex (Native Resolution Flexible) encoding, which processes images at their native resolution by dynamically adjusting patch sequences rather than resizing to a fixed size. This improves accuracy on images where spatial details matter. The base variant offers a smaller memory footprint than the 400M so400m variant.

791,616 ↓ · 36 ♡

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 uses a joint image-text embedding space to score unseen label categories against input images.

724,616 ↓ · 411 ♡

siglip2-so400m-patch16-256

SigLIP2 is Google's second-generation sigmoid loss vision-language contrastive model at 400M parameters, using a 16px patch size and 256px input resolution. The sigmoid loss formulation (vs softmax in CLIP) enables independent positive/negative scoring without requiring full batch negatives. Often used as the vision encoder in multimodal LLMs.

707,908 ↓ · 5 ♡

siglip2-so400m-patch14-384

siglip2-so400m-patch14-384 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.

655,683 ↓ · 91 ♡

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg

CLIP-convnext_base_w-laion2B-s13B-b82K-augreg classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

550,820 ↓ · 8 ♡

PE-Core-S16-384

PE-Core-S16-384 is Meta's Perception Encoder model at the Small/16-patch/384px configuration, designed for zero-shot image classification and visual representation learning. It is described in arxiv:2504.13181 as a general-purpose vision encoder trained for broad perceptual tasks.

549,991 ↓ · 0 ♡

CLIP-ViT-H-14-laion2B-s32B-b79K

CLIP-ViT-H-14-laion2B-s32B-b79K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

394,450 ↓ · 462 ♡

CLIP-ViT-B-16-laion2B-s34B-b88K

OpenCLIP ViT-B/16 trained on LAION-2B with 34B samples seen during training. The ViT-B/16 architecture processes 16x16 patches at 224px resolution, offering better feature quality than ViT-B/32 at moderate additional cost.

391,863 ↓ · 39 ♡

CLIP-ViT-L-14-laion2B-s32B-b82K

CLIP-ViT-L-14-laion2B-s32B-b82K classifies images into arbitrary label sets without task-specific fine-tuning. It compares image embeddings to text descriptions of candidate categories.

380,383 ↓ · 64 ♡

siglip2-base-patch16-224

siglip2-base-patch16-224 performs zero-shot classification by measuring similarity between the image representation and natural-language class descriptions.

362,327 ↓ · 109 ♡

PE-Core-L14-336

PE-Core-L14-336 is an open-weight checkpoint for zero-shot image classification, distributed on the HuggingFace Hub. The Apache 2.0 license keeps PE-Core-L14-336 unrestricted for commercial reuse. PE-Core-L14-336 is community-maintained, so track upstream changes and pin a known-good revision.

316,732 ↓ · 52 ♡

vit_base_patch16_plus_clip_240.laion400m_e31

vit_base_patch16_plus_clip_240.laion400m_e31 is an openly licensed zero-shot image classification model in the clip family. vit_base_patch16_plus_clip_240.laion400m_e31 is MIT-licensed, clearing it for closed-source and paid products. Evaluate vit_base_patch16_plus_clip_240.laion400m_e31 on your own data before trusting it in production.

314,216 ↓ · 1 ♡

siglip2-base-patch16-512

siglip2-base-patch16-512 is an open-weight checkpoint for zero-shot image classification, distributed on the HuggingFace Hub. The Apache 2.0 license keeps siglip2-base-patch16-512 unrestricted for commercial reuse. Like most open checkpoints, siglip2-base-patch16-512 rewards a quick in-domain eval before commitment.

294,208 ↓ · 42 ♡

one-align

One-Align is a unified image and video quality assessment model from the Q-Future group, trained to score perceptual quality and alignment with human aesthetic preferences. It unifies image quality assessment (IQA) and video quality assessment (VQA) into a single model.

267,437 ↓ · 43 ♡

TinyCLIP-ViT-8M-16-Text-3M-YFCC15M

As a clip-based compact model, TinyCLIP-ViT-8M-16-Text-3M-YFCC15M focuses on zero-shot image classification. Weighing in near 8M parameters, TinyCLIP-ViT-8M-16-Text-3M-YFCC15M trades some ceiling for cheaper, faster inference. The MIT license keeps TinyCLIP-ViT-8M-16-Text-3M-YFCC15M unrestricted for commercial reuse. Before relying on TinyCLIP-ViT-8M-16-Text-3M-YFCC15M, reproduce its key numbers on representative inputs.

232,353 ↓ · 12 ♡