AI Tools.

Search

image text to text models

60 models · ranked by HuggingFace downloads

Qwen3-VL-2B-Instruct

Qwen3-VL-2B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

185,757,734 ↓ · 382 ♡

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

9,001,887 ↓ · 1,513 ♡

Qwen3.5-9B

Qwen3.5-9B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

7,358,849 ↓ · 1,369 ♡

gemma-4-31B-it

gemma-4-31B-it is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

7,111,084 ↓ · 2,454 ♡

gemma-4-26B-A4B-it

gemma-4-26B-A4B-it is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

5,629,669 ↓ · 854 ♡

Qwen3.5-4B

Qwen3.5-4B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

4,410,470 ↓ · 507 ♡

Kimi-K2.5

Kimi-K2.5 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

4,390,611 ↓ · 2,773 ♡

Qwen3-VL-8B-Instruct

Qwen3-VL-8B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

4,199,512 ↓ · 886 ♡

Qwen2-VL-2B-Instruct

Qwen2-VL-2B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,987,867 ↓ · 500 ♡

Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,742,960 ↓ · 1,411 ♡

Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,511,989 ↓ · 641 ♡

gemma-4-26B-A4B-it-GGUF

gemma-4-26B-A4B-it-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,440,778 ↓ · 640 ♡

Qwen3.5-27B

Qwen3.5-27B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,414,010 ↓ · 964 ♡

Qwen3.5-0.8B

Qwen3.5-0.8B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

3,056,675 ↓ · 512 ♡

llava-1.5-7b-hf

llava-1.5-7b-hf is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,815,186 ↓ · 358 ♡

moondream2

moondream2 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,741,713 ↓ · 1,408 ♡

gemma-3-12b-it

gemma-3-12b-it is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,706,983 ↓ · 712 ♡

Qwen3-VL-4B-Instruct

Qwen3-VL-4B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,444,944 ↓ · 378 ♡

DeepSeek-OCR

DeepSeek-OCR is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,386,954 ↓ · 3,221 ♡

gemma-3-4b-it

gemma-3-4b-it is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,264,642 ↓ · 1,316 ♡

Qwen2-VL-7B-Instruct

Qwen2-VL-7B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,238,989 ↓ · 1,274 ♡

gemma-4-31B-it-GGUF

gemma-4-31B-it-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

2,062,774 ↓ · 380 ♡

Qwen3.6-35B-A3B

Qwen3.6-35B-A3B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,977,187 ↓ · 1,539 ♡

gemma-4-E4B-it-GGUF

gemma-4-E4B-it-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,875,587 ↓ · 352 ♡

Qwen3.6-35B-A3B-GGUF

Qwen3.6-35B-A3B-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,850,307 ↓ · 874 ♡

Qwen3.6-35B-A3B-FP8

Qwen3.6-35B-A3B-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,849,134 ↓ · 196 ♡

Qwen3.5-2B

Qwen3.5-2B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,745,269 ↓ · 266 ♡

Qwen2-VL-7B-Instruct-AWQ

Qwen2-VL-7B-Instruct-AWQ is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,720,055 ↓ · 49 ♡

Phi-3.5-vision-instruct

Phi-3.5-vision-instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,649,669 ↓ · 733 ♡

MinerU2.5-2509-1.2B

MinerU2.5-2509-1.2B is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,566,196 ↓ · 356 ♡

Qwen3.5-35B-A3B-FP8

Qwen3.5-35B-A3B-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,535,973 ↓ · 147 ♡

DeepSeek-OCR-2

DeepSeek-OCR-2 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,495,688 ↓ · 934 ♡

Qwen3.5-397B-A17B-FP8

Qwen3.5-397B-A17B-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,419,110 ↓ · 165 ♡

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,391,154 ↓ · 1,369 ♡

Qwen3.5-27B-FP8

Qwen3.5-27B-FP8 is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,380,486 ↓ · 132 ♡

Qwen3-VL-235B-A22B-Instruct

Qwen3-VL-235B-A22B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,352,076 ↓ · 383 ♡

Qwen3-VL-32B-Instruct

Qwen3-VL-32B-Instruct is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,327,760 ↓ · 198 ♡

InternVL2-2B

InternVL2-2B is a compact vision-language model combining a 300M parameter vision encoder (InternViT) with an 1.8B parameter language model (InternLM2), enabling multimodal understanding at 2B total parameters. Designed for efficient deployment while maintaining strong performance on vision-language tasks across multiple languages.

1,197,090 ↓ · 80 ♡

Qwen3.5-35B-A3B-GGUF

Qwen3.5-35B-A3B-GGUF is an open-source image-text-to-text model available on HuggingFace. Details are sourced from the public model registry.

1,061,848 ↓ · 833 ♡