any to any models

24 models · ranked by HuggingFace downloads

gemma-4-E4B-it

Gemma 4-E4B-IT is Google DeepMind's edge-optimized 4-billion-parameter any-to-any multimodal model from the Gemma 4 family, designed for deployment on mobile and edge devices rather than servers. The 'any-to-any' pipeline_tag indicates multimodal input and output capability beyond standard image-text-to-text. Apache 2.0 licensed.

5,793,827 ↓ · 1,298 ♡

gemma-4-12B-it

gemma-4-12B-it is Google's Gemma 4 multimodal (text + image) instruction-tuned model. It accepts both text and image inputs and produces text, making it suitable for document analysis, visual Q&A, and structured data extraction. Released under Apache-2.0, it targets users who need a capable VLM without access restrictions.

2,420,240 ↓ · 1,202 ♡

gemma-4-E2B-it

Gemma 4 E2B is Google's efficient 2B-parameter multimodal model, instruction-tuned for both image-text and text-only prompts. It targets edge and on-device deployment where a sub-3B footprint is necessary.

2,289,861 ↓ · 781 ♡

Qwen3-Omni-30B-A3B-Instruct

Qwen3-Omni-30B-A3B-Instruct handles multiple input and output modalities including text, images, and audio within a single unified architecture.

1,982,858 ↓ · 948 ♡

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 processes and generates across multiple modalities, enabling cross-modal reasoning in a single model call.

1,865,704 ↓ · 145 ♡

gemma-4-12B-it-qat-w4a16-ct

gemma-4-12B-it-qat-w4a16-ct is a quantization-aware trained (QAT) weights for W4A16 deployment version of Google's Gemma 4 multimodal (text + image) instruction-tuned model. 12B parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,832,968 ↓ · 33 ♡

Qwen2.5-Omni-3B

Qwen2.5-Omni-3B handles multiple input and output modalities including text, images, and audio within a single unified architecture.

1,686,854 ↓ · 336 ♡

gemma-4-E4B-it-MLX-4bit

A 4-bit MLX quantization of Google's Gemma 4 E4B instruct model (an efficient 4B-equivalent MoE variant) for Apple Silicon. Targets developers who want Gemma 4 running locally on MacBook-class hardware.

1,489,758 ↓ · 12 ♡

gemma-4-E4B-it-MLX-8bit

An 8-bit MLX quantization of Google's Gemma 4 E4B instruct model for Apple Silicon. Higher quality than the 4-bit variant at the cost of roughly double the memory, targeting M2/M3 Pro or Max class machines.

1,446,141 ↓ · 7 ♡

gemma-4-E4B-it-MLX-5bit

gemma-4-E4B-it-MLX-5bit is a MLX 5-bit quantized weights optimized for Apple Silicon inference version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,435,569 ↓ · 0 ♡

gemma-4-E4B-it-MLX-6bit

gemma-4-E4B-it-MLX-6bit is a MLX 6-bit quantized weights optimized for Apple Silicon inference version of Google's Gemma 4 MoE-based multimodal (text + image) instruction-tuned model. parameters are reduced to lower-precision weights for deployment on memory-constrained hardware or Apple Silicon, with quality degradation typically small for general chat tasks. The base model is Apache-2.0 licensed.

1,435,125 ↓ · 3 ♡

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Nemotron-3 Nano Omni is NVIDIA's multimodal reasoning model — 30B total parameters with 3B active per token — that extends the Nemotron-H architecture to support any-to-any input and output modalities including audio, image, and text. The Reasoning variant includes a thinking mode for extended chain-of-thought. It runs in BF16 full precision, targeting multi-GPU H100/H200 deployments.

752,479 ↓ · 362 ♡

Qwen2.5-Omni-7B

Qwen2.5-Omni-7B is a multimodal model accepting diverse input types and producing outputs across text, vision, and audio modalities.

682,523 ↓ · 1,913 ♡

gemma-4-31B-it-assistant

gemma-4-31B-it-assistant is Google's 31-billion-parameter instruction-tuned Gemma 4 model configured for assistant-style interactions. Listed under the any-to-any pipeline tag, it is designed to handle flexible input-output modality combinations within the Transformers ecosystem.

548,069 ↓ · 308 ♡

gemma-4-12B-it-qat-q4_0-gguf

This is the official Google release of Gemma 4 12B instruction-tuned in GGUF format, quantized to q4_0 using Quantization-Aware Training. Unlike community repacks, this comes directly from Google, providing clearer provenance for production pipelines that require verified model sources.

534,189 ↓ · 189 ♡

gemma-4-E4B

gemma-4-E4B is a multimodal model accepting diverse input types and producing outputs across text, vision, and audio modalities.

524,961 ↓ · 333 ♡

gemma-4-12B-it-qat-GGUF

gemma-4-12B-it-qat-GGUF is Unsloth's GGUF repack of Google's Gemma 4 12B instruction-tuned model, which was originally quantization-aware trained (QAT) at q4_0. GGUF packaging enables CPU and hybrid CPU/GPU inference via llama.cpp-compatible runtimes without requiring a full PyTorch stack. The Apache 2.0 license and Unsloth's optimization focus make this a practical option for local inference on consumer hardware.

463,428 ↓ · 289 ♡

MiniCPM-o-2_6

MiniCPM-o 2.6 is an omnimodal 8B model from OpenBMB supporting speech, image, and text inputs with real-time audio output. It targets on-device multimodal scenarios, particularly mobile and edge deployments, with end-to-end speech conversation capability.

439,740 ↓ · 1,292 ♡

OneThinker-SFT-Qwen3-8B

OneThinker-SFT is a Qwen3-8B model fine-tuned by OneThink with supervised fine-tuning (SFT) on a vision-language task mixture, using the Qwen3-VL architecture for any-to-any multimodal output. Apache-2.0 licensed.

431,837 ↓ · 4 ♡

MiniCPM-o-4_5

As a minicpm-based open-weight model, MiniCPM-o-4_5 focuses on multimodal any-to-any generation. The Apache 2.0 license keeps MiniCPM-o-4_5 unrestricted for commercial reuse. Check the MiniCPM-o-4_5 model card for benchmarks and intended use before adopting it.

359,690 ↓ · 1,404 ♡

gemma-4-E4B-it-assistant

gemma-4-E4B-it-assistant is an openly licensed multimodal any-to-any generation model in the gemma family. At about 4000M parameters, gemma-4-E4B-it-assistant sits in the mid-sized tier, which sets its memory and latency budget. gemma-4-E4B-it-assistant is Apache 2.0-licensed, clearing it for closed-source and paid products. Treat gemma-4-E4B-it-assistant's published metrics as a starting point and validate against your workload.

351,896 ↓ · 113 ♡

gemma-4-12B

Built for multimodal any-to-any generation, gemma-4-12B is a gemma-based model with publicly available weights. At about 12000M parameters, gemma-4-12B sits in the large tier, which sets its memory and latency budget. gemma-4-12B is Apache 2.0-licensed, clearing it for closed-source and paid products. gemma-4-12B ships without a hosted SLA, so budget for self-managed deployment and monitoring.

351,287 ↓ · 617 ♡

Qwen3-Omni-30B-A3B-Thinking

As a qwen3-based large model, Qwen3-Omni-30B-A3B-Thinking focuses on multimodal any-to-any generation. Weighing in near 30000M parameters, Qwen3-Omni-30B-A3B-Thinking trades some ceiling for cheaper, faster inference. Qwen3-Omni-30B-A3B-Thinking lists a non-standard license, so confirm permissions before deployment. Read Qwen3-Omni-30B-A3B-Thinking's card for hardware requirements and licensing fine print before deploying.

341,994 ↓ · 308 ♡

gemma-4-E2B

Gemma-4-E2B is Google's 2B edge model from the Gemma-4 family, designed for on-device deployment with multimodal any-to-any capability. The 'E' prefix indicates edge-optimized — smaller memory footprint and lower latency are prioritized over raw capability. Supports image and text input/output in a single model.

336,368 ↓ · 362 ♡