text generation models

306 models · ranked by HuggingFace downloads

Qwen3-0.6B

Qwen3-0.6B is the 0.6-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen3 series, fine-tuned from the Qwen3-0.6B-Base for conversational and task-following use. It targets deployment in environments where even a 1B model is too large — edge hardware, mobile devices, or ultra-low-latency services. Apache 2.0 licensed.

27,739,500 ↓ · 1,362 ♡

Qwen3-4B

Qwen3-4B is Alibaba's 4B parameter model from the Qwen3 series, which introduced a hybrid thinking mode allowing the model to switch between fast direct answering and extended chain-of-thought reasoning. It is a compact model capable of running on consumer hardware while outperforming many 7B predecessors on reasoning benchmarks. Apache 2.0 licensed.

15,932,949 ↓ · 641 ♡

Qwen3-8B

Qwen3-8B is the 8-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen3 family, positioned at the competitive midpoint between 4B and 14B+ tiers. It targets deployment on single consumer or workstation GPUs while providing strong reasoning and multilingual capabilities. Apache 2.0 licensed with text-generation-inference compatibility.

13,501,708 ↓ · 1,166 ♡

gpt2

OpenAI's original GPT-2 at 124M parameters, an autoregressive language model trained on WebText (over 8 million web documents filtered from Reddit outlinks). It generates English text continuation given a prompt using next-token prediction, trained without any instruction tuning or RLHF. MIT licensed and runnable on commodity CPU hardware.

12,980,059 ↓ · 3,316 ♡

opt-125m

OPT-125M is the smallest model in Meta's Open Pretrained Transformer series, a 125-million-parameter decoder-only LLM trained on a dataset comparable to GPT-3's training mix. Released as part of Meta's effort to make large language model weights accessible for research. At 125M parameters it is primarily used for prototyping, educational purposes, and compute-constrained environments.

12,733,324 ↓ · 267 ♡

Qwen2.5-7B-Instruct

Qwen2.5-7B-Instruct is Alibaba Cloud's 7-billion-parameter instruction-tuned language model from the Qwen2.5 series, supporting English and a range of other languages. It targets applications requiring more reasoning and knowledge than sub-3B models, while remaining deployable on a single consumer GPU. Apache 2.0 licensed with text-generation-inference compatibility.

12,715,875 ↓ · 1,387 ♡

Qwen2.5-1.5B-Instruct

Qwen2.5-1.5B-Instruct is a 1.5-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen2.5 series, targeting edge and embedded deployment scenarios where even a 3B model is too large. Apache 2.0 licensed, it focuses on basic instruction following and short-context tasks at minimal compute cost.

11,692,294 ↓ · 749 ♡

Llama-3.1-8B-Instruct

Llama 3.1-8B-Instruct is Meta's 8-billion-parameter instruction-tuned model, supporting 8 languages including English, German, French, Spanish, Italian, Portuguese, Hindi, and Thai. Released under the Llama 3.1 license (permissive with restrictions for products over 700M users), it was a leading open-weight model at its scale at release. Context window extends to 128K tokens.

10,147,881 ↓ · 6,161 ♡

tiny-Qwen2ForCausalLM-2.5

A minimal Qwen2-architecture causal LM created by the TRL (Transformer Reinforcement Learning) team for internal testing purposes. It is not intended for any production use or meaningful text generation — it exists to provide a tiny, fast-loading model compatible with Qwen2 tokenization for unit testing TRL training scripts.

8,876,437 ↓ · 7 ♡

Llama-3.2-1B-Instruct

Llama 3.2-1B-Instruct is Meta's 1-billion-parameter instruction-tuned model from the Llama 3.2 family, the smallest Llama release targeting ultra-low-resource inference scenarios. It is designed for edge deployment on devices that cannot accommodate even 3B models. The Llama 3.2 license restricts use by products/services with over 700M monthly users.

8,395,796 ↓ · 1,502 ♡

Qwen2.5-3B-Instruct

Qwen2.5-3B-Instruct is a 3-billion-parameter instruction-tuned language model from Alibaba Cloud's Qwen2.5 series, positioned between the 1.5B and 7B tiers. It targets lightweight server deployments and on-device inference scenarios where 7B is too large. The license is 'other' — requires reviewing the specific Qwen 2.5 license terms before commercial deployment.

8,182,332 ↓ · 513 ♡

DeepSeek-R1

DeepSeek-R1 is a 671B parameter mixture-of-experts reasoning model from DeepSeek AI, trained with reinforcement learning to produce explicit chain-of-thought reasoning before answering. It achieves GPT-4-class performance on math, coding, and logical inference benchmarks and is released under an MIT license. Active parameters per forward pass are a subset of the 671B total, reducing compute per generated token.

7,292,805 ↓ · 13,417 ♡

gpt-oss-20b

GPT-OSS-20B is a 20-billion-parameter open-source language model released by OpenAI under Apache 2.0 — notable as OpenAI's first substantial open-weight release after years of closed-weights policy. Based on the gpt_oss architecture, it targets high-quality text generation at a scale deployable on research and enterprise GPU infrastructure. FP8 and MXfloat4 quantized variants reduce memory requirements.

7,005,630 ↓ · 4,733 ♡

gemma-3-270m

As a gemma3-based compact model, gemma-3-270m focuses on text generation and chat. gemma-3-270m is subject to Gemma terms, so confirm licensing before commercial use. Weighing in near 270M parameters, gemma-3-270m trades some ceiling for cheaper, faster inference. Read gemma-3-270m's card for hardware requirements and licensing fine print before deploying.

6,644,391 ↓ · 1,040 ♡

deepseek-v4-gguf

A GGUF conversion of DeepSeek V4 by antirez (Salvatore Sanfilippo, creator of Redis), packaged for local inference via llama.cpp. The model represents antirez's personal interest in local AI and has gathered community attention partly due to the author's reputation.

6,514,958 ↓ · 278 ♡

Qwen3-1.7B

Qwen3-1.7B is a 1.7-billion-parameter instruction-tuned language model from Alibaba Cloud's Qwen3 series, filling the gap between the 0.6B and 4B tiers. It targets constrained deployment scenarios where sub-1B quality is insufficient but 4B VRAM requirements are too high. Apache 2.0 licensed.

5,772,336 ↓ · 490 ♡

Qwen3-4B-Instruct-2507

Qwen3-4B-Instruct-2507 is a 4-billion-parameter instruction-tuned model from Alibaba Cloud's Qwen3 series, updated in July 2025. It targets the mid-range deployment tier between ultra-compact sub-2B models and the 7-8B tier requiring heavier hardware. Apache 2.0 licensed with text-generation-inference compatibility.

5,455,321 ↓ · 886 ♡

Qwen3.6-35B-A3B-NVFP4

Qwen3.6-35B-A3B-NVFP4 is an NVIDIA-optimized FP4 quantization of Qwen3.6-35B-A3B, produced with the ModelOpt toolkit for deployment on NVIDIA H100/H200 GPUs. FP4 weights reduce GPU memory footprint roughly 2x compared to BF16 while maintaining most of the original accuracy for conversational tasks. It is intended for inference on NVIDIA TensorRT-LLM or vLLM backends, not for further fine-tuning.

5,022,254 ↓ · 367 ♡

Qwen2.5-0.5B-Instruct

Qwen2.5-0.5B-Instruct is Alibaba Cloud's 0.5-billion-parameter instruction-tuned model, the smallest in the Qwen2.5 family. It targets the most resource-constrained deployment scenarios, prioritizing the ability to run on any hardware over output quality. Apache 2.0 licensed and English-focused.

5,015,816 ↓ · 543 ♡

DeepSeek-R1-0528

Built for text generation and chat, DeepSeek-R1-0528 is a deepseek-based model with publicly available weights. FP8 builds of DeepSeek-R1-0528 are published alongside the full checkpoint for low-memory serving. DeepSeek-R1-0528 is MIT-licensed, clearing it for closed-source and paid products. Read DeepSeek-R1-0528's card for hardware requirements and licensing fine print before deploying.

4,760,502 ↓ · 2,454 ♡

dolphin-2.9.1-yi-1.5-34b

Dolphin 2.9.1 is a community fine-tune of Yi-1.5-34B intended to remove safety filtering and produce an 'uncensored' instruction-tuned model that follows all user requests without refusal. Trained by cognitive computations on OpenHermes, DolphinCoder, and similar datasets. Not Apache/MIT licensed — Yi-1.5-34B's base license applies.

4,635,649 ↓ · 65 ♡

tiny-random-LlamaForCausalLM

A randomly initialized minimal LlamaForCausalLM instance with a tiny vocabulary and hidden dimension, used exclusively for fast unit testing and CI pipelines that need a real model interface without meaningful weights.

4,613,391 ↓ · 0 ♡

Qwen2.5-Coder-14B-Instruct

As a qwen2-based large model, Qwen2.5-Coder-14B-Instruct focuses on text generation and chat. Weighing in near 14000M parameters, Qwen2.5-Coder-14B-Instruct trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen2.5-Coder-14B-Instruct unrestricted for commercial reuse. Before relying on Qwen2.5-Coder-14B-Instruct, reproduce its key numbers on representative inputs.

4,489,725 ↓ · 167 ♡

Qwen3-32B

Qwen3-32B is Alibaba Cloud's 32-billion-parameter instruction-tuned model from the Qwen3 series, targeting deployments requiring stronger reasoning, coding, and instruction following than 7-8B models while remaining lighter than 70B+ alternatives. Apache 2.0 licensed with text-generation-inference compatibility for production serving.

4,208,350 ↓ · 708 ♡

gpt-oss-120b

OpenAI's 120B parameter open-weight language model released under Apache 2.0 in 2025. Supports MXFP4 and 8-bit quantization for multi-GPU deployment via vLLM. Competitive on reasoning and instruction-following benchmarks within the open-weight tier.

4,051,310 ↓ · 4,923 ♡

Qwen2-1.5B-Instruct

Qwen2-1.5B-Instruct is Alibaba's 1.5B parameter instruction-tuned chat model from the Qwen2 series. Designed to run efficiently on CPU or low-VRAM hardware, it handles short-context instruction-following, summarization, and Q&A tasks in English. It is the practical choice when memory constraints prevent running larger Qwen2 variants.

3,850,031 ↓ · 162 ♡

gemma-3-1b-it

gemma-3-1b-it is an open-weight text generation and chat model in the gemma family. It is a fine-tune of gemma-3-1b-pt, inheriting that base model's general competence. At about 1000M parameters, gemma-3-1b-it sits in the mid-sized tier, which sets its memory and latency budget. Evaluate gemma-3-1b-it on your own data before trusting it in production.

3,594,680 ↓ · 1,024 ♡

Qwen2.5-7B-Instruct-AWQ

As a qwen2-based mid-sized model, Qwen2.5-7B-Instruct-AWQ focuses on text generation and chat. AWQ builds of Qwen2.5-7B-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. The Apache 2.0 license keeps Qwen2.5-7B-Instruct-AWQ unrestricted for commercial reuse. Check the Qwen2.5-7B-Instruct-AWQ model card for benchmarks and intended use before adopting it.

3,592,071 ↓ · 47 ♡

Qwen3-14B

Qwen 3 14B is Alibaba's 14-billion-parameter text generation model, offering a significant capacity step up from the 7B class with competitive performance on reasoning, math, and multilingual tasks.

3,588,509 ↓ · 417 ♡

Gemma-3-1B-it-GLM-4.7-Flash-Heretic-Uncensored-Thinking_GGUF

A community GGUF-quantized finetune that merges Gemma-3-1B-it with elements from GLM-4.7-Flash-Thinking, configured to remove default safety refusals. Primarily targeting users who want a small, locally-runnable model with reduced content restrictions.

3,457,428 ↓ · 83 ♡

distilgpt2

DistilGPT2 is a knowledge-distilled version of GPT-2 small, with 82M parameters (vs GPT-2's 117M) and approximately 2x faster inference. It retains around 97% of GPT-2 small's language modeling performance while being lighter to serve.

3,430,053 ↓ · 630 ♡

pythia-160m

Pythia-160M is the smallest model in EleutherAI's Pythia suite, trained on the Pile with checkpoints saved every 512 steps. It is designed for mechanistic interpretability and scaling-law research rather than production use.

3,032,035 ↓ · 42 ♡

Qwen3-30B-A3B

Qwen3-30B-A3B is an openly licensed text generation and chat model in the qwen3 family. Qwen3-30B-A3B is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 30000M parameters, Qwen3-30B-A3B sits in the large tier, which sets its memory and latency budget. Treat Qwen3-30B-A3B's published metrics as a starting point and validate against your workload.

2,714,629 ↓ · 903 ♡

DeepSeek-V3.2

DeepSeek-V3.2 is a Mixture-of-Experts (MoE) large language model from DeepSeek AI, fine-tuned from DeepSeek-V3.2-Exp-Base. It activates a subset of expert parameters per token rather than the full model, enabling high effective parameter counts at lower per-token compute cost. MIT licensed, making it freely deployable commercially despite its scale.

2,682,508 ↓ · 1,451 ♡

Qwen2.5-32B-Instruct

Built for text generation and chat, Qwen2.5-32B-Instruct is a qwen2-based model with publicly available weights. Qwen2.5-32B-Instruct is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 32000M parameters, Qwen2.5-32B-Instruct sits in the frontier-scale tier, which sets its memory and latency budget. Before relying on Qwen2.5-32B-Instruct, reproduce its key numbers on representative inputs.

2,499,364 ↓ · 352 ♡

Qwen2.5-0.5B

Qwen 2.5 0.5B is the smallest base model in Alibaba's Qwen 2.5 family, designed for on-device scenarios requiring minimal memory. It shares the Qwen 2.5 tokenizer with larger models, enabling consistent prompt formatting across the family.

2,342,677 ↓ · 424 ♡

GLM-5-FP8

GLM-5-FP8 targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let GLM-5-FP8 go straight into commercial pipelines. Prebuilt FP8 weights make local and edge inference of GLM-5-FP8 straightforward. Like most open checkpoints, GLM-5-FP8 rewards a quick in-domain eval before commitment.

2,289,032 ↓ · 181 ♡

TinyLlama-1.1B-Chat-v1.0

TinyLlama 1.1B Chat is a compact instruction-tuned language model trained on 3 trillion tokens with the Llama 2 architecture. It targets deployment on devices with limited RAM while retaining basic instruction-following capability.

2,223,782 ↓ · 1,648 ♡

GLM-4.7-Flash

Built for text generation and chat, GLM-4.7-Flash is a glm-based model with publicly available weights. GLM-4.7-Flash is MIT-licensed, clearing it for closed-source and paid products. GLM-4.7-Flash ships without a hosted SLA, so budget for self-managed deployment and monitoring.

2,220,568 ↓ · 1,757 ♡

Qwen2.5-14B-Instruct

Qwen 2.5 14B Instruct is Alibaba's mid-tier instruction model with strong multilingual, coding, and math capabilities. It fills the gap between 7B-class models and the more expensive 32B/72B variants for production deployments.

2,217,753 ↓ · 351 ♡

MiniMax-M2.7

MiniMax-M2.7 targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Prebuilt FP8 weights make local and edge inference of MiniMax-M2.7 straightforward. Licensing for MiniMax-M2.7 is unspecified or custom — clear it before commercial use. Evaluate MiniMax-M2.7 on your own data before trusting it in production.

2,183,457 ↓ · 1,221 ♡

SmolLM2-135M-Instruct

SmolLM2-135M-Instruct targets text generation and chat and is shipped as a compact, self-hostable checkpoint. Permissive Apache 2.0 terms let SmolLM2-135M-Instruct go straight into commercial pipelines. SmolLM2-135M-Instruct's 135M-parameter size keeps hosting requirements modest relative to frontier models. Treat SmolLM2-135M-Instruct's published metrics as a starting point and validate against your workload.

2,151,855 ↓ · 361 ♡

Qwen3-0.6B-FP8

FP8-quantized Qwen3 0.6B, the smallest model in the Qwen3 series. At 0.6B parameters and FP8 precision, it is primarily useful for ultra-low-latency classification or extraction tasks where quality requirements are minimal.

2,079,903 ↓ · 62 ♡

Gemma-4-26B-A4B-NVFP4

NVIDIA's NVFP4-quantized version of Google's Gemma-4-26B-A4B mixture-of-experts model, optimized for Blackwell-generation GPUs using Model Optimizer (ModelOpt). NVFP4 is a 4-bit floating-point format native to Hopper/Blackwell, providing better accuracy retention than INT4 at similar memory savings. Requires NIM or TensorRT-LLM for deployment.

2,036,481 ↓ · 99 ♡

DeepSeek-V4-Flash

As a deepseek-based open-weight model, DeepSeek-V4-Flash focuses on text generation and chat. FP8 builds of DeepSeek-V4-Flash are published alongside the full checkpoint for low-memory serving. The MIT license keeps DeepSeek-V4-Flash unrestricted for commercial reuse. DeepSeek-V4-Flash ships without a hosted SLA, so budget for self-managed deployment and monitoring.

2,033,311 ↓ · 1,618 ♡

Llama-3.2-3B-Instruct

Llama 3.2 3B Instruct is Meta's compact instruction-tuned model designed for on-device and edge inference, with strong performance for its size on reasoning and instruction following benchmarks.

2,002,144 ↓ · 2,276 ♡

Qwen2.5-Coder-7B-Instruct

Qwen2.5-Coder 7B is a code-specialized instruction model trained on 5.5 trillion code tokens, covering 92 programming languages. It achieves competitive performance against much larger code models on pass@1 benchmarks.

1,980,174 ↓ · 743 ♡

Qwen3-Coder-30B-A3B-Instruct

Qwen3-Coder 30B is a code-specialized Mixture-of-Experts model with 30B total and 3B active parameters, instruction-tuned for programming tasks. It targets agentic coding workflows including multi-file editing, tool use, and repository-level understanding.

1,964,182 ↓ · 1,130 ♡

NVIDIA-Nemotron-3-Nano-4B-BF16

NVIDIA-Nemotron-3-Nano-4B-BF16 is NVIDIA's Nemotron Nano 4B, an instruction-tuned LLM derived from a larger Nemotron-H backbone via Neural Architecture Search. Despite the 4B parameter count, it is trained with NVIDIA's Nemotron post-training dataset stack covering math, coding, instruction following, and agentic tool use. BF16 weights are provided for direct inference on A100/H100 GPUs.

1,884,844 ↓ · 96 ♡

Llama-3.2-1B

As a llama-based mid-sized model, Llama-3.2-1B focuses on text generation and chat. Weighing in near 1000M parameters, Llama-3.2-1B trades some ceiling for cheaper, faster inference. Training spans multiple languages, so Llama-3.2-1B covers cross-lingual text generation and chat from one checkpoint. Llama-3.2-1B ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,883,705 ↓ · 2,461 ♡

Llama-3.2-1B-Instruct-FP8-dynamic

Red Hat's dynamically FP8-quantized version of Llama 3.2 1B Instruct, produced using llm-compressor for deployment on FP8-capable GPUs. Reduces memory and increases throughput while maintaining close-to-full-precision instruction following quality.

1,870,541 ↓ · 4 ♡

gpt2-large

GPT-2 Large is OpenAI's 774M-parameter version of the original GPT-2 autoregressive language model from 2019. It produces more coherent text than GPT-2 medium but is significantly outdated compared to modern LLMs.

1,823,125 ↓ · 353 ♡

Qwen3-Coder-Next-FP8

Qwen3-Coder-Next in FP8 precision, targeting high-throughput code generation on FP8-capable hardware (H100 SXM, H200). The FP8 format halves memory requirements vs BF16 while using tensor-core FP8 instructions for near-BF16 throughput. 'Next' in the name indicates this is a more capable successor to the base Qwen3-Coder, with improved instruction following for agentic coding tasks.

1,807,134 ↓ · 154 ♡

pythia-70m-deduped

pythia-70m-deduped targets text generation and chat and is shipped as a compact, self-hostable checkpoint. Permissive Apache 2.0 terms let pythia-70m-deduped go straight into commercial pipelines. pythia-70m-deduped's 70M-parameter size keeps hosting requirements modest relative to frontier models. pythia-70m-deduped is community-maintained, so track upstream changes and pin a known-good revision.

1,805,578 ↓ · 28 ♡

Qwen3-14B-AWQ

As a qwen3-based large model, Qwen3-14B-AWQ focuses on text generation and chat. AWQ builds of Qwen3-14B-AWQ are published alongside the full checkpoint for low-memory serving. Weighing in near 14000M parameters, Qwen3-14B-AWQ trades some ceiling for cheaper, faster inference. Read Qwen3-14B-AWQ's card for hardware requirements and licensing fine print before deploying.

1,710,547 ↓ · 69 ♡

Kimi-K2-Instruct-0905

Kimi-K2-Instruct-0905 is an open-weight model aimed at text generation and chat. Kimi-K2-Instruct-0905 lists a non-standard license, so confirm permissions before deployment. FP8 builds of Kimi-K2-Instruct-0905 are published alongside the full checkpoint for low-memory serving. Before relying on Kimi-K2-Instruct-0905, reproduce its key numbers on representative inputs.

1,648,993 ↓ · 751 ♡

Qwen2.5-Coder-32B-Instruct-AWQ

Built for text generation and chat, Qwen2.5-Coder-32B-Instruct-AWQ is a qwen2-based model with publicly available weights. AWQ builds of Qwen2.5-Coder-32B-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. At about 32000M parameters, Qwen2.5-Coder-32B-Instruct-AWQ sits in the frontier-scale tier, which sets its memory and latency budget. Qwen2.5-Coder-32B-Instruct-AWQ ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,646,473 ↓ · 37 ♡

OpenELM-1_1B-Instruct

As a mid-sized model, OpenELM-1_1B-Instruct focuses on text generation and chat. Weighing in near 1000M parameters, OpenELM-1_1B-Instruct trades some ceiling for cheaper, faster inference. OpenELM-1_1B-Instruct lists a non-standard license, so confirm permissions before deployment. Read OpenELM-1_1B-Instruct's card for hardware requirements and licensing fine print before deploying.

1,574,276 ↓ · 75 ♡

PowerMoE-3b

PowerMoE-3b is an openly licensed text generation and chat model. At about 3000M parameters, PowerMoE-3b sits in the mid-sized tier, which sets its memory and latency budget. PowerMoE-3b is Apache 2.0-licensed, clearing it for closed-source and paid products. Like most open checkpoints, PowerMoE-3b rewards a quick in-domain eval before commitment.

1,565,342 ↓ · 21 ♡

Llama-3.1-8B

Built for text generation and chat, Llama-3.1-8B is a llama-based model with publicly available weights. At about 8000M parameters, Llama-3.1-8B sits in the large tier, which sets its memory and latency budget. Distribution of Llama-3.1-8B is under Llama 3.1 Community, which is worth reading before you ship. Before relying on Llama-3.1-8B, reproduce its key numbers on representative inputs.

1,524,097 ↓ · 2,286 ♡

Qwen2.5-Coder-32B-Instruct

Built for text generation and chat, Qwen2.5-Coder-32B-Instruct is a qwen2-based model with publicly available weights. At about 32000M parameters, Qwen2.5-Coder-32B-Instruct sits in the frontier-scale tier, which sets its memory and latency budget. The weights start from qwen2.5-coder-32b and specialize it for the target task. Before relying on Qwen2.5-Coder-32B-Instruct, reproduce its key numbers on representative inputs.

1,512,155 ↓ · 2,056 ♡

Qwen2.5-1.5B

Qwen2.5-1.5B is an openly licensed text generation and chat model in the qwen2 family. At about 1500M parameters, Qwen2.5-1.5B sits in the mid-sized tier, which sets its memory and latency budget. Qwen2.5-1.5B is Apache 2.0-licensed, clearing it for closed-source and paid products. Qwen2.5-1.5B is community-maintained, so track upstream changes and pin a known-good revision.

1,510,253 ↓ · 189 ♡

Rio-3.0-Open-Mini

Rio-3.0-Open-Mini is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. The MIT license keeps Rio-3.0-Open-Mini unrestricted for commercial reuse. It is a fine-tune of qwen3-4b-thinking-2507, inheriting that base model's general competence. Evaluate Rio-3.0-Open-Mini on your own data before trusting it in production.

1,456,219 ↓ · 9 ♡

Qwen3-VL-30B-A3B-Instruct-AWQ

Qwen3-VL-30B-A3B-Instruct-AWQ is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. Prebuilt AWQ weights make local and edge inference of Qwen3-VL-30B-A3B-Instruct-AWQ straightforward. Weighing in near 30000M parameters, Qwen3-VL-30B-A3B-Instruct-AWQ trades some ceiling for cheaper, faster inference. Qwen3-VL-30B-A3B-Instruct-AWQ is community-maintained, so track upstream changes and pin a known-good revision.

1,406,682 ↓ · 43 ♡

SmolLM2-135M

SmolLM2-135M is a llama-based open-weight model aimed at text generation and chat. SmolLM2-135M's 135M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let SmolLM2-135M go straight into commercial pipelines. SmolLM2-135M ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,390,928 ↓ · 209 ♡

Qwen2.5-14B-Instruct-AWQ

AWQ 4-bit quantized version of Qwen2.5-14B-Instruct, reducing memory requirements from ~28GB to ~8–10GB while maintaining most of the original model's instruction-following quality through activation-aware quantization.

1,347,300 ↓ · 37 ♡

Meta-Llama-3-8B-Instruct

Meta-Llama-3-8B-Instruct is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 8000M parameters, Meta-Llama-3-8B-Instruct trades some ceiling for cheaper, faster inference. Meta-Llama-3-8B-Instruct is subject to Llama 3 Community terms, so confirm licensing before commercial use. Meta-Llama-3-8B-Instruct is community-maintained, so track upstream changes and pin a known-good revision.

1,302,688 ↓ · 4,643 ♡

LLaMA-1B-dj-refine-150B

LLaMA-1B fine-tuned on 150B tokens of RedPajama data filtered and refined by Data-Juicer, a data-cleaning toolkit from Alibaba DAMO. The training corpus was pruned using quality heuristics across Wikipedia, arXiv, Books, and Common Crawl slices. At 1B parameters it trades capability for low inference cost.

1,297,632 ↓ · 3 ♡

Gemma-4-31B-IT-NVFP4

NVIDIA's FP4-quantized version of Gemma 4 31B instruction-tuned, optimized for deployment on Blackwell GPU architecture (B100/B200). Represents the current extreme of low-precision quantization for LLM serving.

1,275,534 ↓ · 516 ♡

Kimi-K2.5-NVFP4

Kimi-K2.5-NVFP4 is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Licensing for Kimi-K2.5-NVFP4 is unspecified or custom — clear it before commercial use. Treat Kimi-K2.5-NVFP4's published metrics as a starting point and validate against your workload.

1,266,699 ↓ · 86 ♡

Meta-Llama-3-8B

Meta's Llama 3 8B base model, pretrained on over 15 trillion tokens with an expanded 128K token vocabulary. It serves as the foundation for instruction-tuned and task-specific finetunes in the Llama 3 ecosystem.

1,249,942 ↓ · 6,586 ♡

Qwen3-Coder-30B-A3B-Instruct-FP8

Qwen3-Coder-30B-A3B-Instruct-FP8 targets text generation and chat and is shipped as a large, self-hostable checkpoint. Prebuilt FP8 weights make local and edge inference of Qwen3-Coder-30B-A3B-Instruct-FP8 straightforward. Qwen3-Coder-30B-A3B-Instruct-FP8's 30000M-parameter size keeps hosting requirements modest relative to frontier models. Evaluate Qwen3-Coder-30B-A3B-Instruct-FP8 on your own data before trusting it in production.

1,233,303 ↓ · 185 ♡

Qwen3-Coder-Next

As a qwen3-based open-weight model, Qwen3-Coder-Next focuses on text generation and chat. The Apache 2.0 license keeps Qwen3-Coder-Next unrestricted for commercial reuse. Before relying on Qwen3-Coder-Next, reproduce its key numbers on representative inputs.

1,216,820 ↓ · 1,488 ♡

NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 is a nemotron-based open-weight model aimed at text generation and chat. NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 lists a non-standard license, so confirm permissions before deployment. Training spans multiple languages, so NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 covers cross-lingual text generation and chat from one checkpoint. NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,188,722 ↓ · 366 ♡

NVIDIA-Nemotron-3-Super-120B-A12B-BF16

NVIDIA-Nemotron-3-Super-120B-A12B-BF16 targets text generation and chat and is shipped as a frontier-scale, self-hostable checkpoint. Licensing for NVIDIA-Nemotron-3-Super-120B-A12B-BF16 is unspecified or custom — clear it before commercial use. NVIDIA-Nemotron-3-Super-120B-A12B-BF16's 120000M-parameter size keeps hosting requirements modest relative to frontier models. Like most open checkpoints, NVIDIA-Nemotron-3-Super-120B-A12B-BF16 rewards a quick in-domain eval before commitment.

1,180,958 ↓ · 390 ♡

DeepSeek-V3-0324

DeepSeek-V3-0324 is a deepseek-based open-weight model aimed at text generation and chat. Permissive MIT terms let DeepSeek-V3-0324 go straight into commercial pipelines. FP8 builds of DeepSeek-V3-0324 are published alongside the full checkpoint for low-memory serving. DeepSeek-V3-0324 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,177,756 ↓ · 3,132 ♡

DeepSeek-V4-Pro

DeepSeek-V4-Pro targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Prebuilt FP8 weights make local and edge inference of DeepSeek-V4-Pro straightforward. Permissive MIT terms let DeepSeek-V4-Pro go straight into commercial pipelines. Evaluate DeepSeek-V4-Pro on your own data before trusting it in production.

1,168,421 ↓ · 5,086 ♡

Phi-3.5-mini-instruct

Phi-3.5-mini-instruct targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let Phi-3.5-mini-instruct go straight into commercial pipelines. Phi-3.5-mini-instruct is multilingual by design rather than English-only. Like most open checkpoints, Phi-3.5-mini-instruct rewards a quick in-domain eval before commitment.

1,154,917 ↓ · 993 ♡

DeepSeek-Coder-V2-Lite-Instruct

DeepSeek-Coder-V2-Lite-Instruct is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Licensing for DeepSeek-Coder-V2-Lite-Instruct is unspecified or custom — clear it before commercial use. Evaluate DeepSeek-Coder-V2-Lite-Instruct on your own data before trusting it in production.

1,147,460 ↓ · 615 ♡

Qwen2.5-Coder-14B-Instruct-AWQ

Built for text generation and chat, Qwen2.5-Coder-14B-Instruct-AWQ is a qwen2-based model with publicly available weights. At about 14000M parameters, Qwen2.5-Coder-14B-Instruct-AWQ sits in the large tier, which sets its memory and latency budget. Qwen2.5-Coder-14B-Instruct-AWQ is Apache 2.0-licensed, clearing it for closed-source and paid products. Qwen2.5-Coder-14B-Instruct-AWQ ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,145,245 ↓ · 21 ♡

h2ovl-mississippi-800m

h2ovl-mississippi-800m is a compact checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 800M parameters, h2ovl-mississippi-800m trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps h2ovl-mississippi-800m unrestricted for commercial reuse. Evaluate h2ovl-mississippi-800m on your own data before trusting it in production.

1,140,588 ↓ · 40 ♡

Qwen2.5-72B-Instruct-AWQ

Qwen2.5-72B-Instruct-AWQ is a qwen2-based open-weight model aimed at text generation and chat. Qwen2.5-72B-Instruct-AWQ lists a non-standard license, so confirm permissions before deployment. AWQ builds of Qwen2.5-72B-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. Read Qwen2.5-72B-Instruct-AWQ's card for hardware requirements and licensing fine print before deploying.

1,125,305 ↓ · 78 ♡

SmolLM-1.7B-Instruct-quantized.w4a16

As a llama-based mid-sized model, SmolLM-1.7B-Instruct-quantized.w4a16 focuses on text generation and chat. Weighing in near 1700M parameters, SmolLM-1.7B-Instruct-quantized.w4a16 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps SmolLM-1.7B-Instruct-quantized.w4a16 unrestricted for commercial reuse. Read SmolLM-1.7B-Instruct-quantized.w4a16's card for hardware requirements and licensing fine print before deploying.

1,123,466 ↓ · 0 ♡

GLM-5.1-FP8

GLM-5.1-FP8 targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Prebuilt FP8 weights make local and edge inference of GLM-5.1-FP8 straightforward. Permissive MIT terms let GLM-5.1-FP8 go straight into commercial pipelines. Evaluate GLM-5.1-FP8 on your own data before trusting it in production.

1,120,566 ↓ · 119 ♡

Qwen2-0.5B

Qwen2-0.5B is the smallest base model in Alibaba's Qwen2 family, with 0.5B parameters and a 32K token context window. As a base (non-instruct) model it requires fine-tuning or custom prompting for task-specific behavior. Despite its size, it outperforms several older models of similar scale on standard benchmarks.

1,114,502 ↓ · 169 ♡

h2ovl-mississippi-2b

As a mid-sized model, h2ovl-mississippi-2b focuses on text generation and chat. Weighing in near 2000M parameters, h2ovl-mississippi-2b trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps h2ovl-mississippi-2b unrestricted for commercial reuse. Check the h2ovl-mississippi-2b model card for benchmarks and intended use before adopting it.

1,109,060 ↓ · 42 ♡

NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

As a nemotron-based large model, NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 focuses on text generation and chat. Weighing in near 30000M parameters, NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 trades some ceiling for cheaper, faster inference. NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 lists a non-standard license, so confirm permissions before deployment. Check the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 model card for benchmarks and intended use before adopting it.

1,078,433 ↓ · 777 ♡

DeepSeek-V3

As a deepseek-based open-weight model, DeepSeek-V3 focuses on text generation and chat. FP8 builds of DeepSeek-V3 are published alongside the full checkpoint for low-memory serving. Before relying on DeepSeek-V3, reproduce its key numbers on representative inputs.

1,067,717 ↓ · 4,091 ♡

Qwen2.5-1.5B-quantized.w8a8

Qwen2.5-1.5B-quantized.w8a8 is an openly licensed text generation and chat model in the qwen2 family. At about 1500M parameters, Qwen2.5-1.5B-quantized.w8a8 sits in the mid-sized tier, which sets its memory and latency budget. Qwen2.5-1.5B-quantized.w8a8 is Apache 2.0-licensed, clearing it for closed-source and paid products. Qwen2.5-1.5B-quantized.w8a8 is community-maintained, so track upstream changes and pin a known-good revision.

1,066,320 ↓ · 4 ♡

DeepSeek-V2-Lite-Chat

DeepSeek-V2-Lite-Chat targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Licensing for DeepSeek-V2-Lite-Chat is unspecified or custom — clear it before commercial use. DeepSeek-V2-Lite-Chat is community-maintained, so track upstream changes and pin a known-good revision.

1,064,427 ↓ · 141 ♡

Mistral-7B-Instruct-v0.2

Mistral 7B Instruct v0.2 improved on v0.1 with a 32K sliding window context and better instruction following. It was the strongest 7B open-weight instruction model available when released and remains competitive for text tasks after later versions raised the bar.

1,045,054 ↓ · 3,169 ♡

tiny-gpt2

Built for text generation and chat, tiny-gpt2 is a gpt2-based model with publicly available weights. Before relying on tiny-gpt2, reproduce its key numbers on representative inputs.

1,035,297 ↓ · 36 ♡

TinyLlama-1.1B-Chat-v0.3-GPTQ

TinyLlama-1.1B-Chat-v0.3-GPTQ is an openly licensed text generation and chat model in the llama family. At about 1100M parameters, TinyLlama-1.1B-Chat-v0.3-GPTQ sits in the mid-sized tier, which sets its memory and latency budget. Prebuilt GPTQ weights make local and edge inference of TinyLlama-1.1B-Chat-v0.3-GPTQ straightforward. Like most open checkpoints, TinyLlama-1.1B-Chat-v0.3-GPTQ rewards a quick in-domain eval before commitment.

1,023,018 ↓ · 10 ♡

DeepSeek-R1-0528-NVFP4-v2

DeepSeek-R1-0528-NVFP4-v2 is a deepseek-based open-weight model aimed at text generation and chat. Permissive MIT terms let DeepSeek-R1-0528-NVFP4-v2 go straight into commercial pipelines. The weights start from deepseek-r1-0528 and specialize it for the target task. Before relying on DeepSeek-R1-0528-NVFP4-v2, reproduce its key numbers on representative inputs.

1,012,472 ↓ · 23 ♡

bart-large-emojilm

bart-large-emojilm is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Treat bart-large-emojilm's published metrics as a starting point and validate against your workload.

1,008,191 ↓ · 0 ♡

diffusiongemma-26B-A4B-it-NVFP4

This is NVIDIA's NVFP4-quantized version of Google's DiffusionGemma-26B-A4B instruction-tuned model, produced using NVIDIA's ModelOpt toolkit. DiffusionGemma applies a diffusion-based decoding approach to a Gemma architecture, and the A4B mixture-of-experts configuration activates roughly 4B parameters per forward pass from a 26B total. The NVFP4 quantization targets NVIDIA Blackwell-generation hardware for reduced memory footprint.

973,005 ↓ · 87 ♡

Qwen2.5-1.5B-Instruct-AWQ

Built for text generation and chat, Qwen2.5-1.5B-Instruct-AWQ is a qwen2-based model with publicly available weights. At about 1500M parameters, Qwen2.5-1.5B-Instruct-AWQ sits in the mid-sized tier, which sets its memory and latency budget. AWQ builds of Qwen2.5-1.5B-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. Before relying on Qwen2.5-1.5B-Instruct-AWQ, reproduce its key numbers on representative inputs.

916,179 ↓ · 7 ♡

Llama-3.2-1B-Instruct-FP8

Llama-3.2-1B-Instruct-FP8 is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Llama-3.2-1B-Instruct-FP8 is subject to Llama 3.2 Community terms, so confirm licensing before commercial use. Llama-3.2-1B-Instruct-FP8 is multilingual by design rather than English-only. Like most open checkpoints, Llama-3.2-1B-Instruct-FP8 rewards a quick in-domain eval before commitment.

893,315 ↓ · 4 ♡

Mistral-7B-v0.1

As a mistral-based mid-sized model, Mistral-7B-v0.1 focuses on text generation and chat. Weighing in near 7000M parameters, Mistral-7B-v0.1 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Mistral-7B-v0.1 unrestricted for commercial reuse. Mistral-7B-v0.1 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

891,809 ↓ · 4,114 ♡

phi-4

phi-4 is an openly licensed text generation and chat model in the phi3 family. phi-4 is MIT-licensed, clearing it for closed-source and paid products. phi-4 is community-maintained, so track upstream changes and pin a known-good revision.

891,481 ↓ · 2,263 ♡

Qwen3-4B-Instruct-2507-FP8

As a qwen3-based mid-sized model, Qwen3-4B-Instruct-2507-FP8 focuses on text generation and chat. Weighing in near 4000M parameters, Qwen3-4B-Instruct-2507-FP8 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen3-4B-Instruct-2507-FP8 unrestricted for commercial reuse. Check the Qwen3-4B-Instruct-2507-FP8 model card for benchmarks and intended use before adopting it.

868,793 ↓ · 78 ♡

NVIDIA-Nemotron-Nano-9B-v2

As a nemotron-based large model, NVIDIA-Nemotron-Nano-9B-v2 focuses on text generation and chat. Weighing in near 9000M parameters, NVIDIA-Nemotron-Nano-9B-v2 trades some ceiling for cheaper, faster inference. Training spans multiple languages, so NVIDIA-Nemotron-Nano-9B-v2 covers cross-lingual text generation and chat from one checkpoint. Before relying on NVIDIA-Nemotron-Nano-9B-v2, reproduce its key numbers on representative inputs.

866,395 ↓ · 497 ♡

Phi-tiny-MoE-instruct

Built for text generation and chat, Phi-tiny-MoE-instruct is a phi-based model with publicly available weights. Phi-tiny-MoE-instruct is MIT-licensed, clearing it for closed-source and paid products. Phi-tiny-MoE-instruct ships without a hosted SLA, so budget for self-managed deployment and monitoring.

833,928 ↓ · 38 ♡

tiny-random-Llama-3

Built for text generation and chat, tiny-random-Llama-3 is a llama-based model with publicly available weights. tiny-random-Llama-3 is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the tiny-random-Llama-3 model card for benchmarks and intended use before adopting it.

804,759 ↓ · 3 ♡

Qwen2.5-Coder-1.5B-Instruct

Qwen2.5-Coder-1.5B-Instruct is a compact instruction-tuned code model from Alibaba designed to handle code generation, explanation, and debugging tasks at 1.5B parameters. Despite its small size it scores competitively on HumanEval for its parameter class, making it a practical choice for on-device code assistants or latency-sensitive completion tools.

791,830 ↓ · 131 ♡

Qwen3-235B-A22B

Qwen3-235B-A22B is a frontier-scale checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 235000M parameters, Qwen3-235B-A22B trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen3-235B-A22B unrestricted for commercial reuse. Like most open checkpoints, Qwen3-235B-A22B rewards a quick in-domain eval before commitment.

789,246 ↓ · 1,099 ♡

DeepSeek-R1-Distill-Qwen-32B

Built for text generation and chat, DeepSeek-R1-Distill-Qwen-32B is a qwen2-based model with publicly available weights. At about 32000M parameters, DeepSeek-R1-Distill-Qwen-32B sits in the frontier-scale tier, which sets its memory and latency budget. DeepSeek-R1-Distill-Qwen-32B is MIT-licensed, clearing it for closed-source and paid products. Read DeepSeek-R1-Distill-Qwen-32B's card for hardware requirements and licensing fine print before deploying.

788,920 ↓ · 1,571 ♡

mamba-130m-hf

Mamba-130M is a selective state space model (SSM) from the Mamba architecture, offering linear-time inference complexity versus transformer quadratic attention. At 130M parameters it's a research checkpoint used to study SSM behavior, not a production text generator. The HF suffix indicates it's adapted for the Transformers interface.

774,556 ↓ · 73 ♡

Qwen3-30B-A3B-Instruct-2507

Qwen3-30B-A3B-Instruct-2507 is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. The Apache 2.0 license keeps Qwen3-30B-A3B-Instruct-2507 unrestricted for commercial reuse. Weighing in near 30000M parameters, Qwen3-30B-A3B-Instruct-2507 trades some ceiling for cheaper, faster inference. Like most open checkpoints, Qwen3-30B-A3B-Instruct-2507 rewards a quick in-domain eval before commitment.

765,333 ↓ · 816 ♡

DeepSeek-R1-0528-Qwen3-8B

DeepSeek-R1-0528-Qwen3-8B is an 8B-parameter reasoning-focused language model built on the Qwen3 architecture, released under the MIT license. It is a distilled variant of the DeepSeek-R1 series, designed to bring chain-of-thought reasoning capabilities to a smaller, more deployable footprint. The model supports text-generation-inference and HuggingFace endpoints out of the box.

754,308 ↓ · 1,079 ♡

NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Nemotron-3 Nano is NVIDIA's 30B-parameter Mixture-of-Experts model with only 3B active parameters per forward pass, quantised to NVFP4 for Hopper GPU deployment. The model supports six languages and was trained on NVIDIA's Nemotron dataset family spanning code, math, and instruction following. NVFP4 quantisation targets tensor-core efficiency on H100/H200 hardware.

751,876 ↓ · 161 ♡

Llama-2-7b-hf

Llama-2-7b-hf is an open-weight text generation and chat model in the llama family. At about 7000M parameters, Llama-2-7b-hf sits in the mid-sized tier, which sets its memory and latency budget. Distribution of Llama-2-7b-hf is under Llama 2 Community, which is worth reading before you ship. Evaluate Llama-2-7b-hf on your own data before trusting it in production.

747,020 ↓ · 2,328 ♡

Qwen2.5-32B-Instruct-AWQ

As a qwen2-based frontier-scale model, Qwen2.5-32B-Instruct-AWQ focuses on text generation and chat. Weighing in near 32000M parameters, Qwen2.5-32B-Instruct-AWQ trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen2.5-32B-Instruct-AWQ unrestricted for commercial reuse. Qwen2.5-32B-Instruct-AWQ ships without a hosted SLA, so budget for self-managed deployment and monitoring.

746,303 ↓ · 101 ♡

Llama-3.2-3B

As a llama-based mid-sized model, Llama-3.2-3B focuses on text generation and chat. Weighing in near 3000M parameters, Llama-3.2-3B trades some ceiling for cheaper, faster inference. Llama-3.2-3B is subject to Llama 3.2 Community terms, so confirm licensing before commercial use. Before relying on Llama-3.2-3B, reproduce its key numbers on representative inputs.

740,225 ↓ · 843 ♡

Qwen3-Coder-Next-GGUF

Unsloth's GGUF quantizations of Qwen3-Coder-Next, a code-focused model from the Qwen3 family with extended training on programming datasets. Unsloth applies imatrix calibration during quantization, which improves accuracy at lower bit-widths compared to naive GGUF conversion. Available in multiple quant levels (Q4_K_M, Q8_0, etc.).

737,229 ↓ · 707 ♡

Qwen3-1.7B-Base

Built for text generation and chat, Qwen3-1.7B-Base is a qwen3-based model with publicly available weights. Qwen3-1.7B-Base is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 1700M parameters, Qwen3-1.7B-Base sits in the mid-sized tier, which sets its memory and latency budget. Qwen3-1.7B-Base ships without a hosted SLA, so budget for self-managed deployment and monitoring.

717,501 ↓ · 75 ♡

Qwythos-9B-Claude-Mythos-5-1M-GGUF

Qwythos-9B-Claude-Mythos-5-1M-GGUF is a GGUF-quantized 9B text generation model derived from Qwen3.5, packaged for local inference via llama.cpp. It advertises a 1-million-token context window and is positioned for agentic, reasoning, and long-document tasks including cybersecurity and biomedical domains. The base model carries an Apache 2.0 license.

712,627 ↓ · 682 ♡

Phi-4-mini-instruct

As a phi3-based open-weight model, Phi-4-mini-instruct focuses on text generation and chat. Training spans multiple languages, so Phi-4-mini-instruct covers cross-lingual text generation and chat from one checkpoint. The MIT license keeps Phi-4-mini-instruct unrestricted for commercial reuse. Phi-4-mini-instruct ships without a hosted SLA, so budget for self-managed deployment and monitoring.

707,072 ↓ · 783 ♡

GLM-4.5-Air-AWQ-4bit

GLM-4.5-Air-AWQ-4bit is a 4-bit AWQ quantization of ZAI's GLM-4.5-Air, a MoE language model optimized for bilingual Chinese-English use. AWQ (Activation-aware Weight Quantization) reduces memory requirements while preserving output quality. The Air variant is a lower-compute subset of GLM-4.5 designed for efficient serving, and the AWQ quantization further reduces VRAM requirements for deployment.

706,202 ↓ · 29 ♡

tiny-GptOssForCausalLM

tiny-GptOssForCausalLM is an open-weight model aimed at text generation and chat. tiny-GptOssForCausalLM ships without a hosted SLA, so budget for self-managed deployment and monitoring.

704,757 ↓ · 4 ♡

Qwen3-4B-Base

Qwen3-4B-Base targets text generation and chat and is shipped as a mid-sized, self-hostable checkpoint. Permissive Apache 2.0 terms let Qwen3-4B-Base go straight into commercial pipelines. Qwen3-4B-Base's 4000M-parameter size keeps hosting requirements modest relative to frontier models. Treat Qwen3-4B-Base's published metrics as a starting point and validate against your workload.

679,326 ↓ · 95 ♡

Llama-3.3-70B-Instruct

Built for text generation and chat, Llama-3.3-70B-Instruct is a llama-based model with publicly available weights. At about 70000M parameters, Llama-3.3-70B-Instruct sits in the frontier-scale tier, which sets its memory and latency budget. Distribution of Llama-3.3-70B-Instruct is under Llama 3.3 Community, which is worth reading before you ship. Check the Llama-3.3-70B-Instruct model card for benchmarks and intended use before adopting it.

675,237 ↓ · 2,860 ♡

Qwen2-7B-Instruct

Built for text generation and chat, Qwen2-7B-Instruct is a qwen2-based model with publicly available weights. Qwen2-7B-Instruct is Apache 2.0-licensed, clearing it for closed-source and paid products. The weights start from qwen2-7b and specialize it for the target task. Check the Qwen2-7B-Instruct model card for benchmarks and intended use before adopting it.

672,229 ↓ · 687 ♡

Qwen2.5-72B-Instruct

Qwen2.5-72B-Instruct is Alibaba's 72B instruction-tuned model from the Qwen 2.5 series, trained on over 18 trillion tokens with improvements in math, coding, and long-context handling up to 128K tokens. It supports 29 languages and uses a non-commercial license for the 72B variant.

669,038 ↓ · 959 ♡

Kimi-K2.6-NVFP4

Kimi-K2.6-NVFP4 is an NVIDIA-optimized FP4 quantization of Kimi-K2.6, produced with the ModelOpt toolkit for deployment on NVIDIA H100/H200 GPUs. FP4 weights reduce GPU memory footprint roughly 2x compared to BF16 while maintaining most of the original accuracy for conversational tasks. It is intended for inference on NVIDIA TensorRT-LLM or vLLM backends, not for further fine-tuning.

662,521 ↓ · 39 ♡

Llama-3.1-70B-Instruct

As a llama-based frontier-scale model, Llama-3.1-70B-Instruct focuses on text generation and chat. Llama-3.1-70B-Instruct is subject to Llama 3.1 Community terms, so confirm licensing before commercial use. Training spans multiple languages, so Llama-3.1-70B-Instruct covers cross-lingual text generation and chat from one checkpoint. Llama-3.1-70B-Instruct ships without a hosted SLA, so budget for self-managed deployment and monitoring.

658,407 ↓ · 926 ♡

tiny-random-OPTForCausalLM

As an open-weight model, tiny-random-OPTForCausalLM focuses on text generation and chat. Read tiny-random-OPTForCausalLM's card for hardware requirements and licensing fine print before deploying.

653,604 ↓ · 0 ♡

Llama-3_3-Nemotron-Super-49B-v1_5

Llama-3_3-Nemotron-Super-49B-v1_5 is a 49B sparse NAS-derived model from NVIDIA's Nemotron line, constructed via Neural Architecture Search to prune the original Llama 3.3 70B into a smaller active-parameter footprint while retaining most quality. The 'Super' designation indicates it targets reasoning tasks, coding, and instruction following with near-70B quality at reduced inference cost.

652,970 ↓ · 234 ♡

DeepSeek-R1-Distill-Qwen-1.5B

DeepSeek-R1-Distill-Qwen-1.5B distills DeepSeek-R1's chain-of-thought reasoning traces into a 1.5B Qwen2 model. The distillation process transfers structured thinking patterns rather than raw capability, producing a model that generates explicit reasoning steps before answers. MIT license makes it broadly usable.

649,284 ↓ · 1,529 ♡

SmolLM3-3B

SmolLM3-3B is HuggingFace's 3B instruction-tuned language model, the third generation of the SmolLM family targeting on-device and resource-constrained deployment. It is multilingual (English, French, Spanish, Italian, Portuguese, Chinese, Arabic, Russian) and achieves competitive instruction-following quality at the 3B parameter scale. Apache-2.0 licensing makes it a viable base for commercial on-device AI applications.

648,978 ↓ · 981 ♡

Qwen2-0.5B-Instruct

Qwen2-0.5B-Instruct is a compact checkpoint for text generation and chat, distributed on the HuggingFace Hub. It is a fine-tune of qwen2-0.5b, inheriting that base model's general competence. The Apache 2.0 license keeps Qwen2-0.5B-Instruct unrestricted for commercial reuse. Qwen2-0.5B-Instruct is community-maintained, so track upstream changes and pin a known-good revision.

639,035 ↓ · 201 ♡

MiniMax-M2.5

MiniMax-M2.5 is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Prebuilt FP8 weights make local and edge inference of MiniMax-M2.5 straightforward. Licensing for MiniMax-M2.5 is unspecified or custom — clear it before commercial use. Like most open checkpoints, MiniMax-M2.5 rewards a quick in-domain eval before commitment.

637,212 ↓ · 1,497 ♡

Llama-3.2-1B-Instruct-Q8_0-GGUF

Llama-3.2-1B-Instruct-Q8_0-GGUF is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 1000M parameters, Llama-3.2-1B-Instruct-Q8_0-GGUF trades some ceiling for cheaper, faster inference. Prebuilt GGUF weights make local and edge inference of Llama-3.2-1B-Instruct-Q8_0-GGUF straightforward. Treat Llama-3.2-1B-Instruct-Q8_0-GGUF's published metrics as a starting point and validate against your workload.

617,321 ↓ · 48 ♡

GLM-5.2-FP8

GLM-5.2-FP8 is an FP8-quantized text generation model from ZhipuAI based on a Mixture-of-Experts architecture (glm_moe_dsa), supporting both English and Chinese. Two associated arxiv papers (2602.15763 and 2603.12201) provide technical grounding for the GLM-5 series design.

613,600 ↓ · 173 ♡

Qwen2.5-7B-Instruct-bnb-4bit

Qwen2.5-7B-Instruct-bnb-4bit is Unsloth's bitsandbytes 4-bit quantized version of Qwen2.5-7B-Instruct, packaged for efficient fine-tuning and inference via the Unsloth framework. The bnb-4bit format enables QLoRA fine-tuning on a single consumer GPU (12-16GB VRAM), making Qwen2.5-7B accessible for custom instruction tuning without requiring multi-GPU setups.

612,137 ↓ · 23 ♡

Hermes-4-14B-AWQ-4bit

Hermes-4-14B-AWQ-4bit is a 4-bit AWQ quantization of NousResearch's Hermes-4-14B, a Qwen3-14B fine-tune optimized for instruction following, function calling, and structured output generation. The quantization targets reduced VRAM consumption while retaining the model's tool-use and JSON-mode capabilities. It is distributed under the Apache-2.0 license.

607,029 ↓ · 4 ♡

Rio-3.0-Open

Rio-3.0-Open is an open-weights LLM released by the Prefeitura do Rio de Janeiro (Rio de Janeiro city government), fine-tuned from Qwen3-235B-A22B on Portuguese and English data for civic and administrative use cases. It is a MoE architecture fine-tune targeting Brazilian Portuguese language understanding and public service applications. MIT licensed for open use.

606,651 ↓ · 5 ♡

TinyLLama-v0

TinyLLama-v0 is an early community repackage of the TinyLlama 1.1B base model, offering PyTorch and ONNX checkpoints for fast local experimentation. This is the pre-instruction-tuned base variant; it generates continuations rather than following instructions. The primary value is quick prototyping on hardware too constrained for larger models.

604,555 ↓ · 43 ♡

Phi-3-mini-4k-instruct

Phi-3-mini-4k-instruct is an openly licensed text generation and chat model in the phi3 family. Phi-3-mini-4k-instruct is MIT-licensed, clearing it for closed-source and paid products. Evaluate Phi-3-mini-4k-instruct on your own data before trusting it in production.

597,867 ↓ · 1,436 ♡

Qwen3-4B-Thinking-2507

Qwen3-4B-Thinking-2507 is an updated (July 2025) thinking-mode variant of Qwen3-4B, fine-tuned to generate extended chain-of-thought reasoning before producing answers. The 2507 suffix indicates a July 2025 update. Thinking mode generates explicit reasoning traces which increase token count but improve accuracy on structured tasks.

596,422 ↓ · 600 ♡

tiny-random-BambaForCausalLM

A randomly initialized, architecturally minimal Bamba model used for unit-testing the BambaForCausalLM implementation in Hugging Face Transformers. Bamba is a hybrid SSM-attention architecture. This model has no trained weights — it exists purely for pipeline and shape verification in CI environments.

593,098 ↓ · 0 ♡

Qwen3-8B-AWQ

Qwen3-8B-AWQ is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 8000M parameters, Qwen3-8B-AWQ trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen3-8B-AWQ unrestricted for commercial reuse. Like most open checkpoints, Qwen3-8B-AWQ rewards a quick in-domain eval before commitment.

585,607 ↓ · 49 ♡

Nemotron-Labs-Diffusion-8B-Base

Nemotron-Labs-Diffusion-8B-Base is NVIDIA's diffusion language model base, applying discrete diffusion to text generation instead of autoregressive decoding. At 8B parameters, it generates text by iteratively denoising token sequences rather than predicting them left-to-right. This enables parallel token generation but requires different inference tooling than standard transformer LLMs.

584,926 ↓ · 6 ♡

deepseek-coder-7b-instruct-v1.5

deepseek-coder-7b-instruct-v1.5 is DeepSeek AI's 7B-parameter instruction-tuned code model, the v1.5 release built on a Llama architecture. It is optimized for code generation, completion, and debugging across common programming languages, with a 16K token context window. Version 1.5 improves over earlier DeepSeek-Coder releases on fill-in-the-middle tasks and instruction following for coding-specific prompts.

580,794 ↓ · 157 ♡

Qwen3.5-397B-A17B-NVFP4

Qwen3.5-397B-A17B-NVFP4 is a frontier-scale checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 397000M parameters, Qwen3.5-397B-A17B-NVFP4 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen3.5-397B-A17B-NVFP4 unrestricted for commercial reuse. Qwen3.5-397B-A17B-NVFP4 is community-maintained, so track upstream changes and pin a known-good revision.

575,765 ↓ · 101 ♡

Qwen3-8B-FP8

Qwen3-8B in FP8 precision from Alibaba, targeting high-throughput serving on Hopper-generation GPUs. FP8 halves the memory footprint of the BF16 checkpoint while matching it in throughput on H100/H200 tensor cores. Qwen3-8B is instruction-tuned with hybrid reasoning mode, toggling between chain-of-thought and direct-answer modes via a flag.

567,893 ↓ · 61 ♡

tiny-Qwen3ForCausalLM

A tiny Qwen3 causal LM checkpoint used for TRL (Transformer Reinforcement Learning) library internal testing. Not a functional AI model; exists to provide a minimal forward-pass target for unit tests and CI pipelines in the Hugging Face TRL codebase.

567,366 ↓ · 1 ♡

phi-2

phi-2 is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. The MIT license keeps phi-2 unrestricted for commercial reuse. Like most open checkpoints, phi-2 rewards a quick in-domain eval before commitment.

554,984 ↓ · 3,471 ♡

tiny-random-LlamaForCausalLM

tiny-random-LlamaForCausalLM targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Evaluate tiny-random-LlamaForCausalLM on your own data before trusting it in production.

553,006 ↓ · 8 ♡

Qwen2.5-Coder-7B-Instruct-GPTQ-Int4

Qwen2.5-Coder-7B-Instruct-GPTQ-Int4 targets text generation and chat and is shipped as a mid-sized, self-hostable checkpoint. Permissive Apache 2.0 terms let Qwen2.5-Coder-7B-Instruct-GPTQ-Int4 go straight into commercial pipelines. Prebuilt GPTQ/INT4 weights make local and edge inference of Qwen2.5-Coder-7B-Instruct-GPTQ-Int4 straightforward. Qwen2.5-Coder-7B-Instruct-GPTQ-Int4 is community-maintained, so track upstream changes and pin a known-good revision.

537,380 ↓ · 14 ♡

DeepSeek-R1-Distill-Qwen-14B

DeepSeek-R1-Distill-Qwen-14B is a 14B model that distills DeepSeek-R1's extended chain-of-thought reasoning into a Qwen2 backbone. It strikes a better capability/size balance than the 1.5B or 7B distillations, handling moderately complex math and coding problems with explicit reasoning traces. MIT license allows unrestricted use.

536,682 ↓ · 657 ♡

gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

gemma-4-12B-coder-fable5-composer2.5-v1-GGUF is a GGUF-quantized, code-and-reasoning-focused fine-tune of Google's gemma-4-12B-it. The merge combines coding specialization with chain-of-thought reasoning enhancements, packaged for local llama.cpp inference.

536,130 ↓ · 2,431 ♡

Qwen3-8B-Base

Qwen3-8B-Base is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 8000M parameters, Qwen3-8B-Base trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen3-8B-Base unrestricted for commercial reuse. Like most open checkpoints, Qwen3-8B-Base rewards a quick in-domain eval before commitment.

533,651 ↓ · 108 ♡

RnJ-1-Instruct-FP8

RnJ-1-Instruct in FP8 precision from Doradus AI, a reasoning-focused instruct model targeting code and logical problem-solving. FP8 quantization reduces memory footprint while preserving most of the original model's task accuracy.

517,285 ↓ · 4 ♡

Qwen2.5-3B

Qwen2.5-3B is the 3B base (non-instruct) model from Alibaba's Qwen2.5 series, with a 32K token context window. Base models in this series are primarily useful as fine-tuning starting points. The instruct variant is recommended for most direct applications.

517,227 ↓ · 192 ♡

gemma-3-270m-it

gemma-3-270m-it is Google's 270M-parameter instruction-tuned variant of the Gemma 3 family, fine-tuned from the gemma-3-270m base checkpoint. At this scale it is one of the smallest instruction-following models in the Gemma 3 lineup, intended for resource-constrained inference, on-device deployment, and rapid experimentation. It uses the Gemma license, which permits use subject to Google's terms of service rather than a standard OSI license.

514,347 ↓ · 601 ♡

Qwen3-30B-A3B-FP8

Qwen3-30B-A3B-FP8 is an FP8-quantized variant of Qwen's 30B mixture-of-experts language model, where approximately 3B parameters are active per forward pass. The FP8 format reduces memory requirements while preserving most of the BF16 model's numerical range, targeting high-throughput text generation on FP8-capable hardware.

510,015 ↓ · 84 ♡

tiny-random-LlamaForCausalLM

tiny-random-LlamaForCausalLM targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. tiny-random-LlamaForCausalLM is community-maintained, so track upstream changes and pin a known-good revision.

504,565 ↓ · 20 ♡

bloom-560m

BLOOM-560M is the smallest model in the BLOOM family, a collaborative multilingual language model trained under the BigScience initiative on 46 natural languages and 13 programming languages. At 560M parameters it's primarily useful for multilingual research and teaching rather than competitive NLP tasks. The RAIL license restricts certain harmful use cases.

502,304 ↓ · 374 ♡

Meta-Llama-3.1-8B-Instruct-FP8

Meta-Llama-3.1-8B-Instruct-FP8 is a large checkpoint for text generation and chat, distributed on the HuggingFace Hub. Meta-Llama-3.1-8B-Instruct-FP8 is multilingual by design rather than English-only. Meta-Llama-3.1-8B-Instruct-FP8 is subject to Llama 3.1 Community terms, so confirm licensing before commercial use. Meta-Llama-3.1-8B-Instruct-FP8 is community-maintained, so track upstream changes and pin a known-good revision.

501,303 ↓ · 44 ♡

Qwen2.5-7B

Qwen 2.5 7B is Alibaba's base (non-instruction-tuned) language model at the 7B scale, pretrained on 18 trillion tokens. It serves as the foundation for Qwen 2.5 7B Instruct and downstream fine-tunes requiring a strong base without chat formatting.

500,320 ↓ · 294 ♡

Kimi-K2-Instruct

Kimi-K2-Instruct is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Licensing for Kimi-K2-Instruct is unspecified or custom — clear it before commercial use. Prebuilt FP8 weights make local and edge inference of Kimi-K2-Instruct straightforward. Evaluate Kimi-K2-Instruct on your own data before trusting it in production.

496,191 ↓ · 2,366 ♡

Qwen3-0.6B-Base

Qwen3-0.6B-Base is Alibaba's smallest Qwen3 model, a base (non-instruct) LLM at 0.6 billion parameters. It targets on-device, edge, and resource-constrained deployments where even 1.5B models are too large. As a base model it requires instruction tuning or few-shot prompting for task-specific use; the primary value is as a fine-tuning starting point.

495,889 ↓ · 174 ♡

Bielik-11B-v3.0-Instruct-awq

Bielik-11B-v3.0-Instruct-awq is an AWQ-quantized version of the Bielik 11B instruction-tuned model, developed by the SpeakLeash initiative with a focus on Polish and broad European multilingual coverage spanning over 30 languages. The base model is LLaMA-architecture fine-tuned for instruction following. AWQ quantization reduces memory footprint while targeting minimal accuracy degradation compared to full precision.

493,821 ↓ · 1 ♡

tiny-random-Gemma2ForCausalLM

tiny-random-Gemma2ForCausalLM is a minimal Gemma 2 architecture stub used for unit testing HuggingFace transformers code. Its weights are randomly initialised and it produces meaningless outputs — it exists solely to provide a fast-loading Gemma2 model class for CI and integration tests without requiring the full multi-GB production checkpoint.

490,225 ↓ · 0 ♡

tiny-Qwen3MoeForCausalLM

tiny-Qwen3MoeForCausalLM is a minimal-scale text generation model built on the Qwen3 MoE (Mixture of Experts) architecture, created by the TRL team for internal testing purposes. It is not intended as a production model but serves as a lightweight fixture for validating TRL training pipelines and integration tests. The model is conversational and endpoints-compatible.

488,534 ↓ · 1 ♡

Qwen2.5-Math-1.5B-Instruct

Qwen2.5-Math 1.5B instruct is a compact math-specialized language model from Alibaba, trained on mathematical corpora and fine-tuned for step-by-step problem solving. Despite its small size, it competes with much larger general models on MATH and GSM8K benchmarks.

487,712 ↓ · 58 ♡

DeepSeek-V3.2-AWQ

An AWQ 4-bit quantisation of DeepSeek V3.2, packaged for vLLM inference. AWQ (Activation-aware Weight Quantisation) identifies and preserves the most salient weights at higher precision, typically losing less perplexity than naive 4-bit approaches. This checkpoint lets teams run the large DeepSeek V3.2 on fewer GPUs than the BF16 original while retaining most benchmark performance.

469,286 ↓ · 11 ♡

Qwen2.5-32B-Instruct-GPTQ-Int4

Qwen2.5-32B-Instruct-GPTQ-Int4 is an openly licensed text generation and chat model in the qwen2 family. Qwen2.5-32B-Instruct-GPTQ-Int4 is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 32000M parameters, Qwen2.5-32B-Instruct-GPTQ-Int4 sits in the frontier-scale tier, which sets its memory and latency budget. Treat Qwen2.5-32B-Instruct-GPTQ-Int4's published metrics as a starting point and validate against your workload.

468,473 ↓ · 40 ♡

Llama-3.3-70B-Instruct-AWQ

Llama-3.3-70B-Instruct-AWQ is an open-weight text generation and chat model in the llama family. Prebuilt AWQ weights make local and edge inference of Llama-3.3-70B-Instruct-AWQ straightforward. Llama-3.3-70B-Instruct-AWQ is multilingual by design rather than English-only. Treat Llama-3.3-70B-Instruct-AWQ's published metrics as a starting point and validate against your workload.

460,034 ↓ · 11 ♡

VLM2Vec-Full

VLM2Vec-Full is TIGER Lab's vision-language embedding model that adapts a multimodal LLM (based on Phi-3.5-V) into a dual-encoder for multimodal retrieval. It enables text-image retrieval and text-text retrieval in a single embedding space.

451,037 ↓ · 29 ♡

Qwen3-Next-80B-A3B-Instruct-FP8

Qwen3-Next-80B-A3B-Instruct-FP8 is an FP8-quantized mixture-of-experts instruction model from Alibaba's Qwen team, with 80B total parameters but only ~3B active per forward pass. The FP8 format reduces VRAM requirements compared to BF16 while preserving most of the base model's generation quality. It is released under Apache 2.0 and is compatible with standard Transformers serving stacks.

448,472 ↓ · 90 ♡

Qwen3.6-27B-Text-NVFP4-MTP

Text-only NVFP4-quantized Qwen3.6-27B with multi-token prediction (MTP) for speculative decoding, optimized for Blackwell and Hopper GPUs via NVIDIA ModelOpt. Stripping vision components reduces memory footprint and inference latency when only text output is needed. Supports 13 languages including Chinese, Japanese, and Korean.

442,987 ↓ · 77 ♡

gpt-neo-125m

GPT-Neo-125M is EleutherAI's open recreation of the GPT-2 class of models, pre-trained on the Pile dataset as part of their open language model initiative. At 125M parameters it's a pedagogical and baseline research model rather than a practical text generator. MIT-licensed and available in multiple frameworks.

440,032 ↓ · 228 ♡

Meta-Llama-3.1-70B-Instruct-AWQ-INT4

Meta-Llama-3.1-70B-Instruct-AWQ-INT4 is an INT4 AWQ quantization of Meta's Llama 3.1 70B instruction model, packaged by Hugging Quants. AWQ (Activation-aware Weight Quantization) selectively quantizes weights based on activation magnitudes, preserving quality better than naive INT4 approaches. It supports eight languages and is compatible with Text Generation Inference and Azure deployment.

439,565 ↓ · 109 ♡

Qwen3-Coder-Next-8bit

MLX 8-bit quantization of Qwen3-Coder-Next for Apple Silicon inference, targeting macOS M-series hardware via the MLX framework. 8-bit quantization preserves more model quality than 4-bit at the cost of higher memory use. Apache-2.0 licensed.

438,855 ↓ · 3 ♡

Phi-3-mini-4k-instruct-gptq-4bit

Phi-3-mini-4k-instruct-gptq-4bit is an open-weight text generation and chat model in the phi3 family. Prebuilt GPTQ/4BIT weights make local and edge inference of Phi-3-mini-4k-instruct-gptq-4bit straightforward. Like most open checkpoints, Phi-3-mini-4k-instruct-gptq-4bit rewards a quick in-domain eval before commitment.

438,727 ↓ · 2 ♡

Qwen2.5-Math-1.5B

Built for text generation and chat, Qwen2.5-Math-1.5B is a qwen2-based model with publicly available weights. The weights start from qwen2.5-1.5b and specialize it for the target task. At about 1500M parameters, Qwen2.5-Math-1.5B sits in the mid-sized tier, which sets its memory and latency budget. Before relying on Qwen2.5-Math-1.5B, reproduce its key numbers on representative inputs.

436,603 ↓ · 109 ♡

DeepSeek-V2-Lite

DeepSeek-V2-Lite is a lightweight variant of DeepSeek's MoE architecture, designed to bring V2's Multi-head Latent Attention (MLA) and DeepSeekMoE designs to a smaller footprint. It activates fewer experts per token than V2-full while sharing the same architectural innovations. Uses custom model code, requiring trust_remote_code=True.

433,139 ↓ · 180 ♡

lynx-instruct-30b

lynx-instruct-30b targets text generation and chat and is shipped as a large, self-hostable checkpoint. lynx-instruct-30b is multilingual by design rather than English-only. It is a fine-tune of qwen3-30b-a3b-instruct-2507, inheriting that base model's general competence. Evaluate lynx-instruct-30b on your own data before trusting it in production.

431,704 ↓ · 3 ♡

Bielik-11B-v3.0-Instruct

Bielik-11B v3.0 is the SpeakLeash community's Polish-focused 11B instruct model, trained on a large Polish text corpus. The third major version targets comprehensive Polish language tasks including complex reasoning, summarization, and instruction following.

429,169 ↓ · 65 ♡

Qwen3-235B-A22B-Instruct-2507-FP8

As a qwen3-based frontier-scale model, Qwen3-235B-A22B-Instruct-2507-FP8 focuses on text generation and chat. FP8 builds of Qwen3-235B-A22B-Instruct-2507-FP8 are published alongside the full checkpoint for low-memory serving. Weighing in near 235000M parameters, Qwen3-235B-A22B-Instruct-2507-FP8 trades some ceiling for cheaper, faster inference. Qwen3-235B-A22B-Instruct-2507-FP8 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

428,035 ↓ · 147 ♡

Meta-Llama-3.1-8B-Instruct

This is Unsloth's repackaged version of Meta's Llama-3.1-8B-Instruct, optimized for faster fine-tuning workflows using the Unsloth library's memory-efficient training kernels. The weights are functionally identical to the Meta original but are formatted for drop-in use with Unsloth's training stack. It operates under the Llama 3.1 community license, which restricts certain large-scale commercial deployments.

427,282 ↓ · 97 ♡

granite-4.0-h-small

Granite 4.0-H-Small is IBM's latest Granite generation using a hybrid SSM-Transformer architecture (GraniteMoEHybrid), combining state space models with attention layers for improved long-context efficiency. The small variant targets edge and on-premise deployments where the compute budget is constrained. This is IBM's first Granite model with a hybrid non-pure-Transformer design.

426,398 ↓ · 308 ♡

granite-3.3-8b-instruct

Granite 3.3-8B Instruct is IBM's latest iteration in the Granite 3.x series, an 8B instruction-tuned model trained on IBM's curated dataset blend emphasising enterprise tasks like code, retrieval-augmented generation, and document understanding. The 3.3 update improves on 3.1 and 3.2 in function calling reliability and structured output generation, both critical for agentic enterprise workflows.

419,279 ↓ · 155 ♡

Qwen3-30B-A3B-abliterated

Qwen3-30B-A3B-abliterated is a fine-tuned derivative of Alibaba's Qwen3-30B-A3B mixture-of-experts model, processed with abliteration to remove refusal behaviors trained into the base model. Abliteration works by identifying and suppressing the refusal direction in the model's residual stream, which can degrade instruction-following on unrelated tasks. It carries the same Apache 2.0 license as the Qwen3 base.

416,918 ↓ · 38 ♡

Llama-3.1-8B-Instruct-FP8

Llama-3.1-8B-Instruct-FP8 is a llama-based open-weight model aimed at text generation and chat. FP8 builds of Llama-3.1-8B-Instruct-FP8 are published alongside the full checkpoint for low-memory serving. The weights start from llama-3.1-8b-instruct and specialize it for the target task. Llama-3.1-8B-Instruct-FP8 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

415,998 ↓ · 37 ♡

t5gemma-s-s-prefixlm

As a gemma-based open-weight model, t5gemma-s-s-prefixlm focuses on text generation and chat. The weights start from t5gemma-s-s-prefixlm and specialize it for the target task. t5gemma-s-s-prefixlm is subject to Gemma terms, so confirm licensing before commercial use. t5gemma-s-s-prefixlm ships without a hosted SLA, so budget for self-managed deployment and monitoring.

415,214 ↓ · 4 ♡

LLaDA-8B-Instruct

LLaDA-8B-Instruct is an openly licensed text generation and chat model. LLaDA-8B-Instruct is MIT-licensed, clearing it for closed-source and paid products. At about 8000M parameters, LLaDA-8B-Instruct sits in the large tier, which sets its memory and latency budget. Treat LLaDA-8B-Instruct's published metrics as a starting point and validate against your workload.

414,473 ↓ · 358 ♡

NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 is a 550B-parameter sparse Mixture-of-Experts language model using NVIDIA's latent-MoE architecture, quantized to NVFP4 precision for reduced memory footprint. It activates approximately 55B parameters per forward pass and is optimized for deployment on NVIDIA hardware via ModelOpt tooling. The model is trained on NVIDIA's proprietary pre- and post-training datasets targeting multilingual instruction following.

411,418 ↓ · 218 ♡

Qwen2.5-Math-7B

Qwen2.5-Math-7B is a qwen2-based open-weight model aimed at text generation and chat. Qwen2.5-Math-7B's 7000M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let Qwen2.5-Math-7B go straight into commercial pipelines. Check the Qwen2.5-Math-7B model card for benchmarks and intended use before adopting it.

409,520 ↓ · 111 ♡

SmolLM-135M

SmolLM-135M is HuggingFace's 135M-parameter LLM trained from scratch on HuggingFace's curated SmolLM-Corpus, designed to push the boundary of what is achievable in extremely compact language models. At 135M it outperforms many prior sub-500M models on standard benchmarks. The model uses the Llama architecture for easy ecosystem integration and is English-focused.

408,004 ↓ · 257 ♡

falcon-7b

Falcon-7B was TII UAE's 7B autoregressive language model released in 2023, trained on the RefinedWeb dataset derived from Common Crawl with aggressive deduplication and filtering. At release it matched GPT-3.5 on several benchmarks while being fully open-weight. Falcon-7B is a base model without instruction tuning; it is notable historically as an early high-quality openly-licensed 7B LLM.

404,961 ↓ · 1,104 ♡

saiga_llama3_8b

Saiga Llama3 8B is a Russian instruction-tuned model based on Llama 3 8B, fine-tuned by Ilya Gusev using the Saiga dataset collection of Russian dialogues and instructions. It is among the most capable open Russian chat models at this parameter count, offering idiomatic Russian language understanding and generation beyond what vanilla Llama 3 provides on Russian prompts.

396,642 ↓ · 141 ♡

Qwen2.5-72B-Instruct-abliterated

An abliterated (safety-removed) version of Qwen2.5-72B-Instruct by huihui-ai, where refusal mechanisms have been removed using directional activation manipulation. This allows the model to respond to requests the original would decline. The abliteration technique is reversible but the resulting model lacks safety guardrails.

392,488 ↓ · 49 ♡

Nemotron-Mini-4B-Instruct

Nemotron-Mini-4B-Instruct is a 4B-parameter instruction-tuned language model from NVIDIA built on the LLaMA-3 architecture, targeting on-device and edge deployment scenarios where larger models are impractical. It is described in arXiv:2407.14679 and trained using NVIDIA's NeMo framework with alignment techniques from arXiv:2402.16819. The model is English-only and optimized for chat and assistant-style tasks.

392,354 ↓ · 184 ♡

LFM2.5-1.2B-Instruct

LFM2.5-1.2B-Instruct is Liquid AI's 1.2B instruction-tuned model using their Liquid Foundation Model architecture, which combines recurrent and attention mechanisms for improved long-context efficiency. Supports 9 languages and is positioned as an edge-friendly model from a non-transformer architecture lineage. License is listed as 'other' — check Liquid AI's terms.

391,492 ↓ · 600 ♡

gemma-2-2b-it

Gemma 2 2B Instruct is Google's smallest instruction-tuned model in the Gemma 2 family, using the same sliding window + full attention hybrid and logit soft-capping as the 9B variant but at 2.6 billion parameters. At release it set a new bar for sub-3B instruction models on standard benchmarks. It is Apache 2.0 licensed and runs on consumer hardware.

390,181 ↓ · 1,404 ♡

DeepSeek-Coder-V2-Lite-Instruct-AWQ

DeepSeek-Coder-V2-Lite-Instruct-AWQ is a deepseek-based open-weight model aimed at text generation and chat. AWQ builds of DeepSeek-Coder-V2-Lite-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. DeepSeek-Coder-V2-Lite-Instruct-AWQ lists a non-standard license, so confirm permissions before deployment. Read DeepSeek-Coder-V2-Lite-Instruct-AWQ's card for hardware requirements and licensing fine print before deploying.

389,440 ↓ · 9 ♡

Qwen3-32B-FP8

Qwen3-32B-FP8 is Alibaba's official FP8-quantized checkpoint of the Qwen3-32B instruction-tuned model, targeting Hopper (H100) GPU inference with FP8 tensor core support. FP8 quantization reduces memory by ~50% vs bf16 while preserving most of the model's accuracy. Apache-2.0 licensed.

389,347 ↓ · 83 ♡

gpt-oss-20b-MXFP4-Q8

As a large model, gpt-oss-20b-MXFP4-Q8 focuses on text generation and chat. The Apache 2.0 license keeps gpt-oss-20b-MXFP4-Q8 unrestricted for commercial reuse. Weighing in near 20000M parameters, gpt-oss-20b-MXFP4-Q8 trades some ceiling for cheaper, faster inference. Check the gpt-oss-20b-MXFP4-Q8 model card for benchmarks and intended use before adopting it.

386,562 ↓ · 67 ♡

Olmo-3-7B-Instruct

Olmo-3-7B-Instruct is an openly licensed text generation and chat model in the olmo family. It is a fine-tune of olmo-3-7b-instruct-dpo, inheriting that base model's general competence. At about 7000M parameters, Olmo-3-7B-Instruct sits in the mid-sized tier, which sets its memory and latency budget. Olmo-3-7B-Instruct is community-maintained, so track upstream changes and pin a known-good revision.

385,034 ↓ · 128 ♡

Qwen3-4B-GGUF

Qwen3-4B-GGUF is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Prebuilt GGUF weights make local and edge inference of Qwen3-4B-GGUF straightforward. The Apache 2.0 license keeps Qwen3-4B-GGUF unrestricted for commercial reuse. Treat Qwen3-4B-GGUF's published metrics as a starting point and validate against your workload.

384,446 ↓ · 111 ♡

mistral-7b-v0.3-bnb-4bit

Mistral 7B v0.3 in BitsAndBytes 4-bit quantisation, packaged by Unsloth for memory-efficient fine-tuning and inference. The bnb-4bit format uses NF4 quantisation with double quantisation, reducing VRAM for fine-tuning from ~16GB to ~5-6GB for a 7B model. Unsloth applies custom kernels to accelerate LoRA training further on top of the bitsandbytes quantisation.

380,628 ↓ · 22 ♡

GLM-4.5-Air

GLM-4.5-Air is Zhipu AI's lightweight MoE variant of their GLM-4.5 series, designed for fast Chinese-English bilingual inference at reduced serving cost. The 'Air' designation indicates a trimmed serving configuration balancing capability and speed. It uses the GLM4_MOE architecture and targets cloud API and enterprise deployments requiring GLM's strong Chinese language performance.

380,323 ↓ · 609 ♡

OTel-LLM-1.2B-IT

OTel-LLM-1.2B-IT is an open-weight model aimed at text generation and chat. OTel-LLM-1.2B-IT's 1200M-parameter size keeps hosting requirements modest relative to frontier models. The weights start from lfm2.5-1.2b-instruct and specialize it for the target task. Check the OTel-LLM-1.2B-IT model card for benchmarks and intended use before adopting it.

379,356 ↓ · 1 ♡

Qwen2.5-Coder-3B

Qwen2.5-Coder-3B is the 3B base (non-instruct) model from Alibaba's code-specialized Qwen2.5 Coder series, trained on a large corpus of code and programming-related text. As a base model it lacks instruction following and requires fine-tuning or prompting strategies to use for code generation tasks. The instruct variant is better suited for direct use.

376,315 ↓ · 52 ♡

Qwen3-Next-80B-A3B-Instruct

Qwen3 80B MoE instruct model activating 3B parameters per token, offering a high-capacity but compute-efficient inference profile. Positioned as a next-generation step-up from the Qwen3-30B-A3B series with additional pretraining compute.

375,814 ↓ · 1,024 ♡

chatgpt_paraphraser_on_T5_base

T5-base fine-tuned to paraphrase text in a ChatGPT-style manner, using T5's text-to-text framework. The model was trained on paraphrase datasets with the goal of rewording inputs while preserving meaning. OpenRAIL license applies — includes usage restrictions on harmful applications.

375,311 ↓ · 193 ♡

mini-coder-1.7b

mini-coder-1.7b is a qwen3-based open-weight model aimed at text generation and chat. mini-coder-1.7b's 1700M-parameter size keeps hosting requirements modest relative to frontier models. The weights start from qwen3-1.7b and specialize it for the target task. Before relying on mini-coder-1.7b, reproduce its key numbers on representative inputs.

373,536 ↓ · 5 ♡

ReaderLM-v2

Built for text generation and chat, ReaderLM-v2 is a qwen2-based model with publicly available weights. Distribution of ReaderLM-v2 is under CC BY-NC 4.0, which is worth reading before you ship. Training spans multiple languages, so ReaderLM-v2 covers cross-lingual text generation and chat from one checkpoint. Read ReaderLM-v2's card for hardware requirements and licensing fine print before deploying.

372,442 ↓ · 791 ♡

gpt-neox-20b

gpt-neox-20b is EleutherAI's 20B autoregressive language model, trained on the Pile dataset and released in 2022 as the largest fully open-weights English LLM at the time. It uses the GPT-NeoX architecture with rotary position embeddings and trained in bf16 on TPUs. While now superseded by much larger models, it remains historically significant and is a baseline for open LLM research.

365,882 ↓ · 584 ♡

Zamba2-1.2B-instruct

Built for text generation and chat, Zamba2-1.2B-instruct is a model with publicly available weights. The weights start from zamba2-1.2b and specialize it for the target task. At about 1200M parameters, Zamba2-1.2B-instruct sits in the mid-sized tier, which sets its memory and latency budget. Before relying on Zamba2-1.2B-instruct, reproduce its key numbers on representative inputs.

364,970 ↓ · 30 ♡

llama-3.3-70b-instruct-awq

llama-3.3-70b-instruct-awq is a llama-based open-weight model aimed at text generation and chat. llama-3.3-70b-instruct-awq's 70000M-parameter size keeps hosting requirements modest relative to frontier models. AWQ builds of llama-3.3-70b-instruct-awq are published alongside the full checkpoint for low-memory serving. Before relying on llama-3.3-70b-instruct-awq, reproduce its key numbers on representative inputs.

362,661 ↓ · 46 ♡

tiny-Glm4MoeForCausalLM

tiny-Glm4MoeForCausalLM is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Like most open checkpoints, tiny-Glm4MoeForCausalLM rewards a quick in-domain eval before commitment.

362,389 ↓ · 0 ♡

Meta-Llama-3.1-8B-Instruct-AWQ-INT4

Hugging Quants' AWQ INT4 quantization of Meta's Llama-3.1-8B-Instruct model. Llama 3.1 8B Instruct is a well-characterized instruction-following model with solid multilingual coverage across 8 languages. The AWQ quantization uses autoawq and is calibrated for minimal accuracy regression on instruction tasks.

362,315 ↓ · 90 ♡

japanese-gpt-neox-small

japanese-gpt-neox-small is an openly licensed text generation and chat model in the gpt neox family. japanese-gpt-neox-small is MIT-licensed, clearing it for closed-source and paid products. Evaluate japanese-gpt-neox-small on your own data before trusting it in production.

361,625 ↓ · 15 ♡

NVIDIA-Nemotron-3-Super-120B-A12B-FP8

Nemotron-3 Super is NVIDIA's 120B MoE model with 12B active parameters per token, quantised to FP8 for Hopper GPU deployment. It uses a Latent MoE architecture with Multi-Token Prediction and is trained on NVIDIA's full Nemotron dataset suite including code, math, and multilingual instruction data. At 120B total capacity it targets tasks that require deep knowledge without the cost of dense 120B inference.

361,286 ↓ · 262 ♡

VertaLily-1.2-1B-GGUF

VertaLily-1.2-1B-GGUF is an open-weight model aimed at text generation and chat. VertaLily-1.2-1B-GGUF's 1000M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let VertaLily-1.2-1B-GGUF go straight into commercial pipelines. Check the VertaLily-1.2-1B-GGUF model card for benchmarks and intended use before adopting it.

360,670 ↓ · 6 ♡

Llama-2-7b-chat-hf

LLaMA 2 7B Chat is Meta's 7B RLHF-aligned conversational model from 2023. While superseded by LLaMA 3 and later releases, it remains a well-understood reference model used for fine-tuning experiments, benchmarking, and educational purposes.

358,965 ↓ · 4,760 ♡

gemma-2-2b-it-GGUF

bartowski's GGUF conversion of Google's Gemma 2 2B Instruct, providing multiple quantisation levels for llama.cpp and similar runtimes. Gemma 2 2B Instruct is Google's smallest instruction model in the Gemma 2 family; at 2B parameters it runs on very limited hardware. bartowski maintains a well-regarded GGUF quantisation pipeline with imatrix calibration for quality retention at lower bit depths.

357,498 ↓ · 97 ♡

DeepSeek-R1-Distill-Qwen-7B

DeepSeek-R1-Distill-Qwen-7B is a qwen2-based open-weight model aimed at text generation and chat. Permissive MIT terms let DeepSeek-R1-Distill-Qwen-7B go straight into commercial pipelines. DeepSeek-R1-Distill-Qwen-7B's 7000M-parameter size keeps hosting requirements modest relative to frontier models. DeepSeek-R1-Distill-Qwen-7B ships without a hosted SLA, so budget for self-managed deployment and monitoring.

355,653 ↓ · 849 ♡

mistral-nemo-instruct-2407-awq

mistral-nemo-instruct-2407-awq is an open-weight text generation and chat model in the mistral family. Prebuilt AWQ weights make local and edge inference of mistral-nemo-instruct-2407-awq straightforward. Like most open checkpoints, mistral-nemo-instruct-2407-awq rewards a quick in-domain eval before commitment.

354,030 ↓ · 12 ♡

Llama-3.2-1B-Instruct-GGUF

Built for text generation and chat, Llama-3.2-1B-Instruct-GGUF is a llama-based model with publicly available weights. GGUF builds of Llama-3.2-1B-Instruct-GGUF are published alongside the full checkpoint for low-memory serving. At about 1000M parameters, Llama-3.2-1B-Instruct-GGUF sits in the mid-sized tier, which sets its memory and latency budget. Check the Llama-3.2-1B-Instruct-GGUF model card for benchmarks and intended use before adopting it.

353,142 ↓ · 167 ♡

opt-1.3b

OPT-1.3B is Meta's Open Pre-trained Transformer at 1.3 billion parameters, released in 2022 as part of a suite ranging from 125M to 175B. The model was trained on a curated mix of publicly available datasets and released with full weights and training logs to enable reproducibility research. It has largely been superseded by later open LLMs but remains a useful controlled baseline.

352,352 ↓ · 184 ♡

Qwen3-Coder-30B-A3B-Instruct-AWQ

Qwen3-Coder-30B-A3B-Instruct-AWQ is a qwen3-based open-weight model aimed at text generation and chat. Permissive Apache 2.0 terms let Qwen3-Coder-30B-A3B-Instruct-AWQ go straight into commercial pipelines. Qwen3-Coder-30B-A3B-Instruct-AWQ's 30000M-parameter size keeps hosting requirements modest relative to frontier models. Read Qwen3-Coder-30B-A3B-Instruct-AWQ's card for hardware requirements and licensing fine print before deploying.

350,956 ↓ · 8 ♡

deepseek-coder-6.7b-instruct

Built for text generation and chat, deepseek-coder-6.7b-instruct is a llama-based model with publicly available weights. deepseek-coder-6.7b-instruct lists a non-standard license, so confirm permissions before deployment. At about 6700M parameters, deepseek-coder-6.7b-instruct sits in the mid-sized tier, which sets its memory and latency budget. deepseek-coder-6.7b-instruct ships without a hosted SLA, so budget for self-managed deployment and monitoring.

349,951 ↓ · 500 ♡

gpt-neo-2.7B

GPT-Neo 2.7B was EleutherAI's 2021 open replication of GPT-3 architecture trained on the Pile dataset. At release it was one of the largest freely available autoregressive LLMs. By current standards it is a historical baseline — useful for studying early large-scale open LM behaviour and running ablation experiments where reproducibility of older results matters.

349,472 ↓ · 503 ♡

GLM-4.7-Flash-AWQ-4bit

As a glm-based open-weight model, GLM-4.7-Flash-AWQ-4bit focuses on text generation and chat. AWQ builds of GLM-4.7-Flash-AWQ-4bit are published alongside the full checkpoint for low-memory serving. The MIT license keeps GLM-4.7-Flash-AWQ-4bit unrestricted for commercial reuse. Before relying on GLM-4.7-Flash-AWQ-4bit, reproduce its key numbers on representative inputs.

349,472 ↓ · 54 ♡

OLMo-2-0425-1B

OLMo-2-0425-1B targets text generation and chat and is shipped as a mid-sized, self-hostable checkpoint. Permissive Apache 2.0 terms let OLMo-2-0425-1B go straight into commercial pipelines. OLMo-2-0425-1B's 1000M-parameter size keeps hosting requirements modest relative to frontier models. Treat OLMo-2-0425-1B's published metrics as a starting point and validate against your workload.

349,329 ↓ · 79 ♡

Llama-3.2-1B-Instruct

Llama-3.2-1B-Instruct is an open-weight text generation and chat model in the llama family. Distribution of Llama-3.2-1B-Instruct is under Llama 3.2 Community, which is worth reading before you ship. It is a fine-tune of llama-3.2-1b-instruct, inheriting that base model's general competence. Llama-3.2-1B-Instruct is community-maintained, so track upstream changes and pin a known-good revision.

345,477 ↓ · 98 ♡

Qwen2.5-Coder-1.5B

Qwen2.5-Coder-1.5B is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 1500M parameters, Qwen2.5-Coder-1.5B trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen2.5-Coder-1.5B unrestricted for commercial reuse. Qwen2.5-Coder-1.5B is community-maintained, so track upstream changes and pin a known-good revision.

345,326 ↓ · 91 ♡

Qwen2.5-3B-Instruct-AWQ

Qwen2.5-3B-Instruct-AWQ is a qwen2-based open-weight model aimed at text generation and chat. Qwen2.5-3B-Instruct-AWQ's 3000M-parameter size keeps hosting requirements modest relative to frontier models. AWQ builds of Qwen2.5-3B-Instruct-AWQ are published alongside the full checkpoint for low-memory serving. Before relying on Qwen2.5-3B-Instruct-AWQ, reproduce its key numbers on representative inputs.

344,701 ↓ · 16 ♡

MiniMax-M2.7-NVFP4

As an open-weight model, MiniMax-M2.7-NVFP4 focuses on text generation and chat. MiniMax-M2.7-NVFP4 lists a non-standard license, so confirm permissions before deployment. Before relying on MiniMax-M2.7-NVFP4, reproduce its key numbers on representative inputs.

344,398 ↓ · 60 ♡

gemma-2-9b-it

Gemma 2 9B Instruct is Google's instruction-tuned 9B model from the Gemma 2 family, which introduced sliding window + full attention alternation and logit soft-capping for improved training stability. At release it outperformed Llama 3 8B on multiple benchmarks while remaining smaller, making it one of the most downloaded open instruction models in its size class. It is English-focused with some multilingual capability.

344,279 ↓ · 832 ♡

Llama-3.1-8B-Instruct

Llama-3.1-8B-Instruct targets text generation and chat and is shipped as a large, self-hostable checkpoint. It is a fine-tune of llama-3.1-8b-instruct, inheriting that base model's general competence. Llama-3.1-8B-Instruct's 8000M-parameter size keeps hosting requirements modest relative to frontier models. Like most open checkpoints, Llama-3.1-8B-Instruct rewards a quick in-domain eval before commitment.

344,157 ↓ · 12 ♡

LLaMmlein_1B_prerelease

LLaMmlein 1B is a German-centric small language model from the University of Würzburg's LSX group, trained from scratch on German text. The 'prerelease' indicates this is a preliminary checkpoint shared before the final publication.

343,431 ↓ · 14 ♡

granite-4.1-3b

granite-4.1-3b is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 3000M parameters, granite-4.1-3b trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps granite-4.1-3b unrestricted for commercial reuse. Evaluate granite-4.1-3b on your own data before trusting it in production.

343,381 ↓ · 81 ♡

GLM-4.7-Flash

GLM-4.7-Flash is a glm-based open-weight model aimed at text generation and chat. The weights start from glm-4.7-flash and specialize it for the target task. Permissive MIT terms let GLM-4.7-Flash go straight into commercial pipelines. Check the GLM-4.7-Flash model card for benchmarks and intended use before adopting it.

343,270 ↓ · 15 ♡

DeepSeek-R1-Distill-Llama-8B

DeepSeek-R1-Distill-Llama-8B is an openly licensed text generation and chat model in the llama family. At about 8000M parameters, DeepSeek-R1-Distill-Llama-8B sits in the large tier, which sets its memory and latency budget. DeepSeek-R1-Distill-Llama-8B is MIT-licensed, clearing it for closed-source and paid products. Evaluate DeepSeek-R1-Distill-Llama-8B on your own data before trusting it in production.

342,882 ↓ · 866 ♡

NVIDIA-Nemotron-3-Nano-30B-A3B-FP8

NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 targets text generation and chat and is shipped as a large, self-hostable checkpoint. Licensing for NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 is unspecified or custom — clear it before commercial use. Prebuilt FP8 weights make local and edge inference of NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 straightforward. Treat NVIDIA-Nemotron-3-Nano-30B-A3B-FP8's published metrics as a starting point and validate against your workload.

342,851 ↓ · 352 ♡

DeepSeek-R1-0528-Qwen3-8B-MLX-4bit

LM Studio Community's MLX 4-bit quantization of DeepSeek-R1-0528 based on a Qwen3-8B backbone. MLX format targets Apple Silicon (M-series) inference via the MLX framework. The 0528 suffix denotes a May 2025 update to the R1 series. 4-bit quantization reduces memory use to approximately 5-6 GB unified memory.

342,717 ↓ · 12 ♡

bloomz-560m

bloomz-560m is a compact checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 560M parameters, bloomz-560m trades some ceiling for cheaper, faster inference. Licensing for bloomz-560m is unspecified or custom — clear it before commercial use. Like most open checkpoints, bloomz-560m rewards a quick in-domain eval before commitment.

342,075 ↓ · 137 ♡

Qwen3-32B-AWQ

Qwen3-32B-AWQ is a qwen3-based open-weight model aimed at text generation and chat. AWQ builds of Qwen3-32B-AWQ are published alongside the full checkpoint for low-memory serving. Qwen3-32B-AWQ's 32000M-parameter size keeps hosting requirements modest relative to frontier models. Qwen3-32B-AWQ ships without a hosted SLA, so budget for self-managed deployment and monitoring.

341,654 ↓ · 136 ♡

gpt2_zinc_87m

A GPT-2-scale 87M model from entropy fine-tuned on ZINC chemical compound SMILES notation for molecular generation. Generates novel SMILES strings representing drug-like small molecules.

341,273 ↓ · 4 ♡

gpt2-medium

Built for text generation and chat, gpt2-medium is a gpt2-based model with publicly available weights. gpt2-medium is MIT-licensed, clearing it for closed-source and paid products. Read gpt2-medium's card for hardware requirements and licensing fine print before deploying.

341,204 ↓ · 205 ♡

Qwen3-1.7B-GPTQ-Int8

Qwen3-1.7B-GPTQ-Int8 is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Prebuilt GPTQ/INT8 weights make local and edge inference of Qwen3-1.7B-GPTQ-Int8 straightforward. Weighing in near 1700M parameters, Qwen3-1.7B-GPTQ-Int8 trades some ceiling for cheaper, faster inference. Qwen3-1.7B-GPTQ-Int8 is community-maintained, so track upstream changes and pin a known-good revision.

340,103 ↓ · 7 ♡

Qwen3-Next-80B-A3B-Instruct-AWQ-4bit

AWQ 4-bit quantization of Qwen3-Next-80B-A3B-Instruct, an 80B mixture-of-experts model activating approximately 3B parameters per token. At 80B total with AWQ compression, loading requires substantial RAM despite per-token compute being 3B-equivalent. compressed-tensors format targets vLLM.

340,071 ↓ · 66 ♡

Qwen2.5-Coder-7B-Instruct-GGUF

Qwen2.5-Coder-7B-Instruct-GGUF is an openly licensed text generation and chat model in the qwen family. Prebuilt GGUF weights make local and edge inference of Qwen2.5-Coder-7B-Instruct-GGUF straightforward. At about 7000M parameters, Qwen2.5-Coder-7B-Instruct-GGUF sits in the mid-sized tier, which sets its memory and latency budget. Qwen2.5-Coder-7B-Instruct-GGUF is community-maintained, so track upstream changes and pin a known-good revision.

340,002 ↓ · 49 ♡

Jan-v3.5-4B-gguf

Jan v3.5-4B is Homebrew (Jan.ai)'s 4B instruction-tuned model in GGUF format, designed for local deployment via the Jan desktop application and llama.cpp. It is fine-tuned for general assistant tasks including math, coding, and identity-aware conversation. Jan.ai positions this as a private, on-device alternative to cloud AI assistants for consumer use.

336,972 ↓ · 21 ♡

gemma-4-31B-it-NVFP4-turbo

Built for text generation and chat, gemma-4-31B-it-NVFP4-turbo is a gemma-based model with publicly available weights. At about 31000M parameters, gemma-4-31B-it-NVFP4-turbo sits in the large tier, which sets its memory and latency budget. gemma-4-31B-it-NVFP4-turbo is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the gemma-4-31B-it-NVFP4-turbo model card for benchmarks and intended use before adopting it.

336,377 ↓ · 284 ♡

Gemma-4-26B-A4B-it-NVFP4

NVFP4 quantization of a Gemma 4 26B MoE instruct model (4B active parameters) from bg-digitalservices, targeting H100/H200 GPU inference. The 26B MoE with 4B active parameters offers strong capability-per-token compute.

333,951 ↓ · 30 ♡

Mistral-Small-24B-Instruct-2501-AWQ

An AWQ (Activation-aware Weight Quantization) conversion of Mistral Small 24B Instruct (January 2025), offering 4-bit quantized inference at reduced memory while preserving most of the original model's instruction-following quality.

333,422 ↓ · 29 ♡

Qwen3Guard-Gen-0.6B

Qwen3Guard-Gen is a 0.6B generative content safety model from Alibaba, designed to classify and explain potential policy violations in model outputs. It can generate natural language explanations of why content may be unsafe, unlike binary classifiers.

328,260 ↓ · 73 ♡

Kimi-K2.5

An MLX-format conversion of Moonshot AI's Kimi K2.5 MoE for Apple Silicon local inference. Kimi K2 models are large MoE language models from Moonshot AI with strong reasoning, available here for native Apple Silicon inference.

324,555 ↓ · 38 ♡

L3.3-GeneticLemonade-Final-v2-70B

L3.3-GeneticLemonade-Final-v2-70B targets text generation and chat and is shipped as a frontier-scale, self-hostable checkpoint. It is a fine-tune of l3.3-geneticlemonade-final-70b, inheriting that base model's general competence. L3.3-GeneticLemonade-Final-v2-70B's 70000M-parameter size keeps hosting requirements modest relative to frontier models. L3.3-GeneticLemonade-Final-v2-70B is community-maintained, so track upstream changes and pin a known-good revision.

323,585 ↓ · 11 ♡

CodeLlama-7b-hf

Code LLaMA 7B is Meta's code-specialized 7B model, initialized from LLaMA 2 7B and further trained on code data. It supports code completion, infilling, and code instruction following at a practical 7B parameter budget.

321,306 ↓ · 377 ♡

Qwen2.5-7B-Instruct

Unsloth's optimized version of Qwen2.5-7B-Instruct, applying Unsloth's memory and speed improvements for fine-tuning and inference. Unsloth reduces VRAM usage during training by up to 60% through custom CUDA kernels and gradient checkpointing optimizations. Apache-2.0 licensed.

321,264 ↓ · 27 ♡

LFM2-24B-A2B-MLX-4bit

A 4-bit MLX quantization of LFM2-24B-A2B, Liquid AI's second-generation 24B MoE model activating ~2B parameters per token, prepared for Apple Silicon local inference by the LM Studio community.

320,702 ↓ · 4 ♡

Qwen2.5-Coder-3B-Instruct

Built for text generation and chat, Qwen2.5-Coder-3B-Instruct is a qwen2-based model with publicly available weights. The weights start from qwen2.5-coder-3b and specialize it for the target task. At about 3000M parameters, Qwen2.5-Coder-3B-Instruct sits in the mid-sized tier, which sets its memory and latency budget. Qwen2.5-Coder-3B-Instruct ships without a hosted SLA, so budget for self-managed deployment and monitoring.

320,309 ↓ · 105 ♡

OTel-LLM-0.6B-IT

OTel-LLM-0.6B-IT is an openly licensed text generation and chat model in the qwen3 family. It is a fine-tune of qwen3-0.6b, inheriting that base model's general competence. OTel-LLM-0.6B-IT is Apache 2.0-licensed, clearing it for closed-source and paid products. OTel-LLM-0.6B-IT is community-maintained, so track upstream changes and pin a known-good revision.

318,539 ↓ · 0 ♡

Qwen3-4B-AWQ

Qwen3-4B-AWQ is an openly licensed text generation and chat model in the qwen3 family. Qwen3-4B-AWQ is Apache 2.0-licensed, clearing it for closed-source and paid products. Prebuilt AWQ weights make local and edge inference of Qwen3-4B-AWQ straightforward. Treat Qwen3-4B-AWQ's published metrics as a starting point and validate against your workload.

318,220 ↓ · 29 ♡

Qwen2.5-3B-Instruct-GGUF

Qwen2.5-3B-Instruct-GGUF is a qwen2-based open-weight model aimed at text generation and chat. Qwen2.5-3B-Instruct-GGUF's 3000M-parameter size keeps hosting requirements modest relative to frontier models. Qwen2.5-3B-Instruct-GGUF lists a non-standard license, so confirm permissions before deployment. Qwen2.5-3B-Instruct-GGUF ships without a hosted SLA, so budget for self-managed deployment and monitoring.

317,263 ↓ · 128 ♡

LFM2-24B-A2B-MLX-8bit

An 8-bit MLX quantization of LFM2-24B-A2B for Apple Silicon, offering higher accuracy than the 4-bit variant at the cost of roughly double the memory requirement. Targets M2/M3 Ultra or M3 Max Macs with sufficient unified memory.

317,211 ↓ · 2 ♡

LFM2-24B-A2B-MLX-5bit

A 5-bit MLX quantization of LFM2-24B-A2B, sitting between the 4-bit and 8-bit variants in the accuracy/memory tradeoff space. Useful for Apple Silicon users who want more quality than 4-bit but less memory usage than 8-bit.

317,034 ↓ · 1 ♡

granite-4.0-micro

granite-4.0-micro targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let granite-4.0-micro go straight into commercial pipelines. granite-4.0-micro is community-maintained, so track upstream changes and pin a known-good revision.

316,985 ↓ · 271 ♡

LFM2-24B-A2B-MLX-6bit

A 6-bit MLX quantization of Liquid AI's LFM2-24B-A2B for Apple Silicon, targeting the sweet spot between memory efficiency and output quality. 6-bit quantization typically preserves instruction-following quality well while cutting memory vs 8-bit.

316,848 ↓ · 3 ♡

Mistral-7B-Instruct-v0.1

Mistral-7B-Instruct-v0.1 is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. The Apache 2.0 license keeps Mistral-7B-Instruct-v0.1 unrestricted for commercial reuse. It is a fine-tune of mistral-7b-v0.1, inheriting that base model's general competence. Like most open checkpoints, Mistral-7B-Instruct-v0.1 rewards a quick in-domain eval before commitment.

316,680 ↓ · 1,833 ♡

kogpt2-base-v2

kogpt2-base-v2 is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. kogpt2-base-v2 is subject to CC BY-NC-SA 4.0 terms, so confirm licensing before commercial use. Like most open checkpoints, kogpt2-base-v2 rewards a quick in-domain eval before commitment.

312,018 ↓ · 61 ♡

Qwen3-30B-A3B-Instruct-2507-FP8

Qwen3-30B-A3B-Instruct-2507-FP8 is a qwen3-based open-weight model aimed at text generation and chat. FP8 builds of Qwen3-30B-A3B-Instruct-2507-FP8 are published alongside the full checkpoint for low-memory serving. Permissive Apache 2.0 terms let Qwen3-30B-A3B-Instruct-2507-FP8 go straight into commercial pipelines. Read Qwen3-30B-A3B-Instruct-2507-FP8's card for hardware requirements and licensing fine print before deploying.

311,787 ↓ · 127 ♡

Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit

AWQ 4-bit quantisation of Qwen3-Coder-30B-A3B, a MoE code model with 30B total and 3B active parameters per token. The AWQ quantisation reduces memory requirements while the MoE architecture keeps compute low per token, making this a practical option for running a frontier-class code model on a single high-VRAM GPU. The model is instruction-tuned for agentic coding tasks.

310,634 ↓ · 55 ♡

NVIDIA-Nemotron-Nano-9B-v2-FP8

NVIDIA-Nemotron-Nano-9B-v2-FP8 targets text generation and chat and is shipped as a large, self-hostable checkpoint. Licensing for NVIDIA-Nemotron-Nano-9B-v2-FP8 is unspecified or custom — clear it before commercial use. NVIDIA-Nemotron-Nano-9B-v2-FP8's 9000M-parameter size keeps hosting requirements modest relative to frontier models. NVIDIA-Nemotron-Nano-9B-v2-FP8 is community-maintained, so track upstream changes and pin a known-good revision.

310,539 ↓ · 9 ♡

gemma-2-2b

gemma-2-2b targets text generation and chat and is shipped as a mid-sized, self-hostable checkpoint. Because gemma-2-2b uses Gemma, vet the conditions against your deployment plan. gemma-2-2b's 2000M-parameter size keeps hosting requirements modest relative to frontier models. gemma-2-2b is community-maintained, so track upstream changes and pin a known-good revision.

309,987 ↓ · 641 ♡

gemma-4-E4B-it-OBLITERATED

An abliterated (safety-filter-removed) version of Gemma 4 E4B instruct from the OBLITERATUS project. The abliteration technique modifies weight directions associated with refusal behavior, allowing unconstrained generation at the cost of losing safety alignment.

309,858 ↓ · 715 ♡

OTel-LLM-1.7B-IT

OTel-LLM-1.7B-IT is an openly licensed text generation and chat model in the qwen3 family. OTel-LLM-1.7B-IT is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 1700M parameters, OTel-LLM-1.7B-IT sits in the mid-sized tier, which sets its memory and latency budget. Evaluate OTel-LLM-1.7B-IT on your own data before trusting it in production.

309,478 ↓ · 1 ♡

MiniMax-M2.7-GGUF

MiniMax-M2.7-GGUF is an open-weight checkpoint for text generation and chat, distributed on the HuggingFace Hub. Prebuilt GGUF weights make local and edge inference of MiniMax-M2.7-GGUF straightforward. Licensing for MiniMax-M2.7-GGUF is unspecified or custom — clear it before commercial use. Treat MiniMax-M2.7-GGUF's published metrics as a starting point and validate against your workload.

309,470 ↓ · 174 ♡

OTel-LLM-1B-IT

OTel-LLM-1B-IT is an openly licensed text generation and chat model. It is a fine-tune of gemma-3-1b-it, inheriting that base model's general competence. At about 1000M parameters, OTel-LLM-1B-IT sits in the mid-sized tier, which sets its memory and latency budget. Evaluate OTel-LLM-1B-IT on your own data before trusting it in production.

309,136 ↓ · 1 ♡

HyperCLOVAX-SEED-Think-14B-GPTQ

Built for text generation and chat, HyperCLOVAX-SEED-Think-14B-GPTQ is a model with publicly available weights. HyperCLOVAX-SEED-Think-14B-GPTQ lists a non-standard license, so confirm permissions before deployment. At about 14000M parameters, HyperCLOVAX-SEED-Think-14B-GPTQ sits in the large tier, which sets its memory and latency budget. Before relying on HyperCLOVAX-SEED-Think-14B-GPTQ, reproduce its key numbers on representative inputs.

309,011 ↓ · 0 ♡

GLM-5

GLM-5 is a glm-based open-weight model aimed at text generation and chat. Permissive MIT terms let GLM-5 go straight into commercial pipelines. GLM-5 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

308,513 ↓ · 2,085 ♡

gemma-2b

Gemma 2B is Google's 2B-parameter open language model from early 2024, trained on 2T tokens of web, code, and math data. It was notable at release for punching above its weight class on benchmarks vs other 2B models available at the time.

307,263 ↓ · 1,185 ♡

OTel-LLM-270M-IT

OTel-LLM-270M-IT is an open-weight model aimed at text generation and chat. OTel-LLM-270M-IT's 270M-parameter size keeps hosting requirements modest relative to frontier models. The weights start from gemma-3-270m-it and specialize it for the target task. Read OTel-LLM-270M-IT's card for hardware requirements and licensing fine print before deploying.

305,687 ↓ · 0 ♡

OTel-LLM-8.3B-IT

As a large model, OTel-LLM-8.3B-IT focuses on text generation and chat. Weighing in near 8300M parameters, OTel-LLM-8.3B-IT trades some ceiling for cheaper, faster inference. The weights start from rnj-1-instruct and specialize it for the target task. Before relying on OTel-LLM-8.3B-IT, reproduce its key numbers on representative inputs.

305,168 ↓ · 2 ♡

Qwen3.5-397B-A17B-Opus-4.6-Reasoning-Uncensored-GGUF

As a qwen-based frontier-scale model, Qwen3.5-397B-A17B-Opus-4.6-Reasoning-Uncensored-GGUF focuses on text generation and chat. The Apache 2.0 license keeps Qwen3.5-397B-A17B-Opus-4.6-Reasoning-Uncensored-GGUF unrestricted for commercial reuse. GGUF builds of Qwen3.5-397B-A17B-Opus-4.6-Reasoning-Uncensored-GGUF are published alongside the full checkpoint for low-memory serving. Read Qwen3.5-397B-A17B-Opus-4.6-Reasoning-Uncensored-GGUF's card for hardware requirements and licensing fine print before deploying.

304,096 ↓ · 22 ♡

gpt-oss-20b-GGUF

gpt-oss-20b-GGUF is an open-weight model aimed at text generation and chat. gpt-oss-20b-GGUF's 20000M-parameter size keeps hosting requirements modest relative to frontier models. GGUF builds of gpt-oss-20b-GGUF are published alongside the full checkpoint for low-memory serving. gpt-oss-20b-GGUF ships without a hosted SLA, so budget for self-managed deployment and monitoring.

303,945 ↓ · 684 ♡

GLM-4.7-FP8

As a glm-based open-weight model, GLM-4.7-FP8 focuses on text generation and chat. The MIT license keeps GLM-4.7-FP8 unrestricted for commercial reuse. FP8 builds of GLM-4.7-FP8 are published alongside the full checkpoint for low-memory serving. GLM-4.7-FP8 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

301,918 ↓ · 123 ♡

gpt-j-6b

gpt-j-6b is a gptj-based open-weight model aimed at text generation and chat. gpt-j-6b's 6000M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let gpt-j-6b go straight into commercial pipelines. Before relying on gpt-j-6b, reproduce its key numbers on representative inputs.

299,770 ↓ · 1,524 ♡

GLM-4.7-Flash-MLX-8bit

GLM-4.7-Flash-MLX-8bit is a glm-based open-weight model aimed at text generation and chat. Permissive MIT terms let GLM-4.7-Flash-MLX-8bit go straight into commercial pipelines. MLX builds of GLM-4.7-Flash-MLX-8bit are published alongside the full checkpoint for low-memory serving. GLM-4.7-Flash-MLX-8bit ships without a hosted SLA, so budget for self-managed deployment and monitoring.

299,295 ↓ · 11 ♡

DialoGPT-medium

DialoGPT-medium targets text generation and chat and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let DialoGPT-medium go straight into commercial pipelines. Like most open checkpoints, DialoGPT-medium rewards a quick in-domain eval before commitment.

298,332 ↓ · 436 ♡

Solar-Open-100B

Solar-Open-100B is an open-weight model aimed at text generation and chat. Solar-Open-100B lists a non-standard license, so confirm permissions before deployment. Solar-Open-100B's 100000M-parameter size keeps hosting requirements modest relative to frontier models. Check the Solar-Open-100B model card for benchmarks and intended use before adopting it.

297,968 ↓ · 475 ♡

Qwen2.5-1.5B-Instruct-GGUF

As a qwen2-based mid-sized model, Qwen2.5-1.5B-Instruct-GGUF focuses on text generation and chat. The Apache 2.0 license keeps Qwen2.5-1.5B-Instruct-GGUF unrestricted for commercial reuse. GGUF builds of Qwen2.5-1.5B-Instruct-GGUF are published alongside the full checkpoint for low-memory serving. Qwen2.5-1.5B-Instruct-GGUF ships without a hosted SLA, so budget for self-managed deployment and monitoring.

297,188 ↓ · 95 ♡

GLM-5.1

As a glm-based open-weight model, GLM-5.1 focuses on text generation and chat. The MIT license keeps GLM-5.1 unrestricted for commercial reuse. Check the GLM-5.1 model card for benchmarks and intended use before adopting it.

296,811 ↓ · 1,612 ♡

Meta-Llama-3.3-70B-Instruct-AWQ-INT4

As a llama-based frontier-scale model, Meta-Llama-3.3-70B-Instruct-AWQ-INT4 focuses on text generation and chat. Meta-Llama-3.3-70B-Instruct-AWQ-INT4 is subject to Llama 3.3 Community terms, so confirm licensing before commercial use. Weighing in near 70000M parameters, Meta-Llama-3.3-70B-Instruct-AWQ-INT4 trades some ceiling for cheaper, faster inference. Read Meta-Llama-3.3-70B-Instruct-AWQ-INT4's card for hardware requirements and licensing fine print before deploying.

296,047 ↓ · 31 ♡

EXAONE-Deep-7.8B

EXAONE-Deep-7.8B is an open-weight text generation and chat model. Licensing for EXAONE-Deep-7.8B is unspecified or custom — clear it before commercial use. At about 7800M parameters, EXAONE-Deep-7.8B sits in the mid-sized tier, which sets its memory and latency budget. Like most open checkpoints, EXAONE-Deep-7.8B rewards a quick in-domain eval before commitment.

295,722 ↓ · 102 ♡

DeepSeek-R1-0528-Qwen3-8B-MLX-8bit

An 8-bit MLX quantisation of DeepSeek R1-0528 built on the Qwen3-8B backbone, packaged by LMStudio for native Apple Silicon inference. DeepSeek R1 is a reasoning model that generates extended chain-of-thought traces before answers; this variant applies the R1-0528 update's improved distillation from the larger R1 model. The MLX format enables Metal GPU acceleration on M-series Macs.

295,466 ↓ · 18 ♡

Qwen2.5-Coder-7B-Instruct-AWQ

Qwen2.5-Coder-7B-Instruct in AWQ 4-bit quantisation, the official Alibaba release for memory-efficient code generation serving. AWQ preserves the most salient weights at higher precision, enabling deployment of the 7B code model on a single GPU with ~8GB VRAM. It achieves competitive HumanEval and MBPP scores relative to the BF16 original while halving memory requirements.

294,709 ↓ · 25 ♡

GLM-4.7-Flash-MLX-6bit

As a glm-based open-weight model, GLM-4.7-Flash-MLX-6bit focuses on text generation and chat. MLX builds of GLM-4.7-Flash-MLX-6bit are published alongside the full checkpoint for low-memory serving. The MIT license keeps GLM-4.7-Flash-MLX-6bit unrestricted for commercial reuse. Check the GLM-4.7-Flash-MLX-6bit model card for benchmarks and intended use before adopting it.

294,338 ↓ · 8 ♡

HyperCLOVAX-SEED-Omni-8B

HyperCLOVAX-SEED-Omni-8B is an open-weight text generation and chat model. Licensing for HyperCLOVAX-SEED-Omni-8B is unspecified or custom — clear it before commercial use. At about 8000M parameters, HyperCLOVAX-SEED-Omni-8B sits in the large tier, which sets its memory and latency budget. Treat HyperCLOVAX-SEED-Omni-8B's published metrics as a starting point and validate against your workload.

293,359 ↓ · 186 ♡

Llama-3.2-3B-Instruct-GGUF

Llama-3.2-3B-Instruct-GGUF is a llama-based open-weight model aimed at text generation and chat. GGUF builds of Llama-3.2-3B-Instruct-GGUF are published alongside the full checkpoint for low-memory serving. Llama-3.2-3B-Instruct-GGUF's 3000M-parameter size keeps hosting requirements modest relative to frontier models. Llama-3.2-3B-Instruct-GGUF ships without a hosted SLA, so budget for self-managed deployment and monitoring.

293,029 ↓ · 206 ♡

Meta-Llama-3.1-8B-Instruct-bnb-4bit

As a llama-based large model, Meta-Llama-3.1-8B-Instruct-bnb-4bit focuses on text generation and chat. 4BIT builds of Meta-Llama-3.1-8B-Instruct-bnb-4bit are published alongside the full checkpoint for low-memory serving. Weighing in near 8000M parameters, Meta-Llama-3.1-8B-Instruct-bnb-4bit trades some ceiling for cheaper, faster inference. Meta-Llama-3.1-8B-Instruct-bnb-4bit ships without a hosted SLA, so budget for self-managed deployment and monitoring.

291,341 ↓ · 99 ♡

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF

Built for text generation and chat, Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF is a qwen-based model with publicly available weights. Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 35000M parameters, Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF sits in the frontier-scale tier, which sets its memory and latency budget. Check the Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF model card for benchmarks and intended use before adopting it.

289,276 ↓ · 256 ♡

dummy-GPT2-correct-vocab

dummy-GPT2-correct-vocab is an open-weight text generation and chat model in the gpt2 family. Evaluate dummy-GPT2-correct-vocab on your own data before trusting it in production.

286,982 ↓ · 0 ♡

Dream-v0-Instruct-7B

Dream-v0-Instruct-7B is a diffusion-based language model — distinct from autoregressive LLMs — that generates text by iteratively denoising a masked sequence rather than left-to-right token prediction. It is instruction-tuned and supports bidirectional context at inference time, which enables flexible text infilling without explicit prompting tricks. This is an early research release exploring the diffusion LM paradigm.

271,115 ↓ · 157 ♡

Qwen2.5-Coder-7B

Qwen2.5-Coder-7B is a mid-sized checkpoint for text generation and chat, distributed on the HuggingFace Hub. Weighing in near 7000M parameters, Qwen2.5-Coder-7B trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps Qwen2.5-Coder-7B unrestricted for commercial reuse. Evaluate Qwen2.5-Coder-7B on your own data before trusting it in production.

265,736 ↓ · 148 ♡

Step-3.5-Flash

As an open-weight model, Step-3.5-Flash focuses on text generation and chat. The Apache 2.0 license keeps Step-3.5-Flash unrestricted for commercial reuse. Check the Step-3.5-Flash model card for benchmarks and intended use before adopting it.

248,316 ↓ · 819 ♡

Darwin-9B-NEG

Darwin-9B-NEG is a 9B model from ansulev, likely a negation-aware variant trained to improve understanding of negative statements in text. The NEG suffix suggests specialization toward negation handling, which remains a known weakness in many transformer language models.

231,963 ↓ · 15 ♡

SmolLM2-360M-Instruct

SmolLM2-360M-Instruct is an openly licensed text generation and chat model in the llama family. SmolLM2-360M-Instruct is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 360M parameters, SmolLM2-360M-Instruct sits in the compact tier, which sets its memory and latency budget. Treat SmolLM2-360M-Instruct's published metrics as a starting point and validate against your workload.

229,139 ↓ · 196 ♡