automatic speech recognition models

116 models · ranked by HuggingFace downloads

whisperkit-coreml

WhisperKit CoreML is a collection of Whisper speech recognition models exported to Apple's CoreML format by Argmax, enabling on-device ASR on Apple Silicon (iPhone, iPad, Mac) without network calls. The models run via the WhisperKit framework, which handles chunking, VAD, and decoding on-device. Designed for iOS/macOS applications requiring offline transcription.

8,390,579 ↓ · 193 ♡

speaker-diarization-3.1

Pyannote speaker-diarization-3.1 is a complete speaker diarization pipeline from pyannote.audio that answers 'who spoke when' in an audio recording. It segments audio into speaker-homogeneous regions, clusters them by speaker identity using embedding models, and outputs timestamped speaker labels. Used in meeting transcription, podcast editing, and call center analytics.

8,329,860 ↓ · 2,509 ♡

whisper-large-v3-turbo

Whisper Large-v3-Turbo is a distilled version of Whisper Large-v3, fine-tuned to achieve most of the large model's transcription accuracy at substantially lower inference cost. It supports over 99 languages and maintains the original model's multilingual ASR quality while requiring fewer decoder layers. MIT licensed and directly compatible with HuggingFace's whisper inference pipeline.

7,410,768 ↓ · 3,118 ♡

whisper-base

whisper-base is an openly licensed speech-to-text transcription model in the whisper family. whisper-base is multilingual by design rather than English-only. whisper-base is Apache 2.0-licensed, clearing it for closed-source and paid products. Like most open checkpoints, whisper-base rewards a quick in-domain eval before commitment.

6,337,973 ↓ · 273 ♡

wav2vec2-large-xlsr-53-japanese

wav2vec2-large-xlsr-53-japanese is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps wav2vec2-large-xlsr-53-japanese unrestricted for commercial reuse. Like most open checkpoints, wav2vec2-large-xlsr-53-japanese rewards a quick in-domain eval before commitment.

6,122,215 ↓ · 60 ♡

whisper-large-v3

Whisper Large-v3 is OpenAI's full-size ASR model supporting 99+ languages, trained on 680,000 hours of multilingual audio. It delivers state-of-the-art transcription accuracy across languages at the cost of significant inference compute. Apache 2.0 licensed. The Large-v3-Turbo variant (a distilled version) provides similar quality at lower cost for most use cases.

5,743,363 ↓ · 5,883 ♡

wav2vec2-large-xlsr-53-polish

wav2vec2-large-xlsr-53-polish is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps wav2vec2-large-xlsr-53-polish unrestricted for commercial reuse. Evaluate wav2vec2-large-xlsr-53-polish on your own data before trusting it in production.

4,736,303 ↓ · 12 ♡

wav2vec2-indonesian-javanese-sundanese

As a wav2vec2-based open-weight model, wav2vec2-indonesian-javanese-sundanese focuses on speech-to-text transcription. The Apache 2.0 license keeps wav2vec2-indonesian-javanese-sundanese unrestricted for commercial reuse. Check the wav2vec2-indonesian-javanese-sundanese model card for benchmarks and intended use before adopting it.

4,253,819 ↓ · 15 ♡

wav2vec2-large-xlsr-53-dutch

Wav2Vec2 XLSR-53 Large fine-tuned on Mozilla Common Voice 6 Dutch data for Dutch automatic speech recognition. Part of Jonatas Grosman's systematic XLSR fine-tuning series covering multiple languages. Apache-2.0 licensed with published evaluation results.

4,168,344 ↓ · 15 ♡

wav2vec2-large-xlsr-53-greek

XLS-R 53-language wav2vec2 large fine-tuned for Greek ASR by Jonatas Grosman, part of their extensive series of language-specific ASR models. Provides a practical open Greek speech recognition model fine-tuned from a strong multilingual backbone.

3,886,058 ↓ · 4 ♡

wav2vec2-large-xlsr-53-arabic

As a wav2vec2-based open-weight model, wav2vec2-large-xlsr-53-arabic focuses on speech-to-text transcription. The Apache 2.0 license keeps wav2vec2-large-xlsr-53-arabic unrestricted for commercial reuse. Read wav2vec2-large-xlsr-53-arabic's card for hardware requirements and licensing fine print before deploying.

3,555,299 ↓ · 54 ♡

mms-300m-1130-forced-aligner

MMS-300M-1130-forced-aligner is Meta's 300M parameter wav2vec2-based model fine-tuned for forced phoneme-level alignment across 1,130 languages. It takes audio and a text transcript as input and outputs word- or phoneme-level timestamps, enabling subtitle synchronization and linguistic documentation at scale. The CC-BY-NC-4.0 license restricts commercial deployment.

3,493,790 ↓ · 92 ♡

wav2vec2-large-xlsr-53-hungarian

wav2vec2-large-xlsr-53-hungarian targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-53-hungarian go straight into commercial pipelines. Like most open checkpoints, wav2vec2-large-xlsr-53-hungarian rewards a quick in-domain eval before commitment.

3,439,390 ↓ · 10 ♡

voice-activity-detection

A pretrained voice activity detection pipeline from pyannote.audio, identifying speech segments in audio streams. It is trained on AMI, DIHARD, and VoxConverse corpora and outputs timestamped speech/non-speech labels.

3,361,700 ↓ · 237 ♡

speaker-diarization-community-1

A community-supported speaker diarization pipeline from pyannote.audio that segments multi-speaker audio into per-speaker turns. It combines voice activity detection, speaker embedding, and clustering steps into a single callable pipeline.

3,361,507 ↓ · 629 ♡

wav2vec2-large-xlsr-53-portuguese

wav2vec2-large-xlsr-53-portuguese is a XLSR-53 model fine-tuned on Portuguese Common Voice data for automatic speech recognition using CTC decoding on 16kHz mono audio. It achieves competitive word error rates on both European and Brazilian Portuguese test sets. Part of the community XLSR fine-tuning effort from the 2021 HuggingFace strong speech event.

3,235,345 ↓ · 55 ♡

wav2vec2-large-xlsr-53-russian

A Russian-language ASR model fine-tuned from Facebook's wav2vec2-large-xlsr-53 (cross-lingual 53-language pre-training) on Mozilla Common Voice and Common Voice 6.0 Russian datasets. Produces Russian text transcriptions from audio using CTC decoding. Community-contributed under Apache 2.0.

3,191,676 ↓ · 75 ♡

whisper-small

Whisper-small is OpenAI's 244M-parameter multilingual speech recognition model, covering 99 languages with reasonable accuracy. It balances quality and inference speed, performing significantly better than tiny/base while running on modest hardware.

2,941,498 ↓ · 570 ♡

wav2vec2-large-xlsr-53-telugu

Built for speech-to-text transcription, wav2vec2-large-xlsr-53-telugu is a wav2vec2-based model with publicly available weights. wav2vec2-large-xlsr-53-telugu is Apache 2.0-licensed, clearing it for closed-source and paid products. Read wav2vec2-large-xlsr-53-telugu's card for hardware requirements and licensing fine print before deploying.

2,804,148 ↓ · 5 ♡

romanian-wav2vec2

romanian-wav2vec2 is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. It is a fine-tune of wav2vec2-xls-r-300m, inheriting that base model's general competence. The Apache 2.0 license keeps romanian-wav2vec2 unrestricted for commercial reuse. Evaluate romanian-wav2vec2 on your own data before trusting it in production.

2,803,352 ↓ · 7 ♡

wav2vec2-large-voxrex-swedish

As a wav2vec2-based open-weight model, wav2vec2-large-voxrex-swedish focuses on speech-to-text transcription. wav2vec2-large-voxrex-swedish lists a non-standard license, so confirm permissions before deployment. Check the wav2vec2-large-voxrex-swedish model card for benchmarks and intended use before adopting it.

2,552,134 ↓ · 13 ♡

wav2vec2-large-xlsr-53-persian

Built for speech-to-text transcription, wav2vec2-large-xlsr-53-persian is a wav2vec2-based model with publicly available weights. wav2vec2-large-xlsr-53-persian is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the wav2vec2-large-xlsr-53-persian model card for benchmarks and intended use before adopting it.

2,550,079 ↓ · 26 ♡

filipino-wav2vec2-l-xls-r-300m-official

A wav2vec2 300M model fine-tuned for Filipino (Tagalog) ASR using the XLS-R multilingual pretrained backbone. One of the few open Filipino speech recognition models available.

2,332,152 ↓ · 2 ♡

wav2vec2-large-xls-r-300m-Urdu

wav2vec2-large-xls-r-300m-Urdu is an openly licensed speech-to-text transcription model in the wav2vec2 family. It is a fine-tune of wav2vec2-xls-r-300m, inheriting that base model's general competence. wav2vec2-large-xls-r-300m-Urdu is Apache 2.0-licensed, clearing it for closed-source and paid products. Evaluate wav2vec2-large-xls-r-300m-Urdu on your own data before trusting it in production.

2,301,607 ↓ · 13 ♡

vakyansh-wav2vec2-tamil-tam-250

vakyansh-wav2vec2-tamil-tam-250 is a wav2vec2-based open-weight model aimed at speech-to-text transcription. Permissive MIT terms let vakyansh-wav2vec2-tamil-tam-250 go straight into commercial pipelines. vakyansh-wav2vec2-tamil-tam-250 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

2,229,067 ↓ · 4 ♡

Wav2Vec2-large-xlsr-hindi

As a wav2vec2-based open-weight model, Wav2Vec2-large-xlsr-hindi focuses on speech-to-text transcription. The weights start from wav2vec2-large-xlsr-53 and specialize it for the target task. Read Wav2Vec2-large-xlsr-hindi's card for hardware requirements and licensing fine print before deploying.

2,152,770 ↓ · 12 ♡

whisper-tiny

As a whisper-based open-weight model, whisper-tiny focuses on speech-to-text transcription. Training spans multiple languages, so whisper-tiny covers cross-lingual speech-to-text transcription from one checkpoint. The Apache 2.0 license keeps whisper-tiny unrestricted for commercial reuse. Read whisper-tiny's card for hardware requirements and licensing fine print before deploying.

1,931,635 ↓ · 434 ♡

wav2vec2-large-xlsr-53-th

wav2vec2-large-xlsr-53-th is a wav2vec2-based open-weight model aimed at speech-to-text transcription. Because wav2vec2-large-xlsr-53-th uses CC BY-SA 4.0, vet the conditions against your deployment plan. wav2vec2-large-xlsr-53-th ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,891,631 ↓ · 28 ♡

wav2vec2-large-xlsr-53-finnish

wav2vec2-large-xlsr-53-finnish targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-53-finnish go straight into commercial pipelines. Evaluate wav2vec2-large-xlsr-53-finnish on your own data before trusting it in production.

1,877,904 ↓ · 1 ♡

wav2vec2-xls-r-300m-cs-250

As a wav2vec2-based compact model, wav2vec2-xls-r-300m-cs-250 focuses on speech-to-text transcription. The Apache 2.0 license keeps wav2vec2-xls-r-300m-cs-250 unrestricted for commercial reuse. Weighing in near 300M parameters, wav2vec2-xls-r-300m-cs-250 trades some ceiling for cheaper, faster inference. wav2vec2-xls-r-300m-cs-250 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,870,319 ↓ · 3 ♡

Voxtral-Mini-4B-Realtime-2602

Voxtral-Mini-4B-Realtime-2602 is a mid-sized checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. Weighing in near 4000M parameters, Voxtral-Mini-4B-Realtime-2602 trades some ceiling for cheaper, faster inference. It is a fine-tune of ministral-3-3b-base-2512, inheriting that base model's general competence. Voxtral-Mini-4B-Realtime-2602 is community-maintained, so track upstream changes and pin a known-good revision.

1,868,460 ↓ · 895 ♡

wav2vec2-xls-r-300m-hebrew

wav2vec2-xls-r-300m-hebrew targets speech-to-text transcription and is shipped as a compact, self-hostable checkpoint. wav2vec2-xls-r-300m-hebrew's 300M-parameter size keeps hosting requirements modest relative to frontier models. It is a fine-tune of wav2vec2-xls-r-300m, inheriting that base model's general competence. Like most open checkpoints, wav2vec2-xls-r-300m-hebrew rewards a quick in-domain eval before commitment.

1,866,816 ↓ · 6 ♡

wav2vec2-xls-r-300m-mixed

wav2vec2-xls-r-300m-mixed is a compact checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. Weighing in near 300M parameters, wav2vec2-xls-r-300m-mixed trades some ceiling for cheaper, faster inference. Evaluate wav2vec2-xls-r-300m-mixed on your own data before trusting it in production.

1,814,222 ↓ · 5 ♡

wav2vec2-base-vi-vlsp2020

Built for speech-to-text transcription, wav2vec2-base-vi-vlsp2020 is a wav2vec2-based model with publicly available weights. Distribution of wav2vec2-base-vi-vlsp2020 is under CC BY-NC 4.0, which is worth reading before you ship. wav2vec2-base-vi-vlsp2020 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,784,051 ↓ · 2 ♡

nb-wav2vec2-1b-bokmaal-v2

As a wav2vec2-based mid-sized model, nb-wav2vec2-1b-bokmaal-v2 focuses on speech-to-text transcription. Weighing in near 1000M parameters, nb-wav2vec2-1b-bokmaal-v2 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps nb-wav2vec2-1b-bokmaal-v2 unrestricted for commercial reuse. nb-wav2vec2-1b-bokmaal-v2 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,644,793 ↓ · 0 ♡

wav2vec2-large-xlsr-53-chinese-zh-cn

Built for speech-to-text transcription, wav2vec2-large-xlsr-53-chinese-zh-cn is a wav2vec2-based model with publicly available weights. wav2vec2-large-xlsr-53-chinese-zh-cn is Apache 2.0-licensed, clearing it for closed-source and paid products. wav2vec2-large-xlsr-53-chinese-zh-cn ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,553,703 ↓ · 134 ♡

Qwen3-ASR-1.7B

Qwen3-ASR 1.7B is Alibaba's 1.7B parameter automatic speech recognition model supporting multiple languages. It is designed as a production-grade ASR model with strong multilingual performance at a compact size.

1,543,783 ↓ · 905 ♡

wav2vec2-xls-r-300m-ftspeech

wav2vec2-xls-r-300m-ftspeech targets speech-to-text transcription and is shipped as a compact, self-hostable checkpoint. wav2vec2-xls-r-300m-ftspeech's 300M-parameter size keeps hosting requirements modest relative to frontier models. Licensing for wav2vec2-xls-r-300m-ftspeech is unspecified or custom — clear it before commercial use. wav2vec2-xls-r-300m-ftspeech is community-maintained, so track upstream changes and pin a known-good revision.

1,467,007 ↓ · 0 ♡

wav2vec2-xls-r-300m-bengali

wav2vec2-xls-r-300m-bengali targets speech-to-text transcription and is shipped as a compact, self-hostable checkpoint. wav2vec2-xls-r-300m-bengali's 300M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let wav2vec2-xls-r-300m-bengali go straight into commercial pipelines. Evaluate wav2vec2-xls-r-300m-bengali on your own data before trusting it in production.

1,452,198 ↓ · 10 ♡

faster-whisper-base

faster-whisper-base is an openly licensed speech-to-text transcription model in the yi family. faster-whisper-base is MIT-licensed, clearing it for closed-source and paid products. faster-whisper-base is multilingual by design rather than English-only. Treat faster-whisper-base's published metrics as a starting point and validate against your workload.

1,450,748 ↓ · 30 ♡

w2v-xls-r-uk

w2v-xls-r-uk is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps w2v-xls-r-uk unrestricted for commercial reuse. It is a fine-tune of wav2vec2-xls-r-300m, inheriting that base model's general competence. Like most open checkpoints, w2v-xls-r-uk rewards a quick in-domain eval before commitment.

1,431,997 ↓ · 8 ♡

wav2vec2-lv-60-espeak-cv-ft

Wav2Vec2 fine-tuned on 60 languages from the LV-60 dataset for phoneme recognition using eSpeak phoneme labels, trained on Common Voice. Produces phoneme-level output rather than word transcription, making it useful for phonetics research and pronunciation assessment rather than standard ASR.

1,349,007 ↓ · 69 ♡

wav2vec2-xls-r-parlaspeech-hr

wav2vec2-xls-r-parlaspeech-hr is an open-weight speech-to-text transcription model in the wav2vec2 family. Evaluate wav2vec2-xls-r-parlaspeech-hr on your own data before trusting it in production.

1,336,731 ↓ · 3 ♡

wav2vec2-base-960h

wav2vec2-base-960h targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-base-960h go straight into commercial pipelines. Like most open checkpoints, wav2vec2-base-960h rewards a quick in-domain eval before commitment.

1,323,309 ↓ · 398 ♡

wav2vec2-xlsr-nepali

wav2vec2-xlsr-nepali is an openly licensed speech-to-text transcription model in the wav2vec2 family. wav2vec2-xlsr-nepali is Apache 2.0-licensed, clearing it for closed-source and paid products. wav2vec2-xlsr-nepali is community-maintained, so track upstream changes and pin a known-good revision.

1,321,614 ↓ · 8 ♡

nb-wav2vec2-1b-nynorsk

NB-Wav2Vec2 1B for Nynorsk is the Norwegian National Library's 1B-parameter wav2vec2 model fine-tuned for automatic speech recognition in Nynorsk (New Norwegian). One of very few dedicated Nynorsk ASR models publicly available.

1,314,590 ↓ · 0 ♡

parakeet-tdt-0.6b-v3

Built for speech-to-text transcription, parakeet-tdt-0.6b-v3 is a model with publicly available weights. The weights start from parakeet-tdt-0.6b-v3 and specialize it for the target task. At about 600M parameters, parakeet-tdt-0.6b-v3 sits in the compact tier, which sets its memory and latency budget. Read parakeet-tdt-0.6b-v3's card for hardware requirements and licensing fine print before deploying.

1,262,884 ↓ · 46 ♡

faster-whisper-small

faster-whisper-small is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. faster-whisper-small is multilingual by design rather than English-only. The MIT license keeps faster-whisper-small unrestricted for commercial reuse. Like most open checkpoints, faster-whisper-small rewards a quick in-domain eval before commitment.

1,212,698 ↓ · 36 ♡

wav2vec2-large-xlsr-mvc-swahili

wav2vec2-large-xlsr-mvc-swahili targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-mvc-swahili go straight into commercial pipelines. It is a fine-tune of wav2vec2-large-xlsr-53, inheriting that base model's general competence. Evaluate wav2vec2-large-xlsr-mvc-swahili on your own data before trusting it in production.

1,195,851 ↓ · 3 ♡

wav2vec2-large-xlsr-malayalam

wav2vec2-large-xlsr-malayalam is an openly licensed speech-to-text transcription model in the wav2vec2 family. wav2vec2-large-xlsr-malayalam is Apache 2.0-licensed, clearing it for closed-source and paid products. Like most open checkpoints, wav2vec2-large-xlsr-malayalam rewards a quick in-domain eval before commitment.

1,186,824 ↓ · 7 ♡

faster-whisper-large-v3

faster-whisper-large-v3 is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The MIT license keeps faster-whisper-large-v3 unrestricted for commercial reuse. faster-whisper-large-v3 is multilingual by design rather than English-only. Like most open checkpoints, faster-whisper-large-v3 rewards a quick in-domain eval before commitment.

1,144,890 ↓ · 609 ♡

wav2vec2-large-xlsr-catala

wav2vec2-large-xlsr-catala is a wav2vec2-based open-weight model aimed at speech-to-text transcription. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-catala go straight into commercial pipelines. Before relying on wav2vec2-large-xlsr-catala, reproduce its key numbers on representative inputs.

1,131,965 ↓ · 1 ♡

parakeet-tdt-0.6b-v2

MLX-format conversion of NVIDIA's Parakeet-TDT 0.6B ASR model, optimized for on-device inference on Apple Silicon. Parakeet-TDT is a FastConformer-based model trained on 64k hours of English audio and achieves competitive WER on LibriSpeech.

1,130,861 ↓ · 43 ♡

faster-whisper-tiny.en

Built for speech-to-text transcription, faster-whisper-tiny.en is a whisper-based model with publicly available weights. faster-whisper-tiny.en is MIT-licensed, clearing it for closed-source and paid products. Check the faster-whisper-tiny.en model card for benchmarks and intended use before adopting it.

1,008,580 ↓ · 10 ♡

wav2vec2-large-xlsr-53-estonian

wav2vec2-large-xlsr-53-estonian targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-53-estonian go straight into commercial pipelines. Treat wav2vec2-large-xlsr-53-estonian's published metrics as a starting point and validate against your workload.

989,124 ↓ · 1 ♡

wav2vec2-large-xlsr-lithuanian

wav2vec2-large-xlsr-lithuanian targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-large-xlsr-lithuanian go straight into commercial pipelines. wav2vec2-large-xlsr-lithuanian is community-maintained, so track upstream changes and pin a known-good revision.

947,714 ↓ · 2 ♡

parakeet-ctc-1.1b

Built for speech-to-text transcription, parakeet-ctc-1.1b is a model with publicly available weights. At about 1100M parameters, parakeet-ctc-1.1b sits in the mid-sized tier, which sets its memory and latency budget. Distribution of parakeet-ctc-1.1b is under CC BY 4.0, which is worth reading before you ship. Check the parakeet-ctc-1.1b model card for benchmarks and intended use before adopting it.

938,969 ↓ · 50 ♡

Qwen3-ASR-0.6B

Qwen3-ASR-0.6B is a compact checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps Qwen3-ASR-0.6B unrestricted for commercial reuse. Weighing in near 600M parameters, Qwen3-ASR-0.6B trades some ceiling for cheaper, faster inference. Qwen3-ASR-0.6B is community-maintained, so track upstream changes and pin a known-good revision.

901,763 ↓ · 310 ♡

wav2vec2-xls-r-300m-sk-cv8

Built for speech-to-text transcription, wav2vec2-xls-r-300m-sk-cv8 is a wav2vec2-based model with publicly available weights. wav2vec2-xls-r-300m-sk-cv8 is Apache 2.0-licensed, clearing it for closed-source and paid products. At about 300M parameters, wav2vec2-xls-r-300m-sk-cv8 sits in the compact tier, which sets its memory and latency budget. wav2vec2-xls-r-300m-sk-cv8 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

893,426 ↓ · 0 ♡

vakyansh-wav2vec2-sanskrit-sam-60

As a wav2vec2-based open-weight model, vakyansh-wav2vec2-sanskrit-sam-60 focuses on speech-to-text transcription. Read vakyansh-wav2vec2-sanskrit-sam-60's card for hardware requirements and licensing fine print before deploying.

889,392 ↓ · 4 ♡

faster-whisper-tiny

As a yi-based open-weight model, faster-whisper-tiny focuses on speech-to-text transcription. Training spans multiple languages, so faster-whisper-tiny covers cross-lingual speech-to-text transcription from one checkpoint. The MIT license keeps faster-whisper-tiny unrestricted for commercial reuse. faster-whisper-tiny ships without a hosted SLA, so budget for self-managed deployment and monitoring.

880,271 ↓ · 23 ♡

wav2vec2-large-xlsr-53-basque

wav2vec2-large-xlsr-53-basque is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps wav2vec2-large-xlsr-53-basque unrestricted for commercial reuse. Evaluate wav2vec2-large-xlsr-53-basque on your own data before trusting it in production.

866,478 ↓ · 1 ♡

wav2vec2-large-xls-r-300m-welsh

wav2vec2-large-xls-r-300m-welsh targets speech-to-text transcription and is shipped as a compact, self-hostable checkpoint. wav2vec2-large-xls-r-300m-welsh's 300M-parameter size keeps hosting requirements modest relative to frontier models. Permissive Apache 2.0 terms let wav2vec2-large-xls-r-300m-welsh go straight into commercial pipelines. Like most open checkpoints, wav2vec2-large-xls-r-300m-welsh rewards a quick in-domain eval before commitment.

849,567 ↓ · 0 ♡

wav2vec2-large-xlsr-korean

wav2vec2-large-xlsr-korean is an openly licensed speech-to-text transcription model in the wav2vec2 family. wav2vec2-large-xlsr-korean is Apache 2.0-licensed, clearing it for closed-source and paid products. Treat wav2vec2-large-xlsr-korean's published metrics as a starting point and validate against your workload.

829,941 ↓ · 56 ♡

wav2vec2-xlsr-khmer

wav2vec2-xlsr-khmer targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-xlsr-khmer go straight into commercial pipelines. Like most open checkpoints, wav2vec2-xlsr-khmer rewards a quick in-domain eval before commitment.

824,405 ↓ · 2 ♡

speaker-diarization

speaker-diarization targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive MIT terms let speaker-diarization go straight into commercial pipelines. speaker-diarization is community-maintained, so track upstream changes and pin a known-good revision.

822,320 ↓ · 1,289 ♡

distil-large-v3

As a whisper-based open-weight model, distil-large-v3 focuses on speech-to-text transcription. The MIT license keeps distil-large-v3 unrestricted for commercial reuse. Before relying on distil-large-v3, reproduce its key numbers on representative inputs.

763,344 ↓ · 376 ♡

cohere-transcribe-03-2026

As an open-weight model, cohere-transcribe-03-2026 focuses on speech-to-text transcription. The Apache 2.0 license keeps cohere-transcribe-03-2026 unrestricted for commercial reuse. Training spans multiple languages, so cohere-transcribe-03-2026 covers cross-lingual speech-to-text transcription from one checkpoint. Before relying on cohere-transcribe-03-2026, reproduce its key numbers on representative inputs.

750,397 ↓ · 1,019 ♡

VibeVoice-ASR

Built for speech-to-text transcription, VibeVoice-ASR is a yi-based model with publicly available weights. Training spans multiple languages, so VibeVoice-ASR covers cross-lingual speech-to-text transcription from one checkpoint. VibeVoice-ASR is MIT-licensed, clearing it for closed-source and paid products. Check the VibeVoice-ASR model card for benchmarks and intended use before adopting it.

732,630 ↓ · 1,193 ♡

wav2vec2-large-xls-r-300m-bg-d2

wav2vec2-large-xls-r-300m-bg-d2 is a compact checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. Weighing in near 300M parameters, wav2vec2-large-xls-r-300m-bg-d2 trades some ceiling for cheaper, faster inference. The Apache 2.0 license keeps wav2vec2-large-xls-r-300m-bg-d2 unrestricted for commercial reuse. Evaluate wav2vec2-large-xls-r-300m-bg-d2 on your own data before trusting it in production.

708,191 ↓ · 1 ♡

wav2vec2-xls-r-300m-cv7-turkish

Wav2Vec2 XLS-R 300M fine-tuned on Mozilla Common Voice 7 Turkish data for Turkish automatic speech recognition. XLS-R is Meta's cross-lingual speech representation model; this checkpoint adapts it to Turkish via CTC fine-tuning. CC-BY-4.0 licensed.

697,569 ↓ · 15 ♡

wav2vec2-large-xlsr-53-punjabi

wav2vec2-large-xlsr-53-punjabi is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. It is a fine-tune of vakyansh-wav2vec2-punjabi-pam-10, inheriting that base model's general competence. The Apache 2.0 license keeps wav2vec2-large-xlsr-53-punjabi unrestricted for commercial reuse. Treat wav2vec2-large-xlsr-53-punjabi's published metrics as a starting point and validate against your workload.

693,961 ↓ · 4 ♡

wav2vec2-large-xlsr-53-slovenian

This is a fine-tuned Slovenian ASR model built on Facebook's wav2vec2-large-xlsr-53, adapted during the XLSR Fine-Tuning Week community event using Mozilla Common Voice Slovenian data. The XLSR-53 base was pre-trained on 53 languages, and this checkpoint adds a CTC head tuned specifically for Slovenian transcription. Both PyTorch and JAX weight formats are provided.

666,711 ↓ · 0 ♡

wav2vec2-large-xlsr-marathi

As a wav2vec2-based open-weight model, wav2vec2-large-xlsr-marathi focuses on speech-to-text transcription. The Apache 2.0 license keeps wav2vec2-large-xlsr-marathi unrestricted for commercial reuse. The weights start from wav2vec2-large-xlsr-53 and specialize it for the target task. Read wav2vec2-large-xlsr-marathi's card for hardware requirements and licensing fine print before deploying.

642,726 ↓ · 2 ♡

wav2vec2-large-xls-r-300m-sinhala-low-LR-part1

As a wav2vec2-based compact model, wav2vec2-large-xls-r-300m-sinhala-low-LR-part1 focuses on speech-to-text transcription. Weighing in near 300M parameters, wav2vec2-large-xls-r-300m-sinhala-low-LR-part1 trades some ceiling for cheaper, faster inference. wav2vec2-large-xls-r-300m-sinhala-low-LR-part1 ships without a hosted SLA, so budget for self-managed deployment and monitoring.

634,144 ↓ · 0 ♡

wav2vec2-large-xlsr-kn

wav2vec2-large-xlsr-kn is a fine-tuned variant of Facebook's XLSR-53 large wav2vec2 model, adapted for Kannada (kn) automatic speech recognition using the OpenSLR dataset. It was produced during the XLSR fine-tuning week community event and supports both PyTorch and JAX backends. The model is Apache-2.0 licensed.

589,795 ↓ · 1 ♡

wav2vec2-BERT-cantonese

wav2vec2-BERT-cantonese is a wav2vec2-BERT model fine-tuned for Cantonese automatic speech recognition using Mozilla Common Voice 16.0 data. It targets the distinct phonology of Cantonese Chinese (zh) rather than Mandarin, addressing a gap in widely available open ASR systems. The model is Apache-2.0 licensed.

550,063 ↓ · 6 ♡

hubert-large-ls960-ft

HuBERT-Large fine-tuned on LibriSpeech 960h for English automatic speech recognition. HuBERT uses offline clustering of audio features as pseudo-labels during pretraining, achieving strong ASR quality. Apache-2.0 licensed, it's a foundational ASR model from Meta.

537,788 ↓ · 76 ♡

whisper-bemba-stt

A Whisper-based automatic speech recognition model fine-tuned for Bemba, a Bantu language spoken primarily in Zambia. The fine-tune adapts OpenAI's Whisper architecture to Bemba phonology and vocabulary, a language with very limited prior ASR coverage. Evaluation data and training details are sparse, so users should benchmark on their own domain audio before production use.

519,625 ↓ · 0 ♡

wav2vec2-large-xlsr-kazakh

wav2vec2-large-xlsr-kazakh is an automatic speech recognition model fine-tuned from Facebook's wav2vec2-large-xlsr-53 on the Kazakh Speech Corpus. It extends cross-lingual speech representation to Kazakh, a low-resource Turkic language with limited existing ASR tooling.

509,739 ↓ · 19 ♡

Phi-4-multimodal-instruct

Phi-4-Multimodal-Instruct is Microsoft's compact multimodal model handling text, audio, images, and video in a single instruction-tuned model. Based on Phi-4-Mini, it covers 23 languages and supports speech recognition, speech translation, and visual QA. MIT-licensed — fully permissive for commercial use.

509,720 ↓ · 1,606 ♡

granite-speech-3.3-2b

Granite Speech 3.3-2B is IBM's 2B ASR model supporting 6 languages (English, French, German, Spanish, Portuguese), using a Granite encoder-decoder architecture. It's positioned for multilingual transcription in enterprise settings. Apache-2.0 licensed with eval-results published.

503,631 ↓ · 55 ♡

wav2vec2-large-xlsr-galician

wav2vec2-large-xlsr-galician fine-tunes Facebook's XLSR-53 multilingual wav2vec2 model on Galician speech data, enabling automatic speech recognition for a low-resource Iberian language. The model follows the cross-lingual speech representation learning approach that transfers acoustic representations across languages. It is deployable via the Transformers library with PyTorch and supports Azure-hosted inference endpoints.

499,301 ↓ · 2 ♡

wav2vec2-large-xlsr-latvian-cv

wav2vec2-large-xlsr-latvian-cv fine-tunes Facebook's wav2vec2-large-xlsr-53 on Latvian Common Voice data, producing an ASR model for a morphologically rich Baltic language with limited speech resource availability. Training followed the XLSR fine-tuning week methodology. The model carries an Apache 2.0 license and weights are available in both PyTorch and JAX/Flax formats.

494,647 ↓ · 3 ♡

wav2vec2-xls-r-300m-pashto

A fine-tune of Facebook's wav2vec2-xls-r-300m base model, adapted for Pashto automatic speech recognition using the FLEURS dataset. The model inherits the 300M-parameter cross-lingual architecture and is trained under Apache 2.0, making it usable in production pipelines. It addresses a significant gap in ASR tooling for low-resource South-Central Asian languages.

472,903 ↓ · 0 ♡

Qwen3-ForcedAligner-0.6B

Qwen3-ForcedAligner-0.6B is a forced alignment model from the Qwen3 ASR family, designed to align audio segments to text transcripts at the phoneme or word level. At 0.6B parameters it's compact for deployment in audio processing pipelines. Apache-2.0 licensed.

467,323 ↓ · 145 ♡

seamless-m4t-v2-large

SeamlessM4T-v2-large is Meta's second-generation unified model for speech and text translation, supporting automatic speech recognition, speech-to-text translation, text-to-speech, and speech-to-speech across roughly 100 languages. The architecture consolidates multiple translation pipelines into a single model, as detailed in arxiv:2312.05187. With nearly 1,000 likes and high download volume, it is one of the more widely validated multilingual translation models available openly.

459,007 ↓ · 989 ♡

wav2vec2-large-xlsr-53-german

wav2vec2-large-xlsr-53-german is a German ASR model fine-tuned from Facebook's XLSR-53 large checkpoint on Mozilla Common Voice 6.0. It was contributed during Hugging Face's XLSR fine-tuning week and is listed on the hf-asr-leaderboard, providing a traceable benchmark reference. The Apache 2.0 license and Azure deployment tag make it accessible for both research and production German transcription pipelines.

457,416 ↓ · 8 ♡

reverb-diarization-v1

reverb-diarization-v1 is an open-weight speech-to-text transcription model. Licensing for reverb-diarization-v1 is unspecified or custom — clear it before commercial use. Evaluate reverb-diarization-v1 on your own data before trusting it in production.

448,543 ↓ · 13 ♡

wav2vec2-large-xlsr-georgian

wav2vec2-large-xlsr-georgian is a fine-tuned checkpoint of Facebook's XLSR-53 wav2vec2-large model, adapted for Georgian (ka) automatic speech recognition using Mozilla Common Voice data during the XLSR Fine-Tuning Week. It is among the few publicly available ASR models targeting the Georgian language. The model is released under Apache 2.0.

447,542 ↓ · 1 ♡

faster-whisper-medium

Faster-Whisper is SYSTRAN's CTranslate2-optimized conversion of OpenAI Whisper, enabling 4× faster inference at reduced memory. The medium variant (769M) balances multilingual ASR accuracy with throughput.

445,041 ↓ · 53 ♡

wav2vec2-large-xls-r-300m-albanian-colab

This model fine-tunes Facebook's wav2vec2-xls-r-300m on Albanian speech from the Common Voice dataset, targeting automatic speech recognition for a low-resource language. It was trained in a Colab environment, indicating limited compute, and serves as a community-contributed baseline for Albanian ASR. The Apache 2.0 license permits open commercial use.

431,735 ↓ · 1 ♡

wav2vec2-large-xls-r-300m-armenian

Fine-tuned from Facebook's wav2vec2-xls-r-300m on the Mozilla Common Voice 7.0 Armenian dataset, this model provides automatic speech recognition for Armenian. It was submitted to the Hugging Face Robust Speech Event and appears on the ASR leaderboard, providing at least one externally validated evaluation context. The Apache 2.0 license allows commercial use.

428,817 ↓ · 0 ♡

whisper-medium

whisper-medium is an openly licensed speech-to-text transcription model in the whisper family. whisper-medium is Apache 2.0-licensed, clearing it for closed-source and paid products. whisper-medium is multilingual by design rather than English-only. whisper-medium is community-maintained, so track upstream changes and pin a known-good revision.

423,057 ↓ · 286 ♡

wav2vec2-large-xlsr-gu

Fine-tuned on Gujarati speech data from OpenSLR during the XLSR fine-tuning week, this model adapts the multilingual wav2vec2-large-xlsr-53 checkpoint for Gujarati ASR. It supports both PyTorch and JAX inference via the Transformers library. Gujarati is a low-resource language, making this one of few publicly available ASR checkpoints for it.

420,946 ↓ · 0 ♡

wav2vec2-large-xlsr-53-icelandic-ep30-967h

Trained for 30 epochs on approximately 967 hours of Icelandic speech from the Samromur Milljon corpus, this model fine-tunes the XLSR-53 base for Icelandic ASR. Produced by the Language and Voice Lab, it represents one of the largest training runs publicly available for Icelandic speech recognition. The CC-BY-4.0 license permits broad reuse with attribution.

413,813 ↓ · 3 ♡

granite-speech-4.1-2b

Granite Speech 4.1 2B is IBM's compact speech-language model combining an ASR encoder with a 2B language model decoder. It handles transcription and speech-grounded question answering within a single architecture, targeting enterprise speech analytics use cases.

405,752 ↓ · 145 ♡

parakeet-tdt-0.6b-v2

Parakeet-TDT-0.6B-v2 is NVIDIA's 600M-parameter English ASR model built on the FastConformer architecture with a Token-and-Duration Transducer (TDT) decoder. It was trained on the Granary and NeMo ASR 3.0 datasets and is listed on the Hugging Face ASR leaderboard. The NeMo framework is required for inference, which differs from the standard Transformers pipeline.

392,641 ↓ · 1,508 ♡

wav2vec2-cv-be

wav2vec2-cv-be is a wav2vec2-based open-weight model aimed at speech-to-text transcription. Because wav2vec2-cv-be uses GPL-3.0, vet the conditions against your deployment plan. Check the wav2vec2-cv-be model card for benchmarks and intended use before adopting it.

371,502 ↓ · 1 ♡

whisper-tiny

whisper-tiny targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. whisper-tiny is community-maintained, so track upstream changes and pin a known-good revision.

371,095 ↓ · 0 ♡

whisper-small-cantonese

Built for speech-to-text transcription, whisper-small-cantonese is a whisper-based model with publicly available weights. The weights start from whisper-small and specialize it for the target task. whisper-small-cantonese is Apache 2.0-licensed, clearing it for closed-source and paid products. Check the whisper-small-cantonese model card for benchmarks and intended use before adopting it.

367,501 ↓ · 118 ♡

parakeetkit-pro

Parakeetkit-Pro is Argmax's optimised packaging of NVIDIA's Parakeet ASR model in CoreML format for Apple Silicon, distributed via the WhisperKit framework. It delivers high-accuracy English transcription on-device with Metal acceleration, positioning itself as a pro-tier local ASR option for macOS applications. The Parakeet architecture is a FastConformer model from NVIDIA trained on 64k+ hours of English speech.

359,906 ↓ · 4 ♡

wav2vec2-xls-r-juznevesti-sr

wav2vec2-xls-r-juznevesti-sr is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. wav2vec2-xls-r-juznevesti-sr is community-maintained, so track upstream changes and pin a known-good revision.

352,504 ↓ · 1 ♡

wav2vec2-large-mms-1b-azerbaijani-common_voice15.0

As a wav2vec2-based mid-sized model, wav2vec2-large-mms-1b-azerbaijani-common_voice15.0 focuses on speech-to-text transcription. Weighing in near 1000M parameters, wav2vec2-large-mms-1b-azerbaijani-common_voice15.0 trades some ceiling for cheaper, faster inference. wav2vec2-large-mms-1b-azerbaijani-common_voice15.0 is subject to CC BY-NC 4.0 terms, so confirm licensing before commercial use. Read wav2vec2-large-mms-1b-azerbaijani-common_voice15.0's card for hardware requirements and licensing fine print before deploying.

344,633 ↓ · 5 ♡

wav2vec2-conformer-rope-large-960h-ft

wav2vec2-conformer-rope-large-960h-ft targets speech-to-text transcription and is shipped as an open-weight, self-hostable checkpoint. Permissive Apache 2.0 terms let wav2vec2-conformer-rope-large-960h-ft go straight into commercial pipelines. wav2vec2-conformer-rope-large-960h-ft is community-maintained, so track upstream changes and pin a known-good revision.

341,356 ↓ · 10 ♡

wav2vec2-xlsr-53-espeak-cv-ft

Wav2Vec2 XLSR-53 fine-tuned on Common Voice for 53-language phoneme recognition using eSpeak labels, producing phoneme sequences rather than word transcriptions. Useful for linguistic and phonetics applications requiring language-agnostic phoneme extraction. Apache-2.0 licensed.

337,484 ↓ · 49 ♡

speaker-diarization-3.0

pyannote/speaker-diarization-3.0 is the third major release of the popular pyannote audio diarization pipeline, combining a speaker segmentation model with a speaker embedding model for 'who spoke when' labeling of audio recordings.

322,363 ↓ · 218 ♡

parakeet-tdt-0.6b-v3

Built for speech-to-text transcription, parakeet-tdt-0.6b-v3 is a model with publicly available weights. Distribution of parakeet-tdt-0.6b-v3 is under CC BY 4.0, which is worth reading before you ship. Training spans multiple languages, so parakeet-tdt-0.6b-v3 covers cross-lingual speech-to-text transcription from one checkpoint. Before relying on parakeet-tdt-0.6b-v3, reproduce its key numbers on representative inputs.

317,246 ↓ · 855 ♡

speakerkit-pro

speakerkit-pro is a whisper-based open-weight model aimed at speech-to-text transcription. speakerkit-pro lists a non-standard license, so confirm permissions before deployment. speakerkit-pro ships without a hosted SLA, so budget for self-managed deployment and monitoring.

314,854 ↓ · 20 ♡

parakeet-tdt_ctc-110m

An MLX-format conversion of NVIDIA's Parakeet TDT-CTC 110M, an English ASR model built on the FastConformer architecture and trained by NVIDIA using the NeMo framework. The MLX conversion enables native Apple Silicon inference. Parakeet TDT-CTC uses a Token-and-Duration Transducer with CTC decoding, which provides fast greedy decoding without beam search overhead.

310,064 ↓ · 1 ♡

whisper-large-v3-turbo-german

whisper-large-v3-turbo-german is a whisper-based open-weight model aimed at speech-to-text transcription. Permissive Apache 2.0 terms let whisper-large-v3-turbo-german go straight into commercial pipelines. The weights start from whisper-large-v3-german and specialize it for the target task. whisper-large-v3-turbo-german ships without a hosted SLA, so budget for self-managed deployment and monitoring.

305,152 ↓ · 57 ♡

mms-1b-all

mms-1b-all is a wav2vec2-based open-weight model aimed at speech-to-text transcription. Training spans multiple languages, so mms-1b-all covers cross-lingual speech-to-text transcription from one checkpoint. mms-1b-all's 1000M-parameter size keeps hosting requirements modest relative to frontier models. Check the mms-1b-all model card for benchmarks and intended use before adopting it.

303,978 ↓ · 199 ♡

parakeet-tdt-0.6b-v3-coreml

parakeet-tdt-0.6b-v3-coreml is a compact checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. Weighing in near 600M parameters, parakeet-tdt-0.6b-v3-coreml trades some ceiling for cheaper, faster inference. parakeet-tdt-0.6b-v3-coreml is multilingual by design rather than English-only. Evaluate parakeet-tdt-0.6b-v3-coreml on your own data before trusting it in production.

302,950 ↓ · 42 ♡

canary-1b-flash

canary-1b-flash is a mid-sized checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. canary-1b-flash is subject to CC BY 4.0 terms, so confirm licensing before commercial use. canary-1b-flash is multilingual by design rather than English-only. Evaluate canary-1b-flash on your own data before trusting it in production.

299,525 ↓ · 272 ♡

overlapped-speech-detection

Built for speech-to-text transcription, overlapped-speech-detection is a model with publicly available weights. overlapped-speech-detection is MIT-licensed, clearing it for closed-source and paid products. Check the overlapped-speech-detection model card for benchmarks and intended use before adopting it.

294,754 ↓ · 56 ♡

T-one

T-one is an open-weight checkpoint for speech-to-text transcription, distributed on the HuggingFace Hub. The Apache 2.0 license keeps T-one unrestricted for commercial reuse. Like most open checkpoints, T-one rewards a quick in-domain eval before commitment.

293,020 ↓ · 90 ♡