audio classification models

14 models · ranked by HuggingFace downloads

clap-htsat-fused

LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.

15,828,115 ↓ · 107 ♡

wav2vec2-large-robust-24-ft-age-gender

wav2vec2-large-robust-24-ft-age-gender performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

663,262 ↓ · 55 ♡

wav2vec2-large-robust-12-ft-emotion-msp-dim

wav2vec2-large-robust-12-ft-emotion-msp-dim maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.

648,603 ↓ · 170 ♡

wav2vec2-large-xlsr-53-gender-recognition-librispeech

wav2vec2-large-xlsr-53-gender-recognition-librispeech is a binary audio classifier fine-tuned from Facebook's wav2vec2-xls-r-300m to predict speaker gender. It was trained on LibriSpeech ASR data and applies the XLSR cross-lingual speech representation backbone to a classification head.

518,737 ↓ · 47 ♡

ast-finetuned-audioset-10-10-0.4593

ast-finetuned-audioset-10-10-0.4593 classifies audio inputs into discrete categories such as language, emotion, speaker identity, or sound event.

510,550 ↓ · 359 ♡

MERT-v1-330M

MERT-v1-330M is a 330M-parameter self-supervised music representation model from the m-a-p lab, pre-trained on large music audio corpora using masked acoustic modeling objectives. It is designed as a foundation model for music understanding tasks including genre classification, instrument recognition, and emotion tagging. The model is described in arXiv:2306.00107 and is licensed under CC BY-NC 4.0.

439,324 ↓ · 89 ♡

mms-lid-256

mms-lid-256 is Meta's Massively Multilingual Speech language identification model covering 256 languages, built on the wav2vec2 architecture and trained on the MMS dataset described in arXiv:2305.13516. It classifies spoken audio into one of 256 language classes and is evaluated on the FLEURS benchmark. The CC-BY-NC 4.0 license restricts commercial use.

391,401 ↓ · 18 ♡

MuQ-large-msd-iter

MuQ-large-msd-iter targets audio classification and is shipped as an open-weight, self-hostable checkpoint. Because MuQ-large-msd-iter uses CC BY-NC 4.0, vet the conditions against your deployment plan. Treat MuQ-large-msd-iter's published metrics as a starting point and validate against your workload.

358,731 ↓ · 24 ♡

WeSpeaker-ResNet34-LM-MLX

An MLX conversion of WeSpeaker's ResNet34 speaker embedding model for Apple Silicon. WeSpeaker-ResNet34 generates d-vector speaker embeddings used for speaker verification and diarization tasks.

344,789 ↓ · 2 ♡

emotion-recognition-wav2vec2-IEMOCAP

emotion-recognition-wav2vec2-IEMOCAP performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

343,948 ↓ · 188 ♡

open-vakgyata

open-vakgyata is a wav2vec2-based open-weight model aimed at audio classification. Because open-vakgyata uses CC BY-NC 4.0, vet the conditions against your deployment plan. Training spans multiple languages, so open-vakgyata covers cross-lingual audio classification from one checkpoint. Before relying on open-vakgyata, reproduce its key numbers on representative inputs.

336,385 ↓ · 3 ♡

hubert-large-speech-emotion-recognition-russian-dusha-finetuned

As a hubert-based open-weight model, hubert-large-speech-emotion-recognition-russian-dusha-finetuned focuses on audio classification. The weights start from hubert-large-ls960-ft and specialize it for the target task. The Apache 2.0 license keeps hubert-large-speech-emotion-recognition-russian-dusha-finetuned unrestricted for commercial reuse. Check the hubert-large-speech-emotion-recognition-russian-dusha-finetuned model card for benchmarks and intended use before adopting it.

327,156 ↓ · 15 ♡

wav2vec-vm-finetune

wav2vec-vm-finetune maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.

322,931 ↓ · 12 ♡

music_genres_classification

music_genres_classification performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.

308,873 ↓ · 39 ♡