LAION's CLAP (Contrastive Language-Audio Pretraining) model using the HTSAT (Hierarchical Token-Semantic Audio Transformer) encoder, fused with a text encoder to align audio and text in a shared embedding space. Analogous to CLIP for images, it enables zero-shot audio classification and retrieval using natural language descriptions without task-specific labeled audio data.
15,828,115 ↓ · 107 ♡
wav2vec2-large-robust-24-ft-age-gender performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
663,262 ↓ · 55 ♡
wav2vec2-large-robust-12-ft-emotion-msp-dim maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.
648,603 ↓ · 170 ♡
wav2vec2-large-xlsr-53-gender-recognition-librispeech is a binary audio classifier fine-tuned from Facebook's wav2vec2-xls-r-300m to predict speaker gender. It was trained on LibriSpeech ASR data and applies the XLSR cross-lingual speech representation backbone to a classification head.
518,737 ↓ · 47 ♡
ast-finetuned-audioset-10-10-0.4593 classifies audio inputs into discrete categories such as language, emotion, speaker identity, or sound event.
510,550 ↓ · 359 ♡
MERT-v1-330M is a 330M-parameter self-supervised music representation model from the m-a-p lab, pre-trained on large music audio corpora using masked acoustic modeling objectives. It is designed as a foundation model for music understanding tasks including genre classification, instrument recognition, and emotion tagging. The model is described in arXiv:2306.00107 and is licensed under CC BY-NC 4.0.
439,324 ↓ · 89 ♡
mms-lid-256 is Meta's Massively Multilingual Speech language identification model covering 256 languages, built on the wav2vec2 architecture and trained on the MMS dataset described in arXiv:2305.13516. It classifies spoken audio into one of 256 language classes and is evaluated on the FLEURS benchmark. The CC-BY-NC 4.0 license restricts commercial use.
391,401 ↓ · 18 ♡
MuQ-large-msd-iter targets audio classification and is shipped as an open-weight, self-hostable checkpoint. Because MuQ-large-msd-iter uses CC BY-NC 4.0, vet the conditions against your deployment plan. Treat MuQ-large-msd-iter's published metrics as a starting point and validate against your workload.
358,731 ↓ · 24 ♡
An MLX conversion of WeSpeaker's ResNet34 speaker embedding model for Apple Silicon. WeSpeaker-ResNet34 generates d-vector speaker embeddings used for speaker verification and diarization tasks.
344,789 ↓ · 2 ♡
emotion-recognition-wav2vec2-IEMOCAP performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
343,948 ↓ · 188 ♡
open-vakgyata is a wav2vec2-based open-weight model aimed at audio classification. Because open-vakgyata uses CC BY-NC 4.0, vet the conditions against your deployment plan. Training spans multiple languages, so open-vakgyata covers cross-lingual audio classification from one checkpoint. Before relying on open-vakgyata, reproduce its key numbers on representative inputs.
336,385 ↓ · 3 ♡
As a hubert-based open-weight model, hubert-large-speech-emotion-recognition-russian-dusha-finetuned focuses on audio classification. The weights start from hubert-large-ls960-ft and specialize it for the target task. The Apache 2.0 license keeps hubert-large-speech-emotion-recognition-russian-dusha-finetuned unrestricted for commercial reuse. Check the hubert-large-speech-emotion-recognition-russian-dusha-finetuned model card for benchmarks and intended use before adopting it.
327,156 ↓ · 15 ♡
wav2vec-vm-finetune maps audio waveforms to class labels. Trained on labeled audio datasets for tasks like language identification and speaker recognition.
322,931 ↓ · 12 ♡
music_genres_classification performs audio classification by encoding spectral and temporal features to predict one or more discrete labels.
308,873 ↓ · 39 ♡