audio text to text models

4 models · ranked by HuggingFace downloads

ultravox-v0_5-llama-3_2-1b

Built for audio text to text, ultravox-v0_5-llama-3_2-1b is a llama-based model with publicly available weights. Training spans multiple languages, so ultravox-v0_5-llama-3_2-1b covers cross-lingual audio text to text from one checkpoint. At about 1000M parameters, ultravox-v0_5-llama-3_2-1b sits in the mid-sized tier, which sets its memory and latency budget. ultravox-v0_5-llama-3_2-1b ships without a hosted SLA, so budget for self-managed deployment and monitoring.

1,095,694 ↓ · 88 ♡

Qwen2-Audio-7B-Instruct

Qwen2-Audio-7B-Instruct is Alibaba's multimodal model handling audio and text inputs, capable of audio analysis, speech-to-text transcription, and audio-grounded Q&A. It's instruction-tuned for dialog about audio content. Apache-2.0 licensed and compatible with the Transformers qwen2_audio model type.

684,698 ↓ · 540 ♡

VibeVoice-ASR-HF

VibeVoice-ASR is Microsoft's HuggingFace-packaged automatic speech recognition model, likely a Whisper-style or custom encoder-decoder ASR system targeting informal or conversational speech. The 'Vibe' branding suggests orientation toward natural conversational audio.

671,724 ↓ · 156 ♡

ultravox-v0_6-llama-3_1-8b

Ultravox v0.6 combines a speech/audio encoder with a Llama 3.1 8B language model backbone, enabling direct audio-to-text generation without a separate ASR transcription step. The model supports over 40 languages drawn from its tag list and processes audio input end-to-end, making it suited for voice assistant and audio understanding tasks. Custom model code is required for inference, as it is not yet a standard Transformers architecture class.

657,054 ↓ · 6 ♡