NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model


Understanding audio has always been the multimodal frontier that lags behind vision. While image-language models have rapidly scaled toward real-world deployment, building open models that robustly reason over speech, environmental sounds, and music — especially at length — has remained quite hard. NVIDIA and the University of Maryland researchers are now taking a direct swing at that gap.

The research team have released Audio Flamingo Next (AF-Next), the most capable model in the Audio Flamingo series and a fully open Large Audio-Language Model (LALM) trained on internet-scale audio data.

Audio Flamingo Next (AF-Next) comes in three specialized variants for different use cases. The release includes AF-Next-Instruct for general question answering, AF-Next-Think for advanced multi-step reasoning, and AF-Next-Captioner for detailed audio captioning.

What is a Large Audio-Language Model (LALM)?

A Large Audio-Language Model (LALM) pairs an audio encoder with a decoder-only language model to enable question answering, captioning, transcription, and reasoning directly over audio inputs. Think of it as the audio equivalent of a vision-language model like LLaVA or GPT-4V, but designed to handle speech, environmental sounds, and music simultaneously — within a single unified model.

bybit
https://arxiv.org/pdf/2604.10905

The Architecture: Four Components Working in a Pipeline

AF-Next is built around four main components: First is the AF-Whisper audio encoder, a custom Whisper-based encoder further pre-trained on a larger and more diverse corpus, including multilingual speech and multi-talker ASR data. Given an audio input, the model resamples it to 16 kHz mono and converts the waveform into a 128-channel log mel-spectrogram using a 25 ms window and 10 ms hop size. The spectrogram is processed in non-overlapping 30-second chunks through AF-Whisper, which outputs features at 50 Hz, after which a stride-2 pooling layer is applied. The hidden dimension is 1280.

Second is the audio adaptor, a 2-layer MLP that maps AF-Whisper’s audio representations into the language model’s embedding space. Third is the LLM backbone: Qwen-2.5-7B, a decoder-only causal model with 7B parameters, 36 transformer layers, and 16 attention heads, with context length extended from 32k to 128k tokens through additional long-context training.

A subtle but important architectural detail is Rotary Time Embeddings (RoTE). Standard positional encodings in transformers index a token by its discrete sequence position i. RoTE replaces this: instead of the standard RoPE rotation angle θ ← −i · 2π, RoTE uses θ ← −τi · 2π, where τi is each token’s absolute timestamp. For audio tokens produced at a fixed 40 ms stride, discrete time positions are interpolated before being fed into the RoTE module. This yields positional representations grounded in actual time rather than sequence order — a core design choice enabling the model’s temporal reasoning, particularly for long audio. Finally, a streaming TTS module enables voice-to-voice interaction.

Temporal Audio Chain-of-Thought: The Key Reasoning Recipe

Chain-of-Thought (CoT) prompting has improved reasoning across text and vision models, but prior audio CoT work showed only small gains because training datasets were limited to short clips with simple questions. AF-Next addresses this with Temporal Audio Chain-of-Thought, where the model explicitly anchors each intermediate reasoning step to a timestamp in the audio before producing an answer, encouraging faithful evidence aggregation and reducing hallucination over long recordings.

To train this capability, the research team created AF-Think-Time, a dataset of question–answer–thinking-chain triplets curated from challenging audio sources including trailers, movie recaps, mystery stories, and long-form multi-party conversations. AF-Think-Time consists of approximately 43K training samples, with an average of 446.3 words per thinking chain.

Training at Scale: 1 Million Hours, Four Stages

The final training dataset comprises approximately 108 million samples and approximately 1 million hours of audio, drawn from both existing publicly released datasets and raw audio collected from the open internet and subsequently labeled synthetically. New data categories introduced include over 200K long videos spanning 5 to 30 minutes for long-form captioning and QA, multi-talker speech understanding data covering speaker identification, interruption identification, and target speaker ASR, approximately 1 million samples for multi-audio reasoning across multiple simultaneous audio inputs, and approximately 386K safety and instruction-following samples.

Training follows a four-stage curriculum, each with distinct data mixtures and context lengths. Pre-training has two sub-stages: Stage 1 trains only the audio adaptor while keeping both AF-Whisper and the LLM frozen (max audio 30 seconds, 8K token context); Stage 2 additionally fine-tunes the audio encoder while still keeping the LLM frozen (max audio 1 minute, 8K token context). Mid-training also has two sub-stages: Stage 1 performs full fine-tuning of the entire model, adding AudioSkills-XL and newly curated data (max audio 10 minutes, 24K token context); Stage 2 introduces long-audio captioning and QA, down-sampling the Stage 1 mixture to half its original blend weights while expanding context to 128K tokens and audio to 30 minutes. The model resulting from mid-training is specifically released as AF-Next-Captioner. Post-training applies GRPO-based reinforcement learning focusing on multi-turn chat, safety, instruction following, and selected skill-specific datasets, producing AF-Next-Instruct. Finally, CoT-training starts from AF-Next-Instruct, applies SFT on AF-Think-Time, then GRPO using the post-training data mixture, producing AF-Next-Think.

One notable contribution from the research team is hybrid sequence parallelism, which makes 128K-context training feasible on long audio. Without it, audio token expansion blows past standard context windows and the quadratic memory cost of self-attention becomes infeasible. The solution combines Ulysses attention — which uses all-to-all collectives to distribute sequence and head dimensions within nodes where high-bandwidth interconnects are available — with Ring attention, which circulates key-value blocks across nodes via point-to-point transfers. Ulysses handles intra-node communication efficiently; Ring scales across nodes.

https://arxiv.org/pdf/2604.10905

Benchmark Results: Strong Across the Board

On MMAU-v05.15.25, the most widely used audio reasoning benchmark, AF-Next-Instruct achieves an average accuracy of 74.20 vs. Audio Flamingo 3’s 72.42, with AF-Next-Think reaching 75.01 and AF-Next-Captioner pushing to 75.76 — with gains across all three subcategories: sound (79.87), music (75.3), and speech (72.13). On the more challenging MMAU-Pro benchmark, AF-Next-Think (58.7) surpasses the closed-source Gemini-2.5-Pro (57.4).

Music understanding sees particularly strong gains. On Medley-Solos-DB instrument recognition, AF-Next reaches 92.13 vs. Audio Flamingo 2’s 85.80. On SongCaps music captioning, GPT5 coverage and correctness scores jump from 6.7 and 6.2 (AF3) to 8.8 and 8.9 respectively.

Long-audio understanding is where AF-Next most clearly separates itself. On LongAudioBench, AF-Next-Instruct achieves 73.9, outperforming both Audio Flamingo 3 (68.6) and the closed-source Gemini 2.5 Pro (60.4). On the speech-inclusive variant (+Speech), AF-Next reaches 81.2 vs. Gemini 2.5 Pro’s 66.2. On ASR, AF-Next-Instruct sets new lows among LALMs with a Word Error Rate of 1.54 on LibriSpeech test-clean and 2.76 on test-other. On VoiceBench, AF-Next-Instruct achieves the highest scores on AlpacaEval (4.43), CommonEval (3.96), and OpenBookQA (80.9), surpassing Audio Flamingo 3 by over 14 points on OpenBookQA. On CoVoST2 speech translation, AF-Next shows a particularly notable 12-point improvement over Phi-4-mm on Arabic EN→X translation (21.9 vs. 9.9).

https://arxiv.org/pdf/2604.10905

Key Takeaways

Here are 5 key takeaways:

A Fully Open Audio-Language Model at Internet Scale: AF-Next is considered the first LALM to scale audio understanding to internet-scale data — approximately 108 million samples and 1 million hours of audio.

Temporal Audio Chain-of-Thought Solves Long-Audio Reasoning: Rather than reasoning like prior CoT approaches, AF-Next explicitly anchors each intermediate reasoning step to a timestamp in the audio before producing an answer. This makes the model significantly more faithful and interpretable on long recordings up to 30 minutes — a problem prior models largely sidestepped.

Three Specialized Variants for Different Use Cases: The release includes AF-Next-Instruct for general question answering, AF-Next-Think for advanced multi-step reasoning, and AF-Next-Captioner for detailed audio captioning — allowing practitioners to select the right model based on their task rather than using a one-size-fits-all checkpoint.

Beats Closed Models on Long Audio Despite Being Smaller On LongAudioBench, AF-Next-Instruct scores 73.9 — outperforming the closed-source Gemini 2.5 Pro (60.4) and Audio Flamingo 3 (68.6). On the more challenging speech-inclusive variant, the gap widens further, with AF-Next reaching 81.2 vs. Gemini 2.5 Pro’s 66.2.

Check out the Paper, Project Page and Model Weights. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest