IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference

IBM released two new open speech recognition models— Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — and they make a compelling case for what a ~2B-parameter speech model can do. Both are available on Hugging Face under the Apache 2.0 license.

The pair targets a specific problem that enterprise AI teams know well: most production-grade automatic speech recognition (ASR) systems either demand massive compute or sacrifice accuracy to stay within budget. IBM’s bet is that careful architecture decisions can let you have it both ways.

What These Models Actually Do

Granite Speech 4.1 2B is a compact and efficient speech-language model designed for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) covering English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, focuses exclusively on ASR — specifically targeting latency-sensitive deployments — and supports English, French, German, Spanish, and Portuguese, but not Japanese. That’s a meaningful distinction: teams that need Japanese transcription or any speech translation capability should reach for the standard autoregressive model.

IBM also quietly released a third variant alongside these two. Granite Speech 4.1 2B-Plus adds speaker-attributed ASR and word-level timestamps for applications where knowing who said what — and exactly when — is a requirement.

Word Error Rate (WER) is the primary metric for measuring transcription quality. Lower is better. A WER of 5% means roughly 5 out of every 100 words are wrong. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B scores a mean WER of 5.33. Drilling into benchmark detail — on LibriSpeech clean, the model achieves a WER of 1.33, and 2.5 on LibriSpeech other.

The Architecture, Explained

Both models share the same three-component design at a high level — a speech encoder, a modality adapter, and a language model — though the decoding mechanism diverges significantly.

The first component is the speech encoder. The architecture uses 16 conformer blocks trained with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE units — using frame importance sampling to focus on informative parts of the audio. A Conformer is a neural network layer that combines convolutional layers (good at capturing local acoustic patterns) with attention mechanisms (good at capturing long-range dependencies). CTC is a training technique that lets the model learn from audio-text pairs without needing exact frame-level alignment.

The second component is a speech-text modality adapter. A 2-layer window query transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the last conformer block, downsampling by a factor of 5 using 3 trainable queries per block and per layer — for a total temporal downsampling factor of 10 — resulting in a 10Hz acoustic embedding rate for the LLM. This adapter bridges the gap between continuous acoustic features and discrete text tokens, compressing the audio representation so the language model can process it efficiently. In the NAR model, the Q-Former has 160M parameters and downsamples the concatenated hidden representations from four encoder layers (layers 4, 8, 12, and 16).

The third component is the language model. Granite Speech 4.1 2B uses an intermediate checkpoint of granite-4.0-1b-base with 128k context length, fine-tuned on all training corpora. In the NAR variant, this becomes a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal attention mask removed to enable bidirectional context — adapted with LoRA at rank 128 applied to both attention and MLP layers.

The Autoregressive vs. Non-Autoregressive Tradeoff

This is where the two models diverge most sharply, and it has direct consequences for production deployment.

In the standard Granite Speech 4.1 2B, text is generated autoregressively — one token at a time, each depending on every token before it. This produces accurate, stable transcripts with full support for AST, keyword-biased recognition, and punctuation, but is inherently sequential and slower at scale.

Granite Speech 4.1 2B-NAR takes a fundamentally different approach. Rather than decoding tokens one at a time, it edits a CTC hypothesis in a single forward pass using a bidirectional LLM, achieving competitive accuracy with faster inference than autoregressive alternatives. This is the NLE (Non-autoregressive LLM-based Editing) architecture. Concretely: the CTC encoder produces a rough initial transcript, that hypothesis is interleaved with insertion slots, and then a bidirectional LLM predicts edits — copy, insert, delete, or replace — at all positions simultaneously in one pass.

The NAR model measured an RTFx of approximately 1820 on a single H100 GPU using batched inference at batch size 128. RTFx (real-time factor multiplier) measures how many times faster than real time a model can process audio — an RTFx of 1820 means a one-hour audio file can be transcribed in under two seconds on that hardware. One practical constraint engineers should note: the NAR model requires flash_attention_2 for inference, since this backend supports sequence packing and respects the is_causal=False flag.

Training Data and Infrastructure

The two models were trained on different datasets. The standard model was trained on 174,000 hours of audio from public corpora for ASR and AST, as well as synthetic datasets tailored to support Japanese ASR, keyword-biased ASR, and speech translation. The NAR model was trained on approximately 130,000 hours of speech across five languages using publicly available datasets including CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.

The infrastructure gap between the two is equally telling. The standard model’s training was completed in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR model trained in just 3 days on 16 H100 GPUs (2 nodes) for 5 epochs — a much lighter training run, which reflects the architectural simplicity of editing over full autoregressive generation.

Key Takeaways

Here are 5 short key takeaways:

IBM released two open ASR models — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — both ~2B parameters, and Apache 2.0 licensed.

The standard model achieves a mean WER of 5.33 on the Open ASR Leaderboard, supports 6 languages for ASR (including Japanese), bidirectional speech translation, keyword biasing, and punctuation/truecasing — competitive with models several times its size.

The NAR model trades capabilities for speed — it drops Japanese, AST, and keyword biasing, but delivers an RTFx of ~1820 on a single H100 GPU by editing a CTC hypothesis in a single forward pass rather than generating tokens one at a time.

The architecture has three core components — a 16-layer Conformer encoder trained with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding rate, and a fine-tuned granite-4.0-1b-base language model.

A third variant, Granite Speech 4.1 2B-Plus, also exists — extending the standard model with speaker-attributed ASR and word-level timestamps for applications where speaker identity and precise timing are required.

Check out the Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link