How Nvidia Made Its ASR Models 3x Faster Than the Competition
Open the Hugging Face Open ASR Leaderboard and sort by RTFx, the inverse real-time factor. Among models with competitive WER, the top of the table is dominated by one family: Nvidia’s Parakeet TDT checkpoints. They process more than 3x as many seconds of audio per second of wall-clock time as the nearest competitor. Their word error rate is competitive with the rest of the top ten.
A gap that wide is rarely just kernel engineering. The mechanism here is architectural. Nvidia's models use a modification to the RNN-Transducer called the Token-and-Duration Transducer, or TDT (Xu et al., 2023).
It changes the decoder loop in a small but consequential way. Instead of stepping through encoder frames one at a time, the model jointly predicts a token and the number of frames that token covers, then jumps.
On long utterances with stretches of silence or steady-state audio, that turns out...
Copyright of this story solely belongs to hackernoon.com. To see the full text click HERE