• Home  
  • NVIDIA Breaks Language Barriers: Unleashing Granary, a Landmark Initiative in Multilingual Speech AI
- News

NVIDIA Breaks Language Barriers: Unleashing Granary, a Landmark Initiative in Multilingual Speech AI

Breaking the Monolingual Mold Even in 2025, artificial intelligence remains frustratingly monolingual. While AI tools have made remarkable progress in areas such as generative text and computer vision, their performance in multilingual speech processing remains lagging. Most models excel only in English and a few major world languages. For the thousands of others spoken around the globe, reliable AI tools remain elusive. NVIDIA is now taking a significant step toward correcting this imbalance with the introduction of Granary. This massive open-source multilingual speech dataset could dramatically expand the reach of speech AI technologies. A Dataset Born for Inclusion Granary is a vast and meticulously curated collection of multilingual speech data. It includes approximately one million hours of audio data, divided into two primary categories: around 650,000 hours dedicated to automatic speech recognition (ASR) and 350,000 hours aimed at automatic speech translation (AST). What makes Granary especially noteworthy is its focus on linguistic diversity. Covering 25 European languages, it spans almost all 24 official European Union languages and adds Russian and Ukrainian to the mix. This is not merely a token nod to inclusivity but a bold move to ensure that AI tools can be developed for speakers of underrepresented languages like Croatian, Estonian, and Maltese. This dataset was developed in collaboration with Carnegie Mellon University and Italy’s Fondazione Bruno Kessler. The effort leverages NVIDIA’s NeMo Speech Data Processor toolkit, which transforms massive volumes of raw, unlabelled audio into usable data through an automated, scalable pipeline. This data processing toolkit significantly reduces the need for manual annotation, making it possible to build high-quality training datasets more efficiently and at lower cost. Engineering Granary: An Intelligent Pipeline The creation of Granary is as impressive as its scale. The pipeline that powers this project is divided into two distinct phases: ASR and AST. In the ASR phase, audio is segmented, language identity is confirmed through a process known as language ID, and content is rigorously filtered to remove low-quality samples or hallucinated text. Importantly, large language models (LLMs) are then used to restore punctuation, a small but crucial touch that enhances the dataset’s utility in real-world applications. In the AST phase, the focus shifts from recognition to translation. Using open language models like EuroLLM, the team generates high-quality translation pairs from the segmented audio and accompanying transcriptions. These are then subjected to further filtration and validation to ensure they meet the required quality standards. What stands out here is the efficiency: models trained with Granary can reach benchmark performance using roughly 50% less data than traditional approaches. This means faster training times, reduced resource consumption, and ultimately more sustainable AI development. The Models: Canary and Parakeet To showcase what Granary makes possible, NVIDIA has also released two new models built using the dataset. The first, Canary-1b-v2, is a one-billion-parameter model optimized for transcription and translation across the full spectrum of supported languages. Despite its relatively modest size compared to some of the more bloated models in the AI ecosystem, Canary-1b-v2 delivers performance that rivals much larger systems. It ranks highly on Hugging Face’s open-model ASR leaderboard and is designed to run efficiently, making it a practical choice for production environments. The second model, Parakeet-tdt-0.6b-v3, is smaller—just 600 million parameters—but engineered for speed. This model is ideal for real-time or high-volume applications. It boasts the ability to transcribe entire 24-minute audio clips in a single pass, a feat that positions it as one of the fastest multilingual speech recognition tools currently available. Both models include features such as automatic punctuation, capitalization, and even word-level timestamps for both original and translated outputs. These features make them not only powerful but also highly usable in commercial settings. Why This Matters The release of Granary and its accompanying models is more than a technical achievement. It represents a paradigm shift in how we think about language inclusion in AI. By providing high-quality, open-source tools for less commonly represented languages, NVIDIA is removing one of the biggest barriers to entry in speech AI development: the cost and scarcity of training data. For developers, this opens new doors. Imagine creating a customer support bot that can fluently interact with users in Slovak or Lithuanian, or building a transcription service that serves the legal sector in Estonia or the healthcare sector in Romania. With Granary, these are no longer far-fetched possibilities but viable projects that can be executed with significantly fewer resources and in much less time. The initiative also serves an important cultural purpose. Language is a key part of identity, and the exclusion of certain languages from AI tools can contribute to their marginalization in the digital age. By supporting a broad swath of Europe’s linguistic landscape, NVIDIA is helping ensure that AI becomes a tool of inclusion rather than exclusion. The Road Ahead The release of Granary is not the end of the road but a foundation for future work. NVIDIA plans to continue refining the dataset and expanding its reach. Meanwhile, the models developed using Granary will be presented at the 2025 Interspeech conference, where the team will delve into the technical details behind their approach and share insights into its real-world impact. What’s more, both the dataset and the models are freely available on Hugging Face under permissive licenses. This means that anyone—from independent developers to large enterprises—can use and build upon this work without navigating restrictive commercial agreements or usage limitations. In an AI landscape often dominated by proprietary systems and paywalled data, Granary stands out as a beacon of openness and collaboration. It’s a project that not only advances the state of the art but does so in a way that invites others to participate and innovate. Final Thoughts NVIDIA’s Granary is a milestone in multilingual speech AI. It sets a new benchmark for the scale, diversity, and quality of open-source language data. Coupled with high-performance models like Canary and Parakeet, it represents a leap forward for developers aiming to build more inclusive, efficient, and powerful AI applications. In a world where

Breaking the Monolingual Mold

Even in 2025, artificial intelligence remains frustratingly monolingual. While AI tools have made remarkable progress in areas such as generative text and computer vision, their performance in multilingual speech processing remains lagging. Most models excel only in English and a few major world languages. For the thousands of others spoken around the globe, reliable AI tools remain elusive. NVIDIA is now taking a significant step toward correcting this imbalance with the introduction of Granary. This massive open-source multilingual speech dataset could dramatically expand the reach of speech AI technologies.

A Dataset Born for Inclusion

Granary is a vast and meticulously curated collection of multilingual speech data. It includes approximately one million hours of audio data, divided into two primary categories: around 650,000 hours dedicated to automatic speech recognition (ASR) and 350,000 hours aimed at automatic speech translation (AST). What makes Granary especially noteworthy is its focus on linguistic diversity. Covering 25 European languages, it spans almost all 24 official European Union languages and adds Russian and Ukrainian to the mix. This is not merely a token nod to inclusivity but a bold move to ensure that AI tools can be developed for speakers of underrepresented languages like Croatian, Estonian, and Maltese.

This dataset was developed in collaboration with Carnegie Mellon University and Italy’s Fondazione Bruno Kessler. The effort leverages NVIDIA’s NeMo Speech Data Processor toolkit, which transforms massive volumes of raw, unlabelled audio into usable data through an automated, scalable pipeline. This data processing toolkit significantly reduces the need for manual annotation, making it possible to build high-quality training datasets more efficiently and at lower cost.

Engineering Granary: An Intelligent Pipeline

The creation of Granary is as impressive as its scale. The pipeline that powers this project is divided into two distinct phases: ASR and AST. In the ASR phase, audio is segmented, language identity is confirmed through a process known as language ID, and content is rigorously filtered to remove low-quality samples or hallucinated text. Importantly, large language models (LLMs) are then used to restore punctuation, a small but crucial touch that enhances the dataset’s utility in real-world applications.

In the AST phase, the focus shifts from recognition to translation. Using open language models like EuroLLM, the team generates high-quality translation pairs from the segmented audio and accompanying transcriptions. These are then subjected to further filtration and validation to ensure they meet the required quality standards. What stands out here is the efficiency: models trained with Granary can reach benchmark performance using roughly 50% less data than traditional approaches. This means faster training times, reduced resource consumption, and ultimately more sustainable AI development.

The Models: Canary and Parakeet

To showcase what Granary makes possible, NVIDIA has also released two new models built using the dataset. The first, Canary-1b-v2, is a one-billion-parameter model optimized for transcription and translation across the full spectrum of supported languages. Despite its relatively modest size compared to some of the more bloated models in the AI ecosystem, Canary-1b-v2 delivers performance that rivals much larger systems. It ranks highly on Hugging Face’s open-model ASR leaderboard and is designed to run efficiently, making it a practical choice for production environments.

The second model, Parakeet-tdt-0.6b-v3, is smaller—just 600 million parameters—but engineered for speed. This model is ideal for real-time or high-volume applications. It boasts the ability to transcribe entire 24-minute audio clips in a single pass, a feat that positions it as one of the fastest multilingual speech recognition tools currently available. Both models include features such as automatic punctuation, capitalization, and even word-level timestamps for both original and translated outputs. These features make them not only powerful but also highly usable in commercial settings.

Why This Matters

The release of Granary and its accompanying models is more than a technical achievement. It represents a paradigm shift in how we think about language inclusion in AI. By providing high-quality, open-source tools for less commonly represented languages, NVIDIA is removing one of the biggest barriers to entry in speech AI development: the cost and scarcity of training data.

For developers, this opens new doors. Imagine creating a customer support bot that can fluently interact with users in Slovak or Lithuanian, or building a transcription service that serves the legal sector in Estonia or the healthcare sector in Romania. With Granary, these are no longer far-fetched possibilities but viable projects that can be executed with significantly fewer resources and in much less time.

The initiative also serves an important cultural purpose. Language is a key part of identity, and the exclusion of certain languages from AI tools can contribute to their marginalization in the digital age. By supporting a broad swath of Europe’s linguistic landscape, NVIDIA is helping ensure that AI becomes a tool of inclusion rather than exclusion.

The Road Ahead

The release of Granary is not the end of the road but a foundation for future work. NVIDIA plans to continue refining the dataset and expanding its reach. Meanwhile, the models developed using Granary will be presented at the 2025 Interspeech conference, where the team will delve into the technical details behind their approach and share insights into its real-world impact.

What’s more, both the dataset and the models are freely available on Hugging Face under permissive licenses. This means that anyone—from independent developers to large enterprises—can use and build upon this work without navigating restrictive commercial agreements or usage limitations.

In an AI landscape often dominated by proprietary systems and paywalled data, Granary stands out as a beacon of openness and collaboration. It’s a project that not only advances the state of the art but does so in a way that invites others to participate and innovate.

Final Thoughts

NVIDIA’s Granary is a milestone in multilingual speech AI. It sets a new benchmark for the scale, diversity, and quality of open-source language data. Coupled with high-performance models like Canary and Parakeet, it represents a leap forward for developers aiming to build more inclusive, efficient, and powerful AI applications. In a world where language diversity often hinders technological equity, Granary offers a roadmap to a more linguistically inclusive digital future.

Leave a comment

Your email address will not be published. Required fields are marked *