nvidia

✓ VerifiedEnterprise

NVIDIA AI, GPU technology leader and model developers

500 models • 86 total models in database

Sort by:

parakeet-tdt-0.6b-v2

[](#model-architecture) | [](#model-architecture) | [](#datasets) > 🎉 NEW: Multilingual Parakeet TDT 0.6B V3 is now available! > 🌍 25 European Languages | 🚀 Enhanced Performance | 🔗 Try it here: nvidia/parakeet-tdt-0.6b-v3 `parakeet-tdt-0.6b-v2` is a 600-million-parameter automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction. Try Demo here: https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2 This XL variant of the FastConformer [1] architecture integrates the TDT [2] decoder and is trained with full attention, enabling efficient transcription of audio segments up to 24 minutes in a single pass. The model achieves an RTFx of 3380 on the HF-Open-ASR leaderboard with a batch size of 128. Note: RTFx Performance may vary depending on dataset audio duration and batch size. Key Features - Accurate word-level timestamp predictions - Automatic punctuation and capitalization - Robust performance on spoken numbers, and song lyrics transcription For more information, refer to the Model Architecture section and the NeMo documentation. This model is ready for commercial/non-commercial use. GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license. Discover more from NVIDIA: For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models. Explore more from NVIDIA: What is Nemotron? NVIDIA Developer Nemotron NVIDIA Riva Speech NeMo Documentation This model serves developers, researchers, academics, and industries building applications that require speech-to-text capabilities, including but not limited to: conversational AI, voice assistants, transcription services, subtitle generation, and voice analytics platforms. This model was developed based on FastConformer encoder architecture[1] and TDT decoder[2] This model has 600 million model parameters. Input: - Input Type(s): 16kHz Audio - Input Format(s): `.wav` and `.flac` audio formats - Input Parameters: 1D (audio signal) - Other Properties Related to Input: Monochannel audio Output: - Output Type(s): Text - Output Format: String - Output Parameters: 1D (text) - Other Properties Related to Output: Punctuations and Capitalizations included. Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version. The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. Transcribing using Python First, let's get a sample Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere NVIDIA Blackwell NVIDIA Hopper NVIDIA Volta Atleast 2GB RAM for model to load. The bigger the RAM, the larger audio input it supports. Current version: parakeet-tdt-0.6b-v2. Previous versions can be accessed here. This model was trained using the NeMo toolkit [3], following the strategies below: - Initialized from a FastConformer SSL checkpoint that was pretrained with a wav2vec method on the LibriLight dataset[7]. - Trained for 150,000 steps on 64 A100 GPUs. - Dataset corpora were balanced using a temperature sampling value of 0.5. - Stage 2 fine-tuning was performed for 2,500 steps on 4 A100 GPUs using approximately 500 hours of high-quality, human-transcribed data of NeMo ASR Set 3.0. Training was conducted using this example script and TDT configuration. The tokenizer was constructed from the training set transcripts using this script. Training Dataset The model was trained on the Granary dataset[8], consisting of approximately 120,000 hours of English speech data: - 10,000 hours from human-transcribed NeMo ASR Set 3.0, including: - LibriSpeech (960 hours) - Fisher Corpus - National Speech Corpus Part 1 - VCTK - VoxPopuli (English) - Europarl-ASR (English) - Multilingual LibriSpeech (MLS English) – 2,000-hour subset - Mozilla Common Voice (v7.0) - AMI - 110,000 hours of pseudo-labeled data from: - YTC (YouTube-Commons) dataset[4] - YODAS dataset [5] - Librilight [7] All transcriptions preserve punctuation and capitalization. The Granary dataset[8] will be made publicly available after presentation at Interspeech 2025. Noise robust data from various sources Single channel, 16kHz sampled data Huggingface Open ASR Leaderboard datasets are used to evaluate the performance of this model. All are commonly used for benchmarking English ASR systems. Audio data is typically processed into a 16kHz mono channel format for ASR evaluation, consistent with benchmarks like the Open ASR Leaderboard. Huggingface Open-ASR-Leaderboard Performance The performance of Automatic Speech Recognition (ASR) models is measured using Word Error Rate (WER). Given that this model is trained on a large and diverse dataset spanning multiple domains, it is generally more robust and accurate across various types of audio. Base Performance The table below summarizes the WER (%) using a Transducer decoder with greedy decoding (without an external language model): | Model | Avg WER | AMI | Earnings-22 | GigaSpeech | LS test-clean | LS test-other | SPGI Speech | TEDLIUM-v3 | VoxPopuli | |:-------------|:-------------:|:---------:|:------------------:|:----------------:|:-----------------:|:-----------------:|:------------------:|:----------------:|:---------------:| | parakeet-tdt-0.6b-v2 | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | Noise Robustness Performance across different Signal-to-Noise Ratios (SNR) using MUSAN music and noise samples: | SNR Level | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change | |:---------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| | Clean | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | | SNR 10 | 6.95 | 14.38 | 12.04 | 10.24 | 1.92 | 4.13 | 2.84 | 3.63 | 6.38 | -14.75% | | SNR 5 | 8.23 | 18.07 | 13.82 | 11.18 | 2.33 | 5.58 | 3.81 | 4.24 | 6.81 | -35.97% | | SNR 0 | 11.88 | 25.43 | 18.59 | 14.32 | 4.40 | 10.07 | 7.27 | 6.42 | 8.54 | -96.28% | | SNR -5 | 20.26 | 36.57 | 28.06 | 22.27 | 11.82 | 19.91 | 16.14 | 13.07 | 14.23 | -234.66% | Telephony Audio Performance Performance comparison between standard 16kHz audio and telephony-style audio (using μ-law encoding with 16kHz→8kHz→16kHz conversion): | Audio Format | Avg WER | AMI | Earnings | GigaSpeech | LS test-clean | LS test-other | SPGI | Tedlium | VoxPopuli | Relative Change | |:-----------------|:-------------:|:----------:|:------------:|:----------------:|:-----------------:|:-----------------:|:-----------:|:-------------:|:---------------:|:-----------------:| | Standard 16kHz | 6.05 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | - | | μ-law 8kHz | 6.32 | 11.98 | 11.16 | 10.02 | 1.78 | 3.52 | 2.20 | 3.38 | 6.52 | -4.10% | These WER scores were obtained using greedy decoding without an external language model. Additional evaluation details are available on the Hugging Face ASR Leaderboard.[6] [1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] Efficient Sequence Transduction by Jointly Predicting Tokens and Durations [4] Youtube-commons: A massive open corpus for conversational and multimodal data [5] Yodas: Youtube-oriented dataset for audio and speech [7] MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages [8] Granary: Speech Recognition and Translation Dataset in 25 European Languages Test Hardware: NVIDIA A10 NVIDIA A100 NVIDIA A30 NVIDIA H100 NVIDIA L4 NVIDIA L40 NVIDIA Turing T4 NVIDIA Volta V100 Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here. Field | Response ---------------------------------------------------------------------------------------------------|--------------- Participation considerations from adversely impacted groups protected classes in model design and testing | None Measures taken to mitigate against unwanted bias | None Field | Response ------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------- Intended Domain | Speech to Text Transcription Model Type | FastConformer Intended Users | This model is intended for developers, researchers, academics, and industries building conversational based applications. Output | Text Describe how the model works | Speech input is encoded into embeddings and passed into conformer-based model and output a text response. Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | Not Applicable Technical Limitations & Mitigation | Transcripts may be not 100% accurate. Accuracy varies based on language and characteristics of input audio (Domain, Use Case, Accent, Noise, Speech Type, Context of speech, etc.) Verified to have met prescribed NVIDIA quality standards | Yes Performance Metrics | Word Error Rate Potential Known Risks | If a word is not trained in the language model and not presented in vocabulary, the word is not likely to be recognized. Not recommended for word-for-word/incomplete sentences as accuracy varies based on the context of input text Licensing | GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license. Field | Response ----------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------- Generatable or reverse engineerable personal data? | None Personal data used to create this model? | None Is there provenance for all datasets used in training? | Yes Does data labeling (annotation, metadata) comply with privacy laws? | Yes Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ Field | Response ---------------------------------------------------|---------------------------------- Model Application(s) | Speech to Text Transcription Describe the life critical impact | None Use Case Restrictions | Abide by CC-BY-4.0 License Model and dataset restrictions | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

—

3,643,785

1,368

parakeet-rnnt-0.6b

[](#model-architecture) | [](#model-architecture) | [](#datasets) `parakeet-rnnt-0.6b` is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by NVIDIA NeMo and Suno.ai teams. It is an XL version of FastConformer Transducer [1] (around 600M parameters) model. See the model architecture section and NeMo documentation for complete architecture details. License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license. Discover more from NVIDIA: For documentation, deployment guides, enterprise-ready APIs, and the latest open models—including Nemotron and other cutting-edge speech, translation, and generative AI—visit the NVIDIA Developer Portal at developer.nvidia.com. Join the community to access tools, support, and resources to accelerate your development with NVIDIA’s NeMo, Riva, NIM, and foundation models. Explore more from NVIDIA: What is Nemotron? NVIDIA Developer Nemotron NVIDIA Riva Speech NeMo Documentation To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version. The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. Transcribing using Python First, let's get a sample This model accepts 16000 Hz mono-channel audio (wav files) as input. This model provides transcribed speech as a string for a given audio sample. FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with a Transducer decoder (RNNT) loss. You may find more information on the details of FastConformer here: Fast-Conformer Model. The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this example script and this base config. The tokenizers for these models were built using the text transcripts of the train set with this script. The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams. The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets: - Librispeech 960 hours of English speech - Fisher Corpus - Switchboard-1 Dataset - WSJ-0 and WSJ-1 - National Speech Corpus (Part 1, Part 6) - VCTK - VoxPopuli (EN) - Europarl-ASR (EN) - Multilingual Librispeech (MLS EN) - 2,000 hour subset - Mozilla Common Voice (v7.0) - People's Speech - 12,000 hour subset The performance of Automatic Speech Recognition models is measuring using Word Error Rate. Since this dataset is trained on multiple domains and a much larger corpus, it will generally perform better at transcribing audio in general. The following tables summarizes the performance of the available models in this collection with the Transducer decoder. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding. |Version|Tokenizer|Vocabulary Size|AMI|Earnings-22|Giga Speech|LS test-clean|SPGI Speech|TEDLIUM-v3|Vox Populi|Common Voice| |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-----|-------|------|------| | 1.22.0 | SentencePiece Unigram | 1024 | 17.55 | 14.78 | 10.07 | 1.63 | 3.06 | 3.47 | 3.86 | 6.05 | 8.07 | These are greedy WER numbers without external LM. More details on evaluation can be found at HuggingFace ASR Leaderboard NVIDIA Riva, is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded. Additionally, Riva provides: World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support. Although this model isn’t supported yet by Riva, the list of supported models is here. Check out Riva live demo. References [1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

—

3,110,585

bigvgan_v2_22khz_80band_256x

--- license: mit license_link: https://huggingface.co/nvidia/BigVGAN/blob/main/LICENSE tags: - neural-vocoder - audio-generation library_name: PyTorch pipeline_tag: audio-to-audio ---

nvidia

parakeet-tdt-0.6b-v2

parakeet-rnnt-0.6b

bigvgan_v2_22khz_80band_256x

Llama-3.1-Nemotron-Nano-VL-8B-V1

bigvgan_v2_44khz_128band_512x

Cosmos-Reason1-7B

segformer-b0-finetuned-ade-512-512

speakerverification_en_titanet_large

prompt-task-and-complexity-classifier

mit-b2

mit-b3

Llama-4-Scout-17B-16E-Instruct-FP8

NV-Embed-v2

segformer-b1-finetuned-ade-512-512

segformer-b5-finetuned-ade-640-640

parakeet-ctc-1.1b

canary-1b-flash

parakeet-tdt_ctc-110m

segformer-b5-finetuned-cityscapes-1024-1024

Llama-3_3-Nemotron-Super-49B-v1_5

DeepSeek-R1-0528-FP4

llama-embed-nemotron-8b

C-RADIOv3-B

DeepSeek-R1-0528-NVFP4

NVIDIA-Nemotron-Nano-9B-v2

NVIDIA-Nemotron-Nano-9B-v2-Base

DeepSeek-R1-0528-NVFP4-v2

canary-1b-v2

mit-b0

DeepSeek-R1-0528-FP4-v2

difix_ref

parakeet-tdt-0.6b-v3

NVLM-D-72B

dragon-multiturn-query-encoder

dragon-multiturn-context-encoder

segformer-b4-finetuned-cityscapes-1024-1024

segformer-b2-finetuned-ade-512-512

MambaVision-S-1K

Cosmos-Transfer2.5-2B

Llama-3.3-70B-Instruct-FP8

omni-embed-nemotron-3b

Llama-3.3-70B-Instruct-NVFP4

Llama-3.3-70B-Instruct-FP4

Llama-3.1-Nemotron-Nano-4B-v1.1

Llama-3.1-8B-Instruct-FP8

RADIO-L

gpt-oss-120b-Eagle3

Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0

DeepSeek-V3-0324-NVFP4

OpenReasoning-Nemotron-7B

DeepSeek-V3-0324-FP4

parakeet-tdt_ctc-1.1b

Llama-4-Scout-17B-16E-Instruct-NVFP4

Llama-3.1-Nemotron-70B-Instruct-HF

Llama-4-Scout-17B-16E-Instruct-FP4

Cosmos-Guardrail1

stt_it_fastconformer_hybrid_large_pc

Cosmos-Predict2.5-2B

llama-nemotron-embed-1b-v2

segformer-b0-finetuned-cityscapes-1024-1024

diar_streaming_sortformer_4spk-v2

NVIDIA-Nemotron-Nano-12B-v2

Llama-3.1-8B-Instruct-NVFP4

NVIDIA-Nemotron-Nano-9B-v2-FP8

bigvgan_v2_24khz_100band_256x

Llama-3.1-8B-Instruct-FP4

NVIDIA-Nemotron-Nano-12B-v2-VL-FP8

low-frame-rate-speech-codec-22khz

llama-nemoretriever-colembed-3b-v1

C-RADIO

stt_en_conformer_ctc_large

Nemotron-H-8B-Base-8K

canary-qwen-2.5b

omnivinci

Llama-3_3-Nemotron-Super-49B-v1_5-FP8

Llama-3_3-Nemotron-Super-49B-v1

Llama-3.1-Nemotron-Nano-8B-v1

Nemotron-H-56B-Base-8K

NVIDIA-Nemotron-Nano-12B-v2-VL-BF16