distil-whisper

11 models • 1 total models in database

Sort by:

distil-large-v3

--- language: - en license: mit library_name: transformers tags: - audio - automatic-speech-recognition - transformers.js widget: - example_title: LibriSpeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: LibriSpeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac pipeline_tag: automatic-speech-recognition ---

license:mit

1,286,613

348

distil-large-v2

Distil-Whisper was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. It is a distilled version of the Whisper model that is 6 times faster, 49% smaller, and performs within 1% WER on out-of-distribution evaluation sets. This is the repository for distil-large-v2, a distilled variant of Whisper large-v2. | Model | Params / M | Rel. Latency ↑ | Short-Form WER ↓ | Long-Form WER ↓ | |----------------------------------------------------------------------------|------------|----------------|------------------|-----------------| | large-v3 | 1550 | 1.0 | 8.4 | 11.0 | | large-v2 | 1550 | 1.0 | 9.1 | 11.7 | | | | | | | | distil-large-v3 | 756 | 6.3 | 9.7 | 10.8 | | distil-large-v2 | 756 | 5.8 | 10.1 | 11.6 | | distil-medium.en | 394 | 6.8 | 11.1 | 12.4 | | distil-small.en | 166 | 5.6 | 12.1 | 12.8 | Update: following the release of OpenAI's Whisper large-v3, an updated distil-large-v3 model was published. This distil-large-v3 model surpasses the performance of the distil-large-v2 model, with no architecture changes and better support for sequential long-form generation. Thus, it is recommended that the distil-large-v3 model is used in-place of the large-v2 model. Note: Distil-Whisper is currently only available for English speech recognition. We are working with the community to distill Whisper on other languages. If you are interested in distilling Whisper in your language, check out the provided training code. We will update the Distil-Whisper repository with multilingual checkpoints when ready! Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub: The model can be used with the `pipeline` class to transcribe short-form audio files ( 30-seconds). In practice, this chunked long-form algorithm is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the Distil-Whisper paper). To enable chunking, pass the `chunklengths` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds is optimal. To activate batching, pass the argument `batchsize`: Distil-Whisper can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed. In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then specify it as the "assistant model" for generation: You can apply additional speed and memory improvements to Distil-Whisper which we cover in the following. We recommend using Flash-Attention 2 if your GPU allows for it. To do so, you first need to install Flash Attention: and then all you have to do is to pass `useflashattention2=True` to `frompretrained`: If your GPU does not support Flash Attention, we recommend making use of BetterTransformers. To do so, you first need to install optimum: And then convert your model to a "BetterTransformer" model before using it: To use the model in the original Whisper format, first ensure you have the `openai-whisper` package installed: The following code-snippet demonstrates how to transcribe a sample file from the LibriSpeech dataset loaded using 🤗 Datasets: To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe: Distil-Whisper can be run from the Whisper.cpp repository with the original sequential long-form transcription algorithm. In a provisional benchmark on Mac M1, `distil-large-v2` is 2x faster than `large-v2`, while performing to within 0.1% WER over long-form audio. Note that future releases of Distil-Whisper will target faster CPU inference more! By distilling smaller encoders, we aim to achieve similar speed-ups to what we obtain on GPU. Steps for getting started: 1. Clone the Whisper.cpp repository: 2. Download the ggml weights for `distil-medium.en` from the Hugging Face Hub: Note that if you do not have the `huggingfacehub` package installed, you can also download the weights with `wget`: Note: Due to the large model size, we recommend running this model server-side with Node.js (instead of in-browser). Through an integration with Hugging Face Candle 🕯️, Distil-Whisper is now available in the Rust library 🦀 Benefit from: Optimised CPU backend with optional MKL support for x86 and Accelerate for Macs CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL WASM support: run Distil-Whisper in a browser Steps for getting started: 1. Install `candle-core` as explained here 2. Clone the `candle` repository locally: 5. To specify your own audio file, add the `--input` flag: Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of total inference time. Thus, to optimise for latency, the focus should be on minimising the inference time of the decoder. To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder consists of only two decoder layers, which are initialised from the first and last decoder layer of the teacher (shown in red). All other decoder layers of the teacher are discarded. The model is then trained on a weighted sum of the KL divergence and pseudo-label loss terms. The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation.clean dataset with streaming mode, meaning no audio data has to be downloaded to your local device. First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to perform the WER calculation: Evaluation can then be run end-to-end with the following example: Distil-Whisper is intended to be a drop-in replacement for Whisper on English speech recognition. In particular, it achieves comparable WER results over out-of-distribution test data, while being 6x faster over both short and long-form audio. Distil-Whisper is trained on 22,000 hours of audio data from 9 open-source, permissively licensed speech datasets on the Hugging Face Hub: | Dataset | Size / h | Speakers | Domain | Licence | |-----------------------------------------------------------------------------------------|----------|----------|-----------------------------|-----------------| | People's Speech | 12,000 | unknown | Internet Archive | CC-BY-SA-4.0 | | Common Voice 13 | 3,000 | unknown | Narrated Wikipedia | CC0-1.0 | | GigaSpeech | 2,500 | unknown | Audiobook, podcast, YouTube | apache-2.0 | | Fisher | 1,960 | 11,900 | Telephone conversations | LDC | | LibriSpeech | 960 | 2,480 | Audiobooks | CC-BY-4.0 | | VoxPopuli | 540 | 1,310 | European Parliament | CC0 | | TED-LIUM | 450 | 2,030 | TED talks | CC-BY-NC-ND 3.0 | | SwitchBoard | 260 | 540 | Telephone conversations | LDC | | AMI | 100 | unknown | Meetings | CC-BY-4.0 | |||||| | Total | 21,770 | 18,260+ | | | The combined dataset spans 10 distinct domains and over 50k speakers. The diversity of this dataset is crucial to ensuring the distilled model is robust to audio distributions and noise. The audio data is then pseudo-labelled using the Whisper large-v2 model: we use Whisper to generate predictions for all the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training. The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds a specified threshold, we discard the training example. Otherwise, we keep it for training. Section 9.2 of the Distil-Whisper paper demonstrates the effectiveness of this filter for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to hallucinations to this filter. The model was trained for 80,000 optimisation steps (or eight epochs). The Tensorboard training logs can be found under: https://huggingface.co/distil-whisper/distil-large-v2/tensorboard?params=scalars#frame The distilled model performs to within 1% WER of Whisper on out-of-distribution (OOD) short-form audio, and outperforms Whisper by 0.1% on OOD long-form audio. This performance gain is attributed to lower hallucinations. For a detailed per-dataset breakdown of the evaluation results, refer to Tables 16 and 17 of the Distil-Whisper paper Distil-Whisper is also evaluated on the ESB benchmark datasets as part of the OpenASR leaderboard, where it performs to within 0.2% WER of Whisper. Training and evaluation code to reproduce Distil-Whisper is available under the Distil-Whisper repository: https://github.com/huggingface/distil-whisper/tree/main/training Distil-Whisper inherits the MIT license from OpenAI's Whisper model. If you use this model, please consider citing the Distil-Whisper paper: Acknowledgements OpenAI for the Whisper model and original codebase Hugging Face 🤗 Transformers for the model integration Google's TPU Research Cloud (TRC) programme for Cloud TPU v4s `@rsonavane` for releasing an early iteration of Distil-Whisper on the LibriSpeech dataset

license:mit

6,997

512

distil-large-v3.5-ct2

license:mit

655

distil-large-v3-ct2

license:mit

distil-large-v3.5-ONNX

license:mit

distil-large-v3-ggml

license:mit

distil-large-v3.5-ggml

license:mit

distil-large-v3-openai

license:mit