bond005

17 models • 1 total models in database

Sort by:

wav2vec2-large-ru-golos

whisper-podlodka-turbo

Whisper-Podlodka-Turbo is a new fine-tuned version of a Whisper large-v3-turbo. The main goal of the fine-tuning is to improve the quality of speech recognition and speech translation for Russian and English, as well as reduce the occurrence of hallucinations when processing non-speech audio signals. Whisper-Podlodka-Turbo is a new fine-tuned version of Whisper-Large-V3-Turbo, optimized for high-quality Russian speech recognition with proper punctuation + capitalization and enhanced with noise resistance capability. - 🎯 Improved Russian speech recognition quality compared to the base Whisper-Large-V3-Turbo model - ✍️ Correct Russian punctuation and capitalization - 🎧 Enhanced background noise resistance - 🚫 Reduced number of hallucinations, especially in non-speech segments - Automatic Speech Recognition (ASR): - 🇷🇺 Russian (primary focus) - 🇬🇧 English - Speech Translation: - Russian ↔️ English - Speech Language Detection (including non-speech detection) Whisper-Podlodka-Turbo is supported in Hugging Face 🤗 Transformers. To run the model, first install the Transformers library. For this example, we'll also install 🤗 Datasets to load toy audio dataset from the Hugging Face Hub, and 🤗 Accelerate to reduce the model loading time: Also, I recommend using `whisper-lid` for initial spoken language detection. Therefore, this library is also worth installing: The model can be used with the `pipeline` class to transcribe audios of arbitrary language: In addition to the usual recognition, the model can also provide timestamps for recognized speech fragments: While previous examples demonstrate accurate transcription for audio segments under thirty seconds, practical applications often require processing extensive recordings ranging from several minutes to multiple hours. This necessitates specialized techniques like the sliding window approach to overcome memory constraints and preserve contextual coherence across the entire signal. The following example showcases the model's capability to handle such long-form audio, enabling accurate transcription of lectures, interviews, and meetings. Along with special language tokens, the model can also return the special token ` `, if the input audio signal does not contain any speech (for details, see section 2.3 of the corresponding paper about Whisper). This skill of the model forms the basis of the speech/non-speech classification algorithm, as demonstrated in the following example: In addition to the transcription task, the model also performs speech translation (although it translates better from Russian into English than from English into Russian): As you can see, in both examples the speech translation contains some errors, however in the example of translation from English to Russian these errors are more significant. - While improvements are observed for English and translation tasks, statistically significant advantages are confirmed only for Russian ASR - The model's performance on code-switching speech (where speakers alternate between Russian and English within the same utterance) has not been specifically evaluated - Inherits basic limitations of the Whisper architecture The model was fine-tuned on a composite dataset including: - Common Voice (Ru, En) - Podlodka Speech (Ru) - Taiga Speech (Ru, synthetic) - Golos Farfield and Golos Crowd (Ru) - Sova Rudevices (Ru) - Audioset (non-speech audio) 1. Data Augmentation: - Dynamic mixing of speech with background noise and music - Gradual reduction of signal-to-noise ratio during training 2. Text Data Processing: - Russian text punctuation and capitalization restoration using bond005/ruT5-ASR-large (for speech sub-corpora without punctuated annotations) - Parallel Russian-English text generation using Qwen/Qwen2.5-14B-Instruct - Multi-stage validation of generated texts to minimize hallucinations using bond005/xlm-roberta-xl-hallucination-detector 3. Training Strategy: - Progressive increase in training example complexity - Balanced sampling between speech and non-speech data - Special handling of language tokens and no-speech detection (` `) The experimental evaluation focused on two main tasks: 1. Russian speech recognition 2. Speech activity detection (binary classification "speech/non-speech") Testing was performed on publicly available Russian speech corpora. Speech recognition was conducted using the standard pipeline from the Hugging Face 🤗 Transformers library. Due to the limitations of this pipeline in language identification and non-speech detection (caused by a certain bug), the whisper-lid library was used for speech presence/absence detection in the signal. The quality of the Russian speech recognition task was tested on test sub-sets of six different datasets: - Common Voice 11 Ru - Podlodka Speech - Golos Farfield - Golos Crowd - Sova Rudevices - Russian Librispeech The quality of the long-form Russian speech recognition was tested on the dangrebenkin/longaudioyoutubelectures dataset, developed by Daniel Grebenkin. This dataset contains seven long-form (20-40 minute) Russian audio recordings that were manually annotated. The audios cover a variety of topics and speaking styles; they are excerpts from Russian scientific lectures on various subjects: philology, mathematics, history, etc. All recordings were made in relatively quiet, lecture-hall-like acoustic environments. However, some natural background noises, such as the sound of chalk on a blackboard, are present. The quality of the voice activity detection task was tested on test sub-sets of two different datasets: - noised version of Golos Crowd as a source of speech samples - filtered sub-set of Audioset corpus as a source of non-speech samples Noise was added using a special augmenter capable of simulating the superposition of five different types of acoustic noise (reverberation, speech-like sounds, music, household sounds, and pet sounds) at a given signal-to-noise ratio (in this case, a signal-to-noise ratio of 2 dB was used). The quality of the robust Russian speech recognition task was tested on test sub-set of above-mentioned noised Golos Crowd. 1. Modified WER (Word Error Rate) for Russian speech recognition quality: - Text normalization before WER calculation: - Unification of numeral representations (digits/words) - Standardization of foreign words (Cyrillic/Latin scripts) - Accounting for valid transliteration variants - Enables more accurate assessment of semantic recognition accuracy - The lower the WER, the better the speech recognition quality 2. F1-score for speech activity detection: - Binary classification "speech/non-speech" - Evaluation of non-speech segment detection accuracy using ` ` token - The higher the F1 score, the better the voice activity detection quality For experiments with short audio signals (under 30 seconds), we used standard greedy decoding (`numbeams=1`). For long-form audio, two approaches were tested: simple 30-second chunking and the sequential long-form algorithm. For the sequential long-form mode, the implementation followed the strategy from Section 4.5 of the paper about Whisper with two key hyperparameter differences: 1. Beam Search: The paper implies the use of beam search for optimal performance, while our initial experiments for this task used greedy decoding (`numbeams=1`). 2. Compression Ratio Threshold: A key deviation was the use of a more conservative `compressionratiothreshold` of 1.35 (compared to 2.4 in the paper). This lower threshold makes the repetition-detection algorithm significantly more aggressive, triggering fallback mechanisms (e.g., temperature rescoring) sooner to suppress repetitive outputs. The parameters for voice activity detection (`nospeechthreshold=0.6`) and low-confidence detection (`logprobthreshold=-1.0`) were kept aligned with the paper's recommendations. Context conditioning between segments (`conditiononprevtokens`) was disabled for this experimental run. | Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |----------------------------|--------------------------------|-------------------------------| | bond005/podlodkaspeech | 8.17 | 8.33 | | rulibrispeech | 9.76 | 10.25 | | sberdevicesgolosfarfield | 11.61 | 20.12 | | sberdevicesgoloscrowd | 11.85 | 14.55 | | sovarudevices | 15.35 | 17.70 | | commonvoice110 | 5.22 | 6.63 | | Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |------------------------------------|--------------------------------|-------------------------------| | the simple chunking | 11.66 | 15.98 | | the sequential long-form algorithm | 7.84 | 9.59 | | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |--------------------------------|-------------------------------| | 0.9235 | 0.8484 | Robust ASR (SNR = 2 dB, speech-like noise, music, etc.) | Dataset | bond005/whisper-podlodka-turbo | openai/whisper-large-v3-turbo | |----------------------------------|--------------------------------|-------------------------------| | sberdevicesgoloscrowd (noised) | 46.58 | 75.20 | If you use this model in your work, please cite it as:

bond005

wav2vec2-large-ru-golos

whisper-podlodka-turbo

whisper-large-v3-ru-podlodka

wav2vec2-large-ru-golos-with-lm

wav2vec2-base-ru

meno-lite-0.1

wav2vec2-base-ru-birm

whisper-large-v2-ru-podlodka

xlm-roberta-xl-hallucination-detector

ruT5-ASR-large

FRED-T5-large-instruct-v0.1

meno-tiny-0.1-gguf

ruT5-ASR

wav2vec2-mbart50-ru

FRED-T5-large-ods-ner-2023

rubert-entity-embedder

rubert-multiconer