SiRoZaRuPa

3 models • 1 total models in database

Sort by:

japanese-wav2vec2-base-turntaking-CSJ

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

—

japanese-wav2vec2-base-backchannel-CSJ

—

Japanese HuBERT Base VADLess ASR RSm

VAD-less Japanese ASR Model This model is a Japanese speech recognition model, fine-tuned from `rinna/japanese-hubert-base` for real-time Automatic Speech Recognition (ASR) in noisy environments. Its main feature is a "VAD-less" architecture, which does not require a separate Voice Activity Detection (VAD) step. It can explicitly recognize and output non-speech segments (noise or silence) included in the audio input as special tokens: `[雑音]` (noise) or `[無音]` (silence). This aims to build a highly real-time speech recognition system by omitting the preceding VAD process. The model was fine-tuned on the medium set (approx. 1,000 hours) of the ReazonSpeech v2.0 corpus. The training data was created using a unique method: concatenating two audio clips, intentionally inserting non-speech segments between them and at the end, and then adding noise (babble and pink noise) from the NOISEX-92 dataset. VADレス日本語音声認識モデルこのモデルは、rinna/japanese-hubert-baseをベースモデルとし、雑音環境下でのリアルタイム自動音声認識（ASR）のためにファインチューニングされた日本語音声認識モデルです。このモデルの主な特徴は、個別の音声活動検出（VAD）処理を必要としない「VADレス」アーキテクチャです。音声入力に含まれる非発話区間（雑音や無音）を、[雑音]や[無音]といった特別なトークンとして明示的に認識し、出力することができます。これにより、事前のVAD処理を省略し、リアルタイム性の高い音声認識システムの構築を目指しています。モデルのファインチューニングには、ReazonSpeech v2.0コーパスのmediumセット（約1,000時間）が使用されました。学習データは、2つの音声クリップを結合し、その間と末尾に非発話区間を挿入した上で、NOISEX-92データセットのノイズ（バブルノイズとピンクノイズ）を重畳する手法で作成されています。 Model Details - Base Model: rinna/japanese-hubert-base - Fine-tuning Strategy: Connectionist Temporal Classification (CTC) - Framework: Transformers - Sampling Rate: 16,000 Hz - Output Vocabulary: A vocabulary of 3,200 unigram generated by sentencepiece, include 3 special tokens (`[雑音]`, `[無音]`, `[PAD]`). Evaluation Result (CER %) | | Clean | 50 dB | 20 dB | 15 dB | 10 dB | 5 dB | 0 dB | |:--- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | |JNAS | 10.61 | 10.57 | 10.16 | 10.10 | 10.43 | 11.64 | 15.95 | |ReazonSpeech| 12.54 | 12.51 | 12.69 | 12.83 | 13.43 | 14.81 | 19.87 |

—