espnet

386 models • 2 total models in database

Sort by:

kan-bayashi_ljspeech_vits

ESPnet2 TTS pretrained model `kan-bayashi/ljspeechvits` ♻️ Imported from https://zenodo.org/record/5443814/ This model was trained by kan-bayashi using ljspeech/tts1 recipe in espnet. Demo: How to use in ESPnet2

license:cc-by-4.0

226

owsm_v4_medium_1B

NaNK

license:cc-by-4.0

powsm

🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks: Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme conversion (P2G). Based on Open Whisper-style Speech Model (OWSM) and trained with IPAPack++, POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR. To use the pre-trained model, please install `espnet` and `espnetmodelzoo`. The requirements are: The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. See `forcealign.py` in ESPnet recipe to try out CTC forced alignment with POWSM's encoder! LID is learned implicitly during training, and you may run it with the script below:

license:cc-by-4.0

owsm_ctc_v3.1_1B

NaNK

license:cc-by-4.0

owsm_v4_base_102M

license:cc-by-4.0

yoshiki_wsj0_2mix_spatialized_enh_tfgridnet_waspaa2023_raw

license:cc-by-4.0

Wangyou_Zhang_chime4_enh_train_enh_conv_tasnet_raw

license:cc-by-4.0

powsm_ctc

license:cc-by-4.0

Xeus

XEUS - A Cross-lingual Encoder for Universal Speech XEUS is a large-scale multilingual speech encoder by Carnegie Mellon University's WAVLab that covers over 4000 languages. It is pre-trained on over 1 million hours of publicly available speech datasets. It requires fine-tuning to be used in downstream tasks such as Speech Recognition or Translation. Its hidden states can also be used with k-means for semantic Speech Tokenization. XEUS uses the E-Branchformer architecture and is trained using HuBERT-style masked prediction of discrete speech tokens extracted from WavLabLM. During training, the input speech is also augmented with acoustic noise and reverberation, making XEUS more robust. The total model size is 577M parameters. XEUS tops the ML-SUPERB multilingual speech recognition leaderboard, outperforming MMS, w2v-BERT 2.0, and XLS-R. XEUS also sets a new state-of-the-art on 4 tasks in the monolingual SUPERB benchmark. More information about XEUS, including download links for our crawled 4000-language dataset, can be found in the project page and paper. The code for XEUS is still in progress of being merged into the main ESPnet repo. It can instead be used from the following fork: XEUS supports Flash Attention, which can be installed as follows:

mit

139

owsm_ctc_v3.2_ft_1B

NaNK

license:cc-by-4.0

simpleoier_librispeech_asr_train_asr_conformer7_wavlm_large_raw_en_bpe5000_sp

license:cc-by-4.0

owsm_v4_small_370M

license:cc-by-4.0

Turn_taking_prediction_SWBD

license:cc-by-4.0

jiyang_tang_cvss-c_es-en_discrete_unit

license:cc-by-4.0

aceopencpop_svs_visinger2_40singer_pretrain

license:cc-by-4.0

kan-bayashi_jsut_vits_prosody

license:cc-by-4.0

espnet_tts_vctk_espnet_spk_voxceleb12_rawnet

license:cc-by-4.0

arecho_base_v0

This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

chenda-li-wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave

license:cc-by-4.0

OpusLM_1.7B_Anneal

NaNK

—

kan-bayashi_ljspeech_tacotron2

license:cc-by-4.0

wanchichen_fleurs_multilingual_asr_hubert_frontend

license:cc-by-4.0

kan-bayashi_vctk_multi_spk_vits

license:cc-by-4.0

owsm_v2_ebranchformer

license:cc-by-4.0

voxcelebs12_ecapa_frozen

license:cc-by-4.0

owls_9B_180K_intermediates

NaNK

—

kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best

license:cc-by-4.0

kan-bayashi_jsut_fastspeech2

license:cc-by-4.0

kan-bayashi_ljspeech_fastspeech2

license:cc-by-4.0

universa-wavlm_base_urgent24_multi-metric_noref

license:cc-by-4.0

kan-bayashi_jsut_full_band_vits_prosody

license:cc-by-4.0

dns_icassp21_enh_train_enh_tcn_tf_raw

license:cc-by-4.0

owsm_v3.1_ebf_base

license:cc-by-4.0

OpusLM_7B_Anneal

NaNK

—

Wangyou_Zhang_chime4_enh_train_enh_beamformer_mvdr_raw

license:cc-by-4.0

kan-bayashi_csj_asr_train_asr_conformer

license:cc-by-4.0

yen-ju-lu-dns_ins20_enh_train_enh_blstm_tf_raw_valid.loss.best

license:cc-by-4.0

english_male_ryanspeech_fastspeech

license:cc-by-nc-4.0

Wangyou_Zhang_librimix_train_enh_tse_td_speakerbeam_raw

license:cc-by-4.0

owsmdata_soundstream_16k_200epoch

license:cc-by-4.0

mixdata_svs_visinger2_spkembed_lang_pretrained

license:cc-by-4.0

Yushi_Ueda_mini_librispeech_diar_train_diar_raw_max_epoch20_valid.acc.best

license:cc-by-4.0

myst_wavlm_aed_transformer

license:cc-by-4.0

arecho_scale_v0

This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

kan-bayashi_jsut_vits_accent_with_pause

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_joint_conformer_fastspeech2_hifigan_raw-truncated-af8fe0

license:cc-by-4.0

visinger2-zh-jp-multisinger-svs

license:mit

Hoon_Chung_zeroth_korean_asr_train_asr_transformer5_raw_bpe_valid.acc.ave

license:cc-by-4.0

chenda-li-wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave

license:cc-by-4.0

kan-bayashi_csmsc_vits

license:cc-by-4.0

mls-english_soundstream_16k_360epoch

license:cc-by-4.0

owls_9B_180K

NaNK

license:cc-by-4.0

kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan

license:cc-by-4.0

iam_handwriting_ocr

license:cc-by-4.0

english_male_ryanspeech_fastspeech2

license:cc-by-nc-4.0

owsm_v3.1_ebf_small

license:cc-by-4.0

Shinji_Watanabe_laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave

license:cc-by-4.0

amuse_soundstream_44.1k

license:cc-by-4.0

mixdata_svs_visinger2_spkemb_lang_pretrained_avg

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-15ef5f

license:cc-by-4.0

kan-bayashi_jvs_tts_finetune_jvs010_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody_latest

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_conformer_fastspeech2_raw_phn_tacotron_-truncated-ec9e34

license:cc-by-4.0

turkish_commonvoice_blstm

license:cc-by-4.0

fsc_challenge_slu_2pass_transformer

license:cc-by-4.0

shihlun_asr_whisper_medium_finetuned_librispeech100

license:cc-by-4.0

voxcelebs12_ska_wavlm_joint

license:cc-by-4.0

amuse_dac_16k

license:cc-by-4.0

dac_16k_audio_single_survey

license:cc-by-4.0

dac_44k_speech_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

myst_ogi_cmu_kids_aed_upsample

license:cc-by-4.0

myst_ogi_cmu_kids_wavlm_aed

license:cc-by-4.0

owls_4B_180K

NaNK

license:cc-by-4.0

farsi_commonvoice_blstm

license:cc-by-4.0

owls_1B_180K

NaNK

license:cc-by-4.0

kan-bayashi_csmsc_fastspeech2

license:cc-by-4.0

owsm_v3.1_ebf_small_lowrestriction

license:cc-by-4.0

brianyan918_iwslt22_dialect_train_asr_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug

license:cc-by-4.0

kan-bayashi_ljspeech_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_vctk_tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g-truncated-50b003

NaNK

license:cc-by-4.0

aishell2_att_ctc_espnet2

license:cc-by-4.0

wanchichen_fleurs_english_asr_wav2vec_frontend

license:cc-by-4.0

geolid_combined_shared_trainable

license:cc-by-4.0

Karthik_sinhala_asr_train_asr_transformer

license:cc-by-4.0

anogkongda-librimix_enh_train_raw_valid.si_snr.ave

license:cc-by-4.0

kan-bayashi_csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave

license:cc-by-4.0

kan-bayashi_csmsc_tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave

license:cc-by-4.0

kan-bayashi_csmsc_tts_train_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave

license:cc-by-4.0

kan-bayashi_jsut_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_jsut_fastspeech

license:cc-by-4.0

kan-bayashi_jsut_fastspeech2_accent

license:cc-by-4.0

kan-bayashi_jsut_fastspeech2_accent_with_pause

license:cc-by-4.0

kan-bayashi_jsut_tacotron2

license:cc-by-4.0

kan-bayashi_jsut_transformer_accent_with_pause

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech2_transformer_teacher_raw_phn_jac-truncated-60fc24

license:cc-by-4.0

kan-bayashi_jsut_tts_train_full_band_vits_raw_phn_jaconv_pyopenjtalk_a-truncated-d7d5d0

license:cc-by-4.0

kan-bayashi_jsut_tts_train_vits_raw_phn_jaconv_pyopenjtalk_prosody_train.total_count.ave

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best

license:cc-by-4.0

kan-bayashi_vctk_gst_fastspeech2

license:cc-by-4.0

kan-bayashi_vctk_tts_train_xvector_conformer_fastspeech2_transformer_t-truncated-69a657

license:cc-by-4.0

german_commonvoice_blstm

license:cc-by-4.0

french_commonvoice_blstm

license:cc-by-4.0

id_commonvoice_blstm

license:cc-by-4.0

greek_commonvoice_blstm

license:cc-by-4.0

tamil_commonvoice_blstm

license:cc-by-4.0

kyrgyz_commonvoice_blstm

license:cc-by-4.0

shihlun-asr-commonvoice-zh-TW

license:cc-by-4.0

simpleoier_librispeech_hubert_iter1_train_ssl_torchaudiohubert_base_960h_pretrain_it1_raw

license:cc-by-4.0

WavLabLM-MK-40k

license:cc-by-4.0

voxcelebs12_ska_wavlm_frozen

license:cc-by-4.0

voxcelebs12_ecapa_mel

license:cc-by-4.0

akreal_lh_small_asr2_e_branchformer_wavlm_mistral02_aed

license:cc-by-4.0

amuse_encodec_16k

license:cc-by-4.0

dac_16k_music_survey

license:cc-by-4.0

dac_44k_music_single_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

universa-base_urgent24_multi-metric

license:cc-by-4.0

asr_mlsuperb2_mms_finetune_baseline

license:cc-by-4.0

owls_05B_180K

NaNK

license:cc-by-4.0

lid_voxlingua107_mms_ecapa

license:cc-by-4.0

myst_ogi_cmu_kids_owsm_v3.1

This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Tue Feb 18 14:16:58 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/s2towsmv3.1lr000303rawenbpe50000/decodeasrbeam1ctc03jibos2tmodelvalid.cer.ave4best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|853|6.8|84.3|8.9|0.0|93.2|93.2| |datajibo/test|1044|1044|8.6|82.9|8.5|0.0|91.4|91.4| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2558|18.8|11.8|69.4|0.0|81.2|93.2| |datajibo/test|1044|3259|18.7|11.1|70.2|0.0|81.3|91.4| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2453|18.9|12.8|68.3|0.0|81.1|93.2| |datajibo/test|1044|3092|18.3|12.6|69.1|0.0|81.7|91.4| exp/s2towsmv3.1lr000303rawenbpe50000/decodeasrbeam1ctc03s2tmodelvalid.cer.ave4best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|92.9|4.8|2.2|2.0|9.0|26.2| |datacmu/test|475|4287|93.4|4.8|1.8|0.9|7.5|28.4| |datajibo/dev|853|853|22.7|77.3|0.0|203.4|280.7|96.5| |datajibo/test|1044|1044|27.6|72.4|0.0|282.6|355.0|89.0| |datamyst/dev|9037|153273|93.2|5.2|1.6|2.3|9.1|56.7| |datamyst/test|10311|182712|91.5|5.5|3.0|2.5|10.9|57.0| |dataogiscripted/dev|5426|15375|98.4|1.4|0.2|0.3|1.9|2.7| |dataogiscripted/test|15945|45419|98.3|1.4|0.3|0.3|2.0|3.1| |dataogispon/dev|349|13561|88.4|8.7|2.9|2.0|13.7|93.4| |dataogispon/test|1095|38811|88.5|8.8|2.7|2.9|14.4|92.6| |highage/test|11196|56799|1.4|34.8|63.8|63.4|162.0|99.9| |lowage/test|5147|24262|2.2|37.7|60.1|61.2|159.0|98.7| |midage/test|26532|374547|11.5|44.9|43.6|43.6|132.1|98.2| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|96.0|1.7|2.3|1.7|5.7|26.2| |datacmu/test|475|22664|96.6|1.1|2.3|1.2|4.5|28.4| |datajibo/dev|853|2558|60.3|25.7|14.0|314.9|354.6|96.5| |datajibo/test|1044|3259|69.0|19.7|11.2|433.4|464.3|89.0| |datamyst/dev|9037|763728|97.3|1.1|1.6|2.2|4.9|56.7| |datamyst/test|10311|911898|95.9|1.2|2.9|2.4|6.5|57.0| |dataogiscripted/dev|5426|83141|98.9|0.7|0.4|0.4|1.5|2.7| |dataogiscripted/test|15945|244467|98.7|0.7|0.6|0.4|1.7|3.1| |dataogispon/dev|349|58255|94.5|2.5|3.0|2.8|8.2|93.4| |dataogispon/test|1095|165977|94.7|2.3|3.0|3.4|8.7|92.6| |highage/test|11196|278522|18.8|20.2|61.0|60.8|142.0|99.9| |lowage/test|5147|117778|20.8|22.0|57.1|57.9|137.1|98.7| |midage/test|26532|1865279|34.5|20.0|45.5|45.5|111.0|98.2| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|8415|95.2|2.6|2.3|1.7|6.5|26.2| |datacmu/test|475|16575|95.7|2.2|2.2|1.2|5.5|28.4| |datajibo/dev|853|2453|54.4|32.5|13.1|235.8|281.3|96.5| |datajibo/test|1044|3092|61.7|27.7|10.7|331.4|369.7|89.0| |datamyst/dev|9037|552703|96.5|2.0|1.6|2.3|5.8|56.7| |datamyst/test|10311|660431|95.0|2.1|2.9|2.5|7.5|57.0| |dataogiscripted/dev|5426|64772|98.6|1.0|0.4|0.4|1.7|2.7| |dataogiscripted/test|15945|190001|98.5|1.0|0.5|0.4|1.9|3.1| |dataogispon/dev|349|40668|92.3|4.6|3.0|2.8|10.5|93.4| |dataogispon/test|1095|116027|92.6|4.4|3.1|3.6|11.0|92.6| |highage/test|11196|207835|11.9|31.3|56.8|56.7|144.8|99.9| |lowage/test|5147|87805|13.2|34.0|52.8|53.6|140.4|98.7| |midage/test|26532|1353952|23.4|32.6|44.0|44.0|120.6|98.2|

license:cc-by-4.0

owsm_v3

license:cc-by-4.0

pengcheng_guo_wenetspeech_asr_train_asr_raw_zh_char

license:cc-by-4.0

kan-bayashi_tsukuyomi_tts_finetune_full_band_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody_latest

license:cc-by-4.0

simpleoier_librispeech_asr_train_asr_conformer7_hubert_ll60k_large_raw_en_bpe5000_sp

license:cc-by-4.0

vectominist_seame_asr_conformer_bpe5626

license:cc-by-4.0

wanchichen_fleurs_asr_conformer_hier_lid_utt

license:cc-by-4.0

Shinji_Watanabe_spgispeech_asr_train_asr_conformer6_n_fft512_hop_lengt-truncated-f1ac86

license:cc-by-4.0

kamo-naoyuki_librispeech_asr_train_asr_conformer6_n_fft512_hop_length2-truncated-a63357

license:cc-by-4.0

kan_bayashi_jsut_tts_train_conformer_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave

license:cc-by-4.0

arabic_commonvoice_blstm

license:cc-by-4.0

WavLabLM-EK-40k

license:cc-by-4.0

m4singer_svs_xiaoice

license:cc-by-4.0

slueted_whisper_summ

license:cc-by-4.0

Chenda_Li_wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave

license:cc-by-4.0

Dan_Berrebbi_aishell4_asr

license:cc-by-4.0

Shinji_Watanabe_librispeech_asr_train_asr_transformer_e18_raw_bpe_sp_valid.acc.best

license:cc-by-4.0

YushiUeda_mini_librispeech_diar_train_diar_raw_valid.acc.best

license:cc-by-4.0

kamo-naoyuki_aishell_conformer

license:cc-by-4.0

kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_frontend-truncated-55c091

license:cc-by-4.0

kan-bayashi_csmsc_tacotron2

license:cc-by-4.0

kan-bayashi_csmsc_tts_train_fastspeech_raw_phn_pypinyin_g2p_phone_train.loss.best

license:cc-by-4.0

kan-bayashi_csmsc_tts_train_full_band_vits_raw_phn_pypinyin_g2p_phone_train.total_count.ave

license:cc-by-4.0

kan-bayashi_csmsc_tts_train_vits_raw_phn_pypinyin_g2p_phone_train.total_count.ave

license:cc-by-4.0

kan-bayashi_jsut_conformer_fastspeech2_accent

license:cc-by-4.0

kan-bayashi_jsut_conformer_fastspeech2_accent_with_pause

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-a7f080

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-569e81

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-35ef5a

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech2_tacotron2_teacher_raw_phn_jacon-truncated-f45dcb

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech2_tacotron2_teacher_raw_phn_jacon-truncated-e5d906

license:cc-by-4.0

kan-bayashi_jsut_tts_train_full_band_vits_raw_phn_jaconv_pyopenjtalk_p-truncated-66d5fc

license:cc-by-4.0

kan-bayashi_jsut_tts_train_transformer_raw_phn_jaconv_pyopenjtalk_train.loss.ave

license:cc-by-4.0

kan-bayashi_jvs_jvs010_vits_prosody

license:cc-by-4.0

kan-bayashi_ljspeech_fastspeech

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_fastspeech_raw_phn_tacotron_g2p_en_no_space_train.loss.best

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

kan-bayashi_vctk_gst_tacotron2

license:cc-by-4.0

kan-bayashi_vctk_gst_xvector_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_-truncated-69081b

NaNK

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_fastspeech2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

kan-bayashi_vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

roshansh_how2_asr_raw_ft_sum_valid.acc

license:cc-by-4.0

xuankai_chang_librispeech_asr_train_asr_conformer7_wav2vec2_960hr_larg-truncated-5b94d9

NaNK

license:cc-by-4.0

ml_openslr63

—

bn_openslr53

license:cc-by-4.0

Wangyou_Zhang_chime4_enh_train_enh_dc_crn_mapping_snr_raw

license:cc-by-4.0

GunnarThor_talromur_d_tacotron2

license:cc-by-4.0

thai_commonvoice_blstm

license:cc-by-4.0

Wangyou_Zhang_wsj0_2mix_enh_train_enh_dptnet_raw

license:cc-by-4.0

slurp_slu_2pass

license:cc-by-4.0

simpleoier_librispeech_hubert_iter0_train_ssl_torchaudiohubert_base_960h_pretrain_it0_raw

license:cc-by-4.0

pengcheng_librimix_asr_train_sot_asr_conformer_wavlm_raw_en_char_sp

license:cc-by-4.0

akreal_ls100_asr2_e_branchformer1_1gpu_raw_wavlm_large_21_km2k_bpe_rm6k_bpe_ts5k_sp

license:cc-by-4.0

kohei0209_ted3_asr2_e_branchformer1_raw_wavlm_large_21_km1000_bpe_rm2000_bpe_ts500_sp

license:cc-by-4.0

jinchuat_aishell_brctc

license:cc-by-4.0

iwslt24_indic_en_ta_bpe_tc4000

license:cc-by-4.0

sluevoxceleb_wavlm_finetune_asr

license:cc-by-4.0

mls-multi_soundstream_16k_360epoch

license:cc-by-4.0

dac_16k_all_survey

license:cc-by-4.0

dac_16k_audio_survey

license:cc-by-4.0

dac_16k_music_single_survey

license:cc-by-4.0

dac_16k_speech_single_survey

license:cc-by-4.0

dac_44k_audio_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

dac_44k_all_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

owls_2B_180K

NaNK

license:cc-by-4.0

owls_025B_180K

NaNK

license:cc-by-4.0

universa-wavlm_base_urgent24_multi-metric_audioref

`espnet/universa-wavlmbaseurgent24multi-metricaudioref` This model was trained by ftshijt using urgent24 recipe in espnet. Please check Colab link for a simple demo of how to use UniVERSA.

license:cc-by-4.0

audioset_dac_16k

—

arecho_base_v0.1-large-decoder

This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

opencpop_svs_train_toksing_300epoch-multi_hl6_wl6_wl23

—

geolid_vl107only_shared_frozen

This geolocation-aware language identification (LID) model is developed using the ESPnet toolkit. It integrates the powerful pretrained MMS-1B as the encoder and employs ECAPA-TDNN as the embedding extractor to achieve robust spoken language identification. The main innovations of this model are: 1. Incorporating geolocation prediction as an auxiliary task during training. 2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information. This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations. For further details on the geolocation-aware LID methodology, please refer to our paper: Geolocation-Aware Robust Spoken Language Identification (arXiv). Prerequisites First, ensure you have ESPnet installed. If not, follow the ESPnet installation instructions. Quick Start Run the following commands to set up and use the pre-trained model: This will download the pre-trained model and run inference using the VoxLingua107 test data. The training used only the VoxLingua107 dataset, comprising 6,628 hours of speech across 107 languages from YouTube. | Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) | | ------------- | ----------- | ------------------ | ------- | --------------------------- | | VoxLingua107 | YouTube | 107/33 | No | Seen | | Babel | Telephone | 25/25 | No | Unseen | | FLEURS | Read speech | 102/102 | No | Unseen | | ML-SUPERB 2.0 | Mixed | 137/(137, 8) | Yes | Unseen | | VoxPopuli | Parliament | 16/16 | No | Unseen | Accuracy (%) on In-domain and Out-of-domain Test Sets .hf-model-cell { max-width: 120px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .config-cell { max-width: 100px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .hf-model-cell::-webkit-scrollbar, .config-cell::-webkit-scrollbar { height: 6px; } .hf-model-cell::-webkit-scrollbar-track, .config-cell::-webkit-scrollbar-track { background: #f1f1f1; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb, .config-cell::-webkit-scrollbar-thumb { background: #888; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb:hover, .config-cell::-webkit-scrollbar-thumb:hover { background: #555; } | ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. | | ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- | | egs2/geolid/lid1 | `conf/voxlingua107only/mmsecapaupcon3244it0.4sharedfrozen.yaml` | 94.3 | 85.9 | 94.3 | 88.8 | 80.7 | 89.2 | 88.8 | For more detailed inference results, please refer to the `expvoxlingua107only/lidmmsecapaupcon3244it0.4sharedfrozenraw/inference` directory in this repository. > Note (2025-08-18): > The corresponding GitHub recipe egs2/geolid/lid1 has not yet been merged into the ESPnet master branch. > See TODO: add PR link for the latest updates.

license:cc-by-4.0

geolid_vl107only_independent_frozen

This geolocation-aware language identification (LID) model is developed using the ESPnet toolkit. It integrates the powerful pretrained MMS-1B as the encoder and employs ECAPA-TDNN as the embedding extractor to achieve robust spoken language identification. The main innovations of this model are: 1. Incorporating geolocation prediction as an auxiliary task during training. 2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information. This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations. For further details on the geolocation-aware LID methodology, please refer to our paper: Geolocation-Aware Robust Spoken Language Identification (arXiv). Prerequisites First, ensure you have ESPnet installed. If not, follow the ESPnet installation instructions. Quick Start Run the following commands to set up and use the pre-trained model: This will download the pre-trained model and run inference using the VoxLingua107 test data. The training used only the VoxLingua107 dataset, comprising 6,628 hours of speech across 107 languages from YouTube. | Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) | | ------------- | ----------- | ------------------ | ------- | --------------------------- | | VoxLingua107 | YouTube | 107/33 | No | Seen | | Babel | Telephone | 25/25 | No | Unseen | | FLEURS | Read speech | 102/102 | No | Unseen | | ML-SUPERB 2.0 | Mixed | 137/(137, 8) | Yes | Unseen | | VoxPopuli | Parliament | 16/16 | No | Unseen | Accuracy (%) on In-domain and Out-of-domain Test Sets .hf-model-cell { max-width: 120px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .config-cell { max-width: 100px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .hf-model-cell::-webkit-scrollbar, .config-cell::-webkit-scrollbar { height: 6px; } .hf-model-cell::-webkit-scrollbar-track, .config-cell::-webkit-scrollbar-track { background: #f1f1f1; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb, .config-cell::-webkit-scrollbar-thumb { background: #888; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb:hover, .config-cell::-webkit-scrollbar-thumb:hover { background: #555; } | ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. | | ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- | | egs2/geolid/lid1 | `conf/voxlingua107only/mmsecapaupcon3244it0.4independentfrozen.yaml` | 94.2 | 87.1 | 95.0 | 89.0 | 77.2 | 90.4 | 88.8 | For more detailed inference results, please refer to the `expvoxlingua107only/lidmmsecapaupcon3244it0.4independentfrozenraw/inference` directory in this repository. > Note (2025-08-18): > The corresponding GitHub recipe egs2/geolid/lid1 has not yet been merged into the ESPnet master branch. > See TODO: add PR link for the latest updates.

license:cc-by-4.0

myst_ogi_cmu_kids_rnnt

This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Tue Feb 18 14:11:50 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/asrtrainasrtransducerebranchformere12mlp1024linear1024lr005rawenchardur05filter/decodetransducerasrmodelvalid.cerctc.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|87.7|9.7|2.6|3.4|15.7|46.8| |datacmu/test|475|4287|87.3|9.3|3.4|2.5|15.1|47.6| |datajibo/dev|853|853|36.6|59.0|4.5|186.5|249.9|81.1| |datajibo/test|1044|1043|38.4|58.3|3.3|253.0|314.6|77.6| |datamyst/dev|9037|153273|90.1|7.5|2.4|2.2|12.2|66.5| |datamyst/test|10311|182712|89.8|7.6|2.6|2.4|12.6|65.4| |dataogiscripted/dev|5426|15375|97.9|1.4|0.7|0.3|2.3|3.9| |dataogiscripted/test|15945|45419|97.7|1.3|0.9|0.5|2.7|4.6| |dataogispon/dev|349|13561|80.1|15.2|4.7|2.5|22.4|95.7| |dataogispon/test|1095|38811|81.1|14.7|4.2|3.1|22.1|95.1| |highage/test|11196|56799|1.4|34.8|63.8|62.5|161.1|99.9| |lowage/test|5147|24262|2.2|37.2|60.6|60.3|158.1|98.7| |midage/test|26532|374547|11.4|44.9|43.8|43.5|132.2|98.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|94.0|2.6|3.4|2.9|8.9|46.8| |datacmu/test|475|22664|93.4|2.4|4.2|2.3|8.9|47.6| |datajibo/dev|853|2014|66.5|25.8|7.7|400.9|434.5|81.1| |datajibo/test|1044|2767|74.8|20.7|4.6|490.0|515.3|77.6| |datamyst/dev|9037|763728|95.8|1.7|2.5|2.3|6.5|66.5| |datamyst/test|10311|911898|95.6|1.8|2.6|2.5|6.8|65.4| |dataogiscripted/dev|5426|83141|98.5|0.4|1.1|0.3|1.8|3.9| |dataogiscripted/test|15945|244467|98.3|0.5|1.2|0.5|2.2|4.6| |dataogispon/dev|349|58255|89.5|4.7|5.9|3.0|13.6|95.7| |dataogispon/test|1095|165977|90.1|4.5|5.4|3.7|13.5|95.1| |highage/test|11196|278522|18.7|20.1|61.1|59.6|140.9|99.9| |lowage/test|5147|117778|20.5|21.8|57.7|56.5|136.0|98.7| |midage/test|26532|1865279|34.4|19.9|45.7|45.5|111.1|98.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| exp/asrtrainasrtransducerebranchformere12mlp1024linear1024lr005rawenchardur05filter/decodetransducerjiboasrmodelvalid.cerctc.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|853|36.5|59.1|4.5|148.4|212.0|81.1| |datajibo/test|1044|1043|38.4|58.3|3.3|198.2|259.7|77.6| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2014|65.7|26.0|8.3|316.5|350.8|81.1| |datajibo/test|1044|2767|74.3|20.7|5.0|379.9|405.6|77.6| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---|

license:cc-by-4.0

WavLabLM-MS-40k

license:cc-by-4.0

Shinji_Watanabe_open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave

license:cc-by-4.0

kan-bayashi_tsukuyomi_full_band_vits_prosody

license:cc-by-4.0

Wangyou_Zhang_universal_train_enh_uses_refch0_2mem_raw

license:cc-by-4.0

Hoon_Chung_jsut_asr_train_asr_conformer8_raw_char_sp_valid.acc.ave

license:cc-by-4.0

brianyan918_iwslt22_dialect_train_st_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug

license:cc-by-4.0

kan-bayashi_jvs_tts_finetune_jvs001_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-178804

license:cc-by-4.0

kan-bayashi_ljspeech_tts_finetune_joint_conformer_fastspeech2_hifigan_-truncated-737899

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp

license:cc-by-4.0

pt_commonvoice_blstm

license:cc-by-4.0

english_male_ryanspeech_conformer_fastspeech2

license:cc-by-nc-4.0

guangzhisun_librispeech100_asr_train_conformer_transducer_tcpgen500_deep_sche30_GCN6L_rep_suffix

license:cc-by-4.0

acesinger_opencpop_visinger2_44khz

license:cc-by-4.0

m4singer_svs_naive_rnn_dp

license:cc-by-4.0

voxcelebs12_xvector_mel

license:cc-by-4.0

mms_1b_mlsuperb

NaNK

license:cc-by-4.0

owls_18B_360K

NaNK

license:cc-by-4.0

geolid_vl107only_shared_trainable

license:cc-by-4.0

Chenda_Li_wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave

license:cc-by-4.0

Karthik_DSTC2_asr_train_asr_transformer

license:cc-by-4.0

Shinji_Watanabe_spgispeech_asr_train_asr_conformer6_n_fft512_hop_lengt-truncated-a013d0

license:cc-by-4.0

YushiUeda_iemocap_sentiment_asr_train_asr_conformer

license:cc-by-4.0

ftshijt_espnet2_asr_totonac_transformer

license:cc-by-4.0

kamo-naoyuki_hkust_asr_train_asr_transformer2_raw_zh_char_batch_bins20-truncated-934e17

license:cc-by-4.0

kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_frontend-truncated-b76af5

license:cc-by-4.0

kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_schedule-truncated-c8e5f9

license:cc-by-4.0

kamo-naoyuki_reverb_asr_train_asr_transformer4_raw_char_batch_bins1600-truncated-1b72bb

NaNK

license:cc-by-4.0

kamo-naoyuki_wsj_transformer2

license:cc-by-4.0

kan-bayashi_csmsc_fastspeech

license:cc-by-4.0

kan-bayashi_jsut_conformer_fastspeech2_tacotron2_prosody

license:cc-by-4.0

kan-bayashi_jsut_conformer_fastspeech2_transformer_prosody

license:cc-by-4.0

kan-bayashi_jsut_full_band_vits_accent_with_pause

license:cc-by-4.0

kan-bayashi_jsut_tacotron2_accent

license:cc-by-4.0

kan-bayashi_jsut_transformer

license:cc-by-4.0

kan-bayashi_jsut_transformer_prosody

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-74c1b4

NaNK

license:cc-by-4.0

kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-f43d8f

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech2_transformer_teacher_raw_phn_jac-truncated-6f4cf5

license:cc-by-4.0

kan-bayashi_jsut_tts_train_fastspeech_raw_phn_jaconv_pyopenjtalk_train.loss.best

license:cc-by-4.0

kan-bayashi_jsut_tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_train.loss.ave

license:cc-by-4.0

kan-bayashi_jsut_tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_train.loss.best

license:cc-by-4.0

kan-bayashi_jsut_tts_train_transformer_raw_phn_jaconv_pyopenjtalk_prosody_train.loss.ave

license:cc-by-4.0

kan-bayashi_jvs_jvs010_vits_accent_with_pause

license:cc-by-4.0

kan-bayashi_jvs_tts_finetune_jvs010_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-d57a28

license:cc-by-4.0

kan-bayashi_libritts_gst_xvector_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_libritts_gst_xvector_trasnformer

license:cc-by-4.0

kan-bayashi_libritts_tts_train_gst_xvector_trasnformer_raw_phn_tacotro-truncated-250027

license:cc-by-4.0

kan-bayashi_libritts_tts_train_xvector_conformer_fastspeech2_transform-truncated-42b443

NaNK

license:cc-by-4.0

kan-bayashi_libritts_tts_train_xvector_vits_raw_phn_tacotron_g2p_en_no-truncated-09d645

license:cc-by-4.0

kan-bayashi_libritts_xvector_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_libritts_xvector_trasnformer

license:cc-by-4.0

kan-bayashi_ljspeech_transformer

license:cc-by-4.0

kan-bayashi_ljspeech_tts_train_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave

license:cc-by-4.0

kan-bayashi_vctk_gst_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_vctk_gst_fastspeech

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_fastspeech_raw_phn_tacotron_g2p_en_no_space_train.loss.best

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_xvector_conformer_fastspeech2_transform-truncated-e051a9

license:cc-by-4.0

kan-bayashi_vctk_tts_train_gst_xvector_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

kan-bayashi_vctk_tts_train_xvector_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave

license:cc-by-4.0

shinji-watanabe-librispeech_asr_train_asr_transformer_e18_raw_bpe_sp_valid.acc.best

license:cc-by-4.0

simpleoier_librispeech_asr_train_asr_conformer7_wav2vec2_960hr_large_raw_en_bpe5000_sp

license:cc-by-4.0

su_openslr36

license:cc-by-4.0

YushiUeda_swbd_sentiment_asr_train_asr_conformer_wav2vec2

license:cc-by-4.0

ftshijt_espnet2_asr_dsing_hubert_conformer

license:cc-by-4.0

russian_commonvoice_blstm

license:cc-by-4.0

chai_librispeech_asr_train_rnnt_conformer_raw_en_bpe5000_sp

license:cc-by-4.0

Wangyou_Zhang_wsj0_2mix_enh_dc_crn_mapping_snr_raw

license:cc-by-4.0

YushiUeda_librimix_diar_enh_2_3_spk_lmf

license:cc-by-4.0

accented_french_openslr57_ASR_transformer

license:cc-by-4.0

GunnarThor_talromur_f_fastspeech2

license:cc-by-4.0

GunnarThor_talromur_b_tacotron2

license:cc-by-4.0

GunnarThor_talromur_c_fastspeech2

license:cc-by-4.0

GunnarThor_talromur_d_fastspeech2

license:cc-by-4.0

GunnarThor_talromur_h_fastspeech2

license:cc-by-4.0

YushiUeda_harpervalley_train_asr_hubert_raw_en_word

license:cc-by-4.0

simpleoier_chime4_enh_asr_convtasnet_init_noenhloss_wavlm_transformer_init_raw_en_char

license:cc-by-4.0

zh-CN_commonvoice_blstm

license:cc-by-4.0

simpleoier_chime6_asr_transformer_wavlm_lr1e-3

license:cc-by-4.0

english_male_ryanspeech_tacotron

license:cc-by-nc-4.0

slurp_slu_2pass_gt

license:cc-by-4.0

fsc_challenge_slu_2pass_conformer

license:cc-by-4.0

fsc_challenge_slu_2pass_transformer_gt

license:cc-by-4.0

jiyangtang_magicdata_asr_conformer_lm_transformer

license:cc-by-4.0

talromur2_xvector_tacotron2

license:cc-by-4.0

simpleoier_librimix_asr_train_asr_transformer_multispkr_raw_en_char_sp

license:cc-by-4.0

realzza-meld-asr-hubert-transformer

license:cc-by-4.0

stop_hubert_slu_raw_en_bpe500

license:cc-by-4.0

librispeech_multiblank_transducer_8421

—

jiyang_tang_aphsiabank_english_asr_ebranchformer_small_wavlm_large1

license:cc-by-4.0

simpleoier_ls960_asr2_train_e_branchformer1_raw_wavlm_large_21_km2000_bpe_rm6000_bpe_ts5000_sp

license:cc-by-4.0

simpleoier_ls960_asr2_train_e_branchformer1_1gpu_raw_wavlm_large_21_km1k_bpe_rm5k_bpe_ts5k_sp

license:cc-by-4.0

simpleoier_ls960_asr2_e_branchformer1_conv1d3_1gpu_raw_wavlm_large_21_km1k_bpe_rm5k_bpe_ts5k_sp

license:cc-by-4.0

Wangyou_Zhang_wsj0_2mix_train_enh_tse_td_speakerbeam_raw

license:cc-by-4.0

chendali_librimix_asr_train_sot_asr_whisper_small_raw_en_whisper_multilingual

license:cc-by-4.0

yoshiki_wsj_asr_conformer_s3prlfrontend_wavlm_raw_en_char

license:cc-by-4.0

eason_gigaspeech_train_asr2_e_branchformer12_lr_raw_wavlm_large_21_km1000

license:cc-by-4.0

msk_lrs3_train_avsr_avhubert_large_extracted_en_bpe1000

license:cc-by-4.0

owsm_v1

license:cc-by-4.0

juice500ml_mls_10h_asr_ssl

license:cc-by-4.0

juice500ml_mls_10h_discrete_asr

license:cc-by-4.0

akreal_lh_small_asr2_e_branchformer_wavlm_large_21_km2k_bpe_rm6k_bpe_ts3k_sp

license:cc-by-4.0

akreal_lh_medium_asr2_e_branchformer_wavlm_large_21_km1k_bpe_rm6k_bpe_ts3k

license:cc-by-4.0

vctk_tts_train_espnet_rawnet_vits

—

opencpop_visinger

license:cc-by-4.0

opencpop_visinger_transfer_acesinger

license:cc-by-4.0

opencpop_naive_rnn_dp

license:cc-by-4.0

opencpop_xiaoice

license:cc-by-4.0

oniku_kurumi_utagoe_svs_db_naive_rnn_dp

license:cc-by-4.0

oniku_kurumi_utagoe_db_xiaoice

license:cc-by-4.0

kiritan_svs_rnn

license:cc-by-4.0

kiritan_svs_xiaoice

license:cc-by-4.0

kiritan_svs_visinger

license:cc-by-4.0

oniku_kurumi_utagoe_db_svs_visinger

license:cc-by-4.0

oniku_kurumi_utagoe_db_svs_visinger2

license:cc-by-4.0

voxblinkclean_rawnet3

license:cc-by-4.0

voxcelebs12_mfaconformer_mel

license:cc-by-4.0

akreal_lh_small_asr2_e_branchformer_wavlm_mistral02_ctc

license:cc-by-4.0

voxcelebs12devs_librispeech_cv16fa_rawnet3

license:cc-by-4.0

interspeech2024_dsuchallenge_wavlm_large_21_km2000_bpe_rm3000_bpe_ts6500_baseline

license:cc-by-4.0

sluevoxceleb_whisper_lightweight_sa

license:cc-by-4.0

sluevoxceleb_owsm_lightweight_sa

license:cc-by-4.0

sluevoxceleb_whisper_finetune_sa

license:cc-by-4.0

sluevoxceleb_wavlm_lightweight_asr

license:cc-by-4.0

sluevoxceleb_whisper_lightweight_asr

license:cc-by-4.0

sluevoxceleb_whisper_finetune_asr

license:cc-by-4.0

sluevoxceleb_owsm_finetune_asr

license:cc-by-4.0

sluevoxceleb_whisper_complex_slu

license:cc-by-4.0

libritts_soundstream16k

license:cc-by-4.0

libritts_soundstream24k

license:cc-by-4.0

libritts_encodec_24k

license:cc-by-4.0

libritts_dac_24k

—

amuse_speech_soundstream_16k

license:cc-by-4.0

speechlm_tts_ls_giga_mlsen_amuse_speech_delay

license:cc-by-4.0

speechlm_tts_ls_giga_mlsen_amuse_speech_multiscale

license:cc-by-4.0

mls-english_soundstream_16k

license:cc-by-4.0

mls-multi_encodec_16k

license:cc-by-4.0

mls-multi_soundstream_16k

license:cc-by-4.0

mls-audioset_soundstream_16k

license:cc-by-4.0

mls-audioset_encodec_16k

license:cc-by-4.0

mls-multi_encodec_16k_360epoch

license:cc-by-4.0

mls-audioset_soundstream_16k_360epoch

license:cc-by-4.0

dac_16k_speech_survey

license:cc-by-4.0

dac_16k_all_single_survey

license:cc-by-4.0

BEATs-AS20K

license:cc-by-4.0

dac_44k_audio_single_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

dac_44k_all_single_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

dac_44k_speech_single_survey

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

BEATs-BEAN.Watkins

license:cc-by-4.0

BEATs-BEAN.Dogs

license:cc-by-4.0

BEATs-BEAN.Bats

license:cc-by-4.0

wanchichen_xeus_fleurs_finetune

license:cc-by-4.0

universa-wavlm_base_urgent24_multi-metric

license:cc-by-4.0

asr_mlsuperb2_mshubert_freeze_baseline

license:cc-by-4.0

asr_mlsuperb2_mshubert_finetune_baseline

license:cc-by-4.0

asr_mlsuperb2_xlsr_finetune_baseline

license:cc-by-4.0

universa-wavlm_base_urgent24_multi-metric_fullref

`espnet/universa-wavlmbaseurgent24multi-metricfullref` This model was trained by ftshijt using urgent24 recipe in espnet. Please check Colab link for a simple demo of how to use UniVERSA.

license:cc-by-4.0

owls_05B_180K_intermediates

NaNK

—

owsm_dac_v2_16k

This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.

license:cc-by-4.0

cnceleb_resnet34

license:cc-by-4.0

cnceleb_resnet221

license:cc-by-4.0

owsm_v2

license:cc-by-4.0

belarusian_commonvoice_blstm

license:cc-by-4.0

transformer_tts_cmu_indic_hin_ab

—

mixdata_svs_visinger2_spkemb_lang_pretrained

license:mit

Yushi_Ueda_ksponspeech_asr_train_asr_conformer8_n_fft512_hop_length256-truncated-eb42e5

license:cc-by-4.0

kan-bayashi_csmsc_conformer_fastspeech2

license:cc-by-4.0

kan-bayashi_vctk_full_band_multi_spk_vits

license:cc-by-4.0

YushiUeda_swbd_sentiment_asr_train_asr_conformer

license:cc-by-4.0

YushiUeda_swbd_sentiment_asr_train_asr_conformer_wav2vec2_2

license:cc-by-4.0

mediaspeech-spanish-hubert

—

wanchichen_fleurs_asr_conformer_scctc

license:cc-by-4.0

khassan_KSC_transformer

license:cc-by-4.0

pengcheng_aishell_asr_train_asr_whisper_medium_finetune_raw_zh_whisper_multilingual_sp

license:cc-by-4.0

voxcelebs12devs_voxblinkfull_rawnet3

license:cc-by-4.0

libriheavy_small_ebranchformer

license:cc-by-4.0

voxcelebs12_ebranchformer_base

license:cc-by-4.0

opencpop_svs2_toksing_pretrain

This model was trained by TangRain using opencpop recipe in espnet. Train the model for 300 epochs and choose the one with the best performance based on validation loss.

license:cc-by-4.0

DCASE23.AudioCaptioning.PreTrained

license:cc-by-4.0

speechlm_unified_v1_1.7B

NaNK

—

owls_18B_180K

NaNK

license:cc-by-4.0

espnet

voxcelebs12_ecapa_wavlm_joint

owsm_ctc_v4_1B

voxcelebs12_rawnet3

fastspeech2_conformer

hubert_dummy

kamo-naoyuki-mini_an4_asr_train_raw_bpe_valid.acc.best

fastspeech2_conformer_with_hifigan

fastspeech2_conformer_hifigan

owsm_v3.1_ebf

owsm_v3.2

kan-bayashi_ljspeech_vits

owsm_v4_medium_1B

powsm

owsm_ctc_v3.1_1B

owsm_v4_base_102M

yoshiki_wsj0_2mix_spatialized_enh_tfgridnet_waspaa2023_raw

Wangyou_Zhang_chime4_enh_train_enh_conv_tasnet_raw

powsm_ctc

Xeus

owsm_ctc_v3.2_ft_1B

simpleoier_librispeech_asr_train_asr_conformer7_wavlm_large_raw_en_bpe5000_sp

owsm_v4_small_370M

Turn_taking_prediction_SWBD

jiyang_tang_cvss-c_es-en_discrete_unit

aceopencpop_svs_visinger2_40singer_pretrain

kan-bayashi_jsut_vits_prosody

espnet_tts_vctk_espnet_spk_voxceleb12_rawnet

arecho_base_v0

chenda-li-wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave

OpusLM_1.7B_Anneal

kan-bayashi_ljspeech_tacotron2

wanchichen_fleurs_multilingual_asr_hubert_frontend

kan-bayashi_vctk_multi_spk_vits

owsm_v2_ebranchformer

voxcelebs12_ecapa_frozen

owls_9B_180K_intermediates

kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best

kan-bayashi_jsut_fastspeech2

kan-bayashi_ljspeech_fastspeech2

universa-wavlm_base_urgent24_multi-metric_noref

kan-bayashi_jsut_full_band_vits_prosody

dns_icassp21_enh_train_enh_tcn_tf_raw

owsm_v3.1_ebf_base

OpusLM_7B_Anneal

Wangyou_Zhang_chime4_enh_train_enh_beamformer_mvdr_raw

kan-bayashi_csj_asr_train_asr_conformer

yen-ju-lu-dns_ins20_enh_train_enh_blstm_tf_raw_valid.loss.best

english_male_ryanspeech_fastspeech

Wangyou_Zhang_librimix_train_enh_tse_td_speakerbeam_raw

owsmdata_soundstream_16k_200epoch

mixdata_svs_visinger2_spkembed_lang_pretrained

Yushi_Ueda_mini_librispeech_diar_train_diar_raw_max_epoch20_valid.acc.best

myst_wavlm_aed_transformer

arecho_scale_v0

kan-bayashi_jsut_vits_accent_with_pause

kan-bayashi_ljspeech_tts_train_joint_conformer_fastspeech2_hifigan_raw-truncated-af8fe0

visinger2-zh-jp-multisinger-svs

Hoon_Chung_zeroth_korean_asr_train_asr_transformer5_raw_bpe_valid.acc.ave

chenda-li-wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave

kan-bayashi_csmsc_vits

mls-english_soundstream_16k_360epoch

owls_9B_180K

kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan

iam_handwriting_ocr

english_male_ryanspeech_fastspeech2

owsm_v3.1_ebf_small

Shinji_Watanabe_laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave

amuse_soundstream_44.1k

mixdata_svs_visinger2_spkemb_lang_pretrained_avg

kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-15ef5f

kan-bayashi_jvs_tts_finetune_jvs010_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody_latest

kan-bayashi_ljspeech_tts_train_conformer_fastspeech2_raw_phn_tacotron_-truncated-ec9e34

turkish_commonvoice_blstm

fsc_challenge_slu_2pass_transformer

shihlun_asr_whisper_medium_finetuned_librispeech100

voxcelebs12_ska_wavlm_joint

amuse_dac_16k

dac_16k_audio_single_survey

dac_44k_speech_survey