espnet
voxcelebs12_ecapa_wavlm_joint
owsm_ctc_v4_1B
voxcelebs12_rawnet3
fastspeech2_conformer
hubert_dummy
kamo-naoyuki-mini_an4_asr_train_raw_bpe_valid.acc.best
fastspeech2_conformer_with_hifigan
fastspeech2_conformer_hifigan
owsm_v3.1_ebf
owsm_v3.2
kan-bayashi_ljspeech_vits
ESPnet2 TTS pretrained model `kan-bayashi/ljspeechvits` ♻️ Imported from https://zenodo.org/record/5443814/ This model was trained by kan-bayashi using ljspeech/tts1 recipe in espnet. Demo: How to use in ESPnet2
owsm_v4_medium_1B
powsm
🐁POWSM is the first phonetic foundation model that can perform four phone-related tasks: Phone Recognition (PR), Automatic Speech Recognition (ASR), audio-guided grapheme-to-phoneme conversion (G2P), and audio-guided phoneme-to-grapheme conversion (P2G). Based on Open Whisper-style Speech Model (OWSM) and trained with IPAPack++, POWSM outperforms or matches specialized PR models of similar size while jointly supporting G2P, P2G, and ASR. To use the pre-trained model, please install `espnet` and `espnetmodelzoo`. The requirements are: The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/powsm/s2t1 Our models are trained on 16kHz audio with a fixed duration of 20s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 20s. To distinguish phone entries from BPE tokens that share the same Unicode, we enclose every phone in slashes and treat them as special tokens. For example, /pʰɔsəm/ would be tokenized as /pʰ//ɔ//s//ə//m/. See `forcealign.py` in ESPnet recipe to try out CTC forced alignment with POWSM's encoder! LID is learned implicitly during training, and you may run it with the script below:
owsm_ctc_v3.1_1B
owsm_v4_base_102M
yoshiki_wsj0_2mix_spatialized_enh_tfgridnet_waspaa2023_raw
Wangyou_Zhang_chime4_enh_train_enh_conv_tasnet_raw
powsm_ctc
Xeus
XEUS - A Cross-lingual Encoder for Universal Speech XEUS is a large-scale multilingual speech encoder by Carnegie Mellon University's WAVLab that covers over 4000 languages. It is pre-trained on over 1 million hours of publicly available speech datasets. It requires fine-tuning to be used in downstream tasks such as Speech Recognition or Translation. Its hidden states can also be used with k-means for semantic Speech Tokenization. XEUS uses the E-Branchformer architecture and is trained using HuBERT-style masked prediction of discrete speech tokens extracted from WavLabLM. During training, the input speech is also augmented with acoustic noise and reverberation, making XEUS more robust. The total model size is 577M parameters. XEUS tops the ML-SUPERB multilingual speech recognition leaderboard, outperforming MMS, w2v-BERT 2.0, and XLS-R. XEUS also sets a new state-of-the-art on 4 tasks in the monolingual SUPERB benchmark. More information about XEUS, including download links for our crawled 4000-language dataset, can be found in the project page and paper. The code for XEUS is still in progress of being merged into the main ESPnet repo. It can instead be used from the following fork: XEUS supports Flash Attention, which can be installed as follows:
owsm_ctc_v3.2_ft_1B
simpleoier_librispeech_asr_train_asr_conformer7_wavlm_large_raw_en_bpe5000_sp
owsm_v4_small_370M
Turn_taking_prediction_SWBD
jiyang_tang_cvss-c_es-en_discrete_unit
aceopencpop_svs_visinger2_40singer_pretrain
kan-bayashi_jsut_vits_prosody
espnet_tts_vctk_espnet_spk_voxceleb12_rawnet
arecho_base_v0
This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
chenda-li-wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave
OpusLM_1.7B_Anneal
kan-bayashi_ljspeech_tacotron2
wanchichen_fleurs_multilingual_asr_hubert_frontend
kan-bayashi_vctk_multi_spk_vits
owsm_v2_ebranchformer
voxcelebs12_ecapa_frozen
owls_9B_180K_intermediates
kan-bayashi_csmsc_tts_train_tacotron2_raw_phn_pypinyin_g2p_phone_train.loss.best
kan-bayashi_jsut_fastspeech2
kan-bayashi_ljspeech_fastspeech2
universa-wavlm_base_urgent24_multi-metric_noref
kan-bayashi_jsut_full_band_vits_prosody
dns_icassp21_enh_train_enh_tcn_tf_raw
owsm_v3.1_ebf_base
OpusLM_7B_Anneal
Wangyou_Zhang_chime4_enh_train_enh_beamformer_mvdr_raw
kan-bayashi_csj_asr_train_asr_conformer
yen-ju-lu-dns_ins20_enh_train_enh_blstm_tf_raw_valid.loss.best
english_male_ryanspeech_fastspeech
Wangyou_Zhang_librimix_train_enh_tse_td_speakerbeam_raw
owsmdata_soundstream_16k_200epoch
mixdata_svs_visinger2_spkembed_lang_pretrained
Yushi_Ueda_mini_librispeech_diar_train_diar_raw_max_epoch20_valid.acc.best
myst_wavlm_aed_transformer
arecho_scale_v0
This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
kan-bayashi_jsut_vits_accent_with_pause
kan-bayashi_ljspeech_tts_train_joint_conformer_fastspeech2_hifigan_raw-truncated-af8fe0
visinger2-zh-jp-multisinger-svs
Hoon_Chung_zeroth_korean_asr_train_asr_transformer5_raw_bpe_valid.acc.ave
chenda-li-wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave
kan-bayashi_csmsc_vits
mls-english_soundstream_16k_360epoch
owls_9B_180K
kan-bayashi_ljspeech_joint_finetune_conformer_fastspeech2_hifigan
iam_handwriting_ocr
english_male_ryanspeech_fastspeech2
owsm_v3.1_ebf_small
Shinji_Watanabe_laborotv_asr_train_asr_conformer2_latest33_raw_char_sp_valid.acc.ave
amuse_soundstream_44.1k
mixdata_svs_visinger2_spkemb_lang_pretrained_avg
kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-15ef5f
kan-bayashi_jvs_tts_finetune_jvs010_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody_latest
kan-bayashi_ljspeech_tts_train_conformer_fastspeech2_raw_phn_tacotron_-truncated-ec9e34
turkish_commonvoice_blstm
fsc_challenge_slu_2pass_transformer
shihlun_asr_whisper_medium_finetuned_librispeech100
voxcelebs12_ska_wavlm_joint
amuse_dac_16k
dac_16k_audio_single_survey
dac_44k_speech_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
myst_ogi_cmu_kids_aed_upsample
This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Wed Feb 19 19:12:32 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/asrtrainasrlr002rawenchardur05filterbah2l4/decodeasrasrmodelvalid.cer.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|89.1|8.7|2.2|2.4|13.3|37.1| |datacmu/test|475|4287|88.3|9.0|2.6|2.4|14.0|40.0| |datajibo/dev|853|853|19.1|80.9|0.0|219.0|299.9|97.4| |datajibo/test|1044|1044|20.4|79.6|0.0|306.0|385.6|94.1| |datamyst/dev|9037|153273|90.4|7.7|1.8|2.9|12.5|66.8| |datamyst/test|10311|182712|89.9|7.8|2.3|3.1|13.1|65.0| |dataogiscripted/dev|5426|15375|98.7|1.1|0.2|0.2|1.5|2.2| |dataogiscripted/test|15945|45419|98.5|1.2|0.3|0.3|1.9|2.7| |dataogispon/dev|349|13561|81.0|15.2|3.8|3.2|22.2|96.6| |dataogispon/test|1095|38811|81.8|14.9|3.3|3.8|22.0|95.3| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|94.9|2.5|2.6|2.4|7.4|37.1| |datacmu/test|475|22664|94.4|2.3|3.3|2.2|7.8|40.0| |datajibo/dev|853|2558|55.8|33.3|10.9|362.2|406.4|97.4| |datajibo/test|1044|3259|62.1|29.5|8.4|498.7|536.6|94.1| |datamyst/dev|9037|763728|96.2|1.8|2.0|2.9|6.7|66.8| |datamyst/test|10311|911898|95.8|1.8|2.4|3.1|7.3|65.0| |dataogiscripted/dev|5426|83141|99.0|0.5|0.4|0.3|1.2|2.2| |dataogiscripted/test|15945|244467|98.8|0.6|0.5|0.4|1.5|2.7| |dataogispon/dev|349|58255|90.3|4.8|4.9|3.7|13.4|96.6| |dataogispon/test|1095|165977|90.9|4.6|4.5|4.3|13.4|95.3| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---|
myst_ogi_cmu_kids_wavlm_aed
This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Tue Feb 18 10:11:23 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/asrtrainasrwavlmtransformerlr03rawenchardur05filter/decodeasrasrmodelvalid.cer.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|91.7|6.5|1.8|2.5|10.9|45.1| |datacmu/test|475|4287|90.2|7.5|2.3|2.0|11.8|47.8| |datajibo/dev|853|853|29.8|70.2|0.0|189.4|259.7|88.0| |datajibo/test|1044|1043|29.4|70.6|0.0|259.6|330.2|86.6| |datamyst/dev|9037|153273|90.8|6.1|3.1|2.5|11.7|62.7| |datamyst/test|10311|182712|88.7|6.8|4.5|3.0|14.2|61.9| |dataogiscripted/dev|5426|15375|93.1|5.8|1.1|0.5|7.4|13.2| |dataogiscripted/test|15945|45419|90.8|7.9|1.3|1.2|10.4|17.7| |dataogispon/dev|349|13561|82.3|10.7|7.1|3.5|21.3|95.4| |dataogispon/test|1095|38811|83.4|10.5|6.1|4.3|20.9|95.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|96.8|1.2|2.0|2.4|5.6|45.1| |datacmu/test|475|22664|95.7|1.3|3.0|2.0|6.4|47.8| |datajibo/dev|853|2014|65.7|31.6|2.7|423.9|458.2|88.0| |datajibo/test|1044|2767|69.1|27.9|3.0|516.7|547.6|86.6| |datamyst/dev|9037|763728|95.7|1.3|3.0|2.4|6.8|62.7| |datamyst/test|10311|911898|94.0|1.6|4.4|3.0|8.9|61.9| |dataogiscripted/dev|5426|83141|96.6|1.8|1.7|0.9|4.3|13.2| |dataogiscripted/test|15945|244467|95.4|2.4|2.2|1.6|6.2|17.7| |dataogispon/dev|349|58255|89.1|3.1|7.8|4.4|15.3|95.4| |dataogispon/test|1095|165977|90.6|2.9|6.6|5.0|14.5|95.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| exp/asrtrainasrwavlmtransformerlr03rawenchardur05filter/decodeasrjiboasrmodelvalid.cer.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|853|12.3|87.7|0.0|1.1|88.7|88.3| |datajibo/test|1044|1043|10.6|89.4|0.0|1.7|91.1|90.3| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2014|22.1|22.7|55.2|1.4|79.3|88.3| |datajibo/test|1044|2767|21.3|20.6|58.1|2.1|80.8|90.3| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---|
owls_4B_180K
farsi_commonvoice_blstm
owls_1B_180K
kan-bayashi_csmsc_fastspeech2
owsm_v3.1_ebf_small_lowrestriction
brianyan918_iwslt22_dialect_train_asr_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug
kan-bayashi_ljspeech_conformer_fastspeech2
kan-bayashi_vctk_tts_train_full_band_multi_spk_vits_raw_phn_tacotron_g-truncated-50b003
aishell2_att_ctc_espnet2
wanchichen_fleurs_english_asr_wav2vec_frontend
geolid_combined_shared_trainable
Karthik_sinhala_asr_train_asr_transformer
anogkongda-librimix_enh_train_raw_valid.si_snr.ave
kan-bayashi_csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave
kan-bayashi_csmsc_tts_train_conformer_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave
kan-bayashi_csmsc_tts_train_fastspeech2_raw_phn_pypinyin_g2p_phone_train.loss.ave
kan-bayashi_jsut_conformer_fastspeech2
kan-bayashi_jsut_fastspeech
kan-bayashi_jsut_fastspeech2_accent
kan-bayashi_jsut_fastspeech2_accent_with_pause
kan-bayashi_jsut_tacotron2
kan-bayashi_jsut_transformer_accent_with_pause
kan-bayashi_jsut_tts_train_fastspeech2_transformer_teacher_raw_phn_jac-truncated-60fc24
kan-bayashi_jsut_tts_train_full_band_vits_raw_phn_jaconv_pyopenjtalk_a-truncated-d7d5d0
kan-bayashi_jsut_tts_train_vits_raw_phn_jaconv_pyopenjtalk_prosody_train.total_count.ave
kan-bayashi_ljspeech_tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best
kan-bayashi_vctk_gst_fastspeech2
kan-bayashi_vctk_tts_train_xvector_conformer_fastspeech2_transformer_t-truncated-69a657
german_commonvoice_blstm
french_commonvoice_blstm
id_commonvoice_blstm
greek_commonvoice_blstm
tamil_commonvoice_blstm
kyrgyz_commonvoice_blstm
shihlun-asr-commonvoice-zh-TW
simpleoier_librispeech_hubert_iter1_train_ssl_torchaudiohubert_base_960h_pretrain_it1_raw
WavLabLM-MK-40k
voxcelebs12_ska_wavlm_frozen
voxcelebs12_ecapa_mel
akreal_lh_small_asr2_e_branchformer_wavlm_mistral02_aed
amuse_encodec_16k
dac_16k_music_survey
dac_44k_music_single_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
universa-base_urgent24_multi-metric
asr_mlsuperb2_mms_finetune_baseline
owls_05B_180K
lid_voxlingua107_mms_ecapa
myst_ogi_cmu_kids_owsm_v3.1
This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Tue Feb 18 14:16:58 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/s2towsmv3.1lr000303rawenbpe50000/decodeasrbeam1ctc03jibos2tmodelvalid.cer.ave4best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|853|6.8|84.3|8.9|0.0|93.2|93.2| |datajibo/test|1044|1044|8.6|82.9|8.5|0.0|91.4|91.4| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2558|18.8|11.8|69.4|0.0|81.2|93.2| |datajibo/test|1044|3259|18.7|11.1|70.2|0.0|81.3|91.4| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2453|18.9|12.8|68.3|0.0|81.1|93.2| |datajibo/test|1044|3092|18.3|12.6|69.1|0.0|81.7|91.4| exp/s2towsmv3.1lr000303rawenbpe50000/decodeasrbeam1ctc03s2tmodelvalid.cer.ave4best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|92.9|4.8|2.2|2.0|9.0|26.2| |datacmu/test|475|4287|93.4|4.8|1.8|0.9|7.5|28.4| |datajibo/dev|853|853|22.7|77.3|0.0|203.4|280.7|96.5| |datajibo/test|1044|1044|27.6|72.4|0.0|282.6|355.0|89.0| |datamyst/dev|9037|153273|93.2|5.2|1.6|2.3|9.1|56.7| |datamyst/test|10311|182712|91.5|5.5|3.0|2.5|10.9|57.0| |dataogiscripted/dev|5426|15375|98.4|1.4|0.2|0.3|1.9|2.7| |dataogiscripted/test|15945|45419|98.3|1.4|0.3|0.3|2.0|3.1| |dataogispon/dev|349|13561|88.4|8.7|2.9|2.0|13.7|93.4| |dataogispon/test|1095|38811|88.5|8.8|2.7|2.9|14.4|92.6| |highage/test|11196|56799|1.4|34.8|63.8|63.4|162.0|99.9| |lowage/test|5147|24262|2.2|37.7|60.1|61.2|159.0|98.7| |midage/test|26532|374547|11.5|44.9|43.6|43.6|132.1|98.2| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|96.0|1.7|2.3|1.7|5.7|26.2| |datacmu/test|475|22664|96.6|1.1|2.3|1.2|4.5|28.4| |datajibo/dev|853|2558|60.3|25.7|14.0|314.9|354.6|96.5| |datajibo/test|1044|3259|69.0|19.7|11.2|433.4|464.3|89.0| |datamyst/dev|9037|763728|97.3|1.1|1.6|2.2|4.9|56.7| |datamyst/test|10311|911898|95.9|1.2|2.9|2.4|6.5|57.0| |dataogiscripted/dev|5426|83141|98.9|0.7|0.4|0.4|1.5|2.7| |dataogiscripted/test|15945|244467|98.7|0.7|0.6|0.4|1.7|3.1| |dataogispon/dev|349|58255|94.5|2.5|3.0|2.8|8.2|93.4| |dataogispon/test|1095|165977|94.7|2.3|3.0|3.4|8.7|92.6| |highage/test|11196|278522|18.8|20.2|61.0|60.8|142.0|99.9| |lowage/test|5147|117778|20.8|22.0|57.1|57.9|137.1|98.7| |midage/test|26532|1865279|34.5|20.0|45.5|45.5|111.0|98.2| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|8415|95.2|2.6|2.3|1.7|6.5|26.2| |datacmu/test|475|16575|95.7|2.2|2.2|1.2|5.5|28.4| |datajibo/dev|853|2453|54.4|32.5|13.1|235.8|281.3|96.5| |datajibo/test|1044|3092|61.7|27.7|10.7|331.4|369.7|89.0| |datamyst/dev|9037|552703|96.5|2.0|1.6|2.3|5.8|56.7| |datamyst/test|10311|660431|95.0|2.1|2.9|2.5|7.5|57.0| |dataogiscripted/dev|5426|64772|98.6|1.0|0.4|0.4|1.7|2.7| |dataogiscripted/test|15945|190001|98.5|1.0|0.5|0.4|1.9|3.1| |dataogispon/dev|349|40668|92.3|4.6|3.0|2.8|10.5|93.4| |dataogispon/test|1095|116027|92.6|4.4|3.1|3.6|11.0|92.6| |highage/test|11196|207835|11.9|31.3|56.8|56.7|144.8|99.9| |lowage/test|5147|87805|13.2|34.0|52.8|53.6|140.4|98.7| |midage/test|26532|1353952|23.4|32.6|44.0|44.0|120.6|98.2|
owsm_v3
pengcheng_guo_wenetspeech_asr_train_asr_raw_zh_char
kan-bayashi_tsukuyomi_tts_finetune_full_band_jsut_vits_raw_phn_jaconv_pyopenjtalk_prosody_latest
simpleoier_librispeech_asr_train_asr_conformer7_hubert_ll60k_large_raw_en_bpe5000_sp
vectominist_seame_asr_conformer_bpe5626
wanchichen_fleurs_asr_conformer_hier_lid_utt
Shinji_Watanabe_spgispeech_asr_train_asr_conformer6_n_fft512_hop_lengt-truncated-f1ac86
kamo-naoyuki_librispeech_asr_train_asr_conformer6_n_fft512_hop_length2-truncated-a63357
kan_bayashi_jsut_tts_train_conformer_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave
arabic_commonvoice_blstm
WavLabLM-EK-40k
m4singer_svs_xiaoice
slueted_whisper_summ
Chenda_Li_wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.ave
Dan_Berrebbi_aishell4_asr
Shinji_Watanabe_librispeech_asr_train_asr_transformer_e18_raw_bpe_sp_valid.acc.best
YushiUeda_mini_librispeech_diar_train_diar_raw_valid.acc.best
kamo-naoyuki_aishell_conformer
kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_frontend-truncated-55c091
kan-bayashi_csmsc_tacotron2
kan-bayashi_csmsc_tts_train_fastspeech_raw_phn_pypinyin_g2p_phone_train.loss.best
kan-bayashi_csmsc_tts_train_full_band_vits_raw_phn_pypinyin_g2p_phone_train.total_count.ave
kan-bayashi_csmsc_tts_train_vits_raw_phn_pypinyin_g2p_phone_train.total_count.ave
kan-bayashi_jsut_conformer_fastspeech2_accent
kan-bayashi_jsut_conformer_fastspeech2_accent_with_pause
kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-a7f080
kan-bayashi_jsut_tts_train_conformer_fastspeech2_tacotron2_teacher_raw-truncated-569e81
kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-35ef5a
kan-bayashi_jsut_tts_train_fastspeech2_tacotron2_teacher_raw_phn_jacon-truncated-f45dcb
kan-bayashi_jsut_tts_train_fastspeech2_tacotron2_teacher_raw_phn_jacon-truncated-e5d906
kan-bayashi_jsut_tts_train_full_band_vits_raw_phn_jaconv_pyopenjtalk_p-truncated-66d5fc
kan-bayashi_jsut_tts_train_transformer_raw_phn_jaconv_pyopenjtalk_train.loss.ave
kan-bayashi_jvs_jvs010_vits_prosody
kan-bayashi_ljspeech_fastspeech
kan-bayashi_ljspeech_tts_train_fastspeech_raw_phn_tacotron_g2p_en_no_space_train.loss.best
kan-bayashi_ljspeech_tts_train_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
kan-bayashi_vctk_gst_tacotron2
kan-bayashi_vctk_gst_xvector_conformer_fastspeech2
kan-bayashi_vctk_tts_train_gst_conformer_fastspeech2_raw_phn_tacotron_-truncated-69081b
kan-bayashi_vctk_tts_train_gst_fastspeech2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
kan-bayashi_vctk_tts_train_xvector_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
roshansh_how2_asr_raw_ft_sum_valid.acc
xuankai_chang_librispeech_asr_train_asr_conformer7_wav2vec2_960hr_larg-truncated-5b94d9
ml_openslr63
bn_openslr53
Wangyou_Zhang_chime4_enh_train_enh_dc_crn_mapping_snr_raw
GunnarThor_talromur_d_tacotron2
thai_commonvoice_blstm
Wangyou_Zhang_wsj0_2mix_enh_train_enh_dptnet_raw
slurp_slu_2pass
simpleoier_librispeech_hubert_iter0_train_ssl_torchaudiohubert_base_960h_pretrain_it0_raw
pengcheng_librimix_asr_train_sot_asr_conformer_wavlm_raw_en_char_sp
akreal_ls100_asr2_e_branchformer1_1gpu_raw_wavlm_large_21_km2k_bpe_rm6k_bpe_ts5k_sp
kohei0209_ted3_asr2_e_branchformer1_raw_wavlm_large_21_km1000_bpe_rm2000_bpe_ts500_sp
jinchuat_aishell_brctc
iwslt24_indic_en_ta_bpe_tc4000
sluevoxceleb_wavlm_finetune_asr
mls-multi_soundstream_16k_360epoch
dac_16k_all_survey
dac_16k_audio_survey
dac_16k_music_single_survey
dac_16k_speech_single_survey
dac_44k_audio_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
dac_44k_all_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
owls_2B_180K
owls_025B_180K
universa-wavlm_base_urgent24_multi-metric_audioref
`espnet/universa-wavlmbaseurgent24multi-metricaudioref` This model was trained by ftshijt using urgent24 recipe in espnet. Please check Colab link for a simple demo of how to use UniVERSA.
audioset_dac_16k
arecho_base_v0.1-large-decoder
This model was trained by ftshijt using universaunite recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
opencpop_svs_train_toksing_300epoch-multi_hl6_wl6_wl23
geolid_vl107only_shared_frozen
This geolocation-aware language identification (LID) model is developed using the ESPnet toolkit. It integrates the powerful pretrained MMS-1B as the encoder and employs ECAPA-TDNN as the embedding extractor to achieve robust spoken language identification. The main innovations of this model are: 1. Incorporating geolocation prediction as an auxiliary task during training. 2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information. This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations. For further details on the geolocation-aware LID methodology, please refer to our paper: Geolocation-Aware Robust Spoken Language Identification (arXiv). Prerequisites First, ensure you have ESPnet installed. If not, follow the ESPnet installation instructions. Quick Start Run the following commands to set up and use the pre-trained model: This will download the pre-trained model and run inference using the VoxLingua107 test data. The training used only the VoxLingua107 dataset, comprising 6,628 hours of speech across 107 languages from YouTube. | Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) | | ------------- | ----------- | ------------------ | ------- | --------------------------- | | VoxLingua107 | YouTube | 107/33 | No | Seen | | Babel | Telephone | 25/25 | No | Unseen | | FLEURS | Read speech | 102/102 | No | Unseen | | ML-SUPERB 2.0 | Mixed | 137/(137, 8) | Yes | Unseen | | VoxPopuli | Parliament | 16/16 | No | Unseen | Accuracy (%) on In-domain and Out-of-domain Test Sets .hf-model-cell { max-width: 120px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .config-cell { max-width: 100px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .hf-model-cell::-webkit-scrollbar, .config-cell::-webkit-scrollbar { height: 6px; } .hf-model-cell::-webkit-scrollbar-track, .config-cell::-webkit-scrollbar-track { background: #f1f1f1; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb, .config-cell::-webkit-scrollbar-thumb { background: #888; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb:hover, .config-cell::-webkit-scrollbar-thumb:hover { background: #555; } | ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. | | ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- | | egs2/geolid/lid1 | `conf/voxlingua107only/mmsecapaupcon3244it0.4sharedfrozen.yaml` | 94.3 | 85.9 | 94.3 | 88.8 | 80.7 | 89.2 | 88.8 | For more detailed inference results, please refer to the `expvoxlingua107only/lidmmsecapaupcon3244it0.4sharedfrozenraw/inference` directory in this repository. > Note (2025-08-18): > The corresponding GitHub recipe egs2/geolid/lid1 has not yet been merged into the ESPnet master branch. > See TODO: add PR link for the latest updates.
geolid_vl107only_independent_frozen
This geolocation-aware language identification (LID) model is developed using the ESPnet toolkit. It integrates the powerful pretrained MMS-1B as the encoder and employs ECAPA-TDNN as the embedding extractor to achieve robust spoken language identification. The main innovations of this model are: 1. Incorporating geolocation prediction as an auxiliary task during training. 2. Conditioning the intermediate representations of the self-supervised learning (SSL) encoder on intermediate-layer information. This geolocation-aware strategy greatly improves robustness, especially for dialects and accented variations. For further details on the geolocation-aware LID methodology, please refer to our paper: Geolocation-Aware Robust Spoken Language Identification (arXiv). Prerequisites First, ensure you have ESPnet installed. If not, follow the ESPnet installation instructions. Quick Start Run the following commands to set up and use the pre-trained model: This will download the pre-trained model and run inference using the VoxLingua107 test data. The training used only the VoxLingua107 dataset, comprising 6,628 hours of speech across 107 languages from YouTube. | Dataset | Domain | #Langs. Train/Test | Dialect | Training Setup (VL107-only) | | ------------- | ----------- | ------------------ | ------- | --------------------------- | | VoxLingua107 | YouTube | 107/33 | No | Seen | | Babel | Telephone | 25/25 | No | Unseen | | FLEURS | Read speech | 102/102 | No | Unseen | | ML-SUPERB 2.0 | Mixed | 137/(137, 8) | Yes | Unseen | | VoxPopuli | Parliament | 16/16 | No | Unseen | Accuracy (%) on In-domain and Out-of-domain Test Sets .hf-model-cell { max-width: 120px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .config-cell { max-width: 100px; overflow-x: auto; white-space: nowrap; scrollbar-width: thin; scrollbar-color: #888 #f1f1f1; } .hf-model-cell::-webkit-scrollbar, .config-cell::-webkit-scrollbar { height: 6px; } .hf-model-cell::-webkit-scrollbar-track, .config-cell::-webkit-scrollbar-track { background: #f1f1f1; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb, .config-cell::-webkit-scrollbar-thumb { background: #888; border-radius: 3px; } .hf-model-cell::-webkit-scrollbar-thumb:hover, .config-cell::-webkit-scrollbar-thumb:hover { background: #555; } | ESPnet Recipe | Config | VoxLingua107 | Babel | FLEURS | ML-SUPERB2.0 Dev | ML-SUPERB2.0 Dialect | VoxPopuli | Macro Avg. | | ------------------------- | ----------- | ------------ | ----- | ------ | ---------------- | -------------------- | --------- | ---------- | | egs2/geolid/lid1 | `conf/voxlingua107only/mmsecapaupcon3244it0.4independentfrozen.yaml` | 94.2 | 87.1 | 95.0 | 89.0 | 77.2 | 90.4 | 88.8 | For more detailed inference results, please refer to the `expvoxlingua107only/lidmmsecapaupcon3244it0.4independentfrozenraw/inference` directory in this repository. > Note (2025-08-18): > The corresponding GitHub recipe egs2/geolid/lid1 has not yet been merged into the ESPnet master branch. > See TODO: add PR link for the latest updates.
myst_ogi_cmu_kids_rnnt
This model was trained by eric102004 using mystogicmukids recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already. RESULTS Environments - date: `Tue Feb 18 14:11:50 CST 2025` - python version: `3.12.3 | packaged by Anaconda, Inc. | (main, May 6 2024, 19:46:43) [GCC 11.2.0]` - espnet version: `espnet 202412` - pytorch version: `pytorch 2.4.0` - Git hash: `6f722aee1f9593572d5eddfd8cac7075b07cf9ca` - Commit date: `Thu Feb 6 22:32:07 2025 -0600` exp/asrtrainasrtransducerebranchformere12mlp1024linear1024lr005rawenchardur05filter/decodetransducerasrmodelvalid.cerctc.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|2170|87.7|9.7|2.6|3.4|15.7|46.8| |datacmu/test|475|4287|87.3|9.3|3.4|2.5|15.1|47.6| |datajibo/dev|853|853|36.6|59.0|4.5|186.5|249.9|81.1| |datajibo/test|1044|1043|38.4|58.3|3.3|253.0|314.6|77.6| |datamyst/dev|9037|153273|90.1|7.5|2.4|2.2|12.2|66.5| |datamyst/test|10311|182712|89.8|7.6|2.6|2.4|12.6|65.4| |dataogiscripted/dev|5426|15375|97.9|1.4|0.7|0.3|2.3|3.9| |dataogiscripted/test|15945|45419|97.7|1.3|0.9|0.5|2.7|4.6| |dataogispon/dev|349|13561|80.1|15.2|4.7|2.5|22.4|95.7| |dataogispon/test|1095|38811|81.1|14.7|4.2|3.1|22.1|95.1| |highage/test|11196|56799|1.4|34.8|63.8|62.5|161.1|99.9| |lowage/test|5147|24262|2.2|37.2|60.6|60.3|158.1|98.7| |midage/test|26532|374547|11.4|44.9|43.8|43.5|132.2|98.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datacmu/dev|237|11449|94.0|2.6|3.4|2.9|8.9|46.8| |datacmu/test|475|22664|93.4|2.4|4.2|2.3|8.9|47.6| |datajibo/dev|853|2014|66.5|25.8|7.7|400.9|434.5|81.1| |datajibo/test|1044|2767|74.8|20.7|4.6|490.0|515.3|77.6| |datamyst/dev|9037|763728|95.8|1.7|2.5|2.3|6.5|66.5| |datamyst/test|10311|911898|95.6|1.8|2.6|2.5|6.8|65.4| |dataogiscripted/dev|5426|83141|98.5|0.4|1.1|0.3|1.8|3.9| |dataogiscripted/test|15945|244467|98.3|0.5|1.2|0.5|2.2|4.6| |dataogispon/dev|349|58255|89.5|4.7|5.9|3.0|13.6|95.7| |dataogispon/test|1095|165977|90.1|4.5|5.4|3.7|13.5|95.1| |highage/test|11196|278522|18.7|20.1|61.1|59.6|140.9|99.9| |lowage/test|5147|117778|20.5|21.8|57.7|56.5|136.0|98.7| |midage/test|26532|1865279|34.4|19.9|45.7|45.5|111.1|98.5| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| exp/asrtrainasrtransducerebranchformere12mlp1024linear1024lr005rawenchardur05filter/decodetransducerjiboasrmodelvalid.cerctc.ave10best WER |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|853|36.5|59.1|4.5|148.4|212.0|81.1| |datajibo/test|1044|1043|38.4|58.3|3.3|198.2|259.7|77.6| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---| |datajibo/dev|853|2014|65.7|26.0|8.3|316.5|350.8|81.1| |datajibo/test|1044|2767|74.3|20.7|5.0|379.9|405.6|77.6| |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |---|---|---|---|---|---|---|---|---|
WavLabLM-MS-40k
Shinji_Watanabe_open_li52_asr_train_asr_raw_bpe7000_valid.acc.ave
kan-bayashi_tsukuyomi_full_band_vits_prosody
Wangyou_Zhang_universal_train_enh_uses_refch0_2mem_raw
Hoon_Chung_jsut_asr_train_asr_conformer8_raw_char_sp_valid.acc.ave
brianyan918_iwslt22_dialect_train_st_conformer_ctc0.3_lr2e-3_warmup15k_newspecaug
kan-bayashi_jvs_tts_finetune_jvs001_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-178804
kan-bayashi_ljspeech_tts_finetune_joint_conformer_fastspeech2_hifigan_-truncated-737899
kan-bayashi_ljspeech_tts_train_fastspeech2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp
pt_commonvoice_blstm
english_male_ryanspeech_conformer_fastspeech2
guangzhisun_librispeech100_asr_train_conformer_transducer_tcpgen500_deep_sche30_GCN6L_rep_suffix
acesinger_opencpop_visinger2_44khz
m4singer_svs_naive_rnn_dp
voxcelebs12_xvector_mel
mms_1b_mlsuperb
owls_18B_360K
geolid_vl107only_shared_trainable
Chenda_Li_wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.ave
Karthik_DSTC2_asr_train_asr_transformer
Shinji_Watanabe_spgispeech_asr_train_asr_conformer6_n_fft512_hop_lengt-truncated-a013d0
YushiUeda_iemocap_sentiment_asr_train_asr_conformer
ftshijt_espnet2_asr_totonac_transformer
kamo-naoyuki_hkust_asr_train_asr_transformer2_raw_zh_char_batch_bins20-truncated-934e17
kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_frontend-truncated-b76af5
kamo-naoyuki_librispeech_asr_train_asr_conformer5_raw_bpe5000_schedule-truncated-c8e5f9
kamo-naoyuki_reverb_asr_train_asr_transformer4_raw_char_batch_bins1600-truncated-1b72bb
kamo-naoyuki_wsj_transformer2
kan-bayashi_csmsc_fastspeech
kan-bayashi_jsut_conformer_fastspeech2_tacotron2_prosody
kan-bayashi_jsut_conformer_fastspeech2_transformer_prosody
kan-bayashi_jsut_full_band_vits_accent_with_pause
kan-bayashi_jsut_tacotron2_accent
kan-bayashi_jsut_transformer
kan-bayashi_jsut_transformer_prosody
kan-bayashi_jsut_tts_train_conformer_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave
kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-74c1b4
kan-bayashi_jsut_tts_train_conformer_fastspeech2_transformer_teacher_r-truncated-f43d8f
kan-bayashi_jsut_tts_train_fastspeech2_raw_phn_jaconv_pyopenjtalk_train.loss.ave
kan-bayashi_jsut_tts_train_fastspeech2_transformer_teacher_raw_phn_jac-truncated-6f4cf5
kan-bayashi_jsut_tts_train_fastspeech_raw_phn_jaconv_pyopenjtalk_train.loss.best
kan-bayashi_jsut_tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_accent_train.loss.ave
kan-bayashi_jsut_tts_train_tacotron2_raw_phn_jaconv_pyopenjtalk_train.loss.best
kan-bayashi_jsut_tts_train_transformer_raw_phn_jaconv_pyopenjtalk_prosody_train.loss.ave
kan-bayashi_jvs_jvs010_vits_accent_with_pause
kan-bayashi_jvs_tts_finetune_jvs010_jsut_vits_raw_phn_jaconv_pyopenjta-truncated-d57a28
kan-bayashi_libritts_gst_xvector_conformer_fastspeech2
kan-bayashi_libritts_gst_xvector_trasnformer
kan-bayashi_libritts_tts_train_gst_xvector_trasnformer_raw_phn_tacotro-truncated-250027
kan-bayashi_libritts_tts_train_xvector_conformer_fastspeech2_transform-truncated-42b443
kan-bayashi_libritts_tts_train_xvector_vits_raw_phn_tacotron_g2p_en_no-truncated-09d645
kan-bayashi_libritts_xvector_conformer_fastspeech2
kan-bayashi_libritts_xvector_trasnformer
kan-bayashi_ljspeech_transformer
kan-bayashi_ljspeech_tts_train_vits_raw_phn_tacotron_g2p_en_no_space_train.total_count.ave
kan-bayashi_vctk_gst_conformer_fastspeech2
kan-bayashi_vctk_gst_fastspeech
kan-bayashi_vctk_tts_train_gst_fastspeech_raw_phn_tacotron_g2p_en_no_space_train.loss.best
kan-bayashi_vctk_tts_train_gst_transformer_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
kan-bayashi_vctk_tts_train_gst_xvector_conformer_fastspeech2_transform-truncated-e051a9
kan-bayashi_vctk_tts_train_gst_xvector_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
kan-bayashi_vctk_tts_train_xvector_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.ave
shinji-watanabe-librispeech_asr_train_asr_transformer_e18_raw_bpe_sp_valid.acc.best
simpleoier_librispeech_asr_train_asr_conformer7_wav2vec2_960hr_large_raw_en_bpe5000_sp
su_openslr36
YushiUeda_swbd_sentiment_asr_train_asr_conformer_wav2vec2
ftshijt_espnet2_asr_dsing_hubert_conformer
russian_commonvoice_blstm
chai_librispeech_asr_train_rnnt_conformer_raw_en_bpe5000_sp
Wangyou_Zhang_wsj0_2mix_enh_dc_crn_mapping_snr_raw
YushiUeda_librimix_diar_enh_2_3_spk_lmf
accented_french_openslr57_ASR_transformer
GunnarThor_talromur_f_fastspeech2
GunnarThor_talromur_b_tacotron2
GunnarThor_talromur_c_fastspeech2
GunnarThor_talromur_d_fastspeech2
GunnarThor_talromur_h_fastspeech2
YushiUeda_harpervalley_train_asr_hubert_raw_en_word
simpleoier_chime4_enh_asr_convtasnet_init_noenhloss_wavlm_transformer_init_raw_en_char
zh-CN_commonvoice_blstm
simpleoier_chime6_asr_transformer_wavlm_lr1e-3
english_male_ryanspeech_tacotron
slurp_slu_2pass_gt
fsc_challenge_slu_2pass_conformer
fsc_challenge_slu_2pass_transformer_gt
jiyangtang_magicdata_asr_conformer_lm_transformer
talromur2_xvector_tacotron2
simpleoier_librimix_asr_train_asr_transformer_multispkr_raw_en_char_sp
realzza-meld-asr-hubert-transformer
stop_hubert_slu_raw_en_bpe500
librispeech_multiblank_transducer_8421
jiyang_tang_aphsiabank_english_asr_ebranchformer_small_wavlm_large1
simpleoier_ls960_asr2_train_e_branchformer1_raw_wavlm_large_21_km2000_bpe_rm6000_bpe_ts5000_sp
simpleoier_ls960_asr2_train_e_branchformer1_1gpu_raw_wavlm_large_21_km1k_bpe_rm5k_bpe_ts5k_sp
simpleoier_ls960_asr2_e_branchformer1_conv1d3_1gpu_raw_wavlm_large_21_km1k_bpe_rm5k_bpe_ts5k_sp
Wangyou_Zhang_wsj0_2mix_train_enh_tse_td_speakerbeam_raw
chendali_librimix_asr_train_sot_asr_whisper_small_raw_en_whisper_multilingual
yoshiki_wsj_asr_conformer_s3prlfrontend_wavlm_raw_en_char
eason_gigaspeech_train_asr2_e_branchformer12_lr_raw_wavlm_large_21_km1000
msk_lrs3_train_avsr_avhubert_large_extracted_en_bpe1000
owsm_v1
juice500ml_mls_10h_asr_ssl
juice500ml_mls_10h_discrete_asr
akreal_lh_small_asr2_e_branchformer_wavlm_large_21_km2k_bpe_rm6k_bpe_ts3k_sp
akreal_lh_medium_asr2_e_branchformer_wavlm_large_21_km1k_bpe_rm6k_bpe_ts3k
vctk_tts_train_espnet_rawnet_vits
opencpop_visinger
opencpop_visinger_transfer_acesinger
opencpop_naive_rnn_dp
opencpop_xiaoice
oniku_kurumi_utagoe_svs_db_naive_rnn_dp
oniku_kurumi_utagoe_db_xiaoice
kiritan_svs_rnn
kiritan_svs_xiaoice
kiritan_svs_visinger
oniku_kurumi_utagoe_db_svs_visinger
oniku_kurumi_utagoe_db_svs_visinger2
voxblinkclean_rawnet3
voxcelebs12_mfaconformer_mel
akreal_lh_small_asr2_e_branchformer_wavlm_mistral02_ctc
voxcelebs12devs_librispeech_cv16fa_rawnet3
interspeech2024_dsuchallenge_wavlm_large_21_km2000_bpe_rm3000_bpe_ts6500_baseline
sluevoxceleb_whisper_lightweight_sa
sluevoxceleb_owsm_lightweight_sa
sluevoxceleb_whisper_finetune_sa
sluevoxceleb_wavlm_lightweight_asr
sluevoxceleb_whisper_lightweight_asr
sluevoxceleb_whisper_finetune_asr
sluevoxceleb_owsm_finetune_asr
sluevoxceleb_whisper_complex_slu
libritts_soundstream16k
libritts_soundstream24k
libritts_encodec_24k
libritts_dac_24k
amuse_speech_soundstream_16k
speechlm_tts_ls_giga_mlsen_amuse_speech_delay
speechlm_tts_ls_giga_mlsen_amuse_speech_multiscale
mls-english_soundstream_16k
mls-multi_encodec_16k
mls-multi_soundstream_16k
mls-audioset_soundstream_16k
mls-audioset_encodec_16k
mls-multi_encodec_16k_360epoch
mls-audioset_soundstream_16k_360epoch
dac_16k_speech_survey
dac_16k_all_single_survey
BEATs-AS20K
dac_44k_audio_single_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
dac_44k_all_single_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
dac_44k_speech_single_survey
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
BEATs-BEAN.Watkins
BEATs-BEAN.Dogs
BEATs-BEAN.Bats
wanchichen_xeus_fleurs_finetune
universa-wavlm_base_urgent24_multi-metric
asr_mlsuperb2_mshubert_freeze_baseline
asr_mlsuperb2_mshubert_finetune_baseline
asr_mlsuperb2_xlsr_finetune_baseline
universa-wavlm_base_urgent24_multi-metric_fullref
`espnet/universa-wavlmbaseurgent24multi-metricfullref` This model was trained by ftshijt using urgent24 recipe in espnet. Please check Colab link for a simple demo of how to use UniVERSA.
owls_05B_180K_intermediates
owsm_dac_v2_16k
This model was trained by ftshijt using amuse recipe in espnet. Follow the ESPnet installation instructions if you haven't done that already.
cnceleb_resnet34
cnceleb_resnet221
owsm_v2
belarusian_commonvoice_blstm
transformer_tts_cmu_indic_hin_ab
mixdata_svs_visinger2_spkemb_lang_pretrained
Yushi_Ueda_ksponspeech_asr_train_asr_conformer8_n_fft512_hop_length256-truncated-eb42e5
kan-bayashi_csmsc_conformer_fastspeech2
kan-bayashi_vctk_full_band_multi_spk_vits
YushiUeda_swbd_sentiment_asr_train_asr_conformer
YushiUeda_swbd_sentiment_asr_train_asr_conformer_wav2vec2_2
mediaspeech-spanish-hubert
wanchichen_fleurs_asr_conformer_scctc
khassan_KSC_transformer
pengcheng_aishell_asr_train_asr_whisper_medium_finetune_raw_zh_whisper_multilingual_sp
voxcelebs12devs_voxblinkfull_rawnet3
libriheavy_small_ebranchformer
voxcelebs12_ebranchformer_base
opencpop_svs2_toksing_pretrain
This model was trained by TangRain using opencpop recipe in espnet. Train the model for 300 epochs and choose the one with the best performance based on validation loss.