Jzuluaga
accent-id-commonaccent_xlsr-en-english
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on CommonVoice Abstract: Despite the recent advancements in Automatic Speech Recognition (ASR), the recognition of accented speech still remains a dominant problem. In order to create more inclusive ASR systems, research has shown that the integration of accent information, as part of a larger ASR framework, can lead to the mitigation of accented speech errors. We address multilingual accent classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which have been proven to perform well on a variety of speech-related downstream tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new state-of-the-art for English accent classification with as high as 95% accuracy. We also study the internal categorization of the Wav2Vev 2.0 embeddings through t-SNE, noting that there is a level of clustering based on phonological similarity. This repository provides all the necessary tools to perform accent identification from speech recordings with SpeechBrain. The system uses a model pretrained on the CommonAccent dataset in English (16 accents). This system is based on the CommonLanguage Recipe located here: https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonLanguage The provided system can recognize the following 16 accents from short speech recordings in English (EN): Github repository link: https://github.com/JuanPZuluaga/accent-recog-slt2022 NOTE: due to incompatibility with the model and the current SpeechBrain interfaces, we cannot offer the Inference API. Please, follow the steps in "Perform Accent Identification from Speech Recordings" to use this Italian Accent ID model. For a better experience, we encourage you to learn more about SpeechBrain. Pipeline description This system is composed of a fine-tuned XLSR model coupled with statistical pooling. A classifier, trained with NLL Loss, is applied on top of that. The system is trained with recordings sampled at 16kHz (single channel). The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling classifyfile if needed. Make sure your input tensor is compliant with the expected sampling rate if you use encodebatch and classifybatch. First of all, please install SpeechBrain with the following command: Please notice that we encourage you to read our tutorials and learn more about SpeechBrain. Perform Accent Identification from Speech Recordings Inference on GPU To perform inference on the GPU, add `runopts={"device":"cuda"}` when calling the `fromhparams` method. 3. Clone our repository in https://github.com/JuanPZuluaga/accent-recog-slt2022: You can find our training results (models, logs, etc) in this repository's `Files and versions` page. Limitations The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets. If you find useful this work, please cite our work as: Cite SpeechBrain Please, cite SpeechBrain if you use it for your research or business.
bert-base-ner-atc-en-atco2-1h
This model allow to perform named-entity recognition (NER) on air traffic control communications data. We solve this challenge by performing token classification (NER) with a BERT model. We fine-tune a pretrained BERT model on the ner task. For instance, if you have the following transcripts/gold annotations: - Utterance: lufthansa three two five cleared to land runway three four left Could you tell what are the main entities in the communication? The desired output is shown below: - Named-entity module output: [call] lufthansa three two five [/call] [cmd] cleared to land [/cmd] [val] runway three four left [/val] This model is a fine-tuned version of bert-base-uncased on the atco2corpus1h. It achieves the following results on the development set: - Loss: 1.4282 - Precision: 0.6195 - Recall: 0.7071 - F1: 0.6604 - Accuracy: 0.8182 Paper: ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications Authors: Juan Zuluaga-Gomez, Karel Veselý, Igor Szöke, Petr Motlicek, Martin Kocour, Mickael Rigault, Khalid Choukri, Amrutha Prasad and others Abstract: Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at this http URL. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at this url: https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community. Code — GitHub repository: https://github.com/idiap/atco2-corpus This model was fine-tuned on air traffic control data. We don't expect that it keeps the same performance on some others datasets where BERT was pre-trained or fine-tuned. See Table 6 (page 18) in our paper: ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications. We described there the data used to fine-tune our NER model. - We use the ATCO2 corpus to fine-tune this model. You can download a free sample here: https://www.atco2.org/data - However, do not worry, we have prepared a script in our repository for preparing this databases: - Dataset preparation folder: https://github.com/idiap/atco2-corpus/tree/main/data/databases/atco2testset1h/dataprepareatco2corpusother.sh - Get the data in the format required by HuggingFace: speakerrole/datapreparation/preparespkidatco2corpustestset1h.sh If you use this code for your research, please cite our paper with: The following hyperparameters were used during training: - learningrate: 5e-05 - trainbatchsize: 32 - evalbatchsize: 16 - seed: 42 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 64 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupsteps: 500 - trainingsteps: 3000 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 125.0 | 500 | 0.8692 | 0.6396 | 0.7172 | 0.6762 | 0.8307 | | 0.2158 | 250.0 | 1000 | 1.0074 | 0.5702 | 0.6970 | 0.6273 | 0.8245 | | 0.2158 | 375.0 | 1500 | 1.3560 | 0.6577 | 0.7374 | 0.6952 | 0.8119 | | 0.0184 | 500.0 | 2000 | 1.3393 | 0.6182 | 0.6869 | 0.6507 | 0.8056 | | 0.0184 | 625.0 | 2500 | 1.3528 | 0.6087 | 0.7071 | 0.6542 | 0.8213 | | 0.0175 | 750.0 | 3000 | 1.4282 | 0.6195 | 0.7071 | 0.6604 | 0.8182 | - Transformers 4.24.0 - Pytorch 1.13.0+cu117 - Datasets 2.7.0 - Tokenizers 0.13.2
bert-base-token-classification-for-atc-en-uwb-atcc
This model allow to detect speaker roles and speaker changes based on text. Normally, this task is done on the acoustic level. However, we propose to perform this task on the text level. We solve this challenge by performing speaker role and change detection with a BERT model. We fine-tune it on the chunking task (token-classification). - Speaker 1: lufthansa six two nine charlie tango report when established - Speaker 2: report when established lufthansa six two nine charlie tango Based on that, could you tell the speaker role? Is it speaker 1 air traffic controller or pilot? Also, if you have a recording with 2 or more speakers, like this: - Recording with 2 or more segments: report when established lufthansa six two nine charlie tango lufthansa six two nine charlie tango report when established could you tell when the first speaker ends and when the second starts? This is basically diarization plus speaker role detection. This model is a fine-tuned version of bert-base-uncased on the UWB-ATCC corpus. It achieves the following results on the evaluation set: - Loss: 0.0098 - Precision: 0.9760 - Recall: 0.9741 - F1: 0.9750 - Accuracy: 0.9965 Paper: BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications. Authors: Juan Zuluaga-Gomez, Seyyed Saeed Sarfjoo, Amrutha Prasad, Iuliia Nigmatulina, Petr Motlicek, Karel Ondrej, Oliver Ohneiser, Hartmut Helmke Abstract: Automatic speech recognition (ASR) allows transcribing the communications between air traffic controllers (ATCOs) and aircraft pilots. The transcriptions are used later to extract ATC named entities, e.g., aircraft callsigns. One common challenge is speech activity detection (SAD) and speaker diarization (SD). In the failure condition, two or more segments remain in the same recording, jeopardizing the overall performance. We propose a system that combines SAD and a BERT model to perform speaker change detection and speaker role detection (SRD) by chunking ASR transcripts, i.e., SD with a defined number of speakers together with SRD. The proposed model is evaluated on real-life public ATC databases. Our BERT SD model baseline reaches up to 10% and 20% token-based Jaccard error rate (JER) in public and private ATC databases. We also achieved relative improvements of 32% and 7.7% in JERs and SD error rate (DER), respectively, compared to VBx, a well-known SD system. Code — GitHub repository: https://github.com/idiap/bert-text-diarization-atc This model was fine-tuned on air traffic control data. We don't expect that it keeps the same performance on some others datasets where BERT was pre-trained or fine-tuned. See Table 3 (page 5) in our paper:BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications.. We described there the data used to fine-tune or model for speaker role and speaker change detection. - We use the UWB-ATCC corpus to fine-tune this model. You can download the raw data here: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-CCA1-0 - However, do not worry, we have prepared a script in our repository for preparing this databases: - Dataset preparation folder: https://github.com/idiap/bert-text-diarization-atc/tree/main/data/databases/uwbatcc - Prepare the data: https://github.com/idiap/bert-text-diarization-atc/blob/main/data/databases/uwbatcc/dataprepareuwbatcccorpus.sh - Get the data in the format required by HuggingFace: https://github.com/idiap/bert-text-diarization-atc/blob/main/data/databases/uwbatcc/expprepareuwbatcccorpus.sh If you use this code for your research, please cite our paper with: The following hyperparameters were used during training: - learningrate: 5e-05 - trainbatchsize: 64 - evalbatchsize: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupsteps: 1000 - trainingsteps: 10000 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy | |:-------------:|:-----:|:-----:|:---------------:|:---------:|:------:|:------:|:--------:| | No log | 0.03 | 500 | 0.2282 | 0.6818 | 0.7001 | 0.6908 | 0.9246 | | 0.3487 | 0.06 | 1000 | 0.1214 | 0.8163 | 0.8024 | 0.8093 | 0.9631 | | 0.3487 | 0.1 | 1500 | 0.0933 | 0.8496 | 0.8544 | 0.8520 | 0.9722 | | 0.1124 | 0.13 | 2000 | 0.0693 | 0.8845 | 0.8739 | 0.8791 | 0.9786 | | 0.1124 | 0.16 | 2500 | 0.0540 | 0.8993 | 0.8911 | 0.8952 | 0.9817 | | 0.0667 | 0.19 | 3000 | 0.0474 | 0.9058 | 0.8929 | 0.8993 | 0.9857 | | 0.0667 | 0.23 | 3500 | 0.0418 | 0.9221 | 0.9245 | 0.9233 | 0.9865 | | 0.0492 | 0.26 | 4000 | 0.0294 | 0.9369 | 0.9415 | 0.9392 | 0.9903 | | 0.0492 | 0.29 | 4500 | 0.0263 | 0.9512 | 0.9446 | 0.9479 | 0.9911 | | 0.0372 | 0.32 | 5000 | 0.0223 | 0.9495 | 0.9497 | 0.9496 | 0.9915 | | 0.0372 | 0.35 | 5500 | 0.0212 | 0.9530 | 0.9514 | 0.9522 | 0.9923 | | 0.0308 | 0.39 | 6000 | 0.0177 | 0.9585 | 0.9560 | 0.9572 | 0.9933 | | 0.0308 | 0.42 | 6500 | 0.0169 | 0.9619 | 0.9613 | 0.9616 | 0.9936 | | 0.0261 | 0.45 | 7000 | 0.0140 | 0.9689 | 0.9662 | 0.9676 | 0.9951 | | 0.0261 | 0.48 | 7500 | 0.0130 | 0.9652 | 0.9629 | 0.9641 | 0.9945 | | 0.0214 | 0.51 | 8000 | 0.0127 | 0.9676 | 0.9635 | 0.9656 | 0.9953 | | 0.0214 | 0.55 | 8500 | 0.0109 | 0.9714 | 0.9708 | 0.9711 | 0.9959 | | 0.0177 | 0.58 | 9000 | 0.0103 | 0.9740 | 0.9727 | 0.9734 | 0.9961 | | 0.0177 | 0.61 | 9500 | 0.0101 | 0.9768 | 0.9744 | 0.9756 | 0.9963 | | 0.0159 | 0.64 | 10000 | 0.0098 | 0.9760 | 0.9741 | 0.9750 | 0.9965 | - Transformers 4.24.0 - Pytorch 1.13.0+cu117 - Datasets 2.7.0 - Tokenizers 0.13.2
accent-id-commonaccent_ecapa
wav2vec2-large-960h-lv60-self-en-atc-atcosim
This model is a fine-tuned version of facebook/wav2vec2-large-960h-lv60-self on the ATCOSIM corpus. It achieves the following results on the evaluation set: - Loss: 0.0850 - Wer: 0.0167 (1.67% WER) Paper: How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications. Authors: Juan Zuluaga-Gomez, Amrutha Prasad, Iuliia Nigmatulina, Saeed Sarfjoo, Petr Motlicek, Matthias Kleinert, Hartmut Helmke, Oliver Ohneiser, Qingran Zhan Abstract: Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E)acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 and 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller fraction of labeled data. We analyze WERs on the low-resource scenario and gender bias carried by one ATC dataset. Code — GitHub repository: https://github.com/idiap/w2v2-air-traffic You can use our Google Colab notebook to run and evaluate our model: https://github.com/idiap/w2v2-air-traffic/blob/master/src/evalxlsratcmodel.ipynb (you need to change the `MODELID` param to `MODELID=Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-atcosim`) Intended uses & limitations This model was fine-tuned on air traffic control data. We don't expect that it keeps the same performance on some others datasets, e.g., LibriSpeech or CommonVoice. See Table 1 (page 3) in our paper: How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications. We described there the partitions of how to use our model. - We use the ATCOSIM dataset for fine-tuning this model. You can download the raw data here: https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html - However, do not worry, we have prepared the database in `Datasets format`. Here, ATCOSIM CORPUS on HuggingFace. You can scroll and check the train/test partitions, and even listen to some audios. - If you want to prepare a database in HuggingFace format, you can follow the data loader script in: dataloaderatc.py. If you use language model, you need to install the KenLM bindings with: If you use this code for your research, please cite our paper with: The following hyperparameters were used during training: - learningrate: 0.0005 - trainbatchsize: 24 - evalbatchsize: 24 - seed: 42 - gradientaccumulationsteps: 4 - totaltrainbatchsize: 96 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lrschedulertype: linear - lrschedulerwarmupsteps: 500 - trainingsteps: 20000 - mixedprecisiontraining: Native AMP | Training Loss | Epoch | Step | Validation Loss | Wer | |:-------------:|:------:|:-----:|:---------------:|:------:| | 1.4757 | 6.41 | 500 | 0.0614 | 0.0347 | | 0.0624 | 12.82 | 1000 | 0.0525 | 0.0277 | | 0.0388 | 19.23 | 1500 | 0.0693 | 0.0241 | | 0.03 | 25.64 | 2000 | 0.0666 | 0.0244 | | 0.0235 | 32.05 | 2500 | 0.0604 | 0.0260 | | 0.0226 | 38.46 | 3000 | 0.0625 | 0.0230 | | 0.0163 | 44.87 | 3500 | 0.0603 | 0.0195 | | 0.0157 | 51.28 | 4000 | 0.0628 | 0.0209 | | 0.0152 | 57.69 | 4500 | 0.0692 | 0.0238 | | 0.0122 | 64.1 | 5000 | 0.0607 | 0.0210 | | 0.011 | 70.51 | 5500 | 0.0608 | 0.0213 | | 0.0114 | 76.92 | 6000 | 0.0681 | 0.0211 | | 0.0106 | 83.33 | 6500 | 0.0613 | 0.0210 | | 0.0081 | 89.74 | 7000 | 0.0654 | 0.0196 | | 0.0078 | 96.15 | 7500 | 0.0612 | 0.0191 | | 0.0082 | 102.56 | 8000 | 0.0758 | 0.0237 | | 0.0078 | 108.97 | 8500 | 0.0664 | 0.0206 | | 0.0075 | 115.38 | 9000 | 0.0658 | 0.0197 | | 0.0052 | 121.79 | 9500 | 0.0669 | 0.0218 | | 0.0054 | 128.21 | 10000 | 0.0695 | 0.0211 | | 0.0053 | 134.62 | 10500 | 0.0726 | 0.0227 | | 0.0046 | 141.03 | 11000 | 0.0702 | 0.0212 | | 0.0043 | 147.44 | 11500 | 0.0846 | 0.0200 | | 0.0041 | 153.85 | 12000 | 0.0764 | 0.0200 | | 0.0032 | 160.26 | 12500 | 0.0785 | 0.0201 | | 0.0028 | 166.67 | 13000 | 0.0839 | 0.0197 | | 0.0035 | 173.08 | 13500 | 0.0785 | 0.0210 | | 0.0027 | 179.49 | 14000 | 0.0730 | 0.0188 | | 0.002 | 185.9 | 14500 | 0.0794 | 0.0193 | | 0.002 | 192.31 | 15000 | 0.0859 | 0.0211 | | 0.0019 | 198.72 | 15500 | 0.0727 | 0.0183 | | 0.0017 | 205.13 | 16000 | 0.0784 | 0.0187 | | 0.0016 | 211.54 | 16500 | 0.0801 | 0.0196 | | 0.0014 | 217.95 | 17000 | 0.0821 | 0.0185 | | 0.0011 | 224.36 | 17500 | 0.0822 | 0.0176 | | 0.001 | 230.77 | 18000 | 0.0856 | 0.0171 | | 0.001 | 237.18 | 18500 | 0.0792 | 0.0176 | | 0.001 | 243.59 | 19000 | 0.0826 | 0.0173 | | 0.0006 | 250.0 | 19500 | 0.0854 | 0.0170 | | 0.0007 | 256.41 | 20000 | 0.0850 | 0.0167 | - Transformers 4.24.0 - Pytorch 1.13.0+cu117 - Datasets 2.6.1 - Tokenizers 0.13.2