HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning
The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. For this, the Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English Speeches. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.T
Fine-tuned facebook/wav2vec2-large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz.
The script used for training can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System
##Usage: The model can be used directly (without a language model) as follows...
#Using the ASRecognition library: from asrecognition import ASREngine
asr = ASREngine("fr", modelpath="codeceejay/HIYACCENTWav2Vec2")
audiopaths = ["/path/to/file.mp3", "/path/to/anotherfile.wav"] transcriptions = asr.transcribe(audiopaths)
##Writing your own inference speech: import torch import librosa from datasets import loaddataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANGID = "en" MODELID = "codeceejay/HIYACCENTWav2Vec2" SAMPLES = 10
#You can use commonvoice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/ testdataset = loaddataset("commonvoice", LANGID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.frompretrained(MODELID) model = Wav2Vec2ForCTC.frompretrained(MODELID)
Preprocessing the datasets. We need to read the audio files as arrays def speechfiletoarrayfn(batch): speecharray, samplingrate = librosa.load(batch["path"], sr=16000) batch["speech"] = speecharray batch["sentence"] = batch["sentence"].upper() return batch
testdataset = testdataset.map(speechfiletoarrayfn) inputs = processor(testdataset["speech"], samplingrate=16000, returntensors="pt", padding=True)
with torch.nograd(): logits = model(inputs.inputvalues, attentionmask=inputs.attentionmask).logits
predictedids = torch.argmax(logits, dim=-1) predictedsentences = processor.batchdecode(predictedids)
for i, predictedsentence in enumerate(predictedsentences): print("-" 100) print("Reference:", testdataset[i]["sentence"]) print("Prediction:", predictedsentence)