codeceejay

1 models • 1 total models in database

Sort by:

HIYACCENT Wav2Vec2

HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. For this, the Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English Speeches. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.T Fine-tuned facebook/wav2vec2-large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz. The script used for training can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System ##Usage: The model can be used directly (without a language model) as follows... #Using the ASRecognition library: from asrecognition import ASREngine asr = ASREngine("fr", modelpath="codeceejay/HIYACCENTWav2Vec2") audiopaths = ["/path/to/file.mp3", "/path/to/anotherfile.wav"] transcriptions = asr.transcribe(audiopaths) ##Writing your own inference speech: import torch import librosa from datasets import loaddataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor LANGID = "en" MODELID = "codeceejay/HIYACCENTWav2Vec2" SAMPLES = 10 #You can use commonvoice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/ testdataset = loaddataset("commonvoice", LANGID, split=f"test[:{SAMPLES}]") processor = Wav2Vec2Processor.frompretrained(MODELID) model = Wav2Vec2ForCTC.frompretrained(MODELID) Preprocessing the datasets. We need to read the audio files as arrays def speechfiletoarrayfn(batch): speecharray, samplingrate = librosa.load(batch["path"], sr=16000) batch["speech"] = speecharray batch["sentence"] = batch["sentence"].upper() return batch testdataset = testdataset.map(speechfiletoarrayfn) inputs = processor(testdataset["speech"], samplingrate=16000, returntensors="pt", padding=True) with torch.nograd(): logits = model(inputs.inputvalues, attentionmask=inputs.attentionmask).logits predictedids = torch.argmax(logits, dim=-1) predictedsentences = processor.batchdecode(predictedids) for i, predictedsentence in enumerate(predictedsentences): print("-" 100) print("Reference:", testdataset[i]["sentence"]) print("Prediction:", predictedsentence)

—