jordand

3 models • 1 total models in database

Sort by:

whisper-d-v1a

WhisperD is a fine-tuned version of whisper-large-v2 that is able to transcribe multi-speaker, conversational speech. It was used to generate synthetic transcriptions for training Parakeet. Diarization is performed implicitly by the model, where "[S1]", "[S2]", etc. denote speaker identity. WhisperD is (often) able to transcribe non-speech events, e.g. "(coughs)", "(laughs)". Outputs include disfluencies. More details can be found in the WhisperD blog post. Caution: This model has only been tested on segments up to 30 seconds in length. It may be unable to handle conditioning on previous text, as this was not included during fine-tuning. Thus, if a pipeline / codebase uses this feature in order to transcribe audio with duration over 30 seconds, generation quality may be poor.

license:mit

345

fish-s1-dac-min

license:cc-by-nc-sa-4.0

echo-tts-no-speaker

license:cc-by-nc-sa-4.0