kyutai
moshiko-pytorch-bf16
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice. - Developed by: Kyutai - Model type: Multimodal speech-text foundation model - Language(s) (NLP): English - License: CC-BY The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions. Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations. - Textual data: The underlying Helium model is trained on a mix of data, more precisely: - 12.5% is high-quality data sources from the following curated sources: Wikipedia Wikibooks, Wikisource, Wikinews, StackExchange and the collection of scientific articles pes2o. For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40. - Unsupervised audio dataset: used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with Whisper (large v3 model) - The Fisher dataset:: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using AudioSR. - Supervised multi-stream dataset: A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data. - Synthetic data: 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user. The different stages of the training procedure are detailled in the paper along with the hyper-parameters. The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
mimi
--- license: cc-by-4.0 library_name: transformers tags: - mimi - audio ---
helium-1-2b
tts-1.6b-en_fr
See also the pre-print research paper, the project page, the Colab example, the GitHub repository, and the repository of voices. This is a model for streaming text-to-speech (TTS). Unlike offline text-to-speech, where the model needs the entire text to produce the audio, our model starts to output audio as soon as the first few words from the text have been given as input. This model is actually 1.8B parameters, not 1.6B as the name might suggest. The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi, see the Moshi paper. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens, although you can use less tokens at inference time for faster generation. The backbone model is 1B parameters, and the depth transformer is 600M parameters and uses partial weight sharing similar to Hibiki. The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2. Kyutai TTS is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. Developed by: Kyutai Model type: Streaming Text-To-Speech. Language(s) (NLP): English and French License: Model weights are licensed under CC-BY 4.0 Repository: GitHub This model is able to perform streaming text-to-speech generation, including dialogs. The model supports voice conditioning through cross-attention pre-computed embeddings, which are provided for a number of voices in our tts-voices repository. This model does not support Classifier Free Guidance (CFG) directly, but was trained with CFG distillation for improved speed (no need to double the batch size). It is easy to batch and can reach a throughput of 75x generated audio per compute unit of time. This model does not perform watermarking for two reasons: - watermarking can easily be deactivated for open source models, - our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi. Instead, we prefered to restrict the voice cloning ability to the use of pre-computed voice embeddings. The model was trained for 750k steps, with a batch size of 64, and a segment duration of 120 seconds. Then, CFG distillation was performed for 24k updates. Pretraining stage: we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running whisper-timestamped with `whisper-medium`. Pretraining was done with 32 H100 Nvidia GPUs. CFG distillation was done on 8 such GPUs. Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
pocket-tts-without-voice-cloning
pocket-tts
tts-0.75b-en-public
See also the pre-print research paper, the project page, the GitHub repository, and the evaluation pipeline. This is a model for streaming text-to-speech (TTS). Unlike offline text-to-speech, where the model needs the entire text to produce the audio, our model starts to output audio as soon as the first few words from the text have been given as input. This model was trained on a mixed of public TTS datasets, allowing for a fair comparisons with other methods. The model architecture is a hierarchical Transformer that consumes tokenized text and generateds audio tokenized by Mimi, see the Moshi paper. The frame rate is 12.5 Hz and each audio frame is represented by 16 audio tokens. You cannot use less tokens at inference. The backbone model is 300M parameters, and the depth transformer is 450M parameters and uses partial weight sharing similar to Hibiki. The audio is shifted by 16 steps (1.28 sec.) with respect to the text, and the model uses an acoustic/semantic delay of 2. Kyutai TTS is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. Developed by: Kyutai Model type: Streaming Text-To-Speech. Language(s) (NLP): English License: Model weights are licensed under CC-BY 4.0 Repository: GitHub This model is able to perform streaming text-to-speech generation. This model allows for voice cloning through prefixing, although when compared with our main model, it achieves a speaker similarity score of 74.9%, against 80.9% for our main model. This level of speaker similarity is in line with some of the existing baselines like CSM and it thus seems safe to open source it. This model does not perform watermarking for two reasons: - watermarking can easily be deactivated for open source models, - our early experiments show that all watermark systems used by existing TTS are removed by simply encodeding and decoding the audio with Mimi. This model is provided primarily for the purpose of scientific comparisons on public benchmarks. In particular, please check our pipeline for running TTS model evaluations on a number of benchmarks: ttslongeval. Here is an example, first install `moshi`, for instance with Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez License is CC-BY 4.0. For citations please use the following.
moshiko-candle-q8
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice. - Developed by: Kyutai - Model type: Multimodal speech-text foundation model - Language(s) (NLP): English - License: CC-BY The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions. Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations. - Textual data: The underlying Helium model is trained on a mix of data, more precisely: - 12.5% is high-quality data sources from the following curated sources: Wikipedia Wikibooks, Wikisource, Wikinews, StackExchange and the collection of scientific articles pes2o. For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40. - Unsupervised audio dataset: used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with Whisper (large v3 model) - The Fisher dataset:: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using AudioSR. - Supervised multi-stream dataset: A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data. - Synthetic data: 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user. The different stages of the training procedure are detailled in the paper along with the hyper-parameters. The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
stt-2.6b-en-trfs
stt-1b-en_fr-trfs
CASA-Qwen2_5-VL-3B-LiveCC
moshika-pytorch-bf16
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice. - Developed by: Kyutai - Model type: Multimodal speech-text foundation model - Language(s) (NLP): English - License: CC-BY The model can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. However, the model has limited abilities for complex tasks and cannot access tools, but rather focues on natural, low-latency interactions. Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems.. Regarding the main Moshi architecture, other downstream usecases would require some finetuning / domain adaptation. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. The model has been trained with a few safeguards to try to limit potential toxic usages, however our toxicity analysis shows that it behaves in the middle of existing models with respect to textual generation. It has some bias towards certain domains and topics that are over-represented in the training data. Its capabilities are relatively limited so far and it is trained to produce only one voice to avoid impersonation. Yet, we need the perspective in time to establish the sociotechnical limitations. - Textual data: The underlying Helium model is trained on a mix of data, more precisely: - 12.5% is high-quality data sources from the following curated sources: Wikipedia Wikibooks, Wikisource, Wikinews, StackExchange and the collection of scientific articles pes2o. For Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. - 87.5% is filtered web data from CommonCrawl, using the following crawls: 2018-30, 2019-04, 2019-30, 2020-05, 2020-34, 2021-04, 2021-31, 2022-05, 2022-33, 2023-40. - Unsupervised audio dataset: used for pre-training, this is a collection of 7 million hours of readily available audio content, which consists mostly of English speech. This training set is transcribed with Whisper (large v3 model) - The Fisher dataset:: used to enable multi-stream. It consists of 2000 hours of phone conversations at 8kHz from Fisher, which we upsample to 24kHz using AudioSR. - Supervised multi-stream dataset: A dataset of 170 hours of natural and scripted conversation between multiple pairs of participants, collected by Kyutai. This dataset is used to train the TTS system used to create synthetic data. - Synthetic data: 20,000 hours of synthetic data generated by our TTS system, and simulating a dialogue between Moshi and a user. The different stages of the training procedure are detailled in the paper along with the hyper-parameters. The training was performed on 127 DGX nodes provided by Scaleway, accounting for 1016 H100 Nvidia GPUs. Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
helium-1-preview-2b
moshiko-mlx-q8
moshiko-candle-bf16
hibiki-1b-pytorch-bf16
hibiki-zero-3b-pytorch-bf16
moshiko-mlx-bf16
moshiko-mlx-q4
helium-1-2b-science
helium-1-2b-wiki
helium-1-2b-books
helium-1-2b-pop
ARC8 Encoder Multi
This page houses `ARC8-Encodermulti` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper ARC-Encoder: learning compressed text representations for large language models available here. Code: ARC-Encoder repository All the encoders released here are trained on web crawl filtered using Dactory based on a Llama3.2-3B base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time: - `ARC8-EncoderLlama`, trained on 2.6B tokens on Llama3.1-8B base specifically with a pooling factor of 8. - `ARC8-EncoderMistral`, trained on 2.6B tokens on Mistral-7B base specifically with a pooling factor of 8. - `ARC8-Encodermulti`, trained by sampling among the two decoders with a pooling factor of 8. As described in the paper, the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks. You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF. For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining. To reproduce the results presented in the paper, you can use our released fine-tuning dataset, ARCfinetuning. ARC-Encoders are licensed under the CC-BY 4.0 license. Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at Llama license. To load the pre-trained ARC-Encoders, use the following code snippet from the ARC-Encoder repository: Remark: This code snippet loads the model from Hugging Face and then creates appropriate folders at ` ` containing the checkpoint and additional necessary files for fine-tuning or evaluation with the `ARC-Encoder` codebase. To reduce occupied memory space, you can then delete the model from your Hugging Face cache.
helium-1-2b-life
helium-1-2b-hum
helium-1-2b-main
helium-1-2b-stem
hibiki-2b-pytorch-bf16
moshika-candle-q8
ARC8_Encoder_Llama
This page houses `ARC8-EncoderLlama` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper ARC-Encoder: learning compressed text representations for large language models available here. Code: ARC-Encoder repository All the encoders released here are trained on web crawl filtered using Dactory based on a Llama3.2-3B base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time: - `ARC8-EncoderLlama`, trained on 2.6B tokens on Llama3.1-8B base specifically with a pooling factor of 8. - `ARC8-EncoderMistral`, trained on 2.6B tokens on Mistral-7B base specifically with a pooling factor of 8. - `ARC8-Encodermulti`, trained by sampling among the two decoders with a pooling factor of 8. As described in the paper, the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks. You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF. For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining. To reproduce the results presented in the paper, you can use our released fine-tuning dataset, ARCfinetuning. ARC-Encoders are licensed under the CC-BY 4.0 license. Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at Llama license. To load the pre-trained ARC-Encoders, use the following code snippet from the ARC-Encoder repository: Remark: This code snippet loads the model from Hugging Face and then creates appropriate folders at ` ` containing the checkpoint and additional necessary files for fine-tuning or evaluation with the `ARC-Encoder` codebase. To reduce occupied memory space, you can then delete the model from your Hugging Face cache.
moshika-mlx-q4
ARC8_Encoder_Mistral
This page houses `ARC8-EncoderMistral` from three different versions of pretrained ARC-Encoders. Architectures and methods to train them are described in the paper ARC-Encoder: learning compressed text representations for large language models available here. Code: ARC-Encoder repository All the encoders released here are trained on web crawl filtered using Dactory based on a Llama3.2-3B base backbone. It consists in two ARC-Encoder specifically trained for one decoder and one for two decoders in the same time: - `ARC8-EncoderLlama`, trained on 2.6B tokens on Llama3.1-8B base specifically with a pooling factor of 8. - `ARC8-EncoderMistral`, trained on 2.6B tokens on Mistral-7B base specifically with a pooling factor of 8. - `ARC8-Encodermulti`, trained by sampling among the two decoders with a pooling factor of 8. As described in the paper, the pretrained ARC-Encoders can be fine-tuned to perform various downstream tasks. You can also adapt an ARC-Encoder to a new pooling factor (PF) by fine-tuning it on the desired PF. For optimal results, we recommend fine-tuning toward a lower PF than the one used during pretraining. To reproduce the results presented in the paper, you can use our released fine-tuning dataset, ARCfinetuning. ARC-Encoders are licensed under the CC-BY 4.0 license. Terms of use: As the released models are pretrained from Llama3.2 3B backbone, ARC-Encoders are subject to the Llama Terms of Use found at Llama license. To load the pre-trained ARC-Encoders, use the following code snippet from the ARC-Encoder repository: Remark: This code snippet loads the model from Hugging Face and then creates appropriate folders at ` ` containing the checkpoint and additional necessary files for fine-tuning or evaluation with the `ARC-Encoder` codebase. To reduce occupied memory space, you can then delete the model from your Hugging Face cache.
moshika-pytorch-q8
moshiko-pytorch-q8
hibiki-1b-mlx-bf16
Helium1-VL-2B
moshika-rag-candle-bf16
CASA-Qwen2_5-VL-3B
hibiki-2b-mlx-bf16
moshika-mlx-bf16
moshika-mlx-q8
moshika-candle-bf16
CASA-Helium1-VL-2B
moshika-vis-candle-q8
stt-1b-en_fr-mlx
hibiki-1b-rs-q6k
hibiki-1b-rs-q8
tts-voices
Voices available for Kyutai TTS. To find voices you like, use the interactive widget on the TTS project page. Do you want more voices? Help us by donating your voice or open an issue in the TTS repo to suggest permissively-licensed datasets of voices we could add here. From the Voice Cloning Toolkit dataset, licensed under the Creative Commons License: Attribution 4.0 International. Each recording was done with two mics, here we used the `mic1` recordings. We chose sentence 23 for every speaker because it's generally the longest one to pronounce. From the Expresso dataset, licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International. Non-commercial use only. We select clips from the "conversational" files. For each pair of "kind" and channel (`ex04-ex01laughing`, channel 1), we find one segment with at least 10 consecutive seconds of speech using `VADsegments.txt`. We don't include more segments per (kind, channel) to keep the number of voices manageable. The name of the file indicates how it was selected. For instance, `ex03-ex02narration001channel1674s.wav` comes from the first audio channel of `audio48khz/conversational/ex03-ex02/narration/ex03-ex02narration001.wav`, meaning it's speaker `ex03`. It's a 10-second clip starting at 674 seconds of the original file. - `degaulle-2.wav`: comes from the Appeal of 18 June, recording here. I don't understand how the license here works exactly, but I think it's safe to assume this recording is in the public domain since it's from 1940. - `ex04narrationlongform00001.wav`: comes from the Expresso dataset, so CC-NC - `p329022.wav`: comes from VCTK, so CC BY 4.0 The others are our own recordings and you may use them as CC0. French voices selected from the CML-TTS Dataset, licensed under the Creative Commons License: Attribution 4.0 International. From the EARS dataset, licensed under the Creative Commons License: Attribution-NonCommercial 4.0 International. Non-commercial use only. For each of the 107 speakers, we use the middle 10 seconds of the `freeformspeech01.wav` file. Additionally, we select two speakers, p003 (female) and p031 (male) and provide speaker embeddings for each of their `emofreeform.wav` files. This is to allow users to experiment with having a voice of a single speaker with multiple emotions. Voices of volunteers submitted through our Voice Donation project, licensed as CC0. Thank you ❤️
stt-2.6b-en
stt-1b-en_fr
moshika-vis-pytorch-bf16
MoshiVis (Project Page | arXiv) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency. To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model. To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters). This model page contains the `Moshika` (female voice) model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`. We provide the same model weights for other backends and quantization formats in the associated model collection. - Developed by: Kyutai - Model type: Multimodal speech+vision+text foundation model - Language(s) (NLP): English - License: CC-BY-4.0 - Uses frozen components from: Moshika and PaliGemma2 - Terms of use: As the released models include frozen weights of the SigLIP image encoder from PaliGemma-2, MoshiVis is subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms - Project Page kyutai.org/moshivis - Preprint (arXiv/abs/2503.15633) - Repository: Github kyutai-labs/moshivis - Demo: Talk to Moshi Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions. Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters, the model could be adapted to different downstream scenarios by further finetuning these parameters : for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains. The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. MoshiVis has been designed to perceptually augment the original Moshi) model with vision capabilities and is expected to inherit similar biases and limitations. Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results. For information on the training data used for the base models, see Pixtral and Moshi respectively. To train the cross-attention and gating mechanism that MoshiVis uses for processing images, we rely on a collection of publicly available datasets, namely: - DOCCI - PixMo - Pixelprose - TallyQA - OCR-VQA - RenderedText - DocVQA MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters) and was trained on a single DGX node with 8 H100 GPUs. Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.
hibiki-1b-rs-bf16
stt-2.6b-en-mlx
See also the project page and the GitHub repository. This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR). Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript, our model starts to output the transcript as soon as a few seconds of audio become available. The model architecture is a Transformer that consumes audio tokenized by Mimi (see the Moshi paper) and outputs text tokens. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. We release two models: - `kyutai/stt-1b-enfr`, an English and French model with ~1B parameters, a 0.5 second delay, and a semantic VAD. - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. Kyutai STT is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. Developed by: Kyutai Model type: Streaming Speech-to-Text transcription. Language(s) (NLP): English and French for `kyutai/stt-1b-enfr`, English for `kyutai/stt-2.6b-en` License: Model weights are licensed under CC-BY 4.0 Repository: GitHub The model can be used for streaming speech-to-text. It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes. The model produces transcripts with capitalization and punctuation. The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset. Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-enfr`, we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running whisper-timestamped. - Finetuning stage: We then finetune the model on a collection of public datasets with ground-truth transcripts. This dataset contains 24000 hours of audio. - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio. The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours). - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French). Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively. Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
hibiki-2b-rs-bf16
stt-1b-en_fr-candle
See also the project page and the GitHub repository. This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR). Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript, our model starts to output the transcript as soon as a few seconds of audio become available. The model architecture is a Transformer that consumes audio tokenized by Mimi (see the Moshi paper) and outputs text tokens. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. We release two models: - `kyutai/stt-1b-enfr`, an English and French model with ~1B parameters, a 0.5 second delay, and a semantic VAD. - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. Kyutai STT is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. Developed by: Kyutai Model type: Streaming Speech-to-Text transcription. Language(s) (NLP): English and French for `kyutai/stt-1b-enfr`, English for `kyutai/stt-2.6b-en` License: Model weights are licensed under CC-BY 4.0 Repository: GitHub The model can be used for streaming speech-to-text. It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes. The model produces transcripts with capitalization and punctuation. The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset. Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-enfr`, we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running whisper-timestamped. - Finetuning stage: We then finetune the model on a collection of public datasets with ground-truth transcripts. This dataset contains 24000 hours of audio. - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio. The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours). - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French). Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively. Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez
helium-1-preview-2b-mlx
moshika-vis-mlx
dactory-models
stt-2.6b-en-candle
See also the project page and the GitHub repository. This is a model for streaming speech-to-text (STT, also known as automatic speech recognition, ASR). Unlike offline speech-to-text, where the model needs the entire audio to produce the transcript, our model starts to output the transcript as soon as a few seconds of audio become available. The model architecture is a Transformer that consumes audio tokenized by Mimi (see the Moshi paper) and outputs text tokens. The frame rate is 12.5 Hz and each audio frame is represented by 32 audio tokens. We release two models: - `kyutai/stt-1b-enfr`, an English and French model with ~1B parameters, a 0.5 second delay, and a semantic VAD. - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. Kyutai STT is a decoder-only model for streaming speech-to-text. It leverages the multistream architecture of Moshi to model text stream based on the speech stream. The text stream is shifted w.r.t. the audio stream to allow the model to predict text tokens based on the input audio. Developed by: Kyutai Model type: Streaming Speech-to-Text transcription. Language(s) (NLP): English and French for `kyutai/stt-1b-enfr`, English for `kyutai/stt-2.6b-en` License: Model weights are licensed under CC-BY 4.0 Repository: GitHub The model can be used for streaming speech-to-text. It is robust to noisy conditions and was found to perform well with audio as long as 2 hours with no additonal changes. The model produces transcripts with capitalization and punctuation. The predicted text token timestamps can be recovered by subtracting the model's text stream offset (0.5 or 2.5 seconds) from the frame's offset. Pretraining stage: For both `kyutai/stt-2.6b-en` and `kyutai/stt-1b-enfr`, we use an audio collection of 2.5 million hours of publicly available audio content. For this dataset, we obtained synthetic transcripts by running whisper-timestamped. - Finetuning stage: We then finetune the model on a collection of public datasets with ground-truth transcripts. This dataset contains 24000 hours of audio. - Long-form finetuning stage: Finally, we finetune the model on a combination of data from the previous stage and long-form audio. The long-form audio is obtained from two sources: (a) concatenating LibriSpeech examples (1000 hours), (b) synthesizing dialogs (22000 hours). - Finetuning stage: We finetune on the Fisher dataset of 2000 hours of English audio, plus proprietary data (1000 hours in English, 600 hours in French). Pretraining and finetuning was done with 48 and 16 H100 Nvidia GPUs, respectively. Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Perez, Laurent Mazaré, Alexandre Défossez