Zyphra

21 models • 2 total models in database
Sort by:

Zamba2-1.2B-instruct

Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically: 1. SFT of the base Zamba2-1.2B model on ultrachat200k and Infinity-Instruct 2. DPO of the SFT checkpoint on ultrafeedbackbinarized, orcadpopairs, and OpenHermesPreferences Zamba2-1.2B-Instruct is a hybrid model composed of state-space (Mamba2) and transformer blocks. To download Zamba2-1.2B-instruct, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage. Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size. | Model | Size | Aggregate MT-Bench | IFEval | |:-------------:|:----:|:-------------:|:----:| | Zamba2-1.2B-Instruct | 1.2B | 59.53 | 41.45 | | Gemma2-2B-Instruct | 2.7B | 51.69 | 42.20 | | H2O-Danube-1.8B-Chat | 1.6B | 49.78 | 27.95 | | StableLM-1.6B-Chat | 1.6B | 49.87 | 33.77 | | SmolLM-1.7B-Instruct | 1.7B | 43.37 | 16.53 | | Qwen2-1.5B-Instruct | 1.5B | N/A | 34.68 | Moreover, due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models. Time to First Token (TTFT) | Output Generation :-------------------------:|:-------------------------: | Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared transformer blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. Note: this is a temporary HuggingFace implementation of Zamba2-1.2B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models. A standalone Pytorch implementation of Zamba2-1.2B may be found here.

NaNK
license:apache-2.0
53,503
28

Zonos-v0.1-hybrid

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz. For more details and speech samples, check out our blog here We also have a hosted version available at playground.zyphra.com/audio Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below. This should produce a `sample.wav` file in your project root directory. For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run. - Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output - Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings - Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German - Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear. - Fast: our model runs with a real-time factor of ~2x on an RTX 4090 - Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech - Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository. At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM). Zonos depends on the eSpeak library phonemization. You can install it on Ubuntu with the following command: We highly recommend using a recent version of uv for installation. If you don't have uv installed, you can install it via pip: `pip install -U uv`. Installing into a new uv virtual environment (recommended) Installing into the system/actived environment using uv Installing into the system/actived environment using pip For convenience we provide a minimal example to check that the installation works: Citation If you find this model useful in an academic context please cite as: ```bash @misc{zyphra2025zonos, title = {Zonos-v0.1: An Expressive, Open-Source TTS Model}, author = {Dario Sucic, Mohamed Osman, Gabriel Clark, Chris Warner, Beren Millidge}, year = {2025}, }

license:apache-2.0
34,379
1,099

Zonos-v0.1-transformer

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz. For more details and speech samples, check out our blog here We also have a hosted version available at playground.zyphra.com/audio Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone. An overview of the architecture can be seen below. This should produce a `sample.wav` file in your project root directory. For repeated sampling we highly recommend using the gradio interface instead, as the minimal example needs to load the model every time it is run. - Zero-shot TTS with voice cloning: Input desired text and a 10-30s speaker sample to generate high quality TTS output - Audio prefix inputs: Add text plus an audio prefix for even richer speaker matching. Audio prefixes can be used to elicit behaviours such as whispering which can otherwise be challenging to replicate when cloning from speaker embeddings - Multilingual support: Zonos-v0.1 supports English, Japanese, Chinese, French, and German - Audio quality and emotion control: Zonos offers fine-grained control of many aspects of the generated audio. These include speaking rate, pitch, maximum frequency, audio quality, and various emotions such as happiness, anger, sadness, and fear. - Fast: our model runs with a real-time factor of ~2x on an RTX 4090 - Gradio WebUI: Zonos comes packaged with an easy to use gradio interface to generate speech - Simple installation and deployment: Zonos can be installed and deployed simply using the docker file packaged with our repository. At the moment this repository only supports Linux systems (preferably Ubuntu 22.04/24.04) with recent NVIDIA GPUs (3000-series or newer, 6GB+ VRAM). Zonos depends on the eSpeak library phonemization. You can install it on Ubuntu with the following command: We highly recommend using a recent version of uv for installation. If you don't have uv installed, you can install it via pip: `pip install -U uv`. Installing into a new uv virtual environment (recommended) Installing into the system/actived environment using uv Installing into the system/actived environment using pip For convenience we provide a minimal example to check that the installation works: Citation If you find this model useful in an academic context please cite as: ```bash @misc{zyphra2025zonos, title = {Zonos-v0.1: An Expressive, Open-Source TTS Model}, author = {Dario Sucic, Mohamed Osman, Gabriel Clark, Chris Warner, Beren Millidge}, year = {2025}, }

license:apache-2.0
16,991
418

Zamba2-7B-Instruct

Zamba2-7B-Instruct is obtained from Zamba2-7B by fine-tuning on instruction-following and chat datasets. Zamba2-7B-Instruct is a hybrid model composed of state-space (Mamba2) and transformer blocks. Zamba2-7B-Instruct long-context has been extended from 4k to 16k context by adjusting the rope frequency in the attention blocks. To use Zamba2-7B-instruct, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage. To use the context-extended version of Zamba, please load the model with `uselongcontext=True`, i.e.: Zamba2-7B-Instruct punches dramatically above its weight, achieving extremely strong instruction-following benchmark scores. | Task | Score | |:------------:|:---------:| | IFEval | 69.95 | | BBH | 33.33 | | MATH Lvl 5 | 13.57 | | GPQA | 10.28 | | MUSR | 8.21 | | MMLU-PRO | 32.43 | | Average | 27.96 | Moreover, due to its unique hybrid SSM architecture, Zamba2-7B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models. Time to First Token (TTFT) | Output Generation :-------------------------:|:-------------------------: | Zamba2-7B-Instruct's high performance, strong instruction-following and reasoning capabilities for its size makes it an ideal generalist small model for a wide range of applications. Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. Our Zamba2-7B instruct features an experimental long-context mode which extends the context from 4k to 16k context. This was achieved by adjusting the rotation frequency of the rotary position embeddings. In Needle-In-A-Haystack tests, we observe that Zamba2-7B-Instruct finds the needle with an extremely high success rate up to and slightly beyond 16k context with performance falling off sharply at about 18k context. In future versions we aim to extend this context length significantly. Note: this is a temporary HuggingFace implementation of Zamba2-7B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models. A standalone Pytorch implementation of Zamba2-7B may be found here.

NaNK
license:apache-2.0
6,021
92

ZUNA

license:apache-2.0
1,774
135

Zamba2-1.2B

Zamba2-1.2B is a hybrid model composed of state-space (Mamba) and transformer blocks. It broadly follows the Zamba architecture which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in Model Details). Zamba2-1.2B possesses three major improvements over Zamba1: 1.) Mamba1 blocks have been replaced with Mamba2 blocks. 2.) We apply a LoRA projector to each shared MLP and attention block, which allows the network to specialize at each invocation of the shared transformer layer across depth. LoRA enables us to add depth-specialization for only a minimal increase in total parameter count. 3.) We utilize rotary position embeddings in the shared attention layer. Zamba2-1.2B differs from our 2.7B model in three ways: 2.) A single shared transformer block (instead of two that we alternate between) 3.) Added LoRA projectors to attention blocks (instead of just a LoRA on the MLP block) We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block (instead of alternating between two independent transformer blocks) because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency. Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including Zyda. Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens. Note: this is a temporary HuggingFace implementation of Zamba2-1.2B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models. A standalone Pytorch implementation of Zamba2-1.2B may be found here. To download Zamba2-1.2B, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage. Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared transformer blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. Zamba2-1.2B achieves leading and state-of-the-art performance among models of Time to First Token (TTFT) | Output Generation :-------------------------:|:-------------------------: | Zamba2-1.2B is a pretrained base model and therefore does not have any moderation mechanism and may output toxic or otherwise harmful language. In addition, one should not expect good instruct or chat performance, as this model was not fine-tuned for instruction following or chat.

NaNK
license:apache-2.0
956
74

ZR1 1.5B

ZR1-1.5B is a small reasoning model trained extensively on both verified coding and mathematics problems with reinforcement learning. The model outperforms Llama-3.1-70B-Instruct on hard coding tasks and improves upon the base R1-Distill-1.5B model by over 50%, while achieving strong scores on math evaluations and a 37.91% pass@1 accuracy on GPQA-Diamond with just 1.5B parameters. For training we utilized the PRIME Eurus-2-RL dataset which combines the following math and code datasets: - NuminaMath-CoT - APPS, CodeContests, TACO, and Codeforces train set We filtered math data by validating that questions are correctly graded when calling the evaluator with reference ground truth, and we removed all code examples with an empty list of test cases. Our final dataset comprised roughly 400k math + 25k code samples. We employ PRIME (Process Reinforcement through IMplicit rEwards), an online RL algorithm with process rewards, motivated by the improvement over GPRO demonstrated in the paper, as well as potentially more accurate token-level rewards due to the learned process reward model. We used the training batch accuracy filtering method from PRIME for training stability, and the iterative context lengthening technique demonstrated in DeepScaleR for faster training, which has also been shown to improve token efficiency. After a warmup period with maximum generation length set to 12k tokens, we sequentially increased the maximum generation length during training, starting at 8k tokens before increasing to 16k and 24k. We trained on a single 8xH100 node with the following specific algorithmic details. - PRIME + RLOO with token-level granularity - No ` ` token prefill. 0.1 format reward/penalty - Main train batch size 256 with n=4 samples per prompt. veRL dynamic batch size with max batch size set per GPU to support training with large generation length - Max prompt length 1536, generation length increase over training. Started with 12k intended to ease model into shorter generation length training - 12384 -> 8192 -> 16384 -> 24448 - Start with 1 PPO epoch, increase to 4 during 24k stage - Accuracy filtering 0.2-0.8 and relax to 0.01-0.99 during 24k stage - Oversample batches 2x for accuracy filtering - KL coefficient 0 (no KL divergence term) - Entropy coefficient 0.001 - Actor LR 5e-7 - Reward beta train 0.05 - Reward LR 1e-6 - Reward grad clip 10 - Reward RM coefficient 5 Coding | | Leetcode | LCB\generation | | :---- | :---- | :---- | | ZR1-1.5B | 40% | 39.74% | | R1-Distill-Qwen-1.5B | 12.22% | 24.36% | | DeepCoder-1.5B | 21.11% | 35.90% | | OpenHands-LM-1.5B | 18.88% | 29.49% | | Qwen2.5-1.5B-Instruct | 20.56% | 24.36% | | Qwen2.5-Coder-3B-Instruct | 35.55% | 39.74% | | Llama-3.1-8B-Instruct | 14.44% | 23.08% | | Llama-3.1-70B-Instruct | 37.22% | 34.62% | | Eurus-2-7B-PRIME | 34.44% | 32.05% | | Mistral-Small-2503 | \- | 38.46% | | Gemma-3-27b-it | \- | 39.74% | | Claude-3-Opus | \- | 37.18% | LiveBench | Model | AMPS Hard | Math\Comp | LCB\Generation | Coding\Completion | | :---- | :---- | :---- | :---- | :---- | | ZR1-1.5B | 74% | 60.42% | 39.74% | 12% | | DeepCoder-1.5B | 69% | 61.46% | 35.90% | 12% | | DeepScaleR-1.5B | 64% | 50% | 24.36% | 6% | | OpenHands-LM-1.5B | 24% | 29.48% | 29.49% | 8% | | R1-Distill-1.5B | 54% | 37.50% | 24.36% | 6% | | Qwen2.5-1.5B-Instruct | 38% | 20.83% | 24.36% | 4% | | Qwen2.5-Math-1.5B-Instruct | 49% | 36.46% | 0% | 0% | | Qwen2.5-3B-Instruct | 41% | 17.71% | 28.21% | 10% | | R1-Distill-7B | 74% | 61.46% | 44.87% | 14% | | Qwen2.5-7B-Instruct | 56% | 29.17% | 38.46% | 40% | | Qwen2.5-Math-7B-Instruct | 62% | 45.83% | 16.67% | 4% | | R1-Distill-14B | 77% | 69.79% | 64.10% | 18% | | Qwen2.5-14B-Instruct | 59% | 43.75% | 46.15% | 54% | | R1-Distill-32B | 74% | 75% | 60.26% | 26% | | QwQ-32B-Preview | 78% | 67.71% | 52.56% | 22% | | QwQ-32B | 83% | 87.5% | 87.18% | 46% | | Qwen2.5-32B-Instruct | 62% | 54.17% | 51.23% | 54% | | Qwen2.5-Coder-32B-Instruct | 48% | 53.13% | 55.13% | 58% | | R1-Distill-Llama-70B\ | 65% | 78.13% | 69.23% | 34% | | Qwen2.5-72B-Instruct | 66% | 52.08% | 50% | 62% | | Qwen2.5-Math-72B-Instruct | 56% | 59.38% | 42.31% | 42% | | DeepSeek-R1\ | 88% | 88.54% | 79.48% | 54% | General Math | model | AIME24 | AIME25 | AMC22\23 | AMC24 | GPQA-D | MATH500 | Minerva | Olympiad | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | ZR1-1.5B | 33.75% | 27.29% | 72.06% | 59.17% | 37.91% | 88.34% | 33.52% | 56.87% | | ZR1-1.5B (greedy) | 40% | 26.67% | 71.08% | 53.33% | 37.88% | 89.40% | 32.72% | 57.93% | | DeepScaleR-1.5B | 42.92% | 27.71% | 74.40% | 60.69% | 34.66% | 89.36% | 35.50% | 59.37% | | DeepScaleR-1.5B (greedy) | 33.33% | 33.33% | 67.47% | 57.77% | 29.29% | 84.60% | 31.62% | 52.44% | | DeepCoder-1.5B | 41.88% | 24.79% | 75.30% | 59.72% | 36.46% | 83.60% | 32.01% | 56.39% | | Still-3-1.5B | 31.04% | 23.54% | 65.51% | 56.94% | 34.56% | 86.55% | 33.50% | 53.55% | | Open-RS3-1.5B | 31.67% | 23.75% | 64.08% | 51.67% | 35.61% | 84.65% | 29.46% | 52.13% | | R1-Distill-1.5B | 28.96% | 22.50% | 63.59% | 50.83% | 33.87% | 84.65% | 31.39% | 51.11% | | R1-Distill-1.5B (greedy) | 26.67% | 13.33% | 51.81% | 24.44% | 30.81% | 73.40% | 25.74% | 40% | | Qwen2.5-Math-1.5B-Instruct (greedy) | 10% | 6.67% | 42.17% | 26.67% | 28.28% | 75.20% | 28.31% | 40.74% | | Qwen2.5-Math-7B-Instruct (greedy) | 20% | 3.33% | 46.99% | 31.11% | 32.32% | 83% | 37.13% | 42.22% | | Qwen2.5-Math-72B-Instruct (greedy) | 26.67% | 6.67% | 59.04% | 46.67% | 43.94% | 85.40% | 42.65% | 50.37% | | Eurus-2-7B-PRIME (greedy) | 20% | 13.33% | 56.62% | 40% | 36.36% | 81.20% | 36.76% | 44.15% | | DeepHermes-3-Llama-3-3B (think prompt, greedy) | 0% | 3.33% | 12.05% | 11.11% | 30.30% | 34.40% | 10.66% | 10.52% | | OpenHands-LM-1.5B (greedy) | 0% | 0% | 10.84% | 4.44% | 23.74% | 36.80% | 12.50% | 10.22% | Our direct answer system prompt was: “Give a direct answer without thinking first.” The table reports the average greedy pass@1 score across the following math evals: AIME24, AIME25, AMC22\23, AMC24, GPQA-Diamond, MATH-500, MinervaMath, OlympiadBench | | avg pass@1 | max\tokens | | :---- | :---- | :---- | | ZR1-1.5B | 51.13% | 32768 | | ZR1-1.5B (truncated) | 46.83% | 4096 | | ZR1-1.5B (direct answer prompt) | 45.38% | 4096 | | ZR1-1.5B (truncated) | 40.39% | 2048 | | ZR1-1.5B (direct answer prompt) | 37% | 2048 | | Qwen-2.5-Math-1.5B-Instruct | 32.25% | 2048 | | Qwen-2.5-Math-7B-Instruct | 37.01% | 2048 | For Leetcode and LiveBench, we report pass@1 accuracy with greedy sampling. For the rest of the evaluations we report pass@1 accuracy averaged over 16 samples per question, with temperature 0.6 and topp 0.95. For vllm we disable prefix caching and chunked prefill.

NaNK
license:mit
362
69

Zamba-7B-v1

Zamba-7B-v1 is a hybrid model between Mamba, a state-space model, and transformers. It uses a mamba backbone with a shared transformer layer every 6 blocks. Zamba was trained using next-token prediction. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data sourced from open web-datasets. Subsequently in a second phase, Zamba was annealed on a mixture of 50B high-quality tokens. Note: the current Huggingface implementation of Zamba performs slower than our internal implementation. We are working to fix this with the Huggingface team. Our technical report describing the training of Zamba is available here. To download Zamba, clone Zyphra's fork of transformers: 1. `git clone https://github.com/Zyphra/transformerszamba` 2. `cd transformerszamba` 3. Install the repository: `pip install -e .` In order to run optimized Mamba implementations on a CUDA device, you need to install `mamba-ssm` and `causal-conv1d`: You can run the model without using the optimized Mamba kernels, but it is not recommended as it will result in significantly higher latency. To run on CPU, please specify `usemambakernels=False` when loading the model using ``AutoModelForCausalLM.frompretrained``. To load a different checkpoint use, e.g., for iteration 2500, The default iteration is the fully trained model, corresponding to iteration 25156. This is the number of training iterations done starting from Zamba-phase 1 Zyphra/Zamba-7B-v1-phase1. See arXiv:2405.16712 for more details on training. Zamba utilizes a unique hybrid SSM architecture. This architecture consists of a backbone of Mamba layers interspersed with a shared attention layer. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. We find that Zamba performs significantly better than existing open models (with open datasets and training details) at this scale. However, it performs slightly worse than the leading open-weight models at the 7B scale. Most of this difference derives from MMLU and reasoning evaluations. Zamba, however, is trained on significantly fewer tokens than these models and is the most sample efficient model in terms of performance per training tokens. Due to its SSM architecture, Zamba is extremely efficient in inference, substantially outperforming comparable 7B and 8B models in inference latency as well as memory cost of generation due to its substantially diminished KV cache. If you find Zamba useful in your work please cite it as: Zamba is a pretrained base model and therefore does not have any moderation mechanism. In addition, one should not expect good chat performance, as this model was not fine-tuned for chat.

NaNK
license:apache-2.0
309
28

Zamba2-2.7B

Zamba2-2.7B is a hybrid model composed of state-space and transformer blocks. It broadly follows the Zamba architecture which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in Model Details). Zamba2-2.7B possesses three major improvements over Zamba1: 1.) Mamba1 blocks have been replaced with Mamba2 blocks. 2.) Instead of a single shared attention block, we utilize two shared attention blocks which are interleaved in an ABAB pattern throughout the network. 3.) We apply a LoRA projector to each shared MLP block, which allows the network to specialize the MLPs at each invocation of the shared layer across depth. LoRA enables us to add depth-specialization for only a minimal increase in total parameter count. Zamba2-2.7B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including Zyda. Subsequently, in a second phase, Zamba2-2.7B was annealed on a mixture of 100B high-quality tokens. Note: this is a temporary HuggingFace implementation of Zamba2-2.7B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models. A standalone Pytorch implementation of Zamba2-2.7B may be found here. To use Zamba2-2.7B, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage. Zamba2-2.7B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. Zamba2-2.7B achieves leading and state-of-the-art performance among models of Time to First Token (TTFT) | Output Generation :-------------------------:|:-------------------------: | Zamba2-2.7B is a pretrained base model and therefore does not have any moderation mechanism and may output toxic or otherwise harmful language. In addition, one should not expect good instruct or chat performance, as this model was not fine-tuned for instruction following or chat.

NaNK
license:apache-2.0
176
78

Zamba2-7B

NaNK
license:apache-2.0
36
113

Zamba2-2.7B-instruct

Zamba2-2.7B-Instruct is obtained from Zamba2-2.7B by fine-tuning on instruction-following and chat datasets. Specifically: 1. SFT of the base Zamba2-2.7B model on ultrachat200k and Infinity-Instruct 2. DPO of the SFT checkpoint on ultrafeedbackbinarized, orcadpopairs, and OpenHermesPreferences Zamba2-2.7B-Instruct is a hybrid model composed of state-space (Mamba2) and transformer blocks. To use Zamba2-1.2B-instruct, install `transformers` from source: 1. `git clone https://github.com/huggingface/transformers.git` 2. `cd transformers && pip install .` To install dependencies necessary to run Mamba2 kernels, install `mamba-ssm` from source (due to compatibility issues with PyTorch) as well as `causal-conv1d`: 1. `git clone https://github.com/state-spaces/mamba.git` 2. `cd mamba && git checkout v2.1.0 && pip install .` 3. `pip install causal-conv1d` You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage. Zamba2-2.7B-Instruct punches dramatically above its weight, achieving extremely strong instruction-following benchmark scores, significantly outperforming Gemma2-2B-Instruct of the same size and outperforming Mistral-7B-Instruct in most metrics. | Model | Size | Aggregate MT-Bench | IFEval | |:---------------------------:|:-----:|:------------------:|:---------:| | Zamba2-2.7B-Instruct | 2.7B | 72.40| 48.02| | Mistral-7B-Instruct | 7B| 66.4 | 45.3 | | Gemma2-2B-Instruct | 2.7B | 51.69 | 42.20 | | H2O-Danube-4B-Chat | 4B| 52.57 | 37.96 | | StableLM-Zephyr-3B | 3B| 66.43 | 38.27 | Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models. Time to First Token (TTFT) | Output Generation :-------------------------:|:-------------------------: | Zamba2-2.7B-Instruct's high performance, strong instruction-following and reasoning capabilities, and small inference compute and memory footprint renders it an ideal generalist model for on-device applications. Zamba2-2.7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small. Note: this is a temporary HuggingFace implementation of Zamba2-2.7B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models. A standalone Pytorch implementation of Zamba2-2.7B may be found here.

NaNK
license:apache-2.0
24
83

ZAYA1-base

license:apache-2.0
9
19

Zamba-7B-v1-phase1

NaNK
license:apache-2.0
8
5

BlackMamba-2.8B

> BlackMamba: Mixture of Experts for State-space models\ > Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge\ > Paper: https://arxiv.org/abs/2402.01771 About We provide inference code for our BlackMamba model in our github repository: https://github.com/Zyphra/BlackMamba BlackMamba is an novel architecture which combines state-space models (SSMs) with mixture of experts (MoE). It uses Mamba as its SSM block and switch transformer as its MoE block base. BlackMamba is extremely low latency for generation and inference, providing significant speedups over all of classical transformers, MoEs, and Mamba SSM models. Additionally, due to its SSM sequence mixer, BlackMamba retains linear compuational complexity in the sequence length.

NaNK
license:apache-2.0
7
30

ZAYA1-reasoning-base

license:apache-2.0
7
9

BlackMamba-1.5B

NaNK
license:apache-2.0
5
9

Zamba2-7B-Instruct-v2

NaNK
license:apache-2.0
5
4

Zamba2-1.2B-Instruct-v2

NaNK
license:apache-2.0
3
3

Zamba2-2.7B-Instruct-v2

NaNK
license:apache-2.0
1
2

Zonos-v0.1-speaker-embedding

This repository contains the speaker embedding models for our Zonos-v0.1 transformer and hybrid models. The speaker embedding models are based on the ResNet293-SimAM-ASP models from VoxBlink2. We use the pretrain models as we found the finetunes performed worse. The output of the speaker embedding model is then passed through an LDA layer and compressed from 256 to 128 dimensions to remove further spurious information about the speaker embedding clip before being fed into the Zonos models.

license:apache-2.0
0
28

Mamba-370M

license:apache-2.0
0
8