jncraton

48 models • 1 total models in database

Sort by:

Qwen2.5-0.5B-Instruct-ct2-int8

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 0.5B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 0.49B - Number of Paramaters (Non-Embedding): 0.36B - Number of Layers: 24 - Number of Attention Heads (GQA): 14 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK

license:apache-2.0

dialogstudio-t5-large-v1.0-ct2-int8

license:apache-2.0

Llama-3.1-Nemotron-Nano-4B-v1.1-ct2-int8

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: - Llama-3.3-Nemotron-Ultra-253B-v1 - Llama-3.3-Nemotron-Super-49B-v1 - Llama-3.1-Nemotron-Nano-8B-v1 GOVERNING TERMS: Your use of this model is governed by the NVIDIA Open Model License. Additional Information: Llama 3.1 Community License Agreement. Built with Llama. Model Dates: Trained between August 2024 and May 2025 Data Freshness: The pretraining data has a cutoff of June 2023. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. Balance of model accuracy and compute efficiency (the model fits on a single RTX GPU and can be used locally). - [\[2408.11796\] LLM Pruning and Distillation in Practice: The Minitron Approach](https://arxiv.org/abs/2408.11796) - [\[2502.00203\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203) - [\[2505.00949\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949) Architecture Type: Dense decoder-only Transformer model Network Architecture: Llama 3.1 Minitron Width 4B Base Llama-3.1-Nemotron-Nano-4B-v1.1 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. Input: - Input Type: Text - Input Format: String - Input Parameters: One-Dimensional (1D) - Other Properties Related to Input: Context length up to 131,072 tokens Output: - Output Type: Text - Output Format: String - Output Parameters: One-Dimensional (1D) - Other Properties Related to Output: Context length up to 131,072 tokens Software Integration - Runtime Engine: NeMo 24.12 - Recommended Hardware Microarchitecture Compatibility: - NVIDIA Hopper - NVIDIA Ampere 1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt 2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode 3. We recommend using greedy decoding for Reasoning OFF mode 4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required See the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below. Our code requires the transformers package version to be `4.44.2` or higher. For some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response. Llama-3.1-Nemotron-Nano-4B-v1.1 supports tool calling. This HF repo hosts a tool-callilng parser as well as a chat template in Jinja, which can be used to launch a vLLM server. Here is a shell script example to launch a vLLM server with tool-call support. `vllm/vllm-openai:v0.6.6` or newer should support the model. Alternatively, you can use a virtual environment to launch a vLLM server like below. After launching a vLLM server, you can call the server with tool-call support using a Python script like below. - BF16: - 1x RTX 50 Series GPUs - 1x RTX 40 Series GPUs - 1x RTX 30 Series GPUs - 1x H100-80GB GPU - 1x A100-80GB GPU A large variety of training data was used for the post-training pipeline, including manually annotated data and synthetic data. The data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model. Prompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both Reasoning On and Off modes, to train the model to distinguish between two modes. Data Collection for Training Datasets: Hybrid: Automated, Human, Synthetic We used the datasets listed below to evaluate Llama-3.1-Nemotron-Nano-4B-v1.1. Data Collection for Evaluation Datasets: Hybrid: Human/Synthetic Data Labeling for Evaluation Datasets: Hybrid: Human/Synthetic/Automatic These results contain both “Reasoning On”, and “Reasoning Off”. We recommend using temperature=`0.6`, topp=`0.95` for “Reasoning On” mode, and greedy decoding for “Reasoning Off” mode. All evaluations are done with 32k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate. > NOTE: Where applicable, a Prompt Template will be provided. While completing benchmarks, please ensure that you are parsing for the correct output format as per the provided prompt in order to reproduce the benchmarks seen below. | Reasoning Mode | Score | |--------------|------------| | Reasoning Off | 7.4 | | Reasoning On | 8.0 | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 71.8% | | Reasoning On | 96.2% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 13.3% | | Reasoning On | 46.3% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 33.8% | | Reasoning On | 55.1% | | Reasoning Mode | Strict:Prompt | Strict:Instruction | |--------------|------------|------------| | Reasoning Off | 70.1% | 78.5% | | Reasoning On | 75.5% | 82.6% | | Reasoning Mode | Score | |--------------|------------| | Reasoning Off | 63.6% | | Reasoning On | 67.9% | | Reasoning Mode | pass@1 | |--------------|------------| | Reasoning Off | 61.9% | | Reasoning On | 85.8% | NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

NaNK

llama-3

all-MiniLM-L6-v2-ct2-int8

license:apache-2.0

Qwen2.5-Coder-7B-Instruct-ct2-int8

NaNK

license:apache-2.0

codet5p-220m-py-ct2-int8

license:bsd-3-clause

LaMini-GPT-124M-ct2-int8

NaNK

license:cc-by-nc-4.0

Starling-LM-7B-beta-ct2-int8

NaNK

license:apache-2.0

Qwen2.5-Coder-1.5B-Instruct-ct2-int8

NaNK

license:apache-2.0

teapotllm-ct2-int8

license:mit

Flan T5 Large Ct2 Int8

license:apache-2.0

flan-alpaca-gpt4-xl-ct2-int8

license:apache-2.0

gte-small-ct2-int8

license:mit

TinyLlama-1.1B-step-50K-105b-ct2-int8

NaNK

license:apache-2.0

TinyLlama-1.1B-Chat-v0.3-ct2-int8

NaNK

license:apache-2.0

GIST-small-Embedding-v0-ct2-int8

license:mit

multilingual-e5-small-ct2-int8

license:mit

granite-embedding-107m-multilingual-ct2-int8

license:apache-2.0

DeepSeek-R1-Distill-Qwen-7B-ct2-int8

NaNK

license:mit

Qwen2.5-3B-Instruct-ct2-int8

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 3B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, Attention QKV bias and tied word embeddings - Number of Parameters: 3.09B - Number of Paramaters (Non-Embedding): 2.77B - Number of Layers: 36 - Number of Attention Heads (GQA): 16 for Q and 2 for KV - Context Length: Full 32,768 tokens and generation 8192 tokens For more details, please refer to our blog, GitHub, and Documentation. The code of Qwen2.5 has been in the latest Hugging face `transformers` and we advise you to use the latest version of `transformers`. With `transformers<4.37.0`, you will encounter the following error: Here provides a code snippet with `applychattemplate` to show you how to load the tokenizer and model and how to generate contents. Detailed evaluation results are reported in this 📑 blog. For requirements on GPU memory and the respective throughput, see results here. If you find our work helpful, feel free to give us a cite.

NaNK

—

Qwen2.5-Math-1.5B-Instruct-ct2-int8

NaNK

license:apache-2.0

gemma-2b-it-ct2-int8

NaNK

—

LaMini-Flan-T5-77M-ct2-int8

NaNK

license:cc-by-nc-4.0

flan-alpaca-xl-ct2-int8

license:apache-2.0

e5-small-v2-ct2-int8

license:mit

TinyLlama-1.1B-Chat-v1.0-ct2-int8

NaNK

license:apache-2.0

openchat-3.5-0106-ct2-int8

NaNK

license:apache-2.0

SmolLM-135M-Instruct-ct2-int8

dataset:HuggingFaceTB/everyday-conversations-llama3.1-2k

SmolLM-360M-Instruct-ct2-int8

dataset:HuggingFaceTB/everyday-conversations-llama3.1-2k

SmolLM-1.7B-Instruct-ct2-int8

NaNK

dataset:HuggingFaceTB/everyday-conversations-llama3.1-2k

Llama-3.2-1B-Instruct-ct2-int8

NaNK

llama

Falcon3-1B-Instruct-ct2-int8

NaNK

—

granite-embedding-30m-english-ct2-int8

license:apache-2.0

Dolphin3.0-Llama3.1-8B-ct2-int8

NaNK

base_model:meta-llama/Llama-3.1-8B

Qwen2.5-7B-Instruct-ct2-int8

NaNK

license:apache-2.0

OpenReasoning-Nemotron-1.5B-ct2-int8

Description: OpenReasoning-Nemotron-1.5B is a large language model (LLM) which is a derivative of Qwen2.5-1.5B-Instruct (AKA the reference model). It is a reasoning model that is post-trained for reasoning about math, code and science solution generation. We evaluated this model with up to 64K output tokens. The OpenReasoning model is available in the following sizes: 1.5B, 7B and 14B and 32B. This model is ready for commercial/non-commercial research use. License/Terms of Use: GOVERNING TERMS: Use of the models listed above are governed by the Creative Commons Attribution 4.0 International License (CC-BY-4.0). ADDITIONAL INFORMATION: Apache 2.0 License Our models demonstrate exceptional performance across a suite of challenging reasoning benchmarks. The 7B, 14B, and 32B models consistently set new state-of-the-art records for their size classes. | Model | AritificalAnalysisIndex | GPQA | MMLU-PRO | HLE | LiveCodeBench | SciCode | AIME24 | AIME25 | HMMT FEB 25 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 1.5B| 31.0 | 31.6 | 47.5 | 5.5 | 28.6 | 2.2 | 55.5 | 45.6 | 31.5 | | 7B | 54.7 | 61.1 | 71.9 | 8.3 | 63.3 | 16.2 | 84.7 | 78.2 | 63.5 | | 14B | 60.9 | 71.6 | 77.5 | 10.1 | 67.8 | 23.5 | 87.8 | 82.0 | 71.2 | | 32B | 64.3 | 73.1 | 80.0 | 11.9 | 70.2 | 28.5 | 89.2 | 84.0 | 73.8 | \ This is our estimation of the Artificial Analysis Intelligence Index, not an official score. Combining the work of multiple agents OpenReasoning-Nemotron models can be used in a "heavy" mode by starting multiple parallel generations and combining them together via generative solution selection (GenSelect). To add this "skill" we follow the original GenSelect training pipeline except we do not train on the selection summary but use the full reasoning trace of DeepSeek R1 0528 671B instead. We only train models to select the best solution for math problems but surprisingly find that this capability directly generalizes to code and science questions! With this "heavy" GenSelect inference mode, OpenReasoning-Nemotron-32B model surpasses O3 (High) on math and coding benchmarks. | Model | Pass@1 (Avg@64) | Majority@64 | GenSelect | | :--- | :--- | :--- | :--- | | 1.5B | | | | | AIME24 | 55.5 | 76.7 | 76.7 | | AIME25 | 45.6 | 70.0 | 70.0 | | HMMT Feb 25 | 31.5 | 46.7 | 53.3 | | 7B | | | | | AIME24 | 84.7 | 93.3 | 93.3 | | AIME25 | 78.2 | 86.7 | 93.3 | | HMMT Feb 25 | 63.5 | 83.3 | 90.0 | | LCB v6 2408-2505 | 63.4 | n/a | 67.7 | | 14B | | | | | AIME24 | 87.8 | 93.3 | 93.3 | | AIME25 | 82.0 | 90.0 | 90.0 | | HMMT Feb 25 | 71.2 | 86.7 | 93.3 | | LCB v6 2408-2505 | 67.9 | n/a | 69.1 | | 32B | | | | | AIME24 | 89.2 | 93.3 | 93.3 | | AIME25 | 84.0 | 90.0 | 93.3 | | HMMT Feb 25 | 73.8 | 86.7 | 96.7 | | LCB v6 2408-2505 | 70.2 | n/a | 75.3 | | HLE | 11.8 | 13.4 | 15.5 | python for just the final solution code block with the following format: Math generation prompt prompt = """Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}. {user} """ Science generation prompt You can refer to prompts here - https://github.com/NVIDIA/NeMo-Skills/blob/main/nemoskills/prompt/config/generic/hle.yaml (HLE) https://github.com/NVIDIA/NeMo-Skills/blob/main/nemoskills/prompt/config/eval/aai/mcq-4choices-boxed.yaml (for GPQA) https://github.com/NVIDIA/NeMo-Skills/blob/main/nemoskills/prompt/config/eval/aai/mcq-10choices-boxed.yaml (MMLU-Pro) messages = [ { "role": "user", "content": prompt.format(user="Write a program to calculate the sum of the first $N$ fibonacci numbers")}, ] outputs = pipeline( messages, maxnewtokens=64000, ) print(outputs[0]["generatedtext"][-1]['content']) @article{ahmad2025opencodereasoning, title={{OpenCodeReasoning: Advancing Data Distillation for Competitive Coding}}, author={Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg}, year={2025}, eprint={2504.01943}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.01943}, } @misc{ahmad2025opencodereasoningiisimpletesttime, title={{OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique}}, author={Wasi Uddin Ahmad and Somshubra Majumdar and Aleksander Ficek and Sean Narenthiran and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Vahid Noroozi and Boris Ginsburg}, year={2025}, eprint={2507.09075}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.09075}, } @misc{moshkov2025aimo2winningsolutionbuilding, title={{AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset}}, author={Ivan Moshkov and Darragh Hanley and Ivan Sorokin and Shubham Toshniwal and Christof Henkel and Benedikt Schifferer and Wei Du and Igor Gitman}, year={2025}, eprint={2504.16891}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2504.16891}, } @inproceedings{toshniwal2025genselect, title={{GenSelect: A Generative Approach to Best-of-N}}, author={Shubham Toshniwal and Ivan Sorokin and Aleksander Ficek and Ivan Moshkov and Igor Gitman}, booktitle={2nd AI for Math Workshop @ ICML 2025}, year={2025}, url={https://openreview.net/forum?id=8LhnmNmUDb} } ``` Use Case: This model is intended for developers and researchers who work on competitive math, code and science problems. It has been trained via only supervised fine-tuning to achieve strong scores on benchmarks. Release Date: Huggingface [07/16/2025] via https://huggingface.co/nvidia/OpenReasoning-Nemotron-1.5B/ Reference(s): [2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [2504.01943] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding [2504.16891] AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset Model Architecture: Architecture Type: Dense decoder-only Transformer model Network Architecture: Qwen-1.5B-Instruct This model was developed based on Qwen2.5-1.5B-Instruct and has 1.5B model parameters. OpenReasoning-Nemotron-1.5B was developed based on Qwen2.5-1.5B-Instruct and has 1.5B model parameters. OpenReasoning-Nemotron-7B was developed based on Qwen2.5-7B-Instruct and has 7B model parameters. OpenReasoning-Nemotron-14B was developed based on Qwen2.5-14B-Instruct and has 14B model parameters. OpenReasoning-Nemotron-32B was developed based on Qwen2.5-32B-Instruct and has 32B model parameters. Input: Input Type(s): Text Input Format(s): String Input Parameters: One-Dimensional (1D) Other Properties Related to Input: Trained for up to 64,000 output tokens Output: Output Type(s): Text Output Format: String Output Parameters: One-Dimensional (1D) Other Properties Related to Output: Trained for up to 64,000 output tokens Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. Software Integration : Runtime Engine: NeMo 2.3.0 Recommended Hardware Microarchitecture Compatibility: NVIDIA Ampere NVIDIA Hopper Preferred/Supported Operating System(s): Linux Model Version(s): 1.0 (7/16/2025) OpenReasoning-Nemotron-32B OpenReasoning-Nemotron-14B OpenReasoning-Nemotron-7B OpenReasoning-Nemotron-1.5B The training corpus for OpenReasoning-Nemotron-1.5B is comprised of questions from OpenCodeReasoning dataset, OpenCodeReasoning-II, OpenMathReasoning, and the Synthetic Science questions from the Llama-Nemotron-Post-Training-Dataset. All responses are generated using DeepSeek-R1-0528. We also include the instruction following and tool calling data from Llama-Nemotron-Post-Training-Dataset without modification. Data Collection Method: Hybrid: Automated, Human, Synthetic Labeling Method: Hybrid: Automated, Human, Synthetic Properties: 5M DeepSeek-R1-0528 generated responses from OpenCodeReasoning questions (https://huggingface.co/datasets/nvidia/OpenCodeReasoning), OpenMathReasoning, and the Synthetic Science questions from the Llama-Nemotron-Post-Training-Dataset. We also include the instruction following and tool calling data from Llama-Nemotron-Post-Training-Dataset without modification. Evaluation Dataset: We used the following benchmarks to evaluate the model holistically. Data Collection Method: Hybrid: Automated, Human, Synthetic Labeling Method: Hybrid: Automated, Human, Synthetic Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

NaNK

license:cc-by-4.0

fastchat-t5-3b-v1.0-ct2-int8

NaNK

license:apache-2.0

flan-t5-xl-ct2-int8

license:apache-2.0

TinyLlama-1.1B-intermediate-step-955k-token-2T-guanaco

NaNK

llama

Phi-3-mini-4k-instruct-ct2-int8

license:mit

Phi-3-mini-4k-instruct-20240701-ct2-int8

license:mit

jncraton

m2m100_418M-ct2-int8

m2m100_1.2B-ct2-int8

LaMini-Flan-T5-248M-ct2-int8

gemma-3-270m-ct2-int8

EuroLLM-9B-Instruct-ct2-int8

Llama-3.2-3B-Instruct-ct2-int8

dialogstudio-t5-base-v1.0-ct2-int8

Qwen2.5-0.5B-Instruct-ct2-int8

dialogstudio-t5-large-v1.0-ct2-int8

Llama-3.1-Nemotron-Nano-4B-v1.1-ct2-int8

all-MiniLM-L6-v2-ct2-int8

Qwen2.5-Coder-7B-Instruct-ct2-int8

codet5p-220m-py-ct2-int8

LaMini-GPT-124M-ct2-int8

Starling-LM-7B-beta-ct2-int8

Qwen2.5-Coder-1.5B-Instruct-ct2-int8

teapotllm-ct2-int8

Flan T5 Large Ct2 Int8

flan-alpaca-gpt4-xl-ct2-int8

gte-small-ct2-int8

TinyLlama-1.1B-step-50K-105b-ct2-int8

TinyLlama-1.1B-Chat-v0.3-ct2-int8

GIST-small-Embedding-v0-ct2-int8

multilingual-e5-small-ct2-int8

granite-embedding-107m-multilingual-ct2-int8

DeepSeek-R1-Distill-Qwen-7B-ct2-int8

Qwen2.5-3B-Instruct-ct2-int8

Qwen2.5-Math-1.5B-Instruct-ct2-int8

gemma-2b-it-ct2-int8

LaMini-Flan-T5-77M-ct2-int8

flan-alpaca-xl-ct2-int8

e5-small-v2-ct2-int8

TinyLlama-1.1B-Chat-v1.0-ct2-int8

openchat-3.5-0106-ct2-int8

SmolLM-135M-Instruct-ct2-int8

SmolLM-360M-Instruct-ct2-int8

SmolLM-1.7B-Instruct-ct2-int8

Llama-3.2-1B-Instruct-ct2-int8

Falcon3-1B-Instruct-ct2-int8

granite-embedding-30m-english-ct2-int8

Dolphin3.0-Llama3.1-8B-ct2-int8

Qwen2.5-7B-Instruct-ct2-int8

OpenReasoning-Nemotron-1.5B-ct2-int8

fastchat-t5-3b-v1.0-ct2-int8

flan-t5-xl-ct2-int8

TinyLlama-1.1B-intermediate-step-955k-token-2T-guanaco

Phi-3-mini-4k-instruct-ct2-int8

Phi-3-mini-4k-instruct-20240701-ct2-int8