ServiceNow-AI
Apriel-1.5-15b-Thinker
Apriel-1.5-15b-Thinker - Mid training is all you need! 1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. Licen...
Apriel-1.6-15b-Thinker
Apriel-5B-Instruct
Apriel-Nemotron-15b-Thinker
1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. License 10. Acknowledgements 11. Citation Apriel-Nemotron-15b-Thinker is a 15 billion‑parameter reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against similarly sized state-of-the-art models like o1‑mini, QWQ‑32b, and EXAONE‑Deep‑32b, all while maintaining only half the memory footprint of those alternatives. It builds upon the Apriel‑15b‑base checkpoint through a three‑stage training pipeline (CPT, SFT and GRPO). Highlights - Half the size of SOTA models like QWQ-32b and EXAONE-32b and hence memory efficient. - It consumes 40% less tokens compared to QWQ-32b, making it super efficient in production. 🚀🚀🚀 - On par or outperforms on tasks like - MBPP, BFCL, Enterprise RAG, MT Bench, MixEval, IFEval and Multi-Challenge making it great for Agentic / Enterprise tasks. - Competitive performance on academic benchmarks like AIME-24 AIME-25, AMC-23, MATH-500 and GPQA considering model size. Evaluations were conducted using lm-eval-harness and evalchemy. Benchmarks that are indicative of enterprise capability Mid training / Continual Pre‑training In this stage, the model is trained on 100+ billion tokens of carefully curated examples drawn from mathematical reasoning, coding challenges, scientific discourse and logical puzzles. The objective is to strengthen foundational reasoning capabilities of the model. This stage is super critical for the model to function as a reasoner and provides significant lifts in reasoning benchmarks. Supervised Fine‑Tuning (SFT) Next, we SFT the model using 200,000 high‑quality demonstrations that cover mathematical and scientific problem‑solving, coding tasks, generic instruction‑following scenarios, API/function invocation use cases etc. Reinforcement Learning Although the SFT‑tuned checkpoint delivers strong performance on core competencies like mathematics and general knowledge, it exhibits weaknesses in instruction following and coding tasks. To address these gaps, we apply GRPO (with some minor modifications to the objective). The result is significant improvement on benchmarks such as IFEval, Multi Challenge, Enterprise RAG, MBPP and BFCL, while preserving scores on competition‑level math exams like AIME and AMC. GRPO also yields modest gains on GPQA and MixEval. Throughout training, intermediate snapshots from both the SFT and GRPO stages are periodically merged, improving generalization and catastrophic forgetting. See our technical report for full details: arXiv:2508.10948. Here is a code snippet demonstrating the model's usage with the transformers library's generate function: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment. >We thank researchers at Nvidia for sharing detailed insights and data from their work in building reasoners! This greatly accelerated our research and we recognize the same with our model naming convention!
AprielGuard
Apriel 5B Base
Apriel-H1-15b-Thinker-SFT
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-40_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-25_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-30_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-34_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-27_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.
Apriel-H1-37_50-15b-Thinker
A 15B-parameter hybrid reasoning model combining Transformer attention and Mamba State Space layers for high efficiency and scalability. Derived from Apriel-Nemotron-15B-Thinker through progressive distillation, Apriel-H1 replaces less critical attention layers with linear Mamba blocks—achieving over 2× higher inference throughput in vLLM with minimal loss in reasoning, math, and coding performance. - Model Size: 15B parameters - Context Length: 65K (target; runtime dependent) - Languages: English (best) - Hybrid Transformer–SSM architecture - ~2× throughput improvement over the base Thinker model - Retains strong reasoning, math, and coding capabilities - Built via efficient distillation—no training from scratch required Model Overview Apriel-H1-15b-Thinker is designed for agentic tasks, code assistance, and multi-step reasoning. It follows Apriel’s “think then answer” style: the model first produces a hidden chain-of-thought and then a concise final response. Where reasoning traces are undesired, configure prompts to favor concise outputs. All models were evaluated with vllm server endpoints using FlashInfer (except for AI21-Jamba-Reasoning-3B which used FlashAttention2), mambacache was set to fp32 for models: NVIDIA-Nemotron-Nano-9B-v2 and AI21-Jamba-Reasoning-3B. Recommended settings: temperature 0.6; increase `maxnewtokens` for complex reasoning. 1. Create and activate a Python environment You can use any environment manager. The example below uses `uv`: Find our plugin at https://github.com/ServiceNow/apriel. You may need to install a version of vLLM compatible with your CUDA version. In this example, we use the default CUDA version and let vLLM automatically select the correct backend. Once installed, you can launch a vLLM OpenAI-compatible API server with your Apriel model: You can run the server directly using the prebuilt container: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. We recommend adding all other instructions within the user message. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment.