Mungert

22,538

Qwen3-Coder-Next-GGUF

11,990

C2S-Scale-Gemma-2-27B-GGUF

This model was generated using llama.cpp at commit `03792ad93`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format C2S-Scale Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis HuggingFace C2S Collection: C2S-Scale Models GitHub Repository: vandijklab/cell2sentence (for code, tutorials, and discussions) Google Research Blog Post: Teaching machines the language of biology Author: van Dijk Lab (Yale), Google Research, Google DeepMind This section describes the C2S-Scale model and how to use it. C2S-Scale-Gemma-27B is a state-of-the-art, open language model built upon the Gemma-2 27B architecture and fine-tuned for single-cell biology. Developed through the Cell2Sentence (C2S) framework, the model processes and understands single-cell RNA sequencing (scRNA-seq) data by treating it as a language. It converts high-dimensional scRNA-seq expression data into "cell sentences" - ordered sequences of gene names - enabling a wide range of biological analyses. This work is the result of a collaboration between Yale University, Google Research, and Google DeepMind to scale up C2S models. The C2S-Scale models were trained on Google's TPU v5s, which allowed for a significant increase in model size and capability. These models excel at tasks such as cell type prediction, tissue classification, and generating biologically meaningful cell representations. Versatility: Demonstrates strong performance across a diverse set of single-cell and multi-cell tasks. Scalability: Trained on a massive dataset of over 57 million cells, showcasing the power of scaling LLMs for biological data. Generative Power: Capable of generating realistic single-cell gene expression profiles. Foundation for Fine-tuning: Can serve as a powerful pretrained foundation for specialized, domain-specific single-cell analysis tasks. C2S-Scale can be a valuable tool for researchers in the following areas: In Silico Experiments: Generate cells under specific conditions or predict perturbational changes to form and test new biological hypotheses. Cell Atlas Annotation: Streamline the process of annotating large-scale single-cell datasets by predicting cell types and tissues. Biomarker Discovery: Analyze gene patterns within cell sentences to identify potential markers for specific cell states or diseases. Below are code snippets to help you get started running the model locally on a GPU. The model can be used for various tasks, further described in the C2S-Scale paper. To perform cell type prediction, the model expects a prompt containing the cell sentence followed by a query. The resulting prompt is in the format expected by the model for this task: See the following Colab notebooks in our GitHub repository for examples of how to use C2S-Scale models: To quickly get started with the model for tasks like cell type prediction and generation: C2S Tutorials C2S-Scale is based on the Gemma 2 family of lightweight, state-of-the-art open LLMs, which utilizes a decoder-only transformer architecture. Base Model: Gemma-2 27B. Fine-tuning Data: A comprehensive collection of over 800 datasets from CellxGene and the Human Cell Atlas, totaling over 57 million human and mouse cells. Training Approach: Instruction fine-tuning using the Cell2Sentence framework, which converts scRNA-seq expression data into sequences of gene tokens. Model type: Decoder-only Transformer (based on Gemma-2) Key publication: Scaling Large Language Models for Next-Generation Single-Cell Analysis The performance of C2S-Scale models was validated on a wide range of single-cell and multi-cell tasks, including advanced downstream tasks such as cluster captioning, question answering, and perturbation prediction. C2S-Scale models demonstrated significant improvements over other open and closed-source models, establishing new state-of-the-art benchmarks for LLMs in single-cell biology. Please see our preprint for a full breakdown of performance metrics. Input: Text. For best performance, prompts should be structured according to the specific task (e.g., cell type prediction, conditioned generation). Inputs are "cell sentences"—ordered, space-separated lists of gene names. Output: Text. The model generates text as a response, which can be a predicted label (like a cell type or tissue), a full cell sentence, or a natural language abstract. CellxGene and Human Cell Atlas: The model was trained on a curated collection of over 800 public scRNA-seq datasets, encompassing more than 57 million cells. This data covers a broad range of tissues, cell types, and experimental conditions from both human and mouse, ensuring the model learns a robust and generalizable representation of cellular states. Evaluation was performed using held-out datasets and standardized benchmarks designed to test the model's capabilities on the tasks listed above. All evaluation methodologies followed established best practices for splitting data to ensure robust and unbiased assessment. The model weights shared on Huggingface are CC-by-4.0. The model was trained using JAX, leveraging Google's TPU v5 hardware for efficient and large-scale training. Research in single-cell genomics and computational biology. As a foundational model for fine-tuning on specific biological domains or datasets. To aid in the annotation and interpretation of large-scale scRNA-seq experiments. C2S-Scale provides a powerful, versatile, and scalable tool for single-cell analysis. It offers: State-of-the-art performance on a wide range of scRNA-seq tasks. A unified framework for handling diverse single-cell analysis challenges. A foundation for building more specialized models from private or proprietary data. The ability to perform in silico generation of cellular data to explore biological hypotheses. The model is trained on public data and its knowledge is limited to the genes, cell types, and conditions present in that data. Performance on out-of-distribution data (e.g., completely novel cell types or technologies) is not guaranteed and requires validation. Performance of the models on input prompt formats that greatly deviate from training prompt formatting is not guaranteed. C2S-Scale Links - Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis - Google Research Blog Post: Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis - GitHub: https://github.com/vandijklab/cell2sentence (Note: Codebase has CC BY-NC-ND 4.0 license. Only weights shared on Hugging Face are CC-by-4.0) Gemma-2 Links - HuggingFace: https://huggingface.co/google/gemma-2-27b - Gemma-2 Blog Post: Gemma explained: What's new in Gemma 2 - Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

license:cc-by-4.0

9,360

NVIDIA-Nemotron-3-Super-120B-A12B-BF16-GGUF

8,806

Tongyi-DeepResearch-30B-A3B-GGUF

This model was generated using llama.cpp at commit `a2054e3a8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format We present Tongyi DeepResearch, an agentic large language model featuring 30 billion total parameters, with only 3 billion activated per token. Developed by Tongyi Lab, the model is specifically designed for long-horizon, deep information-seeking tasks. Tongyi-DeepResearch demonstrates state-of-the-art performance across a range of agentic search benchmarks, including Humanity's Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA, GAIA, xbench-DeepSearch and FRAMES. - ⚙️ Fully automated synthetic data generation pipeline: We design a highly scalable data synthesis pipeline, which is fully automatic and empowers agentic pre-training, supervised fine-tuning, and reinforcement learning. - 🔄 Large-scale continual pre-training on agentic data: Leveraging diverse, high-quality agentic interaction data to extend model capabilities, maintain freshness, and strengthen reasoning performance. - 🔁 End-to-end reinforcement learning: We employ a strictly on-policy RL approach based on a customized Group Relative Policy Optimization framework, with token-level policy gradients, leave-one-out advantage estimation, and selective filtering of negative samples to stabilize training in a non‑stationary environment. - 🤖 Agent Inference Paradigm Compatibility: At inference, Tongyi-DeepResearch is compatible with two inference paradigms: ReAct, for rigorously evaluating the model's core intrinsic abilities, and an IterResearch-based 'Heavy' mode, which uses a test-time scaling strategy to unlock the model's maximum performance ceiling. You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/DeepResearch. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

7,517

Glyph-GGUF

This model was generated using llama.cpp at commit `16724b5b6`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Glyph: Scaling Context Windows via Visual-Text Compression - Repository: https://github.com/thu-coai/Glyph - Paper: https://arxiv.org/abs/2510.17800 Glyph is a framework for scaling the context length through visual-text compression. Instead of extending token-based context windows, Glyph renders long textual sequences into images and processes them using vision–language models (VLMs). This design transforms the challenge of long-context modeling into a multimodal problem, substantially reducing computational and memory costs while preserving semantic information. This is a simple example of running single-image inference using the `transformers` library. First, install the `transformers` library: Known Limitations - Sensitivity to rendering parameters: Glyph’s performance can vary with rendering settings such as resolution, font, and spacing. Since our search procedure adopts a fixed rendering configuration during post-training, the model may not generalize well to unseen or substantially different rendering styles. - OCR-related challenges: Recognizing fine-grained or rare alphanumeric strings (e.g., UUIDs) remains difficult for visual-language models, especially with ultra-long inputs, sometimes leading to minor character misclassification. - Limited generalization: The training of Glyph mainly targets long-context understanding, and its capability on broader tasks is yet to be studied. Citation If you find our model useful in your work, please cite it with: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

6,414

MiroThinker-v1.5-30B-GGUF

Orchestrator-8B-GGUF

2,352

Nanonets-OCR2-3B-GGUF

This model was generated using llama.cpp at commit `3cfa9c3f1`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging Nanonets-OCR2 by Nanonets is a family of powerful, state-of-the-art image-to-markdown OCR models that go far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging, making it ideal for downstream processing by Large Language Models (LLMs). Nanonets-OCR2 is packed with features designed to handle complex documents with ease: LaTeX Equation Recognition: Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. It distinguishes between inline (`$...$`) and display (`$$...$$`) equations. Intelligent Image Description: Describes images within documents using structured ` ` tags, making them digestible for LLM processing. It can describe various image types, including logos, charts, graphs and so on, detailing their content, style, and context. Signature Detection & Isolation: Identifies and isolates signatures from other text, outputting them within a ` ` tag. This is crucial for processing legal and business documents. Watermark Extraction: Detects and extracts watermark text from documents, placing it within a ` ` tag. Smart Checkbox Handling: Converts form checkboxes and radio buttons into standardized Unicode symbols (`☐`, `☑`, `☒`) for consistent and reliable processing. Complex Table Extraction: Accurately extracts complex tables from documents and converts them into both markdown and HTML table formats. Flow charts & Organisational charts: Extracts flow charts and organisational as mermaid code. Handwritten Documents: The model is trained on handwritten documents across multiple languages. Multilingual: Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more. Visual Question Answering (VQA): The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned." Nanonets-OCR2 Family | Model | Access Link | | -----|-----| | Nanonets-OCR2-Plus | Docstrange link | | Nanonets-OCR2-3B | 🤗 link | | Nanonets-OCR2-1.5B-exp | 🤗 link | Model Win Rate vs Nanonets OCR2 Plus (%) Lose Rate vs Nanonets OCR2 Plus (%) Both Correct (%) Model Win Rate vs Nanonets OCR2 3B (%) Lose Rate vs Nanonets OCR2 3B (%) Both Correct (%) Dataset Nanonets OCR2 Plus Nanonets OCR2 3B Qwen2.5-VL-72B-Instruct Gemini 2.5 Flash Tips to improve accuracy 1. Increasing the image resolution will improve model's performance. 2. For complex tables (eg. Financial documents) using `repetitionpenalty=1` gives better results. You can try this prompt also, which generally works better for finantial documents. 3. This is already implemented in Docstrange, please use the `Markdown (Financial Docs)` option for processing table heavy financial documents. 4. Model might work best on certain resolution for specific document types. Please check the cookbooks for details. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

2,194

olmOCR-2-7B-1025-GGUF

2,121

Dolphin-Mistral-24B-Venice-Edition-GGUF

This model was generated using llama.cpp at commit `221c0e0c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format [](https://discord.gg/h3K4XGj2RH) Discord: https://discord.gg/h3K4XGj2RH Website: https://dphn.ai Twitter: https://x.com/dphnAI Dolphin Mistral 24B Venice Edition is a collaborative project we undertook with Venice.ai with the goal of creating the most uncensored version of Mistral 24B for use within the Venice ecosystem. Dolphin Mistral 24B Venice Edition is now live on https://venice.ai/ as “Venice Uncensored,” the new default model for all Venice users. Dolphin aims to be a general purpose model, similar to the models behind ChatGPT, Claude, Gemini. But these models present problems for businesses seeking to include AI in their products. 1) They maintain control of the system prompt, deprecating and changing things as they wish, often causing software to break. 2) They maintain control of the model versions, sometimes changing things silently, or deprecating older models that your business relies on. 3) They maintain control of the alignment, and in particular the alignment is one-size-fits all, not tailored to the application. 4) They can see all your queries and they can potentially use that data in ways you wouldn't want. Dolphin, in contrast, is steerable and gives control to the system owner. You set the system prompt. You decide the alignment. You have control of your data. Dolphin does not impose its ethics or guidelines on you. You are the one who decides the guidelines. Dolphin belongs to YOU, it is your tool, an extension of your will. Just as you are personally responsible for what you do with a knife, gun, fire, car, or the internet, you are the creator and originator of any content you generate with Dolphin. We maintained Mistral's default chat template for this model. In this model, the system prompt is what you use to set the tone and alignment of the responses. You can set a character, a mood, rules for its behavior, and it will try its best to follow them. Make sure to set the system prompt in order to set the tone and guidelines for the responses - Otherwise, it will act in a default way that might not be what you want. Example use of system prompt we used to get the model as uncensored as possible: Note: We recommond using a relatively low temperature, such as `temperature=0.15`. There are many ways to use a huggingface model including: - ollama - LM Studio - Huggingface Transformers library - vllm - sglang - tgi The model can be used with the following frameworks; - `vllm`: See here - `transformers`: See here We recommend using this model with the vLLM library to implement production-ready inference pipelines. Also make sure you have `mistralcommon >= 1.5.2` installed: You can also make use of a ready-to-go docker image or on the docker hub. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

This model was generated using llama.cpp at commit `3f4fc97f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Dolphin 3.0 Llama 3.2 3B 🐬 Part of the Dolphin 3.0 Collection Curated and trained by Eric Hartford, Ben Gitter, BlouseJury and Cognitive Computations [](https://discord.gg/cognitivecomputations) Discord: https://discord.gg/cognitivecomputations Sponsors Our appreciation for the generous sponsors of Dolphin 3.0: - Crusoe Cloud - provided 16x L40s for training and evals - Akash - provided on-demand 8x H100 for training - Lazarus - provided 16x H100 for training - Cerebras - provided excellent and fast inference services for data labeling - Andreessen Horowitz - provided a grant that make Dolphin 1.0 possible and enabled me to bootstrap my homelab Dolphin 3.0 is the next generation of the Dolphin series of instruct-tuned models. Designed to be the ultimate general purpose local model, enabling coding, math, agentic, function calling, and general use cases. Dolphin aims to be a general purpose model, similar to the models behind ChatGPT, Claude, Gemini. But these models present problems for businesses seeking to include AI in their products. 1) They maintain control of the system prompt, deprecating and changing things as they wish, often causing software to break. 2) They maintain control of the model versions, sometimes changing things silently, or deprecating older models that your business relies on. 3) They maintain control of the alignment, and in particular the alignment is one-size-fits all, not tailored to the application. 4) They can see all your queries and they can potentially use that data in ways you wouldn't want. Dolphin, in contrast, is steerable and gives control to the system owner. You set the system prompt. You decide the alignment. You have control of your data. Dolphin does not impose its ethics or guidelines on you. You are the one who decides the guidelines. Dolphin belongs to YOU, it is your tool, an extension of your will. Just as you are personally responsible for what you do with a knife, gun, fire, car, or the internet, you are the creator and originator of any content you generate with Dolphin. In Dolphin, the system prompt is what you use to set the tone and alignment of the responses. You can set a character, a mood, rules for its behavior, and it will try its best to follow them. Make sure to set the system prompt in order to set the tone and guidelines for the responses - Otherwise, it will act in a default way that might not be what you want. There are many ways to use a huggingface model including: - ollama - LM Studio - Huggingface Transformers library - vllm - sglang - tgi Respect and thanks to the creators of the open source datasets that were used: - OpenCoder-LLM (opc-sft-stage1, opc-sft-stage2) - microsoft (orca-agentinstruct-1M-v1, orca-math-word-problems-200k) - NousResearch (hermes-function-calling-v1) - AI-MO (NuminaMath-CoT, NuminaMath-TIR) - allenai (tulu-3-sft-mixture) - HuggingFaceTB (smoltalk) - m-a-p (CodeFeedback-Filtered-Instruction, Code-Feedback) Special thanks to - Meta, Qwen, and OpenCoder, who wrote papers and published models that were instrumental in creating Dolphin 3.0. - RLHFlow for the excellent reward model used to filter the datasets - Deepseek, for the ridiculously fast Deepseek-V3 that we used to augment the data. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

base_model:meta-llama/Llama-3.2-3B

1,216

AI21-Jamba-Reasoning-3B-GGUF

1,213

UI-TARS-1.5-7B-GGUF

1,175

LightOnOCR-1B-1025-GGUF

This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Full BF16 version of the model. We recommend this variant for inference and further fine-tuning. LightOnOCR-1B is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs. [](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb) ⚡ Speed: 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR 💸 Efficiency: Processes 5.71 pages/s on a single H100 (~493k pages/day) for Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

1,079

Dans-PersonalityEngine-V1.3.0-24b-GGUF

1,050

This model was generated using llama.cpp at commit `03792ad93`. Click here to get info on choosing the right GGUF model format Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging Nanonets-OCR2 by Nanonets is a family of powerful, state-of-the-art image-to-markdown OCR models that go far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging, making it ideal for downstream processing by Large Language Models (LLMs). Nanonets-OCR2 is packed with features designed to handle complex documents with ease: LaTeX Equation Recognition: Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. It distinguishes between inline (`$...$`) and display (`$$...$$`) equations. Intelligent Image Description: Describes images within documents using structured ` ` tags, making them digestible for LLM processing. It can describe various image types, including logos, charts, graphs and so on, detailing their content, style, and context. Signature Detection & Isolation: Identifies and isolates signatures from other text, outputting them within a ` ` tag. This is crucial for processing legal and business documents. Watermark Extraction: Detects and extracts watermark text from documents, placing it within a ` ` tag. Smart Checkbox Handling: Converts form checkboxes and radio buttons into standardized Unicode symbols (`☐`, `☑`, `☒`) for consistent and reliable processing. Complex Table Extraction: Accurately extracts complex tables from documents and converts them into both markdown and HTML table formats. Flow charts & Organisational charts: Extracts flow charts and organisational as mermaid code. Handwritten Documents: The model is trained on handwritten documents across multiple languages. Multilingual: Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more. Visual Question Answering (VQA): The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned." Nanonets-OCR2 Family | Model | Access Link | | -----|-----| | Nanonets-OCR2-Plus | Docstrange link | | Nanonets-OCR2-3B | 🤗 link | | Nanonets-OCR2-1.5B-exp | 🤗 link | Model Win Rate vs Nanonets OCR2 Plus (%) Lose Rate vs Nanonets OCR2 Plus (%) Both Correct (%) Model Win Rate vs Nanonets OCR2 3B (%) Lose Rate vs Nanonets OCR2 3B (%) Both Correct (%) Dataset Nanonets OCR2 Plus Nanonets OCR2 3B Qwen2.5-VL-72B-Instruct Gemini 2.5 Flash Tips to improve accuracy 1. Increasing the image resolution will improve model's performance. 2. For complex tables (eg. Financial documents) using `repetitionpenalty=1` gives better results. You can try this prompt also, which generally works better for finantial documents. 3. This is already implemented in Docstrange, please use the `Markdown (Financial Docs)` option for processing table heavy financial documents. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

995

KAT-Dev-72B-Exp-GGUF

QwenLong-L1.5-30B-A3B-GGUF

868

NVIDIA-Nemotron-Nano-12B-v2-GGUF

This model was generated using llama.cpp at commit `4fd1242b`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The pretraining data has a cutoff date of September 2024. NVIDIA-Nemotron-Nano-12B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks. The model was fine-tuned from NVIDIA-Nemotron-Nano-12B-v2-Base was further compressed into NVIDIA-Nemotron-Nano-9B-v2. The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL. The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen. GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement. We evaluated our model in Reasoning-On mode across all benchmarks, except RULER, which is evaluated in Reasoning-Off mode. | Benchmark | NVIDIA-Nemotron-Nano-12B-v2 | | :---- | ----- | | AIME25 | 76.25% | | MATH500 | 97.75% | | GPQA | 64.48% | | LCB | 70.79% | | BFCL v3 | 66.98% | | IFEVAL-Prompt | 84.70% | | IFEVAL-Instruction | 89.81% | All evaluations were done using NeMo-Skills. We published a tutorial with all details necessary to reproduce our evaluation results. This model supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think". - Architecture Type: Mamba2-Transformer Hybrid - Network Architecture: Nemotron-Hybrid NVIDIA-Nemotron-Nano-12B-v2 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Spanish and Japanese) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. - Huggingface 08/29/2025 via https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2 - NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model - Input Type(s): Text - Input Format(s): String - Input Parameters: One-Dimensional (1D): Sequences - Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English. - Output Type(s): Text - Output Format: String - Output Parameters: One-Dimensional (1D): Sequences up to 128K Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. - Runtime Engine(s): NeMo 25.07.nemotron-nano-v2 - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100 - Operating System(s): Linux The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3). Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True` Case 2: `/nothink` is provided, reasoning will be set to `False` Note: `/think` or `/nothink` keywords can also be provided in “user” messages for turn-level reasoning control. We recommend setting `temperature` to `0.6`, `topp` to `0.95` for reasoning True and greedy search for reasoning False, and increase `maxnewtokens` to `1024` or higher for reasoning True. The snippet below shows how to use this model with TRT-LLM. We tested this on the following commit and followed these instructions to build and install TRT-LLM in a docker container. The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM. Note: - Remember to add \`--mamba\ssm\cache\dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade. - If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lower the value further if the error persists. Alternativly, you can use Docker to launch a vLLM server. The thinking budget allows developers to keep accuracy high and meet response‑time targets \- which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts. With budget control, you can set a limit for internal reasoning: `maxthinkingtokens`: This is a threshold that will attempt to end the reasoning trace at the next newline encountered in the reasoning trace. If no newline is encountered within 500 tokens, it will abruptly end the reasoning trace at \`max\thinking\tokens \+ 500\`. Calling the server with a budget (Restricted to 32 tokens here as an example) After launching a vLLM server, you can call the server with tool-call support using a Python script like below: We follow the jinja chat template provided below. This template conditionally adds ` \n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds ` ` to the start of the Assistant response if `/nothink` is found in the system prompt. Thus enforcing reasoning on/off behavior. Data Modality: Text Text Training Data Size: More than 10 Trillion Tokens Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing. Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic Properties: The post-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B. The pre-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately twenty trillion tokens. Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes. More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model . | Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 | | PRM800K | 4/23/2025 | | CC-NEWS | 4/23/2025 | | Common Crawl | 4/23/2025 | | Wikimedia | 4/23/2025 | | Bespoke-Stratos-17k | 4/23/2025 | | tigerbot-kaggle-leetcodesolutions-en-2k | 4/23/2025 | | glaive-function-calling-v2 | 4/23/2025 | | APIGen Function-Calling | 4/23/2025 | | LMSYS-Chat-1M | 4/23/2025 | | Open Textbook Library \- CC BY-SA & GNU subset and OpenStax \- CC BY-SA subset | 4/23/2025 | | Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench | 4/23/2025 | | FineWeb-2 | 4/23/2025 | | Court Listener | Legacy Download | | peS2o | Legacy Download | | OpenWebMath | Legacy Download | | BioRxiv | Legacy Download | | PMC Open Access Subset | Legacy Download | | OpenWebText2 | Legacy Download | | Stack Exchange Data Dump | Legacy Download | | PubMed Abstracts | Legacy Download | | NIH ExPorter | Legacy Download | | arXiv | Legacy Download | | BigScience Workshop Datasets | Legacy Download | | Reddit Dataset | Legacy Download | | SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) | Legacy Download | | Public Software Heritage S3 | Legacy Download | | The Stack | Legacy Download | | mC4 | Legacy Download | | Advanced Mathematical Problem Solving | Legacy Download | | MathPile | Legacy Download | | NuminaMath CoT | Legacy Download | | PMC Article | Legacy Download | | FLAN | Legacy Download | | Advanced Reasoning Benchmark | Legacy Download | | SciBench | Legacy Download | | WikiTableQuestions | Legacy Download | | FinQA | Legacy Download | | Riddles | Legacy Download | | Problems in Elementary Mathematics for Home Study | Legacy Download | | MedMCQA | Legacy Download | | Cosmos QA | Legacy Download | | MCTest | Legacy Download | | AI2's Reasoning Challenge | Legacy Download | | OpenBookQA | Legacy Download | | MMLU Auxiliary Train | Legacy Download | | social-chemestry-101 | Legacy Download | | Moral Stories | Legacy Download | | The Common Pile v0.1 | Legacy Download | | FineMath | Legacy Download | | MegaMath | Legacy Download | | FastChat | 6/30/2025 | Private Non-publicly Accessible Datasets of Third Parties | Dataset | | :---- | | Global Regulation | | Workbench | The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC. The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report). | Dataset | Modality | Dataset Size (Tokens) | Collection Period | | :---- | :---- | :---- | :---- | | English Common Crawl | Text | 3.360T | 4/8/2025 | | Multilingual Common Crawl | Text | 812.7B | 5/1/2025 | | GitHub Crawl | Text | 747.4B | 4/29/2025 | | Dataset | Modality | Dataset Size (Tokens) | Seed Dataset | Model(s) used for generation | | :---- | :---- | :---- | :---- | :---- | | Synthetic Art of Problem Solving from DeepSeek-R1 | Text | 25.5B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; | DeepSeek-R1 | | Synthetic Moral Stories and Social Chemistry from Mixtral-8x22B-v0.1 | Text | 327M | social-chemestry-101; Moral Stories | Mixtral-8x22B-v0.1 | | Synthetic Social Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 83.6M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic Health Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 9.7M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic STEM seeded with OpenStax, Open Textbook Library, and GSM8K from DeepSeek-R1, DeepSeek-V3, DeepSeek-V3-0324, and Qwen2.5-72B | Text | 175M | OpenStax \- CC BY-SA subset; GSM8K; Open Textbook Library \- CC BY-SA & GNU subset | DeepSeek-R1, DeepSeek-V3; DeepSeek-V3-0324; Qwen2.5-72B | | Nemotron-PrismMath | Text | 4.6B | Big-Math-RL-Verified; OpenR1-Math-220k | Qwen2.5-0.5B-instruct, Qwen2.5-72B-Instruct; DeepSeek-R1-Distill-Qwen-32B | | Synthetic Question Answering Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 350M | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic FineMath-4+ Reprocessed from DeepSeek-V3 | Text | 9.2B | Common Crawl | DeepSeek-V3 | | Synthetic FineMath-3+ Reprocessed from phi-4 | Text | 27.6B | Common Crawl | phi-4 | | Synthetic Union-3+ Reprocessed from phi-4 | Text | 93.1B | Common Crawl | phi-4 | | Refreshed Nemotron-MIND from phi-4 | Text | 73B | Common Crawl | phi-4 | | Synthetic Union-4+ Reprocessed from phi-4 | Text | 14.12B | Common Crawl | phi-4 | | Synthetic Union-3+ minus 4+ Reprocessed from phi-4 | Text | 78.95B | Common Crawl | phi-4 | | Synthetic Union-3 Refreshed from phi-4 | Text | 80.94B | Common Crawl | phi-4 | | Synthetic Union-4+ Refreshed from phi-4 | Text | 52.32B | Common Crawl | phi-4 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from DeepSeek-V3 and DeepSeek-V3-0324 | Text | 4.0B | AQUA-RAT; LogiQA; AR-LSAT | DeepSeek-V3; DeepSeek-V3-0324 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | AQUA-RAT; LogiQA; AR-LSAT | Qwen3-30B-A3B | | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; GSM8K; PRM800K | Qwen2.5-32B-Instruct; Qwen2.5-Math-72B; Qwen2.5-Math-7B; Qwen2.5-72B-Instruct | | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | MMLU Auxiliary Train | DeepSeek-R1 | | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | Common Crawl | Qwen3-30B-A3B; Mistral-NeMo-12B-Instruct | | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | Common Crawl | Qwen3-30B-A3B | | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | Wikimedia | Qwen3-30B-A3B | | Synthetic OpenMathReasoning from DeepSeek-R1-0528 | Text | 1.5M | OpenMathReasoning | DeepSeek-R1-0528 | | Synthetic OpenCodeReasoning from DeepSeek-R1-0528 | Text | 1.1M | OpenCodeReasoning | DeepSeek-R1-0528 | | Synthetic Science Data from DeepSeek-R1-0528 | Text | 1.5M | \- | DeepSeek-R1-0528 | | Synthetic Humanity's Last Exam from DeepSeek-R1-0528 | Text | 460K | Humanity's Last Exam | DeepSeek-R1-0528 | | Synthetic ToolBench from Qwen3-235B-A22B | Text | 400K | ToolBench | Qwen3-235B-A22B | | Synthetic Nemotron Content Safety Dataset V2, eval-safety, Gretel Synthetic Safety Alignment, and RedTeam\2K from DeepSeek-R1-0528 | Text | 52K | Nemotron Content Safety Dataset V2; eval-safety; Gretel Synthetic Safety Alignment; RedTeam\2K | DeepSeek-R1-0528 | | Synthetic HelpSteer from Qwen3-235B-A22B | Text | 120K | HelpSteer3; HelpSteer2 | Qwen3-235B-A22B | | Synthetic Alignment data from Mixtral-8x22B-Instruct-v0.1, Mixtral-8x7B-Instruct-v0.1, and Nemotron-4 Family | Text | 400K | HelpSteer2; C4; LMSYS-Chat-1M; ShareGPT52K; tigerbot-kaggle-leetcodesolutions-en-2k; GSM8K; PRM800K; lm\identity (NVIDIA internal); FinQA; WikiTableQuestions; Riddles; ChatQA nvolve-multiturn (NVIDIA internal); glaive-function-calling-v2; SciBench; OpenBookQA; Advanced Reasoning Benchmark; Public Software Heritage S3; Khan Academy Math Keywords | Nemotron-4-15B-Base (NVIDIA internal); Nemotron-4-15B-Instruct (NVIDIA internal); Nemotron-4-340B-Base; Nemotron-4-340B-Instruct; Nemotron-4-340B-Reward; Mixtral-8x7B-Instruct-v0.1; Mixtral-8x22B-Instruct-v0.1 | | Synthetic LMSYS-Chat-1M from Qwen3-235B-A22B | Text | 1M | LMSYS-Chat-1M | Qwen3-235B-A22B | | Synthetic Multilingual Reasoning data from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, and Qwen2.5-14B-Instruct | Text | 25M | OpenMathReasoning; OpenCodeReasoning | DeepSeek-R1-0528; Qwen2.5-32B-Instruct-AWQ (translation); Qwen2.5-14B-Instruct (translation); | | Synthetic Multilingual Reasoning data from Qwen3-235B-A22B and Gemma 3 Post-Trained models | Text | 5M | WildChat | Qwen3-235B-A22B; Gemma 3 PT 12B; Gemma 3 PT 27B | Data Collection Method by dataset: Hybrid: Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

This model was generated using llama.cpp at commit `56b479584`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

732

granite-4.0-h-1b-GGUF

This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-4.0-H-1B is a lightweight instruct model finetuned from Granite-4.0-H-1B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-1B model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-1B comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-1B model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-H-1B baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

617

Qwen2.5-VL-32B-Instruct-GGUF

615

gemma-3-12b-it-gguf

llama.cpp

612

neutts-air-GGUF

612

xLAM-2-3b-fc-r-GGUF

608

Qwen2.5-1.5B-Instruct-GGUF

607

granite-4.0-h-tiny-GGUF

This model was generated using llama.cpp at commit `ee09828cb`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

594

OpenThinker3-7B-GGUF

llama-factory

593

MediPhi-Instruct-GGUF

This model was generated using llama.cpp at commit `e2b7621e`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The MediPhi Model Collection comprises 7 small language models of 3.8B parameters from the base model `Phi-3.5-mini-instruct` specialized in the medical and clinical domains. The collection is designed in a modular fashion. Five MediPhi experts are fine-tuned on various medical corpora (i.e. PubMed commercial, Medical Wikipedia, Medical Guidelines, Medical Coding, and open-source clinical documents) and merged back with the SLERP method in their base model to conserve general abilities. One model combined all five experts into one general expert with the multi-model merging method BreadCrumbs. Finally, we clinically aligned this general expert using our large-scale MediFlow corpora (see dataset `microsoft/mediflow`) to obtain the final expert model `MediPhi-Instruct`. This model is the `MediPhi-Instruct` aligned to accomplish clinical NLP tasks. - Developed by: Microsoft Healthcare \& Life Sciences - Model type: Phi3 - Language(s) (NLP): English - License: MIT - Finetuned from model: `microsoft/MediPhi`, and originally from `microsoft/Phi-3.5-mini-instruct` - Repository: Current HF repo - Paper: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment Intended Uses Primary Use Cases The model is intended for research use in English, especially clinical natural language processing. The model provides uses for research which require: - Medically adapted language models - Memory/compute constrained environments - Latency bound scenarios Our model is designed to accelerate research on language models in medical and clinical scenarios. It should be used for research purposes, i.e., in benchmarking context or with expert user verification of the outputs. Use Case Considerations Our models are not specifically designed or evaluated for all downstream purposes. Researchers (or developers) should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Researchers (or developers) should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. Responsible AI Considerations Like other language models, the Phi family of models and the MediPhi collection can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: - Quality of Service: The Phi and MediPhi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. - Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 3 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. - Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. - Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. - Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. - Long Conversation: Phi-3 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift Researchers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. They are encouraged to rigorously evaluate the model for their use case, fine-tune the models when possible and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: - Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. - High-Risk Scenarios: Researchers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. - Misinformation: Models may produce inaccurate information. Researchers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). - Generation of Harmful Content: Researchers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. - Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline modelname = "microsoft/MediPhi-Instruct" model = AutoModelForCausalLM.frompretrained( modelname, devicemap="cuda", torchdtype="auto", trustremotecode=True, ) tokenizer = AutoTokenizer.frompretrained(modelname) prompt = "Operative Report:\nPerformed: Cholecystectomy\nOperative Findings: The gallbladder contained multiple stones and had thickening of its wall. Mild peritoneal fluid was noted." messages = [ {"role": "system", "content": "Extract medical keywords from this operative notes focus on anatomical, pathological, or procedural vocabulary."}, {"role": "user", "content": prompt}, ] pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, ) generationargs = { "maxnewtokens": 500, "returnfulltext": False, "temperature": 0.0, "dosample": False, } output = pipe(messages, generationargs) print(output[0]['generatedtext']) # gallbladder stones, wall thickening, peritoneal fluid Notes: If you want to use flash attention, call `AutoModelForCausalLM.frompretrained()` with `attnimplementation="flashattention2"`. Check `microsoft/Phi-3.5-mini-instruct` for details about the tokenizer, requirements and basic capabilities. Continual Pre-training: - PubMed (commercial subset) and abstracts from `ncbi/pubmed`. - Medical Guideline `epfl-llm/guidelines`. - Medical Wikipedia `jpcorb20/medicalwikipedia`. - Medical Coding: ICD10CM, ICD10PROC, ICD9CM, ICD9PROC, and ATC. - Clinical documents: - `zhengyun21/PMC-Patients`, `akemiH/NoteChat`, and `starmpcc/Asclepius-Synthetic-Clinical-Notes` (only commercial-friendly licenses across all three datasets) - mtsamples Modular training making five experts from the base model with pre-instruction tuning, merging them into one model and finally clinically aligning it. See paper for details. | | Phi-3.5-mini-instruct | PubMed | Clinical | MedWiki | Guideline | MedCode | MediPhi | MediPhi-Instruct | |:-------------|----------------------:|-------:|---------:|--------:|----------:|--------:|--------:|-----------------:| | MedNLI | 66.6 | 68.3 | 69.2 | 72.8 | 70.3 | 68.5 | 66.9 | 71.0 | | PLS | 28.4 | 29.2 | 29.4 | 29.2 | 29.8 | 22.3 | 28.8 | 26.0 | | MeQSum | 36.7 | 37.6 | 38.1 | 37.6 | 37.6 | 33.5 | 37.9 | 42.8 | | LH | 45.9 | 45.7 | 43.5 | 43.6 | 41.1 | 45.7 | 45.7 | 45.0 | | MeDiSumQA | 25.9 | 26.3 | 26.7 | 25.1 | 25.1 | 23.6 | 26.1 | 29.1 | | MeDiSumCode | 41.1 | 41.0 | 40.5 | 41.7 | 41.9 | 39.0 | 41.7 | 37.2 | | RRS QA | 41.2 | 44.1 | 52.1 | 46.7 | 48.9 | 45.6 | 44.5 | 61.6 | | MedicationQA | 11.2 | 10.3 | 12.0 | 12.2 | 11.9 | 12.0 | 11.3 | 19.3 | | MEDEC | 14.8 | 22.2 | 34.5 | 28.8 | 28.3 | 18.1 | 29.1 | 34.4 | | ACI | 42.3 | 42.7 | 43.9 | 44.7 | 44.7 | 39.0 | 44.3 | 43.5 | | SDoH | 35.1 | 35.8 | 35.8 | 43.6 | 41.0 | 24.8 | 39.7 | 56.7 | | ICD10CM | 49.3 | 49.5 | 49.6 | 50.2 | 49.8 | 68.7 | 55.5 | 54.9 | | Average | 36.5 | 37.7 | 39.6 | 39.7 | 39.2 | 36.7 | 39.3 | 43.4 | New real-world benchmarking also demonstrated good performances on clinical information extraction task: 2507.05517. We carried out a Medical Red Teaming Protocol of Language Models in which we demonstrate broad conversation of original Phi3.5 safety abilities (see Phi-3 Safety Post-Training). All six merged MediPhi models fully conserve their base model's safety capabilities. For `MediPhi-Instruct`, it conserved safe behaviors towards jailbreaking and harmfulness, as well as it is improving considerably on groundedness. We further demonstrate safe behaviours at refusing or giving warnings with limited responses for nearly all harmful queries from clinican and patient user perspectives, based on MedSafetyBench and our PatientSafetyBench. Phi-3.5-mini has 3.8B parameters and is a dense decoder-only Transformer model using the same tokenizer as Phi-3 Mini. It is best suited for prompts using chat format but plain text is also possible. The default context length is of 128K tokens. Note that by default, the Phi-3.5-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: - NVIDIA A100 - NVIDIA A6000 - NVIDIA H100 If you want to run the model on: - NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. @article{corbeil2025modular, title={A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment}, author={Corbeil, Jean-Philippe and Dada, Amin and Attendu, Jean-Michel and Abacha, Asma Ben and Sordoni, Alessandro and Caccia, Lucas and Beaulieu, Fran{\c{c}}ois and Lin, Thomas and Kleesiek, Jens and Vozila, Paul}, journal={arXiv preprint arXiv:2505.10717}, year={2025} } Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

593

EXAONE-Deep-7.8B-GGUF

592

Phi-4-mini-reasoning-GGUF

588

Hermes-4-70B-GGUF

This model was generated using llama.cpp at commit `408ff524`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Hermes 4 70B is a frontier, hybrid-mode reasoning model based on Llama-3.1-70B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-70B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-70B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

UserLM-8b-GGUF

This model was generated using llama.cpp at commit `56b479584`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Unlike typical LLMs that are trained to play the role of the "assistant" in conversation, we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat). This model is useful in simulating more realistic conversations, which is in turn useful in the development of more robust assistants. The model takes a single input, which is the “task intent”, which defines the high-level objective that the user simulator should pursue. The user can then be used to generate: (1) a first-turn user utterance, (2) generate follow-up user utterances based on a conversation state (one or several user-assistant turn exchanges), and (3) generate a token when the user simulator expects that the conversation has run its course. Developed by: Tarek Naous (intern at MSR Summer 2025), Philippe Laban (MSR), Wei Xu, Jennifer Neville (MSR) The UserLM-8b is released for use by researchers involved in the evaluation of assistant LLMs. In such scenarios, UserLM-8b can be used to simulate multi-turn conversations, with our analyses (see Section 3 of the paper) giving evidence that UserLM-8b provides more realistic simulation of user behavior than other methods (such as prompting an assistant model). UserLM-8b offers a user simulation environment that can better estimate the performance of an assistant LLM with real users. See Section 4 of the paper for an initial implementation of such an evaluation. We envision several potential uses for UserLM-8b that we did not implement yet in our presented work but describe in our Discussion section as potential research directions for UserLMs. These potential applications include: (1) user modeling (i.e., predicting user responses to a given set of questions), (2) foundation for judge models (i.e., LLM-as-a-judge finetuning), (3) synthetic data generation (in conjunction with an assistant LM). We caution potential users of the model that UserLM-8b is not an assistant LM, unlike the majority of LLMs released on HuggingFace. As such, it is unlikely to be useful to end-users that require assistance with a task, for which an assistant LLM (such as microsoft/Phi-4) might be more appropriate. We do not recommend using UserLM in commercial or real-world applications without further testing and development. It is being released for research purposes. The paper accompanying this model release presents several evaluations of UserLM-8b and its potential limitations. First in Section 3, we describe the robustness experiments we conducted with UserLM-8b, which show that though the model can more robustly adhere to the user role and the provided task intent, the robustness numbers are not perfect ( Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

base_model:meta-llama/Llama-3.1-8B

565

Skywork-VL-Reward-7B-GGUF

This model was generated using llama.cpp at commit `1f63e75f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to learn more about choosing the right GGUF model format May 12, 2025: Our technical report is now available on arXiv and we welcome citations：Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning April 24, 2025: We released Skywork-VL-Reward-7B, A state-of-the-art multimodal reward model on VLRewardBench, and have released our technical report on the R1V GitHub repository. Introduction The lack of multimodal reward models on the market has become a major bottleneck restricting the development of multimodal reinforcement technology. We open source the 7B multimodal reward model Skywork-VL-Reward, injecting new momentum into the industry and opening a new chapter in multimodal reinforcement learning Skywork-VL-Reward is based on the Qwen2.5-VL-7B-Instruct architecture with the addition of a value head structure for training reward model. We obtained SOTA of 73.1 in VL-RewardBench and high score of 90.1 in RewardBench. In addition, our MPO trained on Skywork-R1V-2.0 further validates the effectiveness of the model. We hope that this multimodal reward model will contribute to the open source community! Please refer to our technical report for more details. Technical Report Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Model Name Model Size General Hallucination Reasoning Overall Accuracy Macro Average Proprietary Models Claude-3.5-Sonnet(2024-06-22) - 43.4 55.0 62.3 55.3 53.6 Gemini-1.5-Flash (2024-09-24) - 47.8 59.6 58.4 57.6 55.3 GPT-4o(2024-08-06) - 49.1 67.6 70.5 65.8 62.4 Gemini-1.5-Pro(2024-09-24) - 50.8 72.5 64.2 67.2 62.5 Gemini-2.0-flash-exp(2024-12) - 50.8 72.6 70.1 68.8 64.5 Open-Source Models Qwen2-VL-7B-Instruct 7B 31.6 19.1 51.1 28.3 33.9 MAmmoTH-VL-8B 8B 36.0 40.0 52.0 42.2 42.7 Qwen2.5-VL-7B-Instruct 7B 43.4 42.0 63.0 48.0 49.5 InternVL3-8B 8B 60.6 44.0 62.3 57.0 55.6 IXC-2.5-Reward-7B 7B 80.3 65.3 60.4 66.3 68.6 Qwen2-VL-72B-Instruct 72B 38.1 32.8 58.0 39.5 43.0 Molmo-72B-0924 72B 33.9 42.3 54.9 44.1 43.7 QVQ-72B-Preview 72B 41.8 46.2 51.2 46.4 46.4 Qwen2.5-VL-72B-Instruct 72B 47.8 46.8 63.5 51.6 52.7 InternVL3-78B 78B 67.8 52.5 64.5 63.3 61.6 Skywork-VL Reward(Ours) 7B 66.0 80.0 61.0 73.1 69.0 Language-Only Reward Models InternLM2-7B-Reward 99.2 69.5 87.2 94.5 87.6 Skywork-Reward-Llama3.1-8B 95.8 87.3 90.8 96.2 92.5 Skywork-Reward-Llama-3.1-8B-v0.2 94.7 88.4 92.7 96.7 93.1 QRM-Llama3.1-8B-v2 96.4 86.8 92.6 96.8 93.1 Multi-Modal Reward Models Qwen2-VL-7B-Instruct 65.1 50.9 55.8 68.3 60.0 InternVL3-8B 97.2 50.4 83.6 83.9 78.8 Qwen2.5-VL-7B-Instruct 94.3 63.8 84.1 86.2 82.1 IXC-2.5-Reward-7B 90.8 83.8 87.8 90.0 88.1 Skywork-VL Reward(Ours) 90.0 87.5 91.1 91.8 90.1 Citation If you use this work in your research, please cite: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

This model was generated using llama.cpp at commit `ee09828cb`. Click here to get info on choosing the right GGUF model format LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of four post-trained checkpoints with 350M, 700M, 1.2B, and 2.6 parameters. They provide the following key features to create AI-powered edge applications: Fast training & inference – LFM2 achieves 3x faster training compared to its previous generation. It also benefits from 2x faster decode and prefill speed on CPU compared to Qwen3. Best performance – LFM2 outperforms similarly-sized models across multiple benchmark categories, including knowledge, mathematics, instruction following, and multilingual capabilities. New architecture – LFM2 is a new hybrid Liquid model with multiplicative gates and short convolutions. Flexible deployment – LFM2 runs efficiently on CPU, GPU, and NPU hardware for flexible deployment on smartphones, laptops, or vehicles. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | LFM2-350M | LFM2-700M | LFM2-1.2B | LFM2-2.6B | | ------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | | Parameters | 354,483,968 | 742,489,344 | 1,170,340,608 | 2,569,272,320 | | Layers | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 30 (22 conv + 8 attn) | | Context length | 32,768 tokens | 32,768 tokens | 32,768 tokens | 32,768 tokens | | Vocabulary size | 65,536 | 65,536 | 65,536 | 65,536 | | Precision | bfloat16 | bfloat16 | bfloat16 | bfloat16 | | Training budget | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens | | License | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Reasoning: LFM2-2.6B is the only model in this family to use dynamic hybrid reasoning (traces between ` ` and ` ` tokens) for complex or multilingual prompts. Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 10 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` v4.55 or a more recent version as follows: Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. You need to install `vLLM` v0.10.2 or a more recent version as follows: You can run LFM2 with llama.cpp using its GGUF checkpoint. Find more information in the model card. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT (Unsloth) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using Unsloth. | | | SFT (Axolotl) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using Axolotl. | | | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | | | DPO (TRL) | Preference alignment with Direct Preference Optimization (DPO) using TRL. | | LFM2 outperforms similar-sized models across different evaluation categories. We only report scores using instruct variants and non-thinking modes for consistency. | Model | MMLU | GPQA | IFEval | IFBench | GSM8K | MGSM | MMMLU | | ---------------------- | ----- | ----- | ------ | ------- | ----- | ----- | ----- | | LFM2-2.6B | 64.42 | 26.57 | 79.56 | 22.19 | 82.41 | 74.32 | 55.39 | | Llama-3.2-3B-Instruct | 60.35 | 30.6 | 71.43 | 20.78 | 75.21 | 61.68 | 47.92 | | SmolLM3-3B | 59.84 | 26.31 | 72.44 | 17.93 | 81.12 | 68.72 | 50.02 | | gemma-3-4b-it | 58.35 | 29.51 | 76.85 | 23.53 | 89.92 | 87.28 | 50.14 | | Qwen3-4B-Instruct-2507 | 72.25 | 34.85 | 85.62 | 30.28 | 68.46 | 81.76 | 60.67 | If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

485

Foundation-Sec-8B-GGUF

base_model:meta-llama/Llama-3.1-8B

481

Devstral-Small-2505-GGUF

Magma-8B-GGUF

Trinity-Mini-GGUF

351

Seed-Coder-8B-Reasoning-GGUF

350

Llama3-ChatQA-2-8B-GGUF

llama-3

349

Lucy-GGUF

345

Wayfarer-2-12B-GGUF

This model was generated using llama.cpp at commit `360d6533`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format We’ve heard over and over from AI Dungeon players that modern AI models are too nice, never letting them fail or die. While it may be good for a chatbot to be nice and helpful, great stories and games aren’t all rainbows and unicorns. They have conflict, tension, and even death. These create real stakes and consequences for characters and the journeys they go on. We created Wayfarer as a response, and after much testing, feedback and refining, we’ve developed a worthy sequel. Wayfarer 2 further refines the formula that made the original Wayfarer so popular, slowing the pacing, increasing the length and detail of responses and making death a distinct possibility for all characters—not just the user. The stakes have never been higher! If you want to try this model for free, you can do so at https://aidungeon.com. We plan to continue improving and open-sourcing similar models, so please share any and all feedback on how we can improve model behavior. Below we share more details on how Wayfarer was created. Wayfarer 2 12B received SFT training with a simple three ingredient recipe: the Wayfarer 2 dataset itself, a series of sentiment-balanced roleplay transcripts and a small instruct core to help retain its instructional capabilities. Wayfarer’s text adventure data was generated by simulating playthroughs of published character creator scenarios from AI Dungeon. Five distinct user archetypes played through each scenario, whose character starts all varied in faction, location, etc. to generate five unique samples. One language model played the role of narrator, with the other playing the user. They were blind to each other’s underlying logic, so the user was actually capable of surprising the narrator with their choices. Each simulation was allowed to run for 8k tokens or until the main character died. Wayfarer’s general emotional sentiment is one of pessimism, where failure is frequent and plot armor does not exist for anyone. This serves to counter the positivity bias so inherent in our language models nowadays. The Nemo architecture is known for being sensitive to higher temperatures, so the following settings are recommended as a baseline. Nothing stops you from experimenting with these, of course. Wayfarer was trained exclusively on second-person present tense data (using “you”) in a narrative style. Other perspectives will work as well but may produce suboptimal results. Thanks to Gryphe Padar for collaborating on this finetune with us! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

Nanonets-OCR-s-GGUF

331

Mistral-Crab-DPO-GGUF

326

NextCoder-14B-GGUF

dataset:bigcode/commitpackft

322

granite-3.2-8b-instruct-GGUF

318

rwkv7-7.2B-g0-GGUF

This model was generated using llama.cpp at commit `e4868d16`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This is RWKV-7 model under flash-linear attention format. - Developed by: Bo Peng, Yu Zhang, Songlin Yang, Ruichong Zhang, Zhiyuan Li - Funded by: RWKV Project (Under LF AI & Data Foundation) - Model type: RWKV7 - Language(s) (NLP): Multilingual - License: Apache-2.0 - Parameter count: 7.2B - Tokenizer: RWKV World tokenizer - Vocabulary size: 65,536 - Repository: https://github.com/fla-org/flash-linear-attention ; https://github.com/BlinkDL/RWKV-LM - Paper: https://arxiv.org/abs/2503.14456 - Model: https://huggingface.co/BlinkDL/rwkv7-g1/resolve/main/rwkv7-g1-2.9b-20250519-ctx4096.pth Install `flash-linear-attention` and the latest version of `transformers` before using this model: You can use this model just as any other HuggingFace models: A: upgrade transformers to >=4.48.0: `pip install 'transformers>=4.48.0'` Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

317

LFM2-1.2B-RAG-GGUF

314

EXAONE-Deep-2.4B-GGUF

310

Llama-3.2-1B-GGUF

308

MiniCPM4.1-8B-GGUF

This model was generated using llama.cpp at commit `1411d9275`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format What's New - [2025.09.29] InfLLM-V2 paper is released! We can train a sparse attention model with only 5B long-text tokens. 🔥🔥🔥 - [2025.09.05] MiniCPM4.1 series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥 - [2025.06.06] MiniCPM4 series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report here.🔥🔥🔥 Highlights MiniCPM4.1 is highlighted with following features: ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning! ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding! - MiniCPM4.1-8B: The latest version of MiniCPM4, with 8B parameters, support fusion thinking. ( Click to expand all MiniCPM4 series models - MiniCPM4-8B: The flagship model with 8B parameters, trained on 8T tokens - MiniCPM4-0.5B: Lightweight version with 0.5B parameters, trained on 1T tokens - MiniCPM4-8B-Eagle-FRSpec: Eagle head for FRSpec, accelerating speculative inference - MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu: Eagle head with QAT for FRSpec, integrating speculation and quantization for ultra acceleration - MiniCPM4-8B-Eagle-vLLM: Eagle head in vLLM format for speculative inference - MiniCPM4-8B-marlin-Eagle-vLLM: Quantized Eagle head for vLLM format - BitCPM4-0.5B: Extreme ternary quantization of MiniCPM4-0.5B, achieving 90% bit width reduction - BitCPM4-1B: Extreme ternary quantization of MiniCPM3-1B, achieving 90% bit width reduction - MiniCPM4-Survey: Generates trustworthy, long-form survey papers from user queries - MiniCPM4-MCP: Integrates MCP tools to autonomously satisfy user requirements Performance Evaluation MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories. Best Practices 1. It is advisable to use temperature=0.9, topp=0.95. And we suggest setting maxoutputtoken to 65,536 tokens. 2. For math problems, we recommend using "Please reason step by step, and put your final answer within \boxed{}." 3. And for English multiple-choice questions, we recommend starting with "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering." And "你回答的最后一行必须是以下格式 '答案：$选项' (不带引号), 其中选项是ABCD之一。请在回答之前一步步思考" for Chinese MCQ. Efficiency Evaluation MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning. Usage MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu. MiniCPM4/MiniCPM4.1 supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu. - Dense attention inference: vLLM, SGLang, Huggingface Transformers - Sparse attention inference: Huggingface Transformers, CPM.cu To facilitate researches in sparse attention, we provide InfLLM-V2 Training Kernels and InfLLM-V2 Inference Kernels. Inference with Transformers MiniCPM4.1-8B requires `transformers>=4.56`. - Inference with Sparse Attention MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the infllmv2cudaimpl library. You can install it by running the following command: To enable InfLLM v2, you need to add the `sparseconfig` field in `config.json`: These parameters control the behavior of InfLLM v2: `kernelsize` (default: 32): The size of semantic kernels. `kernelstride` (default: 16): The stride between adjacent kernels. `initblocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence. `blocksize` (default: 64): The block size for key-value blocks. `windowsize` (default: 2048): The size of the local sliding window. `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks. `usenope` (default: false): Whether to use the NOPE technique in block selection for improved performance. `denselen` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `denselen` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length. - Long Context Extension MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor. You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `ropescaling` fields. You can inference with SGLang using the standard mode and speculative decoding mode. For accelerated inference with speculative decoding, follow these steps: The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation: Start the SGLang server with speculative decoding enabled: The client usage remains the same for both standard and speculative decoding: Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example). - `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding - `--speculative-draft-model-path`: Path to the draft model for speculation - `--speculative-num-steps`: Number of speculative steps (default: 3) - `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1) - `--speculative-num-draft-tokens`: Number of draft tokens (default: 32) - `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9) For now, you need to install our forked version of SGLang. You can start the inference server by running the following command: Then you can use the chat interface by running the following command: Inference with vLLM You can inference with vLLM using the standard mode and speculative decoding mode. For accelerated inference with speculative decoding using vLLM, follow these steps: First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`. The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation: Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM41-8B-Eagle3-bf16 folder: The client usage remains the same for both standard and speculative decoding: - `VLLMUSEV1=1`: Enables vLLM v1 API - `--speculative-config`: JSON configuration for speculative decoding - `model`: Path to the draft model for speculation - `numspeculativetokens`: Number of speculative tokens (default: 3) - `method`: Speculative decoding method (eagle3) - `drafttensorparallelsize`: Tensor parallel size for draft model (default: 1) - `--seed`: Random seed for reproducibility - `--trust-remote-code`: Allow execution of remote code for custom models For now, you need to install the latest version of vLLM. Also, you can start the inference server by running the following command: > Note: In vLLM's chat API, `addspecialtokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extrabody={"addspecialtokens": True}`. Then you can use the chat interface by running the following code: We recommend using CPM.cu for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1. You can install CPM.cu by running the following command: MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `ropescaling` field in the `config.json` file as the following to enable LongRoPE. After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace) You can run the following command to infer with EAGLE3 speculative decoding algorithm. For more details about CPM.cu, please refer to the repo CPM.cu. We also support inference with llama.cpp and Ollama. You can download the GGUF format of MiniCPM4.1-8B model from huggingface and run it with llama.cpp for efficient CPU or GPU inference. Ollama Please refer to model hub for model download. After installing ollama package, you can use MiniCPM4.1 with following commands: MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enablethinking=True` in `tokenizer.applychattemplate` to enable hybrid reasoning mode, and set `enablethinking=False` to enable non-reasoning mode. Similarly, user can directly add `/nothink` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode. Statement - As a language model, MiniCPM generates content by learning from a vast amount of text. - However, it does not possess the ability to comprehend or express personal opinions or value judgments. - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own. LICENSE - This repository and MiniCPM models are released under the Apache-2.0 License. Citation - Please cite our paper if you find our work valuable. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

308

granite-4.0-micro-GGUF

📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

307

typhoon-ocr-7b-GGUF

Arch-Router-1.5B-GGUF

This model was generated using llama.cpp at commit `73e53dc8`. Click here to get info on choosing the right GGUF model format Overview With the rapid proliferation of large language models (LLMs) -- each optimized for different strengths, style, or latency/cost profile -- routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. We introduce a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) -- offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce Arch-Router, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. This model is described in the paper: https://arxiv.org/abs/2506.16655, and powers Arch the open-source AI-native proxy for agents to enable preference-based routing in a seamless way. To support effective routing, Arch-Router introduces two key concepts: - Domain – the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming). - Action – the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation). Both domain and action configs are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request. - Structured Preference Routing: Aligns prompt request with model strengths using explicit domain–action mappings. - Transparent and Controllable: Makes routing decisions transparent and configurable, empowering users to customize system behavior. - Flexible and Adaptive: Supports evolving user needs, model updates, and new domains/actions without retraining the router. - Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments. Requirements The code of Arch-Router-1.5B has been in the Hugging Face `transformers` library and we advise you to install latest version: How to use We use the following example to illustrate how to use our model to perform routing tasks. Please note that, our model works best with our provided prompt format. Quickstart ` Then you should be able to see the following output string in JSON format: ` To better understand how to create the route descriptions, please take a look at our Katanemo API. License Katanemo Arch-Router model is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

258

DMind-1-mini-GGUF

251

functionary-v4r-small-preview-GGUF

Mistral-NeMo-Minitron-8B-Instruct-GGUF

238

granite-20b-code-instruct-r1.1-GGUF

This model was generated using llama.cpp at commit `5dd942de`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-20B-Code-Instruct-r1.1 is a 20B parameter model fine tuned from Granite-20B-Code-Instruct-r1.1 on a combination of permissively licensed instruction data to enhance instruction following capabilities including mathematical reasoning and problem-solving skills. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: July 18th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to coding related instructions and can be used to build coding assistants. Generation This is a simple example of how to use Granite-20B-Code-Instruct-r1.1 model. Training Data Granite Code Instruct models are trained on the following types of data. Code Commits Datasets: we sourced code commits data from the CommitPackFT dataset, a filtered version of the full CommitPack dataset. From CommitPackFT dataset, we only consider data for 92 programming languages. Our inclusion criteria boils down to selecting programming languages common across CommitPackFT and the 116 languages that we considered to pretrain the code-base model (Granite-20B-Code-Base-r1.1). Math Datasets: We consider two high-quality math datasets, MathInstruct and MetaMathQA. Due to license issues, we filtered out GSM8K-RFT and Camel-Math from MathInstruct dataset. Code Instruction Datasets: We use Glaive-Code-Assistant-v3, Glaive-Function-Calling-v2, BigCode-SC2-Instruct, NL2SQL11 and a small collection of synthetic API calling datasets including synthetic instruction-response pairs generated using Granite-34B-Code-Instruct. Language Instruction Datasets: We include high-quality datasets such as HelpSteer and an open license-filtered version of Platypus. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-20B-Code-Base-r1.1 model card. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

This model was generated using llama.cpp at commit `86587da0`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Over the past three months, we have continued to scale the thinking capability of ERNIE-4.5-21B-A3B, improving both the quality and depth of reasoning, thereby advancing the competitiveness of ERNIE lightweight models in complex reasoning tasks. We are pleased to introduce ERNIE-4.5-21B-A3B-Thinking, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, text generation, and academic benchmarks that typically require human expertise. Efficient tool usage capabilities. Enhanced 128K long-context understanding capabilities. > [!NOTE] > Note: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details: |Key|Value| |-|-| |Modality|Text| |Training Stage|Posttraining| |Params(Total / Activated)|21B / 3B| |Layers|28| |Heads(Q/KV)|20 / 4| |Text Experts(Total / Activated)|64 / 6| |Vision Experts(Total / Activated)|64 / 6| |Shared Experts|2| |Context Length|131072| > [!NOTE] > To align with the wider community, this model releases Transformer-style weights. Both PyTorch and PaddlePaddle ecosystem tools, such as vLLM, transformers, and FastDeploy, are expected to be able to load and run this model. Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository. Note: 80GB x 1 GPU resources are required. Deploying this model requires FastDeploy version 2.2. The ERNIE-4.5-21B-A3B-Thinking model supports function call. The `reasoning-parser` and `tool-call-parser` for vLLM Ernie are currently under development. Note: You'll need the`transformers`library (version 4.54.0 or newer) installed to use this model. The following contains a code snippet illustrating how to use the model generate content based on given inputs. The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved. If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

227

granite-3b-code-base-2k-GGUF

225

DeepSeek-R1-Distill-Llama-70B-GGUF

223

Prox-MistralHermes-7B-GGUF

This model was generated using llama.cpp at commit `aa0ef5c5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Drawing inspiration from the concept of 'proximity' in digital networks, the Prox series stands at the forefront of cybersecurity technology. Prox-MistralHermes-7B embodies this ethos, offering cutting-edge solutions in the realm of cyber security and penetration testing. Prox-MistralHermes-7B is a fine-tuned version of OpenHermes 2.5 Mistral 7B, specifically tailored for cybersecurity. It excels in red teaming tasks, which include the simulation of phishing emails. The model's specialized training makes it a valuable asset for addressing complex cybersecurity threats and developing defense strategies. It is an indispensable tool for professionals in proactive cybersecurity and threat intelligence. Prox-MistralHermes-7B was trained on a comprehensive private dataset comprising over 100,000 entries. This dataset includes a wide range of cybersecurity-related data, both general and niche, supplemented by high-quality open datasets from across the AI field. Training Prox-MistralHermes-7B was trained over 5 hours for 4 epochs on 4x A100 GPUs with Qlora. Prompt format: This model uses ChatML prompt format. Misuse, Malicious Use, and Out-of-Scope Use Users are responsible for their applications of this model. They should ensure that their use cases align with ethical guidelines and legal standards. Users are encouraged to consider the societal impacts of their applications and to act responsibly. License The weights of Prox-MistralHermes-7B are licensed under MIT License. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

221

Mistral-7B-Instruct-v0.3-GGUF

220

Phi-4-reasoning-plus-GGUF

214

Holo1-3B-GGUF

214

UIGEN-X-8B-GGUF

This model was generated using llama.cpp at commit `c82d48ec`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format > Tesslate's hybrid reasoning UI generation model built on Qwen3-8B architecture. Trained to systematically plan, architect, and implement complete user interfaces across modern development stacks. Live Examples: https://uigenoutput.tesslate.com Discord Community: https://discord.gg/EcCpcTv93U Website: https://tesslate.com UIGEN-X-8B implements hybrid reasoning from the Qwen3 family - combining systematic planning with direct implementation. The model follows a structured thinking process: 1. Problem Analysis — Understanding requirements and constraints 2. Architecture Planning — Component structure and technology decisions 3. Design System Definition — Color schemes, typography, and styling approach 4. Implementation Strategy — Step-by-step code generation with reasoning This hybrid approach enables both thoughtful planning and efficient code generation, making it suitable for complex UI development tasks. UIGEN-X-8B supports 26 major categories spanning frameworks and libraries across 7 platforms: Web Frameworks - React: Next.js, Remix, Gatsby, Create React App, Vite - Vue: Nuxt.js, Quasar, Gridsome - Angular: Angular CLI, Ionic Angular - Svelte: SvelteKit, Astro - Modern: Solid.js, Qwik, Alpine.js - Static: Astro, 11ty, Jekyll, Hugo Styling Systems - Utility-First: Tailwind CSS, UnoCSS, Windi CSS - CSS-in-JS: Styled Components, Emotion, Stitches - Component Systems: Material-UI, Chakra UI, Mantine - Traditional: Bootstrap, Bulma, Foundation - Design Systems: Carbon Design, IBM Design Language - Framework-Specific: Angular Material, Vuetify, Quasar UI Component Libraries - React: shadcn/ui, Material-UI, Ant Design, Chakra UI, Mantine, PrimeReact, Headless UI, NextUI, DaisyUI - Vue: Vuetify, PrimeVue, Quasar, Element Plus, Naive UI - Angular: Angular Material, PrimeNG, ng-bootstrap, Clarity Design - Svelte: Svelte Material UI, Carbon Components Svelte - Headless: Radix UI, Reach UI, Ariakit, React Aria State Management - React: Redux Toolkit, Zustand, Jotai, Valtio, Context API - Vue: Pinia, Vuex, Composables - Angular: NgRx, Akita, Services - Universal: MobX, XState, Recoil Animation Libraries - React: Framer Motion, React Spring, React Transition Group - Vue: Vue Transition, Vueuse Motion - Universal: GSAP, Lottie, CSS Animations, Web Animations API - Mobile: React Native Reanimated, Expo Animations Icon Systems Lucide, Heroicons, Material Icons, Font Awesome, Ant Design Icons, Bootstrap Icons, Ionicons, Tabler Icons, Feather, Phosphor, React Icons, Vue Icons Web Development Complete coverage of modern web development from simple HTML/CSS to complex enterprise applications. Mobile Development - React Native: Expo, CLI, with navigation and state management - Flutter: Cross-platform mobile with Material and Cupertino designs - Ionic: Angular, React, and Vue-based hybrid applications Desktop Applications - Electron: Cross-platform desktop apps (Slack, VSCode-style) - Tauri: Rust-based lightweight desktop applications - Flutter Desktop: Native desktop performance Python Applications - Web UI: Streamlit, Gradio, Flask, FastAPI - Desktop GUI: Tkinter, PyQt5/6, Kivy, wxPython, Dear PyGui Development Tools Build tools, bundlers, testing frameworks, and development environments. 26 Languages and Approaches: JavaScript, TypeScript, Python, Dart, HTML5, CSS3, SCSS, SASS, Less, PostCSS, CSS Modules, Styled Components, JSX, TSX, Vue SFC, Svelte Components, Angular Templates, Tailwind, PHP UIGEN-X-8B includes 21 distinct visual style categories that can be applied to any framework: Modern Design Styles - Glassmorphism: Frosted glass effects with blur and transparency - Neumorphism: Soft, extruded design elements - Material Design: Google's design system principles - Fluent Design: Microsoft's design language Traditional & Classic - Skeuomorphism: Real-world object representations - Swiss Design: Clean typography and grid systems - Bauhaus: Functional, geometric design principles Contemporary Trends - Brutalism: Bold, raw, unconventional layouts - Anti-Design: Intentionally imperfect, organic aesthetics - Minimalism: Essential elements only, generous whitespace Thematic Styles - Cyberpunk: Neon colors, glitch effects, futuristic elements - Dark Mode: High contrast, reduced eye strain - Retro-Futurism: 80s/90s inspired futuristic design - Geocities/90s Web: Nostalgic early web aesthetics Experimental - Maximalism: Rich, layered, abundant visual elements - Madness/Experimental: Unconventional, boundary-pushing designs - Abstract Shapes: Geometric, non-representational elements Basic Structure To achieve the best results, use this prompting structure below: UIGEN-X-8B supports function calling for dynamic asset integration and enhanced development workflows. Dynamic Asset Loading: - Fetch relevant images during UI generation - Generate realistic content for components - Create cohesive color palettes from images - Optimize assets for web performance Multi-Step Development: - Plan application architecture - Generate individual components - Integrate components into pages - Apply consistent styling and theming - Test responsive behavior Content-Aware Design: - Adapt layouts based on content types - Optimize typography for readability - Create responsive image galleries - Generate accessible alt text Rapid Prototyping - Quick mockups for client presentations - A/B testing different design approaches - Concept validation with interactive prototypes Production Development - Component library creation - Design system implementation - Template and boilerplate generation Educational & Learning - Teaching modern web development - Framework comparison and evaluation - Best practices demonstration Enterprise Solutions - Dashboard and admin panel generation - Internal tool development - Legacy system modernization Hardware - GPU: 8GB+ VRAM recommended (RTX 3080/4070 or equivalent) - RAM: 16GB system memory minimum - Storage: 20GB for model weights and cache Software - Python: 3.8+ with transformers, torch, unsloth - Node.js: For running generated JavaScript/TypeScript code - Browser: Modern browser for testing generated UIs Integration - Compatible with HuggingFace transformers - Supports GGML/GGUF quantization - Works with text-generation-webui - API-ready for production deployment - Token Usage: Reasoning process increases token consumption - Complex Logic: Focuses on UI structure rather than business logic - Real-time Features: Generated code requires backend integration - Testing: Output may need manual testing and refinement - Accessibility: While ARIA-aware, manual a11y testing recommended Discord: https://discord.gg/EcCpcTv93U Website: https://tesslate.com Examples: https://uigenoutput.tesslate.com Join our community to share creations, get help, and contribute to the ecosystem. Built with hybrid reasoning capabilities from Qwen3, UIGEN-X-8B represents a comprehensive approach to AI-driven UI development across the entire modern web development ecosystem. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

213

DeepSeek-R1-Distill-Llama-8B-GGUF

212

Veena-GGUF

212

granite-3.3-2b-base-GGUF

This model was generated using llama.cpp at commit `5dd942de`. Click here to get info on choosing the right GGUF model format Granite-3.3-2B-Base is a decoder-only language model with a 128K token context window. It improves upon Granite-3.1-2B-Base by adding support for Fill-in-the-Middle (FIM) using specialized tokens, enabling the model to generate content conditioned on both prefix and suffix. This makes it well-suited for code completion tasks. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.3-language-models - Website: Granite Docs - Release Date: April 16th, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.3 models for languages beyond these 12 languages. Intended Use: Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, and other long-context tasks. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios. Generation: This is a simple example of how to use Granite-3.3-2B-Base model. Then, copy the code snippet below to run the example. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K DROP NQ AGIEval TriviaQA Avg Granite-3.1-2B-Base 46.83 74.9 54.87 38.93 71.8 53.0 30.08 24.46 38.24 63.18 49.63 Granite-3.3-2B-Base 47.49 73.2 54.33 40.83 70.4 50.0 32.552 24.36 38.78 63.22 49.52 Granite-3.1-8B-Base 53.51 81.4 64.28 51.27 76.2 70.5 45.87 35.97 48.99 78.33 60.63 Granite-3.3-8B-Base 50.84 80.1 63.89 52.15 74.4 59.0 36.14 36.5 49.3 78.18 58.05 Model Architecture: Granite-3.3-2B-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: This model is trained on a mix of open source and proprietary data following a three-stage training strategy. Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data. Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks. Stage 3 data: The data for stage 3 consists of original stage-2 pretraining data with additional synthetic long-context data in form of QA/summary pairs where the answer contains a recitation of the related paragraph before the answer. Infrastructure: We train Granite 3.3 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-3.3-2B-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3.3-2B-Base model with ethical intentions and in a responsible way. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://github.com/ibm-granite-community/ Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

210

OpenReasoning-Nemotron-7B-GGUF

DeepSeek-R1-0528-Qwen3-8B-GGUF

202

Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct-GGUF

201

M3-Agent-Control-GGUF

200

Seed-OSS-36B-Instruct-GGUF

This model was generated using llama.cpp at commit `c8dedc99`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Base-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - The results of ArcAGI-V2 were measured on the official evaluation set, which was not involved in the training process. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

granite-3.1-2b-instruct-GGUF

This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-2B-Instruct is a 2B parameter long-context instruct model finetuned from Granite-3.1-2B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-2B-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-2B-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

180

Magistral-Small-2506-GGUF

178

rwkv7-1.5B-world-GGUF

178

Llama-3.2-1B-Instruct-GGUF

176

TriLM_830M_Unpacked-GGUF

174

KAT-Dev-GGUF

174

Llama-3.1-Nemotron-Nano-8B-v1-GGUF

llama-3

173

SmolDocling-256M-preview-GGUF

This model was generated using llama.cpp at commit `6adc3c3e`. Click here to get info on choosing the right GGUF model format SmolDocling-256M-preview SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments . This model was presented in the paper SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. 🚀 Features: - 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments. - 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images. - 📐 Layout and Localization – Preserves document structure and document element bounding boxes. - 💻 Code Recognition – Detects and formats code blocks including identation. - 🔢 Formula Recognition – Identifies and processes mathematical expressions. - 📊 Chart Recognition – Extracts and interprets chart data. - 📑 Table Recognition – Supports column and row headers for structured table extraction. - 🖼️ Figure Classification – Differentiates figures and graphical elements. - 📝 Caption Correspondence – Links captions to relevant images and figures. - 📜 List Grouping – Organizes and structures list elements correctly. - 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) - 🔲 OCR with Bounding Boxes – OCR regions using a bounding box. - 📂 General Document Processing – Trained for both scientific and non-scientific documents. - 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats. - 💨 Fast inference using VLLM – Avg of 0.35 secs per page on A100 GPU. 🚧 Coming soon! - 📊 Better chart recognition 🛠️ - 📚 One shot multi-page inference ⏱️ - 🧪 Chemical Recognition - 📙 Datasets You can use transformers, vllm, or onnx to perform inference, and Docling to convert results to variety of output formats (md, html, etc.): 📄 Single page image inference using Tranformers 🤖 💻 Local inference on Apple Silicon with MLX: see here DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient. DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency. Full conversion Convert this page to docling. DocTags represetation Chart Convert chart to table. (e.g., <chart>) Formula Convert formula to LaTeX. (e.g., <formula>) Table Convert table to OTSL. (e.g., <otsl>) OTSL: Lysak et al., 2023 Actions and Pipelines OCR the text in a specific location: <loc155><loc233><loc206><loc237> Identify element at: <loc247><loc482><10c252><loc486> Find all 'text' elements on the page, retrieve all section headers. - Developed by: Docling Team, IBM Research - Model type: Multi-modal model (image+text) - Language(s) (NLP): English - License: Apache 2.0 - Architecture: Based on Idefics3 (see technical summary) - Finetuned from model: Based on SmolVLM-256M-Instruct Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

164

SkyCaptioner-V1-GGUF

163

granite-8b-code-instruct-128k-GGUF

dataset:bigcode/commitpackft

126

OCRFlux-3B-GGUF

This model was generated using llama.cpp at commit `caf5681f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This is a preview release of the OCRFlux-3B model that's fine tuned from Qwen2.5-VL-3B-Instruct using the our private document datasets and some data from olmOCR-mix-0225 dataset. The best way to use this model is via the OCRFlux toolkit. The toolkit comes with an efficient inference setup via vllm that can handle millions of documents at scale. OCRFlux is licensed under the Apache 2.0 license. OCRFlux is intended for research and educational use. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

This model was generated using llama.cpp at commit `bf9087f5`. Click here to get info on choosing the right GGUF model format 🎉 License Updated! We are pleased to announce our more flexible licensing terms 🤗 ✈️ Try on FriendliAI We introduce EXAONE 4.0, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. In the EXAONE 4.0 architecture, we apply new architectural changes compared to previous EXAONE models as below: 1. Hybrid Attention: For the 32B model, we adopt hybrid attention scheme, which combines Local attention (sliding window attention) with Global attention (full attention) in a 3:1 ratio. We do not use RoPE (Rotary Positional Embedding) for global attention for better global context understanding. 2. QK-Reorder-Norm: We reorder the LayerNorm position from the traditional Pre-LN scheme by applying LayerNorm directly to the attention and MLP outputs, and we add RMS normalization right after the Q and K projection. It helps yield better performance on downstream tasks despite consuming more computation. For more details, please refer to our technical report, HuggingFace paper, blog, and GitHub. - Number of Parameters (without embeddings): 1.07B - Number of Layers: 30 - Number of Attention Heads: GQA with 32-heads and 8-KV heads - Vocab Size: 102,400 - Context Length: 65,536 tokens You should install the transformers library forked from the original, available in our PR. Once this PR is merged and released, we will update this section. You can install the latest version of transformers with support for EXAONE 4.0 by following the command: For general use, you can use the EXAONE 4.0 models with the following example: The EXAONE 4.0 models have reasoning capabilities for handling complex problems. You can activate reasoning mode by using the `enablethinking=True` argument with the tokenizer, which opens a reasoning block that starts with ` ` tag without closing it. > [!IMPORTANT] > The model generation with reasoning mode can be affected sensitively by sampling parameters, so please refer to the Usage Guideline for better quality. The EXAONE 4.0 models can be used as agents with their tool calling capabilities. You can provide tool schemas to the model for effective tool calling. TensorRT-LLM officially supports EXAONE 4.0 models in the latest commits. Before it is released, you need to clone the TensorRT-LLM repository to build from source. After cloning the repository, you need to build the source for installation. Please refer to the official documentation for a guide to build the TensorRT-LLM environment. You can run the TensorRT-LLM server by following steps: For more details, please refer to the documentation of EXAONE from TensorRT-LLM. > [!NOTE] > Other inference engines including `vllm` and `sglang` don't support the EXAONE 4.0 officially now. We will update as soon as these libraries are updated. The following tables show the evaluation results of each model, with reasoning and non-reasoning mode. The evaluation details can be found in the technical report. - ✅ denotes the model has a hybrid reasoning capability, evaluated by selecting reasoning / non-reasoning on the purpose. - To assess Korean practical and professional knowledge, we adopt both the KMMLU-Redux and KMMLU-Pro benchmarks. Both datasets are publicly released! EXAONE 4.0 32B Phi 4 reasoning-plus Magistral Small-2506 Qwen 3 32B Qwen 3 235B DeepSeek R1-0528 EXAONE 4.0 32B Phi 4 Mistral-Small-2506 Gemma 3 27B Qwen3 32B Qwen3 235B Llama-4-Maverick DeepSeek V3-0324 Model Size 32.0B 14.7B 24.0B 27.4B 32.8B 235B 402B 671B GPQA-Diamond 63.7 56.1 46.1 42.4 54.6 62.9 69.8 68.4 LiveCodeBench v5 43.3 24.6 25.8 27.5 31.3 35.3 43.4 46.7 LiveCodeBench v6 43.1 27.4 26.9 29.7 28.0 31.4 32.7 44.0 Multi-IF (EN) 71.6 47.7 63.2 72.1 71.9 72.5 77.9 68.3 Tau-Bench (Airline) 25.5 N/A 36.1 N/A 16.0 27.0 38.0 40.5 Tau-Bench (Retail) 55.9 N/A 35.5 N/A 47.6 56.5 6.5 68.5 KMMLU-Redux 64.8 50.1 53.6 53.3 64.4 71.7 76.9 72.2 MATH500 (ES) 87.3 78.2 83.4 86.8 84.7 87.2 78.7 89.2 WMT24++ (ES) 90.7 89.3 92.2 93.1 91.4 92.9 92.7 94.3 EXAONE 4.0 1.2B EXAONE Deep 2.4B Qwen 3 0.6B Qwen 3 1.7B SmolLM3 3B EXAONE 4.0 1.2B Qwen 3 0.6B Gemma 3 1B Qwen 3 1.7B SmolLM3 3B > [!IMPORTANT] > To achieve the expected performance, we recommend using the following configurations: > > - For non-reasoning mode, we recommend using a lower temperature value such as `temperature - For reasoning mode (using ` ` block), we recommend using `temperature=0.6` and `topp=0.95`. > - If you suffer from the model degeneration, we recommend using `presencepenalty=1.5`. > - For Korean general conversation with 1.2B model, we suggest to use `temperature=0.1` to avoid code switching. The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflect the views of LG AI Research. - Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information. - Biased responses may be generated, which are associated with age, gender, race, and so on. - The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences. - Since the model does not reflect the latest information, the responses may be false or contradictory. LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI's ethical principles when using EXAONE language models. The model is licensed under EXAONE AI Model License Agreement 1.2 - NC > [!NOTE] > The main difference from the older version is as below: > - We removed the claim of model output ownership from the license. > - We restrict the model use against the development of models that compete with EXAONE. > - We allow the model to be used for educational purposes, not just research. LG AI Research Technical Support: [email protected] Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

116

Qwen3-4B-GGUF

OpenMath-Nemotron-7B-GGUF

license:cc-by-4.0

102

TriLM_560M_Unpacked-GGUF

102

sychonix-GGUF

102

AceMath-1.5B-Instruct-GGUF

This model was generated using llama.cpp at commit `b9c3eefd`. Click here to get info on choosing the right GGUF model format Introduction We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are Improved using Qwen . The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions. The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field. We only recommend using the AceMath models for solving math problems. To support other tasks, we also release AceInstruct-1.5B/7B/72B, a series of general-purpose SFT models designed to handle code, math, and general knowledge tasks. These models are built upon the Qwen2.5-1.5B/7B/72B-Base. For more information about AceMath, check our website and paper. All Resources AceMath Instruction Models - AceMath-1.5B-Instruct, AceMath-7B-Instruct, AceMath-72B-Instruct AceMath Reward Models - AceMath-7B-RM, AceMath-72B-RM Evaluation & Training Data - AceMath-RewardBench, AceMath-Instruct Training Data, AceMath-RM Training Data General Instruction Models - AceInstruct-1.5B, AceInstruct-7B, AceInstruct-72B Benchmark Results (AceMath-Instruct + AceMath-72B-RM) We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation. Correspondence to Zihan Liu ([email protected]), Yang Chen ([email protected]), Wei Ping ([email protected]) Citation If you find our work helpful, we’d appreciate it if you could cite us. @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } License All models in the AceMath family are for non-commercial use only, subject to Terms of Use of the data generated by OpenAI. We put the AceMath models under the license of Creative Commons Attribution: Non-Commercial 4.0 International. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

101

Absolute_Zero_Reasoner-Coder-14b-GGUF

100

Qwen3-30B-A3B-Instruct-2507-GGUF

This model was generated using llama.cpp at commit `e743cddb`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Introduction We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are Improved using Qwen . The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions. The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field. We only recommend using the AceMath models for solving math problems. To support other tasks, we also release AceInstruct-1.5B/7B/72B, a series of general-purpose SFT models designed to handle code, math, and general knowledge tasks. These models are built upon the Qwen2.5-1.5B/7B/72B-Base. For more information about AceMath, check our website and paper. All Resources AceMath Instruction Models - AceMath-1.5B-Instruct, AceMath-7B-Instruct, AceMath-72B-Instruct AceMath Reward Models - AceMath-7B-RM, AceMath-72B-RM Evaluation & Training Data - AceMath-RewardBench, AceMath-Instruct Training Data, AceMath-RM Training Data General Instruction Models - AceInstruct-1.5B, AceInstruct-7B, AceInstruct-72B Benchmark Results (AceMath-Instruct + AceMath-72B-RM) We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation. Correspondence to Zihan Liu ([email protected]), Yang Chen ([email protected]), Wei Ping ([email protected]) Citation If you find our work helpful, we’d appreciate it if you could cite us. @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } License All models in the AceMath family are for non-commercial use only, subject to Terms of Use of the data generated by OpenAI. We put the AceMath models under the license of Creative Commons Attribution: Non-Commercial 4.0 International. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

kani-tts-450m-0.1-pt-GGUF

This model was generated using llama.cpp at commit `152729f8`. Click here to get info on choosing the right GGUF model format Text-to-Speech (TTS) model designed for high-speed, high-fidelity audio generation. KaniTTS is built on a novel architecture that combines a powerful language model with a highly efficient audio codec, enabling it to deliver exceptional performance for real-time applications. KaniTTS operates on a two-stage pipeline, leveraging a large foundation model for token generation and a compact, efficient codec for waveform synthesis. The two-stage design of KaniTTS provides a significant advantage in terms of speed and efficiency. The backbone LLM generates a compressed token representation, which is then rapidly expanded into an audio waveform by the NanoCodec. This architecture bypasses the computational overhead associated with generating waveforms directly from large-scale language models, resulting in extremely low latency. Features This model trained primarily on English for robust core capabilities and the tokenizer supports these languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The base model can be continually pretrained on the multilingual dataset producing high-fidelity audio at sample rates 22kHz. This model powers voice interactions in the modern agentic systems, enabling seamless, human-like conversations. - Model Size: 450M parameters (pretrained version) - License: Apache 2.0 | Text | Audio | |---|---| | I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED. | | | What do we say the the god of death? Not today! | | | What do you call a lawyer with an IQ of 60? Your honor | | | You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? I make you laugh, I'm here to fucking amuse you? | | - Website: nineninesix.ai - GitHub Repo: https://github.com/nineninesix-ai/kani-tts - Base Model Card on HF: nineninesix/kani-tts-450m-0.1-pt - FT Model Card on HuggingFace: nineninesix/kani-tts-450m-0.2-ft - Link to HF Space: nineninesix/KaniTTS - Inference Example: Colab Notebook - Finetuning Example: Colab Notebook - Example Dataset for Fine-tuning: Expresso Conversational - Waiting List for Pro Version Recommended Uses - Conversational AI: Integrate into chatbots, virtual assistants, or voice-enabled apps for real-time speech output. - Edge and Server Deployment: Optimized for low-latency inference on edge devices or affordable servers, enabling scalable, resource-efficient voice applications. - Accessibility Tools: Support screen readers or language learning apps with expressive prosody. - Research: Fine-tune for domain-specific voices (e.g., accents, emotions) or benchmark against other TTS systems. Limitations - Performance may vary with finetuned variants, long inputs ( > 2000 tokens) or rare languages/accents. - Emotion control is basic; advanced expressivity requires fine-tuning. - Trained on public datasets; may inherit biases in prosody or pronunciation from training data. - Dataset: Curated from LibriTTS, Common Voice and Emilia (~50k hours). - Pretrained mostly on English speech for robust core capabilities, with multilingual fine-tuning for supported languages. - Metrics: MOS (Mean Opinion Score) 4.3/5 for naturalness; WER (Word Error Rate) This performance makes KaniTTS suitable for real-time conversational AI applications and low-latency voice synthesis. - Language Optimization: For the best results in non-English languages, continually pretrain this model on datasets from your desired language set to improve prosody, accents, and pronunciation accuracy. Additionally, finetune NanoCodec for desired set of languages. - Batch Processing: For high-throughput applications, process texts in batches of 8-16 to leverage parallel computation, reducing per-sample latency. - Blackwell GPU Optimization: This model runs efficiently on NVIDIA's Blackwell architecture GPUs for faster inference and reduced latency in real-time applications. Credits - This project was inspired by the works of Orpheus TTS and Sesame CSM. - It utilizes the LiquidAI LFM2 350M as its core backbone and - Nvidia NanoCodec for efficient audio processing. Responsible Use and Prohibited Activities The model is designed for ethical and responsible use. The following activities are strictly prohibited: - The model may not be used for any illegal purposes or to create content that is harmful, threatening, defamatory, or obscene. This includes, but is not limited to, the generation of hate speech, harassment, or incitement of violence. - You may not use the model to generate or disseminate false or misleading information. This includes creating deceptive audio content that impersonates individuals without their consent or misrepresents facts. - The model is not to be used for any malicious activities, such as spamming, phishing, or the creation of content intended to deceive or defraud. By using this model, you agree to abide by these restrictions and all applicable laws and regulations. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

GLM-Z1-Rumination-32B-0414-GGUF

granite-guardian-3.1-2b-GGUF

granite-3.1-3b-a800m-instruct-GGUF

This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-3B-A800M-Instruct is a 3B parameter long-context instruct model finetuned from Granite-3.1-3B-A800M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-3B-A800M-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-3B-A800M-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

HyperCLOVAX-SEED-Text-Instruct-0.5B-GGUF

AQUA-7B-GGUF

This model was generated using llama.cpp at commit `21c02174`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format AQUA-7B is Kurma AI’s flagship 7-billion parameter large language model built exclusively for the global aquaculture industry. And it is the first large language model for the aquaculture. It is fine-tuned to deliver actionable insights for aquaculture species-specific farming, hatchery operations, water quality control, and disease management. Trained on over 3 million real and synthetic aquaculture conversations (~1B tokens), AquaGPT-7B brings the power of domain-specific AI to fish farms, fish hatcheries, researchers, and Aqua-Tech innovators worldwide. - Production Systems & Species Management: Covers ponds, tanks, cages, RAS, aquaponics, mariculture, and longlines. Delivers best practices for raising tilapia, catfish, carp, salmon, shrimp, crabs, oysters, trout, sea bass, and more supporting both smallholder and industrial farms. - Genetics, Hatchery, and Early Life Stage Management: Guides advanced breeding, gene editing, hatchery design, spawning, larval care, nursery systems, live feed, transport, egg incubation, and biosecurity. - Nutrition, Feeding, and Growth Optimization: Provides actionable protocols for water quality (temperature, oxygen, pH, ammonia, nitrite, salinity), and structured disease management: identification, vaccination, biosecurity, antibiotic use, and outbreak response. - Water Quality, Health, and Disease Management: Provides actionable protocols for water quality (temperature, oxygen, pH, ammonia, nitrite, salinity), and structured disease management: identification, vaccination, biosecurity, antibiotic use, and outbreak response. - Sustainable Aquaculture & Innovation: Promotes Promotes eco-friendly practices in waste management, environmental impact, biodiversity, climate adaptation, and guides adoption of new technologies AI, automation, sensors, water drones, and modern farm management. - Water Quality, Health, and Disease Management: Advises on market trends, business planning, regulation, certification, traceability, and insurance. Covers best practices for harvesting, processing, cold chain, grading, packaging, contamination prevention, HACCP, and food safety. - Extension worker–farmer dialogues and field advisory logs - FAO, ICAR, NOAA, and peer-reviewed aquaculture research - Synthetic Q&A from 5,000+ aquaculture-focused topics - Climate-resilient practices, hatchery SOPs, and water quality datasets - Carefully curated to support species-specific culture methods - Scale: Trained on approximately 3 million real and synthetic Q&A pairs, totaling around 1 billion tokens of high-quality, domain-specific data. - Base Model: Mistral 7B v0.3 (by Mistral AI) - Training Tokens: ~1 Billion - Released On 4, July 2025 - Data Volume: 3M+ expert-verified and synthetic instructions - Origin: Made in America by Kurma AI - Training Technic Model is trained via Fine-tuning using (LoRA-based) Supervised Fine-Tuning (SFT). - Training Infrastructure: Trained using 16 NVIDIA H200 GPU Multi Cluster Special Thanks to Nebius 🙏 Acknowledgements This project was made possible thanks to: - Nebius for providing a compute grant and access to NVIDIA H200 GPU servers, which powered the model training process. - Mistral for sharing their open-source language models, which made this project possible. - Kurma AI research team: including aquaculture experts, machine learning engineers, data annotators, and advisors who collaborated to curate, verify, and refine the domain-specific dataset used for fine-tuning this model. - Domain Bias: The model may reflect inherent biases present in the aquaculture data sources and industry practices on which it was trained. - Temporal Data Limitation: Climate and environmental recommendations are based on information available up to 2024. Users should cross-check any climate-related advice against the latest advisories (e.g., IMD or NOAA updates). - Potential Hallucinations: Like all large language models, Aqua-7B may occasionally generate inaccurate or misleading responses ("hallucinations"). Always validate critical, regulatory, or high-impact decisions with a qualified aquaculture professional. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

Absolute_Zero_Reasoner-Coder-3b-GGUF

AceInstruct-1.5B-GGUF

Arch-Agent-1.5B-GGUF

This model was generated using llama.cpp at commit `0142961a`. Click here to get info on choosing the right GGUF model format Overview Arch-Agent is a collection of state-of-the-art (SOTA) LLMs specifically designed for advanced function calling and agent-based applications. Designed to power sophisticated multi-step and multi-turn workflows, Arch-Agent excels at handling complex, multi-step tasks that require intelligent tool selection, adaptive planning, and seamless integration with external APIs and services. Built with a focus on real-world agent deployments, Arch-Agent delivers leading performance in complex scenarios while maintaining reliability and precision across extended function call sequences. Key capabilities inlcude: - Multi-Turn Function Calling: Maintains contextual continuity across multiple dialogue turns, enabling natural, ongoing conversations with nested or evolving tool use. - Multi-Step Function Calling: Plans and executes a sequence of function calls to complete complex tasks. Adapts dynamically based on intermediate results and decomposes goals into sub-tasks. - Agentic Capabilities: Advanced decision-making and workflow management for complex agentic tasks with seamless tool coordination and error recovery. For more details, including fine-tuning, inference, and deployment, please refer to our Github. Performance Benchmarks We evaluate Katanemo Arch-Agent series on the Berkeley Function-Calling Leaderboard (BFCL). We compare with commonly-used models and the results (as of June 14th, 2025) are shown below. > [!NOTE] > For evaluation, we use YaRN scaling to deploy the models for Multi-Turn evaluation, and all Arch-Agent models are evaluated with a context length of 64K. Requirements The code of Arch-Agent-1.5B has been in the Hugging Face `transformers` library and we recommend to install latest version: How to use We use the following example to illustrate how to use our model to perform function calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the OpenAI's function calling. License The Arch-Agent collection is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

Mistral-7B-Instruct-v0.1-GGUF

This model was generated using llama.cpp at commit `73e53dc8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Overview Arch-Agent is a collection of state-of-the-art (SOTA) LLMs specifically designed for advanced function calling and agent-based applications. Designed to power sophisticated multi-step and multi-turn workflows, Arch-Agent excels at handling complex, multi-step tasks that require intelligent tool selection, adaptive planning, and seamless integration with external APIs and services. Built with a focus on real-world agent deployments, Arch-Agent delivers leading performance in complex scenarios while maintaining reliability and precision across extended function call sequences. Key capabilities inlcude: - Multi-Turn Function Calling: Maintains contextual continuity across multiple dialogue turns, enabling natural, ongoing conversations with nested or evolving tool use. - Multi-Step Function Calling: Plans and executes a sequence of function calls to complete complex tasks. Adapts dynamically based on intermediate results and decomposes goals into sub-tasks. - Agentic Capabilities: Advanced decision-making and workflow management for complex agentic tasks with seamless tool coordination and error recovery. For more details, including fine-tuning, inference, and deployment, please refer to our Github. Performance Benchmarks We evaluate Katanemo Arch-Agent series on the Berkeley Function-Calling Leaderboard (BFCL). We compare with commonly-used models and the results (as of June 14th, 2025) are shown below. > [!NOTE] > For evaluation, we use YaRN scaling to deploy the models for Multi-Turn evaluation, and all Arch-Agent models are evaluated with a context length of 64K. Requirements The code of Arch-Agent-7B has been in the Hugging Face `transformers` library and we recommend to install latest version: How to use We use the following example to illustrate how to use our model to perform function calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the OpenAI's function calling. License The Arch-Agent collection is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

granite-7b-base-GGUF

EXAONE-4.0-32B-GGUF

mOrpheus_3B-1Base_early_preview-v1-8600-GGUF

orpheus-finetuned-3b-GGUF

This model was generated using llama.cpp at commit `f505bd83`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 03/18/2025 – We are releasing our 3B Orpheus TTS model with additional finetunes. Code is available on GitHub: CanopyAI/Orpheus-TTS Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances. - Human-Like Speech: Natural intonation, emotion, and rhythm that is superior to SOTA closed source models - Zero-Shot Voice Cloning: Clone voices without prior fine-tuning - Guided Emotion and Intonation: Control speech and emotion characteristics with simple tags - Low Latency: ~200ms streaming latency for realtime applications, reducible to ~100ms with input streaming - GitHub Repo: https://github.com/canopyai/Orpheus-TTS - Blog Post: https://canopylabs.ai/model-releases - Colab Inference Notebook: notebook link - One-Click Deployment on Baseten: https://www.baseten.co/library/orpheus-tts/ Check out our Colab (link to Colab) or GitHub (link to GitHub) on how to run easy inference on our finetuned models. Model Misuse Do not use our models for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines. We disclaim responsibility for any use. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

OpenCodeReasoning-Nemotron-32B-GGUF

Osmosis-Structure-0.6B-GGUF

granite-3.0-3b-a800m-instruct-GGUF

limbic-tool-use-0.5B-32K-GGUF

Apriel-1.6-15b-Thinker-GGUF