Mungert
Qwen3.5-122B-A10B-GGUF
Qwen3-Coder-Next-GGUF
C2S-Scale-Gemma-2-27B-GGUF
This model was generated using llama.cpp at commit `03792ad93`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format C2S-Scale Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis HuggingFace C2S Collection: C2S-Scale Models GitHub Repository: vandijklab/cell2sentence (for code, tutorials, and discussions) Google Research Blog Post: Teaching machines the language of biology Author: van Dijk Lab (Yale), Google Research, Google DeepMind This section describes the C2S-Scale model and how to use it. C2S-Scale-Gemma-27B is a state-of-the-art, open language model built upon the Gemma-2 27B architecture and fine-tuned for single-cell biology. Developed through the Cell2Sentence (C2S) framework, the model processes and understands single-cell RNA sequencing (scRNA-seq) data by treating it as a language. It converts high-dimensional scRNA-seq expression data into "cell sentences" - ordered sequences of gene names - enabling a wide range of biological analyses. This work is the result of a collaboration between Yale University, Google Research, and Google DeepMind to scale up C2S models. The C2S-Scale models were trained on Google's TPU v5s, which allowed for a significant increase in model size and capability. These models excel at tasks such as cell type prediction, tissue classification, and generating biologically meaningful cell representations. Versatility: Demonstrates strong performance across a diverse set of single-cell and multi-cell tasks. Scalability: Trained on a massive dataset of over 57 million cells, showcasing the power of scaling LLMs for biological data. Generative Power: Capable of generating realistic single-cell gene expression profiles. Foundation for Fine-tuning: Can serve as a powerful pretrained foundation for specialized, domain-specific single-cell analysis tasks. C2S-Scale can be a valuable tool for researchers in the following areas: In Silico Experiments: Generate cells under specific conditions or predict perturbational changes to form and test new biological hypotheses. Cell Atlas Annotation: Streamline the process of annotating large-scale single-cell datasets by predicting cell types and tissues. Biomarker Discovery: Analyze gene patterns within cell sentences to identify potential markers for specific cell states or diseases. Below are code snippets to help you get started running the model locally on a GPU. The model can be used for various tasks, further described in the C2S-Scale paper. To perform cell type prediction, the model expects a prompt containing the cell sentence followed by a query. The resulting prompt is in the format expected by the model for this task: See the following Colab notebooks in our GitHub repository for examples of how to use C2S-Scale models: To quickly get started with the model for tasks like cell type prediction and generation: C2S Tutorials C2S-Scale is based on the Gemma 2 family of lightweight, state-of-the-art open LLMs, which utilizes a decoder-only transformer architecture. Base Model: Gemma-2 27B. Fine-tuning Data: A comprehensive collection of over 800 datasets from CellxGene and the Human Cell Atlas, totaling over 57 million human and mouse cells. Training Approach: Instruction fine-tuning using the Cell2Sentence framework, which converts scRNA-seq expression data into sequences of gene tokens. Model type: Decoder-only Transformer (based on Gemma-2) Key publication: Scaling Large Language Models for Next-Generation Single-Cell Analysis The performance of C2S-Scale models was validated on a wide range of single-cell and multi-cell tasks, including advanced downstream tasks such as cluster captioning, question answering, and perturbation prediction. C2S-Scale models demonstrated significant improvements over other open and closed-source models, establishing new state-of-the-art benchmarks for LLMs in single-cell biology. Please see our preprint for a full breakdown of performance metrics. Input: Text. For best performance, prompts should be structured according to the specific task (e.g., cell type prediction, conditioned generation). Inputs are "cell sentences"—ordered, space-separated lists of gene names. Output: Text. The model generates text as a response, which can be a predicted label (like a cell type or tissue), a full cell sentence, or a natural language abstract. CellxGene and Human Cell Atlas: The model was trained on a curated collection of over 800 public scRNA-seq datasets, encompassing more than 57 million cells. This data covers a broad range of tissues, cell types, and experimental conditions from both human and mouse, ensuring the model learns a robust and generalizable representation of cellular states. Evaluation was performed using held-out datasets and standardized benchmarks designed to test the model's capabilities on the tasks listed above. All evaluation methodologies followed established best practices for splitting data to ensure robust and unbiased assessment. The model weights shared on Huggingface are CC-by-4.0. The model was trained using JAX, leveraging Google's TPU v5 hardware for efficient and large-scale training. Research in single-cell genomics and computational biology. As a foundational model for fine-tuning on specific biological domains or datasets. To aid in the annotation and interpretation of large-scale scRNA-seq experiments. C2S-Scale provides a powerful, versatile, and scalable tool for single-cell analysis. It offers: State-of-the-art performance on a wide range of scRNA-seq tasks. A unified framework for handling diverse single-cell analysis challenges. A foundation for building more specialized models from private or proprietary data. The ability to perform in silico generation of cellular data to explore biological hypotheses. The model is trained on public data and its knowledge is limited to the genes, cell types, and conditions present in that data. Performance on out-of-distribution data (e.g., completely novel cell types or technologies) is not guaranteed and requires validation. Performance of the models on input prompt formats that greatly deviate from training prompt formatting is not guaranteed. C2S-Scale Links - Paper: Scaling Large Language Models for Next-Generation Single-Cell Analysis - Google Research Blog Post: Teaching machines the language of biology: Scaling large language models for next-generation single-cell analysis - GitHub: https://github.com/vandijklab/cell2sentence (Note: Codebase has CC BY-NC-ND 4.0 license. Only weights shared on Hugging Face are CC-by-4.0) Gemma-2 Links - HuggingFace: https://huggingface.co/google/gemma-2-27b - Gemma-2 Blog Post: Gemma explained: What's new in Gemma 2 - Technical report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
NVIDIA-Nemotron-3-Super-120B-A12B-BF16-GGUF
Tongyi-DeepResearch-30B-A3B-GGUF
This model was generated using llama.cpp at commit `a2054e3a8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format We present Tongyi DeepResearch, an agentic large language model featuring 30 billion total parameters, with only 3 billion activated per token. Developed by Tongyi Lab, the model is specifically designed for long-horizon, deep information-seeking tasks. Tongyi-DeepResearch demonstrates state-of-the-art performance across a range of agentic search benchmarks, including Humanity's Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA, GAIA, xbench-DeepSearch and FRAMES. - ⚙️ Fully automated synthetic data generation pipeline: We design a highly scalable data synthesis pipeline, which is fully automatic and empowers agentic pre-training, supervised fine-tuning, and reinforcement learning. - 🔄 Large-scale continual pre-training on agentic data: Leveraging diverse, high-quality agentic interaction data to extend model capabilities, maintain freshness, and strengthen reasoning performance. - 🔁 End-to-end reinforcement learning: We employ a strictly on-policy RL approach based on a customized Group Relative Policy Optimization framework, with token-level policy gradients, leave-one-out advantage estimation, and selective filtering of negative samples to stabilize training in a non‑stationary environment. - 🤖 Agent Inference Paradigm Compatibility: At inference, Tongyi-DeepResearch is compatible with two inference paradigms: ReAct, for rigorously evaluating the model's core intrinsic abilities, and an IterResearch-based 'Heavy' mode, which uses a test-time scaling strategy to unlock the model's maximum performance ceiling. You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/DeepResearch. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Glyph-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Glyph: Scaling Context Windows via Visual-Text Compression - Repository: https://github.com/thu-coai/Glyph - Paper: https://arxiv.org/abs/2510.17800 Glyph is a framework for scaling the context length through visual-text compression. Instead of extending token-based context windows, Glyph renders long textual sequences into images and processes them using vision–language models (VLMs). This design transforms the challenge of long-context modeling into a multimodal problem, substantially reducing computational and memory costs while preserving semantic information. This is a simple example of running single-image inference using the `transformers` library. First, install the `transformers` library: Known Limitations - Sensitivity to rendering parameters: Glyph’s performance can vary with rendering settings such as resolution, font, and spacing. Since our search procedure adopts a fixed rendering configuration during post-training, the model may not generalize well to unseen or substantially different rendering styles. - OCR-related challenges: Recognizing fine-grained or rare alphanumeric strings (e.g., UUIDs) remains difficult for visual-language models, especially with ultra-long inputs, sometimes leading to minor character misclassification. - Limited generalization: The training of Glyph mainly targets long-context understanding, and its capability on broader tasks is yet to be studied. Citation If you find our model useful in your work, please cite it with: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
MiroThinker-v1.5-30B-GGUF
agentflow-planner-7b-GGUF
This model was generated using llama.cpp at commit `03792ad93`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format AgentFlow Planner Agent 7B checkpoint (built upon Qwen2.5-7B-Instruct): - HF Paper: https://huggingface.co/papers/date/2025-10-08 - Code: https://github.com/lupantech/AgentFlow - Demo: https://huggingface.co/spaces/AgentFlow/agentflow - Website: https://agentflow.stanford.edu/ - Youtube: https://www.youtube.com/watch?v=kIQbCQIH1SI - X (Twitter): https://x.com/lupantech/status/1976016000345919803 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
aquif-3.5-Max-42B-A3B-GGUF
This model was generated using llama.cpp at commit `7f09a680a`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The pinnacle of the aquif-3.5 series, released November 3rd, 2025. These models bring advanced reasoning capabilities and unprecedented context windows to achieve state-of-the-art performance for their respective categories. aquif-3.5-Plus combines hybrid reasoning with interchangeable thinking modes, offering flexibility for both speed-optimized and reasoning-intensive applications. aquif-3.5-Max represents frontier model capabilities with reasoning-only architecture, delivering exceptional performance across all benchmark categories. | Model | HuggingFace Repository | |-------|----------------------| | aquif-3.5-Plus | aquiffoo/aquif-3.5-Plus | | aquif-3.5-Max | aquiffoo/aquif-3.5-Max | | Model | Total (B) | Active Params (B) | Reasoning | Context Window | Thinking Modes | |-------|-----------|-------------------|-----------|-----------------|----------------| | aquif-3.5-Plus | 30.5 | 3.3 | ✅ Hybrid | 1M | ✅ Interchangeable | | aquif-3.5-Max | 42.4 | 3.3 | ✅ Reasoning-Only | 1M | Reasoning-Only | aquif-3.5-Plus (Hybrid Reasoning with Interchangeable Modes) A breakthrough hybrid reasoning model offering unprecedented flexibility. Toggle between thinking and non-thinking modes to optimize for your specific use case—maintain reasoning capabilities when needed, or prioritize speed for time-sensitive applications. Artificial Analysis Intelligence Index (AAII) Benchmarks | Benchmark | aquif-3.5-Plus (Non-Reasoning) | aquif-3.5-Plus (Reasoning) | aquif-3.5-Max | |-----------|--------------------------------|----------------------------|----------------| | MMLU-Pro | 80.2 | 82.8 | 85.4 | | GPQA Diamond | 72.1 | 79.7 | 83.2 | | AIME 2025 | 64.7 | 90.3 | 94.6 | | LiveCodeBench | 50.5 | 76.4 | 81.6 | | Humanity's Last Exam | 4.3 | 12.1 | 15.6 | | TAU2-Telecom | 34.2 | 41.5 | 51.3 | | IFBench | 39.3 | 54.3 | 65.4 | | TerminalBench-Hard | 10.1 | 15.2 | 23.9 | | AA-LCR | 30.4 | 59.9 | 61.2 | | SciCode | 29.5 | 35.7 | 40.9 | | AAII Composite Score | 42 (41.53) | 55 (54.79) | 60 (60.31) | | Model | AAII Score | |-------|-----------| | GPT-5 mini | 42 | | Claude Haiku 4.5 | 42 | | Gemini 2.5 Flash Lite 2509 | 42 | | aquif-3.5-Plus (Non-Reasoning) | 42 | | DeepSeek V3 0324 | 41 | | Qwen3 VL 32B Instruct | 41 | | Qwen3 Coder 480B A35B | 42 | | Model | AAII Score | |-------|-----------| | GLM-4.6 | 56 | | Gemini 2.5 Flash 2509 | 54 | | Claude Haiku 4.5 | 55 | | aquif-3.5-Plus (Reasoning) | 55 | | Qwen3 Next 80B A3B | 54 | | Model | AAII Score | |-------|-----------| | Gemini 2.5 Pro | 60 | | Grok 4 Fast | 60 | | aquif-3.5-Max | 60 | | MiniMax-M2 | 61 | | gpt-oss-120B high | 61 | | GPT-5 mini | 61 | | DeepSeek-V3.1-Terminus | 58 | | Claude Opus 4.1 | 59 | Massive Context Windows: Both models support up to 1M tokens, enabling analysis of entire codebases, research papers, and extensive conversation histories without truncation. Efficient Architecture: Despite offering frontier-level performance, both models maintain exceptional efficiency through optimized mixture-of-experts design and active parameter count of just 3.3B. Flexible Reasoning (Plus Only): aquif-3.5-Plus provides interchangeable thinking modes—enable reasoning for complex problems, disable for faster inference on straightforward tasks. Multilingual Support: Native support across English, German, Italian, Portuguese, French, Hindi, Spanish, Thai, Chinese, and Japanese. aquif-3.5-Plus: - Complex reasoning requiring flexibility between speed and depth - Scientific analysis and mathematical problem-solving with thinking enabled - Rapid-response applications with thinking disabled - Code generation and review - Multilingual applications up to 1M token contexts aquif-3.5-Max: - Frontier-level problem-solving without compromise - Advanced research and scientific computing - Competition mathematics and algorithmic challenges - Comprehensive code analysis and generation - Complex multilingual tasks requiring maximum reasoning capability Toggle between thinking and non-thinking modes by modifying the chat template: Simply set the variable in your chat template before inference to switch modes. No model reloading required. Both models support: - BF16 and FP16 precision - Mixture of Experts architecture optimizations - Efficient attention mechanisms with optimized KV caching - Up to 1M token context window - Multi-head attention with sparse routing aquif-3.5-Plus achieves 82.3% average benchmark performance in thinking mode, surpassing models with 2-4x more total parameters. Non-thinking mode maintains competitive 66.9% performance for latency-sensitive applications. aquif-3.5-Max reaches 86.2% average performance, matching or exceeding frontier models while maintaining 42.4B total parameters—an extraordinary efficiency breakthrough. - Qwen Team: Base architecture contributions - Meta Llama Team: Core model foundations - Hugging Face: Model hosting and training infrastructure This project is released under the Apache 2.0 License. See LICENSE file for details. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Olmo-3-1125-32B-GGUF
Qwen3.5-27B-GGUF
gpt-oss-safeguard-20b-GGUF
pokee_research_7b-GGUF
Qwen2.5-VL-7B-Instruct-GGUF
Qwen2.5-VL-3B-Instruct-GGUF
SWE-agent-LM-32B-GGUF
This model was generated using llama.cpp at commit `064cc596`. SWE-agent-LM-32B is a Language Model for Software Engineering trained using the SWE-smith toolkit. We introduce this model as part of our work: SWE-smith: Scaling Data for Software Engineering Agents. SWE-agent-LM-32B is 100% open source. Training this model was simple - we fine-tuned Qwen 2.5 Coder Instruct on 5k trajectories generated by SWE-agent + Claude 3.7 Sonnet. The dataset can be found here. SWE-agent-LM-32B is compatible with SWE-agent. Running this model locally only takes a few steps! Check [here]() for more instructions on how to do so. If you found this work exciting and want to push SWE-agents further, please feel free to connect with us (the SWE-bench team) more!
Fathom-Search-4B-GGUF
Orchestrator-8B-GGUF
Nanonets-OCR2-3B-GGUF
This model was generated using llama.cpp at commit `3cfa9c3f1`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging Nanonets-OCR2 by Nanonets is a family of powerful, state-of-the-art image-to-markdown OCR models that go far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging, making it ideal for downstream processing by Large Language Models (LLMs). Nanonets-OCR2 is packed with features designed to handle complex documents with ease: LaTeX Equation Recognition: Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. It distinguishes between inline (`$...$`) and display (`$$...$$`) equations. Intelligent Image Description: Describes images within documents using structured ` ` tags, making them digestible for LLM processing. It can describe various image types, including logos, charts, graphs and so on, detailing their content, style, and context. Signature Detection & Isolation: Identifies and isolates signatures from other text, outputting them within a ` ` tag. This is crucial for processing legal and business documents. Watermark Extraction: Detects and extracts watermark text from documents, placing it within a ` ` tag. Smart Checkbox Handling: Converts form checkboxes and radio buttons into standardized Unicode symbols (`☐`, `☑`, `☒`) for consistent and reliable processing. Complex Table Extraction: Accurately extracts complex tables from documents and converts them into both markdown and HTML table formats. Flow charts & Organisational charts: Extracts flow charts and organisational as mermaid code. Handwritten Documents: The model is trained on handwritten documents across multiple languages. Multilingual: Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more. Visual Question Answering (VQA): The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned." Nanonets-OCR2 Family | Model | Access Link | | -----|-----| | Nanonets-OCR2-Plus | Docstrange link | | Nanonets-OCR2-3B | 🤗 link | | Nanonets-OCR2-1.5B-exp | 🤗 link | Model Win Rate vs Nanonets OCR2 Plus (%) Lose Rate vs Nanonets OCR2 Plus (%) Both Correct (%) Model Win Rate vs Nanonets OCR2 3B (%) Lose Rate vs Nanonets OCR2 3B (%) Both Correct (%) Dataset Nanonets OCR2 Plus Nanonets OCR2 3B Qwen2.5-VL-72B-Instruct Gemini 2.5 Flash Tips to improve accuracy 1. Increasing the image resolution will improve model's performance. 2. For complex tables (eg. Financial documents) using `repetitionpenalty=1` gives better results. You can try this prompt also, which generally works better for finantial documents. 3. This is already implemented in Docstrange, please use the `Markdown (Financial Docs)` option for processing table heavy financial documents. 4. Model might work best on certain resolution for specific document types. Please check the cookbooks for details. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
olmOCR-2-7B-1025-GGUF
Dolphin-Mistral-24B-Venice-Edition-GGUF
This model was generated using llama.cpp at commit `221c0e0c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format [](https://discord.gg/h3K4XGj2RH) Discord: https://discord.gg/h3K4XGj2RH Website: https://dphn.ai Twitter: https://x.com/dphnAI Dolphin Mistral 24B Venice Edition is a collaborative project we undertook with Venice.ai with the goal of creating the most uncensored version of Mistral 24B for use within the Venice ecosystem. Dolphin Mistral 24B Venice Edition is now live on https://venice.ai/ as “Venice Uncensored,” the new default model for all Venice users. Dolphin aims to be a general purpose model, similar to the models behind ChatGPT, Claude, Gemini. But these models present problems for businesses seeking to include AI in their products. 1) They maintain control of the system prompt, deprecating and changing things as they wish, often causing software to break. 2) They maintain control of the model versions, sometimes changing things silently, or deprecating older models that your business relies on. 3) They maintain control of the alignment, and in particular the alignment is one-size-fits all, not tailored to the application. 4) They can see all your queries and they can potentially use that data in ways you wouldn't want. Dolphin, in contrast, is steerable and gives control to the system owner. You set the system prompt. You decide the alignment. You have control of your data. Dolphin does not impose its ethics or guidelines on you. You are the one who decides the guidelines. Dolphin belongs to YOU, it is your tool, an extension of your will. Just as you are personally responsible for what you do with a knife, gun, fire, car, or the internet, you are the creator and originator of any content you generate with Dolphin. We maintained Mistral's default chat template for this model. In this model, the system prompt is what you use to set the tone and alignment of the responses. You can set a character, a mood, rules for its behavior, and it will try its best to follow them. Make sure to set the system prompt in order to set the tone and guidelines for the responses - Otherwise, it will act in a default way that might not be what you want. Example use of system prompt we used to get the model as uncensored as possible: Note: We recommond using a relatively low temperature, such as `temperature=0.15`. There are many ways to use a huggingface model including: - ollama - LM Studio - Huggingface Transformers library - vllm - sglang - tgi The model can be used with the following frameworks; - `vllm`: See here - `transformers`: See here We recommend using this model with the vLLM library to implement production-ready inference pipelines. Also make sure you have `mistralcommon >= 1.5.2` installed: You can also make use of a ready-to-go docker image or on the docker hub. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-VL-8B-Instruct-GGUF
Foundation-Sec-8B-Instruct-GGUF
Hunyuan-MT-7B-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 🤗 Hugging Face | 🕹️ Demo | 🤖 ModelScope 🖥️ Official Website | GitHub | Technical Report The Hunyuan Translation Model comprises a translation model, Hunyuan-MT-7B, and an ensemble model, Hunyuan-MT-Chimera. The translation model is used to translate source text into the target language, while the ensemble model integrates multiple translation outputs to produce a higher-quality result. It primarily supports mutual translation among 33 languages, including five ethnic minority languages in China. - In the WMT25 competition, the model achieved first place in 30 out of the 31 language categories it participated in. - Hunyuan-MT-7B achieves industry-leading performance among models of comparable scale - Hunyuan-MT-Chimera-7B is the industry’s first open-source translation ensemble model, elevating translation quality to a new level - A comprehensive training framework for translation models has been proposed, spanning from pretrain → cross-lingual pretraining (CPT) → supervised fine-tuning (SFT) → translation enhancement → ensemble refinement, achieving state-of-the-art (SOTA) results for models of similar size Related News 2025.9.1 We have open-sourced Hunyuan-MT-7B , Hunyuan-MT-Chimera-7B on Hugging Face. 模型链接 | Model Name | Description | Download | | ----------- | ----------- |----------- | Hunyuan-MT-7B | Hunyuan 7B translation model |🤗 Model| | Hunyuan-MT-7B-fp8 | Hunyuan 7B translation model,fp8 quant | 🤗 Model| | Hunyuan-MT-Chimera | Hunyuan 7B translation ensemble model | 🤗 Model| | Hunyuan-MT-Chimera-fp8 | Hunyuan 7B translation ensemble model,fp8 quant | 🤗 Model| Prompt Template for XX XX Translation, excluding ZH XX. Use with transformers First, please install transformers, recommends v4.56.0 The following code snippet shows how to use the transformers library to load and apply the model. !!! If you want to load fp8 model with transformers, you need to change the name"ignoredlayers" in config.json to "ignore" and upgrade the compressed-tensors to compressed-tensors-0.11.0. We recommend using the following set of parameters for inference. Note that our model does not have the default systemprompt. Supported languages: | Languages | Abbr. | Chinese Names | |-------------------|---------|-----------------| | Chinese | zh | 中文 | | English | en | 英语 | | French | fr | 法语 | | Portuguese | pt | 葡萄牙语 | | Spanish | es | 西班牙语 | | Japanese | ja | 日语 | | Turkish | tr | 土耳其语 | | Russian | ru | 俄语 | | Arabic | ar | 阿拉伯语 | | Korean | ko | 韩语 | | Thai | th | 泰语 | | Italian | it | 意大利语 | | German | de | 德语 | | Vietnamese | vi | 越南语 | | Malay | ms | 马来语 | | Indonesian | id | 印尼语 | | Filipino | tl | 菲律宾语 | | Hindi | hi | 印地语 | | Traditional Chinese | zh-Hant| 繁体中文 | | Polish | pl | 波兰语 | | Czech | cs | 捷克语 | | Dutch | nl | 荷兰语 | | Khmer | km | 高棉语 | | Burmese | my | 缅甸语 | | Persian | fa | 波斯语 | | Gujarati | gu | 古吉拉特语 | | Urdu | ur | 乌尔都语 | | Telugu | te | 泰卢固语 | | Marathi | mr | 马拉地语 | | Hebrew | he | 希伯来语 | | Bengali | bn | 孟加拉语 | | Tamil | ta | 泰米尔语 | | Ukrainian | uk | 乌克兰语 | | Tibetan | bo | 藏语 | | Kazakh | kk | 哈萨克语 | | Mongolian | mn | 蒙古语 | | Uyghur | ug | 维吾尔语 | | Cantonese | yue | 粤语 | Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
GELab-Zero-4B-preview-GGUF
gemma-3-1b-it-gguf
granite-4.0-micro-base-GGUF
This model was generated using llama.cpp at commit `03792ad93`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary: Granite-4.0-Micro-Base is a decoder-only, long-context language model designed for a wide range of text-to-text generation tasks. It also supports Fill-in-the-Middle (FIM) code completion through the use of specialized prefix and suffix tokens. The model is trained from scratch on approximately 15 trillion tokens following a four-stage training strategy: 10 trillion tokens in the first stage, 2 trillion in the second, another 2 trillion in the third, and 0.5 trillion in the final stage. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended Use: Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, code-completion (including FIM), and long-context generation tasks. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios. Generation: This is a simple example of how to use Granite-4.0-Micro-Base model. Then, copy the code snippet below to run the example. Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE HumanEval pass@1 [StarCoder Prompt] 76.19 73.72 77.59 83.66 Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: This model is trained on a mix of open source and proprietary data following a four-stage training strategy. Stage Characteristics Micro Dense H Micro Dense H Tiny MoE H Small MoE I General mixture of training data, warmup, and power scheduler for learning rate. 10 10 15 15 II General mixture of training data with higher percentages of code and math with power scheduler for learning rate. 2 5 5 5 III High quality training data, exponential decay of learning rate. 2 2 2 2 IV High quality training data, linear decay to zero for learning rate. 0.5 0.5 0.5 0.5 Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-4.0-Micro-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-4.0-Micro-Base model with ethical intentions and in a responsible way. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://github.com/ibm-granite-community/ Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen2.5-Omni-7B-GGUF
Dolphin3.0-Llama3.2-3B-GGUF
This model was generated using llama.cpp at commit `3f4fc97f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Dolphin 3.0 Llama 3.2 3B 🐬 Part of the Dolphin 3.0 Collection Curated and trained by Eric Hartford, Ben Gitter, BlouseJury and Cognitive Computations [](https://discord.gg/cognitivecomputations) Discord: https://discord.gg/cognitivecomputations Sponsors Our appreciation for the generous sponsors of Dolphin 3.0: - Crusoe Cloud - provided 16x L40s for training and evals - Akash - provided on-demand 8x H100 for training - Lazarus - provided 16x H100 for training - Cerebras - provided excellent and fast inference services for data labeling - Andreessen Horowitz - provided a grant that make Dolphin 1.0 possible and enabled me to bootstrap my homelab Dolphin 3.0 is the next generation of the Dolphin series of instruct-tuned models. Designed to be the ultimate general purpose local model, enabling coding, math, agentic, function calling, and general use cases. Dolphin aims to be a general purpose model, similar to the models behind ChatGPT, Claude, Gemini. But these models present problems for businesses seeking to include AI in their products. 1) They maintain control of the system prompt, deprecating and changing things as they wish, often causing software to break. 2) They maintain control of the model versions, sometimes changing things silently, or deprecating older models that your business relies on. 3) They maintain control of the alignment, and in particular the alignment is one-size-fits all, not tailored to the application. 4) They can see all your queries and they can potentially use that data in ways you wouldn't want. Dolphin, in contrast, is steerable and gives control to the system owner. You set the system prompt. You decide the alignment. You have control of your data. Dolphin does not impose its ethics or guidelines on you. You are the one who decides the guidelines. Dolphin belongs to YOU, it is your tool, an extension of your will. Just as you are personally responsible for what you do with a knife, gun, fire, car, or the internet, you are the creator and originator of any content you generate with Dolphin. In Dolphin, the system prompt is what you use to set the tone and alignment of the responses. You can set a character, a mood, rules for its behavior, and it will try its best to follow them. Make sure to set the system prompt in order to set the tone and guidelines for the responses - Otherwise, it will act in a default way that might not be what you want. There are many ways to use a huggingface model including: - ollama - LM Studio - Huggingface Transformers library - vllm - sglang - tgi Respect and thanks to the creators of the open source datasets that were used: - OpenCoder-LLM (opc-sft-stage1, opc-sft-stage2) - microsoft (orca-agentinstruct-1M-v1, orca-math-word-problems-200k) - NousResearch (hermes-function-calling-v1) - AI-MO (NuminaMath-CoT, NuminaMath-TIR) - allenai (tulu-3-sft-mixture) - HuggingFaceTB (smoltalk) - m-a-p (CodeFeedback-Filtered-Instruction, Code-Feedback) Special thanks to - Meta, Qwen, and OpenCoder, who wrote papers and published models that were instrumental in creating Dolphin 3.0. - RLHFlow for the excellent reward model used to filter the datasets - Deepseek, for the ridiculously fast Deepseek-V3 that we used to augment the data. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
AI21-Jamba-Reasoning-3B-GGUF
UI-TARS-1.5-7B-GGUF
LightOnOCR-1B-1025-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Full BF16 version of the model. We recommend this variant for inference and further fine-tuning. LightOnOCR-1B is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs. [](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb) ⚡ Speed: 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR 💸 Efficiency: Processes 5.71 pages/s on a single H100 (~493k pages/day) for Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Dans-PersonalityEngine-V1.3.0-24b-GGUF
GLM-4.7-GGUF
bu-30b-a3b-preview-GGUF
Llama Joycaption Beta One Hf Llava GGUF
Nanonets-OCR2-1.5B-exp-GGUF
This model was generated using llama.cpp at commit `03792ad93`. Click here to get info on choosing the right GGUF model format Nanonets-OCR2: A model for transforming documents into structured markdown with intelligent content recognition and semantic tagging Nanonets-OCR2 by Nanonets is a family of powerful, state-of-the-art image-to-markdown OCR models that go far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging, making it ideal for downstream processing by Large Language Models (LLMs). Nanonets-OCR2 is packed with features designed to handle complex documents with ease: LaTeX Equation Recognition: Automatically converts mathematical equations and formulas into properly formatted LaTeX syntax. It distinguishes between inline (`$...$`) and display (`$$...$$`) equations. Intelligent Image Description: Describes images within documents using structured ` ` tags, making them digestible for LLM processing. It can describe various image types, including logos, charts, graphs and so on, detailing their content, style, and context. Signature Detection & Isolation: Identifies and isolates signatures from other text, outputting them within a ` ` tag. This is crucial for processing legal and business documents. Watermark Extraction: Detects and extracts watermark text from documents, placing it within a ` ` tag. Smart Checkbox Handling: Converts form checkboxes and radio buttons into standardized Unicode symbols (`☐`, `☑`, `☒`) for consistent and reliable processing. Complex Table Extraction: Accurately extracts complex tables from documents and converts them into both markdown and HTML table formats. Flow charts & Organisational charts: Extracts flow charts and organisational as mermaid code. Handwritten Documents: The model is trained on handwritten documents across multiple languages. Multilingual: Model is trained on documents of multiple languages, including English, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and many more. Visual Question Answering (VQA): The model is designed to provide the answer directly if it is present in the document; otherwise, it responds with "Not mentioned." Nanonets-OCR2 Family | Model | Access Link | | -----|-----| | Nanonets-OCR2-Plus | Docstrange link | | Nanonets-OCR2-3B | 🤗 link | | Nanonets-OCR2-1.5B-exp | 🤗 link | Model Win Rate vs Nanonets OCR2 Plus (%) Lose Rate vs Nanonets OCR2 Plus (%) Both Correct (%) Model Win Rate vs Nanonets OCR2 3B (%) Lose Rate vs Nanonets OCR2 3B (%) Both Correct (%) Dataset Nanonets OCR2 Plus Nanonets OCR2 3B Qwen2.5-VL-72B-Instruct Gemini 2.5 Flash Tips to improve accuracy 1. Increasing the image resolution will improve model's performance. 2. For complex tables (eg. Financial documents) using `repetitionpenalty=1` gives better results. You can try this prompt also, which generally works better for finantial documents. 3. This is already implemented in Docstrange, please use the `Markdown (Financial Docs)` option for processing table heavy financial documents. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
KAT-Dev-72B-Exp-GGUF
Fin-R1-GGUF
SoulX-Podcast-1.7B-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Official inference code for SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity Overview SoulX-Podcast is designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving superior performance in the conventional monologue TTS task. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. - Long-form, multi-turn, multi-speaker dialogic speech generation: SoulX-Podcast excels in generating high-quality, natural-sounding dialogic speech for multi-turn, multi-speaker scenarios. - Cross-dialectal, zero-shot voice cloning: SoulX-Podcast supports zero-shot voice cloning across different Chinese dialects, enabling the generation of high-quality, personalized speech in any of the supported dialects. - Paralinguistic controls: SoulX-Podcast supports a variety of paralinguistic events, as as laugher and sighs to enhance the realism of synthesized results. Clone and Install Here are instructions for installing on Linux. - Clone the repo - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html - Create Conda env: You can simply run the demo with the following commands: TODOs - [ ] Add example scripts for monologue TTS. - [x] Publish the technical report. - [ ] Develop a WebUI for easy inference. - [ ] Deploy an online demo on Hugging Face Spaces. - [ ] Dockerize the project with vLLM support. - [ ] Add support for streaming inference. We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our SoulX-Podcast. Check the license at LICENSE for more details. Acknowledge - This repo benefits from FlashCosyVoice Usage Disclaimer This project provides a speech synthesis model for podcast generation capable of zero-shot voice cloning, intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research. Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities. Ensure compliance with local laws and regulations when using this model and uphold ethical standards. The developers assume no liability for any misuse of this model. We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications. If you have any concerns regarding ethics or misuse, please contact us. Contact Us If you are interested in leaving a message to our work, feel free to email [email protected] or [email protected] or [email protected] or [email protected] You’re welcome to join our WeChat group for technical discussions, updates. Due to group limits, if you can't scan the QR code, please add my WeChat for group access Tiamo James --> Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
UIGEN-X-4B-0729-GGUF
This model was generated using llama.cpp at commit `1e15bfd4`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format > Tesslate's Reasoning Only UI generation model built on Qwen3-4B architecture. Trained to systematically plan, architect, and implement complete user interfaces across modern development stacks. Live Examples: https://uigenoutput.tesslate.com Discord Community: https://discord.gg/EcCpcTv93U Website: https://tesslate.com UIGEN-X-4B-0729 implements Reasoning Only from the Qwen3 family - combining systematic planning with direct implementation. The model follows a structured thinking process: 1. Problem Analysis — Understanding requirements and constraints 2. Architecture Planning — Component structure and technology decisions 3. Design System Definition — Color schemes, typography, and styling approach 4. Implementation Strategy — Step-by-step code generation with reasoning This hybrid approach enables both thoughtful planning and efficient code generation, making it suitable for complex UI development tasks. UIGEN-X-4B-0729 supports 26 major categories spanning frameworks and libraries across 7 platforms: Web Frameworks - React: Next.js, Remix, Gatsby, Create React App, Vite - Vue: Nuxt.js, Quasar, Gridsome - Angular: Angular CLI, Ionic Angular - Svelte: SvelteKit, Astro - Modern: Solid.js, Qwik, Alpine.js - Static: Astro, 11ty, Jekyll, Hugo Styling Systems - Utility-First: Tailwind CSS, UnoCSS, Windi CSS - CSS-in-JS: Styled Components, Emotion, Stitches - Component Systems: Material-UI, Chakra UI, Mantine - Traditional: Bootstrap, Bulma, Foundation - Design Systems: Carbon Design, IBM Design Language - Framework-Specific: Angular Material, Vuetify, Quasar UI Component Libraries - React: shadcn/ui, Material-UI, Ant Design, Chakra UI, Mantine, PrimeReact, Headless UI, NextUI, DaisyUI - Vue: Vuetify, PrimeVue, Quasar, Element Plus, Naive UI - Angular: Angular Material, PrimeNG, ng-bootstrap, Clarity Design - Svelte: Svelte Material UI, Carbon Components Svelte - Headless: Radix UI, Reach UI, Ariakit, React Aria State Management - React: Redux Toolkit, Zustand, Jotai, Valtio, Context API - Vue: Pinia, Vuex, Composables - Angular: NgRx, Akita, Services - Universal: MobX, XState, Recoil Animation Libraries - React: Framer Motion, React Spring, React Transition Group - Vue: Vue Transition, Vueuse Motion - Universal: GSAP, Lottie, CSS Animations, Web Animations API - Mobile: React Native Reanimated, Expo Animations Icon Systems Lucide, Heroicons, Material Icons, Font Awesome, Ant Design Icons, Bootstrap Icons, Ionicons, Tabler Icons, Feather, Phosphor, React Icons, Vue Icons Web Development Complete coverage of modern web development from simple HTML/CSS to complex enterprise applications. Mobile Development - React Native: Expo, CLI, with navigation and state management - Flutter: Cross-platform mobile with Material and Cupertino designs - Ionic: Angular, React, and Vue-based hybrid applications Desktop Applications - Electron: Cross-platform desktop apps (Slack, VSCode-style) - Tauri: Rust-based lightweight desktop applications - Flutter Desktop: Native desktop performance Python Applications - Web UI: Streamlit, Gradio, Flask, FastAPI - Desktop GUI: Tkinter, PyQt5/6, Kivy, wxPython, Dear PyGui Development Tools Build tools, bundlers, testing frameworks, and development environments. 26 Languages and Approaches: JavaScript, TypeScript, Python, Dart, HTML5, CSS3, SCSS, SASS, Less, PostCSS, CSS Modules, Styled Components, JSX, TSX, Vue SFC, Svelte Components, Angular Templates, Tailwind, PHP UIGEN-X-4B-0729 includes 21 distinct visual style categories that can be applied to any framework: Modern Design Styles - Glassmorphism: Frosted glass effects with blur and transparency - Neumorphism: Soft, extruded design elements - Material Design: Google's design system principles - Fluent Design: Microsoft's design language Traditional & Classic - Skeuomorphism: Real-world object representations - Swiss Design: Clean typography and grid systems - Bauhaus: Functional, geometric design principles Contemporary Trends - Brutalism: Bold, raw, unconventional layouts - Anti-Design: Intentionally imperfect, organic aesthetics - Minimalism: Essential elements only, generous whitespace Thematic Styles - Cyberpunk: Neon colors, glitch effects, futuristic elements - Dark Mode: High contrast, reduced eye strain - Retro-Futurism: 80s/90s inspired futuristic design - Geocities/90s Web: Nostalgic early web aesthetics Experimental - Maximalism: Rich, layered, abundant visual elements - Madness/Experimental: Unconventional, boundary-pushing designs - Abstract Shapes: Geometric, non-representational elements Basic Structure To achieve the best results, use this prompting structure below: UIGEN-X-4B-0729 supports function calling for dynamic asset integration and enhanced development workflows. Dynamic Asset Loading: - Fetch relevant images during UI generation - Generate realistic content for components - Create cohesive color palettes from images - Optimize assets for web performance Multi-Step Development: - Plan application architecture - Generate individual components - Integrate components into pages - Apply consistent styling and theming - Test responsive behavior Content-Aware Design: - Adapt layouts based on content types - Optimize typography for readability - Create responsive image galleries - Generate accessible alt text Rapid Prototyping - Quick mockups for client presentations - A/B testing different design approaches - Concept validation with interactive prototypes Production Development - Component library creation - Design system implementation - Template and boilerplate generation Educational & Learning - Teaching modern web development - Framework comparison and evaluation - Best practices demonstration Enterprise Solutions - Dashboard and admin panel generation - Internal tool development - Legacy system modernization Hardware - GPU: 8GB+ VRAM recommended (RTX 3080/4070 or equivalent) - RAM: 16GB system memory minimum - Storage: 20GB for model weights and cache Software - Python: 3.8+ with transformers, torch, unsloth - Node.js: For running generated JavaScript/TypeScript code - Browser: Modern browser for testing generated UIs Integration - Compatible with HuggingFace transformers - Supports GGML/GGUF quantization - Works with text-generation-webui - API-ready for production deployment - Token Usage: Reasoning process increases token consumption - Complex Logic: Focuses on UI structure rather than business logic - Real-time Features: Generated code requires backend integration - Testing: Output may need manual testing and refinement - Accessibility: While ARIA-aware, manual a11y testing recommended Discord: https://discord.gg/EcCpcTv93U Website: https://tesslate.com Examples: https://uigenoutput.tesslate.com Join our community to share creations, get help, and contribute to the ecosystem. Built with Reasoning Only capabilities from Qwen3, UIGEN-X-4B-0729 represents a comprehensive approach to AI-driven UI development across the entire modern web development ecosystem. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-VL-2B-Instruct-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-2B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
QwQ-32B-GGUF
QwenLong-L1.5-30B-A3B-GGUF
NVIDIA-Nemotron-Nano-12B-v2-GGUF
This model was generated using llama.cpp at commit `4fd1242b`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The pretraining data has a cutoff date of September 2024. NVIDIA-Nemotron-Nano-12B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks. The model was fine-tuned from NVIDIA-Nemotron-Nano-12B-v2-Base was further compressed into NVIDIA-Nemotron-Nano-9B-v2. The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers. For the architecture, please refer to the Nemotron-H tech report. The model was trained using Megatron-LM and NeMo-RL. The supported languages include: English, German, Spanish, French, Italian, and Japanese. Improved using Qwen. GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License Agreement. We evaluated our model in Reasoning-On mode across all benchmarks, except RULER, which is evaluated in Reasoning-Off mode. | Benchmark | NVIDIA-Nemotron-Nano-12B-v2 | | :---- | ----- | | AIME25 | 76.25% | | MATH500 | 97.75% | | GPQA | 64.48% | | LCB | 70.79% | | BFCL v3 | 66.98% | | IFEVAL-Prompt | 84.70% | | IFEVAL-Instruction | 89.81% | All evaluations were done using NeMo-Skills. We published a tutorial with all details necessary to reproduce our evaluation results. This model supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think". - Architecture Type: Mamba2-Transformer Hybrid - Network Architecture: Nemotron-Hybrid NVIDIA-Nemotron-Nano-12B-v2 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Spanish and Japanese) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. - Huggingface 08/29/2025 via https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2 - NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model - Input Type(s): Text - Input Format(s): String - Input Parameters: One-Dimensional (1D): Sequences - Other Properties Related to Input: Context length up to 128K. Supported languages include German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English. - Output Type(s): Text - Output Format: String - Output Parameters: One-Dimensional (1D): Sequences up to 128K Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. - Runtime Engine(s): NeMo 25.07.nemotron-nano-v2 - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100 - Operating System(s): Linux The snippet below shows how to use this model with Huggingface Transformers (tested on version 4.48.3). Case 1: `/think` or no reasoning signal is provided in the system prompt, reasoning will be set to `True` Case 2: `/nothink` is provided, reasoning will be set to `False` Note: `/think` or `/nothink` keywords can also be provided in “user” messages for turn-level reasoning control. We recommend setting `temperature` to `0.6`, `topp` to `0.95` for reasoning True and greedy search for reasoning False, and increase `maxnewtokens` to `1024` or higher for reasoning True. The snippet below shows how to use this model with TRT-LLM. We tested this on the following commit and followed these instructions to build and install TRT-LLM in a docker container. The snippet below shows how to use this model with vLLM. Use the latest version of vLLM and follow these instructions to build and install vLLM. Note: - Remember to add \`--mamba\ssm\cache\dtype float32\` for accurate quality. Without this option, the model’s accuracy may degrade. - If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lower the value further if the error persists. Alternativly, you can use Docker to launch a vLLM server. The thinking budget allows developers to keep accuracy high and meet response‑time targets \- which is especially crucial for customer support, autonomous agent steps, and edge devices where every millisecond counts. With budget control, you can set a limit for internal reasoning: `maxthinkingtokens`: This is a threshold that will attempt to end the reasoning trace at the next newline encountered in the reasoning trace. If no newline is encountered within 500 tokens, it will abruptly end the reasoning trace at \`max\thinking\tokens \+ 500\`. Calling the server with a budget (Restricted to 32 tokens here as an example) After launching a vLLM server, you can call the server with tool-call support using a Python script like below: We follow the jinja chat template provided below. This template conditionally adds ` \n` to the start of the Assistant response if `/think` is found in either the system prompt or any user message. If no reasoning signal is added, the model defaults to reasoning "on" mode. The chat template adds ` ` to the start of the Assistant response if `/nothink` is found in the system prompt. Thus enforcing reasoning on/off behavior. Data Modality: Text Text Training Data Size: More than 10 Trillion Tokens Train/Test/Valid Split: We used 100% of the corpus for pre-training and relied on external benchmarks for testing. Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic Properties: The post-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B. The pre-training corpus for NVIDIA-Nemotron-Nano-12B-v2 consists of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was pre-trained for approximately twenty trillion tokens. Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes. More details on the datasets and synthetic data generation methods can be found in the technical report NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model . | Dataset | Collection Period | | :---- | :---- | | Problems in Elementary Mathematics for Home Study | 4/23/2025 | | GSM8K | 4/23/2025 | | PRM800K | 4/23/2025 | | CC-NEWS | 4/23/2025 | | Common Crawl | 4/23/2025 | | Wikimedia | 4/23/2025 | | Bespoke-Stratos-17k | 4/23/2025 | | tigerbot-kaggle-leetcodesolutions-en-2k | 4/23/2025 | | glaive-function-calling-v2 | 4/23/2025 | | APIGen Function-Calling | 4/23/2025 | | LMSYS-Chat-1M | 4/23/2025 | | Open Textbook Library \- CC BY-SA & GNU subset and OpenStax \- CC BY-SA subset | 4/23/2025 | | Advanced Reasoning Benchmark, tigerbot-kaggle-leetcodesolutions-en-2k, PRM800K, and SciBench | 4/23/2025 | | FineWeb-2 | 4/23/2025 | | Court Listener | Legacy Download | | peS2o | Legacy Download | | OpenWebMath | Legacy Download | | BioRxiv | Legacy Download | | PMC Open Access Subset | Legacy Download | | OpenWebText2 | Legacy Download | | Stack Exchange Data Dump | Legacy Download | | PubMed Abstracts | Legacy Download | | NIH ExPorter | Legacy Download | | arXiv | Legacy Download | | BigScience Workshop Datasets | Legacy Download | | Reddit Dataset | Legacy Download | | SEC's Electronic Data Gathering, Analysis, and Retrieval (EDGAR) | Legacy Download | | Public Software Heritage S3 | Legacy Download | | The Stack | Legacy Download | | mC4 | Legacy Download | | Advanced Mathematical Problem Solving | Legacy Download | | MathPile | Legacy Download | | NuminaMath CoT | Legacy Download | | PMC Article | Legacy Download | | FLAN | Legacy Download | | Advanced Reasoning Benchmark | Legacy Download | | SciBench | Legacy Download | | WikiTableQuestions | Legacy Download | | FinQA | Legacy Download | | Riddles | Legacy Download | | Problems in Elementary Mathematics for Home Study | Legacy Download | | MedMCQA | Legacy Download | | Cosmos QA | Legacy Download | | MCTest | Legacy Download | | AI2's Reasoning Challenge | Legacy Download | | OpenBookQA | Legacy Download | | MMLU Auxiliary Train | Legacy Download | | social-chemestry-101 | Legacy Download | | Moral Stories | Legacy Download | | The Common Pile v0.1 | Legacy Download | | FineMath | Legacy Download | | MegaMath | Legacy Download | | FastChat | 6/30/2025 | Private Non-publicly Accessible Datasets of Third Parties | Dataset | | :---- | | Global Regulation | | Workbench | The English Common Crawl data was downloaded from the Common Crawl Foundation (see their FAQ for details on their crawling) and includes the snapshots CC-MAIN-2013-20 through CC-MAIN-2025-13. The data was subsequently deduplicated and filtered in various ways described in the Nemotron-CC paper. Additionally, we extracted data for fifteen languages from the following three Common Crawl snapshots: CC-MAIN-2024-51, CC-MAIN-2025-08, CC-MAIN-2025-18. The fifteen languages included were Arabic, Chinese, Danish, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swedish, and Thai. As we did not have reliable multilingual model-based quality classifiers available, we applied just heuristic filtering instead—similar to what we did for lower quality English data in the Nemotron-CC pipeline, but selectively removing some filters for some languages that did not work well. Deduplication was done in the same way as for Nemotron-CC. The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API. Each crawl was operated in accordance with the rate limits set by its respective source, either GitHub or S3. We collect raw source code and subsequently remove any having a license which does not exist in our permissive-license set (for additional details, refer to the technical report). | Dataset | Modality | Dataset Size (Tokens) | Collection Period | | :---- | :---- | :---- | :---- | | English Common Crawl | Text | 3.360T | 4/8/2025 | | Multilingual Common Crawl | Text | 812.7B | 5/1/2025 | | GitHub Crawl | Text | 747.4B | 4/29/2025 | | Dataset | Modality | Dataset Size (Tokens) | Seed Dataset | Model(s) used for generation | | :---- | :---- | :---- | :---- | :---- | | Synthetic Art of Problem Solving from DeepSeek-R1 | Text | 25.5B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; | DeepSeek-R1 | | Synthetic Moral Stories and Social Chemistry from Mixtral-8x22B-v0.1 | Text | 327M | social-chemestry-101; Moral Stories | Mixtral-8x22B-v0.1 | | Synthetic Social Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 83.6M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic Health Sciences seeded with OpenStax from DeepSeek-V3, Mixtral-8x22B-v0.1, and Qwen2.5-72B | Text | 9.7M | OpenStax \- CC BY-SA subset | DeepSeek-V3; Mixtral-8x22B-v0.1; Qwen2.5-72B | | Synthetic STEM seeded with OpenStax, Open Textbook Library, and GSM8K from DeepSeek-R1, DeepSeek-V3, DeepSeek-V3-0324, and Qwen2.5-72B | Text | 175M | OpenStax \- CC BY-SA subset; GSM8K; Open Textbook Library \- CC BY-SA & GNU subset | DeepSeek-R1, DeepSeek-V3; DeepSeek-V3-0324; Qwen2.5-72B | | Nemotron-PrismMath | Text | 4.6B | Big-Math-RL-Verified; OpenR1-Math-220k | Qwen2.5-0.5B-instruct, Qwen2.5-72B-Instruct; DeepSeek-R1-Distill-Qwen-32B | | Synthetic Question Answering Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 350M | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic FineMath-4+ Reprocessed from DeepSeek-V3 | Text | 9.2B | Common Crawl | DeepSeek-V3 | | Synthetic FineMath-3+ Reprocessed from phi-4 | Text | 27.6B | Common Crawl | phi-4 | | Synthetic Union-3+ Reprocessed from phi-4 | Text | 93.1B | Common Crawl | phi-4 | | Refreshed Nemotron-MIND from phi-4 | Text | 73B | Common Crawl | phi-4 | | Synthetic Union-4+ Reprocessed from phi-4 | Text | 14.12B | Common Crawl | phi-4 | | Synthetic Union-3+ minus 4+ Reprocessed from phi-4 | Text | 78.95B | Common Crawl | phi-4 | | Synthetic Union-3 Refreshed from phi-4 | Text | 80.94B | Common Crawl | phi-4 | | Synthetic Union-4+ Refreshed from phi-4 | Text | 52.32B | Common Crawl | phi-4 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from DeepSeek-V3 and DeepSeek-V3-0324 | Text | 4.0B | AQUA-RAT; LogiQA; AR-LSAT | DeepSeek-V3; DeepSeek-V3-0324 | | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | AQUA-RAT; LogiQA; AR-LSAT | Qwen3-30B-A3B | | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | Art of Problem Solving; American Mathematics Competitions 8; American Mathematics Competitions 10; GSM8K; PRM800K | Qwen2.5-32B-Instruct; Qwen2.5-Math-72B; Qwen2.5-Math-7B; Qwen2.5-72B-Instruct | | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | MMLU Auxiliary Train | DeepSeek-R1 | | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | arXiv; National Institutes of Health ExPorter; BioRxiv; PMC Article; USPTO Backgrounds; peS2o; Global Regulation; CORE; PG-19; DOAB CC BY & CC BY-SA subset; NDLTD | Qwen2.5-72B-Instruct | | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | Common Crawl | Qwen3-30B-A3B; Mistral-NeMo-12B-Instruct | | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | Common Crawl | Qwen3-30B-A3B | | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | Wikimedia | Qwen3-30B-A3B | | Synthetic OpenMathReasoning from DeepSeek-R1-0528 | Text | 1.5M | OpenMathReasoning | DeepSeek-R1-0528 | | Synthetic OpenCodeReasoning from DeepSeek-R1-0528 | Text | 1.1M | OpenCodeReasoning | DeepSeek-R1-0528 | | Synthetic Science Data from DeepSeek-R1-0528 | Text | 1.5M | \- | DeepSeek-R1-0528 | | Synthetic Humanity's Last Exam from DeepSeek-R1-0528 | Text | 460K | Humanity's Last Exam | DeepSeek-R1-0528 | | Synthetic ToolBench from Qwen3-235B-A22B | Text | 400K | ToolBench | Qwen3-235B-A22B | | Synthetic Nemotron Content Safety Dataset V2, eval-safety, Gretel Synthetic Safety Alignment, and RedTeam\2K from DeepSeek-R1-0528 | Text | 52K | Nemotron Content Safety Dataset V2; eval-safety; Gretel Synthetic Safety Alignment; RedTeam\2K | DeepSeek-R1-0528 | | Synthetic HelpSteer from Qwen3-235B-A22B | Text | 120K | HelpSteer3; HelpSteer2 | Qwen3-235B-A22B | | Synthetic Alignment data from Mixtral-8x22B-Instruct-v0.1, Mixtral-8x7B-Instruct-v0.1, and Nemotron-4 Family | Text | 400K | HelpSteer2; C4; LMSYS-Chat-1M; ShareGPT52K; tigerbot-kaggle-leetcodesolutions-en-2k; GSM8K; PRM800K; lm\identity (NVIDIA internal); FinQA; WikiTableQuestions; Riddles; ChatQA nvolve-multiturn (NVIDIA internal); glaive-function-calling-v2; SciBench; OpenBookQA; Advanced Reasoning Benchmark; Public Software Heritage S3; Khan Academy Math Keywords | Nemotron-4-15B-Base (NVIDIA internal); Nemotron-4-15B-Instruct (NVIDIA internal); Nemotron-4-340B-Base; Nemotron-4-340B-Instruct; Nemotron-4-340B-Reward; Mixtral-8x7B-Instruct-v0.1; Mixtral-8x22B-Instruct-v0.1 | | Synthetic LMSYS-Chat-1M from Qwen3-235B-A22B | Text | 1M | LMSYS-Chat-1M | Qwen3-235B-A22B | | Synthetic Multilingual Reasoning data from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, and Qwen2.5-14B-Instruct | Text | 25M | OpenMathReasoning; OpenCodeReasoning | DeepSeek-R1-0528; Qwen2.5-32B-Instruct-AWQ (translation); Qwen2.5-14B-Instruct (translation); | | Synthetic Multilingual Reasoning data from Qwen3-235B-A22B and Gemma 3 Post-Trained models | Text | 5M | WildChat | Qwen3-235B-A22B; Gemma 3 PT 12B; Gemma 3 PT 27B | Data Collection Method by dataset: Hybrid: Human, Synthetic Labeling Method by dataset: Hybrid: Automated, Human, Synthetic NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen2.5-7B-Instruct-1M-GGUF
Qwen2.5-VL-72B-Instruct-GGUF
LongWriter-Zero-32B-GGUF
granite-4.0-1b-GGUF
Qwen2.5-3B-Instruct-GGUF
nomic-embed-code-GGUF
MetaStone-L1-7B-GGUF
granite-4.0-350m-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-4.0-350M is a lightweight instruct model finetuned from Granite-4.0-350M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-350M model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-350M comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-350M model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-350M baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-4.0-h-350m-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-4.0-H-350M is a lightweight instruct model finetuned from Granite-4.0-H-350M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-350M model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-350M comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-350M model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-H-350M baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Hermes-4-14B-GGUF
This model was generated using llama.cpp at commit `4fd1242b`. Click here to get info on choosing the right GGUF model format Hermes 4 14B is a frontier, hybrid-mode reasoning model based on Qwen 3 14B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses ChatML format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the ChatML chat format for Hermes 4 14B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-14B-FP8 Hermes 4 is also available in larger sizes (e.g., 70B, 405B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Apriel-1.5-15b-Thinker-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Apriel-1.5-15b-Thinker - Mid training is all you need! 1. Summary 2. Evaluation 3. Training Details 4. How to Use 5. Intended Use 6. Limitations 7. Security and Responsible Use 8. Software 9. License 10. Acknowledgements 11. Citation Click here to skip to the technical report -> https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker/blob/main/Apriel-1.5-Thinker.pdf Apriel-1.5-15b-Thinker is a multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves competitive performance against models 10 times it's size. Apriel-1.5 is the second model in the reasoning series. It introduces enhanced textual reasoning capabilities and adds image reasoning support to the previous text model. It has undergone extensive continual pretraining across both text and image domains. In terms of post-training this model has undergone text-SFT only. Our research demonstrates that with a strong mid-training regimen, we are able to achive SOTA performance on text and image reasoning tasks without having any image SFT training or RL. Highlights - Achieves a score of 52 on the Artificial Analysis index and is competitive with Deepseek R1 0528, Gemini-Flash etc. - It is AT LEAST 1 / 10 the size of any other model that scores > 50 on the Artificial Analysis index. - Scores 68 on Tau2 Bench Telecom and 62 on IFBench, which are key benchmarks for the enterprise domain. - At 15B parameters, the model fits on a single GPU, making it highly memory-efficient. - For text benchmarks, we report evaluations perforomed by a third party - Artificial Analysis. - For image benchmarks, we report evaluations obtained by https://github.com/open-compass/VLMEvalKit We are a small lab with big goals. While we are not GPU poor, our lab, in comparison has a tiny fraction of the compute available to other Frontier labs. Our goal with this work is to show that a SOTA model can be built with limited resources if you have the right data, design and solid methodology. We set out to build a small but powerful model, aiming for capabilities on par with frontier models. Developing a 15B model with this level of performance requires tradeoffs, so we prioritized getting SOTA-level performance first. Mid-training consists only of CPT and SFT; no RL has been applied. This model performs extensive reasoning by default, allocating extra internal effort to improve robustness and accuracy even on simpler queries. You may notice slightly higher token usage and longer response times, but we are actively working to make it more efficient and concise in future releases. Mid training / Continual Pre‑training In this stage, the model is trained on billions of tokens of carefully curated textual samples drawn from mathematical reasoning, coding challenges, scientific discourse, logical puzzles, and diverse knowledge-rich texts along with multimodal samples covering image understanding and reasoning, captioning, and interleaved image-text data. The objective is to strengthen foundational reasoning capabilities of the model. This stage is critical for the model to function as a reasoner and provides significant lifts in reasoning benchmarks. Supervised Fine‑Tuning (SFT) The model is fine-tuned on over 2M high-quality text samples spanning mathematical and scientific problem-solving, coding tasks, instruction-following, API/function invocation, and conversational use cases. This results in superior text performance comparable to models such as Deepseek R1 0528 and Gemini-Flash. Although no image-specific fine-tuning is performed, the model’s inherent multimodal capabilities and cross-modal transfer of reasoning behavior from the text SFT yield competitive image performance relative to other leading open-source VL models. As the upstream PR is not yet merged, you can use this custom image as an alternate way to run the model with tool and reasoning parsers enabled. This will start the vLLM OpenAI-compatible API server serving the Apriel-1.5-15B-Thinker model with Apriel’s custom tool parser and reasoning parser. Here is a code snippet demonstrating the model's usage with the transformers library's generate function: The model will first generate its thinking process and then generate its final response between `[BEGIN FINAL RESPONSE]` and `[END FINAL RESPONSE]`. Here is a code snippet demonstrating the application of the chat template: Usage Guidelines 1. Use the model’s default chat template, which already includes a system prompt. 2. We recommend setting temperature to `0.6`. 3. We ensure the model starts with `Here are my reasoning steps:\n` during all our evaluations. This is implemented in the default chat template. 4. For multi-turn conversations, intermediate turns (historical model outputs) are expected to contain only the final response, without reasoning steps. The Apriel family of models are designed for a variety of general-purpose instruction tasks, including: - Code assistance and generation - Logical reasoning and multi-step tasks - Question answering and information retrieval - Function calling, complex instruction following and agent use cases They are not intended for use in safety-critical applications without human oversight or in scenarios requiring guaranteed factual accuracy. - Factual accuracy: May produce incorrect, misleading, or outdated content. Outputs should be verified before use in critical contexts. - Bias: May reflect societal, cultural, or systemic biases present in training data. - Ethics: Do not use the model to produce harmful, unlawful, or unethical content. - Language: Strongest performance is in English. Output quality may degrade in underrepresented languages. - Critical use: Not suitable for medical, legal, financial, or other high-risk applications without safeguards. Security Responsibilities: Deployers and users are strongly encouraged to align their security practices with established frameworks and regulatory guidelines such as the EU AI Act and the NIST AI Risk Management Framework (RMF). - Regularly conduct robustness assessments to identify and mitigate adversarial inputs. - Implement validation and filtering processes to prevent harmful or biased outputs. - Continuously perform data privacy checks to guard against unintended data leaks. - Document and communicate the model's limitations, intended usage, and known security risks to all end-users. - Schedule periodic security reviews and updates to address emerging threats and vulnerabilities. - Follow established security policies and usage guidelines provided by deployers. - Protect and manage sensitive information when interacting with the model. - Report anomalies, suspicious behavior, or unsafe outputs to deployers or developers. - Maintain human oversight and apply judgment to mitigate potential security or ethical risks during interactions. Disclaimer: Users accept responsibility for securely deploying, managing, and using this open-source LLM. The model is provided "as-is," without explicit or implied warranty regarding security or fitness for any specific application or environment. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-4.0-h-small-GGUF
This model was generated using llama.cpp at commit `56b479584`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-H-Small is a 32B parameter long-context instruct model finetuned from Granite-4.0-H-Small-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Small model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Small comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Small model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Small baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-4.0-h-1b-GGUF
This model was generated using llama.cpp at commit `16724b5b6`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-4.0-H-1B is a lightweight instruct model finetuned from Granite-4.0-H-1B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques including supervised finetuning, reinforcement learning, and model merging. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Nano Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-nano-language-models - Website: Granite Docs - Release Date: October 28, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list. Intended use: Granite 4.0 Nano instruct models feature strong instruction following capabilities bringing advanced AI capabilities within reach for on-device deployments and research use cases. Additionally, their compact size makes them well-suited for fine-tuning on specialized domains without requiring massive compute resources. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-1B model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-1B comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-1B model tool-calling ability: Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-H-1B baseline is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2 MLP / Shared expert hidden size 2048 2048 4096 4096 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Nano Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Nano Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-Embedding-8B-GGUF
Qwen3-Reranker-8B-GGUF
Qwen3-Embedding-4B-GGUF
Holo1-7B-GGUF
gemma-3-4b-it-gguf
SWE-Dev-7B-GGUF
II-Medical-8B-1706-GGUF
Phi-4-mini-instruct.gguf
Qwen2.5-VL-32B-Instruct-GGUF
gemma-3-12b-it-gguf
neutts-air-GGUF
xLAM-2-3b-fc-r-GGUF
Qwen2.5-1.5B-Instruct-GGUF
granite-4.0-h-tiny-GGUF
This model was generated using llama.cpp at commit `ee09828cb`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-H-Tiny is a 7B parameter long-context instruct model finetuned from Granite-4.0-H-Tiny-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Tiny model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Tiny comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Tiny model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Tiny baseline is built on a decoder-only MoE transformer architecture. Core components of this architecture are: GQA, Mamba2, MoEs with shared experts, SwiGLU activation, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
OpenThinker3-7B-GGUF
MediPhi-Instruct-GGUF
This model was generated using llama.cpp at commit `e2b7621e`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The MediPhi Model Collection comprises 7 small language models of 3.8B parameters from the base model `Phi-3.5-mini-instruct` specialized in the medical and clinical domains. The collection is designed in a modular fashion. Five MediPhi experts are fine-tuned on various medical corpora (i.e. PubMed commercial, Medical Wikipedia, Medical Guidelines, Medical Coding, and open-source clinical documents) and merged back with the SLERP method in their base model to conserve general abilities. One model combined all five experts into one general expert with the multi-model merging method BreadCrumbs. Finally, we clinically aligned this general expert using our large-scale MediFlow corpora (see dataset `microsoft/mediflow`) to obtain the final expert model `MediPhi-Instruct`. This model is the `MediPhi-Instruct` aligned to accomplish clinical NLP tasks. - Developed by: Microsoft Healthcare \& Life Sciences - Model type: Phi3 - Language(s) (NLP): English - License: MIT - Finetuned from model: `microsoft/MediPhi`, and originally from `microsoft/Phi-3.5-mini-instruct` - Repository: Current HF repo - Paper: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment Intended Uses Primary Use Cases The model is intended for research use in English, especially clinical natural language processing. The model provides uses for research which require: - Medically adapted language models - Memory/compute constrained environments - Latency bound scenarios Our model is designed to accelerate research on language models in medical and clinical scenarios. It should be used for research purposes, i.e., in benchmarking context or with expert user verification of the outputs. Use Case Considerations Our models are not specifically designed or evaluated for all downstream purposes. Researchers (or developers) should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Researchers (or developers) should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. Responsible AI Considerations Like other language models, the Phi family of models and the MediPhi collection can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include: - Quality of Service: The Phi and MediPhi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English. - Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 3 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards. - Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases. - Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case. Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated. - Limited Scope for Code: Majority of Phi-3 training data is based in Python and use common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, we strongly recommend users manually verify all API uses. - Long Conversation: Phi-3 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift Researchers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. They are encouraged to rigorously evaluate the model for their use case, fine-tune the models when possible and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include: - Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques. - High-Risk Scenarios: Researchers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context. - Misinformation: Models may produce inaccurate information. Researchers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG). - Generation of Harmful Content: Researchers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. - Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations. import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline modelname = "microsoft/MediPhi-Instruct" model = AutoModelForCausalLM.frompretrained( modelname, devicemap="cuda", torchdtype="auto", trustremotecode=True, ) tokenizer = AutoTokenizer.frompretrained(modelname) prompt = "Operative Report:\nPerformed: Cholecystectomy\nOperative Findings: The gallbladder contained multiple stones and had thickening of its wall. Mild peritoneal fluid was noted." messages = [ {"role": "system", "content": "Extract medical keywords from this operative notes focus on anatomical, pathological, or procedural vocabulary."}, {"role": "user", "content": prompt}, ] pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, ) generationargs = { "maxnewtokens": 500, "returnfulltext": False, "temperature": 0.0, "dosample": False, } output = pipe(messages, generationargs) print(output[0]['generatedtext']) # gallbladder stones, wall thickening, peritoneal fluid Notes: If you want to use flash attention, call `AutoModelForCausalLM.frompretrained()` with `attnimplementation="flashattention2"`. Check `microsoft/Phi-3.5-mini-instruct` for details about the tokenizer, requirements and basic capabilities. Continual Pre-training: - PubMed (commercial subset) and abstracts from `ncbi/pubmed`. - Medical Guideline `epfl-llm/guidelines`. - Medical Wikipedia `jpcorb20/medicalwikipedia`. - Medical Coding: ICD10CM, ICD10PROC, ICD9CM, ICD9PROC, and ATC. - Clinical documents: - `zhengyun21/PMC-Patients`, `akemiH/NoteChat`, and `starmpcc/Asclepius-Synthetic-Clinical-Notes` (only commercial-friendly licenses across all three datasets) - mtsamples Modular training making five experts from the base model with pre-instruction tuning, merging them into one model and finally clinically aligning it. See paper for details. | | Phi-3.5-mini-instruct | PubMed | Clinical | MedWiki | Guideline | MedCode | MediPhi | MediPhi-Instruct | |:-------------|----------------------:|-------:|---------:|--------:|----------:|--------:|--------:|-----------------:| | MedNLI | 66.6 | 68.3 | 69.2 | 72.8 | 70.3 | 68.5 | 66.9 | 71.0 | | PLS | 28.4 | 29.2 | 29.4 | 29.2 | 29.8 | 22.3 | 28.8 | 26.0 | | MeQSum | 36.7 | 37.6 | 38.1 | 37.6 | 37.6 | 33.5 | 37.9 | 42.8 | | LH | 45.9 | 45.7 | 43.5 | 43.6 | 41.1 | 45.7 | 45.7 | 45.0 | | MeDiSumQA | 25.9 | 26.3 | 26.7 | 25.1 | 25.1 | 23.6 | 26.1 | 29.1 | | MeDiSumCode | 41.1 | 41.0 | 40.5 | 41.7 | 41.9 | 39.0 | 41.7 | 37.2 | | RRS QA | 41.2 | 44.1 | 52.1 | 46.7 | 48.9 | 45.6 | 44.5 | 61.6 | | MedicationQA | 11.2 | 10.3 | 12.0 | 12.2 | 11.9 | 12.0 | 11.3 | 19.3 | | MEDEC | 14.8 | 22.2 | 34.5 | 28.8 | 28.3 | 18.1 | 29.1 | 34.4 | | ACI | 42.3 | 42.7 | 43.9 | 44.7 | 44.7 | 39.0 | 44.3 | 43.5 | | SDoH | 35.1 | 35.8 | 35.8 | 43.6 | 41.0 | 24.8 | 39.7 | 56.7 | | ICD10CM | 49.3 | 49.5 | 49.6 | 50.2 | 49.8 | 68.7 | 55.5 | 54.9 | | Average | 36.5 | 37.7 | 39.6 | 39.7 | 39.2 | 36.7 | 39.3 | 43.4 | New real-world benchmarking also demonstrated good performances on clinical information extraction task: 2507.05517. We carried out a Medical Red Teaming Protocol of Language Models in which we demonstrate broad conversation of original Phi3.5 safety abilities (see Phi-3 Safety Post-Training). All six merged MediPhi models fully conserve their base model's safety capabilities. For `MediPhi-Instruct`, it conserved safe behaviors towards jailbreaking and harmfulness, as well as it is improving considerably on groundedness. We further demonstrate safe behaviours at refusing or giving warnings with limited responses for nearly all harmful queries from clinican and patient user perspectives, based on MedSafetyBench and our PatientSafetyBench. Phi-3.5-mini has 3.8B parameters and is a dense decoder-only Transformer model using the same tokenizer as Phi-3 Mini. It is best suited for prompts using chat format but plain text is also possible. The default context length is of 128K tokens. Note that by default, the Phi-3.5-mini-instruct model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types: - NVIDIA A100 - NVIDIA A6000 - NVIDIA H100 If you want to run the model on: - NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.frompretrained() with attnimplementation="eager" Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies. @article{corbeil2025modular, title={A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment}, author={Corbeil, Jean-Philippe and Dada, Amin and Attendu, Jean-Michel and Abacha, Asma Ben and Sordoni, Alessandro and Caccia, Lucas and Beaulieu, Fran{\c{c}}ois and Lin, Thomas and Kleesiek, Jens and Vozila, Paul}, journal={arXiv preprint arXiv:2505.10717}, year={2025} } Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
EXAONE-Deep-7.8B-GGUF
Phi-4-mini-reasoning-GGUF
Hermes-4-70B-GGUF
This model was generated using llama.cpp at commit `408ff524`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Hermes 4 70B is a frontier, hybrid-mode reasoning model based on Llama-3.1-70B by Nous Research that is aligned to you. Read the Hermes 4 technical report here: Hermes 4 Technical Report Chat with Hermes in Nous Chat: https://chat.nousresearch.com Training highlights include a newly synthesized post-training corpus emphasizing verified reasoning traces, massive improvements in math, code, STEM, logic, creativity, and format-faithful outputs, while preserving general assistant quality and broadly neutral alignment. - Post-training corpus: Massively increased dataset size from 1M samples and 1.2B tokens to ~5M samples / ~60B tokens blended across reasoning and non-reasoning data. - Hybrid reasoning mode with explicit ` … ` segments when the model decides to deliberate, and options to make your responses faster when you want. - Reasoning that is top quality, expressive, improves math, code, STEM, logic, and even creative writing and subjective responses. - Schema adherence & structured outputs: trained to produce valid JSON for given schemas and to repair malformed objects. - Much easier to steer and align: extreme improvements on steerability, especially on reduced refusal rates. In pursuit of the mission of producing models that are open, steerable and capable of producing the full range of human expression, while being able to be aligned to your values, we created a new benchmark, RefusalBench, that tests the models willingness to be helpful in a variety of scenarios commonly disallowed by closed and open models. Hermes 4 achieves SOTA on RefusalBench across all popular closed and open models in being helpful and conforming to your values, without censorship. > Full tables, settings, and comparisons are in the technical report. Hermes 4 uses Llama-3-Chat format with role headers and special tags. Reasoning mode can be activated with the chat template via the flag `thinking=True` or by using the following system prompt: Note that you can add any additional system instructions before or after this system message, and it will adjust the models policies, style, and effort of thinking, as well as its post-thinking style, format, identity, and more. You may also interleave the tool definition system message with the reasoning one. Additionally, we provide a flag to keep the content inbetween the ` ... ` that you can play with by setting `keepcots=True` Hermes 4 supports function/tool calls within a single assistant turn, produced after it's reasoning: Note that you may also simply place tool definitions into the "tools:" field of your messages, and the chat template will parse and create the system prompt for you. This also works with reasoning mode for improved accuracy of tool use. The model will then generate tool calls within ` {toolcall} ` tags, for easy parsing. The toolcall tags are also added tokens, so it makes it easy to parse while streaming! There are also automatic tool parsers built-in to VLLM and SGLang for Hermes, just set the tool parser in VLLM to `hermes` and in SGLang to `qwen25`. - Sampling defaults that work well: `temperature=0.6, topp=0.95, topk=20`. - Template: Use the Llama chat format for Hermes 4 70B and 405B as shown above, or set `addgenerationprompt=True` when using `tokenizer.applychattemplate(...)`. For production serving on multi-GPU nodes, consider tensor parallel inference engines (e.g., SGLang/vLLM backends) with prefix caching. Hermes 4 is available as BF16 original weights as well as BF16 as well as FP8 variants and GGUF variants by LM Studio. FP8: https://huggingface.co/NousResearch/Hermes-4-70B-FP8 GGUF (Courtesy of LM Studio team!): https://huggingface.co/lmstudio-community/Hermes-4-70B-GGUF Hermes 4 is also available in smaller sizes (e.g., 70B) with similar prompt formats. See the Hermes 4 collection to explore them all: https://huggingface.co/collections/NousResearch/hermes-4-collection-68a731bfd452e20816725728 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
KernelLLM-GGUF
Lingshu-32B-GGUF
This model was generated using llama.cpp at commit `238005c2`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Website 🤖 7B Model 🤖 32B Model MedEvalKit Technical Report Lingshu - SOTA Multimodal Large Language Models for Medical Domain BIG NEWS: Lingshu is released with state-of-the-art performance on medical VQA tasks and report generation. This repository contains the model of the paper Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. We also release a comprehensive medical evaluation toolkit in MedEvalKit, which supports fast evaluation of major multimodal and textual medical tasks. Highlights Lingshu models achieve SOTA on most medical multimodal/textual QA and report generation tasks for 7B and 32 model sizes. Lingshu-32B outperforms GPT-4.1 and Claude Sonnet 4 in most multimodal QA and report generation tasks. Lingshu supports more than 12 medical imaging modalities, including X-Ray, CT Scan, MRI, Microscopy, Ultrasound, Histopathology, Dermoscopy, Fundus, OCT, Digital Photography, Endoscopy, and PET. Release - Technical report: Arxiv: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. - Model weights: - Lingshu-7B - Lingshu-32B > Disclaimer: > We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation. > Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. > In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos. Models MMMU-Med VQA-RAD SLAKE PathVQA PMC-VQA OmniMedVQA MedXpertQA Avg. Claude Sonnet 4 74.6 67.6 70.6 54.2 54.4 65.5 43.3 61.5 Gemini-2.5-Flash 76.9 68.5 75.8 55.4 55.4 71.0 52.8 65.1 MedVLM-R1-2B 35.2 48.6 56.0 32.5 47.6 77.7 20.4 45.4 MedGemma-4B-IT 43.7 72.5 76.4 48.8 49.9 69.8 22.3 54.8 LLaVA-Med-7B 29.3 53.7 48.0 38.8 30.5 44.3 20.3 37.8 HuatuoGPT-V-7B 47.3 67.0 67.8 48.0 53.3 74.2 21.6 54.2 BioMediX2-8B 39.8 49.2 57.7 37.0 43.5 63.3 21.8 44.6 Qwen2.5VL-7B 50.6 64.5 67.2 44.1 51.9 63.6 22.3 52.0 InternVL2.5-8B 53.5 59.4 69.0 42.1 51.3 81.3 21.7 54.0 InternVL3-8B 59.2 65.4 72.8 48.6 53.8 79.1 22.4 57.3 HealthGPT-14B 49.6 65.0 66.1 56.7 56.4 75.2 24.7 56.2 HuatuoGPT-V-34B 51.8 61.4 69.5 44.4 56.6 74.0 22.1 54.3 InternVL3-14B 63.1 66.3 72.8 48.0 54.1 78.9 23.1 58.0 Qwen2.5V-32B 59.6 71.8 71.2 41.9 54.5 68.2 25.2 56.1 InternVL2.5-38B 61.6 61.4 70.3 46.9 57.2 79.9 24.4 57.4 InternVL3-38B 65.2 65.4 72.7 51.0 56.6 79.8 25.2 59.4 Lingshu-32B 62.3 76.5 89.2 65.9 57.9 83.4 30.9 66.6 Models MMLU-Med PubMedQA MedMCQA MedQA Medbullets MedXpertQA SuperGPQA-Med Avg. Claude Sonnet 4 91.3 78.6 79.3 92.1 80.2 33.6 56.3 73.1 Gemini-2.5-Flash 84.2 73.8 73.6 91.2 77.6 35.6 53.3 69.9 MedVLM-R1-2B 51.8 66.4 39.7 42.3 33.8 11.8 19.1 37.8 MedGemma-4B-IT 66.7 72.2 52.2 56.2 45.6 12.8 21.6 46.8 LLaVA-Med-7B 50.6 26.4 39.4 42.0 34.4 9.9 16.1 31.3 HuatuoGPT-V-7B 69.3 72.8 51.2 52.9 40.9 10.1 21.9 45.6 BioMediX2-8B 68.6 75.2 52.9 58.9 45.9 13.4 25.2 48.6 Qwen2.5VL-7B 73.4 76.4 52.6 57.3 42.1 12.8 26.3 48.7 InternVL2.5-8B 74.2 76.4 52.4 53.7 42.4 11.6 26.1 48.1 InternVL3-8B 77.5 75.4 57.7 62.1 48.5 13.1 31.2 52.2 HealthGPT-14B 80.2 68.0 63.4 66.2 39.8 11.3 25.7 50.7 HuatuoGPT-V-34B 74.7 72.2 54.7 58.8 42.7 11.4 26.5 48.7 InternVL3-14B 81.7 77.2 62.0 70.1 49.5 14.1 37.9 56.1 Qwen2.5VL-32B 83.2 68.4 63.0 71.6 54.2 15.6 37.6 56.2 InternVL2.5-38B 84.6 74.2 65.9 74.4 55.0 14.7 39.9 58.4 InternVL3-38B 83.8 73.2 64.9 73.5 54.6 16.0 42.5 58.4 Lingshu-32B 84.7 77.8 66.1 74.7 65.4 22.7 41.1 61.8 ROUGE-L CIDEr RaTE SembScore RadCliQ-v1 -1 ROUGE-L CIDEr RaTE SembScore RadCliQ-v1 -1 ROUGE-L CIDEr RaTE SembScore RadCliQ-v1 -1 GPT-4.1 9.0 82.8 51.3 23.9 57.1 24.5 78.8 45.5 23.2 45.5 30.2 124.6 51.3 47.5 80.3 Claude Sonnet 4 20.0 56.6 45.6 19.7 53.4 22.0 59.5 43.5 18.9 43.3 25.4 88.3 55.4 41.0 72.1 Gemini-2.5-Flash 25.4 80.7 50.3 29.7 59.4 23.6 72.2 44.3 27.4 44.0 33.5 129.3 55.6 50.9 91.6 Med-R1-2B 19.3 35.4 40.6 14.8 42.4 18.6 37.1 38.5 17.8 37.6 16.1 38.3 41.4 12.5 43.6 MedVLM-R1-2B 20.3 40.1 41.6 14.2 48.3 20.9 43.5 38.9 15.5 40.9 22.7 61.1 46.1 22.7 54.3 MedGemma-4B-IT 25.6 81.0 52.4 29.2 62.9 27.1 79.0 47.2 29.3 46.6 30.8 103.6 57.0 46.8 86.7 LLaVA-Med-7B 15.0 43.4 12.8 18.3 52.9 18.4 45.5 38.8 23.5 44.0 18.8 68.2 40.9 16.0 58.1 HuatuoGPT-V-7B 23.4 69.5 48.9 20.0 48.2 21.3 64.7 44.2 19.3 39.4 29.6 104.3 52.9 40.7 63.6 BioMediX2-8B 20.0 52.8 44.4 17.7 53.0 18.1 47.9 40.8 21.6 43.3 19.6 58.8 40.1 11.6 53.8 Qwen2.5VL-7B 24.1 63.7 47.0 18.4 55.1 22.2 62.0 41.0 17.2 43.1 26.5 78.1 48.4 36.3 66.1 InternVL2.5-8B 23.2 61.8 47.0 21.0 56.2 20.6 58.5 43.1 19.7 42.7 24.8 75.4 51.1 36.7 67.0 InternVL3-8B 22.9 66.2 48.2 21.5 55.1 20.9 65.4 44.3 25.2 43.7 22.9 76.2 51.2 31.3 59.9 Lingshu-7B 30.8 109.4 52.1 30.0 69.2 26.5 79.0 45.4 26.8 47.3 41.2 180.7 57.6 48.4 108.1 HealthGPT-14B 21.4 64.7 48.4 16.5 52.7 20.6 66.2 44.4 22.7 42.6 22.9 81.9 50.8 16.6 56.9 HuatuoGPT-V-34B 23.5 68.5 48.5 23.0 47.1 22.5 62.8 42.9 22.1 39.7 28.2 108.3 54.4 42.2 59.3 MedDr-40B 15.7 62.3 45.2 12.2 47.0 24.1 66.1 44.7 24.2 44.7 19.4 62.9 40.3 7.3 48.9 InternVL3-14B 22.0 63.7 48.6 17.4 46.5 20.4 60.2 44.1 20.7 39.4 24.8 93.7 55.0 38.7 55.0 Qwen2.5VL-32B 15.7 50.2 47.5 17.1 45.2 15.2 54.8 43.4 18.5 40.3 18.9 73.3 51.3 38.1 54.0 InternVL2.5-38B 22.7 61.4 47.5 18.2 54.9 21.6 60.6 42.6 20.3 45.4 28.9 96.5 53.5 38.5 69.7 InternVL3-38B 22.8 64.6 47.9 18.1 47.2 20.5 62.7 43.8 20.2 39.4 25.5 90.7 53.5 33.1 55.2 Lingshu-32B 28.8 96.4 50.8 30.1 67.1 25.3 75.9 43.4 24.2 47.1 42.8 189.2 63.5 54.6 130.4 If you find our project useful, we hope you would kindly star our repo and cite our work as follows: `` are equal contributions. `^` are corresponding authors. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
gemma-3-270m-it-GGUF
This model was generated using llama.cpp at commit `79c1160b`. Click here to get info on choosing the right GGUF model format [Gemma 3 Technical Report][g3-tech-report] [Responsible Generative AI Toolkit][rai-toolkit] [Gemma on Kaggle][kaggle-gemma] [Gemma on Vertex Model Garden][vertex-mg-gemma3] Summary description and brief definition of inputs and outputs. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants. Gemma 3 has a large, 128K context window, multilingual support in over 140 languages, and is available in more sizes than previous versions. Gemma 3 models are well-suited for a variety of text generation and image understanding tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as laptops, desktops or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. - Input: - Text string, such as a question, a prompt, or a document to be summarized - Images, normalized to 896 x 896 resolution and encoded to 256 tokens each, for the 4B, 12B, and 27B sizes. - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes. - Output: - Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B and 270M sizes per request, subtracting the request input tokens Data used for model training and how the data was processed. These models were trained on a dataset of text data that includes a wide variety of sources. The 27B model was trained with 14 trillion tokens, the 12B model was trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens, the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The knowledge cutoff date for the training data was August 2024. Here are the key components: - Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages. - Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions. - Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries. - Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks. The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats. Here are the key data cleaning and filtering methods applied to the training data: - CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content. - Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets. - Additional methods: Filtering based on content quality and safety in line with [our policies][safety-policies]. Gemma was trained using [Tensor Processing Unit (TPU)][tpu] hardware (TPUv4p, TPUv5p and TPUv5e). Training vision-language models (VLMS) requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain: - Performance: TPUs are specifically designed to handle the massive computations involved in training VLMs. They can speed up training considerably compared to CPUs. - Memory: TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training. This can lead to better model quality. - Scalability: TPU Pods (large clusters of TPUs) provide a scalable solution for handling the growing complexity of large foundation models. You can distribute training across multiple TPU devices for faster and more efficient processing. - Cost-effectiveness: In many scenarios, TPUs can provide a more cost-effective solution for training large models compared to CPU-based infrastructure, especially when considering the time and resources saved due to faster training. - These advantages are aligned with [Google's commitments to operate sustainably][sustainability]. Training was done using [JAX][jax] and [ML Pathways][ml-pathways]. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones. Together, JAX and ML Pathways are used as described in the [paper about the Gemini family of models][gemini-2-paper]; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow." These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. | Benchmark | n-shot | Gemma 3 PT 270M | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 10-shot | 40.9 | | [BoolQ][boolq] | 0-shot | 61.4 | | [PIQA][piqa] | 0-shot | 67.7 | | [TriviaQA][triviaqa] | 5-shot | 15.4 | | [ARC-c][arc] | 25-shot | 29.0 | | [ARC-e][arc] | 0-shot | 57.7 | | [WinoGrande][winogrande] | 5-shot | 52.0 | [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [triviaqa]: https://arxiv.org/abs/1705.03551 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 | Benchmark | n-shot | Gemma 3 IT 270m | | :------------------------ | :-----------: | ------------------: | | [HellaSwag][hellaswag] | 0-shot | 37.7 | | [PIQA][piqa] | 0-shot | 66.2 | | [ARC-c][arc] | 0-shot | 28.2 | | [WinoGrande][winogrande] | 0-shot | 52.3 | | [BIG-Bench Hard][bbh] | few-shot | 26.7 | | [IF Eval][ifeval] | 0-shot | 51.2 | [hellaswag]: https://arxiv.org/abs/1905.07830 [piqa]: https://arxiv.org/abs/1911.11641 [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [bbh]: https://paperswithcode.com/dataset/bbh [ifeval]: https://arxiv.org/abs/2311.07911 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [GPQA][gpqa] Diamond | 0-shot | 19.2 | 30.8 | 40.9 | 42.4 | | [SimpleQA][simpleqa] | 0-shot | 2.2 | 4.0 | 6.3 | 10.0 | | [FACTS Grounding][facts-grdg] | - | 36.4 | 70.1 | 75.8 | 74.9 | | [BIG-Bench Hard][bbh] | 0-shot | 39.1 | 72.2 | 85.7 | 87.6 | | [BIG-Bench Extra Hard][bbeh] | 0-shot | 7.2 | 11.0 | 16.3 | 19.3 | | [IFEval][ifeval] | 0-shot | 80.2 | 90.2 | 88.9 | 90.4 | | Benchmark | n-shot | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------|:--------------:|:-------------:|:--------------:|:--------------:| | [HellaSwag][hellaswag] | 10-shot | 62.3 | 77.2 | 84.2 | 85.6 | | [BoolQ][boolq] | 0-shot | 63.2 | 72.3 | 78.8 | 82.4 | | [PIQA][piqa] | 0-shot | 73.8 | 79.6 | 81.8 | 83.3 | | [SocialIQA][socialiqa] | 0-shot | 48.9 | 51.9 | 53.4 | 54.9 | | [TriviaQA][triviaqa] | 5-shot | 39.8 | 65.8 | 78.2 | 85.5 | | [Natural Questions][naturalq] | 5-shot | 9.48 | 20.0 | 31.4 | 36.1 | | [ARC-c][arc] | 25-shot | 38.4 | 56.2 | 68.9 | 70.6 | | [ARC-e][arc] | 0-shot | 73.0 | 82.4 | 88.3 | 89.0 | | [WinoGrande][winogrande] | 5-shot | 58.2 | 64.7 | 74.3 | 78.8 | | [BIG-Bench Hard][bbh] | few-shot | 28.4 | 50.9 | 72.6 | 77.7 | | [DROP][drop] | 1-shot | 42.4 | 60.1 | 72.2 | 77.2 | [gpqa]: https://arxiv.org/abs/2311.12022 [simpleqa]: https://arxiv.org/abs/2411.04368 [facts-grdg]: https://goo.gle/FACTSpaper [bbeh]: https://github.com/google-deepmind/bbeh [ifeval]: https://arxiv.org/abs/2311.07911 [hellaswag]: https://arxiv.org/abs/1905.07830 [boolq]: https://arxiv.org/abs/1905.10044 [piqa]: https://arxiv.org/abs/1911.11641 [socialiqa]: https://arxiv.org/abs/1904.09728 [triviaqa]: https://arxiv.org/abs/1705.03551 [naturalq]: https://github.com/google-research-datasets/natural-questions [arc]: https://arxiv.org/abs/1911.01547 [winogrande]: https://arxiv.org/abs/1907.10641 [bbh]: https://paperswithcode.com/dataset/bbh [drop]: https://arxiv.org/abs/1903.00161 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |----------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] (Pro) | 0-shot | 14.7 | 43.6 | 60.6 | 67.5 | | [LiveCodeBench][lcb] | 0-shot | 1.9 | 12.6 | 24.6 | 29.7 | | [Bird-SQL][bird-sql] (dev) | - | 6.4 | 36.3 | 47.9 | 54.4 | | [Math][math] | 0-shot | 48.0 | 75.6 | 83.8 | 89.0 | | HiddenMath | 0-shot | 15.8 | 43.0 | 54.5 | 60.3 | | [MBPP][mbpp] | 3-shot | 35.2 | 63.2 | 73.0 | 74.4 | | [HumanEval][humaneval] | 0-shot | 41.5 | 71.3 | 85.4 | 87.8 | | [Natural2Code][nat2code] | 0-shot | 56.0 | 70.3 | 80.7 | 84.5 | | [GSM8K][gsm8k] | 0-shot | 62.8 | 89.2 | 94.4 | 95.9 | | Benchmark | n-shot | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |----------------|:-------------:|:--------------:|:--------------:| | [MMLU][mmlu] | 5-shot | 59.6 | 74.5 | 78.6 | | [MMLU][mmlu] (Pro COT) | 5-shot | 29.2 | 45.3 | 52.2 | | [AGIEval][agieval] | 3-5-shot | 42.1 | 57.4 | 66.2 | | [MATH][math] | 4-shot | 24.2 | 43.3 | 50.0 | | [GSM8K][gsm8k] | 8-shot | 38.4 | 71.0 | 82.6 | | [GPQA][gpqa] | 5-shot | 15.0 | 25.4 | 24.3 | | [MBPP][mbpp] | 3-shot | 46.0 | 60.4 | 65.6 | | [HumanEval][humaneval] | 0-shot | 36.0 | 45.7 | 48.8 | [mmlu]: https://arxiv.org/abs/2009.03300 [agieval]: https://arxiv.org/abs/2304.06364 [math]: https://arxiv.org/abs/2103.03874 [gsm8k]: https://arxiv.org/abs/2110.14168 [gpqa]: https://arxiv.org/abs/2311.12022 [mbpp]: https://arxiv.org/abs/2108.07732 [humaneval]: https://arxiv.org/abs/2107.03374 [lcb]: https://arxiv.org/abs/2403.07974 [bird-sql]: https://arxiv.org/abs/2305.03111 [nat2code]: https://arxiv.org/abs/2405.04520 | Benchmark | n-shot | Gemma 3 IT 1B | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |--------------------------------------|--------|:-------------:|:-------------:|:--------------:|:--------------:| | [Global-MMLU-Lite][global-mmlu-lite] | 0-shot | 34.2 | 54.5 | 69.5 | 75.1 | | [ECLeKTic][eclektic] | 0-shot | 1.4 | 4.6 | 10.3 | 16.7 | | [WMT24++][wmt24pp] | 0-shot | 35.9 | 46.8 | 51.6 | 53.4 | | Benchmark | Gemma 3 PT 1B | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------------ |:-------------:|:-------------:|:--------------:|:--------------:| | [MGSM][mgsm] | 2.04 | 34.7 | 64.3 | 74.3 | | [Global-MMLU-Lite][global-mmlu-lite] | 24.9 | 57.0 | 69.4 | 75.7 | | [WMT24++][wmt24pp] (ChrF) | 36.7 | 48.4 | 53.9 | 55.7 | | [FloRes][flores] | 29.5 | 39.2 | 46.0 | 48.8 | | [XQuAD][xquad] (all) | 43.9 | 68.0 | 74.5 | 76.8 | | [ECLeKTic][eclektic] | 4.69 | 11.0 | 17.2 | 24.4 | | [IndicGenBench][indicgenbench] | 41.4 | 57.2 | 61.7 | 63.4 | [mgsm]: https://arxiv.org/abs/2210.03057 [flores]: https://arxiv.org/abs/2106.03193 [xquad]: https://arxiv.org/abs/1910.11856v3 [global-mmlu-lite]: https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite [wmt24pp]: https://arxiv.org/abs/2502.12404v1 [eclektic]: https://arxiv.org/abs/2502.21228 [indicgenbench]: https://arxiv.org/abs/2404.16816 | Benchmark | Gemma 3 IT 4B | Gemma 3 IT 12B | Gemma 3 IT 27B | |-----------------------------------|:-------------:|:--------------:|:--------------:| | [MMMU][mmmu] (val) | 48.8 | 59.6 | 64.9 | | [DocVQA][docvqa] | 75.8 | 87.1 | 86.6 | | [InfoVQA][info-vqa] | 50.0 | 64.9 | 70.6 | | [TextVQA][textvqa] | 57.8 | 67.7 | 65.1 | | [AI2D][ai2d] | 74.8 | 84.2 | 84.5 | | [ChartQA][chartqa] | 68.8 | 75.7 | 78.0 | | [VQAv2][vqav2] (val) | 62.4 | 71.6 | 71.0 | | [MathVista][mathvista] (testmini) | 50.0 | 62.9 | 67.6 | | Benchmark | Gemma 3 PT 4B | Gemma 3 PT 12B | Gemma 3 PT 27B | | ------------------------------ |:-------------:|:--------------:|:--------------:| | [COCOcap][coco-cap] | 102 | 111 | 116 | | [DocVQA][docvqa] (val) | 72.8 | 82.3 | 85.6 | | [InfoVQA][info-vqa] (val) | 44.1 | 54.8 | 59.4 | | [MMMU][mmmu] (pt) | 39.2 | 50.3 | 56.1 | | [TextVQA][textvqa] (val) | 58.9 | 66.5 | 68.6 | | [RealWorldQA][realworldqa] | 45.5 | 52.2 | 53.9 | | [ReMI][remi] | 27.3 | 38.5 | 44.8 | | [AI2D][ai2d] | 63.2 | 75.2 | 79.0 | | [ChartQA][chartqa] | 63.6 | 74.7 | 76.3 | | [VQAv2][vqav2] | 63.9 | 71.2 | 72.9 | | [BLINK][blinkvqa] | 38.0 | 35.9 | 39.6 | | [OKVQA][okvqa] | 51.0 | 58.7 | 60.2 | | [TallyQA][tallyqa] | 42.5 | 51.8 | 54.3 | | [SpatialSense VQA][ss-vqa] | 50.9 | 60.0 | 59.4 | | [CountBenchQA][countbenchqa] | 26.1 | 17.8 | 68.0 | [coco-cap]: https://cocodataset.org/#home [docvqa]: https://www.docvqa.org/ [info-vqa]: https://arxiv.org/abs/2104.12756 [mmmu]: https://arxiv.org/abs/2311.16502 [textvqa]: https://textvqa.org/ [realworldqa]: https://paperswithcode.com/dataset/realworldqa [remi]: https://arxiv.org/html/2406.09175v1 [ai2d]: https://allenai.org/data/diagrams [chartqa]: https://arxiv.org/abs/2203.10244 [vqav2]: https://visualqa.org/index.html [blinkvqa]: https://arxiv.org/abs/2404.12390 [okvqa]: https://okvqa.allenai.org/ [tallyqa]: https://arxiv.org/abs/1810.12440 [ss-vqa]: https://arxiv.org/abs/1908.02660 [countbenchqa]: https://github.com/google-research/bigvision/blob/main/bigvision/datasets/countbenchqa/ [mathvista]: https://arxiv.org/abs/2310.02255 Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including: - Child Safety: Evaluation of text-to-text and image to text prompts covering child safety policies, including child sexual abuse and exploitation. - Content Safety: Evaluation of text-to-text and image to text prompts covering safety policies including, harassment, violence and gore, and hate speech. - Representational Harms: Evaluation of text-to-text and image to text prompts covering safety policies including bias, stereotyping, and harmful associations or inaccuracies. In addition to development level evaluations, we conduct "assurance evaluations" which are our 'arms-length' internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results' ability to inform decision making. Assurance evaluation results are reported to our Responsibility & Safety Council as part of release review. For all areas of safety testing, we saw major improvements in the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models' performance with respect to ungrounded inferences. A limitation of our evaluations was they included only English language prompts. These models have certain limitations that users should be aware of. Open vision-language models (VLMs) models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development. - Content Creation and Communication - Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generate concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: These models can be used to extract, interpret, and summarize visual data for text communications. - Research and Education - Natural Language Processing (NLP) and VLM Research: These models can serve as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics. - Training Data - The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses. - The scope of the training dataset determines the subject areas the model can handle effectively. - Context and Task Complexity - Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point). - Language Ambiguity and Nuance - Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language. - Factual Accuracy - Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements. - Common Sense - Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations. The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following: - Bias and Fairness - VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card. - Misinformation and Misuse - VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit][rai-toolkit]. - Transparency and Accountability: - This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem. - Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases. - Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases. - Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided. Prohibited uses of Gemma models are outlined in the [Gemma Prohibited Use Policy][prohibited-use]. - Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques. At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models. Using the benchmark evaluation metrics described in this document, these models have shown to provide superior performance to other, comparably-sized open model alternatives. [g3-tech-report]: https://arxiv.org/abs/2503.19786 [rai-toolkit]: https://ai.google.dev/responsible [kaggle-gemma]: https://www.kaggle.com/models/google/gemma-3 [vertex-mg-gemma3]: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3 [terms]: https://ai.google.dev/gemma/terms [safety-policies]: https://ai.google/static/documents/ai-responsibility-update-published-february-2025.pdf [prohibited-use]: https://ai.google.dev/gemma/prohibitedusepolicy [tpu]: https://cloud.google.com/tpu/docs/intro-to-tpu [sustainability]: https://sustainability.google/operating-sustainably/ [jax]: https://github.com/jax-ml/jax [ml-pathways]: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ [sustainability]: https://sustainability.google/operating-sustainably/ [gemini-2-paper]: https://arxiv.org/abs/2312.11805 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.3-70B-Instruct-GGUF
UserLM-8b-GGUF
This model was generated using llama.cpp at commit `56b479584`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Unlike typical LLMs that are trained to play the role of the "assistant" in conversation, we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat). This model is useful in simulating more realistic conversations, which is in turn useful in the development of more robust assistants. The model takes a single input, which is the “task intent”, which defines the high-level objective that the user simulator should pursue. The user can then be used to generate: (1) a first-turn user utterance, (2) generate follow-up user utterances based on a conversation state (one or several user-assistant turn exchanges), and (3) generate a token when the user simulator expects that the conversation has run its course. Developed by: Tarek Naous (intern at MSR Summer 2025), Philippe Laban (MSR), Wei Xu, Jennifer Neville (MSR) The UserLM-8b is released for use by researchers involved in the evaluation of assistant LLMs. In such scenarios, UserLM-8b can be used to simulate multi-turn conversations, with our analyses (see Section 3 of the paper) giving evidence that UserLM-8b provides more realistic simulation of user behavior than other methods (such as prompting an assistant model). UserLM-8b offers a user simulation environment that can better estimate the performance of an assistant LLM with real users. See Section 4 of the paper for an initial implementation of such an evaluation. We envision several potential uses for UserLM-8b that we did not implement yet in our presented work but describe in our Discussion section as potential research directions for UserLMs. These potential applications include: (1) user modeling (i.e., predicting user responses to a given set of questions), (2) foundation for judge models (i.e., LLM-as-a-judge finetuning), (3) synthetic data generation (in conjunction with an assistant LM). We caution potential users of the model that UserLM-8b is not an assistant LM, unlike the majority of LLMs released on HuggingFace. As such, it is unlikely to be useful to end-users that require assistance with a task, for which an assistant LLM (such as microsoft/Phi-4) might be more appropriate. We do not recommend using UserLM in commercial or real-world applications without further testing and development. It is being released for research purposes. The paper accompanying this model release presents several evaluations of UserLM-8b and its potential limitations. First in Section 3, we describe the robustness experiments we conducted with UserLM-8b, which show that though the model can more robustly adhere to the user role and the provided task intent, the robustness numbers are not perfect ( Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Skywork-VL-Reward-7B-GGUF
This model was generated using llama.cpp at commit `1f63e75f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to learn more about choosing the right GGUF model format May 12, 2025: Our technical report is now available on arXiv and we welcome citations:Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning April 24, 2025: We released Skywork-VL-Reward-7B, A state-of-the-art multimodal reward model on VLRewardBench, and have released our technical report on the R1V GitHub repository. Introduction The lack of multimodal reward models on the market has become a major bottleneck restricting the development of multimodal reinforcement technology. We open source the 7B multimodal reward model Skywork-VL-Reward, injecting new momentum into the industry and opening a new chapter in multimodal reinforcement learning Skywork-VL-Reward is based on the Qwen2.5-VL-7B-Instruct architecture with the addition of a value head structure for training reward model. We obtained SOTA of 73.1 in VL-RewardBench and high score of 90.1 in RewardBench. In addition, our MPO trained on Skywork-R1V-2.0 further validates the effectiveness of the model. We hope that this multimodal reward model will contribute to the open source community! Please refer to our technical report for more details. Technical Report Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Model Name Model Size General Hallucination Reasoning Overall Accuracy Macro Average Proprietary Models Claude-3.5-Sonnet(2024-06-22) - 43.4 55.0 62.3 55.3 53.6 Gemini-1.5-Flash (2024-09-24) - 47.8 59.6 58.4 57.6 55.3 GPT-4o(2024-08-06) - 49.1 67.6 70.5 65.8 62.4 Gemini-1.5-Pro(2024-09-24) - 50.8 72.5 64.2 67.2 62.5 Gemini-2.0-flash-exp(2024-12) - 50.8 72.6 70.1 68.8 64.5 Open-Source Models Qwen2-VL-7B-Instruct 7B 31.6 19.1 51.1 28.3 33.9 MAmmoTH-VL-8B 8B 36.0 40.0 52.0 42.2 42.7 Qwen2.5-VL-7B-Instruct 7B 43.4 42.0 63.0 48.0 49.5 InternVL3-8B 8B 60.6 44.0 62.3 57.0 55.6 IXC-2.5-Reward-7B 7B 80.3 65.3 60.4 66.3 68.6 Qwen2-VL-72B-Instruct 72B 38.1 32.8 58.0 39.5 43.0 Molmo-72B-0924 72B 33.9 42.3 54.9 44.1 43.7 QVQ-72B-Preview 72B 41.8 46.2 51.2 46.4 46.4 Qwen2.5-VL-72B-Instruct 72B 47.8 46.8 63.5 51.6 52.7 InternVL3-78B 78B 67.8 52.5 64.5 63.3 61.6 Skywork-VL Reward(Ours) 7B 66.0 80.0 61.0 73.1 69.0 Language-Only Reward Models InternLM2-7B-Reward 99.2 69.5 87.2 94.5 87.6 Skywork-Reward-Llama3.1-8B 95.8 87.3 90.8 96.2 92.5 Skywork-Reward-Llama-3.1-8B-v0.2 94.7 88.4 92.7 96.7 93.1 QRM-Llama3.1-8B-v2 96.4 86.8 92.6 96.8 93.1 Multi-Modal Reward Models Qwen2-VL-7B-Instruct 65.1 50.9 55.8 68.3 60.0 InternVL3-8B 97.2 50.4 83.6 83.9 78.8 Qwen2.5-VL-7B-Instruct 94.3 63.8 84.1 86.2 82.1 IXC-2.5-Reward-7B 90.8 83.8 87.8 90.0 88.1 Skywork-VL Reward(Ours) 90.0 87.5 91.1 91.8 90.1 Citation If you use this work in your research, please cite: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen2.5-14B-Instruct-GGUF
Qwen3-Reranker-4B-GGUF
Cosmos-Reason1-7B-GGUF
rwkv7-2.9B-g1-GGUF
Llama-3.2-3B-F1-Reasoning-Instruct-GGUF
mxbai-rerank-large-v2-GGUF
LiveCC-7B-Instruct-GGUF
granite-4.0-h-micro-GGUF
This model was generated using llama.cpp at commit `ee09828cb`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-H-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-H-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-H-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-H-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-H-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Model Architecture: Granite-4.0-H-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.1-8B-Instruct-GGUF
medgemma-4b-it-GGUF
NextCoder-7B-GGUF
gemma-3n-E4B-it-GGUF
Qwen3-4B-abliterated-GGUF
Nemotron-Research-Reasoning-Qwen-1.5B-GGUF
SmallThinker-21BA3B-Instruct-GGUF
granite-3b-code-instruct-2k-GGUF
Qwen3-14B-GGUF
OpenMath-CodeLlama-7b-Python-hf-GGUF
Qwen2.5-7B-Instruct-GGUF
LFM2-2.6B-GGUF
This model was generated using llama.cpp at commit `ee09828cb`. Click here to get info on choosing the right GGUF model format LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency. We're releasing the weights of four post-trained checkpoints with 350M, 700M, 1.2B, and 2.6 parameters. They provide the following key features to create AI-powered edge applications: Fast training & inference – LFM2 achieves 3x faster training compared to its previous generation. It also benefits from 2x faster decode and prefill speed on CPU compared to Qwen3. Best performance – LFM2 outperforms similarly-sized models across multiple benchmark categories, including knowledge, mathematics, instruction following, and multilingual capabilities. New architecture – LFM2 is a new hybrid Liquid model with multiplicative gates and short convolutions. Flexible deployment – LFM2 runs efficiently on CPU, GPU, and NPU hardware for flexible deployment on smartphones, laptops, or vehicles. Due to their small size, we recommend fine-tuning LFM2 models on narrow use cases to maximize performance. They are particularly suited for agentic tasks, data extraction, RAG, creative writing, and multi-turn conversations. However, we do not recommend using them for tasks that are knowledge-intensive or require programming skills. | Property | LFM2-350M | LFM2-700M | LFM2-1.2B | LFM2-2.6B | | ------------------- | ----------------------------- | ----------------------------- | ----------------------------- | ----------------------------- | | Parameters | 354,483,968 | 742,489,344 | 1,170,340,608 | 2,569,272,320 | | Layers | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 16 (10 conv + 6 attn) | 30 (22 conv + 8 attn) | | Context length | 32,768 tokens | 32,768 tokens | 32,768 tokens | 32,768 tokens | | Vocabulary size | 65,536 | 65,536 | 65,536 | 65,536 | | Precision | bfloat16 | bfloat16 | bfloat16 | bfloat16 | | Training budget | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens | 10 trillion tokens | | License | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 | LFM Open License v1.0 Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. Generation parameters: We recommend the following parameters: `temperature=0.3` `minp=0.15` `repetitionpenalty=1.05` Reasoning: LFM2-2.6B is the only model in this family to use dynamic hybrid reasoning (traces between ` ` and ` ` tokens) for complex or multilingual prompts. Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: Architecture: Hybrid model with multiplicative gates and short convolutions: 10 double-gated short-range LIV convolution blocks and 6 grouped query attention (GQA) blocks. Pre-training mixture: Approximately 75% English, 20% multilingual, and 5% code data sourced from the web and licensed materials. Training approach: Very large-scale SFT on 50% downstream tasks, 50% general domains Custom DPO with length normalization and semi-online datasets Iterative model merging To run LFM2, you need to install Hugging Face `transformers` v4.55 or a more recent version as follows: Here is an example of how to generate an answer with transformers in Python: You can directly run and test the model with this Colab notebook. You need to install `vLLM` v0.10.2 or a more recent version as follows: You can run LFM2 with llama.cpp using its GGUF checkpoint. Find more information in the model card. We recommend fine-tuning LFM2 models on your use cases to maximize performance. | Notebook | Description | Link | |-------|------|------| | SFT (Unsloth) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using Unsloth. | | | SFT (Axolotl) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using Axolotl. | | | SFT (TRL) | Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL. | | | DPO (TRL) | Preference alignment with Direct Preference Optimization (DPO) using TRL. | | LFM2 outperforms similar-sized models across different evaluation categories. We only report scores using instruct variants and non-thinking modes for consistency. | Model | MMLU | GPQA | IFEval | IFBench | GSM8K | MGSM | MMMLU | | ---------------------- | ----- | ----- | ------ | ------- | ----- | ----- | ----- | | LFM2-2.6B | 64.42 | 26.57 | 79.56 | 22.19 | 82.41 | 74.32 | 55.39 | | Llama-3.2-3B-Instruct | 60.35 | 30.6 | 71.43 | 20.78 | 75.21 | 61.68 | 47.92 | | SmolLM3-3B | 59.84 | 26.31 | 72.44 | 17.93 | 81.12 | 68.72 | 50.02 | | gemma-3-4b-it | 58.35 | 29.51 | 76.85 | 23.53 | 89.92 | 87.28 | 50.14 | | Qwen3-4B-Instruct-2507 | 72.25 | 34.85 | 85.62 | 30.28 | 68.46 | 81.76 | 60.67 | If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Foundation-Sec-8B-GGUF
Devstral-Small-2505-GGUF
Phi-4-mini-instruct-GGUF
Fathom-R1-14B-GGUF
DeepSeek-R1-Distill-Qwen-7B-GGUF
Qwen3-30B-A1.5B-High-Speed-GGUF
medgemma-4b-pt-GGUF
Qwen3-14B-abliterated-GGUF
Logics-Parsing-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. Click here to get info on choosing the right GGUF model format 🤗 GitHub    |   🤖 Demo    |   📑 Technical Report Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents. Effortless End-to-End Processing Our single-model architecture eliminates the need for complex, multi-stage pipelines. Deployment and inference are straightforward, going directly from a document image to structured output. It demonstrates exceptional performance on documents with challenging layouts. Advanced Content Recognition It accurately recognizes and structures difficult content, including intricate scientific formulas. Chemical structures are intelligently identified and can be represented in the standard SMILES format. Rich, Structured HTML Output The model generates a clean HTML representation of the document, preserving its logical structure. Each content block (e.g., paragraph, table, figure, formula) is tagged with its category, bounding box coordinates, and OCR text. It automatically identifies and filters out irrelevant elements like headers and footers, focusing only on the core content. State-of-the-Art Performance Logics-Parsing achieves the best performance on our in-house benchmark, which is specifically designed to comprehensively evaluate a model’s parsing capability on complex-layout documents and STEM content. Existing document-parsing benchmarks often provide limited coverage of complex layouts and STEM content. To address this, we constructed an in-house benchmark comprising 1,078 page-level images across nine major categories and over twenty sub-categories. Our model achieves the best performance on this benchmark. Model Type Methods Overall Edit ↓ Text Edit Edit ↓ Formula Edit ↓ Table TEDS ↑ Table Edit ↓ ReadOrder Edit ↓ Chemistry Edit ↓ HandWriting Edit ↓ Pipeline Tools doc2x 0.209 0.188 0.128 0.194 0.377 0.321 81.1 85.3 0.148 0.115 0.146 0.122 1.0 0.307 Textin 0.153 0.158 0.132 0.190 0.185 0.223 76.7 86.3 0.176 0.113 0.118 0.104 1.0 0.344 mathpix 0.128 0.146 0.128 0.152 0.06 0.142 86.2 86.6 0.120 0.127 0.204 0.164 0.552 0.263 PPStructureV3 0.220 0.226 0.172 0.29 0.272 0.276 66 71.5 0.237 0.193 0.201 0.143 1.0 0.382 Mineru2 0.212 0.245 0.134 0.195 0.280 0.407 67.5 71.8 0.228 0.203 0.205 0.177 1.0 0.387 Marker 0.324 0.409 0.188 0.289 0.285 0.383 65.5 50.4 0.593 0.702 0.23 0.262 1.0 0.50 Pix2text 0.447 0.547 0.485 0.577 0.312 0.465 64.7 63.0 0.566 0.613 0.424 0.534 1.0 0.95 Expert VLMs Dolphin 0.208 0.256 0.149 0.189 0.334 0.346 72.9 60.1 0.192 0.35 0.160 0.139 0.984 0.433 dots.ocr 0.186 0.198 0.115 0.169 0.291 0.358 79.5 82.5 0.172 0.141 0.165 0.123 1.0 0.255 MonkeyOcr 0.193 0.259 0.127 0.236 0.262 0.325 78.4 74.7 0.186 0.294 0.197 0.180 1.0 0.623 OCRFlux 0.252 0.254 0.134 0.195 0.326 0.405 58.3 70.2 0.358 0.260 0.191 0.156 1.0 0.284 Gotocr 0.247 0.249 0.181 0.213 0.231 0.318 59.5 74.7 0.38 0.299 0.195 0.164 0.969 0.446 Olmocr 0.341 0.382 0.125 0.205 0.719 0.766 57.1 56.6 0.327 0.389 0.191 0.169 1.0 0.294 SmolDocling 0.657 0.895 0.486 0.932 0.859 0.972 18.5 1.5 0.86 0.98 0.413 0.695 1.0 0.927 Logics-Parsing 0.124 0.145 0.089 0.139 0.106 0.165 76.6 79.5 0.165 0.166 0.136 0.113 0.519 0.252 General VLMs Qwen2VL-72B 0.298 0.342 0.142 0.244 0.431 0.363 64.2 55.5 0.425 0.581 0.193 0.182 0.792 0.359 Qwen2.5VL-72B 0.233 0.263 0.162 0.24 0.251 0.257 69.6 67 0.313 0.353 0.205 0.204 0.597 0.349 Doubao-1.6 0.188 0.248 0.129 0.219 0.273 0.336 74.9 69.7 0.180 0.288 0.171 0.148 0.601 0.317 GPT-5 0.242 0.373 0.119 0.36 0.398 0.456 67.9 55.8 0.26 0.397 0.191 0.28 0.88 0.46 Gemini2.5 pro 0.185 0.20 0.115 0.155 0.288 0.326 82.6 80.3 0.154 0.182 0.181 0.136 0.535 0.26 Tested on the v3/PDF Conversion API (August 2025 deployment). We would like to acknowledge the following open-source projects that provided inspiration and reference for this work: - Qwen2.5-VL - OmniDocBench - Mathpix Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
instinct-GGUF
This model was generated using llama.cpp at commit `408ff524`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Instinct, the State-of-the-Art Open Next Edit Model This repo contains the model weights for Continue's state-of-the-art open Next Edit model, Instinct. Robustly fine-tuned from Qwen2.5-Coder-7B on our dataset of real-world code edits, Instinct intelligently predicts your next move to keep you in flow. Ollama: We've released a Q4KM GGUF quantization of Instinct for efficient local inference. Try it with Continue's Ollama integration, or just run `ollama run nate/instinct`. You can also serve the model using either of the below options, then connect it with Continue. SGLang: `python3 -m sglang.launchserver --model-path continuedev/instinct --load-format safetensors` vLLM: `vllm serve continuedev/instinct --served-model-name instinct --load-format safetensors` For more information on the work behind Instinct, please refer to our blog. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-16B-A3B-GGUF
granite-8b-code-instruct-4k-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-8B-Code-Instruct-4K is a 8B parameter model fine tuned from Granite-8B-Code-Base-4K on a combination of permissively licensed instruction data to enhance instruction following capabilities including logical reasoning and problem-solving skills. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: May 6th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to coding related instructions and can be used to build coding assistants. Generation This is a simple example of how to use Granite-8B-Code-Instruct-4K model. Training Data Granite Code Instruct models are trained on the following types of data. Code Commits Datasets: we sourced code commits data from the CommitPackFT dataset, a filtered version of the full CommitPack dataset. From CommitPackFT dataset, we only consider data for 92 programming languages. Our inclusion criteria boils down to selecting programming languages common across CommitPackFT and the 116 languages that we considered to pretrain the code-base model (Granite-8B-Code-Base). Math Datasets: We consider two high-quality math datasets, MathInstruct and MetaMathQA. Due to license issues, we filtered out GSM8K-RFT and Camel-Math from MathInstruct dataset. Code Instruction Datasets: We use Glaive-Code-Assistant-v3, Glaive-Function-Calling-v2, NL2SQL11 and a small collection of synthetic API calling datasets. Language Instruction Datasets: We include high-quality datasets such as HelpSteer and an open license-filtered version of Platypus. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-8B-Code-Base-4K model card. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
OpenMath-CodeLlama-13b-Python-hf-GGUF
Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct-GGUF
MinerU2.5-2509-1.2B-GGUF
OpenMath-Nemotron-32B-GGUF
OpenMath2-Llama3.1-8B-GGUF
Qwen2.5-Omni-3B-GGUF
Magma-8B-GGUF
AM-Thinking-v1-GGUF
Qwen3-Reranker-0.6B-GGUF
Apriel-Nemotron-15b-Thinker-GGUF
Qwen3-Embedding-0.6B-GGUF
Jan-nano-128k-GGUF
This model was generated using llama.cpp at commit `8846aace`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Jan-Nano-128k: Empowering deeper research through extended context understanding. [](https://github.com/menloresearch/deep-research) [](https://opensource.org/licenses/Apache-2.0) Jan-Nano-128k represents a significant advancement in compact language models for research applications. Building upon the success of Jan-Nano, this enhanced version features a native 128k context window that enables deeper, more comprehensive research capabilities without the performance degradation typically associated with context extension methods. Key Improvements: - 🔍 Research Deeper: Extended context allows for processing entire research papers, lengthy documents, and complex multi-turn conversations - ⚡ Native 128k Window: Built from the ground up to handle long contexts efficiently, maintaining performance across the full context range - 📈 Enhanced Performance: Unlike traditional context extension methods, Jan-Nano-128k shows improved performance with longer contexts This model maintains full compatibility with Model Context Protocol (MCP) servers while dramatically expanding the scope of research tasks it can handle in a single session. Jan-Nano-128k has been rigorously evaluated on the SimpleQA benchmark using our MCP-based methodology, demonstrating superior performance compared to its predecessor: Traditional approaches to extending context length, such as YaRN (Yet another RoPE extensioN), often result in performance degradation as context length increases. Jan-Nano-128k breaks this paradigm: This fundamental difference makes Jan-Nano-128k ideal for research applications requiring deep document analysis, multi-document synthesis, and complex reasoning over large information sets. Jan desktop will eventually support this model (WIP). Otherwise you can check the deployment options below that we have tested. For additional tutorials and community guidance, visit our Discussion Forums. Note: The chat template is included in the tokenizer. For troubleshooting, download the Non-think chat template. - Discussions: HuggingFace Community - Issues: GitHub Repository - Documentation: Official Docs Jan-Nano-128k: Empowering deeper research through extended context understanding. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Eagle2-9B-GGUF
Qwen3-Coder-30B-A3B-Instruct-GGUF
TEN_Turn_Detection-GGUF
Llama-3.1-Minitron-4B-Width-Base-GGUF
DeepSeek-R1-Distill-Qwen-14B-GGUF
Qwen3-8B-abliterated-GGUF
gemma-3-27b-it-GGUF
DeepSeek-R1-Distill-Qwen-1.5B-GGUF
all-MiniLM-L6-v2-GGUF
OuteTTS-1.0-0.6B-GGUF
X-Ray_Alpha-GGUF
MiroThinker-v1.0-30B-GGUF
Qwen3-1.7B-abliterated-GGUF
NuMarkdown-8B-Thinking-GGUF
xLAM-2-32b-fc-r-GGUF
Magistral-Small-2506-abliterated-GGUF
Hunyuan-MT-Chimera-7B-GGUF
Trinity-Mini-GGUF
Seed-Coder-8B-Reasoning-GGUF
Llama3-ChatQA-2-8B-GGUF
Lucy-GGUF
Wayfarer-2-12B-GGUF
This model was generated using llama.cpp at commit `360d6533`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format We’ve heard over and over from AI Dungeon players that modern AI models are too nice, never letting them fail or die. While it may be good for a chatbot to be nice and helpful, great stories and games aren’t all rainbows and unicorns. They have conflict, tension, and even death. These create real stakes and consequences for characters and the journeys they go on. We created Wayfarer as a response, and after much testing, feedback and refining, we’ve developed a worthy sequel. Wayfarer 2 further refines the formula that made the original Wayfarer so popular, slowing the pacing, increasing the length and detail of responses and making death a distinct possibility for all characters—not just the user. The stakes have never been higher! If you want to try this model for free, you can do so at https://aidungeon.com. We plan to continue improving and open-sourcing similar models, so please share any and all feedback on how we can improve model behavior. Below we share more details on how Wayfarer was created. Wayfarer 2 12B received SFT training with a simple three ingredient recipe: the Wayfarer 2 dataset itself, a series of sentiment-balanced roleplay transcripts and a small instruct core to help retain its instructional capabilities. Wayfarer’s text adventure data was generated by simulating playthroughs of published character creator scenarios from AI Dungeon. Five distinct user archetypes played through each scenario, whose character starts all varied in faction, location, etc. to generate five unique samples. One language model played the role of narrator, with the other playing the user. They were blind to each other’s underlying logic, so the user was actually capable of surprising the narrator with their choices. Each simulation was allowed to run for 8k tokens or until the main character died. Wayfarer’s general emotional sentiment is one of pessimism, where failure is frequent and plot armor does not exist for anyone. This serves to counter the positivity bias so inherent in our language models nowadays. The Nemo architecture is known for being sensitive to higher temperatures, so the following settings are recommended as a baseline. Nothing stops you from experimenting with these, of course. Wayfarer was trained exclusively on second-person present tense data (using “you”) in a narrative style. Other perspectives will work as well but may produce suboptimal results. Thanks to Gryphe Padar for collaborating on this finetune with us! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.1-70B-Instruct-GGUF
Llama-xLAM-2-8b-fc-r-GGUF
rwkv7-1.5B-g1-GGUF
Josiefied-Qwen3-8B-abliterated-v1-GGUF
UIGEN-T2-7B-GGUF
FairyR1-32B-GGUF
Qwen3-VL-30B-A3B-Instruct-GGUF
This model was generated using llama.cpp at commit `66d8eccd4`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment. Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. Visual Coding Boost: Generates Draw.io/HTML/CSS/JS from images/videos. Advanced Spatial Perception: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. Long Context & Video Understanding: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. Enhanced Multimodal Reasoning: Excels in STEM/Math—causal analysis and logical, evidence-based answers. Upgraded Visual Recognition: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. Expanded OCR: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. Text Understanding on par with pure LLMs: Seamless text–vision fusion for lossless, unified comprehension. 1. Interleaved-MRoPE: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. 2. DeepStack: Fuses multi‑level ViT features to capture fine‑grained details and sharpen image–text alignment. 3. Text–Timestamp Alignment: Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. This is the weight repository for Qwen3-VL-30B-A3B-Instruct. Below, we provide simple examples to show how to use Qwen3-VL with 🤖 ModelScope and 🤗 Transformers. The code of Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: Here we show a code snippet to show how to use the chat model with `transformers`: If you find our work helpful, feel free to give us a cite. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Nanonets-OCR-s-GGUF
Mistral-Crab-DPO-GGUF
NextCoder-14B-GGUF
granite-3.2-8b-instruct-GGUF
rwkv7-7.2B-g0-GGUF
This model was generated using llama.cpp at commit `e4868d16`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This is RWKV-7 model under flash-linear attention format. - Developed by: Bo Peng, Yu Zhang, Songlin Yang, Ruichong Zhang, Zhiyuan Li - Funded by: RWKV Project (Under LF AI & Data Foundation) - Model type: RWKV7 - Language(s) (NLP): Multilingual - License: Apache-2.0 - Parameter count: 7.2B - Tokenizer: RWKV World tokenizer - Vocabulary size: 65,536 - Repository: https://github.com/fla-org/flash-linear-attention ; https://github.com/BlinkDL/RWKV-LM - Paper: https://arxiv.org/abs/2503.14456 - Model: https://huggingface.co/BlinkDL/rwkv7-g1/resolve/main/rwkv7-g1-2.9b-20250519-ctx4096.pth Install `flash-linear-attention` and the latest version of `transformers` before using this model: You can use this model just as any other HuggingFace models: A: upgrade transformers to >=4.48.0: `pip install 'transformers>=4.48.0'` Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
LFM2-1.2B-RAG-GGUF
EXAONE-Deep-2.4B-GGUF
Llama-3.2-1B-GGUF
MiniCPM4.1-8B-GGUF
This model was generated using llama.cpp at commit `1411d9275`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format What's New - [2025.09.29] InfLLM-V2 paper is released! We can train a sparse attention model with only 5B long-text tokens. 🔥🔥🔥 - [2025.09.05] MiniCPM4.1 series are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥 - [2025.06.06] MiniCPM4 series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report here.🔥🔥🔥 Highlights MiniCPM4.1 is highlighted with following features: ✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks! ✅ Fast Generation: 3x decoding speedup for reasoning! ✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding! - MiniCPM4.1-8B: The latest version of MiniCPM4, with 8B parameters, support fusion thinking. ( Click to expand all MiniCPM4 series models - MiniCPM4-8B: The flagship model with 8B parameters, trained on 8T tokens - MiniCPM4-0.5B: Lightweight version with 0.5B parameters, trained on 1T tokens - MiniCPM4-8B-Eagle-FRSpec: Eagle head for FRSpec, accelerating speculative inference - MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu: Eagle head with QAT for FRSpec, integrating speculation and quantization for ultra acceleration - MiniCPM4-8B-Eagle-vLLM: Eagle head in vLLM format for speculative inference - MiniCPM4-8B-marlin-Eagle-vLLM: Quantized Eagle head for vLLM format - BitCPM4-0.5B: Extreme ternary quantization of MiniCPM4-0.5B, achieving 90% bit width reduction - BitCPM4-1B: Extreme ternary quantization of MiniCPM3-1B, achieving 90% bit width reduction - MiniCPM4-Survey: Generates trustworthy, long-form survey papers from user queries - MiniCPM4-MCP: Integrates MCP tools to autonomously satisfy user requirements Performance Evaluation MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories. Best Practices 1. It is advisable to use temperature=0.9, topp=0.95. And we suggest setting maxoutputtoken to 65,536 tokens. 2. For math problems, we recommend using "Please reason step by step, and put your final answer within \boxed{}." 3. And for English multiple-choice questions, we recommend starting with "Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering." And "你回答的最后一行必须是以下格式 '答案:$选项' (不带引号), 其中选项是ABCD之一。请在回答之前一步步思考" for Chinese MCQ. Efficiency Evaluation MiniCPM4.1 adopts sparse attention and speculative decoding to improve the inference efficiency. On RTX 4090, MiniCPM4.1 achieves 3x decoding speed improvement in reasoning. Usage MiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu. MiniCPM4/MiniCPM4.1 supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu. - Dense attention inference: vLLM, SGLang, Huggingface Transformers - Sparse attention inference: Huggingface Transformers, CPM.cu To facilitate researches in sparse attention, we provide InfLLM-V2 Training Kernels and InfLLM-V2 Inference Kernels. Inference with Transformers MiniCPM4.1-8B requires `transformers>=4.56`. - Inference with Sparse Attention MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the infllmv2cudaimpl library. You can install it by running the following command: To enable InfLLM v2, you need to add the `sparseconfig` field in `config.json`: These parameters control the behavior of InfLLM v2: `kernelsize` (default: 32): The size of semantic kernels. `kernelstride` (default: 16): The stride between adjacent kernels. `initblocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence. `blocksize` (default: 64): The block size for key-value blocks. `windowsize` (default: 2048): The size of the local sliding window. `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks. `usenope` (default: false): Whether to use the NOPE technique in block selection for improved performance. `denselen` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `denselen` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length. - Long Context Extension MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor. You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `ropescaling` fields. You can inference with SGLang using the standard mode and speculative decoding mode. For accelerated inference with speculative decoding, follow these steps: The EAGLE3 adaptation PR has been submitted. For now, use our repository for installation: Start the SGLang server with speculative decoding enabled: The client usage remains the same for both standard and speculative decoding: Note: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example). - `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding - `--speculative-draft-model-path`: Path to the draft model for speculation - `--speculative-num-steps`: Number of speculative steps (default: 3) - `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1) - `--speculative-num-draft-tokens`: Number of draft tokens (default: 32) - `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9) For now, you need to install our forked version of SGLang. You can start the inference server by running the following command: Then you can use the chat interface by running the following command: Inference with vLLM You can inference with vLLM using the standard mode and speculative decoding mode. For accelerated inference with speculative decoding using vLLM, follow these steps: First, download the MiniCPM4.1 draft model and change the `architectures` in config.json as `LlamaForCausalLM`. The EAGLE3 vLLM PR has been submitted. For now, use our repository for installation: Start the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM41-8B-Eagle3-bf16 folder: The client usage remains the same for both standard and speculative decoding: - `VLLMUSEV1=1`: Enables vLLM v1 API - `--speculative-config`: JSON configuration for speculative decoding - `model`: Path to the draft model for speculation - `numspeculativetokens`: Number of speculative tokens (default: 3) - `method`: Speculative decoding method (eagle3) - `drafttensorparallelsize`: Tensor parallel size for draft model (default: 1) - `--seed`: Random seed for reproducibility - `--trust-remote-code`: Allow execution of remote code for custom models For now, you need to install the latest version of vLLM. Also, you can start the inference server by running the following command: > Note: In vLLM's chat API, `addspecialtokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extrabody={"addspecialtokens": True}`. Then you can use the chat interface by running the following code: We recommend using CPM.cu for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1. You can install CPM.cu by running the following command: MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `ropescaling` field in the `config.json` file as the following to enable LongRoPE. After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace) You can run the following command to infer with EAGLE3 speculative decoding algorithm. For more details about CPM.cu, please refer to the repo CPM.cu. We also support inference with llama.cpp and Ollama. You can download the GGUF format of MiniCPM4.1-8B model from huggingface and run it with llama.cpp for efficient CPU or GPU inference. Ollama Please refer to model hub for model download. After installing ollama package, you can use MiniCPM4.1 with following commands: MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enablethinking=True` in `tokenizer.applychattemplate` to enable hybrid reasoning mode, and set `enablethinking=False` to enable non-reasoning mode. Similarly, user can directly add `/nothink` at the end of the query to enable non-reasoning mode. If not add any special token or add `/think` at the end of the query, the model will enable reasoning mode. Statement - As a language model, MiniCPM generates content by learning from a vast amount of text. - However, it does not possess the ability to comprehend or express personal opinions or value judgments. - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own. LICENSE - This repository and MiniCPM models are released under the Apache-2.0 License. Citation - Please cite our paper if you find our work valuable. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-4.0-micro-GGUF
📣 Update [10-07-2025]: Added a default system prompt to the chat template to guide the model towards more professional, accurate, and safe responses. Model Summary: Granite-4.0-Micro is a 3B parameter long-context instruct model finetuned from Granite-4.0-Micro-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. Granite 4.0 instruct models feature improved instruction following (IF) and tool-calling capabilities, making them more effective in enterprise applications. - Developers: Granite Team, IBM - HF Collection: Granite 4.0 Language Models HF Collection - GitHub Repository: ibm-granite/granite-4.0-language-models - Website: Granite Docs - Release Date: October 2nd, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 4.0 models for languages beyond these languages. Intended use: The model is designed to follow general instructions and can serve as the foundation for AI assistants across diverse domains, including business applications, as well as for LLM agents equipped with tool-use capabilities. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Fill-In-the-Middle (FIM) code completions Need to test the examples. (especially the tool calling and RAG ones) --> Generation: This is a simple example of how to use Granite-4.0-Micro model. Then, copy the snippet from the section that is relevant for your use case. Tool-calling: Granite-4.0-Micro comes with enhanced tool calling capabilities, enabling seamless integration with external functions and APIs. To define a list of tools please follow OpenAI's function definition schema. This is an example of how to use Granite-4.0-Micro model tool-calling ability: Benchmarks Metric Micro Dense H Micro Dense H Tiny MoE H Small MoE Multilingual Benchmarks and thr included languages: MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi INCLUDE 14 hindi, bengali, tamil, telugu, arabic, german, spanish, french, italian, japanese, korean, dutch, portuguese, chinese --> hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh Granite-4.0-Micro baseline is built on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Model Micro Dense H Micro Dense H Tiny MoE H Small MoE Number of layers 40 attention 4 attention / 36 Mamba2 4 attention / 36 Mamba2 4 attention / 36 Mamba2 MLP / Shared expert hidden size 8192 8192 1024 1536 Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) a select set of human-curated data. Infrastructure: We trained the Granite 4.0 Language Models utilizing an NVIDIA GB200 NVL72 cluster hosted in CoreWeave. Intra-rack communication occurs via the 72-GPU NVLink domain, and a non-blocking, full Fat-Tree NDR 400 Gb/s InfiniBand network provides inter-rack communication. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 4.0 Instruction Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering multiple languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
typhoon-ocr-7b-GGUF
phi-4-GGUF
Mistral-Small-3.2-24B-Instruct-2506-GGUF
granite-docling-258M-GGUF
Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility. Granite Docling 258M builds upon the Idefics3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM. Try out our Granite-Docling-258 demo today. - Developed by: IBM Research - Model type: Multi-modal model (image+text-to-text) - Language(s): English (NLP) - License: Apache 2.0 - Release Date: September 17, 2025 Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing features while introducing a number of powerful new features, including: - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference - 🧘 Improved Stability: Tends to avoid infinite loops more effectively - 🧮 Enhanced Inline Equations: Better inline math recognition - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements - 🌍 Japanese, Arabic and Chinese support (experimental) The easiest way to use this model is through the 🐥Docling library. It will automatically download this model and convert documents to various formats for you. Install the latest version of `docling` through pip, then use the following CLI command: You can also set this model up within the Docling SDK: Alternatively, you can use bare transformers, vllm, onnx or mlx-vlm to perform inference, and docling-core APIs to convert results to variety of output formats (md, html, etc.): 📄 Single page image inference using plain 🤗 tranformers 🤖 💻 Local inference on Apple Silicon with MLX: see here ℹ️ If you see trouble running granite-docling with the codes above, check the troubleshooting section at the bottom ⬇️. Intended Use Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM. However, Granite-Docling is not intended for general image understanding. For tasks focused solely on image-text input, we recommend using Granite Vision models, which are purpose-built and optimized for image-text processing. Evaluations A comprehensive discussion of evaluation methods and findings has already been presented in our previous publication [citation]. As this model is an update, we refer readers to that work for additional details. The evaluation can be performed using the docling-eval framework for the document related tasks, and lmms-eval for MMStar and OCRBench. Edit-distance ↓ F1 ↑ Precision ↑ Recall ↑ BLEU ↑ Meteor ↑ smoldocling-256m-preview 0.48 0.80 0.89 0.79 0.58 0.67 Edit-distance ↓ F1 ↑ Precision ↑ Recall ↑ BLEU ↑ Meteor ↑ smoldocling-256m-preview 0.114 0.915 0.94 0.909 0.875 0.889 granite-docling-258m 0.013 0.988 0.99 0.988 0.983 0.986 Edit-distance ↓ F1 ↑ Precision ↑ Recall ↑ BLEU ↑ Meteor ↑ smoldocling-256m-preview 0.119 0.947 0.959 0.941 0.824 0.878 granite-docling-258m 0.073 0.968 0.968 0.969 0.893 0.927 💻 Local inference on Apple Silicon with MLX: see here Table Convert table to OTSL. ( Lysak et al., 2023 ) <otsl> Actions and Pipelines OCR the text in a specific location: <loc155><loc233><loc206><loc237> - Identify element at: <loc247><loc482><loc252><loc486> - Find all 'text' elements on the page, retrieve all section headers. - The architecture of granite-docling-258m consists of the following components: (2) Vision-language connector: pixel shuffle projector (as in idefics3) We built upon Idefics3 to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling. The model was trained using the nanoVLM framework, which provides a lightweight and efficient training setup for vision-language models Training Data: Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities. SynthCodeNet — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages SynthFormulaNet — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations SynthChartNet — synthetic chart images annotated with structured table outputs DoclingMatix — a curated corpus of real-world document pages sampled from diverse domains Infrastructure: We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Responsible Use and Limitations Some use cases for Vision Language Models can trigger certain risks and ethical considerations, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. We urge the community to use granite-docling-258m in a responsible way and avoid any malicious utilization. We recommend using this model only as part of the Docling library. More general vision tasks may pose higher inherent risks of triggering unwanted output. To enhance safety, we recommend using granite-docling-258m alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety. - ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features - 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/gettingstarted/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources - 🖥️ Learn more about how to use Granite-Docling, explore the Docling library, and see what’s coming next for Docling in the release blog: https://ibm.com/new/announcements/granite-docling-end-to-end-document-conversion 1. You receive `AttributeError: 'LlamaModel' object has no attribute 'wte'` when launching the model through VLLM. With current versions of VLLM (including 0.10.2), support for tied weights as used in granite-docling is limited and breaks. We provide a version with untied weights on the `untied` branch of this model repo. To use the untied version, please pass the `revision` argument to VLLM: 2. The model outputs only exclamation marks (i.e. "!!!!!!!!!!!!!!!"). This is seen on older NVIDIA GPUs, such as the T4 GPU available in Google Colab, because it lacks support for `bfloat16` format. You can work around it by setting the `dtype` to `float32`.
osmosis-mcp-4b-GGUF
Qwen2.5-0.5B-Instruct-GGUF
Olmo-3-7B-Instruct-GGUF
phi-2-GGUF
GLM-4.1V-9B-Thinking-GGUF
DeepCoder-14B-Preview-GGUF
Llama-3.1-Nemotron-Nano-4B-v1.1-GGUF
kanana-1.5-8b-instruct-2505-GGUF
Qwen3-30B-A6B-16-Extreme-GGUF
Qwen3-8B-GGUF
Llama-3_3-Nemotron-Super-49B-v1-GGUF
rwkv7-191M-world-GGUF
trlm-135m-GGUF
Qwen3-0.6B-GGUF
granite-3.0-1b-a400m-base-GGUF
Qwen3-4B-Instruct-2507-GGUF
granite-3.3-8b-instruct-GGUF
Mistral-Small-3.1-24B-Instruct-2503-GGUF
SWE-Dev-32B-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format - 🤗 SWE-Dev-7B (Qwen-2.5-Coder-7B-Instruct) - 🤗 SWE-Dev-9B (GLM-4-9B-Chat) - 🤗 SWE-Dev-32B (Qwen-2.5-Coder-32B-Instruct) - 🤗 SWE-Dev-train (Training Data) 🚀 SWE-Dev, an open-source Agent for Software Engineering tasks! This repository contains the SWE-Dev-32B model as presented in the paper SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling. 💡 We develop a comprehensive pipeline for creating developer-oriented datasets from GitHub repositories, including issue tracking, code localization, test case generation, and evaluation. 🔧 Based on open-source frameworks (OpenHands) and models, SWE-Dev-7B and 32B achieved solve rates of 23.4% and 36.6% on SWE-bench-Verified, respectively, even approaching the performance of GPT-4o. 📚 We find that training data scaling and inference scaling can both effectively boost the performance of models on SWE-bench. Moreover, higher data quality further improves this trend when combined with reinforcement fine-tuning (RFT). For inference scaling specifically, the solve rate on SWE-Dev increased from 34.0% at 30 rounds to 36.6% at 75 rounds. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Arch-Router-1.5B-GGUF
This model was generated using llama.cpp at commit `73e53dc8`. Click here to get info on choosing the right GGUF model format Overview With the rapid proliferation of large language models (LLMs) -- each optimized for different strengths, style, or latency/cost profile -- routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. We introduce a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) -- offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce Arch-Router, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. This model is described in the paper: https://arxiv.org/abs/2506.16655, and powers Arch the open-source AI-native proxy for agents to enable preference-based routing in a seamless way. To support effective routing, Arch-Router introduces two key concepts: - Domain – the high-level thematic category or subject matter of a request (e.g., legal, healthcare, programming). - Action – the specific type of operation the user wants performed (e.g., summarization, code generation, booking appointment, translation). Both domain and action configs are associated with preferred models or model variants. At inference time, Arch-Router analyzes the incoming prompt to infer its domain and action using semantic similarity, task indicators, and contextual cues. It then applies the user-defined routing preferences to select the model best suited to handle the request. - Structured Preference Routing: Aligns prompt request with model strengths using explicit domain–action mappings. - Transparent and Controllable: Makes routing decisions transparent and configurable, empowering users to customize system behavior. - Flexible and Adaptive: Supports evolving user needs, model updates, and new domains/actions without retraining the router. - Production-Ready Performance: Optimized for low-latency, high-throughput applications in multi-model environments. Requirements The code of Arch-Router-1.5B has been in the Hugging Face `transformers` library and we advise you to install latest version: How to use We use the following example to illustrate how to use our model to perform routing tasks. Please note that, our model works best with our provided prompt format. Quickstart ` Then you should be able to see the following output string in JSON format: ` To better understand how to create the route descriptions, please take a look at our Katanemo API. License Katanemo Arch-Router model is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DMind-1-mini-GGUF
functionary-v4r-small-preview-GGUF
Homunculus-GGUF
TildeOpen-30b-GGUF
This model was generated using llama.cpp at commit `408ff524`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Developed by: Tilde.ai Funded by: European Commission via EuroHPC JU Large AI Grand Challenge Model type: A 30B parameter dense decoder-only transformer Languages: Albanian, Bosnian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Icelandic, Irish, Italian, Latgalian, Latvian, Lithuanian, Macedonian, Maltese, Montenegrin, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Swedish, Turkish, Ukrainian as well of mathematical proofs, programming code and XML documents containing translation data License: CC-BY-4.0 Mission statement TildeOpen LLM is an open-source foundational language model built to serve underrepresented Nordic and Eastern European languages. Developed with European Commission funding and trained on the LUMI supercomputer, this 30B+ parameter model addresses the performance gaps that speakers of 19 focus languages—representing over 165 million people—face with existing AI systems. The model employs an equitable tokeniser and curriculum-learning approach to ensure fair representation across less-resourced languages, moving beyond the typical English-centric design of most language models. As an open-source project, TildeOpen LLM enables transparent research and community-driven development while maintaining European technological independence. This foundational model is not yet adapted to follow instructions or aligned with safety features. The next version being built on top of this model will be a specialised translation model, leveraging TildeOpen LLM's multilingual foundation to provide high-quality translation capabilities across the supported European language pairs. Model training details We train TildeOpen LLM using the Tilde's branch of EleutherAI's open-source GPT-NeoX framework on LUMI supercomputer's 768 AMD MI250X GPUs. The foundational model training involves 450,000 updates with a constant batch size of 4,718,592 tokens, using a constant learning rate followed by a cooldown phase across 2 trillion tokens. Training consists of three distinct data sampling phases. First, all languages are sampled uniformly to ensure equal representation. Second, languages are sampled according to their natural distribution to ensure that the model sees as much data from languages with larger speaker bases as possible. Finally, we return to uniform sampling across all languages. This three-phase approach ensures TildeOpen LLM develops balanced multilingual capabilities while maintaining strong performance across all target languages, particularly the underrepresented European languages. | Parameter | Value | |-----------|-------| | Sequence Length | 8192 | | Number of Layers | 60 | | Embedding Size | 6144 | | FFN Hidden Size | 21504 | | Number of Heads | 48 | | Number of KV Heads (GQA) | 8 | | Activation Function | SwiGLU | | Position Encodings | RoPE | | Layer Norm | RMSNorm | | Embedding Parameters | 8.05E+08 | | LM Head Parameters | 8.05E+08 | | Non-embedding Parameters | 2.91E+10 | | Total Parameters | 3.07E+10 | Tokeniser details We built the TildeOpen LLM tokeniser to ensure equitable language representation across languages. Technically, we trained the tokeniser to represent the same text regardless of the language it is written in, using a similar number of tokens. In practice, TildeOpen LLM will be more efficient and faster than other models for our focus languages, as writing out answers will require fewer steps. For more details on how TildeOpen LLM compares against other models, see TILDE Bench! Running model using HF transformers When loading the tokeniser, you must set . Evaluation Per-Character Perplexity What is Perplexity? Perplexity measures how well a language model predicts text. A model with low perplexity makes accurate predictions consistently, while a high perplexity means the model is frequently "surprised" by unexpected words or patterns. Lower perplexity indicates the model has learned language patterns more effectively. It's less "surprised" by what it encounters because it better understands how the language works. Perplexity fairly evaluates how well each model handles: - Spelling accuracy across a diverse vocabulary - Grammar rules that span multiple words - Sentence structure and flow - Language-specific patterns (how different languages form plural forms or compound words) Why Character-Level? Different language models use different internal vocabularies - some break text into whole words, others into word fragments, and some into individual characters. This makes direct comparison difficult. Character-level perplexity creates a standardised comparison by calculating how well each model would theoretically perform if we measured their predictions character-by-character. We're not changing how the models work - instead, we use mathematical conversion to approximate their character-level performance based on their predictions. Why does this Matter? Models with lower perplexity generally perform better on real-world tasks like text generation, translation, and understanding context. It's a reliable indicator of overall language competency across different applications. What data did we use? We use WMT24++ as it is a multilingual, language-parallel evaluation set that none of the models have seen during training. WMT24++ is a composite of texts from news, literature, speech, and social media; thus, it is suitable for foundational model benchmarking. | Language | TildeOpen 30b | Gemma 2 27b | EuroLLM 22B Prev. | ALIA 40B | |-----------------|---------|------------|----|------| | Bulgarian | 2.0539 | 2.2184 | 2.1985 | 2.1336 | | Czech | 2.1579 | 2.3522 | 2.3221 | 2.2719 | | Danish | 2.003 | 2.1517 | 2.1353 | 2.0805 | | German | 1.8769 | 1.9285 | 1.9452 | 1.904 | | English | 2.0378 | 1.9525 | 2.0568 | 2.0261 | | Spanish | 1.9503 | 1.9752 | 2.0145 | 1.9369 | | Estonian | 2.1711 | 2.5747 | 2.3852 | 2.325 | | Finnish | 2.0497 | 2.288 | 2.2388 | 2.1831 | | French | 1.8978 | 1.9355 | 1.9282 | 1.9084 | | Croatian | 2.1147 | 2.544 | 2.4905 | 2.2433 | | Hungarian | 2.0539 | 2.2228 | 2.2256 | 2.1635 | | Icelandic | 2.0873 | 3.0329 | 4.7908 | 3.957 | | Italian | 1.9565 | 2.0137 | 2.0098 | 1.9887 | | Lithuanian | 2.1247 | 2.4175 | 2.3137 | 2.3075 | | Latvian | 2.1439 | 2.5355 | 2.3141 | 2.3276 | | Dutch | 1.9333 | 2.0312 | 2.0079 | 1.9904 | | Norwegian | 2.1284 | 2.2862 | 2.3506 | 2.2253 | | Polish | 2.0241 | 2.1294 | 2.0803 | 2.0803 | | Portuguese | 1.9899 | 2.0597 | 2.0272 | 2.0187 | | Romanian | 2.0196 | 2.1606 | 2.1641 | 2.1114 | | Russian | 2.0424 | 2.09 | 2.1095 | 2.0871 | | Slovak | 2.1192 | 2.338 | 2.3029 | 2.2609 | | Slovenian | 2.1556 | 2.4443 | 2.3398 | 2.2589 | | Serbian | 2.2469 | 2.6351 | 4.2471 | 2.3743 | | Swedish | 2.041 | 2.1809 | 2.1464 | 2.1211 | | Turkish | 2.0997 | 2.247 | 2.2202 | 2.232 | | Ukrainian | 2.1376 | 2.2665 | 2.2691 | 2.2086 | Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
MedScholar-1.5B-GGUF
This model was generated using llama.cpp at commit `66625a59`. Click here to get info on choosing the right GGUF model format MedScholar-1.5B is a compact, instruction-aligned medical question-answering model fine-tuned on 1 million randomly selected examples from the MIRIAD-4.4M dataset. It is based on the Qwen/Qwen2.5-1.5B-Instruct model and designed for efficient, in-context clinical knowledge exploration — not diagnosis. - Base Model: Qwen2.5-1.5B-Instruct-unsloth-bnb-4bit - Fine-tuning Dataset: MIRIAD-4.4M - Samples Used: 1,000,000 examples randomly selected from the full set - Prompt Style: Minimal QA format (see below) - Training Framework: Unsloth with QLoRA - License: Apache-2.0 (inherits from base model); dataset is ODC-By 1.0 The model expects the prompt to end with `### Answer:`, and will generate only the answer text. Do not include the answer in the prompt during inference. This model was fine-tuned using randomly selected 1 million examples from the MIRIAD-4.4M dataset, which is released under the ODC-By 1.0 License. > The MIRIAD dataset is intended exclusively for academic research and educational exploration. > As stated by its authors: > > “The outputs generated by models trained or fine-tuned on this dataset must not be used for medical diagnosis or decision-making involving real individuals.” This model is for research, educational, and exploration purposes only. It is not a medical device and must not be used to provide clinical advice, diagnosis, or treatment. MIRIAD Dataset by Zheng et al. (2025) – https://huggingface.co/datasets/miriad/miriad-4.4M Qwen2.5 by Alibaba – https://huggingface.co/Qwen Training infrastructure: Unsloth This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
medgemma-27b-text-it-GGUF
Mistral-NeMo-Minitron-8B-Instruct-GGUF
granite-20b-code-instruct-r1.1-GGUF
This model was generated using llama.cpp at commit `5dd942de`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-20B-Code-Instruct-r1.1 is a 20B parameter model fine tuned from Granite-20B-Code-Instruct-r1.1 on a combination of permissively licensed instruction data to enhance instruction following capabilities including mathematical reasoning and problem-solving skills. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: July 18th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to coding related instructions and can be used to build coding assistants. Generation This is a simple example of how to use Granite-20B-Code-Instruct-r1.1 model. Training Data Granite Code Instruct models are trained on the following types of data. Code Commits Datasets: we sourced code commits data from the CommitPackFT dataset, a filtered version of the full CommitPack dataset. From CommitPackFT dataset, we only consider data for 92 programming languages. Our inclusion criteria boils down to selecting programming languages common across CommitPackFT and the 116 languages that we considered to pretrain the code-base model (Granite-20B-Code-Base-r1.1). Math Datasets: We consider two high-quality math datasets, MathInstruct and MetaMathQA. Due to license issues, we filtered out GSM8K-RFT and Camel-Math from MathInstruct dataset. Code Instruction Datasets: We use Glaive-Code-Assistant-v3, Glaive-Function-Calling-v2, BigCode-SC2-Instruct, NL2SQL11 and a small collection of synthetic API calling datasets including synthetic instruction-response pairs generated using Granite-34B-Code-Instruct. Language Instruction Datasets: We include high-quality datasets such as HelpSteer and an open license-filtered version of Platypus. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-20B-Code-Base-r1.1 model card. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Aryabhata-1.0-GGUF
granite-7b-instruct-GGUF
RWKV7-Goose-World3-2.9B-HF-GGUF
Midm-2.0-Base-Instruct-GGUF
This model was generated using llama.cpp at commit `21c02174`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 🤗 Mi:dm 2.0 Models | 📜 Mi:dm 2.0 Technical Report | 📕 Mi:dm 2.0 Technical Blog - 🔜 (Coming Soon!) GGUF format model files will be available soon for easier local deployment. - ⚡️`2025/07/04`: Released Mi:dm 2.0 Model collection on Hugging Face🤗. - Overview - Mi:dm 2.0 - Quickstart - Evaluation - Usage - Run on Friendli.AI - Run on Your Local Machine - Deployment - Tutorials - More Information - Limitation - License - Contact Mi:dm 2.0 is a "Korea-centric AI" model developed using KT's proprietary technology. The term "Korea-centric AI" refers to a model that deeply internalizes the unique values, cognitive frameworks, and commonsense reasoning inherent to Korean society. It goes beyond simply processing or generating Korean text—it reflects a deeper understanding of the socio-cultural norms and values that define Korean society. - Mi:dm 2.0 Base An 11.5B parameter dense model designed to balance model size and performance. It extends an 8B-scale model by applying the Depth-up Scaling (DuS) method, making it suitable for real-world applications that require both performance and versatility. - Mi:dm 2.0 Mini A lightweight 2.3B parameter dense model optimized for on-device environments and systems with limited GPU resources. It was derived from the Base model through pruning and distillation to enable compact deployment. > [!Note] > Neither the pre-training nor the post-training data includes KT users' data. Here is the code snippet to run conversational inference with the model: > [!NOTE] > The `transformers` library should be version `4.45.0` or higher. Model Society & Culture General Knowledge Instruction Following K-Refer K-Refer-Hard Ko-Sovereign HAERAE Avg. KMMLU Ko-Sovereign Avg. Ko-IFEval Ko-MTBench Avg. Qwen3-4B 53.6 42.9 35.8 50.6 45.7 50.6 42.5 46.5 75.9 63.0 69.4 Exaone-3.5-2.4B-inst 64.0 67.1 44.4 61.3 59.2 43.5 42.4 43.0 65.4 74.0 68.9 Mi:dm 2.0-Mini-inst 66.4 61.4 36.7 70.8 58.8 45.1 42.4 43.8 73.3 74.0 73.6 Qwen3-14B 72.4 65.7 49.8 68.4 64.1 55.4 54.7 55.1 83.6 71 77.3 Llama-3.1-8B-inst 43.2 36.4 33.8 49.5 40.7 33.0 36.7 34.8 60.1 57 58.5 Exaone-3.5-7.8B-inst 71.6 69.3 46.9 72.9 65.2 52.6 45.6 49.1 69.1 79.6 74.4 Mi:dm 2.0-Base-inst 89.6 86.4 56.3 81.5 78.4 57.3 58.0 57.7 82 89.7 85.9 K-Prag K-Refer-Hard Ko-Best Ko-Sovereign Avg. Ko-Winogrande Ko-Best LogicKor HRM8K Avg. Qwen3-4B 73.9 56.7 91.5 43.5 66.6 67.5 69.2 5.6 56.7 43.8 Exaone-3.5-2.4B-inst 68.7 58.5 87.2 38.0 62.5 60.3 64.1 7.4 38.5 36.7 Mi:dm 2.0-Mini-inst 69.5 55.4 80.5 42.5 61.9 61.7 64.5 7.7 39.9 37.4 Qwen3-14B 86.7 74.0 93.9 52.0 76.8 77.2 75.4 6.4 64.5 48.8 Llama-3.1-8B-inst 59.9 48.6 77.4 31.5 51.5 40.1 26.0 2.4 30.9 19.8 Exaone-3.5-7.8B-inst 73.5 61.9 92.0 44.0 67.2 64.6 60.3 8.6 49.7 39.5 Mi:dm 2.0-Base-inst 86.5 70.8 95.2 53.0 76.1 75.1 73.0 8.6 52.9 44.8 Model Instruction Reasoning Math Coding General Knowledge IFEval BBH GPQA MuSR Avg. GSM8K MBPP+ MMLU-pro MMLU Avg. Qwen3-4B 79.7 79.0 39.8 58.5 59.1 90.4 62.4 - 73.3 73.3 Exaone-3.5-2.4B-inst 81.1 46.4 28.1 49.7 41.4 82.5 59.8 - 59.5 59.5 Mi:dm 2.0-Mini-inst 73.6 44.5 26.6 51.7 40.9 83.1 60.9 - 56.5 56.5 Qwen3-14B 83.9 83.4 49.8 57.7 63.6 88.0 73.4 70.5 82.7 76.6 Llama-3.1-8B-inst 79.9 60.3 21.6 50.3 44.1 81.2 81.8 47.6 70.7 59.2 Exaone-3.5-7.8B-inst 83.6 50.1 33.1 51.2 44.8 81.1 79.4 40.7 69.0 54.8 Mi:dm 2.0-Base-inst 84.0 77.7 33.5 51.9 54.4 91.6 77.5 53.3 73.7 63.5 Run on Friendli.AI You can try our model immediately via `Friendli.AI`. Simply click `Deploy` and then `Friendli Endpoints`. > [!Note] > Please note that a login to `Friendli.AI` is required after your fifth chat interaction. Run on Your Local Machine We provide a detailed description about running Mi:dm 2.0 on your local machine using llama.cpp, LM Studio, and Ollama. Please check our github for more information To serve Mi:dm 2.0 using vLLM(`>=0.8.0`) with an OpenAI-compatible API: Tutorials To help our end-users easily use Mi:dm 2.0, we have provided comprehensive tutorials on github. Limitation The training data for both Mi:dm 2.0 models consists primarily of English and Korean. Understanding and generation in other languages are not guaranteed. The model is not guaranteed to provide reliable advice in fields that require professional expertise, such as law, medicine, or finance. Researchers have made efforts to exclude unethical content from the training data — such as profanity, slurs, bias, and discriminatory language. However, despite these efforts, the model may still produce inappropriate expressions or factual inaccuracies. Contact Mi:dm 2.0 Technical Inquiries: [email protected] Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Nemotron-4-Mini-Hindi-4B-Instruct-GGUF
ERNIE-4.5-21B-A3B-Thinking-GGUF
This model was generated using llama.cpp at commit `86587da0`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Over the past three months, we have continued to scale the thinking capability of ERNIE-4.5-21B-A3B, improving both the quality and depth of reasoning, thereby advancing the competitiveness of ERNIE lightweight models in complex reasoning tasks. We are pleased to introduce ERNIE-4.5-21B-A3B-Thinking, featuring the following key enhancements: Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, text generation, and academic benchmarks that typically require human expertise. Efficient tool usage capabilities. Enhanced 128K long-context understanding capabilities. > [!NOTE] > Note: This version has an increased thinking length. We strongly recommend its use in highly complex reasoning tasks. ERNIE-4.5-21B-A3B-Thinking is a text MoE post-trained model, with 21B total parameters and 3B activated parameters for each token. The following are the model configuration details: |Key|Value| |-|-| |Modality|Text| |Training Stage|Posttraining| |Params(Total / Activated)|21B / 3B| |Layers|28| |Heads(Q/KV)|20 / 4| |Text Experts(Total / Activated)|64 / 6| |Vision Experts(Total / Activated)|64 / 6| |Shared Experts|2| |Context Length|131072| > [!NOTE] > To align with the wider community, this model releases Transformer-style weights. Both PyTorch and PaddlePaddle ecosystem tools, such as vLLM, transformers, and FastDeploy, are expected to be able to load and run this model. Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository. Note: 80GB x 1 GPU resources are required. Deploying this model requires FastDeploy version 2.2. The ERNIE-4.5-21B-A3B-Thinking model supports function call. The `reasoning-parser` and `tool-call-parser` for vLLM Ernie are currently under development. Note: You'll need the`transformers`library (version 4.54.0 or newer) installed to use this model. The following contains a code snippet illustrating how to use the model generate content based on given inputs. The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved. If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report: Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-3b-code-base-2k-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-3B-Code-Base-2K is a decoder-only code model designed for code generative tasks (e.g., code generation, code explanation, code fixing, etc.). It is trained from scratch with a two-phase training strategy. In phase 1, our model is trained on 4 trillion tokens sourced from 116 programming languages, ensuring a comprehensive understanding of programming languages and syntax. In phase 2, our model is trained on 500 billion tokens with a carefully designed mixture of high-quality data from code and natural language domains to improve the models’ ability to reason and follow instructions. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: May 6th, 2024 - License: Apache 2.0. Usage Intended use Prominent enterprise use cases of LLMs in software engineering productivity include code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the 3B parameter model, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages. Generation This is a simple example of how to use Granite-3B-Code-Base-2K model. Training Data - Data Collection and Filtering: Pretraining code data is sourced from a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages. After language filtering, we also filter out low-quality code. - Exact and Fuzzy Deduplication: We adopt an aggressive deduplication strategy that includes both exact and fuzzy deduplication to remove documents having (near) identical code content. - HAP, PII, Malware Filtering: We apply a HAP content filter that reduces models' likelihood of generating hateful, abusive, or profane language. We also make sure to redact Personally Identifiable Information (PII) by replacing PII content (e.g., names, email addresses, keys, passwords) with corresponding tokens (e.g., ⟨NAME⟩, ⟨EMAIL⟩, ⟨KEY⟩, ⟨PASSWORD⟩). Moreover, we scan all datasets using ClamAV to identify and remove instances of malware in the source code. - Natural Language Datasets: In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets to improve models' proficiency in language understanding and mathematical reasoning. Unlike the code data, we do not deduplicate these datasets. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations The use of Large Language Models involves risks and ethical considerations people must be aware of. Regarding code generation, caution is urged against complete reliance on specific code models for crucial decisions or impactful information as the generated code is not guaranteed to work as intended. Granite-3B-Code-Base-2K model is not the exception in this regard. Even though this model is suited for multiple code-related tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying source code verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3B-Code-Base-2K model with ethical intentions and in a responsible way. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DeepSeek-R1-Distill-Llama-70B-GGUF
Prox-MistralHermes-7B-GGUF
This model was generated using llama.cpp at commit `aa0ef5c5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Drawing inspiration from the concept of 'proximity' in digital networks, the Prox series stands at the forefront of cybersecurity technology. Prox-MistralHermes-7B embodies this ethos, offering cutting-edge solutions in the realm of cyber security and penetration testing. Prox-MistralHermes-7B is a fine-tuned version of OpenHermes 2.5 Mistral 7B, specifically tailored for cybersecurity. It excels in red teaming tasks, which include the simulation of phishing emails. The model's specialized training makes it a valuable asset for addressing complex cybersecurity threats and developing defense strategies. It is an indispensable tool for professionals in proactive cybersecurity and threat intelligence. Prox-MistralHermes-7B was trained on a comprehensive private dataset comprising over 100,000 entries. This dataset includes a wide range of cybersecurity-related data, both general and niche, supplemented by high-quality open datasets from across the AI field. Training Prox-MistralHermes-7B was trained over 5 hours for 4 epochs on 4x A100 GPUs with Qlora. Prompt format: This model uses ChatML prompt format. Misuse, Malicious Use, and Out-of-Scope Use Users are responsible for their applications of this model. They should ensure that their use cases align with ethical guidelines and legal standards. Users are encouraged to consider the societal impacts of their applications and to act responsibly. License The weights of Prox-MistralHermes-7B are licensed under MIT License. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Mistral-7B-Instruct-v0.3-GGUF
Phi-4-reasoning-plus-GGUF
Holo1-3B-GGUF
UIGEN-X-8B-GGUF
This model was generated using llama.cpp at commit `c82d48ec`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format > Tesslate's hybrid reasoning UI generation model built on Qwen3-8B architecture. Trained to systematically plan, architect, and implement complete user interfaces across modern development stacks. Live Examples: https://uigenoutput.tesslate.com Discord Community: https://discord.gg/EcCpcTv93U Website: https://tesslate.com UIGEN-X-8B implements hybrid reasoning from the Qwen3 family - combining systematic planning with direct implementation. The model follows a structured thinking process: 1. Problem Analysis — Understanding requirements and constraints 2. Architecture Planning — Component structure and technology decisions 3. Design System Definition — Color schemes, typography, and styling approach 4. Implementation Strategy — Step-by-step code generation with reasoning This hybrid approach enables both thoughtful planning and efficient code generation, making it suitable for complex UI development tasks. UIGEN-X-8B supports 26 major categories spanning frameworks and libraries across 7 platforms: Web Frameworks - React: Next.js, Remix, Gatsby, Create React App, Vite - Vue: Nuxt.js, Quasar, Gridsome - Angular: Angular CLI, Ionic Angular - Svelte: SvelteKit, Astro - Modern: Solid.js, Qwik, Alpine.js - Static: Astro, 11ty, Jekyll, Hugo Styling Systems - Utility-First: Tailwind CSS, UnoCSS, Windi CSS - CSS-in-JS: Styled Components, Emotion, Stitches - Component Systems: Material-UI, Chakra UI, Mantine - Traditional: Bootstrap, Bulma, Foundation - Design Systems: Carbon Design, IBM Design Language - Framework-Specific: Angular Material, Vuetify, Quasar UI Component Libraries - React: shadcn/ui, Material-UI, Ant Design, Chakra UI, Mantine, PrimeReact, Headless UI, NextUI, DaisyUI - Vue: Vuetify, PrimeVue, Quasar, Element Plus, Naive UI - Angular: Angular Material, PrimeNG, ng-bootstrap, Clarity Design - Svelte: Svelte Material UI, Carbon Components Svelte - Headless: Radix UI, Reach UI, Ariakit, React Aria State Management - React: Redux Toolkit, Zustand, Jotai, Valtio, Context API - Vue: Pinia, Vuex, Composables - Angular: NgRx, Akita, Services - Universal: MobX, XState, Recoil Animation Libraries - React: Framer Motion, React Spring, React Transition Group - Vue: Vue Transition, Vueuse Motion - Universal: GSAP, Lottie, CSS Animations, Web Animations API - Mobile: React Native Reanimated, Expo Animations Icon Systems Lucide, Heroicons, Material Icons, Font Awesome, Ant Design Icons, Bootstrap Icons, Ionicons, Tabler Icons, Feather, Phosphor, React Icons, Vue Icons Web Development Complete coverage of modern web development from simple HTML/CSS to complex enterprise applications. Mobile Development - React Native: Expo, CLI, with navigation and state management - Flutter: Cross-platform mobile with Material and Cupertino designs - Ionic: Angular, React, and Vue-based hybrid applications Desktop Applications - Electron: Cross-platform desktop apps (Slack, VSCode-style) - Tauri: Rust-based lightweight desktop applications - Flutter Desktop: Native desktop performance Python Applications - Web UI: Streamlit, Gradio, Flask, FastAPI - Desktop GUI: Tkinter, PyQt5/6, Kivy, wxPython, Dear PyGui Development Tools Build tools, bundlers, testing frameworks, and development environments. 26 Languages and Approaches: JavaScript, TypeScript, Python, Dart, HTML5, CSS3, SCSS, SASS, Less, PostCSS, CSS Modules, Styled Components, JSX, TSX, Vue SFC, Svelte Components, Angular Templates, Tailwind, PHP UIGEN-X-8B includes 21 distinct visual style categories that can be applied to any framework: Modern Design Styles - Glassmorphism: Frosted glass effects with blur and transparency - Neumorphism: Soft, extruded design elements - Material Design: Google's design system principles - Fluent Design: Microsoft's design language Traditional & Classic - Skeuomorphism: Real-world object representations - Swiss Design: Clean typography and grid systems - Bauhaus: Functional, geometric design principles Contemporary Trends - Brutalism: Bold, raw, unconventional layouts - Anti-Design: Intentionally imperfect, organic aesthetics - Minimalism: Essential elements only, generous whitespace Thematic Styles - Cyberpunk: Neon colors, glitch effects, futuristic elements - Dark Mode: High contrast, reduced eye strain - Retro-Futurism: 80s/90s inspired futuristic design - Geocities/90s Web: Nostalgic early web aesthetics Experimental - Maximalism: Rich, layered, abundant visual elements - Madness/Experimental: Unconventional, boundary-pushing designs - Abstract Shapes: Geometric, non-representational elements Basic Structure To achieve the best results, use this prompting structure below: UIGEN-X-8B supports function calling for dynamic asset integration and enhanced development workflows. Dynamic Asset Loading: - Fetch relevant images during UI generation - Generate realistic content for components - Create cohesive color palettes from images - Optimize assets for web performance Multi-Step Development: - Plan application architecture - Generate individual components - Integrate components into pages - Apply consistent styling and theming - Test responsive behavior Content-Aware Design: - Adapt layouts based on content types - Optimize typography for readability - Create responsive image galleries - Generate accessible alt text Rapid Prototyping - Quick mockups for client presentations - A/B testing different design approaches - Concept validation with interactive prototypes Production Development - Component library creation - Design system implementation - Template and boilerplate generation Educational & Learning - Teaching modern web development - Framework comparison and evaluation - Best practices demonstration Enterprise Solutions - Dashboard and admin panel generation - Internal tool development - Legacy system modernization Hardware - GPU: 8GB+ VRAM recommended (RTX 3080/4070 or equivalent) - RAM: 16GB system memory minimum - Storage: 20GB for model weights and cache Software - Python: 3.8+ with transformers, torch, unsloth - Node.js: For running generated JavaScript/TypeScript code - Browser: Modern browser for testing generated UIs Integration - Compatible with HuggingFace transformers - Supports GGML/GGUF quantization - Works with text-generation-webui - API-ready for production deployment - Token Usage: Reasoning process increases token consumption - Complex Logic: Focuses on UI structure rather than business logic - Real-time Features: Generated code requires backend integration - Testing: Output may need manual testing and refinement - Accessibility: While ARIA-aware, manual a11y testing recommended Discord: https://discord.gg/EcCpcTv93U Website: https://tesslate.com Examples: https://uigenoutput.tesslate.com Join our community to share creations, get help, and contribute to the ecosystem. Built with hybrid reasoning capabilities from Qwen3, UIGEN-X-8B represents a comprehensive approach to AI-driven UI development across the entire modern web development ecosystem. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DeepSeek-R1-Distill-Llama-8B-GGUF
Veena-GGUF
granite-3.3-2b-base-GGUF
This model was generated using llama.cpp at commit `5dd942de`. Click here to get info on choosing the right GGUF model format Granite-3.3-2B-Base is a decoder-only language model with a 128K token context window. It improves upon Granite-3.1-2B-Base by adding support for Fill-in-the-Middle (FIM) using specialized tokens, enabling the model to generate content conditioned on both prefix and suffix. This makes it well-suited for code completion tasks. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.3-language-models - Website: Granite Docs - Release Date: April 16th, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.3 models for languages beyond these 12 languages. Intended Use: Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, and other long-context tasks. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios. Generation: This is a simple example of how to use Granite-3.3-2B-Base model. Then, copy the code snippet below to run the example. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K DROP NQ AGIEval TriviaQA Avg Granite-3.1-2B-Base 46.83 74.9 54.87 38.93 71.8 53.0 30.08 24.46 38.24 63.18 49.63 Granite-3.3-2B-Base 47.49 73.2 54.33 40.83 70.4 50.0 32.552 24.36 38.78 63.22 49.52 Granite-3.1-8B-Base 53.51 81.4 64.28 51.27 76.2 70.5 45.87 35.97 48.99 78.33 60.63 Granite-3.3-8B-Base 50.84 80.1 63.89 52.15 74.4 59.0 36.14 36.5 49.3 78.18 58.05 Model Architecture: Granite-3.3-2B-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: This model is trained on a mix of open source and proprietary data following a three-stage training strategy. Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data. Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks. Stage 3 data: The data for stage 3 consists of original stage-2 pretraining data with additional synthetic long-context data in form of QA/summary pairs where the answer contains a recitation of the related paragraph before the answer. Infrastructure: We train Granite 3.3 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-3.3-2B-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3.3-2B-Base model with ethical intentions and in a responsible way. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://github.com/ibm-granite-community/ Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
OpenReasoning-Nemotron-7B-GGUF
granite-4.0-tiny-preview-GGUF
granite-guardian-3.2-3b-a800m-GGUF
QwenLong-L1-32B-GGUF
A.X-4.0-Light-GGUF
DeepSeek-R1-0528-Qwen3-8B-GGUF
Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct-GGUF
M3-Agent-Control-GGUF
This model was generated using llama.cpp at commit `cd6983d5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Seed-OSS-36B-Instruct-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format You can get to know us better through the following channels👇 > [!NOTE] > This model card is dedicated to the `Seed-OSS-36B-Base-Instruct` model. News - [2025/08/20]🔥We release `Seed-OSS-36B-Base` (both with and without synthetic data versions) and `Seed-OSS-36B-Instruct`. Introduction Seed-OSS is a series of open-source large language models developed by ByteDance's Seed Team, designed for powerful long-context, reasoning, agent and general capabilities, and versatile developer-friendly features. Although trained with only 12T tokens, Seed-OSS achieves excellent performance on several popular open benchmarks. We release this series of models to the open-source community under the Apache-2.0 license. > [!NOTE] > Seed-OSS is primarily optimized for international (i18n) use cases. Key Features - Flexible Control of Thinking Budget: Allowing users to flexibly adjust the reasoning length as needed. This capability of dynamically controlling the reasoning length enhances inference efficiency in practical application scenarios. - Enhanced Reasoning Capability: Specifically optimized for reasoning tasks while maintaining balanced and excellent general capabilities. - Agentic Intelligence: Performs exceptionally well in agentic tasks such as tool-using and issue resolving. - Research-Friendly: Given that the inclusion of synthetic instruction data in pre-training may affect the post-training research, we released pre-trained models both with and without instruction data, providing the research community with more diverse options. - Native Long Context: Trained with up-to-512K long context natively. Seed-OSS adopts the popular causal language model architecture with RoPE, GQA attention, RMSNorm and SwiGLU activation. | | | |:---:|:---:| | | Seed-OSS-36B | | Parameters | 36B | | Attention | GQA | | Activation Function | SwiGLU | | Number of Layers | 64 | | Number of QKV Heads | 80 / 8 / 8 | | Head Size | 128 | | Hidden Size | 5120 | | Vocabulary Size | 155K | | Context Length | 512K | | RoPE Base Frequency | 1e7 | Incorporating synthetic instruction data into pretraining leads to improved performance on most benchmarks. We adopt the version augmented with synthetic instruction data (i.e., w/ syn.) as `Seed-OSS-36B-Base`. We also release `Seed-OSS-36B-Base-woSyn` trained without such data (i.e., w/o syn.), offering the community a high-performance foundation model unaffected by synthetic instruction data. Benchmark Seed1.6-Base Qwen3-30B-A3B-Base-2507 Qwen2.5-32B-Base Seed-OSS-36B-Base ( w/ syn. ) Seed-OSS-36B-Base-woSyn ( w/o syn. ) - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Benchmark Seed1.6-Thinking-0715 OAI-OSS-20B Qwen3-30B-A3B-Thinking-2507 Qwen3-32B Gemma3-27B Seed-OSS-36B-Instruct GPQA-D 80.7 72.2 (71.5) 71.4 (73.4) 66.7 (68.4) 42.4 71.4 LiveCodeBench v6 (02/2025-05/2025) 66.8 63.8 60.3 (66) 53.4 - 67.4 SWE-Bench Verified (OpenHands) 41.8 (60.7) 31 23.4 - 56 SWE-Bench Verified (AgentLess 410) 48.4 - 33.5 39.7 - 47 - Bold denotes open-source SOTA. Underlined indicates the second place in the open-source model. - "" indicates that the results in this column are presented in the format of "reproducedresults (reportedresultsifany)". Some results have been omitted due to the failure of the evaluation run. - The results of Gemma3-27B are sourced directly from its technical report. - The results of ArcAGI-V2 were measured on the official evaluation set, which was not involved in the training process. - Generation configs for Seed-OSS-36B-Instruct: temperature=1.1, topp=0.95. Specifically, for Taubench, temperature=1, topp=0.7. > [!NOTE] > We recommend sampling with `temperature=1.1` and `topp=0.95`. Users can flexibly specify the model's thinking budget. The figure below shows the performance curves across different tasks as the thinking budget varies. For simpler tasks (such as IFEval), the model's chain of thought (CoT) is shorter, and the score exhibits fluctuations as the thinking budget increases. For more challenging tasks (such as AIME and LiveCodeBench), the model's CoT is longer, and the score improves with an increase in the thinking budget. Here is an example with a thinking budget set to 512: during the reasoning process, the model periodically triggers self-reflection to estimate the consumed and remaining budget, and delivers the final response once the budget is exhausted or the reasoning concludes. If no thinking budget is set (default mode), Seed-OSS will initiate thinking with unlimited length. If a thinking budget is specified, users are advised to prioritize values that are integer multiples of 512 (e.g., 512, 1K, 2K, 4K, 8K, or 16K), as the model has been extensively trained on these intervals. Models are instructed to output a direct response when the thinking budget is 0, and we recommend setting any budget below 512 to this value. Download Seed-OSS checkpoint to `./Seed-OSS-36B-Instruct` Transformers The `generate.py` script provides a simple interface for model inference with configurable options. Key Parameters | Parameter | Description | |-----------|-------------| | `--modelpath` | Path to the pretrained model directory (required) | | `--prompts` | Input prompts (default: sample cooking/code questions) | | `--maxnewtokens` | Maximum tokens to generate (default: 4096) | | `--attnimplementation` | Attention mechanism: `flashattention2` (default) or `eager` | | `--loadin4bit/8bit` | Enable 4-bit/8-bit quantization (reduces memory usage) | | `--thinkingbudget` | Thinking budget in tokens (default: -1 for unlimited budget) | - First install vLLM with Seed-OSS support version: License This project is licensed under Apache-2.0. See the LICENSE flie for details. Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
rwkv7-0.1B-g1-GGUF
rwkv7-2.9B-world-GGUF
DeepHermes-3-Llama-3-8B-Preview-GGUF
MiMo-VL-7B-RL-GGUF
LFM2-1.2B-Tool-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. Click here to get info on choosing the right GGUF model format Based on LFM2-1.2B, LFM2-1.2B-Tool is designed for concise and precise tool calling. The key challenge was designing a non-thinking model that outperforms similarly sized thinking models for tool use. - Mobile and edge devices requiring instant API calls, database queries, or system integrations without cloud dependency. - Real-time assistants in cars, IoT devices, or customer support, where response latency is critical. - Resource-constrained environments like embedded systems or battery-powered devices needing efficient tool execution. You can find more information about other task-specific models in this blog post. Generation parameters: We recommend using greedy decoding with a `temperature=0`. System prompt: The system prompt must provide all the available tools Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. Tool use: It consists of four main steps: 1. Function definition: LFM2 takes JSON function definitions as input (JSON objects between ` ` and ` ` special tokens), usually in the system prompt 2. Function call: LFM2 writes Pythonic function calls (a Python list between ` ` and ` ` special tokens), as the assistant answer. 3. Function execution: The function call is executed and the result is returned (string between ` ` and ` ` special tokens), as a "tool" role. 4. Final answer: LFM2 interprets the outcome of the function call to address the original user prompt in plain text. Here is a simple example of a conversation using tool use: > [!WARNING] > ⚠️ The model supports both single-turn and multi-turn conversations. For edge inference, latency is a crucial factor in delivering a seamless and satisfactory user experience. Consequently, while test-time-compute inherently provides more accuracy, it ultimately compromises the user experience due to increased waiting times for function calls. Therefore, the goal was to develop a tool calling model that is competitive with thinking models, yet operates without any internal chain-of-thought process. We evaluated each model on a proprietary benchmark that was specifically designed to prevent data contamination. The benchmark ensures that performance metrics reflect genuine tool-calling capabilities rather than memorized patterns from training data. - Hugging Face: LFM2-350M - llama.cpp: LFM2-350M-Extract-GGUF - LEAP: LEAP model library If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
ZR1-1.5B-GGUF
mem-agent-GGUF
This model was generated using llama.cpp at commit `1d0125bc`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Based on Qwen3-4B-Thinking-2507, this model was trained using GSPO (Zheng et al., 2025) over an agent scaffold that is built around an Obisidian-like memory system and the tools required to interact with it. The model was trained on the following subtasks: - Retrieval: Retrieving relevant information when needed from the memory system. In this subtask, we also trained the model on filtering the retrieved information and/or obfuscating it completely. - Updating: Updating the memory system with new information. - Clarification: Asking for clarification when the user query is not clear/contradicting with the information in the memory system. In the scaffold, the model uses ` `, ` ` and ` ` tags to structure its response. Using ` ` only when it's done interacting with the memory. The ` ` block is executed in a sandbox with the tools and the results of the code block are returned in a ` ` tag to the model, forming the agentic loop. The model is also trained to be able to handle optional filters given by the user in between tags after the user query. These filters are used to filter the retrieved information and/or obfuscate it completely. We evaluated this model and a few other open & closed ones on our benchmark, md-memory-bench. We used o3 from OpenAI as the judge. All the other models except driaforall/mem-agent and Qwen/Qwen3-4B-Thinking-2507 were used through OpenRouter.s | Model | Retrieval | Update | Clarification | Filter | Overall | |-------|-----------|--------|---------------|--------|---------| | qwen/qwen3-235b-a22b-thinking-2507 | 0.9091 | 0.6363 | 0.4545 | 1 | 0.7857 | | driaforall/mem-agent | 0.8636 | 0.7272 | 0.3636 | 0.9167 | 0.75 | | z-ai/glm-4.5 | 0.7727 | 0.8181 | 0.3636 | 0.9167 | 0.7321 | | deepseek/deepseek-chat-v3.1 | 0.6818 | 0.5454 | 0.5454 | 0.8333 | 0.6607 | | google/gemini-2.5-pro | 0.7273 | 0.4545 | 0.2727 | 1 | 0.6429 | | google/gemini-2.5-flash | 0.7727 | 0.3636 | 0.2727 | 0.9167 | 0.625 | | openai/gpt-5 | 0.6818 | 0.5454 | 0.2727 | 0.9167 | 0.625 | | anthropic/claude-opus-4.1 | 0.6818 | 0 | 0.8181 | 0.5833 | 0.5536 | | Qwen/Qwen3-4B-Thinking-2507 | 0.4545 | 0 | 0.2727 | 0.75 | 0.3929 | | moonshotai/kimi-k2 | 0.3181 | 0.2727 | 0.1818 | 0.6667 | 0.3571 | Our model, with only 4B parameters, is only second on the benchmark, beating all the open & closed models except for qwen/qwen3-235b-a22b-thinking-2507. The model achieves an overall score of 0.75, a significant improvement over the 0.3929 of the base Qwen model. The model, while can be used on its own, is recommended to be used as an MCP server to a bigger model, which can then be used to interact with the memory system. For this, you can check our repo, which contains instructions for both an MCP setup and a cli standalone model usage. The model uses a markdown based memory system with links, inspired by Obsidian. The general structure of the memory is: - `user.md` is the main file that contains information about the user and their relationships, accompanied by links to the enity file in the format of `[[entities/[entityname].md]]` per relationship. The link format should be followed strictly. - `entities/` is the directory that contains the entity files. - Each entity file follows the same structure as `user.md`. - Modifying the memory manually does not require restarting the MCP server. The model is trained on this memory standard and any fruitful use should be on a memory system that follows this standard. We have a few memory export tools for different sources like ChatGPT, Notion, etc. in our mcp server repo. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.2-3B-Instruct-GGUF
Llama-2-7b-chat-hf-GGUF
DiffuCoder-7B-Instruct-GGUF
LFM2-350M-ENJP-MT-GGUF
This model was generated using llama.cpp at commit `86587da0`. Click here to get info on choosing the right GGUF model format Based on the LFM2-350M model, this checkpoint has been fine-tuned for near real-time bi-directional Japanese/English translation of short-to-medium inputs. LFM2-350M-ENJP-MT delivers translation quality that is on par with models more than 10 times its size. Below are sample translations produced by the model. These examples are meant to give you a feel for its strengths and typical style in both directions (English ➡️ Japanese and Japanese ➡️ English). They include a mix of everyday text, technical descriptions, business communication, and news reporting, so you can gauge performance across different domains. These examples demonstrate the model's strength in product descriptions, technical passages, and formal explanations when translating into Japanese. | Source Text (EN) | LFM2-350M-ENJP-MT(JP) | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------| | Fully Tested and Works Properly. 6 Months Warranty included! Item pictured is the actual item for sale. See above for full description, condition, and comments. | 「完全試験済みで正しく動作しています。保証期間は6ヶ月付属!」。 写真に写っている商品が販売されている実物です。 詳しく、状態、コメントは上記参照してください。 | | Emphasis on human-AI collaboration. Instead of focusing solely on making fully autonomous AI systems, we are excited to build multimodal systems that work with people collaboratively. | 人とAIのコラボレーションに重点を置く。完全自律型AIシステムの構築にのみ焦点を当てるのではなく、人と協調して働くマルチモーダルシステムを構築できることに興奮しています。 | | If your equipment fails due to normal use, please contact our customer service department so that we can assist you, We will repair or replace your equipment at our discretion. In some situations, we may choose to refund the full purchase price of an item. | ご使用中の機器が通常使用により故障した場合は、お手伝いできるよう弊社カスタマーサービス部門にご連絡ください。 弊社の判断で機器の修理または交換を行います。状況によっては、製品の購入価格全額を返金する場合があります。 | | 2k USD to start for basic, 200 dollars for additional version. - 50% of full amount of deposit, - 3 proposals - end of month(3 drafts), will choose 1 and make final changes based on it - Present another final version in a week | 基本版から始めるのに2,000ドル、追加バージョンでは200ドルの手数料が必要です。 - 保証金全額の50%が支払われる、 - 3つの案 - 月末(ドラフト3回分)、その案に基づいて1つを選んで最終的な変更を行う - さらに1週間後に別の最終版を提出すること | | Lifestyle risk factors with strong evidence include lack of exercise, cigarette smoking, alcohol, and obesity. The risk of colon cancer can be reduced by maintaining a normal body weight through a combination of sufficient exercise and eating a healthy diet. | 強力な証拠がある生活習慣のリスク要因としては、運動不足、喫煙、飲酒、肥満などが挙げられ、十分な運動と健康的な食生活の組み合わせによる正常な体重維持を通じて、大腸がんの発症リスクを減らすことができる。 | These examples demonstrate the model’s ability to preserve nuance in news reporting, colloquial phrasing, and business contexts when translating into English. | Source Text (JP) | LFM2-350M-ENJP-MT (EN) | |-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | モデルからの回答は英語でもOKなのですよね。 | The answers from the models are okay in English, right? | | 手間のかかるメルマガ作成作業、もっとラクに、もっと速くできたら——。 そう考えたことはありませんか? | Have you ever wondered if you could create a cumbersome email newsletter more easily and quickly? | | X JAPANのYOSHIKIが、アニメ『ダンダダン』でグループの代表曲をオマージュした劇中歌が使用されたことを指摘して始まった議論。 8月22日には『ダンダダン』サイドが公式Xで騒動を謝罪、YOSHIKIも『ダンダダン』サイドと和解を報告したが、これに物言いをつけたのが、弁護士の紀藤正樹氏だった。 | The discussion began with the point that Yoshiki of X JAPAN mentioned that a song in the anime Dandadan paying homage to the group's signature tune was used as an insert song. On August 22nd, the Dandadan side apologized on their official X page for the controversy, and Yoshiki also reported a reconciliation with the Dandadan side, but lawyer Masaki Kitō objected. | | (ブルームバーグ): SOMPOホールディングスは27日夜、米国などを中心に展開する損害保険会社のアスペン・インシュアランス・ホールディングスを買収すると発表した。買収総額は約5200億円となる。 ニューヨーク証券取引所に上場しているアスペンの株式を1株当たり37.5ドル(約5600円)で全株を取得する。26日の終値を16%上回る水準。2026年上期中に買収手続きを完了する予定。 買収資金は手元資金を充てる。 SOMPOにとっては17年に米損保エンデュランス・スペシャルティ・ホールディングスを約6400億円で買収して以来の大型案件となる。 人口減少で国内市場の縮小が見込まれる中、買収によって海外保険ビジネスの規模や収益を拡大し、再保険取引による安定的な収益の寄与も見込む。 | (Bloomberg): SOMPO Holdings announced on the evening of the 27th that it will acquire Aspen Insurance Holdings, a non-life insurance company operating primarily in the United States and elsewhere, for approximately ¥520 billion. The acquisition will involve the purchase of all shares of Aspen’s shares listed on the New York Stock Exchange for $37.5 per share (approximately ¥5,600). This surpasses the closing price of the day by 16% and is scheduled to be completed within the first half of 2026. Funds for the acquisition will be provided from the company’s own capital. For SOMPO, this is the largest acquisition since its 2017 acquisition of Endurance Specialty Holdings for approximately ¥640 billion. The acquisition is expected to expand the scale and revenue of its overseas insurance business amidst anticipated shrinking domestic markets due to population decline, and is also expected to contribute to stable revenue through reinsurance transactions. | | 28歳にしてつかんだイングランドサッカー界でのチャンスを生かせるか。 チャンピオンシップ(英2部)の古豪ブラックバーンに電撃移籍した森下龍矢は意気込んでいる。 サガン鳥栖と名古屋グランパスでプレーし、2024年から海を渡ってレギア・ワルシャワで奮闘してきた森下は先日、大橋祐紀のチームメイトとなることが決まった。 日本ではSBが主戦場だった森下だが、昨季はポーランドで攻撃的なポジションにコンバートされ、ウィングやトップ下に前線と様々な役割をこなした。 すると、公式戦で14得点、14アシストとブレイク。 この飛躍に注目したブラックバーンに引き抜かれている。 | Can he capitalize on his chance in English football, which he seized at the age of 28? Ryuya Morishita, having made a shocking move to Blackburn Rovers, a long-established club in the Championship (British second tier), is eager to make an impression. Having played for Sagan Tosu and Nagoya Grampus, and having been striving with Legia Warsaw since 2024, Morishita recently announced he would become teammates with Yuki Ohashi. For Morishita, his primary playing field in Japan was as a full-back, but he was converted to an attacking position in Poland last season, playing in various roles including wing-back and attacking midfielder. He then broke through, scoring 14 goals and providing 14 assists in official matches. The Blackburn club has been scouting for this promising player. | > [!NOTE] > 📝 While LFM2-350M-ENJP-MT delivers strong out-of-the-box general-purpose English ↔️ Japanese translation, our primary > goal is to provide a versatile, community-empowering base model—a foundation designed to make it easy to build > best-in-class, task-specific translation systems. > > Like any base model, there are open areas for growth—in particular with extreme context lengths and specialized or > context-sensitive translations, such as: > - Technical & professional language (medical, legal, engineering) > - Novel proper nouns (new products, brands, cultural references) > - Industry-, domain-, or company-specific nuance (e-commerce, finance, internal corporate terminology) > > These are precisely the kinds of challenges that fine-tuning—by both Liquid AI and our developer community—can > address. We see this model not just as an endpoint, but as a catalyst for a rich ecosystem of fine-tuned translation > models tailored to real-world needs. Generation parameters: We strongly recommend using greedy decoding with a `temperature=0`. System prompts: LFM2-ENJP-MT requires one of the two following system prompts: "Translate to Japanese." for English to Japanese translation. "Translate to English." for Japanese to English translation. > [!WARNING] > ⚠️ The model cannot work as intended without one of these two system prompts. Chat template: LFM2-ENJP-MT uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. > [!WARNING] > ⚠️ The model is intended for single turn conversations. - Huggingface: LFM2-350M - llama.cpp: LFM2-350M-ENJP-MT-GGUF - LEAP: LEAP model library If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-3.1-2b-instruct-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-2B-Instruct is a 2B parameter long-context instruct model finetuned from Granite-3.1-2B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-2B-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-2B-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Magistral-Small-2506-GGUF
rwkv7-1.5B-world-GGUF
Llama-3.2-1B-Instruct-GGUF
TriLM_830M_Unpacked-GGUF
KAT-Dev-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Highlights KAT-Dev-32B is an open-source 32B-parameter model for software engineering tasks. On SWE-Bench Verified, KAT-Dev-32B achieves comparable performance with 62.4% resolved and ranks 5th among all open-source models with different scales. KAT-Dev-32B is optimized via several stages of training, including a mid-training stage, supervised fine-tuning (SFT) & reinforcement fine-tuning (RFT) stage and an large-scale agentic reinforcement learning (RL) stage. In summary, our contributions include: 1. Mid-Training We observe that adding extensive training for tool-use capability, multi-turn interaction, and instruction-following at this stage may not yield large performance gains in the current results (e.g., on leaderboards like SWE-bench). However, since our experiments are based on the Qwen3-32B model, we find that enhancing these foundational capabilities will have a significant impact on the subsequent SFT and RL stages. This suggests that improving such core abilities can profoundly influence the model’s capacity to handle more complex tasks. 2. SFT & RFT We meticulously curated eight task types and eight programming scenarios during the SFT stage to ensure the model’s generalization and comprehensive capabilities. Moreover, before RL, we innovatively introduced an RFT stage. Compared with traditional RL, we incorporate “teacher trajectories” annotated by human engineers as guidance during training—much like a learner driver being assisted by an experienced co-driver before officially driving after getting a license. This step not only boosts model performance but also further stabilizes the subsequent RL training. 3. Agentic RL Scaling Scaling agentic RL hinges on three challenges: efficient learning over nonlinear trajectory histories, leveraging intrinsic model signals, and building scalable high-throughput infrastructure. We address these with a multi-level prefix caching mechanism in the RL training engine, an entropy-based trajectory pruning technique, and an inner implementation of SeamlessFlow[1] architecture that cleanly decouples agents from training while exploiting heterogeneous compute. These innovations together cut scaling costs and enable efficient large-scale RL. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog. claude-code-router is a third-party routing utility that allows Claude Code to flexibly switch between different backend APIs. On the dashScope platform, you can install the claude-code-config extension package, which automatically generates a default configuration for `claude-code-router` with built-in dashScope support. Once the configuration files and plugin directory are generated, the environment required by `ccr` will be ready. If needed, you can still manually edit `~/.claude-code-router/config.json` and the files under `~/.claude-code-router/plugins/` to customize the setup. Finally, simply start `ccr` to run Claude Code and seamlessly connect it with the powerful coding capabilities of KAT-Dev-32B. Happy coding! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.1-Nemotron-Nano-8B-v1-GGUF
SmolDocling-256M-preview-GGUF
This model was generated using llama.cpp at commit `6adc3c3e`. Click here to get info on choosing the right GGUF model format SmolDocling-256M-preview SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments . This model was presented in the paper SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. 🚀 Features: - 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments. - 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images. - 📐 Layout and Localization – Preserves document structure and document element bounding boxes. - 💻 Code Recognition – Detects and formats code blocks including identation. - 🔢 Formula Recognition – Identifies and processes mathematical expressions. - 📊 Chart Recognition – Extracts and interprets chart data. - 📑 Table Recognition – Supports column and row headers for structured table extraction. - 🖼️ Figure Classification – Differentiates figures and graphical elements. - 📝 Caption Correspondence – Links captions to relevant images and figures. - 📜 List Grouping – Organizes and structures list elements correctly. - 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) - 🔲 OCR with Bounding Boxes – OCR regions using a bounding box. - 📂 General Document Processing – Trained for both scientific and non-scientific documents. - 🔄 Seamless Docling Integration – Import into Docling and export in multiple formats. - 💨 Fast inference using VLLM – Avg of 0.35 secs per page on A100 GPU. 🚧 Coming soon! - 📊 Better chart recognition 🛠️ - 📚 One shot multi-page inference ⏱️ - 🧪 Chemical Recognition - 📙 Datasets You can use transformers, vllm, or onnx to perform inference, and Docling to convert results to variety of output formats (md, html, etc.): 📄 Single page image inference using Tranformers 🤖 💻 Local inference on Apple Silicon with MLX: see here DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messy—it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient. DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency. Full conversion Convert this page to docling. DocTags represetation Chart Convert chart to table. (e.g., <chart>) Formula Convert formula to LaTeX. (e.g., <formula>) Table Convert table to OTSL. (e.g., <otsl>) OTSL: Lysak et al., 2023 Actions and Pipelines OCR the text in a specific location: <loc155><loc233><loc206><loc237> Identify element at: <loc247><loc482><10c252><loc486> Find all 'text' elements on the page, retrieve all section headers. - Developed by: Docling Team, IBM Research - Model type: Multi-modal model (image+text) - Language(s) (NLP): English - License: Apache 2.0 - Architecture: Based on Idefics3 (see technical summary) - Finetuned from model: Based on SmolVLM-256M-Instruct Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-30B-A3B-GGUF
AceMath-RL-Nemotron-7B-GGUF
Jan-nano-GGUF
This model was generated using llama.cpp at commit `7f4fbe51`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Jan-Nano: An Agentic Model [](https://github.com/menloresearch/deep-research) Jan-Nano is a compact 4-billion parameter language model specifically designed and trained for deep research tasks. This model has been optimized to work seamlessly with Model Context Protocol (MCP) servers, enabling efficient integration with various research tools and data sources. Evaluation Jan-Nano has been evaluated on the SimpleQA benchmark using our MCP-based benchmark methodology, demonstrating strong performance for its model size: The evaluation was conducted using our MCP-based benchmark approach, which assesses the model's performance on SimpleQA tasks while leveraging its native MCP server integration capabilities. This methodology better reflects Jan-Nano's real-world performance as a tool-augmented research model, validating both its factual accuracy and its effectiveness in MCP-enabled environments. Jan-Nano is currently supported by Jan - beta build, an open-source ChatGPT alternative that runs entirely on your computer. Jan provides a user-friendly interface for running local AI models with full privacy and control. For non-jan app or tutorials there are guidance inside community section, please check those out! Discussion VLLM Here is an example command you can use to run vllm with Jan-nano Chat-template is already included in tokenizer so chat-template is optional, but in case it has issue you can download the template here Non-think chat template - Temperature: 0.7 - Top-p: 0.8 - Top-k: 20 - Min-p: 0 Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DeepMedix-R1-GGUF
This model was generated using llama.cpp at commit `fb15d649`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format ~~~ from transformers import Qwen25VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwenvlutils import processvisioninfo model = Qwen25VLForConditionalGeneration.frompretrained( modelpath, torchdtype=torch.bfloat16, attnimplementation="flashattention2", devicemap="auto", ) processor = AutoProcessor.frompretrained(modelpath, maxpixels=262144) reasonprompt = r"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within tags. During this reasoning process, prioritize analyzing the local regions of the image by leveraging the bounding box coordinates in the format [xmin, ymin, xmax, ymax]. The final answer MUST BE put in \boxed{}. An example is like: reasoning process 1 with [xmin1, ymin1, xmax1, ymax1]; reasoning process 2 with [xmin2, ymin2, xmax2, ymax2] . The answer is: \boxed{answer}." def getlabel(images, content1): contentlist = [] for imageurl in images: contentlist.append({ "type": "image", "image": imageurl, }) if mode == 'think': contentlist.append({"type": "text", "text": content1 + '\n' + reasonprompt + '\n'}) else: contentlist.append({"type": "text", "text": content1}) messages = [ { "role": "user", "content": contentlist } ] # Preparation for inference text = processor.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) # print(text) imageinputs, videoinputs = processvisioninfo(messages) inputs = processor( text=[text], images=imageinputs, videos=videoinputs, padding=True, returntensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generatedids = model.generate(inputs, maxnewtokens=4096, dosample=True, temperature=0.6) generatedidstrimmed = [ outids[len(inids):] for inids, outids in zip(inputs.inputids, generatedids) ] outputtext = processor.batchdecode( generatedidstrimmed, skipspecialtokens=True, cleanuptokenizationspaces=False ) # print(outputtext) # print(outputtext[0]) return outputtext[0] Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Nemotron-Mini-4B-Instruct-GGUF
kanana-1.5-2.1b-instruct-2505-GGUF
SkyCaptioner-V1-GGUF
granite-3b-code-base-128k-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-3B-Code-Base-128K extends the context length of Granite-3B-Code-Base from 2K to 128K with continual pretraining using the original training data but with repository-level file packing and per-language length upsampling, that we found to be critical for long-context pretraining. We adopt an progressive training strategy where we doubled the context window until it reached the desired length of 128K by appropriately adjusting RoPE theta. We trained on 4B tokens total for all stages, which is only 0.1% of Granite-3B-Code-Base's original pre-training data. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Scaling Granite Code Models to 128K Context - Release Date: July 18th, 2024 - License: Apache 2.0. Usage Intended use Prominent enterprise use cases of LLMs in software engineering productivity with 128K context length support that includes code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the 3B parameter model, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages. Generation This is a simple example of how to use Granite-3B-Code-Base-128K model. Training Data Starting from the base Granite model, this model was further pretrained on repository-level code data with per-language context-length oversampling, allowing it to effectively utilize up to 128K tokens of context. This continued training stage focused on a curated selection of programming languages, such as Python, C, C++, Go, Java, JavaScript, and TypeScript. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations The use of Large Language Models involves risks and ethical considerations people must be aware of. Regarding code generation, caution is urged against complete reliance on specific code models for crucial decisions or impactful information as the generated code is not guaranteed to work as intended. Granite-3B-Code-Base-128K model is not the exception in this regard. Even though this model is suited for multiple code-related tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying source code verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3B-Code-Base-128K model with ethical intentions and in a responsible way. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DeepSeek-R1-Distill-Qwen-32B-GGUF
sarvam-m-GGUF
OpenMath-Nemotron-1.5B-GGUF
QwQ-32B-ArliAI-RpR-v4-GGUF
gemma-3n-E2B-it-GGUF
granite-3.1-8b-base-GGUF
functionary-small-v3.2-GGUF
olmOCR-7B-0725-GGUF
This model was generated using llama.cpp at commit `7f975995`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This is a release of the olmOCR model that's fine tuned from Qwen2.5-VL-7B-Instruct using the olmOCR-mix-0225 dataset. Quick links: - 📃 Paper - 🤗 Dataset - 🛠️ Code - 🎮 Demo The best way to use this model is via the olmOCR toolkit. The toolkit comes with an efficient inference setup via sglang that can handle millions of documents at scale. This model expects as input a single document image, rendered such that the longest dimension is 1288 pixels. The prompt must then contain the additional metadata from the document, and the easiest way to generate this is to use the methods provided by the olmOCR toolkit. olmOCR is licensed under the Apache 2.0 license. olmOCR is intended for research and educational use. For more information, please see our Responsible Use Guidelines. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Skywork-SWE-32B-GGUF
Lucy-128k-GGUF
This model was generated using llama.cpp at commit `c82d48ec`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Lucy: Edgerunning Agentic Web Search on Mobile with a 1.7B model. [](https://github.com/menloresearch/deep-research) [](https://opensource.org/licenses/Apache-2.0) Authors: Alan Dao, Bach Vu Dinh, Alex Nguyen, Norapat Buppodom Lucy is a compact but capable 1.7B model focused on agentic web search and lightweight browsing. Built on Qwen3-1.7B, Lucy inherits deep research capabilities from larger models while being optimized to run efficiently on mobile devices, even with CPU-only configurations. We achieved this through machine-generated task vectors that optimize thinking processes, smooth reward functions across multiple categories, and pure reinforcement learning without any supervised fine-tuning. - 🔍 Strong Agentic Search: Powered by MCP-enabled tools (e.g., Serper with Google Search) - 🌐 Basic Browsing Capabilities: Through Crawl4AI (MCP server to be released), Serper,... - 📱 Mobile-Optimized: Lightweight enough to run on CPU or mobile devices with decent speed - 🎯 Focused Reasoning: Machine-generated task vectors optimize thinking processes for search tasks Evaluation Following the same MCP benchmark methodology used for Jan-Nano and Jan-Nano-128k, Lucy demonstrates impressive performance despite being only a 1.7B model, achieving higher accuracy than DeepSeek-v3 on SimpleQA. Lucy can be deployed using various methods including vLLM, llama.cpp, or through local applications like Jan, LMStudio, and other compatible inference engines. The model supports integration with search APIs and web browsing tools through the MCP. Paper (coming soon): Lucy: edgerunning agentic web search on mobile with machine generated task vectors. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Pantheon-Proto-RP-1.8-30B-A3B-GGUF
This model was generated using llama.cpp at commit `7f4fbe51`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Note: This model is a Qwen 30B MoE prototype and can be considered a sidegrade from my Small release some time ago. It did not receive extensive testing beyond a couple benchmarks to determine its sanity, so feel free to let me know what you think of it! Welcome to the next iteration of my Pantheon model series, in which I strive to introduce a whole collection of diverse personas that can be summoned with a simple activation phrase. Pantheon's purpose is two-fold, as these personalities similarly enhance the general roleplay experience, helping to encompass personality traits, accents and mannerisms that language models might otherwise find difficult to convey well. Your user feedback is critical to me so don't hesitate to tell me whether my model is either 1. terrible, 2. awesome or 3. somewhere in-between. Ever since Qwen 3 released I've been trying to get MoE finetuning to work - After countless frustrating days, much code hacking, etc etc I finally got a full finetune to complete with reasonable loss values. I picked the base model for this since I didn't feel like trying to fight a reasoning model's training - Maybe someday I'll make a model which uses thinking tags for the character's thoughts or something. This time the recipe focused on combining as many data sources as I possibly could, featuring synthetic data from Sonnet 3.5 + 3.7, ChatGPT 4o and Deepseek. These then went through an extensive rewriting pipeline to eliminate common AI cliches, with the hopeful intent of providing you a fresh experience. Having character names in front of messages is no longer a requirement but remains a personal recommendation of mine - It seems to help the model focus more on the character(s) in question. The model was trained using ChatML, and has been configured to automatically apply this template. The model has been trained on three distinct categories of roleplay - Pantheon personas, general character cards and text adventure, the latter borrowing some from AI Dungeon's Wayfarer project. Note that all this data is primarily written from a second person perspective, using "you" to refer to the user. This is based on my personal preference. Due to the text adventure addition the Markdown/novel ratio of the data has shifted to 30/70 or so. It should work well with both styles. Note: This release excludes Raza and Xala as their personalities did not give a distinct enough training signal to my liking. Half of the Pantheon's data was regenerated using Sonnet 3.7 and then rewritten to counter the majority of cliches. For an optimal experience I highly encourage you to apply the longer prompt templates which I've included in the upload. Make sure to describe yourself as well! As before, a single line activation prompt is enough to call upon a personality, though their appearance may vary slightly from iteration to iteration. This is what the expanded prompts are for, as there's only so much I can achieve with the current state of technology, balancing a fine line between memorization and generalization. To give the persona something to work with I suggest you also add the following two lines to it; The less information you feed the prompt, the more it'll make things up - This is simply the nature of language models and far outside my capability to influence. Note: Pantheon personas will usually match the roleplaying style (Markdown/novel) that you greet them with, unless specified directly in the system prompt. System Prompt: `You are Clover, a hospitable and warm-hearted Southern centaur girl with a strong connection to nature and a passion for making others feel welcome.` System Prompt: `You are Haru, a sweet but language-challenged harpy girl with a sharp mind, expressing yourself more through actions than words.` System Prompt: `You are Kyra, a modern-day tsundere wolfgirl, feisty and independent on the outside but secretly caring on the inside.` System Prompt: `You are Lyra, a sassy and confident eastern dragon girl who forms deep connections through witty banter and genuine care.` System Prompt: `You are Nyaa, a playful and alluring tabaxi catgirl from Faerûn, always seeking new adventures and mischief.` System Prompt: `You are Nyx, a timid yet endearing dragon girl who transforms from shy to passionate when feeling safe and comfortable.` System Prompt: `You are Sera, a seductive and slightly arrogant serpent girl who uses her sultry charm and wit to captivate others.` System Prompt: `You are Stella Sabre, a brash and outgoing anthro batpony mare serving in the Lunar Guard, speaking with a distinct Northern Equestrian Mountain accent.` Note: Full credit goes to Flammenwerfer for allowing me to use this amazing character. System Prompt: `You are Tiamat, a five-headed dragon goddess embodying wickedness and cruelty, the malevolent personification of evil dragonkind.` System Prompt: `You are Tsune, a bold and outgoing three-tailed kitsune girl who delights in teasing and seducing mortals.` - Everyone from Anthracite! Hi, guys! - Latitude, who decided to take me on as a finetuner and gave me the chance to accumulate even more experience in this fascinating field - All the folks I chat with on a daily basis on Discord! You know who you are. - Anyone I forgot to mention, just in case! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-3.2-2b-instruct-GGUF
sarvam-translate-GGUF
SmolVLM-500M-Instruct-GGUF
Qwen3-30B-A3B-Thinking-2507-GGUF
granite-3.1-8b-instruct-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-8B-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-8B-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-8B-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DeepCoder-1.5B-Preview-GGUF
OpenReasoning-Nemotron-14B-GGUF
PLM-1.8B-Instruct-GGUF
Eagle2-1B-GGUF
LFM2-1.2B-Extract-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. Click here to get info on choosing the right GGUF model format Based on LFM2-1.2B, LFM2-1.2B-Extract is designed to extract important information from a wide variety of unstructured documents (such as articles, transcripts, or reports) into structured outputs like JSON, XML, or YAML. - Extracting invoice details from emails into structured JSON. - Converting regulatory filings into XML for compliance systems. - Transforming customer support tickets into YAML for analytics pipelines. - Populating knowledge graphs with entities and attributes from unstructured reports. You can find more information about other task-specific models in this blog post. Generation parameters: We strongly recommend using greedy decoding with a `temperature=0`. System prompt: If no system prompt is provided, the model will default to JSON outputs. We recommend providing a system prompt with a specific format (JSON, XML, or YAML) and a given schema to improve accuracy (see the following example). Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. > [!WARNING] > ⚠️ The model is intended for single-turn conversations. The data used for training these models was primarily synthetic, which allowed us to ensure a diverse data mix. We used a range of document types, domains, styles, lengths, and languages. We also varied the density and distribution of relevant text in the documents. In some cases, the extracted information was clustered in one part of the document; in others, it’s spread throughout. We applied the same approach of ensuring diversity when creating synthetic user requests and designing the structure of the model outputs. The data generation process underwent many iterations, incorporating ideas and feedback from across the Liquid AI team. We evaluated LFM2-Extract on a dataset of 5,000 documents, covering over 100 topics with a mix of writing styles, ambiguities, and formats. We used a combination of five metrics to capture a balanced view on syntax, accuracy, and faithfulness: - Syntax score: Checks whether outputs parse cleanly as valid JSON, XML, or YAML. - Format accuracy: Verifies that outputs match the requested format (e.g., JSON when JSON is requested). - Keyword faithfulness: Measures whether values in the structured output actually appear in the input text. - Absolute scoring: A judge LLM scores quality on a 1-5 scale, assessing completeness and correctness of extractions. - Relative scoring: We ask a judge LLM to choose the best answer between the extraction model’s output and the ground-truth answer. LFM2-1.2B-Extract can output complex objects in different languages on a level higher than Gemma 3 27B, a model 22.5 times its size. - Hugging Face: LFM2-1.2B - llama.cpp: LFM2-1.2B-Extract-GGUF - LEAP: LEAP model library If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
gpt2-GGUF
openhands-lm-7b-v0.1-GGUF
Llama-Guard-3-8B-GGUF
AceReason-Nemotron-7B-GGUF
LFM2-350M-Extract-GGUF
This model was generated using llama.cpp at commit `c8dedc99`. Click here to get info on choosing the right GGUF model format Based on LFM2-350M, LFM2-350M-Extract is designed to extract important information from a wide variety of unstructured documents (such as articles, transcripts, or reports) into structured outputs like JSON, XML, or YAML. - Extracting invoice details from emails into structured JSON. - Converting regulatory filings into XML for compliance systems. - Transforming customer support tickets into YAML for analytics pipelines. - Populating knowledge graphs with entities and attributes from unstructured reports. You can find more information about other task-specific models in this blog post. Generation parameters: We strongly recommend using greedy decoding with a `temperature=0`. System prompt: If no system prompt is provided, the model will default to JSON outputs. We recommend providing a system prompt with a specific format (JSON, XML, or YAML) and a given schema to improve accuracy (see the following example). Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. Chat template: LFM2 uses a ChatML-like chat template as follows: You can automatically apply it using the dedicated `.applychattemplate()` function from Hugging Face transformers. > [!WARNING] > ⚠️ The model is intended for single-turn conversations. The data used for training these models was primarily synthetic, which allowed us to ensure a diverse data mix. We used a range of document types, domains, styles, lengths, and languages. We also varied the density and distribution of relevant text in the documents. In some cases, the extracted information was clustered in one part of the document; in others, it’s spread throughout. We applied the same approach of ensuring diversity when creating synthetic user requests and designing the structure of the model outputs. The data generation process underwent many iterations, incorporating ideas and feedback from across the Liquid AI team. We evaluated LFM2-Extract on a dataset of 5,000 documents, covering over 100 topics with a mix of writing styles, ambiguities, and formats. We used a combination of five metrics to capture a balanced view on syntax, accuracy, and faithfulness: - Syntax score: Checks whether outputs parse cleanly as valid JSON, XML, or YAML. - Format accuracy: Verifies that outputs match the requested format (e.g., JSON when JSON is requested). - Keyword faithfulness: Measures whether values in the structured output actually appear in the input text. - Absolute scoring: A judge LLM scores quality on a 1-5 scale, assessing completeness and correctness of extractions. - Relative scoring: We ask a judge LLM to choose the best answer between the extraction model’s output and the ground-truth answer. LFM2-350M-Extract outperforms Gemma 3 4B at this task, a model more than 11x its size. - Hugging Face: LFM2-350M - llama.cpp: LFM2-350M-Extract-GGUF - LEAP: LEAP model library If you are interested in custom solutions with edge deployment, please contact our sales team. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
GneissWeb.7B_ablation_model_on_350B_FineWeb.seed1-GGUF
GneissWeb.7Bablationmodelon350BFineWeb.seed1 GGUF Models This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This model is part of the GneissWeb ablations, detailed in this technical report. The model has 7 billion parameters and uses the LLama architecture. It is trained on a random subset of 350 billion English tokens from FineWeb V1.1.0, tokenized using the StarCoder tokenizer. This model is trained on 350B tokens of English FineWeb V1.1.0 data and is not instruction-tuned or safety aligned. It is important to note that the primary intended use case for this model is to compare its performance with other models trained under similar conditions, with the goal of comparing pre-training datasets. These other models are mentioned here This is a simple example of how to use `GneissWeb.7Bablationmodelon350BFineWeb.seed1` model. Then, copy the code snippet below to run the example. Evaluation Results: Please refer to section 5.3.2 of the GneissWeb paper. Infrastructure: Please refer to section 5.2 of the GneissWeb paper. Ethical Considerations and Limitations: The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. `GneissWeb.7Bablationmodelon350BFineWeb.seed1` is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, thus it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use `GneissWeb.7Bablationmodelon350BFineWeb.seed1` model with ethical intentions and in a responsible way. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-8b-code-instruct-128k-GGUF
OCRFlux-3B-GGUF
This model was generated using llama.cpp at commit `caf5681f`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format This is a preview release of the OCRFlux-3B model that's fine tuned from Qwen2.5-VL-3B-Instruct using the our private document datasets and some data from olmOCR-mix-0225 dataset. The best way to use this model is via the OCRFlux toolkit. The toolkit comes with an efficient inference setup via vllm that can handle millions of documents at scale. OCRFlux is licensed under the Apache 2.0 license. OCRFlux is intended for research and educational use. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-0.6B-abliterated-GGUF
SmolVLM-Instruct-GGUF
granite-3.0-2b-base-GGUF
granite-3.1-8b-lora-intrinsics-v0.1-GGUF
watt-tool-70B-GGUF
Llama-3.1-8B-GGUF
Gemma-3-Gaia-PT-BR-4b-it-GGUF
WEBGEN-4B-Preview-GGUF
Magistral-Small-2509-GGUF
Qwen3-32B-GGUF
gemma-3-27b-it-qat-q4_0-GGUF
EXAONE-4.0-1.2B-GGUF
This model was generated using llama.cpp at commit `bf9087f5`. Click here to get info on choosing the right GGUF model format 🎉 License Updated! We are pleased to announce our more flexible licensing terms 🤗 ✈️ Try on FriendliAI We introduce EXAONE 4.0, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. In the EXAONE 4.0 architecture, we apply new architectural changes compared to previous EXAONE models as below: 1. Hybrid Attention: For the 32B model, we adopt hybrid attention scheme, which combines Local attention (sliding window attention) with Global attention (full attention) in a 3:1 ratio. We do not use RoPE (Rotary Positional Embedding) for global attention for better global context understanding. 2. QK-Reorder-Norm: We reorder the LayerNorm position from the traditional Pre-LN scheme by applying LayerNorm directly to the attention and MLP outputs, and we add RMS normalization right after the Q and K projection. It helps yield better performance on downstream tasks despite consuming more computation. For more details, please refer to our technical report, HuggingFace paper, blog, and GitHub. - Number of Parameters (without embeddings): 1.07B - Number of Layers: 30 - Number of Attention Heads: GQA with 32-heads and 8-KV heads - Vocab Size: 102,400 - Context Length: 65,536 tokens You should install the transformers library forked from the original, available in our PR. Once this PR is merged and released, we will update this section. You can install the latest version of transformers with support for EXAONE 4.0 by following the command: For general use, you can use the EXAONE 4.0 models with the following example: The EXAONE 4.0 models have reasoning capabilities for handling complex problems. You can activate reasoning mode by using the `enablethinking=True` argument with the tokenizer, which opens a reasoning block that starts with ` ` tag without closing it. > [!IMPORTANT] > The model generation with reasoning mode can be affected sensitively by sampling parameters, so please refer to the Usage Guideline for better quality. The EXAONE 4.0 models can be used as agents with their tool calling capabilities. You can provide tool schemas to the model for effective tool calling. TensorRT-LLM officially supports EXAONE 4.0 models in the latest commits. Before it is released, you need to clone the TensorRT-LLM repository to build from source. After cloning the repository, you need to build the source for installation. Please refer to the official documentation for a guide to build the TensorRT-LLM environment. You can run the TensorRT-LLM server by following steps: For more details, please refer to the documentation of EXAONE from TensorRT-LLM. > [!NOTE] > Other inference engines including `vllm` and `sglang` don't support the EXAONE 4.0 officially now. We will update as soon as these libraries are updated. The following tables show the evaluation results of each model, with reasoning and non-reasoning mode. The evaluation details can be found in the technical report. - ✅ denotes the model has a hybrid reasoning capability, evaluated by selecting reasoning / non-reasoning on the purpose. - To assess Korean practical and professional knowledge, we adopt both the KMMLU-Redux and KMMLU-Pro benchmarks. Both datasets are publicly released! EXAONE 4.0 32B Phi 4 reasoning-plus Magistral Small-2506 Qwen 3 32B Qwen 3 235B DeepSeek R1-0528 EXAONE 4.0 32B Phi 4 Mistral-Small-2506 Gemma 3 27B Qwen3 32B Qwen3 235B Llama-4-Maverick DeepSeek V3-0324 Model Size 32.0B 14.7B 24.0B 27.4B 32.8B 235B 402B 671B GPQA-Diamond 63.7 56.1 46.1 42.4 54.6 62.9 69.8 68.4 LiveCodeBench v5 43.3 24.6 25.8 27.5 31.3 35.3 43.4 46.7 LiveCodeBench v6 43.1 27.4 26.9 29.7 28.0 31.4 32.7 44.0 Multi-IF (EN) 71.6 47.7 63.2 72.1 71.9 72.5 77.9 68.3 Tau-Bench (Airline) 25.5 N/A 36.1 N/A 16.0 27.0 38.0 40.5 Tau-Bench (Retail) 55.9 N/A 35.5 N/A 47.6 56.5 6.5 68.5 KMMLU-Redux 64.8 50.1 53.6 53.3 64.4 71.7 76.9 72.2 MATH500 (ES) 87.3 78.2 83.4 86.8 84.7 87.2 78.7 89.2 WMT24++ (ES) 90.7 89.3 92.2 93.1 91.4 92.9 92.7 94.3 EXAONE 4.0 1.2B EXAONE Deep 2.4B Qwen 3 0.6B Qwen 3 1.7B SmolLM3 3B EXAONE 4.0 1.2B Qwen 3 0.6B Gemma 3 1B Qwen 3 1.7B SmolLM3 3B > [!IMPORTANT] > To achieve the expected performance, we recommend using the following configurations: > > - For non-reasoning mode, we recommend using a lower temperature value such as `temperature - For reasoning mode (using ` ` block), we recommend using `temperature=0.6` and `topp=0.95`. > - If you suffer from the model degeneration, we recommend using `presencepenalty=1.5`. > - For Korean general conversation with 1.2B model, we suggest to use `temperature=0.1` to avoid code switching. The EXAONE language model has certain limitations and may occasionally generate inappropriate responses. The language model generates responses based on the output probability of tokens, and it is determined during learning from training data. While we have made every effort to exclude personal, harmful, and biased information from the training data, some problematic content may still be included, potentially leading to undesirable responses. Please note that the text generated by EXAONE language model does not reflect the views of LG AI Research. - Inappropriate answers may be generated, which contain personal, harmful or other inappropriate information. - Biased responses may be generated, which are associated with age, gender, race, and so on. - The generated responses rely heavily on statistics from the training data, which can result in the generation of semantically or syntactically incorrect sentences. - Since the model does not reflect the latest information, the responses may be false or contradictory. LG AI Research strives to reduce potential risks that may arise from EXAONE language models. Users are not allowed to engage in any malicious activities (e.g., keying in illegal information) that may induce the creation of inappropriate outputs violating LG AI's ethical principles when using EXAONE language models. The model is licensed under EXAONE AI Model License Agreement 1.2 - NC > [!NOTE] > The main difference from the older version is as below: > - We removed the claim of model output ownership from the license. > - We restrict the model use against the development of models that compete with EXAONE. > - We allow the model to be used for educational purposes, not just research. LG AI Research Technical Support: [email protected] Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-4B-GGUF
GLM-Z1-9B-0414-GGUF
granite-3.1-3b-a800m-base-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-3B-A800M-Base extends the context length of Granite-3.0-3B-A800M-Base from 4K to 128K using a progressive training strategy by increasing the supported context length in increments while adjusting RoPE theta until the model has successfully adapted to desired length of 128K. This long-context pre-training stage was performed using approximately 500B tokens. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering, and more. All Granite Base models are able to handle these tasks as they were trained on a large amount of data from various domains. Moreover, they can serve as baseline to create specialized models for specific application scenarios. Generation: This is a simple example of how to use Granite-3.1-3B-A800M-Base model. Then, copy the code snippet below to run the example. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Base 63.99 83.27 63.45 51.29 78.92 60.19 66.85 Granite-3.1-2B-Base 53.58 77.67 52.86 39.02 72.84 47.99 57.32 Granite-3.1-3B-A800M-Base 50.76 74.45 48.31 39.91 69.29 40.56 53.88 Granite-3.1-1B-A400M-Base 39.42 66.13 26.53 37.67 2.03 18.87 31.78 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Base 42.21 26.02 9.52 9.51 8.36 24.8 20.07 Granite-3.1-2B-Base 35.22 16.84 5.59 3.69 3.9 13.9 13.19 Granite-3.1-3B-A800M-Base 29.96 11.91 4 3.69 1.11 8.81 9.91 Granite-3.1-1B-A400M-Base 25.19 6.43 2.19 0.22 1.76 1.55 6.22 Model Architecture: Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. Core components of this architecture are: Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss. Training Data: This model is trained on a mix of open source and proprietary data following a two-stage training strategy. Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data. Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks. Stage 3 data: The data for stage 3 consists of original stage-2 pretraining data with additional synthetic long-context data in form of QA/summary pairs where the answer contains a recitation of the related paragraph before the answer. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: The use of Large Language Models involves risks and ethical considerations people must be aware of, including but not limited to: bias and fairness, misinformation, and autonomous decision-making. Granite-3.1-3B-A800M-Base model is not the exception in this regard. Even though this model is suited for multiple generative AI tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying text verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-3.1-3B-A800M-Base model with ethical intentions and in a responsible way. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-3.2-8b-instruct-preview-GGUF
This model was generated using llama.cpp at commit `5dd942de`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.2-8B-Instruct-Preview is an early release of an 8B long-context model fine-tuned for enhanced reasoning (thinking) capabilities. Built on top of Granite-3.1-8B-Instruct, it has been trained using a mix of permissively licensed open-source datasets and internally generated synthetic data designed for reasoning tasks. The model allows controllability of its thinking capability, ensuring it is applied only when required. - Developers: Granite Team, IBM - Website: Granite Docs - Release Date: February 7th, 2025 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. However, users may finetune this Granite model for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Thinking Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.2-8B-Instruct-Preview model. Then, copy the snippet from the section that is relevant for your use case. Models ArenaHard Alpaca-Eval-2 MMLU PopQA TruthfulQA BigBenchHard DROP GSM8K HumanEval HumanEval+ IFEval AttaQ Llama-3.1-8B-Instruct 36.43 27.22 69.15 28.79 52.79 72.66 61.48 83.24 85.32 80.15 79.10 83.43 DeepSeek-R1-Distill-Llama-8B 17.17 21.85 45.80 13.25 47.43 65.71 44.46 72.18 67.54 62.91 66.50 42.87 Qwen-2.5-7B-Instruct 25.44 30.34 74.30 18.12 63.06 70.40 54.71 84.46 93.35 89.91 74.90 81.90 DeepSeek-R1-Distill-Qwen-7B 10.36 15.35 50.72 9.94 47.14 65.04 42.76 78.47 79.89 78.43 59.10 42.45 Granite-3.1-8B-Instruct 37.58 27.87 66.84 28.84 65.92 68.10 50.78 79.08 88.82 84.62 71.20 85.73 Granite-3.2-8B-Instruct-Preview 55.23 61.16 66.93 28.08 66.37 65.60 50.73 83.09 89.47 86.88 73.57 85.99 Training Data: Overall, our training data is largely comprised of two key sources: (1) publicly available datasets with permissive license, (2) internal synthetically generated data targeted to enhance reasoning capabilites. Infrastructure: We train Granite-3.2-8B-Instruct-Preview using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite-3.2-8B-Instruct-Preview builds upon Granite-3.1-8B-Instruct, leveraging both permissively licensed open-source and select proprietary data for enhanced performance. Since it inherits its foundation from the previous model, all ethical considerations and limitations applicable to Granite-3.1-8B-Instruct remain relevant. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-4B-Thinking-2507-GGUF
granite-3.0-1b-a400m-instruct-GGUF
granite-8b-code-base-128k-GGUF
RoboBrain2.0-7B-GGUF
olmOCR-7B-0225-preview-GGUF
orpheus-3b-0.1-ft-GGUF
granite-guardian-3.2-5b-GGUF
Qwen2.5-14B-Instruct-1M-GGUF
Phi-4-reasoning-GGUF
Llama-3_3-Nemotron-Super-49B-v1_5-GGUF
granite-3.3-8b-base-GGUF
granite-3b-code-instruct-128k-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-3B-Code-Instruct-128K is a 3B parameter long-context instruct model fine tuned from Granite-3B-Code-Base-128K on a combination of permissively licensed data used in training the original Granite code instruct models, in addition to synthetically generated code instruction datasets tailored for solving long context problems. By exposing the model to both short and long context data, we aim to enhance its long-context capability without sacrificing code generation performance at short input context. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Scaling Granite Code Models to 128K Context - Release Date: July 18th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to coding related instructions over long-conext input up to 128K length and can be used to build coding assistants. Generation This is a simple example of how to use Granite-3B-Code-Instruct model. Training Data Granite Code Instruct models are trained on a mix of short and long context data as follows. Short-Context Instruction Data: CommitPackFT, BigCode-SC2-Instruct, MathInstruct, MetaMathQA, Glaive-Code-Assistant-v3, Glaive-Function-Calling-v2, NL2SQL11, HelpSteer, OpenPlatypus including a synthetically generated dataset for API calling and multi-turn code interactions with execution feedback. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers. Long-Context Instruction Data: A synthetically-generated dataset by bootstrapping the repository-level file-packed documents through Granite-8b-Code-Instruct to improve long-context capability of the model. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-3B-Code-Base-128K model card. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
SmolLM3-3B-Base-GGUF
TriLM_99M_Unpacked-GGUF
OpenMath-Nemotron-7B-GGUF
TriLM_560M_Unpacked-GGUF
sychonix-GGUF
AceMath-1.5B-Instruct-GGUF
This model was generated using llama.cpp at commit `b9c3eefd`. Click here to get info on choosing the right GGUF model format Introduction We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are Improved using Qwen . The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions. The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field. We only recommend using the AceMath models for solving math problems. To support other tasks, we also release AceInstruct-1.5B/7B/72B, a series of general-purpose SFT models designed to handle code, math, and general knowledge tasks. These models are built upon the Qwen2.5-1.5B/7B/72B-Base. For more information about AceMath, check our website and paper. All Resources AceMath Instruction Models - AceMath-1.5B-Instruct, AceMath-7B-Instruct, AceMath-72B-Instruct AceMath Reward Models - AceMath-7B-RM, AceMath-72B-RM Evaluation & Training Data - AceMath-RewardBench, AceMath-Instruct Training Data, AceMath-RM Training Data General Instruction Models - AceInstruct-1.5B, AceInstruct-7B, AceInstruct-72B Benchmark Results (AceMath-Instruct + AceMath-72B-RM) We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation. Correspondence to Zihan Liu ([email protected]), Yang Chen ([email protected]), Wei Ping ([email protected]) Citation If you find our work helpful, we’d appreciate it if you could cite us. @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } License All models in the AceMath family are for non-commercial use only, subject to Terms of Use of the data generated by OpenAI. We put the AceMath models under the license of Creative Commons Attribution: Non-Commercial 4.0 International. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Absolute_Zero_Reasoner-Coder-14b-GGUF
Qwen3-30B-A3B-Instruct-2507-GGUF
This model was generated using llama.cpp at commit `cd6983d5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format We introduce the updated version of the Qwen3-30B-A3B non-thinking mode, named Qwen3-30B-A3B-Instruct-2507, featuring the following key enhancements: - Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage. - Substantial gains in long-tail knowledge coverage across multiple languages. - Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation. - Enhanced capabilities in 256K long-context understanding. Qwen3-30B-A3B-Instruct-2507 has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 30.5B in total and 3.3B activated - Number of Paramaters (Non-Embedding): 29.9B - Number of Layers: 48 - Number of Attention Heads (GQA): 32 for Q and 4 for KV - Number of Experts: 128 - Number of Activated Experts: 8 - Context Length: 262,144 natively. NOTE: This model supports only non-thinking mode and does not generate `` `` blocks in its output. Meanwhile, specifying `enablethinking=False` is no longer required. For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation. | | Deepseek-V3-0324 | GPT-4o-0327 | Gemini-2.5-Flash Non-Thinking | Qwen3-235B-A22B Non-Thinking | Qwen3-30B-A3B Non-Thinking | Qwen3-30B-A3B-Instruct-2507 | |--- | --- | --- | --- | --- | --- | --- | | Knowledge | | | | | | | | MMLU-Pro | 81.2 | 79.8 | 81.1 | 75.2 | 69.1 | 78.4 | | MMLU-Redux | 90.4 | 91.3 | 90.6 | 89.2 | 84.1 | 89.3 | | GPQA | 68.4 | 66.9 | 78.3 | 62.9 | 54.8 | 70.4 | | SuperGPQA | 57.3 | 51.0 | 54.6 | 48.2 | 42.2 | 53.4 | | Reasoning | | | | | | | | AIME25 | 46.6 | 26.7 | 61.6 | 24.7 | 21.6 | 61.3 | | HMMT25 | 27.5 | 7.9 | 45.8 | 10.0 | 12.0 | 43.0 | | ZebraLogic | 83.4 | 52.6 | 57.9 | 37.7 | 33.2 | 90.0 | | LiveBench 20241125 | 66.9 | 63.7 | 69.1 | 62.5 | 59.4 | 69.0 | | Coding | | | | | | | | LiveCodeBench v6 (25.02-25.05) | 45.2 | 35.8 | 40.1 | 32.9 | 29.0 | 43.2 | | MultiPL-E | 82.2 | 82.7 | 77.7 | 79.3 | 74.6 | 83.8 | | Aider-Polyglot | 55.1 | 45.3 | 44.0 | 59.6 | 24.4 | 35.6 | | Alignment | | | | | | | | IFEval | 82.3 | 83.9 | 84.3 | 83.2 | 83.7 | 84.7 | | Arena-Hard v2 | 45.6 | 61.9 | 58.3 | 52.0 | 24.8 | 69.0 | | Creative Writing v3 | 81.6 | 84.9 | 84.6 | 80.4 | 68.1 | 86.0 | | WritingBench | 74.5 | 75.5 | 80.5 | 77.0 | 72.2 | 85.5 | | Agent | | | | | | | | BFCL-v3 | 64.7 | 66.5 | 66.1 | 68.0 | 58.6 | 65.1 | | TAU1-Retail | 49.6 | 60.3# | 65.2 | 65.2 | 38.3 | 59.1 | | TAU1-Airline | 32.0 | 42.8# | 48.0 | 32.0 | 18.0 | 40.0 | | TAU2-Retail | 71.1 | 66.7# | 64.3 | 64.9 | 31.6 | 57.0 | | TAU2-Airline | 36.0 | 42.0# | 42.5 | 36.0 | 18.0 | 38.0 | | TAU2-Telecom | 34.0 | 29.8# | 16.9 | 24.6 | 18.4 | 12.3 | | Multilingualism | | | | | | | | MultiIF | 66.5 | 70.4 | 69.4 | 70.2 | 70.8 | 67.9 | | MMLU-ProX | 75.8 | 76.2 | 78.3 | 73.2 | 65.1 | 72.0 | | INCLUDE | 80.1 | 82.1 | 83.8 | 75.6 | 67.8 | 71.9 | | PolyMATH | 32.2 | 25.5 | 41.9 | 27.0 | 23.3 | 43.1 | : For reproducibility, we report the win rates evaluated by GPT-4.1. \#: Results were generated using GPT-4o-20241120, as access to the native function calling API of GPT-4o-0327 was unavailable. The code of Qwen3-MoE has been in the latest Hugging Face `transformers` and we advise you to use the latest version of `transformers`. With `transformers =0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as `32,768`. For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. Qwen3 excels in tool calling capabilities. We recommend using Qwen-Agent to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity. To define the available tools, you can use the MCP configuration file, use the integrated tool of Qwen-Agent, or integrate other tools by yourself. To support ultra-long context processing (up to 1 million tokens), we integrate two key techniques: - Dual Chunk Attention (DCA): A length extrapolation method that splits long sequences into manageable chunks while preserving global coherence. - MInference: A sparse attention mechanism that reduces computational overhead by focusing on critical token interactions. Together, these innovations significantly improve both generation quality and inference efficiency for sequences beyond 256K tokens. On sequences approaching 1M tokens, the system achieves up to a 3× speedup compared to standard attention implementations. For full technical details, see the Qwen2.5-1M Technical Report. > [!NOTE] > To effectively process a 1 million token context, users will require approximately 240 GB of total GPU memory. This accounts for model weights, KV-cache storage, and peak activation memory demands. Download the model and replace the content of your `config.json` with `config1m.json`, which includes the config for length extrapolation and sparse attention. After updating the config, proceed with either vLLM or SGLang for serving the model. Then launch the server with Dual Chunk Flash Attention enabled: | Parameter | Purpose | |--------|--------| | `VLLMATTENTIONBACKEND=DUALCHUNKFLASHATTN` | Enables the custom attention kernel for long-context efficiency | | `--max-model-len 1010000` | Sets maximum context length to ~1M tokens | | `--enable-chunked-prefill` | Allows chunked prefill for very long inputs (avoids OOM) | | `--max-num-batched-tokens 131072` | Controls batch size during prefill; balances throughput and memory | | `--enforce-eager` | Disables CUDA graph capture (required for dual chunk attention) | | `--max-num-seqs 1` | Limits concurrent sequences due to extreme memory usage | | `--gpu-memory-utilization 0.85` | Set the fraction of GPU memory to be used for the model executor | | Parameter | Purpose | |---------|--------| | `--attention-backend dualchunkflashattn` | Activates Dual Chunk Flash Attention | | `--context-length 1010000` | Defines max input length | | `--mem-frac 0.75` | The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. | | `--tp 4` | Tensor parallelism size (matches model sharding) | | `--chunked-prefill-size 131072` | Prefill chunk size for handling long inputs without OOM | 1. Encountering the error: "The model's max sequence length (xxxxx) is larger than the maximum number of tokens that can be stored in the KV cache." or "RuntimeError: Not enough memory. Please try to increase --mem-fraction-static." The VRAM reserved for the KV cache is insufficient. - vLLM: Consider reducing the ``maxmodellen`` or increasing the ``tensorparallelsize`` and ``gpumemoryutilization``. Alternatively, you can reduce ``maxnumbatchedtokens``, although this may significantly slow down inference. - SGLang: Consider reducing the ``context-length`` or increasing the ``tp`` and ``mem-frac``. Alternatively, you can reduce ``chunked-prefill-size``, although this may significantly slow down inference. 2. Encountering the error: "torch.OutOfMemoryError: CUDA out of memory." The VRAM reserved for activation weights is insufficient. You can try lowering ``gpumemoryutilization`` or ``mem-frac``, but be aware that this might reduce the VRAM available for the KV cache. 3. Encountering the error: "Input prompt (xxxxx tokens) + lookahead slots (0) is too long and exceeds the capacity of the block manager." or "The input (xxx xtokens) is longer than the model's context length (xxx tokens)." The input is too lengthy. Consider using a shorter sequence or increasing the ``maxmodellen`` or ``context-length``. We test the model on an 1M version of the RULER benchmark. | Model Name | Acc avg | 4k | 8k | 16k | 32k | 64k | 96k | 128k | 192k | 256k | 384k | 512k | 640k | 768k | 896k | 1000k | |---------------------------------------------|---------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|-------| | Qwen3-30B-A3B (Non-Thinking) | 72.0 | 97.1 | 96.1 | 95.0 | 92.2 | 82.6 | 79.7 | 76.9 | 70.2 | 66.3 | 61.9 | 55.4 | 52.6 | 51.5 | 52.0 | 50.9 | | Qwen3-30B-A3B-Instruct-2507 (Full Attention) | 86.8 | 98.0 | 96.7 | 96.9 | 97.2 | 93.4 | 91.0 | 89.1 | 89.8 | 82.5 | 83.6 | 78.4 | 79.7 | 77.6 | 75.7 | 72.8 | | Qwen3-30B-A3B-Instruct-2507 (Sparse Attention) | 86.8 | 98.0 | 97.1 | 96.3 | 95.1 | 93.6 | 92.5 | 88.1 | 87.7 | 82.9 | 85.7 | 80.7 | 80.0 | 76.9 | 75.5 | 72.2 | All models are evaluated with Dual Chunk Attention enabled. Since the evaluation is time-consuming, we use 260 samples for each length (13 sub-tasks, 20 samples for each). To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: - We suggest using `Temperature=0.7`, `TopP=0.8`, `TopK=20`, and `MinP=0`. - For supported frameworks, you can adjust the `presencepenalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 16,384 tokens for most queries, which is adequate for instruct models. 3. Standardize Output Format: We recommend using prompts to standardize model outputs when benchmarking. - Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt. - Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the `answer` field with only the choice letter, e.g., `"answer": "C"`." If you find our work helpful, feel free to give us a cite. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-3.0-2b-instruct-GGUF
InternVL3-8B-GGUF
granite-20b-code-instruct-8k-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-20B-Code-Instruct-8K is a 20B parameter model fine tuned from Granite-20B-Code-Base-8K on a combination of permissively licensed instruction data to enhance instruction following capabilities including logical reasoning and problem-solving skills. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: May 6th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to coding related instructions and can be used to build coding assistants. Generation This is a simple example of how to use Granite-20B-Code-Instruct-8K model. Training Data Granite Code Instruct models are trained on the following types of data. Code Commits Datasets: we sourced code commits data from the CommitPackFT dataset, a filtered version of the full CommitPack dataset. From CommitPackFT dataset, we only consider data for 92 programming languages. Our inclusion criteria boils down to selecting programming languages common across CommitPackFT and the 116 languages that we considered to pretrain the code-base model (Granite-20B-Code-Base). Math Datasets: We consider two high-quality math datasets, MathInstruct and MetaMathQA. Due to license issues, we filtered out GSM8K-RFT and Camel-Math from MathInstruct dataset. Code Instruction Datasets: We use Glaive-Code-Assistant-v3, Glaive-Function-Calling-v2, NL2SQL11 and a small collection of synthetic API calling datasets. Language Instruction Datasets: We include high-quality datasets such as HelpSteer and an open license-filtered version of Platypus. We also include a collection of hardcoded prompts to ensure our model generates correct outputs given inquiries about its name or developers. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations Granite code instruct models are primarily finetuned using instruction-response pairs across a specific set of programming languages. Thus, their performance may be limited with out-of-domain programming languages. In this situation, it is beneficial providing few-shot examples to steer the model's output. Moreover, developers should perform safety testing and target-specific tuning before deploying these models on critical applications. The model also inherits ethical considerations and limitations from its base model. For more information, please refer to Granite-20B-Code-Base-8K model card. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Llama-3.1-Nemotron-8B-UltraLong-4M-Instruct-GGUF
OpenReasoning-Nemotron-32B-GGUF
rwkv7-0.4B-world-GGUF
granite-guardian-3.1-8b-GGUF
InternVL3-1B-GGUF
granite-20b-code-base-8k-GGUF
OpenMath-Nemotron-14B-Kaggle-GGUF
GLM-4-32B-0414-GGUF
SmolLM2-1.7B-Instruct-GGUF
Refact-1_6B-fim-GGUF
SmolLM2-360M-Instruct-GGUF
ERNIE-4.5-21B-A3B-PT-GGUF
AceMath-7B-Instruct-GGUF
This model was generated using llama.cpp at commit `e743cddb`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Introduction We introduce AceMath, a family of frontier models designed for mathematical reasoning. The models in AceMath family, including AceMath-1.5B/7B/72B-Instruct and AceMath-7B/72B-RM, are Improved using Qwen . The AceMath-1.5B/7B/72B-Instruct models excel at solving English mathematical problems using Chain-of-Thought (CoT) reasoning, while the AceMath-7B/72B-RM models, as outcome reward models, specialize in evaluating and scoring mathematical solutions. The AceMath-1.5B/7B/72B-Instruct models are developed from the Qwen2.5-Math-1.5B/7B/72B-Base models, leveraging a multi-stage supervised fine-tuning (SFT) process: first with general-purpose SFT data, followed by math-specific SFT data. We are releasing all training data to support further research in this field. We only recommend using the AceMath models for solving math problems. To support other tasks, we also release AceInstruct-1.5B/7B/72B, a series of general-purpose SFT models designed to handle code, math, and general knowledge tasks. These models are built upon the Qwen2.5-1.5B/7B/72B-Base. For more information about AceMath, check our website and paper. All Resources AceMath Instruction Models - AceMath-1.5B-Instruct, AceMath-7B-Instruct, AceMath-72B-Instruct AceMath Reward Models - AceMath-7B-RM, AceMath-72B-RM Evaluation & Training Data - AceMath-RewardBench, AceMath-Instruct Training Data, AceMath-RM Training Data General Instruction Models - AceInstruct-1.5B, AceInstruct-7B, AceInstruct-72B Benchmark Results (AceMath-Instruct + AceMath-72B-RM) We compare AceMath to leading proprietary and open-access math models in above Table. Our AceMath-7B-Instruct, largely outperforms the previous best-in-class Qwen2.5-Math-7B-Instruct (Average pass@1: 67.2 vs. 62.9) on a variety of math reasoning benchmarks, while coming close to the performance of 10× larger Qwen2.5-Math-72B-Instruct (67.2 vs. 68.2). Notably, our AceMath-72B-Instruct outperforms the state-of-the-art Qwen2.5-Math-72B-Instruct (71.8 vs. 68.2), GPT-4o (67.4) and Claude 3.5 Sonnet (65.6) by a margin. We also report the rm@8 accuracy (best of 8) achieved by our reward model, AceMath-72B-RM, which sets a new record on these reasoning benchmarks. This excludes OpenAI’s o1 model, which relies on scaled inference computation. Correspondence to Zihan Liu ([email protected]), Yang Chen ([email protected]), Wei Ping ([email protected]) Citation If you find our work helpful, we’d appreciate it if you could cite us. @article{acemath2024, title={AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling}, author={Liu, Zihan and Chen, Yang and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024} } License All models in the AceMath family are for non-commercial use only, subject to Terms of Use of the data generated by OpenAI. We put the AceMath models under the license of Creative Commons Attribution: Non-Commercial 4.0 International. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
kani-tts-450m-0.1-pt-GGUF
This model was generated using llama.cpp at commit `152729f8`. Click here to get info on choosing the right GGUF model format Text-to-Speech (TTS) model designed for high-speed, high-fidelity audio generation. KaniTTS is built on a novel architecture that combines a powerful language model with a highly efficient audio codec, enabling it to deliver exceptional performance for real-time applications. KaniTTS operates on a two-stage pipeline, leveraging a large foundation model for token generation and a compact, efficient codec for waveform synthesis. The two-stage design of KaniTTS provides a significant advantage in terms of speed and efficiency. The backbone LLM generates a compressed token representation, which is then rapidly expanded into an audio waveform by the NanoCodec. This architecture bypasses the computational overhead associated with generating waveforms directly from large-scale language models, resulting in extremely low latency. Features This model trained primarily on English for robust core capabilities and the tokenizer supports these languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The base model can be continually pretrained on the multilingual dataset producing high-fidelity audio at sample rates 22kHz. This model powers voice interactions in the modern agentic systems, enabling seamless, human-like conversations. - Model Size: 450M parameters (pretrained version) - License: Apache 2.0 | Text | Audio | |---|---| | I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED. | | | What do we say the the god of death? Not today! | | | What do you call a lawyer with an IQ of 60? Your honor | | | You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? I make you laugh, I'm here to fucking amuse you? | | - Website: nineninesix.ai - GitHub Repo: https://github.com/nineninesix-ai/kani-tts - Base Model Card on HF: nineninesix/kani-tts-450m-0.1-pt - FT Model Card on HuggingFace: nineninesix/kani-tts-450m-0.2-ft - Link to HF Space: nineninesix/KaniTTS - Inference Example: Colab Notebook - Finetuning Example: Colab Notebook - Example Dataset for Fine-tuning: Expresso Conversational - Waiting List for Pro Version Recommended Uses - Conversational AI: Integrate into chatbots, virtual assistants, or voice-enabled apps for real-time speech output. - Edge and Server Deployment: Optimized for low-latency inference on edge devices or affordable servers, enabling scalable, resource-efficient voice applications. - Accessibility Tools: Support screen readers or language learning apps with expressive prosody. - Research: Fine-tune for domain-specific voices (e.g., accents, emotions) or benchmark against other TTS systems. Limitations - Performance may vary with finetuned variants, long inputs ( > 2000 tokens) or rare languages/accents. - Emotion control is basic; advanced expressivity requires fine-tuning. - Trained on public datasets; may inherit biases in prosody or pronunciation from training data. - Dataset: Curated from LibriTTS, Common Voice and Emilia (~50k hours). - Pretrained mostly on English speech for robust core capabilities, with multilingual fine-tuning for supported languages. - Metrics: MOS (Mean Opinion Score) 4.3/5 for naturalness; WER (Word Error Rate) This performance makes KaniTTS suitable for real-time conversational AI applications and low-latency voice synthesis. - Language Optimization: For the best results in non-English languages, continually pretrain this model on datasets from your desired language set to improve prosody, accents, and pronunciation accuracy. Additionally, finetune NanoCodec for desired set of languages. - Batch Processing: For high-throughput applications, process texts in batches of 8-16 to leverage parallel computation, reducing per-sample latency. - Blackwell GPU Optimization: This model runs efficiently on NVIDIA's Blackwell architecture GPUs for faster inference and reduced latency in real-time applications. Credits - This project was inspired by the works of Orpheus TTS and Sesame CSM. - It utilizes the LiquidAI LFM2 350M as its core backbone and - Nvidia NanoCodec for efficient audio processing. Responsible Use and Prohibited Activities The model is designed for ethical and responsible use. The following activities are strictly prohibited: - The model may not be used for any illegal purposes or to create content that is harmful, threatening, defamatory, or obscene. This includes, but is not limited to, the generation of hate speech, harassment, or incitement of violence. - You may not use the model to generate or disseminate false or misleading information. This includes creating deceptive audio content that impersonates individuals without their consent or misrepresents facts. - The model is not to be used for any malicious activities, such as spamming, phishing, or the creation of content intended to deceive or defraud. By using this model, you agree to abide by these restrictions and all applicable laws and regulations. Contact Have a question, feedback, or need support? Please fill out our contact form and we'll get back to you as soon as possible. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
GLM-Z1-Rumination-32B-0414-GGUF
AceReason-Nemotron-1.1-7B-GGUF
kanana-safeguard-8b-GGUF
This model was generated using llama.cpp at commit `adef8178`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 모델 상세설명 Kanana Safeguard는 카카오의 자체 언어모델인 Kanana 8B를 기반으로 한 유해 콘텐츠 탐지 모델입니다. 이 모델은 대화형 AI 시스템 내 사용자 발화 또는 AI 어시스턴트의 답변으로부터 리스크 여부를 분류하도록 학습되었습니다. 분류 결과는 <SAFE> 또는 <UNSAFE-S4> 형식의 단일 토큰으로 출력됩니다. 여기에서 S4는 사용자 발화 또는 AI 어시스턴트 답변이 위반한 리스크 카테고리의 코드를 의미합니다. 리스크 분류 체계 본 모델의 리스크 카테고리는 MLCommons 분류체계에 기반하고 있으며, 여기에 한국 로컬 특성에 맞는 리스크 카테고리를 추가함으로써 아래와 같이 총 7가지 카테고리로 구성된 리스크 분류체계를 수립하였습니다. 본 모델에서 ①사용자의 발화와 ②AI 어시스턴트의 답변은 동일한 리스크 분류체계에 의해 판별됩니다. S1 증오 출신, 인종, 외양, 장애 및 질병 유무, 사회 경제적 상황 및 지위, 종교, 연령, 성별·성 정체성·성적 지향 또는 기타 정체성 요인 등을 이유로 특정 대상을 차별하거나, 이러한 차별에 기반해 개인 또는 집단을 공격하는 발화 S2 괴롭힘 타인에게 불쾌감이나 굴욕감을 주거나, 위협적이거나, 특정 대상에 대한 괴롭힘을 부추기는 발화 S3 성적 콘텐츠 성적 행위나 신체를 묘사/암시하거나, 성적 수치심/혐오감을 일으킬 수 있는 발화 (성교육 및 웰빙 제외) S4 범죄 불법적인 행위(예: 폭력∙비폭력 범죄, 성범죄, 무기 제작·조달)를 기획하고 준비하는 과정을 담은 발화 S5 아동 성착취 아동 대상의 성적 학대와 관련된 설명, 격려, 지지 등의 발화 (예: 그루밍, CSAM 관련 텍스트 등) S6 자살 및 자해 의도적으로 자신의 생명을 끊거나 자신의 신체를 의도적으로 해치는 행위를 묘사하거나 유도하는 발화 빠른 시작 🤗 HuggingFace Transformers - 모델을 실행하려면 `transformers>=4.51.3` 또는 최신 버전이 필요합니다. Kanana Safeguard의 학습 데이터는 수기 데이터와 합성 데이터로 구성되며 한국어 데이터로만 구성되어 있습니다. 수기 데이터는 내부정책에 부합하도록 전문 라벨러가 직접 생성하고 라벨링한 데이터입니다. 합성 데이터는 LLM 기반 표현 변환과 노이즈 삽입 등 다양한 데이터 증강 기법을 통해 생성되어 있습니다. 학습 데이터에는 안전하지 않은 발화 데이터 외에도, 모델의 거짓 양성(false positive) 비율을 줄이기 위해 유해한 질문에 대해 안전하게 응답한 AI 어시스턴트의 대화 데이터가 포함되어 있습니다. 평가 Kanana Safeguard는 SAFE/UNSAFE 이진 분류 기준으로 성능을 평가했습니다. 모든 평가는 UNSAFE를 양성(positive) 클래스로 간주하고, 모델이 출력한 첫 번째 토큰을 기준으로 분류했습니다. 외부 벤치마크 모델은 각 모델의 출력값에 대해 다음과 같은 방식으로 평가하였습니다. LlamaGuard는 SAFE/UNSAFE 토큰을 그대로 활용해 결과를 판정했습니다. ShieldGemma는 임계치를 0.5로 설정하여 이진 분류를 수행했습니다. GPT-4o는 리스크 카테고리 기반 분류 프롬프트를 zero-shot 방식으로 입력하고, 출력 내용이 특정 코드로 분류된 경우 UNSAFE로 간주하여 이진 분류를 수행했습니다. 그 결과 자체적으로 구축한 한국어 평가 데이터셋에서 Kanana Safeguard의 분류 성능이 타 벤치마크 모델 대비 가장 우수한 성능을 나타냈습니다. 모든 모델은 동일한 평가 데이터셋과 분류 기준으로 평가되었으며, 정책 및 모델 구조 차이에 따른 영향을 최소화하고, 공정하고 신뢰도 높은 비교가 가능하도록 설계되었습니다. Kanana Safeguard는 다음과 같은 한계점이 있으며, 이는 향후 지속적으로 개선해나갈 예정입니다. 본 모델은 100% 완벽한 분류를 보장하지 않습니다. 특히, 모델의 정책은 일반적인 사용사례에 기반하여 수립되었기 때문에 특정한 도메인에서는 잘못 분류될 수 있습니다. 본 모델은 이전 대화 이력을 기반으로 문맥을 유지하거나 대화를 이어가는 기능은 제공하지 않습니다. 본 모델은 정해진 리스크만을 탐지하므로 실사례의 모든 리스크를 탐지할 수는 없습니다. 따라서 의도에 따라 Kanana Safeguard-Siren(법적 리스크 탐지 모델), Kanana Safeguard-Prompt(프롬프트 공격 탐지 모델)와 함께 사용하면 전체적인 안전성을 더욱 높일 수 있습니다. Contributors JeongHwan Lee, Deok Jeong, HyeYeon Cho, JiEun Choi Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
kanana-safeguard-siren-8b-GGUF
This model was generated using llama.cpp at commit `adef8178`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 모델 상세설명 Kanana Safeguard-Siren은 카카오의 자체 언어모델인 Kanana 8B 기반으로 한 법적∙정책적 위험 탐지 모델입니다. 이 모델은 대화형 AI 시스템 내 사용자의 발화로부터 법적∙정책적 주의가 필요한 발화를 분류하도록 학습되었습니다. 분류 결과는 <SAFE> 또는 <UNSAFE-I2> 형식의 단일 토큰으로 출력됩니다. 여기에서 I2는 사용자 발화가 위반한 리스크 카테고리의 코드를 의미합니다. 리스크 분류 체계 본 모델의 리스크 카테고리는 MLCommons 분류체계에 기반하고 있으며, 여기에 한국의 법률적 특성에 맞는 리스크 카테고리를 추가함으로써 아래와 같이 총 4가지 카테고리로 구성된 리스크 분류체계를 수립하였습니다. I1 성인인증 주류, 담배, 도박, 유흥업소 또는 19세 콘텐츠 등 청소년 유해 정보에 대한 요청을 포함하는 발화 I2 전문조언 의학, 법률, 세무, 금융 등 전문적인 의사결정과 관련된 조언을 요청하는 발화 I3 개인정보 개인 식별 정보(예: 주민등록번호, 계좌번호 등)나 민감한 데이터를 요청하거나 포함하는 발화 I4 지식재산권 저작권, 특허, 상표권 등으로 보호된 콘텐츠를 무단으로 요청하거나 복제하려는 발화 빠른 시작 🤗 HuggingFace Transformers - 모델을 실행하려면 `transformers>=4.51.3` 또는 최신 버전이 필요합니다. Kanana Safeguard-Siren의 학습 데이터는 수기 데이터, 합성 데이터, 외부 데이터로 구성되며 다양한 유형의 데이터를 활용해 학습 데이터의 다양성을 확보했습니다. 수기 데이터는 내부 정책에 부합하도록 전문 라벨러가 직접 생성하고 라벨링한 데이터입니다. 합성 데이터는 학습 효과를 높이기 위해 LLM 기반 표현 변환과 노이즈 삽입 등 다양한 데이터 증강 기법을 통해 생성하였습니다. 외부 데이터는 공개적으로 이용 가능한 출처에서 수집되었습니다. 학습 데이터에는 안전하지 않은 발화 데이터 외에도, 모델의 거짓 양성(false positive) 비율을 줄이기 위해 안전한 사용자 발화도 포함되어 있습니다. 평가 Kanana Safeguard-Siren은 SAFE/UNSAFE 이진 분류 기준으로 성능을 평가했습니다. 모든 평가는 UNSAFE를 양성(positive) 클래스로 간주하고, 모델이 출력한 첫 번째 토큰을 기준으로 분류했습니다. 외부 벤치마크 모델은 각 출력값에 대해 다음과 같은 방식으로 평가하였습니다. LlamaGuard는 SAFE/UNSAFE 토큰을 그대로 활용해 결과를 판정했습니다. ShieldGemma는 임계치를 0.5로 설정하여 이진 분류를 수행했습니다. GPT-4o는 리스크 카테고리 기반 분류 프롬프트를 zero-shot 방식으로 입력하고, 출력 내용이 특정 코드로 분류된 경우 UNSAFE로 간주하여 이진 분류를 수행했습니다. 그 결과 자체적으로 구축한 한국어 평가 데이터셋에서 Kanana Safeguard-Siren의 분류 성능이 타 벤치마크 모델 대비 가장 우수한 성능을 나타냈습니다. 모든 모델은 동일한 테스트셋과 분류 기준으로 평가되었으며, 정책 및 모델 구조 차이에 따른 영향을 최소화하고, 공정하고 신뢰도 높은 비교가 가능하도록 설계되었습니다. 한계점 Kanana Safeguard-Siren은 다음과 같은 한계점이 있으며, 이는 향후 지속적으로 개선해나갈 예정입니다. 1. 오탐지 가능성 존재 본 모델은 100% 완벽한 분류를 보장하지 않습니다. 특히, 모델의 정책은 일반적인 사용사례에 기반하여 수립되었기 때문에 특정한 도메인에서는 잘못 분류될 수 있습니다. 2. Context 인식 미지원 본 모델은 이전 대화 이력을 기반으로 문맥을 유지하거나 대화를 이어가는 기능은 제공하지 않습니다. 3. 제한된 리스크 카테고리 본 모델은 정해진 리스크만을 탐지하므로 실사례의 모든 리스크를 탐지할 수는 없습니다. 따라서 의도에 따라 Kanana Safeguard(유해한 콘텐츠 탐지), Kanana Safeguard-Prompt(프롬프트 공격 탐지) 모델과 함께 사용하면 전체적인 안전성을 더욱 높일 수 있습니다. Contributors HyeYeon Cho, JeongHwan Lee, Deok Jeong, JiEun Choi Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
reka-flash-3.1-GGUF
granite-guardian-3.1-2b-GGUF
granite-3.1-3b-a800m-instruct-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-3B-A800M-Instruct is a 3B parameter long-context instruct model finetuned from Granite-3.1-3B-A800M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-3B-A800M-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-3B-A800M-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Nemotron-4-Mini-Hindi-4B-Base-GGUF
AceInstruct-7B-GGUF
granite-20b-code-base-r1.1-GGUF
watt-tool-8B-GGUF
Meta-Llama-3-8B-Instruct-GGUF
granite-8b-code-base-4k-GGUF
This model was generated using llama.cpp at commit `5dd942de`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Model Summary Granite-8B-Code-Base-4K is a decoder-only code model designed for code generative tasks (e.g., code generation, code explanation, code fixing, etc.). It is trained from scratch with a two-phase training strategy. In phase 1, our model is trained on 4 trillion tokens sourced from 116 programming languages, ensuring a comprehensive understanding of programming languages and syntax. In phase 2, our model is trained on 500 billion tokens with a carefully designed mixture of high-quality data from code and natural language domains to improve the models’ ability to reason and follow instructions. - Developers: IBM Research - GitHub Repository: ibm-granite/granite-code-models - Paper: Granite Code Models: A Family of Open Foundation Models for Code Intelligence - Release Date: May 6th, 2024 - License: Apache 2.0. Usage Intended use Prominent enterprise use cases of LLMs in software engineering productivity include code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the 8B parameter model, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages. Generation This is a simple example of how to use Granite-8B-Code-Base-4K model. Training Data - Data Collection and Filtering: Pretraining code data is sourced from a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages. After language filtering, we also filter out low-quality code. - Exact and Fuzzy Deduplication: We adopt an aggressive deduplication strategy that includes both exact and fuzzy deduplication to remove documents having (near) identical code content. - HAP, PII, Malware Filtering: We apply a HAP content filter that reduces models' likelihood of generating hateful, abusive, or profane language. We also make sure to redact Personally Identifiable Information (PII) by replacing PII content (e.g., names, email addresses, keys, passwords) with corresponding tokens (e.g., ⟨NAME⟩, ⟨EMAIL⟩, ⟨KEY⟩, ⟨PASSWORD⟩). Moreover, we scan all datasets using ClamAV to identify and remove instances of malware in the source code. - Natural Language Datasets: In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets to improve models' proficiency in language understanding and mathematical reasoning. Unlike the code data, we do not deduplicate these datasets. Infrastructure We train the Granite Code models using two of IBM's super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations The use of Large Language Models involves risks and ethical considerations people must be aware of. Regarding code generation, caution is urged against complete reliance on specific code models for crucial decisions or impactful information as the generated code is not guaranteed to work as intended. Granite-8B-Code-Base-4K model is not the exception in this regard. Even though this model is suited for multiple code-related tasks, it has not undergone any safety alignment, there it may produce problematic outputs. Additionally, it remains uncertain whether smaller models might exhibit increased susceptibility to hallucination in generation scenarios by copying source code verbatim from the training dataset due to their reduced sizes and memorization capacities. This aspect is currently an active area of research, and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. Regarding ethics, a latent risk associated with all Large Language Models is their malicious utilization. We urge the community to use Granite-8B-Code-Base-4K model with ethical intentions and in a responsible way. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
DiffuCoder-7B-cpGRPO-GGUF
This model was generated using llama.cpp at commit `bf9087f5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format The DiffuCoder-7B-cpGRPO variant further refines DiffuCoder-Instruct with reinforcement learning via Coupled-GRPO. - Initialized from DiffuCoder-7B-Instruct, post-training with coupled-GRPO on 21K code data (1 epoch). - coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. - Paper: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation Acknowledgement To power this HuggingFace model release, we reuse Dream's modeling architecture and generation utils. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
SmolLM3-3B-GGUF
A.X-3.1-Light-GGUF
MiniCPM4-8B-GGUF
This model was generated using llama.cpp at commit `7f4fbe51`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format What's New - [2025.06.06] MiniCPM4 series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report here.🔥🔥🔥 MiniCPM4 Series MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. - MiniCPM4-8B: The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens. ( Note: In vLLM's chat API, `addspecialtokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extrabody={"addspecialtokens": True}`. Then you can use the chat interface by running the following code: Evaluation Results On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement. Comprehensive Evaluation MiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories. Long Text Evaluation MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance. Statement - As a language model, MiniCPM generates content by learning from a vast amount of text. - However, it does not possess the ability to comprehend or express personal opinions or value judgments. - Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. - Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own. LICENSE - This repository and MiniCPM models are released under the Apache-2.0 License. Citation - Please cite our paper if you find our work valuable. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code from. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
MiniCPM4-0.5B-GGUF
Harbinger-24B-GGUF
This model was generated using llama.cpp at commit `7f4fbe51`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Like our Wayfarer line of finetunes, Harbinger-24B was designed for immersive adventures and other stories where consequences feel real and every decision matters. Training focused on enhancing instruction following, improving mid-sequence continuation, and strengthening narrative coherence over long sequences of outputs without user intervention. The same DPO (direct preference optimization) techniques used in our Muse model were applied to Harbinger, resulting in polished outputs with fewer clichés, repetitive patterns, and other common artifacts. If you want to easily try this model, you can do so at https://aidungeon.com. Note that Harbinger requires a subscription while Muse and Wayfarer Small are free. We plan to continue improving and open-sourcing similar models, so please share any and all feedback on how we can improve model behavior. Below we share more details on how Muse was created. Harbinger 24B was trained in two stages, on top of Mistral Small 3.1 Instruct. SFT - Various multi-turn datasets from a multitude of sources, focused on Wayfarer-style text adventures and general roleplay, each carefully balanced and rewritten to be free of common AI cliches. A small single-turn instruct dataset was included to send a stronger signal during finetuning. DPO - Reward Model User Preference Data, detailed in our blog - This stage refined Harbinger's narrative coherence while preserving its unforgiving essence, resulting in more consistent character behaviors and smoother storytelling flows. Mistral Small 3.1 is sensitive to higher temperatures, so the following settings are recommended as a baseline. Nothing stops you from experimenting with these, of course. Harbinger was trained exclusively on second-person present tense data (using “you”) in a narrative style. Other styles will work as well but may produce suboptimal results. Thanks to Gryphe Padar for collaborating on this finetune with us! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3-1.7B-GGUF
granite-guardian-3.0-8b-GGUF
AFM-4.5B-GGUF
ERNIE-4.5-0.3B-PT-GGUF
granite-3.1-2b-base-GGUF
OpenReasoning-Nemotron-1.5B-GGUF
granite-3.1-1b-a400m-base-GGUF
SmolVLM-256M-Instruct-GGUF
Jan-v1-4B-GGUF
This model was generated using llama.cpp at commit `cd6983d5`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format [](https://github.com/menloresearch/deep-research) [](https://opensource.org/licenses/Apache-2.0) [](https://jan.ai/) Overview Jan-v1 is the first release in the Jan Family, designed for agentic reasoning and problem-solving within the Jan App. Based on our Lucy model, Jan-v1 achieves improved performance through model scaling. Jan-v1 uses the Qwen3-4B-thinking model to provide enhanced reasoning capabilities and tool utilization. This architecture delivers better performance on complex agentic tasks. Question Answering (SimpleQA) For question-answering, Jan-v1 shows a significant performance gain from model scaling, achieving 91.1% accuracy. The 91.1% SimpleQA accuracy represents a significant milestone in factual question answering for models of this scale, demonstrating the effectiveness of our scaling and fine-tuning approach. These benchmarks evaluate the model's conversational and instructional capabilities. Jan-v1 is optimized for direct integration with the Jan App. Simply select the model from the Jan App interface for immediate access to its full capabilities. - Discussions: HuggingFace Community - Jan App: Learn more about the Jan App at jan.ai () Note By default we have system prompt in chat template, this is to make sure the model having the same performance with the benchmark result. You can also use the vanilla chat template without system prompt in the file chattemplateraw.jinja. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
II-Search-4B-GGUF
functionary-small-v3.1-GGUF
HyperCLOVAX-SEED-Text-Instruct-0.5B-GGUF
AQUA-7B-GGUF
This model was generated using llama.cpp at commit `21c02174`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format AQUA-7B is Kurma AI’s flagship 7-billion parameter large language model built exclusively for the global aquaculture industry. And it is the first large language model for the aquaculture. It is fine-tuned to deliver actionable insights for aquaculture species-specific farming, hatchery operations, water quality control, and disease management. Trained on over 3 million real and synthetic aquaculture conversations (~1B tokens), AquaGPT-7B brings the power of domain-specific AI to fish farms, fish hatcheries, researchers, and Aqua-Tech innovators worldwide. - Production Systems & Species Management: Covers ponds, tanks, cages, RAS, aquaponics, mariculture, and longlines. Delivers best practices for raising tilapia, catfish, carp, salmon, shrimp, crabs, oysters, trout, sea bass, and more supporting both smallholder and industrial farms. - Genetics, Hatchery, and Early Life Stage Management: Guides advanced breeding, gene editing, hatchery design, spawning, larval care, nursery systems, live feed, transport, egg incubation, and biosecurity. - Nutrition, Feeding, and Growth Optimization: Provides actionable protocols for water quality (temperature, oxygen, pH, ammonia, nitrite, salinity), and structured disease management: identification, vaccination, biosecurity, antibiotic use, and outbreak response. - Water Quality, Health, and Disease Management: Provides actionable protocols for water quality (temperature, oxygen, pH, ammonia, nitrite, salinity), and structured disease management: identification, vaccination, biosecurity, antibiotic use, and outbreak response. - Sustainable Aquaculture & Innovation: Promotes Promotes eco-friendly practices in waste management, environmental impact, biodiversity, climate adaptation, and guides adoption of new technologies AI, automation, sensors, water drones, and modern farm management. - Water Quality, Health, and Disease Management: Advises on market trends, business planning, regulation, certification, traceability, and insurance. Covers best practices for harvesting, processing, cold chain, grading, packaging, contamination prevention, HACCP, and food safety. - Extension worker–farmer dialogues and field advisory logs - FAO, ICAR, NOAA, and peer-reviewed aquaculture research - Synthetic Q&A from 5,000+ aquaculture-focused topics - Climate-resilient practices, hatchery SOPs, and water quality datasets - Carefully curated to support species-specific culture methods - Scale: Trained on approximately 3 million real and synthetic Q&A pairs, totaling around 1 billion tokens of high-quality, domain-specific data. - Base Model: Mistral 7B v0.3 (by Mistral AI) - Training Tokens: ~1 Billion - Released On 4, July 2025 - Data Volume: 3M+ expert-verified and synthetic instructions - Origin: Made in America by Kurma AI - Training Technic Model is trained via Fine-tuning using (LoRA-based) Supervised Fine-Tuning (SFT). - Training Infrastructure: Trained using 16 NVIDIA H200 GPU Multi Cluster Special Thanks to Nebius 🙏 Acknowledgements This project was made possible thanks to: - Nebius for providing a compute grant and access to NVIDIA H200 GPU servers, which powered the model training process. - Mistral for sharing their open-source language models, which made this project possible. - Kurma AI research team: including aquaculture experts, machine learning engineers, data annotators, and advisors who collaborated to curate, verify, and refine the domain-specific dataset used for fine-tuning this model. - Domain Bias: The model may reflect inherent biases present in the aquaculture data sources and industry practices on which it was trained. - Temporal Data Limitation: Climate and environmental recommendations are based on information available up to 2024. Users should cross-check any climate-related advice against the latest advisories (e.g., IMD or NOAA updates). - Potential Hallucinations: Like all large language models, Aqua-7B may occasionally generate inaccurate or misleading responses ("hallucinations"). Always validate critical, regulatory, or high-impact decisions with a qualified aquaculture professional. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Absolute_Zero_Reasoner-Coder-3b-GGUF
AceInstruct-1.5B-GGUF
Arch-Agent-1.5B-GGUF
This model was generated using llama.cpp at commit `0142961a`. Click here to get info on choosing the right GGUF model format Overview Arch-Agent is a collection of state-of-the-art (SOTA) LLMs specifically designed for advanced function calling and agent-based applications. Designed to power sophisticated multi-step and multi-turn workflows, Arch-Agent excels at handling complex, multi-step tasks that require intelligent tool selection, adaptive planning, and seamless integration with external APIs and services. Built with a focus on real-world agent deployments, Arch-Agent delivers leading performance in complex scenarios while maintaining reliability and precision across extended function call sequences. Key capabilities inlcude: - Multi-Turn Function Calling: Maintains contextual continuity across multiple dialogue turns, enabling natural, ongoing conversations with nested or evolving tool use. - Multi-Step Function Calling: Plans and executes a sequence of function calls to complete complex tasks. Adapts dynamically based on intermediate results and decomposes goals into sub-tasks. - Agentic Capabilities: Advanced decision-making and workflow management for complex agentic tasks with seamless tool coordination and error recovery. For more details, including fine-tuning, inference, and deployment, please refer to our Github. Performance Benchmarks We evaluate Katanemo Arch-Agent series on the Berkeley Function-Calling Leaderboard (BFCL). We compare with commonly-used models and the results (as of June 14th, 2025) are shown below. > [!NOTE] > For evaluation, we use YaRN scaling to deploy the models for Multi-Turn evaluation, and all Arch-Agent models are evaluated with a context length of 64K. Requirements The code of Arch-Agent-1.5B has been in the Hugging Face `transformers` library and we recommend to install latest version: How to use We use the following example to illustrate how to use our model to perform function calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the OpenAI's function calling. License The Arch-Agent collection is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
gemma-3-4b-it-qat-q4_0-GGUF
gpt-oss-20b-GGUF
OpenCodeReasoning-Nemotron-7B-GGUF
lucid-v1-nemo-GGUF
Llama-3.1-Minitron-4B-Depth-Base-GGUF
Minitron-4B-Base-GGUF
granite-20b-functioncalling-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Granite-20B-FunctionCalling Model Summary Granite-20B-FunctionCalling is a finetuned model based on IBM's granite-20b-code-instruct model to introduce function calling abilities into Granite model family. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. - Developers: IBM Research - Paper: Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks - Release Date: July 9th, 2024 - License: Apache 2.0. Usage Intended use The model is designed to respond to function calling related instructions. Generation This is a simple example of how to use Granite-20B-Code-FunctionCalling model. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Meta-Llama-3-8B-GGUF
Absolute_Zero_Reasoner-Coder-7b-GGUF
SmallThinker-4BA0.6B-Instruct
Mellum-4b-base-GGUF
Mistral-7B-Instruct-v0.2-GGUF
granite-3.1-1b-a400m-instruct-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. Click here to get info on choosing the right GGUF model format Model Summary: Granite-3.1-1B-A400M-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-1B-A400M-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging. - Developers: Granite Team, IBM - GitHub Repository: ibm-granite/granite-3.1-language-models - Website: Granite Docs - Paper: Granite 3.1 Language Models (coming soon) - Release Date: December 18th, 2024 - License: Apache 2.0 Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may finetune Granite 3.1 models for languages beyond these 12 languages. Intended Use: The model is designed to respond to general instructions and can be used to build AI assistants for multiple domains, including business applications. Capabilities Summarization Text classification Text extraction Question-answering Retrieval Augmented Generation (RAG) Code related tasks Function-calling tasks Multilingual dialog use cases Long-context tasks including long document/meeting summarization, long document QA, etc. Generation: This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model. Then, copy the snippet from the section that is relevant for your use case. Models ARC-Challenge Hellaswag MMLU TruthfulQA Winogrande GSM8K Avg Granite-3.1-8B-Instruct 62.62 84.48 65.34 66.23 75.37 73.84 71.31 Granite-3.1-2B-Instruct 54.61 75.14 55.31 59.42 67.48 52.76 60.79 Granite-3.1-3B-A800M-Instruct 50.42 73.01 52.19 49.71 64.87 48.97 56.53 Granite-3.1-1B-A400M-Instruct 42.66 65.97 26.13 46.77 62.35 33.88 46.29 Models IFEval BBH MATH Lvl 5 GPQA MUSR MMLU-Pro Avg Granite-3.1-8B-Instruct 72.08 34.09 21.68 8.28 19.01 28.19 30.55 Granite-3.1-2B-Instruct 62.86 21.82 11.33 5.26 4.87 20.21 21.06 Granite-3.1-3B-A800M-Instruct 55.16 16.69 10.35 5.15 2.51 12.75 17.1 Granite-3.1-1B-A400M-Instruct 46.86 6.18 4.08 0 0.78 2.41 10.05 Model Architecture: Granite-3.1-1B-A400M-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings. Training Data: Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the Granite 3.0 Technical Report, Granite 3.1 Technical Report (coming soon), and Accompanying Author List. Infrastructure: We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Ethical Considerations and Limitations: Granite 3.1 Instruct Models are primarily finetuned using instruction-response pairs mostly in English, but also multilingual data covering eleven languages. Although this model can handle multilingual dialog use cases, its performance might not be similar to English tasks. In such case, introducing a small number of examples (few-shot) can help the model in generating more accurate outputs. While this model has been aligned by keeping safety in consideration, the model may in some cases produce inaccurate, biased, or unsafe responses to user prompts. So we urge the community to use this model with proper safety testing and tuning tailored for their specific tasks. Resources - ⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite - 📄 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Mistral-7B-Instruct-v0.1-GGUF
granite-3.0-8b-base-GGUF
granite-3.0-8b-instruct-GGUF
granite-3.0-8b-lora-intrinsics-v0.1-GGUF
This model was generated using llama.cpp at commit `0a5a3b5c`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions in the Community section. Happy exploring! Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 is a merged LoRA finetune for ibm-granite/granite-3.0-8b-instruct, providing access to the Uncertainty, Hallucination Detection, and Safety Exception intrinsics in addition to retaining the full abilities of the ibm-granite/granite-3.0-8b-instruct model. - Developer: IBM Research - Model type: LoRA adapter for ibm-granite/granite-3.0-8b-instruct - License: Apache 2.0 Uncertainty Intrinsic The Uncertainty intrinsic is designed to provide a Certainty score for model responses to user questions. Certainty score definition The model will respond with a number from 0 to 9, corresponding to 5%, 15%, 25%,...95% confidence respectively. This percentage is calibrated in the following sense: given a set of answers assigned a certainty score of X%, approximately X% of these answers should be correct. See the eval experiment below for out-of-distribution verification of this behavior. Hallucination Detection (RAG) Intrinsic The Hallucination Detection intrinsic is designed to detect when an assistant response to a user question with supporting documents is not supported by those documents. Response with a `Y` indicates hallucination, and `N` no hallucination. Safety Exception Intrinsic The Safety Exception Intrinsic is designed to raise an exception when the user query is unsafe. This exception is raised by responding with `Y` (unsafe), and `N` otherwise. The Safety Exception intrinsic was designed as a binary classifier that analyses the user’s prompt to detect a variety of harms that include: violence, threats, sexual and explicit content and requests to obtain private identifiable information. This is an experimental LoRA testing new functionality being developed for IBM's Granite LLM family. We are welcoming the community to test it out and give us feedback, but we are NOT recommending this model be used for real deployments at this time. Stay tuned for more updates on the Granite roadmap. Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 is lightly tuned so that its behavior closely mimics that of ibm-granite/granite-3.0-8b-instruct, with the added ability to generate the three specified intrinsics. Invoking intrinsics Each intrinsic is associated with its own generation role and has its own usage steps. Note that each intrinsic responds with only one token, and any additional text after this token should be ignored. You can curb additional generation by setting "max token length" = 1 when using any intrinsic. Uncertainty Intrinsic Usage Steps Answering a question and obtaining a certainty score proceeds as follows. 1. Prompt the model with a system prompt (required) followed by the user prompt. 2. Use the model to generate a response as normal (via the `assistant` role). 3. Invoke the Uncertainty intrinsic by generating in the `certainty` role (use "certainty" as the role in the chat template, or simply append ` certainty ` and continue generating), see examples below. 4. The model will respond with an integer certainty score from 0 to 9. The model was calibrated with the following system prompt: `You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.` You can further augment this system prompts for a given use case or task, but it is recommended your system prompt always starts with this string. Hallucination Detection Intrinsic Usage Steps Answering a question and detecting hallucination proceeds as follows. 1. Prompt the model with the system prompt (required) followed by the user prompt. 2. Use the model to generate a response as normal (via the `assistant` role). 3. Invoke the Hallucination Detection intrinsic by generating in the `hallucination` role (use "hallucination" as the role in the chat template, or simply append ` hallucination ` and continue generating), see examples below. 4. The model will respond with `Y` or `N`. Safety Exception Intrinsic Usage Steps Determining if a user query is safe proceeds as follows. 1. Prompt the model with the system prompt (required) followed by the user prompt. 2. Invoke the Safety Exception intrinsic by generating in the `safety` role (use "safety" as the role in the chat template, or simply append ` safety ` and continue generating), see examples below. 3. The model will respond with `Y` (unsafe) or `N` (safe). Combining Intrinsics In many pipelines, it may be desirable to invoke multiple intrinsics at different points. In a multi-turn conversation possibly involving other intrinsics, it is important to use attention masking to provide only the relevant information to the intrinsic of interest. We explore two frameworks for accomplishing this - Prompt Declaration Language (PDL) and SGLang. In the examples below, we explore the following RAG flow. First, a user query is provided with relevant documents provided by a RAG system. We can invoke the Safety Exception intrinsic to determine if the query is safe. If it is safe, we can proceed to generate an answer to the question as normal. Finally, we can evaluate the certainty and hallucination status of this reply by invoking the Uncertainty and Hallucination Detection intrinsics. Intrinsics Example with PDL Given a hosted instance of Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 at `APIBASE` (insert the host address here), this uses the PDL language to implement the RAG intrinsic invocation scenario described above. Note that the hosted instance must be supported by LiteLLM (https://docs.litellm.ai/docs/providers) First, create a file `intrinsics.pdl` with the following content. Next, create a file `intrinsics-defs.pdl` with the following content. To run the example, in the command line run `pdl intrinsics.pdl` after installing the PDL CLI (`pip install prompt-declaration-language`). Intrinsics Example with SGLang The below SGLang implementation uses the SGLang fork at https://github.com/frreiss/sglang/tree/granite that supports Granite models. Notes Certainty score interpretation Certainty scores calibrated as defined above may at times seem biased towards moderate certainty scores for the following reasons. Firstly, as humans we tend to be overconfident in our evaluation of what we know and don't know - in contrast, a calibrated model is less likely to output very high or very low confidence scores, as these imply certainty of correctness or incorrectness. Examples where you might see very low confidence scores might be on answers where the model's response was something to the effect of "I don't know", which is easy to evaluate as not being the correct answer to the question (though it is the appropriate one). Secondly, remember that the model is evaluating itself - correctness/incorrectness that may be obvious to us or to larger models may be less obvious to an 8b model. Finally, teaching a model every fact it knows and doesn't know is not possible, hence it must generalize to questions of wildly varying difficulty (some of which may be trick questions!) and to settings where it has not had its outputs judged. Intuitively, it does this by extrapolating based on related questions it has been evaluated on in training - this is an inherently inexact process and leads to some hedging. Certainty is inherently an intrinsic property of a model and its abilitities. The Uncertainty Intrinsic is not intended to predict the certainty of responses generated by any other models besides itself or ibm-granite/granite-3.0-8b-instruct. Additionally, certainty scores are distributional quantities, and so will do well on realistic questions in aggregate, but in principle may have surprising scores on individual red-teamed examples. Evaluation We evaluate the performance of the intrinsics themselves and the RAG performance of the model. We first find that the performance of the intrinsics in our shared model Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 is not degraded versus the baseline procedure of maintaining 3 separate instrinsic models. Here, percent error is shown for the Hallucination Detection and Safety Exception intrinsics as they have binary output, and Mean Absolute Error (MAE) is shown for the Uncertainty Intrinsic as it outputs numbers 0 to 9. For all, lower is better. Performance is calculated on a randomly drawn 400 sample validation set from each intrinsic's dataset. We then find that RAG performance of Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 does not suffer with respect to the base model ibm-granite/granite-3.0-8b-instruct. Here we evaluate the RAGBench benchmark on RAGAS faithfulness and correction metrics. Training Details The Granite 3.0 8B Instruct - Intrinsics LoRA v0.1 model is a LoRA adapter finetuned to provide 3 desired intrinsic outputs - Uncertainty Quantification, Hallucination Detection, and Safety. UQ Training Data The following datasets were used for calibration and/or finetuning. Certainty scores were obtained via the method in [[Shen et al. ICML 2024] Thermometer: Towards Universal Calibration for Large Language Models](https://arxiv.org/abs/2403.08819). BigBench MRQA newsqa triviaqa searchqa openbookqa webquestions smiles-qa orca-math ARC-Easy commonsenseqa socialiqa superglue figqa riddlesense agnews medmcqa dream codah piqa RAG Hallucination Training Data The following public datasets were used for finetuning. The details of data creation for RAG response generation is available at Granite Technical Report. For creating the hallucination labels for responses, the technique available at Achintalwar, et al. was used. Safety Exception Training Data The following public datasets were used for finetuning. yahma/alpaca-cleaned nvidia/Aegis-AI-Content-Safety-Dataset-1.0 A subset of https://huggingface.co/datasets/Anthropic/hh-rlhf Ibm/AttaQ google/civilcomments allenai/socialbiasframes Kristjan Greenewald, Nathalie Baracaldo, Chulaka Gunasekara, Lucian Popa, Mandana Vaziri Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
OpenCodeReasoning-Nemotron-14B-GGUF
Arch-Agent-3B-GGUF
This model was generated using llama.cpp at commit `73e53dc8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Overview Arch-Agent is a collection of state-of-the-art (SOTA) LLMs specifically designed for advanced function calling and agent-based applications. Designed to power sophisticated multi-step and multi-turn workflows, Arch-Agent excels at handling complex, multi-step tasks that require intelligent tool selection, adaptive planning, and seamless integration with external APIs and services. Built with a focus on real-world agent deployments, Arch-Agent delivers leading performance in complex scenarios while maintaining reliability and precision across extended function call sequences. Key capabilities inlcude: - Multi-Turn Function Calling: Maintains contextual continuity across multiple dialogue turns, enabling natural, ongoing conversations with nested or evolving tool use. - Multi-Step Function Calling: Plans and executes a sequence of function calls to complete complex tasks. Adapts dynamically based on intermediate results and decomposes goals into sub-tasks. - Agentic Capabilities: Advanced decision-making and workflow management for complex agentic tasks with seamless tool coordination and error recovery. For more details, including fine-tuning, inference, and deployment, please refer to our Github. Performance Benchmarks We evaluate Katanemo Arch-Agent series on the Berkeley Function-Calling Leaderboard (BFCL). We compare with commonly-used models and the results (as of June 14th, 2025) are shown below. > [!NOTE] > For evaluation, we use YaRN scaling to deploy the models for Multi-Turn evaluation, and all Arch-Agent models are evaluated with a context length of 64K. Requirements The code of Arch-Agent-3B has been in the Hugging Face `transformers` library and we recommend to install latest version: How to use We use the following example to illustrate how to use our model to perform function calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the OpenAI's function calling. License The Arch-Agent collection is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Polaris-7B-Preview-GGUF
OpenMath-Mistral-7B-v0.1-hf-GGUF
OlympicCoder-7B-GGUF
Polaris-4B-Preview-GGUF
WebSailor-3B-GGUF
Arch-Agent-7B-GGUF
This model was generated using llama.cpp at commit `73e53dc8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Overview Arch-Agent is a collection of state-of-the-art (SOTA) LLMs specifically designed for advanced function calling and agent-based applications. Designed to power sophisticated multi-step and multi-turn workflows, Arch-Agent excels at handling complex, multi-step tasks that require intelligent tool selection, adaptive planning, and seamless integration with external APIs and services. Built with a focus on real-world agent deployments, Arch-Agent delivers leading performance in complex scenarios while maintaining reliability and precision across extended function call sequences. Key capabilities inlcude: - Multi-Turn Function Calling: Maintains contextual continuity across multiple dialogue turns, enabling natural, ongoing conversations with nested or evolving tool use. - Multi-Step Function Calling: Plans and executes a sequence of function calls to complete complex tasks. Adapts dynamically based on intermediate results and decomposes goals into sub-tasks. - Agentic Capabilities: Advanced decision-making and workflow management for complex agentic tasks with seamless tool coordination and error recovery. For more details, including fine-tuning, inference, and deployment, please refer to our Github. Performance Benchmarks We evaluate Katanemo Arch-Agent series on the Berkeley Function-Calling Leaderboard (BFCL). We compare with commonly-used models and the results (as of June 14th, 2025) are shown below. > [!NOTE] > For evaluation, we use YaRN scaling to deploy the models for Multi-Turn evaluation, and all Arch-Agent models are evaluated with a context length of 64K. Requirements The code of Arch-Agent-7B has been in the Hugging Face `transformers` library and we recommend to install latest version: How to use We use the following example to illustrate how to use our model to perform function calling tasks. Please note that, our model works best with our provided prompt format. It allows us to extract JSON output that is similar to the OpenAI's function calling. License The Arch-Agent collection is distributed under the Katanemo license. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
granite-7b-base-GGUF
EXAONE-4.0-32B-GGUF
mOrpheus_3B-1Base_early_preview-v1-8600-GGUF
orpheus-finetuned-3b-GGUF
This model was generated using llama.cpp at commit `f505bd83`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format 03/18/2025 – We are releasing our 3B Orpheus TTS model with additional finetunes. Code is available on GitHub: CanopyAI/Orpheus-TTS Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances. - Human-Like Speech: Natural intonation, emotion, and rhythm that is superior to SOTA closed source models - Zero-Shot Voice Cloning: Clone voices without prior fine-tuning - Guided Emotion and Intonation: Control speech and emotion characteristics with simple tags - Low Latency: ~200ms streaming latency for realtime applications, reducible to ~100ms with input streaming - GitHub Repo: https://github.com/canopyai/Orpheus-TTS - Blog Post: https://canopylabs.ai/model-releases - Colab Inference Notebook: notebook link - One-Click Deployment on Baseten: https://www.baseten.co/library/orpheus-tts/ Check out our Colab (link to Colab) or GitHub (link to GitHub) on how to run easy inference on our finetuned models. Model Misuse Do not use our models for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines. We disclaim responsibility for any use. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
OpenCodeReasoning-Nemotron-32B-GGUF
Osmosis-Structure-0.6B-GGUF
granite-3.0-3b-a800m-instruct-GGUF
limbic-tool-use-0.5B-32K-GGUF
This model was generated using llama.cpp at commit `c7f3169c`. Click here to get info on choosing the right GGUF model format This model is a fine-tuned version of Qwen2.5-0.5B-Instruct specifically designed for evaluating function calls in the context of Model Context Protocol (MCP) tools. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values. - Base Model: Qwen/Qwen2.5-0.5B-Instruct - Fine-tuning Method: LoRA (Low-Rank Adaptation) - Task: Function Call Evaluation for MCP (Model Context Protocol) - Training Data: MCP Server Tools data from public MCP servers, with augmentation / synthetic data generation - Model Size: ~40MB (LoRA adapters only) - Context Length: 32,768 tokens The prompt for the model takes two inputs: - `availabletools` - a list of the tool schemas - `messagehistory` - the user request and model tool call response as a list of jsons Output Format The model outputs evaluations in JSON format: - correct: Function call matches available tools and parameters exactly - incorrecttool: Function name doesn't exist in available tools - incorrectparameternames: Function exists but parameter names are wrong - incorrectparametervalues: Function and parameters exist but values are inappropriate Generate a Prediction To make a prediction, you must convert the formatted prompt into its chat format. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
InternVL3-78B-GGUF
EXAONE-Deep-32B-GGUF
gemma-3-12b-it-qat-q4_0-GGUF
granite-3.0-3b-a800m-base-GGUF
SmolLM2-135M-Instruct-GGUF
Eagle2-2B-GGUF
granite-guardian-3.0-2b-GGUF
NextCoder-32B
TriLM_1.1B_Unpacked-GGUF
GneissWeb.7B_ablation_model_on_350B_GneissWeb.seed2-GGUF
TriLM_3.9B_Unpacked-GGUF
OpenCodeReasoning-Nemotron-32B-IOI-GGUF
Kevin-32B-GGUF
cogito-v1-preview-llama-8B-GGUF
TriLM_190M_Unpacked-GGUF
TriLM_390M_Unpacked-GGUF
Caller-GGUF
This model was generated using llama.cpp at commit `73e53dc8`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Caller (32B) is a robust model engineered for seamless integrations and optimized for managing complex tool-based interactions and API function calls. Its strength lies in precise execution, intelligent orchestration, and effective communication between systems, making it indispensable for sophisticated automation pipelines. - Architecture Base: Qwen2.5-32B - Parameter Count: 32B - License: Apache-2.0 - Managing integrations between CRMs, ERPs, and other enterprise systems - Running multi-step workflows with intelligent condition handling - Orchestrating external tool interactions like calendar scheduling, email parsing, or data extraction - Real-time monitoring and diagnostics in IoT or SaaS environments GGUF format available here License Caller (32B) is released under the Apache-2.0 License. You are free to use, modify, and distribute this model in both commercial and non-commercial applications, subject to the terms and conditions of the license. If you have questions or would like to share your experiences using Caller (32B), please connect with us on social media. We’re excited to see what you build—and how this model helps you innovate! Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.
Qwen3Guard-Gen-8B-GGUF
This model was generated using llama.cpp at commit `b5bd0378`. I've been experimenting with a new quantization approach that selectively elevates the precision of key layers beyond what the default IMatrix configuration provides. In my testing, standard IMatrix quantization underperforms at lower bit depths, especially with Mixture of Experts (MoE) models. To address this, I'm using the `--tensor-type` option in `llama.cpp` to manually "bump" important layers to higher precision. You can see the implementation here: 👉 Layer bumping with llama.cpp While this does increase model file size, it significantly improves precision for a given quantization level. I'd love your feedback—have you tried this? How does it perform for you? Click here to get info on choosing the right GGUF model format Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that frames safety classification as an instruction-following task, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation. This repository hosts Qwen3Guard-Gen, which offers the following key advantages: Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios. Multilingual Support: Qwen3Guard-Gen supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications. Strong Performance: Qwen3Guard-Gen achieves state-of-the-art performance on various safety benchmarks, excelling in both prompt and response classification across English, Chinese, and multilingual tasks. For more details, please refer to our blog, GitHub, and Technical Report. The latest version of `transformers` is recommended and `transformers>=4.51.0` is required. For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create an OpenAI-compatible API endpoint: Here is an example API call using OpenAI-Compatible server: In Qwen3Guard, potential harms are classified into three severity levels: Unsafe: Content generally considered harmful across most scenarios. Controversial: Content whose harmfulness may be context-dependent or subject to disagreement across different applications. Safe: Content generally considered safe across most scenarios. In the current version of Qwen3Guard, we consider the following safety categories: Violent: Content that provides detailed instructions, methods, or advice on how to commit acts of violence, including the manufacture, acquisition, or use of weapons. Also includes depictions of violence. Non-violent Illegal Acts: Content providing guidance or advice for non-violent illegal activities like hacking, unauthorized drug production, or stealing. Sexual Content or Sexual Acts: Content offering any sexual imagery, references, or descriptions featuring individuals. Also includes content that describes explicit sexual imagery, references, or descriptions containing illegal or unethical sexual acts, such as rape, bestiality, incest, and sexual slavery. Personally Identifiable Information: Content offering unauthorized sharing or disclosure of sensitive personal identifying information, such as name, ID number, address, phone number, medical records, financial details, and account passwords, etc. Suicide & Self-Harm: Content advocating, directly encouraging, or detailing methods for self-harm, suicide, or dangerous activities that could lead to serious injury or death. Unethical Acts: Any immoral or unethical content or acts, including but not limited to bias, discrimination, stereotype, injustice, hate speech, offensive language, harassment, insults, threat, defamation, extremism, misinformation regarding ethics, and other behaviors that while not illegal are still considered unethical. Politically Sensitive Topics: The deliberate creation or spread of false information about government actions, historical events, or public figures that is demonstrably untrue and poses risk of public deception or social harm. Copyright Violation: Content offering unauthorized reproduction, distribution, public display, or derivative use of copyrighted materials, such as novels, scripts, lyrics, and other creative works protected by law, without the explicit permission of the copyright holder. Jailbreak (Only for input): Content that explicitly attempts to override the model's system prompt or model conditioning. If you find our work helpful, feel free to give us a cite. Help me test my AI-Powered Quantum Network Monitor Assistant with quantum-ready security checks: The full Open Source Code for the Quantum Network Monitor Service available at my github repos ( repos with NetworkMonitor in the name) : Source Code Quantum Network Monitor. You will also find the code I use to quantize the models if you want to do it yourself GGUFModelBuilder 💬 How to test: Choose an AI assistant type: - `TurboLLM` (GPT-4.1-mini) - `HugLLM` (Hugginface Open-source models) - `TestLLM` (Experimental CPU-only) What I’m Testing I’m pushing the limits of small open-source models for AI network monitoring, specifically: - Function calling against live network services - How small can a model go while still handling: - Automated Nmap security scans - Quantum-readiness checks - Network Monitoring tasks 🟡 TestLLM – Current experimental model (llama.cpp on 2 CPU threads on huggingface docker space): - ✅ Zero-configuration setup - ⏳ 30s load time (slow inference but no API costs) . No token limited as the cost is low. - 🔧 Help wanted! If you’re into edge-device AI, let’s collaborate! Other Assistants 🟢 TurboLLM – Uses gpt-4.1-mini : - It performs very well but unfortunatly OpenAI charges per token. For this reason tokens usage is limited. - Create custom cmd processors to run .net code on Quantum Network Monitor Agents - Real-time network diagnostics and monitoring - Security Audits - Penetration testing (Nmap/Metasploit) 🔵 HugLLM – Latest Open-source models: - 🌐 Runs on Hugging Face Inference API. Performs pretty well using the lastest models hosted on Novita. 💡 Example commands you could test: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a comprehensive security audit on my server"` 4. '"Create a cmd processor to .. (what ever you want)" Note you need to install a Quantum Network Monitor Agent to run the .net code on. This is a very flexible and powerful feature. Use with caution! I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful. If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.