inclusionAI
LLaDA2.1-flash
LLaDA2.0-mini
Ring-2.5-1T
Ling-lite-1.5
Ling-mini-2.0
LLaDA2.1-mini
ZwZ-8B
LLaDA2.0-mini-preview
Ming-flash-omni-2.0
ZwZ-4B
Ling-mini-2.0-GGUF
Use https://github.com/im0qianqian/llama.cpp to quantize. For model inference, please download our release package from this url https://github.com/im0qianqian/llama.cpp/releases . Let's look forward to the following PR being merged: - #16063 model : add BailingMoeV2 support - #16028 Add support for Ling v2
Ming-flash-omni-Preview
π Technical Report ο½π€ Hugging Face ο½ π€ ModelScope Ming-flash-omni Preview, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100B total parameters, of which only 6B are active per token. Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. We significantly advance speech recognition capabilities, achieving state-of-the-art performance in both contextual ASR and dialect-aware ASR. In image generation, Ming-flash-omni Preview introduces high-fidelity text rendering and demonstrates marked gains in scene consistency and identity preservation during image editing. Furthermore, Ming-flash-omni Preview introduces generative segmentation, a capability that not only achieves strong standalone segmentation performance but also enhances spatial control in image generation and improves editing consistency. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models. [2025.10.27] π₯ We release the preview version of Ming-flash-omniοΌMing-flash-omni Preview. [2025.07.15] π₯ We release Ming-lite-omni v1.5 with significant improvements across all modalities. [2025.06.12] π₯ Our Technical Report is in public on arxiv. [2025.05.28] π₯ The official version of Ming-lite-omni v1 is released, with better performance and image generation support. [2025.05.04] π₯ We release the test version of Ming-lite-omniοΌMing-lite-omni-Preview. Key Features Compared to Ming-lite-omni v1.5, Ming-flash-omni Preview features key optimizations in the following 3 areas: - Sparse MoE Architecture for Omni-Modality: The Sparse MoE Architecture for Omni-Modality features a 100B-A6B MoE backbone (an extension of Ling-Flash-2.0). To ensure uniform expert activation and stable training across all modalities, Ming-flash-omni Preview employs a Dual-Balanced Routing Mechanism that combines an Auxiliary Load Balancing Loss with a Modality-Level Router Bias Update. - Generative Segmentation-as-Editing Paradigm: It unifies segmentation and editing into a semantics-preserving generation task, and achieves $0.90$ on GenEval, surpassing non-RL methods in fine-grained spatial control. - Context-Aware and Dialectal Speech Recognition: Ming-flash-omni Preview sets a new State-of-the-Art performance across all 12 ContextASR benchmarks, and it significantly improves recognition performance for 15 Chinese dialects. You can download our latest model from both Huggingface and ModelScope. For previous version model like Ming-Lite-Omni v1.5, Please refer to this link. | Model | Input modality | Oput modality | Download | |:------------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ming-flash-omni Preview | Image,text,video,audio | Image,text,audio | π€ HuggingFace π€ ModelScope | If you're in mainland China, we strongly recommend you to download our model from π€ ModelScope . Note: This download process will take several minutes to several hours, depending on your network conditions. Evaluation Ming-flash-omni Preview shows competitive performance in vision-text understanding, image generation, audio understanding and text-to-speech capabilities. For detailed evaluation resultsοΌplease refer to our techinical report. We provide a simple example on the usage of this repo. For detailed usage, please refer to cookbook.ipynb. If you find our work helpful, feel free to give us a cite.
Ring-1T
π€ Hugging Face | π€ ModelScope | π Experience Now Today, we officially launch the trillion-parameter thinking model, Ring-1T. It is open-source u...
Ring-mini-2.0-GGUF
Use https://github.com/im0qianqian/llama.cpp to quantize. For model inference, please download our release package from this url https://github.com/im0qianqian/llama.cpp/releases . Let's look forward to the following PR being merged: - #16063 model : add BailingMoeV2 support - #16028 Add support for Ling v2
Ling-flash-2.0-GGUF
Ring-flash-2.0-GGUF
LLaDA-MoE-7B-A1B-Instruct
LLaDA-MoE is a new and upgraded series of the LLaDA diffusion language model. This pre-release includes two cutting-edge models:
UI-Venus-Ground-7B
UI-Venus This repository contains the UI-Venus model from the report UI-Venus Technical Report: Building High-performance UI Agents with RFT. UI-Venus is a native UI agent based on the Qwen2.5-VL multimodal large language model, designed to perform precise GUI element grounding and effective navigation using only screenshots as input. It achieves state-of-the-art performance through Reinforcement Fine-Tuning (RFT) with high-quality training data. More inference details and usage guides are available in the GitHub repository. We will continue to update results on standard benchmarks including Screenspot-v2/Pro and AndroidWorld. [](https://opensource.org/licenses/Apache-2.0) [](http://arxiv.org/abs/2508.10833) [](https://github.com/inclusionAI/UI-Venus) [](https://huggingface.co/inclusionAI/UI-Venus-Ground-7B) > Figure: Performance of UI-Venus across multiple benchmark datasets. UI-Venus achieves State-of-the-Art (SOTA) results on key UI understanding and interaction benchmarks, including ScreenSpot-Pro, ScreenSpot-v2, OS-World-G, UI-Vision, and Android World. The results demonstrate its superior capability in visual grounding, UI navigation, cross-platform generalization, and complex task reasoning. UI-Venus is a multimodal UI agent built on Qwen2.5-VL that performs accurate UI grounding and navigation using only screenshots as input. The 7B and 72B variants achieve 94.1%/50.8% and 95.3%/61.9% on Screenspot-V2 and Screenspot-Pro benchmarks, surpassing prior SOTA models such as GTA1 and UI-TARS-1.5. On the AndroidWorld navigation benchmark, they achieve 49.1% and 65.9% success rates, respectively, demonstrating strong planning and generalization capabilities Key innovations include: - SOTA Open-Source UI Agent: Publicly released to advance research in autonomous UI interaction and agent-based systems. - Reinforcement Fine-Tuning (RFT): Utilizes carefully designed reward functions for both grounding and navigation tasks - Efficient Data Cleaning: Trained on several hundred thousand high-quality samples to ensure robustness. - Self-Evolving Trajectory History Alignment & Sparse Action Enhancement: Improves reasoning coherence and action distribution for better long-horizon planning. Use the shell scripts to launch the evaluation. The evaluation setup follows the same protocol as ScreenSpot, including data format, annotation structure, and metric calculation. | Model | Mobile Text | Mobile Icon | Desktop Text | Desktop Icon | Web Text | Web Icon | Avg. | |--------------------------|-----------------|-----------------|------------------|------------------|--------------|--------------|----------| | uitars-1.5 | - | - | - | - | - | - | 94.2 | | Seed-1.5-VL | - | - | - | - | - | - | 95.2 | | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 | | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 | | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 | | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 | | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 | | UI-Venus-Ground-7B (Ours) | 99.0 | 90.0 | 97.0 | 90.7 | 96.2 | 88.7 | 94.1 | | UI-Venus-Ground-72B (Ours) | 99.7 | 93.8 | 95.9 | 90.0 | 96.2 | 92.6 | 95.3 | Performance comparison of GUI agent models across six task categories on ScreenSpot-Pro. Scores are in percentage (%). `T` = Text, `I` = Icon. ``: reproduced; `β `: trained from UI-TARS-1.5-7B. | Model | CAD (T/I) | Dev (T/I) | Creative (T/I) | Scientific (T/I) | Office (T/I) | OS (T/I) | Avg T | Avg I | Overall | Type | |-------|-----------|-----------|----------------|------------------|--------------|---------|--------|--------|------------|------| | GPT-4o | 2.0 / 0.0 | 1.3 / 0.0 | 1.0 / 0.0 | 2.1 / 0.0 | 1.1 / 0.0 | 0.0 / 0.0 | 1.3 | 0.0 | 0.8 | Closed | | Claude Computer Use | 14.5 / 3.7 | 22.0 / 3.9 | 25.9 / 3.4 | 33.9 / 15.8 | 30.1 / 16.3 | 11.0 / 4.5 | 23.4 | 7.1 | 17.1 | Closed | | UI-TARS-1.5 | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 61.6 | Closed | | Seed1.5-VL | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 60.9 | Closed | | Qwen2.5-VL-7B\ | 16.8 / 1.6 | 46.8 / 4.1 | 35.9 / 7.7 | 49.3 / 7.3 | 52.5 / 20.8 | 37.4 / 6.7 | 38.9 | 7.1 | 26.8 | SFT | | Qwen2.5-VL-72B | 54.8 / 15.6 | 65.6 / 16.6 | 63.1 / 19.6 | 78.5 / 34.5 | 79.1 / 47.2 | 66.4 / 29.2 | 67.3 | 25.0 | 51.2 | SFT | | UI-TARS-7B | 20.8 / 9.4 | 58.4 / 12.4 | 50.0 / 9.1 | 63.9 / 31.8 | 63.3 / 20.8 | 30.8 / 16.9 | 47.8 | 16.2 | 35.7 | SFT | | UI-TARS-72B | 18.8 / 12.5 | 62.9 / 17.2 | 57.1 / 15.4 | 64.6 / 20.9 | 63.3 / 26.4 | 42.1 / 15.7 | 50.9 | 17.6 | 38.1 | SFT | | Phi-Ground-7B | 26.9 / 17.2 | 70.8 / 16.7 | 56.6 / 13.3 | 58.0 / 29.1 | 76.4 / 44.0 | 55.1 / 25.8 | 56.4 | 21.8 | 43.2 | RL | | UI-TARS-1.5-7B | β / β | β / β | β / β | β / β | β / β | β / β | β | β | 49.6 | RL | | GTA1-7Bβ | 53.3 / 17.2 | 66.9 / 20.7 | 62.6 / 18.2 | 76.4 / 31.8 | 82.5 / 50.9 | 48.6 / 25.9 | 65.5 | 25.2 | 50.1 | RL | | GTA1-72B | 56.9 / 28.1 | 79.9 / 33.1 | 73.2 / 20.3 | 81.9 / 38.2 | 85.3 / 49.1 | 73.8 / 39.1 | 74.5 | 32.5 | 58.4 | RL | | UI-Venus-Ground-7B | 60.4 / 21.9 | 74.7 / 24.1 | 63.1 / 14.7 | 76.4 / 31.8 | 75.7 / 41.5 | 49.5 / 22.5 | 67.1 | 24.3 | 50.8 | Ours (RL) | | UI-Venus-Ground-72B | 66.5 / 29.7 | 84.4 / 33.1 | 73.2 / 30.8 | 84.7 / 42.7 | 83.1 / 60.4 | 75.7 / 36.0 | 77.4 | 36.8 | 61.9 | Ours (RL) | > π Experimental results show that UI-Venus-Ground-72B achieves state-of-the-art performance on ScreenSpot-Pro with an average score of 61.7, while also setting new benchmarks on ScreenSpot-v2(95.3), OSWorldG(69.8), AgentCPM(84.7), and UI-Vision(38.0), highlighting its effectiveness in complex visual grounding and action prediction tasks. Citation Please consider citing if you find our work useful:
Ling-flash-2.0
π€ Hugging Face    |   π€ ModelScope Today, Ling-flash-2.0 is officially open-sourced! π Following the release of the language model Ling-mini-2.0 and the thinking model Ring-mini-2.0, we are now open-sourcing the third MoE LLM under the Ling 2.0 architecture: Ling-flash-2.0, a language model with 100B total parameters and 6.1B activated parameters (4.8B non-embedding). Trained on 20T+ tokens of high-quality data, together with supervised fine-tuning and multi-stage reinforcement learning, Ling-flash-2.0 achieves SOTA performance among dense models under 40B parameters, despite activating only ~6B parameters. Compared to MoE models with larger activation/total parameters, it also demonstrates strong competitiveness. Notably, it delivers outstanding performance in complex reasoning, code generation, and frontend development. We conducted a comprehensive evaluation of Ling-flash-2.0βs reasoning capabilities, reporting strong results on representative benchmarks: Multi-disciplinary knowledge reasoning: GPQA-Diamond, MMLU-Pro Advanced mathematical reasoning: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks) Challenging code generation: LiveCodeBench v6, CodeForces-Elo Logical reasoning: KOR-Bench, ARC-Prize Key regulated industries (Finance, Healthcare): FinanceReasoning, HealthBench Compared with dense models under 40B (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and larger-activation/total-parameter MoE models (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), Ling-flash-2.0 demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on creative tasks (Creative Writing v3). Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation-ratio MoE architecture, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, aux-loss-free + sigmoid routing strategy, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable small-activation MoE models to achieve 7Γ efficiency gains over equivalent dense architectures. In other words, with just 6.1B activated parameters (4.8B non-embedding), Ling-flash-2.0 can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages: On H20 hardware, Ling-flash-2.0 achieves 200+ tokens/s, offering 3Γ speedups compared to 36B dense models in everyday use. With YaRN extrapolation, it supports 128K context length, and as output length grows, its relative speedup can reach 7Γ or more. You can download the following table to see the various stage of Ling-flash-2.0 models. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-flash-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-flash-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN` to start command. We recommend you to use Llama-Factory to finetune Ling. This code repository is licensed under the MIT License.
Ring Mini 2.0
π€ Hugging Face    |   π€ ModelScope | π Experience Now This model is presented in the paper Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model. Today, we officially release Ring-mini-2.0 β a high-performance inference-oriented MoE model deeply optimized based on the Ling 2.0 architecture. With only 16B total parameters and 1.4B activated parameters, it achieves comprehensive reasoning capabilities comparable to dense models below the 10B scale. It excels particularly in logical reasoning, code generation, and mathematical tasks, while supporting 128K long-context processing and 300+ tokens/s high-speed generation. Enhanced Reasoning: Joint Training with SFT + RLVR + RLHF Built upon Ling-mini-2.0-base, Ring-mini-2.0 undergoes further training with Long-CoT SFT, more stable and continuous RLVR, and RLHF joint optimization, significantly improving the stability and generalization of complex reasoning. On multiple challenging benchmarks (LiveCodeBench, AIME 2025, GPQA, ARC-AGI-v1, etc.), it outperforms dense models below 10B and even rivals larger MoE models (e.g., gpt-oss-20B-medium) with comparable output lengths, particularly excelling in logical reasoning. High Sparsity, High-Speed Generation Inheriting the efficient MoE design of the Ling 2.0 series, Ring-mini-2.0 activates only 1.4B parameters and achieves performance equivalent to 7β8B dense models through architectural optimizations such as 1/32 expert activation ratio and MTP layers. Thanks to its low activation and high sparsity design, Ring-mini-2.0 delivers a throughput of 300+ tokens/s when deployed on H20. With Expert Dual Streaming inference optimization, this can be further boosted to 500+ tokens/s, significantly reducing inference costs for high-concurrency scenarios involving thinking models. Additionally, with YaRN extrapolation, it supports 128K long-context processing, achieving a relative speedup of up to 7x in long-output scenarios. | Model | #Total Params | #Activated Params | Context Length | Download | | :----------------: | :---------------: | :-------------------: | :----------------: | :----------: | | Ring-mini-2.0 | 16.8B | 1.4B | 128K | π€ HuggingFace π€ Modelscope| Here is a code snippet to show you how to use the chat model with `transformers`: License This code repository is licensed under the MIT License. Project Page Access the demo and experience the model at: https://zenmux.ai/inclusionai/ring-mini-2.0 Code The full code repository for this model can be found on GitHub: https://github.com/inclusionAI/Ring-V2 Citation If you find our work helpful, feel free to give us a cite.
Ring Mini Linear 2.0
π Technical Report    |    π€ Hugging Face    |   π€ ModelScope Today, we are officially open-sourcing Ring-mini-linear-2.0. This model continues to employ a hybrid architecture that combines linear attention and standard attention mechanisms, striking a balance between performance and efficiency. Inheriting the efficient MoE (Mixture-of-Experts) design from the Ling 2.0 series, and through architectural optimizations such as a 1/32 expert activation ratio and MTP layers, Ring-mini-linear achieves the performance of an ~8B dense model while activating only 1.6B of its 16.4B total parameters. This model was converted from Ling-mini-base-2.0, continually trained on an additional 600B tokens. In terms of performance, the hybrid linear model is comparable in overall performance to standard attention models of a similar size (e.g., Ring-mini-2) and surpasses other open-source MoE and Dense models of the same class on several challenging benchmarks. Additionally, we support a 512k long context window, achieved by extrapolating the window 4x using YaRN. This provides superior speed, especially on tasks involving long inputs and outputs. To better demonstrate our model's reasoning capabilities, we compared it with three other modelsβRing-mini-2.0, Qwen3-8B-thinking, and GPT-OSS-20B-Mediumβon 5 challenging reasoning benchmarks across mathematics, code, and science. We observe that the hybrid-linear architecture achieves performance comparable to that of softmax attention models. Linear Attention, Highly Sparse, High-Speed Generation Thanks to its hybrid attention mechanism and highly sparse MoE architecture, `Ring-mini-linear-2.0` achieves near-linear time complexity and constant space complexity, resulting in outstanding inference efficiency. To fully demonstrate this advantage, we conducted a comparison between our model and top-tier competitors of similar size or performance.The results clearly demonstrate the advantage of our model in inference efficiency. We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below. First, create a Conda environment with Python 3.10 and CUDA 12.8: Finally, install compatible versions of transformers after vLLM is installed:
Ling-1T
π€ Hugging Face | π€ ModelScope | π Experience Now Ling-1T is the first flagship non-thinking model in the Ling 2.0 series, featuring 1 trillion t...
Ring-1T-FP8
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model This repository presents Ring-1T, an open-source, state-of-the-art thinking model with a trillion-scale parameter, as detailed in the paper Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model. For the full codebase, please refer to the GitHub repository. π€ Hugging Face | π€ ModelScope | π Experience Now Today, we officially launch the trillion-parameter thinking model, Ring-1T. It is open-source upon releaseβdevelopers can download the model weights from Hugging Face and ModelScope, or experience direct chat interactions and API calls via the Ling Chat page and ZenMux (links provided at the end of the article). Building upon the preview version released at the end of last month, Ring-1T has undergone continued scaling with large-scale verifiable reward reinforcement learning (RLVR) training, further unlocking the natural language reasoning capabilities of the trillion-parameter foundation model. Through RLHF training, the model's general abilities have also been refined, making this release of Ring-1T more balanced in performance across various tasks. Ring-1T adopts the Ling 2.0 architecture and is trained on the Ling-1T-base foundation model, which contains 1 trillion total parameters with 50 billion activated parameters, supporting a context window of up to 128K tokens. Leveraging our self-developed icepop reinforcement learning stabilization method and the efficient reinforcement learning system ASystem (whose AReaL framework is already open-source), we have achieved smooth scaling of MoE architecture reinforcement learningβfrom tens of billions (Ring-mini-2.0) to hundreds of billions (Ring-flash-2.0) to trillions (Ring-1T) of parametersβsignificantly enhancing the model's deep reasoning and natural language inference capabilities. You can download Ring-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope to speed up the download process. | Model | Context Length | Download | | :-------: | :----------------: | :-------------------------------------------------------------------------------------------------------------------------------------------: | | Ring-1T | 64K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ring-1T-FP8 | 64K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. To evaluate the deep reasoning capabilities of Ring-1T, we selected representative open-source thinking models (Ring-1T-preview, Deepseek-V3.1-Terminus-Thinking, Qwen-235B-A22B-Thinking-2507) and closed-source APIs (Gemini-2.5-Pro and GPT-5-Thinking(High)) as benchmarks. First, compared to the previously open-sourced preview version, Ring-1T demonstrates more balanced performance across various tasks. Furthermore, Ring-1T achieves open-source leading performance on challenging reasoning benchmarks such as math competitions (AIME 25, HMMT 25), code generation (LiveCodeBench, CodeForce), and logical reasoning (ARC-AGI-v1). It also exhibits strong competitiveness in comprehensive tasks (Arena-Hard-v2.0), healthcare (HealthBench), and creative writing (Creative Writing v3). Although we have implemented string-level and semantic-level contamination filtering for benchmark tasks across all training stagesβincluding pre-training, fine-tuning instructions, and reinforcement learning promptsβrigorous decontamination for earlier published benchmarks remains a significant challenge in the industry. To more objectively analyze Ring-1T's deep reasoning capabilities, we conducted tests using the IMO 2025 (International Mathematical Olympiad) held in July this year and the recently concluded ICPC World Finals 2025 (International Collegiate Programming Contest World Finals). For the IMO 2025 test, similar to the previous preview version, we integrated Ring-1T into the multi-agent framework AWorld (https://github.com/inclusionAI/AWorld) and used pure natural language reasoning to solve the problems. The results show that Ring-1T solved Problems 1, 3, 4, and 5 in a single attempt (silver medal level at IMO). On the third attempt, it also produced a nearly perfect proof for Problem 2, a geometry proof. For the most challenging Problem 6 (which no AI contestant in IMO 2025 solved correctly), Ring-1T converged to the same answer as Gemini 2.5 Proβ"4048" (the correct answer is 2112). We believe that with ongoing optimizations, Ring-1T has the potential to reach gold medal level at IMO in a single attempt in the future. At the ICPC World Finals 2025, we compared GPT-5-Thinking, Gemini-2.5-Pro, and Ring-1T. In a test allowing three attempts for direct problem-solving by the models, they solved 6 (problems CDEFKL), 3 (problems DFK), and 5 (problems DFJKL) problems, respectively. The results demonstrate that Ring-1T also delivers outstanding performance in top-tier international programming competitions. Further testing is ongoing, and we will also open-source the solution traces of the models for the aforementioned competitions (IMO traces are provided at the end of the article). We look forward to collaborating with the community to further optimize the reasoning potential of this trillion-parameter thinking model. Icepop: Ensuring Stable Reinforcement Learning Through Long-Term Training In the reinforcement learning training of MoE models, the discrepancies in operator implementations between the training and inference engines are more pronounced compared to dense models. This divergence becomes increasingly significant as sequence length and training steps accumulate, particularly during long-sequence generation and extended training cycles. As illustrated in the experiment below, the original GRPO algorithm begins to collapse after relatively few training steps. In contrast, our proposed Icepop algorithm mitigates this issue by correcting distributions through masked bidirectional truncation technology, effectively reducing the gap between training and inference phasesβthereby "cooling down" the rapidly escalating training-inference discrepancy. Figure 1: The training-inference discrepancy of GRPO increases exponentially with training, while Icepop remains relatively stable. Figure 2: Maximum training-inference discrepancyβGRPO shows a significant rise with training, whereas Icepop maintains a low level. Asystem: In-House RL Framework "Mastering" Trillion-Scale Training To ensure stable and efficient reinforcement learning training for trillion-parameter foundation models, we independently developed a high-performance reinforcement learning systemβASystem. ASystem adopts a SingleController + SPMD architecture. In terms of training and inference engines, it has been meticulously optimized to address memory management and weight exchange challenges specific to trillion-parameter models. Leveraging our self-developed unified memory pool technology for training and inference, it achieves transparent memory offloading, efficiently releases memory fragmentation, and reduces the risk of insufficient memory. Through techniques such as direct P2P communication between GPUs and in-place updates, it enables second-level, zero-redundant model weight exchange. For the RL training framework, we built a hybrid reward system based on large-scale Serverless Sandbox technology. This system can start up in milliseconds, supports execution environments for over 10 programming languages, and handles request throughput of up to 10K/s. We have open-sourced AReaL and hope to accelerate RL training and research in the open-source community through technological openness. Here is a code snippet to show you how to use the chat model with `transformers`: We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps: Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODELPATH}. Here is the example to run Ring-1T with multiple GPU nodes, where the master node IP is ${MASTERIP} and server port is ${PORT}: For latest guidance, please refer to the vLLM `instructions`. Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTERIP}, server port is ${PORT} and the path of model is ${MODELPATH}: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. We recommend you to use Llama-Factory to finetune Ring. Ring-1T represents the Bailing teamβs first attempt at developing a trillion-scale deep thinking model. The current version may occasionally exhibit issues such as identity recognition bias, language mixing, and repetitive generation. Additionally, since its attention architecture still adopts the GQA approach from Ling 2.0, there remains room for improvement in inference efficiency under long-context scenarios. We will continue to optimize these aspects in future releases and highly welcome feedback from the community. Furthermore, training for Ring-1T is still ongoing. We are committed to further unlocking the reasoning potential of this trillion-parameter foundation model and look forward to sharing more mature upgraded versions with everyone as soon as possible. Welcome to visit our open-source repository and demo page for download and usage. Hugging Face: https://huggingface.co/inclusionAI/Ring-1T ModelScope: https://modelscope.cn/models/inclusionAI/Ring-1T Ling Chat (for Chinese users): https://ling.tbox.cn/chat ZenMux (for overseas developers, offering Chat testing and API capabilities): https://zenmux.ai/inclusionai/ring-1t?utmsource=hfinclusionAI Ring-1T@Aworld IMO test trajectory: https://github.com/inclusionAI/AWorld/tree/main/examples/imo/samples/samples%20from%20Ring-1T This code repository is licensed under the MIT License. If you find our work helpful, feel free to give us a cite.
Ling-2.5-1T
LLaDA-MoE-7B-A1B-Base
Ming-omni-tts-0.5B
Ming-Lite-Omni-1.5
MingTok Vision
MingTok: A Unified Tokenizer for Visual Understanding and Generation without Vector Quantization π Technical Report ο½ π Project Page ο½ π€ Hugging Face ο½ π€ ModelScope ο½ πΎ GitHub Key Features - πΌοΈ First Continuous Unified Vision Tokenizer: MingTok enables unified vision understanding and generation via a continuous latent space, eliminating quantization while preserving semantic and perceptual fidelity. - π― High-Fidelity Image Reconstruction: A three-stage architecture (encoding, expansion, reconstruction) captures fine details and global structure for accurate, high-quality image recovery. - β‘ Accelerated Autoregressive Convergence: Masked modeling with multi-level supervision shapes a compact, semantically rich latent space, enabling faster and more stable autoregressive training. Figure 1: Conceptual comparison and qualitative examples of MingTok. body { font-family: Arial, sans-serif; margin: 20px; } table { width: 100%; border-collapse: collapse; font-size: 12px; } th, td { border: 1px solid #ccc; padding: 6px 8px; text-align: center; } thead th { background-color: transparent; font-weight: bold; } .section-row { background-color: transparent; text-align: center; font-style: italic; } .uparrow { font-size: 10px; vertical-align: super; } .dagger { font-size: 10px; color: gray; } caption { font-weight: bold; font-size: 14px; margin: 10px 0; text-align: left; } Tokenizer Res. # Tokens rFID β PSNR β SSIM β LPIPS β β denotes using semantic decoder after joint pre-training.
UI-Venus-Navi-7B
Ming-UniAudio-16B-A3B
Ring Flash Linear 2.0
π Technical Report    |   π€ Hugging Face    |   π€ ModelScope We are excited to announce the official open-source release of Ring-flash-linear-2.0! Building on the success of our Ling 2.0 series, this model continues to leverage a powerful hybrid architecture of linear and standard attention, perfectly balancing high performance with superior efficiency. By integrating our proven MoE design with optimizations like a 1/32 expert activation ratio and MTP layers, Ring-flash-linear achieves the performance of a 40B dense model while activating only 6.1B parameters. This model was converted from Ling-flash-base-2.0, further trained on an additional 1T tokens. When it comes to benchmarks, Ring-flash-linear-2.0 not only holds its own against standard attention models (like Ring-flash-2.0) but also outperforms other open-source MoE and Dense models in its class on several demanding tasks. Plus, with support for a 128k long context, it's faster and more precise than ever, especially when handling long-form inputs and outputs. To better demonstrate the model's capabilities, we selected representative open-source thinking models and closed-source APIs for comparison. We present results on several challenging reasoning benchmarks spanning domains such as mathematics, coding, and science. Also, we evaluate the model's performance on a creative writing task (Creative Writing v3). We observe that our model achieves performance on par with other models. Linear Attention, Highly Sparse, High-Speed Generation Thanks to its hybrid attention mechanism and highly sparse MoE architecture, Ring-flash-linear-2.0 achieves near-linear time complexity and constant space complexity, resulting in outstanding inference efficiency. To fully demonstrate this advantage, we conducted a comparison between our model and top-tier competitors of similar size or performance. The results clearly demonstrate the advantage of our model in inference efficiency. | Model | Context Length | Download | | :----------------: | :----------------: | :----------: | | Ring-flash-linear-2.0 | 128K | π€ HuggingFace π€ Modelscope| --> We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below. First, create a Conda environment with Python 3.10 and CUDA 12.8: Finally, install compatible versions of transformers after vLLM is installed:
LLaDA2.0-flash-CAP
Ring Flash 2.0
Ling 1T FP8
π€ Hugging Face | π€ ModelScope | π Experience Now Ling-1T is the first flagship non-thinking model in the Ling 2.0 series, featuring 1 trillion total parameters with β 50 billion active parameters per token. Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of efficient reasoning and scalable cognition. Pre-trained on 20 trillion+ high-quality, reasoning-dense tokens, Ling-1T-base supports up to 128K context length and adopts an evolutionary chain-of-thought (Evo-CoT) process across mid-training and post-training. This curriculum greatly enhances the modelβs efficiency and reasoning depth, allowing Ling-1T to achieve state-of-the-art performance on multiple complex reasoning benchmarksβbalancing accuracy and efficiency. We comprehensively evaluated Ling-1T against leading flagship models, including both open-source giants (e.g., DeepSeek-V3.1-Terminus, Kimi-K2-Instruct-0905) and closed-source APIs (GPT-5-main, Gemini-2.5-Pro). Across code generation, software development, competition-level mathematics, professional math, and logical reasoning, Ling-1T consistently demonstrates superior complex reasoning ability and overall advantage. In the AIME 25 benchmark, Ling-1T extends the Pareto frontier of reasoning accuracy vs. reasoning length, showcasing its strength in βefficient thinking and precise reasoning.β Ling-1T excels in visual reasoning and front-end code generation tasks, combining deep semantic understanding with precise code synthesis. We introduce a hybrid SyntaxβFunctionβAesthetics reward mechanism, enabling the model to not only generate correct and functional code but also demonstrate a refined sense of visual aesthetics. On ArtifactsBench, Ling-1T ranks first among open-source models, and the benchmark visualizations in this card were, in fact, generated by Ling-1T itself. Scaling to the trillion-parameter level has revealed strong emergent reasoning and transfer capabilities. For example, in the BFCL V3 tool-use benchmark, Ling-1T achieves β 70% tool-call accuracy with only light instruction tuningβdespite having seen no large-scale trajectory data during training. Ling-1T can: Interpret complex natural-language instructions Transform abstract logic into functional visual components Generate cross-platform compatible front-end code Create stylistically controlled marketing copy and multi-lingual text These capabilities form the foundation for general, collaborative humanβAI intelligence, which we aim to advance together with the open-source community through Ling-1Tβs release. The Ling 2.0 architecture was designed from the ground up for trillion-scale efficiency, guided by the Ling Scaling Law (arXiv:2507.17702). This ensures architectural and hyperparameter scalability even under 1e25β1e26 FLOPs of compute. 1T total / 50B active parameters with a 1/32 MoE activation ratio MTP layers for enhanced compositional reasoning Aux-loss-free, sigmoid-scoring expert routing with zero-mean updates QK Normalization for fully stable convergence Ling-1T is the largest FP8-trained foundation model known to date. FP8 mixed-precision training yields 15%+ end-to-end speedup, improved memory efficiency, and maintains β€ 0.1% loss deviation from BF16 across 1T tokens. A fine-grained, heterogeneous 1F1B interleaved pipeline further boosts utilization by 40 %+. System-level optimizationsβfused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetryβensure stable trillion-scale training. Pre-training used over 20T high-quality tokens, with > 40% reasoning-dense data in later stages. Mid-training introduced curated chain-of-thought corpora for βreasoning pre-activationβ, improving downstream reasoning stability. A custom WSM (WarmupβStableβMerge) LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization. Built upon mid-training reasoning activation, post-training adopts Evo-CoT (Evolutionary Chain-of-Thought) for progressive reasoning enhancement under controllable cost. This approach continually expands the Pareto frontier of reasoning accuracy vs. efficiencyβideal for reflexive non-thinking models. For reinforcement learning, we introduce LPO (Linguistics-Unit Policy Optimization) βa novel sentence-level policy optimization method. Unlike GRPO (token-level) or GSPO (sequence-level) algorithms, LPO treats sentences as the natural semantic action units, enabling precise alignment between rewards and reasoning behavior. Empirically, LPO offers superior training stability and generalization across reasoning tasks. Ling-1T has been extensively evaluated across knowledge, code, math, reasoning, agent, and alignment benchmarks. It currently stands as the best open-source flagship non-thinking model, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability. Evaluation | Task | Benchmark | DeepSeek-V3.1-Terminus | Kimi-K2-Instruct-0905 | gpt-5-main | Gemini 2.5 Pro | Ling-1T | | --------------------- | -------------------------- | ---------------------------------------- | ---------------------------------------- | ---------- | ---------------------------------------- | ---------------------------------------- | | | | (NonThinking) | | | (thinkBudget=128) | | | Knowledge | Professional Knowledge | | | | | | | | C-Eval | 91.76 | 91.12 | 83.59 | 88.77 | 92.19 | | | MMLU-Redux (EM) | 92.37 | 91.58 | 92.75 | 94.67 | 92.25 | | | MMLU-Pro | 83.25 | 81.03 | 81.94 | 82.13 | 82.04 | | Knowledge | STEM | | | | | | | | MMLU-Pro-Stem | 87.91 | 85.30 | 73.45 | 88.60 | 88.5 | | | OlympiadBench-stem | 87.83 | 79.13 | 78.26 | 89.57 | 91.3 | | | GPQA-Diamond | 76.23 | 73.93 | 71.31 | 71.81 | 72.98 | | Coding | Code Generation | | | | | | | | MultiPL-E | 77.68 | 73.76 | 76.66 | 71.48 | 77.91 | | | mbpp | 90.69 | 89.96 | 91.72 | 91.01 | 96.87 | | | LiveCodeBench (2408-2505) | 48.02 | 48.95 | 48.57 | 45.43 | 61.68 | | | CodeForces-rating | 1582 | 1574 | 1120 | 1675 | 1901 | | | BIRDSQL | 44.88 | 46.45 | 43.97 | 54.76 | 52.38 | | Coding | Software Development | | | | | | | | ArtifactsBench | 43.29 | 44.87 | 41.04 | 60.28 | 59.31 | | | FullStack Bench | 55.48 | 54.00 | 50.92 | 48.19 | 56.55 | | | Aider | 88.16 | 85.34 | 84.40 | 89.85 | 83.65 | | Math | Competition Math | | | | | | | | CNMO 2024 | 73.78 | 68.92 | 63.11 | 74.65 | 79.25 | | | AIME 2025 | 55.21 | 50.16 | 59.43 | 70.10 | 70.42 | | | UGMathBench | 72.70 | 69.97 | 67.27 | 70.10 | 74.95 | | | Omni-Math | 64.77 | 62.42 | 61.09 | 72.02 | 74.46 | | Math | Professional Math | | | | | | | | FinanceReasoning | 86.44 | 84.83 | 86.28 | 86.65 | 87.45 | | | Optibench | 64.30 | 60.83 | 40.06 | 68.76 | 74.71 | | | OptMATH | 35.99 | 35.84 | 39.16 | 42.77 | 57.68 | | General Reasoning | | | | | | | | | BBEH | 42.86 | 34.83 | 39.75 | 29.08 | 47.34 | | | KOR-Bench | 73.76 | 73.20 | 70.56 | 59.68 | 76.00 | | | ARC-AGI-1 | 14.69 | 22.19 | 14.06 | 18.94 | 43.81 | | | ZebraLogic | 81.6 | 85.5 | 57.3 | 70.2 | 90.8 | | Agent | | | | | | | | | BFCL-V3 | 52.67 | 71.05 | 50.27 | 63.31 | 69.64 | | Alignment | | | | | | | | | Arena Hard V2 ELO | 54.09 | 76.95 | 68.37 | 65.37 | 76.26 | | | Arena Hard V2 Win Rate | 63.24 | 69.88 | 65.06 | 74.46 | 75.83 | | | writingbench | 80.95 | 87.59 | 77.07 | 80.53 | 89.4 | | | Creative Writing v3 | 85.18 | 87.01 | 80.93 | 84.99 | 89.24 | | | MultiChallenge | 42.49 | 48.72 | 48.72 | 51.28 | 58.24 | You can download Ling-1T from the following table. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | | :-------: | :----------------: | :-------------------------------------------------------------------------------------------------------------------------------------------: | | Ling-1T | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. We will later submit our model to the SGLang official release. Now we can prepare the environment by following these steps: Both BF16 and FP8 models are supported by SGLang now. It depends on the dtype of the model in ${MODELPATH}. Here is the example to run Ling-1T with multiple GPU nodes, where the master node IP is ${MASTERIP} and server port is ${PORT}: For latest guidance, please refer to the vLLM `instructions`. Here is the example to deploy the model with multiple GPU nodes, where the master node IP is ${MASTERIP}, server port is ${PORT} and the path of model is ${MODELPATH}: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. While Ling-1T has made strong progress in efficient reasoning, cross-domain generalization, and training efficiency, several limitations remain: GQA-based attention: stable for long-context reasoning but relatively costly. Future versions will adopt hybrid attention to improve efficiency. Limited agentic ability: current model has room to grow in multi-turn interaction, long-term memory, and tool use. Instruction and identity issues: occasional deviations or role confusion may occur; future updates will enhance alignment and consistency. The future versions of Ling-1T will continue to evolve in architecture, reasoning, and alignment, advancing the series toward more general intelligence. This code repository is licensed under the MIT License. Recommended temperature? 0.7 Recommended topp? 0.95 If you find our work helpful, feel free to give our paper a citation. ```bibtex @article{Ling-Team2025, author = {Ling-Team and 141 others}, title = {{Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation}}, journal = {arXiv preprint arXiv:2510.22115}, eprint = {2510.22115}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, year = {2025}, doi = {10.48550/arXiv.2510.22115}, url = {https://arxiv.org/abs/2510.22115} }
MingTok-Audio
Ling-flash-base-2.0
π€ Hugging Face    |   π€ ModelScope Today, Ling-flash-2.0 is officially open-sourced! π Following the release of the language model Ling-mini-2.0 and the thinking model Ring-mini-2.0, we are now open-sourcing the third MoE LLM under the Ling 2.0 architecture: Ling-flash-2.0, a language model with 100B total parameters and 6.1B activated parameters (4.8B non-embedding). Trained on 20T+ tokens of high-quality data, together with supervised fine-tuning and multi-stage reinforcement learning, Ling-flash-2.0 achieves SOTA performance among dense models under 40B parameters, despite activating only ~6B parameters. Compared to MoE models with larger activation/total parameters, it also demonstrates strong competitiveness. Notably, it delivers outstanding performance in complex reasoning, code generation, and frontend development. We conducted a comprehensive evaluation of Ling-flash-2.0βs reasoning capabilities, reporting strong results on representative benchmarks: Multi-disciplinary knowledge reasoning: GPQA-Diamond, MMLU-Pro Advanced mathematical reasoning: AIME 2025, Omni-MATH, OptMATH (advanced mathematical optimization tasks) Challenging code generation: LiveCodeBench v6, CodeForces-Elo Logical reasoning: KOR-Bench, ARC-Prize Key regulated industries (Finance, Healthcare): FinanceReasoning, HealthBench Compared with dense models under 40B (e.g., Qwen3-32B-Non-Thinking, Seed-OSS-36B-Instruct (think budget=0)) and larger-activation/total-parameter MoE models (e.g., Hunyuan-A13B-Instruct, GPT-OSS-120B/low), Ling-flash-2.0 demonstrates stronger complex reasoning power. Moreover, it shows high competitiveness on creative tasks (Creative Writing v3). Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation-ratio MoE architecture, optimized across multiple design choices: expert granularity, shared-expert ratio, attention balance, aux-loss-free + sigmoid routing strategy, MTP layers, QK-Norm, Partial-RoPE, and more. These refinements enable small-activation MoE models to achieve 7Γ efficiency gains over equivalent dense architectures. In other words, with just 6.1B activated parameters (4.8B non-embedding), Ling-flash-2.0 can match the performance of ~40B dense models. Thanks to its small activation size, it also delivers major inference speed advantages: On H20 hardware, Ling-flash-2.0 achieves 200+ tokens/s, offering 3Γ speedups compared to 36B dense models in everyday use. With YaRN extrapolation, it supports 128K context length, and as output length grows, its relative speedup can reach 7Γ or more. You can download the following table to see the various stage of Ling-flash-2.0 models. If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-flash-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-flash-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation: BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODELPATH}. They both share the same command in the following: MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN` to start command. We recommend you to use Llama-Factory to finetune Ling. This code repository is licensed under the MIT License.
Ling-mini-base-2.0
π€ Hugging Face    |   π€ ModelScope Today, we are excited to announce the open-sourcing of Ling 2.0 β a family of MoE-based large language models that combine SOTA performance with high efficiency. The first released version, Ling-mini-2.0, is compact yet powerful. It has 16B total parameters, but only 1.4B are activated per input token (non-embedding 789M). Trained on more than 20T tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 achieves remarkable improvements in complex reasoning and instruction following. With just 1.4B activated parameters, it still reaches the top-tier level of sub-10B dense LLMs and even matches or surpasses much larger MoE models. We evaluated Ling-mini-2.0 on challenging general reasoning tasks in coding (LiveCodeBench, CodeForces) and mathematics (AIME 2025, HMMT 2025), as well as knowledge-intensive reasoning tasks across multiple domains (MMLU-Pro, Humanity's Last Exam). Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities. Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation ratio MoE architecture, with empirically optimized design choices in expert granularity, shared expert ratio, attention ratio, aux-loss free + sigmoid routing strategy, MTP loss, QK-Norm, half RoPE, and more. This enables small-activation MoE models to achieve over 7Γ equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7β8B dense model. The highly sparse small-activation MoE architecture also delivers significant training and inference efficiency. In simple QA scenarios (within 2000 tokens), Ling-mini-2.0 generates at 300+ token/s (on H20 deployment) β more than 2Γ faster than an 8B dense model. Ling-mini-2.0 is able to handle 128K context length with YaRN, as sequence length increases, the relative speedup can reach over 7Γ. Ling 2.0 employs FP8 mixed-precision training throughout. Compared with BF16, experiments with over 1T training tokens show nearly identical loss curves and downstream benchmark performance. To support the community in efficient continued pretraining and fine-tuning under limited compute, we are also open-sourcing our FP8 training solution. Based on tile/blockwise FP8 scaling, it further introduces FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map for extreme memory optimization. On 8/16/32 80G GPUs, compared with LLaMA 3.1 8B and Qwen3 8B, Ling-mini-2.0 achieved 30β60% throughput gains with MTP enabled, and 90β120% throughput gains with MTP disabled. We believe Ling-mini-2.0 is an ideal starting point for MoE research. For the first time at this scale, it integrates 1/32 sparsity, MTP layers, and FP8 training β achieving both strong effectiveness and efficient training/inference performance, making it a prime candidate for the small-size LLM segment. To further foster community research, in addition to releasing the post-trained version, we are also open-sourcing five pretraining checkpoints: the pre-finetuning Ling-mini-2.0-base, along with four base models trained on 5T, 10T, 15T, and 20T tokens, enabling deeper research and broader applications. You can download the following table to see the various stage of Ling-mini-2.0 models(1.43B activated of 16.26B total params). If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-mini-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-5T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-10T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-15T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-20T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Models with safetensors format can be downloaded from HuggingFace or ModelScope. If you want to train your model and eval it, you can convert from dcp produced by training. Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it: - `--force-bf16` for BF16 format. - `--force-fp8` for FP8 format. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation:
Ring Mini Linear 2.0 GPTQ Int4
ViLaSR
Ming UniVision 16B A3B
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer π Technical Report ο½π Project Page ο½π€ Hugging Face ο½ π€ ModelScope ο½ πΎ GitHub Key Features - π First Unified Autoregressive MLLM with Continuous Vision Tokens: Ming-UniVision is the first multimodal large language model that natively integrates continuous visual representations from MingTok into a next-token prediction (NTP) frameworkβunifying vision and language under a single autoregressive paradigm without discrete quantization or modality-specific heads. - β‘ 3.5Γ Faster Convergence in Joint Vision-Language Training: The coherent representational space between understanding and generationβenabled by MingTokβreduces optimization conflicts across tasks, leading to dramatically faster convergence during end-to-end multimodal pretraining. - π Multi-Round In-Context Vision Tasks: Ming-UniVision supports iterative understanding, generation, and editing entirely within the continuous latent spaceβwithout the need to decode intermediate states into images, enabling efficient and coherent multimodal reasoning.Users can alternate between asking questions and requesting edits, just like conversing with a human. Figure 1: Conceptual comparison and qualitative examples of Ming-UniVision built upon MingTok. Figure 2: Multi-Round image understanding, generation and editing architecture of Ming-UniVision, powered by MingTok. - Image generation: Use descriptive prompts + ``outputimageprefix`` to save output. - Image understanding: Include "image" and "text" in the same message. - Image editing: Chain multiple ``generate(..., foredit=True)`` calls with unique ``outputimageprefix`` names. - Multi-turn interactions are supported via internal state β call ``model.resetinnerstate()`` to reset. - Input types: "text" and "image" β flexible order, mixed inputs allowed. - The current model was trained with only two-turn conversations, and has not been optimized for alternating rounds of image understanding and generation, although it may generalize to more than two turns during inference. As a result, performance may be limited in complex, multi-modal dialogue scenarios requiring deep contextual reasoning across turns. - This open-sourced version was trained using mixed-resolution strategies: high resolution for image understanding, but lower resolution for image editing and generation. Additionally, large-scale interleaved image-text data was not included during pretraining. - Due to these factors, image editing quality and consistency may be suboptimal compared to fully end-to-end, high-resolution multimodal models. We are actively working on improved versions with unified resolution training and richer interleaved data. body { font-family: Arial, sans-serif; margin: 20px; } table { width: 100%; border-collapse: collapse; font-size: 12px; } th, td { border: 1px solid #ccc; padding: 6px 8px; text-align: center; } thead th { background-color: transparent; font-weight: bold; } .section-row { background-color: transparent; text-align: center; font-style: italic; } .uparrow { font-size: 10px; vertical-align: super; } .dagger { font-size: 10px; color: gray; } caption { font-weight: bold; font-size: 14px; margin: 10px 0; text-align: left; } Table 1. Quantitative evaluations on MMBench, MMStar, MMMU, MathVista, HallusionBench, AI2D, MM-Vet, OCRBench, and MME. Model MMB β MMS β MMMU β MathV β Hall β AI2D β MM-Vet β OCRBench β MME β Qwen2.5-VL-3B 79.1 55.9 53.1 62.3 46.3 81.6 - 797 2157 Qwen2.5-VL-7B 83.5 63.9 58.6 68.2 52.9 83.9 67.1 864 2347 InternVL2.5-4B 81.1 58.3 52.3 60.5 46.3 81.4 60.6 828 2338 InternVL2.5-8B 84.6 62.8 56.0 64.4 50.1 84.5 62.8 822 2344 Ming-UniVision-16B-A3B (Ours) 78.5 63.7 40.3 66.6 47.8 82.8 64.2 724 2023 body { font-family: Arial, sans-serif; margin: 20px; } table { width: 100%; border-collapse: collapse; font-size: 12px; } th, td { border: 1px solid #ccc; padding: 6px 8px; text-align: center; } thead th { background-color: transparent; font-weight: bold; } .section-row { background-color: transparent; text-align: center; font-style: italic; } .uparrow { font-size: 10px; vertical-align: super; } .dagger { font-size: 10px; color: gray; } caption { font-weight: bold; font-size: 14px; margin: 10px 0; text-align: left; } Table 2. Evaluation of text-to-image generation ability on GenEval and DPG-Bench. β denotes performance obtained by rewritten prompts. Method Single Obj. β Two Obj. β Counting β Colors β Position β Color Attri. β Overall β DPG-Bench β SD3-Medium 0.99 0.94 0.72 0.89 0.33 0.60 0.74 84.08 Janus-Pro-1B 0.98 0.82 0.51 0.89 0.65 0.56 0.73 82.63 Janus-Pro-7B 0.99 0.89 0.59 0.90 0.79 0.66 0.80 84.19 Show-o2-7B 1.00 0.87 0.58 0.92 0.52 0.62 0.76 86.14 TokenFlow-XL 0.95 0.60 0.41 0.81 0.16 0.24 0.55 73.38 Ming-UniVision-16B-A3B (Ours) 1.00 0.93 0.59 0.93 0.92 0.70 0.85 82.12
LLaDA2.0-flash
Ming UniAudio 16B A3B Edit
π Technical Report ο½π Project Page ο½π€ Hugging Face ο½ π€ ModelScope Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification. - π₯ First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio - π₯ First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio - π₯ First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: Ming-UniAudio-Edit - π₯ First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark [2025.09.30] π₯ We release Ming-UniAudio with significant improvements across speech understanding, generation, and free-form editing tasks. Key Features Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs: - Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks - Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis. - Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks. Evaluation In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale. Context ASR performance comparison on various audio benchmark datasets. Qwen2-Audio 11.49 | 27.27 | 35.08 13.99 | 33.02 | 32.92 9.92 | 24.10 | 30.02 7.00 | 22.76 | 26.17 Baichuan-Audio 7.52 | 5.87 | 4.55 5.66 | 10.01 | 3.64 2.16 | 6.65 | 2.35 2.96 | 11.48 | 3.94 Kimi-Audio 2.90 | 6.68 | 8.01 4.67 | 13.50 | 11.31 1.95 | 11.13 | 15.28 2.90 | 15.91 | 16.68 Baichuan-Omni-1.5 8.16 | 7.69 | 6.53 9.91 | 14.40 | 5.54 2.98 | 8.39 | 4.71 5.00 | 16.83 | 7.84 Qwen2.5-Omni-3B 3.99 | 7.80 | 9.69 4.83 | 14.36 | 12.85 2.13 | 10.55 | 14.11 3.12 | 15.07 | 15.17 Qwen2.5-Omni-7B 3.96 | 7.38 | 8.72 5.32 | 11.83 | 9.24 1.84 | 9.80 | 12.19 2.40 | 14.06 | 13.17 Ming-UniAudio-16B-A3B-Edit(ours) 4.00 | 3.56 | 3.69 5.34 | 8.73 | 2.53 1.58 | 5.98 | 2.40 3.04 | 9.50 | 1.48 Performance comparison on various audio benchmark datasets. The best results are in bold . Datasets Model Model Type DNSMOS OVRL DNSMOS SIG DNSMOS BAK You can download our latest model and Benchmark from both Huggingface and ModelScope. |Type| Model | Input modality | Oput modality | Download | |:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:| Tokenizer| MingTok-Audio | audio | audio | π€ HuggingFace π€ ModelScope | SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | π€ HuggingFace π€ ModelScope | SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | π€ HuggingFace π€ ModelScope | Benchmark| Ming-Freeform-Audio-Edit | - | - | π€ HuggingFace π€ ModelScope Eval tools| If you're in mainland China, we strongly recommend you to download our model from π€ ModelScope . Note: This download process will take several minutes to several hours, depending on your network conditions. Additional demonstration cases are available on our project page. You can also initialize the environment by building the docker image. First clone this repository: Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while: At last, start the container with the current repo directory mounted: You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container. Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory Download our model following `Model & Benchmark Downloads` Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model. We also provide a simple example on the usage of this repo. For detailed usage, please refer to demobook.ipynb. Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. If you find our work helpful, feel free to give us a cite.
Ring-1T-preview
π€ Hugging Face    |   π€ ModelScope Recently, we have been fully occupied with the post-training of Ling 2.0's 1T foundational language model, striving to maximize the natural language reasoning potential of this trillion-scale base model. Conducting post-training on such a huge model, particularly the "training" involved in large-scale reinforcement learning, stands as one of the most technically challenging tasks the Ling Team has encountered since its establishment. On the other hand, it has also been a process that continuously reshapes our technical understanding and reinforces the belief that "scaling is all you need" In the early stages of large-scale reinforcement learning training, Ring-1T, the thinking version of the 1T foundational language model, has already demonstrated powerful natural language reasoning capabilities. In AIME 2025 (American Invitational Mathematics Examination), the model achieved a high score of 92.6 through pure natural language reasoning, further approaching GPT-5 with thinking (no tools)'s score of 94.6. Additionally, the model has shown strong competitiveness in the Harvard-MIT Mathematics Tournament (HMMT) 2025, competition-level code generation tasks such as LiveCodeBench v6 and CodeForces, as well as the abstraction and reasoning benchmark ARC-AGI-1 task. To further explore the reasoning limits of the early version of Ring-1T, we integrated it into the multi-agent framework AWorld (https://github.com/inclusionAI/AWorld ) and conducted pure natural language reasoning tests on the IMO 2025 (International Mathematical Olympiad, 6 problems in total). Previously, we tested Ring-flash-2.0, using the same method: under the setting of three allowed reasoning attempts, Ring-flash-2.0 only managed to solve Problem 3 on the third try. In contrast, during this test, Ring-1T solved Problem 3 in just one attempt, and also produced partially correct answers on Problems 1, 2, 4, and 5 in a single try. This demonstrates advanced reasoning capabilities essential for top-tier math competitionsβsuch as insight, constructive problem solving, counterexample generation, strategic thinking, and rigorous logical-chain reasoningβhighlighting the stronger reasoning potential of large-scale thinking models. To facilitate early community exploration of the reasoning capabilities of the trillion-parameter thinking model Ring-1T, we have decided to open-source its preview version, Ring-1T-preview, ahead of schedule. This model retains the efficient MoE architecture of Ling 2.0, completed pre-training on 20T tokens of corpora, and underwent RLVR training tailored for reasoning abilities within our self-developed efficient reinforcement learning system ASystem (the AReaL framework of which has been open-sourced), leveraging the previously disclosed "icepop" method(https://ringtech.notion.site/icepop). Ring-1T remains under continuous training. While the preview version already demonstrates powerful natural language reasoning capabilities, it still exhibits issues such as language mixing, repetitive reasoning and identity misperception. We look forward to community exploration and feedback to collectively accelerate the iterative refinement of this trillion-parameter foundation. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . This code repository is licensed under the MIT License. Tip To facilitate academic research and downstream applications with customizable model naming, we did not conduct specific identity recognition training.
GUI-G2-7B
This repository contains the GUI-G2-7B model from the paper GUI-GΒ²: Gaussian Reward Modeling for GUI Grounding. We provided more inference details on the github quick start. [](https://huggingface.co/papers/2507.15846) [](https://arxiv.org/abs/2507.15846) [](https://www.alphaxiv.org/abs/2507.15846) [](https://zju-real.github.io/GUI-G2) [](https://github.com/zju-real/GUI-G2) The model is based on `Qwen2.5-VL-7B-Instruct` and is fine-tuned using our proposed Gaussian dense reward framework framework. - π‘Gaussian Point & Coverage Rewards: Encourage accurate, spatially-aligned clicks. π Adaptive Variance Mechanism: Adjusts reward granularity based on element scale. π Dense Learning Signals: Smooth gradients outperform binary RL rewards in early-stage learning. π State-of-the-art Performance on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro datasets. | Model | Mobile Text | Mobile Icon | Desktop Text | Desktop Icon | Web Text | Web Icon | Avg. | | -------------------- | --------------- | --------------- | ---------------- | ---------------- | ------------ | ------------ | -------- | | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 | | Qwen2.5-VL-3B | 93.4 | 73.5 | 88.1 | 58.6 | 88.0 | 71.4 | 80.9 | | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 | | SeeClick-9.6B | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | 55.1 | | UGround-7B | 75.1 | 84.5 | 85.1 | 61.4 | 84.6 | 71.9 | 76.3 | | OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 | | UI-TARS-2B | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | 84.7 | | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 | | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 | | JEDI-7B | 96.9 | 87.2 | 95.9 | 87.9 | 94.4 | 84.2 | 91.7 | | GUI-Actor-7B | 97.6 | 88.2 | 96.9 | 85.7 | 93.2 | 86.7 | 92.1 | | UI-R1-3B | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 | | UI-R1-E-3B | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 | | SE-GUI-7B | - | - | - | - | - | - | 90.3 | | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 | | GUI-GΒ²-7B (Ours) | 98.3 | 91.9 | 95.4 | 89.3 | 94.0 | 87.7 | 93.3 |
Ling-lite
Ming-lite-omni-1.5-FP8
UI-Venus-Ground-72B
Ling-mini-base-2.0-20T
π€ Hugging Face    |   π€ ModelScope Today, we are excited to announce the open-sourcing of Ling 2.0 β a family of MoE-based large language models that combine SOTA performance with high efficiency. The first released version, Ling-mini-2.0, is compact yet powerful. It has 16B total parameters, but only 1.4B are activated per input token (non-embedding 789M). Trained on more than 20T tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 achieves remarkable improvements in complex reasoning and instruction following. With just 1.4B activated parameters, it still reaches the top-tier level of sub-10B dense LLMs and even matches or surpasses much larger MoE models. We evaluated Ling-mini-2.0 on challenging general reasoning tasks in coding (LiveCodeBench, CodeForces) and mathematics (AIME 2025, HMMT 2025), as well as knowledge-intensive reasoning tasks across multiple domains (MMLU-Pro, Humanity's Last Exam). Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities. Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation ratio MoE architecture, with empirically optimized design choices in expert granularity, shared expert ratio, attention ratio, aux-loss free + sigmoid routing strategy, MTP loss, QK-Norm, half RoPE, and more. This enables small-activation MoE models to achieve over 7Γ equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7β8B dense model. The highly sparse small-activation MoE architecture also delivers significant training and inference efficiency. In simple QA scenarios (within 2000 tokens), Ling-mini-2.0 generates at 300+ token/s (on H20 deployment) β more than 2Γ faster than an 8B dense model. Ling-mini-2.0 is able to handle 128K context length with YaRN, as sequence length increases, the relative speedup can reach over 7Γ. Ling 2.0 employs FP8 mixed-precision training throughout. Compared with BF16, experiments with over 1T training tokens show nearly identical loss curves and downstream benchmark performance. To support the community in efficient continued pretraining and fine-tuning under limited compute, we are also open-sourcing our FP8 training solution. Based on tile/blockwise FP8 scaling, it further introduces FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map for extreme memory optimization. On 8/16/32 80G GPUs, compared with LLaMA 3.1 8B and Qwen3 8B, Ling-mini-2.0 achieved 30β60% throughput gains with MTP enabled, and 90β120% throughput gains with MTP disabled. We believe Ling-mini-2.0 is an ideal starting point for MoE research. For the first time at this scale, it integrates 1/32 sparsity, MTP layers, and FP8 training β achieving both strong effectiveness and efficient training/inference performance, making it a prime candidate for the small-size LLM segment. To further foster community research, in addition to releasing the post-trained version, we are also open-sourcing five pretraining checkpoints: the pre-finetuning Ling-mini-2.0-base, along with four base models trained on 5T, 10T, 15T, and 20T tokens, enabling deeper research and broader applications. You can download the following table to see the various stage of Ling-mini-2.0 models(1.43B activated of 16.26B total params). If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-mini-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-5T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-10T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-15T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-20T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Models with safetensors format can be downloaded from HuggingFace or ModelScope. If you want to train your model and eval it, you can convert from dcp produced by training. Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it: - `--force-bf16` for BF16 format. - `--force-fp8` for FP8 format. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation:
ASearcher-Local-7B
.no-border-table table, .no-border-table th, .no-border-table td { border: none !important; } | | | |-|-| | [](https://github.com/inclusionAI/ASearcher) | [](https://arxiv.org/abs/2508.07976) | ASearcher is an open-source framework designed for large-scale online reinforcement learning (RL) training of search agents. Our mission is to advance Search Intelligence to expert-level performance. We are fully committed to open-source by releasing model weights, detailed training methodologies, and data construction pipelines. Additionally, we provide comprehensive guidance on building and training customized agents based on AReaL. ASearcher empowers developers to build their own high-performance search agents easily and cost-effectively. We have released multiple models trained with different settings and based on foundation models of varying sizes. These models have achieved outstanding performance on Single-Hop / Multi-Hop QA and more challenging tool-augmented benchmarks like GAIA, Xbench. | Model Name | Base Model | Training Setting | Download Link | |------------|----------------|------------------|----------------| | ASearcher-Local-7B | Qwen2.5-7B | Local knowledge base with RAG | π€Huggingface | | ASearcher-Web-7B | Qwen2.5-7B | Web-based search and browsing | π€Huggingface | | ASearcher-Local-14B | Qwen2.5-14B | Local knowledge base with RAG | π€Huggingface | | ASearcher-Web-14B | Qwen2.5-14B | Web-based search and browsing | π€Huggingface | | ASearcher-Web-QwQ-32B | QwQ-32B | Web-based search and browsing | π€Huggingface | Performance Evaluation on challenging benchmarks (ASearcher-Web-QwQ) Dataset Download We also release our full training data and test data, you can easily get them and reproduce our result. Quickstart If you want to learn more details, please refer to our GitHub repository: ASearcher
LLaDA-MoE-7B-A1B-Instruct-TD
Ring-lite-distill-preview
Ling-lite-1.5-2507
Model Overview We are excited to introduce Ling-lite-1.5-2507, the latest version of our highly capable Ling-lite-1.5 model. Ling-lite-1.5-2507 boasts 16.8 billion parameters with 2.75 billion activated parameters, which demonstrates significant improvements over previous versions across professional knowledge assessments, logical reasoning evaluations, and coding capability benchmarks. Key Features As the flagship model of our Lite series, Ling-lite-1.5-2507 features two major enhancements: Smarter and More Efficient Reasoning For straightforward inquiries, the model generates concise and direct responses. When confronting complex challenges, it exhibits advanced problem-solving prowess by systematically decomposing problems, integrating a sophisticated reflective mechanism, and producing elaborate reasoning traces to achieve accurate solutions through an inherently efficient and integrated reasoning process. Enhanced Human-Aligned Subjectivity The model delivers well-structured and coherent responses, demonstrating profound cognitive depth in subjective and open-ended tasks. This leads to a strong alignment with human preferences concerning response organization and conceptual richness. Here is a code snippet to show you how to use the chat model with `transformers`: License This code repository is licensed under the MIT License. If you find our work helpful, feel free to give us a cite.
Ling-plus
Ring-lite
Ring Flash Linear 2.0 GPTQ Int4
To enable deployment of Ring-Linear-2.0 on memory-constrained devices, we release quantized weights using the GPTQ INT4 format. Additionally, we evaluate the online FP8 quantization performance of `Ring-Linear-2.0` models, which closely approaches that of BF16 precision. | Model | Maximum Supported Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ring-flash-linear-2.0-GPTQ-int4 | 128k | π€ HuggingFace π€ ModelScope | | Ring-mini-linear-2.0-GPTQ-int4 | 512k | π€ HuggingFace π€ ModelScope | Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below. First, create a Conda environment with Python 3.10 and CUDA 12.8: Finally, install compatible versions of transformers after vLLM is installed: We evaluate the INT4 and FP8 quantized models using several datasets. The FP8 quantization is applied via the quantization="fp8" argument in SGLang or vLLM. Ring-mini-linear-2.0 | Dataset | BF16 | FP8 | GPTQ-Int4 | | :----------------: |:--------:|:-------:|:-------------:| | AIME25 | 73.65 | 72.40 | 66.56 | | AIME24 | 79.95 | 79.53 | 74.95 | | LiveCodeBench| 59.53 | 58.42 | 56.29 | | GPQA | 65.69 | 66.79 | 62.53 | Ring-flash-linear-2.0 | Dataset | BF16 | FP8 | GPTQ-Int4 | | :----------------: |:--------:|:-------:| :-----------------------:| | AIME25 | 85.10 | 84.22 | 82.88 | | LiveCodeBench| 69.82 | 69.44 | 66.14 | | GPQA | 72.85 | 72.95 | 71.72 | This code repository is licensed under the MIT License.
Ring-1T-preview-FP8
LLaDA2.0-flash-preview
LLaDA2.0-flash-preview is a diffusion language model featuring a 100BA6B Mixture-of-Experts (MoE) architecture. As an enhanced, instruction-tuned iteration of the LLaDA2.0 series, it is optimized for practical applications. | Benchmark | Ling-flash-2.0 | LLaDA2.0-mini-preview | LLaDA2.0-flash-preview | | :------------------------------ | :-------------: | :-------------------------: | :---------------------: | | Average | 79.93 | 66.89 | 77.03 | | Knowledge | | | | | MMLU | 87.98 | 72.49 | 83.15 | | MMLU-PRO | 76.84 | 49.22 | 66.16 | | CMMLU | 86.59 | 67.53 | 79.64 | | C-EVAL | 88.03 | 66.54 | 79.28 | | Reasoning | | | | | squad2.0 | 81.32 | 85.61 | 90.61 | | drop | 88.32 | 79.49 | 88.17 | | korbench | 68.96 | 37.26 | 53.28 | | Coding | | | | | CruxEval-O | 82.75 | 61.88 | 74.50 | | mbpp | 85.01 | 77.75 | 86.65 | | MultiPL-E | 65.76 | 62.43 | 72.38 | | humaneval | 85.98 | 80.49 | 88.41 | | Bigcodebench-Full | 40.70 | 30.44 | 40.44 | | Math | | | | | GSM8K | 95.45 | 89.01 | 95.75 | | math | 96.1 | 73.50 | 83.52 | | Agent & Alignment | | | | | BFCLLive | 67.57 | 74.11 | 74.86 | | IFEval-strict -prompt | 81.52 | 62.50 | 75.60 | π Performance Highlights + Leading MoE Architecture: The open-source Mixture-of-Experts (MoE) diffusion large language model continually trained on the Ling2.0 series with approximately 20 trillion tokens. + Efficient Inference: With 100 billion total parameters, only 6.1 billion are activated during inference. LLaDA2.0-flash-preview significantly reduces computational costs while outperforming open-source dense models of similar scale. + Impressive Performance on Code & Complex Reasoning: Excels in tasks such as code generation and advanced mathematical reasoning, demonstrating strong reasoning capabilities. + Tool Use: Supports tool calling and achieves excellent performance in complex agent-based tasks. + Open & Extensible: Fully open-source with commitment to transparency. We plan to release a leading inference framework in the future and continue investing in cutting-edge areas like diffusion LLMs (dLLM) to drive disruptive innovation. + Supercharged Reasoning with LLaDA 2.0: LLaDA 2.0 series will be fine-tuned with Reinforcement Learning, unlocking a new level of sophisticated reasoning and problem-solving abilities. + Tools for Innovators: The model was finetuned on the VeOmni framework using Fully Sharded Data Parallel (FSDP2). We will release a detailed tutorial and our complete post-training framework. Whether you want to master the current model or build your own customized versions, you'll have the tools you need. Stay tuned π¦ Model Variants | Model ID | Description | Hugging Face Link | | --- | --- | --- | | `inclusionAI/LLaDA2.0-mini-preview` | Instruction-tuned model, ready for downstream applications. | π€ Model Card | | `inclusionAI/LLaDA2.0-flash-preview` | Instruction-tuned model, ready for downstream applications. | π€ Model Card | π Model Overview LLaDA2.0-flash-preview has the following specifications: + Type: Mixture-of-Experts (MoE) Diffusion Language Model + Total Parameters (Non-Embedding): 100B + Number of Layers: 32 + Attention Heads: 32 + Context Length: 4,096 tokens + Position Embedding: Rotary (RoPE) + Vocabulary Size: 157,184 π€ Hugging Face Transformers Make sure you have `transformers` and its dependencies installed: Best Practices To achieve optimal performance, we recommend the following settings: 1. Sampling Parameters: We suggest using `Temperature=0.0`, `blocklength=32`, and `steps=32`. Using a higher temperature value may occasionally result in language mixing and a slight decrease in model performance. 2. Adequate Output Length: We recommend using an output length of 2048 tokens for most queries. For benchmarking on problems require more output length, such as those found in math and programming competitions, we suggest setting the max output length to 4096 tokens. π License This project is licensed under the terms of the Apache License 2.0. π€ Contact & Collaboration For questions, collaborations, or feedback, please reach out via Hugging Face or open an issue in the repository. π Join us in advancing open, efficient, and intelligent language models!
Ring-lite-2507
GUI-G2-3B
This repository contains the GUI-G2-3B model from the paper GUI-GΒ²: Gaussian Reward Modeling for GUI Grounding. We provided more inference details on the github quick start. We will update GUI-G2-3B results on GUI Grounding benchmark. [](https://huggingface.co/papers/2507.15846) [](https://arxiv.org/abs/2507.15846) [](https://www.alphaxiv.org/abs/2507.15846) [](https://zju-real.github.io/GUI-G2) [](https://github.com/zju-real/GUI-G2) The model is based on `Qwen2.5-VL-3B-Instruct` and is fine-tuned using our proposed Gaussian dense reward framework framework. - π‘Gaussian Point & Coverage Rewards: Encourage accurate, spatially-aligned clicks. π Adaptive Variance Mechanism: Adjusts reward granularity based on element scale. π Dense Learning Signals: Smooth gradients outperform binary RL rewards in early-stage learning. π State-of-the-art Performance on ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro datasets. | Model | Mobile Text | Mobile Icon | Desktop Text | Desktop Icon | Web Text | Web Icon | Avg. | | -------------------- | --------------- | --------------- | ---------------- | ---------------- | ------------ | ------------ | -------- | | GPT-4o | 26.6 | 24.2 | 24.2 | 19.3 | 12.8 | 11.8 | 20.1 | | Qwen2.5-VL-3B | 93.4 | 73.5 | 88.1 | 58.6 | 88.0 | 71.4 | 80.9 | | Qwen2.5-VL-7B | 97.6 | 87.2 | 90.2 | 74.2 | 93.2 | 81.3 | 88.8 | | SeeClick-9.6B | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | 55.1 | | UGround-7B | 75.1 | 84.5 | 85.1 | 61.4 | 84.6 | 71.9 | 76.3 | | OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | 84.1 | | UI-TARS-2B | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | 84.7 | | UI-TARS-7B | 96.9 | 89.1 | 95.4 | 85.0 | 93.6 | 85.2 | 91.6 | | UI-TARS-72B | 94.8 | 86.3 | 91.2 | 87.9 | 91.5 | 87.7 | 90.3 | | JEDI-7B | 96.9 | 87.2 | 95.9 | 87.9 | 94.4 | 84.2 | 91.7 | | GUI-Actor-7B | 97.6 | 88.2 | 96.9 | 85.7 | 93.2 | 86.7 | 92.1 | | UI-R1-3B | 96.2 | 84.3 | 92.3 | 63.6 | 89.2 | 75.4 | 85.4 | | UI-R1-E-3B | 98.2 | 83.9 | 94.8 | 75.0 | 93.2 | 83.7 | 89.5 | | SE-GUI-7B | - | - | - | - | - | - | 90.3 | | LPO | 97.9 | 82.9 | 95.9 | 86.4 | 95.6 | 84.2 | 90.5 | | GUI-GΒ²-7B (Ours) | 98.3 | 91.9 | 95.4 | 89.3 | 94.0 | 87.7 | 93.3 |
Ling-lite-base-1.5
Ling-Coder-lite
Ling-mini-base-2.0-5T
π€ Hugging Face    |   π€ ModelScope Today, we are excited to announce the open-sourcing of Ling 2.0 β a family of MoE-based large language models that combine SOTA performance with high efficiency. The first released version, Ling-mini-2.0, is compact yet powerful. It has 16B total parameters, but only 1.4B are activated per input token (non-embedding 789M). Trained on more than 20T tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 achieves remarkable improvements in complex reasoning and instruction following. With just 1.4B activated parameters, it still reaches the top-tier level of sub-10B dense LLMs and even matches or surpasses much larger MoE models. We evaluated Ling-mini-2.0 on challenging general reasoning tasks in coding (LiveCodeBench, CodeForces) and mathematics (AIME 2025, HMMT 2025), as well as knowledge-intensive reasoning tasks across multiple domains (MMLU-Pro, Humanity's Last Exam). Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities. Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation ratio MoE architecture, with empirically optimized design choices in expert granularity, shared expert ratio, attention ratio, aux-loss free + sigmoid routing strategy, MTP loss, QK-Norm, half RoPE, and more. This enables small-activation MoE models to achieve over 7Γ equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7β8B dense model. The highly sparse small-activation MoE architecture also delivers significant training and inference efficiency. In simple QA scenarios (within 2000 tokens), Ling-mini-2.0 generates at 300+ token/s (on H20 deployment) β more than 2Γ faster than an 8B dense model. Ling-mini-2.0 is able to handle 128K context length with YaRN, as sequence length increases, the relative speedup can reach over 7Γ. Ling 2.0 employs FP8 mixed-precision training throughout. Compared with BF16, experiments with over 1T training tokens show nearly identical loss curves and downstream benchmark performance. To support the community in efficient continued pretraining and fine-tuning under limited compute, we are also open-sourcing our FP8 training solution. Based on tile/blockwise FP8 scaling, it further introduces FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map for extreme memory optimization. On 8/16/32 80G GPUs, compared with LLaMA 3.1 8B and Qwen3 8B, Ling-mini-2.0 achieved 30β60% throughput gains with MTP enabled, and 90β120% throughput gains with MTP disabled. We believe Ling-mini-2.0 is an ideal starting point for MoE research. For the first time at this scale, it integrates 1/32 sparsity, MTP layers, and FP8 training β achieving both strong effectiveness and efficient training/inference performance, making it a prime candidate for the small-size LLM segment. To further foster community research, in addition to releasing the post-trained version, we are also open-sourcing five pretraining checkpoints: the pre-finetuning Ling-mini-2.0-base, along with four base models trained on 5T, 10T, 15T, and 20T tokens, enabling deeper research and broader applications. You can download the following table to see the various stage of Ling-mini-2.0 models(1.43B activated of 16.26B total params). If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-mini-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-5T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-10T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-15T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-20T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Models with safetensors format can be downloaded from HuggingFace or ModelScope. If you want to train your model and eval it, you can convert from dcp produced by training. Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it: - `--force-bf16` for BF16 format. - `--force-fp8` for FP8 format. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation:
AReaL-boba-2-8B
Ling-plus-base
Ling-mini-base-2.0-15T
π€ Hugging Face    |   π€ ModelScope Today, we are excited to announce the open-sourcing of Ling 2.0 β a family of MoE-based large language models that combine SOTA performance with high efficiency. The first released version, Ling-mini-2.0, is compact yet powerful. It has 16B total parameters, but only 1.4B are activated per input token (non-embedding 789M). Trained on more than 20T tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 achieves remarkable improvements in complex reasoning and instruction following. With just 1.4B activated parameters, it still reaches the top-tier level of sub-10B dense LLMs and even matches or surpasses much larger MoE models. We evaluated Ling-mini-2.0 on challenging general reasoning tasks in coding (LiveCodeBench, CodeForces) and mathematics (AIME 2025, HMMT 2025), as well as knowledge-intensive reasoning tasks across multiple domains (MMLU-Pro, Humanity's Last Exam). Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities. Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation ratio MoE architecture, with empirically optimized design choices in expert granularity, shared expert ratio, attention ratio, aux-loss free + sigmoid routing strategy, MTP loss, QK-Norm, half RoPE, and more. This enables small-activation MoE models to achieve over 7Γ equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7β8B dense model. The highly sparse small-activation MoE architecture also delivers significant training and inference efficiency. In simple QA scenarios (within 2000 tokens), Ling-mini-2.0 generates at 300+ token/s (on H20 deployment) β more than 2Γ faster than an 8B dense model. Ling-mini-2.0 is able to handle 128K context length with YaRN, as sequence length increases, the relative speedup can reach over 7Γ. Ling 2.0 employs FP8 mixed-precision training throughout. Compared with BF16, experiments with over 1T training tokens show nearly identical loss curves and downstream benchmark performance. To support the community in efficient continued pretraining and fine-tuning under limited compute, we are also open-sourcing our FP8 training solution. Based on tile/blockwise FP8 scaling, it further introduces FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map for extreme memory optimization. On 8/16/32 80G GPUs, compared with LLaMA 3.1 8B and Qwen3 8B, Ling-mini-2.0 achieved 30β60% throughput gains with MTP enabled, and 90β120% throughput gains with MTP disabled. We believe Ling-mini-2.0 is an ideal starting point for MoE research. For the first time at this scale, it integrates 1/32 sparsity, MTP layers, and FP8 training β achieving both strong effectiveness and efficient training/inference performance, making it a prime candidate for the small-size LLM segment. To further foster community research, in addition to releasing the post-trained version, we are also open-sourcing five pretraining checkpoints: the pre-finetuning Ling-mini-2.0-base, along with four base models trained on 5T, 10T, 15T, and 20T tokens, enabling deeper research and broader applications. You can download the following table to see the various stage of Ling-mini-2.0 models(1.43B activated of 16.26B total params). If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-mini-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-5T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-10T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-15T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-20T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Models with safetensors format can be downloaded from HuggingFace or ModelScope. If you want to train your model and eval it, you can convert from dcp produced by training. Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it: - `--force-bf16` for BF16 format. - `--force-fp8` for FP8 format. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation:
AReaL-boba-2-14B
Ling-mini-base-2.0-10T
π€ Hugging Face    |   π€ ModelScope Today, we are excited to announce the open-sourcing of Ling 2.0 β a family of MoE-based large language models that combine SOTA performance with high efficiency. The first released version, Ling-mini-2.0, is compact yet powerful. It has 16B total parameters, but only 1.4B are activated per input token (non-embedding 789M). Trained on more than 20T tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-mini-2.0 achieves remarkable improvements in complex reasoning and instruction following. With just 1.4B activated parameters, it still reaches the top-tier level of sub-10B dense LLMs and even matches or surpasses much larger MoE models. We evaluated Ling-mini-2.0 on challenging general reasoning tasks in coding (LiveCodeBench, CodeForces) and mathematics (AIME 2025, HMMT 2025), as well as knowledge-intensive reasoning tasks across multiple domains (MMLU-Pro, Humanity's Last Exam). Compared with sub-10B dense models (e.g., Qwen3-4B-instruct-2507, Qwen3-8B-nothinking) and larger-scale MoE models (Ernie-4.5-21B-A3B-PT, GPT-OSS-20B/low), Ling-mini-2.0 demonstrated outstanding overall reasoning capabilities. Guided by Ling Scaling Laws, Ling 2.0 adopts a 1/32 activation ratio MoE architecture, with empirically optimized design choices in expert granularity, shared expert ratio, attention ratio, aux-loss free + sigmoid routing strategy, MTP loss, QK-Norm, half RoPE, and more. This enables small-activation MoE models to achieve over 7Γ equivalent dense performance. In other words, Ling-mini-2.0 with only 1.4B activated parameters (non-embedding 789M) can deliver performance equivalent to a 7β8B dense model. The highly sparse small-activation MoE architecture also delivers significant training and inference efficiency. In simple QA scenarios (within 2000 tokens), Ling-mini-2.0 generates at 300+ token/s (on H20 deployment) β more than 2Γ faster than an 8B dense model. Ling-mini-2.0 is able to handle 128K context length with YaRN, as sequence length increases, the relative speedup can reach over 7Γ. Ling 2.0 employs FP8 mixed-precision training throughout. Compared with BF16, experiments with over 1T training tokens show nearly identical loss curves and downstream benchmark performance. To support the community in efficient continued pretraining and fine-tuning under limited compute, we are also open-sourcing our FP8 training solution. Based on tile/blockwise FP8 scaling, it further introduces FP8 optimizer, FP8 on-demand transpose weight, and FP8 padding routing map for extreme memory optimization. On 8/16/32 80G GPUs, compared with LLaMA 3.1 8B and Qwen3 8B, Ling-mini-2.0 achieved 30β60% throughput gains with MTP enabled, and 90β120% throughput gains with MTP disabled. We believe Ling-mini-2.0 is an ideal starting point for MoE research. For the first time at this scale, it integrates 1/32 sparsity, MTP layers, and FP8 training β achieving both strong effectiveness and efficient training/inference performance, making it a prime candidate for the small-size LLM segment. To further foster community research, in addition to releasing the post-trained version, we are also open-sourcing five pretraining checkpoints: the pre-finetuning Ling-mini-2.0-base, along with four base models trained on 5T, 10T, 15T, and 20T tokens, enabling deeper research and broader applications. You can download the following table to see the various stage of Ling-mini-2.0 models(1.43B activated of 16.26B total params). If you are located in mainland China, we also provide the model on ModelScope.cn to speed up the download process. | Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ling-mini-base-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-5T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-10T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-15T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-base-2.0-20T | 4K | π€ HuggingFace π€ ModelScope | | Ling-mini-2.0 | 32K -> 128K (YaRN) | π€ HuggingFace π€ ModelScope | Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope. Models with safetensors format can be downloaded from HuggingFace or ModelScope. If you want to train your model and eval it, you can convert from dcp produced by training. Currently, BF16 and FP8 formats are supported, you can use convert parameter to handle it: - `--force-bf16` for BF16 format. - `--force-fp8` for FP8 format. Here is a code snippet to show you how to use the chat model with `transformers`: If you're in mainland China, we strongly recommend you to use our model from π€ ModelScope . vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference. Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below: To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a `ropescaling` field to the model's `config.json` file, for example: 2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service. For detailed guidance, please refer to the vLLM `instructions`. We will later submit our model to SGLang official release, now we can prepare the environment following steps: Then you should apply patch to sglang installation:
Ring-lite-2506
Ring-lite-2506 is a lightweight, fully open-sourced MoE (Mixture of Experts) LLM designed for complex reasoning tasks. It is built upon the publicly available Ling-lite-1.5 model, which has 16.8B parameters with 2.75B activated parameters. We use a joint training pipeline combining knowledge distillation with reinforcement learning, achieving performance comparable to state-of-the-art (SOTA) small-size reasoning models on challenging benchmarks (AIME, LiveCodeBench, and GPQA-Diamond) while activating only one-third of their parameters. | Model | #Total Params | #Activated Params | Context Length | Download | | :----------------: | :---------------: | :-------------------: | :----------------: | :----------: | | Ring-lite-2506 | 16.8B | 2.75B | 128K | π€ HuggingFace | Evaluation For a comprehensive evaluation of the quality of our reasoning models, we implemented automatic benchmarks to assess their performance including math, code and science. π€ Hugging Face Transformers Here is a code snippet to show you how to use the chat model with `transformers`: Dataset The training data of Ring-lite-2506 is release at Ring-lite-sft-data and Ring-lite-rl-data. License This code repository is licensed under the MIT License.
Ling-Coder-lite-base
ASearcher-Web-7B
Ling-lite-base
ASearcher-Web-14B
.no-border-table table, .no-border-table th, .no-border-table td { border: none !important; } | | | |-|-| | [](https://github.com/inclusionAI/ASearcher) | [](https://arxiv.org/abs/2508.07976) | ASearcher is an open-source framework designed for large-scale online reinforcement learning (RL) training of search agents. Our mission is to advance Search Intelligence to expert-level performance. We are fully committed to open-source by releasing model weights, detailed training methodologies, and data construction pipelines. Additionally, we provide comprehensive guidance on building and training customized agents based on AReaL. ASearcher empowers developers to build their own high-performance search agents easily and cost-effectively. We have released multiple models trained with different settings and based on foundation models of varying sizes. These models have achieved outstanding performance on Single-Hop / Multi-Hop QA and more challenging tool-augmented benchmarks like GAIA, Xbench. | Model Name | Base Model | Training Setting | Download Link | |------------|----------------|------------------|----------------| | ASearcher-Local-7B | Qwen2.5-7B | Local knowledge base with RAG | π€Huggingface | | ASearcher-Web-7B | Qwen2.5-7B | Web-based search and browsing | π€Huggingface | | ASearcher-Local-14B | Qwen2.5-14B | Local knowledge base with RAG | π€Huggingface | | ASearcher-Web-14B | Qwen2.5-14B | Web-based search and browsing | π€Huggingface | | ASearcher-Web-QwQ-32B | QwQ-32B | Web-based search and browsing | π€Huggingface | Performance Evaluation on challenging benchmarks (ASearcher-Web-QwQ) Dataset Download We also release our full training data and test data, you can easily get them and reproduce our result. Quickstart If you want to learn more details, please refer to our GitHub repository: ASearcher
Ling-lite-1.5-2506
Ming-omni-tta-0.5B
Qwen3-32B-AWorld
.no-border-table table, .no-border-table th, .no-border-table td { border: none !important; } | | | |-|-| | [](https://github.com/inclusionAI/AWorld/tree/main/train) | [](https://arxiv.org/abs/2508.20404) | Qwen3-32B-AWorld is a large language model fine-tuned from `Qwen3-32B`, specializing in agent capabilities and proficient tool usage. The model excels at complex agent-based tasks through precise integration with external tools, achieving a pass@1 score on the GAIA benchmark that surpasses GPT-4o and is comparable to DeepSeek-V3. This guide provides instructions for quickly deploying and running inference with `Qwen3-32B-AWorld` using vLLM. To deploy the model, use the following `vllm serve` command: Deployment Recommendation: We recommend deploying the model on 8 GPUs to enhance concurrency. The `tensor-parallel-size` argument should be set to the number of GPUs you are using (e.g., `8` in the command above). Tool Usage Flags: To enable the model's tool-calling capabilities, it is crucial to include the `--enable-auto-tool-choice` and `--tool-call-parser hermes` flags. These ensure that the model can correctly process tool calls and parse the results. When making an inference request, you must include the `tools` you want the model to use. The format should follow the official OpenAI API specification. Here is a complete Python example for making an API call to the deployed model using the requests library. This example demonstrates how to query the model with a specific tool. Remember to replace `{yourip}` and `{yourport}` in the `url` variable with the actual IP address and port where your vLLM server is running. The default port is typically `8000`.
UI-Venus-1.5-2B
ASearcher-Web-QwQ-V2
UI-Venus-1.5-30B-A3B
AReaL-boba-2-32B
UI-Venus-1.5-8B
ASearcher-Local-14B
ASearcher-Web-QwQ
Ring Mini Sparse 2.0 Exp
π€ Hugging Face    |   π€ ModelScope We are excited to annouce the official release of Ring-mini-sparse-2.0-exp. This model employs a Mixture of Block Attention (MoBA) architecture, delivering highly efficient inference without compromising performance. This model inherts from Ling-mini-base-2.0, continually trained on an additional 100B tokens. The performance of the MoBA-based model is on par with the standard attention models of the same size (e.g., Ring-mini-v2). Furthermore, by applying YaRN-based 4Γ window extrapolation, we extend the context length to 128K tokens, delivering superior inference speed on tasks that involve long inputs and outputs. Figure 1: The Model Architecture of Ring-mini-sparse-2.0-exp To comprehensively assess the reasoning capability of our model, we conducted evaluations on five challenging benchmarks spanning mathematics, coding, and science, comparing it with Ring-mini-2.0, Qwen3-8B-Thinking, and GPT-OSS-20B-Medium. The MoBA architecture demonstrates comparable performance to full softmax attention models. Ring-mini-sparse-2.0-exp achieves high inference efficiency through highly sparse attention and a Mixture-of-Experts (MoE) architecture. Unlike MoBA used in Kimi, our approach shares the same KV block selection across all heads within a GQA group, reducing the total number of KV tokens each query head retrieves from the KV cache during decoding. During 64K-context decoding, only 8,192 key-value (KV) tokens are activated per queryβreducing KV cache retrieval overhead by 87.5% compared to full attention and delivering up to 3Γ inference speedup over Ring-mini-2.0. This design significantly lowers computational costs for high-concurrency scenarios involving reasoning-intensive models while maintaining competitive performance. Additionally, with YaRN extrapolation, the model extends context capacity to 128K tokens, achieving up to 2Γ relative speedup in long-input scenarios compared to Ring-mini-2.0 (full softmax attention). Figure 4: Inference speedup ratios of Ring-mini-sparse-2.0-exp compared to Ring-mini-2.0. π€ Hugging Face Transformers Installation requirements: Here is a code snippet to show you how to use the chat model with `transformers`: We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages: Our model is supported by SGLang now. You can launch the sever with the command in the following:
AReaL-boba-SFT-32B
Ring Flash Linear 2.0 128k
Rubicon-Preview
AReaL-boba-2-8B-Open
Ming-omni-tts-16.8B-A3B
AReaL-boba-RL-7B
AReaL-boba-2-14B-Open
AReaL-1.5B-Preview-Stage-2
AReaL-1.5B-Preview-Stage-3
GroveMoE-Inst
ZwZ-7B
Ling-Coder-lite-GPTQ-Int8
LLaDA2.0-mini-CAP
GroveMoE-Base
Ming Lite Omni
π Technical Report ο½π Project Page ο½π€ Hugging Face ο½ π€ ModelScope Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community. [2025.06.12] π₯ Our Technical Report is in public on arxiv. [2025.05.28] π₯ The official version of Ming-lite-omni is released, with better performance and image generation support. [2025.05.04] π₯ We release the test version of Ming-lite-omniοΌMing-lite-omni-Preview. - Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers. - Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks. - Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation. Evaluation Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods. | Benchmarks | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | InternVL2.5-8B-MPO | |:------------------|:--------------:|:----------------------------:|:------------------:| | AI2D | 83.1 | 84.4 | 84.5 | | HallusionBench | 55.0 | 55.8 | 51.7 | | MMBenchTESTV11 | 80.8 | 82.8 | 82.0 | | MMMU | 56.3 | 56.6 | 54.8 | | MMStar | 64.7 | 65.3 | 65.2 | | MMVet | 71.3 | 71.6 | 68.1 | | MathVista | 71.6 | 68.1 | 67.9 | | OCRBench | 88.4 | 87.8 | 88.2 | | Average | 71.4 | 71.5 | 70.3 | | Object Recognition | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | |:---------------------|:--------------:|:------------------------:| | Plants | 54.96 | 47.8 | | Animals | 56.7 | 50.85 | | Vehicles | 41.91 | 42.29 | | Food & Ingredients | 62.28 | 54.09 | | Dishes | 44.3 | 39.07 | | General | 91.08 | 92.42 | | Average | 58.54 | 54.43 | | Benchmarks | Ming-lite-omni | Qwen2.5VL-7B-Instruct | |:------------------------|:--------------:|:---------------------:| | VideoMME | 67.0 | 67.3 | | MVBench | 67.7 | 67.4 | | Video-MMMU | 46.3 | 47.4 | | LongVideoBench | 56.6 | 54.7 | | Average | 59.4 | 59.2 | Note: All models are evaluated based on 128 uniformly sampled frames. | Model | Average | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench | |:-----------------|:-------------:|:-----------:|:-----------:|:------------:|:------------:|:------------:|:------------:|:-------------:| | Qwen2-Audio-chat | 3.545 | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 | | Baichuan-Audio | 3.695 | 4.00 | 3.39 | 49.64 | 48.80 | 63.30 | 41.32 | 86.73 | | GLM-4-Voice | 3.77 | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 | | Kimi-Audio | 4.215 | 4.46 | 3.97 | 63.12 | 62.17 | 83.52 | 61.10 | 100.00 | | Qwen2.5-Omni | 4.21 | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 | | Ming-lite-omni | 4.34 | 4.63 | 4.06 | 58.84 | 47.53 | 61.98 | 58.36 | 99.04 | | Model | aishell1 | aishell2android | aishell2ios | cv15zh | fleurszh | wenetspeechmeeting | wenetspeechnet | librispeechtestclean | librispeechtestother | multilinguallibrispeech | cv15en | fleursen | voxpopuliv1.0en | |:--------------:|:--------:|:----------------:|:------------:|:--------:|:---------:|:-------------------:|:---------------:|:----------------------:|:----------------------:|:------------------------:|:--------:|:---------:|:--------------------:| | Ming-lite-omni | 1.47 | 2.55 | 2.52 | 6.31 | 2.96 | 5.95 | 5.46 | 1.44 | 2.80 | 4.15 | 6.89 | 3.39 | 5.80 | | Qwen2.-Omni | 1.18 | 2.75 | 2.63 | 5.20 | 3.00 | 5.90 | 7.70 | 1.80 | 3.40 | 7.56 | 7.60 | 4.10 | 5.80 | | Qwen2-Audio | 1.53 | 2.92 | 2.92 | 6.90 | 7.50 | 7.16 | 8.42 | 1.60 | 3.60 | 5.40 | 8.60 | 6.90 | 6.84 | | Kimi-Audio | 0.60 | 2.64 | 2.56 | 7.21 | 2.69 | 6.28 | 5.37 | 1.28 | 2.42 | 5.88 | 10.31 | 4.44 | 7.97 | | Model | InfoSeekH-mean | InfoSeekunseenquestion | InfoSeekunseenentity | |:---------------|:---------------:|:------------------------:|:----------------------:| | GPT-4o | 36.05 | - | - | | PaLI-X | 22.06 | 23.5 | 20.8 | | Qwen2.5-vl-32B | 19.35 | 20.55 | 18.28 | | Ming-lite-omni | 27.7 | 30.4 | 25.4 | | Model | Ming-lite-omni | Qwen2.5-VL-7B-Instruct | |:-------------------|:--------------:|:-----------------------:| | ChartQATEST | 85.1 | 87.3 | | DocVQATEST | 93 | 95.7 | | OCRBenchV2en/zh | 53.3/52 | 56.3/57.2 | | OmniDocBenchβ | 34/ 34.4 | 30.8 /39.8 | | TextVQAVAL | 82.8 | 84.9 | | Model | Ming-lite-omni | InternVL3 8B | Qwen2.5-VL-7B-Instruct | |:---------------------------|:--------------:|:------------:|:----------------------:| | ScreenSpot | 82.1 | 79.5 | 78.9 | | ScreenSpot-V2 | 84.1 | 81.4 | - | | AITZ(EM) | 66.6 | - | 57.6 | | Model | singleobject | twoobject | counting | colors | position | colorattr | GENEVAL | DPGBench | FIDβ | |:---------------|:-------------:|:----------:|:----------:|:--------:|:--------:|:----------:|:--------:|:---------:|:-------------:| | Ming-lite-omni | 0.9875 | 0.7727 | 0.6812 | 0.7872 | 0.31 | 0.29 | 0.64 | 81.72 | 4.85 | | Metaquery-XL | - | - | - | - | - | - | 0.61 | 82.05 | 6.02 | | SDv2.1 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.50 | 68.09 | 26.96 | | Emu3-Gen | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | 0.54 | 80.60 | - | | SDXL | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.55 | 74.65 | 8.76 | | Janus | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | 0.61 | 79.68 | 10.10 | | JanusFlow | - | - | - | - | - | - | 0.63 | 80.09 | 9.51 | Please refer to our technical report for more comprehensive evaluation results. You can download the model from both Huggingface and ModelScope. | Model | Input modality | Oput modality | Download | |:---------------| :---------------------: | :---------------: |:----------------------------------------------------------------------------------------------------------------------------------------------------:| | Ming-Lite-Omni | Image,text,viedio,audio | Image,text,audio | π€ HuggingFace π€ ModelScope | If you're in mainland China, we strongly recommend you to download our model from π€ ModelScope . Additional demonstration cases are available on our project page. Please download our model following Model Downloads, then you can refer to the following codes to run Ming-lite-omni model. Note: We test following examples on hardware of NVIDIA H800-80GB with CUDA 12.2. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 40890MB memory. To enable thinking before response, adding the following system prompt before your question: For detailed usage for ASR, SpeechQA, and TTS tasks, please refer to `testaudiotasks.py` Ming-omni natively supports image generation and image editing. To use this function, you only need to add the corresponding parameters in the generate function. This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory. If you find our work helpful, feel free to give us a cite.
Ring-lite-linear-preview
ZwZ-2B-GGUF
V2P 7B
Model Card for V2P: Valley-to-Peak GUI Grounding Model Model Name: V2P (Valley-to-Peak) Version: 1.0 Model Type: GUI Grounding / UI Element Localization Developers: Jikai Chen, Long Chen, Dong Wang, Zhixuan Chu, Qinglin Su, Leilei Gan, Chenyi Zhuang, Jinjie Gu [](https://arxiv.org/abs/2508.13634) [](https://github.com/inclusionAI/AgenticLearning/tree/main/V2P)