Gen-Verse
MMaDA-8B-MixCoT
MMaDA-8B-Base
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations: 1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. 2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.
DemyAgent-4B
Demystifying Reinforcement Learning in Agentic Reasoning π― About This Repository This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies. π Introduction In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal: - π― Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives - β‘ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency - π§ Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks. π¦ Resources | Type | Name | Link | | --------- | ------------------- | ------------------------------------------------------------ | | π Dataset | 3K Agentic SFT Data | π€ HuggingFace | | π Dataset | 30K Agentic RL Data | π€ HuggingFace | | π€ Model | Qwen2.5-7B-RA-SFT | π€ HuggingFace | | π€ Model | Qwen3-4B-RA-SFT | π€ HuggingFace | | π€ Model | DemyAgent-4B | π€ HuggingFace | > Note: > - Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data > - DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe π Performance We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks. Benchmark Results | | MATH | | Science | Code | | -------------------------- | ------------ | ------------ | ---------------- | -------------------- | | Method | AIME2024 | AIME2025 | GPQA-Diamond | LiveCodeBench-v6 | | Self-Contained Reasoning | | | | | | Qwen2.5-7B-Instruct | 16.7 | 10.0 | 31.3 | 15.2 | | Qwen3-4B-Instruct-2507 | 63.3 | 47.4 | 52.0 | 35.1 | | Qwen2.5-72B-Instruct | 18.9 | 15.0 | 49.0 | - | | DeepSeek-V3 | 39.2 | 28.8 | 59.1 | 16.1 | | DeepSeek-R1-Distill-32B | 70.0 | 46.7 | 59.6 | - | | DeepSeek-R1-Zero (671B) | 71.0 | 53.5 | 59.6 | - | | Agentic Reasoning | | | | | | Qwen2.5-7B-Instruct | 4.8 | 5.6 | 25.5 | 12.2 | | Qwen3-4B-Instruct-2507 | 17.9 | 16.3 | 44.3 | 23.0 | | ToRL-7B | 43.3 | 30.0 | - | - | | ReTool-32B | 72.5 | 54.3 | - | - | | Tool-Star-3B | 20.0 | 16.7 | - | - | | ARPO-7B | 30.0 | 30.0 | 53.0 | 18.3 | | rStar2-Agent-14B | 80.6 | 69.8 | 60.9 | - | | DemyAgent-4B (Ours) | 72.6 | 70.0 | 58.5 | 26.8 | Key Highlights β¨ Despite having only 4B parameters, DemyAgent-4B achieves: - π₯ State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B) - π₯ Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%) - π Competitive performance against 14B-32B models with 4-8Γ fewer parameters - π‘ Superior efficiency compared to long-CoT models through deliberative tool use π Citation
TraDo-4B-Instruct
We introduce TraDo, SOTA diffusion language model, trained with TraceRL. TraDo-4B-Instruct and TraDo-8B-Instruct outperform similarly sized strong AR models across math reasoning tasks. TraDo-8B-Thinking is the first Long-CoT diffusion language model.
ReasonFlux-PRM-7B
Qwen3-4B-RA-SFT
TraDo-8B-Instruct
TraDo-8B-Thinking
We introduce TraDo, SOTA diffusion language model, trained with TraceRL. TraDo-4B-Instruct and TraDo-8B-Instruct outperform similarly sized strong AR models across math reasoning tasks. TraDo-8B-Thinking is the first Long-CoT diffusion language model.
ReasonFlux-Coder-7B
We introduce ReasonFlux-Coders, trained with CURE, our algorithm for co-evolving an LLM's coding and unit test generation abilities. ReasonFlux-Coder-7B and ReasonFlux-Coder-14B outperform similarly sized Qwen Coders, DeepSeek Coders, and Seed-Coders, and naturally integrate into common test-time scaling and agentic coding pipelines. ReasonFlux-Coder-4B is our Long-CoT model, outperforming Qwen3-4B while achieving 64.8% efficiency in unit test generation. We have demonstrated its ability to serve as a reward model for training base models via reinforcement learning (see our paper).
ReasonFlux-Coder-4B
We introduce ReasonFlux-Coders, trained with CURE, our algorithm for co-evolving an LLM's coding and unit test generation abilities. ReasonFlux-Coder-7B and ReasonFlux-Coder-14B outperform similarly sized Qwen Coders, DeepSeek Coders, and Seed-Coders, and naturally integrate into common test-time scaling and agentic coding pipelines. ReasonFlux-Coder-4B is our Long-CoT model, outperforming Qwen3-4B while achieving 64.8% efficiency in unit test generation. We have demonstrated its ability to serve as a reward model for training base models via reinforcement learning (see our paper).
Qwen2.5-7B-RA-SFT
ReasonFlux-Coder-14B
ReasonFlux-F1
ReasonFlux-PRM-1.5B
ReasonFlux-F1-14B
ReasonFlux-PRM-Qwen-2.5-7B
We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling. ReasonFlux-PRM PRM 7B β’ Trajectory-aware scoring β’ Online/Offline supervision β’ Dense process rewards Data selection, RL training, Test-time scaling π€ 7B ReasonFlux-PRM PRM 1.5B β’ Lightweight scoring β’ Efficient inference β’ Edge deployment Resource-constrained applications π€ 1.5B ReasonFlux-PRM-Qwen-2.5 End-to-End Trained Policy Model 7B β’ Long CoT reasoning β’ Solving complex tasks and problems Math and Science Reasoning π€ 7B >Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k TrajectoryβResponse pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO. Citation