Gen-Verse

24 models • 1 total models in database

Sort by:

MMaDA-8B-MixCoT

MMaDA-8B-Base

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. MMaDA is distinguished by three key innovations: 1. MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. 2. MMaDA introduces a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. 3. MMaDA adopts a unified policy-gradient-based RL algorithm, which we call UniGRPO, tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements.

DemyAgent-4B

Demystifying Reinforcement Learning in Agentic Reasoning 🎯 About This Repository This repository contains the DemyAgent-4B model weights, a 4B-sized agentic reasoning model that achieves state-of-the-art performance on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6. DemyAgent-4B is trained using our GRPO-TCR recipe with 30K high-quality agentic RL data, demonstrating that small models can outperform much larger alternatives (14B/32B) through effective RL training strategies. 🌟 Introduction In our work, we systematically investigate three dimensions of agentic RL: data, algorithms, and reasoning modes. Our findings reveal: - 🎯 Data Quality Matters: Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives - ⚡ Training Efficiency: Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency - 🧠 Reasoning Strategy: Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning We contribute high-quality SFT and RL datasets, demonstrating that simple recipes enable even 4B models to outperform 32B models on the most challenging reasoning benchmarks. 📦 Resources | Type | Name | Link | | --------- | ------------------- | ------------------------------------------------------------ | | 📊 Dataset | 3K Agentic SFT Data | 🤗 HuggingFace | | 📊 Dataset | 30K Agentic RL Data | 🤗 HuggingFace | | 🤖 Model | Qwen2.5-7B-RA-SFT | 🤗 HuggingFace | | 🤖 Model | Qwen3-4B-RA-SFT | 🤗 HuggingFace | | 🤖 Model | DemyAgent-4B | 🤗 HuggingFace | > Note: > - Qwen2.5-7B-RA-SFT and Qwen3-4B-RA-SFT are finetuned from Qwen2.5-7B-Instruct and Qwen3-4B-Instruct-2507 using our 3K Agentic SFT Data > - DemyAgent-4B is trained through Agentic RL with our 30K Agentic RL data using the GRPO-TCR recipe 🏆 Performance We evaluate our models on challenging benchmarks spanning mathematics, science, and code generation tasks. Benchmark Results | | MATH | | Science | Code | | -------------------------- | ------------ | ------------ | ---------------- | -------------------- | | Method | AIME2024 | AIME2025 | GPQA-Diamond | LiveCodeBench-v6 | | Self-Contained Reasoning | | | | | | Qwen2.5-7B-Instruct | 16.7 | 10.0 | 31.3 | 15.2 | | Qwen3-4B-Instruct-2507 | 63.3 | 47.4 | 52.0 | 35.1 | | Qwen2.5-72B-Instruct | 18.9 | 15.0 | 49.0 | - | | DeepSeek-V3 | 39.2 | 28.8 | 59.1 | 16.1 | | DeepSeek-R1-Distill-32B | 70.0 | 46.7 | 59.6 | - | | DeepSeek-R1-Zero (671B) | 71.0 | 53.5 | 59.6 | - | | Agentic Reasoning | | | | | | Qwen2.5-7B-Instruct | 4.8 | 5.6 | 25.5 | 12.2 | | Qwen3-4B-Instruct-2507 | 17.9 | 16.3 | 44.3 | 23.0 | | ToRL-7B | 43.3 | 30.0 | - | - | | ReTool-32B | 72.5 | 54.3 | - | - | | Tool-Star-3B | 20.0 | 16.7 | - | - | | ARPO-7B | 30.0 | 30.0 | 53.0 | 18.3 | | rStar2-Agent-14B | 80.6 | 69.8 | 60.9 | - | | DemyAgent-4B (Ours) | 72.6 | 70.0 | 58.5 | 26.8 | Key Highlights ✨ Despite having only 4B parameters, DemyAgent-4B achieves: - 🥇 State-of-the-art on AIME2025 (70.0%), outperforming even DeepSeek-R1-Zero (671B) - 🥈 Second place on AIME2024 (72.6%) and GPQA-Diamond (58.5%) - 🚀 Competitive performance against 14B-32B models with 4-8× fewer parameters - 💡 Superior efficiency compared to long-CoT models through deliberative tool use 📝 Citation

TraDo-4B-Instruct

We introduce TraDo, SOTA diffusion language model, trained with TraceRL. TraDo-4B-Instruct and TraDo-8B-Instruct outperform similarly sized strong AR models across math reasoning tasks. TraDo-8B-Thinking is the first Long-CoT diffusion language model.

ReasonFlux-PRM-7B

Qwen3-4B-RA-SFT

TraDo-8B-Instruct

TraDo-8B-Thinking

We introduce TraDo, SOTA diffusion language model, trained with TraceRL. TraDo-4B-Instruct and TraDo-8B-Instruct outperform similarly sized strong AR models across math reasoning tasks. TraDo-8B-Thinking is the first Long-CoT diffusion language model.

ReasonFlux-Coder-7B

We introduce ReasonFlux-Coders, trained with CURE, our algorithm for co-evolving an LLM's coding and unit test generation abilities. ReasonFlux-Coder-7B and ReasonFlux-Coder-14B outperform similarly sized Qwen Coders, DeepSeek Coders, and Seed-Coders, and naturally integrate into common test-time scaling and agentic coding pipelines. ReasonFlux-Coder-4B is our Long-CoT model, outperforming Qwen3-4B while achieving 64.8% efficiency in unit test generation. We have demonstrated its ability to serve as a reward model for training base models via reinforcement learning (see our paper).

ReasonFlux-Coder-4B

We introduce ReasonFlux-Coders, trained with CURE, our algorithm for co-evolving an LLM's coding and unit test generation abilities. ReasonFlux-Coder-7B and ReasonFlux-Coder-14B outperform similarly sized Qwen Coders, DeepSeek Coders, and Seed-Coders, and naturally integrate into common test-time scaling and agentic coding pipelines. ReasonFlux-Coder-4B is our Long-CoT model, outperforming Qwen3-4B while achieving 64.8% efficiency in unit test generation. We have demonstrated its ability to serve as a reward model for training base models via reinforcement learning (see our paper).

Qwen2.5-7B-RA-SFT

ReasonFlux-Coder-14B

ReasonFlux-F1

ReasonFlux-PRM-1.5B

ReasonFlux-F1-14B

ReasonFlux-PRM-Qwen-2.5-7B

We introduce ReasonFlux-PRM, a trajectory-aware process reward model (PRM) explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. ReasonFlux-PRM is able to support both offline and online reward supervision, by selecting high-quality training data for model distillation, providing dense process-level rewards for policy optimization during reinforcement learning, and enabling reward-guided test-time scaling. ReasonFlux-PRM PRM 7B • Trajectory-aware scoring • Online/Offline supervision • Dense process rewards Data selection, RL training, Test-time scaling 🤗 7B ReasonFlux-PRM PRM 1.5B • Lightweight scoring • Efficient inference • Edge deployment Resource-constrained applications 🤗 1.5B ReasonFlux-PRM-Qwen-2.5 End-to-End Trained Policy Model 7B • Long CoT reasoning • Solving complex tasks and problems Math and Science Reasoning 🤗 7B >Note: We obtain ReasonFlux-PRM-Qwen-2.5-7B through an end-to-end training process, first applying SFT on 1k Trajectory–Response pairs selected by ReasonFlux-PRM-7B, followed by RL training with ReasonFlux-PRM-7B integrated GRPO. Citation

HermesFlow

license:apache-2.0

RLAnything-Alf-7B

RLAnything-Alf-Reward-14B

RLAnything-OS-Reward-8B

RLAnything-OS-8B

RLAnything-UT-14B

RLAnything-Coder-7B

ReasonFlux-F1-7B