Kwai-Klear

7 models • 1 total models in database

Sort by:

Klear-46B-A2.5B-Instruct

NaNK

license:apache-2.0

112

Klear-46B-A2.5B-Base

NaNK

license:apache-2.0

Klear-Reasoner-8B

✨ Klear-Reasoner-8B We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. | Resource | Link | |---|---| | 📝 Preprints | Paper | | 🤗 Daily Paper | Paper | | 🤗 Model Hub | Klear-Reasoner-8B | | 🤗 Dataset Hub | Math RL | | 🤗 Dataset Hub | Code RL | | 🐛 Issues & Discussions | GitHub Issues | | 📧 Contact | [email protected] | Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8). Klear-Reasoner is an 8-billion-parameter reasoning model that achieves SOTA performance on challenging math and coding benchmarks: | Benchmark | AIME 2024 | AIME 2025 | LiveCodeBench V5 | LiveCodeBench V6 | |---|---|---|---|---| | Score | 90.5 % | 83.2 % | 66.0 % | 58.1 % | The model combines: 1. Quality-centric long CoT SFT – distilled from DeepSeek-R1-0528. 2. Gradient-Preserving Clipping Policy Optimization (GPPO) – a novel RL method that keeps gradients from clipped tokens to boost exploration & convergence. Evaluation When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. Evaluation is coming soon, stay tuned. | Model | AIME2024 avg@64 | AIME2025 avg@64 | HMMT2025 avg@64 | LCB V5 avg@8 | LCB V6 avg@8 | |-------|--------------------|--------------------|--------------------|-----------------|-----------------| | AReal-boba-RL-7B | 61.9 | 48.3 | 29.4 | 34.3 | 31.0† | | MiMo-7B-RL | 68.2 | 55.4 | 35.7 | 57.8 | 49.3 | | Skywork-OR1-7B | 70.2 | 54.6 | 35.7 | 47.6 | 42.7 | | AceReason-Nemotron-1.1-7B | 72.6 | 64.8 | 42.9 | 57.2 | 52.1 | | POLARIS-4B-Preview | 81.2 | 79.4 | 58.7 | 58.5† | 53.0† | | Qwen3-8B | 76.0 | 67.3 | 44.7† | 57.5 | 48.4† | | Deepseek-R1-0528-Distill-8B | 86.0 | 76.3 | 61.5 | 61.0† | 51.6† | | OpenReasoning-Nemotron-7B | 84.7 | 78.2 | 63.5 | 65.6† | 56.3† | | Klear-Reasoner-8B-SFT | 75.6 | 70.1 | 57.6 | 58.5 | 49.6 | | Klear-Reasoner-8B | 83.2 | 75.6 | 60.3 | 61.6 | 53.1 | | w/ 64K Inference Budget | 90.5 | 83.2 | 70.8 | 66.0 | 58.1 | > We report the average `pass@1` results (avg@n), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, topp=0.95). For the code, we use Firejail for the sandbox environment. Additionally, we implemented multi-process control based on Pebble, enabling automatic resource reclamation upon task timeout. For mathematics, we use mathverify for judging. Using Ray for Multi-Node Training For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines: Step 1: Start Ray on the Head Node (node0) On each additional worker node (e.g., `node1`), run the following, replacing the IP with that of your head node: RL Training Run the following script on the master node to start the training task. In the startup script, you need to set the following variables: Evaluation When we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5. The evaluation data for AIME24, AIME25, and HMMT2025 are available in our GitHub repository under the benchmarks directory. For LiveCodeBench, please download the data from the official website. You can run the following commands to perform inference and evaluation: 🤝 Citation If you find this work helpful, please cite our paper:

NaNK

license:apache-2.0

Klear-Reasoner-8B-SFT

NaNK

license:apache-2.0

Klear-Qwen3-Thinking-Preview

Improving the reasoning capabilities of large language models (LLMs) has recently attracted significant attention in the AI community. The current paradigm for developing strong reasoning models typically involves a two-stage approach: supervised fine-tuning (SFT) with distilled data, followed by reinforcement learning (RL). While the open-source community has flourished with increasingly available open-source datasets, many critical training details remain unclear. In this study, we present a comprehensive and open-source pipeline for training a high-performance reasoning model, named `Klear-Qwen3-Thinking`, starting from the `Qwen3-8B-Base`. We balance training stability and exploratory behavior in RL through multiple strategies. `Klear-Qwen3-Thinking-Preview` achieves 76.4% on AIME 2025 and 63.9% on LiveCodeBench V5, improving +13.7% and +8.8% over its SFT baseline, respectively. Notably, `Klear-Qwen3-Thinking-Preview` yields better performance than `Qwen3-8B` (Thinking mode), and competitive performance as `DeepSeek-R1-0528-Qwen3-8B` in math and coding, without distilling from DeepSeek-R1-0528. 👨‍💻 Github, 🤗 HF Model, 🤗 HF Dataset, 📖 Tech Report, 🔎 Evaluation results Performance in comparison with SOTA models on AIME 24&25 and LiveCodeBench v5. Klear-SFT and Klear-Preview refer to Klear-Qwen3-Thinking-SFT and Klear-Qwen3-Thinking-Preview, respectively. Among 7B and 8B models, we outperform AceReason-Nemotron-1.1-7B (AceReason) and Qwen3-8B. Although we do not use the DeepSeek-R1-0528 dataset, we achieve comparable results to DeepSeek-R1-0528-Qwen3-8B. Additionally, compared to larger models like Qwen3-32B and DeepSeek-R1 (0120), we also demonstrate significant advantages.

license:apache-2.0

qwen2.5-math-rlep

NaNK

license:apache-2.0

Klear AgentForge 8B SFT

NaNK

license:mit