THU-KEG
AdaptThink-7B-delta0.05
OpenSAE-LLaMA-3.1-Layer_00
OpenSAE-LLaMA-3.1-Layer_01
OpenSAE-LLaMA-3.1-Layer_15
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
DeepDive-4B-SFT
DeepPrune-Judge-4B
DeepPrune: Parallel Scaling without Inter-trace Redundancy Abstract Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: this https URL This model is a fine-tuned version of Qwen3-4B-Instruct-2507 on the mycustomdataset dataset. It achieves the following results on the evaluation set: - Loss: 0.0438 To address the inter-trace redundancy problem in parallel scaling, we propose DeepPrune, a two-stage framework that includes offline training of a specialized judge model and online inference-time pruning. The core idea is that by accurately predicting whether two incomplete reasoning traces will yield identical final answers, we can efficiently prune redundant paths while preserving answer diversity. We fine-tune `Qwen/Qwen3-4B-Instruct-2507` to become a judge model: `DeepPrune-Judge-4B` that can predict whether two unfinished traces would yield the same answer. Our training data is collected exclusively from DeepSeek-R1-Distill-Llama-8B outputs, while traces from other models are reserved for testing cross-model generalization. The model is trained on DeepPrune's fine-tuing dataset The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 1 - seed: 42 - distributedtype: multi-GPU - numdevices: 4 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 8 - totalevalbatchsize: 4 - optimizer: Use OptimizerNames.ADAMWTORCHFUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 We report the evaluation results in our paper's Offline Experiment Results section (section 5.2), too. - Transformers 4.55.0 - Pytorch 2.8.0+cu128 - Datasets 3.6.0 - Tokenizers 0.21.1
LongWriter Zero 32B
LongWriter-Zero βοΈ β Mastering Ultra-Long Text Generation via Reinforcement Learning π Table of Contents - π LongWriter-Zero - π Benchmarks & Evaluation - β‘ Quick Start - π Citation LongWriter-Zero is a purely reinforcement learning (RL)-based large language model capable of generating coherent passages exceeding 10,000 tokens. Built upon Qwen 2.5-32B-Base, the training process includes: - 30 billion-token continual pretraining on long-form books and technical reports to enhance fundamental writing capabilities; - Application of Group Relative Policy Optimization (GRPO) with a composite reward function: - Length Reward Model (RM) enforces the desired output length, - Writing RM scores fluency, coherence, and helpfulness, - Format RM ensures strict adherence to the ` β¦ β¦ ` structure, and also detects repeated content to avoid redundancy; - A dedicated prompting strategy that encourages models to explicitly reflect before answering, thereby improving structural planning and fine-grained length control. The resulting model, LongWriter-Zero-32B, matches or surpasses the performance of 100B-scale models in ultra-long-form generation. LongWriter-Zeroβs effectiveness is demonstrated on two fronts: WritingBench and Arena-Write for automatic scoring and a human-in-the-loop win-rate study for pairwise quality comparison. > WritingBench (scale 1β10) & Arena-write (Elo) performance of different LLMs . > Donut charts showing win/tie/loss proportions against six baselines (left) and aggregated human evaluation (right). Summary: LongWriter-Zero achieves the highest automatic WritingBench score among open models and secures dominant win-rates in pairwise GPT-4.1 evaluations, confirming its superior quality in ultra-long-form generation while maintaining efficiency. Note: We use a slightly different tokenizer and chat template compared to the original Qwen2.5-32B-Instruct model. The snippet below shows how to format prompts with LongWriter-Zeroβs ` β¦ β¦ ` protocol and call the model through an SGlang-powered endpoint supporting streaming responses. ```bibtex @misc{wu2025longwriterzeromasteringultralongtext, title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning}, author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li}, year={2025}, eprint={2506.18841}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.18841}, }
ADELIE-DPO-1.5B
LLaDA-8B-BGPO-math
[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-math is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced mathematical capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Mathematics - Language: English - Training Steps: 700 steps - Response Length: 512 tokens - Train Diffusion Steps: 256 - Eval Diffusion Steps: 512 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for mathematical tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.
WildReward-8B
ADELIE-SFT-1.5B
WildReward-4B
LLaDA-8B-BGPO-sudoku
[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-sudoku is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced Sudoku solving capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Sudoku - Language: English - Training Steps: 400 steps - Response Length: 256 tokens - Train Diffusion Steps: 128 - Eval Diffusion Steps: 256 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 32 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for Sudoku tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.
LLaDA-8B-BGPO-code
[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-code is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced code generation capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Code generation - Language: English - Training Epochs: 5 epochs (112 steps per epoch) - Total Steps: 560 steps - Response Length: 512 tokens - Train Diffusion Steps: 512 - Eval Diffusion Steps: 512 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for code generation tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.
LLaDA-8B-BGPO-countdown
[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-countdown is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced planning capabilities on countdown task. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Countdown - Language: English - Training Steps: 560 steps - Response Length: 256 tokens - Train Diffusion Steps: 128 - Eval Diffusion Steps: 256 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for countdown tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.
AdaptThink-1.5B-delta0.05
ADELIE-DPO
OpenSAE-LLaMA-3.1-Layer_25
SIRI-7B-high
ADELIE-DPO-3B
SIRI-7B-low
DeepDive-4B-C-GRPO
ADELIE-SFT-3B
ADELIE-SFT
OpenSAE-LLaMA-3.1-Layer_06
R1-Distill-Qwen-7B-VerIF
- Developed by: Hao Peng@THUGKEG - Model type: RL trained LLMs - Language(s) (NLP): English, Chinese - License: apache-2.0 - Finetuned from model [optional]: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Repository: https://github.com/THU-KEG/VerIF - Paper: https://arxiv.org/abs/2506.09942 The model is trained using RL with VerIF, using train data VerInstruct. VerIF is a practical and efficient method for verification in instruction-following reinforcement learning. Built on the idea of Reinforcement Learning with Verifiable Rewards (RLVR), VerIF integrates rule-based code checks with LLM-based reasoning verification (e.g., QwQ-32B) to provide accurate and scalable reward signals. The model is optimized for instruction-following, without affecting other general capabilities. Evaluation Results We evaluate the model on several representative instruction-following benchmarks, including IFEval, Multi-IF, SysBench, FollowBench, and etc.. You can find more details in our github repo (https://github.com/THU-KEG/VerIF). If you find this model helpful, please kindly cite us:
AdaptThink-1.5B-delta0.1
IF-Verifier-7B
- Developed by: Hao Peng@THUKEG - Model type: Generative reward model - Language(s) (NLP): English, CHinese - License: apache-2.0 - Finetuned from model [optional]: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Repository: https://github.com/THU-KEG/VerIF - Paper: https://arxiv.org/abs/2506.09942 This model is trained from DeepSeek-R1-Distill-Qwen-7B using 131k critic data IF-Verifier-Data. This model is used for verifying soft constraints of instruction following. Deploying IF-Verifier-7B requires only one single H800 GPU, with an average reward computation time of 120 seconds per batch, which can be further reduced with multi-GPUs. Results The model trained using this model is comparable with that of QwQ 32B. Summary Please refer to our paper and our GitHub repo (https://github.com/THU-KEG/VerIF) for more details. Citation If this model helps, please kindly cite us:
AdaptThink-1.5B-delta0.01
DeepDive-30B-A3B-C-GRPO
SIRI-1.5B-high
ReaRAG-9B
ReaRAG-9B is trained based on glm-4-9b, with enhanced capability to generate knowledge-guided reasoning chains for iterative RAG. The model supports a context window of up to 8k tokens. Please refer to the Inference section in the GitHub repository for usage detail. π Citation If you use this dataset in your research or projects, please consider citing our work:
OpenSAE-LLaMA-3.1-Layer_31
AdaptThink-1.5B-delta0.075
AdaptThink-1.5B-delta0
Mistral-Crab-SFT
TULU3-VerIF
SIRI-1.5B-low
SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression SIRI (Scaling Iterative Reinforcement Learning with Interleaved Compression) is a reinforcement-learningβbased framework designed to improve the efficiency and accuracy of Large Reasoning Models (LRMs). Traditional RL training often causes overthinking and long, redundant reasoning traces. Prior methods that compress outputs (length penalties, pruning, or skipping thought tokens) improve efficiency but hurt accuracy. SIRI solves this trade-off by iteratively alternating between compression and expansion of the reasoning budget, controlled by a cosine length scheduler. This approach dynamically balances concise reasoning with long-horizon exploration. - Interleaved CompressionβExpansion: - Compression phase: forces concise, high-density reasoning by limiting rollout length. - Expansion phase: restores longer rollouts to encourage exploration and planning. - Token Efficiency without Accuracy Loss: Unlike previous methods, SIRI improves accuracy while reducing average token usage. - Iterative RL Training: Built on GRPO with modifications from DAPO (clip-high/low decoupling, KL removal). - Generalization Across Model Sizes: Validated on both 1.5B and 7B models.
DeepDive-30B-A3B-SFT
OpenSAE-LLaMA-3.1-Layer_02
OpenSAE-LLaMA-3.1-Layer_03
OpenSAE-LLaMA-3.1-Layer_04
OpenSAE-LLaMA-3.1-Layer_05
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
OpenSAE-LLaMA-3.1-Layer_08
OpenSAE-LLaMA-3.1-Layer_09
OpenSAE-LLaMA-3.1-Layer_01-shift_back
OpenSAE-LLaMA-3.1-Layer_02-shift_back
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
Llama3-Crab-DPO
kopl_semantic_parser
PairJudge-RM
PairJudge RM is a pairwise judge reward model designed to enhance Best-of-N sampling for mathematical reasoning tasks. Instead of assigning arbitrary absolute scores to candidate solutions, PairJudge RM compares them in pairs using chain-of-thought (CoT) reasoning and selects the best answer via a knockout tournament strategy. - Paper: https://arxiv.org/abs/2501.13007 - Code: https://github.com/THU-KEG/PairJudgeRM - Dataset: https://huggingface.co/datasets/THU-KEG/PairJudge-432K - Pairwise Judgment: Evaluates two candidate solutions simultaneously to determine which is more correct. - Chain-of-Thought Reasoning: Leverages CoT to transparently verify each step of the candidate solutions. PairJudge RM is built by fine-tuning a pre-trained language model (e.g., Qwen-2.5-7B-Instruct) on the PAIRJUDGE-432K dataset. Key training details include: - Optimizer: Adam - Learning Rate: 1Γ10β»β΅ - Batch Size: 128 - Epochs: 8 Below is an example of how to use PairJudge RM for evaluating candidate solutions: Citation If you find our work useful, please consider citing our paper:
OpenSAE-LLaMA-3.1-Layer_07
OpenSAE-LLaMA-3.1-Layer_11
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
OpenSAE-LLaMA-3.1-Layer_13
OpenSAE-LLaMA-3.1-Layer_14
OpenSAE-LLaMA-3.1-Layer_18
OpenSAE-LLaMA-3.1-Layer_19
OpenSAE-LLaMA-3.1-Layer_20
OpenSAE-LLaMA-3.1-Layer_22
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
OpenSAE-LLaMA-3.1-Layer_24
OpenSAE-LLaMA-3.1-Layer_26
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]
OpenSAE-LLaMA-3.1-Layer_30
OpenSAE-LLaMA-3.1-Layer_04-shift_back
LongWriter-V-7B
This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the LongWriter-V-22K dataset. The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 8 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 16 - totalevalbatchsize: 64 - optimizer: Use adamwtorch with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3 - Transformers 4.49.0.dev0 - Pytorch 2.5.1+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0
Mistral-Crab-DPO
LLMAEL-ReFinED-FT
LongWriter-V-72B
OpenSAE-LLaMA-3.1-Layer_10
OpenSAE-LLaMA-3.1-Layer_12
OpenSAE-LLaMA-3.1-Layer_16
OpenSAE-LLaMA-3.1-Layer_17
OpenSAE-LLaMA-3.1-Layer_21
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]