THU-KEG

78 models • 1 total models in database

Sort by:

OpenSAE-LLaMA-3.1-Layer_15

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

—

DeepDive-4B-SFT

NaNK

—

DeepPrune-Judge-4B

DeepPrune: Parallel Scaling without Inter-trace Redundancy Abstract Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: this https URL This model is a fine-tuned version of Qwen3-4B-Instruct-2507 on the mycustomdataset dataset. It achieves the following results on the evaluation set: - Loss: 0.0438 To address the inter-trace redundancy problem in parallel scaling, we propose DeepPrune, a two-stage framework that includes offline training of a specialized judge model and online inference-time pruning. The core idea is that by accurately predicting whether two incomplete reasoning traces will yield identical final answers, we can efficiently prune redundant paths while preserving answer diversity. We fine-tune `Qwen/Qwen3-4B-Instruct-2507` to become a judge model: `DeepPrune-Judge-4B` that can predict whether two unfinished traces would yield the same answer. Our training data is collected exclusively from DeepSeek-R1-Distill-Llama-8B outputs, while traces from other models are reserved for testing cross-model generalization. The model is trained on DeepPrune's fine-tuing dataset The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 1 - seed: 42 - distributedtype: multi-GPU - numdevices: 4 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 8 - totalevalbatchsize: 4 - optimizer: Use OptimizerNames.ADAMWTORCHFUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3.0 We report the evaluation results in our paper's Offline Experiment Results section (section 5.2), too. - Transformers 4.55.0 - Pytorch 2.8.0+cu128 - Datasets 3.6.0 - Tokenizers 0.21.1

NaNK

llama-factory

LongWriter Zero 32B

LongWriter-Zero ✍️ — Mastering Ultra-Long Text Generation via Reinforcement Learning 🔍 Table of Contents - 🚀 LongWriter-Zero - 📊 Benchmarks & Evaluation - ⚡ Quick Start - 📝 Citation LongWriter-Zero is a purely reinforcement learning (RL)-based large language model capable of generating coherent passages exceeding 10,000 tokens. Built upon Qwen 2.5-32B-Base, the training process includes: - 30 billion-token continual pretraining on long-form books and technical reports to enhance fundamental writing capabilities; - Application of Group Relative Policy Optimization (GRPO) with a composite reward function: - Length Reward Model (RM) enforces the desired output length, - Writing RM scores fluency, coherence, and helpfulness, - Format RM ensures strict adherence to the ` … … ` structure, and also detects repeated content to avoid redundancy; - A dedicated prompting strategy that encourages models to explicitly reflect before answering, thereby improving structural planning and fine-grained length control. The resulting model, LongWriter-Zero-32B, matches or surpasses the performance of 100B-scale models in ultra-long-form generation. LongWriter-Zero’s effectiveness is demonstrated on two fronts: WritingBench and Arena-Write for automatic scoring and a human-in-the-loop win-rate study for pairwise quality comparison. > WritingBench (scale 1–10) & Arena-write (Elo) performance of different LLMs . > Donut charts showing win/tie/loss proportions against six baselines (left) and aggregated human evaluation (right). Summary: LongWriter-Zero achieves the highest automatic WritingBench score among open models and secures dominant win-rates in pairwise GPT-4.1 evaluations, confirming its superior quality in ultra-long-form generation while maintaining efficiency. Note: We use a slightly different tokenizer and chat template compared to the original Qwen2.5-32B-Instruct model. The snippet below shows how to format prompts with LongWriter-Zero’s ` … … ` protocol and call the model through an SGlang-powered endpoint supporting streaming responses. ```bibtex @misc{wu2025longwriterzeromasteringultralongtext, title={LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning}, author={Yuhao Wu and Yushi Bai and Zhiqiang Hu and Roy Ka-Wei Lee and Juanzi Li}, year={2025}, eprint={2506.18841}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.18841}, }

NaNK

license:apache-2.0

109

ADELIE-DPO-1.5B

NaNK

LLaMA

LLaDA-8B-BGPO-math

[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-math is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced mathematical capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Mathematics - Language: English - Training Steps: 700 steps - Response Length: 512 tokens - Train Diffusion Steps: 256 - Eval Diffusion Steps: 512 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for mathematical tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.

NaNK

license:apache-2.0

WildReward-8B

NaNK

license:apache-2.0

ADELIE-SFT-1.5B

NaNK

LLaMA

WildReward-4B

NaNK

license:apache-2.0

LLaDA-8B-BGPO-sudoku

[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-sudoku is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced Sudoku solving capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Sudoku - Language: English - Training Steps: 400 steps - Response Length: 256 tokens - Train Diffusion Steps: 128 - Eval Diffusion Steps: 256 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 32 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for Sudoku tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.

NaNK

license:apache-2.0

LLaDA-8B-BGPO-code

[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-code is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced code generation capabilities. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Code generation - Language: English - Training Epochs: 5 epochs (112 steps per epoch) - Total Steps: 560 steps - Response Length: 512 tokens - Train Diffusion Steps: 512 - Eval Diffusion Steps: 512 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for code generation tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.

NaNK

license:apache-2.0

LLaDA-8B-BGPO-countdown

[](https://arxiv.org/abs/2510.11683) [](https://github.com/THU-KEG/BGPO) LLaDA-8B-BGPO-countdown is an 8-billion parameter diffusion large language model (dLLM) that was trained on LLaDA-8B-Instruct using Boundary-Guided Policy Optimization (BGPO) for enhanced planning capabilities on countdown task. - Model Type: Diffusion Large Language Model (dLLM) - Parameters: 8 billion - Training Method: Boundary-Guided Policy Optimization (BGPO) - Base Model: LLaDA-8B-Instruct - Task: Countdown - Language: English - Training Steps: 560 steps - Response Length: 256 tokens - Train Diffusion Steps: 128 - Eval Diffusion Steps: 256 - Block Size: 32 - Monte Carlo Sample Size ($nt$): 16 - Learning Rate: 5e-7 - Batch Size: 16 - Framework: Built on VeRL (Volcengine Reinforcement Learning) - Primarily designed for countdown tasks. - Performance may vary on other tasks. - Requires appropriate computational resources for inference.

NaNK

license:apache-2.0

AdaptThink-1.5B-delta0.05

NaNK

license:mit

ADELIE-DPO

llama

OpenSAE-LLaMA-3.1-Layer_25

—

SIRI-7B-high

NaNK

license:apache-2.0

ADELIE-DPO-3B

NaNK

llama

SIRI-7B-low

NaNK

license:apache-2.0

DeepDive-4B-C-GRPO

NaNK

—

ADELIE-SFT-3B

NaNK

llama

ADELIE-SFT

llama

OpenSAE-LLaMA-3.1-Layer_06

—

R1-Distill-Qwen-7B-VerIF

- Developed by: Hao Peng@THUGKEG - Model type: RL trained LLMs - Language(s) (NLP): English, Chinese - License: apache-2.0 - Finetuned from model [optional]: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Repository: https://github.com/THU-KEG/VerIF - Paper: https://arxiv.org/abs/2506.09942 The model is trained using RL with VerIF, using train data VerInstruct. VerIF is a practical and efficient method for verification in instruction-following reinforcement learning. Built on the idea of Reinforcement Learning with Verifiable Rewards (RLVR), VerIF integrates rule-based code checks with LLM-based reasoning verification (e.g., QwQ-32B) to provide accurate and scalable reward signals. The model is optimized for instruction-following, without affecting other general capabilities. Evaluation Results We evaluate the model on several representative instruction-following benchmarks, including IFEval, Multi-IF, SysBench, FollowBench, and etc.. You can find more details in our github repo (https://github.com/THU-KEG/VerIF). If you find this model helpful, please kindly cite us:

NaNK

license:apache-2.0

AdaptThink-1.5B-delta0.1

NaNK

license:mit

IF-Verifier-7B

- Developed by: Hao Peng@THUKEG - Model type: Generative reward model - Language(s) (NLP): English, CHinese - License: apache-2.0 - Finetuned from model [optional]: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B - Repository: https://github.com/THU-KEG/VerIF - Paper: https://arxiv.org/abs/2506.09942 This model is trained from DeepSeek-R1-Distill-Qwen-7B using 131k critic data IF-Verifier-Data. This model is used for verifying soft constraints of instruction following. Deploying IF-Verifier-7B requires only one single H800 GPU, with an average reward computation time of 120 seconds per batch, which can be further reduced with multi-GPUs. Results The model trained using this model is comparable with that of QwQ 32B. Summary Please refer to our paper and our GitHub repo (https://github.com/THU-KEG/VerIF) for more details. Citation If this model helps, please kindly cite us:

NaNK

license:apache-2.0

AdaptThink-1.5B-delta0.01

NaNK

license:mit

DeepDive-30B-A3B-C-GRPO

NaNK

—

SIRI-1.5B-high

NaNK

license:apache-2.0

ReaRAG-9B

ReaRAG-9B is trained based on glm-4-9b, with enhanced capability to generate knowledge-guided reasoning chains for iterative RAG. The model supports a context window of up to 8k tokens. Please refer to the Inference section in the GitHub repository for usage detail. 📚 Citation If you use this dataset in your research or projects, please consider citing our work:

NaNK

—

OpenSAE-LLaMA-3.1-Layer_31

—

AdaptThink-1.5B-delta0.075

NaNK

license:mit

AdaptThink-1.5B-delta0

NaNK

license:mit

Mistral-Crab-SFT

license:apache-2.0

TULU3-VerIF

NaNK

llama

SIRI-1.5B-low

SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression SIRI (Scaling Iterative Reinforcement Learning with Interleaved Compression) is a reinforcement-learning–based framework designed to improve the efficiency and accuracy of Large Reasoning Models (LRMs). Traditional RL training often causes overthinking and long, redundant reasoning traces. Prior methods that compress outputs (length penalties, pruning, or skipping thought tokens) improve efficiency but hurt accuracy. SIRI solves this trade-off by iteratively alternating between compression and expansion of the reasoning budget, controlled by a cosine length scheduler. This approach dynamically balances concise reasoning with long-horizon exploration. - Interleaved Compression–Expansion: - Compression phase: forces concise, high-density reasoning by limiting rollout length. - Expansion phase: restores longer rollouts to encourage exploration and planning. - Token Efficiency without Accuracy Loss: Unlike previous methods, SIRI improves accuracy while reducing average token usage. - Iterative RL Training: Built on GRPO with modifications from DAPO (clip-high/low decoupling, KL removal). - Generalization Across Model Sizes: Validated on both 1.5B and 7B models.

NaNK

license:apache-2.0

DeepDive-30B-A3B-SFT

NaNK

—

OpenSAE-LLaMA-3.1-Layer_02

—

OpenSAE-LLaMA-3.1-Layer_03

—

OpenSAE-LLaMA-3.1-Layer_04

—

OpenSAE-LLaMA-3.1-Layer_05

—

OpenSAE-LLaMA-3.1-Layer_08

—

OpenSAE-LLaMA-3.1-Layer_09

—

OpenSAE-LLaMA-3.1-Layer_01-shift_back

—

OpenSAE-LLaMA-3.1-Layer_02-shift_back

—

Llama3-Crab-DPO

llama

kopl_semantic_parser

—

PairJudge-RM

PairJudge RM is a pairwise judge reward model designed to enhance Best-of-N sampling for mathematical reasoning tasks. Instead of assigning arbitrary absolute scores to candidate solutions, PairJudge RM compares them in pairs using chain-of-thought (CoT) reasoning and selects the best answer via a knockout tournament strategy. - Paper: https://arxiv.org/abs/2501.13007 - Code: https://github.com/THU-KEG/PairJudgeRM - Dataset: https://huggingface.co/datasets/THU-KEG/PairJudge-432K - Pairwise Judgment: Evaluates two candidate solutions simultaneously to determine which is more correct. - Chain-of-Thought Reasoning: Leverages CoT to transparently verify each step of the candidate solutions. PairJudge RM is built by fine-tuning a pre-trained language model (e.g., Qwen-2.5-7B-Instruct) on the PAIRJUDGE-432K dataset. Key training details include: - Optimizer: Adam - Learning Rate: 1×10⁻⁵ - Batch Size: 128 - Epochs: 8 Below is an example of how to use PairJudge RM for evaluating candidate solutions: Citation If you find our work useful, please consider citing our paper:

NaNK

license:mit

OpenSAE-LLaMA-3.1-Layer_07

—

OpenSAE-LLaMA-3.1-Layer_11

—

OpenSAE-LLaMA-3.1-Layer_13

—

OpenSAE-LLaMA-3.1-Layer_14

—

OpenSAE-LLaMA-3.1-Layer_18

—

OpenSAE-LLaMA-3.1-Layer_19

—

OpenSAE-LLaMA-3.1-Layer_20

—

OpenSAE-LLaMA-3.1-Layer_22

—

OpenSAE-LLaMA-3.1-Layer_24

—

OpenSAE-LLaMA-3.1-Layer_26

—

OpenSAE-LLaMA-3.1-Layer_30

—

OpenSAE-LLaMA-3.1-Layer_04-shift_back

—

LongWriter-V-7B

This model is a fine-tuned version of Qwen/Qwen2.5-VL-7B-Instruct on the LongWriter-V-22K dataset. The following hyperparameters were used during training: - learningrate: 1e-05 - trainbatchsize: 1 - evalbatchsize: 8 - seed: 42 - distributedtype: multi-GPU - numdevices: 8 - gradientaccumulationsteps: 2 - totaltrainbatchsize: 16 - totalevalbatchsize: 64 - optimizer: Use adamwtorch with betas=(0.9,0.999) and epsilon=1e-08 and optimizerargs=No additional optimizer arguments - lrschedulertype: cosine - lrschedulerwarmupratio: 0.1 - numepochs: 3 - Transformers 4.49.0.dev0 - Pytorch 2.5.1+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0

NaNK

llama-factory