xl-zhao

14 models • 1 total models in database

Sort by:

PromptCoT 2.0 SelfPlay 30B A3B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning). It is a 30B-A3B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math). The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers. This model achieves state-of-the-art performance at the 30B scale, competitive with closed-source models such as Gemini 2.5 Pro and OpenAI o3. - Self-Play Training: The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0. Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness). - Competitive with Closed-Source Models: Despite activating only 3B parameters, this model achieves results comparable to Gemini 2.5 Pro and OpenAI o3 across both math and code. Performance of PromptCoT-2.0-SelfPlay-30B-A3B on six benchmarks (AIME24/25, HMMT Feb25, LiveCodeBench v5/v6, Codeforces). The model achieves competitive results with Gemini 2.5 Pro and OpenAI o3, while surpassing strong open-source baselines. - Math + Code reasoning: Strong, balanced gains across both Olympiad-level math (AIME, HMMT) and competitive programming (LiveCodeBench, Codeforces). - Efficient scaling: Uses 3B activated parameters for self-play fine-tuning, making it significantly more efficient than comparable closed-source models. - 📄 Paper: PromptCoT 2.0 - 💻 GitHub: inclusionAI/PromptCoT - 📊 Dataset: PromptCoT-2.0-SelfPlay-30B-11K If you find this model useful, please consider citing:

NaNK

license:mit

PromptCoT-2.0-Prompt-Generation-Model

This repository hosts the Problem Generation Model (PGM) used in PromptCoT 2.0, a framework for scalable prompt synthesis that advances LLM reasoning in mathematics and programming. This checkpoint is the Problem Generation Model (PGM) of PromptCoT 2.0. - Input: a set of domain concepts (math or programming) and an optional difficulty tag. - Output: a rationale (the structured “thinking process” that connects the concepts) followed by a fully formed problem (Olympiad-level math or coding task). How it fits into PromptCoT 2.0: PromptCoT 2.0 jointly trains two models via an EM optimization loop: - Rationale Generator (E-step): infers rationales given concepts and problems, updated via reinforcement learning with reward signals. - Problem Generation Model (PGM) (M-step): learns to produce rationale–problem pairs conditioned only on concepts. At inference time, the PGM is all you need: provide concepts and it will generate (rationale → problem) in one pass—without any handcrafted templates or domain-specific heuristics. - Model type: Causal language model for problem generation. - Training data: Concept–rationale–problem triples synthesized and refined via PromptCoT 2.0. - Domains: Mathematics (Olympiad-level) and Programming (competitive programming). - Initialization: Warm-started from `Qwen2.5-32B-Base` with cold-start annotations (concepts & rationales) generated by instruction-tuned models. You can load this model with Hugging Face `transformers`: Use the following templates for inference. Replace `{concepttext}` and `{level}`. Expected format: The output will first include a Rationale (multi-step explanation of how the concepts are combined) and then a precise Problem statement. The PGM is the core component powering the creation of: Self-Play datasets (math/code problems paired with verifiable answers or unit tests). SFT datasets (problems with complete reasoning traces distilled from teacher models). PromptCoT 2.0 demonstrates that rationale-driven prompt synthesis yields harder and more diverse problems than existing datasets. Self-Play (30B-A3B): Achieves strong gains in both mathematics and programming. - Math: 92.1 on AIME24, 89.8 on AIME25, 76.7 on HMMT Feb25. - Code: 74.2 on LiveCodeBench v5, 71.0 on v6, and 2079 Elo on Codeforces. Overall, performance is competitive with Gemini 2.5 Pro / OpenAI o3 and surpasses strong open-source baselines. SFT (7B, 100% synthetic): Demonstrates that fully synthetic data can rival or outperform human-written datasets. - Math: 73.1 on AIME24, 65.6 on AIME25, 46.5 on HMMT Feb25. - Code: 53.4 on LiveCodeBench v5, 48.9 on v6, and 1815 Elo on Codeforces. These results exceed human-written baselines such as OpenMathReasoning and OpenCodeReasoning, highlighting the scalability of synthetic data. 📄 Paper (arXiv:2509.19894) 🤗 HF Collection 📚 PromptCoT 2.0 SFT Data (4.8M prompts) 🤖 PromptCoT 2.0 SFT Model (7B) 🎮 Self-Play Models (4B, 30B-A3B) If you find this model or the PromptCoT 2.0 framework useful, please cite:

NaNK

license:mit

PromptCoT-2.0-SelfPlay-4B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning). It is a 4B model trained via self-play, where synthesized problems from PromptCoT 2.0 provide verifiable feedback (unit tests for code, boxed answers for math). The training loop uses Direct Preference Optimization (DPO) to align generations with automatically verified outcomes, removing the dependence on stronger external teachers. This model establishes new state-of-the-art performance at the 4B scale, consistently outperforming strong open-source baselines and curated datasets. - Self-Play Training: The model improves autonomously using synthetic math & code problems generated by PromptCoT 2.0. Positive/negative pairs are constructed from verifiable feedback signals (unit test success / final answer correctness). - Strong Baseline Improvements: Outperforms Qwen3-4B-Thinking-2507 and surpasses curated datasets such as OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3 across all six benchmarks. Evaluation on six benchmarks under the self-play setting with 4B parameters. Bold = best, Italic = second-best. | Model | AIME 24 | AIME 25 | HMMT Feb 25 | LiveCodeBench v5 (2408–2502) | LiveCodeBench v6 (2502–2505) | Codeforces | |------------------------------|---------|---------|-------------|-------------------------------|-------------------------------|------------| | Qwen3-4B-Thinking-2507 | 85.2 | 81.3 | 55.5 | 63.8 | 55.2 | 1852 | | OpenCodeReasoning | 83.1 | 78.5 | 50.4 | 64.4 | 57.1 | 1867 | | OpenMathReasoning | 85.3 | 83.0 | 56.8 | 59.7 | 48.5 | 1826 | | OpenThoughts3 | 84.7 | 80.6 | 54.2 | 65.2 | 54.4 | 1846 | | OpenR1 | 84.6 | 80.9 | 56.7 | 63.0 | 54.6 | 1829 | | PromptCoT 1.0 | 85.3 | 81.8 | 58.6 | 64.5 | 56.7 | 1878 | | PromptCoT 2.0 | 87.3| 85.0| 66.5 | 67.7 | 61.1 | 1934 | Best across all six benchmarks: PromptCoT 2.0 achieves top scores on AIME 24/25, HMMT Feb 25, LiveCodeBench v5/v6, and Codeforces. Large gains on high-difficulty tasks: +11.0 points on HMMT, +5.9 on LCB v6, and +82 Elo on Codeforces compared to the next best. Beyond curated baselines: Unlike OpenMathReasoning, OpenCodeReasoning, and OpenThoughts3—which saturate on strong 4B bases—PromptCoT 2.0 continues to deliver significant improvements. 📄 Paper: PromptCoT 2.0 💻 GitHub: inclusionAI/PromptCoT 📊 Dataset: PromptCoT-2.0-SelfPlay-4B-48K If you find this model useful, please consider citing:

NaNK

license:mit

PromptCoT-Problem-Generation-Model

NaNK

llama

PromptCoT-2.0-SFT-7B

This model is part of PromptCoT 2.0 (Scaling Prompt Synthesis for LLM Reasoning). It is a 7B parameter model trained entirely on synthetic prompts generated by PromptCoT 2.0, with reasoning trajectories distilled from GPT-OSS-120B (medium). Unlike prior works (e.g., OpenMathReasoning, OpenCodeReasoning) that rely on human-written prompts, this model demonstrates that fully synthetic data can match or even surpass the effectiveness of manually curated datasets for advancing reasoning in both mathematics and programming. PromptCoT-2.0-SFT-7B is trained 100% on synthetic prompts with teacher trajectories from GPT-OSS-120B (medium). Below we compare it against two widely used human-written prompt baselines. > Metric: Pass@1 for AIME24/25, HMMT Feb25, LiveCodeBench v5/v6; Elo for Codeforces. | Model | Prompt Source | Teacher | AIME24 | AIME25 | HMMT Feb25 | LiveCodeBench v5 (2408-2502) | LiveCodeBench v6 (2502-2505) | Codeforces | |---------------------------|---------------|----------------------|:------:|:------:|:----------:|:----------------:|:----------------:|:--------------:| | PromptCoT-2.0-SFT-7B | Synthetic | GPT-OSS-120B (med.) | 73.1 | 65.6 | 46.5 | 53.4 | 48.9 | 1815 | | OpenMathReasoning | Human | DeepSeek-R1 | 73.3 | 58.1 | 42.1 | 9.7 | 10.7 | 676 | | OpenCodeReasoning | Human | DeepSeek-R1 | 11.7 | 7.7 | 6.0 | 50.5 | 42.0 | 1648 | Takeaways - Fully synthetic wins: PromptCoT-2.0-SFT-7B outperforms human-written baselines across most math benchmarks and all code benchmarks. - Scalable & practical: High performance without manual prompt curation suggests a clear path to scaling reasoning with synthetic data. You can load the model via Hugging Face `transformers`: Data: 4.8M fully synthetic prompts generated by PromptCoT 2.0 Teacher: GPT-OSS-120B (medium), used for reasoning trajectory distillation Domains: Mathematics (Olympiad-level) and Programming (competitive coding) Training regime: Supervised fine-tuning (SFT), 100% synthetic data Fully synthetic prompts work: No reliance on human-written datasets. Compact trajectories: Distilled responses are shorter than those in prior datasets, reducing inference cost while maintaining quality. Scalability: Opens the door for training larger reasoning models on purely synthetic corpora. If you use this model or the PromptCoT 2.0 dataset, please cite:

NaNK

license:mit

xl-zhao

PromptCoT 2.0 SelfPlay 30B A3B

PromptCoT-2.0-Prompt-Generation-Model

PromptCoT-2.0-SelfPlay-4B

PromptCoT-Problem-Generation-Model

PromptCoT-2.0-SFT-7B

PromptCoT-DS-7B

PromptCoT-QwQ-32B

formal_proof_generator_v1_iter3

formal_proof_generator_v2_iter3

formal_proof_generator_v4_iter3

PromptCoT-Mamba-7B

formal_proof_generator_v3_iter3

PromptCoT-DS-1.5B

PromptCoT-Mamba-Math-7B