Goedel-LM

8 models • 1 total models in database
Sort by:

Goedel-Prover-V2-32B

Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date [](http://blog.goedel-prover.com) [](https://github.com/Goedel-LM/Goedel-Prover-V2) [](https://huggingface.co/Goedel-LM/Goedel-Prover-V2-32B) [](https://arxiv.org/abs/2508.03613) [](https://opensource.org/licenses/Apache-2.0) We introduce Goedel-Prover-V2, an open-source language model series that sets a new state-of-the-art in automated formal proof generation. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis : We generate synthetic proof tasks of increasing difficulty to progressively train the model, enabling it to master increasingly complex theorems; (2) Verifier-guided self-correction : The model learns to iteratively revise its own proofs by leveraging feedback from Lean’s compiler, closely mimicking how humans refine their work; (3) Model averaging : We combine multiple model checkpoints to improve robustness and overall performance. Our small model, Goedel-Prover-V2-8B, reaches 84.6% on MiniF2F test set at Pass@32, matching the performance of prior state-of-the-art DeepSeek-Prover-V2-671B while being nearly 100 times smaller in model size. Our flagship model, Goedel-Prover-V2-32B, achieves 88.0% on MiniF2F at Pass@32 on standard mode and 90.4% on self-correction mode, outperforming prior SOTA DeepSeek-Prover-V2-671B and concurrent work Kimina-Prover-72B by a large margin. Additionaly, our flagship model with self-correction solves 64 problems on PutnamBench at Pass@64, securing the 1st on the leaderboard surpassing DeepSeek-Prover-V2-671B's record of solving 47 problems by Pass@1024. Self-correction mode : Our model improves proof quality by first generating an initial candidate and then using Lean compiler feedback to iteratively revise it. We perform two rounds of self-correction, which remain computationally efficient—the total output length (including the initial proof and two revisions) increases only modestly from the standard 32K to 40K tokens. .fig-row { display: flex; justify-content: space-between; / spread them out / align-items: flex-start; / align tops / gap: 1rem; / space between images / } .fig-row img { display: block; width: 100%; height: auto; } .fig-row .panel { / override per‐panel width as needed / / e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. / } figure { margin: 0; } figure figcaption { text-align: center; font-size: 0.9em; margin-top: 0.75rem; color: #555; } figure figcaption strong { font-weight: bold; } / Italicize the rest of the caption / figure figcaption em { font-style: italic; } Figure 1 : Pass@32 performance on MiniF2F, PutnamBench, and our new MathOlympiadBench containing 360 IMO-level problems. The charts above demonstrate the state-of-the-art performance of Goedel-Prover-V2. We report all numbers at Pass@32: (1) Across all three datasets, our flagship 32B model, in both standard and self-correction mode, significantly outperforms prior state-of-the-art DeepSeek-Prover-V2-671B and Kimina-Prover-72B; (2) on miniF2F, our 8B model matches the performance of DeepSeek-Prover-V2-671B while being 100 times smaller in model size. 1 Goedel-Prover-V2-32B (self-correction mode) 86 Pass@192 1 Goedel-Prover-V2-32B (self-correction mode) 57 Pass@32 1 Goedel-Prover-V2-32B 43 Pass@32 2 DeepSeek‑Prover‑V2-671B 47 Pass@1024 2 DeepSeek‑Prover‑V2-671B 22 Pass@32 3 DSP+ 23 Pass@128 4 Kimina‑Prover‑7B‑Distill 10 Pass@192 5 Self-play Theorem Prover 8 Pass@3200 6 Goedel-Prover-V1 7 Pass@512 Table 1 : PutnamBench leaderboard. Goedel-Prover-V2-32B secures the top rank with significantly less compute (pass number) than the previous state-of-the-art. .fig-row { display: flex; justify-content: space-between; / spread them out / align-items: flex-start; / align tops / gap: 1rem; / space between images / } .fig-row img { display: block; width: 100%; height: auto; } .fig-row .panel { / override per‐panel width as needed / / e.g. .panel-1 { width:25%; } .panel-2 { width:40%; } etc. / } figure { margin: 0; } figure figcaption { text-align: center; font-size: 0.9em; margin-top: 0.75rem; color: #555; } figure figcaption strong { font-weight: bold; } / Italicize the rest of the caption / figure figcaption em { font-style: italic; } Figure 2 : Performance on MiniF2F test set under different sample budgets. The scaling curves above show that our 32B model consistently outperforms all prior state-of-the-art models across the entire range of inference-time compute budgets. We release our Goedel-Prover-V2 models and the new MathOlympiadBench benchmark to foster future research. | Model | Download | | -------- | -------- | | Goedel-Prover-V2-32B | 🤗Download | | Goedel-Prover-V2-8B | 🤗Download | | Dataset | Download | | -------- | -------- | | MathOlympiadBench | 🤗Download | MathOlympiadBench (Math Olympiad Bench) comprises human-verified formalizations of Olympiad-level mathematical competition problems, sourced from Compfiles and IMOSLLean4 repository. MathOlympiadBench contains 360 problems, including 158 IMO problems from 1959 to 2024, 131 IMO shortlist problems covering 2006 to 2023, 68 regional mathematical Olympiad problems, and 3 additional mathematical puzzles. This model is being released to aid other open-source projects, including those geared towards the upcoming IMO competition. A full paper with all details will be released in the coming weeks. 5. Quick Start You can directly use Huggingface's Transformers for model inference.

NaNK
license:apache-2.0
18,227
58

Goedel-Prover-V2-8B

NaNK
license:apache-2.0
5,635
17

Goedel-Prover-SFT

llama
762
29

Goedel-Formalizer-V2-8B

This is the formalizer for translating infromal math problem into formal statement in Lean 4. In Goedel-Prover-V2 project we use this formalizer in generating lean4 statements. Different from all previous open-source formalizers, this formalizer has the ability to think before generating the lean4 statements, which significantly increases the acurracy of the translation. In our internal evaluation, we test the performance of formalizers on a eval set containing 300 informal statements from Omni-MATH. Kimina-formalizer-8B successfully translates 161/300 statements, while Goedel-Formalizer-V2-32B success in 228 statements.

NaNK
license:apache-2.0
680
4

Goedel-Formalizer-V2-32B

NaNK
license:apache-2.0
233
7

Goedel-Prover-DPO

llama
5
4

Goedel-Formalizer-32B-SonnetAnnotated

NaNK
license:mit
1
1

Goedel-Formalizer-32B-LeanWorkbookAnnotated

NaNK
license:mit
1
0