dbest-isi

2 models • 2 total models in database
Sort by:

Searchless Chess 9M Selfplay

A 9-million parameter transformer-based chess engine trained via self-play with Stockfish evaluation. This model learns to play chess without explicit search during inference, relying purely on learned pattern recognition. - Model Size: 9M parameters (8 layers, 256 embedding dim, 8 attention heads) - Architecture: Decoder-only Transformer with learned positional encodings - Training Method: Self-play with Stockfish rewards - Framework: JAX + Haiku - Q-Value Distribution: 128 return buckets for action-value prediction This model predicts action-values (Q-values) for chess positions without performing tree search, making it extremely fast for inference while maintaining strong play. Install the required dependencies for CPU inference: For other CUDA versions, see the JAX installation guide. Note: This model includes all necessary code and can be used without cloning the original repository. - Base Model: Initialized from pretrained 9M checkpoint - Training Method: Self-play reinforcement learning - Reward Signal: Stockfish evaluation at depth 20 - Iteration: 22 (EMA parameters) - Action Space: 1968 possible moves (all legal chess moves) - Value Representation: Discretized into 128 buckets - Fast chess move prediction without search - Chess position evaluation - Research on learned planning in board games - Integration into chess applications requiring low-latency move suggestions - Does not perform explicit search (unlike traditional chess engines) - May make suboptimal moves in complex tactical positions - Performance depends on training data distribution - Best suited for fast move suggestions rather than deep analysis This model is based on the architecture from DeepMind's Searchless Chess work. The self-play training implementation and this trained model are original work by Darrell Best. For the full self-play training implementation and codebase, visit: - Repository: https://github.com/DarrellBest/searchlesschess For questions or issues, please open an issue on the GitHub repository.

license:apache-2.0
23
1

Searchless Chess 9M Dpo

This is a 9 million parameter transformer-based chess engine trained using Direct Preference Optimization (DPO) with mistake-focused self-play and Stockfish supervision. - Architecture: Transformer with 8 layers, 256 embedding dim, 8 attention heads - Training Method: DPO (Direct Preference Optimization) with self-play - Framework: JAX/Haiku - Parameters: ~9 million - Base Model: DeepMind's Searchless Chess 9M - Training Iteration: 1 - Self-play Games: 1000 games - Preference Pairs: 36,407 (model mistakes) - Training Steps: 50 gradient steps - Final Loss: 0.6890 (from 0.6931) Puzzle Solving: - Base 9M model: 87% accuracy - DPO-trained model: 88% accuracy - +1% improvement overall, with best gains in 1000-1500 rating range (+3.45%) Head-to-Head Games (50 games): - Win-Draw-Loss: 24-9-17 (vs base 9M) - Win rate: 57% - Elo Improvement: +25 Elo (BayesElo calculation) Direct Preference Optimization (DPO) is a preference-based learning algorithm that directly optimizes the policy without requiring a separate reward model. The training process: 1. Self-Play Generation: Model plays 1000 games against itself 2. Mistake Identification: Stockfish analyzes each position to find model errors 3. Preference Pair Creation: For each mistake: - Chosen action: Stockfish's move (better outcome) - Rejected action: Model's move (worse outcome) - Filtering: Only include mistakes with eval difference > 0.3 pawns 4. DPO Training: Optimize policy to prefer Stockfish's moves using DPO loss - Base Model: 9M parameter action-value model (pre-trained by DeepMind) - Training Algorithm: Direct Preference Optimization (DPO) - Self-play Games: 1000 games per iteration - Preference Pairs Found: 36,407 (mistakes where model played suboptimal moves) - Batch Size: 32 - Learning Rate: 1e-5 - Gradient Steps: 50 per iteration - DPO Beta: 0.1 (KL penalty coefficient) - Eval Threshold: 0.3 pawns (minimum mistake margin) - Stockfish Analysis: Depth 20, 0.1s per position - Optimizer: Adam with gradient clipping (max norm 1.0) - EMA Decay: 0.999 (used for inference) - Reference Model: Updated every 3 iterations for stability Where: - πθ: Current policy (converted from Q-values) - πref: Reference policy (frozen snapshot) - β: KL penalty coefficient - τ: Temperature for softmax conversion - Input: 77-token FEN representation - Embedding: 256 dimensions - Layers: 8 transformer blocks - Attention Heads: 8 per layer - Output: 128-bucket Q-value distribution over actions - Positional Encoding: Learned - Activation: GELU in feed-forward layers - Total Parameters: ~9M The model was trained using mistake-focused self-play: 1. Generate Self-Play Games: Model plays 1000 games against itself from diverse openings 2. Analyze with Stockfish: Each position analyzed at depth 20 (0.1s per move) 3. Extract Preferences: 36,407 position-move pairs where model made mistakes 4. Filter Quality: - Eval difference ≥ 0.3 pawns (meaningful mistakes) - Position quality |eval| ≤ 3.0 pawns (avoid blown positions) 5. DPO Training: 50 gradient steps optimizing preference likelihood 6. Checkpoint: Save EMA parameters for inference Strengths: - Improved tactical accuracy (fewer blunders) - Better move selection in middlegame positions - Stronger in 1000-1500 Elo puzzle range Current Limitations: - Early training (only 1 iteration completed) - Limited self-play data (1000 games) - No explicit opening book or endgame tablebase - Evaluation based on Q-values, not full search Future Work: - Continue training for more iterations (recommended: 10 iterations) - Progressive curriculum (increase Stockfish depth over time) - Larger batch sizes and more gradient steps - Test on wider puzzle range and benchmark positions | Metric | Base 9M | DPO-trained | Improvement | |--------|---------|-------------|-------------| | Puzzle Accuracy | 87% | 88% | +1% | | Head-to-Head Win Rate | 43% | 57% | +14% | | Elo Rating | Baseline | +25 | +25 Elo | Based on the Searchless Chess work by DeepMind Technologies Limited: - Original Searchless Chess Repository - Training Code and Documentation - DPO Paper - Searchless Chess Paper

license:apache-2.0
14
1