HarleyCooper
nanochat561
> Status (Oct 22, 2025 @ 11:30 PM MT): Oops again! The tokenizer files on the Hub got out of sync with the weights, so `HarleyCooper/nanochat561` currently produces gibberish at inference and I am unsure of the fix. That package was meant to be used alongside Karpathy’s custom tokenizer/model modules (`trustremotecode=True`). During the dozen tweaks I attempted, I also changed the tokenizer metadata several times, so the pickled tokenizer and the model weights went out of sync. Also, I can’t find my original `.pt` checkpoints. nanochat: The Best ChatGPT That "About Two Fifty" Can Buy. “Oops I thought I had a custom-model-to-HF inference converter but it's not working. You can run locally, or in HF Spaces I will put up, Gradio, or the inference Karpathy built right out of the box. – christian” nanochat’s published checkpoint is a 20-layer Transformer with 560,988,160 learnable parameters. The count comes from: | Component | Calculation | Params | | --- | --- | --- | | Token embedding | 65,536 vocab × 1,280 dim | 83,886,080 | | 20 × Attention projections | 20 × (4 × 1,280 × 1,280) | 131,072,000 | | 20 × MLP projections | 20 × [1,280 × (4 × 1,280) + (4 × 1,280) × 1,280] | 262,144,000 | | Output head | 1,280 dim × 65,536 vocab | 83,886,080 | | Total | | 560,988,160 | The embeddings are untied, so the input lookup and output head each contribute the same ~84M parameters, while each of the 20 decoder blocks supplies ~19.7M parameters split between attention (6.6M) and MLP (13.1M) weights. nanochat is a full-stack implementation of a ChatGPT-like language model trained from scratch in a single, clean, minimal, and hackable codebase. This model demonstrates that powerful conversational AI capabilities can be achieved with modest computational budgets, making advanced language modeling accessible to researchers, educators, and practitioners. This live model card documents the 561M-parameter (`depth=20`, `sequencelen=2048`) run that is currently training on 8x NVIDIA H100 80GB GPUs via Lambda Labs. Once the run finishes we will post the resulting checkpoints, evaluation artifacts, and chat-ready weights here. This implementation represents a complete pipeline from raw text to a deployable chat interface, including: - Custom tokenizer training (BPE) - Pretraining on web-scale data - Instruction fine-tuning - Conversational adaptation - Optional reinforcement learning from human feedback Key Innovation: nanochat proves that with careful architectural choices and efficient training procedures, a capable ChatGPT clone can be trained for approximately $100 in compute costs, making it an invaluable educational resource and research baseline. nanochat implements a modern Transformer architecture with several key improvements over the original GPT design: Core Specifications: - Parameters: 560,988,160 (approximately 560M) - Layers: 20 (configurable via `--depth` parameter) - Model Dimension: 1280 - Attention Heads: 10 (dimension 128 per head) - KV Heads: 10 (Multi-Query Attention capable) - Vocabulary Size: 65,536 tokens (2^16) - Context Length: 2,048 tokens - Precision: BFloat16 1. Rotary Position Embeddings (RoPE): Unlike traditional learned positional embeddings, RoPE provides better length generalization and more efficient position encoding. 2. RMSNorm: Replaces LayerNorm for improved training stability and efficiency, reducing computational overhead while maintaining performance. 3. Multi-Query Attention (MQA): Enables efficient inference by sharing key-value projections across heads, reducing memory bandwidth requirements. 4. Untied Embeddings: Separate input and output embedding matrices provide additional model capacity without significantly increasing parameters. 5. ReLU² Activation: Uses squared ReLU (ReLU²) in the feedforward network for improved expressiveness compared to standard activations. 6. QK Normalization: Normalizes query and key vectors before attention computation for training stability. nanochat achieves remarkable training efficiency through: Optimizer Design: - Muon Optimizer: Applied to weight matrices in attention and MLP layers, providing superior convergence properties - AdamW: Used for embedding layers where adaptive learning rates are beneficial - Learning Rate Scaling: Automatically scales learning rate by 1/√(dim/768) for larger models Computational Profile: - Target Compute: ~4e19 FLOPs - FLOPs per Token: 3.49e9 - Model FLOPs Utilization (MFU): ~48% on H100 GPUs - Throughput: 1.08M tokens/second on 8x H100 Primary Configuration (current run): - GPUs: 8x NVIDIA H100 80GB (SXM) - Provider: Lambda Labs GPU Cloud (via the Hyperbolic deployment automation) - Launch Time: Oct 13, 2025 @ 9:00 PM ET - Projected Runtime: ~10.5 hours for base pretraining plus alignment - Projected Cost: ~$250 at $24/hour for the 8xH100 node Alternative Configurations: - 8x A100 80GB (adds ~35% to wall-clock time) - 4x H100 80GB with increased gradient accumulation - Single-GPU experiments by lowering `depth`, `devicebatchsize`, and `maxseqlen` nanochat employs a four-stage training pipeline optimized for performance and capability development: Stage 1: Tokenizer Training Duration: ~1 minute Training Data: 2 billion characters from FineWeb-EDU Algorithm: Byte-Pair Encoding (BPE) with regex splitting Vocabulary Size: 65,536 tokens Compression Ratio: 4.8 characters/token The custom Rust-based tokenizer (`rustbpe`) provides training performance critical for rapid iteration while maintaining compatibility with OpenAI's tiktoken for efficient inference. Tokenizer Performance vs Baselines: - Superior to GPT-2 (50,257 tokens) across all categories except mathematics - Optimized for FineWeb-EDU's document distribution (English-heavy web text) - Competitive with GPT-4 tokenizer on English text despite smaller vocabulary Stage 2: Pretraining (Base Model) Duration: ~10.5 hours (projected; 21,400 steps at ~1.9 s/step) Iterations: 21,400 steps Training Tokens: 11.2 billion (following Chinchilla optimal scaling) Batch Size: 524,288 tokens per step (32 sequences x 2048 tokens x 8 GPUs) Dataset: FineWeb-EDU (240 shards, ~24GB) Final Validation Loss: 0.81 bits per byte CORE Score: 0.2219 Chinchilla Scaling Adherence: Following Hoffmann et al. (2022), nanochat trains with a 20:1 token-to-parameter ratio: - Parameters: 560M - Optimal Training Tokens: 560M x 20 = 11.2B tokens - Compute Budget: 6 x 560M x 11.2B ~ 3.8e19 FLOPs This optimization ensures maximal performance for the allocated compute budget, avoiding both undertrained and overtrained regimes. Training Data Composition: FineWeb-EDU is a high-quality subset of CommonCrawl filtered for educational content, providing: - Diverse knowledge across domains - High-quality English prose - Educational and informative content - Minimal toxic or low-quality text Stage 3: Midtraining (Instruction Adaptation) Duration: ~8 minutes Dataset Mixture: - SmolTalk: 460K conversational examples - MMLU Auxiliary: 100K multiple-choice questions - GSM8K: 8K math problems with tool use - Total: 568K instruction examples Purpose: Midtraining bridges the gap between document completion (pretraining) and conversational interaction: - Teaches multi-turn conversation structure - Introduces special tokens for chat formatting - Develops multiple-choice reasoning capabilities - Enables tool use (Python interpreter for mathematics) - Adapts to structured dialogue patterns Chat Format: nanochat uses OpenAI's Harmony-style formatting: Evaluation Results: - ARC-Easy: 0.3561 - ARC-Challenge: 0.2875 - MMLU: 0.3111 - GSM8K: 0.0250 - HumanEval: 0.0671 - ChatCORE: 0.0730 Stage 4: Supervised Fine-Tuning (SFT) Duration: ~7 minutes Focus: High-quality conversation refinement Key Adaptation: Padding to match inference-time format SFT performs final refinement on carefully curated conversational data, eliminating the domain shift between packed training sequences and padded inference sequences. Final Evaluation Results: - ARC-Easy: 0.3876 (+3.15 points) - ARC-Challenge: 0.2807 (-0.68 points) - MMLU: 0.3151 (+0.40 points) - GSM8K: 0.0455 (+2.05 points) - HumanEval: 0.0854 (+1.83 points) - ChatCORE: 0.0884 (+1.54 points) Optional Stage 5: Reinforcement Learning Duration: ~1.5 hours (when enabled) Algorithm: Simplified GRPO (Group Relative Policy Optimization) Focus: GSM8K mathematical reasoning Improvement: GSM8K accuracy increases from 4.55% to 7.58% The RL stage demonstrates that even simple reinforcement learning can yield measurable improvements on domains with clear reward signals, though it remains optional in the default pipeline. Pretraining: - FineWeb-EDU (HuggingFace): High-quality educational web text derived from CommonCrawl - 240 shards used (~24GB compressed) - Each shard: ~250K characters, ~100MB compressed - Document diversity: News, encyclopedic content, educational materials, technical documentation Instruction Tuning: - SmolTalk (HuggingFace): 460K diverse conversational examples - MMLU Auxiliary Train: 100K multiple-choice questions across academic domains - GSM8K: 8K grade-school math problems with step-by-step solutions Evaluation: - CORE Benchmark: 22 diverse autocompletion tasks (HellaSwag, PIQA, WinoGrande, etc.) - ARC-Easy & Challenge: Science questions for elementary and middle school - MMLU: Multitask language understanding across 57 subjects - GSM8K: Grade-school mathematics reasoning - HumanEval: Python code generation benchmark Base Model (After Pretraining) | Metric | Score | Comparison | |--------|-------|------------| | CORE | 0.2219 | Between GPT-2 Large (0.21) and GPT-2 XL (0.26) | | Validation BPB | 0.81 | Bits per byte on held-out data | Chat Model (After SFT) | Benchmark | Score | Baseline (Random) | Description | |-----------|-------|-------------------|-------------| | ARC-Easy | 38.76% | 25% | Elementary science questions | | ARC-Challenge | 28.07% | 25% | Middle school science questions | | MMLU | 31.51% | 25% | Multitask language understanding | | GSM8K | 4.55% | 0% | Grade-school math problems | | HumanEval | 8.54% | 0% | Python code generation | | ChatCORE | 0.0884 | 0.0 | Aggregate chat performance | For reference, GPT-2 (1.5B parameters, 2019) achieved: - Similar CORE scores (~0.26 for XL variant) - Limited mathematical reasoning - No instruction-following capability nanochat achieves comparable base capabilities with: - 63% fewer parameters (560M vs 1.5B) - Substantially lower training cost ($100 vs ~$10K+ estimated) - Native instruction-following and conversational ability - Modern architectural improvements (RoPE, RMSNorm, MQA) The recommended way to interact with nanochat is through its web interface: nanochat models can be exported to HuggingFace format for deployment on Inference Endpoints: The fastest way to train nanochat is using the speedrun script: This will: 1. Set up the environment (uv, Rust, dependencies) 2. Download training data 3. Train tokenizer 4. Pretrain base model (~10.5 hours on 8xH100) 5. Perform midtraining (~8 minutes) 6. Perform supervised fine-tuning (~7 minutes) 7. Generate evaluation report For those without access to cloud GPU infrastructure, we provide a complete Colab notebook: Interactive Training Notebook: notebooks/trainoncolab.ipynb Colab Options: - Free Tier: T4 GPU (16GB) - Train smaller models in 2-3 hours - Colab Pro: V100 GPU - Faster training - Colab Pro+: A100 GPU - Full-scale training The Colab notebook provides: - Zero setup required - Step-by-step instructions - Automatic checkpoint saving to Google Drive - Real-time training visualization - Guided walkthrough of the entire pipeline nanochat easily scales to larger models by adjusting the `--depth` parameter: The codebase automatically adjusts: - Model dimensions (channels scale with depth) - Learning rates (scale with 1/√dim) - Training tokens (Chinchilla ratio maintained) - Gradient accumulation (to maintain effective batch size) nanochat prioritizes: - Readability: Clean, commented, educational code - Minimalism: Single cohesive implementation, no abstraction layers - Hackability: Easy to modify and experiment with - Dependency-lite: Minimal external dependencies (PyTorch, tiktoken, a few utilities) Codebase Statistics: - Lines of Code: ~8,300 - Files: 44 - Total Characters: ~334,000 - Approximate Tokens: ~83,500 (fits in context window of modern LLMs) Core Implementation (`nanochat/`): - `gpt.py`: Transformer model implementation - `engine.py`: Inference engine with KV caching - `tokenizer.py`: Tokenizer interface and special tokens - `dataloader.py`: Efficient data loading and batching - `checkpointmanager.py`: Model checkpointing and loading - `adamw.py` / `muon.py`: Optimizer implementations - `configurator.py`: Hyperparameter configuration Training Scripts (`scripts/`): - `toktrain.py`: Tokenizer training - `basetrain.py`: Pretraining - `midtrain.py`: Midtraining - `chatsft.py`: Supervised fine-tuning - `chatrl.py`: Reinforcement learning (optional) - `chateval.py`: Comprehensive evaluation - `chatweb.py`: Web interface server - `exporttohuggingface.py`: HuggingFace export Tokenizer (`rustbpe/`): - High-performance Rust implementation - Compatible training with Python reference - Efficient inference via tiktoken nanochat builds upon decades of research in neural language modeling and represents a practical synthesis of modern best practices: Transformer Architecture: - Vaswani et al. (2017). "Attention Is All You Need." NeurIPS. arXiv:1706.03762 - Foundation of modern language models - Self-attention mechanism - Positional encoding concepts GPT Models and Scaling: - Radford et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. - Demonstrates effectiveness of generative pretraining - Decoder-only Transformer architecture - Radford et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI. - GPT-2: Scaling to 1.5B parameters - Zero-shot task transfer - Brown et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. arXiv:2005.14165 - GPT-3: Scaling to 175B parameters - In-context learning and few-shot prompting Scaling Laws: - Kaplan et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361 - Power law relationships between loss and scale - Compute-optimal model sizing - Hoffmann et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556 - Chinchilla scaling laws (followed by nanochat) - Optimal token-to-parameter ratio of 20:1 - Demonstrates smaller, longer-trained models outperform larger, shorter-trained models Architectural Innovations: - Su et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864 - RoPE: Relative position encoding - Better length extrapolation - Shazeer (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150 - Multi-Query Attention (MQA) - Inference efficiency improvements - Zhang & Sennrich (2019). "Root Mean Square Layer Normalization." NeurIPS. arXiv:1910.07467 - RMSNorm: Simplified normalization - Training stability with reduced computation Instruction Tuning and Alignment: - Wei et al. (2022). "Finetuned Language Models are Zero-Shot Learners." ICLR. arXiv:2109.01652 - Instruction fine-tuning methodology - Task generalization through instructions - Ouyang et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS. arXiv:2203.02155 - InstructGPT: RLHF methodology - Alignment techniques Evaluation and Benchmarking: - Li et al. (2024). "CORE: A Data-Efficient Benchmark for Evaluating Language Models." arXiv:2406.11794 - CORE metric used in nanochat - Broad evaluation across 22 datasets Optimization: - Loshchilov & Hutter (2019). "Decoupled Weight Decay Regularization." ICLR. arXiv:1711.05101 - AdamW optimizer - Improved weight decay handling nanochat is designed as the capstone project for LLM101n, a course on large language models developed by Eureka Labs (founded by Andrej Karpathy). The implementation philosophy emphasizes: - Transparency: Every component is visible and understandable - Accessibility: Entire codebase fits in an LLM context window - Practicality: Real-world training at modest budgets - Extensibility: Clean baseline for research experiments Related educational resources by Andrej Karpathy: - Neural Networks: Zero to Hero - Video course series - nanoGPT - Minimal GPT pretraining (nanochat's predecessor) - minGPT - Minimal GPT implementation - makemore - Character-level language modeling As a model trained on a $100 budget with 560M parameters, nanochat has inherent limitations: Knowledge and Reasoning: - Limited factual knowledge compared to larger models - Struggles with complex multi-step reasoning - May produce incorrect or nonsensical information - Cannot reliably solve advanced mathematics or coding problems - Knowledge cutoff reflects training data (FineWeb-EDU, circa 2024) Language and Coverage: - Primarily English-focused (tokenizer optimized for English) - Limited multilingual capabilities - May not handle specialized domains well (legal, medical, scientific) - Code generation capabilities are basic (8.54% on HumanEval) Context and Memory: - 2,048 token context limit - Cannot process very long documents - No persistent memory across conversations Important: nanochat has not undergone extensive safety testing or alignment: - May generate biased, offensive, or inappropriate content - Not suitable for production applications without additional safety measures - Should not be relied upon for factual information - May hallucinate convincingly false information - Has not been trained to refuse harmful requests nanochat is intended for: - Educational purposes and learning about LLMs - Research baselines and experimentation - Understanding full-stack LLM development - Rapid prototyping of LLM applications - Cost-effective model training research nanochat should NOT be used for: - Production applications without additional fine-tuning - High-stakes decision making - Medical, legal, or financial advice - Any application where factual accuracy is critical - Public-facing applications without content filtering Environmental Impact: Training nanochat consumes approximately: - 4x10^19 FLOPs of computation - ~250 kWh of electricity (estimated, 8x H100 for 10.5 hours at ~3kW per GPU) - Corresponding CO2 emissions depend on energy source Data and Bias: - Trained on web-scraped data (FineWeb-EDU) which may contain biases - Reflects biases present in internet text circa 2024 - English-language bias in training data - Limited representation of non-Western perspectives Transparency: nanochat prioritizes transparency by: - Open-sourcing all training code and procedures - Documenting training data sources - Providing detailed methodology - Enabling reproducibility Author: Andrej Karpathy Organization: Eureka Labs Repository: github.com/karpathy/nanochat License: MIT For questions, issues, or contributions: - GitHub Issues: github.com/karpathy/nanochat/issues - GitHub Discussions: github.com/karpathy/nanochat/discussions - Discord: Karpathy's Discord #nanochat channel If you use nanochat in your research or projects, please cite: - nanoGPT: Minimal GPT pretraining implementation - modded-nanoGPT: Performance optimization and gamification - HuggingFace: FineWeb-EDU and SmolTalk datasets - Lambda Labs: GPU infrastructure for development - Alec Radford: Technical guidance and LLM architecture insights Special thanks to the open-source ML community for making projects like this possible. - v1.0 (October 2025): Initial release - 560M parameter model - Complete training pipeline (tokenizer → base → mid → sft → rl) - Web interface and inference engine - Comprehensive evaluation suite - Google Colab support - Hugging Face export capability Documentation: - Training Walkthrough: Detailed guide to the speedrun - Colab Notebook: Interactive training guide - Hyperbolic Deployment: Cloud training guide Related Projects: - nanoGPT: Pretraining-focused predecessor - minGPT: Minimal GPT implementation - llm.c: LLM training in pure C/CUDA Educational Materials: - Neural Networks: Zero to Hero: Video course series - LLM101n: Comprehensive LLM course (in development) - The Illustrated Transformer: Visual guide to Transformers Note: This model card will be automatically updated with actual training metrics, costs, and performance numbers when the model completes training. Placeholder values should be replaced with real measurements from your training run.
Qwen3-30B-ThinkingMachines-Dakota1890
Qwen3-30B-Dakota1890
Nanochat AquaRat
Training Language Models with Reinforcement Learning on Mathematical Reasoning [](https://github.com/HarleyCoops/nanochatAquaRat) [](LICENSE) [](https://www.python.org/downloads/) A modified version of nanochat trained with reinforcement learning on the DeepMind AQuA-RAT dataset for algebraic reasoning and multiple-choice problem solving. Quick Start • Dataset • Modifications • Training • Results - Overview - The Base: nanochat Framework - Dataset Structure - Modifications from Base nanochat - Training Pipeline - Quick Start - File Structure - Monitoring & Visualization - Results This project adapts the nanochat training framework (originally designed for GSM8K numerical reasoning) to work with AQuA-RAT (Algebra Question Answering with Rationales), a dataset of ~97,000 algebraic word problems with multiple-choice answers (A-E) and natural language solution rationales. - Domain Transfer: Demonstrates how to adapt a mathematical reasoning pipeline from free-form numeric answers to multiple-choice format - RL on Math: Implements GRPO-style reinforcement learning with reward shaping for categorical outputs - Mechanistic Interpretability: Integrates attention analysis during training to understand model reasoning patterns - Production-Ready: Includes automated Lambda Labs and Hyperbolic Labs deployment helpers for cloud GPU training | Model | Parameters | Training Time | AQuA-RAT Dev Accuracy | |-------|------------|---------------|----------------------| | depth-8 | ~60M | 3-4 hours | 30-50% | | depth-20 | ~561M | 6-8 hours | 40-60% | nanochat is a minimalist yet complete pipeline for training transformer language models from scratch, created by Andrej Karpathy. It implements: - Custom tokenizer: BPE tokenizer written in Rust for performance - Training stages: Pretraining → Mid-training → SFT → RL - Evaluation suite: CORE benchmarks and task-specific metrics - Optimizations: Memory-efficient training, gradient accumulation, distributed training Original focus: Training on GSM8K (Grade School Math 8K) with free-form numeric answers. The DeepMind AQuA-RAT dataset contains algebraic reasoning problems in JSON format: Dataset splits: - Training: 97,467 problems - Development: 254 problems - Test: 254 problems Key characteristics: - Multiple-choice (A-E) format - Algebraic word problems - Natural language rationales - Topics: arithmetic, algebra, geometry, probability | Aspect | GSM8K (Original) | AQuA-RAT (This Project) | |--------|------------------|-------------------------| | Format | Free-form numeric | Multiple choice (A-E) | | Answer | Single number | Letter choice | | Size | 8,500 problems | 97,700 problems | | Difficulty | Elementary school | High school algebra | | Rationale | Step-by-step | Natural language | | Evaluation | Exact match on number | Categorical accuracy | To adapt nanochat from GSM8K to AQuA-RAT, we modified the following components: python def formatexample(row): options = row["options"] assistantcontent = [ {"type": "text", "text": row["rationale"].strip()}, {"type": "text", "text": f"Answer: {row['correct'].strip().upper()}"}, ] return { "messages": [ {"role": "user", "content": renderuserprompt(row["question"], options)}, {"role": "assistant", "content": assistantcontent}, ], "letters": letters, "answerletter": correct, } python def extractletter(text, default=None): answermatch = re.search(r"answer\s[:\-]\s([A-E])", text, flags=re.IGNORECASE) if answermatch: return answermatch.group(1).upper() match = LETTERRE.search(text) return match.group(1).upper() if match else default bash (Optional) Cache the dataset locally as JSONL python -m scripts.prepareaqua --outputdir "$NANOCHATBASEDIR/aqua" Mid-training now samples from the AQuA mixture torchrun -m scripts.midtrain -- --run=demo --numiterations=200 SFT stage emphasises AQuA problems torchrun -m scripts.sfttrain -- --run=demo --aquatrainexamples=20000 RL fine-tuning rewards the correct letter on AQuA-RAT torchrun -m scripts.chatrl -- --run=demo --temperature=0.7 --maxnewtokens=64 bash torchrun --nprocpernode=8 -m scripts.basetrain -- --depth=8 bash torchrun --nprocpernode=8 -m scripts.midtrain bash torchrun --nprocpernode=8 -m scripts.sfttrain -- \ --aquatrainexamples=20000 \ --aquavalexamples=254 bash torchrun --nprocpernode=1 -m scripts.chatrl -- \ --temperature=0.7 \ --maxnewtokens=64 bash git clone --recurse-submodules https://github.com/HarleyCoops/nanochatAquaRat.git powershell $env:Path += ";$env:USERPROFILE\.cargo\bin" setx PATH "$env:Path" setx CARGOHOME "$env:USERPROFILE\.cargo" setx RUSTUPHOME "$env:USERPROFILE\.rustup" rustup set default-host x8664-pc-windows-msvc rustup default stable-x8664-pc-windows-msvc cargo --version rustup --version bash uv run maturin develop bash Set credentials export LAMBDAAPIKEY='your-lambda-api-key' export WANDBAPIKEY='your-wandb-api-key' Launch with auto-start python scripts/launchlambdatraining.py \ --ssh-key-name yourlambdasshkey \ --instance-type gpu8xh100sxm5 \ --region us-west-1 \ --auto-start \ --inject-env WANDBAPIKEY bash SSH to instance ssh ubuntu@ Attach to tmux session tmux attach -t nanochat-train Or view logs tail -f ~/nanochatAquaRat/training.log bash Set credentials export HYPERBOLICAPIKEY='your-hyperbolic-api-key' export WANDBAPIKEY='your-wandb-api-key' Launch with auto-start python scripts/launchhyperbolictraining.py \ --gpu-count 1 \ --region us-east \ --auto-start \ --inject-env WANDBAPIKEY bash sudo apt-get update sudo apt-get install -y git curl unzip build-essential python3 python3-venv tmux git clone https://github.com/HarleyCoops/nanochatAquaRat.git cd nanochatAquaRat bash curl -LsSf https://astral.sh/uv/install.sh | sh curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --default-toolchain stable source "$HOME/.cargo/env" export PATH="$HOME/.local/bin:$PATH" uv venv && uv sync --extra gpu source .venv/bin/activate uv run maturin develop uv run python -m scripts.toktrain bash curl -sSL https://sdk.cloud.google.com | bash source "$HOME/.bashrc" gcloud auth login --no-launch-browser gcloud config set project gcloud storage cp gs://nanochat-aquarat-datasets/datasets/aqua/aquacache.zip . unzip -o aquacache.zip -d ~/aquacache export AQUADATADIR=$HOME/aquacache bash cd ~/.cache/nanochat curl -L -o identityconversations.jsonl https://karpathy-public.s3.us-west-2.amazonaws.com/identityconversations.jsonl curl -L -o evalbundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/evalbundle.zip unzip -q evalbundle.zip && rm evalbundle.zip cd ~/nanochatAquaRat bash export LAMBDAAPIKEY='your-key' export WANDBAPIKEY='your-key' python launchlambda.py \ --instance-type gpu8xh100sxm5 \ --region us-west-1 bash Setup environment cp .env.template .env Edit .env with your WANDBAPIKEY Run training bash runaquaratsmall.sh bash uv run python -m scripts.synchfrepo --no-push bash uv run python -m scripts.synchfrepo --repo-id HarleyCooper/nanochatAquaRat nanochatAquaRat/ ├── nanochat/… # Vendored upstream nanochat package ├── scripts/ │ ├── basetrain.py # Base pretraining stage │ ├── midtrain.py # Mid-training (now includes AQuA) │ ├── chatsft.py # Chat SFT pipeline │ ├── sfttrain.py # Shim so `-m scripts.sfttrain` still works │ ├── chatrl.py # Reinforcement learning on AQuA-RAT │ ├── chateval.py # Evaluation harness (adds AQuA task) │ ├── prepareaqua.py # AQuA-RAT JSONL exporter │ ├── launchlambdatraining.py # Lambda Labs automation │ ├── launchhyperbolictraining.py # Hyperbolic Labs automation │ └── uploadtogcs.sh # Artifact helper ├── tasks/ │ ├── aqua.py # AQuA-RAT task implementation │ ├── arc.py / gsm8k.py / mmlu.py # Other reasoning tasks │ └── … ├── runaquaratsmall.sh # End-to-end orchestration ├── pyproject.toml / uv.lock # Environment definitions └── README.md rl/acc ━━━━━━━━━━ 0.45 rl/kllettermean ━━━━━━━━━━ 0.12 rl/lettermarginmean ━━━━━━━━━━ 2.34 attn/entropymean ━━━━━━━━━━ 3.21 ``` | Depth | Parameters | Training Time | Best Instance Type | Estimated Cost | |-------|------------|---------------|-------------------|----------------| | 8 | ~60M | 3-4 hours | 1-2x A100 | ~$18-35 | | 12 | ~180M | 4-5 hours | 4x A100 | ~$35-45 | | 20 | ~561M | 6-8 hours | 8x H100 | ~$144-192 | | 26 | ~1.1B | 10-12 hours | 8x H100 | ~$240-288 | To change model depth, edit the `--depth` parameter in `runaquaratsmall.sh`. After SFT (before RL): - Dev accuracy: 20-30% (depth-8), 30-40% (depth-20) - Basic problem-solving capability - Some format errors (invalid letters) After RL: - Dev accuracy: 30-50% (depth-8), 40-60% (depth-20) - Improved reasoning coherence - Better multiple-choice selection confidence - Reduced format errors - Stable attention patterns | Model | Training Time | Total Cost | |-------|---------------|------------| | depth-8 (60M) | 3-4 hours | ~$96 | | depth-20 (561M) | 6-8 hours | ~$192 | Budget options: - Test pipeline: 1x A10 @ $0.60/hr - Small model: 2x A100 @ $4.40/hr - Production: 8x H100 @ $24/hr For Lambda Labs Users - Always terminate instances after training to avoid charges - Monitor spending in the Lambda Labs dashboard - Check instance availability before launching (high demand periods) Known Limitations - RL on AQuA-RAT is experimental; results may vary - Attention logging adds ~5-10% overhead - KL computation can be expensive with large batch sizes - Smaller models (<100M params) may struggle with complex reasoning - scripts/launchlambdatraining.py - Full-featured automation - scripts/launchhyperbolictraining.py - Hyperbolic marketplace automation - launchlambda.py - Simplified launcher - QUICKSTART.md - Fast track guide - LAMBDAMANUALSETUP.md - Manual setup walkthrough - GCSUPLOADGUIDE.md - Upload weights to Google Cloud Storage - .env.template - Environment configuration This project is based on the nanochat framework. For issues specific to: - AQuA-RAT training: Open an issue in this repository - Base nanochat framework: Refer to the upstream nanochat project - Lambda Labs deployment: See documentation above This project inherits the license from the base nanochat project. - Andrej Karpathy - nanochat framework - DeepMind - AQuA-RAT dataset and mechanistic interpretability tools - Lambda Labs - Cloud GPU infrastructure - Weights & Biases - Experiment tracking and visualization - Lambda Labs Support: https://lambdalabs.com/support - Weights & Biases Docs: https://docs.wandb.ai - Project Issues: https://github.com/HarleyCoops/nanochatAquaRat/issues