nanochat-students
Nanochat D20
This is the the checkpoint from Andrej Karpathy's fullstack llm project to build an LLM, nanochat. You can also run the model in vLLM, using the above branch install: - run: - source: mid - dtype: bfloat16 - devicebatchsize: 4 - numepochs: 1 - maxiterations: -1 - targetexamplesperstep: 32 - unembeddinglr: 0.0040 - embeddinglr: 0.2000 - matrixlr: 0.0200 - weightdecay: 0.0000 - initlrfrac: 0.0200 - evalevery: 100 - evalsteps: 100 - evalmetricsevery: 200 - Training rows: 20,843 - Number of iterations: 651 - Training loss: 1.1904 - Validation loss: 1.0664 - source: sft - taskname: None - dtype: bfloat16 - temperature: 0.0000 - maxnewtokens: 512 - numsamples: 1 - topk: 50 - batchsize: 8 - modeltag: None - step: None - maxproblems: None - ARC-Easy: 0.4259 - ARC-Challenge: 0.2961 - MMLU: 0.3250 - GSM8K: 0.0432 - HumanEval: 0.0549 - ChatCORE metric: 0.0988 Logs from training can be found here: https://huggingface.co/spaces/nanochat-students/trackio
Base D20
This model is trained with the nanochat recipe by Andrej Karpathy. It was trained with a depth of 20 on 2 billion tokens and corresponds to this tokenizer. I will combine this repo with the tokenizer. Base model evaluation timestamp: 2025-10-14 16:16:53 - Model: basemodel (step 21400) - CORE metric: 0.1963 - hellaswagzeroshot: 0.2634 - jeopardy: 0.0959 - bigbenchqawikidata: 0.4993 - arceasy: 0.5269 - arcchallenge: 0.1251 - copa: 0.4400 - commonsenseqa: 0.0653 - piqa: 0.3743 - openbookqa: 0.1440 - lambadaopenai: 0.3683 - hellaswag: 0.2630 - winograd: 0.2674 - winogrande: 0.0923 - bigbenchdycklanguages: 0.1050 - agievallsatar: 0.0326 - bigbenchcsalgorithms: 0.3674 - bigbenchoperators: 0.1524 - bigbenchrepeatcopylogic: 0.0000 - squad: 0.2222 - coqa: 0.1957 - boolq: -0.4615 - bigbenchlanguageidentification: 0.1801 - train bpb: 0.8147 - val bpb: 0.8121 - sample 0: The capital of France is Paris. It is the largest city in France and the capital of the country. - sample 1: The chemical symbol of gold is Au and the atomic number is 79. Gold is a soft, malleable, - sample 2: If yesterday was Friday, then tomorrow will be Saturday. If today is Monday, then tomorrow will be Tuesday. If today is - sample 3: The opposite of hot is cold. The opposite of hot is cold. The opposite of hot is cold. - sample 4: The planets of the solar system are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, - sample 5: My favorite color is blue. I love the color blue because it is a color that is so versatile - sample 6: If 5x + 3 = 13, then x is a factor of 5. If 5x + 3 = - run: dummy - depth: 20 - maxseqlen: 2048 - numiterations: -1 - targetflops: -1.0000 - targetparamdataratio: 20 - devicebatchsize: 32 - totalbatchsize: 524,288 - embeddinglr: 0.2000 - unembeddinglr: 0.0040 - weightdecay: 0.0000 - matrixlr: 0.0200 - gradclip: 1.0000 - evalevery: 250 - evaltokens: 10,485,760 - coremetricevery: 2000 - coremetricmaxpertask: 500 - sampleevery: 2000 - modeltag: - Number of parameters: 560,988,160 - Number of FLOPs per token: 3.491758e+09 - Calculated number of iterations: 21,400 - Number of training tokens: 11,219,763,200 - Tokens : Params ratio: 20.0000 - DDP world size: 8 - warmupratio: 0.0000 - warmdownratio: 0.2000 - finallrfrac: 0.0000 - Minimum validation bpb: 0.8120 - Final validation bpb: 0.8120 - CORE metric estimate: 0.2059 - MFU %: 48.36% - Total training flops: 3.917670e+19 - Total training time: 172.18m - Peak memory usage: 75422.02MiB
rl-d20
This is the RL trained checkpoint from Andrej Karpathy's fullstack llm project to build an LLM, nanochat. - run: burtenshaw-20251015111354 - source: sft - dtype: bfloat16 - devicebatchsize: 8 - examplesperstep: 16 - numsamples: 16 - maxnewtokens: 256 - temperature: 1.0000 - topk: 50 - unembeddinglr: 0.0040 - embeddinglr: 0.2000 - matrixlr: 0.0200 - weightdecay: 0.0000 - initlrfrac: 0.0500 - numepochs: 1 - saveevery: 60 - evalevery: 60 - evalexamples: 400 - source: rl - taskname: GSM8K - dtype: bfloat16 - temperature: 0.0000 - maxnewtokens: 512 - numsamples: 1 - topk: 50 - batchsize: 8 - modeltag: None - step: None - maxproblems: None - GSM8K: 0.0970 Logs from training can be found here: https://huggingface.co/spaces/nanochat-students/trackio
mid-d20
nanochat-tokenizer-2B
This is the tokenizer from Andrej Karpathy's educational project nanochat. This is the first step from the speedrun.sh script. For now, we need to download the first ~2B characters of pretraining dataset using the dataset script in nanochat. Then, we can train the tokenizer with vocab size ~2B characters of data - maxchars: 2,000,000,000 - doccap: 10,000 - vocabsize: 65,536 - traintime: 52.9085 - numspecialtokens: 9 - tokenbytesmin: 1 - tokenbytesmax: 32 - tokenbytesmean: 6.9197 - tokenbytesstd: 2.8748 Tokenizer evaluation timestamp: 2025-10-14 10:29:10 | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% | | korean | 893 | 745 | 1.20 | 712 | 1.25 | +4.4% | | code | 1259 | 576 | 2.19 | 492 | 2.56 | +14.6% | | math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% | | science | 1112 | 260 | 4.28 | 228 | 4.88 | +12.3% | | fwe-train | 4208518 | 900364 | 4.67 | 856883 | 4.91 | +4.8% | | fwe-val | 4991242 | 1075364 | 4.64 | 1027241 | 4.86 | +4.5% | | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % | |-----------|-------|--------------|--------------|-------------|------------|-----------------| | news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% | | korean | 893 | 364 | 2.45 | 712 | 1.25 | -95.6% | | code | 1259 | 309 | 4.07 | 492 | 2.56 | -59.2% | | math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% | | science | 1112 | 249 | 4.47 | 228 | 4.88 | +8.4% | | fwe-train | 4208518 | 874799 | 4.81 | 856883 | 4.91 | +2.0% | | fwe-val | 4991242 | 1048837 | 4.76 | 1027241 | 4.86 | +2.1% |