sdobson
tinystories-llama-15m
This is a small Llama-architecture language model trained on the TinyStories dataset. The model is designed to generate simple, coherent children's stories using a vocabulary and concepts that a typical 3-4 year old would understand. Model Architecture: Llama 2 Training Framework: PyTorch Implementation: Based on llama2.c - Dimension: 288 - Number of Layers: 6 - Number of Attention Heads: 6 - Number of KV Heads: 6 - Vocabulary Size: 32,000 (Llama 2 tokenizer) - Maximum Sequence Length: 256 tokens - Dropout: 0.0 - Hidden Dimension Multiple: 32 - Batch Size: 128 (micro-batch) - Gradient Accumulation Steps: 4 - Effective Batch Size: 512 - Learning Rate: 5e-4 (max) - Learning Rate Schedule: Cosine decay with warmup - Warmup Iterations: 1,000 - Total Training Iterations: 100,000 - Weight Decay: 0.1 - Beta1: 0.9 - Beta2: 0.95 - Gradient Clipping: 1.0 - Optimizer: AdamW - Precision: bfloat16 (with mixed precision training) Tokens per Iteration: ~65,536 (4 grad accum × 1 process × 64 batch × 256 seq len) This model is intended for: - Generating simple children's stories - Educational demonstrations of small-scale language model training - Research into emergent capabilities in small language models - Experimentation with efficient inference (e.g., pure C implementation) - Domain-Specific: The model is trained exclusively on simple stories and will not perform well on general text generation tasks - Vocabulary: Limited to concepts and language appropriate for very young children - Context Length: Maximum sequence length of 256 tokens limits story length - No Instruction Following: This is a base model without instruction tuning The model was trained on the TinyStories dataset, which consists of short stories generated to contain only words that a typical 3-4 year old would understand. The dataset was created to study the capabilities of small language models. Dataset Size: ~2.1M stories Vocabulary: Words understandable by 3-4 year olds Content: Simple narratives, common objects, basic emotions and actions Prompt: "Once upon a time, there was a little girl named Lily." If you use this model or the llama2.c implementation, please cite: - Model architecture and training code adapted from llama2.c by Andrej Karpathy - Trained on the TinyStories dataset by Ronen Eldan and Yuanzhi Li - Based on the Llama 2 architecture by Meta AI
catalan-stories-6m
This is a small Llama-style language model trained on a dataset of synthetically-generated short stories in Catalan language, with a custom 512-token vocabulary. Try it out here: https://huggingface.co/spaces/sdobson/catalan-stories-6m - Architecture: Llama (decoder-only transformer) - Parameters: ~6M parameters - Hidden size: 256 - Layers: 8 - Attention heads: 8 - KV heads: 4 (Grouped Query Attention) - Vocabulary size: 512 (custom SentencePiece tokenizer) - Max sequence length: 256 tokens - Training data: Catalan stories dataset This model uses a custom SentencePiece tokenizer trained specifically on our dataset with a vocabulary size of only 512 tokens. This makes the model: - Very lightweight and fast - Optimised for simple Catalan stories - Easy to deploy in resource-constrained environments - Framework: llama2.c (PyTorch) - Dataset: Catalan stories - Tokenizer: Custom SentencePiece model (512 vocab) - Hardware: Trained on GeForce RTX 3060 - Training time: 30min - Domain-specific: The model is optimized for simple English stories and may not generalize well to other domains - Small vocabulary: With only 512 tokens, the model has limited vocabulary coverage - Short context: Maximum sequence length of 256 tokens - Size: While efficient, this is a small model (~15M parameters) and has limited capabilities compared to larger models This model is intended for: - Educational purposes - Learning about language models and tokenization - Lightweight text generation in resource-constrained environments - Generating simple children's stories - Experimentation with custom tokenizers The model was trained on the Catalan stories dataset, which consists of short stories written in simple Catalan, generated synthetically to be suitable for language learning.
Nanochat
nanochat is a 561M parameter transformer language model trained for conversational AI tasks. This model demonstrates that capable chat models can be trained efficiently on modest hardware budgets (~$100 on 8x H100 GPUs). Read about the process at https://samdobson.uk/posts/training-a-chatgpt-clone-for-cheap/ Chat with the model at https://huggingface.co/spaces/sdobson/nanochat - Developed by: Andrej Karpathy - Trained by: Sam Dobson - Model type: Transformer-based causal language model - Language(s): English - License: MIT - Parameters: 560,988,160 (~561M) - Layers: 20 - Hidden size: 1280 channels - Attention heads: 10 - Head dimension: 128 - Vocabulary size: 65,536 tokens 1. Pretraining: 100B token subset of FineWeb-EDU (11.2B tokens processed) 2. Midtraining: SmolTalk conversations, MMLU multiple choice questions, GSM8K math problems 3. Supervised Fine-tuning (SFT): Conversational adaptation data Tokenization - Custom Rust-based tokenizer - Vocabulary: 65,536 tokens - Compression ratio: 4.8 characters per token Training Infrastructure - Hardware: 8x H100 GPUs (Lambda GPU Cloud) - Training time: ~3 hours for pretraining stage - Estimated compute: ~4e19 FLOPs - Total cost: ~$100 Training Stages The model was trained in three stages: 1. Pretraining on web text (FineWeb-EDU) 2. Midtraining on domain-specific datasets (reasoning, conversation, maths) 3. Supervised fine-tuning for chat optimisation | Benchmark | Score | Description | |-----------|-------|-------------| | MMLU | 23.99% | Multitask language understanding | | GSM8K | 4.47% | Grade school math problems | | HumanEval | 6.71% | Python code generation | | ARC-Easy | 24.79% | Science questions (easy) | | ARC-Challenge | 24.32% | Science questions (hard) | | ChatCORE | 1.73% | Conversational reasoning | nanochat is designed for: - Conversational AI applications - Research on efficient language model training - Educational purposes for understanding LLM training pipelines - Low-resource deployment scenarios The model can be fine-tuned for specific conversational tasks or used as a base model for further domain adaptation. - Production-grade conversational AI (the model is relatively small and has limited capabilities) - Tasks requiring specialised knowledge or high accuracy - Critical applications where errors could cause harm - Small scale: At 561M parameters, this model has significantly fewer capabilities than larger models (1B+ parameters) - Limited training: Trained on only 11.2B tokens, which is modest by modern standards - Performance: Benchmark scores indicate limited reasoning and mathematical capabilities - Bias: Inherits biases from training data (FineWeb-EDU, SmolTalk, etc.) - Language: English-only Simon Willison created a script to allow this to run on CPU on MacOS: 1. Download all files 2. Put `tokenizer.pkl` and `tokenbytes.pt` in `~/.cache/nanochat/tokenizer` 3. Put `model000650.pt` and `meta000650.json` in `~/.cache/nanochat/chatsftcheckpoints/d20` 4. Clone https://github.com/karpathy/nanochat 5. Run `uv sync` followed by `uv run python -m scripts.chatweb`