codelion

31 models • 1 total models in database

Sort by:

dhara-70m

gpt-2-70m

A 70M parameter GPT-2 model trained on 1 billion tokens using an optimized 50-30-20 dataset mixing strategy. This model demonstrates the effectiveness of careful dataset composition for efficient language model pretraining. Despite using 10x less training data than GPT-2 (1B vs 10B tokens), it achieves competitive performance by leveraging an optimal mixture of high-quality data sources. Architecture: GPT-2 - Parameters: 70M (64.09M trainable) - Layers: 12 - Hidden Size: 512 - Attention Heads: 8 - Context Length: 1024 tokens - Vocabulary Size: 50,257 The model was trained on 1 billion tokens with the following composition: - 50% - FinePDFs (500M tokens): High-quality PDF content - 30% - DCLM Baseline (300M tokens): Filtered web content - 20% - FineWeb-Edu (200M tokens): Educational web content This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains. - Total Tokens: 1,000,000,000 - Batch Size: 24 (effective: 120 with gradient accumulation) - Learning Rate: 5e-4 → 5e-5 (cosine decay) - Warmup Steps: 162 (2% of total) - Precision: BFloat16 - Optimizer: AdamW - Final Loss: 2.92 | Benchmark | Our Model | Random | GPT-2 | vs Random | vs GPT-2 | |-----------|-----------|--------|-------|-----------|----------| | MMLU (5-shot) | 24.11% | 25.00% | 26.00% | -0.89% | -1.89% | | HellaSwag (0-shot) | 27.03% | 25.00% | 30.00% | +2.03% | -2.97% | | ARC-Challenge (0-shot) | 21.67% | 25.00% | 24.00% | -3.33% | -2.33% | | PIQA (0-shot) | 57.29% | 50.00% | 63.00% | +7.29% | -5.71% | | WinoGrande (0-shot) | 51.46% | 50.00% | 51.00% | +1.46% | +0.46% | | TruthfulQA MC2 (0-shot) | 47.31% | 25.00% | 40.00% | +22.31% | +7.31% | | Average | 38.15% | 33.33% | 39.00% | +4.81% | -0.85% | - Performance Gap: Only 0.85% behind GPT-2 baseline (39.00%) - Efficiency: Achieves 84.9% of GPT-2's performance improvement over random guessing - Data Efficiency: Competitive results with 10x less training data - TruthfulQA Excellence: +7.31% above GPT-2 baseline, demonstrating superior factual accuracy 1. Data Quality > Quantity: The 50-30-20 mixing strategy demonstrates that careful dataset composition can achieve strong performance with significantly reduced compute 2. Factual Accuracy: The model excels at truthfulness (TruthfulQA), likely due to high-quality FinePDF content (50%) 3. Practical Commonsense: Strong performance on PIQA and WinoGrande shows effective real-world reasoning 4. Knowledge Gaps: Below-random performance on MMLU and ARC-Challenge indicates insufficient academic/scientific knowledge for this scale - Academic Knowledge: Limited performance on academic benchmarks (MMLU, ARC-Challenge) - Training Scale: 1B tokens is insufficient for comprehensive world knowledge - Parameter Count: 70M parameters may limit capacity for complex reasoning For questions or issues, please open an issue on the model repository.

codelion

dhara-70m

gpt-2-70m

Llama-3.3-70B-o1

Llama-3.2-1B-Instruct-tool-calling-lora

gemma-3-1b-it-reasoning-grpo-lora

Qwen3-0.6B-accuracy-recovery-lora

Qwen3 4B Execution World Model Lora

Llama-3.2-3B-o1

Qwen2.5-Coder-0.5B-Instruct-security-grpo-lora

qwen2-5-coder-0-5b-instruct-progressive-2000k-lora

Qwen3-0.6B-ICM-DPO-mlx-fp16

MathCoT

gemma-3-1b-it-ICM-DPO-mlx-fp16

Llama-3.3-70B-o1-gguf

Qwen3-4B-Instruct-2507-self-verify-lora

malm-165m

public-domain-mickey-mouse

Qwen3-0.6B-ICM-DPO

SmolLM2-70M

Qwen3-0.6B-PTS-DPO

gemma-3-1b-it-ICM-DPO

whisper-age-estimator

Qwen3-0.6B-GRPO-mlx-fp16

DeepSeek-R1-Distill-Qwen-1.5B-PTS-DPO

Qwen3-0.6B-GRPO

optillm-bert-uncased

scorelora

optillm-modernbert-large

Llama-3.2-3B-o1-lora

Llama-3.3-70B-o1-lora

Qwen3-0.6B-PTS-DPO-LoRA