HF1BitLLM
Llama3-8B-1.58-100B-tokens
The Llama3-8B-1.58 models are large language models fine-tuned on the BitNet 1.58b architecture, starting from the base model Llama-3-8B-Instruct. For a deeper dive into the methods and results, check out our blog post. - Repository: Model - Paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits You can easily load and test our model in Transformers. Just follow the code below: Start by installing the transformers version with the correct configuration to load bitnet models 1. Starting Point - Best-performing checkpoint from the 10 billion token runs with a linear lambda scheduler 2. Training Duration - Fine-tuned for an additional 45,000 steps - Reached a total of 100 billion tokens 4. Batch Size - 2 million tokens per step - Total per run: 45,000 steps 2 million tokens = 90 billion tokens - Combined with initial 10 billion tokens to reach 100 billion 5. Learning Rate Experiments - Tested various learning rates to find optimal setting, according the to experiments, the best performing peak lr is 1e-5 6. Performance - Close to Llama3 8B on some metrics - Behind Llama3 8B in overall average performance 7. Evaluation - Metrics included perplexity, MMLU scores, and other standard benchmarks These extended training runs on 100 billion tokens pushed the boundaries of highly quantized models, bringing performance closer to half-precision models like Llama3. The evaluation of the models is done on the nanotron checkpoints using LightEval :