instruction-pretrain
finance-Llama3-8B
medicine-Llama3-8B
InstructLM-500M
Instruction Pre-Training: Language Models are Supervised Multitask Learners (EMNLP 2024) This repo contains the general models pre-trained from scratch (on 100B tokens) in our paper Instruction Pre-Training: Language Models are Supervised Multitask Learners. We explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. Instruction Pre-Training outperforms Vanilla Pre-training in both general pre-training from scratch and domain-adaptive continual pre-training. In pre-training from scratch, Instruction Pre-Training not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Updates 2024/11/30: Released the multimodal version of the instruction synthesizer: Visual Instruction Synthesizer 2024/9/20: Our paper has been accepted by EMNLP 2024 main conference🎉 2024/9/11: Updated FAQ on continual pre-training from Llama3 2024/8/29: Updated guidelines on evaluating any 🤗Huggingface models on the domain-specific tasks 2024/7/31: Updated pre-training suggestions in the `Advanced Usage` section of instruction-synthesizer 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M. The performance trend on downstream tasks throughout the pre-training process: Resources 🤗 We share our data and models with example usages, feel free to open any discussions at this page! 🤗 - Thanks to the demo davanstrien/instruction-synthesizer for implementing our approach - Context-Based Instruction Synthesizer: instruction-synthesizer - Fine-Tuning Data for the Synthesizer: ft-instruction-synthesizer-collection - General Models Pre-Trained from Scratch (on 100B tokes): - InstructLM-500M - InstructLM-1.3B - Domain-Specific Models Pre-Trained from Llama3-8B: - Finance-Llama3-8B - Biomedicine-Llama3-8B - General Instruction-Augmented Corpora: general-instruction-augmented-corpora - Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): medicine-instruction-augmented-corpora General Pre-Training From Scratch We augment the RefinedWeb corproa with instruction-response pairs generated by our context-based instruction synthesizer to pre-train general langauge models from scratch. To evaluate our general base model using the lm-evaluation-harness framework Citation If you find our work helpful, please cite us: