keeeeenw
MicroLlama
MicroLlama-Instruct-0.1
MicroLlama-text-embedding
MicroLlava-Qwen3-0.6B-base-siglip2-so400m
MicroLlava
A compact vision language model that you can pretrain and finetune on a single consumer GPU. - 📊 VQAv2 Accuracy: Achieves 56.91% on VQAv2 dev/test — making MicroLLaVA one of the best-performing open-source language models with vision capabilities under 700M parameters. - 🧠 Parameter Budget: - 🗣️ Language Model: MicroLLaMA (300M) - 👁️ Vision Encoder: SigLIP2 (400M) → ~700M total parameters - 🏆 Best in Class: According to ChatGPT’s Deep Research Agent (Aug 2025): > “No known open model below ~700M currently surpasses MicroLLaVA’s VQAv2 accuracy. Models that do perform better tend to have larger language components.” - 🧪 Ongoing Experiments: - 🔧 Qwen3-0.6B + SigLIP2 → Training is converging, showing promising loss curves. (Qwen3-0.6B is significantly larger than MicroLLaMA.) - ❌ Gemma-3B-270M-IT + SigLIP2 → Training did not converge, likely due to instability, bugs, or poor alignment under current hyperparameters. 08/17/2025: this hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava. 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2. 08/09/2025: initial version of MicroLlava released | Item | Detail | |-----------------|--------| | Framework | Transformers + PyTorch | | Checkpoint type | `safetensors` | | LLM | `keeeeenw/MicroLlama` (about 300M parameters) | | Vision tower | `siglip-so400m-patch14-384` | | Hardware used | Single NVIDIA RTX 4090 | | Training stack | No DeepSpeed required | | Intended tasks | Visual Question Answering, caption-style prompts | MicroLLaVA is a TinyLLaVA Factory based model that pairs a very small language model `keeeeenw/MicroLlama` with an efficient SigLIP vision encoder. The goal is to create a vision language model that almost anyone can train and iterate on with one consumer GPU. - Language model: `keeeeenw/MicroLlama` with ~300M parameters - Vision encoder: `siglip2-so400m-patch14-384` - Training codebase: TinyLLaVA Factory with additional changes in my fork: Custom fork with training tweaks Because of its compact size, this model can be trained entirely on a single NVIDIA RTX 4090 without DeepSpeed. Pretraining on LAION-CC-SBU-558K took about 5 hours on a single NVIDIA RTX 4090 without DeepSpeed. Supervised finetuning on all datasets from the TinyLLaVA Factory guide (except `ocrvqa`) took about 12 hours on the same GPU. VQAv2 Evaluation Results (MicroLlama 300M + Siglip2-so400m-patch4-384) | Question Type | Accuracy | |---------------|----------| | Yes/No | 72.32% | | Number | 43.89% | | Other | 46.65% | | Overall | 56.91% | (Previous version) VQAv2 Evaluation Results (MicroLlama 300M + Siglip-so400m-patch4-384) | Question Type | Accuracy | |---------------|----------| | Yes/No | 65.08% | | Number | 28.97% | | Other | 29.32% | | Overall | 44.01% | More evaluation results will be added in the coming days. Community contributions with benchmark results are welcome and encouraged. Intended uses - Rapid experimentation for vision-language research on limited hardware - Educational demonstrations for students and hobbyists - Starting point for domain-specific finetuning Limitations - The small LLM size and compact vision encoder may limit reasoning depth and OCR performance - Performance can vary significantly depending on the image domain and quality - The model includes minimal safety filtering and refusal behavior — downstream applications should implement their own safeguards > ⚠️ This model should not be used for applications that may cause harm or have significant safety, financial, legal, or medical implications without thorough human review. This model is released under the Apache License 2.0. You are free to use, modify, and distribute this model and its derivatives, provided that you comply with the terms of the license. If you use this model in your research or applications, please credit the original authors and clearly indicate any modifications you have made. > Note: Ensure that the datasets used for pretraining or finetuning also allow redistribution of derived model weights. This work builds upon the efforts of many in the open-source AI community: - TinyLLaVA Factory maintainers and contributors for creating the training framework - `keeeeenw/MicroLlama` I am also the creator of MicroLlama. Please help support my work! - SigLIP2 authors for the efficient vision encoder architecture - Contributors to LAION-CC-SBU-558K and other datasets used in pretraining and finetuning - The Hugging Face ecosystem for hosting, tools, and community support