HuggingFaceTB
✓ VerifiedAI StartupHugging Face's technical team, platform leaders
SmolLM2-135M
--- library_name: transformers license: apache-2.0 language: - en ---
SmolLM2-360M-Instruct
--- library_name: transformers license: apache-2.0 language: - en pipeline_tag: text-generation tags: - safetensors - onnx - transformers.js base_model: - HuggingFaceTB/SmolLM2-360M ---
SmolLM-135M
--- library_name: transformers license: apache-2.0 language: - en datasets: - HuggingFaceTB/smollm-corpus ---
SmolLM2-360M
--- library_name: transformers license: apache-2.0 language: - en ---
SmolVLM-256M-Instruct
--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceM4/the_cauldron - HuggingFaceM4/Docmatix pipeline_tag: image-text-to-text language: - en base_model: - HuggingFaceTB/SmolLM2-135M-Instruct - google/siglip-base-patch16-512 ---
SmolLM2-135M-Instruct
--- library_name: transformers license: apache-2.0 language: - en pipeline_tag: text-generation tags: - safetensors - onnx - transformers.js base_model: - HuggingFaceTB/SmolLM2-135M ---
SmolVLM2-2.2B-Instruct
--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceM4/the_cauldron - HuggingFaceM4/Docmatix - lmms-lab/LLaVA-OneVision-Data - lmms-lab/M4-Instruct-Data - HuggingFaceFV/finevideo - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M - lmms-lab/LLaVA-Video-178K - orrzohar/Video-STaR - Mutonix/Vript - TIGER-Lab/VISTA-400K - Enxin/MovieChat-1K_train - ShareGPT4Video/ShareGPT4Video pipeline_tag: image-text-to-text tags: - video-text-to-text language: - en base_model: - HuggingFaceTB/SmolVLM-Ins
SmolVLM2-500M-Video-Instruct
--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceM4/the_cauldron - HuggingFaceM4/Docmatix - lmms-lab/LLaVA-OneVision-Data - lmms-lab/M4-Instruct-Data - HuggingFaceFV/finevideo - MAmmoTH-VL/MAmmoTH-VL-Instruct-12M - lmms-lab/LLaVA-Video-178K - orrzohar/Video-STaR - Mutonix/Vript - TIGER-Lab/VISTA-400K - Enxin/MovieChat-1K_train - ShareGPT4Video/ShareGPT4Video pipeline_tag: image-text-to-text language: - en base_model: - HuggingFaceTB/SmolVLM-500M-Instruct ---
SmolVLM-Instruct
--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceM4/the_cauldron - HuggingFaceM4/Docmatix pipeline_tag: image-text-to-text language: - en base_model: - HuggingFaceTB/SmolLM2-1.7B-Instruct - google/siglip-so400m-patch14-384 ---
SmolLM-360M
A model based on the Transformers library, licensed under Apache 2.0, designed for natural language processing tasks.
SmolVLM2-256M-Video-Instruct
SmolVLM2-256M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.38GB of GPU RAM for video inference. This efficiency makes it particularly well-suited for on-device applications that require specific domain fine-tuning and computational resources may be limited. Model Summary - Developed by: Hugging Face 🤗 - Model type: Multi-modal model (image/multi-image/video/text) - Language(s) (NLP): English - License: Apache 2.0 - Architecture: Based on Idefics3 (see technical summary) - Demo: Video Highlight Generator - Blog: Blog post SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation. To fine-tune SmolVLM2 on a specific task, you can follow the fine-tuning tutorial. We evaluated the performance of the SmolVLM2 family on the following scientific benchmarks: | Size | Video-MME | MLVU | MVBench | |----------|-----------------|----------|---------------| | 2.2B | 52.1 | 55.2 | 46.27 | | 500M | 42.2 | 47.3 | 39.73 | | 256M | 33.7 | 40.6 | 32.7 | You can use transformers to load, infer and fine-tune SmolVLM. Make sure you have num2words, flash-attn and latest transformers installed. You can load the model as follows. You preprocess your inputs directly using chat templates and directly passing them To use SmolVLM2 for video inference, make sure you have decord installed. You can interleave multiple media with text using chat templates. SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to: - Prohibited Uses: - Evaluating or scoring individuals (e.g., in employment, education, credit) - Critical automated decision-making - Generating unreliable factual content - Malicious Activities: - Spam generation - Disinformation campaigns - Harassment or abuse - Unauthorized surveillance SmolVLM2 is built upon SigLIP as image encoder and SmolLM2 for text decoder part. We release the SmolVLM2 checkpoints under the Apache 2.0 license. Citation information You can cite us in the following way: Training Data SmolVLM2 used 3.3M samples for training originally from ten different datasets: LlaVa Onevision, M4-Instruct, Mammoth, LlaVa Video 178K, FineVideo, VideoStar, VRipt, Vista-400K, MovieChat and ShareGPT4Video. In the following plots we give a general overview of the samples across modalities and the source of those samples. | Data Type | Percentage | |--------------|------------| | Image | 34.4% | | Text | 20.2% | | Video | 33.0% | | Multi-image | 12.3% | Text Datasets | Dataset | Percentage | |--------------------------------------------|------------| | llava-onevision/magpieproft380bmt | 6.8% | | llava-onevision/magpieproft380btt | 6.8% | | llava-onevision/magpieproqwen272btt | 5.8% | | llava-onevision/mathqa | 0.9% | Multi-image Datasets | Dataset | Percentage | |--------------------------------------------|------------| | m4-instruct-data/m4instructmultiimage | 10.4% | | mammoth/multiimage-cap6 | 1.9% | Image Datasets | Dataset | Percentage | |--------------------------------------------|------------| | llava-onevision/other | 17.4% | | llava-onevision/visionflan | 3.9% | | llava-onevision/mavismathmetagen | 2.6% | | llava-onevision/mavismathrulegeo | 2.5% | | llava-onevision/sharegpt4o | 1.7% | | llava-onevision/sharegpt4vcoco | 1.5% | | llava-onevision/imagetextualization | 1.3% | | llava-onevision/sharegpt4vllava | 0.9% | | llava-onevision/mapqa | 0.9% | | llava-onevision/qa | 0.8% | | llava-onevision/textocr | 0.8% | Video Datasets | Dataset | Percentage | |--------------------------------------------|------------| | llava-video-178k/1-2m | 7.3% | | llava-video-178k/2-3m | 7.0% | | other-video/combined | 5.7% | | llava-video-178k/hound | 4.4% | | llava-video-178k/0-30s | 2.4% | | video-star/starb | 2.2% | | vista-400k/combined | 2.2% | | vript/long | 1.0% | | ShareGPT4Video/all | 0.8% |
SmolLM3-3B
--- library_name: transformers license: apache-2.0 language: - en - fr - es - it - pt - zh - ar - ru base_model: - HuggingFaceTB/SmolLM3-3B-Base ---
SmolVLM-500M-Instruct
--- library_name: transformers license: apache-2.0 datasets: - HuggingFaceM4/the_cauldron - HuggingFaceM4/Docmatix pipeline_tag: image-text-to-text language: - en base_model: - HuggingFaceTB/SmolLM2-360M-Instruct - google/siglip-base-patch16-512 ---
SmolLM2-1.7B-Instruct
SmolVLM-256M-Base
SmolLM-1.7B
This model is based on the transformers library and is licensed under Apache 2.0.
SmolLM-135M-Instruct
This model is based on HuggingFaceTB/SmolLM-135M and is licensed under Apache 2.0.
SmolLM2-1.7B
1. Model Summary 2. Evaluation 3. Limitations 4. Training 5. License 6. Citation SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device. More details in our paper: https://arxiv.org/abs/2502.02737v1 The 1.7B variant demonstrates significant advances over its predecessor SmolLM1-1.7B, particularly in instruction following, knowledge, reasoning, and mathematics. It was trained on 11 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new mathematics and coding datasets that we curated and will release soon. We developed the instruct version through supervised fine-tuning (SFT) using a combination of public datasets and our own curated datasets. We then applied Direct Preference Optimization (DPO) using UltraFeedback. The instruct model additionally supports tasks such as text rewriting, summarization and function calling thanks to datasets developed by Argilla such as Synth-APIGen-v0.1. You can find the SFT dataset here: https://huggingface.co/datasets/HuggingFaceTB/smoltalk and finetuning code in the alignement handbook. For more details refer to: https://github.com/huggingface/smollm. You will find pre-training, post-training, evaluation and local inference code. Running the model on CPU/GPU/multi GPU Using full precision In this section, we report the evaluation results of SmolLM2. All evaluations are zero-shot unless stated otherwise, and we use lighteval to run them. | Metric | SmolLM2-1.7B | Llama-1B | Qwen2.5-1.5B | SmolLM1-1.7B | |------------------|--------------|-------------|---------------|--------------| | HellaSwag | 68.7 | 61.2 | 66.4 | 62.9 | | ARC (Average) | 60.5 | 49.2 | 58.5 | 59.9 | | PIQA | 77.6 | 74.8 | 76.1 | 76.0 | | MMLU-Pro (MCF) | 19.4 | 11.7 | 13.7 | 10.8 | | CommonsenseQA | 43.6 | 41.2 | 34.1 | 38.0 | | TriviaQA | 36.7 | 28.1 | 20.9 | 22.5 | | Winogrande | 59.4 | 57.8 | 59.3 | 54.7 | | OpenBookQA | 42.2 | 38.4 | 40.0 | 42.4 | | GSM8K (5-shot) | 31.0 | 7.2 | 61.3 | 5.5 | | Metric | SmolLM2-1.7B-Instruct | Llama-1B-Instruct | Qwen2.5-1.5B-Instruct | SmolLM1-1.7B-Instruct | |:-----------------------------|:---------------------:|:-----------------:|:----------------------:|:----------------------:| | IFEval (Average prompt/inst) | 56.7 | 53.5 | 47.4 | 23.1 | | MT-Bench | 6.13 | 5.48 | 6.52 | 4.33 | | OpenRewrite-Eval (microavg RougeL) | 44.9 | 39.2 | 46.9 | NaN | | HellaSwag | 66.1 | 56.1 | 60.9 | 55.5 | | ARC (Average) | 51.7 | 41.6 | 46.2 | 43.7 | | PIQA | 74.4 | 72.3 | 73.2 | 71.6 | | MMLU-Pro (MCF) | 19.3 | 12.7 | 24.2 | 11.7 | | BBH (3-shot) | 32.2 | 27.6 | 35.3 | 25.7 | | GSM8K (5-shot) | 48.2 | 26.8 | 42.8 | 4.62 | SmolLM2 models primarily understand and generate content in English. They can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content. - Architecture: Transformer decoder - Pretraining tokens: 11T - Precision: bfloat16
SmolLM3-3B-Base
1. Model Summary 2. How to use 3. Evaluation 4. Training 5. Limitations 6. License SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale. SmolLM3-3B-Base is the base model after pretraining, you can find the instruct model at SmolLM3-3B. The model is a decoder-only transformer using GQA and NoPE, it was pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO). Key features - Instruct model optimized for hybrid reasoning - Fully open model: open weights + full training details including public data mixture and training configs - Long context: Trained on 64k context and suppots up to 128k tokens using YARN extrapolation - Multilingual: 6 natively supported (English, French, Spanish, German, Italian, and Portuguese) For more details refer to our blog post: https://hf.co/blog/smollm3 How to use The modeling code for SmolLM3 is available in transformers `v4.53.0`, so make sure to upgrade your transformers version. You can also load the model with the latest `vllm` which uses transformers as a backend. For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection (https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23). The current `config.json` is set for context length up to 65,536 tokens. To handle longer inputs (128k or 256k), we utilize YaRN you can change the `maxpositionembeddings` and ropescaling` to: In this section, we report the evaluation results of SmolLM3 model. All evaluations are zero-shot unless stated otherwise, and we use lighteval to run them. We highlight the best score in bold and underline the second-best score. English benchmarks Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length. | Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Reasoning & Commonsense| HellaSwag | 76.15 | 74.19 | 75.52 | 60.52 | 74.37 | | | ARC-CF (Average) | 65.61 | 59.81 | 58.58 | 55.88 | 62.11 | | | Winogrande | 58.88 | 61.41 | 58.72 | 57.06 | 59.59 | | | CommonsenseQA | 55.28 | 49.14 | 60.60 | 48.98 | 52.99 | | Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | 47.65 | | | MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | 24.92 | | | MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | 41.07 | | | PIQA | 78.89 | 78.35 | 78.51 | 75.35 | 77.58 | | | OpenBookQA | 40.60 | 40.20 | 42.00 | 36.40 | 42.40 | | | BoolQ | 78.99 | 73.61 | 75.33 | 74.46 | 74.28 | | Math & Code | | | | | | | | Coding & math | HumanEval+ | 30.48 | 34.14| 25.00 | 43.29 | 54.87 | | | MBPP+ | 52.91 | 52.11 | 38.88| 59.25 | 63.75 | | | MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | 51.20 | | | GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | 74.14 | | Long context | | | | | | | | | Ruler 32k | 76.35 | 75.93 | 77.58 | 70.63 | 83.98 | | | Ruler 64k | 67.85 | 64.90 | 72.93 | 57.18 | 60.29 | | | Ruler 128k | 61.03 | 62.23 | 71.30 | 43.03 | 47.23 | | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Main supported languages | | | | | | | | | French| MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 | | | Belebele | 51.00 | 51.55 | 49.22 |49.44| 55.00 | | | Global MMLU (CF) | 38.37 | 34.22 | 33.71 | 34.94 |41.80 | | | Flores-200 (5-shot) | 62.85| 61.38| 62.89 | 58.68 | 65.76 | | Spanish| MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 | | | Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 | | | Global MMLU (CF) | 38.51 | 35.84 | 35.60 | 34.79 |41.22 | | | Flores-200 (5-shot) | 48.25 | 50.00| 44.45 | 46.93 | 50.16 | | German| MLMM Hellaswag | 59.56 | 49.99| 53.19|46.10| 56.43 | | | Belebele | 48.44 | 47.88 | 46.22 | 48.00 | 53.44| | | Global MMLU (CF) | 35.10 | 33.19 | 32.60 | 32.73 |38.70 | | | Flores-200 (5-shot) | 56.60| 50.63| 54.95 | 52.58 | 50.48 | | Italian| MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 | | | Belebele | 46.44 | 44.77 | 43.88 | 44.00 | 48.78 | 44.88 | | | Global MMLU (CF) | 36.99 | 33.91 | 32.79 | 35.37 |39.26 | | | Flores-200 (5-shot) | 52.65 | 54.87| 48.83 | 48.37 | 49.11 | | Portuguese| MLMM Hellaswag | 63.22 | 57.38 | 56.84 | 50.73 | 59.89 | | | Belebele | 47.67 | 49.22 | 45.00 | 44.00 | 50.00 | 49.00 | | | Global MMLU (CF) | 36.88 | 34.72 | 33.05 | 35.26 |40.66 | | | Flores-200 (5-shot) | 60.93 |57.68| 54.28 | 56.58 | 63.43 | The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information. | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base | |---------|--------|---------------------|------------|--------------|------------------|---------------| | Other supported languages | | | | | | | | | Arabic| Belebele | 40.22 | 44.22 | 45.33 | 42.33 | 51.78 | | | Global MMLU (CF) | 28.57 | 28.81 | 27.67 | 29.37 | 31.85 | | | Flores-200 (5-shot) | 40.22 | 39.44 | 44.43 | 35.82 | 39.76 | | Chinese| Belebele | 43.78 | 44.56 | 49.56 | 48.78 | 53.22 | | | Global MMLU (CF) | 36.16 | 33.79 | 39.57 | 38.56 | 44.55 | | | Flores-200 (5-shot) | 29.17 | 33.21 | 31.89 | 25.70 | 32.50 | | Russian| Belebele | 47.44 | 45.89 | 47.44 | 45.22 | 51.44 | | | Global MMLU (CF) | 36.51 | 32.47 | 34.52 | 34.83 | 38.80 | | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | 54.70 | 60.53 | No Extended Thinking Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold. | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B | |---------|--------|------------|------------|-------------|------------|----------| | High school math competition | AIME 2025 | 9.3 | 2.9 | 0.3 | 8.0 | 17.1 | | Math problem-solving | GSM-Plus | 72.8 | 74.1 | 59.2 | 68.3 | 82.1 | | Competitive programming | LiveCodeBench v4 | 15.2 | 10.5 | 3.4 | 15.0 | 24.9 | | Graduate-level reasoning | GPQA Diamond | 35.7 | 32.2 | 29.4 | 31.8 | 44.4 | | Instruction following | IFEval | 76.7 | 65.6 | 71.6 | 74.0 | 68.9 | | Alignment | MixEval Hard | 26.9 | 27.6 | 24.9 | 24.3 | 31.6 | | Tool Calling | BFCL| 92.3 | - | 92.3 | 89.5 | 95.0 | | Multilingual Q&A | Global MMLU | 53.5 | 50.54 | 46.8 | 49.5 | 65.1 | Extended Thinking Evaluation results in reasoning mode for SmolLM3 and Qwen3 models: | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B | |---------|--------|------------|------------|----------| | High school math competition | AIME 2025 | 36.7 | 30.7 | 58.8 | | Math problem-solving | GSM-Plus | 83.4 | 79.4 | 88.2 | | Competitive programming | LiveCodeBench v4 | 30.0 | 34.4 | 52.9 | | Graduate-level reasoning | GPQA Diamond | 41.7 | 39.9 | 55.3 | | Instruction following | IFEval | 71.2 | 74.2 | 85.4 | | Alignment | MixEval Hard | 30.8 | 33.9 | 38.0 | | Tool Calling | BFCL | 88.8 | 88.8 | 95.5 | | Multilingual Q&A | Global MMLU | 64.1 | 62.3 | 73.3 | - Architecture: Transformer decoder - Pretraining tokens: 11T - Precision: bfloat16 - GPUs: 384 H100 - Training Framework: nanotron - Data processing framework: datatrove - Evaluation framework: lighteval - Post-training Framework: TRL Open resources Here is an infographic with all the training details. - The datasets used for pretraining can be found in this collection and those used in mid-training and post-training will be released in the following weeks - The training and evaluation configs and code can be found in the huggingface/smollm repository. - The training intermediate checkpoints are available at HuggingFaceTB/SmolLM3-3B-checkpoints SmolLM3 can produce text on a variety of topics, but the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data. These models should be used as assistive tools rather than definitive sources of information. Users should always verify important information and critically evaluate any generated content.
SmolLM-360M-Instruct
License: Apache 2.0. Base model: HuggingFaceTB/SmolLM-360M.
SmolLM2-360M-Instruct-GGUF
ngxson/SmolLM2-360M-Instruct-Q80-GGUF This model was converted to GGUF format from `HuggingFaceTB/SmolLM2-360M-Instruct` using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model. Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well. Step 2: Move into the llama.cpp folder and build it with `LLAMACURL=1` flag along with other hardware-specific flags (for ex: LLAMACUDA=1 for Nvidia GPUs on Linux).
SmolLM-1.7B-Instruct
This model is based on HuggingFaceTB/SmolLM-1.7B and is licensed under Apache 2.0.
SmolLM3-3B-checkpoints
We are releasing intermediate checkpoints of SmolLM3 to enable further research. For more details, check the SmolLM GitHub repo with the end-to-end training and evaluation code: - ✓ Pretraining scripts (nanotron) - ✓ Post-training code SFT + APO (TRL/alignment-handbook) - ✓ Evaluation scripts to reproduce all reported metrics We release checkpoints every 40,000 steps, which equals 94.4B tokens. The GBS (Global Batch Size) in tokens for SmolLM3-3B is 2,359,296. To calculate the number of tokens from a given step: Stage 1: Steps 0 to 3,450,000 (86 checkpoints) config Stage 2: Steps 3,450,000 to 4,200,000 (19 checkpoints) config Stage 3: Steps 4,200,000 to 4,720,000 (13 checkpoints) config For the additional 2 stages that extend the context length to 64k, we sample checkpoints every 4,000 steps (9.4B tokens) for a total of 10 checkpoints: We release checkpoints at every step of our post-training recipe: Mid training, SFT, APO soup, and LC expert.
smollm-135M-instruct-v0.2-Q8_0-GGUF
SmolLM2-1.7B-Instruct-GGUF
SmolVLM-Base
finemath-classifier
smolvlm-app-config
SmolLM2-1.7B-Instruct-16k
SmolLM3-3B-ONNX
SmolVLM 500M Base
smollm-360M-instruct-add-basics-q0f16-MLC
SmolLM2-1.7B-intermediate-checkpoints
smollm-360M-instruct-v0.2-Q8_0-GGUF
SmolLM-360M-Instruct-ONNX-fp16
SmolLM2-135M-intermediate-checkpoints
smollm2-135M-SFT-Only
cosmo-1b
SmolVLM2-2.2B-Base
This is the base model for SmolVLM2-2.2B, a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 5.2GB of GPU RAM for video inference, it delivers robust performance on complex multimodal tasks. This efficiency makes it particularly well-suited for on-device applications where computational resources may be limited. Model Summary - Developed by: Hugging Face 🤗 - Model type: Multi-modal model (image/multi-image/video/text) - Language(s) (NLP): English - License: Apache 2.0 - Architecture: Based on Idefics3 (see technical summary) - Demo: Video Highlight Generator - Blog: Blog post SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation. To fine-tune SmolVLM2 on a specific task, you can follow the fine-tuning tutorial. You can use transformers to load, infer and fine-tune SmolVLM. Make sure you have num2words, flash-attn and latest transformers installed. You can load the model as follows. You preprocess your inputs directly using chat templates and directly passing them To use SmolVLM2 for video inference, make sure you have decord installed. You can interleave multiple media with text using chat templates. SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to: - Prohibited Uses: - Evaluating or scoring individuals (e.g., in employment, education, credit) - Critical automated decision-making - Generating unreliable factual content - Malicious Activities: - Spam generation - Disinformation campaigns - Harassment or abuse - Unauthorized surveillance SmolVLM2 is built upon the shape-optimized SigLIP as image encoder and SmolLM2 for text decoder part. We release the SmolVLM2 checkpoints under the Apache 2.0 license. Citation information You can cite us in the following way:
SmolLM2-1.7B-sft-only
smollm-360M-instruct-add-basics
finemath-ablation-infiwebmath
finemath-ablation-finemath-4plus
finemath-ablation-finemath-3plus
stack-edu-classifier-python
finemath-ablation-infiwebmath-4plus
finemath-ablation-infiwebmath-3plus
finemath-ablation-finemath-infimath-3plus
finemath-ablation-finemath-infimath-4plus
finemath-ablation-4plus-160B
finemath-ablation-3plus-160B
finemath-ablation-owm
This model is part of the 📐 FineMath ablations, we continue pretraining Llama-3.2-3B base on different math datasets for 60B tokens. The model has 3.21B parameters and 4096 context length. It was trained on 60B tokens from OpenWebMath, tokenized using `llama3` tokenizer. This model was trained on English math data and is not instruction-tuned, making it intended for text completion in English with a focus on math. It is important to note that the primary intended use case of this model is to compare its performance with other models trained under the same conditions. This model is not necessarily the best possible outcome achievable with the given dataset. We are releasing intermediate checkpoints for this model at intervals of every 10000 training steps (10B tokens) in separate branches. The naming convention is `10B`. You can load a specific model revision with `transformers` using the argument `revision`: You can access all the revisions for the models via the following code: Training Model - Architecture: Llama3 - Pretraining steps: 60k - Pretraining tokens: 60B - Precision: bfloat16 Software - nanotron for training - datatrove for tokenization - lighteval for evaluation Evaluation We used the SmolLM2 setup to evaluate all our ablation models with `lighteval`. You can find the details here: https://github.com/huggingface/smollm/tree/main/evaluation#smollm2-base-models Limitations This model was predominantly trained on English math data, potentially limiting its performance in other languages. Furthermore, the model's behavior is influenced by the quality and diversity of its training data, which may include biases and harmful content.