Xkev
Llama-3.2V-11B-cot
Llama-3.2V-11B-cot is a visual language model capable of spontaneous, systematic reasoning. The model was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step. Our model is built upon meta-llama/Llama-3.2-11B-Vision-Instruct. Llama 3.2 is licensed under the LLaMA 3.2 Community License, Copyright © Meta Platforms, Inc. The use of our model must comply with Meta’s Acceptable Use Policy. - License: apache-2.0 - Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct | MMStar | MMBench | MMVet | MathVista | AI2D | Hallusion | Average | |--------|---------|-------|-----------|------|-----------|---------| | 57.6 | 75.0 | 60.3 | 54.8 | 85.7 | 47.8 | 63.5 | To reproduce our results, you should use VLMEvalKit and the following settings. | Parameter | Value | |-------------------|---------| | dosample | True | | temperature | 0.6 | | topp | 0.9 | | maxnewtokens | 2048 | You may change them in this file, line 80-83, and modify the maxnewtokens throughout the file. Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the maxnewtokens to 2048. After you get the results, you should filter the model output and only keep the outputs between \ and \ . This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes. By keeping the outputs between \ and \ , most answers can be direclty extracted using VLMEvalKit system, which can be much less biased. You can use the inference code for Llama-3.2-11B-Vision-Instruct. The model is trained on the LLaVA-CoT-100k dataset. The model is finetuned on llama-recipes with the following settings. Using the same setting should accurately reproduce our results. | Parameter | Value | |-------------------------------|---------------------------------------------------| | FSDP | enabled | | lr | 1e-5 | | numepochs | 3 | | batchsizetraining | 4 | | usefastkernels | True | | runvalidation | False | | batchingstrategy | padding | | contextlength | 4096 | | gradientaccumulationsteps | 1 | | gradientclipping | False | | gradientclippingthreshold | 1.0 | | weightdecay | 0.0 | | gamma | 0.85 | | seed | 42 | | usefp16 | False | | mixedprecision | True | The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. Technically, the model's performance in aspects like instruction following still falls short of leading industry models.