This model combines: - Vision encoder: google/siglip-base-patch16-224 - Language model: Qwen/Qwen2.5-0.5B-Instruct