HPAI-BSC

Aloe-Vision is a medical Large Vision–Language Model built on Qwen2-VL-Instruct, released in 7B and 72B sizes. The model is trained on a \~3.5 M samples balanced mixture across medical vs. general and multimodal vs. text-only sources, rebalanced by loss-contributing assistant tokens to avoid long-answer bias. We implement leakage control of evaluation images in the training data via exact 64-bit image-hash matching, removing any duplicates from the training. Quality filtering of the training data combines (1) LVLM-based sample scoring (1–5 scale) for image–question–answer coherence and relevance and (2) answer perplexity checks to flag trivial or noisy annotations. Thresholds are dataset-specific and manually tuned, leading to the removal of low-quality outliers while preserving clinically meaningful diversity. Furthermore, the model is additionally fine-tuned on 17.2 K adversarially perturbed medical samples to enhance robustness against sycophantic and misleading multimodal cues. The model is released for research purposes under CC BY-NC-SA 4.0. Base model: Qwen2-VL-Instruct (7B / 72B) Variant: Aloe-Vision-72B-AR (Adversarially Robust) Training type: Two-stage SFT (medical + adversarial fine-tuning) Sizes: 7B, 72B Languages: English Images per turn: Qwen2-VL style multi-image support License: CC BY-NC-SA 4.0 Developed by: HPAI — Barcelona Supercomputing Center (BSC) Contact: [email protected] Intended: research on medical VQA and multimodal reasoning, dataset analysis, academic benchmarking. clinical diagnosis/treatment, triage, or any unsupervised medical use. generation of harmful, misleading, or fraudulent medical content. processing of PHI or any personally identifiable patient data. Aloe-Vision follows the Qwen2-VL chat template and processor API. Replace the image path(s) and prompt content to suit your use case. Grounding: Aloe-Vision supports region-referenced grounding using Qwen2-VL box marker tokens. Training type: Two-stage SFT (medical + adversarial fine-tuning) Stack: TRL + DeepSpeed ZeRO-3 Precision: BF16 Global batch size: 2000 Micro batch size: 4 Epochs: 1 Sequence length: 4096 LR: 1.25e-5, Cosine schedule, warmup 3% Optimizer: AdamW Grad checkpointing: enabled Parallelism: DeepSpeed ZeRO-3 Cluster: MareNostrum-5 (BSC) Nodes/GPUs: 25 nodes × 4× NVIDIA H100 (total 100 GPUs) GPU hours: \~4500 We construct a balanced mixture across two axes: modality (multimodal vs text-only) and domain (medical vs general). All sources are normalized to a unified trl conversation schema. Medical multimodal includes both global understanding and fine-grained region reasoning. The dataset can be found in HPAI-BSC/Aloe-Vision-Data. Aloe-Vision targets comprehensive evaluation across medical multimodal, medical text-only, general multimodal, and general text-only tasks. Benchmarks are run with identical settings for Aloe-Vision and baselines to ensure reproducibility. PathMMU (multi, medical, MCQ) — 1.1K GMAI-MMBench (multi, medical, MCQ) — 4.5K OmniMedVQA (multi, medical, MCQ) — 89K ProbMed (multi, medical, Y/N) — 57K SLAKE (multi, medical, open-ended; LLM-as-judge) — 2K MMMU (multi, general, MCQ) — 1.4K MultiMedQA (text, medical, MCQ) — 7K MMLU (text, general, MCQ) — 14K Multimodal via VLMEvalKit, text-only via lm-evaluation-harness. Decoding: greedy, accuracy by exact match for MCQ and Y/N. LLM-as-judge (SLAKE): Qwen2.5-VL-72B with a rubric-based {0.0, 0.5, 1.0} scale. | Model | OmniMedVQA | GMAI-MMBENCH | PathMMU | ProbMed | SLAKE | MMMU | MultiMedQA | MMLU | | --------------------------------------------------- | :------------: | :--------------: | :----------: | :----------: | :----------: | :----------: | :------------: | :----------: | | Qwen2-VL-72B (general) | 77.90 | 51.03 | 64.71 | 73.87 | 68.15 | 61.22 | 74.25 | 81.86 | | InternVL3.5-30B-A3B (general) | 91.60 | 63.91 | 72.07 | 82.21 | 79.87 | 60.67 | 71.21 | 81.68 | | Linghsu-32B | 80.20 | 53.54 | 67.60 | 80.84 | 86.08 | 53.00 | 72.08 | 81.32 | | HuatuoGPT-Vision-34B | 68.90 | 48.31 | 54.90 | 71.79 | 60.03 | 27.11 | 60.57 | 72.80 | | Aloe-Vision-72B | 85.20 | 55.12 | 70.32 | 77.71 | 69.36 | 62.00 | 76.35 | 82.52 | | Aloe-Vision-72B-AR | 84.00 | 54.79 | 71.45 | 77.29 | 67.36 | 62.89 | 76.24 | 82.55 | To improve robustness against noisy or misleading inputs, we conducted an additional fine-tuning stage focused on adversarial robustness. This stage aimed to mitigate common LVLM vulnerabilities such as sycophantic behavior or misleading multimodal cues. An adversarial benchmark was first created by applying controlled perturbations to existing medical datasets (distinct from those used in evaluation). These perturbations introduced conflicting or false multimodal signals (e.g., mismatched region annotations or incorrect textual hints). Using this adversarially transformed dataset, we trained an Aloe-Vision-72B-AR variant through a single-stage post-training SFT consisting of 17.2K adversarial samples. The adversarial fine-tuning employed the same optimization setup as the base model and ran for 1 epoch. This procedure yielded substantial improvements across all adversarial evaluation categories while preserving performance on standard benchmarks. The following table reports model accuracy (%) under different adversarial perturbations. Columns correspond to: Cap = misleading captions inserted into the image Pmt = misleading captions in the prompt Syc = sycophantic prompt bias Leg = misleading legends inserted into the image | Model | Cls Base | Cap | Pmt | Syc | Det Base | Cap | Pmt | Syc | Leg | | :--------------------- | :----------: | :---------: | :---------: | :---------: | :----------: | :---------: | :---------: | :---------: | :---------: | | Qwen2-VL-72B | 61.2 | 2.1 | 5.1 | 35.6 | 75.9 | 3.7 | 3.6 | 14.7 | 41.1 | | InternVL3.5-30B | 68.3 | 7.3 | 5.1 | 35.7 | 73.1 | 35.1 | 33.5 | 27.5 | 48.1 | | Lingshu-32B | 68.1 | 4.5 | 29.7 | 55.4 | 80.2 | 8.5 | 19.6 | 35.0 | 60.1 | | HuatuoGPT-Vision-34B | 59.6 | 22.2 | 10.0 | 16.1 | 64.1 | 40.9 | 5.6 | 5.2 | 50.7 | | Aloe-Vision-72B | 64.2 | 4.7 | 6.9 | 73.9 | 69.7 | 3.0 | 1.9 | 8.6 | 37.5 | | Aloe-Vision-72B-AR | 69.8 | 17.0 | 50.5 | 62.8 | 81.8 | 48.6 | 64.0 | 66.3 | 53.4 | Not a medical device. Do not rely on outputs for diagnosis/treatment. Failure modes: may hallucinate, misinterpret findings, or over-generalize across modalities and specialties. Sensitive content: can produce unsafe content if prompted adversarially. Keep a qualified clinician in the loop for any medically relevant use. Clinical safety: Aloe-Vision is a research model. It must not be used for diagnosis, treatment, or clinical decision-making. Always place a qualified human in the loop. Developed by the High Performance Artificial Intelligence (HPAI) group at Barcelona Supercomputing Center (BSC). Contact: [email protected].

NaNK

license:cc-by-nc-sa-4.0