JulietChoo
VisionSelector Qwen2.5 VL 7B
VisionSelector-code [\[📂 VisionSelector\]](https://github.com/JulietChoo/VisionSelector) VisionSelector-model [\[🤗 VisionSelector-Qwen2.5-VL-3B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-3B) [\[🤗 VisionSelector-Qwen2.5-VL-7B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B) [\[🤗 VisionSelector-LLaVA-OV-1.5-8B\]](https://huggingface.co/JulietChoo/VisionSelector-LLaVA-OV-1.5-8B) Model Overview We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency. A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention. A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection. A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference. VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks. Institution - University of Science and Technology of China - ZTE-AIM
VisionSelector-Qwen2.5-VL-3B
VisionSelector-code [\[📂 VisionSelector\]](https://github.com/JulietChoo/VisionSelector) VisionSelector-model [\[🤗 VisionSelector-Qwen2.5-VL-3B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-3B) [\[🤗 VisionSelector-Qwen2.5-VL-7B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B) [\[🤗 VisionSelector-LLaVA-OV-1.5-8B\]](https://huggingface.co/JulietChoo/VisionSelector-LLaVA-OV-1.5-8B) Model Overview We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency. A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention. A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection. A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference. VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks. Institution - University of Science and Technology of China - ZTE-AIM
VisionSelector-LLaVA-OV-1.5-8B
VisionSelector-code [\[📂 VisionSelector\]](https://github.com/JulietChoo/VisionSelector) VisionSelector-model [\[🤗 VisionSelector-Qwen2.5-VL-3B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-3B) [\[🤗 VisionSelector-Qwen2.5-VL-7B\]](https://huggingface.co/JulietChoo/VisionSelector-Qwen2.5-VL-7B) [\[🤗 VisionSelector-LLaVA-OV-1.5-8B\]](https://huggingface.co/JulietChoo/VisionSelector-LLaVA-OV-1.5-8B) Model Overview We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency. A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention. A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection. A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference. VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks. Institution - University of Science and Technology of China - ZTE-AIM