Zigeng
SlimSAM-uniform-77
> 0.1% Data Makes Segment Anything Slim > Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang > Learning and Vision Lab, National University of Singapore > Paper: [[Arxiv]](https://arxiv.org/abs/2312.05284) > Code: [[GitHub]](https://github.com/czg1225/SlimSAM) SlimSAM is a novel SAM compression method, which efficiently reuses pre-trained SAMs without the necessity for extensive retraining. This is achieved by the efficient reuse of pre-trained SAMs through a unified pruning-distillation framework. To enhance knowledge inheritance from the original SAM, we employ an innovative alternate slimming strategy that partitions the compression process into a progressive procedure. Diverging from prior pruning techniques, we meticulously prune and distill decoupled model structures in an alternating fashion. Furthermore, a novel label-free pruning criterion is also proposed to align the pruning objective with the optimization target, thereby boosting the post-distillation after pruning. SlimSAM achieves approaching performance while reducing the parameter counts to 0.9\% (5.7M), MACs to 0.8\% (21G), and requiring mere 0.1\% (10k) of the training data when compared to the original SAM-H. Extensive experiments demonstrate that our method realize significant superior performance while utilizing over 10 times less training data when compared to other SAM compression methods. Fast statedict loading for local uniform pruning SlimSAM-50 model: BibTex of our SlimSAM If you use SlimSAM in your research, please use the following BibTeX entry. Thank you! Torch Pruning (DepGraph: Towards Any Structural Pruning) [ bib ]
DParallel LLaDA 8B Instruct
š dParallel: Learnable Parallel Decoding for dLLMs > dParallel: Learnable Parallel Decoding for dLLMs > Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang > xML Lab, National University of Singapore š” Introduction We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Overview of proposed certainty-forcing distillation. š Experimental Results Results on LLaDA-8B-Instruct: āļø Acknowledgement Our code builds on LLaDA, Dream, Fast-dLLM, and dKV-Cache, and we acknowledge these great works for laying the groundwork that made our approach possible. Citation If our research assists your work, please give us a star ā or cite us using:
SlimSAM-uniform-50
DParallel Dream 7B Instruct
š dParallel: Learnable Parallel Decoding for dLLMs > dParallel: Learnable Parallel Decoding for dLLMs > Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang > xML Lab, National University of Singapore š” Introduction We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Overview of proposed certainty-forcing distillation. š Experimental Results Results on LLaDA-8B-Instruct: āļø Acknowledgement Our code builds on LLaDA, Dream, Fast-dLLM, and dKV-Cache, and we acknowledge these great works for laying the groundwork that made our approach possible. Citation If our research assists your work, please give us a star ā or cite us using:
R1 VeriThinker 7B
š VeriThinker: Learning to Verify Makes Reasoning Model Efficient > VeriThinker: Learning to Verify Makes Reasoning Model Efficient > Zigeng Chen, Xinyin Ma, Gongfan Fang, Ruonan Yu, Xinchao Wang > xML Lab, National University of Singapore The key distinction between VeriThinker and traditional SFT or RL-based long-to-short methods. We uniquely train LRMs on an auxiliary CoT verification task, achieving effective CoT compression without relying on synthetic target reasoning chains. š” Introduction We introduce VeriThinker, a novel approach for CoT compression. Unlike conventional methods that fine-tune LRMs directly on the original reasoning task using synthetic concise CoT data, we innovatively fine-tune the model solely through an auxiliary verification task. By training LRMs to accurately verify the correctness of CoT solutions, the LRMs inherently become more discerning about the necessity of subsequent self-reflection steps, thereby effectively suppressing overthinking. Extensive experiments validate that VeriThinker substantially reduces reasoning chain lengths while maintaining or even slightly improving accuracy. When applied to DeepSeek-R1-Distill-Qwen-7B, our approach reduces reasoning tokens on MATH500 from 3790 to 2125 while improving accuracy by 0.8% (94.0% to 94.8%), and on AIME25, tokens decrease from 14321 to 10287 with a 2.1% accuracy gain (38.7% to 40.8%). Additionally, our experiments demonstrate that VeriThinker can also be zero-shot generalized to speculative reasoning to boost throughput. Speculative Reasoning Results: Speculative reasoning results on three reasoning models. When using Qwen-2.5-Math-Instruct-7B as the draft model, most problems in MATH500 and GSM8K can be solved with short CoT model, while only a few (around 10%) require activation of the long CoT model for more complex solutions. Citation If our research assists your work, please give us a star ā or cite us using: