jieliu
SD3.5M-FlowGRPO-PickScore
SD3.5M-FlowGRPO-GenEval
[Update] We release a new GenEval model that maintains image quality close to the base model, while still achieving the original GenEval score of 95. Feel free to give it a try! This model is trained using Flow-GRPO with LoRA. We provide only the LoRA weights here, so you will need to download the SD 3.5 Medium base model first. - Repository: https://github.com/yifan123/flowgrpo - Paper: https://www.arxiv.org/pdf/2505.05470
SD3.5M-FlowGRPO-Text
Storm-7B
Storm-7B - Developed by: Jie Liu \\(^{1,2}\\), Zhanhui Zhou \\(^{2}\\), Jiaheng Liu \\(^{2}\\), Xingyuan Bu \\(^{2}\\), Chao Yang \\(^{2}\\), Han-Sen Zhong \\(^{\dag 2}\\), Wanli Ouyang \\(^{1,2}\\). - \\(^{1}\\)MMLab, The Chinese University of Hong Kong   \\(^{2}\\)Shanghai AI Laboratory - Paper: Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level - Finetuned from the model: openchat-3.5-0106 - Dataset: berkeley-nest/Nectar - Reward Model: Starling-RM-34B We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the AlpacaEval 2.0 leaderboard. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Performance Our 7B model achieves a 50.5% length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0. Our model's LC win rate improves over iterations without significantly changing the response length, indicating better alignment with human values without length bias. The final trained model (iteration 3) achieves a 50.5% LC win rate, making it the first open-source model to surpass the baseline model GPT-4 Preview. In addition to regular decoding, we also test beam search and best-of-n sampling on top of our trained model. Beam search over our trained model shows a 5% improvement over regular decoding, Best-of-n sampling with Starling-RM-34B achieves 61.6% LC Win rate and outperforms GPT-4 Omni. We observe no significant degradation in traditional NLP tasks from the Huggingface Open LLM Leaderboard. Our model uses the same chat template as Openchat-3.5-0106. A sample code snippet for inference using our model is provided below. Scripts You can reproduce our results on AlphaEval 2.0 using the script provided below. Our work has several limitations: (1) We focus on aligning with human preferences but only use GPT-4 as a proxy for human judgment to evaluate language models. (2) We reduce verbosity with a length penalty, though verbosity and length are not necessarily correlated. Future work could train a specific reward model to directly penalize verbosity, replacing the length margin with a verbosity margin, following the standard MODPO pipeline.