RLHFlow
ArmoRM-Llama3-8B-v0.1
License: llama3. Authors include Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang.
Llama3.1-8B-PRM-Deepseek-Data
Llama3-v2-iterative-DPO-iter3
Llama3.1-8B-PRM-Mistral-Data
Qwen2.5-Math-1.5B-DAPO-easy
Qwen2.5-Math-1.5B-GRPO-n8-easy
LLaMA3-iterative-DPO-final
The model focuses on the RLHF workflow, transitioning from reward modeling to online RLHF. It is associated with the paper titled 'RLHF Workflow: From Reward Modeling to Online RLHF', published in TMLR in 2024. The authors include Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang.
Qwen2.5-Math-1-5B-Reinforce-Ada-balance-easy
LLaMA3-SFT
Qwen2.5-Math-7B-Reinforce-Ada-balance-easy
Checkpoint from step=500 and trained on the easy prompt set.
Qwen3-4B-Instruct-2507-Reinforce-Ada-balance-hard
Checkpoint from step=400 and trained on the hard prompt set.
Qwen2.5-Math-7B-Reinforce-Ada-balance-hard
Checkpoint from step=400 and trained on the hard prompt set.
Llama3.1-8B-ORM-Deepseek-Data
Llama-3.2-3B-Instruct-Reinforce-Ada-balance-hard
Checkpoint from step=400 and trained on the hard prompt set.
pair-preference-model-LLaMA3-8B
Qwen2.5-Math-1-5B-Reinforce-Ada-balance-hard
LLaMA3.2-3B-SFT
Decision-Tree-Reward-Llama-3.1-8B
LLaMA3-SFT-v2
RewardModel-Mistral-7B-for-DPA-v1
LLaMA3.2-1B-SFT
Decision-Tree-Reward-Gemma-2-27B
Qwen2.5-Math-7B-Zero-Reinforce-Rej
Llama3.1-8B-ORM-Mistral-Data
Qwen2.5-7B-PPO-Zero
DPA-v1-Mistral-7B
Qwen2.5-Math-7B-Zero-RAFTpp
Llama3-SFT-v2.0-epoch1
Llama3-SFT-v2.0-epoch2
Llama3-SFT-v2.0-epoch3
Llama3-v2-iterative-DPO-iter2
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]