RLHFlow

34 models • 2 total models in database

Sort by:

ArmoRM-Llama3-8B-v0.1

License: llama3. Authors include Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang.

LLaMA3-iterative-DPO-final

The model focuses on the RLHF workflow, transitioning from reward modeling to online RLHF. It is associated with the paper titled 'RLHF Workflow: From Reward Modeling to Online RLHF', published in TMLR in 2024. The authors include Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang.

llama

Qwen2.5-Math-1-5B-Reinforce-Ada-balance-easy

NaNK

—

LLaMA3-SFT

llama

Qwen2.5-Math-7B-Reinforce-Ada-balance-easy

Checkpoint from step=500 and trained on the easy prompt set.

NaNK

license:apache-2.0

Qwen3-4B-Instruct-2507-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK

license:apache-2.0

Qwen2.5-Math-7B-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK

license:apache-2.0

Llama3.1-8B-ORM-Deepseek-Data

NaNK

llama

Llama-3.2-3B-Instruct-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK

llama

pair-preference-model-LLaMA3-8B

NaNK

llama

Qwen2.5-Math-1-5B-Reinforce-Ada-balance-hard

NaNK

—

LLaMA3.2-3B-SFT

NaNK

llama

Decision-Tree-Reward-Llama-3.1-8B

NaNK

llama

LLaMA3-SFT-v2

llama

RewardModel-Mistral-7B-for-DPA-v1

NaNK

—

LLaMA3.2-1B-SFT

NaNK

llama

Decision-Tree-Reward-Gemma-2-27B

NaNK

—

Qwen2.5-Math-7B-Zero-Reinforce-Rej

NaNK

—

Llama3.1-8B-ORM-Mistral-Data

NaNK

llama

Qwen2.5-7B-PPO-Zero

NaNK

—

DPA-v1-Mistral-7B

NaNK

—

Qwen2.5-Math-7B-Zero-RAFTpp

NaNK

—

Llama3-SFT-v2.0-epoch1

llama

Llama3-SFT-v2.0-epoch2

llama

Llama3-SFT-v2.0-epoch3

llama

Llama3-v2-iterative-DPO-iter2

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

llama