RLHFlow

34 models • 2 total models in database
Sort by:

ArmoRM-Llama3-8B-v0.1

License: llama3. Authors include Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang.

NaNK
llama
10,839
182

Llama3.1-8B-PRM-Deepseek-Data

NaNK
llama
1,444
37

Llama3-v2-iterative-DPO-iter3

llama
96
1

Llama3.1-8B-PRM-Mistral-Data

NaNK
llama
37
10

Qwen2.5-Math-1.5B-DAPO-easy

NaNK
license:apache-2.0
37
0

Qwen2.5-Math-1.5B-GRPO-n8-easy

NaNK
license:apache-2.0
25
0

LLaMA3-iterative-DPO-final

The model focuses on the RLHF workflow, transitioning from reward modeling to online RLHF. It is associated with the paper titled 'RLHF Workflow: From Reward Modeling to Online RLHF', published in TMLR in 2024. The authors include Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang.

llama
22
41

Qwen2.5-Math-1-5B-Reinforce-Ada-balance-easy

NaNK
16
0

LLaMA3-SFT

llama
15
10

Qwen2.5-Math-7B-Reinforce-Ada-balance-easy

Checkpoint from step=500 and trained on the easy prompt set.

NaNK
license:apache-2.0
15
0

Qwen3-4B-Instruct-2507-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK
license:apache-2.0
14
0

Qwen2.5-Math-7B-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK
license:apache-2.0
13
0

Llama3.1-8B-ORM-Deepseek-Data

NaNK
llama
11
2

Llama-3.2-3B-Instruct-Reinforce-Ada-balance-hard

Checkpoint from step=400 and trained on the hard prompt set.

NaNK
llama
10
0

pair-preference-model-LLaMA3-8B

NaNK
llama
9
38

Qwen2.5-Math-1-5B-Reinforce-Ada-balance-hard

NaNK
9
0

LLaMA3.2-3B-SFT

NaNK
llama
8
0

Decision-Tree-Reward-Llama-3.1-8B

NaNK
llama
6
7

LLaMA3-SFT-v2

llama
6
2

RewardModel-Mistral-7B-for-DPA-v1

NaNK
4
4

LLaMA3.2-1B-SFT

NaNK
llama
4
0

Decision-Tree-Reward-Gemma-2-27B

NaNK
3
8

Qwen2.5-Math-7B-Zero-Reinforce-Rej

NaNK
3
1

Llama3.1-8B-ORM-Mistral-Data

NaNK
llama
2
0

Qwen2.5-7B-PPO-Zero

NaNK
1
2

DPA-v1-Mistral-7B

NaNK
1
1

Qwen2.5-Math-7B-Zero-RAFTpp

NaNK
1
1

Llama3-SFT-v2.0-epoch1

llama
1
0

Llama3-SFT-v2.0-epoch2

llama
1
0

Llama3-SFT-v2.0-epoch3

llama
1
0

Llama3-v2-iterative-DPO-iter2

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - Developed by: [More Information Needed] - Funded by [optional]: [More Information Needed] - Shared by [optional]: [More Information Needed] - Model type: [More Information Needed] - Language(s) (NLP): [More Information Needed] - License: [More Information Needed] - Finetuned from model [optional]: [More Information Needed] - Repository: [More Information Needed] - Paper [optional]: [More Information Needed] - Demo [optional]: [More Information Needed] Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). - Hardware Type: [More Information Needed] - Hours used: [More Information Needed] - Cloud Provider: [More Information Needed] - Compute Region: [More Information Needed] - Carbon Emitted: [More Information Needed]

llama
1
0

Qwen2.5-7B-DPO-Zero

NaNK
1
0

Qwen2.5-7B-SFT

NaNK
1
0

Qwen2.5-7B-DPO

NaNK
1
0