JayHyeon

267 models • 127 total models in database
Sort by:

pythia-2.8b-rDPO_5e-7_1.0vpo_constant-1ep_0.3label_smoothing

Model Card for pythia-2.8b-rDPO5e-71.0vpoconstant-1ep0.3labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
18
0

pythia-2.8b-cDPO_5e-7_1.0vpo_constant-1ep_0.3label_smoothing

Model Card for pythia-2.8b-cDPO5e-71.0vpoconstant-1ep0.3labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
17
0

pythia-2.8b-cDPO_5e-7_1.0vpo_constant-1ep_0.1label_smoothing

Model Card for pythia-2.8b-cDPO5e-71.0vpoconstant-1ep0.1labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
14
0

pythia-2.8b-2e-5-1ep

This model is a fine-tuned version of EleutherAI/pythia-2.8b on the HuggingFaceH4/ultrafeedbackbinarized dataset. It has been trained using TRL. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
13
0

pythia-2.8b-rDPO_5e-7_1.0vpo_constant-1ep_0.1label_smoothing

Model Card for pythia-2.8b-rDPO5e-71.0vpoconstant-1ep0.1labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
12
0

Qwen_0.5-VDPO_5e-7_3.0vpo_constant-1ep

NaNK
9
0

Qwen_0.5-rDPO_5e-7_0.1lsmooth-1.0vpo_constant

Model Card for Qwen0.5-rDPO5e-70.1lsmooth-1.0vpoconstant This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.55.0 - Pytorch: 2.7.1 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
9
0

Qwen_1.5B-math-cDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep

NaNK
8
0

Qwen_1.5B-math-rDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep

NaNK
8
0

llama-1e-6-1ep

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the HuggingFaceH4/ultrafeedbackbinarized dataset. It has been trained using TRL. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.7.1 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
llama
8
0

Qwen_0.5-VDPO_5e-7_3.0vpo_constant_0.3label_smoothing

NaNK
8
0

Qwen_0.5-rDPO_5e-7_1.0vpo_constant_0.3label_smoothing

NaNK
8
0

pythia-2.8b-VIPO_5e-7_1.0vpo_constant-1ep

Model Card for pythia-2.8b-VIPO5e-71.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
7
0

Qwen_1.5B-math-DPO_5e-6_1.0vpo_constant-10ep

Model Card for Qwen1.5B-math-DPO5e-61.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
6
0

Qwen_0.5-VDPO_3e-6_10.0vpo_constant-1ep_0.3flip

NaNK
6
0

Qwen_0.5-IRPO_1e-6-3ep_0.01alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO1e-6-3ep0.01alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
6
0

llama-DPOP_5e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
llama
6
0

Qwen_0.5-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_50dpop_lam

NaNK
6
0

gemma-DPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
6
0

gemma-DPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
6
0

Qwen_0.5-VDPO_5e-7_0.3vpo_constant-1ep

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
6
0

Qwen2.5-0.5B_ultrainteract_sft_2e-5_1ep

NaNK
5
0

Qwen_0.5-BDPO_1e-6-3ep_0alp_0.999bdpo_lam_0dpop_lam

NaNK
5
0

Qwen_0.5-IRPO_1e-6-3ep_0.005alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO1e-6-3ep0.005alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
5
0

llama-BDPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
5
0

Qwen_1.5B-math-cDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep

NaNK
5
0

Qwen_1.5B-math-rDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep

NaNK
5
0

Qwen_0.5-cDPO_5e-7_1.0vpo_constant_0.3label_smoothing

NaNK
5
0

llama-VDPO_5e-7_1.0vpo_constant

llama
5
0

Qwen_0.5-VDPO_5e-7_1.0vpo_constant_0.1label_smoothing

NaNK
5
0

Qwen_0.5-cDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep

Model Card for Qwen0.5-cDPO5e-70.1lsmooth-1.0vpoconstant-1ep This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
5
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-7_3ep_0alp_0lam

A model designed for fine-tuning with specific parameters for optimal performance in various tasks.

NaNK
4
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-6-3ep_0alp_5lam

Base model trained on ultrafeedback binarized datasets using the transformers library.

NaNK
4
0

Qwen_0.5-IPO_5e-7-3ep_0alp_0lam

Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.

NaNK
4
0

Qwen_1.5B-math-VIPO_5e-6_3.0vpo_constant-5ep

NaNK
4
0

pythia-2.8b-VIPO_5e-7_1.0vpo_const-1ep

NaNK
4
0

pythia-2.8b-VIPO_5e-7_3.0vpo_const-1ep

This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
4
0

Qwen_0.5-IRPO_1e-6-3ep_2alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO1e-6-3ep2alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
4
0

Qwen_0.5-DPO_5e-7_1.0vpo_constant

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
4
0

Qwen_0.5-VDPO_5e-7_1.0vpo_constant

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
4
0

Qwen_0.5-cDPO_5e-7_1.0vpo_constant_0.1label_smoothing

NaNK
4
0

llama-DPO_5e-7_1.0vpo_constant

llama
4
0

Qwen2.5-0.5B-SFT-7e-5-3ep

Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
license:mit
3
0

Qwen2.5-0.5B-SFT-1e-5-5ep

Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-7e-5-5ep

Core purpose is to provide a model based on Qwen/Qwen2.5-0.5B. It utilizes datasets from HuggingFaceH4/ultrafeedback_binarized and is built using the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_7e-7_2ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_1e-6_1ep_0alp_0lam

A model designed for fine-tuning tasks using the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-3ep_1alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
3
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-1ep_1alp_0lam

A model designed for fine-tuning tasks using the Qwen architecture, optimized for performance with specific training parameters.

NaNK
3
0

Qwen_0.5-DPOP_3e-7-2ep_0alp_5lam

Library name: transformers.

3
0

Qwen_0.5-rDPO_1e-6-1ep_0vpo_const_0.1

NaNK
3
0

Qwen_1.5B-math-DPO_5e-6_1.0vpo_constant-5ep

Model Card for Qwen1.5B-math-DPO5e-61.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

pythia-2.8b-VDPO_5e-7_1.0vpo_constant-1ep

Model Card for pythia-2.8b-VDPO5e-71.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-IPO_5e-6_1.0vpo_constant-5ep

NaNK
3
0

Qwen_1.5B-math-VIPO_5e-6_1.0vpo_constant-5ep

Model Card for Qwen1.5B-math-VIPO5e-61.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-VIPO_5e-6_10.0vpo_constant-5ep

Model Card for Qwen1.5B-math-VIPO5e-610.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-VDPO_5e-6_3.0vpo_constant-5ep

NaNK
3
0

Qwen_1.5B-math-VDPO_5e-6_10.0vpo_constant-5ep

NaNK
3
0

Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-5ep

Model Card for Qwen1.5B-math-DPO1e-51.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-10ep

NaNK
3
0

pythia-2.8b-DPO_1e-6_1.0vpo_constant-1ep

Model Card for pythia-2.8b-DPO1e-61.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-DPO_5e-5_1.0vpo_constant-10ep

Model Card for Qwen1.5B-math-DPO5e-51.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_1.5B-math-DPO_5e-5_1.0vpo_constant-20ep

Model Card for Qwen1.5B-math-DPO5e-51.0vpoconstant-20ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

pythia-2.8b-IPO_5e-7_1.0vpo_constant-1ep

NaNK
3
0

Qwen_1.5B-math-DPO_1e-4_1.0vpo_constant-10ep

NaNK
3
0

pythia-2.8b-IPO_5e-7_1.0vpo_const-1ep

This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
3
0

Qwen_1.5B-math-VDPO_1e-4_1.0vpo_constant-10ep

Model Card for Qwen1.5B-math-VDPO1e-41.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

pythia-2.8b-VDPO_5e-7_3.0vpo_constant-1ep

NaNK
3
0

pythia-2.8b-VDPO_5e-7_10.0vpo_constant-1ep

Model Card for pythia-2.8b-VDPO5e-710.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-VDPO_5e-7_1.0vpo_constant-1ep

This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-DPO_5e-7_1.0vpo_constant-1ep_0.3flip

NaNK
3
0

Qwen_0.5-cDPO_5e-7_1.0vpo_constant-1ep_0.3flip

NaNK
3
0

Qwen_0.5-DPO_3e-6_1.0vpo_constant-1ep_0.3flip

Model Card for Qwen0.5-DPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-VDPO_3e-6_1.0vpo_constant-1ep_0.3flip

Model Card for Qwen0.5-VDPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-VDPO_3e-6_3.0vpo_constant-1ep_0.3flip

NaNK
3
0

Qwen_0.5-cDPO_3e-6_1.0vpo_constant-1ep_0.3flip

Model Card for Qwen0.5-cDPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-IPO_3e-6_1.0vpo_constant-1ep_0.3flip

NaNK
3
0

Qwen_0.5-IRPO_5e-7-3ep_0.1alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO5e-7-3ep0.1alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
3
0

Qwen_0.5-IRPO_1e-6-3ep_10alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-BDPO_5e-7-3ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-IRPO_5e-7-3ep_10alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO5e-7-3ep10alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
3
0

Qwen_0.5-IRPO_1e-6-3ep_5alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO1e-6-3ep5alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

llama-BDPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-IRPO_5e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

Qwen_0.5-IRPO_5e-7-3ep_0.05alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO5e-7-3ep0.05alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
3
0

llama-DPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-IRPO_1e-6-3ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPOP_1e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
llama
3
0

llama-BDPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-IRPO_1e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
llama
3
0

Qwen_0.5-IRPO_1e-6-3ep_0.25alp_0.5bdpo_lam_0dpop_lam

Model Card for Qwen0.5-IRPO1e-6-3ep0.25alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-BDPO_1e-6-3ep_0alp_0.99999bdpo_lam_0dpop_lam

Model Card for Qwen0.5-BDPO1e-6-3ep0alp0.99999bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
3
0

Qwen_0.5-IRPO_5e-7-3ep_0.25alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-IRPO_1e-6-3ep_0.5alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-DPOP_5e-7-3ep_0alp_0.5bdpo_lam_50dpop_lam

NaNK
3
0

Qwen_0.5-IRPO_5e-7-3ep_0.5alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-DPOP_5e-7-3ep_0alp_0.5bdpo_lam_500dpop_lam

NaNK
3
0

Qwen_0.5-BDPO_5e-7-3ep_0alp_0.99999bdpo_lam_0dpop_lam

NaNK
3
0

gemma-BDPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-DPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-DPOP_1e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
3
0

Qwen_0.5-IRPO_5e-7-3ep_2alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-BDPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for gemma-BDPO5e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
3
0

gemma-IRPO_5e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

Model Card for gemma-IRPO5e-7-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
3
0

gemma-DPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for gemma-DPO5e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
3
0

gemma-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
3
0

gemma-IRPO_1e-6-3ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

Qwen_0.5-ultrainteract_ORPO_5e-7-1ep

NaNK
3
0

Qwen_0.5-SLiC_5e-7-1ep

This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5Bultrainteractsft2e-51ep on the JayHyeon/trlultrainteract-pair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
3
0

Qwen_0.5-BDPO_5e-7-3ep_0alp_0.999bdpo_lam_0dpop_lam

Model Card for Qwen0.5-BDPO5e-7-3ep0alp0.999bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
3
0

llama-IRPO_1e-6-2ep_1alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-IRPO1e-6-2ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

llama-BDPO_1e-6-2ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-BDPO1e-6-2ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

llama-DPOP_1e-6-2ep_0alp_0.5bdpo_lam_5dpop_lam

Model Card for llama-DPOP1e-6-2ep0alp0.5bdpolam5dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

gemma-BDPO_3e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-DPOP_3e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
3
0

gemma-IRPO_3e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-IRPO_1e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-BDPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
3
0

gemma-DPOP_1e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
3
0

llama-IRPO_1e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPOP_1e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
llama
3
0

llama-DPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-DPO1e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

llama-BDPO_2e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-BDPO2e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

Qwen_0.5-ultrainteract_SLiC_5e-7-1ep

This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5Bultrainteractsft2e-51ep on the JayHyeon/trlultrainteract-pair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
3
0

llama-DPO_2e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-DPO2e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

llama-BDPO_3e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-IRPO_3e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-DPOP_3e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam

NaNK
llama
3
0

llama-DPO_3e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
3
0

llama-BDPO_3e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-BDPO3e-6-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

llama-IRPO_3e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-IRPO3e-6-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
3
0

Qwen_0.5-ultrainteract_SPPO_5e-7-1ep

NaNK
3
0

Qwen_0.5-cDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep

3
0

Qwen_0.5-rDPO_5e-7_0.3lsmooth-1.0vpo_constant

NaNK
3
0

Qwen2-0.5B-Reward_VPO_5e-4

NaNK
2
0

Qwen2.5-0.5B-SFT

Qwen 2.5 0.5B SFT is trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
license:mit
2
0

Qwen2.5-0.5B-SFT-1e-4

Qwen 2.5 0.5B is trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
license:mit
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-5e-5-2ep

The model is based on Qwen/Qwen2.5-0.5B and utilizes datasets from HuggingFaceH4/ultrafeedback_binarized with the library transformers.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7-3ep_0alp_0lam

Base model trained on datasets from trl-lib/ultrafeedback_binarized using the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7_2ep_0alp_0lam

library_name: transformers tags: [] Provide a quick summary of what the model is/does.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-7_1ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-1ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-2ep_0alp_0lam

A model designed for specific tasks using the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_3e-7-3ep_1alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.

NaNK
2
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_3e-7-3ep_0alp_5lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.

NaNK
2
0

Qwen_0.5-DPO_5e-7-3ep_0alp_0lam

Base model: Qwen2.5 0.5B SFT 2e 5 2ep. Datasets: trl-lib/ultrafeedback_binarized. Library name: transformers.

NaNK
2
0

Qwen_0.5-DPOP_3e-6-3ep_0alp_5lam

Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.

NaNK
2
0

Qwen_0.5-DPO_3e-6-2ep_0alp_0lam

Library name: transformers.

2
0

Qwen_0.5-DPOP_3e-7-1ep_0alp_5lam

Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.

2
0

Qwen_math-DPO_5e-7-1ep_0alp_0lam

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the openbmb/UltraInteractpair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
2
0

Qwen_math-IRPO_5e-7-1ep_1alp_0lam

NaNK
2
0

Qwen_0.5-IPO_5e-7-1ep_0alp_0lam

Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.

NaNK
2
0

Qwen_0.5-VIPO_1e-6-1ep_10vpo_const

NaNK
2
0

Qwen_1.5B-math-VDPO_5e-7_1.0vpo_constant-20ep

Model Card for Qwen1.5B-math-VDPO5e-71.0vpoconstant-20ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
2
0

pythia-2.8b-DPO_5e-7_1.0vpo_constant-1ep

NaNK
2
0

Qwen_0.5-rDPO_5e-7_1.0vpo_constant-1ep_0.3flip

NaNK
2
0

Qwen_0.5-rDPO_3e-6_1.0vpo_constant-1ep_0.3flip

NaNK
2
0

Qwen_0.5-VIPO_3e-6_1.0vpo_constant-1ep_0.3flip

Model Card for Qwen0.5-VIPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1

NaNK
2
0

llama-BDPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
llama
2
0

Qwen_0.5-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_500dpop_lam

Model Card for Qwen0.5-DPOP1e-6-3ep0alp0.5bdpolam500dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
2
0

gemma-IRPO_1e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam

NaNK
2
0

gemma-BDPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam

NaNK
2
0

llama-DPO_1e-6-2ep_0alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-DPO1e-6-2ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
2
0

llama-IRPO_2e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam

Model Card for llama-IRPO2e-7-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
2
0

llama-DPOP_2e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam

Model Card for llama-DPOP2e-7-1ep0alp0.5bdpolam5dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2

NaNK
llama
2
0

Qwen_0.5-rDPO_5e-7_1.0vpo_constant_0.1label_smoothing

Model Card for Qwen0.5-rDPO5e-71.0vpoconstant0.1labelsmoothing This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1

NaNK
2
0

Qwen_0.5-IPO_5e-7_1.0vpo_constant

NaNK
2
0

Qwen_VIPO_SHP

NaNK
2
0

Qwen_0.5-IPO_5e-7_seed42

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-DPO_5e-7_1.0vpo_constant_ls0.0_seed42

Model Card for Qwen0.5-DPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-VDPO_5e-7_1.0vpo_constant_ls0.0_seed42

Model Card for Qwen0.5-VDPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-IPO_5e-7_1.0vpo_constant_ls0.0_seed42

Model Card for Qwen0.5-IPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-VIPO_5e-7_1.0vpo_constant_ls0.0_seed42

Model Card for Qwen0.5-VIPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-rDPO_5e-7_1.0vpo_constant_ls0.1_seed42

Model Card for Qwen0.5-rDPO5e-71.0vpoconstantls0.1seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-rDPO_5e-7_1.0vpo_constant_ls0.3_seed42

Model Card for Qwen0.5-rDPO5e-71.0vpoconstantls0.3seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-cDPO_5e-7_1.0vpo_constant_ls0.1_seed42

Model Card for Qwen0.5-cDPO5e-71.0vpoconstantls0.1seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-cDPO_5e-7_1.0vpo_constant_ls0.3_seed42

Model Card for Qwen0.5-cDPO5e-71.0vpoconstantls0.3seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4

NaNK
2
0

Qwen_0.5-VDPO_5e-7_0.3vpo_constant_ls0.0_seed42

NaNK
2
0

Qwen_0.5-VDPO_5e-7_3.0vpo_constant_ls0.0_seed42

NaNK
2
0

Qwen_0.5-VDPO_5e-7_5vpo_constant_ls0.0_seed42

NaNK
2
0

Qwen2.5-0.5B-Instruct-SFT

Qwen 2.5 0.5B Instruct is designed for instruction-based tasks using the Hugging Face Ultrafeedback binarized dataset with the Transformers library.

NaNK
license:mit
1
1

Qwen2-0.5B-Reward_VPO_1e-4

NaNK
1
0

Qwen2-0.5B-Reward_1e-4-test

NaNK
1
0

Qwen2-0.5B-Reward_VPO_5e-3

NaNK
1
0

Qwen-0.5B-IRPO-5epoch

A model designed for efficient natural language processing tasks, utilizing the transformers library under the MIT license.

NaNK
license:mit
1
0

Qwen-0.5B-DPO-1epoch

A model designed for efficient natural language processing tasks, utilizing the transformers library under the MIT license.

NaNK
license:mit
1
0

Qwen-0.5B-IRPO-1epoch

A model designed for various natural language processing tasks, utilizing the transformers library and licensed under MIT.

NaNK
license:mit
1
0

Qwen2.5-0.5B-Instruct-SFT-MDPO-1epoch_v1

This model is designed for instruction-based tasks using the Qwen architecture. It is built with the Transformers library and is licensed under MIT.

NaNK
license:mit
1
0

Qwen2.5-0.5B-Instruct-SFT-DPO-1epoch_v1

This model is designed for instruction-based tasks using the Qwen architecture. It is built on the transformers library and is licensed under MIT.

NaNK
license:mit
1
0

Qwen2.5-0.5B-Instruct-SFT-IRPO-1epoch_v1

This model is designed for instruction-based tasks using the Qwen architecture. It is built on the transformers library and is licensed under MIT.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-DPO-1epoch_v1

A model that utilizes the transformers library and is licensed under MIT.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-2e-5

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-5e-5

Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-7e-5

Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT_2ep

NaNK
1
0

Qwen2.5-0.5B-SFT-1e-5-3ep

Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-5e-5-3ep

Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-1e-4-3ep

Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
license:mit
1
0

Qwen2.5-0.5B-SFT-1e-5-2ep

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4's ultrafeedback_binarized. It utilizes the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-7e-5-2ep

Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-1e-4-2ep

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-4-2ep

Qwen 2.5 0.5B is designed for fine-tuning tasks using the Hugging Face Transformers library with the ultrafeedback_binarized dataset.

NaNK
1
0

Qwen2.5-0.5B-SFT-5e-5-5ep

Core model for Qwen 2.5 with 0.5B parameters, trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-1e-4-5ep

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-4-5ep

Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_7e-7_3ep_0alp_0lam

Core purpose is to provide a fine-tuned model based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-5ep. It utilizes datasets from trl-lib/ultrafeedback_binarized and is built using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_5e-7_3ep_0alp_0lam_1ep

library_name: transformers tags: [] Provide a quick summary of what the model is/does.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_5e-7_3ep_0alp_0lam_2ep

library_name: transformers tags: [] Provide a quick summary of what the model is/does.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_7e-7_3ep_0alp_0lam_2ep

Library name: transformers.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7_1ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_7e-7-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the library transformers.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_7e-7_1ep_0alp_0lam

Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_1e-6_2ep_0alp_0lam

A model designed for fine-tuning tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_2e-6_1ep_0alp_0lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-6-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-2ep_0alp_0lam

A model designed for various natural language processing tasks, utilizing the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-1ep_0alp_0lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-6-2ep_0alp_0lam

A model designed for various natural language processing tasks, utilizing the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-2ep_0alp_0lam

A model designed for efficient fine-tuning and deployment, utilizing advanced techniques for optimal performance.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-2ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-1ep_0alp_0lam

A model designed for various natural language processing tasks, utilizing the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-6-1ep_0alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-6-1ep_0alp_0lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-1ep_1alp_0lam

A model designed for various natural language processing tasks, utilizing the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-2ep_1alp_0lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-7-1ep_0alp_5lam

A model designed for fine-tuning and optimization tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-7-2ep_0alp_5lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-3ep_1alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-2ep_1alp_0lam

A model designed for various tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-1ep_1alp_0lam

A model designed for various natural language processing tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-6-2ep_0alp_5lam

A model designed for fine-tuning and optimization tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-7-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-3ep_1alp_0lam

Core model for fine-tuning with datasets from trl-lib/ultrafeedback_binarized using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-1ep_0alp_0lam

A model designed for fine-tuning with specific parameters for enhanced performance in various tasks.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-2ep_1alp_0lam

A model designed for fine-tuning tasks using the transformers library.

NaNK
1
0

Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-7-3ep_0alp_0lam

The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.

NaNK
1
0

Qwen_0.5-DPO_5e-7-2ep_0alp_0lam

library_name: transformers tags: [] Provide a quick summary of what the model is/does.

1
0

Qwen_0.5-DPOP_3e-6-1ep_0alp_5lam

library_name: transformers

1
0

Qwen_0.5-DPOP_3e-6-2ep_0alp_5lam

Library name: transformers.

1
0

Qwen_0.5-DPO_1e-6-3ep_0alp_0lam

Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers

NaNK
1
0

Qwen_0.5-DPOP_1e-6-3ep_0alp_5lam

Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers

NaNK
1
0

Qwen_0.5-DPOP_1e-7-3ep_0alp_5lam

Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers

NaNK
1
0

Qwen_0.5-DPO_3e-7-3ep_0alp_0lam

Base model: Qwen2.5 0.5B SFT 2e 5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers

NaNK
1
0

Qwen_0.5-DPOP_3e-7-3ep_0alp_5lam

Base model: Qwen 2.5 0.5B SFT 2e 5 2ep. Datasets: trl-lib/ultrafeedback_binarized. Library name: transformers.

NaNK
1
0

Qwen_0.5-DPO_3e-7-1ep_0alp_0lam

Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.

1
0

Qwen_0.5-DPO_3e-7-2ep_0alp_0lam

Library name: transformers.

1
0

Qwen_math-DPOP_5e-7-1ep_0alp_5lam

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the openbmb/UltraInteractpair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3

NaNK
1
0

Qwen_0.5-VIPO_1e-6-1ep_30vpo_const

NaNK
1
0

Qwen_0.5-VDPO_1e-6-1ep_1vpo_const

NaNK
1
0

Qwen_0.5-cDPO_1e-6-1ep_0vpo_const_0.3

NaNK
1
0

Qwen_0.5-VDPO_5e-6-1ep_10vpo_const

NaNK
1
0

Qwen_0.5-VDPO_3e-6-1ep_0vpo_const

NaNK
1
0

Qwen_0.5-VDPO_3e-6-1ep_0.3vpo_const_exp

NaNK
1
0

Math-Qwen_0.5-BDPO_5e-7-1ep_0alp_0lam

NaNK
1
0

Qwen_0.5-ultrainteract_DPOP_5e-7-1ep_0.5bdpo_lambda

NaNK
1
0

Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-20ep

NaNK
1
0

Qwen-0.5B-DPO-5epoch

A model designed for efficient natural language processing tasks, utilizing the Transformers library under the MIT license.

NaNK
license:mit
0
1