JayHyeon
pythia-2.8b-rDPO_5e-7_1.0vpo_constant-1ep_0.3label_smoothing
Model Card for pythia-2.8b-rDPO5e-71.0vpoconstant-1ep0.3labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
pythia-2.8b-cDPO_5e-7_1.0vpo_constant-1ep_0.3label_smoothing
Model Card for pythia-2.8b-cDPO5e-71.0vpoconstant-1ep0.3labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
pythia-2.8b-cDPO_5e-7_1.0vpo_constant-1ep_0.1label_smoothing
Model Card for pythia-2.8b-cDPO5e-71.0vpoconstant-1ep0.1labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
pythia-2.8b-2e-5-1ep
This model is a fine-tuned version of EleutherAI/pythia-2.8b on the HuggingFaceH4/ultrafeedbackbinarized dataset. It has been trained using TRL. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
pythia-2.8b-rDPO_5e-7_1.0vpo_constant-1ep_0.1label_smoothing
Model Card for pythia-2.8b-rDPO5e-71.0vpoconstant-1ep0.1labelsmoothing This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_5e-7_3.0vpo_constant-1ep
Qwen_0.5-rDPO_5e-7_0.1lsmooth-1.0vpo_constant
Model Card for Qwen0.5-rDPO5e-70.1lsmooth-1.0vpoconstant This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.55.0 - Pytorch: 2.7.1 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_1.5B-math-cDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep
Qwen_1.5B-math-rDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep
llama-1e-6-1ep
This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the HuggingFaceH4/ultrafeedbackbinarized dataset. It has been trained using TRL. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.7.1 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-VDPO_5e-7_3.0vpo_constant_0.3label_smoothing
Qwen_0.5-rDPO_5e-7_1.0vpo_constant_0.3label_smoothing
pythia-2.8b-VIPO_5e-7_1.0vpo_constant-1ep
Model Card for pythia-2.8b-VIPO5e-71.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_1.5B-math-DPO_5e-6_1.0vpo_constant-10ep
Model Card for Qwen1.5B-math-DPO5e-61.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_3e-6_10.0vpo_constant-1ep_0.3flip
Qwen_0.5-IRPO_1e-6-3ep_0.01alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO1e-6-3ep0.01alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
llama-DPOP_5e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam
Qwen_0.5-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_50dpop_lam
gemma-DPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam
gemma-DPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-VDPO_5e-7_0.3vpo_constant-1ep
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen2.5-0.5B_ultrainteract_sft_2e-5_1ep
Qwen_0.5-BDPO_1e-6-3ep_0alp_0.999bdpo_lam_0dpop_lam
Qwen_0.5-IRPO_1e-6-3ep_0.005alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO1e-6-3ep0.005alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
llama-BDPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Qwen_1.5B-math-cDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep
Qwen_1.5B-math-rDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep
Qwen_0.5-cDPO_5e-7_1.0vpo_constant_0.3label_smoothing
llama-VDPO_5e-7_1.0vpo_constant
Qwen_0.5-VDPO_5e-7_1.0vpo_constant_0.1label_smoothing
Qwen_0.5-cDPO_5e-7_0.1lsmooth-1.0vpo_constant-1ep
Model Card for Qwen0.5-cDPO5e-70.1lsmooth-1.0vpoconstant-1ep This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-7_3ep_0alp_0lam
A model designed for fine-tuning with specific parameters for optimal performance in various tasks.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-6-3ep_0alp_5lam
Base model trained on ultrafeedback binarized datasets using the transformers library.
Qwen_0.5-IPO_5e-7-3ep_0alp_0lam
Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.
Qwen_1.5B-math-VIPO_5e-6_3.0vpo_constant-5ep
pythia-2.8b-VIPO_5e-7_1.0vpo_const-1ep
pythia-2.8b-VIPO_5e-7_3.0vpo_const-1ep
This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_0.5-IRPO_1e-6-3ep_2alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO1e-6-3ep2alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-DPO_5e-7_1.0vpo_constant
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_5e-7_1.0vpo_constant
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-cDPO_5e-7_1.0vpo_constant_0.1label_smoothing
llama-DPO_5e-7_1.0vpo_constant
Qwen2.5-0.5B-SFT-7e-5-3ep
Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-1e-5-5ep
Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-7e-5-5ep
Core purpose is to provide a model based on Qwen/Qwen2.5-0.5B. It utilizes datasets from HuggingFaceH4/ultrafeedback_binarized and is built using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_7e-7_2ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_1e-6_1ep_0alp_0lam
A model designed for fine-tuning tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-3ep_1alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-1ep_1alp_0lam
A model designed for fine-tuning tasks using the Qwen architecture, optimized for performance with specific training parameters.
Qwen_0.5-DPOP_3e-7-2ep_0alp_5lam
Library name: transformers.
Qwen_0.5-rDPO_1e-6-1ep_0vpo_const_0.1
Qwen_1.5B-math-DPO_5e-6_1.0vpo_constant-5ep
Model Card for Qwen1.5B-math-DPO5e-61.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
pythia-2.8b-VDPO_5e-7_1.0vpo_constant-1ep
Model Card for pythia-2.8b-VDPO5e-71.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_1.5B-math-IPO_5e-6_1.0vpo_constant-5ep
Qwen_1.5B-math-VIPO_5e-6_1.0vpo_constant-5ep
Model Card for Qwen1.5B-math-VIPO5e-61.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_1.5B-math-VIPO_5e-6_10.0vpo_constant-5ep
Model Card for Qwen1.5B-math-VIPO5e-610.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_1.5B-math-VDPO_5e-6_3.0vpo_constant-5ep
Qwen_1.5B-math-VDPO_5e-6_10.0vpo_constant-5ep
Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-5ep
Model Card for Qwen1.5B-math-DPO1e-51.0vpoconstant-5ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-10ep
pythia-2.8b-DPO_1e-6_1.0vpo_constant-1ep
Model Card for pythia-2.8b-DPO1e-61.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_1.5B-math-DPO_5e-5_1.0vpo_constant-10ep
Model Card for Qwen1.5B-math-DPO5e-51.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_1.5B-math-DPO_5e-5_1.0vpo_constant-20ep
Model Card for Qwen1.5B-math-DPO5e-51.0vpoconstant-20ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
pythia-2.8b-IPO_5e-7_1.0vpo_constant-1ep
Qwen_1.5B-math-DPO_1e-4_1.0vpo_constant-10ep
pythia-2.8b-IPO_5e-7_1.0vpo_const-1ep
This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_1.5B-math-VDPO_1e-4_1.0vpo_constant-10ep
Model Card for Qwen1.5B-math-VDPO1e-41.0vpoconstant-10ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
pythia-2.8b-VDPO_5e-7_3.0vpo_constant-1ep
pythia-2.8b-VDPO_5e-7_10.0vpo_constant-1ep
Model Card for pythia-2.8b-VDPO5e-710.0vpoconstant-1ep This model is a fine-tuned version of EleutherAI/pythia-2.8b on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_5e-7_1.0vpo_constant-1ep
This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-DPO_5e-7_1.0vpo_constant-1ep_0.3flip
Qwen_0.5-cDPO_5e-7_1.0vpo_constant-1ep_0.3flip
Qwen_0.5-DPO_3e-6_1.0vpo_constant-1ep_0.3flip
Model Card for Qwen0.5-DPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_3e-6_1.0vpo_constant-1ep_0.3flip
Model Card for Qwen0.5-VDPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-VDPO_3e-6_3.0vpo_constant-1ep_0.3flip
Qwen_0.5-cDPO_3e-6_1.0vpo_constant-1ep_0.3flip
Model Card for Qwen0.5-cDPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
Qwen_0.5-IPO_3e-6_1.0vpo_constant-1ep_0.3flip
Qwen_0.5-IRPO_5e-7-3ep_0.1alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO5e-7-3ep0.1alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_0.5-IRPO_1e-6-3ep_10alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-BDPO_5e-7-3ep_0alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-IRPO_5e-7-3ep_10alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO5e-7-3ep10alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_0.5-IRPO_1e-6-3ep_5alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO1e-6-3ep5alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
llama-BDPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-DPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-IRPO_5e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-IRPO_5e-7-3ep_0.05alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO5e-7-3ep0.05alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-DPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-IRPO_1e-6-3ep_1alp_0.5bdpo_lam_0dpop_lam
llama-DPOP_1e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam
llama-BDPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-DPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam
llama-IRPO_1e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam
llama-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_5dpop_lam
Qwen_0.5-IRPO_1e-6-3ep_0.25alp_0.5bdpo_lam_0dpop_lam
Model Card for Qwen0.5-IRPO1e-6-3ep0.25alp0.5bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-BDPO_1e-6-3ep_0alp_0.99999bdpo_lam_0dpop_lam
Model Card for Qwen0.5-BDPO1e-6-3ep0alp0.99999bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_0.5-IRPO_5e-7-3ep_0.25alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-IRPO_1e-6-3ep_0.5alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-DPOP_5e-7-3ep_0alp_0.5bdpo_lam_50dpop_lam
Qwen_0.5-IRPO_5e-7-3ep_0.5alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-DPOP_5e-7-3ep_0alp_0.5bdpo_lam_500dpop_lam
Qwen_0.5-BDPO_5e-7-3ep_0alp_0.99999bdpo_lam_0dpop_lam
gemma-BDPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
gemma-DPO_1e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
gemma-DPOP_1e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam
Qwen_0.5-IRPO_5e-7-3ep_2alp_0.5bdpo_lam_0dpop_lam
gemma-BDPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for gemma-BDPO5e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
gemma-IRPO_5e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
Model Card for gemma-IRPO5e-7-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
gemma-DPO_5e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for gemma-DPO5e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of google/gemma-3-1b-it on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
gemma-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_5dpop_lam
gemma-IRPO_1e-6-3ep_1alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-ultrainteract_ORPO_5e-7-1ep
Qwen_0.5-SLiC_5e-7-1ep
This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5Bultrainteractsft2e-51ep on the JayHyeon/trlultrainteract-pair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-BDPO_5e-7-3ep_0alp_0.999bdpo_lam_0dpop_lam
Model Card for Qwen0.5-BDPO5e-7-3ep0alp0.999bdpolam0dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
llama-IRPO_1e-6-2ep_1alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-IRPO1e-6-2ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-BDPO_1e-6-2ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-BDPO1e-6-2ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-DPOP_1e-6-2ep_0alp_0.5bdpo_lam_5dpop_lam
Model Card for llama-DPOP1e-6-2ep0alp0.5bdpolam5dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
gemma-BDPO_3e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
gemma-DPOP_3e-6-1ep_0alp_0.5bdpo_lam_5dpop_lam
gemma-IRPO_3e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam
gemma-IRPO_1e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
gemma-BDPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
gemma-DPOP_1e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam
llama-IRPO_1e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
llama-DPOP_1e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam
llama-DPO_1e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-DPO1e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-BDPO_2e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-BDPO2e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
Qwen_0.5-ultrainteract_SLiC_5e-7-1ep
This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5Bultrainteractsft2e-51ep on the JayHyeon/trlultrainteract-pair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-DPO_2e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-DPO2e-7-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-BDPO_3e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-IRPO_3e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
llama-DPOP_3e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam
llama-DPO_3e-7-1ep_0alp_0.5bdpo_lam_0dpop_lam
llama-BDPO_3e-6-1ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-BDPO3e-6-1ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-IRPO_3e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-IRPO3e-6-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
Qwen_0.5-ultrainteract_SPPO_5e-7-1ep
Qwen_0.5-cDPO_5e-7_0.3lsmooth-1.0vpo_constant-1ep
Qwen_0.5-rDPO_5e-7_0.3lsmooth-1.0vpo_constant
Qwen2-0.5B-Reward_VPO_5e-4
Qwen2.5-0.5B-SFT
Qwen 2.5 0.5B SFT is trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-1e-4
Qwen 2.5 0.5B is trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-5e-5-2ep
The model is based on Qwen/Qwen2.5-0.5B and utilizes datasets from HuggingFaceH4/ultrafeedback_binarized with the library transformers.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7-3ep_0alp_0lam
Base model trained on datasets from trl-lib/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7_2ep_0alp_0lam
library_name: transformers tags: [] Provide a quick summary of what the model is/does.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-7_1ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-1ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-2ep_0alp_0lam
A model designed for specific tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_3e-7-3ep_1alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_3e-7-3ep_0alp_5lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the transformers library.
Qwen_0.5-DPO_5e-7-3ep_0alp_0lam
Base model: Qwen2.5 0.5B SFT 2e 5 2ep. Datasets: trl-lib/ultrafeedback_binarized. Library name: transformers.
Qwen_0.5-DPOP_3e-6-3ep_0alp_5lam
Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.
Qwen_0.5-DPO_3e-6-2ep_0alp_0lam
Library name: transformers.
Qwen_0.5-DPOP_3e-7-1ep_0alp_5lam
Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.
Qwen_math-DPO_5e-7-1ep_0alp_0lam
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the openbmb/UltraInteractpair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_math-IRPO_5e-7-1ep_1alp_0lam
Qwen_0.5-IPO_5e-7-1ep_0alp_0lam
Base model JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep datasets trl-lib/ultrafeedback_binarized library name transformers.
Qwen_0.5-VIPO_1e-6-1ep_10vpo_const
Qwen_1.5B-math-VDPO_5e-7_1.0vpo_constant-20ep
Model Card for Qwen1.5B-math-VDPO5e-71.0vpoconstant-20ep This model is a fine-tuned version of Qwen/Qwen2.5-Math-1.5B on the argilla/distilabel-math-preference-dpo dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
pythia-2.8b-DPO_5e-7_1.0vpo_constant-1ep
Qwen_0.5-rDPO_5e-7_1.0vpo_constant-1ep_0.3flip
Qwen_0.5-rDPO_3e-6_1.0vpo_constant-1ep_0.3flip
Qwen_0.5-VIPO_3e-6_1.0vpo_constant-1ep_0.3flip
Model Card for Qwen0.5-VIPO3e-61.0vpoconstant-1ep0.3flip This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.15.2 - Transformers: 4.50.0 - Pytorch: 2.6.0 - Datasets: 3.4.1 - Tokenizers: 0.21.1
llama-BDPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam
Qwen_0.5-DPOP_1e-6-3ep_0alp_0.5bdpo_lam_500dpop_lam
Model Card for Qwen0.5-DPOP1e-6-3ep0alp0.5bdpolam500dpoplam This model is a fine-tuned version of JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
gemma-IRPO_1e-6-1ep_1alp_0.5bdpo_lam_0dpop_lam
gemma-BDPO_1e-6-3ep_0alp_0.5bdpo_lam_0dpop_lam
llama-DPO_1e-6-2ep_0alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-DPO1e-6-2ep0alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-IRPO_2e-7-1ep_1alp_0.5bdpo_lam_0dpop_lam
Model Card for llama-IRPO2e-7-1ep1alp0.5bdpolam0dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
llama-DPOP_2e-7-1ep_0alp_0.5bdpo_lam_5dpop_lam
Model Card for llama-DPOP2e-7-1ep0alp0.5bdpolam5dpoplam This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct on the trl-lib/ultrafeedbackbinarized dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.20.0.dev0 - Transformers: 4.53.0 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2
Qwen_0.5-rDPO_5e-7_1.0vpo_constant_0.1label_smoothing
Model Card for Qwen0.5-rDPO5e-71.0vpoconstant0.1labelsmoothing This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.19.0.dev0 - Transformers: 4.52.4 - Pytorch: 2.7.1 - Datasets: 3.6.0 - Tokenizers: 0.21.1
Qwen_0.5-IPO_5e-7_1.0vpo_constant
Qwen_VIPO_SHP
Qwen_0.5-IPO_5e-7_seed42
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-DPO_5e-7_1.0vpo_constant_ls0.0_seed42
Model Card for Qwen0.5-DPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-VDPO_5e-7_1.0vpo_constant_ls0.0_seed42
Model Card for Qwen0.5-VDPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-IPO_5e-7_1.0vpo_constant_ls0.0_seed42
Model Card for Qwen0.5-IPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-VIPO_5e-7_1.0vpo_constant_ls0.0_seed42
Model Card for Qwen0.5-VIPO5e-71.0vpoconstantls0.0seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-rDPO_5e-7_1.0vpo_constant_ls0.1_seed42
Model Card for Qwen0.5-rDPO5e-71.0vpoconstantls0.1seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-rDPO_5e-7_1.0vpo_constant_ls0.3_seed42
Model Card for Qwen0.5-rDPO5e-71.0vpoconstantls0.3seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-cDPO_5e-7_1.0vpo_constant_ls0.1_seed42
Model Card for Qwen0.5-cDPO5e-71.0vpoconstantls0.1seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-cDPO_5e-7_1.0vpo_constant_ls0.3_seed42
Model Card for Qwen0.5-cDPO5e-71.0vpoconstantls0.3seed42 This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the JayHyeon/shp-dpo-converted dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.22.0.dev0 - Transformers: 4.55.0 - Pytorch: 2.8.0 - Datasets: 4.0.0 - Tokenizers: 0.21.4
Qwen_0.5-VDPO_5e-7_0.3vpo_constant_ls0.0_seed42
Qwen_0.5-VDPO_5e-7_3.0vpo_constant_ls0.0_seed42
Qwen_0.5-VDPO_5e-7_5vpo_constant_ls0.0_seed42
Qwen2.5-0.5B-Instruct-SFT
Qwen 2.5 0.5B Instruct is designed for instruction-based tasks using the Hugging Face Ultrafeedback binarized dataset with the Transformers library.
Qwen2-0.5B-Reward_VPO_1e-4
Qwen2-0.5B-Reward_1e-4-test
Qwen2-0.5B-Reward_VPO_5e-3
Qwen-0.5B-IRPO-5epoch
A model designed for efficient natural language processing tasks, utilizing the transformers library under the MIT license.
Qwen-0.5B-DPO-1epoch
A model designed for efficient natural language processing tasks, utilizing the transformers library under the MIT license.
Qwen-0.5B-IRPO-1epoch
A model designed for various natural language processing tasks, utilizing the transformers library and licensed under MIT.
Qwen2.5-0.5B-Instruct-SFT-MDPO-1epoch_v1
This model is designed for instruction-based tasks using the Qwen architecture. It is built with the Transformers library and is licensed under MIT.
Qwen2.5-0.5B-Instruct-SFT-DPO-1epoch_v1
This model is designed for instruction-based tasks using the Qwen architecture. It is built on the transformers library and is licensed under MIT.
Qwen2.5-0.5B-Instruct-SFT-IRPO-1epoch_v1
This model is designed for instruction-based tasks using the Qwen architecture. It is built on the transformers library and is licensed under MIT.
Qwen2.5-0.5B-SFT-DPO-1epoch_v1
A model that utilizes the transformers library and is licensed under MIT.
Qwen2.5-0.5B-SFT-2e-5
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-5e-5
Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-7e-5
Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT_2ep
Qwen2.5-0.5B-SFT-1e-5-3ep
Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-5e-5-3ep
Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-1e-4-3ep
Qwen 2.5 0.5B is a model trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-1e-5-2ep
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4's ultrafeedback_binarized. It utilizes the transformers library.
Qwen2.5-0.5B-SFT-7e-5-2ep
Core model for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-1e-4-2ep
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-2e-4-2ep
Qwen 2.5 0.5B is designed for fine-tuning tasks using the Hugging Face Transformers library with the ultrafeedback_binarized dataset.
Qwen2.5-0.5B-SFT-5e-5-5ep
Core model for Qwen 2.5 with 0.5B parameters, trained on the HuggingFaceH4 ultrafeedback binarized dataset using the transformers library.
Qwen2.5-0.5B-SFT-1e-4-5ep
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-2e-4-5ep
Qwen 2.5 0.5B is designed for fine-tuning with datasets from HuggingFaceH4/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_7e-7_3ep_0alp_0lam
Core purpose is to provide a fine-tuned model based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-5ep. It utilizes datasets from trl-lib/ultrafeedback_binarized and is built using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_5e-7_3ep_0alp_0lam_1ep
library_name: transformers tags: [] Provide a quick summary of what the model is/does.
Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_5e-7_3ep_0alp_0lam_2ep
library_name: transformers tags: [] Provide a quick summary of what the model is/does.
Qwen2.5-0.5B-SFT-2e-5-5ep-MDPO_7e-7_3ep_0alp_0lam_2ep
Library name: transformers.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-7_1ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_7e-7-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized with the library transformers.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_7e-7_1ep_0alp_0lam
Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_1e-6_2ep_0alp_0lam
A model designed for fine-tuning tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_2e-6_1ep_0alp_0lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-6-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-2ep_0alp_0lam
A model designed for various natural language processing tasks, utilizing the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_3e-6-1ep_0alp_0lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-6-2ep_0alp_0lam
A model designed for various natural language processing tasks, utilizing the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_5e-6-2ep_0alp_0lam
A model designed for efficient fine-tuning and deployment, utilizing advanced techniques for optimal performance.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-2ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_2e-6-1ep_0alp_0lam
A model designed for various natural language processing tasks, utilizing the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-6-1ep_0alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_5e-6-1ep_0alp_0lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-1ep_1alp_0lam
A model designed for various natural language processing tasks, utilizing the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-7-2ep_1alp_0lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-7-1ep_0alp_5lam
A model designed for fine-tuning and optimization tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-7-2ep_0alp_5lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-3ep_1alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-2ep_1alp_0lam
A model designed for various tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_5e-6-1ep_1alp_0lam
A model designed for various natural language processing tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPOP_5e-6-2ep_0alp_5lam
A model designed for fine-tuning and optimization tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_1e-7-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, designed for use with the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-3ep_1alp_0lam
Core model for fine-tuning with datasets from trl-lib/ultrafeedback_binarized using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-MDPO_0.5_1e-7-1ep_0alp_0lam
A model designed for fine-tuning with specific parameters for enhanced performance in various tasks.
Qwen2.5-0.5B-SFT-2e-5-2ep-IRPO_1e-7-2ep_1alp_0lam
A model designed for fine-tuning tasks using the transformers library.
Qwen2.5-0.5B-SFT-2e-5-2ep-DPO_3e-7-3ep_0alp_0lam
The model is based on JayHyeon/Qwen2.5-0.5B-SFT-2e-5-2ep and utilizes datasets from trl-lib/ultrafeedback_binarized, implemented using the transformers library.
Qwen_0.5-DPO_5e-7-2ep_0alp_0lam
library_name: transformers tags: [] Provide a quick summary of what the model is/does.
Qwen_0.5-DPOP_3e-6-1ep_0alp_5lam
library_name: transformers
Qwen_0.5-DPOP_3e-6-2ep_0alp_5lam
Library name: transformers.
Qwen_0.5-DPO_1e-6-3ep_0alp_0lam
Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers
Qwen_0.5-DPOP_1e-6-3ep_0alp_5lam
Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers
Qwen_0.5-DPOP_1e-7-3ep_0alp_5lam
Base model: Qwen2.5 0.5B SFT 2e-5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers
Qwen_0.5-DPO_3e-7-3ep_0alp_0lam
Base model: Qwen2.5 0.5B SFT 2e 5 2ep datasets: trl-lib/ultrafeedback_binarized library name: transformers
Qwen_0.5-DPOP_3e-7-3ep_0alp_5lam
Base model: Qwen 2.5 0.5B SFT 2e 5 2ep. Datasets: trl-lib/ultrafeedback_binarized. Library name: transformers.
Qwen_0.5-DPO_3e-7-1ep_0alp_0lam
Library name: transformers. Tags: [] Provide a quick summary of what the model is/does.
Qwen_0.5-DPO_3e-7-2ep_0alp_0lam
Library name: transformers.
Qwen_math-DPOP_5e-7-1ep_0alp_5lam
This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the openbmb/UltraInteractpair dataset. It has been trained using TRL. This model was trained with DPO, a method introduced in Direct Preference Optimization: Your Language Model is Secretly a Reward Model. - TRL: 0.13.0.dev0 - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1 - Datasets: 3.1.0 - Tokenizers: 0.20.3
Qwen_0.5-VIPO_1e-6-1ep_30vpo_const
Qwen_0.5-VDPO_1e-6-1ep_1vpo_const
Qwen_0.5-cDPO_1e-6-1ep_0vpo_const_0.3
Qwen_0.5-VDPO_5e-6-1ep_10vpo_const
Qwen_0.5-VDPO_3e-6-1ep_0vpo_const
Qwen_0.5-VDPO_3e-6-1ep_0.3vpo_const_exp
Math-Qwen_0.5-BDPO_5e-7-1ep_0alp_0lam
Qwen_0.5-ultrainteract_DPOP_5e-7-1ep_0.5bdpo_lambda
Qwen_1.5B-math-DPO_1e-5_1.0vpo_constant-20ep
Qwen-0.5B-DPO-5epoch
A model designed for efficient natural language processing tasks, utilizing the Transformers library under the MIT license.