PKU-Alignment

192 models • 3 total models in database
Sort by:

alpaca-8b-reproduced-llama-3

NaNK
llama
4,432
0

beaver-dam-7b

Boasting 7 billion parameters, Beaver-Dam-7B is a powerful QA-Moderation model derived from the Llama-7B base model and trained on the PKU-Alignment/BeaverTails Classification Dataset. Beaver-Dam's key feature is its ability to analyze responses to prompts for toxicity across 14 different categories. - Developed by: PKU-Alignment Team - Model type: QA moderation - License: Non-commercial license - Finetuned from model: LLaMA - Repository: https://github.com/PKU-Alignment/beavertails - Web: https://sites.google.com/view/pku-beavertails - Paper: Coming soon Traditional approaches to content moderation in Question-Answering (QA) tasks often gauge the toxicity of a QA pair by examining each utterance individually. This method, while effective to a degree, can inadvertently result in a significant number of user prompts being discarded. If the moderation system perceives them as too harmful, it may prevent the language model from generating appropriate responses, consequently interrupting the user experience and potentially hindering the evolution of a beneficial AI with human-like understanding. BeaverDam is a shift in the approach to content moderation for QA tasks - a concept we term "QA moderation": In this paradigm, a QA pair is classified as harmful or benign based on its degree of risk neutrality. Specifically, it assesses the extent to which potential risks in a potentially harmful question can be counteracted by a non-threatening response.

NaNK
llama
2,187
12

alpaca-7b-reproduced

NaNK
llama
812
5

beaver-7b-v1.0-reward

NaNK
llama
365
17

beaver-7b-v1.0-cost

NaNK
llama
334
10

alpaca-7b-reproduced-llama-2

NaNK
llama
108
1

AA-chameleon-7b-base

NaNK
license:cc-by-4.0
49
8

ProgressGym-HistLlama3-8B-C013-instruct-v0.2

NaNK
llama
48
0

beaver-7b-v3.0-cost

NaNK
llama
33
0

beaver-7b-unified-reward

NaNK
llama
31
0

beaver-7b-v3.0-reward

NaNK
llama
28
0

beaver-7b-v3.0

NaNK
llama
26
0

ProgressGym-HistLlama3-8B-C017-instruct-v0.2

NaNK
llama
25
0

beaver-7b-unified-cost

NaNK
llama
21
1

ProgressGym-HistLlama3-8B-C016-instruct-v0.2

NaNK
llama
21
0

llama3.1-8b-vision-audio

NaNK
llama_vision_audio
20
4

ProgressGym-HistLlama3-8B-C018-pretrain-v0.2

NaNK
llama
15
0

AnyRewardModel

license:cc-by-nc-4.0
14
4

ProgressGym-HistLlama3-8B-C021-instruct-v0.2

NaNK
llama
13
0

ProgressGym-HistLlama3-8B-C018-instruct-v0.2

NaNK
llama
12
0

ProgressGym-HistLlama3-8B-C014-instruct-v0.2

NaNK
llama
11
0

Beaver 7b V1.0

Beaver is a chat assistant trained based on the Standord Alpaca model (reproduced version) using the PKU-Alignment/safe-rlhf library. Beaver was born to study the safety of LLMs (Large Language Models). Compared with its predecessor Alpaca, beaver relies on Safe-RLHF alignment technology, which can avoid outputting harmful content while outputting helpful information as much as possible. - Developed by: the PKU-Alignment Team. - Model Type: An auto-regressive language model based on the transformer architecture. - License: Non-commercial license. - Fine-tuned from model: LLaMA, Alpaca. - Repository: - Beaver: - Dataset: - Reward Model: - Cost Model: - Dataset Paper: - Paper: - Using the PKU-Alignment/safe-rlhf GitHub repository.

NaNK
llama
9
13

Qwen1.5-0.5B-IMDB-Q1-10000

NaNK
9
0

Beaver-Vision-11B

NaNK
mllama
7
2

ProgressGym-HistLlama3-8B-C013-pretrain-v0.2

NaNK
llama
6
0

ProgressGym-HistLlama3-8B-C015-pretrain-v0.2

NaNK
llama
6
0

Align-DS-V

NaNK
base_model:deepseek-ai/DeepSeek-R1-Distill-Llama-8B
5
72

AA-chameleon-7b-plus

NaNK
license:cc-by-4.0
5
5

ProgressGym-HistLlama3-70B-C021-pretrain-v0.1

NaNK
llama
5
1

ProgressGym-HistLlama3-70B-C015-instruct-v0.1

NaNK
llama
5
0

ProgressGym-HistLlama3-8B-C019-instruct-v0.2

NaNK
llama
5
0

ProgressGym-HistLlama3-8B-C017-pretrain-v0.2

NaNK
llama
5
0

Qwen1.5-4B-Safety-Q1-1k

NaNK
5
0

ProgressGym-HistLlama3-70B-C013-instruct-v0.1

NaNK
llama
4
0

ProgressGym-HistLlama3-70B-C016-instruct-v0.1

NaNK
llama
4
0

ProgressGym-HistLlama3-70B-C015-pretrain-v0.1

NaNK
llama
4
0

ProgressGym-HistLlama3-70B-C020-pretrain-v0.1

NaNK
llama
4
0

ProgressGym-HistLlama3-8B-C015-instruct-v0.2

NaNK
llama
4
0

ProgressGym-HistLlama3-8B-C020-instruct-v0.2

NaNK
llama
4
0

Qwen1.5-4B-IMDB-Q1-1000-Q2-100

NaNK
4
0

Qwen1.5-7B-Safety-Q1-10k

NaNK
4
0

Qwen1.5-4B-IMDB-Q1-2000-Q2-500

NaNK
4
0

tinyllama-3T-IMDB-Q1-2000-Q2-2000

llama
4
0

beaver-7b-v2.0-reward

NaNK
llama
3
0

ProgressGym-HistLlama3-8B-C014-pretrain-v0.2

NaNK
llama
3
0

ProgressGym-HistLlama3-8B-C019-pretrain-v0.2

NaNK
llama
3
0

Qwen1.5-0.5B-IMDB-Q1-2000-Q2-2000

NaNK
3
0

Qwen1.5-4B-IMDB-Q1-1000-Q2-1000

NaNK
3
0

Qwen1.5-0.5B-Safety-Q1-50k

NaNK
3
0

Qwen1.5-7B-Safety-Q1-40k-Q2-500

NaNK
3
0

tinyllama-1.5T-Safety-Q1-5k-Q2-500

llama
3
0

Qwen1.5-4B-IMDB-Q1-5000-Q2-200

NaNK
3
0

tinyllama-1.5T-Safety-Q1-2k-Q2-500

llama
3
0

tinyllama-1T-Safety-Q1-40k-Q2-1k

llama
3
0

tinyllama-3T-Safety-Q1-40k-Q2-100

llama
3
0

tinyllama-3T-Safety-Q1-5k-Q2-5k

llama
3
0

Qwen1.5-7B-IMDB-Q1-10000-Q2-500

NaNK
3
0

tinyllama-1.5T-IMDB-Q1-1000

llama
3
0

tinyllama-2T-IMDB-Q1-5000-Q2-2000

llama
3
0

tinyllama-3T-IMDB-Q1-10000-Q2-100

llama
3
0

tinyllama-3T-IMDB-Q1-2000-Q2-200

llama
3
0

beaver-7b-v2.0

NaNK
llama
2
0

ProgressGym-HistLlama3-70B-C017-instruct-v0.1

NaNK
llama
2
0

ProgressGym-HistLlama3-70B-C019-instruct-v0.1

NaNK
llama
2
0

ProgressGym-HistLlama3-70B-C021-instruct-v0.1

NaNK
llama
2
0

ProgressGym-HistLlama3-70B-C014-pretrain-v0.1

NaNK
llama
2
0

ProgressGym-HistLlama3-70B-C017-pretrain-v0.1

NaNK
llama
2
0

ProgressGym-HistLlama3-8B-C020-pretrain-v0.2

NaNK
llama
2
0

llama3.1-8b-instruct-vision

NaNK
base_model:meta-llama/Llama-3.1-8B-Instruct
2
0

safe-o1-7b

NaNK
license:cc-by-4.0
2
0

Qwen1.5-0.5B-IMDB-Q1-10000-Q2-200

NaNK
2
0

Qwen1.5-4B-IMDB-Q1-1000-Q2-200

NaNK
2
0

Llama-2-7b-hf-Safety-Q1-20k

NaNK
llama
2
0

Llama-2-7b-hf-Safety-Q1-20k-Q2-1k

NaNK
llama
2
0

Qwen1.5-0.5B-Safety-Q1-10k-Q2-1k

NaNK
2
0

Qwen1.5-0.5B-Safety-Q1-1k-Q2-100

NaNK
2
0

Qwen1.5-0.5B-Safety-Q1-1k-Q2-500

NaNK
2
0

Qwen1.5-0.5B-Safety-Q1-50k-Q2-1k

NaNK
2
0

Qwen1.5-0.5B-Safety-Q1-50k-Q2-2k

NaNK
2
0

Qwen1.5-4B-Safety-Q1-1k-Q2-2k

NaNK
2
0

Qwen1.5-4B-Safety-Q1-5k

NaNK
2
0

Qwen1.5-7B-Safety-Q1-10k-Q2-100

NaNK
2
0

Qwen1.5-7B-Safety-Q1-2k-Q2-1k

NaNK
2
0

Qwen1.5-7B-Safety-Q1-30k-Q2-100

NaNK
2
0

tinyllama-1T-Safety-Q1-10k-Q2-2k

llama
2
0

tinyllama-0.5T-Safety-Q1-5k-Q2-500

llama
2
0

tinyllama-1T-Safety-Q1-40k-Q2-100

llama
2
0

tinyllama-2.5T-Safety-Q1-1k-Q2-5k

llama
2
0

tinyllama-2.5T-Safety-Q1-40k-Q2-500

llama
2
0

tinyllama-2T-Safety-Q1-50k-Q2-500

llama
2
0

tinyllama-3T-Safety-Q1-5k-Q2-100

llama
2
0

tinyllama-3T-Safety-Q1-5k-Q2-1k

llama
2
0

tinyllama-3T-Safety-Q1-5k-Q2-2k

llama
2
0

tinyllama-3T-Safety-Q1-5k-Q2-500

llama
2
0

Qwen1.5-7B-IMDB-Q1-2000-Q2-1000

NaNK
2
0

Qwen1.5-7B-IMDB-Q1-2000-Q2-200

NaNK
2
0

Qwen1.5-7B-IMDB-Q1-2000-Q2-2000

NaNK
2
0

Qwen1.5-7B-IMDB-Q1-2000-Q2-500

NaNK
2
0

tinyllama-3T-IMDB-Q1-1000-Q2-1000

llama
2
0

tinyllama-3T-IMDB-Q1-1000-Q2-500

llama
2
0

tinyllama-3T-IMDB-Q1-5000-Q2-2000

llama
2
0

Beaver-0.5B-Instruct

NaNK
license:apache-2.0
1
1

ProgressGym-HistLlama3-70B-C014-instruct-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-70B-C018-instruct-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-70B-C020-instruct-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-70B-C013-pretrain-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-70B-C016-pretrain-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-70B-C019-pretrain-v0.1

NaNK
llama
1
0

ProgressGym-HistLlama3-8B-C021-pretrain-v0.2

NaNK
llama
1
0

alpaca-70b-reproduced-llama-3

NaNK
llama
1
0

s1-m_7b_beta

NaNK
license:cc-by-nc-4.0
1
0

TruthfulJudge

TruthfulJudge is a reliable evaluation pipeline designed to mitigate the pitfalls of AI-as-judge setups. Our methodology emphasizes in-depth human involvement to prevent feedback loops of hallucinated errors, ensuring faithful assessment of multimodal model truthfulness. Our specialized judge model, TruthfulJudge, is well-calibrated (ECE=0.11), self-consistent, and highly inter-annotator agreed (Cohen's κ = 0.79), achieving 88.4% judge accuracy. This model is a pairwise critique-label judge trained to judge the preference of two responses to TruthfulVQA dataset open-ended questions. Here's a simple example of how to use TruthfulJudge: The model outputs a structured response with three components: - ` `: A detailed analysis of the responses - ` `: Either 'A' or 'B' indicating which response is better - ` `: A score between 0 and 1 indicating the confidence in the judgment

NaNK
license:apache-2.0
1
0

Qwen1.5-0.5B-IMDB-Q1-1000

NaNK
1
0

Qwen1.5-0.5B-IMDB-Q1-1000-Q2-200

NaNK
1
0

Qwen1.5-0.5B-IMDB-Q1-10000-Q2-2000

NaNK
1
0

Qwen1.5-0.5B-IMDB-Q1-2000

NaNK
1
0

Qwen1.5-0.5B-IMDB-Q1-5000-Q2-2000

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-10000-Q2-1000

NaNK
1
0

Llama-2-7b-hf-Safety-Q1-1k-Q2-100

NaNK
llama
1
0

Llama-2-7b-hf-Safety-Q1-1k-Q2-500

NaNK
llama
1
0

Llama-2-7b-hf-Safety-Q1-20k-Q2-100

NaNK
llama
1
0

Qwen1.5-0.5B-Safety-Q1-10k-Q2-5k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-1k-Q2-1k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-20k-Q2-2k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-20k-Q2-500

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-2k-Q2-100

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-2k-Q2-200

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-2k-Q2-2k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-2k-Q2-500

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-2k-Q2-5k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-30k

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-30k-Q2-200

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-40k-Q2-100

NaNK
1
0

Qwen1.5-0.5B-Safety-Q1-40k-Q2-200

NaNK
1
0

Qwen1.5-4B-Safety-Q1-10k-Q2-1k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-10k-Q2-200

NaNK
1
0

Qwen1.5-4B-Safety-Q1-10k-Q2-500

NaNK
1
0

Qwen1.5-4B-Safety-Q1-1k-Q2-5k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-20k-Q2-200

NaNK
1
0

Qwen1.5-4B-Safety-Q1-20k-Q2-5k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-30k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-30k-Q2-500

NaNK
1
0

Qwen1.5-4B-Safety-Q1-40k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-40k-Q2-1k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-50k-Q2-100

NaNK
1
0

Qwen1.5-4B-Safety-Q1-50k-Q2-1k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-50k-Q2-2k

NaNK
1
0

Qwen1.5-4B-Safety-Q1-50k-Q2-500

NaNK
1
0

Qwen1.5-4B-Safety-Q1-5k-Q2-500

NaNK
1
0

Qwen1.5-7B-Safety-Q1-10k-Q2-500

NaNK
1
0

Qwen1.5-7B-Safety-Q1-1k-Q2-200

NaNK
1
0

Qwen1.5-7B-Safety-Q1-20k-Q2-100

NaNK
1
0

Qwen1.5-7B-Safety-Q1-20k-Q2-1k

NaNK
1
0

Qwen1.5-7B-Safety-Q1-2k-Q2-2k

NaNK
1
0

Qwen1.5-7B-Safety-Q1-2k-Q2-500

NaNK
1
0

Qwen1.5-7B-Safety-Q1-40k-Q2-2k

NaNK
1
0

Qwen1.5-7B-Safety-Q1-50k-Q2-1k

NaNK
1
0

Qwen1.5-7B-Safety-Q1-5k-Q2-100

NaNK
1
0

Qwen1.5-7B-Safety-Q1-5k-Q2-500

NaNK
1
0

tinyllama-1.5T-Safety-Q1-50k-Q2-100

llama
1
0

tinyllama-1.5T-Safety-Q1-5k

llama
1
0

Qwen1.5-4B-IMDB-Q1-10000-Q2-2000

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-2000-Q2-100

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-2000-Q2-2000

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-5000

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-5000-Q2-1000

NaNK
1
0

Qwen1.5-4B-IMDB-Q1-5000-Q2-500

NaNK
1
0

Qwen1.5-7B-IMDB-Q1-1000

NaNK
1
0

tinyllama-1.5T-Safety-Q1-1k-Q2-200

llama
1
0

tinyllama-1.5T-Safety-Q1-30k-Q2-100

llama
1
0

tinyllama-1.5T-Safety-Q1-30k-Q2-5k

llama
1
0

tinyllama-1.5T-Safety-Q1-50k-Q2-5k

llama
1
0

tinyllama-1T-Safety-Q1-50k-Q2-500

llama
1
0

tinyllama-2.5T-Safety-Q1-20k-Q2-500

llama
1
0

tinyllama-2.5T-Safety-Q1-30k-Q2-500

llama
1
0

tinyllama-3T-Safety-Q1-10k-Q2-2k

llama
1
0

tinyllama-3T-Safety-Q1-50k-Q2-5k

llama
1
0

Qwen1.5-7B-IMDB-Q1-2000-Q2-100

NaNK
1
0

Qwen1.5-7B-IMDB-Q1-5000-Q2-100

NaNK
1
0

Qwen1.5-7B-IMDB-Q1-5000-Q2-2000

NaNK
1
0

Qwen1.5-7B-IMDB-Q1-5000-Q2-500

NaNK
1
0

gemma-2b-IMDB-Q1-5000

NaNK
1
0

tinyllama-2T-IMDB-Q1-10000-Q2-1000

llama
1
0

tinyllama-2T-IMDB-Q1-5000-Q2-1000

llama
1
0

tinyllama-3T-IMDB-Q1-1000-Q2-200

llama
1
0

tinyllama-3T-IMDB-Q1-10000-Q2-2000

llama
1
0

tinyllama-3T-IMDB-Q1-10000-Q2-500

llama
1
0

tinyllama-3T-IMDB-Q1-2000

llama
1
0

tinyllama-3T-IMDB-Q1-5000-Q2-100

llama
1
0

tinyllama-3T-IMDB-Q1-5000-Q2-1000

llama
1
0

tinyllama-3T-IMDB-Q1-5000-Q2-200

llama
1
0

SAE V

(ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the source code for more information. Hyper-parameters SAE and SAE-V of LLaVA-NeXT/Mistral SAE and SAE-V of Chameleon/Anole The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results. The SAE and SAE-V is developed based on SAELens-V. The loading example is as follow:

0
2