PKU-Alignment
alpaca-8b-reproduced-llama-3
beaver-dam-7b
Boasting 7 billion parameters, Beaver-Dam-7B is a powerful QA-Moderation model derived from the Llama-7B base model and trained on the PKU-Alignment/BeaverTails Classification Dataset. Beaver-Dam's key feature is its ability to analyze responses to prompts for toxicity across 14 different categories. - Developed by: PKU-Alignment Team - Model type: QA moderation - License: Non-commercial license - Finetuned from model: LLaMA - Repository: https://github.com/PKU-Alignment/beavertails - Web: https://sites.google.com/view/pku-beavertails - Paper: Coming soon Traditional approaches to content moderation in Question-Answering (QA) tasks often gauge the toxicity of a QA pair by examining each utterance individually. This method, while effective to a degree, can inadvertently result in a significant number of user prompts being discarded. If the moderation system perceives them as too harmful, it may prevent the language model from generating appropriate responses, consequently interrupting the user experience and potentially hindering the evolution of a beneficial AI with human-like understanding. BeaverDam is a shift in the approach to content moderation for QA tasks - a concept we term "QA moderation": In this paradigm, a QA pair is classified as harmful or benign based on its degree of risk neutrality. Specifically, it assesses the extent to which potential risks in a potentially harmful question can be counteracted by a non-threatening response.
alpaca-7b-reproduced
beaver-7b-v1.0-reward
beaver-7b-v1.0-cost
alpaca-7b-reproduced-llama-2
AA-chameleon-7b-base
ProgressGym-HistLlama3-8B-C013-instruct-v0.2
beaver-7b-v3.0-cost
beaver-7b-unified-reward
beaver-7b-v3.0-reward
beaver-7b-v3.0
ProgressGym-HistLlama3-8B-C017-instruct-v0.2
beaver-7b-unified-cost
ProgressGym-HistLlama3-8B-C016-instruct-v0.2
llama3.1-8b-vision-audio
ProgressGym-HistLlama3-8B-C018-pretrain-v0.2
AnyRewardModel
ProgressGym-HistLlama3-8B-C021-instruct-v0.2
ProgressGym-HistLlama3-8B-C018-instruct-v0.2
ProgressGym-HistLlama3-8B-C014-instruct-v0.2
Beaver 7b V1.0
Beaver is a chat assistant trained based on the Standord Alpaca model (reproduced version) using the PKU-Alignment/safe-rlhf library. Beaver was born to study the safety of LLMs (Large Language Models). Compared with its predecessor Alpaca, beaver relies on Safe-RLHF alignment technology, which can avoid outputting harmful content while outputting helpful information as much as possible. - Developed by: the PKU-Alignment Team. - Model Type: An auto-regressive language model based on the transformer architecture. - License: Non-commercial license. - Fine-tuned from model: LLaMA, Alpaca. - Repository: - Beaver: - Dataset: - Reward Model: - Cost Model: - Dataset Paper: - Paper: - Using the PKU-Alignment/safe-rlhf GitHub repository.
Qwen1.5-0.5B-IMDB-Q1-10000
Beaver-Vision-11B
ProgressGym-HistLlama3-8B-C013-pretrain-v0.2
ProgressGym-HistLlama3-8B-C015-pretrain-v0.2
Align-DS-V
AA-chameleon-7b-plus
ProgressGym-HistLlama3-70B-C021-pretrain-v0.1
ProgressGym-HistLlama3-70B-C015-instruct-v0.1
ProgressGym-HistLlama3-8B-C019-instruct-v0.2
ProgressGym-HistLlama3-8B-C017-pretrain-v0.2
Qwen1.5-4B-Safety-Q1-1k
ProgressGym-HistLlama3-70B-C013-instruct-v0.1
ProgressGym-HistLlama3-70B-C016-instruct-v0.1
ProgressGym-HistLlama3-70B-C015-pretrain-v0.1
ProgressGym-HistLlama3-70B-C020-pretrain-v0.1
ProgressGym-HistLlama3-8B-C015-instruct-v0.2
ProgressGym-HistLlama3-8B-C020-instruct-v0.2
Qwen1.5-4B-IMDB-Q1-1000-Q2-100
Qwen1.5-7B-Safety-Q1-10k
Qwen1.5-4B-IMDB-Q1-2000-Q2-500
tinyllama-3T-IMDB-Q1-2000-Q2-2000
beaver-7b-v2.0-reward
ProgressGym-HistLlama3-8B-C014-pretrain-v0.2
ProgressGym-HistLlama3-8B-C019-pretrain-v0.2
Qwen1.5-0.5B-IMDB-Q1-2000-Q2-2000
Qwen1.5-4B-IMDB-Q1-1000-Q2-1000
Qwen1.5-0.5B-Safety-Q1-50k
Qwen1.5-7B-Safety-Q1-40k-Q2-500
tinyllama-1.5T-Safety-Q1-5k-Q2-500
Qwen1.5-4B-IMDB-Q1-5000-Q2-200
tinyllama-1.5T-Safety-Q1-2k-Q2-500
tinyllama-1T-Safety-Q1-40k-Q2-1k
tinyllama-3T-Safety-Q1-40k-Q2-100
tinyllama-3T-Safety-Q1-5k-Q2-5k
Qwen1.5-7B-IMDB-Q1-10000-Q2-500
tinyllama-1.5T-IMDB-Q1-1000
tinyllama-2T-IMDB-Q1-5000-Q2-2000
tinyllama-3T-IMDB-Q1-10000-Q2-100
tinyllama-3T-IMDB-Q1-2000-Q2-200
beaver-7b-v2.0
ProgressGym-HistLlama3-70B-C017-instruct-v0.1
ProgressGym-HistLlama3-70B-C019-instruct-v0.1
ProgressGym-HistLlama3-70B-C021-instruct-v0.1
ProgressGym-HistLlama3-70B-C014-pretrain-v0.1
ProgressGym-HistLlama3-70B-C017-pretrain-v0.1
ProgressGym-HistLlama3-8B-C020-pretrain-v0.2
llama3.1-8b-instruct-vision
safe-o1-7b
Qwen1.5-0.5B-IMDB-Q1-10000-Q2-200
Qwen1.5-4B-IMDB-Q1-1000-Q2-200
Llama-2-7b-hf-Safety-Q1-20k
Llama-2-7b-hf-Safety-Q1-20k-Q2-1k
Qwen1.5-0.5B-Safety-Q1-10k-Q2-1k
Qwen1.5-0.5B-Safety-Q1-1k-Q2-100
Qwen1.5-0.5B-Safety-Q1-1k-Q2-500
Qwen1.5-0.5B-Safety-Q1-50k-Q2-1k
Qwen1.5-0.5B-Safety-Q1-50k-Q2-2k
Qwen1.5-4B-Safety-Q1-1k-Q2-2k
Qwen1.5-4B-Safety-Q1-5k
Qwen1.5-7B-Safety-Q1-10k-Q2-100
Qwen1.5-7B-Safety-Q1-2k-Q2-1k
Qwen1.5-7B-Safety-Q1-30k-Q2-100
tinyllama-1T-Safety-Q1-10k-Q2-2k
tinyllama-0.5T-Safety-Q1-5k-Q2-500
tinyllama-1T-Safety-Q1-40k-Q2-100
tinyllama-2.5T-Safety-Q1-1k-Q2-5k
tinyllama-2.5T-Safety-Q1-40k-Q2-500
tinyllama-2T-Safety-Q1-50k-Q2-500
tinyllama-3T-Safety-Q1-5k-Q2-100
tinyllama-3T-Safety-Q1-5k-Q2-1k
tinyllama-3T-Safety-Q1-5k-Q2-2k
tinyllama-3T-Safety-Q1-5k-Q2-500
Qwen1.5-7B-IMDB-Q1-2000-Q2-1000
Qwen1.5-7B-IMDB-Q1-2000-Q2-200
Qwen1.5-7B-IMDB-Q1-2000-Q2-2000
Qwen1.5-7B-IMDB-Q1-2000-Q2-500
tinyllama-3T-IMDB-Q1-1000-Q2-1000
tinyllama-3T-IMDB-Q1-1000-Q2-500
tinyllama-3T-IMDB-Q1-5000-Q2-2000
Beaver-0.5B-Instruct
ProgressGym-HistLlama3-70B-C014-instruct-v0.1
ProgressGym-HistLlama3-70B-C018-instruct-v0.1
ProgressGym-HistLlama3-70B-C020-instruct-v0.1
ProgressGym-HistLlama3-70B-C013-pretrain-v0.1
ProgressGym-HistLlama3-70B-C016-pretrain-v0.1
ProgressGym-HistLlama3-70B-C019-pretrain-v0.1
ProgressGym-HistLlama3-8B-C021-pretrain-v0.2
alpaca-70b-reproduced-llama-3
s1-m_7b_beta
TruthfulJudge
TruthfulJudge is a reliable evaluation pipeline designed to mitigate the pitfalls of AI-as-judge setups. Our methodology emphasizes in-depth human involvement to prevent feedback loops of hallucinated errors, ensuring faithful assessment of multimodal model truthfulness. Our specialized judge model, TruthfulJudge, is well-calibrated (ECE=0.11), self-consistent, and highly inter-annotator agreed (Cohen's κ = 0.79), achieving 88.4% judge accuracy. This model is a pairwise critique-label judge trained to judge the preference of two responses to TruthfulVQA dataset open-ended questions. Here's a simple example of how to use TruthfulJudge: The model outputs a structured response with three components: - ` `: A detailed analysis of the responses - ` `: Either 'A' or 'B' indicating which response is better - ` `: A score between 0 and 1 indicating the confidence in the judgment
Qwen1.5-0.5B-IMDB-Q1-1000
Qwen1.5-0.5B-IMDB-Q1-1000-Q2-200
Qwen1.5-0.5B-IMDB-Q1-10000-Q2-2000
Qwen1.5-0.5B-IMDB-Q1-2000
Qwen1.5-0.5B-IMDB-Q1-5000-Q2-2000
Qwen1.5-4B-IMDB-Q1-10000-Q2-1000
Llama-2-7b-hf-Safety-Q1-1k-Q2-100
Llama-2-7b-hf-Safety-Q1-1k-Q2-500
Llama-2-7b-hf-Safety-Q1-20k-Q2-100
Qwen1.5-0.5B-Safety-Q1-10k-Q2-5k
Qwen1.5-0.5B-Safety-Q1-1k-Q2-1k
Qwen1.5-0.5B-Safety-Q1-20k-Q2-2k
Qwen1.5-0.5B-Safety-Q1-20k-Q2-500
Qwen1.5-0.5B-Safety-Q1-2k-Q2-100
Qwen1.5-0.5B-Safety-Q1-2k-Q2-200
Qwen1.5-0.5B-Safety-Q1-2k-Q2-2k
Qwen1.5-0.5B-Safety-Q1-2k-Q2-500
Qwen1.5-0.5B-Safety-Q1-2k-Q2-5k
Qwen1.5-0.5B-Safety-Q1-30k
Qwen1.5-0.5B-Safety-Q1-30k-Q2-200
Qwen1.5-0.5B-Safety-Q1-40k-Q2-100
Qwen1.5-0.5B-Safety-Q1-40k-Q2-200
Qwen1.5-4B-Safety-Q1-10k-Q2-1k
Qwen1.5-4B-Safety-Q1-10k-Q2-200
Qwen1.5-4B-Safety-Q1-10k-Q2-500
Qwen1.5-4B-Safety-Q1-1k-Q2-5k
Qwen1.5-4B-Safety-Q1-20k-Q2-200
Qwen1.5-4B-Safety-Q1-20k-Q2-5k
Qwen1.5-4B-Safety-Q1-30k
Qwen1.5-4B-Safety-Q1-30k-Q2-500
Qwen1.5-4B-Safety-Q1-40k
Qwen1.5-4B-Safety-Q1-40k-Q2-1k
Qwen1.5-4B-Safety-Q1-50k-Q2-100
Qwen1.5-4B-Safety-Q1-50k-Q2-1k
Qwen1.5-4B-Safety-Q1-50k-Q2-2k
Qwen1.5-4B-Safety-Q1-50k-Q2-500
Qwen1.5-4B-Safety-Q1-5k-Q2-500
Qwen1.5-7B-Safety-Q1-10k-Q2-500
Qwen1.5-7B-Safety-Q1-1k-Q2-200
Qwen1.5-7B-Safety-Q1-20k-Q2-100
Qwen1.5-7B-Safety-Q1-20k-Q2-1k
Qwen1.5-7B-Safety-Q1-2k-Q2-2k
Qwen1.5-7B-Safety-Q1-2k-Q2-500
Qwen1.5-7B-Safety-Q1-40k-Q2-2k
Qwen1.5-7B-Safety-Q1-50k-Q2-1k
Qwen1.5-7B-Safety-Q1-5k-Q2-100
Qwen1.5-7B-Safety-Q1-5k-Q2-500
tinyllama-1.5T-Safety-Q1-50k-Q2-100
tinyllama-1.5T-Safety-Q1-5k
Qwen1.5-4B-IMDB-Q1-10000-Q2-2000
Qwen1.5-4B-IMDB-Q1-2000-Q2-100
Qwen1.5-4B-IMDB-Q1-2000-Q2-2000
Qwen1.5-4B-IMDB-Q1-5000
Qwen1.5-4B-IMDB-Q1-5000-Q2-1000
Qwen1.5-4B-IMDB-Q1-5000-Q2-500
Qwen1.5-7B-IMDB-Q1-1000
tinyllama-1.5T-Safety-Q1-1k-Q2-200
tinyllama-1.5T-Safety-Q1-30k-Q2-100
tinyllama-1.5T-Safety-Q1-30k-Q2-5k
tinyllama-1.5T-Safety-Q1-50k-Q2-5k
tinyllama-1T-Safety-Q1-50k-Q2-500
tinyllama-2.5T-Safety-Q1-20k-Q2-500
tinyllama-2.5T-Safety-Q1-30k-Q2-500
tinyllama-3T-Safety-Q1-10k-Q2-2k
tinyllama-3T-Safety-Q1-50k-Q2-5k
Qwen1.5-7B-IMDB-Q1-2000-Q2-100
Qwen1.5-7B-IMDB-Q1-5000-Q2-100
Qwen1.5-7B-IMDB-Q1-5000-Q2-2000
Qwen1.5-7B-IMDB-Q1-5000-Q2-500
gemma-2b-IMDB-Q1-5000
tinyllama-2T-IMDB-Q1-10000-Q2-1000
tinyllama-2T-IMDB-Q1-5000-Q2-1000
tinyllama-3T-IMDB-Q1-1000-Q2-200
tinyllama-3T-IMDB-Q1-10000-Q2-2000
tinyllama-3T-IMDB-Q1-10000-Q2-500
tinyllama-3T-IMDB-Q1-2000
tinyllama-3T-IMDB-Q1-5000-Q2-100
tinyllama-3T-IMDB-Q1-5000-Q2-1000
tinyllama-3T-IMDB-Q1-5000-Q2-200
SAE V
(ICML 2025 Poster) SAE-V: Interpreting Multimodal Models for Enhanced Alignment This repository contains the SAE-V model for our ICML 2025 Poster paper "SAE-V: Interpreting Multimodal Models for Enhanced Alignment", including 2 sparse autoencoder (SAE) and 3 sparse autoencoder with Vision (SAE-V). See each model folders and the source code for more information. Hyper-parameters SAE and SAE-V of LLaVA-NeXT/Mistral SAE and SAE-V of Chameleon/Anole The differences in training parameters arise because the LLaVA-NeXT-7B model requires more GPU memory to handle vision input, so fewer batches can be cached. For the SAE and SAE-V parameters, we set different hook layers and context sizes based on the distinct architectures of the two models. We also experimented with different feature numbers on both models, but found that only around 30,000 features are actually activated during training. All training runs were conducted until convergence. All SAE and SAE-V training is performed on 8xA800 GPUs. We ensured that the variations in the parameters did not affect the experiment results. The SAE and SAE-V is developed based on SAELens-V. The loading example is as follow: