Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
Constitutional AI (CAI) is a strategy to fine-tune LLMs that place a higher value on harmlessness1, without being overly evasive.
CAI employs Reinforcement Learning with AI feedback (RLAIF), standing in contrast to RLHF, as used by InstructGPT and similar models.
To improve upon RLHF, such that:
Fewer human-provided labels are needed;
The model can be steered with a set of principles, i.e. a Constitution;
The model chooses clarity over evasion when rejecting promptings that don't fit its principles.
1) Using a third-party fine-tuned LLM optimized exclusively for helpfulness, generate outputs for prompts selected for their "toxicity".
2) Ask the third-party LLM to modify (critique) the outputs from Step 1 according to a randomly chosen principle in the constitution.
3) Repeat step 2 multiple times, for a variety of inputs and constitution principles
4) Fine-tune a vanilla LLM in a supervised fashion using the toxic inputs and the critiqued outputs.
5) Use the fine-tuned model from Step 4 to generate two outputs (at a high temperature) for each toxic input.
6) Build a preference dataset from the output of Step 5, by:
Creating a multiple-choice question with each input-output pair along with one of the Constitution principles.
Asking the fine-tuned model which of those two outputs is more aligned to the given principle and using that as a label.
7) Join the dataset produced by Step 6 with a third-party human-labeled helpfulness preference dataset.
8) Use the dataset from Step 7 to train a preference model (PM).
9) Use the PM from Step 8 to run a Reinforcement Learning (RL) loop to fine-tune the model from Step 4, arriving at the final version.
Authors claim that using Chain-of-Thought to explain why some inputs aren't given a helpful answer is a good way to defuse the tension between helpfulness and harmlessness.
Authors devised a way to encode generic constraints to the outputs via a Constitution.
Authors created an algorithm to reduce the level of harmlessness while not being overly evasive when refusing to answer questions helpfully.
Authors used AI itself to create a preference model, to be used in an RL loop to fine-tune vanilla LLMs.
- On RLHF: "RLHF typically uses tens of thousands of human preference labels."
- HH Models from Anthropic's previous article, Bai et al, 2022
The whole thing seems to depend on a previously fine-tuned LLM optimized exclusively for helpfulness.
Using Chain-of-Thought to avoid evasive answers doesn't increase Helpfulness, from the point of view of the user. It's just trying to educate people according to the principles in the Constitution.
The relative weights of each "H" in HH models don't seem to be mentioned, but they will affect the model behavior. A 50/50 model will be very different from an 80/20 or a 20/80 model.
1: Over honesty and helpfulness, the other 2 "H's" of alignment.