queirozf.com - Main Entries Feed

Paper Summary: Few-shot Fine-Tuning vs In-context Learning: a Fair Comparison and Evaluation

2024-07-24T00:00:00-03:00

Few-shot Fine-Tuning vs In-context Learning: a Fair Comparison and Evaluation Source

WHAT

Compare how LLMs generalize to out-of-domain problems when using Fine-tuning (FT) or In-context Learning (ICL) strategies.

WHY

Previous comparisons between FT and ICL did not take model sizes into account.

HOW

Several different comparisons, also checking how the results vary as the model capacities grow.
Models are compared in NLI and Paraphrase Identification tasks.
OPT (decoder-only) model variants are used, from 125M to 30B parameters.

CLAIMS

Both ICL and FT achieve comparable results when controlled for model size;
Both ICL and FT get better as model size grows, but for larger model sizes, FT is better than ICL;
The performance of both FT and ICL is unstable, with high variance.
FT is not limited to a model's context size, so it can in theory handle much more data;
Previous studies have misleading results because they did not compare both strategies fairly.

QUOTES

On model size: "ICL requires large models to work in contrast to FT, which works well even with small models"
On inference time performance: "...the inference time of fine-tuned models is much smaller than ICL, since it only includes the time that it takes to process the minimal pattern and the test instance. When using ICL, each test instance has to include all of the demonstrations as well, which increases the inference time."

EXTENDS/USES

OPT model (Zhang et al 2022)

NOTES

For fine-tuning, the authors use something called PBFT (Pattern-based Fine-tuning), whereby users have to provide more information about the structure of the extra input/output pairs.
In some cases, OOD performance is better than in-domain performance :thinking:

MY 2¢

"In-domain" generalization is another word for in-sample holdout. This is contrasted with "Out-of-domain" generalization.

References

Paper Summary: Multitask Prompted Training Enables Zero-Shot Task Generalization

2024-04-02T00:00:00-03:00

Multitask Prompted Training Enables Zero-Shot Task Generalization Source

WHAT

Investigate if and how fine-tuning a vanilla LM on text-to-text NLP tasks (like T5 summary) helps it perform better on unseen tasks.

WHY

To see how the results from T5 generalize over unseen tasks;
To compare those gains (if any) with the performance from larger vanilla LMs such as GPT3;

HOW

1) Pretrain a vanilla LM using masked language modeling on the C4 Dataset;

2) Fine-tune (SFT) that model on input-output pairs of NLP tasks described in natural language;

3) Test how the model from step 2 performs when prompted to solve NLP tasks not in the SFT training set, in a zero-shot manner.

CLAIMS

Fine-tuning LLMs on some NLP tasks makes them better on other, unseen NLP tasks (as compared with vanilla LMs such as GPT3). Even when using much smaller models.

QUOTES

Masked Language Modeling FTW: "We note that masked language modeling has repeatedly been shown to be a dramatically more effective pre-training strategy."

EXTENDS/USES

Architecture and model decisions from T5

NOTES

Authors cite FLAN (summary). So FLAN came before T0.

MY 2¢

Seems to me it was a large article with relatively few important points. Looks like an addendum to T5 only.
Very similar to FLAN, in terms of scope and findings, except for the fact that FLAN claims that fine-tuning hurts model performance on unseen tasks (if the model capacity is too low).

How is T0 different from T5?

T5 did not investigate zero-shot performance in unseen tasks, T0 did.
T5 trained on more tasks than T0, but T0 used more different datasets.

References

Sahn et al 2021: MULTITASK PROMPTED TRAINING ENABLES ZERO-SHOT TASK GENERALIZATION

Paper Summary: Learning to summarize from human feedback

2024-04-02T00:00:00-03:00

Learning to summarize from human feedback Source

WHAT

RLHF¹ is applied to the task of generating abstractive summaries of an input text.

WHY

The authors wanted to extend the work by Ziegler et al 2019, using offline instead of online RL and better managing the labelers.

HOW

1) Generate and/or collect pairs of summaries and have a human labeler select the best of the two.
2) Train a Reward model to be able to tell which of a pair of summaries was the best one.
3) Use the Reward model from Step 2 to train an RL model using PPO.
- I.e.: generate a summary for a post then get its score from the reward model, update the RL model, and repeat.

The RL flow used by Stiennon et al. Remarkably similar to the image from the InstructGPT Paper.
Source

CLAIMS

Abstractive summarization with RLHF works much better than previous baselines from SFT only.
- Both in terms of subjective quality and ability to generalize to unseen domains.

QUOTES

On generalization: "Our human feedback models can also generate excellent summaries of CNN/DM news articles without any further training."
On using numeric metrics to measure subjective quality: "We also find that ROUGE fails to track sample quality as our models improve."

EXTENDS/USES

The setup is adapted from Ziegler et al 2019.
Model architecture is based on GPT-3, using 1.7 and 6.7B parameters.
TL;DR summarization dataset from Reddit.

NOTES

The RL setup is offline, not online as in the previous paper by Ziegler et al.
The initial generation of summaries is done with a simple LLM, and it's fully in-context.
The KL-divergence correction in the reward function to prevent the RL model from finding hacks seems to have been introduced here.

MY 2¢

Interesting points about the length vs quality tradeoff. Models that generate longer summaries may be taken to be better but they are kind of cheating.
The authors say that generating the summaries was done in a zero-shot manner but they then say that they provided examples in the context, which makes it few-shot (not zero-shot) :thinking:.
The authors correctly predict several problems related to hallucinations and possible bias generated by using humans to direct model preferences.

1: The acronym RLHF is nowhere to be found in this article, however.

References

Arxiv: Learning to summarize from human feedback

Paper Summary: Zephyr: Direct Distillation of LM Alignment

2024-01-14T00:00:00-03:00

Zephyr: Direct Distillation of LM Alignment Source

WHAT

Authors instruction-tune Mistral-7B vanilla by distillation: using DPO on open preference datasets and samples generated from previously aligned teacher models.

WHY

Because traditional distillation strategies are only good at transferring stylistic — not alignment capabilities.

HOW

Starting with Mistral-7B as the V0 model:

1) Run SFT on V0 using input/output pairs from the UltraChat dataset, generating model V1
3) Use inputs from UltraFeedback dataset and, for each input, feed it to intermediary models (Claude, Falcon, etc), generating multiple output variations for the same input.
4) For each input from step 3, feed all the output variations to the teacher model (GPT-4) and ask it to select the best one.
5) Use DPO to align model V1, using the best output for each input, as selected on step 4.¹

CLAIMS

It's possible to transfer alignment capabilities from teacher models using the suggested approach.
The DPO model overfits quickly with longer training.
Zephyr-7B outperforms 70B models (such as Llama-chat-70B) on some benchmarks.

QUOTES

"... without an initial SFT step ... models are not able to learn at all from feedback and perform terribly."
- This is interesting. We can't jump to reward modeling without the initial SFT step.

EXTENDS/USES

Mistral-7B
Other aligned LLMs as teachers: Claude, Falcon, Llama, GPT-4.

NOTES

Distillation appears to be the default term for extracting the capabilities of a "teacher" model into a simpler and cheaper "student" model. Apparently it was introduced by Hinton et al 2015.
Zephyr-7B was fully optimized for Helpfulness only.

1: More precisely, DPO is optimized using the best response to each each, but contrasting it to a randomly chosen response. It doesn't classify response, it ranks them

Examples: Installing and Updating Packages with Apt

2024-01-01T00:00:00-03:00

List installed packages

Use apt list --list installed, optionally using grep to limit the search

$ apt list --installed | grep qua
pngquant/focal,now 2.12.2-1 amd64 [installed]
quarto/now 1.3.433 amd64 [installed,local]

Install .deb package

sudo dpkg -i my-file.deb

Uninstall .deb package

Run $ sudo apt remove my-package-name where my-package-name is the name of the package installed by the .deb file.

Get package name from .deb file

Run $ dpkg --info my_deb_file.deb

$ dpkg --info quarto-1.3.433-linux-amd64.deb  | grep Package:
 Package: Quarto

Git Examples: Reverting a File from a Branch

2023-12-27T00:00:00-03:00

Revert file from branch

Use git checkout to retrieve a file from my-other-branch

$ git checkout my-other-branch -- path/to/file

Revert file from remote branch (using git log and checkout)

Retrieve the last commit hash of the remote branch

$ git log origin/main
commit 123456abcde (origin/main, origin/HEAD, main)
Author: John Doe <john-doe@example.com>
Date:   Mon Jul 3 12:44:25 2023 -0300

some commit message

Revert file to that commit hash with git checkout (like Restore file from Previous commit)
```
$ git checkout 123456abcde -- path/to/your/file
```

Revert file from remote branch (using fetch and checkout)

Use git fetch to make sure your local branches are up to date
```
$ git fetch --all
```
Then just use checkout to reset the file to origin/my-other-branch
```
$ git checkout origin/my-other-branch -- path/to/your/file
```

Paper Summary: Constitutional AI

2023-11-20T00:00:00-03:00

Constitutional AI Source

WHAT

Constitutional AI (CAI) is a strategy to fine-tune LLMs that place a higher value on harmlessness¹, without being overly evasive.

CAI employs Reinforcement Learning with AI feedback (RLAIF), standing in contrast to RLHF, as used by InstructGPT and similar models.

WHY

To improve upon RLHF, such that:

Fewer human-provided labels are needed;
The model can be steered with a set of principles, i.e. a Constitution;
The model chooses clarity over evasion when rejecting promptings that don't fit its principles.

HOW

1) Using a third-party fine-tuned LLM optimized exclusively for helpfulness, generate outputs for prompts selected for their "toxicity".

2) Ask the third-party LLM to modify (critique) the outputs from Step 1 according to a randomly chosen principle in the constitution.

3) Repeat step 2 multiple times, for a variety of inputs and constitution principles

4) Fine-tune a vanilla LLM in a supervised fashion using the toxic inputs and the critiqued outputs.

5) Use the fine-tuned model from Step 4 to generate two outputs (at a high temperature) for each toxic input.

6) Build a preference dataset from the output of Step 5, by:

Creating a multiple-choice question with each input-output pair along with one of the Constitution principles.
Asking the fine-tuned model which of those two outputs is more aligned to the given principle and using that as a label.

7) Join the dataset produced by Step 6 with a third-party human-labeled helpfulness preference dataset.

8) Use the dataset from Step 7 to train a preference model (PM).

9) Use the PM from Step 8 to run a Reinforcement Learning (RL) loop to fine-tune the model from Step 4, arriving at the final version.

CLAIMS

Authors claim that using Chain-of-Thought to explain why some inputs aren't given a helpful answer is a good way to defuse the tension between helpfulness and harmlessness.
Authors devised a way to encode generic constraints to the outputs via a Constitution.
Authors created an algorithm to reduce the level of harmlessness while not being overly evasive when refusing to answer questions helpfully.
Authors used AI itself to create a preference model, to be used in an RL loop to fine-tune vanilla LLMs.

QUOTES

On RLHF: "RLHF typically uses tens of thousands of human preference labels."

EXTENDS/USES

HH Models from Anthropic's previous article, Bai et al, 2022

NOTES

Anthropic's Claude was trained using Constitutional AI. The constitution used can be found here.

MY 2¢

The whole thing seems to depend on a previously fine-tuned LLM optimized exclusively for helpfulness.
Using Chain-of-Thought to avoid evasive answers doesn't increase Helpfulness, from the point of view of the user. It's just trying to educate people according to the principles in the Constitution.
The relative weights of each "H" in HH models don't seem to be mentioned, but they will affect the model behavior. A 50/50 model will be very different from an 80/20 or a 20/80 model.

Footnotes

1: Over honesty and helpfulness, the other 2 "H's" of alignment.

References

Arxiv: Constitutional AI

Pytest Examples: Handling Exceptions

2023-10-29T00:00:00-03:00

Assert exception is raised

Use with pytest.raises(ValueError): as a context manager:

# inside my_test.py
def test_raises_index_error():

    # test will success if an IndexError is raised
    with pytest.raises(IndexError):
        arr = [1,2,3]
        arr[1]

Assert exception with specific text

Use pytest.raises(<class>, match=<regular_expression>). <regular_expression> supports whatever you can use in re.search.

# inside my_test.py
def test_raises_specific_exception():

    # test will success if a ValueError is raised,
    # but only if the text contains a number starting with "5"
    # (e.g. 500 or 503 HTTP errors)
    with pytest.raises(RuntimeError, match=r"5\d+"):
        some_code_that_raises_the_exception()

Troubleshooting Colima Start Problems

2023-10-18T00:00:00-03:00

Waiting for the essential requirement 1 of 5: ssh

This has many possible reasons. In my case, what solves this is running these two commands:

$ colima delete
colima start --arch x86_64

You will still see the dreaded message once or twice, but it should work.

References

Pyenv and Jupyter Notebook Integration

2023-09-16T00:00:00-03:00

Add pyenv Environment as Kernel

Running ipython kernel install alone doesn't seem to work for PyEnv.

Do this instead:

1) Activate the environment: $ pyenv activate my-venv

2) Install the kernel with $ ipython kernel install --name "my-venv" --user (This creates a file in ~/Library/Jupyter/kernels/my-env/kernel.json)

3) The created file (on the path above) will probably have the wrong path to the Python executable.¹ Open it and edit it to point to the PyEnv executable:

   {
    "argv": [
     "~/.pyenv/versions/my-venv/bin/python", # <-- HERE
     "-m",
     "ipykernel_launcher",
     "-f",
     "{connection_file}"
    ],
    "display_name": "my-venv",
    "language": "python",
    "metadata": {
     "debugger": true
    }
   }

Troubleshooting: No module named ipykernel_launcher

You need to install ipykernel in the virtualenv you want to use.

References

1: In my case it pointed to a native Python version: /usr/local/opt/python@3.11/bin/python3.11

Sublime 4 Productivity Examples: Keymaps, Snippets, Macros

2023-08-06T00:00:00-03:00

Replace characters

Example: Add a new Key Binding to Make "--" expand to "—"

Open Settings -> Key Bindings. This will open a file such as Default (Linux).sublime-keymap

Add the following to that file:

[
  { "keys": ["-", "-"], "command": "insert", "args": {"characters": "&mdash;"}}
]

Run snippet on selection

Example: Wrap selected text with "*" to make a markdown text bold when you hit ctrl+b

Open Settings -> Key Bindings.

Add the following to that file:

[
  { "keys": ["ctrl+b"], "command": "insert_snippet", "args": {"contents": "**${0:$SELECTION}**"}}
]

Paper Summary: Llama 2: Open Foundation and Fine-Tuned Chat Models

2024-01-14T00:00:00-03:00

Llama 2: Open Foundation and Fine-Tuned Chat Models Source

WHAT

Updated version of LLaMA 1 (summary) with more data (still fully open), double the context size, and enhanced attention.

Two model variations are published: a vanilla LLM and an instruction-tuned version.

HOW

LLaMA-2: Similar to LLaMA-1, with 40% more data (only public data), better data cleaning and larger context. One epoch over the training data. Also, enhanced attention.
LLaMA-2-chat: SFT and RLHF instruction-tuning on top of LLaMA-2.

CLAIMS

Using a smaller but higher-quality preference dataset yields better results.
RLHF is responsible for most of the increase in instruction-following performance.

QUOTES

Small but high-quality instruction-following data for SFT: "We found that SFT annotations in the order of tens of thousands was (sic) enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations"
Reward model initialization: "We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pretraining. In short, the reward model “knows” what the chat model knows."

EXTENDS/USES

Main architectural decisions from LLaMA-1 (Touvron et al., 2023).
Grouped-query Attention (GQA), from Ainslie et al., 2023.
RLHF loop from Instruct-GPT (Ouyang et al., 2022).
- But they experiment with Rejection Sampling Fine-tuning instead of PPO.

NOTES

Just like the DPO paper (summary), the authors used GPT-4 to evaluate the models subjectively.
Authors tried to decrease hallucination by oversampling known trusted sources.
Two reward models were trained, one optimized only helpfulness and the other only optimized safety.
The reward model is also a transformer-based LM (but trained for regression instead of predicting the next token).
Authors introduce a variant of Attention during fine-tuning, called Ghost Attention. The objective is to help the optimizer learn from multi-turn messaging like a chat conversation.
Authors used red-team adversarial attacks on the model, to test its safety.

MY 2¢

PPL shows no sign of saturation as more tokens are used (Figure 5)

References

Arxiv: Touvron et al 2023: Llama 2: Open Foundation and Fine-Tuned Chat Models

Python Dependency Management: Examples and Reference

2023-08-06T00:00:00-03:00

Get path to site-packages

Run python -m site. It will be listed (usually last element)

# python -m site
sys.path = [
    '/momo/src/momo',
    '/usr/local/lib/python39.zip',
    '/usr/local/lib/python3.9',
    '/usr/local/lib/python3.9/lib-dynload',
    '/usr/local/lib/python3.9/site-packages', <--- HERE
]
USER_BASE: '/root/.local' (doesn't exist)
USER_SITE: '/root/.local/lib/python3.9/site-packages' (doesn't exist)
ENABLE_USER_SITE: True

Paper Summary: Deep Reinforcement Learning from Human Preferences

2023-07-16T00:00:00-03:00

Deep Reinforcement Learning from Human Preferences Source

WHAT

An algorithm to estimate a reward function using human opinions. The function is then optimized in a Reinforcement Learning (RL) setting.

This approach is now called RLHF (Reinforcement Learning from Human Preferences).

WHY

Because it isn't practical to mathematically formulate a reward function for some types of RL problems. But it is possible to ask humans to subjectively rate how preferable a given state is.

HOW

1) Show humans pairs of states and ask them to rank these states in terms of desirability (i.e. say which state is preferable);
2) Learn a reward function in a supervised manner using the data from step 1;
3) Train an RL model using the learned reward function as a proxy for the real reward.

CLAIMS

It is possible to use a learned reward function built from human preferences.
In some cases, a learned reward function performs better than an actual mathematical reward function.

EXTENDS/USES

OpenAI Gym

NOTES

Performance is evaluated on a set of robotics and video-game-playing RL tasks.
In addition to human feedback, authors also used so-called synthetic feedback—building a reward function from actual true signals.

MY 2¢

The term "RLHF" is not mentioned in the article.
RLHF is not introduced in this article. The authors' contributions revolve around making the process more efficient.
RLHF is relevant for NLP and instruction-tuning because it is not trivial to estimate how appropriate an output is to a given instruction. RLHF can be used to fine-tune a pre-trained LLM.
There exists a way to produce a function from pairwise preference rankings—the Bradley-Terry model.

References

Arxiv: Christiano et al., 2017: Deep Reinforcement Learning from Human Preferences

Jenv Examples on MacOS

2023-07-09T00:00:00-03:00

List available java versions

$ jenv versions
  system
  1.8
  1.8.0.362
  19.0
  19.0.2
  openjdk64-19.0.2
* temurin64-1.8.0.362 (set by /Users/felipe.almeida/.jenv/version)

Add java version installed with homebrew

Example with OpenJDK 11 installed via brew install openjdk@11

Add it to jenv: $ jenv add /usr/local/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home/

References

jenv website

Paper Summary: Fine-tuned Language models are Zero-Shot Learners

2023-07-07T00:00:00-03:00

Finetuned Language models are Zero-Shot Learners Source

WHAT

Fine-tune LaMDA-PT 137B with NLP tasks framed as natural language instructions. The final model is called FLAN.

WHY

To understand the impact of instruction-tuning LMs for free-form NLP problems.

HOW

Took supervised datasets for 12 NLP tasks and rewrote those as pure natural language tasks.
Fine-tuned a LaMDA-PT 137B model on the rewritten tasks
Compared the results from the fine-tuned model (FLAN) with the pre-trained version (LaMDA-PT) and GPT-3 on several regimes¹ and tasks.

CLAIMS

FLAN outperforms GPT-3 (untuned) on most zero-shot tasks.
FLAN performs better using zero-shot in some tasks than GPT-3 using few-shot examples.
Instruction-tuning enhances results even on unseen tasks.

Fine-tuning only helps once the pre-trained
model reach a minimum number of parameters. Under that threshold, fine-tuning
actually hurts performance. Source

EXTENDS/USES

LaMDA-PT 137B
Data processing from T5 summary
Prompt Tuning (Lester et al., 2021)

References

1: Zero-shot and few-shot learning.

Arxiv: Wei et al., 2022: Fine-tuned Language models are Zero-Shot Learners

Paper Summary: Cross-Task Generalization via Natural Language Crowdsourcing Instructions

2023-07-02T00:00:00-03:00

Cross-Task Generalization via Natural Language Crowdsourcing Instructions Source

WHAT

Build a dataset with pairs of high-quality instruction-following examples;
Measure how fine-tuned models perform when trained to follow those instructions.

WHY

To provide a dataset for other people to build up on.
To examine the tradeoff between fine-tuning a smaller model vs using a much larger model

HOW

Build a dataset with examples of instructions and fine-tune a pre-trained LM on those
The datasets consist of instructions and task examples, so models are queried in a few-shot setting.

CLAIMS

LMs fine-tuned for instruction-following can generalize into task instances and even task types not seen in the training dataset.
A 170M-parameter model (BART), when fine-tuned, is better at following instructions than GPT-3 with 175B parameters.

EXTENDS/USES

BART LM(Lewis et al., 2019)

QUOTES

Authors didn't try to fine-tune GPT-3, apparently because they didn't have enough compute resources "We cannot fine-tune the parameters of [GPT-3] and use it as-is under its default setting"

NOTES

Uses ROUGE for evaluation (generated vs actual)
Examples in the evaluation set are not from different tasks as those in the training set—they are different examples of the same tasks.

MY 2¢

Why don't people use this preference dataset more often?
This is an updated version of a 2021 paper called "Natural Instructions: Benchmarking generalization to new tasks from natural language instructions". It is sometimes referenced by its old name.

References

ACL: Mishra et al., 2022: Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Python 3 Regex: Named Capture Examples

2023-06-25T00:00:00-03:00

Extract Named capture groups

Extract matches into a dict, using re.match and .groupdict():

import re

# a word followed by a comma and then another word followed by a period
pattern = r'(?P<param1>[\w]+),(?P<param2>[\w]+)\.'

re.match(pattern,'foo,bar.').groupdict()
# >>> {'param1': 'foo', 'param2': 'bar'}

Re.search

import re

# a word followed by a comma and then a word followed by a period
pattern = r'(?P<param1>[\w]+),(?P<param2>[\w]+)\.'

re.search(pattern,'xxx  foo,bar.')
# >>> <re.Match object; span=(5, 13), match='foo,bar.'>

Search, multiple matches

import re

# a word followed by a comma and then a word followed by a period
pattern = r'(?P<param1>[\w]+),(?P<param2>[\w]+)\.'

matches = re.search(pattern,'  foo,bar. aaaand another xxx,yyy.')

for match in matches.groups():
    print(match)

# >>> foo
# >>> bar

Re.findall

import re

# a word followed by a comma and then another word followed by a period
pattern = r'(?P<param1>[\w]+),(?P<param2>[\w]+)\.'

re.findall(pattern,'foo,bar.')
# >> [('foo', 'bar')]

Findall, multiple matches

import re

# a word followed by a comma and then another word followed by a period
pattern = r'(?P<param1>[\w]+),(?P<param2>[\w]+)\.'

re.findall(pattern,'foo,bar. and another xxx,yyy.')
# >>> [('foo', 'bar'), ('xxx', 'yyy')]

Paper Summary: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2023-08-02T00:00:00-03:00

Direct Preference Optimization: Your Language Model is Secretly a Reward Model Source

WHAT

An approach to align pre-trained LMs to human preferences without using Reinforcement Learning (RL).

WHY

Because RL-based instruction-tuning methods (such as RLHF) are costly and difficult to implement.

HOW

The authors figured out a way to represent the objective function from RLHF as a loss function that can be directly optimized using algorithms such as SGD.

A dataset containing good (so-called preferred) as well as bad (so-called dispreferred) prompt/output pairs is needed to fine-tune the model. The loss function includes both types of pairs to calculate the loss.

CLAIMS

Objective evaluation: better results than PPO (the RL algorithm used by RHLF) as measured by reward and KL-divergence from the original text distribution.
Subjective evaluation: Also better results than RLHF-PPO but the comparison setup is very nontraditional and based upon proxies. Authors use GPT-4 to provide ground truth for experiments, sentiment classifiers to filter generated text wrt sentiment, etc.
Learning with DPO is more stable (smaller variance) than RLHF-PPO.
DPO converges quickly.

NOTES

GPT-4 (zero-shot) was used to evaluate DPO against other types of fine-tuning. Crazy.
DPO was applied on an LM that had been previously fine-tuned with regular SFT.

References

Arxiv: Rafailov et al., 2023: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Pyenv Examples: Managing multiple Python versions and Virtualenvs

2023-07-07T00:00:00-03:00

Create virtualenv

$ pyenv virtualenv my-venv

Create virtualenv with python version

Create a virtualenv using a specific python version.

$ pyenv virtualenv 3.7.16 my-venv

Activate virtualenv

To activate a virtualenv called my-venv:

$ pyenv activate my-venv

Set default virtualenv for directory

Use pyenv local my-venv.

This will create a hidden .python-version file (should not be versioned).

The virtualenv will be activated automatically every time you cd to that directory (without the need to call pyenv activate)

$ pyenv local my-venv

Virtualenv location

For virtualenv my-venv:

on MacOS

  ~/.pyenv/versions/my-venv

Install python version

$ pyenv install 3.9

List python versions

$ pyenv versions

Paper Summary: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

2023-06-24T00:00:00-03:00

Pythia: A Suite for Analyzing Large Language
Models Across Training and Scaling
Source

WHAT

A framework—Pythia—to uniformly train variations of LLMs to measure the impact of hyperparameter choices:

number of layers
model dimensionality
number of attention heads
dimensionality of attention heads
batch size
learning rate

WHY

It's hard to measure the impact of hyperparameters using other published LLMs because they have been trained using different architectures, different data, and different training decisions.

HOW

Train 8 variations of GPT-3-like models and study the impact of changing hyperparameters on the model performance, as evaluated on several NLP tasks (via EleutherAI/lm-evaluation-harness)

CLAIMS

Deduplicating The Pile training dataset had no benefit on performance, contrary to existing literature.
Using parallel attention and MLP sublayers did not degrade performance, contrary to existing literature.
Using multi-lingual datasets hurt performance less than expected.
The position of a piece of text—i.e. at the start or the end of the training dataset— does not make it more or less likely to be memorized by the model.
Term frequencies in the pretraining dataset do affect the downstream performance of the model, especially in models with higher capacity.

EXTENDS/USES

Toolset from GPT-NeoX
GPT-3 for architecture and most other decisions
EleutherAI's The Pile dataset
BPE tokenizer (from GPT-NeoX-20B)
Flash Attention (Dao et al., 2022)
Rotary Embeddings (Su et al., 2021)
Parallel Attention (from GPT-J-6B)

NOTES

Authors applied interventions :thinking: in the dataset to address bias

MY 2¢

Most new articles are using a curated English-language dataset called The Pile

References

Arxiv: Biderman et al., 2023: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Paper Summary: LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

2024-01-14T00:00:00-03:00

LLaMA-Adapter: Efficient Fine-tuning of Language
Models with Zero-init Attention
Source

WHAT

A cheaper way to fine-tune a vanilla LLM based the on 52k input/output pairs from self-instruct.

WHY

To reduce the cost to fine-tune LLMs for instruction-following.

HOW

A few layers (1.2M parameters) are added to a pre-trained LLaMA model and only these are unfrozen and fine-tuned.
Attention mechanisms in the unfrozen layers are initialized with zeros and a gating mechanism, to prevent disturbing the information coming from the base LLM.

CLAIMS

Fine-tuning a LLaMA 7B model takes 1 hour. Comparable performance to Alpaca while taking 1/3 of the time.

EXTENDS/USES

Adapter-based Fine-tuning from Houlsby et al 2019
Fine-tuning input/output pairs from Self-instruct
Base LLM from LLaMA

NOTES

LLaMA-adapter also supports other modalities (audio, images, video).
LLaMA-adapter is a type of Parameter-Efficient Fine-Tuning (PEFT)

MY 2c

No quantitative comparison with Alpaca, only examples (possibly cherry-picked) and a vague claim of "comparable instruction-following proficiency with the 7B Alpaca"

References

Paper Summary: LLaMA: Open and Efficient Foundation Language Models

2023-08-02T00:00:00-03:00

WHAT

An LLM (LLaMA) is trained from scratch using more data but fewer training iterations than GPT3. Only public data is used.

WHY

To test how the tradeoff data vs compute budget behaves as the scale grows.

HOW

LLaMA is a standard Transformer LLM with some optimizations used by previous LMs. It's trained exclusively on open-access data.

CLAIMS

Models with fewer parameters are cheaper to use at inference time
LLaMA outperforms or matches LMs having 3-10x the number of parameters (GPT3, Gopher, Chinchilla) at most natural language tasks (Zero-shot and Few-shot)

EXTENDS/USES

AdamW Optimizer
Transformers Implementation from facebookresearch/xformers
RMSNorm
SwiGLU Activation Function
Rotary Embeddings from GPTNeo

NOTES

Total number of tokens used for training: 1.4T
Some fine-tuning was done using simple SFT

MY 2¢

This is an engineering article; not many theoretical advancements.
The moat enjoyed by big players gets smaller every day.

References

Paper Summary: Self-instruct: Aligning Language Models with Self-generated Instructions

2023-06-25T00:00:00-03:00

Self-Instruct: Aligning Language Models with Self-Generated Instructions
Source

WHAT

A way to fine-tune LLMs to follow instructions using only information from the model itself—no human annotation needed.

WHY

Because human-annotated datasets are expensive to come by.

HOW

1) Use the pre-trained LLM itself to generate input/output instruction pairs, from a small set of seed pairs (one seed example per task, 175 examples in total).
- Seed data
- Generated instructions
2) Perform supervised fine-tuning with the pairs from step 1), using heuristics to classify which outputs are better than others.

CLAIMS

In one experiment, GPT3_{self-instruct} hits 44.4% of correct answers while InstructGPT (GPT3 aligned with RLHF) hits 50.7%.

NOTES

All tasks are represented in the form (task definition, input/output pairs). It's a versatile way to represent any kind of task. Example below:

How the authors represent the instruction tasks to align the model.
Source

No need to host a local version of GPT3. Everything was done using Open AI CLI tools and making HTTP requests to GPT3 endpoints

MY 2¢

The contribution is how to generate alignment examples from a vanilla LLM.

References

Arxiv: Wang et al., 2023: Self-Instruct: Aligning Language Models with Self-Generated Instructions

1:Such as InstructGPT/ChatGPT which are based on RHLF

As a Manager: Is it Worthwhile? How Worthwhile?

2023-05-29T00:00:00-03:00

Prioritization is key

As a (new) manager, one of your most important activities will be to prioritize work to be done by your team.

Get into the habit of thinking not only if some given task/project is worthwhile, but also how worthwhile.

You will never have infinite resources or people in your team, so some tasks or projects must by necessity be dropped to the benefit of others.

Value/Effort ratio instead of just Value

Every task or project has an expected Value–the benefit (financial or otherwise) it brings to the team or organization.

But the Value alone is not what you should use to prioritize tasks. A better metric to use is Value/Effort, whereby you also take into account the Effort (i.e. man-hours) needed to accomplish the task.

There are many other dimensions that must also be addressed in order to prioritize, naturally (emergency tasks, unblocking tasks).

As a Manager: Stating the Obvious is Important

2023-05-29T00:00:00-03:00

Stating the obvious when talking with reports is important but managers don't always do it, either due to laziness or to wrongly assuming reports already know it.

Define expectations precisely, with examples

The definition of what it means for some task to be done varies a lot from person to person. It's important to make expectations clear.

"Make sure you double check the result after completing the task, to confirm the task's objective was achieved."
"Make sure you look at the system logs after deploying the changes, to make sure they worked."
"Make sure you communicate everyone who may be impacted before you start working."

Explain the impacts of one's actions to impart a sense of ownership

TODO

Stating the obvious fills knowledge gaps

Many people have knowledge gaps–things they don't fully understand about their work. These gaps are usually skipped over or downright ignored, as people cut some corners to get work done.

Stating the obvious is a good way to help people plug gaps they may not even realize they have.

Be careful not to be repetitive

Stating the obvious can easily turn you into a "boring" person if you don't watch out. Be careful not to overdo it.

You need to be able to detect who you need to state the obvious to and when to do it.

An alternative is to ask people if they understand the topic you are about to explain–but make sure you ask this in a nonjudgemental manner. If the person suspects that you are asking them about something that should be trivial they may choose to lie and say they do, in fact, understand it, to "save face".

Watch people's body language as you explain

If you choose to state the obvious and people seem impatient of looking elsewhere as you talk, it may be a sign you don't need to explain that particular thing to that particular person.

Likewise, if the listener tries to listen closely to what you are saying it might be a sign that you are indeed filling a knowledge gap (or they may be just trying to flatter you).

As a Manager: Drive Growth by Asking Open-Ended Questions

2023-08-06T00:00:00-03:00

When managing junior/midlevel engineers, one of your key objectives should be to encourage them to think about what they are doing—instead of just executing pre-assigned tasks.

Asking open-ended questions is an excellent way to get reports to think and talk about topics they may not have thought about yet—making them think at a higher level about what they are doing, and enabling them to become more autonomous and self-conscious.

Think of it as a sort of therapy: one of the reasons why therapy works is that people hear themselves talking about their issues, rather than having people tell them what to do.

Here are some examples of the questions you could ask your reports, during 1:1 meetings, project review meetings, etc.

Fishing for problems: Are the next steps clear?

TODO

What do you think should be the next steps of this project, and why?

This gets reports to think about the project as a whole (rather than the specific task they are currently executing).

As they think about the next tasks, they will have to think about topics such as:

Project management (how to properly conduct a project such that there's less risk of failure)
Task sequencing (which tasks should be done first, unblocking other team members, de-risking the project by doing risky tasks first, etc)

What do you think we should be working on next?

This is useful to get people to think about prioritization. When they are focused on executing only, it may be hard to think about the high-level objectives of the team.

This encourages them to think about:

Value/Effort Tradeoff: Each task has an estimated value but it also has some cost (i.e. work hours) attached to it. Both should be taken into account when selecting tasks to be worked on.
Focus on business outcomes: Thinking about the team priorities forces people to don the "business" hat and think about which tasks are more important from a business perspective.
- Being able to think about business needs is a key skill many technically-oriented people lack.

If you could start over, what would you have done differently?

Thinking about one's work objectively—from a distance—is a great way to let go of unhelpful or unproductive behavior patterns that can hinder one's career.

By asking reports what they could have done differently in a project or task, you allow them to reflect upon their work, helping them grow as professionals.

Paper Summary: Training language models to follow instructions with human feedback

2023-06-25T00:00:00-03:00

Training language models to follow
instructions with human feedback Source

WHAT

Introduce a strategy —InstructGPT— to fine-tune pre-trained LLMs to follow human instructions using Reinforcement Learning.²

WHY

Pretraining LLMs on unlabelled data does not make them good at following instructions or providing output that's aligned with the user's intent: We need something else.

RLHF

It's a 3-stage strategy (assumes you already have a pre-trained, so-called vannila LM)
- 1) Supervised Fine-tuning (SFT): Sample the vanilla LM and give out some of those prompts to human annotators and have them write a proper response to that prompt. Then fine-tune the pre-trained LM in a supervised manner on those prompt/answer pairs.
- 2) Reward Model (RM) With the fine-tuned LM, we again sample some prompts and feed them to the model¹ and get some outputs. We then ask human annotators to rank the outputs on a Likert scale to define how aligned the outputs are to the original prompt.
  - The outcome is a model (RM) that takes a prompt/output pair and says how aligned it is to what humans usually want.
  - Also an LLM; can be Transformer-based
- 3) RL Fine-tuning Intiate a Reinforcement Learning (RL) feedback loop whereby:
  - Sample the LM for a prompt/output pair
  - Score the prompt/output pair with the Reward model a Preference Reward)
  - Score the output with the original LM itself (before fine-tuning) to see how close to "normal language" the output is.
  - PPO-ptx: Calculate a Final Reward that takes into account both the preference Reward and the original LM perplexity to make sure the output is both good in terms of alignment but also that it should be natural (as defined by the original, untuned LM)
  - Feed the Final Reward back to the LM and repeat the loop

HOW

The how is basically applying RLHF to a GPT-3 LM, with some technical optimizations.

PPO (Proximal Policy Optimization) is used to update the LM in the RL Fine-tuning loop, with a modification that lends some weight to the original, untuned LM (PPO-ptx, see above RLHF)

CLAIMS

InstructGPT (1.3B params) provides better outputs than GPT-3 (175B params). (According to labelers)
The cost of increasing model alignment is **modest* relative to pretraining"*
Learned alignment generalizes to hold-out annotators
PPO-ptx can be used to avoid regressions (i.e. text that is statistically very close to preferences but unnatural and/or bad in other ways)

QUOTES

Misalignment: "... the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective "follow the user’s instructions helpfully and safely""
Alignment Tax: "... our alignment procedure comes at the cost of lower performance on certain tasks that we may care about."
- This is reduced with PPO-ptx

NOTES

The 3 H's (helpful, honest, and harmless) of implicit alignment were defined in Askell et al., 2021. (see refs)
Types of alignment
- Explicit alignment: Following express orders such as "write a list such that..."
- Implicit alignment: Not producing outright misleading text, not hallucinating.

MY 2¢

In addition to every technological breakthrough in the paper, it's a masterpiece of experiment design as well. Everything is done toavoid bias, inaccuracies and make efficient use of the resources (humans, computing, etc)

Footnotes

1: With an appropriate temperature setting, to generate diverse samples.

2: It is widely believed that ChatGPT was trained using RLHF as described in this article.

References

Arxiv: Ouyang et al 2022: Training language models to follow instructions with human feedback
Open AI Blog: Aligning Language Models to Follow Instructions
Youtube: Reinforcement Learning from Human Feedback: From Zero to chatGPT
- Amazing Video Lecture on RLHF by Nathan Lambert @HuggingFace
Arxiv: Askell et al 2021: A General Language Assistant as a Laboratory for Alignment

Paper Summary: Language Models are Few-Shot Learners

2023-02-05T00:00:00-03:00

WHAT

GPT-3 model is introduced.

Authors show that, if you have enough data, you can start solving all kinds of problems by few-shot prompting, even beating SOTA, with no fine-tuning.

WHY

Because the usual pretraining/fine-tuning architecture for NLP tasks has some downsides:

The need to have a smaller annotated dataset for each new downstream application is still a cost/time bottleneck.
Forcing such a large pretrained model to relearn on small task-specific datasets doesn't necessarily go well.

HOW

Added more data (and more money $$) with some tweaks on top of GPT-2

CLAIMS

The more parameters a model has, the larger the performance differences between zero-, one-, and few-shot learning.
In some tasks, Few-shot (even one- or zero-shot) learning with GPT-3 175B surpasses task-specific fine-tuned models, but not in all.¹
Near 100% of accuracy in adding/subtracting up to 3 digits, but gets worse as we add more digits (few-shot setting).

QUOTES

Model size and ability to learn from context: "Larger models make increasingly efficient use of in-context information"

NOTES

They provide a consistent definition of zero-shot, one-shot and few-shot learning, i.e. the number of examples provided at inference time (in the prompt), without any inference-time weight updates.
Several models are trained, with increasing number of parameters, to test how the performance scales with more capacity (from 125M to 175B params)
One of the several tasks the model was evaluated on was to ask humans to detect if some text was model-generated or not!
GPT3 does not include any bidirectional architecture

MY 2¢

One interesting point is that they found a bug in removing overlaps between train/test data. But the cost of retraining was prohibitive and they didn't retrain the whole thing because of that!
For translation tasks, the direction of translation matters a lot (performance is better when translating into English than when translating from English)
There are so many different NLP tasks available; you can basically encode any problem as an NLP problem, provided you can represent it in words.
The data overlap problem is larger than I first thought - makes one wonder how much of that skews the results
Section 5: Limitations is a great work on the operational challenges of getting such models to train and run inference on

References

Brown et al 2020: Language Models are Few-Shot Learners

Footnotes

1: But in all likelihood, training GPT-3 with more than 175B params could change that.

Paper Summary: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2023-02-05T00:00:00-03:00

WHAT

This article introduces the BERT model, which is a type of transformer-based fine-tuning³ architecture for all sorts of NLP tasks.

BERT introduces bidirectional self-attention to Transformers (instead of left-to-right only) and combine both token-level and sentence-level self-supervision so that the model is good both levels of tasks.

WHY

Verify if transfer-learning approaches can also benefit from bidirectional architectures.

Test different self-supervision strategies (token-level and sentence-level) together.

HOW

Two steps: Pre-training and fine-tuning
Self-supervision target. BERT uses two tasks:
- A masked language model, AKA the Cloze task whereby one word at random is masked an the net must predict it from surrounding words.
- "Next sentence prediction" self-supervision target in addition to the above. (Binarized, as in a 1 or 0 target)
Bidirectional Transformers: BERT uses bidirectional self-attention (vanilla Transformers use left-only self-attention)
Encoding: Input embeddings are actually a sum of the raw token embeddings (WordPiece), segment embeddings to tell which sentence it's from and a sine/cosine positional embedding.

CLAIMS

SOTA scores for many NLP tasks and benchmarks such as GLUE and SQuAD.
Better results than GPT-1 with the same number of parameters

QUOTES

Feature-based adaptation vs fine-tuning: "There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning"
- Feature-based: "task-specific architectures that include the pre-trained representations as additional features"¹
- Fine-tuning: "introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters"²
Architecture: "A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture."

NOTES

They mention that the Billion Word Benchmark is a collection of shuffled sentences and this hurts document-grain comprehension.
During the fine-tuning task, all pre-trained parameters are updated. No frozen layers.
BERT can be used to just produce embeddings to be used downstream too. It performs slightly worse than in the fine-tuning approach but is still very good.
- Note that it's possible to use several model layers as embeddings, not just the last layer!

MY 2¢

Very important point: left-only (as in, unidirectional) Transformers are also called Transformer Decoders (because they can be used to generate text) while bidirectional transformers are called Transformer Encoders in the literature.

References

Devlin et al 2019 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Footnotes

1: One example of a feature-based strategy is Peters et al, 2018: Deep Contextualized Word Representations

2: Fine-tuning is the strategy used by GPT-1 (Radford et al, 2018)

3: As opposed to feature-based (see quotes)