Paper Summary: The Science of Detecting LLM-Generated Texts
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
The authors survey the nascent field of machine-generated text detection, categorize the main types, explain the main approaches (mostly watermarking), and explain the need for fair evaluation.
WHY
There was no comprehensive survey as of then, and some structure was needed.
HOW
The authors categorized existing detection methods into two types:
Black-box methods: You take samples from human-written text and text generated by a specific LLM. Build a binary classifier to tell them apart.
White-box methods: In addition to having samples from both human- and machine-generated text, you also have access to the full weights of the LLM you are trying to detect.
Different approaches to watermarking were discussed too, along with some points about how to compare approaches fairly.
Watermarking
Watermarking refers to methods to encode hidden information in the text, which can be checked or validated a posteriori.
The two basic approaches described are Inference-time and Post-hoc watermarking:
Inference-time Watermarking | Post-hoc Watermarking |
---|---|
Change the sampling process inside the decoder by adding a seed, such that every time a word is chosen, only the variant in a given group is picked. | After the text is generated, modify blocks of text in a way that maintains the semantics. This can be done by paraphrasing, changing syntactic structures, using synonyms, etc. |
Watermarking usually decreases the generated text quality and it may cause inference to take longer.
CLAIMS
- Adversarial attacks on detectors "...a paraphrasing attack could break a wide array of detectors, including both white-box and black-box approaches."
- This is a simple way to evade model-specific detectors because you can use another LLM to generate paraphrases.
QUOTES
- On Emotion: "Initial observations indicate that LLM-generated texts are less emotional and objective compared to human-authored text, which often uses punctuation and grammar to convey subjective feelings"
NOTES
Detection methods listed here are usually limited to a specific LLM. No mention is made of model-agnostic detection.
Fact-checking is also used as a way to detect LLM-generated text, especially hallucinations.
MY 2¢
Model-agnostic detection: What about model-agnostic detection? This is what we want!
Temperature What about the role of temperature when generating text? No mention in the text. Surely this makes a ton of difference; if the temperature is zero then it becomes trivial to detect whether some text is LLM-generated.
Good insights from this paper: This paper (How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection) seems to have lots of interesting insights and statistics about human-written vs LLM-generated text.
- LLM-generated text is longer on average;
- "ChatGPT texts use more determiners, conjunctions, and auxiliary relations"
- "Unlike humans, large language models tend to be neutral by default and lack emotional expression."
Detecting vs Verifying: Detecting is a different problem from verifying. When we are talking about a single LLM, both types collapse to a single problem, but with model-agnostic detection, both are very different. Watermarking, for instance, may help in model-specific detection but not in model-agnostic detection.
Regulation: Maybe governments will require that all public LLMs encode some sort of watermarking in them. But this is impossible with open-source LLMs.
Tradeoffs: There is a clear tradeoff between the ease of detection and the text size. I don't think this was explored enough in this article.
References
Arxiv: Tang et al 2023: The Science of Detecting LLM-Generated Texts.
- This is the article this page is about.
-
- This is another well-cited article on the detection of LLM-generated text
Arxiv: Kirchenbauer et al 2023: A Watermark for Large Language Models
- Most cited work on Watermarking for text.
Huggingface: Watermarking text
- Good info here, especially this part: "Detecting whether a given text was generated using a language model without having access to that model is currently impossible."