Erik Jones on Automatically Auditing Large Language Models

11/08/2023 22 min

Listen "Erik Jones on Automatically Auditing Large Language Models"

Episode Synopsis

Erik is a Phd at Berkeley working with Jacob Steinhardt, interested in making generative machine learning systems more robust, reliable, and aligned, with a focus on large language models.In this interview we talk about his paper "Automatically Auditing Large Language Models via Discrete Optimization" that he presented at ICML.

Youtube: https://youtu.be/bhE5Zs3Y1n8
Paper: https://arxiv.org/abs/2303.04381
Erik: https://twitter.com/ErikJones313
Host: https://twitter.com/MichaelTrazzi
Patreon: https://www.patreon.com/theinsideview

Outline

00:00 Highlights
00:31 Eric's background and research in Berkeley
01:19 Motivation for doing safety research on language models
02:56 Is it too easy to fool today's language models?
03:31 The goal of adversarial attacks on language models
04:57 Automatically Auditing Large Language Models via Discrete Optimization
06:01 Optimizing over a finite set of tokens rather than continuous embeddings
06:44 Goal is revealing behaviors, not necessarily breaking the AI
07:51 On the feasibility of solving adversarial attacks
09:18 Suppressing dangerous knowledge vs just bypassing safety filters
10:35 Can you really ask a language model to cook meth?
11:48 Optimizing French to English translation example
13:07 Forcing toxic celebrity outputs just to test rare behaviors
13:19 Testing the method on GPT-2 and GPT-J
14:03 Adversarial prompts transferred to GPT-3 as well
14:39 How this auditing research fits into the broader AI safety field
15:49 Need for automated tools to audit failures beyond what humans can find
17:47 Auditing to avoid unsafe deployments, not for existential risk reduction
18:41 Adaptive auditing that updates based on the model's outputs
19:54 Prospects for using these methods to detect model deception
22:26 Prefer safety via alignment over just auditing constraints, Closing thoughts

Patreon supporters:

Tassilo Neubauer
MonikerEpsilon
Alexey Malafeev
Jack Seroy
JJ Hepburn
Max Chiswick
William Freire
Edward Huff
Gunnar Höglund
Ryan Coppolo
Cameron Holmes
Emil Wallner
Jesse Hoogland
Jacques Thibodeau
Vincent Weisser

More episodes of the podcast The Inside View