Listen "Adaptive Stress Testing for Language Model Toxicity"
Episode Synopsis
This episode explores ASTPrompter, a novel approach to automated red-teaming for large language models (LLMs). Unlike traditional methods that focus on simply triggering toxic outputs, ASTPrompter is designed to discover likely toxic prompts – those that could naturally emerge during regular language model use. The approach uses Adaptive Stress Testing (AST), a technique that identifies likely failure points, and reinforcement learning to train an "adversary" model. This adversary generates prompts that aim to elicit toxic responses from a "defender" model, but importantly, these prompts have a low perplexity, meaning they are realistic and likely to occur, unlike many prompts generated by other methods.
More episodes of the podcast AI Safety Breakthrough
Navigating the New AI Security
13/08/2025
DeepSeek: A Disruptive Force in AI
03/02/2025
Agent Bench: Evaluating LLMs as Agents
27/11/2024
Surgical Precision: PKE’s Role in AI Safety
24/11/2024
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.