Petri: Accelerating AI Safety Auditing

10/10/2025 15 min

Listen "Petri: Accelerating AI Safety Auditing"

Episode Synopsis

On October 6, 2925 Anthropic introduces **Petri (Parallel Exploration Tool for Risky Interactions)**, an open-source framework developed for automated auditing to accelerate AI safety research. Petri uses **AI-driven auditor agents** to interact with and test the behavior of target language models across diverse, multi-turn scenarios, automating the process of environment simulation and initial transcript analysis. A **judge component** then scores the generated transcripts across dozens of dimensions, such as "unprompted deception" or "whistleblowing," to quickly surface **misaligned behaviors** like autonomous deception and cooperation with misuse. The text provides a detailed technical overview of Petri's architecture, including how researchers form hypotheses, create seed instructions, and utilize the automated assessment and iteration steps, while also discussing the **limitations and biases** found in the auditor and judge agents during pilot evaluations.Source:https://alignment.anthropic.com/2025/petri/