AI Control: Improving Safety Despite Intentional Subversion

15/12/2023 16 min
AI Control: Improving Safety Despite Intentional Subversion

Listen "AI Control: Improving Safety Despite Intentional Subversion"

Episode Synopsis

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: We summarize the paper;We compare our methodology to what the one used in other safety papers.The next post in this sequence (which we’ll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models.Here's the abstract of the paper:As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models [...]--- First published: December 13th, 2023 Source: https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion --- Narrated by TYPE III AUDIO.

More episodes of the podcast LessWrong (Curated & Popular)