Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

Artificial General Intelligence (AGI) Show with Soroush Pour

19/06/2024 2h 42min Temporada 1 Episodio 14

Listen "Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)"

Episode Synopsis

We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He’s particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.We talk to Stephen about:* His technical AI safety work in the areas of: * Interpretability * Latent attacks and adversarial robustness * Model unlearning * The limitations of RLHF* Cas' journey to becoming an AI safety researcher* How he thinks the AI safety field is going and whether we're on track for a positive future with AI* Where he sees the biggest risks coming with AI* Gaps in the AI safety field that people should work on* Advice for early career researchersHosted by Soroush Pour. Follow me for more AGI content:Twitter: https://twitter.com/soroushjpLinkedIn: https://www.linkedin.com/in/soroushjp/== Show links ==-- Follow Stephen --* Website: https://stephencasper.com/* Email: (see Cas' website above)* Twitter: https://twitter.com/StephenLCasper* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ-- Further resources --* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446* MMET paper on model editing - https://arxiv.org/abs/2210.07229* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693* Recommended papers on latent adversarial training (LAT) - * https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d * https://arxiv.org/abs/2403.05030* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030* Cas' systems for reading for research - * Follow ML Twitter * Use a combination of the following two search tools for new Arxiv papers: * https://vjunetxuuftofi.github.io/arxivredirect/ * https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1 * Skim a new paper or two a day + take brief notes in a searchable notes app* Recommended people to follow to learn about how to impact the world through research - * Dan Hendrycks * Been Kim * Jacob Steinhardt * Nicolas Carlini * Paul Christiano * Ethan PerezRecorded May 1, 2024

More episodes of the podcast Artificial General Intelligence (AGI) Show with Soroush Pour

Ep 13 - AI researchers expect AGI sooner w/ Katja Grace (Co-founder & Lead Researcher, AI Impacts) 19/06/2024

Ep 12 - Education & advocacy for AI safety w/ Rob Miles (YouTube host) 09/03/2024

Ep 11 - Technical alignment overview w/ Thomas Larsen (Director of Strategy, Center for AI Policy) 14/12/2023

Ep 10 - Accelerated training to become an AI safety researcher w/ Ryan Kidd (Co-Director, MATS) 08/11/2023

Ep 9 - Scaling AI safety research w/ Adam Gleave (CEO, FAR AI) 06/11/2023

Ep 8 - Getting started in AI safety & alignment w/ Jamie Bernardi (AI Safety Lead, BlueDot Impact) 13/10/2023

Ep 7 - Responding to a world with AGI - Richard Dazeley (Prof AI & ML, Deakin University) 03/08/2023

Ep 6 - Will we see AGI this decade? Our AGI predictions & debate w/ Hunter Jay (CEO, Ripe Robotics) 20/07/2023

Ep 5 - Accelerating AGI timelines since GPT-4 w/ Alex Browne (ML Engineer) 22/05/2023

Ep 4 - When will AGI arrive? - Ryan Kupyn (Data Scientist & Forecasting Researcher @ Amazon AWS) 31/03/2023

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

Listen "Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)"

Episode Synopsis

More episodes of the podcast Artificial General Intelligence (AGI) Show with Soroush Pour

White Hat Hacking, Ethical Hackers…

Gray Hat Hacking, those with ambiguous ethics…

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD