Interpretability in the wild and other papers

06/04/2023 5 min
Interpretability in the wild and other papers

Listen "Interpretability in the wild and other papers"

Episode Synopsis

---client: t3afeed_id: ai_safety_abstractsnarrator: ai---This episode covers 3 abstracts:Active reward learning from multiple teachers - Peter Barnett et al. Conditioning Predictive Models: Risks and Strategies - Hubinger et al.Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT2 small - Kevin Wang et al.Share feedback on this narration.