Direct Preference Optimization: Your Language Model is Secretly a Reward Model

04/12/2023 25 min

Listen "Direct Preference Optimization: Your Language Model is Secretly a Reward Model"

Episode Synopsis

The paper introduces a new method called Direct Preference Optimization (DPO) for fine-tuning large-scale unsupervised language models (LMs) to align with human preferences. DPO is stable, performant, and computationally lightweight, and achieves better control of sentiment and improved response quality compared to existing methods.

https://arxiv.org/abs//2305.18290

YouTube: https://www.youtube.com/@ArxivPapers

TikTok: https://www.tiktok.com/@arxiv_papers

Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016

Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers

More episodes of the podcast Arxiv Papers