RLHF (Reinforcement Learning from Human Feedback)

07/02/2025 15 min

Listen "RLHF (Reinforcement Learning from Human Feedback)"

Episode Synopsis

Reinforcement Learning from Human Feedback (RLHF) incorporates human preferences into AI systems, addressing problems where specifying a clear reward function is difficult. The basic pipeline involves training a language model, collecting human preference data to train a reward model, and optimizing the language model with an RL optimizer using the reward model. Techniques like KL divergence are used for regularization to prevent over-optimization. RLHF is a subset of preference fine-tuning techniques. It has become a crucial technique in post-training to align language models with human values and elicit desirable behaviors.

More episodes of the podcast Large Language Model (LLM) Talk