Listen "LLMs Learning from Verbal Feedback Without Scalar Rewards"
Episode Synopsis
The September 25, 2025 collaboration between Sea AI Lab, SUTD, NUS, NTU and University of Waterloo paper proposes an alternative to traditional Reinforcement Learning (RL) for Large Language Models (LLMs) by introducing the **Feedback-Conditional Policy (FCP)**, which learns directly from rich **verbal feedback** instead of compressing it into scalar rewards. The authors argue that scalarization leads to information loss, ambiguity, and imbalanced reward scales, hindering effective learning from natural language critiques. FCP reframes learning as a **conditional generation** problem, approximating the feedback-conditional posterior through maximum likelihood training on offline data and then using an **online bootstrapping stage** conditioned on positive feedback to refine the policy. This approach, which draws inspiration from text-to-image generation's ability to combine mixed captions (as shown in the accompanying image), allows LLMs to leverage their inherent linguistic priors for better control and performance, matching or surpassing scalar-based RL methods on reasoning tasks.Source:https://arxiv.org/pdf/2509.22638
More episodes of the podcast AI: post transformers
Attention with a bias
17/01/2026
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.