Confidence-Reward Preference Optimization for Machine Translation

10/06/2025 55 min

Listen "Confidence-Reward Preference Optimization for Machine Translation"

Episode Synopsis

This pod introduces Confidence-Reward driven Preference Optimization (CRPO), a novel method for improving machine translation by more effectively selecting training data for large language models (LLMs). The paper highlights challenges in applying LLMs to translation due to pretraining on English-centric data and the complexity of traditional reinforcement learning from human feedback. While Direct Preference Optimization (DPO) simplifies training, its success relies on high-quality preference data. CRPO addresses this by combining reward scores with model confidence to identify challenging sentence pairs where the model is uncertain or underperforming, leading to more efficient fine-tuning. The authors demonstrate CRPO's effectiveness on both LLMs and encoder-decoder models, showing it outperforms existing data selection methods in translation accuracy and data efficiency.

More episodes of the podcast Neural intel Pod