Listen "“OpenAI finetuning metrics: What is going on with the loss curves?” by jorio, James Chua"
Episode Synopsis
Introduction For our current project, we've been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft's Azure documentation: Our experimental results didn't match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics. What we found: The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation. Some calculations To be concrete, suppose that you are performing SFT on the following conversation:User: blah blah blah
Assistant: TOKEN1 where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is <span>_text{Loss} = - frac{1}{3} left( log p(text{TOKEN1}) + log p(text{TOKEN2}) + log p(text{TOKEN3}) right)_</span> and the accuracy is <span>_text{ACC} = frac{text{NUMBER OF CORRECTLY [...] ---Outline:(00:13) Introduction(01:21) Some calculations(02:14) Experiments(03:40) Results(05:28) Conclusions(06:12) Acknowledgments The original text contained 1 footnote which was omitted from this narration. ---
First published:
November 24th, 2025
Source:
https://www.lesswrong.com/posts/zDnA4SEbkeCj2CDHp/openai-finetuning-metrics-what-is-going-on-with-the-loss
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Assistant: TOKEN1 where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is <span>_text{Loss} = - frac{1}{3} left( log p(text{TOKEN1}) + log p(text{TOKEN2}) + log p(text{TOKEN3}) right)_</span> and the accuracy is <span>_text{ACC} = frac{text{NUMBER OF CORRECTLY [...] ---Outline:(00:13) Introduction(01:21) Some calculations(02:14) Experiments(03:40) Results(05:28) Conclusions(06:12) Acknowledgments The original text contained 1 footnote which was omitted from this narration. ---
First published:
November 24th, 2025
Source:
https://www.lesswrong.com/posts/zDnA4SEbkeCj2CDHp/openai-finetuning-metrics-what-is-going-on-with-the-loss
---
Narrated by TYPE III AUDIO.
---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
ZARZA We are Zarza, the prestigious firm behind major projects in information technology.