Evaluating Language Models as Synthetic Data Generators

08/12/2024 21 min Episodio 164

Listen "Evaluating Language Models as Synthetic Data Generators"

Descargar episodio Ver en sitio original

Episode Synopsis

🤗 Upvotes: 30 | cs.CL

Authors:
Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig

Title:
Evaluating Language Models as Synthetic Data Generators

Arxiv:
http://arxiv.org/abs/2412.03679v1

Abstract:
Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

More episodes of the podcast Daily Paper Cast

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning 09/12/2025

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs 09/12/2025

Unified Video Editing with Temporal Reasoner 09/12/2025

Voxify3D: Pixel Art Meets Volumetric Rendering 09/12/2025

Scaling Zero-Shot Reference-to-Video Generation 09/12/2025

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems 09/12/2025

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows 08/12/2025

EditThinker: Unlocking Iterative Reasoning for Any Image Editor 08/12/2025

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks 08/12/2025

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture 08/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

Evaluating Language Models as Synthetic Data Generators

Listen "Evaluating Language Models as Synthetic Data Generators"

Episode Synopsis

More episodes of the podcast Daily Paper Cast

Subdomains, a glance with the experts!

Internet as human right and its scope

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD