VibeVoice Technical Report

27/08/2025 21 min Episodio 1102

Listen "VibeVoice Technical Report"

Descargar episodio Ver en sitio original

Episode Synopsis

🤗 Upvotes: 45 | cs.CL, cs.AI, cs.SD, eess.AS

Authors:
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

Title:
VibeVoice Technical Report

Arxiv:
http://arxiv.org/abs/2508.19205v1

Abstract:
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

More episodes of the podcast Daily Paper Cast

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning 09/12/2025

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs 09/12/2025

Unified Video Editing with Temporal Reasoner 09/12/2025

Voxify3D: Pixel Art Meets Volumetric Rendering 09/12/2025

Scaling Zero-Shot Reference-to-Video Generation 09/12/2025

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems 09/12/2025

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows 08/12/2025

EditThinker: Unlocking Iterative Reasoning for Any Image Editor 08/12/2025

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks 08/12/2025

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture 08/12/2025

Ver todos los episodios

ZARZA We are Zarza, the prestigious firm behind major projects in information technology.

VibeVoice Technical Report

Listen "VibeVoice Technical Report"

Episode Synopsis

More episodes of the podcast Daily Paper Cast

Dot COM: The Internet’s dominant TLD

Subdomains, a glance with the experts!

Bandwidth: Broadband or Narrowband?

Personnel recruitment via Web

Deep web or Invisible Internet

Subdomains, a glance with the experts!

Free Internet, a prediction in Nostradamus style

Educational Technology: From traditional to digital

Localhost, there’s no place like 127.0.0.1

Googling with breathtaking tricks you ignore

Gray Hat Hacking, those with ambiguous ethics…

Internet Predators on the prowl

Dot COM: The Internet’s dominant TLD