We present Live Avatar, an algorithm-system co-designed framework that enables real-time,
streaming, and infinite-length interactive avatar video generation.
It is powered by a 14-billion-parameter diffusion model that achieves 20 FPS on 5 H800 GPUs with 4-step sampling.
Critically, it supports Block-wise Autoregressive processing, enabling streaming video generation up to 10,000+ seconds long.
The streaming and real-time nature of Live Avatar unlocks powerful interactions: users can engage in natural, face-to-face conversations via microphone and camera, receiving immediate visual feedback as the avatar responds in real-time.
By integrating Live Avatar with Qwen3-Omni, we enable fully interactive dialogue agents. Below is a demonstration of a streaming, real-time conversation between two autonomous agents:
Real-time streaming interaction requires the model to generate frames faster than playback speed and support unbounded, continuous streaming expansion based on preceding frames. We achieve this by:
Existing talking-avatar systems exhibit degradation over long, autoregressive generation—manifesting as identity drift and color shifts. We attribute these long-horizon failures to three internal phenomena: