Live Avatar Project Page

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Yubo Huang^1,2, Hailong Guo^2,3, Fangtai Wu^2,4, Shifeng Zhang², Shijie Huang², Qijun Gan⁴, Lin Liu¹, Sirui Zhao^1,*, Enhong Chen^1,*, Jiaming Liu^2,‡, Steven Hoi²

¹University of Science and Technology of China ²Alibaba Group ³Beijing University of Posts and Telecommunications ⁴Zhejiang University
^* Corresponding authors. ^‡ Project Leader.

Paper Code 🤗 Hugging Face

Introduction

We present Live Avatar, an algorithm-system co-designed framework that enables real-time, streaming, and infinite-length interactive avatar video generation. It is powered by a 14-billion-parameter diffusion model that achieves 20 FPS on 5 H800 GPUs with 4-step sampling. Critically, it supports Block-wise Autoregressive processing, enabling streaming video generation up to 10,000+ seconds long.

The streaming and real-time nature of Live Avatar unlocks powerful interactions: users can engage in natural, face-to-face conversations via microphone and camera, receiving immediate visual feedback as the avatar responds in real-time. By integrating Live Avatar with Qwen3-Omni, we enable fully interactive dialogue agents. Below is a demonstration of a streaming, real-time conversation between two autonomous agents:

LiveStream Interaction

Infinite Autoregressive Generation

⚡ Achieving Real-Time Streaming Performance

Real-time streaming interaction requires the model to generate frames faster than playback speed and support unbounded, continuous streaming expansion based on preceding frames. We achieve this by:

Employing Distribution Matching Distillation to transform a 14B bidirectional, multi-step video diffusion model into a 4-step streaming one.

Designing Timestep-forcing Pipeline Parallelism (TPP), a novel paradigm that decouples sequential denoising stages across multiple devices. This method achieves linear speedup proportional to the number of devices, scaling effectively up to the sampling step count.

Together, these techniques yield an 84× FPS improvement over the baseline, enabling live video generation over 20 FPS without using quantization.

♾️ Achieving Infinite-length Generation

Existing talking-avatar systems exhibit degradation over long, autoregressive generation—manifesting as identity drift and color shifts. We attribute these long-horizon failures to three internal phenomena:

inference-mode drift: the conditioning pattern at inference (e.g., the RoPE-relative positioning between the sink frame and current target blocks) gradually diverges from the training-time setup, weakening identity cues.
distribution drift: the distribution of generated frames progressively deviates from normal, realistic video distributions, likely driven by persistent factors that continuously push the rolling generation toward unrealistic outputs.
error accumulation: subtle flaws (e.g., slight imperfections) are inherited and compounded frame-by-frame. This difficult-to-recover accumulation causes rapid quality deterioration and incoherent outputs over time.

We address these by:

Rolling RoPE: Dynamically updating the sink frame’s RoPE to preserve relative positioning, mitigating inference drift to stabilize long-term identity.

Adaptive Attention Sink (AAS): Replacing the initial reference with a generated frame as the sink to eliminate the persistent factor driving distribution drift.
History Corrupt: Injecting noise into the KV-cache to simulate inference errors, guiding the model to extract motion from history and stable details from the sink frame.

Together, these strategies enable infinite-length streaming for over 10,000 seconds without quality degradation or identity drift.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Introduction

LiveStream Interaction

Infinite Autoregressive Generation

⚡ Achieving Real-Time Streaming Performance

♾️ Achieving Infinite-length Generation

Generated Portrait Videos

Generated Cartoon Videos

Generated Long Videos

Comparison with Other Methods