Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

Yubo Huang1, Weiqiang Wang2, Sirui Zhao1✉, Tong Xu1, Lin Liu1✉, Enhong Chen1✉
1University of Science and Technology of China

2Monash University

Corresponding Author

Abstract

Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds 'who' and 'speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.

Video Gallery

Overview of Proposed Framework

MY ALT TEXT

Overview of the proposed framework, which consists of four main components:
(1) a Multi-Modal Diffusion Transformer (MM-DiT) that generates video sequences conditioned on text, audio, and visual inputs; (2) a Face Encoder that extracts facial features; (3) an Audio Encoder that captures motion-related information; and (4) an Embedding Router that binds 'who' and 'speaks what' together, enabling precise audio-to-character correspondence control. Dual mask-guided cross-attention modules in MM-DiT selectively incorporate both motion-related speech information and facial embeddings into visual tokens, using fine-grained masks predicted by the embedding router.