DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

March 17, 2026 | Source: arXiv AI

Tags: DiFlowDubber, video dubbing, text-to-speech, cross-modal alignment, synchronization, prosody, facial expressions

DiFlowDubber introduces a novel two-stage training framework for automated video dubbing, enhancing synchronization and expressiveness in speech generation.

Details

DiFlowDubber is a new approach to automated video dubbing that addresses limitations in existing methods, which often rely on limited datasets or struggle with expressive speech generation. This system utilizes a two-stage training framework to transfer knowledge from pre-trained text-to-speech (TTS) models to video dubbing. Key components include the FaPro module, which captures prosody and stylistic cues from facial expressions, and a Synchronizer module that improves alignment between text, video, and speech. Experimental results indicate that DiFlowDubber outperforms previous dubbing methods on benchmark datasets, demonstrating its effectiveness in generating temporally synchronized and expressive speech.