PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

VividFace: Face-Aware One-Step Diffusion for High-Fidelity Video Face Enhancement

Shulian Zhang¹, Long Peng², Ziyang Wang¹, Ye Chen¹, Jie Li¹,
Wenbo Li³, Yulun Zhang⁴, Jian Chen¹ Yong Guo⁵,

¹South China University of Technology
²University of Science and Technology of China
³Chinese University of Hong Kong
⁴Shanghai Jiao Tong University
⁵Max Planck Institute for Infromatics

Paper Supplementary Code arXiv

Visualization

Abstract

Video Face Enhancement (VFE) aims to restore high-quality facial details from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) insufficient recovery of fine-grained facial textures; and (3) poor restoration quality due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a one-step diffusion framework that reformulates a text-to-video generation model into a single-step generator with switchable face-aware guidance that persistently focuses optimization on perceptually critical facial regions across both latent and pixel spaces. Furthermore, we propose a human-aligned MLLM-driven data curation pipeline that leverages the video understanding capabilities of Multimodal Large Language Models (MLLMs) and iteratively refines scoring criteria through lightweight human-in-the-loop calibration to align quality judgments with human perceptual expectations, yielding a high-quality face video dataset MLLM-Face90. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks, while achieving a $12\times$ inference speedup over state-of-the-art diffusion-based VFE methods.

Pipeline

Overview of our VividFace framework. VividFace is a face-aware one-step enhancement method built upon a flow-matching formulation. Middle: Facial masks are first extracted in the pixel space (M_p) and then geometrically aligned to the latent space (M_l) to match the VAE’s spatiotemporal compression. Top: In the first stage (Latent-space Optimization), the DiT model learns a single-step velocity field to transform the degraded latent z_l into the restored latent ẑ_h. This process is supervised by our switchable face-aware guidance using M_l. Bottom: In the second stage (Pixel-space Refinement), the decoded video x̂_h undergoes fine-grained optimization where the switchable guidance is applied using M_p to further enhance facial textures and perceptual fidelity.

VividFace: Face-Aware One-Step Diffusion for High-Fidelity Video Face Enhancement

Visualization

Abstract

Pipeline

BibTeX