VividFace: Face-Aware One-Step Diffusion for High-Fidelity Video Face Enhancement

Shulian Zhang1, Long Peng2, Ziyang Wang1, Ye Chen1, Jie Li1,
Wenbo Li3, Yulun Zhang4, Jian Chen1 Yong Guo5,
1South China University of Technology
2University of Science and Technology of China
3Chinese University of Hong Kong
4Shanghai Jiao Tong University
5Max Planck Institute for Infromatics

Visualization

Abstract

Video Face Enhancement (VFE) aims to restore high-quality facial details from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) insufficient recovery of fine-grained facial textures; and (3) poor restoration quality due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a one-step diffusion framework that reformulates a text-to-video generation model into a single-step generator with switchable face-aware guidance that persistently focuses optimization on perceptually critical facial regions across both latent and pixel spaces. Furthermore, we propose a human-aligned MLLM-driven data curation pipeline that leverages the video understanding capabilities of Multimodal Large Language Models (MLLMs) and iteratively refines scoring criteria through lightweight human-in-the-loop calibration to align quality judgments with human perceptual expectations, yielding a high-quality face video dataset MLLM-Face90. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks, while achieving a $12\times$ inference speedup over state-of-the-art diffusion-based VFE methods.

Pipeline

VividFace pipeline

Overview of our VividFace framework. VividFace is a face-aware one-step enhancement method built upon a flow-matching formulation. Middle: Facial masks are first extracted in the pixel space (Mp) and then geometrically aligned to the latent space (Ml) to match the VAE’s spatiotemporal compression. Top: In the first stage (Latent-space Optimization), the DiT model learns a single-step velocity field to transform the degraded latent zl into the restored latent ẑh. This process is supervised by our switchable face-aware guidance using Ml. Bottom: In the second stage (Pixel-space Refinement), the decoded video x̂h undergoes fine-grained optimization where the switchable guidance is applied using Mp to further enhance facial textures and perceptual fidelity.

BibTeX