HVDM: Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Kihong Kim¹ Haneol Lee² Jihye Park³ Seyeon Kim³ Kwanghee Lee¹ Seungryong Kim^3† Jaejun Yoo^2†
^†Corresponding Author
¹VIVE STUDIOS ²UNIST ³Korea University

[Paper] [Code]

Abstract

Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high dimensionality and complexity. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D or 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent, (ii) a local volume information captured by 3D convolutions with wavelet decomposition, and (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on standard video generation benchmarks such as UCF101, SkyTimelapse, and TaiChi demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

Overall architecture of our hybrid video autoencoder (HVDM)

Diverse latent video diffusion models

Main Results

Short Video Generation

DIGAN

LVDM

PVDM

HVDM

Long Video Generation

DIGAN

LVDM

PVDM

HVDM

Applications

Image-to-Video

Image

Video

Image

Video

Image

Video

Image

Video

Video Dynamics Control

Slow Motion

Medium Motion

Fast Motion

BibTeX


    @article{kim2024hybrid,

        title={Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation},

        author={Kim, Kihong and Lee, Haneol and Park, Jihye and Kim, Seyeon and Lee, Kwanghee and Kim, Seungryong and Yoo, Jaejun},

        year={2024},

        eprint={2402.13729},

        archivePrefix={arXiv},

        primaryClass={cs.CV}

    }

Project page template is borrowed from DreamBooth.