Application and expansion of DiT architecture in video generation models

Abstract

In recent years, the field of text-to-video models has made remarkable achievements. However, most current text-to-video video generation models still rely on UNet as the main backbone. This choice not only limits the improvement of model performance but also hinders large-scale expansion. In contrast, Transformers exhibit unique advantages due to its suitability for processing long-range contexts and ease of scalability. Innovatively, the world’s first DiT-based text-to-video open-source model, Latte, has been proposed. It aims to pioneer the construction of stable and effective large-scale neural networks in the video generation domain.

Date
Jun 14, 2024 10:00 AM — 11:10 AM
Location
Vitural & Online
Click on the Slides button above to view the built-in slides feature.
Xin Ma
Xin Ma

I’m a Ph.D canditate at Monash University. My research interests include image super-resolution and inpainting, model compression, face recognition, video generation, large-scale generative models, etc