Application and expansion of DiT architecture in video generation models

Name: Application and expansion of DiT architecture in video generation models
Start: 2024-06-14T10:00:00Z
End: 2024-06-14T11:10:00Z
Location: Vitural & Online

Abstract

In recent years, the field of text-to-video models has made remarkable achievements. However, most current text-to-video video generation models still rely on UNet as the main backbone. This choice not only limits the improvement of model performance but also hinders large-scale expansion. In contrast, Transformers exhibit unique advantages due to its suitability for processing long-range contexts and ease of scalability. Innovatively, the world’s first DiT-based text-to-video open-source model, Latte, has been proposed. It aims to pioneer the construction of stable and effective large-scale neural networks in the video generation domain.

Date

Jun 14, 2024 10:00 AM — 11:10 AM

Event

智猩猩AI新青年 & Beijing University of Civil Engineering and Architecture

Location

Vitural & Online

Click on the Slides button above to view the built-in slides feature.

Xin Ma

I’m a Ph.D canditate at Monash University. My research interests include image super-resolution and inpainting, model compression, face recognition, video generation, large-scale generative models, etc