In recent years, the field of text-to-video models has made remarkable achievements. However, most current text-to-video video generation models still rely on UNet as the main backbone. This choice not only limits the improvement of model performance but also hinders large-scale expansion. In contrast, Transformers exhibit unique advantages due to its suitability for processing long-range contexts and ease of scalability. Innovatively, the world’s first DiT-based text-to-video open-source model, Latte, has been proposed. It aims to pioneer the construction of stable and effective large-scale neural networks in the video generation domain.