Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers  

Xin Ma1 Yaohui Wang2 Gengyun Jia3 Xinyuan Chen2
Tien-Tsin Wong1 Cunjian Chen1

1Monash University 2Shanghai Artificial Intelligence Laboratory 3Nanjing University of Posts and Telecommunications

[Paper]     [Github]    


Click to play animations!

Input Image

Case 1 Image

Animated Video

Input Image

Case 2 Image

Animated Video

Input Image

Case 5 Image

Animated Video

"A Car Driving on the Road"

"A Red Car Driving Slowly on the Road"

"Birds Shaking its Body"

Case 1 Image
Case 2 Image
Case 5 Image

"Bubbles Floating Upwards"

"Candle Flickering"

"City Lightning"

Case 1 Image
Case 2 Image
Case 5 Image

"Clouds in the Sky Moving Slowly"

"Doggy Barking"

"Hummingbird Flying in the Air"

Methodology

The powerful generative capabilities of diffusion transformer models have achieved significant progress in the field of image animation. However, the quadratic nature of vanilla self-attention in transformer blocks results in significant resource demands, making video generation computationally expensive. Additionally, maintaining appearance consistency with the static input image and preventing abrupt motion transitions in the generated animation remain difficult problems. In this paper, we introduce MiraMo that aims to achieve fast generation speed, better appearance consistency, and motion smoothness. Specifically, we first design a base text-to-video generation model that replaces all vanilla attention in the transformer blocks with more efficient linear attention, maintaining generation quality while ensuring temporal consistency. MiraMo then proposes to focus on learning the distribution of motion residuals, rather than directly predicting frames as in existing image animation methods. During the inference, we further mitigate the sudden motion changes in the animated video by introducing a novel DCT-based noise refinement strategy. To counteract the over-smoothing of motion, we introduce a dynamics degree control design for better control of the magnitude of motion. Altogether, these strategies enable MiraMo to produce highly consistent, smooth, and motion-controllable animated results with fast inference. Extensive experiments compared with several state-of-the-art methods demonstrate the effectiveness and superiority of our proposed approach. In the end, we also demonstrate how MiraMo can be applied for motion transfer or video editing of any given video.

Comparisons

We shows the animated results generated by different methods using the prompt "the ship sailing on the water".
We qualitatively compare our method with both commercial tools and research approaches, including Hailuo, Genmo, ConsistI2V, DynamiCrafter, I2VGen-XL, SEINE, PIA, SVD and Cinemo.

Click to play the following animations!

Analysis

The ablation studies and potential applications are presented here.

Motion intensity controllability

We demonstrate that our method can finely control the motion intensity of animated videos. The prompts are "car driving". Please observe the speed at which the background trees move backward in the video to reflect the car's speed.

Click to play the following animations!


Effectiveness of DCTInit

We demonstrate that the proposed DCTInit can stabilize the video generation process and effectively mitigate sudden motion change; the DCT frequency domain decomposition method can effectively mitigate the color inconsistency issues caused by the FFT frequency domain decomposition method. Here, "Baseline" refers to the results produced solely by motion flow matching models, without the application of any test-time improvement techniques. The first and second lines prompt "woman smiling" and "tank moving", respectively.


Motion control by prompt

We demonstrate that our method does not rely on complex guiding instructions and even simple textual prompts can yield satisfactory visual effects.


Motion transfer/Video editing

We demonstrate that our proposed method can also be applied to motion transfer and video editing. We use the off-the-shelf image editing method to edit the first frame of the input video.

Gallery

Text-to-video generation (480p)

Using prompts:

©Xin Ma · Powered by DreamBooth