Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

Xin Ma¹ Yaohui Wang² Gengyun Jia³ Xinyuan Chen²
Tien-Tsin Wong¹ Cunjian Chen¹

¹Monash University ²Shanghai Artificial Intelligence Laboratory ³Nanjing University of Posts and Telecommunications

[Paper] [Github]

Click to play animations!

Input Image

Animated Video

Input Image

Animated Video

Input Image

Animated Video

"A Car Driving on the Road"

"A Red Car Driving Slowly on the Road"

"Birds Shaking its Body"

"Bubbles Floating Upwards"

"Candle Flickering"

"City Lightning"

"Clouds in the Sky Moving Slowly"

"Doggy Barking"

"Hummingbird Flying in the Air"

Methodology

The powerful generative capabilities of diffusion transformer models have achieved significant progress in the field of image animation. However, the quadratic nature of vanilla self-attention in transformer blocks results in significant resource demands, making video generation computationally expensive. Additionally, maintaining appearance consistency with the static input image and preventing abrupt motion transitions in the generated animation remain difficult problems. In this paper, we introduce MiraMo that aims to achieve fast generation speed, better appearance consistency, and motion smoothness. Specifically, we first design a base text-to-video generation model that replaces all vanilla attention in the transformer blocks with more efficient linear attention, maintaining generation quality while ensuring temporal consistency. MiraMo then proposes to focus on learning the distribution of motion residuals, rather than directly predicting frames as in existing image animation methods. During the inference, we further mitigate the sudden motion changes in the animated video by introducing a novel DCT-based noise refinement strategy. To counteract the over-smoothing of motion, we introduce a dynamics degree control design for better control of the magnitude of motion. Altogether, these strategies enable MiraMo to produce highly consistent, smooth, and motion-controllable animated results with fast inference. Extensive experiments compared with several state-of-the-art methods demonstrate the effectiveness and superiority of our proposed approach. In the end, we also demonstrate how MiraMo can be applied for motion transfer or video editing of any given video.

Comparisons

We shows the animated results generated by different methods using the prompt "the ship sailing on the water".
We qualitatively compare our method with both commercial tools and research approaches, including Hailuo, Genmo, ConsistI2V, DynamiCrafter, I2VGen-XL, SEINE, PIA, SVD and Cinemo.

Click to play the following animations!

Input Image

Hailuo

Genmo

ConsistI2V

DynamiCrafter

I2VGen-XL

SEINE

PIA

SVD

Cinemo

Ours

Analysis

The ablation studies and potential applications are presented here.

Motion intensity controllability

We demonstrate that our method can finely control the motion intensity of animated videos. The prompts are "car driving". Please observe the speed at which the background trees move backward in the video to reflect the car's speed.

Click to play the following animations!

Input Image

b=0

b=9

b=18

Effectiveness of DCTInit

We demonstrate that the proposed DCTInit can stabilize the video generation process and effectively mitigate sudden motion change; the DCT frequency domain decomposition method can effectively mitigate the color inconsistency issues caused by the FFT frequency domain decomposition method. Here, "Baseline" refers to the results produced solely by motion flow matching models, without the application of any test-time improvement techniques. The first and second lines prompt "woman smiling" and "tank moving", respectively.

Input Image

Baseline

FFTInit

DCTInit

Input Image

Baseline

FFTInit

DCTInit

Motion control by prompt

We demonstrate that our method does not rely on complex guiding instructions and even simple textual prompts can yield satisfactory visual effects.

Input Image

"Fireworks"

"Leaves Swaying"

"Lightning"

Motion transfer/Video editing

We demonstrate that our proposed method can also be applied to motion transfer and video editing. We use the off-the-shelf image editing method to edit the first frame of the input video.

Original video

First frame

Edited first frame

Output video

Gallery

Text-to-video generation (480p)

Using prompts:

"A robot warrior, ultra realistic, concept art, intricate details, highly detailed, photorealistic, 8k sharp focus, volumetric lighting, unreal engine"
"A panda standing on a surfboard in the ocean in sunset"
"A cyborg koala DJ in front of a turntable, in heavy raining futuristic Tokyo rooftop cyberpunk night, sci-fi, fantasy, intricate, neon light, soft light smooth, sharp focus, illustration"
"a cute anime girl looks at the beautiful nature through the window of a moving train, well rendered, 3D rendered"
"a cat wearing sunglasses and working as a lifeguard at pool"
"an astronaut feeding ducks on a sunny afternoon, reflection from the water"