The powerful generative capabilities of diffusion transformer models have achieved significant progress in the field of image animation.
However, the quadratic nature of vanilla self-attention in transformer blocks results in significant resource demands, making video generation
computationally expensive. Additionally, maintaining appearance consistency with the static input image and preventing abrupt motion transitions
in the generated animation remain difficult problems. In this paper, we introduce MiraMo that aims to achieve fast generation speed,
better appearance consistency, and motion smoothness. Specifically, we first design a base text-to-video generation model that replaces all
vanilla attention in the transformer blocks with more efficient linear attention, maintaining generation quality while ensuring temporal consistency.
MiraMo then proposes to focus on learning the distribution of motion residuals, rather than directly predicting frames as in existing image animation
methods. During the inference, we further mitigate the sudden motion changes in the animated video by introducing a novel DCT-based noise refinement
strategy. To counteract the over-smoothing of motion, we introduce a dynamics degree control design for better control of the magnitude of motion. Altogether,
these strategies enable MiraMo to produce highly consistent, smooth, and motion-controllable animated results with fast inference. Extensive experiments
compared with several state-of-the-art methods demonstrate the effectiveness and superiority of our proposed approach. In the end, we also demonstrate how
MiraMo can be applied for motion transfer or video editing of any given video.
Comparisons
We shows the animated results generated by different methods using the prompt "the ship sailing on the water".
We qualitatively compare our method with both commercial tools and research approaches,
including Hailuo, Genmo, ConsistI2V, DynamiCrafter, I2VGen-XL, SEINE, PIA, SVD and Cinemo.
Click to play the following animations!
Input Image
Hailuo
Genmo
ConsistI2V
DynamiCrafter
I2VGen-XL
SEINE
PIA
SVD
Cinemo
Ours
Analysis
The ablation studies and potential applications are presented here.
Motion intensity controllability
We demonstrate that our method can finely control the motion intensity of animated videos. The prompts are "car driving".
Please observe the speed at which the background trees move backward in the video to reflect the car's speed.
Click to play the following animations!
Input Image
b=0
b=9
b=18
Effectiveness of DCTInit
We demonstrate that the proposed DCTInit can stabilize the video generation process and effectively mitigate sudden motion change;
the DCT frequency domain decomposition method can effectively mitigate the color inconsistency issues caused by the FFT frequency domain decomposition method.
Here, "Baseline" refers to the results produced solely by motion flow matching models, without the application of any test-time improvement techniques.
The first and second lines prompt "woman smiling" and "tank moving", respectively.
Input Image
Baseline
FFTInit
DCTInit
Input Image
Baseline
FFTInit
DCTInit
Motion control by prompt
We demonstrate that our method does not rely on complex guiding instructions and even simple textual prompts can yield satisfactory visual effects.
Input Image
"Fireworks"
"Leaves Swaying"
"Lightning"
Motion transfer/Video editing
We demonstrate that our proposed method can also be applied to motion transfer and video editing. We use the off-the-shelf image editing method to edit the first frame of the input video.
"A panda standing on a surfboard in the ocean in sunset"
"A cyborg koala DJ in front of a turntable, in heavy raining futuristic Tokyo rooftop cyberpunk night, sci-fi, fantasy, intricate, neon light, soft light smooth, sharp focus, illustration"