Diffusion models have achieved significant progress in the task of image animation due to the powerful generative capabilities.
However, preserving appearance consistency to the static input image, and avoiding abrupt motion change in the generated animation, remains challenging.
In this paper, we introduce Cinemo, a novel image animation approach that aims at achieving better appearance consistency and motion smoothness.
The core of Cinemo is to focus on learning the distribution of motion residuals, rather than directly predicting frames as in existing diffusion models.
During the inference, we further mitigate the sudden motion changes in the generated video by introducing a novel DCT-based noise refinement strategy.
To counteract the over-smoothing of motion, we introduce a dynamics degree control design for better control of the magnitude of motion.
Altogether, these strategies enable Cinemo to produce highly consistent, smooth, and motion-controllable results.
Extensive experiments comparing with several state-of-the-art methods demonstrate the effectiveness and superiority of our proposed approach.
At the end, we also demonstrates how our model can be applied for motion transfer or video editing of any given video.
Comparisons
We shows the animated results generated by different methods using the prompt "girl smiling".
We qualitatively compare our method with both commercial tools and research approaches,
including Pika Labs, Genmo, ConsistI2V, DynamiCrafter, I2VGen-XL, SEINE, PIA and SVD.
Click to play the following animations!
Input Image
Pika Labs
Genmo
ConsistI2V
DynamiCrafter
I2VGen-XL
SEINE
PIA
SVD
Ours
Input Image
DynamiCrafter
I2VGen-XL
SEINE
PIA
SVD
Ours
Analysis
The ablation studies and potential applications are presented here.
Motion intensity controllability
We demonstrate that our method can finely control the motion intensity of animated videos. The prompts are "shark swimming" and "car moving", respectively.
Click to play the following animations!
Input Image
b=0
b=9
b=18
Input Image
b=0
b=9
b=18
Effectiveness of DCTInit
We demonstrate that the proposed DCTInit can stabilize the video generation process and effectively mitigate sudden motion change;
the DCT frequency domain decomposition method can effectively mitigate the color inconsistency issues caused by the FFT frequency domain decomposition method.
Here, "Baseline" refers to the results produced solely by motion diffusion models, without the application of any test-time improvement techniques.
The first and second lines prompt "woman smiling" and "robot dancing", respectively.
Input Image
Baseline
FFTInit
DCTInit
Input Image
Baseline
FFTInit
DCTInit
Motion control by prompt
We demonstrate that our method does not rely on complex guiding instructions and even simple textual prompts can yield satisfactory visual effects.
Input Image
"Fireworks"
"Leaves Swaying"
"Lightning"
Motion transfer/Video editing
We demonstrate that our proposed method can also be applied to motion transfer and video editing. We use the off-the-shelf image editing method to edit the first frame of the input video.
Original video
First frame
Edited first frame
Output video
Gallery
More animation results generated by our method are shown here.
Click to play results from Cinemo!
"Birds Rubbing Their Beaks"
"Downward Flow of Waterfall"
"Bubbles Floating Upwards"
"Car Driving on the Road"
"City Lightning"
"Clouds in the Sky Moving Slowly"
"Dragon Glowing Eyes"
"Ducks Swimming on the Water"
"Flames Burning and Light Snow Falling"
"People Walking"
"Planet Rotating"
"River Flowing"
"Bird Walking on the beach"
"Snowman Waving His Hand"
"Tank Moving"
"Monkeys Playing"
"Butterfly Folding its Wings Upward"
"Woman Walking"
"Sea Swell"
"Candle Flickering"
"Girl Dancing under the Stars"
Project page template is borrowed from DreamBooth.