Training-free Stylized Text-to-Image Generation with Fast Inference
Xin Ma1
Yaohui Wang2
Xinyuan Chen2
Tien-Tsin Wong1
Cunjian Chen1
1Monash University
2Shanghai Artificial Intelligence Laboratory
[Paper]
[Github]
OmniPainter can generate high-quality images that match both the given prompt and the style reference image within just 4 to 6 timesteps, without requiring any inversion!
Reference Styles
Generated Images (512px)
"Lotus flowers"
"Mountain"
"Sea"
"Cherry blossom"
"Kitten"
"Woman"
Methodology
Although diffusion models exhibit impressive generative capabilities,
existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images,
which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a
novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional
optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract
the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the
norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate
output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images.
Comparisons
We shows the results generated by different methods.
We qualitatively compare our method with both personalized text-to-image generation methods and style transfer methods,
including DEADiff, Custom Diffusion, DreamBooth, StyleDrop, InstaStyle, AdaAttn, CCPL, StyleID, CAST and AesPA-Net.
Compared to personalized text-to-image generation methods!
From top to bottom, the used prompts are "Couch", "Bus", "Sweet peppers", "Castle", "Woman driving lawn mower", "Clouds", "Dinosaur" and "Raccoon", respectively.
Reference
DEADiff
Custom Diffusion
DreamBooth
StyleDrop
InstaStyle
Ours
Compared to neural image style transfer methods!
We first use "bed", "bridge", "tank", "camel", "fox" and "butterfly" as the prompts to generate the corresponding content images by using the base text-to-image model.
Then, applying
bed
bridge
tank
camel
fox
Analysis
The analysis is presented below.
Challenges in fine-grained style control for text-to-image generation
While text-to-image diffusion models can generate images in common styles via prompts,
producing fine-grained or less popular styles remains difficult, making reference images a more effective tool for nuanced style control.
Reference Styles
Generated Images (512px)
"Beaver"
"Boy"
"Mushrooms"
"Mountain"
"Pickup truck"
"Rabbit"
Effects of different attention controls
We discuss three different attention control strategies: direct replacement, direct addition, and a mixture of self-attention, and analyze their limitations.
To address these issues, we propose a normalized mixture of self-attention.
Issures of the direct replacement.
It tends to prioritize style statistics at the expense of the semantic information derived from the prompt.
Style reference
Girl
House
Resnet block
Query
Self-attention map
Issures of the direct addition.
It is less robust and highly dependent on the choice of balance factors.
Style reference
Apples
Butterfly
Effect of Norm mixture of self-attention.
Mixture of self-attention can sometimes result in a mismatch between the global style distribution of the reference style image and the generated image (the first line).
Norm mixture of self-attention can effectively mitigate this issue (the second line).
Style reference
Apples
Butterfly
Butterfly
Ovearall effect of different attention controls.
Style reference
Direct replacement
Direct addition
Mixture of self-attention
Norm MSA
The comparison of performance and efficiency
The size of the colored circle reflects performance: the larger the colored circle, the better the performance,
and vice versa. Our method delivers exceptional performance results without the need for fine-tuning
and achieves the shortest inference time.
Gallery
More results (512px) generated by our method, using diverse style reference images, are shown here.
Reference Styles
Generated Images
"Antelope"
"Barbecue grill"
"Lamp"
"Leopard"
"Man"
"Bed"
"Bookshelf"
"Bottles"
"Couch"
"Mountain"
"Castle"
"Fence"
"House"
"Laptop"
"Palm tree"
"Bowls"
"Cactus"
"Cherry blossom"
"Crane"
"Helicopter"
"Bear"
"Beetle car"
"Duck"
"Girl"
"Hedgehog"
"Bear"
"Boy"
"Bridge"
"Cups"
"Curtain"
"Antelope"
"Castle"
"Cloud"
"Kitten"
"Sunflowers"
"Bee"
"Boy"
"Headphones"
"Ice cream"
"Koala"
"Bicycle"
"Fox"
"Ladder"
"Mushrooms"
"Wolf"
"Girl"
"Tram"
"Tree"
"Wolf"
"Yacht"
Most of the reference style images are sourced from the Internet.
If you believe that any of them infringe upon your legal rights,
please contact me and I will remove them.
Project page template is borrowed from DreamBooth.