Training-free Stylized Text-to-Image Generation with Fast Inference

Xin Ma1 Yaohui Wang2 Xinyuan Chen2 Tien-Tsin Wong1 Cunjian Chen1

1Monash University 2Shanghai Artificial Intelligence Laboratory

[Paper]     [Github]    


OmniPainter can generate high-quality images that match both the given prompt and the style reference image within just 4 to 6 timesteps, without requiring any inversion!

Reference Styles

Generated Images (512px)

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Lotus flowers"

"Mountain"

"Sea"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Cherry blossom"

"Kitten"

"Woman"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Boy"

"Crane"

"Dog"

Methodology

Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images.

Comparisons

We shows the results generated by different methods.
We qualitatively compare our method with both personalized text-to-image generation methods and style transfer methods, including DEADiff, Custom Diffusion, DreamBooth, StyleDrop, InstaStyle, AdaAttn, CCPL, StyleID, CAST and AesPA-Net.

Compared to personalized text-to-image generation methods!

From top to bottom, the used prompts are "Couch", "Bus", "Sweet peppers", "Castle", "Woman driving lawn mower", "Clouds", "Dinosaur" and "Raccoon", respectively.

Reference

DEADiff

Custom Diffusion

DreamBooth

StyleDrop

InstaStyle

Ours


Compared to neural image style transfer methods!

We first use "bed", "bridge", "tank", "camel", "fox" and "butterfly" as the prompts to generate the corresponding content images by using the base text-to-image model. Then, applying

bed
bridge
tank
camel
fox

Reference

butterfly

Content

AesPA-Net

CAST

StyleID

CCPL

Ours

Analysis

The analysis is presented below.

Challenges in fine-grained style control for text-to-image generation

While text-to-image diffusion models can generate images in common styles via prompts, producing fine-grained or less popular styles remains difficult, making reference images a more effective tool for nuanced style control.

Reference Styles

Generated Images (512px)

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Bear"

"Bridge"

"Girl"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Beaver"

"Boy"

"Mushrooms"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image

"Mountain"

"Pickup truck"

"Rabbit"


Effects of different attention controls

We discuss three different attention control strategies: direct replacement, direct addition, and a mixture of self-attention, and analyze their limitations. To address these issues, we propose a normalized mixture of self-attention.

Issures of the direct replacement. It tends to prioritize style statistics at the expense of the semantic information derived from the prompt.

Issures of the direct addition. It is less robust and highly dependent on the choice of balance factors.

Effect of Norm mixture of self-attention. Mixture of self-attention can sometimes result in a mismatch between the global style distribution of the reference style image and the generated image (the first line). Norm mixture of self-attention can effectively mitigate this issue (the second line).

Ovearall effect of different attention controls.


The comparison of performance and efficiency

The size of the colored circle reflects performance: the larger the colored circle, the better the performance, and vice versa. Our method delivers exceptional performance results without the need for fine-tuning and achieves the shortest inference time.

Gallery

More results (512px) generated by our method, using diverse style reference images, are shown here.

Reference Styles

Generated Images

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Antelope"

"Barbecue grill"

"Lamp"

"Leopard"

"Man"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bed"

"Bookshelf"

"Bottles"

"Couch"

"Mountain"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Castle"

"Fence"

"House"

"Laptop"

"Palm tree"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bowls"

"Cactus"

"Cherry blossom"

"Crane"

"Helicopter"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bear"

"Beetle car"

"Duck"

"Girl"

"Hedgehog"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bear"

"Boy"

"Bridge"

"Cups"

"Curtain"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Antelope"

"Castle"

"Cloud"

"Kitten"

"Sunflowers"

Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bee"

"Boy"

"Headphones"

"Ice cream"

"Koala"

Case 1 Image
Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Bicycle"

"Fox"

"Ladder"

"Mushrooms"

"Wolf"

Case 1 Image
Case 1 Image
Case 1 Image
Case 2 Image
Case 2 Image
Case 2 Image

"Girl"

"Tram"

"Tree"

"Wolf"

"Yacht"

Most of the reference style images are sourced from the Internet. If you believe that any of them infringe upon your legal rights, please contact me and I will remove them. Project page template is borrowed from DreamBooth.