OmniPainter

Training-free Stylized Text-to-Image Generation with Fast Inference

Xin Ma¹ Yaohui Wang² Xinyuan Chen² Tien-Tsin Wong¹ Cunjian Chen¹

¹Monash University ²Shanghai Artificial Intelligence Laboratory

[Paper] [Github]

OmniPainter can generate high-quality images that match both the given prompt and the style reference image within just 4 to 6 timesteps, without requiring any inversion!

Reference Styles

Generated Images (512px)

"Lotus flowers"

"Mountain"

"Sea"

"Cherry blossom"

"Kitten"

"Woman"

"Boy"

"Crane"

"Dog"

Methodology

Although diffusion models exhibit impressive generative capabilities, existing methods for stylized image generation based on these models often require textual inversion or fine-tuning with style images, which is time-consuming and limits the practical applicability of large-scale diffusion models. To address these challenges, we propose a novel stylized image generation method leveraging a pre-trained large-scale diffusion model without requiring fine-tuning or any additional optimization, termed as OmniPainter. Specifically, we exploit the self-consistency property of latent consistency models to extract the representative style statistics from reference style images to guide the stylization process. Additionally, we then introduce the norm mixture of self-attention, which enables the model to query the most relevant style patterns from these statistics for the intermediate output content features. This mechanism also ensures that the stylized results align closely with the distribution of the reference style images.

Comparisons

We shows the results generated by different methods.
We qualitatively compare our method with both personalized text-to-image generation methods and style transfer methods, including DEADiff, Custom Diffusion, DreamBooth, StyleDrop, InstaStyle, AdaAttn, CCPL, StyleID, CAST and AesPA-Net.

Compared to personalized text-to-image generation methods!

From top to bottom, the used prompts are "Couch", "Bus", "Sweet peppers", "Castle", "Woman driving lawn mower", "Clouds", "Dinosaur" and "Raccoon", respectively.

Reference

DEADiff

Custom Diffusion

DreamBooth

StyleDrop

InstaStyle

Ours

Compared to neural image style transfer methods!

We first use "bed", "bridge", "tank", "camel", "fox" and "butterfly" as the prompts to generate the corresponding content images by using the base text-to-image model. Then, applying

bed

bridge

tank

camel

fox

Reference

butterfly

Content

AesPA-Net

CAST

StyleID

CCPL

Ours

Analysis

The analysis is presented below.

Challenges in fine-grained style control for text-to-image generation

While text-to-image diffusion models can generate images in common styles via prompts, producing fine-grained or less popular styles remains difficult, making reference images a more effective tool for nuanced style control.

Reference Styles

Generated Images (512px)

"Bear"

"Bridge"

"Girl"

"Beaver"

"Boy"

"Mushrooms"

"Mountain"

"Pickup truck"

"Rabbit"

Effects of different attention controls

We discuss three different attention control strategies: direct replacement, direct addition, and a mixture of self-attention, and analyze their limitations. To address these issues, we propose a normalized mixture of self-attention.

Issures of the direct replacement. It tends to prioritize style statistics at the expense of the semantic information derived from the prompt.

Style reference

Girl

House

Resnet block

Query

Self-attention map

Issures of the direct addition. It is less robust and highly dependent on the choice of balance factors.

Style reference

Apples

Butterfly

Effect of Norm mixture of self-attention. Mixture of self-attention can sometimes result in a mismatch between the global style distribution of the reference style image and the generated image (the first line). Norm mixture of self-attention can effectively mitigate this issue (the second line).

Style reference

Apples

Butterfly

Ovearall effect of different attention controls.

Style reference

Direct replacement

Direct addition

Mixture of self-attention

Norm MSA

The comparison of performance and efficiency

The size of the colored circle reflects performance: the larger the colored circle, the better the performance, and vice versa. Our method delivers exceptional performance results without the need for fine-tuning and achieves the shortest inference time.

Gallery

More results (512px) generated by our method, using diverse style reference images, are shown here.

Reference Styles

Generated Images

"Antelope"

"Barbecue grill"

"Lamp"

"Leopard"

"Man"

"Bed"

"Bookshelf"

"Bottles"

"Couch"

"Mountain"

"Castle"

"Fence"

"House"

"Laptop"

"Palm tree"

"Bowls"

"Cactus"

"Cherry blossom"

"Crane"

"Helicopter"

"Bear"

"Beetle car"

"Duck"

"Girl"

"Hedgehog"

"Bear"

"Boy"

"Bridge"

"Cups"

"Curtain"

"Antelope"

"Castle"

"Cloud"

"Kitten"

"Sunflowers"

"Bee"

"Boy"

"Headphones"

"Ice cream"

"Koala"

"Bicycle"

"Fox"

"Ladder"

"Mushrooms"

"Wolf"

"Girl"

"Tram"

"Tree"

"Wolf"

"Yacht"

Citation

If you find this work useful for your research, please consider citing it.

@article{ma2025omnipainter,
  title={Training-free Stylized Text-to-Image Generation with Fast Inference},
  author={Ma, Xin and Wang, Yaohui and Chen, Xinyuan and Wong, Tien-Tsin and Chen, Cunjian},
  journal={arXiv preprint arXiv:2505.19063},
  year={2025}}

Most of the reference style images are sourced from the Internet. If you believe that any of them infringe upon your legal rights, please contact me and I will remove them. Project page template is borrowed from DreamBooth.