Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

1University of California, Merced

Abstract

Subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross Diffusion Guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our guidance is more effective in eliminating subject mixing.

What’s more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., the beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to.

Our guidance is training-free and can boost the performance of both Unet-based and transformer-based diffusion models such as the Stable Diffusion series. We also release a more challenging benchmark with prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross Diffusion Guidance.

For example “a bear and an elephant”. the image below describes the aggregation of self-attention maps. Given the cross-attention map of “bear”, we select patches with high responses and visualize their self-attention maps. With our assumption that, at some time steps and layers, a subject should not attend to other subjects in the image, we propose to penalize the overlap between the aggregated self-attention map and cross-attention maps of other subjects.

Sample Image

Qualitative comparisons

Qualitative comparisons between original SD3-medium and our method. Notice that in SD3 we can use image-text attention maps instead of cross-attention maps and use image-image attention maps instead of self-attention maps.

prompt a bear riding an elephant with a rabbit
SD3 Alice Alice Alice Alice
Ours
prompt a beagle and a collie and a husky
SD3 Alice Alice Alice Alice
Ours
prompt a bear and a wolf and a fox
SD3 Alice Alice Alice Alice
Ours
prompt a eagle and a owl and a seagull
SD3 Alice Alice Alice Alice
Ours
prompt a parrot and a pigeon and a sparrow
SD3 Alice Alice Alice Alice
Ours

Qualitative comparisons between original SD2.1 and our method.

prompt a bear and a horse
SD2 Alice Alice Alice Alice
Ours Alice Alice Alice Alice
prompt a bear and a rabbit
SD2 Alice Alice Alice Alice
Ours Alice Alice Alice Alice
prompt a lion and a monkey
SD2 Alice Alice Alice Alice
Ours Alice Alice Alice Alice

Quantitative comparisons

Quantitative comparisons between original SD1.4,SD2.1,SD3-medium and our method.

methods Similar Subjects Dataset-3 (21 prompts)
Question Types Existence Recognizability with out Mixture text-to-text similarity
SD3 33.54 30.31 70.08 73.82
Ours 57.92 53.08 77.15 74.96
methods Animal-Animal (66 prompts)
Question Types Existence Recognizability with out Mixture text-to-text similarity
SD2 61.63 38.23 67.51 81.68
Ours 89.53 55.03 78.79 84.02
methods Animal-Animal (66 prompts)
Question Types Existence Recognizability with out Mixture text-to-text similarity
SD1 39.51 29.70 72.24 76.50
Ours 94.55 87.79 92.94 84.30

Related Links

There's a lot of excellent work on this topic.

CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models introduces a contrastive loss between cross attention maps.

InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization use self attention maps to optimize the beginning few steps during inference.

Some works focus on the existence of subjects, such as Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

There are probably many more by the time you are reading this. For example Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation.

BibTeX

@inproceedings{2025selfcross,
  author    = {Weimin, Qiu and Jieke, Wang and Meng, Tang},
  title     = {Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects},
  booktitle   = {CVPR},
  year      = {2025},
}