Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Abstract

Subject mixing is a known and unresolved issue for diffusion-based image synthesis, particularly for synthesizing multiple similar-looking subjects. We propose Self-Cross Diffusion Guidance to penalize the overlap between cross-attention maps and aggregated self-attention maps. Compared to previous methods based on self-attention or cross-attention alone, our guidance is more effective in eliminating subject mixing.

What’s more, our guidance addresses mixing for all relevant patches of a subject beyond the most discriminant one, e.g., the beak of a bird. We aggregate self-attention maps of automatically selected patches for a subject to form a region that the whole subject attends to.

Our guidance is training-free and can boost the performance of both Unet-based and transformer-based diffusion models such as the Stable Diffusion series. We also release a more challenging benchmark with prompts of similar-looking subjects and utilize GPT-4o for automatic and reliable evaluation. Extensive qualitative and quantitative results demonstrate the effectiveness of our Self-Cross Diffusion Guidance.

For example “a bear and an elephant”. the image below describes the aggregation of self-attention maps. Given the cross-attention map of “bear”, we select patches with high responses and visualize their self-attention maps. With our assumption that, at some time steps and layers, a subject should not attend to other subjects in the image, we propose to penalize the overlap between the aggregated self-attention map and cross-attention maps of other subjects.

Sample Image

Qualitative comparisons

Qualitative comparisons between original SD3-medium and our method. Notice that in SD3 we can use image-text attention maps instead of cross-attention maps and use image-image attention maps instead of self-attention maps.

prompt	a bear riding an elephant with a rabbit
SD3
Ours

prompt	a beagle and a collie and a husky
SD3
Ours

prompt	a bear and a wolf and a fox
SD3
Ours

prompt	a eagle and a owl and a seagull
SD3
Ours

prompt	a parrot and a pigeon and a sparrow
SD3
Ours

Qualitative comparisons between original SD2.1 and our method.

prompt	a bear and a horse
SD2
Ours

prompt	a bear and a rabbit
SD2
Ours

prompt	a lion and a monkey
SD2
Ours

Quantitative comparisons

Quantitative comparisons between original SD1.4,SD2.1,SD3-medium and our method.

methods	Similar Subjects Dataset-3 (21 prompts)
Question Types	Existence	Recognizability	with out Mixture	text-to-text similarity
SD3	33.54	30.31	70.08	73.82
Ours	57.92	53.08	77.15	74.96

methods	Animal-Animal (66 prompts)
Question Types	Existence	Recognizability	with out Mixture	text-to-text similarity
SD2	61.63	38.23	67.51	81.68
Ours	89.53	55.03	78.79	84.02

methods	Animal-Animal (66 prompts)
Question Types	Existence	Recognizability	with out Mixture	text-to-text similarity
SD1	39.51	29.70	72.24	76.50
Ours	94.55	87.79	92.94	84.30

BibTeX

@inproceedings{2025selfcross,
  author    = {Weimin, Qiu and Jieke, Wang and Meng, Tang},
  title     = {Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects},
  booktitle   = {CVPR},
  year      = {2025},
}

Self-Cross Diffusion Guidance for Text-to-Image Synthesis of Similar Subjects

Abstract

Qualitative comparisons

Quantitative comparisons

Related Links

BibTeX