An Efficient and Training-Free Approach for Subject-Driven Text-to-Image Generation

Subject-driven text-to-image generation presents a significant challenge: faithfully reproducing a specific subject’s identity within novel, text-described scenes. Existing solutions typically involve computationally expensive model fine-tuning or less performant training-free methods. This paper introduces Content-Adaptive Grafting (CAG), a novel, efficient, and entirely training-free framework designed to achieve high subject fidelity and strong text alignment. CAG operates without modifying the underlying generative model’s weights, instead leveraging intelligent noise initialization and adaptive feature fusion during inference. Our framework comprises Initial Structure Guidance (ISG), which prepares a structurally consistent starting point via an inverted collage image, and Dynamic Content Fusion (DCF), which adaptively infuses multi-scale reference features using a gated attention mechanism and a time-dependent decay strategy. Extensive experiments demonstrate that CAG significantly outperforms state-of-the-art training-free baselines in subject fidelity and text alignment, while maintaining competitive efficiency. Ablation studies and human evaluations further validate the critical contributions of ISG and DCF, affirming CAG’s leading position in providing a high-quality, practical solution for subject-driven text-to-image generation.

Liked Liked