Method BootPIG

rochona · Post by **rochona** » Mon May 26, 2025 10:16 am

We now describe: (1) the BootPIG architecture and (2) a training pipeline that enables BootPIG to perform zero-shot subject driven generation. The BootPIG architecture builds upon pretrained text-to-image models and introduces new layers to allow for the pretrained model to accept images of the subject at test-time. The training process trains these new layers and thus enables the BootPIG model to generate images of the desired subject without any specific finetuning. On top of that, BootPIG’s training process does not require real images featuring the object and only takes about an hour!

Architecture: The BootPIG architecture shares similarities with the ControlNet6 architecture. Specifically, we introduce a copy of the original text-to-image architecture, which we call the Reference U-Net. The Reference U-Net takes as input reference images, e.g., images of your dog or cat, and learns features that enables the original text-to-image architecture, which we call the Base U-Net, to synthesize the subject in the desired scene. Additionally, we add america phone number list Reference Self-Attention (RSA) operators in place of the Self-Attention layers in the Base U-Net. The RSA operator performs the Attention operation between the Base U-Net features (query) and the concatenation of Base U-Net features and Reference U-Net features (keys, values). During training, the Base U-Net and Reference U-Net are trained jointly. All parameters in the Reference U-Net are updated, while only the RSA layers in the Base U-Net are trained.

Data: BootPIG is trained entirely using synthetic data. During training, BootPIG requires triplets of the form, (image, caption, reference image). First, we use ChatGPT7 to generate captions for potential images. Next, we use StableDiffusion to generate images for each of these captions. Lastly, we use Segment Anything8, a state-of-the-art segmentation model, to segment the subject in each image and we use this segmented portion of the image as the reference image.

Results
BootPIG demonstrates state-of-the-art quantitative results in zero-shot subject-driven generation. We present qualitative (visual) comparisons to existing zero-shot and test-time finetuned methods below. As seen in the figure, BootPIG is able to maintain key features of the subjects, such as the fur markings of the dog in the 3rd row, in the new scene.