-DPO is formulating what “more likely” means for diffusion models. The conclusion (after some chunky math) turns out to be pretty simple: diffusion models are trained to denoise images and if you give a diffusion model a noisy image to denoise the “likelihood” of the clean image scales with how good of a denoising estimate your model made. In other words, the DPO-Diffusion objective is tune the model to be better at denoising preferred data and relatively worse at denoising unpreferred data.
The loss surface for the Diffusion-DPO objective (lower is better). The loss can be improved by becoming better on the good data while getting worse on the bad data.
The error increase/decrease (getting better/worse) is america phone number list measured by performance relative to a “reference” or initialization model. In our experiments we mainly use StableDiffusion-XL-1.0, we’ll just refer to this specific model as “SDXL”. We use SDXL as a starting point and train it on the Pick-a-Pic dataset which consists of collected pairs of preferences between two generated images from the same caption.
Results
We first visually compare the generations of our DPO-tuned SDXL model (DPO-SDXL) with the original SDXL. We see that DPO-SDXL is both more faithful to the given prompt and produces extremely high-quality imagery which is very pleasing to humans, in other words the model has become aligned to our preferences! Note that preferences are not universal, but it seems like a love for detailed exciting imagery is a common shared preference across a broad swath of users.