In some cases, such as the StableDiffusion family of models, a second stage of learning to denoise visually appealing imagery is used to bias the model towards “higher aesthetic value” generations. While useful, this is a stark contrast between LLMs and T2I models. The field of the former has many recipes for incorporating human feedback into their models with huge benefits, while the latter largely has empirically justified or ad hoc approaches.
One of the key differences in diffusion (T2I) generation from america phone number list language generation is what the incremental unit of generation is. In LLMs it’s a single token (word, word-part, or other chunk of text) which ultimately will be a part of the final generation. In diffusion models, each incremental model decision steers a noisy generation towards a clean denoised version (see our blog on prior work EDICT for more of an introduction to diffusion models). This means that there can be many paths to the same image, which changes the meaning and importance of sequential diffusion steps.
To consider how to apply RLHF to diffusion models, we turned to a recent development in preference tuning for LLMs called Direct Preference Optimization (DPO). DPO enables directly learning a model to become “optimal” with respect to a dataset of human preferences which greatly simplifies the RLHF pipeline. This is a much simpler framework than traditional RLHF methods which require learning a “reward” model to evaluate and critique the generative models outputs. The DPO objective boils down to a simple criteria: tune the model to be more likely to output the preferred data and less likely to output unpreferred data.