Learning from human preferences

rochona · Post by **rochona** » Mon May 26, 2025 5:35 am

specifically Reinforcement Learning from Human Feedback (RLHF) has been a key recent component in the development of large language models such as ChatGPT or Llama2. Up until recently, the impact of human feedback training on text-to-image models was much more limited. In this work, Diffusion-DPO, we bring the benefit of learning from human feedback to diffusion models, resulting in a state-of-the-art generative text-to-image model. This closes the gap between the StableDiffusion family of open-source models and closed models such as Midjourney v5 (the most current at the time of this project) and opens the door to a new generation of aligned text-to-image models.

In summary:

We adapt the Direct Preference Optimization (DPO) training america phone number list method to text-to-image models
DPO-tuned StableDiffusion-XL models far outperform their initialization and are comparable to closed-source model such as Midjourney and Meta’s Emu
Public implementations of the training code and resulting models
Introduction

The story of alignment (i.e. alignment with human goals/preferences/ethics) in Large Language Models (LLMs) is very different than alignment in text-to-image (T2I) models. While the most powerful current LLMs such as GPT4, Bard, and Llama2 all specifically cite alignment via RLHF as a key component of their training preferences, state-of-the-art T2I models are primarily trained via a single simple objective.