To rigorously test our model

rochona · Post by **rochona** » Mon May 26, 2025 5:37 am

we use two prompt benchmarks: PartiPrompts and HPS. These are prompting benchmarks for 1632 and 800 captions respectively. We generate SDXL and DPO-SDXL images for each caption, and ask 5 human labelers to vote on which they

Generally like better
Find more visually appealing
Think is better aligned to the target prompt
We collect 5 responses for each comparison and choose the majority vote as the collective decision.

We see that DPO-SDXL significantly improves on SDXL, winning approximately ⅔ of the time.

Comparison to state-of-the-art closed-source models

While our results on the academic benchmarks demonstrate america phone number list the effectiveness of aligning diffusion models to user preferences, we also want to understand if the alignment process helps close the gap on powerful closed-source models. Closed-source models like Midjourney and Meta’s Emu have been shown to be able to generate significantly better images than the open-source alternatives. We now explore if DPO-tuned SDXL is competitive with Midjourney and Emu. Since their training datasets and (in case of Midjourney) training recipes are not available, these are not apples-to-apples comparisons, but rather an effort to document the relative performance of different models.