UniControl is accepted to NeurIPS’23

rochona · Post by **rochona** » Mon May 26, 2025 6:52 am

Other authors include Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, and Yun Fu.

Is it possible for a single model to master the art of creating images from sketches, maps, diagrams, and more? While diffusion-based text-to-image generators like DALL-E-3 have showcased remarkable results from natural language prompts, achieving precise control over layouts, boundaries, and geometry remains challenging using just text descriptions. Now, researchers have developed UniControl, a unified model capable of handling diverse visual conditions ranging from edges to depth maps within one unified framework.

Background
Text-to-image (T2I) synthesis has exploded in capability recently america phone number list through advances in deep generative models. Systems like DALL-E 2, Imagen, and Stable Diffusion can now generate highly photorealistic images controllable by natural language prompts. These breakthroughs are powered by diffusion models which have proven extremely effective for text-to-image generation.

However, control using just text prompts is barely precise for spatial, structural and geometric attributes. For example, asking to “add a large purple cube” relies on the model’s implicitly learned understanding of 3D geometry. Recent approaches like ControlNet introduced conditioning on additional visual signals like segmentation maps or edge detections. This enables explicit control over image regions, boundaries, object locations, etc.