Open-Set Grounded Text-to-Image Generation

University of Wisconsin-Madison; Columbia University; Microsoft   *Equal Advising

Figure 1. GLIGEN enables versatile grounding capabilities for a frozen text-to-image generation model.


Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, Grounded-Language-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configuration and concepts. GLIGEN’s zero-shot performance on COCO and LVIS outperforms that of existing supervised layout-to-image baselines by a large margin.

I. Generated examples from GLIGEN Demo

II. Demo Instruction

Model Designs: Efficient Training & Flexible Inference

  • GLIGEN is built upon existing pretrained diffusion models. whose original weights are frozen to preserve vast pre-trained knowledge.
  • A new trainable Gated Self-Attention layer is added at each transformer block to absorb new grounding input.
  • Each grounding token consists of two types of information: semantic of grounded entity (encoded text or image) and spatial location (encoded bounding box or keypoints).

Figure 2. Gated Self-Attention is used to fuse new grounding tokens.

I. Modulated Training

Compared with other ways of using a pretrained diffusion model such as full-model finetuning, our newly added modulated layers are continual pre-trained on large grounding data (image-text-box) and is more cost-efficient. Just like Lego, one can plug and play different trained layers to enable different new capabilities.

II. Scheduled Sampling

As a favorable property of our modulated training, GLIGEN supports scheduled sampling in the diffusion process for inference, where the model can dynamically choose to use grounding tokens (by adding the new layer) or original diffusion model with good prior (by kicking out the new layer), and thus balances generation quality and grounding ability.


Text Grounded T2I Generation (Bounding box)

By exploiting knowledge of pretrained text2img model, GLIGEN can generate varieties of objects in given locations, it also supports varies of styles.

Compared with existing text2img models such as DALLE1 and DALLE2, GLIGEN enables the new capability to allow grounding instruction. The text prompt and DALLE generated images are from OpenAI Blog.

Spatially counterfactual generation

By explicitly specifying object size and location, GLIGEN can generate spatially counterfactual results which are difficult to release through text2img model (e.g., Stable Diffusion).

Image Grounded T2I Generation (Bounding box)

GLIGEN can also ground on reference images. Top row indicates reference images can provide more fine-grained details beyond text description such as style and shape or car. The second row shows reference image can also be used as style image in which case we find ground it into corner or edge of an image is sufficient.

Grounded T2I Generation (Keypoints)

GLIGEN can also ground human keypoints while doing text-to-image generation.

Grounded Inpainting

Like other diffusion models, GLIGEN can also perform grounded image inpaint, which can generate objects tightly following provided bounding boxes.


  author      = {Li, Yuheng and Liu, Haotian and Wu, Qingyang and Mu, Fangzhou and Yang, Jianwei and Gao, Jianfeng and Li, Chunyuan and Lee, Yong Jae},
  title       = {GLIGEN: Open-Set Grounded Text-to-Image Generation},
  publisher   = {arXiv:2301.07093},
  year        = {2023},


This website is adapted from Nerfies and X-Decoder, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Related Links :

  • [Computer Vision in the Wild]
  • GLIGEN: (box, concept) → image || GLIP : image → (box, concept); See grounded image understanding in [GLIP]
  • Modulated design and training of foundation models for image understanding [REACT]