1Meta AI (FAIR),
2The Hebrew University of Jerusalem,
3Reichman University
CVPR 2023
SpaText – new method for text-to-image generation using open-vocabulary scene
control.
Abstract
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality.
However, it is nearly impossible to control the shapes of different regions/objects or their layout in a
fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a
fixed set of labels. To this end, we present SpaText – a new method for text-to-image generation using
open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the
user provides a segmentation map where each region of interest is annotated by a free-form natural
language description. Due to lack of large-scale datasets that have a detailed textual description for
each region in the image, we choose to leverage the current large-scale text-to-image datasets and base
our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two
state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the
classifier-free guidance method in diffusion models to the multi-conditional case and present an
alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and
use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves
state-of-the-art results on image generation with free-form textual scene control.
Method
We aim to provide the user with more fine-grained control over the generated image. In addition to a
single global text prompt, the user will also provide a segmentation map, where the
content of each segment of interest is described using a local free-form text prompt.
However, current large-scale text-to-image datasets cannot be used for this task because they do not
contain local text descriptions for each segment in the images. Hence, we need to develop a way to extract
the objects in the image along with their textual description. To this end, we opt to use a pre-trained
panoptic segmentation model along with a CLIP model.
During training (left) – given a training image x, we extract K random segments, pre-process them
and extract their CLIP image embeddings. Then we stack these embeddings in the same shapes of the segments
to form the spatio-textual representation ST. During inference (right) – we embed the local prompts
into the CLIP text embedding space, then convert them using the prior model P to the CLIP image
embeddings space, lastly, we stack them in the same shapes of the inputs masks to form the spatio-textual
representation ST.
Results
Examples
The following images were generated by our method: each pair consists of an (i) input global text (top
left, black), a spatio-textual representation describing each segment using free-form text prompts (left,
colored text and sketches), and (ii) the corresponding generated image (right). (The colors are for
illustration purposes only, and do not affect the actual inputs.)
“on a wooden table outdoors”
“a brown hat”
“on a wooden table outdoors”
“on a concrete floor”
“an elephant”
“on a concrete floor”
“at the beach”
“a lion”
“a book”
“at the beach”
“at the desert”
“a squirrel”
“a sign with an apple painting”
“at the desert”
“on a snowy day”
“a mouse”
“boxing gloves”
“a black punching bag”
“on a snowy day”
“a sunny day at the street”
“a lemur”
“oranges”
“a soda can”
“a sunny day at the street”
“in the forest”
“a black cat with a red sweater and a blue jeans”
a
“a sunny day near the Eiffel tower”
“a white Labrador”
“a blue ball”
“a sunny day near the Eiffel tower”
“in the style of The Starry Night”
“a black horse”
“a red full moon”
“a sunny day near the Eiffel tower”
“on the moon”
“an astronaut”
“a horse”
“on the moon”
“room with sunlight”
“a wooden table”
“a red bowl”
“a picture on the wall”
“room with sunlight”
“near a lake”
“a black elephant”
“near a lake”
“a painting”
“a snowy mountain”
“a red car”
“a painting”
“a sunny day after the snow”
“a Husky dog”
“a German Shepherd dog”
“a sunny day after the snow”
“on a table”
“a mug”
“a white plate with cookies”
“on a table”
“a bathroom with an artificial light”
“a mirror”
“a white sink”
“a vase with red flowers”
“a bathroom with an artificial light”
“near a river”
“a grizzly bear”
“a huge avocado”
“near a river”
“next to a wooden house”
“a chimpanzee”
“a red wooden stick”
“next to a wooden house”
“indoors”
“a glass tea pot”
“a golden straw”
“indoors”
“on a grass”
“an Amanita mushroom”
“on a grass”
“a black and white photo in the desert”
“a nuclear explosion”
“a black and white photo in the desert”
“a portrait photo”
“a rabbit”
“a portrait photo”
“a portrait photo”
“a duck”
“a duck”
“under the sun”
“a blue butterfly”
“under the sun”
“inside a lake”
“an elephant”
“under the sun”
“an oil painting”
“a black horse”
“a white horse”
“an oil painting”
“in the snow”
“a white horse”
“a brown horse”
“an oil painting”
“a photo taken during the golden hour”
“a snowy mountain”
“a black cat”
“a photo taken during the golden hour”
“a sunny day at the beach”
“a colorful beach umbrella”
“a ginger cat”
“a sunny day at the beach”
“near some ruins”
“a colorful parrot”
“a straw hat”
“near some ruins”
“at the desert”
“a green parrot”
“a red hat”
“at the desert”
“sitting on a wooden floor”
“a gray teddy bear”
“a brown teddy bear”
“sitting on a wooden floor”
“in the street”
“a brown teddy bear”
“a gray teddy bear”
“in the street”
“a night with the city in the background”
“a white car”
“a big full moon”
“a night with the city in the background”
“in a sunny day near near the forest”
“a blue car”
“a red balloon”
“in a sunny day near near the forest”
“in an empty room”
“a canvas with a painting of a Corgi dog”
“a metallic yellow robot”
“in an empty room”
“day outdoors”
“a canvas with math equations”
“a wooden robot”
“day outdoors”
Mask Sensitivity
During our experiments, we noticed that the model generates images that correspond to the implicit masks
in the spatio-textual representation, but not perfectly. Nevertheless, we argue that this
characteristic can be beneficial. For example, given a general animal shape mask (first image), the model
is able to generate a diverse set of results driven by the different local prompts. It changes the body
type according to the local prompt while leaving the overall posture of the character intact.
“a sunny day outdoors”
a sunny day outdoors
“a white cat”
a sunny day outdoors
“a Shiba Inu dog”
a sunny day outdoors
“a goat”
a sunny day outdoors
“a pig”
a sunny day outdoors
“a black rabbit”
a sunny day outdoors
“a gray donkey”
a sunny day outdoors
“a panda bear”
a sunny day outdoors
“a gorilla”
a sunny day outdoors
“a toad”
a sunny day outdoors
“a cow”
a sunny day outdoors
“The Statue of Liberty”
a sunny day outdoors
“a golden calf”
a sunny day outdoors
“a shark”
a sunny day outdoors
“a cactus”
a sunny day outdoors
“a tortoise”
We also demonstrate this characteristic on a Rorschach test mask: Given a general Rorschach
mask (first image), the model is able to generate a diverse set of results driven by the different local
prompts. It changes fine details according to the local prompt while leaving the overall general shape
intact.
“a painting”
“a painting”
“a bat”
“a painting”
“a colorful butterfly”
“a painting”
“a moth”
“a painting”
“two birds facing away from each other”
“a painting”
“a dragon”
“a painting”
“mythical creatures”
“a painting”
“two dogs”
“a painting”
“a crab”
“a painting”
“an evil pig”
“a painting”
“a flying angel”
“a painting”
“an owl”
“a painting”
“a bee”
“a painting”
“two flamingos”
“a painting”
“an avocado”
“a painting”
“two clowns”
These results are visualized in the following AI art video:
Multi-Scale Control
The extension of classifier-free guidance to the
multi-conditional case allows fine-grained control over the input conditions. Given the same inputs
(left), we can use different scales for each condition. In the following example, if we put all the weight
on the local scene (1), the generated image contains a horse with the correct color and posture, but not
at the beach. Conversely, if we place all the weight on the global text (5), we get an image of a beach
with no horse in it. The in-between results correspond to a mix of conditions – in (4) we get a gray
donkey, in (2) the beach contains no water, and in (3) we get a brown horse at the beach on a sunny day.
BibTeX
If you find this research useful, please cite the following:
@article{avrahami2022spatext,
title={SpaText: Spatio-Textual Representation for Controllable Image Generation},
author={Avrahami, Omri and Hayes, Thomas and Gafni, Oran and Gupta, Sonal and Taigman, Yaniv and Parikh, Devi and Lischinski, Dani and Fried, Ohad and Yin, Xi},
journal={arXiv preprint arXiv:2211.14305},
year={2022}
}
Leave A Comment