An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

1Tel Aviv University, 2NVIDIA
Teaser.

We learn to generate specific concepts, like personal objects or artistic styles, by describing them using new "words" in the embedding space of pre-trained text-to-image models. These can be used in new sentences, just like any other word.

Our work builds on the publicly available Latent Diffusion Models

Here are some more generated examples. We hope they are cool enough to convince you it's worth reading on :)

Abstract

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom.

Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts.

We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

How does it work?

In the text-encoding stage of most text-to-image models, the first stage involves converting the prompt into a numerical representation. This is typically done by converting the words into tokens, each equivalent to an entry in the model's dictionary. These entries are then converted into an "embedding" - a continuous vector representation for the specific token. These embeddings are usually learned as part of the training process. In our work, we find new embeddings that represent specific, user-provided visual concepts. These embeddings are then linked to new pseudo-words, which can be incorporated into new sentences like any other word. In a sense, we are performing inversion into the text-embedding space of the frozen model. We're calling the process 'Textual Inversion'.

Learning to represent styles

Our method can be used to represent a wide array of concepts - including visual artistic styles. In a sense, we can learn a pseudo-word that represents a specific artist or a new artistic movement, and mimic it in future creations.

Image credits: @QinniArt (top), @David Revoy (bottom). Image reproduction authorized for non-commercial use only.


Reducing Biases

Text-to-image models suffer from biases inherited from the training data. Rather than learning a new concept, we can find new embeddings for 'biased' concepts. These are found using small datasets, so we can easily curate the data and ensure a fairer representation. For example, here we replace the model's notion of 'Doctor', with a new, more inclusive word.



Compositions

We can combine the new words in order to create scenes that draw on both concepts. Unfortunately, this doesn't yet work for relational prompts, so we can't show you our cat on a fishing trip with our clock.

Image credits: @QinniArt (left), @Leslie Manlapig (right). Reproductions authorized for non-commercial & non-print use.


Downstream Models

Our pseudo-words work with downstream models. For example, if you're tired of your old photographs, you can spice them up by inserting some new friends using Blended Latent Diffusion:



BibTeX

If you find our work useful, please cite our paper:

@misc{gal2022textual,
      doi = {10.48550/ARXIV.2208.01618},
      url = {https://arxiv.org/abs/2208.01618},
      author = {Gal, Rinon and Alaluf, Yuval and Atzmon, Yuval and Patashnik, Or and Bermano, Amit H. and Chechik, Gal and Cohen-Or, Daniel},
      title = {An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion},
      publisher = {arXiv},
      year = {2022},
      primaryClass={cs.CV}
}