An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom.

Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts.

We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

Our method can be used to represent a wide array of concepts - including visual artistic styles. In a sense, we can learn a pseudo-word that represents a specific artist or a new artistic movement, and mimic it in future creations.

Image credits:@David Revoy. @QinniArt result removed at family's request. Image reproduction authorized for non-commercial use only.

Text-to-image models suffer from biases inherited from the training data. Rather than learning a new concept, we can find new embeddings for 'biased' concepts. These are found using small datasets, so we can easily curate the data and ensure a fairer representation. For example, here we replace the model's notion of 'Doctor', with a new, more inclusive word.

We can combine the new words in order to create scenes that draw on both concepts. Unfortunately, this doesn't yet work for relational prompts, so we can't show you our cat on a fishing trip with our clock.

Image credits: @Leslie Manlapig. Reproductions authorized for non-commercial & non-print use.

Our pseudo-words work with downstream models. For example, if you're tired of your old photographs, you can spice them up by inserting some new friends using Blended Latent Diffusion:

BibTeX

If you find our work useful, please cite our paper:

@misc{gal2022textual,
      doi = {10.48550/ARXIV.2208.01618},
      url = {https://arxiv.org/abs/2208.01618},
      author = {Gal, Rinon and Alaluf, Yuval and Atzmon, Yuval and Patashnik, Or and Bermano, Amit H. and Chechik, Gal and Cohen-Or, Daniel},
      title = {An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion},
      publisher = {arXiv},
      year = {2022},
      primaryClass={cs.CV}
}

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Here are some more generated examples. We hope they are cool enough to convince you it's worth reading on :)

Abstract

How does it work?

Learning to represent styles

Reducing Biases

Compositions

Downstream Models

BibTeX