Skip to content

Commit 5b4f595

Browse files
Update text_inversion.mdx (#393)
* Update text_inversion.mdx Getting in a bit of background info * fixed typo mode -> model * Link SD and re-write a few bits for clarity * Copied in info from the example script As suggested by surajpatil :) * removed an unnecessary heading
1 parent 3dcc5e9 commit 5b4f595

File tree

1 file changed

+101
-8
lines changed

1 file changed

+101
-8
lines changed

docs/source/training/text_inversion.mdx

Lines changed: 101 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,114 @@ specific language governing permissions and limitations under the License.
1212

1313

1414

15-
# Quicktour
15+
# Textual Inversion
1616

17-
Start using Diffusers🧨 quickly!
18-
To start, use the [`DiffusionPipeline`] for quick inference and sample generations!
17+
Textual Inversion is a technique for capturing novel concepts from a small number of example images in a way that can later be used to control text-to-image pipelines. It does so by learning new 'words' in the embedding space of the pipeline's text encoder. These special words can then be used within text prompts to achieve very fine-grained control of the resulting images.
1918

19+
![Textual Inversion example](https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG)
20+
_By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation ([image source](https://github.com/rinongal/textual_inversion))._
21+
22+
This technique was introduced in [An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion](https://arxiv.org/abs/2208.01618). The paper demonstrated the concept using a [latent diffusion model](https://github.com/CompVis/latent-diffusion) but the idea has since been applied to other variants such as [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion).
23+
24+
25+
## How It Works
26+
27+
![Diagram from the paper showing overview](https://textual-inversion.github.io/static/images/training/training.JPG)
28+
_Architecture Overview from the [textual inversion blog post](https://textual-inversion.github.io/)_
29+
30+
Before a text prompt can be used in a diffusion model, it must first be processed into a numerical representation. This typically involves tokenizing the text, converting each token to an embedding and then feeding those embeddings through a model (typically a transformer) whose output will be used as the conditioning for the diffusion model.
31+
32+
Textual inversion learns a new token embedding (v* in the diagram above). A prompt (that includes a token which will be mapped to this new embedding) is used in conjunction with a noised version of one or more training images as inputs to the generator model, which attempts to predict the denoised version of the image. The embedding is optimized based on how well the model does at this task - an embedding that better captures the object or style shown by the training images will give more useful information to the diffusion model and thus result in a lower denoising loss. After many steps (typically several thousand) with a variety of prompt and image variants the learned embedding should hopefully capture the essence of the new concept being taught.
33+
34+
## Usage
35+
36+
To train your own textual inversions, see the [example script here](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion).
37+
38+
There is also a notebook for training:
39+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb)
40+
41+
And one for inference:
42+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_conceptualizer_inference.ipynb)
43+
44+
In addition to using concepts you have trained yourself, there is a community-created collection of trained textual inversions in the new [Stable Diffusion public concepts library](https://huggingface.co/sd-concepts-library) which you can also use from the inference notebook above. Over time this will hopefully grow into a useful resource as more examples are added.
45+
46+
## Example: Running locally
47+
48+
The `textual_inversion.py` script [here](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion) shows how to implement the training procedure and adapt it for stable diffusion.
49+
50+
### Installing the dependencies
51+
52+
Before running the scipts, make sure to install the library's training dependencies:
53+
54+
```bash
55+
pip install diffusers[training] accelerate transformers
2056
```
21-
pip install diffusers
57+
58+
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
59+
60+
```bash
61+
accelerate config
2262
```
2363

24-
## Main classes
2564

26-
### Models
65+
### Cat toy example
2766

28-
### Schedulers
67+
You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree.
2968

30-
### Pipeliens
69+
You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
3170

71+
Run the following command to autheticate your token
3272

73+
```bash
74+
huggingface-cli login
75+
```
76+
77+
If you have already cloned the repo, then you won't need to go through these steps. You can simple remove the `--use_auth_token` arg from the following command.
78+
79+
<br>
80+
81+
Now let's get our dataset.Download 3-4 images from [here](https://drive.google.com/drive/folders/1fmJMs25nxS_rSNqS5hTcRdLem_YQXbq5) and save them in a directory. This will be our training data.
82+
83+
And launch the training using
84+
85+
```bash
86+
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
87+
export DATA_DIR="path-to-dir-containing-images"
88+
89+
accelerate launch textual_inversion.py \
90+
--pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
91+
--train_data_dir=$DATA_DIR \
92+
--learnable_property="object" \
93+
--placeholder_token="<cat-toy>" --initializer_token="toy" \
94+
--resolution=512 \
95+
--train_batch_size=1 \
96+
--gradient_accumulation_steps=4 \
97+
--max_train_steps=3000 \
98+
--learning_rate=5.0e-04 --scale_lr \
99+
--lr_scheduler="constant" \
100+
--lr_warmup_steps=0 \
101+
--output_dir="textual_inversion_cat"
102+
```
103+
104+
A full training run takes ~1 hour on one V100 GPU.
105+
106+
107+
### Inference
108+
109+
Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `placeholder_token` in your prompt.
110+
111+
```python
112+
113+
from torch import autocast
114+
from diffusers import StableDiffusionPipeline
115+
116+
model_id = "path-to-your-trained-model"
117+
pipe = pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
118+
119+
prompt = "A <cat-toy> backpack"
120+
121+
with autocast("cuda"):
122+
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
123+
124+
image.save("cat-backpack.png")
125+
```

0 commit comments

Comments
 (0)