huggingface · patrickvonplaten · Sep 8, 2022 · Sep 7, 2022 · Sep 7, 2022 · Sep 7, 2022
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -28,7 +28,7 @@
   title: "Using Diffusers"
 - sections:
   - local: optimization/fp16
-    title: "Torch Float16"
+    title: "Memory and Speed"
   - local: optimization/onnx
     title: "ONNX"
   - local: optimization/open_vino

diff --git a/docs/source/_toctree_new.yml b/docs/source/_toctree_new.yml
@@ -22,7 +22,7 @@
   title: "Pipelines for Inference"
 - sections:
   - local: optimization/fp16
-    title: "Torch Float16"
+    title: "Memory and Speed"
   - local: optimization/onnx
     title: "ONNX"
   - local: optimization/open_vino

diff --git a/docs/source/optimization/fp16.mdx b/docs/source/optimization/fp16.mdx
@@ -10,23 +10,59 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
 specific language governing permissions and limitations under the License.
 -->
 
+# Memory and speed
 
+We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.
 
-# Quicktour
+## CUDA `autocast`
 
-Start using Diffusers🧨 quickly!
-To start, use the [`DiffusionPipeline`] for quick inference and sample generations!
+If you use a CUDA GPU, you can take advantage of `torch.autocast` to perform much faster inference. All you need to do is put your inference call inside an `autocast` context manager. The following example shows how to do it using Stable Diffusion text-to-image generation as an example:
 
+```Python
+from torch import autocast
+from diffusers import StableDiffusionPipeline
+
+pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=True)
+pipe = pipe.to("cuda")
+
+prompt = "a photo of an astronaut riding a horse on mars"
+with autocast("cuda"):
+    image = pipe(prompt).images[0]  
 ```
-pip install diffusers
+
+## Half precision weights
+
+In order to save a big chunk of GPU memory, you can load the model weights in half precision. This involves loading the float16 version of the weights, which was saved to a branch named `fp16`, and telling PyTorch to use the `float16` type when loading them:
+
+```Python
+pipe = StableDiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4",
+    revision="fp16",
+    torch_dtype=torch.float16,
+    use_auth_token=True
+)
 ```
 
-## Main classes
+## Sliced attention for additional memory savings
 
-### Models
+For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once. You only need to invoke `enable_attention_slicing()` in your pipeline before inference, like here:
 
-### Schedulers
+```Python
+import torch
+from diffusers import StableDiffusionPipeline
 
-### Pipeliens
+pipe = StableDiffusionPipeline.from_pretrained(
+    "CompVis/stable-diffusion-v1-4",
+    revision="fp16",
+    torch_dtype=torch.float16,
+    use_auth_token=True
+)
+pipe = pipe.to("cuda")
 
+prompt = "a photo of an astronaut riding a horse on mars"
+pipe.enable_attention_slicing()
+with torch.autocast("cuda"):
+    image = pipe(prompt).images[0]  
+```
 
+There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM! 
diff --git a/examples/textual_inversion/README.md b/examples/textual_inversion/README.md
@@ -79,7 +79,7 @@ from torch import autocast
 from diffusers import StableDiffusionPipeline
 
 model_id = "path-to-your-trained-model"
-pipe = pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
+pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
 
 prompt = "A <cat-toy> backpack"