Skip to content

Commit ae69462

Browse files
a-r-r-o-wstevhliu
authored andcommitted
[docs] Add a note on torchao/quanto benchmarks for CogVideoX and memory-efficient inference (#9296)
* add a note on torchao/quanto benchmarks and memory-efficient inference * apply suggestions from review * update * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * add note on enable sequential cpu offload --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent b6cd4a4 commit ae69462

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,21 @@ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds o
7777
- `pipe.enable_model_cpu_offload()`:
7878
- Without enabling cpu offloading, memory usage is `33 GB`
7979
- With enabling cpu offloading, memory usage is `19 GB`
80+
- `pipe.enable_sequential_cpu_offload()`:
81+
- Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
82+
- When enabled, memory usage is under `4 GB`
8083
- `pipe.vae.enable_tiling()`:
8184
- With enabling cpu offloading and tiling, memory usage is `11 GB`
8285
- `pipe.vae.enable_slicing()`
8386

87+
### Quantized inference
88+
89+
[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
90+
91+
It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
92+
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
93+
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
94+
8495
## CogVideoXPipeline
8596

8697
[[autodoc]] CogVideoXPipeline

0 commit comments

Comments
 (0)