Skip to content

Commit c81e607

Browse files
authored
Merge branch 'main' into feat/sa-solver
2 parents 425b96d + f72b28c commit c81e607

File tree

73 files changed

+9837
-2477
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+9837
-2477
lines changed

.github/workflows/pr_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ jobs:
115115
run: |
116116
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
117117
--make-reports=tests_${{ matrix.config.report }} \
118-
examples/test_examples.py
118+
examples
119119
120120
- name: Failure short reports
121121
if: ${{ failure() }}

.github/workflows/push_tests_fast.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ jobs:
100100
run: |
101101
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
102102
--make-reports=tests_${{ matrix.config.report }} \
103-
examples/test_examples.py
103+
examples
104104
105105
- name: Failure short reports
106106
if: ${{ failure() }}

PHILOSOPHY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ Models are designed as configurable toolboxes that are natural extensions of [Py
8282
The following design principles are followed:
8383
- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
8484
- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py), [`transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformer_2d.py), etc...
85-
- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modelling files and shows that models do not really follow the single-file policy.
85+
- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
8686
- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
8787
- Models all inherit from `ModelMixin` and `ConfigMixin`.
8888
- Models can be optimized for performance when it doesn’t demand major code changes, keep backward compatibility, and give significant memory or compute gain.

docs/source/en/_toctree.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@
7272
title: Overview
7373
- local: using-diffusers/sdxl
7474
title: Stable Diffusion XL
75+
- local: using-diffusers/sdxl_turbo
76+
title: SDXL Turbo
7577
- local: using-diffusers/kandinsky
7678
title: Kandinsky
7779
- local: using-diffusers/controlnet
@@ -94,6 +96,8 @@
9496
title: Latent Consistency Model-LoRA
9597
- local: using-diffusers/inference_with_lcm
9698
title: Latent Consistency Model
99+
- local: using-diffusers/svd
100+
title: Stable Video Diffusion
97101
title: Specific pipeline examples
98102
- sections:
99103
- local: training/overview
@@ -129,6 +133,8 @@
129133
title: LoRA
130134
- local: training/custom_diffusion
131135
title: Custom Diffusion
136+
- local: training/lcm_distill
137+
title: Latent Consistency Distillation
132138
- local: training/ddpo
133139
title: Reinforcement learning training with DDPO
134140
title: Methods
@@ -329,6 +335,8 @@
329335
title: Stable Diffusion 2
330336
- local: api/pipelines/stable_diffusion/stable_diffusion_xl
331337
title: Stable Diffusion XL
338+
- local: api/pipelines/stable_diffusion/sdxl_turbo
339+
title: SDXL Turbo
332340
- local: api/pipelines/stable_diffusion/latent_upscale
333341
title: Latent upscaler
334342
- local: api/pipelines/stable_diffusion/upscale

docs/source/en/api/pipelines/kandinsky3.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,32 @@ specific language governing permissions and limitations under the License.
99

1010
# Kandinsky 3
1111

12-
TODO
12+
Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh)
13+
14+
The description from it's Github page:
15+
16+
*Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.*
17+
18+
Its architecture includes 3 main components:
19+
1. [FLAN-UL2](https://huggingface.co/google/flan-ul2), which is an encoder decoder model based on the T5 architecture.
20+
2. New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters.
21+
3. Sber-MoVQGAN is a decoder proven to have superior results in image restoration.
22+
23+
24+
25+
The original codebase can be found at [ai-forever/Kandinsky-3](https://github.com/ai-forever/Kandinsky-3).
26+
27+
<Tip>
28+
29+
Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
30+
31+
</Tip>
32+
33+
<Tip>
34+
35+
Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
36+
37+
</Tip>
1338

1439
## Kandinsky3Pipeline
1540

docs/source/en/api/pipelines/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
5151
| [InstructPix2Pix](pix2pix) | image editing |
5252
| [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation |
5353
| [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting |
54+
| [Kandinsky 3](kandinsky3) | text2image, image2image |
5455
| [Latent Consistency Models](latent_consistency_models) | text2image |
5556
| [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
5657
| [LDM3D](stable_diffusion/ldm3d_diffusion) | text2image, text-to-3D, text-to-pano, upscaling |
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# SDXL Turbo
14+
15+
Stable Diffusion XL (SDXL) Turbo was proposed in [Adversarial Diffusion Distillation](https://stability.ai/research/adversarial-diffusion-distillation) by Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach.
16+
17+
The abstract from the paper is:
18+
19+
*We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs,Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models.*
20+
21+
## Tips
22+
23+
- SDXL Turbo uses the exact same architecture as [SDXL](./stable_diffusion_xl).
24+
- SDXL Turbo should disable guidance scale by setting `guidance_scale=0.0`
25+
- SDXL Turbo should use `timestep_spacing='trailing'` for the scheduler and use between 1 and 4 steps.
26+
- SDXL Turbo has been trained to generate images of size 512x512.
27+
- SDXL Turbo is open-access, but not open-source meaning that one might have to buy a model license in order to use it for commercial applications. Make sure to read the [official model card](https://huggingface.co/stabilityai/sdxl-turbo) to learn more.
28+
29+
<Tip>
30+
31+
To learn how to use SDXL Turbo for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl_turbo) guide.
32+
33+
Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints!
34+
35+
</Tip>
36+
37+
## StableDiffusionXLPipeline
38+
39+
[[autodoc]] StableDiffusionXLPipeline
40+
- all
41+
- __call__
42+
43+
## StableDiffusionXLImg2ImgPipeline
44+
45+
[[autodoc]] StableDiffusionXLImg2ImgPipeline
46+
- all
47+
- __call__
48+
49+
## StableDiffusionXLInpaintPipeline
50+
51+
[[autodoc]] StableDiffusionXLInpaintPipeline
52+
- all
53+
- __call__

docs/source/en/api/pipelines/text_to_video_zero.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,19 @@ imageio.mimsave("video.mp4", result, fps=4)
9292
```
9393

9494

95+
- #### SDXL Support
96+
In order to use the SDXL model when generating a video from prompt, use the `TextToVideoZeroSDXLPipeline` pipeline:
97+
98+
```python
99+
import torch
100+
from diffusers import TextToVideoZeroSDXLPipeline
101+
102+
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
103+
pipe = TextToVideoZeroSDXLPipeline.from_pretrained(
104+
model_id, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
105+
).to("cuda")
106+
```
107+
95108
### Text-To-Video with Pose Control
96109
To generate a video from prompt with additional pose control
97110

@@ -141,7 +154,33 @@ To generate a video from prompt with additional pose control
141154
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
142155
imageio.mimsave("video.mp4", result, fps=4)
143156
```
144-
157+
- #### SDXL Support
158+
159+
Since our attention processor also works with SDXL, it can be utilized to generate a video from prompt using ControlNet models powered by SDXL:
160+
```python
161+
import torch
162+
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel
163+
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor
164+
165+
controlnet_model_id = 'thibaud/controlnet-openpose-sdxl-1.0'
166+
model_id = 'stabilityai/stable-diffusion-xl-base-1.0'
167+
168+
controlnet = ControlNetModel.from_pretrained(controlnet_model_id, torch_dtype=torch.float16)
169+
pipe = StableDiffusionControlNetPipeline.from_pretrained(
170+
model_id, controlnet=controlnet, torch_dtype=torch.float16
171+
).to('cuda')
172+
173+
# Set the attention processor
174+
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
175+
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
176+
177+
# fix latents for all frames
178+
latents = torch.randn((1, 4, 128, 128), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)
179+
180+
prompt = "Darth Vader dancing in a desert"
181+
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
182+
imageio.mimsave("video.mp4", result, fps=4)
183+
```
145184

146185
### Text-To-Video with Edge Control
147186

@@ -253,5 +292,10 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers)
253292
- all
254293
- __call__
255294

295+
## TextToVideoZeroSDXLPipeline
296+
[[autodoc]] TextToVideoZeroSDXLPipeline
297+
- all
298+
- __call__
299+
256300
## TextToVideoPipelineOutput
257301
[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput

0 commit comments

Comments
 (0)