Skip to content

Cleaning release code #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.pt2
.pt2_2
.pt13
*.egg-info
build
/outputs
/checkpoints
27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Generative Models at Stability AI
# Generative Models by Stability AI

![sample1](assets/000.jpg)

## News

**June 22, 2023**

- We are releasing two new diffusion models for research: SD-XL 0.9-base and SD-XL 0.9-refiner. The refiner has been trained to denoise small noise levels of high quality data and as such is not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model. The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The base model uses [OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main) for text encoding whereas the refiner model only uses the OpenCLIP model. **We plan to do a full release soon (July).**

## The code base
- We are releasing two new diffusion models:
- `SD-XL 0.9-base`: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The base model uses [OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip) and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main) for text encoding whereas the refiner model only uses the OpenCLIP model.
- `SD-XL 0.9-refiner`: The refiner has been trained to denoise small noise levels of high quality data and as such is not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.

**We plan to do a full release soon (July).**

## The codebase

### General Philosophy

Expand All @@ -18,12 +23,12 @@ Modularity is king. This repo implements a config-driven approach where we build

For training, we use [pytorch-lightning](https://www.pytorchlightning.ai/index.html), but it should be easy to use other training wrappers around the base modules. The core diffusion model class (formerly `LatentDiffusion`, now `DiffusionEngine`) has been cleaned up:

- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial conditionings, and all combinations thereof) in a single class: `GeneralConditioner`, see `ldm/modules/encoders/modules.py`.
- We separate guiders (such as classifier-free guidance, see `ldm/modules/diffusionmodules/guiders.py`) from the
samplers (`ldm/modules/diffusionmodules/sampling.py`), and the samplers are independent of the model.
- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial conditionings, and all combinations thereof) in a single class: `GeneralConditioner`, see `sgm/modules/encoders/modules.py`.
- We separate guiders (such as classifier-free guidance, see `sgm/modules/diffusionmodules/guiders.py`) from the
samplers (`sgm/modules/diffusionmodules/sampling.py`), and the samplers are independent of the model.
- We adopt the ["denoiser framework"](https://arxiv.org/abs/2206.00364) for both training and inference (most notable change is probably now the option to train continuous time models):
* Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see `ldm/modules/diffusionmodules/denoiser.py`.
* The following features are now independent: weighting of the diffusion loss function (`ldm/modules/diffusionmodules/denoiser_weighting.py`), preconditioning of the network (`ldm/modules/diffusionmodules/denoiser_scaling.py`), and sampling of noise levels during training (`ldm/modules/diffusionmodules/sigma_sampling.py`).
* Discrete times models (denoisers) are simply a special case of continuous time models (denoisers); see `sgm/modules/diffusionmodules/denoiser.py`.
* The following features are now independent: weighting of the diffusion loss function (`sgm/modules/diffusionmodules/denoiser_weighting.py`), preconditioning of the network (`sgm/modules/diffusionmodules/denoiser_scaling.py`), and sampling of noise levels during training (`sgm/modules/diffusionmodules/sigma_sampling.py`).
- Autoencoding models have also been cleaned up.

## Installation:
Expand All @@ -36,7 +41,7 @@ git clone git@github.com:Stability-AI/generative-models.git
cd generative-models
```

#### 2. setting up the virtualenv
#### 2. Setting up the virtualenv

This is assuming you have navigated to the `generative-models` root after cloning it.

Expand Down Expand Up @@ -120,7 +125,7 @@ python scripts/demo/detect.py <your folder name here>/*
We are providing example training configs in `configs/example_training`. To launch a training, run

```
python main.py --base configs/<config1.yaml> configs/<config2.yaml>
python main.py --base configs/<config1.yaml> configs/<config2.yaml>
```

where configs are merged from left to right (later configs overwrite the same values).
Expand Down Expand Up @@ -178,5 +183,5 @@ e.g.,
example = {"jpg": x, # this is a tensor -1...1 chw
"txt": "a beautiful image"}
```

where we expect images in -1...1, channel-first format.
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
model:
base_learning_rate: 4.5e-6
target: ldm.models.autoencoder.AutoencodingEngine
target: sgm.models.autoencoder.AutoencodingEngine
params:
input_key: jpg
monitor: val/rec_loss

loss_config:
target: ldm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
target: sgm.modules.autoencoding.losses.GeneralLPIPSWithDiscriminator
params:
perceptual_weight: 0.25
disc_start: 20001
Expand All @@ -17,10 +17,10 @@ model:
kl_loss: 1.0

regularizer_config:
target: ldm.modules.autoencoding.regularizers.DiagonalGaussianRegularizer
target: sgm.modules.autoencoding.regularizers.DiagonalGaussianRegularizer

encoder_config:
target: ldm.modules.diffusionmodules.model.Encoder
target: sgm.modules.diffusionmodules.model.Encoder
params:
attn_type: none
double_z: True
Expand All @@ -35,7 +35,7 @@ model:
dropout: 0.0

decoder_config:
target: ldm.modules.diffusionmodules.model.Decoder
target: sgm.modules.diffusionmodules.model.Decoder
params:
attn_type: none
double_z: False
Expand All @@ -50,7 +50,7 @@ model:
dropout: 0.0

data:
target: ldm.data.dataset.StableDataModuleFromConfig
target: sgm.data.dataset.StableDataModuleFromConfig
params:
train:
datapipeline:
Expand Down
38 changes: 19 additions & 19 deletions configs/example_training/imagenet-f8_cond.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
model:
base_learning_rate: 1.0e-4
target: ldm.models.diffusion.DiffusionEngine
target: sgm.models.diffusion.DiffusionEngine
params:
scale_factor: 0.13025
disable_first_stage_autocast: True
log_keys:
- cls

scheduler_config:
target: ldm.lr_scheduler.LambdaLinearScheduler
target: sgm.lr_scheduler.LambdaLinearScheduler
params:
warm_up_steps: [10000]
cycle_lengths: [10000000000000]
Expand All @@ -17,19 +17,19 @@ model:
f_min: [1.]

denoiser_config:
target: ldm.modules.diffusionmodules.denoiser.DiscreteDenoiser
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
num_idx: 1000

weighting_config:
target: ldm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
scaling_config:
target: ldm.modules.diffusionmodules.denoiser_scaling.EpsScaling
target: sgm.modules.diffusionmodules.denoiser_scaling.EpsScaling
discretization_config:
target: ldm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization
target: sgm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization

network_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
target: sgm.modules.diffusionmodules.openaimodel.UNetModel
params:
use_checkpoint: True
use_fp16: True
Expand All @@ -48,14 +48,14 @@ model:
spatial_transformer_attn_type: softmax-xformers

conditioner_config:
target: ldm.modules.GeneralConditioner
target: sgm.modules.GeneralConditioner
params:
emb_models:
# crossattn cond
- is_trainable: True
input_key: cls
ucg_rate: 0.2
target: ldm.modules.encoders.modules.ClassEmbedder
target: sgm.modules.encoders.modules.ClassEmbedder
params:
add_sequence_dim: True # will be used through crossattn then
embed_dim: 1024
Expand All @@ -64,19 +64,19 @@ model:
- is_trainable: False
ucg_rate: 0.2
input_key: original_size_as_tuple
target: ldm.modules.encoders.modules.ConcatTimestepEmbedderND
target: sgm.modules.encoders.modules.ConcatTimestepEmbedderND
params:
outdim: 256 # multiplied by two
# vector cond
- is_trainable: False
input_key: crop_coords_top_left
ucg_rate: 0.2
target: ldm.modules.encoders.modules.ConcatTimestepEmbedderND
target: sgm.modules.encoders.modules.ConcatTimestepEmbedderND
params:
outdim: 256 # multiplied by two

first_stage_config:
target: ldm.models.autoencoder.AutoencoderKLInferenceWrapper
target: sgm.models.autoencoder.AutoencoderKLInferenceWrapper
params:
ckpt_path: CKPT_PATH
embed_dim: 4
Expand All @@ -97,31 +97,31 @@ model:
target: torch.nn.Identity

loss_fn_config:
target: ldm.modules.diffusionmodules.loss.StandardDiffusionLoss
target: sgm.modules.diffusionmodules.loss.StandardDiffusionLoss
params:
sigma_sampler_config:
target: ldm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
params:
num_idx: 1000

discretization_config:
target: ldm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization
target: sgm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization

sampler_config:
target: ldm.modules.diffusionmodules.sampling.EulerEDMSampler
target: sgm.modules.diffusionmodules.sampling.EulerEDMSampler
params:
num_steps: 50

discretization_config:
target: ldm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization
target: sgm.modules.diffusionmodules.discretizer.LegacyDDPMDiscretization

guider_config:
target: ldm.modules.diffusionmodules.guiders.VanillaCFG
target: sgm.modules.diffusionmodules.guiders.VanillaCFG
params:
scale: 5.0

data:
target: ldm.data.dataset.StableDataModuleFromConfig
target: sgm.data.dataset.StableDataModuleFromConfig
params:
train:
datapipeline:
Expand Down
28 changes: 14 additions & 14 deletions configs/example_training/toy/cifar10_cond.yaml
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
model:
base_learning_rate: 1.0e-4
target: ldm.models.diffusion.DiffusionEngine
target: sgm.models.diffusion.DiffusionEngine
params:
denoiser_config:
target: ldm.modules.diffusionmodules.denoiser.Denoiser
target: sgm.modules.diffusionmodules.denoiser.Denoiser
params:
weighting_config:
target: ldm.modules.diffusionmodules.denoiser_weighting.EDMWeighting
target: sgm.modules.diffusionmodules.denoiser_weighting.EDMWeighting
params:
sigma_data: 1.0
scaling_config:
target: ldm.modules.diffusionmodules.denoiser_scaling.EDMScaling
target: sgm.modules.diffusionmodules.denoiser_scaling.EDMScaling
params:
sigma_data: 1.0

network_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
target: sgm.modules.diffusionmodules.openaimodel.UNetModel
params:
use_checkpoint: True
in_channels: 3
Expand All @@ -29,41 +29,41 @@ model:
adm_in_channels: 128

conditioner_config:
target: ldm.modules.GeneralConditioner
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: True
input_key: cls
ucg_rate: 0.2
target: ldm.modules.encoders.modules.ClassEmbedder
target: sgm.modules.encoders.modules.ClassEmbedder
params:
embed_dim: 128
n_classes: 10

first_stage_config:
target: ldm.models.autoencoder.IdentityFirstStage
target: sgm.models.autoencoder.IdentityFirstStage

loss_fn_config:
target: ldm.modules.diffusionmodules.loss.StandardDiffusionLoss
target: sgm.modules.diffusionmodules.loss.StandardDiffusionLoss
params:
sigma_sampler_config:
target: ldm.modules.diffusionmodules.sigma_sampling.EDMSampling
target: sgm.modules.diffusionmodules.sigma_sampling.EDMSampling

sampler_config:
target: ldm.modules.diffusionmodules.sampling.EulerEDMSampler
target: sgm.modules.diffusionmodules.sampling.EulerEDMSampler
params:
num_steps: 50

discretization_config:
target: ldm.modules.diffusionmodules.discretizer.EDMDiscretization
target: sgm.modules.diffusionmodules.discretizer.EDMDiscretization

guider_config:
target: ldm.modules.diffusionmodules.guiders.VanillaCFG
target: sgm.modules.diffusionmodules.guiders.VanillaCFG
params:
scale: 3.0

data:
target: ldm.data.cifar10.CIFAR10Loader
target: sgm.data.cifar10.CIFAR10Loader
params:
batch_size: 512
num_workers: 1
Expand Down
22 changes: 11 additions & 11 deletions configs/example_training/toy/mnist.yaml
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
model:
base_learning_rate: 1.0e-4
target: ldm.models.diffusion.DiffusionEngine
target: sgm.models.diffusion.DiffusionEngine
params:
denoiser_config:
target: ldm.modules.diffusionmodules.denoiser.Denoiser
target: sgm.modules.diffusionmodules.denoiser.Denoiser
params:
weighting_config:
target: ldm.modules.diffusionmodules.denoiser_weighting.EDMWeighting
target: sgm.modules.diffusionmodules.denoiser_weighting.EDMWeighting
params:
sigma_data: 1.0
scaling_config:
target: ldm.modules.diffusionmodules.denoiser_scaling.EDMScaling
target: sgm.modules.diffusionmodules.denoiser_scaling.EDMScaling
params:
sigma_data: 1.0

network_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
target: sgm.modules.diffusionmodules.openaimodel.UNetModel
params:
use_checkpoint: True
in_channels: 1
Expand All @@ -27,24 +27,24 @@ model:
num_head_channels: 32

first_stage_config:
target: ldm.models.autoencoder.IdentityFirstStage
target: sgm.models.autoencoder.IdentityFirstStage

loss_fn_config:
target: ldm.modules.diffusionmodules.loss.StandardDiffusionLoss
target: sgm.modules.diffusionmodules.loss.StandardDiffusionLoss
params:
sigma_sampler_config:
target: ldm.modules.diffusionmodules.sigma_sampling.EDMSampling
target: sgm.modules.diffusionmodules.sigma_sampling.EDMSampling

sampler_config:
target: ldm.modules.diffusionmodules.sampling.EulerEDMSampler
target: sgm.modules.diffusionmodules.sampling.EulerEDMSampler
params:
num_steps: 50

discretization_config:
target: ldm.modules.diffusionmodules.discretizer.EDMDiscretization
target: sgm.modules.diffusionmodules.discretizer.EDMDiscretization

data:
target: ldm.data.mnist.MNISTLoader
target: sgm.data.mnist.MNISTLoader
params:
batch_size: 512
num_workers: 1
Expand Down
Loading