-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Fix EMA and make it compatible with deepspeed. #813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The documentation is not available anymore as the PR was closed or merged. |
if device is not None: | ||
self.averaged_model = self.averaged_model.to(device=device) | ||
parameters = list(parameters) | ||
self.shadow_params = [p.clone().detach() for p in parameters] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never heard them being called shadow
, but pretty creative :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much copied it from here https://github.com/fadel/pytorch_ema/blob/master/torch_ema/ema.py#L14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me!
@patil-suraj make sure it works with fp16 (if it's an option) for the whole training run. Subtracting the decayed parameters might become numerically unstable when getting close to 0.9999
Will do a run in fp16, but our fp16 is mixed precision, so params are always fp32 |
Maybe this would be helpful. I've got DeepSpeed working on my 12 GB 3060 by changing this line (like in https://github.com/huggingface/diffusers/pull/735/files#diff-8702f762e46a3b5363085930b0b045de554909d32560864031ca7b12ddd349d5R555): diff --git a/examples/text_to_image/train_text_to_image.py b/examples/text_to_image/train_text_to_image.py
index e4a91ff..4481951 100644
--- a/examples/text_to_image/train_text_to_image.py
+++ b/examples/text_to_image/train_text_to_image.py
@@ -566,7 +566,7 @@ def main():
# Predict the noise residual and compute loss
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
- loss = F.mse_loss(noise_pred, noise, reduction="mean")
+ loss = F.mse_loss(noise_pred.float(), noise.float(), reduction="mean")
# Gather the losses across all processes for logging (if we use distributed training).
avg_loss = accelerator.gather(loss.repeat(args.train_batch_size)).mean() By "working" I mean that it doesn't throw an exception "Found dtype Float but expected Half" at this line:
I should say that I'm still training the model on the pokemon dataset, so I don't know what the actual result would be yet. The command I've used for training is almost identical to the one in readme, I've only added accelerate launch --use_deepspeed --zero_stage=2 --gradient_accumulation_steps=1 --offload_param_device=cpu --offload_optimizer_device=cpu train_text_to_image.py \
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
--dataset_name="lambdalabs/pokemon-blip-captions" \
--use_ema \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--mixed_precision="fp16" \
--max_train_steps=15000 \
--learning_rate=1e-05 \
--max_grad_norm=1 \
--lr_scheduler="constant" --lr_warmup_steps=0 \
--output_dir="sd-pokemon-model" |
Thanks a lot @pink-red , would you like to open a PR for that once this is merged. Indeed, casting it |
@patil-suraj No problem! 👌 |
Referring my review here to @anton-l as he knows EMA much better :-) |
Also @patil-suraj let's maybe fix the code quality with |
…ggingface#813) * fix ema * style * add comment about copy * style * quality
There's an issue with current EMA in multi-gpu and deepspeed. This PR updates the
EMAModel
to only keep the parameters instead of copying the model which doesn't seem to work withdeepspeed
.