Skip to content

resume_from_checkpoint training fail in train_dreambooth_lora_sdxl.py #5412

Closed
@yuxu915

Description

@yuxu915

Describe the bug

I trained dreambooth with lora and sd-xl for 1000 steps, then I try to continue traning resume from the 500th step, however, it seems like the training starts without the 1000's checkpoint, i.e. it starts from the beginning. Training scripts are as below.
btw. I fix another resume bug as advised in #5004.

could you please help it?

Reproduction

export MODEL_NAME="/data/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="/data/datasets/image_instance/chair_crop"
export OUTPUT_DIR="/data/lora-xl/diffusers/examples/dreambooth/lora-trained-xl_1e-6_step1500_chair_crop_per25"
export VAE_PATH="/data/model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of sks chair" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=2700 \
  --validation_prompt="A photo of sks chair on grass" \
  --validation_epochs=25 \
  --seed="0" \
  --checkpointing_steps=500 \
  --resume_from_checkpoint="/data/xuyu/lora-xl/diffusers/examples/dreambooth/lora-trained-xl_1e-6_step1500_chair_crop_per25/checkpoint-1000"

Logs

No response

System Info

  • diffusers version: 0.22.0.dev0
  • Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • Huggingface_hub version: 0.17.3
  • Transformers version: 4.34.0
  • Accelerate version: 0.23.0
  • xFormers version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: parallel

Who can help?

@patrickvonplaten @sayakpaul @yiyixuxu @dn

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions