Closed
Description
Describe the bug
I trained dreambooth with lora and sd-xl for 1000 steps, then I try to continue traning resume from the 500th step, however, it seems like the training starts without the 1000's checkpoint, i.e. it starts from the beginning. Training scripts are as below.
btw. I fix another resume bug as advised in #5004.
could you please help it?
Reproduction
export MODEL_NAME="/data/model/stable-diffusion-xl-base-1.0/stable-diffusion-xl-base-1.0"
export INSTANCE_DIR="/data/datasets/image_instance/chair_crop"
export OUTPUT_DIR="/data/lora-xl/diffusers/examples/dreambooth/lora-trained-xl_1e-6_step1500_chair_crop_per25"
export VAE_PATH="/data/model/stable-diffusion-xl-base-1.0/sdxl-vae-fp16-fix"
accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks chair" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-5 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=2700 \
--validation_prompt="A photo of sks chair on grass" \
--validation_epochs=25 \
--seed="0" \
--checkpointing_steps=500 \
--resume_from_checkpoint="/data/xuyu/lora-xl/diffusers/examples/dreambooth/lora-trained-xl_1e-6_step1500_chair_crop_per25/checkpoint-1000"
Logs
No response
System Info
diffusers
version: 0.22.0.dev0- Platform: Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.10.13
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Huggingface_hub version: 0.17.3
- Transformers version: 4.34.0
- Accelerate version: 0.23.0
- xFormers version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: parallel