Skip to content

save_checkpoint deepspeed bug #2557

Open
@kmn1024

Description

@kmn1024

Notice: In order to resolve issues more efficiently, please raise issue following the template.
(注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

🐛 Bug

save_checkpoint crashes at:

  File "/workspace/FunASR/funasr/train_utils/trainer_ds.py", line 644, in train_epoch                                                    
    self.save_checkpoint(                                           
  File "/workspace/FunASR/funasr/train_utils/trainer_ds.py", line 290, in save_checkpoint                                                
    misc_utils.smart_remove(filename)                               
  File "/workspace/FunASR/funasr/utils/misc.py", line 118, in smart_remove                                                               
    shutil.rmtree(path)                                             
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree         
    onerror(os.rmdir, path, sys.exc_info())                         
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree         
    os.rmdir(path)                                                  
FileNotFoundError: [Errno 2] No such file or directory: './train_outputs_full2/model.pt.ep0.5000'         

when training on multi-GPU with deepspeed.

To Reproduce

Any multi-GPU run using deepspeed.

Additional context

I think the smart_remove needs to be guarded by self.rank == 0: ....

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions