Open
Description
Notice: In order to resolve issues more efficiently, please raise issue following the template.
(注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
🐛 Bug
save_checkpoint crashes at:
File "/workspace/FunASR/funasr/train_utils/trainer_ds.py", line 644, in train_epoch
self.save_checkpoint(
File "/workspace/FunASR/funasr/train_utils/trainer_ds.py", line 290, in save_checkpoint
misc_utils.smart_remove(filename)
File "/workspace/FunASR/funasr/utils/misc.py", line 118, in smart_remove
shutil.rmtree(path)
File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
os.rmdir(path)
FileNotFoundError: [Errno 2] No such file or directory: './train_outputs_full2/model.pt.ep0.5000'
when training on multi-GPU with deepspeed.
To Reproduce
Any multi-GPU run using deepspeed.
Additional context
I think the smart_remove needs to be guarded by self.rank == 0: ....