Open
Description
It appears as though adamw does work better but the weight decay is creating strange generations.
Getting the same strange "brown" generations even though the loss continues to go down. It does so at a pretty slow rate - and if you're working with --fp16 it's tough to know the generations are poor until after training due to the inability to submit images through wandb.
Metadata
Metadata
Assignees
Labels
No labels