Open
Description
Hi, Thank you for sharing your great work!
I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.
Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).
Any advise on successfully running your training code?
Thank you for your cooperation.
Metadata
Metadata
Assignees
Labels
No labels