Skip to content

Timed out initializing process group in store based barrier on rank: 2 #48

Open
@yotaroshimose

Description

@yotaroshimose

Hi, Thank you for sharing your great work!

I tried to run your training scripts. But my machine only has 4 GPUs. So I changed its WORLD SIZE to 4 from 8 in original yaml file.

Then it says "Timed out initializing process group in store based barrier on rank: 2" or sometimes it suddenly crashes during the epoch and my docker container shutdowns (indicating memory leak?).

Any advise on successfully running your training code?

Thank you for your cooperation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions