Skip to content

Feat/picotron resume from checkpoint #656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

KeitaW
Copy link
Collaborator

@KeitaW KeitaW commented Apr 29, 2025

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@KeitaW KeitaW requested a review from allela-roy April 29, 2025 14:16
@KeitaW KeitaW self-assigned this Apr 29, 2025
@allela-roy
Copy link
Contributor

@KeitaW the training currently fails when trying to save the checkpoint. Shared the full error log for debugging.

@KeitaW
Copy link
Collaborator Author

KeitaW commented Apr 30, 2025

using docker command to create config is suboptimal due to permission issue (Thanks @allela-roy for point out). PR changed to use enroot command instead. However, squashfuse and use-overlayfs are required. Created PR suggesting to install them by default. #661

@KeitaW KeitaW added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels May 1, 2025
@KeitaW
Copy link
Collaborator Author

KeitaW commented May 1, 2025

Due to NVIDIA/enroot#130 it would be better to create enroot container manually. Addressed in 20f9a42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants