Skip to content

Update distributed checkpoint recipes #3446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 10, 2025
Merged

Conversation

Saiteja64
Copy link
Contributor

@Saiteja64 Saiteja64 commented Jul 10, 2025

Description

Remove references to FSDP and replace with FSDP2
Also some fixes to ensure tutorials run without errors

Follow up to convert these tutorials to py files: #3452 (comment)

@Saiteja64 Saiteja64 added the tutorials_audit used on tutorial audit PRs label Jul 10, 2025
Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3446

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 9078c86 with merge base ab48a0c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but did you try to run it? does it run?

@Saiteja64
Copy link
Contributor Author

Looks good but did you try to run it? does it run?

@svekars yeah. I copied each code block to a python script and verified they work as expected

@svekars
Copy link
Contributor

svekars commented Jul 10, 2025

Do you think it's possible to replace the .rst with that .py script? We overall, prefer .py to .rst.

@Saiteja64
Copy link
Contributor Author

Do you think it's possible to replace the .rst with that .py script? We overall, prefer .py to .rst.

@svekars The py script does not contain all the information. It's just code blocks. I can prioritize conversion from rst to py as a less urgent follow up.

@svekars
Copy link
Contributor

svekars commented Jul 10, 2025

Sounds good, please create an issue and assign to yourself and link in this PR. Thanks!

@@ -152,7 +151,7 @@ Now, let's create a toy module, wrap it with FSDP, feed it with some dummy input
join=True,
)

Please go ahead and check the `checkpoint` directory. You should see 8 checkpoint files as shown below.
Please go ahead and check the `checkpoint` directory. You should see world_size checkpoint files as shown below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this intentional change? "You should see world_size checkpoint files" sounds a bit weird. Can you rephrase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Updated to be a bit more clear.

@Saiteja64 Saiteja64 merged commit 8de775a into main Jul 10, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed tutorials_audit used on tutorial audit PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants