Skip to content

Update DDP example #1364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Update DDP example #1364

wants to merge 4 commits into from

Conversation

jafraustro
Copy link
Contributor

@jafraustro jafraustro commented Jul 8, 2025

Update DDP to use the accelerator API and switch to torchrun for distributed launches

CC: @dvrogozh , @msaroufim

@jafraustro jafraustro marked this pull request as ready for review July 8, 2025 15:06
Copy link

netlify bot commented Jul 8, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit afdd3ce
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68712b124833f100080d2c69

Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jafraustro : CC reviewers in PR description.

@soumith
Copy link
Member

soumith commented Jul 10, 2025

the CI is failing for Distributed examples because something cant find numpy

@jafraustro
Copy link
Contributor Author

the CI is failing for Distributed examples because something cant find numpy

Hi, I changed the torch version in requirements.txt file.

× No solution found when resolving dependencies:
╰─▶ Because only torch<=2.7.1 is available and you require torch>=2.8

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
- Update README to reflect torchrun usage
- Remove main.py (no longer referenced in documentation)
- Update CI to test example.py script instead

Signed-off-by: jafraustro <[email protected]>
@soumith
Copy link
Member

soumith commented Jul 11, 2025

it's failing now with some new errors

@jafraustro
Copy link
Contributor Author

it's failing now with some new errors

Hello @soumith,

The errors occurred because there were not enough GPUs available. To address this, I added a minimum GPU verification step, similar to the approach used in the tensor_parallel_example.py example. This ensures the script only runs when the required number of GPUs are present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants