Skip to content

make timeouts configurable #229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 12, 2025
Merged

make timeouts configurable #229

merged 1 commit into from
Jul 12, 2025

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jul 10, 2025

Summary:
while training, we need to set higher quorum timeouts to make sure all replicas can finish training

these parameters are only exposed through the manager but users may not be able to access the manager directly e.g. when using torchtitan

so make the timeouts configurable using env vars that take precedence over the parameters passed through manager


Stack created with Sapling. Best reviewed with ReviewStack.

This was referenced Jul 10, 2025
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@tushar00jain tushar00jain force-pushed the pr229 branch 3 times, most recently from 7653989 to ca01921 Compare July 11, 2025 22:22
@tushar00jain tushar00jain requested review from d4l3k and H-Huang July 11, 2025 22:38
@tushar00jain tushar00jain marked this pull request as ready for review July 11, 2025 22:38
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Summary:
while training, we need to set higher quorum timeouts to make sure all replicas can finish training

these parameters are only exposed through the manager but users may not be able to access the manager directly e.g. when using torchtitan

so make the timeouts configurable using env vars that take precedence over the parameters passed through manager
@tushar00jain tushar00jain merged commit 347fd32 into pytorch:main Jul 12, 2025
12 of 17 checks passed
@tushar00jain tushar00jain deleted the pr229 branch July 12, 2025 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants