Skip to content

enable merging parameters for diloco #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 12, 2025
Merged

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 10, 2025

Summary:

  • merge local and global parameters of the model after synchronization
  • add the "alpha" parameter to integration tests

Test Plan:

pytest -vs ./torchft/local_sgd_integ_test.py

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2025
@tushar00jain tushar00jain force-pushed the pr212 branch 18 times, most recently from e609b5a to b93142d Compare June 12, 2025 17:07
@tushar00jain tushar00jain force-pushed the pr212 branch 5 times, most recently from 3f8e0b8 to 323fb47 Compare June 15, 2025 22:56
@tushar00jain tushar00jain marked this pull request as ready for review June 16, 2025 16:14
@tushar00jain tushar00jain marked this pull request as draft June 16, 2025 16:15
@tushar00jain tushar00jain force-pushed the pr212 branch 2 times, most recently from 37da9d9 to 8ce038d Compare June 16, 2025 22:27
@tushar00jain tushar00jain force-pushed the pr212 branch 8 times, most recently from 7a016b6 to 355deed Compare June 19, 2025 01:39
@tushar00jain tushar00jain force-pushed the pr212 branch 3 times, most recently from b5eb209 to 6ad8993 Compare June 30, 2025 15:26
@tushar00jain tushar00jain requested a review from dzmitry-huba July 8, 2025 22:27
@tushar00jain tushar00jain marked this pull request as ready for review July 8, 2025 22:28
This was referenced Jul 10, 2025
@tushar00jain tushar00jain force-pushed the pr212 branch 3 times, most recently from 8b76f91 to 6524d16 Compare July 11, 2025 20:43
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -719,6 +721,7 @@ def test_streaming_diloco_commit_failure(
"diloco_args": {
"fragment_sync_delay": fragment_sync_delay,
"sync_every": 4,
"fragment_update_alpha": alpha,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding some mocked out tests where we actually check the numerical values against some references to avoid regressions?

With a fixed seed and deterministic mode in torch we can add regression tests by comparing against a known value.

Merges the local and global parameters.
"""
for name, p in self._model_fragment.named_parameters():
torch.lerp(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't inplace right? having some numerics tests would be nice to catch issues like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep let me add this and regression test in a separate diff?

Summary:
- merge local and global parameters of the model after synchronization
- add the "alpha" parameter to integration tests

Test Plan:
```
pytest -vs ./torchft/local_sgd_integ_test.py
```
@tushar00jain tushar00jain merged commit 2a10ef2 into pytorch:main Jul 12, 2025
13 of 15 checks passed
@tushar00jain tushar00jain deleted the pr212 branch July 12, 2025 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants