-
Notifications
You must be signed in to change notification settings - Fork 39
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomerslighthouseLighthouse and quorum relatedLighthouse and quorum relatedrust
Description
We currently always heal on step 0 to avoid synchronization issues. We want an option to support skipping this sync for users who set the PyTorch seed so all ranks are initialized with the same values.
This should match the name init_sync
from pytorch/pytorch#142824
Bonus would be to randomly initialize a value in Manager so we can detect whether or not ranks are seeded and throw an error if there's a mismatch on first quorum.
Relevant code:
- Manager https://github.com/pytorch/torchft/blob/main/torchft/manager.py
max_step == 0 && primary.replica_id != p.replica_id
Lines 403 to 410 in d427bef
// Nodes are recovering if: // 1. not at the max step // 2. max_step == 0 and not the primary replica let all_recover_dst_ranks: Vec<usize> = participants .iter() .enumerate() .filter_map(|(i, p)| { if p.step != max_step || max_step == 0 && primary.replica_id != p.replica_id {
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomerslighthouseLighthouse and quorum relatedLighthouse and quorum relatedrust