retry quorum #228

tushar00jain · 2025-07-10T17:41:00Z

Summary:

we currently don't retry quorum requests from the manager to lighthouse
if lighthouse crashes, this can result in all replicas crashing
so add retries configurable through env var
remove holding state lock when making network calls in manager
the manager tries reconnecting to lighthouse if a response from lighthouse fails up to configured number of retries
there's still some unhandled cases
- manager doesn't broadcast the result to all ranks if there's a failure in _run_quorum, resulting in a hang
- if a rank gets error from quorum, it still crashes (the handling will be more complicated if ranks are on multiple hosts and they can independently reconnect)

d4l3k

LGTM but have some comments. If we want to target things other than just timeouts I think we need to tweak the retry loop

d4l3k · 2025-07-12T00:13:53Z

src/manager.rs

+        timeout: Duration,
+        lighthouse_request: LighthouseQuorumRequest,
+    ) -> Result<tonic::Response<LighthouseQuorumResponse>, Status> {
+        let mut client = self.lighthouse_client.clone();


Will this reconnect if an error occured or do we need to construct a new lighthouse_client in that case?

updated, now we create a new client each time the request fails

src/manager.rs

d4l3k · 2025-07-12T00:15:13Z

torchft/manager.py

@@ -271,6 +276,7 @@ def __init__(
                world_size=group_world_size,
                heartbeat_interval=heartbeat_interval,
                connect_timeout=connect_timeout,
+                quorum_retries=int(os.environ.get(QUORUM_RETRIES_ENV, "0")),


may be good to also add a config variable for this

the env var is the config? or you mean somewhere else?

i.e. able to specify it via an arg to Manager init

Summary: - we currently don't retry quorum requests from the manager to lighthouse - if lighthouse crashes, this can result in all replicas crashing - so add retries configurable through env var - remove holding state lock when making network calls in manager - the manager tries reconnecting to lighthouse if a response from lighthouse fails up to configured number of retries - there's still some unhandled cases - manager doesn't broadcast the result to all ranks if there's a failure in `_run_quorum`, resulting in a hang - if a rank gets error from quorum, it still crashes (the handling will be more complicated if ranks are on multiple hosts and they can independently reconnect)

d4l3k

LGTM

d4l3k · 2025-07-15T00:55:56Z

src/manager.rs

+                    "Failed to send heartbeat to lighthouse: {}",
+                    e.to_string()
+                );
+                let _ = self.create_lighthouse_client().await;


Should we be checking status for this?

This was referenced Jul 10, 2025

enable merging parameters for diloco #212

Merged

make timeouts configurable #229

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025

tushar00jain force-pushed the pr228 branch 5 times, most recently from 2210b6c to 5c14c9e Compare July 11, 2025 22:34

tushar00jain marked this pull request as ready for review July 11, 2025 22:40

tushar00jain force-pushed the pr228 branch from 5c14c9e to 3546524 Compare July 11, 2025 23:35

d4l3k approved these changes Jul 12, 2025

View reviewed changes

tushar00jain force-pushed the pr228 branch 3 times, most recently from 9ae71da to cc1b895 Compare July 12, 2025 01:23

tushar00jain requested a review from d4l3k July 12, 2025 01:25

tushar00jain force-pushed the pr228 branch 7 times, most recently from ecd4720 to 0931e7e Compare July 12, 2025 03:50

tushar00jain force-pushed the pr228 branch from 0931e7e to 2df0d1a Compare July 12, 2025 04:04

d4l3k approved these changes Jul 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

retry quorum #228

retry quorum #228

tushar00jain commented Jul 10, 2025 •

edited

Loading

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Jul 12, 2025

Uh oh!

tushar00jain Jul 12, 2025

Uh oh!

Uh oh!

Uh oh!

d4l3k Jul 12, 2025

Uh oh!

tushar00jain Jul 12, 2025

Uh oh!

d4l3k Jul 15, 2025

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Jul 15, 2025

Uh oh!

Uh oh!

retry quorum #228

Are you sure you want to change the base?

retry quorum #228

Conversation

tushar00jain commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d4l3k Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tushar00jain commented Jul 10, 2025 •

edited

Loading