Skip to content

retry quorum #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

retry quorum #228

wants to merge 1 commit into from

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jul 10, 2025

Summary:

  • we currently don't retry quorum requests from the manager to lighthouse
  • if lighthouse crashes, this can result in all replicas crashing
  • so add retries configurable through env var
  • remove holding state lock when making network calls in manager
  • the manager tries reconnecting to lighthouse if a response from lighthouse fails up to configured number of retries
  • there's still some unhandled cases
    • manager doesn't broadcast the result to all ranks if there's a failure in _run_quorum, resulting in a hang
    • if a rank gets error from quorum, it still crashes (the handling will be more complicated if ranks are on multiple hosts and they can independently reconnect)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@tushar00jain tushar00jain force-pushed the pr228 branch 5 times, most recently from 2210b6c to 5c14c9e Compare July 11, 2025 22:34
@tushar00jain tushar00jain marked this pull request as ready for review July 11, 2025 22:40
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but have some comments. If we want to target things other than just timeouts I think we need to tweak the retry loop

src/manager.rs Outdated
timeout: Duration,
lighthouse_request: LighthouseQuorumRequest,
) -> Result<tonic::Response<LighthouseQuorumResponse>, Status> {
let mut client = self.lighthouse_client.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this reconnect if an error occured or do we need to construct a new lighthouse_client in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, now we create a new client each time the request fails

@@ -271,6 +276,7 @@ def __init__(
world_size=group_world_size,
heartbeat_interval=heartbeat_interval,
connect_timeout=connect_timeout,
quorum_retries=int(os.environ.get(QUORUM_RETRIES_ENV, "0")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be good to also add a config variable for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the env var is the config? or you mean somewhere else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. able to specify it via an arg to Manager init

@tushar00jain tushar00jain force-pushed the pr228 branch 3 times, most recently from 9ae71da to cc1b895 Compare July 12, 2025 01:23
@tushar00jain tushar00jain requested a review from d4l3k July 12, 2025 01:25
@tushar00jain tushar00jain force-pushed the pr228 branch 7 times, most recently from ecd4720 to 0931e7e Compare July 12, 2025 03:50
Summary:
- we currently don't retry quorum requests from the manager to lighthouse
- if lighthouse crashes, this can result in all replicas crashing
- so add retries configurable through env var
- remove holding state lock when making network calls in manager
- the manager tries reconnecting to lighthouse if a response from lighthouse fails up to configured number of retries
- there's still some unhandled cases
  - manager doesn't broadcast the result to all ranks if there's a failure in `_run_quorum`, resulting in a hang
  - if a rank gets error from quorum, it still crashes (the handling will be more complicated if ranks are on multiple hosts and they can independently reconnect)
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"Failed to send heartbeat to lighthouse: {}",
e.to_string()
);
let _ = self.create_lighthouse_client().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking status for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants