-
Notifications
You must be signed in to change notification settings - Fork 37
retry quorum #228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
retry quorum #228
Conversation
2210b6c
to
5c14c9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but have some comments. If we want to target things other than just timeouts I think we need to tweak the retry loop
src/manager.rs
Outdated
timeout: Duration, | ||
lighthouse_request: LighthouseQuorumRequest, | ||
) -> Result<tonic::Response<LighthouseQuorumResponse>, Status> { | ||
let mut client = self.lighthouse_client.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this reconnect if an error occured or do we need to construct a new lighthouse_client in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, now we create a new client each time the request fails
torchft/manager.py
Outdated
@@ -271,6 +276,7 @@ def __init__( | |||
world_size=group_world_size, | |||
heartbeat_interval=heartbeat_interval, | |||
connect_timeout=connect_timeout, | |||
quorum_retries=int(os.environ.get(QUORUM_RETRIES_ENV, "0")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be good to also add a config variable for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the env var is the config? or you mean somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. able to specify it via an arg to Manager init
9ae71da
to
cc1b895
Compare
ecd4720
to
0931e7e
Compare
Summary: - we currently don't retry quorum requests from the manager to lighthouse - if lighthouse crashes, this can result in all replicas crashing - so add retries configurable through env var - remove holding state lock when making network calls in manager - the manager tries reconnecting to lighthouse if a response from lighthouse fails up to configured number of retries - there's still some unhandled cases - manager doesn't broadcast the result to all ranks if there's a failure in `_run_quorum`, resulting in a hang - if a rank gets error from quorum, it still crashes (the handling will be more complicated if ranks are on multiple hosts and they can independently reconnect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"Failed to send heartbeat to lighthouse: {}", | ||
e.to_string() | ||
); | ||
let _ = self.create_lighthouse_client().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking status for this?
Summary:
_run_quorum
, resulting in a hang