You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that there are always some collective operations between replication groups to allreduce the gradients (if any form of ddp or hsdp is used). If one node fails in a replication group, all other groups will timeout because they will not finish allreduce. As lighthouse already knows that the node failed, should allreduce be aborted to avoid waiting for the timeout?