Skip to content

[KEP-4680] Update README to include configurable HealhCheckTimeout #5476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions keps/sig-node/4680-add-resource-health-to-pod-status/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,12 +259,13 @@ We may consider this as a future improvement.

### Notes/Constraints/Caveats (Optional)

<!--
What are the caveats to the proposal?
What are some important details that didn't come across above?
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they relate.
-->
- **DRA Device Health Timeout Configuration:** The timeout for marking a DRA device's health as "Unknown"
when no updates are received can be configured per device through the `health_check_timeout_seconds` field
in the `DeviceHealth` message. This allows different hardware types (e.g., GPUs, FPGAs, TPUs, storage devices)
to specify appropriate timeout values based on their health-reporting characteristics. If not specified,
Kubelet will use a default timeout of 30 seconds. This addresses
[Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in
[PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511).

### Risks and Mitigations

Expand Down Expand Up @@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
will be responsible for reconciling the state reported by the plugin, handling
timeouts for stale data (marking devices as "Unknown" if not updated
within a certain period), and persisting this information across Kubelet restarts.

**Note:** The timeout for marking a device's health as "Unknown" can be
configured per device via the `health_check_timeout_seconds` field in the
`DeviceHealth` message. If not specified, Kubelet will use a default timeout
of 30 seconds. This addresses [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
allowing different hardware types (e.g., GPUs, FPGAs, TPUs, storage) to specify
appropriate timeout values based on their health-reporting characteristics.

3. **Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
Upon plugin registration, it will attempt to initiate the health monitoring
Expand Down Expand Up @@ -368,6 +376,10 @@ message DeviceHealth {
// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
// Required.
int64 last_updated_timestamp = 4;
// Health check timeout duration in seconds for this device.
// If not specified or zero, Kubelet will use a default timeout.
// Optional.
int64 health_check_timeout_seconds = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the negative value or 0 mean? Let's specify in the field description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq for PR review I think - is Duration field OK to use in our APIs? Or this is not recommended?

}
```

Expand Down Expand Up @@ -448,6 +460,7 @@ Planned tests will cover the user-visible behavior of the feature:
#### Beta

- Complete e2e tests coverage
- Verify configurable device health check timeout implementation works correctly across different plugin vendors (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))

#### GA

Expand Down