-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the feature you'd like
Today, local-mode endpoint deployment uses a hard-coded health check time-out of 120s for the container to become healthy.
This does not appear to be consistent with the start-up requirements for actual SageMaker endpoints, and even if it was, it may not be appropriate to assume local environments have similar network bandwidth or compute capabilities to target instance types.
How would this feature be used? Please describe.
I'm currently testing a use case with large (e.g. ~5GB+) model archives, and finding local mode deployment fails due to this healthcheck time-out, even though actual SageMaker endpoint deployments succeed without any issue.
If the default timeout was significantly longer, I think it should work okay. If the default timeout was configurable somehow, I could force it to wait longer for my use case.
Describe alternatives you've considered
Possible options could include:
- Extending the timeout
- Making the timeout configurable
- Somehow excluding tarball download/extract time from the coverage of the timeout check
- Supporting decompressed local folders as
model_data
targets for local models/endpoints - instead of requiring S3/tarball.
Additional context
N/A