Skip to content

Configurable (or just much longer?) health-check timeout in local mode #3362

@athewsey

Description

@athewsey

Describe the feature you'd like

Today, local-mode endpoint deployment uses a hard-coded health check time-out of 120s for the container to become healthy.

This does not appear to be consistent with the start-up requirements for actual SageMaker endpoints, and even if it was, it may not be appropriate to assume local environments have similar network bandwidth or compute capabilities to target instance types.

How would this feature be used? Please describe.

I'm currently testing a use case with large (e.g. ~5GB+) model archives, and finding local mode deployment fails due to this healthcheck time-out, even though actual SageMaker endpoint deployments succeed without any issue.

If the default timeout was significantly longer, I think it should work okay. If the default timeout was configurable somehow, I could force it to wait longer for my use case.

Describe alternatives you've considered

Possible options could include:

  • Extending the timeout
  • Making the timeout configurable
  • Somehow excluding tarball download/extract time from the coverage of the timeout check
  • Supporting decompressed local folders as model_data targets for local models/endpoints - instead of requiring S3/tarball.

Additional context

N/A

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions