fix: Windows service error-handling correctness #13312
Open
+87
−48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The Windows Service wrapper has a couple flaws around how it handles service control:
Run()
method finishing is not responded tootelcol.exe
continues running, doing nothingSERVICE_CONTROL_INTERROGATE
requests will hang during startup (these are always valid to be sent)SERVICE_CONTROL_STOP
command to collector during startupOf these two issues, not properly exiting the service manager wrapper when the actual
col.Run()
call finishes is the more severe. In particular, this condition can be hit if the config watch returns an error or the collector reports an async error, e.g. a component transitioning to the "fatal error" state.The handling of service requests during start pending is less problematic in practice, but per-Windows documentation, a service should handle requests on a separate thread, not blocking during long operations such as
SERVICE_START_PENDING
. Fixing this has the nice side effect of allowing the service to be stopped while it's still in a start pending state.Additionally, within the generic collector code in
col.Run()
(service.go
), fatal errors are logged but not propagated up, which will result in the process erroneously terminating cleanly with exit code 0. These should be returned (along with any shutdown error), and then handled either by the Windows Service wrapper or Cobra, so that they can be logged and cause an appropriate error/exit code to be returned.Link to tracking issue
n/a - did not create issue first
Testing
There is existing test coverage for running the collector as a Windows Service.
As far as I know, it's not practical to introduce a synthetic error to verify this behavior programmatically.
Documentation
None, but a changelog entry is likely warranted.
In particular, propagating config watch + async errors from
col.Run()
can cause non-zero exit codes on ALL platforms (not just Windows). The current behavior is incorrect/buggy, but it's still a notable behavior difference to be considered by end-users depending on their service manager configuration that's actually running the collector in a container or native OS process.