Skip to content

fix: Windows service error-handling correctness #13312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

milas
Copy link

@milas milas commented Jul 1, 2025

Description

The Windows Service wrapper has a couple flaws around how it handles service control:

  • Collector Run() method finishing is not responded to
    • Impact: collector shutdown can occur but otelcol.exe continues running, doing nothing
  • Requests are not handled during start pending
    • Impact: SERVICE_CONTROL_INTERROGATE requests will hang during startup (these are always valid to be sent)
    • Impact: not possible to send SERVICE_CONTROL_STOP command to collector during startup

Of these two issues, not properly exiting the service manager wrapper when the actual col.Run() call finishes is the more severe. In particular, this condition can be hit if the config watch returns an error or the collector reports an async error, e.g. a component transitioning to the "fatal error" state.

The handling of service requests during start pending is less problematic in practice, but per-Windows documentation, a service should handle requests on a separate thread, not blocking during long operations such as SERVICE_START_PENDING. Fixing this has the nice side effect of allowing the service to be stopped while it's still in a start pending state.

Additionally, within the generic collector code in col.Run() (service.go), fatal errors are logged but not propagated up, which will result in the process erroneously terminating cleanly with exit code 0. These should be returned (along with any shutdown error), and then handled either by the Windows Service wrapper or Cobra, so that they can be logged and cause an appropriate error/exit code to be returned.

Link to tracking issue

n/a - did not create issue first

Testing

There is existing test coverage for running the collector as a Windows Service.

As far as I know, it's not practical to introduce a synthetic error to verify this behavior programmatically.

Documentation

None, but a changelog entry is likely warranted.

In particular, propagating config watch + async errors from col.Run() can cause non-zero exit codes on ALL platforms (not just Windows). The current behavior is incorrect/buggy, but it's still a notable behavior difference to be considered by end-users depending on their service manager configuration that's actually running the collector in a container or native OS process.

@milas milas requested a review from a team as a code owner July 1, 2025 19:05
@milas milas requested a review from bogdandrutu July 1, 2025 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant