fix: Windows service error-handling correctness #13312

milas · 2025-07-01T19:05:12Z

Description

The Windows Service wrapper has a couple flaws around how it handles service control:

Collector Run() method finishing is not responded to
- Impact: collector shutdown can occur but otelcol.exe continues running, doing nothing
Requests are not handled during start pending
- Impact: SERVICE_CONTROL_INTERROGATE requests will hang during startup (these are always valid to be sent)
- Impact: not possible to send SERVICE_CONTROL_STOP command to collector during startup

Of these two issues, not properly exiting the service manager wrapper when the actual col.Run() call finishes is the more severe. In particular, this condition can be hit if the config watch returns an error or the collector reports an async error, e.g. a component transitioning to the "fatal error" state.

The handling of service requests during start pending is less problematic in practice, but per-Windows documentation, a service should handle requests on a separate thread, not blocking during long operations such as SERVICE_START_PENDING. Fixing this has the nice side effect of allowing the service to be stopped while it's still in a start pending state.

Additionally, within the generic collector code in col.Run() (service.go), fatal errors are logged but not propagated up, which will result in the process erroneously terminating cleanly with exit code 0. These should be returned (along with any shutdown error), and then handled either by the Windows Service wrapper or Cobra, so that they can be logged and cause an appropriate error/exit code to be returned.

Link to tracking issue

n/a - did not create issue first

Testing

There is existing test coverage for running the collector as a Windows Service.

As far as I know, it's not practical to introduce a synthetic error to verify this behavior programmatically.

Documentation

None, but a changelog entry is likely warranted.

In particular, propagating config watch + async errors from col.Run() can cause non-zero exit codes on ALL platforms (not just Windows). The current behavior is incorrect/buggy, but it's still a notable behavior difference to be considered by end-users depending on their service manager configuration that's actually running the collector in a container or native OS process.

fix: stop Windows service on collector error

119c853

milas requested a review from a team as a code owner July 1, 2025 19:05

milas requested a review from bogdandrutu July 1, 2025 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Windows service error-handling correctness #13312

fix: Windows service error-handling correctness #13312

milas commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: Windows service error-handling correctness #13312

Are you sure you want to change the base?

fix: Windows service error-handling correctness #13312

Conversation

milas commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Link to tracking issue

Testing

Documentation

Uh oh!

Uh oh!

milas commented Jul 1, 2025 •

edited

Loading