Skip to content

Emit outcome: failure in obsconsumer #13234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jade-guiton-dd
Copy link
Contributor

Description

The last remaining part of #12676 is to implement the outcome: failure part of the Pipeline Component Telemetry RFC (see here). This is done by introducing a downstream error wrapper struct, to distinguish between errors coming from the next component from errors bubbled from further downstream.

Important note

This PR implements things slightly differently from what the text of the RFC describes.

If a pipeline contains components A → B and an error occurs in B, this PR records:

  • otelcol.component.outcome = failure in the otelcol.*.consumed.* metric for B
  • otelcol.component.outcome = refused in the otelcol.*.produced.* metric for A

whereas the RFC would set both outcomes to failure.

This is programmatically simpler — no need to have different behavior between the obsconsumer around the output of A and the one around the input of B — but more importantly, I think it is clearer for users as well: outcome = failure only occurs on metrics associated with the component where the failure actually occurred.

This subtlety wasn't discussed in-depth in #11956 which introduced outcome = refused, so I took the liberty to make this change. If necessary, I can file another RFC amendment to match, or, if there are objections, implement the RFC as-is instead.

Link to tracking issue

Fixes #12676

Testing

I've updated the existing tests in obsconsumer to expect a downstream-wrapped error to exit the obsconsumer layer. I may add more tests later.

Documentation

None.

Copy link

codecov bot commented Jun 18, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.45%. Comparing base (f68d710) to head (61ba663).
Report is 14 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #13234   +/-   ##
=======================================
  Coverage   91.44%   91.45%           
=======================================
  Files         533      534    +1     
  Lines       29564    29596   +32     
=======================================
+ Hits        27034    27066   +32     
  Misses       1998     1998           
  Partials      532      532           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

github-actions bot commented Jul 8, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 8, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jul 9, 2025
#### Description

This PR updates the [Pipeline Component Telemetry
RFC](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md)
with the following changes:
- Reflect implementation choices that have been made since the RFC was
written:
1. using instrumentation scope attributes instead of datapoint
attributes to identify component instances
(see discussion in #12217 and
open-telemetry/opentelemetry-go#6404)
2. automatically injecting these attributes, without changes to
component code
    3. changing the instrumentation scope name used for pipeline metrics
- Slightly change the semantics of `outcome = refused`:

The current planned behavior (from #11956) is that, in the case of a
pipeline A → B where component B returns an error, the "consumed" metric
for B and the "produced" metric for A should both have `outcome =
failure`.

I fear that this may lead users to think that a failure occurred in A,
and would like to restrict `outcome = failure` to only be associated
with the component that "failed", ie. component B. The "produced" metric
associated with A would instead have `outcome = refused`.

This incidentally makes implementation slightly easier, since an
instrumentation layer will not need different error wrapping behavior
between the "producer" layer and the "consumer" layer.

    See draft PR #13234 for an example implementation.

As this is a non-trivial change to an RFC, it may need to follow the RFC
process.

Co-authored-by: Alex Boten <[email protected]>
@mx-psi
Copy link
Member

mx-psi commented Jul 10, 2025

Is this ready for review now that the RFC amendment has been merged?

@jade-guiton-dd
Copy link
Contributor Author

Not quite, I'm thinking of adding some additional tests. Not sure when I'll have the time to get to it though 😅

@jade-guiton-dd jade-guiton-dd marked this pull request as ready for review July 15, 2025 16:39
@jade-guiton-dd jade-guiton-dd requested a review from a team as a code owner July 15, 2025 16:39
@jade-guiton-dd jade-guiton-dd requested a review from dmitryax July 15, 2025 16:39
@jade-guiton-dd
Copy link
Contributor Author

Looks like the contrib failures are due to #13364. This should be ready for review.

Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @evan-bradley could you review the consumererror bits?

@mx-psi mx-psi requested a review from evan-bradley July 18, 2025 10:22
@jade-guiton-dd
Copy link
Contributor Author

I think I've addressed your comments, could you take another look @evan-bradley?

Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for your patience with all my questions. 🙂

@jade-guiton-dd
Copy link
Contributor Author

We discussed whether errors.Join(downstream, notDownstream) should be considered downstream or not during the Collector stability meeting (see this comment thread for context). Some points that were raised:

  • A more idiomatic way to add context to a downstream error would be fmt.Errorf("error in <context>: %w", downstream), so it's not clear that we should support the "additional context" interpretation of errors.Join.
  • If a component experiences an issue, but still succeeds in forwarding data to the next component, that issue was clearly not fatal to the processing of that payload. It should thus probably be surfaced as a warning log, not as an error returned to the caller with errors.Join. So it's not clear that we should support the "unrelated failure" interpretation of errors.Join either.
  • Given that we can't really think of an idiomatic use case for this errors.Join call, we may want to consider it "undefined behavior", and default to the simplest implementation using errors.As. This is the current state of the PR.

I don't remember very well, but I think @dmitryax raised the point that perhaps we should instead default to the interpretation which is least likely to "hide" internal failures. Do you have objections to the current logic in the PR?

@mx-psi
Copy link
Member

mx-psi commented Jul 23, 2025

@open-telemetry/collector-approvers Based on the above comment I think we can merge this by EOW (after the merge conflicts are resolved) unless there are objections. If we explicitly consider this edge case to be unspecified behavior, I think we can go ahead with the choice we made

@mx-psi mx-psi added this pull request to the merge queue Jul 25, 2025
Merged via the queue into open-telemetry:main with commit 545866f Jul 25, 2025
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pipeline instrumentation metrics
5 participants