Skip to content

Update Pipeline Component Telemetry RFC #13260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jade-guiton-dd
Copy link
Contributor

Description

This PR updates the Pipeline Component Telemetry RFC with the following changes:

  • Reflect implementation choices that have been made since the RFC was written:

    1. using instrumentation scope attributes instead of datapoint attributes to identify component instances
      (see discussion in System for managing own telemetry attributes within pipeline components #12217 and Attribute injection in the Collector opentelemetry-go#6404)
    2. automatically injecting these attributes, without changes to component code
    3. changing the instrumentation scope name used for pipeline metrics
  • Slightly change the semantics of outcome = refused:

    The current planned behavior (from Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956) is that, in the case of a pipeline A → B where component B returns an error, the "consumed" metric for B and the "produced" metric for A should both have outcome = failure.

    I fear that this may lead users to think that a failure occurred in A, and would like to restrict outcome = failure to only be associated with the component that "failed", ie. component B. The "produced" metric associated with A would instead have outcome = refused.

    This incidentally makes implementation slightly easier, since an instrumentation layer will not need different error wrapping behavior between the "producer" layer and the "consumer" layer.

    See draft PR Emit outcome: failure in obsconsumer #13234 for an example implementation.

As this is a non-trivial change to an RFC, it may need to follow the RFC process.

@jade-guiton-dd jade-guiton-dd added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jun 24, 2025
@jade-guiton-dd jade-guiton-dd requested a review from a team as a code owner June 24, 2025 12:30
Copy link

codecov bot commented Jun 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.55%. Comparing base (7f957ed) to head (e4765d4).
Report is 25 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13260      +/-   ##
==========================================
- Coverage   91.57%   91.55%   -0.03%     
==========================================
  Files         522      522              
  Lines       29089    29089              
==========================================
- Hits        26639    26631       -8     
- Misses       1933     1939       +6     
- Partials      517      519       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


The upstream component which called `ConsumeX` will have this `otelcol.component.outcome` attribute applied to its produced measurements, and the downstream
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements.
After inspecting the error, the instrumentation layer should tag it as coming from downstream before returning it to the caller. Since there are two instrumentation layers between each pair of successive components (one recording produced data and one recording consumed data), this means that a call recorded with `outcome = failure` by the "consumer" layer will be recorded with `outcome = refused` by the "producer" layer, reflecting the fact that only the "consumer" component failed. In all other cases, the `outcome` recorded by both layers should be identical.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great. I think it is useful and important to identify the original failure as distinct from the subsequent refusals. Is this really a change, relative to what's already written for failure and refused? (Referring to lines 100-101.)

Copy link
Contributor Author

@jade-guiton-dd jade-guiton-dd Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of the RFC already allows users to identify where the original failure occurred. It's just that they need to be careful in interpreting the metrics:

  • a metric point like otelcol.processor.consumed.items (otelcol.component.id = transform, outcome = failure) means that a failure occurred in the Consume call where the transform processor consumed items, ie. it happened in the transform processor's code
  • a metric point like otelcol.processor.produced.items (otelcol.component.id = transform, outcome = failure) means that a failure occurred in the Consume call where the transform processor produced items, ie. the transform processor succeeded in transforming data, but the next component in the pipeline returned an error.

The change in this PR just amounts to saying that the second case should be labeled as outcome = refused instead, like for all the components further upstream. As a side effect, it means that we never see outcome = failure on a "produced" metric.

Hope that clears things up.

@jade-guiton-dd
Copy link
Contributor Author

I'll announce this RFC amendment on Wednesday's SIG meeting, then we can start the waiting period.

@mx-psi mx-psi added the rfc:final-comment-period This RFC is in the final comment period phase label Jul 2, 2025
@mx-psi
Copy link
Member

mx-psi commented Jul 2, 2025

If there are no further objections, this can be merged on or after July 9th (added one more day to account for a lot of folks in the US)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc:final-comment-period This RFC is in the final comment period phase Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants