-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Update Pipeline Component Telemetry RFC #13260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update Pipeline Component Telemetry RFC #13260
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #13260 +/- ##
==========================================
- Coverage 91.57% 91.55% -0.03%
==========================================
Files 522 522
Lines 29089 29089
==========================================
- Hits 26639 26631 -8
- Misses 1933 1939 +6
- Partials 517 519 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
The upstream component which called `ConsumeX` will have this `otelcol.component.outcome` attribute applied to its produced measurements, and the downstream | ||
component that `ConsumeX` was called on will have the attribute applied to its consumed measurements. | ||
After inspecting the error, the instrumentation layer should tag it as coming from downstream before returning it to the caller. Since there are two instrumentation layers between each pair of successive components (one recording produced data and one recording consumed data), this means that a call recorded with `outcome = failure` by the "consumer" layer will be recorded with `outcome = refused` by the "producer" layer, reflecting the fact that only the "consumer" component failed. In all other cases, the `outcome` recorded by both layers should be identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds great. I think it is useful and important to identify the original failure as distinct from the subsequent refusals. Is this really a change, relative to what's already written for failure and refused? (Referring to lines 100-101.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version of the RFC already allows users to identify where the original failure occurred. It's just that they need to be careful in interpreting the metrics:
- a metric point like
otelcol.processor.consumed.items (otelcol.component.id = transform, outcome = failure)
means that a failure occurred in theConsume
call where the transform processor consumed items, ie. it happened in the transform processor's code - a metric point like
otelcol.processor.produced.items (otelcol.component.id = transform, outcome = failure)
means that a failure occurred in theConsume
call where the transform processor produced items, ie. the transform processor succeeded in transforming data, but the next component in the pipeline returned an error.
The change in this PR just amounts to saying that the second case should be labeled as outcome = refused
instead, like for all the components further upstream. As a side effect, it means that we never see outcome = failure
on a "produced" metric.
Hope that clears things up.
I'll announce this RFC amendment on Wednesday's SIG meeting, then we can start the waiting period. |
If there are no further objections, this can be merged on or after July 9th (added one more day to account for a lot of folks in the US) |
Description
This PR updates the Pipeline Component Telemetry RFC with the following changes:
Reflect implementation choices that have been made since the RFC was written:
(see discussion in System for managing own telemetry attributes within pipeline components #12217 and Attribute injection in the Collector opentelemetry-go#6404)
Slightly change the semantics of
outcome = refused
:The current planned behavior (from Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956) is that, in the case of a pipeline A → B where component B returns an error, the "consumed" metric for B and the "produced" metric for A should both have
outcome = failure
.I fear that this may lead users to think that a failure occurred in A, and would like to restrict
outcome = failure
to only be associated with the component that "failed", ie. component B. The "produced" metric associated with A would instead haveoutcome = refused
.This incidentally makes implementation slightly easier, since an instrumentation layer will not need different error wrapping behavior between the "producer" layer and the "consumer" layer.
See draft PR Emit
outcome: failure
in obsconsumer #13234 for an example implementation.As this is a non-trivial change to an RFC, it may need to follow the RFC process.