-
Notifications
You must be signed in to change notification settings - Fork 46
fix(dsm): set dsm checkpoint for all records in event #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big question I guess is do we:
- Couple a bit more APM & DSM, but avoid de-serializing two times the datadog context for the first record
- Couple APM & DSM less, but at the cost of de-serializing two times the datadog context for the first record
datadog_lambda/dsm.py
Outdated
not event_source.equals(EventTypes.KINESIS) | ||
and not event_source.equals(EventTypes.SNS) | ||
and not event_source.equals(EventTypes.SQS) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect this to happen? If not, let's add a debug log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the double de-serialization introduce some performance cost. But since we only need 1 extra deserialization for tracing purpose, and one JSON.loads(50 bytes) is at the level of nanosecond cost. So I feel comfortable not worrying about it at all. I'd say decoupling and better code structure is worth more. Also for lambdas, cold start costs are more critical than this. With that being said,...
- If in the future we need to deserialize all the tracecontext for spanlinks for example, that would bring the cost to ms level for each invocation and we might want to refactor the code to do it only once then.
- Might be too late to mention, but I start to think maybe datadog-lambda-js is a better starting place for refactoring like this because there we are using a bunch of extractors and could be easier to refactor.
69bbcd8
to
c0906a0
Compare
datadog_lambda/tracing.py
Outdated
) | ||
# Handle case where trace context is injected into attributes.AWSTraceHeader | ||
# example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1 | ||
attrs = event.get("Records")[0].get("attributes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, it's accessing records[0] again. I would put this whole section in an else if idx == 0
(line 315)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attrs = event.get("Records")[0].get("attributes") | |
attrs = record.get("attributes") |
datadog_lambda/tracing.py
Outdated
) | ||
if idx == 0: | ||
context = propagator.extract(dd_data) | ||
dsm_data = dd_data | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this else, dsm_data is not set. Is that an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue. I checked in dd-trace-py and DSM never injects context into attributes.AWSTraceHeader, we can just set a checkpoint with None
datadog_lambda/tracing.py
Outdated
) | ||
context = None | ||
for idx, record in enumerate(records): | ||
dsm_data = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not specific to dsm, it's dd_ctx
datadog_lambda/tracing.py
Outdated
if idx == 0 | ||
else context | ||
) | ||
_dsm_set_checkpoint(None, "kinesis", source_arn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are setting a checkpoint kinesis
if not kinesis, I think the name kinesis
is wrong, because this code is a bit confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. This is deep enough inside the extract function where we believe that the event source is from Kinesis from parsing beforehand. However, all AWS documentation says Kinesis lambda event should have this field. To my understanding, this check is for lambda synchronous invocations with records that match Kinesis, but doesn't actually come from a Kinesis stream. @DataDog/apm-serverless Can you help confirm why this check is here in the first place?
datadog_lambda/tracing.py
Outdated
for idx, record in enumerate(records): | ||
try: | ||
source_arn = record.get("eventSourceARN", "") | ||
dsm_data = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as bellow, dsm_data is not specific to data streams here. I would name it same as bellow, dd_ctx
, or something like that.
datadog_lambda/tracing.py
Outdated
if dd_json_data_type == "Binary": | ||
import base64 | ||
context = None | ||
records = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe let's have records always be: event.get("Records", [])
, however, in the loop, we can break early if data streams is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if index == 0:
do apm stuff
if data streams is enabled:
break
f38673b
to
60ca41b
Compare
60ca41b
to
2f8dfaa
Compare
datadog_lambda/tracing.py
Outdated
) | ||
# Handle case where trace context is injected into attributes.AWSTraceHeader | ||
# example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1 | ||
attrs = event.get("Records")[0].get("attributes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attrs = event.get("Records")[0].get("attributes") | |
attrs = record.get("attributes") |
datadog_lambda/tracing.py
Outdated
"Failed to extract Step Functions context from SQS/SNS event." | ||
) | ||
context = propagator.extract(dd_data) | ||
if not config.data_streams_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this break is too hidden. It is in if dd_payload
block?
I suggest this high level approach:
apm_context = None
for record in records:
context = extract_context(record)
if apm_context is None:
apm_context = context
if data_streams_enabled:
set_checkpoint()
if !data_streams_enabled:
# APM only looks at the first record.
break
You can break down the code into helper function to make that structure be very clear.
Basically, can you avoid some of the nested conditions? Like if not config.data_streams_enabled
here?
datadog_lambda/tracing.py
Outdated
# example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1 | ||
attrs = event.get("Records")[0].get("attributes") | ||
if attrs: | ||
x_ray_header = attrs.get("AWSTraceHeader") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe put this logic in the extract_context
I suggested above. The extract_context can take an argument: extract_from_xray?
(extract_context is probably not a great name, I let you find a better one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I might be misinterpreting but I'm not sure we should have one function return both a Context() object and the return of a json.loads(). I ended up splitting the x-ray extractor into another helper, let me know what you think
@@ -248,91 +247,105 @@ def extract_context_from_sqs_or_sns_event_or_context( | |||
except Exception: | |||
logger.debug("Failed extracting context as EventBridge to SQS.") | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if context is extracted from event bridge, we don't set a checkpoint. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tracers never inject DSM context in the case of event bridge or step functions. I'm not sure this is the PR to be adding the functionality for these event types
) | ||
|
||
|
||
def _extract_context_from_sqs_or_sns_record(record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that function looks great to me
datadog_lambda/tracing.py
Outdated
try: | ||
dd_ctx = _extract_context_from_sqs_or_sns_record(record) | ||
if apm_context is None: | ||
if dd_ctx and is_step_function_event(dd_ctx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the DSM context can't be in a step_funtion_event?
In any case, the logic to get the apm context from the dd_ctx, should be in it's own function I this.
That function can be the _extract_context_from_xray
that you can rename to:
_extract_apm_context
, and it can take the parameters: dd_ctx
and record
.
Then, code here can be:
if apm_context is None:
apm_context = _extract_apm_context(dd_ctx, record)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make the code easier to read, but also less error prone.
Here, if the context is extracted from step function, we are not setting a checkpoint. I don't think this is what we want?
datadog_lambda/tracing.py
Outdated
return None | ||
|
||
|
||
def _extract_context_from_xray(record): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, I would change function to extract the apm context. Not just from xray.
What does this PR do?
This PR
Sets a DSM checkpoint for every single record in a event. Does this by extracting out a helper that gets datadog context per record, and if DSM is enabled continues to loop starting with the second record and setting DSM checkpoints (always using the first record for APM).
Motivation
Please note the discrepancy between the msg/s from incoming produce and outgoing by downstream queue. When batch processing, if the consume service does not set a checkpoint for each message coming from upstream we lose track of them causing the noticeable drop in throughput.
Testing Guidelines
All of the tests in #622 this table are maintained. Ensured no regressions by making sure that all test_tracing.py continued to pass. Tested on sandbox AWS account for all queue types to see context propagation and ensure throughput matches. Added tests to show that DSM can now handle multiple records in a event.
Additional Notes
Types of Changes
Check all that apply