fix(dsm): set dsm checkpoint for all records in event #643

michael-zhao459 · 2025-07-30T17:12:47Z

What does this PR do?

This PR

Sets a DSM checkpoint for every single record in a event. Does this by extracting out a helper that gets datadog context per record, and if DSM is enabled continues to loop starting with the second record and setting DSM checkpoints (always using the first record for APM).

Motivation

Please note the discrepancy between the msg/s from incoming produce and outgoing by downstream queue. When batch processing, if the consume service does not set a checkpoint for each message coming from upstream we lose track of them causing the noticeable drop in throughput.

Testing Guidelines

All of the tests in #622 this table are maintained. Ensured no regressions by making sure that all test_tracing.py continued to pass. Tested on sandbox AWS account for all queue types to see context propagation and ensure throughput matches. Added tests to show that DSM can now handle multiple records in a event.

Additional Notes

Types of Changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)

piochelepiotr

The big question I guess is do we:

Couple a bit more APM & DSM, but avoid de-serializing two times the datadog context for the first record
Couple APM & DSM less, but at the cost of de-serializing two times the datadog context for the first record

piochelepiotr · 2025-07-30T21:40:21Z

datadog_lambda/dsm.py

+        not event_source.equals(EventTypes.KINESIS)
+        and not event_source.equals(EventTypes.SNS)
+        and not event_source.equals(EventTypes.SQS)
+    ):


do we expect this to happen? If not, let's add a debug log here.

I agree the double de-serialization introduce some performance cost. But since we only need 1 extra deserialization for tracing purpose, and one JSON.loads(50 bytes) is at the level of nanosecond cost. So I feel comfortable not worrying about it at all. I'd say decoupling and better code structure is worth more. Also for lambdas, cold start costs are more critical than this. With that being said,...

If in the future we need to deserialize all the tracecontext for spanlinks for example, that would bring the cost to ms level for each invocation and we might want to refactor the code to do it only once then.

Might be too late to mention, but I start to think maybe datadog-lambda-js is a better starting place for refactoring like this because there we are using a bunch of extractors and could be easier to refactor.

piochelepiotr · 2025-08-05T19:24:23Z

datadog_lambda/tracing.py

-                )
+                # Handle case where trace context is injected into attributes.AWSTraceHeader
+                # example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1
+                attrs = event.get("Records")[0].get("attributes")


here, it's accessing records[0] again. I would put this whole section in an else if idx == 0 (line 315)

Suggested change

attrs = event.get("Records")[0].get("attributes")

attrs = record.get("attributes")

piochelepiotr · 2025-08-05T19:25:47Z

datadog_lambda/tracing.py

+                            )
+                    if idx == 0:
+                        context = propagator.extract(dd_data)
+                    dsm_data = dd_data
            else:


in this else, dsm_data is not set. Is that an issue?

Not an issue. I checked in dd-trace-py and DSM never injects context into attributes.AWSTraceHeader, we can just set a checkpoint with None

piochelepiotr · 2025-08-05T19:26:58Z

datadog_lambda/tracing.py

+    )
+    context = None
+    for idx, record in enumerate(records):
+        dsm_data = None


this is not specific to dsm, it's dd_ctx

piochelepiotr · 2025-08-05T19:28:35Z

datadog_lambda/tracing.py

+                    if idx == 0
+                    else context
+                )
+                _dsm_set_checkpoint(None, "kinesis", source_arn)


we are setting a checkpoint kinesis if not kinesis, I think the name kinesis is wrong, because this code is a bit confusing.

hmm. This is deep enough inside the extract function where we believe that the event source is from Kinesis from parsing beforehand. However, all AWS documentation says Kinesis lambda event should have this field. To my understanding, this check is for lambda synchronous invocations with records that match Kinesis, but doesn't actually come from a Kinesis stream. @DataDog/apm-serverless Can you help confirm why this check is here in the first place?

piochelepiotr · 2025-08-05T19:31:22Z

datadog_lambda/tracing.py

+    for idx, record in enumerate(records):
+        try:
+            source_arn = record.get("eventSourceARN", "")
+            dsm_data = None


same as bellow, dsm_data is not specific to data streams here. I would name it same as bellow, dd_ctx, or something like that.

piochelepiotr · 2025-08-05T19:32:22Z

datadog_lambda/tracing.py

-            if dd_json_data_type == "Binary":
-                import base64
+    context = None
+    records = (


Maybe let's have records always be: event.get("Records", []), however, in the loop, we can break early if data streams is disabled.

if index == 0: do apm stuff if data streams is enabled: break

piochelepiotr · 2025-08-06T16:52:42Z

datadog_lambda/tracing.py

-                )
+                # Handle case where trace context is injected into attributes.AWSTraceHeader
+                # example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1
+                attrs = event.get("Records")[0].get("attributes")


Suggested change

attrs = event.get("Records")[0].get("attributes")

attrs = record.get("attributes")

piochelepiotr · 2025-08-06T17:17:54Z

datadog_lambda/tracing.py

+                                    "Failed to extract Step Functions context from SQS/SNS event."
+                                )
+                        context = propagator.extract(dd_data)
+                        if not config.data_streams_enabled:


this break is too hidden. It is in if dd_payload block?

I suggest this high level approach:

apm_context = None for record in records: context = extract_context(record) if apm_context is None: apm_context = context if data_streams_enabled: set_checkpoint() if !data_streams_enabled: # APM only looks at the first record. break

You can break down the code into helper function to make that structure be very clear.
Basically, can you avoid some of the nested conditions? Like if not config.data_streams_enabled here?

piochelepiotr · 2025-08-06T17:20:01Z

datadog_lambda/tracing.py

+                # example:Root=1-654321ab-000000001234567890abcdef;Parent=0123456789abcdef;Sampled=1
+                attrs = event.get("Records")[0].get("attributes")
+                if attrs:
+                    x_ray_header = attrs.get("AWSTraceHeader")


maybe put this logic in the extract_context I suggested above. The extract_context can take an argument: extract_from_xray?

(extract_context is probably not a great name, I let you find a better one)

Hmm. I might be misinterpreting but I'm not sure we should have one function return both a Context() object and the return of a json.loads(). I ended up splitting the x-ray extractor into another helper, let me know what you think

piochelepiotr · 2025-08-08T00:17:27Z

datadog_lambda/tracing.py

@@ -248,91 +247,105 @@ def extract_context_from_sqs_or_sns_event_or_context(
    except Exception:
        logger.debug("Failed extracting context as EventBridge to SQS.")

-    try:


if context is extracted from event bridge, we don't set a checkpoint. Is that expected?

The tracers never inject DSM context in the case of event bridge or step functions. I'm not sure this is the PR to be adding the functionality for these event types

piochelepiotr · 2025-08-08T00:23:58Z

datadog_lambda/tracing.py

+    )
+
+
+def _extract_context_from_sqs_or_sns_record(record):


that function looks great to me

piochelepiotr · 2025-08-08T00:26:49Z

datadog_lambda/tracing.py

+        try:
+            dd_ctx = _extract_context_from_sqs_or_sns_record(record)
+            if apm_context is None:
+                if dd_ctx and is_step_function_event(dd_ctx):


I guess the DSM context can't be in a step_funtion_event?

In any case, the logic to get the apm context from the dd_ctx, should be in it's own function I this.

That function can be the _extract_context_from_xray that you can rename to:
_extract_apm_context, and it can take the parameters: dd_ctx and record.

Then, code here can be:

if apm_context is None: apm_context = _extract_apm_context(dd_ctx, record)

This will make the code easier to read, but also less error prone.

Here, if the context is extracted from step function, we are not setting a checkpoint. I don't think this is what we want?

piochelepiotr · 2025-08-08T00:28:43Z

datadog_lambda/tracing.py

+    return None
+
+
+def _extract_context_from_xray(record):


as mentioned above, I would change function to extract the apm context. Not just from xray.

michael-zhao459 changed the title ~~set dsm checkpoint for all records in array~~ fix(dsm): set dsm checkpoint for all records in array Jul 30, 2025

michael-zhao459 requested a review from piochelepiotr July 30, 2025 18:37

piochelepiotr reviewed Jul 30, 2025

View reviewed changes

set checkpoint for all messages one loop approach

c0906a0

michael-zhao459 force-pushed the michael.zhao/dsm-ckpt-all-records branch from 69bbcd8 to c0906a0 Compare August 4, 2025 20:05

michael-zhao459 added 2 commits August 4, 2025 16:07

fix lint

8230b07

formatting stuff

e2cd3c5

michael-zhao459 changed the title ~~fix(dsm): set dsm checkpoint for all records in array~~ fix(dsm): set dsm checkpoint for all records in event Aug 4, 2025

michael-zhao459 requested a review from piochelepiotr August 4, 2025 20:40

better approach with indexes

5cf2059

piochelepiotr reviewed Aug 5, 2025

View reviewed changes

michael-zhao459 added 3 commits August 5, 2025 16:04

changed approach returning early, rename to dd_ctx

90480bb

keep APM record[0] implementation

6936bba

Keep APM records[0] approach

268c1e5

michael-zhao459 requested a review from piochelepiotr August 6, 2025 12:07

michael-zhao459 force-pushed the michael.zhao/dsm-ckpt-all-records branch from f38673b to 60ca41b Compare August 6, 2025 15:06

add more batch processing tests

2f8dfaa

michael-zhao459 force-pushed the michael.zhao/dsm-ckpt-all-records branch from 60ca41b to 2f8dfaa Compare August 6, 2025 15:11

DataDog deleted a comment from piochelepiotr Aug 6, 2025

want to keep looping

63e41cb

piochelepiotr reviewed Aug 6, 2025

View reviewed changes

michael-zhao459 added 5 commits August 6, 2025 13:20

clean up if statement

e5a0903

structure code based on comment

103ffa1

Suggested loop structure

6c2fbbd

same structure with kinesis

2067363

return early

f2b2dc3

michael-zhao459 requested a review from piochelepiotr August 7, 2025 16:52

piochelepiotr reviewed Aug 8, 2025

View reviewed changes

michael-zhao459 added 2 commits August 8, 2025 10:15

suggested refactoring

1a1a1d9

more tests

6bcbd0a

michael-zhao459 requested a review from piochelepiotr August 8, 2025 17:09

	attrs = event.get("Records")[0].get("attributes")
	attrs = record.get("attributes")

fix(dsm): set dsm checkpoint for all records in event #643

Are you sure you want to change the base?

fix(dsm): set dsm checkpoint for all records in event #643

Uh oh!

Conversation

michael-zhao459 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Testing Guidelines

Additional Notes

Types of Changes

Check all that apply

Uh oh!

piochelepiotr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-zhao459 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-zhao459 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michael-zhao459 commented Jul 30, 2025 •

edited

Loading

michael-zhao459 Aug 5, 2025 •

edited

Loading

michael-zhao459 Aug 5, 2025 •

edited

Loading