Skip to content

[exporter/prometheusremotewrite] Excessive buffer allocations in prometheus remote write exporter with large batch sizes #34269

Closed
@ben-childs-docusign

Description

@ben-childs-docusign

Component(s)

exporter/prometheusremotewrite

What happened?

Description

We have observed high memory usage in our open telemetry collectors which are set up to pull data from a kafka cluster and push to a prometheus remote write endpoint. In order to handle the volume of data we have been increasing the batch size of the batch processor beyond the default (8192) to (50,000) or more. This is due to the fact that prometheus remote write exporter only parallelizes sends within a batch and never parallelizes batches themselves.

While 50k batch size works most of the time we have seen cases where this config runs behind so we have also tried increasing it beyond 50k to 100k/200k or more. When we did this we noticed a dramatic increase in the amount of memory utilized by our collectors.

With a pprof profile we discovered that a huge portion of memory allocations was occuring in this single function "prometheusremotewriteexporter.batchTimeSeries"
e472a152-3503-488e-940a-7feba47810be

Reviewing this function it allocates many large buffers with max capacity set to the full size of the batch even though we expect the individual requests being batched to be smaller. We patched this locally to allocate smaller buffers and observed a huge reduction in memory usage and in our test environment going from ~80gb to ~30-40gb across 11 pods.

image

I will prepare a PR with the proposed patch for further discussion.

Steps to Reproduce

Configure a metrics pipeline with prometheus remote write exporter and a large batch size (e.g. 100k data points or more). Send large volumes of data through the pipeline, observe the memory allocations in pprof.

Expected Result

Actual Result

Collector version

v0.101.0

Environment information

Environment

OS: Ubuntu 20.04
Compiler(if manually compiled): go 1.21.12

OpenTelemetry Collector configuration

      prometheusremotewrite:
        auth:
          authenticator: bearertokenauth
        endpoint: [SCRUBBED]
        max_batch_size_bytes: 10000000
        remote_write_queue:
          enabled: true
          num_consumers: 25
        resource_to_telemetry_conversion:
          enabled: true
        timeout: 30s
 batch:
        send_batch_size: 200000
        timeout: 5s

Log output

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions