Skip to content

[kafkaexporter] How to optimize the performance of the Collector, especially the CPU utilization rate? #36974

Open
@xiaoyao2246

Description

@xiaoyao2246

Describe the bug
I deployed the Collector using Kubernetes to receive Trace data and report it to Kafka. However, I found that the CPU utilization rate of the Collector is high, but the memory utilization rate is low.

Steps to reproduce

I provided the Collector with a configuration of 1C2G.

What did you expect to see?
I hope the CPU utilization of the Collector can be lower, because its memory utilization is very low.

Are there any other ways to optimize this Collector? I want it to use less CPU resources.

What did you see instead?
Then I sent data to this Collector, and its monitoring situation is as follows:
image

What version did you use?
v0.95.0

What config did you use?

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-gd-config
  namespace: default
data:
  config.yaml: |-
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      memory_limiter:
        check_interval: 1s
        limit_mib: 2000
        spike_limit_mib: 400
      batch:
        send_batch_size: 500
        send_batch_max_size: 500
      resource:
        attributes:
          - key: from-collector
            value: gd-fat-k8s
            action: insert
    exporters:
      logging:
        verbosity: normal
      kafka:
        brokers:
          - xx.xx.xx.xx:9092
          - xx.xx.xx.xx:9092
          - xx.xx.xx.xx:9092
        topic: otlp_trace_fat
        partition_traces_by_id: true
        protocol_version: 1.0.0
        sending_queue:
          enabled: true
          num_consumers: 10
          queue_size: 10000
    
    extensions:
      pprof:
        endpoint: ":1777"
    service:
      extensions: [pprof]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [logging, kafka]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-gd
  namespace: default
  labels:
    app: opentelemetry
    component: otel-collector-gd
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector-gd
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector-gd
    spec:
      containers:
        - name: otel-collector-gd
          image: otel/opentelemetry-collector-contrib:0.95.0
          resources:
            limits:
              cpu: 1000m
              memory: 2048Mi
          volumeMounts:
            - mountPath: /var/log
              name: varlog
              readOnly: true
            - mountPath: /var/lib/docker/containers
              name: varlibdockercontainers
              readOnly: true
            - mountPath: /etc/otelcol-contrib/config.yaml
              name: data
              subPath: config.yaml
              readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: data
          configMap:
            name: otel-collector-gd-config

---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector-gd
  namespace: default
  labels:
    app: opentelemetry
    component: otel-collector-gd
spec:
  ports:
    - name: otlp-grpc
      port: 4317
      protocol: TCP
      targetPort: 4317
    - name: otlp-http
      port: 4318
      protocol: TCP
      targetPort: 4318
    - name: pprof
      port: 1777
      protocol: TCP
      targetPort: 1777
  selector:
    component: otel-collector-gd

Environment

Additional context
I used pprof to analyze the CPU usage.

File: otelcol-contrib
Type: cpu
Time: Dec 26, 2024 at 1:50pm (CST)
Duration: 300s, Total samples = 230.42s (76.81%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 82.53s, 35.82% of 230.42s total
Dropped 1059 nodes (cum <= 1.15s)
Showing top 10 nodes out of 261
      flat  flat%   sum%        cum   cum%
    14.37s  6.24%  6.24%     14.37s  6.24%  runtime/internal/syscall.Syscall6
    10.79s  4.68% 10.92%     13.18s  5.72%  compress/flate.(*decompressor).huffSym
    10.43s  4.53% 15.45%     20.19s  8.76%  runtime.scanobject
    10.29s  4.47% 19.91%     45.47s 19.73%  runtime.mallocgc
     8.19s  3.55% 23.47%      8.19s  3.55%  runtime.memclrNoHeapPointers
     7.17s  3.11% 26.58%      7.17s  3.11%  runtime.memmove
     7.03s  3.05% 29.63%      7.95s  3.45%  runtime.lock2
     5.52s  2.40% 32.02%     24.51s 10.64%  compress/flate.(*decompressor).huffmanBlock
     4.57s  1.98% 34.01%      4.75s  2.06%  runtime.unlock2
     4.17s  1.81% 35.82%      4.17s  1.81%  runtime.nextFreeFast (inline)

image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions