Skip to content

[exporterhelper] Persistent queue can block collector start-up #9451

Closed
@swar8080

Description

@swar8080

Describe the bug
When using a persistent queue backed by file storage, the persisted state can cause the collector to get stuck midway during start-up. By stuck I mean that the collector process doesn't exit but some components are never initialized. Disabling the persistent queue fixes the issue.

From dumping the stacktraces of all goroutines using the pprof extension, it looks like the below goroutine is blocking start-up from continuing. I confirmed this stacktrace was always present the few times this bug happened, and never present on collector instances that start cleanly:

1 @ 0x43f1ce 0x408c45 0x408897 0x11ea95e 0x11eaddb 0x11eaf4b 0x11e706c 0x11e3829 0x11de3a7 0xd0b245 0xe5c0a5 0xeca506 0xeca912 0xecce0c 0xe8867c 0xe88f3f 0xe88a93 0x630ea7d 0x630e9f8 0x630e96a 0x43ed5b 0x472ac1
#	0x11ea95d	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initPersistentContiguousStorage+0x3bd	go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:154
#	0x11eadda	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initClient+0x7a				go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:123
#	0x11eaf4a	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).Start+0xea				go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:115
#	0x11e706b	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start+0x4b				go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/consumers.go:31
#	0x11e3828	go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start+0x48							go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:136
#	0x11de3a6	go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start+0xa6						go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:202
#	0xd0b244	go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll+0x1a4							go.opentelemetry.io/collector/[email protected]/internal/graph/graph.go:396
#	0xe5c0a4	go.opentelemetry.io/collector/service.(*Service).Start+0x2e4									go.opentelemetry.io/collector/[email protected]/service.go:169
#	0xeca505	go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents+0x705						go.opentelemetry.io/collector/[email protected]/collector.go:190
#	0xeca911	go.opentelemetry.io/collector/otelcol.(*Collector).Run+0x51									go.opentelemetry.io/collector/[email protected]/collector.go:229
#	0xecce0b	go.opentelemetry.io/collector/otelcol.NewCommand.func1+0x6b									go.opentelemetry.io/collector/[email protected]/command.go:27
#	0xe8867b	github.com/spf13/cobra.(*Command).execute+0xabb											github.com/spf13/[email protected]/command.go:983
#	0xe88f3e	github.com/spf13/cobra.(*Command).ExecuteC+0x3fe										github.com/spf13/[email protected]/command.go:1115
#	0xe88a92	github.com/spf13/cobra.(*Command).Execute+0x12											github.com/spf13/[email protected]/command.go:1039
#	0x630ea7c	main.runInteractive+0x3c													github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:27
#	0x630e9f7	main.run+0x117															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:11
#	0x630e969	main.main+0x89															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:20
#	0x43ed5a	runtime.main+0x2ba		

Line 154 is trying to write to a buffered channel. If i'm reading it correctly, the goroutine would get blocked if persisted write index - persisted read index > queue capacity?

image

Steps to reproduce

I've since disabled file storage and didn't have time to find consistent steps to reproduce the issue, but can revisit later if it'd help.

Here are some observations though:

  • Our exporter configuration was left broken for a while with many failures happening
  • I remember seeing error messages about the exporter queue being full
  • The issue never resolves itself, even after restarts or letting the instance sit for hours. I think we did fix it by deleting the persistent volume claims at one point, before realizing they were probably at fault, but it eventually happened again
  • Some of the collectors in our kubernetes stateful set never ran into the issue even when restarting, but did not test this thoroughly

What did you expect to see?
Collector starts-up cleanly

What did you see instead?
Collector is up and running but not doing anything, i.e. no new log statements and receivers are not accepting data.

What version did you use?
v0.92.0 of the collector-contrib image

What config did you use?

    extensions:
      ...
      file_storage/otc:
        directory: /data/wal/storage/otc
        timeout: 10s

    exporters:
      otlp:
        endpoint: ...
        sending_queue:
          queue_size: 10000
          storage: file_storage/otc


    service:
      extensions: [file_storage/otc]
      ...
      pipelines:
        ...
        traces:
          receivers: [otlp]
          processors: [k8sattributes, attributes/env, batch]
          exporters: [otlp, spanmetrics, servicegraph, debug]

Environment

  • Using the collector-contrib image on linux ec2 instances
  • Using OpenTelemetry operator and helm charts, with pods deployed as a statefulset using k8 persistent volume claims

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions