[exporterhelper] Persistent queue can block collector start-up

**Describe the bug**
When using a persistent queue backed by file storage, the persisted state can cause the collector to get stuck midway during start-up. By stuck I mean that the collector process doesn't exit but some components are never initialized. Disabling the persistent queue  fixes the issue.

From dumping the stacktraces of all goroutines using the pprof extension, it looks like the below goroutine is blocking start-up from continuing. I confirmed this stacktrace was always present the few times this bug happened, and never present on collector instances that start cleanly:

```
1 @ 0x43f1ce 0x408c45 0x408897 0x11ea95e 0x11eaddb 0x11eaf4b 0x11e706c 0x11e3829 0x11de3a7 0xd0b245 0xe5c0a5 0xeca506 0xeca912 0xecce0c 0xe8867c 0xe88f3f 0xe88a93 0x630ea7d 0x630e9f8 0x630e96a 0x43ed5b 0x472ac1
#	0x11ea95d	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initPersistentContiguousStorage+0x3bd	go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:154
#	0x11eadda	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initClient+0x7a				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:123
#	0x11eaf4a	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).Start+0xea				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/persistent_queue.go:115
#	0x11e706b	go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start+0x4b				go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/internal/consumers.go:31
#	0x11e3828	go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start+0x48							go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/queue_sender.go:136
#	0x11de3a6	go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start+0xa6						go.opentelemetry.io/collector/exporter@v0.92.0/exporterhelper/common.go:202
#	0xd0b244	go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll+0x1a4							go.opentelemetry.io/collector/service@v0.92.0/internal/graph/graph.go:396
#	0xe5c0a4	go.opentelemetry.io/collector/service.(*Service).Start+0x2e4									go.opentelemetry.io/collector/service@v0.92.0/service.go:169
#	0xeca505	go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents+0x705						go.opentelemetry.io/collector/otelcol@v0.92.0/collector.go:190
#	0xeca911	go.opentelemetry.io/collector/otelcol.(*Collector).Run+0x51									go.opentelemetry.io/collector/otelcol@v0.92.0/collector.go:229
#	0xecce0b	go.opentelemetry.io/collector/otelcol.NewCommand.func1+0x6b									go.opentelemetry.io/collector/otelcol@v0.92.0/command.go:27
#	0xe8867b	github.com/spf13/cobra.(*Command).execute+0xabb											github.com/spf13/cobra@v1.8.0/command.go:983
#	0xe88f3e	github.com/spf13/cobra.(*Command).ExecuteC+0x3fe										github.com/spf13/cobra@v1.8.0/command.go:1115
#	0xe88a92	github.com/spf13/cobra.(*Command).Execute+0x12											github.com/spf13/cobra@v1.8.0/command.go:1039
#	0x630ea7c	main.runInteractive+0x3c													github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:27
#	0x630e9f7	main.run+0x117															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:11
#	0x630e969	main.main+0x89															github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:20
#	0x43ed5a	runtime.main+0x2ba		
```

Line 154 is trying to write to a buffered channel. If i'm reading it correctly, the goroutine would get blocked if `persisted write index - persisted read index > queue capacity`?

<img width="1024" alt="image" src="https://github.com/open-telemetry/opentelemetry-collector/assets/17691679/71ae6943-8a85-47f9-9975-831637d439a8">

**Steps to reproduce**

I've since disabled file storage and didn't have time to find consistent steps to reproduce the issue, but can revisit later if it'd help.

Here are some observations though:
- Our exporter configuration was left broken for a while with many failures happening
- I remember seeing error messages about the exporter queue being full
- The issue never resolves itself, even after restarts or letting the instance sit for hours. I think we did fix it by deleting the persistent volume claims at one point, before realizing they were probably at fault, but it eventually happened again
- Some of the collectors in our kubernetes stateful set never ran into the issue even when restarting, but did not test this thoroughly

**What did you expect to see?**
Collector starts-up cleanly

**What did you see instead?**
Collector is up and running but not doing anything, i.e. no new log statements and receivers are not accepting data.

**What version did you use?**
v0.92.0 of the collector-contrib image

**What config did you use?**
```
    extensions:
      ...
      file_storage/otc:
        directory: /data/wal/storage/otc
        timeout: 10s

    exporters:
      otlp:
        endpoint: ...
        sending_queue:
          queue_size: 10000
          storage: file_storage/otc


    service:
      extensions: [file_storage/otc]
      ...
      pipelines:
        ...
        traces:
          receivers: [otlp]
          processors: [k8sattributes, attributes/env, batch]
          exporters: [otlp, spanmetrics, servicegraph, debug]
```

**Environment**
- Using the collector-contrib image on linux ec2 instances
- Using OpenTelemetry operator and helm charts, with pods deployed as a statefulset using k8 persistent volume claims


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[exporterhelper] Persistent queue can block collector start-up #9451

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[exporterhelper] Persistent queue can block collector start-up #9451

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions