Description
Describe the bug
When using a persistent queue backed by file storage, the persisted state can cause the collector to get stuck midway during start-up. By stuck I mean that the collector process doesn't exit but some components are never initialized. Disabling the persistent queue fixes the issue.
From dumping the stacktraces of all goroutines using the pprof extension, it looks like the below goroutine is blocking start-up from continuing. I confirmed this stacktrace was always present the few times this bug happened, and never present on collector instances that start cleanly:
1 @ 0x43f1ce 0x408c45 0x408897 0x11ea95e 0x11eaddb 0x11eaf4b 0x11e706c 0x11e3829 0x11de3a7 0xd0b245 0xe5c0a5 0xeca506 0xeca912 0xecce0c 0xe8867c 0xe88f3f 0xe88a93 0x630ea7d 0x630e9f8 0x630e96a 0x43ed5b 0x472ac1
# 0x11ea95d go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initPersistentContiguousStorage+0x3bd go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:154
# 0x11eadda go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).initClient+0x7a go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:123
# 0x11eaf4a go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*persistentQueue[...]).Start+0xea go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/persistent_queue.go:115
# 0x11e706b go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*QueueConsumers[...]).Start+0x4b go.opentelemetry.io/collector/[email protected]/exporterhelper/internal/consumers.go:31
# 0x11e3828 go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start+0x48 go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:136
# 0x11de3a6 go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start+0xa6 go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:202
# 0xd0b244 go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll+0x1a4 go.opentelemetry.io/collector/[email protected]/internal/graph/graph.go:396
# 0xe5c0a4 go.opentelemetry.io/collector/service.(*Service).Start+0x2e4 go.opentelemetry.io/collector/[email protected]/service.go:169
# 0xeca505 go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents+0x705 go.opentelemetry.io/collector/[email protected]/collector.go:190
# 0xeca911 go.opentelemetry.io/collector/otelcol.(*Collector).Run+0x51 go.opentelemetry.io/collector/[email protected]/collector.go:229
# 0xecce0b go.opentelemetry.io/collector/otelcol.NewCommand.func1+0x6b go.opentelemetry.io/collector/[email protected]/command.go:27
# 0xe8867b github.com/spf13/cobra.(*Command).execute+0xabb github.com/spf13/[email protected]/command.go:983
# 0xe88f3e github.com/spf13/cobra.(*Command).ExecuteC+0x3fe github.com/spf13/[email protected]/command.go:1115
# 0xe88a92 github.com/spf13/cobra.(*Command).Execute+0x12 github.com/spf13/[email protected]/command.go:1039
# 0x630ea7c main.runInteractive+0x3c github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:27
# 0x630e9f7 main.run+0x117 github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:11
# 0x630e969 main.main+0x89 github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:20
# 0x43ed5a runtime.main+0x2ba
Line 154 is trying to write to a buffered channel. If i'm reading it correctly, the goroutine would get blocked if persisted write index - persisted read index > queue capacity
?

Steps to reproduce
I've since disabled file storage and didn't have time to find consistent steps to reproduce the issue, but can revisit later if it'd help.
Here are some observations though:
- Our exporter configuration was left broken for a while with many failures happening
- I remember seeing error messages about the exporter queue being full
- The issue never resolves itself, even after restarts or letting the instance sit for hours. I think we did fix it by deleting the persistent volume claims at one point, before realizing they were probably at fault, but it eventually happened again
- Some of the collectors in our kubernetes stateful set never ran into the issue even when restarting, but did not test this thoroughly
What did you expect to see?
Collector starts-up cleanly
What did you see instead?
Collector is up and running but not doing anything, i.e. no new log statements and receivers are not accepting data.
What version did you use?
v0.92.0 of the collector-contrib image
What config did you use?
extensions:
...
file_storage/otc:
directory: /data/wal/storage/otc
timeout: 10s
exporters:
otlp:
endpoint: ...
sending_queue:
queue_size: 10000
storage: file_storage/otc
service:
extensions: [file_storage/otc]
...
pipelines:
...
traces:
receivers: [otlp]
processors: [k8sattributes, attributes/env, batch]
exporters: [otlp, spanmetrics, servicegraph, debug]
Environment
- Using the collector-contrib image on linux ec2 instances
- Using OpenTelemetry operator and helm charts, with pods deployed as a statefulset using k8 persistent volume claims