Skip to content

[receiver/hostmetrics/cpuscraper] Windows - CTX timeout, use CountsWithContext instead and make it configurable #32133

Closed as not planned
@dloucasfx

Description

@dloucasfx

Component(s)

receiver/hostmetrics

What happened?

Description

The cpu.Counts gopsutil func, which is called by the cpu scraper, does not set a deadline/timeout on its context, which forces WMIQueryWithContext to set it using the hardcoded timeout value of 3 seconds.
In large busy env or/and low resourced, the wmi call can take longer than 3 seconds, which will lead to a context deadline exceeded error and fail to get the CPU counts.

Steps to Reproduce

Find a windows host where the wmi calls take longer than 3 seconds and run the hostmetrics receiver with the cpu scraper.

Expected Result

Get all the metrics, including the physical and logical CPU counts

Actual Result

CPU counts are missing and we see this error in the logs

4670103 Mar 29 00:13 Error       splunk-otel-collector          3 1.7116855975244713e+09        error
                                                                  scraperhelper/scrapercontroller.go:200        Error
                                                                  scraping metrics      {"kind": "receiver", "name":
                                                                  "hostmetrics", "data_type": "metrics", "error":
                                                                  "context deadline exceeded", "scraper": "cpu"}
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).scrapeMetricsAndReport
                                                                        go.opentelemetry.io/collector/[email protected]/scrap
                                                                  erhelper/scrapercontroller.go:200
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).startScraping.func1
                                                                        go.opentelemetry.io/collector/[email protected]/scrap
                                                                  erhelper/scrapercontroller.go:176

Collector version

v0.95.0

Environment information

Environment

host_cpu_cores:"2"
host_cpu_model:"Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz"
host_mem_total:"8080924"
host_os_name":"Microsoft Windows Server 2016 Datacenter",

OpenTelemetry Collector configuration

No response

Log output

4670103 Mar 29 00:13 Error       splunk-otel-collector          3 1.7116855975244713e+09        error
                                                                  scraperhelper/scrapercontroller.go:200        Error
                                                                  scraping metrics      {"kind": "receiver", "name":
                                                                  "hostmetrics", "data_type": "metrics", "error":
                                                                  "context deadline exceeded", "scraper": "cpu"}
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).scrapeMetricsAndReport
                                                                        go.opentelemetry.io/collector/[email protected]/scrap
                                                                  erhelper/scrapercontroller.go:200
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).startScraping.func1
                                                                        go.opentelemetry.io/collector/[email protected]/scrap
                                                                  erhelper/scrapercontroller.go:176

Additional context

Suggestion is to use CountsWithContext instead of Counts and introduce a wmi_timeout option for cpuscraper

cc: @atoulme who helped with the RCA

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions