[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895

celian-garcia · 2025-03-24T10:25:58Z

Description

Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

The advantages of using azmetrics batch metrics API over the armmonitor metrics API are described in details in the issue + in the README.md

Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results.

Link to tracking issue

Relates #38651

Testing

✅ Manual tests + mocked unit tests

Documentation

✅ Added config field documentation in README.md

celian-garcia · 2025-03-24T10:53:58Z

receiver/azuremonitorreceiver/scraper_batch.go

+		// Once all metrics has been collected for one subscription, we move to the next.
+		// We need to keep it synchronous to have the subscription id in resource attributes and not metrics attributes.
+		// It can be revamped later if we need to parallelize more, but currently, resource emit is not thread safe.
+		rb := s.mb.NewResourceBuilder()


Running each subscriptions synchronously might become a problem when tenant contains a lot of subscriptions.
In our fork, we run each sub in parallel and have subscription id at data point level, but I prefer to keep it at resource attribute level.

We need first to run it on our tenants and see if it takes longer than collection interval.

Tested with 94 subscriptions discovered and it took largely less than 1m. On my laptop. Quite satisfying and reassuring.

Wel ok it was for only one resource type

services: [ "Microsoft.Storage/storageAccounts" ] use_batch_api: true discover_subscriptions: true

I think it's acceptable to release it like that since it's not a migration but just a proposal of using a different API.
I struggle to try big metrics amount locally but for sure we'll test in our environments and probably come back with fixes 😄

celian-garcia · 2025-03-24T10:56:33Z

A lot of code is duplicated between scraper and scraper_batch. I think it's an acceptable technical debt if we have then a follow up PR/issue to address it later

ashishbans · 2025-03-27T19:38:23Z

receiver/azuremonitorreceiver/README.md

@@ -35,6 +35,8 @@ The following settings are optional:
 - `cloud` (default = `AzureCloud`): defines which Azure cloud to use. Valid values: `AzureCloud`, `AzureUSGovernment`, `AzureChinaCloud`.
 - `dimensions.enabled` (default = `true`): allows to opt out from automatically split by all the dimensions of the resource type.
 - `dimensions.overrides` (default = `{}`): if dimensions are enabled, it allows you to specify a set of dimensions for a particular metric. This is a two levels map with first key being the resource type and second key being the metric name. Programmatic value should be used for metric name https://learn.microsoft.com/en-us/azure/azure-monitor/reference/metrics-index
+- `use_batch_api` (default = `false`): Use the batch API to fetch metrics. This is useful when the number of subscriptions is high and the API calls are rate limited.
+- `region` (default = `""`): The region is used when batch API is used AND the subscriptions are not discovered (given as list). In this case we need region and cannot guess it.


Hello, @celian-garcia, what if a user wants to specify list of regions, or want metrics to be collected from all regions in a particular subscription, does not want to use subscription discovery.

This would be another feature I guess? How would it work in terms of Azure API ? It would get all sub in the region and then make the usual calls?

okay @celian-garcia, I appreciate your efforts for this feature request.
we can take in next feature request. but it can be like get list of all regions in a particular subscription and make api calls according. i'm not sure what is efficient way.

One more thing
I tested it by providing westeurope in location. it still checking and pulling metrics from other location. which is good for me. but not sure this was expected.

2025-03-28T01:59:22.670+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://northeurope.metrics.monitor.azure.com"} 2025-03-28T01:59:31.541+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://eastus.metrics.monitor.azure.com"} 2025-03-28T01:59:39.248+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://westeurope.metrics.monitor.azure.com"}

Cool, thanks for having tested out! It's not expected no.
Except if you are using subscription discovery, then the region field is not checked.

I will try to remove the region field actually as if I can prevent from the user to bother with that it would be better.

Thanks @celian-garcia

Ok, reading twice the code, it looks like the region configured has no effect, as the region used is anyway the one retrieved in getResourceAndTypes function. I will be able to simplify removing the region field.

My confusion came with the fact that I thought in the beginning that a subscription can only be in one region. Which is not the case. https://dev.to/simonwaight/understanding-tenants-subscriptions-regions-and-geographies-in-azure-550h?utm_source=chatgpt.com

Subscriptions are not tied to an Azure Region and as a result can contain resources from any number of Regions.

I removed the region field. Region is now computed directly from resources.
Basically what it does is that for each resource it stores the region as a requestable region in the subscription.
Then when we query getBatch metrics API, we do it for each region found in the subscription.

…attribute (#39029)  #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR.  #### Link to tracking issue Linked with #38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature.  #### Testing Added unit tests + new mock API to mock the new get by subscription calls.  #### Documentation  Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>

…attribute (open-telemetry#39029)  #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR.  #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature.  #### Testing Added unit tests + new mock API to mock the new get by subscription calls.  #### Documentation  Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>

…h API Signed-off-by: Célian Garcia <[email protected]>

…attribute (open-telemetry#39029)  #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR.  #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature.  #### Testing Added unit tests + new mock API to mock the new get by subscription calls.  #### Documentation  Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>

celian-garcia · 2025-04-10T15:49:44Z

For everyone information, we rolled out from our fork to this PR with good success. Same amount of timeseries and seamless for users.

atoulme

Written by codeowner, had review and confirmation it works on a fork - let's merge.

…h API (open-telemetry#38895)  #### Description Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP The advantages of using ``azmetrics`` batch metrics API over the ``armmonitor`` metrics API are described in details in the issue + in the README.md Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results.  #### Link to tracking issue Relates open-telemetry#38651  #### Testing ✅ Manual tests + mocked unit tests  #### Documentation ✅ Added config field documentation in README.md  Signed-off-by: Célian Garcia <[email protected]>

…attribute (open-telemetry#39029)  #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR.  #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature.  #### Testing Added unit tests + new mock API to mock the new get by subscription calls.  #### Documentation  Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>

…h API (open-telemetry#38895)  #### Description Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP The advantages of using ``azmetrics`` batch metrics API over the ``armmonitor`` metrics API are described in details in the issue + in the README.md Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results.  #### Link to tracking issue Relates open-telemetry#38651  #### Testing ✅ Manual tests + mocked unit tests  #### Documentation ✅ Added config field documentation in README.md  Signed-off-by: Célian Garcia <[email protected]>

andrewegel · 2025-05-13T22:24:38Z

receiver/azuremonitorreceiver/scraper_batch.go

+									}
+								}
+								attributes["timegrain"] = &compositeKey.timeGrain
+								for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice


Whats the reason for this choice? Wouldn't you want the datapoints created from oldest -> newest?

This is mostly because the receiver works in a "streaming" pipeline.
Everytime the receiver call the Azure API, the results can be 1 or more samples. Here the code will retrieve only the newest sample to go through. On the next call, a new point (or maybe the same point will be collectted). This is to avoid getting duplicates over time, collectors only handles the only latest datapoint.

it also goes with the break point on the next lines, because we want to stop processing the time range once we have 1 value available (The azure API often replies a timestamp with no samples in it (empty values) - even more for the 1Hour TimeGrain aggregates metrics like blob storage size)

@andrewegel you might ask yourself then why we're making a range query and not a ponctual query. It's because the API is sometimes a bit late. Like for PT1M timegrain, it can give result only for 3m before or 4m before. So we query a range and take the last.

The drawback is that if the API "wakes up" and is not having late at a moment, it can miss some point.
e.g:
At T time. the query for range [T-3m , T] gives only T-3m result.
At T+1m time(assuming that collection interval is 1m), the query for range [T-2m, T+1m] gives now all results [T-2, T-1, T, T+1] (the API "wakes up" and catch up its late)
=> then we'll take only T+1 and we lose T-2, T-1, T

There's no perfect approach without sending duplicates ^^

I also see the break in there after encountering the "last" data point you want to scrape; I should provide some more context:

We're running this on a 300s (5m) scrape interval;

We want 1m granular data

Therefore we want all 1m data points in that 5m window and the code as written, only scrapes the latest datapoint value.

If I had to guess, this code assumes you're using the of 10s

opentelemetry-collector-contrib/receiver/azuremonitorreceiver/factory.go

Line 21 in 345bc70

defaultCollectionInterval = 10 * time.Second

where it makes sense you just grab the last data point, and use the "last scraped 1m" metric to determine when to scrape a certain metric/dimension combination, but our usage we're scraping potentially thousands of resources and metrics and dimensions, I don't think we can call this API that often within a 1m interval.

Let me create an official issue describing our requirements and you can advise us on what the proper receiver settings should be, because we're also trying to make this scrape somewhat cost-optimal, as every scrape call comes with an Azure cost to call that API.

FWIW #40078

andrewegel · 2025-05-13T22:27:58Z

receiver/azuremonitorreceiver/scraper_batch.go

+								attributes["timegrain"] = &compositeKey.timeGrain
+								for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice
+									metricValue := timeseriesElement.Data[i]
+									if metricValue.Average != nil {


Whats this conditional for? If I wanted a different aggregation other than Average, this would prevent that data point from actually getting into the collector's Pipeline

I think its a fair point, we are using average because we dont rely on Azure monitor aggregate. Average was the default choice because it seems to be available for every metrics and as we are using the smallest timegrain. its convenient. But I think this is a valid feature request to make the aggregation configurable (either global or override per resourceType)

Can we step back and try to understand functionally what this code is doing?

Reminder of the metricValue structure

attributes["timegrain"] = &compositeKey.timeGrain for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice metricValue := timeseriesElement.Data[i] if metricValue.Average != nil { s.processQueryTimeseriesData(resID, metric, metricValue, attributes) break } }

So what is it doing?
It takes only the last timestamp in the queried range and if that value has an average aggregation, we process all the aggregation present.

If I understand your answer @ahurtaud, we're making this because we observed some situation where some of the MetricValue fields are nil while the value entry is here actually?

Like we're receiving an array like this from Azure?

{ data: [{ timestamp: null average: null // ... }, { timestamp: 123456789 average: 0.5 // ... }] }

If so, why didn't we choose timestamp to make this oddity verification?

I'm sorry @ahurtaud maybe it's very old in your mind as you touched it long time ago and as you know I just copied that part.

@andrewegel is very true saying it's a problem because we already having a feature actually to get only some aggregations #37420. This is building query filter before the query is made to make sure that we have only the selected aggregation for particular metrics
Documentation: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/azuremonitorreceiver/README.md#filtering-metrics
So if someone is doing like in the example, this will not collect the IncomingMessage/ Total aggregation.

Okay there is another problem. The metrics aggregation filter feature that I mentioned has been well backported to the scraper_batch, to add aggregation into composite key but the composite key is not used in the query to get metrics 😲

Let me create an issue to raise that point and also the point that we should find a solution to remove safely that average == nil check. Thanks!

Like we're receiving an array like this from Azure?

{ data: [{ timestamp: null average: null // ... }, { timestamp: 123456789 average: 0.5 // ... }] }

If so, why didn't we choose timestamp to make this oddity verification?

No if I remember correctly is about your first comment, we are receiving things like:

{ data: [{ timestamp: 124 average: null // ... }, { timestamp: 123 average: null // ... }, { timestamp: 122 average: 0.5 // ... }] }

so as average is likely the value to always be here, when Azure gives data, we decided to select this one for "value selection" but the timestamp is always valid.

ok clear. Then we need a more clever decider, maybe giving the first of the aggregator from the composite key or something more dynamic to analyze data, taking the most complete one or something. I will add it to the issue

github-actions bot added the receiver/azuremonitor label Mar 24, 2025

github-actions bot requested a review from nslaughter March 24, 2025 10:26

celian-garcia force-pushed the azuremonitorreceiver/getBatch branch 5 times, most recently from 5e66998 to 6c51627 Compare March 24, 2025 10:45

celian-garcia commented Mar 24, 2025

View reviewed changes

celian-garcia force-pushed the azuremonitorreceiver/getBatch branch 9 times, most recently from caae7a3 to 03b4ec3 Compare March 27, 2025 09:52

celian-garcia marked this pull request as ready for review March 27, 2025 10:00

celian-garcia requested a review from a team as a code owner March 27, 2025 10:00

celian-garcia requested a review from evan-bradley March 27, 2025 10:00

github-actions bot assigned evan-bradley Mar 27, 2025

ashishbans reviewed Mar 27, 2025

View reviewed changes

atoulme marked this pull request as draft March 27, 2025 20:56

celian-garcia mentioned this pull request Mar 28, 2025

[receiver/azuremonitorreceiver] feat: add subscription name resource attribute #39029

Merged

celian-garcia force-pushed the azuremonitorreceiver/getBatch branch from 03b4ec3 to 8ff4a53 Compare April 9, 2025 14:54

[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatc…

5ef0796

…h API Signed-off-by: Célian Garcia <[email protected]>

celian-garcia force-pushed the azuremonitorreceiver/getBatch branch from 8ff4a53 to 5ef0796 Compare April 9, 2025 15:01

celian-garcia marked this pull request as ready for review April 10, 2025 11:59

github-actions bot assigned MovieStoreGuy Apr 10, 2025

atoulme approved these changes Apr 14, 2025

View reviewed changes

atoulme merged commit 1fdf89d into open-telemetry:main Apr 14, 2025
174 checks passed

github-actions bot added this to the next release milestone Apr 14, 2025

celian-garcia mentioned this pull request Apr 15, 2025

[receiver/azuremonitor] use_batch_api should parallelize the calls by subscription #39417

Open

andrewegel reviewed May 13, 2025

View reviewed changes

andrewegel mentioned this pull request May 14, 2025

[receiver/azuremonitorreceiver] Review our settings for scaling #40078

Closed

celian-garcia mentioned this pull request May 14, 2025

[receiver/azuremonitorreceiver] metrics aggregation filter unused and slightly with current use_batch_api:true implementation #40079

Closed

celian-garcia mentioned this pull request May 27, 2025

Feature Request for Azure Monitor Metrics Data Plane Batch API Support #38651

Closed

[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895

[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895

Uh oh!

Conversation

celian-garcia commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Link to tracking issue

Testing

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia commented Mar 24, 2025

Uh oh!

ashishbans Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashishbans Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia commented Apr 10, 2025

Uh oh!

atoulme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahurtaud May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

celian-garcia May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahurtaud May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

celian-garcia commented Mar 24, 2025 •

edited

Loading

ashishbans Mar 27, 2025 •

edited

Loading

celian-garcia Mar 27, 2025 •

edited

Loading

ashishbans Mar 28, 2025 •

edited

Loading

celian-garcia May 14, 2025 •

edited

Loading

ahurtaud May 14, 2025 •

edited

Loading

celian-garcia May 14, 2025 •

edited

Loading

celian-garcia May 14, 2025 •

edited

Loading

ahurtaud May 14, 2025 •

edited

Loading