Skip to content

[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

celian-garcia
Copy link
Member

@celian-garcia celian-garcia commented Mar 24, 2025

Description

Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

The advantages of using azmetrics batch metrics API over the armmonitor metrics API are described in details in the issue + in the README.md

Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results.

Link to tracking issue

Relates #38651

Testing

✅ Manual tests + mocked unit tests

Documentation

✅ Added config field documentation in README.md

@github-actions github-actions bot requested a review from nslaughter March 24, 2025 10:26
@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/getBatch branch 5 times, most recently from 5e66998 to 6c51627 Compare March 24, 2025 10:45
// Once all metrics has been collected for one subscription, we move to the next.
// We need to keep it synchronous to have the subscription id in resource attributes and not metrics attributes.
// It can be revamped later if we need to parallelize more, but currently, resource emit is not thread safe.
rb := s.mb.NewResourceBuilder()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running each subscriptions synchronously might become a problem when tenant contains a lot of subscriptions.
In our fork, we run each sub in parallel and have subscription id at data point level, but I prefer to keep it at resource attribute level.

We need first to run it on our tenants and see if it takes longer than collection interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with 94 subscriptions discovered and it took largely less than 1m. On my laptop. Quite satisfying and reassuring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wel ok it was for only one resource type

    services: [ "Microsoft.Storage/storageAccounts" ]
    use_batch_api: true
    discover_subscriptions: true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's acceptable to release it like that since it's not a migration but just a proposal of using a different API.
I struggle to try big metrics amount locally but for sure we'll test in our environments and probably come back with fixes 😄

@celian-garcia
Copy link
Member Author

A lot of code is duplicated between scraper and scraper_batch. I think it's an acceptable technical debt if we have then a follow up PR/issue to address it later

@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/getBatch branch 9 times, most recently from caae7a3 to 03b4ec3 Compare March 27, 2025 09:52
@celian-garcia celian-garcia marked this pull request as ready for review March 27, 2025 10:00
@celian-garcia celian-garcia requested a review from a team as a code owner March 27, 2025 10:00
@@ -35,6 +35,8 @@ The following settings are optional:
- `cloud` (default = `AzureCloud`): defines which Azure cloud to use. Valid values: `AzureCloud`, `AzureUSGovernment`, `AzureChinaCloud`.
- `dimensions.enabled` (default = `true`): allows to opt out from automatically split by all the dimensions of the resource type.
- `dimensions.overrides` (default = `{}`): if dimensions are enabled, it allows you to specify a set of dimensions for a particular metric. This is a two levels map with first key being the resource type and second key being the metric name. Programmatic value should be used for metric name https://learn.microsoft.com/en-us/azure/azure-monitor/reference/metrics-index
- `use_batch_api` (default = `false`): Use the batch API to fetch metrics. This is useful when the number of subscriptions is high and the API calls are rate limited.
- `region` (default = `""`): The region is used when batch API is used AND the subscriptions are not discovered (given as list). In this case we need region and cannot guess it.
Copy link

@ashishbans ashishbans Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, @celian-garcia, what if a user wants to specify list of regions, or want metrics to be collected from all regions in a particular subscription, does not want to use subscription discovery.

Copy link
Member Author

@celian-garcia celian-garcia Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be another feature I guess? How would it work in terms of Azure API ? It would get all sub in the region and then make the usual calls?

Copy link

@ashishbans ashishbans Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay @celian-garcia, I appreciate your efforts for this feature request.
we can take in next feature request. but it can be like get list of all regions in a particular subscription and make api calls according. i'm not sure what is efficient way.

One more thing
I tested it by providing westeurope in location. it still checking and pulling metrics from other location. which is good for me. but not sure this was expected.

2025-03-28T01:59:22.670+0530	info	[email protected]/scraper_batch.go:79	Batch Endpoint	{"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://northeurope.metrics.monitor.azure.com"}
2025-03-28T01:59:31.541+0530	info	[email protected]/scraper_batch.go:79	Batch Endpoint	{"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://eastus.metrics.monitor.azure.com"}
2025-03-28T01:59:39.248+0530	info	[email protected]/scraper_batch.go:79	Batch Endpoint	{"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://westeurope.metrics.monitor.azure.com"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for having tested out! It's not expected no.
Except if you are using subscription discovery, then the region field is not checked.

I will try to remove the region field actually as if I can prevent from the user to bother with that it would be better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reading twice the code, it looks like the region configured has no effect, as the region used is anyway the one retrieved in getResourceAndTypes function. I will be able to simplify removing the region field.

My confusion came with the fact that I thought in the beginning that a subscription can only be in one region. Which is not the case. https://dev.to/simonwaight/understanding-tenants-subscriptions-regions-and-geographies-in-azure-550h?utm_source=chatgpt.com

Subscriptions are not tied to an Azure Region and as a result can contain resources from any number of Regions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the region field. Region is now computed directly from resources.
Basically what it does is that for each resource it stores the region as a requestable region in the subscription.
Then when we query getBatch metrics API, we do it for each region found in the subscription.

@atoulme atoulme marked this pull request as draft March 27, 2025 20:56
atoulme added a commit that referenced this pull request Apr 7, 2025
…attribute (#39029)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The subscription name is often more speaking for people playing with
metrics, than the subscription id which is an uuid.
Hence this PR. 

<!-- Issue number (e.g. #1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Linked with #38895. I prefer to have a separate PR for the add of
subscription name in resource attributes to have one PR by feature.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests + new mock API to mock the new get by subscription
calls.

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
dmathieu pushed a commit to dmathieu/opentelemetry-collector-contrib that referenced this pull request Apr 8, 2025
…attribute (open-telemetry#39029)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The subscription name is often more speaking for people playing with
metrics, than the subscription id which is an uuid.
Hence this PR. 

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Linked with open-telemetry#38895. I prefer to have a separate PR for the add of
subscription name in resource attributes to have one PR by feature.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests + new mock API to mock the new get by subscription
calls.

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/getBatch branch from 03b4ec3 to 8ff4a53 Compare April 9, 2025 14:54
@celian-garcia celian-garcia force-pushed the azuremonitorreceiver/getBatch branch from 8ff4a53 to 5ef0796 Compare April 9, 2025 15:01
LucianoGiannotti pushed a commit to LucianoGiannotti/opentelemetry-collector-contrib that referenced this pull request Apr 9, 2025
…attribute (open-telemetry#39029)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The subscription name is often more speaking for people playing with
metrics, than the subscription id which is an uuid.
Hence this PR. 

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Linked with open-telemetry#38895. I prefer to have a separate PR for the add of
subscription name in resource attributes to have one PR by feature.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests + new mock API to mock the new get by subscription
calls.

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
@celian-garcia celian-garcia marked this pull request as ready for review April 10, 2025 11:59
@celian-garcia
Copy link
Member Author

For everyone information, we rolled out from our fork to this PR with good success. Same amount of timeseries and seamless for users.

Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Written by codeowner, had review and confirmation it works on a fork - let's merge.

@atoulme atoulme merged commit 1fdf89d into open-telemetry:main Apr 14, 2025
174 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 14, 2025
akshays-19 pushed a commit to akshays-19/opentelemetry-collector-contrib that referenced this pull request Apr 23, 2025
…h API (open-telemetry#38895)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Let me introduce the Batch Metrics API
https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

The advantages of using ``azmetrics`` batch metrics API over the
``armmonitor`` metrics API are described in details in the issue + in
the README.md

Note that this is a revival of a previous PR that we made last year, and
that has not been followed up properly but we're using that forked code
in the Amadeus company for months with good results.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates open-telemetry#38651

<!--Describe what testing was performed and which tests were added.-->
#### Testing
✅  Manual tests + mocked unit tests

<!--Describe the documentation added.-->
#### Documentation

✅  Added config field documentation in README.md
<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
Fiery-Fenix pushed a commit to Fiery-Fenix/opentelemetry-collector-contrib that referenced this pull request Apr 24, 2025
…attribute (open-telemetry#39029)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
The subscription name is often more speaking for people playing with
metrics, than the subscription id which is an uuid.
Hence this PR. 

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Linked with open-telemetry#38895. I prefer to have a separate PR for the add of
subscription name in resource attributes to have one PR by feature.

<!--Describe what testing was performed and which tests were added.-->
#### Testing
Added unit tests + new mock API to mock the new get by subscription
calls.

<!--Describe the documentation added.-->
#### Documentation

<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
Co-authored-by: Antoine Toulme <[email protected]>
Fiery-Fenix pushed a commit to Fiery-Fenix/opentelemetry-collector-contrib that referenced this pull request Apr 24, 2025
…h API (open-telemetry#38895)

<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
#### Description
Let me introduce the Batch Metrics API
https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP

The advantages of using ``azmetrics`` batch metrics API over the
``armmonitor`` metrics API are described in details in the issue + in
the README.md

Note that this is a revival of a previous PR that we made last year, and
that has not been followed up properly but we're using that forked code
in the Amadeus company for months with good results.

<!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. -->
#### Link to tracking issue
Relates open-telemetry#38651

<!--Describe what testing was performed and which tests were added.-->
#### Testing
✅  Manual tests + mocked unit tests

<!--Describe the documentation added.-->
#### Documentation

✅  Added config field documentation in README.md
<!--Please delete paragraphs that you did not use before submitting.-->

Signed-off-by: Célian Garcia <[email protected]>
}
}
attributes["timegrain"] = &compositeKey.timeGrain
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the reason for this choice? Wouldn't you want the datapoints created from oldest -> newest?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly because the receiver works in a "streaming" pipeline.
Everytime the receiver call the Azure API, the results can be 1 or more samples. Here the code will retrieve only the newest sample to go through. On the next call, a new point (or maybe the same point will be collectted). This is to avoid getting duplicates over time, collectors only handles the only latest datapoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also goes with the break point on the next lines, because we want to stop processing the time range once we have 1 value available (The azure API often replies a timestamp with no samples in it (empty values) - even more for the 1Hour TimeGrain aggregates metrics like blob storage size)

Copy link
Member Author

@celian-garcia celian-garcia May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewegel you might ask yourself then why we're making a range query and not a ponctual query. It's because the API is sometimes a bit late. Like for PT1M timegrain, it can give result only for 3m before or 4m before. So we query a range and take the last.

The drawback is that if the API "wakes up" and is not having late at a moment, it can miss some point.
e.g:
At T time. the query for range [T-3m , T] gives only T-3m result.
At T+1m time(assuming that collection interval is 1m), the query for range [T-2m, T+1m] gives now all results [T-2, T-1, T, T+1] (the API "wakes up" and catch up its late)
=> then we'll take only T+1 and we lose T-2, T-1, T

There's no perfect approach without sending duplicates ^^

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see the break in there after encountering the "last" data point you want to scrape; I should provide some more context:

  • We're running this on a 300s (5m) scrape interval;
  • We want 1m granular data

Therefore we want all 1m data points in that 5m window and the code as written, only scrapes the latest datapoint value.

If I had to guess, this code assumes you're using the of 10s

defaultCollectionInterval = 10 * time.Second
where it makes sense you just grab the last data point, and use the "last scraped 1m" metric to determine when to scrape a certain metric/dimension combination, but our usage we're scraping potentially thousands of resources and metrics and dimensions, I don't think we can call this API that often within a 1m interval.

Let me create an official issue describing our requirements and you can advise us on what the proper receiver settings should be, because we're also trying to make this scrape somewhat cost-optimal, as every scrape call comes with an Azure cost to call that API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW #40078

attributes["timegrain"] = &compositeKey.timeGrain
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice
metricValue := timeseriesElement.Data[i]
if metricValue.Average != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats this conditional for? If I wanted a different aggregation other than Average, this would prevent that data point from actually getting into the collector's Pipeline

Copy link

@ahurtaud ahurtaud May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a fair point, we are using average because we dont rely on Azure monitor aggregate. Average was the default choice because it seems to be available for every metrics and as we are using the smallest timegrain. its convenient. But I think this is a valid feature request to make the aggregation configurable (either global or override per resourceType)

Copy link
Member Author

@celian-garcia celian-garcia May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we step back and try to understand functionally what this code is doing?

Reminder of the metricValue structure
image

attributes["timegrain"] = &compositeKey.timeGrain
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice
  metricValue := timeseriesElement.Data[i]
  if metricValue.Average != nil {
    s.processQueryTimeseriesData(resID, metric, metricValue, attributes)
    break
  }
}

So what is it doing?
It takes only the last timestamp in the queried range and if that value has an average aggregation, we process all the aggregation present.

If I understand your answer @ahurtaud, we're making this because we observed some situation where some of the MetricValue fields are nil while the value entry is here actually?

Like we're receiving an array like this from Azure?

{
  data: [{
    timestamp: null
    average: null
    // ...
  }, {
    timestamp: 123456789
    average: 0.5
    // ...
  }]
}

If so, why didn't we choose timestamp to make this oddity verification?

I'm sorry @ahurtaud maybe it's very old in your mind as you touched it long time ago and as you know I just copied that part.

@andrewegel is very true saying it's a problem because we already having a feature actually to get only some aggregations #37420. This is building query filter before the query is made to make sure that we have only the selected aggregation for particular metrics
Documentation: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/azuremonitorreceiver/README.md#filtering-metrics
So if someone is doing like in the example, this will not collect the IncomingMessage/ Total aggregation.

Copy link
Member Author

@celian-garcia celian-garcia May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay there is another problem. The metrics aggregation filter feature that I mentioned has been well backported to the scraper_batch, to add aggregation into composite key but the composite key is not used in the query to get metrics 😲

Let me create an issue to raise that point and also the point that we should find a solution to remove safely that average == nil check. Thanks!

Copy link

@ahurtaud ahurtaud May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we're receiving an array like this from Azure?

{
  data: [{
    timestamp: null
    average: null
    // ...
  }, {
    timestamp: 123456789
    average: 0.5
    // ...
  }]
}

If so, why didn't we choose timestamp to make this oddity verification?

No if I remember correctly is about your first comment, we are receiving things like:

{
  data: [{
    timestamp: 124
    average: null
    // ...
  }, {
    timestamp: 123
    average: null
    // ...
  }, {
    timestamp: 122
    average: 0.5
    // ...
  }]
}

so as average is likely the value to always be here, when Azure gives data, we decided to select this one for "value selection" but the timestamp is always valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok clear. Then we need a more clever decider, maybe giving the first of the aggregator from the composite key or something more dynamic to analyze data, taking the most complete one or something. I will add it to the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants