-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/azuremonitorreceiver] feat: option to use azmetrics getBatch API #38895
Conversation
5e66998
to
6c51627
Compare
// Once all metrics has been collected for one subscription, we move to the next. | ||
// We need to keep it synchronous to have the subscription id in resource attributes and not metrics attributes. | ||
// It can be revamped later if we need to parallelize more, but currently, resource emit is not thread safe. | ||
rb := s.mb.NewResourceBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running each subscriptions synchronously might become a problem when tenant contains a lot of subscriptions.
In our fork, we run each sub in parallel and have subscription id at data point level, but I prefer to keep it at resource attribute level.
We need first to run it on our tenants and see if it takes longer than collection interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with 94 subscriptions discovered and it took largely less than 1m. On my laptop. Quite satisfying and reassuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wel ok it was for only one resource type
services: [ "Microsoft.Storage/storageAccounts" ]
use_batch_api: true
discover_subscriptions: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's acceptable to release it like that since it's not a migration but just a proposal of using a different API.
I struggle to try big metrics amount locally but for sure we'll test in our environments and probably come back with fixes 😄
A lot of code is duplicated between scraper and scraper_batch. I think it's an acceptable technical debt if we have then a follow up PR/issue to address it later |
caae7a3
to
03b4ec3
Compare
@@ -35,6 +35,8 @@ The following settings are optional: | |||
- `cloud` (default = `AzureCloud`): defines which Azure cloud to use. Valid values: `AzureCloud`, `AzureUSGovernment`, `AzureChinaCloud`. | |||
- `dimensions.enabled` (default = `true`): allows to opt out from automatically split by all the dimensions of the resource type. | |||
- `dimensions.overrides` (default = `{}`): if dimensions are enabled, it allows you to specify a set of dimensions for a particular metric. This is a two levels map with first key being the resource type and second key being the metric name. Programmatic value should be used for metric name https://learn.microsoft.com/en-us/azure/azure-monitor/reference/metrics-index | |||
- `use_batch_api` (default = `false`): Use the batch API to fetch metrics. This is useful when the number of subscriptions is high and the API calls are rate limited. | |||
- `region` (default = `""`): The region is used when batch API is used AND the subscriptions are not discovered (given as list). In this case we need region and cannot guess it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, @celian-garcia, what if a user wants to specify list of regions, or want metrics to be collected from all regions in a particular subscription, does not want to use subscription discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be another feature I guess? How would it work in terms of Azure API ? It would get all sub in the region and then make the usual calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay @celian-garcia, I appreciate your efforts for this feature request.
we can take in next feature request. but it can be like get list of all regions in a particular subscription and make api calls according. i'm not sure what is efficient way.
One more thing
I tested it by providing westeurope in location. it still checking and pulling metrics from other location. which is good for me. but not sure this was expected.
2025-03-28T01:59:22.670+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://northeurope.metrics.monitor.azure.com"}
2025-03-28T01:59:31.541+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://eastus.metrics.monitor.azure.com"}
2025-03-28T01:59:39.248+0530 info [email protected]/scraper_batch.go:79 Batch Endpoint {"otelcol.component.id": "azuremonitor", "otelcol.component.kind": "Receiver", "otelcol.signal": "metrics", "endpoint": "https://westeurope.metrics.monitor.azure.com"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for having tested out! It's not expected no.
Except if you are using subscription discovery, then the region field is not checked.
I will try to remove the region field actually as if I can prevent from the user to bother with that it would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @celian-garcia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reading twice the code, it looks like the region configured has no effect, as the region used is anyway the one retrieved in getResourceAndTypes function. I will be able to simplify removing the region
field.
My confusion came with the fact that I thought in the beginning that a subscription can only be in one region. Which is not the case. https://dev.to/simonwaight/understanding-tenants-subscriptions-regions-and-geographies-in-azure-550h?utm_source=chatgpt.com
Subscriptions are not tied to an Azure Region and as a result can contain resources from any number of Regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the region field. Region is now computed directly from resources.
Basically what it does is that for each resource it stores the region as a requestable region in the subscription.
Then when we query getBatch metrics API, we do it for each region found in the subscription.
…attribute (#39029) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Linked with #38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature. <!--Describe what testing was performed and which tests were added.--> #### Testing Added unit tests + new mock API to mock the new get by subscription calls. <!--Describe the documentation added.--> #### Documentation <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>
…attribute (open-telemetry#39029) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature. <!--Describe what testing was performed and which tests were added.--> #### Testing Added unit tests + new mock API to mock the new get by subscription calls. <!--Describe the documentation added.--> #### Documentation <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>
03b4ec3
to
8ff4a53
Compare
…h API Signed-off-by: Célian Garcia <[email protected]>
8ff4a53
to
5ef0796
Compare
…attribute (open-telemetry#39029) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature. <!--Describe what testing was performed and which tests were added.--> #### Testing Added unit tests + new mock API to mock the new get by subscription calls. <!--Describe the documentation added.--> #### Documentation <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>
For everyone information, we rolled out from our fork to this PR with good success. Same amount of timeseries and seamless for users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Written by codeowner, had review and confirmation it works on a fork - let's merge.
…h API (open-telemetry#38895) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP The advantages of using ``azmetrics`` batch metrics API over the ``armmonitor`` metrics API are described in details in the issue + in the README.md Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Relates open-telemetry#38651 <!--Describe what testing was performed and which tests were added.--> #### Testing ✅ Manual tests + mocked unit tests <!--Describe the documentation added.--> #### Documentation ✅ Added config field documentation in README.md <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]>
…attribute (open-telemetry#39029) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description The subscription name is often more speaking for people playing with metrics, than the subscription id which is an uuid. Hence this PR. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Linked with open-telemetry#38895. I prefer to have a separate PR for the add of subscription name in resource attributes to have one PR by feature. <!--Describe what testing was performed and which tests were added.--> #### Testing Added unit tests + new mock API to mock the new get by subscription calls. <!--Describe the documentation added.--> #### Documentation <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]> Co-authored-by: Antoine Toulme <[email protected]>
…h API (open-telemetry#38895) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP The advantages of using ``azmetrics`` batch metrics API over the ``armmonitor`` metrics API are described in details in the issue + in the README.md Note that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Relates open-telemetry#38651 <!--Describe what testing was performed and which tests were added.--> #### Testing ✅ Manual tests + mocked unit tests <!--Describe the documentation added.--> #### Documentation ✅ Added config field documentation in README.md <!--Please delete paragraphs that you did not use before submitting.--> Signed-off-by: Célian Garcia <[email protected]>
} | ||
} | ||
attributes["timegrain"] = &compositeKey.timeGrain | ||
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the reason for this choice? Wouldn't you want the datapoints created from oldest -> newest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly because the receiver works in a "streaming" pipeline.
Everytime the receiver call the Azure API, the results can be 1 or more samples. Here the code will retrieve only the newest sample to go through. On the next call, a new point (or maybe the same point will be collectted). This is to avoid getting duplicates over time, collectors only handles the only latest datapoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also goes with the break point on the next lines, because we want to stop processing the time range once we have 1 value available (The azure API often replies a timestamp with no samples in it (empty values) - even more for the 1Hour TimeGrain aggregates metrics like blob storage size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewegel you might ask yourself then why we're making a range query and not a ponctual query. It's because the API is sometimes a bit late. Like for PT1M timegrain, it can give result only for 3m before or 4m before. So we query a range and take the last.
The drawback is that if the API "wakes up" and is not having late at a moment, it can miss some point.
e.g:
At T time. the query for range [T-3m , T] gives only T-3m result.
At T+1m time(assuming that collection interval is 1m), the query for range [T-2m, T+1m] gives now all results [T-2, T-1, T, T+1] (the API "wakes up" and catch up its late)
=> then we'll take only T+1 and we lose T-2, T-1, T
There's no perfect approach without sending duplicates ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also see the break in there after encountering the "last" data point you want to scrape; I should provide some more context:
- We're running this on a 300s (5m) scrape interval;
- We want 1m granular data
Therefore we want all 1m data points in that 5m window and the code as written, only scrapes the latest datapoint value.
If I had to guess, this code assumes you're using the of 10s
defaultCollectionInterval = 10 * time.Second |
Let me create an official issue describing our requirements and you can advise us on what the proper receiver settings should be, because we're also trying to make this scrape somewhat cost-optimal, as every scrape call comes with an Azure cost to call that API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW #40078
attributes["timegrain"] = &compositeKey.timeGrain | ||
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice | ||
metricValue := timeseriesElement.Data[i] | ||
if metricValue.Average != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats this conditional for? If I wanted a different aggregation other than Average, this would prevent that data point from actually getting into the collector's Pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its a fair point, we are using average because we dont rely on Azure monitor aggregate. Average was the default choice because it seems to be available for every metrics and as we are using the smallest timegrain. its convenient. But I think this is a valid feature request to make the aggregation configurable (either global or override per resourceType)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we step back and try to understand functionally what this code is doing?
Reminder of the metricValue
structure
attributes["timegrain"] = &compositeKey.timeGrain
for i := len(timeseriesElement.Data) - 1; i >= 0; i-- { // reverse for loop because newest timestamp is at the end of the slice
metricValue := timeseriesElement.Data[i]
if metricValue.Average != nil {
s.processQueryTimeseriesData(resID, metric, metricValue, attributes)
break
}
}
So what is it doing?
It takes only the last timestamp in the queried range and if that value has an average aggregation, we process all the aggregation present.
If I understand your answer @ahurtaud, we're making this because we observed some situation where some of the MetricValue fields are nil while the value entry is here actually?
Like we're receiving an array like this from Azure?
{
data: [{
timestamp: null
average: null
// ...
}, {
timestamp: 123456789
average: 0.5
// ...
}]
}
If so, why didn't we choose timestamp to make this oddity verification?
I'm sorry @ahurtaud maybe it's very old in your mind as you touched it long time ago and as you know I just copied that part.
@andrewegel is very true saying it's a problem because we already having a feature actually to get only some aggregations #37420. This is building query filter before the query is made to make sure that we have only the selected aggregation for particular metrics
Documentation: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/azuremonitorreceiver/README.md#filtering-metrics
So if someone is doing like in the example, this will not collect the IncomingMessage/ Total aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay there is another problem. The metrics aggregation filter feature that I mentioned has been well backported to the scraper_batch, to add aggregation into composite key but the composite key is not used in the query to get metrics 😲
Let me create an issue to raise that point and also the point that we should find a solution to remove safely that average == nil check. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we're receiving an array like this from Azure?
{ data: [{ timestamp: null average: null // ... }, { timestamp: 123456789 average: 0.5 // ... }] }If so, why didn't we choose timestamp to make this oddity verification?
No if I remember correctly is about your first comment, we are receiving things like:
{
data: [{
timestamp: 124
average: null
// ...
}, {
timestamp: 123
average: null
// ...
}, {
timestamp: 122
average: 0.5
// ...
}]
}
so as average is likely the value to always be here, when Azure gives data, we decided to select this one for "value selection" but the timestamp is always valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok clear. Then we need a more clever decider, maybe giving the first of the aggregator from the composite key or something more dynamic to analyze data, taking the most complete one or something. I will add it to the issue
Description
Let me introduce the Batch Metrics API https://learn.microsoft.com/en-us/rest/api/monitor/metrics-batch/batch?view=rest-monitor-2023-10-01&tabs=HTTP
The advantages of using
azmetrics
batch metrics API over thearmmonitor
metrics API are described in details in the issue + in the README.mdNote that this is a revival of a previous PR that we made last year, and that has not been followed up properly but we're using that forked code in the Amadeus company for months with good results.
Link to tracking issue
Relates #38651
Testing
✅ Manual tests + mocked unit tests
Documentation
✅ Added config field documentation in README.md