Open
Description
What is the issue?
It can be seen from linkerd-viz that some pods have no rps at all, while others get all the traffic.

How can it be reproduced?
Deploy two app, they communicate over grpc (using fqdn .cluster.local). Both apps are meshed.
Logs, error output, etc
There no errors that can be seen. When we get more load, pods start to crash because not they all are actually working.
output of linkerd check -o short
╰> linkerd check -o short [10:16:57]
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.4.1 but the latest edge version is 25.5.5
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.4.1 but the latest edge version is 25.5.5
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-68f7bd57cb-csvqn (edge-25.4.1)
* linkerd-destination-68f7bd57cb-svt8m (edge-25.4.1)
* linkerd-identity-6f6d4d4f64-6d468 (edge-25.4.1)
* linkerd-identity-6f6d4d4f64-vst8j (edge-25.4.1)
* linkerd-proxy-injector-858587c6ff-b87hs (edge-25.4.1)
* linkerd-proxy-injector-858587c6ff-h4qk6 (edge-25.4.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-6b6994d46-8jbdc (edge-25.4.1)
* prometheus-576d6c98cf-527nh (edge-25.4.1)
* tap-574f8fb84f-2tl8n (edge-25.4.1)
* tap-574f8fb84f-5hbzg (edge-25.4.1)
* tap-574f8fb84f-gnht6 (edge-25.4.1)
* tap-injector-6c9d7895dd-6vl8v (edge-25.4.1)
* web-6b676dcf7-v9kxs (edge-25.4.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
Status check results are √
Environment
- k8s: 1.30.1 (also 1.31.2)
- AKS
- managed zonal cluster
- OS: ubuntu
- linkerd version: edge-25.4.1
- CNI: cilium 1.12.9 (also 1.15.10)
Possible solution
If we remove most loaded pod(s) the other pod(s) start to get all the requests.
Additional context
we are also using topology-mode: auto
annotation with services, but it can be seen that only a few pods in the same zone gets requests, while other looks idle
Would you like to work on fixing this bug?
maybe