Skip to content

Some pods have more rps than others #14086

Open
@1ovsss

Description

@1ovsss

What is the issue?

It can be seen from linkerd-viz that some pods have no rps at all, while others get all the traffic.

Image

How can it be reproduced?

Deploy two app, they communicate over grpc (using fqdn .cluster.local). Both apps are meshed.

Logs, error output, etc

There no errors that can be seen. When we get more load, pods start to crash because not they all are actually working.

output of linkerd check -o short

╰> linkerd check -o short                                                                                                                                                             [10:16:57]
linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.4.1 but the latest edge version is 25.5.5
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.4.1 but the latest edge version is 25.5.5
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-68f7bd57cb-csvqn (edge-25.4.1)
	* linkerd-destination-68f7bd57cb-svt8m (edge-25.4.1)
	* linkerd-identity-6f6d4d4f64-6d468 (edge-25.4.1)
	* linkerd-identity-6f6d4d4f64-vst8j (edge-25.4.1)
	* linkerd-proxy-injector-858587c6ff-b87hs (edge-25.4.1)
	* linkerd-proxy-injector-858587c6ff-h4qk6 (edge-25.4.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-6b6994d46-8jbdc (edge-25.4.1)
	* prometheus-576d6c98cf-527nh (edge-25.4.1)
	* tap-574f8fb84f-2tl8n (edge-25.4.1)
	* tap-574f8fb84f-5hbzg (edge-25.4.1)
	* tap-574f8fb84f-gnht6 (edge-25.4.1)
	* tap-injector-6c9d7895dd-6vl8v (edge-25.4.1)
	* web-6b676dcf7-v9kxs (edge-25.4.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

  • k8s: 1.30.1 (also 1.31.2)
  • AKS
  • managed zonal cluster
  • OS: ubuntu
  • linkerd version: edge-25.4.1
  • CNI: cilium 1.12.9 (also 1.15.10)

Possible solution

If we remove most loaded pod(s) the other pod(s) start to get all the requests.

Additional context

we are also using topology-mode: auto annotation with services, but it can be seen that only a few pods in the same zone gets requests, while other looks idle

Would you like to work on fixing this bug?

maybe

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions