Skip to content

[v25.1.x] rpk: improve rpk debug bundle assumptions in k8s environments #26164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #26091

r-vasquez added 6 commits May 15, 2025 17:39
For Mac users.

We won't need this now that we have upgraded to
Bazel 8.

(cherry picked from commit 17cbc78)
We had build tags for the other files already.

We previously missed the build tag on this file.

(cherry picked from commit 1dc1a5e)
rpk debug bundle works on a best-effort basis. It
always tries to return a bundle—even if some steps
(like logs or resource collection) fail.
Sometimes, is because the system is unhealthy
leading to expected errors. Seeing errors doesn't
mean the bundle is useless.

This change aims to make this clearer for the user.

(cherry picked from commit 066732a)
If --namespace is not provided, fallback to $NAMESPACE, then
/var/run/secrets/kubernetes.io/serviceaccount/namespace, and
finally default to "redpanda". This avoids hardcoding and better
supports various deployment environments.

(cherry picked from commit adb9d38)
When collecting a debug bundle from a
k8s environment, we now:

- Log a warning if admin addresses
  cannot be retrieved from the k8s API.
- Use the union of the profile-defined
  admin addresses and those returned by
  the k8s API.
- Fall back to 127.0.0.1 if no addresses
  are available from either source.
- Log the final list of admin addresses
  used.

This improves reliability and visibility
in environments with incomplete or
misconfigured cluster info.

(cherry picked from commit 939e1ba)
We will stop assuming `redpanda` to be the default
container name. Instead we are going to fetch the
logs of all containers and initContainers in
the namespace/pod.

Logs will still be stored under logs/

(cherry picked from commit d66c6b0)
@vbotbuildovich vbotbuildovich requested a review from r-vasquez as a code owner May 15, 2025 17:39
@vbotbuildovich vbotbuildovich added this to the v25.1.x-next milestone May 15, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label May 15, 2025
@vbotbuildovich vbotbuildovich requested review from kbatuigas and a team as code owners May 15, 2025 17:39
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label May 15, 2025
@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented May 15, 2025

Retry command for Build#66072

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_move_interruption_test.py::PartitionMoveInterruption.test_cancelling_partition_move@{"compacted":false,"recovery":"restart_recovery","replication_factor":1,"unclean_abort":true}
tests/rptest/tests/datalake/iceberg_toggling_test.py::IcebergTogglingTest.test_iceberg_toggling@{"cloud_storage_type":1}
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_scaling_up_with_recovered_topic

@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#66072
test_class test_method test_arguments test_kind job_url test_status passed reason
IcebergTogglingTest test_iceberg_toggling {"cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/66072#0196d548-c496-4e2b-8e0a-fc078f95bd3a FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 0. The test should FAIL
PartitionMoveInterruption test_cancelling_partition_move {"compacted": false, "recovery": "restart_recovery", "replication_factor": 1, "unclean_abort": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66072#0196d548-c495-4003-8f91-7cd60a9df386 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 0. The test should FAIL
PartitionReassignmentsTest test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/66072#0196d54d-3e40-48ef-8e9a-f6d51d233b65 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 0. The test should FAIL
ScalingUpTest test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/66072#0196d54d-3e40-48ef-8e9a-f6d51d233b65 FLAKY 5/21 upstream reliability is '48.1981981981982'. current run reliability is '23.809523809523807'. drift is 24.38867 and the allowed drift is set to 0. The test should FAIL

@r-vasquez
Copy link
Contributor

/ci-repeat 1
tests/rptest/tests/partition_move_interruption_test.py::PartitionMoveInterruption.test_cancelling_partition_move@{"compacted":false,"recovery":"restart_recovery","replication_factor":1,"unclean_abort":true}
tests/rptest/tests/datalake/iceberg_toggling_test.py::IcebergTogglingTest.test_iceberg_toggling@{"cloud_storage_type":1}
tests/rptest/tests/partition_reassignments_test.py::PartitionReassignmentsTest.test_add_partitions_with_inprogress_reassignments
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_scaling_up_with_recovered_topic

@r-vasquez r-vasquez merged commit b599de1 into redpanda-data:v25.1.x May 20, 2025
22 checks passed
@piyushredpanda piyushredpanda modified the milestones: v25.1.x-next, v25.1.5 Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build area/rpk kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants