Skip to content

rpk: improve rpk debug bundle assumptions in k8s environments #26091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 15, 2025

Conversation

r-vasquez
Copy link
Contributor

@r-vasquez r-vasquez commented May 9, 2025

Fixes UX-114

This PR:

  • Improves the error messaging around diagnostic steps failures.
  • Improves the namespace resolution
  • Improves the admin API address resolution

Misc. changes

  • Adds a missing build tag in debug bundle for non-Linux builds.
  • Fixes a bug in our makefile.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • rpk debug bundle: improve reliability of debug bundle collection in k8s environments.

@r-vasquez r-vasquez requested review from kbatuigas and a team as code owners May 9, 2025 20:09
@r-vasquez r-vasquez requested review from andresaristizabal and removed request for a team May 9, 2025 20:09
@r-vasquez r-vasquez requested a review from chrisseto May 9, 2025 20:10
chrisseto
chrisseto previously approved these changes May 9, 2025
Copy link
Contributor

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

fmt.Println(errs.Error())
}

fmt.Printf("Debug bundle saved to %q\n", f.Name())
return nil
}

// adminAddressesUnion returns the union of two slices of adminAddresses.
func adminAddressesUnion(a, b []string) []string {
m := make(map[string]bool) // track unique addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map[string]struct{} is technically more efficient here. Though it's much less pleasant to use. This is more of a "the more you know" comment, no change needed IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I did the change though 👍

m[v] = true
}
for _, v := range b {
if _, ok := m[v]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _, ok := m[v]; !ok {
if !m[v] {

The primary benefit of using map[string]bool (IMO) is that you don't have to use the extended syntax here.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 10, 2025

CI test results

test results on build#65763
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudStorageTimingStressTest test_cloud_storage {"cleanup_policy": "compact,delete"} ducktape https://buildkite.com/redpanda/redpanda/builds/65763#0196b6db-f34c-4c27-aef3-3284bd001e3f FLAKY 20/21 upstream reliability is '97.74011299435028'. current run reliability is '95.23809523809523'. drift is 2.50202 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65763#0196b6db-f34c-4c27-aef3-3284bd001e3f FLAKY 20/21 upstream reliability is '99.3421052631579'. current run reliability is '95.23809523809523'. drift is 4.10401 and the allowed drift is set to 50. The test should PASS
RedpandaOIDCTest test_init ducktape https://buildkite.com/redpanda/redpanda/builds/65763#0196b6ee-d30a-4989-883b-d52ff3a6e02a FLAKY 20/21 upstream reliability is '99.10714285714286'. current run reliability is '95.23809523809523'. drift is 3.86905 and the allowed drift is set to 50. The test should PASS
test results on build#65823
test_class test_method test_arguments test_kind job_url test_status passed reason
DatalakeTransactionTests test_with_transactions {"cloud_storage_type": 1, "compaction": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f830-4354-a863-baf4fc47a2ca FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
MaintenanceTest test_maintenance_sticky {"use_rpk": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c557-c4dd-4697-9cd7-0a1eba7b9e3d FLAKY 20/21 upstream reliability is '99.13419913419914'. current run reliability is '95.23809523809523'. drift is 3.8961 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f82d-457a-a999-7f755d85d1c2 FLAKY 20/21 upstream reliability is '99.2248062015504'. current run reliability is '95.23809523809523'. drift is 3.98671 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f82d-457a-a999-7f755d85d1c2 FLAKY 20/21 upstream reliability is '97.17741935483872'. current run reliability is '95.23809523809523'. drift is 1.93932 and the allowed drift is set to 50. The test should PASS
SaslPlainTest test_plain_authn {"client_type": 4, "sasl_plain_enabled": true, "scram_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f830-4354-a863-baf4fc47a2ca FLAKY 20/21 upstream reliability is '99.52606635071089'. current run reliability is '95.23809523809523'. drift is 4.28797 and the allowed drift is set to 50. The test should PASS
ShardPlacementTest test_node_join {"disable_license": true} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f830-4b8d-8f45-e4f74bcd7a9d FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
TestTieredStoragePause test_safe_pause_resume {"allow_gaps_cluster_level": true, "allow_gaps_topic_level": null} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f82d-457a-a999-7f755d85d1c2 FLAKY 20/21 upstream reliability is '98.47328244274809'. current run reliability is '95.23809523809523'. drift is 3.23519 and the allowed drift is set to 50. The test should PASS
UpgradeBackToBackTest test_upgrade_with_all_workloads {"single_upgrade": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65823#0196c560-f82e-4797-9420-50b04d234855 FLAKY 11/21 upstream reliability is '86.58649398704902'. current run reliability is '52.38095238095239'. drift is 34.20554 and the allowed drift is set to 50. The test should PASS
test results on build#65860
test_class test_method test_arguments test_kind job_url test_status passed reason
ArchivalTest test_single_partition_leadership_transfer {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c762-4c4e-4dea-b6d1-7f75392ff23c FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DatalakeE2ETests test_topic_lifecycle {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c762-4c4e-4dea-b6d1-7f75392ff23c FLAKY 20/21 upstream reliability is '99.56709956709958'. current run reliability is '95.23809523809523'. drift is 4.329 and the allowed drift is set to 50. The test should PASS
DeleteRecordsTest test_delete_records_topic_start_delta {"cloud_storage_enabled": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c762-4c50-4103-a895-571faf79ffcf FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
PartitionBalancerTest test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c75a-d7f3-44b0-88c9-3be6eca74747 FLAKY 11/21 upstream reliability is '71.5670436187399'. current run reliability is '52.38095238095239'. drift is 19.18609 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c75a-d7f1-4cb2-b01f-74effe663dce FAIL 0/1
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/65860#0196c75a-d7f1-4cb2-b01f-74effe663dce FAIL 0/1
test results on build#65946
test_class test_method test_arguments test_kind job_url test_status passed reason
PartitionBalancerTest test_transfer_controller_leadership ducktape https://buildkite.com/redpanda/redpanda/builds/65946#0196cb99-afa5-4084-84e4-b4e3bd009b24 FLAKY 15/21 upstream reliability is '81.47448015122873'. current run reliability is '71.42857142857143'. drift is 10.04591 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": true, "with_tiered_storage": false} ducktape https://buildkite.com/redpanda/redpanda/builds/65946#0196cb99-afa4-4058-ab25-470cded51f92 FLAKY 20/21 upstream reliability is '95.65217391304348'. current run reliability is '95.23809523809523'. drift is 0.41408 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/65946#0196cb99-afa4-4058-ab25-470cded51f92 FLAKY 20/21 upstream reliability is '98.63013698630137'. current run reliability is '95.23809523809523'. drift is 3.39204 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/65946#0196cb99-afa5-4084-84e4-b4e3bd009b24 FLAKY 20/21 upstream reliability is '96.42857142857143'. current run reliability is '95.23809523809523'. drift is 1.19048 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false, "with_tiered_storage": true} ducktape https://buildkite.com/redpanda/redpanda/builds/65946#0196cb99-afa7-43c7-8c56-f0407ae4197c FLAKY 20/21 upstream reliability is '99.27536231884058'. current run reliability is '95.23809523809523'. drift is 4.03727 and the allowed drift is set to 50. The test should PASS

r-vasquez added 5 commits May 12, 2025 17:54
For Mac users.

We won't need this now that we have upgraded to
Bazel 8.
We had build tags for the other files already.

We previously missed the build tag on this file.
rpk debug bundle works on a best-effort basis. It
always tries to return a bundle—even if some steps
(like logs or resource collection) fail.
Sometimes, is because the system is unhealthy
leading to expected errors. Seeing errors doesn't
mean the bundle is useless.

This change aims to make this clearer for the user.
If --namespace is not provided, fallback to $NAMESPACE, then
/var/run/secrets/kubernetes.io/serviceaccount/namespace, and
finally default to "redpanda". This avoids hardcoding and better
supports various deployment environments.
When collecting a debug bundle from a
k8s environment, we now:

- Log a warning if admin addresses
  cannot be retrieved from the k8s API.
- Use the union of the profile-defined
  admin addresses and those returned by
  the k8s API.
- Fall back to 127.0.0.1 if no addresses
  are available from either source.
- Log the final list of admin addresses
  used.

This improves reliability and visibility
in environments with incomplete or
misconfigured cluster info.
@r-vasquez
Copy link
Contributor Author

Force Push 1: Addresses PR review suggestions.

Force Push 2: rebases with Dev to fix a CI issue.

We will stop assuming `redpanda` to be the default
container name. Instead we are going to fetch the
logs of all containers and initContainers in
the namespace/pod.

Logs will still be stored under logs/
@r-vasquez
Copy link
Contributor Author

Added d66c6b0 to include the logs of all containers in the pod and stop assuming redpanda to be the container name.

Copy link
Contributor

@chrisseto chrisseto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM sans the loop/closure scoping question.

cb := func(ctx context.Context) ([]byte, error) {
return podsInterface.GetLogs(p.Name, opts).Do(ctx).Raw()
}
grp.Go(func() error { return requestAndSave(ctx, ps, fmt.Sprintf("logs/%v-%v.txt", p.Name, c.Name), cb) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as of go 1.23 but the reference to p and c in a closure from another goroutine inside a doublely nested loop has all of my internal go alarms ringing.

Just to check have you tested this on a cluster with multiple containers and Pods and observed that the logs are all named correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I double checked just now and the logs are named and saved correctly
image

@r-vasquez r-vasquez merged commit 0f9a335 into redpanda-data:dev May 15, 2025
24 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26091-v24.2.x-158 remotes/upstream/v24.2.x
git cherry-pick -x 17cbc780d3 1dc1a5ea91 066732ae11 adb9d3855e 939e1baad2 d66c6b00ed

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants