Skip to content

[v25.1.x] Decommission status improvements #26147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #26054

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 3b20ef2)
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit f1f97cc)
Introduced a type providing detailed information about reallocation
failure.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 936098d)
…ations

When partition replicas area moved in the cluster from whichever reason
Redpanda should expose all information required to identify issues
experienced during that operation. Previously partition balancer planner
included only the basic information about the failures in the status.
That was the list of `ntps` that replicas failed to be reallocated and
the total number of reallocations failures. In order to obtain more
detailed information looking into the partition balancer log was
required. Moreover the reallocations that were not even attempted during
preparing an allocation plan were not reported at all. This lead to
confusion as f.e. partitions with only one replica on the node that is
offline that can not be moved was not reported ans the failed one.

In order to address this problem we introduced a new
`allocation_failure_details` map in the partition balancer planner
result. The map provides a detailed information about the failed
reallocation. Additionally reallocation failure is reported when
partition is marked as `immutable` in the process of planning balancer
action. This makes it easy for the operator to identify any partition
balancer issues without the need to look for the specific log entries.

The change is fully backward compatible as all the changes are
incremental i.e. when no detailed information about failures is
available Redpanda will fallback to the old basic `ntp` list.

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 48834cf)
…atus

Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit ba820b0)
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 17ff1d9)
Signed-off-by: Michał Maślanka <[email protected]>
(cherry picked from commit 8f4b6a4)
@vbotbuildovich vbotbuildovich requested a review from a team as a code owner May 14, 2025 19:22
@vbotbuildovich vbotbuildovich added this to the v25.1.x-next milestone May 14, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label May 14, 2025
@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented May 14, 2025

Retry command for Build#66007

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/rest_catalog_connection_test.py::RestCatalogConnectionTest.test_redpanda_connection_to_rest_catalog@{"cloud_storage_type":1}
tests/rptest/tests/tiered_storage_pause_test.py::TestTieredStoragePause.test_safe_pause_resume@{"allow_gaps_cluster_level":true,"allow_gaps_topic_level":null}
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_scaling_up_with_recovered_topic

@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#66007
test_class test_method test_arguments test_kind job_url test_status passed reason
RestCatalogConnectionTest test_redpanda_connection_to_rest_catalog {"cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/66007#0196d0a1-d164-407b-a10d-e2d56f87eabe FLAKY 20/21 upstream reliability is '97.43589743589743'. current run reliability is '95.23809523809523'. drift is 2.1978 and the allowed drift is set to 0. The test should FAIL
ScalingUpTest test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/66007#0196d08e-65b4-45c6-a1f4-38fc4f92b1e9 FLAKY 10/21 upstream reliability is '49.65034965034965'. current run reliability is '47.61904761904761'. drift is 2.0313 and the allowed drift is set to 0. The test should FAIL
TestTieredStoragePause test_safe_pause_resume {"allow_gaps_cluster_level": true, "allow_gaps_topic_level": null} ducktape https://buildkite.com/redpanda/redpanda/builds/66007#0196d08e-65b5-4df5-8917-f974daaba828 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 0. The test should FAIL

@piyushredpanda piyushredpanda merged commit 0a46a32 into redpanda-data:v25.1.x Jun 11, 2025
13 of 17 checks passed
@piyushredpanda piyushredpanda modified the milestones: v25.1.x-next, v25.1.5 Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants