Skip to content

r/stm_manager: added watchdog logging an error when stm did not stop #26259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented May 28, 2025

State machines should be able to stop timely as any issue with stopping a state machine may lead to a situation in which the whole partition fails to stop. Added watchdog reporting an error when STM fails to stop.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • better observability of state machines shutdown issues

@mmaslankaprv mmaslankaprv force-pushed the stm-manager-stop-watchdog branch from a0215d9 to afcd313 Compare May 28, 2025 15:49
bharathv
bharathv previously approved these changes May 28, 2025
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. also wondering about the unused function

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 29, 2025

CI test results

test results on build#66506
test_class test_method test_arguments test_kind job_url test_status passed reason
DataTransformsTest test_consume_off_end {"offset": null} ducktape https://buildkite.com/redpanda/redpanda/builds/66506#01971814-6ea6-45b7-a5ef-321a8de1eab0 FLAKY 20/21 upstream reliability is '99.57805907172997'. current run reliability is '95.23809523809523'. drift is 4.33996 and the allowed drift is set to 50. The test should PASS
MaintenanceCycleTest test_leader_distribution {"use_rpk": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66506#01971814-6ea5-49b1-b5e6-da1e9d334074 FLAKY 20/21 upstream reliability is '99.59677419354838'. current run reliability is '95.23809523809523'. drift is 4.35868 and the allowed drift is set to 50. The test should PASS
PartitionReassignmentsTest test_add_partitions_with_inprogress_reassignments ducktape https://buildkite.com/redpanda/redpanda/builds/66506#01971814-6ea5-49b1-b5e6-da1e9d334074 FLAKY 16/21 upstream reliability is '89.57475994513031'. current run reliability is '76.19047619047619'. drift is 13.38428 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66506#01971816-da34-4a98-a8d7-86518a37fead FLAKY 20/21
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": false, "with_iceberg": true} ducktape https://buildkite.com/redpanda/redpanda/builds/66506#01971816-da34-4a98-a8d7-86518a37fead FLAKY 20/21
test results on build#66565
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": false, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66565#01971bb7-5015-4319-bdac-874a687f3311 FAIL 0/1
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "enable_failures": true, "mixed_versions": false, "with_chunked_compaction": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66565#01971bb7-5015-4319-bdac-874a687f3311 FAIL 0/1
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "enable_failures": true, "mixed_versions": true, "with_chunked_compaction": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/66565#01971bb7-5016-4b1b-842f-f2f40ebeba67 FLAKY 20/21 upstream reliability is '97.70992366412213'. current run reliability is '95.23809523809523'. drift is 2.47183 and the allowed drift is set to 50. The test should PASS

@mmaslankaprv mmaslankaprv force-pushed the stm-manager-stop-watchdog branch from 4a7c940 to e012d75 Compare May 29, 2025 06:15
@mmaslankaprv mmaslankaprv requested review from dotnwat and bharathv May 29, 2025 06:15
State machines should be able to stop timely as any issue with stopping
a state machine may lead to a situation in which the whole partition
fails to stop. Added watchdog reporting an error when STM fails to stop.

Signed-off-by: Michał Maślanka <[email protected]>
@mmaslankaprv mmaslankaprv force-pushed the stm-manager-stop-watchdog branch from e012d75 to 9d4dbfb Compare May 29, 2025 10:30
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

@mmaslankaprv mmaslankaprv merged commit 0641d9e into redpanda-data:dev May 30, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@mmaslankaprv mmaslankaprv deleted the stm-manager-stop-watchdog branch May 30, 2025 14:42
@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26259-v24.2.x-362 remotes/upstream/v24.2.x
git cherry-pick -x 9d4dbfbbb7

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26259-v24.3.x-457 remotes/upstream/v24.3.x
git cherry-pick -x 9d4dbfbbb7

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26259-v25.1.x-975 remotes/upstream/v25.1.x
git cherry-pick -x 9d4dbfbbb7

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants