Skip to content

KVM local storage pool gets removed from hosts #11104

Open
@jdhirst

Description

@jdhirst

problem

Issue Details

After switching over to main for testing the CKS improvements, I ran into an issue where my compute nodes lose their local storage pools in libvirt. The pools exist in Cloudstack however the pools get undefined. This can be temporarily fixed by manually defining the pools with the same UUID via virsh pool-define, however they get removed after as well.

Both of the compute nodes have local.storage.uuid defined in their agent.properties file. Once this issue starts to happen, restarting the agent does not cause the pool to come back. However, if you change the UUID in agent.properties, a new pool gets created and will start without issue.

The affect of this is that after the pool is removed, existing instances continue to run, however new instances cannot be created until the pool is manually recreated.

Creating a new pool with a different UUID works for a time, but eventually also gets removed. I'm not exactly sure what is triggering the removal.

Log Snippets

Both management server and agent encounter errors trying to find the pool:

2025-06-29 07:36:58,138 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-6:[ctx-8a6c9ea0]) (logid:58ac492b) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Could not fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt due to org.libvirt.LibvirtException: Storage pool not found: no storage pool with matching uuid '27ea62dc-ded2-4a25-9115-277c981f33fa'
2025-06-29 07:36:58,096 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-2:[]) (logid:58ac492b) Processing command: com.cloud.agent.api.GetVolumeStatsCommand
2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system
2025-06-29 07:36:58,096 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Trying to fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system
2025-06-29 07:36:58,097 DEBUG [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Could not find storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa in libvirt

The pool is being removed by the management server:

2025-06-29 03:36:47,496 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Removing pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} from host Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
2025-06-29 03:36:47,511 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Sending  { Cmd , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"13","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d","host":"10.25.0.3","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//492af6d4-b5d1-3dc8-8a50-040101d28230","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-06-29 03:36:47,563 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Received:  { Ans: , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 10, { Answer } }
2025-06-29 03:36:47,576 INFO  [o.a.c.s.d.p.DefaultHostListener] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Connection removed between storage pool: StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} and host: Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
2025-06-29 03:36:47,606 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Found storage pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} of type Filesystem with overprovisioning factor 2
2025-06-29 03:36:47,608 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Total over provisioned capacity of the pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} is (306.79 GB) 329410707456
2025-06-29 03:36:47,629 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Successfully set Capacity - (306.79 GB) 329410707456 for capacity type - 9 , DataCenterId - 1, Pool - StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}, PodId 1

Agent's logs showing pool removal:

2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Request:Seq 1-4725683383995203589:  { Cmd , MgmtId: 90520738888109, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"12","uuid":"27ea62dc-ded2-4a25-9115-277c981f33fa","host":"10.25.0.2","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//36cfe92b-5de4-3baf-9813-796fbddeb8af","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand
2025-06-29 03:36:47,233 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Attempting to remove storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
2025-06-29 03:36:47,234 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Looking for libvirtd connection at: qemu:///system
2025-06-29 03:36:47,235 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa has no corresponding secret. Not removing any secret.
2025-06-29 03:36:47,236 INFO  [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa was successfully removed from libvirt.

The pools still show as up in the UI:
Image

And in the DB: (as you can see, I attempted to replace the pools with new ones to see what would happen)

id name uuid pool_type port data_center_id pod_id cluster_id used_bytes capacity_bytes host_address user_info path created removed update_time status storage_provider_name scope hypervisor managed capacity_iops parent used_iops
1 cs-compute1.cloud.redacted.net-local-0322a17b Filesystem 0 1 1 1 424638771200 913383583744 10.25.0.2 /var/lib/libvirt/images 2024-09-17 12:14:34 2025-06-28 12:44:14 2025-06-27 12:40:43 Up DefaultPrimary HOST KVM 0 0 120
2 cs-compute2.cloud.redacted.net-local-7aa8e196 Filesystem 0 1 1 1 114410532864 164705353728 10.25.0.3 /var/lib/libvirt/images 2024-09-17 12:17:05 2025-06-28 12:43:04 2025-06-28 09:09:47 Up DefaultPrimary HOST KVM 0 0 60
4 nfs 10805e0a-2205-3158-993f-9350b08b9137 NetworkFilesystem 2049 1 1 1 36540069576704 61401196134400 nas.redacted.net /volume1/cs-primary 2024-09-19 16:55:58 2025-06-29 06:34:38 Up DefaultPrimary CLUSTER KVM 0 0
5 nfs-ssd 80a947f2-ab02-3d8c-b1c0-7ba964ae0e80 NetworkFilesystem 2049 1 1 1 358665551872 1418370088960 nas.redacted.net /volume2/cs-primary-ssd 2024-09-28 10:31:47 2025-06-29 06:34:38 Up DefaultPrimary CLUSTER KVM 0 0
12 cs-compute1.cloud.redacted.net-local-27ea62dc 27ea62dc-ded2-4a25-9115-277c981f33fa Filesystem 0 1 1 1 424638771200 913383583744 10.25.0.2 /var/lib/libvirt/images 2025-06-28 13:15:34 2025-06-29 00:38:22 Up DefaultPrimary HOST KVM 0 0 29
13 cs-compute2.cloud.redacted.net-local-d997caf8 d997caf8-e98c-498c-aac2-3016f6ae2f5d Filesystem 0 1 1 1 114449989632 164705353728 10.25.0.3 /var/lib/libvirt/images 2025-06-28 13:21:47 2025-06-29 00:49:49 Up DefaultPrimary HOST KVM 0 0 60

The removed column is set to NULL for the two current local storage pools even after they have been removed from the hosts.

Tracing

I attempted to do some tracing of this issue (see trace logs attached), however I am still unable to find what triggers the pool removal call in the first place.

Log Files

Debug

management-server-part1.log
management-server-part2.log
management-server-part3.log

pool_removal.log
storagemanager.log

agent-compute1.log
agent-compute2.log

agent-compute1.err.log
agent-compute2.err.log

Trace

management-server-trace.log
pool_removal-trace.log

versions

I'm using the latest nightly build as of yesterday, however this issue has been there ever since I switched over to nightly.

Current build: 4.21.0.0-SNAPSHOT.20250628
OS: RHEL 8.10

The cluster is made up of one management node and two KVM compute nodes, all running the same build.

The steps to reproduce the bug

  1. Install the latest nightly build on RHEL with KVM compute nodes.
  2. Configure local storage pools on the compute nodes and launch a few VM instances.
  3. Either restart the agent or wait a number of hours, and the pool should vanish from libvirt.
    ...

What to do about it?

Find a code fix or workaround to prevent local storage pools from being removed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions