Description
problem
Issue Details
After switching over to main for testing the CKS improvements, I ran into an issue where my compute nodes lose their local storage pools in libvirt. The pools exist in Cloudstack however the pools get undefined. This can be temporarily fixed by manually defining the pools with the same UUID via virsh pool-define
, however they get removed after as well.
Both of the compute nodes have local.storage.uuid
defined in their agent.properties file. Once this issue starts to happen, restarting the agent does not cause the pool to come back. However, if you change the UUID in agent.properties
, a new pool gets created and will start without issue.
The affect of this is that after the pool is removed, existing instances continue to run, however new instances cannot be created until the pool is manually recreated.
Creating a new pool with a different UUID works for a time, but eventually also gets removed. I'm not exactly sure what is triggering the removal.
Log Snippets
Both management server and agent encounter errors trying to find the pool:
2025-06-29 07:36:58,138 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-6:[ctx-8a6c9ea0]) (logid:58ac492b) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Could not fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt due to org.libvirt.LibvirtException: Storage pool not found: no storage pool with matching uuid '27ea62dc-ded2-4a25-9115-277c981f33fa'
2025-06-29 07:36:58,096 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-2:[]) (logid:58ac492b) Processing command: com.cloud.agent.api.GetVolumeStatsCommand
2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system
2025-06-29 07:36:58,096 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Trying to fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system
2025-06-29 07:36:58,097 DEBUG [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Could not find storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa in libvirt
The pool is being removed by the management server:
2025-06-29 03:36:47,496 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Removing pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} from host Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
2025-06-29 03:36:47,511 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Sending { Cmd , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"13","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d","host":"10.25.0.3","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//492af6d4-b5d1-3dc8-8a50-040101d28230","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-06-29 03:36:47,563 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Received: { Ans: , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 10, { Answer } }
2025-06-29 03:36:47,576 INFO [o.a.c.s.d.p.DefaultHostListener] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Connection removed between storage pool: StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} and host: Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
2025-06-29 03:36:47,606 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Found storage pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} of type Filesystem with overprovisioning factor 2
2025-06-29 03:36:47,608 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Total over provisioned capacity of the pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} is (306.79 GB) 329410707456
2025-06-29 03:36:47,629 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Successfully set Capacity - (306.79 GB) 329410707456 for capacity type - 9 , DataCenterId - 1, Pool - StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}, PodId 1
Agent's logs showing pool removal:
2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Request:Seq 1-4725683383995203589: { Cmd , MgmtId: 90520738888109, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"12","uuid":"27ea62dc-ded2-4a25-9115-277c981f33fa","host":"10.25.0.2","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//36cfe92b-5de4-3baf-9813-796fbddeb8af","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand
2025-06-29 03:36:47,233 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Attempting to remove storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
2025-06-29 03:36:47,234 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Looking for libvirtd connection at: qemu:///system
2025-06-29 03:36:47,235 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa has no corresponding secret. Not removing any secret.
2025-06-29 03:36:47,236 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa was successfully removed from libvirt.
The pools still show as up in the UI:
And in the DB: (as you can see, I attempted to replace the pools with new ones to see what would happen)
id | name | uuid | pool_type | port | data_center_id | pod_id | cluster_id | used_bytes | capacity_bytes | host_address | user_info | path | created | removed | update_time | status | storage_provider_name | scope | hypervisor | managed | capacity_iops | parent | used_iops |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | cs-compute1.cloud.redacted.net-local-0322a17b | Filesystem | 0 | 1 | 1 | 1 | 424638771200 | 913383583744 | 10.25.0.2 | /var/lib/libvirt/images | 2024-09-17 12:14:34 | 2025-06-28 12:44:14 | 2025-06-27 12:40:43 | Up | DefaultPrimary | HOST | KVM | 0 | 0 | 120 | |||
2 | cs-compute2.cloud.redacted.net-local-7aa8e196 | Filesystem | 0 | 1 | 1 | 1 | 114410532864 | 164705353728 | 10.25.0.3 | /var/lib/libvirt/images | 2024-09-17 12:17:05 | 2025-06-28 12:43:04 | 2025-06-28 09:09:47 | Up | DefaultPrimary | HOST | KVM | 0 | 0 | 60 | |||
4 | nfs | 10805e0a-2205-3158-993f-9350b08b9137 | NetworkFilesystem | 2049 | 1 | 1 | 1 | 36540069576704 | 61401196134400 | nas.redacted.net | /volume1/cs-primary | 2024-09-19 16:55:58 | 2025-06-29 06:34:38 | Up | DefaultPrimary | CLUSTER | KVM | 0 | 0 | ||||
5 | nfs-ssd | 80a947f2-ab02-3d8c-b1c0-7ba964ae0e80 | NetworkFilesystem | 2049 | 1 | 1 | 1 | 358665551872 | 1418370088960 | nas.redacted.net | /volume2/cs-primary-ssd | 2024-09-28 10:31:47 | 2025-06-29 06:34:38 | Up | DefaultPrimary | CLUSTER | KVM | 0 | 0 | ||||
12 | cs-compute1.cloud.redacted.net-local-27ea62dc | 27ea62dc-ded2-4a25-9115-277c981f33fa | Filesystem | 0 | 1 | 1 | 1 | 424638771200 | 913383583744 | 10.25.0.2 | /var/lib/libvirt/images | 2025-06-28 13:15:34 | 2025-06-29 00:38:22 | Up | DefaultPrimary | HOST | KVM | 0 | 0 | 29 | |||
13 | cs-compute2.cloud.redacted.net-local-d997caf8 | d997caf8-e98c-498c-aac2-3016f6ae2f5d | Filesystem | 0 | 1 | 1 | 1 | 114449989632 | 164705353728 | 10.25.0.3 | /var/lib/libvirt/images | 2025-06-28 13:21:47 | 2025-06-29 00:49:49 | Up | DefaultPrimary | HOST | KVM | 0 | 0 | 60 |
The removed column is set to NULL for the two current local storage pools even after they have been removed from the hosts.
Tracing
I attempted to do some tracing of this issue (see trace logs attached), however I am still unable to find what triggers the pool removal call in the first place.
Log Files
Debug
management-server-part1.log
management-server-part2.log
management-server-part3.log
pool_removal.log
storagemanager.log
agent-compute1.log
agent-compute2.log
agent-compute1.err.log
agent-compute2.err.log
Trace
management-server-trace.log
pool_removal-trace.log
versions
I'm using the latest nightly build as of yesterday, however this issue has been there ever since I switched over to nightly.
Current build: 4.21.0.0-SNAPSHOT.20250628
OS: RHEL 8.10
The cluster is made up of one management node and two KVM compute nodes, all running the same build.
The steps to reproduce the bug
- Install the latest nightly build on RHEL with KVM compute nodes.
- Configure local storage pools on the compute nodes and launch a few VM instances.
- Either restart the agent or wait a number of hours, and the pool should vanish from libvirt.
...
What to do about it?
Find a code fix or workaround to prevent local storage pools from being removed.