Update (20140114): I have tested KVM HA on Apache CloudStack 4.2.1 and its functioning as expected. Please see this new blog post.
The Linux Kernel Virtual Machine (KVM) is a very popular hypervisor choice amongst CloudStack and OpenStack users. It is free and comes ready with popular Linux distributions like CentOS/RedHat and Ubuntu. In some cases, customers insist on using and end to end Open Source solution for their private cloud and KVM ends up being the only choice available.
So in a recent deployment experience, where the private cloud had to run on full open source mature solutions, the obvious choice was to use Apache CloudStack 4.1 and KVM (on CentOS 6.x) Hypervisors. Building the management tier and the KVM hosts itself was a breeze with CentOS KickStart and the SSH based Ansible for post-install configuration of services.
The infrastructure too was built for resilience – dual power supplies, dual 4 port network controllers wired across east-west switches with LACP, HA for storage. From the CloudStack side, the management servers behind load balancers with MySQL replication services, multiple PODs, multiple Clusters and multiple Hosts in a cluster. Also, new service offerings created with HA enabled.
One of the resilience tests was to simply power off a random KVM hypervisor within a logical cluster and watch the affected HA enabled VM(s) auto start on another host within the same cluster after the time out period. To everyones surprise, the Guest VMs just sat there marked in ‘Up’ state despite physically being offline. A close look at the management logs show little to no activity that CloudStack even cared for these affected guest VMs and the KVM host.
CloudStack VM HA with KVM was simply not working.
After spending some time on the Apache CloudStack mailing lists and JIRA, it turns out that its a CloudStack feature to “do nothing” in a host down scenario. This is primarily to avoid any split brain situations where we could potentially end up with the multiple copies of the guest VMs running on more than one physical host due to network connectivity problems. Since KVM does not have in built clustering/HA features, it is up to the CloudStack layer to decide on a corrective course of action. At this time, CloudStack simply chooses to ignore failed KVM hosts.
The situation could be even more problematic if you unfortunately happen to have the CloudStack “virtual router” also running on the failed host. All basic network services like DHCP, DNS and routing for that POD will fail as the router would be offline. This actually happened to a someone on the mailing lists. The “fix” would be to go into the CloudStack database and mark the Virtual Router as “destroyed”. CloudStack would then create a new virtual router and services would resume.
This issue is currently being discussed in this No HA actions are performed when a KVM host goes offline JIRA Ticket and there is developer interest in coming up with a solution for an upcoming Apache CloudStack 4.1.x release. Also see the thread HA not working – CloudStack 4.1.0 and KVM hypervisor hosts on cloudstack-users mailing list.
Please note that this problem is specific to KVM hypervisors only as they do not have in-built clustering capabilities. CloudStack with VMware and XenServers do not have this issue. Both VMware and XenServers clusters automatically do the right thing using their in-built clustering features.
As a side note, Citrix XenServer 6.2 has been fully open sourced in July and installation ISOs are available from XenServer.Org. Given the enterprise features that XenServer (like HA clustering and fault tolerance) already has over KVM, it is very likely to have massive adoption in fully open source clouds with future releases of Apache CloudStack.
Update: According to this thread, XCP is also affected.