Failure of a master
Failure of a slave
Isolated host and response
Failed Host
Prior to vSphere 5.0, the restart of virtual machines from a failed host was straightforward.
With the introduction of master/slave hosts and heartbeat datastores in vSphere 5.0, the restart procedure has also changed, and with it the associated timelines. There is a clear distinction between the failure of a master versus the failure of a slave. We want to emphasize this because the time it takes before a restart attempt is initiated differs between these two scenarios. Let’s start with the most common failure, that of a host failing, but note that failures generally occur infrequently. In most environments, hardware failures are very uncommon to begin with. Just in case it happens, it doesn’t hurt to understand the process and its associated timelines.
The Failure of a Slave
This is a fairly complex scenario compared to how HA handled host failures prior to vSphere 5.0.
Part of this complexity comes from the introduction of a new heartbeat mechanism. Actually, there are two different scenarios: one where heartbeat datastores are configured and one where heartbeat datastores are not configured. Keeping in mind that this is an actual failure of the host, the timeline is as follows:
T0 – Slave failure.
T3s – Master begins monitoring datastore heartbeats for 15 seconds.
T10s – The host is declared unreachable and the master will ping the management network of the failed host. This is a continuous ping for 5 seconds.
T15s – If no heartbeat datastores are configured, the host will be declared dead.
T18s – If heartbeat datastores are configured, the host will be declared dead.
The master monitors the network heartbeats of a slave. When the slave fails, these heartbeats will no longer be received by the master. We have defined this as T0. After 3 seconds (T3s), the master will start monitoring for datastore heartbeats and it will do this for 15 seconds. On the 10th second (T10s), when no network or datastore heartbeats have been detected, the host will be declared as “unreachable”. The master will also start pinging the management network of the failed host at the 10th second and it will do so for 5 seconds. If no heartbeat datastores were configured, the host will be declared “dead” at the 15th second (T15s) and virtual
machine restarts will be initiated by the master. If heartbeat datastores have been configured, the host will be declared dead at the 18th second (T18s) and restarts will be initiated. We realize that this can be confusing and hope the timeline depicted in Figure 18 makes it easier to digest.
Figure 18: Restart timeline slave failure
The master filters the virtual machines it thinks failed before initiating restarts. Prior to vSphere 5.0 Update 1, a master used the protectedlist. If the master did not know the on-disk protection state for the virtual machine, the master did not try to restart it. And, the on-disk state could be obtained only by one master at a time since it required opening the protectedlist file in
exclusive mode. As of vSphere 5.0 Update 1 (and above) this behavior has been changed. If there is a network partition multiple masters could try to restart the same virtual machine as vCenter Server also provided the necessary details for a restart. As an example, it could happen that a master has locked a virtual machine’s home datastore while the other master is in contact with vCenter Server and as such is aware of the current desired protected state. In this
scenario it could happen that the master which does not own the home datastore of the virtual machine will restart the virtual machine based on the information provided by vCenter Server.
This change in behavior was introduced to avoid the scenario where a restart of a virtual machine would fail due to insufficient resources in the partition which was responsible for the virtual machine. With this change, there is less chance of such a situation occurring as the master in the other partition would be using the information provided by vCenter Server to initiate the restart.
That leaves us with the question of what happens in the case of the failure of a master.
The Failure of a Master
In the case of a master failure, the process and the associated timeline are slightly different.
The reason being that there needs to be a master before any restart can be initiated. This means that an election will need to take place amongst the slaves. The timeline is as follows:
T0 – Master failure.
T10s – Master election process initiated.
T25s – New master elected and reads the protectedlist.
T35s – New master initiates restarts for all virtual machines on the protectedlist which are not running.
Slaves receive network heartbeats from their master. If the master fails, let’s define this as T0, the slaves detect this when the network heartbeats cease to be received. As every cluster needs a master, the slaves will initiate an election at T10s. The election process takes 15s to complete, which brings us to T25s. At T25s, the new master reads the protectedlist. This list contains all the virtual machines which are protected by HA. At T35s, the master initiates the restart of all virtual machines that are protected but not currently running. The timeline depicted in Figure 19 hopefully clarifies the process.
Figure 19: Restart timeline master failure
Besides the failure of a host, there is another reason for restarting virtual machines: an isolation event.