I attended at the great session during the VMworld Europe: INF5898 - vSphere High Availability (HA) Best Practices. VMware introduced a new useful feature with vSphere 6.0 - VM Component Protection. This is a new option of VMware HA that protects VMs from Permanent Device Loss (PDL) and All Paths Down (APD).
This post covers the following topics:
- PDL and APD definitions and occurs.
- HA Response for Datastore with Permanent Device Loss or All Paths Down.
- Configuration of VM Component Protection (VMCP).
When does occur the Permanent Device Loss (PDL)?
A PDL happens when the device is unavailable. Some reasons as examples:
- ESXi host remove from Array's Storage Group.
- Removing a WWN of Storage ports from the zone configuration.
- Failed LUN.
In the PDL state, Storage Array can communicate with the host (still all paths are up) but issuing SCSI sense codes indicating that the device is unavailable. When ESXi detects a PDL states, the host will stop sending I/O requests to the Storage Array.
What is All Paths Down (APD)?
In this case, there is no connection between Storage Array and the Host (no PDL SCSI sense returned from the Storage Array). Some reasons:
- FC Switch failure.
- FC HBA failure.
However, as ESXi host does not have enough information if the device may return or not, there is a special period known as the APD Timeout (by default 140 seconds or change the advanced setting per host: Misc.APDTimeout) to wait until the device is marked as APD.
The below figure presents VMCP Recovery Timeline:
How to configure VM Component Protection (VMCP)?
To enable and configure VM Component Protection (VMCP) you have to use the Web Client. Just edit HA settings and enable the following option: Protect against Storage Connectivity Loss.
As shown on the above figure, there are following options with parameters:
- Response for Datastore with Permanent Device Loss (PDL)
- Issue events – No action, only an event when a PDL has occurred.
- Power off and restart VMs
- Response for Datastore with All Paths Down (APD)
- Issue events – No action, only an event when an APD has occurred
- Power off and restart VMs (conservative) – Restart only if if there is sufficient capacity on healthy hosts.
- Power off and restart VMs (aggressive)
- Delay for VM failover for APD - When the APD Timeout has been reached (default: 140 seconds) VMCP will wait an additional period of time (3 minutes) before taking action against the affected VMs.
- Response for APD recovery after APD timeout
- Reset VMs – Hard reset of the VMs.
The VM Component Protection is a powerful feature that can help us to minimize influence of storage problems to our VMware infrastructure.