Recently I had a long discussion with my customer about problems in their VMware environment. We needed to find a solution for below problems (almost) caused by snapshots:
- Insufficient disk space on datastore
- Problem with backup of VMs
- Problem with VMs performance
During our conversation and my investigation of VMware infrastructure I noted such important things:
- Many VM snapshots (some 4 months old!)
- All users (VM guest admins) can create snapshots but they can not delete them...
- Many VMs needed disk consolidations
It has been my second time to see that VM admin has a "Create a Snapshot" permission but no permission to "Remove a Snapshot". VMware admins decided that "Remove a Snapshot" is a more important/dangerous permission than "Create a Snapshot" and by default - VMs owners were not given "dangerous permissions". Also snapshots were used as a backup of VM! Of course this is not a correct using of snapshots.
Let's recollect some VMware's best practice and recommendations for snapshots:
- Snapshots are not backups. Why? Because, a snapshot file is only a change log of the original virtual disk.
- Delta files can grow to the same size as the original base disk file, which is why the provisioned storage size of a virtual machine increases by an amount up to the original size of the virtual machine multiplied by the number of snapshots on the virtual machine.
- The maximum supported amount of snapshots in a chain is 32. However, VMware recommends that you use only 2-3 snapshots in a chain.
- Use no single snapshot for more than 24-72 hours. Snapshots should not be maintained over long periods of time for application or Virtual Machine version control purposes.
- If using a third party product that takes advantage of snapshots (such as virtual machine backup software), regularly monitor systems configured for backups to ensure that no snapshots remain active for extensive periods of time.
- An excessive number of delta files in a chain (caused by an excessive number of snapshots) or large delta files may cause decreased virtual machine and host performance.
So what could be a good solution/proposition to solve/avoid problems mentioned earlier in this article?
- If you have test/dev VMs (the best in dedicated VMware HA cluster) you can give permissions: "Create a Snapshot" and "Remove a Snapshot" to VM owners. Tests or developments VMs are machines where developers often test and check e.g. new features of software. Generally no impact on production systems.
- Production and DMZ VMs need to be dealed individually but at least if you give a permission to create a snapshot, you also should give a permission to remove a snapshot. "Old and forgotten" snapshots can impact not only on VM but also on another VMs (e.g. out of space on datastore). Generally as a VMware administrator, you are responsible for VMs (at least "outside").
- You should monitor all VMs running from snapshots and free space on datastores.
- If you need a backup, always you should do it via backup application like VMware Data Protection (VDP), Symantec NetBackup or another 3rd backup application. Of course, these applications use snapshots but they delete (or should 😉 ) them after backup is completed.
- Planned and big "changes" of VMs should be preceded by backup. Snapshots should be used only for sudden changes or reconfiguration of VM.
- Snapshots should be kept as short as possible. Snapshots can very quickly grow in size (e.g. high-transaction virtual machines), filling datastore space. Commiting snapshots also needs time and can impact on VM.
- Avoid nested snapshots.
- Please educate VMs owners about risks caused by snapshots - generally everybody knows only benefits...
- Size of datastore should be planned also including snapshots requirements (e.g. how many snapshots, prediction of snapshot size). I have seen delta file 80% (as I remember well) of original size - because OS Admin changed guest OS...
I mentioned only my personal suggestions and VMware best practices. Remember then, best practice does not mean that you have to implement it. It means that you should implement or adapt it to your environment.