Tom Howarth, VMware Communities Moderator and blogger at PlanetVM.net, posted this week how he was informed by a developer of a virtualization backup vendor about a scenario involving reverting to an ESX snapshot that results in corrupted incremental backups when using vSphere’s Change Block Tracking (CBT). Howarth’s post Major issue with Change Block Tracking recounts his conversation and exploration of the problem with the developer. In summary, Howarth reported “there is a major issue with the way VMware handles the indexing of the ChangeID.”
Almost a week later and after a flurry of comments from most of the vendors leveraging CBT for virtual machine backups, VMware has published a KB article on the subject.
The KB Article describes the exact scenario that causes the problem:
Four things need to occur in the following sequence before there is a possibility of this issue occuring:
- A VM with hardware version 7 needs to have a snapshot present AND has been backed up previously by a backup product leveraging CBT
- A backup product performs an incremental backup of VM and leverages CBT to determine changed blocks since last backup
- After incremental backup is complete, user manually reverts snapshot on the VM
- A backup product performs an incremental backup of VM and leverages CBT to determine changed blocks since last backup
This issue is caused by arguably a unique set of circumstances, but it is important for VMware administrators to be aware of none-the-less. I’ve blogged about (other bloggers blogging about) ESX snapshots being like a loaded gun before, and here is another example of why.
VMware offers the following resolution for now:
“The workaround for this issue is to do a full VM backup after a snapshot revert operation. If the backup application does not allow this as an option, you will need to remove the CTK files for that VM. The CTK files mentioned are stored with the virtual machine on the datastore and can be removed via the Datastore Browser. This delete operation can be safely done while the VM is running.”
Disclaimer: I work for Veeam Software, the creators of Veeam Backup and Replication.
Veeam Software has confirmed that Backup and Replication v4.1 successfully handles this issue without corruption in all but one specific scenario of manually caused events as described in the Veeam Forums here: http://www.veeam.com/forums/viewtopic.php?f=2&t=3699&p=15139#p15139.
The hotfix has been created for this one remaining scenario and is in “testing for validation” as of this writing.
In the same linked thread Veeam also recommends that until the patch is available workaround this last scenario by “disabling the use of changed block tracking in the Advanced job settings for all jobs which process VMs where manual snapshot reversal may happen, and triggering Full Backup on these jobs to heal the backup file (in case you believe you may have this scenario happened before for some VMs).”