Over the past few years I’ve been asked to troubleshoot and explain why cloning a virtual machine (VM) from a master template would take a longer time than expected more than once. Usually when I’m asked the virtualization admin is frustrated at the hypervisor. “This shouldn’t take this long. It needs to be fixed!” they say. “I definitely agree,” I say, “but let’s take a deeper look at what is happening here first before we flame the vendor’s help desk technician on the phone.”
So, this post is about taking a deeper look at where the master template VM resides versus where the cloned template is destined. My math my be a little off or may not account for every factor involved, but my point is to be close enough to demonstrate that the disk/array/LUN design is can be the culprit more times than not.
When I started this post I emailed for some help. I asked for a sanity check from some storage experts. I’ve been reasonably happy with my own answer until now, but I figured I do some research before adding the content to VM /ETC. I got back a single reply that I am paraphrasing: “Sounds about right. Let me think about it some more and if I can stump you with anything else I’ll let you know.” He never did so I’ll take that as a positive confirmation meaning “yes VM moron, it is that simple.” Good enough for me! If anyone can point out any other factors I am not properly accounting for please leave a comment.
The following is part of my email for help. It not only explains my test scenario but it illustrates the problem and resolution as well. At the end of this post I make some suggestions for bettering the time it takes to clone a VM.
The email for help
Oh wise and all powerful masters of the disk,
I humbly submit the following concept for your review. Guide me to a greater disk performance understanding when cloning a VM in VMware ESX environment.
Here’s the scenario:
- Cloning a VM takes a long time — 10 GB VM using only 3.5 GB of space takes roughly 45 min to an hour to clone.
- The master template and the clone reside on the same disk and NFS mount.
- Yeah, it’s a single SATA disk in a lab. I know, it should suck.
I’m trying to explain the expected speed of read and writes using the IOPs calculator here: http://www.wmarow.com/strcalc/
See the attached screen shot for the values I put in the calculator, but the results I’m interested in are:
- with 50% reads and 50% writes (master and clone on same disk) average throughput (MB/s) is 1.2
- I used 50% reads and 50% writes for the cache.
which means to me that
- 3548 MB / 1.2 (MB/s) = 2957 secs or 0.82 hrs.
Here’s the screen shot of the IOPS Calculator I linked in the email for help:
Suggestions for improvement
Obviously, the type/performance of the disks, the number of disks, and the type of array makes a huge difference. I should also point out that I am using 8 ms as the value for the seek latency. I’m not as focused on technical accuracy because my point is served without it, but changing this value makes a significant difference as well. If you want technical accuracy and more explanation about some of the numbers to use in the calculator check out these posts on the topic of IOPS and the impact on a virtual environment:
- http://www.yellow-bricks.com/2009/12/23/iops/
- http://vmtoday.com/2009/12/storage-basics-part-ii-iops/
- http://virtualgeek.typepad.com/virtual_geek/2010/02/solving-a-weird-slow-performance-cloning-issue.html
- http://vpivot.com/2009/09/18/storage-is-the-problem/
In my case, moving the VM template to another disk/array or increasing the number of disks used on my NFS server would help because the reads and writes would be separated when the cloned VM resides on a different disk/array and the number of IOPs possible would be increased with more disks. Yes, this post uses a single SATA disk as a simple example, but the point is hopefully clear. Use the same logic and math for shared storage scenarios, all storage protocols, any vendor’s storage device, and all RAID types. Plug those values in the IOPs calculator to calculate your own results.
My ultimate point is to make everyone think about how the disk/array/LUN design decisions impact the behaviors of the virtual infrastructure.
As an example, if my lab NFS server was using 6 SATA disks configured as a RAID 5 array the calculation for expected time to clone changes as follows:
- 3548 MB / 2.99 (MB/s) = 1187 secs or 0.33 hrs.
Better, right? Hey, it’s a basement lab. It’s supposed to suck!