The hardest part of life is figuring out where to put all your stuff. If it were that easy, things like attics and storage units wouldn’t exist. Instead, people are plagued with the problem of constantly looking for somewhere to store an object that doesn’t quite fit. Sometimes we try and get creative and wind up causing more issues than before.
This metaphor isn’t just an existential crisis for my camping gear closet. It’s a problem in enterprise server farms as well. Just because we made the majority of our workloads software running in the cloud or in an on-site server cluster doesn’t mean we have unlimited space. Cloud servers cost money. Servers in our farm aren’t always capable of holding the biggest possible workloads. And given our propensity over time of workloads to balloon out of control, it’s not always easy to even find a place to store them months or years after they’ve been created.
We can create the biggest possible SAN volumes to store monster database VMs. We can optimize big iron boxes with massive amounts of CPUs and enough memory to make an elephant jealous. But it all still comes crashing down when an external force causes our carefully laid plans to vanish into the same ether that our servers live in. No host has 100% uptime. No workload behaves optimally for its entire life. And accidents happen that force everything to move.
vMotion is the greatest thing to happen to virtual workloads in the history of virtualization. The ability to move a server while it’s still running is the “magic moment” most people finally get the power of running everything in software. Any yet, for all the greatness that vMotion represents, it’s still a huge Achilles’ heel for objects that live at the limits of our infrastructure. Those monster database VMs? Those could take an hour to migrate. How about that super massive AI workload that needs all kinds of specialized CPU or GPU resources? What happens if the host goes down and we need to find a place to send it so it can stay running? How do we even know that we’ve got our workloads running on the most optimal servers in the first place?
During VMworld 2019, I got to see a great presentation from Virtana about the problem of vMotion and unintended consequences. Formerly known as Virtual Instruments, the company is working hard to provide analytics that help you understand the needs of your workloads and the capabilities of your infrastructure and weave the two together to figure out the best solution for keeping your virtual servers put when they need to stay and moving when they should be going somewhere. Take a look at this great overview video from the event:
I love the examples that Tim Van Ash lays out in this video. We always think about vMotion solving reachability or stability problems. If you’re someone that uses VMware’s DRS solution, you probably think of vMotion as the underlay that enables performance monitoring. But, if you’ve ever heard the horror stories of what DRS can do when it’s left unchecked you know why there are some very tight controls on what it should be allowed to do.
I’ve been told stories about organizations that use DRS as a suggestion but require a change control notification for every vMotion undertaken. Otherwise you may lose track of your VMs. Worse yet, you may end up moving your VM to a new location based on a simple recommendation only to find out that something caused a problem along the way and made things fall apart. Imagine our monster DB VM is close to the redline on the host it’s on. DRS suggests moving it to a new location. We kick off the migration and when the DB lands on the new hypervisor it executes a big query and spikes the CPU on the destination host. Now, the old host looks like a better destination so it gets queued to be sent back to the old host.
No Guessing Games
That’s an actual example from the video. Except it took over half an hour for the host to be migrated. And then it got sent back right away! These are the kinds of problems that simple algorithms can’t solve. You need real analytics to help you figure out the best destinations for your hosts. You need to be able to add hard statistics to your decision making processes. Maybe Virtana knows that you can’t just ping-pong the hosts between Los Angeles and New York, but if you clean a couple of lightweight VMs from the Miami cluster you can send that big bad resource hog down there permanently and stop worrying about things.
Virtana has a lot more wisdom to share with their solution. The fact that they call it Virtual Wisdom is no accident either. Humans are bad at math. Simple software algorithms are better at math but not good at thinking ahead. Like the sages of old, you need someone with the experience and the vision to say that this thing that looks like a good idea is in fact terrible and you shouldn’t do it. You also need a tool that can help you understand that you need to make changes when everything looks to be running smoothly because you are on the edge of a problem. Wisdom comes from insight, and Virtana has all the insight you could want in an analytics platform.
Bringing It All Together
I will admit that I’m the kind of person that either likes to keep things static and unchanging or live in complete chaos. Either leave those VMs where they are until a failure happens or turn on DRS at maximum and watch the machines flying all over the place. Thankfully, the middle ground comes in the form of sanity from companies like Virtana. They can help you plan your vMotions and smile with pride as the roulette wheel looks more like a choice between two reasonable options. Rather than letting it ride yourself, you should let Virtana stack the deck in your favor.
For more information on Virtana and their analytics offerings for both on-premises and cloud workloads, make sure to check out their website at http://Virtana.com. To see more videos from Virtana’s presentation at VMworld US 2019, make sure to check out https://techfieldday.com/companies/virtana/
- Predicting Data Patterns with Cradlepoint - January 16, 2020
- How Do RFC3161 Timestamps Work? - January 15, 2020
- Testing the Whole System with NetAlly EtherScope nXG - January 14, 2020
- Stupid Network Tricks - January 14, 2020
- There Is No Layer-2 in Public Cloud - January 8, 2020
- Assuring Your Service Level with Ixia IxProbe - January 8, 2020
- Wi-Fi and the Netflix Effect - December 27, 2019
- Figure Out What Problem You’re Trying to Solve - December 20, 2019
- Ensuring Unified Communications Success with NETSCOUT - December 19, 2019
- Network Stability Through Resilience Engineering - December 18, 2019