vMotion: an elephant in the Data Center room

A while ago I had a chat with a fellow CCIE (working in a large enterprise network with reasonably-sized Data Center) and briefly described vMotion to him. His response: “Interesting, I didn’t know that.” … and “Ouch” a few seconds later as he realized what vMotion means from bandwidth consumption and routing perspectives. Before going into the painful details, let’s cover the basics.

The basics

vMotion is VMware’s technology that allows migration of live virtual machines from one physical server to another without losing VM state, its data or its LAN or SAN sessions. It’s not a new technology (the Usenix ’05 article contains lots of useful technical details) and even the competing products (Hyper-V) got similar functionality in recent releases.

vMotion by itself is useful, but additional functionality built on top of it makes it indispensable: with DRS, VM load can be spread dynamically across a cluster of physical servers and with DPM, all VMs running on a cluster are consolidated on a small number of servers in the off-hours and the unused servers shut down.

Bandwidth implications

Obviously, vMotion needs to transfer the whole VM memory image before the VM state can be migrated to the second server; expect a few gigabytes of data traversing your network. The data transfer usually doesn’t matter if the physical servers are close together, but if the VM image is transferred across the Data Center backbone or even across a WAN link, the load can become significant.

Depending on the server uplink utilization and the QoS requirements of your traffic you might want to consider the impact of vMotion even when the servers are close together. For example, if you use a fully loaded UCS chassis with only a few uplinks, you might want to configure QoS on the uplinks to ensure vMotion does not interfere with your regular traffic (hint: it’s easy to configure vMotion to run in a dedicated VLAN).

Routing implications

Obviously the moved VM has to retain its IP address if you want to retain its TCP/IP sessions (and it makes no sense to move a running machine if it loses all the sessions). You can easily guess what the “traditional” solution is: bridging between the source and the destination physical servers. When a VM is moved from one server to another but remains in the same VLAN, all it takes is a single packet sent from the VM to update the MAC address tables on all intervening switches.

The implications of the same-VLAN requirement are manifold:

You need a trunking link connecting the physical server with a switch.
All servers have to have the same set of VLANs.
All intermediate switches have to participate in all those VLANs.
The server and switch configurations have to be synchronized (don’t forget, every VMware server has an embedded switch).

Net result: if you have a security-conscious environment where different applications reside in different layer-3 segments, you’ll end with a veritable VLAN sprawl.

Traffic flow implication

Every VM is connected to a number of network-layer and application-layer peers (routers, firewalls, database servers …). As you move a VM, those connections are usually not moved and might significantly increase the amount of traffic flowing across your Data Center core.

For example, server VLANs commonly use a single default gateway established with a first-hop redundancy protocol (FHRP) like ~~NHRP~~HSRP or GLBP.

After the VM is moved, it still uses the same default gateway, causing its outbound traffic to go all the way back to the original switch.

Notes:

Traffic flow is optimal if the two switches use VSS or vPC.
Very smart layer-2 filters can create two simultaneous FHRP groups on the same VLAN. This is an ideal solution if you’re looking for job security and late-night troubleshooting sessions.

Need more information?

If you’d like to know more about vMotion, there are literally tons of good articles on the Internet. Unfortunately, most of them are written from the server admin’s perspective.

For an overview of vMotion, related technologies and their impact on the network register for my Data Center 3.0 for Networking Engineers.

vMotion: an elephant in the Data Center room

The basics

Bandwidth implications

Routing implications

Traffic flow implication

Need more information?

Related Posts

Pure Storage Takes Aim at Disk Storage | Gestalt IT Rundown: June 14, 2023

Building an AI Training Data Pipeline with VAST Data | Utilizing Tech 07x02

Operationalizing AI with VAST Data | Utilizing Tech 06x08

The Bedrock of AI is Data with Nick Magnuson and Clive Bearman of Qlik | Utilizing Tech 06x01

About the author

Ivan Pepelnjak

Leave a Comment X

The basics

Bandwidth implications

Routing implications

Traffic flow implication

Need more information?

Related Posts

Pure Storage Takes Aim at Disk Storage | Gestalt IT Rundown: June 14, 2023

Building an AI Training Data Pipeline with VAST Data | Utilizing Tech 07x02

Operationalizing AI with VAST Data | Utilizing Tech 06x08

The Bedrock of AI is Data with Nick Magnuson and Clive Bearman of Qlik | Utilizing Tech 06x01

About the author

Ivan Pepelnjak

Leave a Comment X

Private video