Monthly Archives: September 2014

vMotion, an online operation?

There are two types of vMotions, storage and regular. Storage vMotion moves VM files or a single .vmdk file to another datastore. The regular vMotion moves the VMs memory from one host to another and then stuns the VM in order to pause processing so the new host can open the file and take ownership of the VM. Today I’ll be referring mostly to the regular vMotion.

These are both fantastic technologies that allow for rolling upgrades of all kinds and also the ability to load balance workloads based on usage. The Distributed Resource Scheduler (DRS) runs every 5 minutes by default to do this load balancing. Datastore clusters can be automated to balance VMs across datastores for space and usage reasons. Like I said, these technologies are fantastic but need to be used responsibly.

“VMware vSphere┬« live migration allows you to move an entire running virtual machine from one physical server to another, without downtime” –

That last little bit is up for debate. It depends on what your definition of downtime is. This interesting historical read shows that vMotion was the next logical step after a pause, move and start operation was worked out. Even though VMware is now transferring the state over the network and things are much more live, we still have to pause. The virtual machine memory is copied to a new host, which takes time, then the deltas are copied over repeatedly until a very small amount of changed memory is left and the VM is stunned. This means no CPU cycles are processed while the last tiny little bit of memory is copied over, the file is closed by that host and the file is opened on the new host which allows for the CPU to come back alive. Depending on what else is going on, this can take seconds, yes that is plural. Seconds of an unresponsive virtual machine.

What does that mean? Usually in my environment, a dropped ping, or maybe not even a dropped ping but a couple slow pings in the 300ms range. This is all normally fine because TCP is designed to re-transmit packets that don’t make it through. Connections generally stay connected in my environment. However, I have had a couple strange occurrences in certain applications that have lead to problems and downtime. Downtime during vMotion is rare and inconsistent. Some applications don’t appreciate delays during some operations and throw a temper tantrum when they don’t get their CPU cycles. I am on the side of vMotion and strongly believe these applications need to increase their tolerance levels but I am in a position where I can’t always do that.

The other cause of vMotion problems is usually related to over committed or poorly configured resources. vMotion is a stellar example of super efficient network usage. I’m not sure what magic sauce they have poured into it but the process can fully utilize a 10Gb connection to copy that memory. Because of that, vMotion should definitely be on its own vLan and physical set of NICs. If it is not, the network bandwidth could be too narrow to complete the vMotion process smoothly and that last little bit of memory could take a longer time than normal to copy over causing the stun to take longer. Very active memory can also cause the last delta to take longer.

Hardware vendors advertise their “east-west” traffic to promote efficiencies they have discovered inside blade chassis. There isn’t much reason for a vmotion from one blade to another blade in a chassis to leave the chassis switch. This can help reduce problems with vMotions and reduce the traffic on core switches.

In the vSphere client, vMotions are recorded under the tasks and events. When troubleshooting a network “blip” the completed time of this task is the important part. Never have I seen an issue during the first 99% of a vMotion. If I want to troubleshoot broader issues, I use some T-SQL and touch the database inappropriately. Powershell and PowerCLI should be used in lieu of database calls for several reasons but a query is definitely the most responsive of the bunch. This query will list VMs by their vMotion frequency since August.

MAX([VM_NAME]) as 'VM',
count(*) as 'Number of vmotions'
EVENT_TYPE = 'vm.event.DrsVmMigratedEvent' and
CREATE_TIME > '2014-8-14'
GROUP BY vm_name

This query can reveal some interesting problems. DRS kicks in every 5 minutes and decides if VMs need to be relocated or not. I have clusters that have DRS on but don’t ever need to vMotion any VMs because of load and I have clusters that are incredibly tight on resources and vMotion VMs all the time. One thing I have noticed is that VMs that end up on the top of this query can sometimes be in a state of disarray. A hung thread or process that is using CPU can cause DRS to search every 5 mintues for a new host for the VM. Given the stun, this isn’t usually a good thing.

IMHO, a responsible VM admin is willing to contact VM owners when they are hitting the top of the vMotions list. “Don’t be a silent DBA.” That is some advice I received earlier on in my career. Maintenance and other DBA type actions that can be “online” but in actuality cause slowdowns in the system that other support teams may never find the cause for. The same advice can be applied to VMware admins as well.

Leave a comment

Posted by on September 16, 2014 in Virtual