I recently ran into some strange behavior on one of my virtual hosts. I first noticed that the VMWare console was nearly unuseable because it was so slow. It looked like an overly compressed JPEG and the mouse clicks were taking forever on all the VMs on that host. RDP worked fine and the VMs didn’t seem to have any problems. VMs on identical hosts in the same cluster were not having problems. I attempted to vmotion a VM off of that host and it didn’t seem to be progressing. VMs were happy so it wasn’t an urgent issue. The next day I came in and was presented with this error:
“The migration was cancelled because the amount of changing memory for the VM was greater than the available network bandwidth”
I’m stuck in a sticky situation. I have a problem with a host, but I can’t move any VMs off of the host because VMotion isn’t working. Since I am new at VMWare administration I ran the problem by a couple co-workers and then opened a ticket with the vendor.
Lead 1: Busy network
When connecting to a vm console, that traffic is managed by the host kernel. VMotion traffic is also handled by the kernel but can be separated from the other traffic. Both of these things were suffering so it seemed to be pretty obvious where my problem was. I suspected a network problem but wasn’t really sure how to troubleshoot the management network traffic.
This KB was relevant but not helpful. 23 Ratings and the average is one star.
This article does some finger pointing of a broader network problem with some general contention. The console was never fast and the VMotions never finished. Support recommended segregating vmotion traffic with kernel traffic in different vlans and port groups. Basically trying to solve a general network busy problem. With only two connections, I could maintain redundancy and segregation by setting one vmnic to active and one to standby and do the opposite for the other port group. That would give me the minimum requirement for VMotion of a dedicated 1Gbit connection.
Lead 2: Bad something or other
The performance was consistently slow. That doesn’t describe a contention problem. So the next suggestion was bad driver/firmware/cable. I tried swapping management cables with another host. The problem stayed with this host and not the cables.
So the last option support kept coming back to was a bad driver or firmware. They suggested updating but this causes me problems because I would have to reboot for this to take effect. I can’t reboot because I can’t get the stinking VMs off this host. Also, why the heck do I have 6 identical hosts and only 1 is having this problem???
Important discovery: Direction matters
I started examining the performance with esxtop. I noticed that while these vmotions were running throughput was staying at a consistent 1 Mbits/sec. That about 1000 times slower than it should be. VMotion can saturate a 1Gb/sec connection. I also found a new and quite useful tool to graph this information. Visual esxtop is like perfmon for esx.
I was able to look at each port individually. I began to understand that even though the two NICs were “active/active” in the port group, a single VMotion would only use a single NIC. Out of the blue, against my better judgement I decided to vmotion a VM onto this box. Previously I was focused on getting VMs off of this server. To my surprise the process worked just fine. That is because it hit the second NIC and not the first one.
Resolution: Turn it off and on again
To resolve my issue I went into the vSwitch properties and edited my management port group making the bad NIC standby. I was then able to move all the VMs off this host and reboot. I added the “bad” NIC back into the port group as active and watched a couple VMotions succeed using 950Mbits/sec throughput on that same “bad” NIC.
So the root cause was a single management NIC(1 of 12) acting a bit goofy and requiring a reboot. Not exactly a perfect resolution but at least I didn’t have to cause any downtime for my users.