Bug? Well I don’t want to jump to conclusions but I recently had some less than desirable behavior when putting a host into maintenance mode.
In this 3 node cluster I had recently created some larger storage VMs to replace a single VM. Both the old and new guests were running for the parallel upgrade so it doubled my storage requirements for a short period of time.
While this upgrade was happening I needed to do a driver update on the hosts. I planned to put one in maintenance mode near the end of the day and do the driver update the next day. Maintenance mode started without incident and all VMs started vmotioning off of that host.
The next morning I got some complaints that one of the storage VMs was having some issues. I checked for vmware error messages but didn’t have any. I could ping this VM but I couldn’t log in or do anything else with it. I couldn’t even send a remote reboot command: shutdown -m \\storagevm /f /r /t 00
I was forced to power the VM off. When I tried to power it on vmware gave me this message: Cannot open the disk ‘/vmfs/volumes/xxxx/vm/vm.vmdk’ or one of the snapshot disks it depends on.
I was stumped, I first thought maybe some kind of pathing issue but that did not make any sense because the paths are not define on hosts individually. While the VM was off I was able to migrate it back to the original host and power it on. This worked, so I was back in business but I still needed to update that driver. I then tried to manually vmotion this server to the same host and was presented with this, slightly more descriptive error: The VM failed to resume on the destination during early power on. Cannot open the disk ‘/vmfs/volumes/xxxx/vm/vm.vmdk’ or one of the snapshot disks it depends on. 0 (Cannot allocate memory)
What happened next was why I use the b word in my title. The manual vmotion does fail but the guest is still happy. It continues run because instead of forcing itself on the destination, it hops back to the source host. The previous evening maintenance mode failed to foresee this problem and rammed my VM onto the host making it unresponsive. No error or message was presented other than maintenance mode was completed successfully. My VMs running properly is far more important than my maintenance mode completing successfully.
My problem was a vmware storage limitation. A single host can have a limited amount of VMDKs open. I do take responsibility for not knowing how close I was to one of these limitations. But, in my defense, I just got back from VMWorld where they were bragging about 64TB datastores.
This KB was quite helpful and let me know that if I complete my ESXi upgrade I won’t be pushing the limits anymore: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004424
I also found these other two resources useful.
PowerCLI > Get-HardDisk -Datastore | select -ExpandProperty CapacityGB | measure -sum -ave -max -min