RSS

Category Archives: Virtual

VMware 6.5 LACP Configuration

I just got back from Cisco Live in Las Vegas and brought a nasty cold with me. With the extended time off work, I came back to a sizeable pile of “priorities”. One of the more interesting/challenging things I was working on was getting LACP going. We had some new rackmounts that were partially configured and had ESXi installed. They had two 10Gb ports which were handling all the traffic on standard virtual switches. The networking guy wanted to turn on LACP which is a best practice, but we couldn’t get it going at first for various reasons. It is one of those settings that has to be either on or off and both the host side and switch side have to be the same. Now that the project is nearing some deadlines, we decided to give it another go.

There are a couple key reasons you might want to setup Link Aggregation Control Protocol on uplink ports. 1. For faster failover in the event of a switch or port going offline. 2. Higher bandwidth for a single logical uplink.

VMware does a pretty nice job of handling failover without this turned on. VMs run on a single connection and jump over to the other connection if there are issues. So that might be reason enough not to go through the extra hassle of setting up LACP. Another reason is you might get yourself in a chicken/egg catch 22 scenario. If your vmkernal management ports have to run LACP, and you have a vCenter that runs on a host with only LACP available… you might have a hard time configuring your virtual distributed switch. You might be able to log into a remote console of the host and revert management network changes in 6.5, but I have not tested this. For this reason, I recommend using some different (perhaps a pair of onboard 1Gb ports) for your ESXi management network.

So step one is to get your host online with a VMware standard switch. Then you can deploy the vCenter 6.5 appliance to this host. You will need this to configure LACP. I would also recommend using the standard switch for vCenter and ESXi traffic. This can be done by editing the port group on the vCenter appliance VM.

On the physical switch side, you must setup a vPC. This is done by configuring a port channel on each switch port, then a virtual port channel that pairs the two ports.

Then in vCenter, you create a distributed virtual switch. Under the configuration tab there are LACP settings. First create a Link Aggregation Group. You will want to set this to active so the NIC will negotiate with the physical switch to aggregate the links. Create one LAG with the number of ports you will have for VM traffic in your entire cluster. This is one step that confused me. The documentation says to create one LAG per port channel ( https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.networking.doc/GUID-34A96848-5930-4417-9BEB-CEF487C6F8B6.html ), however, VMware handles creating a LAG for each host and you only need to create the overall LAG for the distributed switch. So basically I got one host setup pretty easily, but then when I went to setup my second host, I couldn’t add the second LAG into the uplinks options because two LAGs are not supported.

Once you create the LAG, you can now add hosts to the distributed switch and assign the physical NICs as uplinks with the LAG selected.

Lastly, create portgroups for each vlan. Then you can assign the LAG uplinks under teaming and failover for each port group.

Advertisements
 
Leave a comment

Posted by on July 16, 2017 in Network Admin, Virtual

 

vmworld 2016 keynote

I’ve been lucky enough to be able to attend vmworld once in 2013 but didn’t get to go this year. I decided to take advantage of the free live stream of the keynote on Monday.

Pat Gelsinger took the stage with 23,000 of “his closest friends” a.k.a vmworld attendees, and had a few interesting things to say. First, he no longer makes a distinction between digital and non-digital businesses. Truth is, I cannot think of a business that will grow into the next decade without embracing technology. They may not need VMware and the cloud but with the new generation seeking products and services exclusively online, every business will need a way to engage these new customers.

After that he shared some interesting data points:

– in 2006, only 2% of the cloud was considered public with most of that being salesforce.com
– in 2016, about 80% is still traditional IT, meaning the automated service deployment or future service based IT is still only 20%
– 2016 is the year that 50% of hardware is in some form of public cloud and shifting from private datacenters rapidly
– 2016 70% of workloads function in tradition IT and by 2021 that number will be down to 50%

Servers are becoming more powerful, and consolidation is improving. Pat doesn’t see this as a shrinking market but actually generating growth by making the cost of servers more accessible to the masses. He compared this to processor enhancements of the 90s and I agree with this assessment.

Enterprises are finding cases where the public clouds make a lot of sense but there will (always?) be lots of cases where the existing owned datacenter will have advantages and certain services will be provided in-house.

The most interesting announcement from the keynote was VMware’s cross-cloud platform. This was a higher level GUI for managing services on all the big public clouds, and private clouds. There was a demo where you could enter the login information for AWS and this platform would discover the applications hosts under this account and present them in a uniform fashion with other cloud services.

This service is interesting but it appears to only work with IBM’s cloud witch advertised “500” customers are on. That doesn’t seem like a lot of market penetration to me. If this cross-cloud platform doesn’t work with all the cloud vendors, it doesn’t make it very far in my opinion.

Another buzzword I picked up on was Automated Lifecycle Manager. This sounds like product that has reached its end of life (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2015661) There is an obvious need for this when the demand for services increase with the decrease of cost for cloud services. We have to automate the full lifecycle to prevent sprawl and complexity. I wonder if VMware is coming out with something new on this front and is why it was mentioned.

The guest speakers from IBM, Marriot and Michael Dell all seemed to fall flat to me. None had any really innovating or inspiring comments. I’m not sure who the speaker was but towards the end they started talking about service architectures. One comment I did like was choosing multiple cloud providers for a single service in case one cloud goes down. This has happened and even though it could cause a lot of complexity, could be something we would want to architect around.

When I attend conferences like this, I usually to find something more valuable to do with my time during the general sessions. In conclusion, the session wasn’t that bad. My view wasn’t totally negative like this post (http://diginomica.com/2016/08/29/vmworld-2016s-keynote-snoozer-is-no-worry-for-the-infrastructure-innovators/) but since my opportunity cost was low, I don’t regret tuning in.

 
Leave a comment

Posted by on August 30, 2016 in Virtual

 

vCenter Server Upgrade Finished

In my previous post I bypassed a fairly lengthy step we had in validating the compatibility matrix. VMware software has strict guidelines in what hardware and firmware levels will be supported. Layer in the 3rd party plug-ins and databases and other VMware software and you have something that resembles a pile of cubes rather than a single matrix. For the basics like esxi & vCenter, this page is helpful: https://www.vmware.com/resources/compatibility/sim/interop_matrix.php#interop&1=577&2=

After the previous list of gotchas, I ran into a few more gotchas. My main virtual center was asking for 24GB of freespace to upgrade from 5.5 to 6. Easy enough problem unless you have only 2 drives in a physical blade that are already fully partitioned. Some of the solutions we batted around were:

1. Get the hbas working and zone a lun, then swap the drive letters
2. backup and restore to a VM
3. install a clean 5.5 and point to an existing database
4. Use the VMware converter to P2V

We tried option 2 and failed. There ended up being some limitation of the software that we ran into. Option 4 worked out quite well. At first I was told it wasn’t possible becuase you need vCenter online for the converter to work. Turns out there is a workaround. The P2V only took an hour and I was able to re-size the partitions in the process. Two posts that were very helpful in this process were:

https://blog.shiplett.org/performing-a-p2v-on-a-physical-vcenter-server/

https://virtualnotions.wordpress.com/2015/04/23/how-to-p2v-of-vcenter-server/

5.5 came back online in virtual form on this isolated host fairly quickly. Then it was time for the upgrade.

After about 30 minutes of solid progress bar moving, it appeared to stall out. CPU was idle and the upgrade window showed a message like, “Starting Up vCenter Server…”

vcenter_stall

I got concerned, almost scrapped it and started over from my VM snapshot. I checked a bunch of log files and looked at the disk activity to see what files it was writing to. None of this really amounted to much of a lead. I looked at the windows services and vCenter was in the “started” state. I tried with the thick client to log in but it said I didn’t have access. It was at that point the upgrade appeared to take off again and completed without error. I guess it just needed a kick.

vcenter_fin

The update manager install was simple and uses the same install media. After that I only had one issue remaining. The overview performance charts were not showing up. This is by design in the thick client https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109800

However, the error I was getting in the Web Client was not by design. Adjusting some network parameters corrected the error I was recieving after a restart of the Performance Charts Service
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2131040

Overall this project was great experience. I have a better understanding of vCenter and learned what logs are important. I got some practice in disaster recovery (failed upgrades). I am also more comfortable with running virtualized vCenters. The plan is to move to the vCenter appliance at some point but I suspect that will come after we upgrade are hosts.

 
Leave a comment

Posted by on March 13, 2016 in Virtual

 

vCenter Server Upgrade Gotchas

We have a sandbox virtual center with a couple hosts in it to test things just like the 5.5 to 6 upgrade we were ready to start. This vCenter has been rebuilt from scratch a few times and doesn’t really have any complexity to it. So when we went to upgrade it, of course it went fine.

The next vCenter I tried to upgrade was at our DR location. Fortunately, I took a backup of the OS and database, because this time I ran into several issues.

administrator@vsphere.local password

This password had expired which is required for setup to continue. There is a database hack available that I have used to extend this timeout value inside the RSA database but it wasn’t working this time. The utility vdcadmintool.exe is documented here and quite easy to use to get a new password. It is just a cmd line utility that will spit out a random password. A great reason to lockdown who has Windows Administrator on your vCenter Server.

VIEW SERVER STATE permission on the database

There are only a few options to select and the install starts.

In previous versions, we have allowed the service account defined in the ODBC connect to have db_owner. This grants every permission inside the database but nothing at the server level. It turns out v6 requires a server level permission called VIEW SERVER STATE.

Here is another KB.

Rollback Gotchas

After this error was hit, a rollback process was started. Rollback doesn’t put 5.5 back in place at the filesystem level so you need more than just a database backup. Part of our operating system restore procedures require an ISO to be mounted. But since vcenter was down, I couldn’t mount that iso. I had to look in the database and find what host the vCenter VM was running on and connect directly to it with the thick client. There is a VPX_HOSTS view that makes this fairly simple to find what host to connect to.

The restore process also requires us to add a NIC but distributed switches were not available to select in the dropdown. I had to create a standard switch on this host and assign that to the VM so vlan tagging could happen.

After the OS restore and database restore I was able to connect to vcenter 5.5.

Inventory service broken

The next time I tried the install I wasn’t able to start it. There was an error complaining about the inventory service. I checked this by trying to search for a VM in the thick client and, sure enough, it was broken. I’m guessing this was due to the restore but a restart didn’t seem to fix it. I went searching and found another KB to reset the inventory service. There is a handy powershell script at the bottom to speed this lengthy process along.

Logon failure: unknown user name or bad password.

After the restore, I did have to reset the administrator account again. I got a strange password that started with a space, but it worked in the web console so I tried again. The next go at the install died with this message:

Error 1326 while creating SSO group “ComponentManager.Administrators”:dir-cli failed. Error 1326: Logon failure: unknown user name or bad password.

The error had a very strange ” ,” with extra line breaks around it. There seemed to be a parsing error. This error left my vcenter a steaming pile, so I applied the backups and tried again with a new administrator password. I wasn’t able to confirm but I am pretty sure I got really unlucky and that space at the beginning of the password caused an install fail. No KBs for that one.

Success… Almost!

This upgrade can take around 30 minutes so I was very delighted to finally see a successful message at the end. I was able to log into the thick client and see VMs. However, my web client was giving me a permission problem even when I was logged in as the administrator@vsphere.local account.

You do not have permissions to view this object or this object does not exist

I ended up calling support on this one and they showed me to a registry hack. I’m not sure how this happens but an important service account registry key for a path can get overwritten.

Good Luck!

Hopefully this list helps save someone some grief. vCenter is a complex product with a lot of interconnected services. I’m not terribly unhappy with my upgrade experience. I probably would have had a better time if I had read through all of the best practices. Even though it doesn’t run on MS SQL, I’ll be seriously considering migrating to the appliance version of vCenter after we get completely upgraded.

 
Leave a comment

Posted by on February 28, 2016 in Virtual

 

Book Review: The Phoenix Project

I’ve broken this post up into two parts, the first directed at convincing you to buy this book and read it several times, and the second to open up discussion for those who have read the book. There will be spoilers in the second part.

PART 1: No Spoilers

Link to Buy on Amazon

I borrowed this book from a co-worker on Friday and finished it Saturday. Yup, done in one day. 382 pages of stories that seem like they could have come straight from my work related nightmares.

The main character Bill takes over after his boss and his boss’s boss both leave the company. The company is not an IT company and the growing complexity of IT has caused great stress and financial loss.

It is an obvious plug for DevOps. By the end of reading you might wonder if there is any other way to get things done. Keep a skeptical view and enjoy this book.

PART 2: SPOILERS

After the first 10 chapters, I didn’t know how much more I could take. I was physically stressed after reading about the constant firefighting, poor communication, late nights, political sabotage, yelling, swearing, night/weekends/all-nighters, and unreasonable demands. The book depicted a sad state of affairs. I recognized some of the outages, and even the blame game comments sounded spot on.

Its like they consolidated the most frustrating parts of my 9 years at my current company into 3 months. I’m a SAN administrator and that first outage of payroll that got blamed on the SAN but ended up being a poorly implemented security feature caused my first wave of stress. It was like watching a horror movie. “corruption” is like the catch all for unknown software errors. If you take action based on wild assumptions, bad things are going to happen. And let me tell you they continue to happen even though the new boss Bill seems to have a calm logical approach to things.

I wonder if this book was written like Dilbert, where the author was simply writing about what really happened to him. Its the only way this could be so close to accurate.

About halfway through the book, I had a guess that 3 of the secondary characters that were helping Bill, especially Erik, may have just been his alternate personalities. Wes is the aggressive obnoxious one, Patty is the over documenter and process type, and Erik is philosophical one. I was actually disappointed that they remained real characters and not imaginary. I think it would have added to the story to find out that Bill had really just been going crazy from all the stress.

Change Control

I loved watching the team be shocked at how many changes actually happen in the ops world that they have been living in. How could they not know? Changes are like queries on a database, sometimes it makes sense to count them, but mostly they are so different that they can’t be counted. One single big change can be more impactful and riskier that 1000 small changes combined.

Who changed what, when? Questions all ops teams should be able to answer. The book describes “changes” as one of four types of work. I’m not really certain how it fits into DevOps. Maybe change control is about reducing unplanned work, which is another type of work.

I liked the compromise they made between using the crappy change control system, but still forcing and encouraging teams to communicate by writing them on cards. It started a habit and the process communicated the vision. It was an early win in their struggles. The system had so many side benefits such as discovering the Brent bottleneck.

I wouldn’t encourage IT departments to use an index card method to schedule changes. Its not searchable and doesn’t scale well. A heavy handed software application with too many required fields is not the best approach either. The key is having clear definitions of what “Change” really means and what systems need to be tracked the most. IE: important financial systems such as payroll.

%Busy/%Idle

This concept hit close to home. My team has lost two people in the last few months and the workload is climbing to unprecedented levels. The automation I’ve put in place is in need of upgrades and important business projects are coming to fruition.

When you are busy, you make mistakes. When you make mistakes, its time consuming to recover. You also take shortcuts that tend to create more work in the long run. Being busy sucks the life out of people.

Decreasing the wait time for IT to add value to the business is was DevOps is all about. The book illustrates this quite well across several fronts. The way Bill achieves some of his goals before achieving kumbaya in the datacenter is with endless hours. He gets denied more people so he takes his salaried workforce and makes something out of nothing.

The graph describes why wait times go through the roof. People can function quite well until they are over 90% busy, from there wait times go through the roof. You can’t squeeze 11% of output out of 10% of idle time. It creates context switching penalties and queuing. This drives the wait times through the roof.

This is why I sometimes work long hours. I know that if I fall behind, it piles up like laundry and I have no clean underwear. It didn’t quite click until I saw the graph in this book but it make total sense. Trying to squeeze that last little bit of production out of a person or process can lead to devastating results.

In the book, Bill realizes he needs to dedicate Brent to project Phoenix. I like the pool of people dedicated to dealing with escalations that usually go to Brent. Its like training without the training. Allowing Brent to focus leads to some interesting automation discoveries later in the book.

Everything is Awesome!

After the first 10 chapters, the book slows down its pace quite a bit. Some characters turn a 180 and everything starts going better. It was a little harder to read and the politics started to take over.

The authors started to apply DevOps approaches to a small team and everything just magically worked. I was hoping there would be continuing issues before they actually got things right but magic pixie dust just made things work. Brent’s server builds just converted over to the cloud without mention of problems or massive costs increases that they already sunk into onsite servers not to mention the architectual shift that would have had to take place to successfully run in the old code in the cloud. But I suppose they were close to 10 deployments a day so it would have been fast right?

 
Leave a comment

Posted by on November 12, 2015 in Virtual

 

Disaster Recovery

I have recently been sucked into all that is Disaster Recovery or Business Continuity Planning. Previously I have been a bit dodgy of the topic. I haven’t really enjoyed the subject because it always seems to distract from my focus on backups and local recovery. I liked to focus on the more likely failure scenarios and make sure those are covered before we get distracted. I’m not really sure if that was a good plan or not.

We would have to loose almost our entire datacenter to trigger our disaster recovery plan. A fire in the datacenter, tornado or maybe loosing our key storage array might trigger DR. Dropping a table in a business application isn’t something you want to trigger a DR plan. Developing a highly available, resilient system is a separate task from developing a DR plan for that system. It was very challenging to convince people to complete a discussion of the local recovery problems without falling into the endless pit of DR.

There seems to be two different business reasons for DR. 1. Complete a test of the plan so we can pass an audit once a year and 2. Create a plan so we can actually recover if there were a disaster. The first one comes with a few key caveats, the test must be non-disruptive to business, it cannot change the data we have copied offsite and it cannot disrupt the replication of the data offsite.

In a cool or warm DR site, the hardware is powered on and ready but it is not actively running any applications. If I were to approach this problem from scratch, I would seriously consider a hot active site. I hear metro clusters are becoming more common. Sites that are close enough for synchronous storage replication enable a quick failover with no data loss. A hot site like this would have many benefits including:
1. Better utilization of hardware
2. Easier Disaster Recovery testing
3. Planned failovers for disaster avoidance or core infrastructure maintenance

However, there are downsides…
1. Increased complexity
2. Increased storage latency and cost
3. Increased risk of disaster affecting both sites because they are closer

Testing is vital. In our current configuration, in order to do a test we have to take snapshots at the cold site and bring those online in an isolated network. This test brings online the systems deemed critical to business an nothing more. In an active/active datacenter configuration, the test could be much more thorough where you actually run production systems at the second site.

A most basic understanding of DR covers the simple fact that we now need hardware in a second location. There is much more to DR than a second set of servers. I hope to learn more about the process in the future.

 
Leave a comment

Posted by on February 7, 2015 in Hardware, Storage, Virtual

 

5 9s Lead to Nestfrastructure (and fewer 9s)

Off the top of my head,

Microsoft DNS issue a handful of hours before xbox one launch(http://redmondmag.com/articles/2013/11/21/windows-azure-outages.aspx)

Widespread Amazon outages (http://www.zdnet.com/amazon-web-services-suffers-outage-takes-down-vine-instagram-flipboard-with-it-7000019842/)

NASDAQ (http://www.bloomberg.com/news/2013-08-26/nasdaq-three-hour-halt-highlights-vulnerability-in-market.html)

The POTUS’s baby (http://www.healthcare.gov)

I learned about 5 9’s in a college business class. If a manufacturer wants to be respected as building quality products, they should be able to build 99.999% of them accurately. That concept has translated to IT as some kind of reasonable expectation of uptime. (http://en.wikipedia.org/wiki/High_availability)

I take great pride in my ability to keep servers running. Not only avoiding unplanned downtime, but developing a highly available system so it requires little to no planned downtime. These HA features add additional complexity and can sometimes backfire. Simplicity and more planned downtime is often times the best choice. If 99,999% uptime is the goal, there is no room for flexibility, agility, budgets or sanity. To me, 5 9s is not a reasonable expectation of uptime even if you only count unplanned downtime. I will strive for this perfection, however, I will not stand idly by while this expectation is demanded.

Jaron Lanier, the author and inventor of the concept of virtual reality, warned that digital infrastructure was moving beyond human control. He said: “When you try to achieve great scale with automation and the automation exceeds the boundaries of human oversight, there is going to be failure … It is infuriating because it is driven by unreasonable greed.”
Source: http://www.theguardian.com/technology/2013/aug/23/nasdaq-crash-data

IMHO the problem stems from dishonest salespeople. False hopes are injected into organizations’ leaders. These salespeople are often times internal to the organization. An example is an inexperienced engineer that hasn’t been around for long enough to measure his or her own uptime for a year. They haven’t realized the benefit of keeping track of outages objectively and buy into new technologies that don’t always pan out. That hope bubbles up to upper management and then propagates down to the real engineers in the form of an SLA that no real engineer would actually be able to achieve.

About two weeks later, the priority shifts to the new code release and not uptime. Even though releasing untested code puts availability as risk, the code changes must be released. These ever changing goals are prone to failure.

So where is 5 9s appropriate? With the influx of cloud services, the term infrastructure is being too broadly used. IIS is not infrastructure, it is part of your platform. Power and cooling are infrastructure and those should live by the 5 9s rule. A local network would be a stretch to apply 5 9s to. Storage arrays and storage networks are less of a stretch because the amount of change is limited.

Even when redundancies exist, platform failures are disruptive. A database mirroring failover (connections closed), webserver failure (sessions lost), a compute node (os reboots) and even live migrations of vms require a “stun” which stops the cpu for a period of time(a second?). These details I listed in parentheses are often omitted from the sales pitch. The reaction varies with each application. As the load increases on a system these adverse reactions can increase as well.

If you want to achieve 5 9s for your platform, you have to move the redundancy logic up the stack. Catch errors, wait and retry.

stack

Yes, use the tools you are familiar with lower in the stack. But don’t build yourself a nest at every layer in the stack, understand the big picture and apply pressure as needed. Just like you wouldn’t jump on every possible new shiny security feature, don’t jump on every redundancy feature to avoid nestfrastructure.

 

vMotion, an online operation?

There are two types of vMotions, storage and regular. Storage vMotion moves VM files or a single .vmdk file to another datastore. The regular vMotion moves the VMs memory from one host to another and then stuns the VM in order to pause processing so the new host can open the file and take ownership of the VM. Today I’ll be referring mostly to the regular vMotion.

These are both fantastic technologies that allow for rolling upgrades of all kinds and also the ability to load balance workloads based on usage. The Distributed Resource Scheduler (DRS) runs every 5 minutes by default to do this load balancing. Datastore clusters can be automated to balance VMs across datastores for space and usage reasons. Like I said, these technologies are fantastic but need to be used responsibly.

“VMware vSphere® live migration allows you to move an entire running virtual machine from one physical server to another, without downtime” – http://www.vmware.com/products/vsphere/features/vmotion

That last little bit is up for debate. It depends on what your definition of downtime is. This interesting historical read shows that vMotion was the next logical step after a pause, move and start operation was worked out. Even though VMware is now transferring the state over the network and things are much more live, we still have to pause. The virtual machine memory is copied to a new host, which takes time, then the deltas are copied over repeatedly until a very small amount of changed memory is left and the VM is stunned. This means no CPU cycles are processed while the last tiny little bit of memory is copied over, the file is closed by that host and the file is opened on the new host which allows for the CPU to come back alive. Depending on what else is going on, this can take seconds, yes that is plural. Seconds of an unresponsive virtual machine.

What does that mean? Usually in my environment, a dropped ping, or maybe not even a dropped ping but a couple slow pings in the 300ms range. This is all normally fine because TCP is designed to re-transmit packets that don’t make it through. Connections generally stay connected in my environment. However, I have had a couple strange occurrences in certain applications that have lead to problems and downtime. Downtime during vMotion is rare and inconsistent. Some applications don’t appreciate delays during some operations and throw a temper tantrum when they don’t get their CPU cycles. I am on the side of vMotion and strongly believe these applications need to increase their tolerance levels but I am in a position where I can’t always do that.

The other cause of vMotion problems is usually related to over committed or poorly configured resources. vMotion is a stellar example of super efficient network usage. I’m not sure what magic sauce they have poured into it but the process can fully utilize a 10Gb connection to copy that memory. Because of that, vMotion should definitely be on its own vLan and physical set of NICs. If it is not, the network bandwidth could be too narrow to complete the vMotion process smoothly and that last little bit of memory could take a longer time than normal to copy over causing the stun to take longer. Very active memory can also cause the last delta to take longer.

Hardware vendors advertise their “east-west” traffic to promote efficiencies they have discovered inside blade chassis. There isn’t much reason for a vmotion from one blade to another blade in a chassis to leave the chassis switch. This can help reduce problems with vMotions and reduce the traffic on core switches.

In the vSphere client, vMotions are recorded under the tasks and events. When troubleshooting a network “blip” the completed time of this task is the important part. Never have I seen an issue during the first 99% of a vMotion. If I want to troubleshoot broader issues, I use some T-SQL and touch the database inappropriately. Powershell and PowerCLI should be used in lieu of database calls for several reasons but a query is definitely the most responsive of the bunch. This query will list VMs by their vMotion frequency since August.


SELECT
MAX([VM_NAME]) as 'VM',
count(*) as 'Number of vmotions'
FROM [dbo].[VPXV_EVENTS]
WHERE
EVENT_TYPE = 'vm.event.DrsVmMigratedEvent' and
CREATE_TIME > '2014-8-14'
GROUP BY vm_name
ORDER BY 2

This query can reveal some interesting problems. DRS kicks in every 5 minutes and decides if VMs need to be relocated or not. I have clusters that have DRS on but don’t ever need to vMotion any VMs because of load and I have clusters that are incredibly tight on resources and vMotion VMs all the time. One thing I have noticed is that VMs that end up on the top of this query can sometimes be in a state of disarray. A hung thread or process that is using CPU can cause DRS to search every 5 mintues for a new host for the VM. Given the stun, this isn’t usually a good thing.

IMHO, a responsible VM admin is willing to contact VM owners when they are hitting the top of the vMotions list. “Don’t be a silent DBA.” That is some advice I received earlier on in my career. Maintenance and other DBA type actions that can be “online” but in actuality cause slowdowns in the system that other support teams may never find the cause for. The same advice can be applied to VMware admins as well.

 
Leave a comment

Posted by on September 16, 2014 in Virtual

 

Guest Memory Dump From the Hypervisor

Part of VMware’s VMotion process copies all the guest system’s memory from one physical host to another over the network. Snapshots and VM Suspends will force a memory checkpoint making sure there is a persisted full copy of memory on disk. The point here is that the hypervisor is very much aware of the guest’s memory.

Without the hypervisor there are a few ways to capture data in RAM needed for some serious debugging. A single process is easy, just fire up the proper bitness of task manager.

process_dump

If the Windows computer is actually crashing, you can have it automatically create a dump file. One requirement is enough space for the page file. http://blogs.technet.com/b/askcore/archive/2012/09/12/windows-8-and-windows-server-2012-automatic-memory-dump.aspx

If the problem you are trying to debug doesn’t crash your computer, you have a little more reading to do. https://support.microsoft.com/kb/969028 There are several tools including a registry entry for CTRL+Scroll and a PS utility who’s name I love: NotMyFault.exe

But wait! It gets better!


The hypervisor checkpoint process. Just hit the pause button on your VM and viola. Browse the datastore and download the .vmss file. VMware has kindly written a Windows version of it’s application to handle the conversion https://labs.vmware.com/flings/vmss2core To convert this .vmss file to a windbg memory dump file just run this command

vmss2core.exe -W C:\pathtodmp\vm_suspend.vmss

You can also perform this same process using a snapshot instead. This can be an even better option to avoid downtime if your guest is still mostly working.

Now What?


Well, this is the point where I call in the experts. I generally do this to ship the file off for analysis by the developers of suspect code. As a teaser to some future posts, here are the ingredients we are going to have to collect:

The file we created is consumable by WinDBG http://msdn.microsoft.com/en-us/windows/hardware/hh852365.aspx

http://support.microsoft.com/kb/311503: Symbols help map out the functions

Commands for analysis in Windbg: http://msdn.microsoft.com/en-us/library/windows/hardware/ff564043(v=vs.85).aspx

 
Leave a comment

Posted by on February 7, 2014 in Virtual

 

Tags: ,

Changing memory for the VM was greater than the available network bandwidth

I recently ran into some strange behavior on one of my virtual hosts. I first noticed that the VMWare console was nearly unuseable because it was so slow. It looked like an overly compressed JPEG and the mouse clicks were taking forever on all the VMs on that host. RDP worked fine and the VMs didn’t seem to have any problems. VMs on identical hosts in the same cluster were not having problems. I attempted to vmotion a VM off of that host and it didn’t seem to be progressing. VMs were happy so it wasn’t an urgent issue. The next day I came in and was presented with this error:

“The migration was cancelled because the amount of changing memory for the VM was greater than the available network bandwidth”

I’m stuck in a sticky situation. I have a problem with a host, but I can’t move any VMs off of the host because VMotion isn’t working. Since I am new at VMWare administration I ran the problem by a couple co-workers and then opened a ticket with the vendor.

Lead 1: Busy network

When connecting to a vm console, that traffic is managed by the host kernel. VMotion traffic is also handled by the kernel but can be separated from the other traffic. Both of these things were suffering so it seemed to be pretty obvious where my problem was. I suspected a network problem but wasn’t really sure how to troubleshoot the management network traffic.

This KB was relevant but not helpful. 23 Ratings and the average is one star.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2009135

This article does some finger pointing of a broader network problem with some general contention. The console was never fast and the VMotions never finished. Support recommended segregating vmotion traffic with kernel traffic in different vlans and port groups. Basically trying to solve a general network busy problem. With only two connections, I could maintain redundancy and segregation by setting one vmnic to active and one to standby and do the opposite for the other port group. That would give me the minimum requirement for VMotion of a dedicated 1Gbit connection.

Lead 2: Bad something or other

The performance was consistently slow. That doesn’t describe a contention problem. So the next suggestion was bad driver/firmware/cable. I tried swapping management cables with another host. The problem stayed with this host and not the cables.

So the last option support kept coming back to was a bad driver or firmware. They suggested updating but this causes me problems because I would have to reboot for this to take effect. I can’t reboot because I can’t get the stinking VMs off this host. Also, why the heck do I have 6 identical hosts and only 1 is having this problem???

Important discovery: Direction matters

I started examining the performance with esxtop. I noticed that while these vmotions were running throughput was staying at a consistent 1 Mbits/sec. That about 1000 times slower than it should be. VMotion can saturate a 1Gb/sec connection. I also found a new and quite useful tool to graph this information. Visual esxtop is like perfmon for esx.

http://labs.vmware.com/flings/visualesxtop

I was able to look at each port individually. I began to understand that even though the two NICs were “active/active” in the port group, a single VMotion would only use a single NIC. Out of the blue, against my better judgement I decided to vmotion a VM onto this box. Previously I was focused on getting VMs off of this server. To my surprise the process worked just fine. That is because it hit the second NIC and not the first one.

Resolution: Turn it off and on again

To resolve my issue I went into the vSwitch properties and edited my management port group making the bad NIC standby. I was then able to move all the VMs off this host and reboot. I added the “bad” NIC back into the port group as active and watched a couple VMotions succeed using 950Mbits/sec throughput on that same “bad” NIC.

So the root cause was a single management NIC(1 of 12) acting a bit goofy and requiring a reboot. Not exactly a perfect resolution but at least I didn’t have to cause any downtime for my users.

 
Leave a comment

Posted by on October 10, 2013 in Virtual