This is a task that is hard to muster effort to complete. I am talking about building a valuable baseline. A baseline is a detailed picture of an application environment when it is working well and has a normal user load. Without users everything works just great so that is a very key aspect to a good baseline. A baseline can also be performance test results from a system but that is only useful if you are allow to re-run that test.
Don’t rely on user reported data for a baseline. Users are all different but generally they cannot detect a 3x performance gain let alone a 40% performance gain. These survey type inaccuracies can be exasperated by a slowly changing system.
When I picture a good SQL baseline, I picture the data I would collect if I was having a problem. The big 4 [CPU][MEM][DISK][NET] performance metrics are key. The sql waits over a period of time such as a business day are essential. I would like to dive a bit deeper in this post. There more complicated problems that a good baseline can assist with.
You aren’t going to go from 0 to fixed in 3 seconds flat with a good baseline. It isn’t that easy. What you can do is look at a broken system, look at the baseline and then realize the differences that you should be the focus of your attention.
This baseline vs. broken analysis should come after you have identified what changed. That is a very annoying question to the people that make changes. People who make changes, SAN admins, DBAs, VMWare admins dislike that question because they hear is a lot. Its not so much a dislike of the question, its a dislike of the competence of the person asking it. Its a completely valid question but when you have a laundry list of changes… it can very much complicate and drag out the problem solving. Most changes can be [ctrl+z]’d but we have to understand why this changed happened. If we smash [ctrl+z] in a panic we might be in the same boat a month from now. If we undo the change without figuring out what is wrong, it could effectively issue a DNR order on that product. It creates a fear that change is bad. Before you ask “what changed?” remember that time itself changes.
I personally like to sit in the “what changed” camp for a while, but if that doesn’t fix the problem you have to switch the question back to what’s wrong. A good baseline can help answer both of those questions. Like I mentioned, a good baseline will have pictures of all the areas that you go when troubleshooting. Even a simple test like, ping, can save you time when troubleshooting. This will prevent the OMG I can’t ping it, call the network team response when actually 2008 R2 disables ping (ICMP echo request) by default. Also, under network pressure, windows will choose ICMP to drop first.
To get more specific, here are some things that I dream of in my baseline:
1. pathping results during peak usage from several areas
2. network trace
3. sql trace
4. graphs of ready time vs. cpu utilization
5. memory usage and allocation
6. disk latency and throughput
7. windows event log
8. application log
9. host log
10. sql server and db configuration
11. Full backup times
You think that’s a lot of data? Well that’s what it takes to solve complex problems. It takes a lot of work to really solve problems by figuring out the root cause. The real root cause, not just lupus. Not just the fact that you changed something, its the reason the change didn’t work.
A personal story
I’d like to close with a story of home ownership and stupidity. I’ve lived in the same house for over 4 years now and from time to time there has been a spot that you can stand and hear a vibration sound. When no-one is standing there the vibration is gone. This makes it rather hard to troubleshoot but I assumed (correctly) that it was a vent making contact with the floorboard.
Several times this would irritate me beyond a reasonable level and I would march downstairs and bang at the nails holding the vents in place until I thought it went away. I would then march back upstairs in Homer Simpson fashion and step on the same spot and hear the vibration again, driving me insane.
The furnace and the furnace fan is very old so I, without proper baseline, thought this vibration was normal. I must have replaced the furnace filter 10 times and decided that maybe this time I will buy a better filter. Still, with a new filter the vibration continued. Within the last month I started noticing that my dog would wake up hungry and pace around the bedroom causing the vibration intermittently. This compounded frustrations because my dog should not know that it is 5:30am on the dot and should not pacing in circles like some sort of zoo animal.
I marched downstairs determined to figure out how to replace this furnace fan that was making this terrible vibration. thumpthumpwumpthumpthumpthumpwumpthumpthumpwump…. gah! I walked around and felt the vents because the furnace was running at the time. There was a cold side and a hot side right next to the furnace. I opened up the furnace and removed the filter and looked at it. I thought… hrm… thats odd, the dust is on the wrong side! DOH! I put the new, new furnace filter with the nicely labeled [AIR—>FLOW] pointing the other way. It makes sense now that the cold air goes in and the hot air comes out…duh.
When my wife got home later I bragged to her my victory. She got mad. She explained to me she said something about that already which I thought I didn’t remember. I posed a fairly good argument that if she reeeeealy thought that was it, why didn’t she change it. So she stayed mad for a day and then the next day told me the story of exactly what I said about 6 months ago when we were talking while I was putting the first new filter in.
The furnace fan was running and you were replacing the filter. Dustin, I said to you, “Are you sure that’s right?” And you replied, “O yeeeeea, I stuck my hand in there and felt the air”.
This caused and immediate rush of humility because I did remember saying that and it made me feel downright stupid. We laughed historically for a while. I admitted I was completely and totally wrong because:
A. I didn’t have a good baseline and blamed the fan that has been working for years
B. I “stuck my hand in there” while the thing was turned on thinking that was a good test for air flow in a vacuum
C. Was arrogant and discounted her when she asked “Are you sure?”