I have recently experienced a crossover between having knowlege in theory vs practice. It’s one thing to know DMV’s and profiler exist, and another thing to use them to identify the query that is causing performance problems. This falls in line with my favorite path to mastery of Learn One, Do One, Teach One.
Query tuning is a topic I will continue to come back to as I get better at it. I rate myself far from good as DBAs are concerned. I have not had that much experience, but I would like to share the steps I recently took to identify a process that is running rampant on one of my servers that holds only one vendor developed database.
We had an issue where the sql server CPU would shoot up and hang around 85-90% for the afternoon and return to normal (20%-25%) after normal business hours. I focused on a blocking issue that would come and go. SQLSentry help identify the blocking head which was a query that would run for about 30 seconds. It would block anywhere from 10 to 50 update queries for that time. This was happening a lot during the day. We found out that one of the parameters of the query was leading wildcard. This didn’t work out too well for the 30million row table.
We were taking traces at this time with a filter of > 1 second duration because of fear of adding too much overhead. We thought for sure the bad queries would be long running. We did this using the GUI (sql tv is bad) as instructed by the software vendor. Beyond the leading wildcard we assumed that there was also a similar query that would return far too many results. We identified two problems and sent them to the vendor. What we got back was a “this would be hard to fix/ educate your users not to do leading wildcard searches”. We were quite unsatisfied but then it was time for me to go to SQL Rally.
About a week after I got back I had taken a Friday off and come back to find out they upgraded the SQL server from 16GB ram to 32GB and 4v CPUs to 8v CPUs. It was running on its own blade server and seemed to be doing ok for the first day. Tuesday afternoon the same spiking CPU problem came back. We were unable to kill this problem with iron.
The first thing I did Tuesday was open up a query that I got from SQLRally by Eddie Wuerch. It examines the actively running queries and shows the wait stats. When the CPU was high, I ran this several times. I got rather discouraged when I didn’t see any of the leading wildcard or large result queries that we thought were a problem before. I then ran it several times in succession with a WAITFOR DELAY command in between each run. I found there was a query that was waiting on CXPACKET that had a short duration (100-200ms) but was executing every 200 to 300 milliseconds.
I decided to pull a few tricks out that I picked up from Grant Frichey at the SQL Rally Pre-Con. One tip was not to watch SQL TV and to do server side traces, what I call SQL DVR. I used profiler to select my events (RPC Complete and Batch Completed) and filters (database ID) then to export the trace definition. The other was to check the query plans for large pipes and compare Actual vs. Estimated rows. The last was to use profiler and import perfmon data (%cpu and sql server buffer cache hit ratio). That gives me the ability to drag a time slider bar and see what queries started running right when the CPU spiked. Also, to see if the same queries were running before the high CPU.
I very carefully setup a server side trace. I had my trace stop query ready to run in case the server slowed down. I was capturing the data to a local drive in 100MB files. We ran it for about 5 minutes and caught 300MB of data. I moved these files to my local computer and opened the first one. From there I chose to export the data to a trace table on one of our test servers. I ordered the query by CPU DESC and saw the same query that showed up in the wait stats query. I had heard this before but after looking it up I realized duration is in microseconds and CPU is in ms. Both time measurements I at first didn’t understand how CPU could be longer but then I divided it by 8 (number of CPUs) and got a number that was always less that duration.
I pulled this query out and put it into SSMS exactly as is and ran it with the “Include Actual Execution Plan” button selected. Sure enough I found several large pipes and three parallelism entities. One of the pipes showed estimated rows as 30 million and actual at 200 thousand.
Creating statistics on one of the columns might help this query but it is in the hands of the vendor’s developers. We created some neat pie charts showing how this one query was accounting for 98% of all CPU usage. I hope they stop the root cause of the problem which is running the same query 3 times a second with the same results.