An accidental DBA can make it quite some time before learning about database maintenance. Users become numb to performance problems that slowly change over time. It’s like watching a child grow up. If you see them every day the change isn’t that noticeable. But if you see them once every 5 years the shock factor is there.
With experience I’ve gained a spidey sense of approximately how long something should take. I don’t really need that many details or prior experience with the database in question. So when the email archive searching was taking 10 minutes when only a date was selected I jumped right on the issue.
A 10 minute query is easy to catch with the DMVs. The system is pretty lightly used but has all the emails archived so there is plenty of data. I had the user recreate the issue and pulled out the sql text that was running. I ran it in management studio with the actual plan on. Sure enough it seemed hit some particularly gooey part of the database and would crawl every time it was run.
I jumped to a premature conclusion that there was some missing index. But after checking the tables, all filters and joins had indexes and the query plan was using them.
Then I checked my agent job that runs an index maintenance script. The particular script I ran on this server was by Michelle Ufford http://sqlfool.com/2011/06/index-defrag-script-v4-1/ I have been using this script since I discovered maintenance was a necessity a few years ago.
Back in time
I have been a big fan of this script for maintenance. When I inhereted about 30 servers 3 years ago, little to no maintenance was done. I’m now up to about 100 servers. One server did have a maintenace plan but it was doing all most all of the GUI based maintenance plan tasks in succession and taking almost all night to run when it would complete successfully. It looked something like this:
After some research I concluded that there were some things that were unnecessary. The first was a rebuild right after a reorg of all indexes. The reorg is completely redundant maintenance. The next thing I started to investigate is statistics. I didn’t really know what they were at the time but I had read somewhere that stats get updated when you rebuild an index. This is true for the lead column in the index but it doesn’t get ALL stats. I decided eventually to scrap the whole thing and run Michelle’s stored procedure weekly and checkdb less frequently.
Michelle’s script builds some intelligence around index maintenance. Just rebuilding all indexes isn’t necessary and will heavily tax the system, especially with VLDBs. In the script, lightly fragmented indexes only get a reorg while heavily fragmented indexes get rebuilt. Tiny little indexes don’t get any maintenance because they don’t cause performance problems.
Back to the problem
There are a couple problems with only running an index maintenance script like this one. Some indexes can get quite large but not fragmented so the stats can get stale because they won’t ever hit the rebuild threshold. SQL Server can auto-update stats if they are stale but sometimes large tables can need more frequent attention. Also in this script, no attention is given to column level stats that get added when the database is set to auto-create stats(those stats prefixed with _WA_). Or the column level stats that get added during database design.
In the case of my 10 minute query, I checked the actual query plan and ran the query in SSMS. I noticed that one of the pipes in the plan showed something like 50,000 rows actual and the optimizer only estimated 1. I’ve read that means a stats problem so I decided to do a complete update statistics on all tables with fullscan. I should have checked the statistics properties to see when it was last updated but I wasn’t too concerned. That just means I won’t know how stale they were.
After that was done, I ran the query once more and it took 4 minutes and the second time it ran in just over 1 second. I would say that is a success but there is one thing that baffles me to this day. The estimated and actual rows are still the same as before. I did get a very slight difference in the query plan. The optimizer changed a nested loop join to hit clustered index instead of a non-clustered index. It also changed the join order slightly. But I anticipated the new plan would be able to estimate rows better on that pipe but it was still 50,000/1.
I switched the maintenance to OLA’s script which can include FULLSCAN stats maintenance if needed.
Yes, I improved query performance by 99%. Yes, that is awesome, but I also was the root cause of the problem because I neglected to do proper maintenance. I hope to one day understand the internals of query plans and stats a bit better. More specifically why auto-update stats and index maintenance combined can still let stats get out of wack. Also, I want to have a better way to alert if stats are stale in my environment.