We have this 3rd party webapp that is used quite heavily inside our organization. Users get rather testy when this thing goes down and for about 3 weeks it was crashing several times a day. By crashing I mean:
no event logs
no log files
no increased CPU/MEM/DISK/NETOWRK utilization
leaves database connections open
no queries being run
http requests just hang
So the server left without even saying goodbye. A quick cycle of a windows service would fix the problem but there was no event to capture to automate. A ticket was opened with the vendor but not much progress was made in a couple weeks. I am actually a user of the product and it started affecting me so I decided to do some investigation of my own. I manage the database but not the application. It was fairly certain to everyone that it was an application problem and not a SQL problem. However, SQL has tools easy enough for someone without an opposable thumb to understand.
I didn’t really expect much out of it but I started a server side trace. I captured batch completed and RPC completed and filtered on database ID. Instead of putting extra load on the server by running profiler GUI, I scripted the event and filter select to a query and ran that. I also set the log file to roll over at 100MB and to stop the trace after 10 files, very important settings to have so you don’t fill up your drives.
Sure enough we caught a crash while I was out of office. I had a couple extra days of trace data to dig through and a very approximate time of the crash. This app runs hundreds of queries a second which isn’t too bad but makes it a bit more difficult to find things. I did load these trace files into a sql table, but couldn’t really figure out a good way to find the gap while the app was down. If you have something to identify gaps in time in a trace table, let me know! 🙂 I eventually scrolled enough and found the last query run before the gap(app crash).
sp_execute 24, ‘blah’, ‘blah’, ‘param’
Ok, that doesn’t help much… what the heck is 24? Well come to find out I needed to find the “declare… sp_prepare 24” statement that processed on that spid to figure out what the actual SQL was that was running. I did, but that didn’t tell me much either, so I started another trace.
In the meantime, the vendor’s genius idea was to turn on RCSI. I believed the database was a great candidate for RCSI, I just knew it wouldn’t solve the crashes. I turned it on with an alter database command but noticed looking at the sessions dmo that the isolation level was still 2, or read_committed. From the trace I saw that the jdbc driver opens up 8 connections to the database (a conn pool) and sets transaction isolation level to read_committed. The vendor didn’t have any information on how to change this. I left it alone for two reasons, one because I had caught a few more crashes in my new trace and two I wanted to alert faster than the users could by phone so we wouldn’t get so many complaints.
I decided to look into a way to alert me if this application wasn’t running any queries. During business hours its a safe bet that we wouldn’t go 30 seconds without running a query unless the server was down. With the help of #sqlhelp and @alzdba I constructed a query of sys.dm_exec_sessions that I ran with an agent job every 30 seconds. It would either return the last session or no sessions if the connections were dropped.
It was something like:
IF EXISTS ( SELECT TOP 1 [last_request_end_time] FROM sys.dm_exec_sessions WHERE [is_user_process] = 1 AND [host_name] = 'APPSERVER' AND [last_request_end_time] < DATEADD(MILLISECOND, -30000, getdate()) ORDER BY last_request_end_time DESC ) BEGIN print 'exists' END ELSE BEGIN print 'error' exec msdb.dbo.sp_send_dbmail @MailProfile = 'apperror', 'firstname.lastname@example.org', 'ccAppAdmin@domain.com',,'app has not run query ina while' --get job id from msdb.dbo.sysjobs table exec msdb.dbo.sp_update_job @job_id = 'B4DEAB0E-312F-4FE1-8155-BF1DFE70C7C1', @enabled = 0 END
Turns out lunchtime threw a few false positives. Also, I added some handling to disable the job when it fired so I wouldn’t get 100s of emails.
Now on to my new trace data. I noticed that in 3 out of the 4 crashes there was a particularly interesting parameter in the execute statement. I recognized it as a network userid and looked it up. With hundreds of queries a second this person, that I happened to know, had showed up as the last person be doing something before the server crashed. 3 times is more than a fluke, or coincidence… that is a pattern.
So one user, a.k.a. Typhoid Mary, was crashing a server that thousands used. It happened to be a very strange combination of actions including new browser sessions with new cookies that threw this app into a “dur” state. SQL trace was able to identify this user and get the application support team closer to a fix.