RSS

Monthly Archives: September 2011

#sqlwinning with server side trace

We have this 3rd party webapp that is used quite heavily inside our organization. Users get rather testy when this thing goes down and for about 3 weeks it was crashing several times a day. By crashing I mean:

no event logs
no log files
no increased CPU/MEM/DISK/NETOWRK utilization
leaves database connections open
no queries being run
http requests just hang

So the server left without even saying goodbye. A quick cycle of a windows service would fix the problem but there was no event to capture to automate. A ticket was opened with the vendor but not much progress was made in a couple weeks.  I am actually a user of the product and it started affecting me so I decided to do some investigation of my own.  I manage the database but not the application.  It was fairly certain to everyone that it was an application problem and not a SQL problem. However, SQL has tools easy enough for someone without an opposable thumb to understand.

I didn’t really expect much out of it but I started a server side trace.  I captured batch completed and RPC completed and filtered on database ID.  Instead of putting extra load on the server by running profiler GUI, I scripted the event and filter select to a query and ran that.  I also set the log file to roll over at 100MB and to stop the trace after 10 files, very important settings to have so you don’t fill up your drives.

Sure enough we caught a crash while I was out of office.  I had a couple extra days of trace data to dig through and a very approximate time of the crash. This app runs hundreds of queries a second which isn’t too bad but makes it a bit more difficult to find things.  I did load these trace files into a sql table, but couldn’t really figure out a good way to find the gap while the app was down.  If you have something to identify gaps in time in a trace table, let me know! 🙂 I eventually scrolled enough and found the last query run before the gap(app crash).

sp_execute 24, ‘blah’, ‘blah’, ‘param’

Ok, that doesn’t help much… what the heck is 24?  Well come to find out I needed to find the “declare… sp_prepare 24” statement that processed on that spid to figure out what the actual SQL was that was running. I did, but that didn’t tell me much either, so I started another trace.

In the meantime, the vendor’s genius idea was to turn on RCSI. I believed the database was a great candidate for RCSI, I just knew it wouldn’t solve the crashes.  I turned it on with an alter database command but noticed looking at the sessions dmo that the isolation level was still 2, or read_committed.  From the trace I saw that the jdbc driver opens up 8 connections to the database (a conn pool) and sets transaction isolation level to read_committed.  The vendor didn’t have any information on how to change this.  I left it alone for two reasons, one because I had caught a few more crashes in my new trace and two I wanted to alert faster than the users could by phone so we wouldn’t get so many complaints.

I decided to look into a way to alert me if this application wasn’t running any queries.  During business hours its a safe bet that we wouldn’t go 30 seconds without running a query unless the server was down.  With the help of #sqlhelp and @alzdba I constructed a query of sys.dm_exec_sessions that I ran with an agent job every 30 seconds. It would either return the last session or no sessions if the connections were dropped.

It was something like:


IF EXISTS
(
  SELECT TOP 1
    [last_request_end_time]
  FROM sys.dm_exec_sessions
  WHERE
    [is_user_process] = 1 AND
    [host_name] = 'APPSERVER' AND
    [last_request_end_time] < DATEADD(MILLISECOND, -30000, getdate())
  ORDER BY last_request_end_time DESC
)
BEGIN
  print 'exists'
END
ELSE
BEGIN
  print 'error'
  exec msdb.dbo.sp_send_dbmail @MailProfile = 'apperror', 'me@domain.com', 'ccAppAdmin@domain.com',,'app has not run query ina while'
  --get job id from msdb.dbo.sysjobs table
  exec msdb.dbo.sp_update_job @job_id = 'B4DEAB0E-312F-4FE1-8155-BF1DFE70C7C1', @enabled = 0
END

Turns out lunchtime threw a few false positives.  Also, I added some handling to disable the job when it fired so I wouldn’t get 100s of emails.

Now on to my new trace data. I noticed that in 3 out of the 4 crashes there was a particularly interesting parameter in the execute statement.  I recognized it as a network userid and looked it up.  With hundreds of queries a second this person, that I happened to know, had showed up as the last person be doing something before the server crashed. 3 times is more than a fluke, or coincidence… that is a pattern.

So one user, a.k.a. Typhoid Mary, was crashing a server that thousands used. It happened to be a very strange combination of actions including new browser sessions with new cookies that threw this app into a “dur” state. SQL trace was able to identify this user and get the application support team closer to a fix.

Advertisements
 
Leave a comment

Posted by on September 28, 2011 in SQL Admin, SQL Dev

 

Thomas LaRock’s #SQLSat84 sessions

Monitoring Databases in a Virtual Environment

The main points that I pulled from this session were

1. Task manager is a dirty filthy liar. Not a whole lot of specifics on this but it was a beginner session.
2. Don’t waste my time in SQL and check vmware ready time first, then go back to SQL.
3. DBAs should have read only access to vSphere to check host and guest counters.
4. Let vmware admins give you a 1vCPU box. Chances are they are faster CPUs and its easy to scale up later.

This was a great beginner session for DBAs new to vmware or looking to virtualize. My one comment is that he seemed to be stuck in a high gear when trying to get up to speed. Or another metaphor would be stuck at 60,000 ft when I would have liked to dive in a little sooner/faster.

Waits and Queues and You

Thomas described the running-waiting-runnable queue quite well including his salty humor. This queue is a first in first out system by design. The order of talking about queues first and then waits was good because the runable status in the queue is the one where the waits are tracked.

He again asked a question and told everyone who raised their hand that they were wasting their time. Note to self: don’t answer Thomas’s questions, however he would probably make a mockery of anyone not raising their hand. The question asked was if anyone was using sp_who regularly. I suppose this was a good lead in for waits that are in the DMOs.

He covered the big 4 CPU/NETWORK/DISK/MEM bottlenecks and what waits to look for when trying to diagnose which one of these is the culprit. I liked the fact that he went over his top 5 waits specifically. He had a demo script that showed how easy it is to query the DMOs and get wait information.

 
Leave a comment

Posted by on September 19, 2011 in PASS, SQL Admin, SQL Dev, Virtual

 

#sqlsat84

SQL Saturday in Kalamazoo was a huge success. The event leader Tim a.k.a. (Brent Ozar PLF) not only did most of the prep work for the event, he delivered a great session on DMOs AND hosted a great speaker/volunteer party the day before.

I volunteered because I have mooched knowledge from both the wmssug and semssug for over a year now without doing anything. I was curious to see the inner workings that make PASS tick. I got that and much more.

Tim really grabbed this corn cob by its ears. To make sure all of my comments are qualified, this was my first sqlsat. The networking, speakers, venue and the prizes are the big four things that make sqlsat’s work. I believe we excelled in all of these areas. If the prizes and venue are lacking you might get some complaints but the important things are that you drawn well known speakers and create opportunities for networking.

The whole OPLF made it to the event and shared their skills. That alone is quite a feat. We planned quite well so I was able to attend most of a few sessions. Thanks specifically to Thomas LaRock, Wendy Pastrick, Eddie Wuerch and Jason Strate for educating me in the many areas of SQL.

The pre and post parties were great opportunities for networking. The vendors came through with great prizes (iPad2, xbox 360, thousands of dollars in software and training ..etc). And finally the venue was fantastic. I’ll be curious if anyone was able to capture its awesomeness in a picture because my droid cam was not able to do that.

 
Leave a comment

Posted by on September 18, 2011 in PASS

 

Age bias in IT? Or, something more complex

I read an intriguing article shared by a colleague of mine. Its well written with actual numbers and stories to back up the opinions.

http://www.computerworld.com/s/article/9219187/Age_bias_in_IT_The_reality_behind_the_rumors?taxonomyId=10&pageNumber=1

I decided to write a response to this article. I am 27 years old with 5 years of full time experience and 7 more years of part time experience in IT. Fortunately (or unfortunately), I have almost no experience interviewing for positions and job hunting.

For you to understand my response to this article, I have to describe my experience at my current employer. Our company values stability. Employees do not get laid off, or juggled around as the economy fluctuates. This allows IT staff to do their job, the same way they have done it for the last 30 years. Decision makers and managers get promoted from within the company. The IT department is slow and resistant to change as a whole but has gotten much better in recent years.

I find this article especially relevant because I work very closely with a large number of COBOL programmers, young, old and in-between. I used to work on an older desktop application in C and when it came time to phase that out I saw most of my co-workers getting sucked into “the dark side”. I made it very clear to my managers, well before they asked, that this was not my preferred career path. “the dark side” was an accurate nickname because our COBOL development environment was a black screen with white text.

Basically, my response to the article is that it isn’t an age bias but a personality bias. As myself, and all others age, we develop a stronger identity. Part of that identity is loosing the need to “try new things” because we know what we like, and we know what we don’t like. Trying something new is a risk and could be a huge waste of time. I have talked to quite a few blatantly obvious dinosaurs. And also some folks who may be half their age getting into the a routine that lacks innovation. You are only as old as you feel… or act for that matter. Physical age doesn’t have much to do with it.

When meeting someone for 5 minutes or less, stereotypes have to come into play. If someone falls into the buckets of “stuck in their ways”, “resistant to change” or a “know-it-all” they are not a good fit for an IT position. One might describe IT as full of unknowns and agile and if you are not those then you should pick a new career path. Being over 50 does not mean you adhere to any of these negative stereotypes but I assume that people, including open minded interviewers, might think that if you have one of these traits you have them all.

The key stat that the article references is that Math/Computer professionals have the highest difference in unemployment between 25-54 and 55+ age groups. I think all this states is that when the economy got bad, IT was the first budget to get cut. Within IT, more efficiency was needed and the 55+ $100,000+ COBOL programmer was the easy choice for most companies. They could be replaced with a framework and college graduates willing to work for 1/2 of that salary.

What would be more relevant is a stat that includes the age of all applicants compared to chosen applicants. I don’t know this number but I imagine that a BIAS might be able to be derived from it.

The end of the article grants some advice to those looking for a job. It says to be flexible which I believe I am. I am will to do a lot of things in IT. Not just SQL Server, .NET, networking and Windows which are my favorites but I would explore much more. Not wanting to be a COBOL programmer was me being honest and a visionary not inflexible.

The user group and community involvement is the 55+ person’s wild card. This kind of active involvement in the community can be very beneficial for someone many years out of college.

 
Leave a comment

Posted by on September 15, 2011 in Uncategorized

 

405 Method Not Allowed and WCF Activation

We deployed some new features to a web application and I got word that some were not working. Its a new application so each problem has a pretty steep learning curve for me. The QA team said that no error messages were shown and it worked on some computers but not others. This initially sounded like a browser problem so I was not involved from the start since my focus is on the servers.

The dev team found using fiddler is that there was an HTTP error of 500. This was happening when trying to call a validation service (.svc) file. To get more out of this message we had to turn on debugging in the web.config file. Then with fiddler I was able to inspect the inner exception message which was the title of this blog.

There are two IIS7 servers that are network load balanced. I had no idea why IIS wouldn’t allow this simple HTTP call to a web service. Some of the first things we did were:

Make sure .NET 3.5 was installed [check]
Make sure .NET 4.0 was installed [check]
Check the application pool identities and settings [check]
recycle app pools [check]
reboot server [check]

None of these worked. These settings were the same on both servers but I decided to do some testing on each server individually. This narrowed down the problem to the second server. All of the files in the web app are replicated using a mesh topology. But the Temporary files in the .NET folders are not so I checked security on those next. That also was not a problem.

I turned on failed request tracing which is a really cool feature in IIS7. It spit out some .xml log files for the 500 error I was getting and showed me that ISAPI was at fault. However, I have done these traces before and had the same result. I think ISAPI is at a higher level than I needed to see to get a good error.

Searching of this 405 error reported some people having trouble with handler mappings. But that didn’t make sense for me because those are stored in the web.config file for each site, and that was the same on the working server and broken server. I started then looking at server level IIS7 settings. The ISAPI restrictions were the same so that wasn’t it.

At this point I decided to go nuclear and uninstall .NET 4.0, IIS7, application server role and .NET 3.5. In 2008 R2 .NET 3.5 is a feature that you can choose to install. I noticed when trying to uninstall that, there are actually some sub features to .NET 3.5. As it turns out, WCF Activation is a very important feature to have installed on a server that uses HTTP service calls.

Once I saw this I re-installed everything that I blew away and sure enough, the code worked just fine. What really gets me is that in all the googling I did there wasn’t a link between 405 and WCF Activiation. Hopefully now there is :]

 
Leave a comment

Posted by on September 14, 2011 in .NET, Network Admin