RSS

Database Mirroring Split-Brain failure scenario

20 Oct

I have a database that needs to be online 24/7/365.242199. Ok, so it doesn’t reeeealy need to be online that much. Uptime does not need to be measured in milliseconds for any of my company’s applications. No one’s life depends on this database being online so I’m quite alright with small outages. When revenue generating systems go offline, it’s bad, but not life threatening in my case.

That said, it’s a nice bonus for my reputation if I can keep a database online for a month straight or maybe even a year. However, I need to be careful not to make HA decisions purely based on my reputation.

I will not be upgrading any production or QA Servers to SQL 2012, until at the very least 6 months after RTM. I am making this a rule of my own so I am not tempted by any of the shiny new features in AlwaysOn. I will investigate them, but from what I have heard, all solutions still drop connections so it’s not worth upgrading for that. Is a dropped connection an outage? I think so but I’ll let you talk to your users and decide.

The Project

We chose VMware and two physical host servers for minimal redundancy. VMware allows for flexibility, scalability some awesome HA/DR tools. For the server architecture it was easy to spin up multiple DC/IIS/SQL servers to support this application. Database mirroring was the obvious choice because I could then patch the SQL Server with a few seconds or less of downtime with this particular application.

But, we only had two physical servers. A requirement of an automatic failover is a witness server in addition to the principal and mirror servers. We have separate VMs but only two physical servers. I put some hard thought into this and decided that the witness vm should be on the same physical host as the mirror. In the event of a hardware failure on the first host, the witness will still be able to “turn on” the mirror. If there was a DC/IIS/SQL box on one host, users wouldn’t have to wait for the VMware HA features to kick in.

Even given the failure scenario I am about to describe, I will still keep it that way.

Boom, disaster strikes

Turns out vLAN settings are a single point of failure. If a vital vLAN was blown away, some servers may have the ability to talk to each other, but not their redundant pairs. My particular failure allowed any VMs sitting on the same hosts to talk to each other, but if they were sitting on another host, they could not talk. On top of VMs on separate hosts not being able to talk to each other, the VMs could not talk to anything else, intranet or internet. The host management traffic is on a separate vLAN so I was able to access the console of the VMs for troubleshooting.

Those of you familiar with database mirroring might have already guessed what happened but there is an interesting twist. What happened was the witness wasn’t able to talk to the principal database. It was able to talk to the mirror database because it was on the same host so it switched that mirror to principal. Boom, bam we’ve got two working principals…or maybe. Of course I didn’t really think about this scenario until after everything was working. One of the troubleshooting steps we took was to Vmotion all the servers to one host. I believe at that point, the witness was able to re-sync the new principal with the new mirror.

Is that a big deal? Well, it depends. Had internet traffic been flowing into the load balance IIS boxes, writes could be occurring on both databases.

Unless of course the principal, not being able to talk to the witness AND mirror servers would have caused it to go into offline mode. The quorum works like it should 2 out of 3 win.

#sqlhelp was able to calm my nerves and buy me some time to test this out myself. I wanted to believe split-brain couldn’t happen when designing the solution but when it actually did happen, I had doubts. Even in this scenario, the vLAN came back online and everything was fine. All databases and mirroring connections were restored and synchronized without intervention.  Myth-busted!

 
Leave a comment

Posted by on October 20, 2011 in Network Admin, SQL Admin

 

Leave a comment