restore testing

07 Aug

No good app goes untested.  So why would so many backups go untested?

Fact is, backups are low on the priority chart.  That is the way it should be.  If last night’s backup failed once, no big deal it will take care of itself tonight.  If its been failing for a week… I’ll get to it when I get a chance.

If you have been in the business for a while you either stopped reading in disgust or have continued on so you can post the most damming comment when you’re done.  I can hear the users now freaking out that this is how I feel about backups.  For those people let me explain why.

Some buzz phrases that have been in IT for quite some time are “high availability”, “disaster recovery”, “5 9’s” and “redundancy”.  Hardware and software vendors alike absolutely LOVE these phrases.  They may as well translate into the word “gravy”.  As you can imagine vendors make a killing every time a company hops on the HA bandwagon.  So let me briefly describe these buzzwords:

5 9’s : In IT it means your system is available to users 99.999% of the time.

redundancy: For full redundancy, every component in a system must have a duplicate that can take over in the event of failure.

Disaster Recovery: If the entire location were lost a second location could be brought online.

High Availability: Usually defined in an SLA with Restore Point Objective (RPO)/Restore Time Objective (RTO), HA is a culmination of the three previous buzzwords.

I like my “theoretical” examples so lets set one up.  A new piece of accounting software has been POC’d and your company is ready to dive right in. IT is usually brought into the discussion and it is way too late by now.  At this point us IT’ers generally have no say or ability to review the software we just need to stand up the environment as fast as possible.  Any environment should come with an SLA.  Now would be a good time to do some mock ups of the environment so you can accurately predict your RPO/RTO.  Remember your accounting department is rather $$$ savvy so this is where they might weight in the cost in relation to their wants.  Their wants usually include that the system never go down.  Explain that this is not possible and most systems have a worst case scenario of RPO/RTO 24/2.  This means you take nightly backups of the data (24 possible hours of data loss) and can have the environment back up in 2 hours.

IT in 1995 DR mode might suggest the system that simply backs up the important files nightly to tape.  If they’re smart they have sent their tapes offsite.  In the event of a fire a restore would go something like this:

1. Buy new hardware
2. Setup hardware
3. buy new os (yup the cd is melted)
4. install drivers
5. download app software
6. setup app software
7. setup database software
8. restore files from tape
9. modify security settings
10. modify dns setting to point to new location
11. notify users environment is up

Ok, so you met your 24 hours of data loss but how are you doing on time? If it was just one server and the application software was easy to install you are looking at about 72 hours of downtime. 1995 mode no longer works for todays environments with 50+ servers.  With that many servers you would be looking at a month of downtime and most likely not having all the data you need to meet your 24 hours of data loss.

Fast forward to 2010, enter super sweet HA technologies ($$$$$$$$).

Consider the pieces of a fairly simple but modern server farm. Lets call this Proposal A.
Network Load Balancer
Domain Controller1
Domain Controller2
Web Server1
Web Server2
Web Server3
Database Server Principal
Database Server Mirror
Database Server Witness
Storage Area Network

A user request comes in hits the NLB then one of the web servers, user is authenticated with domain controller, database call made to principal, data pulled off the SAN and response is sent back to the user. The NLB and SAN have built in redundancies so your nearly covered. Now, times every cost by 2 for your 1:1 DR site.

Proposal B from that server admin stuck in 1995 might include just one server that handles all of this. Fortunately there will be two hard drives and two power supplies built into this server.

So after that lengthy setup I now can make a decent argument. Factor in variables like hardware life cycle and maintenance costs and proposal A is so ridiculously expensive it would take a mad man to chose. At what point did we careen down the path of insanity and throw out all cost to benefit analyses. We have fallen into a trap that is very hard to get out of. Who in your organization will have the courage to pick proposal B? How about if it not even a mission critical application? Chances are if the cost won’t put your company under no IT manager will pull the trigger on B.

I say go for it. The odds are with you. Chances are you won’t have any problems with B in the 4-5 year lifespan of the hardware. Just work harder to script everything. Script the windows patches. Script the entire environment restore. Do this to minimize the downtime and test your restores. Buy that second server for a cold site in case the unthinkable happens. Have your tapes over there ready for remotely scripted restores. Don’t buy into the fear tactics of the hardware vendors that even the government is helping propagate now. All you need is a good plan.

Leave a comment

Posted by on August 7, 2010 in Network Admin, SQL Admin


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: