I have recently been sucked into all that is Disaster Recovery or Business Continuity Planning. Previously I have been a bit dodgy of the topic. I haven’t really enjoyed the subject because it always seems to distract from my focus on backups and local recovery. I liked to focus on the more likely failure scenarios and make sure those are covered before we get distracted. I’m not really sure if that was a good plan or not.
We would have to loose almost our entire datacenter to trigger our disaster recovery plan. A fire in the datacenter, tornado or maybe loosing our key storage array might trigger DR. Dropping a table in a business application isn’t something you want to trigger a DR plan. Developing a highly available, resilient system is a separate task from developing a DR plan for that system. It was very challenging to convince people to complete a discussion of the local recovery problems without falling into the endless pit of DR.
There seems to be two different business reasons for DR. 1. Complete a test of the plan so we can pass an audit once a year and 2. Create a plan so we can actually recover if there were a disaster. The first one comes with a few key caveats, the test must be non-disruptive to business, it cannot change the data we have copied offsite and it cannot disrupt the replication of the data offsite.
In a cool or warm DR site, the hardware is powered on and ready but it is not actively running any applications. If I were to approach this problem from scratch, I would seriously consider a hot active site. I hear metro clusters are becoming more common. Sites that are close enough for synchronous storage replication enable a quick failover with no data loss. A hot site like this would have many benefits including:
1. Better utilization of hardware
2. Easier Disaster Recovery testing
3. Planned failovers for disaster avoidance or core infrastructure maintenance
However, there are downsides…
1. Increased complexity
2. Increased storage latency and cost
3. Increased risk of disaster affecting both sites because they are closer
Testing is vital. In our current configuration, in order to do a test we have to take snapshots at the cold site and bring those online in an isolated network. This test brings online the systems deemed critical to business an nothing more. In an active/active datacenter configuration, the test could be much more thorough where you actually run production systems at the second site.
A most basic understanding of DR covers the simple fact that we now need hardware in a second location. There is much more to DR than a second set of servers. I hope to learn more about the process in the future.