RSS

SQL Server Backup Infrastructure

07 Nov

What are your backups sitting on? How fast are restores? Compression? Dedupe? Magic? There are lots of questions to be answered before buying storage for backups.

What is your retention policy?

Space requirements vary greatly with compression and dedupe but nothing has a greater affect on space than the retention policy. If your legal department gets involved you may have “indefinite retention” on some or all of your backups. That means you can’t delete any backups. Better get that storage vendor on speed dial.

A more realistic retention policy would be 30 days of nightly backups. Another approach would be to keep a week of nightly backups, a month of weekly backups and a year of monthly backups.

What exactly are you sending over the wire?

Unless you are backing up on the same server or SAN, something is going over the wire. That wire is usually the bottleneck in a well tuned environment. A well tuned environment is actually shaped like a can than a bottle but you get my point.

A full backup would mean all data, or a copy of space used in your database is going over the wire. A differential would send only changed extents since the last full. Turning on compression reduces the size of these files by 50%-80% in my experience. SQL 2008 and up can natively apply this compression or you can use a 3rd party tools from Quest or RedGate to send less data over the wire.

EMC’s Data Domain Boost is not yet publicly available as far as I know but it’s worth mentioning. The generic data domain would be a full uncompressed copy of your data would have to go over the wire. That would be bad (http://www.brentozar.com/archive/2009/11/why-dedupe-is-a-bad-idea-for-sql-server-backups/). But with the addition of DDBoost, an integrated tool that is supposed to send only unique data over the wire, we have a possibly workable solution. This is slightly better than differentials which send change data over the wire, night after night until another full is taken.

Watch out for the simultaneous in/out.

One thing that cropped up and bit me in the arse was backups going in and out. This can happen in a couple different scenarios. For starters, heaven forbid you actually have to restore a database during your backup cycle. Can the drives and network support that operation? Or is your restore going to crawl?

Another time this can happen is if you are forced into backing up your backups. Say you have indefinite retention and backups have to be sent to tape. Depending on how fast things are, you might be reading and writing to disk at the same time. You might also be sending and receiving data over the wire at the same time.

Are you sending these backups offsite? If so that might be another opportunity to have multiple simultaneous ins and outs. If you tuned your system for only one operation at a time, you might want to rethink your RTO.

Scrubbing

Unless you are restoring your database and running checkdb, you have to assume your backups are not good. Scrubbing is the process of verifying the data written long ago is still good. Some appliances have this process built in so they can at least verify the bits that were written are still the same. A small problem can be blown up with dedupe or compression. Small problems in a backup file can cause restores to fail and then you will have to call in the experts in data recovery.

Reused tapes would frequently have errors. I don’t know anyone backing up to SSDs but early models had some problems failing. That said, good old fashioned, enterprise class HDDs make a good home for data. Adjust your scrubbing intensity with your experience. Make sure you are not causing more problems than solving. This process might be pushing a whole lot of IOPS to a shared storage system. Know who your neighbors are and don’t piss them off.

Throughput

I like to simplify a baseline throughput measurement into a simple clock time. A round of full backups took X amount of time. This translates well to the business who will be paying additional money so that you can achieve their RTO. That said when tuning the system we have to look at the throughput of each potential bottleneck.

The wire is generally the bottleneck. 1Gbps = 125MBps. Make sure you understand the difference in network terminology and storage terminology when it comes to bits and bytes of throughput. If you want to sound like an amateur, just say “MEG” or “MEGS” when the details really do matter. Your mileage may vary but I have not experienced 10x improvement when switching to 10Gbps from 1Gbps network ports and adapters. Tuning your MTU size across the network to support larger frames (aka jumbo packets) can help to utilize more bandwidth. Teaming multiple network cards can increase network throughput. Dedicating a backup network team of NICs can help with a busy SQL server that has users during the backup window.

I have experienced a RAID5 SATA configuration have the ability to write sequentially 500MBps. If you are not concerned about the simultaneous ins and outs, or potentially random workload from scrubbing, the storage cost can be really low. If you want offsite replication built into the storage product, costs will increase very fast. If you can do this with robocopy and little to no WAN acceleration, a simple pool of NAS drives could be a viable option.

Dedupe and compression can actually cause CPU contention on the SQL server if it is busy during the backup window. This is important to be aware of for test VMs. Test VMs might have a full set of data, but only 1CPU. This might be fine for users but it could be slowing down a backup stream. In a virtualized environment you may not want to kick off all of your backups simultaneously. Instead try to schedule backups in streams that you can adjust for the best performance. It is easier to setup and maintain streams than staggering start times.

Conclusion

I highly recommend a solution that gets a copy of your data off server, off SAN and offsite as quickly as possible. I suggest keeping 1 backup set onsite and 1 backup set offsite. This allows for fast restores and disaster recovery.

You may not have a superdome with 32 NICs but this is still a good read:
http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/Technical%20Case%20Study-Backup%20VLDB%20Over%20Network_Final.docx

Advertisements
 
Leave a comment

Posted by on November 7, 2013 in SQL Admin, Storage

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: