Off the top of my head,
Microsoft DNS issue a handful of hours before xbox one launch(http://redmondmag.com/articles/2013/11/21/windows-azure-outages.aspx)
Widespread Amazon outages (http://www.zdnet.com/amazon-web-services-suffers-outage-takes-down-vine-instagram-flipboard-with-it-7000019842/)
The POTUS’s baby (http://www.healthcare.gov)
I learned about 5 9’s in a college business class. If a manufacturer wants to be respected as building quality products, they should be able to build 99.999% of them accurately. That concept has translated to IT as some kind of reasonable expectation of uptime. (http://en.wikipedia.org/wiki/High_availability)
I take great pride in my ability to keep servers running. Not only avoiding unplanned downtime, but developing a highly available system so it requires little to no planned downtime. These HA features add additional complexity and can sometimes backfire. Simplicity and more planned downtime is often times the best choice. If 99,999% uptime is the goal, there is no room for flexibility, agility, budgets or sanity. To me, 5 9s is not a reasonable expectation of uptime even if you only count unplanned downtime. I will strive for this perfection, however, I will not stand idly by while this expectation is demanded.
Jaron Lanier, the author and inventor of the concept of virtual reality, warned that digital infrastructure was moving beyond human control. He said: “When you try to achieve great scale with automation and the automation exceeds the boundaries of human oversight, there is going to be failure … It is infuriating because it is driven by unreasonable greed.”
IMHO the problem stems from dishonest salespeople. False hopes are injected into organizations’ leaders. These salespeople are often times internal to the organization. An example is an inexperienced engineer that hasn’t been around for long enough to measure his or her own uptime for a year. They haven’t realized the benefit of keeping track of outages objectively and buy into new technologies that don’t always pan out. That hope bubbles up to upper management and then propagates down to the real engineers in the form of an SLA that no real engineer would actually be able to achieve.
About two weeks later, the priority shifts to the new code release and not uptime. Even though releasing untested code puts availability as risk, the code changes must be released. These ever changing goals are prone to failure.
So where is 5 9s appropriate? With the influx of cloud services, the term infrastructure is being too broadly used. IIS is not infrastructure, it is part of your platform. Power and cooling are infrastructure and those should live by the 5 9s rule. A local network would be a stretch to apply 5 9s to. Storage arrays and storage networks are less of a stretch because the amount of change is limited.
Even when redundancies exist, platform failures are disruptive. A database mirroring failover (connections closed), webserver failure (sessions lost), a compute node (os reboots) and even live migrations of vms require a “stun” which stops the cpu for a period of time(a second?). These details I listed in parentheses are often omitted from the sales pitch. The reaction varies with each application. As the load increases on a system these adverse reactions can increase as well.
If you want to achieve 5 9s for your platform, you have to move the redundancy logic up the stack. Catch errors, wait and retry.
Yes, use the tools you are familiar with lower in the stack. But don’t build yourself a nest at every layer in the stack, understand the big picture and apply pressure as needed. Just like you wouldn’t jump on every possible new shiny security feature, don’t jump on every redundancy feature to avoid nestfrastructure.