This content is part of the Essential Guide: Building a disaster recovery architecture with cloud and colocation
News Stay informed about the latest enterprise technology news and product updates.

Colocation and cloud providers experience outage woes

Cloud infrastructure offerings increased in resiliency in 2015, assuaging the fears of many businesses looking to switch some applications or transition production IT entirely to the cloud. Enterprises want to save money while retaining the same performance, which cloud providers aim to deliver. Granted, 2015 wasn’t a perfect year.

While evaluating cloud providers’ reliability is difficult since there are few independent data sources, it is not impossible. SearchCloudComputing created a general assessment of cloud infrastructure performance in 2015 by combining a few sources of data, including a CloudHarmony snapshot of cloud provider performance over a 30-day period and Nasuni’s reports on the cloud providers that it uses.

In February 2015, Google’s infrastructure as a service offering Google Compute Engine (GCE) experienced a global outage for over two hours. The outage was at its peak for forty minutes, during which outbound traffic from GCE experienced 70% loss of flows.

Months later, Amazon Web Services (AWS) experienced outages over a weekend in September that affected content delivery giant Netflix and throttled service for other U.S.-East-1 region AWS users while recovery efforts took place. Compared to previous years when AWS experienced some major outages, 2015’s cloud problems were definitely less major, more of a slowdown than a full stop. However, the list of AWS services affected was longer than the list of services unaffected.

Is Colo the Way to Go?

Even though offerings from cloud providers are improving, some companies found that the cloud just couldn’t handle their business needs. Since 2011, Groupon has been moving away from the cloud and to a colocation provider. Cost drove the online deals company towards running its own data center IT, with its enterprise needs covered in nearly every area, from databases and storage to hosting virtual machines.

However, colocation providers aren’t free of problems. A study of the costs of data center outages from Emerson and Ponemon Institutes found that UPS system failure accounted for a fourth of all unplanned outages, while cybercrime rose from 2% of outages in 2010 to 22% in 2016.

Verizon’s recent data center outage that took airline company JetBlue offline for three hours and grounded flights highlights the importance of failover plans and redundant power. Verizon, which runs its own data centers for its telecom business, is a surprising sufferer in this outage scenario, according to some observers.

Companies that run owned data centers aren’t free from the same problems that plague cloud and colocation data centers, from stale diesel fuel to poor disaster recovery planning in advance of an attack, error or natural disaster. Data center IT staff must consider how much oversight they have over potential problem areas, and how much control they want — or can have — over the outage and how it is resolved. Visibility into the outage and its aftermath also will vary from provider to provider.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Cloud is more hype than capability. Sure, in theory it provides for increased elasticity and flexibility, but it is complex from installation, app/service instantiation, performance, reliability, and troubleshooting.

I have done Tier3 support for mobility and layers 1-3 network services. Take VPN services for enterprises; if outage time exceeds 1sec, many video codecs at the customer sites will choke, cause pixelization and a screaming CEO. GE-Telepresence used to own my life because of problems like this. Now, can you imagine how much more difficult problem isolation becomes and how much more frequently these types of problems will occur when I am running vPE, vRR, and vCE on an Openstack cloud with the data plane using DPDK or SR-IOV over SmartNICs, and maybe a hypervisor is in the path and packets are being silently dropped there?

What about the fact that there is now an opensource craze on the software side, and now Open Compute Project "opensource" hardware design?

Does software reliability get better or worse by virtue of adding severely under-tested opensource from multiple 3rd parties? Answer: worse.

Does software reliability get better or worse with Agile development techniques? Answer: worse.

Does the entire discipline of Software Reliability Engineering (SRE) get dropped in Agile? Answer: Yes.

Does traditional Failure Mide and Effects Analysis (FMEA) take a back seat given the focus on software? Answer: Yes.

With so many more layers of opensource 3rd party software, with abandonment of SRE due to Agile, with little to no focus on FMEA, with the abandonment of just about every software and hardware engineering best practices developed over the years, for example SRE developed by John Musa at Bell Labs, is it any wonder that we are having nightmare outages in cloud?

This is like the Climate Change debate; that is, "debate" is a misnomer. In essence, "Cloud", like Climate Change, is not based on sound engineering and science, but rather, it has become a religion. One simply cannot abandon the well established software reliability practices, engineered and refined over the years, and expect a largely software-based services environment to deliver reliable services. It is the definition of insanity.

Cloud is even extremely cumbersome to orchestrate. Have you ever tried using Mirantis Fuel to deploy a cloud? Wow, the amount of switch and server configuration you must do, largely manual, defeats the purpose of Fuel as an automation tool for Openstack deoloyment. In adfition, once the cloyd is finally deployed, then there are a compkete set of new headaches related to the actuall services one plansbto offer.

Cloud is a religion, one that many executives have latched on to. They believe it will reduce OpEx substantially, abd at the same time, they have their head in the sand regarding SLAs and breach of contract as they abandon the very engineering principles invented to ensure services reliability.

If you want a secure job for the forseeable future, and you don't mind 24x7 on-call duty, endless conference calls with angry enterprise customers, and an inability to see silent packet loss inside tne many layers of poorly tested 3rd party opensource, then become a Tier2 or Tier3 "Cloud" support specialist. Surely you will have a well paying job until executives realize that we still need ASICs and FPGAs and best practice software reliability engineering in order to meet SLA and contract commitments.


Thanks for this Blog post, it's really nice to understand and m going to share this link with others members
also visit this site for Colocation services in Pune:-