Cloud computing depends on connectivity and availability, but those can be far from perfect with some providers. Disruptions to a public cloud can impair the productivity of an entire organization, unless data center administrators know what to do when it happens—and it will happen. In this tip, you will learn how to choose a public cloud provider and how to best respond to outages.
Learning to accept cloud crashes
Cloud disruptions are a fact of life, even among the most common providers. Amazon Web Services
had two major outages in 2011—one in April in its Virginia data center and one in August in its
Ireland data center, which was literally hit by lightning. Both crashes caused major disruptions to
hundreds of Amazon cloud customers. What followed was an inevitable debate on the viability of
using a public cloud for mission-critical applications.
Although these types of major disruptions get the attention of the media and the blogosphere, there is a bigger issue that is ongoing about the reliability of a public cloud, such as Amazon EC2—they are not designed to be reliable in the first place. Instances on Amazon’s cloud crash all the time.
Major catastrophic crashes and ongoing instance failures underscore that ensuring high availability in the cloud is more complex,
Requires Free Membership to View
When you register, you’ll also receive targeted alerts from my team of editorial writers and independent industry experts with the latest news, tips, and advice to help you do your job more efficiently and effectively. Our goal is to keep you informed on the hottest topics and biggest challenges faced by IT professionals today working with data center technologies.
Margie Semilof, Editorial DirectorPublic cloud failures at different levels
It’s important to understand that failures in the cloud can occur at different levels,
according to George Reese, chief technology officer at enStratus Networks Inc., which offers
infrastructure management services to major cloud providers. Reese describes those different levels
as “the five levels of redundancy:”
- Physical machine level
- Virtual machine level
- Availability zone level
- Regional/data center level
- Cloud provider level
Not all public cloud providers offer all of these levels. For example, very few—if any—cloud providers besides Amazon offer availability zones, which are data centers within a single geographical location that are insulated from each other. The idea behind availability zones is that if one data center fails, it won’t drag the others down.
But even though Amazon does offer availability zones, it does not provide visibility into or control of the physical machine level. The rule of thumb about these levels of redundancy is that the lower you go down this list in an attempt to create a redundant highly available application, the more reliable it becomes. But it also becomes more complex and expensive. Let’s review that trade-off in more detail:
Incorporating redundancy
Redundancy at the physical machine level is a familiar concept and a practice in many
traditional data centers. Because of that, it is the least complex and the least expensive option.
Many public cloud services offer this type of redundancy automatically and as part of the standard
price of the service. For example, Amazon’s Elastic Block Storage service automatically replicates
all data to a separate physical machine.
Similarly, at the virtual machine level, there are many known practices—and available commercial and open source products—for maintaining high availability through load balancing, replication and fail-over. This is true both for traditional data centers and public clouds.
Designing for high availability at the physical machine level and at the virtual machine level is familiar to Web application developers who now have established best practices of building these types of distributed apps. This same practice would be a challenge, however, for existing legacy applications that were designed in a more monolithic, client-server architecture. It is for this reason that these types of applications are more suitable for solid Infrastructure as a Service clouds, such as Verizon’s Terremark Worldwide Inc.,
Bluelock or Virtacore Systems Inc. These applications are not suitable for liquid clouds such as AWS or GoGrid.
All that said, there is still some complexity involved in designing applications for high availability in volatile environments such as Amazon EC2. It puts a big burden on developers.
This is where Platform as a Service vendors such as Salesforce.com’s Heroku Inc. come in. They run on top of Infrastructure as a Service (IaaS) providers such as Amazon and promise to handle many of the complexities of running applications in dynamic environments.
Two kinds of IaaS public clouds
Two basic models of IaaS public clouds—called liquid clouds and solid clouds—are
emerging, each with an almost diametrically opposed philosophy behind it.
Amazon EC2 is the epitome of the liquid cloud. It was built with the philosophy of unreliable
hardware and reliable software. In other words, expect the infrastructure to fail and fail often
and design your applications to deal with it.
Solid clouds follow the philosophy of reliable hardware and unreliable software. They are offered
by companies such as Bluelock and Verizon’s Terremark Worldwide Inc. They use expensive proprietary
hardware and are suitable for legacy applications that were not designed for frequent hardware
failures.
| Liquid | Solid | |
| Alternative names | commodity, webscale, “design for failure” | Enterprise, legacy, traditional |
| Hardware | cheap, unreliable, commodity | expensive, reliable, proprietary |
| Isolation | public, shared | private, dedicated |
| Provisioning | minutes, self-service, API | hours/days, professional service |
| Automation | high | low |
| CloudOS, platform | open source | typically VMware |
| Customer acquisition and onboarding | low-touch | high-touch |
| Environment | homogenous | heterogeneous |
| Price | $ | $$$ |
High availability across data centers
At the availability zone level, things get a little more complicated but are still within the
realm of expertise of many developers. This is because Amazon, possibly the only public cloud
provider that offers the availability zone concept, provides various tools for maintaining HA
within availability zones.
One such tool is Amazon EC2 Elastic IP addresses, which mask specific instances and availability
zones. They also programmatically remap IP addresses to instances on other availability zones in
case of failure.
That said, data center admins still need to maintain multiple copies of various components of the app, which adds to costs. And although Amazon has said there is no single point of failure among availability zones within a single region, this has already been disproven in both the April and August outages.
Moving down the levels to regions and cloud providers, things get significantly more complicated and expensive. For one thing, the connection among these goes through the much less reliable and higher latency public Internet.
So the logic that addresses the move between these data centers will need to be much more sophisticated and will have to address a number of scenarios, especially to prevent data and application state inconsistencies. And although Amazon does not charge for data transfers between availability zones, such communication across regions and outside cloud providers may become costly.
It can become even more complex when attempting distributed architectures across cloud
providers—for example, Amazon and Rackspace—because each uses different APIs and other management
approaches.
Third-party vendors in the fast-paced cloud computing business are coming up with solutions to
these issues. Cloudsoft Corp., RightScale Inc. and enStratus are examples of companies that offer
various multicloud solutions for application mobility and disaster recovery. Xeround is another
company tackling the particularly sticky problem of relational database cross-cloud data mobility
and disaster recovery. In the end, each organization will need to decide on the level of high
availability each of its applications require. It then needs to make a trade-off decision, for
example, on how much it is willing to invest to prevent the relatively rare occasions of
availability zone, regional and public cloud provider failures.
About the expert: Geva Perry writes the blog Thinking Out Cloud and has been named as one of the Top 25 Most Influential People in the Web Hosting Industry, Top 50 Cloud Computing Bloggers and one of the 12 Top Thinkers in Cloud Computing. Perry has been an adviser to cloud computing companies, including Deutsche Telekom, NEC, Internap, New Relic, Twilio, Sauce Labs, Heroku/Salesforce.com, Xeround, Totango, JFrog, Cloudsoft and others. He also serves on the board of directors of Upstream Commerce and BlazeMeter. Previously he was CMO of GigaSpaces.
This was first published in November 2011
Data Center Strategies for the CIO
Join the conversationComment
Share
Comments
Results
Contribute to the conversation