cutimage - Fotolia
Small World Big Data
Published: 17 Nov 2015
A big challenge for IT is managing big clusters effectively, especially with bigger data, larger mashed-up workflows, and the need for more agile operations.
Cluster designs are everywhere these days. Popular examples include software-defined storage, virtual infrastructure, hyper-convergence, public and private clouds, and, of course, big data. Clustering is the scale-out way to architect infrastructure to use commodity resources like servers and JBODs. Scale-out designs can gain capacity and performance incrementally, reaching huge sizes cost-effectively compared to most scale-up infrastructure.
Big clusters are appealing because they support large-scale convergence and consolidation initiatives that help optimize overall CapEx. So why haven't we always used cluster designs for everyday IT infrastructure? Large cluster management and operations are quite complex, especially when you start mixing workloads and tenants. If you build a big cluster, you'll want to make sure it gets used effectively, and that usually means hosting multiple workloads. As soon as that happens, IT has trouble figuring out how to prioritize or share resources fairly. This has never been easy -- the total OpEx in implementing, provisioning, and optimally managing shared clustered architectures is often higher than just deploying fully contained and individually assigned scale-up products.
When clustering in a virtualized infrastructure, it's the job of the hypervisor to enforce sharing, isolate noisy neighbors, dynamically migrate and/or restart impacted or suddenly demanding workloads, and generally play traffic cop. We've seen great progress in this space over the years, to the point where we can dynamically enforce user-specified quality of service (QoS) at the level of the virtual machine and virtual storage volume (e.g., VMware VVOLs).
Of course, one could interpret the whole idea of an infrastructure cloud (e.g., OpenStack) as a large, optimally managed cluster of resources. Still, virtual and cloud infrastructure platforms have taken years to mature and still aren't perfect. There are miles to go in developing cloud management tools that make it as easy to implement these systems as it is to deploy dedicated equipment. And inside a virtualized environment, it's still hard to ensure that an application in a virtual machine can deliver a guaranteed response time to an end-user.
Cluster management tools for big data
Effective cluster design is especially important to big data, which is all about bringing HPC technology like clustering at scale to enterprise IT. Hadoop, Spark and some scalable NoSQL tools are designed to make distributed processing feasible for everyone. However, production big data applications are just now requiring consistent application performance. When big data applications underpin key business processes, reliable operations and consistent performance matter.
In vanilla big data tools such as a Hadoop cluster, every big data job competes for the same resources. Up until now, many Hadoop clusters simply hosted a single big data process or served a small group of users, often in a non-production data science environment. But as big data clusters move into production, they are usually expected to host multiple jobs and serve multiple tenants -- just like big virtualization or cloud clusters. And when that cluster is shared -- and it usually is -- managing big data performance becomes a big challenge.
It's not surprising then to see new approaches to big data cluster management and operations. Cluster management tools tend to fall into a couple of categories. That's led to companies like Bright Computing, which got started in the high-performance computing space, is now being used in the enterprise to help deploy, provision and manage large clusters from bare metal.
But the real trick is performance management, the key to which is knowing who's doing what, and when. At a minimum, there are standard tools that can generate reports out of the (often prodigious) log files collected across a cluster. But this approach gets harder as log files grow. And when it comes to operational performance, what you really need is to optimize QoS and runtimes for mixed-tenant and mixed-workload environments. For example, Pepperdata assembles a live run-time view of what's going on across the cluster, and then uses that insight to dynamically control the assignment of cluster resources. This assures priority applications meet service-level agreements while minimizing needed cluster infrastructure.
At a higher level, big data deserves its own version of application performance management. One example is Concurrent's Driven, which tracks historical and ongoing application execution, providing direct visibility into business and application level workflows, their application inter-dependencies, runtimes and failures. This helps identify code bottlenecks, plan and fix workflow execution windows, and even assist with data governance.
As more scale-out architectures land in the data center, the value proposition for cluster management tools will grow -- minimizing the CapEx needed to share resources while helping guarantee performance and other QoS to big data business processes. And in some cases, IT just may not be able to stand up big data clusters effectively without these cluster-specific tools.
Collectively we think that big cluster management will mature much faster than the decade or so it's taken for virtualization management to handle high priority production workloads. By learning from their virtualization and cloud predecessors, they may even supplant them with a more complete clustered data center vision.
- E-Guide: Key Differences Between Virtualization and Cloud Computing –SearchDataCenter.com