peshkova - Fotolia
Small World Big Data
Published: 21 Sep 2017
It's ironic that we spend a lot of money on proprietary databases, business applications and structured business intelligence platforms for "little" data, but we turn to open source platforms for big data analytics. Why not just scale down free, big data open source systems to handle the little data too?
Of course, there are a number of real reasons, including minimizing risk and assuring enterprise-class data management requirements. Cost probably isn't even the first criteria for most enterprises. Even when it comes to cost, open source doesn't mean free in a real economic sense. Open source strategies require cutting-edge expertise, professional support and often buy-up into proprietary enterprise-class feature sets. The truth is, open source platforms don't necessarily maximize ROI.
Still, open source strategies create attractive opportunities for businesses that want to evolve their aging applications. Many IT investing strategies now include a core principle preferring open source for new applications. In fact, we'd claim open source now represents the fastest growing segment of enterprise IT initiatives. From a theoretical point of view, when it comes to developing new ways of doing business, new types of agile and web-scale applications, and new approaches to analyze today's ever-bigger data, open source presents innovative opportunities to compete and even disrupt the competition.
But this is much easier said than done. We've seen many enterprises fumble with aggressive open source strategies, eventually reverting to tried-and-true proprietary software stacks. So if enterprises aren't adopting open source because it's cheaper, and it often lacks enterprise-class features, then why has it become such a popular strategy?
Adopting open source strategies goes hand in hand with an ability to attract top technical talent, Rajnish Verma said at the Dataworks Summit in June, when he was president of big data software vendor Hortonworks. Smart people want to work in an open source environment so they can develop in-demand skills, establish broader relationships outside a single company and potentially contribute back to a larger community -- all part of building a personal brand, I suppose.
In other words, organizations adopt open source because that's what today's prospective employees want to work on.
All open, or mostly open?
When organizations adopt open source strategies, they rarely intend to dive into the source code. That would require hiring internal miracle workers -- an expensive proposition. Instead, they contract for support, usually with a vendor that's a primary contributor to the open source project.
Often, but not always, this is the company that has many of the original open source project contributors on staff, and continues to make the most commits back to the code base. Sometimes, like with big data analytics, this gets competitive, resulting in several downstream distributions -- each from a different vendor.
In addition, it's common to find that the business model of an open source distributor includes layering on some proprietary licensed components that provide enterprise features. In other words, I can download the application and run it freely on my laptop, but IT shops will likely need to deploy an enterprise version -- as a license or through a support contract -- to get all the security, scale, service levels and management features they need in a production data center.
As an example, the main Apache Hadoop distribution vendors include Hortonworks, Cloudera and MapR Technologies. It is important to note that as the big data space has evolved over the last 10 years, none of these companies still actively promote themselves as a Hadoop distribution. In fact, they are all repositioning and rebranding as next-generation data processing platforms. In particular, they are no longer tied to any specific project but able to evolve a broader field of offerings and value propositions -- and not limited, for example, to the original Hadoop MapReduce or even the Hadoop Distributed File System.
In general, big data distribution vendors have added enterprise features for operations and governance, SQL and business application support, and have evolved real-time components, Spark support and a focus on machine learning. The future for all of these companies may revolve around how they come to support containers, the internet of things, cloud and edge architectures. But, of course, each vendor focuses on carving out a niche:
- Hortonworks. Taking the high road in terms of open source purity, Hortonworks is striving today to be your other major IT vendors' big data analytical platform, with deep partnerships with Microsoft and IBM. Somewhat ironically, this requires customized integration with a partner's proprietary stack.
- Cloudera. From the start, the market leader focused focused on directly meeting enterprise needs and helping IT transition to next-gen applications. Cloudera also helps IT move away from expensive legacy enterprise data warehouse stacks to more cost-effective operational intelligence platforms.
- MapR. MapR eschewed plain Hadoop Distributed File System and created a fully transactional -- and, as a storage analyst, I could say software-defined -- big data storage layer. Today MapR's platform works as well as a scalable container host as it does for hosting converged application types, such as online analytical and transaction processing and big data.
Maybe this is not how these vendors present themselves in their own elevator pitches, but you get the idea that, with multiple open source vendors, there are important differences in mission and vision. An IT shop just kicking tires will find a lot in common among them at a functional level, whereas key differences might show up later in production operations.
Open is as open does
IT managers would do well to recognize the real drivers pushing their organizations toward open source strategies and strive to get ahead of them. Employee demand will likely make open source inevitable. And there is probably already an internal need for some deeper experience and knowledge about just what's available in all those public open source projects.
If you have been avoiding open source, it's best to first implement some smaller projects before taking on major open source initiatives. Encourage employees to take ownership for some open source sandboxes. Play with different distributions and learn where you can use mostly open applications. Learn about the edge elements of key open source offerings, such as how you secure, manage and integrate them into existing architectures. And remember that although the future of IT will be more open with bigger data, it's still a small world.
Take this test to gauge your open source knowledge
Why open source for IoT is a good idea
Discover new business development opportunities through Hadoop