Sergey Nivens - Fotolia
BOSTON -- IT pros have been suckers for a constantly changing big data marketing message.
That's what database pioneer and A.M. Turing Award-winner Michael Stonebraker told the estimated 1,000 attendees here at the HP Big Data conference this week.
How much "you all listen to and believe the marketing hype" has been one of the biggest surprises in Stonebraker's 40-year career, he said. His career includes co-creating Vertica, the analytics platform that became an HP product after its purchase of Vertica Systems, Inc. in 2011.
Stonebraker was awarded the A.M. Turing Award -- considered the most prestigious computer science award -- in 2014 by the Association for Computing Machinery for his fundamental contributions to the concepts and practices underlying modern database systems.
Big data analytics evolves, requires data scientists
Until the mid-2000s, all data warehouse products were row stores. Stonebraker said he disrupted the industry with column stores, which are now widely used by data warehouse products.
"It wasn't that Vertica invented column stores -- they had been around -- but Vertica challenged the then-incumbents, and challenged the entire industry to move from row stores to column stores," he said.
In addition to pointing out his peers' naivety about big data hype and discussing his own accomplishments, the adjunct professor in the Department of Electrical Engineering and Computer Science at MIT offered his view of the future of big data.
Organizations have a big data problem, if there is too much data that comes at an organization from too many places.
HP Vertica can do SQL analytics on petabytes of data -- a solved problem, Stonebraker said.
The change will come when business analysts who work with SQL on large amounts of data give way to data scientists, which will involve more sophisticated analysis, predictive modeling, regressions and Bayesian classification.
"That stuff at scale doesn't work well on anyone's engine right now. If you want to do complex analytics on big data, you have a big problem right now," he said.
The big data problem may be solved by Apache Storm's data processing for Hadoop and a high-performance transactional in-memory database system, such as MemSQL or VoltDB.
Data coming at an organization quickly could also be solved by Apache Kafka, but upstream from most data warehouses is a "velocity problem" that will become more of an issue, as the Internet of Things grows.
Integrating systems both upstream and downstream will be a challenge for vendors, he said.
The "800-pound gorilla" is that enterprises are getting data from too many places and all organizations want to integrate more data sources. The average company has 5,000 data systems; a large enterprise, such as Verizon, has 10,000, Stonebraker said.
"You get 20 of them in your Vertica data warehouse, how about the other 4,980?" he said. "They are siloed and not accessible."
Analytics are going to become complex and business analytics will now morph into data science -- so more data scientists will be needed.
Some business analysts doing SQL could figure out how to "retread" themselves into a data scientist, he said.
However, one data scientist said a business analyst can't just become a data scientist.
"I think it is two very different things," said Massimo Mascaro, who works in the consumer tax group at Intuit Inc. in San Diego. "You need both."
A business analyst helps turn data into suggestions about what the business should do next, Mascaro said, while a data scientist's job is to come up with models to help predict "both sides of the coin."
The introduction of more data scientists may take a decade, as many colleges and universities have just begun to introduce data science programs, Stonebraker said.
"That is clearly going to be the future," he said.
In his criticism of "marketing buzz," Stonebraker chronicled the history of sales pitches around big data in recent years, noting that MapReduce was the answer four years ago -- even though it was purpose built by Google 10 years ago. Google later open sourced it and eventually replaced MapReduce with Google Cloud Dataflow.
"Google has to be laughing in their beer for convincing everybody that what they discarded was a good idea," Stonebraker said.
"The new pivot is to say if it is a data warehouse, what value is NTFS?" he said.
A data lake is a place to put all of an organization's data files -- essentially making it "your junk drawer," Stonebraker said. Then, the data has to be curated before it is loaded into a system, such as Vertica.
"How much money are you willing to pay for a junk drawer? Not very much," he said.
IT pros have believed the succession of hype, and, in summary, Stonebraker said: "You guys should all be way, way more cynical."
IT pros express big data Hadoop-la skepticism
Stonebraker's message of skepticism was welcomed by those who recognize hoopla when they see it.
"The fact that he is a techie and cynical, and tells people to be cynical is wonderful," said David Hatala, a data warehouse architect at Vircor in Tampa, Fla., which serves as consultants for "high performance solutions."
However, Stonebraker compromised his message when he mentioned companies he's co-founded, such as in-memory database company VoltDB and data unification platform Tamr, Hatala said.
Big data in three words
Volume: too much data;
Velocity: data coming too fast;
Variety: data coming from too many places.
"As cynical as he is about marketing messages, everybody has their own agenda," he said.
Those who see beyond the buzz expect big data tools to provide a way to build actionable items from the analytics of mixed workloads.
"It will not only give you analytics, it will take you all the way through processing that data to give you actionable items," said Gabriel Carlson, a solutions architect at Vircor.
To analyze mixed workloads, IT pros need several processing threads to happen concurrently and not step on each other, Hatala said.
Once a system is built around a lot of data, the system architecture will be focused on what to do with all the data, Carlson said.
"If you haven't thought about how your system can be used for actionable items, you have designed your system halfway," Carlson said.
Robert Gates covers data centers, data center strategies, server technologies, converged and hyperconverged infrastructure and open source operating systems for SearchDataCenter. Follow him on Twitter @RBGatesTT or email him at firstname.lastname@example.org.
Big data, cloud agendas come from marketers
Stonebraker on data curation