Research data center scientists are the cowboys of the computing world, but the research computing IT architecture has carry-over potential into enterprises grappling with how to process big data.
"Research computing was big data before big data," said Richard Villar, vice president of data center and cloud at IDC, a research firm based in Framingham, Mass.
Most enterprise data centers were designed to support systems of record -- availability and reliability mattered for the entire infrastructure, Villar said. But now IT supports systems of engagement and insight.
Systems of engagement are inherently dispersive and customer-facing. Systems of insight focus on analytics and thrive on quickly allocated and reallocated compute resources. Availability and reliability for these applications refer to the entire fabric of the systems, not individual servers.
Enterprise IT shops also deal with new types of data coming into the business and how to improve customer service through data analysis. The Massachusetts Green High-Performance Computing Center (MGHPCC) shows how distributed compute addresses this problem.
The original big data architecture
The MGHPCC, a cooperative research computing data center located in Holyoke, Mass., operates on a "hard shell and soft core IT architecture," according to James Cuff, researcher at Harvard University.
Harvard, MIT, the University of Massachusetts, Boston University and Northeastern University share the MGHPCC. Researchers must obey different data regulations, and often bring different mixes of hardware into the facility. But these tenants frequently work together on projects that demand cross-connects on the network and, increasingly, share resources for the cost benefit of higher utilization.
James CuffHarvard University
The MGHPCC is analogous to a modern corporation -- some servers are isolated for compliance and security reasons, while others host workloads from multiple departments or scale down to idle when demand is low. Only 20% of the IT equipment runs on uninterruptable power supplies, and the remaining 80% can fail -- or power down -- without much issue. It's a design tailored to research computing, where algorithms churn away toward an end result, producing high volumes of expendable data along the way. It could also teach enterprises how to process big data.
"In research computing, we're the cowboys on the frontier taking chances," Cuff said. "But we're responsible cowboys."
Load balancing is the best way to deal with the pressure of new services and rapid scaling, when you use it for the right use cases, Villar said.
"All data centers are under pressure to get the maximum work for a minimal cost of IT assets," he said.
A batch scheduling system orchestrates all MGHPCC workloads. For example, a scientist loads 6,000 pieces of work that each run for two hours. The computing software's architecture tolerates bad nodes or missing compute. While the code runs, it writes out periodic stop points. If an individual piece of work lands on a failed node, the program picks up from that point on a good one. Google and Netflix rely on this style of computing, and so do astrophysicists.
"It wouldn't work for financial trades," admitted Cuff. "For six nines data centers, this is heresy."
For the majority of business IT, however, consideration of its facility design will save unnecessary costs without abandoning the systems of record and workloads that need protection. Major Web properties build their data centers with only enough power to shut down gracefully in the event of a failure, but they don't treat mission-critical data the same as batch data analytics programs. Batch analytics and non-real-time big data can pick up where they left off and not stay up 100% of the time, Villar said.
"We used to build a lot of high availability and mainframe bomb-proof power," Cuff said. "You can do the same thing with distributed computing, but it's a lot harder because there's a business function that's hard to change."
The core storage and networking resources at the center are its "crown jewels" and must be protected from downtime. But the hundreds of racks of compute are more flexible.
In case of a major failure, there's very little value in shoring up all of the systems with backup power because workloads can shift to other facilities instead, Villar said.
MGHPCC's researchers build out data systems for large projects that can support data processing across distributed resources, Cuff said, with 20 Gbps backlinks to replicate between data centers on different campuses and the MGHPCC. For projects that require rigorous backup or transfer ridiculously large data streams, MGHPCC spreads the load over the storage closest to the compute. And for international projects, such as data analysis from the Large Hadron Collider at CERN, reliable access is paramount.
"We treat [the various data centers] all as one big Layer 2 network with fast switches and large chunks of the infrastructure as one machine," Cuff said.
Thanks in part to server efficiency increases with every product generation -- and the researchers' willingness to share assets for higher utilization -- the MGHPCC data center still has a great deal of room left for growth.
The hardware you need to take on big data
Three experts on big data's big changes
Why build? Big data on AWS