Donald Becker, an MIT grad who in the 1990s pioneered high-performance computing (HPC) with commodity components, returned to his alma mater recently to update the Boston Linux & Unix User Group (BLU) on the state of HPC and his work to make Linux clusters more powerful, more user-friendly and easier to manage.
Now Becker is the CTO of Scyld Software and its parent company, Penguin Computing Inc.. In 1994, he helped launch NASA's Beowulf Project which demonstrated that $50,000 of commodity hardware, clustered together, could equal the performance of a $1 million Cray supercomputer. For his work, he received the Gordon Bell Prize from the IEEE (Institute of Electrical and Electronics Engineers Inc.) Computer Society in 1997.Imrproving the Beowolf model
Noted for his work in modifying network drivers to accelerate speed, Becker said that more than 75% of high-performance computing systems costing $1 million or more run Linux today, according to the Top500 supercomputer list and an even greater percentage of midrange machines costing $50,000 to $1 million. And most of them use his Beowulf model, Becker said.
After the Beowulf project confirmed the feasibility of low-cost cluster computing, the team tackled other problems that made commodity-based HPC so daunting: designing better hardware, building in diagnostics, adding debugging features, tuning the BIOS settings, creating libraries and improving the software, he said. In addition, the team's educational mission was to create a recipe for the research and development project and to show others how to replicate their achievement commercially, he said.
Big hurdles remained, however. Building an HPC cluster was still too complex and required extensive training to install, configure and use. It was also difficult to administer long term, Becker said. HPC systems built on the Beowulf model still needed more power, plug-in capabilities, reliable booting, and improved software and networking to create a unified system which would be easier to manage, he said.
"We had to educate people so they could [build Linux clusters] themselves," Becker told nearly 40 BLU members who gathered in the MIT classroom for his talk. "We've been making it simpler, addressing scalability issues, creating faster clusters for 10,000 machines and teaching others how to build them."
Scyld Software addresses the current challenge of making HPC Linux clusters more user-friendly by automating the configuration of the operating system, enabling a single full install on a master node with a diskless, single-system image. Administrators download only a small operating system image and as little memory as possible, and the master drives the configuration, booting and provisioning simultaneously, from a single point of control.
In turn the master node directs individual compute nodes to load only the cache with the specific components needed to run a particular application, which improves machine restarts because extraneous data is not copied across a cluster, he said.
"Today, 80% in HPC use our approach," Becker said. "You can replace custom hardware and unique software and run an HPC pilot on a laptop. And it will run the same on the cluster."
Becker attributes that success to the open source model. "Open source makes HPC possible," Becker said. "The open source kernel lets us track, monitor and control activity and build tools to fix problems. By looking at the source code, we know the solution will work in HPC. With a proprietary vendor, we couldn't do that."
Becker currently is working with the Linux community to improve the network boot process, enabling it to identify automatically all machine hardware and upload the correct drivers, he said. If drivers are unavailable, the machine would initiate a driver request directly to Red Hat or Novell during the boot-up process, eliminating the need for new driver hardware updates, he said.
Another community group called EtherBoot has investigated how to boot machines remotely over the network without a CD and download, install and run the correct drivers, Becker said. The group even hopes to enable machines to diagnose the cause of crashes and then retrieve URLs to fix those problems, he said.
These community projects are necessary, because Linux doesn't have new driver support for everything," Becker said. "We need an automatic way to accomplish these tasks."