Beowulf is 10 years old this month and Donald Becker is a proud parent. Becker created Beowulf clustering while...
at NASA, providing a less costly but equally effective vehicle (PC clusters) for doing complex mathematical calculations. Previously, that task could only be run on supercomputers. Becker has been instrumental in helping that Linux-based high performance computing software grow up. He founded Scyld Computing in 1998 to redesign the Beowulf software and make it commercially usable. About a year ago, Scyld was acquired by hardware vendor Penguin Computing of San Francisco. Today, Becker is chief technology officer (CTO) of Penguin, which provides integrated hardware/software systems that offer scalable performance. In this interview, Becker discusses the legacy and future of Beowulf and Linux's place in the high-performance computing landscape.
On the 10th anniversary of your creation of Beowulf, can you reflect on its impact on Linux and IT?
Donald Becker: Ten years ago, clustering independent, commodity-class machines -- and building, essentially, supercomputers out of them -- was a controversial idea. At first, we put Beowulf clusters beside supercomputers to remove some of their workloads. But, Beowulf clusters are clearly supercomputers in their own right.
I think that Beowulf and Linux have had a significant beneficial interaction. Linux was successful with a broad base of early adopters because it had excellent device support at a time when other similar operating systems supported only a very limited selection.
One of the unique demands of clusters was that the machines needed to communicate much more than [networked] workstations [do]. My work on Linux focused on networking; adding network drivers and other infrastructure so that Linux worked well with a wide variety of hardware. This included low-cost hardware that would normally be ignored because it at first appears unrelated to high-end use. But the small-scale and educational clusters often started with low-end proof of concept clusters. Being able to use existing and commonly available hardware made Linux much easier to install and use than other OSes that had specific, limited hardware support.
What does the Scyld version of Beowulf have that the non-commercial version did not?
Becker: Scyld developed a set of innovations that makes it possible to install and maintain a large set of machines just as if were a single machine, while maintaining the full performance of each additional machine.
Rather than install the full operating system on each machine, we use full, standard Linux install only on one 'master' machine. The other machines are compute nodes that are dynamically added and configured with just enough of an environment to run applications. This approach allows a completely standard environment for users and administrators, with the master running all of the expected services. The compute nodes thus may run a minimal, carefully optimized environment that does need have any of the rarely used but essential tools and services users expect.
Part of the development was creating a network booting system, new node monitoring and management, specialized directory services and many other subsystems that 'add simplicity' to make the cluster reliable and easy to use.
How does Beowulf differ from other types of Linux-based clusters?
Becker: Beowulf is now a generic name for scalable performance clusters based on commodity hardware, a private system network and open source software [Linux] infrastructure.
Scalable performance clusters improve performance proportionally with added machines, compared to failover, which tries to improve availability.
Commodity hardware [means the cluster is built] from machines capable of standalone use versus custom-designed single machines that may not be incrementally scaled as needed. Mass-market, standalone compute nodes take advantage of the improvements driven by a larger market rather than the slow development of traditional supercomputers.
In a private system network, nodes [are] dedicated to computation [in a] predictable, efficient and a simple security model, as compared to [other types of clusters] wide-area, ad hoc collection of workstations.
In an open source software or Linux infrastructure, core software is trustable and verifiable. This allows inspection of utilities to check that they will work correctly in a cluster environment.
How do you see the next 10 years of Beowulf's life panning out?
Becker: For about the first five years of building Beowulf clusters, the challenge was to get them to work at all. Not just putting together a workable software system, but demonstrating that applications written for high-end machines could be done on clusters.
The focus on the second five years was to make these increasingly large collections of machines easy enough for non-scientists to install and maintain. That meant developing approaches to handle the software complexity and simplify administration.
For the future I see the focus continuing on making the software easier to use. The same software system that makes scalable performance clusters easy to install and maintain solves the same problems for any large collection of machines, even if they are not all working on a single job. I see a future where a site only installs a single machine for each operating system type, and that machine starts out as a 'cluster of one.' Additional machines just join the cluster rather than have an independent installation.
Diagnostics and monitoring have always been part of using cluster, but as more clusters are deployed and the underlying operating system becomes more complex, there is an increasing need for tools that identify the problem and point to their cause.
There are many interesting ideas around for creating minimal virtual environments so that each application appears to run in its own pristine environment while the underlying system efficiently shares resources.
Supercomputer vendor Cray has created a new product that is designed to compete with some Linux clusters. Cray Canada CTO Paul Terry said that Linux clusters really can't compare to a supercomputer. What is your take on Cray's moves against Linux?
Becker: They are simultaneously saying that Linux clusters are not high-performance computing systems while introducing a product to compete with Linux clusters. They clearly saw that a large part of their customer base was moving toward commodity clusters, Beowulf-class clusters, to do high-end computing.
Clusters can't replace all of the workload being done by supercomputers today, but it can replace the bulk of the traditional vector supercomputers. There is always that 10% of the market that won't run well on clusters, and that is the market that Cray is in. We are happy to solve most of the problems of the world and run most of the applications and play in our part of the marketplace.
Obviously, the high performance clusters started out in engineering and other scientific areas. Where are they moving to today?
Becker: Linux clusters started out being used by the people who traditionally used supercomputers, mostly putting clusters to work toward the single end goal running a single job. Now, people in businesses are seeing that a cluster is a really good platform for managing hundreds of machines running independent applications. So any place where you have hundreds of machines deployed, you will probably want to set those up in a cluster structure. That will allow you to do single points of updates and single points of management, so you can manage those machines with the same effort as managing one machine.
FEEDBACK: Can Linux clusters do the job of a supercomputer?
Send your feedback to the SearchEnterpriseLinux.com news team.