Christian Belady, the director of hardware architecture at Microsoft's eXtreme Computing Group, is one of the most influential thinkers in the data center. In 2006 Belady introduced the power usage effectiveness (PUE) metric, and today PUE is a global data center standard.
When he joined Microsoft's data center operations team, Belady placed a rack of servers in a tent with no cooling to prove that hardware can handle less-than-ideal conditions. Since then, he has moved to Microsoft's cutting-edge cloud computing think tank to work on new ways to deliver applications.. In this interview, Belady discusses data center metrics; the next big steps for improving data center effectiveness; and why a continued focus on infrastructure may create diminishing returns. This Q&A is part of a series of interviews with five data center professionals who are changing the industry.
Your data center efficiency metric PUE has been public for about four years. What do you think of adoption at this point?
Christian Belady: I think it's pretty good. If you talk to anyone running a major data center operation, they're using PUE. There are perceived issues with it, comments about people cheating on PUE reporting. But who cares?
What these companies are doing is benchmarking their own operations. I don't care if your PUE is 1.25 or 1.15. It doesn't matter if you are focusing on improving and are consistent on how you are measuring.
There is no question that these companies are leading the industry, while the bulk of the industry still has a PUE around 2.0. These guys are benchmarking themselves and improving year over year. That's the importance of the metric, and from that standpoint it's been very successful. In addition, it really has become globally accepted.
What do you think about the next level of efficiency metric: measuring the useful work of the data center? Will that kind of metric be available in the future?
C.B.: This is going to take a lot of effort. Look at PUE and see how difficult it was to get buy-in from various stakeholders. A data center productivity metric will be an order of magnitude more difficult to get broad acceptance. It's a very complex metric, and I'm very supportive of it, but it will be really hard to get agreement.
What did running servers in tents outside prove for you?
C.B.: I did that experiment in the tent, running servers, because of something I learned from negotiating with my wife. If I want to come to an agreement with her, I have to go to an extreme and come to a compromise which was where I wanted to be in the first place. The point is the experiment was the extreme, but in the end we I was shooting for the compromise.
I conducted that experiment when I came to Microsoft to show the industry that we're being far too sensitive about protecting the servers with hyperoptimized temperature control.
That opened the door within Microsoft early on, to really look at aggressive air-side economization. If you look at our container and ITPAC [IT Pre-Assembled Component] strategy, the cost of our data center infrastructure is almost an order of magnitude cheaper because of economization. We also asked the question 'Why don't we run our data centers at 95 degrees?' And guess what? That is our spec now and gives us huge opportunities for cost savings and in the end is more efficient and sustainable.
Do you think hardening the server is the way to save more energy?
C.B.: I think people should look at developing a 120-degree Fahrenheit server. With that I can operate it without mechanical cooling anywhere on the globe. How hard is it to do? It's a piece of cake -- bigger heat sinks, larger fans. I can create a server that operates at a higher temperature. I can design it, but are you willing to give up on density?
Up to 80% of the cost of a data center is in the mechanical electrical infrastructure and 10% is space. What is density buying me in these scale out environments? All the cost in the data center are power and cooling, not space. The only reason we went for server density in the past is because companies were charging for space in the data center. Now companies are allocating data center costs based on power usage.
As a thought experiment, What if I give you a 1-U server in 2-U space? I could use outside air in Abu Dhabi with no mechanical cooling, but some would argue [that would] take up twice the space in the data center. However, the mechanical and electrical infrastructure typically consumes half the space in the data center. With high temperature and less-dense servers, the mechanical equipment goes away and frees up space that can now be used by the less-dense servers. So it isn't even a bigger footprint at the end of the day. I'm not saying these are exact numbers, but these are the tradeoffs.
In a recent blog post, you suggested there are two kinds of companies: (1) those whose IT operations are a huge portion of the cost structure and (2) traditional companies where operations are not and efficiency is not a concern. Can traditional data centers adopt aggressive efficiency techniques? Or will more efficient data center operators just subsume IT responsibilities as they can deliver the services so much cheaper?
C.B.: This is the polarization I see in the industry. In the past, I've been frustrated with the industry's resistance to broadly adopting efficiency metrics in data center operations. Then I had an epiphany when I looked at the Gartner chart. It all boiled down to the answer to a simple question. What percentage of your business's total costs is from IT operations? If it's a big piece, you're going to be optimizing and looking for ways to improve in the future. For a company where it's just a small piece of the cost, it is not a concern.
I think it's an interesting play for the cloud ultimately. A lot of efficiency improvements come with scale. Microsoft data centers today use outside air economization. Maybe we'll find that there are higher server failure rates but because of our scale we can tolerate it without impacting services. Someone with a smaller operation won't be able to handle the loss of some server hardware. But for large players, 1% loss of hardware may be nothing compared with the efficiency gain.
I think people are going to look at costs and make the decisions about cloud computing accordingly. Who likes the cloud today? Startups that don't have to buy servers to start their business or hire people to manage them. Slowly this will grow up in the stack and smaller IT operations will start looking at the trade-off in the cloud, and eventually large enterprises will start looking to see that running operations in the cloud is significantly cheaper than in-house. Certainly there are enterprises looking already, but we're talking about the mass movement.
Users in the large internet data center space have become efficient, and more traditional companies have done enough to stop the bleeding, or data center cost is insignificant enough to their business that they'll never go green. So is the data center efficiency problem solved?
C.B.: The job is never done, but if you focus on improving in one area very long you'll start to get diminishing returns. You have to be conscious of the cost pie, always be conscious for where the bulk of the costs are. At that time, it made sense to focus on data center efficiency. But some of us have improved this so much -- is it still the biggest piece of the pie?
If the answer is yes, we can still work on it. Always be conscious of the TCO cost pie, and always be focused on the biggest slice. We're not in a static environment -- all these costs are going to change. Maybe tomorrow it's a completely different slice.
What are you doing in your current role?
C.B.: When I came to Microsoft, there was a great opportunity because they were building a group that blurred the lines between IT and the infrastructure, between the server and the data center. It was a fun job and we did great stuff that we have been pretty public about. I loved the job and could have continued my career in our in operations group. It's a great team.
But I'm all about the interfaces: the big opportunities are not to dive deep in one area, but to look across disciplines. In my new team, the eXtreme Computing Group, I get to look at the opportunities across hardware, software, applications and security interfaces. What can we do if we really stripped all our legacy IT requirements? What if we blurred the lines between these disciplines and developed a new cloud ecosystem from ground zero. What could that ecosystem look like? How can we see an order of magnitude reduction in cost.
My interest always lies way out in the future. How do we change the game? All the guys in this series have demonstrated that they have made significant changes in the industry. My challenge to all of us is [to ask] how we take the next big step. That's what excites me more and that is what I am working on. Stay tuned!
Let us know what you think about the story; email Matt Stansberry, Executive Editor.