A lack of prescience in Splunk's spunky search-based troubleshooting tool limited its usage to after-the-fact diagnosis. Adding Nagios, the mighty open source monitoring tool, puts Splunkers one step ahead in the data center troubleshooting race, according to Michael Baum, chief executive Splunker
Today, San Francisco-based Splunk announced its partnership with Nagios Project, the development team behind the popular systems management host and service monitor, and Nagios creator Ethan Galstad. Just prior to the announcement, I talked to Baum and Patrick McGovern, Splunk's chief community Splunker, about Splunk, Nagios and why data center management and troubleshooting is so difficult.
What is your 30-second description of Splunk?
Michael Baum Splunk, of course, is a play on spelunking, where you are dropping into a cave with your light on and looking around. Splunking is where you drop into your IT data center and look around.
There are a lot of tools that give you fancy dashboards of services. But, generally, tools available now to diagnose the problems in today's complex data centers have become very brutal and try to take the intelligence away from the system administrator.
Splunk allows the system administrator to surf their IT data just like they surf the Web. So, if one has an issue and thinks there may be a problem with system number 123, one can start there; do queries and different types of complex searches to be able to find out. Then, you are able to graphically search down into the different tiers and be able to find out more about the problem.
Splunk is not a magic potion that just fixes your system. It allows people to fix their systems faster.
Why is IT troubleshooting so challenging?
Baum: The underlying root problem is too much complexity.
In a presentation that I give to Linux users groups, I have a slide that says: 'In the beginning, there was the mainframe, and it was good.' I show a picture of a giant computer with an on-off switch. Then I show a picture of a rack of servers and a data center with racks and racks of servers.
Now, think of this, companies have dozens to thousands of servers stacked up in racks. Just one of those servers could be processing data that comes in from a thousand different sources. Think about a mail server and the tremendous amount of services that are going to create all of these different log entries. Now, that is just one machine. In a data center, the amount of log information that you need to sift through to resolve situations becomes very unmanageable very quickly.
Then, you have all these different systems that have to communicate together. There are a lot of different services to sort through. You have to see inside the different operating systems that are running simultaneously, each with their own format. You have people yanking out machines in the data center and putting new ones in and updating old ones. You're doing backup exactly at the same time a query is happening. Of course, all of these things are generating log information. It is a very complex situation.
So, if you ask a sys admin how he troubleshoots today, he'll say: 'Well, I have 200 machines, and I think that machine number 17 has a problem. So, I will look around, go through log files where I'm are trying to diagnosis what is going on. Oftentimes, it is not that machine, it is one or two machines nearby that are causing that machine to fail. But I've had to spend a lot of time troubleshooting the machine that is having the problem, not the ones that are causing trouble.'
Splunk sees all that unstructured data that is time-based and takes in that data at real time, whether there are systems being added or systems being yanked out. Splunk indexes all of the different data from all of the different tiers and services.
From the response to our articles about Nagios, I've learned it's well-respected by sys admins. What does Nagios bring to Splunk's product?
McGovern: With Nagios, we can make it so that if Nagios notices that something is wrong, then the sys admin can click a button, search their IT data, determine whether the problem resides in a log file or somewhere else and see very quickly what is going on.
Nagios does have a great installed base, Nagios users can use Splunk to troubleshoot the tons of log files that they have to deal with on a regular basis. Nagios shows the flashing red light, and Splunk provides the troubleshooting tools that take them to cause of that flashing light so to speak. It is going to be a tremendous win for both sides.
Beyond Nagios, are there other open source tools that may be added to Splunk's troubleshooting tools?
McGovern: When I ran Sourceforge.net for five years, I watched it go from 500 to over 100,000 open source projects. Isaw firsthand the software that gets generated when a lot of passionate people get together and want to solve a problem. So, there is a lot of fantastic software out there, such as Nagios. There are a lot of possibilities. Our ability to work closely in an open source community is a big win.
As a commercial vendor, how does Splunk draw the line between its free, open source products and commercial products?
McGovern: We have two products out today. One is the Splunk Server that server is freely downloadable off of our Web site. Anyone who wants to troubleshoot their data center can use it. The free product is a very simple inside, and in five minutes you can Splunk your data.
Splunk Professional has all of the things that the free product has, but it also has enterprise-class support, as well as functionality and features, such as the ability to search data from a number of different places and have multiple indexes of information. It also has live Splunk, where you can do searches on a regular basis be paged if something happens.