"Why is it so hard to figure out what's slowing down network performance?" That's the question many IT managers...
asked Peter Harrison while he was researching The Linux Quick Fix Notebook, a new book from Prentice Hall PTR. In this tip, Harrison answers that question and offers advice for diagnosing performance problems. - Editor
Detecting network performance bottlenecks can be tricky, and several factors add to problem-resolution difficulties. Many applications rely on inter-server communication to function properly. Slow application response times could therefore be due to problems anywhere along the communications path at any level of the Open Systems Interconnection (OSI) stack. It could be a physical problem with a cable or NIC, a link protocol problem, routing latency across networks, poorly implemented TCP/IP stacks causing intermittent delays, packet filtering at security devices, overloaded systems or overloaded applications.
All of these things have to be checked along every communications path your application needs to use. A slow Web server response could be related to issues between your Web browser and the Web server, the Web server and the application server, the application server and the database server, the database server and the disks in the storage area network or the application server and the credit card processing bureau.
The next problem originates from the fact that this same communications path is indirectly managed by so many different departments -- for instance, the building facilities team responsible for wiring. Then, there are the networking team, the security team, programmers, database administrators, storage experts, systems administrators and hardware and software vendors. Communicating the steps taken in the process of elimination from group to group is often a great challenge. The problem becomes worse when some of the responsibilities are outsourced to contractors, possibly in other countries.
Finally, there is human nature. No one wants to admit they changed anything. If there is a history of denial of responsibility, even though the problem may appear to be a hardware fault or software bug, people remain wary and may unnecessarily double-check the veracity of claims made by team members. This wastes time and can add further delays in problem resolution.
Finding, communicating bottlenecks
Each of the teams mentioned previously have tools to identify possible performance bottlenecks. The process of elimination should follow the OSI stack from bottom to top.
- The networking group should try to eliminate the lower level physical cabling, communications link, routing and packet filtering problems.
- The systems administrators should attempt to eliminate server issues that could affect application performance.
- The development staff should revise their programming code and the DBAs should evaluate database performance issues.
The real challenge is always in relaying the findings to other groups to help eliminate probable problem sources.
In my experience, the best way to handle these multi-disciplinary issues is to assign the responsibility of fixing the problem to a single person. This should be someone with project management skills, plus some technical ability to allow him to have a better grasp of the IT concepts surrounding the situation.
The problem can be that a project manager type of person is usually respected by the executive team, as they can explain the technical situation in managerial terms, but they are often held in less regard by the technical staff with superior IT skills. It is for this reason that I recommend teaming the project manager with a technical lead who has hands-on knowledge of the various technologies used by the IT organization. This will help in the diplomacy required to determine which technical group should be called upon to fix problems as they're discovered.
Both persons should be assigned for the duration of the problem. It may sound excessive, but the project manager will provide continuity between the management staff of the groups, and the technical lead will be able to coordinate and track the technical challenges faced at each step. Once the problem has been resolved, these two persons will be well placed to run a post mortem meeting to determine how to prevent and detect the recurrence of such an issue.
If the bottleneck is serious, then it is often wise to create a conference call bridge to which all the necessary parties can participate. In such cases, e-mail communication should be largely limited to delivering status messages because the technical staff members resolving the problem are frequently far away from e-mail. Also, direct phone calls often are given much greater priority. The project manager should lead the call, and the technical lead should operate behind the scenes overseeing the systems engineering details.
As I mentioned earlier, a post-mortem meeting should be arranged to document the lessons learned. This will help in more rapidly detecting and resolving the recurrence of the issue. It is also possible to analyze this information to help eliminate the potential problem altogether. The post-mortem document should be used as the basis of internal staff training of all the affected groups -- either in a formal setting or as part of regular meetings -- to ensure that the knowledge is correctly disseminated.