We've all heard this or maybe even said this. There are
many tools and testers to assist administrators with identifying when a network
is down and several approaches to react to the alarms. What method is best? The
short answer is, none of them. No single method works in every situation. There
are basically two approaches to troubleshooting, top down and bottom up. There
is also one rule for both.... At some point in time, you will use the one only
to realize you should have used the other!
In a recent survey conducted by Infonetics Research, the top
three threats to your network are network products, security, and cabling and
connectors in that order. Gartner also released a new study that said that
roughly 20 percent of all IT investments are for things that don't work.
Top-down Approach
In a top-down approach, the network manager begins at the upper layers of
the OSI protocol stack. The administrator tests the application to make sure it
is working, then pings the servers, until they are at the bottom of the stack or
the physical layer. This approach is best if multiple users initiate the help
desk calls. It is very rare that physical layer problems will be an issue for
all users, unless of course, it happens to be the only server connection. This
methodology allows the administrator to determine if the application or server
is down, slow, or for some reason non-responsive to network commands. To be
effective, it is generally aided by some tool or network monitoring application
that can provide the network manager with some type of trending and actionable
data.
Actionable
data could be as simple as a ping that results in a host unreachable all the way
to monitoring bit errors and other errors delivered via SNMP (Simple Network
Management Protocol) traps. The real trick, however, is to determine the cause
of the errors. To be effective in doing so, a methodical troubleshooting plan
should be used. This should certainly include more than rebooting a server. If a
server is going down, there is something causing it to do so. It may be a memory
leak, over-utilization in the processors, or other issue, but rebooting should
be considered a bandage, not a solution. So, what exactly is actionable data? It
is data that provides enough information to be useful and clear enough to
determine a plan of action.
Most management packages and monitoring tools allow a network
administrator to set thresholds for performance outside of an acceptable range.
Knowing where to set these for specific issues will require a bit of trial and
error. Set too low, they will make a pager or cell phone flooded with messages,
set too high and they will result in unemployment. Blindly accepting the
defaults can result in underutilization of the tools. Any time you deploy
management software, be sure to spend the money and
get trained. The best training would ideally be on site, in your environment, by
someone certified in the software. That way you can eliminate the modules you
don't want or need to use and tune the ones that will provide you with the
best information. Bandwidth heavy applications and heavily utilized servers will
require the most tuning to be of benefit.
Another
benefit of management software is the ability to query disparate equipment and
retain statistics and trends in one reporting tool. In the old days, and still
in many environments today, the network manager is stuck double clicking on each
switch in a wide variety of interfaces depending on the server software and
active electronics. With a single tool, trending and overall traffic reports can
be exported, sorted, etc. These can be used to justify new equipment and
upgrades (just a little side perk). The advantages of trending and utilization
models is that it allows you to determine which servers could benefit from
multiple network cards for instance. It also allows you to segment your switches
so that you balance the amount of packets within each switch so that one is not
over utilized while the others are under utilized. It also helps you to know
what types of packets are moving where so that can be optimized as well.
Bottom Up Approach
In a bottom up approach, the cabling is checked first and then
troubleshooting moves up the protocol stack. When one user goes down, it is far
easier to start at the physical layer and move up. Some idiosyncrasies can
develop when EMI and/or environmental concerns are causing the problem. Physical
layer testers are a bit different. These can be field testers, smart bit
testers, and/or spectrometers for radio frequency information. What you are
testing will determine what type of tester you need. They key here is that the
tester be calibrated just prior to the test and that the tester be certified by
an independent agency. Test and Measurement World (http://www.reed-electronics.com/tmworld/)
has a listing of testers, ratings for how well they perform, and certifications
for each variety. You will want to be sure that your tester is certified by an
independent testing source.
When there are errors, either continuous or intermittent, it
is a good idea to look at your physical layer. Field terminated patch cords are
a particular culprit, but other environmental conditions can also be to blame.
When walls are moved, cables that were once placed away from fluorescent light
fixtures may no longer be outside of acceptable range, new power panels may be
located too close, etc. It is important to note that you should not rely on the
fact that you have a link light on your switch port to determine if the cable is
good or bad. Just like your electronics, there are conditions where you may have
a link, but the signal is so degraded from sender to receiver that the packet is
useless. Remember the expression "lights are on but no one's home."
This is true for copper or fiber that is not performing, but still cause the
link light on the switch to illuminate.
Another thing that can happen is performance degradation
through autonegotiation down to lower speeds or to half duplex in order to try
and maintain the connection. If you have employed Gigabit Ethernet and your
cabling was installed before the new parameters for channel performance were
adopted, you will also want to have your cabling recertified for the new
parameters. This is recommended by the cabling standards bodies. You should note
that when equipment is tested for operation with any physical layer media, that
this is done in a lab in a pristine environment. Actual installations may vary
for a number of reasons. If you are going with the bottom up approach, check all
of the physical medium and don't skip this step because you can ping a device
or see it's link light. On the other side, if you don't have a link light
— it's obvious.
Then you work your way up — checking the network card
diagnostics, switch port statistics, and work up to the application. If only one
application is not working, start at the top. If several are not working or all
are not working for one workstation, start your way up from the bottom. And
remember, once in a while it will be in the middle or this rule will be
backwards.
Pre-Installation Testing
This step can be one of the best tools you can use to eliminate problems
before applications and networks go live. This should include a thorough testing
of all components under load. All components means physical layer, network layer
and where possible applications. It is unwise to assume that if you are
installing standards based components that there will not be problems. This can
be particularly true in the physical layer. Anomalies in installation, poor
installation practices, EMI or RF interference, and marginally compliant
components can all cause errors especially when combined in any combination. The
higher the frequency, the worse the problems can become.
Many manufacturers make cables and connectivity with margin
above the minimal standards. This provides a bit of a forgiveness factor for
installation issues, but proper installation is still the main key to error free
performance of any system, active or passive.
Many larger companies maintain test networks for just this
reason. When you are troubleshooting a problem, you can move the components to a
test lab where they physical layer is certified and determined to be trouble
free. Network electronics can then be tested in the test bed before or after
implementation if problems are found, in a controlled environment.
Carrie Higbie
of The Siemon Company