Network
Troubleshooting
By Othmar Kyas
Section I
Basic Concepts
Chapter 1 – Network Availability... 21
1.1 The Strategic Importance of Information Technology1.2 Intranets and the Internet: Revolutions in Network Technology 1.3 The Behavior of Complex Network Systems: Catastrophe Theory 1.4 The Causes of Network Failure
1.4.1 Operator Error
1.4.2 Mass Storage Problems 1.4.3 Computer Hardware Problems 1.4.4 Software Problems
1.4.5 Network Problems
1.5 Calculation and Estimation of Costs Incurred Due to Network Failures 1.5.1 Immediate Costs
1.5.2 Consequential Costs
1.6 High Availability and Fault Tolerance in Networks 1.7 Summary
Chapter 2 – Error Management in Data Networks... 39
2.1 Corporate Guidelines for Data Processing Systems2.1.1 Corporate Guidelines for Employees
2.1.2 Corporate Guidelines for Network Hardware 2.1.3 Corporate Guidelines for Network Software 2.2 Services and Service Level Agreements
2.3 Network Planning and Documentation 2.4 Network Management Tools
2.5 Monitoring and Diagnostic Tools 2.6 Regular Network Audits
2.6.1 Physical Layer Audit 2.6.2 Data Link Layer Audit 2.6.3 Network Layer Audit 2.7 Network Simulation Tools
2.8 Change Management and Problem Solving 2.9 Documentation of Network Problems 2.10 Network Support Staff Training
Chapter 3 – Diagnostic Tools ... 61
3.1 Tactical Test Systems3.1.1 Tactical Test Systems for Physical Parameters in Local-Area Networks 3.1.2 Tactical Test Systems for Physical Parameters in Wide-Area Networks 3.1.3 Tactical Test Systems for OSI Layers 2-7
3.2 Strategic Test Systems
3.2.1 SNMP and Management Information Bases (MIBs) 3.2.2 RMON (Remote Monitoring) Systems
3.2.3 Strategic Systems Based on Software Agents 3.2.4 Network Management Systems
4.1.3 Digital Data Communications (Baseband Systems) 4.1.4 Data Communications over Fiber-Optic Links
4.1.5 Synchronous and Asynchronous Data Communications 4.1.6 The Decibel
4.2
Chapter 5 – Fundamentals of Testing in Local-Area Networks .... 123
5.1 Basic Concepts5.1.1 Local-Area Network Topologies: Bus, Ring, Switch 5.1.2 Standards for Local-Area Data Networks
5.2 In-Service Testing in Local-Area Networks 5.2.1 Signal Condition Measurements
5.2.2 Protocol Analysis and Statistical Measurements 5.3 Out-of-Service Measurements in Local-Area Networks
5.3.1 Cable Testing
5.3.2 Detecting External Interference
5.3.3 Conformance and Interoperability Testing 5.3.4 Load Tests
Section II
Troubleshooting Local-Area-Networks
Chapter 6 – Cable Infrastructures in Local-Area Networks ... 145
6.1 Coaxial Cable6.1.1 Coaxial Cable: Specification and Implementation 6.1.2 Troubleshooting Coaxial Cable (10 Base2, 10 Base5) 6.1.3 Symptoms and Causes: Coaxial Cable
6.2 Twisted-Pair Cable
6.2.1 Twisted-Pair Cable: Specification and Implementation 6.2.2 Troubleshooting Twisted-Pair Cable
6.2.3 Symptoms and Causes: Twisted-Pair Cable 6.3 Fiber-Optic Cable
6.3.1 Fiber-Optic Cable: Specification and Implementation 6.3.2 Troubleshooting Fiber-Optic Cable
6.3.3 Symptoms and Causes: Fiber-Optic Cable
Chapter 7 – 10/100/1,000 Mbit/s Ethernet... 183
7.17.2 Troubleshooting in 10/100/1,000 Mbit/s Ethernet
7.2.1 Gathering Information on Symptoms and Recent Changes 7.2.2 Starting the Troubleshooting Procedure
7.2.3 Error Symptoms in Ethernet 7.2.4 Cabling Problems
7.2.5 Problems with Ethernet Interface Cards 7.2.6 Problems with Media Access Units (MAUs) 7.2.7 Problems with Repeaters and Hubs 7.2.8 Problems with Bridges
7.2.9 Problems with Routers
7.2.10 Symptoms and Causes: 10/100/1,000 MBit/s Ethernet
Chapter 8 – ...…
8.1Chapter 9 – ...…
9.110.1.2 ATM in Heterogeneous LAN Environments 10.1.3 ATM in Public Wide-Area Networks 10.1.4 Asynchronous Transfer Mode: ATM 10.1.5 The ATM Layer Model
10.1.6 The ATM Physical Layer 10.1.7 The ATM Layer
10.1.8 The ATM Adaptation Layer (AAL) 10.2 Troubleshooting ATM
10.2.1 Troubleshooting the Physical Layer 10.2.2 Troubleshooting the ATM Layer 10.2.3 Troubleshooting Higher Layers 10.2.4 Cabling Problems
10.2.5 Problems with ATM Interface Cards 10.2.6 Problems with Routers
10.2.7 Symptoms and Causes: ATM
Chapter 11 – Switched LANs... 423
11.1 Switched LANs: Specification and Implementation11.2 Design Guidelines for Switched LANs 11.3 Troubleshooting in Switched LANs
11.3.1 Gathering Information on Symptoms and Recent Changes 11.3.2 Starting the Troubleshooting Procedure
11.3.3 Error Symptoms in Switched Networks 11.3.4 Symptoms and Causes: LAN Switching
Section III
Troubleshooting Wide-Area-Networks
Chapter 12 – ISDN ... 459
12.1 12.2 12.3 12.4 Troubleshooting ISDN12.4.1 Gathering Information on Symptoms and Recent Changes 14.4.2 Error Symptoms in ISDN
14.4.3 Symptoms and Causes: ISDN
Chapter 13 – Frame Relay ... 468
13.1 Specification and Implementation13.2 Frame Relay Standards 13.3 Troubleshooting Frame Relay
13.3.1 Gathering Information on Symptoms and Recent Changes 13.3.2 Starting the Troubleshooting Procedure
13.3.3 Error Symptoms in Frame Relay Networks 13.3.4 Symptoms and Causes: Frame Relay
Section IV
Troubleshooting Higher-Layer Protocols
Chapter 15 – ...…
15.1Chapter 16 – Internet Protocols... 547
16.1 The Internet Protocol (IP)16.2 The Transmission Protocol (TCP) 16.3 The User Datagram Protocol (UDP) 16.4 IP Incapsulation: SLIP, PPP
16.4.1 Serial Line Internet Protocol (SLIP) 16.4.2 The Point-to-Point Protocol (PPP) 16.5 Addressing in IP Networks
16.5.1 The IP Address Space 16.5.2 Special Internet Addresses 16.5.3 Subnetwork Address Masks
16.5.4 The Internet Domain Name System (DNS) 16.5.5 The Host Command and the DNS Protocol 16.6 Internet Protocol - The Next Generation (IPv6)
16.6.1 The IPv6 Address Format 16.6.2 The IPv6 Data Packet Format
16.6.3 Authentication and Encryption in IPv6 16.7 The NetBIOS Protocol
16.7.1 NetBEUI (NetBIOS over LLC) 16.7.2 NetBIOS over TCP/IP
16.8 The Dynamic Host Control Protocol (DHCP) 16.8.1 Diagnosing DHCP Errors
16.9 Internet Protocols: Standards 16.10 Troubleshooting Internet Protocols
16.10.1 Gathering Information on Symptoms and Recent Changes 16.10.2 Error Symptoms in IP Networks
16.10.3 Symptoms and Causes: Internet Protocols
Chapter 17 – Network Services and Applications... 591
17.1 E-Mail17.2 Internet News
17.3 The File Transfer Protocol (FTP) 17.4 Trivial File Transfer Protocol (TFTP) 17.5 Telnet
17.6 Hypertext Transfer Protocol (HTTP) 17.7 The Server Message Block (SMB) Protocol 17.8 The Microsoft Browsing Protocols
17.9 Troubleshooting Application Protocols
17.10 Symptoms and Causes: Network Services and Applications in General 17.11 Symptoms and Causes: Mail
Chapter 18 – Testing Network Performance ... 625
18.1Section V
Appendix
Appendix A – Error Symptoms and Root Causes... 658
A.1A.6 A.7
A.8 Common Errors in ATM Networks A.9 Symptoms and Causes: LAN Switching A.10 Common Errors in LAN Switched Networks A.11 Symptoms and Causes: ISDN
A.12 Common Errors in ISDN Networks A.13 Symptoms and Causes: Frame Relay A.14 Common Errors in Frame Relay Networks A.15 Symptoms and Causes: PDH
A.16 Common Errors in PDH Networks A.17 Symptoms and Causes: SDH A.18 Common Errors in SDH Networks
A.19 Symptoms and Causes: X and V Series Interfaces A.20 Common Errors in X/V Series Connections
Appendix B – ...…
B.1Appendix C – Tables for Network Design and Operation ... 691
C.1 Transmission Delays in LANsC.2 Ethernet
C.2.1 Parameters for 10 Mbit/s Ethernet Operation C.2.2 Parameters for 100 Mbit/s Ethernet Operation C.2.3 Parameters for Gigabit Ethernet Operation C.3 Token Ring
C.4 FDDI C.5 ATM
C.5.1 ATM Transmission Delays C.6 ISDN
Section I
Basic Concepts
Network Availability
1
“Waiting for an alarm is not the ideal form of network management.” BOB BUCHANAN, THE NETWORK JOURNAL
1.1
The Strategic Importance
of Information Technology
Growing financial and competitive pressures in the business world mean that companies everywhere must continuously optimize their internal and external structures in order to survive. All business processes and routines must be reviewed regularly for effectiveness (“Are we doing the right things?”) and for efficiency (“Are we doing things right?”). Most business processes today consist of physical activities, such as the manufacture of a metal part, combined with information flow: How many parts should be produced? When? In what sizes? Increasingly, key business processes—in insurance companies, travel agencies, banks, and airlines, for example—consist entirely of information flow. Today, of course, the flow of information is largely dependent on information technology— that is, computers, databases and networks. A high-performance, high-availabil-ity information technology (IT) system is becoming a prerequisite for successful execution of the business practices that are decisive in maintaining a leadership position in today’s competitive markets. The role of the computer in business has changed radically over the past few years, from a tolerated plaything to a cornerstone of corporate infrastructures. This change has taken place so rapidly that in many companies IT still has not taken a central position in managerial circles, even though it has long since become indispensable for day-to-day business functions.
The reliance of enterprises on smoothly functioning IT infrastructures will continue to grow in the coming years. Areas of business that until recently had little to do with computer technology, such as marketing and customer service, are increasingly IT-based. This is largely due to the advent of customer inter-faces that allow consumers to perform many transactions electronically, such as placing orders or making reservations. In fact, the proportion of people who work directly or indirectly with IT has grown in recent years to more than 50 percent (see Figure 1.1).
SECTION I SECTION II SECTION III SECTION IV SECTION V BASIC CONCEPTS 22 S o u r c e : L e o a . N e f i o d o w , " T h e f i f t h K o n d r a t i e f f " 1 8 0 0 P er ce nt ag e 0 1 8 5 0 1 9 0 0 1 9 5 0 2 0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 A g r a r i a n a g e I n d u s t r i a l a g e I n f o r m a t i o n a g e I n f o r m a t i o n S e r v i c e s P r o d u c t i o n A g r i c u l t u r e Y e a r
Figure 1.1 Changes in employment patterns in Western industrial countries since 1800
I n d u s t r y B u s i n e s s p r o c e s s e s D o w n t i m e c o s t s p e r h o u r ( U S $ ) F i n a n c i a l s e r v i c e s S t o c k t r a d i n g 6 , 0 0 0 , 0 0 0 F i n a n c i a l s e r v i c e s C r e d i t c a r d / t e l e c a s h t r a n s a c t i o n s 2 , 4 0 0 , 0 0 0 M e d i a P a y - p e r - v i e w 1 5 0 , 0 0 0 R e t a i l H o m e s h o p p i n g ( T V ) 1 0 0 , 0 0 0 R e t a i l M a i l o r d e r ( c a t a l o g ) 8 0 , 0 0 0 T r a v e l / t o u r i s m A i r l i n e r e s e r v a t i o n 8 2 , 5 0 0 S h i p p i n g P a r c e l s e r v i c e 2 5 , 5 0 0 S o u r c e : A T & T / G a r t n e r G r o u p
In keeping with these developments, the professional operation and manage-ment of computer networks has long since ceased to be a necessary evil. On the contrary, it has become a decisive strategic necessity for the success of almost any enterprise. A network failure that lasts only a few hours can cost millions of dollars. According to a study carried out by AT&T, companies that deal in financial services, such as investment brokerages or credit card firms, can suffer losses of 2.5 to 5 million dollars from just 1 hour of network downtime (see Figure 1.2).
1.2
Intranets and the Internet:
Revolutions in Network Technology
The difficulties involved in the professional operation of high-performance data networks have been further complicated by the Internet revolution, which has brought about radical changes in network technology and applications. Since the mid-1990s the Internet has not only developed into a universal communica-tions medium, but has also become a global marketplace for the exchange of goods and services. As a result, growing numbers of business are faced with the necessity of providing their employees with Internet access. Special network infrastructures are now required in order to provide electronic access for increasing numbers of Internet-based consumers. Once it was sufficient to have just a few carefully controlled wide-area network (WAN) links in an otherwise homogenous local-area network (LAN). Today, however, a secure, high-per-formance LAN-WAN structure is indispensable.
Internet technologies are also being introduced into company networks, leading to the development of “corporate intranets”. This has necessitated further re-structuring so that broad areas of internal data processing can be adapted to the transport mechanisms, protocols and formats used in the Internet. All of these developments have caused the World Wide Web (WWW) to take on a position of global importance as a uniform user interface. At the same time, these changes have placed enormous demands on network managers. In many cases, the skills and tools available for managing computer systems and networks can barely keep up with the increasing complexity of data network structures. And to add to the difficulty of the task, the technology cycles in data communications—the intervals at which new and more powerful data communication technologies are introduced—are getting shorter all the time. Whereas the classic 10 Mbit/s Ethernet topologies shaped computer networking throughout the 1980s, the 1990s have seen the introduction of new technologies almost every year, includ-ing LAN switchinclud-ing, 100 Mbit/s Ethernet, Gigabit Ethernet, ATM (Asynchronous
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
24
Transfer Mode), IP (Internet Protocol) switching, Packet over Sonet (PoS) and ADSL (Asynchronous Digital Subscriber Mode), to name just a few. Product life cycles in the IT field are often measured in months now rather than years. This rapid pace of technological development puts manufacturers and users alike under tremendous pressure to keep abreast of constant innovation (see Figure 1.3). 1 6 M b i t / s 1 0 M b i t / s 1 3 3 M b i t / s 1 5 5 M b i t / s 1 5 0 0 M b i t / s 1 0 0 0 M b i t / s 1 0 0 M b i t / s T o k e n R i n g ( 4 M b i t / s ) E t h e r n e t T o k e n R i n g ( 1 6 M b i t / s ) I s o c h r o n o u s E t h e r n e t A T M ( 5 2 M b i t / s , 1 0 0 M b i t / s , 1 5 5 M b i t / s , 6 2 2 M b i t / s , 2 . 4 G b i t / s , 9 . 9 G b i t / s ) F F O L F D D I - 2 G i g a b i t E t h e r n e t ( 1 0 0 0 M b i t / s ) F D D I 1 0 0 B a s e - T F i b r e C h a n n e l ( 1 3 3 M b i t / s , 2 6 6 M b i t / s , 5 3 0 M b i t / s , 1 G b i t / s ) T C N S ( T h o m a s C o n r a d N e t w o r k i n g S y s t e m ) C D D I I E E E 8 0 2 . 1 2 ( 1 0 0 B a s e - V G - A n y L A N ) S t a n d a r d s i n d e v e l o p m e n t A d o p t e d s t a n d a r d s A T M b a n d w i d t h s a r e i n d i c a t e d p e r s w i t c h p o r t . A 1 5 5 M b i t / s A T M s w i t c h w i t h 1 0 p o r t s r e p r e s e n t s a s u b d i v i d e d b a n d w i d t h o f 1 . 5 G b i t / s ! F r a m e R e l a y ( 1 . 5 4 5 M b i t / s ) L A N S w i t c h i n g ( n x 1 0 M b i t / s , n x 1 0 0 M b i t / s ) x D S L ( 1 6 M b i t / s ) 1 9 8 2 1 9 8 4 1 9 8 6 1 9 8 8 1 9 9 0 1 9 9 2 1 9 9 4 1 9 9 6 1 9 9 8 2 0 0 0 1 0 G i g a b i t / s E t h e r n e t
1.3
The Behavior of Complex Network Systems:
Catastrophe Theory
The enormous technological complexity in combination with the large numbers of hardware and software components used in networks makes operation and management a difficult task, to say the least. Communication media, connec-tors, hubs, switches, repeaters, network interface cards, operating systems, data protocols, driver software, and application software must all function smoothly under widely varying conditions, including network load, number of nodes connected, and size of data packets transmitted. Even when a given system has attained a relatively stable operating state, its stability is constantly put to the test by dynamic variations as well as by operator errors, administrative errors, configuration changes, and hardware and software problems. In general, the more complex a system is and the greater the number of parameters that influence it, the more difficult it is to predict its behavior. Catastrophe theory (see René Thom, 1975) offers an excellent model for describing the behavior of systems as complex as computer networks. This theory can provide at least qualitative descriptions of system behavior, especially for non-linear operating states, such as those that often accompany a network breakdown. Catastrophe theory postulates seven elementary catastrophes, which behave in a given manner according to the number rather than the type of control parameters influencing the system. The behavior model for catastrophes determined by two parameters, for example, is called a cusp graph. The cusp is a three-dimensional surface whose upper side represents balanced states, while the lower surface represents unstable maxima. Catastrophe theory can be applied to Ethernet networks, for example, to show the effects of two control parameters, slot time and network load, on throughput. Slot time, which is defined as twice the time it takes a signal to travel between the two nodes that are farthest apart in an Ethernet segment, is influenced by the network components that cause signal transmission delay or latency, such as cables, repeaters or hubs. Figure 1.4 shows the behavior pattern for throughput when all other variables, such as network load, average packet size and number of network nodes, are constant. An increase in traffic in a network with a given slot time a moves the operating state across the upper surface of the cusp. The rise along the x-axis indicates increasing throughput. Starting from the higher slot time b, however, the same increase in traffic drastically reduces network efficiency. All processes take place on the surface of the cusp and are thus linear. If the operating state is a when the network load increases, and subsequently the slot time increases from state c (Figure 1.5), an abrupt departure from the balanced state takes place at
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
26
state, but one in which throughput is minimal. The abrupt transition from d to e constitutes a catastrophe.
The model provided by catastrophe theory clearly illustrates how complex and unpredictable a network can be. The symptoms that indicate problems in a network are often caused by a series of errors. One event triggers another, and the resulting state yet another, and so on. Feedback may either amplify or reduce the effects of error events. When the error symptom is finally detected, it may be far removed from its original locus in a completely different form and appear to have been triggered by some trivial event.
D iv e r ge n c e F o ld L in e B i m o d a l i t y F ol d L i n e U n s t a b l eZ o n e T h r o u g h p u t 0 a b N e t w o r k L o a d S l o t T i m e
1.4
The Causes of Network Failure
There are five categories of errors that can lead to system failure: • Operator error
• Mass storage problems • Computer hardware problems • Software problems • Network problems C a t a s t r o p h e N e t w o r k L o a d T h r o u g h p u t 0 S l o t T i m e H y s t e r e s i s C a t a s t r o p h e d f a b e c
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
28
1.4.1
Operator Error
On the average, operator error is responsible for over 5 percent of all system failures—a large enough proportion to merit a closer look. Operator errors can be classified as intentional or unintentional mistakes, and as errors that do or do not cause consequential damage.
The term “intentional error” does not necessarily indicate that the error itself was the operator’s intent, but rather that it resulted from some intentional action, such as trying to take a shortcut. The belief that a given process can be shortened, or that certain quality control or safety guidelines are superfluous, can lead to error situations with or without consequential damage. Less com-mon are the truly intentional errors motivated, for example, by an employee’s desire for “revenge” against a superior or the company, by the desire to cause trouble for a colleague (by making mistakes that the colleague will be blamed for), or out of destructiveness brought on by general frustration.
Unintentional errors usually result either from insufficient understanding of a given process or from poor concentration. Other common causes include soft-ware and hardsoft-ware errors (the system does not behave as it should even though it is configured and operating correctly) or installation and configuration errors (errors occur when the system is operating correctly and the software or hard-ware is functioning according to specification). Sometimes a series of minor errors, which individually go undetected because no harmful effects are noted, are eventually compounded so that serious errors or even system failures result.
1.4.2
Mass Storage Problems
Problems with hard disks are the most common cause of failures in data processing. More than 26 percent of all system failures can be traced to faults in mass storage media. Although high-performance mass storage can attain a mean
time between failures (MTBF) of over 106 hours, this could still mean replacing
hard disks almost every month if the system has a large number of disk drives. There is usually a wide gap between theoretical MTBF and the operational MTBF that can be achieved in practice. The probability that a hard disk drive with a
theoretical MTBF of 106 hours (almost 114 years) will actually run that long
without error is only 30 percent. To calculate the number of hard disks that will have to be replaced within a certain period of time in a given system, multiply the total number of hard disk drives in the system by the period of system service in hours, and then divide this number by the theoretical MTBF. For example, in a system that has 1,000 disk drives, each of which has a theoretical
MTBF of 106 hours, the number of failures A in the first 5 years (43,800 hours)
A = = 4 4 1 , 0 0 0 4 3 , 8 0 0 h o u r sd i s k
h o u r s d i s k 1 , 0 0 0 , 0 0 0
This is based on the assumption that all of the hard disk systems have the same MTBF and are operated under similar conditions. Tests have shown that mass storage units operating in warm ambient conditions tend to show a lower actual MTBF than those operated in well-cooled environments. Furthermore, frequent disk search operations and changes in location have both been shown to have negative effects on the service life of mass storage media. For this reason, some hard disk manufacturers use another value in addition to the theoretical and operational MTBF to indicate the probable period of error-free operation for their products. This value, called the cumulative distribution function (CDF), indicates the probability that a mass storage medium will fail within a specified time. For example, a CDF of 4 percent over 5 years means that there is a 4 p e rcent chance that the medium in question will break down within the first 5 years of use. 5 . 9 % S o u r c e : F i n d / S V P U s e r s D a t a n e t w o r k s S o f t w a r e C o m p u t e r h a r d w a r e M a s s s t o r a g e 2 1 . 1 % 2 2 . 3 % 2 4 . 3 % 2 6 . 4 %
Figure 1.6 Causes of system failures in data processing
1.4.3
Computer Hardware Problems
Roughly one-quarter of all system failures are caused by computer hardware problems. By definition, this includes problems with any computer hardware component, including monitor, keyboard, mouse, CPU, RAM, hard disks and floppy disk drives. The average error-free service life of a system is calculated from the sum of the MTBF values of its components divided by the number of components. The following are some average MTBF values for various computer system components:
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
30
• RAM chips: 8,000,000 hours
• Floppy disk drives, mice, CD-ROM drives: 2,000,000 hours • 10Base-T interface cards: 5,000,000 hours
• FDDI, ATM interface cards: 400,000 hours • CPUs: 100,000 hours
The MTBF values calculated for today’s computer systems average between 10,000 and 50,000 hours. In general, the more complex a system is, the lower the average MTBF. A system with multiple processors and multiple network links, for example, is more error-prone than a comparatively simple server with only one processor.
The Annual Failure Rate (AFR) is a better indicator of reliability than the MTBF. The AFR is the MTBF divided by the number of hours per year that the system is in operation. When a server system with a MTBF of 25,000 hours is in constant operation, the AFR amounts to 25,000/8,760 = 2.8, or about 3 failures every year. Another important parameter for the availability of computer systems is the mean time to repair (MTTR), which indicates the average length of time it takes to repair the system after a failure. The MTTR is the total repair time divided by the number of system failures. Typical MTTR values lie between 2 and 3 hours when the repair time used in the calculation is the amount of work time actually spent repairing the system.
1.4.4
Software Problems
Software problems cause almost as many failures as hardware problems do. The widespread use of client-server architectures and distributed platforms in enterprise networks have led to such complex combinations of software that it is almost impossible to monitor system behavior under all network loads and in all operating states. In the age of corporate intranets and the Internet, the update schedules for software applications are becoming shorter all the time, so that sufficient time is not allowed for detailed testing before software is released. Automatic testing tools, such as LoadRunner (from
Mer-cury Interactive—www.loadrunner.com) or AutoTester (from AutoTester Inc—
www.autotester.com), which attempt to simulate various extreme operating situations, provide only limited assistance. Problems with new software that can lead to system failure arise not only at the application level, but also as a result of unstable software drivers, faulty installation or backup procedures, or oper-ating system errors.
1.4.5
Network Problems
The fifth major category of IT problems encompasses errors that occur within the network itself. When the software and hardware problems that are directly related to network operation are included in this category—such as problems with network interface cards or with certain components of application soft-ware, protocols and card drivers—this group accounts for more than one-third of all IT failures. These network errors can be classified by OSI layer. As shown in Figure 1.7, 30 percent of all LAN errors occur on OSI layers 1 and 2. Typical causes are defective cables, connectors, or interface cards; defective modules in hubs, bridges or routers; collisions (in Ethernet networks); beacon processes
F r e q u e n c y d i s t r i b u t i o n o f L A N e r r o r s b y O S I m o d e l l a y e r s W i d e - A r e a N e t w o r k s : L e a s e d L i n e s W i d e - A r e a N e t w o r k s : I S D N P h y s i c a l L a y e r ( 8 0 % ) P h y s i c a l L a y e r( 9 0 % ) H i g h e r l a y e r s , c o n f i g u r a t i o n e r r o r s , p r o t o c o l e r r o r s , e t c . ( 2 0 % ) H i g h e r l a y e r s , c o n f i g u r a t i o n e r r o r s , p r o t o c o l e r r o r s , e t c . ( 1 0 % ) P r e s e n t a t i o n L a y e r ( 5 % ) A p p l i c a t i o n L a y e r ( 2 0 % ) D a t a L i n k L a y e r ( 1 0 % ) P h y s i c a l L a y e r ( 2 0 % ) N e t w o r k L a y e r ( 2 5 % ) T r a n s p o r t L a y e r ( 1 5 % ) S e s s i o n L a y e r ( 5 % )
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
32
(in Token-Ring networks); checksum errors, and incorrect packet sizes. The development and implementation of more reliable hardware components, coupled with the continuous improvement of cabling systems, have meant a decrease in the absolute numbers of these types of errors, but developments in software have had similar effects on the higher OSI layers as well. More stable network operating systems and applications as well as mature protocol stacks have also reduced the number of failures per network segment. As a result, the distribution of error sources over the seven OSI layers remains roughly the same over the past years.
In wide-area networks (WANs), the proportion of errors occurring on the physi-cal layer is even higher. Where permanent WAN links (leased lines) are em-ployed, 80 percent of all errors—in the case of ISDN (Integrated Services Digital Network), as many as 90 percent of all errors—can be traced to component failure, defective modems, or cable and connector faults (see Figure 1.7).
1.5
Calculation and Estimation of Costs Incurred Due
to Network Failures
It is becoming increasingly important to estimate the costs that are incurred in the event of network failure. These can be difficult to quantify, however. None-theless, a fairly clear idea of the financial impact of system failure is essential in order to determine the optimum infrastructure dimensions from the perspec-tive of network management and maintenance. Knowing the costs of system failure enables the enterprise to make informed decisions regarding the level of investment in redundant components or network management and trouble-shooting systems. All too often the costs of system failure are grossly underesti-mated. It may be true that the exorbitant losses of $100,000 per minute and more reported in some superficial studies apply only to a few special cases, such as when system failure affects production control systems, financial services offered by credit card companies, or investment brokerages. Nonetheless, the consequences of a network breakdown even in smaller- or medium-sized compa-nies should not be underrated.
The average availability of a data network today is between 98 and 99 percent. A system that is in operation 10 hours a day, 5 days a week can expect network downtime totaling between 52 and 104 hours per year. If an average of 100 employees are affected by a network failure, this means a maximum loss of productivity between 5,200 and 104,000 hours. This type of oversimplified calculation, however, quickly leads to inflated figures that do not necessarily reflect real situations.
The first step toward a more realistic analysis of network downtime costs is to distinguish between immediate costs incurred within the first 24 hours follow-ing the failure and consequential costs that arise after the first 24 hours. Costs in each of these categories are further divided into direct and indirect costs. Direct costs include all expenditures that are directly involved in correcting the network problem, while indirect costs include such factors as lost employee productivity and delayed project completion.
1.5.1
Immediate Costs
(Costs arising within the first 24 hours) Direct Costs
• Replacement parts (network cards, cable, repeaters, hubs, etc.) • New components (bridges, routers, servers, etc.)
• Rental or purchase of diagnostic equipment (network analyzers, cable testers, etc.)
• Consulting fees charged by network specialists
• Consulting fees charged by software/hardware manufacturers • Overtime compensation for network support staff
Indirect Costs
• Loss of employee productivity at computer workstations
• Loss of productivity on production lines, in shipping and receiving depart-ments, or in warehouse management; downtime of automated warehousing systems, etc.
• Loss of consumer or customer orders and confidence
Easiest to calculate are the direct immediate costs, such as the purchase of replacement components or consultants’ fees, because these are automatically documented by invoices. In mid-sized networks (around 500 nodes) with an availability of 99 percent, the average downtime of 52 hours results from an average of 10 to 20 failures of the network or parts of it, lasting between 1 and 5 hours each. If the direct immediate costs of solving the problem average $1,250 per case, then the direct cost of restoring operation after 10 failures comes to $12,500.
Quantifying the indirect immediate costs is more difficult. In general, only the cumulative loss of employee productivity is calculated. The extent of this loss, however, depends mainly on the degree to which employee productivity is dependent on network availability. Often a number of employee activities can be postponed until the next day, or at least for a few hours, without significant loss
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
34
of productivity. In mid-sized office environments, therefore, loss of employee productivity is usually estimated at roughly 25 percent of the total network downtime. For example, if an average of 100 employees are affected by the network breakdown, at 52 hours of downtime per year the loss of productivity amounts to 52 · 0.25 · 100 = 1,300 hours. If the average gross salary costs come to $40 per hour, this puts the immediate indirect costs at $52,000.
1.5.2
Consequential Costs
(Costs arising after the first 24 hours) Direct Costs
• New or adjusted hardware configuration in the network (restructuring of servers, bridges, etc.)
• Testing of other network segments for errors similar to those that caused the failure
• Documentation of the system failure Indirect Costs
• Delayed project completion (product development, production, etc.) • Delayed services (tenders, invoices, entering transactions in accounts, etc.) • Loss of customer loyalty and satisfaction
Consequential indirect costs resulting from network failure are the most diffi-cult to calculate. These costs are also referred to as “company losses” because they cannot be attributed to any one department or cost center. The amount of such costs is proportional to the degree to which the company depends on network-supported processes. Tenders may have to be printed and sent a day later than planned, for example. Incoming orders and payments may be similarly delayed. Incoming deliveries may be blocked if receiving slips cannot be printed or automated warehousing equipment cannot be operated. Urgent shipments sent by special courier result in higher shipping rates. Late charges may be incurred for bills that cannot be processed. Sales may be lost due to unavailable Web-based ordering systems. Customers may grow dissatisfied if they cannot reach a support hotline, which means a loss of future orders. These are just a few examples of company losses as consequential costs of network failures. At a fairly low estimate of $1,000 in consequential indirect costs and $250 in conse-quential direct costs per failure, the total loss per year in this category, based on the conditions described previously, is $77,000. This means each hour of down-time costs the company $1,480. Or, to look at the case from another perspective, an improvement of a mere 0.1 percent in network availability saves the company $7,700.
1.6
High Availability and Fault Tolerance in Networks
High-availability data processing infrastructures have become a basic require-ment for smooth business processes in commercial data processing. Barring special measures taken to maximize network availability, the average availabil-ity of today’s IT systems is between 98 and 99 percent, which corresponds to a total annual downtime of 50 to 100 hours. For a growing number of companies, however, even this is too much downtime. Special systems can be added to boost network availability to between 99.9 and 99.9999 percent (99.999 percent up-time is equivalent to 6.8 minutes downup-time in one year). In this way the average downtime-per-year can be reduced to a few hours or even, in the extreme case, a few minutes.The costs of availability, however, increase almost exponentially with each additional decimal place. Before planning a high-availability system, it is
impor-N e t w o r k a v a i l a b i l i t y 9 9 % A n n u a l d o w n t i m e ( h o u r s ) 5 2 N u m b e r o f e m p l o y e e s a f f e c t e d p e r f a i l u r e 1 0 0 D e p e n d e n c y o f e m p l o y e e p r o d u c t i v i t y 2 5 % o n n e t w o r k a v a i l a b i l i t y A v e r a g e a n n u a l f a i l u r e s 1 0 A v e r a g e d i r e c t , i m m e d i a t e c o s t s p e r y e a r / $ 1 2 , 0 0 0 ( $ 1 , 2 0 0 p e r f a i l u r e ) p e r f a i l u r e ( r e p l a c e m e n t p a r t s , e t c . ) A v e r a g e i n d i r e c t , i m m e d i a t e c o s t s 1 0 0 · 0 . 2 5 · 5 2 = ( l o s s o f e m p l o y e e p r o d u c t i v i t y ) 1 3 0 0 h · $ 4 0 = $ 5 2 , 0 0 0 A v e r a g e d i r e c t , c o n s e q u e n t i a l c o s t s p e r y e a r / $ 2 , 5 0 0 ( $ 2 5 0 ) p e r f a i l u r e ( p l a n n i n g , f a i l u r e d o c u m e n t a t i o n , r e c o n f i g u r a t i o n ) A v e r a g e i n d i r e c t , c o n s e q u e n t i a l i m m e d i a t e $ 1 0 , 0 0 0 ( $ 1 , 0 0 0 ) c o s t s p e r y e a r / p e r f a i l u r e ( c o m p a n y l o s s e s ) A n n u a l n e t w o r k f a i l u r e c o s t s $ 7 7 , 0 0 0 H o u r l y n e t w o r k f a i l u r e c o s t s $ 1 , 4 8 1 N e t w o r k f a i l u r e c o s t s p e r 0 . 1 % d o w n t i m e $ 7 , 7 0 0
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
36
tant to specify exactly what service levels are required. This determines the degree of availability that must be guaranteed. Availability is expressed as a percentage, calculated from the total operating time and the downtime:
A v a i l a b i l i t y = t o t a l o p e r a t i n g t i m e d o w n t i m e t o t a l o p e r a t i n g t i m e
Another important factor is the average downtime resulting from system failure, which is called the mean time to repair or MTTR. In most cases, a large number of short service interruptions, lasting only seconds or minutes, is acceptable, while just a few failures that last for several hours each have serious conse-quences.
The main prerequisite for a availability infrastructure is the use of high-quality components. Even without any special equipment or configuration for ultra-high availability, the quality of components is an important factor in the reliability of hardware and software. Component quality is also decisive for the performance of diagnostic tools and system and network management applica-tions, as well as for the level of maintenance and support that can be attained. If no concessions are made in these areas, the availability of the data processing structure is bound to be significantly above average. Availability can only be improved beyond this level by the addition of components and services. These can include:
• Redundant components
• Software and hardware switching
A r c h i t e c t u r e A v a i l a b i l i t y T y p i c a l A n n u a l d o w n t i m e f a i l u r e d u r a t i o n U n i n t e r r u p t i b l e 1 0 0 % N o n e N o n e o p e r a t i o n F a u l t t o l e r a n c e 9 9 . 9 9 9 9 % T i c k s 0 . 5 m i n u t e s F a i l - o v e r b y c l u s t e r 9 9 . 9 9 9 % T i c k s U p t o 5 m i n u t e s t o s e c o n d s F a u l t r e s i l i e n c e 9 9 . 9 9 % S e c o n d s U p t o 5 0 m i n u t e s ( f a i l - o v e r ) t o m i n u t e s H i g h a v a i l a b i l i t y 9 9 . 9 % M i n u t e s U p t o 8 h o u r s S t a n d a r d s y s t e m 9 9 % H o u r s S e v e r a l d a y s S o u r c e : A T & T / G a r t n e r / T P P C
• Detailed planning of every scheduled downtime • Reduction of system administration tasks
• Development of automatic error reaction systems
• Thorough acceptance testing prior to installation of new hardware or software components
• Specifications and practice drills for operator response to system failure • Replicated databases and application software
• Clustering
Redundant components can reduce the number of single points of failure in the network. When a given network component fails, its redundant counterpart is activated automatically. If the installation of fallback components is combined with software and hardware switching technologies, the redundant components can take over for malfunctioning components within seconds or even fractions of seconds. Reducing the level of interaction between the network and
adminis-D o c u m e n t a n d m o n i t o r c u r r e n t s y s t e m s t a t e T e s t p r o c e d u r e s a t r e g u l a r i n t e r v a l s T r a i n a d m i n i s t r a t o r s a n d o p e r a t o r s D e v e l o p p r o c e d u r e s f o r s y s t e m f a i l u r e s M o d i f y / a d d a p p l i c a t i o n s o f t w a r e f o r H A u s e C h o o s e a h i g h - a v a i l a b i l i t y a r c h i t e c t u r e D e f i n e m a x i m u m f a i l u r e d u r a t i o n D e f i n e a v a i l a b i l i t y g o a l D e t e r m i n e c u r r e n t a v a i l a b i l i t y
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
38
trator is a useful step in establishing deterministic reactions to different error scenarios—ideally, a given error should consistently trigger a single, defined process. The individual steps involved in introducing a high-availability system are shown in Figure 1.10.
1.7
Summary
Mission-critical systems in today’s enterprise networks are growing more de-pendent every day on smoothly functioning data processing systems. Network managers are thus faced with the enormous challenge of increasing the avail-ability of their data processing infrastructures while these infrastructures grow in both size and complexity. Network management is further complicated by the fact that corporate intranets are increasingly accessible through remote or public networks, such as the Internet, telecommunication service providers, customers’ networks, telecommuters’ systems and so on. It is no longer possible to have complete, end-to-end control over a company network. This makes it even more important to plan network operation and maintenance systemati-cally, to implement appropriate procedures, and to have experienced network support staff equipped with advanced diagnostic and management tools.
Error Management
in Data Networks
2
“…All those occasions on which a planned sequence of mental or physical activities fails to achieve its intended outcome.”
NEVILLE STANTON, DEFINITION OF “ERROR”
It is practically impossible to operate and maintain today’s large and complex data networks without carefully structured management systems and adher-ence to precisely defined procedures. Modern network management is primarily based on service level agreements (SLAs), which specify precise quality charac-teristics for guaranteed services. The network manager’s overriding
responsi-C o r p o r a t e M a n a g e m e n t S e r v i c e M a n a g e m e n t S e r v i c e p l a n n i n g S L A d e f i n i t i o n S L A m o n i t o r i n g C a p a c i t y p l a n n i n g C o s t m a n a g e m e n t N e t w o r k M a n a g e m e n t E r r o r m a n a g e m e n t C o n f i g u r a t i o n m a n a g e m e n t C h a n g e m a n a g e m e n t O p e r a t i o n m o n i t o r i n g U s e r s u p p o r t I n s t a l l a t i o n a n d a c c e p t a n c e t e s t s C o m p o n e n t M a n a g e m e n t R e p l a c e m e n t p a r t s u p p l y B a c k u p s M a i n t e n a n c e w o r k I n s t a l l a t i o n o f a c c e p t a n c e t e s t s C o n f i g u r a t i o n a n d d o c u m e n t a t i o n R e p a i r s
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
40
bility is to ensure that SLAs are fulfilled. This represents a significant change in network management philosophy. Until recently, the individual components of network operation—such as network management, error management, instal-lation, configuration and planning—were largely handled by separate opera-tional entities. The new concept of service management unites all of these tasks under the common goal of fulfilling SLAs. This permits more efficient and purposeful utilization of available resources. Furthermore, keeping track of service level agreements is equivalent to constantly monitoring the data pro-cessing infrastructure, which in turn makes capacity planning realistic and documentable, and allows detailed cost management for the entire data process-ing infrastructure.
As can be seen in Figure 2.1, error management takes place at the network management level. The many aspects of error management, however, are so closely connected with all other areas of data-processing service management that these also have a role to play in strategic error management planning. For example, a given SLA includes a definition of the maximum mean time to repair (MTTR) for the service in question, which in turn has a direct impact on staff and equipment requirements in the network support department. Thus a com-prehensive error management strategy must include specifications for the fol-lowing network management elements:
• Corporate guidelines for data processing utilization • All the services to be provided in the network • Documented service level agreements for all services • Systematic network planning
• Systematic documentation of all network components • Use of network management tools
• Use of monitoring and diagnostic tools • Regular network audits
• Use of network simulation tools • Procedures for change management
• Procedures for problem handling and diagnostics • Documentation of network problems
2.1
Corporate Guidelines
for Data Processing Systems
Every enterprise should have clearly formulated guidelines regarding the use of their data processing systems and networks. Such corporate guidelines provide a framework for the data processing environment and form the best starting point for developing a comprehensive error management strategy. Unfortu-nately, these guidelines are often imprecisely formulated, leaving room for disagreements between network support staff and other personnel about re-sponsibilities for particular areas of network operation or management. Some departments or business units, for example, assume that they can acquire and install whatever network hardware or software they like because they have always done so. Users generally see only the PC on their own desks, and some believe they are free to configure their “own” machines as they see fit. Obviously such activities make centralized network management impossible. This is why corporate guidelines must provide clear instructions for all personnel. In addi-tion, steps must be taken to ensure that all staff members are aware of, and appropriately trained in, their areas of responsibility.
Corporate guidelines are usually based on a general definition of functions and tasks for the data network and computer system. As a rule, they include descrip-tions of the services to be provided, as well as the security and network availabil-ity requirements. The exact content of such guidelines varies according to the conditions within the organizational unit in question. In any case, it is essential that measures specified in these corporate guidelines be economically and technologically feasible and that the enterprise management guarantee their implementation. The task of formulating the guidelines should be shared be-tween the network support department and enterprise management. Responsi-bility for interpreting the guidelines must also be defined in individual cases. No matter how detailed the guidelines are, there are always situations that require further interpretation of the guidelines before a decision can be reached.
2.1.1
Corporate Guidelines for Employees
Guidelines for employees should describe the rights and obligations of all network users and user groups, including permanent employees, guests, tempo-rary employees, system administrators, service and maintenance personnel, and external consultants. Guidelines should cover the following topics for all user groups:
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
42
• Access to systems and services
• Restrictions on the use of systems and services, including prohibition of such actions as:
• Breaking into other systems • Revealing passwords
• Manipulating data in other users’ files • Sharing user accounts
• Copying copyright-protected software • Authorization to create user accounts • Duties of each user:
• Preserving confidentiality of user passwords • Changing user passwords regularly
• Backing up user data
• Preserving confidentiality of sensitive data • Adhering to guidelines regarding:
• The use of system resources (data storage, CPU time, etc.) • Internet use
• The use of company data processing services for private purposes • Monitoring user accounts for unauthorized use
• Reporting unusual operating behavior, viruses, etc.
2.1.2
Corporate Guidelines
for Network Hardware
Guidelines for hardware in the data processing infrastructure must include: • Security regulations for individual hardware components (servers, routers,
terminals, wiring closets, etc.)
• Security regulations for the company as a whole (concerning fire and flood alarm systems, air conditioning systems, etc.)
All critical components of the data processing infrastructure, including pro-grams, file servers, routers, bridges, root terminals, etc., should be physically shielded and installed in air-conditioned rooms. Access to these rooms should be controlled via personal access codes or ID cards with magnetic strips, for example.
2.1.3
Corporate Guidelines for Network Software
Another important part of a holistic network strategy is the control of software applications used in the company. Users configuring their own computers without supervision from network managers often cause networks errors due to
the loss of data, faulty data storage, or system incompatibilities. The unautho-rized installation of applications (often unlicensed!) or computer games can result in computer virus infection or Trojan horse infiltration. The network administration staff must have documentation of all computer system configu-rations as well as all network components. Only authorized personnel should be allowed to make changes in any configuration. Last but not least, a systematic backup strategy should be implemented to provide security for user data, including offsite backups (Disaster Recovery).
2.2
Services and Service Level Agreements
A service level agreement should be defined for each of the main services specified in the corporate guidelines. Each SLA must include a precise descrip-tion of the service, the funds that can be spent to attain it, its user group, and the date from which the SLA is effective. Furthermore, it should give the number of hours per day and the number of days per week or month during which the service in question shall be available, taking the required maintenance time into account. SLAs must also specify the exact number and location of users, as well as the hardware provided for the services in question. In addition, SLAs must describe procedures for reporting problems and requesting changes, and must define escalation procedures and network management response times for such requests. Finally, SLAs must state the specific goals for such service quality parameters as:
• Average availability • Minimum availability • Average response time • Maximum response time • Average throughput
Unfortunately, users’ obligations to participate in training programs or to ad-here to certain procedural guidelines, for example, are often overlooked in SLAs. A user who does not fulfill such obligations may be responsible for a violation of the SLA. Typical services covered by SLAs in many companies include user support, network printing services, e-mail services, Internet access services, server access, and network operation in general. In addition to such services, there should also be SLAs for all mission-critical applications, such as customer ordering systems, databases, and production control systems.
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
44
2.3
Network Planning and Documentation
The next step in implementing an error management strategy is to compile comprehensive documentation if such does not already exist. Network manage-ment in general, and error managemanage-ment in particular, are impossible without detailed documentation of all hardware and software components in the entire information processing infrastructure. Comprehensive documentation must in-clude the following components:
• Building floor plans that show the cable infrastructure, wall jacks, wiring closets, network components and terminals
• Functional block diagrams
• Configuration database of all network components • Documentation of equipment test results
The most time-consuming of these tasks is creating a diagram of the cable infrastructure. If possible, a computer-aided cable management system should be used. Usually the building floor plan already exists in electronic form and can be imported into such a system. Most computer-aided cable management sys-tems can store visual representations of each network component as well as details about it, such as cable test results, device specifications or configuration data. If there is no such program available, copies of the original building floor plans must be obtained and every cable end and wall jack in the network clearly marked on it. The cables and connectors should be designated in such a way that users can quickly inform network management personnel of the cables and connections in their own system. Ideally a dedicated label maker should be used to produce clear and professional looking labels. If the pin assignments of a given cable or connector are not standard, the exact pin assignments must also be listed. Once all the data on cable routes, wall jacks and user terminals has been collected, it can be entered on the floor plan or in the cable management software. Wiring closets are usually depicted on separate diagrams that include both structural diagrams of the closet itself and detailed slot lists. If bus topolo-gies (10Base2, 10Base5) are used, care must be taken to ensure that all connec-tors are included on the cabling diagram. Furthermore, passive components, such as repeaters and hubs, are often overlooked when documenting networks. It is important that these also be documented in detail, however, and included on cabling diagrams.
Another important part of network documentation is the functional block diagram. This diagram depicts all network components and segments, including all terminals, repeaters, hubs, bridges, routers, switches, and media access units (MAUs), along with their functions, connections, port bandwidths and redun-dancies. The third component of network documentation is a database
contain-ing the system configurations and specifications of all network components. The fourth component is a comprehensive record of all test results gathered during network audits and troubleshooting.
B u i l d i n g c a b l i n g p l a n W i r i n g c l o s e t p l a n W i r i n g c l o s e t l a y o u t C a b l e r o u t e s [ i n c l u d e L o c a t i o n W i r i n g c l o s e t n u m b e r m a n u f a c t u r e r , d i m e n s i o n s , ( b u i l d i n g , f l o o r , r o o m ) u s e ( p o w e r / d a t a , d a t a r a t e ) ] W i r i n g c l o s e t l o c a t i o n s W i r i n g c l o s e t n u m b e r M a k e a n d m o d e l S o c k e t s I n c o m i n g / o u t g o i n g I n s t a l l a t i o n p o s i t i o n c a b l e f e e d s T e r m i n a l e q u i p m e n t G r o u n d b u s e s E q u i p m e n t l i s t s D a t e o f l a s t m o d i f i c a t i o n I n c o m i n g / o u t g o i n g S o c k e t s c a b l e f e e d s D a t e o f l a s t m o d i f i c a t i o n D a t e o f l a s t m o d i f i c a t i o n
Figure 2.2 Components of cable documentation
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
46
The TIA/EIA 606 specification, “The Administration Standard for the Telecom-munications Infrastructure of Commercial Buildings”, describes standardized nomenclature for the documentation of data networks. This specification, avail-able from Global Engineering Documents at http://www.global.ihs.com, is the “bible” of cable infrastructure management. Every computer-aided cable man-agement system used should be based on TIA/EIA 606.
2.4
Network Management Tools
Network management systems (NMS) are essential for the operation of today’s complex data processing infrastructures and, by the same token, are indispens-able for any error management system. In networks with more than 1000 users, controlled network operation with careful consideration of expenditures is almost impossible without a network management system. Such systems are software applications that monitor and manage the individual software and hardware components in a network as well as the overall network structure. According to the definition in the ISO/IEC 7498-4 standard (“Information Pro-cessing Systems–Open Systems Interconnection–Basic Reference Model—Part 4: Management Framework”), a comprehensive network management system is composed of the following five parts:
• Configuration management • Performance management • Security management • Accounting management • Error management
Configuration management encompasses the administration of all hardware systems and applications used in the network. This includes system configura-tion and component management, distribuconfigura-tion and licensing of software, and management of client and server system inventories. Furthermore, all network topologies must be surveyed and continuously updated.
Performance management consists of monitoring hardware and software perfor-mance by means of dedicated software agents. These agents monitor the relevant objects and report results to a central analytical application. The system may be configured to send an alarm signal to the network manager when defined thresholds are exceeded. In addition to monitoring applications and hardware, performance management has the task of monitoring network efficiency. This involves continuous monitoring of data streams between and within each
network segment and recording the characteristics of traffic on each segment, such as packet size distribution, packet error rate, transmission delay, etc. Security management involves the administration of network addresses, file areas, access rights, and passwords for each user. Additionally, system- and network-critical components, such as servers, routers, gateways, firewalls, etc., must be monitored for unauthorized or unusual activities.
The purpose of accounting management is to register all network use and classify overall usage of system resources by user. Such classification can be performed on the basis of a number of different parameters, such as hard disk storage on file servers, database access time, or the number and length of data packets transmitted over the network. The goal is to achieve a fair distribution of the total costs of data processing so that users are motivated to use the available resources sensibly and economically.
Error management is probably the most widely implemented component of the ISO network management model. Every network requires some kind of error management system, even if only in rudimentary form. According to the ISO definition, error management in the strictest sense means defining and following procedures for detecting symptoms, restricting error domains, solving problems, testing solutions and documenting errors and their solutions. In practice, error management is far more complex, encompassing aspects of practically every area of responsibility in network operation, from documenta-tion to planning.
Aside from the five aspects of network management described in the ISO model, the importance of data storage management has risen dramatically in recent years, and with it the amount of time and money involved. This area has thus developed into a separate network management field. With the sharp increase in storage space requirements and the growing dependence of companies on the availability of stored information, backup systems and their management have become new challenges in the data processing field. The amounts of data to be stored grow larger every day, and backup operations must be performed either outside working hours (“cold backups”) or during working hours (“hot back-ups”). Furthermore, separate procedures must be defined for storing data on local backup media and for transmitting data over the network to central storage media. Storage management tasks also include defining backup cycles, deter-mining how long backups take, and controlling data restoration processes. Management of all these components and their communication links is based upon software agents that run within the network components being monitored or controlled and on the communication protocols used for transmitting and receiving the corresponding information to and from a central network manage-ment station. The protocols available for these managemanage-ment tasks include the
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
48
Simple Network Management Protocol (SNMP) and the less widely used Com-mon Management Information Protocol (CMIP). Most modern network compo-nents have integrated SNMP agents, which use SNMP to read operating statistics and to load configuration parameters. For network components with propri-etary control mechanisms, proxies are used to translate SNMP into the control commands used.
In view of the wide variety of tasks involved in network operation, it is clear that high-performance management tools are required for maximum support. Moni-toring and diagnostic systems are also indispensable when problems occur—and beforehand, as well, for preventive action.
2.5
Monitoring and Diagnostic Tools
Special monitoring and diagnostic tools are essential for systematic trouble-shooting and proactive error management in data networks. In very small networks, experienced network managers can solve some network problems without such tools—when the source of an error can be localized from the error symptoms alone, for example, or when problems are prevented before they occur by running routine tests and replacing a component or two “just in case”. Unfortunately, this is only possible in a limited number of situations. In some of these cases the symptom may be removed, but the magnitude of the actual problem remains unknown. Another difficulty with this basically reactive method is that it neglects opportunities for proactive prevention of errors. Even in small networks, it is a good idea to use monitoring and diagnostic tools, at least to some extent. In larger networks with 1,000 nodes or more, such tools are indispensable as part of a universal network management system.
Diagnostic tools can be divided into two categories: tactical and strategic. Tactical systems are mobile, and are deployed in the affected network segment only until the problem in question has been solved. Strategic diagnostic tools are used continuously, to monitor a given network segment, for example. Strategic tools can be used to perform long-term studies or trend analyses. They can also be programmed to set off an alarm when defined threshold values are exceeded during network operation. Strategic and tactical diagnostic tools are often used in combination. The operating parameters detected by these diagnostic systems range from characteristics of the cable infrastructure to detailed analyses of communication protocols on the higher OSI layers. The test instruments used include cable testers, multimeters, optical time-domain reflectometers (OTDRs), protocol analyzers, bit error testers, and system probes. Details on the functions and uses of diagnostic tools are discussed in the next chapter.
2.6
Regular Network Audits
Regular network check-ups are an essential part of successful error manage-ment. Comprehensive audits of the network, including the cable infrastructure, should be performed at long intervals, such as once a year. Less detailed audits, involving a review of the principal operating parameters, should be performed at weekly or monthly intervals. These audits should be supplemented by con-tinuous measurements of key parameters, such as the network traffic load, data packet sizes, numbers of error packets and service response times. When per-formed conscientiously, the combination of continuous measurements and regu-lar audits keeps the network manager apprised of characteristic network operat-ing parameters. This makes it possible to detect and correct many types of network problems at a very early stage. Experience has shown that in over 50 percent of serious network problems a comparison of the active operating parameters with those recorded previously on the same weekday at the same time of day can quickly limit the error domain to a very small range. This significantly reduces the mean time to repair. Network audits generally include examinations of the first three OSI layers: the physical, data link and network layers.
2.6.1
Physical Layer Audit
Testing the entire cable infrastructure during a network audit may seem extravagant at first glance, but in practice it has repeatedly proven to be a highly effective preventive measure. Although today’s cable and connector systems are far more sturdy and reliable than those manufactured just a few years ago, a fair proportion of those errors that are intermittent or otherwise difficult to localize can be traced to these components. Such errors are often caused by the aging of materials—which is accelerated by the effects of light and humidity or by mechanical wear and tear, poor workmanship or defective materials. In many cases the consequences of these defects become evident only gradually; the proportion of defective data packets with checksum errors increases, service response times grow longer, but communication basically continues and the diminished throughput and mounting error rates often go unnoticed until the system fails completely. Yearly cable audits can reveal wear and tear in copper and fiber-optic cable segments. Replacing all lines that no longer fulfill minimum performance requirements can prevent errors during network operation. Some of the tests performed on cables require that the LAN or WAN connection be shut down. Redundant network structuring is useful in this case because the flow of data can be automatically switched to backup segments when an important connection is broken. In Token-Ring and FDDI
SECTION I SECTION II SECTION III SECTION IV SECTION V
BASIC CONCEPTS
50
networks the data flow is shifted to the secondary ring; when routing and switching are used, network traffic is sent over alternative ports. Disconnecting individual lines for measurements during the annual cable infrastructure audit can also test these mechanisms.
Cable Audits in Local-Area Networks
Cable audits in local-area networks should include testing of the following parameters, as applicable:
Copper Twisted-Pair Cable
• Cable length
• Near-end crosstalk (NEXT) • Signal-to-noise ratio (SNR) • Attenuation Coaxial Cable • Cable length • Reflection at connectors • Number of stations Fiber-Optic Cable • Cable length • Total attenuation
• Attenuation caused by splices
Token Ring, FDDI, Routers with Backup Routes, ATM Switches with PVCs
• Backup test: disconnect line to test backup switching/routing
Cable Audits in Wide-Area Networks
The most important Layer 1 tests in wide-area networks are signal condition and bit-error-rate (BER) measurements. The first step is to use a protocol analyzer to check all packets in network traffic for Cyclic Redundancy Check (CRC) errors. The next step is to check signal levels and signal conditions with an interface tester. The test results may indicate anomalies, such as deviations from the nominal output voltage, distorted signal shape, or jitter. If the line can be disconnected briefly, separate bit-error-rate testing (BERT) should also be performed. In this procedure, a number of different bit patterns are transmitted over the WAN link under test—either framed in 64 Kbit/s channels or unframed, depending on the transmitting port—and then looped back from the receiving end and evaluated by the BER tester upon return. To obtain meaningful BERT
results, these tests should be carried out over a period of several hours. The quality of the WAN link under test can be determined from the following BERT results:
• Bit-error rate • Block-error rate
• Number of error-free seconds • Number of errored seconds
A u d i t p a r a m e t e r T e s t i n s t r u m e n t s T y p e o f t e s t L o c a l - a r e a n e t w o r k s T w i s t e d p a i r C a b l e l e n g t h C a b l e t e s t e r O u t - o f - s e r v i c e N e a r - e n d c r o s s t a l k ( N E X T ) C a b l e t e s t e r O u t - o f - s e r v i c e S i g n a l - t o - n o i s e r a t i o ( S N R ) C a b l e t e s t e r O u t - o f - s e r v i c e A t t e n u a t i o n C a b l e t e s t e r O u t - o f - s e r v i c e C o a x i a l c a b l e C a b l e l e n g t h C a b l e t e s t e r O u t - o f - s e r v i c e R e f l e c t i o n s a t c o n n e c t o r s C a b l e t e s t e r O u t - o f - s e r v i c e N u m b e r o f n o d e s P r o t o c o l a n a l y z e r , L A N p r o b e s I n - s e r v i c e F i b e r o p t i c C a b l e l e n g t h O T D R O u t - o f - s e r v i c e T o t a l a t t e n u a t i o n O T D R O u t - o f - s e r v i c e S p l i c e a t t e n u a t i o n O T D R O u t - o f - s e r v i c e W i d e - a r e a n e t w o r k s S i g n a l l e v e l I n t e r f a c e t e s t e r I n - s e r v i c e S i g n a l c o n d i t i o n I n t e r f a c e t e s t e r I n - s e r v i c e B i t e r r