Troubleshooting

Introduction

The scope of this section is to provide an overview of common issues that might be encountered at different insertion points when inserting Juniper platforms as a result of a trigger event (adding a new application or service to the organization). This section won’t provide exhaustive troubleshooting details, however, we do describe the principal recommended approaches to troubleshooting the most common issues and provide guidelines for identification, isolation, and resolution.

Troubleshooting Overview

When investigating the root cause of a problem, it is important to determine the problem’s nature and analyze its symptoms. When troubleshooting a problem, it is generally advisable to start at the most general level and work progressively into the details, as needed. Using the OSI model as a reference, troubleshooting typically begins at the lower layers (physical and data link) and works progressively up toward the application layer until the problem is found. This approach tends to quickly identify what is working properly so that it can be eliminated from consideration, and narrows the problem domain for quick problem identification and resolution.

The following list of questions provides a methodology on how to use clues and visible effects of a problem to reduce the diagnostic time.

• Has the issue appeared just after a migration, a deployment of new network equipment, a new link connection, or

a configuration change? This is the context being presented in this Data Center LAN Migration Guide. The Method of

Procedure (MOP) detailing the steps of the operation in question should include the tasks to be performed to return to the original state before the network event, should any abnormal conditions be identified. If any issue arises during or after the operation that cannot be resolved in a timely manner, it may be necessary to roll back and disconnect newly deployed equipment while the problem is researched and resolved. The decision to back out should be made well in advance, prior to the expiration of the maintenance window. This type of problem is likely due to an equipment misconfiguration or planning error.

• Does the problem have a local or a global impact on the network? The possible causes of a local problem may

likely be found at L1 or L2, or it could be related to an Ethernet switching issue at the access layer. An IP routing problem may potentially have a global impact on networks, and the operator should focus its investigation on the aggregation and core layer of the network.

• Is it an intermittent problem? When troubleshooting an intermittent problem, system logging and traceoptions

provide the primary debugging tools on Juniper Networks platforms, and can be focused on various protocol mechanisms at various levels of detail. Events occurring in the network will cause the logging of state transitions related to physical, logical, or protocols to local or remote files for analysis.

• Is it a total or partial loss of connectivity or is it a performance problem? All Juniper Networks platforms have

a common architecture in that there are separate control and forwarding planes. For connectivity issues, Juniper recommends that you first focus on the control plane to verify routing and signaling states and then concentrate on the forwarding or data plane, which is implemented in the forwarding hardware (Packet Forwarding Engine or PFE). If network performance is adversely affected by packet loss, delays, and jitter impacting one or multiple traffic types, the root cause is most likely related to network congestion, high link utilization, and packet queuing along the traversed path.

Hardware

The first action to take when troubleshooting a problem and also before making any change in the network is to ensure proper functionality and integrity of the network equipment and systems. A series of validation checks and inspection tests should be completed to verify that the hardware and the software operate properly and there are not any fault conditions. The following presents a list of “show commands” from the Junos OS CLI relative to this, as well as a brief description of expected outcomes.

• show system boot-messages

Review the output and verify that no abnormal conditions or errors occurred during the booting process. POST (power- on self-test) results are captured in the bootup message log and stored on the hard drive.

• show chassis hardware detail

Verify that all hardware appears in the output (i.e., routing engines, control boards, switch fabric boards, power supplies, line cards, and physical ports). Verify that no hardware indicates a failure condition.

• show chassis alarms

Verify that there are no active alarms.

• show log messages

Search log for errors and failures and review the log for any abnormal conditions. The search can be narrowed to specific keywords using the “grep” function.

• show system core-dumps

Verify any transient software failures. Junos OS under fatal fault condition will create a core file of the kernel and processes in question for diagnostic analysis.

For more details on platform specifics, please refer to the Juniper technical documentation that can be found at:

www.juniper.net/techpubs.

OSI Layer 1: Physical Troubleshooting

An OSI Layer 1 problem or physical link failure can occur in any part of the network. Each media type has different physical and logical properties and provides different diagnostic capabilities. Focus here will be on Ethernet, as it is universally deployed in data centers at all tiers and in multiple flavors: GbE, 10GbE, copper, fiber, etc.

• show interface extensive command produces the most detailed and complete information about all interfaces. It

displays input and output errors for the interface displayed in multiple categories such as carrier transition, cyclic redundancy check (CRC) errors, L3 incomplete errors, policed discard, L2 channel errors, static RAM (SRAM) errors, packet drops, etc. It also contains interface status and setup information at both physical and logical layers. Ethernet networks can present many symptoms, but troubleshooting can be helped by applying common principles: verify media type, speed, fiber mode and length, interface and protocol maximum transmission unit (MTU), flow control and link mode. The physical interface may have a link status of “up” because the physical link is operational with no active alarm, but the logical interface has a link status of “down” because the data link layer cannot be established end to end. If this occurs, refer to the next command.

• monitor interface provides real-time packets and byte counters as well as displaying error and alarm conditions.

After an equipment migration or a new link activation, the network operator should ping a locally connected host to verify that the link and interface are operating correctly and monitor if there are any incrementing error counters. The do-not-fragment flag in a ping test is a good tool to detect MTU problems which can adversely affect end-to-end communication.

For 802.3ad aggregated Ethernet interfaces, we recommend enabling Link Aggregation Control Protocol (LACP) as a dynamic bundling protocol to form one logical interface with multiple physical interfaces. LACP is designed to provide link monitoring capabilities and fast failure detection over an Ethernet bundle connection.

OSI Layer 2: Data Link Troubleshooting

Below are some common steps to assist in troubleshooting issues at Layer 2 in the access and aggregation tiers:

• Are the devices utilizing DHCP to obtain an IP addresses? Is the Dynamic Host Configuration Protocol (DHCP)

server functioning properly so that host devices receive an IP address assignment from the DHCP server? If routed, is the DHCP request being correctly forwarded?

• monitor traffic interface ge-0/0/0 command provides a tool for monitoring local traffic. Expect to see all packets

that are sent out and received to and from ge-0/0/0. This is particularly useful to verify the Address Resolution Protocol (ARP) process over the connected LAN or VLAN. Use the show arp command to display ARP entries.

• Is the VLAN in question active on the switch? Is a trunk active on the switch that could interfere with the ability

to communicate? Is the routed VLAN interface (RVI) configured with the correct prefix and attached to the

corresponding VLAN? Is VRRP functioning properly and showing one unique routing node as master for the virtual IP (VIP) address?

• Virtual Chassis, Layer 3 uplinks, inverted U designs, and VPLS offer different alternatives to prevent L2 data forwarding loops in a switching infrastructure without the need to implement Spanning Tree Protocols (STPs). Nevertheless, it is common best practice to enable STP as a protection mechanism to prevent broadcast storms in the event of a switch misconfiguration or a connection being established by accident between two access switches. Virtual Chassis Troubleshooting

Configuring a Virtual Chassis is essentially plug and play. However, if there are connectivity issues, the following section provides the relevant commands to perform operational analysis and troubleshooting. To troubleshoot the configuration of a Virtual Chassis, perform the following steps.

Check and confirm Virtual Chassis configuration and status with the following commands: • show configuration virtual-chassis

• show virtual-chassis member-config all-members • show virtual-chassis status

Check and confirm Virtual Chassis interfaces: • show interfaces terse

• show interfaces terse vcp* • show interfaces terse *me*

Verify that the mastership priority is assigned appropriately: • show virtual-chassis status

• show virtual-chassis vc-port all-members

Verify the Virtual Chassis active topology and neighbors: • show virtual-chassis active-topology

• show virtual-chassis protocol adjacency

• show virtual-chassis protocol database extensive • show virtual-chassis protocol route

• show virtual-chassis protocol statistics

In addition to the verifications above, also check the following:

• Check the cable to make sure that it is properly and securely connected to the ports. If the Virtual Chassis port (VCP) is an uplink port, make sure that the uplink module is model EX-UM-2XFP.

• If the VCP is an uplink port, make sure that the uplink port has been explicitly set as a VCP.

• If the VCP is an uplink port, make sure that you have specified the options (pic-slot, port-number, member-id) correctly.

OSI Layer 3: Network Troubleshooting

While L1 and L2 problems have limited effect and are local, L3 or routing issues may affect other networks by propagation and may have a global impact. In the data center, the aggregation/core tiers may be affected. The following focuses on the operation of OSPF and BGP as they are commonly implemented in data centers to exchange internal and external routing information.

In a next -generation or newly deployed network, OSPF’s primary responsibility typically is to discover endpoints for internal BGP. Unlike OSPF, BGP may play multiple roles that include providing connectivity to an external network, information exchange between VRFs for L3 MPLS VPN or VPLS, eventually carrying data centers internal routes to access routers. OSPF

A common problem in OSPF is troubleshooting adjacency issues which can occur for multiple reasons: mismatched IP subnet/mask, area number, area type, authentication, hello/dead interval, network type, or mismatched IP MTU. The following are useful commands for troubleshooting an OSPF problem:

• show ospf neighbor displays information about OSPF neighbors and the state of the adjacencies which must be

shown as “full.”

• show ospf interface displays information about the status of OSPF interfaces.

• show ospf log logs shortest-path-first (SPF) calculation.

• show ospf statistics displays number and type of OSPF packets sent and received.

• show ospf databases displays entries in the OSPF link-state database (LSDB).

OSPF traceoptions provide the primary debugging tool, and the OSPF operation can be flagged to log error packets and state transitions along with the events causing them.

BGP

show bgp summary is a primary command used to verify the state of BGP peer sessions, and it should display that the peering is “established” to be fully operational.

BGP has multiprotocol capabilities made possible through simple extensions that add new address families. This command also helps to verify which address families are carried over the BGP session, for example, inet-vpn if L3 MPLS VPN service is required or L2VPN for VPLS.

BGP is a policy-driven routing protocol. It offers flexibility and granularity when implementing routing policy for path determination and for prefix filtering. A network operator must be familiar with the rich set of attributes that can be modified and also with the BGP route selection process. Routing policy controls and filters can modify routing information entering or leaving the router in order to alter forwarding and routing decisions based on the following criteria: • What should be learned about the network from all protocols?

• What routes should be shared with other routing protocols? • What should be advertised to other routers?

• What routing information should be modified, if any?

Consistent policies must be applied across the entire network to filter/advertise routes and modify BGP route attributes. The following commands assist in the troubleshooting of routing policies:

• show route receive-protocol bgp <neighbor> displays received attributes.

• show route advertising-protocol bgp <neighbor> displays route and attributes sent by BGP to a specific peer.

• show route hidden extensive displays routes not usable due to BGP next-hop problems and routes filtered by an

inbound route filter.

Logging of peer state transitions and flagging BGP operations provides a good source of information when investigating BGP problems.

VPLS Troubleshooting

This section provides a logical approach to take when determining the root cause of a problem in a VPLS network. A good place to start is to verify the configuration setup. Following is a brief configuration snippet with corresponding descriptor.

routing-instance {

vpls_vpn1 { #arbitrary name

instance-type vpls; #VPLS type

vlan-tags outer 4094 inner 4093;

#VLAN Normalization

must match if configured

interface ge-1/0/0.3001; # int.unit

route-distinguisher 65000:1001; # RD carried in MPGP

vrf-target target:65000:1001; # VPN RT must match on all PEs in this

VPLS

protocols {

vpls {

mac-table-size {

100; # max mac table size

}

interface-mac-limit {

# max mac that may be

learned from all CE

50; facing interfaces

}

no-tunnel-services; # lsi interfaces for tunneling

site site-1 { # arbitrary name

site-identifier 1001;

# unique site ID

interface ge-1/0/0.3001; # list of int.unit in this

VPN

The next step is to verify the control plane with the following operation commands:

• show route receive-protocol bgp <neighbor> table <vpls-vpn> detail displays BGP routes received from an MP-

iBGP peer for a VPLS instance. Use the detail/extensive option to see other BGP attributes such as route-target RT, label base, and site-ID. The BGP next hop must have a route in the routing table for the mapping to a transport MPLS LSP.

• show vpls connections is an excellent command to verify the VPLS connection status and to aid in troubleshooting.

After the control plane has been validated as fully functional, the forwarding plane should be checked next by issuing the following commands. Note that the naming of devices maps to a private MPLS network, as opposed to using a service provider MPLS network.

On local switch:

• show arp

• show interfaces ge-0/0/0

On MPLS edge router

• show vpls mac-table • show route forwarding-table

On MPLS core router

The commands presented in this section should highlight the proper VPLS operation as follows: • Sending to unknown MAC address, VPLS edge router floods to all members of the VPLS. • Sending to a known MAC address, VPLS edge router maps to an outer and inner label.

• Receiving a MAC address, VPLS edge router identifies the sender and maps the MAC address to a label stack in the MAC address cache.

• VPLS provider edge (PE) router periodically ages out unused entries from the MAC address cache. Multicast

Looked at simplistically, multicast routing is upside down unicast routing. Multicast routing functionality is focused on where the packet came from and directs traffic away from its source. When troubleshooting multicast, the following methodology is recommended:

• Gather information

In one-to-many and many-to-many communications, it is important to have a good understanding of the expected traffic flow to clearly identify all sources and receivers for a particular multicast group.

• Verify receiver interest by issuing the following commands:

show igmp group <mc_group> displays information about Internet Group Management Protocol (IGMP) group

membership received from the multicast receivers on the LAN interface.

show pim interfaces is used to verify the designated router for that interface or VLAN.

• Verify knowledge of the active source by issuing the following commands:

show multicast route group <mc_group> source-prefix <ip_address> extensive displays the forwarding state (pruned

or forwarding) and the rate for this multicast route.

show pim rps extensive determines if the source designated router has the right rendezvous point (RP) and displays

tunnel interface-related information for register message for encapsulation/de-encapsulation.

• Trace the forwarding state backwards, working your way back towards the source IP and looking for Physical Interface Module (PIM) problems along the way with the following commands:

show pim neighbors displays information about PIM neighbors.

Show pim join extensive <mc_group> validates outgoing interface list and upstream neighbor and displays source tree

and shared tree (Real-Time Transport Protocol or RTP) state, with join/prune status.

Show multicast route group <mc_group> source-prefix <ip_address> produces extensive checks if traffic is flowing

and has a positive traffic rate.

show multicast rpf <source_address>Multicast routing uses “reverse path forwarding” check. A router forwards only

multicast packets if received on the upstream interface to the source. Otherwise, the RPF check fails, and the packet is discarded.

Quality of Service/Class of Service (CoS)

Link congestion in the network can be the root cause of packet drops. The show interfaces queue command provides CoS queue statistics for all physical interfaces to assist in determining the number of packets dropped due to tail drop, and the number of packets dropped due to random early detection (RED).

OSI Layer 4-7: Transport to Application Troubleshooting

This type of problem is most likely to occur on firewalls or on routers secured with firewall filters. Below are some important things to remember when troubleshooting Layer 4-7 issues:

• Standard troubleshooting tools such as ping and traceroute may not work. Generally, ping and traceroute are not enabled through a firewall except in specific circumstances.

• Firewalls are routers too, In addition to enforcing stateful policies on traffic, firewalls also have the responsibility of routing packets to their next hop. To do this, firewalls must have a working and complete routing table statically or dynamically defined. If the table is incomplete or incorrect, the firewall will not be able to forward traffic correctly.

In document DATA CENTER LAN MIGRATION GUIDE (Page 55-63)