Connectivity problem isolation flows - Problem determination flows

Chapter 4. Problem determination guide

4.3 Problem determination flows

4.3.2 Connectivity problem isolation flows

At this point, one or more systems within the SAN have been identified as good starting locations for the next problem determination step. This section provides two maps for checking the physical connections and logical links. We describe each step that is outlined in the maps. To further illustrate the process by example, we used the simple SAN created for this publication. Figure 4-6 on page 125 is a connections diagram of the test SAN that was created and used for this publication. As an additional reference, Table 4-1 on page 125 lists the devices by name and WWN. Units with multiple HBAs only list the ports that connects to this SAN.

The basic premise is to start with a server as the starting edge device. We then proceed to follow the connections through the SAN environment to the storage edge device on the other side of the SAN fabric. As the various steps are described, we will provide examples using our sample SAN.

If your initial investigations have determined that a fabric device or storage device is the best initial point, then the maps can be re-ordered as needed. We are using the server-to-storage device as a simplified method to explain the process. This methodology is aimed primarily at troubleshooting

constantly-occurring problems. The process described in this section can also provide some valuable clues for cases with intermittent problems.

Figure 4-6 Example SAN diagram

Table 4-1 List of code levels and WWN by device name

Device name Code level WWN

tiod95 AIX 5.1.0 10:00:00:00:c9:27:38:72 (fcs0) 10:00:00:00:c9:23:ed:73 (fcs3) tiod96 AIX 4.3.3 10:00:00:00:c9:26:58:12 (fcs0) 10:00:00:00:c9:24:0d:8b (fcs2) tiod70 AIX 4.3.3 10:00:00:00:c9:26:a6:28 brocade02 (2109-S16) 2.4.1e 10:00:00:60:69:10:74:9c McDATA ED-5000 3.0.1 10:00:08:00:88:60:6a:e5 McDATA ES-1000 1.02.0 10:00:08:00:88:60:40:e8 Mickey (2105-F20) 1.4.0.237 10:00:00:00:c9:21:23:f5 Minnie (2105-E20) 1.4.0.237 10:00:00:00:c9:23:9e:a2

EMC Symmetrix 50:06:04:82:03:3e:e5:cf

3590 - 1 (3590-E11) 1.15.9.7 50:05:07:63:00:40:3b:5e 3590 - 1 (3590-E11) 1.15.9.7 50:05:07:63:00:40:20:14 3590 - 1 (3590-E11) 1.15.9.7 50:05:07:63:00:40:36:02

The first map, Figure 4-7 on page 127, checks the physical connections from the server to a storage device. This map outlines three steps that are relatively easy to carry out with little effort and time. These first steps are nothing more than a visual inspection of the status LEDs that are now incorporated on most SAN devices. In situations with a constant problem, this initial step should quickly provide a clue as to whether the cause is due to a physical issue with the connections.

If no obvious connection problems are indicated by the LEDs on various devices, the next action in the problem determination process is as per Map 2 (Figure 4-8 on page 145). This troubleshooting map is based on the search for one or more issues in the logical connectivity between edge devices. In this map we proceed in a similar manner as in Map 1 by starting with further investigations of the pSeries server. Once started, the plan is to methodically progress from the server through the fabric to the storage resources until the problem is found.

Based on Map 2’s outline, the next series of actions (Step 4) involve the verification of the basic operations within the pSeries server. This involves checking such items as:

򐂰 Correct operation of the HBA in the pSeries server

򐂰 Proper installation of device drivers for the HBA

򐂰 Proper installation of drivers for the logical SCSI I/O Controller protocol device.

Step 4 in Map 2 refers to the actions and explanations that are found in Section 4.4, “Checking the pSeries server” on page 132. If the problem is discovered during those actions, that section also contains some possible solutions. The possible solutions are based on some of the more common issues.

While most of the actions for implementing Step 4 are outlined and discussed in Section 4.4, “Checking the pSeries server” on page 132; Section 4.3.3,

“Advanced problem isolation techniques” on page 129 provides additional methods that the reader can use. Please note that these additional actions are not neccessarily required to troubleshoot every problem. The reader should review the various actions and determine whether a given action will provide useful information.

If the problem is not found, and still exists, after Step 4, then the next step is to perform a basic health check of the SAN fabric. Step 5 of Map 2 is the entry point for these actions, and further explained is located in Section 4.5, “Checking the fabric” on page 147. Some of the basic items that are checked in Step 5 are:

򐂰 Confirming logical connections between all fabric devices

򐂰 Inspecting the fabric’s zone configuration for conflicts or other errors

Figure 4-7 Map 1 - Troubleshooting physical links in a SAN

Outside of a given device’s hard failure, a typical fabric-based problem involves consistency with the zone configuration information that is maintained is most fabric devices.

One method of identifying this type of issue is to manually inspect and compare the configuration of each device with all others. However, Section 4.3.3,

“Advanced problem isolation techniques” on page 129 and Section 4.5,

“Checking the fabric” on page 147, have additional details and enhanced troubleshooting actions for the fabric.

If the pSeries server and the SAN fabric are apparently working correctly, then the next action is Step 6. Step 6 covers troubleshooting storage devices. Further information and materials can be found in Section 4.3.3, “Advanced problem isolation techniques” on page 129, and Section 4.6, “Checking the storage systems” on page 191. Some of the basic actions for the inspection of the various storage units are:

򐂰 Verifying the logical connectivity to the SAN fabric

򐂰 Checking any masking implementation in the storage devices, such as LUN masking

򐂰 Inspecting the resource allocation to specific servers within the storage unit The actions described in Section 4.6, “Checking the storage systems” on page 191 should be completed before implementing any of the advanced techniques.

In document Practical Guide for SAN with pseries (Page 142-147)