Advanced problem isolation techniques - Problem determination flows

Chapter 4. Problem determination guide

4.3 Problem determination flows

4.3.3 Advanced problem isolation techniques

This section describes a number of advanced troubleshooting actions that can be implemented by the reader. We present these items for readers that have a knowledgeable and well-founded background working with SAN environments. A number of the actions described in this section are techniques that are typically utilized by remote technical support groups. Some of these actions can be disruptive to a SAN, and they are identified. These actions are not used in all problem situations, and the reader is cautioned.

This section will occasionally refer to some materials that will also be covered in the following sections. This advanced troubleshooting section is presented here as a natural progression of increasing complexity, with references and examples to follow.

Figure 4-9 Map 3 - Advanced troubleshooting

As shown in Map 3, we do not have a set starting point for continuing the problem determination process. Also, the listing of actions under each SAN area (server, fabric, and storage) should already have been checked before

considering any of the methods provided in this section. We will revisit the list of possible advanced actions with further details about each.

򐂰 Reboot the pSeries server.

One of the most basic operations for AIX systems, this operation can resolve a number of issues. However, a server reboot does impact the SAN

environment. At the same time, the reboot operation will cause the AIX operating system to reload all associated device drivers after a consistency check of the filesets, and cause the configuration manager application to be run again.

򐂰 Reload the device driver/component in question.

If a good first rule of thumb is to reboot the server, it is closely followed by a second rule of thumb that is reloading the code component in question. This action is particularly true in new installation situations. An example of this scenario occurred during the preparations for this publication. In our case, a server had its device drivers reloaded, but only showed a subset of storage resources in a defined state. After extensive troubleshooting, we deleted all of the defined resources, reloaded the device drivers, and then ran the

configuration manager application again. The problem was then resolved.

򐂰 Delete adapter instance plus any associated logical resources and rerun the configuration manager application, cfgmgr.

This action is a variation of the reload device component. The intent of this action is to reinitialize the AIX operating system’s knowledge of the storage resources available to it. This situation usually occurs during installation of new devices in the SAN environment. Needless to say, this action is disruptive to the server, and can impact a heavily loaded SAN when the cfgmgr application is run on a server with many resources available to it.

򐂰 Run a SDD trace.

The Subsystem Device Driver supports AIX trace functions. The trace ID for the SDD is 2F8. The trace tracks routine entry, exit, and error paths of the SDD algorithm. To use the trace, manually turn on the trace function before a problem is recreated, then turn off the trace function either after the problem is seen or whenever the trace report is needed. To start the trace function, issue the trace -a -j 2F8 command. To stop the trace function, use the trcstop command. To then read the report, use the trcrpt | pg command.

When the AIX trace is running, the SDD logs error conditions into the AIX errlog system. To check if the SDD generated an error log message, use the errpt -a | grep VPATH command. Refer to the IBM Subsystem Device Driver Installation and User’s Guide softcopy documentation for more information about the various error log messages and their explanations.

򐂰 Generate and interpret the error log from the pSeries servers.

A number of applications and the AIX operating system make entries into the AIX error log. This error log can contain very useful information if a problem occurs on a regular basis. Not only does a problem entry in the log provide data about the type of problem, but valuable information about the source of the problem can be gathered from the error log. This action is not disruptive to the SAN environment.

Note: To perform the AIX trace function, the bos.sysmgt.trace fileset must be installed on the AIX system.

򐂰 Capture and interpret the error log from fabric devices.

Practically all fabric devices have some mechanism to detect and record various types of events and errors. A review of this information in conjunction with data from other sources can be a valuable resource identifying the problem source. For more information on obtaining these logs from different switches, refer to Section 4.5, “Checking the fabric” on page 147.

򐂰 Run tapeutil to obtain errors reported by the drive to the host.

Check for any messages that may have been reported back to the AIX operating system by the tape drive unit. To gather this information, run the tapeutil program. From the menu, select option 9, Error Log Analysis. For further information understanding messages, refer to Chapter 4, “Messages,”

of IBM 3590 Tape Subsystem Maintenance Information, SA37-0301.

򐂰 Run the Fibre Channel trace tool.

The use of Fibre Channel trace tools in a SAN environment is usually implemented in very intermittent problem situations. Since the trace tool must be inserted into a link, this action is disruptive to a SAN environment. The severity of the disruption varies depending on the link to be traced. Because of the impact on a SAN, trace tools are typically used as a last resort method of problem determination.

In document Practical Guide for SAN with pseries (Page 147-150)