• No results found

Problem diagnosis

Physical Location Codes

5.2 Problem diagnosis

The best method of initial diagnosis of a problem is to run Diagnostics in Problem Determination mode to let the system analyze the error log and test any suspect devices. See Section 5.3.1, “Concurrent mode” on page 125.

Diagnosis of hardware problems depends upon observation, error information collection, and the results of running diagnostics. Generally, the first indication of a problem is an entry in the error log. The first entry for a problem can be many days or even weeks old before you notice a problem. Sometimes, there may be multiple similar entries showing a degradation in device performance before a failure. In other cases, there may only be a single error log entry.

There are several ways of accessing the error log. You can use SMIT, the command line, or use the Display Hardware Error Report task, which provides options to display both the summary and detail hardware error logs. Refer to Section 2.2, “Error log file processing” on page 21 for more information.

Observation of a how a problem manifests itself will quite often give you an indication as to what may be the cause.

One of the more powerful tools to help you resolve a fault is the Diagnostics system, available through AIX when loaded on the machine and on a separate CD-ROM or diskette. Additionally, a set of utilities are included in the diagnostics in the Task Selection or Service Aid section. Included in this section are aids to help further diagnose SCSI, LAN, and disk subsystem faults.

5.2.1 Making sense of the error log

To get the most out of the AIX Error Log, you must understand the following points:

1. The kernel and device drivers log errors to the error log, not diagnostics.

Diagnostics (Error Log Analysis) analyzes errors that have been logged.

2. In many cases, permanent and temporary hardware errors are logged that are not indicative of a hardware problem. Error Log Analysis should always be run to determine if these errors indicate a hardware problem.

3. Resource Name indicates the resource that detected the error. It does NOT indicated the failing resource. It does indicate the resource that diagnostics and Error Log Analysis should be run on.

4. Failure Causes, Probable Causes, and User Causes are generic

recommendations and are not intended to be used to determine what parts to replace. Parts should only be replaced based on diagnostics and Error Log Analysis results.

5. System errors related to the processors, memory, power supplies, fans, and so on, are logged under resource name sysplanar0. Error Log Analysis should be run on sysplanar0 any time there is an error log with a resource name of sysplanar0.

6. Error Log Analysis can be run by running diagnostics in problem

determination mode or by running the running the Run Error Log Analysis task. Stand-alone diagnostics do not perform error log analysis.

7. Error Log Analysis will analyze all the errors in the error log associated with a specific resource. For those errors that should be corrected, Error Log Analysis will provide a list of actions or an SRN. For errors that can safely be ignored, Error Log Analysis will indicate that no problems were found.

For information on how to access the error log, refer to Section 2.2, “Error log file processing” on page 21.

The error log can contain many hundreds of entries, so it is always best to start with the summary format. This format will give you a chronological list of events, starting with the latest event at the top of the screen. The following is an example of the summary output of the errpt command:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION

Note: Sysplanar0 errors are detected by Sysplanar0, not necessarily caused by Sysplanar0, although the error log may list it as a probable cause. Run the Error Log Analysis task or Diagnostics in Problem Determination mode to find the true cause.

C60BB505 0511122901 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED C60BB505 0511122901 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED C60BB505 0511122901 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED 74533D1A 0510182901 U H SYSIOS LOSS OF ELECTRICAL POWER

9DBCFDEE 0511083901 T O errdemon ERROR LOGGING TURNED ON 192AC071 0510182501 T O errdemon ERROR LOGGING TURNED OFF 0734DA1D 0503110401 P H fd0 DISKETTE MEDIA ERROR

C60BB505 0422105301 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED C60BB505 0422105301 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED C60BB505 0422105301 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED 3573A829 0422105001 U S CMDCRASH SYSTEM DUMP

AD331440 0422104501 U S SYSDUMP SYSTEM DUMP

AE26DD07 0422083001 P S SYSSPECFS DRIVER RETURNED WITH INTERRUPTS DISABLED

9DBCFDEE 0422104701 T O errdemon ERROR LOGGING TURNED ON

Look through the first couple of screens to see the sort of errors being produced and the frequency. Also check whether the errors fall into an obvious sequence, for example, a disk error followed by a SCSI error. Look at the time stamps of the errors for any pattern, for example, if they occur at or near the same time each day. For tape drive entries, look back over several weeks. Media errors on the same day of the week may indicate a particular tape is failing. The time stamp format is MMDDHHMMYY. So, the first error in the example output above occurred on May 11th, 2001 at 12:29. The column marked C denotes the class of error. Class types are H for hardware, S for software, and O for operator

message. See “Class” on page 28 for a full description of the class entries. Once you have decided which of the errors interest you the most, expand the error log into the detail or intermediate format. Example 5-1 shows a complete detailed error log entry.

Example 5-1 Example of a detailed error log entry LABEL: DISK_ERR4

ROS Level and ID...5 5A Serial Number...00438487 EC Level...895186 FRU Number...86F0118 Device Specific.(Z0)...000002029F00001E Device Specific.(Z1)...75G3644 Device Specific.(Z2)...0983

Device Specific.(Z3)...95333 Device Specific.(Z4)...0002 Device Specific.(Z5)...22

Device Specific.(Z6)...895180 Description

DISK OPERATION ERROR Probable Causes MEDIA

DASD DEVICE User Causes MEDIA DEFECTIVE

Recommended Actions

FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY PERFORM PROBLEM DETERMINATION PROCEDURES Failure Causes

MEDIA DISK DRIVE

Recommended Actions

FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY PERFORM PROBLEM DETERMINATION PROCEDURES Detail Data

SENSE DATA

0600 0000 0800 10F3 0100 0000 0000 0000 0102 0000 7000 0100 0000 0018 0000 0000 1500 0180 0001 0000 01A6 0000 000E 02FF 0000 0000 0000 0000 0000 0000 0A06 0000 0000 0000 100E 0801 0000 0032 4000 1800 0000 0000 1106 0100 0000 0000 0F0E 8080 0000 0001 0000 0001 0000 0000 2020 2020 2020 2020 2020 4C31 2020 2020 2020 2020 2020 2020 2020 3539 4833 3438 3120 2020 2020 4533 0000 0028 0002 7600

The Error Log Analysis program within Diagnostics, Task Selection, will decode the sense data and give a possible cause and Action Plan. For an older version of AIX, this can also be run remotely by IBM support personnel, who will cut and paste the sense data into the following IBM intranet Web interface URL:

http://starbase5.austin.ibm.com/cgi-bin/hardware/dsense/dsense_form.sh

The next area of the error log entry to look at is the area starting at the resource name and finishing at the end of the VPD section.The information given here will help you to identify the type of device that has detected the problem, its size, and, more importantly on a complex system, its location. Additionally, the VPD will often give you the part number and FRU number to enable you to arrange for a replacement part to be ordered if diagnostics determine the part is failing.

The VPD can also indicate the microcode level of the device at the time the error occurred.

Most diagnostic routines, except when running diagnostics from CD-ROM or via network boot, use the sense data in the error log when run in Problem

Determination mode. This use of the error log data by the diagnostics is the main way of deciding what is the cause of any hardware errors logged in the AIX error log. To run the diagnostics in this way, the date set on the machine must be within seven days of the error log time stamp on machines running AIX Version 4.3.1 and above (using Diagnostic Run Time options within Task Selection under AIX Version 4.3.1 and above allows you to set a value of 1 to 60 days). Machines running AIX Version 4.3.0 and below must have the date set within 24 hours of the time stamp for the error log analysis diagnostic to give correct results.

5.2.2 Physical inspection

The most obvious thing to look for when starting a diagnosis is physical damage, such as impact damage. Is anything obviously broken or does anything look incorrect to you? Look at cables going into the machine: Are any of them showing damage? Is the cable securely fixed to the adapter and to the device that is at the other end? Most installations end up with a tangle of cables either at the rear of the machine or under the floor. Try and look to see if there is any additional cabling nearby or intertwined with the cables of the machine you are looking at that might be a source of electrical interference. Power cables carrying heavy current are a prime source of electrical noise.

Look at the cabling of devices attached to the machine, especially their routing, the tightness of the fixing of cabling, damage to cabling, and the proximity of heavy current carrying power cables. The positioning of adapters and devices can also influence problems.

If the machine you are working on is a PCI Bus machine, check the adapter positions in the machine against the recommended positions shown in the RS/6000 and pSeries PCI Adapter Placement Reference, SA38-0538. The latest copy of this publication can be found at the following URL:

Note: Do not replace any parts solely due to errors in the error log.

http://www.ibm.com/servers/eserver/pseries/library/

This guide will show whether the PCI card and slot within the system are compatible. Details of voltage (3.3 or 5V), bus width (32 or 64bit), bus speed (33, 50 or 66 MHz), hot-pluggable or not, customer or service representative

installable, maximum cards per system/bus, and restrictions causing

performance problems are listed. Note that some restrictions are rules that must not be broken and others are just performance related guidelines. The placement of the adapters should also be checked against a search by your IBM marketing support representative to ensure that you have the most recent information. This especially applies to high-bandwidth adapters, such as SP Switch, SSA RAID, and ATM. While you are looking at the adapter placement, make sure that the adapters are securely clamped to the chassis and are as deep into the card slot as possible. The correct seating of adapters is most important, especially in the case of J and R series SMP machines. An improperly seated adapter will, sometimes, not have a problem itself but will cause another adapter elsewhere on the bus to cause strange or intermittent problems.