• No results found

A dump can be forced when the system locks up to determine the cause of the hang.

A system hang is a total system lockup. A dump forced by turning the key to the Service position and pressing the Reset button can be examined to see what locks are being held by whom. Refer to Section 4.6.2, “How to force a dump” on page 91 for more information.

4.7.7 Data required by IBM support

In any type of system crash (Trap or DSI), the following data is required by IBM support to perform problem determination.

Ideally, the output of the snap command, collected as follows:

/usr/bin/snap -a -o /dev/rmt#

This collects the system dump, /unix, and other required information and puts it onto a tape drive.

In the event that a dump cannot be sent, the following minimum information is a mandatory requirement for IBM support to analyze the problem:

 The system status and messages, obtained by using the stat subcommand at the kdb or crash prompt. For example:

(0)> stat

 If you are using the crash command, obtained a kernel stack trace by using the trace -m subcommand. For example:

# crash <dump> <unix>

> trace -m

This is usually sufficient unless the crash occurred in a kernel extension or device driver.

 The error log from the dump, obtained by using the errpt subcommand at the kdb or crash prompt. For example:

(0)> errpt

Chapter 5.

Hardware problem determination

This chapter guides you through the process of running diagnostics. There are various modes of running diagnostics, each with some limitations. This chapter will help you decide which mode is best for you. The process of running diagnostics enables you to confirm whether or not the problem you are experiencing is hardware related.

5

5.1 General advice

Where possible, run diagnostics concurrently in Problem Determination mode while AIX is running. This will analyze the error log and show any problems found. Check the date of the problem, as it may show up an unrelated event. If it is not possible to test the suspect device concurrently, or there is a doubt about the integrity of the AIX system, then run stand-alone diagnostics from CD-ROM, NIM, or diskette to the suspected device using the correct additional parts requested, such as wrap plugs or test media. If you run diagnostics and get a No Trouble Found report, you will probably be more successful in resolving the problem by concentrating on investigating software issues.

It is important that you use the exact replacement parts requested by the diagnostic system. The diagnostics system specifies each required part by part number. The use of a similar, but incorrect, part can cause the diagnostics system to report a failure when none exists, or to report No Trouble Found, when in fact there is a problem.

5.1.1 Diagnostic tips

To get the most out of the online and stand-alone diagnostics, you must understand the following points:

1. Error log analysis (ELA) is a major part of the diagnostic strategy.

2. Stand-alone diagnostics does NOT perform error log analysis except for Power-On-Self-Test (POST) Errors that occurred while booting stand-alone diagnostics and checkstops that just occurred.

3. Online diagnostics will perform error log analysis only when the Problem Determination selection is selected from the DIAGNOSTIC MODE SELECTION menu or when the Run Error Log Analysis task is selected.

4. Stand-alone diagnostics should only be used when you are unable to run the online diagnostics.

5. If a part is replaced as a result of error log analysis, a log repair action must be done to prevent the problem from being reported again. A repair action can be logged by using the Log Repair Action task or by running Advanced Diagnostics in System Verification mode.

6. Except for the floating-point tests and the system exerciser, all processor and memory testing are done by POST. Errors that prevent the system from booting are reported by 8-digit error codes on later PCI machines. Errors that do not prevent the system from booting are logged and are reported when the memory, processor, or sysplanar diagnostics are run. Although memory and processor are not fully tested by Standalone and online diagnostics, they are

monitored for correct operation by various checkers. If one of these checks occurs, it is logged in the AIX error log.

7. Some systems support a fast and slow boot. Additional POSTs are run in slow boot. Normally, the system should be booted in slow mode, if you suspect a problem in the base system or if you have no idea where the problem may be.

8. Sysplanar diagnostics not only test the system planar functions, they test and monitor other major system components, such as power supplies and fans.

Always run sysplanar diagnostics in problem determination mode to ensure that there is not a system problem.

5.1.2 Device location notation

You will see, both in this chapter and other chapters, output from various commands showing the location of devices or adapters in the system. This section explains the notation used to describe device location. The location is shown either as an AIX Location Code or a Physical Location Code.