Problem Solving and Troubleshooting in AIX 5L

(1)

Problem Solving and

Troubleshooting in AIX 5L

HyunGoo Kim

JaeYong An

John Harrison

Tony Abruzzesse

KyeongWon Jeong

Introduction of new problem solving

tools and changes in AIX 5L

Step by step troubleshooting

in real environments

Complete guide to the

problem determination

process

(2)

(3)

Problem Solving and Troubleshooting in AIX 5L

January 2002

(4)

Second Edition (January 2002)

This edition applies to IBM ^ pSeries and RS/6000 Systems for use with the AIX 5L Operating System Version 5.1 and is based on information available in August, 2001. Comments may be addressed to:

IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834

11400 Burnet Road Austin, Texas 78758-3493

When you send information to IBM, you grant IBM a non-exclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you.

(5)

Figures

1-1 Boot path flowchart: part one . . . 12

1-2 Boot path flowchart: Part two . . . 13

1-3 Boot path flowchart: part three . . . 14

1-4 Machine displaying 888 in LED or LCD . . . 15

2-1 Error log file processing . . . 21

4-1 Start panel of Web-based System Manager . . . 72

4-2 System Environment Settings panel . . . 73

4-3 Dumps Properties panel . . . 73

4-4 Selecting dump device on Web-based System Manager . . . 76

4-5 Dump Device Check panel . . . 81

4-6 Always allow system dump setting . . . 84

4-7 Machine state save area . . . 97

5-1 Example of SCSI labelling . . . 145

7-1 TCP/IP network connection . . . 196

7-2 Selective host. . . 197

7-3 No network access. . . 199

9-1 Endian diagram example . . . 295

10-1 Bottleneck determining process . . . 311

10-2 System Environment Settings panel . . . 328

(16)

(17)

Tables

1-1 pSeries and RS/6000 general service documentation . . . 7

1-2 pSeries and RS/6000 System Guides . . . 8

1-3 Symptom cross reference chart . . . 15

3-1 Flag settings to obtain maintenance menu . . . 43

4-1 Dedicated dump device size by real memory size . . . 75

4-2 Dump status codes . . . 85

4-3 vmmerrlog structure components . . . 111

5-1 SSA adapter information . . . 151

8-1 X11 and Motif compatibility filesets . . . 284

13-1 Overview of relevant NIM LED codes . . . 424

(18)

(19)

Preface

This redbook covers problem determination and troubleshooting on the IBM ^ pSeries and RS/6000 platform. It is intended as a guide to the approach that should be adopted when attempting to resolve a problem on an IBM ^ pSeries and RS/6000 system running AIX 5L Version 5.1. It is not intended to replace the comprehensive documentation available for the AIX operating system or the System Installation and Service Guides available for IBM ^ pSeries and RS/6000 system hardware. Instead, it should be

considered as a useful supplement to guide you through the problem determination process.

A problem can manifest itself in many ways, and very often the root cause of the problem is not obvious. This redbook describes an approach to problem

determination that will guide you through the initial steps in problem diagnosis and to narrow down and identify the component or subsystem that is

experiencing a problem.

In addition to helping you determine the likely cause of your problem, this redbook describes some of the most common problems that can occur with AIX systems and, where possible, describes the actions you can take to resolve them.

This redbook is a valuable tool for system administrators and other technical support personnel who deal with RS/6000 and AIX problems.

The information in this redbook is not intended to completely cover problem determination on SP systems or systems configured as nodes in an HACMP cluster. This redbook does not cover ItaniumTM-based systems (IA-64), although many items will be true for both POWER and Itanium-based systems.

The team that wrote this redbook

(20)

KyeongWon Jeong is a Consulting IT Specialist at the International Technical Support Organization, Austin Center. He writes extensively on AIX and education materials and teaches IBM classes worldwide on all areas of AIX. Before joining the ITSO three years ago, he worked in IBM Global Learning Services in Korea as a Senior Education Specialist and was a class manager of all AIX classes for customers and interns. He has many years of teaching and development experience.

HyunGoo Kim is an Advisory Education Specialist at Global Learning Services in IBM Korea. He has five years of experience of teaching AIX, HACMP, and SP classes to customers and business partners. His area of expertise includes Linux. He has an interest in Clustering Technology.

JaeYong An is an Advisory System Service Representative in IBM Korea ITS Technical Operations. He has six years of experience working on IBM ^ pSeries, RS/6000, and AIX. He currently works on AIX-related problem solving for customers, business partners, and IBM internal support. His areas of expertise include IBM ^ pSeries, RS/6000, SP systems, AIX dump analysis, and performance problem solving.

John Harrison is an Advisory System Service Representative in IBM United Kingdom. He has worked for IBM for 24 years, for the past ten years as a Field Hardware Specialist on IBM ^ pSeries, RS/6000 products, and AIX. He has extensive hardware knowledge on other IBM systems and provides Field Support in the north London area. He is also an instructor for UK IBM ^ pSeries and RS/6000 hardware courses and a member of the UK ITS Technical Council.

Tony Abruzzess is an Advisory IT Specialist at the IBM Support Center in Sydney, Australia. He has been serving the IBM AIX customer community in the technical support area for over a decade. He holds a Bachelor of Mathematics from the University of Wollongong and a Graduate Diploma of Finance and Investment from the Securities Institute of Australia. His areas of expertise include SP, AIX, and base related products.

Thanks to the following people for their contributions to this project: International Technical Support Organization, Austin Center Wade Wallace

IBM Austin

(21)

IBM Dallas

John Tesch and Rashid Rahim IBM UK

Derrick Daines and Michael Pearson

Special notice

This publication is intended to help system administrators and IBM service personnel perform basic problem determination on IBM ^ pSeries and RS/6000 systems running AIX 5L Version 5.1. The information in this publication is not intended as the specification of any programming interfaces that are provided by AIX 5L Version 5.1. See the PUBLICATIONS section of the IBM Programming Announcement for AIX 5L Version 5.1, Program Number 5765-C34 for more information about what publications are considered to be product documentation.

IBM trademarks

The following terms are trademarks of the International Business Machines Corporation in the United States and/or other countries:

Comments welcome

Your comments are important to us!

(22)

We want our IBM Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at:

ibm.com/redbooks

Send your comments in an Internet note to: redbook@us.ibm.com

(23)

Chapter 1.

Problem determination

introduction

This redbook is intended to help system administrators and other support personnel have a better understanding of problem solving on IBM ^ pSeries and RS/6000 systems using AIX 5L Version 5.1. It is not the intention of this book to provide an exhaustive list of all the solutions to all of the possible problems that can be encountered on IBM ^ pSeries and RS/6000 systems running AIX. Instead, the book is intended to guide you through the problem determination process and to narrow down and identify the component or subsystem that is experiencing a problem.

This redbook does

not

cover the ItaniumTM-based systems (IA-64), although many items will be true for both POWER and Itanium-based systems.

The approach detailed in this book will help you to determine the cause of a problem in a component or subsystem that was previously functioning correctly. It is not intended to help you diagnose problems when you are installing and configuring new hardware or software. You should refer to the appropriate documentation supplied with the new component for installation and configuration information.

Note that some of the Web addresses referenced in this redbook are IBM intranet addresses, and are, therefore, not available for general public access. Where possible, Internet sites offering similar information are referenced.

(24)

(25)

1.1 Problem determination process

For the purposes of this redbook, a problem can be considered as any situation where something that was previously working correctly is now not working as expected.

1.1.1 Defining the problem

The first step in problem resolution is to define the problem. It is important that the person trying to solve the problem understands exactly what the users of the system perceive the problem to be. A clear definition of the problem is useful in two ways. First of all, it can give you a hint as to the cause of the problem. Secondly, it is much easier to demonstrate to the users that the problem has been solved if you know how the problem is seen from their point of view. Take, for example, the situation where a user is unable to print a document. The problem may be due to the /var file system running out of space. The person solving the problem may fix this and demonstrate that the problem has been fixed by using the df command to show that the /var file system is no longer full. This example can also be used to illustrate another difficulty with problem determination. Problems can be hidden by other problems. When you fix the most visible problem, another one may come to light. The problems that are unearthed during the problem determination process may be related to the one that was initially reported, in other words, multiple problems with the same symptoms. In some cases, you may discover problems that are completely unrelated to the one that was initially reported.

In the example described above, simply increasing the amount of free space in the /var file system may not solve the problem being experienced by the user. The printing problem may turn out to be a cable problem, a problem with the printer, or perhaps a failure of the lpd daemon. This is why understanding the problem from the users perspective is so important. In this example, a better way of proving that the problem has been resolved is to get the user to print their document.

1.1.2 Gathering information from the user

(26)

The following questions should be asked when gathering information from the user during problem determination:

What is the problem?

Try to get the user to explain what the problem is and how it affects them. Depending on the situation and the nature of the problem, this question can be supplemented by either of the following two questions:

– What is the system doing? – What is the system

not

doing?

Once you have determined what the symptoms of the problem are, you should try to establish the history of the problem.

How did you first notice the problem? Did you do anything different that made you notice the problem?

When did it happen? Does it always happen at the same time, for example, when the same job or application is run?

Does the same problem occur elsewhere? Is only one machine experiencing the problem or are multiple machines experiencing the same problem? Have any changes been made recently?

This refers to any type of change made to the system, ranging from adding new hardware or software, to configuration changes to existing software. If a change has been made recently, were all of the prerequisites met before

the change was made?

Software problems most often occur when changes have been made to the system, and either the prerequisites have not been met, for example, system firmware not at the minimum required level, or instructions have not been followed exactly in order, for example, the person following the instructions second guesses what the instructions are attempting to do and decides they know a quicker route. The second guess then means that, because the person has taken a perceived better route, prerequisites for subsequent steps may not have been met, so the problem develops into the situation you are confronted with.

(27)

If the problem occurs on more than one machine, look for similarities and differences between the situations.

1.1.3 Gathering information from the system

The second step in problem determination is gathering information from the system. Some information will already have been obtained from the user during the process of defining the problem.

It is not only the user of the machine that can provide information on a problem. By using various commands, it is possible to determine how the machine is configured, the errors that are being produced, and the state of the operating system.

The use of commands, such as lsdev, lspv, lsvg, lslpp, lsattr, and others enable you to gather information on how the system is configured. Other

commands, such as errpt, can give you an indication of any errors being logged by the system.

If the system administrator uses SMIT or the Web-based System Manager to perform administrative tasks, examine the log files of these applications to look for recent configuration changes. The log files are normally contained in the home directory of the root user and are named /smit.log for SMIT and /websm.log for the Web-based System Manager by default.

If you are looking for something specific based on the problem described by the user, then other files are usually viewed or extracted to send to your IBM support function for analysis, such as system dumps or checkstop files.

1.1.4 Resolving the problem

Once you have defined the problem, you should use the flowcharts in Figure 1-1 on page 12, Figure 1-2 on page 13, and Figure 1-3 on page 14 and the symptom index cross reference chart in Table 1-3 on page 15 to direct you to the most relevant part of this book to perform basic troubleshooting of the suspected device or software component. During the investigative process, you should keep a log of the actions you perform in trying to determine the cause of the problem, and any actions you perform to correct the problem.

(28)

should be supplied to IBM. If possible, you should create a simple test case to replicate the problem. In some cases, a test case may not be possible, for example, if a particular problem appears on a random basis and there is no apparent sequence of actions that trigger the problem.

1.1.5 Obtaining software fixes and hardware microcode updates

Software fixes for AIX and many LPPs and hardware microcode updates are available on the Internet from the following URL:

http://techsupport.services.ibm.com/server/fixes

For a more customized approach to downloading AIX fixes for AIX 4.3, you can use the AIX application called FixDist instead of the Web. As a Web-alternative application, FixDist provides more discrete downloads and transparently delivers all required images with just one click. It can also keep track of fixes you have already downloaded, so you can download smaller fix packages the next time you need them. FixDist can be downloaded from the Web site mentioned above. For AIX 5L, the fixes are still obtained from this site but you will need to register as a user. At the time of the writing this redbook, this was free and could be done immediately. FixDist is to be replaced with a service that will dynamically build one .tar file that contains the required filesets and place them on an FTP server. An e-mail is then sent with information on how to get the file and extract the AIX filesets. This service is expected to be integrated with the new tool known as Inventory Scout, which will check what existing fixes are already installed. Once you have determined the nature of your problem, you should try searching this Web site or using FixDist to see if you are experiencing a known problem for which a fix has already been made available.

Microcode Discovery Service

Microcode Discovery Service gives you two ways to generate a real-time comparison report, showing subsystems that may need to be updated and determine if your IBM ^ pSeries or RS/6000 is at the latest microcode level. On the later machines, it is the customer’s responsibility to obtain and load the latest microcode on the system, adapters, and devices.

Service Agent

(29)

Before using these services, you will need to install Inventory Scout, onto the server you want to survey for microcode information or run Service Agent. Inventory Scout, Microcode Discovery Service, and Service Agent can be obtained from the following URL:

http://techsupport.services.ibm.com/server/fixes

1.1.6 Other relevant documentation

Each IBM ^ pSeries and RS/6000 machine has a specific set of

documentation that should be used in the problem determination process when a hardware problem is suspected. These are an orderable feature for some systems and are not shipped with the machine.

Every IBM ^ pSeries and RS/6000 machine has a System Installation and Service Guide, which are supplemented by a number of additional

documents, depending on whether the machine is a Micro Channel Bus system or a PCI machine, otherwise known as a Multiple Bus system. Each bus type (Micro Channel and Multiple Bus) has a set of manuals, one covering common diagnostic procedures on machines of that type, and the other describing the adapters, devices, and cables that can be used on systems of that type. In addition, the Multiple Bus systems have a guide detailing the placement rules for PCI adapters.

Most of the hardware documentation can be viewed online at the IBM ^ pSeries and RS/6000 Library Internet site. The URL for the site is:

http://www-1.ibm.com/servers/eserver/pseries/library

Additionally, hardcopy versions of the manuals can be ordered from your IBM marketing representative. Table 1-1 shows details of the IBM ^ pSeries and RS/6000 General Service Documentation. Table 1-2 on page 8 shows the details of the model-specific System Installation and Service Guides.

Table 1-1 pSeries and RS/6000 general service documentation

Document Form number

IBM RS/6000 Adapters, Device, and Cables Inforamtion for Micro Channel Bus Systems

SA38-0533

RS/6000 Adapters, Devices, and Cable Information for Multiple Bus Systems

SA38-0516

RS/6000 Diagnostics Information for Micro Channel Bus Systems SA38-0532 RS/6000 and pSeries Diagnostics Informationfor Multiple Bus

Systems

(30)

Table 1-2 pSeries andRS/6000 System Guides

RS/6000 and pSeries PCI Adapter Placement Reference SA38-0538 pSeries Customer Support Information SA38-0584 Site & Hardware Planning Information SA38-0508

Document Form number

pSeries and RS/6000 Model Form number

7006 Graphics Workstation Service Guide SA23-2719 7009 Compact Server Service Guide SA23-2716 7012 300 Series Service Guide SA38-0545 7012 G Series Service Guide SA23-2741 IBM RS/6000 7013 500 Series Installation

and Service Guide

SA38-0531

7013 J Series Service Guide SA23-2725 7014 Model S00 Rack 7014 Model S00

Rack Installation and Service Guide

SA38-0550

7014 Series Model T00 and T42 Installation and Service Guide

SA38-0577

7015 Model R00 Rack Installation and Service Guide

SA23-2744

7015 R10/R20/R21 Service Guide SA23-2708 7015 Model R30, R40, & R50, CPU

Enclosure Installation and Service Guide

SA23-2743

7017 S Series Installation and Service Guide

SA38-0548

IBM 7024 E Series Service Guide SA38-0502 IBM 7025 F Series Service Guide SA38-0505 IBM RS/6000 7025 F40 Service Guide SA38-0515 RS/6000 7025 F50 Installation and

Service Guide

SA38-0541

7026 CPU Drawer Install and Service Guide

(31)

IBM RS/6000 7043 Service Guide SA38-0512 7043 260 Service Guide 7043 260 Setup Guide 7043 260 User's Guide SA38-0554 SA38-0555 SA38-0553 RS/6000 44P Series Model 170 Service

Guide

44P Series Model 170 Installation Guide RS/6000 44P Series Model 170 User's Guide

SA38-0560 SA38-0561 SA38-0559 RS/6000 44P Series Model 270 Service

Guide

RS/6000 44P Series Model 270 Install Guide

RS/6000 44P Series Model 270 User's Guide SA38-0572 SA38-0574 SA38-0573 RS/6000 7046 B50 Service Guide RS/6000 7046 B50 Setup Guide RS/6000 7046 B50 User's Guide SA38-0564 SA38-0562 SA38-0563 Enterprise Server H Series CPU Drawer

Installation and Service Guide

SA38-0547

Enterprise Server H Series User's Guide SA38-0546 Enterprise Server Model F80 and pSeries

620 Model 6F1 Installation Guide

Enterprise Server Model F80 and pSeries 620 Model 6F1 Service Guide

Enterprise Server Model F80 and pSeries 620 Model 6F1 User's Guide

SA38-0569 SA38-0568 SA38-0567 RS/6000 7026 B80 Installation Guide Service Guide User's Guide SA38-0579 SA38-0581 SA38-0580 Enterprise Server Model H80 and pSeries

660 Model 6H1 Installation Guide Enterprise Server Model H80 and pSeries 660 Model 6H1 Service Guide

Enterprise Server Model H80 and pSeries 660 Model 6H1 User's Guide

SA38-0575 SA38-0566 SA38-0565

(32)

1.2 Troubleshooting starting point

This section contains flowcharts and a table of symptoms of common problems. Based on the problem described by the user, you can either:

Go directly to the chapter most relevant to your problem. You can use the table of contents at the front of the book or the index at the back of the book to decide which section is most relevant.

If you have a boot problem, use the boot path flowchart starting in Figure 1-1 on page 12 in this section to decide on a course of action.

Refer to the symptom index shown in Table 1-3 on page 15 to decide your next action.

1.2.1 Boot path flowchart

The boot path flowchart is designed to assist you in the diagnosis of boot problems. The chart takes you through the stages of the boot process, listing some of the LED or LCD codes that are signposts for either things completing successfully or indicating you have a problem. As you follow the chart, you will eventually come to an exit point. This exit point will direct you to the relevant chapter or section within the book. The section you are referred to will then try to

RS/6000 Enterprise Server Model M80 and eServer pSeries 680 Model 6M1 Service Guide

RS/6000 Enterprise Server Model M80 and eServer pSeries 680 Mod 6M1 Install Guide

RS/6000 Enterprise Server Model M80 and eServer pSeries 680 Model 6M1 User's Guide SA38-0571 SA38-0576 SA38-0570 RS/6000 Enterprise Server S80 Installation Guide

Enterprise Server S80, pSeries 680 Model S85 Service Guide

RS/6000 Enterprise Server S80 User's Guide

Capacity Upgrade on Demand Installing and Upgrading Processors

SA38-0582 SA38-0558 SA38-0557 SA38-0583

(33)

either provide you with suggestions for a solution, or direct you to a publication or Web site that will provide you with additional information to fix the problem. If neither of these solutions provide a solution, contact the organization that provides you with AIX software support.

(34)

Figure 1-1 Boot path flowchart: part one

Does machine power on and stay powered on?

Indications are fan noise and power light(s) on solid. Wait 30 seconds for 7017-S7x and later systems for confirmation of power on.

Replace power cable with known good power cable. If rack mounted, check for tripped circuit breakers. Does machine power on and stay powered on? Plug machine into a known good power connection.

Does machine power on and stay powered on?

Original power connection is defective. Try another connection and call electrician to repair the original connection.

Possible Hardware Failure

Refer to the Diagnostic Information Manual applicable to machine type. This may direct you to the Minimum Config MAP 1540. See Section 3.4, “Minimum configuration” on page 55. Do you observe any of the

following:

MCA - LEDs 292, 252, 243, 233

PCI - Icons on graphics panel or memory, keyboard, and so on, on ASCII panel.

Go to Figure 1-2 on page 13.

Is 269 (Non-bootable) or alternating appearance of LED 229 displayed?

(MCA machines only)

Does machine start to load AIX? Indicated by momentary appearance of LED 299 on MCA machines and Starting Software message on PCI machines.

(35)

Figure 1-2 Boot path flowchart: Part two

LED 551 indicates IPL Varyon.

Go to Figure 1-3 on page 14.

Does the machine hang on any other LED value, apart from 581?

Does the machine hang on LED 581 for more than five minutes?

Yes Yes

No

From Figure 1-1 on page 12.

Does the machine hang on LED 549, or any LED value between 551 and 557? A hang is suspected if LED is displayed for more than five minutes.

For machines displaying 888, refer to Figure 1-4 on page 15. For any other codes, refer to Section 3.3.2, “Codes displayed longer than five minutes” on page 51.

Wait for the machine to complete the boot process. Refer to Section 7.3.2, “LED 581 hang” on page 219.

See Table 1-3 on page 15.

No

(36)

Figure 1-3 Boot path flowchart: part three

Yes

Do you see c31 on LED display?

This indicates the system is prompting for input on the console device.

Can you see output on the console device?

No

This indicates the system is using the LFT as the console device.

This indicates the system is using a serial port as the console device.

Do you see c99 on the LED display?

This indicates the system is running diagnostics but has not detected a console device.

No

No No

Do you get a login prompt?

(37)

Figure 1-4 Machine displaying 888 in LED or LCD

1.2.2 Symptom index

Use this table as an alternative method for finding the correct section in the book that will help you deal with your problem.

Table 1-3 Symptom cross reference chart Do you see

888-xxx-xxx-xxx or flashing 888 on the LED display?

If 888 is on solid on a MCA machine this indicates a power problem . Go to the Diagnostic Information Manual for Microchannel Bus Systems

No

Record all of the numbers displayed following the 888.

You may have to press reset or scroll buttons on some systems. If the number following the 888 is 102, 103,104 or 105 go to the appropriate box below

102

This is a software crash. However it can be caused by a related hardware failure. Ensure any dump produced is saved for analysis

104 or 105

These are reserved codes. Consult your software support if any occur.

103

This is a hardware failure. The next two sets of 3 LEDs give you the SRN of the failing hardware. 0Cx gives dump status

Yes

Symptom Possible cause Refer to

SCSI devices missing or duplicated

Address conflicts on the bus

“SCSI devices missing or duplicated” on page 147

Checkstops or machine checks in error log

(38)

LED 553 inittab problem, AIX corruption

Section 3.6.3, “LED 553 halt” on page 64 LED 551, 555, and 557 A corrupted file system, JFS log, or defective disk Section 3.6.1, “LED 551, 555, or 557 halt” on page 58 LED 552, 554, and 556 A corrupted file system, JFS log bad IPL record, Corrupted Boot LV Defective Disk

Section 3.6.2, “LED 552, 554, or 556 halt” on page 60

LED 549 No system console or console problem with unsaved dump

Section 4.2.1, “LED 549” on page 71

LED 888-102 AIX kernel or kernel extension problem

Chapter 4, “System dumps” on page 69

System hang AIX system resource “System hang” on page 113 Tape stuck in drive Back up not

completed, tape or tape drive problem

Section 5.9.4, “4 mm, 8 mm, and DLT tape drives” on page 158

Tape media errors Defective media or drive needs cleaning

Section 5.9.4, “4 mm, 8 mm, and DLT tape drives” on page 158

TAPE_ERR in error log

Media failure, drive failure, and so on

“Tape error log entries” on page 160

hdisk - pdisk mismatch

Disk from RAID set Section 5.8.3, “SSA disk does not configure as a hdisk” on page 152 SSA errors in error

log

SSA drive or cable Section 5.8, “Serial Storage

Architecture (SSA) disks” on page 150 Cannot log in to specific remote machine Network problem Remote machine down

Section 7.2.1, “Selective host network problems” on page 197

Cannot log in to any remote machine

Network problem Adapter problem

Section 7.2.2, “No network access” on page 198

LED 581 TCP/IP problem Section 7.3.2, “LED 581 hang” on page 219

Error log entries Hardware or software

Section 2.3, “Error log management” on page 22

(39)

1.3 Avoiding problems

The Reliability, Availability, and Serviceability (RAS) features of AIX are

designed to fulfill many functions. As well as helping you determine the cause of a problem once it has actually occurred, such as the diagnostics subsystem, many of the RAS features are designed to provide information on potential problems before they occur. By default, IBM ^ pSeries and RS/6000 systems are configured to run periodic automatic diagnostic checks. Any errors or warnings reported by this system will appear in the system error log.

Good system administration is not only fixing problems when they occur, but managing a system in such a way as to minimize the chances of a problem having an impact on the users of the system.

Periodic system maintenance can help reduce the number of problems experienced on a machine. Simple tasks, such as cleaning the tape drive as required, can prevent tape errors. Examining the system error log on a regular basis can help you spot a potential problem when the related error log entries are still warnings rather than errors.

1.3.1 System health check

The following section lists some simple commands that can be run on a regular basis to monitor a system. They will help you to become aware of how the system is operating.

Use the errpt command to look at a summary error log report. Be on the lookout for recent additions to the log. Use the errpt -a command to examine any suspicious detailed error log entries.

Core dump error log entries

Software failure Section 2.4, “Finding a core dump” on page 34

Machine fails to boot Hardware or software

Section 1.2.1, “Boot path flowchart” on page 10

Corrupt boot list Corrupt boot list, erased boot list, or list non-existent device

Section 3.3.1, “Failure to locate a boot image” on page 49

3 Digit LED starting 0c

System dump in progress

Section 4.4.2, “Dump status codes” on page 85

(40)

Check disk space availability on the system with the df -k command. A full file system can cause a number of problems, so it is best to avoid the situation if at all possible. The two solutions to a full file system are to either delete some files to free up space, or use the LVM to add additional

resources to the file system. The option you take will depend on the nature of the data in the file system, and whether there is any available space in the volume group.

Check volume groups for stale partitions with the lsvg command. If stale partitions, logical volumes, or physical volumes are reported, try and repair the situation with the syncvg command.

Check system paging space with the lsps -s command. A system will not perform very well if it is low on paging space. In extreme circumstances, the system can terminate processes in order to solve the problem. Obviously, it is better that the system administrator ensures that there is enough paging space. You can either increase the size of existing paging space volumes, or add a new paging space volume. Again, the option you take will depend on the available space in the volume groups on the system.

Check if all expected subsystems are running with the lssrc -a command. Check the networking by trying to ping a well-known address. If you cannot

(41)

Chapter 2.

Error logging

This chapter describes the error logging subsystem on AIX 5L Version 5.1. It shows how you can use various commands to view the error log and interpret the information it contains. AIX 5L provides three enhancements in the area of error logging. You can specify a

duplicate check

that treats identical errors arriving closer than this threshold as duplicates and counts them only once. Second, with the errpt command, you can now request an

intermediate

format report

that removes seldom needed data from the detailed error report format. A third enhancement, the diagnostic tool, will now put

additional

information

into the error log entry.

(42)

2.1 Error logging overview

The error logging process begins when an operating system module detects an error. The error-detecting segment of code then sends error information to either the

errsave

kernel service and errlast kernel service for pending system crash or the errlog subroutine to log an application error, where the information is, in turn, written to the /dev/error special file. The errlast preserves the error record in the NVRAM. Therefore, in the event of a system crash, the last logged error is not lost.

This process then adds a time stamp to the collected data. The errdemon daemon constantly checks the /dev/error file for new entries, and when new data is written, the daemon conducts a series of operations.

Before an entry is written to the error log, the errdemon daemon compares the label sent by the kernel or application code to the contents of the error record template repository. If the label matches an item in the repository, the daemon collects additional data from other parts of the system.

To create an entry in the error log, the errdemon daemon retrieves the appropriate template from the repository, the resource name of the unit that detected the error, and detailed data. Also, if the error signifies a

hardware-related problem and hardware Vital Product Data (VPD) exists, the daemon retrieves the VPD from the Object Data Manager (ODM). When you access the error log, either through SMIT or with the errpt command, the error log is formatted according to the error template in the error template repository and presented in either a summary or detailed report. Most entries in the error log are attributable to hardware and software problems, but informational messages can also be logged.

The system administrator can look at the error log to determine what caused a failure, or to periodically check the health of the system when it is running. Service personnel can also examine the log to help them service the machine. The software components that allow the AIX kernel and commands to log errors to the error log are contained in the fileset bos.rte.serv_aid. This fileset is automatically installed as part of the AIX installation process.

The commands that allow you to view and manipulate the error log, such as the

errpt and errclear commands, are contained in the fileset called

bos.sysmgt.serv_aid. This fileset is not automatically installed by earlier releases of AIX Version 4. Use the following command to check whether the package is installed on your system:

(43)

2.2 Error log file processing

The error log is used by system administrators and IBM service personnel to diagnose system problems as the first tool for problem determination. The error log contains error IDs, time stamp, error type, error class, and resource names associated with each error. Some error log entries also contain detailed data passed by the errlog or errsave routines, such as information about the slot number of the failing card or the name of the file that could not be opened. Figure 2-1 shows how the error log is processed to allow users to view information about the system errors.

Figure 2-1 Error log file processing

2.2.1 Error templates

The error template contains numbers that correspond to error messages in the codepoint catalogue. These are sometimes referred to as codepoint messages. These error messages are used to communicate possible causes and to recommend actions for an error. They are also used to explain the detailed data that may accompany the error.

/usr/lib/nls/msg/$LANG/codepage /usr/adm/ras/errtmplt

/usr/adm/ras/errlog

errpt

Error template Error message repository

Error log

Error log formatting command

(44)

The template also is used to indicate whether or not the error should be

reportable, loggable, or alertable. If it is not reportable, it will not appear when the

errpt command is run. If an error is not loggable, it will not be put in the error log. For example, if a developer just wanted concurrent error notification to kick off an action, she or he may not want the error to show up in an error report.

Each template in the template file contains unique information that corresponds to a unique error. The contents of the error template are used to calculate the error ID of the error. The error numbers correspond to error messages indicating causes and recommendations for action.

The templates in the errtmplt file can be viewed by invoking the

errpt

command with the -t flag.

Templates are installed into the errtmplt file using the errupdate command. This is carried out automatically as part of the installation process of any filesets that contain new error templates.

2.2.2 Error messages

Error messages are numbered and placed in a separate file, called the codepoint catalogue. Similar to message catalogues, the codepoint catalogue is a

translated file that exists in the different language directories. The user's LANG environment variable determines which codepoint catalogue is accessed when the errinstall, errmsg, or errpt commands are run. The error message source is installed in the codepoint catalog using the errinstall command. This is carried out automatically as part of the installation process of any filesets that contain new error messages.

The codepoint catalogue can be viewed by using the errmsg command with the -w flag. As you make use of errmsg, you can add your own messages to the message catalog for the application you made. Beginning with AIX Version 4.3.2, error messages may also come from regular XPG/4, also called NLS, message catalogues.

2.3 Error log management

You can generate an error report from entries in an error log. The errpt

(45)

2.3.1 Viewing the error log

There are two main ways of viewing the error log:

You can use the System Management Interface Tool (SMIT) with a fast path to run the errpt command. To use the SMIT fast path, enter:

# smit errpt

After completing a dialog about the destination of the output and concurrent error reporting, you will see a panel similar to that shown in Example 2-1.

Example 2-1 SMIT Generate an Error Report panel Generate an Error Report Type or select values in entry fields.

Press Enter AFTER making all desired changes.

[TOP] [Entry Fields] CONCURRENT error reporting? no

Type of Report summary + Error CLASSES (default is all) [] + Error TYPES (default is all) [] + Error LABELS (default is all) [] + Error ID's (default is all) []

Resource CLASSES (default is all) [] Resource TYPES (default is all) [] Resource NAMES (default is all) [] SEQUENCE numbers (default is all) [] STARTING time interval [] ENDING time interval []

Show only Duplicated Errors [no] + [MORE...5]

F1=Help F2=Refresh F3=Cancel F4=List F5=Reset F6=Command F7=Edit F8=Image F9=Shell F10=Exit Enter=Do

You can also view the error log from the command line using the errpt command. When used from the command line, considerable amounts of output can often be generated, so it is best to control the command by piping the output to either the

more or pg commands, which allow it to be viewed one panel at a time. When invoked with no options, errpt will display a summary report, listing one line of information about each error log entry. An example of this is shown in

Example 2-2.

Example 2-2 Errpt output # errpt

(46)

35BFC499 0821161601 P H cd0 DISK OPERATION ERROR 0BA49C99 0821161601 T H scsi0 SCSI BUS ERROR

F89FB899 0806150001 P O dumpcheck The copy directory is too small. 2120313C 0806142301 I H tok0 PROBLEM RESOLVED

75CE5DC5 0806142301 I H tok0 ADAPTER ERROR FFE305EE 0806142301 P H tok0 WIRE FAULT 2120313C 0806142101 I H tok0 PROBLEM RESOLVED 75CE5DC5 0806142001 I H tok0 ADAPTER ERROR FFE305EE 0806142001 P H tok0 WIRE FAULT FFE305EE 0806142001 P H tok0 WIRE FAULT 2120313C 0806141901 I H tok0 PROBLEM RESOLVED 75CE5DC5 0806141801 I H tok0 ADAPTER ERROR FFE305EE 0806141801 P H tok0 WIRE FAULT FFE305EE 0806141801 P H tok0 WIRE FAULT FFE305EE 0806141701 P H tok0 WIRE FAULT 2120313C 0806141701 I H tok0 PROBLEM RESOLVED 499B30CC 0806133601 T H ent1 ETHERNET DOWN 75CE5DC5 0806133501 I H tok0 ADAPTER ERROR FFE305EE 0806133501 P H tok0 WIRE FAULT

F89FB899 0805150001 P O dumpcheck The copy directory is too small. F89FB899 0804150001 P O dumpcheck The copy directory is too small.

In addition to the summary report, the errpt command can be used with various flags to generate a customized report detailing the error log entries you are interested in:

To display information about errors in the error log file in detailed format, enter the following command:

# errpt -a

In AIX 5L Version 5.1, the errpt command now supports an intermediate output format by using the -A flag, in addition to the summary and the details already provided. Only the values for LABEL, Date/Time, Type, Resource Name, Description, and Detail Data are displayed. To display a shortened version of the detailed report produced by the -a flag, enter the following command:

# errpt -A -j identifier

Where identifier is the eight digit hexadecimal unique error identifier. Example 2-3 shows the output of the above command.

Example 2-3 A shortened error report # errpt -A 9DBCFDEE

LABEL: ERRLOG_ON

(47)

Resource Name: errdemon Description

ERROR LOGGING TURNED ON

To display a detailed report of all errors logged for a particular error identifier, enter the following command:

# errpt -a -j identifier

Where identifier is the eight digit hexadecimal unique error identifier. To clear all entries from the error log, enter the following command:

# errclear 0

To stop error logging, enter the following command: # /usr/lib/errstop

To start error logging, enter the following command: # /usr/lib/errdemon

To list the current setting of error log file and buffer size and duplicate information, enter the following command:

# /usr/lib/errdemon -l

If you want to change the buffer size and error log file size, you can use the

errdemon command. For further detail, refer to the manual page for errdemon.

Software service aid configuration information is stored in the

/etc/objrepos/SWservAt ODM database. This ODM class is used to store information about the location and size of various log files used by the system. It is also used to hold information about trace hooks available for use by the trace subsystem, described in Chapter 11, “Event tracing” on page 341.

By default, AIX runs a cron job that deletes all hardware error log entries older than 90 days daily at 12 AM and all software and operator message error log entries older than 30 days daily at 11 AM. The cron job simply uses the errclear

command to delete the old entries. If you are investigating a software problem that has been on a machine for a long time, do not assume that the first instance of the error in the error log was caused the first time the software problem occurred. Previous entries may have been deleted if older than 30 days.

Note: When you remove the errlog file accidently, use the /usr/lib/errstop

and /usr/lib/errdemon commands in sequence to recover the file. errdemon

(48)

There are some new options introduced in AIX 5L version 5.1. These will reduce your management of same and iterative error log. The first will reduce rapidly logged duplicate error log entries. and the second adds diagnostic output to error log entries.

Duplicate Removal flag: -D

The errdemon command was enhanced in AIX 5L to support four additional flags. The flags -D and -d specify if duplicate error log entries are to be removed or not. The default is the -D flag, which instructs the command to remove the duplicates.

Time Range Flag: -t

With the -t and -m flags, you can control what is considered a duplicate error log entry. A value in the range 1 to 231 - 1 specifies the time in milliseconds within which an error identical to the previous one is considered a duplicate. The default value for this flag is 100 or 0.1 seconds. You should normally not change this time value. If, for example, you make it too large, diagnostics may not consider a condition serious when it really is serious. Diagnostics

occasionally depends on an error being logged multiple times within a time period.

Count Flag: -m

The -m flag sets a count, after which the next error is no longer considered a duplicate of the previous one. The range for this value is 1 to 231 - 1 with a default of 1000.

The errpt command also has a new -D flag, which consolidates duplicate errors. In conjunction with the -a flag, only the number of duplicate errors and the time stamps for the first and last occurrence are reported. This is complemented by a new -P flag, which displays only the duplicate errors logged by the new

mechanisms of errdemon mentioned previously.

The link between error log and diagnostics is the new function.

When the diagnostic tool runs, it automatically tries to diagnose hardware errors it finds in the error log. Starting with AIX 5L, the information generated by the

diag command is put back into the error log entry, so that it is easy to make the connection between the error event and, for example, the FRU number required to repair failing hardware. After replacement, we run the diag command and go to Task Selection and select the Log Repair Action.

To get the most out of the AIX Error Log and Error Log Analysis in Diagnostics, you must understand the following points:

(49)

2. In many cases, permanent and temporary hardware errors are logged that are not indicative of a hardware problem. Error Log Analysis should always be run to determine if these errors indicate a hardware problem.

3. Resource Name indicates the resource that detected the error. It does

not

indicated the failing resource. It does indicate the resource that diagnostics and Error Log Analysis should be run on.

4. Failure Causes, Probable Causes, and User Causes are generic

recommendations and are not intended to be used to determine what parts to replace. Parts should only be replaced based on diagnostics and Error Log Analysis results.

5. System errors related to the processors, memory, power supplies, fans, and so on, are logged under resource name sysplanar0. Error Log Analysis should be run on sysplanar0 any time there is an error log with a resource name of sysplanar0.

6. Error Log Analysis can be run by running diagnostics in problem

determination mode or by running the running the Run Error Log Analysis task. Stand-alone diagnostics do not perform error log analysis.

7. Error Log Analysis will analyze all the errors in the error log associated with a specific resource. For those errors that should be corrected, Error Log Analysis will provide a list of actions or an SRN. For errors that can safely be ignored, Error Log Analysis will indicate that no problems were found.

2.3.2 Reading a summary error log

A summary error log report, obtained by using the

errpt

command with no flags, contains the following information for each error log entry.

Identifier

The error identifier is a 32-bit CRC hexadecimal code that determines which error record template is used to interpret the information contained in the error log entry.

Time stamp

This is the time and date when the error occurred, formatted as MMDDhhmmYY.

Type

The type indicates the severity of the reason for logging the error. There are five different types of errors, along with a sixth type for when the severity cannot be determined:

(50)

PERF The performance of the device or component has degraded to below an acceptable level.

PERM Condition that could not be recovered from. Error types with this value are usually the most severe errors and are more likely to mean that you have a defective hardware device or software module. Error types other than PERM usually do not indicate a defect, but they are recorded so that they can be analyzed by the diagnostics programs. TEMP Condition that was recovered from after a number of

unsuccessful attempts. This error type is also used to record informational entries, such as data transfer statistics for DASD devices.

UNKN It is not possible to determine the severity of the error. INFO The error log entry is informational and was not the result

of an error.

The summary error log report only displays the first letter of the error type.

Class

This indicates the general source of the error. The classes are as follows: H This means hardware device failures or media errors.

When you receive a hardware error, refer to your system operator guide for information about performing

diagnostics on the problem device or other piece of equipment. The diagnostics program tests the device and analyzes the error log entries related to it to determine the state of the device. Refer to Chapter 5, “Hardware problem determination” on page 115 for more information. S This means software application program failures, system

program failures, and kernel problems, such as low paging space.

O This indicates an operator notification error that gets logged when the errorlogger command is used. U This means the source of the error is undetermined.

Resource name

(51)

Description

Gives a description of the logged information.

2.3.3 Reading error logs in detail

To obtain a detailed report of all errors logged by the system, enter the following command:

# errpt -a | pg

A detailed error log report may contain multiple entries. Each entry contains multiple fields. The exact fields that are displayed depend upon the nature of the error. The following fields are always included in every error entry:

LABEL Predefined name for the event. ID Numerical identifier for the event. Date/Time Date and time of the event. Sequence Number Unique number for the event.

Machine ID Identification number of your system processor unit. Node ID Mnemonic name of your system.

Class General source of the error. See “Class” on page 28 for a description of the class types.

Type Severity of the error that has occurred. Five types of errors are possible. See “Type” on page 27 for a description of the error log types.

Resource Name Name of the resource that has detected the error. For software errors, this is the name of a software component or an executable program. For hardware errors, this is the name of a device or system component. It does not indicate that the component is faulty or needs replacement. Instead, it is used to determine the

appropriate diagnostic modules to be used to analyze the error.

Description Summary of the error.

The following fields may or may not be present, depending on the nature of the error:

Resource Class General class of the resource that detected the failure. Resource Type Type of the resource that detected the failure.

Problem Solving and Troubleshooting in AIX 5L

Problem Solving and

Troubleshooting in AIX 5L

HyunGoo Kim

JaeYong An

John Harrison

Tony Abruzzesse

KyeongWon Jeong

Introduction of new problem solving

tools and changes in AIX 5L

Step by step troubleshooting

in real environments

Complete guide to the

problem determination

process

Problem Solving and Troubleshooting in AIX 5L

January 2002

Contents

Figures

Tables

Preface

The team that wrote this redbook

Special notice

IBM trademarks

Comments welcome

Problem determination

introduction

not

1.1 Problem determination process

1.1.1 Defining the problem

1.1.2 Gathering information from the user

not

1.1.3 Gathering information from the system

1.1.4 Resolving the problem

1.1.5 Obtaining software fixes and hardware microcode updates

Microcode Discovery Service

Service Agent

1.1.6 Other relevant documentation

1.2 Troubleshooting starting point

1.2.1 Boot path flowchart

1.2.2 Symptom index

1.3 Avoiding problems

1.3.1 System health check

Error logging

duplicate check

intermediate

format report

additional

information

2.1 Error logging overview

errsave

2.2 Error log file processing

2.2.1 Error templates

errpt

2.2.2 Error messages

2.3 Error log management

2.3.1 Viewing the error log

not

2.3.2 Reading a summary error log

errpt

Identifier

Time stamp

Type

Class

Resource name

Description

2.3.3 Reading error logs in detail