Service Agent
Chapter 2. Error logging
2.3 Error log management
You can generate an error report from entries in an error log. The errpt command allows flags for selecting errors that match specific criteria. By using the default condition, you can display error log entries in the reverse order they occurred and were recorded. In other words, the latest error log entry appears first.
2.3.1 Viewing the error log
There are two main ways of viewing the error log:
You can use the System Management Interface Tool (SMIT) with a fast path to run the errpt command. To use the SMIT fast path, enter:
# smit errpt
After completing a dialog about the destination of the output and concurrent error reporting, you will see a panel similar to that shown in Example 2-1.
Example 2-1 SMIT Generate an Error Report panel Generate an Error Report Type or select values in entry fields.
Press Enter AFTER making all desired changes.
[TOP] [Entry Fields]
CONCURRENT error reporting? no
Type of Report summary + Error CLASSES (default is all) [] + Error TYPES (default is all) [] + Error LABELS (default is all) [] + Error ID's (default is all) []
Resource CLASSES (default is all) []
Resource TYPES (default is all) []
Resource NAMES (default is all) []
SEQUENCE numbers (default is all) []
STARTING time interval []
ENDING time interval []
Show only Duplicated Errors [no] + [MORE...5]
F1=Help F2=Refresh F3=Cancel F4=List F5=Reset F6=Command F7=Edit F8=Image F9=Shell F10=Exit Enter=Do
You can also view the error log from the command line using the errpt command.
When used from the command line, considerable amounts of output can often be generated, so it is best to control the command by piping the output to either the more or pg commands, which allow it to be viewed one panel at a time. When invoked with no options, errpt will display a summary report, listing one line of information about each error log entry. An example of this is shown in
Example 2-2.
Example 2-2 Errpt output
# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
35BFC499 0821161601 P H cd0 DISK OPERATION ERROR 0BA49C99 0821161601 T H scsi0 SCSI BUS ERROR
F89FB899 0806150001 P O dumpcheck The copy directory is too small.
2120313C 0806142301 I H tok0 PROBLEM RESOLVED
F89FB899 0805150001 P O dumpcheck The copy directory is too small.
F89FB899 0804150001 P O dumpcheck The copy directory is too small.
In addition to the summary report, the errpt command can be used with various flags to generate a customized report detailing the error log entries you are interested in:
To display information about errors in the error log file in detailed format, enter the following command:
# errpt -a
In AIX 5L Version 5.1, the errpt command now supports an intermediate output format by using the -A flag, in addition to the summary and the details already provided. Only the values for LABEL, Date/Time, Type, Resource Name, Description, and Detail Data are displayed. To display a shortened version of the detailed report produced by the -a flag, enter the following command:
# errpt -A -j identifier
Where identifier is the eight digit hexadecimal unique error identifier.
Example 2-3 shows the output of the above command.
Example 2-3 A shortened error report
# errpt -A 9DBCFDEE LABEL: ERRLOG_ON
Date/Time: Wed Aug 22 11:03:54 CDT Type: TEMP
Resource Name: errdemon Description
ERROR LOGGING TURNED ON
To display a detailed report of all errors logged for a particular error identifier, enter the following command:
# errpt -a -j identifier
Where identifier is the eight digit hexadecimal unique error identifier.
To clear all entries from the error log, enter the following command:
# errclear 0
To stop error logging, enter the following command:
# /usr/lib/errstop
To start error logging, enter the following command:
# /usr/lib/errdemon
To list the current setting of error log file and buffer size and duplicate information, enter the following command:
# /usr/lib/errdemon -l
If you want to change the buffer size and error log file size, you can use the errdemon command. For further detail, refer to the manual page for errdemon.
Software service aid configuration information is stored in the
/etc/objrepos/SWservAt ODM database. This ODM class is used to store information about the location and size of various log files used by the system. It is also used to hold information about trace hooks available for use by the trace subsystem, described in Chapter 11, “Event tracing” on page 341.
By default, AIX runs a cron job that deletes all hardware error log entries older than 90 days daily at 12 AM and all software and operator message error log entries older than 30 days daily at 11 AM. The cron job simply uses the errclear command to delete the old entries. If you are investigating a software problem that has been on a machine for a long time, do not assume that the first instance of the error in the error log was caused the first time the software problem occurred. Previous entries may have been deleted if older than 30 days.
Note: When you remove the errlog file accidently, use the /usr/lib/errstop and /usr/lib/errdemon commands in sequence to recover the file. errdemon creates the errlog file if the file does not exist.
There are some new options introduced in AIX 5L version 5.1. These will reduce your management of same and iterative error log. The first will reduce rapidly logged duplicate error log entries. and the second adds diagnostic output to error log entries.
Duplicate Removal flag: -D
The errdemon command was enhanced in AIX 5L to support four additional flags. The flags -D and -d specify if duplicate error log entries are to be removed or not. The default is the -D flag, which instructs the command to remove the duplicates.
Time Range Flag: -t
With the -t and -m flags, you can control what is considered a duplicate error log entry. A value in the range 1 to 231 - 1 specifies the time in milliseconds within which an error identical to the previous one is considered a duplicate.
The default value for this flag is 100 or 0.1 seconds. You should normally not change this time value. If, for example, you make it too large, diagnostics may not consider a condition serious when it really is serious. Diagnostics
occasionally depends on an error being logged multiple times within a time period.
Count Flag: -m
The -m flag sets a count, after which the next error is no longer considered a duplicate of the previous one. The range for this value is 1 to 231 - 1 with a default of 1000.
The errpt command also has a new -D flag, which consolidates duplicate errors.
In conjunction with the -a flag, only the number of duplicate errors and the time stamps for the first and last occurrence are reported. This is complemented by a new -P flag, which displays only the duplicate errors logged by the new
mechanisms of errdemon mentioned previously.
The link between error log and diagnostics is the new function.
When the diagnostic tool runs, it automatically tries to diagnose hardware errors it finds in the error log. Starting with AIX 5L, the information generated by the diag command is put back into the error log entry, so that it is easy to make the connection between the error event and, for example, the FRU number required to repair failing hardware. After replacement, we run the diag command and go to Task Selection and select the Log Repair Action.
To get the most out of the AIX Error Log and Error Log Analysis in Diagnostics, you must understand the following points:
1. The kernel and device drivers logs errors to the error log, not diagnostics.
Diagnostics (Error Log Analysis) analyzes errors that have been logged.
2. In many cases, permanent and temporary hardware errors are logged that are not indicative of a hardware problem. Error Log Analysis should always be run to determine if these errors indicate a hardware problem.
3. Resource Name indicates the resource that detected the error. It does not indicated the failing resource. It does indicate the resource that diagnostics and Error Log Analysis should be run on.
4. Failure Causes, Probable Causes, and User Causes are generic
recommendations and are not intended to be used to determine what parts to replace. Parts should only be replaced based on diagnostics and Error Log Analysis results.
5. System errors related to the processors, memory, power supplies, fans, and so on, are logged under resource name sysplanar0. Error Log Analysis should be run on sysplanar0 any time there is an error log with a resource name of sysplanar0.
6. Error Log Analysis can be run by running diagnostics in problem
determination mode or by running the running the Run Error Log Analysis task. Stand-alone diagnostics do not perform error log analysis.
7. Error Log Analysis will analyze all the errors in the error log associated with a specific resource. For those errors that should be corrected, Error Log Analysis will provide a list of actions or an SRN. For errors that can safely be ignored, Error Log Analysis will indicate that no problems were found.
2.3.2 Reading a summary error log
A summary error log report, obtained by using the errpt command with no flags, contains the following information for each error log entry.
Identifier
The error identifier is a 32-bit CRC hexadecimal code that determines which error record template is used to interpret the information contained in the error log entry.