Recommended action for proactive replacement of disk drive(s) that report “Recommend Drive Replacement”

Section 4 Troubleshooting & Tools

4. Recommended action for proactive replacement of disk drive(s) that report “Recommend Drive Replacement”

(event code 803) or customers who are demanding proactive action on soft media error events (event code 820). Verify that the suspect drive has not already been faulted. Run Background Verify on all LUNS associated with the suspect drive. Allow Background Verify(s) to complete before taking any other action. If you run Background Verify against that drive’s LUNs first, you can help reduce any possibility of data loss due to latent soft media errors that might be present on other drives in the RAID group. There is no need to run Background Verify if the drive that you are replacing is already faulted. It is a good idea to run the Raid Group Health Check (rghc.pl) on the drive that is to be replaced. This script reads the SPCollect files and analyzes all other drives in the Raid Group and will report if it is safe to replace the drive in question. Run rghc.pl for help on appropriate arguments. If the drive is not safe to remove, other action such as backing up data before any action might be an appropriate course of action.

5. Recommended action when soft media errors seem to be frequent or increasing. As noted above, Sniffer constantly runs in the background to detect and re-allocate any defective sectors. These re-allocations are logged by FLARE as soft media errors. In addition, any defective sectors encountered during normal operation are also logged as soft media errors. These logged errors generally do not indicate an abnormal condition and do not require any corrective action. Allowing FLARE to function in this manner as designed, is EMC’s recommendation in most situations. However, if more than three (3) soft media errors (specifically 820 with sense key of 05, “Bad Block”) are detected on any one drive in any 30-day period, Background Verify should be run on all LUNS associated with that disk drive. If, following the Background Verify operation, that same drive logs more than two (2) soft media errors (specifically 820 with sense key of 05) in the subsequent 30 day period, (excluding any errors logged during the

801 Soft SCSI Bus Error

Description: A lot of information is available regarding these types of messages. In fact there is a course that one can take that covers backend fault isolation and architecture of the backend buses for the various CLARiiON families of products. The embodiment of the course is covered in a document entitled “EMC/CLARiiON Troubleshooting”. Ultimately, Soft SCSI errors indicate that there is some type of disturbance on the bus. This disturbance may be caused by a bad transmitter on a drive, an LCC, or Cable. Unfortunately, finding the bad component can be hard to detect. The backend buses in a CLARiiON CX product consist of a Fibre channel arbitrated loop. The enclosures attached to the loop are DAE, DAE2-ATA, DAE2P or DAE4P. The DAE (Katana) extends the loop from the first drive to the next through the last drive in the enclosure. Bus disturbances tend to be reflected up and down the topology affecting good devices. Flare will attempt to stem these problems by shutting down drives. Unfortunately searching for the failing drive is not an exact science and many times good drives are excommunicated from the bus. The DAE2P and DAE4P (Stiletto) enclosures isolate drives from each other in switched mode in an attempt to prevent one drive from affecting many. One thing to keep in mind is that if a Stiletto enclosure is added to a CX (chameleon/Fish) DPE, DAE (Katana), or DAE2-ATA (Klondike) it operates in loop mode and does not provide drive isolation.

Recommendation: There are several recommended courses of action depending on if logs are available and how sudden the problem crops up.

Troubleshooting with logs

1. 801 errors with extended status 0x2 “Parity Error” and 0x2A “Bad Transfer Count”. Drives on the loop before the problem drive(s) tend to report 0x2A and drives after the problem tend to report 0x2. Identify the last drive reporting 0x2A and the first drive reporting 0x2. Suspected drives include those identified above and all drives in between. Use FBI to corroborate your finding and further narrow the selection. Remove any hot spares that are not engaged. If FBI information can not be obtained or is inconclusive, remove the last drive to report 0x2A and monitor the array to see if the 801 messages stop. Hot spares can complicate the identification process if they are involved in Raid Groups where the messages are being reported. Note: The tricky part here is to keep in mind that IBM drives report “Aborted by Device” error code 0x11 when they really mean “Parity error” error code 0x2.

2. If the above combination is not present, run FBI and examine output. If messages occur on an enclosure boundary and on one side only then suspect an LCC. Other indicators of LCC problems are “6c2 BE Fibre Loop Hung” and

“LCC Glitch” messages.

Troubleshooting without logs

Unfortunately, there are times when backend problems are severe enough to prevent gathering logs. This can occur when the Navisphere agent is degraded or one of the SPs is not operating properly.

1. Try booting the failing SP with only the boot DAE attached. Monitor boot log through serial PPP connection to make sure the array successfully boots the OS.

2. If this does not work, attempt to get the failing SP booted into degraded mode.

3. If step two does not work, replace the SP.

4. If you can put the machine into degraded mode, then drives 0_0_0 through 0_0_14 are suspect. Set HFOFF, then reboot again. Start an off array ktcons session with ktail output to a log file. Make a copy of the

flareandlayeredstart.bat file then edit the file to include a pause command after each driver starts. Step through to see what driver fails and examine ktail for clues to the reason for the failure. Keep in mind that one SP can affect the operation of the other when backend problems are severe enough.

5. If the ktcons output does not reveal much or you are experiencing trouble getting output, you can shutdown both SPs, remove all drives but the primary boot drive for the troublesome SP and attempt to boot. Monitor progress from ktail and boot log. If the SP boots, then proceed to add disks to the enclosure one by one slowly monitoring output and status. This type of process may serve to isolate the problem areas.

78b/78c Drive physically remove/inserted

Description: The Configuration Manager detects that a disk drive has left or returned to the fibre channel loop or bus. Be aware that these messages can be found in abundance in the triage log files and may not necessarily mean the drive has been removed or inserted. In fact, if the 78b/78c occurs in close proximity to another within seconds it can be assumed that the device has not been physically removed and inserted into an enclosure. Instead, another event on the disk in the slot may have occurred that causes the software to report the disk as removed and/or inserted. For example, if the drive is powered down because of errors, it may appear to the software that the drive has been removed. Release 19 and beyond includes a feature in the logs where the serial number of the drive is recorded and can help determine if the drive has actually been replaced.

Example:

B 02/14/06 14:50:29 Bus3 Enc2 DskB 78b Drive physically removed from slot 0 0 0 B 02/14/06 14:51:12 Bus3 Enc2 DskB 78c Drive physically inserted into slot 0 0 0

Recommendation: Further investigation for the reported removal/insertion must be done before any action is taken.

798 The Drive Port Bypass Circuit Status changed.

Description: This message was added in Flare release 16 to indicate when a drive is leaving the loop or attempting access. An extended status of one indicates the drive port bypass circuit is set and the drive is leaving the loop. An extended status of 0 indicates that a drive is attempting to regain access to the loop. The PBC can be controlled by the drive itself or the LCC. This message is output to the log when the CM performs flaky drive handling.

Recommendation: Examine the circumstances around why a drive or drives are reporting changes in their PBC. If it is a single drive, the drive might be flaky and may need to be replaced.

799 Peer Requested Drive Power Down

Description: Added in R16 to indicate the reporting SP received a message from the peer requesting drive shutdown.

Recommendation: Further investigation as to the reason for the shutdown must be done before any action is taken.

6a0 Disk soft media error remapped via disk ECC

Description: This message indicates that the disk has successfully remapped a bad sector using its internal ECC.

Today’s high density drives have thousands of sectors available for remapping bad sectors.

Recommendation: See Primus article emc64488. There is no need to consider the drive defective unless the

“Recommend Disk Replacement” message appears in the triage log files.

69d/69e Recovery started/completed

Description: The message “69d A bad drive or LCC is causing hardware problems.” Indicates that a bad disk or LCC is causing problems on the bus. The storage system will soon remove the bad drive or LCC from service, and the storage system will then generate an 0x9-level or 0xA-level “xxx removed” message. The message “69e A bad drive or LCC is causing hardware problems.” Indicates the storage system has finished removing the bad disk or LCC noted in message 0x69d from service. This is an informational message that follows the 0x9-level or 0xA-level “xxx removed” message.

Recommendation: Look for numerous 63e LCC port glitch entries in Navisphere log files (TRiiAGE_Splogs.txt) to isolate a faulty LCC(s) causing backend instability which might then need replacement. Similarly, look for drive issues (media errors, Soft SCSI Bus errors, a18 CRU Drive Causing Loop Failure messages etc.) to isolate a faulty drive(s) in the loop which might also need proactive replacement.

63e A port glitch was detected by the LCC

SP State

Advanced Lustat

Statistics Logging: Reports Statistics Logging is ENABLED or DISABLED. The SP maintains a log of statistics for the LUNs, disk modules, and system caching that you can turn on and off. When enabled, logging affects storage-system performance, so you may want to leave it disabled unless you have a reason to monitor performance. You can change the Statistics Logging from the General tab in the storage-system properties dialog box. Note: If Navisphere Analyzer is installed, and you enable statistics logging for the storage system, Analyzer logging is also enabled.

PEER SP: Status of the peer SP as seen from this SP is reported here. This field can be used to get the status on Peer SP when we are not able to connect to it or there are any issues with it. Status can be REMOVED, PRESENT or NONE (in case product contains only one SP – for e.g. AX100SC)

WRITE CACHE: Write Cache displays the state of the storage system’s write cache. Write cache can be enabled or disabled from the Cache tab in the storage-system properties dialog. The size of the write cache can be set from the Memory tab in the storage-system properties dialog box.

Write cache states are:

Write Cache State Details

INITING Cache is initializing. This is initial value of cache state when powering up SYNCING When the peer is powering up (determined by a peer INITING event) the running

SP enters the syncing state where the two SPs (a.k.a. boards) are attempting to sync the cache ram images. The SP remains syncing until the peer either dies or transitions to another state (i.e. ENABLING, DISABLING or DISABLED).

ENABLING Before cache is ENABLED, each SP enters the ENABLING state. Enabling is simply a handshake between the two Caches indicating that each is ready to enable.

ENABLED** The cache is ENABLED when all necessary caching components are enabled. The components that must be operational for a viable cache include: peer SP is up and communicating appropriately, the vault is enabled, the BBU (Battery Backup Unit, i.e. SPS) is charged, the fans and VSCs (power supplies) are not faulty.

If one or more of these components should fail, the cache is disabled. Please note: The CX550, CX750, and CX950 family of arrays can loose one fan and still maintain a viable cache.

QUIESCING When a required cache component fails, the cache image is backed up on vault.

Before the writes to the disk can begin, all cache ram modifications must be stopped (due to parity encoding). This stage in the cache shutdown process is referred to as “quiescing”. The QUIESCING state is when all CAQEs (Cache Queue Elements) are being stopped. Once all active CAQEs are stopped, the cache is said to be frozen.

FROZEN All CAQEs are frozen. But before backing up the cache image, we must wait for all CMI traffic from the other SP to stop (for the same reason as above).

When the peer responds that it is also frozen (or dead), The cache image backup can take place.

DUMPING One of the SP’s is dumping the cache to the Vault.

DISABLING The cache is disabling while there are component failures and the cache is dirty. When the cache is clean (no cache dirty

LUNs), the cache is said to be disabled.

DISABLED** The cache is disabled while if there are components failures (and the cache is clean) or the operator has purposefully shutdown or not enabled the cache. All component failures must be rectified before the cache can be enabled.

RECOVERING The cache is recovering if we are caching on a single SP (non-mirrored) and the CM tells us that cache recovery is needed. The RECOVERING state is similar to the INITING state, in that the Front End (host ports) is not yet turned on, and LUNs are not yet assigned. After a successful recovery, we will transition

**: State should be either ENABLED or DISABLED. The other states are interim conditions in which the cache can be found. If the TRiiAGE Analysis report indicates that the cache is in one of these momentary states, it is best to check the live status of the array before taking any action. The array should not be “stuck” in any of these states for an extended period of time. A detailed examination of the array is in order and possible escalation to the Crisis Team may be appropriate.

READ CACHE: Read Cache displays status of SP Read Cache. Each SP has a read cache in its memory, which is either enabled or disabled. The read cache on one SP is independent of the read cache on the other SP. Storage system read caching for an SP must be enabled before read caching can be enabled for any given LUN. You can enable or disable an SP’s read cache from the Cache tab in the storage-system properties dialog. You can set the size of an SP’s read cache from the Memory tab in the storage-system properties dialog box.

Status can be DISABLING, ENABLED, DISABLED or UNKNOWN A: DP 50% TOTAL 122751 DIRTY 62251

TOTAL: Total number of write cache page count on the SP DIRTY: Write cache dirty page count

DP: The dirty pages percentage (=DIRTY/TOTAL) B: TOTAL 122752

Total number of Write cache page count on SPB U: DP 00% TOTAL 0000

TOTAL: Write cache unassigned page count

DP: (Unassigned/Total) % (from code this looks like it will be always 0) Requests Complete: 209382809

Number of completed host requests.

SPS A: OK SPS B: OK

These fields report information on SPS (Standby Power Supply) and the SPS configuration.

SPS Status Reported as:

SPS = SPS_BAT_OK OK

OK Unknown Config OK Invalid Power Cable1 OK Invalid Power Cable2 OK Invalid Serial Cable OK Invalid Multiple Cables

SPS = SPS_TESTING TE

SPS = UNIT_NOT_PRESENT --

SPS = UNIT_FAILED FLT

Advanced Lustat version 1.53

Following are the details on columns reported in the output.

LUNs are listed along with the NaviLUN (ALU), Flare LUN (FLU) and MetaLUN (listed if LUN is a component LUN) numbers. Mapping of FLU<->ALU is provided via WWN matching if .DRT file is available, otherwise TRiiAGE is forced to perform the match using the GETRG output (which can be unreliable).

MLU : Metalun number ALU : Navi LUN number FLU : Flare LUN number RGP : Raid group number

ENC : Reports Enclosure Type on which LU is bound. This can be FC, ATA, ST2 (Stiletto 2G) or ST4 (Stiletto 4G).

TYPE : Raid Type; This can be: Ind-Disk, RAID-0, RAID-1, RAID-10, RAID-3, RAID-5, HotSpare

P : indicates Private LU. Reports “Y” if LU is private. For example: hot spare, MetaLUN Components, Snap cache LUNs and Clone Private LUNs are all reported as Private LUNs.

LD : Reports if there are any Layer driver in the LU stack. Only the first device in the stack is listed. For example: If the stack for “LOGICAL UNUT NUMBER 46” contains K10RollBackAdmin, K10FarAdmin and

K10SnapCopyAdmin in that order, only K10RollBackAdmin is reported here as RB.

LD Abbreviations Layered Driver

SC SnapCopy Admin

RM Remote Mirror Admin

AG Aggregate (MetaLUN) Driver Admin

CL Clone Admin

AM Asynchronous Mirror

WIL Write Intent Log

SCL Snap Cache LUN

RB Roll Back Admin

CAPACITY : Reports LUN Capacity.

CAC : Reports LU Read and Write Cache state DEFOWN : Reports Default Owner for LU

STATE : Reports LUN Status.

NAVIFRUS : Reports drives on which LU is bound in B.E.D format Note: FRU order in case of RAID10 is: P1 P2 P3 S1 S2 S3 Some Important Notes:

1. WIL = Write Intent Log and CPL = Clone Private Lun will not assign after failure if the array FE cables are disconnected. Reference the Layered Product section for which Luns make up the WIL and CPL.

2. ATA guideline emc95538 recommends disks in the same RG are assigned to one SP.

3. ALUSTAT is LUSTAT with ALU column appended. The ALU is determined using the WWN for R13 and greater. The mapping WILL always be correct. WARNING: The ALU is determined using Navi commands for R12 and earlier. It should usually be okay, but do not rely on it for critical analysis.

4. Default owner for a component of MetaLUN is not important. For e.g. If metalun has 3 components and each component can have different default owner assigned. Aggregate driver ensures that all components are assigned to the same SP i.e. current owner for all components of a MetaLUN will be the same.

Ktcons Vpstat

Reports verify information for LUNs.

> !vpstat

Summary of Verifies:

RAID Sniffing Verify Percent Sniff BV Total LUN Group Type State Capacity Type Complete Rate Time Passes ---- --- --- --- --- --- --- --- ---- --- 0 7 RAID-5 Enabled 476.0 GB --- 10 0 24 1 5 RAID-5 Enabled 100.0 GB --- 10 0 44 2 205 HotSpare Enabled 297.0 GB --- 254 0 34 3 7 RAID-5 Enabled 150.0 GB --- 10 0 4 4 18 RAID-5 Enabled 1782.0 GB sniff 48 10 0 24 5 19 RAID-5 Enabled 1782.0 GB peer:sn 0 10 0 9 6 7 RAID-5 Enabled 100.0 GB --- 10 0 4 Sniffing State: Reports Sniff is Enabled/Disabled for LU

In ideal scenario, all LUNs should report Sniffing State as “Enabled”

This can be changed with setsniffer navicli command.

Verify Type: Reports type of verify running on LU.

Values can be: sniff – We are doing a sniff verify.

BV – We are doing a background verify.

Peer:sn – Peer doing a sniff verify.

Peer:bv – Peer doing a background verify.

Percent Complete: Reports verify percentage complete for LU.

Sniff Rate: Specifies the rate at which sniffs are executed. It is specified in 100-ms units. (100 ms /sniff verify IO)

In document EMC / CLARiiON Troubleshooting Guide 2 nd Edition (Page 154-167)

Recommended action for proactive replacement of disk drive(s) that report “Recommend Drive Replacement”

Section 4 Troubleshooting &amp; Tools

4. Recommended action for proactive replacement of disk drive(s) that report “Recommend Drive Replacement”

Section 4 Troubleshooting & Tools