Use the appropriate section dependent upon the type of storage environment that is involved
5. Run the ‘clean’ command using BRT with the (perhaps edited) report file as input
Important to remember that the BRT tool does not repair or restore lost data. An uncorrectable data sector is lost data.
Previous to the introduction of this tool, the events would continually get reported in the SP event log until corrective action was taken. Normally this action was to unbind and rebind the effected LUN. If the effected sector was in non-used data space, having to rebind and restore data is a major impact upon customer operations.
The normal course of events prior to R19 are;
1. Run background verify on all LUNs associated with the disk/s reporting the uncorrectable or invalidated events.
2. Determine if the data loss is in actual ‘used’ data space or in ‘unused’ data space. This can be done by having the customer run a full backup on the effected LUN/s. If no errors are returned, then skip to #4.
3. If errors, then a rebind/restore operation will have to occur unless the customer can identify the effected data.
a. If the customer can identify the effected file/s, rewriting the file/s from a backup will correct the condition.
4. If no errors, then you can do one of the following;
a. Ignore the events and as the ‘unused’ data space gets written to, new WRITEs will overwrite the bad sector and clear the error condition. Note that every time the verify process attempts to read or a rebuild
CELERRA storage environment
ID: emc117301 NOTE: Always refer to the most current Primus solution, this is provided as reference only.
This solution provides information on how to recover Celerra file systems with unrecoverable data on attached backend storage array.
For full details of these operations, contact the appropriate level of Technical Support as the solution contains procedures only to be used by Celerra Technical Support. The error symptoms that occur are as follows;
Specific DART panic: >>PANIC in file: ../BVolumeIrp.cxx at line: 298 : IO failure despite all retries/failovers
The /nas/log/sys_log -->
Clariion passes backend events to Celerra
<date/time> NaviEventMonitor:3:3 Backend Event Number 0x953 Host
OEM-XOO25IL9VL9 Storage Array APM00041700339 SPA Device Bus 1 Enclosure 1 Disk 8 SoftwareRev 6.19.0 (4.14) Unknown Error 2.19.0.701.5.027 Description
Uncorrectable Parity Sector
<date/time> NaviEventMonitor:4:2 Backend Event Number 0x840 Host
OEM-XOO25IL9VL9 Storage Array APM00041700339 SPA Device Bus 1 Enclosure 1 Disk 11 SoftwareRev 6.19.0 (4.14) Unknown Error 2.19.0.701.5.027 Description Data Sector Invalidated
CLARiiON SPCollect events
<date/time> Bus1 Enc1 Dsk8 956 Parity Invalidated [vr_rd RAID] 0 21773bc0 12001000
<date/time> Bus1 Enc1 DskB 957 Uncorrectable Sector [vr_rd RAID] 0 21773bc0 12001000
<date/time> Bus1 Enc1 DskB 840 Data Sector Invalidated [vr_rd RAID] 0 21773bc0 12001000
Navicli getsniffer results $ ./navicli -h 192.168.1.200 getsniffer 28 Corrected Uncorrectable Checksum errors 0 103 Server log events CamStatus 84 ScsiStatus 02 Sense 0400 00
<date/time>:CAM:3:I/O Error: c80t1l7 Irp 0x90e11084 CamStatus 0x84 ScsiStatus 0x02 Sense 0x04/0x00/0x00
<date/time>:CAM:3:camFlags 0x50 Addr 0x8d635304 Len 0x1c000
<date/time>:CAM:3:cdb: 28 00 00 c9 02 80 00 00 e0 00 00 00
A fatal event happened on the storage system attached to a Celerra which caused data lost. Those events might be double drive faults in one RAID group or a power outage without battery backup. This basically means any event that could cause previously written and committed data to be invalid. On Symmetrix and CLARiiON, invalid data will be marked as "bad" and the host access is denied. The problem occurs if there has been a fatal error on the backend that caused data lost. In this case the backend "knows" that the data on the affected sectors / tracks had been changed but the data itself had been lost. Both CLARiiON and Symmetrix are designed to prevent the client from reading this old / bad data and return a read error.
The Celerra is designed to trust the backend for data integrity. Since the data integrity has been lost, the Data Mover will panic once a read (or write) error occurs. Since a file system check just verifies the logical structure of the filesystem but not the data within, it will not find the corrupted data. Any affected file that has lost its data integrity needs to be restored from tape. The only method from a host to get the data valid again is to overwrite the track / sector with "some" data. Since the NAS code (as with any other operating system) does not know what this data has been it needs to write "zero" data to this block. This is what the special NAS code provides.
Once data has been overwritten, the client is able to access the file again. But from an application point of view, data is probably bad.
ATTENTION! Before carrying out ANY procedure on the Celerra, CLARiiON or Symmetrix, support MUST verify the backend and correct as much data as possible. Any event that could cause further backend outages must be carried out BEFORE the procedure can be executed. The backend needs to be in a healthy state beside the invalid tracks. Faulted drives or other bad backend hardware need to be replaced first. On RAID groups, the parity rebuild should finish. On a CLARiiON array, background verify must run. The number of uncorrectable tracks needs to be known before continuing.
REQUIREMENT!
The procedure requires a special NAS code and tool provided by engineering. The patch and tool will be delivered by engineering on request. Any request MUST be escalated to engineering before this patch can be installed and used.
RESTRICTION!
Due to the nature of the problem, there is no guarantee that all file systems and data will be recovered. The recovery procedure includes many risks. The process includes many manual steps that needs to be executed for every affected file system / every affected block. This is time consuming. There is a maximum limit of 50 bad sectors / tracks on ATA and 100 bad sectors / tracks on FC drives per system. If there exists more bad tracks / sectors, the customer is encouraged to delete the file systems and restore from tape since the recovery will require more time then the restore would. Consult with TS2 management if there are objections. The procedure
CDL storage environment
ID: emc106007 NOTE: Always refer to the most current Primus solution, this is provided as reference only.
This solution provides information on how to address uncorrectable errors on a CLARiiON Disk Library (CDL). Details are not provided specific to all steps. Please engage appropriate Technical Support resources for more detail on usage. The basic steps that will be taken are;
1. Run Background Verify on all LUNs in the affected RAID group. Refer to solution emc32911.
2. If uncorrectable errors are detected, go to Step 3. If none are detected, you are done.
Note: To perform the steps below, you will be required at a point in time before unbinding the LUN from Navisphere (Step 8) to stop all I/O to the CDL completely, which will requires downtime.
3. Warning! If this is not done, there will be data loss.
If you have LUN 899 and 900, you are running the old LUN scheme and must look at a getall output in the SPcollect file.
HLU/ALU Pairs: HLU ALU HLU = Host LUN ---- --- ALU = Array LUN
0 4
1 5
2 6
If ALU 5 had the errors, then in Step 4 you would use HLU 1 (LUN 1). If no LUN 899 and 900, go to Step 4.
4. At the CDL console check for all tapes that reside on the effected LUN. Follow this path:
(Physical Resources) / Storage Devices / Fibre Channel Devices / DGC-RAID / General tab ((Check SCSI Address 0:0:0 4 is LUN 4 )) / Layout tab ((VirtualTape-02806 is VID 02806)) 5. Map all VID numbers to bar codes or tapes. Follow this path:
(Logical Resources) / VirtualTape Library System) / pick each library ( STK-L180-02789) / ( Tapes ).
This will show you bar code to Virtual TapeID numbers. Note all virtual tapes that are found.
6. The customer or administrator must perform this step:.
A. Back up all data off the virtual tape(s) found in Step 4.
B. On the CDL if a physical tape is connected to the backend, the virtual tape can be moved to vault.
C. If there is a license for remote copy, which can be used.
D. The only other way to achieve this is through the backup application software.
Note: The customer must create new tapes to achieve Step 5. Then they must make sure that the new tapes are not created on the affected LUN(s). When you create a virtual tape, you can "uncheck" the affected LUN.
7. The customer or administrator will be asked to perform the following steps;
o Delete all tapes to be unbound from backup software on the backup server.
8. Delete all tapes to be unbound from the CDL.
9. Discharge the LUN in the CDL console. In the console tree: Select Physical Resources/Storage Devices/Fibre Channel Devices/CLARiiON S/N/ DGC:RAID 3(LUN to be unbound), right click and select Discharge.
10. Unbind the faulted LUN(s) from Navisphere.
11. See solution emc125981.
See solution emc48444 or emc62865 for a description of errors described in Symptoms.
General Array and Host Attach Related Information Binding
Binding involves taking a group of one or more disk modules and grouping them into a Logical Unit (LUN). Only after a disk module has been bound into a LUN is its storage space available for host access. A LUN is always created as part of a RAID group. You can create the RAID group explicitly or have it created when you bind the LUN. The RAID group can be of any of the following RAID types:
RAID-5 (individual access array);
RAID-3 (parallel access array);
RAID-1 (mirrored pair) individual disk;
RAID-0 (nonredundant individual access array);
RAID-1/0 group (mirrored RAID-0 group);
Individual disk
Hot Spare disk
A LUN created by the binding process is given a unique identifying integer. There is also a RAID group ID, assigned when the RAID group is created. A LUN can be UNBOUND, upon which all knowledge of the LUN is removed from the SP's databases and all host data on the LUN is destroyed. After all LUNs in a RAID Group have been unbound, the RAID Group itself can be removed. After a LUN is unbound, the disk modules that made up that LUN are free to be bound into new LUNs (added to existing RAID groups or bound into new RAID groups of any type).
* For ‘Default Owner’ detail, see Initial Assignment below.
* For ‘Enable Auto Assign’ detail, see Auto-Assignment below.
The term used to describe how LUNs are assigned and ‘owned’ by each SP is called “LUN OWNERSHIP ACCESS”. It is sometimes referred to as an ‘active / passive’ ownership model. This is different from a DMX environment that is known as ‘active / active’ or as Dual-Simultaneous Access. This type of design allows multiple interfaces to a logical device equal access to the logical device.
The LUN Ownership Access (CLARiiON) model allows access to LUNs through only one path at a time. This access model requires a trespass command to the LUN to move ownership from one SP to the other SP. If there are multiple interfaces to a logical device, one of them is designated as the primary route to the LUN device. Host I/O is not directed to paths connected to a non-assigned interface, meaning paths to the non-owning SP.
Normal access to a device through any interface on an SP other than the assigned one is not possible. In event of a failure (storage processor or all paths to SP), logical devices or LUNs must be moved to another SP. If an interface card fails, logical devices are reassigned from the broken interface to another interface. External failover software (in ex; EMC PowerPath or Veritas/DMP) instructs storage system to initiate this reassignment (known as trespassing). After devices are trespassed, data is sent via the new route to an SP.
In order to understand how ownership is handled, the following information is provided.
Assignment
Assignment is the process by which a given SP is given EXCLUSIVE ownership of a given LUN. The responsibility of ownership is primarily the maintenance of data/parity integrity in a dual-ported environment; it is essential that only ONE SP access any LUN at one time. Assignment enforces this by denying access to the LUN through the SP that does NOT have the LUN assigned. The SP does not require that an explicit assign command be issued to it. LUNs are assigned by the SP as part of the SP's power-up process. This Assignment activity is called INITIAL Assign. There are two methods that a host may alter the default Initial Assignment of any LUN: AUTO-ASSIGNMENT and TRESPASS. These methods are discussed next in more detail.
Initial Assignment
At the time a LUN is bound, one of the two SPs is identified as the default owner. At power-up time, the SP will assign all LUNs that it owns by default. This process is called Initial Assignment. The LUN will remain Assigned to this SP until changed by either a Trespass (discussed later) command from the host, or it is prompted by the fault/removal of this SP in a dual-SP cabinet (involves Auto-Assignment, discussed later), or the default SP owner for the LUN is changed (via a serial port or SCSI command) AND the cabinet is power-cycled. Thus, Initial Assignment is the process that the SP takes at power-up to assign all LUNs for which it is the default owner. The concept of Initial Assignment along with default ownership of a LUN exists only to provide the means to decide at POWER-UP time which SP owns which LUNs.
The default owner of a LUN is the SP that assumes ownership of the LUN when the storage system is powered up. If the storage system has two SPs, you can choose to bind some LUNs using one SP as the default owner and the rest using the other SP as the default owner, or you can select Auto, which tries to divide the LUNs equally between SPs. The primary route to a LUN is the route through the SP that is its default owner, and the secondary route is through the other SP. If you do not specifically select one of the Default Owner values, default LUN owners are assigned according to RAID Group IDs as follows:
RAID Group IDs Default LUN owner
Odd numbered SP A
Even numbered SP B
The default owner property is unavailable for a Hot Spare LUN.