Sync Operations (this step only occurs if a failure occurs)

Section 2 NDU – Basic Operations and Troubleshooting

16. Sync Operations (this step only occurs if a failure occurs)

Sync operations happen after an SP is re-imaged or a previous NDU operation fails. NDU reads the ndu-toc data area in PSM for the list of packages that should currently be installed and makes whatever changes are required to make this SP’s software match that list.

Before release 16, a sync initiated during the SP reboot would attempt to disable the cache settings, much like any other

17. Commit

A commit operation indicates to the software that the current upgrade has completed and it is safe to start using new persistent data formats and messages. When the user commits a bundle, NDU will go through each of the component packages in that bundle and issue a commit to each of those first. If they all commit successfully, then the entire bundle will be marked as committed.

Within a package, there can be a number of admin libraries that request to receive a commit. For example, when the Base package is committed, the Flare, PSM, Hostside, and System admin libraries all receive a commit opcode that is forwarded to the appropriate drivers.

A failure to commit could be due to out of sync software. For example, the peer SP may have been up and running while a single SP upgrade was done, so it still has not done a sync, which would ensure that both SPs are running the code in the ndu-toc file. It could also be the case that a registry flush to disk failed, so that the registry setting changes made during the previous upgrade were lost. See Commit Failed for examples.

A failure to commit could also indicate that one of the component package’s admin libraries returned back an error. User ktrace would indicate the most information in this case. See Commit Failed for examples of this case as well.

- Sample Cases -

Dependency Check Failed

These are cases where either some packages were being upgraded to a new release without upgrading other dependent packages or an enabler was being installed onto a platform where it does not belong.

In the first case, you may see messages in CLI like:

Uninstallable Reason: Dependency not met Required packages:

SnapViewOption 141, SnapView 150, AccessLogixOption 131, ManagementServer 150, Navisphere 150, Base 150

The fix is to install all packages for the new release together or just use a bundle that already has all of the needed packages together.

In the second case, you may see a similar message before release 19. From the GUI, it may appear as:

-SANCopy - (Generation 141) - Cannot Install: Dependency not met:

The following packages must be installed to install the requested package:

OpenSANCopy generation >=140, SANCopy_PERMIT generation >=1, SANCopy_PERMIT >=10 MUST NOT BE INSTALLED

With release 19 or higher, the message may be more like “Permit attribute dependency not met. This package is not allowed on this platform type.”. This message indicates that this particular enabler is not allowed on this platform type.

In this case, SANCopy, MirrorView, and MirrorView/A are all not allowed on CX500i arrays.

PSM Access Failed

Cache Disable Failed

In these cases, an upgrade was attempted when there are cache dirty LUNs. NDU must disable and zero the cache settings before it can proceed, but it will be prevented from doing so if there are cache dirty LUNs. In a recent case, the event logs showed:

5/13/04 12:00:58 PM NDU Information (1234) 0 N/A CPQA2125 Informational message.

File: K10NDUAdminManage.cpp Line: 1398 Details: Disabling cache

5/13/04 12:00:59 PM NDU Error (1234) 32820 N/A CPQA2125 Failed to disable cache settings.

File: K10NDUAdminManage.cpp Line: 1406 Details: Failed to disable cache Status: 0x6000026d

This means that NDU attempted to disable the cache, but got the 0x6000026d error code back. This is a sunburst error code, which can be found in a slightly different form: #define HOST_LUNS_CACHE_DIRTY 0x206D

The corresponding sunburst error code can be derived by removing the leading 0x60000 and inserting a 0 between the first digit and the remaining digit. Potentially a ‘clear dirty cache’ procedure may be required to resolve this issue.

Check Script Failed

Check script failures show up after NDU operation has gone asynchronous. “NaviCLI ndu –status” command may show:

Is Completed: YES

Status: Operation Failed: A check script contained in the associated package failed. Check the SP event logs on the primary SP and the package release notes (0x71518013)

Operation: Install

This indicates that a check script failed and the event logs as well as the c:\temp\ndu-check.out file on the primary SP will contain more details. In this case, the ndu-check-out file showed:

Found LU 620 with WWID 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9:11 Exporting driver is K10AggDrvAdmin

Consuming driver is K10CloneAdmin Public is: 0

Error: Found private LU 620 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9:11.

Found LU 621 with WWID 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9:11 Exporting driver is K10AggDrvAdmin

Consuming driver is K10CloneAdmin Public is: 0

Error: Found private LU 621 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9:11.

...

Cannot upgrade to this release while K10CloneAdmin is consuming MetaLUNs for private use.

Detected compatibility problems

The event logs showed:

04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Found private LU 620 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f3:55:e5:19:36:a1:d9: NDU

04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Found private LU 621 for K10CloneAdmin built on a MetaLUN. WWID is 60:6:1:60:87:c7:c:0.f4:55:e5:19:36:a1:d9: NDU

04/27/2005 15:24:02 (71518013)A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Cannot upgrade to this release while K10CloneAdmin is consuming MetaLUNs for private use.

There is a restriction in R16 and higher that was not there in R14 that private LUNs for layered drivers such as Clones WILs and SnapCache LUNs cannot be built on top of metaLUNs. These must be reallocated on normal LUNs before the array can be upgraded to a release after release 16.

Setup Script failed

These incidents were upgrades from R13 to R14. In each case, the upgrade failed in a slightly different way. For one of the incidents, it failed in the setup script for MirrorView, although the event log uses the words “check script”:

B 02/01/06 14:32:41 NDU 71518013 A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: Failed to auto-install the MirrorViewOption package that is required to retain MirrorView capability when upgrading.

In another incident, it failed in a check script and produced this message:

A 03/04/06 12:29:31 NDU 71518013 A check script contained in the associated package failed. Consult the SP Event Log on the NDU primary SP and the package release notes. Error: MirrorView has been stop-shipped, so this array cannot be upgraded to this bundle.

For both cases, the root cause was the same. Both arrays had MirrorView installed at one point, and the package was uninstalled, but it looks like some registry settings were left around on one SP. Since the upgrade was tried on different SPs in each case, different messages were seen.

A special package called ToR14EnablementCheck was created to fix this problem. Note that this package is not generally available and is only made available if required at the discretion of EMC Engineering.

Quiesce Failed

This case looked to be a ScsiTarg Timeout when NDU went to quiesce I/O on the peer SP. The event logs for the primary SP showed:

01/06/2005 09:50:48 (71518016)An attempt to quiesce I/O on the peer SP failed. Consult other SP Event Log entries for details. Call Service provider.

File: K10NDUAdminManage.cpp Line: 279 Details: Quiesce returned 1901166615 Status: 0x71518017 71 51 80 16 NDU

The user ktrace output on the peer SP showed:

09:40:43.377 NDU: Received QUIESCE command 09:40:43.462 ndumon: QuesceAll TCD

09:40:43.463 ndumon: HostAdmin quiesce opcode 6

09:50:43.526 NDU: error: 0x71508003 File: D:\views\3a5fd66156ebf0222db8e58e642d6629.stg\catmerge\mgmt\K10Ho 09:50:43.526 NDU: stAdmi

09:50:43.534 NDU: QUIESCE failed 71518017

09:50:43.534 NDU: Calling TerminateThread to cancel HangTimer: 3a4 09:50:43.534 NDU: Hang Timer canceled

09:50:43.534 NDU: Returning RC: 71518017

The 71508003 error code is K10_HOSTADMIN_ERROR_IOCTL_TIMEOUT, which indicates that HostAdmin got an error from TCD (ScsiTarg).

Deactivate Hang

The end of ndu-old-deact.out showed iSNS as the last line that was being executed:

C:\EMC\Base\02185003.005>msgbin\isns -UnregServer

The user ktrace showed that the iSNS process was still running up to the panic:

06:17:07.746 iSNS: Wait on mutex 06:17:07.746 iSNS: Got Devmap mutex 06:17:07.749 iSNS: release mutex

06:17:07.749 iSNS: FlareData mutex count dec 0

Since the command didn’t complete within the 16 minute timeout period, NDU panicked and the NDU failed.

Panic During Activate

The user ktrace in this panic showed that NDU was activating the Base package:

11:03:58.476 NDU: Activating using set Operation=Revert&& set BASE_GENERATION=160&& cd c:\EMC\Base\02165003 11:03:58.476 NDU: .446&& ndu\bin\activate > \temp\ndu-act.out 2>&1

Engineering dump analysis showed that NDULoadBios was started by cmd.exe, which was started by cmd.exe, which was started by NaviAgent.exe. Because this panic was on the primary SP, the NDU thread is running in NaviAgent’s process space. Also, the ndu-act.out file would likely show NDULoadBios.exe as the last line in the output file.

Reboot Failed

In this case, SPA was pinging, but there was no EMCRemote access. SPA was the primary in the NDU and it failed following its reboot (before the chkdsk). Hitting the NMI button forced a panic, from which dump analysis showed that this was an NtRaiseHardError hang, which has been seen mostly on CX500s when they reboot as part of an upgrade.

Registry Flush Failed

In some cases, an SP will reboot, but come up unmanaged. It may be running with the latest drivers, but have old registry settings. This can be caused by a failure to flush the EMC key in the registry, which causes a message like this to appear in the event logs:

04/11/2005 10:28:40 (79508017)Dynamic strings:Cannot flush EMC

keyD:\views\chainsaw_r12_ch2k_nal_fr.stg\catmerge\mgmt\K10SystemAdminLib\K10SystemAdminControl.cpp768 79 50 80 17 00 00 03 f8 00 00 00 00 naviagent

The safest way to fix this condition is to re-image the system. However, if that is not an option, contact EMC Engineering for a possible optional procedure. This problem should only be seen when upgrading from release 19 or earlier.

Commit Failed

One case is where the software was out of sync on one SP between that SP and the ndu-toc file in PSM. The event logs showed a shutdown failure, which may have caused registry setting changes to be lost:

06/08/2004 20:18:03 (79508017)Exception: InitiateSystemShutdown failed; File:

D:\views\b79e399bec2ddb1ffa549397821ab792.stg\catmerge\mgmt\K10SystemAdminLib\K10SystemAdminControl.cpp; Line:

387. 79 50 80 17 00 00 00 05 00 00 00 00 ndumon

This caused NDU to return the 7151803B error code back when the commit was attempted. Rebooting either SP should cause a sync which would fix the problem.

Another is where an admin library returned back an error to NDU, which caused the commit to fail. User ktrace showed:

06:54:16.775 NDU: Sending commit to Admin Library K10FlareAdmin 06:54:16.777 NDU: Admin Library returned 0

06:54:16.777 NDU: Sending commit to Admin Library K10HostAdmin

06:54:16.823 NDU: error: 0x71508010 File: D:\views\31bc63403d30f59a60e5e38999d17155.stg \catmerge\mgmt\K10Ho 06:54:16.824 NDU: Commit failed71508010

The event logs showed:

Unexpected Exception. Call service provider. File: K10HostAdmin.cpp Line: 2612 Status: 0x71508010 NTErrorCode: 0x467 Exception Details: Error waiting on ioctl -2134236020 7151801d

From this information, it was eventually found to be a problem with Clones.

Post Conversion Bundle Inconsistency in Release 14

When running release 14, conversions will not leave the bundle software in a consistent state. That is, the BundleIndex for the new platform may not exactly match the installed software, some of which may be left over from the old platform.

The fix is normally to install the latest patch for the new platform and commit it. Until that is done, however, any installs may see dependency failures with the 71518004 error code.

R12/R13 to R16/R17 stack size problem

In these incidents, an upgrade from R13 or earlier to R16 or higher on a CX400 or CX600 system with layered drivers installed caused a panic. In some cases, the panic was during the NDU, which caused it to fail. In others, the panic happened the first time layered drivers were used afterwards.

Usually, the panic code was “0x35, NO_MORE_IRP_STACK_LOCATIONS”. In other cases, however, the panic was

“0xe1318013, CMID_BUGCHECK_PARTITION_FROM_LIVE_PEER_DETECTED”. The difference being that in some cases, the problem was detected on that SP and it panicked itself. In others, it looks like memory was corrupted, which caused the other SP to panic.

The root cause in either case was that on NT systems upgrading from a release 13 or earlier, the script that sets the driver stack size was only running when the Base package was being activated, and not as the layered drivers were being enabled. This left the default stack size of 3 in place, rather than the correct number for the installed layered drivers.

There is now an R16_R17_StackSize_Fix_emc107453 package that can be used in both upgrades to R16/R17 and on systems that may have a latent problem. If the SP is already degraded, the SP can be fixed manually. Contact EMC Engineering for information about this procedure if required.

Initial Cleanup Failed

In these incidents, the UtilityPartition package was installed, but the ICA process was left running. This caused the next installation to fail because the c:\temp\ndu directory was in use and could not be deleted. User ktrace shows:

19:03:21.787 NDU: Cleanup cmd is: rd/s/q \temp\ndu & mkdir \temp\ndu

iSCSIPortx IP Configuration Restoration and Device Discovery

There were two incidents uncovered when manufacturing started using a new rev of CX300i hardware. This problem applies to any iSCSI platform (including CX500i and AX100[SC]i) where the iSCSI chip rev may change, e.g. due to an SP replacement or be different from the hardware revision stored in Windows Plug and Play meta data in a freshly imaged SP. Newer code has been added into R19 to address this on CX and AX platforms. Here is information that describes the issues in detail as well as workarounds for R16 and R17:

QLogic r4/r3 issue

There are two problems - one regarding plug-and-play, one regarding network settings and PSM. Some operations bring out one problem, some bring out both (some bring out neither). The plug-and-play issue occurs when new hardware is introduced, and, since we can't control when plug-and-play runs, the SP ends up in a state where iSCSI ports are not correctly named. So, this occurs when an R3-based SP is swapped out for an R4 one. It would also happen in the hypothetical reverse swap case, but EMC currently only intends to spare with R4s.

The network setting case occurs when a data-in-place reimage is done without changing the hardware. This is caused by a bug where the NDIS settings for the iSCSI ports are not restored from what is correctly stored in PSM. So given this information;

1. NDUs work fine. No special steps are required.

2. An SP swap where the old and new revs of chip are the same rev requires no special steps.

3. A data-in-place reimage will require the user to re-enter the IP information for the iSCSI ports.

Contact EMC Engineering for any required procedures.

4. An SP swap where the old and new revs of the 4010 chip are different requires a modified process.

Contact EMC Engineering for any required procedures.

One or both SPs in reboot cycle (CX200/CX400/CX600 being upgraded from pre-R11 only) Event log shows newSP continually reinstalling a package successfully then rebooting

This problem most likely occurs when someone installs a TRANSIENT or EPHEMERAL NDU package (look at the package’s TOC.txt file for those keywords), perhaps along with other NDU packages, on a revision of software that does not support TRANSIENT or EPHEMERAL packages (i.e. packages that are supposed to disappear automatically after the NDU completes).

What happens is that the NDU succeeds, but because the Base software that is currently running does not know to treat the TRANSIENT or EPHEMERAL keywords properly (basically ignoring them), it includes the TRANSIENT or

EPHEMERAL package in the ndu-toc file in PSM rather than leaving it out.

What happens on the next NDU sync opportunity, e.g. when an SP reboots after the NDU has already completed, is newSP finds that this package’s <PKG>_REVISION environment variable is not set, indicating that it is not active, and proceeds to activate it, then reboot. But since the package activation never sets that environment variable (by nature of being TRANSIENT or EPHEMERAL), this cycle recurs when the SP comes up again.

To fix this, it will be necessary to edit ndu-toc to remove the reference to the TRANSIENT or EPHEMERAL package.

Contact EMC Engineering for any required procedures.

Note that if any SP has been put into HFOFF mode, run HFON then reboot it. When it boots up, it should automatically remove the TRANSIENT or EPHEMERAL package that had been causing problems.

Once one SP that is in this situation has been repaired and is servicing I/O, it will cause the peer SP to perform an NDU sync and clean up the remains of the TRANSIENT or EPHEMERAL package on that SP. This may cause the peer SP to reboot. If the peer SP does not reboot and remains unmanaged, reboot it and it should come back managed if this were the only issue involved in its unmanaged behavior.

Tips and Tricks SPCollects

SPCollect output files are an excellent source of information for triaging an NDU problem. They contain the event logs, ktrace output, and NDU’s output files from the temp directory.

Event Logs

The sus.zip file in the SPCollect output file contains a SP?_navi_getlog.txt file, which has a the event logs mixed together.

If Navi was unavailable at the time SPCollect was run, you can still get the raw .evt files out of the evt.zip file. Use evtdump.exe to extract the text from these files.

Ktrace

The sus.zip file in the SPCollect output file has a SP?_kt_user.txt file, which has the latest user ktrace information. Ktrace information from previous boots can be found in the ktd.zip file. Searching for “-ruser” will jump to the start of the user ktrace output.

NDU Output Files

The rtp.zip file in the SPCollect output file has a number of NDU output files of the form ndu-*.out. These contain more detailed information about particular operations. For example, the ndu-act.out file contains information about the latest activation on that SP.

There may also be preserved output files from failures. If an activate for the Base package revision 02.16.500.5.001 failed, it would save the last ndu-act.out file as ndu-Base-02165004.001-act.out. Always check that this output file corresponds to that package though, as there are cases where a step can fail without producing a new output file, so the

In document EMC / CLARiiON Troubleshooting Guide 2 nd Edition (Page 85-93)