Common Problems and Troubleshooting - SAN System Design and Deployment Guide. Second Edition

Troubleshooting SAN and storage subsystems is both a science and an art. The science of troubleshooting relies on understanding the components of your SAN or storage subsystems and obtaining a working knowledge of component specifications and limitations. Using your experience to troubleshoot a problem, and more

specifically, identify where in the overall system to focus your investigation first, is the art.

The troubleshooting approach you use should be that of collecting data (similar to a science experiment) to validate a hypothesis. Having a working knowledge of Fibre Channel, SCSI protocol, and FC fabric switch commands is crucial in collecting

information about SAN subsystem behavior. When you change the configuration or a parameter in a SAN fabric or a storage subsystem, you can measure or record the change in SAN behavior or performance. Using this logical and methodical approach to make changes most often results in resolving a problem.

This chapter provides information on how to troubleshoot and resolve issues in systems using VMware Infrastructure with SAN. It also lists common problems that system designers and administrators of these systems may encounter, along with potential solutions.

The topics covered in this chapter include the following:

“Documenting Your Infrastructure Configuration” on page 179

“Avoiding Problems” on page 179

“Troubleshooting Basics and Methodology” on page 180

“Common Problems and Solutions” on page 181

“Resolving Performance Issues” on page 185

10 Documenting Your Infrastructure Configuration

It is extremely helpful to have a record of your SAN fabric infrastructure architecture and component configuration. The following information is crucial to help isolate system problems:

1. A topology diagram showing all physical, network, and device connections. This diagram should include physical connection ports and a reference to the physical location of each component.

2. A list of IP addresses for all FC fabric switches. You can use this information to log into fabric switches which provide information that helps identify the relationship to components. For example, it can tell you which server is

connected to which FC fabric switch at which port, and which fabric switch port is mapped to which fabric storage ports.

3. A list of IP addresses for logging into the service console of each ESX host. You can access the service console to retrieve storage component information (such as HBA type, PCI slot assignment, and HBA BIOS revision), as well as to view storage subsystem event logs.

4. A list of IP addresses for logging into each disk array type or array management application. The disk array management agent can provide information about LUN mapping, disk array firmware revisions, and volume sharing data.

Avoiding Problems

This section provides guidelines to help you avoid problems in VMware Infrastructure deployments with SAN:

Carefully determine the number of virtual machines to place on a single volume in your environment. In a lightly I/O loaded environment (with 20 to 40 percent I/O bandwidth use), you might consider placing up to 80 virtual machines per volume. In a moderate I/O load environment (40 to 60 percent I/O bandwidth use), up to 40 virtual machines per volume is a conservative number. In a heavy I/O load environment (above 60 percent I/O bandwidth use), 20 virtual machines per volume is a good maximum number to consider. If your I/O bandwidth usage is 95 percent or higher, you should consider adding more HBAs to the servers that are hosting virtual machines and also consider adding inter-switch-links (ISLs) to your SAN infrastructure for the purpose of establishing ISL trunking.

Do not change any multipath policy that the ESX management system sets for you by default (neither Fixed nor MRU). In particular, working with an

active/passive array and setting the path policy to Fixed can lead to path thrashing.

Do not change disk queuing (or queuing in general) unless you have performed a quantitative analysis of improvements you could achieve by setting new queue depths.

For active/active arrays, check whether I/O load balancing using multipathing was set, since that may cause a bottleneck at a specific path.

Determine if your application is CPU-, memory-, or I/O-intensive, to properly assign resources to meet application demands. See Chapter 7, “Growing your Storage Capacity,” for more information on this topic.

Verify from your FC fabric switch that all ESX hosts are zoned properly to a storage processor port and that this zone is saved and enabled as an active zone.

Verify where each virtual machine is located on a single or multiple data disks and that these disk volumes are still accessible from the ESX host.

Ensure that the FC HBAs are installed in the correct slots in the ESX host, based on slot and bus speed.

Troubleshooting Basics and Methodology

Some of the main keys to successful troubleshooting are keeping accurate

documentation of your system architecture and configuration and, when a problem or issue arises, following a careful, methodical approach to isolate and remedy the situation, or find a workaround.

When a problem occurs, you can typically use the following steps to quickly identify the problem. You might spend fifteen to thirty minutes performing the steps, but if you cannot locate the problem and need to contact VMware support services, you will have already collected important information that lets VMware personnel help you resolve the problem:

1. Capture as much information as possible that describes the symptoms of the problem. Information that is crucial includes:

a) The date or time the problem first occurred (or when you first noticed it).

b) The impact of the problem on your application.

c) A list of events that lead up to the problem.

d) Whether the problem is recurring or a one-time event.

2. Check that the ESX host on which the problem occurs has access to all available mapped LUNs assigned from the storage array management interface.

a) From your array management, verify that the LUNs are mapped to the ESX host.

b) From the ESX host having the problem, verify that all LUNs are still recognized after a rescan.

3. Log into the FC fabric switches that are part of VMware Infrastructure and verify that all connections are still active and healthy.

4. Determine if your issue or problem is a known problem for which a solution or workaround has already been identified.

a) Visit VMware knowledge base resources at http://kb.vmware.com

b) Review the problems listed in Table 10-1 for commonly known SAN-related issues with ESX and VMware Infrastructure.

Common Problems and Solutions

This section lists the issues that are most frequently encountered. It either explains how to resolve them or points to the location where the issue is discussed. Some common problems that system administrators might encounter include the following:

1. LUNs not detected by the ESX host.

2. SCSI reservation conflict messages.

3. VirtualCenter freezes.

4. Disk Partition not recognized errors.

5. None of the paths to LUN x:y:z are working.

6. Service console freezes.

The following table lists solutions to some other common problems and issues:

Table 10-1. Common Issues and Solutions

Issue Solution

A LUN is not visible in the VI Client. See “Resolving Issues with LUNs That Are Not Visible”

on page 106.

You want to understand how path failover is performed or change how path failover is performed.

The VI Client allows you to perform these actions. See

“Managing Multiple Paths for Fibre Channel” on page 119.

You want to view or change the current multipathing policy or preferred path, or disable or enable a path.

The VI Client allows you to perform these actions. See

“Managing Multiple Paths for Fibre Channel” on page 119.

You need to increase the Windows disk timeout to avoid disruption during failover.

See “Setting Operating System Timeout” on page 148.

You need to customize driver options for the QLogic or Emulex HBA.

See “Setting Device Driver Options for SCSI Controllers” on page 148.

The server is unable to access a LUN, or access is slow.

Path thrashing might be the problem. See “Resolving Path Thrashing Problems” on page 182.

Access is slow. If you have many LUNs/VMFS volumes and all of them are VMFS-3, unload the VMFS-2 driver by typing at a command line prompt:

vmkload_mod -u vmfs2

You should see a significant increase in the speed of certain management operations, such as refreshing datastores and rescanning storage adapters.

You have added a new LUN or a new path to storage and want to see it in the VI Client.

You have to rescan. See “Performing a Rescan” on page 116.

Understanding Path Thrashing

In all arrays, the storage processors (SPs) are like independent computers that have access to some shared storage. Algorithms determine how concurrent access is handled.

For active/passive arrays, only one SP can access all the sectors on the storage that make up a given volume at a time. The ownership is passed around between the SPs. The reason is that storage arrays use caches and SP A must not write something to disk that invalidates SP B’s cache. Because the SP has to flush the cache when it is done with its operation, moving ownership takes a little time.

During that time, neither SP can process I/O to the volume.

For active/active arrays, the algorithms allow more fine-grained access to the storage and synchronize caches. Access can happen concurrently through any SP without extra time required.

Arrays with auto volume transfer (AVT) features are active/passive arrays that attempt to look like active/active arrays by passing the ownership of the volume to the various SPs as I/O arrives. This approach works well in a clustering setup, but if many ESX systems access the same volume concurrently through different SPs, the result is path thrashing.

Consider how path selection works during failover:

On an active/active array, the system starts sending I/O down the new path.

On an active/passive array, the ESX system checks all standby paths. The SP at the end of the path that is currently under consideration sends information to the system on whether or not it currently owns the volume.

♦ If the ESX system finds an SP that owns the volume, that path is selected and I/O is sent down that path.

♦ If the ESX host cannot find such a path, the ESX host picks one of the paths and sends the SP (at the other end of the path) a command to move the volume ownership to this SP.

Path thrashing can occur as a result of this path choice. If Server A can reach a LUN through only one SP, and Server B can reach the same LUN only through a different SP, they both continuously cause the ownership of the LUN to move between the two SPs, flipping ownership of the volume back and forth. Because the system moves the ownership quickly, the storage array cannot process any I/O (or can process only very little). As a result, any servers that depend on the volume start timing out I/O.

Resolving Path Thrashing Problems

If your server is unable to access a LUN, or access is slow, you might have a problem with path thrashing (sometimes called LUN thrashing). In path thrashing, two hosts access the volume via different storage processors (SPs), and, as a result, the volume is never actually available.

This problem usually occurs in conjunction with certain SAN configurations and under these circumstances:

You are working with an active/passive array.

Path policy is set to Fixed.

Two hosts access the volume using opposite path order. For example, Host A is set up to access the lower-numbered LUN via SP A. Host B is set up to access the lower-numbered LUN via SP B.

Path thrashing can also occur if Host A lost a certain path and can use only paths to SP A while Host B lost other paths and can use only paths to SP B.

This problem could also occur on a direct connect array (such as AX100) with HBA failover on one or more nodes.

You typically do not see path thrashing with other operating systems:

No other common operating system uses shared LUNs for more than two servers (that setup is typically reserved for clustering).

For clustering, only one server is issuing I/O at a time. Path thrashing does not become a problem.

In contrast, multiple ESX systems may be issuing I/O to the same volume concurrently.

To resolve path thrashing:

Ensure that all hosts sharing the same set of volumes on those active/passive arrays access the same storage processor simultaneously.

Correct any cabling inconsistencies between different ESX hosts and SAN targets so that all HBAs see the same targets in the same order.

Make sure the path policy is set to Most Recently Used (the default).

Resolving Issues with Offline VMFS Volumes on Arrays

On some arrays, it is not possible to display the volume with the same volume ID across hosts. As a result, the ESX system incorrectly detects the volume as a snapshot and places the volume offline.

Examples of storage arrays for which the same volume ID might not be visible for a given volume across hosts are Clariion AX100 and few IBM TotalStorage Enterprise Storage Systems (previously Shark Storage systems).

NOTE: If you use Clariion AX100 with Navisphere Express, you cannot configure the same volume ID across storage groups. You must use a version of Navisphere software that has more comprehensive management capabilities.

For IBM TotalStorage 8000, you need to recreate these volumes.

To resolve issues with invisible LUNs on certain arrays:

1. In the VI Client, select the host in the inventory panel.

2. Click the Configuration tab and click Advanced Settings.

3. Select LVM in the left panel and set LVM.DisallowSnapshotLUN to 0 in the right panel.

4. Rescan all LUNs.

After the rescan, all LUNs are available. You might need to rename the volume label.

Understanding Resignaturing Options

This section discusses how the EnableResignature and DisallowSnapshotLUN options interact, and explains the three states that result from changing these options.

In document SAN System Design and Deployment Guide. Second Edition (Page 190-196)