MONITORING YOUR PS SERIES SAN WITH SAN HEADQUARTERS

(1)

TECHNICAL REPORT

MONITORING YOUR PS SERIES SAN WITH SAN

HEADQUARTERS

ABSTRACT

Provides detailed information and best practices for monitoring a Dell EqualLogic™ PS Series storage environment using SAN HeadQuarters Version 2.1

TR1050 V2.1

(2)

Reserved.

Dell EqualLogic is a trademark of Dell Inc.

All trademarks and registered trademarks mentioned herein are the property of their respective owners. Possession, use, or copying of the documentation or the software described in this publication is authorized only under the license agreement specified herein.

Dell, Inc. will not be held liable for technical or editorial errors or omissions contained herein. The information in this document is subject to change.

November 2010

(3)

iii

REVISION INFORMATION

The following table describes the release history of this Technical Report.

Report Date Document Revision

1.0 September 2009 Initial Release

2.0 June 2010 Updated for v2 of the SANHQ tool 2.1 November 2010 Updated for v2.1 of the SANHQ tool

The following table shows the software and firmware used for the preparation of this Technical Report.

Vendor Model Software Revision

Microsoft® Windows Server® 2008 R2 Enterprise Edition Microsoft® Windows 7 Enterprise Edition

Dell EqualLogic™ PS Series array firmware Version 5.0.2 Dell EqualLogic™ Host integration Tools for

Windows

Version 3.4.2

Dell EqualLogic™ SAN HeadQuarters Version 2.1.0

The following table lists the documents referred to in this Technical Report. All PS Series Technical Reports are available on the EqualLogic website: http://www.equallogic.com/resourcecenter/documentcenter.aspx

Vendor Document Title

Dell EqualLogic™ PS Series Arrays SAN HeadQuarters User Guide Version 2.1

(5)

PREFACE

Thank you for your interest in Dell EqualLogic™ PS Series storage products. We hope you will find the PS Series products intuitive and simple to configure and manage.

PS Series arrays optimize resources by automating volume and network load balancing. Additionally, PS Series arrays offer all-inclusive array management software, host software, and free firmware updates. The following value-add features and products integrate with PS Series arrays and are available at no additional cost:

Note: The highlighted text denotes the focus of this document. • PS Series Array Software

o Firmware – Installed on each array, this software allows you to manage your storage environment and provides capabilities such as volume snapshots, clones, and replicas to ensure data hosted on the arrays can be protected in the event of an error or disaster.

 Group Manager GUI: Provides a graphical user interface for managing your array  Group Manager CLI: Provides a command line interface for managing your array.

o Manual Transfer Utility (MTU): Runs on Windows and Linux host systems and enables secure transfer of large amounts of data to a replication partner site when configuring disaster tolerance. You use portable media to eliminate network congestion, minimize downtime, and quick-start replication.

o SAN HeadQuarters (SANHQ): Provides centralized monitoring, historical performance trending, and event reporting for multiple PS Series groups.

• Host Software for Windows o Host Integration Tools

 Remote Setup Wizard (RSW): Initializes new PS Series arrays, configures host connections to PS Series SANs, and configures and manages multipathing.

 Multipath I/O Device Specific Module (MPIO DSM): Includes a connection awareness-module that understands PS Series network load balancing and facilitates host connections to PS Series volumes.

 VSS and VDS Provider Services: Allows 3rd

party backup software vendors to perform off-host backups.

 Auto-Snapshot Manager/Microsoft Edition (ASM/ME): Provides point-in-time SAN protection of critical application data using PS Series snapshots, clones, and replicas of supported applications such as SQL Server, Exchange Server, Hyper-V, and NTFS file shares.

• Host Software for VMware

o Storage Adapter for Site Recovery Manager (SRM): Allows SRM to understand and recognize PS Series replication for full SRM integration.

o Auto-Snapshot Manager/VMware Edition (ASM/VE): Integrates with VMware Virtual Center and PS Series snapshots to allow administrators to enable Smart Copy protection of Virtual Center folders, datastores, and virtual machines.

o EqualLogic Multipathing Extension Module for VMware ESX: Provides enhancements to existing VMware multipathing functionality.

Current Customers Please Note: You may not be running the latest versions of the tools and software listed above. If you are under valid warranty or support agreements for your PS Series array, you are entitled to obtain the latest updates and new releases as they become available.

To learn more about any of these products, contact your local sales representative or visit the Dell EqualLogic™ site at

http://www.equallogic.com. To set up a Dell EqualLogic support account to download the latest available PS Series firmware and software kits visit: https://www.equallogic.com/secure/login.aspx?ReturnUrl=%2fsupport%2fDefault.aspx

(6)

INTRODUCTION

According to IDC and Gartner, Dell is the leader in iSCSI SAN deployments. Dell EqualLogic PS Series storage arrays offer great benefits including:

• Simplified management of storage resources • Comprehensive protection of data resources

• Load balancing and performance optimization of storage resources • Server and host integration tools

SAN HeadQuarters (SANHQ) provides comprehensive monitoring of performance and health statistics for one or more EqualLogic PS Series groups.

The purpose of this technical report is to help storage administrators and other IT professionals use SANHQ to monitor an EqualLogic SAN. In addition, it provides basic troubleshooting tips to help administrators diagnose some common SAN problems.

OVERVIEW

SANHQ monitors one or more PS Series groups. The tool is a client/server application that runs on a Microsoft Windows server and uses SNMP to query the groups. Acting like a “flight data recorder” on an aircraft, SANHQ collects data over time and stores it on the server for later retrieval and analysis. Client systems connect to the server and format and display the data in the SANHQ GUI.

SANHQ enables the storage administrator to:

• Monitor one or more PS Series groups and store operational data for up to a year • Obtain a centralized view of the health and status of multiple groups

• Allow the same performance data to be viewed by multiple clients simultaneously • Monitor performance for a specific time period

• Monitor and analyze capacity usage for groups

• View IO rates, throughput, and latency for each volume, member, pool, or group • View estimated maximum IO capabilities

• Generate alerts and email notifications based on the health status of the groups • Use the GUI or a script to archive group performance data for later offline analysis • View archived group data offline

• Launch the Group Manager GUI directly from SANHQ • Use Single Sign On for quick group login

• Use the built-in syslog server to consolidate all events and alerts into a single view

• Generate reports, including Top10 Volume, Configuration, Performance and Alerts reports • Customize the SANHQ user interface

An overview of the architecture is shown in Figure 1.

(7)

FIGURE 1:SANHQARCHITECTURE

TABLE 1:DESCRIPTION OF ELEMENTS IN FIGURE 1

Figure 1 Element Description

The computer running the Monitor Service issues a series of SNMP requests (polls) to each group for configuration, status, and performance information. The Monitor Service also includes a syslog server to which a PS Series group can log events.

When the first set of SNMP requests returns from a group, the Monitor Service stores this baseline information in the log files for that group. The Monitor Service issues subsequent SNMP requests at regular intervals of time (by default, two minutes).

To obtain a data point, the Monitor Service averages the data from consecutive polling operations.

Each computer running a Monitor Client accesses the log files maintained by the Monitor Service and displays the group data in the SAN HeadQuarters GUI.

Note: The computer running the Monitor Service also has a Monitor Client installed.

PLANNING THE INSTALLATION

As shown in Figure 1, implementation of SANHQ requires a Windows system to run the Monitor Service that conducts the SNMP polling and store the group performance data in log files. Other Windows systems run the Monitor Client and format and display the data in the SANHQ GUI. To minimize the overhead of SNMP polling on a PS Series group, it is important that only one system monitor a specific group. You can have multiple servers running the Monitor Service (for example, a central hosting facility that monitors the SANs of multiple customers) if the list of groups monitored by each Monitor Service is unique . Each Monitor Service can support many systems running the Monitor Client.

(8)

Monitor Service System Requirements

The SANHQ Monitor Service can be installed on Windows 7, Windows Vista, Windows 2003, or Windows 2008. See the SANHQ User Guide for detailed requirements.

Either a physical or virtual system can be used to host the Monitor Service. However, when using a virtual server, time is not always accurately tracked, so the polling interval might be adversely affected if the hypervisor is particularly busy.

Be sure that the network permissions and routing allow the Monitor Service access to all the group member Ethernet ports. Be sure that the log file directory is accessible to all the systems running the Monitor Client.

Monitor Client System Requirements

The SANHQ Monitor Client can be installed on Windows 7, Windows Vista, Windows 2003, or Windows 2008. See the SANHQ User Guide for detailed requirements.

Minimally, a system running the Monitor Client must have network read access to the Windows file share in which the Monitor Service stores the group log files. See Permissions Needed to Access Group Data for more information.

Note that any Monitor Client can open an archive file and view the archived group data. Permissions Needed to Access Group Data

The data collected by the Monitor Service can be stored in log files on the system running the Monitor Service and made available to the Monitor Clients through a Windows File Share. Alternately, you can store the log files on a network device.

Standard Windows permissions are used to control access to the network share and the directory where the data is stored. Monitor Clients with read-only access can view data, but read-write access is required to manage the list of monitored groups, configure SNMP information for the monitored groups, or change the email notification settings for the monitored groups. Use standard Windows tools to manage the permissions.

SNMP Polling

The SNMP polling interval is designed to minimize the impact of SANHQ on the performance of the groups being monitored. If the SNMP poll is unable to return a full set of data during the default two-minute interval, the Monitor Service will automatically adjust the polling frequency to ensure the data can be collected.

INSTALLING SAN HEADQUARTERS

To install SANHQ, run SANHQSetup32and64.exe on the computer. You can choose to install the Monitor Service or only the Monitor Client.

An easy to follow wizard will guide you through the installation. See the SANHQ User Guide for more details.

ADDING GROUPS TO BE MONITORED

Multiple groups can be monitored by a single SANHQ Monitor Service installation. You can add groups to be monitored, as described in the SANHQ User Guide.

(9)

Of particular importance is ensuring that the SNMP community name is the same in the Monitor Service group configuration and in the monitored group.

Optionally, you can change the default size of the log files when first setting up monitoring for a group. Generally, the default log file size of 5 MB is an appropriate size.

Larger log files allow you to maintain more detailed data for longer periods of time, but at the expense of disk space and potentially slower performance from the Monitor Client.

Smaller log files require less disk space and might improve Monitor Client performance, but at the expense of more detailed data.

See the SANHQ User Guide for more details.

Using SANHQ on Groups with the Management Port Enabled

SANHQ can be used with PS Series groups that have a dedicated management network enabled. The management IP address should be used to add a group, instead of the normal iSCSI VLAN group IP address. Note that the Monitor Service must be able to communicate with each network interface on each group member to properly gather data.

INFORMATION PROVIDED BY SANHQ

There is a broad spectrum of information that SANHQ provides. Data is categorized into key areas, as discussed briefly below.

Capacity

A key component of the health of your PS Series group is capacity. To fully understand the capacity available for new applications or to support the growth of existing servers, you must examine the overall group and pool capacity, storage utilization statistics, thin provisioned space, and space used for replication. To ensure a healthy SAN, it is important to detect any sudden or unexpected changes in capacity utilization.

Alerts and Events

Monitoring SANHQ alerts and PS Series group events can help you correlate specific performance issues with group activity. Dell recommends setting up email notification in SANQH and also configuring the monitored groups to log events to the SANHQ syslog server, which is part of the Monitor Service, to help troubleshoot problems if they occur.

IO and Experimental Analysis Data

For the PS Series group and its storage pools and members, IO performance indicates whether the storage system can handle the load from the servers. IOPS (IO operations per second), throughput in KB/sec, and latency all are factors to consider. Latency is particularly important because it helps determine if you are exceeding the maximum IOPS or throughput.

Obtaining IO data for individual volumes can help determine the source of the load and can assist you in determining if more resources or dedicated resources are needed to support a particular application.

The Experimental Analysis data provides an estimate of how “busy” the SAN is and can help identify problems caused by an excessive load on the available resources. The estimated Experimental Analysis data is based on a specific workload (small, random IOPS) that may not

(10)

resemble the actual group workload. Therefore, the data should not be used as the sole measure of group performance.

Hardware and Firmware

Because hardware failures can be a source of performance problems, SANHQ alerts you when there is a failure.

In addition, hardware and firmware can affect performance. For example, different disk types will have different performance characteristics. Also, if hardware is added or removed from the

environment, viewing this information at different points in time will tell you what was in use at that time.

SANHQ tracks:

• Array model, service tag, and serial number • RAID status and RAID policy

• Firmware version

In addition, information about individual disks is available, including disk IO data and individual disk queue depth (requires PS Series Firmware Version 4.2 or higher).

Network and Port Data

A key to understanding the overall load on the SAN is network data and member network interface (port) data. Network retransmits are a critical indicator of network problems that can affect SAN performance. SANHQ shows the network link throughput and high- and low-use network ports.

RECEIVING EMAIL ALERTS

SANHQ can proactively inform you of performance problems and PS Series group events that require your attention, such as hardware failures or high latencies.

To receive alerts from SANHQ, configure the email settings appropriate for your environment. See the SANHQ User Guide for details.

EXPORTING DATA

SANHQ enables you to export the data for one or more monitored groups in csv format for use with an external analysis tool, such as Microsoft Excel. You specify the groups and the time range for the desired data.

See the SANHQ User Guide for details.

ARCHIVING DATA

SANHQ allows you to archive data for one or more monitored groups. This enables you to use SANHQ to view and analyze the archived data offline, without needing access to the Monitor Service.

You can also archive data to preserve detailed data for a particular period in time.

A data archive is more valuable than exported data when working with Dell support to resolve issues because the archive can be viewed by anyone running the Monitor Client.

(11)

DATA ANALYSIS

Analyzing data gathered by SANHQ can be a mixture of art and science. When looking for the source of performance issues, it is important to carefully consider all the performance data before drawing conclusions. The use of the SAN may change over time, so a tool such as SANHQ, which provides historical perspective, is often able to provide the insight that a simple performance snapshot cannot.

Launching the Monitor Client for Offline Use

Normally, each Monitor Client connects to the Monitor Service to obtain and format the latest group performance data. If desired, you can archive group data for later analysis (for example, if you do not have access to the Monitor Service).

For example, if you start SANHQ, but do not have access to the Monitor Service, simply choose the “Ignore” option when launching SANHQ, as shown in Figure 2. This allows SANHQ to start in offline mode, after which you may import archive files.

FIGURE 2:LAUNCHING THE SANHQCLIENT FOR OFFLINE USE

Importing Data

To open an archive, select Monitor > Open Archive from the SANHQ menu bar. A new SANHQ session will appear with the data from the selected archive file. The data can be viewed and analyzed just like it would be if you were connected to the Monitor Service.

(12)

FIGURE 3:OPENING AN ARCHIVE FILE FOR ANALYSIS

Understanding Your Applications and What is “Normal”

Key to knowing whether the storage system is performing optimally for your environment is understanding the applications that are using the PS Series SAN. For example, normal business applications behave very differently than video editing applications. Furthermore, you might be running high-impact operations only at certain times. Therefore, what is “normal” at 10AM may be very different from what is “normal” at 10PM, and what is “normal” for most days may not be “normal” at the end of the month.

Using multiple monitoring tools including SANHQ, server monitoring with tools such as Windows PerfMon or Linux iostat, and network monitoring tools can provide insight into the overall

behavior of the storage system, under normal and abnormal conditions.

Obtaining a baseline when things are working well often helps you to identify the source of the problem when problems arise. The baseline must be reestablished after major system

reconfigurations, upgrades, or significant changes in application use patterns.

SOLVING PERFORMANCE PROBLEMS

There are many potential causes of performance problems. Resolving problems requires a methodical approach. You must consider all possible solutions and the effect of any changes and make sure changes can be reversed if they exacerbate the problem. Use of a methodical approach can also help avoid “analysis paralysis” where nothing is tried for fear of causing irreversible damage and thus nothing improves.

(13)

Check for Damaged Hardware

Performance problems are often caused by malfunctioning hardware. The basic troubleshooting steps that are key to solving any IT problem also apply to a SAN. Table 2 lists some problems that you should watch for and immediately correct. Other factors unique to a particular SAN also might be important.

If you correct any hardware problems and the problem still exists, you must investigate further.

TABLE 2:TYPICAL HARDWARE ISSUES AFFECTING SANPERFORMANCE

Damaged Hardware

Typical Symptom Detected By Possible Corrective Actions Server NIC Malformed Packets Monitor Errors at Switch Update NIC Drivers

Replace NIC Bad Patch Cable

Wrong Class of Patch Cables

Visible Damage Malformed Packets

Visual Inspection Monitor Errors at Switch

Replace Cable

Defective Switch Spontaneous Restarts Random Lock-up

Monitor Switch with Appropriate Network Tools Update Switch Firmware Replace Switch Defective Array Hardware

Alerts Monitor EqualLogic

Group

Monitor SANHQ Setup Email Alerts on Group and SANHQ

Contact Dell Support to Replace the Malfunctioning Component

Check the Volume IO Latencies

One of the leading indicators of the health of a SAN is latency. In SANHQ, latency is the time from the receipt of the IO request to the time that the IO is returned to the server. Volume latencies are easily observed using SANHQ.

Table 3 provides some typical guidelines for interpreting the observed latencies and possible corrective actions.

Many applications will begin to exhibit significant performance degradation when latencies in the storage system are consistently above 50 ms. If the servers show high latency (for example, using PerfMon) but the storage does not, the issue is not with the storage system but with the server configuration or the SAN network. Consult your operating system, server, or switch vendor for the appropriate actions to troubleshoot these components.

(14)

TABLE 3:VOLUME LATENCIES GUIDELINES

Observed Value

Indicative Of When To Be Concerned Possible Corrective Actions Less than

20ms

Normal Operations N/A None Required

20ms to 50ms Possible Mis-configuration of SAN Components *** If OK *** Possible Overload of SAN Resources When Condition is Sustained Check Configuration of Server NICs and SAN Switches

*** If OK ***

Add Additional Hardware to the Storage Pool Migrate Volumes to other Storage Pools

Above 50ms Possible

Mis-configuration of SAN Components *** If OK *** Probable Overload of SAN Resources When Condition is Frequently Repeated or Sustained Check Configuration of Server NICs and SAN Switches

*** If OK ***

Determining Overload of SAN Resources

When high latency cannot be attributed to incorrect configuration of the SAN infrastructure (server NICs and switches) it is possible that SAN resources are overloaded. SANHQ can provide

additional information to help determine if this is the situation. Areas that could become overloaded are shown in Table 4, along with possible corrective actions.

(15)

TABLE 4:DETERMINING IF SANRESOURCES ARE OVERLOADED

Overloaded Resource

Indicated By When To Be Concerned Possible Corrective Actions Random IOPS High IOPS Values,

Not Attributable to Sequential Operations Such as Backup When Condition is Sustained

Add Additional Hardware to the Storage Pool Migrate Volumes to other Storage Pools Network Performance High TCP Retransmits and Alerts Unable to Support High Throughput for Sequential Operations When Condition is Sustained

Check Flow Control Settings on Servers and Switches

Check Jumbo Frames Settings on Servers and Switches

Disable Jumbo Frame Support

Check Receive Buffers on Servers and Switches Disable Unicast Storm Control

Enable Broadcast and Multicast Storm Control Network Bandwidth High Network Utilization Values and Alerts When Condition is Sustained Activate Additional Network Ports (if available)

(16)

Overloaded Indicated By When To Be Concerned Possible Corrective

Resource Actions

Storage Pool Capacity

Low Storage Pool Free Space Values and Alerts

When Condition is Frequently Repeated or Sustained

Reduce Storage Utilization Reduce Overallocated Snapshot Reserve Space Convert Underutilized Volumes to Thin Provisioned Volumes Migrate Volumes to other Storage Pools

Add Additional Hardware to the Storage Pool Low

Performance on Thin Provisioned Volumes

Low Storage Pool Free Space Values and Alerts

Volume Approaching Maximum In Use Space Values and Alerts

When Condition is Frequently Repeated or Sustained

Reduce Storage Utilization Add Additional Hardware to the Storage Pool Migrate Volumes to other Storage Pools

Increase the Maximum In Use Space Value

Convert Thin Provisioned Volumes to Standard Volumes iSCSI Connections Unable to Attach Servers to Volumes and Alerts When Condition is Frequently Repeated or Sustained

Disconnect from Unused Volumes and Snapshots Modify MPIO Settings to Reduce the Number of Connections Per Volume Migrate Volumes to another Storage Pool Create a New Storage Pool and Migrate Volumes to the New Storage Pool

(17)

Overloaded Indicated By When To Be Concerned Possible Corrective Resource Actions MPIO Connections Unable to Establish Multiple Connections When Condition is Encountered

Check the Number of Active iSCSI Connections in the Storage Pool Check ACLs on Volumes Use EqualLogic Auto-MPIO on Supported Operating Systems Ensure that MPIO is Supported and Configured on Other Operating Systems High Queue Depth >10 When Condition is Frequently Repeated or Sustained, Especially if Accompanied by High Latency

(18)

Random IOPS

A PS Series group uses storage virtualization to distribute workloads over many disks and generally provides much higher performance than other storage systems with a similar number of like disks. However, the disks do have a finite ability to do work, as measured in IOPS. Faster disks, such as SSD or those spinning at 15K RPM, are able to do more random work than slower disks.

As with the statement “it snows in Florida,” the estimated maximum IOPS number, while valid under the right conditions, is not very realistic to expect. Write scaling from the maximum is greatly affected by the read/write ratio and RAID level, with RAID 10 experiencing the least degradation and RAID 6 the most, due to the greater write penalty. When random IO is high, latencies in the SAN are generally the best indication that the maximum IOPS have been exceeded. The best way to address high IO issues is to redistribute the load to other storage pools within the group (if possible) or add PS Series arrays to the storage pool that is experiencing the high random workload. Sequential workloads, such as backup, will generate much higher IOPS numbers than random workloads; however, the IOPS during sequential operations are less relevant than network bandwidth utilization, as discussed below.

When determining if high IOPS numbers are generated by random or sequential functions, consider the applications in use at that time. For example, there will be few OLTP users active at 2:00 am. in most environments, but backup is most likely occurring. Therefore, be sure to look at the IO size. Large IOs are often indicative of sequential operations, while small IOs indicate random

operations. Latencies are usually higher if a high IOPS value is caused by random operations. Overall, high random IOPS are not a problem if applications are performing satisfactorily. Experimental Analysis and IOPS

SANHQ provides Experimental Analysis windows that display an estimate of the maximum workload that can be sustained; as well an estimate of the workload that can be sustained if a RAID set becomes degraded due to a drive failure. There is also a related graph that shows the estimated workload on a scale of 0% to 100%.

Note that these graphs are estimates based on a small, random IO workload, which is prevalent in a typical business environment. Large IO sizes and more sequential workloads may not match the estimated workload.

Consequently, it is possible for the group workload to exceed the maximum estimated IOPS. This is not cause for concern unless it is accompanied by high latencies. Similarly, some workloads can result in high latencies, while remaining below the estimated maximum IOPS.

Understanding what is normal for your environment will help you determine the best use of the estimated workload graphs.

Group IO Load Space Distribution

Starting with v2.1 of the SANHQ software and v5 of the EqualLogic firmware, SANHQ is able to provide additional information on the IO load distribution within a group or pool. This distribution can be helpful in determining if the IO activity observed is attributable to a relatively small amount of the total dataset, or if it is a generally uniform distribution across the entire dataset. Knowledge of the distribution of activity can be useful in making decisions about whether or not tiering data for performance will be effective. If data activity is concentrated in a relatively small portion of the capacity, providing higher performing media for that capacity may prove effective in increasing

(19)

the performance of an application. For example, many databases have a core portion of the data that is frequently accessed – indexes and reference data accessed by all users for example. This portion of the data, if accelerated, will often improve the overall performance of the database. The Group IO Load Space distribution graph shows the amount of data that falls into one of three categories: high IO, medium IO and low IO.

In addition, if SSD media is present in the EqualLogic group, the Group IO Load Space

distribution graph will reflect this capacity relative to the amount of very active data. And if one of the EqualLogic tiered array models is present (PS6000XVS or PS6010XVS) an additional graph appears to demonstrate how much of the Enhanced Write Cache unique to these array models is in use as shown below in Figure 4.

FIGURE 4:GROUP I/OLOAD SPACE DISTRIBUTION AND ENHANCED WRITE CACHE USAGE

Network Performance

Network performance is dependent on a number of components working in conjunction with each other.

The most critical function that affects network performance is Flow Control, which allows network devices to signal the next device that the data stream should be reduced to prevent dropped packets and retransmissions.

Flow Control is typically disabled by default and must be enabled on both the server NICs and the network switches in order to be effective. Consult your switch vendor or server NIC driver

documentation to determine how to configure Flow Control. If Flow Control cannot be configured on a network device such as a NIC or switch, either upgrade the device to a version of firmware that supports Flow Control or replace the device with one that does. Flow Control is automatically supported by PS Series arrays.

(20)

The second item which must be properly configured is Jumbo Frames. Jumbo Frames uses a larger frame size than a standard Ethernet frame and allow large amounts of data to be efficiently

transmitted between the server and storage.

In environments with small, average IO sizes, Jumbo Frames provides limited benefits. In general, Jumbo Frames support is disabled by default on switches and server NICs. Enabling Jumbo Frames requires that the switch use a VLAN other the default VLAN (usually VLAN 1). PS Series arrays will automatically negotiate the use of Jumbo Frames when the iSCSI connection is

established by the server. Consult your switch vendor or server NIC driver documentation to determine if Jumbo Frames can be configured.

Note that some network devices run more slowly with Jumbo Frames enabled, do not properly support Jumbo Frames, or cannot support them simultaneously with Flow Control. In these cases, Jumbo Frames should be disabled, or the switches should be upgraded or replaced.

When attempting to troubleshoot network performance problems, disable Jumbo Frames and determine whether the network is performing properly with standard Ethernet frames.

Another area that can cause problems is a lack of receive buffers. Low end switches often have limited memory and suffer from performance issues related to insufficient buffers. A

recommended buffer level in switches is 1MB per port. Dedicated buffers are preferred to shared buffers.

In addition, server performance can often be improved by increasing the number of buffers allocated to the server NICs. Consult your switch vendor or server NIC driver documentation to determine if you can increase the buffers.

Network Bandwidth

Network bandwidth may become fully utilized during highly sequential operations, such as backup. This is not indicative of a problem in most cases but simply a case of a fully utilized system. Using all the available bandwidth on one or more member Ethernet interfaces will generate an alert.

Make sure you connect and enable all the member Ethernet interfaces to maximize the available SAN bandwidth.

If all interfaces are enabled, but bandwidth is still insufficient, increasing the number of arrays in the storage pool may provide additional throughput, if the servers have not reached their maximum bandwidth.

If only one interface is completely utilized on a member or on a server with multiple NICs, ensure that MPIO is properly configured. If all of the server NICs exceed their capacity (use host based tools to determine this), but the PS Series group has excess network capacity, add additional server NICs. Also, configure MPIO for those operating systems that support MPIO.

Storage Pool Capacity

Low storage pool capacity is a problem that generates an alert in SANHQ. If a pool has less than 5% free space (or less than 100 GB per member, whichever is less), a PS Series group may not have sufficient free space to efficiently perform the virtualization functions required for automatic optimization of the SAN. In addition, when storage pool free space is low, write performance on thin provisioned volumes is automatically reduced in order to slow the consumption of free space.

(21)

To increase free space in a storage pool, you can:

• Reduce the amount of in-use storage space by deleting unused volumes.

• Reduce the amount of in-use storage space by reducing the amount of snapshot reserve. • Identify large volumes that have low utilization and convert them to thin provisioned volumes. • Migrate volumes to storage pools with excess capacity.

• Add additional hardware to the storage pool.

Low Performance on Thin Provisioned Volumes

If free space in a storage pool falls below 5% of total capacity (or 100GB per member, whichever is less), write performance for thin provisioned volumes decreases in order to reduce usage of the declining free space. Using the options for alleviating low storage pool capacity, listed above, will correct the problem and permit the thin provisioned volumes to once again operate at full speed. In addition, thin provisioned volume performance decreases when the allocated space for the volume approaches the maximum use space value for the volume. Increasing the maximum in-use space value or the reported volume size or converting the volume from a thin provisioned volume to a fully-provisioned volume will restore performance to normal.

Note that there must be sufficient free space to reserve the remaining unreserved volume space when converting to a fully-provisioned volume. Do not convert a thin provisioned volume if doing so will reduce the storage pool free space below 5%.

iSCSI Connections

Large, complex environments can utilize many iSCSI connections. A storage pool in a PS Series group can support numerous simultaneous connections, as outlined in the release notes for the particular EqualLogic firmware release in use. These connections can be used for fully-provisioned volumes, thin fully-provisioned volumes, and snapshots. Attempting to exceed the supported number of connections will result in an error message.

You can reduce the number of iSCSI connections to the volumes and snapshots in a storage pool in several ways:

• Disconnect from unused volumes and snapshots.

• Modify MPIO settings to reduce the number of connections per volume. • Move volumes to another storage pool.

• Create a new storage pool and move volumes to the new storage pool.

MPIO Connections

MPIO provides additional performance capabilities and network path failover between servers and volumes. For certain operating systems (Windows 2003 and 2008), the connections can be automatically managed.

If MPIO is not creating multiple connections, you should:

• Check that the storage pool does not have the maximum number of iSCSI connections for the release in use (see the release notes).

• Check the access control records for the volume. Using the iSCSI initiator name instead of an IP address can make access controls easier to manage and more secure.

• Ensure that EqualLogic MPIO extensions are properly installed on the supported operating systems. See the EqualLogic Host Integration Tools documentation for details.

(22)

• Ensure that MPIO is supported and properly configured, according to the documentation for the operating system.

Queue Depths

Queue depth is a measure of how much work is pending for a resource, such as an array or a disk drive.

A high queue depth might indicate that a resource is overloaded, particularly if high latencies exist. A low queue depth might indicate that a resource has sufficient unused capacity to absorb new workloads.

Understanding what resource has a high queue depth can be helpful in deciding what workload to move or what type of resources should be added to the SAN. Note that queue depth reporting requires PS Series Firmware Version 4.2 or higher.

TABLE 5:QUEUE DEPTH GUIDELINES

Observed Value

Indicative Of When To Be Concerned Possible Corrective Actions

< 10 Low queue depth Normal N/A

10-25 Moderate queue

depth

Normal N/A

>25 High queue depth When sustained, especially if accompanied by high latency

Consider moving some workloads to another pool or adding resources if sustained

(23)

EXAMPLES

Troubleshooting SAN issues using SANHQ is easily demonstrated using examples. The following examples show a variety of issues and some solutions that can be used to resolve them.

Example 1: A Stable System

The combined graphs view gives an overview of the overall health of the group. A healthy, stable system is shown in Figure 5. Latencies are consistently below 20 ms, network bandwidth is well below the maximum that could be sustained by multiple Gigabit Ethernet ports, and TCP

retransmits are virtually zero.

FIGURE 5:ASTABLE SYSTEM

(24)

Example 2: High Latency Caused by High Random IOPS

A system overloaded with a random IOPS workload is shown in Figure 6. The PS Series group has high IOPS with a low IO size at times when users are performing normal daily activities.

The small IO size, a 50/50 read/write ratio, and latencies that are 20 ms and above indicate that the group performance is bound by random IOPS. The group displays high latencies that negatively affected application performance prior to January 31, at which time an additional array was added. Because of the inherent scalability of a PS Series group, the online expansion and automatic

distribution of the load over more spindles occurred without any disruption to running applications. Once the group expansion was complete, latencies dropped to acceptable levels and application performance improved dramatically.

FIGURE 6:HIGH LATENCIES CAUSED BY HIGH RANDOM IOPS

(25)

Example 3: TCP Retransmit Errors

A group that exhibits high TCP retransmit Errors is shown in Figure 7. TCP retransmit errors can be an indicator of a number of problems, ranging from defective cables to overloaded SAN switches or servers.

The frequency of the errors might indicate the source of the problem. If they are frequent, regardless of load, the cause may be a bad cable, NIC, HBA, or a switch that is unable to process traffic properly. Less frequent errors, such as those shown in Figure 6, are typical of switches or server components that fail only under load and are harder to diagnose. Very infrequent errors might indicate a temporary, but normal condition that may not adversely affect application performance.

FIGURE 7:TCPRETRANSMIT ERRORS

(26)

Example 4: Network Bandwidth Saturation

A group that has saturated the available network bandwidth is shown in Figure 8. Figure 9 shows the associated alerts.

Saturation of the network bandwidth is not a problem if it occurs during sequential operations and if application performance remains acceptable. Resolving problems with network bandwidth might require redistribution of the workload over multiple arrays. The EqualLogic architecture permits the bandwidth to scale, either by enabling additional array interface ports or by adding another array to the group and thus adding more controllers and network ports.

FIGURE 8:SATURATED NETWORK BANDWIDTH

FIGURE 9:NETWORK BANDWIDTH SATURATION ALERTS

(27)

Example 5: Sudden Rise in Capacity Used

Most groups will exhibit a rise in capacity utilization over time, due to normal activities such as adding volumes, increasing snapshot reserve, expanding thin provisioned volumes, and other normal storage activities. A sudden rise in in-use capacity, as shown in Figure 10, could indicate a problem, such as an improperly sized volume (for example, you specified TB when GB was intended) or a heavily used thin provisioned volume. Use the volume data prior to and after the increase to diagnose the sudden increase in utilization.

FIGURE 10:SUDDEN RISE IN CAPACITY IN USE

(28)

Example 6: Low Storage Pool Free Space

As with any computing resource, high utilization is desirable for the sake of efficiency, but a small buffer must be maintained for overhead. In the case of an PS Series group, the recommended minimum free space in a storage pool is 5% of total capacity or 100 GB per member, whichever is less.

A group without this buffer can have difficulty with various operations, such as snapshots, load balancing, member removal, and replication. A low pool free space condition is indicated by a warning message such as the one in Figure 11.

Several steps can be taken to increase free pool space, such as converting mostly empty volumes to thin provisioning, reducing snapshot reserve for volumes that do not need the amount currently allocated, reducing the replication reserve for a volume if it is not needed, suspending replication for some volumes, deleting unneeded volumes, and moving volumes to another pool.

Ultimately, you might need to add more resources to the pool, either by redistributing existing group resources or by adding an array to the pool.

FIGURE 11:LOW POOL FREE SPACE WARNING

SUMMARY

Acting as a “flight data recorder” for your PS Series group, SAN HeadQuarters is a powerful monitoring and analysis tool designed to provide SAN administrators with valuable insight into the health of their storage environment. The easy-to-use graphical interface provides information on PS Series group capacity, IO performance, network data, member hardware and configuration, and volume data. With the ability to show trends and export metrics for further reporting and analysis, SAN HeadQuarters is a key component in the constant battle that administrators face daily to do more with less resources.

(29)

FOR MORE INFORMATION

For detailed information about PS Series arrays, groups, and volumes see the following documentation:

• Release Notes. Provides the latest information about PS Series arrays and groups.

• Installation and Setup. Describes how to install the array hardware and configure the software. The manual also describes how to create and connect to a volume.

• Group Administration. Describes how to use the Group Manager graphical user interface (GUI) to manage a PS Series group. This manual provides comprehensive information about product concepts and procedures.

• CLI Reference. Describes how to use the Group Manager command line interface (CLI) to manage a PS Series group and individual arrays.

• Hardware Maintenance. Describes how to maintain the array hardware. Be sure to use the manual for your array model.

• Online help. In the Group Manager GUI, expand Tools in the far left panel and then click Online Help for help on both the GUI and the CLI.

See support.dell.com/EqualLogic and log in to your customer support site for the latest documentation.

TECHNICAL SUPPORT AND CUSTOMER SERVICE

Dell’s support service is available to answer your questions about PS Series arrays. If you have an Express Service Code, have it ready when you call. The code helps Dell’s automated-support telephone system direct your call more efficiently.

Contacting Dell

Dell provides several online and telephone-based support and service options. Availability varies by country and product, and some services may not be available in your area.

For customers in the United States, call 800-945-3355.

Note: If you do not have an Internet connection, you can find contact information on your purchase invoice, packing slip, bill, or Dell product catalog.

To contact Dell for sales, technical support, or customer service issues: 1. Visit support.dell.com.

2. Verify your country or region in the Choose A Country/Region drop-down menu at the bottom of the window.

3. Click Contact Us on the left side of the window.

4. Select the appropriate service or support link based on your need. 5. Choose the method of contacting Dell that is convenient for you.

(30)

25

Online Services

You can learn about Dell products and services using the following procedure: 1. Visit www.dell.com (or the URL specified in any Dell product information). 2. Use the locale menu or click on the link that specifies your country or region.

MONITORING YOUR PS SERIES SAN WITH SAN HEADQUARTERS