EMC SCALEIO
NETWORKING BEST PRACTICES
ABSTRACT
This document describes the core concepts, best practices, and validation methods for architecting a ScaleIO network.
To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller, visit www.emc.com, or explore and compare products in the EMC Store
Copyright © 2015 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
VMware and <insert other VMware marks in alphabetical order; remove sentence if no VMware marks needed. Remove highlight and brackets> are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other
trademarks used herein are the property of their respective owners. Part Number H14708
TABLE OF CONTENTS
EXECUTIVE SUMMARY ... 5
AUDIENCE ... 5
SCALEIO OVERVIEW ... 5
NETWORK INFRASTRUCTURE ... 7
Network Topology ... 7
Leaf-Spine Network Topology ... 8
Flat Network Topology ... 9
IPv4 and IPv6 ... 9
NETWORK PERFORMANCE ... 10
Network Latency ... 10
Network Speed ... 10
Infiniband ... 13
NICs ... 13
Two NICs vs. Four NICs and Other Configurations ... 13
Dedicated Access Port ... 13
NIC Pooling ... 13
Jumbo Frames ... 14
Flow Control ... 14
Link Aggregation ... 15
High Availability ... 15
Active High Availability ... 15
Passive High Availability ... 15
SOFTWARE-DEFINED SAN VS. HYPERCONVERGED ... 15
VMWARE IMPLEMENTATIONS ... 15
SDC ... 15
VM-Kernel port ... 15
Virtual Machine Port Group ... 15
VMware Advantages over a Standard Switch ... 15
VALIDATION METHODS ... 16
Internal SIO Tools ... 16
SDS Network Test ... 16
In the example above, you can see the network performance from the SDS you are testing
from, to every other SDS in the network. Ensure that the speed per second is close to the
expected performance of your network configuration. ... 16
SDS Network Latency Meter Test ... 17
Iperf and NetPerf ... 18
Iperf ... 18
NetPerf ... 18
Network Monitoring ... 18
Network Troubleshooting 101 ... 18
Revision history ... 19
REFERENCES ... 19
EXECUTIVE SUMMARY
EMC’s ScaleIO software defined storage platform presents limitless opportunities to create powerful storage systems from commodity hardware. ScaleIO’s success is driven not only by the hardware it operates on, but also by properly tuned operating system
platforms and networks. Further, it empowers users with the ability to size networks properly from a redundancy and performance perspective. Please note that this guide does not cover every networking best practice for ScaleIO but attempts to cover a minimum set of network best practices. It is very likely that a ScaleIO technical expert could recommend a more comprehensive or sometimes different best practices than covered in this guide. This guide is intended to provide details on:
• Network topology choices • Network performance
• Software defined SAN and hyperconverged considerations • ScaleIO implementations within a VMware environment • Validation methods
• Monitoring recommendations
AUDIENCE
This white paper is intended as a guide to advise end-users regarding best practices for ScaleIO networking, while creating a better understanding of the options available for the successful operation of ScaleIO from a networking perspective.
SCALEIO OVERVIEW
The management of large-scale, rapidly growing infrastructures is a constant challenge for many data center operation teams and it is not surprising that data storage is at the heart of these challenges. The traditional dedicated SAN and dedicated workloads cannot always provided the scale and flexibility needed. A storage array can’t borrow capacity from another SAN if demand increases and can lead to data bottlenecks and a single point of failure. When delivering Infrastructure-as-a-Service (IaaS) or high performance applications, delays in response are simply not acceptable to customers or users.
EMC ScaleIO is software that creates a server-based SAN from local application server storage (local or network storage devices). ScaleIO delivers flexible, scalable performance and capacity on demand. ScaleIO integrates storage and compute resources, scaling to thousands of servers (also called nodes). As an alternative to traditional SAN infrastructures, ScaleIO combines hard disk drives (HDD), solid state disk (SSD), and Peripheral Component Interconnect Express (PCIe) flash cards to create a virtual pool of block storage with varying performance tiers. ScaleIO is hardware-agnostic, supports physical and/or virtual application servers, and has been proven to deliver significant TCO savings vs. traditional SAN.
Massive Scale - ScaleIO is designed to massively scale from three to thousands of nodes. Unlike most traditional storage systems, as the number of storage devices grows, so do throughput and IOPS. The scalability of performance is linear with regard to the growth of the deployment. Whenever the need arises, additional storage and compute resources (i.e., additional servers and/or drives) can be added modularly so that resources can grow individually or together to maintain balance
Extreme Performance - Every server in the ScaleIO cluster is used in the processing of I/O operations, making all I/O and throughput accessible to any application within the cluster. Such massive I/O parallelism eliminates bottlenecks. Throughput and IOPS scale in direct proportion to the number of servers and local storage devices added to the system, improving cost/performance rates with growth. Performance optimization is automatic; whenever rebuilds and rebalances are needed, they occur in the
background with minimal or no impact to applications and users. The ScaleIO system autonomously manages performance hot spots and data layout
Compelling Economics - As opposed to traditional Fibre Channel SANs, ScaleIO has no requirement for a Fibre Channel fabric between the servers and the storage and no dedicated components like HBAs. There are no “forklift” upgrades for end-of-life hardware. You simply remove failed disks or outdated servers from the cluster. It creates a software-defined storage environment that allows users to exploit the unused local storage capacity in any server. Thus ScaleIO can reduce the cost and complexity of the solution resulting in typically greater than 60 percent TCO savings vs. traditional SAN.
Unparalleled Flexibility - ScaleIO provides flexible deployment options. With ScaleIO, you are provided with two deployment options. The first option is called “two-layer” and is when the application and storage are installed in separate servers in the ScaleIO cluster. This provides efficient parallelism and no single points of failure. The second option is called “hyper-converged” and is when the application and storage are installed on the same servers in the ScaleIO cluster. This creates a single-layer architecture and provides the lowest footprint and cost profile. ScaleIO provides unmatched choice for these deployments options. ScaleIO is infrastructure agnostic making it a true software-defined storage product. It can be used with mixed server brands, operating systems (physical and virtual), and storage media types (HDDs, SSDs, and PCIe flash cards). In addition, customers can also use OpenStack commodity hardware for storage and compute nodes.
Supreme Elasticity - With ScaleIO, storage and compute resources can be increased or decreased whenever the need arises. The system automatically rebalances data “on the fly” with no downtime. Additions and removals can be done in small or large
increments. No capacity planning or complex reconfiguration due to interoperability constraints is required, which reduces complexity and cost.
Based on the requirements, ScaleIO can be used either in 2-layer architecture also known as SAN.NEXT or in a single-layer architecture also known as Infrastructure.NEXT.
• Two-layer or SAN.NEXT: ScaleIO allows the customer to move towards a software-defined scale out SAN infrastructure using commodity hardware. Customers can keep running the applications the way they are today on a separate servers and transform the way to handle SAN storage.
• Single-layer or Infrastructure.NEXT: ScaleIO also allows the customers to run the application on the same storage server, bringing the compute and storage together in a single architecture. In a nutshell, ScaleIO brings together the storage, compute and application in a single layer, which makes the management of the infrastructure much simpler.
NETWORK INFRASTRUCTURE
Network Topology
There are two network topologies that we will be discussing for ScaleIO deployments: (1) Leaf-Spine network, and (2) Flat network. The primary considerations when determining your network topology are:
1. What is the number of ScaleIO nodes planned for your deployment
o High Number of nodes that don’t fit into a single switch
o Small number of nodes (<10) that fit into a single switch
o Port density/number of available ports on switch
2. What is your deployment plan – hyper-converged or software-defined SAN?
o If you plan on running hyper-converged, you may need additional network capacity from a bandwidth perspective to accommodate additional storage and applications
3. Are you extending an existing network to accommodate a small number of nodes?
o Small number of nodes
o Want to leverage existing port capacity 4. Network Redundancy
o Network redundancy enables you to have the same amount of bandwidth across the network infrastructure
o You should have enough connections between switches to have end-to-end capacity if there is a device failure
o No singular point of failure – highly available 5. Security
o If you are connecting to untrusted SDCs, you may want or need to separate the SDC network from the SDS network
Leaf-Spine Network Topology
A Leaf-Spine topology is a two-tier architecture, and is an alternative to the classic three-layer network design. It consists of Leaf Switches and Spine Switches. In this design, each Leaf Switch is attached to all Spine Switches, but the Leaves and Spines are not connected to each other. The Leaf Switches control the flow of traffic between servers, and the Spine Switches move traffic between nodes at Layer 2.
In most instances, we recommend leveraging a Leaf-Spine network topology design for a ScaleIO implementation. This is because: • ScaleIO can scale upwards to hundreds of nodes
• Leaf-Spine architecture facilitates scale-out deployments, without having to re-architect the network (future-proofing) • When designed correctly to allow for maximum bandwidth, a Leaf-Spine topology will be non blockings
• All connections have equal access to bandwidth • Predictable latency
• Highest availability and performance
Flat Network Topology
A Flat network design is less costly and easier to maintain from an administration perspective. A Flat network topology is easier to implement, and may be the preferred choice if an existing network is being extended or if the network does not scale beyond four switches. If you expand beyond four switches, you will need more cross-link ports that it would likely be cost prohibitive to remain in a flat network topology. A Flat Network is the fastest, simplest way to get your ScaleIO deployment up and running.
The primary use-cases for a flat network topology are: • Small deployment, not extending beyond four switches • Remote Office/Back Office
• Small Business
IPv4 and IPv6
While the current version of ScaleIO (1.32.2) supports Internet Protocol version 4 (IPv4) addressing, IPv6 will be supported in the coming product release.
NETWORK PERFORMANCE
NOTE: It is recommended that all specialty network configurations be disabled when deploying ScaleIO. This includes Jumbo Frames, Flow Control, and Link Aggregation. It is recommended that following the successful deployment of your infrastructure you can begin to tune the environment and add these layers to achieve best performance.
Network Latency
Network latency is important to account for when designing your network. Minimizing the amount of network latency will provide for improved performance. For best performance, latency for all SDS and MDM communication should not exceed 1ms roundtrip time. This can be easily verified by pinging, and more extensively by the SDS Network Latency Meter Test.
Please note that ScaleIO is not designed to extend outside the datacenter. Leveraging wide area networks to operate ScaleIO is discouraged.
Network Speed
Network speed is a critical component when designing your ScaleIO implementation. In order to determine your network speed, the following considerations should be examined:
• Rebuild time (the amount of time it takes for a failed node to rebuild)
• Rebalance time (the amount of time it takes to redistribute data in the event of proactive device removal or uneven data distribution in the event of a node failure)
• Drive capability/performance (the amount of data a drive is capable of delivering from an IO perspective) • Performance expectations (IOPS, Bandwidth, Latency)
While ScaleIO can be deployed on a 1Gbps network, storage performance will be bottlenecked by the network capacity. At a minimum, we recommend leveraging 10Gbps network technology.
Using the following formula, you can calculate the number of NICs that are suggested for optimal network throughput, based on the size and type of drives you will be using in a ScaleIO node.
Note: The following calculation can be used as a guide for optimizing throughput in order to minimize ScaleIO rebuild and rebalance times.
(Number of drives per server * average sequential drive performance in MBps)
Is approximately equal to
EXAMPLE 1:
EXAMPLE 2:
Infiniband
While Infiniband is not required technology for running ScaleIO, as a TCP/IP based storage technology, IP-over-Infiniband is
supported for front-end and backend ScaleIO storage platforms. If you are leveraging Infiniband combined with Ethernet technology it is recommended that an MTU size of 4092 be utilized across both networks. This does present a potential disadvantage depending on your network topology.
NICs
ScaleIO supports single and multiple NIC configurations, but as a best practice we recommend having a minimum of (2) 10Gbps Ethernet NICs per ScaleIO node as an initial configuration.
The following items should be considered when determining the NIC configuration:
• Redundancy – Ensure your NIC configuration will be fault-tolerant in the event of a switch or single NIC card failure. We always recommend leveraging active/active or active/passive network configurations for best network stability and performance. • Performance – Network speed should be considered in ensuring capability to perform storage operations consistent with the
needs of the specific use case. Ensure you are using a sufficient amount of NICs to meet you IOPS and bandwidth requirements.
• Ease of use – We recommend leveraging active/active Ethernet networks that will consistently deliver the required amount of capacity and performance for the environment.
• Capacity – Sizing networks directly corresponded with the rebuild time of your disks and nodes in a software defined SAN ScaleIO deployment. If you are leveraging a hyper-converged solution it is also important to consider the bandwidth needs of operations taking place outside of the storage infrastructure.
Two NICs vs. Four NICs and Other Configurations
As a baseline for system design every ScaleIO SDS node should contain a minimum of two network interfaces for redundancy. Further, as outlined above additional network capacity may be required for a variety of reasons as outlined above in “Network Speed”. ScaleIO allows for the scaling of network resources through the addition of additional network interfaces.
Although not required, there may be situations where isolating front-end and back-end traffic for the storage network may be ideal. In all cases we recommend multiple interfaces for redundancy, capacity, and speed. The primary driver to segment front end and back end network traffic is to guarantee the performance of storage and application related network traffic.
Dedicated Access Port
For optimal performance, it is recommended that switch ports remain in their factory default state. It is not recommended to enable VLAN tagging for ScaleIO traffic.
NIC Pooling
ScaleIO has the ability to use multiple IPs to manage storage traffic, and these IPs can be tied to separate NICs.
As network redundancy is a primary concern for most organizations, we recommend setting up the ScaleIO data network with multiple separate IPs for each ScaleIO implementation. Refer to the network section in the ScaleIO User Guide for more information.
Jumbo Frames
While ScaleIO does support Jumbo Frames, leveraging Jumbo Frames can be challenging depending on your infrastructure. There may be networking technology limitations that prevent the use of Jumbo Frames.
Because of the potential constraints, we recommend leaving Jumbo Frames disabled initially, and enabling them only after you have confirmed that your infrastructure can support their use.
There are several situations when Jumbo Frames do not drive performance, and can have an adverse effect. In the following scenarios, Jumbo Frames should not be utilized:
• Jumbo Frames are not supported by the network infrastructure • Jumbo Frames are not supported by the clients
• Most reads and/or writes are not expected to exceed 1500 bytes
However, if Jumbo Frames are supported by your network technology, there are benefits to enabling Jumbo Frames. Enabling Jumbo Frames will allow blocks which are being written to the filesystem to be passed in a single Ethernet frame, decreasing interrupts and maximizing performance. For example, if you will be writing 4k or 8k blocks to a filesystem, the number of packets for such writes can be significantly reduced.
If you determine that your environment does support Jumbo Frames, and your writes typically exceed 1500 bytes, enabling Jumbo Frames allows for improved performance.
Flow Control
Flow control is a signaling method that allows two connected network devices to notify the other side that they cannot accept more packets at this time due to buffers being full; it signals the other side to briefly pause sending more packets before resuming traffic. This prevents packets from being dropped, and by doing so may prevent unwanted TCP/IP congestion control algorithms to activate.
When to use flow control
Global flow control:
• Applies to all traffic on a NIC
• When there is a dedicated network configuration for just ScaleIO
• When Priority Flow Control (PFC) is not supported or available on NIC or switches Datacenter Bridging/PFC
• Would want to enable if switches and NICs can support • Like flow control but only for specific type of traffic. • No packets get lost
When NOT to use flow control
• When you can only use global flow control and you are hyper-converged (e.g. not only ScaleIO traffic is being passed along • With either PFC or global flow control:
o If you have a dedicated physical network only for storage (however there still may be benefits)
Link Aggregation
In order to optimize network performance, enabling Link Aggregation is recommended if your network technology has the ability to support it. Link Aggregation increases network throughput by using all available network paths simultaneously, exceeding the performance of a single network connection. It also provides a level of redundancy, in that a single link can fail and network traffic will continue to pass over the remaining trunked connections.
High Availability
Active High Availability
Link aggregation with two or more NICs where all NICs are active and divide traffic across them. Use Link Aggregation if your switch supports teaming/bonding across two or more switches; configuration is needed on both sides. Be aware that the configuration must use active link detection (no static LACP configuration) and should be configured with short timers to allow fast failover to take place.
Passive High Availability
Link aggregation with two or more NICs where only a single link/NIC is ever active. Use if switch does not support teaming/bonding across two or more switches; the server chooses the active NIC based on NIC interface link status
SOFTWARE-DEFINED SAN VS. HYPERCONVERGED
ScaleIO has the ability to run hyperconverged, using the local storage of each of the ScaleIO servers and combining it into a shared storage pool. There are no significant network implications when running a hyperconverged instance of ScaleIO, however you should consider the bandwidth utilization of ScaleIO and any implication that it may have on production applications.
VMWARE IMPLEMENTATIONS
VMware-based networking provides all the options of a physical switch, but gives more flexibility within network design. To make use of virtual networking, virtual network configurations must be consistent with physical network devices connected to the virtual structure. Just as would be necessary without network virtualization, uplinks to physical switches must take into account redundancy and bandwidth. We recommend determining traffic patterns in advance, in order to prevent bandwidth starving.
SDC
While the ScaleIO SDC is not integrated into the ESX kernel, there is a kernel driver for ESX that implements the ScaleIO client module.
VM-Kernel port
VM-kernel Port is often used for vMotion, storage network, fault tolerance, and management. We recommend having a higher priority on network traffic over virtual machine traffic, for example virtual machine port groups or user-level traffic.
Virtual Machine Port Group
Virtual Machine Port Group can be separate or joined. For example, you can have three virtual machine port groups on the same VLAN. They can be segregated onto separate VLANs, or depending on the number of NICs, they can be on different networks.
VMware Advantages over a Standard Switch
NetIOC provides the ability to prioritize certain types of traffic, for example – storage traffic can be given a higher priority than other types of traffic. This will only work with VMware distributed switches and not standard switches.
This can also be done with Datacenter Bridging (also known as Priority Flow Control), and could be configured with standard QoS; however not all switches support these features.
VALIDATION METHODS
Internal SIO Tools
There are two main built-in tools that monitor network performance: • SDS Network Test
• SDS Network Latency Meter Test
SDS Network Test
The first test is the SDS network test – please refer to “start_sds_network_test” in the ScaleIO User Manual. Once this test has completed, you can fetch the results with “query_sds_network_test_results.” This is to ensure that you will saturate the maximum bandwidth available in your system.
It is important to note that the parallel_messages and network_test_size_gb - options should be set so that they are at least 2x larger than the maximum network bandwidth. For Example: 1x10GB NIC = 1250MB * 2 = 2500MB, or 3 GB rounded up. In this case you should run the command “--network_test_size_gb 3” This will ensure that you are sending enough bandwith out on the network to have a consistent test result, accounting for variability on the system as a whole. The parallel message size should be equal to the total number of cores in your system, with a maximum of 16.
Example Output:
scli --start_sds_network_test --sds_ip 10.248.0.23 --network_test_size_gb 8 --parallel_messages 8 Network testing successfully started.
scli --query_sds_network_test_results --sds_ip 10.248.0.23SDS with IP 10.248.0.23 returned information on 7 SDSs SDS 6bfc235100000000 10.248.0.24 bandwidth 2.4 GB (2474 MB) per-second SDS 6bfc235200000001 10.248.0.25 bandwidth 3.5 GB (3592 MB) per-second SDS 6bfc235400000003 10.248.0.26 bandwidth 2.5 GB (2592 MB) per-second SDS 6bfc235500000004 10.248.0.28 bandwidth 3.0 GB (3045 MB) per-second SDS 6bfc235600000005 10.248.0.30 bandwidth 3.2 GB (3316 MB) per-second SDS 6bfc235700000006 10.248.0.27 bandwidth 3.0 GB (3056 MB) per-second SDS 6bfc235800000007 10.248.0.29 bandwidth 2.6 GB (2617 MB) per-second
In the example above, you can see the network performance from the SDS you are testing from, to every
other SDS in the network. Ensure that the speed per second is close to the expected performance of your
network configuration.
SDS Network Latency Meter Test
There is also "query_network_latency_meters" (for writes only) which can be run at any time, which will show the average network latency for each SDS when it communicates with other SDSs. Here, we are just making sure that there are no outliers on the latency side and that latency stays low. Note that this can and should be run from each SDS to other SDSs.
Example Output:
scli --query_network_latency_meters --sds_ip 10.248.0.23 SDS with IP 10.248.0.23 returned information on 7 SDSs SDS 10.248.0.24
Average IO size: 8.0 KB (8192 Bytes) Average latency (micro seconds): 231 SDS 10.248.0.25
Average IO size: 40.0 KB (40960 Bytes) Average latency (micro seconds): 368 SDS 10.248.0.26
Average IO size: 38.0 KB (38912 Bytes) Average latency (micro seconds): 315 SDS 10.248.0.28
Average IO size: 5.0 KB (5120 Bytes) Average latency (micro seconds): 250 SDS 10.248.0.30
Average IO size: 1.0 KB (1024 Bytes) Average latency (micro seconds): 211 SDS 10.248.0.27
Average IO size: 9.0 KB (9216 Bytes) Average latency (micro seconds): 252 SDS 10.248.0.29
Average IO size: 66.0 KB (67584 Bytes) Average latency (micro seconds): 418
Iperf and NetPerf
NOTE: Iperf and NetPerf should be used to validate your network before configuring ScaleIO. If you identify issues with Iperf or NetPerf, there may be network issues that need to be investigated. If you do not see issues with Iperf/NetPerf, use the ScaleIO internal validation tools for additional and more accurate validation.
Iperf
Iperf is a traffic generation tool, which can be used to measure the maximum possible bandwidth on IP networks. The Iperf feature-set allows for tuning of various parameters and reports on bandwidth, loss, and other measurements.
NetPerf
Netperf is a benchmark that can be used to measure the performance of many different types of networking. It provides tests for both unidirectional throughput, and end-to-end latency.
Network Monitoring
It is important to monitor the health of your network to identify any issues that are preventing your network for operating at optimal capacity, and to safeguard from network performance degradation. There are a number of network monitoring tools available for use on the market, which offer many different featuresets.
We recommend monitoring the following areas: • Input and output traffic
• Errors, Discards, Overruns • Port status
Network Troubleshooting 101
• Ping connectivity end-to-end between SDSs and SDCs • Test traffic between devices in both directions • Check for round-trip latency between devices
• Check for port errors/discards/overruns on the host and switch side • Check MTU across all switches and servers
• Check to make sure LACP/MLAG is disabled • Check SIO test output
REVISION HISTORY
Date
Version
Author
Change Summary
Nov 2015
1.0
EMC
Initial Document
References
ScaleIO User Guide ScaleIO Installation Guide ScaleIO ECN communityVMware vSphere 5.5 Documentation Center EMC ScaleIO for VMware Environment