Best Practices for Deploying
WAN Optimization with Data Replication
Keys for Successful
Data Protection
Across the WAN
The Weak Link in daTa ProTecTion
All too often, the Wide Area Network (WAN) link is the weak link in data protection. Limited bandwidth, high latency, lost packets, and out of order packets can all jeopardize strategic data replication and backup initiatives, resulting in missed Recovery Point Objectives and Recovery Time Objectives (RPO/RTO). As data volumes grow, and as the distance between data centers increases to protect business data from catastrophic disasters, there is increasing pressure being placed on the WAN. This has heightened the demand for optimization tools that can improve data replication times across the WAN while maximizing bandwidth efficiency during these processes.
There are unique requirements for deploying WAN optimization in a disaster recovery (“DR”) environment. Replication, for example, involves a high volume of sustained traffic which is highly susceptible to lost and out-of-order packets. Transfer times are very well defined to fit within allocated windows, and DR traffic is often encrypted to protect sensitive business data. Replication solutions can run over TCP, UDP and in some cases proprietary and encapsulated transport protocols (depending on the solution),
and often have their own optimization techniques that can affect the performance of downstream WAN optimization devices. By understanding these requirements and establishing guidelines for addressing them, WAN optimization can be deployed with maximum effectiveness. As such, WAN optimization can live up to its potential as a key enabler for strategic disaster recovery initiatives.
asking The righT QuesTions
The first step when deploying any networking solution is to ask the right questions. When it comes to WAN optimization, the following questions should be at the forefront of the evaluation process:
• Know your applications and protocols – It is impossible to measure results without first baselining the existing situation. How much traffic is being generated in an “average” replication stream or backup process? How much traffic is generated in a single delta set? What are peak loads and how many times a day are they being hit? How long does it take to transfer the data from point A to point B?
• Know your network – It is easy to determine how much bandwidth you are paying for, but it is harder to know how much throughput is really being achieved across the WAN. In many environments, such as MPLS and IP-VPNs, packets are lost or delivered out of order due to router oversubscription. This can lead to excessive re-transmissions, which can drop the effective throughput (aka “goodput”) of traffic across the WAN. As figure 1 shows, just a small amount of packet loss (.075%) can drop goodput to less than 5 Mbps – regardless of how much bandwidth is actually available on the WAN link. Most large backup/replication processes cannot recover from such a significant drop in throughput. As a result, WAN optimization technologies like Forward Error Correction (FEC) and Packet Order Correction (POC) are extremely valuable in some replication/ backup environments.
The more quantifiable information one can collect, the easier it will be to size an appropriate WAN optimization device and gauge the level of improvement it provided. Think ahead – factor in an expected rate of data growth to ensure that a WAN optimization solution can grow with evolving replication/backup needs.
It is also extremely valuable to know how the replication/backup applications work. Do they run over TCP (e.g. EMC SRDF/A, NetApp SnapMirror/SnapVault, Double-Take) or UDP (e.g., Veritas Volume Replicator, EMC CLARiiON disk library, Aspera/Isilon)? Are proprietary or encapsulated protocols being used, as is the case with some Fibre Channel over IP (FCIP) implementations? If anything other than standard TCP is being used to communicate between host devices, make sure that WAN optimization appliances can support those protocols.
In addition to bandwidth and packet loss, latency can be a silent killer for many DR applications. There is no getting around the laws of physics – when there is a significant distance between target and host devices it will take time for packets to travel back and forth across the WAN. This problem is only getting worse as enterprises look to extend the geographic distances between data centers, be it for better protection from catastrophic disasters or to take advantage of cheaper power in rural environments. In many instances, TCP acceleration techniques like selective acknowledgements and adjustable window sizing will help address latency challenges across the WAN. If a WAN upgrade is underway, which includes switching to a new WAN technology (e.g. MPLS or IPVPN) or building out a new data center, it is encouraged to simulate potential WAN conditions as part of a WAN optimization evaluation process. A good WAN emulator will effectively reproduce bandwidth, latency, packet loss, and out of order packets to provide a real-world experience.
• Know your limits – How many simultaneous flows are generated during a typical replication process? How many are generated when multiple processes are taking place simultaneously, such as the backup of dozens of remote branch offices? If other traffic is using the same WAN as your DR traffic, how many flows is it generating?
“Faster transfer times and
higher LAN/WAN throughput
means better RPO. The
more data that can be sent
by storage devices across
the WAN, the more data
can be protected in a
given period of time.”
By understanding the quantity of flows across the network one can ensure that WAN optimization devices handle the volume appropriately. Be sure to understand how a WAN optimization device reacts when its flow limits are reached. Is traffic blocked when limits are reached, or sent through un-accelerated? In addition to the above, it is useful to know if there are throughput limits being placed on individual flows by routers, firewalls and other network elements. A router, for example, may limit the amount of throughput per flow to ensure that all flows get serviced properly. Or, a firewall may limit the amount of throughput per flow to prevent malicious applications from hogging precious bandwidth. Whether deliberately set or not, throughput limits can wreak havoc on high volume traffic and should be addressed accordingly (either through reconfiguration of the network element or through WAN optimization techniques, like packet striping).
BesT PracTice
conFiguraTion guideLines
Many WAN optimization techniques (e.g., data reduction, QoS, compression, latency mitigation, and loss mitigation) work transparently to storage devices and DR software. During normal operations, the storage medium should not even know that the traffic sent across the WAN is being accelerated. However, different deployment scenarios can result in different levels of performance across the WAN.
The following configuration guidelines help to maximize end-to-end performance when performing data replication across a WAN: • Compression / de-duplication. Many
storage devices perform basic payload compression (e.g. LZ). This does not prevent downstream WAN optimization devices from working, but it can reduce the overall effectiveness of these devices by limiting visibility into “raw” data. Because this functionality is not unique to the storage device – i.e. most WAN optimization devices can perform the same or better compression then a storage array – it is typically recommended that this functionality be turned off in the array. This typically leads to better overall net performance from a compression standpoint. In addition, because compression is very CPU intensive, moving this functionality off the host (array) and onto a dedicated WAN optimization appliance can result in better scalability within the storage medium.
De-duplication is a slightly different story, as in many environments this functionality is desired within the storage medium and turning it off is not an option. As long as the WAN optimization device has byte-level granularity when doing its own data reduction, working with de-duplication should not be a problem. In fact, expect an additional 10-20x performance improvement when WAN data reduction is performed in conjunction with de-duplication.
This is particularly true when multiple applications are sent across the same WAN because the optimization device has a larger data set to sample and match from. For example, if someone sends an email across the WAN it will be fingerprinted and stored by a WAN optimization device performing data reduction. When that email is backed up, the WAN optimization device will have already seen the data, leading to immediate data reduction benefits. In contrast, this might be the first time that the storage device is backing up the data, so de-duplication may be minimal. As one can expect, having a WAN optimization device + de-duplication in the mix yields the best overall net results.
• Encryption. When communicating across a WAN, many enterprises look to encryption as a necessary tool for protecting sensitive information. However, when WAN optimization is deployed with storage medium, one must be careful as to where this encryption takes place. When encryption takes place “upstream” of the WAN optimization device, special actions must be taken to terminate the encryption session on the appliance, un-encrypt the traffic, optimize the traffic, and then re-encrypt the traffic. Otherwise, the WAN optimization device does not have visibility into the data and cannot perform its optimization functions. Because this process can be difficult to coordinate and can have an adverse effect on performance, it is generally not recommended unless
encryption is absolutely necessary at the source. Instead, it is recommended that encryption be left to the WAN optimization device.
Best practices dictate that WAN
optimization devices perform two types of encryption. One is encryption of data at rest (i.e data stored on the appliance). The other is encryption of data sent between appliances. The former is particularly needed when the WAN optimization device is using local disk drives for data reduction, which can store several terabytes worth of information. The latter is most often needed on shared networks, such as IP VPNs, where IPsec and other VPN solutions can provide an added layer of security. In both scenarios, it is recommended that encryption take place in dedicated hardware so as not to impact performance.
• High Availability. When WAN optimization is used for disaster recovery, it takes on an increased element of importance. Poor transfer times can mean failed replication/backup processes, which means business information is placed at a higher level of risk. To avoid this scenario, it is often recommended that WAN optimization be deployed in a redundant configuration when used as part of disaster recovery operations.
Redundancy between appliances is typically achieved using standard redirecting techniques, like Policy Based Routing (PBR) and Web Cache Coordination Protocol (WCCP), which can be used to redirect traffic in the event of a problem. Redundant power, disk drives and other modules will help ensure maximum uptime within the appliance.
Figure 2: common redirection techniques can be used to deploy Wan optimization appliances in a redundant fashion
undersTanding success criTeria
With the above information, one can effectively define criteria for a successful WAN optimization evaluation. More specifically, enterprises can collect quantitative data that will justify whether an investment in WAN optimization is the correct choice for their DR environment. Specific items to look for include:
• Reduced transfer times. How much faster is the replication/backup process? This is easy to measure, and easy to compare to baseline numbers (assuming they were collected prior to deploying WAN optimization.)
• Increased LAN-side throughput. In many instances, removing a WAN bottleneck enables more data to be sent from the storage medium. This means that more data can be protected within allocated windows.
• Improved WAN utilization (i.e. more “virtual bandwidth”). If LAN-side throughput is constant, than WAN utilization should go down when using WAN optimization. However, in many instances LAN-side throughput goes up, which can result in an increase in overall WAN traffic. This may seem counter-intuitive to the goal of WAN optimization, but it actually means that one is getting more efficient usage out of available WAN bandwidth.
As the last point indicates, removing a WAN bottleneck can actually expose bottlenecks elsewhere in an enterprise. For example, a bad server NIC or outdated LAN hub may have worked “fine” when WAN throughput was limited to 10 Mbps, but they may strain to keep up with a WAN that can now handle 100 Mbps of traffic. Similarly, replication software can be physically limited in the amount of data it can push out, or it might have been manually configured to limit throughput based on WAN conditions. This may lead to sub-optimal performance gains when WAN optimization is deployed. For example, the amount of traffic on the WAN might be significantly reduced with WAN optimization, but transfer times across the WAN may not show a significant improvement. This might be something that can be corrected with minor configuration changes in the storage medium, or it may be a fundamental limitation of that medium.
Lastly, it is important to point out the importance of effective management tools when evaluating, and subsequently deploying a WAN optimization solution. These will help baseline existing network and application behavior, optimize configuration for seamless deployment, and monitor behavior on an ongoing basis to assess performance over time.
Making The Business case
Faster transfer times and higher LAN/WAN throughput means better RPO. The more data that can be pumped out by the storage device and subsequently sent across the WAN, the more data that can be protected in a given period of time.
“WAN optimization offers
the best performance
improvements in disaster
recovery environments
with the lowest total cost
of ownership.“
Faster transfer times also mean better RTO. WAN optimization not only accelerates replication and backup functions, but it ensures that transfers in the opposite direction – i.e. during a recovery – also happen as quickly as possible.
What is the value placed on improved RPO and RTO? How much is it worth to protect more data and recover it faster? Do these benefits outweigh the investment in WAN optimization equipment?
Consider the alternative – adding more WAN bandwidth. This may seem like the path of least resistance when data protection is not performing as desired across the WAN, but it has several major drawbacks.
For starters, it assumes that bandwidth is the only issue that needs to be addressed when doing replication and backup across the WAN. However, if packet loss, packet ordering, and latency are also issues, adding more bandwidth will not solve the problem. (In fact, loss is often exacerbated as WAN links grow in size). Secondly, in many regions it can take quite a long time to get a large WAN connection ordered and provisioned from a service provider. If problems exist today, waiting several months for an OC-3 or OC-12 circuit may not be a viable option.
Lastly, when all factors are considered, the cost of adding more WAN bandwidth is often significantly higher than the cost of deploying WAN acceleration. Aside from a dramatic increase in recurring bandwidth expenditures (30- 60% on average), routers and other network equipment may have to be added or upgraded, the storage medium may need to be upgraded, new licenses might be required in the replication/backup software to accommodate additional WAN links, and new operational expenditures might be required to handle the added complexity of new and larger WAN connections. One can argue that bandwidth expenditures are decreasing over time, but the recurring costs are still significant and the tangential costs associating with upgrading the WAN can be quite substantial. In the end, WAN optimization offers the best performance improvements in disaster recovery environments with the lowest total cost of ownership. When deployed correctly, the benefits of WAN optimization are very tangible – from improved data transfer times to more efficient usage of available WAN bandwidth. By following best practice recommendations, WAN optimization is an indispensible tool in day-to-day disaster recovery operations. n
aPPenDix a - baseline infOrmatiOn
Application: ____________________________________________________________________________ Traffic generated in an “average” transfer _________________________________________________________________________ MB Actual time to transfer an “average” delta set _______________________________________________________________________ mins Desired time to transfer an “average” delta set ________________________________________________________________________ mins Transport protocols used (circle one) _____TCP / _____ UDP / Other _____________________________________________ Total WAN Bandwidth _______________________________________________________________________ Mbps Average latency _________________________________________________________________________ ms Peak latency _________________________________________________________________________ ms Average packet loss __________________________________________________________________________ % Peak packet loss __________________________________________________________________________ % Number of simultaneous TCP flows ____________________________________________________________________________ Throughput limit per flow _______________________________________________________________________ Mbps
aPPenDix b - COnfiguratiOn CheCklist
Is compression disabled upstream of the WAN optimization device? Yes _ No _ Is de-duplication being performed in the replication/backup software? Yes _ No _ Is encryption disabled upstream of the WAN optimization device? Yes _ No _
Are the WAN optimization appliances deployed inline or out-of-path? _____________________________________________________ If out-of-path, what redirection technique is being used? _____________________________________________________
aPPenDix C - Wan OPtimizatiOn results COmParisOn
Amount of traffic sent across WAN during data transfer:Vendor _____________________________________________ _____________________________________ MB Vendor _____________________________________________ _____________________________________ MB
% data reduction (amount of optimized traffic / amount of baseline traffic * 100)
Vendor _____________________________________________ ______________________________________ % Vendor _____________________________________________ ______________________________________ %
Amount of LAN-side traffic during data transfer
Vendor _____________________________________________ _____________________________________ MB Vendor _____________________________________________ _____________________________________ MB
Time to transfer an “average” delta set
Vendor _____________________________________________ ____________________________________ mins Vendor _____________________________________________ ____________________________________ mins
Time improvement (baseline time – optimized time)
Vendor _____________________________________________ ____________________________________ mins Vendor _____________________________________________ ____________________________________ mins
aPPenDix D - Wan OPtimizatiOn feature COmParisOn
FEAtuRE REquIREmENt VENdoR: VENdoR:
(Y/N) ________________ _________________
PerForMance
Maximum WAN Bandwidth Suppoorted (All optimization features enabled) TCP Flows Supported
Disk Size (Data Reduction)
aPPLicaTions
Bulk TCP (e.g. backup, file, email, web) Real-time TCP (e.g. Replication, SQL, Citrix) UDP (e.g. Replication, VoIP, Video) Proprietary (e.g. FCIP)
securiTy
Encryption across WAN? Encryption of data on appliance? Hardware acceleration
Loss MiTigaTion
Forward Error Correction Packet Order Correction
dePLoyMenT
In band WCCP PBR VRRP