Big Data Storage: Convergence and Efficiency

(1)

I D C T E C H N O L O G Y S P O T L I G H T

Big Data Storage: Convergence and Efficiency

May 2014

Frank Cai, William Zhang, Craig Stires

IDC Opinion

The emergence of the big data age changes the way businesses operate and how people live. Today, data assets are increasingly becoming the core competency of companies because big data analysis enables companies to capture market dynamics with higher sensitivity and thus detect more sales and marketing opportunities. The explosively growing amounts of data, however, exceed the capacity that the companies' exsisting IT infrastructure can handle or cope with. For enterprises that have decided to get involved with big data, the very first thing to consider is the infrastructure's capability in carrying volumes of data and its time efficiency in data computing. Based on such needs, it is crucial that the infrastructure has high scalability and high performance in carrying big data applications. According to IDC's China market data, 60% of investments in big data were used on infrastructure in 2013. After all, big data application can hardly be realized without highly efficient infrastructure.

The big data system is toppling the traditional IT architecture of enterprises, in which the storage systems are a core component. The ability to quickly analyze large volumes of varied data types fast is exactly where the value of big data lies. An IDC survey conducted on a global scale shows that volume, variety, and fast processing speed are the three indispensible factors that bring value to big data.

The high efficiency of big data is made possible by the high-performance processing platform and flexible scalability, which are powerful capabilities sought by enterprise customers. In addition, for big data application, enterprises also want to evaluate the accessibility and manageability of the big data infrastructure. In the face of complicated data types and tremendous amounts of data, finding the means to obtaining maximum value is a big challenge IT professionals are confronted with. In a traditional IT infrastructure, data processing is section-specific and department-specific, that is, the acquisition, storage, backup and filing of data, as well as data mining and analysis, are conducted by different subsystems. Improving the performance and capacity of a single subsystem can only solve part of the problem. For enterprise users, the big data processing platform can optimally converge all the capabilities of these subsystems with unified view and management,

(2)

High efficiency of the storage system. The nature of big data determines that data analysis

needs to be highly real-time, and that the big data storage system has to be high-performance. A survey of IDC reveals that 68.6% of users regard high performance as the single most important factor when choosing big data storage system.

High scalability of the storage system. The important characteristics of the big data storage

system are it is massive in amount and it keeps increasing fast. The big data storage system needs to be highly scalable to meet the linear growth in demand for capacity and performance by the big data system.

Diversified data type supported by the storage system platform. A prominent feature of the

big data storage system is the variety of data types it carries, including structured data,

nonstructured data, half-structured data, and object data. The big data storage system needs to simultaneously support and administer a variety of data types.

Manageability of storage system platform and adaptibility to the application programming

interface (API) standard. The big data infrastructure is extraordinarily complicated. So it is important for the big data storage system to provide unified and simplifed administration. In addition, for the sake of a centralized automatic administration of storage and computing, the big data storage system needs to be adaptable to the API standard.

In This Technology Spotlight

In this Technology Spotlight, we will discuss the all-in-one infrastructure of Huawei OceanStor 9000 big data storage system to get an in-depth understanding of its design in terms of system architecture, data storage, and business operation protection. We will also analyze how to fully meet the customer demand in storage system life-cycle application in order to maximize the value for customers.

Market Overview

According to IDC statistics, by 2020 the amount of data in the entire world will be eight times more than in 2012, and in China this will be 24 times more. The emergence of the big data age is toppling the traditional IT architecture, especially challenging the enterprise storage architecture. The big data storage system not only stores and protects data, but it also needs to provide an all-in-one architecture that is capable of life-cycle management and data analysis optimization, including features of data storage, filing, analysis, and management. The purpose is to offer powerful support for customers to build a complete big data ecosystem and downsize the complexity for customers to manage data.

IDC's China market data reveals that 67.2% investments in big data were used on infrastructure in 2013 (see Figure 1). Among all the categories, the storage system investment is the highest with a five-year compound average growth rate (CAGR) of more than 60%.

(3)

Huawei OceanStore 9000 Storage Solution

Distributed File System

WushanFS Distributed File System

WushanFS, a distributed file system Huawei developed independently, is the core system that supports the OceanStor 9000. The system employs a hardware architecture of fully symmetric distributed clusters, and in networking adopts the full connection and full redundancy strategy. Deployed on X86 servers, the WushanFS system could leverage latest features of hardware development to cut short development cycle. Logically, WushanFS distributed file system

aggregates all disks within the servers to form a resource pool and provide a unified namespace, so that different levels of data redundancy protections can be provided cross nodes and cross stacks.

The WushanFS distributed system adopts decentralized technologies including cluster metadata service (MDS) and internode data redundancy. File data and management data (metadata) of the system are stored in each node to avoid competition for resources. Breaking this bottleneck, the system sees the volume of storage and the computing performance in linear growth with the increase of nodes. WushanFS distributed file system has a global cache, the volume of which also grows linearly as the number of nodes increases. With the global cache, data hit ratio rises greatly

F i g u r e 1

China Big Data Revenue Breakdown, 2013

Source: IDC, 2014

Software (14.9%)

Service (17.9%)

Total: US$238 Million

Infrastructure (67.2%)

(4)

High Performance

High performance is a primary characteristic of the big data storage system, and also the cornerstone for the storage, analysis, and filing of big data. The SPECsfs 2008 evaluation on OceanStor 9000 shows that, with 100 nodes and under the Network File System (NFS) protocol, the system could reach the performance of 5,030,264 operations (OPS), which tops the current standards of the industry. The scale-out architecture of OceanStor 9000 ensures the linear expansion of the system. Every new node added to the system will see an increase of storage volume and performance with the business being unaffected. This evaluation assesses the

performance of OceanStor 9000 at 10 nodes, 20 nodes, and 40 nodes, demonstrating the scale-out and performance growth at different number of nodes.

WushanFS Distributed System Architecture

Source: Huawei, 2014 Switch

Node A Node B Node C Node D Node N

Switch 1G Switch Management Server 1B/10G Switch 1B/10G Switch

Ap

pl

ic

at

io

n

St

or

ag

e

(5)

Seamless Scale-Out

OceanStor 9000 allows dynamic on-demand node expansion from the 3 up to 288 nodes, without interupting the business during system expansion. OceanStor 9000 could use different types of nodes to support different applications, and support a maximum namespace of up to 40 petabyte (PB). IT administrators do not have to manage multiple namespaces, thus reducing the complexity of system administration. By eliminating the namespaces, the data silos caused by the namespaces are also broken down. OceanStor 9000 is a flexible storage system that is easy to extend in

accordance to users' requests. This reduces the total ownership cost for users with unchanged high reliability and performance.

Lean Data Management

InfoProtector High Reliability Data Protection

The data protection technologies of OceanStor 9000 include distribution model and node

redundancy. Data entering the system is divided into a number of data fragments designated as N, and the system then computes out another number of redundant data fragments designated as M. All these data fragments will be saved in different nodes that amount to the number of N+M. This way, OceanStor 9000 storage system is able to restore the data not just when disks are faulty, but also when the whole node is faulty. The system can be restored to service as long as the number

F i g u r e 3

Huawei OceanStor 9000 SPEC Evaluation

Source: Huawei, 2014

Throughput(ops/sec) Response(msec)

500323 1.1 1002290 1.4 1503680 1.4 2005790 1.6 2511513 1.7 3011686 1.8 3516148 2.1 4028616 2.1 4506490 2.1 5030264 2.3

(6)

Info Tier Automatic Storage Tiering

The Info Tier automatic storage tiering of OceanStor 9000 allows classification of physical nodes into different nodes pool within one file system. Nodes with same characteristics, physical types, and access performance fall into one node pool. File pool refers to the collection of files that has certain features, for instance, files with the size of more than 100M. The file pool policy defines the rules of file pools in node pool, for instance, files bigger than 100M is stored in the other node pool. File pool and node pool correlate through file pool policy.

Automatic storage tiering allows users to define the value of data in workflow through setting the file pool policy. Files of high value or importance will be stored in devices of high usability and performance, which in turn are more expensive, while files of low value or importance in devices of low cost, and relatively low performance and usability.

Dynamic storage tiering is triggered by file pool policy, and migrates data automatically across nodes. The whole process is transparent to business access and will not affect normal operation. InfoEqualizer Load Balancing Technology

OceanStor 9000 InfoEqualizer client connected load balancing is a domain name system (DNS)-based technology. At the stage of domain name translation, OceanStor 9000 load balancing service through an algorithm assigns one node among the cluster for user's access, while the business data interaction in the wake will be completed between the client and this node.

OceanStor 9000 load balancing service is designed as a cluster system. At the initial stage of the cluster, a node will be selected out as the main node through Paxos algorithm. There is only one main node throughout the load balancing process. Each of the nodes in the cluster collects the loading status information and reports regularly to the main node, including information regarding the number of CPU cores, CPU clock speed, RAM size, network adaptor status, CPU utilization, RAM utilization, network throughput, and NAS client connection status. The main node conducts load balancing based on the information collected. OceanStor 9000 supports unified service access domain name, and domain name inquiry service is integrated into the load balancing service. When users launch a request for domain name inquiry, load balancing service will perform computations based on the load balancing policy configured, and revert an appropriate node back to client for the access into the OceanStor 9000 system.

Architecture of All-in-One

Life-Cycle Management of Data

One of the biggest challenges for big data is the life-cycle management of the mass data. End users often face the challenge of how to administrate data of online store, near store, and offline store with high efficiency and low cost. Huawei raised the concept of innovative storage architecture that combines storage, analysis, and filing together; and practices automated full life-cycle management on information. By applying high hardware performance to the concept of all-in-one of functions in product design and integrating the storage, analysis, and filing of big data, Huawei OceanStor 9000 effectively improves the data life-cycle management efficiency.

Huawei OceanStor 9000 is capable of providing full life-cycle management on one platform, and deploying a variety of business applications within one architecture, thus realizing a holistic vision on data flow. The full life-cycle management of online store, near store, and off store data simplifies the management and reduces operation cost.

(7)

In addition, full life-cycle management on the same platform will spare the system cost and data loss risks during data migration, making it more efficient to use the function of dynamic tiering storage. OceanStor 9000 provides abundant interfaces including NFS, CIFS, POSIX, HDFS, MR, and SQL. Through the management software, OceanStor 9000 is able to apply unified devices management, NFS/CIFS/POSIX/HDFS file write in, MR/SQL inquiry and life-cycle management of the data. The application logic is therefore streamlined and business development sped up.

Analysis-Oriented Optimization

Data analysis systems usually fall into two categories. One is structured data analysis that develops from relational database; the other is unstructured data analysis based on Hadoop. Both of these two types have downsides. For structured data analysis, most data analysis products require data to be processed and sorted before entering the database. For unstructured data analysis, Hadoop is developer dependent and cannot be delivered directly, thus secondary development is needed in considerations of reliability, usability, and functionality.

Huawei OceanStor 9000 big data analysis subsystem is based on distributed storage, and incorporates

F i g u r e 4

OceanStor 9000 Full Life-Cycle Data Management

Source: Huawei, 2014

……

Unstructured Data, Object Data, Structured Data Big data sharing BI Big data analysis Behavior Forecast, Instant Analysis … HPC

Internet … Media Assets

Node Node

Distributed Database Enterprise Hadoop

NFSCIFSHDFSObject MR/Hbase Node Node Node Node Node Node Node Node Node Node Node Node Node Node

Distributed File System

Node Node ……

SQL

(8)

Huawei's big data platform provides a nice capability on speed and extension, and allows users to define the metadata by themselves. To achieve a swifter pace, the system also supports metadata retrieval to both system metadata and the metadata of the business files loaded into the system.

Opportunities and Challenges of OceanStor 9000

IDC believes that big data is in the nascent stage, which makes the investment on the deployment of infrastructure and data management as the first step. IDC sees huge market space for big data storage and forecasts that the average compound growth rate of the big data storage will stand as high as 40.2% from 2012 to 2017. Huawei has gained an advantage in the market as a pioneer in big data storage by introducing the OceanStor 9000.

IDC, however, has also noticed that Huawei will come across some challenges in promoting its big data solution. As the eternal goal of big data is to dig out and analyze the valuable data from the large volume and fast-expanding data warehouses, the storage and management of data is only the first step. Huawei needs to provide its clients an end-to-end solution, which runs from the infrastructure up to the processing and analysis platform; and is loaded with industry modelling, data mining, and business analysis software to truly realize the value of big data. To maximize the value of the data, business model should be built and insights be gained for each industry. Therefore, Huawei needs to partner with ISVs that have knowledge in vertical industries to construct a complete ecosystem.

Conclusion

At this stage, a successful infrastructure is the first step towards advancing business operation and enhancing competitiveness for organizations today. With the onset of the big data era, building an infrastructure of low cost, high efficiency, and multi-function is an important step to address the challenges of the future. In this technology spotlight, IDC summarizes for IT administrators the features of big data storage on full life-cycle management from the aspect of performance, usability, and open protocol. In addition to this, IT managers when evaluating storage solutions should also pay attention to the following reminders:

Consider the reliability of the storage systems within the framework of cost

Apply lean management when reliability of the storage system is ensured

Think about the security along with the efficiency of the storage system

A B O U T T H I S P U B L I C AT I O N

This publication was produced by IDC Go-to-Market Services. The opinion, analysis, and research results presented herein are drawn from more detailed research and analysis independently conducted and published by IDC, unless specific vendor sponsorship is noted. IDC Go-to-Market Services makes IDC content available in a wide range of formats for distribution by various companies. A license to distribute IDC content does not imply endorsement of or opinion about the licensee.

C O P Y R I G H T A N D R E S T R I C T I O N S

Any IDC information or reference to IDC that is to be used in advertising, press releases, or promotional materials requires prior written approval from IDC. For permission requests contact the GMS information line at +8610-5889-1758 or gms@idc. com. Translation and/or localization of this document requires an additional license from IDC.

For more information on IDC visit www.idc.com. For more information on IDC GMS visit www.idc.com/gms. IDC China Office: Room 1206, Tower D, Global Trade Center No.36 North 3rd Ring Road, Beijing 100013 P.+8610.5889.1666 F. +8610.5889.1777 www.idc.com.cn

Big Data Storage: Convergence and Efficiency

I D C T E C H N O L O G Y S P O T L I G H T