• No results found

東 海 大 學 資 訊 工 程 研 究 所 碩 士 論 文

N/A
N/A
Protected

Academic year: 2021

Share "東 海 大 學 資 訊 工 程 研 究 所 碩 士 論 文"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

東海大學

資訊工程研究所

碩士論文

指導教授

:

楊朝棟博士

以異質儲存技術實作一個軟體定義儲存服務

Implementation of a Software-Defined Storage Service with

Heterogeneous Storage Technologies

研究生

:

連威翔

(2)
(3)

雲端技術的應用和服務與日俱增,無論是政府、企業或是個人都可能有建構雲 端系統的需求。隨著雲端運算的盛行,雲端基礎設施的需求也跟著提升,要如 何以最少的成本達到最佳的效果,是眾人所面臨的一大問題,而發展各種的虛 擬化技術便是其中的一種解決方法,將資源做邏輯上的分割轉變為邏輯上可以 管理的資源,並提供給需求者,讓使用者更合理、更充分的控制與管理這些資 源。軟體定義儲存 (Software-Defined Storage, SDS)便是虛擬化技術中的一項, 透過軟體將儲存資源集中化,整合異質儲存設備,提升現有儲存設備的可用度 與使用率,並且可以視使用者需求調整,以達到最佳的效能。近年來 SDS 的概 念逐漸被大家所注意,各家廠商也紛紛推出自己的產品,但始終沒有一個公認 的標準出現,多數方案仍是偏向於自家的商品,SDS 能夠整合的儲存只限於小 範圍。本論文將採用以 OpenStack 作為架設雲端的環境,並在OpenStack 上整 合 HDFS、Ceph、Swift 等儲存,以達到軟體定義儲存的概念。本論文採用的 軟體,將可以整合不同儲存設備,提供一個整合性的儲存陣列,建立虛擬儲存 池,使用戶感覺不到儲存硬體的限制。並且提供管理者一個網頁介面,將可讓 管理者在不關閉機器的情況下調整儲存空間及管理用戶或安全設定等等。而效 能測試方面最後以 Hadoop HDFS 效能最佳,除了少數部分,整體而言是優於其他 兩者,是較佳使用於生產環境中的儲存。 關鍵字:雲端服務,虛擬化,軟體定義儲存

(4)

Abstract

With the popularity of cloud computing, the demand of cloud infrastructure also increases. Because the requirement of cloud computing increases steeply, the in-frastructure of cloud service expands too. Spending the least cost to achieve the best effect on cloud is a big problem for every researcher, and virtualization is one of the answers to resolve this problem. Virtualization is a technology that divides the real resources into logically manageable resources, which are then provided to users to enable the users control and manage them easily and legitimately. Software-Defined Storage (SDS) is a kind of virtualization, in which by integrat-ing the storage resource and different storage devices by software to increase the usability and activity, the user can according to their requirement to make adjust-ment and then achieve the optimal performance. In recent years, SDS becomes more and more popular, and several companies have announced their product. But the generic standard still has not appeared; most products are only appropri-ate for their devices and SDS just can integrappropri-ate a few storages. In this thesis, we will use the OpenStack to build and manage the cloud service, and use software to integrate storage resources include Hadoop HDFS, Ceph and Swift on OpenStack to achieve the concept of SDS. The software used can integrate different storage devices to provide an integrated storage array and build a virtual storage pool; so that users do not feel restrained by the storage devices. Our software platform also provides a web interface for managers to arrange the storage space, administrate users and security settings. For allocation of the storage resources, we make a pol-icy and assign the specific storage array to the machine that acquires the resource according to the policy. Then we run performance tests of file systems to prove

(5)

our system will run correctly. Finally, according to the experimental results, we find the performance of Hadoop HDFS is better than the other two storages; i.e., except a few instances, Hadoop HDFS out performs the other two storages in the same environment.

(6)

致謝詞

能夠有機會並且順利的完成這篇論文,並非我獨自一人能夠做到的,其中還包 含了許許多多來自各方面的支持協助,若非有這些無論是物質亦或者是精神上 的幫助,我沒有辦法達成這樣的目標。 首先我要感謝我的指導教授楊朝棟老師,謝謝老師六年來的教導,引領我進 入現在的研究領域,老師以其言傳身教,教導我研究的精神和做事的態度以及 方法,在學習的過程中督促並指點我正確的方向,讓我在一件件的工作中學習 與成長,若沒有這些磨練,我便沒有辦法順利完成這篇論文與學業。也希望在 最後的成果上,沒有讓老師失望。 感謝各位口試委員在口試的過程中,指導協助我改正並完善我的論文,感謝 薛念林老師、朱正忠老師、盧志偉老師、張志宏老師,不辭辛勞撥空擔任我的 口試委員,提供給了我許多的建議,因為有你們的幫助,令我的論文能夠更加 的完善。 感謝實驗室的學長們,在剛進入實驗室時,以身作則引領我適應研究生的生 活,感謝各位同學們在兩年來在諸多方面的幫助,無論是在技術方面、討論論 文方向、計劃的執行等等,因為有你們在,讓我可以順利的度過研究所的生活。 感謝學弟妹們的幫助,在這最後一段時間,幫忙分攤實驗室的工作,令我能夠 專注在論文上。 感謝父母對我從小的栽培,因為有你們一路上的支持,讓我順利的度過我的 求學生活,讓我有這個機會能夠站在這裡,作我想要的研究。 最後,僅以此文獻給所有一路上幫助我陪伴我的人,因為有你們才有今天的 我,謝謝你們。

(7)

Table of Contents

摘要 I

Abstract II

致謝詞 IV

Table of Contents V

List of Figures VII

List of Tables VIII

1 Introduction 1

1.1 Motivation . . . 1

1.2 Thesis Goal and Contributions . . . 2

1.3 Thesis Organization . . . 2

2 Background Review and Related Work 3 2.1 Cloud Computing . . . 3 2.2 Virtualization . . . 5 2.2.1 Full Virtualization . . . 6 2.2.2 Para-virtualization . . . 7 2.2.3 Hardware-assisted Virtualization . . . 8 2.2.4 Storage Virtualization . . . 9 2.2.5 Software-Defined Storage . . . 10 2.3 OpenStack . . . 11 2.4 Swift . . . 12 2.5 HDFS . . . 13 2.6 Ceph . . . 14 2.7 Related Work . . . 16

3 System Design and Implementation 18 3.1 System Architecture . . . 18

3.2 System Implementation . . . 20

(8)

4.1 Experimental Environment . . . 29

4.2 Performance of Data Transport . . . 29

4.3 User Interface of ViPR . . . 36

4.4 Discussion . . . 40

5 Conclusions and Future Work 41 5.1 Conclusion Remarks . . . 41 5.2 Future Work . . . 42 References 43 A Hadoop Installation 48 B Ceph Installation 52 C GlusterFS Installation 55

(9)

List of Figures

2.1 Full Virtualization . . . 7

2.2 Para-virtualization . . . 8

2.3 Hardware-assisted Virtualization . . . 9

2.4 the storage situation compare from the before and now . . . 10

2.5 OpenStack . . . 12 2.6 The HDFS architecture . . . 14 3.1 System Architecture . . . 19 3.2 Ceph Architecture . . . 21 3.3 HDFS Service Architecture . . . 22 3.4 HDFS Architecture . . . 22 3.5 OpenStack Overview . . . 23 3.6 VM Instances . . . 23

3.7 ViPR with OpenStack Architecture . . . 24

3.8 ViPR System Architecture . . . 26

3.9 ViPR Cluster . . . 27

4.1 Write single file with different file size . . . 31

4.2 Read single file with different file size . . . 32

4.3 Write small file . . . 33

4.4 Read small file . . . 34

4.5 Write small file . . . 35

4.6 Read small file . . . 35

4.7 Dashboard . . . 36 4.8 Add PhysicalAsset . . . 37 4.9 Add Host . . . 37 4.10 Add VirtualAsset . . . 38 4.11 VirtualPool Create . . . 39 4.12 DataService Setup . . . 40

(10)

List of Tables

4.1 Hardware Specification . . . 30 4.2 Software Specification . . . 30

(11)

Chapter 1

Introduction

Cloud computing, one of today’s hottest topics, is being developed very fast. There are many large companies such as Google, Amazon, and Yahoo offering cloud ser-vices to millions of users at the same time. A lot of people including programmers in the enterprise, government, and research centers, or even some common peo-ple, want to build cloud computing environments. But the operating cost of a cloud may be very expensive; the cost includes acquirement and maintenance of hardware facilities, and salaries paid to operating engineers and so on. So how to reduce the cost is a big issue.

1.1

Motivation

Cloud computing is a concept, in which computers over networks are able to cooperate with each other to provide far-reaching network services [7, 8, 19, 32, 36]. The basic approach of cloud computing is to computing through terminal operations over the Internet by moving hardware, software, and shared information from users to the server side. Due to this situation, we can say that we are living in a world that is surrounded by the cloud technology, and we almost use the cloud every day. Now, more and more enterprises join the procession of cloud service providers. There are many kinds of cloud services in the market. When a

(12)

company wants to build a cloud, a difficult problem it must face is how to choose the right cloud product to suit its need. Software-Defined Storage (SDS) is a kind of virtualization which integrates the storage resource and different storage device by software to increase the usability and activity; the users can adjust SDS according to their requirement to achieve the optimal performance. In recent years, SDS is an increasingly popular concept, and several companies have announced related products. But the generic standard still has not formed; hence, most of its products are only appropriate for some devices.

1.2

Thesis Goal and Contributions

In this thesis we implement a cloud service to integrate several heterogeneous stor-age technologies. In order to achieve the goal of high availability, we choose the technologies with a redundant mechanism, in which crash of machines is consid-ered a normal situation and mechanisms exist to resolve it. It will provide high availability to this system. In addition to this, we use the open source technology to let the system have good compatibility. It can also help the users to freely build their systems with high flexibility. These features are present in the design concept of Software-Defined Storage.

1.3

Thesis Organization

Chapter 2 will describe some background information and related works, including cloud computing, virtualization, software-defined storage. Chapter 3 will introduce our experimental environment and methods, and the overall architecture. Chapter 4 will list the hardware information measurement method and software version, present and analyze experimental results. Finally, Chapter 5 summarizes this thesis by pointing out its major contributions and future work directions.

(13)

Chapter 2

Background Review and Related

Work

2.1

Cloud Computing

Cloud computing is Internet-based computing, in which shared hardware, software resources, and data can be provided to users according to their demand of com-puters and other devices. The user does not need to know the details of the cloud infrastructure, or has appropriate expertise, and is without direct control. Cloud computing allows companies to fast deploy applications and reduce the complexity of management and maintenance costs, and enables rapid changes of IT resources reallocation to respond business needs. Cloud computing, usually involving with the Internet, describes a new Internet-based services to increase IT use and deliv-ery models, and is easy to provide dynamic and virtual extension of the resource. The cloud is a metaphor of networks or the Internet. Cloud computing can be considered to include the following levels of service: infrastructure as a service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS):

• SaaS: Consumers use the application, but do not control the operating sys-tem, hardware, or the operation of the network infrastructure, for example:

(14)

the Microsoft CRM or Salesforce.com platform as a service. With the SaaS model, the user can access the services, software, and data. The service providers maintain the normal operation of the infrastructure and platform to maintain service, while users access cloud services through browsers, and desktop or mobile applications.

• PaaS: Consumers use the host operating applications. Consumers control the operation of the application environment (host also has some control over it), but do not control the operating system, hardware, or the opera-tion of the network infrastructure. The platform is usually the applicaopera-tion infrastructure, for example: the Google App Engine.

• IaaS: Consumers can use the basic computing resources such as processing power, storage space, network components or middleware. Consumers can control the operating system, storage, deployed applications and network components (e.g. firewalls, load balancers), but do not control the cloud infrastructure. Examples of IaaS include Amazon AWS, and Rackspace.

Cloud computing relies on the sharing of resources in order to achieve economies of scale in similar infrastructure. Service providers integrate a large number of re-sources for use by multiple users. Users can easily request more rere-sources and adjust usage at any time, without the need to release all resources back to the whole structure. Users do not need to buy a lot of computing resources; they only need to enhance the amount of rent for short-term spikes of resource demands, and release some resources during low demand periods. Service providers are able to lease unused resources to other users, and even adjust rents in accordance with the demand of the whole. The National Institute of Standards and Technology (NIST) also classifies cloud computing based on the deployment model.

• Public cloud: In short, public cloud provides services through the Internet and third party service provider, and is opening up to the use of customers. The term public does not necessarily mean free, but may represent very cheap. In the public cloud, the user data is available for anyone to view.

(15)

Public cloud providers usually suggest interested users to implement some access control mechanisms. The public cloud is a flexible and cost-effective solution.

• Private cloud: Private cloud have many advantages, such as with more flex-ibility than a public cloud environment and being suitable for the provision of services. The difference between the two is that for private cloud ser-vices, data and programs are managed within the organization, while for public cloud services, they are constrained by the network bandwidth, secu-rity concerns, and regulatory restrictions. In addition, private cloud services suppliers and users have more control over the cloud infrastructure to im-prove safety and flexibility, because the user and the Internet are subject to special restrictions.

• Community Cloud: Community cloud is controlled and used by members of organizations with common interests, particular security requirements, and common purpose. Community members can access cloud data and applica-tions.

• Hybrid cloud: The hybrid cloud is a mix of the public cloud and private cloud. In this mode, the user usually has non-business-critical information dealt with in a public cloud, but at the same time, control of business-critical services and information.

2.2

Virtualization

With virtualization, the computer’s physical resources, such as servers, network, memory, and storage, are abstracted after conversion, so that users can apply those resources in a better way than the original configuration. Virtualization is commonly referred to virtualized resources including computing power and data storage; in this thesis virtualization is specifically referred to virtualization of servers [12, 15, 24, 30, 33]. The server virtualization software technology refers to

(16)

the use of one or more of the host hardware settings. It has the flexibility to configure the virtual hardware platform and operating system, like real hardware. In this way, a variety of different operating environments (for example, Windows, Linux, etc.) can operate simultaneously on the same physical host, and be inde-pendently operating in different physical hosts. Virtualization solutions can be broadly divided into three categories: full virtualization, para-virtualization, and hardware-assisted virtualization.

2.2.1

Full Virtualization

In full virtualization, Hypervisor replaces the traditional core of the operating system in the Ring 0 privilege level, via Binary Translation technology; the guest operating system requires direct access to the hardware Ring 0 level instruction by the hypervisor conversion to submit requests further to hardware.

Hypervisor virtualizes all hardware components. The guest operating system works as the individual real host, with a high degree of independence. But Binary Translation technology will cause the performance of the virtual machine (VM) to decrease significantly. One example of software using virtualization technology is the VMware Workstation.

The full virtualization is totally dependent on the virtual hardware layer cre-ated Guest OS, which can use hardware resources, only limited by the virtual hardware resources, and less able to misallocate of the entity’s computer hard-ware. In addition to the CPU, memory, the graphics card, etc. can also be virtualized if the proprietary driver has already been installed.

The benefits of full virtualization are, regardless of how the hardware environ-ment, the Guest OS is able to maintain a more consistent compatibility, and can use different operating systems with the physical machine. The disadvantage of full virtualization is that it is a greater burden of the physical machine.

(17)

Figure 2.1: Full Virtualization

2.2.2

Para-virtualization

Para-virtualization, also known as parallel virtualization, is not all hardware device virtualization such as Xen. Compared to the full virtualization virtual machine instruction Binary Translation, the para-virtualized VM invokes hardware devices through Dom0, and manages each VM access to physical resources. The VM per-formance is significantly improved, but the hardware driver binding in dom0, and the core of the operating system in the VM must go through a special modification; thus the independence of the VM is relatively low.

The para-virtualization does not configure the virtual layer on top of the hard-ware, but use more than one memory location program calls at different times. Para-virtualization allows the Guest OS share hardware resources. The advantage is that the hardware will not need to waste performance on the hardware layer; however, the drawback is that the operating system in the virtual machine must be consistent with the physical environment.

(18)

Figure 2.2: Para-virtualization

2.2.3

Hardware-assisted Virtualization

Hardware-assisted virtualization, as shown in Figure 2.3, is used to overcome the dilemma that the virtual machine operating system kernel cannot be placed in the processor Ring 0 privilege level. Since both the hypervisor and virtual op-erating system kernel can be issued in Ring 0, the hypervisor will be automati-cally intercepted by the need to deal with the instruction of the virtual operating system directly by the hardware processing; therefore the full virtualization bi-nary translation or para-virtualization Hypercall operations are no longer in need. Hardware assisted virtualization basically eliminates the difference between full virtualization and para-virtualization. On the basis of hardware-assisted full vir-tualization or para-virvir-tualization program virtual machines are high performance and independent. The representative program of the hardware-assisted virtual-ization VMware is ESXi with open source Kernel-based Virtual Machine (KVM). KVM needs a host operating system, whose core provides virtualization services; whereas VMware ESXi is the equivalent of a specialized virtualization thin client operating system, installed directly on a physical machine.

(19)

Figure 2.3: Hardware-assisted Virtualization

2.2.4

Storage Virtualization

After the server virtualization, the interactive mode of the server and storage device changed from one to one to the same application with several VMs to one storage device. It causes read/write of data more complicated like Figure 2.4. In addition to the application which has to request immediately such as DB or ERP, there are more and more new type of cloud and big data applications to be released like Node.js, Phthon and Hadoop and so on. Because the appearance of these new type technology, the ability of storage device to support them is challenged time and again. Additionally, there are several kinds of different architecture and medium of storage. For instance the new architecture like NAS and SAN or the medium like SATA, SAS, SSD and many other storage architecture developed for virtualization; the more and more new technologies let situations of data services become more complicated [28, 29].

(20)

Figure 2.4: the storage situation compare from the before and now

2.2.5

Software-Defined Storage

Software-Defined Storage (SDS) is a term for the computer data storage tech-nology which separates storage hardware from the software that manages the stor-age infrastructure. The software enabling SDS environments can provide policy management for feature options such as deduplication, replication, thin provision-ing, snapshots and backup [3, 23, 37].

Characteristics of SDS could include any or all of the following features [6,22]:

• Abstraction of logical storage services and capabilities from the underlying physical storage systems, and in some cases, pooling across multiple different implementations. Since data movement is relatively expensive and slow com-pared to compute and services (the ”data gravity” problem in infonomics), pooling approaches sometimes suggest leaving it in place and creating a mapping layer to it that spans arrays. Examples include

• Automation with policy-driven storage provisioning with service-level agree-ments replacing technology details. This requires management interfaces

(21)

that span traditional storage array products, as a particular definition of separating the ”control plane” from ”data plane,” in the spirit of OpenFlow. Prior industry standards efforts include the Storage Management Initiative – Specification (SMI-S) which began in 2000.

• Commodity hardware with storage logic abstracted into a software layer. This is also described as a clustered file system for converged storage.

• Scale-out storage architecture.

2.3

OpenStack

OpenStack [20] is an IaaS cloud computing project for public and private clouds. It is free open source software released under the terms of the Apache License. The project aims to deliver solutions for all types of clouds by being simple to im-plement, massively scalable, and features rich. The technology consists of a series of interrelated projects delivering various components for a cloud infrastructure solution. Founded by Rackspace Hosting and NASA, OpenStack has grown to be a global software community of developers collaborating on a standard and mas-sively scalable open source cloud operating system. Its mission is to enable any organization to create and offer cloud computing services running on standard hardware. The project is managed by the OpenStack Foundation, a non-profit corporate entity established in September 2012 to promote, protect and empower OpenStack software and its community.

OpenStack offers flexibility and choice through a highly engaged community of over 6,000 individuals and over 190 companies including Rackspace, such as Intel, AMD, Canonical, SUSE Linux, Inktank, Red Hat, Groupe Bull, Cisco, Dell, HP, IBM, NEC, VMware and Yahoo. It is portable software, but is mostly developed and used on operating systems running Linux.

The technology consists of a series of interrelated projects that control large pools of processing, storage, and networking resources through-out a datacenter,

(22)

all managed through a dashboard that gives administrators control while em-powering its users to provision resources through a web interface. OpenStack is committed to an open design and development process. The community op-erates around a six-month, time-based release cycle with frequent development milestones. During the planning phase of each release, the community gathers for the OpenStack Design Summit to facilitate live developer working sessions and assemble the roadmap.

Figure 2.5: OpenStack

2.4

Swift

OpenStack Object Storage (Swift) is a scalable redundant storage system. Objects and files are written to multiple disk drives spread throughout servers in the data center, with the OpenStack software responsible for ensuring data replication and integrity across the cluster. Storage clusters scale horizontally simply by adding new servers. Should a server or hard drive fail, OpenStack replicates its content from other active nodes to new locations in the cluster.

(23)

Because OpenStack uses software logic to ensure data replication and distribution across different devices, inexpensive commodity hard drives and servers can be used.

2.5

HDFS

The Hadoop Distributed File System (HDFS) is an Apache Software Foundation project and a subproject of the Apache Hadoop project. HDFS is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and to provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure durability to failure and high availability to parallel applications [39]. HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS’s write-once-read-many model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access [4, 16, 26, 38]. Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. HDFS rigorously restricts data write to one write at a time. Bytes are always appended to the end of a stream, and byte streams are guaranteed to be stored in the written order.

HDFS is comprised of interconnected clusters of nodes where files and direc-tories reside. An HDFS cluster consists of a single node, known as a NameNode, which manages the file system namespace and regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files. Within HDFS, a given name node manages file system namespace operations like opening, clos-ing, and renaming files and directories. A name node also maps data blocks to data nodes that handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node. The architecture of HDFS is shown in Figure 2.6.

(24)

Figure 2.6: The HDFS architecture

2.6

Ceph

Ceph [2, 34, 40] is a software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph is a distributed storage designed to provide excellent performance, reliability and scalability. Ceph was made possible by a global community of enthusiastic storage engineers and researchers. It is open source and freely-available. Ceph software runs on commod-ity hardware. The system is designed to be both self-healing and self-managing and strives to cut both administrator and budget costs.

• Object Storage: Ceph is a distributed object storage and file system designed to provide excellent performance, reliability and scalability. Its software libraries offer client applications with direct access to the reliable autonomic

(25)

distributed object store (RADOS) object-based storage system, and also provide a basis for some of Ceph’s advanced features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System.

The librados software libraries enable applications written in C, C++, Java, Python and PHP. The RADOS Gateway also exposes the object store as a RESTful interface which can present as both native Amazon S3 and Open-Stack Swift APIs. The librados libraries provide advanced features, includ-ing:

Partial or complete reads and writes

Snapshots

Atomic transactions with features like append, truncate and clone range

Object level key-value mappings

• Block Storage: Ceph’s object storage system allows users to mount Ceph as a thinly provisioned block device. Ceph’s RADOS Block Device (RBD) provides access to block device images that are striped and replicated across the entire storage cluster. When an application writes data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. Ceph’s RADOS Block Device (RBD) also integrates with KVMs, bringing Ceph’s virtually unconstrained storage to KVMs running on user’s Ceph clients.

Ceph RBD interfaces with the same Ceph object storage system that pro-vides the librados interface and the CephFS file system, and it stores block device images as objects. Since RBD is built on top of librados, RBD inherits librados’s capabilities, including read-only snapshots and revert to snapshot. Ceph’s object storage system is not bounded to native binding or RESTful APIs. User can mount Ceph as a thinly provisioned block device. When write data to Ceph using a block device, Ceph automatically stripes and replicates the data across the cluster. By striping images across the cluster, Ceph increases read access performance for large block device images.

(26)

• File System: Ceph’s file system (CephFS) runs on top of the same object stor-age system that provides object storstor-age and block device interfaces. Ceph provides a POSIX-compliant network file system that aims for high perfor-mance, large data storage, and maximum compatibility with legacy appli-cations. Compared to many object storage systems available today Ceph’s object storage system offers a significant feature: a traditional file system interface with POSIX semantics. Object storage systems are a significant innovation, but they supplement rather than replace traditional file systems. The Ceph metadata server cluster provides a service that maps the direc-tories and file names of the file system to objects stored within RADOS clusters. The metadata server cluster can expand, contract, and dynami-cally rebalance the file system to distribute data evenly among cluster hosts. As storage requirements grow for legacy applications, organizations can con-figure their legacy applications to use the Ceph file system. This means user can run one storage cluster for object, block and file-based data storage. This ensures high performance and prevents heavy loads on specific hosts within the cluster.

2.7

Related Work

Chengzhang Penga and Zejun Jiangb [21] proposed a cloud storage service system, in which a solution is suggested to build a cloud storage service system based on the open-source distributed database. Their system was designed as three-tiered architecture. And they demonstrated their prototype of CS3 built based on the framework.

A file system architecture for organization of data and metadata that will efficiently organize and enable sharing besides exploiting the power of storage virtualization and maintaining simplicity in such a highly complex and virtualized environment proposed by Ankur Agrrawal et al [1].

(27)

Tahani Hussain et al. [10] assessed the performance of an existing enterprise network before and after deploying distributed storage systems. And the simula-tion of an enterprise network with 680 clients and 54 servers followed by redesign-ing shows improvements in the storage system throughput by 13.9% with 24.4% decreased average response time and 38.3% reduced packet loss rate.

Dejun Wang [31] proposed an efficient cloud storage mode for heterogeneous cloud infrastructures. And he verified by extensive tests to the model with numer-ical examples. He thought cloud storage system with traditional storage systems has some differences; such as the demand from the performance point of view, data security, reliability, efficiency and other indicators need to be considered for cloud storage services, which are services in a wide range of complex network environment for the demands of large-scale users.

(28)

Chapter 3

System Design and

Implementation

3.1

System Architecture

The main goal of this thesis is to build a cloud service platform integrating with several heterogeneous storage technologies. The cloud storage combines different storage technologies that are the most popular and perform great in the same type of technology. These storages that we chose are open source; hence, it will be helpful for further development and maintenance in the future. Additionally, we will offer a simple GUI for users to manage the cloud service. In this chapter, we will overview the whole system.

(29)

Figure 3.1: System Architecture

The system architecture consists of some components. The first part is a server node built by OpenStack. And we also used OpenStack to build some VMs to be the storage nodes. Second, we used three kinds of different storage technology to build the storage of OpenStack. And we used software ViPR to control the different storage devices and manage storage spaces for the storage of OpenStack. ViPR will display the storage with a simple appearance and it can be used to integrate different storage technologies. The system architecture is shown in Figure 3.1.

In our experiment, we focus on the inside of system. The basic of system was built by OpenStack, which was combined with several components, including the computing side, network side and storage side. We integrated some different

(30)

storage to be the storage side for the OpenStack cloud. And the storage controller integrated them by the APIs of these storage.

3.2 System Implementation

• Architecture of Ceph: there are three daemons in the Ceph side, including OSD, MDS and MON. The OSD is responsible for saving data as an object; it used an independent partition, and it will partition to a xfs type. It at least has one node, and the capacity of Ceph is defined by the Equation(3.1), where i means the number of osd from 0 to n, and n means the last number of OSD. The function shows the total capacity of OSD as the sum of all OSDs. CephSize= ni=0 partitionsize(OSD)i, n≥1 (3.1)

The second is MDS, responsible for storing the data structure of the data index, and writing it into the OSD. It at least has one node, and it uses a redundant mechanism. The last is MON which is responsible to monitor the running status of OSD and MDS, and according to the situation to adjust them. It at least has one node, and it uses a redundant mechanism too. The Ceph architecture is shown in Figure 3.2.

(31)

Figure 3.2: Ceph Architecture

Every OSD has an independent partition; and the three daemons will not conflict with one another. They can run on a same node.

• Architecture of HDFS: the HDFS is combined with the master node and the slave node which is called NameNode and DataNode. And we built a HDFS architecture consisting of one NameNode and two DataNodes shown in Figure 3.3. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from clients of the file system. The DataNodes also performs block creation, deletion, and replication upon instruction from the NameNode shown in Figure 3.4.

(32)

Figure 3.3: HDFS Service Architecture

(33)

• OpenStack: OpenStack was used to build our cloud; it is open source soft-ware to manager cloud. It was built on a VM with Ubuntu OS, and then some VMs were created to form a storage system cluster. The overview of the system is shown in Figure 3.5 and Figure 3.6.

Figure 3.5: OpenStack Overview

Figure 3.6: VM Instances

• EMC ViPR: traditional data center is just a set of some servers, storage, network and security and etc. In the traditional data center, to build a new application usually takes several weeks. But the process can be sim-plified by the data center driven by completely dynamic software, making it like building a new VM. If we can design a series of policies that can be

(34)

implemented on a variety of new applications to automatically support in-frastructure services included in a container, i.e., a virtual data center, then, in a few minutes or a few seconds, we can complete the deployment of a new application.

EMC Virtualization Platform Reinvented (ViPR) is a logical storage system, not a physical storage. It can integrate EMC storage and third-party storage in a storage pool, and manage it as a single system, but still save the worth of original storage. ViPR can replicate data across several different place and data center with different store product, and it provides a unified block store, object store and file system and other services. At the same time, ViPR provides a unified metadata service and self-service deployment, mea-surement and monitoring services. In addition to this, ViPR also applies to multi-tenant environments. EMC proposed a model about software-defined storage, and described the relation with OpenStack. The architecture is shown in Figure 3.7.

(35)

ViPR run on three to five virtual server machines. It consists of the control plane and data plane; the first part realizes automatic store, and the second part provides data service by building on the first part. ViPR provides some APIs to let users to use services and manage storage, as shown in Figure 3.8.

One part of ViPR is the control plane. It can manage all applications of storage array, such as data mining, storage resource searching, and capacity counting and reporting. The controller controls the application of virtual storage array, and manages the storage resource. In order to achieve the goal, ViPR virtualizes the control path of physical storage, and runs the function with it. By the virtual plane, people can manage the storage pool and split it to different virtual storage arrays with policies, just like virtualization of the server.

In addition to the controller and data service, ViPR provides the open REST-ful API. The developers can develop new services without the limit of the hardware. Through these APIs, people can access data, and add, delete, modify, monitor and measure the logical storage resources.

(36)

Figure 3.8: ViPR System Architecture

ViPR is built by the scale-out architecture; if it is built with at least three clusters, the architecture will provide the ability of high availability, load bal-ance and system upgrade and so on. It provides RESTfulAPI, GUI(console), CLI and SDK to let user control it with high flexibility as shown in Figure 3.9. And ViPR can provide the automatic of disaster recovery. Through the continuous remote replication technology, it can provide the ability of data replication continuously to successfully achieve the goal of disaster recovery.

(37)

Figure 3.9: ViPR Cluster

The data service of ViPR has some features:

• Unified Storage: ViPR offers the traditional layer on top of cloud storage data type, with the following advantages: tradition by allowing local access to the underlying storage, improved compatibility with existing applications; while allowing manipulation of data at the right time through different meth-ods to improve the efficiency of the work; and it can convert existing data to the cloud easily.

• Heterogeneous Storage: ViPR provides an engine for a variety of storage devices, for a given environment or use case can select the appropriate storage capacity and hardware, and the user can reuse the existing storage hardware investments, but also applies to mixed ViPR additional scenarios, such as across different devices tiered storage and replication.

(38)

• Enterprise-grade Storage: cloud storage platforms often lack some of the features needed in enterprise scenarios, such as snapshots and compliance. The ability to leverage and extend ViPR data services to the underlying storage devices to provide enterprise-class cloud storage, such as ViPR can provide incremental real-time snapshot of the object store.

• Flexible Storage: ViPR data services are implemented by software, which makes ViPR data service can be deployed on any server in the data center, in simple, reliable, lightweight and scalable design; And while all of these functions underlying storage devices, IP is not locked in any storage array.

• Extensible Storage: ViPR data services provides a rich API, third-party services can use it to develop their own systems. In addition, ViPR will also expose the raw building blocks, these modules can provide the core IP cloud-scale services, including distributed B + tree, metadata; third party can use these to develop services that modules can be placed inside or outside ViPR to extend a platform.

• Object Storage: object storage services like Amazon S3 ViPR provided the model for the image. ViPR APIs support AmazonS3, OpenStackSwift and EMC’s Atmos, vNextAPI support EMCCentera CAS. The ViPR also pro-vides some extensions, including Byte range updates, atomic increase, rich ACL snapshots, metering and billing, and so on. In addition, ViPR provides such functions as file can be accessed directly on the same underlying file storage device object storage, and the performance of the local file system consistency.

• HDFS: the HDFS file system has become increasingly popular, not only in Hadoop, but also in a variety of distributed applications. ViPR HDFS provides massive storage platform; ViPR achieves through existing HDFS to solve the problem addressing key storage engine, in addition ViPR design can easily be stored in HDFS into existing hardware environment. One can also make HDFS run target storage of above scenarios.

(39)

Chapter 4

Experimental Results

4.1

Experimental Environment

We used OpenStack to build our cloud platform, which then was used to create and manage the distributed storage system. In the system we integrated three heterogeneous storage technologies. And we built the storage system by some VMs, in which Ceph was constructed by three VMs with specifications of 1 core CPU, 2GB memory, and a total of 60GB storage space. The first node was MON and OSD, the second was MDS and OSD, the third was OSD. These were components in the Ceph cluster. And the HDFS part was constructed by three virtual machines including one NameNode and two DataNodes with specifications of 1 core CPU, 2GB memory, and total of 40GB storage space.

4.2

Performance of Data Transport

In the first part of our experiment, we compared the three different kinds of storage technology, including Ceph, Hadoop HDFS and GlusterFS. The reason to choose them is that they are famous storage technology in the popular distributed storage systems today.

(40)

Table 4.1: Hardware Specification

Host name CPU Memory Disk OS Management node vCPU 9 cores 20GB 200GB Ubuntu

12.04 Ceph node1 1 cores vCPU 2GB 20GB Ubuntu

12.04 Ceph node2 1 cores vCPU 2GB 20GB Ubuntu

12.04 Ceph node3 1 cores vCPU 2GB 20GB Ubuntu

12.04 HDFS Namenode1 1 cores vCPU 2GB 20GB Ubuntu

12.04 HDFS Datanode1 1 cores vCPU 2GB 20GB Ubuntu

12.04 HDFS Datanode2 1 cores vCPU 2GB 20GB Ubuntu

12.04 GlusterFS server1 1 cores vCPU 2GB 20GB Ubuntu

12.04 GlusterFS server2 1 cores vCPU 2GB 20GB Ubuntu

12.04 GlusterFS client 1 cores vCPU 2GB 20GB Ubuntu

12.04

Table 4.2: Software Specification Software Version OpenStack Havana Ceph V0.72 EMPEROR HDFS 2.2.0 Swift GlusterFS 3.5

In this experiment, we built three storage systems: Hadoop, Ceph and Glus-terFS and recorded the data of the performance of uploading and downloading files on the three systems. We used three cases to upload and download a single file. The first case was for files with the size of 1MB to 16GB. The second is tests of small sized files of 1MB to 100MB, with 16GB total file size. And the last one is files100MB to 1000MB, with 16GB total file size.

The result for the experiment to upload a single file from 1MB to 16GB from the client node to the server node is shown in Figure 4.1. We observe that when the file size is smaller than 512MB, the performance of the three storage systems are

(41)

similar. But when the file size is larger than 512MB, the performance of Hadoop is better than GlusterFS and Ceph. According to this result, we can say the ability of Hadoop is better than GlusterFS and Ceph when uploading large files.

Figure 4.1: Write single file with different file size

The result for the experiment to download a single file from 1MB to 16GB from the client node to the server node is shown in Figure 4.2. From the result, we observe that when the file is smaller than 512MB, the performance of the three storage systems are similar. But when the file size is larger than 512MB, the performance of GlusterFS is better than the other two. However, when the file size is larger than 4GB, the performance of Hadoop is better than Ceph.

(42)

Figure 4.2: Read single file with different file size

In this part, we uploaded 16GB file in total, and split the file into smaller files of 10MB to 100MB. We will use it to test the effect of performance when transporting files of different sizes. In Figure 4.5, we can see the curve is almost stable; however when the file size is just larger than 60MB, the run time of Hadoop increase fast. And when the file size is between 30MB to 60MB, its performance is the best. The performance of GlusterFS and Ceph are similar to each other. And Hadoop performs better than they.

(43)

Figure 4.3: Write small file

In this part, we downloaded 16GB file in total, and split the file to be 10MB to 100MB. We will use it to test the effect of performance of downloading different file sizes, as shown in Figure 4.6. In this experiment, we find the results have not consistent trends; they are close. The performance of Hadoop is better than Ceph most of the time, but when the file size between 50MB and 80MB, the performance of Ceph and Hadoop is close.

(44)

Figure 4.4: Read small file

In this case, we uploaded 16GB file in total, and split the file from 100MB to 1000MB; we wanted to test the effect of performance of data transport between larger file and small file. From the results, we can observe that the performance of Hadoop is better than the other two. Moreover, when the file size is smaller than 500MB, Ceph is better than GlusterFS, but if the file size is larger than 500MB, Ceph is either similar to GlusterFS or even worse.

(45)

Figure 4.5: Write small file

In this case, we download 16GB file in total, and split the file from 100MB to 1000MB. From the result, we found that the performance of them is close, but Ceph rises suddenly when the file size is between 500MB and 600MB. And Hadoop is the best in them.

(46)

4.3 User Interface of ViPR

ViPR can integrate some different original storage architecture through a data service, and provide some automatic mechanism for the storage systems to simplify the management work and decrease the work complexity. It can transform the physical storage from different company or system to be a single, scalable and open virtual storage platform. It can manage all storage resource by a single interface, as shown in Figure 4.7.

Figure 4.7: Dashboard

As shown in Figure 4.8, people can register their storage system to the physical assets part. And people can add some computers in the physical storage array to be a host, as shown in Figure 4.9.

(47)

Figure 4.8: Add PhysicalAsset

(48)

ViPR transforms the physical storage to be a virtual storage plane, and it can create or manage the virtual storage pools automatically by the policy from the manager. When people create the policy, he can set how to create the virtual storage pool according to the storage performance which the service needs, as shown in Figure 4.10 and Figure 4.11.

(49)

Figure 4.11: VirtualPool Create

There are some data service will run on the plane, and the virtual storage pool and virtual storage array can provide the application with the data service, as shown in Figure 4.12. People can split the virtual storage pool to be some virtual storage arrays, and manage them by the policy. There are three types in the data service:

• Object: the file systems can store, take a file and modify the unstructured data by the data service, and the origin data application system can read or write the data easily.

• HDFS: collocation the service, people can transform the origin storage device to be a big data storage place. And they can use it to build a big data environment quickly without building an analysis cluster.

• Block: in the block data service, people can take a data across several storage array, and it can support many kinds of applications, file systems, database and virtual platforms.

(50)

Figure 4.12: DataService Setup

4.4

Discussion

We used ViPR to provide the storage management to system; it can manage storage array simply through the data plane and control plane. The data plane is an advantage data service which runs on storage arrays, for example it can offer ability of object storage on file system storage platforms or offer ability of block storage on normal storage platforms. The data service which can virtualize the storage system combines different data types (file, object, and block), protocol (iSCSI, NFS, and REST), security, usability. We used ViPR to offer a storage pool to our system, and we prove that the system will run normality by the experiments in this chapter. According to the experimental results, we can find that the performance of Hadoop HDFS is better than the others, and thus, Hadoop HDFS is more mature and has higher performance than the others. And ViPR can create virtual storage pool and assign the storage space to different kinds of storage.

(51)

Chapter 5

Conclusions and Future Work

5.1

Conclusion Remarks

Cloud computing is a concept, in which computers over networks are able to coop-erate with each other to provide far-reaching network services. The basic approach to computing through the Internet terminal operations of cloud computing is to move hardware, software, and shared information from users to the server side. Due to the situation, we can say that we are living in a world which is surrounded by cloud technology, and we almost use the cloud every day. In recent years, the request of cloud service increase quickly every day. How to manage the cloud in a more humane way and make cloud deployment simpler are very important. In this thesis, we built a cloud platform to achieve the concept of Storage-Defined Storage by software, through distributed cloud architecture to build a high reliability and high scalability cloud service, which integrates heterogeneous storage technologies. Finally we also provide a high usability user friendly interface to let users use this system with ease.

(52)

5.2 Future Work

Our system still has many places to improve. The first problem is the storages which the software of our system can support has limit. It just can support some storage technologies. And the other is the environment of our system is small. In the future we hope we can deploy software that can support more storage tech-nologies. And we hope to build a larger environment with our system architecture, and improve the performance of the data transport from the management node to the storages.

(53)

References

[1] A. Agrrawal, R. Shankar, S. Akarsh, and P. Madan. File system aware storage virtualization management. In2012 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), pages 1–11. IEEE, 2012.

[2] B. N. Bershad and J. C. Mogul, editors. 7th Symposium on Operating Systems Design and Implementation (OSDI ’06), November 6-8, Seattle, WA, USA. USENIX Association, 2006.

[3] M. Carlson, A. Yoder, and L. Schoeb. Software defined storage.

http://snia.org/sites/default/files/SNIA%20Software%20Defined% 20Storage%20White%20Paper-%20v1.0k-DRAFT.pdf/, 2014.

[4] H.-C. Chao, T.-J. Liu, K.-H. Chen, and C.-R. Dow. A seamless and reliable distributed network file system utilizing webspace. In WSE, pages 65–68, 2008.

[5] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, pages 273–286. USENIX Association, 2005.

[6] I. Coraid. The fundamentals of software-defined storage. http://san. coraid.com/rs/coraid/images/SB-Coraid_SoftwareDefinedStorage. pdf, 2013.

(54)

[7] P. T. Endo, G. E. Gonçalves, J. Kelner, and D. Sadok. A survey on open-source cloud computing solutions. In Brazilian Symposium on Computer Networks and Distributed Systems, 2010.

[8] S. Figuerola, M. Lemay, V. Reijs, M. Savoie, and B. S. Arnaud. Converged optical network infrastructures in support of future internet and grid services using iaas to reduce ghg emissions. J. Lightwave Technol., 27(12):1941–1946, Jun 2009.

[9] J. G. Hansen and E. Jul. Self-migration of operating systems. InProceedings of the 11th workshop on ACM SIGOPS European workshop, page 23. ACM, 2004.

[10] T. Hussain, P. N. Marimuthu, and S. J. Habib. Managing distributed storage system through network redesign. In Asia-Pacific Network Operations and Management Symposium, pages 1–6, 2013.

[11] Y. Jiang, H. Jin, and X. Liao. Devstore: Distributed storage system for desktop virtualization. In 2013 8th International Conference on Computer Science & Education (ICCSE), pages 341–346, 2013.

[12] M. Kim, H. Lee, H. Yoon, J.-I. Kim, and H. Kim. Imav: an intelligent multi-agent model based on cloud computing for resource virtualization. In 2011 international conference on information and electronics engineering.

[13] L.-j. D. Kun Liua. Research on cloud data storage technology and its archi-tecture implementation. Procedia Engineering, 29:133–137, 2012.

[14] libvirt.org. libvirt. http://libvirt.org/, 2013.

[15] Q. Ma, H. Wang, Y. Li, G. Xie, and F. Liu. A semantic qos-aware discovery framework for web services. In International Conference on Web Services, pages 129–136. IEEE, 2008.

[16] G. Mackey, S. Sehrish, and J. Wang. Improving metadata management for small files in hdfs. In CLUSTER, pages 1–4, 2009.

(55)

[17] D. S. Milojičić, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou. Process migration. ACM Computing Surveys (CSUR), 32(3):241–299, 2000.

[18] R. Moreno-Vozmediano, R. S. Montero, and I. M. Llorente. Elastic man-agement of cluster-based services in the cloud. In Proceedings of the 1st workshop on Automated control for datacenters and clouds, ACDC ’09, pages 19–24, New York, NY, USA, 2009. ACM.

[19] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The eucalyptus open-source cloud-computing system. In Cluster Computing and the Grid, 2009. CCGRID’09. 9th IEEE/ACM International Symposium on, pages 124–131. IEEE, 2009.

[20] Openstack. Openstack. https://www.openstack.org/, 2014.

[21] C. Penga and Z. Jiang. Building a cloud storage service system. Procedia Environmental Sciences, 10:691–696, 2011.

[22] S. Robinson. Software-defined storage: The reality be-neath the hype. http://www.computerweekly.com/opinion/ Software-defined-storage-The-reality-beneath-the-hype, 2013.

[23] M. Rouse. software-defined storage. http://searchsdn.techtarget.com/ definition/software-defined-storage, 2013.

[24] J. Sahoo, S. Mohapatra, and R. Lath. Virtualization: A survey on concepts, taxonomy and associated security issues. In Second International Conference on Computer and Network Technology.

[25] P. Sempolinski and D. Thain. A comparison and critique of eucalyptus, open-nebula and nimbus. In Cloud Computing Technology and Science (Cloud-Com), 2010 IEEE Second International Conference on, pages 417–426. Ieee, 2010.

[26] J. Shafer, S. Rixner, and A. L. Cox. The hadoop distributed filesystem: Balancing portability and performance. In ISPASS, pages 122–133, 2010.

(56)

[27] B. Sotomayor, R. S. Montero, I. M. Llorente, and I. Foster. Virtual infrastruc-ture management in private and hybrid clouds. Internet Computing, IEEE, 13(5):14–22, 2009.

[28] J. Spillner, J. Müller, and A. Schill. Creating optimal cloud storage systems.

Future Generation Comp. Syst., 29(4):1062–1072, 2013.

[29] B. Tang and G. Fedak. Analysis of data reliability tradeoffs in hybrid dis-tributed storage systems. In IPDPS Workshops, pages 1546–1555, 2012.

[30] P. Vichare, A. Nassehi, S. Kumar, and S. T. Newman. A unified manufactur-ing resource model for representmanufactur-ing cnc machinmanufactur-ing systems. In Robotics and Computer-Integrated Manufacturing.

[31] D. Wang. An efficient cloud storage model for heterogeneous cloud infras-tructures. Procedia Engineering, 23:510–515, 2011.

[32] L. Wang, D. Chen, Y. Hu, Y. Ma, and J. Wang. Towards enabling cyber-infrastructure as a service in clouds. Computers & Electrical Engineering, 39(1):3–14, 2013.

[33] P. Wang. Qos-aware web services selection with intuitionistic fuzzy set under consumer’s vague perception. Expert Syst. Appl., 36(3):4460–4466, 2009.

[34] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In OSDI, pages 307–320, 2006.

[35] X. Y. WenfengLin, YeZhongn. Research on cloud data storage technology and its architecture implementation. Journal of Network and Computer Ap-plications, 36:1696–1704, 2013.

[36] wikipedia. Cloud computing. http://en.wikipedia.org/wiki/Cloud_ computing, 2013.

[37] wikipedia. Software-defined storage. http://en.wikipedia.org/wiki/ Software-defined_storage, 2014.

(57)

[38] C.-T. Yang, W.-C. Shih, G.-H. Chen, and S.-C. Yu. Implementation of a cloud computing environment for hiding huge amounts of data. In International Symposium on Parallel and Distributed Processing with Applications, pages 1–7, 2010.

[39] C.-T. Yang, W.-C. Shih, and C.-L. Huang. Implementation of a distributed data storage system with resource monitoring on cloud computing. In GPC, pages 64–73, 2012.

[40] J. Yun, Y. H. Park, D. M. Seo, S. J. Lee, and J. S. Yoo. Design and implemen-tation of a metadata management scheme for large distributed file systems.

IEICE Transactions, 92-D(7):1475–1478, 2009.

[41] D. Zhao, K. Burlingame, C. Debains, P. Alvarez-Tabio, and I. Raicu. To-wards high-performance and cost-effective distributed storage systems with information dispersal algorithms. In CLUSTER, pages 1–5, 2013.

(58)

Appendix A

Hadoop Installation

I. Modify hosts and hostname # sudo vim /etc/hosts

# sudo vim /etc/hostname

II. Install Java JDK

# sudo apt -get -y install openjdk -7-jdk

# sudo ln -s /usr/lib/jvm/java -7-openjdk -amd64 /usr/lib/jvm/jdk

III. Add hadoop user # sudo addgroup hadoop

# sudo adduser --ingroup hadoop hduser # sudo adduser hduser sudo

IV. Creat SSH authentication login

# ssh -keygen -t rsa -f ~/. ssh/id_rsa -P "" # cat ~/. ssh/id_rsa.pub >> ~/. ssh/authorized_keys

V. Download hadoop

# wget http :// ftp.twaren.net/Unix/Web/apache/hadoop/common/hadoop 2.2.0/ hadoop -2.2.0. tar.gz

# tar zxf hadoop -2.2.0. tar.gz # mv hadoop -2.2.0. tar.gz hadoop

(59)

Appendix

# vim .bashrc

export JAVA_HOME =/usr/lib/jvm/jdk/ export HADOOP_INSTALL =/ home/hduser/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME =$HADOOP_INSTALL export HADOOP_COMMON_HOME =$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL

VII. Set hadoop config

a. edit hadoop-env.sh # cd hadoop/etc/hadoop # vim hadoop -env.sh

export JAVA_HOME =/usr/lib/jvm/jdk/

b. edit core-site.xml # vim core -site.xml

<property >

<name >fs.default.name </name >

<value >hdfs :// hadoop1 -master :9000 </ value > </property >

c. edit yarn-site.xml # vim yarn -site.xml

<property >

<name >yarn.nodemanager.aux -services </name > <value >mapreduce_shuffle </value >

</property > <property >

<name >yarn.resourcemanager.hostname </name > <value >hadoop1 </value >

</property >

d. edit mapred-site.xml

# cp mapred -site.xml.template mapred -site.xml # vim mapred -site.xml

(60)

Appendix

<property >

<name >mapreduce.framework.name </name > <value >yarn </value >

</property > e. edit hdfs-site.xml # mkdir -p ~/ mydata/hdfs/namenode # mkdir -p ~/ mydata/hdfs/datanode # vim hdfs -site.xml <property >

<name >dfs.replication </name > <value >2</value >

</property > <property >

<name >dfs.namenode.name.dir </name >

<value >/ home/hduser/mydata/hdfs/namenode </value > </property >

<property >

<name >dfs.datanode.data.dir </name >

<value >/ home/hduser/mydata/hdfs/datanode </value > </property > f. edit slaves # vim slaves hadoop1 hadoop2 hadoop3

VIII. Copy hadoop to all nodes

# scp -r /home/hduser/hadoop hadoop1 :/ home/hduser # scp -r /home/hduser/hadoop hadoop2 :/ home/hduser # scp -r /home/hduser/hadoop hadoop3 :/ home/hduser

IX. Format HDFS

# hdfs namenode -format

X. Start hadoop # start -all.sh

(61)

Appendix

XI. Use jps to see java running program # jps

XII. MapReduce JobTracker monitoring website # hadoop1 :50030

(62)

Appendix B

Ceph Installation

I. Create Ceph user # ssh user@ceph -server

# sudo useradd -d /home/ceph -m ceph # sudo passwd ceph

II. Add root competence for every Ceph cluster

#echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph #sudo chmod 0440 /etc/sudoers.d/ceph

III. Set SSH no password login and copy to every node # ssh -keygen -t rsa -f ~/. ssh/id_rsa -P ""

# cat ~/. ssh/id_rsa.pub >> ~/. ssh/authorized_keys # ssh -copy -id ceph@ceph1

# ssh -copy -id ceph@ceph2 # ssh -copy -id ceph@ceph3

IV. Install Ceph

# wget -q -O- 'https :// ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc ' | sudo apt key add

-# echo deb http :// ceph.com/debian -dumpling/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list

# sudo apt -get update

# sudo apt -get install ceph -deploy

(63)

Appendix

# mkdir my -cluster # cd my -cluster

VI. Create a monitor key

# ceph -deploy purgedata {ceph -node} [{ceph -node }] # ceph -deploy forgetkeys

VII. Create clusters # ceph -deploy new ceph1

VIII. Install Ceph

# ceph -deploy install ceph1 ceph2 ceph3

IX. Create a monitor

# ceph -deploy mon create ceph1

X. Collect the key

# ceph -deploy gatherkeys ceph1

XI. Add two OSDs # ssh ceph2

# sudo mkdir /tmp/osd0 # exit

# ssh ceph3

# sudo mkdir /tmp/osd1 # exit

XII. Prepare OSDs

# ceph -deploy osd prepare ceph2 :/tmp/osd0 ceph3 :/tmp/osd1

XIII. Activate OSDs

# ceph -deploy osd activate ceph2 :/tmp/osd0 ceph3 :/tmp/osd1

(64)

Appendix

# ceph -deploy admin admin -node ceph1 ceph2 ceph3

XV. Check Ceph status # ceph health

(65)

Appendix C

GlusterFS Installation

I. Modify host file # sudo vim /etc/hosts # sudo vim /etc/hostname

II. Install glusterfs

# apt -get install glusterfs -server

III. Set glusterfs service # chconfig glusterd on

IV. Start glusterfs

# service glusterd start

V. Modify firewall configuration # iptables -L -n

VI. Add host server2 into the trusted storage pool # gluster peer probe server2

VII. Check trust relationship # gluster peer status

(66)

Appendix

VIII. Create volume

# gluster volume create datavol replica 2 transport tcp server1 :/opt/data server2 :/opt/data

IX. Start volume

# gluster volume start datavol

X. Check connection between services # netstat -tap | grep glusterfsd

XI. Check volume status # gluster volume info datavol

XII. Mount file system on cluster

References

Related documents

1.Make sure the Android SDK platform-tools/ directory is included in PATH 2.Change directories to the root of the Android project and execute:.

He compare three different network architecture, Linux bridge, Open vSwitch and support OpenFlow protocol physical switch.. The result show how different network speed in

&#34; In-office visits by motorists increased 6.7 percent from the previous quarter, as the department served 1,796,434 customers in driver license field offices.*.. * This

Univariate analyses indicated that tumour size (p = 0.033) and treatment for initial recurrence (p,0.001) had a significant influence on post-recurrence survival in 74 patients

Some other ways to correct your vision are by wearing glasses or contact lenses, or by undergoing other kinds of laser refractive surgery such as non-custom LASIK or

Crime committed is qualified theft and not Crime committed is qualified theft and not estafa. The offense contains all the estafa. The offense contains all the elements of

 Discover mapped public IP &amp; port for private IP &amp; port  Use mapped public IP &amp; port in application layer message  Keep this mapping valid.. 

THAT this Committee recommends to Council that as recommended in a report dated February 29, 2012 from the Director of Parks and Recreation and the Manager of Special Projects,