• No results found

KingCloud: Object Oriented Archiving System

N/A
N/A
Protected

Academic year: 2020

Share "KingCloud: Object Oriented Archiving System"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

2016 International Conference on Mathematical, Computational and Statistical Sciences and Engineering (MCSSE 2016) ISBN: 978-1-60595-396-0

KingCloud: Object Oriented Archiving System

Jia-jia MIAO

1,*

, Yin-jin FU

1

and Han-dong MAO

2

1

Institute of Command Automation, PLA University of Science and Technology, Nanjing, China

2

Pushtime Technology Inc., Beijing, China

*Corresponding author

Keywords: File system semantic, Document classification, Data layout, File prefetching.

Abstract. With the ceaseless propulsion of informatization process, a great deal of data accumulated

in the production system resulted in archiving demand; meanwhile, with the increasingly enriching types of data information, big data technology enhanced the function of unstructured data. This paper designs and realizes the KingCloud intelligent object archiving system, which achieves classification of text files through document classification technology and provides logic views of documents and achieves the acquisition of content elements through image identification, video keyframe and other technologies. As for the overall storage structure, document prefetching, memory buffer, data layout and strategy perception etc. can be optimized by combining the semantic studies of file system, and data can be intelligently classified, summarized, discovered, predicted and analyzed to remarkably improve the service ability, service quality and service performance of memory system.

Introduction

With the advent of the Web era, the amount of data showing explosive growth. Especially as e-commerce, search engine, and social networking sites, frequently require massive petabytes of data processing, Facebook's photo storage system [1] currently stored in the 26 million pictures, the AT&T network daily flow 16PB data, Google is a daily processing 20PB data, YouTube store the 31PB flow the media data [2]. Cisco statistics every month on the network video stream is about 5000PB, NASA earth observation system EOSDIS has stored the 3PB data, and the rate of 1TB per week increased by [3].

Hand, these large-scale data lack of effective analytical method, how to understand the data has exceeded the ability of people, on the other hand, the production system deployment time of 10 years, many old data despite the existing in the online system, but the basic no touch, dragging down the overall performance of production operation system. Therefore, we need a method that can intelligently analyze data, classify data, induce data, discover and forecast data, and provide an important foundation for data storage and data layout.

The analysis and research of the internal information of data in scientific research field has never been stopped, and the use of semantic information has always been the focus of attention. Traditional storage system semantic research usually has 1 data collection, 2) information extraction, 3) data analysis and understanding of these three stages.

Related Works

Architecture from the point of view, object storage system and a smart object file system; and semantic file system file system file between the semantic association, as the object file system provide intelligent support, file system filter is currently common file semantic transparent obtain means. From the above three aspects introduced related technologies.

(2)

Object is variable length, can contain any type of data, more can be adjusted dynamically, and the block is fixed. For example, the allocation of storage space is predicted by the properties of the object, which is one of the improvements to the storage system after the object oriented semantic acquisition. Semantic file system [6,7] can automatically from the content data extraction of semantic properties, and the formation of certain logic relation, which can be associative access. Here the associative access is refers to the system by the concept of the virtual directory to provide access to the index based on the contents of the file. A semantic file system according to the content data from the extraction of semantic properties file indexing and documents in accordance with the attribute classification and organization in a virtual directory, so is for the user provides a difference in the directory tree structure, also a file organization perspective and access mechanism [8].

Semantic filtering system fully integrated in stored inside the system is collected and stored data semantics, such as anti-virus software is the use of filtering system this method common is collected data access sequence pattern [9], recording two data access logic unit relative to the number of visits, and the relationship between the evaluation unit with the filter driver kernel module to the file access operations monitoring based on. The advantage of this method is not the front storage application pre assumptions and modify any of its logic, but because it is a separate front-end application in storage, storage system module, the method can collect to information limited to data access sequential patterns and some simple attribute data.

In contrast with the white box method and the black box approach is a kind of the most transparent and universal method [9], this acquisition method of basic concepts of semantic is the semantic extraction device as the data access path layer [10], so all of the data access will through the layer, the semantic information in after the data were extracted. Grey box method of transparency and accuracy of the balance between the two to obtain semantic method, similar to the black box method, grey box method also does not relate to the front storage application modification [11], but the definition of several pre assumptions.

KingCloud Framework

KingCloud used for objects in the way to organize the data archiving, relates to the concept of object, attribute, object storage devices such as, of which the object is the data in the form of logical organization, can be a document, or a video is a record of the operation, the object is some data carrier with a logical relationship; attribute is the description of the characteristics of a particular aspect of the object, such as the examination and approval documents of approval for the properties described the actual approval of the file, object storage device is using file system as a support, can be connected to the server device arrays can also be magnetic tape or optical disk device, with self management function.

KingCloud intelligent object archiving system includes user interface, KingCloud server, terminal data acquisition Agent, data ETL tool set, such as the four components, as shown in figure 1. The part of the user interface including API secondary development interface, logical view browsing interface, polymerization retrieval interface and topic oriented content display, part of the user interface to provide a logical view of the data for the client, including file name, directory structure, and for the client to provide physical view, describing data stored on physical media. Server KingCloud content analysis service, file service layout, metadata service and web application server, in object storage, data logical view and physical view are separated, the metadata server is only responsible for logical view and physical view are managed by the object storage device of. ETL tool set is responsible for the acquisition, conversion and loading of structured data; the client acquisition Agent is responsible for the acquisition, storage and extraction of basic metadata information.

(3)

to spread the load and increase the reliability, as shown in Figure 2, which peripheral block identifies the node, marked with an asterisk (*) as the master node, small square said slice. Node is an instance of metadata running. A cluster is a group with the same set of nodes, nodes collaborative work, data sharing and provide failover and extended function, when a new node joins or delete nodes, cluster will perceive to and automatic balance data. A node in the cluster will be elected as the main node, it is used to manage some of the changes in the cluster, such as new or delete index, add or remove nodes, etc. Any node knows each other data exist on which node, forwards the request to the external data where the node, the master node is responsible for collection of all nodes return data, finally returned to the client. When the metadata cluster is expanded or shrunk, the system will automatically migrate and divide among nodes so as to keep the cluster balanced.

Agent

Ki ngCl oud Ser ver Met a- dat a

Ser vi ce

User I nt er f ace Logi cal

Vi ew Sear ch Vi sual I nt er f aceAPI

WEB Ser vi ce

Cl ust er of f i l e syst em

Cl ust er of f i l e syst em

Fi l e l ayout

Ol d- syst em Fi l esFi l es Pi ct ur es Vi deos Fi l e Agent 1 PI C Agent 2 Agent 3Vi deo ETL

[image:3.595.198.395.226.453.2]

Cont ent Anal ysi st

Figure 1. KingCloud architecture diagram.

Figure 2. The lateral extension framework of metadata.

Current document types of data, the full text content also as a property stored in the value of the database, to achieve the full text of the document retrieval function.

KingCloud archiving system uses the gray box method to collect data access heat, data layout and other aspects of information through the detection (Probe), and then predict the access behavior of the archive system. The detection involved is mainly through the implementation of some standard data access operations on the front storage application system, and then observe the IO access and its induced data changes triggered by each operation. The system runs a service process to monitor and collect the upper level file system for the operation of the underlying file.

[image:3.595.154.445.496.603.2]
(4)
[image:4.595.122.479.128.297.2]

system contains the ETL tool set, in the understanding of the source application system data can automatic extraction of the data, and after combining multidatabase data conversion and loading to the database, the specific process as the following picture shows.

Figure 3. ETL process.

For obtaining the associated metadata application system also has its shortcomings, for different applications, in the face of different customers and need more customized, mainly understand the content of the data source, which is a relatively time-consuming process.

Application of Semantic Analysis of Documents

Document semantics can be in the role of document prefetching, data caching, layout optimization, security awareness and data search and other systems to optimize the strategy. The KingCloud archive system is widely used in internal storage.

Metadata Prefetching

KingCloud system closer to the aggressive algorithm, mainly to consider the amount of metadata is relatively small, so that the cost of pre fetching is relatively low. Of course in the wrong rate increase, may be over prefetching brings performance improvement, this time to consider using a threshold to control the file related degree calculation. In the experiment the threshold effect on the performance of the observation, to modify the threshold value.

Data Layout Optimization

Document correlation was also shown to have a very large effect on improving the effectiveness of the file layout policy [12]. There are several issues to be considered when analyzing the file data layout process. One of the most important issues is to determine which files should be integrated into a file. Using the previously mentioned dependency list to solve the problem, however, because the files in the archive system does not have to modify the characteristics, so data layout is relatively is relatively simple, KingCloud filing system only consider the read-only file to store a file with a group. In this way, any file in the group is accessed, and the file data of the entire access group will be tied to the system cache. So as to enhance the performance of the subsequent visit IO.

Search Results Optimization

(5)

ascending sort, similar to these ideas with PageRank technology, in addition, linking the relationship between also comes with a degree of correlation, and connections between is not 0 and 1, so a page rank algorithm to reflect the file properties, to the search results bring optimization.

Summary

This paper describes the overall structure of the KingCloud smart object file system, introduces the semantic acquisition method based on the access behavior, then to various types of unstructured documents based on content metadata extraction method are analyzed in detail. In addition, such as application system rich document metadata description, the system also provides the ETL tool to the portion of data extraction transformation, on the basis of these work, analyzes the intelligent metadata access to technology on the file system metadata service, object data layout optimization, search optimization effect etc. have improved dramatically.

Acknowledgement

This research was financially supported by the Program for Equipment Research of China 9140A15070414JB25224.

References

[1] Beaver D; Kumar S; Li H. Finding a needle in Haystack: Facebook's photo storage 2010. Proc of the 10th USENIX Symp on Operating Systems Design and Implementation. 2010:30-35.

[2] Cao Qiang, Huang Jian-zhong, The design of network storage system. The University of HuaZhong technology. 2010.

[3] ECS Info. http://observer.gsfc.nasa.gov/

[4] Gibson G A, Nagle D F, Amiri K, et al. A cost-effective, high-bandwidth storage architecture. SIGOPS Oper. Syst. Rev., 1998, 32(5):92-103.

[5] Zeng L, Zhou K, Shi Z, et al. HUSt: a heterogeneous unified storage system for GIS grid. in: Proceedings of SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, New York, NY, USA: ACM, 2006, 325.

[6] Gifford D K, Jouvelot P, Sheldon M A, et al. Semantic file systems. SIGOPS Oper. Syst. Rev., 1991, 25(5):16-25.

[7] Semantic file systems. http://www.objs.com/survey/OFSExt.htm.

[8] Bhagwat D, Polyzotis N. Searching a file system using inferred semantic links. In: Proceedings of HYPERTEXT ’05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, New York, NY, USA: ACM, 2005, 85-87.

[9] Kroeger T M, Long D D E. The Case for Efficient File Access Pattern Modeling. In: Proceedings of HOTOS ’99: Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems, Washington, DC, USA: IEEE Computer Society, 1999, 14.

[10] Gu P, Zhu Y, Jiang H, et al. Nexus: A Novel Weighted-Graph-Based Prefetching Algorithm for Metadata Servers in Petabyte-Scale Storage Systems. in: Proceedings of CCGRID ’06: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, Washington, DC, USA: IEEE Computer Society, 2006, 409-416.

[11] Schindler J, Griffin J L, Lumb C R, et al. Track-Aligned Extents: Matching Access Patterns to Disk Drive Characteristics. in: Proceedings of Proceedings of the 1st USENIX Conference on File and Storage Technologies, Berkeley, CA, USA: USENIX Association, 2002.

Figure

Figure 2. The lateral extension framework of metadata.
Figure 3. ETL process.

References

Related documents

The book is reputed to have been purchased by merchants from all over Europe (Favier, 1998), which supports a view that the primary audience for Summa was neither mathematicians

Tindakan perataan laba merupakan fenomena yang umumnya terjadi sebagai suatu usaha yang dilakukan manajemen untuk mengurangi fluktuasi laba yang dilaporkan dan sebagai sarana yang

Al-Hazemi (2000) suggested that vocabulary is more vulnerable to attrition than grammar in advanced L2 learners who had acquired the language in a natural setting and similar

In addition, it is not unusual for operating holding companies to have a few core holdings that account for a large portion of their portfolio's estimated value, leading a higher

Specifically, aggregated data from multiple sources show how eWOM variables via earned social media and other key variables— volumes, valence, and information related to the

Based on the values returned by the similarity measurement criterion, the existing bug reports in the repository are sorted in a way that the bug reports at the top of the list

Addiction (psychological dependence) is a pathologic psychological condition that includes a compulsion to take a specific drug ( e.g ., opioid) to experience its psychic

diagnosis of heart disease in children. : The role of the pulmonary vascular. bed in congenital heart disease.. natal structural changes in intrapulmon- ary arteries and arterioles.