• No results found

RUBA: Real-time Unstructured Big Data Analysis Framework

N/A
N/A
Protected

Academic year: 2021

Share "RUBA: Real-time Unstructured Big Data Analysis Framework"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

RUBA: Real-time Unstructured Big Data Analysis

Framework

Jaein Kim, Nacwoo Kim, Byungtak Lee

IT Management Device Research Section Honam Research Center, ETRI

Gwangju, Republic of Korea jaein, nwkim, bytelee@etri.re.kr

Joonho Park, Kwangik Seo Altibase Consulting Services

Altibase, Inc Seoul, Republic of korea joonho.park, kwangik.seo@altibase.com

Hunyoung Park Infinidata Team

Tibero, Inc Sungnam, Republic of korea

hypark@tibero.com

AbstractWe are greeting “Big Data Generation”. As ICT technology is developing, the volume of data is incredibly growing and many works to deal with a big data are underway. In this paper, we proposed a novel framework for real-time unstructured big data analysis, such as a movie, sound, text and image data. Our proposed framework provides functions of a real-time analysis and dynamic modification for unstructured big data analysis. We have implemented the object monitoring system as a test system which is applied our framework, and we have confirmed each functions and the availability of our framework.

Keywords—Big Data, Unstructured Data, Real-time System, CEP,

CQL.

I. INTRODUCTION

As many ICT technologies advance, we are greeting “Big Data Generation”. Big data is a new defined word which means a large volume of data, many numbers and types of data sources and very frequent updates of data. The big data is recognized as important issues for the future ICT technologies and many works to process the big data have been developed such as Hadoop[1], Map-Reduce methods[2], CEP(Complex Event Processing)[3] and etc.. And many related researches are underway in this time. Especially, the real-time big data analysis to find valuable information or knowledge is becoming more important. We expect that the requirements for analysis of a big data will be increased as the appearance of new ICT real-time services [4][5].

Actually, a real-time big data analysis is similar to expanded data mining methods. Because the real-time data mining method which collects data as a transaction and find frequent events or patterns from many transactions can be applied to the real-time big data analysis. Many real-time data mining methods have been developed from the previous works [6][7] to the recent works [8][9][10]. Many researches for real-time data mining are going now and the real-real-time big data analysis applying a data mining idea will be an important challenge.

However, there are some problems to apply directly the real-time data mining idea to the real-time big data analysis. The main problem is an unstructured data type of the big data. In the data mining methods, they can only deal with the

structured data type to find knowledge. But in the big data applications, there are many unstructured data. For example, from CCTV on the street, we get massive video data in real-time which have much information about situations of people and streets. But we can not store the video data into a structured database system and we can not apply the data mining methods to find some information from them. The second problem is big scale features of the big data on various parts in many sources, rapidly changed, requirements to fine new knowledge. In the data mining methods, an algorithm to discover some knowledge is once built and executed. In other words, if we want to discover the other knowledge or modify the algorithm, we should re-structure the code, re-build and execute again. In the big data environment, the data is changing always on its types, states and analysis purpose. Therefore, these re-build and re-execution process is not applicable for the big data analysis system.

In this paper, we proposed a novel framework for the real-time big data analysis, the name is RUBA framework: Real-time Unstructured Big data Analysis framework. We can analyze an unstructured big data like CCTV data and manage a distributed analysis system which collects the big data effectively using RUBA framework. This framework consists of a CEP engine to analyze and discover the big data in real-time (BA: Big data Analysis) and big data processor to convert unstructured data to structured data (BP: Big data Processing). And there are three interfaces for input of BA, BP and output of BA (BAI: BA input Interface, BPI: BP input Interface, BAOI: BA Output Interface). This paper is organized as follows. In section 2, we define requirements of real-time big data analysis and we propose RUBA framework in section 3. We have implemented the object monitoring system using RUBA framework to confirm the availability of our framework for the real-time big data analysis and its results are described in section 4. In the last section, conclusion and future works are drawn.

II. REQUIREMENTS OF THE REAL-TIME ANALYSIS

There are two important requirements for the real-time unstructured big data analysis. The first one is a real-time analysis with low costs about storing, processing and analyzing data. For example, supposed that we have to find a getaway vehicle A in a certain city which has 10,000 of CCTV and

(2)

monitoring system. If one person can monitor 10 of CCTV and work for 8 hours continuously, it needs 300 of people for one day. If we use a getaway vehicle automatic detection system which receives video streams and extracts a car number from those using image processing, and suppose that it is capable of processing video streams from 100 of CCTV simultaneously, we need 100 of above getaway vehicle detection system. In other words, it needs many costs for big data analysis. Therefore, we need to minimize the cost for storing, transmission and inquiry for big data analysis. If we need more time to store CCTV data than real-time response time, it can not be an efficient real-time big data analysis system. Also, if we need a lot of time to query to find some information from stored big data, It can not be a good system as well.

The second requirement is easy modification of analysis system. In the real world, the new data comes every time and new data types will be formed as new ICT services appearance. Therefore, the big data analysis system should be adaptable for a change of big data environments. For example, a system is running to detect the getaway vehicle A. After vehicle A is arrested, suppose that a new getaway vehicle B appears. Then, we need to modify the analysis system to detect the vehicle B. For modifying, we should carry out re-cording, re-building and re-execution of algorithm. But this process is not efficient on the demand for adaption of data environment changes. Especially, if there are many distributed analysis systems for big data, it needs many costs to modify the algorithm of systems at the same time.

A big data analysis framework should analyze the big data with low computing cost and be adaptive for the changing of data environment. For these requirements, we developed the selective processing methodology to analyze the big data with low computing cost. Also, for unstructured big data analysis, we proposed the data processing procedure and interfaces to convert an unstructured data to a structured data systematic. And we developed a framework to manage the distributed big data analysis systems which can accept the new data and are adaptive for new analysis algorithm without cording, re-building and re-execution.

III. RUBA FRAMEWORK

A. Core of RUBA framework

The first step to analyze the meaningful data(patterns or rules) from a big data is to extract meaningful data from them. In the existing data mining methods, to extract meaningful data such as frequent itemset, all data is saved in the data base and scanned a few times. The second step is to find some information such as important patterns or rules from the extracted data. The algorithm to discover the information is various with a purpose of application. However, in this process, the cost of saving big data into database and scanning from database is very expansive. Therefore, we can not apply these processes to the big data analysis process. Actually, because we can not store all big data into the database, there is a big problem in the first step.

To find a specific data from real-time data stream, the continuous query processing has been studied [3][11]. The continuous query processing is different from existing query

processing. In the existing query processing, data is stored in the database firstly and queries are executed whenever user requests. The other way, in the continuous query processing, queries are registered in the system in advance and those are executed whenever data stream is incoming. If a data is matched with registered query, the system extracts that data. After many works, the continuous query processing is developed as CEP(complex event processing) engine [3][12] which uses CQL(continuous query language) similar in SQL grammar to extract specific data from data stream. To extract a specific data, we just need to register the CQL that describes conditions of the specific data and if we want to stop to extract that data, we have only to delete CQL from system.

We think that it is effective to use CEP engine for real-time big data processing. Because CEP engine is executed in the memory and it can process the real-time data at high speed. In addition, we can define an object data exactly in “where” clause of CQL. Especially, because we can register and delete the CQL in the runtime of system in real-time, we can modify the analysis conditions or states of system without re-building and re-executing. Therefore we included CEP engine in real-time big data analysis framework and designed interfaces to manage the registration and deletion of CQL which is used to extract an object data.

Figure 1. Core of RUBA

Figure 2. Conventional framework

Fig. 1 shows the core of RUBA framework and Fig. 2 shows the example of general framework for unstructured data analysis system. In this conventional framework, in Fig. 2, we consider image analysis system. This system has feature extraction module, classification module and class database on the integrated image analysis system. Suppose that we constructed n image analysis systems for n sources. If we want to extract new objects or modify conditions of object data, those systems should be coded for new algorithms and re-built for re-execution. In the proposed RUBA framework, in Fig. 1, the preprocessor which like a feature extraction module and analysis processor such as classification module are divided. Then the role of analysis processor is replaced with

(3)

CEP engine as BA and the preprocessor would be BP to process an unstructured data. In RUBA framework, the unstructured data streams are entered into BP through BPI, then all results of BP is sent to BA using BAI. In BA, CEP engine runs CQLs continuously and object data is detected in real-time. In this framework, if we want to find new data or modify the conditions of object data, we have only to register the new CQL or delete the registered CQL.

The method to extract an object data using CEP engine with CQLs is a real-time big data analysis in itself. For example, if CQL1 that extracts the number of cars which have passed location A for 1 minute is registered in system, the result of CQL1 will be a result of real-time analysis. Also, we can select object data using CQL then the selected data will be an important data for discovering knowledge. In addition, we can focus on the processing of selected data and reduce the cost of processing data. It is a very efficient way to use CEP engine and CQL for real-time big data analysis.

The one of the performance criteria of RUBA framework is how exactly we describe the conditions of object data in “where” clause of CQL. If we can not describe certain conditions exactly, the result will be wrong. Actually, different from SQL, CQL is not a standard yet. However, almost CEP engines use CQL which is based on SQL standard, so we can describe almost logical conditions which can be described with SQL. In addition, there are special grammar to describe the time window and sliding window in CEP engine and we can even use java class to use user defined functions in the case of ESPER[13] CEP engine. Therefore, we can describe the conditions of object data and extract them exactly.

B. RUBA Framework

Figure 3. RUBA framework

Fig. 3 shows the RUBA framework. RUBA framework consists of RUBA processor which has a data receive module and a data send module, (un)structured data processing module and real-time analysis module. In the RUBA processor, the data receive module collects an unstructured data like a movie data and structured data like a typical sensor data using BPI. BPI is an interface to collect big data from various sources and

it can be changed according to applications. The data send module sends the result of real-time analysis using BAOI. It can be changed according to applications too. For example, in the web application, http protocol can be BPI and BAOI. And in the USN applications, RS232 or RS485 or Zigbee can be BPI and BAOI.

BP receives the data from BPI and processes them. In case of unstructured data, BP extracts the feature data from them using feature extraction modules. In case of image processing, colors, positions and shapes can be features. The feature extraction modules for these various features can be defined according to analysis purpose. In RUBA framework, feature extraction modules are developed as an independent module such as java class file. Therefore, we can add or delete the process module easily for each analysis purpose.

In BP, the unstructured data is converted to structured data which has data structure of BAI. BAI is an input interface of BA. BA is based on CEP engine and it has input queues for data stream. The data structure of BA’s input queues can be a type of structure, array, map and etc. depending on CEP engine. Finally, the input data is analyzed in BA with user defined CQLs and the result of BA is sent to users through BAOI.

RUBA framework has data flow from collecting data to sending the result data to user. Also, there is a control flow for management of CQL and data processing modules. When user wants to add or delete CQL, user can use message for CQL management. In this message, there are command field, destination ID, CQL ID and CQL statement. And this message is sent from user to system using BAOI. In the message for data processing modules management, there are command field, destination ID, module’s ID and module’s filename. These massages are received at RUBA processor and proper actions are operated according to the value of command field.

Figure 4. Impelemted system on wired network environment of RUBA

Fig. 4 and Fig. 5 show examples of RUBA framework implementation. In Fig. 4, RUBA framework is implemented for the environment of wired networks. In this example, there are n analysis systems and a management server system(MS: Management System) which provides UI for CQL editor and result view of analysis. MS can send CQL and data processing modules(*.class files) to analysis systems in the system running time. Therefore, we can not only analyze the big data but also modify the analysis strategies in real-time.

(4)

Figure 5. Impelemted system on wireless network and distributed system environment of RUBA

The example of Fig. 5 is implemented for distributed environment using wireless network. In this example, BA and BP are separated from one system. And the result data of BA is transmitted with wireless network. If we transmit the unstructured big data without BA to analysis system using wireless network, we need a high cost for the network operation. However we can transmit the minimum data for analysis using RUBA framework efficiently. Finally, we can get the big data from moving objects and find important knowledge in real-time with a low cost about processing, analysis and transmission of the big data.

IV. DEMONSTRATION

Figure 6. Objects and a camera(Kinect) in demonstration

We have implemented the Objects Monitoring System using RUBA framework to confirm the availability of our proposed framework. This system has an image sensor which consists of a camera and image processing module. And an analysis system which uses ESPER CEP engine and a management system are included. The structure of this system is a distributed system like example of Fig. 5 and connected with wired network. Fig. 6 shows the camera and moving objects on the rail track of our demonstration. In this demo, we can analysis whether two objects are on the rail track properly and pass through appointed locations ordinarily using CQL. Then, this system notifies the result of analysis using web pages. We have defined 3 locations to detect normal and abnormal states, normal paths are LOT-A, LOT-C and abnormal path is LOT-B.

Figure 7. Analysis process in demonstration. 1) object detecting; 2)feature extraction; 3) registered CQL; 4) analysis result on web page

(5)

In the image sensor, the camera(we have used Kinect) gets the video data from objects(see Fig. 6) and the image processing module extracts an information about two objects. The information of object’s existence and position are extracted from objects and the result value is returned to the analysis system. The value is array structure: [x_point, y_point, Boolean of A existence, Boolean of B existence] and is extracted by functions of .class file in the image sensor. For example, if two objects exist and their position(x, y) is (230, 240), the value is [230, 240, 1, 1]. If object A is absence, the value is [230, 240, 0, 1]. After processing on the image sensor, the result value is sent to the analysis system. We have used CoAP protocol as transmission method and it can be changed according to an application environment.

Fig. 7 shows the process of analysis on demonstration system. First, the image sensor gets the video data from objects and extracts information from them using image processing module. Second, the result of image sensor is sent to analysis system using CoAP protocol. Third, in the analysis system, registered CQLs to analysis a statements of objects are executed continuously. Finally, the result of analysis is sent to management system and showed on the web page. In Fig. 7(3), the registered CQL detects whether objects pass through LOT-C location properly. In the result view web page(see Fig.7(4)), the result of another registered CQL which analyze whether objects have passed through LOT-A properly is shown.

Users can make a CQL for certain analysis by management system’s CQL editor page and send a processing module file to the image sensor. RUBA framework has a function for real-time unstructured big data analysis and provides a real-real-time modifying function for a new data or new analysis strategies without re-coding and re-building. In this demonstration, we have changed CQLs for new analysis strategies and confirm the proper actions in the real-time.

V. CONCLUSIONS AND FUTURE WORKS

According to advance of ICT technologies, massive data which has various types is being on the increase explosively. Because these big data have important information about trends of many phenomenon, it will be a more significant study to analyze the big data. In this paper, we proposed novel framework for the real-time analysis of unstructured big data such as video, image, sounds and text. RUBA framework analyzes the big data using CEP engine and uses CQL to modify the analysis conditions in real-time without re-executions of system. In addition, RUBA framework provides functions to manage several distributed analysis systems using the method of CQL management easily. We have implemented Object Monitoring System applied RUBA framework and confirmed the availabilities of proposed framework through real-time data analysis and modifying of analysis conditions. In the future, we have a plan to make a solution for real-time big data analysis using RUBA framework and will apply it to fields of U-city, U-plant and ITS.

REFERENCES [1] http://hadoop.apache.org

[2] J. Dean and S. Ghemawt, “MapReduce: simplified data processing on large clusters,” in Proc. OSDI `04, 2004

[3] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom, “STREAM – The Stanford data Stream Management System”, Technical Report, Stanford InfoLab, 2004.

[4] J.T. Kim, B.J. Oh and J.Y. Park, “Standard Trends for the Big Data Technologies,” Electorincs and Telecommunications Trends 2013, ETRI, 2013, pp. 92-99.

[5] C.H. Lee, J. Hur, H.J. Oh, H.J. Kim, P.M. Ryu and H.K. Kim, “Technology Trends of Issue Detection and Predictive Analysis on Social Big Data,” Electorincs and Telecommunications Trends 2013, ETRI, 2013, pp. 62-71.

[6] R. Agrawal and R. Srikant, “Mining sequential Patterns,” in Proc. ICDE`95, 1995, pp.3-14.

[7] J. Pei, J. Han, B. M. Asl, H. Pinto, Q. Chen, U. Dayal and M. Hus, “PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth,” in Proc. ICDE’01, 2001, pp.215-226.

[8] U. Yun, “WIS: Weighted Interesting Sequential Pattern Mining with a Similar Level of Support and/or Weight,” ETRI Journal, vol. 29, 2007, pp.336-352.

[9] C. F. Ahmed, S. K. Tanbeer and B. S. Jeong, “A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases,” ETRI Journal, vol. 32, 2010, pp.676-686.

[10] J. I. Kim, P. S. Choi and B. H. Hwang, “Real-time Sequential Pattern Mining for USN System,” in Proc. ICUIMC`12, 2012.

[11] B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom, “Models and Issues in Data Stream systems,” in Proc. ACM PODS 2002, 2002. [12] J. I. kim, N. W. Kim, S. K. Yun and B. T. Lee, “A study on CEP

performance in mobile embedded system,” in Proc. ICTC2012, 2012, pp.15-17.

References

Related documents

The idea that a student cannot be removed from school because of his or her disability is central to the development of special educa- tion law in the United States (see, e.g., Mills

Figure 1 Flow diagram depicting eligibility, initiation and completion of IPT among PLHIV registered between January 2016 and December 2017 in the Far-Western Region of

For both capital services and the capital stock, results are provided based on two different breakdowns of investment data: the 2-asset case drawing upon data for structures

On the single objective problem, the sequential metamodeling method with domain reduction of LS-OPT showed better performance than any other method evaluated. The development of

baccalaureate programs at both institutions simultaneously. The student must satisfy baccalaureate degree requirements at the University of Maryland Eastern Shore for a program of

ABSTRACT Three F 2 populations derived from crosses between cultivars with green and yellow cotyledon colors were used to identify quantitative trait loci (QTLs) associated

Structural limitations aside, the MNH Junction provides healthcare providers with a unique opportunity for uninterrupted patient care. Based on the relatively low risk-based