A Summarization Paradigm of Open Challenges for Data Stream Mining Issues

(1)

A Summarization Paradigm of Open Challenges for

Data Stream Mining Issues

Dr.V.Kavitha1 S.Subhasini2

Associate Professor, Assistant professor,

Department of MCA, Department of computer applications (BCA),

Hindusthan college of arts and science,Coimbatore,India Hindusthan college of arts and science, Coimbatore, India .

Abstract: Nowadays databases are outsized, it required storing

huge amounts of data that are continuously introduced and processed. Analyzing these kinds of huge data sets and extracting valuable pattern from many real time applications are interesting factors in data mining researchers. There are two kinds of processes are identified to process the data mining. The former process is to stream the data and applies the data mining algorithms and techniques whereas the later one is to solve the specific problem with the relevant and efficient algorithms. Recently many of the data mining researchers are focused and concentrated on the research topic of data stream mining as an efficient way to handle the entire data base. The most important obstacle of data stream mining means growing data is more complicated to perceive the usual mining techniques hence, unsupervised techniques should be applied. Clustering techniques and procedures can direct the research to discover the hidden information from huge database. Through this survey, the various obstacles of data stream mining are identified; the specific difficulties are encountered in the field of research and also recognized the various prominent techniques to tackle the different problems.

Keywords: Dtream mining, Distance Measure.

I. INTRODUCTION

Nowadays in the field of information processing and data analysis, data mining plays a vital role for mining knowledge from the outsized real time datasets. In similar fashion, stream data mining is concerned with extracting and processing knowledge from continuous and non-stopping stream of real time information. The data stream refers that it must be dynamic, massive, infinite, continuous and rapid sequence of data elements and it is not a static data element. The recent research of data stream mining is stimulated by the emerging real time applications involving tremendous data sets such as structural engineering data, medical data,

stock market, multimedia data, sensor data, temperature data, website data, scientific and engineering data, customer click data streams, and telephone recording data. Extracting fruitful and potential knowledge from data streams is a challenging task in the research. Most of the data mining and knowledge discovery techniques conclude that there is a determinate amount of data generated by an unknown, stationary probability distribution, which can be physically stored and analyzed in multiple steps by a batch-mode algorithm. The implementation of effective techniques for data streams is a recent research challenge. For data stream mining, however, the successful establishment of innovative algorithms has to take into account the following specific reasons.

* Data objects generated and appeared continuously from the real time applications.

* At the time of data processing, the data objects of the data streams are generated without any control.

* The volume of data is outsized potentially.

* At the moment of data processing the data objects are discarded. In a specific given period of time one can store part of the data, using a forgetting mechanism to discard them later.

* The unknown data generation process is probably may change over the period of time.

(2)

676 obstacles creates a number of unique challenges which are

not simply solved by the conventional data mining techniques. Some of the challenges of data stream mining like outsized, dynamic natured, high speed, limitless memory essentials, lacking in global view, managing the continuous flow of data impose a significant challenge for the research persons dealing with the dynamic streaming data. To store an entire data stream or scan throughout multiple times due to its enlarged volume of data is a tedious task like dealing the traditional data sets. The main motivation of this research paper is to study the various techniques and algorithms for mining the data streams. Section I narrates Introduction. Section II describes the foundations of Stream data processing. Section III provides the various kinds of stream mining algorithms. Section IV sketches out the challenges of data stream and finally Section V give a report of the conclusion of the work which concludes the paper.

II. FOUNDATIONS OF STREAM DATA PROCESSING

To process and analyze the streaming data it requires the effective techniques and algorithms, innovative data structures. The traditional data base system does not have an enough data space to store and process the stream data. Moreover the continuous nature of streaming data needs to concentrate the accuracy and storage. Hence, the research outcomes required exact results rather than approximate results. From the technical point of view, the research requires the algorithms that are more efficient the performance factor of execution time and memory space. Instead of handling the data stream, the data structures use much compact poly logarithmic space in the data stream. In this section, the paper discusses about Sliding window and Random Sampling.

* Sliding Windows

* Random Sampling

In streaming data mining, the sampling methods are among the manageable methods for synopsis construction. Rather than process with the whole data stream, the system can think of sampling the stream at fixed periodic time intervals. The main privilege of random sampling technique is that it is relatively easy to use this synopsis with an extensive variety of application since their representation is not specialized and applies the identical multidimensional representation like the real data points. To procure an unbiased sampling of data set the system required to know

the real and full data set in advance. An innovative technique called reservoir sampling can be utilized to select an unbiased random sample of data elements without replacement. The main perception behind reservoir sampling is comparatively very simple than other techniques. To maintain a sample of data size at least s, referred the “reservoir”, from which a random sample of size s can be produced. When the reservoir is very large then generating specific sample from the reservoir can be expensive. To overcome this, the system has to maintain a set of s candidates in the reservoir, which generate a real random sample of the data element from the data stream. As the data stream flows continuously, every new data element has a certain probability of replacing an old element in the reservoir.

III. DATA MINING CHALLENGES

In future, there may be many challenges and obstacles will occur in data stream mining. This kind of stream data may be varied by the nature of the large data size. The researchers will deal and handle these variety of challenges in data stream mining. Some of the challenges are listed below.

[image:2.595.354.507.479.731.2]

Data Stream Mining has some challengers according to the nature like volume, velocity, volatility etc…, Figure 1.1 shows the challenges of data mining which gives familiarity for the researchers.

Fig 1.1 Data Mining Challenges Volume

Velocity

Volatility

Verity

Variability

Value

Veracity

Validity Data Mining

(3)

A.Volume:

Real data set contains enormous volumes of data and its size is increased continuously but the data processing techniques are not sufficient to analyze the mass of data. That large volume of data is produced by networks, automated machines and human interaction systems such as social Medias. Hence, the innovative system is required to analyze the massive volume of data. In account of increasing the volume of data, the data handling is major obstacle in data stream mining.

B.Velocity:

Information is arriving continuously in data stream mining. To acquire fruitful information from the real world data bases in the form of data stream is a complicated process. The concept of velocity deals with the speed at which data flows continuously from the actual sources like business processes, social Medias, automated machines, networks and mobile devices. The flow of data stream in stream data base is massive and continuous. This kind of active real world data streams can support the researchers in the form of researchers to make the strong and reliable decisions that provides the strategic competitive advantages.

C.Volatility:

Data Volatility describes that how long the value of the data is stay back and what is the time to take to store the data in database. Active real world data set needs to determine at what point is data no longer relevant to the existing present data analysis.

D.Variety:

Obviously there are more varieties of data are available in the common real world database such as sensor data, image, audio, video data and text data. The well organized stream data mining techniques are utilized to manage the various kinds of data verities.

E.Variability:

Each and every data must have different structure in various forms. That kind of various data stream must be interpreting the data. That data must be change over time to time due to the natural reason.

F.Value:

All data in real time data base should have a specific value according to its nature. This leads to acquire the fruitful optimal solutions and eventually to a competitive advantages.

G.Veracity and Validity:

Naturally, some of the steam data in the real world database are not entirely reliable and the system can have the capability to manage and handle the uncertainty data.

IV RESEARCH DIRECTION OF STREAM DATA

Some of the factors are analyzed as the research issues based on data stream mining in the research direction. Those factors are listed in the given list:

A.Hidden Data

The nature of data stream is quantitative because of continuous flow of data and also it is massive. These kind of outsized quantities of practical and reliable information are helpless sometimes to recover fruitful information. Because these kinds of data stream are file based, untagged and unstructured. At the same time the less quantity if information was tagged and even minimum amount of information was analyzed. Recently, the structure of real data set is probably not clear and worst. The open data and semantic web movements are come forward that requires attentive, responsive, enhance and perfect.

B.Analytics architecture

Analytics architecture is not clear until how architecture of an analytics system should be manufacture to deal with real time data and historical data at the same time. Lamdbda architecture of Nathan Marz was the first proposal. This architecture deals the problem of computing arbitrary functions on arbitrary data in real time database by decomposing the specific problem into three layers namely,

 Batch Layer

 Serving Layer

 Speed Layer

It combines the system Hadoop for the batch year and Storm for the speed layer.

C.Evaluation

(4)

678 D.Distributed Mining

Most of the data mining algorithms and techniques are not significant to parallelize. Having distributed versions of several techniques and methods, substantial research is required with both theoretical analysis and practical experiments.

E.Time Evolving Data

Data must be growing and developing over time, so it is more significant that the data mining techniques and algorithms are able to adapt and in some cases unambiguously detect. Most of the data stream mining techniques are motivated by exactly for this requirement of fast growing databases.

F.Compression

When handling and managing with big data, the quantity of storage data required more and relevant storage space. Handling big data having two approaches namely compression and sampling.

• There is no lose of information

• Use more time and less memory space

• Consider transformation from time to space

G.Sampling

• Losing information

• To reduce the complexity of real time big data problems.

H.Visualisation

The major issue in real time big data analysis os how to visualise the outcome. Managing and processing information from large quantity of data in a way that is the human understandability is quite challenging.

V.Analysing Various Distance Measures

Distance measure analysis plays an important role in data mining techniques. Most of the data mining techniques are depends on the distance measures. Hence, find the effective distance measure plays a vital role in every research. Finding cluster quality is the most significant process in data mining research. The cluster quality of inter cluster and intra cluster are determined with the assistance of the efficient distance measure. This measure is to determine

[image:4.595.306.550.171.411.2]

the distance between the two data stream objects of the particular cluster. In the below table explains about the difference between three various distance measures.

Table 1:Comparing various distance measures

Distance Measures Feature of the Distance Measure

Euclidean Distance Easy to calculate.

Found out the best optimal distance.

Time complexity is less.

Manhattan Distance Easy to Calculate

Distance is not good

Time complexity is moderate

Minkowski Distance It is hybrid method

Distance is good

Time complexity is more

In table 1 suggest that to desire the most excellent distance measure. Three distance measures namely Euclidean distance, Manhattan distance and Minkowski distance are compared for this distance metric analysis. For these three distance metrics are compared and that must be applied with the various clustering techniques. According to the perfect distance measure the optimal clustering is discovered and also the time complexity is determined.

A.Euclidean Distance Metric

[image:4.595.340.517.614.718.2]

Euclidean distance is one of the common and simple distance measure which is used to calculate the best distance between the two time series data stream objects.

(5)

In this figure describes about the distance calculating procedure of two data points. The time series data stream points X1 and X2 are fixed in the cluster surface. Using Euclidean distance measure the distance is calculated directly and easily among those two data points. When comparing the performance of other distance metrics Euclidean is attaining less time complexity, very simple to calculate and it obtains the best optimal solution. Hence, this distance metric is preferable for this research work.

The Euclidean distance measure is calculated by the following

formula, Where,

dis(a,b) = Distance between two time series data stream data points.

N = Total number of time series data points.

n = Number of Iterations.

The proposed system will consider the diameter value among the cluster to be the utmost premier dissimilarity between the two time series data streams belonging to the identical cluster.

B.Manhattan Distance Metric

[image:5.595.339.524.418.518.2] [image:5.595.70.260.524.628.2]

Manhattan distance is another distance measure which is to calculate the distance between the two stream data points.

Figure Calculate the Manhattan distance between the two stream points X1 and X2

In this figure explains about the distance calculating procedure of two time series data stream points. The time series data stream points X1 and X2 are fixed in the cluster surface. Using Manhattan distance measure the distance is calculated among those two data points.

In this method the center point is decided among the x and y axis. Based on that center point the x axis data point and y axis data point are to be declared then the distance will be calculated. Hence, the time will be taken more to discover the distance between the two data points.

When comparing the performance of other distance metrics, Manhattan is attaining the distance value easily and moderate time complexity. The limitation of this metric is

i) Time complexity is not recommended.

ii) Attain worst distance

Hence, the above limitation of this metric is not suggested for the proposed research work.

C.Minkowski Distance Measure

Minkowskidistance is one of the hybrid distance measure which is used to calculate the distance between the two time series data stream objects.

Figure Calculate the Minkowski distance between the two stream points X1 and X2

(6)

680 VI CONCLUSION

At present the real time stream data are outsized and it flows continuously, it required more amount of storage space that are continuously manufactured and processed. Analyzing these types of massive data stream database and extracting fruitful and valuable pattern from numerous real world applications are very much attracted by the data stream mining researchers. Generally two type’s processes are found out to precede data mining. They are the data mining algorithms and techniques and specific problem with the relevant and efficient algorithms. Data mining challenges are classified like volume, velocity, verity, and value etc…, Moreover research direction of hidden data, evaluation, compression and visualization are specified in this survey. Through this survey, the various obstacles of data stream mining are identified; the specific difficulties are encountered in the field of research and also recognized the various prominent techniques to tackle the different problems.

REFERENCES

[1] Pantelis n.Karamolegkos, Charalampos Z.Patrikakis Nikolaos D.Doulamis Panagiotis, “An Evaluation Study of Clustering Algorithms in the Scope of user Communities Assessment” Computers $ Mathematics with Applications, Elsevier, Vol No 58, issue no 8, October 2009, Pages 1498 - 1519.

[2] Man Abdel - Maksoud, Mohammed Elmogy, Rashid Al-Awadi, “Brain Tumor Segmentation Based on a Hybrid Clustering Technique”, Egyptian Informatics Journal, Vol No 16, Issue no 1, March 2005, Pages 1 - 81.

[3] Madjid Khalilian, Norwati Mustapha, Data Stream Clustering: Challenges and Issue, Proceedings of the International Multi conference of Engineers and Computer Scientists 2010 Vol No1, IMECS 2010,March 17-19 2010.

[4] Maryam Mousavi1 , Azuraliza Abu Bakar, and Mohammadmahdi Vakilian, “Data Stream Clustering Algorithms: A Review”, International Journal of Advance Soft Computer Applications Vol o 7, Issue No 3, November 2015, ISSN 2074-8523.

[5] Jose R. Fernandez,” A Framework and Algorithm for Data Stream Cluster Analysis”, International Journal of Advanced Computer Science and Applications, Vol No 2, Issue No11, Pages 87, 2011.

[6] Twinkle B Ankleshwaria, Twinkle B Ankleshwaria, Mining Data Streams: A Survey, International Journal of Advance Research in Computer

Science and Management Studies, Vol No 2, Issue No 2, Feb 2014, ISSN: 2321-778.

[7] Amineh Amini, Teh Ying Wah, “Density Micro-Clustering Algorithms on Data Streams: A Review”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol No 1, IMCES 2011, March 16-18, 2011.

[8] Silva, J. A., Faria, E. R., Barros, R. C., Hruschka, E. R., de Carvalho, A. C. P. L. F., and Gama, J, “Data stream clustering: A survey”, ACM Computing Surveys, Vol No 46, Issue No1, Article 13, October 2013, Pages 31.

[9] DoniaAugustine, “A Survey on Density based Micro-clustering Algorithms for Data Stream Clustering”, International Journal of Advanced Research in Computer Science and Software Engineering Research, Vol No 7, Issue No 1, January 2017.

[10] Dure Supriya Suresh, Prof. Wadne Vinod, “Survey Paper on Clustering Data Streams Based on Shared Density between Micro-Clusters”, International Research Journal of Engineering and Technology (IRJET), e-ISSN: 2395 -0056, Vol No 04 ,Issue No 01, January 2017.

[11] Amini A, Wah TY, Saboohi H, “On density-based data streams clustering algorithms: A survey”,Journal of Computer Science and Technology, Pages 116–141, January 2014, DOI 10.1007/s11390-013-1416-3.

[12] Safal V Bhosale, “A Survey: Outlier Detection in Streaming Data Using Clustering Approache”, International Journal of Computer Science and Information Technologies, Vol No 5,

2014, 6050-6053 ISSN 0975 - 9646.

[13] Prashant V. Desai, Vilas S. Gaikawad, “Novel approach for data stream clustering through micro-clusters shared Density”,International Journal of Computer Sciences and Engineering Volume-5, Issue-1 E-ISSN: 2347-2693.

[14] M.S.B.PhridviRaj, C.V.GuruRao, “Data Mining - Past, Present and Future - A Typical Survey on Data Streams”, Elsevier Procedia Technology”, Vol No 12, 2014, Pages 255 - 263.

[15] Yisroel Mirsky, Bracha Shapira, Lior Rokach, and Yuval Elovici, “pcStream: A Stream Clustering Algorithm for Dynamically Detecting and Managing Temporal Contexts”, Springer International Publishing Switzerland 2015, PAKDD 2015, Part II, LNAI 9078, pp. 119–133, 2015. DOI: 10.1007/978-3-319-18032-8_10.