Emergent Trends and Challenges in Big Data Analytics, Data Mining,
Virtualization and Cyber Crimes: An Integrated Global Perspective- I
Gurdeep S Hura
Department of Mathematics and Computer Science, University of Maryland Eastern Shore, Princess Anne, MD 21853
Abstract
Big data and big data analytics are playing very important roles in a variety of applications with the implementation capabilities and tools for a possible support for offering efficient solutions to those applications. This new technology has recently emerged as a very popular research and practical-oriented framework that implements i) data mining, ii) predictive analysis forecasting, iii) text mining, iv) virtualization, v) optimization, vi) data security, vii) virtualization tools for processing very large data sets particularly cloud data for exploring new business enterprise applications and decisions. The big data analytic was being considered as one of the fast growing technology trend in 2014 and will continue for few more years in future with a large number of big data applications in particular is social networking cloud computing, healthcare systems and many business systems. Thus, in order to understand this technology, we need to understand how each of the concepts (data mining, virtualization and data security) have evolved and contributed into data big analytics.
With this in mind, we propose a two parts of paper that will provide a state-of-the-art of each of these interrelated concepts used in Big data analytics starting with how this concept evolved, its applications, available tools, limitations and the current status so that researchers and developers can understand the how this new technology can be used for new applications and also deriving new technology, tools and frameworks In this first part of the paper, we focus on the conceptual design of big data applications, big data analytics and solutions, discussion on a number of open source framework tools, the roles of data mining and virtualization. The data mining techniques have been used in big data analysis, but recent applications for multimedia big data are looking at newer data mining techniques for managing and analyzing the huge amount of data. Further, easy representation and display of data analysis are looking for an efficient and user friendly tool that will help the users to interpret the data in a very simple way. The second part of the paper focuses on remaining two technologies as data virtualization and data security. Virtualization offers very efficient tool to provide the representation of data with capabilities of displaying the dynamic behavior of the data. Since the modern big data applications are implemented over Internet, it is important to understand various cyber-attacks and crimes that affect all the implementation phases of big data multimedia applications. This paper will also provide insights of how the applications can be prevented from these attacks and further how cyber-crime analysis technique can be used for reliable big data implementation.
Specifically, part I discusses i) main concepts, features, applications, implementation issues and capabilities of encapsulating features of ii) data mining for mining, analyzing and processing big data while part II discusses i) virtualization tools to extract useful information from processed data and also dealing with ii) data security. The rationale for considering these three sub topics of data mining, virtualization and data security is due to the fact that all the successful and implemented big data applications are derived from big data analytics. We hope the these two papers will provide a very clear understanding of the big data analytics and describe how each of the concepts like data mining and text mining, virtualization and data security have evolved and have been now integrated into Big data analytics
I. BACKGROUND OF BIG DATA ANALYTICS AND DATA MINING (PART I)
The first part of the paper discusses the background of this new technology of big data analytics and then present how it has been used to implement advanced data mining techniques, virtualization frameworks and data security for various applications like public sector, manufacturing, retails, healthcare, weather and scientific applications, etc. In other words, the paper describes each of the concepts in details and how each of these have been implemented in big data analytics. The paper describes operations in big data, discussion of known big data applications and
classifications. This is one of the reasons I have selected only these concepts and provided a detailed discussion on them. Further, the paper also describes various available open source tools that have been used to implement and solve big data applications and implements various data mining techniques.
techniques for collection and analysis of huge amount of multimedia data need to be introduced. It is hoped these new efficient and formal data mining techniques will be used for understanding the big data analytics with a view to offer easy understanding, easy data formatting, interpreting and extracting useful information from collected data in the applications. The paper describes briefly the suitable data mining techniques and presents how some of the existing techniques will be redefined with a view to use in applications like multimedia data applications, social networking, scientific weather data and many other similar applications.
The paper presents various unresolved issues and problems dealing with big data analytics and data mining, challenges and possible future applications. The paper further also presents the future research initiatives.
II. BACKGROUND OF VIRTUALIZATION AND DATA SEWCURITY (PART II):
The second part II of the paper presents state-of-the art of remaining two important technologies virtualization and data security that have implemented in big data analytics.
One of the implementation phases for big data solutions is data processing. There are many methods that can be used to data processing. A simple and user friendly visual and dynamic representation of data can be implemented by data virtualization. This method provides not only easy representation of data, but also dynamic behavior of the data movement and helps to extract useful information from the data. The virtualization tools represent the data processing process in a very simple way for data analysis. The paper discusses different architectures of virtualization tool, methodologies, main frame virtualization, guidelines and various available abstraction tools of virtualization that have been used in big data applications.
Business and technology professionals and practitioners are deeply concerned about data security. Since data is coming from different devices like mobile data generation, real-time connectivity, digital business, and other sources have changed the entire environment difficult and harder to protect the data assets over internet. We have seen some security measures that have been implemented in big data analytics and it is expected that the future big data applications have an increasingly crucial and important role in providing data security. Recent years have seen some efforts in data analytics that have implemented various counter measures for data security such as intrusion detection, differential privacy, preventive measures, authentication, digital watermarking, malware countermeasures and many other measures. In order to implement operational strategies under serious crisis, data security becomes very critical. Some organizations and professionals are having a little bit of difficulties to be competitive in the absence of data security and are engaged in including advanced analytics capabilities that will manage privacy and security challenges. By following this approach, they are able to create confidence in clients/customers/consumer with some level of trust. In order to provide reassurance to customers/consumers around privacy and data security issues, it is important to establish a framework that will not only provide security
but it evaluates and meet business, big data technology and needs of consumers/customers.
With a brief discussion and role of data security in big data applications, the paper describes in brief the cyber malicious attacks and crimes. The paper presents the challenges and problems associated with creating a secured communication environment over internet for big data applications. Further, it describes briefly various attacks and crimes over internet known as Cyber Attacks and Crimes. The paper also presents all the known Cyber Attacks, cyber-crimes that may affect the data processing, data mining techniques and virtualization tools of big data applications over internet. After understanding these attacks and crimes, paper presents how the big data implementation includes security issues in new applications. Further, it also presents Cyber security analysis for big data applications.
In conclusion, after presenting the state-of-the-art of big data analytics that implements data mining techniques in part I and virtualization, text mining, optimization and associate security issues in part II, the papers summarize the challenges, unsolved problems and future trends in the area of big data and big data analytics, data mining techniques, virtualization and data security counter measures for cyber-attacks and crimes. Further, the papers also elaborates various new technologies and methodologies researchers are exploring in handling security issues in big data analytics.
Keywords: Big Data Analytics, Solution of big data applications, Data mining techniques, Virtualization, Virtualization Methodology and tools Cyber-attacks, Cyber-crimes, Preventive and Defensive Measures
III. INTRODUCTION: PART I (BIG DATA ANALYTICS AND DATA MINING) A. Big Data Analytic: A brief description
for enhancing the speed for analytic processing is further helping big data to provide more new applications.
The year 2014 has seen a new technology emerging providing a solution and implementation to a number of applications dealing with enormous amount of data, data from different sources and devices with different formats, but all these new applications are suffering from data security and as such big data analytics are getting another role to play in data security. It has been observed that analytics have recently explored some of counter measures like intrusion detection, differential privacy, digital watermarking, malware countermeasures, filters, etc.
IV. BASIC DEFINITIONS AND CONCEPTS OF BIG DATA ANALYTICS:
The analytics and machine learning together are being applied to a new process of analytics itself. Big Data is very large in volume usually in the range of Zeta bytes (1021) of data flowing from our computers, mobile devices, and machine sensors into Internet. With the right tools for collection and analysis of big data application, organizations can dive into all data and gain valuable insights that were previously unimaginable. It is important to know that big data technologies and analysis tools can transform our applications into new business arena. The proposed framework must be introduced that can provide big data solution for any application simple data processing, simple operations, simple setup, simple code and application development for its easy implementation.
The huge amount of data of various application oriented techniques stored in the database is growing exponentially and it becoming very difficult and a big challenge in implementation of big data analytics including data collection, storing, processing, sharing, analyzing, visualizing, and other related issues of big data applications. Data volume in modern applications range from MBytes (109), to Terabyte (1012), to Petabyte (1015), to Exabyte (1018), to Zettabytes (1021). In one of the surveys, it was mentioned that about 5 Exabyte of data was created by all of us around the globe in 2003. This volume of data is still increasing exponentially and has been in the range of few Zettabytes in 2012. It is estimated that the volume of data may be over 8 Zettabytes by the end of 2015 [1-2]. IBM showed that over 2.5 Exabytes of data is being generated everyday. Similarly, the storage requirements for storing this large volume of data would require over 20 billion PCs. This is due to the fact that each PC today holds about 500 Gigabytes of data. Only Google has more than one million servers around the world. There are over 6 billion mobile subscriptions around the world and has a total population of nearly7.5 billion around the world. It has been found that there are over ten billion text messages by these mobile subscribers per day [1-2-4]
There exist a number of frameworks and tools that can be used to generate appropriate solutions for the big data business applications, but SAP Hana [1-6] seems to be more widely acceptable and robust framework for big data formatting and representation. SAP Hana is an in-memory, column-oriented, relational database management system developed by SAP SE. HANA's architecture is designed to
handle both high transaction rates and complex query processing on the same platform. In-memory platform or framework as a service based on SAP Hana can allow developers to build innovative applications with improved productivity. SAP HANA was previously called SAP High-Performance Analytic Appliance. It is very powerful with predictive analytics that provides intuitive modeling, advanced data visualization, and profitable analysis of the applications with big data. The business intelligence analysts will be responsible to show how to predict future outcomes and uncover opportunities that will maximize our business performance [1-6, 8,9,11, 14, 38, 40].
V. COMPUTATIONAL OPERATIONS ON BIG DATA Web services over Internet supports a variety of operations and services depending on the big data application. One of the very popular web operations is social network-based applications needing data modeling and analysis that deal with understanding user intelligence needed on big data applications for more targeted advertising, marketing campaigns and capacity planning, customer behavior and buying patterns and many other inferences. According to these inferences, firms spend significant time and resources in the optimization of content of big data and recommendation engine. Some companies such as Google and Amazon publishing articles are related to their work. Inspired by the writings published, developers are developing similar technologies as open source software such as Lucene, Solr, Hadoop and HBase. Facebook, Twitter and LinkedIn are going a step further thereby publishing open source projects for big data like Cassandra, Hive, Pig, Voldemort, Storm, IndexTank [1,14].
In 2012, Obama regime announced big data initiatives of more than $200 million in research and development investments for National Science Foundation, National Institutes of Health, Department of Defense, Department of Energy and United States Geological Survey. The investments were launched to take a step forward instruments and methods for access, organize and collect findings from vast volumes of digital data [21].
One of the biggest data bases that are being used by both machines and human beings are being designed and understood by users and developers. There does not seem to exist any systematic structure that can provide accurate interdependency relationships between various components of the web system. Currently the information can be retrieved from the data base using a single keyword or a combination of keywords for searching the web sites containing that information and showing their Uniform Resource Locators (URLs). Many a times, it does no optimize the time for search as it may contain a lot of unnecessary or irrelevant information and as such expects the users to narrow the search words,
hyperlinks which in turn may provide a faster retrieval of the requested information efficiently. Defining structure from unstructured dataset and deriving useful information from this structure is a first logical step in data analysis. The derivation of information is based on the identifying documents in the collection on the basis of properties approved to the documents by the user requesting the information retrieval. [7, 9, 12-13, 38]
The amount of data in these applications are usually do not follow any particular format and types so defining structure of the data and deriving the useful information from these data becomes a big challenge and has been a main research objective in the areas of Data mining, Natural language processing, Relational database, Data virtualization, Cyber Crimes, Crime analysis, Information retrieval, Data analysis, big data solutions, etc.
In order to appreciate and understand the state-of-the-art of big data and big data analytics, the following section describers the classifications of big data applications and some of the popular applications along with their implementation and solutions on different platforms.
VI. CLASSIFICATION OF BIG DATA APPLICATIONS The topic of big data is becoming a current project not only in US, but also in other countries around the globe as it focusses on the collection, analysis, manipulation and visualization of big data on a real time. The big data from different applications generally can be classified into following five major groups McKinsey Global Institute specified the potential of big data in five main topics [1,17]:
Healthcare: clinical decision support systems, individual analytics applied for patient profile, personalized medicine, performance based pricing for personnel, analyze disease patterns, improve public health
Public sector: creating transparency by accessible related data, discover needs, improve performance, customize actions for suitable products and services, decision making with automated systems to decrease risks, innovating new products and services
Retail: in store behavior analysis, variety and price optimization, product placement design, improve performance, labor inputs optimization, distribution and logistics optimization, web based markets
Manufacturing: improved demand forecasting, supply chain planning, sales support, developed production operations, web search based applications.
Personal location data: smart routing, geo targeted advertising or emergency response, urban planning, new business models.
Business: Recently we have seen a new class of big data application emerging in the market as Business. Some of the business applications include: e-mails, images, searching, health records, weather data, sensors and mobile devices, logs, social networks, IRS audits, social security administration, satellite communications, news bulletin, on-line transactions, banking systems, astronomy, atmospheric science, genomics, biogeochemical, biological science and research, life sciences, medical records, scientific research,
government, natural disaster and resource management, private sector, military surveillance, private sector, financial services, retail, social networks, web logs, text, document, photography, audio, video, click streams, search indexing, call detail records, Point of Sale Software (POS) information, radio frequency identifier (RFID), mobile phones, sensor networks and telecommunications.. Organizations in any industry have big data can benefit from its careful analysis to gain insights and depths to solve real problems [1, 3, 20].
Other Applications: Other big data applications such as banking, business informatics, meteorology, sports, medicine use big data set within data management and data mining. These applications with a big volume of data can’t be handled by traditional database management for its storage and processing. The big data technology can be used to define the framework that can provide the solutions to big data set. An attempt has been made to use non-relational database architecture to implement horizontal scalability based on optimized key-value format and agility. One of such non-relational data base is NoSQL (Not Only SQL) database architecture [13, 24]. NoSQL databases also provide a structure for the storage of data but these structures are less strict as relational schema and as such it is suitable for some applications. It is important to know that NoSQL is no way a replacement of traditional RDBS and it finds its applications in huge big and complex data set and the traditional relational database system tools may not be suitable for these applications.
VII. NEW IMPLEMENTATION SOLUTIONS OF BIG DATA APPLICATIONS
The volume of big data set is usually represented above terabytes and is usually unstructured. In the literature, it is being mentioned that the big data problem has been addressed as a combination of four V’s as Volume (from Terabytes to Zettabytes of existing data to process), Velocity (data in motion, streaming data and processing them between a time frame of milliseconds or seconds), Variety (different formats of data, different types of data and also the defined structured formats of data), and Veracity (undefined format of data, ambiguity in data, difficult to interpret the data, inconsistency) [1-7, 11,15 38].
In addition to these well-defined classes of big data applications, recent years have seen successful implementation, newer technology and solutions of some of the popular multimedia big data applications in real world:
A. Machine learning: Machine learning-based techniques have introduced new applications on iPhones that can predict the ethnicity, gender, age of the users with high degree of accuracy and on-line prices. These applications are benefitted from improved data mining techniques as these allow an easy presentation of application to the users.
intelligence. This provides a lot of information and some of them may be unnecessary e.g. location of any place or business may not be necessary for our application. But still this application Google Now is a very powerful application as it has taken all the possible aspects of us to make our life better and comfortable [38].
C. Election campaign tool: Software developers introduced a new big data application that shifted, collated and combined different categories of information on each registered voters to discover patterns which then can be used to target voters for further fundraising, advertisement, personal meetings etc. to those most likely to respond. Based on this concept and successes in other applications needing social issues like education, health care, utility usage, crime statistics, etc. can be developed. Other activities like phone calls, on line searches, can also be used for defining the patterns. Large companies like IBM (redrawing bus routes in Ivory Coast, Google (flu tracking software) and others. Data mining technique also found its use in social issues like tutoring disadvantaged kids, retail chains forecast sales, modeling of customer’s behavior, and presenting articles with title showing its use in business applications [46, 53].
D. Social networking: One of the most successful applications of big data dealing with its collection, analysis and virtualization is social networks (Facebook, Twitter, Google+, LinkedIn, etc.). According to statistical survey, over 955 million users with active accounts access Facebook in over 70 languages, upload over 140 billion photos, over 125 billion friend connections. In 2012, The Human Face of Big Data accomplished a global project that is centering in real time operations of collection, visualization and analysis of large amounts of data [1].
In addition to these applications, media project survey statistics [1-6] reported for some of the popular big data applications as shown below:
Facebook has 955 million monthly active accounts using 70 languages and handles the loading of over 140 billion photos, over 125 billion friend connections, access of over 30 billion pieces of content per day and posting of over 2.7 billion likes and comments [1]
In YouTube, an estimated 48 hours of video are loaded every minute and every day. Over 4 billion views are being recoded on YouTube [1].
In Google, over 7.2 billion pages per day are being monitored and also process over 20 petabytes of data every day. It is interesting to note that Google provides these services in over 66 languages around the globe every day. Google support many services as it monitors 7.2 billion pages per day and processes 20 petabytes (1015 bytes) of data daily also translates into 66 languages. 1 billion Tweets every 72 hours from more than 140 million active users on Twitter. 571 new websites are created every minute of the day [23]. Within the next decade, number of information will increase by 50 times however number of information technology specialists who keep up with all that data will increase by 1.5 times. [7, 8, 11, 17, 22, 38].
In Twitter, over one billion tweets are being recorded every three days by more than 140 million active users. Over 571 new web sites are being created every minute of the day [1-6, 46, 68].
Further, based on the current use of internet for these social networks and services, it is expected that the information from these social networks and services over internet will increase by more 50%. In next few years, the number of people handling this big amount of data will also increase by at least 1.5 times [7, 38].
VIII. IMPLEMENTATION PHASES OF BIG DATA APPLICATIONS:
In the previous section, we introduced the state-of-the-art of big data, big data analytics, classes of big data applications, and implementation phases required for data applications. One of the most critical and important phases of implementation is data processing. The following section discusses various tasks needed during the data processing for big data application implementation. Further, we will present current trend of big data in Enterprise Computing Environment. We also describe various frameworks and tools that have been introduced and are currently being used in those applications.
IX. BIG DATA PROCESSING
Appropriate and accurate solutions of any big data applications depend heavily on how the data processing is implemented. There are a number of tools for data processing that offer a number of options for implementing in-depth memory database analysis, optimization techniques, data cleansing, data forms, etc. The development of new algorithms will make these options automated and are used tom join and cleans big data set. Some of the options provided by tools are described below.
X. DATA COLLECTION
XI. DATA CLEANSING
After collection, data cleansing or cleaning is performed. This process will recognize any noise present in the data, or any missing data in the big data set. It uses different techniques to reduce the noise and also if possible eliminate any unwanted data from the dataset. After cleaning, the data may need to be transformed as final preparation for analytics on data set.
XII. DATA FORMS
Big data from different applications are received from different sources and are usually represented in three different forms as: structured, semi structured and unstructured. The structured data are represented in well-defined format, already tagged, sorted and are entered into data warehouse. This type of data can be analyzed easily. The unstructured data on other side is random, not well-defined format, already tagged and easily sorted and as such unstructured data is difficult to analyze due to its randomness. The semi-structured data does not conform to fixed fields but contains tags to separate data elements and as such the volume of data is usually in the range of terabytes and petabytes [1].
XIII. DATA ANALYSIS
Data analysis deals with extracting and interpreting new information or knowledge from the big data set. This process of extraction of deriving new information allows designers to define different policies for managing the data and the rules for predicting new information from data set.
There exist different analytic methods and techniques that have been used on this data set. These methods and techniques are grouped into three categories as: statistical analysis, data mining and machine learning. Since we are dealing with big data set, the analysis technique needs to interpret and extract meaningful information and also these techniques have to be automated as the manual techniques are very time consuming.
In data analytics techniques have to be executed on different platform, we will face a number of problems like new platform may not i) accommodate this big volumes of data, ii) support to needed analytic models. Further, the data loading may be too slow, and new platform may not support new advanced analytics technique and also may not meet with requirements.
The data analysis of big data set does require large number of servers that will run massively parallel software. This type of analysis technique in the distributed environment can distinguish the big data set for its classes such as category, size of data, velocity and application and analysis will provide detailed insights of the data, its structure and interpretation.
On some new conceptual data analysis trends:
A number of new conceptual analysis techniques for the derivation of information have been introduced such as Content-based image retrieval, Semantic Web (SW) and others.
Content-based image retrieval: It is based on visual pictures of images. The derivation of information through retrieval offers various retrieval capabilities of text images and different attributes like form, content and structure. This method also provides measures for accurate information retrieval.
Semantic Web (SW): It offers intelligent retrieval of information during the data analysis. The information representation by web semantics allows its use in a variety of applications like display, automation, integration, and reuse for computers. It allows the representation and exchange of information in a meaningful way. To retrieve the information from the documents, a number of techniques exist but in general these are not so advanced that they can exploit semantic knowledge within documents and give the desired accurate information. It looks that future web services will be defined as a combination of text documents and Semantic markup. Semantic Web (SW) uses Semantic Web documents (SWD’s) which must be combined with Web based Indexing. To use these techniques, a normal user has to be aware of all the tools. To overcome these problems, a new concept of Ontology in Semantic Web has been introduced that represents various languages that are used for building software and increases accuracy. It describes basic concepts in a domain and defines relations among them and together with a set of individual instances of classes constitutes a knowledge base. Recent years have seen a new application of Semantic Web in open, distributed and heterogeneous Web environments, and for sharing the knowledge in the semantic web [7-9, 16-20, 38, 40].
Data Storage, Manipulation and Handling
Since big data set is being transmitted over the internet, we need to make sure that the data communicates over a secured environment. Although, we may use cloud computing technology for storing and processing the data, we have to make sure that performance of big data processing is not affected by additional overhead for secured communication. Further, we need to make sure that the data itself is also secured and may represent new types of data for original dataset. During the transmission over internet, the data may be compromised and we may lose important data from the original data set. For example, attack on the MapReduce: one of open source framework for big data solution (discussed below) could be a malicious mapper that accesses sensitive data and modifies the result. Unlike most RDBMS, NoSQL security is largely relied on outside of the database system. Research into the types of attacks that are possible on these new systems would be beneficial [12-13, 16, 24].
Storage and analytic techniques
internet. A number of approaches for sending big data set over internet with optimum requirement of bandwidth has been suggested and implemented e.g. compression, data deduplication, caching, protocol optimization, etc. It is important to note that the optimized bandwidth for big data based on compression offers advantages only on a certain type of data e.g. plain text data while it may not be beneficial to use this on encrypted data. Further, the benefit of compression is available only for homogeneous and susceptible types of data.
The data duplication that reduces the size of data transmission over internet deals with the data at file and block levels. In the case of duplicates, these duplicates are sent as pointers to one copy of the data, thus reducing the transmission of multiple copies of the same data set. Some researchers also call this method as redundancy elimination. The other technique deals with protocol optimization at the transportation level where one port for each of the transport layer protocols (TCP and UDP) is dedicated for session control 1, 4, 11, 12, 23, 45,70].
Handling and storage:
The handling and storage of big data has also changed the architecture of storage system and has recently focused on the implementation of highly scalable and flexibility features in the architecture to handle big data set in effective and efficient manner. The storage of big data set should provide reliable, efficient methods of storage and retrieval. Google File System (GFS) is one of the storage systems which is based on clusters and stores the data as a block of 64MB as smallest unit across the nodes of cluster [46].
It has been observed that after going through collection, storage and analysis of data, the companies can get benefit of aimed marketing, detailed strategies for business insights, client based service offering via segmentation, determination of sales, market needs and chances, risk analysis, etc. The above listed benefits are available only when the data analytics is implemented properly. Further, sometimes lack of inexperienced analysts, cost, lack of database software in analytic, and hard to design analytic system may not offer these benefits.
XIV. IMPLEMENTING AND SOLVING BIG DATA APPLICATIONS
The following steps are needed to obtain a solution of any application with big data set. The proposed solution tries to extract the maximum value from your big Data and business analytics that needs to be transformed our IT infrastructure and implementation of big data technologies to allow us to understand the capture, store, and leverage data-driven insights in real time.
We need to define applications of big data for the organization’s policy and priorities
Design a detailed plan for future growth in big data and possible applications that may help the organization’s long term goals of the organization
Identify and implement the proposed big data platform along with best practices and ethics. Implement and run the applications by running
big data analytics applications that will offer possible new technologies and applications. Our needs: on-premise, cloud, or hybrid
Big data and associated operations like collection, storage, analysis and manipulation has become an integral part of any business and companies dealing with different types of services. These services in one way or the other are dealing with different types of data with different formats in a variety of applications. One of the major concerns with big data set is to develop suitable analytic techniques for managing it after defining its solution.
A new concept based on in-memory databases has been introduced that helps in enhancing the speed of analytic processing. Many businesses have already started using this concept in applying big data analytic in enterprising computing environment. This forces all the records, attributes and transactions from different systems to reside on same in-memory database. This requires the development of another product tool that manages, secures and integrates the data in the database. A number of frameworks for implementing big data analytic in enterprise computing have been introduced. This type of new trend in big data analytics finds its application in enterprise computing and some of the big data applications in enterprise computing environment are discussed below
Big Data in Enterprise Computing Environment
Many of the enterprise computing environments is In-memory platform or framework as a service based on SAP Hana can allow to build innovative applications with improved productivity in handling and managing big data sets of very large volume. The data analysis of big data may provide insights of the data sets in such a way that it can be used to grow the business and marketing of the products. The big data analytics is becoming a new technology that has generated an interest in enterprising arena and as such has to support a number of architecture that has been developed for those applications [12, 53].
XV. OPEN SOURCE FRAMEWORKS/TOOLS FOR BIG DATA SOLUTIONS
There exist a number of open source frameworks/tools to solve big data applications such as: MapReduce, Job Tracker, Hadoop, High Performance Computing Clusters (HPCC) and many others. A brief description on some of the popular frameworks/tools is discussed below. For more details, please refer to [1-3, 12, 16, 18, 25-36, 38, 46-48]
A. MapReduce
these processes on different processor in a distributed environment. It is used to solve big data set and can be implemented in two stages [18, 28, 46-48]
MapReduce tool is used for processing the big data set across the nodes of cluster of Google File System (GFS) [46]. The GFS uses the distributed architecture of GFS to allocate and schedule the data sets to nodes and also transfers the data across the nodes. The function for processing big data set offered by MapReduce like replication, storage, etc. has been included in the implementation of Hadoop framework which is most popular model that includes MapReduce engine, Hadoop Distributed File System (HDFS) and utilities of other Hadoop modules. The HDFS file system is highly fault-tolerant system and stores the data on the clusters of the framework. Google introduced another tool data storage system known as BigTable which has been adopted by Hadoop framework (to be discussed below). As SQL, MapReduce, in-memory, stream processing, graph analytics and other associate tools help Hadoop create more business and enterprise data processing applications.
The proposed big data solution based on MapReduce is usually implemented in two stages as described below:
In the first stage, the master node after collecting the data is divided into a number of the smaller processes. A worker node is chosen to execute some of these processes (based on scheduling) under the control of JobTracker (a component) of the framework. The result from this execution is stored in local file system which can be accessed by reducer (a component).
In the second stage, the scheduled data is analyzed and merges input data from the first stage. There can be multiple reduce tasks to parallelize the aggregation, and these tasks are executed on the worker nodes under the control of the JobTracker.
B. Hadoop
Apache software foundation introduced a tool called Apache Hadoop (an open source data computing framework) that uses a number of modules and provides solutions to handle manage and implement big data set. This framework and set of tools for processing large data sets was originally designed to manage cluster of physical machines. Now, we have seen a big use of this framework cloud like Amazon’s Redshift hosted B1 data warehouse, Google’s BigQuery data service, IBM’s Bluemix cloud platform and Amazon’s Kinesis data processing servive.
It is based on BigTable (data storage system which was introduced by Google) data storage system. The Hadoop is Java based framework designed for heterogeneous open source platform. Various features of this framework includes Distributed File System, analytics and data storage platforms, layer to manage parallel computation, workflow and configuration administration and many others needed to solve big data sets. Hadoop Distributed File System runs across nodes in a Hadoop clusters and provides connectivity to all the input and output nodes with a view to create one big file system. Some of the material
described below is derived from [1,16, 18, 20, 25-26, 36-37, 42, 46].
Hadoop framework offers solution based on batch processing concept for handling, managing and processing big data set. It may not provide appropriate solution for real-time ad hoc querying management, but has become a common solution for processing large amounts of data. Modules such as Pig and Hive along with Hadoop MapReduce provide querying management. Some efforts have already been made to provide solutions for real-time ad hoc querying management over large scale big data set. The querying system is based on SQL for implementing query Hadoop system. Other possible solutions (based on Hadoop) using relational data base based on scalability and distributed relational systems have been developed that analyzes the data set and interpret useful information from the data set.
The volume of data is continuously increasing in all applications at an exponential rate and it is becoming a big challenge to handle the data and also develop appropriate solutions [25, 26]. The Hadoop framework model has become very popular tool for managing social networking environment applications over Internet. Nearly all the applications of social networking (Facebook, Twitter, Linkedln, etc) deal with huge amount of data from their users. The data are in different forms and need to be need to be presented to their users in a very simple and friendly manner. In spite of these features of Hadoop, in real-time analysis and predictive analysis, it has been seen that it takes significant amount of time. SQL query tool as SparkSQLseems to be fast interactive query with streaming capabilities. A new tool supporting SQL like quering has opened the door for Hadoop to be used in Enterprise computing applications.
A number of new open source modules interfacing at application layer of Hadoop model have been developed to implement scalable and distributed computing environment including: database (HBase and Cassandra), querying (Hive and Pig), coordination services (ZooKeeper) [33-34].
The various functional module programs offering different services to be used on Hadoop framework have recently been introduced. Some of the popular services applications include: HDFS, MapReduce, Pig, Hive, JAQL, HBase, Flume, Sqoop, Oozie, Zookeeper, YARN, Mahout, Ambari, Spark, Whirr, Hue, Lucene, Chukwa, Hama, Cassandra, Impala, etc [37] Each of the module does a specific functionality and is being used with Hadoop to implement a specific aspect of big data starting from the collection, storage, administration, query management, interpretation and solving of big data set into different clusters across the distributed system over internet.
The following is a brief description of some of these modules, their services and each of these modules operate at the top layer of Hadoop model.
MapReduce: This module provides a powerful parallel programming technique for distributed processing on clusters of the framework [1,28]
warehousing system used with Hadoop for querying, summarization, and analysis of data stored in Hadoop model. Queries are expressed in the “SQL-like” Hive Querying Language. A compiler translates the HQL into a set of MapReduce jobs that are executed on the Hadoop system. In other words, it provides a means to perform data manipulations with high level HiveQL, without having to write the more complex map reduce functions that are harder to maintain and reuse.
It arranges data into tables, partitions and buckets. A table is implemented as rows and columns and contains the data into it. We can have multiple partitions refereed as column values in the table. The buckets within partitions are divided by the hash of the column. Further, the users can influence the optimization by providing hints through HiveQL. [31-32]
Pig
This module provides a high level data processing system for analyzing data sets that are being created at a high level language. Apache Pig has been used as tool for the analysis of big data sets. It uses a high level language (known as Pig Latin) that is compiled into MapReduce programs that are executed on Hadoop. Pig also allows extensions its language with User Defined Functions which can then be written in Java, JavaScript, or Python. One of the important features of this application that it allows easy query and analysis with reduced writing of map functions [1,29].
Hbase
This module creates a scalable and distributed database for random read/write access of the data stored on clusters. The Hbase model is a distributed, column-oriented NoSQL database that operates on top of the Hadoop Distributed File System. It uses Google’s BigTable and provides a distributed data store that is highly scalable with consistent reads and writes. Data is stored as indexed Storefiles on HDFS and is fault tolerant [1,32].
ZooKeeper
This module supports a centralized service for providing distributed synchronization and group services across clusters. It offers coordination services that support synchronization which can be enabled throughout a Hadoop cluster. It achieves this by defining objects containing information and namespaces in-memory. This information is kept across distributed ZooKeeper servers. Client applications running over Hadoop cluster can retrieve ZooKeeper from these distributed servers. One of the advantages of this model is its ability to support synchronization across Hardoop cluster system [31-32]. Cassandra
This module is a distributed column-oriented database and was originally developed by Facebook. In this database model, a column includes name, value and timestamp while a row contains multiple columns, column families contain rows, and keyspaces contain column families. Column families are stored in separate files. Data is separated into partitions across the nodes in the distributed database. Each node has a random position on a
hash ring. For more details on the discussion below, please refer to [37].
Voldemort
This module is a data base program that is highly scalable distributed key-value based data store and was originally developed by LinkedIn. In this model the data is automatically replicated and partitioned among the nodes in the distributed system. Each node is independent of the others such that there is no single point of failure. Read and write access is limited to key-value access. As such, it supports only three types of queries: get, put and delete. Based on its simplicity, it always offers predictable performance of queries.
Sqoop:
This module allows the transferring of data between relational databases and Hadoop [1]
Avro:
This module supports the serialization of data for its processing.
Oozie:
This module defines a systematic workflow for dependent Hadoop jobs [1]
Chukwa:
This module provides support for a Hadoop subproject as data storage and accumulation system for managing and monitoring distributed systems of clusters [1]
Flume:
This module provides mechanism for a reliable and distributed streaming log collection of the data across clusters [1].
XVI. HIGH PERFORMANCE COMPUTING CLUSTER (HPCC)
Another open source software framework/tool for solving big data set has been introduced as High Performance Computing Cluster (HPCC) Systems that supports distributed data intensive open source computing platform. It allows the users to support the definition of model and provides big data workflow management services. A high level programming language as Extensible Computer Language (ECL) that provides Development Environment and supports the description of complex problems easily and this framework ensures the execution of ECL at a maximum elapsed time and configures the implementation of all nodes executing in parallel.
The HPCC framework consists of the following three application modules: [2]:
HPCC Data Refinery (Thor): This module defines a massively parallel ETL engine that enables data integration on a scale and provides batch oriented data manipulation.
HPCC Data Delivery Engine (Roxie): This module allows efficient multi user retrieval of data and structured query response. It is based on highly massively parallel, high throughput and ultra-fast with low latency capability. Enterprise Control Language (ECL): This module is responsible providing automatic distribution of workload between nodes, with a support of has automatic synchronization of algorithms. It contains extensible machine learning library and has simple usage programming language optimized for big data operations and query transactions [1].
Out of these frameworks/tools discussed above, HPCC and Hadoop seem to be more popular and are being used for implementing a variety of applications of big data sets. However, there are some differences between them in terms of architecture and stacks. The following sections will highlight some major differences between all three frameworks/tools. For details, readers are referred to [22].
XVII. COMPARISONS OF THREE OPEN SOURCE FRAMEWORKS (MAPREDUCE, HADOOP AND HPCC) An attempt has been made to compare these frameworks in terms of architecture and stacks. Below is a brief summary of that attempt. For more details, please refer to [2]:
The clusters in HPCC are being performed with Thor and Roxie modules while in Hadoop these clusters are being implemented by MapReduce application program.
The HPCC is based on ECL primary language while MapReduce application program is based on Java language.
As stated above, the HBase is based on procurare res column oriented concept and is supported by Hadoop while HPCC platform builds multi key and multivariate indexes on Distributed File System.
Hive application program of Hadoop provides data warehouse and allows the data to be loaded into HDFS. In HPCC, the data warehouse and loading are based on structural queries and analyzer application programs.
For a larger hardware configuration with large number of nodes, the HPCC framework is faster than Hadoop and takes a very little processing time.
XVIII. HADOOP-BASED SOLUTIONS FOR BIG DATA IN SOCIAL NETWORKING ENVIRONMENT
The following section describes a list of social networks that are providing the solutions to big data set using the Hadoop model for storing, manipulating and analyzing the big data set. For detailed discussions, please refer to [7, 38-44]. State-of-an-art in social networks and data mining can be found [68]
A. Facebook
This social network hosts the largest Hadoop cluster by volume, consisting of a total of over 4400 nodes and 100+ petabytes of data. It consists of five modules that are used on big data set as the Hadoop Core, a log data collector called Scribe, Hive, a UI for querying with Hive called HiPal, and an automation framework called NoCron.
Based on needs and other requirements, the network has defined its own configurations on Hadoop framework. The underlying HDFS uses a Federated HDFS and its redundancy is implemented using RAID technology. It also uses Hive to simplify the interaction with Hadoop by their analysts. Roughly 90% of their MapReduce jobs are built on Hive [7, 40, 44].
B. Twitter
Another social network that needs a lot of storage and processing of big data set also uses Hadoop framework. All the data in this network is stored in Hadoop Distributed File System using LZO (Lempel-Ziv-Oberhumer) compression. Further, it uses Google’s Protocol Buffers to efficiently read and write data into their cluster through data serialization with the generated code it provides as these are supported by Hadoop framework. It uses scalding framework to provide a simpler way of creating MapReduce jobs much like Hive and Pig [7,41-42,44]
C. LinkedIn
In this social network, Hadoop is used to provide predictive analytics and querying that is based on the features of Hadoop like People You May Know and Endorsements (PYMKE). Over a billion of LinkedIn relationships are processed each day to compute People. It also creates their engagement emails, presenting a user’s profile views and also their association in their professional association. It adopted Apache Pig to avoid writing complex MapReduce programs. It developed a Hadoop log aggregator and dashboard called White Elephant that supports visualization the utilization across the users in cluster and this allows the users to understand these features and better use them over Distributed File System [7, 43].
XIX. PROBLEMS AND CHALLENGES IN HADOOP-BASED IMPLEMENTATION
defining quality data in such a way that only subset of data is being used for processing. It is difficult to identify subset from big amount of data and as such may not represent the big data set to predict its quality.
One of the leading sports organizations National Football League (NFL) has adopted the cloud computing environment (based on Hadoop) that delivers a state-of-the-art big data experiences for its Fantasy Football league schedule and time frame. The members associated with this franchise will be able to analyze, compare the players, predict the outcomes of the matches, winning or losing of matches, finals etc.
The above section described various frameworks/tools that have been used in some of the real world projects. Although the success stories of these projects support the use of these tools for such applications, the implementers and developers of these still feel a number of challenges like lack of uniform standard formats, lack of formal methods, lack of clear representation and analysis of data for providing useful interpretations to the users, lack of tools that support predictable behavior of data, lack of newer user friendly representation for easy understanding etc. In order to address some of the difficulties and challenges, recent years have seen a new trend of redefining data mining techniques for multi-media data analysis and making use of virtualization of accurate representation and extraction of useful information from processed data.
The following two sections describe the current state-of-the-art of Data mining techniques and Virtualization and present how the entire big data and big data analytics technology have taken a new turn and trend in providing efficient solutions to big data applications in a variety of disciplines.
XX. IMPLEMENTATION OF NEW DATA MINING TECHNIQUES BY BIG DATA ANALYTICS A. Basic concepts and definitions of data mining
Data mining offers a systematic approach for understanding and analyzing the big sets of data with a view to obtain useful information in a highly readable format and friendly environment. This technique is heavily based on predictive analysis concept that includes the concrete assessment of the complex data so that suitable analysis technique can be identified for deriving useful information from the set of data. This concept finds it use in a variety of applications that contain a large data sets such as NASA weather data set, social networks data sets, data communications via mobile devices, bioinformatics, sensors, stock markets, manufacturing of embedded systems, world wide web, etc. Nearly all the applications are using Internet for the design, data collection, data analysis, dissemination, etc. and as such are expected to provide secured communication environment over it. New trends of using data mining techniques for social networks can be found in [68].
As stated above, the big data deals with large-volume, complex system, growing data set with multiple formats and sources. In recent years, we have seen multiple areas
where big data has expanded its use in Science, Technology Engineering and Mathematics (STEM), other sciences like physical, biological, biomedical, and other scientific areas. As mentioned above the big data technology is capable of implementing data mining, virtualization, text mining and optimization, the newer data mining techniques will provide a new perspective with new characteristics and features that introduces new model. This model offers features like: data-driven that involves aggregation of information sources on demand, mining and analysis of user’s interests, security, privacy and integrity of data.
In all the applications discussed above, it is quite obvious that big data technology coupled with data mining techniques has already played in five known applications as discussed above. Based on its usefulness in managing big data, an attempt has been made to apply these concepts into social networks as nearly all the social networks are dealing with explosion of big amount of data in different formats. Social network analysis focusses on understanding of user intelligence for defining advertising strategies, marketing strategies, capacity planning, customer behavior, shopping pattern, targeted customers, and other behavior profile database for marketing and advertising. Based on this information, companies and industries use optimization techniques so that needed and useful contents from this information can be used on big data engine.
Some of the companies like Google, Amazon have published some interesting results of their work based on the underlying framework and many other companies have developed similar frameworks as open source software such Lucene, Solr, Hadoop and HBase. Facebook, Twitter and LinkedIn that will offer more efficient and useful open source projects for big data like Cassandra, Hive, Pig, Voldemort, Storm, IndexTank. In addition, predictive analytics on traffic flows or identify threats from different video, audio and data feeds some of the advantages that are useful for big data set analysis. In order to understand how data mining techniques can play an important roles in implementing different categories of big data applications with more emphasis on social networks, the following section describes different data mining techniques along with their features and limitations. For more discussions on these topics, please refer to [1, 7-9, 11, 12, 17, 20, 22, 36, 49-50, 68-69]
B. Data mining Techniques
i) Mining Sequence Data
A sequence data may be as an ordered list of events and are usually classified based on behavior and characteristics of events as: Time-series, Symbolic and Biological data sequences. The following paragraph describes each of these in brief
In Time-series sequence data, the sequence data is defined as data of long sequences of numeric data, recorded at equal time intervals (e.g., per minute, per hour, or per day). This data sequence can be generated by many natural and economic processes such as stock markets, and scientific, medical, or natural observations.
Symbolic sequence data may be defined as long sequences of event or nominal data that may not be recorded or observed at equal time intervals and lapses between recorded events may not be important. Few examples of this data sequence include: customer shopping sequences and web click streams, sequences of events in science and engineering and in natural and social developments.
Biological sequence data may be defined as very long sequence of data that carries important, complicated information but this information is hidden in its semantic meaning. Examples of this sequence data include: DNA, protein sequences and other medical-based sequence data.
ii) Mining Graphs and Networks
Graphs represent a more general class of structures than sets, sequences, lattices, and trees. There is a broad range of graph applications on the Web and in social networks, information networks, biological networks, bioinformatics, chemical informatics, computer vision, and multimedia and text retrieval. Hence, graph and network mining have become increasingly important and heavily researched. Based on the above concepts, the following graphic applications have been introduced to deal big data sets: graph pattern mining; statistical modeling of networks; data cleaning, integration, and validation by network analysis; clustering and classification of graphs and homogeneous networks; clustering, ranking, and classification of heterogeneous networks; role discovery and link prediction in information networks; similarity search and OLAP in information networks; evolution of information networks and other related to graphs and networks. Various frameworks like Hadoop have successfully implemented this technique for these applications.
iii) Mining Other Kinds of Data
In addition to sequences and graphs, there are many other kinds of semi-structured or unstructured data, such as spatiotemporal, multimedia, and hypertext data that carry various kinds of semantics and are either stored in or dynamically streamed through a system, and call for specialized data mining methodologies. These types of data find interesting applications in cyber related multimedia data. Other applications include: spatial data, spatiotemporal data, cyber-physical system data, multimedia data, text data, web data, and data streams, and other types of data miming.
iv) Mining Spatial Data
Spatial data usually refers to geo-space-related data that are stored in geospatial data repositories and spatial data mining can be performed on spatial data warehouses, spatial databases, and other geospatial data repositories. The spatial data can be in many forms or formats such as vector, raster, imagery and geo-referenced multimedia. The spatial data mining methodology identifies patterns, locations and knowledge from spatial data. Recently, we have seen big interests in large geographic data warehouses that have been constructed by integrating thematic and geographically referenced data from multiple sources. From these, we can construct spatial data cubes that contain spatial dimensions and measures, and support spatial On-line Analytic Processing (OLAP) for multidimensional spatial data analysis. Popular topics on geographic knowledge discovery and spatial data mining include mining spatial associations and co-location patterns, spatial clustering, spatial classification, spatial modeling, and spatial trend and outlier analysis.
v) Mining Spatiotemporal Data and Moving Objects
Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns, locations and knowledge from spatiotemporal data.
Spatiotemporal data mining has become increasingly important and has far-reaching implications, given the popularity of mobile phones, GPS devices, Internet-based map services, weather services, and digital Earth, as well as satellite, RFID, sensor, wireless, and video technologies. Typical examples of spatiotemporal data mining include: discovering the evolutionary history of cities and lands, uncovering weather patterns, predicting earthquakes and hurricanes, and determining global warming trends. For example, animal scientists attach telemetry equipment on wildlife to analyze ecological behavior, mobility managers embed GPS in cars to better monitor and guide vehicles, and meteorologists use weather satellites and radars to observe hurricanes.
Among many kinds of spatiotemporal data, moving-object data (i.e., data about moving moving-objects) are especially important. Some of the examples based on this data application include: mining movement patterns of multiple moving objects (i.e., the discovery of relationships among multiple moving objects such as moving clusters, leaders and followers, merge, convoy, swarm, and pincer, as well as other collective movement patterns). Another form of spatiotemporal data that is becoming popular is massive-scale moving-object data are becoming rich, complex, and ubiquitous and found its application in mining periodic patterns for one or a set of moving objects, and mining trajectory patterns, clusters, models, and outliers.
vi) Mining Cyber-Physical System Data