Beginner s Guide to. BigDataAnalytics

(1)

BigDataAnalytics

(2)

Introduction

‘Big Data’, What do these two words really mean? Yes everyone is talking about it but frankly, not many really understand what the hype is all about. This book by Jigsaw Academy aims to give one an

understanding of Big Data and what makes data big, while also elaborating in simple language the challenges of Big Data, the emerging technologies and the Big Data landscape. Finally we talk about careers in Big Data and what the future could hold in store for the industry.

This book is also a useful companion to those of you enrolled in Jigsaw's Course ‘Big Data Analytics Using Hadoop and R’. You can use this book to compliment your learning and better understand Big Data. Please note the blue boxes in every chapter which link the content in the chapter to the modules covered in the course.

Enjoy the book.

(3)

What is Big Data

What Makes Data Big

Challenges of Big Data Processing

Big Data Technologies

Big Data and Analytics

Unstructured Data and Text Analytics

Big Data in Action

Big Data Landscape

Big Data Career Paths

Big Data in the Future

Learn more about Big Data

Outline

2

3

5

10

15

17

20

23

30

03

04

07

09

15

20

22

26

31

30

33

30

34

(4)

“I don't know how the cost of hard disks has decreased so rapidly. These days one can buy a terabyte hard drive for just $100” a friend told me couple of years ago. It's hard not to agree with him and a quick review of historical facts validated his opinion. In the 1990's, the cost per 1 gigabyte of hard disk was around $10,000 and now it can be purchased at only $0.1. The price has dropped 100,000 times over a span of 20 years. Currently we are even seeing that a few giga bytes of hard disk space are being offered free of cost by email service providers and file hosting services. For personal accounts, Gmail offers about 15 gigabytes of free hard disk space whereas file hosting service Dropbox offers up to 3.5 gigabytes. However, these values are on the higher side for business accounts.

One would wonder how enterprises are influenced by the lower

costs of storage space. For one, it definitely provides them with more opportunities of storing data around their product and service offerings. Virtually every industry is seeing a tremendous explosion in terms of new data sources and is dependent on advanced storage technologies. Increased adoption of internet and smart phones enabled individuals across the globe to leave a huge digital footprint of online data which is wanted by many enterprises. In the past, for example, banks used to store customer data mostly around demographic information tracked from application forms and further transaction information tracked from passbooks. These days, the customer data being stored is enormous and varies widely across mobile usage, online transactions, ATM

withdrawals, customer feedback, social media comments and credit bureau information. All these new sources of data which did not exist in the past can be categorized under the new word “Big Data”. Big Data can be easily referred to as data which is huge, but more importantly Big Data is data that comes from multiple sources rather than just one.

Big Data is definitely one of the more fascinating evolutions of the 21st century in the world of IT. The truth is that Big Data has opened up tremendous opportunities and has provided us with endless solutions to deal with social, economic and business problems. For enterprises, it is a huge untapped

source of profit, which if used appropriately will be the key to staying ahead of their competition. In order to deal with Big Data effectively, they need to depend on advanced database technologies and faster processing capabilities. Just having Big Data is not a sufficient criterion for success; enterprises also need to implement analytics effectively, in order to be able to garner insights that help improve profitability. They should actively pursue the art and science of capturing, analysing and creating value out of Big Data.

WhatisBigData?

CHAPTER 01

The Big Data and Hadoop Overview Module provides pre-class videos and lots of reading material on the importance of Big Data and how it is transforming the way enterprises are implementing data based strategies to become more competitive.

(5)

We live in the era of Big Data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external Big Data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and emails. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of Big Data is machine data which consists of real-time information from sensors and web logs that monitor customer’s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of Big Data is not going to stop.

Although Big Data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that “Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures”. In simple terms, Big Data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has

increased 10000 times and is currently at 10 petabytes (1 petabyte = 1000000 gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on Big Data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters.

To further simplify our Big Data understanding, we can rely on three major characteristics of Big Data i.e. volume, variety and velocity which are more commonly referred as 3 V’s of Big Data. Occasionally, some resources do talk about a not so common characteristic of Big Data i.e. Veracity which is referred as the 4th V of Big Data. All these 4 characteristics provide more details around the nature, scope and structure of the Big Data.

WhatmakesdataBig?

CHAPTER 02

Commonly Big Data is characterized by 3 V’s and these provide context for a new class of possibilities. You will learn more about how these characteristics help achieve more information from massive amounts of data in Big Data and Hadoop Overview Module.

(6)

Volume deals with the size aspect of Big Data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by 2015. A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data.

Variety

Big Data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under

either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required

capabilities to process this kind of information.

(7)

Velocity

Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of Big Data. Another

important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle Big Data.

Veracity

Though enterprises have access to lot of Big Data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the Big Data era where most of the data capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in Big Data before using it for any kind of analysis.

(8)

Just having a big source of data is not enough to become successful; enterprises need to implement relevant processes and systems which will help them extract value out of it. An important aspect here is what one should do with it. In the absence of a business context, data in itself is meaningless and would just occupy space in the storage servers. Also many Big Data sources tend to have missing or low content information as described by the veracity characteristic earlier. The actual power of Big Data will surface only by applying analytics on top of it, when it is used to generate useful insights to guide future decision making. Irrespective of the size of the data, whether big or small, analytics methodologies need to be implemented to reap benefits. This typically involves the cleaning, analysing, interpreting and finally visualizing the hidden patterns that emerge from the data. Due to sheer volume, variety and velocity of Big Data; the processing capacity of traditional database systems is strained beyond their limits. So enterprises need to look out for advanced processing technologies and capabilities to effectively manage Big Data and further implement analytics on it.

One of the major aspects of any Big Data processing framework would be to successfully handle huge amounts of information without compromising on query time. Traditional database

systems lack the required infrastructure and internal designs to process Big Data at a scale of petabytes or exabytes. Since these systems tend to operate out of a single machine with huge hard disk and processing capabilities, there are a set of limitations which comes with it. The first one is a scalability issue. With a continuous increase in data volumes, the storage capacity of these systems needs to be continuously increased, and this can be an expensive option. The second one is a slow querying time because the storage load is already operating at maximum levels and enterprises cannot wait for days to

ChallengesofBigDataProcessing

CHAPTER 03

NoSQL technologies are helping enterprises to achieve more than what was possible previously. In the Big Data and Hadoop Overview Module, the evolution and benefits of new technologies such as Hadoop and MongoDB are discussed and also how these help in overcoming the

limitations of traditional IT systems for solving Big Data problem.

(9)

get their daily reports. These limitations call for alternate approaches based on scalable storage and a distributed approach to scaling.

Big Data sources are diverse and inputs for these systems can be in the form of structured or unstructured formats. Since the origin of these data formats is spread across the globe, most times these won’t have a pre-defined order and requires pre-processing before using it for any analysis. A common use of Big Data processing is to make use of unstructured data, specifically comments on social media and by tracking customer sentiments towards various product and service offerings. Due to their inherent static design, many traditional database systems can handle only structured data and as such do not provide any alternatives for unstructured data. For example, SQL based database systems depend on schema designs which clearly define the nature of data being loaded and used to process transactional data. Since unstructured data for the most part does not have a proper structure, it would be impossible to handle it in SQL based systems. Luckily for us, there exist alternatives in the form of NoSQL databases, which can handle both structured and unstructured data formats.

The majority of client applications run by enterprises are based on real-time, and instant support on services has become a priority. This requires processing of Big Data by the minute in order to provide relevant service to the customers. For example, based on user search keyword, Google instantly processes information stored in their million databases and returns relevant links within a matter of seconds. Similarly, banks need to track global online transactions at any time of the day and further update their databases so that it will reflect in a

customer’s online account immediately. These services require enterprises to have a system which can ensure the fast movement of data without any potential failures. In order to handle this velocity of Big Data coupled with volume and variety, enterprises need to depend on sophisticated databases which form part of the NoSQL category. These databases relax the limits of the schema based design of SQL systems and stores data in key-value pairs, all which are capable of handling all the 3 V’s of Big Data.

(10)

BigDataTechnologies-

OverviewofComponents

CHAPTER 04

With the growing challenges of Big Data and limitations of traditional data management solutions, enterprises need to leverage new technologies and data warehousing architectures which have significant IT implications. These technologies vary in terms of functionalities ranging from storing and processing massive volumes of data to performing various analysis on the data at the lowest level of granularity. For example, by integrating unstructured data such as text fields, social media chatter, and email documents, enterprises can leverage new sources of data which can reveal new insights about the customers.

According to a market forecast report by ABI Research, the world-wide IT spending on Big Data technologies

1

exceeded $31 billion in 2013 and is projected to reach $114 billion by 2018 . Most existing Big Data technologies fall under the open source paradigm. These are free to use and can be experimented upon by anyone. In the current Big Data technology landscape, there are many open-source tools which can potentially solve any problem but one should have the right knowledge and niche expertise in order to efficiently work with these technologies.

One of the most popular and widely adopted open source Big Data technologies is Apache Hadoop. It is more formally defined as an open-source software framework that supports distributed processing of vast amounts of data across clusters of computers by using a simple programming model. Apache Hadoop is considered a cost-effective solution which provides capabilities to scale up from single servers to thousands of machines, each offering local computation and storage. In simple terms, it is more like cluster of machines interconnected by a network system processing chunks of data at the same time rather than depending on one single machine which is time consuming and in-efficient especially in the case of Big Data. A Hadoop cluster can be made up of a single or thousands of machines which are commonly termed as nodes.

(11)

Apache Hadoop is the most popular IT solution for effectively dealing with Big Data. With the help of Big Data and Hadoop Overview, Hadoop Data Management and Processing Complex Data using Hadoop modules, you will learn technical aspects of setting up of Hadoop Cluster, its Architecture, HDFS and MapReduce Framework and other components using hands-on examples.

Let us try to understand this concept using a simple example. Say, an apartment complex housing 50 families has a single washing machine catering to laundry needs. Assuming on an average if each family process 10 clothes per day and time taken by washing machine to clean about 50 clothes is one hour, then total time taken by the washing machine to meet the entire apartment needs per day would be 10 hours. Now the apartment manager is

considering increasing the capacity to 100 houses, and definitely this would put tremendous stress on the washing machine’s daily load management capacity. In order to deal with this situation, the manager should probably consider buying 4 more washing machines. With a total of 5 machines working together, the entire apartment complexes laundry needs can be managed within 4 hours on any given day. The new solution also allows the families to be more flexible with respect to the time of their washing machine usage. This example briefly captures the essence of implementing distributed processing solutions using a cluster of machines rather than depending on one single machine to meet the growing Big Data needs.

Invented and named by Doug Cutting after his son’s elephant toy, Hadoop Ecosystem comprises of multiple projects which provide complete data management solutions needed by an enterprise. Some of the projects of Hadoop Ecosystem include HDFS, MapReduce, Hive, HBase, Pig and others. Though evolution of Hadoop dates back to early 2000’s, its main stream usage picked up momentum only a couple of years ago. Major advantage is its ability to efficiently manage and process unstructured data. Since about 80% of Big Data consists of unstructured data, it has become more of a strategic choice for many enterprises to implement Hadoop bases solutions.

Let's briefly review some of the key components of Hadoop Ecosystem.

HDFS(HadoopDistributedFileSystem)

Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the

Two primary components of Apache Hadoop are HDFS which provides distributed data storage capabilities and MapReduce which is a parallel programming framework. To better understand how Hadoop allows scalability, one should understand how HDFS functions. HDFS breaks down the processing data into smaller pieces called blocks, and stores them across various nodes of a Hadoop cluster. This mechanism of HDFS enables one to handle Big Data more efficiently and in a cost-effective way by employing low cost commodity hardware on the nodes of the Hadoop cluster. Unlike relational databases which depend of defining schemas to store structured data, HDFS puts no restrictions on the type of data and can easily handle unstructured data too. Based on the NoSQL principle, HDFS allows for schema less storage of data which makes it more popular when it comes to Big Data management.

(12)

MapReduce forms the heart of Hadoop and is a programming model which processes data stored on the nodes of Hadoop cluster, in a parallel and distributed manner. Typically a MapReduce program consists of two components: Map() and Reduce() procedures. Both of these phases work on key-value pairs. A key/value pair is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. These key/value pairs can be like a customer unique identifiers and location details or URLs paired with number of visits. What goes into key/value pairs is subjective and is dependent on the type of problem being solved.

Map() procedure or job performs operations such as filtering and sorting which takes individual elements of data and further break it down into key/value pairs. After execution of Map() job, Reduce() implements summary

functions where the output will be in an aggregated form. Always remember the order of any MapReduce program involves the execution of Reduce() job followed by Map() job. Also, the output of Map() job will act as an input to the Reduce() job.

MapReduceExample

Let’s look at a simple example. Assume you have three text documents, and each file contains specific number of words. Say the first document contains a sentence “I like Hadoop” and all the documents are stored in HDFS. The end objective is to find out the frequency of words present in all the text documents. For this we need to write Map and Reduce jobs to process the text data and summarize the word distribution.

As the Map job executes, the documents are first sent to the mapper that will count each unique word for each document: a list of (key/value) pairs is thus created with the word as key and its count as value. For example, the results produced from one mapper task for the first text document would look like this:

(I,1) (Like,1) (Hadoop,1)

The list of (key/value) pairs generated by all mapper tasks are then processed by the reducer that basically aggregates the (key/value) pairs from each mapper to finally produce a list with all the words and the summed counts from the three mappers, producing a final result set as follows:

(13)

(I,1) (Like,1) (Hadoop,3) (Is,2) (Fun,1) (So,1) (Great,1)

This one is a simple and straight forward example. Even though a real time application would be quite complex and often involves processing millions or billions of rows, the key principle behind a MapReduce execution would remain the same.

JavaAPIs

In order to deal with Hadoop programming at the MapReduce level, one would need to work with Java APIs. Since Hadoop framework is developed on Java platform, MapReduce programming using Java language is more native by design. Hadoop developers or analysts should have fair knowledge of Java concepts to process queries on data stored in various cluster nodes. Running MapReduce jobs involves installation of eclipse environment for Hadoop, writing Map and Reduce job scripts, compiling them into a jar file and then further executing these jobs on the data stored in HDFS.

For those who are averse to java programming and who do not have a developer background, alternatives exist for Hadoop programming in terms of Pig, Hive and Hadoop Streaming components. Using Hadoop streaming

component, it is easier to create and run MapReduce jobs with any general programming languages such as Ruby, Python, Perl, C++, R etc.

Pig

Pig comes to the rescue for non-technical professionals and makes it more approachable to work with Big Data on Hadoop clusters. It is a highly interactive and script based environment for executing MapReduce jobs on the data stored in HDFS. It consists of a data flow language, called Pig Latin, which supports writing MapReduce programs with more ease and less amount of code in comparison to usage of Java

APIs. In many ways, the functionality of Pig is very much similar to how SQL operates in relational database management systems. It also supports many user-defined functions, which can be embedded and executed along with a Java program.

Hive

Hive enables the connection between the worlds of Hadoop and SQL. It is very beneficial for people with strong SQL skills. Hive is a data warehouse

infrastructure built on top of Hadoop that provides data summarization, querying, and analysis capabilities using an SQL-like language called HiveQL. Similar to Pig, Hive functions like an abstraction on top of MapReduce, and queries run will be converted to a series of MapReduce jobs at the time of execution. Since the internal architecture is very similar to that of relational databases, Hive is used to handle structured data and enables easy integration between Hadoop and other business intelligence tools.

Analyzing Big Data is a key component of any enterprise's IT strategies related to Hadoop. In Processing Complex Data using Hadoop Module, you will gain strong command in components such as Hive, Pig and Impala which enable faster querying and aggregation of data from Hadoop cluster.

(14)

Impala

Impala, similar to Hive provides an interactive SQL based query engine for data sitting on Hadoop servers. It is an open-source program for handling and ensuring the availability of large quantities of data. This engine was developed by Hadoop distribution vendor Cloudera, and currently can be accessed under open source Apache license. As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users can put their SQL expertise directly to use on Big Data. Based on comparison results published by Cloudera, Impala offers 6 to 69 times faster querying times than Hive thus making it a first choice among enterprises when it comes to performing Big Data analyses on Hadoop.

HadoopStreaming

Hadoop Streaming component is a utility which allows us to write map and reduce programs in languages other than java. It uses UNIX standard streams as the interface between Hadoop and MapReduce job, and thus any language that supports reading standard input and writing standard output

can be used. It supports most of the programming languages such as ruby, python, perl, and .net. So when you come across a MapReduce job written in any of these languages, then surely execution will be handled by the Hadoop Streaming component.

HBase

HBase is a column-oriented database within Hadoop Ecosystem and runs on

top of HDFS. Hadoop is a batch-oriented system which allows loading data into HDFS, processing and then retrieving. This kind of operating mechanism would not be ideal for tasks involving regular reading and writing of data. MapReduce programs can read input data and write outputs directly from HBase. Apart from using Java API, Hive and Pig can be used to write MapReduce programs to be implemented on data sitting in HBase.

SqoopandFlume

These components enable connectivity between Hadoop and the rest of the data world. Sqoop is a tool which allows transfer of data between Hadoop cluster and SQL based relational databases. Using Sqoop, one can easily import data from external enterprise data warehouses or relational databases, and can efficiently store it in HDFS or Hive databases.

As Sqoop is used to connect with traditional databases, Flume is used to collect and aggregate application data into HDFS. Typically, it is used to collect large amounts of log data from distributed servers. Flume’s

architecture is streaming data flow-based and can be easily integrated with Hadoop or Hive for analysis of data. Some of the common applications of Flume component are to transport massive quantities of event data such web logs, network traffic data, social media generated data like twitter feed

Hadoop Streaming is an essential utility and quite helpful for

programmers who prefer programming with Python or R over Java. In the Performing Analytics on Hadoop Module, you will learn about running R scripts for MapReduce jobs through Hadoop Streaming utility.

Hadoop Data Management Module provides detailed introduction with hands-on exposure to database components of Hadoop such as HBase, Hive, Sqoop and Flume. Also you will be able to develop more in-depth knowledge on how to load and query data using these components.

(15)

and email messages.

ZookeeperandOozie

While Hadoop offers a great model for scaling data processing applications through its distributed file system (HDFS), Map/Reduce and numerous add-ons including Hive and Pig, all this power and distributed processing requires coordination and smooth workflow which can be achieved by Zookeeper and Oozie components.

ZooKeeper is a component of Hadoop ecosystem which enables highly reliable distributed coordination. Within Hadoop cluster, Zookeeper looks into synchronization and configuration of nodes and stores information around how these nodes can access different services relating to MapReduce implementations.

Oozie is an open source workflow scheduler system to manage data processing jobs across Hadoop cluster. It provides mechanisms to schedule the execution of MapReduce jobs based either on time-based criteria or on data availability. It allows for repetitive execution of multi-stage workflows that can describe a complete end-to-end process thus reducing the need for custom coding for each stage.

(16)

BigDataandAnalytics

CHAPTER 05

So far, we have learned about various technological and database architectural components that supports Big Data management. The real imperative of Big Data lies in the enterprise’s ability to derive actionable insights and to create business value. Building capabilities of analysing Big Data would provide unique opportunities for

enterprises and also put them ahead of their competition. Also these analyses can be performed on more detailed and complete data, as compared to traditional analysis which would be limited only to samples. However,

performing analytics on Big Data is a challenging task considering the volumes and complex structures involved. To deal with this, enterprises need to able to find the right mix of tools, expertise and analytics techniques.

Many early adopters of Big Data such as Google, Yahoo, Amazon and eBay are considered to be pioneers in analysing Big Data. For example, eBay launches successful products and services by employing analytics on demographic and behavioural data from their millions of customers. Data used for analysis can come in various forms - user behaviour data,

transactional data, and customer service data. On the other hand, Amazon offers services of recommendation engine on their home page. It leverages Big Data analytics on data relating to customer’s buying history and

demographics to identify hidden patterns and provides accurate recommendations for potential new purchases.

2

HoweBayleveragesBigData ?

Online auction giant eBay regularly monitors and analyzes huge amounts of information from their 100 million customer interactions. They use this data to conduct experiments on its users in order to maximise selling prices and customer satisfaction. On an average, they run about 200 experimental tests at the same time which range from barely-noticeable alterations, to the dimensions of product images, right up to complete overhauls, to the ways in which content for users' personal eBay home pages is displayed. Their huge customer base creates 12Tb of data per day from every button they click to every product they buy, which is continually added to an existing pile of Big Data. As the data is queried by automatic monitoring systems and employees looking to find more meaning from it, data throughput reaches 100 petabytes (102,400TB) per day.

One of the business problems eBay handles is to achieve the highest buying price possible for all items users place for sale, as profits come from a cut of each sale. Its data scientists perform advanced analytics by looking at all variables in the way items are presented and sold. As one of the solutions to this problem, they began to study the impact on selling price by the quality of the picture in a listing. They used Hadoop framework to process petabytes of pictures due to its capability of handling unstructured data. Later these pictures were analyzed and re analyzed and data scientists managed to extract some structured information such as how much they were sold for, how many people viewed. Towards the end, they managed to establish a possible relation and concluded that better

The real value of Big Data lies in the insights it can generate. Processing Complex Data using Hadoop Module provides hands-on techniques and knowledge to analyze Big Data with the help of Hadoop Components. In Performing Analytics on Hadoop Module, you will learn about how analytics tools can be used to run some advanced analyses on data residing on Hadoop.

(17)

image quality actually does provide a better price.

AnalyticsProjectFramework

Before doing a deep dive into Big Data, the first and important aspect of any analysis is to identify the business problem. This is a fundamental step even with traditional data analytics projects. Once the business problem is defined, then Big Data can be leveraged to search for hidden patterns and get valuable insights. Typically some of the analytics problems being solved can be of the following nature.

Ÿ Predicting customer churn behaviour to design reach out campaigns

Ÿ Understanding online and offline marketing impacts on sales

Ÿ Identifying whether a transaction is fraudulent or not

Ÿ Using customer purchasing patterns to recommend new products

Ÿ Forecasting of sales for better inventory management

Irrespective of any problem across verticals, the methodology involved in implementing data analytics projects would remain the same. Major difference between Big Data analytics projects and traditional data

projects would be the scale of data being handled and the combination of required tools. On the other hand, the business problems, analytics techniques and project methodology would remain the same and is independent of the data being handled. As part of any analytics project cycle, the processes typically involved are problem definition, data gathering, selecting the right technique, performing analysis and visualizing the final results.

Let us get some more perspective on various stages involved in implementing an analytics project using a used cars price prediction example.

Problem Identification

The first question that needs to be asked in any data analytics project would be what is the problem we are trying to solve? In today’s Big Data world, enterprises are performing data analytics over various kinds of business problems. It becomes essential to figure out which problem would create higher business impact and further focussing on it to maximize ROI.

In the case of used cars price prediction example, determining value of a used car based on a variety of

characteristics such as make, model and engine that would benefit retailers. Such information would benefit them to better manage the supply and demand flow in a highly price volatile market. Also with robust knowledge of price variations by each model type, retailers can target buyers with relevant promotions and targeted discounts.

Gathering required Data

(18)

For prediction of used car prices, we will require the sales data across years which capture the information on type of car sold, number of years it was in use and final amount paid by the buyer. Additional data can be captured on condition and performance of the car related to mileage and internal characteristics such as type, trim, doors and cruise control. These days with rapid growth of social media and other data sources, more data can be captured around brand perception of used cars and insurance claims related to the car which provides more insights on price variations.

Choosing the Right Analytics Technique

Picking the right technique for any given problem is as critical as finding the right kind of data to begin with. In analytics projects, often we depend on various tools and algorithms to work on various data problems. Say for example, R is known for its statistical offerings while Python is popular for text data processing. For solution

extractions, statistical techniques rely on business context and have specific use cases like clustering algorithms are used for solving customer segmentation problems, time series algorithms are used for forecasting problems and recommendation algorithms are used to provide insights on more relevant products or services. Before applying any technique, gathered data needs to undergo a set of data operations, such as data cleansing, data aggregation, data augmentation, data sorting, and data formatting which are collectively referred to as pre-processing steps. These steps translate the raw data into a fixed data format which is then shared as input to various algorithms.

Since problems involved in the used car example is forecasting of price values, regression techniques can be used. At a broad level these deal with predictions of continuous variables like price, income, age, sales etc. Many

algorithms can implement regression techniques such as linear regression, random forests, neural networks and support vector machines which vary in terms of complexity of implementation and scope of business

interpretation. At this point, this might sound more technical but getting a general idea is of more relevance here. You will be able to appreciate the underlying concepts more while using these techniques in real-time projects.

Implementing Analytics Techniques

As discussed in above section, analytics problems can be solved with the help of a variety of statistical techniques. When it comes to implementation of these techniques, there are lot of options available in terms of analytics tools such as SAS, R and Python. SAS is more popular amongst the enterprises because of its ease of use and R or Python are open source tools which have lot of takers amongst academia and programmers. On an average, almost 80% of the time of any analytics project goes in problem identification, data gathering and pre-processing steps while the remaining 20% is used for implementing chosen techniques and visualizing the final results. In the case of Big Data, the same algorithms can be translated to MapReduce algorithms for running them on Hadoop clusters which often requires more efforts and specialized expertise. In Hadoop Ecosystem, Mahout Component use Java programming language to implement statistical techniques such as classification, recommendation algorithm and others.

Depending on the data volumes being gathered for used car prediction problem, say to implement linear

(19)

used for Big Data. Another alternative for Big Data would be making use of Mahout Component which requires Java expertise.

Visualizing End Results

Data visualization is used for displaying the output of analytics projects. Typically this would be the last step of any analytics projects where visualization techniques are implemented either for validating the technique outcomes or to present end results to non-technical management team. This can be done with various data visualization software’s such as Tableau, Spotfire and also in-built capabilities of SAS and R. In comparison with SAS, R offers a variety of packages namely ggplot2 and lattice for visualization of datasets.

After building linear regression model for user car price prediction, visualization techniques are implemented to validate the statistical results and to further check whether these results are satisfying the technique assumptions. Some of the standard validation techniques of linear regression model are heteroskedasticity, autocorrelation, and multicollinearity. Above plots showcase visualization examples of performing these validation techniques on final model results of used car price prediction example.

DifferentkindsofAnalytics

By looking at the used car price prediction example, one can understand that irrespective of any domain the framework for implementing analytics projects remains the same. However, as different enterprises work on

(20)

the common ones are marketing analytics, customer analytics, risk analytics, fraud analytics, human resource analytics and web analytics which are classified based on different business functions. Marketing analytics in any enterprise revolves around increasing efficiency and maximizing marketing performance through analysis such as marketing mix optimization, marketing mix modeling, price analysis and promotional analysis to name a few. On the other hand, customer analytics deals with understanding of customer behaviours and increasing loyalty using analysis like customer segmentation, attrition analysis and life time value analysis.

Another common classification exists which is based on complexity level of analytics techniques being

implemented across any enterprise and is independent of the domain. These kinds are broadly classified under basic analytics and advanced analytics categories.

BasicAnalytics

Basic analytics techniques are generally used to explore your data which include simple visualizations or simple statistics. Some of the common examples are:

Ÿ Slicing and dicing refers to breaking down of data into smaller sets that are easier to explore. This is more employed as a preliminary step to gain more understanding into data attributes and how different techniques can be used and also how much computational power is required to implement a full scale analysis.

Ÿ Anomaly identification is the process of detecting outliers, such as an event where the actual observation differs from the expected value. This might involve computing some summary statistics like mean, median, and range values and also sometimes involves visualization techniques such as scatter plot, box plot etc. to identify outliers through visual means.

AdvancedAnalytics

Advanced analytics involves applications of statistical algorithms for complex analysis on either structured or unstructured data. Among its many use cases, these techniques can be deployed to find patterns in data, prediction, forecasting, and complex event processing. With the growth of Big Data and enterprise's need to stay ahead of competition, advanced analytics implementations have become main stream as an integral part of their decision making process. Some of the examples of advanced analytics are:

Ÿ Text Analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information on which statistical techniques can be applied. Since much of Big Data comprises unstructured data, text analytics has become one of the main stream applications amongst Big Data analytics.

Ÿ Predictive Modeling consists of statistical or data-mining solutions including algorithms and techniques to determine future outcomes. A predictive model is made up of a number of predictors, which are variable factors that are likely to influence future behavior or results. In marketing, for example, a customer's gender, age, and purchase history might predict the likelihood of a future sale. Some of the other common applications include churn prediction, fraud detection, customer segmentation, marketing spend optimization and many more.

(21)

UnstructuredDataand

TextAnalytics

CHAPTER 06

Unstructured data usually takes up lots of storage capacity and more difficult to analyze when compared with structured data which is relatively easy to handle and process. It is basically information which is text heavy. In most cases that do not have a predefined data model and also does not fit well into traditional database management systems. At an enterprise level, only 20% of the Big Data being handled is structured and the remaining 80% is unstructured. Most of the unstructured data these days is machine generated from various sources such as satellite images, video surveillance, scientific sensors, weather monitoring, social media, mobile and other web content. Data coming from these sources would be in the form of text, images, videos, web logs, and other customary machine formats like sensor output.

Few key facts related to unstructured data are:

Ÿ Most new data is unstructured and represents almost 95 percent of all data generated

Ÿ Unstructured data tends to grow exponentially, and is estimated to be doubling every year

Ÿ Unstructured data is vastly underutilized due to limitations of traditional IT technologies

With the evolution of Big Data technologies, enterprises can effectively process unstructured data and derive business value out of it. Most firms currently implement NoSQL based technologies mainly Hadoop whose

capabilities extend beyond the traditional databases. Regardless of the native formats, Hadoop can store different types of data from multiple databases with no prior need of schema. Within Hadoop Ecosystem, HDFS is used for storage which handles non-predefined data models and MapReduce framework for quick processing of large volumes of unstructured data. Later using the data sitting in Hadoop, enterprises can tap into traditionally unexplored information and can start making more decisions based on hard data.

(22)

We have seen in an example in an earlier chapter, how eBay (a giant online marketplace) tries to achieve highest buying price of items by understanding the impact of the quality of picture shared in the listing. To find a possible solution, data science teams at eBay performed extensive image analysis and successfully found a relationship between listing views and items sold. This is a classic real-time example of unstructured data processing with the help of Hadoop. Generally in order to create value out of unstructured data, some of the most common analytics methodologies are text analytics, image and audio analysis. Out of these, text analytics has been adopted as a mainstream

activity across many enterprises with increased usage of Hadoop and other Big Data technologies.

Text analytics is commonly referred to as the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways. The analysis and extraction processes used in text analytics takes advantage of techniques that originated in

computational linguistics, statistics, and other computer science disciplines. In a Big Data scenario, the applications of text analytics is wide spread around social media analysis, brand perception and sentiment analysis, and even in areas of churn and fraud prediction. Increasingly enterprises across all verticals are looking for ways to combine both structured and unstructured data to get a complete view about their customer’s perceptions towards various product and service offerings.

In the context of Big Data analytics, text analytics implementations can be done with the help of Hadoop components such as Pig, Hive, and MapReduce programming using Java, Python and other languages. These components are equipped with in-build custom functions which are suited for the processing of unstructured data formats like text, images and videos. The key to successfully handling unstructured data is to bring structure on the native format and then applying analytics or statistical techniques on top of it. Apart from Hadoop solutions, other commercial text analytics tools are offered by vendors like Attensity, Clarabridge, IBM and SAS in Big Data space.

Majority of enterprise Hadoop applications are implemented to deal with unstructured data. In the Processing Complex Data using Hadoop and Performing Analytics on Hadoop Modules, you will learn more about how to handle and analyze text data with the help of real time examples leveraging Twitter and Email data.

(23)

BigDatainAction

CHAPTER 07

Enterprises are spending millions of dollars on scaling up their existing IT infrastructures with Big Data technologies to meet the end goal of achieving more business value out of data and staying ahead of their competition. Unlike traditional data warehousing and BI opportunities, Big Data and analytics opportunities are more business hypothesis driven and often revolve around exploratory activities. This scenario is consistent across all verticals since Big Data is being generated from every function of a business ranging from

manufacturing to sales. The key to success in dealing with Big Data is in any enterprise’s ability to define relevant business problems, combine structured and unstructured data sources, and identify hidden trends.

The business problem being handled varies by task as some are computationally intensive while others are more data analysis intensive. Understanding the nature of the problem is very essential for picking the correct

approach. In order to exploit Big Data analytics, often enterprises develop a compelling business use case clearly outlining what business outcomes are to be achieved. One of such user cases in IT domain is the implementation of the Aadhar Big Data project by the government of India. Aadhar is a

12-digit unique number issued for all residents in India with a goal of creating the world’s largest biometric database. The objective of this project is to deliver more efficient public services and facilitate direct cash transfers to the people.

Likewise, many organizations are basing their business cases on the following benefits that can be derived from Big Data and analytics:

Ÿ Smarter decisions: Enable decision making beyond traditional practices

Ÿ Faster decisions: Reducing the dependency on bureaucracy within an organization

Ÿ Impactful decisions: Focus on value generating efforts and capabilities

Applications and Use cases of Big Data are many spanning across business domains. The case studies taught in the course helps you appreciate both IT and business aspects of Big Data. They cover domains such as Finance, E-Commerce, Airline and Social Media which will provide hands on exposure in terms of processing Big Data and then analyze it to solve various business problems.

(24)

In search of achieving these benefits, many business verticals such as Telecom, Banking, Insurance, HealthCare, Retail, IT and Manufacturing are all riding the wave of Big Data analytics. Now we will review further how some of these industries are leveraging Big Data to solve their business problems.

Retail & E-Commerce

Retail is one of the high potential areas for Big Data. A survey conducted by research firm IDC revealed that retailers are increasingly looking at Big Data and analytics to derive business benefits. Companies can bring together both online and offline data along with transactions information to better understand factors that drive the shoppers behavioural traits. Beyond purchase data, retailers are looking at a whole array of new data sources – web browsing data, social data and geo-tracking data which further helps in thorough segmentation of customers. Combining this new information with traditional data, they started doing high-end analytics like market-basket analytics, seasonal sales analytics, inventory optimization analytics, and pricing optimization analytics.

In the case of e-mail targeting, traditional approach has been to scan through the entire customer base, develop a list of customers and then send out mass mailers to all. However, Big Data advantage is personalization, by

understanding consumers browsing history retailers can share specific messages related to items of the search and then offer that shopper a targeted promotion. Also with the help of location data from mobile devices and if a particular customer is present in a store, they can be offered specific coupons to motivate them into making a purchase.

Telecom

Similar to other sectors, communications service providers all over the globe are witnessing significant data growth due to increased adoption of smart phones, rise of social media and growth of mobile networks. Many of these firms are tackling Big Data challenges so as to gain more market share and increase profit margins. Big Data can help service providers achieve some of the key business objectives – provide better customer service with the help of internal and external data, implement innovative product services using segmentation techniques, and develop strategies to generate new sources of revenue. Over the last few years telecom operators have moved away from a traditional model of data warehousing towards a centralized data repository model with integrated reporting solutions. With exponential growth of Big Data in this sector, these operators are now looking towards new technologies as a cost-effective solution to process the growing volumes of data.

One of the common applications most telecom operators implement is around integrating network performance data with subscriber usage patterns. This is to understand what is happening in the complex intersection of network and services (voice, data, and content). It generally helps them to detect network performance issues in real time and provide quality customer services to maximize their customer satisfaction.

Financial Services

(25)

records generated on a daily basis. With digitalisation, a variety of data sources – social media, information portals and customized web applications are adding more information to industry’s existing ocean of data. Implementing Big Data solutions enable these enterprises to collect and organize a host of additional data such as customer preferences, behaviors, interaction history, events and location-specific details in a cost-effective manner. Using this huge information, many financial services firms run sophisticated analytics to determine the best set of actions to recommend to a customer, such as a targeted promotion, a cross-sell/up-sell offer, a discounted product or fee, or a retention offer. In addition the Big Data technologies add value through real-time insight generation and help in faster decision making.

One of the major developments is to integrate external data sources such as social media with the internal IT infrastructure, which provides a broader view on customers, products and services at enterprise level. Customer segmentation is a key tool for sales, promotion, and marketing campaigns across financial services firms. Using the available data, generally customers are grouped into different segments based on their expectations, needs and other characteristics. The advantages from such an implementation are multi-fold for enterprises in terms of increasing loyalty with customers, selling more products and services and also cutting costs by better management of resources.

HealthCare

Big Data has many implications for patients, providers, researchers, payers, and other health-care constituents. Today’s patients are demanding more information about their health-care options so that they understand their choices and can participate in decisions about their care. In a health-care scenario, traditionally the key data sources have been patient demographics and medical history, diagnostic, clinical trials data and drug effectiveness index. If these traditional data sources are combined with data provided by patients themselves through social media channels and telematics, it can become a valuable source of information for researchers and others who are constantly in search of options to reduce costs, boost outcomes, and improve treatment.

One of the major applications of Big Data has been in the area of DNA analysis. With the latest tools and technologies, one can analyze an entire individual human DNA sequence and compare it against those of other individuals and groups in smaller timeframes. The current relatively low cost to perform individual DNA analysis (thousands of dollars) has made this tool accessible to a substantial number of people compared to the initial cost of millions of dollars a few years ago after the first full human genome was analyzed.

Smart Cities

Growth of Big Data and digitalization has resulted in the availability of a wide range of information about cities, their physical infrastructure, services, and interactions between people. Smarter Cities are leveraging this Big Data to improve infrastructure, planning and management, and human services as a system of systems – with the goal of making cities more desirable, liveable, sustainable, and green. Some of the applications include mass transit, utilities, environment, emergency response, big event planning, public safety, and social programs.

(26)

one of the projects, they have implemented a project using Big Data for better traffic flow in Dublin, Ireland. By utilizing the GPS information from buses, IBM has been able to more accurately measure the arrival and departure times and pass that data on to travellers via the transportation system’s notification boards. This information would enable people to make better use of their time which would further increase the confidence in public transport system.

(27)

Although Hadoop and its projects are completely open source, a large number of companies have developed their own Hadoop distributions which are more ready to use. These distributions are packaged and guaranteed to have HDFS and MapReduce components, and all other supporting tools. There are several distributions available, such as ones provided by EMC and Intel, as well as those provided by hardware vendors like IBM which are typically all-in-one solutions that include hardware. But the three biggest and most prevalent

Hadoop distributions that exist today are Cloudera, MapR and Hortonworks. If you are looking for a quick plug-and-play option, then each of these vendors offers VM images with Linux and Hadoop already installed.

Apache Hadoop, the original release of Hadoop comes from apache.org backed up by community support of Apache Software Foundation. Many of the original Hadoop releases are done by this group with the latest one being Hadoop 2.0. Other companies or organizations that release products containing modified or extended versions of the Apache Hadoop are generally termed as Hadoop Distributions. One important point to note here would be these Hadoop distributions will be continuously upgraded to keep up with the latest Apache Hadoop releases launched by Apache Software Foundation.

Some of the companies that include Apache Hadoop and provide

additional capabilities in terms of commercial support, and other utilities related to Hadoop are, Cloudera’s Hadoop Distribution, CDH4 version includes HDFS, YARN, HBase, MapReduce, Hive, Pig, Zookeeper, Oozie, Mahout, Hue, and other open source tools (including the real-time query engine - Impala). Cloudera Manager Free Edition includes all of CDH, plus a basic Manager supporting up to 50 cluster nodes. Cloudera

BigDataLandscape

CHAPTER 08

HadoopDistributionOfferings

There are many Hadoop distributions available, and Cloudera CDH4 is one of the widely used distributions at enterprise level. In the Big Data and Hadoop Overview Module, you will learn about installation and working with CDH4 Hadoop Distribution. Cloudera CDH4 Installation includes Apache Hadoop along with other components such as Pig, Hive and Impala for Big Data processing.

(28)

supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools.

Hortonworks Hadoop Distribution, HDP version 2.0 includes HDFS, YARN, HBase, MapReduce, Hive, Pig, HCatalog, Zookeeper, Oozie, Mahout, Hue, Ambari, Tez, and a real-time version of Hive (Stinger) and other open source tools. It also provides high-availability support, a high-performance Hive ODBC driver, and Talend Open Studio for Big Data.

MapR Hadoop Distribution, M7 version includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig,

ZooKeeper, Hue, and other open source tools. It also includes direct NFS access, snapshots, and mirroring for “high availability,” a proprietary HBase implementation that is fully compatible with Apache APIs, and a MapR management console.

IBM Infosphere BigInsights is available in two editions. The Basic Edition includes HDFS, HBase, MapReduce, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and several other open source tools, as well as a basic version of the IBM installer and data access tools. The Enterprise Edition adds sophisticated job management tools, a data access layer that integrates with major data sources, and BigSheets (a spreadsheet-like interface for manipulating data in the cluster).

Intel Distribution for Apache Hadoop is a product based on Apache Hadoop, containing optimizations for Intel's latest CPUs and chipsets. It includes the Intel Manager for Apache Hadoop for managing a cluster.

Amazon Elastic MapReduce is a cloud service that enables users to easily process vast amounts of data at a cheaper cost. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon

Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). It includes HDFS (with S3 support), HBase (proprietary backup recovery), MapReduce, Hive (added support for Dynamo), Pig, and Zookeeper.

Windows Azure HDInsight is a Hadoop solution for the Azure cloud. It is integrated with the Microsoft

management console for easy deployment and integration with System Center. It can be integrated with Excel through a Hive Excel plug-in. Further, it also offers connectivity services with Microsoft SQL Server Analysis Services (SSAS), PowerPivot, and Power View through the Hive Open Database Connectivity (ODBC) driver.

For Big Data analysis, apart from knowing about analytics project cycle and kinds of analysis that can be done, enterprises should also leverage the right kind of analytics tools to efficiently deal with Big Data. Broadly the classification of Big Data analysis tools can be made around statistical technique offerings and business intelligence integration capabilities. Although Hadoop components can be used to achieve each of these independently, it is not a specialized analytics tool and is popularly used only for its distributed framework. Let's explore some of the tools which offer extensive visualizations, drag-and-drop options, and easy-to-install

Handling Big Data on Cloud is one of the growing practices among enterprises due to low cost and better processing capabilities. You will be provided AWS instance with Hadoop installation to work on the assignments and other case study problems as part of virtual lab offering.

(29)

R is the most popular open source analytics tool. In the Performing Analytics on Hadoop Module, you will learn about R syntax, handling data and running statistical tests with R, and also about techniques to integrate R with Hadoop for implementing MapReduce programs from R console.

scripts.

AnalyticsImplementations

With the explosion of Big Data, there has been a quick growth of tools providing statistical capabilities at a larger scale. Since statistics is critical for identifying and quantifying relationships between various attributes in the data, it is one of the key components of many analytics tools catering to Big Data. Some of the more interesting tools include:

R is an open source programming and statistical language that is rapidly gaining popularity in Big Data space. It has been widely used among universities and startup companies alike from many years, but lot of recent interest can be attributed to its open source nature and also flexibility of integration with open source Big Data technologies such as Hadoop. In terms of statistical capabilities, R is very versatile and has more than 4000 packages which can deal with any problem related to Big Data analysis. Also with the introduction of RHadoop packages by Revolution Analytics,

now anyone can easily work with Hadoop cluster, interact with data in HDFS, and run MapReduce queries written in R syntax.

SAS has been a pioneer in business analytics software and services over the last decade, and is also the largest independent vendor in the business intelligence market. In order to deal with the data deluge, SAS has recently upgraded their services towards Big Data handling capabilities. These will help users to perform data manipulation and exploration analysis on Hadoop. Unlike working with Hadoop which requires specialized expertise, enterprises can leverage their existing SAS skills to work easily with Big Data. It also offers text analytics capabilities as part of its overall analytics platform and text data is viewed as simply another source of data.

Apache Mahout, a statistical component of Hadoop Ecosystem, provides scalable machine learning algorithms on top of the Hadoop platform. Mahout provides algorithms for clustering, classification, and collaborative filtering implemented on top of Apache Hadoop using MapReduce. However, it requires java programming expertise to successfully work and implement MapReduce queries using Mahout.

MADlib, one of the latest developments, is an open-source library that supports in-database analytics. It provides data-parallel implementations of mathematical, statistical, and machine-learning methods that support structured, semistructured, and unstructured data.

TextAnalyticsImplementations

(30)

in the text analysis Big Data market.

Attensity is one of the original text analytics companies that began developing and selling products more than ten years ago. It offers several engines for text analytics around Auto-Classification, Entity Extraction, and Exhaustive Extraction. Attensity text analytics tools uses Hadoop framework to store data and are focused on social and multichannel analytics by analyzing text for reporting from both internal and external sources.

Clarabridge is another pure-play text analytics vendor which extensively deals with unstructured data processing. It offers its solution as a Software as a Service (SaaS).

Software giant IBM offers IBM Content Analytics solutions in the text analytics space. This tool is used to transform content into analyzed information, and further made available for detailed analysis similar to the way structured data would be analyzed in a BI toolset.

BusinessIntelligenceIntegration

Generally, enterprise level business intelligence needs cater to regular reporting, generating dashboards and creating visualizations. Hive component of Hadoop provides traditional database features and business

intelligence integration capabilities to meet enterprise's reporting and analysis needs on structured data. Though it uses SQL like query language for performing Big Data analysis; additional ready to use features such as drag-and-drop and automated reporting are not supported. Many alternative tools exist which provide advanced business intelligence reporting on Big Data sitting in Hadoop cluster. These BI tools

provide rich, user friendly environment to slice and dice data. We will review some of the widely used ones.

Tableau has been gaining popularity across enterprises as the go-to BI tool when it comes to analysing and visualizing Big Data. It offers direct

connections for many high-performance databases, cubes, Hadoop, and cloud data sources such as Salesforce.com and Google Analytics. Tableau has a fast, in-memory analytical engine which can work directly with Big

Data to create reports and dashboards. It also provides features around publishing web dashboards on a server and enables easy sharing across the enterprise. With the availability of more than 30 data base plugin ranging from Big Databases to traditional SQL databases, Tableau is attaining a status of should have BI tool across many enterprises and industry verticals.

Datameer Analytics Solution (DAS), is a business integration platform for Hadoop and provides comprehensive capabilities to analyze both structured and unstructured data. Major specialization is for enabling analysis

capabilities on large volumes of data stored in Hadoop cluster. It has a spreadsheet interface with over 200 analytic functions and visualization including reports, charts and dashboards. DAS provides support for all major Hadoop distributions including Apache, Cloudera, EMC Greenplum HD, IBM BigInsights, MapR, Yahoo! and Amazon.

Pivotal, an EMC spinoff offers big-data storage and analytics capabilities. Pivotal Big Data solutions offers wide set

Tableau is one of the leading data visualization tools at enterprise level. In the Performing Analytics on Hadoop Module, you will learn about working with Tableau and further able to build web dashboards and complex visualizations using the data residing in Hadoop cluster.

(31)

of enterprise data products: MPP and column store databases, in-memory data processing, and Hadoop. It also provides in-database integrations with SAS analytics and is one of the fast growing BI vendors in Big Data analytics space.

Pentaho Big Data Solutions supports the entire Big Data analytics process ranging from discovering and preparing data sources, to integration, visualization, analysis and prediction. For IT and developers, Pentaho provides a complete, visual design environment to simplify and accelerate data preparation and modeling. For business users, Pentaho provides visualization and exploration of data. And for data analysts and scientists, Pentaho provides full data discovery, exploration and predictive analytics.

Another commercial solution offered by IBM includes the combination InfoSphere BigInsights and Cognos

software. This gives organizations a powerful solution to translate large amounts of data into valuable, actionable insights. InfoSphere BigInsights software provides Big Data processing capabilities whereas Cognos software offers enterprise level BI capabilities.