W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

(1)

Building your Big Data analytics

strategy: Block-by-Block!

Abstract

In this white paper, Impetus discusses how you can handle Big Data problems.

It talks about how analytics on Big Data is changing the way in which technology is evolving, as well as bringing in new challenges. It discusses best practices for addressing Big Data analytics concerns and how strategies can be created to cope up with these challenges. The white paper guides you on choosing the right strategy with the optimal technology stack to address Big Data analytics problems.

(2)

Introduction

With the number of software, Internet, and mobile users growing exponentially, there is a huge demand on software infrastructure to deal with the voluminous data being created, and that too, at high speed. Ever increasing Internet access bandwidth is allowing huge data sets to flow into the WWW. Just ten years ago, Gigabytes and Petabytes were terms meant only for academics and R&D

experts. Now Gigabytes are eaten up by even small hand held devices. All of this has resulted in a unique problem. Not only do we have to deal with this very large data, we have to do it quickly, and do it in a manner that we gain business insights from it.

(3)

Building a Big Data strategy

With almost all software strategies, the first step in building a Big Data plan is to gather the requirements correctly, so that the business problems can be

thoroughly understood. This means that we identify and define what needs to be done and lay down the expectations from the solution.

The next step is to assess and select the right strategy. This involves finding the right patterns and best practices to architect, design, and implement the relevant solution.

Another important step is to determine the right tool-sets and/or technology stacks that will fulfill all the defined requirements as well as support the best practices.

Finally, it is important to implement the chosen strategy and resolve the business problems at hand in a cost-effective manner.

Big Data and the three Vs

One of the major problems faced by data architects and stakeholders is ascertaining whether their problem is actually a Big Data issue. Unfortunately, there is no magical volume limit or an algorithm to help you decide when a data problem evolves into a big data problem. The usual trend is to define Big Data in terms of data volumes or sizes, Also, whenever regular RDBMS breakdown with excessive data, big data solutions appear as the right choices.

However, a better way of classifying Big Data is by understanding the concept of the 3Vs model—Variety, Volume, and Velocity of data.

Volume

Volume is simply the data size that we are capturing and is measured in bytes of data. While earlier, the Gigabyte was supposed to be Big, today, terms like Terabytes, Petabytes, and Exabytes are heard in the context of Big Data.

Variety

Variety means the different kinds of data that we are trying to capture. A simple example of variety can be a social web site capturing data from its own site, as well as drawing inputs from Twitter or Facebook using Google analytics and internally using data from other third party products. This will result in a number of data formats that may vary from text to audio to video to databases to log files to web services call, and so on.

(4)

Velocity

The last V stands for velocity of data which means the speed at which the data is being captured. Again, using the earlier example, a feed from Twitter might be 10s of tweets being fired for a user, while some keyword-driven feeds can reach viral status, with thousands of tweets getting fired simultaneously.

Therefore, when we are classifying a data problem as a Big Data problem, we need to consider all the three factors in the 3V model that is velocity, volume and variety. A simple volume problem might not be a Big Data problem at all, but even a marginally large data problem might get converted into a Big Data problem if velocity and variety are important parts of that issue.

The Big Data Analytics lifecycle

Every software product has a life cycle and same holds true for Big Data. This life cycle starts with the Creation of data. It can be created in multiple ways and have several formats. The second step after creation is Ingestion, where the data may undergo complex transformation, filtering or enrichment before it becomes suitable for the third stage, which is Analytics.

Analytics may also call for some processing of data before it can be understood, to derive valuable insights from it.

Visualization is the final aspect of understanding data Analytics, and hence is an important step.

Key concerns in Big Data Analytics lifecycle

There are underlying concerns and problems that that need to be address in every stage of the Big Data Lifecycle. The data is usually created as part of external systems like RDBMSs or server logs or audio/video streams or third party data sources.

(5)

Since the data volumes are huge, there is need to address concerns such as how to store the data, how to optimize and compress it in the data creation stage. We also need to monitor the data creation phase and take important decisions such as using Cloud-like elasticity, as well as data back up and disaster recovery strategies.

The Ingestion phase has its own challenges, where various transformations and integrations play a major role. The data warehousing industry has been

traditionally using ETL (extract, transformation, and load) techniques which help overcome most of the challenges related to the Ingestion phase. Here, some of the key decisions involve finding the right tools and technologies.

The same holds good for the Analysis phase too, and suitable tools and

technology decisions need to be taken. These decisions may involve addressing the classical build versus buy question, and assessing how existing investments can be re-used.

Moreover, the data may have hidden trends and traits that are immensely useful. Statistical data mining, machine learning and NLP (natural language processing) are becoming essential parts of the Analytical phase today. The last and very critical phase is the packaging or presentation of Analytics. Here too, tools and technologies play a pivotal role and standardization is one of the key requirements.

With so many new modes of data delivery available, the visualization for various channels also needs to be considered seriously. For example, it is important to understand whether a graphical view would be a better depiction or a classical tabular report, a better representation of the given problem. Similarly, mobile and other handheld devices may require a different representation.

Addressing the 3Vs with the Big Data Analytics lifecycle

Having understood the 3Vs as well as about how the foundation of the Big Data Analytics lifecycle has to be laid, it is important to see how these can be

combined together to define and create a Big Data strategy.

Organizations can begin by creating a matrix where it can capture answers to straightforward questions related to volume, variety and velocity of data against each phase of the Big Data life cycle. These questions can be as simple as how much, what type and at what rate.

Once the relevant columns of the matrix are filled up, the matrix can be used as a foundation to create a strategy that can address Big Data problems.

(6)

Selecting the right Big Data Analytics strategy

Impetus Technologies has been working with Big Data problems for many years now and has fashioned a master strategy that can address almost all the major Big Data issues and problems.

Impetus has been using this strategy successfully for many of its customers, providing them with a failure-proof solution to their pertinent Big Data challenges.

The fact is that an ideal Big Data Analytics solution needs to be able to scale easily to support the large data, which can be in Terabytes or Petabytes. The system should also be distributed across geographically unaware processors. It should be able to respond quickly to highly complex queries as well as support a wide variety of data types, including images and arbitrary data structures. The ideal analytical solution should be able to provide data scientists with all the necessary tools, using which they can explain the significance of data in a manner that is easily understood by others.

An analytical solution should have the ability to incorporate machine learning, providing recommendations, and executing analytics on real-time incoming data such as logs, as well as providing domain specific canned reports.

(7)

It should also be able to handle data from heterogeneous sources, whether structured or unstructured, while providing a high rate for loading and analysis, as well as the ability to handle software or hardware failures.

A Big Data Analytics strategy therefore involves creating a platform or a solution that covers all aspects of the Big Data lifecycle as well as manages the 3Vs– variety, volume and velocity of data.

The ideal solution for the strategy can be a platform that allows different kinds of data to be ingested. One of the ways of implementing such a solution is to utilize the Service Oriented Architecture (SOA) in the form of an extensible connector based mechanism. This connector mechanism can then allow new connectors to be added or modified, thereby making it possible to cater to new kinds of data sources efficiently and in a fool proof way.

Another requirement, which is gradually gaining importance, is real-time analytics. The ideal solution should also facilitate complex, Real-time processing and transformation before the data is used for complex analytics. Complex event processing and rule engine integration is a related requirement and can be used to solve a variety of real world problems. Hence, the ideal solution should also provide CEP (Complex Event Processing) support.

The analytical phase should enable easy data modeling and transformation, helping data scientists to derive the maximum value. Therefore, the solutions need to have user-friendly interfaces for data modeling as well as offer easy-to-use configurable workflow management interfaces.

And of course, the interaction with the existing visualization tools completes the entire life cycle. The solution must therefore allow easy integration with

visualization tools, which will enable analytical data to be understood easily and also provide deeper insights into sparse or complex data sets.

In order to create the ideal Big Data Analytics strategy and achieve the most optimum results, users will need to handpick the tools and technologies. They must also create a framework that uses a leading open source solution–Apache Hadoop for solving Big Data problems.

(8)

The Hadoop eco-system

Hadoop has certainly come a long way from its humble origin. It was initially introduced as a simple file system in the Apache Nutch project, a massive web crawler which needed a file system to store large volumes of data across the Internet.

There are several tools and components that are an integral part of the Hadoop ecosystem. These tools and components are aligned with the Big Data Lifecycle and are serving different purposes for Creation, Ingestion, Analytics and Visualization.

Sqoop, Flume or Chukwa allow users to procure the data to be ingested and place it in a Hadoop-based data warehouse. The Ingestion and Analytical phases may utilize Hive, PIG or programmatic processing, or workflow systems like Oozie for data transformation and enrichment.

Apache Mahout can be used for a wide range of machine learning and data mining algorithms including clustering, classification, collaborative filtering and frequent pattern mining. These will also cover the advanced data analytics requirements in the Analytics phase.

(9)

Currently, Hadoop is the leader in the Open source Big Data technology world However, there are many other products and initiatives, both commercial and Open Source, that are foraying this space.

There have been attempts and even some successes in running Hadoop or similar distributed processing technologies faster and also adding real-time processing support.

MapR, DataRush, Hstreaming, HPCC, Platform computing, Datastax etc. are the examples of faster technologies that can serve as alternatives to Apache Hadoop. The major database and dataware house vendors like Oracle, IBM, Microsoft, HP and EMC have also jumped on to the Big Data bandwagon and come up with their own customized solutions which are usually categorized as MPP (Massively parallel processing) databases.

NoSQL is another important Big Data Technology. While some call NoSQL ‘No to SQL,’ Impetus prefers terming it ‘Not only SQL,’ due the fact that slowly but surely, the gap between regular RDBMSs and the NoSQL world is getting reduced.

There are other options, such as graphical databases like Neo4j, which can help users address Big Data issues emerging as part of exploding social media data. There are also faster versions of SQL databases such as VoltDB, which bring together the capabilities of RDBMS ACID with the power of Big Data.

Hardware or appliance-based solutions also offer alternative solutions for Big Data problems.

Putting it all together

Now that we have the strategy, tools and technologies in place, it is all a matter for putting them together. Essentially, this is about using Hadoop as the Big Data Analytics solution. As explained earlier, Hadoop is an excellent Big Data technology that is slowly becoming the de-facto leader with the Open Source Big Data domain. There are multiple ways in which the power of Hadoop can be used or combined in the Analytical and Visualization phases of the Big Data Lifecycle.

Impetus has been using Hadoop for cleaning/transforming the data into a structured form, and then loading the same into the RDBMS databases. Here, Hadoop capabilities are being harnessed to handle Ingestion and some part of the Analytical phase. On the other hand, some analytical processing is handled at the RDBMS level as the data sinks. It is now possible to use any existing visualization technique or tool from the rich world of RDBMS

(10)

visualization products. The Visualization phase can therefore be handled by existing toolsets.

Hadoop can efficiently access the data between the RDBMS data sources and Hadoop systems through DBInputFormat and DBOutputFormat interfaces. Once the unstructured data is processed, it can be pushed to an RDBMS database, which can subsequently act as a data source for any BI solution.

This approach provides the end-user with the flexibility of parallel processing with Hadoop and an SQL interface at the summarized data level. It is good when the summarized data is not big enough to pose a challenge for the RDBMS database being used. This solution is not as expensive as some of the other options.

This approach is also suitable for the high touch queries where the user wants to perform real- time, ad hoc analytics as most of the RDBMS databases come with a comprehensive set of performance enhancement techniques.

However, when the summarized data is very large, this approach might fail to deliver. Also, if batch analysis is the key requirement, then moving data to an RDBMS database could be a redundant activity.

Take the instance of a scenario where the processed and summarized data, which in itself is very huge, is placed on the Hadoop system. Therefore, what can be done in a situation where there is need to use the summarized data for batch reporting without getting into the complications of moving the data out of the Hadoop system either to a MPP DW or an RDBMS?

This can be done by using Hive as an interface for the data present on the Hadoop system. Hive provides a very promising interface for executing the SQL-like queries by converting them into MR (MapReduce) jobs. These MR Jobs are executed on the Hadoop clusters for the data that is itself present on Hadoop. This approach allows users to do batch and asynchronous analytics over the same data present in the Hadoop system. It is very cost-effective as it does not involve managing separate data sources, other than the existing Hadoop System. It also provides users with the flexibility to scale to any level with their summarized data.

Today, several options are available in the market that allows the integration of Massively Parallel Processing Data warehouses (MPP DWs) with Hadoop. This is worth considering if you have a large amount of data even after applying summarization over it.

(11)

Using Hadoop for cleaning/transforming the data into a structured form allows users to load the data into any of the available options of MPP DWs. While the data is being uploaded, they can write User Defined Functions to perform database level analytics and then integrate the same with Business Intelligence (BI) solutions using ODBC/JDBC connectivity for end-user analytics and

reporting.

Also, using MPP Data Warehouses will allow users to deploy various

performance enhancement techniques like index compression, materialized views, result set caching and I/O sharing.

Alternatively, some of the MPP DWs may also provide users with a good framework that supports MR jobs executions within their own clusters at MPP levels providing them with second levels of parallel processing. This feature is really good for working with high touch queries and also provides an excellent framework for end user ad hoc analytics.

However, the disadvantage of using this approach could be the cost involved. Most of the MPP DWs are expensive to acquire and some also require high-end servers for deployment, which could be expensive.

Using this approach also calls for an expert team that has hands-on experience on MPP Data warehouse management and development. This could turn out to be a challenge in itself in today's rapidly changing technology space where Open Source technologies like Hadoop are getting widely accepted and adopted.

(12)

Conclusion

In summary it can be said that an ideal Big Data strategy can lead users to create a platform or solution that covers all the aspects of the Big Data Lifecycle and manage these as well.

Organizations are using the Hadoop ecosystem or a blend of alternate technologies, including FOSS and commercial technologies‚ such as NoSQL‚ DataRush‚ HStreaming‚ etc. to address Big Data problems today.

There are three strategies involved in using Hadoop as the Big Data Analytics Solution. The first option is indirect analytics over Hadoop, which provides the flexibility of parallel processing of Hadoop and an SQL interface at the

summarized data level. This solution is not very expensive when compared with other options.

The second option is direct analytics over Hadoop, which allows you to perform batch and asynchronous analytics over the same data present over the Hadoop system. It is a very cost-effective approach as it does not involve any expense in managing the separate data sources.

The third option is integrating MPP DWs with Hadoop when there is a large amount of data. This is an expensive option when compared with the two approaches discussed above.

Impetus has successfully used the Hadoop ecosystem to create a

comprehensive Big Data platform that provides the capabilities required to solve all concerns in the various stages of Big Data Lifecycle.

About Impetus

Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others.

Impetus Technologies, Inc.

5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.252.7111 | Email:[email protected]

Regional Development Centers - INDIA: • New Delhi • Bangalore • Indore • Hyderabad