Big Data Defined Introducing DataStack 3.0

(1)

Big Data Defined

Introducing DataStack 3.0

Executive Summary

The technology of data warehouse and data mining has evolved over the years. Increasingly businesses are implementing these technologies to get new insights into their business. However, businesses are also dealing with a deluge of data, structured and unstructured, from various sources – customers, partners, employees. The technology community has coined the term Big Data to describe this new era of data and data management. However, there is a lot of confusion around exactly what Big Data is and more importantly, how enterprises should think about Big Data within their organizations.

Enterprises can think about Big Data as the third data stack in the organization. We refer to this as DataStack 3.0.

DataStack 3.0, gives enterprises a framework to think about how Big Data fits into their data infrastructure architecture.

This paper defines DataStack 3.0 in the context of historical trends in data and enterprises’ existing data infrastructure.

This paper aims to help CIOs, MIS managers and functional heads of businesses take advantage of DataStack 3.0 in their organizations.

www.persistentsys.com

Big Data

Inside:

Executive Summary

Introduction... 2 Emergence of

DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics... 4 DataStack 3.0: Confluence of New Technologies &

DataStack 2.0... 6 Conclusion... 7 ... 1

(2)

Introduction

The technology of data warehousing and data mining has evolved over the years. Increasingly, businesses are implementing these technologies to get new insights into their business. However, businesses are also dealing with a deluge of data, structured and unstructured, from various sources – customers, partners, employees. Unless this stream of data is tapped in a meaningful way, chances are that the insights companies are looking for will remain elusive.

The technology community has coined the term Big Data to describe this new era of data and data management. However, there is a lot of confusion around exactly what Big Data is and more importantly, how enterprises should think about Big Data within their organizations. How do we take advantage of Big Data in the context of our current data infrastructure? Do we have to throw out existing data infrastructure to take advantage of Big Data? Does Big Data only apply to unstructured data or does it refer to structure data as well?

Persistent has introduced the term DataStack 3.0 as a way to help enterprises think about how Big Data fits into their organization’s data infrastructure. This paper defines DataStack 3.0 in the context of historical trends in data and an enterprise’s existing data infrastructure. This paper aims to help CIOs, MIS managers and functional heads take advantage of Big Data within their organizations.

With the evolution of technologies that gather and analyze data, there are three distinct inflexion points that are discernible. We have tried to capture these technologies and the data in three different generations of what we term as the “Data Stacks”. The characteristics of the three data stacks differ in what type of data is collected and how it is processed. We introduce these terms below that will be elaborated later in the paper.

The majority of this paper will focus on DataStack 3.0. We elaborate on DataStack 1.0 and DataStack 2.0 separately on page 4 to give context to DataStack 3.0.

DataStack 1.0

DataStack 2.0 DataStack 3.0

– Application specific data typically used to store and record transactional data of a specific application

– Aggregation of application data, creating dimensional views and analytics on structured data

– New technology stack capable of handling Big Data both in terms of volume, data source and type of data (structured, semi-structured and unstructured).

DataStack 3.0 - New technology stack capable of handling Big Data in terms of volume, data source and type of data (structured, semi- structured and unstructured)

(3)

Emergence of DataStack 3.0

—

As data warehouses matured there was a mind-set change emerging with internet companies such as Google, Yahoo and Facebook. With tens of millions of users and more than a billion page views every day, these companies started collecting massive amounts of data without knowing what they were looking for. This led to a fundamental shift in thinking and the need for DataStack 3.0, which gives businesses the ability to leverage the data that is generated without knowing precisely how one plans to use it. One of the challenges that companies have faced is developing a scalable way of storing and processing all these bytes. Companies understand that using this historical data is a large part of how they can improve the user experience.

The emergence of Big Data frameworks such as NoSql, Hadoop and related technologies, which provide a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm, enabled developers to start interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (i.e. the Facebook Lexicon) or are being used in the background to improve user experience on Facebook, Yahoo and Google. The list of projects that are using Big Data infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality.

One other trend that emerged in parallel is that a number of non-relational or NoSql databases such as Cassandra offer the ability to process vast amounts of data in a clustered environment.

Traditional SQL servers allow users to create a schema and define the structure of that schema which often has very rigid rules. NoSql databases attempt to provide flexibility and scalability for web-scale data, new applications such as “social apps” and frequent schema changes due to changing data. The key motivators that are driving the adoption of DataStack 3.0 are:

Shared Nothing Architecture – to increase parallelism Reducing Infrastructure Cost – by using commodity hardware

Linear Scalability – ability to add incremental capacity to storage systems with minimal overhead and no downtime. In some cases, the system should automatically balance load and adjust utilization across the new hardware

High Write Throughput – most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput

High Availability & Disaster Recovery – Need to provide a service with very high uptime to users that cover both planned and unplanned events such as software upgrades and unplanned failure of hardware

Unstructured & Structured Data – While DataStack 1.0 and DataStack 2.0 looked mostly at structured tabular data, DataStack 3.0 considers unstructured data as well. In fact, most of the explosion in data has been in unstructured data such as text files, documents, weblogs, email, etc.

80% of new information growth is unstructured content – with 90% of that unmanaged.

– IDC

(4)

DataStack 1.0 to 2.0

Businesses have harnessed enterprise software such as Another trend that happened side by side was the change Online Transaction Processing (OLTP) systems through in the nature of data itself. Initially data was application the last two decades to make their businesses more driven and largely structured. We characterize this as efficient. From pure OLTP systems, the software has also DataStack 1.0. With the following explosion of data, a new evolved to add capabilities such as Data Warehouses to set of applications, which looked at aggregated data, help businesses get key insights into their customers and came into the fold. These applications allowed the meaningful reports on their data. OLTP systems have their enterprises to aggregate data, form dimensional views origin in the eighties and most enterprises such as banks and conduct analytics. This is what we call DataStack 2.0.

used them for key transactions. With the emergence of Businesses driven by demand for more sophisticated UNIX and Windows server systems in the nineties, vendors queries on ever increasing data sets pushed the such as Oracle, IBM, and Microsoft introduced databases envelope of DataStack 1.0. As the workload increased, it for enterprises. As OLTP systems proliferated, SQL was evident that techniques had to evolve to tackle this became the standard method of implementation of OLTP workload. This led to DataStack 2.0, which addressed systems. SQL also enabled businesses to query some of the scalability and volume of data issues.

transactional data to produce operational reports. For example, the transactional system to implement sales transactions could also produce reports on the sales by geographies or product codes.

In the last few years data warehouses continued to refine. non-SQL algorithms to be easily embedded in the Softwares such as Cognos, Business Objects and processing elements of its MPP streams without the Hyperion appeared in the market which enabled typical intricacies of parallel or grid programming. The businesses to slice and dice their data, form dimensional ability to run analytics of any complexity “on stream”

views and produce dashboards of key metrics for their against huge data volumes, eliminates the delays and businesses. For example a telecommunications company costs of moving data to separate hardware. It also could load their summarized call data records onto a data accelerates performance by orders of magnitude, making warehouse through a process known as ETL (Extract, these data warehouse appliances the ideal platform for Transform & Load) and use tools such as Cognos to view the convergence of data warehousing and advanced KPIs such as: 1) calls by region 2) calls by time of day and 3) analytics. This evolution in data warehouse platforms and voice vs. data calls. This data enabled the applications, which we term as DataStack 2.5, was driven telecommunication companies to focus their marketing mainly by the inadequacy of DataStack 2.0 to handle efforts, refine promotions and adjust pricing plans. large volumes of data for analytics.

Another phenomenon that has been taking place in data Some of the new technologies which became the warehouse solutions is the emergence of customized foundation of Data Stack 2.5 are:

hardware/software marketed as appliances such Oracle _— Massively Parallel Processing (MPP) architectures Exadata and IBM Netezza. These data warehouse – enabled parallel execution of queries and eased the appliance hardware components and intelligent system load on the processors by shifting some of the load to software are closely intertwined. The software is designed the storage path

to fully exploit the hardware capabilities of the appliance

— Column oriented Databases – DBMS that stores its and incorporate numerous innovations to offer exponential

content by column rather than row. This has performance gains, whether for simple inquiries, complex

advantages in certain data warehouses, where ad-hoc queries or deep analytics.

aggregates are computed over large numbers of similar data items

Systems such as Exadata and Netezza, which have

brought the principles of MPP and data processing close to — In memory Databases – speeding up queries the source, are suited for advanced analytics on large data through caching of the database in memory

sets. These data warehouse appliances allow complex

DataStack 2.0 Refined for Large Data & Analytics

(5)

A Comparison of DataStack 1.0, 2.0 and 3.0

DataStack 1.0

Relational

Database Systems for Operational Store

DataStack 2.0

Enterprise Data Warehouse for Decision Support

DataStack 3.0

Integrated Platform for structured, semi-structured

& unstructured data from any source

Business Case/Need

Record business events

Support for Decision making

Tap all data sources for insights

Data

Arrangement

Highly

normalized data

Un-normalized

dimensional model Schema less approach

Data Horizon Short Couple of years Multi-year

Data Quality Extremely high Not as essential as

DataStack 1.0 Not a key requirement

Size of Data GBs TBs PBs

End user Access

Through enterprise apps

Through reporting

systems Currently directly

Language of

Access SQL SQL/MDX MapReduce

Type of Data Structured Structured Structured, semi-structured

& unstructured

(6)

DataStack 3.0: Confluence of New Technologies &

DataStack 2.0

As businesses explore Big Data, they need to keep in mind that DataStack 3.0 is a confluence of new technologies for Big Data such as Cassandra, Hbase, Hadoop, MapReduce with the existing DataStack 2.0. The implementation of DataStack 3.0 has to integrate with existing DataStack 2.0.

If DataStack 3.0 is not a complimentary extension to the existing data infrastructure, benefits will be nebulous. From the planning phases for DataStack 3.0, integration should be mapped out.

When an organization is planning to rollout DataStack 3.0, various stakeholders across the organization should be involved to make it a successful. The various functional groups that need to participate are:

All of the above personnel have to work closely to plan and execute how DataStack 3.0 will be leveraged to get the desired end results. They need to review the technology available, capabilities of resources (and service providers, if applicable) and bring together the right team.

This figure below describes how the three data stacks can co-exist in an Enterprise. DataStack 3.0 stores historical information for analysis, just like DataStack 2.0. The difference is that DataStack 3.0 will handle the data sources which cannot be handled by DataStack 2.0. Early in its maturity, DataStack 3.0 is primarily used for discovering use cases, and it is not uncommon for the results from the discovery exercise to be pushed back to DataStack 2.0 so as to enhance the analysis. Essentially, all three data stacks will co-exist in the enterprise and the chance of Big Data replacing any one of them will be low for the foreseeable future.

—

IT Administration

Compliance & Audit departments

ETL Administration Data Provider

Data Architect

Data Scientist

Reporting/Dashboards Developer

– Plan the deployment of and configure DataStack 3.0. This needs to happen with participation of resources skilled in DataStack 3.0 deployment and planning

– Just like data for key transactional applications, companies should consider the compliance needs for the data collected, privacy requirements and the enforcement of those requirements

– Consider all data sources necessary for DataStack 3.0

– Identifies the data sources and is responsible for providing access to the technology team

– Expert in modeling database schemas, knowledgeable in database implementation best practices and familiar with the company’ s particular database schema

– Acts as the bridge between the business and the technical team, helping to define the business insights that need to be extracted

– Expert in development of reports and dashboards with one of the reporting/dashboarding tools such as IBM Cognos, Pentaho, MSTR, etc.

Figure 1:DataStack 3.0

If DataStack 3.0 is not a complimentary extension to the existing data infrastructure, benefits will be nebulous.

Early in its maturity, DataStack 3.0 is primarily used for discovering use cases, and it is not uncommon for the results from the discovery exercise to be pushed back to

DataStack 2.0 so as to enhance the analysis.

All three data stacks will co-exist in the enterprise and the chance of Big Data replacing any one of them will be low for the foreseeable future.

(7)

Conclusion

Data warehouse solutions are moving to what can be termed as “DataStack 3.0”. This has been pioneered by internet companies such Facebook, Google and Yahoo. The technology, incubated in those companies, is being rapidly adopted by mainstream enterprises. When it is adopted as part of a business, attention must be paid to how to integrate with DataStack 2.0 to get the desired return on investment. The traditional Enterprise Data Warehouses (EDWs) are evolving with the need to deal with petabyte scale data, which includes unstructured data.However, as enterprises rethink their EDW solutions, they will have to keep in mind that mere introduction of core Hadoop technologies such as – MapReduce, HDFS, Hive will not necessarily give them greater insight into their business. All of the investments in traditional DWs, data marts, data hubs, operational data stores, etc. are reasonably safe from obsolescence. The reality is that the EDW is evolving into a platform in which all of these database architectures can and will co-exist.

We see these requirements coming directly from CTOs and other senior decision-makers in large organizations who are driving convergence of investments across all of these formerly separate technology domains. Vendors are racing to address this convergence in their product portfolios.

We believe that DataStack 3.0 will be a confluence of various technologies including traditional EDWs and they will co-exist. The enterprises who manage this confluence well, will benefit most from the right blend of established and emerging technologies.

About Persistent Systems

Established in 1990, Persistent Systems (BSE & NSE: PERSISTENT) is a global company specializing in software product development services. For more than two decades, Persistent has been an innovation partner for the world’s largest technology brands, leading enterprises and pioneering start-ups. With a global team of 6,600+ employees, Persistent has 350+ customers spread across North America, Europe, and Asia. Today, Persistent focuses on developing best- in-class solutions in four key next-generation technology areas: Cloud Computing, Mobility, BI &

Analytics, Collaboration across technology, telecommunications, life sciences, consumer packaged goods, banking & financial services and healthcare verticals. For more information,

please visit: .

India

USA

Persistent Systems Limited Bhageerath, 402,

Senapati Bapat Road Pune 411016.

Tel: +91 (20) 2570 2000 Fax: +91 (20) 2567 8901

Persistent Systems, Inc.

2055 Laurelwood Road, Suite 210 Santa Clara, CA 95054

Tel: +1 (408) 216 7010 Fax: +1 (408) 451 9177 Email: [email protected]

DISCLAIMER: “The trademarks or trade names mentioned in this paper are property of their respective owners and are included for reference only and do