Big Data Defined
Introducing DataStack 3.0
Executive Summary
The technology of data warehouse and data mining has evolved over the years. Increasingly businesses are implementing these technologies to get new insights into their business. However, businesses are also dealing with a deluge of data, structured and unstructured, from various sources – customers, partners, employees. The technology community has coined the term Big Data to describe this new era of data and data management. However, there is a lot of confusion around exactly what Big Data is and more importantly, how enterprises should think about Big Data within their organizations.
Enterprises can think about Big Data as the third data stack in the organization. We refer to this as DataStack 3.0.
DataStack 3.0, gives enterprises a framework to think about how Big Data fits into their data infrastructure architecture.
This paper defines DataStack 3.0 in the context of historical trends in data and enterprises’ existing data infrastructure.
This paper aims to help CIOs, MIS managers and functional heads of businesses take advantage of DataStack 3.0 in their organizations.
www.persistentsys.com
Big Data
Inside:
Executive Summary
Introduction... 2 Emergence of
DataStack 3.0... 3 DataStack 1.0 to 2.0... 4 DataStack 2.0 Refined for Large Data & Analytics... 4 DataStack 3.0: Confluence of New Technologies &
DataStack 2.0... 6 Conclusion... 7 ... 1
© 2012 Persistent Systems Ltd. All rights reserved. 2
www.persistentsys.com
Introduction
The technology of data warehousing and data mining has evolved over the years. Increasingly, businesses are implementing these technologies to get new insights into their business. However, businesses are also dealing with a deluge of data, structured and unstructured, from various sources – customers, partners, employees. Unless this stream of data is tapped in a meaningful way, chances are that the insights companies are looking for will remain elusive.
The technology community has coined the term Big Data to describe this new era of data and data management. However, there is a lot of confusion around exactly what Big Data is and more importantly, how enterprises should think about Big Data within their organizations. How do we take advantage of Big Data in the context of our current data infrastructure? Do we have to throw out existing data infrastructure to take advantage of Big Data? Does Big Data only apply to unstructured data or does it refer to structure data as well?
Persistent has introduced the term DataStack 3.0 as a way to help enterprises think about how Big Data fits into their organization’s data infrastructure. This paper defines DataStack 3.0 in the context of historical trends in data and an enterprise’s existing data infrastructure. This paper aims to help CIOs, MIS managers and functional heads take advantage of Big Data within their organizations.
With the evolution of technologies that gather and analyze data, there are three distinct inflexion points that are discernible. We have tried to capture these technologies and the data in three different generations of what we term as the “Data Stacks”. The characteristics of the three data stacks differ in what type of data is collected and how it is processed. We introduce these terms below that will be elaborated later in the paper.
The majority of this paper will focus on DataStack 3.0. We elaborate on DataStack 1.0 and DataStack 2.0 separately on page 4 to give context to DataStack 3.0.
DataStack 1.0
DataStack 2.0 DataStack 3.0
– Application specific data typically used to store and record transactional data of a specific application
– Aggregation of application data, creating dimensional views and analytics on structured data
– New technology stack capable of handling Big Data both in terms of volume, data source and type of data (structured, semi-structured and unstructured).
DataStack 3.0 - New technology stack capable of handling Big Data in terms of volume, data source and type of data (structured, semi- structured and unstructured)
www.persistentsys.com
Emergence of DataStack 3.0
—
—
—
—
—
—
As data warehouses matured there was a mind-set change emerging with internet companies such as Google, Yahoo and Facebook. With tens of millions of users and more than a billion page views every day, these companies started collecting massive amounts of data without knowing what they were looking for. This led to a fundamental shift in thinking and the need for DataStack 3.0, which gives businesses the ability to leverage the data that is generated without knowing precisely how one plans to use it. One of the challenges that companies have faced is developing a scalable way of storing and processing all these bytes. Companies understand that using this historical data is a large part of how they can improve the user experience.
The emergence of Big Data frameworks such as NoSql, Hadoop and related technologies, which provide a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm, enabled developers to start interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (i.e. the Facebook Lexicon) or are being used in the background to improve user experience on Facebook, Yahoo and Google. The list of projects that are using Big Data infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality.
One other trend that emerged in parallel is that a number of non-relational or NoSql databases such as Cassandra offer the ability to process vast amounts of data in a clustered environment.
Traditional SQL servers allow users to create a schema and define the structure of that schema which often has very rigid rules. NoSql databases attempt to provide flexibility and scalability for web-scale data, new applications such as “social apps” and frequent schema changes due to changing data. The key motivators that are driving the adoption of DataStack 3.0 are:
Shared Nothing Architecture – to increase parallelism Reducing Infrastructure Cost – by using commodity hardware
Linear Scalability – ability to add incremental capacity to storage systems with minimal overhead and no downtime. In some cases, the system should automatically balance load and adjust utilization across the new hardware
High Write Throughput – most of the applications store (and optionally index) tremendous amounts of data and require high aggregate write throughput
High Availability & Disaster Recovery – Need to provide a service with very high uptime to users that cover both planned and unplanned events such as software upgrades and unplanned failure of hardware
Unstructured & Structured Data – While DataStack 1.0 and DataStack 2.0 looked mostly at structured tabular data, DataStack 3.0 considers unstructured data as well. In fact, most of the explosion in data has been in unstructured data such as text files, documents, weblogs, email, etc.
80% of new information growth is unstructured content – with 90% of that unmanaged.
– IDC
www.persistentsys.com
© 2012 Persistent Systems Ltd. All rights reserved. 4
DataStack 1.0 to 2.0
Businesses have harnessed enterprise software such as Another trend that happened side by side was the change Online Transaction Processing (OLTP) systems through in the nature of data itself. Initially data was application the last two decades to make their businesses more driven and largely structured. We characterize this as efficient. From pure OLTP systems, the software has also DataStack 1.0. With the following explosion of data, a new evolved to add capabilities such as Data Warehouses to set of applications, which looked at aggregated data, help businesses get key insights into their customers and came into the fold. These applications allowed the meaningful reports on their data. OLTP systems have their enterprises to aggregate data, form dimensional views origin in the eighties and most enterprises such as banks and conduct analytics. This is what we call DataStack 2.0.
used them for key transactions. With the emergence of Businesses driven by demand for more sophisticated UNIX and Windows server systems in the nineties, vendors queries on ever increasing data sets pushed the such as Oracle, IBM, and Microsoft introduced databases envelope of DataStack 1.0. As the workload increased, it for enterprises. As OLTP systems proliferated, SQL was evident that techniques had to evolve to tackle this became the standard method of implementation of OLTP workload. This led to DataStack 2.0, which addressed systems. SQL also enabled businesses to query some of the scalability and volume of data issues.
transactional data to produce operational reports. For example, the transactional system to implement sales transactions could also produce reports on the sales by geographies or product codes.
In the last few years data warehouses continued to refine. non-SQL algorithms to be easily embedded in the Softwares such as Cognos, Business Objects and processing elements of its MPP streams without the Hyperion appeared in the market which enabled typical intricacies of parallel or grid programming. The businesses to slice and dice their data, form dimensional ability to run analytics of any complexity “on stream”
views and produce dashboards of key metrics for their against huge data volumes, eliminates the delays and businesses. For example a telecommunications company costs of moving data to separate hardware. It also could load their summarized call data records onto a data accelerates performance by orders of magnitude, making warehouse through a process known as ETL (Extract, these data warehouse appliances the ideal platform for Transform & Load) and use tools such as Cognos to view the convergence of data warehousing and advanced KPIs such as: 1) calls by region 2) calls by time of day and 3) analytics. This evolution in data warehouse platforms and voice vs. data calls. This data enabled the applications, which we term as DataStack 2.5, was driven telecommunication companies to focus their marketing mainly by the inadequacy of DataStack 2.0 to handle efforts, refine promotions and adjust pricing plans. large volumes of data for analytics.
Another phenomenon that has been taking place in data Some of the new technologies which became the warehouse solutions is the emergence of customized foundation of Data Stack 2.5 are:
hardware/software marketed as appliances such Oracle — Massively Parallel Processing (MPP) architectures Exadata and IBM Netezza. These data warehouse – enabled parallel execution of queries and eased the appliance hardware components and intelligent system load on the processors by shifting some of the load to software are closely intertwined. The software is designed the storage path
to fully exploit the hardware capabilities of the appliance
— Column oriented Databases – DBMS that stores its and incorporate numerous innovations to offer exponential
content by column rather than row. This has performance gains, whether for simple inquiries, complex
advantages in certain data warehouses, where ad-hoc queries or deep analytics.
aggregates are computed over large numbers of similar data items
Systems such as Exadata and Netezza, which have
brought the principles of MPP and data processing close to — In memory Databases – speeding up queries the source, are suited for advanced analytics on large data through caching of the database in memory
sets. These data warehouse appliances allow complex
DataStack 2.0 Refined for Large Data & Analytics
www.persistentsys.com
A Comparison of DataStack 1.0, 2.0 and 3.0
DataStack 1.0
Relational
Database Systems for Operational Store
DataStack 2.0
Enterprise Data Warehouse for Decision Support
DataStack 3.0
Integrated Platform for structured, semi-structured
& unstructured data from any source
Business Case/Need
Record business events
Support for Decision making
Tap all data sources for insights
Data
Arrangement
Highly
normalized data
Un-normalized
dimensional model Schema less approach
Data Horizon Short Couple of years Multi-year
Data Quality Extremely high Not as essential as
DataStack 1.0 Not a key requirement
Size of Data GBs TBs PBs
End user Access
Through enterprise apps
Through reporting
systems Currently directly
Language of
Access SQL SQL/MDX MapReduce
Type of Data Structured Structured Structured, semi-structured
& unstructured
www.persistentsys.com
© 2012 Persistent Systems Ltd. All rights reserved. 6
DataStack 3.0: Confluence of New Technologies &
DataStack 2.0
As businesses explore Big Data, they need to keep in mind that DataStack 3.0 is a confluence of new technologies for Big Data such as Cassandra, Hbase, Hadoop, MapReduce with the existing DataStack 2.0. The implementation of DataStack 3.0 has to integrate with existing DataStack 2.0.
If DataStack 3.0 is not a complimentary extension to the existing data infrastructure, benefits will be nebulous. From the planning phases for DataStack 3.0, integration should be mapped out.
When an organization is planning to rollout DataStack 3.0, various stakeholders across the organization should be involved to make it a successful. The various functional groups that need to participate are:
All of the above personnel have to work closely to plan and execute how DataStack 3.0 will be leveraged to get the desired end results. They need to review the technology available, capabilities of resources (and service providers, if applicable) and bring together the right team.
This figure below describes how the three data stacks can co-exist in an Enterprise. DataStack 3.0 stores historical information for analysis, just like DataStack 2.0. The difference is that DataStack 3.0 will handle the data sources which cannot be handled by DataStack 2.0. Early in its maturity, DataStack 3.0 is primarily used for discovering use cases, and it is not uncommon for the results from the discovery exercise to be pushed back to DataStack 2.0 so as to enhance the analysis. Essentially, all three data stacks will co-exist in the enterprise and the chance of Big Data replacing any one of them will be low for the foreseeable future.
—
—
—
—
—
—
—
IT Administration
Compliance & Audit departments
ETL Administration Data Provider
Data Architect
Data Scientist
Reporting/Dashboards Developer
– Plan the deployment of and configure DataStack 3.0. This needs to happen with participation of resources skilled in DataStack 3.0 deployment and planning
– Just like data for key transactional applications, companies should consider the compliance needs for the data collected, privacy requirements and the enforcement of those requirements
– Consider all data sources necessary for DataStack 3.0
– Identifies the data sources and is responsible for providing access to the technology team
– Expert in modeling database schemas, knowledgeable in database implementation best practices and familiar with the company’ s particular database schema
– Acts as the bridge between the business and the technical team, helping to define the business insights that need to be extracted
– Expert in development of reports and dashboards with one of the reporting/dashboarding tools such as IBM Cognos, Pentaho, MSTR, etc.
Figure 1:DataStack 3.0
If DataStack 3.0 is not a complimentary extension to the existing data infrastructure, benefits will be nebulous.
Early in its maturity, DataStack 3.0 is primarily used for discovering use cases, and it is not uncommon for the results from the discovery exercise to be pushed back to
DataStack 2.0 so as to enhance the analysis.
All three data stacks will co-exist in the enterprise and the chance of Big Data replacing any one of them will be low for the foreseeable future.
Conclusion
Data warehouse solutions are moving to what can be termed as “DataStack 3.0”. This has been pioneered by internet companies such Facebook, Google and Yahoo. The technology, incubated in those companies, is being rapidly adopted by mainstream enterprises. When it is adopted as part of a business, attention must be paid to how to integrate with DataStack 2.0 to get the desired return on investment. The traditional Enterprise Data Warehouses (EDWs) are evolving with the need to deal with petabyte scale data, which includes unstructured data.However, as enterprises rethink their EDW solutions, they will have to keep in mind that mere introduction of core Hadoop technologies such as – MapReduce, HDFS, Hive will not necessarily give them greater insight into their business. All of the investments in traditional DWs, data marts, data hubs, operational data stores, etc. are reasonably safe from obsolescence. The reality is that the EDW is evolving into a platform in which all of these database architectures can and will co-exist.
We see these requirements coming directly from CTOs and other senior decision-makers in large organizations who are driving convergence of investments across all of these formerly separate technology domains. Vendors are racing to address this convergence in their product portfolios.
We believe that DataStack 3.0 will be a confluence of various technologies including traditional EDWs and they will co-exist. The enterprises who manage this confluence well, will benefit most from the right blend of established and emerging technologies.
About Persistent Systems
www.persistentsys.com
Established in 1990, Persistent Systems (BSE & NSE: PERSISTENT) is a global company specializing in software product development services. For more than two decades, Persistent has been an innovation partner for the world’s largest technology brands, leading enterprises and pioneering start-ups. With a global team of 6,600+ employees, Persistent has 350+ customers spread across North America, Europe, and Asia. Today, Persistent focuses on developing best- in-class solutions in four key next-generation technology areas: Cloud Computing, Mobility, BI &
Analytics, Collaboration across technology, telecommunications, life sciences, consumer packaged goods, banking & financial services and healthcare verticals. For more information,
please visit: .
India
USA
Persistent Systems Limited Bhageerath, 402,
Senapati Bapat Road Pune 411016.
Tel: +91 (20) 2570 2000 Fax: +91 (20) 2567 8901
Persistent Systems, Inc.
2055 Laurelwood Road, Suite 210 Santa Clara, CA 95054
Tel: +1 (408) 216 7010 Fax: +1 (408) 451 9177 Email: [email protected]
DISCLAIMER: “The trademarks or trade names mentioned in this paper are property of their respective owners and are included for reference only and do
www.persistentsys.com