EXECUTIVE
SUMMARY
We live in a time of uncertainty for the traditional Enterprise Data Warehouse (EDW). The long-standing requirement to operationalise Business Intelligence (BI) has been accelerated by the needs of real-time operational decisioning. At the same time the EDW must cope with an explosion of volume, high user expectations and the demands of data discovery.
Enter big data technology which promises scalability, flexibility and lower cost to serve. Enterprises have begun by experimenting with big data platforms supporting an increasing number of point solutions. This is the origin of our big data refinery architecture, which allows for experimentation without disrupting the existing EDW. Many customers have successfully used the big data refinery approach for their specific challenges but as big data platforms develop enterprise-grade features, it becomes feasible to use big data to play a bigger role in the EDW. This is the origin of the “Data Lake” or “Data Reservoir”.
So what should a Data Lake look like? What is the best blend of big data and traditional database tools for an organisation? How do you avoid being left behind by increasingly agile competition?
This paper considers four architecture models which put big data technology and the data asset itself increasingly at the centre of the enterprise. We look at the challenges each model can help to solve and the potential pitfalls business leaders will need to consider as they determine how best to embrace big data in the digital enterprise.
“ So what should a Data
Lake look like? What is
the best blend of big data
and traditional database
tools for an organisation?
How do you avoid being left
behind by increasingly agile
competition?”
BAE Systems Applied Intelligence BAE Systems Applied Intelligence
The Enterprise Data Warehouse (EDW) has become a long-serving, business critical capability for many organisations. The successful EDW should provide consistent, trusted data for accurate decision-making.
However most organisations are expressing frustrations with their data warehousing solutions due to: • Inflexibility of Evolved Platforms:
An inability to deal with a changing business quickly and get data to where it is needed most. The shorter time-to-market for new products and services demands flexible solutions and rapid software development methods. Similarly data discovery, which forms a key engine of innovation in the digital age, demands more responsive, higher capacity platforms for raw analysis.
• Heightened User Expectations:
Our personal use of smartphones and tablets has created heightened user expectations about the immediacy and intuitiveness of technology. Technology-driven EDW projects are continuing to fail with a string of expensive projects that adopted a ‘build it and they will come’ approach, failing to properly engage end users.
• The Volume, Variety & Velocity of the Data:
Moving into a world of interaction data e.g. machine logs, clickstreams, sensor and appliance data (the “Internet of Things”), as well as semi-structured and unstructured data, has opened up opportunities for greater customer and operational insights but poses data processing and storage challenges for traditional database platforms. At the same time, the increasing desire for organisations to be able to respond to events quickly has fuelled requirements for personalised real-time decision support systems which still need to draw on a centralised single version of the truth. • Total Cost of Ownership:
IT projects have come under increasing scrutiny to deliver lower cost to serve where software licensing, data storage, high performance infrastructure and system maintenance are large contributors to TCO. Organisations are challenging their suppliers to step up or move aside where cloud and open source tools continue to mature their enterprise-readiness.
BAE Systems Applied Intelligence BAE Systems Applied Intelligence
4
THE PROMISE OF BIG DATA
The emergence of big data technologies has offered some potential solutions to these challenges:
• Scale-out storage in petabytes and beyond means that data which may contain critical business advantage can be retained cost-effectively
• Massively Parallel Processing (MPP) on grid-computing enables data integration and processing of many sources with very high throughput rates, at a fraction of the cost of traditional MPP platforms
• Schema-on-Read offers the ability to define the data structure at query time, as opposed to load time. This means new data sources can be loaded in their native format and quickly made available for self-service discovery
• Complex data structures can be stored and processed efficiently, alleviating limitations of relational data
• Open Source software on commodity infrastructure helps to relieve license and support costs.
Moreover, two of the major concerns around big data technology: read consistency and transaction support, do not typically apply to the EDW.
THE BIG DATA REFInERY
Our Big Data Refinery (BDR) model is a starting point that illustrates how a big data platform can be employed to unlock hidden value in a wide variety of data sources, and sit alongside the traditional EDW. Big data sources can be presented to data scientists and summarised as a source to the data warehouse. The primary intent of the BDR is to enhance an established well-functioning warehouse, rather than offering a complete answer to the challenges posed above.
Traditional Data Sources
BIG DATA SOURCES
DATA REFINERY
Data Sources
Storage & Processing
Presentation &
Exploitation
ACCESS AND EXPLOITATION TOOLS Big Data Specialists Real time analytics platform Fast search and query platform Batch processing platformFilter / pre-processing
Data
Warehouse MartsData
Enterprise Data Warehouse
Legacy Systems Traditional BI Analyst
BI Tools Bulk analytics Fixed and dynamic reporting Search
Online transaction processing
BAE Systems Applied Intelligence 5 4 BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES
Traditional EDW Data Marts OLAP Cube
CONVENTIONAL RELATIONAL DATA STORES
Archive ETL Staging
DATA LAKE
Archive Load
1 Gartner Magic Quadrant Data Warehouse DBMS Survey, Nov 2012 and Nov 2013 and Gartner Presentation, What About the Date Warehouse? Start? Stop?
Continue? - Mark Beyer, October 2014
BAE Systems Applied Intelligence
5 4
Unsurprisingly vendors of traditional tools are pushing this approach, with ever improving integration with big data
technologies. This will be the right choice for many customers - for example those looking to make a small investment to test the returns. This is far and away the most popular approach at present - more than 70% of organisations Gartner surveyed in 2013 were using big data for ‘marts’1. Big data solutions have evolved rapidly over recent years with increasingly mature
enterprise grade features, strengthening their ability to support BI and analytics solutions.
RISE OF THE DATA LAkE
The Data Lake goes at least one stage beyond the BDR and becomes the initial “landing point” for enterprise data sources and externally gathered data. Like the BDR, it is underpinned by big data technologies - typically starting with Hadoop. This naturally raises questions of the role the Data Lake can play in the enterprise such as: ‘Can I host my whole Data Warehouse on Hadoop?’ and ‘in what parts of the EDW will big data technology be most effective?’ We therefore suggest four models through which the Data Lake increasingly encompasses the EDW. There is a logical progression but that does not imply the same model is ideal for every organisation and the choice depends on several factors.
1. The “Active Archive” Data Lake undertakes some responsibility for Extract, Transform and Load (ETL) and provides online access to historic data - both raw source information and data archived from the conventional relational stores. Through retention of source data in its native format, business questions can be asked in ways which were not envisaged when the data was written. Replicating the data with low latency to the Data Lake is a cheap way to alleviate the query load of Data Discovery from source systems. At the same time this brings the advantages of a distributed platform for the purpose of advanced analytics. For many years, we have implemented Historical Data Stores (HDS) in traditional EDWs. However, even then there is an analysis and development cost and lag to acquiring new data sources. Furthermore, the HDS pattern is generally unsuited to unstructured data.
6
2. The “Dual Warehouse” Data Lake continues to act as an archive, but in addition presents a replica and extension of existing reporting structures to broaden the use cases it can fulfil. Since the Dual Warehouse replicates data, it can represent a transition step to one of the later models.
3. The “Hybrid” Data Lake is an evolution of the Dual Warehouse; in this case the Operational Data Store, common to many EDW patterns, resides in the Data Lake and a traditional relational database and OLAP tools are used for data marts. BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES
Traditional EDW Data Marts OLAP Cube
CONVENTIONAL RELATIONAL DATA STORES
Archive Hadoop DWH DATA LAKE Replicate Load BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES
Data Marts OLAP Cube
BI stores
Hadoop DWH
DATA LAKE
Summarise
BAE Systems Applied Intelligence
7 6
4. The “Enterprise” Data Lake as the endpoint of the evolution serves all the BI and analytics needs of the organisation. This model is declining in popularity - Gartner’s Data Warehouse inquiry data shows the “replacement” idea is
disappearing : 17% of organisations were considering replacing the EDW with a Big Data solution in 2010 but this had dropped to 3% by 20132.
BRIDGInG THE GAP WITH DATA VIRTUALISATIOn
The role of Data Virtualisation (DV) is a vital consideration when deciding on an architecture.
DV is a form of data integration that allows multiple data sources to be treated as one logical source, but this does not constitute a Data Lake. It offers a means to leverage capabilities of different underlying technologies by presenting an abstracted data access layer that reduces time-to-insight for BI and analytics solutions by accessing data directly at source. DV recognises the reality of a heterogeneous data landscape and allows for optimum tooling to be used in each case. It can be used in any of the models presented so far to hide the implementation of the Data Lake (and its evolution) to Data Discovery consumers and potentially for reporting. It also allows for access to sources yet to be migrated into the Data Lake or where low latency is an important requirement.
DV solutions are not a panacea however and may be prone to mixed performance results depending on the query/workload introduced. Moreover, certain types of advanced analytics can only be run effectively by bringing the data into the Data Lake.
BIG DATA SOURCES Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES
Data Marts Olap Cube Hadoop DWH
DATA LAKE
Dashboard DATA LAKE EDW
2Gartner Presentation, What About the Date Warehouse? Start? Stop? Continue? - Mark Beyer, October 2014
BAE Systems Applied Intelligence
7 6
8
COMPARISOn OF THE MODELS
The benefits of each model are shown below. The risks reflect those of placing an increasing reliance on big data platforms. These are discussed in the next section.
MODEL BEnEFITS WHERE MOST EFFECTIVE
Active
Archive • Enables online access to historical data, retained for long periods • Capacity pressures on conventional relational
stores are alleviated by offloading some ETL processing to the Data Lake therefore capitalising on specialist BI infrastructure investment
• Minimal disruption to traditional BI solutions • Analytics migrated to the Data lake, enabling
self-service insight over a wide variety of data formats
• An existing, successful enterprise warehouse solution exists, critical to business operations and the appetite for risk of complete re-platform of the existing solution is low • Existing ETL processing is under pressure to
satisfy batch windows
Dual
Warehouse • Greater flexibility to choose the ‘right tool for the right job’ leveraging the strengths of each on a case-by-case basis
• Option to migrate conventional capabilities on demand as Data Lake technologies mature over time
• Organisations are committed to a strategy for the Data Lake in the enterprise, but desire the ability to selectively transition capabilities from the conventional relational data stores
Hybrid • Separate solutions optimised for different workload types (e.g. batch vs interactive query) • Reduces infrastructure cost by offloading high
volume storage completely to the Data Lake • Maintains conventional options for enterprise
applications and dashboards
• There is an ambition to reduce infrastructure costs
• Strong technical expertise exists to deliver, maintain and support Hadoop based solutions • Complex data access functionality is
implemented in existing BI applications that is non-trivial to port on to Data Lake
Enterprise
Data Lake • Centralised data warehouse on single architecture for self-service analytics and BI solutions • Single data storage platform for enforcement of
governance policies and controls • Reduced complexity
• Organisations have a significant appetite and skills to embrace emerging technologies • The requirement is for a ‘greenfield’ site with
no legacy system replacement or risk to existing capabilities
BAE Systems Applied Intelligence
9 8
SO WHAT?
Before embarking on either a warehouse enhancement, a Hadoop-based experiment or a major new data strategy the following need to be considered:
RISkS
All BI projects come with the same notorious risks of failure which the Data Lake doesn’t change, such as attention to business sponsorship and user engagement. However the use of big data technologies at the centre of the corporate IT estate brings a number of new considerations. The open source community is enormously creative in plugging capability gaps, so many of these challenges are diminishing. However commercial support options which offer greater stability inevitably lag behind open source developments.
DATA ACCESS
SQL interfaces to Hadoop are evolving extremely quickly as those facing the greatest challenges to embrace the Data Lake are BI vendors. On some Hadoop platforms only a subset of SQL functionality is supported however. A schema-on-read approach is not necessarily straightforward to implement, especially if schemas change over time. Data access is simplified where SQL interfaces to Hadoop can be used, although some Hadoop-based platforms only offer a subset of SQL functionality.
DATA GOVERnAnCE
The conventional data warehouse is a proven enabler for enterprise data governance processes through its tight controls and relational database functionality. The Data Lake has even greater responsibility to enforce governance policies given its flexibility to receive, process and store data in a variety of forms.
In an environment that includes multiple teams of data scientists, the enforcement of data governance policies including data retention, access controls, audit, data quality, ownership and stewardship is critical. We recommend drawing a distinction between the level of governance required for data services that are used to run the business and those used to discover new transformational business opportunities.
DATA AVAILABILITY
For the Data Lake to support the enterprise it will need to satisfy similar service levels expected of relational databases (e.g. high availability, monitoring, vendor response times). Hadoop-based solutions are still maturing in this area - the answer is to match the service level to the use case.
BAE Systems Applied Intelligence
10
DATA MODEL
Many standard data models aligned to industry sectors are available as accelerators for traditional database systems. Implementation of these rely on standard relational database features such as data integrity constraints, individual record updates and highly structured data formats to support data quality standards. Organisations wanting to use the Enterprise Data Lake model will need to consider the cost of translating these onto big data platforms, if this is really needed.
Likewise there are standard patterns for ETL, such as change data capture or customer matching, which will need to be re-invented for big data. The costs of this needs to be recognised.
DATA SkILLS
Big data solutions implemented in emerging technologies face a greater barrier to entry because of limited availability of skilled resources. Over time this will be mitigated by wider adoption of big data and the emergence of more user friendly technologies.
PROFESSIOnAL SERVICES
A cited benefit of big data technologies is cost saving through deployment on commodity infrastructure, but on-premise mission critical deployments will still require support services from infrastructure vendors. Cloud provision can defer some of these costs, in fact some cloud providers are moving up the stack from infrastructure to platform and cluster provisioning. While this is an attractive alternative, the increasing variety of cloud offerings necessitates yet another skill set.
TAkInG THE PLUnGE
The Data Lake is well placed to tackle some of the frustrations currently experienced with the traditional data warehouse, while leveraging new opportunities the digital age demands.
Big data platforms however bring their own risks and nervousness for architects, developers, administrators and analysts in an emerging technology space.
Today, many organisations are testing the water with big data capabilities in the form of proof-of-concept initiatives and point solutions. Thus big data skills and expertise will naturally continue to evolve; as they do so, the case for embracing the Data Lake in the enterprise is strengthened. Choosing which lake to swim in has never been more important.
10
ABOUT US
BAE Systems Applied Intelligence delivers solutions which help our clients to protect and enhance their critical assets in the connected world. Leading enterprises and government departments use our solutions to protect and enhance their physical infrastructure, nations and people, mission-critical systems, valuable intellectual property, corporate information, reputation and customer relationships, and competitive advantage and financial success.
We operate in four key domains of expertise:
• Cyber Security – helping our clients across the complete cyber security risk lifecycle
• Financial Crime – identifying, combating and preventing financial threats, risk, loss or penalties
• Communications Intelligence – providing sophisticated network intelligence, protection and controls
• Digital Transformation – creating competitive advantage and enhancing operating performance by exploiting data and digital connectivity
We enable organisations to be more agile, increase trust and operate more confidently. Our solutions help to strengthen national security and resilience, for a safer world. They enable enterprises to manage their business risks, optimise their operations and comply with regulatory obligations.
We are part of BAE Systems, a global defence, aerospace and security company delivering a wide range of products and services including advanced electronics, security and information technology solutions.
www.twitter.com/baesystems_ai
www.linkedin.com/company/baesystemsai E: [email protected]
W: www.baesystems.com/ai
Copyright © BAE Systems plc 2014. All rights reserved.
BAE SYSTEMS, the BAE SYSTEMS Logo and the product names referenced herein are trademarks of BAE Systems plc. BAE Systems Applied Intelligence Limited registered in England & Wales (No.1337451) with its registered office at Surrey Research Park, Guildford, England, GU2 7RQ. No part of this document may be copied, reproduced, adapted or redistributed in any form or by any means without the express prior written consent of BAE Systems Applied Intelligence.
Global Headquarters
BAE Systems Applied Intelligence
Surrey Research Park Guildford
Surrey GU2 7RQ United Kingdom
T: +44 (0) 1483 816000
BAE Systems Applied Intelligence Australia
Level 1220 Bridge Street Sydney NSW 2000 Australia
T: +61 (2) 9255 0400
BAE Systems Applied Intelligence Dubai
Dubai Internet City Building 17
Office Ground Floor 53 PO Box 500523 Dubai
T: +971 4369 4369
BAE Systems Applied Intelligence Malaysia
Level 28 Menara Binjai 2 Jalan Binjai, 50450 Kuala Lumpur T: +60 3 2191 3000
BAE Systems Applied Intelligence USA
265 Franklin Street Boston MA 02110 USA T: +1 (617) 737 4170 10