• No results found

However most organisations are expressing frustrations with their data warehousing solutions due to:

N/A
N/A
Protected

Academic year: 2021

Share "However most organisations are expressing frustrations with their data warehousing solutions due to:"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

EXECUTIVE

SUMMARY

We live in a time of uncertainty for the traditional Enterprise Data Warehouse (EDW). The long-standing requirement to operationalise Business Intelligence (BI) has been accelerated by the needs of real-time operational decisioning. At the same time the EDW must cope with an explosion of volume, high user expectations and the demands of data discovery.

Enter big data technology which promises scalability, flexibility and lower cost to serve. Enterprises have begun by experimenting with big data platforms supporting an increasing number of point solutions. This is the origin of our big data refinery architecture, which allows for experimentation without disrupting the existing EDW. Many customers have successfully used the big data refinery approach for their specific challenges but as big data platforms develop enterprise-grade features, it becomes feasible to use big data to play a bigger role in the EDW. This is the origin of the “Data Lake” or “Data Reservoir”.

So what should a Data Lake look like? What is the best blend of big data and traditional database tools for an organisation? How do you avoid being left behind by increasingly agile competition?

This paper considers four architecture models which put big data technology and the data asset itself increasingly at the centre of the enterprise. We look at the challenges each model can help to solve and the potential pitfalls business leaders will need to consider as they determine how best to embrace big data in the digital enterprise.

“ So what should a Data

Lake look like? What is

the best blend of big data

and traditional database

tools for an organisation?

How do you avoid being left

behind by increasingly agile

competition?”

(3)

BAE Systems Applied Intelligence BAE Systems Applied Intelligence

The Enterprise Data Warehouse (EDW) has become a long-serving, business critical capability for many organisations. The successful EDW should provide consistent, trusted data for accurate decision-making.

However most organisations are expressing frustrations with their data warehousing solutions due to: • Inflexibility of Evolved Platforms:

An inability to deal with a changing business quickly and get data to where it is needed most. The shorter time-to-market for new products and services demands flexible solutions and rapid software development methods. Similarly data discovery, which forms a key engine of innovation in the digital age, demands more responsive, higher capacity platforms for raw analysis.

• Heightened User Expectations:

Our personal use of smartphones and tablets has created heightened user expectations about the immediacy and intuitiveness of technology. Technology-driven EDW projects are continuing to fail with a string of expensive projects that adopted a ‘build it and they will come’ approach, failing to properly engage end users.

• The Volume, Variety & Velocity of the Data:

Moving into a world of interaction data e.g. machine logs, clickstreams, sensor and appliance data (the “Internet of Things”), as well as semi-structured and unstructured data, has opened up opportunities for greater customer and operational insights but poses data processing and storage challenges for traditional database platforms. At the same time, the increasing desire for organisations to be able to respond to events quickly has fuelled requirements for personalised real-time decision support systems which still need to draw on a centralised single version of the truth. • Total Cost of Ownership:

IT projects have come under increasing scrutiny to deliver lower cost to serve where software licensing, data storage, high performance infrastructure and system maintenance are large contributors to TCO. Organisations are challenging their suppliers to step up or move aside where cloud and open source tools continue to mature their enterprise-readiness.

BAE Systems Applied Intelligence BAE Systems Applied Intelligence

(4)

4

THE PROMISE OF BIG DATA

The emergence of big data technologies has offered some potential solutions to these challenges:

• Scale-out storage in petabytes and beyond means that data which may contain critical business advantage can be retained cost-effectively

• Massively Parallel Processing (MPP) on grid-computing enables data integration and processing of many sources with very high throughput rates, at a fraction of the cost of traditional MPP platforms

• Schema-on-Read offers the ability to define the data structure at query time, as opposed to load time. This means new data sources can be loaded in their native format and quickly made available for self-service discovery

• Complex data structures can be stored and processed efficiently, alleviating limitations of relational data

• Open Source software on commodity infrastructure helps to relieve license and support costs.

Moreover, two of the major concerns around big data technology: read consistency and transaction support, do not typically apply to the EDW.

THE BIG DATA REFInERY

Our Big Data Refinery (BDR) model is a starting point that illustrates how a big data platform can be employed to unlock hidden value in a wide variety of data sources, and sit alongside the traditional EDW. Big data sources can be presented to data scientists and summarised as a source to the data warehouse. The primary intent of the BDR is to enhance an established well-functioning warehouse, rather than offering a complete answer to the challenges posed above.

Traditional Data Sources

BIG DATA SOURCES

DATA REFINERY

Data Sources

Storage & Processing

Presentation &

Exploitation

ACCESS AND EXPLOITATION TOOLS Big Data Specialists Real time analytics platform Fast search and query platform Batch processing platform

Filter / pre-processing

Data

Warehouse MartsData

Enterprise Data Warehouse

Legacy Systems Traditional BI Analyst

BI Tools Bulk analytics Fixed and dynamic reporting Search

Online transaction processing

(5)

BAE Systems Applied Intelligence 5 4 BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES

Traditional EDW Data Marts OLAP Cube

CONVENTIONAL RELATIONAL DATA STORES

Archive ETL Staging

DATA LAKE

Archive Load

1 Gartner Magic Quadrant Data Warehouse DBMS Survey, Nov 2012 and Nov 2013 and Gartner Presentation, What About the Date Warehouse? Start? Stop?

Continue? - Mark Beyer, October 2014

BAE Systems Applied Intelligence

5 4

Unsurprisingly vendors of traditional tools are pushing this approach, with ever improving integration with big data

technologies. This will be the right choice for many customers - for example those looking to make a small investment to test the returns. This is far and away the most popular approach at present - more than 70% of organisations Gartner surveyed in 2013 were using big data for ‘marts’1. Big data solutions have evolved rapidly over recent years with increasingly mature

enterprise grade features, strengthening their ability to support BI and analytics solutions.

RISE OF THE DATA LAkE

The Data Lake goes at least one stage beyond the BDR and becomes the initial “landing point” for enterprise data sources and externally gathered data. Like the BDR, it is underpinned by big data technologies - typically starting with Hadoop. This naturally raises questions of the role the Data Lake can play in the enterprise such as: ‘Can I host my whole Data Warehouse on Hadoop?’ and ‘in what parts of the EDW will big data technology be most effective?’ We therefore suggest four models through which the Data Lake increasingly encompasses the EDW. There is a logical progression but that does not imply the same model is ideal for every organisation and the choice depends on several factors.

1. The “Active Archive” Data Lake undertakes some responsibility for Extract, Transform and Load (ETL) and provides online access to historic data - both raw source information and data archived from the conventional relational stores. Through retention of source data in its native format, business questions can be asked in ways which were not envisaged when the data was written. Replicating the data with low latency to the Data Lake is a cheap way to alleviate the query load of Data Discovery from source systems. At the same time this brings the advantages of a distributed platform for the purpose of advanced analytics. For many years, we have implemented Historical Data Stores (HDS) in traditional EDWs. However, even then there is an analysis and development cost and lag to acquiring new data sources. Furthermore, the HDS pattern is generally unsuited to unstructured data.

(6)

6

2. The “Dual Warehouse” Data Lake continues to act as an archive, but in addition presents a replica and extension of existing reporting structures to broaden the use cases it can fulfil. Since the Dual Warehouse replicates data, it can represent a transition step to one of the later models.

3. The “Hybrid” Data Lake is an evolution of the Dual Warehouse; in this case the Operational Data Store, common to many EDW patterns, resides in the Data Lake and a traditional relational database and OLAP tools are used for data marts. BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES

Traditional EDW Data Marts OLAP Cube

CONVENTIONAL RELATIONAL DATA STORES

Archive Hadoop DWH DATA LAKE Replicate Load BIG DATA SOURCES Dashboard Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES

Data Marts OLAP Cube

BI stores

Hadoop DWH

DATA LAKE

Summarise

(7)

BAE Systems Applied Intelligence

7 6

4. The “Enterprise” Data Lake as the endpoint of the evolution serves all the BI and analytics needs of the organisation. This model is declining in popularity - Gartner’s Data Warehouse inquiry data shows the “replacement” idea is

disappearing : 17% of organisations were considering replacing the EDW with a Big Data solution in 2010 but this had dropped to 3% by 20132.

BRIDGInG THE GAP WITH DATA VIRTUALISATIOn

The role of Data Virtualisation (DV) is a vital consideration when deciding on an architecture.

DV is a form of data integration that allows multiple data sources to be treated as one logical source, but this does not constitute a Data Lake. It offers a means to leverage capabilities of different underlying technologies by presenting an abstracted data access layer that reduces time-to-insight for BI and analytics solutions by accessing data directly at source. DV recognises the reality of a heterogeneous data landscape and allows for optimum tooling to be used in each case. It can be used in any of the models presented so far to hide the implementation of the Data Lake (and its evolution) to Data Discovery consumers and potentially for reporting. It also allows for access to sources yet to be migrated into the Data Lake or where low latency is an important requirement.

DV solutions are not a panacea however and may be prone to mixed performance results depending on the query/workload introduced. Moreover, certain types of advanced analytics can only be run effectively by bringing the data into the Data Lake.

BIG DATA SOURCES Data Discovery / Data Science Search Reporting Decision Automation CONVENTIONAL DATA SOURCES

Data Marts Olap Cube Hadoop DWH

DATA LAKE

Dashboard DATA LAKE EDW

2Gartner Presentation, What About the Date Warehouse? Start? Stop? Continue? - Mark Beyer, October 2014

BAE Systems Applied Intelligence

7 6

(8)

8

COMPARISOn OF THE MODELS

The benefits of each model are shown below. The risks reflect those of placing an increasing reliance on big data platforms. These are discussed in the next section.

MODEL BEnEFITS WHERE MOST EFFECTIVE

Active

Archive • Enables online access to historical data, retained for long periods • Capacity pressures on conventional relational

stores are alleviated by offloading some ETL processing to the Data Lake therefore capitalising on specialist BI infrastructure investment

• Minimal disruption to traditional BI solutions • Analytics migrated to the Data lake, enabling

self-service insight over a wide variety of data formats

• An existing, successful enterprise warehouse solution exists, critical to business operations and the appetite for risk of complete re-platform of the existing solution is low • Existing ETL processing is under pressure to

satisfy batch windows

Dual

Warehouse • Greater flexibility to choose the ‘right tool for the right job’ leveraging the strengths of each on a case-by-case basis

• Option to migrate conventional capabilities on demand as Data Lake technologies mature over time

• Organisations are committed to a strategy for the Data Lake in the enterprise, but desire the ability to selectively transition capabilities from the conventional relational data stores

Hybrid • Separate solutions optimised for different workload types (e.g. batch vs interactive query) • Reduces infrastructure cost by offloading high

volume storage completely to the Data Lake • Maintains conventional options for enterprise

applications and dashboards

• There is an ambition to reduce infrastructure costs

• Strong technical expertise exists to deliver, maintain and support Hadoop based solutions • Complex data access functionality is

implemented in existing BI applications that is non-trivial to port on to Data Lake

Enterprise

Data Lake • Centralised data warehouse on single architecture for self-service analytics and BI solutions • Single data storage platform for enforcement of

governance policies and controls • Reduced complexity

• Organisations have a significant appetite and skills to embrace emerging technologies • The requirement is for a ‘greenfield’ site with

no legacy system replacement or risk to existing capabilities

(9)

BAE Systems Applied Intelligence

9 8

SO WHAT?

Before embarking on either a warehouse enhancement, a Hadoop-based experiment or a major new data strategy the following need to be considered:

RISkS

All BI projects come with the same notorious risks of failure which the Data Lake doesn’t change, such as attention to business sponsorship and user engagement. However the use of big data technologies at the centre of the corporate IT estate brings a number of new considerations. The open source community is enormously creative in plugging capability gaps, so many of these challenges are diminishing. However commercial support options which offer greater stability inevitably lag behind open source developments.

DATA ACCESS

SQL interfaces to Hadoop are evolving extremely quickly as those facing the greatest challenges to embrace the Data Lake are BI vendors. On some Hadoop platforms only a subset of SQL functionality is supported however. A schema-on-read approach is not necessarily straightforward to implement, especially if schemas change over time. Data access is simplified where SQL interfaces to Hadoop can be used, although some Hadoop-based platforms only offer a subset of SQL functionality.

DATA GOVERnAnCE

The conventional data warehouse is a proven enabler for enterprise data governance processes through its tight controls and relational database functionality. The Data Lake has even greater responsibility to enforce governance policies given its flexibility to receive, process and store data in a variety of forms.

In an environment that includes multiple teams of data scientists, the enforcement of data governance policies including data retention, access controls, audit, data quality, ownership and stewardship is critical. We recommend drawing a distinction between the level of governance required for data services that are used to run the business and those used to discover new transformational business opportunities.

DATA AVAILABILITY

For the Data Lake to support the enterprise it will need to satisfy similar service levels expected of relational databases (e.g. high availability, monitoring, vendor response times). Hadoop-based solutions are still maturing in this area - the answer is to match the service level to the use case.

BAE Systems Applied Intelligence

(10)

10

DATA MODEL

Many standard data models aligned to industry sectors are available as accelerators for traditional database systems. Implementation of these rely on standard relational database features such as data integrity constraints, individual record updates and highly structured data formats to support data quality standards. Organisations wanting to use the Enterprise Data Lake model will need to consider the cost of translating these onto big data platforms, if this is really needed.

Likewise there are standard patterns for ETL, such as change data capture or customer matching, which will need to be re-invented for big data. The costs of this needs to be recognised.

DATA SkILLS

Big data solutions implemented in emerging technologies face a greater barrier to entry because of limited availability of skilled resources. Over time this will be mitigated by wider adoption of big data and the emergence of more user friendly technologies.

PROFESSIOnAL SERVICES

A cited benefit of big data technologies is cost saving through deployment on commodity infrastructure, but on-premise mission critical deployments will still require support services from infrastructure vendors. Cloud provision can defer some of these costs, in fact some cloud providers are moving up the stack from infrastructure to platform and cluster provisioning. While this is an attractive alternative, the increasing variety of cloud offerings necessitates yet another skill set.

TAkInG THE PLUnGE

The Data Lake is well placed to tackle some of the frustrations currently experienced with the traditional data warehouse, while leveraging new opportunities the digital age demands.

Big data platforms however bring their own risks and nervousness for architects, developers, administrators and analysts in an emerging technology space.

Today, many organisations are testing the water with big data capabilities in the form of proof-of-concept initiatives and point solutions. Thus big data skills and expertise will naturally continue to evolve; as they do so, the case for embracing the Data Lake in the enterprise is strengthened. Choosing which lake to swim in has never been more important.

(11)

10

ABOUT US

BAE Systems Applied Intelligence delivers solutions which help our clients to protect and enhance their critical assets in the connected world. Leading enterprises and government departments use our solutions to protect and enhance their physical infrastructure, nations and people, mission-critical systems, valuable intellectual property, corporate information, reputation and customer relationships, and competitive advantage and financial success.

We operate in four key domains of expertise:

• Cyber Security – helping our clients across the complete cyber security risk lifecycle

• Financial Crime – identifying, combating and preventing financial threats, risk, loss or penalties

• Communications Intelligence – providing sophisticated network intelligence, protection and controls

• Digital Transformation – creating competitive advantage and enhancing operating performance by exploiting data and digital connectivity

We enable organisations to be more agile, increase trust and operate more confidently. Our solutions help to strengthen national security and resilience, for a safer world. They enable enterprises to manage their business risks, optimise their operations and comply with regulatory obligations.

We are part of BAE Systems, a global defence, aerospace and security company delivering a wide range of products and services including advanced electronics, security and information technology solutions.

www.twitter.com/baesystems_ai

www.linkedin.com/company/baesystemsai E: [email protected]

W: www.baesystems.com/ai

Copyright © BAE Systems plc 2014. All rights reserved.

BAE SYSTEMS, the BAE SYSTEMS Logo and the product names referenced herein are trademarks of BAE Systems plc. BAE Systems Applied Intelligence Limited registered in England & Wales (No.1337451) with its registered office at Surrey Research Park, Guildford, England, GU2 7RQ. No part of this document may be copied, reproduced, adapted or redistributed in any form or by any means without the express prior written consent of BAE Systems Applied Intelligence.

Global Headquarters

BAE Systems Applied Intelligence

Surrey Research Park Guildford

Surrey GU2 7RQ United Kingdom

T: +44 (0) 1483 816000

BAE Systems Applied Intelligence Australia

Level 1220 Bridge Street Sydney NSW 2000 Australia

T: +61 (2) 9255 0400

BAE Systems Applied Intelligence Dubai

Dubai Internet City Building 17

Office Ground Floor 53 PO Box 500523 Dubai

T: +971 4369 4369

BAE Systems Applied Intelligence Malaysia

Level 28 Menara Binjai 2 Jalan Binjai, 50450 Kuala Lumpur T: +60 3 2191 3000

BAE Systems Applied Intelligence USA

265 Franklin Street Boston MA 02110 USA T: +1 (617) 737 4170 10

References

Related documents

T h e second approximation is the narrowest; this is because for the present data the sample variance is substantially smaller than would be expected, given the mean

Potential explanations for the large and seemingly random price variation are: (i) different cost pricing methods used by hospitals, (ii) uncertainty due to frequent changes in

Now the bottom of the slide shows a classic traditional data warehousing architecture where we are pulling data from our operational systems, largely transactional data using

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Increasing ruminally available energy (starch) should increase the synthesis and efficiency of bacterial CP (BCP) in the rumen and therefore increase the di- etary degradable

In addition, three graduate programs in AUc’s School of Business –– Master of Business Administration, Master of Arts in economics and Master of Arts in economics in

Simulating clinical concentrations and delivery rates of a typical intravenous infusion, a variety of routinely used pharmaceutical drugs were tested for potential binding to