Beyond the Data Lake

(1)

Beyond the Data Lake

Managing Big Data for Value Creation

In this white paper

1 The Data Lake Fallacy

2 Moving Beyond Data Lakes

3 A Big Data Warehouse Supports Strategy, Value Creation

(2)

Beyond the Data Lake

Managing Big Data for Value Creation

By Dr. Paul Terry, President & CEO, PHEMI

We live in an era of “big data” in which data-driven insights will drive efficiencies and increased productivity, fuel discovery, and spark innovation.

To gain these benefits, organizations must take a strategic approach to their digital assets. They should consider the value these assets can provide today, as well as into the future, if they are properly, strategically managed. A sound data management strategy should address the complete data lifecycle, including how data is collected, stored, secured, curated, analyzed, presented and, finally, destroyed.

Many vendors that purport to address the big data challenge actually offer only the first items on this checklist — collection, storage, and security. These basic services address the fact that most organizations possess multiple databases that cannot communicate with one another. Thus, one commonly proffered solution is to combine all databases into one — the so-called “data lake.”

In this white paper, I’ll point out the strengths and shortcomings of the data lake concept and describe an alternative big data management strategy that is scalable and enterprise-grade, and addresses the complete data lifecycle for significant value creation today and into the future.

The Data Lake Fallacy

The term data lake implies a single repository where all data is stored in its native format and made available for retrieval, analysis, and value creation. Without proper curation and the addition of metadata to guide governance, link related data and provide additional functionalities, a data lake risks becoming a “data swamp.”

(3)

technology (IT) department will gain a time-consuming new set of responsibilities that require specialized and hard-to-find skillsets.

With an ad hoc approach to the big data challenge, the execution of an

organizational data strategy is likely to become arduous and time-consuming, with few assurances of success.

“ Data lakes typically begin as ungoverned data stores. Meeting the needs of wider audiences requires curated repositories with governance, semantic consistency and access controls—elements already found in a data warehouse. It’s beneficial to quickly move beyond a ‘data lake’ concept to develop a more robust, logical, data warehouse strategy.”

— Nick Heudecker, research director, Gartner, Inc.

A data lake, while appealing in its apparent simplicity, may collect, store, and secure data in its native format, and combine many disparate databases, but it does not address the suite of additional functionalities that create significant value over time. With a data lake, it is a challenge to protect, control, find, and retrieve data, much less create value with it.

One commonly implemented approach is to use Hadoop, an open-source software framework in the Java programming language for distributed storage and processing of big data. Though Hadoop offers file system availability and reliability, it provides limited security, particularly in the area of access controls. A new approach is needed to ensure that only the right user, at the right time, can see the specific data or level of data they are permitted to access. Data security and privacy is all about controlling the visibility of individual pieces of data, such as the names and numbers associated with a patient record in a healthcare application.

Moving Beyond Data Lakes

“ While it is certainly true that ‘data lakes’ can provide value to various parts of the organization, the [data lake’s] proposition of enterprise-wide data management has yet to be realized.”

— Andrew White, VP and distinguished analyst, Gartner, Inc.

(4)

By assigning metadata to datasets entering a data management system, it is possible to determine data quality, maintain the original data, and track changes made to that data (version control). Without metadata, every query begins from scratch. The data lake risks becoming a data swamp.

Governance policies to control who has access to specific datasets or levels of data granularity are critical to managing privacy, consent, confidentiality, and data-sharing agreements within or between organizations. Data lakes provide little or no oversight and control of their contents and security and

privacy policies. Access controls for a single data repository such as a data lake that lacks metadata and policy enforcement remain embryonic at best.

Performance is a critical variable in data management practices. Without metadata, and the indexing and cataloging metadata supports, finding answers to queries is slow, cumbersome, and fragmented. Queries themselves depend on data-analysis expertise on the part of enterprise users or the IT staff that supports them. A single data repository, cobbled together with à la carte functionalities, is unlikely to perform as swiftly, accurately, and comprehensively as a purpose-built, optimized data management system that reflects an organization’s strategic vision for value creation. Metadata is the key to powerful end-to-end data control.

A Big Data Warehouse Supports

Strategy and Value Creation

The implications of the data lake fallacy for public- or private-sector managers who are mandated by law or driven by market pressures to store, secure, analyze, and create value from data in their care should be clear: a holistic, enterprise-grade solution for big data should manage data across its complete lifecycle.

A complete solution, found in the big data warehouse model—must enable automated data collection, ensure the application of data privacy, security, and governance measures, convert disparate data formats into an analytics-ready state and analyze and present actionable insights upon demand. When a dataset’s mandated or useful life reaches an end, a complete solution must address proper data destruction. Throughout the lifecycle,

The Power of Metadata

(5)

data must be preserved in its original form, but also, any data items derived through transformation after ingest must be accurately tracked.

A complete solution should also be scalable. That is, to validate the solution’s value proposition, it must be possible to apply the solution incrementally, to one database entity at a time, in a systematic and cost-effective way, as the business builds out its big data management strategy and associated investments. Scalability means the marginal cost of adding the next bit of data is less than the previous bit of data.

The goal of any big data strategy should be actionable insights, value creation, and innovation. With a big data management platform approach, the ability to mine for insights increases as more data sources are added to the system. ( This is known as the “network effect.” ) Increases in efficiencies and productivity—doing more with less — should be a given. To illustrate how a big data warehouse approach enables an organization’s data management strategy to achieve these returns on investment, it’s useful to envision a step-by-step approach that includes collection, curation, and consumption.

Collect.

A big data warehouse solution enables a user to automatically collect disparate data sources, tag them with metadata, catalog and index them, and load them into a data repository, according to rules set by the user. A critical requirement of a big data warehouse platform in the collection phase is the ability to handle any kind of data, including structured (e.g., database records), semi-structured (e.g., Microsoft Excel, machine-collected data, or genomic files) and/or unstructured (e.g., images or documents).

Curate.

A big data warehouse produces analytics-ready digital assets that are cataloged and protected. The curation process preserves the original data item in its native format, providing a baseline resource that can re-analyzed—potentially in different ways—going forward. Metadata describes the original file, adding context such as source, user history and data-sharing agreements, captures additional data, and supports indexing and cataloging. Metadata allows users to implement governance policies, security, and privacy measures such as access and visibility controls. And metadata tracks who has touched that data, when, and how.

Consume.

(6)

the use of in-house or third-party applications that perform the actual data analysis for actionable insights.

Unlike a data lake, a big data warehouse approach should collect, curate, and consume data at speed and scale.

Conclusion

Full data-lifecycle management in a single, purpose-built platform enables an optimal, strategic approach to digital assets for value creation.

Functionalities should include governance (including access, data-sharing, and visibility controls), secure, reliable, scalable, and fast storage, the application of metadata, data immutability, audit, version control, and timely destruction. Such a platform should enable swift development of applications to serve an organization’s specific needs.

(7)

About the author

Dr. Paul Terry is president and CEO of Vancouver, B.C.-based PHEMI, developer of a big data warehouse platform, where he provides vision and technical leadership. Terry advises private and public healthcare organizations on next-generation data strategies. He is an adjunct professor in big data at Simon Fraser University (SFU) and a partner with Magellan Angel Partners. He lectures in technology, strategy and product management for the MBA program at SFU. He is a member of the big data Sub-Committee Working Group at the BC Institute for Health Innovation and serves on Genome BC’s Health Strategy Task Force.

Prior to his experience in healthcare and venture capital, Paul was the CTO and cofounder of OctigaBay Systems—a pioneer in high performance

computing—which was acquired by Cray Inc., the world leader in supercomputing. He was also the cofounder and CTO of Abatis Systems, which was acquired by Redback Networks in one of the largest technology acquisitions in Canadian history. He holds an MBA from the Cranfield School of Management, a marketing diploma from the Chartered Institute of Marketing, a PhD in electrical engineering and an honours Bachelor’s degree from the University of Liverpool.

About PHEMI

PHEMI was founded in 2013 by a team of proven entrepreneurs and industry experts. Headquartered in Vancouver, Canada, the PHEMI team has extensive experience bringing innovative technologies to enterprise-class customers. Industry expertise— ranging from healthcare to telecom to public sector to security—drives PHEMI Central features, while networking and high performance computing technology expertise drive PHEMI architecture to meet the challenges of big data.

PHEMI Central gives organizations the agility to seamlessly collect data sources, catalog and curate a powerful inventory of secure digital assets, conceive new business applications, and rapidly build new solutions to support strategic objectives.

PHEMI partners with best-in-class technology and service providers to deliver a complete solution to meet any organization’s needs.