Deriving Value from the Data Lake - Architecting Data Lakes

Controlling and Allowing Access

When providing various users—whether C-level executives, busi‐

ness analysts, or data scientists—with the tools they need, security is critical. Setting and enforcing the security policies, consistently, is essential for successful use of a data lake. In-memory technologies should support different access patterns for each user group, depending on their needs. For example, a report generated for a C-level executive may be very sensitive, and should not be available to others who don’t have the same access privileges. In addition, you may have business users who want to use data in a low-latency man‐

ner because they are interacting with data in real time, with a BI tool; in this case, they need a speedy response. Data scientists may need more flexibility, with lesser amounts of governance; for this group, you might create a sandbox for exploratory work. By the same token, users in a company’s marketing department should not have access to the same data as users in the finance department.

With security policies in place, users only have access to the data sets assigned to their privilege levels.

You may also use security features to enable users to interact with the data, and contribute to data preparation and enrichment. For example, as users find data in the data lake through the catalog, they can be allowed to clean up the data, and enrich the fields in a data set, in a self-service manner.

Access controls can also enable a collaborative approach for access‐

ing and consuming the data. For example, if one user finds a data set that she feels is important to a project, and there are three other team members on that same project, she can create a workspace with that data, so that it’s shared, and the team can collaborate on enrichments.

Using a Bottom-Up Approach to Data Governance to Rank Data Sets

The bottom-up approach to data governance, discussed in Chap‐

ter 2, enables you to rank the usefulness of data sets by crowdsourc‐

ing. By asking users to rate which data sets are the most valuable, the word can spread to other users so they can make productive use of that data. This way, you are creating a single source of truth from the bottom up, rather than the top down.

Controlling and Allowing Access | 47

To do this, you need a rating and ranking mechanism as part of your integrated data lake management platform. The obvious place for this bottom-up, watermark-based governance model would be the catalog. Thus the catalog has to have rating functions.

But it’s not enough to show what others think of a dataset. An inte‐

grated data lake management and governance solution should show users the rankings of the data sets from all users—but it should also offer a personalized data rating, so that each individual can see what they have personally found useful whenever they go to the catalog.

Users also need tools to create new data models out of existing data sets. For example, users should be able to take a customer data set and a transaction data set and create a “most valuable customer”

data set by grouping customers by transactions, and figuring out when customers are generating the most revenue. Being able to do these types of enrichments and transformations is important from an end-to-end perspective.

Data Lakes in Different Industries

The data lake provides value in many different areas. Below are examples of industries that benefit from using a data lake to store and access information.

Healthcare

Many large healthcare providers maintain millions of records for millions of patients, including semi-structured reports such as radi‐

ology images, unstructured doctors’ notes, and data captured in spreadsheets and other common computer applications. A data lake is an obvious solution for such organizations, because it solves the challenge healthcare providers face with data storage, integration, and accessibility. By implementing a data lake based on a Hadoop architecture, a healthcare provider can enable distributed big data processing, by using broadly accepted, open software standards, and massively-parallel commodity hardware.

48 | Chapter 5: Deriving Value from the Data Lake

Hadoop allows healthcare providers’ widely disparate records of both structured and unstructured data to be stored in their native formats for later analysis; this avoids the issue of forcing categoriza‐

tion of each data type, as would be the case in a traditional EDW.

Not incidentally, preserving the native format also helps maintain data provenance and fidelity of the data, enabling different analyses to be performed using different contexts. With data lakes, sophisti‐

cated data analyses projects are possible, including those using pre‐

dictive analytics to anticipate and take measures against frequent readmissions.

Financial Services

In the financial services industry, data lakes can be used to comply with the Dodd-Frank regulation. By consolidating multiple EDWs into one data lake repository, financial institutions can move recon‐

ciliation, settlement, and Dodd-Frank reporting to a single platform.

This dramatically reduces the heavy lifting of integration, as data is stored in a standard yet flexible format that can accommodate unstructured data.

Retail banking also has important use cases for data lakes. In retail banking, large institutions need to process thousands of applications for new checking and savings accounts on a daily basis. Bankers that accept these applications consult third-party risk scoring services before opening an account, yet it is common for bank risk analysts to manually override negative recommendations for applicants with poor banking histories. Although these overrides can happen for good reasons (say there are extenuating circumstances for a particu‐

lar person’s application), high-risk accounts tend to be overdrawn and cost banks millions of dollars in losses due to mismanagement or fraud.

By moving to a Hadoop data lake, banks can store and analyze mul‐

tiple data streams, and help regional managers control account risk in distributed branches. They are able to find out which risk analysts were making account decisions that went against risk information by third parties. The net result is better control of fraud. Over time, the accumulation of data in the data lake allows the bank to build algorithms that detect subtle but high-risk patterns that bank risk analysts may have previously failed to identify.

Data Lakes in Different Industries | 49

Retail

Data lakes can also help online retail organizations. For example, retailers can store all of a customer’s shopping behavior in a data lake in Hadoop. By capturing web session data (session histories of all users on a page), retailers can do things like provide timely offers based on a customer’s web browsing and shopping history.

50 | Chapter 5: Deriving Value from the Data Lake

CHAPTER 6 Looking Ahead

As the data lake becomes an important part of next-generation data architectures, we see multiple trends emerging based on different vertical use cases that indicate what the future of data lakes will look like.

Ground-to-Cloud Deployment Options

Currently, most data lakes reside on-premises at organizations, but a growing number of enterprises are moving to the cloud because of the agility, ease of use, and economic benefits of a cloud-based plat‐

form. As clouds—both private and public—mature from security and multi-tenancy perspectives, we’ll see this trend intensify, and it’s important that data lake tools work across both environments.

As a result, we’re seeing an increased adoption of cloud-based Hadoop infrastructures that complement and sometimes even replace on-premises Hadoop deployments. As data onboarding, management, and governance matures and becomes easier, data needs to be accessible in cloud-based architectures the same way it is available in on-premises architectures. Most data lake vendors are extending their tools so they work seamlessly across cloud and phys‐

ical environments. This allows business users and data scientists to spin up and down clusters in the cloud, and create augmented plat‐

forms for both agile analytics and traditional queries.

With a cloud-to-ground environment, you have a hybrid architec‐

ture that may be useful for organizations that have yet to build their

own private clouds. It can be used to store sensitive or vulnerable data that organizations can’t trust to a public cloud environment. At the same time, other, less-sensitive data sets can be moved to the public cloud.

Looking Beyond Hadoop: Logical Data Lakes

Another key trend is the emergence of logical data lakes. A logical data lake provides a unified view of data that exists across multiple data stores and across multiple environments in an enterprise.

In early Hadoop use cases, batch processing using MapReduce was the norm. Now in-memory technologies like Spark are becoming predominant, as they fit low-latency use cases that previously couldn’t be accopm;ished in a MapReduce architecture. We’re also seeing hybrid data stores, where data can be stored not only in HDFS, but also in object data stores such as S3 from Amazon, or Azure Elastic Block storage, or No-SQL databases.

Federated Queries

Federated queries go hand-in-hand with logical data lakes. As data is stored in different physical and virtual environments, you may need to use different query tools, and decompose a user’s query into mul‐

tiple queries—sending them to on-premises data stores as well as cloud-based data stores, each of which possess just part of the answer. Federated queries allow answers to be aggregated and com‐

bined, and sent back to the user so she gets one version of the truth across the entire logical data lake.

Data Discovery Portals

Another trend is to make data available to consumers via rich meta‐

data data catalogs put into a data-as-a-service framework. Many enterprises are already building these portals out of shared Hadoop environments, where users can browse what data is available in the data lake, and have an Amazon-like shopping cart experience where they select data based on various filters. They can then create a sand‐

box for that data, perform exploratory ad hoc analytics, and feed the results back into the data lake to be used by others in the organiza‐

tion.

In document Architecting Data Lakes - Oreilly (Page 52-58)