IBM Software The fundamentals of data lifecycle management in the era of big data

(1)

The fundamentals of data lifecycle

management in the era of big data

How data lifecycle management complements a big data strategy

(2)

Introduction

Big data,

big impact:

Dealing with

the three Vs

Best practices:

Putting data

lifecycle

management

into action

The power of

enterprise-scale

data lifecycle

management

Enhance data

warehouse

agility with

IBM InfoSphere

Why InfoSphere?

1 2 3 4 5 6

(3)

3

1 Introduction 2 Big data, big impact: Dealing with the three Vs

3 Best practices: Putting data lifecycle management into action

4 The power of enterprise-scale data lifecycle management

5 Enhance data warehouse agility with IBM InfoSphere

6 Why InfoSphere?

Introduction

Organizations are eager to harness the power of big data. But as new big data opportunities emerge, ensuring that information is trusted and protected becomes exponentially more difficult. If these challenges are not addressed directly, end users may lose confidence in the insights generated from their data— which can leave them unable to act on new opportunities or address threats.

The tremendous volume, variety and velocity of big data means that the old manual methods of discovering, governing and correcting data are no longer feasible. Organizations need to automate information integration and governance from the start.

By automating information integration and governance and employing it at the point of data creation and throughout its lifecycle, organizations can help protect information and improve the accuracy of big data insights.

(4)

4

6 Why InfoSphere? Information integration and governance

solutions must become a natural part of big data projects. They must support automated discovery and profiling and they must facilitate an understanding of diverse data sets to provide the complete context required to make informed decisions. They must be agile enough to accommodate a wide variety of data and seamlessly integrate with diverse technologies, from data marts to Apache Hadoop systems. Plus, they must discover, protect and monitor sensitive information across its lifecycle as part of big data applications.

Understanding the context of data and being able to extract the precise information necessary to meet a business objective is key to utilizing big data to the fullest. Managing the data lifecycle so that data is accurate, is appropriately used and is correctly stored to meet the required service levels and retention needs has wide-ranging benefits. These benefits include risk reduction, performance improvements and preventing an overload of useless information.

1 Introduction

This e-book explores the challenges of managing big data, best practices for enterprise-scale data lifecycle

management and how IBM® InfoSphere® Optim™ data lifecycle management solutions incorporate a comprehensive range of information integration and governance capabilities that enable companies to properly manage data over its lifetime.

(5)

5

6 Why InfoSphere? Without effective data lifecycle management,

the increasing volume, variety and velocity of big data can reduce performance, increase margins and amplify risks.

Performance and time-to-market

As more users execute more queries on larger data volumes, slow response times and degraded application performance become major issues. If left unchecked, continued data growth will stretch resources beyond capacity and negatively impact response time for critical queries and reporting processes. These problems can affect production environments and hamper upgrades, migrations and disaster recovery efforts. Implementing intelligent data

Big data, big impact: Dealing with the three Vs

management of historical, dormant data is essential for avoiding these potentially business-halting issues.

Rapid data growth also makes testing more difficult. As data warehouses and big data environments grow to petabytes or more, testing processes are taxed by having to cull data for their specific needs. The results include longer test cycles, slower time-to-market and fewer defects identified in advance of release. Speeding up testing workflows and delivery of data warehouses requires organizations to automate the creation of realistic rightsized test data—while keeping appropriate security measures in place.

Margins

Exponential data growth also can drive up infrastructure and operational costs, often consuming most of an organization’s data warehousing or big data budget. Rising data volumes require more capacity, and organizations often must buy more hardware and spend more money to maintain, monitor and administer their expanding infrastructure. Large data warehouses and big data environments generally require bigger servers, appliances and testing environments, which can also increase software licensing costs for the database and database tooling, not to mention labor, power and legal costs.

2 Big data, big impact: Dealing with the three Vs

(6)

6

Risks

Following the “let’s keep it in case someone needs it later” mandate, many organizations already keep too much historical data. According to the CGOC 2012 Summit Survey, 69 percent of data has no value. Opening the doors to excessive storage and retention only exacerbates the situation.

At the same time, organizations must ensure the privacy and security of the growing volumes of confidential information. Government and industry regulations from around the world, such as the Health Insurance Portability and Accountability Act (HIPAA), the Personal Information

Maintaining compliance with data retention regulations, protecting privacy and archiving data are not just legal matters—they are essential for sustaining customer satisfaction and brand reputation. In recent IBM surveys, respondents indicate that data theft/ cybercrime is the number-one threat to a company’s reputation—a greater threat than system failures. Sixty-four percent of respondents say their company will be focusing more on managing and protecting their reputation than they did five years ago.1

Protection and Electronic Documents Act (PIPEDA) and the Payment Card Industry Data Security Standard (PCI DSS) require organizations to protect personal information no matter where it lives—even in test and development environments.

Source: Insights from the 2012 Global Reputational Risk and IT Study.

Data breaches and attacks risk negative consumer sentiment

75% of IT risks impact customer satisfaction and brand reputation 43% are increasing focus on reputational risk because of growth in emerging technologies such as social media

(7)

7

The danger of treating a backup

as an archive

Many organizations are confused about the difference between archiving and backing up data. Archiving preserves data, providing a long-term repository of information that can be used by litigation and audit teams. By contrast, backing up data involves copying production data and moving it to another environment to enable disaster recovery and the restoration of deleted files. Backups are often retained for a short time, until a fresh backup replaces the existing backup. Archiving complements backups by removing old, redundant and infrequently accessed data from a system and by reducing the size of databases and their backups. Approximately 75 percent of the data stored is typically inactive, rarely accessed by any user, process or application. An estimated 90 percent

of all data access requests are serviced by new data— usually data that is less than a year old.2_{With an}

effective archiving strategy, organizations can protect old data and comply with data retention rules while reducing costs and enhancing system performance. In an attempt to meet archiving needs, some organizations simply back up data to a Hadoop environment. But this kind of backup will not ensure that data will be fully protected or remain query-able, the way a true archive would. With an effective data lifecycle management solution, companies can create an archive that protects data, meets compliance standards, and supports queries and reporting. An emerging trend is for organizations to use Hadoop as a lower-cost storage alternative for archives.

(8)

8

Best practices: Putting data lifecycle management into action

The data lifecycle stretches through multiple phases as data is created, used, shared, updated, stored and eventually archived or defensively disposed. Data lifecycle management plays an especially key role in three of these phases of data’s existence: archiving, test data management and data masking.

Archiving

Retention policies are designed to keep important data elements for reference and for future use while deleting data that is no longer necessary to support the legal needs of an organization. Effective data lifecycle management includes the intelligence not only to archive data in its full context, which may include information across dozens of databases, but also to archive it based on specific parameters or business rules, such as the age of the data. It can also help storage administrators develop a tiered and automated storage strategy to archive dormant data in a data warehouse, thereby improving overall warehouse performance.

The entire data lifecycle (shown as the grey circle) benefits from good governance, but management capabilities that focus on the use, share and archive steps have wide-ranging benefits for cost reduction and efficiency gains.

Create

Use

Share

Test data management

Where management tasks fall in the data lifecycle

Data masking Archiving Update Archive Store /retain Dispose

(9)

9

6 Why InfoSphere? Many organizations envision big data as a

large, pristine, centralized “data lake.” But a data lake can quickly turn into a “data swamp” when data is poorly managed and controlled. By setting up an intelligent data lifecycle management strategy and archiving to inexpensive storage, you can avoid turning your big data environment into a dumping ground.

Test data management

In development, testers must automate the creation of realistic, rightsized data sources that mirror the behaviors of existing production databases. To ensure that queries can be run easily and accurately, they must create a subset of actual production data and

reproduce actual conditions to help identify defects or problems as early as possible in the testing cycle.

The tremendous size of big data systems creates challenges for testers. There is a greater need to speed delivery of big data applications, requiring organizations to create realistic, rightsized, masked test data for testing those applications for performance and functionality. Testers also need ways to generate test data sets that facilitate realistic functional and performance testing. Because production data contains information that may identify customers, organizations must mask that information in test environments to maintain compliance and privacy.

Many organizations hope that big data will provide a large, centralized “lake” of data, but in many cases, it becomes a data swamp full of unreliable information.

31% Enterprise information 69% Everything else Subject to legal hold 25% Has business utility 5% Regulatory record keeping 1% 3 Best practices: Putting data lifecycle management into action

(10)

10

6 Why InfoSphere? Applying data masking techniques to

the test data means testers use realistic-looking, but fictional data—no actual sensitive data is revealed. Application developers can also use test data

management technologies to easily access and refresh test data, which speeds the testing and delivery of the new data source. Organizations also need ways to mask certain sensitive data, such as credit card and phone numbers. While testing their big data environments, they must mask sensitive data from unauthorized users,

Data masking techniques protect the confidentiality of private information.

Customers table

Orders table

Cust ID Item # Order date 27645 80-2382 20 June 2004 27645 86-4538 10 October 2005

Cust ID Name Street 10000 Auguste Renoir 23 Mars 10001 Claude Monet 24 Venus

Pablo Picasso 25 Saturn

Cust ID Item # Order date 10002 80-2382 20 June 2004 10002 86-4538 10 October 2005 Original data De-identified data Customers table Orders table

Cust ID Name Street 08054 Alice Bennett 2 Park Blvd 19101 Carl Davis 258 Main

Elliot Flynn 96 Avenue 27645

10002

even though those users might be authorized to see the data in aggregate. For example, a pharmaceutical company that is testing its data warehouse

environment might mask Social Security numbers and dates of birth but not patients’ ages and other demographic information. Masking certain data this way satisfies corporate and industry regulations by removing identifiable information, while still maintaining business context and referential integrity for testing in non-production environments.

(11)

11

As volume, variety and velocity impacts the complexity of data infrastructures, scaling test environments becomes a significant problem. It isn’t unusual for Fortune 500 companies to spend up to USD30 million building a single test lab—and many of these organizations have dozens of labs. Add in rising wages, and testing costs begin to spiral out of control.

Routing services Third-party services Collaboration Web/Internet Portals Content providers Business partners Archives Shared services Messaging services EJB File systems Enterprise service bus Mainframe Directory identity Data warehouse Heterogeneous environments

Complex IT landscapes

make setting up test

labs extremely costly

(12)

12

6 Why InfoSphere? Effective data lifecycle management benefits

both IT and business stakeholders.

• Increasing margin: Lower infrastructure and capital costs, improved productivity and reduced application defects during the development lifecycle.

• Reducing risks: Reduced application downtime, minimized service and

performance disruptions, and adherence to data retention requirements.

• Promoting business agility: Improved time-to-market, increased application performance and improved quality of applications through realistic test data.

With InfoSphere Optim, organizations gain a single data lifecycle management solution that can scale to meet enterprise needs. Whether they implement InfoSphere Optim for a single application, data warehouse or big data environment, organizations can streamline data lifecycle management with a consistent strategy. The unique relationship engine in InfoSphere Optim provides a single point of control to guide data processing activities such as archiving, subsetting and retrieving data.

(13)

13

InfoSphere Optim solutions help organizations meet requirements for information integration and governance and address challenges exacerbated by the increasing volume, variety and velocity of data. By archiving old data from huge data warehouse environments, businesses can improve response times and reduce costs by reclaiming valuable storage capacity. By creating realistic, rightsized data sources for testing, they can enhance the accuracy of testing and identify problems early in the testing cycle. And by implementing data masking capabilities, they can protect sensitive data and help ensure compliance with privacy regulations.

Enhance data warehouse agility with IBM InfoSphere

As a result, organizations gain more control of their IT budget while simultaneously helping their big data and data warehouse environments run more efficiently and reducing the risk of exposure of sensitive data.

InfoSphere Optim supports major big data and data warehouse environments, including IBM PureData™ for Analytics, IBM PureData for Transactions, IBM InfoSphere BigInsights™, Teradata, Oracle and popular Hadoop distributions. It also supports enterprise databases and operating systems, including IBM DB2®, Oracle Database, Sybase, Microsoft SQL Server, IBM Informix®, IBM IMS™, IBM Virtual Storage Access Method (VSAM), Microsoft Windows, UNIX, Linux and IBM z/OS®. In addition, InfoSphere Optim supports key enterprise resource planning (ERP) and customer relationship management (CRM) applications such as Oracle E-Business Suite, PeopleSoft Enterprise, JD Edwards EnterpriseOne, Siebel, Amdocs CRM and the SAP ERP and CRM applications, as well as many custom applications.

(14)

14

With 42 high-volume back-end systems needed to generate a full end-to-end system test, a US insurance company could not confidently launch new features. Testing in production was becoming the norm. In fact, claims could not be processed in certain states because of application defects that the teams skipped over during the testing process. IT was consuming an increasing number of resources—yet application quality was declining rapidly.

After implementing a process to govern test data management, the insurance company reduced the costs of testing by USD400,000 per year. Today, the company can easily refresh 42 test systems from across the organization in record time while finding defects in advance.

The value of test data management at a US insurance company

The business value from implementing test data management included:

$500,000

41 percent less labor required over 12 months Cost savings of approximately USD500,000 per year 44 percent

fewer untested scenarios

44 %

(15)

15

Why InfoSphere?

As the foundation of the IBM big data platform, InfoSphere provides market-leading

functionality across all the capabilities of information integration and governance. It is designed to handle the challenges of big data by providing optimal scale and performance for massive data volumes, agile and rightsized integration and governance for the increasing velocity of data, and support for a wide variety of data types and big data systems. InfoSphere helps make big data and analytics projects successful by delivering the confidence to act on insight.

InfoSphere capabilities include:

• Metadata, business glossary and

policy management: Define metadata, business terminology and governance policies with IBM InfoSphere Business Information Exchange.

• Data integration: Handle all integration requirements, including batch data

transformation and movement (InfoSphere Information Server), real-time replication (InfoSphere Data Replication) and data federation (InfoSphere Federation Server). • Data quality: Parse, standardize, validate

and match enterprise data with InfoSphere Information Server for Data Quality.

• Master data management: Act on a trusted view of your customers, products, suppliers, locations and accounts with InfoSphere MDM.

• Data lifecycle management: Manage data throughout its lifecycle, from requirements through retirement, with InfoSphere Optim test data automation and database archiving capabilities. • Data security and privacy: Continuously

monitor data access and protect repositories from data breaches, and support compliance with IBM InfoSphere Guardium®. Ensure sensitive data is masked and protected with InfoSphere Optim.

(16)

16

Additional resources

Ready to get started? Take a self-service

InfoSphere Optim Business Value

Assessment and show the ROI results

to your big data project owner.

To learn more about InfoSphere Optim, check out these resources:

• Manage the Data Lifecycle of Big Data Environments

• IBM InfoSphere Optim solutions for data warehouses

• Demo: IBM InfoSphere Optim Data Growth Solution

• Demo: IBM InfoSphere Optim Test Data Management Solution

To learn more about the IBM approach to information integration and governance for big data, please contact your IBM representative or IBM Business Partner, or visit: ibm.com/software/data/information-integration-governance

(17)

IMM14126-USEN-00 jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation.

Please Recycle

1 _{Yuhanna, Noel. “Your Enterprise Data Archiving Strategy.” Forrester. February 2011.}_{ftp://ftp.boulder.ibm.com/software/data/sw-library/}

data-management/optim/papers/your-enterprise-data-archiving-strategy.pdf