M6A. TDWI Data Virtualization: Solving Complex Data Integration Challenges. Mark Peco

(1)

M6A

European TDWI Conference

with BARC@TDWI-Track June 22 – 24, 2015

MOC Munich / Germany

TDWI Data Virtualization:

Solving Complex Data Integration Challenges

Mark Peco

(2)

TDWI Data Virtualization

Solving Complex Data Integration Challenges

(3)

COURSE OBJECTIVES

You will learn:

• How models are used to define and frame analytic needs • Data virtualization definitions and terminology

• Business case and technical rationale for data

virtualization

• Key concepts and foundational principles of

virtualization—views, services, etc.

• Data virtualization life cycle, capabilities, and processes • How to extend the data warehouse with virtualization • How virtualization enables federation and enterprise data

integration

• How virtualization is applied to big data and cloud data

challenges

• How companies use virtualization to solve business

problems and drive business agility

The Data Warehousing Institute takes pride in the educational soundness and technical accuracy of all of our courses. Please send us your comments—we’d like to hear from you. Address your feedback to:

[email protected]

Publication Date: August 2013

© Copyright 2013 by The Data Warehousing Institute. All rights reserved. No part of this document may be reproduced in any form, or by any means, without written permission from The Data Warehousing Institute.

(4)

The Oxford Dictionary defines virtual as “not physically existing as such but made by software to appear to do so.” Virtual data, then, is a data structure that appears to exist but does not exist as a physically stored set of data. Data virtualization (DV) includes the processes and technologies that are used to create virtual data.

Wikipedia describes data virtualization as “the presentation of data as an abstract layer, independent of underlying database systems, structures, and storage.” This definition captures two key elements of data

virtualization: • abstraction

• decoupling (removal of dependencies)

FROM THE

EXPERTS

The facing page shows two definitions from recognized experts in the subject of data virtualization. Key concepts in Rick van der Lans’s definition include:

• virtualization as a process • data consumers

• hidden technology

Judith Davis and Robert Eve define virtualization from a purposeful perspective, with the purpose encompassing:

• integration of disparate data

• reach across internal and external data sources • complete information

• high-quality information • actionable information

(9)

Data Virtualization Basics

(10)

Data Virtualization Basics

Virtualization vs. Materialization

BUSINESS AND

TECHNICAL

PERSPECTIVES

The physical form and location in which data are stored is generally of little interest to data consumers, and it can be a real barrier to finding data that is needed. When data is found the physical structure is typically optimized for database and application performance – important

technology considerations, but realities that make understanding of and access to data more difficult.

ABSTRACT vs.

PHYSICAL

The technology view of data is necessarily physical, working with data locations and database structures. The business view of data is more abstract, working with views that vary depending on the processes and circumstances in which data is applied. Both needs are readily met when data is materialized (managed physically) for technology purposes and virtualized (managed abstractly) for business purposes. Data structures provide the means to map from material to abstract. Physical models describe materialized data structures. Logical and conceptual models describe virtualized data structures.

(11)

Data Virtualization Basics

(12)

Data Virtualization Basics

Virtualization vs. Materialization

VIRTUAL vs.

MATERIAL DATA

INTEGRATION

The facing page illustrates the distinction between materialized and virtualized with examples seeking to integrate the same disparate data with similar but different goals. These examples are typical of the data integration challenges for a data warehouse, operational data store, or master data hub.

The goal of materialization is a single source of rationalized data, where source implies a physical database.

The goal of virtualization is a single view of rationalized data, where view implies a logical, but not physically instantiated data structure.

(13)

Data Virtualization Basics

(14)

Data Virtualization Basics

Virtualization vs. Synchronization

MATCHING

MULTIPLE

DATABASES

Synchronization is a somewhat different data integration requirement than the consolidation work performed for data warehouses. The purpose of synchronization is to keep multiple databases aligned in time and state – to maintain multiple copies of the same data where the data values are consistent across all copies at all times.

Common use cases for synchronization include MDM synchronization of master reference data, geographically distributed data, local copy with cloud-hosted master, and application integration such as CRM to ERP alignment.

Synchronization rules may be defined in a variety of forms:

• Master-slave priority – where one database is always the master copy and changes are pushed to all other copies.

• Most recent transaction priority – where every transaction that occurs in any database is propagated to all other copies in chronological sequence.

• Rule-based data selection – where a complex business rule is used to determine from which database a value is pushed to other copies of a data item.

Synchronization is not a form of virtualization because each distinct copy of the data is materialized. It is practical, however, to sometimes replace or supplement synchronization with virtual data services.

(15)

Data Virtualization Basics

(16)

Data Virtualization Basics

Virtualization vs. Federation

INTEGRATED YET

AUTONOMOUS

According to Rick van der Lans, the term federation “refers to combining autonomously operating objects” and data federation is “combining

autonomous data stores to form one large data store.”1 On that basis, van

der Lans provides the definition shown on the facing page.

FEDERATION AS

PART OF DATA

INTEGRATION

Data integration, as we’ve discussed thus far, encompasses two

techniques – materialization, which creates physical sources of integrated data, and virtualization, which creates non-material views of, integrated data.

Federation is a subset of virtualization – a specific use of virtualization to create views that rationalize multiple, disparate, and autonomous data stores. As van der Lans says, “Not all forms of data virtualization imply

federation … but federation always results in virtualization.”2

PRINCIPLES OF

FEDERATION

Four principles capture the essence of data federation:

• Virtualization: Data federation is a form of data virtualization. • Heterogeneity: Data federation works across multiple data types,

data structures, data storage technologies, and data access methods.

• Autonomy: Each data store integrated through data federation is also able to operate independently and be applied for uses outside the scope of federation.

• On-Demand: Integration is triggered by a consumer request. Data access and integration occur only when the consumer asks for data

1_{Clearly Defining Data Virtualization, Data Federation, and Data Integration, van der Lans, http://www.b-eye-network.com/view/14815} 2_{Clearly Defining Data Virtualization, Data Federation, and Data Integration, van der Lans, http://www.b-eye-network.com/view/14815}

(17)

Data Virtualization Basics

(18)

Data Virtualization Basics

History and Evolution

A TIMELINE VIEW

OF DATA

INTEGRATION

Prior to the late 1980s data integration was typically a handcrafted point solution to a specific problem or need. In 1988, IBM researchers Barry Devlin and Paul Murphy coined the term information warehouse. Bill Inmon popularized warehousing and moved it from experimental to mainstream in the early 1990s with his book Building the Data

Warehouse. Early data warehouses (and most data warehouses today)

were clearly physical. The concept of virtual data warehouse surfaced frequently but struggled to gain wide acceptance. Still the issues of “lift and shift” data duplication were cause for concern and the concept of virtualization persisted. As EII tools matured to become data

virtualization tools, and as new kinds of data and new expectations pushed the limits of batch ETL, data virtualization gained acceptance. Change is driving adoption of data virtualization today – change in data types and change in business expectations about information velocity and business agility. Data virtualization doesn’t replace ETL, but it is an essential part of the integration toolbox. Today ETL is familiar and comfortable for most data integrators. They look to data virtualization only when ETL can’t get the job done – when batch ETL is too slow, the data sources are difficult to access, or the data types are challenging. Change will continue to drive the evolution. Look at the timeline on the facing page. Note the sparseness on the left and progressively increasing density as you move from left to right. The pattern is indicative of accelerating change and accelerating challenges for data integrators and data providers.

In time, data virtualization will become familiar and comfortable. Expect in the future that virtualization will take center stage. We will choose data virtualization first, and turn to ETL and materialization when data

virtualization doesn’t meet the needs – for example highly complex transformations or the need to persist history beyond its lifespan in source systems.

The reality is that ETL and data virtualization are not competing technologies – they are complementary. Data virtualization adds a new tool to the data integrator’s toolbox. Technology decisions should be based on requirements – choosing the best tool for the job.

(19)

Why Data Virtualization?

(20)

Why Data Virtualization?

Business Agility

SHAPING YOUR

BUSINESS FUTURE

Business agility is the popular term that describes the capabilities of a business to quickly respond to changing conditions. In a complex, competitive, and continuously changing business environment, it is easy to understand why agility is important. But knowing that it matters doesn’t make it easy to achieve. Judith Davis and Robert Eve state the case clearly: “While the importance of business agility is well understood, achieving it is a difficult and ongoing challenge. The key to success is

information. Armed with the right information, business decision makers

can better evaluate their environment and decide how to adapt it for future

success.”1

There are two key messages in this quote: • Business agility depends on information.

• Business agility is about shaping the future of your business.

DIMENSIONS OF

AGILITY

Davis and Eve continue to describe three aspects of agility, all of which must be satisfied to achieve true business agility:

• Decision agility describes the speed at which informed decisions can be made.

• Time-to-solution agility describes the cycle time from recognizing a business need to delivery of the information services that are needed to respond to the need.

• Resource agility describes the ability of information services organizations to adjust people, projects, and priorities to quickly respond to business pressures.

(21)

Why Data Virtualization?

(22)

Why Data Virtualization?

The Data Virtualization Business Case

VIRTUALIZATION

ENABLES AGILITY

The business case for technology initiatives is typically based in financials – ROI, TCO, payback time, etc. The data virtualization

business case is less finance oriented than outcome oriented. Perhaps that is because the decision to pursue data virtualization isn’t truly a

technology initiative. It is a business initiative to improve the speed and quality of actionable information.

Data virtualization enables business agility, action ability, information speed, and information quality with:

• Rapid data integration, which results in quicker time-to-solution for business information needs.

• More information opportunities with reach into the new types and greater volumes of data that are available today.

• More robust business analysis through more types of data and more extensive data integration.

• More complete information through reach to new data types and greater data volumes.

• Better quality information that translated to business syntax and context instead of delivery in systems and data storage context. • Simplified data governance by reducing the number of replicated

and redundant data stores that must be reconciled.

• Clear connection of information and its value with time and resources used to get information.

• Less costly information infrastructure by reducing costly “lift and shift” processes and databases.

(23)

Why Data Virtualization?

(24)

Why Data Virtualization?

The Data Virtualization Technical Case

BUSINESS

ALIGNED

TECHNOLOGY

While data virtualization is motivated by business agility, it does have substantial technical implications, enabling IT organizations to become more responsive to continuous and quickly changing needs for business information. Data virtualization enables fast and effective delivery of business information by:

• Making data integration easier to achieve both in scope and timeliness of information.

• Providing a platform for rapid, iterative development where information requirements can be discovered and change is not a barrier to quick delivery.

• Reducing development cycles (time to solution) by eliminating the need to design and develop redundant data stores and processes to “lift and shift” data.

• Making developers more productive with a development environment that focuses on business-perspective information delivery instead of detailed mechanics of data manipulation. • Supporting the discovery-driven requirements and test-driven

development needs of agile development projects. • Breaking down the barriers of integrating structured and

unstructured data into a single consumer view of information. • Providing fast, easy access to cloud-hosted databases of all types. • Meeting performance expectations and SLAs through query

performance optimization.

• Reducing the maintenance and management overhead of data integration systems.

• Working together with ETL-based integration in a way that allows each technology to do what it does best.

• Extending the data integration toolbox with a new tool that doesn’t demand radical change and readily supports systematic migration.

(25)

The Data Virtualization Foundation

(26)

The Data Virtualization Foundation

Views

WINDOWS INTO

COMPLEX DATA

The central component of a data virtualization platform is a view. Views are purposeful means of seeing complex data in simplified and specific context that is matched to the viewer’s perspective. Depending on the specific virtualization platform being used and the data types involved, the views may be SQL-based, XML-based, services-based, etc.

MULTIPLE VIEWS

The earlier statement that “views are purposeful means …” captures an

important consideration. Much of the value of views is that they are not “one size fits all” data structures. Data integrators work with three distinct kinds of views – three purposes for working with data:

• Connection views serve the purpose of accessing data sources, corresponding to extraction in ETL processing. These are the windows through which we see the content of disparate data sources. Connection views may include a degree of normalization and rationalization.

• Integration views are used to combine and connect data from disparate sources, corresponding with transformation in ETL processing. Integration views show data relationships, resolve inconsistencies, rationalize data formats and values, and improve data quality.

• Consumer views are the business-oriented windows into data, with some correspondence to load functions of ETL processing. A significant difference from ETL is that load simply places data into a different collection of database tables; consumer views are more similar to publishing of business information.

(27)

The Data Virtualization Foundation

(28)

The Data Virtualization Foundation

Query Optimization

SPEED OF

INFORMATION

Optimization is an essential element of any data virtualization platform. Unlike the ETL-based warehouse, the data is not stored in ready-to-access data marts. It resides only at the source until requested by a consumer’s query. The good news: you haven’t done a lot of work to integrate lots of data that is never accessed. The bad news: when data is requested, all of the work from access, through integration, to information delivery must be performed in real time. Thus, optimization is a must; not only must the work be performed but it must be done fast.

To enable business and IT agility, a data virtualization tool must:

• Recognize that the network is a bottleneck. Effectively optimize to minimize network traffic without information loss.

• Recognize that each data source has its unique and technology-specific access methods. For example, avoid using generic access language such as ANSI SQL. Instead translate to use the SQL native to each relational data source.

• Push as much work as practical to the data source. Use source specific features when practical to minimize the work done via virtualization views and to reduce the volume of data moving across the network.

(29)

The Data Virtualization Foundation

(30)

The Data Virtualization Foundation

Data Services

DATA SERVICES

AND INFORMATION

SERVICES

Extending from data virtualization to data and information services is a logical and natural progression. Adding a services layer to virtualized data has several distinct advantages.

• Right-time integrated data is easily available to your newer SOA-based applications.

• SOA-based, “no-hub” (MDM) is enabled with source data continuing to reside at the source, and fast, real-time, multi-directional data integration.

• Data services maximize opportunity to create reusable data objects that encapsulate both business-rule-based and integration-based behaviors.

• Information services achieve an exceptional level of consumer friendliness for information access.

• Data and information services enable data and information mashups, enhancing the self-service capabilities of consumers to meet their own needs.

(31)

The Data Virtualization Foundation

(32)

The Data Virtualization Foundation

A “Bird’s-Eye” View

FROM DATA TO

INFORMATION AT

HIGH SPEED

The role of data virtualization is to combine data from disparate sources of many types and to move that data at high speed to be delivered as high quality, integrated, consistent, relevant, and timely business information. As with all complex problems, complexity becomes manageable when divided into logical parts. The logical parts of data virtualization include:

• A source layer implementing connection views. The source layer connects with many different data structures and types using a variety of access languages. The source layer is the point at which decoupling – separating data use from data storage – is achieved. When the source layer understands both read and write functions of each source then multi-directional integration becomes

possible.

• An integration layer implementing integration views. Here data perspective shifts from data-storage syntax to business syntax. This layer delivers the transformation and federation capabilities to represent data relationships and data combinations not apparent in the individual data sources. Transformation capabilities are also applied for consistent representation of data and for data quality improvement.

• A business layer implementing consumer views. This layer shifts the perspective from business syntax to data usage. The views at this layer present data in accessible and understandable forms that make it readily available both for business user consumption. The same views at this level support rapid application development activities such as prototyping and agile projects.

• An application layer implementing data services, consumer views, or a combination of the two. Both single-point-of-interface and usage specific information capabilities are supported here. The application layer spans data access capabilities ranging from SOA applications to virtual data marts.

(33)

Virtualize or Materialize?

(34)

Virtualize or Materialize?

Decision Factors

COMPLEX

DECISIONS

Deciding whether to integrate data materially, virtually, or as a hybrid is a complex process that involves many decision variables and that is made on a case-by-case basis:

• Time to solution is the speed at which a data integration solution is needed. Greater urgency indicates virtualization.

• Cost sensitivity is a budget-driven variable. Exceptionally limited budget indicates virtualization.

• Requirements stability is concerned with clarity and constancy of data integration requirements. Clear and stable requirements are suited to materialization uncertain and volatile requirements fit virtualization. • Replication constraints consider privacy and policy limits to creating

multiple copies of data. Use virtualization when constraints are strong. • Organizational personality describes a cultural continuum that ranges

from cautious and risk-averse to adventurous. Highly cautious

organizations are better suited to tried-and-true methods such as ETL. • Source system availability is essential to virtualization. Limited

availability makes on-demand integration difficult to achieve. • Source system load considers the processing capacity of source

systems to take on additional query demand. For source systems with little headroom, demands of virtualization may exceed capacity. • Data cleansing needs may inhibit use of virtualization. Messy data that

requires complex cleansing algorithms is a poor fit for virtualization. • Transformation complexity considers the structures, dependencies,

and quantities of business and data rules that must be applied to integrate data. Highly complex transformations are better suited to materialization than to virtualization.

• Application focus ranges from operational and real-time decision support to time-series analysis and data mining. The real consideration here is the amount of history that is needed in integrated data. When history needs exceed that which is available in source systems at any point in time, then materialization is necessary.

• Data format influences the choice with multi-dimensional and other non-SQL target data structures better suited to materialization than to virtualization.

• Target data freshness has similar influence. Real-time and very low latency data are virtualization friendly and difficult to achieve with ETL and materialization.

• Data volume per query must be considered. Processing large amounts of data with each query is not ideal for data virtualization.

(35)

Virtualize or Materialize?

(36)

Virtualize or Materialize?

Business Considerations Discussion

Think about your organization and your BI systems and projects. Where do you fit for each of the five factors shown here?

Is the answer the same for all integration needs or do answers change depending on data subjects, data sources, or data integration projects?

(37)

(38)

Module 2

Data Integration Architecture

Topic Page

Integration Architecture Concepts 2-2

Reference Architectures 2-8

Integration Architecture Examples 2-16

(39)

Integration Architecture Concepts

(40)

Integration Architecture Concepts

Integration Architecture Defined

ARCHITECTURE

Architecture defines the roles, structure, relationships, and rules by which a collection of components constitute a cohesive whole – the glue that bonds individual parts into a system. Architecture is an early-stage design activity that precedes detailed design, specification, and construction. Effective architecture ensures that the things we build:

• Are suited to the purposes for which they are intended • Comply with regulations and standards

• Fit gracefully into their environment

• Are sustainable through their expected lifespan • Are aesthetically pleasing

These principles hold true for architecture of many things – buildings, bridges, information systems, and more.

DATA

INTEGRATION

ARCHITECTURE

Data integration architecture defines the roles, structure, relationships, and rules to aggregate a collection of data integration components into a data integration system.

The facing page illustrates generic data integration architecture comprising these components:

• Disparate data sources – The non-integrated data that is the target of data integration activity. The scope of data types ranges from highly structured relational data to unstructured, web, cloud, and “big data” sources.

• Data access methods – The means by which integration technologies connect to data sources. These methods encompass all of the common data access protocols.

• Data integration technologies – The classes of tools that are available to automate and execute data integration tasks: data replication, data virtualization, extract-transform-load (ETL), and enterprise

application integration (EAI).

• Data integration techniques – The methods, processes, and products that are used to combine, connect, and rationalize disparate data as a unified data resource: propagation, transformation, consolidation, and federation.

• Integrated data applications – The business and information systems that access and use integrated data

• Integration management – The essential components to for integration system internals: quality, metadata, and systems management.

(41)

Integration Architecture Concepts

(42)

Integration Architecture Concepts

Data Sources, Middleware, and Data Consumers

MORE THAN

TECHNOLOGY

The previous view of data integration architecture is a technology-focused perspective that begins with databases and ends with systems. A more holistic view extends the architecture to include the very important elements of business activities and business people.

With these elements included, a three-layer view of data integration is useful perspective:

• Data sources include the technical components described earlier – the databases containing structured and unstructured data. But the real sources are the business activities where data is created (planning, management, and day-to-day business functions) and the people (planners, managers, and staff) who perform those activities. • Data consumers include the applications described earlier, ranging

from domain specific systems to analytics. But the ultimate consumers are the business activities that are informed by data – strategic,

tactical, and operational – and the executives, managers, and staff who perform those activities.

• Middleware is the technology that bridges from data sources to data consumers. Middleware includes all of the technological components for data access, integration technologies and techniques, and

integration management. ETL, EAI, and data virtualization platforms are all types of data integration middleware.

(43)

Integration Architecture Concepts

(44)

Integration Architecture Concepts

You Have It (Whether Defined or Not)

LEGACY

ARCHITECTURE

You probably already have data integration architecture even if you don’t recognize it as such. Anyone with a data warehouse, an operational data store, or even application interfaces has integration components with rules, roles, relationships, and purpose to combine data from multiple sources. The architecture may not be elegant and it may not be documented, but it does exist.

Maybe not elegant, maybe not documented, maybe you don’t recognize it as architecture. But you have components with roles, relationships, and purpose to present unified views of data.

It is important to know what you have – to begin there – and then ask what you need and how to extend, expand, evolve existing architecture for the new challenges of data growth and business agility.

(45)

Reference Architectures

(46)

Reference Architectures

Forrester’s Data Architecture Reference Model

WHY REFERENCE

ARCHITECTURE?

Reference architecture captures the core concepts of components and relationships for a particular type of system or collection of systems. The purpose is to provide guidance for development of specific architecture for a targeted organizational and technical environment. Data integration architecture, then, captures the essential components and relationships of data integration systems, providing framework and guidance to define and develop specific data integration architectures.

FORRESTER DATA

MANAGEMENT

ARCHITECTURE

The facing page illustrates Forrester’s Data Management Reference

Architecture.1_{Note the substantial presence of data virtualization as}

architectural components. As indicated by the check marks ( diagram, data virtualization supports virtualized data access, derived data stores, and data integration (though the older term EII is used to refer to virtualization as a data rationalization component).

Although the terminology and visualization are somewhat different, this reference architecture is quite similar to the earlier illustration of generic data integration architecture. It is also interesting that while the title refers to data management the core of the architecture is focused on data

integration – perhaps suggesting that integration is the predominant

challenge in data management today.

1_{Forrester’s Data Management Reference Architecture, Yuhanna, Leganza, Karel, Evelson, Kobelius, & Owens,}

(47)

Reference Architectures

(48)

Reference Architectures

Forrester’s IaaS Architecture

FORRESTER

INFORMATION

SERVICES

ARCHITECTURE

In addition to data management architecture, Forrester offers a reference architecture for Information as a Service (IaaS). Distributed data access, integration middleware, and SOA-based delivery are the core elements of this architecture – core characteristics that depend upon data virtualization technology to enable them.

This reference architecture can provide especially useful guidance to extend and evolve data integration architecture for those organizations

pursuing service-oriented master data management (MDM) and 360o

views of enterprise data.

1_{The Forrester Wave™: Information-As-A-Service, Q1 2010, Yuhanna & Gilpin,}

http://www.forrester.com/search?N=10001+20042&range=504001&tmtxt=+IaaS#/The+Forrester+Wave+InformationAsAService+Q1+2010/ quickscan/-/E-RES55204

(49)

Reference Architectures

(50)

Reference Architectures

Gartner’s Data Services Layer Architecture

GARTNER’S

SERVICES LAYERS

Gartner also offers service oriented reference architecture that is geared to Data as a Service (DaaS). The Gartner model consists of eight layers that progress from data sources to business processes. Business processes, services, and applications constitute the business components. The data services layers work to access and transform data and to map it into semantic and business context. These layers are enabled by data

virtualization technology. Data sources are the bottom layer of the stack, representing the wide variety of disparate data types that are today’s integration challenge. Supporting structure includes optimization and data governance processes.

(51)

Reference Architectures

(52)

Reference Architectures

IBM’s BI Reference Architecture

IBM’S VIEW OF

BUSINESS

INTELLIGENCE

IBM’s Business Intelligence Reference Architecture puts data integration into BI context. Note the focus on connecting data consumers and data sources. But in the IBM model the bridge is a combination of data warehousing and business analytics – a common BI perspective. In this architecture data virtualization isn’t highly visible. It would fit well in the column of integration processes – ETL, data quality, data integration – with ability to bypass the data stores column immediately to the left of integration processes.

(53)

Integration Architecture Examples

(54)

Integration Architecture Examples

Example 1 – Ministry Social Services Logical Architecture

THREE-LAYER

ARCHITECTURE

The example on the facing page is drawn from a case study of

Compassion International described in the book Data Virtualization.1

The data virtualization system is designed to integrate data from multiple, complex sources including ERP, EDW, application databases, and cloud-hosted databases. Progression from source views of data to consumer views depends on

• multi-layer architecture,

• models to describe the data in source and in business contexts, • mapping and rules to drive data transformations.

The data virtualization layer encompasses three sub-layers: source transform views, canonical objects, and consumer views – each

considered to be a collection of “building blocks.” The characteristics of the building blocks as described by Davis and Eve are:

• They are actual views where query is possible – not just logical objects.

• They look more like source system views at the bottom of the diagram and become increasingly business-oriented as you move upward.

• They encapsulate standard and reusable business logic. • They can be cached for performance optimization.

• Each is documented in a wiki form accessible to end-users and to developers.

(55)

Integration Architecture Examples

(56)

Integration Architecture Examples

Example 2 – Energy Industry Logical Architecture

FOUR-LAYER

ARCHITECTURE

This example, also drawn from a Data Virtualization1 case study of a

Global 50 energy company, views the data virtualization layer as four distinct sub-layers:

• Source connections access the disparate data sources and provide data to the conforming layer.

• The conforming layer transforms data to conform to a common data model.

• The common semantic layer is a collection of views into the common data model.

• The business demand layer publishes views and services that are used by data consumers to access data.

This data virtualization layer also includes a data storage component that improves performance by staging data for fast retrieval.

The energy company’s IT executive expresses a key concept of this architecture’s IT executive: “The [consuming] application does not go directly to the system of record [the data source] but rather to the record

of reference, which is the data virtualization layer.”2

1_{Data Virtualization, pp. 115-125, Davis and Eve, 2011 (www.datavirtualizationbook.com)} 2_{Data Virtualization, pp. 118, Davis and Eve, 2011 (www.datavirtualizationbook.com)}

(57)

Integration Architecture Examples

(58)

Integration Architecture Examples

Example 3 – Energy Industry Technical Architecture

TECHNOLOGY

ENABLED

SERVICES

Extending from logical to technical architecture shows the technologies involved in data virtualization and the services roles of each. This extends the view of architecture from “what” to “how” of virtualizing data, and shifts perspective from logical layers to a technology services.

• The variety of data sources includes Oracle, SQL Server, SAP, Web Services, and local interfaces for inbound data.

• Embarcadero Studio is used to manage and maintain the common data model.

• IBM Netezza Data Warehouse Appliance implements the data storage component.

• Microsoft technology is used for data mapping.

• Cisco Data Virtualization implements the virtualization services. • A variety of technologies are used by BI and data integration

(59)

Integration Architecture Examples

(60)

Integration Architecture Examples

Example 4 – Financial Services Logical Architecture

VIRTUALIZATION

SERVICES

ARCHITECTURE

This example illustrates a logical services perspective of data

virtualization architecture. Drawn from a Denodo case study1 of a debt

collection company, the architecture connects corporate systems and business processes with disparate web data sources through interaction of data acquisition, data transformation, and data virtualization services. The nature of the data sources – LinkedIn, Facebook, Twitter, etc. – is of particular interest here. Working with external and primarily unstructured data is different than working with internal and structured data, especially for data acquisition and transformation. Acquisition uses a combination of web services and web extraction techniques. Transformation must filter, prioritize, and normalize data to be passed to virtualization services. The virtualization services publish the views, services, and interfaces through which consumers access the data.

(61)

Virtualize or Materialize?

(62)

Virtualize or Materialize?

Data Source Considerations Discussion

Think about your organization and your BI systems and projects. Where do you fit for each of the four factors shown here?

Is the answer the same for all integration needs or do answers change depending on data subjects, data sources, or data integration projects?

(63)

(64)

Module 3

Data Virtualization in Integration Architecture

Topic Page

Virtualization in Data Integration Projects 3-2

Data Warehousing Use Cases 3-4

Data Federation Use Cases 3-16

MDM and EIM Use Cases 3-28

More Data Virtualization Applications 3-36

(65)

Virtualization in Data Integration Projects

(66)

Virtualization in Data Integration Projects

Data Virtualization Use Cases

VIRTUALIZATION

OPPORTUNITIES

The opportunities for data virtualization in projects are many and don’t necessarily imply long-term virtualization in production applications. It is common to virtualize in development and materialize for production. The facing page illustrates many types of integration systems from data warehousing to cloud data integration, and many uses of virtualization in projects from prototyping to production.

(67)

Data Warehousing Use Cases

(68)

Data Warehousing Use Cases

Data Warehouse Augmentation

EXTENDING THE

EXISTING DATA

WAREHOUSE

Traditional data warehousing systems are designed to provide integration of structured data through extract, transform and load (ETL) processes. The output of ETL processing is integrated data that is physically stored in a relational database and made available for downstream reporting and access. Depending on the data architecture, additional data stores such as data marts may exist to optimize the information delivery functions. The batch nature of ETL processing necessitates some latency of warehouse data.

CHALLENGES

Long-term success and sustainability of a data warehouse is based on

ability to adapt and evolve to the meet continuously changing information needs. The time and effort required to bring in additional source data is a significant challenge for existing data warehouses. The challenge and the complexities increase when the new requirements include unstructured data. Real-time data requirements bring additional challenges in ETL-based data warehousing processes.

Abundance of unstructured data and the impact of big data technologies bring both opportunities and challenges. The emergence of big data in a modern business context – especially social media data – creates

opportunity to analyze and better understand customer perceptions and behaviors. But with the opportunity comes complexity – unstructured social data is not a quick and easy fit into a traditional data warehouse.

OPPORTUNITY

ENABLED BY

VIRTUALIZATION

Data Virtualization can be applied to complement and augment an existing data warehouse with virtual views to meet new information requirements. Unstructured data, cloud data, and real-time data

integration can be implemented without extensive and disruptive changes to the core data model and ETL processing.

Speed of delivery and speed of data are accelerated with virtualization. Leveraging new and existing data sources more rapidly advances business agility. Unstructured data is integrated with structured data and new reporting applications are quickly implemented.

(69)

Data Warehousing Use Cases

Data Warehouse Federation

Data Virtualization

Federated Views

Data

Warehouse

Data

Warehouse

ETL Processing

Data

Warehouse

Data

Warehouse

ETL Processing

Data Virtualization

Federated Views

Data

Warehouse

Data

Warehouse

ETL Processing

Data

Warehouse

Data

Warehouse

ETL Processing

Data

Warehouse

Data

Warehouse

ETL Processing

Data

Warehouse

Data

Warehouse

ETL Processing

(70)

Data Warehousing Use Cases

Data Warehouse Federation

MULTIPLE DATA

WAREHOUSES

Many organizations have multiple data warehouses for a variety of reasons. Mergers and acquisitions, independent departmental initiatives for data integration, and purchased or hosted applications with data warehouse components are among the most common causes. Whatever the causes may be, the result is typically new silos of data without having achieved full enterprise integration.

CHALLENGES

Enterprise reporting and robust analytics require data integration across

the enterprise, but physical integration of multiple data warehouses is time-consuming and costly. It is especially challenging in dynamic and volatile environments where the rate of data and systems change may exceed the capacity for continuous data warehouse alignment.

The real challenge is to deliver an integrated view from many different data warehouses to support enterprise wide information needs quickly, efficiently, and cost-effectively.

OPPORTUNITY

ENABLED BY

VIRTUALIZATION

Data Virtualization provides the means to meet the challenges – fast, efficient, and cost-effective. Each individual data warehouse continues to operate independently, serving the users and purposes for which it is designed. Simultaneously, the warehouses can be federated through views that support new uses and provide enterprise-wide perspective.

Data virtualization is the means to achieve federation. Recall the earlier quote from Rick van der Lans (page 1-10): “federation always results in virtualization.”

(71)

Data Warehousing Use Cases

(72)

Data Warehousing Use Cases

Hub and Virtual Spoke

SUSTAINABLE AND

SCALABLE DATA

MARTS

Hub and spoke data warehouse architecture uses a central data warehouse – the hub – as the point of data integration, then produces multiple data marts – the spokes – to support information needs of various workgroups Though strength of integration is high, the workloads for development and for operation are also high because each data mart has unique ETL processing.

CHALLENGES

As demand for information accelerates, the demand for new data marts

grows. Data mart proliferation brings redundancies and inconsistencies that degrade strength of integration and decay data quality. Parallel growth of workload and loss of quality is clearly not a sustainable approach to data warehousing.

OPPORTUNITY

ENABLED BY

VIRTUALIZATION

Data virtualization offers the opportunity to create virtual data marts. These “virtual spokes” can be deployed quickly, without the increased workload of additional ETL processing, and with substantially reduced risk of data quality issues. Many new data mart requirements, and many changes to existing data marts, can be met without increased

development, processing, and administrative workload. When changes to an existing data mart are implemented by virtualizing the mart, overall workload may actually be reduced.

Virtualization enables the concept of disposable data marts and creation of new data marts easily, and without need to build new physical data stores. This is a particularly powerful technique in highly volatile business and systems environments.

(73)

Data Warehousing Use Cases

(74)

Data Warehousing Use Cases

Complement ETL

RESOLVING ETL

INCOMPATIBILITY

Most older data warehouses acquire, process, and load data through ETL processing. The data sources in are typically structured data that is readily suited to relational database management systems. As technology has grown and evolved, many new data sources are not well suited to traditional ETL processing.

CHALLENGES

Efforts to add new data sources to existing ETL processes are challenged

when the ETL technology lacks interfaces and access methods for the data that is needed. Common examples of data source and ETL

incompatibility include ERP-embedded databases and web services data. Modifying existing ETL processes to access these sources is complex and likely to introduce bugs and performance problems into previously stable processes.

OPPORTUNITY

ENABLED BY

VIRTUALIZATION

Data virtualization is an effective way to remove or reduce data source to ETL incompatibilities. Using a virtualization tool you can pre-process problem data sources, creating views that are readily accessible by your ETL technology. Changes to existing ETL processes, and the risks inherent in those changes are substantially reduced. The complexities and incompatibilities are managed externally while integrity of the ETL process is maintained.

(75)

Data Warehousing Use Cases

(76)

Data Warehousing Use Cases

Data Warehouse Prototyping

REQUIREMENTS

DISCOVERY

Collecting data warehouse requirements is an iterative process that takes refinement. Requirements analysis for warehousing is a continuous process of discovery about business information needs and about hidden characteristics of source data. The warehouse requirements analyst quickly discovers that

• Users can’t always tell you what they need.

• Data models and documentation are rarely complete, current, and accurate.

Recent interest in agile methods for data warehouse development raises the stakes for warehouse prototyping.

CHALLENGES

The key to effective prototyping is speed – fast response to discovery and

change. Rapid prototyping is good prototyping, but physical warehouse development has many barriers to fast changes. When you prototype a data warehouse each cycle of discovery brings changes to database schema, to data transformation logic, and potentially to choice of data sources. These are programming changes – labor intensive, and the antithesis of rapid.

OPPORTUNITY

ENABLED BY

VIRTUALIZATION

Data virtualization can substantially reduce the challenges of change when prototyping a data warehouse. Without physical database schema, and with much of integration and transformation logic embedded in canonical models, cycles of prototyping are executed much faster. New requirements and new data sources can be quickly integrated into existing virtual structures. Ultimately, when discovery diminishes and

requirements stabilize, you can migrate from a virtual to a physical data warehouse for runtime efficiencies and historical data retention.

The prototyping opportunity, of course, applies to extending existing data warehouses as well as building new data warehouses. Prototyping can be performed at any level from EDW to data marts.

(77)

Data Warehousing Use Cases

(78)

Data Warehousing Use Cases

Data Warehouse Migration

MOVING THE DATA

WAREHOUSE

Sometimes you simply need to move a data warehouse from one platform to another. Cost savings, performance gains, increased accessibility, enhanced mobility, and more motivators drive warehouse migrations. The migration may be from one DBMS to another, from database server to appliance, from row-based to columnar, from locally hosted to cloud, and many other variations.

CHALLENGES

The big challenge in moving a data warehouse is the demand for

continued operation and availability. Migration is not an event but a process – one that involves rebuilding of both databases and reporting processes. It occurs over a period of days and weeks, perhaps even months. Yet business information needs and reporting requirements continue to occur on a day-to-day basis. It is not practical to shut down to migrate.

OPPORTUNITIES

ENABLED BY

VIRTUALIZATION

Data virtualization remediates the challenges of warehouse migration. Create a virtual reporting layer to decouple reporting from the physical data structure. You first build and test the reporting layer mapped to the original data warehouse. Next you build the data warehouse on the new platform. And finally you map the virtual reporting layer to the new data warehouse. The result is a step-by-step process with smooth transition that insulates warehouse users from the impacts migration.

The physical data warehouse moves from one platform to another, but the virtual reporting layer is more than a temporary solution to a migration problem. Keeping the virtual layer in place increases the flexibility and agility with which new reporting requirements can be met.

(79)

Data Federation Use Cases

(80)

Data Federation Use Cases

Federated Views

VIEWS ACROSS

DATABASES

Database views are a powerful tool when working with relational data. They can be used to hide data structure complexities, to apply business-friendly naming, to join data from multiple tables, and to get maximum advantage from query optimizers. A SQL view, however, is limited to work within the boundaries of a single database.

CHALLENGES

As volume and variety of data increases every organization experiences

corresponding increase in the number of databases that they manage simultaneously – some data in packaged applications, some in ERP systems, some in legacy databases, some in warehouses, and some in the cloud, etc. The segmentation of databases is driven by technology and by the history and evolution of applications. But information needs often cut horizontally across vertically segmented databases. Database views could readily satisfy many of the information needs if they only worked across multiple databases.

VIRTUALIZATION

AND VIEWS

Data virtualization eliminates the boundary constraint of views contained within a single database by enabling federated views. Virtualization can combine data from multiple relational databases, Excel spreadsheets, XML, and other formats into a single view that is readily consumed by applications that are unaware of the multi-database data sourcing. The advantages of views now work across multiple databases.

(81)

Data Federation Use Cases

(82)

Data Federation Use Cases

Data Services

SOA-BASED

INTEGRATION

Sometimes views aren’t enough. The new world of service-oriented architectures and systems needs new approaches to data integration. Wrapping data sources with a services layer is a common way to get started with service-oriented integration. But wrappers are labor-intensive to build and require maintenance with every database change.

CHALLENGES

Making integrated data available to service-oriented architecture

(SOA)-based systems is uniquely challenging. Most data integration systems are predicated on relational technologies and optimized for SQL access. But SQL views don’t do the job for web-services applications,

service-oriented master data management, etc. When SOAP, REST, WSDL, JMS, etc. are the right protocols, SQL isn’t a satisfactory substitute.

VIRTUALIZATION

AND SERVICES

Some data virtualization tools include data services capabilities that make it practical to combine all of the core data integration functions – multiple sources, abstraction, transformation, etc. – with popular SOA-based data delivery formats.

(83)

Data Federation Use Cases

(84)

Data Federation Use Cases

Data Mashups

NEW WAYS TO

PRESENT DATA

The most common form of mashup is web application that combines existing components from many sources for visual presentation in a new context. A data mashup combines, aggregates, and data from a variety of different sources. Data mashups are an effective way to meet new needs for information when the data already exists and the effort required is to present that data in new combinations and new visual formats.

CHALLENGES

Web mashups are enabled by the published APIs of web applications that

make their data and functions easy to access and integrate. This is the key to fast mashup based on the idea of assembling from existing components instead of building from scratch. The challenge of data mashups is that most corporate data sources do not have readily accessible APIs to support the mashup process.

VIRTUALIZATION

AND MASHUP

The data virtualization tools that enable SOA approaches are also

enablers of data mashups. The same protocols, services, and data delivery formats that are used for virtualized services fill the role of data APIs for quick and easy access to data.

(85)

Data Federation Use Cases

(86)

Data Federation Use Cases

Caches

FREQUENTLY

ACCESSED DATA

While data virtualization is an effective technique for many data integration needs, it may have visible impact on performance of source systems and databases. One advantage of physical integration in a warehouse is that the source system is isolated from access by analysis and reporting applications.

CHALLENGES

A virtual data integration system increases the number of ways that

operational data is accessible and useful. When the potential becomes reality – a shift from accessible and useful to accessed and used – the operational databases may experience performance challenges. If query optimization isn’t enough, then you may need to duplicate the data, creating a copy of frequently accessed data as a way to isolate the source database.

VIRTUALIZATION

AND DUPLICATION

When you need to create a copy of frequently accessed data, two options are possible – replication and caching. A virtualization tool with cache capability does the job with lower overhead and greater flexibility than full database replication. Database replication simply creates copies of tables; integration follows replication. Caching can store copies of virtual views and services; integration is retained in the copy. The database replicate is static until updates are pushed to the copy. Caches can be automatically and periodically refreshed to synchronize with the source.

(87)

M6A. TDWI Data Virtualization: Solving Complex Data Integration Challenges. Mark Peco

M6A

European TDWI Conference

TDWI Data Virtualization:

Solving Complex Data Integration Challenges

Mark Peco

TDWI Data Virtualization

Solving Complex Data Integration Challenges

COURSE OBJECTIVES

TABLE OF CONTENTS

Module 1

Data Virtualization Concepts and Principles

Data Virtualization Basics

Data Virtualization Basics

Data Virtualization Defined

WHAT IT MEANS TO

BE VIRTUAL

FROM THE

EXPERTS

Data Virtualization Basics

Data Virtualization Basics

Virtualization vs. Materialization

BUSINESS AND

TECHNICAL

PERSPECTIVES

ABSTRACT vs.

PHYSICAL

Data Virtualization Basics

Data Virtualization Basics

Virtualization vs. Materialization

VIRTUAL vs.

MATERIAL DATA

INTEGRATION

Data Virtualization Basics

Data Virtualization Basics

Virtualization vs. Synchronization

MATCHING

MULTIPLE

DATABASES

Data Virtualization Basics

Data Virtualization Basics

Virtualization vs. Federation

INTEGRATED YET

AUTONOMOUS

FEDERATION AS

PART OF DATA

INTEGRATION

PRINCIPLES OF

FEDERATION

Data Virtualization Basics

Data Virtualization Basics

History and Evolution

A TIMELINE VIEW

OF DATA

INTEGRATION

Why Data Virtualization?

Why Data Virtualization?

Business Agility

SHAPING YOUR

BUSINESS FUTURE

DIMENSIONS OF

AGILITY

Why Data Virtualization?

Why Data Virtualization?

The Data Virtualization Business Case

VIRTUALIZATION

ENABLES AGILITY

Why Data Virtualization?

Why Data Virtualization?

The Data Virtualization Technical Case

BUSINESS

ALIGNED

TECHNOLOGY

The Data Virtualization Foundation

The Data Virtualization Foundation

Views

WINDOWS INTO

COMPLEX DATA

MULTIPLE VIEWS

The Data Virtualization Foundation