M6A
European TDWI Conference
with BARC@TDWI-Track June 22 – 24, 2015
MOC Munich / Germany
TDWI Data Virtualization:
Solving Complex Data Integration Challenges
Mark Peco
TDWI Data Virtualization
Solving Complex Data Integration Challenges
ii © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
COURSE OBJECTIVES
You will learn:
• How models are used to define and frame analytic needs • Data virtualization definitions and terminology
• Business case and technical rationale for data
virtualization
• Key concepts and foundational principles of
virtualization—views, services, etc.
• Data virtualization life cycle, capabilities, and processes • How to extend the data warehouse with virtualization • How virtualization enables federation and enterprise data
integration
• How virtualization is applied to big data and cloud data
challenges
• How companies use virtualization to solve business
problems and drive business agility
The Data Warehousing Institute takes pride in the educational soundness and technical accuracy of all of our courses. Please send us your comments—we’d like to hear from you. Address your feedback to:
Publication Date: August 2013
© Copyright 2013 by The Data Warehousing Institute. All rights reserved. No part of this document may be reproduced in any form, or by any means, without written permission from The Data Warehousing Institute.
TABLE OF CONTENTS
Module 1 Data Virtualization Concepts and Principles …. 1-1
Module 2 Data Integration Architecture ….………... 2-1
Module 3 Data Virtualization in Integration Architecture .. 3-1
Module 4 Data Virtualization Platforms .……….……... 4-1
Module 5 Implementing Data Virtualization .…………... 5-1
Module 6 Getting Started with Data Virtualization ….….... 6-1
Appendix A Data Virtualization Case Studies ………..….…... A-1
Module 1
Data Virtualization Concepts and Principles
Topic Page
Data Virtualization Basics 1-2
Why Data Virtualization? 1-14
The Data Virtualization Foundation 1-20
1-2 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
Data Virtualization Defined
WHAT IT MEANS TO
BE VIRTUAL
The Oxford Dictionary defines virtual as “not physically existing as such but made by software to appear to do so.” Virtual data, then, is a data structure that appears to exist but does not exist as a physically stored set of data. Data virtualization (DV) includes the processes and technologies that are used to create virtual data.
Wikipedia describes data virtualization as “the presentation of data as an abstract layer, independent of underlying database systems, structures, and storage.” This definition captures two key elements of data
virtualization: • abstraction
• decoupling (removal of dependencies)
FROM THE
EXPERTS
The facing page shows two definitions from recognized experts in the subject of data virtualization. Key concepts in Rick van der Lans’s definition include:
• virtualization as a process • data consumers
• hidden technology
Judith Davis and Robert Eve define virtualization from a purposeful perspective, with the purpose encompassing:
• integration of disparate data
• reach across internal and external data sources • complete information
• high-quality information • actionable information
1-4 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
Virtualization vs. Materialization
BUSINESS AND
TECHNICAL
PERSPECTIVES
The physical form and location in which data are stored is generally of little interest to data consumers, and it can be a real barrier to finding data that is needed. When data is found the physical structure is typically optimized for database and application performance – important
technology considerations, but realities that make understanding of and access to data more difficult.
ABSTRACT vs.
PHYSICAL
The technology view of data is necessarily physical, working with data locations and database structures. The business view of data is more abstract, working with views that vary depending on the processes and circumstances in which data is applied. Both needs are readily met when data is materialized (managed physically) for technology purposes and virtualized (managed abstractly) for business purposes. Data structures provide the means to map from material to abstract. Physical models describe materialized data structures. Logical and conceptual models describe virtualized data structures.
1-6 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
Virtualization vs. Materialization
VIRTUAL vs.
MATERIAL DATA
INTEGRATION
The facing page illustrates the distinction between materialized and virtualized with examples seeking to integrate the same disparate data with similar but different goals. These examples are typical of the data integration challenges for a data warehouse, operational data store, or master data hub.
The goal of materialization is a single source of rationalized data, where source implies a physical database.
The goal of virtualization is a single view of rationalized data, where view implies a logical, but not physically instantiated data structure.
1-8 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
Virtualization vs. Synchronization
MATCHING
MULTIPLE
DATABASES
Synchronization is a somewhat different data integration requirement than the consolidation work performed for data warehouses. The purpose of synchronization is to keep multiple databases aligned in time and state – to maintain multiple copies of the same data where the data values are consistent across all copies at all times.
Common use cases for synchronization include MDM synchronization of master reference data, geographically distributed data, local copy with cloud-hosted master, and application integration such as CRM to ERP alignment.
Synchronization rules may be defined in a variety of forms:
• Master-slave priority – where one database is always the master copy and changes are pushed to all other copies.
• Most recent transaction priority – where every transaction that occurs in any database is propagated to all other copies in chronological sequence.
• Rule-based data selection – where a complex business rule is used to determine from which database a value is pushed to other copies of a data item.
Synchronization is not a form of virtualization because each distinct copy of the data is materialized. It is practical, however, to sometimes replace or supplement synchronization with virtual data services.
1-10 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
Virtualization vs. Federation
INTEGRATED YET
AUTONOMOUS
According to Rick van der Lans, the term federation “refers to combining autonomously operating objects” and data federation is “combining
autonomous data stores to form one large data store.”1 On that basis, van
der Lans provides the definition shown on the facing page.
FEDERATION AS
PART OF DATA
INTEGRATION
Data integration, as we’ve discussed thus far, encompasses two
techniques – materialization, which creates physical sources of integrated data, and virtualization, which creates non-material views of, integrated data.
Federation is a subset of virtualization – a specific use of virtualization to create views that rationalize multiple, disparate, and autonomous data stores. As van der Lans says, “Not all forms of data virtualization imply
federation … but federation always results in virtualization.”2
PRINCIPLES OF
FEDERATION
Four principles capture the essence of data federation:
• Virtualization: Data federation is a form of data virtualization. • Heterogeneity: Data federation works across multiple data types,
data structures, data storage technologies, and data access methods.
• Autonomy: Each data store integrated through data federation is also able to operate independently and be applied for uses outside the scope of federation.
• On-Demand: Integration is triggered by a consumer request. Data access and integration occur only when the consumer asks for data
1 Clearly Defining Data Virtualization, Data Federation, and Data Integration, van der Lans, http://www.b-eye-network.com/view/14815 2 Clearly Defining Data Virtualization, Data Federation, and Data Integration, van der Lans, http://www.b-eye-network.com/view/14815
1-12 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Virtualization Basics
Data Virtualization Basics
History and Evolution
A TIMELINE VIEW
OF DATA
INTEGRATION
Prior to the late 1980s data integration was typically a handcrafted point solution to a specific problem or need. In 1988, IBM researchers Barry Devlin and Paul Murphy coined the term information warehouse. Bill Inmon popularized warehousing and moved it from experimental to mainstream in the early 1990s with his book Building the Data
Warehouse. Early data warehouses (and most data warehouses today)
were clearly physical. The concept of virtual data warehouse surfaced frequently but struggled to gain wide acceptance. Still the issues of “lift and shift” data duplication were cause for concern and the concept of virtualization persisted. As EII tools matured to become data
virtualization tools, and as new kinds of data and new expectations pushed the limits of batch ETL, data virtualization gained acceptance. Change is driving adoption of data virtualization today – change in data types and change in business expectations about information velocity and business agility. Data virtualization doesn’t replace ETL, but it is an essential part of the integration toolbox. Today ETL is familiar and comfortable for most data integrators. They look to data virtualization only when ETL can’t get the job done – when batch ETL is too slow, the data sources are difficult to access, or the data types are challenging. Change will continue to drive the evolution. Look at the timeline on the facing page. Note the sparseness on the left and progressively increasing density as you move from left to right. The pattern is indicative of accelerating change and accelerating challenges for data integrators and data providers.
In time, data virtualization will become familiar and comfortable. Expect in the future that virtualization will take center stage. We will choose data virtualization first, and turn to ETL and materialization when data
virtualization doesn’t meet the needs – for example highly complex transformations or the need to persist history beyond its lifespan in source systems.
The reality is that ETL and data virtualization are not competing technologies – they are complementary. Data virtualization adds a new tool to the data integrator’s toolbox. Technology decisions should be based on requirements – choosing the best tool for the job.
1-14 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Why Data Virtualization?
Why Data Virtualization?
Business Agility
SHAPING YOUR
BUSINESS FUTURE
Business agility is the popular term that describes the capabilities of a business to quickly respond to changing conditions. In a complex, competitive, and continuously changing business environment, it is easy to understand why agility is important. But knowing that it matters doesn’t make it easy to achieve. Judith Davis and Robert Eve state the case clearly: “While the importance of business agility is well understood, achieving it is a difficult and ongoing challenge. The key to success is
information. Armed with the right information, business decision makers
can better evaluate their environment and decide how to adapt it for future
success.”1
There are two key messages in this quote: • Business agility depends on information.
• Business agility is about shaping the future of your business.
DIMENSIONS OF
AGILITY
Davis and Eve continue to describe three aspects of agility, all of which must be satisfied to achieve true business agility:
• Decision agility describes the speed at which informed decisions can be made.
• Time-to-solution agility describes the cycle time from recognizing a business need to delivery of the information services that are needed to respond to the need.
• Resource agility describes the ability of information services organizations to adjust people, projects, and priorities to quickly respond to business pressures.
1-16 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Why Data Virtualization?
Why Data Virtualization?
The Data Virtualization Business Case
VIRTUALIZATION
ENABLES AGILITY
The business case for technology initiatives is typically based in financials – ROI, TCO, payback time, etc. The data virtualization
business case is less finance oriented than outcome oriented. Perhaps that is because the decision to pursue data virtualization isn’t truly a
technology initiative. It is a business initiative to improve the speed and quality of actionable information.
Data virtualization enables business agility, action ability, information speed, and information quality with:
• Rapid data integration, which results in quicker time-to-solution for business information needs.
• More information opportunities with reach into the new types and greater volumes of data that are available today.
• More robust business analysis through more types of data and more extensive data integration.
• More complete information through reach to new data types and greater data volumes.
• Better quality information that translated to business syntax and context instead of delivery in systems and data storage context. • Simplified data governance by reducing the number of replicated
and redundant data stores that must be reconciled.
• Clear connection of information and its value with time and resources used to get information.
• Less costly information infrastructure by reducing costly “lift and shift” processes and databases.
1-18 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Why Data Virtualization?
Why Data Virtualization?
The Data Virtualization Technical Case
BUSINESS
ALIGNED
TECHNOLOGY
While data virtualization is motivated by business agility, it does have substantial technical implications, enabling IT organizations to become more responsive to continuous and quickly changing needs for business information. Data virtualization enables fast and effective delivery of business information by:
• Making data integration easier to achieve both in scope and timeliness of information.
• Providing a platform for rapid, iterative development where information requirements can be discovered and change is not a barrier to quick delivery.
• Reducing development cycles (time to solution) by eliminating the need to design and develop redundant data stores and processes to “lift and shift” data.
• Making developers more productive with a development environment that focuses on business-perspective information delivery instead of detailed mechanics of data manipulation. • Supporting the discovery-driven requirements and test-driven
development needs of agile development projects. • Breaking down the barriers of integrating structured and
unstructured data into a single consumer view of information. • Providing fast, easy access to cloud-hosted databases of all types. • Meeting performance expectations and SLAs through query
performance optimization.
• Reducing the maintenance and management overhead of data integration systems.
• Working together with ETL-based integration in a way that allows each technology to do what it does best.
• Extending the data integration toolbox with a new tool that doesn’t demand radical change and readily supports systematic migration.
1-20 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
The Data Virtualization Foundation
The Data Virtualization Foundation
Views
WINDOWS INTO
COMPLEX DATA
The central component of a data virtualization platform is a view. Views are purposeful means of seeing complex data in simplified and specific context that is matched to the viewer’s perspective. Depending on the specific virtualization platform being used and the data types involved, the views may be SQL-based, XML-based, services-based, etc.
MULTIPLE VIEWS
The earlier statement that “views are purposeful means …” captures animportant consideration. Much of the value of views is that they are not “one size fits all” data structures. Data integrators work with three distinct kinds of views – three purposes for working with data:
• Connection views serve the purpose of accessing data sources, corresponding to extraction in ETL processing. These are the windows through which we see the content of disparate data sources. Connection views may include a degree of normalization and rationalization.
• Integration views are used to combine and connect data from disparate sources, corresponding with transformation in ETL processing. Integration views show data relationships, resolve inconsistencies, rationalize data formats and values, and improve data quality.
• Consumer views are the business-oriented windows into data, with some correspondence to load functions of ETL processing. A significant difference from ETL is that load simply places data into a different collection of database tables; consumer views are more similar to publishing of business information.
1-22 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
The Data Virtualization Foundation
The Data Virtualization Foundation
Query Optimization
SPEED OF
INFORMATION
Optimization is an essential element of any data virtualization platform. Unlike the ETL-based warehouse, the data is not stored in ready-to-access data marts. It resides only at the source until requested by a consumer’s query. The good news: you haven’t done a lot of work to integrate lots of data that is never accessed. The bad news: when data is requested, all of the work from access, through integration, to information delivery must be performed in real time. Thus, optimization is a must; not only must the work be performed but it must be done fast.
To enable business and IT agility, a data virtualization tool must:
• Recognize that the network is a bottleneck. Effectively optimize to minimize network traffic without information loss.
• Recognize that each data source has its unique and technology-specific access methods. For example, avoid using generic access language such as ANSI SQL. Instead translate to use the SQL native to each relational data source.
• Push as much work as practical to the data source. Use source specific features when practical to minimize the work done via virtualization views and to reduce the volume of data moving across the network.
1-24 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
The Data Virtualization Foundation
The Data Virtualization Foundation
Data Services
DATA SERVICES
AND INFORMATION
SERVICES
Extending from data virtualization to data and information services is a logical and natural progression. Adding a services layer to virtualized data has several distinct advantages.
• Right-time integrated data is easily available to your newer SOA-based applications.
• SOA-based, “no-hub” (MDM) is enabled with source data continuing to reside at the source, and fast, real-time, multi-directional data integration.
• Data services maximize opportunity to create reusable data objects that encapsulate both business-rule-based and integration-based behaviors.
• Information services achieve an exceptional level of consumer friendliness for information access.
• Data and information services enable data and information mashups, enhancing the self-service capabilities of consumers to meet their own needs.
1-26 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
The Data Virtualization Foundation
The Data Virtualization Foundation
A “Bird’s-Eye” View
FROM DATA TO
INFORMATION AT
HIGH SPEED
The role of data virtualization is to combine data from disparate sources of many types and to move that data at high speed to be delivered as high quality, integrated, consistent, relevant, and timely business information. As with all complex problems, complexity becomes manageable when divided into logical parts. The logical parts of data virtualization include:
• A source layer implementing connection views. The source layer connects with many different data structures and types using a variety of access languages. The source layer is the point at which decoupling – separating data use from data storage – is achieved. When the source layer understands both read and write functions of each source then multi-directional integration becomes
possible.
• An integration layer implementing integration views. Here data perspective shifts from data-storage syntax to business syntax. This layer delivers the transformation and federation capabilities to represent data relationships and data combinations not apparent in the individual data sources. Transformation capabilities are also applied for consistent representation of data and for data quality improvement.
• A business layer implementing consumer views. This layer shifts the perspective from business syntax to data usage. The views at this layer present data in accessible and understandable forms that make it readily available both for business user consumption. The same views at this level support rapid application development activities such as prototyping and agile projects.
• An application layer implementing data services, consumer views, or a combination of the two. Both single-point-of-interface and usage specific information capabilities are supported here. The application layer spans data access capabilities ranging from SOA applications to virtual data marts.
1-28 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Virtualize or Materialize?
Virtualize or Materialize?
Decision Factors
COMPLEX
DECISIONS
Deciding whether to integrate data materially, virtually, or as a hybrid is a complex process that involves many decision variables and that is made on a case-by-case basis:
• Time to solution is the speed at which a data integration solution is needed. Greater urgency indicates virtualization.
• Cost sensitivity is a budget-driven variable. Exceptionally limited budget indicates virtualization.
• Requirements stability is concerned with clarity and constancy of data integration requirements. Clear and stable requirements are suited to materialization uncertain and volatile requirements fit virtualization. • Replication constraints consider privacy and policy limits to creating
multiple copies of data. Use virtualization when constraints are strong. • Organizational personality describes a cultural continuum that ranges
from cautious and risk-averse to adventurous. Highly cautious
organizations are better suited to tried-and-true methods such as ETL. • Source system availability is essential to virtualization. Limited
availability makes on-demand integration difficult to achieve. • Source system load considers the processing capacity of source
systems to take on additional query demand. For source systems with little headroom, demands of virtualization may exceed capacity. • Data cleansing needs may inhibit use of virtualization. Messy data that
requires complex cleansing algorithms is a poor fit for virtualization. • Transformation complexity considers the structures, dependencies,
and quantities of business and data rules that must be applied to integrate data. Highly complex transformations are better suited to materialization than to virtualization.
• Application focus ranges from operational and real-time decision support to time-series analysis and data mining. The real consideration here is the amount of history that is needed in integrated data. When history needs exceed that which is available in source systems at any point in time, then materialization is necessary.
• Data format influences the choice with multi-dimensional and other non-SQL target data structures better suited to materialization than to virtualization.
• Target data freshness has similar influence. Real-time and very low latency data are virtualization friendly and difficult to achieve with ETL and materialization.
• Data volume per query must be considered. Processing large amounts of data with each query is not ideal for data virtualization.
1-30 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Virtualize or Materialize?
Virtualize or Materialize?
Business Considerations Discussion
Think about your organization and your BI systems and projects. Where do you fit for each of the five factors shown here?
Is the answer the same for all integration needs or do answers change depending on data subjects, data sources, or data integration projects?
Module 2
Data Integration Architecture
Topic Page
Integration Architecture Concepts 2-2
Reference Architectures 2-8
Integration Architecture Examples 2-16
2-2 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Concepts
Integration Architecture Concepts
Integration Architecture Defined
ARCHITECTURE
Architecture defines the roles, structure, relationships, and rules by which a collection of components constitute a cohesive whole – the glue that bonds individual parts into a system. Architecture is an early-stage design activity that precedes detailed design, specification, and construction. Effective architecture ensures that the things we build:• Are suited to the purposes for which they are intended • Comply with regulations and standards
• Fit gracefully into their environment
• Are sustainable through their expected lifespan • Are aesthetically pleasing
These principles hold true for architecture of many things – buildings, bridges, information systems, and more.
DATA
INTEGRATION
ARCHITECTURE
Data integration architecture defines the roles, structure, relationships, and rules to aggregate a collection of data integration components into a data integration system.
The facing page illustrates generic data integration architecture comprising these components:
• Disparate data sources – The non-integrated data that is the target of data integration activity. The scope of data types ranges from highly structured relational data to unstructured, web, cloud, and “big data” sources.
• Data access methods – The means by which integration technologies connect to data sources. These methods encompass all of the common data access protocols.
• Data integration technologies – The classes of tools that are available to automate and execute data integration tasks: data replication, data virtualization, extract-transform-load (ETL), and enterprise
application integration (EAI).
• Data integration techniques – The methods, processes, and products that are used to combine, connect, and rationalize disparate data as a unified data resource: propagation, transformation, consolidation, and federation.
• Integrated data applications – The business and information systems that access and use integrated data
• Integration management – The essential components to for integration system internals: quality, metadata, and systems management.
2-4 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Concepts
Integration Architecture Concepts
Data Sources, Middleware, and Data Consumers
MORE THAN
TECHNOLOGY
The previous view of data integration architecture is a technology-focused perspective that begins with databases and ends with systems. A more holistic view extends the architecture to include the very important elements of business activities and business people.
With these elements included, a three-layer view of data integration is useful perspective:
• Data sources include the technical components described earlier – the databases containing structured and unstructured data. But the real sources are the business activities where data is created (planning, management, and day-to-day business functions) and the people (planners, managers, and staff) who perform those activities. • Data consumers include the applications described earlier, ranging
from domain specific systems to analytics. But the ultimate consumers are the business activities that are informed by data – strategic,
tactical, and operational – and the executives, managers, and staff who perform those activities.
• Middleware is the technology that bridges from data sources to data consumers. Middleware includes all of the technological components for data access, integration technologies and techniques, and
integration management. ETL, EAI, and data virtualization platforms are all types of data integration middleware.
2-6 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Concepts
Integration Architecture Concepts
You Have It (Whether Defined or Not)
LEGACY
ARCHITECTURE
You probably already have data integration architecture even if you don’t recognize it as such. Anyone with a data warehouse, an operational data store, or even application interfaces has integration components with rules, roles, relationships, and purpose to combine data from multiple sources. The architecture may not be elegant and it may not be documented, but it does exist.
Maybe not elegant, maybe not documented, maybe you don’t recognize it as architecture. But you have components with roles, relationships, and purpose to present unified views of data.
It is important to know what you have – to begin there – and then ask what you need and how to extend, expand, evolve existing architecture for the new challenges of data growth and business agility.
2-8 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Reference Architectures
Reference Architectures
Forrester’s Data Architecture Reference Model
WHY REFERENCE
ARCHITECTURE?
Reference architecture captures the core concepts of components and relationships for a particular type of system or collection of systems. The purpose is to provide guidance for development of specific architecture for a targeted organizational and technical environment. Data integration architecture, then, captures the essential components and relationships of data integration systems, providing framework and guidance to define and develop specific data integration architectures.
FORRESTER DATA
MANAGEMENT
ARCHITECTURE
The facing page illustrates Forrester’s Data Management Reference
Architecture.1 Note the substantial presence of data virtualization as
architectural components. As indicated by the check marks ( diagram, data virtualization supports virtualized data access, derived data stores, and data integration (though the older term EII is used to refer to virtualization as a data rationalization component).
Although the terminology and visualization are somewhat different, this reference architecture is quite similar to the earlier illustration of generic data integration architecture. It is also interesting that while the title refers to data management the core of the architecture is focused on data
integration – perhaps suggesting that integration is the predominant
challenge in data management today.
1 Forrester’s Data Management Reference Architecture, Yuhanna, Leganza, Karel, Evelson, Kobelius, & Owens,
2-10 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Reference Architectures
Reference Architectures
Forrester’s IaaS Architecture
FORRESTER
INFORMATION
SERVICES
ARCHITECTURE
In addition to data management architecture, Forrester offers a reference architecture for Information as a Service (IaaS). Distributed data access, integration middleware, and SOA-based delivery are the core elements of this architecture – core characteristics that depend upon data virtualization technology to enable them.
This reference architecture can provide especially useful guidance to extend and evolve data integration architecture for those organizations
pursuing service-oriented master data management (MDM) and 360o
views of enterprise data.
1 The Forrester Wave™: Information-As-A-Service, Q1 2010, Yuhanna & Gilpin,
http://www.forrester.com/search?N=10001+20042&range=504001&tmtxt=+IaaS#/The+Forrester+Wave+InformationAsAService+Q1+2010/ quickscan/-/E-RES55204
2-12 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Reference Architectures
Reference Architectures
Gartner’s Data Services Layer Architecture
GARTNER’S
SERVICES LAYERS
Gartner also offers service oriented reference architecture that is geared to Data as a Service (DaaS). The Gartner model consists of eight layers that progress from data sources to business processes. Business processes, services, and applications constitute the business components. The data services layers work to access and transform data and to map it into semantic and business context. These layers are enabled by data
virtualization technology. Data sources are the bottom layer of the stack, representing the wide variety of disparate data types that are today’s integration challenge. Supporting structure includes optimization and data governance processes.
2-14 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Reference Architectures
Reference Architectures
IBM’s BI Reference Architecture
IBM’S VIEW OF
BUSINESS
INTELLIGENCE
IBM’s Business Intelligence Reference Architecture puts data integration into BI context. Note the focus on connecting data consumers and data sources. But in the IBM model the bridge is a combination of data warehousing and business analytics – a common BI perspective. In this architecture data virtualization isn’t highly visible. It would fit well in the column of integration processes – ETL, data quality, data integration – with ability to bypass the data stores column immediately to the left of integration processes.
2-16 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Examples
Integration Architecture Examples
Example 1 – Ministry Social Services Logical Architecture
THREE-LAYER
ARCHITECTURE
The example on the facing page is drawn from a case study of
Compassion International described in the book Data Virtualization.1
The data virtualization system is designed to integrate data from multiple, complex sources including ERP, EDW, application databases, and cloud-hosted databases. Progression from source views of data to consumer views depends on
• multi-layer architecture,
• models to describe the data in source and in business contexts, • mapping and rules to drive data transformations.
The data virtualization layer encompasses three sub-layers: source transform views, canonical objects, and consumer views – each
considered to be a collection of “building blocks.” The characteristics of the building blocks as described by Davis and Eve are:
• They are actual views where query is possible – not just logical objects.
• They look more like source system views at the bottom of the diagram and become increasingly business-oriented as you move upward.
• They encapsulate standard and reusable business logic. • They can be cached for performance optimization.
• Each is documented in a wiki form accessible to end-users and to developers.
2-18 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Examples
Integration Architecture Examples
Example 2 – Energy Industry Logical Architecture
FOUR-LAYER
ARCHITECTURE
This example, also drawn from a Data Virtualization1 case study of a
Global 50 energy company, views the data virtualization layer as four distinct sub-layers:
• Source connections access the disparate data sources and provide data to the conforming layer.
• The conforming layer transforms data to conform to a common data model.
• The common semantic layer is a collection of views into the common data model.
• The business demand layer publishes views and services that are used by data consumers to access data.
This data virtualization layer also includes a data storage component that improves performance by staging data for fast retrieval.
The energy company’s IT executive expresses a key concept of this architecture’s IT executive: “The [consuming] application does not go directly to the system of record [the data source] but rather to the record
of reference, which is the data virtualization layer.”2
1 Data Virtualization, pp. 115-125, Davis and Eve, 2011 (www.datavirtualizationbook.com) 2 Data Virtualization, pp. 118, Davis and Eve, 2011 (www.datavirtualizationbook.com)
2-20 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Examples
Integration Architecture Examples
Example 3 – Energy Industry Technical Architecture
TECHNOLOGY
ENABLED
SERVICES
Extending from logical to technical architecture shows the technologies involved in data virtualization and the services roles of each. This extends the view of architecture from “what” to “how” of virtualizing data, and shifts perspective from logical layers to a technology services.
• The variety of data sources includes Oracle, SQL Server, SAP, Web Services, and local interfaces for inbound data.
• Embarcadero Studio is used to manage and maintain the common data model.
• IBM Netezza Data Warehouse Appliance implements the data storage component.
• Microsoft technology is used for data mapping.
• Cisco Data Virtualization implements the virtualization services. • A variety of technologies are used by BI and data integration
2-22 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Integration Architecture Examples
Integration Architecture Examples
Example 4 – Financial Services Logical Architecture
VIRTUALIZATION
SERVICES
ARCHITECTURE
This example illustrates a logical services perspective of data
virtualization architecture. Drawn from a Denodo case study1 of a debt
collection company, the architecture connects corporate systems and business processes with disparate web data sources through interaction of data acquisition, data transformation, and data virtualization services. The nature of the data sources – LinkedIn, Facebook, Twitter, etc. – is of particular interest here. Working with external and primarily unstructured data is different than working with internal and structured data, especially for data acquisition and transformation. Acquisition uses a combination of web services and web extraction techniques. Transformation must filter, prioritize, and normalize data to be passed to virtualization services. The virtualization services publish the views, services, and interfaces through which consumers access the data.
2-24 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Virtualize or Materialize?
Virtualize or Materialize?
Data Source Considerations Discussion
Think about your organization and your BI systems and projects. Where do you fit for each of the four factors shown here?
Is the answer the same for all integration needs or do answers change depending on data subjects, data sources, or data integration projects?
Module 3
Data Virtualization in Integration Architecture
Topic Page
Virtualization in Data Integration Projects 3-2
Data Warehousing Use Cases 3-4
Data Federation Use Cases 3-16
MDM and EIM Use Cases 3-28
More Data Virtualization Applications 3-36
3-2 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Virtualization in Data Integration Projects
Virtualization in Data Integration Projects
Data Virtualization Use Cases
VIRTUALIZATION
OPPORTUNITIES
The opportunities for data virtualization in projects are many and don’t necessarily imply long-term virtualization in production applications. It is common to virtualize in development and materialize for production. The facing page illustrates many types of integration systems from data warehousing to cloud data integration, and many uses of virtualization in projects from prototyping to production.
3-4 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehousing Use Cases
Data Warehouse Augmentation
EXTENDING THE
EXISTING DATA
WAREHOUSE
Traditional data warehousing systems are designed to provide integration of structured data through extract, transform and load (ETL) processes. The output of ETL processing is integrated data that is physically stored in a relational database and made available for downstream reporting and access. Depending on the data architecture, additional data stores such as data marts may exist to optimize the information delivery functions. The batch nature of ETL processing necessitates some latency of warehouse data.
CHALLENGES
Long-term success and sustainability of a data warehouse is based onability to adapt and evolve to the meet continuously changing information needs. The time and effort required to bring in additional source data is a significant challenge for existing data warehouses. The challenge and the complexities increase when the new requirements include unstructured data. Real-time data requirements bring additional challenges in ETL-based data warehousing processes.
Abundance of unstructured data and the impact of big data technologies bring both opportunities and challenges. The emergence of big data in a modern business context – especially social media data – creates
opportunity to analyze and better understand customer perceptions and behaviors. But with the opportunity comes complexity – unstructured social data is not a quick and easy fit into a traditional data warehouse.
OPPORTUNITY
ENABLED BY
VIRTUALIZATION
Data Virtualization can be applied to complement and augment an existing data warehouse with virtual views to meet new information requirements. Unstructured data, cloud data, and real-time data
integration can be implemented without extensive and disruptive changes to the core data model and ETL processing.
Speed of delivery and speed of data are accelerated with virtualization. Leveraging new and existing data sources more rapidly advances business agility. Unstructured data is integrated with structured data and new reporting applications are quickly implemented.
3-6 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehouse Federation
Data Virtualization
Data Virtualization
Federated Views
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data Virtualization
Data Virtualization
Federated Views
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data
Warehouse
Data
Warehouse
ETL Processing
ETL Processing
Data Warehousing Use Cases
Data Warehouse Federation
MULTIPLE DATA
WAREHOUSES
Many organizations have multiple data warehouses for a variety of reasons. Mergers and acquisitions, independent departmental initiatives for data integration, and purchased or hosted applications with data warehouse components are among the most common causes. Whatever the causes may be, the result is typically new silos of data without having achieved full enterprise integration.
CHALLENGES
Enterprise reporting and robust analytics require data integration acrossthe enterprise, but physical integration of multiple data warehouses is time-consuming and costly. It is especially challenging in dynamic and volatile environments where the rate of data and systems change may exceed the capacity for continuous data warehouse alignment.
The real challenge is to deliver an integrated view from many different data warehouses to support enterprise wide information needs quickly, efficiently, and cost-effectively.
OPPORTUNITY
ENABLED BY
VIRTUALIZATION
Data Virtualization provides the means to meet the challenges – fast, efficient, and cost-effective. Each individual data warehouse continues to operate independently, serving the users and purposes for which it is designed. Simultaneously, the warehouses can be federated through views that support new uses and provide enterprise-wide perspective.
Data virtualization is the means to achieve federation. Recall the earlier quote from Rick van der Lans (page 1-10): “federation always results in virtualization.”
3-8 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehousing Use Cases
Hub and Virtual Spoke
SUSTAINABLE AND
SCALABLE DATA
MARTS
Hub and spoke data warehouse architecture uses a central data warehouse – the hub – as the point of data integration, then produces multiple data marts – the spokes – to support information needs of various workgroups Though strength of integration is high, the workloads for development and for operation are also high because each data mart has unique ETL processing.
CHALLENGES
As demand for information accelerates, the demand for new data martsgrows. Data mart proliferation brings redundancies and inconsistencies that degrade strength of integration and decay data quality. Parallel growth of workload and loss of quality is clearly not a sustainable approach to data warehousing.
OPPORTUNITY
ENABLED BY
VIRTUALIZATION
Data virtualization offers the opportunity to create virtual data marts. These “virtual spokes” can be deployed quickly, without the increased workload of additional ETL processing, and with substantially reduced risk of data quality issues. Many new data mart requirements, and many changes to existing data marts, can be met without increased
development, processing, and administrative workload. When changes to an existing data mart are implemented by virtualizing the mart, overall workload may actually be reduced.
Virtualization enables the concept of disposable data marts and creation of new data marts easily, and without need to build new physical data stores. This is a particularly powerful technique in highly volatile business and systems environments.
3-10 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehousing Use Cases
Complement ETL
RESOLVING ETL
INCOMPATIBILITY
Most older data warehouses acquire, process, and load data through ETL processing. The data sources in are typically structured data that is readily suited to relational database management systems. As technology has grown and evolved, many new data sources are not well suited to traditional ETL processing.
CHALLENGES
Efforts to add new data sources to existing ETL processes are challengedwhen the ETL technology lacks interfaces and access methods for the data that is needed. Common examples of data source and ETL
incompatibility include ERP-embedded databases and web services data. Modifying existing ETL processes to access these sources is complex and likely to introduce bugs and performance problems into previously stable processes.
OPPORTUNITY
ENABLED BY
VIRTUALIZATION
Data virtualization is an effective way to remove or reduce data source to ETL incompatibilities. Using a virtualization tool you can pre-process problem data sources, creating views that are readily accessible by your ETL technology. Changes to existing ETL processes, and the risks inherent in those changes are substantially reduced. The complexities and incompatibilities are managed externally while integrity of the ETL process is maintained.
3-12 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehousing Use Cases
Data Warehouse Prototyping
REQUIREMENTS
DISCOVERY
Collecting data warehouse requirements is an iterative process that takes refinement. Requirements analysis for warehousing is a continuous process of discovery about business information needs and about hidden characteristics of source data. The warehouse requirements analyst quickly discovers that
• Users can’t always tell you what they need.
• Data models and documentation are rarely complete, current, and accurate.
Recent interest in agile methods for data warehouse development raises the stakes for warehouse prototyping.
CHALLENGES
The key to effective prototyping is speed – fast response to discovery andchange. Rapid prototyping is good prototyping, but physical warehouse development has many barriers to fast changes. When you prototype a data warehouse each cycle of discovery brings changes to database schema, to data transformation logic, and potentially to choice of data sources. These are programming changes – labor intensive, and the antithesis of rapid.
OPPORTUNITY
ENABLED BY
VIRTUALIZATION
Data virtualization can substantially reduce the challenges of change when prototyping a data warehouse. Without physical database schema, and with much of integration and transformation logic embedded in canonical models, cycles of prototyping are executed much faster. New requirements and new data sources can be quickly integrated into existing virtual structures. Ultimately, when discovery diminishes and
requirements stabilize, you can migrate from a virtual to a physical data warehouse for runtime efficiencies and historical data retention.
The prototyping opportunity, of course, applies to extending existing data warehouses as well as building new data warehouses. Prototyping can be performed at any level from EDW to data marts.
3-14 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Warehousing Use Cases
Data Warehousing Use Cases
Data Warehouse Migration
MOVING THE DATA
WAREHOUSE
Sometimes you simply need to move a data warehouse from one platform to another. Cost savings, performance gains, increased accessibility, enhanced mobility, and more motivators drive warehouse migrations. The migration may be from one DBMS to another, from database server to appliance, from row-based to columnar, from locally hosted to cloud, and many other variations.
CHALLENGES
The big challenge in moving a data warehouse is the demand forcontinued operation and availability. Migration is not an event but a process – one that involves rebuilding of both databases and reporting processes. It occurs over a period of days and weeks, perhaps even months. Yet business information needs and reporting requirements continue to occur on a day-to-day basis. It is not practical to shut down to migrate.
OPPORTUNITIES
ENABLED BY
VIRTUALIZATION
Data virtualization remediates the challenges of warehouse migration. Create a virtual reporting layer to decouple reporting from the physical data structure. You first build and test the reporting layer mapped to the original data warehouse. Next you build the data warehouse on the new platform. And finally you map the virtual reporting layer to the new data warehouse. The result is a step-by-step process with smooth transition that insulates warehouse users from the impacts migration.
The physical data warehouse moves from one platform to another, but the virtual reporting layer is more than a temporary solution to a migration problem. Keeping the virtual layer in place increases the flexibility and agility with which new reporting requirements can be met.
3-16 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Federation Use Cases
Data Federation Use Cases
Federated Views
VIEWS ACROSS
DATABASES
Database views are a powerful tool when working with relational data. They can be used to hide data structure complexities, to apply business-friendly naming, to join data from multiple tables, and to get maximum advantage from query optimizers. A SQL view, however, is limited to work within the boundaries of a single database.
CHALLENGES
As volume and variety of data increases every organization experiencescorresponding increase in the number of databases that they manage simultaneously – some data in packaged applications, some in ERP systems, some in legacy databases, some in warehouses, and some in the cloud, etc. The segmentation of databases is driven by technology and by the history and evolution of applications. But information needs often cut horizontally across vertically segmented databases. Database views could readily satisfy many of the information needs if they only worked across multiple databases.
VIRTUALIZATION
AND VIEWS
Data virtualization eliminates the boundary constraint of views contained within a single database by enabling federated views. Virtualization can combine data from multiple relational databases, Excel spreadsheets, XML, and other formats into a single view that is readily consumed by applications that are unaware of the multi-database data sourcing. The advantages of views now work across multiple databases.
3-18 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Federation Use Cases
Data Federation Use Cases
Data Services
SOA-BASED
INTEGRATION
Sometimes views aren’t enough. The new world of service-oriented architectures and systems needs new approaches to data integration. Wrapping data sources with a services layer is a common way to get started with service-oriented integration. But wrappers are labor-intensive to build and require maintenance with every database change.
CHALLENGES
Making integrated data available to service-oriented architecture(SOA)-based systems is uniquely challenging. Most data integration systems are predicated on relational technologies and optimized for SQL access. But SQL views don’t do the job for web-services applications,
service-oriented master data management, etc. When SOAP, REST, WSDL, JMS, etc. are the right protocols, SQL isn’t a satisfactory substitute.
VIRTUALIZATION
AND SERVICES
Some data virtualization tools include data services capabilities that make it practical to combine all of the core data integration functions – multiple sources, abstraction, transformation, etc. – with popular SOA-based data delivery formats.
3-20 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Federation Use Cases
Data Federation Use Cases
Data Mashups
NEW WAYS TO
PRESENT DATA
The most common form of mashup is web application that combines existing components from many sources for visual presentation in a new context. A data mashup combines, aggregates, and data from a variety of different sources. Data mashups are an effective way to meet new needs for information when the data already exists and the effort required is to present that data in new combinations and new visual formats.
CHALLENGES
Web mashups are enabled by the published APIs of web applications thatmake their data and functions easy to access and integrate. This is the key to fast mashup based on the idea of assembling from existing components instead of building from scratch. The challenge of data mashups is that most corporate data sources do not have readily accessible APIs to support the mashup process.
VIRTUALIZATION
AND MASHUP
The data virtualization tools that enable SOA approaches are also
enablers of data mashups. The same protocols, services, and data delivery formats that are used for virtualized services fill the role of data APIs for quick and easy access to data.
3-22 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY
Data Federation Use Cases
Data Federation Use Cases
Caches
FREQUENTLY
ACCESSED DATA
While data virtualization is an effective technique for many data integration needs, it may have visible impact on performance of source systems and databases. One advantage of physical integration in a warehouse is that the source system is isolated from access by analysis and reporting applications.
CHALLENGES
A virtual data integration system increases the number of ways thatoperational data is accessible and useful. When the potential becomes reality – a shift from accessible and useful to accessed and used – the operational databases may experience performance challenges. If query optimization isn’t enough, then you may need to duplicate the data, creating a copy of frequently accessed data as a way to isolate the source database.
VIRTUALIZATION
AND DUPLICATION
When you need to create a copy of frequently accessed data, two options are possible – replication and caching. A virtualization tool with cache capability does the job with lower overhead and greater flexibility than full database replication. Database replication simply creates copies of tables; integration follows replication. Caching can store copies of virtual views and services; integration is retained in the copy. The database replicate is static until updates are pushed to the copy. Caches can be automatically and periodically refreshed to synchronize with the source.
3-24 © TDWI. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. DO NOT COPY