Managing Data
in Motion
Data Integration Best Practice
Techniques and Technologies
April Reeve
ELSEVIER
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
M<
Contents
Foreword xv Acknowledgements xvii
Biography xix Introduction xxi
PART 1 INTRODUCTION TO DATA INTEGRATION
Chapter 1 The Importance of Data Integration з The natural complexity of data interfaces 3 The rise of purchased vendor packages 4 Key enablement of big data and virtualization 5
Chapter 2 What Is Data Integration? 7
Data in motion 7 Integrating into a common format—transforming data 7
Migrating data from one system to another 8 Moving data around the organization 9 Pulling information from unstructured data 11
Moving process to data 12 Chapter 3 Types and Complexity of Data Integration 15
The differences and similarities in managing data in motion
and persistent data 15 Batch data integration 16 Real-time data integration 16 Big data integration 17 Data virtualization 17 Chapter 4 The Process of Data Integration
Development 19 The data integration development life cycle 19
Inclusion of business knowledge and expertise 20
PART 2 BATCH DATA INTEGRATION
Chapter 5 Introduction to Batch Data Integration 25
What is batch data integration? 25 Batch data integration life cycle 26
viii Contents
Chapter 6 Extract, Transform, and Load 29
WhatisETL? 29 Profiling 30 Extract 30 Staging 31 Access layers 32 Transform 33
Simple mapping 33 Lookups 33 Aggregation and normalization 33
Calculation 34
Load 34
Chapter 7 Data Warehousing 37 What is data warehousing? 37 Layers in an enterprise data warehouse architecture 38
Operational application layer 38
External data 38 Data staging areas coming into a data warehouse 39
Data warehouse data structure 40 Staging from data warehouse to data mart or
business intelligence 40 Business Intelligence Layer 40 Types of data to load in a data warehouse 41
Master data in a data warehouse 41 Balance and snapshot data in a data warehouse 42
Transactional data in a data warehouse 43
Events 43 Reconciliation 43 Interview with an expert: Krish Krishnan on
data warehousing and data integration 44
Chapter 8 Data Conversion 51 What is data conversion? 51 Data conversion life cycle 51 Data conversion analysis 52 Best practice data loading 52 Improving source data quality 53
Contents ix
Mapping to target 53 Configuration data 54 Testing and dependencies 55
Private data 55 Proving 56 Environments 56 Chapter 9 Data Archiving 59
What is data archiving? 59 Selecting data to archive 60 Can the archived data be retrieved? 60
Conforming data structures in the archiving environment 61
Flexible data structures 61 Interview with an expert: John Anderson on data
archiving and data integration 62 Chapter 10 Batch Data Integration Architecture and
Metadata 67 What is batch data integration architecture? 67
Profiling tool 67 Modeling tool 68 Metadata repository 69 Data movement 69 Transformation 70 Scheduling 71 Interview with an expert: Adrienne Tannenbaum on
metadata and data integration 73
PART 3 REAL TIME DATA INTEGRATION
Chapter 11 Introduction to Real-Time Data Integration 77
Why real-time data integration? 77 Why two sets of technologies? 78
Chapter 12 Data Integration Patterns 79
Interaction patterns 79 Loose coupling 79 Hub and spoke 80 Synchronous and asynchronous interaction 83
x Contents
Request and reply 83 Publish and subscribe 84 Two-phase commit 84 Integrating interaction types 85
Chapter 13 Core Real-Time Data Integration
Technologies 87 Confusing terminology 87 Enterprise service bus (ESB) 88 Interview with an expert: David S. Linthicum on
ESB and data integration 89 Service-oriented architecture (SOA) 90
Extensible markup language (XML) 92 Interview with an expert: M. David Allen on
XML and data integration 92 Data replication and change data capture 95
Enterprise application integration (EAI) 97 Enterprise information integration (Ell) 97 Chapter 14 Data Integration Modeling 99
Canonical modeling 99 Interview with an expert: Dagna
Gaythorpe on canonical modeling and data
integration 100 Message modeling 103
Chapter 15 Master Data Management 105 Introduction to master data management 105
Reasons for a master data management
solution 105 Purchased packages and master data 106
Reference data 107 Masters and slaves 107 External data 110 Master data management functionality 110
Types of master data management solutions—registry
and data hub I l l Chapter 16 Data Warehousing with Real-Time Updates 113
Corporate information factory 113 Operational data store 113
Contents xi
Master data moving to the data warehouse 116 Interview with an expert: Krish Krishnan on
real-time data warehousing updates 116
Chapter 17 Real-Time Data Integration Architecture
and Metadata 119 What is real-time data integration metadata? 119
Modeling 120 Profiling 120 Metadata repository 120
Enterprise service bus—data transformation
and orchestration 121 Technical mediation 122 Business content 122 Data movement and middleware 123
External interaction 123
PART 4 BIG, CLOUD, VIRTUAL DATA
Chapter 18 Introduction to Big Data Integration 127 Data integration and unstructured data 127 Big data, cloud data, and data virtualization 127
Chapter 19 Cloud Architecture and Data Integration 129 Why is data integration important in the cloud? 129
Public cloud 129 Cloud security 130 Cloud latency 131 Cloud redundancy 132
Chapter 20 Data Virtualization 135 A technology whose time has come 135
Business uses of data virtualization 137 Business intelligence solutions 137 Integrating different types of data 137 Quickly add or prototype adding data to a data
warehouse 137 Present physically disparate data together 138
Leverage various data and models triggering
transactions 138
xii Contents
Data virtualization architecture 138 Sources and adapters 138 Mappings and models and views 138
Transformation and presentation 139
Chapter 21 Big Data Integration 141 What is big data? 142 Big data dimension—volume 142
Massive parallel processing—moving
process to data 142 Hadoop and MapReduce 143
Integrating with external data 144
Visualization 144 Big data dimension—variety 145
Types of data 145 Integrating different types of data 145
Interview with an expert: William McKnight
on Hadoop and data integration 145 Big data dimension—velocity 146
Streaming data 147 Sensor and GPS data 147 Social media data 147 Traditional big data use cases 147
More big data use cases 148
Health care 148 Logistics 148 National security 149 Leveraging the power of big data—real-time decision
support 149 Triggering action 149
Speed of data retrieval from memory versus disk 150 From data analytics to models, from streaming
data to decisions 150 Big data architecture 151
Operational systems and data sources 151
Intermediate data hubs 151 Business intelligence tools 152 Data virtualization server 153
Contents xiii
Batch and real-time data integration tools 153
Analytic sandbox 153 Risk response systems/recommendation engines 153
Interview with an expert: John Haddad on
Big Data and data integration 154 Chapter 22 Conclusion to Managing Data in Motion 157
Data integration architecture 157 Why data integration architecture? 157
Data integration life cycle and expertise 158
Security and privacy 158 Data integration engines 160
Operational continuity 160
ETL engine 160 Enterprise service bus 161
Data virtualization server 161
Data movement 162 Data integration hubs 162
Master data 163 Data warehouse and operational data store 164
Enterprise content management 164
Data archive 164 Metadata management 164
Data discovery 165 Data profiling 165 Data modeling 165 Data flow modeling 165 Metadata repository 166
The end 166
References 167 Index 169