... 50102GC20 Production 2.0 May 1999 M08762
Data Warehousing
Fundamentals
Authors Chon S. Chua Richard Green Technical Contributors and Reviewers Jackie Collins Jennifer Jacoby Mike Schmitz John Haydu Russ Pitts Lauran Serhal Brian Pottle Donna Corrigan Patricia Moll Harry Penbert SuiWah Chan Joel Barkin Steve Dressler Publisher Tony McGettigan
Copyright Oracle Corporation, 1999. All rights reserved.
This documentation contains proprietary information of Oracle Corporation. It is provided under a license agreement containing restrictions on use and disclosure and is also protected by copyright law. Reverse engineering of the software is prohibited. If this documentation is delivered to a U.S. Government Agency of the Department of Defense, then it is delivered with Restricted Rights and the following legend is applicable:
Restricted Rights Legend
Use, duplication or disclosure by the Government is subject to restrictions for commercial computer software and shall be deemed to be Restricted Rights software under Federal law, as set forth in subparagraph (c) (1) (ii) of DFARS 252.227-7013, Rights in Technical Data and Computer Software (October 1988). This material or any portion of it may not be copied in any form or by any means without the express prior written permission of Oracle Corporation. Any other copying is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the Department of Defense, then it is delivered with “Restricted Rights,” as defined in FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987). The information in this document is subject to change without notice. If you find any problems in the documentation, please report them in writing to Education Products, Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corporation does not warrant that this document is error-free. Data Warehouse Method—A Methodology for Designing Data Warehouse, SQL*Loader, PL/SQL, Pro*C, Oracle7, Oracle8, and Oracle8i, Distributed Option, Parallel Query Option, Parallel Server Option, Media Server, Spatial Data Option, ConText Option, Video Server, Text Server, WebServer, Oracle Universal Server ROLAP Option, Express Server, Web-enabled Express Server, SQL*Net, Developer/2000, Relational Access Manager, Discoverer, Designer/2000, SQL*Bridge, Transparent Gateway Developer’s Kit, Procedural Gateway Developer’s Kit, Express, Express Analyzer, Express Objects, Sales Analyzer, and Financial Analyzer are product names, trademarks, or registered trademarks of Oracle Corporation.
All other products or company names are used for identification purposes only and may be trademarks of their respective owners.
...
Data Warehousing Fundamentals iii
...Contents
Preface
Profile xi
Related Publications xiv
Typographic Conventions xv
Lesson 1: Introduction
Course Objectives 1-3
Agenda 1-5
Questions About You 1-9
Lesson 2: Meeting a Business Need
Overview 2-3
Unsuitability of OLTP Systems for Complex Analysis 2-5
Management Information Systems and Decision Support 2-7
Data Extract Processing 2-9
Business Drivers for Data Warehouses 2-15
Current Situation and Growth of Data Warehousing 2-19
Typical Uses of a Data Warehouse 2-21
Summary 2-23
Practice 2-1 2-25
Lesson 3: Defining Data Warehouse Concepts and Terminology
Overview 3-3
Data Warehouse Definition 3-5
Data Warehouse Properties 3-7
Data Warehouse Terminology 3-21
Components of a Data Warehouse 3-25
Oracle Warehouse Vision, Products, and Services 3-31
Summary 3-41
Practice 3-1 3-43
Lesson 4: Driving Implementation Through a Methodology
Overview 4-3
Warehouse Development Approaches 4-5
... Contents
Oracle Data Warehouse Method 4-15
DWM Fundamental Elements 4-19
Oracle Warehouse Technology Initiative (WTI) 4-57
Summary 4-61
Practice 4-1 4-63
Lesson 5: Planning for a Successful Warehouse
Overview 5-3
Managing Financial Issues 5-5
Obtaining Business Commitment 5-9
Managing a Warehouse Project 5-15
Identifying Planning Phases 5-29
Identifying Warehouse Strategy Phase Deliverables 5-31
Identifying Project Scope Phase Deliverables 5-35
Summary 5-41
Practice 5-1 5-43
Lesson 6: Analyzing User Query Needs
Overview 6-3
Types of Users 6-5
Gathering User Requirements 6-7
Managing User Data Access 6-9
Security 6-21
OLAP 6-25
Query Access Architectures 6-47
Summary 6-51
Practice 6-1 6-53
Lesson 7: Modeling the Data Warehouse
Overview 7-3
...
Data Warehousing Fundamentals v
...Contents
Summary 7-41
Practice 7-1 7-43
Lesson 8: Choosing a Computing Architecture
Overview 8-3
Architecture Requirements 8-5
The Hardware Architecture 8-7
Database Server Requirements 8-29
Parallel Processing 8-33
Summary 8-39
Practice 8-1 8-41
Lesson 9: Planning Warehouse Storage
Overview 9-3
The Server Data Architecture 9-5
Protecting the Database 9-17
Summary 9-27
Practice 9-1 9-29
Lesson 10: Building the Warehouse
Overview 10-3
Extracting, Transforming, and Transporting Data 10-5
Extracting Data 10-13
Examining Data Sources 10-15
Extraction Techniques 10-23
Extraction Tools 10-35
Summary 10-39
Practice 10-1 10-41
Lesson 11: Transforming Data
Overview 11-3
Importance of Data Quality 11-5
Transformation 11-13
Transforming Data: Problems and Solutions 11-17
... Contents
Transformation Tools 11-53
Summary 11-57
Practice 11-1 11-59
Lesson 12: Transportation: Loading Warehouse Data
Overview 12-3
Transporting Data into the Warehouse 12-5
Building the Transportation Process 12-11
Transporting the Data 12-15
Postprocessing of Loaded Data 12-25
Summary 12-39
Practice 12-1 12-41
Lesson 13: Transportation: Refreshing Warehouse Data
Overview 13-3
Capturing Changed Data 13-5
Limitations of Methods for Applying Changes 13-25
Purging and Archiving Data 13-33
Final Tasks 13-39
Selecting ETT Tools 13-43
Summary 13-51
Practice 13-1 13-53
Lesson 14: Leaving a Metadata Trail
Overview 14-3
Defining Warehouse Metadata 14-5
Developing a Metadata Strategy 14-11
Examining Types of Metadata 14-19
Metadata Management Tools 14-33
Common Warehouse Metadata 14-35
...
Data Warehousing Fundamentals vii
...Contents
Business Intelligence 15-5
Multidimensional Query Techniques 15-7
Categories of Business Intelligence Tools 15-9
Data Mining in a Warehouse Environment 15-19
Oracle Data Mining Partners 15-33
Summary 15-35
Practice 15-1 15-37
Lesson 16: Web-Enabling the Warehouse
Overview 16-3
Accessing the Warehouse Over the Web 16-5
Common Web Data Warehouse Architecture 16-9
Issues in Deploying a Data Warehouse on the Web 16-11
Evaluating Web-Based Tools 16-19
Summary 16-23
Practice 16-1 16-25
Lesson 17: Managing the Data Warehouse
Overview 17-3
Managing the Transition to Production 17-5
Managing Growth 17-19
Managing Backup and Recovery 17-33
Identifying Data Warehouse Performance Issues 17-45
Summary 17-51
Appendix A: Practice Solutions
Practice 2-1 A-2 Practice 3-1 A-4 Practice 4-1 A-7 Practice 5-1 A-11 Practice 6-1 A-12 Practice 7-1 A-13 Practice 8-1 A-14 Practice 9-1 A-15
... Contents Practice 10-1 A-18 Practice 11-1 A-20 Practice 12-1 A-21 Practice 13-1 A-23 Practice 14-1 A-24 Practice 15-1 A-26 Practice 16-1 A-28 Glossary
...
10
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Overview
Project Management (Methodology, Maintaining Metadata) Defining DW Concepts & Terminology Planning for a Successful Warehouse Analyzing User Query Needs Choosing a Computing Architecture Modeling the Data Warehouse Planning Warehouse Storage ETT (Building the Warehouse) ETT (Building the Warehouse) Meeting a Business Need Supporting End User Access Managing the Data Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to do the following:
•
Outline the extraction, transformation, and transportation processes for building a data warehouse•
Identify extraction issues•
Explain how to examine data sources•
Identify extraction techniques•
List tools that can be used to extract data from sources...
Data Warehousing Fundamentals 10-3
...Overview
Overview
In this lesson, you explore the sources of data for the data warehouse data. You consider how the extraction and transformation processes take data from source systems and change it into data that is acceptable to the users of the data warehouse. The lesson also describes typical data anomalies and looks at ways to eliminate them. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.
Objectives
After completing this lesson, you should be able to do the following:
• Outline the extraction, transformation, and transportation processes for building a
data warehouse.
• Identify extraction issues.
• Explain how to examine data sources.
• Identify extraction techniques.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Extraction/Transformation/Transportation
Processes (ETT)
•
Extract source data•
Transform/clean data•
Index and summarize•
Load data into WH•
Detect changes•
Refresh data Programs Tools ETT Operational systems Warehouse Browser:http:// Hollywoo d Hollywoo d + +XX C ustom ers: a recor of a s X X + + C ustom ers: Browser:http:// Hol lywood Hol lywood Browser:http:// Hollywoo d Hollywoo d X X + + Gateways...
Data Warehousing Fundamentals 10-5
...Extracting, Transforming, and Transporting Data
Extracting, Transforming, and Transporting Data
Extraction, Transformation, and Transportation Tasks
Before considering this lesson’s focus on extraction, you should be aware that extraction, transformation, and transportation (sometimes called ETT) describes the series of processes that:
• Extract data from source systems
• Transform and clean up the data
• Index the data
• Summarize the data
• Load data into the warehouse
• Detect the changes made to source data required for the warehouse
• Restructure keys
• Maintain the metadata
• Refresh the warehouse with updated data
You can use custom programming, gateways between database systems, and internally developed tools or vendor tools to carry out the ETT processes.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
ETT Processes
•
Must result in data that is relevant, useful, high-quality, accurate, and accessible•
Require a large proportion of warehouse development time and resourcesWarehouse Operational systems Relevant Clean up Consolidate Restructure ETT Useful Quality Accurate Accessible
...
Data Warehousing Fundamentals 10-7
...Extracting, Transforming, and Transporting Data
ETT Processes
ETT Importance The extraction, transformation, and transportation processes are
absolutely fundamental in ensuring that the data resident in the warehouse is:
• Relevant and useful to the business users
• High quality
• Accurate
• Easy to access so that the warehouse is used efficiently and effectively by the
business users
ETT Cost Building the ETT process is potentially one of the biggest tasks of
building a warehouse; it is complex and time-consuming. In some implementations, it can take more than half of the total warehouse implementation effort.
Note: Extraction is covered by this lesson; transformation and transportation are
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Data Staging Area
•
The construction site for the warehouse•
Required by most implementations•
Composed of ODS, flat files, or relational server tables•
Frequently configured as multitier stagingExtract Transform Operational system Transport (Load) Warehouse Data staging area
...
Data Warehousing Fundamentals 10-9
...Extracting, Transforming, and Transporting Data
The Data Staging Area
Ralph Kimball is one of the most widely recognized experts in the field of data warehousing. Kimball calls the data staging area the construction site for the
warehouse. This is where much of the data transformation and cleansing takes place. A staging area is a typical requirement of warehouse implementations. It may be an operational data store environment, a set of flat files, a series of tables in a relational database server, or proprietary data structures used by data staging tools.
You may employ multitier staging that reconciles data before and after the
transformation process and before data is loaded into the warehouse. As many as three tiers are possible, from the operational server to the staging area and then to the warehouse server.
Note: Some ETT tools stage data internally and do not require a separate staging area.
If you are using the Oracle server and in-house developed tools, data is typically transformed after it is bulk-loaded (using SQL*Loader) into the staging area—the database tables. PL/SQL is often used to transform the data. You may also use gateways and replication techniques.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Remote Staging Model
Data staging area within the warehouse environment
Extract, transform, transport Transform Operational system Transport (Load) Data staging area Warehouse Warehouse environment Oper. envt.
Data staging area in its own environment, avoiding negative impact on the warehouse environment
Extract, transform, transport Transform Operational system Transport (Load) Data staging area Warehouse Staging envt.
Oper. envt. Warehouse envt.
Copyright Oracle Corporation, 1999. All rights reserved.
Onsite Staging Model
Extract Transform Operational system Transport (Load) Data staging area Warehouse
Operational environment WH envt.
Data staging area within the operational environment, possibly affecting the operational system
...
Data Warehousing Fundamentals 10-11
...Extracting, Transforming, and Transporting Data
Possible Staging Models
Choosing a Model The model you choose depends upon operational and warehouse
requirements, system availability, connectivity bandwidth, gateway access, and volume of data to be moved or transformed.
Remote Staging Model You may choose to extract the data from the operational
environment and transport it into the warehouse environment for transformation processing. You may optionally execute some transformation processing during the extraction and transportation from operational to warehouse environment. You would then execute the bulk of transformation processing in the warehouse environment’s staging area.
On-site Staging Model Alternatively, you may choose to perform the cleansing,
transformation, and summarization processes locally in the operational environment and then extract to the staging area. This model may conflict with the day-to-day working of the operational system. If chosen, this model’s process should be executed when the operational system is idle or less heavily used.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Extracting Data
•
Routines developed to select fields from source•
Various data formats•
Rules, audit trails, error correction facilitiesTransform Operational
databases Data staging
area
Warehouse database
Browser:http:// Hollywood
Hollywood Custom + +ers:XX
a recor of a s X X + + Custom ers: Browser:http:// Hollywood Hollywood Browser:http:// Hollywood Hollywood X X + + Data mapping
Copyright Oracle Corporation, 1999. All rights reserved.
Source Systems
•
Production•
Archive•
Internal•
External Browser: http:// Hol lywood Hol lywood X X + + Cust om ers: Browser:http:// Hollywood Hollywood X X + + a rec orof a s X X + + Custo mers: Browser:http:// Hollywood Hollywood 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO...
Data Warehousing Fundamentals 10-13
...Extracting Data
Extracting Data
The process of data extraction takes selected data fields that pertain to the subject area maintained by the data warehouse. The data may come from a variety of source systems, and the data may exist in a variety of formats.
The extraction routines are developed to account for the variety of systems from which data is taken. These routines contain data or business rules, as well as audit trails and error correction facilities.
Source Systems The source systems mentioned may be in the form of data
existing in:
• Production operational systems
• Archives
• Internal files not directly associated with company operational systems, such as
individual spreadsheets and workbooks
• External data from outside the company
Extraction Routines The routines created for extraction are specifically developed
to account for the variety of systems from which data is taken. The routines contain data or business rules, audit trails, and error correction facilities. The routines take into account the frequency with which data is to be extracted.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
• Operating system platforms
• Hardware platforms
• File systems
• Database systems and vertical applications
Production Data IMS DB2 VSAM NonStop SQL Oracle Sybase Rdb SAP Shared Medical Systems
Dun and Bradstreet Financials Hogan Financials Oracle Financials Browser:http:// Hollywood Hollywood + +XX C ustom ers: a recor of a s X X + + C ustomers: Browser:http:// Hollywood Hollywood Browser:http:// Hollywood Hollywood X X + +
Copyright Oracle Corporation, 1999. All rights reserved.
• Historical data
• Useful for analysis over long periods of time
• Useful for first-time load
• May require unique transformations
Archive Data
Operational databases
Warehouse database
...
Data Warehousing Fundamentals 10-15
...Examining Data Sources
Examining Data Sources
Production Data Production data may come from a multitude of different sources:
• Operating system platforms
• Hardware platforms
• File systems (flat files)
• Database systems, for example, Oracle, DB2, dBase, Informix, ISAM, NonStop
SQL, Rdb, and TurboImage
• Vertical applications, such as Oracle Financials, SAP, PeopleSoft, Baan, and Dun
and Bradstreet
Archive Data Archive data may be useful to the enterprise in supplying historical
data. Historical data is needed if analysis over long periods of time is to be achieved. Archive data is not used consistently as a source for the warehouse; for example, it would not be used for regular data refreshes. However, for the initial implementation of a data warehouse (and the first-time load), archived data is an important source of historical data.
You need to consider this carefully when planning the data warehouse. How much historical data do you have available for the data warehouse? How much effort is necessary to transform it into an acceptable format?
The data warehouse may need some careful and unique transformations, and clear details of the changes must be maintained in metadata.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Internal Data
•
Planning, sales, and marketing organization data•
Maintained by:– Spreadsheets (structured) – Documents (unstructured)
•
Treated like any other source dataPlanning Marketing Accounting 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Warehouse database 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO
...
Data Warehousing Fundamentals 10-17
...Examining Data Sources
Internal Data
Internal data may be information prepared by planning, sales, or marketing
organizations that contains data such as budgets, forecasts, or sales quotas. The data contains figures (numbers) that are used across the enterprise for comparison
purposes. The data is maintained using software packages such as spreadsheets and word processors and uploaded into the warehouse.
Internal data is treated like any other source system data. It must be transformed, documented in metadata, and mapped between the source and target databases.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
•
Information from outside the organization•
Issues of frequency, format, and predictability•
Described and tracked using metadataExternal Data
Barron’s Dun and Bradstreet Purchased databases Wall Street Journal Economic forecasts Competitive information Warehousing databases A.C. Nielsen, IRI, IMS,...
Data Warehousing Fundamentals 10-19
...Examining Data Sources
External Data
External data is important if you want to compare the performance of your business against others. There are many sources for external data:
• Periodicals and reports
• External syndicated data feeds (Some warehouses rely regularly on this as a
source)
• Competitive analysis information
• Newspapers
• Purchased marketing, competitive, and customer related data
• Free data from the Web
Issues You must consider the following issues with external data:
• Frequency: There is no real pattern like that of internal data. Constant monitoring
is required to determine when it is available.
• Format: The data may be different in format than internal data, and the granularity
of the data may be an issue. In order to make it useful to the warehouse a certain amount of reformatting may be required. In addition, you may find that external data, particularly that available on the Web, comes with digital audio data, picture image data, and digital video data. These present an interesting challenge in storage and speed of access.
• Predictability: External data is not predictable; it can come from any source at any
time, in any format, on any medium.
Tracked Using Metadata Metadata (described earlier as descriptive data about
data) plays an invaluable role in the registration, access, and control of external data. The metadata should provide the warehouse manager with as much information about the external data as possible, averting the need to examine the data closely.
Note: ETT decisions and strategies can evolve over time throughout the life of the
warehouse. It may be prudent to track those strategies and decisions, so that you can always explain the algorithmic logic or business rules used at different times with current, recent, or archived data.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Mapping
•
Defines which operational attributes to use•
Defines how to transform the attributes for thewarehouse
•
Defines where the attributes exist in the warehouse•
Mapping tools are availableFile A
F1 123
F2 Bloggs
F3 10/12/56
Staging File One Number USA123
Name Mr. Bloggs
DOB 10-Dec-56
Metadata
File A Staging File One
F1 Number
F2 Name
...
Data Warehousing Fundamentals 10-21
...Examining Data Sources
Mapping Data
Once you have determined your business subjects for the warehouse, you need to determine the required attributes from the source systems.
On an attribute-by-attribute basis you must determine how the source data maps into the data warehouse, and what, if any, transformation rules to apply. This is known as
mapping. There are mapping tools available.
Mapping information should be maintained in metadata that is server (RDBMS) resident, for ease of access, maintenance, and clarity.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
•
Programs: C, COBOL, PL/SQL•
Gateways: transparent database access•
In-house development is popular•
Tools– High initial cost – Ongoing automation – Data cleanup
...
Data Warehousing Fundamentals 10-23
...Extraction Techniques
Extraction Techniques
You can extract data from different source systems to the warehouse in different ways:
• Programmatically, using procedural languages such as COBOL, C, C++, or
Procedural SQL
• Using a gateway to access data sources. This method is acceptable only for small
amounts of data; otherwise, the network traffic becomes unacceptably high.
• In-house developed tools that:
– Store a physical definition of the source and warehouse data
– Create data dictionaries
– Generate data conversion programs
– Clean and transform the data
– Allow selective retrieval
– Maintain metadata
Note: In-house development is an ongoing activity that may become a resources black
hole. You need local knowledge to support all of the file formats.
• Using a vendor’s data extraction tool
Although it is expensive, an extraction tool:
– Provides ongoing automation of the data extraction process
– Supports data cleanup
More than 50% of companies use their own in-house development teams to develop data extraction programs. The extraction process may access different host systems media, such as fiche, optical, tape, CD, and disk formats.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Sources and Targets
OLAP Data marts
Data analysis
Data mining
...
Data Warehousing Fundamentals 10-25
...Extraction Techniques
Sources and Targets
To summarize, the data for the warehouse is a complex mixture of structured and unstructured data from different source systems. It all needs to be moved in a clean and integrated state into the warehouse.
Note: The same process is performed for current data that is to reside in an operational
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Designing Extraction Processes
•
Analysis:– Sources, technologies – Data types, quality, owners
•
Design options:– Manual, custom, gateway, third-party – Replication, full, or delta refresh
•
Design issues:– Batch window, volumes, data currency – Automation, skills needed, resources
...
Data Warehousing Fundamentals 10-27
...Extraction Techniques
Designing Extraction Processes
When designing your extraction processes, consider the analysis issues, the design options available to you, and the design issues.
Analysis
• The sources and technologies used
• Existing data feeds and redo logs
• Data types (EBCDIC or ASCII)
• Data quality and ownership
• Data volumes
• Operational schedule in the source environment
• Spare processing capacity in the source environment
Design Options
• Manual data entry
• Custom programs
• Gateway technologies
• Replication techniques
• Third party tools
• Full refresh or delta changes
Design Issues
• Batch window
• Data volumes
• Data currency (how up-to-date the data is to be)
• Degree of automation required
• Technology skills needed
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Maintaining Extraction Metadata
•
Source location, type, structure•
Access method•
Privilege information•
Temporary storage•
Failure procedures•
Validity checks...
Data Warehousing Fundamentals 10-29
...Extraction Techniques
Maintaining Extraction Metadata
It is essential to maintain a “metadata trail” of information about all ETT processes, including the extraction process. This information is important for warehouse enhancement and performance improvements.
The quality of metadata is critical for every aspect of the warehouse; attention must be paid to its control, management, and change.
Extraction metadata includes:
• The source location, type, contact, and structure information
• The access method
• The privilege information
• The extraction temporary storage information
• The extraction failure and validity check procedures information
• Information about how to handle missing data
Extraction metadata also contains information about the frequency of program execution and maps the source data to the target database.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Possible ETT Failures
•
A missing source file•
A system failure•
Inadequate metadata•
Poor mapping information•
Inadequate storage planning•
A source structural change•
No contingency plan...
Data Warehousing Fundamentals 10-31
...Extraction Techniques
Possible ETT Failures
ETT processes are vital to the warehouse, and they must succeed. ETT may fail for any of the following reasons:
• Extraction routines must specify the name and location of the source data. A
missing file may cause the extraction to fail. You must therefore ensure that exception and error handling routines are included.
• If there is a system or media failure during the process, the process may fail
entirely. You must start again or you may, depending upon system settings, be able to continue from the point of failure.
• Metadata that inadequately describes the source to destination mapping and rules
will cause ETT to fail; for example, when an unexpected value is found.
• Without the space for temporary data, staging data, and sorting operations, ETT
fails.
• Any changes to the source systems that are not documented in metadata will cause
extraction to fail.
• Contingency plans are needed, including mechanisms for correcting or reapplying
processing.
• If data is not validated correctly, the quality of extraction and the success of
transformation cannot be guaranteed. This translates to a data warehouse that may contain dirty data at the end of the load.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Maintaining ETT Quality
•
ETT must be: – Tested – Documented– Monitored and reviewed
...
Data Warehousing Fundamentals 10-33
...Extraction Techniques
Maintaining ETT Quality
Any failure of the ETT processes affects data quality, the importance of which cannot be underestimated. Inaccurate data leads to inaccurate analysis results, which lead to bad business decisions. The result of poor data quality is a lack of confidence in the system to deliver the solution.
Testing the Process You should test the proposed ETT techniques to ensure that
volumes can be physically moved within the load window constraints and network capabilities.
Documenting the Process You must communicate and document the proposed load
processes with the operations organization to ensure their agreement and commitment to this important process.
Monitoring and Reviewing the Process You should ensure that the load is
constantly monitored and reviewed, and revise metrics where needed. Warehouse data volumes grow rapidly, and metrics for load and data granularity need regular revision. The grain of the warehouse affects query capabilities and the warehouse size.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Extraction Tools
Mapping information
Update metadata JCL files
Map Source Data to Intermediate File Store
Map Source Data to Intermediate File Store
Sales and Marketing Customer Name
Char Varchar 20 Unique name
Copyright Oracle Corporation, 1999. All rights reserved.
•
Base functionality•
Interface features•
Metadata repository•
Open API•
Metadata access•
Repository utilities•
Input and output processing•
Cleansing, reformatting, and auditing•
References•
Training requirements...
Data Warehousing Fundamentals 10-35
...Extraction Tools
Extraction Tools
Extraction tools normally have a GUI front end that allows you to enter the individual field mappings from source to target systems. The tools normally:
• Generate the required code for the mapping, whether COBOL, C, or any other
language
• Create the necessary job control and scheduling files for the specific platform
• Create and manage changes to the metadata
Selection Criteria
The warehouse uses a host of different tools for extraction, modeling, management, and access. A tools selection committee must ensure that every tool selected meets identified requirements. This is usually a rigorous process.
If you decide to buy an extraction tool, consider the following fundamental issues:
• Base functionality
• Interface features and functionality
• The metadata repository and the attributes stored in the repository
• Open API
• Access to metadata by end users
• The effectiveness of the way that the tool presents the information
• Repository utilities such as scheduling, name, and address management
• Data extraction inputs and outputs
• Data cleansing, reformatting, and auditing features
Ask the tool vendor for customer references, so that you can ask those customers to describe their goals, successes, and failures with the product.
Consider the training required for the extraction tool. The complexity of the available extraction products varies, as does the ability of your staff. Training may be required for a few days or weeks.
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
WTI Partner ETT Tools
•
Carleton•
Constellar•
Evolutionary Technologies•
Informatica•
Information Builders•
Oracle EDMS, Toolkits, OADW•
Prism Solutions•
Sagent...
Data Warehousing Fundamentals 10-37
...Extraction Tools
WTI Partner ETT Tools
The choice of ETT techniques and tools is often driven by the quality of the source data.
WTI Partner Product
Carleton Corp Carleton Passport, Carleton Passport Development
Workbench
Constellar Constellar Hub
Evolutionary Technologies ETI Development Workbench, ETI Extract Tool Suite
Informatica Corporation PowerMart (Designer, Server, and Manager)
Information Builders, Inc. EDA Copy Manager
Oracle EDMS (Extraction and Transformation Template)
Toolkits OADW
Prism Solutions, Inc. Prism Change Manager, Prism Development
Workbench, Prism Warehouse Manager
Sagent Data Mart Suites
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Summary
This lesson discussed the following topics:
•
ETT processes are essential and consume a large proportion of warehouse resources and time•
The extraction process acquires source data•
You may encounter many data sources•
There are many data extraction issues•
ETT Tools should be considered...
Data Warehousing Fundamentals 10-39
...Summary
Summary
This lesson discussed the following topics:
• ETT processes are essential and consume a large proportion of warehouse
resources and time
• The extraction process acquires source data
• You may encounter many data sources
• There are many data extraction issues
... Lesson 10: Building the Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Practice 10-1 Overview
This practice covers the following topics:
• Answering a series of short questions
...
Data Warehousing Fundamentals 10-41
...Practice 10-1
Practice 10-1
Please answer the following questions.
1 The acronym ETT stands for _________________________________________.
2 Name at least four potential sources of production data for the warehouse.
_____________________ _____________________ _____________________ _____________________
3 Name at least five potential sources of external data for the warehouse.
___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________ ___________________________________________
4 Identify whether the following statements are true or false.
Question True False
Archive data is never used in a data warehouse; it is too old.
External data is one of the easiest types of data to incorporate into the warehouse.
Mapping data is a process whereby you eliminate data inconsistencies.
Gateways are great mechanisms for transferring large volumes of data into the warehouse.
Extraction tools are expensive.
... Lesson 10: Building the Warehouse
...
11
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Overview
Project Management (Methodology, Maintaining Metadata) Defining DW Concepts & Terminology Planning for a Successful Warehouse Analyzing User Query Needs Choosing a Computing Architecture Modeling the Data Warehouse Planning Warehouse Storage ETT (Building the Warehouse) ETT (Building the Warehouse) Meeting a Business Need Supporting End User Access Managing the Data Warehouse
Copyright Oracle Corporation, 1999. All rights reserved.
Objectives
After completing this lesson, you should be able to do the following:
• Explain the importance of quality data
• Define the term “transformation”
• Identify transformation issues
• Describe techniques for transforming data
...
Data Warehousing Fundamentals 11-3
...Overview
Overview
The last lesson introduced extraction, transformation, and transportation. The lesson then focused on extraction issues.
In this lesson, you explore how the transformation process transforms data from source systems into data suitable for end user query and analysis applications. Note that the “ETT (Building the Warehouse)” block is highlighted in the overview slide on the facing page.
Objectives
At the end of this lesson, you should be able to:
• Explain the importance of quality data
• Define the term “transformation”
• Identify transformation issues
• Describe techniques for transforming data
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Importance of Data Quality
Summit Sports Hollywood Speedy Pizza Browser: http:// Hollywood Hollywood X X + + Custom ers: a recor of a s X X + + C ustom ers: Browser:http:// Hollywood Hollywood Browser:http:// Hollywood Hollywood X X + +
Copyright Oracle Corporation, 1999. All rights reserved.
Benefits of Quality Data
• Clean data is essential for:
– Targeting customers
– Determining buying patterns
– Identifying householders: private and commercial
– Matching customers – Identify historical data
...
Data Warehousing Fundamentals 11-5
...Importance of Data Quality
Importance of Data Quality
Importance of Quality Data
The importance of quality data in the data warehouse cannot be overemphasized. Although data anomalies are bound to exist in source systems, if they are allowed to get into the data warehouse this leads to inaccurate information, which further leads to inaccurate reports and bad business decisions. The overall result is a lack of
confidence in the system to deliver the solution and a data warehouse that either is not used or requires substantial improvement and management buy-in.
Quality data is the key to a successful warehouse; it is better to have no data at all than
bad data.
Benefits of Quality Data
All dirty data must be eliminated from the staging area, to ensure you can query the warehouse to:
• Target the right audience for marketing communication
• Determine that a particular customer buys related products
• Determine that a group of people form a family, each of whom is a potential
customer (householding)
• Identify that an organization is part of a larger enterprise (commercial
householding)
• Identify that a customer is now part of another organization, because of acquisition
or take over
• Match customers where there are many different records for the same customer.
(For example, the different components of health care, such as the hospital, the pharmacy, and the doctor have their own records, or a patient may be treated by different physicians in the same hospital.)
• Identify the age of data and its history
Note: The terms scrubbing, cleaning, cleansing, and data reengineering are used
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Standards
• Define a quality strategy
• Decide on optimal data-quality level
Copyright Oracle Corporation, 1999. All rights reserved.
Quality Improvements
•
Consider modifying rules for operational data•
Document the sources•
Create a data stewardship program•
Design the cleanup process carefully...
Data Warehousing Fundamentals 11-7
...Importance of Data Quality
Standards
A data-quality strategy must be defined early on in the development cycle. It is imperative that you have one in place.
The strategy defines the optimal level of data quality that provides the value required for the business. For example, there is little point in seeking a low data inconsistency rate at great expense if the benefit to the business is not tangible.
Improving Operational Data Quality
You may need to consider making changes over time to the operational system in order to improve the quality of data for the warehouse:
• Some of the validation and integrity rules that are applied to current operational
data may need to be modified or enhanced.
• You may need to document previously undocumented sources, enlist the help of
users who know the business data, and consider creating a “data stewardship” program.
• You should carefully examine the cleanup processes that you employ in
transforming the extracted data.
• The initial data cleanup routines may be different from the routines applied to
subsequent data refreshes.
Correcting data can be tedious, time-consuming, and expensive. Consider any modifications in a phased approach rather than fixing all problems in one attempt.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Guidelines
•
Operational data should not be used directly in the warehouse•
Operational data must be cleaned for each increment•
Operational data is not simply fixed by modifying applications...
Data Warehousing Fundamentals 11-9
...Importance of Data Quality
Guidelines
Do not assume that because the data in the operational system suits you at the operational level, it is going to be appropriate, suitable, and of a sufficiently high quality for the data warehouse.
• The operational system contains no aging information.
• There are many examples of disparity in the data.
• There are many different meanings applied to data.
• Good operational data when merged may become poor data warehouse data.
Do not assume it is acceptable to clean up data after the pilot run of the first increment or implementation.
• The credibility of the data warehouse or data mart suffers.
• Postimplementation cleanups are more costly and the risk is higher than during the
pilot run.
• The programs needed to handle the multitude of problems are very complex and
would need to be rewritten after cleanup.
Do not assume that fixing applications at the point of entry (operational system) is going to satisfy quality and clean up the data for the future.
• It is often too time-consuming and costly to continually implement changes at that
level.
• Changes cannot be implemented quickly enough to keep up with constantly
changing operational requirements.
The cost in time and resources in reengineering the existing legacy data may be too high.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Solutions
•
Conventional COBOL, 4GL•
Specialized tools•
Customized conversion process•
Business expertsInvestigation Conditioning Standardization
Integration
Copyright Oracle Corporation, 1999. All rights reserved.
Management
Poor data quality
•
Own•
Take responsibility•
Resolve problems•
Data quality manager...
Data Warehousing Fundamentals 11-11
...Importance of Data Quality
Solutions
Use conventional COBOL or 4GL programs or purchase a specialized tool to capture and eradicate anomalies prior to data load. It is often very difficult to predict all possible variants.
You may consider designing a process in-house to assure the quality of the data entering the data warehouse. The process must involve:
• Data investigation: Parsing, lexical analysis, and pattern investigation
• Data conditioning and standardization: Moving the data into fixed fields,
standardizing names and addresses
• Data integration: Building unique keys and integrating the data
You should involve the business experts in the entire warehouse ETT process. Management
You must manage the quality of the data, processes, and rules, and put people in place to manage them. Someone must own, be directly responsible for, and resolve the issue of poor data quality. This person is often known as the data quality manager.
Note: At some sites there is a person or a group responsible for name and address
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Transformation
Transformation eliminates operational data anomalies
•
Cleans•
Standardizes•
Presents subject-oriented dataExtract Transform Operational system Warehouse Clean up Consolidate Restructure Data staging area Transport (Load)
Copyright Oracle Corporation, 1999. All rights reserved.
Source Data Anomalies
•
No unique key•
Data naming and coding anomalies•
Data meaning anomalies between groups•
Spelling and text inconsistencies90328575 Oracle Corp 100 NE 1st Street, Tampa 90328575 Oracle 100 NE. First St., Tampa 90238475 Oracle Services 100 North East 1st St., FLA 90233479 Oracle Limited 100 N.E. 1st St.
90233489 Oracle Computing 15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA
...
Data Warehousing Fundamentals 11-13
...Transformation
Transformation
Transformation involves a number of tasks, the most important being to eliminate all anomalies. Cleaning also includes eliminating formatting differences, assigning data types, defining consistent units of measure, and determining encoded structures. Along with these tasks, another objective is to ensure that the data is presented in a subject-oriented fashion.
Reasons for Data Anomalies
One of the causes of inconsistencies within internal data is that in-house system development takes place over many years, often with different software and development standards for each implementation.
There may be no consistent policy for the software used in the corporate environment. Systems may be upgraded or changed over the years. Each system may represent data in different ways.
Source Data Anomalies
Many potential problems can exist with source data:
• No unique key for individual records
• Anomalies within data fields, such as differences between naming and coding
(data type) conventions
• Differences in the interpreted meaning of the data by different user groups
• Spelling errors and other textual inconsistencies (this is particularly relevant in the
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Transformation Routines
•
Cleaning data•
Eliminating inconsistencies•
Adding elements•
Merging data•
Integrating data...
Data Warehousing Fundamentals 11-15
...Transformation
Transformation Routines
One reason for the inconsistencies with internal data is that in-house system development takes place over many years and often uses different software and standards for each implementation.
• Cleaning the data, also referred to as data cleansing or scrubbing
• Adding an element of time to the data, if it does not already exist
• Translating the formats of external and purchased data into something meaningful
for the warehouse
• Merging rows or records in files
• Integrating all the data into files and formats to be loaded into the warehouse
Transformation should be performed:
• Before the data is loaded into the warehouse
• In parallel (On larger databases, there is not enough time to perform this process as
a single threaded process.)
The transformation process should be self-documenting, should generate summary statistics, and should process exceptions.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Transforming Data: Problems and
Solutions
Multipart keys Country code Sales territory Product number Salesperson code Product code = 12M65431345...
Data Warehousing Fundamentals 11-17
...Transforming Data: Problems and Solutions
Transforming Data
:
Problems and Solutions
Multipart Keys Problem
Many older operational systems used record key structures that had a built-in meaning. To allow for decision support reporting, these keys must be broken down into atomic
values.
In the example, the key contains four atomic values.
Key Code:12M65431345 Where:
12 is the country code M is the sales territory 65431 is the product code 345 is the salesperson
Solution The program or tools you use must be capable of identifying on a
character-by-character (or position-by-position) basis the individual values, length of value, and the meaning of the resulting information. In the example quoted it is important that the code can extract the M and know that this is a territory code that identifies “Midwest,” “Manchester,” or “Moscow.”
You may need to build a series of transforms to evaluate the results fully. For example, these steps may be appropriate:
1 Extract third character position.
2 Evaluate the character against a master lookup table.
3 Evaluate the meaning of M.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
If field not in (‘m’,1,’male’) then …
else if field is NULL then …
Transforming Data
•
Multiple encoding•
Must pick up erroneous datam , f 1 , 0 male, female m, f m, f mle, female 1 , NULL
Copyright Oracle Corporation, 1999. All rights reserved.
Transforming Data
•
Multiple local standards•
Tools or filters to preprocesscm inches cm DD/MM/YY MM/DD/YY DD-Mon-YY 1,000 GBP FF 9,990 USD 600
...
Data Warehousing Fundamentals 11-19
...Transforming Data: Problems and Solutions
Multiple Encoding Problem
Some systems may represent values in different ways.
For example, some systems may use M to denote “male” and F to denote “female”, while others use 1 and 0, or even NULL values.
Solution The program must be capable of identifying all the distinct possibilities
and program for exceptions. For example, your program considers a male might be either M, or NULL, or Male, but it does not take into account spurious and bad entries such as Man, Mle, N/A.
Your program must be capable of picking up the spurious and bad entries and changing the values to something appropriate, such as:
1 Select all M, or NULL, or Male.
2 Place all other records into a file for reprocessing.
3 Interpret records to be reprocessed and determine from other related values in the
record whether the person is male or female.
4 Change value accordingly, and reprocesses rows selecting newly marked records.
Multiple Local Standards Problem
This is particularly relevant for values entered in different countries.
For example, some countries use imperial measurements and others metric; currencies and date formats differ; currency values and character sets may vary; and numeric precision values may differ.
Currency values are often stored in two formats, a local currency such as sterling, French francs, or Australian dollars, and a global currency such as U.S. dollars.
Solution Typically, you use tools or filters to preprocess this data into a suitable
format for the database, with the logic needed to interpret and reconstitute a value. You might employ steps similar to those identified for multiple encoding.
You may consider revising source applications to eliminate these inconsistencies early on.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Multiple Files Problem
•
Added complexity of multiple source files•
Start simple Extracted data Multiple source files Logic to detect correct sourceCopyright Oracle Corporation, 1999. All rights reserved.
Transforming Data from Multiple Files
0 2 4 6 8 10 12 14 16 2 3 4 5 6 Sources to be Incorporated
Conflict and integration points
File
File
File
File File File
File File
...
Data Warehousing Fundamentals 11-21
...Transforming Data: Problems and Solutions
Multiple Files Problem
The source of information may be one file for one condition, and a set of files for another. Logic (normally procedural) must be in place to detect the right source. The complexity of integrating data is greatly increased according to the number of data sources being integrated.
For example, if you are integrating data from two sources, there is a single point of integration where conflicts must be sorted. Integrate from three sources, and there are three points of conflict. Four sources provide six conflict points. The problem is exponential.
Solution This is a complex problem that requires the use of tools or
well-documented transformation mechanisms.
Try not to integrate all the sources in the first instance. Start with two or three and then enhance the program to incorporate more sources. Build on your learning experiences.
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Missing Values Problem
Solution
•
Ignore•
Wait•
Mark rows•
Extract when time-stampedIf NULL then field = ‘A’
A
Copyright Oracle Corporation, 1999. All rights reserved.
Duplicate Value Problem
Solution
•
SQL self-join techniques•
RDMBS constraint utilities ACME Inc ACME Inc ACME Inc ACME Inc SELECT …FROM table_a, table_b
WHERE table_a.key (+) = table_b.key UNION
SELECT …
FROM table_a, table_b
...
Data Warehousing Fundamentals 11-23
...Transforming Data: Problems and Solutions
Missing Values Problem
Null, missing, and default values are always an issue. NULL values may be valid entries where NULLs are allowed; otherwise, NULLs indicate missing values.
Solution You must examine each occurrence of the condition to determine validity
and decide whether these occurrences must be transformed; that is, identify whether a NULL is valid or invalid (missing data). You may choose to:
• Ignore the missing data. If the volume of records is relatively small, it may have
little impact overall.
• Wait to extract the data until you are sure that missing values are entered from the
operational system.
• Mark rows when extracted, so that on the next extract you can select only those
rows not previously extracted. It does involve the overhead of SELECT and UPDATE, and if the extracted data forms the basis of a summary table, these need re-creating.
• Extract data only when it is time-stamped as completed, rather than by business
cycle.
Duplicate Value Problem
You need to eliminate duplicate values, which invariably exist. This can be time-consuming, although it is a simple task to perform.
Solution You can use standard SQL self-join techniques or RDBMS constraint
... Lesson 11: Transforming Data
Copyright Oracle Corporation, 1999. All rights reserved.
Solution
•
CTAS•
SQL*LoaderElement Names Problem
Customer Browser:http:// Hollywoo d Hollywoo d X X + + Cust om ers: a rec orof a s X X + + Cust om ers: Browser:http:// Hol lywood Hol lywood Browser:http:// Hollywood Hollywood X X + + 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Customer Client Contact Name
Copyright Oracle Corporation, 1999. All rights reserved.
Element Meaning Problem
Customer’s name All customer details All details except name a recorof a s X X + + Custom ers: Browser: http:// Hollywoo d Hollywood Customer_detail