Data Warehousing With Limited Database Operations

(1)

Data Warehousing With

Limited Database Operations

By Muhammad Ahmad Shahzad

Managing Principal Consultant, PMP Collabera

(2)

Data warehousing is a very expensive solution and should be designed as per requirements. This paper focuses on the need to reduce database operations while implementing the data warehouse. A data warehouse can be divided into following phases: extraction, initialization, transformation and scrubbing, categorization, loading, summarization and reporting. Most of these phases are not directly based on database, and can be handled using flat file manipulation. This paper will outline different phases of data warehousing, and how to limit operations in these phases and after implementation.

The Need for Limiting Database Operations

Developing data warehouses is a lengthy and expensive process. One of the most important decisions a project manager faces is deciding to what extent database operations should be used for the implementation of data warehouse. The data warehouse project can be divided in phases: extraction, initialization, transformation and scrubbing, categorization, loading, summarization and reporting. The existing data servers have the capability to implement all of the above phases of data warehousing. Database servers like Oracle 8i and SQL Server 7.0 have special features tuned to implement data warehouses more efficiently. With the help of database design tools, fast loads and OLAP reporting front end, the process of data warehouse implementation has improved considerably.

The most prominent issue with implementing all of the above-mentioned phases using database operations is that databases were initially designed for implementing transactional systems rather than reporting applications. Thus, features like data consistency, integrity, normalization, concurrency, and rollbacks, which are crucial for transaction systems, are not that relevant when we want to implement data warehouses. Most of the commercially available database servers, which have evolved in the direction of very large databases, still support online transactional systems, and therefore include the aforementioned features. These features are important in data warehouses, but are not of high importance.

To understand why most of the features of databases do not directly apply on data warehouses, we must look into the basic features of data warehouses. A data warehouse is a subject-oriented, integrated, non-volatile and time-variant collection of data in support of management’s decisions. It is about molding data into information, and storing this information based on the subject rather than application. Essentially, data warehousing is the “warehousing” of data outside operational systems and this has not significantly changed with the evolution of data warehousing systems. A prime reason of this separation is that the evaluation and analysis, done by analysts, require complex and analytic queries - the effect of which is the performance degradation of operational systems. Most of the data warehousing possess three-tier architecture – first-tier being the

(3)

transfer of data from transactional systems to database server, second tier being the database server itself and the third tier is client-side analysis tools.

If we closely analyze the aforementioned architecture, it is very clear that the database itself is just a part of overall picture. There exist a lot of operations in data warehousing which are not directly based on database. Though the database itself is very important and the database design takes a pivotal role in the overall implementation, most of the database operations can be undermined as a result of the associated overhead. Data being loaded into the warehouse is landing as big “chunks” – consisting of hundreds of Megs, or even Gigs - and once they are loaded into the database, they will not change. Most of these applications are designed with some type of batch window for loading data, during which time some portion of the data warehouse will be down. This gives some slack for implementation, but the issue stills exists to attempt to minimize this window.

Before discussing the phases of data warehousing in detail, it must be mentioned that the data entering the warehouse is hardly acceptable for loading into the database in its raw format. Data is coming from diverse sources with different hardware, software, and databases. Often, it is dirty data in terms of redundancy, key management, referential integrity and inconsistency in data elements. This is an issue that must be addressed.

Now, this paper will discuss all the phases of data warehousing and how to implement those with limited database operations.

Extraction and Initialization

Data in the warehouse comes from either the transactional database for which the data warehouse is implemented, or the data warehouse itself. For unloading data from databases, a

(4)

fast unload utility can be used rather than using SQL operations. These fast-unload utilities unload the data into data files using control files to specify the format. Extracting data from transactional databases is understandable, but why extract data from existing data warehouse? The reason behind this will be clearer when we take a look at the other phases, but in short it’s to minimize database operations for lookups like primary key management or foreign key management. The importance of Meta data is beyond doubt in the implementation of data warehouses. Meta data files are used to recognize and manage data files. In a wisely organized warehouse, there exist different sets of meta data files for different phases. As the files evolve from ‘dirty’ to load-able, the meta data itself gets closer to the database design.

As mentioned previously, typically data coming into the database from transactional systems needs a lot of grooming. The biggest issue is to make sense of dirty data. There exist various reasons for this problem, but almost all of warehouse designers have to face this. Also, there exist primary key and referential integrity management and the removal of inconsistencies from the data. Almost all of these problems are handled in the transformation and scrubbing phase, and the initial preparation is done in this phase, as is the first level of validation. The files coming into a data warehouse may or may not follow some kind of standard format. If they do not follow any format or there exists more than one format for different transactional systems, then these files have to be changed to a standard warehouse format: Unified Warehouse File Format or UWF format.

Following is an overview of UWF format:

• Each object/table will have two files: one is the data file and the other

one is definition or Meta file.

• There should exist only one set of files for one object.

• One file should have data for only one object.

• The definition file should have the type of file (fixed length or variable length), delimitation information, data types, default values and characteristics of fields – which would include information like ‘is nullable’, ‘primary key’, ‘reference key’ or ‘partition key’. Most of the time it’s also useful to include objects type inside the definition file, showing if the object is a fact table, dimension table or a summary table.

Once the files are reformatted into UWF format, the rest of the programs and phases will be handling using just one kind of file format. Often, it’s useful to have the first level of validation in the Extraction phase. This validation includes checks like existence of Mandatory information,

(5)

Valid Number, Valid Date, and Valid Partition Key. This is a very primitive check, but is also very essential.

After converting files into UWF format and passing through the first level validation, files for transactional data are ready for the transformation and scrubbing phase.

The final step in the extraction phase is unloading from the warehouse. For the transformation phase, there is a need for primary key and foreign key information. For all of the objects, files will be maintained in some kind of Database Lookup area. Transformation will use these files to generate new primary keys and maintain referential integrity. If the domain information is implemented in the database, then foreign key management will handle this; otherwise, there should be domain file management in place. These domain files will have information like target value and a set of source values. The following figure shows a domain file for sex.

Transformation and Scrubbing

This phase is the backbone of data warehousing. If implemented properly, based on the overall requirements and design, the project is 70% completed. The transformation and scrubbing phase is about 65-80% of the overall development of data warehouse. This phase becomes even more important if we want to implement the data warehouse without the staging database. There is no need to have a physically separate staging database to store data before scrubbing and transforming it. The data coming from extraction phase are in the form of flat files, and should remain as flat files even after this phase. All of the processing should be done on flat files, as flat file manipulation is faster and can be handled in parallel.

(6)

The overall goal of the Transformation and Scrubbing phase is to present data in a load-able format. The load phase should not be performing any file manipulation. All of the manipulation should be part of this phase. As a reminder the data coming into this phase has already been reformatted to UWM format.

Following are the main tasks for this phase.

• Remove redundant information;

• Resolve inconsistency from data;

• Primary key management, and

• Foreign key management

These tasks may seem simple, but the solutions for these problems are very complex and vary from data warehouse to data warehouse. The complexity of the first two tasks is a function of dirty data. If the data extracted from a legacy system is very dirty, then accomplishing these tasks will need a lot of work, whereas if the legacy systems were designed properly and integrating them is not a problem, there is not much to resolve in transaction phase. The key management is directly related to the physical database design of data warehouse.

Redundant Data

Ideally, data coming into the data warehouse system should be clean. But the world is not ideal, and this is especially true for big systems. There can exist a multitude of reasons for this problem, but data to be loaded into the data warehouse often has redundant data. The following can be different reasons behind this problem:

• There exists redundant data in a transactional system or legacy

system;

• This problem was raised during the extraction of data from legacy

system;

• Overlap in the extraction window of data, and

• The same data file is sent twice.

Whatever the reason may be, the solution for this is complex if one has to implement the solution in the transformation phase. Some of these issues should be resolved on the transactional system side or in the way data is commuting between the two systems. But this is not always a solution: most of the time, it’s much more expensive to handle this before transformation phase. Thus, data warehouse developers have to resolve this issue.

(7)

The redundancy problem can be sub-divided into two classes; the solution for both is different. The first class is related to redundancy in the same batch load. Whereas the second is related to resending redundant data, the data itself may not be duplicate but has already been loaded into the data warehouse, so will become redundant if loaded. Both have different solutions: the first one is a self-contained problem, and the solution for the second one needs some kind of database feedback. There may exist scenarios in which both of these exist simultaneously in one batch load. In that case, the first class has more priority then the second class.

First Class of Redundant Data

As the problem is self-contained, the solution lies in file manipulation. Some dynamic utilities can be written to manipulate files that are already in UWM format. The utilities can open UWM file and find out about the primary key information, then scan data for duplicates and then remove the redundant data. A scheme for removing duplicates can be decided with the discussion between data architects, data warehouse designers and business users. Sometimes, it’s a good practice to inform concerned parties about duplication so that they may modify their systems to minimize this problem.

The database-oriented solution for this is to first load data into a staging database, and then manipulating this data to remove redundant data. Those of us who have manipulated huge amount of data using queries know that how expensive this solution will be. No matter how one wants to do it, there will exist at least a few full-table scans in order to remove duplicates. A full table scan is the most expensive type of scan in database operations.

(8)

Second Class of Redundant Data

This class approaches the redundancy problem in relation to duplication between data to load and data already loaded in database. The solution for this needs a feedback from database. For example: the Extraction phase is generating a file called Primary Key File (PKF) for each table. This file has candidate key information from the transaction system and the mapped primary key for data warehouse. An example of PKF for a patient table is given below. Where AHBC and AHBB are two hospital Ids:

This PKF file will also be in UWM file format. A dynamic utility can be developed which will compare the Transactional System candidate keys between data file and PKF file and check if a particular key was already in the PKF file or not. The output of this utility will be a file with data that is not already loaded.

An alternative approach is to load this data into the staging database and then compare this data at the time of loading this into data warehouse. The problem with this approach is that again this will require a lot of database operations to compare the data before loading. The most probable approach will be to have a cursor of the staging area data and for each record compare it with the data in the tables of data warehouse. However, a row-by-row traversal on cursor of millions of records and then comparing this with another table of millions of records is a very expensive database operation.

(9)

Third Class of Redundant Data

If both of the above mentioned problems exist in data to be loaded (for example, there exists redundant data within a data file and part of this was already loaded.) The solution is quite simple: pass this file from the first phase for removing redundancy from a data file, and the second phase will be to create a file with only new data.

• Resolving Inconsistency in Data

Inconsistency is a very common problem in data warehousing where data is coming from different legacy or transactional systems and each of those have there own methods of abbreviation or domain ids. A simple example can be of marital status. The following table shows different marital statuses and ids used by different systems

The solution for this is very simple. The domain files for each system that will have mappings from respective system to the desired value must be maintained. The file will be matched against these domain files and the output file will have the required value for each item.

The complexity is in creating these domain files. If all possible combinations of data items and their domain values are already known, then it’s not a problem for making such a file. However, sometimes this is not known beforehand. For example, consider a case of implementing a health care data warehouse and making a domain file for allergies. Even if it is considered that there are no typos in transactional systems, there exist so many allergies that it’s quite difficult to make a domain file in the beginning. In cases like this the domain file will ‘evolve’ to have all the required values. The utility that is made to map domain values will have to be developed in such a way that whenever there is a new item value it should append that into the domain file. If the domain value for that item is simple enough to generate, then the utility should generate it; otherwise there has to be an involvement from business user.

The aforementioned scenarios do not deal with the dirty data problem. What if there exist typos in the data coming to data warehouse? They become important when you are generating some kind of domain information based on descriptive fields. Taking the same example of allergies: what if the data coming to the system is not code based, but the descriptive field is the name of allergy?

(10)

Making the domain file for such a field is expensive. However, most of the time, it is required for mapping data from different systems. Business users must be involved for resolving discrepancies.

The following figure shows the solution for resolving inconsistencies.

• Primary Key Management

This phase will need the PKF file, which was mentioned previously. The keys of the transactional systems have to be mapped on the keys of the data warehouse. The PKF file is generated in the extraction phase, and this phase will use the values from that file for generating new primary keys. After passing through this phase the data files will have data warehouse based keys. If it is important to retain the transaction system’s primary key information, then this system will not remove that. Some dimensional tables can be created to map keys so that querying for required data can be accomplished.

• Foreign Key Management

This phase will also need the PKF file. In theory foreign keys should be the primary keys of referenced table. The logic is very much similar to primary key handling. The difference comes in the sense that, in foreign key management, the process is simple mapping from primary keys of the transactional system to primary keys of the data warehouse system. This phase will not generate any new keys. Problem comes when key information is not in PKF files. There can be two approaches to handle this problem. One will be to discard the record; the second approach can be to put some default value or dummy value. Whatever approach is opted for situations like this, data should not be loaded when there is no reference data available. Generally in data warehouses referential integrity is not implemented as part of database, as it will slow the

(11)

database operations, so handling foreign keys becomes part of the Foreign Key Management phase.

• Loading

This obviously is a database-oriented phase. In this phase a data file is loaded into the table. By this time data has been cleaned and is in ready-to-load form. A simple fast loading utility can load data. As this paper is focused on non-database operations in data warehouses, we will not discuss this phase.

• Summarization of Data

Most of the time, data viewed by business users is in a summarized level. The summarization can be done on different levels. For example, if we consider time as the base of summarization, then summarization can be done on week, month, quarter and year basis. There can also exist a multidimensional summarization in data warehouses. Depending on the amount of data it may be more efficient to unload the data first into the flat file, then summarize the data as per requirements and reload it into different summary tables.

Most of the time, data warehouses are implemented in levels. At the base level there exists a centralized Enterprise Data Warehouse and above that there are data marts. OLAP tools and DSS communicate to the data marts rather than EDW. There will exist a layer of Extraction, Transformation and Load between EDW and data marts. This layer will handle the summarization of data. Another approach can be to summarize data at the transformation before even loading data into EDW, and loading this summarized data directly to the data marts.

The following figures show both of the above-mentioned approaches.

• Updating Data Warehouses

Ideally, once data is loaded into the data warehouse, there should not be any changes to it. In data warehouses, data should not be modified after it’s loaded. As defined, a data warehouse is a

(12)

time variant system. However, in reality there comes a time when either data is updated in data warehouse or there is a need to remove some portion. There was a purpose in defining data warehouses a time variant system: data warehouses are so large that

Deleting Data from Data Warehouses

It is very expensive to use database operation to delete data from tables. Imagine removing a couple of thousand records from a table of million rows – especially, if these records are random in nature. It’s been our observation that if we want to remove more than one percentage of tables with more than a million records, it’s cheaper to unload the table, remove the records and reload the table again. Thus, there has to be some kind of logic implemented in the process of deleting data. It is more of an observational decision and varies considerably from resources to resources.

Updating Data in Data Warehouses

Conceptually updating data is a combination of deletion and insertion. The solution is pretty clear: if it is required to update some data in a data warehouse, those records must be removed from the database using the above-mentioned approach. Then the data must be re-loaded as a new load.

Summary

Data warehousing is a very expensive solution and should be designed as per requirements. Though most of the commercial databases are tuned to implement very large databases like data warehouses, there exist some limitations that restrict efficient implementation of data warehouses using database operations. One approach is to limit the database operations and use flat file manipulations for data warehousing. Manipulating flat files are less expensive than doing the same thing in data warehouses. It also reduces the batch window in loading the data, which in-turn reduces the down time for querying. This allows a data warehouse to remain an effective and time and cost-efficient tool for your organization.

About Collabera

Collabera is one of the fastest growing end-to-end information technology services and solutions firms worldwide. We work with leading Global 2000 firms from the Financial Services, Manufacturing & Retail, Technology, Communications & Media domains. Collabera delivers highly responsive and innovative solutions that bridge our client's Execution Gaps through our proprietary methodologies - Momentum workshops, ABS (Asset Based Services) and global delivery model - helping them experience accelerated value.

(13)

With revenues of over USD 300 million, Collabera employs over 4000 professionals across 22 locations and seven world-class delivery centers in the US, Europe, India and China regions. Our Global Delivery model leverages a best-in-class Quality system (including SEI CMMI Level 5 and ISO 27001 security certifications) to deliver a full portfolio of services that include Application Development and Management, Independent Testing, Outsourced Product Development, Enterprise Software Solutions, Business Intelligence & Datawarehousing.

(14)

References Inmon, W.H.

What is Data Warehouse?

Gupta, Vivek R., Senior Consultant, Services Corporation, Chicago, Illinois. An Introduction to Data Warehousing

Heise, David, CIO Andrews University Data Warehousing at Avondale College

Goodnight, James, CEO SAS Institute Inc.

Data Warehousing: Understanding Its Role in Business Management Architecture

Hill, Janelle, Analyst

Challenges of Data Preparation for a Data Warehouse

Hall, Curt

Data Warehousing Issues and Trends

McGuff, Frank

Designing the Perfect Data Warehouse

Gordon, K. I.

The Why of Data Standards – Do You Really Know Your Data?

Kimball, Ralph and Reeves, Laura and Ross, Margy and Thornwaite, Warren The Chess Pieces

Larissa Moss

Data Cleansing: A Dichotomy of Data Warehousing?

Orli, R. and Santos, F.

Data Extraction, Transformation, and Migration Tools

(15)

Data Warehouse Technology

Firestone, Joseph M.

Data Warehouses and Data Marts: A Dynamic View

Symons, Van

(16)