Speeding ETL Processing in Data Warehouses White Paper

(1)

(2)

High-Performance Aggregations and Joins for Faster

Data Warehouse Processing

Data Processing Challenges... 1

Joins and Aggregates are Critical to Data Warehouse Processing ... 1

Aggregations are a Key Component of Data Warehouse Processing ... 2

Using High-Performance Aggregations for Preprocessing ... 2

Pre-Calculated Aggregations Speed Queries ... 2

Using Multi-Level Hierarchal Aggregations... 3

Joins Combine Data for Reports and Downstream Processing... 4

Using High-Performance Joins for Changed Data Capture (CDC) and Other Preprocessing... 4

Case Studies Demonstrate Elapsed Time Savings Using DMExpress ADM High-Performance Aggre-gations and Joins ... 5

Scenario #1: Slow Java Processing ... 5

The ADM Solution... 6

Scenario #2: Slow ETL Application... 6

The ADM Solution... 6

(3)

Data Processing Challenges

Staying competitive in today’s real-time business world demands the capability to process ever-increasing volumes of data at lightning speed. In today's fast-changing, competitive environment, a complaint frequently heard by data warehouse users is that access to time-critical data is too slow. Shrinking batch windows and data volume that increases exponentially are placing increasing demands on data warehouses to deliver instantly-available information.

Additionally, data warehouses must be able to consistently generate accurate results. But achieving accuracy and speed with large, diverse sets of data can be challenging. Data management and governance are thus of paramount importance, a trend that has forced industries to invest billions of dollars in data infrastructure updates. The majority of these expenditures has been devoted to improving access to, and the quality of, real-time business intelligence (BI) and other mission-critical operations, including data marts, data mining, online data stores, and online transaction processing (OLTP) systems.

One of the primary factors affecting data warehouse performance is the underlying physical distribution of the data in the databases, which must be consolidated in various ways. For example, a query may rely on tables built from multiple heterogeneous sources, involving a variety of transformations. An inability to perform data-intensive operations efficiently will inevitably impede business analyses and subsequent decision-making.

Joins and Aggregates are Critical to Data Warehouse Processing

Various operations can be used to optimize data manipulation and thus accelerate data warehouse

processes. Two operations in particular, joins and aggregations, play an integral role during preprocessing as well in manipulating and consolidating data in a data warehouse.

For processing data for BI analytics and other mission-critical applications, the usual go-to software products include various extract, transform, load (ETL) and data warehousing solutions. However, these products do not always provide the speed required to deliver data on time. Often an additional product can contribute to speeding the processing of high volumes of data. One such product, Syncsort’s DMExpress, including its Advanced Data Management (ADM) component, significantly reduces processing time of data-intensive operations through the use of high-performance joins and high-performance aggregations.

Using ADM, operations such as queries, multi-level hierarchal aggregations, and changed data capture (CDC), also referred to as delta processing, can be accelerated. The result: critical business reports and analyses finish many hours and sometimes even days faster, allowing more frequent access to data for timely reports and analysis.

Deriving useful information from raw data within a data warehouse involves joining factual and dimensional information before aggregating it to produce a report or output for downstream analysis. However, joins and aggregations are data-intensive and time-intensive, which prevent data warehouses with insufficiently fast ETL software from performing them frequently; but frequency is key to providing data for timely business

(4)

Page 2

analysis. ADM provides a solution. It speeds data-intensive joins and aggregations, making it possible to perform them much more frequently. ADM high-speed aggregations and joins exploit patented algorithms, parallel processing technology, and dynamic optimization to accelerate applications and reduce computing resources. And these savings increase dramatically as data volumes grow.

The following shows a job built from join and aggregate tasks in the DMExpress management interface.

Aggregations are a Key Component of Data Warehouse

Processing

Aggregations play a key role at various stages of data warehouse processing. From preprocessing of data prior to its entering the data warehouse, to dimensional data analysis used for conducting queries and generating reports, aggregations are critical to efficiently preparing, configuring, and analyzing large volumes of data.

Using High-Performance Aggregations for Preprocessing

By aggregating data prior to loading it into the data warehouse, queries, database loads, and other

downstream processing can be performed much faster; DMExpress ADM can quickly summarize factual data to the minimum level of granularity required by the data warehouse. A common application removes

redundant transaction data from multiple sources for faster queries and database loads.

Pre-Calculated Aggregations Speed Queries

Data warehouse experts agree that aggregates are the best way to speed data warehouse queries. According to data warehousing expert Ralph Kimball, “Every data warehouse should contain precalculated and prestored aggregation tables” (The Data Warehouse Toolkit, 2nd edition, p. 356).

Join Aggregate

Figure 1. Using DMExpress ADM High-Speed Join and Aggregation to Generate a Product Sales Report.

(5)

Aggregate operations yield the greatest performance benefits when they are used for input to data analysis; for example, to aggregate daily sales data for each quarter and each sales region. If these queries run at consistent times and if the aggregation application is fast enough, then it is often possible to generate the aggregations and store them in tables within the database before the query is run. The query can then use these aggregate tables to compute the results, thus avoiding the processing necessary to create the aggregations. In the case of a six million row query, reducing the number of rows read by creating

aggregations across dimensions can vastly accelerate processing time. A query answered from base-level data can take hours and involve millions of data records and millions of calculations. With pre-calculated aggregates, the same query can be answered in seconds with just a few records and calculations.

Using Multi-Level Hierarchal Aggregations

Sophisticated aggregation schemes recognize dimensional hierarchies and build higher-level aggregations from more granular aggregations. For instance, a daily aggregation of sales for a region could be used to build a monthly aggregation of sales by region, and thereby avoid an aggregation of the more granular daily sales totals. Such aggregate-awareness can significantly increase speed and enhance reporting capacity.

Multi-level hierarchal aggregations (Figure 2) can be efficiently performed with ADM. Using the aggregate functionality and implementing a processing scheme that involves working with smaller and smaller datasets, ADM can perform hierarchical aggregation and then use DMExpress merge functionality to bring the hierarchically-aggregated data together to create a final sales report.

(6)

Page 4

Aggregations can also be used to replace fact data with rolled up versions of themselves. For example, you can use this approach to create a monthly product-sales-by-store summary of an original fact table. All of the daily sales records are aggregated into monthly records, which reduces the size of the fact table. Yet dimensional analysis by time at the month, quarter, and year levels can still be performed. ADM can also perform multi-level hierarchal aggregation to help build cubes for more advanced dimensional data analyses.

Joins Combine Data for Reports and Downstream Processing

Joins are used to combine information from two or more data sources, such as database tables, and place it into a new data source suitable for downstream processing or reports. Joins are particularly powerful because they enable rapidly changing data to be organized into categories for subsequent report preparation through the use of matching keys. Joins are used to pre-process data, to improve the efficiency of queries, and to accelerate changed data capture (CDC), or delta processing.

Using High-Performance Joins for Changed Data Capture (CDC) and Other

Preprocessing

To reduce the time needed to retrieve information, data must be preprocessed into the proper form for the dimensional data warehouse. High-speed joins are critical to this process, which can include lookups of legacy values for appropriate replacement, cleansing and validating data, identifying and eliminating non-matching or unnon-matching values, and pre-aggregation. Using ADM high-performance join for data-intensive operations, descriptive information can be combined with factual data late in a processing sequence so that storage and throughput requirements are minimized.

CDC is an increasingly important pre-processing function. By loading only new, updated, and deleted records into the data warehouse, CDC significantly conserves time and resources when carrying out data-intensive processes (Figure 3).

(7)

Rather than replacing the information in the data warehouse with the data in the entire online transactional database, a join will match the primary key of the previously loaded record with its corresponding new record and then compare the data portions of the two records to determine if they’ve changed. In this way, only added, deleted, and altered records are updated, which significantly reduces elapsed time of database loads. By using a high-performance join for CDC, data warehouse updates can be performed with far greater efficiency.

Case Studies Demonstrate Elapsed Time Savings Using

DMExpress ADM High-Performance Aggregations and Joins

Because of the significant benefits they provide, aggregations and joins are widely used in nearly every industrial sector where large volumes of data must be analyzed including the banking, pharmaceutical, retail, and telecommunications industries. Below are two examples from the financial industry where ADM high-performance aggregations and joins were used to significantly accelerate data processing.

Scenario #1: Slow Java Processing

A large investment bank wanted to improve the performance of a lengthy product/customer data management process that originally involved five major procedures, all written in Java. Due to recent increases in data volumes, legacy Java-based applications were consuming excessive system resources and time. It thus became necessary to upgrade the bank’s data processing applications in order to reduce elapsed time for routine customer management processes. The processes included data extractions, joins, merges, reorganizations, and loads in a networked UNIX environment. The bank’s existing solution, using programs written in Java, required 30-32 hours to complete.

Figure 4. Investment Bank Uses DMExpress with ADM to Reduce Data Processing Time by 90%.

(8)

Page 6

The ADM Solution

The Java procedures were converted to ADM high-performance join and aggregation features, which eliminated the need for a separate database load step. As a result of switching to DMExpress ADM, elapsed time for customer management data processing was reduced by over 90%, from 30-32 hours to just 2-3 hours (Figure 4). The bank now completes its routine customer management processes faster and therefore can perform them more frequently.

Scenario #2: Slow ETL Application

At another leading financial institution, a client profitability application was designed to examine customer account and product data and subsequently determine the rate and amount that the company's products earned for its investors. The company needed to create aggregates and order the data, which consisted of thirty-six, 1GB regular time period files and five, 1.8 GB summary time period files. DataStage was used to process the customer financial data and then generate financial files by time periods as well as by product, customer, and organization. DataStage's aggregation and ordering features did not provide the performance and functionality that the company needed. With DataStage alone the process was taking 72 hours.

The ADM Solution

To reduce total processing time, the company selected DMExpress with ADM to process the aggregates and ordering. Once the aggregates were completed, the database was brought down and the database

administrator ran several statistics with the aggregated files and other DataStage files. The data was then loaded into the data warehouse and different divisions from around the world queried the information using Oracle Reports and Oracle Discoverer. When the ETL process was run using ADM, the previous data was still up so that queries could be performed on the same machine while running their ETL process. With ADM, the elapsed time was reduced by 33% to 48 hours, cutting processing time by one entire day (Figure 5).

Figure 5. Bank Cuts Processing Time by 33% Using DMExpress with ADM

(9)

DMExpress: An Enterprise Solution

DMExpress is the high-speed data management solution for ETL, data warehousing, BI, and other mission-critical applications. DMExpress replaces long-running processes so applications run faster. This can save many hours, even days, especially when processing large amounts of data. DMExpress can be used to speed processing at many points in the enterprise management structure (Figure 6).

Figure 6. DMExpress Speed is Utilized at Many Points in the Data Management Structure.

(10)

Page 8

DMExpress leverages nearly 40 years of research and development in high-performance software. It is needed wherever you have data-intensive applications. It speeds processes like ETL, staging data for a data warehouse, and database loading by up to 90%. The ultimate business benefit of DMExpress is twofold:

•

It provides rapid availability of data for critical operations such as BI (including CRM, ERP, SCM, and

CPM), data marts, data mining, web log processing, online data stores, and OLTP systems.

•

Its speed and efficiency minimize resource consumption, making it possible to consolidate hardware for significant cost savings.

For more information about DMExpress ADM and high-performance aggregations and joins, contact DMExpress sales at 201-930-8200 or visit http://www.syncsort.com/products/dmx/features/adm.htm.

(11)

Speeding ETL Processing in Data Warehouses White Paper

High-Performance Aggregations and Joins for Faster

Data Warehouse Processing

Data Processing Challenges

Joins and Aggregates are Critical to Data Warehouse Processing

Aggregations are a Key Component of Data Warehouse

Processing

Using High-Performance Aggregations for Preprocessing

Pre-Calculated Aggregations Speed Queries

Using Multi-Level Hierarchal Aggregations

Joins Combine Data for Reports and Downstream Processing

Using High-Performance Joins for Changed Data Capture (CDC) and Other

Preprocessing

Case Studies Demonstrate Elapsed Time Savings Using

DMExpress ADM High-Performance Aggregations and Joins

Scenario #1: Slow Java Processing

The ADM Solution

Scenario #2: Slow ETL Application

The ADM Solution

DMExpress: An Enterprise Solution

•

•

50 Tice Boulevard

Woodcliff Lake, NJ 07677

www.syncsort.com

201-930-8200