Data Integrator Performance Optimization Guide

(1)

Data Integrator Performance Optimization

Guide

(2)

and sold by Business Objects: 5,555,403, 6,247,008 B1, 6,578,027 B2, 6,490,593 and 6,289,352.

Trademarks Business Objects owns the following U.S. patents, which may cover products that are offered and licensed by Business Objects: 5,555,403; 6,247.008 B1; 6,578,027 B2; 6,490,593; and 6,289,352. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Xcelsius, Crystal Decisions, Intelligent Question, Desktop Intelligence, Crystal Enterprise, Crystal Analysis, WebIntelligence, RapidMarts, and BusinessQuery are trademarks or registered trademarks of Business Objects in the United States and/or other countries. All other names mentioned herein may be trademarks of their respective owners. Copyright © 2007 Business Objects. All rights reserved.

Third-party contributors

Business Objects products in this release may contain redistributions of software licensed from third-party contributors. Some of these individual components may also be available under alternative licenses. A partial listing of third-party contributors that have requested or permitted acknowledgments, as well as required notices, can be found at: http://

www.businessobjects.com/thirdparty

(3)

chapter

Data Integrator Performance Optimization Guide

(10)

Introduction

About this guide

1 About this guide

Welcome to the Data Integrator Performance Optimization Guide.

Data Integrator allows you to create jobs that extract, transform, and load data from database management systems into a data warehouse. After successfully designing and testing these jobs, you might wonder:

•

With my current hardware and software environment, how can I configure the Data Integrator system to achieve optimal performance? What rules do I need to follow?

•

How many CPUs and how much physical and virtual memory do I need? What might cause the system to bottleneck?

•

How will the performance of Data Integrator be affected by options such as join ordering, caching tables, bulk loading, parallel execution, and grid computing?

These questions expose potential configuration and tuning issues with database servers and operating systems, as well as with Data Integrator. This book provides guidelines for these issues. Sections include:

•

Environment Test Strategy

•

Measuring Data Integrator Performance

•

Tuning Overview

•

Maximizing push-down operations

•

Using Caches

•

Using Parallel Execution

•

Distributing Data Flow Execution

•

Using Bulk Loading

•

Other Tuning Techniques

Who should read this guide

All technical information associated with Data Integrator assumes the following:

•

You are an application developer, consultant or database administrator working on data extraction, data warehousing, or data integration.

•

You understand your source and target data systems, DBMS, legacy systems, business intelligence, and messaging concepts.

•

You understand your organization’s data needs.

(11)

Introduction

About this guide

1

•

If you are interested in using this product to design real-time processing

you are familiar with:

•

DTD and XML Schema formats for XML files

•

Publishing Web Services (WSDL, HTTP/S and SOAP protocols, etc.)

•

You are familiar with Data Integrator installation environments: Microsoft Windows or UNIX.

Business Objects information resources

Consult the Data Integrator Getting Started Guide for:

•

An overview of Data Integrator products and architecture

•

Data Integrator installation and configuration information

•

A list of product documentation and a suggested reading path

After you install Data Integrator, you can view technical documentation from many locations. To view documentation in PDF format, you can:

•

Select Start > Programs > Business Objects > Data Integrator > Data Integrator Documentation and choose:

•

Release Notes

•

Release Summary

•

Technical Manuals

•

Select one of the following from the Designer’s Help menu:

•

Release Notes

•

Release Summary

•

Technical Manuals

•

Select Help from the Data Integrator Administrator

You can also view and download PDF documentation, including Data Integrator documentation for previous releases (including Release Summaries and Release Notes), by visiting the Business Objects documentation Web site at http://support.businessobjects.com/ documentation/.

(12)

Introduction

About this guide

1

(13)

chapter

Data Integrator Performance Optimization Guide

(14)

Overview

2 Overview

This section covers suggested methods of tuning source and target database applications, their operating systems, and the network used by your Data Integrator environment. It also introduces key Data Integrator job execution options.

This section contains the following topics:

•

The source OS and database server

•

The target OS and database server

•

The network

•

Data Integrator Job Server OS and job options

To test and tune Data Integrator jobs, work with all four of these components in the order shown above.

In addition to the information in this chapter, you can use your UNIX or Windows operating system and database server documentation for specific techniques, commands, and utilities that can help you measure and tune the Data Integrator environment.

The source OS and database server

Tune the source operating system and database to quickly read data from disks.

Operating system

Make the input and output (I/O) operations as fast as possible. The read-ahead protocol, offered by most operating systems, can greatly improve performance. This protocol allows you to set the size of each I/O operation. Usually its default value is 4 to 8 kilobytes which is too small. Set it to at least 64K on most platforms.

Database

Tune your database on the source side to perform SELECTs as quickly as possible.

In the database layer, you can improve the performance of SELECTs in several ways, such as the following:

•

Create indexes on appropriate columns, based on your Data Integrator data flows.

(15)

The target OS and database server

2

•

Increase the size of each I/O from the database server to match the OS

read-ahead I/O size.

•

Increase the size of the shared buffer to allow more data to be cached in the database server.

•

Cache tables that are small enough to fit in the shared buffer. For example, if jobs access the same piece of data on a database server, then cache that data. Caching data on database servers will reduce the number of I/O operations and speed up access to database tables. See your database server documentation for more information about techniques, commands, and utilities that can help you measure and tune the the source databases in your Data Integrator jobs.

The target OS and database server

Tune the target operating system and database to quickly write data to disks.

Operating system

Make the input and output operations as fast as possible. For example, the asynchronous I/O, offered by most operating systems, can greatly improve performance. Turn on the asynchronous I/O.

Database

Tune your database on the target side to perform INSERTs and UPDATES as quickly as possible.

In the database layer, there are several ways to improve the performance of these operations.

Here are some examples from Oracle:

•

Turn off archive logging

•

Turn off redo logging for all tables

•

Tune rollback segments for better performance

•

Place redo log files and data files on a raw device if possible

•

Increase the size of the shared buffer

See your database server documentation for more information about techniques, commands, and utilities that can help you measure and tune the the target databases in your Data Integrator jobs.

(16)

The network

2 The network

When reading and writing data involves going through your network, its ability to efficiently move large amounts of data with minimal overhead is very important. Do not underestimate the importance of network tuning (even if you have a very fast network with lots of bandwidth).

Set network buffers to reduce the number of round trips to the database servers across the network. For example, adjust the size of the network buffer in the database client so that each client request completely fills a small number of network packets.

Data Integrator Job Server OS and job options

Tune the Job Server operating system and set job execution options to improve performance and take advantage of self-tuning features of Data Integrator.

Operating system

Data Integrator jobs are multi-threaded applications. Typically a single data flow in a job initiates one al_engine process that in turn initiates at least 4 threads.

For maximum performance benefits:

•

Consider a design that will run one al_engine process per CPU at a time.

•

Tune the Job Server OS so that Data Integrator threads spread to all available CPUs.

For more information, see “Checking system utilization” on page 21.

Data Integrator jobs

You can tune Data Integrator job execution options after:

•

Tuning the database and operating system on the source and the target computers

•

Adjusting the size of the network buffer

•

Your data flow design seems optimal

You can tune the following execution options to improve the performance of Data Integrator jobs:

(17)

Data Integrator Job Server OS and job options

2

•

Collect statistics for optimization and Use collected statistics

Setting Monitor sample rate

During job execution, Data Integrator writes information to the monitor log file and updates job events after processing the number of rows specified in

Monitor sample rate. Default value is 1000. Increase Monitor sample rate

to reduce the number of calls to the operating system to write to the log file. When setting Monitor sample rate, you must evaluate performance improvements gained by making fewer calls to the operating system against your ability to view more detailed statistics during job execution. With a higher

Monitor sample rate, Data Integrator collects more data before calling the

operating system to open the file, and performance improves. However, with a higher monitor rate, more time passes before you can view statistics during job execution.

In production environments when your jobs transfer large volumes of data, Business Objects recommends that you increase Monitor sample rate to 50,000.

Note: If you use a virus scanner on your files, exclude the Data Integrator log from the virus scan. Otherwise, the virus scan analyzes the Data Integrator log repeated during the job execution, which causes a performance degradation.

Collecting statistics for self-tuning

Data Integrator provides a self-tuning feature to determine the optimal cache type (in-memory or pageable) to use for a data flow.

To take advantage of this self-tuning feature

1. When you first execute a job, select the option Collect statistics for

optimization to collect statistics which include number of rows and width

of each row. Ensure that you collect statistics with data volumes that represent your production environment. This option is not selected by default.

2. The next time you execute the job, select Use collected statistics. This option is selected by default.

3. When changes occur in data volumes, re-run your job with Collect

statistics for optimization to ensure that Data Integrator has the most

current statistics to optimize cache types.

(18)

Data Integrator Job Server OS and job options

2

(19)

chapter

Data Integrator Performance Optimization Guide

Measuring Data Integrator

Performance

(20)

Overview

3 Overview

This chapter contains the following topics:

•

Data Integrator processes and threads

•

Measuring performance of Data Integrator jobs

Data Integrator processes and threads

Data Integrator uses processes and threads to execute jobs that extract data from sources, transform the data, and load data into a data warehouse. The number of concurrently executing processes and threads affects the performance of Data Integrator jobs.

Data Integrator processes

The processes Data Integrator uses to run jobs are:

•

al_jobserver

The al_jobserver initiates one process for each Job Server configured on a computer. This process does not use much CPU power because it is only responsible for launching each job and monitoring the job’s execution.

•

al_engine

For batch jobs, an al_engine process runs when a job starts and for each of its data flows. Real-time jobs run as a single process.

The number of processes a batch job initiates also depends upon the number of:

•

parallel work flows

•

parallel data flows

•

sub data flows

For an example of the Data Integrator monitor log that displays the processes, see “Analyzing log files for task duration” on page 25.

Data Integrator threads

A data flow typically initiates one al_engine process, which creates one thread per data flow object. A data flow object can be a source, transform, or target. For example, two sources, a query, and a target could initiate four threads.

(21)

Measuring performance of Data Integrator jobs

3

If you are using parallel objects in data flows, the thread count will increase to approximately one thread for each source or target table partition. If you set the Degree of parallelism (DOP) option for your data flow to a value greater than one, the thread count per transform will increase. For example, a DOP of 5 allows five concurrent threads for a Query transform. To run objects within data flows in parallel, use the following Data Integrator features:

•

Table partitioning

•

File multithreading

•

Degree of parallelism for data flows

For more information, see “Using Parallel Execution” on page 67.

Measuring performance of Data Integrator jobs

You can use several techniques to measure performance of Data Integrator jobs:

•

Checking system utilization

•

Analyzing log files for task duration

•

Reading the Monitor Log for execution statistics

•

Reading the Performance Monitor for execution statistics

•

Reading Operational Dashboards for execution statistics

Checking system utilization

The number of Data Integrator processes and threads concurrently executing affects the utilization of system resources (see “Data Integrator processes and threads” on page 20).

Check the utilization of the following system resources:

•

CPU

•

Memory

•

Disk

•

Network

To monitor these system resources, use the following tools: For UNIX:

•

top or a third party utility (such as glance for HPUX) For Windows:

(22)

3

Depending on the performance of your jobs and the utilization of system resources, you might want to adjust the number of Data Integrator processes and threads. The following sections describe different situations and suggests Data Integrator features to adjust the number of processes and threads for each situation.

CPU utilization

Data Integrator is designed to maximize the use of CPUs and memory available to run the job.

The total number of concurrent threads a job can run depends upon job design and environment. Test your job while watching multi-threaded Data Integrator processes to see how much CPU and memory the job requires. Make needed adjustments to your job design and environment and test again to confirm improvements.

For example, if you run a job and see that the CPU utilization is very high, you might decrease the DOP value or run less parallel jobs or data flows. Otherwise, CPU thrashing might occur. For more information about DOP, see “Using Parallel Execution” on page 67.

For another example, if you run a job and see that only half a CPU is being used, or if you run eight jobs on an eight-way computer and CPU usage is only 50%, you can be interpret this CPU utilization in several ways:

•

One interpretation might be that Data Integrator is able to push most of the processing down to source and/or target databases.

•

Another interpretation might be that there are bottlenecks in the database server or the network connection. Bottlenecks on database servers do not allow readers or loaders in jobs to use Job Server CPUs efficiently. To determine bottlenecks, examine:

•

Disk service time on database server computers

Disk service time typically should be below 15 milliseconds. Consult your server documentation for methods of improving performance. For example, having a fast disk controller, moving database server log files to a raw device, and increasing log size could improve disk service time.

•

Number of threads per process allowed on each database server operating system. For example:

•

On HPUX, the number of kernel threads per process is configurable. The CPU to thread ratio defaults to one-to-one. Business Objects recommends setting the number of kernel threads per CPU to between 512 and 1024.

(23)

3

•

On Solaris and AIX, the number of threads per process is not

configurable. The number of threads per process depends on system resources. If a process terminates with a message like “Cannot create threads,” you should consider tuning the job. For example, use the Run as a separate process option to split a data flow or use the Data_Transfer transform to create two sub data flows to execute sequentially. Since each sub data flow is executed by a different Data Integrator al_engine process, the number of threads needed for each will be 50% less than in your previous job design.

If you are using the Degree of parallelism option in your data flow, reduce the number for this option in the data flow Properties window.

•

Network connection speed

Determine the rate that your data is being transferred across your network.

•

If the network is a bottle neck, you might change your job execution distribution level from sub data flow to data flow or job to execute the entire data flow on the local Job Server. For more information, see “Using grid computing to distribute data flows execution” on page 102.

•

If the capacity of your network is much larger, you might retrieve multiple rows from source databases using fewer requests. See “Using array fetch size” on page 154.

•

Yet another interpretation might be that the system is under-utilized. In this case, you might increase the value for the Degree of parallelism option and increase the number of parallel jobs and data flows. For more information, see Chapter 7: Using Parallel Execution.

Data Integrator memory

For memory utilization, you might have one of the following different cases:

•

Low amount of physical memory.

In this case, you might take one of the following actions:

•

Add more memory to the Job Server.

•

Redesign your data flow to run memory-consuming operations in separate sub data flows that each use a smaller amount of memory, and distribute the sub data flows over different Job Servers to access memory on multiple machines. For more information, see “Splitting a data flow into sub data flows” on page 88.

(24)

3

•

Redesign your data flow to push down memory-consuming operations to the database. For more information, see “Push-down operations” on page 36.

For example, if your data flow reads data from a table, joins it to a file, and then groups it to calculate an average, the group by operation might be occurring in memory. If you stage the data after the join and before the group by into a database on a different computer, then when a sub data flow reads the staged data and continues with the group processing, it can utilize memory from the database server on a different computer. This situation optimizes your system as a whole.

For information about how to stage your data, see “Data_Transfer transform” on page 96. For more information about distributing sub data flows to different computers, see “Using grid computing to distribute data flows execution” on page 102.

•

Large amount of memory but it is under-utilized.

In this case, you might cache more data. Caching data can improve the performance of data transformations because it reduces the number of times the system must access the database.

Data Integrator provides two types of caches: in-memory and pageable. For more information, see “Caching data” on page 56.

•

Paging occurs.

Pageable cache is the default cache type for data flows. On Windows and Linux, the virtual memory available to the al_engine process is 1.5 gigabytes (500 megabytes of virtual memory is reserved for other engine operations, totaling 2GB). On UNIX, Data Integrator limits the virtual memory for the al_engine process to 3.5 gigabytes (500MB is reserved for other engine operations, totaling 4GB). If more memory is needed than these virtual memory limits, Data Integrator starts paging to continue executing the data flow.

If your job or data flow requires more memory than these limits, you can take advantage of one of the following Data Integrator features to avoid paging:

•

Split the data flow into sub data flows that can each use the amount of memory set by the virtual memory limits.

Each data flow or each memory-intensive operation within a data flow can run as a separate process that uses separate memory from each other to improve performance and throughput. For more information, see “Splitting a data flow into sub data flows” on page 88.

•

Push-down memory-intensive operations to the database server so that less memory is used on the Job Server computer. For more information, see “Push-down operations” on page 36.

(25)

3 Analyzing log files for task duration

The trace log shows the progress of an execution through each component (object) of a job. The following sample Trace log shows a separate Process ID (Pid) for the Job, data flow, and each of the two sub data flows.

This sample log contains messages about sub data flows, caches, and statistics. For more information about sub data flows, see “Splitting a data flow into sub data flows” on page 88. For more information about caches and statistics, see “Caching data” on page 56.

For information about accessing, viewing, and managing Data Integrator logs, see “Log” on page 124 of the Data Integrator Reference Guide.

(26)

3 Reading the Monitor Log for execution statistics

The Monitor log file indicates how many rows Data Integrator produces or loads for a job. By viewing this log during job execution, you can observe the progress of row-counts to determine the location of bottlenecks. You can use the Monitor log to answer questions such as the following:

•

What transform is running at any moment?

•

How many rows have been processed so far?

The frequency that the Monitor log refreshes the statistics is based on

Monitor sample rate. For more information, see “Setting Monitor sample rate” on page 17.

•

How long does it take to build the cache for a lookup or comparison table? How long does it take to process the cache?

If take long time to build the cache, use persistent cache. For more information, see “Using persistent cache” on page 60.

•

How long does it take to sort?

If take long time to sort, you can redesign your data flow to push down the sort operation to the database. For more information, see “Push-down operations” on page 36.

•

How much time elapses before a blocking operation sends out the first row?

If your data flow contains resource-intensive operations after the blocking operation, you can add Data_Transfer transforms to push-down the resource-intensive operations. For more information, see “Data_Transfer transform for push-down operations” on page 43.

You can view the Monitor log from the following tools:

•

The Designer, as the job executes, when you click the Statistics icon.

•

The Administrator of the Management Console, when you click the Monitor link for a job from the Batch Job Status page.

The following sample Monitor log in the Designer shows the path for each object in the job, the number of rows processed and the elapsed time for each object. The Absolute time is the total time from the start of the job to when Data Integrator completes the execution of the data flow object.

(27)

3

For information about accessing, viewing, and managing Data Integrator logs, see, “Log” on page 124 of the Data Integrator Reference Guide.

Reading the Performance Monitor for execution statistics

The Performance Monitor displays execution information for each work flow, data flow, and sub data flow within a job. You can display the execution times in a graph or a table format. You can use the Performance Monitor to answer questions such as the following:

•

Which data flows might be bottlenecks?

•

How much time did a a data flow or sub data flow take to execute?

•

How many rows did the data flow or sub data flow process?

•

How much memory did a specific data flow use?

Note: Memory statistics (Cache Size column) display in the Performance Monitor only if you select the Collect statistics for monitoring option when you execute the job.

The following sample Performance Monitor shows the following information:

•

The Query_Lookup transform used 110 kilobytes of memory.

•

The first sub data flow processed 830 rows, and the second sub data flow processed 35 rows.

(28)

3

To view the Performance Monitor

1. Access the Management Console with one of the following methods:

•

In the Designer top menu bar, click Tools and select Data Integrator

Management Console.

•

Click Start > Programs > BusinessObjects Data Integrator >

2. On the launch page, click Administrator.

3. Select Batch > repository.

4. On the Batch Job Status page, find a job execution instance.

5. Under Job Information for an instance, click Performance Monitor. For more information, see “Monitoring and tuning cache types” on page 62.

Reading Operational Dashboards for execution statistics

Operational dashboard reports contain job and data flow execution

information for one or more repositories over a given time period (for example the last day or week). You can use operational statistics reports to answer some of the following questions:

•

Are jobs executing within the allotted time frames?

•

How many jobs succeeded or failed over a given execution period?

•

How is the execution time for a job evolving over time?

(29)

3

To compare execution times for the same job over time

1. Access the metadata reporting tool with one of the following methods:

•

In the Designer top menu bar, click Tools and select Data Integrator

•

In the Windows Start menu, click Programs > Business Objects > Data Integrator > Data Integrator Management Console.

2. On the launch page, click Dashboards.

3. Look at the graphs in Job Execution Statistic History or Job Execution Duration History to see if performance is increasing or decreasing.

4. On the Job Execution Duration History, if there is a specific day that looks high or low compared to the other execution times, click that point on the graph to view the Job Execution Duration graph for all jobs that ran that day.

5. Click View all history to compare different executions of a specific job or data flow.

(30)

3

6. On the Job Execution History tab, you can select a specific job and number of days.

7. On the Data Flow Execution History tab, you can select a specific job and number of days, as well as search for a specific data flow.

For more information about Operational Dashboards, see Chapter 4, “Operational Dashboard Reports,” in the Data Integrator Management Console: Metadata Reports Guide.

(31)

chapter

Data Integrator Performance Optimization Guide

(32)

Tuning Overview

Overview

4 Overview

This chapter presents an overview of the different Data Integrator tuning options, with cross-references to subsequent chapters for more details.

Strategies to execute Data Integrator jobs

To maximize performance of your Data Integrator jobs, Business Objects recommends that you use the following tuning strategies:

•

Maximizing push-down operations to the database server

•

Improving Data Integrator throughput

•

Using advanced Data Integrator tuning options

Maximizing push-down operations to the database server

Data Integrator generates SQL SELECT statements to retrieve the data from source databases. Data Integrator automatically distributes the processing workload by pushing down as much as possible to the source database server.

Pushing down operations provides the following advantages:

•

Use the power of the database server to execute SELECT operations (such as joins, Group By, and common functions such as decode and string functions). Often the database is optimized for these operations.

•

Minimize the amount of data sent over the network. Less rows can be retrieved when the SQL statements include filters or aggregations. You can also do a full push down from the source to the target, which means Data Integrator sends SQL INSERT INTO... SELECT statements to the target database. The following features enable a full push down:

•

Data_Transfer transform

•

Database links and linked datastores

(33)

Tuning Overview

Strategies to execute Data Integrator jobs

4 Improving Data Integrator throughput

Use the following Data Integrator features to improve throughput:

•

Using caches for faster access to data

You can improve the performance of data transformations by caching as much data as possible. By caching data in memory, you limit the number of times the system must access the database. For more information, see Chapter 6: Using Caches.

•

Bulk loading to the target

Data Integrator supports database bulk loading engines including the Oracle bulk load API. You can have multiple bulk load processes running in parallel. For more information, see Chapter 8: Using Bulk Loading.

•

Other tuning techniques

•

Source-based performance options

•

“Join ordering” on page 150

•

“Minimizing extracted data” on page 154

•

“Using array fetch size” on page 154

•

Target-based performance options

•

“Loading method” on page 155

•

“Rows per commit” on page 156

•

Job design performance options

•

“Loading only changed data” on page 157

•

“Minimizing data type conversion” on page 157

•

“Minimizing locale conversion” on page 157

•

“Precision in operations” on page 158

For more information, see Chapter 10: Other Tuning Techniques

Using advanced Data Integrator tuning options

If your jobs have CPU-intensive and memory-intensive operations, you can use the following advanced tuning features to improve performance:

•

Parallel processes - Individual work flows and data flows can execute in parallel if you do not connect them in the Designer workspace.

•

Parallel threads - Data Integrator supports partitioned source tables, partitioned target tables, and degree of parallelism. These options allow you to control the number of instances for a source, target, and transform

(34)

Tuning Overview

Strategies to execute Data Integrator jobs

4

that can run in parallel within a data flow. Each instance runs as a separate thread and can run on a separate CPU. For more information, see Chapter 7: Using Parallel Execution.

•

Server groups and distribution levels - You can group Job Servers on different computers into a logical Data Integrator component called a server group. A server group automatically measures resource availability on each Job Server in the group and distributes scheduled batch jobs to the computer with the lightest load at runtime. This functionality also provides a hot backup method. If one Job Server in a server group is down, another Job Server in the group processes the job. For more information, see “Using Server Groups” on page 43 of the Data Integrator Management Console: Administrator Guide.

You can distribute the execution of data flows or sub data flows within a batch job across multiple Job Servers within a Server Group to better balance resource-intensive operations. For more information, see “Using grid computing to distribute data flows execution” on page 102.

(35)

chapter

Data Integrator Performance Optimization Guide

Maximizing push-down

operations

(36)

About this chapter

5 About this chapter

For SQL sources and targets, Data Integrator creates database-specific SQL statements based on the data flow diagrams in a job. Data Integrator

generates SQL SELECT statements to retrieve the data from source databases. To optimize performance, Data Integrator pushes down as many SELECT operations as possible to the source database and combines as many operations as possible into one request to the database. Data Integrator can push down SELECT operations such as joins, Group By, and common functions such as decode and string functions.

Data flow design influences the number of operations that Data Integrator can push to the database. Before running a job, you can view the SQL that Data Integrator generates and adjust your design to maximize the SQL that is pushed down to improve performance.

You can use database links and the Data_Transfer transform to pushdown more operations.

This chapter discusses:

•

Push-down operations

•

Push-down examples

•

Viewing SQL

•

Data_Transfer transform for push-down operations

•

Database link support for push-down operations across datastores

Push-down operations

By pushing down operations to the source database, Data Integrator reduces the number of rows and operations that the engine must retrieve and process, which improves performance. When determining which operations to push to the database, Data Integrator examines the database and its environment.

Full push-down operations

The Data Integrator optimizer always first tries to do a full push-down operation. A full push-down operation is when all Data Integrator transform operations can be pushed down to the databases and the data streams directly from the source database to the target database. Data Integrator sends SQL INSERT INTO... SELECT statements to the target database where the SELECT retrieves data from the source.

(37)

Push-down operations

5

Data Integrator does a full push-down operation to the source and target

databases when the following conditions are met:

•

All of the operations between the source table and target table can be pushed down

•

The source and target tables are from the same datastore or they are in datastores that have a database link defined between them.

You can also use the following features to enable a full push-down from the source to the target:

•

Data_Transfer transform (see “Data_Transfer transform for push-down operations” on page 43).

•

Database links (see“Database link support for push-down operations across datastores” on page 49).

When all other operations in the data flow can be pushed down to the source database, the auto-correct loading operation is also pushed down for a full push-down operation to the target. Data Integrator sends an SQL MERGE INTO target statement that implements the Ignore columns with value and

Ignore columns with null options.

Partial push-down operations

When a full push-down operation is not possible, Data Integrator still pushes down the SELECT statement to the source database. Operations within the SELECT statement that Data Integrator can push to the database include:

•

Aggregations — Aggregate functions, typically used with a Group by statement, always produce a data set smaller than or the same size as the original data set.

•

Distinct rows — When you select Distinct rows from the Select tab in the query editor, Data Integrator will only output unique rows.

•

Filtering — Filtering can produce a data set smaller than or equal to the original data set.

•

Joins — Joins typically produce a data set smaller than or similar in size to the original tables. Data Integrator can push down joins when either of the following conditions exist:

•

The source tables are in the same datastore

•

The source tables are in datastores that have a database link defined between them

•

Ordering — Ordering does not affect data-set size. Data Integrator can efficiently sort data sets that fit in memory. Business Objects

(38)

Push-down examples

5

•

Projection — Projection is the subset of columns that you map on the

Mapping tab in the query editor. Projection normally produces a smaller

data set because it only returns columns needed by subsequent operations in a data flow.

•

Functions — Most Data Integrator functions that have equivalents in the underlying database are appropriately translated. These functions include decode, aggregation, and string functions.

Operations that cannot be pushed down

Data Integrator cannot push some transform operations to the database. For example:

•

Expressions that include Data Integrator functions that do not have database correspondents

•

Load operations that contain triggers

•

Transforms other than Query

•

Joins between sources that are on different database servers that do not have database links defined between them.

Similarly, Data Integrator cannot always combine operations into single requests. For example, when a stored procedure contains a COMMIT statement or does not return a value, Data Integrator cannot combine the stored procedure SQL with the SQL for other operations in a query. Data Integrator can only push operations supported by the DBMS down to that DBMS. Therefore, for best performance, try not to intersperse Data Integrator transforms among operations that can be pushed down to the database.

Push-down examples

The following are typical push-down scenarios.

Scenario 1: Combine transforms to push down

When determining how to push operations to the database, Data Integrator first collapses all the transforms into the minimum set of transformations expressed in terms of the source table columns. Next, Data Integrator pushes all possible operations on tables of the same database down to that DBMS.

(39)

5

For example, the following data flow extracts rows from a single source table.

The first query selects only the rows in the source where column A contains a value greater than 100. The second query refines the extraction further, reducing the number of columns returned and further reducing the qualifying rows.

Data Integrator collapses the two queries into a single command for the DBMS to execute. The following command uses AND to combine the WHERE clauses from the two queries:

SELECT A, MAX(B), C FROM source

WHERE A > 100 AND B = C GROUP BY A, C

Data Integrator can push down all the operations in this SELECT statement to the source DBMS.

Scenario 2: Full push down from the source to the target

If the source and target are in the same datastore, Data Integrator can do a full push-down operation where the INSERT into the target uses a SELECT from the source. In the sample data flow in scenario 1, a full push down passes the following statement to the database:

INSERT INTO target (A, B, C) SELECT A, MAX(B), C FROM source WHERE A > 100 AND B = C GROUP BY A, C SELECT A, MAX(B), C FROM source WHERE B = C GROUP BY A, C SELECT * FROM source WHERE A>100

(40)

5

If the source and target are not in the same datastore, Data Integrator can also do a full push-down operation if you use one of the following features:

•

Add a Data _Transfer transform before the target. For more information, see “Data_Transfer transform for push-down operations” on page 43.

•

Define a database link between the two datastores. For more information, see “Database link support for push-down operations across datastores” on page 49.

Scenario 3: Full push down for auto correct load to the target

For an Oracle target, if you specify the Auto correct load option, Data Integrator can do a full push-down operation where the SQL statement is a MERGE into the target with a SELECT from the source. In the sample data flow in scenario 1, a full push down passes the following statement to the database when the source and target are in the same datastore:

MERGE INTO target s USING

(SELECT A, MAX(B), C FROM source) n ON ((s.A = n.A))

WHEN MATCHED THEN UPDATE SET s.A = n.A,

s.MAX(B) = n.MAX(B) s.C = n.C

WHEN NOT MATCHED THEN INSERT (s.A, s.Max(B), s.C) VALUES (n.A, n.MAX(B), n.C)

Scenario 4: Partial push down to the source

If the data flow contains operations that cannot be passed to the DBMS, Data Integrator optimizes the transformation differently than the previous two scenarios. For example, if Query1 called func(A) > 100, where func is a Data Integrator custom function, then Data Integrator generates two commands:

•

The first query becomes the following command which the source DBMS executes:

SELECT A, B, C FROM source WHERE B = C

•

The second query becomes the following command which Data Integrator executes because func cannot be pushed to the database:

SELECT A, MAX(B), C FROM Query1 WHERE func(A) > 100 GROUP BY A, C

(41)

Viewing SQL

5 Viewing SQL

Before running a job, you can view the SQL code that Data Integrator generates for table sources in data flows. By examining the SQL code, you can verify that Data Integrator generates the commands you expect. If necessary, you can alter your design to improve the data flow.

Note: This option is not available for R/3 data flows.

To view SQL

1. Validate and save data flows.

2. Open a data flow in the workspace.

3. Select Display Optimized SQL from the Validation menu.

Alternately, you can right-click a data flow in the object library and select

Display Optimized SQL.

The Optimize SQL window opens and shows a list of datastores and the optimized SQL code for the selected datastore. By default, the Optimize SQL window selects the first datastore, as the following example shows:

Data Integrator only shows the SELECT generated for table sources and INSERT INTO... SELECT for targets. Data Integrator does not show the SQL generated for SQL sources that are not table sources, such as:

•

Lookup function

•

Key_generation function

•

Key_Generation transform

(42)

Viewing SQL

5

4. Select a name from the list of datastores on the left to view the SQL that this data flow applies against the corresponding database or application. The following example shows the optimized SQL for the second datastore which illustrates a full push-down operation (INSERT INTO... SELECT). This data flows uses a Data_Transfer transform to create a transfer table that Data Integrator loads directly into the target. For more information, see “Data_Transfer transform for push-down operations” on page 43

In the Optimized SQL window you can:

•

Use the Find button to perform a search on the SQL displayed.

•

Use the Save As button to save the text as a .sql file.

If you try to use the Display Optimized SQL command when there are no SQL sources in your data flow, Data Integrator alerts you. Examples of non-SQL sources include:

•

R/3 data flows

•

Message sources

•

File sources

•

IDoc sources

If a data flow is not valid when you click the Display Optimized SQL option, Data Integrator alerts you.

Note: The Optimized SQL window displays the existing SQL statement in the repository. If you changed your data flow, save it so that the Optimized SQL window displays your current SQL statement.

(43)

Data_Transfer transform for push-down operations

5 Data_Transfer transform for push-down

operations

Use the Data_Transfer transform to move data from a source or from another transform into the target datastore and enable a full push-down operation (INSERT INTO... SELECT) to the target. You can use the Data_Transfer transform to pushdown resource-intensive operations that occur anywhere within a data flow to the database. Resource-intensive operations include joins, GROUP BY, ORDER BY, and DISTINCT.

Examples of push-down with Data_Transfer

Scenario 1: Push down an operation after a blocking operation

You can place a Data_Transfer transform after a blocking operation to enable Data Integrator to push down a subsequent operation. A blocking operation is an operation that Data Integrator cannot push down to the database, and prevents (“blocks”) operations after it from being pushed down. For examples of blocking operations, see “Operations that cannot be pushed down” on page 38.

For example, you might have a data flow that groups sales order records by country and region, and sums the sales amounts to find which regions are generating the most revenue. The following diagram shows that the data flow contains a Pivot transform to obtain orders by Customer ID, a Query

transform that contains a lookup_ext function to obtain sales subtotals, and another Query transform to group the results by country and region.

Because the Pivot transform and the lookup_ext function are before the query with the GROUP BY clause, Data Integrator cannot push down the GROUP BY operation. The following Optimize SQL window shows the SELECT statement that Data Integrator pushes down to the source database:

(44)

5

However, if you add a Data_Transfer transform before the second Query transform and specify a transfer table in the same datastore as the target table, Data Integrator can push down the GROUP BY operation.

The following Data_Transfer Editor window shows that the transfer type is table and the transfer table is in the same datastore as the target table.

(45)

5

For more information, see “Data_Transfer” on page 273 of the Data Integrator Reference Guide.

The following Optimize SQL window shows that Data Integrator pushed down the GROUP BY to the transfer table TRANS2.

(46)

5 Scenario 2: Using Data_Transfer tables to speed up auto correct loads

Auto correct loading ensures that the same row is not duplicated in a target table, which is useful for data recovery operations. However, an auto correct load prevents a full push-down operation from the source to the target when the source and target are in different datastores.

For large loads where auto-correct is required, you can put a Data_Transfer transform before the target to enable a full push down from the source to the target. Data Integrator generates an SQL MERGE INTO target statement that implements the Ignore columns with value and Ignore columns with null options if they are selected on the target editor.

For example, the following data flow loads sales orders into a target table which is in a different datastore from the source.

(47)

5

The following target editor shows that the Auto correct load option is

selected. The Ignore columns with null and Use merge options are also selected in this example.

The following Optimize SQL window shows the SELECT statement that Data Integrator pushes down to the source database.

(48)

5

When you add a Data_Transfer transform before the target and specify a transfer table in the same datastore as the target, Data Integrator can push down the auto correct operation.

(49)

Database link support for push-down operations across datastores

5

The following Optimize SQL window shows the MERGE statement that Data Integrator will push down to the target.

Database link support for push-down

operations across datastores

This section covers how to use database links in Data Integrator. This section contains the following topics:

•

Software support

•

Example of push-down with linked datastores

•

Generated SQL statements

•

Tuning performance at the data flow or Job Server level

Various database vendors support one-way communication paths from one database server to another. Data Integrator refers to communication paths between databases as database links. The datastores in a database link relationship are called linked datastores. For more details, see “Linked datastores” on page 107 of the Data Integrator Designer Guide.

Data Integrator uses linked datastores to enhance its performance by pushing down operations to a target database using a target datastore. Pushing down operations to a database not only reduces the amount of information that needs to be transferred between the databases and Data Integrator but also allows Data Integrator to take advantage of the various DMBS capabilities, such as various join algorithms.

(50)

5

With support for database links, Data Integrator pushes processing down from different datastores, which can also refer to the same or different database type. Linked datastores allow a one-way path for data. For example, if you import a database link from target database B and link datastore B to datastore A, Data Integrator pushes the load operation down to database B, not to database A.

Software support

Data Integrator supports push-down operations using linked datastores on all Windows and Unix platforms. It supports DB2, Oracle, and MS SQL server databases.

To take advantage of linked datastores

1. Create a database link on a database server that you intend to use as a target in a Data Integrator job.

The following database software is required:

•

For DB2, use the DB2 Information Integrator (previously known as Relational Connect) software and make sure that the database user has privileges to create and drop a nickname.

To end users and client applications, data sources appear as a single collective database in DB2. Users and applications interface with the database managed by the information server. Therefore, configure an information server and then add the external data sources. DB2 uses nicknames to identify remote tables and views.

See the DB2 database manuals for more information about how to create links for DB2 and non-DB2 servers.

•

For Oracle, use the Transparent Gateway for DB2 and MS SQL Server.

See the Oracle database manuals for more information about how to create database links for Oracle and non-Oracle servers.

•

For MS SQL Server, no special software is required.

Microsoft SQL Server supports access to distributed data stored in multiple instances of SQL Server and heterogeneous data stored in various relational and non-relational data sources using an OLE database provider. SQL Server supports access to distributed or heterogeneous database sources in Transact-SQL statements by qualifying the data sources with the names of the linked server where the data sources exist.

(51)

5

2. Create a database datastore connection to your target database. For more information, see “Importing database links” on page 60 of the Data Integrator Reference Guide.

Example of push-down with linked datastores

Linked datastores enable a full push-down operation (INSERT INTO... SELECT) to the target if all the sources are linked with the target. The sources and target can be in datastores that use the same database type or different database types.

The following diagram shows an example of a data flow that will take advantage of linked datastores:

The dataflow joins three source tables from different database types:

•

ora_source.HRUSER1.EMPLOYEE on \\oracle_server1

•

ora_source_2.HRUSER2.PERSONNEL on \\oracle_server2

•

mssql_source.DBO.DEPARTMENT on \\mssql_server3. Data Integrator loads the join result into the target table ora_target.HRUSER3.EMP_JOIN on \\oracle_server1.

(52)

5

In this data flow, the user (HRUSER3) created the following database links in the Oracle database oracle_server1.

To enable a full push-down operation, database links must exist from the target database to all source databases and links must exist between the following datastores:

•

ora_target and ora_source

•

ora_target and ora_source2

•

ora_target and mssql_source

Data Integrator executes this data flow query as one SQL statement in oracle_server1:

INSERT INTO HR_USER3.EMP_JOIN (FNAME, ENAME, DEPTNO, SAL, COMM) SELECT psnl.FNAME, emp.ENAME, dept.DEPTNO, emp.SAL, emp.COMM FROM HR_USER1.EMPLOYEE emp, HR_USER2.PERSONNEL@orasvr2 psnl, oracle_server1.mssql_server.DBO.DEPARTMENT@tg4msql dept;

Generated SQL statements

To see how Data Integrator optimizes SQL statements, use Display

Optimized SQL from the Validation menu when a data flow is open in the

workspace.

•

For DB2, Data Integrator uses nicknames to refer to remote table references in the SQL display.

•

For Oracle, Data Integrator uses the following syntax to refer to remote table references: <remote_table>@<dblink_name>.

•

For SQL Server, Data Integrator uses the following syntax to refer to remote table references: <liked_server >.<remote_database >.<remote_user >.<remote_table>.

For more information, see “Viewing SQL” on page 41.

Tuning performance at the data flow or Job Server level

You might want to turn off linked-datastore push downs in cases where you do not notice performance improvements.

Database Link Name

Local (to database link location) Connection Name

Remote (to database link location) Connection Name

Remote User orasvr2 oracle_server1 oracle_server2 HRUSER2 tg4msql oracle_server1 mssql_server DBO

(53)

5

For example, the underlying database might not process operations from

different data sources well. Data Integrator pushes down Oracle stored procedures and external functions. If these are in a job that uses database links, it will not impact expected performance gains. However, Data Integrator does not push down functions imported from other databases (such as DB2). In this case, although you may be using database links, Data Integrator cannot push the processing down.

Test your assumptions about individual job designs before committing to a large development effort using database links.

For a data flow

On the data flow properties dialog, Data Integrator enables the Use

datastore links option by default to allow push downs using linked

datastores. If you do not want Data Integrator to use linked datastores in a data flow to push down processing, deselect the check box.

For a job server

You can also disable linked datastores at the Job Server level. However, the

Use database links option, at the data flow level, takes precedence. See

“Changing Job Server options” on page 329 of the Data Integrator Designer Guide for more information.

Data Integrator Performance Optimization Guide