Join ordering

•

Minimizing extracted data

•

Using array fetch size

•

Target-based performance options

•

Loading method

•

Rows per commit

•

Job design performance options

•

Loading only changed data

•

Minimizing data type conversion

•

Minimizing locale conversion

•

Precision in operations

These techniques require that you monitor the change in performance carefully as you make small adjustments. For more information about monitoring, see Chapter 3: Measuring Data Integrator Performance.

Source-based performance options

Data Integrator reads sources in preparation for the join in an order determined by a “depth first traversal” of a tree starting from the root node.

Controlling this order—referred to as the join order in this description—can

Other Tuning Techniques Source-based performance options

10

often have a profound effect on the performance of producing the join result.

The join order has no effect on the actual result produced. This section describes:

•

How join ranks affect join order

•

How join ranks are used in outer joins

How join ranks affect join order

Consider these examples without join ranks set:

You can use join ranks to control the order in which the sources are joined.

Data Integrator joins the child nodes in the order of decreasing join ranks. The join rank associated with a node is the maximum join rank of any node further down the branch.

For example, suppose you are joining four tables in an outer join—tables J through M—and you assign each a different join rank.

In this join tree, there is only one branch to traverse. Table A is read then joined with Table B. The results from that join are joined with Table C.

In this join tree, the longest branch is traversed first. Table J is read, then joined with Table L.

The results are joined with Table M. The results are then joined with Table K.

Table C Table B Table A

Table K Table L

Table J

Table M

Other Tuning Techniques

Source-based performance options

10

In this join tree, the join rank of Table L is considered 3 because the join rank of Table M is greater than the original join rank of Table L. However, the join rank of Table K (5) is still larger than the join rank of Table L. The join order is as follows: Table J is read and joined with Table K. The results are joined with Table L. The result of that join is joined with Table M.

Note: If the value of Join rank is 0 and you collected cache statistics for the sources involved in the join, the optimizer determines the join ranks based on the cache statistics.

How join ranks are used in outer joins

Join ordering is determined by the type of join you choose to use in Data Integrator. There are two types of joins:

•

Normal — Created by drawing connection lines from multiple sources in a data flow to the same query and then entering an integer as a Join rank value for each source using source editors. Data Integrator orders join operations using these values.

For a normal join between three or more tables, Data Integrator internally implements a series of two-way joins. First, it joins the two highest ranked tables together using the join ranking rules outlined above. Next, Data Integrator joins the result of the first join with the third highest join ranked table. This process continues until all tables are included in the join.

•

Outer Join — Created by taking a normal join and then using the Outer Join tab of the query to specify the outer and inner sources. Data Integrator processes the outer source as if it was assigned the highest join rank.

For an Outer join, Data Integrator ignores the Join rank values unless you are using three or more tables and a pair of joins share the same outer join table.

Other Tuning Techniques Source-based performance options

10

For example, if table A is an outer join to tables B and C, the join rank is calculated by taking the maximum join rank value of the two tables in each outer join pair (A and B and A and C). The pair with the higher join rank becomes the outer loop.

Data Integrator implements joins as nested loop joins. The source with the higher join rank or the one specified as an Outer Source in the Outer Join tab becomes an outer loop. If all the ranks are equal or not set (defaults to 0), then Data Integrator picks an inner and outer source at random. During job execution, Data Integrator reads the source in the outer loop one time; and reads the source in the inner loop for each row in the outer loop. Performance improves when fewer rows are read.

The number of output rows for a join depends on the type of join you use in a data flow.

•

For a Normal join, the output is limited to the number of rows that meet join conditions.

•

For an Outer join, the output is equal to or greater than the number of rows in the source specified as the Outer Source.

Using an Outer join is often preferable with real-time jobs because often you want the whole message passed on whether or not conditions a join looks for in the inner source exist.

In fact, if you do not use an Outer Join to order joins for a message, Data Integrator will still process the message as if it has the highest join rank.

The message source editor lacks a Join rank option because Data Integrator automatically gives it the highest join rank.

For more information about outer joins and the other options available in a query transform, see “Query” on page 332 of the Data Integrator Reference Guide.

Tips

For a join between two tables, assign a higher join rank value to the larger table and, if possible, cache the smaller table.

For a join between a table and file:

•

If the file is small and can be cached, then assign it a lower join rank value and cache it.

Other Tuning Techniques

Source-based performance options

10

•

If you cannot cache the file, then assign it a higher join rank value so that it becomes the “outer table” in the join.

For a join between two files, assign a higher rank value to the larger file and, if possible, cache the smaller file.

In document Data Integrator Performance Optimization Guide (Page 150-154)