• No results found

Enter a number for Degree of parallelism

Degree of parallelism

3. Enter a number for Degree of parallelism

Using Parallel Execution Parallel execution in data flows

7

The default value for Degree of parallelism is 0. If you set an individual data flow’s Degree of parallelism to this default value, then you can control it using a Global_DOP value which affects all data flows run by a given Job Server. If you use any other value for a data flow’s Degree of parallelism, it overrides the Global_DOP value.

You can use the local and global DOP options in different ways. For example:

If you want to globally set all data flow DOP values to 4, but one data flow is too complex and you do not want it to run in parallel, you can set the Degree of parallelism for this data flow locally. From the data flow’s Properties window, set this data flow’s Degree of parallelism to 1. All other data flows will replicate and run transforms in parallel after you set the Global_DOP value to 4. The default for the Global_DOP value is 1.

If you want to set the DOP on a case-by-case basis for each data flow, set the value for each data flow’s Degree of parallelism to any value except zero.

For information about how to set the Global_DOP value, see “Changing Job Server options” on page 329 of the Data Integrator Designer Guide.

4. Click OK.

Degree of parallelism and joins

If your Query transform joins sources, DOP determines the number of times the join replicates to process a parallel subset of data.

This section describes two scenarios:

DOP and executing a join as a single process

DOP and executing a join as multiple processes DOP and executing a join as a single process

The following figures show runtime instances of a data flow that contains a join with a DOP of 1, and the same data flow with a DOP of 2. You use join ranks to define the outer source and inner source (see “Join ordering” on page 150). In both data flows, the inner source is cached in memory (see

“Caching joins” on page 62).

Using Parallel Execution Parallel execution in data flows

7

Figure 7-5 :Runtime instance of a join where DOP =1

Figure 7-6 :Runtime instance of a join where DOP = 2

With a DOP greater than one, Data Integrator inserts an internal Round Robin Split (RRS) that transfers data to each of the replicated joins. The inner source is cached once, and each half of the outer source joins with the cached data in the replicated joins. The replicated joins execute in parallel, and the results merge into a single stream by an internal Merge transform.

DOP and executing a join as multiple processes

When you select the Run JOIN as a separate process in the Query transform, you can split the execution of a join among multiple processes.

Data Integrator creates a sub data flow for each separate process.

The following figure shows a runtime instance of a data flow that contains a join with a DOP of 2 and the Run JOIN as a separate process option selected.

Figure 7-7 :Runtime instance of a join that runs as multiple processes and DOP = 2

Using Parallel Execution Parallel execution in data flows

7

The data flow becomes four sub data flows (indicated by the blue dotted and dashed line in the figure):

The first sub data flow uses an internal hash algorithm to split the data.

The next two sub data flows are the replicated joins that run as separate processes.

The last sub data flow merges the data and loads the target.

Tip: If DOP is greater than one, select either job or data flow for the

Distribution level option when you execute the job. If you execute the job with the value sub data flow for Distribution level, the Hash Split sends data to the replicated queries that might be executing on different job servers. Because the data is sent on the network between different job servers, the entire data flow might be slower. For more information about job distribution levels, see “Using grid computing to distribute data flows execution” on page 102.

Degree of parallelism and functions

In Data Integrator, you can set stored procedures and custom functions to replicate with the transforms in which they are used. To specify this option, select the Enable parallel execution check box on the function’s Properties window. If this option is not selected and you add the function to a transform, the transform will not replicate and run in parallel even if its parent data flow has a value greater than 1 set for Degree of parallelism.

When enabling functions to run in parallel, verify that:

Your database will allow a stored procedure to run in parallel

A custom function set to run in parallel will improve performance All built-in functions, except the following, replicate if the transform they are used in replicates due to the DOP value:

avg()

min()

count()

previous_row_value()

count_distinct()

print()

exec ()

raise_exception()

get_domain_description()

raise_exception_ext()

gen_row_num()

set_env()

gen_row_num_by_group()

sleep()

is_group_changed()

smtp_to()

Using Parallel Execution Parallel execution in data flows

7

Setting functions to run in parallel

Use the Enable parallel execution option to set functions to run in parallel when the transforms in which they are used execute in parallel.

To enable stored procedures to run in parallel