WebSphere
®DataStage
and
QualityStage
Parallel
Job
Advanced
Developer
Guide
Version8.0.1WebSphere
®DataStage
and
QualityStage
Parallel
Job
Advanced
Developer
Guide
Version8.0.1Note
Beforeusingthisinformationandtheproductthatitsupports,besuretoreadthegeneralinformationunder“Noticesand trademarks”onpage899.
Contents
Chapter
1.
Introduction
.
.
.
.
.
.
.
. 1
Terminology . . . 1
Chapter
2.
Job
design
tips
.
.
.
.
.
.
. 3
WebSphereDataStageDesignerinterface . . . 3
Processinglargevolumesofdata . . . 3
Modulardevelopment . . . 3
Designingforgoodperformance. . . 4
Combiningdata . . . 4
Sortingdata . . . 5
Defaultandexplicittypeconversions . . . 5
UsingTransformerstages . . . 7
UsingSequentialFilestages . . . 7
UsingDatabasestages . . . 8
Databasesparselookupvs.join . . . 8
DB2databasetips. . . 8
Oracledatabasetips . . . 9
TeradataDatabaseTips . . . 9
Chapter
3.
Improving
performance
.
.
. 11
Understandingaflow . . . 11
Scoredumps . . . 11
Examplescoredump . . . 11
Tipsfordebugging . . . 12
Performancemonitoring . . . 12 Jobmonitor . . . 12 Iostat . . . 13 Loadaverage. . . 13 Runtimeinformation . . . 13 Performancedata . . . 14
OS/RDBMSspecifictools. . . 15
Performanceanalysis . . . 15
Selectivelyrewritingtheflow . . . 16
Identifyingsuperfluousrepartitions . . . 16
Identifyingbufferingissues . . . 16
Resourceestimation . . . 17
Creatingamodel . . . 17
Makingaprojection . . . 19
Generatingaresourceestimationreport . . . . 20
Examplesofresourceestimation . . . 20
Resolvingbottlenecks . . . 23
Choosingthemostefficientoperators. . . 23
Partitionerinsertion,sortinsertion. . . 24
CombinableOperators. . . 24
DiskI/O . . . 24
Ensuringdataisevenlypartitioned . . . 25
Buffering . . . 25
Platformspecifictuning . . . 25
Tru64 . . . 25
HP-UX . . . 26
AIX . . . 26
Diskspacerequirementsofpost-release7.0.1data sets . . . 26
Chapter
4.
Link
buffering
.
.
.
.
.
.
. 27
Bufferingassumptions. . . 27
Controllingbuffering . . . 28
Bufferingpolicy . . . 28
Overridingdefaultbufferingbehavior . . . . 28
Operatorswithspecialbufferingrequirements. . 29
Chapter
5.
Specifying
your
own
parallel
stages
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 31
Definingcustomstages . . . 31
Definingcustomstages . . . 35
Definingbuildstages . . . 38
Buildstagemacros . . . 43
Informationalmacros . . . 43
Flow-controlmacros . . . 43
Inputandoutputmacros . . . 43
TransferMacros . . . 44
Howyourcodeisexecuted . . . 44
Inputsandoutputs . . . 44
ExampleBuildStage . . . 46
Definingwrappedstages . . . 46
Examplewrappedstage . . . 50
Chapter
6.
Environment
Variables
.
.
. 51
Buffering . . . 55 APT_BUFFER_FREE_RUN . . . 55 APT_BUFFER_MAXIMUM_MEMORY . . . . 56 APT_BUFFER_MAXIMUM_TIMEOUT . . . . 56 APT_BUFFER_DISK_WRITE_INCREMENT. . . 56 APT_BUFFERING_POLICY . . . 56 APT_SHARED_MEMORY_BUFFERS . . . 56
BuildingCustomStages . . . 56
DS_OPERATOR_BUILDOP_DIR . . . 56 OSH_BUILDOP_CODE . . . 57 OSH_BUILDOP_HEADER . . . 57 OSH_BUILDOP_OBJECT . . . 57 OSH_BUILDOP_XLC_BIN . . . 57 OSH_CBUILDOP_XLC_BIN . . . 57 Compiler . . . 57 APT_COMPILER . . . 57 APT_COMPILEOPT . . . 57 APT_LINKER . . . 58 APT_LINKOPT . . . 58 DB2Support . . . 58 APT_DB2INSTANCE_HOME . . . 58 APT_DB2READ_LOCK_TABLE. . . 58 APT_DBNAME . . . 58 APT_RDBMS_COMMIT_ROWS . . . 58 DB2DBDFT . . . 58 Debugging . . . 58 APT_DEBUG_OPERATOR . . . 59 APT_DEBUG_MODULE_NAMES . . . 59 APT_DEBUG_PARTITION . . . 59 APT_DEBUG_SIGNALS . . . 59 APT_DEBUG_STEP. . . 59 APT_DEBUG_SUBPROC . . . 60
APT_EXECUTION_MODE . . . 60 APT_PM_DBX . . . 60 APT_PM_GDB . . . 60 APT_PM_LADEBUG . . . 60 APT_PM_SHOW_PIDS . . . 60 APT_PM_XLDB . . . 61 APT_PM_XTERM . . . 61 APT_SHOW_LIBLOAD . . . 61 Decimalsupport. . . 61 APT_DECIMAL_INTERM_PRECISION . . . . 61 APT_DECIMAL_INTERM_SCALE. . . 61 APT_DECIMAL_INTERM_ROUND_MODE . . 61 DiskI/O . . . 61 APT_BUFFER_DISK_WRITE_INCREMENT. . . 61 APT_CONSISTENT_BUFFERIO_SIZE. . . 61 APT_EXPORT_FLUSH_COUNT . . . 62 APT_IO_MAP/APT_IO_NOMAPand APT_BUFFERIO_MAP/ APT_BUFFERIO_NOMAP . . . 62 APT_PHYSICAL_DATASET_BLOCK_SIZE . . . 62
GeneralJobAdministration . . . 62
APT_CHECKPOINT_DIR. . . 62 APT_CLOBBER_OUTPUT . . . 62 APT_CONFIG_FILE . . . 62 APT_DISABLE_COMBINATION . . . 62 APT_EXECUTION_MODE . . . 63 APT_ORCHHOME . . . 63 APT_STARTUP_SCRIPT . . . 63 APT_NO_STARTUP_SCRIPT . . . 63 APT_STARTUP_STATUS . . . 64 APT_THIN_SCORE. . . 64 JobMonitoring . . . 64 APT_MONITOR_SIZE. . . 64 APT_MONITOR_TIME . . . 64 APT_NO_JOBMON. . . 64 APT_PERFORMANCE_DATA . . . 64 Lookupsupport . . . 64 APT_LUTCREATE_MMAP . . . 64 APT_LUTCREATE_NO_MMAP . . . 65 Miscellaneous . . . 65 APT_COPY_TRANSFORM_OPERATOR. . . . 65 APT_DATE_CENTURY_BREAK_YEAR . . . . 65 APT_EBCDIC_VERSION . . . 65 APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL65 APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS65 APT_INSERT_COPY_BEFORE_MODIFY. . . . 66 APT_OLD_BOUNDED_LENGTH . . . 66 APT_OPERATOR_REGISTRY_PATH . . . 66 APT_PM_NO_SHARED_MEMORY . . . 66 APT_PM_NO_NAMED_PIPES . . . 66 APT_PM_SOFT_KILL_WAIT . . . 66 APT_PM_STARTUP_CONCURRENCY . . . . 66 APT_RECORD_COUNTS. . . 66 APT_SAVE_SCORE. . . 67 APT_SHOW_COMPONENT_CALLS . . . 67 APT_STACK_TRACE . . . 67 APT_WRITE_DS_VERSION . . . 67 OSH_PRELOAD_LIBS. . . 67 Network . . . 67 APT_IO_MAXIMUM_OUTSTANDING . . . . 68 APT_IOMGR_CONNECT_ATTEMPTS . . . . 68 APT_PM_CONDUCTOR_HOSTNAME . . . . 68 APT_PM_NO_TCPIP . . . 68 APT_PM_NODE_TIMEOUT. . . 68 APT_PM_SHOWRSH . . . 68 APT_PM_USE_RSH_LOCALLY. . . 68 NLSSupport . . . 68 APT_COLLATION_SEQUENCE . . . 68 APT_COLLATION_STRENGTH . . . 69 APT_ENGLISH_MESSAGES. . . 69 APT_IMPEXP_CHARSET. . . 69 APT_INPUT_CHARSET . . . 69 APT_OS_CHARSET . . . 69 APT_OUTPUT_CHARSET . . . 69 APT_STRING_CHARSET. . . 69 OracleSupport . . . 69 APT_ORACLE_LOAD_DELIMITED . . . 69 APT_ORACLE_LOAD_OPTIONS . . . 70 APT_ORACLE_NO_OPS . . . 70 APT_ORACLE_PRESERVE_BLANKS. . . 70 APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM70 APT_ORA_WRITE_FILES . . . 70 APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL . 70 Partitioning . . . 71 APT_NO_PART_INSERTION . . . 71 APT_PARTITION_COUNT . . . 71 APT_PARTITION_NUMBER. . . 71
Readingandwritingfiles. . . 71
APT_DELIMITED_READ_SIZE. . . 71 APT_FILE_IMPORT_BUFFER_SIZE . . . 71 APT_FILE_EXPORT_BUFFER_SIZE . . . 72 APT_IMPORT_PATTERN_USES_FILESET . . . 72 APT_MAX_DELIMITED_READ_SIZE . . . . 72 APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE72 APT_STRING_PADCHAR . . . 72 Reporting . . . 72 APT_DUMP_SCORE . . . 72 APT_ERROR_CONFIGURATION . . . 72 APT_MSG_FILELINE . . . 74 APT_PM_PLAYER_MEMORY . . . 74 APT_PM_PLAYER_TIMING. . . 74 APT_RECORD_COUNTS. . . 74 OSH_DUMP . . . 75 OSH_ECHO . . . 75 OSH_EXPLAIN . . . 75 OSH_PRINT_SCHEMAS . . . 75 SASSupport . . . 75 APT_HASH_TO_SASHASH . . . 75 APT_NO_SASOUT_INSERT . . . 75 APT_NO_SAS_TRANSFORMS . . . 75 APT_SAS_ACCEPT_ERROR. . . 75 APT_SAS_CHARSET . . . 76 APT_SAS_CHARSET_ABORT . . . 76 APT_SAS_COMMAND . . . 76 APT_SASINT_COMMAND . . . 76 APT_SAS_DEBUG . . . 76 APT_SAS_DEBUG_IO . . . 76 APT_SAS_DEBUG_LEVEL . . . 76 APT_SAS_DEBUG_VERBOSE . . . 76 APT_SAS_NO_PSDS_USTRING . . . 77 APT_SAS_S_ARGUMENT . . . 77
APT_SAS_SCHEMASOURCE_DUMP. . . 77 APT_SAS_SHOW_INFO . . . 77 APT_SAS_TRUNCATION . . . 77 Sorting . . . 77 APT_NO_SORT_INSERTION . . . 77 APT_SORT_INSERTION_CHECK_ONLY . . . 77 TeradataSupport . . . 78 APT_TERA_64K_BUFFERS . . . 78 APT_TERA_NO_ERR_CLEANUP . . . 78 APT_TERA_NO_SQL_CONVERSION . . . . 78 APT_TERA_NO_PERM_CHECKS . . . 78 APT_TERA_SYNC_DATABASE. . . 78 APT_TERA_SYNC_PASSWORD . . . 78 APT_TERA_SYNC_USER. . . 78 TransportBlocks. . . 78 APT_AUTO_TRANSPORT_BLOCK_SIZE . . . 79 APT_LATENCY_COEFFICIENT . . . 79 APT_DEFAULT_TRANSPORT_BLOCK_SIZE . . 79 APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE . . . . 79
Guidetosettingenvironmentvariables . . . 79
Environmentvariablesettingsforalljobs . . . 79
Optionalenvironmentvariablesettings . . . . 79
Chapter
7.
Operators
.
.
.
.
.
.
.
.
. 81
StagetoOperatorMapping . . . 81
Changeapplyoperator. . . 84
Dataflowdiagram . . . 84
Properties . . . 85
Schemas . . . 86
Syntaxandoptions. . . 86
Example . . . 89
Changecaptureoperator . . . 91
Dataflowdiagram . . . 91
Keyandvaluefields . . . 91
Syntaxandoptions. . . 92
Example1:alloutputresults . . . 95
Example2:droppingoutputresults . . . 96
Compareoperator . . . 97
Dataflowdiagram . . . 97
Properties . . . 97
Syntaxandoptions. . . 99
Example1:runningthecompareoperatorin parallel . . . 101
Example2:runningthecompareoperator sequentially . . . 101
Copyoperator . . . 101
Dataflowdiagram . . . 102
Properties . . . 102
Syntaxandoptions . . . 102
PreventingWebSphereDataStagefrom removingacopyoperator . . . 103
Example1:Thecopyoperator. . . 104
Example2:runningthecopyoperator sequentially . . . 104
Diffoperator . . . 105
Dataflowdiagram . . . 105
Properties . . . 105
Transferbehavior . . . 106
Syntaxandoptions . . . 107
Example1:generalexample . . . 110
Example2:DroppingOutputResults . . . . 111
Encodeoperator . . . 112
Dataflowdiagram . . . 112
Properties . . . 112
Syntaxandoptions . . . 113
EncodingWebSphereDataStagedatasets . . . 113
Example . . . 114
Filteroperator . . . 114
Dataflowdiagram . . . 114
Properties . . . 115
Syntaxandoptions . . . 115
Jobmonitoringinformation. . . 116
Expressions . . . 117
Inputdatatypes . . . 118
Example1:comparingtwofields. . . 119
Example2:testingforanull . . . 120
Example3:evaluatinginputrecords. . . 120
Jobscenario:mailinglistforawineauction . . 120
Funneloperators . . . 122
Dataflowdiagram . . . 122
Properties . . . 122
Funneloperator . . . 123
Sortfunneloperators. . . 123
Generatoroperator . . . 126
Dataflowdiagram . . . 127
Properties . . . 127
Syntaxandoptions . . . 127
Usingthegeneratoroperator . . . 128
Example1:usingthegeneratoroperator . . . 129
Example2:executingtheoperatorinparallel 130 Example3:usinggeneratorwithaninputdata set . . . 130
Definingtheschemafortheoperator . . . . 131
Timestampfields . . . 136
Headoperator . . . 137
Dataflowdiagram . . . 138
Properties . . . 138
Syntaxandoptions . . . 138
Example1:headoperatordefaultbehavior . . 139
Example2:extractingrecordsfromalargedata set . . . 139
Example3:locatingasinglerecord . . . 140
Lookupoperator . . . 140
Dataflowdiagrams . . . 140
Properties . . . 141
Syntaxandoptions . . . 142
Partitioning . . . 145
Create-onlymode . . . 145
Example1:singlelookuptablerecord . . . . 146
Example2:multiplelookuptablerecord . . . 146
Example3:interestratelookupexample . . . 147
Example4:handlingduplicatefieldsexample 148 Mergeoperator. . . 149
Dataflowdiagram . . . 150
Properties . . . 150
Syntaxandoptions . . . 150
Mergingrecords . . . 152
Understandingthemergeoperator . . . 154
Example1:updatingnationaldatawithstate data . . . 157
Jobscenario:galacticindustries . . . 161
Missingrecords . . . 163
Modifyoperator . . . 165
Dataflowdiagram . . . 165
Properties . . . 166
Syntaxandoptions . . . 166
Transferbehavior . . . 167
Avoidingcontiguousmodifyoperators . . . . 167
Performingconversions . . . 167
Allowedconversions . . . 198
pcompressoperator . . . 200
Dataflowdiagram . . . 200
Properties . . . 200
Syntaxandoptions . . . 200
Compresseddatasets . . . 201
Example . . . 202
Peekoperator . . . 203
Dataflowdiagram . . . 203
Properties . . . 203
Syntaxandoptions . . . 204
Usingtheoperator . . . 205
PFTPoperator . . . 206
Dataflowdiagram . . . 206
Operatorproperties . . . 207
Syntaxandoptions . . . 207
Restartability . . . 212
Remdupoperator . . . 213
Dataflowdiagram . . . 213
Properties . . . 213
Syntaxandoptions . . . 214
Removingduplicaterecords . . . 215
Usingoptionstotheoperator . . . 215
Usingtheoperator . . . 217
Example1:usingremdup . . . 217
Example2:usingthe-lastoption. . . 217
Example3:case-insensitivestringmatching . . 217
Example4:usingremdupwithtwokeys . . . 218
Sampleoperator . . . 218
Dataflowdiagram . . . 218
Properties . . . 218
Syntaxandoptions . . . 219
Examplesamplingofadataset . . . 220
Sequenceoperator. . . 221
Dataflowdiagram . . . 221
Properties . . . 221
Syntaxandoptions . . . 222
ExampleofUsingthesequenceOperator . . . 222
Switchoperator . . . 223
Dataflowdiagram . . . 223
Properties . . . 224
Syntaxandoptions . . . 224
Jobmonitoringinformation. . . 228
Examplemetadataandsummarymessages . . 228
Customizingjobmonitormessages . . . 229
Tailoperator. . . 229
Dataflowdiagram . . . 229
Properties . . . 229
Syntaxandoptions . . . 230
Example1:tailoperatordefaultbehavior . . . 230
Example2:tailoperatorwithbothoptions . . 230
Transformoperator . . . 231
Runningyourjobonanon-NFSMPP . . . . 231
Dataflowdiagram . . . 231
Properties . . . 231
Syntaxandoptions . . . 232
Transferbehavior . . . 240
Thetransformationlanguage . . . 241
ThetransformationlanguageversusC . . . . 277
Usingthetransformoperator . . . 278
Example1:student-scoredistribution . . . . 278
Example2:student-scoredistributionwitha lettergradeaddedtoexample. . . 281
Example3:student-scoredistributionwitha classfieldaddedtoexample . . . 284
Example4.studentrecorddistributionwithnull scorevaluesandareject. . . 287
Example5.studentrecorddistributionwithnull scorevalueshandled . . . 289
Example6.studentrecorddistributionwith vectormanipulation . . . 292
Example7:studentrecorddistributionusing sub-record . . . 296
Example8:externalCfunctioncalls. . . 299
Writerangemapoperator. . . 301
Dataflowdiagram . . . 301
Properties . . . 301
Syntaxandoptions . . . 302
Usingthewriterangeoperator. . . 303
Chapter
8.
The
import/export
library
305
Recordschemas . . . 306Example1:importschema . . . 306
Example2:exportschema . . . 306
Fieldandrecordproperties. . . 307
Completeandpartialschemas. . . 308
Implicitimportandexport . . . 312
Errorhandlingduringimport/export . . . . 315
ASCIIandEBCDICconversiontables . . . . 316
Importoperator . . . 322
Dataflowdiagram . . . 322
Properties . . . 322
Syntaxandoptions . . . 323
Howtoimportdata . . . 330
Example1:importingfromasingledatafile 332 Example2:importingfrommultipledatafiles 333 Exportoperator . . . 334
Dataflowdiagram . . . 334
Properties . . . 335
Syntaxandoptions . . . 336
Howtoexportdata . . . 340
Example1:datasetexporttoasinglefile . . . 344
Example2:DataSetExporttoMultiplefiles . . 345
Import/exportproperties . . . 346
Settingproperties . . . 346
Properties . . . 348
Properties:referencelisting. . . 355
Chapter
9.
The
partitioning
library
.
. 401
Theentirepartitioner. . . 401
Usingthepartitioner . . . 402
Properties . . . 402
Syntax. . . 403
Thehashpartitioner . . . 403
Specifyinghashkeys . . . 404
Example . . . 404
Usingthepartitioner . . . 405
Dataflowdiagram . . . 405
Properties . . . 405
Syntaxandoption. . . 406
Themoduluspartitioner. . . 407
Dataflowdiagram . . . 407
Properties . . . 408
Syntaxandoption. . . 408
Example . . . 408
Therandompartitioner . . . 409
Usingthepartitioner . . . 410
Dataflowdiagram . . . 410
Properties . . . 410
Syntax. . . 411
TherangePartitioner. . . 411
Considerationswhenusingrangepartitioning 412 Therangepartitioningalgorithm. . . 412
Specifyingpartitioningkeys . . . 412
Creatingarangemap . . . 413
Example:configuringandusingrange partitioner . . . 415
Usingthepartitioner . . . 415
Dataflowdiagram . . . 416
Properties . . . 416
Syntaxandoptions . . . 416
Writerangemapoperator. . . 418
Dataflowdiagram . . . 418
Properties . . . 419
Syntaxandoptions . . . 419
Usingthewriterangeoperator. . . 420
Themakerangemaputility . . . 421
Syntaxandoptions . . . 421
Usingthemakerangemaputility . . . 423
Theroundrobinpartitioner . . . 423
Usingthepartitioner . . . 424
Dataflowdiagram . . . 424
Properties . . . 424
Syntax. . . 424
Thesamepartitioner . . . 425
Usingthepartitioner . . . 425
Dataflowdiagram . . . 425
Properties . . . 426
Syntax. . . 426
Chapter
10.
The
collection
library
.
.
. 427
Theorderedcollector. . . 427
Orderedcollecting. . . 427
Properties . . . 428
Syntax. . . 428
Theroundrobincollector . . . 428
Roundrobincollecting . . . 429
Properties . . . 429
Syntax. . . 430
Thesortmergecollector . . . 430
Understandingthesortmergecollector . . . . 430
Dataflowdiagram . . . 430
Specifyingcollectingkeys . . . 431
Properties . . . 433
Syntaxandoptions . . . 433
Chapter
11.
The
restructure
library
435
Theaggtorecoperator . . . 435Outputformats. . . 435
Properties . . . 436
Syntaxandoptions . . . 436
Example1:theaggtorecoperatorwithoutthe toplevelkeysoption . . . 438
Example2:theaggtorecoperatorwithmultiple keyoptions . . . 438
Example3:Theaggtorecoperatorwiththe toplevelkeysoption . . . 439
Thefield_exportoperator . . . 440
Dataflowdiagram . . . 440
Properties . . . 441
Syntaxandoptions . . . 441
Example . . . 442
Thefield_importoperator . . . 443
Dataflowdiagram . . . 443
Properties . . . 444
Syntaxandoptions . . . 444
Example . . . 445
Themakesubrecoperator . . . 447
Dataflowdiagram . . . 447
Properties . . . 448
Transferbehavior . . . 448
Subrecordlength . . . 448
Syntaxandoptions . . . 449
Themakevectoperator . . . 450
Dataflowdiagram . . . 450
Properties . . . 450
TransferBehavior . . . 451
Non-consecutivefields . . . 451
Syntaxandoption. . . 451
Example1:Themakevectoperator . . . 452
Example2:Themakevectoperatorwithmissing inputfields . . . 452
ThepromotesubrecOperator . . . 453
DataFlowDiagram . . . 453
Properties . . . 454
SyntaxandOption . . . 454
Example . . . 454
ThesplitsubrecOperator . . . 455
DataFlowDiagram . . . 455
Properties . . . 455
SyntaxandOption . . . 456
Example . . . 456
Thesplitvectoperator . . . 457
Dataflowdiagram . . . 457
Properties . . . 458
Syntaxandoption. . . 458
Example . . . 458
Thetagbatchoperator . . . 459
Taggedfieldsandoperatorlimitations . . . . 459
Operatoractionandtransferbehavior . . . . 460
Dataflowdiagram . . . 460
Properties . . . 461
Inputdatasetrequirements . . . 462
Syntaxandoptions . . . 462
Example1:Thetagbatchoperator,simple flatteningoftagcases . . . 464
Example2:Thetagbatchoperator,missingand duplicatecases . . . 465
Example3:Thetagbatchoperatorwithmultiple keys . . . 466
Thetagswitchoperator . . . 467
Dataflowdiagram . . . 467
Properties . . . 468
Inputandoutputinterfaceschemas . . . 468
Thecaseoption. . . 468
Usingtheoperator . . . 468
Syntaxandoptions . . . 469
Example1:Thetagswitchoperator,default behavior . . . 470
Example2:thetagswitchoperator,onecase chosen. . . 471
Chapter
12.
The
sorting
library
.
.
.
. 475
Thetsortoperator . . . 475
Configuringthetsortoperator. . . 477
Usingasorteddataset . . . 477
Specifyingsortingkeys . . . 478
Dataflowdiagram . . . 479
Properties . . . 479
Syntaxandoptions . . . 480
Example:usingasequentialtsortoperator . . 483
Example:usingaparalleltsortoperator . . . 484
Performingatotalsort . . . 485
Example:performingatotalsort . . . 487
Thepsortoperator . . . 489
Performingapartitionsort . . . 489
Configuringthepartitionsortoperator . . . . 491
Usingasorteddataset . . . 491
DataFlowDiagram . . . 493
Properties . . . 493
Syntaxandoptions . . . 494
Example:usingasequentialpartitionsort operator . . . 496
Example:usingaparallelpartitionsortoperator 497 Performingatotalsort . . . 498
Rangepartitioning. . . 500
Example:PerformingaTotalSort. . . 501
Chapter
13.
The
join
library
.
.
.
.
. 505
Dataflowdiagrams . . . 505
Properties . . . 506
Transferbehavior . . . 507
Inputdatasetrequirements . . . 507
Memoryuse. . . 507
Jobmonitorreporting . . . 507
Comparisonwithotheroperators. . . 507
Inputdatausedintheexamples . . . 508
innerjoinoperator . . . 508
Syntaxandoptions . . . 509
Example . . . 509
leftouterjoinoperator. . . 510
Syntaxandoptions . . . 510
Example . . . 511
rightouterjoinoperator . . . 512
Syntaxandoptions . . . 512
Example . . . 513
fullouterjoinoperator. . . 514
Syntaxandoptions . . . 514
Example . . . 515
Chapter
14.
The
ODBC
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 517
AccessingODBCfromWebSphereDataStage. . . 517
NationalLanguageSupport . . . 517
ICUcharactersetoptions . . . 518
MappingbetweenODBCandICUcharactersets 518 Theodbcreadoperator . . . 518
Dataflowdiagram . . . 519
Properties . . . 519
Syntaxandoptions . . . 519
Operatoraction. . . 521
Columnnameconversion . . . 521
Datatypeconversion. . . 522
Externaldatasourcerecordsize . . . 522
Readingexternaldatasourcetables . . . 522
Joinoperations . . . 523
Example1:readinganexternaldatasourcetable andmodifyingafieldname . . . 523
Theodbcwriteoperator . . . 524
Writingtoamultibytedatabase . . . 525
Dataflowdiagram . . . 525
Properties . . . 525
Operatoraction. . . 525
Wheretheodbcwriteoperatorruns . . . 526
Syntaxandoptions . . . 528
Example1:writingtoanexistingexternaldata sourcetable . . . 530
Example2:creatinganexternaldatasourcetable 531 Example3:writingtoanexternaldatasource tableusingthemodifyoperator . . . 532
Otherfeatures . . . 533
Theodbcupsertoperator . . . 533
Dataflowdiagram . . . 533
Properties . . . 534
Operatoraction. . . 534
Syntaxandoptions . . . 535
Example . . . 536
Theodbclookupoperator . . . 537
Dataflowdiagram . . . 538
Properties . . . 539
Syntaxandoptions . . . 539
Example . . . 541
Chapter
15.
The
SAS
interface
library
543
UsingWebSphereDataStagetorunSAScode . . 543WritingSASprograms . . . 543
UsingSASonsequentialandparallelsystems 544 PipelineparallelismandSAS . . . 545
ConfiguringyoursystemtousetheSAS interfaceoperators. . . 545
RepresentingSASandnon-SASDatain
DataStage . . . 547
GettinginputfromaSASdataset . . . 548
GettinginputfromaWebSphereDataStagedata setoraSASdataset . . . 549
Convertingbetweendatasettypes . . . 550
ConvertingSASdatatoWebSphereDataStagedata 551 aWebSphereDataStageexample . . . 552
ParallelizingSASsteps . . . 553
ExecutingPROCstepsinparallel. . . 559
SomepointstoconsiderinparallelizingSAS code . . . 563
UsingSASwithEuropeanlanguages . . . 564
UsingSAStodoETL. . . 565
TheSASinterfaceoperators . . . 566
SpecifyingacharactersetandSASmode . . . 566
ParallelSASdatasetsandSASInternational 568 Specifyinganoutputschema . . . 569
Controllingustringtruncation. . . 569
Generatingaproccontentsreport . . . 570
WebSphereDataStage-insertedpartitionandsort components . . . 570
Longnamesupport . . . 570
Environmentvariables . . . 571
Thesasinoperator. . . 572
Dataflowdiagram . . . 573
Properties . . . 573
Syntaxandoptions . . . 573
Thesasoperator . . . 577
Dataflowdiagram . . . 577
Properties . . . 577
Syntaxandoptions . . . 578
Thesasoutoperator . . . 582
Dataflowdiagram . . . 583
Properties . . . 583
Syntaxandoptions . . . 583
Thesascontentsoperator . . . 585
Dataflowdiagram . . . 585
Properties . . . 586
Syntaxandoptions . . . 586
Examplereports . . . 587
Chapter
16.
The
Oracle
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 589
AccessingOraclefromWebSphereDataStage. . . 589
Changinglibrarypaths . . . 589
Preservingblanksinfields . . . 589
Handling#and$charactersinOraclecolumn names. . . 590
NationalLanguageSupport . . . 590
ICUcharactersetoptions . . . 590
MappingbetweenICUandOraclecharactersets 591 Theorareadoperator. . . 591
Dataflowdiagram . . . 591
Properties . . . 592
Operatoraction. . . 592
Wheretheorareadoperatorruns . . . 593
Columnnameconversion . . . 593
Datatypeconversion. . . 594
Oraclerecordsize . . . 594
Targetingthereadoperation . . . 594
Joinoperations . . . 595
Syntaxandoptions . . . 595
Example1:readinganOracletableand modifyingafieldname . . . 598
Example2:readingfromanOracletablein parallelwiththequeryoption. . . 599
Theorawriteoperator . . . 599
Writingtoamultibytedatabase . . . 599
Dataflowdiagram . . . 600
Properties . . . 600
Operatoraction. . . 600
Datatypeconversion. . . 602
Writemodes. . . 602
Matchedandunmatchedfields . . . 603
Syntaxandoptions . . . 604
Example1:writingtoanexistingOracletable 610 Example2:creatinganOracletable . . . 611
Example3:writingtoanOracletableusingthe modifyoperator . . . 611
Theoraupsertoperator . . . 612
Dataflowdiagram . . . 613
Properties . . . 613
OperatorAction . . . 613
Associatedenvironmentvariables . . . 614
Syntaxandoptions . . . 615
Example . . . 617
Theoralookupoperator . . . 617
Dataflowdiagram . . . 619
Properties . . . 619
Syntaxandoptions . . . 619
Example . . . 621
Chapter
17.
The
DB2
interface
library
623
ConfiguringWebSphereDataStageaccess . . . . 623EstablishingaremoteconnectiontoaDB2 server . . . 624
Handling#and$charactersinDB2column names. . . 624
Usingthe-padcharoption . . . 625
RunningmultipleDB2interfaceoperatorsina singlestep . . . 625
NationalLanguageSupport . . . 626
Specifyingcharactersettings . . . 626
Preventingcharacter-setconversion . . . 626
Thedb2readoperator . . . 627
Dataflowdiagram . . . 627
Properties . . . 627
Operatoraction. . . 627
ConversionofaDB2resultsettoaWebSphere DataStagedataset. . . 628
Targetingthereadoperation . . . 629
Specifyingopenandclosecommands . . . . 630
Syntaxandoptions . . . 631
Example1:readingaDB2tablewiththetable option. . . 633
Example2:readingaDB2tablesequentially withthe-queryoption . . . 634
Example3:readingatableinparallelwiththe -queryoption . . . 634
Thedb2writeanddb2loadoperators . . . 635
Properties . . . 635
Actionsofthewriteoperators. . . 636
HowWebSphereDataStagewritesthetable:the defaultSQLINSERTstatement . . . 636
FieldconventionsinwriteoperationstoDB2 637 Datatypeconversion. . . 637
Writemodes. . . 638
Matchedandunmatchedfields . . . 639
Syntaxandoptions . . . 639
db2loadspecialcharacteristics. . . 646
Example1:AppendingDatatoanExistingDB2 Table . . . 647
Example2:writingdatatoaDB2tablein truncatemode . . . 648
Example3:handlingunmatchedWebSphere DataStagefieldsinaDB2writeoperation . . . 649
Example4:writingtoaDB2tablecontainingan unmatchedcolumn . . . 650
Thedb2upsertoperator . . . 651
Partitioningfordb2upsert . . . 651
Dataflowdiagram . . . 651
Properties . . . 651
Operatoraction. . . 652
Syntaxandoptions . . . 653
Thedb2partoperator. . . 655
Syntaxandoptions . . . 656
Example . . . 657
Thedb2lookupoperator. . . 658
Dataflowdiagram . . . 659
Properties . . . 659
Syntaxandoptions . . . 659
Example . . . 661
ConsiderationsforreadingandwritingDB2tables 662 Datatranslationanomalies . . . 662
Usinganodemap. . . 662
Chapter
18.
The
Informix
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 665
ConfiguringtheINFORMIXuserenvironment . . 665
ReadoperatorsforInformix . . . 665
Dataflowdiagram . . . 666
Readoperatoraction . . . 666
Executionmode . . . 667
Columnnameconversion . . . 667
Datatypeconversion. . . 667
Example1:ReadingalldatafromanInformix table . . . 668
WriteoperatorsforInformix . . . 669
Dataflowdiagram . . . 670
Operatoraction. . . 670
Executionmode . . . 670
Columnnameconversion . . . 670
Datatypeconversion. . . 671
Writemodes. . . 671
MatchingWebSphereDataStagefieldswith columnsofInformixtable . . . 672
Limitations . . . 672
Example2:Appendingdatatoanexisting Informixtable . . . 673
Example3:writingdatatoanINFORMIXtable Example4:HandlingunmatchedWebSphere DataStagefieldsinanInformixwriteoperation . 675
Example5:WritingtoanINFORMIXtablewith anunmatchedcolumn . . . 676
hplreadoperator . . . 677
Specialoperatorfeatures . . . 677
Establishingaremoteconnectiontothehplread operator . . . 677
Dataflowdiagram . . . 678
Propertiesofthehplreadoperator . . . 678
Syntaxandoptions . . . 679
Example . . . 680
hplwriteoperatorforInformix. . . 680
Specialoperatorfeatures . . . 680
Dataflowdiagram . . . 681
Propertiesofthehplwriteoperator . . . 681
Syntaxandoptions . . . 681
Examples. . . 683
infxreadoperator . . . 683
DataFlowDiagram . . . 683
Properties . . . 684
SyntaxandOptions . . . 684
Example . . . 685
infxwriteoperator . . . 686
Dataflowdiagram . . . 686
Propertiesoftheinfxwriteoperator . . . 686
Syntaxandoptions . . . 686
Examples. . . 688
xpsreadoperator . . . 688
Dataflowdiagram . . . 689
Propertiesofthexpsreadoperator . . . 689
Syntaxandoptions . . . 689
Example . . . 691
xpswriteoperator . . . 691
Dataflowdiagram . . . 691
Propertiesofthexpswriteoperator . . . 692
Syntaxandoptions . . . 692
Examples. . . 693
Chapter
19.
The
Teradata
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 695
Nationallanguagesupport . . . 695
Teradatadatabasecharactersets . . . 695
Japaneselanguagesupport . . . 695
SpecifyingaWebSphereDataStageustring characterset. . . 696
Terareadoperator . . . 696
Dataflowdiagram . . . 697
Properties . . . 697
Specifyingthequery . . . 697
Columnnameanddatatypeconversion . . . 698
terareadrestrictions . . . 699
SyntaxandOptions . . . 699
TerawriteOperator . . . 701
Dataflowdiagram . . . 702
Properties . . . 702
ColumnNameandDataTypeConversion. . . 702
Correctingloaderrors . . . 703
Writemodes. . . 704
Restrictions . . . 705
SyntaxandOptions . . . 705
Chapter
20.
The
Sybase
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 709
AccessingSybasefromWebSphereDataStage. . . 709
Sybaseclientconfiguration . . . 709
NationalLanguageSupport . . . 710
TheasesybasereadeandsybasereadeOperators . . 710
Dataflowdiagram . . . 710
Properties . . . 710
OperatorAction . . . 711
WhereasesybasereadeandsybasereadeRun . . 711
Columnnameconversion . . . 711
Datatypeconversion. . . 711
Targetingthereadoperation . . . 712
JoinOperations. . . 713
SyntaxandOptions . . . 713
Example1:ReadingaSybaseTableand ModifyingaFieldName . . . 715
TheasesybasewriteandsybasewriteOperators . . 716
WritingtoaMultibyteDatabase . . . 716
Dataflowdiagram . . . 716
Properties . . . 717
OperatorAction . . . 717
WhereasesybasewriteandsybasewriteRun . . 717
DataconventionsonwriteoperationstoSybase 718 Datatypeconversion. . . 718
WriteModes . . . 719
Matchedandunmatchedfields . . . 719
SyntaxandOptions . . . 720
Example1:WritingtoanExistingSybaseTable 723 Example2:CreatingaSybaseTable . . . 724
Example3:WritingtoaSybaseTableUsingthe modifyOperator . . . 725
TheasesybaseupsertandsybaseupsertOperators 726 Dataflowdiagram . . . 727
Properties . . . 727
OperatorAction . . . 727
SyntaxandOptions . . . 728
Example . . . 730
TheasesybaselookupandsybaselookupOperators 731 Dataflowdiagram . . . 732
Properties . . . 732
SyntaxandOptions . . . 732
Example . . . 737
Chapter
21.
The
SQL
Server
interface
library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 739
AccessingSQLServerfromWebSphereDataStage 739 UNIX . . . 739
Windows. . . 739
NationalLanguageSupport . . . 739
Thesqlsrvrreadoperator . . . 739
Dataflowdiagram . . . 740
Properties . . . 740
Operatoraction. . . 740
Wherethesqlsrvrreadoperatorruns . . . . 741
Columnnameconversion . . . 741
Datatypeconversion. . . 741
SQLServerrecordsize . . . 742
Targetingthereadoperation . . . 742
Joinoperations . . . 742
Syntaxandoptions . . . 743
Example1:ReadingaSQLServertableand modifyingafieldname . . . 744
Thesqlsrvrwriteoperator . . . 745
Writingtoamultibytedatabase . . . 745
Dataflowdiagram . . . 746
Properties . . . 746
Operatoraction. . . 746
Wherethesqlsrvrwriteoperatorruns . . . . 746
DataconventionsonwriteoperationstoSQL Server. . . 747
Writemodes. . . 748
Syntaxandoptions . . . 749
Example1:WritingtoanexistingSQLServer table . . . 751
Example2:CreatingaSQLServertable . . . 752
Example3:WritingtoaSQLServertableusing themodifyoperator . . . 753
Thesqlsrvrupsertoperator . . . 754
Dataflowdiagram . . . 754
Properties . . . 754
Operatoraction. . . 754
Syntaxandoptions . . . 755
Example . . . 757
Thesqlsrvrlookupoperator. . . 758
Dataflowdiagram . . . 759
Properties . . . 759
Syntaxandoptions . . . 760
Example . . . 762
Chapter
22.
The
iWay
interface
library
763
AccessingiWayfromWebSphereDataStage . . . 763NationalLanguageSupport . . . 763
Theiwayreadoperator . . . 763
Dataflowdiagram . . . 764
Properties . . . 764
Operatoraction. . . 764
Datatypeconversion. . . 764
Syntaxandoptions . . . 765
Example:ReadingatableviaiWay . . . 767
Theiwaylookupoperator . . . 767
Dataflowdiagram . . . 769
Properties . . . 769
Syntaxandoptions . . . 769
Example:lookingupatableviaiWay . . . . 772
Chapter
23.
The
Netezza
Interface
Library
.
.
.
.
.
.
.
.
.
.
.
.
.
. 773
Netezzawriteoperator . . . 773
Netezzadataloadmethods. . . 773
nzloadmethod . . . 773
Externaltablemethod . . . 773
Writemodes. . . 773
Limitationsofwriteoperation. . . 774
Charactersetlimitations. . . 774
Badinputrecords . . . 774
Syntaxfornzwriteoperation . . . 775
Chapter
24.
The
Classic
Federation
interface
library
.
.
.
.
.
.
.
.
.
. 777
AccessingthefederateddatabasefromWebSphere DataStage . . . 777
Nationallanguagesupport . . . 777
Internationalcomponentsforunicodecharacter setparameter . . . 778
MappingbetweenfederatedandICUcharacter sets. . . 778
Readoperationswithclassicfedread. . . 778
classicfedreadproperties. . . 779
classicfedreadoperatorsyntax. . . 779
Columnnameconversion . . . 781
Datatypeconversion. . . 781
Readingexternaldatasourcetables . . . 782
Writeoperationswithclassicfedwrite . . . 782
Matchedandunmatchedfields . . . 783
classicfedwritesyntax . . . 783
Writingtomultibytedatabases . . . 786
Insertandupdateoperationswithclassicfedupsert 787 classicfedupsertProperties . . . 788
classicfedupsertsyntax . . . 788
Exampleofafederatedtablewhena classicfedupsertoperationisperformed . . . 789
LookupOperationswithclassicfedlookup . . . . 790
classicfedlookupproperties. . . 791
classicfedlookupsyntax . . . 792
Exampleofaclassicfedlookupoperation . . . 793
Chapter
25.
WebSphere
DataStage
development
kit
(job
control
interfaces)
.
.
.
.
.
.
.
.
.
.
.
.
. 795
WebSphereDataStagedevelopmentkit . . . 795
Thedsapi.hheaderfile . . . 795
Datastructures,resultdata,andthreads . . . 795
WritingWebSphereDataStageAPIprograms 796 BuildingaWebSphereDataStageAPI application . . . 796 Redistributingapplications . . . 797 APIfunctions . . . 797 DSAddEnvVar . . . 798 DSAddProject . . . 799 DSCloseJob . . . 800 DSCloseProject . . . 800 DSDeleteEnvVar . . . 801 DSDeleteProject . . . 801 DSFindFirstLogEntry. . . 802 DSFindNextLogEntry. . . 803 DSGetProjectList . . . 804 DSGetJobInfo . . . 805 DSGetLastError. . . 807 DSGetLastErrorMsg . . . 807 DSGetLinkInfo . . . 808 DSGetLogEntry. . . 809 DSGetNewestLogId . . . 810 DSGetParamInfo . . . 811 DSGetProjectInfo . . . 812 DSGetProjectList . . . 813 DSGetReposInfo . . . 813 DSGetReposUsage. . . 814 DSGetStageInfo. . . 815 DSGetProjectList . . . 817 DSListEnvVars . . . 818 DSListProjectProperties . . . 819 DSLockJob . . . 820 DSLogEvent. . . 820 DSMakeJobReport. . . 821 DSOpenJob . . . 822 DSOpenProject . . . 823 DSRunJob . . . 823 DSSetEnvVar . . . 824 DSSetGenerateOpMetaData. . . 825 DSSetJobLimit . . . 826 DSSetParam. . . 827 DSSetProjectProperty. . . 828 DSSetServerParams . . . 829 DSStopJob . . . 830 DSUnlockJob . . . 830 DSWaitForJob . . . 831 DataStructures. . . 831 DSCUSTINFO . . . 832 DSJOBINFO. . . 832 DSLINKINFO . . . 835 DSLOGDETAIL . . . 836 DSLOGEVENT. . . 837 DSPARAM . . . 837 DSPARAMINFO . . . 839 DSPROJECTINFO . . . 840 DSREPOSINFO. . . 841 DSREPOSUSAGE . . . 842 DSSTAGEINFO. . . 842 DSLINKINFO . . . 844 Errorcodes . . . 845
WebSphereDataStageBASICInterface . . . 850
DSAttachJob. . . 851 DSCheckRoutine . . . 852 DSDetachJob . . . 852 DSExecute . . . 853 DSGetCustInfo . . . 853 DSIPCPageProps . . . 854 DSGetJobInfo . . . 854 DSGetJobMetaBag. . . 857 DSGetLinkInfo . . . 857 DSGetLinkMetaData . . . 858 DSGetLogEntry. . . 859 DSGetLogSummary . . . 860 DSGetNewestLogId . . . 861 DSGetParamInfo . . . 861 DSGetProjectInfo . . . 863 DSGetStageInfo. . . 863 DSGetStageLinks . . . 865 DSGetStagesOfType . . . 866 DSGetStagesTypes. . . 866 DSGetVarInfo . . . 867 DSLogEvent. . . 867 DSLogFatal . . . 868 DSLogInfo . . . 868 DSLogToController . . . 869 DSLogWarn. . . 869
DSMakeJobReport. . . 869 DSMakeMsg. . . 870 DSPrepareJob . . . 871 DSRunJob . . . 871 DSSendMail. . . 872 DSSetGenerateOpMetaData. . . 873 DSSetJobLimit . . . 873 DSSetParam. . . 874 DSSetUserStatus . . . 874 DSStopJob . . . 875 DSTransformError. . . 875 DSTranslateCode . . . 876 DSWaitForFile . . . 876 DSWaitForJob . . . 877
JobStatusMacros . . . 877
Commandlineinterface. . . 878
CommandsforcontrollingWebSphere DataStagejobs . . . 878
CommandsforadministeringWebSphere DataStage . . . 888
Commandsforsearchingjobs . . . 892
XMLschemasandsamplestylesheets . . . 896
Accessing
information
about
the
product
.
.
.
.
.
.
.
.
.
.
.
.
.
. 897
ContactingIBM . . . 897
Accessibledocumentation . . . 897
Providingcommentsonthedocumentation . . . 897
Notices
and
trademarks
.
.
.
.
.
.
. 899
Notices . . . 899
Trademarks . . . 901
Chapter
1.
Introduction
Thismanual isintended fortheIBM®WebSphere®DataStage®user whohasmastered thebasics of paralleljobdesignandnowwantstoprogressfurther.
Themanualcoversthefollowingtopics:
v JobDesign Tips.Thischaptercontainsmiscellaneous tipsaboutdesigningparalleljobs, fromuseofthe
WebSphereDataStageDesignerinterface tohandlinglargevolumesofdata.
v ImprovingPerformance.Thischapterdescribes methodsbywhichyoucanevaluatetheperformanceof
yourparalleljobdesigns andcomeupwith strategiesforimprovingthem.
v LinkBuffering. Thischaptercontainsanin-depthdescriptionofwhenandhow WebSphereDataStage
buffersdatawithin ajob,andhow youcanchangetheautomaticsettingsifrequired.
v SpecifyingYourOwnParallelStages.ThischapterdescribetheinterfaceWebSphereDataStageprovides
fordefiningyourown paralleljobstagetypes.
v EnvironmentVariables.Thischapterlistalltheenvironmentvariablesthatareavailablefor affecting
thesetupandoperationofparalleljobs.
v Operators.Anumberofchaptersdescribetheoperatorsthatunderlietheparallel jobstages.
v
WebSphereDataStageDevelopmentKit(JobControlInterfaces).Thischapterliststhevariousinterfaces
thatenableyoutorunandcontrolWebSphereDataStagejobswithoutusingtheWebSphereDataStage Directorclient.
Terminology
Becauseofthetechnicalnatureof someof thedescriptionsinthis manual,itsometimestalksabout detailsof theengine thatdrivesparalleljobs.Thisinvolvestheuseoftermsthatmaybeunfamiliarto ordinaryparallel jobusers.
v Operators.TheseunderliethestagesinaWebSphereDataStagejob.Asinglestagemaycorrespondtoa
singleoperator, oranumberofoperators,dependingonthepropertiesyouhaveset, andwhetheryou havechosento partitionorcollect orsortdataontheinputlinktoa stage.Atcompilation,WebSphere DataStageevaluatesyour jobdesignandwillsometimesoptimizeoperatorsoutif theyarejudgedto besuperfluous,or insertotheroperatorsiftheyareneededforthelogicofthejob.
v OSH.ThisisthescriptinglanguageusedinternallybytheWebSphereDataStageparallelengine.
v Players.Playersaretheworkhorseprocessesina paralleljob.Thereisgenerallyaplayer foreach
operatoroneachnode.Playersarethechildrenofsectionleaders;thereisonesectionleaderper processingnode.Sectionleadersarestartedbytheconductor processrunningontheconductornode (theconductornodeisdefinedintheconfigurationfile).
Chapter
2.
Job
design
tips
Thischaptergivessomehintsandtipsforthegooddesignofparalleljobs.
WebSphere
DataStage
Designer
interface
Thefollowingaresometipsforsmooth useof theWebSphereDataStageDesignerwhenactuallylaying outyourjobonthecanvas.
v Tore-arrangeanexistingjobdesign,orinsert newstagetypesintoanexistingjobflow,first disconnect
thelinksfromthestagetobechanged,then thelinkswillretainanymetadataassociatedwith them. v ALookupstagecanonlyhaveoneinputstream,oneoutputstream, and,optionally,onerejectstream.
Dependingonthetypeoflookup,itcanhaveseveralreferencelinks.Tochangetheuseofparticular Lookuplinksinanexistingjobflow,disconnectthelinksfromtheLookupstage andthenright-click to changethelinktype, forexample,Stream toReference.
v TheCopystageisagoodplaceholderbetweenstagesifyouanticipatethatnew stagesorlogicwillbe
neededinthefuturewithoutdamagingexistingpropertiesandderivations.Wheninsertinga new stage,simplydragtheinputandoutputlinksfromtheCopyplaceholdertothenewstage.Unlessthe ForcepropertyissetintheCopystage,WebSphereDataStage optimizestheactualcopyoutat runtime.
Processing
large
volumes
of
data
Theabilitytoprocesslargevolumesofdatainashortperiodoftimedependsonall aspectsoftheflow andtheenvironmentbeingoptimizedformaximumthroughputand performance.Performancetuning andoptimizationareiterativeprocessesthatbeginwithjobdesignandunittests, proceedthrough
integrationandvolumetesting, andcontinuethroughouttheproductionlifecycleoftheapplication.Here aresomeperformancepointers:
v Whenwriting intermediateresultsthatwillonlybeshared betweenparallel jobs,alwayswriteto
persistentdatasets(usingData Setstages).Youshouldensurethatthedataispartitioned,andthatthe partitions,andsortorder,areretainedateverystage.AvoidformatconversionorserialI/O.
v DataSetstagesshouldbeusedtocreaterestartpointsintheevent thata joborsequenceneedstobe
rerun.But,becausedatasetsareplatformandconfigurationspecific,theyshouldnotbeusedfor long-termbackupand recoveryofsourcedata.
v Dependingonavailablesystem resources,itmaybepossibletooptimizeoverallprocessingtimeatrun
timebyallowingsmallerjobsto runconcurrently.However, caremustbe takentoplanforscenarios whensourcefilesarrivelaterthanexpected,orneedtobereprocessed intheeventofa failure. v Parallelconfigurationfilesallowthedegreeofparallelismand resourcesusedbyparalleljobstobe set
dynamicallyat runtime.Multipleconfigurationfilesshouldbe usedtooptimizeoverallthroughput andtomatchjobcharacteristicstoavailablehardwareresourcesindevelopment,test,and production modes.
Theproperconfigurationofscratchand resourcedisksandtheunderlyingfilesystemandphysical hardwarearchitecturecansignificantlyaffectoveralljobperformance.
WithinclusteredETLand databaseenvironments,resource-pool namingcanbe usedtolimit processing tospecificnodes, includingdatabase nodeswhenappropriate.
Modular
development
Youshouldaimtousemodulardevelopment techniquesinyourjobdesigns inordertomaximizethe reuseofparalleljobs andcomponentsandsave yourselftime.
v Usejobparametersinyourdesignandsupplyvaluesat runtime.Thisallowsa singlejobdesignto
processdifferentdataindifferentcircumstances,ratherthanproducingmultiple copiesofthesame job withslightlydifferentarguments.
v UsingjobparametersallowsyoutoexploittheWebSphereDataStageDirector’smultiple invocation
capability.Youcanrunseveralinvocationsofa jobatthesametimewithdifferentruntimearguments. v Useshared containerstosharecommonlogicacrossanumber ofjobs.Rememberthatshared
containersare insertedwhena jobiscompiled.Ifthesharedcontainer ischanged,thejobsusingitwill needrecompiling.
Designing
for
good
performance
Hereare sometipsfordesigninggoodperformance intoyourjobfromtheoutset.
Avoid
unnecessary
type
conversions.
Becarefultousepropersourcedatatypes,especiallyfromOracle.Youcansetthe
OSH_PRINT_SCHEMASenvironmentvariable toverifythatruntimeschemasmatchthejobdesign columndefinitions.
Ifyouare usingstage variablesona Transformerstage,ensurethattheirdatatypesmatchtheexpected resulttypes.
Use
Transformer
stages
sparingly
and
wisely
Donothavemultiplestageswherethefunctionalitycouldbeincorporatedintoa singlestage,anduse otherstagetypestoperformsimpletransformation operations(see″UsingTransformerStages″formore guidance).
Increase
sort
performance
where
possible
Carefuljobdesigncanimprovetheperformanceofsortoperations,bothinstandalone Sortstagesandin on-linksortsspecifiedintheInputspagePartitioningtabofotherstagetypes.See″Sorting Data″for guidance.
Remove
unneeded
columns
Removeunneededcolumns asearly aspossiblewithinthejobflow.Everyadditionalunusedcolumn requiresadditionalbuffermemory,whichcanimpactperformanceand makeeachrowtransfer fromone stagetothenextmoreexpensive.Ifpossible,whenreadingfromdatabases,useaselectlisttoreadjust thecolumnsrequired,ratherthantheentiretable.
Avoid
reading
from
sequential
files
using
the
Same
partitioning
method.
Unlessyouhavespecifiedmorethanonesourcefile,this willresult intheentirefilebeingreadintoa singlepartition, makingtheentiredownstreamflowrunsequentiallyunlessyouexplicitlyrepartition(see ″UsingSequential FileStages″formoretipsonusingSequential filestages).
Combining
data
Thetwomajor waysofcombiningdataina WebSphereDataStagejobare viaaLookupstageora Join stage.Howdoyoudecidewhichonetouse?
LookupandJoinstagesperformequivalentoperations:combiningtwoormoreinputdatasetsbasedon oneormorespecifiedkeys.Whenoneunsortedinputisverylargeor sortingisnotfeasible,Lookupis preferred.Whenallinputsare ofmanageablesize orarepre-sorted,Joinisthepreferredsolution.
TheLookupstageismostappropriatewhenthereferencedataforallLookupstagesinajobissmall enoughtofitintoavailable physicalmemory.Eachlookupreferencerequiresacontiguousblockof physicalmemory.TheLookupstagerequiresallbutthefirstinput(theprimaryinput)tofitintophysical memory.
IfthereferencetoalookupisdirectlyfromaDB2®or Oracletableandthenumberofinputrowsis significantlysmallerthanthereferencerows, 1:100ormore,a SparseLookupmaybe appropriate. IfperformanceissuesarisewhileusingLookup,consider usingtheJoinstage.TheJoin stagemustbe usedif thedatasetsarelargerthanavailable memoryresources.
Sorting
data
Lookatjobdesignsand trytoreorderthejobflowtocombineoperationsaroundthesamesortkeysif possible,andcoordinateyour sortingstrategy withyourhashingstrategy.Itissometimespossibleto rearrangetheorderofbusinesslogicwithina jobflowtoleveragethesamesortorder,partitioning,and groupings.
Ifdatahasalreadybeenpartitionedand sortedonaset ofkeycolumns,specifythe″don’tsort,
previouslysorted″option forthekeycolumns intheSortstage.Thisreducesthecostofsortingand takes greateradvantageof pipelineparallelism.
Whenwritingtoparalleldatasets, sortorderand partitioningarepreserved.Whenreadingfromthese datasets,trytomaintain thissortingif possiblebyusingSamepartitioningmethod.
Thestablesort optionismuchmoreexpensivethannon-stablesorts, andshouldonlybe usedif thereis aneedtomaintain roworderotherthanasneededtoperformthesort.
Theperformanceofindividual sortscanbe improvedbyincreasingthememoryusageperpartitionusing theRestrictMemoryUsage(MB)optionoftheSortstage.Thedefaultsettingis20MBperpartition. Note thatsortmemoryusagecanonlybespecifiedforstandalone Sortstages,itcannotbechangedforinline (onalink)sorts.
Default
and
explicit
type
conversions
Whenyouaremappingdatafromsourcetotarget youmayneedtoperform datatypeconversions.Some conversionshappenautomatically,andthesecantakeplaceacrosstheoutputmappingofanyparallel job stagethathasan inputandanoutputlink.Other conversionsneeda functiontoexplicitlyperform the conversion.Thesefunctionscanbe calledfromaModifystage oraTransformerstage,andarelisted in AppendixB ofWebSphereDataStageParallelJobDeveloperGuide.(Modifyisthepreferredstage forsuch conversions-see″UsingTransformerStages″.)
Thefollowingtableshowswhichconversionsareperformedautomaticallyand whichneedtobe explicitlyperformed.″d″ indicatesautomatic(default)conversion,″m″indicatesthatmanualconversion isrequired, ablanksquareindicatesthatconversionisnotpossible:
Source Field
DestinationField
int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat
int8 d d d d d d d d dm
uint8 d d d d d d d d d
int16 dm d d d d d d d d
Source Field
DestinationField
int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat
int32 dm d d d d d d d d uint32 d d d d d d d d d int64 dm d d d d d d d d uint64 d d d d d d d d d sfloat dm d d d d d d d d dfloat dm d d d d d d d d decimal dm d d d dm d dm dm d dm string dm d dm d d dm d d d dm ustring dm d dm d d dm d d d dm raw m m date m m m m time m m m time stamp m m m SourceField
decimal string ustring raw date time timestamp
int8 d dm dm m m m uint8 d d d int16 d dm dm uint16 d dm dm int32 d dm dm m m uint32 d m m m int64 d d d uint64 d d d sfloat d d d dfloat dm dm dm m m decimal dm dm string dm d m m m ustring dm d m m raw date m m m time m m dm timestamp m m m m
d=defaultconversion;m =modifyoperatorconversion; blank=noconversionneededorprovided Youshouldalso notethefollowingpoints abouttype conversion:
v TheenvironmentvariableAPT_STRING_PADCHARcanbe usedtochangethedefaultpadcharacter
fromanASCIINULL(0x0)toanothercharacter;forexample,anASCII space(Ox20) oraunicode space(U+0020).
v Asanalternatesolution,thePadStringfunction canbeusedtopada variable-length(Varchar) stringto
aspecified lengthusinga specifiedpadcharacter.NotethatPadStringdoesnotworkwithfixed-length (Char)stringtypes.Youmust firstconvertChartoVarcharbeforeusingPadString.
Using
Transformer
stages
Ingeneral,it isgoodpracticenottousemore Transformerstagesthanyouhaveto.Youshouldespecially avoidusingmultipleTransformerstageswherethelogiccanbe combinedintoa singlestage.
Itisoftenbetter touseotherstagetypesforcertaintypesofoperation: v Usea CopystageratherthanaTransformerforsimpleoperationssuchas:
– Providingajobdesignplaceholderonthecanvas.(Providedyoudo notset theForcepropertyto
TrueontheCopystage,thecopywillbe optimizedoutofthejobatruntime.) – Renamingcolumns.
– Droppingcolumns.
– Implicittype conversions(see″DefaultandExplicit TypeConversions″).
Notethat,ifruntimecolumn propagationisdisabled,youcanalsouseoutputmappingonastage torename,drop,or convertcolumns onastagethathasbothinputsand outputs.
v UsetheModifystageforexplicit typeconversion(see″DefaultandExplicit TypeConversions″)and
nullhandling.
v Wherecomplex,reusablelogicisrequired,orwhereexistingTransformer-stagebased jobflowsdonot
meetperformance requirements,consider buildingyour owncustomstage(seeChapter5,“Specifying yourown parallelstages,”onpage31,)
v Usea BASICTransformerstage whereyouwanttotakeadvantageofuser-definedfunctionsand
routines.
Using
Sequential
File
stages
Certainconsiderationsapplywhenreadingand writingfixed-lengthfieldsusingtheSequentialFilestage. v Ifreadingcolumnsthathaveaninherentlyvariable-widthtype (forexample,integer, decimal,or
varchar)then youshouldsettheFieldWidth propertytospecifytheactualfixed-widthoftheinput column.DothisbyselectingEdit Row...fromtheshortcutmenufora particularcolumninthe Columnstab, andspecifythewidthintheEditColumn MetaData dialogbox.
v Ifwritingfixed-widthcolumnswith typesthatare inherentlyvariable-width,then settheFieldWidth
propertyandthePadcharproperty intheEditColumnMetaDatadialogboxtomatchthewidthof theoutputcolumn.
Otherconsiderations areasfollows: v
Ifacolumn isnullable, youmust definethenullfieldvalue andlengthintheEditColumnMetaData
dialogbox.
v Becareful whenreadingdelimited,bounded-lengthvarcharcolumns(i.e.,varcharswiththelength
optionset).Ifthesourcefilehasfields whicharelongerthanthemaximumvarchar length,theseextra charactersaresilently discarded.
v AvoidreadingfromsequentialfilesusingtheSamepartitioning method.Unlessyouhavespecified
morethanonesourcefile,thiswillresult intheentirefilebeingread intoasinglepartition, makingthe entiredownstreamflowrunsequentiallyunlessyouexplicitlyrepartition.
Using
Database
stages
Thebestchoiceistouseconnectorstagesifavailable foryourdatabase.Thenext bestchoicearethe Enterprise’databasestagesasthesegive maximumparallelperformanceand featureswhencompared to ’plug-in’stages.TheEnterprisestagesare:
v DB2/UDBEnterprise v Informix ®Enterprise v OracleEnterprise v TeradataEnterprise v SQLServerEnterprise v SybaseEnterprise v ODBCEnterprise v iWayEnterprise v NetezzaEnterprise
YoushouldavoidgeneratingtargettablesinthedatabasefromyourWebSphereDataStagejob(i.e.,using theCreatewritemodeonthedatabasestage)unlesstheyareintended fortemporarystorageonly.Thisis becausethismethoddoesnotallowyouto,forexample,specifytargettablespace, andyoumay
inadvertentlyviolatedata-management policiesonthedatabase.If youwanttocreateatableona target databasefromwithin ajob,usetheOpencommandpropertyonthedatabasestagetoexplicitlycreatethe tableandallocatetablespace,oranyotheroptionsrequired.
TheOpencommandpropertyallowsyoutospecifyacommand(forexamplesomeSQL)thatwillbe executedbythedatabasebefore itprocessesanydatafromthestage.Thereisalso aClosepropertythat allowsyouto specifya commandtoexecuteafter thedatafromthestagehasbeenprocessed.(Notethat, whenusinguser-definedOpenandClosecommands,youmayneedtoexplicitly specifylockswhere appropriate.)
Database
sparse
lookup
vs.
join
Dataread byanydatabasestagecanserveasthereferenceinputtoaLookupstage.By default,this referencedataisloadedintomemorylikeanyotherreferencelink.
WhendirectlyconnectedasthereferencelinktoaLookupstage,bothDB2/UDBEnterpriseandOracle Enterprisestagesallowthelookup typetobechangedto Sparseand sendindividualSQLstatementsto thereferencedatabaseforeachincoming Lookuprow.SparseLookupisonlyavailablewhenthedatabase stageisdirectlyconnectedtothereferencelink, withnointermediate stages.
ItisimportanttonotethattheindividualSQLstatementsrequiredbya SparseLookupare anexpensive operationfromaperformanceperspective.Inmostcases,it isfastertouseaWebSphereDataStageJoin stagebetweentheinputandDB2referencedatathanitistoperforma SparseLookup.
Forscenarioswherethenumber ofinputrowsissignificantlysmaller(1:100ormore)thanthenumberof referencerowsinaDB2or Oracletable,aSparseLookupmaybeappropriate.
DB2
database
tips
Ifavailable,usetheDB2connestor. Otherwise,always usetheDB2Enterprisestageinpreferencetothe DB2/APIpluginstageforreadingfrom,writingto,andperforminglookups againsta DB2Enterprise ServerEditionwiththeDatabasePartitioningFeature(DBF).TheDB2 Enterprisestageisdesignedfor maximumperformanceand scaleabilityagainstverylargepartitionedDB2UNIX® databases.
TheDB2/APIpluginshouldonlybeusedtoreadfromand writetoDB2onother,non-UNIXplatforms. Youmight,forexample,useittoaccessmainframeeditionsthroughDB2Connect™.
Write
vs.
load
TheDB2EnterprisestageoffersthechoicebetweenSQLmethods(insert,update, upsert,delete)orfast loadermethodswhenwritingtoaDB2database.Thechoicebetweenthesemethodsdependsonthe requiredperformance,database logusage, andrecoverabilityconsiderationsasfollows:
v Thewritemethod(usinginsert,update,upsert,ordelete)communicatesdirectlywithDB2database
nodesto executeinstructionsin parallel.Alloperationsare loggedtotheDB2databaselog,andthe targettable(s) canbeaccessedbyotherusers.Timeandrow-basedcommitintervalsdeterminethe transactionsizeandavailabilityofnew rowstootherapplications.
v
TheloadmethodrequiresthattheuserrunningthejobhasDBADMprivilegeonthetargetdatabase.
Duringaloadoperationanexclusive lockisplacedontheentireDB2tablespaceintowhichthedatais beingloaded,andsothistablespacecannotbe accessedbyanyoneelsewhiletheloadistakingplace. Theloadisalsonon-recoverable:iftheloadoperationisterminatedbeforeitiscompleted,thecontents ofthetableareunusable andthetablespaceisleft intheloadpending state.Ifthis happens,the WebSphereDataStagejobmustbere-run withthestagesetto truncatemodetocleartheloadpending state.
Oracle
database
tips
Whendesigning jobsthatuseOracle sourcesortargets, notethattheparallelengine willuseits interpretationoftheOraclemetadata(forexample,exactdatatypes)based oninterrogationof Oracle, overridingwhatyoumayhavespecifiedintheColumnstab. Forthis reasonitisbesttoimportyour OracletabledefinitionsusingtheImport→ OrchestrateSchemaDefinitionscommandfromhe WebSphereDataStageDesigner. ChoosetheDatabasetableoption andfollowtheinstructionsfromthe wizard.
Loading
and
indexes
WhenyouusetheLoadwritemethodinanOracle Enterprisestage,youareusingtheParallel Direct Pathloadmethod.Ifyouwanttousethismethodtowritetablesthathaveindexesonthem(including indexesautomaticallygeneratedbyprimarykeyconstraints),youmust specifytheIndexModeproperty (youcansetit toMaintenanceorRebuild).Analternativeistosettheenvironmentvariable
APT_ORACLE_LOAD_OPTIONSto″OPTIONS(DIRECT=TRUE,PARALLEL=FALSE).Thisallowsthe loadingofindexed tableswithoutindexmaintenance,buttheloadisperformedsequentially.
YoucanusetheupsertwritemethodtoinsertrowsintoanOracle tablewithoutbypassingindexesor constraints.InordertoautomaticallygeneratetheSQLneeded,set theUpsertModepropertyto Auto-generatedand identifythekeycolumn(s) ontheColumns tabbyselectingtheKeycheckboxes.
Teradata
Database
Tips
YoucanusetheAdditionalConnections OptionspropertyintheTeradataEnterprisestage(whichisa dependentofDBOptionsMode)tospecifydetailsaboutthenumberofconnectionstoTeradata.The possiblevaluesofthisare:
v sessionsperplayer.ThisdeterminesthenumberofconnectionseachplayerinthejobhastoTeradata.
Thenumbershouldbeselectedsuchthat:
(sessions per player * number of nodes * players per node) = total requested sessions
Thedefaultvalueis2.Setting thistoolowonalargesystem canresult insomanyplayersthatthejob failsdue toinsufficientresources.
v requestedsessions.Thisisanumberbetween1 andthenumberofvprocsin thedatabase.Thedefault
Chapter
3.
Improving
performance
Thischapterisintendedtohelpresolve anyperformanceproblems.Itassumesthatbasicstepstoassure performancehavebeentaken: asuitableconfigurationfilehasbeenset up(seeinWebSphereDataStage ParallelJobDeveloperGuide),reasonableswapspaceconfiguredetc.andthatyouhavefollowedthedesign guidelineslaiddowninChapter2,“Jobdesigntips,”onpage3.
Understanding
a
flow
Inordertoresolveanyperformance issuesit isessentialtohaveanunderstandingoftheflowof WebSphereDataStagejobs.
Score
dumps
Tohelpunderstanda jobflowwesuggest youtakea scoredump.Dothis bysettingthe
APT_DUMP_SCOREenvironment variabletrueandrunningthejob(APT_DUMP_SCOREcanbeset in theAdministratorclient,undertheParallel →Reportingranch).Thiscausesa reporttobe produced whichshowstheoperators,processesanddatasetsinthejob.Thereportincludesinformationabout: v Whereandhowdataisrepartitioned.
v WhetherWebSphereDataStagehadinsertedextraoperatorsintheflow.
v Thedegreeofparallelismeachoperatorrunswith,and onwhichnodes.
v Informationaboutwhere dataisbuffered.
Thedumpscoreinformationisincludedinthejoblogwhenyourunajob.
Thescoredumpisparticularlyusefulinshowing youwhereWebSphereDataStageisinsertingadditional componentsinthejobflow. InparticularWebSphereDataStagewilladdpartitionand sortoperators wherethelogicof thejobdemands it.Sorts inparticularcanbedetrimentaltoperformance andascore dumpcanhelpyoutodetectsuperfluousoperatorsandamendthejobdesigntoremovethem.
Example
score
dump
Thefollowingscoredumpshowsa flowwithasingle dataset,whichhasa hashpartitioner,partitioning onkey″a″.Itshowsthreeoperators:generator, tsort,and peek.Tsortand peekare ″combined″,indicating thattheyhavebeenoptimizedintothesame process.All theoperatorsinthis flowarerunningonone node.
##I TFSC 004000 14:51:50(000) <main_program> This step has 1 data set:
ds0: {op0[1p] (sequential generator)
eOther(APT_HashPartitioner { key={ value=a }
})->eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:tsort)}
It has 2 operators:
op0[1p] {(sequential generator)
on nodes (
lemond.torrent.com[op0,p0] )}
op1[2p] {(parallel APT_CombinedOperatorController:
(tsort) (peek) ) on nodes ( lemond.torrent.com[op1,p0] lemond.torrent.com[op1,p1] )}