73843431-Ab-Initio-1

(1)

(2)

INTRODUCTION

Ab Initio is a Latin phrase that translates to

“from first principles” or “from the beginning”.

(3)

INTRODUCTION

- Ab Initio is a general purpose data processing platform for enterprise class, mission critical applications such as data

warehousing, clickstream processing, data movement, data transformation and analytics.

- Supports integration of arbitrary data sources and programs, and provides complete metadata management across the

enterprise.

- Proven best of breed ETL solution. - Applications of Ab Initio:

• ETL for data warehouses, data marts and operational data sources.

• Parallel data cleansing and validation.

• Parallel data transformation and filtering.

• High performance analytics

(4)

ARCHITECTURE

Ab Initio Comprises of 3 Core components. - The Co>OperatingSystem

- Enterprise Meta Environment (EME) - Graphical Design Environment (GDE) Supplementary Components

- Continuous>Flow - Plan>It

- Re>source - Data>Profiler

(5)

(6)

ARCHITECTURE

Native Operating System

UNIX Windows NT

Native Operating System

UNIX Windows NT

Ab Initio Co>Operating System

Component Library Component Library User-defined Components User-defined Components Third Party Components Third Party Components

Application Development Environments Graphical C ++ Shell

Application Development Environments Graphical C ++ Shell Applications Applications Ab Initio Metadata Repository Ab Initio Metadata Repository

(7)

OPERATING ENVIRONMENTS

Ab Initio has been developed to support a vast variety of operating environments.

The Co>Operating System and EME is available in many flavors and can be installed on

• _{Windows NT based} • _AIX • _HP-UX • _Linux • _Solaris • _{Tru64 Unix}

The GDE part of Ab Initio is currently available for Windows based Operating Systems only.

(8)

CO>OPERATING SYSTEM

The Co>Operating System and the Graphical Development

Environment (GDE) form the basis of Ab Initio software. They

interact with each other to create and run an Ab Initio application: The GDE provides a canvas on which you manipulate icons that represent data, programs, and the connections between them. The result looks like a data flow diagram, and represents the

solution to a data manipulation problem.

The GDE communicates this solution to the Co>Operating

System as a Korn shell script.

(9)

ENTERPRISE META>ENVIRONMENT

Enterprise Meta>Environment (EME) is an AbInitio repository and

environment for storing and managing metadata. It provides capability to

store both business and technical metadata. EME metadata can be

accessed from the Ab Initio GDE, web browser or AbInitio CoOperating system command line (air commands).

The basic situation when you work with the EME is that you have files in two locations, of two kinds:

• A personal work area that you specify, located in some filesystem Essentially this is a formalized directory structure that usually only you have access to. This is where you work on files.

• A system storage area

This is the EME system area where every version that you save of the files you work on is permanently preserved, organized in projects. You

(and other users) have indirect access to this area through the GDE and the

(10)

FROM THE BEGINNING DAY - 2

(11)

GRAPHICAL DEVELOPMENT ENVIRONMENT

GDE is a graphical application for developers which is used for

designing and running AbInitio Graphs.

• The ETL process in AbInitio is represented by AbInitio graphs. Graphs are formed by components (from the standard

components library or custom), flows (data streams) and

parameters.

• A user-friendly frontend for designing Ab Initio ETL graphs • Ability to run, debug Ab Initio jobs and trace execution logs.

• GDE AbInitio graph compilation process results in generation of a

UNIX shell script which may be executed on a machine without

(12)

GDE

Ma in Ma in To_ols To ols Ru_n Ru n Edi t Edi t Phas es Phas es Deb ug Deb ug Menu bar Menu bar Sandbo x Sandbo x _Compo nents Compo nents Applica tion Output Applica tion Output

Graph Design area

(13)

THE GRAPH

An Ab Initio application is called a Graph: a graphical

representation of datasets, programs, and the connections

between them. Within the framework of a graph, you can arrange and rearrange the datasets, programs, and connections and

specify the way they operate. In this way, you can build a graph to perform any data manipulation you want.

(14)

THE GRAPH

A graph is a data flow diagram that defines the various processing stages of a task and the streams of data as they move from one stage to another. In a graph, a component represents a stage and a flow represents a data stream. In

addition, there are parameters for specifying various aspects of graph behavior. Graphs are built in the GDE by dragging and dropping components, connecting them with flows, and then defining values for parameters. You run, debug, and tune your graph in the GDE. In the process of building a graph, you are developing an Ab Initio application, and thus graph development is called graph programming. When you are ready to deploy your application, you save the graph as a script that you can run from the command line.

(15)

THE GRAPH

A graph is composed of components and flows:

• Components represent datasets (dataset components) and the

programs (program components) that operate on them.

 _{Dataset components are frequently referred to as file components}

when they represent files, or table components when they represent database tables.

 _{Components have ports through which they read and write data.}  _{Ports have metadata attached to them that specifies how to}

interpret the data passing through them.

• Program components have parameters that you can set to

control the specifics of how the program operates.

• Flows represent the connections between ports. Data moves from one component to another through flows.

(16)

THE GRAPH

A graph can run on one or many processors. The processors can be on one or many computers.

All the computers that participate in running a graph must have the

Co>Operating System installed.

When you run a graph from the GDE, the GDE connects to the

Co>Operating System on the runhost — the computer that hosts the

Co>Operating System installation that controls the execution of the graph. The GDE transmits a Korn shell script that contains all the information the Co>Operating System needs to execute the programs and processes represented by the graph by..

• Managing all connections between the processors or computers that execute various parts of the graph

• Program execution

• Transferring data from one program to the next

• Transferring data from one processor or computer to another

• Execution in phases

• Restoring the graph to its state at the last completed checkpoint in case of a failure

(17)

SETTING UP GDE

Before using the GDE to run a graph or plan, a connection must be established between GDE and the installation of the Co>Operating System that you want to run your graph. The computer that hosts this installation is called the run host. Typically, the GDE is on one computer and the run host is a different computer. But even if the GDE is local to the run host, you must still establish the connection.

Host Settings(On GDE, Run>Settings>Host Tab) dialog to specify the information the GDE

needs to log in to the run host.

The GDE creates a host settings file with the name you specified. The host

settings file contains the information you specified, such as the name of the run

host, login name or username, password, Co>Operating System version and location, and type of shell you log in with.

(18)

DATA PROCESSING

Generally, Data processing involves following tasks:

• _{Collection of data from various sources.}

• _{Cleansing and standardizing data and its datatypes.} • _{Data integrity checks.}

• _{Profiling of data.}

• _{Staging it for further operations.} • _{Joining with various datasets.}

• _{Transforming data based on Business rules.} • _{Conforming data validity.}

• _{Translating data to compatible types of target systems.} • _{Loading data to Target systems.}

• _{Tying out source and target data for quality and quantity.} • _{Anomaly, rejection and error handling.}

• _{Reporting process updates.} • _{Metadata updation.}

(19)

COMMON TASKS

• Selecting: The SELECT clause of an SQL statement.

• Filtering: Statements in WHERE clause of an SQL statement. • Sorting: Specifications in an ORDER BY clause.

• Transformation: Various transformation functions used on the fields in the SELECT clause.

• Aggregation: Summarization on a set of records (ADD, AVG, MIN

etc)

• Switch: Conditional operations using CASE statements.

• String Operations: String selection and manipulation

(CONCATENATE, SUBSTRING etc.)

• Mathematical Operations: Addition, Division etc.

• Inquiry: Test for attributes and content of a field of a record.

• Joins: Specifications using JOIN(SQL server, ANSI) clauses (WHERE clause in Oracle).

• Lookups: Getting relevant fields from Dimensions.

• Rollups: Summarized data grouped using GROUP clause of an SQL statement.

(20)

COMPONENTS

Components are classified based on their functionality. Following are a few important sets of Component

(21)

(22)

COMPONENTS

The dataset components represent records or act on records, as follows:

INPUT FILE represents records read as input to a graph from one or more serial files or from a multifile.

INPUT TABLE unloads records from a database into an Ab Initio graph, allowing you to specify as the source either a database table, or an SQL statement that selects records from one or more tables.

INTERMEDIATE FILE represents one or more serial files or a multifile of intermediate results that a graph writes during execution, and saves for your review after execution.

LOOKUP FILE represents one or more serial files or a multifile of records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could if they were stored on disk.

OUTPUT FILE represents records written as output from a graph into one or multiple serial files or a multifile.

OUTPUT TABLE loads records from a graph into a database, letting you specify the records’ destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables.

READ MULTIPLE FILES sequentially reads from a list of input files.

READ SHARED reduces the disk read rate when multiple graphs (as opposed to

multiple components in the same graph) are reading the same very large file.

(23)

COMPONENTS

The database components are the following:

• CALL STORED PROCEDURE calls a stored procedure that returns multiple result sets. The stored procedure can also take parameters.

• INPUT TABLE unloads records from a database into an Ab Initio graph, allowing you to specify as the source either a database table, or an SQL statement that selects records from one or more tables.

• JOIN WITH DB joins records from the flow or flows connected to its input port with records read directly from a database, and outputs new records containing data based on, or calculated from, the joined records.

• MULTI UPDATE TABLE executes multiple SQL statements for each input record.

• OUTPUT TABLE loads records from a graph into a database, letting you specify the records’destination either directly as a single database table, or through an SQL statement that inserts records into one or more tables.

• RUN SQL executes SQL statements in a database and writes confirmation messages to the log port.

• TRUNCATE TABLE deletes all the rows in a database table, and writes confirmation messages to the log port.

• UPDATE TABLE executes UPDATE, INSERT, or DELETE statements in embedded SQL format to modify a table in a database, and writes status information to the log port.

(24)

COMPONENTS

MULTI REFORMAT changes the record format of records flowing between from

one to 20 pairs of in and out ports by dropping fields, or by using DML expressions to add fields, combine fields, or transform the data in the records.

NORMALIZE generates multiple output records from each input record; you can specify the number of output records, or the number of output records can depend on a field or fields in each input record. NORMALIZE can separate a record with a vector field into several individual records, each containing one element of the vector.

REFORMAT changes the record format of records by dropping fields, or by using

DML expressions to add fields, combine fields, or transform the data in the records. ROLLUP generates records that summarize groups of records. ROLLUP gives you

more control over record selection, grouping, and aggregation than AGGREGATE. ROLLUP can process either grouped or ungrouped input. When processing

ungrouped input, ROLLUP maximizes performance by keeping intermediate results in main memory.

SCAN generates a series of cumulative summary records — such as successive

year-to-date totals — for groups of records. SCAN can process either grouped or ungrouped input. When processing ungrouped input, SCAN maximizes

performance by keeping intermediate results in main memory.

SCAN WITH ROLLUP performs the same operations as SCAN; it generates a summary record for each input group.

(25)

COMPONENTS

The Miscellaneous components perform a variety of tasks:

ASSIGN KEYS assigns a value to a surrogate key field in each record on the in port, based on the value of a natural key field in that record, and then sends the record to one or two of three output ports.

BUFFERED COPY copies records from its in port to its out port without changing the values in the

records. If the downstream flow stops, BUFFERED COPY copies records to a buffer and defers outputting them until the downstream flow resumes.

DOCUMENTATION provides a facility for documenting a transform.

GATHER LOGS collects the output from the log ports of components for analysis of a graph after

execution.

LEADING RECORDS copies records from input to output, stopping after the given number of records. META PIVOT converts each input record into a series of separate output records: one separate output

record for each field of data in the original input record.

REDEFINE FORMAT copies records from its input to its output without changing the values in the records. You can use REDEFINE FORMAT to change or rename fields in a record format without changing the values in the records.

REPLICATE arbitrarily combines all the records it receives into a single flow and writes a copy of that flow to each of its output flows.

RUN PROGRAM runs an executable program.

THROTTLE copies records from its input to its output, limiting the rate at which records are processed. RECIRCULATE and COMPUTE CLOSURE components can only be used together. These two

components calculate the complete set of direct and derived relationships among a set of input key-pairs; in other words, the transitive closure of the relationship within the set.

(26)

COMPONENTS

The transform components modify or manipulate records by using one or several transform functions. Some of these are multistage transform components that modify records in up to five stages: input selection, temporary initialization, processing,

finalization, and output selection. Each stage is written as DML transform function.

AGGREGATE generates records that summarize groups of records. AGGREGATE can

process either grouped or ungrouped input. When processing ungrouped input,

AGGREGATE maximizes performance by keeping intermediate results in main memory.

DEDUP SORTED separates one specified record in each group of records from the rest of the records in the group. DEDUP SORTED requires grouped input.

DENORMALIZE SORTED consolidates groups of related records into a single record with a vector field for each group, and optionally computes summary fields for each group. DENORMALIZE SORTED requires grouped input.

FILTER BY EXPRESSION filters records according to a specified DML expression.

FUSE applies a transform to corresponding records from each input flow. The

transform is first applied to the first record on each flow, then to the second record on each flow, and so on. The result of the transform is sent to the out port.

JOIN performs inner, outer, and semi-joins on multiple flows of records. JOIN can

process either sorted or unsorted input. When processing unsorted input, JOIN maximizes performance by loading input records into main memory.

MATCH SORTED combines and performs transform operations on multiple flows of

(27)

COMPONENTS

The sort components sort and merge records, and perform related tasks:

CHECKPOINTED SORT sorts and merges records, inserting a

checkpoint between the sorting and merging phases.

FIND SPLITTERS sorts records according to a key specifier, and then finds the ranges of key values that divide the total number of input records approximately evenly into a specified number of partitions.

PARTITION BY KEY AND SORT repartitions records by key values and then sorts the records within each partition; the number of input and output partitions can be different.

SAMPLE selects a specified number of records at random from one or

more input flows. The probability of any one input record appearing in the output flow is the same — it does not depend on the position of the record in the input flow.

SORT orders and merges records.

SORT WITHIN GROUPS refines the sorting of records already sorted according to one key specifier: it sorts the records within the groups formed by the first sort according to a second key specifier.

(28)

COMPONENTS

The partition components distribute records to multiple flow partitions or multiple straight flows to support data parallelism or component

parallelism.

BROADCAST arbitrarily combines all the records it receives into a single

flow and writes a copy of that flow to each of its output flow partitions. PARTITION BY EXPRESSION distributes records to its output flow partitions according to a specified DML expression.

PARTITION BY KEY distributes records to its output flow partitions according to key values.

PARTITION BY PERCENTAGE distributes a specified percent of the total

number of input records to each output flow.

PARTITION BY RANGE distributes records to its output flow partitions according to the ranges of key values specified for each partition.

PARTITION BY ROUND-ROBIN distributes records evenly to each output flow.

PARTITION WITH LOAD BALANCE distributes records to its output flow

partitions, writing more records to the flow partitions that consume records faster.

(29)

COMPONENTS

The departition components combine multiple flow partitions or multiple straight flows into a single flow to support data parallelism or component parallelism

CONCATENATE appends multiple flow partitions of records one after another.

GATHER combines records from multiple flow partitions arbitrarily.

INTERLEAVE combines blocks of records from multiple flow

partitions in round-robin fashion.

MERGE combines records from multiple flow partitions that have all been sorted according to the same key specifier and maintains the sort order.

(30)

COMPONENTS

The compress components compress data or expand compressed data:

• _DEFLATE_and_INFLATE_{work on all platforms.}

• _{DEFLATE reduces the volume of data in a flow and INFLATE}

reverses the effects of DEFLATE.

• _COMPRESS_and_UNCOMPRESS_{are available on Unix and}

Linux platforms; not on Windows.

• _{COMPRESS reduces the volume of data in a flow and} UNCOMPRESS reverses the effects of COMPRESS.

• _{COMPRESS cannot output more than 2 GB of data on Linux}

(31)

COMPONENTS

The MVS dataset components perform as follows:

MVS INPUT FILE reads an MVS dataset as an input to your graph. MVS INTERMEDIATE FILE represents MVS datasets that contain

intermediate results that your graph writes and saves for review.

MVS LOOKUP FILE contains shared MVS data for use with the

DML lookup functions; it allows access to records according to a key.

MVS OUTPUT FILE represents records written as output from a

graph into an MVS dataset.

(32)

COMPONENTS

The EME (Enterprise Meta>Environment) components are in the

$AB_HOME/Connectors > EME folder. They transfer data into and out of the

EME.

LOAD ANNOTATION VALUES attaches annotation values to objects in an

EME datastore, and validates the annotation for the object to which it is to be attached.

LOAD CATEGORY stores annotation values in EME datastores for objects that

are members of a given category.

LOAD FILE DATASET creates a file dataset in an EME datastore, and attaches

its record format.

LOAD MIMEOBJ creates a MIME object in an EME datastore with the specified

MIME type.

LOAD TABLE DATASET creates a table dataset in an EME datastore and

attaches its record format.

LOAD TYPE creates a DML record object in an EME datastore.

REMOVE OBJECTS AND ANNOTATIONS allows you to delete objects and

annotation values from an EME datastore.

UNLOAD CATEGORY unloads the annotation rules of all objects that are

(33)

COMPONENTS

The FTP (File Transfer Protocol) components transfer data, as follows:

FTP FROM transfers files of records from a computer not running the Co>Operating System to a computer running the

Co>Operating System.

FTP TO transfers files of records to a computer not running the Co>Operating System from a computer running the Co>Operating System.

SFTP FROM transfers files of records from a computer not running the Co>Operating System to a computer running the

Co>Operating System using the sftp or scp utilities to connect to a Secure Shell (SSH) server on the remote machine and transfer the files via the encrypted connection provided by SSH.

SFTP TO transfers files of records from a computer running the Co>Operating System to a computer not running the

Co>Operating System using the sftp or scp utilities to connect to a Secure Shell (SSH) server on the remote machine and transfer the files via the encrypted connection provided by SSH.

(34)

COMPONENTS

The validate components test, debug, and check records, and produce data for testing Ab Initio graphs:

CHECK ORDER tests whether records are sorted according to a key specifier. COMPARE CHECKSUMS compares two checksums generated by COMPUTE

CHECKSUMS. Typically, you use COMPARE CHECKSUMS to compare checksums generated from two sets of records, each set computed from the same data by a different method, in order to check the correctness of the records.

COMPARE RECORDS compares records from two flows one by one.

COMPUTE CHECKSUM calculates a checksum for records.

GENERATE RANDOM BYTES generates a specified number of records, each

consisting of a specified number of random bytes. Typically, the output of GENERATE RANDOM BYTES is used for testing a graph. For more control over the content of the records, use GENERATE RECORDS.

GENERATE RECORDS generates a specified number of records with fields of

specified lengths and types. You can let GENERATE RECORDS generate

random values within the specified length and type for each field, or you can control various aspects of the generated values. Typically, you use the output of GENERATE RECORDS to test a graph.

(35)

COMPONENTS

• Where do the actual data files represented by

INPUT FILE components reside?

• Where should the Co>Operating System write the files represented by OUTPUT FILE components, or do those files already exist?

• If a file represented by an OUTPUT FILE component does exist, should the graph append the data it produces to the existing data in the file, or should the new data overwrite the old?

• What field of the data should a program component use as a key when processing the data?

• What is the record format attached to a particular port?

• How much memory should a component use for processing before it starts writing temporary files to disk?

• How many partitions do you want to divide the data into at any particular point in the graph?

• What are the locations of the processors you want to use to execute various parts of the graph?

• What is the location of the Co>Operating System you want to use to control the execution of the graph?

(36)

COMPONENTS

Configuring Components:

Components comprises of settings which dictates the way it

impacts the data flowing through. These settings are seen and set with the following tabs on Properties window.

• _Description • _Access

• _Layout

• _Parameters • _Ports

(37)

COMPONENT PROPERTIES

DESCRIPTION:

• Specify Labels

• Component specific selections • Data Locations

• Partitions • Comments

(38)

COMPONENT PROPERTIES

ACCESS:

Specification for file handling methods.

• _{Creation/Deletion} • _{Roll back settings} • _{Exclusive access} • _{File Protection}

(39)

COMPONENT PROPERTIES

LAYOUT:

• The locations of files.

• The number and locations of the partitions of multifiles.

• The number of the partitions of program components and the locations where they execute.

A layout is one of the following:

• A URL that specifies the location of a serial file

• A URL that specifies the location of the control partition of a multifile

• A list of URLs that specify the locations of: • The partitions of an ad hoc multifile

• The working directories of a program component

Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts and repartition data when needed for processing by a greater or lesser number of processors.

(40)

COMPONENT PROPERTIES

PARAMETERS:

• Set various component specific configuration

• Specify transform settings

• Handle rejects/logs/thresholds • Apply filters

• Specify Keys for Sort and Partition

• Specify URL for transforms • Parameter interpretation

(41)

COMPONENT PROPERTIES

PORTS:

• Assign record formats for various ports

• Specify URL of record format • Interpretation settings for any

(42)

EDITORS

EDITORS:

Configuring components using various tabs involve changing settings for • Transforms

• Filters • Keys

• Record Formats • Expressions

These settings are created using • Text editors

(43)

EDITORS: RECORD FORMAT

The Record Format Editor enables you to easily create and edit record formats. record decimal(6) cust_id; string(18) last_name; string(16) first_name; string(26) street_addr; string(2) state; string(1) newline; end;

(44)

(45)

EDITORS: EXPRESSION EDITOR

Fields:

Displays the input record formats available for use in an expression.

Operators:

Displays the built-in DML operators.

Functions:

Displays the built-in DML functions.

(46)

EDITORS: KEY SPECIFIER

Key Specifier Editor:

Sort Sequences

Sequence Details

phonebook

Digits are treated as the lowest value characters, followed by the letters of the alphabet in the order AaBbCcDd..., followed by spaces. All other characters, such as punctuation, are ignored. The order of digits is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.

index

Is the same as phonebook ordering, except that punctuation characters are not ignored; they have lower values than all other characters. The order of punctuation characters is the machine sequence.

machine

Uses character code values in the sequence in which they are arranged in the character set of the string.

For ASCII-based character sets, digits are the lowest-value

characters, followed by uppercase letters, followed by lowercase letters.

For EBCDIC character sets, lowercase letters are the lowest-value characters, followed by uppercase, followed by digits.

For Unicode , the order is from the lowest character code value to the highest.

custom Uses the user-defined sort order. You construct a custom sequence modifier by naming groups of characters, or by naming the characters themselves

(47)

EDITORS: TRANSFORM EDITOR

out::reformat(in) = begin

out.* :: in.*;

out.score :: if (in.income > 1000) 1 else 2;

(48)

COMPONENTS

Represents records read as input to a graph from one or more serial files or from a multifile.

Location: Datasets Key Settings:

• Data Location URL • Ports

Represents records written as output from a graph into one or more serial files or a multifile.

Location: Datasets Key Settings:

• Data Location URL • Ports

Note: When the target of an OUTPUT FILE component is a particular file (such as /dev/null, NUL, a named pipe, or some other special file), the Co>Operating System never deletes and re-creates that file, nor does it ever truncate it.

INPUT FILE

(49)

COMPONENTS

Sorts and merges records. SORT can be used to

order records before you send them to a component that requires grouped or sorted records.

Location: Sort Key Parameters:

• key : The field to be used as Sort key

• Max-core : Amount of memory to be used per

partition.

Note:

• Use Key Specifier editor to add/modify keys

and sort order and sequences.

• Sort orders are impacted by the sort sequence

used. Character sets used can also have an impact on the sort order.

(50)

COMPONENTS

Filters records according to a DML expression.

Location: Transform Key Settings:

• _{select_expr: The expression defining}

the filter key

Note: Use Expression Editor to add/modify the DML expression

(51)

COMPONENTS

Changes the format of records by dropping fields,

or by using DML expressions to add fields,

combine fields, or transform the data in the records.

• _{count: Number of transformed outputs- n.} • _{transformn: The rules defining the}

transformation

• _{select: The expression specifying any filters for} input

• reject-threshold: Defines the rule to abort the

process.

Note: Use Expression Editor to add/modify the DML expression for select

Use Transform Editor to add/modify transform rules

(52)

TRANSFORM FUNCTIONS

A transform function is a collection of rules that specifies how to produce result records from input records.

• The exact behavior of a transform function depends on the component using it. • Transform functions express record reformatting logic.

• Transform functions encapsulate a wide variety of computational

knowledge that cleanses records, merges records, and aggregates records.

• Transform functions perform their operations on the data records flowing into the component and write the resulting data records to the out flow.

Example:

The purpose of the transform function in the REFORMAT component is to construct data records flowing out of the component by:

• Using all the fields in the record format of the data records flowing into the component

creating a new field containing a title (Mr. or Ms.) using the gender field of the data records flowing into the component

• Creating a new field containing a score computed by dividing the income field by 10

(53)

TRANSFORM FUNCTIONS

Transform functions (or transforms) drive nearly all data

transformation and computation in Ab Initio graphs. Typical simple transforms can:

• Extract the year from a date.

• Combine first and last names into a full name.

• Determine the amount owed in sales tax from a transaction amount. • Validate and cleanse existing fields.

• Delete bad values or reject records containing invalid data. • Set default values.

• Standardize field formats. • Merge or aggregate records.

A transform function is a collection of business rules, local

variables, and statements. The transform expresses the connections

between the rules, variables, and statements, as well as the

(54)

TRANSFORM EDITOR

Creating Transforms Transforms consists of • _Rules • _Variables • _{and Statements}

Relationships between them are created using the Transform

Editor.

The Transform Editor has two views:

• _{Grid view: Rules are created by dragging and dropping} fields from input flow, functions and operators.

(55)

TRANFORM EDITOR

Input Fields

Input

Fields Output _Fields

Output Fields Transform Function Transform Function GRID VIEW

(56)

TRANSFORM EDITOR

Text View:

Typically, you use text view of the Transform Editor to enter or edit transforms using DML. In addition, Ab Initio software supports the following text alternatives:

You can enter the DML into a standalone text file, then include that file in the transform's package.

You can select Embed and enter the DML code into the Value box on the Parameters tab of the component Properties dialog.

(57)

TRANSFORM FUNCTIONS

Untyped transforms: out :: trans1(in) = begin out.x :: in.a; out.y :: in.b + 1; out.z :: in.c + 2; end; Typed transforms:

decimal(12) out :: add_one(decimal(12) in) = begin

out :: in + 4; end;

record integer(4) x, y; end out :: fiddle(in1, double in2) = begin

out.x :: size_of(in1); out.y :: in2 + in2; end;

(58)

TRANSFORM EDITOR

Rule:

A rule, or business rule, is an instruction in a transform function that directs the construction of one field in an output record. Rules can express everything from simple reformatting logic for field values to complex computations.

Rules are created in the expression editor triggered by right clicking the rule line.

(59)

TRANSFORM EDITOR

Prioritized Rules:

• Priorities can be optionally assigned to the rules for a particular output field.

• They are evaluated in order of priority, starting with the assignment of lowest-numbered priority and proceeding to assignments of higher-numbered priority.

• The last rule evaluated will be the one with blank priority, which places it after all others in priority.

• A single output field can have multiple rules attached to it.

• Prioritized rules are always evaluated in the ascending order of priority.

In Text View:

out :: average_part_cost(in) = begin

out :1: if (in.num_parts > 0) in.total_cost / in.num_parts; out :2: 0;

(60)

TRANSFORM FUNCTIONS

Local Variable:

A local variable is a variable declared within a transform function. You can use local variables to simplify the structure of rules or to hold values used by multiple rules.

• To declare (or modify) local variables, use the Variables Editor.

• To initialize local variables, drag-and-drop the variable from the Variables tab to the Output pane

Alternatively, enter the equivalent DML code in the text view of the Transform Editor. For example:

out::rollup(in)= begin

let string(7) myvar="a"; let decimal(2) mydec=2.2; end;

Statements:

A statement can assign a value to a local variable, a global variable, or an output field; define processing logic; or control the number of iterations of another

(61)

COMPONENTS

unloads records from a database into a graph,

allowing you to specify as the source either a database table or an SQL statement that selects records from one or more tables.

Location: Database / Datasets Key Settings:

• Config File: Location of the file defining key parameters to access the database.

• Source: Choose to use a table or a SQL statement.

Note: Database settings and other pre-requistes needs to be completed to create the config file. INPUT TABLE

(62)

COMPONENTS

JOIN reads data from two or more input

ports, combines records with matching keys according to the transform you

specify, and sends the transformed records to the output port. Additional ports allow you to collect rejected and unused records. JOIN can have as many as 20 input ports.

• _{Count: An integer from 2 to 20 specifying}

the total number of inputs (in ports) to join.

• _{sorted-input: specifies the input data}

sorts.

• _{Key: Name(s) of the field(s) in the input}

records that must have matching values for JOIN to call the transform function.

• _{Transform: Transform function specifying}

the resultant fields from join

(63)

COMPONENTS

Key Settings (continued):

• _{Join-type: specifies the join method from} inner-join,Outer-Join and Explicit.

• _{Record-requiredn: dependent setting of join-type.}

• _{Dedupn: removes duplicate records on the specified port.} • _{Selectn: Acts as a component level filter for the port.}

• _{Override-keyn: Alternative name(s) for the key field(s) for a}

particular inn port.

(64)

COMPONENTS

JOIN WITH DB joins records from the flow or flows connected to its in port with records read

directly from a database, and outputs new

records containing data based on, or calculated from, the joined records.

Location: Database Key Settings:

• _{DBConfigfile: A database configuration file.}

Specifies the database to connect to

• _{select_sql: The SELECT statement to perform for}

each input record.

(65)

SPECIAL TRANSFORMS

AGGREGATE generates records that

summarize groups of records.

ROLLUP is the newer version of the

AGGREGATE component. ROLLUP offers better control over record selection,

grouping and aggregation.

(66)

VECTORS

A vector is an array of elements, indexed from 0. An element can be a single field or an entire record. Vectors are often used to provide a logical grouping of information.

An array is a collection of elements that are logically grouped for ease of access. Each element has an index by which the

element can be referenced(read or written). Example:

char myarray[10] = [a, b, c, d, e, f, g, h, i, j] ;

Above is an example of an array with 10 elements of type char. The elements of this is addressed as

myarray[0]

The above statement will return the first element ‘a’. Indices are addressed starting from 0 to n-1 where n is declared no. of array elements

(67)

SPECIAL COMPONENTS

NORMALIZE generates multiple output

records from each of its input records.

You can directly specify the number of

output records for each input record, or the number of output records can depend on some calculation.

Location : Transform Key Settings:

• _{Transform: Either the name of the file}

containing the types and transform functions, or a transform string.

(68)

SPECIAL COMPONENTS

DENORMALIZE SORTED consolidates

groups of related records by key into

a single output record with a vector

field for each group, and optionally

computes summary fields in the output record for each group.

Location : Transform Key Settings:

• _{Key: Specifies the name(s) of the key}

field(s) the component uses to define groups of records.

• _{Transform: Specifies either the name}

of the file containing the transform function, or a transform string.

(69)

PHASES AND PARALLELISM

Parallelism and multifiles

• Graphs can be scaled to accommodate any amount of data by introducing parallelism — doing more than one thing at the same time — into it. Data parallelism, in particular, separates the data flowing through the graph into as many divisions — called partitions. Partitions can be sent to as many processors as needed to produce the result in the desired length of time.

• Ab Initio multifiles stores partitioned data wherever convenient — on different disks or on different computers in various locations, and so on — and yet manage all the partitions as a single entity.

Phases and checkpoints

• Resources can be regulated by dividing a graph into stages — called phases —

each of which must complete before the next begins.

• Checkpoint are used at the end of any phase to guard against loss in case of

failure. If a problem occurs, you can return the state of the graph to the last successfully completed checkpoint and then rerun it without having to start over from the beginning.

(70)

PARALLELISM

The power of Ab Initio software to process large quantities of data is based on its use of parallelism — doing more than one thing

at the same time.

The Co>Operating System uses three types of parallelism:

• _{Component parallelism} • _{Pipeline parallelism}

(71)

COMPONENT PARALLELISM

Component parallelism occurs when program components

execute simultaneously on different branches of a graph.

In the graph above, the CUSTOMERS and TRANSACTIONS datasets are unloaded, sorts them and merges them into a dataset named MERGED INFORMATION.

Component parallelism scales to the number of branches of a graph — the more branches a graph has, the greater the component

parallelism. If a graph has only one branch, component parallelism cannot occur.

(72)

PIPELINE PARALLELISM

Both SCORE and SELECT read records as they become available and write each record immediately after processing it. After

SCORE finishes scoring the first CUSTOMER record and sends it to SELECT, SELECT determines the destination of the record and

sends it on. At the same time, SCORE reads the second

CUSTOMER record. The two processing stages of the graph run concurrently — this is pipeline parallelism.

Pipeline parallelism occurs when several connected program components on the same branch of a graph execute

(73)

DATA PARALLELISM

Data parallelism occurs when a graph separates data into

multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions

simultaneously.

The divisions of the data and the copies of the program

components that create data parallelism are called partitions, and a component partitioned in this way is called a parallel component.

If each partition of a parallel program component runs on a separate processor, the increase in the speed of processing is almost directly proportional to the number of partitions.

(74)

PARTITIONS

The divisions of the data and the copies of the program components that create data parallelism are called partitions, and a component partitioned in this way is called a

parallel component. If each partition of a parallel program component runs on a separate

processor, the increase in the speed of processing is almost directly proportional to the number of partitions.

• FLOW PARTITIONS

When you divide a component into partitions, you divide the flows that connect to it as well. These divisions are called flow partitions.

• PORT PARTITIONS

The port to which a partitioned flow connects is partitioned as well, with the same number of port partitions as the flow connected to it.

• DEPTH OF PARALLELISM

The number of partitions of a component, flow, port, graph, or section of a graph determines its depth of parallelism.

(75)

PARTITIONS

PARALLEL FILES

Sometimes you will want to store partitioned data in its

partitioned state in a parallel file for further parallel processing. If you locate the partitions of the parallel file on

different disks, the parallel components in a graph can all read from the file at the same time, rather than being limited by all having to take turns reading from the same disk. These parallel files are called multifiles.

(76)

MULTIFILE SYSTEM

Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the

partitions of the multifile. Understanding the concept of multifiles is essential

when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application.

Data parallelism makes it possible for Ab Initio software to process large amounts of data very quickly. In order to take full advantage of the power of

data parallelism, you need to store data in its parallel state. In this state, the

partitions of the data are typically located on different disks on various

machines. An Ab Initio multifile is a parallel file that allows you to manage all the partitions of the data, no matter where they are located, as a single

entity.

Multifiles are organized by using a Multifile System, which has a directory tree structure that allows you to work with multifiles and the directory

structures that contain them in the same way you would work with serial

files. Using an Ab Initio multifile system, you can apply familiar file

management operations to all partitions of a multifile from a central point of administration — no matter how diverse the machines on which they are

located. You do this by referencing the control partition of the multifile with a single Ab Initio URL.

(77)

MULTIFILE SYSTEM

Multifile

• An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.

• An Ab Initio multifile can only exist in the context of a multifile system.

• Multifiles are created by creating a multifile system, and then

either outputting parallel data to it with an Ab Initio graph, or using the m_touch command to create an empty multifile in the multifile system.

Ad Hoc Multifiles

• A parallel dataset with partitions that are an arbitrary set of

serial files containing similar data.

• Created explicitly by listing a set of serial files as partitions, or

by using a shell expression that expands at runtime to a list of serial files.

• The serial files listed can be anything from a serial dataset divided into multiple serial files to any set of serial files containing similar data.

(78)

MULTIFILE SYSTEM

- Visualize a directory tree containing subdirectories and files.

- Now imagine 3 identical copies of the same tree

located on several disks, and number them 0 to 2. These are the data partitions of the multifile system.

- Then add one more copy of the tree to serve as the

control partition.

- This Multifile system will be referenced on GDE using Ab Initio URLs.

(79)

MULTIFILE SYSTEM

Creating the Multifile system MFS

m_mkfs //pluto.us.com/usr/ed/mfs3 \ //pear.us.com/p/mfs3-0 \ //pear.us.com/q/mfs3-1 \ //plum.us.com/D:/mfs3-2 Multidirectory m_mkdir //pluto.us.com/usr/ed/mfs3/cust Multifile m_touch //pluto.us.com/usr/ed/mfs3/cust/t.out \

(80)

ASSIGNMENT

Creating a Multidirectory

• _{Open the command console.} • _{Enter the following command}

(81)

LAYOUTS

A layout of a component specifies :

• The locations of files: A URL specifying the location of file.

• The number and locations of the partitions of multifiles: A URL that specifies the location of the control partition of a multifile.

Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts. The layouts you choose can be critical to the success or failure of a graph. For a layout to be effective, it must fulfill the following requirements:

• The Co>Operating System must be installed on the computers specified by the layout.

• The run host must be able to connect to the computers specified by the layout.

• The working directories in the layout must exist

• The permissions in the directories of the layout must allow the graph to write files there.

• The layout must allow enough space for the files the graph needs to write there.

During execution, a graph writes various files in the layouts of some or all of the components in it.

(82)

(83)

FLOWS

Flows indicate the type of data transfer between the components

(84)

FLOWS

Straight Flow:

A straight flow connects components with the same depth of

parallelism, including serial components, to each other. If the

components are serial, the flow is serial. If the components are parallel, the flow has the same depth of parallelism as the

(85)

FLOWS

Fan Out flow:

A fan-out flow connects a component with a lesser number of partitions to one with a greater number of partitions — in other words, it follows a one-to-many pattern. The component with the

greater depth of parallelism determines the depth of parallelism of a fan-out flow.

NOTE: You can only use a fan-out flow when the result of dividing the greater

number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.

(86)

FLOW

Fan In flow:

A fan-in flow connects a component with a greater depth of

parallelism to one with a lesser depth — in other words, it follows a many-to-one pattern. As with a fan-out flow, the component with the greater depth of parallelism determines the depth of parallelism of a fan-in flow.

(87)

FLOW

All to All flow:

An all-to-all flow connects components with the same or different

degrees of parallelism in such a way that each output port partition of one component is connected to each input port partition of the other component.

(88)

SANDBOX

A sandbox is a special directory (folder) containing a certain

minimum number of specific subdirectories for holding Ab Initio graphs and related files. These subdirectories have standard

names that indicate their function. The sandbox directory itself can have any name; its properties are recorded in various special and hidden files that lie at its top.

(89)

SANDBOX

Parts of a graph

• Data files (.dat)

• The Transactions input dataset and the Transaction Data output dataset. If these are multifiles, the actual datasets will occupy multiple files.

• Record formats (.dml)

• There are two separate record formats in the graph, one for the input file, the other for the output. These record formats could be embedded in the Input File and Output File components themselves. • Transforms (.xfr)

• The transform function can be embedded in the component, but (as with the record formats)

• Graph file (.mp)

• The graph itself is stored complete as an .mp file. • Deployed script (.ksh)

• If deployed as a script, the graph will also exist as a .ksh file, which has to be stored somewhere.

(90)

SANDBOX

Sandbox Parameters:

Sandbox parameters are variables which are visible to any

component in any graph which is stored in that sandbox. Here are some examples of sandbox parameters:

• _{$PROJECT_DIR} • _$DML

• $RUN

Graphs refer to the sandbox subdirectories by using sandbox

(91)

SANDBOX

(92)

SANDBOX

The default sandbox parameters in a GDE-created sandbox are these eight:

PROJECT_DIR — absolute path to the sandbox directory DML — relative sandbox path to the dml subdirectory XFR — relative sandbox path to the xfr subdirectory RUN — relative sandbox path to the run subdirectory DB — relative sandbox path to the db subdirectory MP — relative sandbox path to the mp subdirectory

RESOURCE — relative sandbox path to the resource subdirectory PLAN — relative sandbox path to the plan subdirectory

(93)

GRAPH PARAMETERS

• Graph parameters are associated with individual graphs and are private to them.

• They affect the execution only of the graph for which they are defined.

• All the specifiable values of a graph, including each

component's parameters (as well as other values such as URLs, file protections, and record formats,) comprise that

graph's parameters.

• Graph parameters are part of the graph they are associated with and are checked in or out along with the graph.

• Graph Parameters are used when a graph needs to be reused or to facilitate runtime changes to certain settings of

components used in the graph.

• These are created using Graph Parameter editor.

• Values to the graph parameters can be stored(pset) in a file and called during the execution of graph.

(94)

GRAPH PARAMETERS

(95)

PARAMETER SETTING

Name:

• Value: Unique identifier of the parameter.

• Description: String. First character must be a letter; remaining characters can be a combination of letters, digits, and underscores. Spaces are not allowed.

• Example: OUT_DATA_MFS

Scope:

• Value: Local or Formal. • Description:

• Local: parameter receives its value from the Value column.

• Formal: parameter receives its value at runtime from the command line. When run from GDE, a dialog appears prompting input values.

Kind:

• Sandbox

• Value: Unspecified or Keyword. • Description:

• Unspecified: parameter values are directly assigned on command line.

• Keyword: The value for a keyword parameter is identified by a keyword preceding it. The keyword is the name of the parameter.

• Graph

• Value: Positional, Keyword or Environment. • Description:

• Positional: The value for a positional parameter is identified by its position on the command line.

• Keyword: same as Sandbox setting.

(96)

PARAMETER SETTING

Type:

• Value: Common Project, Dependent, Switch or String.

• Description:

• Common Project: to include other shared, or common, project values within the current sandbox.

• Dependent: parameters whose values depend on the value of a switch parameter.

• Switch: purpose of a switch parameter is to allow you to change your sandbox's context:

• String: a normal string representing record formats, file paths etc.. Dependent on:

• Value: Name of the switch column.

Value:

• Value: The parameter's value, consistent with its type.

Interpretation:

• Value: Constant, $ substitution, ${} substitution and Shell.

• Description: Determines how the string value is evaluated

Required:

• Value: Required. Or Optional.

• Description: Specifies the value is required or optional.

Export:

• Value: Checked /Unchecked

• Description: Specifies if the parameter /value needs to be exported to environment

(97)

ORDER OF EVALUATION

When you run a graph, parameters are evaluated in the following order:

1. The host setup script is run.

2. Common (that is, included) project (sandbox) parameters are

evaluated.

3. Project (sandbox) parameters are evaluated.

4. The project-start.ksh script is run.

5. Formal parameters are evaluated. 6. Graph parameters are evaluated.

(98)

GRAPH PARAMS V/S SANDBOX PARAMS

• Graph parameters are visible only to the particular graph to which they belong

• Sandbox parameters are visible to all the graphs stored in a particular sandbox

• Graph parameters are created by Edit>Parameters on the Graph window

• Sandbox parameters are automatically created (defaults) and can be edited by Project>Edit Sandbox

or by editing the .air-project-parameters file on the root directory of the sandbox

• Graph parameters are set after Sandbox parameters. If a graph parameter and a sandbox parameter shares the same name, the graph parameter has a higher precedence.

(99)

PARAMETER INTERPRETATION

The value of a parameter often contains references to other parameters. This attribute specifies how you want such references to be handled in this parameter. There are four possibilities:

$ substitution:

This specifies that if the value for a parameter contains the name of another parameter preceded by a dollar sign ($), the dollar sign and name are replaced with the value of the parameter that's referred to. In other words, $parameter is replaced by the value of parameter. $parameter is said to be "a $ reference" or "dollar-sign reference to parameter".

${} substitution:

${} substitution is similar to $ substitution, except that it additionally requires that you surround the name of the referenced parameter with curly braces — {}. If you do not use curly braces, no

substitution occurs, and the dollar sign (and the name that follows it) is taken literally. You should use $ {} substitution for parameters that are likely to have values that contain $ as a character, such as names of relational database tables.

constant:

The GDE uses the parameter's specified value literally, as is, with no further interpretation. If the value contains any special characters, such as $, the GDE surrounds them with single quotes in the deployed shell script, so that they are protected from any further interpretation by the shell.

shell:

The shell specified in the Host Settings dialog for the graph, usually the Korn shell, interprets the parameter's value. Note that the value is not a full shell command; rather, it is the equivalent of the value assigned in the definition of a shell.

PDL:

The Parameter Definition Language is available as an interpretation method only for the :eme value of a dependent graph parameter. PDL (Parameter Definition Language) is a simple set of notations for expressing the values of parameters—in components, Graphs, and projects.

(100)

LOOKUP

A lookup file is a file of data records that is small enough to fit in main memory, letting a transform function retrieve records much more quickly than it could if they were stored on disk. Lookup files associate key values with corresponding data values to index

records and retrieve them.

Ab Initio’s lookup components are dataset components with

special features.

Data in lookups are accessed using specific functions inside transforms.

Indexes are used to access the data when called by lookup functions.

The data is loaded into memory for quick access.

A common use for a lookup file is to hold data frequently used by a transform component. For example, if the data records of a lookup file contain numeric product codes and their corresponding product names, a component can rapidly search the lookup file to translate codes to names or names to codes.