MULTIFILE SYSTEM

Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the

partitions of the multifile. Understanding the concept of multifiles is essential

when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application.

Data parallelism makes it possible for Ab Initio software to process large amounts of data very quickly. In order to take full advantage of the power of

data parallelism, you need to store data in its parallel state. In this state, the

partitions of the data are typically located on different disks on various

machines. An Ab Initio multifile is a parallel file that allows you to manage all the partitions of the data, no matter where they are located, as a single

entity.

Multifiles are organized by using a Multifile System, which has a directory tree structure that allows you to work with multifiles and the directory

structures that contain them in the same way you would work with serial

files. Using an Ab Initio multifile system, you can apply familiar file

management operations to all partitions of a multifile from a central point of administration — no matter how diverse the machines on which they are

located. You do this by referencing the control partition of the multifile with a single Ab Initio URL.

Multifile

• An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.

• An Ab Initio multifile can only exist in the context of a multifile system.

• Multifiles are created by creating a multifile system, and then

either outputting parallel data to it with an Ab Initio graph, or using the m_touch command to create an empty multifile in the multifile system.

Ad Hoc Multifiles

• A parallel dataset with partitions that are an arbitrary set of

serial files containing similar data.

• Created explicitly by listing a set of serial files as partitions, or

by using a shell expression that expands at runtime to a list of serial files.

• The serial files listed can be anything from a serial dataset divided into multiple serial files to any set of serial files containing similar data.

MULTIFILE SYSTEM

- Visualize a directory tree containing subdirectories and files.

- Now imagine 3 identical copies of the same tree

located on several disks, and number them 0 to 2. These are the data partitions of the multifile system.

- Then add one more copy of the tree to serve as the

control partition.

- This Multifile system will be referenced on GDE using Ab Initio URLs.

MULTIFILE SYSTEM

Creating the Multifile system MFS

m_mkfs //pluto.us.com/usr/ed/mfs3 \ //pear.us.com/p/mfs3-0 \ //pear.us.com/q/mfs3-1 \ //plum.us.com/D:/mfs3-2 Multidirectory m_mkdir //pluto.us.com/usr/ed/mfs3/cust Multifile m_touch //pluto.us.com/usr/ed/mfs3/cust/t.out \

ASSIGNMENT

Creating a Multidirectory

• _{Open the command console.} • _{Enter the following command}

LAYOUTS

A layout of a component specifies :

• The locations of files: A URL specifying the location of file.

• The number and locations of the partitions of multifiles: A URL that specifies the location of the control partition of a multifile.

Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts. The layouts you choose can be critical to the success or failure of a graph. For a layout to be effective, it must fulfill the following requirements:

• The Co>Operating System must be installed on the computers specified by the layout.

• The run host must be able to connect to the computers specified by the layout.

• The working directories in the layout must exist

• The permissions in the directories of the layout must allow the graph to write files there.

• The layout must allow enough space for the files the graph needs to write there.

During execution, a graph writes various files in the layouts of some or all of the components in it.

FLOWS

Flows indicate the type of data transfer between the components

FLOWS

Straight Flow:

A straight flow connects components with the same depth of

parallelism, including serial components, to each other. If the

components are serial, the flow is serial. If the components are parallel, the flow has the same depth of parallelism as the

FLOWS

Fan Out flow:

A fan-out flow connects a component with a lesser number of partitions to one with a greater number of partitions — in other words, it follows a one-to-many pattern. The component with the

greater depth of parallelism determines the depth of parallelism of a fan-out flow.

NOTE: You can only use a fan-out flow when the result of dividing the greater

number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.

FLOW

Fan In flow:

A fan-in flow connects a component with a greater depth of

parallelism to one with a lesser depth — in other words, it follows a many-to-one pattern. As with a fan-out flow, the component with the greater depth of parallelism determines the depth of parallelism of a fan-in flow.

NOTE: You can only use a fan-out flow when the result of dividing the greater

number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.

FLOW

All to All flow:

An all-to-all flow connects components with the same or different

degrees of parallelism in such a way that each output port partition of one component is connected to each input port partition of the other component.

NOTE: You can only use a fan-out flow when the result of dividing the greater

number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.

SANDBOX

A sandbox is a special directory (folder) containing a certain

minimum number of specific subdirectories for holding Ab Initio graphs and related files. These subdirectories have standard

names that indicate their function. The sandbox directory itself can have any name; its properties are recorded in various special and hidden files that lie at its top.

SANDBOX

Parts of a graph

• Data files (.dat)

• The Transactions input dataset and the Transaction Data output dataset. If these are multifiles, the actual datasets will occupy multiple files.

• Record formats (.dml)

• There are two separate record formats in the graph, one for the input file, the other for the output. These record formats could be embedded in the Input File and Output File components themselves. • Transforms (.xfr)

• The transform function can be embedded in the component, but (as with the record formats)

• Graph file (.mp)

• The graph itself is stored complete as an .mp file. • Deployed script (.ksh)

• If deployed as a script, the graph will also exist as a .ksh file, which has to be stored somewhere.

SANDBOX

Sandbox Parameters:

Sandbox parameters are variables which are visible to any

component in any graph which is stored in that sandbox. Here are some examples of sandbox parameters:

• _{$PROJECT_DIR} • _$DML

• $RUN

Graphs refer to the sandbox subdirectories by using sandbox

SANDBOX

The default sandbox parameters in a GDE-created sandbox are these eight:

PROJECT_DIR — absolute path to the sandbox directory DML — relative sandbox path to the dml subdirectory XFR — relative sandbox path to the xfr subdirectory RUN — relative sandbox path to the run subdirectory DB — relative sandbox path to the db subdirectory MP — relative sandbox path to the mp subdirectory

RESOURCE — relative sandbox path to the resource subdirectory PLAN — relative sandbox path to the plan subdirectory

In document 73843431-Ab-Initio-1 (Page 76-93)