Multifiles are parallel files composed of individual files, which may be located on separate disks or systems. These individual files are the
partitions of the multifile. Understanding the concept of multifiles is essential
when you are developing parallel applications that use files, because the parallelization of data drives the parallelization of the application.
Data parallelism makes it possible for Ab Initio software to process large amounts of data very quickly. In order to take full advantage of the power of
data parallelism, you need to store data in its parallel state. In this state, the
partitions of the data are typically located on different disks on various
machines. An Ab Initio multifile is a parallel file that allows you to manage all the partitions of the data, no matter where they are located, as a single
entity.
Multifiles are organized by using a Multifile System, which has a directory tree structure that allows you to work with multifiles and the directory
structures that contain them in the same way you would work with serial
files. Using an Ab Initio multifile system, you can apply familiar file
management operations to all partitions of a multifile from a central point of administration — no matter how diverse the machines on which they are
located. You do this by referencing the control partition of the multifile with a single Ab Initio URL.
MULTIFILE SYSTEM
Multifile
• An Ab Initio multifile organizes all partitions of a multifile into one single virtual file that you can reference as one entity.
• An Ab Initio multifile can only exist in the context of a multifile system.
• Multifiles are created by creating a multifile system, and then
either outputting parallel data to it with an Ab Initio graph, or using the m_touch command to create an empty multifile in the multifile system.
Ad Hoc Multifiles
• A parallel dataset with partitions that are an arbitrary set of
serial files containing similar data.
• Created explicitly by listing a set of serial files as partitions, or
by using a shell expression that expands at runtime to a list of serial files.
• The serial files listed can be anything from a serial dataset divided into multiple serial files to any set of serial files containing similar data.
MULTIFILE SYSTEM
- Visualize a directory tree containing subdirectories and files.
- Now imagine 3 identical copies of the same tree
located on several disks, and number them 0 to 2. These are the data partitions of the multifile system.
- Then add one more copy of the tree to serve as the
control partition.
- This Multifile system will be referenced on GDE using Ab Initio URLs.
MULTIFILE SYSTEM
Creating the Multifile system MFS
m_mkfs //pluto.us.com/usr/ed/mfs3 \ //pear.us.com/p/mfs3-0 \ //pear.us.com/q/mfs3-1 \ //plum.us.com/D:/mfs3-2 Multidirectory m_mkdir //pluto.us.com/usr/ed/mfs3/cust Multifile m_touch //pluto.us.com/usr/ed/mfs3/cust/t.out \
ASSIGNMENT
Creating a Multidirectory
• Open the command console. • Enter the following command
LAYOUTS
A layout of a component specifies :
• The locations of files: A URL specifying the location of file.
• The number and locations of the partitions of multifiles: A URL that specifies the location of the control partition of a multifile.
Every component in a graph — both dataset and program components — has a layout. Some graphs use one layout throughout; others use several layouts. The layouts you choose can be critical to the success or failure of a graph. For a layout to be effective, it must fulfill the following requirements:
• The Co>Operating System must be installed on the computers specified by the layout.
• The run host must be able to connect to the computers specified by the layout.
• The working directories in the layout must exist
• The permissions in the directories of the layout must allow the graph to write files there.
• The layout must allow enough space for the files the graph needs to write there.
During execution, a graph writes various files in the layouts of some or all of the components in it.
FLOWS
Flows indicate the type of data transfer between the components
FLOWS
Straight Flow:
A straight flow connects components with the same depth of
parallelism, including serial components, to each other. If the
components are serial, the flow is serial. If the components are parallel, the flow has the same depth of parallelism as the
FLOWS
Fan Out flow:
A fan-out flow connects a component with a lesser number of partitions to one with a greater number of partitions — in other words, it follows a one-to-many pattern. The component with the
greater depth of parallelism determines the depth of parallelism of a fan-out flow.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.
FLOW
Fan In flow:
A fan-in flow connects a component with a greater depth of
parallelism to one with a lesser depth — in other words, it follows a many-to-one pattern. As with a fan-out flow, the component with the greater depth of parallelism determines the depth of parallelism of a fan-in flow.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.
FLOW
All to All flow:
An all-to-all flow connects components with the same or different
degrees of parallelism in such a way that each output port partition of one component is connected to each input port partition of the other component.
NOTE: You can only use a fan-out flow when the result of dividing the greater
number of partitions by the lesser number of partitions is an integer. If this is not the case, you must use an all-to-all flow.
SANDBOX
A sandbox is a special directory (folder) containing a certain
minimum number of specific subdirectories for holding Ab Initio graphs and related files. These subdirectories have standard
names that indicate their function. The sandbox directory itself can have any name; its properties are recorded in various special and hidden files that lie at its top.
SANDBOX
Parts of a graph
• Data files (.dat)
• The Transactions input dataset and the Transaction Data output dataset. If these are multifiles, the actual datasets will occupy multiple files.
• Record formats (.dml)
• There are two separate record formats in the graph, one for the input file, the other for the output. These record formats could be embedded in the Input File and Output File components themselves. • Transforms (.xfr)
• The transform function can be embedded in the component, but (as with the record formats)
• Graph file (.mp)
• The graph itself is stored complete as an .mp file. • Deployed script (.ksh)
• If deployed as a script, the graph will also exist as a .ksh file, which has to be stored somewhere.
SANDBOX
Sandbox Parameters:
Sandbox parameters are variables which are visible to any
component in any graph which is stored in that sandbox. Here are some examples of sandbox parameters:
• $PROJECT_DIR • $DML
• $RUN
Graphs refer to the sandbox subdirectories by using sandbox
SANDBOX
SANDBOX
The default sandbox parameters in a GDE-created sandbox are these eight:
PROJECT_DIR — absolute path to the sandbox directory DML — relative sandbox path to the dml subdirectory XFR — relative sandbox path to the xfr subdirectory RUN — relative sandbox path to the run subdirectory DB — relative sandbox path to the db subdirectory MP — relative sandbox path to the mp subdirectory
RESOURCE — relative sandbox path to the resource subdirectory PLAN — relative sandbox path to the plan subdirectory