Artifact Software Library ADTransform User Manual. Creating Simple ETVL Workflows with ADTRansform

(1)

(2)

(3)

Chapter 2: 2 Installation... 23

Setting up ADTransform... 24

Platforms and Requirements... 24

Installing ADTransform...24

Starting A Project...24

ADTransform data folder structure... 24

Integrating ADTranform into a Compound Workflow... 26

Adding Pre-Configured Workflows... 26

Chapter 3: 3 Configuration... 27

Configuring Workflow Phases...28

Configuring a Workflow Task...28

Configuring The Environment... 30

Configuring The Mailer...30

Configuring Error Handling...31

Internal Data Structure...31

Chapter 4: 4 General Processing...33

Configuring Pre and Post-Processing...34

Reading Properties Files...34

Chapter 5: 5 File Manipulation... 35

Manipulating Remote Files... 36

Sending files to remote sites with FTP... 36

Getting files from a remote sites with FTP... 36

Sending files to Remote sites with sFTP...37

Getting files with sFTP...38

(4)

iv

Copy a File...40

Copy All Files... 40

Move a File... 41

Move All Files...41

Renaming A Local File...41

Deleting A File...42

Deleting all Files in a Folder...42

Chapter 6: 6 Input and Output... 43

Input and Output Tasks... 44

Reading and Writing CSV files... 44

Reading files...48

Writing HTML files...49

Chapter 7: 7 Transformation... 51

Configuring Transformations... 52

Creating a new Empty Data Store... 52

Concatenating Data Stores...52

Cloning a Data Store... 53

Creating New Columns... 54

Renaming Columns...54

Adding The Row Number... 55

Changing the Case of Values... 55

Concatenate Columns...56

Setting A Default Value... 56

Replacing Characters in a Text Column... 57

Simple Mapping Transformation... 58

Multi-column Mapping Transformation...58

Reformating a Date...60

Changing Minutes to Hours and Minutes... 61

Chapter 8: 8 Filtering... 63

Filtering on single column...64

Chapter 9: 9 Validation... 67

Configuring Validations... 68

Validating the Type of a Field... 68

Validating the Length of a Field... 69

Validating the Relationship between Two Fields... 69

Validating the Relationship between Two Numeric Values... 70

Comparing Two Dates in a Data Store... 71

Validating a Field using a Lookup... 71

Validating Unique Values...72

Validating the Values of a Field... 73

Chapter 10: 10 Reporting... 75

A Report from a Single Data Store... 76

HTML Reports...76

Tabular HTML Reports...77

(5)

Chapter 11: 11 Flow Control... 83

Flow Control... 84

Chapter 12: 12 Fully Configured Workflows...85

Fully Configured Implementations... 86

Chapter 13: 13 Writing Custom Plugins...87

Appendix A: A Appendix... 89

LMS Implementation Use Case... 90

(6)

(7)

Abstract

This manual is written for the person who is responsible for the process of preparing the data for an automated process or developing a workflow consisting of multiple computer systems whose data formats are not identical even though the data flowing between systems has the same meaning.

It assumes that the reader is familiar with programming, simple configuration of Spring beans, batch scripting and general server technology. It does not require knowledge of Java or a deep understanding of Spring unless the user intends to extend the system by adding new functionality.

For a high level overview of the system, please refer to the ADTransform website at http:// adtransform.artifact-software.com.

(8)

(9)

Notice

Notice to the Reader

Topics:

• Trademarks

This information was developed for products and services offered in the U.S.A.

Some of the trademarks mentioned in this product appear for identification purposes only. Not responsible for direct, indirect,

incidental or consequential damages resulting from any defect, error or failure to perform.

No other warranty expressed or implied. This supersedes all previous notices.

COPYRIGHT LICENSE:

This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may not copy, modify, and distribute these sample programs in any form without permission from Artifact Software Inc.,

These examples have not been thoroughly tested under all conditions. Artifact Software, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

(10)

Trademarks

The following terms are trademarks of Artifact Software Inc. in the Canada, other countries, or both: ADTransform®

The following terms are trademarks of other companies:

Spring is a registered trademarks of VMWare and/or its affiliates.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

(11)

Preface

Preface for the ADTransform User Manual

(12)

(13)

Chapter

1

1 Introduction

Topics:

• What is ADTransform?

• What Data Types are Supported? • ADTransform Workflow

• ADTransform Workflow Structures • Introduction to Logging and the

Audit Trail

This chapter lays out the basic concepts underlying the operation of ADTransform.

(14)

14

What is ADTransform?

It is an advanced ETVL (Extract, Transform, Validate and Load) utility which has been extended to include reporting and other useful features that facilitates the creation of user-friendly workflow applications. It is a stand-alone batch application designed to modify and transform data provided by any number of input sources, validate the resulting records, then report the data and output the data in the format required for further processing.

The original use case was to simplify the preparation of data to be imported into an LMS system. However there are many use cases where data in one format has to manipulated, validated and output in a different format.

All of the transformation and validation is done through configuration of plugins so it is very flexible and does not require Java programming.

It can do many types of data transformation. Many standard transformation plugins come with the package such as:

• Set defaults

• Map values based on secondary data - job titles to job roles, department numbers or cost codes to organizational ids

• Copy columns from one data steam to another - create a list of organizations from information in the personnel data.

• Create new columns in a data stream

• Create new data streams and populate them.

Custom transformation plugins can be written and configured into the workflow. Validation plugins include:

• Check for required values/columns • Uniqueness

• Table look-up - the organization code in a person record must exist in the organization, etc. • Date format checking

• Value comparison - one value must be greater or equal to another. • Length constraints

• Value constraints - timezone codes, currency, state, etc., must match a list or a table of valid values. Custom validations can be added. For example, a validation plugin could be written to validate a discount based on a price and date dependent rate.

The standard plugins give very clear error messages. These tell the person exactly which records failed validation and what error was detected. This is done in terms that a non-IT person can understand.

222 rows were validated for uniqueness in column 'NAME' in datastore 'organizations' with 13 duplicate Keys.

Key '212' was found in rows : 171, 113

Key '250' was found in rows : 163, 120

. . .

This is a big improvement over an SQL exception that may have little or no information about why the record could not be processed and no information about which rows were involved

This can save days of testing imports and refreshing databases that have had wrong or partial updates applied.

(15)

the provided data input/output plugins do not meet your needs.

Web services or an SQL query can be used to extract the data from the originating system directly. Multiple input and output routines can be configured into the program so that data can be collected in a number of ways and output in several formats. Their are standard plugins for CSV, Excel and HTML. A report could be configured into the data output tasks to produce formatted reports of the data. JasperReports is supported in a standard plugin.

Graphs can also be produced. The standard plugins can produce static organization charts using GraphViz (www.graphviz.org) as well as dynamic charts that can be interrogated in your browser to view the data associated with an on-screen object.

What Data Types are Supported?

Internally ADTransform supports both named data elements and tables of rows of data.

There are two data structures that can be used to hold data during the execution of the workflow. Named data can be used to hold single valued items such as the e-mail address of the person to whom a report should be sent. These are typically read from properties files and there is a standard plugin that allows you to load many values into named data elements.

The names of data values can be any string of alpha-numeric characters. Do not use names that start with "SYSTEM" since these are reserved for ADTransform use and overwriting them may cause some very odd errors that you will find hard to resolve.

The datastore is a tabular structure that consists of rows and columns. The datastores and their columns have case sensitive names.

Text, numbers and dates are all held as text strings. However, validation can still check for the validity of dates or compare numbers.

ADTransform Workflow

Each application will have a unique structure and list of tasks that are needed to achieve the

transformation. For convenience, we discuss worflows in terms of phases that have a number of steps. Although a workflow can be specified as a single phase or divided up into as many phases as you want, it is often convenient to structure the typical workflow in 7 phases.

• Pre-process • Input • Transformation • Validation • Output • Reporting • Post-process

(16)

16

Each of these phases is made up of smaller steps that perform a simple single task such as: • Get a file from a remote location

• Read a spreadsheet

• Read a database table or the results of a query

• Set defaults or add columns based on a lookup in another file • Validate that a column has no duplicate keys

• Output a CSV file or write in a database • Send a file to an another location using ftp.

There could be a large number of steps in a single processing stream. Each of these steps is easy to specify using the operations provided by the plugins available in ADTransform.

ADTransform Workflow Structures

There is an infinite number of ways to structure a workflow. This sections discusses a couple of different approaches that can be used as inspirations for your own workflow.

ADTransform does not impose a structure on the way you organize your tasks. We often talk about workflows in terms of 7 phases but there is no need to follow this pattern.

We will first present basic 7 step model and then show an alternative way to structure the same workflow that may be more convenient depending on the nature of your application.

In the 7 phase case, the workflow is structured into phases that cross all of the various data stores that will be used in the workflow.

• a pre-process phase in which general set up can be done, directories cleared, files from previous runs archived, etc.

• an input phase for the data is read,

• a transformation phase for the information in the data store or data stores are converted to the desired form,

• a validation phase which all of the relevant fields are validated according to the requirements for the output,

• reporting phase where reports are created assist in the analysis of the particular data that is being transformed and converted,

• an output phase in which the internal data stores are written in the desired output format.

• a post-processing phase where in reports may be e-mailed, directories archived, files uploaded to external systems or any other general housekeeping is required to complete the workflow.

(17)

In order to understand the diagrams you need to know a little bit about the symbols used. Within each phase, the individual tasks are identified as the small squares. In this Human Resources example, we are dealing with an organization file (On), location file (Ln)), job types file (Jn) and a person file(pn).

In the input phase the M1 task reads in mapping files be used in the transformation and validation phases. The other tasks read in each of the source data files.

In the transformation phase there are three tasks that transform the organization file and for tasks that transform the person file.

(18)

18

In the validation phase there are no validation of location or job types but several validation tasks for organization and person file.

In the reporting phase there is a general task G1 that outputs the error log and the audit trail as well as reports from the organization location job types and person file files

In the post-processing phase there is again a general task and tasks related to each file.

In the diagram, the tasks have been rearranged so that each of the grey boxes represents the work done on each file type. The location file is read in, transformed, validated and reported and then the job types file, organization file and persons file are similarly processed.

The second last block outputs the logs.

The pre and post-processing phases stay the same as in the original flow.

You can create a custom structure for your workflow so that it is easy to understand where each task is being specified so that it is easy to maintain the workflow as requirements change.

The key point is to ensure that each task happens in the right sequence with respect to its dependencies. If there are transformations that must be done to the locations file prior to validation of the persons file then the relevant locations tasks must be in the phases that will be processed prior to the phase that validates the person file.

Introduction to Logging and the Audit Trail

This section describes the logging and audit strategy of ADTransform.

ADTransform and the plugins work together to provide a flexible framework to support an Audit Trail, an Error Log report as well as a detailed logging for programmers of custom plugins.

The Audit Trail is a high level summary of the activities of each step. One would expect to find messages such as "Excel file reader - Added 711 rows of data from 'input/lms.xls' to 'person_data'."" and "Convert Start date to Saba format - Successfully transformed 711 date values in column 'HIRED_ON'." and "Username uniqueness validator - 711 rows were validated for uniqueness in column 'USERNAME' in datastore 'person_data'.".

(19)

information required to fix data or configuration problems. The information should be expressed in terms that the person using the workflow will understand.

For example, one might find a message such as "Record 8 of 'call_details' has a value '11001' in

'extention_number' which is not in the extension_master." or "Record 3708 of the employee_master has a termination_date earlier than the hire_date."

The Error Log and Audit Trail would normally be e-mailed to the end-user in charge of the overall process. This should be done as the last step in the workflow since any tasks done after these files are produced will not appear in the reports. A copy of these reports is normally retained in a log folder that is archived with the inputs and outputs.

Custom plugins can also append detailed messages to an external log file using Apache commons-logging. This programmer's log can provide very detailed information about the internal processing of each record. This log would not normally be sent to the end-user.

Audit Trail

The audit trail provides a summary of each operation performed during the workflow. This gives the workflow planner a lot of flexibility in satisfying audit requirements. The Audit Trail holds the following information:

Step The step number of the plugin that made the audit entry.

Plugin The user-specified name of the task that made the audit entry.

Message The audit message that the plugin recorded.

The Audit Trail is maintained in a datastore that can be treated like other datastores so that it can be reported, e-mailed, saved to a remote server, etc. using the standard plugins.

(20)

20

The ADTransform API includes access to the audit trail manager to allows plugins to append information to the audit trail.

Error Log

ADTransform supports detailed logging of activities in a format that is easy for end-users to understand. The standard plugins use this to write a detailed description of errors that are found during the processing of the workflow.

The Error Log records the following information for each entry made in the error log:

Step The step identifier of the plugin that made the audit entry.

Plugin The name of the Plugin that made the audit entry

Message The message that the plugin recorded.

Level The severity of the condition being reported.

Typical values are "DEBUG","INFO", "WARNING", "ERROR" and "FATAL". The use of "DEBUG"

(21)

programmers log.

The plugins can write as many detailed, multi-line message entries as are required to describe the problems encountered in the processing. The messages in the Error Log should be written in a way that the end-user can understand what caused the errors and what records, files, etc. need to be examined and fixed.

The standard plugins follow this guideline.

The Error Log is maintained in a datastore that can be treated like other datastores so that it can be reported, e-mailed, saved to a remote server, etc. using the standard plugins.

There is a standard report supplied with ADTransform that can be produced that uses font sizes and colours to flag entries with "FATAL", "ERROR" or "WARNING" severity.

(22)

(23)

Chapter

2

2 Installation

Topics:

• Setting up ADTransform • Platforms and Requirements • Installing ADTransform • Starting A Project

• ADTransform data folder structure • Integrating ADTranform into a

Compound Workflow

• Adding Pre-Configured Workflows

This chapter describes the installation of ADTransform and the inclusion into an existing workflow.

(24)

24

Setting up ADTransform

ADTransform is easy to install.

ADTransform is delivered as an installer for use on MS-Windows or Linux. The adtransform-installer file will install the files required to run ADTransform, the Project Initiation program used to start a new project as well as the sample templates.

Platforms and Requirements

ADTransform will run on Microsoft Windows or and Unix-like platforms including the variants of Linux. A Java run-time environment that has a version of Java 8 or greater must be installed on the computer where ADTransform will be used.

Custom plugins may be more restrictive in the environments that they support. This should be noted in their Release Notes or User Documentation.

Installing ADTransform

ADTransform is delivered as an installer that will create a working ADTransform structure on the target workstation.

ADTransform is delivered as a jar file for that can be run using Java. In MS-Windows this can usually be accomplished by double-clicking on the jar file. This will initiate an installer session that asks for the locations where you want the program and templates to be installed.

The installer includes working sample projects called templates that can be used for testing the installation or as a base for your project.

Starting A Project

Once ADTransform is installed and any extra template packages are installed, you can can select and activate the one that you want to use by running the "projectInstaller".

It is installed in the ADTransform main directory. This will be projectInstaller.bat for Windows or projectInstaller.sh for other operating systems. The projectInstaller This will build a complete project structure in the destination folder that you specify.

Once the project structure is setup, you can modify the configuration files, replace the data files and run the ADTRansform process. The workflow is initiated by executing the "startup" file that is created in the project directory. This will be startup.bat for Windows or startup.sh for other operating systems.

ADTransform data folder structure

A normal ADTransform project uses a number of folders to hold various types of files. These can be overridden in the configuration for a plugin if the structure is not suitable for your needs.

(25)

This is the structure used in the sample projects that are distributed with the ADTransform application and for any application that Artifact Software distributes. You can modify this structure or replace it with your own layout. A typical project uses the following files and folders.

• inputdata : The default folder where the ADTransform will look for input files. • outputdata : The default folder where the final output files will be written. • outputreports : The default folder where reports will be written.

• config : This folder and its sub folder hierarchy contain the configuration files that control the process. • mappings : The default folder to place mapping files that are required by your transformations or

validation.

• logs : This directory is where the application writes its log files. This directory should be empty after installation

• archive : At the completion of successful validation, you may want to have ADTransform clean out the input directory after copying the original input files to a an archive sub folder labeled by the date and time in the format YYYY-MM-DD_HH_MM.

• appContext.xml : Required configuration and setup, in most cases you do not want to manipulate this file. It refers to the masterConfig.xml file by name. If you change the name of the master configuration file to another name, you will have to update this file.

• masterConfig.xml : This lists the configuration files that determine the flow of the processing. They are processed in the order that they are listed in this file.

• config : These are configuration files where you specify each of the steps. For convenience, it is recommended that you break your operations into phases. You can have as many files as you want.

(26)

They will be processed in the order that they are included as references in the masterConfig.xml. This also contains the css stylesheet files that are used when viewing any HTML reports that you generate. • reports : This contains the report specification files for reporting plugins that require external report

specifications. For example, compiled JasperReports (*.jasper) will be saved there. • startup.bat : The Windows batch start-up file.

• startup.sh : The Linux/Unix batch start-up file.

• Repository : This contains the specification of the repository and any temporary files created during the processing. It should not be modified.

All relative references to files have the top level of the data structure as its root. For example, an output file would be specified as "outputdata/foobar.csv".

If you want to add sub-folders or change the names of the folders, you will have to change the references in the configuration files to match your new structure.

Integrating ADTranform into a Compound Workflow

ADTransform can be easily integrated into an existing workflow or run on its own. The ADTransform is often run as part of the batch stream that extracts data from an application and imports it into another system.

ADTransform can, of course, be run manually in either a test or development environment. Batch scripts for Windows and Linux/Unix are provided with the system and you can either execute the scripts as part of your workflow or copy the commands into the script controlling your end-to-end workflow.

The easiest way to put ADTransform into production is to use the installer to setup the system on an MS-Windows(Vista, Windows-7 or Windows-8) workstation, configure and test the ADTransform workflow on a Windows workstation and once it works, transfer the entire directory structure to the server where the rest of the workflow is running. ADTransform must be installed on the server and the projectInstaller must be run to get the correct folder structure and the correct startup file installed in the project directory.

Adding Pre-Configured Workflows

Pre-configured workflows are referred to as "templates". They are packaged in individual installation packages that are installed after the main ADTransform Core is installed.

When asked for an installation directory, select the directory where the ADTransform core package is installed. In windows this will normally be C:\Program Files\ADTransform. The installer will install the templates in the "templates" sub-directory.

Once the templates are installed, you can can select and activate the one that you want to use by running the "projectInstaller" that is found in the ADTransform main directory. This will be projectInstaller.bat for Windows or projectInstaller.sh for other operating systems. This will build a complete project structure in the destination folder that you specify.

Once the project structure is setup, you can modify the configuration files, replace the data files and run the ADTransform process. The workflow is initiated by executing the "Startup" file that is created in the project directory. This will be startup".bat for Windows or startup".sh for other operating systems.

(27)

Chapter

3

3 Configuration

Topics:

• Configuring Workflow Phases • Configuring a Workflow Task • Configuring The Environment • Internal Data Structure

The following sections describe technical details about customizing a workflow with the steps that are required to accomplish all of the ETVL tasks.

It is intended for reader with an understanding of IT terms. For some of the tasks, it assumes a basic familiarity with XML files.

(28)

28

Configuring Workflow Phases

A workflow is described in one or more phases. Each phase contains one or more tasks. The structure of the phases is completely user configurable.

The phases are defined as Java beans with an id which must be unique and a class that must be "com.artifact_software.adt.model.PhaseContainerImpl".

The masterConfig.xml file has 2 main sections. The first imports the phase definition files that you have set up. The second section adds the name of each bean defining the phase to the list of phases to include in the workflow. The names of the import files does not have to match the names of the phases.

The following example of a masterConfig.xml has 8 phases defined to make up the workflow. The file for each phase is referenced in an import statement. The phase definition is referenced in the "phaseList", by the bean name which appears in the phase file.

The name of each phase is completely arbitrary.

It is common to split some of the phases into smaller workflows to make it easier to design and test your configuration. For example, if you have 3 types of input files, you might use a single input phase to read all the files but create a separate transformation phase for each type of data being transformed and validated. Validation is often split and there is no reason why you could not validate one input file before you

transform another that depends on data in the first file.

Configuring a Workflow Task

Configuring an ADTransform task is very simple. There are some common parameters that appear in each step which are described below. In addition, each plugin will require information that is specific to the task that it performs. This is described in the documentation for each plugin.

(29)

configured, you simply have to fill in the mailer configuration, the input and output file patterns, etc. as described in the Environment Configuration section.

If you want to create your own workflow, you can use one of the existing configurations as a model in conjunction with this documentation to create a workflow configuration that matches your needs. Each workflow step is specified as a Java bean. This allows the workflow to be configured without Java programming and assembled by the ADTransform when it starts up.

The beans specification of every workflow step includes the following:

id The identification of the step. This must be a unique name for the step and will be used for flow control when you want to jump to a particular step in your workflow.

class This is the name of the class that performs the

operation that you want executed in this step. Every bean should have 2 parameters regardless of the step's functionality:

pluginId This mandatory property is the name of the step that you want to see in an error log or audit trail entry. It can be any short string that describes the plugin in way that makes sense to someone reading the configuration file or the output reports.

usageDescription This is an optional short description that describes what the plugin is doing in terms that a person reading a configuration file understands. <bean id="validation2" class="com.artifact_software.adt.plugin.validation.FieldLengthValidator" > <property name="pluginId" value="Username Required validator"/> <property name="usageDescription" value="Checks that all rows have an entry for the person's username"/>

In addition, each plugin will have as many properties as required to specify the details of the function to be performed.

For example, there is a plugin that reads Excel spreadsheets. Its parameters are simply a list of file specifications. Each file specification includes

• the name of the internal data store that you will use in subsequent steps, • the location of the file to read

• a flag indicating the presence of a header row

• the number of rows to skip before starting to read the data.

• The number of rows to skip the beginning of the file before looking for the header • The number of rows to skip after the header before reading data rows

• the number of rows to drop at the end of the file

• the number of rows to read once the data rows are found

<bean id="ExcelPersonInput"

class="com.artifact_software.adt.plugin.fileManagement.ExcelFileReader"> <property name="pluginId" value="Person file Excel reader"></property> <property name="usageDescription" value="Reads the person data"/> <property name="fileSet">

<set>

<bean id="PersonFIleSpec"

class="com.artifact_software.adt.plugin.fileManagement.specification.ExcelFileSpecification"> <property name="headerRowRequired" value="true"/>

(30)

30

</set> </property> </bean>

Configuring The Environment

You can specify some shared information that sets the environment in which ADTransform operates. THis is usually done in one or more tasks in an early phase. This is usually done through property files that set individual named values that can be used later.

Shared information can be specified that can be used in multiple tasks rather than specifying the same information each time it is needed. For example, one can set the configuration options for the mail functions once and use it as many times as required to send files or reports.

Configuring The Mailer

In order to mail logs and reports, ADTransform needs to have its mail configuration set up.

ADTransform can mail its log files to one or more recipients. You need to set the following properties for the mailing sub system. This is normally done in an early phase , usually the pre-processing phase, before the first time a file or report is to be mailed.

You can set this up using either of two plugins that are included with the basic system.

1. Use the PropertySetter plugin to take a list of property names and values and sets those values as

internal ADTransform properties or

2. Use the PropertyFileReader plugin to read in your property names and values from a properties file.

Regardless of the method used to supply the properties, you must specify the following information to set up your mailer configuration.

mailing_enabled use values true or false to tell ADTransform plugins whether or not mailing is available.

mailing_use_authorization use values true or false to tell the mailer whether or not to attempt to use authentication.

mailing_authorized_username a user name with authorization to send mail through the SMTP host server being used

mailing_authorized_password the password for the user specified

mailing_port the port number for the mail server (usually 25)

mailing_smtphost the SMTP host address of the mail server being used

mailing_recipients a comma separated list of email addresses to which the log files will be sent for example :

[email protected] ,[email protected]

mailing_from the address you wish to appear as the "from" address eg. [email protected]

mailing_subject the subject line you wish to appear as the "subject" of the email. Note: the placeholder {0} may be used in the subject line. This specific placeholder will be replaced with the date at the time the ADTransform

(31)

files for ADTransform process run at {0}"

mailing_body_message The text you wish to appear in the body of the mail message. Note: the placeholder {0} may be used in the body. This specific placeholder will be replaced with the date at the time the ADTransform application started its current run. Example: "log files for ADTransform process run at {0}"

You can override these properties in individual steps so that different subject headers and messages can be specified as well as different lists of addressees.

Configuring Error Handling

If errors occur in steps in the process, you can have them ignored or have them cause the workflow to halt. Error handling is controlled by two properties that can be set and reset during your workflow. You can set these in either of two ways.

1. Use the PropertySetter plugin which takes a list of names and values and sets those values as internal

ADTransform properties or

2. Use the PropertyFileReader plugin which reads in your names and values from a properties file. runOnError If any step in the process fails with an error

condition and the "runOnError" property is set to false, the process will jump to the step that is labeled "FatalError". This step must be included in your workflow configuration. You will normally configure this to produce the Error Log and mail it to the administrator and then halt.

finishOnError _{A plugin configuration can specify whether the}

task is to run to completion or stop as soon as an error reported in a record. This can be use to determine if a validation will stop when it finds the first error or continues to the end of its process, identifying all of the errors that it is checking in the data stream. To make it stop after the first error, set the "finishOnError" parameter value to "false" in the parameters of the plugin.

"FinishOnError" defaults to "true" which is normally the behavior that you want. Normally, you will not specify this parameter for plugins.

Related concepts

Validation on page 67

Internal Data Structure

Once the data is loaded by an input plugin it is available to the subsequent workflow processes in a standard format that is independent of the input format.

For each type of data read in, a Data Store is created that has the data stored in a series of rows each with a set of columns that have a name and a cell value.

Plugins typically operate on one or more columns and process all of the rows in the Data Store. Transformations can create or destroy entire objects. They can insert or remove columns or rows.

(32)

Of course, they can modify individual values.

There are also named values referred to as "system properties", that can be used store values that you want to pass between tasks.

(33)

Chapter

4

4 General Processing

Topics:

• Configuring Pre and Post-Processing • Reading Properties Files

The general tasks are used to setup the environment at the start of a workflow, move files locally or between the local workstation/server and remote servers or to clean it up at the end.

(34)

Configuring Pre and Post-Processing

A wide range of tasks can be performed before the data is read and after the data is output. These can include:

• setting values or properties used by the system, • creating or deleting directories,

• moving files between directories, • retrieving files from remote systems, • sending files to remote systems, • e-mailing files and reports.

These tasks are performed by plugins that can be included as required.

Reading Properties Files

Property files contain single values for each property rather than rows. These are stored as system properties and can be referenced in custom plugins by requesting their values as system properties. The properties are read from property files that have each property on a single line with a name and value separated by the "=" character. The following file would set the system properties "companynName", "address", "city", "state" and zip". They could be used in subsequent plugins.

companyName=Artifact Software address=135 Main Street city=Washington

state=DC zip=12345

It is possible to read in multiple properties files. If duplicate names are found in the files, the last value specified will be the only one saved.

The ability to read property files is provided by the "com.artifact-software.adt.plugin.fileManagement.PropertyFileReader" class.

The following parameters are required to add a new column to an existing data stream.

propertyFileName The name of the file to be read and processed. Example showing a file company.properties being read in

<bean id="addColumns"

class="com.artifact_software.adt.plugin.fileManagement.PropertyFileReader" > <property name="propertyFileName" value="company.properties" />

(35)

Chapter

5

5 File Manipulation

Topics:

• Manipulating Remote Files • Managing Local Files

ADTransform can manipulate both local and remote files. These activities normally appear in a pre-processing or

post-processing phase. In the pre-post-processing phase, the tasks are usually related to cleaning folders to remove files that were output in previous runs, downloading files or moving files. In the post-processing phase, these functions are used to send files to remote systems for further processing or cleaning up the working disk folder or archiving files.

(36)

36

Manipulating Remote Files

Files can be retrieved and set to remote servers.

Sending files to remote sites with FTP

Files can be uploaded onto remote FTP servers.

FTP uploads are done using "com.artifact-software.adt.plugin.filemanagement.FTP" class.

operation The literal text "PUT" indicating that you want to move files from the local machine to the remote server.

ftpUsername The username to be used on the FTP server

ftpPassword The password of the user on the remote server

ftpRemoteServer The hostname of the remote server.

ftpRemoteFolder The folder on the remote server where the files will reside.

ftpLocalFolder The folder on the local machine where the files reside.

protocol The literal text "FTP"

fileNames The list of files to be uploaded.

This will upload all the files in the list provided.

This configuration will upload "fileOne.csv" and "fileTwo.cvs" which are in the "outputdata" folder on the local machine to the "archive" folder on the "ftp.example.com" FTP server. It will log in as "ftpUser" with the password "secretpassword".

Getting files from a remote sites with FTP

Files can be downloaded from FTP servers.

FTP downloads are done using "com.artifact-software.adt.plugin.fileManagement.FTP" class.

operation The literal text "GET" indicating that you want to move files from the remote server to the local machine.

(37)

ftpRemoteFolder The folder on the remote server where the files reside.

ftpLocalFolder The folder on the local machine where the files will be put.

protocol The literal text "FTP"

fileNames The list of files to be downloaded.

This will download all the files in the list provided.

<bean id="FTPTransfer"

class="com.artifact_software.adt.plugin.fileManagement.FTP"> <property name="pluginId" value="FTPFileTransfer"/>

This configuration will download "fileOne.csv" and "fileTwo.cvs" which are in the "archive" folder on the remote "ftp.example.com" FTP server to the "restoredArchive" folder on the local machine. It will log in as "ftpUser" with the password "secretpassword".

Sending files to Remote sites with sFTP

Files can be sent to remote servers using the secure FTP protocol.

sFTP uploads are done using the "com.artifact-software.adt.plugin.filemanagement.SFTP" class.

operation The literal text "PUT" indicating that you want to move files from the local machine to the remote server.

ftpUsername The username to be used on the FTP server

ftpRemoteFolder The folder on the remote server where the files will reside.

ftpLocalFolder The folder on the local machine where the files reside.

protocol The literal text "sFTP"

(38)

38

This will upload all the files in the list provided.

This configuration will upload "fileOne.csv" and "fileTwo.cvs" which are in the "fileToSave" folder on the local machine to the "archive" folder on the "sftp.example.com" sFTP server. It will log in as "ftpUser"" with the password "secretpassword".

Getting files with sFTP

Files can be retrieved from a remote server using the secure FTP protocol.

FTP downloads are done using "com.artifact-software.adt.plugin.filemanagement.SFTP" class.

operation The literal text "GET" indicating that you want to move files from the remote server to the local machine.

ftpUsername The username to be used on the sFTP server

ftpRemoteFolder The folder on the remote server where the files reside.

ftpLocalFolder The folder on the local machine where the files will be put.

protocol The literal text "sFTP"

fileNames The list of files to be downloaded.

This will download all the files in the list provided.

<bean id="sFTPTransfer"

class="com.artifact_software.adt.plugin.fileManagement.SFTP"> <property name="pluginId" value="FTPFileTransfer"/>

<value="fileOne.csv"/> <value="fileTwo.csv"/>

(39)

</bean>

This configuration will download "fileOne.csv" and "fileTwo.cvs" which are in the "archive" folder on the remote "sftp.example.com" sFTP server to the "restoredArchive" folder on the local machine. It will log in as ftpUser with the password "secretpassword".

Managing Local Files

Files within the ADTransform local workstation or server can be manipulated.

fileOrDirectoryPath path to the file or directory that will be processed. Relative paths are based on the directory from which ADTransform is run.

outputFileOrDirectoryPath path to an output file or directory where the files or directories specified in the fileOrDirectoryPath will be placed in the case of copy, copyContent, move or moveContent commands.

operation the operation to perform on the contents specified by the fileOrDirectoryPath. Command values are case sensitive.

createIfRequired optional true/false parameter specifying if a directories or files in output locations should be created if they do not already exists in the file system. If not specified it defaults to true.

fileTypeFilter optionally restricts the type of file on which the command will be operated. For example, a source a value of "log" or ".log" would perform the operation specified in the command property only on files finishing with the character string "log" or of .log file type, leaving any other files in the directory untouched. Optional parameter.

create creates a new file or directory with the path provided by the fileOrDirectoryPath.

copy copies the specific file at the fileOrDirectoryPath to the outputFileOrDirectoryPath.

copyContents copies the contents of a directory

specified by the fileOrDirectoryPath to the outputFileOrDirectoryPath.

delete deletes a file or directory specified by the fileOrDirectoryPath.

deleteContents deletes the contents of a directory specified by the fileOrDirectoryPath.

move similar to the copy command except that a move command removes the original files or directories after copying them to the target location.

(40)

40

moveContent similar to copyContent with the exception the a moveContent command removes the original files or subdirectories.

Related tasks

Copy a File on page 40 Copy a file.

Copy All Files on page 40

Copy all files in a folder to another location. Move a File on page 41

A single file can be moved from one folder to another. Move All Files on page 41

Move all files in a folder to another location. Renaming A Local File on page 41

ADTransform can rename files using the move operation for a single file. Deleting A File on page 42

A file can be easily deleted.

Deleting all Files in a Folder on page 42 Deletes all of the files in the specified folder.

Copy a File

Copy a file.

Copying files is done using the "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class.

This will copy the specified file to a new location.

<bean id="CopySourceFile"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="CopyInputFile"/>

This configuration will copy "input".csv" in the inputdata"" folder to the "archive" folder.

Related concepts

Managing Local Files

Copy All Files

Copy all files in a folder to another location.

Copying files is done using "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class. This will copy all the csv files in the specified folder to another.

<bean id="CopyAllSourceFiles"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="CopySourceFiles"/>

(41)

This configuration will copy all the files in "inputdata" with "csv" type to the folder "archive".

Related concepts

Move a File

A single file can be moved from one folder to another.

Moving files is done using "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class. This will move the specified file to a new location

<bean id="MoveSourceFile"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="MoveSourceFiles"/>

This configuration will move the "input".csv" to the folder "archive_folder".

Related concepts

Move All Files

Move all files in a folder to another location.

moving files is done using "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class.

<bean id="MoveAllSourceFiles"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="MoveSourceFiles"/>

</bean>

This configuration will move all the files in "inputdata" with "csv" type to the folder "archive".

Related concepts

Renaming A Local File

ADTransform can rename files using the move operation for a single file.

To rename a file in place, just specify a "Move File" with the same directory path and a new filename.

Related concepts

(42)

Deleting A File

A file can be easily deleted.

Deleteing files is done using "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class. This will delete the file specified.

<bean id="deleteSourceFiles"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="DeleteSourceFiles"/>

This will delete the file "input".csv" from the "inputdata" folder.

Related concepts

Deleting all Files in a Folder

Deletes all of the files in the specified folder.

Copying files is done using "com.artifact-software.adt.plugin.fileManagement.FileSystemHandler" class. This will delete all of the files in the specified folder. The fileTypeFilter can be used to reduce the scope of the deletion to a single type of file.

<bean id="DeleteAllSourceFiles"

class="com.artifact_software.adt.plugin.fileManagement.FileSystemHandler"> <property name="pluginId" value="DeleteSourceFiles"/>

</bean>

This configuration will delete all the files with "csv" type from the "inputdata"" folder.

Related concepts

(43)

Chapter

6

6 Input and Output

Topics:

• Input and Output Tasks

Reading and writing data is an important part of ADTransform. A number of different file formats can read and written.

Regardless of the format read in, once the data is read in, it is held in datastores which are treated as rows of data with named columns. This provides a consistent way to manipulate data in the subsequent transformation, validation, reporting and output operations. The plugin that reads the data is responsible for capturing the column names as well as the actual rows of information.

Input and output plugins could be written to read and write a wide variety of files and the input type does not have any relationship to the output. A CSV file could be read, transformed, validated and written out as a formatted report by a plugin that invokes a standard report writer. A data base could be queried to create the internal objects with the final results written back into a set of database tables. Data output also includes the ability to

Standard input and output plugins included in the ADTransform handle the most common tasks. This capability can be extended through custom plugins to communicate with external applications that expose Web services or a remote Application Programming Interface (API).

(44)

44

Input and Output Tasks

Input and output are basic functions for any workflow.

The plugins that can read and write data can require different configurations depending on the sources and destinations that are required.

The plugins that read and write files all depend on file specifications that describe the files to be read or written. These are the basic properties that are shared by most of the plugins.

dataStoreName The data store to be written or filled with data.

fileName The name of the file to be read or written

dateLocation Indicates whether a date is to be added to a filename on output. "N" no date added; "P" - prefix date to fileName; "S" - put date at the end of the fileName

dateFormat String pattern describing the date format to be used if a date is to be added to the filename. Follows the Java standard.

Each plugin may require additional information.

Reading and Writing CSV files

CVS (Comma Separated Values) files can be read into data stores for processing and data stores can be output as CSV files for processing in other systems.

CSV files can read by "com.artifact-software.adt.plugin.fileManagement.CSVFileReader" class. The CSV reader can handle very complex data structures including extra lines at the beginning and end of the file as well as gaps between the header and the first line of data. This can be very useful if you have data files that have an explanation or comments at the beginning of the file and have summary information at the end that you do not want to read. It can also read a fixed number of lines from the data section of the file in order to build test data or to extract parts of the file without reading the whole file.

Writing of CSV files is done by "com.artifact-software.adt.plugin.fileManagement.CSVFileWriter" class. The Wikipedia entry on CSV describes the characteristics of the CSV file.

Reading and writing use the CVS File Specification to describe the name and characteristics of the file to be read or written.

dataStoreName The name of the datastore where the data read in will be saved or the name of the data store containing the data to be written.

fileName The name of the CSV file to be read or created.

columnNames This is the list of column names that are to be written into the output file. This is a list of values that match the name of a column in the data store that is being written.

columnSeparator Although the name implies that fields in the file are separated by commas, it is possible to use other characters to separate fields.

quoteCharacter The text fields in a CVS file may be surrounded by a quotation character which is usually single or double quotes.

(45)

then you need to prefix it with an escape character that tells the CSV reader to treat the following character as part of the string rather than as an end of string marker.

ignoreLeadingWhiteSpace Setting this to true instructs the reader to ignore any spaces or tabs at the start of a field.

encoding The default encoding is UTF-8. If you need to read or write a file using another encoding such as ISO-8859-1, you can specify it using the encoding property.

dropFirst Normally processing starts at the first row in the file but lines can be ignored by putting this property to the number of rows to ignore at the start of the file. This is used to skip over any introductory text or irrelevant data before the header row o first row of data if there is no header. If omitted it defaults to 0.

headerRowRequired CVS files may or may not include a header row. If this property is set to true, the header row will be read or written. Otherwise the first row will be considered to contain data on input and the output file will not include a header row with the column names.

skipRows This is the number of records to between the header and the first row of data. This is only available in the Advanced reader. It is used to ignore extra rows such as blank lines after the header or to jump partway through the data before registering data. It defaults to 0 if not specified.

readRows This is the number of data records to register. This is only available in the Advanced reader. It is used to read a fixed number of rows. If it is not specified or set to a negative value, all the rows will be read after the rows skipped at the top of the file up until the end of file unless the there is a positive number in "dropLast"." By setting skipRows and readRows you can read a fixed number of consecutive rows from the middle of the CSV file. If not specified all data rows are read (taking into account dropLast).

dropLast This is the number of records to drop at the end of the file. This is only available in the Advanced reader. It is used to ignore extra rows such as totals or dates at the end of the file. If 0 or not specified no lines are dropped.

When data is written, the column names in the data store will be written into the header row as field names. On input, the data read in will appear as rows in a data store and the columns will be given the names from the header row. If there is no header in the input file, the columns will be named "column1", "column2".

(46)

46

The reader and writer will process a number of files in a single task. Each file is described in its own specification so the files can have different specifications of header rows, quote characters, etc.

<bean id="LocationsFileSpec"

class="com.artifact_software.adt.plugin.fileManagement.specification.CSVFileSpecification"> <property name="headerRowRequired" value="true"/>

The file specifications are configured as a fileSet property which is a Set of CSVFileSpecifications. This configurations instructs ADTransform to read three CSV files into the datastores "locations", "job_types" and ""organizations". The first two use the "|" symbol as a field separator and do not have quotes around the text field values. They all include a header row with the column names. The third file has quotes around each text field and uses the "\" to signal quotes that are part of the data to be read.

<bean id="readallCVSfiles"

class="com.artifact_software.adt.plugin.fileManagement.CSVFileReader"> <property name="pluginId" value="input CSV files"/>

<bean id="JobTypeFileSpec"

<bean id="InternalOrgsFileSpec"

This configurations is similar to the example above and instructs ADTransform to read three CSV files into the datastores "locations", "job_types" and ""organizations". They all use the "|" symbol as a field separator and do not have quotes around the text field values. They all include a header row with the column names. In addition, while reading the locations.csv"" file, the first 3 rows will be skipped before the header is read and the last 2 rows in the file will be dropped

In the second file, the header will be read from the first line and then 1 data row will be skipped and up to 110 rows will be processed.

(47)

rows of summary data at the end to be ignored.

<bean id="JobTypeFileSpec"

<bean id="InternalOrgsFileSpec"

The following configuration will write a CSV files from the data store called "people". A header row will be written. It will use "|" to separate fields. Text fields will not be enclosed in a quote character.

<bean id="WritePeopleCSVFile"

class="com.artifact_software.adt.plugin.fileManagement.CSVFileWriter"> <property name="pluginId" value="Write person data"/>

<bean id="PeopleFileSpec"

<property name="fileName" value="output/int_person.csv"/> <property name="columnNames"> <list> <value>ID</value> <value>USERNAME</value> <value>LNAME</value> <value>FNAME</value> <value>MNAME</value> <value>GENDER</value> <value>LOCALE</value> <value>TIMEZONE</value> <value>COMPANY</value>

(48)

48 <value>STATUS</value> <value>MANAGER</value> <value>ADDRESS</value> <value>CITY</value> <value>STATE</value> <value>JOBTYPE</value> <value>JOB_TITLE</value> <value>EMAIL</value> </list> </property> </bean> </set> </property> </bean>

Reading files

Excell spreadsheets can be read into data stores f