Creating a differencing graph
OUTPUT RECORDS SENT TO OUTPUT
Trash Unchanged records (that is, records that are identical in both input files).
Different Data
file Records showing changes between the Old Data and New Data datasets. Each record includes the key field(s), and an old_field and new_field field for each non-key field:
z For non-key fields that have not changed, these fields are blank.
z For non-key fields that have changed, the old_ field and new_ field fields contain the old and new values of the field.
Updates file Records from the New Data file that have keys matching those of records in the Old Data file but have one or more differing non-key fields; in other words, these are updated versions of existing records.
Adds file Records from the New Data file whose keys do not appear in a record in the Old Data file.
Deletes file Records from the Old Data file whose keys do not appear in a record in the New Data file.
AB INITIO CONFIDENTIAL AND PROPRIETARY Finding differences 69 For difference.mp, the record format of the input files is as follows:
record
date("YYYYMMDD") date_key;
string(8) str_key;
decimal(8) dec_field;
little endian integer(8) int_field;
string("\n") str_field;
end
The key is formed by the first two fields:
{date_key; str_key};
2.To determine which records have been added or deleted, and to collect the records that exist in both the old and new datasets, use a JOIN component.
Connect the JOIN component to the two PARTITION BY KEY AND SORT components, name it appropriately (in difference.mp, this is Check for Matching Keys), and configure it as follows:
a.Set the join_type parameter to Inner Join (matching records required on all inputs). b.Set the key parameter to the same value as the key for the PARTITION BY KEY AND SORT
components you used in Step 1.
c. To collect records that have been deleted (records that are present in the old dataset but not in the new one), connect the unused0 port to an output dataset, and name the dataset appropriately (in difference.mp, this is the Deletes dataset).
d.To collect records that have been added (records that are present in the new dataset but not in the old one), connect the unused1 port to an output dataset, and name the dataset appropriately (in difference.mp, this is the Adds dataset).
If necessary, choose View > Optional Ports to show all optional ports. Then connect the unused0 and unused1 ports to the output datasets, and deselect Optional Ports to hide the unneeded ports.
70 Finding differences AB INITIO CONFIDENTIAL AND PROPRIETARY
e. For the out port, create an output record format that includes a field for both the old and the new values of each field, so that you can later compare the values to see which ones have changed. In the difference.mp graph, this is difference_both.dml:
record
f. To collect records that exist in both the old and the new datasets, edit the transform parameter such that it writes all matching old and new fields from each input file to the out port. The difference.mp graph does this by using the file combine.xfr, which contains the following transform:
out :: join(old, new) = begin
out.date_key :: old.date_key;
out.str_key :: old.str_key;
out.old_dec_field :: old.dec_field;
out.new_dec_field :: new.dec_field;
out.old_int_field :: old.int_field;
out.new_int_field :: new.int_field;
out.old_str_field :: old.str_field;
out.new_str_field :: new.str_field;
end;
AB INITIO CONFIDENTIAL AND PROPRIETARY Finding differences 71 3.To find the differences among records that exist in both the old and the new datasets, use a
REFORMAT component.
Connect the REFORMAT component to the JOIN from Step 2, name it appropriately (in difference.mp, this is Split Out Records with Differences), and configure it as follows:
a.Set reject_threshold to Never abort.
b.Create a transform that compares the new and the old value for each field, sends records that do not have differences to a TRASH component, and rejects records that do have differences. (You will use the “rejected” records to generate two of the output datasets.)
The difference.mp graph does this by using the transform diff_records.xfr: out :: reformat(in) =
begin
out.date_key :: in.date_key;
out.str_key :: in.str_key;
out.dec_field :: if (in.old_dec_field == in.new_dec_field) in.old_dec_field;
out.int_field :: if (in.old_int_field == in.new_int_field) in.old_int_field;
out.str_field :: if (in.old_str_field == in.new_str_field) in.old_str_field;
end;
NOTE: This transform works because the fields in the output record format (specified in difference_input.dml) do not have default values. For fields with default values, you would have to use something like the following to ensure that records with differences would be sent to the reject port:
out.dec_field ::
if (in.old_dec_field == in.new_dec_field) in.old_dec_field
else force_error("Old and new values are different");
72 Finding differences AB INITIO CONFIDENTIAL AND PROPRIETARY
4.Replicate the flow from the reject port of the Split Out Records with Differences component (these are the records with differences).
5.Use one REFORMAT component to output a dataset containing the differences between the old and the new data.
Name the REFORMAT and the output dataset appropriately and code the transform of the REFORMAT as shown below. (In difference.mp, the REFORMAT is Leave Only Different Fields, the output dataset is Different data, and the transform is diff_fields.xfr.)
a. Output the old and new values of the key fields.
b. Compare the old and new values of the other key fields. If they are different, output each value. If they are the same, output a blank field.
out :: reformat(in) = begin
out.date_key :: in.date_key;
out.str_key :: in.str_key;
out.old_dec_field :: if (in.old_dec_field != in.new_dec_field) in.old_dec_field else "";
out.new_dec_field :: if (in.old_dec_field != in.new_dec_field) in.new_dec_field else "";
out.old_int_field :: if (in.old_int_field != in.new_int_field) in.old_int_field else "";
out.new_int_field :: if (in.old_int_field != in.new_int_field) in.new_int_field else "";
out.old_str_field :: if (in.old_str_field != in.new_str_field) in.old_str_field else "";
out.new_str_field :: if (in.old_str_field != in.new_str_field) in.new_str_field else "";
end;
The above transform sends the following data to the output file Different Data:
AB INITIO CONFIDENTIAL AND PROPRIETARY Finding differences 73 6.Use another REFORMAT component to output a dataset containing the updated records. Name
the REFORMAT and the output dataset appropriately (in differences.mp, they are named Output Updated Records and Updates) and code the transform of the REFORMAT to assign the new value of each field to the updated records. In differences.mp, the transform is diff_only_new_fields.xfr:
out :: reformat(in) = begin
out.date_key :: in.date_key;
out.str_key :: in.str_key;
out.dec_field :: in.new_dec_field;
out.int_field :: in.new_int_field;
out.str_field :: in.new_str_field;
end;
74 Finding differences AB INITIO CONFIDENTIAL AND PROPRIETARY