Chapter 7. Building a golden record
7.3 Sample process for building a virtual record
7.3.2 Workflow for arriving at a golden view
Figure 7-2 shows the steps that are involved in building a golden view using InfoSphere MDM Standard Edition.
Figure 7-2 InfoSphere MDM Standard Edition core workflow
The workflow for arriving at a golden view with InfoSphere MDM SE includes the following major steps:
Step 1: Creating a workspace Step 2: Configuring the model Step 3: Entity creation
Step 4: Designing an algorithm Step 5: Setting up a server instance Step 6: Deploying the configuration Step 7: Importing the data
Step 8: Weight generation Step 9: Bulk comparison Step 10: Threshold analysis
Step 1: Creating a workspace
The workspace is technically an Eclipse project that facilitates creation and design for algorithms.
Step 2: Configuring the model
In this step, you create the basic member model that consists of creating a member, attribute types, attributes, and source (source is name of actual external source where the data re-sides). Alternatively, you can import a predefined configurations into the workspace using the domain templates that provide predefined models for party, patient, provider, and so on.
Step 3: Entity creation
In InfoSphere MDM SE, an
entity
is a distinct type of master data such as person or organization that shares a common linking identifier called an Entity Identifier (EID) across all the matched master data entity. Figure 7-3 shows therelationship between the entity, members, and the algorithm associated with member model.
Step 4: Designing an algorithm
Algorithm design involves deriving the match and link algorithm by using the predefined functions that are available in the workbench.
The following matching process is used: Optimize data for statistical comparisons:
– Operate against raw data, no cleansing required.
– Normalize and compact data, create derived data layer. Source data remains intact.
– Apply phonetic equivalences, tokenization, nicknames, and so on in the matching process.
Find all the potential matches:
– Cast a wide net: partial matches, reversals, anonymous values, and so on. – The final candidate list is a combination (OR) of multiple bucket roles. – Each bucket role is created from functions of parts of a single attribute,
combinations of multiple attributes, or both. Score accurately using probabilistic statistics:
– Compare attributes one-by-one and produce a weighted score (likelihood ratio) for each pair of records.
– Frequency weights specific to your business.
Figure 7-4 on page 104 shows the InfoSphere MDM SE matching approach with the sample data in the buckets and bucket hashes. Buckets are nothing but the grouping the data based on the attributes in the model; something like name and phone number or name and zip code. Bucket hashes are unique IDs created for the particular bucketing criteria. The end result is master data entities that are formed and ready for threshold analysis where the upper threshold and lower threshold are fixed.
Figure 7-4 InfoSphere MDM matching process
Step 5: Setting up a server instance
This step involves connecting the workbench to an InfoSphere MDM instance. The connection requires the MDM instance server details such as host and the port details.
Step 6: Deploying the configuration
Deploy the configuration to the InfoSphere MDM instance as shown in Figure 7-5. This deploys the model that is developed in the workbench. The deployment includes core configuration, algorithms, string data, and weight data.
Figure 7-5 Deploy configuration
Figure 7-6 shows the steps that are completed at this point.
Step 7: Importing the data
After the data is brought into a standard format using ETL tool, use InfoSphere MDM SE utility, mpxdata, to massage, create, and import the data.
The mpxdata utility uses raw data to build member unload files (.unl), generate
comparison strings, assign bucket hashes, and create binary files. Figure 7-7 shows the input file, configuration, and the associated job.
Figure 7-7 mpxdata job
Step 8: Weight generation
Weight generation is a process within the Master Data Engine that enables the generation and assignment of weight values to attributes. Weight assignments enable determination of a match or non-match between members. After the weight generation process, the configuration must be deployed.
Step 9: Bulk comparison
This process matches and links the entities in bulk by using the bulk cross match (BXM) utility. The BXM process is made up of two primary jobs: compare Members in bulk (mpxcomp) and link entities (mpxlink). After running the compare and link, the data must be loaded into the database.
Step 10: Threshold analysis
The objective of this step is to arrive at a threshold showing that two matched members are same; in other words, that the two matching records are same. The following tools are used in this step:
Pair manager: A GUI tool that helps in arriving at the threshold.
Threshold calculator: A tool in the workbench that helps in arriving at the threshold.
Figure 7-8 shows the status of the InfoSphere MDM SE core configuration workflow after step 11.
Figure 7-8 InfoSphere MDM Standard Edition workflow status
At this point, a golden view of the record or entity is created; the rest of the process is for fine tuning and report generation for analysis.
Figure 7-9 shows the final view of the enterprise application integration with InfoSphere MDM. Using SDK or Webservices of InfoSphere MDM, queries can be made to the enterprise application for viewing of the golden record.