Process-Driven and
Context-Powered Data Quality
Methodology
A MIOsoft White Paper
MIOvantage
www.MIOvantage.comTABLE OF CONTENTS
TABLE OF CONTENTS ... 3
CONTEXT DATA QUALITY PROVIDES DRAMATIC COST REDUCTION ... 4
BUSINESS BENEFITS OF CONTEXT ... 4
BUSINESS PROCESSES DRIVE EFFECTIVE DATA QUALITY ... 4
KEY POINTS ... 5
PROCESSES DETERMINE EXPECTATIONS ... 5
CONTEXT ALLOWS COMPLEX DATA QUALITY MEASURES ... 7
INTEGRATION CHECKPOINTS ... 7 FORMALIZATION ... 10 EXAMPLE ... 11 CONTEXT CLUSTERING ... 12 FORMALIZATION ... 13 EXAMPLE ... 13 EXPECTED VALUES ... 14 FORMALIZATION ... 15 EXAMPLE ... 16
COMPARISON OF ACTUAL AND EXPECTED DATA ... 17
CATEGORIZATION OF FINDINGS ... 18
FORMALIZATION ... 18
ROOT CAUSE ANALYSIS ... 18
BIG DATA QUALITY REQUIRES CONTEXT ... 20
CONTEXT DATA QUALITY PROVIDES DRAMATIC COST REDUCTION
Companies know that business success requires high-quality data.
MIOsoft’s data quality assurance method uses context to leverage the insights of a process’s information by clustering all related data. Context provides a direct way to discover the root causes of issues and perform real-time transformation, directly reducing data-quality-related support and cleansing costs.
Business Benefits of Context
Reduces support costs and increases customer satisfaction by preventing problems pro-actively, in many cases before they are encountered by the customer.
Finds all problems in the process, even unanticipated ones.
Reduces analysis and cleansing costs by finding the root causes of data quality prob-lems.
Monitors and controls side effects of data cleansing.
Continuous data quality analysis offers real-time KPIs for monitoring and analyzing cus-tomer behavior.
BUSINESS PROCESSES DRIVE EFFECTIVE DATA QUALITY
‘Fit for use’ is the most common definition of data quality. In this paper, we will discuss a meth-odology which allows us to define the actual use and value of our data, then measure whether the data is fit for this use.
First the context is chosen, according to the process to be analyzed: for example, an order or a customer. We define checkpoints along the process’s data flow, especially at critical points of the process and after data transactions. Checkpoints can be between, across, and within the systems that belong to the process. At all checkpoints, we can leverage context to quickly de-termine the expected data values and compare them to the actual data values found in or be-tween the systems.
Key Points
Traditional approaches to data quality only discover and treat the symptoms of underly-ing problems.
Finding inaccurate data across or between systems using the process helps identify the root cause of the problem so that it can be eliminated.
Context allows the data to easily be understood with respect to differing systems and processes.
Appropriate context technology for data quality provides three main features: check-points that can be set throughout the process when data is at rest, captures that can be set between checkpoints to record data in motion, and advanced relationship discovery techniques to combine all related data in one context.
PROCESSES DETERMINE EXPECTATIONS
We use the process to compute the expected data so that we can later measure whether or not the actual data meets these criteria. Computing the expected values is critical, since this gov-erns the output. Although this may seem straightforward, in reality companies often do not know their expected data values in detail. By performing an analysis of a company’s imple-mented business processes, we can understand the data’s purpose, expected use, and expected values.
It is important to be aware that in many companies, especially large enterprises, some pro-cesses do not work as described in their documentation. Workarounds are frequently intro-duced and are usually not accurately or fully documented.
Choice of processes to be analyzed (defines the
context)
Definition of checkpoints and captures in process
data flow
Computation of expected data for all checkpoints
and captures
CONTEXT ALLOWS COMPLEX DATA QUALITY MEASURES
We use context to examine complex processes.
A context is a group of related data. For example, a purchase context could consist of the pur-chased product’s information, the customer’s information, and payment information. Using contexts in data quality has several advantages.
Having all process-related data in one context makes finding the expected data and comparing it to the actual data simpler: all needed information is easily available. By using relationship discovery techniques, we can also find data that we did not expect to be related, or data that we did not expect to exist in the system.
The process of forming contexts by collecting and visualizing all related data creates a data pool where potential data problems can be exposed. This allows us to quickly see data quality issues, even unanticipated ones.
To demonstrate how this works, we will discuss a revenue assurance project we implemented for one of the world’s largest telecommunications companies.
CRM- and Order System Which product is ordered? Billing System Line Billing System Internet Legacy Billing System Line Legacy Product has Internet part
Customer is migrated? Product has line part
YES No Rating System Product needs to be rated? YES End No
The basis of any revenue assurance program is a single question: Did we bill an order correctly? Managers require a quick answer to this question, but the complexity of the answer can vary depending on the complexity of the business processes. An abstracted model of our example business process is shown above.
INTEGRATION CHECKPOINTS
We define checkpoints at critical data locations along the business process. Checkpoints are places where we check whether the data flow contains the correct data at a given time. To do
this, we unload one or more snapshots of the checkpoint, which consist of the relevant attrib-utes and the snapshot time, and compare them.
In order to generate a snapshot, and to make two snapshots comparable, we need specialized data integration techniques.
In our revenue assurance example, one of the sources, an Oracle database system, was very large. We decided to unload the system’s CRM data with a delta process and deliver the data without any table relations. Realities and compromises like these are common, as is an initially incomplete understanding of the data’s relationships. Therefore, one benefit of appropriate context technology is the ability to discover the relationships in the data: for example, connect-ing contracts to customer data.
In the figure below, we can see an example of the complexity of key connections for our reve-nue assurance example. Note that the product system is used to transform keys in order to make two systems compatible and comparable.
In order to compare the CRM and the Rating system, we need to transfer the CRM product data into rating data. We can do this using the product system. In green, we can see the stored data of the respective systems. Data integration with MIOsoft’s tools can be developed very quickly and easily. In the user interface, we see the data we need to integrate on the right-hand side
CRM- and Order System Asset Internet Line Billing Product Billing System Asset Billing System Internet Lecacy Internet Internet Product Billing System Line Legacy Billing Price Rating System Rating Internet Key Asset Key Billing – Price Key
Product System Price Product InternetProduct Rating Price Key Rating Key Price Key Rating Key InternetProduct Rating Key
and our data quality model on the left-hand side. We can connect the data to the model’s at-tributes with two clicks.
CRM- and Order System
Billing System
Internet Legacy Rating System
Billing System
Line Legacy Rating System
Billing System Line CRM- and Order System CRM- and Order System Starting Time Time of Data Flow Starting Time + Time
that Data Flow 1 needs to reach the
Rating system (Rating time 1) Data Flow 1 Data Flow 2 Data Flow 3 Starting Time + Time that Data Flow
3 needs to reach Billing System Line (Billing Line time 3)
In the previous figure, we see our example process’s data flows and the time at which the snap-shots are taken. We need to be careful to choose the right time to unload, since we have to maximize the synchronization of the compared data and minimize the use of data that is out of sync.
For the definition of a snapshot, we assume that, at every checkpoint, it is possible to unload the relevant data in one table.
In order to implement this, we must use strong data integration techniques. The data integra-tion technologies must handle joins, data transformaintegra-tion and standardizaintegra-tion, and all kinds of standard and non-standard data formats.
Checkpoints are used as master checkpoints or data checkpoints. Master checkpoints are locat-ed at the start of a data flows. Usually, there will only be one logical master checkpoint in a business process, but this is not required. Data checkpoints are located at any non-starting point of a data flow.
In addition to the checkpoints, we also perform captures. In general, captures are unloads of a bus or a queue. Captures allow investigation of the data flow between checkpoints over a peri-od of time. Because of this, captures are helpful when finding root causes. We will discuss this concept in more detail later.
Formalization
A formalization of our integration definition allows us to be more explicit. We define a set I of elements x1, x2,..xn to be I = {x1,x2,..xn}.
Let A be a table of data with rows (records) and columns (attributes) unloaded or creat-ed at time t.
Let C be the set of columns of A; C= {column1, column2,..columnN } Let R be the set of rows of A; R= { row1,row2,…rowM }
Such that A (t)= [C][R]
A is therefore a set of elements where each element (cell) knows its position (column,row). Elements can be of various types like bool, string, date, etc.
We define the subset a of A with selected columns and all rows to be a (t)= [Ca] where Ca is a subset of C.
and the subset b of A with selected columns and selected rows to be b (t)= [Cb][Rb] where Cb is a subset of C and Rb is a subset of R. Let A be the table of system A with
A(time of unload)= [CsystemA][RsystemA].
Then we can define the snapshot of system A (starting time of data flow + time of data flow arriving at system A)
S(A)=[attribute1, attribute2,…attributeN]
where {attribute1, attribute2,…attributeN} is the subset of relevant columns of Csys-temA.
We define a fragment of a checkpoint such that looking at one row of a snapshot we get a fragment.
Fragment X of Snapshot system A = [attribute1, attribute2,…attributeN][row x] We define a Capture K between snapshot A and B to be
K of snapshots A and B
(starting time + time of data flow arriving at system A; starting time + time of data flow arriving at system B) = [attribute1, attribute2,…attributeN]
where {attribute1, attribute2,…attributeN} are a subset of CBus. Again, we get a fragment by looking at just one row.
Fragment X of Capture K = [attribute1, attribute2,…attributeN][row x].
Example
In our revenue assurance example, we can choose the CRM system as the master checkpoint since all data flows begin in the CRM system. Data checkpoints are located at every system with data at rest (Billing Internet, Billing Line, Billing Line Legacy, and Rating).
Snapshot CRM (starting time) =
[customer nb, asset_nb, internet_product_key, product_name, internet_key, billing_nb, price_key, billing_price_key, rating_key ]
Snapshot Billing Internet (starting time + Billing Internet Time 1) =
[customer nb, internet_product_key,product_name, internet_key, rating_key] Snapshot Billing Line (starting time + Billing Line Time 3) =
[asset_nb]
Snapshot Billing Line Legacy (starting time + Billing Line Legacy Time 2) = [billing_price_key, billing_nb, price_key, rating_key]
Snapshot Rating (starting time + Rating Time 1) = [rating_key]
Looking at our example, we would place a capture on the data flow between the CRM system and the Billing Internet system. In the case of an inconsistency, we can locate exactly where the error happened by looking at the messages sent from the CRM system to the Billing Internet system.
Capture of Snapshots CRM and Billing Internet
(starting time;starting time + Billing Internet Time 1) = [customer nb, asset nb, internet_product_key]
CONTEXT CLUSTERING
For our methodology, we have to carefully choose which contexts we are interested in, keeping in mind our desired result and the business process we are analyzing. To get the correct con-text, we have to discover all relationships between fragments. In particular, we are interested in clustering related fragments to form contextually-related groupings.
For that we have several techniques:
Key Connection (equal): The easiest is a simple key connection, which means that two
fragments have one key in common.
Transitive Matching: Also relatively straightforward is using transitive closure to match
related fragments. With transitive closure, two fragments can match without being a di-rect match. For example, if FragmentA matches FragmentB, Fragment B matches Frag-mentC, and Fragment A does not match FragFrag-mentC, transitive closure brings Fragmen-tA, FragmentB, and FragmentC together in one context.
Distance-Based Clustering: For non-key data we need to be more tolerant than
byte-by-byte comparison allows. Instead, we can used distance-based clustering. For instance, for names there are several mathematical algorithms such as Q-Gram and phonetic matching that can assign a “closeness” comparison value between two names.
In order to produce a context, we define rules for fragment clustering. The table above shows an example rule set. Two data records (fragments) are clustered in the same context when
At-tribute2 is equal and Attribute3 is equal after applying the Q-Gram algorithm, or when Attrib-ute6 is equal, or when Attribute1 is equal and Attribute2 is equal after applying a two-typo al-gorithm.
If we were using relational data integration techniques, this would be much more time-consuming and complex. We would have to use the right joins to match all data sources, and would need to decide ahead of time which data we need later. With MIOsoft’s method we can include all data in the context, since it is not much more work and performance is mostly unaf-fected.
A context also serves as a kind of data pool. We can use it to connect all data we are interested in with all the necessary relationships. With all related fragments at available at one glance, we can recognize patterns across complex data types with good performance.
In our revenue assurance example, using context was particularly powerful. Architectural changes to the source systems happened frequently and quickly; data quality rules and calcula-tions changed almost every month. These changes did not necessarily reach the DQ department quickly enough, and many workarounds were also discovered. Context made it easy to discover and communicate these changes between departments.
Context can also be leveraged to determine whether the defined business process is reflected in the data. We do this by predicting how data should look at our checkpoints and checking to see if the actual data matches our prediction.
Formalization
Using the clustering rule table, we get a context which consists of all fragments of all snapshots that meet the rule table criteria.
Context CO (Ruletable) = {fragment1, fragment2,..,fragmentN}
Example
In our revenue assurance example, we used the customer as our context. The Customer cluster-ing rule is shown below.
EXPECTED VALUES
In order to know whether our integrated data is correctly stored, we have to define what should be in our context. This definition is derived from the implemented process; however, this step can be very complex, because business rules for every detail of the data flows in the process must be specified.
Expected values can be computed for snapshots and for captures. The master snapshot is the basis of the computation that determines the expected data values of all snapshots. The ex-pected data values are themselves formed as snapshots. For every data snapshot, we want to convert the master snapshot into the expected data snapshots using the business rules.
In the end, we have a virtual construct describing which data values we expect to find at each data checkpoint. The capture itself is the basis for the computation of our expected data values for captures.
Business Rule
Process message data CRM Data Flow 1
Capture between CRM and Billing Internet CRM- and
Order System
Formalization
Let f be an algorithm f(attribute1) which describes the business rule transforming the value of attribute1 in the master snapshot into the value of attribute1 in the expected snapshot. Busi-ness rules can be lookups in other tables or more complex functions.
Then we can define the expected snapshot as Expected Snapshot system A=
[f(attribute1), f(attribute2), … f(attributeN)] where the set of
{attribute1,attribute2,..attributeN} is the set of relevant columns of the data snapshot. Let fc be an algorithm fc(attributex) which describes the business rule transforming attributex of the source snapshot into the attributey of the capture.
Then we can define the expected capture as Expected Capture of system A and B=
[f(attribute1), f(attribute2), … f(attributeN)] where the set of
{attribute1,attribute2,..attributeN} is the set of relevant columns of the source snap-shot.
Example
In our example we get the following expected snapshots: Expected Snapshot BillingInternet DF1 =
[BRinternet1(customer nb), BRinternet1(internet_product_key), BRinternet1(product_name, internet_key), BRinternet1( rating_key)] Expected Snapshot Rating =
[BRrating1(rating_key)]
Expected Snapshot Billing Line Legacy =
[BRbillingline2(billing_price_key), BRbillingline2(billing_nb), BRbillingline2(price_key), BRbillingline2(rating_key)]
Expected Snapshot Rating = [BRrating2(rating_key)]
Expected Snapshot Billing Line = [BRbillingline3(asset_nb)]
And the following expected capture:
Expected Capture of CRM and BillingInternet =
BRcrm(product_name), BRcrm(internet_key), BRcrm(billing_nb), BRcrm(price_key), BRcrm(billing_price_key), BRcrm(rating_key)]
COMPARISON OF ACTUAL AND EXPECTED DATA
Once we have the expected data values, we have to check whether all data snapshots are the same as the expected snapshots. To do this, we have to define which columns are relevant for the connection of the expected data to the actual data. Then, we can easily compare the ex-pected snapshot with the data snapshots.
We may also find actual snapshots that do not match to any expected snapshot, and vice versa. We saw this in our revenue assurance example:
The actual and expected values were compared and categorized. In our example, we found two different problems. In data flow 1, we found that rating data was missing. We additionally found actual data from unexpected sources.
After the categorization of the found data quality problems, we can analyze our results. This analysis is saved so it can be compared to future results. This is important for finding the source of the problem and can be done by monitoring in a dashboard, or by other reporting and anal-ysis functions.
Moreover, we can generate and unload data and use it to perform manual or automatic data cleansing in the original systems.
CATEGORIZATION OF FINDINGS
For categorization, we distinguish between contexts that are ‘OK’ and contexts that have a data quality problems. ‘OK’ means that all equations in our previous formalization are true; contexts with a data quality problem have at least one equation with a false result. Since we know exact-ly what value is expected, we can even say how far apart the actual and expected values are and categorize using this distance.
In the revenue assurance case, we assigned a cost to every snapshot and its expected snap-shots. By comparing the expected and actual cost and computing the delta, we could calculate missed or overcharged revenue. This is the most important outcome of our revenue assurance problem and drove the whole project. We can easily categorize overbilling (negative delta val-ues) and underbilling (positive delta valval-ues) and can prioritize and organize cleansing according to the problem’s category.
Formalization
We can define our comparison as a simple Boolean test for some context x: Context x is ‘OK’ IFF Expected Snapshot system A = snapshot system A
ROOT CAUSE ANALYSIS
Sometimes just finding the data problem shows its root cause, but often this is not the case, and we need to investigate inside the context.
Because a context has all related data at one glance, we have a platform from which to find the root problem. We can measure whether the expected message was sent from the source sys-tem, whether the message reached the next syssys-tem, and how the system used the message to update its data.
Expected Order 1 Order 2 Order 3 Data Flow 1 Billing Internet Legacy $12,99 Rating $4,67 Data Flow 2 Billing Line Legacy $23,50 Data Flow 3 Billing Line $4,67 Actual Order 1 Order 2 Order 3 Data Flow 1 Billing Internet Legacy $12,99 Data Flow 2 Billing Line Legacy $22,00 Data Flow 3 Billing Line $4,67 Rating $3,80 Billing Line Legacy $23,50 comparison No Actual Data ! Missing Rating Data! Rating $3,80 Message DF1 Message DF2 Message DF2 Message DF2 Message DF2
We can see this clearly by looking at our revenue assurance example. In data flow 1, we expect a message going from the Billing Internet system to the Rating system, but this message is not in the actual capture. This is why the data is missing in the Rating snapshot. In data flow 2, we see that all messages we expect are actually captured, so the Rating snapshot is correct.
To find the root cause of missing or additional snapshot data, we search for the message we expected to build the snapshot data and identify where something went wrong. Here, we found that the root cause was that the Billing Internet system did not send the expected message to the Rating system.
BIG DATA QUALITY REQUIRES CONTEXT
Big data is often understood to be high-volume data with no structure. In reality, the variety and complexity of data requires the use of big data methods to deliver more business benefit. MIOsoft’s context-powered and process-driven data quality method provides the opportunity to show data quality failures for multiple very complex data structures. With this method, even unreachable data like data in motion or unforeseen special business cases can be considered and can play an important role in decision-making. In particular, the contextual consolidation of data at rest and data in motion enables a better understanding of the data flow and process optimization.
Big data also must be cost-effective, using approaches such as highly distributable technologies and cost-effective commodity hardware. Some contextual technologies, such as those from MIOsoft, are particularly well designed to scale up.
Furthermore, leveraging contextual systems allows reduction of time-to-market.
Time-consuming tasks such as data integration with extra ETL tools or manual matching are often not needed when a contextual system allows and understands contextual modeling in a first-order way.