• No results found

Mining patterns in the Hub Entity

ural Defin

Hub Example the examples

4.12 Mining patterns in the Hub Entity

The Hub table brings together previously disassociated business keys. It represents lists of these business keys in a single common table. For example: a list of all part numbers that appear across the enterprise. Patterns can be mined from the single list of business keys. By coagulating the business keys from multiple source systems into a single component, it becomes possible to extract business value and meaning.

Hubs can be mined for the following information:

• Entry patterns and format masks

• Source System Key creation/generation patterns over time

• Possible ontological relationships or hierarchies within keys

By mining the Hubs’ data it is possible to discover practical associations and ties across business keys. Hierarchies and ontologies can be discovered which translates into added business value. The results always need to be checked against the business to see if they are false positives. Complex inter-relationships across the internal data patterns and shifts in entry can be discovered. It is interesting to note that the longer the business key, the more likely it is to make these discoveries.

Entry patterns and format masks can also be established. The percentage of data that meets particular patterns can be assigned. The greater the percentage, the more likely the business rule is out there somewhere being utilized. It is possible to tie strength and confidence ratings to

percentages of data meeting specific patterns that have been discovered. Just as with the last case, the more data involved in the discovery, the higher the confidence that the discovered pattern is an applicable business pattern.

Source system key creation (or broken business requirements) can be discovered as the data set is loaded. Confidence ratings increase as additional data (new keys) arrives to demonstrate that the business rule truly is broken. For example, the business requirement is: contracts always create new customer keys, but when the data is loaded – the pattern states otherwise.

Contracts are responsible for the creation, and inflow of customer accounts, the negotiation of these accounts before the organization can begin building product for the customers. The data set in the Hub shows that 40% of the new business keys are being created by a financial system. Further discovery shows that it takes 20 days before the customers are synchronized and moved into the contracts system. The business then needs to ask the following questions about their business processes:

• What does this say about the business requirement?

• Can Finance negotiate contracts with the customer?

• How long does it take for the data to move from the contracts system to the financial system?

• How much is it costing the business to NOT have the customers created in contracts first?

What happens when the business key for a certain customer changes when it is passed from Finance to Contracts? What if the programmatic code that changes the key does not record the

“from-to”, or the business user does not record the “from-to” change when they key it in to the contracts system? What impact to the business does this have? It can be huge, it can be costly, and it can range from the $10 dollar mark to the $10 million dollar mark.

Mining Hub keys for patterns can be a powerful way to validate the data against the business requirements. It provides insight into the gap between the “vision” that the business assumes it’s operating under, and the “reality” of their operational systems, coupled with the business process in place today. This is the fundamental idea behind process improvement, monitoring and measuring.

Super Charge Your Data Warehouse Page 73 of 152

© Dan Linstedt 2010-2011, all rights reserved http://LearnDataVault.com 4.13 Process of Building a Hub Table

The process is simplistic in nature; however it requires a consistent check with the business users, business application, and source system data set. At the end of the day the business application collecting the data has the overriding decision. It is the responsibility of the Data Vault to enable reproduction of the source system as-it-stood as of a specific point in time; otherwise the

commutative property is broken, and the system of record that exists within the Data Vault is compromised.

1) Find the business key

a. Go to the business users and watch how they interact with the operational systems. View their print-outs, application screens. Locate the “find” mechanisms, headers of reports, and dimensional groups they use in their MS Excel Spreadsheets.

b. Determine which business keys are truly used in which business units. Do NOT worry or consider HOW to define the business keys, leave that to the business users later in the project.

c. Locate the business keys in the source system by examining the record join/find code.

d. Look for business keys in the dusty old data model that is supposed to represent the source system, look for the primary keys and secondary unique indexes.

e. Pry open the physical data stores on the source systems, look for alternate unique indexes and primary keys.

2) Validate the Business Keys

a. Check with the business units, balance the data sets and unique indexes that are physically printed or seen by the business users. Eliminate those keys that are internal only. Many times the internal keys are there for performance reasons.

b. Validate the business key data by profiling the data set. Discover the consistency, actual uniqueness; develop metrics against the business keys, their patterns, and their associations to other records in other systems.

3) Check Business keys against multiple source systems

a. Develop profiling patterns across multiple source systems that are within scope, discover where the collisions are. Work on resolving the multiple entry patterns that occur. Again, the focus is not to define these keys, but rather simply to identify the business keys.

4) Finally, build the Hub

a. Define the systems that feed the Hub. Develop data flows that identify potential collisions.

b. Define what to do in case of a collision. Get this answer from the business users by ASKING them to define which system is the first master, the second master, the third master and so on.

c. Implement loading paradigms from a staging area to the Hub in the Data Vault

d. Profile the results to produce metrics and measurements about the patterns of the data sets.

e. Publish the results to the entire IT team, the business users, and anyone interested in the Data Warehouse. BEGIN the data quality improvement process as early as possible.

5) Validate the results in the Hub

a. Reconstruct the LIST of business keys for each system, and balance the keys against each of the source systems to ensure integrity has not been lost.

These are the fundamental steps to building a single Hub within the Data Vault model.