• No results found

Data Warehouse design

N/A
N/A
Protected

Academic year: 2021

Share "Data Warehouse design"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Warehouse design

Design of Enterprise Systems University of Pavia

(2)

DATA QUALITY

(3)

Data Quality

 The quality of the data in a data warehouse determines the reputation and value of that data warehouse.

 If customers perceive the data in a data warehouse to be misleading or just plain wrong, they won’t use the data warehouse.

 If customers can find or create a superior source of data elsewhere, they will abandon a data warehouse altogether.

 Data Quality, the perceived reputation and value, of a data warehouse, therefore, is vital to the success of a data warehouse.

(4)

Deming’s definition of Quality

 Deming’s definition of Quality begins with a distinction between Features and

Quality.

– Features are the buzzers and whistles included with a product that do not

specifically address the core purpose of a product.

 Typically, people see Features first. For example: – Leather seats in a car

– Mother of Pearl inlays in a guitar

– Graphical user interface (GUI) point-and-click buttons and scroll bars in a spreadsheet

 Deming’s point is not that Features have no value. But their value is subjective and, therefore, impossible to measure.

 So, what is Quality?

– Deming’s answer is based on expectations and whether or not those expectations are achieved.

– Deming defined Quality as the success or failure to achieve a customer’s expectations.

(5)

Data Quality Service Level Agreement

 Customer expectations of a data warehouse, therefore, must be expressed in

well-defined and quantifiable terms. For example:

– Completeness

• Not Quantifiable: Most of the stores should be present in the data warehouse.

• Quantifiable: 95 percent of the stores should be present in the data warehouse.

– Latency

• Not Quantifiable: The warehouse data should be in before the staff arrives. • Quantifiable: The warehouse data should be in by 6:30 a.m.

– Accuracy

• Not Quantifiable: The warehouse data should equal what actually happens in the warehouse.

• Quantifiable: The total inventory movement of each product in a warehouse reported by the data warehouse should be within 2 percent (+/−) the total movement of each product as reported by the warehouse end-of-day

(6)

Deming’s Statistical Process Control

 The processes of the data warehouse are the focus of Deming’s Process Control method, which measures the process by measuring the product — by

measuring its data.

– A data warehouse process that is in control will generate data that meets expectations all day long.

– A data warehouse process that is not in control will generate nonsense data.  By measuring each individual ETL process, the data warehouse team can

monitor the health of the ETL application.

– If a single process is out of control, the ETL process can stop, allowing time

to adjust the errant ETL process.

– The time and effort spent to adjust an out-of-control ETL process costs less

than the rework that would happen had the process not been stopped.

– When all ETL processes are back in control, the data in the data warehouse will meet the expectations of the customer.

(7)

Process Measurement Methods

 The options listed below should be chosen based on the corporate

management and culture surrounding the data warehouse.

 Data Quality methods that work well in one setting may fail in another setting.  So, involve members of the business to gain their feedback, approval, and buy-in.  Methods and strategies:

– Data Stewardship

– Post-Load Audit and Report Errant Data

– Plug in a Default Value and Report Errant Data – Reject a Record and Report the Errant Record – Recycle the Data: In Place and Report Errant Data

(8)

Data Stewardship

 Data Stewardship is a strategy.

 A member of each business area included in the data warehouse is engaged to participate in the data warehouse as a Data Steward.

– This person must have a complete understanding of the business area, its

data, processes, and people.

– A Data Steward must also be able to understand the concepts of the data warehouse.

– Finally, a Data Steward must be able and willing to engage in the general oversight of the data warehouse.

 In general, a Data Steward is not only a liaison between the business and the data warehouse, but a champion of both the business area and the data

warehouse.

 A Data Steward participates in all communications between the business area and the data warehouse.

(9)

Post-Load Audit and Report Errant Data

 After data has been loaded into a data warehouse, programmatically query the data in the data warehouse.

 The SQL should test and validate that the data in a data warehouse conforms to

the expectations of the data warehouse, which are outlined in the Target

System Analysis.

 Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.

 The individual data elements are not the focus.

(10)

Plug in a Default Value and Report Errant Data

 When an errant data element is encountered, the portion of the data element that does not conform to its expectations can be replaced by a default value

 The default value may have a specific meaning (e.g., no known value, rejected value, etc.).

 The data is allowed to proceed toward the data warehouse with the defaulted values in place of the errant values.

 This method is also known as a Soft Reject.

 Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.

 The individual data elements are not the focus.

(11)

Reject a Record and Report the Errant Record

 When an errant data element is encountered, the entire record or row may be

discarded altogether.

 The remainder of the data is allowed to proceed toward the data warehouse, but without the rejected record or row.

 This method is also known as a Hard Reject.

 This method can be applied also to a whole dataset

 Errant data should be reported to members of the business area who are interested in, or using, that subject area from which the errant data originated, and the data warehouse team.

 The individual data elements are not the focus.

(12)

Data Recycle: In Place and Report Errant Data

 A data element that is subject to remediation can be recorded into two fields, rather than just one.

– The first field is the original errant data value. – The second field is the defaulted data value.

– When a correct data value becomes available, that correct data value will

overwrite the second defaulted data field.

– The first field, containing the original data value, is used only to find the correct data value.

– The second data field, containing the correct data value, is the data field that is visible to the data warehouse customers.

 This strategy requires communication between the data warehouse team and the Data Stewards.

 When the business area management ceases to participate in the remediation of errant data

– the remainder of the data is still allowed to add value to the data warehouse

(13)

Data Recycle: Recycle Wheel and Report Errant Data

 A data element that is subject to remediation can be held in a separate table, away from the data warehouse.

– This separate table is often called a Recycle Wheel

– When a correct data value becomes available, that correct data value will

overwrite the errant data field in the Recycle Wheel.

– Then, the corrected record or row of data in the Recycle Wheel is forwarded

to the ETL application to be included in the next iteration of data going to

the data warehouse.

 This strategy requires communication between the data warehouse team and the Data Stewards.

 The business person who is assigned the task of finding the correct data value must be included in the data warehouse team and its meetings and discussions.  The cost of a Recycle Wheel is the storage and maintenance cost of the table

(14)

DATA QUALITY TOOLS

(15)

Market definition

 The data quality tools market includes vendors that offer stand-alone software products to address the core functional requirements of the discipline, which are:

– Profiling: The analysis of data to capture statistics (metadata) that provide insight into the quality of data and help to identify data quality issues.

– Parsing and standardization: The decomposition of text fields into component parts and the formatting of values into consistent layouts based on industry standards, local standards (for example, postal authority standards for address data), user-defined business rules, and knowledge bases of values and patterns. – Generalized "cleansing": The modification of data values to meet domain

restrictions, integrity constraints or other business rules that define when the quality of data is sufficient for an organization.

– Matching: Identifying, linking or merging related entries within or across sets of data.

– Monitoring: Deploying controls to ensure that data continues to conform to business rules that define data quality for the organization.

– Enrichment: Enhancing the value of internally held data by appending related attributes from external sources (for example, consumer demographic attributes and geographic descriptors).

– Connectivity/adapters: The ability to interact with a range of different data structure types.

(16)

Market analysis

 Leaders exhibit a clear understanding and vision of where the market is headed,

including recognition of non-customer data quality issues and delivery of enterprise-level data quality implementations.

 Challengers have an established presence, credibility and viability, but may demonstrate strength only in a specific domain (e.g. only customer name and address cleansing).  Visionaries demonstrate a strong

understanding of current and future market trends (e.g. the importance of ongoing

monitoring of data quality). They lack market presence, brand recognition, customer base.  Niche Players often lack strength in rapidly

evolving functional areas. In addition, they may focus solely on a specific market

segment (such as midsize businesses), limited geographic areas or a single domain (e.g. customer data).

References

Related documents

Because the lower bore seems to generate no radiation, the end correction should be negligible, then its fundamental fre- quency is calculated as 345,000/(220 × 2) = 784 Hz if the

We address remaining concerns that the sample of women interviewed pre- and post-arrival could di¤er on dimensions that drive fertility preferences using three strategies:

We have also begun new ways of expanding to the larger community; including offering an environmental science community education class, hosting a rain barrel workshop, and

Although the effects of career-related social support have been demonstrated for a general population, the question remains how social support at school or at work

Black powder burns essentially at the same rate in the open (unconfined) as when in a gun. The burning rate of smokeless powder increases with increasing pressure. If

En efecto, así como los libertarianos ven en cual- quier forma de intervención del Estado una fuente inevitable de interferencias arbitrarias –con la excepción de aquella acción

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’

In order to gather knowledge about the current state of the art the, a literature review on the following topics was made: reverse logistics processes, information system support