Phase.IV:.Data.Source.Selection

where PriorityD (D), PriorityU (U), and PriorityQ (Q) are the domain, users, and query priorities. As a result of this step we obtain a set of queries ordered by priority. The form QRY_USR is updated in order to complete the Global priority field. These priorities are a tool for solving conflicting requirements.

For example, in our case study, query Q1 has priority 5 for user U2 (with pri-ority 1), who belongs to domain D1 (with priority 5). Thus, the global priority for query Q1 is 135.

• Data.requirements.matrix: This is the integrated requirements form. This form is used for exchanging information with data producer users. The form MAT_REQ_DATA is filled out.

Example 4: Figure 1 shows a portion of the form MAT_REQ_DATA for our case study. Each triple domain-user-query has associated with it a set of data quality dimensions and values for these dimensions. Note that this form summarizes information obtained during the previous phases.

In this phase, data sources are studied in order to determine if they fulfill the infor-mation requirements collected in phases I to III. The outputs of the phase are (1) a query evaluation report and (2) a data source selection order for each data element.

This process is central to our methodology. As far as we are aware of, this is the

8 Vaisman

first proposal addressing this topic in a quantitative fashion. Now we describe the steps of the phase.

• Analysis.of.data.sources: Meetings with data producers are carried out (with the help of the documents produced so far), where the set of data sources and the quality of their data are documented. Also, information on source availability is collected. Two forms, MAS_DS_P and MAS_DS_L, are used for physical Figure 1. Data requirements matrix

Origin (dom-usr)ID Data (dictionary)Description

Business Proc

essAggregationRange (description)

Ranking (description)TimelinessAccuracyConsistency descriptioncompleteness descriptionvalue

SURVEY value

FORM

Data Quality-Based Requirements Elicitation for Decision Support Systems

and logical data sources, respectively. Each form contains a data source iden-tifier, values for data source availability, and a source priority defined by the data producer user. In the case of logical data sources, for each data element the corresponding expression for obtaining the data must be specified. The following actions are taken: (a) the data producer user determines the prior-ity criteria for data source usage, based on experience and technical issues.

Priority ranges between 1 and 5. (b) The requirements engineer finds out if a physical source contains the required data; if so, it is registered in the form MAS_DS_P. (c) If a combination of fields yields some of the required data, this combination is considered a logical data source, and it is registered in the form MAS_DS_L. In our case study we have three physical data sources,

• Data.source.quality:.This step consists of three tasks that can be performed in parallel. The goal is obtaining the quality of the data source for each data-source combination. The data provider informs quality characteristics of the data source and a mapping for the required fields (i.e., where is the required data located, and under which name?).

• Data.source.quality.(data):.The form DS_QTY_I is completed. This form contains, for each query, for each data element in the query, and for each data source, the following information: (a) mapping: field in the data source containing the data element, or field to which a function must be applied. For instance, the month of a sale could be obtained as month(date). (b) Aggrega-tion: tells if the aggregated data is or is not present in the source, or can be computed from the data in the source. (c) Accuracy. (d) Timeliness. The last two dimensions apply to our case study, but can be replaced by a different set of dimensions if the problem at hand requires it. In our case study, the record

<Q2, sales, B, amount, NO, 70, 5min> tells that for query Q2, data element sales can be obtained from data source B (where it is in nonaggregated form), with 70% accuracy and with a timeliness of 5 minutes.

• Data.source.quality.(consistency):.The form DS_QTY_II is completed, with the consistency characteristics of the data source. There is one entry for each data source, containing an evaluation of the source’s consistency. In our case study, consistency condition Q2C above is accomplished with a 100% preci-sion by data source A, and 90% precision by data source B.

• Data.source.quality.(completeness): The form DS_QTY_III is filled analo-gously to form DS_QTY_II, addressing completeness instead of consistency.

In our case study, the completeness condition stating that all customers regis-80 Vaisman

tered since 2001 must be in the database is accomplished with 100% precision by data source A, and 99% precision by data source B.

• Data.source.quality.assessment: The goal of this step is the integration, in a single data source assessment matrix, of the three essential components of the methodology: (a) data requirements; (b) quality requirements; and (c) data sources. The output of the process is, for each data element, the best data source for obtaining it, and a range with the qualification for each data source. The Global Data Source Performance is computed, using a procedure that adapts the QFD methodology.

Example.5:.The data source quality assessment matrix for our running case study is depicted in Figure 2. We only show the data element “sales”

and two queries: Q1 (from user U2) and Q4 (from user U3). The infor-mation gathered so far is:

a. Query Q1

. Priorities. of. quality. dimensions:. accuracy:. 5, consistency: 4, completeness: 3, timeliness: 5.

Global priority of the query: 135 (as explained in Phase III).

. Aggregations.required: month and salesman.

b. Query Q4

. Priorities.of.quality.dimensions: accuracy: 4, consistency: 5, com-pleteness: 1, timeliness: 3.

Global priority of the query: 31

. Aggregations.required:.Country, province, city, neighborhood.

Finally, the data producer user provided the following informa-tion:

Available data sources: A, B, and C (c in Figure 2), with priorities 5,4,1 respectively, as explained previously (b in Figure 2).

Each matrix block is composed as follows: (1) Consumer users’ requirements: data (h), query ID, quality dimensions (i), aggregations (j), global priority of the query (from Phase II), and quality dimension priorities given by the users in Phase III;

(2) Data producer users’ information, obtained in the previous step of this phase:

a submatrix indicating requirements fulfillment for each available data source.

According to the degree of fulfillment, a value is given (1, 3, or 9, d in Figure 2), using the following criteria: “1” is given if the condition is not fulfilled, “3” if the condition is not fulfilled, but can be computed from the data in the source; and “9”

if the condition is fulfilled. For the sake of brevity we do not extend on how to

de-Data Quality-Based Requirements Elicitation for Decision Support Systems 8

termine these values. (3) Data source performance for each query (e in Figure 2);

(4) Global data source performance (f in Figure 2).

The data source performance for each query is computed as:

PerfLocal(S,Q,D).=.∑ (pri_i* rel_i),

where

Figure 2. Quality assessment matrix

82 Vaisman

PerfLocal.(S,Q,D):.Data source performance of source S for data D in query Q

Pri_i: Data, quality, and aggregations priorities, for data D in query Q

Rel_i: Degree of fulfillment of data source S for query Q and data element D

The global data source performance is computed as:

PerfGlobal(F,Q) = ∑ HierGlobal(Qj) * PerfLocal(S,Q ,D)

For all queries Qj involving data element D, and given a set F.of a data source and a set of queries Q. HierGlobal (Qj): Global priority of query Qj.

Example.6:.For the table in Example 5, the local performance for data source A and query Q1 is computed as 5 * 1 + 5 * 1 + 5 * 1 + 5 * 1 + 4 * 1 + 3 * 1 + 5 * 1 + 5 * 1 + 5 * 1 = 42. The global performance for source A is: 135 * 42 + 31 * 144 = 10134.

• Data.source.selection: Although the final source selection is beyond the scope of the methodology, a document is generated, with a ranking of data sources for each data. This document will be used in the final data source selection process. For our case study, the ranking is 1: data source B (global performance 48,468), 2: data source C (global performance 27,702), and 3: data source A (global performance 10,134).

In document Data Warehouses and OLAP (Page 104-109)