• No results found

Schema Matching Using Regular Expressions

5.3 Instance-based Schema Matching

5.3.2 Schema Matching Using Regular Expressions

In this section, we introduce our approach towards instance-based schema matching utilizing regular expressions in pattern classes. In this approach, we define pattern classes for each statistical concept as introduced in Section 5.2.1 and defined in the SDMX Content-Oriented Guidelines [SDM09], i.e. one pattern class for the geographical dimension, one for the temporal dimension and so on. These classes are used as background knowledge during the matching process and contain multiple regular expressions for representing instance values of this particular statistical concept. A correspondence between two schema elements is considered if their instances can be expressed via a regular expression from one particular pattern class.

In the following subsection, we describe how such pattern classes containing multiple regular expressions are defined and present our approach for finding similar schema elements including the implemented algorithm.

Pattern Classes as Background Knowledge

For our approach, we assume that there exist several pattern classes that contain multiple patterns describing a particular statistical concept (see Section 5.2.1) like e.g. dates,

5.3 Instance-based Schema Matching

age groups or geographical codes. Each pattern is described as a regular expression. Table 5.8 presents an excerpt of different instance values that can appear in a schema element with temporal coverage and their corresponding patterns. More patterns for this particular statistical concept can be found in [SDM09].

Instance Value Regular Expression

2010 [0-9]{4}

2010 [0-2][0-9]{3}

10-2010 [0-9]{2}-[0-9]{4}

28.10.2010 [0-9]{2}.[0-9]{2}.[0-9]{4} 2010-28-10 [0-9]{4}-[0-3][0-9].[0-1][0-9]

Table 5.8: Overview of instance values and their corresponding regular expressions for temporal information.

In the case of geographical codes, the definition of patterns is more complicated than for dates. Although there are international standards like the ISO norms 3166-1 and 3166-2 or the NUTS classification, the derived patterns are of a very generic kind, like two characters for countries, e.g. DE, FR, ES or US. Patterns describing these instances may be derived from other entries of other schema elements as well. This implies that lower weightings have to be assigned to them. In Table 5.9, patterns for geographical codes are presented. The first three patterns refer to the ISO norms, while the other three patterns correspond to entries of the NUTS classification. DEA encodes the federal state of North Rhine-Westphalia at NUTS level 1, DEA2 describes the administrative district of Cologne at NUTS level 2 and DEA22 refers to the independent city of Bonn. Table 5.10 presents possible instance values and derived patterns across age groups. This example illustrates the problem that occurs if there is no standardized way of encoding such an information in statistical data. The entries are very heterogeneous and the only element of a pattern that is certain in most cases is the encoding of a numeric range like ## - ##, where the ## determine a specific age. Nevertheless, as the example depicts, there can be other definitions like Y_LT15, which encodes the age group for all people younger than 15.

Instance Value Regular Expression

DE [A-Z]{2}

DEU [A-Z]{3}

DE-NW [A-Z]{2}- [A-Z0-9]{0-3}

DEA [A-Z]{2}[A-Z0-9]

DEA2 [A-Z]{2}[A-Z0-9]{2}

DEA22 [A-Z]{2}[A-Z0-9]{3}

Table 5.9: Overview of instance values and their corresponding regular expressions for geographical code lists.

5 Data Matching for Published Linked Open Social Science Data

Instance Value Regular Expression

Y_LT15 [A-Z]_[A-Z]{2}[0-9]{2} Y30-49 [A-Z][0-9]{1,2}-[0-9]{1,2} 30-49 [0-9]{1,2}-[0-9]{1,2}

Table 5.10: Overview on instance values and their regular expressions for age groups.

Finding Similar Schema Elements

For our approach, we define two given data sets as M and N . A given set of pattern classes, we denote as C. For each pattern class Cx∈ C, consisting of regular expressions, a match between two schema elements SM ∈ M and SN ∈ N is detected, if at least

one instance from SM and SN can be expressed by a regular expression from the same

pattern class Cx. Hereby, a weighting Ω expresses the probability of the match with a value between 0 and 1.

Let C be the set of pattern classes with

C = {C1, C2, C3, ..., Cn}

then each class Cx ∈ C is itself a set comprising tuples of regular expressions and an additional weighting ω. The regular expressions describe the patterns for representing a particular statistical concept x (e.g. age groups) and the weighting ω is a value between 0 and 1 determining how appropriately the regular expression represents x:

Cx= {(regex, ω)|regex matches x, 0 < ω < 1}

This additional weighting ω was included for sorting multiple regular expressions regarding their appropriateness for representing the statistical concept of the class, e.g. for the concept of age groups. For example, C = {Cdate, Cage, Cgeo} is a set of pattern classes that represents a date, an age reference and a geographical location. An example for such a pattern class is Cdate= {([0 − 9]{2}.[0 − 9]{2}.[0 − 9]{4}, 0.9), ([0 − 9]{2} − [0 − 9]{4}, 0.8)} for dates, as shown in Table 5.10.

As we intend to calculate a confidence value considering all instances within a schema element, for each Cx ∈ C we calculate the average weighting for all schema elements

SM ∈ M and SN ∈ N . As soon as an instance of a schema element can be expressed

by a (regex, ω) ∈ Cx, the value of ω is added to the sum of all weightings whose regular expressions previously matched another instance, resulting in the final P

0ω for all instances. The average is calculated by normalizing this sum regarding the total number of instances in this particular schema element. For each SM and SN, this is

avg(SM) =

P

0ω |Instances in SM|

5.3 Instance-based Schema Matching

avg(SN) =

P

0ω |Instances in SN|

For Cdate from the example above, let a schema element DateM ∈ M have the instances

28.20.2010 and 10-2010. Then the first instance can be expressed by the second regular expression in Cdate, and the second instance by the first one. Accordingly, the average weighting is avg(DateM) = 0.9+0.82 = 0.85.

All schema elements are collected in a set together with their average weighting if the average weighting is not 0. We define these sets as Mx and Nx for each Cx. All schema

elements form a tuple with their aggregated weight and are denoted as

Mx = {(SM, avg(SM)|∃(regex, ω) ∈ Cx: regex matches min. 1 instance of SM}

Nx = {(SN, avg(SN)|∃(regex, ω) ∈ Cx : regex matches min. 1 instance of SN}

In the example, we can see, that the schema element DateM contains instances matched by a regular expression in Cdate. Thus, it is an element of Mdate, whereas another schema

element like GeoM, containing strings for country codes, would probably not be matched by any regular expression in Cdate. Therefore, it would not be an element of Mdate. Finally, we calculate the Cartesian product M atchesx = Mx × Nx, where a triple

(SM, SN, Ω) defines a match between a SM and a SN with the probability of Ω computed

from the average weightings.

M atchesx= {(SM, SN, Ω) ∈ Mx× Nx|Ω = avg(SM) ∗ avg(SN)}

Additionally to DateM ∈ M from our example, we assume that in data set N there exists

a different schema element DvalN = {19.11.2009}. Regarding the pattern class Cdate,

Ndatecontains this schema element with (DvalN, 0.9) analogous to (DateM, 0.85) ∈ Mdate.

Consequently, M atchesdate would calculate the triple (DateM, DvalN, 0.76). Thus, a

match with a specific confidence has been found.