Schema Matching Using Regular Expressions

5.3 Instance-based Schema Matching

5.3.2 Schema Matching Using Regular Expressions

In this section, we introduce our approach towards instance-based schema matching utilizing regular expressions in pattern classes. In this approach, we define pattern classes for each statistical concept as introduced in Section 5.2.1 and defined in the SDMX Content-Oriented Guidelines [SDM09], i.e. one pattern class for the geographical dimension, one for the temporal dimension and so on. These classes are used as background knowledge during the matching process and contain multiple regular expressions for representing instance values of this particular statistical concept. A correspondence between two schema elements is considered if their instances can be expressed via a regular expression from one particular pattern class.

In the following subsection, we describe how such pattern classes containing multiple regular expressions are defined and present our approach for finding similar schema elements including the implemented algorithm.

Pattern Classes as Background Knowledge

For our approach, we assume that there exist several pattern classes that contain multiple patterns describing a particular statistical concept (see Section 5.2.1) like e.g. dates,

5.3 Instance-based Schema Matching

age groups or geographical codes. Each pattern is described as a regular expression. Table 5.8 presents an excerpt of different instance values that can appear in a schema element with temporal coverage and their corresponding patterns. More patterns for this particular statistical concept can be found in [SDM09].

Instance Value Regular Expression

2010 [0-9]{4}

2010 [0-2][0-9]{3}

10-2010 [0-9]{2}-[0-9]{4}

28.10.2010 [0-9]{2}.[0-9]{2}.[0-9]{4} 2010-28-10 [0-9]{4}-[0-3][0-9].[0-1][0-9]

Table 5.8: Overview of instance values and their corresponding regular expressions for temporal information.

In the case of geographical codes, the definition of patterns is more complicated than for dates. Although there are international standards like the ISO norms 3166-1 and 3166-2 or the NUTS classification, the derived patterns are of a very generic kind, like two characters for countries, e.g. DE, FR, ES or US. Patterns describing these instances may be derived from other entries of other schema elements as well. This implies that lower weightings have to be assigned to them. In Table 5.9, patterns for geographical codes are presented. The first three patterns refer to the ISO norms, while the other three patterns correspond to entries of the NUTS classification. DEA encodes the federal state of North Rhine-Westphalia at NUTS level 1, DEA2 describes the administrative district of Cologne at NUTS level 2 and DEA22 refers to the independent city of Bonn. Table 5.10 presents possible instance values and derived patterns across age groups. This example illustrates the problem that occurs if there is no standardized way of encoding such an information in statistical data. The entries are very heterogeneous and the only element of a pattern that is certain in most cases is the encoding of a numeric range like ## - ##, where the ## determine a specific age. Nevertheless, as the example depicts, there can be other definitions like Y_LT15, which encodes the age group for all people younger than 15.

Instance Value Regular Expression

DE [A-Z]{2}

DEU [A-Z]{3}

DE-NW [A-Z]{2}- [A-Z0-9]{0-3}

DEA [A-Z]{2}[A-Z0-9]

DEA2 [A-Z]{2}[A-Z0-9]{2}

DEA22 [A-Z]{2}[A-Z0-9]{3}

Table 5.9: Overview of instance values and their corresponding regular expressions for geographical code lists.

5 Data Matching for Published Linked Open Social Science Data

Instance Value Regular Expression

Y_LT15 [A-Z]_[A-Z]{2}[0-9]{2} Y30-49 [A-Z][0-9]{1,2}-[0-9]{1,2} 30-49 [0-9]{1,2}-[0-9]{1,2}

Table 5.10: Overview on instance values and their regular expressions for age groups.

Finding Similar Schema Elements

For our approach, we define two given data sets as M and N . A given set of pattern classes, we denote as C. For each pattern class C_x∈ C, consisting of regular expressions, a match between two schema elements SM ∈ M and SN ∈ N is detected, if at least

one instance from SM and SN can be expressed by a regular expression from the same

pattern class C_x. Hereby, a weighting Ω expresses the probability of the match with a value between 0 and 1.

Let C be the set of pattern classes with

C = {C1, C2, C3, ..., Cn}

then each class C_x ∈ C is itself a set comprising tuples of regular expressions and an additional weighting ω. The regular expressions describe the patterns for representing a particular statistical concept x (e.g. age groups) and the weighting ω is a value between 0 and 1 determining how appropriately the regular expression represents x:

Cx= {(regex, ω)|regex matches x, 0 < ω < 1}

This additional weighting ω was included for sorting multiple regular expressions regarding their appropriateness for representing the statistical concept of the class, e.g. for the concept of age groups. For example, C = {Cdate, Cage, Cgeo} is a set of pattern classes that represents a date, an age reference and a geographical location. An example for such a pattern class is C_date= {([0 − 9]{2}.[0 − 9]{2}.[0 − 9]{4}, 0.9), ([0 − 9]{2} − [0 − 9]{4}, 0.8)} for dates, as shown in Table 5.10.

As we intend to calculate a confidence value considering all instances within a schema element, for each Cx ∈ C we calculate the average weighting for all schema elements

SM ∈ M and SN ∈ N . As soon as an instance of a schema element can be expressed

by a (regex, ω) ∈ C_x, the value of ω is added to the sum of all weightings whose regular expressions previously matched another instance, resulting in the final P

0ω for all instances. The average is calculated by normalizing this sum regarding the total number of instances in this particular schema element. For each S_M and S_N, this is

avg(SM) =

0ω |Instances in SM|

5.3 Instance-based Schema Matching

avg(SN) =

0ω |Instances in SN|

For Cdate from the example above, let a schema element DateM ∈ M have the instances

28.20.2010 and 10-2010. Then the first instance can be expressed by the second regular expression in C_date, and the second instance by the first one. Accordingly, the average weighting is avg(DateM) = 0.9+0.8₂ = 0.85.

All schema elements are collected in a set together with their average weighting if the average weighting is not 0. We define these sets as Mx and Nx for each Cx. All schema

elements form a tuple with their aggregated weight and are denoted as

Mx = {(SM, avg(SM)|∃(regex, ω) ∈ Cx: regex matches min. 1 instance of SM}

Nx = {(SN, avg(SN)|∃(regex, ω) ∈ Cx : regex matches min. 1 instance of SN}

In the example, we can see, that the schema element Date_M contains instances matched by a regular expression in Cdate. Thus, it is an element of Mdate, whereas another schema

element like Geo_M, containing strings for country codes, would probably not be matched by any regular expression in C_date. Therefore, it would not be an element of M_date. Finally, we calculate the Cartesian product M atchesx = Mx × Nx, where a triple

(S_M, SN, Ω) defines a match between a SM and a SN with the probability of Ω computed

from the average weightings.

M atchesx= {(SM, SN, Ω) ∈ Mx× Nx|Ω = avg(SM) ∗ avg(SN)}

Additionally to DateM ∈ M from our example, we assume that in data set N there exists

a different schema element Dval_N = {19.11.2009}. Regarding the pattern class C_date,

Ndatecontains this schema element with (DvalN, 0.9) analogous to (DateM, 0.85) ∈ Mdate.

Consequently, M atchesdate would calculate the triple (DateM, DvalN, 0.76). Thus, a

match with a specific confidence has been found.

In document Methods for Matching of Linked Open Social Science Data (Page 158-161)