• No results found

2.3 Tree Matching

2.3.2 Partial Tree Alignment(PTA)

PTA [79–81] algorithm is part of DEPTA (Data Extraction based on Partial Tree Alignment). The alignment is partial because the node in Tican be inserted into Tsonly if the location for

the insertion can be uniquely determined in Ts. Otherwise, they will not be inserted and will

be left unaligned. This technique is commonly used for alignment of a data item in each data region due to the fact that, in some web pages, the object details or data records are not represented in a connecting segment of the HTML code. Data region is the boundary location in the web page that contains similar data records. In this case, data items from all data records will be rearranged before they are integrating into the database.

PTA aligns multiple trees by progressively growing a seed tree. The seed tree, denoted by Ts, is considered to be from the tree which has the maximum number of data fields.

The reason for choosing this seed tree is clear, as it is more likely that the tree has a good alignment with it than with data fields in other data records. After that, for each Ti(i ̸= s),

the algorithm tries to find a matching node in Ts for each node in Ti. When a match is found

for node Ti[ j], a link is created from T[j] to Ts[k] to indicate its match in the seed tree. If the

match cannot be found for node Ti[ j], then the algorithm attempts to expand the seed tree

by inserting Ti[ j] into Ts. The expanded seed tree Ts is then used in subsequent matching.

To clarify PTA algorithm, we will first describe two tree alignment, and after this we will introduce multiple tree alignment.

Partial Alignment of Two Trees

To clarify, PTA algorithm works on, how nodes in Tican be aligned with nodes in Ts. Firstly,

we will demonstrate partial alignment of two trees. After Tsand Tiare matched, some nodes

in Ti can be aligned with their corresponding nodes of Ts because they match one another.

The nodes in Ti that are not matched should then be, inserted into Ts as they may contain

optional data items. There are two possible situations when inserting a new node Ti[ j] into

the seed tree Ts, depending on the whether a location in ts can be uniquely determined to

insert Ti[ j]. Otherwise, they will not be inserted and will be left unaligned. The location for

30 Theoretical Background

Fig. 2.8 Expanding the seed tree: (a) and (b) unique expansion and (c) insertion ambigu- ity [81].

1. If Ti[ j] ...Ti[m] have two neighbouring siblings in Ti, one on the right and the other on

the left, that are matched with two consecutive siblings in Ts. Figure 2.8a shows such

a situation, which gives one part of Tsand one part of Ti.

2. If Ti[ j] ...Ti[m] has only one left neighbouring sibling x in Ti and x matches the right

most node x in Ts. The Ti[ j] ...Ti[m] can be inserted after node x in Ts. Figure 2.8b

illustrates this case.

3. If, on the other hand, there can not decide a unique location for unmatched nodes in Tito be inserted into ts, this is shown in Figure 2.8c. The unmatched node x in Timay

be inserted into Ts, either between nodes a and b, or between nodes b and e. In this

situation, we do not insert this node into Ts.

Partial Alignment of Multiple Trees

The alignment algorithm is based on tree matching and uses only HTML tags for compari- son between pair of nodes. Figure 2.9 illustrates the multiple tree alignment algorithm.

2.3 Tree Matching 31

Fig. 2.9 Partial Tree Alignment with multiple trees [81].

The algorithm starts with three trees. Ts, T2 and T3 which are to be aligned. The tree

which has the maximum number of data items will be a seed tree or Ts. The algorithm is

based on two trees matching.

Step 1 A DOM tree is established for each data record. The tree consists of two levels: parentand child.

Step 2 A DOM tree that holds the maximum number of data items (nodes) is referred to as a seed tree. Figure 2.9(a) is a seed tree (Ts).

Step 3 Let Ts be a seed tree and Ti contains a set of other trees in each data region (i ̸= s).

Tiis matched with Ts until end of Ti.

There are two possible situation:

(i) If the location for adding new node can be resolved, the node in Tiwill be inserted

into Ts. Figure 2.9 (c), (d) and (f) show this situation. The seed tree Ts is expanded.

(ii) If the location for inserting new node cannot be uniquely determined (as shown in Figure 2.9 (a), (b) and (d)), there is more than one possible place to insert the new

32 Theoretical Background

node, and node X and I can be added after node D, J or K. Therefore, Ts cannot be

aligned and T2will be moved to R.

Step 4 When all Ti are processed completely, the trees in each data record will be pro-

cessed again. If the trees are unaligned in Ts this is the end of the process and, the

unmatchable data items will be moved to a single column.

The complexity of this algorithm based on Big O notation is O(k2), where k is the number of trees. Big O notation, also called Landau’s Symbol, is a technique which is used in Computer Science to define the performance or complexity of an algorithm. The letter O is defined as the rate of growth of a function, so T(k) = O(n2) means that T(k) grows at the order of n2.

Incorrect alignment of data items happens in the two following situations:

1. Data items of the same attribute are incorrectly aligned into different columns because they are enclosed by tags with different tag names.

2. Data items of different attributes are incorrectly aligned into the same columns be- cause they are enclosed by tags with the same tag name.

The two results of incorrect alignment reveal that tag name is a significant feature for preci- sion in partial tree alignment.