Extension: Genetic Algorithm - Top-k Consistent Entity Augmentation

2.3 Top-k Consistent Entity Augmentation

2.3.5 Extension: Genetic Algorithm

There is a large body of literature dedicated to the set covering problem, in which various op- timization methods apart from the greedy approach are studied (see Section 2.6). The genetic approach seems especially viable for our specific versions of the problem, as it intrinsically gen- erates a pool of solutions from which k can be picked, and both consistency and diversity of the results can be modeled intuitively. Specifically, consistency can be modeled as part of the fitness function, and diversity can be introduced through a suitable population replacement strategy. In general, to apply the genetic framework to a problem, we need to define

• a representation of individuals and their genomes, • the fitness function,

• cross over and a mutation functions, • a method for population initialization, • the strategy for parent selection,

• and finally a mechanism replacement of individuals in each generation.

We will discuss all of these factors in the following. While our approach is inspired partly by (Beasley and Chu, 1996), our problem domain makes different choices for almost all of these decision necessary. Obviously, we can represent an individual’s genome as the set of data sources it is comprised of, i.e., as a bit vector where the n-th bit is set if the n-th candidate source was used in the cover. We can then use the objective function 2.6 as the fitness function for individuals. The population is initialized by creating covers with the Greedy algorithm until all candidate sources have been used in at least one cover. The Greedy algorithm is modified in this use case to strongly favor unused data sources in later iterations to quickly produce such an initial population. This can be trivially achieved by assigning an increasing weight to the redundancyterm over the iterations of the Greedy Algorithm 1. This initialization ensures that all the possible genetic material, i.e., all possible data sources, are represented in the starting population.

The most interesting step is the crossover function, as combining two sets of data sources does not necessarily again yield a valid cover. If the wrong genes are passed to the descendant solution, the solution may not be feasible, i.e., there may be uncovered entities, or if too many genes are passed, the cover may not be minimal. In (Beasley and Chu, 1996), genes shared by both parents are passed on definitively, and those only present in one parent are passed with a probability corresponding to their weight, which leads to potentially infeasible or non-minimal solutions as mentioned above. To correct this, another step of pruning redundant genes and adding random genes from the pool that cover missing entities was proposed.

Algorithm 3Top-k consistent set covering: Genetic function Genetic-TopK-Covers(k, s, E, D) U ←    0 . . . 0 ... ... ... 0 . . . 0    |E|×|D| .Usage matrix P ← Init-GreedyTopKCovers(k ∗ s, E, D, U) for i ∈ 0..k ∗ s do

c1, c2←pickRandom(P ) .Select Parents

.Tournament vs. most similar solution w1, l1 ←tournament(c1, arg max

x∈P \c2

sim(c1, x)) w2, l2 ←tournament(c2, arg max

x∈P \c1

sim(c2, x))

c+ ← Cover(E, c1∪ c2, U ) .As in Alg. 1

Mutate(c+) .Randomized Flipping of Genes

FillAndPrune(c+) .As in (Beasley and Chu, 1996)

removed ←minscore(l1, l2) P ← P \ removed ∪ c+

U ← UpdateUsage(U, c+, removed) .As Alg. 1

C ← Select(k, P ) .As in Alg. 2

return C

However, this solution does not take consistency of the generated descendant into account. Instead, we again use the basic Greedy approach as a building block, but use the shared genes as starting point, and take the union of genes as the set of candidates, instead of the whole set D. We use the same mutation strategy, i.e., randomly flipping bits in the gene, and subsequently use the prune and fill approach from (Beasley and Chu, 1996) to remove datasets that may have become redundant, and add datasets to fill holes created by mutation. The fill operation can be realized using the greedy algorithm with the incomplete cover as the starting points, i.e., by computing Cover(E, D, U) after mutation if necessary. The prune operation works by alternating between computing the redundant datasets rd = {d | cov(c \ d) = E}, and then removing arg min

d∈rd

simA(d, c)until there is not redundancy in the cover left.

Diversity is achieved through several mechanisms: in the crossover step, since the Greedy approach as a building block uses a global usage matrix U throughout the run of the genetic algorithm, in the mutation step that introduces randomness into the population, and finally in the parent selection and replacement strategy. Here, we select two parents randomly to create a descendant, but before crossover, we pair them with their most similar solution in the population use the better one by score, while the least fit solution is replaced in the next generation. Algorithm 3 gives an overview of our genetic consistent top-k set covering approach and all the mentioned mechanics.

Web Table Index

Lucene-based, indexes schema, content, metadata

Web Table Store

disk-based K/V store for Web Tables and metadata indexes

API JSON & REST-based

Entity Augmentation

Top-k Consistent Set Covering Algorithms

Data Source Management Schema- & Instance Matching

Similarity Measures & -Combination, Match Selection

Knowledge Repository

Wordnet, Alexa Domain Popularity, Term Frequencies

Candidate Source Selection

Relevance Scoring Consistency Scoring

Figure 2.5: REA System Architecture

in this section, the next step is to instantiate them for the problem of Web table-based entity augmentation by introducing the Web table-specific relevance and similarity functions rel : D → [0, 1]and sim : D × D → [0, 1]. We will therefore now describe our novel Web table-based entity augmentation system REA.

In document Query-Time Data Integration (Page 40-42)