Evaluating Cover Quality - Query-Time Data Integration

2.5 Evaluation

2.5.5 Evaluating Cover Quality

In this section, we will demonstrate where our novel entity augmentation methods differenti- ate themselves from the baseline approaches: consistency and diversity, as well as minimality of the number of sources used. First, consider Figure 2.9, which depicts the five indicators consistency, diversity, relevance score, minimality and coverage by rank, from 0, the first augmentation, to 9, the last one in the k = 10 case. The plotted values are averages over the whole set of test queries. Since the ValueGrouping baseline algorithm can only produce a single re- sult using majority voting between all possible values for an entity, it is only plotted as a larger triangle instead of a line.

In Figure 2.9, the coverage threshold θCov(see Section 2.5.1) is set to 1.0, demanding full covers and representing the default case. We can see that the baseline algorithms, while pro- ducing results with very high average relevance scores, do so at the expense of using more different sources per cover, which is visible in the Minimality and the Consistency indicators. Here, the benefit of viewing the augmentation problem as a form of the set cover problem be- comes apparent: the set cover-based algorithms can create solutions that are much smaller,

1 2 3 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6 0.8 1.0 Relevance Score 1 2 3 4 5 6 7 8 9 10 Minimality 1 2 3 4 5 6 7 8 9 10 rank Consistency 1 2 3 4 5 6 7 8 9 10 rank 0.0 0.2 0.4 0.6 0.8 1.0 Diversity 1 2 3 4 5 6 7 8 9 10 rank Coverage TopRanked ValGroup Greedy Greedy* Genetic

Figure 2.9: Cover quality indicators by rank, Full covers scenario: θCov=1.0 and θCons=0.0

without loosing noticeably in relevance. It is also quite noticeable that the basic Greedy approach, while constructing smaller solutions, is not able to outperform the baseline in terms of consistency or score. The Greedy* algorithm, by exploring more possible solutions is able to improve on the baseline, while the Genetic approach finally beats the baselines clearly with respect to minimality and consistency: In comparison to TopRanked, it produces results that are 21% more consistent and achieve a 87% better minimality (0.28 vs. 0.52) on average, while maintaining the average relevance score of the data sources used. Note that these are averages over all ranks, and the gains on the first ranks are even higher. Comparing the first cover cre- ated to the ValueGrouping baseline yields a 65% improvement in minimality (0.58 to 0.81) and 46% in consistency (0.39 to 0.57).

Now consider the diversity plots, which show at each rank the diversity for the result set from the first one up to and including the respective rank. The plots show that the algorithms behave differently with respect to using similar datasets for further solutions after the initial ones have been generated. The basic greedy approach moves through the similarity space of the candidate sources as it fills its usage matrix U (see Algorithm 1), penalizing used candidates and those similar to it. Since it creates at most k solutions, it is likely to pick new, unpenalized candidates in each iteration, which explains the strong diversity of the augmentations generated. The Greedy* approach on the other hand creates a multiple of the necessary solutions and will often “wrap around” the similarity space of candidates, reconsidering used sources in new combinations. The same is even more true for the Genetic approach, that will combine existing good solutions to new ones, promoting the survival of useful data sources.

1 2 rank 0.0 0.2 0.4 0.6 0.8 1.0 Diversity for k=3 1 2 3 4 rank Diversity for k=5 1 2 3 4 5 6 7 8 9 rank Diversity for k=10

TopRanked Greedy Greedy* Genetic

Figure 2.10: Diversity by Rank and by factor k

While the benefit of reusing good partial solutions is obvious from the other measures, this leads to more uniform results across the first results of the top-k list. Note however, that all algorithms converge to a similar diversity level as more solutions are added to the top-k list. At k = 10, the Genetic approach that dominates the baselines in all other aspects reaches similar levels of diversity when looking at the whole result list, even though it produces more similar solutions on the first few ranks. Measured over the whole result list, however, it produces covers that are still 10% more diverse than the TopRanked baseline. To summarize this experiment, the Greedy set covering approach produces the most diverse result list of all approaches, but is no improvement on the base in the other dimensions. The Genetic algorithm, however, greatly improves consistency and minimality, without sacrificing relevance.

Still, in this experiment it appears as if it only reaches high levels of diversity with larger number of requested solutions. However, this is in fact an artifact of our result selection strat- egy, which balances the various dimensions when picking and ranking results. Consider Fig- ure 2.10, which shows only the diversity scores, but for various levels of k. We observe that even if k is set to a lower value, the genetic approach does still reaches the same level of diversity when calculated over the whole result set. This implies that the Genetic approach does have a tendency to produce more similar results on the first few ranks, but compensates for this in later ranks.

Next, consider Figures 2.11, depicting a relaxed scenario, and 2.12, depicting a best-effort- coveragescenario. In these figures we demonstrate the useful trade-off between coverage and cover quality that is possible with our algorithms: if we lower θCov, i.e., we allow a certain amount of uncovered entities if necessary to keep consistency over θCons. The baseline algorithms do not have early braking included and their results from the first experiment are included as a constant frame of reference. In the relaxed scenario (Figure 2.11), we allow 10% uncovered entities, if necessary to keep consistency above 0.4, and in the best-effort-coverage scenario (Figures 2.12) we allow the algorithms to leave as much uncovered as necessary to keep consistency above 0.4. Note that θConsand θCovare guidelines for the covering process, not hard limits. Allowing 10% uncovered entities can dramatically improve the results in all areas. The Minimality score of the TopRanked baseline is doubled by the Genetic approach, while also improving consistency by 49%. Interestingly, the results are even more diverse in

Figure 2.11: Cover quality indicators by rank, Relaxed scenario: θCov=0.9 and θCons=0.4

this case, improving 12% over the default case for the genetic algorithm. These two observa- tions can be explained by the fact that certain entities can only be covered by data sources that have no fitting “partner” sources, which reduces consistency if they have to be covered in one cover, while other entities are coverable by only a small number of sources, which then have to be included in almost every cover, reducing diversity. These “unfavorable” data sources are skipped by our algorithms, if the coverage requirement is relaxed.

In the third covering experiment shown in 2.12, we allow aggressive optimization of the result regarding consistency by sacrificing coverage completely. We can see that perfectly consistent and diverse results can mostly only be reached by not using more than one source: both Greedy* and Genetic return mostly single datasets as single augmentation on the first few ranks. Note how in this configuration the algorithms all select almost completely different datasets for every augmentation as the diversity stays above 0.8 throughout the higher ranks, but to do so have to choose smaller and smaller datasets, which is visible in the coverage quickly falling to unusable levels.

To summarize, not only does the set covering-based approach to top-k entity augmentation improve the quality of the results in all measured dimensions, using the parameters θCovθCons also allows the algorithms to be tuned to varying use cases.

In document Query-Time Data Integration (Page 52-55)