2.5 Evaluation
2.5.1 Experimental Setup
This section introduces our experimental setup, specifically the dataset and queries we use, the baseline algorithms, as well as the indicators we measure.
Dataset All queries are run against the full Dresden Web Table Corpus, a corpus of 125M
Web tables we extracted from a public Web crawl, as described in (Eberius et al., 2015a) and published onhttp://wwwdb.inf.tu-dresden.de/misc/dwtc.
Queries To enable comparable results, we used similar domains and queried attributes as in
related work (Yakout et al., 2012; Zhang and Chakrabarti, 2013): companies from the Forbes 2000 list7 with the attributes revenue, employees and founded, countries with the attributes population, population growth and area, and finally largest cities based on the Mondial database8, with the attribute population. When not mentioned otherwise, the default query set used for the experiments has 4 distinct queries for each domain-attribute pair, with 20 entities (|E| = 20) each, and k = 10. For the company and city domain we also differentiate between top and random entities, i.e., four queries from the top 100 of the Forbes list, and four with randomly picked entities. In addition to the standard queries, we also vary |E| and k for the performance experiments.
Precision Assessment To establish whether an augmentation value returned by any algo-
rithm is correct, one possible approach is to manually compile a Gold Standard, i.e., the set of correct answers for the test queries. However, even seemingly simple facts such as population
7http://www.forbes.com/global2000/
counts can refer to different years, or may use different ways of measurement, e.g., different ways of defining city borders for the city domain. Furthermore, there may be slight variations of the queried attribute, such as “under 25 years” for the attribute “population”, which may still be of interest for an exploratory query. As discussed in Section 2.2, we aim at embracing this variety and return top-k results, instead of a single-truth value.
We therefore evaluate precision in terms of relevance of the returned results to the query, instead of a binary correctness decision. To establish relevance, we asked a group of human judges to classify datasets. Specifically, for each entity in each result set we asked the judges to evaluate whether the data source used to cover the entity was relevant to the query, and whether the entity was matched correctly. While we collected judgments to cover all the do- mains and attributes introduced above, to keep the amount of manual work manageable, the number of augmentations created was smaller than in the automatic evaluation (only up to k = 5).
Indicators Due to the nature of the cover problem, it is not enough to measure relevance and
coverage of the query results to asses the quality of the proposed algorithms. We further need to determine measures such as consistency and minimality of the individual augmentations, as well as the diversity of the result list. We will now discuss the indicators measured in our evaluation in detail.
• Coverage: We measure the percentage of the queried entities that are augmented with a value. However, we do not primarily aim at measuring the quality of the Web tables corpus or the schema- and instance matching system used. Instead, our evaluation fo- cuses on the quality of the covering algorithms, which is orthogonal to these issues. For this reason, if not stated otherwise, we only measure the indicators for entities for which our combination of table corpus and matching system returns at least one (potentially irrelevant) result value. We do, however, also provide absolute coverage percentages for reference and comparison with other entity augmentation systems in Section 2.5.3. • Precision: We measure the percentage of entities for which a relevant value was re-
trieved, as described above in the discussion of Gold Standards. This considers aug- mented values for each entity individually, i.e., an augmentation with relevance 1.0 indi- cates that each entity was augmented with a value that was judged relevant with respect to the query keyword. This does not necessarily imply that the augmentation is consis- tent.
• Relevance Score: For this indicator, we measure the average relevance score each indi- vidual dataset was assigned by the rel function. This score, however, is not to be confused with the actual relevance as measured by human judges, defined above. As explained in detail in Section 2.4.2, this score measures the confidence of the schema and instance match between query and data source, as well as the quality of the source. Thus, a higher value is of course better, as higher scored datasets are more likely to contain relevant re- sults. It is calculated exactly as the relevance cover quality score defined in Equation 2.8. • Minimality: We measure the number of datasets used in a cover in relation to the num- ber of entities it covers. This corresponds to the minimality cover quality score as defined in Equation 2.9. The measure is purposefully not related to the set of all entities E, as a low coverage would be rewarded with a high Minimality. As we established in Sec-
tion 2.1, a smaller number of sources, and thus a higher Minimality score, is better. • Consistency: We measure consistency as the average similarity between the datasets that
were used to create the respective cover. A higher value means that the respective aug- mentation was created from more similar datasets. This corresponds to the consistency cover quality score as defined in Equation 2.10. While the consistency score is calculated fully automatically based on the similarity function sim, we also perform an additional manual evaluation of consistency and diversity using manual tagging described in Sec- tion 2.5.5
• Diversity: This indicator is measured not for a single cover, but a set of covers. We use the commonly employed method of measuring the average distance between all pairs of items in the result list. In our case, this corresponds to the diversity cover quality score as defined in Equation 2.11. Note that this score is not measured for a cover, but for a set of covers.
Relaxed Coverage To study the trade-offs between coverage and the other dimensions, we
introduce early break parameters θCovand θConsto all of our algorithms. Specifically, we allow our algorithms to stop iterating if at least θCovpercent of the entities are covered, and the cover consistency would drop below θConsby continuing. For example, in a relaxed scenario with θCov = 0.9and θCons = 0.4, the algorithms are allowed to leave up to 10% of the entities uncovered, if the consistency score would otherwise fall below 0.4. However, if not mentioned differently θCov is set to 1.0. In other words, we aim for full coverage of the queried entities regardless of consistency as the default setting.
Other Parameters If not mentioned differently, all parameters are kept constant throughout
all the experiments. This is especially important for the underlying Web table retrieval and matching system that comes with a set of typical parameters of its own, e.g., threshold for string distances used in the matching step, or number of raw candidates retrieved from the Web table index. As this system is not the focus of the evaluation, we choose reasonable defaults and keep them constant. For the consistent set covering algorithms themselves we study the influence of |E| and k, i.e., the number of entities in the query and the number of covers to create. The search space factor s, used in the Greedy* and Genetic approaches (Algorithms 2 and 3), determining the number of solutions to create or the number of generations respectively, is kept constant at 10 in the evaluation for space reasons.