• No results found

5.2 Case Study

5.2.4 Evaluation Results

5.2.4.1 Domain Coverage

We evaluated the efficiency of the materialization strategies described in Sec- tion 5.2.1 in terms of domain coverage for each attribute with respect to the number of queries required to achieve a given coverage value. The goal is to assess the ability of each materialization strategy to quickly explore the data and domain-space of a data source. To perform the evaluation, we cre- ated a database composed of ˜1M real estate offers crawled from an existing Real Estate Web site. Experiments were performed by considering the access pattern realEstateByLocation, which takes a real estate type and a geo- graphical location as the input attributes.

The real estate type input attribute was populated using a dictionary strategy, which used a static, pre-populated list of attribute values such as ’rental’ or ’for sale’ as the input dictionary. The location input attribute was populated

5.2. Case Study 109

Figure 5.8: Experimental Results for SPSS domain coverage.

by reseeding strategy; moreover a matching location output attribute return- ing locations was mapped as an appropriate input attribute value provider. A domain of postcodes found in the collected real estate database was used as the location input attribute domain. The domain size of the postcodes re- seeding input attribute was approximately 11000, a randomly selected subset of which was used as a starting seed dictionary.

For each of the materialization (result generation) strategies 10 runs were performed; to avoid biases in the evaluation, the input attributes conforming to the reseeding input strategy were initialized at each run by a randomly se- lected subset initDict of size 100. The resulting domain coverage increase had been averaged between runs for each strategy and observed in 10% increments (with regards to the overall domain size).

5.2.4.2 Coverage relative to the full materialization

We evaluated the efficiency of the materialization SPSS strategy described in the previous subsection in terms of result coverage for each query with regards to the number of queries required to achieve a given coverage value. The setup of the experiment was the same as in the domain coverage experi-

5.2. Case Study 110

ment. We used a database composed of ˜1M real estate offers crawled from an existing Real Estate Web site. Experiments were performed by considering the access pattern realEstateByLocation, which takes a real estate type and a geographical location as the input attributes. While obtaining the origi-

Figure 5.9: Experimental Results SPSS - Total Result Coverage - page size 10.

nal materialization the set of queries used for the discovery was serialized in a DBMS. During the experiment a subset of 80% randomly selected queries of the original query set were executed in batches of 500 against the origi- nal materialization. After each batch the difference between the experiment materialization data and the original materialization was taken. Two sets of experiments were performed each one with 10 runs. First experiment was performed with chunk size 10 - minimum page size, while the other experi- ment was performed with chunk size 100 - maximum page size. The resulting result coverage increase had been averaged between runs for each strategy and observed in 10% increments (with regards to the overall result size).

5.2.4.3 Evaluation Results Discussion

We compared the performance of Breadth First Search (BFS) and Depth First Search (DFS) algorithms, each featuring two different child/sibling insertion

5.2. Case Study 111

Figure 5.10: Experimental Results SPSS - Total Result Coverage - page size 100.

policies. The results were obtained against a database of 1M real estate offers crawled from several existing Real Estate Web sites. Figure 5.8 depicts the change of coverage in respect to the number of issued queries.

We also performed the real estate database coverage experiments with varying result set sizes. Figures 5.9 and 5.10 depict the change of coverage in respect to the number of issued queries when result set size was set to 10 and 100 results tuples respectively.

Figure 5.8provides the results of our comparison. Breadth-first algorithms are able to retrieve a wider coverage on the input by requiring less service in- vocations. For the tested scenario, the first breadth-first algorithm is able to achieve a 65% coverage of the input domain after 5000 queries (values in line with the daily limit imposed by several Web data source API providers), thus, requiring 25% less queries than the second best breadth-first algorithm and 50% less queries than the best performing depth-first algorithm and al- most a 5-fold improvement over the worst, depth-first based algorithm. The depth-first algorithm, instead, proves very unsuited for the goal at hand. The experimental results match our intuition.

5.2. Case Study 112

The variance in the performance between BFS and DFS algorithms matches our intuition. DFS algorithms focus on executing all pages of the given query attribute value, thus, narrowing the scope of the potential do- main discovery. The queried attribute value may suffer from poor connectivity within the domain. Another reason might be the feature of the query whereas rk matched top-k tuples of the query, thus, restricting the new domain value

discovery. Conversely we observed the situation where the query is too specific to return any values.

In conclusion, if the query is neither in top-k range nor too exact, the discovery of domain values of an attribute with unknown domain is enabled. Contrasting to DFS, BFS switches to queries with new values as soon as discovered, thus, widening the discovery scope. The behaviour also limits the negative influence of poorly connected values to the discovery rate. BFS algorithm is still suspect to the value discovery optimal situation where the query is neither in top-k nor exact as elaborated above.

Results presented in this study case match work done by [Jin 2011]. Here authors employed breadth and depth search algorithms alongside their im- proved algorithms against several hidden web databases to ascertain their domain value discovery performance.

Figure 5.9 provides the results of our second comparison. The results of both result set size experiments match the behaviour of the strategies observed in the domain coverage experiments. Breadth-first algorithms are able to retrieve a wider coverage on the input by requiring less service invocations. For the tested scenario, the first breadth-first algorithm is able to achieve a 25% coverage of the total result domain after 46000 queries, thus, requiring 18% less queries than the second best breadth-first algorithm and 25% less queries than the best performing depth-first algorithm and almost a 3-fold improvement over the worst, depth-first based algorithm.

Noticeable is the drastic difference in the number of executed queries to achieve the full result, in case of the first experiments with chunk size 10 - 120000 queries were issued while in the case of second experiments with chunk size 100, just 18000 queries were issued, almost a 9-fold decrease. This result correlates research results presented in work by [Wu 2006], where sim-

5.3. Chapter Summary 113

ilar decrease in result size resulted in 50-60% drop of the domain discovery performance.

In summary the case study case illustrates the importance of the engage- ment of the appropriate domain value discovery algorithm. Failure to perform in this phase of the materialization is likely to result in, from the number of used queries, an expansive materialization procedure. This also increases the risk of the materialization process not achieving the desired result as data acquisition phase may fail to discover new values for all the possible queries needed to achieve the assigned task.

5.3

Chapter Summary

In the previous two chapters we presented service and materialization models and formally defined them.

This chapter brought a case study that illustrated input discovery as a chal- lenge within the materialization data acquisition dimension. The case study presents our approach and experiment results proving the validity of this ap- proach. It reflects on the related research, it outlines similarities and findings that are to be proved of essence in the materialization process optimization. In this chapter we also described the architecture of the materialization module used in the empirical study throughout the research.

In the next chapter we will address the first materialization formulation step - finding a feasible solution for the materialization task from the given set of access patterns and mapped services.

Chapter 6

Modelling Feasible Solutions

6.1

Introduction

In this Chapter we focus on materialization feasibility analysis. Feasibility analysis presents a critical part of the formulation of the materialization solu- tion as it delivers a combination of access patterns that meet required coverage requirements.

The service materialization formulation and its feasibility analysis are ex- pressed in terms of Service Description Framework and Service materialization model as presented in Chapter 4.

The analysis of feasibility determines which combination of access patterns produces wanted materialization considering the available input dictionaries. It also provides an estimate of the materialization coverage by determining a maximum number of materialization calls, or boundedness, for the given feasible solution.

We present the transformation of the SDF into a network layout from which the feasibility model is derived and analysis performed upon. We for- mally define the feasibly analysis model and its relation to the service materi- alization model. Then we elaborate on the actual feasibility study that leads to the feasibility analysis model. On top of this, the model is presented as a working algorithm to be extended in the materialization process. Finally, to evaluate the effectiveness of the feasibility analysis we experiment with a set of materialization tasks of varying Access Pattern complexity.