The different approaches presented in section 4.3 present differing possibilities of evaluating the respec- tive algorithms. However, they either do not publish their collection of test web pages (like [57, 198]) or they state web pages that are not existing anymore or have undergone a complete redesign (like [136]). Thus, it is not possible to achieve comparable data, making it necessary to create an own evaluation set. Further, rating the segmentation just in categories “perfect”, “satisfactory”, “fair” and “bad” (as done in [44]) is not satisfactory, as such ratings are highly subjective and heavily depend on the ratio of errors and the length of a page. For example, ten segmentation errors in a short web page containing few segments are more significant than ten segmentation errors in a very long web page consisting of a lot of segments. Thus, an own methodology to evaluate HYRECA is presented in section 4.5.1. The evaluation design is explained in section 4.5.2, with the results being discussed in section 4.5.3.
4.5.1 Corpus Design
One design goal of HYRECA is that it should work without being limited to specific web pages or a certain web genre (i.e. a certain type of page like a blog post, a wiki page or an academic homepage). Therefore, a corpus that aims to serve as an evaluation foundation has to reflect these different types of web pages. The evaluation corpus consists of 48 web pages manually assembled from five different categories of web pages (Blogs, Company homepages, Web shops, News sites and Miscellaneous pages). A full listing of the URLs can be found in appendix B.2. While this distribution originates on the idea to measure the quality and applicability of HYRECA on a broad spectrum of pages, it casts the research question whether the algorithm works equally well with the different genres. Thus, the following categories are taken into account:
Blogs This category is a set consisting of the main page and one single blog post page of the five most popular blogs from Technorati14. The blog posts contain a different number of comments.
Company homepages This category contains home pages of the ten biggest companies according to Fortune 500 in 200715(e.g. BP, Walmart and Daimler).
Web shops This set consists of different web shop pages from eBay, Amazon, Dell and ASOS. News sites This category contains pages from popular German and English news sites.
Miscellaneous pages This category is a collection of web pages that are interesting because they have certain properties that make them challenging for a segmentation approach. Included are wikis, forums, academic homepages and pages that are very lean on markup and heavily rely on visual representation.
The respective web pages were downloaded including all assets (e.g. linked images, advertisements, CSS and JavaScripts) and stored locally. The downloaded HTML was cleaned using Tidy16 to prevent parsing and rendering errors due to invalid markup. HYRECA was executed and all found segments were visually highlighted. Two versions of segment visualization were provided, one highlighting the segments’ outlines and one setting a unique background colour for each segment. Further, for each processed web page, the number of segments detected by the pattern and visual algorithm was stored. Four of the pages could not be processed by Cobra, the HTML rendering engine used, due to excessive 14 http://technorati.com/blogs/top100?type=faves, retrieved 2008-07-08
15 http://money.cnn.com/magazines/fortune/global500/2007/, retrieved 2008-06-09 16 http://tidy.sourceforge.net, retrieved 2008-07-27
use of JavaScript and Flash, thus these were excluded from the further evaluation, resulting in a corpus size of 44 web pages.
4.5.2 Evaluation Design
Five different error classes were identified that can occur on automatic segmentation in the given sce- nario:
Missing segments are the parts of a page that should be marked as an own segment but are not recog- nized by the algorithm.
Superfluous segments are segments that are found but are not segments according to the definition given in section 4.3.1. For example, this could be a nested segment that adds no textual informa- tion.
Incomplete segments are segments that are principally correct but there is a part of the segment not included. An example is a textual part of the page that lacks a heading.
Too big segments are segments that are too big in respect to the segment definition, e.g. a segment spanning two comments in a blog.
Wrong segments are segments that are completely wrong and do not match any of the criteria above. Evaluating a segmentation algorithm needs human judgements as a reference. Thus, five participants (between the ages of 23 and 28, all male including three students of Information Science, one student of Education and one member of research staff) were introduced to the definition of coherent segments. The participants were instructed to examine each processed web page and check whether the found segments are correct or can be classified as one of the above–mentioned errors. The participants were given a short introduction to HYRECA and shortly trained to interpret the coloured output. On inspecting the segmented web pages, they were allowed to switch between the three visualizations, the original web page and the two versions highlighting the segments found by HYRECA. Finally, the participants were asked to count the occurrences of each of these errors. On average, an evaluation took about three hours for each participant.
The rating values (i.e. errors found per segmented web page) were converted into the ratio of error
re in relation to all segments found on the web page for each web page, rater and error class. The ratios were aggregated in five ordinal classes denoting the severity of error in fine–granular steps of 2.5%, i.e. category 1 as0.0<= re<= 0.025 (“completely correct segmentation to 2.5% of all segments were wrong”), category 2 as 0.025 < re <= 0.05, category 3 as 0.05 < re <= 0.075, category 4 as 0.075 < re <= 0.1 and category 5 as re > 0.1. This categorization is based on the observation that the participants found an error rate of more than 10% (i.e. 10% of all found segments were erroneous, which is category 5) not to be acceptable anymore.
4.5.3 Results of the Evaluation
The data gathered in the evaluation has been evaluated with regard to the participants’ agreement on their ratings and the quality of the segmentation approach.
Agreement between Raters
With the five categories defined above, the agreement p that denotes the agreement in percent of the cases can be calculated (see equation 4.1), with n being the number of raters, k the number of categories indexed by j= 1, . . . k, and nj being the number of ratings assigned by all raters to the jth category.
p= Pz j=1n 2 j − nj n(n − 1) (4.1)
The agreements of the participants of the evaluation can thus be measured for each error class per web page. A weighted mean agreement of¯p= 0.70 (with standard deviation of 0.05) resulted for all of
the 44 web pages taken into account in this evaluation.
Genre ¯p Standard deviation
Blogs 0.74 0.20 Companies 0.69 0.26 Shops 0.60 0.19 News 0.72 0.23 Miscellaneous 0.73 0.17 Weighted total 0.70 0.05
Table 4.1:The mean agreements¯pper genre
This value shows that, albeit the raters agree in many ratings, that the task of rating coherent segments is subjective. Especially when incorrect segments were identified, there were different perceptions how the problematic segments should have been partitioned. Another source of disagreement were the error classes, e.g. the error classes superfluous and completely wrong were often confounded.
A per–genre value of agreement is given in table 4.1. Here, especially the genre shops proved to be difficult to rate, probably due to the participants’ uncertainty how product presentations were to be segmented.
Correct Retrieval of Segments
Table 4.2 shows the different types of counted errors, averaged over all pages. The low number of segments marked as too big and the high number of superfluous segments indicate that HYRECA is tending to produce segments that are too fine–grained. Simultaneously, the high number of missing segments shows that the approach presented here is improvable, but just adjusting the thresholds that are used to differentiate valid segments from too fine–grained fragments will not work, as this would in turn increase the number of superfluous segments.
Precision and recall are metrics for measuring the quality of IR and extraction algorithms that are
commonly used. Precision is the proportion of retrieved items that are actually relevant, recall is the proportion of relevant items that are retrieved [8]. Applied to page segmentation, precision is the proportion of detected segments that are actually applicable to the segment definition and recall is the proportion of segments that are correctly detected. In order to calculate precision and recall, the results that are based on single web pages are averaged per genre. The following notations are used:
Type of error Mean Standard deviation
Superfluous segments 1.66 2.52
Missing segments 1.07 1.92
Incomplete segments 0.93 2.34
Too big segments 0.20 0.77
Completely wrong segments 0.24 1.92
Table 4.2:Average errors per web resource reported by participants of the study, listed by error class.
nd The total number of segments in a web page detected by HYRECA.
na The total number of segments in a web page as given by the participants. This number is estimated based on the ratings of the participants.
nc The correctly identified segments of a web page.
em The average number of missing segments.
es The average number of superfluous segments.
ei The average number of incomplete segments.
eb The average number of segments that are too big.
ew The average number of segments that are completely wrong.
The actual amount of segments nain a web page is an average of the participants’ ratings, because the exact number of segments is subjective and there is no absolute gold standard. Thus, nais approximated by the number of detected segments plus the average number of missing segments with the average number of superfluous segments subtracted:
na= nd+ em− es (4.2)
Further, the number of correctly detected segments ncis the number of detected segments without all erroneous segments:
nc= nd− (es+ ei+ eb+ ew) (4.3)
With equations 4.2 and 4.3, precision and recall can be defined as
pr ecision= nc
nd r ecal l=
nc
na (4.4)
Using equation 4.4, the quality of HYRECA can be calculated per web page and thus be averaged to an appraisal of the quality in the different genres (see table 4.3). The weighted average of precision is0.86, recall is 0.88. These results are feasible when considering the fuzziness of the manual segmentation task. Notably, the miscellaneous category does perform below average. This is because this category is a collection of assembled web pages, some of which were selected because they pose challenges to a segmentation and some that do not expose a very refined structure, e.g. wiki pages that largely consist of plain paragraphs and therefore provide only few pattern structures.
Genre Pages Precision Recall Blogs 10 0.95 0.96 Companies 7 0.80 0.84 Shops 8 0.93 0.95 News 9 0.93 0.94 Miscellaneous 10 0.70 0.73 Weighted total 44 0.86 0.88
Table 4.3:Precision and recall of correct segmentation
0.0 0.2 0.4 0.6 0.8 1.0
blog company shop news misc
Mean Ratio of Detected
Segments [%]
Web Page Collection Category
Pattern Detection Visual Analysis Standard deviation
Figure 4.12:Average ratio of segments found by the pattern detection vs. the visual approach per genre. Note: The ID and class heuristics have been omitted in this diagram, as they yield constant results per genre.
Further, the ratio of segments recognized by the pattern detection vs. segments detected by the visual analysis are compared in figure 4.12 for each genre. The standard deviation shows that the ratios are affected by large deviations between the web pages in each genre. Therefore, applying both approaches seems to be a good strategy to provide a good detection of segments in a diverse range of web pages. A limitation of this evaluation is that it does not consider the order of the applied detection steps, e.g. the ratios in the figure would probably different if the visual approach would be applied first. This, however, would have necessitated the participants to rate the segmentation of each web page twice, but this could not be expected of the participants due to expenditure of time.