Results - Document representation for efficient search engines

The sizes of the representations of Wt10g and Wt100g collections produced by the baseline and CTS are illustrated in the “Compressed” rows of Table 3.3. As expected CTS admits a small compression loss over zlib. However, both schemes substantially reduce the size of the text to 30% of the original uncompressed collections and by about half of the parsed version. The “Word map” row holds the size of the CTS data-structure required to recover unspelt words at snippets construction time. The size of this data-structure grows at almost the same rate as that of the collection, mainly due to the appearance of new terms, names, numbers, URLs, and dates [Williams and Zobel, 2005]. While we anticipated that the exclusion of terms with frequency less than three from the model would slow the rate at which it grew,

3.5. RESULTS 59 wt10g wt100g Parsed 6,041(58.99%) 46,664(45.34%) Baseline(zlib) Compressed (Mb) 2,534(24.76%) 19,875(19.31%) Document Map (Mb) 17.75 194.82 CTS Compressed (Mb) 2,987(29.16%) 23,903(23.37%) Word map (Mb) 63.69 427.34 Document Map (Mb) 12.91 141.69

Table 3.3: Total storage space (Mb) for documents for the two test collections once parsed and then when uncompressed. The percentages are calculated against the size of the unprocessed and uncompressed collection size. The Word map rows show the amount of RAM occupied by the word models.

this seems to have had little effect.

For each document, CTS and the baseline also store information about its location on disk: a file path in the case of the baseline and a 64-bit offset in the case of CTS. We store this information as part of the Document Map — a data-structure, used by a retrieval system to store document metadata (see page 26). The increase of the Document Map size as a result of adding the document location information is shown in the “Document Map” rows. The CTS_{offsets always require eight additional bytes per document. Therefore given the number} of documents to be compressed, the increase in the size of Document Map can be predicted accurately. However, with regards to the baseline system, the size of the Document Map can be much larger, depending on the directory structure chosen. In the our implementation we used a three-tier directory structure. The directory and the file names were all five bytes long. Figure 3.4 indicates the mean running time to generate snippets for the queries in the two Excite query-logs. The x-axis indicates the group of 100 queries; for example 20 holds queries in position 2000 to 2100. The y-axis shows the mean running time of the 100 queries within each group in seconds.

A clearly noticeable pattern here is the drop-effect, where the average query snippet generation time falls substantially after the first 5, 000 or so queries are processed. This is due to the operating system caching disk blocks and pre-fetching data ahead of specific read

0 10 20 30 40 50 60 70 80 90 100 Queries grouped in 100s 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Time (seconds) Baseline system CTS system

(a) Exc1-97 timings

0 10 20 30 40 50 60 70 80 90 100 Queries grouped in 100s 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time (seconds) (b) Exc1-99 timings

Figure 3.4: Time to generate snippets for the top 20 documents per query, averaged over buckets of 100 queries, for the 10,000 Exc1-97 and Exc1-99 queries.

requests. To verify that disk access is the source of the drop-effect, we repeated the previous experiment. However, this time we excluded the Fetch component from the timings. All that this experiment measured was the amount of time spent uncompressing documents and processing their content in memory. If the drop-effect is caused by any other aspect of the

3.5. RESULTS 61 0 10 20 30 40 50 60 70 80 90 100 Queries grouped in 100s 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Time (seconds) Baseline system CTS system

Figure 3.5: Time to generate snippets for the queries in the Exc1-99 log. The plotted timings exclude disk access (Fetch component), and, as in Figure 3.4, they are averaged over groups of 100 queries.

summarisation other than disk access, then the results of this experiment should also exhibit similar pattern as Figure 3.4. Figure 3.5 illustrates the mean snippets generation time for the Exc1-99 queries, excluding the Fetch component. The drop-effect is no longer apparent. Similar trends were also observed in timings of the Exc1-97 results. This suggests that the operating system is buffering a larger amount of data, and this is often enough to make an appreciable difference in snippet generation speed. Part of above caching is due to the spatial locality of disk references generated by the query stream: repeated queries will already have their document files cached in memory; and similarly different queries that return the same documents will benefit from document caching.

One can also observe that the drop-effect of the CTS system is much smaller than the baseline. This is caused by the fact that prior to the timings, CTS system opens the collection file to read preliminary metadata. If memory is flushed after CTS has read this metadata, we found that the average snippet generation time, for the first 15–20 query groups (in Figure 3.4) almost doubles. This increase however drops sharply that flushing RAM appears to have little effect on the timings of the last 5, 000 queries.

The time spent generating snippets per query, averaged over the final 5, 000 Excite queries once caching effects have dissipated, are shown in Table 3.4. Once the drop-effect has stabi-

Exc1-97 Exc1-99

Baseline 89.45 334.56

CTS _32.92 _136.18

Reduction in time 63.20% 59.29%

Table 3.4: Average turn-around time to generate snippets for a query (in milliseconds) once the caching effects have dissipated (averaged over the last 5,000 queries of the Excite logs).

lized, CTS is 60% faster than the Baseline system. There are several factors which contribute to these gains. First, unlike the baseline, CTS does not have the overhead of opening and closing individual files. A single seek to the start of a document and a read are all that are required to fetch a document into memory. Second, since words are represented as integer codewords, matching a query term to a word in a document usually requires a single test while the baseline employs a character by character string comparison. Finally, the baseline makes use of a compression scheme with decompression speed slower than that used by CTS. Figure 3.6 shows the difference in disk access time between using a baseline system that stores documents in individual files and another that stores documents indexed in a single repository (as with the CTS system). The plot illustrates the amount of time spent reading documents (Fetch component) for the Exc1-99 data-set. Similar results were also obtained using the Exc1-97 data-set. The ‘Baseline-single-file system’ plot shows the average time the baseline takes to fetch documents when the collection is stored as single file, while the ‘Baseline system’ plot shows the average time the baseline spends fetching documents where each document in the collection is stored in an individual file. The Baseline-single-file system plot also exhibits similar cache drop-effect as CTS.

Table 3.5 shows the break down of the snippet generation times reported in Table 3.4 into the three main components of the snippet generation process: Fetch, Decode and Score. This table highlights that a large portion of snippet generation time, over 60% of time in CTS_{and more than 80% in the baseline, is spent accessing disk. Moreover, this disk-access} cost increases with the size of the collection, as the bigger the collection, the larger storage space its compressed form requires, the longer distances the disk head has to seek on average. The advantage of using a byte-oriented compression scheme which allows fast decompression as well as the integer representation of symbols which minimizes the number of comparisons when generating snippets, are also apparent in the Decode and Score columns. Here, CTS

3.5. RESULTS 63 0 10 20 30 40 50 60 70 80 90 100 Queries grouped in 100s 0.0 0.3 0.6 0.9 Time (seconds) Baseline system CTS system Baseline-single-file system

Figure 3.6: Average time (in milliseconds) spent fetching documents for a query from disk for the Exc1-99 data-set. The Baseline-single-file system stores the baseline files in a single file.

Fetch Decode Score Exc1-97 Baseline 74.78 8.75 5.91 CTS _20.79 _7.63 _4.49 Exc1-99 Baseline 313.06 13.22 8.29 CTS _119.89 _9.81 _6.48

Table 3.5: Break-down of the snippets generation time (in milliseconds) for the final 5,000 queries in each log into the three main tasks: fetch, decode and score. The “Fetch” column holds the time elapsed opening files, seeking and reading document. The “Decode” shows the time spent uncompressing fetched documents. Finally the “Score” column is the time spent scoring and ranking sentences, as well as reverse mapping them to their string content.

is on average 20% faster when processing a document once in memory.

Finally, we examine the impact document size has on snippet generation speed. In par- ticular, we aim to establish if the snippet generation time drops with the size of documents summarised. The motivation being that if smaller documents incur the same amount of

Fetch Decode Score Exc1-97 Baseline 74.04 6.51 1.86 CTS _21.16 _1.65 _0.91 Exc1-99 Baseline 303.02 10.02 2.08 CTS _119.90 _3.49 _1.52

Table 3.6: Break-down of the time spent generating snippets (in milliseconds) for queries whose documents were less than half the average document size retrieved in the original query results.

processing time as larger documents, then it is obvious that investing in alternative means of compacting documents may not be fruitful.

From the results used to generate Table 3.5, we excluded all queries that contained documents larger than half the average compressed document size in the corresponding data- set. All that remained were results of 701 queries for in the Exc1-97 set, we refer to as Exc1-97-small, and 432 queries in the set, which we call Exc1-99-small. We found that the average document size in Exc1-97-small and Exc1-99-small are 68%–72% smaller than the average document size in the corresponding data-set. Similarly, as shown in Table 3.6, the average Score and Decode time for the small document is a third compared to the original data-set results shown in Table 3.5. The reduction in the Fetch time however are minimal. Table 3.6 also shows that the average time spent decompressing a document using the baseline system does not drop as much as that of CTS. This is primarily because the initialisation of the data-structures necessary to complete the decompression makes up a considerable portion of the decoding time. While the zlib implementation of the baseline may be optimised for the task at hand, given that the Fetch phase dominates the baseline performance, we did not invest time in tuning the Decode phase of the baseline.

The experimental findings thus far have shown that, compared to the defined baseline, CTS_{considerably reduces the cost of accessing disk-resident documents, as well as the cost} of processing them in memory. The results also suggest that this cost can be further reduced if the size of documents is smaller. Next, we look at strategies of optimising the CTS system.

In document Document representation for efficient search engines (Page 70-77)