• No results found

Structure of thesis

This monograph is structured as a series of chapters, each of which covers a different topic.

While each chapter is designed for the most part to stand on its own, the chapters often build on or refer to concepts introduced in previous chapters.

There are a total of 11 chapters in this monograph and the remaining 10 chapters are discussed in the following sections.

Chapter 2: State of Existing Research

Chapter 2 looks at the current state of research into document signatures and related fields, from early punched card categorisation systems to modern locality-sensitive hashing ap-proaches. While there are a number of different approaches to document signatures, a number of which seem to have been developed independently and took different paths to come to similar conclusions, it is definitely possible to trace certain concepts used in signature search, many of which were created for entirely different purposes. This chapter attempts to tie all of the most prominent research together and show the different areas attention has been devoted to over the years.

Chapter 3: Document Signatures

Chapter 3 reviews the theoretical underpinnings of document signatures, how they are searched and the nature of the compromises that must be made between search efficiency and quality.

While the different document signature and locality-sensitive hashing approaches were often developed separately and are justified in different ways, this chapter attempts to show how they are derived from combining the vector space model [Salton et al., 1975] of information retrieval with dimensionality reduction. The use of term weighting approaches used in ranking algorithms are also shown to be effective at resolving the inter-term cross-talk that results from the dimensionality reduction.

Chapter 4: Software Implementation

Testing the effectiveness of signature approaches and working out ways they can be refined requires the use of a software search engine; however, at the time of this research, there were no publicly accessible fully-featured search engines that used signature files. While some open source locality-sensitive hashing implementations were available, these were very limited in capability and were found to be unsuited for general-purpose searching. This research looks at addressing this issue through the creation of a new search engine, TOPSIG, which is available as an open source search platform. Chapter 4 discusses the implementation of the TOPSIG search engine, the design decisions associated with its development, the modular structure and the approaches it takes in the indexing and searching of signature files.

Chapter 5: Parallel Processing

Performance is an important consideration for modern search engines and one of the original criticisms of signature file approaches was related to this. Technology has, however, moved on since much of the original research into signature files was conducted, and multi-core and multi-processor systems are now virtually ubiquitous. Chapter 5 looks into the indexing and searching of signature files from the perspective of parallelism, as well as issues associated with multithreading information retrieval tasks for performance reasons and in general how to make optimal use of the available hardware.

Chapter 6: Evaluation and Refinement

Chapter 6 takes a look at document signatures from the view of how effective they are at retrieving documents for users, more specifically within the realm of ad hoc search (that is, searching for documents with short, user-supplied conjunctive text queries). While a

“vanilla” search engine implemented using the signature approaches described in earlier chapters is capable of performing searches, the quality initially leaves a lot to be desired. This chapter looks at how search engines are evaluated, as well as looking at different approaches and refinements that can be implemented to improve the effectiveness of ad hoc signature search. Some optimisations, while effective at improving results, come with side-effects

such as an increase in memory usage or a reduction in signature searching performance, and these are looked at with a view to determining whether certain compromises are worthwhile.

Chapter 7: Relevance Feedback

Chapter 7 looks at relevance feedback, the concept of using feedback from the user on the usefulness of the results they have received thus far to return additional useful results to the user. This chapter covers the evaluation of relevance feedback approaches considering a prototype implementation of focused relevance feedback in TOPSIG.

Chapter 8: Document Similarity

Chapter 8 looks at document similarity; the ability to use document signatures as a proxy for the original documents when making similarity determinations, as well as related tasks such as the automated classification or tagging of new documents by topic, and how Hamming distance correlates with document similarity and shared topics and tags.

Chapter 9: Inverted Signature Slice Lists

Chapter 9 looks at the inverted signature slice list (ISSL), a new approach to solving the

“Hamming distance problem” of scalability in signature searches. This chapter describes the approach, its performance characteristics and limitations as well as how the approach compares to other attempts to solve the same issues of signature scalability.

Chapter 10: Duplicate Sub-Document Identification

Chapter 10 looks at the use of document signatures for detecting duplicate sub-documents;

that is, passages of identical text appearing in two or more documents. This task is often used in applications such as plagiarism detection. The chapter looks at the effectiveness of using Hamming distance to determine instances of duplicate sub-documents, how its effectiveness can be improved and the performance and limitations of TOPSIGwith respect to this task.

Chapter 11: Conclusions and Recommendations

Chapter 11 is the denouement to this monograph and consists of a summary of the research performed, the contributions made during the course of this research and topics of interest with respect to future research in this area.

1.3 Contributions

The primary contributions of this thesis are:

• An assessment of document signatures and their effectiveness, applicability and effi-ciency across different situations. This is primarily covered in Chapters 3 and 6, but is an underlying theme throughout the entire monograph.

• The TOPSIG open source signature search platform, which is a highly optimised and fully-featured search engine capable of performing many different search tasks and is usable both as a search engine in its own right and as a component of a larger system.

The platform is introduced in Chapter 4

• An approach for efficient whole-document nearest neighbour searching with signatures that gets around fundamental scalability limitations inherent to locality-sensitive hash-ing approaches. This is covered in Chapter 9.

Secondary contributions include:

• A relevance feedback evaluation platform that is capable of communicating with an external server and exchanging topic and relevance information, and evaluating the capability of compatible relevance feedback modules while maintaining the secrecy of the collection characteristics. This is covered in Chapter 7.

• A detailed discussion and demonstration of parallel processing techniques with infor-mation retrieval, the performance implications of locking and synchronisation and how this affects the implementation of high performance information retrieval systems. This is covered in Chapter 5.

• An assessment of the literature showing how document signatures, locality-sensitive hashes and Zatocoding are intertwined approaches with similar theoretical underpin-nings.

• An examination of locality-sensitive hashing Hamming distances and how they corre-late with document similarity and categorisation. This is covered in Chapter 8.

• An investigation into scalable sub-document nearest-neighbour searching and the ap-plicability of document signatures for plagiarism detection tasks. This is covered in Chapter 10.