Content Indexing - Content Discoverability Techniques

2. State of the Art

2.5 Content Discoverability Techniques

2.5.2 Content Indexing

The indexing mechanism is a critical component of any IR-based system, which provides a formalised, simplified and machine usable representation of content contained within each resource (Salton, 1989). In adaptive systems, the indexing mechanism varies de- pending upon the nature of the content resources being incorporated in each individual system9. In adaptive systems that rely on a closed corpus of content, the content is struc- tured and annotated by the content author (or a domain expert), therefore indexing this content is relatively straightforward, as it does not require an automatic approach for this task (Aroyo et al., 2004). However, manual indexing of content resources is a labour-

8_{http://80legs.coms}

intensive task. On the other hand, adaptive systems that rely on open corpus content require an automatic approach to index the harvested content.

Content indexing, in adaptive systems, can be classified into two main approaches: doc- ument-level (De Bra & Calvi, 1998) and fragment-level (Weal et al., 2007; Levacher 2014).

2.5.2.1 Document-level Indexing

In document-level approaches, the harvested resources are indexed in their native form (usually as HTML web pages) as one-size-fits-all document level granularity resources. For example, within the KBS Hyperbook (Fröhlich et al., 1998; Henze and Nejdl, 2001), once identified by system users, open corpus resources are indexed in a content repository. Each resource is assigned to a knowledge concept of the application domain such as the "if" or "while" concepts in a programming language ontology. These assigned concepts are used for indexing the content resources in the storage module. After that, links between existing (closed corpus) resources and the newly indexed (open corpus) resources are generated automatically based upon the concept that each new resource was assigned. The indexed documents are then adapted and presented based on concepts that represent the user’s goals. Another example is the ML-Tutor system (Smith and Blandford, 2003). ML-Tutor utilises an IR mechanism to collect content resources from the open web. The harvested documents are then indexed along with the prominent key- words in each document. A document is then presented, in its native form, to the user based on the similarity between the document’s keyword vector and the keyword vector in the user’s model.

2.5.2.2 Fragment-level Indexing

On the other hand, fragment-level approaches focus primarily on content where the ad- aptation is performed at a finer level of granularity (Bunt et al., 2007). In these approaches, the harvested resources are processed and segmented into coherent fragments. These fragments are then indexed in the content repository. ArtEquAKT (Millard et al., 2003; Weal et al., 2007) for example, utilises information extraction and knowledge man- agement techniques to create dynamic biographies of artists from content available on the web. The system relies on a traditional search engine to harvest content resources that comprise biographical information about artists. The harvested resources are then fragmented into paragraphs (and sentences) that are analysed syntactically to identify whether

it contains any relevant information about the artist requested by individual users. These fragments are then associated with the relevant instances in a Knowledge Base (KB) and indexed in a MYSQL database to be used later in generating biography pages. Figure 2.7 shows how ArtEquAKT extracts the different fragments from webpages to create dynamic biographies.

Figure 2.7 The ArtEquAKT system

Another example is PMCC (Steichen, 2012), that delivers a personalised content to individual users using open corpus content as fragments of text. The system utilises an automatic metadata extraction technique to enrich the harvested content. This enriched content is then fragmented using a wrapper-based content fragmentation approach (Bunt, Carenini and Conati, 2007) to identify regions of pages in order to produce individual fragments of content. These fragments are then presented to the users based on their knowledge in user models.

Slicepedia (Levacher et al., 2014) applies the Densitometric Content Fragmentation approach (Kohlschütter & Nejdl, 2008) to fragment the harvested open corpus content into segments based on their HTML structure. The fragments (called slices) are then indexed in a content repository along with concepts that represent the topics covered in each fragment. The indexed fragments can then be retrieved based on a request from an arbitrary adaptive system. The architecture of Slicepedia is depicted in Figure 2.8.

Both approaches (document-level and fragment-level) are limited in their ability to pro- vide different levels of granularity for the indexed content. Document-level indexing approaches limit the extent to which these resources can be modified or recomposed to- gether. This in turn hinders the indexed content from being reused in multiple systems. Furthermore, as pointed by (Bunt et al., 2007), presenting the incorporated open corpus resources in their native form allows more content to be “visible to the user. However,

the more content is shown, the higher the chance of generating information overload and reducing attention to the most relevant information, defeating one of the very reasons for having adaptive systems in the first place”.

Figure 2.8 Slicepedia Architecture Pipeline

For fragment-level indexing approaches, relying upon the original structure of the content resource (HTML structure or paragraphs structure) to produce fragments and hence index them, does not reflect the needs or preferences of individual users or applications. This is because the structure posed by each resource reflects the needs and the point of view of its author. While each adaptive system has its own content requirements (based on its users), relying upon such structure does not reflect these requirements. Furthermore, in these approaches, the indexing process means that the final structure of each content item is already built. This limits the capability of these approaches to change the structure according to the individual user or application needs.

Another limitation of both approaches is the conceptual coverage associated with each indexed content item (weather a full document or a fragment). In other words, these approaches annotate individual content items with a small number of concepts (or some- times one concept) that represent the topic(s) covered by that content item. However, a content item could be associated with many different concepts with different relevancy levels that represent to what extent each concept is covered by that content item. By only considering a few concepts that are deemed most relevant and ignoring the less relevant concepts, these approaches ignore a range of possible interpretations of that content item. This in turn hinders this content item from being properly discovered and presented to users according to their needs, and from being reused in other systems.

Hence, there is a need to index content in its smallest granularity, without only considering the original structure that has been built by the author of the content resource. Ideally, content should be indexed at a range of granularity levels and associated with all relevant concepts at each of these levels. The research in this thesis hence tries to overcome the limitations of these aforementioned approaches and tries to build a structural hierarchy out of text documents without relying on their original structure. After building such a structure, the documents are indexed in a manner that facilitates ease of discovery and reuse.

In document Using NLP Techniques to Enhance Content Discoverability and Reusability for Adaptive Systems (Page 55-59)