System Description - Problem Formulation - Towards generic domain-specific information retrieva

5.2 Problem Formulation

6.1.1 System Description

We now describe the architecture of our system and then explain the two key features in detail.

The architecture of our system (Figure 6.1) consists of four stages:

Figure 6.1: Architecture of the math search system.

described in Section4.3.1) as the resource collection for our system. This collection can be replaced by periodic crawling later if the system is scaled up for public use in future; however, for now, we use our math corpus as a convenient collection since it already contains a variety of math resources for the 27 chosen topics.

Stage 2 We then apply our machine-learned classiﬁers to categorize the re-

sources, compute their readability using our iterative approach, and link math concepts to their expression representations.

Stage 3 We employ Lucene1, a freely-available text search engine library, to index2 the resources with the results from categorization and linking.

Stage 4 Users access the organized resources through our search interface.

Feature 1: Automated categorization of resource type, information type and readability. The webpages in the collection are of various types. To

facilitate ﬁltering in the downstream user interface, we apply supervised learning

1_{http://lucene.apache.org/} 2

Normalization and stemming are done via the StandardAnalyzer class. Same for all other systems mentioned in this chapter.

Resource Type Deﬁnition Concept

Information

Explanatory texts on math concepts.

Exercises Exercises on math concepts.

Discussion Forum discussions on math concepts.

Paper Scholarly articles that describe research on math

concepts.

Visualization Applets, ﬁgures and diagrams that visualize as- pects of math concepts.

Textbook Textbooks on math concepts.

Tool Software packages that facilitate the application of math concepts.

Course Courses on math concepts.

Journal Journals on math concepts.

Research Community

Events, conferences and researchers related to the research on math concepts.

Hub Compiled links to resources on math concepts.

Others Any other types of resources.

to classify the webpages into one of the 12 categories as listed in Table 6.1. Although by no means exhaustive, these categories are designed to meet the common resource needs of math seekers as discovered in our user study (See Table2.1for a list of resource needs).

We have annotated 1,068 webpages from our math corpus (i.e., all webpages for 10 concepts and the ones in the top 30 search results for the remaining 17 concepts; discarding irrelevant webpages) as our training data. We then extract three classes of features from the webpages: token (e.g., n-grams), webpage (e.g., URL tokens and content length) and formatting (e.g., whether a word is in bold/italics). Since we were able to clearly associate a single resource type to each webpage, we train a multi-class CRF classifier on our annotated data instead of multiple one-against-all classifiers. We then apply the resulting classifier onto the remaining webpages to determine their resource type.

The resource type of the search results are displayed together with the context of the matched keywords in the search interface. Users can also use the resource type filter to filter results to a specific type.

Among diﬀerent types of resources, concept information resources commonly contain the most types of information (e.g., deﬁnitions, exercises, examples and

Table 6.2: Math information types for classiﬁcation. Information Type Deﬁnition

Deﬁnition Sentences that contain deﬁnitions of math concepts. Exercises Sentences that contain exercises on math concepts.

Examples Sentences that contain examples on math concepts.

Proof Sentences that contain proofs on math concepts.

Others Any other types of sentences.

proofs) sought by math seekers3. To save their trouble of reading through the resources to locate such information, we further categorize the sentences of the concept information resources into ﬁve categories as listed in Table 6.2.

The categorization process is similar to that of resource type. We have annotated all the sentences in the 112 concept information webpages on Bayes’ theorem, complex numbers and modular arithmetic as our training data. We then extract five classes of features from the sentences: token (e.g., n-grams), sentence (e.g., length and position of the sentence), formatting (e.g., whether a word is in bold/italic), concepts (e.g., appearances of math concepts), and expressions (e.g., appearances of math expressions). We also train a hard, multi- class CRF classifier for this categorization on our annotated data and apply it onto all the sentences in the webpages which have been categorized as concept information resources by the resource type classifier4.

The sentences belonging to the ﬁrst four information types can be viewed directly in the search results. Filtering on information type is also available for users to focus on sentences containing speciﬁc types of information.

Last but not least, math resources targeted at different audience are written at different levels of readability. To help users pick out the ones that are suited to their level of knowledge, we compute the readability scores for the resources as described in Chapter 4. The final scores are used directly for sorting and converted into a discrete 5-point scale for display and filtering.

Feature 2: Automated linking of keywords to their expression rep- resentations. We link math concepts to their expression representations as

See Table2.1for the list of information needs.

4_{Unlike our work mentioned in Chapter}₃_{, we do not employ joint inference to combine this}

categorization with resource type categorization. The reason is that the categorization of other types of resources would not beneﬁt from information type categorization or vice versa since we only target the information from the concept information resources.

the top ﬁve linked expressions are displayed. They can then choose any of the linked expressions for expression retrieval.

Figure 6.2 shows the search interface of our system. It demonstrates how Resource Categorization and Text-to-Construct Linking are employed: The categorization labels – resource type, information type and readability – are shown as part of the search results. Tools for ﬁltering and sorting are provided on the left of the interface. The linked expressions are displayed above the search results with checkboxes for users to choose which one(s) to search with.

In document Towards generic domain-specific information retrieval (Page 147-151)