Thesis Overview - A knowledge acquisition tool to assist case authoring from texts.

learning?

The following objectives were then obtained from the research questions:

• to design a case structure onto which unstructured textual reports will be mapped;

• to design and implement a semi-automated tool to map unstructured text to the case structure obtained above;

• to develop a keyword hierarchy to enable structuring of cases with diﬀerent levels of speciﬁcity; and

• to design and implement a case-based tool to assist in the design of SmartHouse solutions.

However, in the SmartHouse domain, individual words have often been found insuﬃ-cient to capture the meaning of important expressions. For example the phrase intercom operation does not mean an intercom or an operation but rather the way to operate an intercom. Phrases containing two or more words have been found to be more informa-tive. Consequently, the keyword hierarchy mentioned in the objectives has been revised to concept hierarchy where a concept is a group of words and phrases. Furthermore, in order to deal better with the problems of polysemy and synonymy which are inherent in the English language, the vocabulary is restricted by harmonising the text.

1.5 Thesis Overview

This thesis will develop a case authoring tool to enable the extraction of knowledge from semi-structured text. Hierarchically structured cases allow comparison at diﬀerent levels of abstraction. Thus the extracted knowledge will be assembled into structured cases. The task of creating structured cases from the documents is divided into the following steps:

1. Harmonising the text in order to obtain a uniform vocabulary;

2. Representing the documents with only those terms that actually describe the prob-lem (key phrases);

1.5. Thesis Overview 17 3. Using the representative terms to create a hierarchically structured conceptual model

that reﬂects important features in the domain and their relationships; and

4. Mapping each document onto concepts in the conceptual model in order to create structured cases.

Each of these steps shall be explained in detail in the next chapters. The SmartCAT-T system architecture is illustrated in Figure 1.3. Documents in the collection are used to create synonym mappings for the domain by making use of WordNet. A text harmoni-sation module transforms the documents into harmonised texts by utilising the synonym mappings and Google. The harmonisation module is also used to meaningfully interpret queries during problem-solving. Background knowledge is used to discover disability and disease terms pertaining to each harmonised document. A documents disability and dis-ease terms, Latent Semantic Indexing (LSI), and background knowledge are all utilised in identifying a document’s key problem descriptors that are not disability or disease terms.

All the discovered key problem descriptors together with the document’s disability and disease terms are used to represent the document. Formal Concept Analysis is applied to the documents’ representatives to obtain a conceptual model with which structured cases are created.

A survey of the literature pertinent to this research is carried out in Chapter 2. The survey examines techniques that are related to the diﬀerent modules that comprise this project’s case authoring goals and identiﬁes gaps in current research in order to justify the work carried out in this project.

In Chapter 3, a description of the harmonisation procedure is given. The SmartHouse reports are harmonised in order to create a uniform vocabulary and hence overcome the problems of polysemy and synonymy during the later stages of knowledge extraction.

Central to the harmonisation is an unsupervised approach to Word Sense Disambiguation.

WordNet is used to assign each polysemous word with a sense that is applicable in the domain.

Chapter 4 presents the extraction of key phrases which comprise disease terms, disability-terms and other key problem-descriptors. Disease disability-terms are identiﬁed by making use of a list of known diseases and disability terms are obtained by making use of the commonly

1.5. Thesis Overview 18

occurring complaint-description terms diﬃculty, diﬃculties, impaired, problem, and im-pairment. The system makes use of background knowledge to identify other terms that are not disability or disease terms, but are key problem-descriptors. The key phrases are used as representations for the corresponding sub-problems.

Chapter 5 examines how Formal Concept Analysis can be employed to create a con-ceptual model. The sub-problems in the text and their representative (key) phrases are used to create a formal context onto which Formal Concept Analysis is applied in order to obtain formal concepts and a conceptual model. Mappings of sub-problems to their respec-tive solution descriptions is also carried out. Phrase overlaps between sub-problem and solution texts are used to identify solution description texts for each sub-problem. These mappings are later used to match the appropriate solution to each authored problem-part in order to complete the case creations.

In Chapter 6, a case authoring tool SmartCAT⁵is presented. SmartCAT uses the sub-problem representations to obtain a conceptual model. It then uses the model to create cases and to organise the cases into a structure based on relationships between concepts

5Smart Case Authoring Tool

1.6. Summary 19 in the conceptual model. However, the cases so obtained are ﬂat structured which do not allow for comparison at diﬀerent levels of problem abstraction.

Chapter 7 presents a description of the SmartCAT-T case authoring tool. SmartCAT-T changes the nature of the key phrase representations and consequently reduces dependen-cies between the various objects in the formal context. It then creates more redundancy in the resulting context in order to overcome multiple inheritance in the subsequent concep-tual model. SmartCAT-T uses the tree structured concepconcep-tual model to create hierarchi-cally structured cases. The created cases can be compared at various levels of abstraction and are not prone to the eﬀects of multiple inheritance as do the cases created using SmartCAT.

Chapter 8 presents an evaluation of the SmartCAT-T system. The quality of the case content is measured against original documents. An evaluation is also carried out to estimate the contribution of the various modules that comprise SmartCAT-T to the effectiveness of the tool. The effect of employing the case structure in retrieval is also evaluated against employing a flat structure. SmartCAT-T is also benchmarked against a high-standard Information Retrieval tool.

Chapter 9 investigates the extent to which the techniques developed in this research can be applied to other domains. This is achieved by subjecting the techniques to the domain of air and marine safety investigation and carrying out a short evaluation to determine the usefulness of the cases created by the SmartCAT-T approach, but using documents in this domain.

A summary of the the contributions of this research and possible directions for future work are presented in Chapter 10.

1.6 Summary

This thesis will develop a tool that assists case authoring from semi-structured textual reports in the SmartHouse domain. Consequently a CBR tool will be developed to assist in matching the needs of the elderly and people with disabilities with appropriate SmartHouse technology. SmartHouse problem-solving experiences are used to create cases in order to populate the case-base. However, the reports were recorded as free-form textual

semi-1.6. Summary 20 structured reports. This kind of text does not allow for eﬀective case comparison in order to retrieve the most appropriate cases in a new problem situation. Therefore, a structured representation of the reports is required. This representation can be acquired by extracting knowledge from the textual reports and presenting it in a structured form.

Developing a CBR system when the knowledge is embedded in textual sources typically presents the so-called case acquisition bottleneck. Access to domain-experts and time are important aspects that need to be addressed (Zaluski et al. 2003). Case-based reasoners that work with textual sources have addressed this problem in diﬀerent ways. The means employed depend on the nature of the text and (intended) application of the system. The term case authoring is in this project used to refer to the identifying, eliciting, representing, and indexing of cases (Weber et al. 2000).

SmartHouse device recommendation is a form of decision support like that used in recommender systems and industrial trouble shooting. Case authoring involves steps such as the identification of a feature vocabulary and the possible values the features can have (feature-value pairs), feature organisation, case indexing and determination of relevant solutions for each problem structure. Organising feature vocabularies make the TCBR systems feasible and effective because the organisations can be used for traversal in search of new solutions and comparison at different levels of problem specificity.

The next chapter presents a survey of the pertinent literature that has been carried out. It provides an evaluation of related work in the field and justifies the need for this research by identifying the gaps it will fill. It illustrates and gives examples of case representations where knowledge was originally available as text. The survey demonstrates that the methods used to elicit knowledge from text depend on the nature of text(whether it is structured, semi-structured, or unstructured), availability of additional knowledge, and the desired case structure. Finally, it makes the case for the chosen methodology in order to obtain the targeted case representation structure.

Chapter 2

In document A knowledge acquisition tool to assist case authoring from texts. (Page 30-35)