In this section a general discussion concerning data preprocessing in the context of data mining is presented. Specific details of the preprocessing of the evaluation data sets used with respect to the work described in this thesis is presented in later sections. Although the raw questionnaire data from which it is desired to extract knowledge is expected to be in an electronic format (transcribed and/or collected electronically), it still needs to be preprocessed in order that data can be translated into an appropriate format that will permit the application of classification techniques. Raw data typically contains noisy elements such as corrupted, redundant or incomplete data (data that features missing values), and data not relevant for the knowledge extraction process. Data preprocessing is the first step of the KDD process [41]. Data preprocessing can be time-consuming, but is of great importance in order to improve the quality of the data input to the data mining step. In the context of questionnaire data, and assuming that both the tabular data and the free text elements are of interest, both need to be preprocessed and represented separately. In the following two subsections a general description of how preprocessing is typically carried, with respect to both tabular and free text data, is presented.
3.2.1 Tabular Data
Although this thesis is focused on summarising the free text element of questionnaires, the preprocessing of the tabular element is also considered (although to a lesser extent) in this chapter because, in addition to often being present in questionnaires, it can
provide additional valuable information with respect to the text summarisation pro- cess that could enrich the free text to be summarised: (i) by including relevant tabular attributes as words in the free text during the classification process and (ii) by using the output of mining tabular data in conjunction with the output of the free text classi- fication. As in the case of tabular data in general, once collected, tabular questionnaire data is typically stored electronically; in some cases bespoke storage systems are used which may have some built in data analysis functionality. For data mining purposes it is usually necessary to transform the data into a format compatible with the data mining software to be adopted. Common formats include: CSV, SSV, XML, TAB and ARFF, amongst others. Han et al. [41] consider that tabular data preprocessing includes techniques such as:
• Data cleaning: Removal of noise and fixing of inconsistencies. • Data integration: Integration of data from different sources.
• Data transformation: Transformation of data into more suitable formats for the data mining process.
• Data reduction (Feature selection): Reduction of the size of the data set, taking into account the most important information, so as to decrease the overall anticipated computational cost.
3.2.2 Free Text
There are many ways to represent free text documents for data mining purposes, two representations that frequently appear in the literature are: (i) theVector Space Model (VSM) and (ii)Latent Semantic Indexing (LSI)[23]. In the VSM, free text documents are represented as vectors (one per document); in this context the terms contained in the vectors can be words or phrases. Where words are used this is referred to as the bag-of-words representation; where phrases are used this is referred to as the
bag-of-phrases representation. More specifically, the bag-of-words representation refers to an unordered representation of a text document based on the single words that appear in it. In the bag-of-phrases representation, on the other hand, a text document is represented by the phrases that appear in it. LSI creates an index of the words contained in a text document and makes use ofSingular Value Decomposition (SVD)
to reduce the dimensionality of the data by generating a mapping of the relationships between words and documents. Words and documents that are closely related to each other are put near one another. SVD has been used extensively in the area of statistics and in this context is used to identify how much the words and the documents are related to each other, focusing on the ones that have a close association. Words in a
query are then used to retrieve the related documents based on how near they are to the location represented in the mapping.
The free text contained in questionnaire data can be viewed as a special form of free text in that it features certain characteristics that are typically not found in more standard forms of free text. More specifically, questionnaire free text: (i) tends to be unstructured, (ii) frequently includes misspelled words, (iii) features poor grammar, and (iv) often includes abbreviations and acronyms specific to the domain. As such, questionnaire free text has similarities with e-mail correspondence, mobile phone texts and tweets, as opposed to the free text found in (say) newspaper articles or document collections. The advantage of analysing the free text element of questionnaires, as already noted, is that this free text tends to be much more informative with regard to respondent opinions than the tabular element of questionnaire data.
The bag-of-words representation was adopted, for use with respect to the research described in this thesis, to represent the textual data. The reasons for using this repre- sentation were directly related to the nature of questionnaire free text (lack of structure of the free text, misspelled words, poor grammar, abbreviations and acronyms) where by it was not viable to use a representation based on syntax or semantics such as the bag-of-phrases representation or LSI. The main disadvantage of the bag-of-words rep- resentation is that the relationship (ordering) between the words is lost. However, this was not considered significant given the nature of the questionnaire free text under consideration.
Text preprocessing tends to adopt some of the tabular data preprocessing techniques together with some additional techniques directed specifically at textual data. Miner
et al. [74] presents a list of basic text preprocessing steps:
1. Scope of documents selection: Deciding whether to use an entire free text document collection, or just some part of it that is considered to be most relevant to a specific text mining application.
2. Tokenization: Breaking the text down into single words or tokens, typically using white spaces and punctuation symbols as delimiters.
3. Stop word removal: Stop words are common words that are considered to be not significant given a desired application (for example articles such as “the” or “a” are often considered to be stop words) and are thus frequently removed.
4. Stemming: Stemming refers to processing text so that, where appropriate, words are reduced to their stem base or root by removing prefixes and suffixes; for example the words “operated”, “operating” and “operates” share the same stem base “operat”.
5. Spelling normalisation: The process of automatically correcting misspelled words by using dictionary-based approaches, fuzzy matching algorithms or word clustering.
6. Sentence boundary detection: Breaking down the text into sentences. 7. Case normalisation: Homogenising the given text, which is typically written
in mixed case, into lower or upper case.
The specific preprocessing associated with each of the data sets used for evaluation purposes in this thesis are discussed in the following three subsections.