Automated approaches to semantic classification

research, that of interpretation. As different participants may use different terms or even phrases to refer to the same latent concept, an objectivist approach that relies purely on semantics will evidently fail in capturing the relevant concepts.

In this chapter, we propose a partially automated approach which combines traditional content analysis techniques (Strauss and Corbin, 1998) with computational approaches to assess the semantic similarity between documents (Salton et al., 1975). In identifying the concepts which will form the basis for computing the similarity between narratives, the proposed approach combines existing domain-specific knowledge with open-coding procedures that aim at identifying constructs that are not captured by existing measurement models. Domain-specific knowledge, on the other hand, for instance within the field of user experience, may be found in psychometric scales measuring perceived product qualities and emotional responses to products. Thus, only concepts that are relevant to the given context are considered in computing the similarity between narratives. At the same time, the process is automated as concepts identified in a small set of narratives may be used to assess the similarity among the full set of narratives. This results in an iterative process of coding and visualization of obtained insights.

Next, we describe the application of a fully-automated procedure that relies on a vector-space model (Salton et al., 1975), the Latent Semantic Analysis (LSA) and motivate the proposed adaptations of this approach towards a semi-automated approach.

6.2 Automated approaches to semantic classification

A number of automated approaches exist for the assessment of semantic similarity between documents (for an extensive review see Kaur and Hornof, 2005; Cohen and Widdows, 2009). These approaches rely on the principle that the semantic similarity between two documents relates to the degree of term co-occurrence in these documents (Deerwester et al., 1990). In this sense, every document may be characterized as an n-dimensional vector where each element of the vector depicts the number of times that a given term appears in the document. The similarity between documents may then be computed in a high-dimensional geometrical space defined by these vectors.

Latent-Semantic Analysis (LSA) (Deerwester et al., 1990), also known as Latent- Semantic Indexing within the field of Information Retrieval, is one of the most popu- lar vector-space approaches to semantic similarity measurement. It has been shown to reflect human semantic similarity judgments quite accurately (Landauer and Du- mais, 1997) and has been successfully applied in a number of contexts such as that of identifying navigation problems in web sites (Katsanos et al., 2008) and structuring

and identifying trends in academic communities (Larsen et al., 2008a).

LSA starts by indexing all n terms that appear in a pool of m documents, and constructs a n x m matrix A where each element ai,jdepicts the number of times that the term i appears in document j. As matrix A is high-dimensional and sparse, LSA employs Singular-Value Decomposition (SVD) in reducing the dimensionality of the matrix and thus identifying the principal latent dimensions in the data. Semantic similarity can then be computed on this reduced dimensionality space which depicts a latent semantic space. Below, we describe in detail the procedure as applied in this chapter.

6.2.1 LSA procedure

Term indexing

Term-indexing techniques may vary from simple "bag-of-words" approaches that dis- card the syntactic structure of the document and only index the full list of words that appear in a document, to natural language algorithms that identify the part-of- speech, e.g. the probability that a term is a noun or a verb, in inferring the essence of a word (Berry et al., 1999). LSA typically discards syntactic information and treats each document as a pool of terms. However, it applies two pre-processing procedures in order to enhance the quality of the indexing procedure.

Firstly, a number of words, called stop-words, such as prepositions, pronouns and conjunctions, are commonly found in documents and carry no semantic information for the comprehension of the document theme (Fox, 1989). Such words are excluded from further analysis as they do not provide meaningful information and are likely to distort the similarity measure. We used a list stop-words provided by Fox (1989).

Secondly, the remaining terms are reduced to their root words through stemming algorithms. For instance, the terms "usability" and "usable" are reduced to the term "usabl", thus allowing the indexing of multiple forms of a word under one dimension in the vector-space model. We employed Porter’s (1980) algorithm for stemming.

Normalizing impact of terms

The first step in the procedure has resulted in a n x m matrix A where each element ai,jdepicts the number of times that the stemmed term i appears in document j. The frequencies of different terms across different documents will vary substantially. This results in undesired impacts of terms that are more frequent across a larger set of documents as they receive higher weight than terms that appear in only a small set of documents. However, these terms that appear in many documents have limited discriminatory power and are thus not very informative. One term-weighting criterion

6.2. AUTOMATED APPROACHES TO SEMANTIC CLASSIFICATION 151

that counterbalances for this inherent bias is the term-frequency inverse-document frequency (TFIDF) (Salton and Buckley, 1988):

ai,j_weighted= ai,j∗ log(nDocs/nDocsi) (6.1)

which weights the frequency ai,jby the logarithm of the ratio of the total number of documents nDocs by the number of documents nDocsiin which the term i appears. Thus, frequent terms that appear in a large amount of documents and thus have little discriminatory power receive lower weight in the final matrix.

Dimensionality reduction

Matrix A is sparse and high-dimensional. Moreover, certain groups of terms may display similar distributions across the different documents, thus underlying a single latent variable. LSA attempts to approximate A by a matrix of lower rank. Singular Value Decomposition is used to decompose matrix A in three matrices U, S, V in that

A = U SVT (6.2)

Matrices U and V are orthonormal matrices and S is a diagonal matrix that con- tains the singular values of A. Singular values are ordered in decreasing size in matrix S, thus by taking the first k x k submatrix of S, we approximate A by its best-fit of rank k.

Ak= UnkSkkVmkT (6.3)

Computing document similarity

The similarity between different documents or different terms may then be computed on the reduced dimensionality approximation of A. Matrices 6.4 and 6.5 constitute m x m and n x n covariances matrices for the documents and terms, respectively. The proximity matrices for the documents and terms are then derived by transforming 6.4 and 6.5 to correlation matrices.

SR= ATkAk= VmxkSkxk2 V T mxk (6.4) AkATk = UnxkSkxk2 V T nxk (6.5)

Each element si,jrepresents the similarity between documents, or terms i and j.

The proximity matrix is normalized to a range (0,1) and transformed to a distance matrix with each element di,j= 1 − |si,j|.

6.2.2 Limitations of LSA in the context of qualitative content analysis Latent-Semantic Analysis has been shown to adequately approximate human judgments of semantic similarity in a number of contexts (Landauer et al., 2003; Katsanos et al., 2008; Larsen et al., 2008a). However, one may expect a number of drawbacks when compared to traditional content analysis procedures as applied by researchers. First, LSA assumes a homogeneity in the style of writing across documents. Thus, the extend to which different words occur in one document over a second one de- notes a difference in content across the two documents. This assumptions has been shown to hold in contexts of formal writing such as web pages (Katsanos et al., 2008) or abstracts of academic papers (Larsen et al., 2008a), but it is not expected to hold in qualitative research data such as interview transcripts or self-provided essays in diary studies as the vocabulary and verbosity of documents might substantially vary across different participants.

Second, LSA computes the similarity between documents based on the co- occurrence of all possible terms that may appear in the pool of documents. In the analysis of qualitative data, however, one is interested only in a small set of words that refer to a phenomenon that the researchers are interested in. As a result, words that are of minimal interest to the researchers may shadow the semantic relations that researchers are pursuing at identifying.

Third, LSA lacks an essential part of qualitative research, that of interpretation. As different participants may use different terms or even phrases to refer to the same latent concept, an objectivist approach that relies purely on semantics will evidently fail in capturing the relevant concepts. Ideally, automated vector-space models could be applied to meta-data that have resulted from open coding qualitative procedures (Strauss and Corbin, 1998). In the next section we propose such a semi-automated approach to semantic classification.

In document Quantifying Diversity in User Experience. Evangelos Karapanos (Page 149-152)