3.2 Latent Semantic Analysis
3.2.2 Introduction to Singular Value Decomposition
Singular Value Decomposition (SVD) is a method from the field of linear algebra. The purpose of SVD is to diagonalize any t × d matrix A. The diagonalization corresponds to a transition to a new coordinate system (Lang and Pucker, 1998). This transition brings forth the latent semantic structure of a document set.
To explain the effect of SVD, I will use an example from image compression, where SVD is used to optimize the relation between image quality and file size. Figure 3.2 shows pictures of a clown with different quality.
Figure 3.2: SVD in image compression: View the m × n image as a matrix. The rank of
this matrix was reduced from r = 200 to k = 1, 2, 5, 15, 50. Hardly any difference is visible between the rank k = 50 approximation of the image and the original, but the file size is reduced from m × n to k(m + n). (Source: Persson (2007))
Any picture can be seen as a matrix, where each cell contains a number that corresponds to a colour or a grayscale. The upper left image in figure 3.2 shows the original picture, a matrix of rank r = 200. The rank (r) of a matrix is the smaller of the number of linear independent rows and columns. SVD is used to reduce the rank and thereby the file size of the image. If the rank is reduced to k = 1 or k = 2 the image of the clown is not recognizable. The clown can
38 3.2. LATENT SEMANTIC ANALYSIS be recognized in the rank 15 (k = 15) approximation but the image is blurred. At rank k = 50 hardly any difference between the approximation and the original image can be detected, but the file size is reduced from m × n to k(m + n).
The clown can be recognized because SVD emphasizes the most essential features and information while unimportant details are suppressed. On the highest level of abstraction (rank- 1 approximation) only the very basic structure of the image is depicted. SVD ranks the features by importance for the image. By reducing the rank to k, only the first k features are kept.
One aims to find the optimal rank approximation where all and only the important informa- tion is shown. If the important features are not all captured the picture cannot be recognized, whereas if too many features are kept the data structure is unnecessarily large.
In the example in section 3.2.1 SVD finds concepts and relations between terms in the term- by-document matrix B and ranks them by importance. Only the k most important concepts are kept, and the term similarity is calculated on the basis of the reduced matrix. The benefit of this reduction to the optimal rank-k approximation is that the term similarity calculation is only based on the most characteristic features of the document collection at hand. The noise that blurs the clear view on the hidden relations is suppressed.
Mathematically speaking the characteristic concepts of a term-by-document matrix are its eigenvectors. First I will explain Eigenvalue Decomposition (EVD) for square matrices from which SVD for rectangular matrices is derived. The goal of EVD is to find eigenvectors ~x that point in the same direction as Ax, i.e., vectors that satisfy equation 3.6.
A~x = λ~x (3.6)
Here λ is an eigenvalue, which determines the scaling of the corresponding eigenvector ~x. For example the eigenvectors for the following matrix C of rank 3 are ~x1, ~x2 and ~x3:
C = 2 0 0 0 9 0 0 0 4 ⇒ ~x1 = 0 1 0 ~x2 = 0 0 1 ~x3 = 1 0 0 λ1 = 9 λ2 = 4 λ3 = 2
In the case of a diagonal matrix the eigenvectors are the canonical unit vectors, i.e., the vectors spanning the coordinate system. Equation 3.6 can be solved by subtracting λ~x to obtain:
(A − λI)~x = 0 (3.7)
Here I is the unit matrix – a diagonal matrix where the main diagonal consists only of ones. If this equation has a non-trivial solution, then A − λI is not invertible, which means there is no B−1 = (A − λI)−1 that fulfils B−1B = BB−1 = I. From that it follows that the determinant of (A − λI) has to be 0:
For a detailed derivation of this transformation see Strang (2003). With this equation eigenval- ues λ can be calculated since det(A − λI) will result in a polynomial of rthorder.
The procedure described here is called Eigenvalue Decomposition (EVD) since it can only be applied to certain classes of square matrices. In IR most term-by-document matrices are rectangular hence the generalization for rectangular matrices, Singular Value Decomposition (SVD), is used. SVD and EVD are related. EVD decomposes a square matrix C into two submatrices Q and Λ where Q represents the eigenvectors and the eigenvalues are listed in descending order in matrix Λ:
C = QΛQ−1 (3.9)
In contrast to a square matrix a rectangular matrix has two sets of eigenvectors, the right singular vectors and the left singular vectors. SVD decomposes any rectangular t×d matrix A into three submatrices T, S and D (figure 3.3). The left singular vectors are represented by T, the right singular vectors by D. t
A
d = tT
r rS
r rD
T dFigure 3.3: Singular Value Decomposition: A is a t × d matrix, where t is the number of
index terms, d the number of documents indexed, and r the rank of the matrix A.
Any rectangular matrix A is squared by multiplying it by AT. The eigenvectors of A T = AAT are the left singular vectors of A and the eigenvectors of A
D = ATA are the right singular vectors of A. The eigenvectors and the eigenvalues for these auxiliary matrices can be calculated by EVD as described above.
Singular values are the square roots of the common eigenvalues of AT and AD and are written in descending order in S. The eigenvectors in T and D are ordered correspondingly.
Only when the term-by-document matrix is decomposed into these three submatrices is it possible to reduce the number of dimensions of the semantic space and thereby the number of concepts (or features). In that case only the first k singular values in S and the corresponding vectors in T and D are kept. This number of remaining dimensions (k) is a crucial value for the performance of any LSA based application. If too many dimensions are kept, the latent semantic structure cannot be revealed because the documents and words are not projected near enough to each other and too much noise is left. If k is too small then too many words and/or documents will be superimposed on one another, destroying the latent semantic structure.
40 3.2. LATENT SEMANTIC ANALYSIS t
A
k d = tT
k k kS
k k kD
T k dFigure 3.4: Reduced Singular Value Decomposition: A is a t × d matrix, where t is the
number of index terms, d the number of documents indexed, r the rank of the matrix A, and k the number of dimensions kept.
for a document collection. The derived concepts or topics of the document collection are de- picted in Dk, and the word distribution patterns in Tk. In spatial terms the rows of the matrices Tk and Dk are the coordinates of points representing the terms and documents in reduced k dimensional space. The matrix Sk is used to rescale the axes in order to be able to compare different objects to each other.
Depending on the type of similarity calculation required, the submatrices are multiplied with S. For term-to-term similarity calculation the vectors of the matrix CTk = TkSk are used. To compare documents with each other the distances between the vectors of the matrix CDk = DkSkare calculated. If the task is to compare a term to a document the vectors of the matrices CTD1k = TkS 1 2 k and CTD2k = DkS 1 2
k are used to calculate the cosine similarity. For my experiments I use the scaled space CDk = DkSk. Since I explore the potential of LSA in the field of sentence clustering for MDS, I will call this space the clustering space. With each k (number of remaining dimensions) a different clustering space is formed.
Yu et al. (2002) summarizes the advantages of LSA as follows:
“The SVD algorithm preserves as much information as possible about the relative distances between the document vectors, while collapsing them down into a much smaller set of dimensions. In this collapse, information is lost, and content words are superimposed on one another. Information loss sounds like a bad thing, but here it is a blessing. What we are losing is noise from our original term-document matrix, revealing similarities that were latent in the document collection. Similar things become more similar, while dissimilar things remain distinct. This reductive mapping is what gives LSI its seemingly intelligent behaviour of being able to correlate semantically related terms. We are really exploiting a property of natural language, namely that words with similar meaning tend to occur together.”
In contrast to the VSM the dimensions of the vector, which represents a sentence, do not corre- spond to a term but rather to a concept or a word usage pattern. Thus the similarity calculation for sentence clustering using LSA is not only based on word matching but on latent semantic relations of the terms.