Algorithm for Textual Data Visualization
Angela Blanco´
Universidad Pontificia de Salamanca [email protected]
Spain
Manuel Mart´ın-Merino
Universidad Pontificia de Salamanca [email protected]
Spain
Contents
1. Introduction
2. The Torgerson Multidimensional Scaling Algorithm
3. A Semi-supervised Multidimensional Scaling Algorithm 4. Experimental results
5. Conclusions and future research trends
Introduction (I)
The Torgerson MDS algorithm is a popular visualization tech- nique that helps to discover the underlying structure of high di- mensional data.
An interesting application is the visualization of the semantic relations among terms or documents in textual databases.
However, the Torgerson MDS algorithm proposed in the litera- ture suffers from a low discriminant power due to:
• The unsupervised nature.
• The ‘curse of dimensionality’.
Introduction (II)
Several search engines provide a categorization for a subset of documents.
Problem overview
t1, t2, .... tn
Space of terms (Rd) C1
C2
C3 C4 Ck
Relation between terms and documents
(Rn )
categorized Terms are usually not
t1,..., tn Semantic classes
Space of documents f
Torgerson MDS map
Goal: To generate a visual representation of term relationships
taking advantage of the document class labels.
Introduction (III)
Our approach:
Define a semi-supervised similarity between terms that consi- ders the document class labels.
• It should reflect whether two terms are related to the same semantic topics.
• It should reflect the semantic proximities between terms.
Incorporate the semi-supervised similarity into the Torgerson
MDS algorithm. This will preserve the nice properties of the op-
timization problem.
Torgerson MDS Algorithm (I)
The Torgerson MDS algorithm looks for an object configuration in a low dimensional space such that the interpattern distances are approximately preserved.
Properties for text mining problems:
It is based on an efficient linear algebraic operation (SVD).
The optimization problem does not have local minima.
For certain similarities it is equivalent to LSI.
Torgerson MDS Algorithm (II)
Drawbacks:
Low discriminant power: Due to the unsupervised nature, diffe- rent topics in the textual collection overlap significantly in the word map.
It is affected by the ‘curse of dimensionality’.
Semi-supervised MDS algorithm (I)
Goal: To improve the discriminant power of Torgerson MDS algorithm that works in the space of terms considering a classification in the space of documents.
The association between the terms (t
i) and the document class labels (C
k) is evaluated by the Mutual Information I
0(t
i; C
k).
A supervised measure is defined that becomes large for terms that are correlated with the same categories:
s
1(t
i, t
j) =
P
k
I
0(t
i; C
k)I
0(t
j; C
k) pP
k
(I
0(t
i; C
k))
2pP
k
(I
0(t
j; C
k))
2. (1)
Semi-supervised MDS Algorithm (II)
The supervised measure will reflect just the semantic catego- ries of the textual collection but not the term relationships which is interesting for visualization purposes.
Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class.
s(t
i, t
j) = λs
sup(t
i, t
j) + (1 − λ)s
unsup(t
i, t
j) . (2)
λ controls if the word map reflects better the semantic catego-
ries (λ large) or the semantic relations among terms (λ small).
Properties Semi-supervised Similarity
cos(x,y)
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0500000100000015000002000000
Fig. 1: Cosine similarity histogram.
s(x,y)
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0e+002e+054e+056e+058e+05
Fig. 2: Semi-supervised similarity histogram.
The histogram is smoother. It is more robust to the ’curse of dimensionality’.
Word maps will reflect better the term relationships.
Working with partially labeled documents
When only a small fraction of documents are labeled we proceeds as follows:
Documents are categorized in a semi-supervised way using Transductive SVM.
The Semi-supervised measures can now be computed in the usual way.
The Torgerson MDS algorithm is applied to obtain a word map.
Experimental results (I)
The semi-supervised algorithm has been applied to the visuali- zation of the semantic relations among terms.
Evaluation of the visualization algorithms:
• The mapping algorithm is applied to generate the word map.
• A clustering algorithm is run in the map grouping the terms into 7 groups.
• Finally, the partition induced by the map is compared with the
classes induced by the thesaurus.
Experimental results (II)
The agreement between the partition induced by the mapping algorithm and the thesaurus has been evaluated through seve- ral objective functions:
• F measure (F).
• Entropy measure (E): Small values suggest little overlapping among different topics in the word map.
• Mutual Information (I): Informs particularly about the position
of the more specific terms in the word map.
Experimental results (III)
F E I
Torgerson MDS 0.46 0.55 0.17
Least square MDS 0.53 0.52 0.16 Torgerson MDS (Average) 0.69 0.43 0.27 Torgerson MDS (Maximum) 0.77 0.36 0.31 Least square MDS (Average) 0.70 0.42 0.27 Least square MDS (Maximum) 0.76 0.38 0.31
The primary conclusions are the following:
The semi-supervised techniques reduce significantly the over- lapping among the different topics in the word map.
The widely used F measure is significantly improved.
The maximum semi-supervised measure increases particularly
the discriminant power of the word maps.
Experimental results (IV)
0.0 0.1 0.2 0.3 0.4
−0.4−0.3−0.2−0.10.00.1
x
y
POTENTIAL
DEFECTS
TRANSIENT
PHASE
TECHNOLOGYDOPEDCABLEAMPLIFIERLASER FREQUENCY
INTEGRATION THYRISTORS
SUBSTRATE OPERATIONAL SILICON DIODES
ELECTRICAL SEMICONDUCTORDEVICES DIFFUSION
THERMALLOAD POWER
SPEED BANDWIDTH
OPTICAL FIBER
TRANSMISSION LIGHT WAVELENGTHPOLARIZATIONCIRCUITVOLTAGE
LINES
KOHONENORGANIZING MAPPING
LIKELIHOOD QUANTIZATION
PROTOTYPENEURONS
SOMSELF VISUALIZATIONMAPSNONLINEAR DIMENSIONALITYPCAREDUCTIONPRINCIPAL
MULTIDIMENSIONAL DISCRIMINANT
PROJECTION
WAVELET EXTRACTION PROBABILITY
VISUAL
PRIOR BAYESIAN STATISTICAL NORMAL LEARNING
MACHINE GAUSSIANPATTERN
FUZZY PERCEPTRONRULE OPTIMIZATION
NEURAL ESTIMATION UNSUPERVISED
CLUSTER
Semiconductor devices and optical cables
Unsupervised learning Supervised learning
Fig. 1