• No results found

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

N/A
N/A
Protected

Academic year: 2021

Share "A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Algorithm for Textual Data Visualization

Angela Blanco´

Universidad Pontificia de Salamanca [email protected]

Spain

Manuel Mart´ın-Merino

Universidad Pontificia de Salamanca [email protected]

Spain

(2)

Contents

1. Introduction

2. The Torgerson Multidimensional Scaling Algorithm

3. A Semi-supervised Multidimensional Scaling Algorithm 4. Experimental results

5. Conclusions and future research trends

(3)

Introduction (I)

The Torgerson MDS algorithm is a popular visualization tech- nique that helps to discover the underlying structure of high di- mensional data.

An interesting application is the visualization of the semantic relations among terms or documents in textual databases.

However, the Torgerson MDS algorithm proposed in the litera- ture suffers from a low discriminant power due to:

• The unsupervised nature.

• The ‘curse of dimensionality’.

(4)

Introduction (II)

Several search engines provide a categorization for a subset of documents.

Problem overview

t1, t2, .... tn

Space of terms (Rd) C1

C2

C3 C4 Ck

Relation between terms and documents

(Rn )

categorized Terms are usually not

t1,..., tn Semantic classes

Space of documents f

Torgerson MDS map

Goal: To generate a visual representation of term relationships

taking advantage of the document class labels.

(5)

Introduction (III)

Our approach:

Define a semi-supervised similarity between terms that consi- ders the document class labels.

• It should reflect whether two terms are related to the same semantic topics.

• It should reflect the semantic proximities between terms.

Incorporate the semi-supervised similarity into the Torgerson

MDS algorithm. This will preserve the nice properties of the op-

timization problem.

(6)

Torgerson MDS Algorithm (I)

The Torgerson MDS algorithm looks for an object configuration in a low dimensional space such that the interpattern distances are approximately preserved.

Properties for text mining problems:

It is based on an efficient linear algebraic operation (SVD).

The optimization problem does not have local minima.

For certain similarities it is equivalent to LSI.

(7)

Torgerson MDS Algorithm (II)

Drawbacks:

Low discriminant power: Due to the unsupervised nature, diffe- rent topics in the textual collection overlap significantly in the word map.

It is affected by the ‘curse of dimensionality’.

(8)

Semi-supervised MDS algorithm (I)

Goal: To improve the discriminant power of Torgerson MDS algorithm that works in the space of terms considering a classification in the space of documents.

The association between the terms (t

i

) and the document class labels (C

k

) is evaluated by the Mutual Information I

0

(t

i

; C

k

).

A supervised measure is defined that becomes large for terms that are correlated with the same categories:

s

1

(t

i

, t

j

) =

P

k

I

0

(t

i

; C

k

)I

0

(t

j

; C

k

) pP

k

(I

0

(t

i

; C

k

))

2

pP

k

(I

0

(t

j

; C

k

))

2

. (1)

(9)

Semi-supervised MDS Algorithm (II)

The supervised measure will reflect just the semantic catego- ries of the textual collection but not the term relationships which is interesting for visualization purposes.

Therefore, a semi-supervised similarity should be defined that reflect both, the semantic categories and the term relationships inside each class.

s(t

i

, t

j

) = λs

sup

(t

i

, t

j

) + (1 − λ)s

unsup

(t

i

, t

j

) . (2)

λ controls if the word map reflects better the semantic catego-

ries (λ large) or the semantic relations among terms (λ small).

(10)

Properties Semi-supervised Similarity

cos(x,y)

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0500000100000015000002000000

Fig. 1: Cosine similarity histogram.

s(x,y)

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0e+002e+054e+056e+058e+05

Fig. 2: Semi-supervised similarity histogram.

The histogram is smoother. It is more robust to the ’curse of dimensionality’.

Word maps will reflect better the term relationships.

(11)

Working with partially labeled documents

When only a small fraction of documents are labeled we proceeds as follows:

Documents are categorized in a semi-supervised way using Transductive SVM.

The Semi-supervised measures can now be computed in the usual way.

The Torgerson MDS algorithm is applied to obtain a word map.

(12)

Experimental results (I)

The semi-supervised algorithm has been applied to the visuali- zation of the semantic relations among terms.

Evaluation of the visualization algorithms:

• The mapping algorithm is applied to generate the word map.

• A clustering algorithm is run in the map grouping the terms into 7 groups.

• Finally, the partition induced by the map is compared with the

classes induced by the thesaurus.

(13)

Experimental results (II)

The agreement between the partition induced by the mapping algorithm and the thesaurus has been evaluated through seve- ral objective functions:

• F measure (F).

• Entropy measure (E): Small values suggest little overlapping among different topics in the word map.

• Mutual Information (I): Informs particularly about the position

of the more specific terms in the word map.

(14)

Experimental results (III)

F E I

Torgerson MDS 0.46 0.55 0.17

Least square MDS 0.53 0.52 0.16 Torgerson MDS (Average) 0.69 0.43 0.27 Torgerson MDS (Maximum) 0.77 0.36 0.31 Least square MDS (Average) 0.70 0.42 0.27 Least square MDS (Maximum) 0.76 0.38 0.31

The primary conclusions are the following:

The semi-supervised techniques reduce significantly the over- lapping among the different topics in the word map.

The widely used F measure is significantly improved.

The maximum semi-supervised measure increases particularly

the discriminant power of the word maps.

(15)

Experimental results (IV)

0.0 0.1 0.2 0.3 0.4

−0.4−0.3−0.2−0.10.00.1

x

y

POTENTIAL

DEFECTS

TRANSIENT

PHASE

TECHNOLOGYDOPEDCABLEAMPLIFIERLASER FREQUENCY

INTEGRATION THYRISTORS

SUBSTRATE OPERATIONAL SILICON DIODES

ELECTRICAL SEMICONDUCTORDEVICES DIFFUSION

THERMALLOAD POWER

SPEED BANDWIDTH

OPTICAL FIBER

TRANSMISSION LIGHT WAVELENGTHPOLARIZATIONCIRCUITVOLTAGE

LINES

KOHONENORGANIZING MAPPING

LIKELIHOOD QUANTIZATION

PROTOTYPENEURONS

SOMSELF VISUALIZATIONMAPSNONLINEAR DIMENSIONALITYPCAREDUCTIONPRINCIPAL

MULTIDIMENSIONAL DISCRIMINANT

PROJECTION

WAVELET EXTRACTION PROBABILITY

VISUAL

PRIOR BAYESIAN STATISTICAL NORMAL LEARNING

MACHINE GAUSSIANPATTERN

FUZZY PERCEPTRONRULE OPTIMIZATION

NEURAL ESTIMATION UNSUPERVISED

CLUSTER

Semiconductor devices and optical cables

Unsupervised learning Supervised learning

Fig. 1

:

Word map generated by the semi-supervised MDS algorithm.

(16)

Conclusions and future research trends

We have proposed a semi-supervised version of the Torgerson MDS algorithm.

The new algorithm has been applied to the analysis of the se- mantic relations among terms in textual databases.

The experimental results suggest that the proposed algorithm improves significantly the discriminant power of mapping tech- niques that rely solely on unsupervised measures.

Future research will focus on the development of new semi-

supervised dimension reduction techniques.

References

Related documents

In this paper, we present the Windows Monitoring Kernel (WMK), a custom-built version of the latest Windows 2003 Server operating system that includes a fine- grained

The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or

Most of the guidelines indicate explicitly the use of the peak shear strength parameters in- stead of the residual shear strength parameters in the design of geosynthetic

To protect himself against the …rm’s future rollover risk caused by other creditors, each maturing creditor will choose to roll over his debt if and only if the current

a sensor rich environment, be able to predict occupancy behavior related to thermal

For these individuals, the EITC produces a negative work effect because the level of benefits is based on household income.. So if a spouse decides to start working, the family

We used longitudinal data from the Massachusetts Male Aging Study, a large population-based random-sample cohort of men aged 40 –70 yr at baseline, to establish normative age trends

Although more ac- cessible for image or voice classification (e.g., using online images or audio) (Krause et al., 2016), very few such datasets exist for en- vironmental sound,