• No results found

Document Retrieval for Large Scale Content Analysis using Contextualized Dictionaries

N/A
N/A
Protected

Academic year: 2021

Share "Document Retrieval for Large Scale Content Analysis using Contextualized Dictionaries"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

TKE 2014

Document Retrieval for Large Scale

Content Analysis using Contextualized

Dictionaries

Gregor Wiedemann

| [email protected]

Andreas Niekler

| [email protected]

NLP Group | Department of Computer Science

University of Leipzig

Augustusplatz 10

04109 Leipzig

(2)

Outline

1)

Motivation

2)

Dictionary creation with topic models

3)

Contextualizing dictionaries

4)

Retrieval with dictionaries

(3)

3

Motivation

content

analysis

social science

political science

business intelligence

media studies

...

How is European identity framed in

newspapers?

How (often) do policy makers refer to

concepts of social or distributive justice?

Is there a neoliberal economization of

political justifications in the public policy

debate?

(4)

How to find relevant documents

Information Retrieval:

obtaining documents relevant to an information need by querying a

collection

standard query: small key word set

Puzzling question:

How can analysts represent their (rather abstract) information need?

small keyword set

Idea:

compilation of a reference collection

of paradigmatic documents

paradigmatic document = document knowingly containing information /

(5)

5

Example use case

Political science study on

„neoliberalism“

Is there a neoliberal

economization of political

justifications in the public

policy debate?

Target collection

400,000 news paper

articles from DIE ZEIT

(1949-2011)

Reference collection

36 works of confessed

neoliberals (Mont Pelerin

Society)

e.g. Milton Friedman, F.A.

Hayek etc.

(6)

Approach

3 steps of retrieval for content analysis purpose

1) extraction of ranked dictionary from reference

collection

2) extraction of co-occurrence data from reference

collection

3) relevancy scoring of documents in target

(7)

7

dictionary creation

dictionary

automatically or manually compiled set of (several hundred)

keywords representing conceptual / domain knowledge

rank information unequal importance of terms

automatic dictionary creation:

term extraction task from reference collection

e.g. TF/IDF, LL-measure of frequencies between two

(8)

dictionary creation via topic models

Topic Models:

statistical models (e.g. LDA, Blei et. al 2003) to

extract latent semantic structure from collections

distribution over

K

topics in documents

distribution over words in topics p(

w

|z

k

)

idea: term probability can be used to score

weight of dictionary terms

tw

n

=

log

(

tf

(

w

n

))

k=1

K

(9)

9

example study

Term... Weigth einkomm [income] 0.353946 preis [price] 0.344253 gut [goods] 0.293837 polit [political] 0.289046 zeit [time] 0.27682 hoh [high] 0.240263 kost [cost] 0.231548 regeln [rules] 0.221033 mensch [human] 0.217523 offent [public] 0.215913 person [person] 0.212456 regier [government] 0.210173 wert [value] 0.208864 inflation [inflation] 0.201939 analys [analysis] 0.200244 bestimmt [certain] 0.199011 allgemein [common] 0.198007 ... ... Proba

bility Top 10 Words

0.0961 mensch, freiheit, gesellschaft, gesetz, regeln, allgemein, grupp, ziel, bestimmt, regier

[human, freedom, society, law, rules, common, group, aim, certain, govern]

0.0851 einkomm, gut, zeit, haushalt, konsum, kost, straftat, preis, wert, gleichung

[income, goods, time, budget, consume, cost, offense, price, value, equation]

0.0638 steu, gut, offent, period, steuersatz, beschrank, staatlich, einnahm, steuerzahl, besteuer

[tax, goods, public, period, tax rate, constraint, state, revenue, tax payer, taxation]

0.0939 polit, analys, regeln, okonom, theori, modell, verhalt, ansatz, frag, polit

[political, analysis, rules, economic, theory, model, behaviour, approach, question, politics]

... ...

(10)

contextualizing dictionaries

need for more subtle meaning representation in content analysis IR

co-occurrence data captures meaning of terms (distributional

semantics hypothesis)

Term-Term-Matrix C:

computation of significant co-ooccurrences of dictionary terms

from

reference

collection

sentence window

dictionary of length N

dice measure (0;1) reflects syntagmatic relations

(11)

11

contextualizing dictionaries

filtering for reference corpus specific

co-occurrences

computation of significant co-ooccurrences of dictionary terms

from (large)

randomly composed

corpus (e.g. Leipzig Corpora

Collection)

term-term matrix D

C '

=

max

(

C

D ,

0

)

(12)

Example study

example term

C

C'

öffentlich

[public]

gut 0.207 privat 0.116 meinung 0.114 ausgabe 0.102 schule 0.063 gut 0.201 ausgabe 0.094 meinung 0.087 privat 0.072 theorie 0.055

beitrag

[contribution]

leisten 0.185 wichtig 0.036 insbesondere 0.035 sozial 0.033 größen 0.032 leisten 0.066 insbesondere 0.032 größen 0.028 sozial 0.027 buch 0.027

eltern

[parents]

kind 0.371 alter 0.084 schule 0.082 humankapital 0.073 altruismus 0.061 kind 0.266 alter 0.084 humankapital 0.073 schule 0.062 altruismus 0.061

(13)

13

Retrieval with dictionaries

Vector Space Model

(14)

Retrieval with dictionaries

Applying length normalization

Applying contextual similarity

(15)

15

Retrieval with dictionaries

Applying length normalization

Applying contextual similarity

(16)

Example study

score length year title

347,22 685 1977 Pro und kontra Mehrwertsteuer [Pro's and con's of VAT] 321,81 662 1973 Oelkrise und Konjunktur [Oil crisis and economy] 290,48 705 1966 Energie muß billig sein [Energy has to be cheap] 289,34 687 1977 Die Steuern senken [Lower the taxes] 287,26 845 1964 Korrektur der Einkommensteuer [Correction of VAT] 281,07 687 1971 Die Bauern im Nacken [The farmers at the neck] 279,74 884 1965 Was ist uns die Mark wert? [What is the „Mark“ worth to us?] 272,75 682 1970 Steuern mit der Steuer [Governing with taxes]

264,82 719 1971 Ohne Abkühlung keine Stabilität [No stability without slowdown] 262,81 671 1973 Das sicherste Mittel [The most secure instrument] 261,33 707 1972 Entlastung – wovon? [Relief – of what?] 254,97 676 1979 Das Fernsehen und die Angst [Television and fear] 254,93 704 2011 Nicht ernst gemeint: die Quote [Quotas not meant serious] 251,53 457 1977 Eine Konfliktstrategie der Union [A conflict strategy of the EU]

(17)

17

Evaluation I

2 purposes for evaluation

1) determining optimal

α

2) assessing quality of

- score

context

vs. score

VSM

- topic models vs. tf-idf for

dictionary creation

no gold standard data set for

this retrieval task

2 alternative appoaches

Approach 1: Generating

pseudorels by data fusion

(Nuray/Can 2006):

create set of „pseudorelevant“

documents from best ranked

documents of most distinctive

retrieval systems

consider tf-idf / topic model +

different

α

values as different

„systems“

evaluate „mean average

(18)

Evaluation I

4 most distinctive systems:

D

tf-idf

+ score

VSM

[

α

=0]

D

tf-idf

+ score

context

[tf(w,s)=0]

D

TopicModel

+ score

VSM

[

α

=0]

D

TopicModel

+ score

context

[tf(w,s)=0]

54 documents as

(19)

19

Evaluation II

Approach 2: Precision at k

evaluation of example retrieval by

domain experts

Conclusion

1.

improvement of retrieval results by

mixing of unigram and

co-occurrence information from

reference collections

2.

further improvement by dictionary

extraction with topic models

approach enables domain experts to query large

collections for texts representing rather abstract

domain knowledge

(20)

Literature

 Alsumait, L., Barbara, D., Gentle, J., Domeniconi, C.: Topic signicance ranking of LDA generative

models. In: ECML/PKDD '09: Part I. pp. 67-82 (2009)

 Biemann, C., Heyer, G., Quastho, U., Richter, M.: The leipzig corpora collection. Monolingual corpora

of standard size. In: Corpus Linguistic 2007 (2007)

 Billhardt, H., Borrajo, D., Maojo, V.: Using term co-occurrence data for document indexing and

retrieval. In: In Proceedings of the 22nd IRSG. pp. 105-117 (2000)

 Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3,

993-1022 (2003)

 Bordag, S.: A comparison of co-occurrence and similarity measures as simulations of context. In:

Proceedings of the 9th CICLing. pp. 52-63 (2008)

 Krippendor, K.: Content analysis: An introduction to its methodology. SAGE, 3 edn. (2013)

 Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Information

Processing & Management 42(3), 595-614 (2006)

 Peat, H., Willet, P.: The limitations of term co-occurrence data for query expansion in document retrieval

References

Related documents

In this retrospective analysis using a large cohort of allo- HSCT recipients, our data indicate that post-transplantation CMV reactivation is associated with a reduced risk of dis-

Objectives: This study aimed to compare oral squamous cell carcinoma lesions to surgical margins and the mucosa of healthy volunteers by fluorescence spectroscopy.. Materials

Individual differences in physiological and behavioural responses to stressors are increasingly recognised as adaptive variation and thus, raw material for evolution and

48 000 CNAM-TS patients on a yearly basis 69 000 patients hospitalised on a yearly basis Tuppin et al.. Applicability of the results of RCTs in a

His major research interest is on regional monetary integration and cooperation and his recent publication includes Regional Integration: Europe and Asia Compared (Ashgate

From a fresh look at research fundamentals - the building blocks of effective innovation – to how to challenge assumptions and construct a brand vision, Jeneanne and Katie

(See T ables 2 – 4 for de fi nitions of abnormal returns, absolute abnormal returns, abnormal trading volume, and forecast revisions.) V ariable de fi nitions for explanatory

Please note that the software synchronization must be disabled (see Vector Hardware Config | General information | Settings | Software time synchronization) if the hardware