Document Retrieval for Large Scale Content Analysis using Contextualized Dictionaries

(1)

TKE 2014

Document Retrieval for Large Scale

Content Analysis using Contextualized

Dictionaries

Gregor Wiedemann

Andreas Niekler

NLP Group | Department of Computer Science

University of Leipzig

Augustusplatz 10

04109 Leipzig

(2)

Outline

1)

Motivation

2)

Dictionary creation with topic models

3)

Contextualizing dictionaries

4)

Retrieval with dictionaries

(3)

3

Motivation

content

analysis

social science

political science

business intelligence

media studies

...

●

How is European identity framed in

newspapers?

●

How (often) do policy makers refer to

concepts of social or distributive justice?

●

Is there a neoliberal economization of

political justifications in the public policy

debate?

(4)

How to find relevant documents



Information Retrieval:

―

obtaining documents relevant to an information need by querying a

collection

―

standard query: small key word set



Puzzling question:

―

How can analysts represent their (rather abstract) information need?

―

small keyword set

―

Idea:

compilation of a reference collection

of paradigmatic documents

―

paradigmatic document = document knowingly containing information /

(5)

5

Example use case



Political science study on

„neoliberalism“

―

Is there a neoliberal

economization of political

justifications in the public

policy debate?



Target collection

―

400,000 news paper

articles from DIE ZEIT

(1949-2011)



Reference collection

―

36 works of confessed

neoliberals (Mont Pelerin

Society)

―

e.g. Milton Friedman, F.A.

Hayek etc.

(6)

Approach



3 steps of retrieval for content analysis purpose

―

1) extraction of ranked dictionary from reference

collection

―

2) extraction of co-occurrence data from reference

collection

―

3) relevancy scoring of documents in target

(7)

7

dictionary creation



dictionary

―

automatically or manually compiled set of (several hundred)

keywords representing conceptual / domain knowledge



rank information unequal importance of terms

←



automatic dictionary creation:

―

term extraction task from reference collection

―

e.g. TF/IDF, LL-measure of frequencies between two

(8)

dictionary creation via topic models



Topic Models:

―

statistical models (e.g. LDA, Blei et. al 2003) to

extract latent semantic structure from collections

—

distribution over

K

topics in documents

—

distribution over words in topics p(

w

|z

k

)



idea: term probability can be used to score

weight of dictionary terms

tw

_n

=

log

(

tf

(

w

_n

))

∑

k=1

K

(9)

9

example study

Term... Weigth einkomm [income] 0.353946 preis [price] 0.344253 gut [goods] 0.293837 polit [political] 0.289046 zeit [time] 0.27682 hoh [high] 0.240263 kost [cost] 0.231548 regeln [rules] 0.221033 mensch [human] 0.217523 offent [public] 0.215913 person [person] 0.212456 regier [government] 0.210173 wert [value] 0.208864 inflation [inflation] 0.201939 analys [analysis] 0.200244 bestimmt [certain] 0.199011 allgemein [common] 0.198007 ... ... Proba

bility Top 10 Words

0.0961 mensch, freiheit, gesellschaft, gesetz, regeln, allgemein, grupp, ziel, bestimmt, regier

[human, freedom, society, law, rules, common, group, aim, certain, govern]

0.0851 einkomm, gut, zeit, haushalt, konsum, kost, straftat, preis, wert, gleichung

[income, goods, time, budget, consume, cost, offense, price, value, equation]

0.0638 steu, gut, offent, period, steuersatz, beschrank, staatlich, einnahm, steuerzahl, besteuer

[tax, goods, public, period, tax rate, constraint, state, revenue, tax payer, taxation]

0.0939 polit, analys, regeln, okonom, theori, modell, verhalt, ansatz, frag, polit

[political, analysis, rules, economic, theory, model, behaviour, approach, question, politics]

... ...

(10)

contextualizing dictionaries



need for more subtle meaning representation in content analysis IR



co-occurrence data captures meaning of terms (distributional

semantics hypothesis)



Term-Term-Matrix C:

―

computation of significant co-ooccurrences of dictionary terms

from

reference

collection

―

sentence window

―

dictionary of length N

→

―

dice measure (0;1) reflects syntagmatic relations

(11)

11

contextualizing dictionaries



filtering for reference corpus specific

co-occurrences

―

computation of significant co-ooccurrences of dictionary terms

from (large)

randomly composed

corpus (e.g. Leipzig Corpora

Collection)

―

→

term-term matrix D

C '

=

max

(

C

−

D ,

0 )

(12)

Example study

example term

C

C'

öffentlich

[public]

gut 0.207 privat 0.116 meinung 0.114 ausgabe 0.102 schule 0.063 gut 0.201 ausgabe 0.094 meinung 0.087 privat 0.072 theorie 0.055

beitrag

[contribution]

leisten 0.185 wichtig 0.036 insbesondere 0.035 sozial 0.033 größen 0.032 leisten 0.066 insbesondere 0.032 größen 0.028 sozial 0.027 buch 0.027

eltern

[parents]

kind 0.371 alter 0.084 schule 0.082 humankapital 0.073 altruismus 0.061 kind 0.266 alter 0.084 humankapital 0.073 schule 0.062 altruismus 0.061

(13)

13

Retrieval with dictionaries



Vector Space Model

(14)

Retrieval with dictionaries



Applying length normalization



Applying contextual similarity

(15)

15

Retrieval with dictionaries



Applying length normalization



Applying contextual similarity

(16)

Example study

score length year title

347,22 685 1977 Pro und kontra Mehrwertsteuer [Pro's and con's of VAT] 321,81 662 1973 Oelkrise und Konjunktur [Oil crisis and economy] 290,48 705 1966 Energie muß billig sein [Energy has to be cheap] 289,34 687 1977 Die Steuern senken [Lower the taxes] 287,26 845 1964 Korrektur der Einkommensteuer [Correction of VAT] 281,07 687 1971 Die Bauern im Nacken [The farmers at the neck] 279,74 884 1965 Was ist uns die Mark wert? [What is the „Mark“ worth to us?] 272,75 682 1970 Steuern mit der Steuer [Governing with taxes]

264,82 719 1971 Ohne Abkühlung keine Stabilität [No stability without slowdown] 262,81 671 1973 Das sicherste Mittel [The most secure instrument] 261,33 707 1972 Entlastung – wovon? [Relief – of what?] 254,97 676 1979 Das Fernsehen und die Angst [Television and fear] 254,93 704 2011 Nicht ernst gemeint: die Quote [Quotas not meant serious] 251,53 457 1977 Eine Konfliktstrategie der Union [A conflict strategy of the EU]

(17)

17

Evaluation I



2 purposes for evaluation

―

1) determining optimal

α

―

2) assessing quality of

- score

_context

vs. score

_VSM

- topic models vs. tf-idf for

dictionary creation



no gold standard data set for

this retrieval task

2 alternative appoaches

→



Approach 1: Generating

pseudorels by data fusion

(Nuray/Can 2006):

―

create set of „pseudorelevant“

documents from best ranked

documents of most distinctive

retrieval systems

―

consider tf-idf / topic model +

different

α

values as different

„systems“

―

evaluate „mean average

(18)

Evaluation I



4 most distinctive systems:

―

D

tf-idf

+ score

VSM

[

α

=0]

―

D

tf-idf

+ score

context

[tf(w,s)=0]

―

D

TopicModel

+ score

VSM

[

α

=0]

―

D

TopicModel

+ score

context

[tf(w,s)=0]



→

54 documents as

(19)

19

Evaluation II



Approach 2: Precision at k

―

evaluation of example retrieval by

domain experts



Conclusion

1. improvement of retrieval results by

mixing of unigram and

co-occurrence information from

reference collections

2. further improvement by dictionary

extraction with topic models

approach enables domain experts to query large

collections for texts representing rather abstract

domain knowledge

(20)

Literature

 Alsumait, L., Barbara, D., Gentle, J., Domeniconi, C.: Topic signicance ranking of LDA generative

models. In: ECML/PKDD '09: Part I. pp. 67-82 (2009)

 Biemann, C., Heyer, G., Quastho, U., Richter, M.: The leipzig corpora collection. Monolingual corpora

of standard size. In: Corpus Linguistic 2007 (2007)

 Billhardt, H., Borrajo, D., Maojo, V.: Using term co-occurrence data for document indexing and

retrieval. In: In Proceedings of the 22nd IRSG. pp. 105-117 (2000)

 Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3,

993-1022 (2003)

 Bordag, S.: A comparison of co-occurrence and similarity measures as simulations of context. In:

Proceedings of the 9th CICLing. pp. 52-63 (2008)

 Krippendor, K.: Content analysis: An introduction to its methodology. SAGE, 3 edn. (2013)

 Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Information

Processing & Management 42(3), 595-614 (2006)

 Peat, H., Willet, P.: The limitations of term co-occurrence data for query expansion in document retrieval

Document Retrieval for Large Scale Content Analysis using Contextualized Dictionaries

TKE 2014