TKE 2014
Document Retrieval for Large Scale
Content Analysis using Contextualized
Dictionaries
Gregor Wiedemann
| [email protected]
Andreas Niekler
| [email protected]
NLP Group | Department of Computer Science
University of Leipzig
Augustusplatz 10
04109 Leipzig
Outline
1)
Motivation
2)
Dictionary creation with topic models
3)
Contextualizing dictionaries
4)
Retrieval with dictionaries
3
Motivation
content
analysis
social science
political science
business intelligence
media studies
...
●
How is European identity framed in
newspapers?
●
How (often) do policy makers refer to
concepts of social or distributive justice?
●
Is there a neoliberal economization of
political justifications in the public policy
debate?
How to find relevant documents
Information Retrieval:
―
obtaining documents relevant to an information need by querying a
collection
―
standard query: small key word set
Puzzling question:
―
How can analysts represent their (rather abstract) information need?
―small keyword set
―
Idea:
compilation of a reference collection
of paradigmatic documents
―paradigmatic document = document knowingly containing information /
5
Example use case
Political science study on
„neoliberalism“
―
Is there a neoliberal
economization of political
justifications in the public
policy debate?
Target collection
―
400,000 news paper
articles from DIE ZEIT
(1949-2011)
Reference collection
―
36 works of confessed
neoliberals (Mont Pelerin
Society)
―
e.g. Milton Friedman, F.A.
Hayek etc.
Approach
3 steps of retrieval for content analysis purpose
―
1) extraction of ranked dictionary from reference
collection
―
2) extraction of co-occurrence data from reference
collection
―
3) relevancy scoring of documents in target
7
dictionary creation
dictionary
―
automatically or manually compiled set of (several hundred)
keywords representing conceptual / domain knowledge
rank information unequal importance of terms
←
automatic dictionary creation:
―
term extraction task from reference collection
―
e.g. TF/IDF, LL-measure of frequencies between two
dictionary creation via topic models
Topic Models:
―
statistical models (e.g. LDA, Blei et. al 2003) to
extract latent semantic structure from collections
—
distribution over
K
topics in documents
—distribution over words in topics p(
w
|z
k
)
idea: term probability can be used to score
weight of dictionary terms
tw
n=
log
(
tf
(
w
n))
∑
k=1
K
9
example study
Term... Weigth einkomm [income] 0.353946 preis [price] 0.344253 gut [goods] 0.293837 polit [political] 0.289046 zeit [time] 0.27682 hoh [high] 0.240263 kost [cost] 0.231548 regeln [rules] 0.221033 mensch [human] 0.217523 offent [public] 0.215913 person [person] 0.212456 regier [government] 0.210173 wert [value] 0.208864 inflation [inflation] 0.201939 analys [analysis] 0.200244 bestimmt [certain] 0.199011 allgemein [common] 0.198007 ... ... Probability Top 10 Words
0.0961 mensch, freiheit, gesellschaft, gesetz, regeln, allgemein, grupp, ziel, bestimmt, regier
[human, freedom, society, law, rules, common, group, aim, certain, govern]
0.0851 einkomm, gut, zeit, haushalt, konsum, kost, straftat, preis, wert, gleichung
[income, goods, time, budget, consume, cost, offense, price, value, equation]
0.0638 steu, gut, offent, period, steuersatz, beschrank, staatlich, einnahm, steuerzahl, besteuer
[tax, goods, public, period, tax rate, constraint, state, revenue, tax payer, taxation]
0.0939 polit, analys, regeln, okonom, theori, modell, verhalt, ansatz, frag, polit
[political, analysis, rules, economic, theory, model, behaviour, approach, question, politics]
... ...
contextualizing dictionaries
need for more subtle meaning representation in content analysis IR
co-occurrence data captures meaning of terms (distributional
semantics hypothesis)
Term-Term-Matrix C:
―
computation of significant co-ooccurrences of dictionary terms
from
reference
collection
―
sentence window
―
dictionary of length N
→
―
dice measure (0;1) reflects syntagmatic relations
11
contextualizing dictionaries
filtering for reference corpus specific
co-occurrences
―
computation of significant co-ooccurrences of dictionary terms
from (large)
randomly composed
corpus (e.g. Leipzig Corpora
Collection)
―
→
term-term matrix D
C '
=
max
(
C
−
D ,
0
)
Example study
example term
C
C'
öffentlich
[public]
gut 0.207 privat 0.116 meinung 0.114 ausgabe 0.102 schule 0.063 gut 0.201 ausgabe 0.094 meinung 0.087 privat 0.072 theorie 0.055beitrag
[contribution]
leisten 0.185 wichtig 0.036 insbesondere 0.035 sozial 0.033 größen 0.032 leisten 0.066 insbesondere 0.032 größen 0.028 sozial 0.027 buch 0.027eltern
[parents]
kind 0.371 alter 0.084 schule 0.082 humankapital 0.073 altruismus 0.061 kind 0.266 alter 0.084 humankapital 0.073 schule 0.062 altruismus 0.06113
Retrieval with dictionaries
Vector Space Model
Retrieval with dictionaries
Applying length normalization
Applying contextual similarity
15
Retrieval with dictionaries
Applying length normalization
Applying contextual similarity
Example study
score length year title
347,22 685 1977 Pro und kontra Mehrwertsteuer [Pro's and con's of VAT] 321,81 662 1973 Oelkrise und Konjunktur [Oil crisis and economy] 290,48 705 1966 Energie muß billig sein [Energy has to be cheap] 289,34 687 1977 Die Steuern senken [Lower the taxes] 287,26 845 1964 Korrektur der Einkommensteuer [Correction of VAT] 281,07 687 1971 Die Bauern im Nacken [The farmers at the neck] 279,74 884 1965 Was ist uns die Mark wert? [What is the „Mark“ worth to us?] 272,75 682 1970 Steuern mit der Steuer [Governing with taxes]
264,82 719 1971 Ohne Abkühlung keine Stabilität [No stability without slowdown] 262,81 671 1973 Das sicherste Mittel [The most secure instrument] 261,33 707 1972 Entlastung – wovon? [Relief – of what?] 254,97 676 1979 Das Fernsehen und die Angst [Television and fear] 254,93 704 2011 Nicht ernst gemeint: die Quote [Quotas not meant serious] 251,53 457 1977 Eine Konfliktstrategie der Union [A conflict strategy of the EU]
17
Evaluation I
2 purposes for evaluation
―
1) determining optimal
α
―
2) assessing quality of
- score
contextvs. score
VSM- topic models vs. tf-idf for
dictionary creation
no gold standard data set for
this retrieval task
2 alternative appoaches
→
Approach 1: Generating
pseudorels by data fusion
(Nuray/Can 2006):
―
create set of „pseudorelevant“
documents from best ranked
documents of most distinctive
retrieval systems
―
consider tf-idf / topic model +
different
α
values as different
„systems“
―
evaluate „mean average
Evaluation I
4 most distinctive systems:
―
D
tf-idf
+ score
VSM[
α
=0]
―
D
tf-idf
+ score
context[tf(w,s)=0]
―
D
TopicModel
+ score
VSM[
α
=0]
―
D
TopicModel
+ score
context[tf(w,s)=0]
→
54 documents as
19
Evaluation II
Approach 2: Precision at k
―
evaluation of example retrieval by
domain experts
Conclusion
1.
improvement of retrieval results by
mixing of unigram and
co-occurrence information from
reference collections
2.
further improvement by dictionary
extraction with topic models
approach enables domain experts to query large
collections for texts representing rather abstract
domain knowledge
Literature
Alsumait, L., Barbara, D., Gentle, J., Domeniconi, C.: Topic signicance ranking of LDA generative
models. In: ECML/PKDD '09: Part I. pp. 67-82 (2009)
Biemann, C., Heyer, G., Quastho, U., Richter, M.: The leipzig corpora collection. Monolingual corpora
of standard size. In: Corpus Linguistic 2007 (2007)
Billhardt, H., Borrajo, D., Maojo, V.: Using term co-occurrence data for document indexing and
retrieval. In: In Proceedings of the 22nd IRSG. pp. 105-117 (2000)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3,
993-1022 (2003)
Bordag, S.: A comparison of co-occurrence and similarity measures as simulations of context. In:
Proceedings of the 9th CICLing. pp. 52-63 (2008)
Krippendor, K.: Content analysis: An introduction to its methodology. SAGE, 3 edn. (2013)
Nuray, R., Can, F.: Automatic ranking of information retrieval systems using data fusion. Information
Processing & Management 42(3), 595-614 (2006)
Peat, H., Willet, P.: The limitations of term co-occurrence data for query expansion in document retrieval