Constructing a Class-Based Lexical Dictionary using Interactive Topic Models

(1)

Constructing a Class-Based Lexical Dictionary

using Interactive Topic Models

Kugatsu Sadamitsu, Kuniko Saito, Kenji Imamura and Yoshihiro Matsuo

NTT Cyber Space Laboratories, NTT Corporation 1-1 Hikarinooka, Yokosuka-shi, Kanagawa, 239-0847, Japan

{sadamitsu.kugatsu, saito.kuniko, imamura.kenji, matsuo.yoshihiro}@lab.ntt.co.jp

Abstract

This paper proposes a new method of constructing arbitrary class-based related word dictionaries on interactive topic models; we assume that each class is described by a topic. We propose a new semi-supervised method that uses the simplest topic model yielded by the standard EM algorithm; model calculation is very rapid. Furthermore our approach allows a dictionary to be modified interactively and the final dictionary has a hierarchical structure.

This paper makes three contributions. First, it proposes a word-based semi-supervised topic model. Second, we apply the semi-supervised topic model to interactive learning; this approach is called the Interactive Topic Model. Third, we propose a score function; it extracts the related words that occupy the middle layer of the hierarchical structure. Experiments show that our method can appropriately retrieve the words belonging to an arbitrary class.

Keywords:Interactive Topic Models, Interactive Unigram Mixtures, Lexical dictionary

1. Introduction

Many NLP applications, such as recommendation engines, demand a class-based related word dictionary. The related words are broadly similar word set about one class, e.g. for the class ”Car”, the related word set includes{Toyota, engine, hybrid}et cetera. However, class definitions are of-ten changed in response to application or user needs. One application assigns wordSUV and wordmotorcycleto the same class (related), while another differentiates them (not related). Although the hierarchical or graph based struc-ture (e.g. WordNet) or multi-labeling (e.g. Wikipedia) can provide a rough solution, there is no universal definition of hierarchical structure or multi-labeling that can satisfy ev-ery application.

This paper describes a way of constructing arbitrary class-based word dictionaries on semi-supervised topic models trained against non-annotated text corpus; we assume that each class is described by a topic. Because unsupervised topic models lack control by human intent, we adapt semi-supervised learning using a few semi-supervised words and in-teractive learning.

This paper makes three contributions. First, it proposes the word-based semi-supervised topic model. In a previ-ous study about semi-supervised methods (Nigam et al., 2000), document labels are given as supervised data. Be-cause our purpose is constructing a dictionary, it is more naturally that we use words as clues for classes. Our semi-supervised method uses the simplest topic model yielded by the standard EM algorithm, so model calculation is very rapid.

Second, we apply the semi-supervised topic model to in-teractive learning. After learning topic models that include the defined class “Finance”, the user might think that “ Fi-nance” topic should be split into “Bank” and “Insurance”. Our interactive methods allow such user’s interaction. Fur-thermore, our models not only modify the output of topic

Figure 1: The abstract illustration of hierarchical topic structure and correspondence sections. The words in col-ored boxes are supervised words and the words in white boxes are extracted words by our method.

models, but also construct a hierarchical structure.

Third, we propose a score function; it extracts the related words that occupy the bottom or middle layer of the hier-archical structure. Experiments show that our method can appropriately retrieve the words belonging to an arbitrary class. Figure 1 illustrates the abstraction of our topic mod-els and the relevant sections.

2. Construct arbitrary class-based word

dictionaries using Interactive Topic

Models

(2)

(3)

(4)

Table 1: Results: precision of “Food” class.

class Feedstuff Sugar Milling Food oil Alcohol Bread/Snack Ham Flavoring Dairy Others

#seed 4 7 6 3 4 17 8 17 4 30

method L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2

@10 1.0 1.0 0.5 0.7 0.9 0.8 0.3 0.2 1.0 1.0 1.0 1.0 0.4 0.9 0.3 0.8 0.2 0.9 0.3 1.0

@50 0.9 1.0 - 0.64 0.88 0.88 - - 0.9 1.0 1.0 0.9 - 0.68 - 0.86 - 0.9 - 1.0

@100 0.7 0.95 - 0.52 0.54 0.54 - - 0.7 0.95 0.75 0.9 - - - 0.73 - 0.5 - 0.85

Table 2: Results: precision of “Finance” class.

class City Bank Local Bank Trust Bank Brokerage Insurance Lease Leased immovables Sold immovables

#seed 4 95 3 16 3 29 13 24

method L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 L1 L2

@10 0.4 1.0 0.5 1.0 0.6 1.0 0.8 0.9 0.9 0.7 0.0 0.0 1.0 0.3 1.0 0.4

@50 - 1.0 - 0.7 0.72 0.8 0.56 0.88 0.98 0.84 - - 0.4 - 0.8

-@100 - 0.95 - 0.75 0.61 0.55 0.58 0.69 0.79 0.62 - - - - 0.6

-Figure 2: The experimental settings of semi-supervised Un-igram Mixtures with a single layer structure (upper side) and Interactive Unigram Mixtures with a hierarchical struc-ture (lower side).

function, at first, we define the parent variance and child variance as

σp(v) = (p(v|z)−p(v|zp))2 (13) σc(v) =

∑

zc

(p(v|zc)−p(v|z))2, (14)

where z is the focusing middle layer topic and zp is the

parent topic of topiczandzc is the child topic ofz. The

parent variance represents the characteristics of differencez from the parent topiczpwhile the child variance represents

the common characteristics of topic zand its child topics zc.

Finally, we use the score function,

scorem(v, z) = σp σc

, (15)

for ranking and extracting related words for middle layer topics. This function give high score to the word whose pa-rameter is far from parent topic papa-rameter (large variance) and which appear equally in the child topics (small vari-ance).

3. Experiments and Results

In this section, we examine the dictionaries constructed by the proposed topic model with pre-defined structure. For

Table 3: Results: precision of middle layer classes.

class Food Finance

#seed 100 187

method BL Var. BL Var.

@10 0.0 0.7 0.0 0.5

@50 - 0.74 - 0.4

@100 - 0.62 - 0.0

confirming the success of the proposed interactive update process, we use supervised data whose hierarchical struc-ture consists of 1st and 2nd layer classes. We compare SSUM with a single layer structure (L1) and IUM with a hi-erarchical layer structure (L2). This experimental settings are listed in Figure 2.

The supervised resource is our in-house thesaurus about Japanese companies; it has a hierarchical style with 1st and 2nd layer classes. The 1st layer of the thesaurus has 22 classes and the 2nd layer has 135 classes extended from 17 classes in the 1st layer. A few supervised words are given for all topics belonging to 2nd layer topics. The numbers of supervised words are partially listed in the 2nd row of Table 1 and Table 2 (#seed) as examples of 2nd layer classes in-cluded in 1st layer classes; “Food” and “Finance”. The su-pervised words in the first layer are duplicated and merged from second layer topics. For SSUM with a single layer, the 1st layer from the thesaurus is ignored for the IUM situa-tion and only 2nd layer topics are modeled using 2nd layer supervised words. In the case of using IUM, the 1st layer is modeled with SSUM and the 2nd layer is modeled with IUM. This process is reasonable as the interactive update of the 2nd layer given that the 1st layer topics have exces-sively coarse grain and the intent to separate the 2nd layer classes.

(5)

(6)

InProceedings of the COLING-ACL Conference, pages 409–413.

Thomas Hofmann. 1999. Probabilistic latent semantic in-dexing. InProceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57.

Yuening Hu and Jordan Boyd-graber. 2011. Interactive Topic Modeling. InProceedings of the 49th ACL-HLT, pages 248–257.

Satoru Ikehara, Masahiro Miyazaki, Satoshi Shirai, Akio Yokoo, Hiromi Nakaiwa, Kentaro Ogura, Yoshifumi Ooyama, and Yoshihiko Hayashi, editors. 1997. Ni-hongo Goi-Taikei - a Japanese Lexicon (in Japanese). Iwanami Shoten.

Masaaki Nagata, Yumi Shibaki, and Kazuhide Yamamoto. 2010. Using goi-taikei as an upper ontology to build a large-scale japanese ontology from wikipedia. In Pro-ceedings of the 6th Workshop on Ontologies and Lexical Resources, COLING, pages 11–18.

Kamal Nigam, Andrew K Mccallum, Sebastian Thrun, and Tom Mitchell. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learn-ing, 39(2):103–134.

Simone Paolo Ponzetto and Michael Strube. 2007. Deriv-ing a large scale taxonomy from Wikipedia. In Proceed-ings of the 22nd Conference on Advancement of Artificial intelligence (AAAI), pages 1440–1445.

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. YAGO: A core of semantic knowledge unifying WordNet and Wikipedia. InProceedings of the 16th International World Wide Web (WWW) Conference, pages 697–706.