Integrating Semantic Knowledge to Tackle Zero shot Text Classification

(1)

1031

Integrating Semantic Knowledge to Tackle Zero-shot Text Classification

Jingqing Zhang∗ Data Science Institute Imperial College London

London, UK

Piyawat Lertvittayakumjorn∗ Department of Computing

Imperial College London London, UK

{jingqing.zhang15,pl1515,y.guo}@imperial.ac.uk

Yike Guo Data Science Institute Imperial College London

London, UK

Abstract

Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classifica-tion. Recognising text documents of classes that have never been seen in the learning stage, so-calledzero-shot text classification, is there-fore difficult and only limited previous works tackled this problem. In this paper, we pro-pose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descrip-tions, class hierarchy, and a general knowl-edge graph) are incorporated into the pro-posed framework to deal with instances of un-seen classes effectively. Experimental results show that each and the combination of the two phases achieve the best overall accuracy com-pared with baselines and recent approaches in classifying real-world texts under the zero-shot scenario.

1 Introduction

As one of the most fundamental problems in ma-chine learning, automatic classification has been widely studied in several domains. However, many approaches, proven to be effective in tradi-tional classification tasks, cannot catch up with a dynamic and open environment where new classes can emerge after the learning stage (Romera-Paredes and Torr,2015). For example, the number of topics on social media is growing rapidly, and the classification models are required to recognise the text of the new topics using only general in-formation (e.g., descriptions of the topics) since labelled training instances are unfeasible to ob-tain for each new topic (Lee et al., 2011). This scenario holds in many real-world domains such

∗

Piyawat Lertvittayakumjorn and Jingqing Zhang con-tributed equally to this project.

as object recognition and medical diagnosis (Xian et al.,2017;World Health Organization,1996).

Zero-shot learning (ZSL) for text classification aims to classify documents of classes which are absent from the learning stage. Although it is challenging for a machine to achieve, humans are able to learn new concepts by transferring knowl-edge from known to unknown domains based on high-level descriptions and semantic representa-tions (Thrun and Pratt,1998). Therefore, without labelled data of unseen classes, a zero-shot learn-ing framework is expected to exploit supportive semantic knowledge (e.g., class descriptions, rela-tions among classes, and external domain knowl-edge) to generally infer the features of unseen classes using patterns learned from seen classes.

(2)

A collection of classifiers

Phase 1: Coarse-grained Classification

Data augmentation

A traditional classifier

Phase 2: Fine-grained Classification

A zero-shot classifier Feature

augmentation 𝑦"_#∈ 𝒞_𝒮

𝑦"_#∉ 𝒞_𝒮 if

𝑥_# 𝑦"_#

Refinement

𝑣_*#

𝑣*#; 𝑣,; 𝑣*,,#

A classifier for𝑐/∈ 𝒞𝒮

A classifier for𝑐|𝒞𝒮|∈ 𝒞𝒮

[image:2.595.75.529.60.212.2]

𝑣_*#

Figure 1: The overview of the proposed framework with two phases. The coarse-grained phase judges if an input documentxicomes from seen or unseen classes. The fine-grained phase finally decides the classyˆi. All notations are defined in section2.1-2.2.

the testing set. Hence, these methods are not work-ing under a strict zero-shot scenario.

To tackle the zero-shot text classification prob-lem, this paper proposes a novel two-phase frame-work together with data augmentation and fea-ture augmentation (Figure 1). In addition, four kinds of semantic knowledge including word em-beddings, class descriptions, class hierarchy, and a general knowledge graph (ConceptNet) are ex-ploited in the framework to effectively learn the unseen classes. Both of the two phases are based on convolutional neural networks (Kim, 2014). The first phase called coarse-grained classifica-tion judges if a document is from seen or un-seen classes. Then, the second phase, named

fine-grained classification, finally decides its class.

Note that all the classifiers in this framework are trained using labelled data of seen classes (and augmented text data) only. None of the steps learns from the labelled data of unseen classes.

The contributions of our work can be sum-marised as follows.

• We propose a novel deep learning based two-phase framework, including coarse-grained and fine-grained classification, to tackle the zero-shot text classification problem. Un-like some previous works, our framework does not require semantic correspondence be-tween classes in a training stage and classes in an inference stage. In other words, the seen and unseen classes can be clearly different.

• We propose a novel data augmentation tech-nique called topic translation to strengthen the capability of our framework to detect doc-uments from unseen classes effectively.

• We propose a method to perform feature augmentation by using integrated semantic knowledge to transfer the knowledge learned from seen to unseen classes in the zero-shot scenario.

In the remainder of this paper, we firstly explain our proposed zero-shot text classification frame-work in section2. Experiments and results, which demonstrate the performance of our framework, are presented in section3. Related works are dis-cussed in section4. Finally, section 5 concludes our work and mentions possible future work.

2 Methodology

2.1 Problem Formulation

Let CS and CU be disjoint sets of seen and

unseen classes of the classification respec-tively. In the learning stage, a training set

{(x1, y1), . . . ,(xn, yn)} is given where xi is the

i-th document containing a sequence of words

[w₁i, w₂i, . . . , w_ti] andyi ∈ CS is the class of xi.

In the inference stage, the goal is to predict the class of each document,yˆi, in a testing set which

has the same data format as the training set ex-cept that yi comes from CS ∪ CU. Note that (i)

every class comes with a class label and a class de-scription (Figure2a); (ii)a class hierarchy show-ing superclass-subclass relationships is also pro-vided (Figure2b);(iii)the documents from unseen classes cannot be observed to train the framework.

2.2 Overview and Notations

(3)

education

research university

Related To

At Location At Location

Related To

Used For

science Related To

Used For

Related To

book reading Used For

Related To

Has Prerequisite

Causes

Used For Used For

Class labels Class descriptions

Company an organization that provides services, or that makes or sells goods for money

Educational Institution

a place where people of different ages gain an education.

Artist someone who makes paintings, sculptures etc

(a)

(b)

(c)

Used For Thing

Agent Work Place

Village Building Natural Person Organisation Written _work Film Album

Ofﬁce

holder Athlete Artist Company Educational

[image:3.595.74.292.61.312.2]

institution

Figure 2: Illustrations of semantic knowledge inte-grated into our framework: (a) class labels and class descriptions (b) class hierarchy and (c) a subgraph of the general knowledge graph (ConceptNet).

(Figure1). The first phase, coarse-grained classifi-cation, predicts whether an input document comes from seen or unseen classes. We also apply a data augmentation technique in this phase to help the classifiers be aware of the existence of unseen classes without accessing their real data. Then the second phase, fine-grained classification, fi-nally specifies the class of the input document. It uses either a traditional classifier or a zero-shot classifier depending on the coarse-grained predic-tion given by Phase 1. Also, feature augmentapredic-tion based on semantic knowledge is used to provide additional information which relates the document and the unseen classes to generalise the zero-shot reasoning.

We use the following notations in Figure1and throughout this paper.

• The list of embeddings of each word in the document xi is denoted by viw = [v_wi₁, v_wi₂, . . . , vi_w_t].

• The embedding of each class label c is de-noted byvc,∀c∈ CS∪ CU. It is assumed that

each class has a one-word class label. If the class label has more than one word, a similar one-word class label is provided to findvc.

• As augmented features, the relationship

vec-torv_wi_j_,c shows the degree of relatedness be-tween the wordwj and the classcaccording

to semantic knowledge. Hence, the list of re-lationship vectors between each word in xi

and each classc∈ CS∪ CUis denoted byvw,ci = [v_wi₁_,c, vi_w₂_,c, . . . , v_wi_t_,c]. We will explain the construction method in section2.4.1.

2.3 Phase 1: Coarse-grained Classification

Given a documentxi, Phase 1 performs a binary

classification to decide whetheryˆi ∈ CS oryˆi ∈/ CS. In this phase, each seen classcs ∈ CS has its

own CNN classifier (with a subsequent dense layer and a sigmoid output) to predict the confidence thatxicomes from the classcs, i.e.,p(ˆyi =cs|xi).

The classifier usesv_wi as an input and it is trained using a binary cross entropy loss with all docu-ments of its class in the training set as positive ex-amples and the rest as negative exex-amples.

For a test document xi, this phase computes

p(ˆyi=cs|xi)for every seen classcsinCS. If there

exists a class cs such that p(ˆyi = cs|xi) > τs,

it predictsyˆi ∈ CS; otherwise, yˆi ∈ C/ S. τs is a

classification threshold for the classcs, calculated

based on the threshold adaptation method from (Shu et al.,2017).

2.3.1 Data Augmentation

During the learning stage, the classifiers in Phase 1 use negative examples solely from seen classes, so they may not be able to differentiate the positive class from unseen classes. Hence, when the names of unseen classes are known in the inference stage, we try to introduce them to the classifiers in Phase 1 via augmented data so they can learn to reject the instances likely from unseen classes. We do data augmentation by translating a document from its original seen class to a new unseen class using analogy. We call this processtopic translation.

In the word level, we translate a word w in a document of classc to a corresponding word w0

(4)

candidates ofw0:

w0 = argmax x∈V

cos(x, c0) cos(x, w) cos(x, c) +

whereV is a vocabulary set andcos(a, b)is a co-sine similarity score between the vectors of worda

and wordb. Also,is a small number (i.e., 0.001) added to prevent division by zero.

In the document level, we follow Algorithm 1 to translate a document of class c into the topic of another class c0. To explain, we translate all nouns, verbs, adjectives, and adverbs in the given document to the target class, word-by-word, using the word-level analogy. The word to replace must have the same part of speech as the original word and all the replacements in one document are 1-to-1 relations, enforced by replace dict in Algorithm 1. With this idea, we can create augmented doc-uments for the unseen classes by topic-translation from the documents of seen classes in the training dataset. After that, we can use the augmented doc-uments as additional negative examples for all the CNNs in Phase 1 to make them aware of the tone of unseen classes.

Algorithm 1:Document-level topic translation

Input :a documentxi= [wi1, w2i, . . . , wti],

an original class labelc, a target class labelc0

Output:a translated documentx0i 1 replace dict =dict();x0i= []; 2 foreachw∈xido

3 ifis valid pos(w)then 4 ifw /∈replace dictthen

5 cands = solve analogy(w,c,c0,

top k=20);

6 forj= 0tolen(cands)-1do

7 ifcands[j]∈/replace dict.values()∧ pos of(w)∈pos list(cands[j])

then

8 replace dict[w] = cands[j];

9 break;

10 ifj== len(cands)then 11 x0i.append(w);

12 continue;

13 x0_i.append(replace dict[w]);

14 else

15 x0_i.append(w); 16 returnx0i

2.4 Phase 2: Fine-grained Classification

Phase 2 decides the most appropriate classyˆi for

xi using two CNN classifiers: a traditional

classi-fier and a zero-shot classiclassi-fier as shown in Figure 1. Ifyˆi ∈ CSpredicted by Phase 1, the traditional

classifier will finally select a classcs ∈ CS asyˆi.

Otherwise, ifyˆi ∈ C/ S, the zero-shot classifier will

be used to select a classcu ∈ CU asyˆi.

The traditional classifier and the zero-shot clas-sifier have an identical CNN-based structure fol-lowed by two dense layers but their inputs and outputs are different. The traditional classifier is a multi-class classifier (|CS|classes) with a softmax

output, so it requires only the word embeddings

vi_w as an input. This classifier is trained using a cross entropy loss with a training dataset whose examples are from seen classes only.

In contrast, the zero-shot classifier is a binary classifier with a sigmoid output. Specifically, it takes a text documentxi and a class c as inputs

and predicts the confidence p(ˆyi = c|xi).

How-ever, in practice, we utilisevi_w to representxi,vc

to represent the classc, and also augmented fea-tures vi

w,c to provide more information on how

intimate the connections between words and the classcare. Altogether, for each wordwj, the

clas-sifier receives the concatenation of three vectors (i.e., [vi_w_j;vc;v_wi_j_,c]) as an input. This classifier is trained using a binary cross entropy loss with a training data from seen classes only, but we ex-pect this classifier to work well on unseen classes thanks to the distinctive patterns ofv_w,ci in positive examples of every class. This is how we transfer knowledge from seen to unseen classes in ZSL.

2.4.1 Feature Augmentation

The relationship vectorvwj,c contains augmented

features we input to the zero-shot classifier. vwj,c

shows how the wordwjand the classcare related

considering the relations in a general knowledge graph. In this work, we use ConceptNet providing general knowledge of natural language words and phrases (Speer and Havasi,2013). A subgraph of ConceptNet is shown in Figure 2c as an illustra-tion. Nodes in ConceptNet are words or phrases, while edges connecting two nodes show how they are related either syntactically or semantically.

We firstly represent a class c as three sets of nodes in ConceptNet by processing the class hi-erarchy, class label, and class description ofc. (1)

the class nodesis a set of nodes of the class

(5)

Con-ceptNet nodes for this class are:

(1) educational institution, educational, institution (2) organization, agent

(3) place, people, ages, education.

To construct vwj,c, we consider whether the

wordwjis connected to the members of the three

sets above within K hops by particular types of relations or not1. For each of the three sets, we construct a vector with3K+ 1dimensions.

• v[0] = 1ifwjis a node in that set; otherwise,

v[0] = 0.

• fork= 0, . . . , K−1:

– v[3k+ 1] = 1if there is a node in the set whose shortest path to wj isk+ 1.

Otherwise,v[3k+ 1] = 0.

– v[3k+ 2]is the number of nodes in the set whose shortest path towj isk+ 1.

– v[3k+3]isv[3k+2]divided by the total number of nodes in the set.

Thus, the vector associated to each set shows how

wj is semantically close to that set. Finally, we

concatenate the constructed vectors from the three sets to becomevwj,cwith3×(3K+1)dimensions.

3 Experiments

3.1 Datasets

We used two textual datasets for the experiments. The vocabulary size of each dataset was limited by 20,000 most frequent words and all numbers were excluded. (1) DBpedia ontology dataset (Zhang et al., 2015) includes 14 non-overlapping classes and textual data collected from Wikipedia. Each class has 40,000 training and 5,000 testing sam-ples. (2) The20newsgroupsdataset2_{has 20}

top-ics each of which has approximately 1,000 docu-ments. 70% of the documents of each class were randomly selected for training, and the remaining 30% were used as a testing set.

3.2 Implementation Details3

In our experiments, two different rates of unseen classes, 50% and 25%, were chosen and the corre-sponding sizes ofCSandCU are shown in Table1.

For each dataset and each unseen rate, the random

1

In this paper, we only consider the most common types of positive relations which areRelatedTo, IsA,PartOf, and AtLocation. They cover∼60% of all edges in ConceptNet.

2

http://qwone.com/∼jason/20Newsgroups/

3_{Code: https://github.com/JingqingZ/KG4ZeroShotText.}

selection of (CS,CU) were repeated ten times and

these ten groups were used by all the experiments with this setting for a fair comparison. All doc-uments fromCU were removed from the training

set accordingly. Finally, the results from all the ten groups were averaged.

In Phase 1, the structure of each classifier was identical. The CNN layer had three filter sizes [3, 4, 5] with 400 filters for each filter size and the subsequent dense layer had 300 units. For data augmentation, we used gensim with an implemen-tation of 3COSMUL(Reh˚uˇrek and Sojka,ˇ 2010) to solve the word-level analogy (line 5 in Algorithm 1). Also, the numbers of augmented text docu-ments per unseen class for every setting (if used) are indicated in Table1. These numbers were set empirically considering the number of available training documents to be translated.

In Phase 2, the traditional classifier and the zero-shot classifier had the same structure, in which the CNN layer had three filter sizes [2, 4, 8] with 600 filters for each filter size and the two intermediate dense layers had 400 and 100 units respectively. For feature augmentation, the max-imum path length K in ConceptNet was set to 3 to create the relationship vectors4. The DBpe-dia ontology5 was used to construct a class hier-archy of the DBpedia dataset. The class hierar-chy of the 20newsgroups dataset was constructed based on the namespaces initially provided by the dataset. Meanwhile, the classes descriptions of both datasets were picked from Macmillan Dictio-nary6as appropriate.

For both phases, we used 200-dim GloVe vec-tors7 for word embeddings vw and vc

(Penning-ton et al., 2014). All the deep neural networks were implemented with TensorLayer (Dong et al., 2017a) and TensorFlow (Abadi et al.,2016).

Dataset Unseen

rate | CS| | CU|

#Augmented docs percu

DBpedia 25% 11 3 12,000 (14 classes) 50% 7 7 8,000

20news 25% 15 5 4,000

[image:5.595.308.533.592.656.2]

(20 classes) 50% 10 10 3,000

Table 1: The rates of unseen classes and the numbers of augmented documents (per unseen class) in the ex-periments

4

Based on our observation, most of the related words stay within 3 hops from the class nodes in ConceptNet.

5

http://mappings.dbpedia.org/server/ontology/classes/

6

https://www.macmillandictionary.com/

(6)

3.3 Baselines and Evaluation Metrics

We compared each phase and the overall frame-work with the following approaches and settings.

Phase 1: Proposed by (Shu et al.,2017),DOC

is a state-of-the-art open-world text classification approach which classifies a new sample into a seen class or “reject” if the sample does not belong to any seen classes. The DOC uses a single CNN and a 1-vs-rest sigmoid output layer with thresh-old adjustment. Unlike DOC, the classifiers in the proposed Phase 1 work individually. However, for a fair comparison, we used DOC only as a binary classifier in this phase (yˆi∈ CS oryˆi ∈ C/ S).

Phase 2: To see how well the augmented

fea-turevw,c work in ZSL, we ran the zero-shot

clas-sifier withdifferent combinations of inputs. Par-ticularly, five combinations of vw, vc, and vw,c

were tested with documents from unseen classes only (traditional ZSL).

The whole framework: (1) Count-based

modelselected the class whose label appears most

frequently in the document asyˆi. (2)Label

simi-larity (Sappadla et al., 2016) is an unsupervised

approach which calculates the cosine similarity between the sum of word embeddings of each class label and the sum of word embeddings of every n-gram (n = 1,2,3) in the document. We adopted this approach to do single-label classifi-cation by predicting the class that got the highest similarity score among all classes. (3)RNN

Au-toEncoder was built based on a Seq2Seq model

with LSTM (512 hidden units), and it was trained to encode documents and class labels onto the same latent space. The cosine similarity was ap-plied to select a class label closest to the document on the latent space. (4) RNN+FC refers to the architecture 2 proposed in (Pushp and Srivastava, 2017). It used an RNN layer with LSTM (512 hid-den units) followed by two hid-dense layers with 400 and 100 units respectively. (5)CNN+FCreplaced the RNN in the previous model with a CNN, which has the identical structure as the zero-shot classi-fier in Phase 2. Both RNN+FC and CNN+FC pre-dicted the confidencep(ˆyi = c|xi) givenvw and

vc. The class with the highest confidence was

se-lected asyˆi.

For Phase 1, we used the accuracy for binary classification (y,yˆi ∈ CS or y,yˆi ∈ C/ S) as an

evaluation metric. In contrast, for Phase 2 and the whole framework, we used the multi-class classi-fication accuracy (yˆi=yi) as a metric.

Artist Building Office holder Written work 0.975

[image:6.595.315.518.60.195.2]

0.980 0.985 0.990 0.995 1.000

Figure 3: The distributions of confidence scores of pos-itive examples from four seen classes of DBpedia in Phase 1.

3.4 Results and Discussion

The evaluation of Phase 1(coarse-grained

classi-fication) checks if eachxiwas correctly delivered

to the right classifier in Phase 2. Table 3 shows the performance of Phase 1 with and without aug-mented data compared with DOC. Considering test documents from seen classes only, our frame-work outperformed DOC on both datasets. In addition, the augmented data improved the accu-racy of detecting documents from unseen classes clearly and led to higher overall accuracy in every setting. Despite no real labelled data from unseen classes, the augmented data generated by topic translation helped Phase 1 better detect documents from unseen classes. Table4 shows some exam-ples of augmented data from the DBpedia dataset. Even if they are not completely understandable, they contain the tone of the target classes.

Although Phase 1 provided confidence scores for all seen classes, we could not use them to pre-dictyˆi directly since the distribution of scores of

positive examples from different CNNs are dif-ferent. Figure 3 shows that the distribution of confidence scores of the class “Artist” had a no-ticeably larger variance and was clearly different from the class “Building”. Hence, even ifp(ˆyi =

“Building”|xi) > p(ˆyi = “Artist”|xi), we cannot

conclude thatxi is more likely to come from the

class “Building”. This is whya traditional

clas-sifier in Phase 2 is necessary.

Regarding Phase 2, fine-grained classification is in charge of predicting yˆi and it employs two

classifiers which were tested separately. Assum-ing Phase 1 is perfect, the classifiers in Phase 2 should be able to find the right class. The purpose of Table 5 is to show that the traditional CNN

(7)

Dataset Unseen

rate yi Count-based

Label Similarity (Sappadla et al.,2016)

RNN Autoencoder

RNN + FC (Pushp and Srivastava,

2017)

CNN +

FC Ours

DBpedia

25%

seen 0.322 0.377 0.250 0.895 0.985 0.975 unseen 0.372 0.426 0.267 0.046 0.204 0.402 overall 0.334 0.386 0.254 0.713 0.818 0.852

50%

20news

25%

50%

seen 0.219 0.293 0.275 0.709 0.684 0.767

unseen 0.196 0.266 0.126 0.052 0.126 0.168 overall 0.207 0.280 0.200 0.381 0.405 0.469

Table 2: The accuracy of the whole framework compared with the baselines.

Dataset

Unseen rate yi DOC

Ours w/o aug.

Ours w/ aug.

DBpedia 25%

seen 0.980 0.982 0.982

unseen 0.471 0.388 0.536

overall 0.871 0.855 0.886

DBpedia 50%

seen 0.983 0.986 0.987

unseen 0.384 0.345 0.512

overall 0.684 0.666 0.749

20news 25%

seen 0.800 0.838 0.831 unseen 0.573 0.431 0.577

overall 0.745 0.754 0.770

20news 50%

seen 0.824 0.856 0.843 unseen 0.562 0.419 0.603

[image:7.595.79.521.61.230.2]

overall 0.694 0.639 0.724

Table 3: The accuracy of Phase 1 with and without aug-mented data compared with DOC .

Animal (Original)

Mitra perdulca is a species of sea snail a marine gastropod mollusk in the family Mitridae the miters or miter snails.

Animal → Plant

Arecaceae perdulca is a flowering of port aster a naval mollusk gastropod in the fabaceae Clusiaceae the tiliaceae or rock-ery amaryllis.

Animal → Athlete

Mira perdulca is a swimmer of sailing sprinter an Olympian limpets gastropod in the basketball Middy the miters or miter skater.

Table 4: Examples of augmented data translated from a document of the original class “Animal” into two target classes “Plant” and “Athlete”.

Besides, given test documents from unseen classes only, the performance of the zero-shot

classifier in Phase 2is shown in Table 6. Based

on the construction method, vw,c quantified the

relatedness between words and the class but, un-likevw andvc, it did not include detailed

seman-tic meaning. Thus, the classifier using vw,c only

could not find out the correct unseen class and nei-ther[vw;vw,c]and[vc;vw,c]could do. On the other

Dataset DBpedia 20news Input\Unseen rate 50% 25% 50% 25%

[image:7.595.75.287.270.415.2]

vw 0.993 0.992 0.878 0.861

Table 5: The accuracy of the traditional classifier in Phase 2 given documents from seen classes only.

Dataset DBpedia 20news Inputs\Unseen rate 50% 25% 50% 25%

Random guess 0.143 0.333 0.100 0.200

vw,c 0.154 0.443 0.104 0.210 [vc;vw,c] 0.163 0.400 0.099 0.215 [vw;vw,c] 0.266 0.460 0.122 0.307 [vw;vc] 0.381 0.711 0.274 0.431 [vw;vc;vw,c] 0.418 0.754 0.302 0.500

Table 6: The accuracy of the zero-shot classifier in Phase 2 given documents from unseen classes only.

hand, the combination of[vw;vc], which included semantic embeddings of both words and the class label, increased the accuracy of predicting unseen classes clearly. However, the zero-shot classifier fed by the combination of all three types of inputs

[vw;vc;vw,c]achieved the highest accuracy in all

settings. It asserts that the integration of semantic knowledge we proposed is an effective means for knowledge transfer from seen to unseen classes in the zero-shot scenario.

Last but most importantly, we compared the

whole framework with four baselines as shown

in Table2. First, the count-based model is a rule-based model so it failed to predict documents from seen classes accurately and resulted in unpleasant overall results. This was similar to the label simi-larity approach even though it had higher degree of flexibility. Next, the RNN Autoencoder was trained without any supervision sinceyˆi was

[image:7.595.307.528.351.441.2]

(8)

the implicit semantic relatedness between classes caused the failure of the RNN Autoencoder. Be-sides, the CNN+FC and RNN+FC had same in-puts and outin-puts and it was clear that CNN+FC performed better than RNN+FC in the experi-ment. However, neither CNN+FC nor RNN+FC was able to transfer the knowledge learned from seen to unseen classes. Finally, our two-phase framework has competitive prediction accuracy on unseen classes while maintaining the accuracy on seen classes. This made it achieve the highest overall accuracy on both datasets and both unseen rates. In conclusion, by using integrated seman-tic knowledge, the proposed two-phase framework with data and feature augmentation is a promising step to tackle this challenging zero-shot problem.

Furthermore, another benefit of the framework is high flexibility. As the modules in Figure1has less coupling to one another, it is flexible to im-prove or customise each of them. For example, we can deploy an advanced language understanding model, e.g., BERT (Devlin et al.,2018), as a tradi-tional classifier. Moreover, we may replace Con-ceptNet with a domain-specific knowledge graph to deal with medical texts.

4 Related Work

4.1 Zero-shot Text Classification

There are a few more related works to discuss be-sides recent approaches we compared with in the experiments (explained in section 3.3). Dauphin et al. (2013) predicted semantic utterance of texts by mapping class labels and text samples into the same semantic space and classifying each sam-ple to the closest class label. Nam et al. (2016) learned the embeddings of classes, documents, and words jointly in the learning stage. Hence, it can perform well in domain-specific classification, but this is possible only with a large amount of training data. Overall, most of the previous works exploited semantic relationships between classes and documents via embeddings. In contrast, our proposed framework leverages not only the word embeddings but also other semantic knowledge. While word embeddings are used to solve analogy for data augmentation in Phase 1, the other seman-tic knowledge sources (in Figure2) are integrated into relationship vectors and used as augmented features in Phase 2. Furthermore, our framework does not require any semantic correspondences be-tween seen and unseen classes.

4.2 Data Augmentation in NLP

In the face of insufficient data, data augmentation has been widely used to improve generalisation of deep neural networks especially in computer vi-sion (Krizhevsky et al., 2012) and multimodality (Dong et al.,2017b), but it is still not a common practice in natural language processing. Recent works have explored data augmentation in NLP tasks such as machine translation and text clas-sification (Saito et al., 2017;Fadaee et al., 2017; Kobayashi, 2018), and the algorithms were de-signed to preserve semantic meaning of an orig-inal document by using synonyms (Zhang and Le-Cun,2015) or adding noises (Xie et al.,2017), for example. In contrast, our proposed data augmen-tation technique translates a document from one meaning (its original class) to another meaning (an unseen class) by analogy in order to substitute un-available labelled data of the unseen class.

4.3 Feature Augmentation in NLP

Apart from improving classification accuracy, fea-ture augmentation is also used in domain adapta-tion to transfer knowledge between a source and a target domain (Pan et al.,2010b;Fang and Chi-ang,2018;Chen et al.,2018). An early research paper applying feature augmentation in NLP is Daume III(2007) which targeted domain adapta-tion on sequence labelling tasks. After that, fea-ture augmentation was used in several NLP tasks such as cross-domain sentiment classification (Pan et al., 2010a), multi-domain machine translation (Clark et al., 2012), semantic argument classifi-cation (Batubara et al., 2018), etc. Our work is different from previous works not only that we ap-plied this technique to zero-shot text classification but also that we integrated many types of semantic knowledge to create the augmented features.

5 Conclusion and Future Work

(9)

baselines and recent approaches in all settings. In the future, we plan to extend our framework to do multi-label classification with a larger amount of data, and also study how semantic units defined by linguists can be used in the zero-shot scenario.

Acknowledgments

We would like to thank Douglas McIlwraith, Nontawat Charoenphakdee, and three anonymous reviewers for helpful suggestions. Jingqing and Piyawat would also like to thank the support from the LexisNexisR

Risk Solutions HPCC SystemsR

academic program and Anandamahidol Founda-tion, respectively.

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten-sorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265– 283.

Dina Khaira Batubara, Moch Arif Bijaksana, and Adi-wijaya. 2018.On feature augmentation for semantic argument classification of the quran english trans-lation using support vector machine. Journal of Physics: Conference Series, 971(1):012043.

Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xue, and Leonid Sigal. 2018. Semantic feature augmentation in few-shot learning. CoRR, abs/1804.05298.

Jonathan H Clark, Alon Lavie, and Chris Dyer. 2012. One system, many domains: Open-domain statisti-cal machine translation via feature augmentation. In

Proceedings of the Conference of the Association for Machine Translation in the Americas.

Hal Daume III. 2007. Frustratingly easy domain adap-tation. In Proceedings of the 45th Annual Meet-ing of the Association of Computational LMeet-inguistics, pages 256–263. Association for Computational Lin-guistics.

Yann N Dauphin, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2013. Zero-shot learning for semantic utterance classification. arXiv preprint arXiv:1401.0509.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

Hao Dong, Akara Supratak, Luo Mai, Fangde Liu, Axel Oehmichen, Simiao Yu, and Yike Guo. 2017a. Tensorlayer: a versatile library for efficient deep learning development. InProceedings of the 2017 ACM on Multimedia Conference, pages 1201–1204. ACM.

Hao Dong, Jingqing Zhang, Douglas McIlwraith, and Yike Guo. 2017b. I2t2i: Learning text to image syn-thesis with textual data augmentation. InImage Pro-cessing (ICIP), 2017 IEEE International Conference on, pages 2015–2019. IEEE.

Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.

Wen-Chieh Fang and Yi-Ting Chiang. 2018. A dis-criminative feature mapping approach to heteroge-neous domain adaptation. Pattern Recognition Let-ters, 106:13–19.

Rob Fergus, Hector Bernal, Yair Weiss, and Antonio Torralba. 2010. Semantic label sharing for learning with many categories. InComputer Vision – ECCV 2010, pages 762–775, Berlin, Heidelberg. Springer Berlin Heidelberg.

Y. Fu, T. Xiang, Y. Jiang, X. Xue, L. Sigal, and S. Gong. 2018. Recent advances in zero-shot recog-nition: Toward data-efficient understanding of vi-sual content. IEEE Signal Processing Magazine, 35(1):112–125.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprint arXiv:1408.5882.

Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic re-lations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), pages 452–457. Association for Computational Linguistics.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Advances in neural information processing systems, pages 1097–1105.

Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. InIEEE Conference on Computer Vision and Pattern Recog-nition, pages 951–958. IEEE.

(10)

Omer Levy and Yoav Goldberg. 2014. Linguistic reg-ularities in sparse and explicit word representations. InProceedings of the eighteenth conference on com-putational natural language learning, pages 171– 180.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Jinseok Nam, Eneldo Loza Menc´ıa, and Johannes F¨urnkranz. 2016. All-in text: Learning document, label, and word representations jointly. InThirtieth AAAI Conference on Artificial Intelligence.

Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado, and Jeffrey Dean. 2013. Zero-shot learning by convex combination of semantic embed-dings. arXiv preprint arXiv:1312.5650.

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2010a. Cross-domain sen-timent classification via spectral feature alignment. InProceedings of the 19th international conference on World wide web, pages 751–760. ACM.

Sinno Jialin Pan, Qiang Yang, et al. 2010b. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors for word representation. InEmpirical Methods in Nat-ural Language Processing (EMNLP), pages 1532– 1543.

Pushpankar Kumar Pushp and Muktabh Mayank Sri-vastava. 2017. Train once, test anywhere: Zero-shot learning for text classification. arXiv preprint arXiv:1712.05972.

Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. In

Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Val-letta, Malta. ELRA.

Bernardino Romera-Paredes and Philip Torr. 2015. An embarrassingly simple approach to zero-shot learn-ing. InInternational Conference on Machine Learn-ing, pages 2152–2161.

Itsumi Saito, Jun Suzuki, Kyosuke Nishida, Kugatsu Sadamitsu, Satoshi Kobashikawa, Ryo Masumura, Yuji Matsumoto, and Junji Tomita. 2017. Improv-ing neural text normalization with data augmenta-tion at character-and morphological levels. In Pro-ceedings of the Eighth International Joint Confer-ence on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 257–262.

Prateek Veeranna Sappadla, Jinseok Nam, Eneldo Loza Menc´ıa, and Johannes F¨urnkranz. 2016. Using se-mantic similarity for multi-label zero-shot classifica-tion of text documents. InProc. European Sympo-sium on Artificial Neural Networks, Computational Intelligence and Machine Learning.

Lei Shu, Hu Xu, and Bing Liu. 2017. Doc: Deep open classification of text documents. InProceedings of the 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 2911–2916. Asso-ciation for Computational Linguistics.

Richard Socher, Milind Ganjoo, Christopher D. Man-ning, and Andrew Y. Ng. 2013. Zero-shot learning through cross-modal transfer. InProceedings of the 26th International Conference on Neural Informa-tion Processing Systems - Volume 1, NIPS’13, pages 935–943, USA. Curran Associates Inc.

Robert Speer and Catherine Havasi. 2013. Conceptnet 5: A large semantic network for relational knowl-edge. InThe Peoples Web Meets NLP, pages 161– 176. Springer.

Sebastian Thrun and Lorien Pratt. 1998. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer.

Xiaolong Wang, Yufei Ye, and Abhinav Gupta. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. InProceedings of the IEEE Con-ference on Computer Vision and Pattern Recogni-tion, pages 6857–6866.

World Health Organization. 1996. Infectious diseases kill over 17 million people a year. Malaria Weekly, 3:11–16.

Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-a com-prehensive evaluation of the good, the bad and the ugly. arXiv preprint arXiv:1707.00600.

Ziang Xie, Sida I Wang, Jiwei Li, Daniel L´evy, Aiming Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.

Xiang Zhang and Yann LeCun. 2015. Text understand-ing from scratch.arXiv preprint arXiv:1502.01710.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas-sification. In Advances in neural information pro-cessing systems, pages 649–657.