Talla at SemEval 2018 Task 7: Hybrid Loss Optimization for Relation Classification using Convolutional Neural Networks

(1)

Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), pages 863–867

Talla at SemEval-2018 Task 7: Hybrid Loss Optimization for Relation

Classification using Convolutional Neural Networks

Bhanu Pratap, Daniel Shank, Oladipo Ositelu, Byron V. Galbraith Talla Inc.

Boston, MA, USA

{bhanu, daniel, ladi, byron_}@talla.com

Abstract

This paper describes our approach to SemEval-2018 Task 7 – given an entity-tagged text from the ACL Anthology corpus, identify and classify pairs of entities that have one of six possible semantic relationships. Our model consists of a convolutional neural network leveraging pre-trained word embed-dings, unlabeled ACL-abstracts, and multiple window sizes to automatically learn useful features from entity-tagged sentences. We also experiment with a hybrid loss function, a combination of cross-entropy loss and ranking loss, to boost the separation in classification scores. Lastly, we include WordNet-based features to further improve the performance of our model. Our best model achieves an F1(macro) score of 74.2 and 84.8 on subtasks 1.1 and 1.2, respectively.

1 Introduction

Classifying the relationship between entities is an important natural language processing (NLP) task that serves as a building block for a variety of NLP applications such as knowledge base con-struction and question-answering tasks. SemEval-2018 Task 7 (G´abor et al.,2018) provided entity-tagged texts from the ACL Anthology corpus and asked participants to identify and classify entity pairs into one of six semantic relationships.

Our approach to this problem consisted of se-lecting two architectures shown to be successful (Zeng et al., 2014;Nguyen and Grishman,2015) in this problem domain and adapting them to this particular task. We found that pre-trained word embeddings were effective for this problem, as well as a combined loss function, and using Word-Net features at a later stage of our model.

Figure 1: An example of sentence-level relation In-stance. Sentences marked with entity positions were the input to our model.

2 System Description

Convolutional neural networks (CNNs) have been proven to significantly outperform other methods for Relation Classification (Zeng et al.,2014;dos Santos et al.,2015;Nguyen and Grishman,2015). Our approach was inspired byNguyen and Grish-man(2015) anddos Santos et al.(2015), both sys-tems being minimally dependent on explicit fea-ture engineering. While Nguyen and Grishman (2015) relied solely on their model architecture to automatically extract useful features, we also in-cluded additional features based on part-of-speech tags and WordNet hypernyms. Followingdos San-tos et al.(2015), we trained our model on a hybrid objective function, a combination of cross entropy loss and ranking loss. Finally, we also trained our model in two stages to utilize large amounts of un-labeled ACL corpus abstracts (Bird et al., 2008). We describe these stages in detail in section4.1.

Each abstract is first tokenized into sentences. For each sentence, we then formed training exam-ples by taking all combinations of pairs of enti-ties annotated in the sentence. If a pair is anno-tated with a relation label, we labeled the sentence as this relation. Otherwise, we labeled the sen-tence as an artificial class calledOTHER. Figure 1shows an example of a training instance in our sentence-level dataset.

For an input sentence, each word was mapped to a word-vector to form a sentence matrix. These sentence matrices were provided as inputs to the

(2)

CNN. The output of this layer was then fed into a softmax layer to classify the relationship between two entities. Section3provides a detailed descrip-tion of the underlying CNN.

3 Convolutional Neural Network for Relation Classification

Our model consists of a preprocessing feature gen-eration step followed by a 2D convolutional layer with max pooling and then a fully connected layer with softmax output.

3.1 Preprocessing and Feature Generation

The input to our model was a raw sentence marked with entity positions. This raw sentence was first converted into a real-valued sentence matrix by kenizing the sentence and then replacing each to-ken with a corresponding word embedding. We used three different look-up tables: publicly avail-able pre-trained word embeddings, randomly ini-tialized word positions, and randomly iniini-tialized part-of-speech tags. Following Collobert et al. (2011), the final word embedding for each token in the sentence was a concatenation of these three embeddings.

For pre-trained word embeddings, we evaluated word2vec (Mikolov et al.,2013), GloVe (Penning-ton et al., 2014), and Numberbatch (Speer and Chin,2016), ultimately choosing word2vec as the best performer for this task.

The specific process used to generate the fea-ture vector for a given token in a sentence is as follows. Letnbe the number of tokens in a

sen-tencex = [x1, x2, ..., xn]withxi1 andxi2 being

the two head words of the two entities in relation-shipr. Relative positions between a tokenxi and

two entities are given by (i−i1) and (i−i2).

These positions are also mapped into real-valued vectors using a position embeddings look-up table

Wp. Also,We, andWtembeddings look-up tables

are used to map each word and its part-of-speech tag into a real valued vector. Finally,xi is

trans-formed into a vectorvi = [ei;ui1;ui2;ti]T, where

vi is the concatenation of vectors ei, ui1, ui2, ti,

ei is the word-vector mapped using theWe

look-up table, ui1 and ui2 are the word-position

vec-tors mapped using theWp look-up table, andtiis

the word-part-of-speech vector mapped using the

Wt look-up table. Dimension d of the final

in-put vector is given by d = (de + 2∗dp +dt),

where de, dp and dt are the dimensions of

pre-trained word-embeddings, word-position embed-dings and word-part-of-speech tag embedembed-dings. As a result of these look-up operations, the raw sentence x = [x1, x2, ..., xn]is transformed into

a real-valued sentence matrixx= [x1,x2, ...,xn] of sized_×n.

3.2 2D Convolution with Max Pooling

We used multiple window sizes to extract features corresponding to variousn-grams. Letwbe a win-dow size and nw be the number of unique

win-dow sizes, a filter f = [f1,f2, ..,fw] is a weight

matrix where fi is a column vector of size d =

(de+ 2∗dp+dt). A convolutional operation is

then performed usingxandf to produce a feature

maps= [s1, s2, ..., sn−w+1]as:

si =g( w_X−1

j=0

f_jT₊₁xT_j₊_i+b)

wherebandgare bias and ReLU (Nair and

Hin-ton, 2010) activation function respectively. This convolutional operation was repeated for different filters and window sizes, and then a max pooling strategyZhang and Wallace(2017) was applied to extract only 1 feature (the one with highest acti-vation) from each feature map. That is, for each feature maps, amaxfunction was applied to

pro-duce a single value:pf =max(s).

3.3 Classification Layer

We took all the individually selected features from the max pooling operation and concatenated them together, producingz= [p1, p2, ..., pm], wherem

is the number of feature maps andpiis the pooled

value for ith feature map. A random proportion

of input vector z was set to zero for

regulariza-tion purposes to produce a drop-out versionzdof

input vector z. The vector zd was then fed into

a dense layer followed by a softmax operation to produce the final classification probability for a re-lation classras:

o=Czd+b

p(r|θ) = e

oi

PL k=1eok

where o is the output of the dense layer, b is a

bias term,Lis the number of relation categories,

andCis a weight matrix of size(nwm×L)with

nwbeing the number of unique window sizes and

(3)

3.4 Additional Features

In addition to the output of the CNN layer, we ex-plored a variety of additional features derived from the input sentences.

Part-of-speech features,pos We randomly ini-tialized embeddings for each part-of-speech tag and used these embeddings as additional input to our network. Part-of-speech tags for raw sentences were generated usingspaCy1_.

WordNet hypernym features,hyp We incorpo-rated WordNet hypernyms using the implementa-tion2_{provided by}_{Ciaramita and Altun}_(2006).

Semantic Similarity between two entities, sim We computed the cosine similarity between the word-embeddings of the head-words of the two entities in a relation instance.

REVERSE flag feature,rev We applied an in-dicator function on the REVERSE flag of the rela-tionship instance.

We fed hyp, simandrev features as additional inputs to the classification layer. Whilepos fea-tures were provided as input to the convolution layer.

4 Training Methods

We evaluated three different loss functions for training our model: cross-entropy loss, ranking loss (dos Santos et al.,2015), and a weighted com-bination of the two, where

Lcombined=αLranking+ (1−α)Lcross entropy

with α as a weighting parameter. The combined loss function was determined to be the most effec-tive.

4.1 Two-Staged Training

To make use of unlabeled data for fine-tuning word-position and word-part-of-speech embed-dings, we trained our model in two-stages follow-ingSeveryn and Moschitti(2015): adistant train-ing stage and asupervisedtraining stage.

Distant Training We first created a distantly supervised training dataset using unlabeled ACL corpus abstracts (Bird et al., 2008) based on the naive assumption thattwo entities have the same relationship across all aligned sentences. By

1_spacy.io

2_{sourceforge.net/projects/supersensetag/}

aligned sentences, we mean all sentences which have exactly two entities. In order to create dis-tantly supervised training data based on the above assumption, we performed the following opera-tions:

i) All the sentences in the ACL-corpus were in-dexed in an IR system. Here we usedWhoosh3_.

ii) For each relation instance in the labeled train-ing data, the top 40 sentences which contained both the entity texts in the relation were returned from the IR system.

iii) Result sentences in which the distance (number of characters) between the two entity texts was greater than 170 were removed. This number was derived from distance statistics from given labeled datasets.

iv) The remaining sentences were labeled with the same relationship as the relation instance in (ii).

Following above steps, we created distant-datasets of around 1600 and 11000 training in-stances for subtasks 1.1 and 1.2, respectively. We also verified that there is no overlap between these generated distant-datasets and the provided test datasets.

Once the distantly supervised training data is created, we train our model using these datasets to fine-tune only word-position and part-of-speech tag embeddings, while keeping word-embeddings fixed.

Supervised Training In the second stage, we initialized our model with the fine-tuned embed-dings trained in the distantly supervised training stage and then train our model using the pro-vided labeled training data. In this stage we also train word-embeddings but freeze them for first 10 epochs to prevent any large updates.

5 Experiments and Results

The class labels for subtasks 1.1 and 1.2 are highly imbalanced (Table1). To compensate for this im-balance, we trained our models for subtasks 1.1 and 1.2 jointly on a combined dataset and used class-weights to weight our loss function.

5.1 Resources and Hyperparameters

We chose all the hyperparameters based on the model performance on our validation set. All experiments below use the hyperparameters as shown in Table2.

(4)

Class Train Test Train Test

1.1 1.1 1.2 1.2

USAGE 0.39 0.49 0.38 0.35

PART-WHOLE 0.19 0.20 0.16 0.16 MODEL-FEATURE 0.27 0.19 0.14 0.21 COMPARE 0.08 0.06 0.03 0.01 RESULT 0.06 0.06 0.10 0.08

TOPIC 0.01 0.01 0.19 0.19

[image:4.595.77.287.62.158.2]

n 1228 355 1249 355

Table 1: Distribution of classes within datasets. Here,

nis the total number of instances in a dataset.

Parameter Value

Window Sizes 2,3,4,5 Number of Filters 25 Word-Embeddings Size (de) 300 Word-Position Embeddings Size (dp) 25

Positive Class Margin (m+) 2.5

Negative Class Margin (m−₎ _0.5

λ 1.0

α 0.1

learning rate 0.01

Table 2: Hyperparameters used in all experiments.

Our final model is a soft-voting ensemble of the best models obtained using 10-fold strati-fied cross-validation. All the models were im-plemented using TensorFlow4_{. We trained our}

models using a stochastic gradient descent opti-mizer with momentum (Sutskever et al., 2013). Lastly, based on our experiments, we chose the 300-dimensional word2vec pre-trained word em-beddings trained on the Google News corpus.

5.2 Evaluation

F1-Macro F1-Macro

Model Subtask 1.1 Subtask 1.2

CNN 72.8 84.1

CNN+pos 72.1 82.8

CNN+pos+hyp 74.4 82.4

CNN+pos+hyp+sim 73.7 84.8a

CNN+pos+hyp+sim+rev 74.2b _84.7

—”—(2-staged) 73.9 84.7

Overall Best(DS3Lab) 81.7 90.4 a_{our official submission ranked second}

b_{our official submission ranked fifth}

Table 3: Performance of our final model on Subtasks 1.1 and 1.2. Additional features are incrementally added to our plain CNN model. Here(2-staged)refers to the results of our experiments using 2-staged training method.

.

Table 3shows the results of our ablation stud-ies using different feature sets on subtasks 1.1

4_{https://www.tensorflow.org/}

and 1.2. It shows that the simple similarity (sim) feature helps in case of subtask 1.2, while it de-grades the performance in case of subtask 1.1. Similarly, WordNet features with part-of-speech features boosted the performance only of subtask 1.1. Fine tuning using the two-staged training ap-proach did not yield any performance gain in ei-ther the subtasks.

We also evaluated the effect of loss function choice. Table 4 shows the results of our final

F1-Macro F1-Macro

Loss Function Subtask 1.1 Subtask 1.2

Cross Entropy Loss 72.7 83.9

Ranking Loss 70.6 81.5

[image:4.595.89.275.200.306.2]

Combined Loss 74.2 84.7

Table 4: Effect of Loss Functions

model trained on different loss functions. A com-bination of ranking loss and cross entropy loss does yield a performance boost.

While we did not provide a formal submission for subtask 2, we evaluated our approach on it given the labeled test data. Table5shows the

re-F1-Macro

Model Subtask 2(ANY)

CNN 35.0

CNN+pos 36.6

[image:4.595.309.502.212.266.2]

CNN+pos+hyp 34.7 Overall Best(UWNLP) 50.0

Table 5: Performance of our final model on Subtask 2(Relation Extraction).

sults of our experiments on subtask 2 (relation ex-traction). While this method did not outperform the top submissions, it still demonstrated compet-itive results.

6 Conclusion

[image:4.595.308.477.398.462.2] [image:4.595.71.286.501.619.2]

(5)

References

Steven Bird, Robert Dale, Bonnie Dorr, Bryan Gibson, Mark Joseph, Min-Yen Kan, Dongwon Lee, Brett Powley, Dragomir Radev, and Yee Fan Tan. 2008.

The acl anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. InProceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-08), Marrakech, Morocco. European Lan-guage Resources Association (ELRA). ACL An-thology Identifier: L08-1005.

Massimiliano Ciaramita and Yasemin Altun. 2006.

Broad-coverage sense disambiguation and informa-tion extracinforma-tion with a supersense sequence tagger. InProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, pages 594–602, Stroudsburg, PA, USA. Associ-ation for ComputAssoci-ational Linguistics.

Ronan Collobert, Jason Weston, L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch.J. Mach. Learn. Res., 12:2493–2537.

Kata G´abor, Davide Buscaldi, Anne-Kathrin Schu-mann, Behrang QasemiZadeh, Ha¨ıfa Zargayouna, and Thierry Charnois. 2018. Semeval-2018 Task 7: Semantic relation extraction and classification in scientific papers. In Proceedings of International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. InProceedings of the 26th International Conference on Neural Information Processing Systems -Volume 2, NIPS’13, pages 3111–3119, USA. Curran Associates Inc.

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines.

InICML, pages 807–814. Omnipress.

Thien Huu Nguyen and Ralph Grishman. 2015. Rela-tion extracRela-tion: Perspective from convoluRela-tional neu-ral networks.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.

C ˜Acero Nogueira dos Santos, Bing Xiang, and Bowen Zhou. 2015. Classifying relations by ranking with convolutional neural networks. In ACL (1), pages 626–634. The Association for Computer Linguis-tics.

Aliaksei Severyn and Alessandro Moschitti. 2015.

Unitn: Training deep convolutional neural net-work for twitter sentiment classification. In

SemEval@NAACL-HLT, pages 464–469. The Asso-ciation for Computer Linguistics.

Robert Speer and Joshua Chin. 2016. An ensemble method to produce high-quality word embeddings.

CoRR, abs/1604.01692.

Ilya Sutskever, James Martens, George Dahl, and Geof-frey Hinton. 2013. On the importance of initializa-tion and momentum in deep learning. In Proceed-ings of the 30th International Conference on Inter-national Conference on Machine Learning - Volume 28, ICML’13, pages III–1139–III–1147. JMLR.org.

Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation classification via convolutional deep neural network. In COLING 2014, 25th International Conference on Computa-tional Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ire-land, pages 2335–2344.