Research on Relation Extraction Method Based on Multi-channel Convolution and BiLSTM Model

(1)

Research on Relation Extraction Method Based on Multi-channel Convolution and BiLSTM Model

1^st Tao Sun*

Qilu University of Technology (Shandong Academy of Sciences)

2^nd Dong Wang Qilu University of Technology (Shandong Academy of Sciences)

Abstract—Deep learning methods have achieved good results in relation extraction research and have received widespread attention. However, the existing deep learning methods use a single word vector model, which cannot fully utilize the rich semantic information and syntactic structure in the corpus.

The high parameter dimension causes information overload and cannot make full use of context information. Aiming at the problems of the current method, this paper proposes a multi- channel relation extraction framework that uses multiple word vector models to map the corpus to form a multi-channel. Feature extraction is performed through the neural network model fused with convolutional neural network, BiLSTM and attention mechanism, and ﬁnally completes the relationship extraction task through the classiﬁer. The experimental results on the SemEval 2010 task 8 data set show that this method can not only acquire rich semantic information in the corpus, but also better learn local features and use contextual information. Compared with other methods, the method in this paper achieves competitive performance.

Index Terms—Relation extraction, Multi channel, Convolu- tional neural network, BiLSTM,Attention

I. INTRODUCTION

In recent years, unstructured data has become an important source of knowledge, and information extraction technology has become an important way to obtain information from unstructured data. Information extraction includes two sub- topics: concept extraction and relationship extraction. The task of relationship extraction is to mine the semantic relationship between entities from unstructured text to form (entity, relationship, entity) triples. For example: ”Global digital music sales grow as the music industry develops new business models”, the text contains two entities ”music industry” and

”business models”, there is a ”product” relationship between them, represented by a triple: (music industry, product, business models).

Most relation extraction tasks can be classiﬁed as classi- ﬁcation tasks. In recent years, neural network models have become the mainstream method of relation extraction. Most existing methods use single word vectors, but single word vectors cannot fully represent the semantic information in the corpus, which limits the amount of information input to the neural network model.

For relation extraction tasks, people usually use Convolu- tional Neural Network (CNN) and Recurrent Neural Network

(RNN) to complete. Among them, the convolutional neural network is very effective for capturing local features due to its diverse convolution kernel, but it cannot solve the long- distance dependence of two entities, which means that the use of context information is not particularly sufﬁcient. The recurrent neural network and its improved model Long Short- Term Memory Network (LSTM) can fully consider the dependence between long-distance words, and its memory function is conducive to identifying sequences, but its representation and application of local features are slightly lacking, and the internal structure of sentences cannot be well understood. At the same time, generally speaking, the more parameters of the model, the stronger the expressive ability of the model, but also the larger the amount of information stored by the model, which will cause the problem of information overload.

Our idea comes from the ﬁeld of image processing [1]. The image is composed of three channels (red, green and blue).

Each channel has an independent description of the image, and each channel is processed independently by a neural network.

There is no interaction between channels. Based on the above intuition, this paper proposes a model based on the integration of multi-channel convolutional neural network, bidirectional long short-term memory network and attention mechanism, which aims to better perform relation extraction tasks. Its contributions are as follows:

• Aiming at the problem that existing models cannot fully capture all the semantic information of sentences and limit the total amount of information input, this paper proposes a multi-channel framework that uses multiple word vector mappings to obtain different vector representations of the same sentence.

• Aiming at the problem that the existing single neural network model cannot better comprehensively use context information and fully grasp the local features, this paper uses the CNN+BiLSTM model in each channel to solve this problem

• Aiming at the problem of information overload in the neural network model, this paper introduces the attention mechanism and focuses on key information to solve this problem.

The rest of this paper is arranged as follows: Section 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable

Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)

(2)

2 discusses related work on relation extraction, Section 3 mainly introduces the model proposed in this paper, Section 4 mainly introduces the experimental process and analysis of experimental results, and Section 5 is the conclusion.

II. RELATEDWORK

At present, the work of relation extraction has important application value in many ﬁelds, including the construction of knowledge graphs, automatic question and answer systems, machine translation and search engines [2]. Especially for the construction of knowledge graphs, relation extraction has played a pivotal role, which has attracted a large number of scholars and researchers to participate, and has also introduced many new methods and models.

The common way of relationship extraction is to treat it as a classiﬁcation problem. Based on the recognition of named entities, the classiﬁer is used to distinguish the relationship types. When deep learning neural network is successful in image processing, people try to apply neural network model to relational extraction. The relation extraction model based on neural network can automatically learn the text features, and it does not need to construct the features manually. The following will be introduced from these aspects.

A. Word vector model

As the input of the neural network model, most of the current work is in the form of word vectors. As an important tool to represent natural language, the importance of word vector model is self-evident. Word vector technology converts words in natural language into vectors, and similar words will be represented by similar vectors. The Word2Vec [3]

method is an open source toolkit for obtaining word vectors launched by Google in 2013. It is actually a shallow neural network structure model, Skip-gram is just the opposite,that is, the input is the word vector of a certain feature word, and the output is the word vector of the context corresponding to the feature word. The core idea of the two is to obtain the vectorized representation of the word through the context of the word. The GloVe [4] (global vectors for word representation) method was proposed by Pennington et al.

of Stanford University. The model ﬁrst constructs the co- occurrence matrix of words through the corpus, and then constructs the approximate relationship between the word vector and the co-occurrence matrix through formulas to obtain the word vector representation. The model combines the global information of the text and the local context information.

The above two methods have a problem, that is, a word will have different meanings in different contexts, and the above two models cannot solve this problem. Therefore, ELMo [5]

(Embeddings from Language Models) was developed in 2018.

It is completed by multi-layer stack LSTM and pre-training ﬁne-tuning. It can not only learn the complex characteristics of word usage, but also learn the changes of complex usage in different contexts, that is, the word vector through ELMo is not static, but changes at any time according to the context.

But the parallel computing power of this model is very poor,

and it is one-way, there is no way to consider data in the other direction. So Google released the BERT [6] model in 2018.

It uses Transformer for encoding. When predicting words, it comprehensively considers the characteristics of the context.

Its working method is similar to ELMo. It is pre-trained on a large-scale corpus ﬁrst, and then Input downstream tasks and perform relatively lightweight ﬁne-tuning. And made an improvement, inspired by cloze, covering 15% of the words, using the remaining 85% to predict the 15% of the words, in order to safely use the two-way context features.

B. Method based on convolutional neural network

Zeng [7] et al. used Convolutional Neural Network (CNN) to extract the hierarchical features of vocabulary and sentence for relation extraction, and introduced a position vector. On the ACE2005 data set, F1 exceeded the kernel function method by 9%, but this method cannot capture syntactic and semantic information. Therefore, Nguyen [8] et al. added a multi-size convolution kernel as a ﬁlter to the convolutional layer based on this work, proving the effectiveness of multi-size convolutional neural networks in relation extraction. However, the above methods may lead to failure to capture sentence structure information and other effective information. Xu [9] et al. proposed the depCNN model, which uses a convolutional neural network model to learn relational representation on the shortest path of two entities, which makes the model signiﬁcantly improve the extraction effect. Wang [10] et al. tried a multi-layer attention mechanism in a convolutional neural network to highlight the contribution of sentence components to relational labels, which increased the complexity of the structure a lot. Lee [11] et al.

input features such as word vectors, word relative positions, and word attributes into the CNN model and pre-divided the relationship into three categories: synonyms, upper and lower relations, and non-existent relations. Experiments proved that the introduction of word inverse positions information can improve classiﬁcation performance. Qian Xiaomei [12] et al.

proposed a deep convolutional neural network model based on dense connections for relation extraction tasks to obtain richer semantic information of input sentences and at the same time reduce the disappearance of gradients in deep neural networks.

Wang [13] et al. used BERT to continue pre-training the context in the original sentence, and then used a convolutional neural network with attention mechanism to extract advanced features from the sentence, and also achieved a higher F1 value.

Despite the convolutional neural network in the relationship between extraction task performance is good, but due to the inability of the convolution kernels a selection is very large, so it is difﬁcult to modeling of long sentence, that is to say, the convolutional neural networks to capture local characteristics of the effect is very good, but you can’t solve the distance dependence of the two entities, as a result, people with recurrent neural network modeling of a sentence.

(3)

C. Method based on recurrent neural network

Zhang et al. proposed to use RNN to replace CNN to model relationship instances, and use simple location labels instead of location vectors to better utilize the context of entities. Socher [14] et al. used recurrent neural networks (RNN) to solve the task of relation extraction. This method can effectively consider the syntactic structure information of the sentence, but cannot take into account the position and semantic information of the two entities in the sentence.

Hochreiter [15] and others proposed an improved model of RNN, the long short-term memory network model (LSTM).

Xu [16] et al. used LSTM to capture sentence information for the ﬁrst time to achieve relation extraction tasks. This method uses four channels and adopts the shortest dependency path for learning, which has achieved good results in relation extraction tasks. Aiming at the problem that one-way LSTM cannot completely extract context information, Zhang et al.

proposed BiLSTM to extract the bidirectional hidden state output of sentences. Although this method has achieved certain results, it still requires a large number of artificially defined features. Zhou [17] et al. proposed the Att-BiLSTM model based on Zhang combined with the attention mechanism, and calculated the weight of each word’s contribution to the sentence representation through a specific vector, and obtained the final vector representation of the sentence for classification.

Miwa [18] et al. stacked two-way tree structure LSTM on the basis of two-way sequence LSTM for joint relationship extraction. First, use two-way LSTM to identify all entities and traverse all entity pairs, and then use the tree structure two-way LSTM network to output the relationship between them. In order to solve the problem of introducing noise, Yan [19] et al.

used segmented LSTM combined with a convolutional neural network model to encode sentences, and then used a two-way LSTM network model for relation extraction. The experiment proved that the model has signiﬁcant anti-noise performance.

Geng [20] et al. ﬁrst used a two-way LSTM with an attention mechanism to identify word features, and then input the two- way tree structure LSTM. This method achieved good results on both data sets.

The method based on recurrent neural network is slightly lacking in the representation and application of local features, cannot grasp the internal structure of the sentence well, the internal structure is complex, and the training period is long.

D. Model introduces attention mechanism

Attention mechanism [21] (Attention) is a model proposed by Treisman and Gelade to simulate the human brain. It opti- mizes the model by calculating the probability distribution and highlighting the impact of key input information on the model output. Lin [22] et al. introduced an attention mechanism on the basis of convolutional neural networks to complete the relation extraction model. This method is different from Zeng [23] et al. which only trains on relational sentences with the highest probability. This method makes full use of the inclusion all sentences of the two entities. Qin [24] et al. proposed an attention mechanism based on entity pairs,

which put the entity pair information as prior knowledge to the attention layer, and at the same time construct a two-way GRU [25] network to reduce the computational complexity of the two-way LSTM network, and ﬁnally pay attention to the force weight and the output of the two-way GRU network are integrated to obtain a vector representation for classiﬁcation.

Sun [26] et al. set the attention layer network to LSTM on the basis of Qiu et al., using a ﬁne-grained two-way LSTM based on entity-to-attention mechanism to extract key information, and a coarse-grained two-way LSTM to extract sentence- level features Information, the two granular network functions complement each other, and good experimental results have been achieved.

III. A MODEL OFRELATIONEXTRACTIONBASED ON

MULTI-CHANNEL

In order to better complete the task of relation extraction, this paper proposes a multi-channel framework. The existing methods of using deep learning for relation extraction all use a speciﬁc single word vector model. However, its performance depends on the ability of the word vector to represent natural language, but the single word vector model can only represent part of the semantic information, which limits the expectation the total amount of information input. To solve this problem, we use different vector models to map corpus to get different vector matrix and form multi-channel, which can solve the lack of comprehensive grasp of semantic information of sentences to a certain extent. Then the fusion model of CNN and the improvement of RNN model BiLSTM is used to extract the features of the matrix obtained by the mapping of different word vector models. Since CNN is based on local connection, that is to say, the number of neurons in the upper layer con- tacted by each neuron is limited, so CNN has a better effect on local features. RNN is used to solve the problem of time series information, that is to say, the result of a certain moment is predicted based on the characteristics of the previous moment and the characteristics of the current moment, so that the context is used. LSTM not only has the functions of RNN, but also solves the problem of gradient disappearance or explosion that RNN may appear. The combined modeling of CNN and bidirectional LSTM can not only handle local features well but also make full use of context information. At the same time, the attention mechanism can selectively ﬁlter out important information from a large amount of information and focus on these contents, giving it a larger weight, so to a certain extent, it can directly calculate the dependency relationship regardless of the distance between words. It can learn the internal structure of sentences and can also play a good role in optimizing the model. Our multi-channel model is shown in Figure 1.

As shown in Figure 1, it is the multi-channel relation extraction model proposed in this paper, which consists of a word embedding layer, a convolutional layer, a BiLSTM layer, a pooling layer, an attention layer and the ﬁnal classiﬁcation.

First, in the word embedding layer, we use two different word vector models to map the corpus to form a multi-

(4)

Fig. 1. Multi-channel relation extraction model

channel network structure; then enter the convolutional layer and the BiLSTM layer for feature learning and extraction;

then enter the pooling layer performs feature fusion and segmentation maximum pooling. During this period, the attention mechanism is used to assign different weight information to the features, and ﬁnally the softmax function is used for classiﬁcation to complete the task of relationship extraction.

A. Word embedding layer

Word embedding, also known as word vector, is essentially a text representation method. It represents each word as a continuous real-valued vector, and the similarity of the vector space is made through training can be used to express the similarity of text semantics. The text representation is divided into discrete representations (such as One-hot representation, word bag model and N-gram model, etc.), distributed representations (such as co-occurrence matrix, etc.) and neural network representations (such as NNLM, Word2Vec, etc.). Compared with discrete representations and distributed representations, neural network representations can clearly express more semantic information. At present, most methods that use neural networks for relation extraction tasks use word vectors as the input of the network.

In the word embedding layer, we map the original text corpus into a vector matrix. Then in this process, there is a problem of insufﬁcient semantic information acquisition when using a single word vector model for mapping, that is, the single word vector model always maps the original corpus with a bias. Considering this problem, we can start from the word embedding layer, and map the original corpus to multiple word vector models in the word embedding layer to obtain more comprehensive semantic information of the corpus.

In order to obtain more semantic information at the word embedding layer, we must have a good grasp of the local and global information of the original corpus. Before we introduced two word vector models: Word2Vec and GloVe.

The above two word vector models have their own advantages.

The Word2Vec method is a prediction-based method that better

portrays local information; GloVe is a counting-based method that makes better use of global information. Then fusion of the word vector representations obtained by these two word vector models will inevitably obtain more semantic information.

We use the above two word vector models that have been trained to map the input. Each word is mapped to a d- dimensional vector, and each natural statement is mapped to a vector matrix. Where W1 stands for the word vector trained by Word2Vec and W2 stands for the word vector trained by Glove. W1 maps the input statement to the ﬁrst channel, and W2 maps the input statement to the second channel.

r = W^wdV (1)

Where W^wd is the trained word vector matrix,and here we know,W1,W2∈R^d∗|v|;d is the dimension of the word vector;

|v| is the size of the dictionary; V∈R^|v|∗nis the word bag representation of the input statement; n is the length of the input statement.

B. Convolution layer

The second layer of network structure is the convolution layer. In order to extract contextual information of words in sentences and better capture local features, we use convolution kernel sizes of different sizes to extract sequence feature information of different granularity. The input of the convolutional layer is the matrix r obtained by mapping through the word embedding layer, then a feature Ci after extraction is expressed as:

Ci= δ

W¹· ri:i+h−1+ b

(2) Where W¹∈R^d∗v|is the convolution kernel; h is the height of the convolution kernel; b is the offset term;is a sequence of word vectors from i to i+h-1, that is, the local ﬁltering window composed of h words; σ is the activation function, A is the activation function, and in this case we’re using the ReLU function.

As the convolution kernel slides on the sequence, the result after convolution can be obtained as:c=[c₁,c₂,c₃...c_n−h+1] .

(5)

C. BiLSTM layer

LSTM (long short-term memory) is a modiﬁed application of recurrent neural network (RNN). Since RNN will encounter huge difﬁculties when dealing with nodes that are far away in the time series, and the calculation of the connections between distant nodes will involve multiple multiplications of the matrix, which will cause the gradient to disappear or explode. And the introduction of LSTM gating unit and linear connection solves this problem well. However, LSTM has the disadvantage of not being able to encode from back to front, so bidirectional LSTM (BiLSTM) is introduced. BiLSTM is composed of forward LSTM and reverse LSTM, and the structure is the same, but it receives different vector sequences, so it can better capture the semantic information of the entire sentence.

An LSTM unit is composed of a memory unit and three gates (forgetting gate F, memory gate I, and output gate O). Through information forgetting and remembering new information, information that is useful for subsequent computation can be transferred, and useless information can be discarded. For this layer network structure, the input is matrix c=[c₁,c₂,c₃...c_n−h+1] . At time t, the input is c_t, and the hidden layer state vector at time t-1 is h_t−1. The forgetting gate, memory gate and output gate units are calculated respectively.

Calculate the forgetting gate F_t and select the information to be forgotten:

Ft= δ (Wf· [ht−1, ct] + bf) (3) Calculate the memory gate I_t, select the information to be remembered and the new information G_t (the current feature vector) after the transformation:

It= δ (Wi· [ht−1,ct+ bi]) (4)

Gt= tanh (Wg· [h_t−1, ct] + bg) (5) Then update the cell status C_t by the cell status C_t−1 at the previous moment, the information to be forgotten in the forgetting gate, the information to be retained in the memory gate and the current feature information G_t:

Ct= Ft· Ct−1+ It· Gt (6) Calculate output gate O_t and current hidden layer state h_t: Ot= δ (Wo· [ht−1, ct] + bo) (7)

ht= Ot· tanh (Ct) (8) Whereσ represents the sigmod activation function; W and b are the corresponding parameter matrices and the bias terms.

The above is the working process of an LSTM unit.

Multiple LSTM units constitute the LSTM network. Through such a positive LSTM network, we can obtain a sequence h_a=[c₁,c₂,c₃...c_n−h+1], In order to better capture the semantic

information of the whole sentence, we also need to build a reverse LSTM network, whose working process and network structure are the same as the forward LSTM network, so we can get h_b. These two vectors are spliced together to get h, which is the ﬁnal output of our BiLSTM layer:

h = ha⊕ hb (9)

D. pooling layer

In order to make full use of the richer semantic information obtained through multi-channel processing, what we have done in the pooling layer is to fuse the feature vectors extracted from the convolutional layer and BiLSTM layer, and then pool the fused feature vectors.

Due to the multi-channel reason, the eigenvectors output through the convolutional layer and bidirectional LSTM layer will have two different representations. In this step, we will fuse the two eigenvectors. At present, the commonly used methods are point-wise Addition and Concatenate. Two methods of contrast, the former is a special form of the latter, stitching method is to add a priori knowledge of vector, this means Concatenate cases and the number of completely enough to cover the Addition, the learning process, thus can be expressed completely, and to some extent, can alleviate the gradient disappeared, so we adopt vector matching method of feature fusion. The vector splicing method is to splicing the two obtained feature vectors through the following formula:

M = h1⊗ h2 (10)

The pooling process is to further extract the previously learned features, and extract the main features while reducing the complexity of the neural network. There are two common pooling methods, mean pooling and max pooling. The average pooling is to calculate the area average value as the result of pooling, and the max pooling is to select the largest value in the area as the result of pooling. Among them, max pooling can combine features to get a ﬁxed sentence length on the one hand, and on the other hand can pick out the strongest features to ignore the less obvious ones.

In order to make full use of the semantic information we obtained before and capture the structural information and ﬁne-grained feature information between the two entities, we adopt the segmented maxi pooling method and replace the original single maxi feature with the method of preserving the maxi feature in segmentation. In the original corpus, there will be two target entities to divide the corpus into three segments, and the maximum pooling operation will be carried out for each segment respectively, and then pooling results will be obtained by combining them. The corresponding input M_i will be divided into three segments by the two target entities { Mi1,M_i2,M_i3 }. The operation of maximum segmentation pooling can be expressed as:

Pij= max (Mij) (11)

(6)

Then, the three pooled vectors are combined to get the vector P_i={ Pi1,P_i2,P_i3}, and the ﬁnal output P of the pooling layer is obtained after all the feature vectors are maximized by piecewise pooling.

E. Attention layer

The source of inspiration for the attention mechanism can be attributed to people’s physiological perception of the environment. For example, our visual system tends to select part of the information in the image to focus on analysis and ignore the irrelevant information in the image. Adding the attention mechanism to our model can not only improve the performance of the model and increase the interpretability of the neural network, but also alleviate the performance degradation caused by the increase of input sequence length and the low computational efﬁciency caused by the sequential processing of input in the neural network.

The attention mechanism can be understood as selectively screening out the important information from a large amount of information and focusing on these contents, while ignoring most of the unimportant information. In other words, it can identify which components of the input sentence have a greater impact on the classification of relations and assign them corresponding weights according to the impact size. The process of focusing is reflected in the calculation of the weight coefficient. The greater the weight is, the higher the attention will be. The input of the attention layer is P, and the calculation process is as follows:

et= tanh (Mt) (12)

at= softmax (et) (13)

s =

t t=1

atMt (14)

We send the output P of the pooling layer into the tanh function, and get the attention weight a that will be given after weighting. Then according to the attention weight, all vectors in all sequence layers are weighted to obtain the ﬁnal text feature vector s.

F. Model training

Due to the large number of parameters of the model, over- ﬁtting is often caused during the training process. In order to prevent the occurrence of over-ﬁtting phenomenon, we adopt two strategies to solve this problem.

Dropout strategy is to make the hidden node value 0 according to a certain probability in each training batch, that is to say, ignore the feature detector according to a certain probability.

This method can reduce the number of parameters in the original model, and to a certain extent the interaction of the feature detector is reduced, and the over-ﬁtting phenomenon can be signiﬁcantly reduced. We added the Dropout strategy after the input layer and the pooling layer.

Regularization is to add penalty items to the loss function of the model, so as to reduce the complexity of the model itself, so it can also reduce the over-ﬁtting phenomenon in the training process. In this paper, L2 regularization is used to add L2 normθ²_F term in the loss function, so the loss function can be expressed as:

J (θ) = −

m I=1

tilog (yi) + λ θ²_F (15)

Where the ﬁrst term is the cross entropy and λ is the regularization coefﬁcient. In this paper, Adam algorithm is used for optimization.

The vector y after Dropout processing is input into the final sofamax classifier as the final sentence feature:

z = W²y (16)

Where W²∈R^l×m is the transition matrix, l is the category item to be classiﬁed, m is the number of convolution kernels, and the output z is an l-dimensional vector. The i-th dimension represents the probability of belonging to the i-th class.

IV. EXPERIMENTAL RESULTS AND ANALYSIS

A. Data set and its introduction

In order to evaluate our proposed multi-channel framework, this paper uses the relational extraction data set SemEval 2010 Task 8 of the semantic evaluation conference SemEval.

The data set contains 10717 corpus, of which 8000 corpus are training samples, 2717 corpus are test samples, and The entities and entity relationships in each sentence have been marked. The sample data set is shown in Table 1.

TABLE I SAMPLE DATA SETS

Sample number Sample content

8001

The most common<e1>audits</e2>were about

<e2>waste</e2>and recycling.

8002

The<e1>company</e1>fabricates plastic

<e2>chairs</e2>.

8003

The school<e1>master</e1>teaches the lesson with a<e2>stick</e2>.

The relationship types between the two entities e1 and e2 have a total of 9 relationship types and 1 other type. Below we will list the relationship types and distributions of various relationships in the data set, as shown in Table 2.

B. Model parameter settings and evaluation indicators In the experiments of this paper, in order to better obtain the semantic information of the corpus, we use convolution kernels of different heights for multi-channel processing. The speciﬁc experimental hyperparameter settings are shown in Table 3.

For each relationship extraction, the precision, recall and F1 value are generally used to comprehensively evaluate the

(7)

TABLE II INTRODUCTION TO DATA SETS Relation types Distribution(proportion)

C-E 1003(12.54%)

C-W 941(11.76%)

C-C 540(6.75%)

E-D 845(10.56%)

E-O 716(8.95%)

I-A 504(6.30%)

M-C 690(8.63%)

M-T 634(7.92%)

P-P 717(8.96)

Others 1410(17.63%)

TABLE III SUPER PARAMETER SETTING

Parameter Value

Word vector dimension d 300 Height of convolution kernel h 3,4,5 Number of convolution kernel m 90*3 LSTM number of units 100 Learning rate 0.001

Dropout 0.65

L2 regularizes the penalty term 0.001

extraction results. The three judgment values are deﬁned as follows:

P recision = T P

T P + F P (17)

Recall = T P

T P + F N (18)

F 1 = 2P recision · Recall

P recision + Recall (19) Among them, TP (true positive) means that the correct judgment will be correct; FP (false positive) means that the correct judgment will be wrong; TN (true negative) means that the wrong judgment is correct; FN (false negative) means that the wrong judgment will be wrong.

C. Analysis of results

In this paper, we use multiple sets of experiments to verify that our proposed multi-channel framework is effective in improving the effect of relationship extraction. We ﬁrst carried out the comparative experiment in the model of this article.

The ﬁrst set of experiments only used the Word2Vec word vector model to embed words in the data set, and then used the CNN+ATT+BiLSTM model constructed in this paper to extract relations; the second set of experiments only used GloVe word vector model to embed words in the data set.

The third set of experiments uses random initialization for input mapping. The last set of experiments uses the multi- channel method proposed in this paper. Other parameters in the experiment are consistent. The experimental results are shown in Table 4.

From the above experimental results, it can be seen that the results obtained by randomly initializing the data set are relatively low. Then we use the word vector model to map

TABLE IV

COMPARISON BETWEEN THE MULTI-CHANNEL MODEL AND THE NORMAL CHANNEL MODEL

Word embedding layer settings Precision(%) Recall(%) F1(%)

Only use Word2Vec 82.3 87.1 84.4

Omly use GloVe 86.6 83.5 85.1

Random initialization 75.2 75.4 75.3

Use both channels 87.1 86.3 86.7

the word embedding layer of the data set. We can see that whether it is using the Word2Vec word vector model or the Glove word vector model, the classification performance of the model has been greatly improved compared to the random initialization processing method. The above results show that word vectors play an important role in the model’s understanding of natural language. Secondly, comparing the first and second sets of experiments, we can see that two different word vector models are mapped to the data set. The final results are slightly different, but the difference is not particularly large, which shows that different word vectors can be mapped obtain different semantic information. Finally, comparing the first two and four sets of experiments, it can be seen that the classification performance of the model using two channels at the same time is better than the mapping using any word vector model alone, which shows that using two channels at the same time can better understand the syntactic structure and semantic information of corpus and achieve better results in relation extraction tasks.

In order to verify the stability of the model and explore the inﬂuence of the number of iterations on the F1 value, we tested the changes in the F1 value obtained by the above four methods under the premise of different iteration times.

As shown in Figure 2, during the first 15 iterations, the F1 value has risen steadily. After 15 iterations, the convergence tends to stabilize. Although there will be some fluctuations, the fluctuation range is not large, which proves the stability of the model. It can be seen from Figure 2 that the multi-channel approach we proposed is better than the other three methods in terms of F1, indicating that the multi-channel approach can better complete the relation extraction task.

Fig. 2. Curve of F1-values changing with the number of epochs

This experiment also counts the classiﬁcation of nine relationships under the multi-channel model in this article, as shown in Figure 3. It can be seen from the ﬁgure that

(8)

under a certain relationship type, the F1 value is signiﬁcantly different. Among them, C-E and E-D are signiﬁcantly better than other categories, and I-A and other types perform poorly.

Although to a certain extent, this is related to the distribution of categories in the data set, but after analyzing the data set, we can find the difference in the impact of the data on the classification results. For example, in the CE category, the F1 value reached 95.41%, reaching a relatively high level. We found through statistics that these sentences usually contain some high-frequency vocabulary, accompanied by the appearance of some prepositions, and have relatively good structural characteristics. Therefore, this category performs well. On the contrary, in poor performance categories, although high- frequency vocabulary often appears, there are still quite a few sentences that only use prepositions, that is, in these sentences, high-frequency vocabulary does not accompany the appearance of prepositions. The sentence structure features of this category are not obvious enough, which is one of the reasons for the relatively low F1 value. Of course, the judgment of the relationship type of the neural network is more complicated. The above just illustrates a reason that can be used for explanation. The extraction of features by the neural network is multifaceted, and there may even be some unexplainable features that are difficult to extract manually.

And this is also the advantage of neural network.

Fig. 3. Comparison of F1-values of each relation type

In order to better evaluate our proposed multi-channel relationship extraction framework, we selected the following published methods for comparison:

• SVM: integrates various features of syntax, morphology and frame semantics;

• Multi-CNN: Use multi-scale convolution kernel to extract features plus max pooling;

• BiLSTM+ATT: Fusion of BiLSTM and attention mechanism;

• Multi-channel CNN: Divide the sentence into different parts and treat each part separately.

The experimental results are shown in Table 5.

In our experiment, we chose the best-performing SVM model in traditional methods. This model requires manual extraction of 12 features such as part-of-speech tagging and reliance on parsing. The workload is huge and time- consuming. The method used in this paper is based on a deep

TABLE V

COMPARISON WITH PUBLISHED RESULTS

Model F1(%)

SVM 82.2

Multi-CNN 82.8 BiLSTM+ATT 84.0 Multi-channel CNN 84.6

Ours 86.7

learning model. Features can be extracted automatically, thus avoiding the problem of workload. The Multi-CNN method uses a multi-scale convolution kernel to convolve the corpus, but it cannot accurately connect the context information.

For example, if there are multiple entities in a sentence and there is a relationship between every two entities, then simply use a convolutional neural network is not enough.

The BiLSTM+ATT method uses the BiLSTM fusion attention mechanism to extract the relation of the corpus, but it cannot maximize the use of local information. For example, in a long sentence, two entities with a relationship may only exist in a few close intervals words, such simple use of BiLSTM can not fully obtain the relationship. The multi-channel CNN method is to divide the sentence into several parts, and each part is processed by the same neural network model, which does not affect each other. This method can affect the sentence full interpretation of the meaning of each word in the sentences, but still unable to change the disadvantage of not being able to extract complex relationships in complex sentences. Since the vector stitched by our framework is 600-dimensional, we set the above input vector dimension based on the deep learning method to 600.

Through the above experimental results, it can be seen that the multi-channel relation extraction framework proposed in this paper has a certain improvement in the effect of the relationship extraction model using a single word vector. From this we can know that, ﬁrst of all, the multi-channel neural network model is better than the ordinary neural network model, because more comprehensive and rich corpus semantic information can be obtained through different channels, so that the neural network model in the channel can learn more diverse. The multi-channel framework can have stronger characterization capabilities for natural language. Secondly, comparing our single-channel model (ie CNN+BiLSTM+ATT) and the BiLSTM+ATT model, we can conclude that the convolutional neural network has good performance for local feature extraction, but it does not make good use of the corpus information in the context. Therefore, we integrate the three models of BiLSTM, convolutional neural network and attention mechanism, which can be better by extracting local features, BiLSTM can also be used to better capture contextual information and make full use of it. Finally, the multi-channel framework proposed in this paper is extensible.

Multiple channels can be added, and each channel is calculated in parallel, which greatly improves the calculation efﬁciency.

Moreover, the multi-channel method can be used to character- ize traditional semantics. The better integration of traditional

(9)

methods provides the basis.

CONCLUSION

This paper proposes a multi-channel relation extraction model. Each channel uses a neural network model of CNN+ATT+BiLSTM. First, multi-channel refers to the use of Word2Vec and GloVe word vector models to map the corpus, and then merge the two vectors to obtain more semantic information, and then the neural network model in each channel, we merge CNN and BiLSTM, and make full use of CNN’s grasp of local features and BiLSTM’s grasp of contextual information. Applying the multi-channel framework to the data set of SemEval 2010 task8, and comparing with other methods mentioned in the paper, the multi-channel relationship extraction framework proposed in this paper has a certain improvement. In the next step, we can consider introducing more channels to further improve the performance of the model.

ACKNOWLEDGMENT

This work is supported by Project of Shandong Natural Foundation(ZR2017LF019), China and National Key Research and Development Program of China(2019YFB1404704).

REFERENCES

[1] Y. Chen, K. Wang, W. Yang, Y. Qin, and P. Chen, “A multi-channel deep neural network for relation extraction,” IEEE Access, vol. PP, no. 99, pp. 1–1, 2020.

[2] J. Li, G. Huang, J. Chen, and Y. Wang, “Dual cnn for relation extraction with knowledge-based attention and word embeddings,” Computational Intelligence and Neuroence, vol. 2019, pp. 1–10, 2019.

[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimation of word representations in vector space,” Computer ence, 2013.

[4] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Conference on Empirical Methods in Natural Language Processing, 2014.

[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds. Association for Computational Linguistics, 2018, pp. 2227–2237.

[Online]. Available: https://doi.org/10.18653/v1/n18-1202

[6] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp.

4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423 [7] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, “Relation classiﬁcation

via convolutional deep neural network,” in COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, J. Hajic and J. Tsujii, Eds. ACL, 2014, pp. 2335–2344. [Online].

Available: https://www.aclweb.org/anthology/C14-1220/

[8] D. Zhang and D. Wang, “Relation classiﬁcation via recurrent neural network,” Computer ence, 2015.

[9] K. Xu, Y. Feng, S. Huang, and D. Zhao, “Semantic relation classiﬁcation via convolutional neural networks with simple negative sampling,”

Computer Science, vol. 71, no. 7, pp. 941–9, 2015.

[10] L. Wang, Z. Cao, G. D. Melo, and Z. Liu, “Relation classiﬁcation via multi-level attention cnns,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.

[11] J. Y. Lee, F. Dernoncourt, and P. Szolovits, “Mit at semeval-2017 task 10: Relation extraction with convolutional neural networks,” arXiv:

Computation and Language, 2017.

[12] D. Zeng, J. Zeng, and D. Yuan, “Using cost-sensitive ranking loss to improve distant supervised relation extraction,” in China National Con- ference on Chinese Computational Linguistics International Symposium on Natural Language Processing Based on Naturally Annotated Big Data, 2017.

[13] Y. Wang, X. Xin, and P. Guo, “Relation extraction via attention-based cnns using token-level representations,” in 2019 15th International Conference on Computational Intelligence and Security (CIS), 2019.

[14] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, “Semantic composi- tionality through recursive matrix-vector spaces,” in Joint Conference on Empirical Methods in Natural Language Processing & Computational Natural Language Learning, 2012.

[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[16] X. Yan, L. Mou, G. Li, Y. Chen, and Z. Jin, “Classifying relations via long short term memory networks along shortest dependency paths,”

computer science, 2015.

[17] J. Luo, J. Du, B. Nie, W. Xiong, L. Liu, J. He, and S. O. Computer,

“Tcm text relationship extraction model based on bidirectional lstm and gbdt,” Application Research of Computers, 2019.

[18] M. Miwa and M. Bansal, “End-to-end relation extraction using lstms on sequences and tree structures,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.

The Association for Computer Linguistics, 2016. [Online]. Available:

https://doi.org/10.18653/v1/p16-1105

[19] D. Yan and B. Hu, “Shared representation generator for relation extrac- tion with piecewise-lstm convolutional neural networks,” IEEE Access, pp. 1–1, 2019.

[20] Z. Q. Geng, G. F. Chen, Y. M. Han, G. Lu, and F. Li, “Semantic relation extraction using sequential and tree-structured lstm with attention,”

Information ences, vol. 509, 2019.

[21] D. Talsma and M. G. Woldorff, “Selective attention and multisensory integration: Multiple phases of effects on the evoked brain activity,”

Journal of Cognitive Neuroscience, vol. 17, no. 7, pp. 1098–1114, 2005.

[22] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun, “Neural relation extraction with selective attention over instances,” in Meeting of the Association for Computational Linguistics, 2016.

[23] D. Zeng, K. Liu, Y. Chen, and J. Zhao, “Distant supervision for relation extraction via piecewise convolutional neural networks,” in Conference on Empirical Methods in Natural Language Processing, 2015.

[24] P. Qin, W. Xu, and J. Guo, “Designing an adaptive attention mechanism for relation classiﬁcation,” in International Joint Conference on Neural Networks, 2017.

[25] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv: Computation and Language, 2014.

[26] Y. Sun, Y. Cui, J. Hu, and W. Jia, “Relation classiﬁcation using coarse and ﬁne-grained networks with SDP supervised key words selection,” in Knowledge Science, Engineering and Management - 11th International Conference, KSEM 2018, Changchun, China, August 17-19, 2018, Proceedings, Part I, ser. Lecture Notes in Computer Science, W. Liu, F. Giunchiglia, and B. Yang, Eds., vol. 11061. Springer, 2018, pp.

514–522. [Online]. Available: https://doi.org/10.1007/978-3-319-99365- 2 46