Aristotle University of Thessaloniki Faculty of Sciences
School of Informatics MSc Artificial Intelligence
Transformer-based Zero-shot Entity Linking
Σύνδεση Οντοτήτων χωϱίς παϱαδείγµατα σε Βάσεις Γνώσης µέσω Μετασχηµατιστή
Master Thesis of
Eleni Partalidou
Supervisor: Grigorios Tsoumakas, Associate Professor
March 26, 2021
DECLARATION
I confirm that this master’s thesis is my own work and I have documented all sources and material used.
Eleni Partalidou March 26, 2021
Acknowledgements
First of all, I would like to thank Prof. Grigorios Tsoumakas for granting me the elaboration of this thesis.
I would also like to thank PhD student Despina Christou for guiding and supporting me throughout the thesis.
Without any of your support, it would have been considerably more difficult to elaborate on this thesis.
Abstract
This thesis is an introduction to Zero-shot Entity Linking, a Natural Lan- guage Processing (NLP) task, where a model assigns a unique identity to named entities mentioned in text, focusing mainly on language comprehen- sion. The basic steps to apply this task and its applications are presented.
State-of-the-art Zero-shot Entity Linking methods use Transformers, a newly introduced deep learning model that exploits Attention to process informa- tion, and standard methodologies are introduced. Within this thesis, many experiments were conducted, concerning the way that the model extracts a set of candidate entities and assigns probabilities to them, the form in- formation has when it is fed to a Transformer, whether the entity type of a mention should be used or overlooked and other experiments that are fur- ther analyzed. The experiments had good results and it is showed that the recall value of a model to contain the correct entity description in a very small candidate set (50 candidates) is about 84%. The experiments that were conducted indicate the need for having proper methods that compute the similarity between a document mention and an entity description, in order to have good results. Finally, the evaluation results are discussed and conclusions are drawn.
Keywords: Zero-shot Entity Linking, Candidate Generation, Candidate Rank- ing, Transformers, BERT, Knowledge Base, Named Entity Recognition
Πεϱίληψη
Στην παϱούσα διπλωµατική εργασία γίνεται µια εισαγωγή στη Σύνδεση Ον- τοτήτων χωϱίς παραδείγµατα σε Βάσεις Γνώσης, ένα έϱγο της Επεξεργασίας Φυσικής Γλώσσας, όπου ένα µοντέλο αναθέτει ένα µοναδικό αναγνωριστικό σε ονοµατικές οντότητες που εντοπίζονται µέσα σε κείµενο, εστιάζοντας στην κατανόηση της ϕυσικής γλώσσας. Αρχικά παρουσιάζονται τα ϐασικά ϐήµατα αυτού του έργου, καθώς και οι εφαρµογές που έχει σε πραγµατικά προβλή- µατα. Οι µέϑοδοι που ϐρίσκονται στην αιχµή της τεχνολογίας και υλοποιούν Σύνδεση Οντοτήτων χωϱίς παραδείγµατα σε Βάσεις Γνώσης χρησιµοποιούν τους Μετασχηµατιστές, µια πρόσφατη κατηγορία µοντέλων ϐαθιάς µάϑησης που εκµεταλλεύονται τεχνικές εύϱεσης της χϱήσιµης πληροφορίας κατά την επεξεργασία και γίνεται µια εισαγωγή σε αυτούς. Στα πλαίσια της διπλω- µατικής εργασίας έγινε διεξαγωγή πολλών πειϱαµάτων που αϕοϱούσαν τον τϱόπο που εξάγει ένα µοντέλο υποψήφιους και τους αναθέτει πιθανότητες, τη µοϱϕή που πϱέπει να έχει η πληροφορία όταν δίνεται στους Μετασχηµα- τιστές, εάν ο τύπος της ονοµατικής οντότητας ϑα πϱέπει να χρησιµοποιείται ή να παραβλέπεται, καθώς και άλλα πειράµατα που αναλύονται περισσότερο.
Τα πειράµατα είχαν καλά αποτελέσµατα και ϕάνηκε ότι η τιµή της µετϱικής recall ενός µοντέλου που εµπεριέχει τη σωστή οντότητα µέσα σε ένα µικϱό σύνολο υποψήφιων (50 υποψήφιοι) είναι πεϱίπου 84%. Ακόµη, τα πειράµατα υποδεικνύουν την ανάγκη για ύπαϱξη κατάλληλων µεϑόδων που υπολογίζουν την οµοιότητα µεταξύ ενός εγγράφου που διαθέτει την αναϕοϱά και µιας περι- γραφής της οντότητας, ώστε να εξάγονται καλά αποτελέσµατα. Τέλος, γίνεται συϹήτηση των αποτελεσµάτων και προκύπτουν κάποια συµπεράσµατα.
Λέξεις-κλειδιά: Σύνδεση Οντοτήτων χωϱίς παραδείγµατα σε Βάσεις Γνώσης, Παραγωγή Υποψήφιων, Ταξινόµηση Υποψήφιων, Μετασχηµατιστής, Βάση Γνώσης, Αναγνώριση Ονοµατικών Οντοτήτων
Contents
Acknowledgements 2
Abstract 3
Πεϱίληψη 4
1 Introduction 8
1.1 Motivation . . . 8
1.2 Approaches and Objectives . . . 9
1.3 Thesis Outline . . . 10
2 Foundations 11 2.1 Entity Linking . . . 11
2.1.1 Description . . . 11
2.1.2 Candidate Generation . . . 14
2.1.3 Candidate Ranking . . . 14
2.2 Zero-shot Entity Linking . . . 15
2.2.1 Description . . . 15
2.2.2 Relationship to other EL tasks . . . 16
2.3 Transformer-based models . . . 17
2.3.1 Description . . . 17
2.3.2 BERT . . . 19
2.4 Knowledge Bases . . . 22
2.4.1 DBPedia . . . 22
2.4.2 Wikidata . . . 24
2.4.3 Wikipedia . . . 25
2.4.4 Others . . . 26
3 Related Work 27
3.1 Entity Linking . . . 27
3.2 Zero-shot Entity Linking . . . 28
3.3 Candidate Generation . . . 28
3.4 Candidate Ranking . . . 30
4 Implementation 32 4.1 Benchmark Dataset . . . 32
4.1.1 Description . . . 32
4.1.2 Data Construction . . . 34
4.1.3 Data usage . . . 36
4.2 Models . . . 38
4.2.1 BLINK Overview . . . 38
4.2.2 Bi-encoder . . . 38
4.2.3 Cross-encoder . . . 40
4.3 Pre-processing, Train and Test . . . 41
4.3.1 Pre-processing . . . 41
4.3.2 Training . . . 41
4.3.3 Testing . . . 44
5 Experiments 45 5.1 Setup . . . 45
5.1.1 Hardware . . . 45
5.1.2 Software . . . 45
5.1.3 Hyper-parameter settings . . . 46
5.1.4 Metrics . . . 47
5.2 Results . . . 47
5.2.1 Usage of different sentence representations . . . 47
5.2.2 Addition of entity type of the mention . . . 50
5.3 Discussion . . . 54
5.3.1 Comparison of similarity methods with less candidates . 54 5.3.2 Comparison of similarity methods with less candidates (+ entity type) . . . 56
5.3.3 Summary . . . 58
6 Conclusions and Outlook 60 6.1 Conclusion . . . 60
6.2 Outlook . . . 62
Acronyms 63
List of Tables 65
List of Figures 66
Bibliography 67
Chapter 1
Introduction
1.1 Motivation
Over the last decades the Internet has become a global means of communi- cation and the need for digital information is now imperative. Text content is being produced and put in knowledge bases, like Wikipedia. Natural Lan- guage Processing (NLP) is a research field that covers the need to use this information and helps us understand natural language. It is a fusion of Lin- guistics and Artificial Intelligence that aims to build computer systems that interact with digital documents and find correlations between different types of text. However, this process is extremely difficult due to the complexity of human language [Collobert et al., 2011b].
Entity Linking (EL) is a subtask of NLP that concerns the connection of a named entity mention in a document to an element of a big collection of records, a Knowledge Base (KB). It is a critical task and focuses on deeply understanding context, by exploiting the named entities in it and conse- quently understanding the context by the domain of the link. Moreover, En- tity Linking (EL) is a crucial step because it contributes to the reinforcement of the Semantic Web, which has the goal to make Internet data machine- readable.
In Entity Linking (EL) it is taken for granted that such a computational system has in some point observed all the mentions it will meet, which is not always true. The idea behind Zero-shot Entity Linking is that a model
should be able to work with all types of mentions without in domain labeled data. In other words, the task has two key properties:
• It is Zero-shot, as no mentions have been observed during training.
• Only language understanding is available.
Numerous approaches that implement Zero-shot Entity Linking perform ex- tremely well on different topics [Yao et al., 2020, Li et al., 2020a].
In summary, two important arguments motivate this work: The investigation of the Zero-shot Entity Linking problem by taking into account previous work in terms of fine-tuning as well as the analysis of this approach, in order to find the best deep learning algorithm. Even though there is already an application like [Wu et al., 2019] that attempts to resolve these issues, a successful approach that uses the best input representation and uses good similarity metrics has yet to be found.
1.2 Approaches and Objectives
The objective of this thesis is to gain insight on the process and results of ex- perimenting with different parameters in terms of Zero-shot Entity Linking with deep learning. The dedicated dataset that is used to conduct exper- iments originates from Wikipedia and contains documents from different domains. For more information on the origin and creation of the dataset, see Chapter 4. Necessary steps for this objective are expressed through the research questions below.
1. Which representation of the document suits the best?
Different vector representations other than the default are used to get the best results. Examples include the first token of the document ([CLS]), the average or sum of the whole document and the usage of the special tokens in the document that are placed at the beginning, the end and around a mention.
2. Which is the ideal number of candidates for the candidate set?
We use a variety of numbers for the candidate set, in order to locate the best fit number that includes the correct candidate with a high recall value and does not burden the Entity Linking (EL) system.
3. Which similarity method extracts the highest results?
We use different methods, like the cosine, the euclidean distance and the dot product methods to compare the similarity between a document that includes a mention and a description of a named entity.
4. How important is the type of the named entity in the document?
We get the entity type of each mention and we run experiments to show whether the entity type is valuable for the model.
1.3 Thesis Outline
This section shortly presents the different chapters of this thesis. After the above introduction, the following chapters are presented:
Chapter 2 explains the basic foundations needed to follow this work. In particular, it breaks down the steps of Entity Linking (EL), Candidate Gen- eration (CG) and Candidate Ranking (CR), it mentions the Zero-shot tech- nique and gives a full description of Transformers and a basic type of such a model, which is BERT. Moreover, the most common knowledge bases are presented.
Chapter 3 presents prior work on the topics involved in this thesis. First, the state of the art in the field of Zero-shot Entity Linking is depicted. Then, the current progress with regards to Candidate Generation (CG) and Candidate Ranking (CR) are shown.
In Chapter 4 the implementation steps that were addressed are presented, in particular to the structure of the dataset, the type of models used for fine- tuning and Zero-shot Entity Linking. Moreover, the pre-processing steps that had to be done before training and testing are explained.
Chapter 5 shows all the experiments and interprets the respective results.
We experimented with different methods to be able to outline the best fit model.
Finally, in Chapter 6, we present our conclusions drawn from the results with respect to the involved limitations. Additionally, some future work is depicted, that would have suited the topic of this thesis but did not fit in its scope due to time restrictions.
Chapter 2
Foundations
2.1 Entity Linking
2.1.1 Description
Definition
Entity Linking (EL) refers to the process of linking entity mentions in a text document with their corresponding entities in a collection of structured information known as a Knowledge Base (KB). It has many applications in a variety of NLP tasks, including information extraction, information retrieval, knowledge base population and question answering. Entity Linking (EL) can be quite a difficult task when a named entity has many forms (abbreviations, short names) and the system has to identify and link each form correctly.
Another problem is entity ambiguity, meaning that an entity mention could possibly denote different named entities (the word "Paris" can refer to the city or the celebrity).
A formal definition of Entity Linking (EL) [Hu et al., 2019]: Given a Knowl- edge Base (KB) with a set of r named entities E = {e0, e1, ..., er−1}and a list of documents with n mentions M = {m0, m1, ..., mn−1}an Entity Linking (EL) sys- tem maps a mention mi, i = (0, ..., n−1) to a named entity ej, j = (0, ..., r−1) from the Knowledge Base (KB). We assume that the mentions are known and Named Entity Recognition (NER) has been applied to the input as a previous step (End-to-End).
It must be noted that some mentions might not be located in the Knowledge Base (KB). In such case the mention is called unlinkable.
There is a rare case where a Knowledge Base (KB) may not even be available when performing Entity Linking (EL). The problem then is solved as an entity co-reference resolution problem, where the mentions from one or multiple documents are clustered as named entities based on the context they appear in. This case is out of the scope of the thesis and will not be described extensively.
An example of the task
An example of Entity Linking (EL) is depicted in Figure 2.1. The system reads the context surrounding a mention in a document, extracts candidate entities from the Knowledge Base (KB) and decides which entity the mention string is most possibly referring to. In this example, Michael Jordan could be the football player, the basketball player or the mycologist, but all these options are rejected, because of the phrases "professor", "machine learning"
and "artificial intelligence" in the document [Shen et al., 2014].
Figure 2.1: An illustration for the entity linking task
Components
A typical Entity Linking (EL) system consists of the following steps:
• Candidate Generation (CG): For each entity mention the Entity Linking
(EL) system locates entities that are related to the topic of the docu- ment.
• Candidate Ranking (CR): If the candidate entity set has more than one entity, the system must choose the most probable entity using some criteria.
Applications
Entity Linking (EL) is essential to many different tasks:
• Information extraction: Named entities can be ambiguous or have dif- ferent forms in documents. For that reason, if the entities are linked to a proper record from a Knowledge Base (KB), it is easier to understand them semantically and proceed to further exploitation. [Lin et al., 2012]
• Information retrieval: As a further step of the semantic information, web search engines can benefit from Entity Linking (EL). Mapping doc- uments to a database is helpful to the extraction of better results. [Ha- sibi et al., 2016]
• Content analysis: Algorithms that deal with text analysis are the finest to collaborate with Entity Linking (EL) systems. Recommendation sys- tems are a part of these algorithms and their main focus is to collect and present to users documents that match with their needs. If a sys- tem can connect an article with a database, the recommendations will surely be more efficient. This technique also applies to social media, where two users may write posts about the same entity. [Huang et al., 2018]
• Question answering: With a question answering system a user finds information about a topic he is interested in and communicates with a bot directly to get answers. Here, Entity Linking (EL) can play an important role when the user wants to learn more information about a named entity, for example a famous person or a city. Moreover, a system that depends on a Knowledge Base (KB) should be prepared to meet the requirements of the active user [Li et al., 2020a].
• Knowledge base population: In knowledge base population the goal is the enhancement of the information in a Knowledge Base (KB) using
text. Reading information from a document with a mention and col- lecting attributes for the linked entity (the age of a person entity) can boost the strength of that Knowledge Base (KB) [Dredze et al., 2010].
2.1.2 Candidate Generation
Candidate Generation (CG) is the first step in Entity Linking (EL). In this step, the system for each entity mention m tries to filter out irrelevant entities and keep a representative set of candidate named entities. To assess the quality of a Candidate Generation (CG) approach, a recall metric is used and defined as follows:
CGRecall = mentions where candidates contain the true entity all mentions
The higher the recall in the CG step, CGRecall, the higher the probability of finding the correct entity for each mention, which is pivotal for improving the accuracy of Entity Linking (EL) tasks.
Candidate Generation (CG) can become a very challenging task, especially in cases where the size of the Knowledge Base (KB) is large. At first, the target entity may be absent from the candidate set, which affects the Candidate Ranking (CR) step. Augmenting the size of the candidate set is a solution to this problem. However, with big candidate sets, systems struggle with memory and time and thus such sets are impractical for online applications [Guo et al., 2013].
Moreover, large scale systems comprised of document mentions and entity descriptions are usually imbalanced. A typical system may have to link few mentions to an enormous amount of named entities, resulting in the extraction of a candidate set that is not representative of a mention.
2.1.3 Candidate Ranking
Candidate Ranking (CR) is a very important step in Entity Linking (EL). The entity set that is extracted from the Candidate Generation (CG) step usually contains more than one component. For that reason, there is a need to sort the entity set Em based on the relevance of each entity ei with the mention
m and assign probabilities that denote how possible it is that m is referring to ei.
Candidate Ranking (CR) techniques can be categorized according to the availability of the correct entity in the dataset into the following meth- ods:
• Supervised ranking methods: A training data set is provided with the target entity for each document mention and the objective is to learn how to rank the candidate set based on the targets.
• Unsupervised ranking methods: An unlabeled corpus is available with- out any targets and the candidates can be represented into a vector space to find similarities with the mention.
Another way to categorize Candidate Ranking (CR) methods based on the treatment of named entities into the given document is as follows:
• Independent ranking methods: Using these approaches, all the named entities in the document are independent and the mention doesn’t rely on them to extract a candidate set. The context of the mention is compared to the linked text from each candidate prior to ranking [Zheng et al., 2010].
• Collective ranking methods: It is assumed that a document is com- prised of relevant entities that belong to the same or similar subject topic [Hoffart et al., 2011].
• Collaborate ranking methods: These methods collect documents from a Knowledge Base (KB) to find similar mentions and use their context to identify similar entities of each mention [Chen and Ji, 2011].
2.2 Zero-shot Entity Linking
2.2.1 Description
Traditional Entity Linking (EL) techniques assume that all the named entities that the system should be aware of exist in both the train and the test set.
The idea behind Zero-shot Entity Linking is that mentions must be linked to unseen entities without in-domain labeled data. No alias tables or frequency
information is available and the Entity Linking (EL) system relies exclusively on language understanding. Formally, if Etrain and Etest are assigned to be the knowledge bases for training and testing, it is required that Etrain ∧Etest
=∅[Wu et al., 2019]. Users tend to zero-shot cases when the labeled data is too expensive to obtain. Therefore, there is a need to develop Entity Linking (EL) systems that can generalize to unseen specialized entities.
Figure 2.2: Zero-shot Entity Linking using Wikias dataset. Multiple training and test domains (worlds) are shown.
Zero-shot Entity Linking poses two fundamental challenges during its im- plementation. At first, the system focuses on understanding the context to extract new entities (topics) which demonstrates the importance of language comprehension. Moreover, as some of the named entities that exist in test- ing haven’t been at the system’s disposal, the system must adapt to new document mentions and entity descriptions.
2.2.2 Relationship to other EL tasks
In this section the relationship between prior EL tasks and this newly intro- duced approach is discussed.
• Standard EL: Both methodologies have in common that they use the mentions during training and it is assumed that the document men- tions and entity descriptions are available. However, standard EL sys- tems utilize frequency statistics and meta-data.
• Cross-Domain EL: Older approaches try to link entity mentions to dif- ferent types of text, like news articles, whilst both approaches use Wikipedia articles for training (Wikification) [Hachey et al., 2013].
• Linking to any DB: Former approaches attempt to generalize for unseen entities, but differ in that they use techniques like alias tables to reduce the size of the candidate set or look at structured data to disambiguate natural language.
2.3 Transformer-based models
2.3.1 Description
Transformers first appeared in the NLP community in 2017 and imitate Recurrent Neural Networks (RNN) in that they have been designed to process sequential data. They have been used in various NLP tasks, such as text translation and summarization. The main difference between a RNN and a Transformer is that in the latter the data from the sequence doesn’t need to be in order. Also, a great advantage of Transformers is data parallelization, which motivates NLP specialists to use them.
A Transformer consists of an encoder-decoder structure, where the encoder connects an input sequence(x0, x1, ..., xn−1)to an output sequence with con- tinuous representations z= (z0, z1, ..., zn−1)and the decoder uses z to extract an output sequence (y0, y1, ..., yn−1). At each step the model consumes the previously generated outputs as inputs.
The outstanding part of a Transformer is the Attention mechanism. Using Attention, the goal is to map a query and a set of keys-values to an output computed as a weighted sum of the values. The weights depict the similarity of the keys with the query [Vaswani et al., 2017]. The Attention mechanism has many applications in the Transformer architecture:
• Both the encoder and the decoder are composed of the Multi-Head
Attention. Instead of using a simple Attention function, the keys, val- ues and queries are projected linearly h times. The query comes from the previous decoder layer and the memory keys and values from the output of the encoder.
• The encoder contains self-attention layers with keys, values and queries from the output of the previous layer of the encoder.
• The decoder also contains self-attention layers which attend informa- tion processed up to that point.
Figure 2.3: The transformer - model architecture
Self-attention is when the Query, Key, and Value are all generated from the same input sequence X and is important for some specific reasons. Com- pared to convolution and recursion, the computational complexity drops at a self-attention layer. Additionally, as it has been mentioned, the computa- tions in self-attention are parallelized. Equally important is the path length, meaning that when the length the signal has to traverse in the model is small, larger sequences are easier to learn.
Transformers have played an important role in Entity Linking (EL). The con-
text of an entity in a sequence and the application of the Attention feature of a Transformer to find correlations between the tokens can make the al- gorithm better understand the entity and improve the performance of the model for the Entity Linking (EL) task.
2.3.2 BERT
In language modeling, where the goal is to understand and predict a miss- ing word based on the context (for example ’The woman went to the store and bought a X of shoes.’), most architectures (RNN, LSTM) process the text sequence from one direction to another. Bidirectional Encoder Repre- sentations from Transformers (BERT) have a deeper understanding of the language context, because they process a sequence in both directions at the same time [Devlin et al., 2018]. BERT is an extension of a Transformer, it is open source and performs state-of-the-art results in different NLP tasks, like sentiment analysis and question answering [Li et al., 2020a].
The main reason why this bidirectional technique is so successful is be- cause it is context-based. Context-based representation of words contains embeddings that have been extracted from a corpus used for a specific task.
Vectors that are part of context-free representations like Word2vec [Mikolov et al., 2013] have been computed from a large corpus with no specific domain and may not always be instructive (the word "bank" has the same vector in all cases). Consequently, it is very helpful to use text from both the left and the right side of a missing word to compute the correct vector.
BERT processes the input before it extracts a vector in different stages:
• Token embeddings: A [CLS] token is added to the input at the begin- ning of the first sentence and a [SEP] token is inserted at the end of each sentence.
• Segment embeddings: A marker indicating the sentences is added to each token. This allows the encoder to distinguish between sentences.
• Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.
Figure 2.4: Input Representation of BERT
The input is converted to a list of vectors, which is then processed by the neural network. BERT follows 2 strategies during training:
Masked Language Model (MLM): In this technique, BERT randomly masks words in the sentence and tries to predict them using the context of each mask from both sides. Usually, it replaces about 15% of the words with the token [MASK]. However, this special token confuses the model and it predicts only when [MASK] is present. So, from this 15%:
• 80% of the tokens are [MASK]
• 10% is replaced with a random word
• 10% is left unchanged
Next Sentence Prediction (NSP): Using this strategy, the BERT model aims at understanding the relationship between 2 sentences. During this process, the model takes as input 2 sentences and it learns to predict if the second sentence follows the first one. The sequence is fed to the model which assigns to the second sentence the value IsNext if there is a sequential relationship or NotNext if it doesn’t follow the first sentence.
Figure 2.5: An example that uses the 2 BERT training strategies
BERT models can be trained on an enormous amount of data or used as pre- trained models, which are then fine-tuned on smaller task-specific datasets.
Fine-tuning has advantages over training from scratch. The weights in the pre-trained model have information from a big, text collection and thus the training process takes less time. If we were to build a model from scratch, we would need enormous amount of text data and lots of resources to do so.
Fine-tuning is an approach of transfer learning [Elnaggar et al., 2018]. It can be implemented in different ways:
• We can further train the entire pre-trained model on our dataset and feed the output to a softmax layer (if it is desired). In this case, the error is back-propagated through the entire architecture and the weights of the model are updated based on the new dataset. This approach is followed in this thesis.
• We can keep the weights of initial layers of the model frozen while we retrain only the higher layers. The optimal number of layers to be frozen can be investigated in this case.
• We can even freeze all the layers of the model and attach a few neu- ral network layers of our own and train this new model. Note that the weights of only the attached layers will be updated during model training.
In fine-tuning the BERT model deals with a very important issue: the proper sequence length. The model doesn’t need to process the whole sentence for fine-tuning and it can be assigned with a fixed value for all sentences. The maximum sequence length that is provided by default from the model is 512 tokens. When the sentence is smaller than the sequence length, BERT pads the sentences with the [PAD] token. In Entity Linking (EL), the part of the document that will be fed to BERT with fixed length should contain the mention and the context surrounding it to boost the performance.
Fine-tuning a BERT model can be applied to different tasks. In Entity Link- ing (EL) BERT is used for sequence classification and the goal is to classify a document to a possible candidate.
Figure 2.6: Fine-tuning BERT for NLP tasks
2.4 Knowledge Bases
Knowledge bases play a significant role in the enhancement of a NLP system, as the latter is able to keep up with real-time structured data. Ideally, we want knowledge bases to cover a wide field of knowledge in different domains that is valid and up to date. Most knowledge bases depend on Wikipedia, a widely used encyclopedia, which is supported by a big user community. In this section, different knowledge bases that an Entity Linking (EL) system can use are discussed.
2.4.1 DBPedia
DBpedia is a multilingual Knowledge Base (KB) with structured data from Wikipedia. In DBpedia, information is extracted by posing different queries
(for example find all the rivers in Italy whose length is bigger than 100 miles) and it is connected to the Web to be kept up to date. DBpedia is supported by the DBpedia user community. They create mappings between information in Wikipedia and DBpedia ontologies [Lehmann et al., 2015]. It covers many subjects and is used in different areas, like data integration, Named Entity Recognition (NER), theme detection and document ranking. Many modern applications use DBpedia as the prime Knowledge Base (KB). Wikipedia ar- ticles consist mostly of free text, but they also contain various other types, like images, geographical data, external links and links to other languages of the article. The DBpedia extraction framework extracts information from those types and converts it to a rich and structured Knowledge Base (KB).
In order to do that, the framework follows some steps:
• Input: Wikipedia pages are read from an external source.
• Parsing: Each Wikipedia page is transformed to an Abstract Syntax Tree from the wiki parser.
• Extraction: DBpedia offers an extractor to get information from the Abstract Syntax Tree, like targets or geo-locations. The result of the extraction phase are RDF statements.
• Output: The collected RDF statements are written to a sink.
Figure 2.7: Overview of DBpedia extraction framework
Dbpedia is comprised of databases that are used for Natural Language Pro- cessing (NLP) tasks. The Lexicalization data, which is one of the datasets, provides access to alternative names for entities and themes and computes scores to estimate the strength of the relationship between a name and a
given url. This scoring method helps to discern the common entities from the rare ones and mirrors their ambiguity, meaning that explicit entities usually don’t describe many concepts.
The DBpedia ontology is comprised of 320 classes and 1650 properties1. The DBpedia community doesn’t add any more concepts, as the Knowledge Base (KB) is already broad enough, but only properties. When DBpedia is used for the Entity Linking (EL) task, the URI pages are used as identifiers.
2.4.2 Wikidata
In March 2012 the development of Wikidata started, a free Knowledge Base (KB) that can be read by both humans and machines [Vrandečić and Krötzsch, 2014]. Wikidata provides data for all the languages that are supported by Wikipedia. The main components of Wikidata are:
• Free usage: Copying, editing, distributing and displaying data from Wikidata is allowed without having to apply for permission.
• Collaboration: Data is inserted into the Knowledge Base (KB) from some editors that decide on the rules of the content.
• Multilingualism: Any editing or re-usage of data is applied to all the supported languages.
• Being a secondary database: Wikidata records statements and their sources for knowledge diversity and so data is verified.
• Collecting structured data: Structured data allows easier usage from users and also is better understood by machines.
• Support for Wikimedia wikis: Data and links are maintainable to all the languages.
• Anyone can use it: Anyone can use Wikidata in many ways by using its application programming interface.
The Knowledge Base (KB) Wikidata collects information in items that contain labels and descriptions. The items are constructed to statements for a spe- cific context. Many statements from different sources and dates can be de-
1https://wiki.dbpedia.org/
fined and some items are connected to each other when necessary. Because many statements can describe the same item or property these are ranked, according to their relevance to the user’s needs and their validity.
Figure 2.8: Items and their data are interconnected
Up to now, Wikidata contains close to 1 billion items and 8 thousand prop- erties. Wikidata also offers a Linked Data interface as well as regular RDF dumps of all its data.
2.4.3 Wikipedia
Wikipedia has always been the default solution for automatic information extraction, because of its availability and the fact that it contains encyclo- pedic knowledge [Nakayama et al., 2010]. MediaWiki.org contains a list of innovative applications that process Wiki content. Many researchers that have used Wikipedia as their Knowledge Base (KB) have used innumerable features from the articles, like inner links, anchor texts, category links or redirect pages in order to implement NLP tasks. Inner links have been exploited to create an association thesaurus that connects strongly related articles that belong to the same or similar concepts. Anchor texts are perfect for word sense disambiguation and synonym extraction, as if many words have the same anchor text, then the term is ambiguous. Category links show
the domain of an article, so they are used in text categorization. Additional software projects that depend on Wikipedia construct Wikipedia Ontologies to detect semantic relationships. These projects have to process natural language and for that reason they use Part of Speech Tagging techniques.
Also, a Wikipedia API has been implemented for Web applications. Java Wikipedia Library (JWPL) is a Java-Based API that provides access to the Wikipedia API.
2.4.4 Others
YAGO
YAGO2is a Knowledge Base (KB) similar to DBpedia in that every Wikipedia article becomes an entity. One of the main differences between them is that DBpedia is closer to Wikipedia and provides RDF versions of the con- tent. Meanwhile, YAGO attempts to extract fewer relations from DBpedia for greater precision and consistent knowledge. YAGO is backed up from WordNet, resulting in it containing more classes than DBpedia and also it integrates objects and attributes in interactive rules. Both knowledge bases are connected to and complement each other (YAGO is suggested from DB- pedia as an alternative solution).
Freebase
Freebase 3 is a graph database that extracts structured information from Wikipedia and converts it to RDF versions. Both Freebase and DBpedia provide dumps, APIs and endpoints to access Wiki data. However, there are basic differences between the 2 databases. Freebase uses lots of sources to cover a wide field of knowledge. It depends on the GraphD database with saved meta content for every fact. Previous versions of Freebase are backed up, because Freebase can be edited by any user. Lastly, Freebase is mainly run by Google and focuses on areas of it, providing a higher coverage than DBpedia.
2https://yago-knowledge.org/
3https://developers.google.com/freebase
Chapter 3
Related Work
3.1 Entity Linking
Traditional Entity Linking (EL) systems tend to hand-design a set of useful features, in order to calculate similarities between mentions and entities, as well as correlations between entities. The Entity Linking (EL) task can be broken into 2 steps: Candidate Generation (CG) and Candidate Ranking (CR). In the most common form, similarity measures were performed, like name string comparison to extract a proper candidate set and supervised [Csomai and Mihalcea, 2008] or unsupervised [Bunescu and Pasca, 2006]
methods are performed to rank the candidate entities.
Some models take into account the correlation between different entities in the same document, which can improve the entire performance. The most popular method has been to construct an undirected graph of differ- ent candidate entities and use graph mining algorithms to find the correct entity. Several models attempt to find correlations between nodes of the graph [Hoffart et al., 2011], [Alhelbawy and Gaizauskas, 2014]. This collec- tion of approaches seems to be better than the traditional ones.
In recent years, Entity Linking (EL) has been updated by leveraging un- labeled documents from a Knowledge Base (KB). The Entity Linking (EL) model uses the entities from the candidate set as latent variables and learns to choose entities based on the context of the mention and on coherence with other entities in the document [Le and Titov, 2019]. Another work offers a
Human-In-The-Loop annotation method, where a recommendation system suggests potential concepts of a mention [Klie et al., 2020]. In a more recent approach, the semantic information of a document is merged to the entity embeddings, so the linking model can learn contextual commonality [Hou et al., 2020].
3.2 Zero-shot Entity Linking
Former works have pointed out the importance of building Entity Linking (EL) systems that can generalize to unknown named entities. [Sil et al., 2012] and [Wang et al., 2015] approaches are very close to the Zero-shot Entity Linking approach, but make an attempt to reduce the size of the candidate set. [Gupta et al., 2017] selected mentions with multiple entity candidates to enhance the system’s performance. Further approaches make use of word disambiguation by using keys from a dictionary [Chaplot and Salakhutdinov, 2018], but lack in that some words are ambiguous and the system is not able to generalize properly. The newest technique proposed by [Logeswaran et al., 2019] makes use of language comprehension, in order to generalize.
3.3 Candidate Generation
Approaches to Candidate Generation (CG) have been mainly based on string comparison between the entity mention and the name of the entity existing in the Knowledge Base (KB) [Phan et al., 2017]. This kind of comparison has been proven to be unsuccessful when the entity mention has spelling mistakes.
Other prior work has used alias tables that contain possible named entities for a given mention and restricts the Entity Linking (EL) system to a relatively small set. Over the last years, many systems have used frequency informa- tion to estimate the probability of an entity and a mention linking to that entity. In most common cases, traditional information retrieval methods like TF-IDF have been followed. TF-IDF intends to capture how important a mention string is to a candidate document. The value extracted from this method depends on the number of times the mention appears in the docu-
ment and the frequency of the mention in general in the whole corpus. BM25 has been used as an alternative method of TF-IDF, a bag-of-words method that uses keywords from each document to calculate the similarity between the mention string and the candidate document ( [Sil et al., 2012], [Murty et al., 2018]).
Neural-based approaches have also been used for the detection of relevant entities from a piece of text. Recurrent Neural Networks (RNN) have been used for text understanding for decades, handling text as a sequence of tokens, where tokens are fed to the model at each moment of time (state). In RNNs, the positions of the tokens in the document matter and the neurons of the network use the output from the previous state as input for decision making [Salehinejad et al., 2017].
For this task the following models have been proposed [Humeau et al., 2019]:
• A bi-encoder applies self-attention to the input and the candidate label separately to a common feature space and uses a function (dot prod- uct, cosine, non-linear) to compute their similarity. The encodings are cached and saved in a big candidate set and thus the predictions are fast.
• A cross-encoder applies full-attention to the input and the candidate label, merges them and passes them through a non-linear function.
Using this approach, the label is able to interact and be compared to each token from the context and this process leads to high evaluation metrics. However, the model computes representations for the inserted information from scratch and is slow in testing.
• A poly-encoder aims to get the best of both the bi-encoder and the cross-encoder. A candidate label is represented as a vector like in a bi-encoder, which allows for caching the candidates at fast inference time. Meanwhile, the context is encoded together with the candidate, as in the cross-encoder, allowing the extraction of more information.
Figure 3.1: Diagrams of the three model architectures. (a) Bi-encoder, (b) Cross-encoder (c) Poly-encoder
3.4 Candidate Ranking
Multiple studies focused on the task of Candidate Ranking (CR) have also used name string-wise comparison of the mention name to each candidate.
Also, taking into account the entity popularity has been used as a feature from [Hoffart et al., 2011], under the assumption that some entities from the candidate set are more frequent than others and so the likelihood that the mention is referring to a popular candidate is increased. Another very important component of a named entity is the type, for example whether an entity is a location, a person or an organization. The exploitation of the entity type can contribute to identify which candidates are more similar to the mention and boost the performance of the Candidate Ranking (CR) system. The entity type of the mention and the candidates may be missing in some cases and can be retrieved using a Named Entity Recognizer [Li et al., 2020b]. Additionally, for the task of Candidate Ranking (CR), the textual context from both and document mention and each entity description has been exploited.
Supervised Ranking methods that are based on Binary Classification [Zhang et al., 2010] implement an algorithm that decides whether the mention should be linked to each candidate and the one with the biggest probability is selected. Also, learning to rank methods focus on finding correlations between the candidates. Another approach is to create a new dataset with lots of training records replacing the mention with similar words to boost the performance of the system.
Unsupervised Ranking Methods calculate distances between the mention and each candidate using VSM [Zhang et al., 2010]. Also, some researchers have treated the Candidate Ranking (CR) problem as an information retrieval based ranking problem, where each candidate entity generates a search query for each mention.
More recent approaches that have been implemented ( [He et al., 2013], [Sun et al., 2015], [Yamada et al., 2016], [Ganea and Hofmann, 2017], and [Kolitsas et al., 2018]) use neural networks to model the mention and its context. A more recent approach [Logeswaran et al., 2019] implements a transformer-based architecture that does deep cross attention. Two variants of this approach are:
• Pool Transformer: a network that implements two deep transformers that extracts single vector representations for the mention and the entity. Single vector representations have also been used in prior works [Gupta et al., 2017].
• Cand-Pool-Transformer: an architecture that also uses two Trans- former encoders. It differs in that it attends to individual tokens of the mention and its context.
Chapter 4
Implementation
4.1 Benchmark Dataset
4.1.1 Description
The main dataset that was used to study the Zero-shot Entity Linking prob- lem was constructed using documents from Wikias1. Wikias are community- written encyclopedias, each specializing in a particular subject such as a fic- tional universe from a book or film series. In Wikias, labeled mentions can be automatically extracted based on hyperlinks. The documents have rich document context that can be used from NLP algorithms for understanding.
Each encyclopedia contains many unique entities that belong to a specific theme, which makes it a great dataset to evaluate domain generalization of Entity Linking (EL) systems.
The data in Wikias dataset are from 16 Wikias, 8 of which are used for train- ing (A merican Football, Doctor Who, Fallout, Final Fantasy, Military, Pro Wrestling, Star Wars, World of Warcraft), 4 are used for validation (Corona- tion Street, Elder Scrolls, Ice Hockey, Muppets) and 4 are used for testing (Forgotten Realms, Lego, Star Trek, YuGiOh). Each domain has a large number of entities, which is shown in Table 4.1.
The entities are mentioned in the documents. The training set has 49,275 labeled mentions. To examine the in domain generalization performance,
1https://github.com/lajanugen/zeshel
Set World Entities Train A merican Football 31,929 Doctor Who 40,281 Fallout 16,992 Final Fantasy 14,044 Military 104,520 Pro Wrestling 10,133 StarWars 87,056 World of Warcraft 27,677 Val Coronation Street 17,809 Muppets 21,344 Ice Hockey 28,684 Elder Scrolls 21,712 Test Forgotten Realms 15,603 Lego 10,076 Star Trek 34,430 YuGiOh 10,031 Table 4.1: Number of Entities per world
heldout sets seen and unseen of 5,000 mentions are constructed, composed of mentions that link to only entities that were seen or unseen during train- ing, respectively. The validation and test sets have 10,000 mentions each (all of which are unseen).
World Train mentions Eval Seen mentions Eval Unseen mentions
A merican Football 3898 410 333
Coronation Street 0 0 1464
Doctor Who 8334 819 702
Elder Scrolls 0 0 4275
Fallout 3286 337 256
Final Fantasy 6041 629 527
Forgotten Realms 0 0 1200
Ice Hockey 0 0 2233
Lego 0 0 1199
Military 13063 1356 1408
Muppets 0 0 2028
Pro Wrestling 1392 151 111
Star Trek 0 0 4227
StarWars 11824 1143 1563
World of Warcraft 1437 155 100
YuGiOh 0 0 3374
Table 4.2: Number of Mentions per world
Since the task is already quite difficult we assume that the target exists in the dataset. The dataset provides information to locate mentions in a document and recognition or clustering methods for this purpose are overlooked.
4.1.2 Data Construction
Entity Dictionary
Documents/pages from the Wikias can be utilized as both context men- tions and entity descriptions. Each document connects to an entity and the collection of documents comprise the entity dictionary. The documents are organized as documents/<wikia>.json, each line of which represents a document/entity from a wikia, and has the following format:
{ " document_id " : "000523A4D586C293 " , " t i t l e " : " Warner " , " t e x t
" : " Warner Warner was a communications t e c h n i c i a n aboard Nerva Beacon . . . " }
{ " document_id " : "0009247003C7CB16" , " t i t l e " : " Winnie T y l e r " ,
" t e x t " : " Winnie T y l e r Winnie T y l e r was Jacob T y l e r ’ s w i f e . . . " }
Description of fields:
• document_id: Unique identifier of document/entity
• title: Document title
• text: Document content/description of entity
Mentions
The mentions are organized in the following files: { train, heldout_train_seen, heldout_train_unseen, val, test }.json. Each file has the following format:
{ " mention_id " : "0004DD84239096E0 " , " context_document_id " :
"99897223FC75151C " , " corpus " : " f o r g o t t e n _ r e a l m s " , "
s t a r t _ i n d e x " : 12 , " end_index " : 15 , " t e x t " : " Vault o f Gnashing Teeth " , " label_document_id " : "A0604B5BC7C4CAAF" ,
" c a t e g o r y " : "LOW_OVERLAP " }
{ " mention_id " : "B7C96945B4EB593F " , " context_document_id " : "
F2845988E1AC5BC0 " , " corpus " : " m i l i t a r y " , " s t a r t _ i n d e x " : 178 , " end_index " : 182 , " t e x t " : "C . W . Jessen " , "
label_document_id " : "33B704F9C31B1D57" , " c a t e g o r y " : "
LOW_OVERLAP " } Description of fields:
• mention_id: Unique mention identifier
• start_index, end_index: Location of mention text in the source docu- ment, assuming white-space tokenization (0-based indexing, start and end positions inclusive)
• text: Mention phrase
• category: Type of mention
• corpus: Source world of the mention
• context_document_id: Identifier of mention’s context
• label_document_id: Identifier of the document that describes the men- tion
The category field distinguishes mentions in the dataset based on token overlap between mentions and the corresponding entity title in documents as follows:
• High Overlap: The title field is exactly like the mention text
• Multiple Categories: The title field is the mention text followed by a disambiguation phrase
• A mbiguous substring: The mention is a substring of the title
• Low Overlap: The rest of the mentions
4.1.3 Data usage
In this thesis the documents are used either as context of mentions or as entity descriptions. Each JSON record refers to the document that uses the mention text with the context_document_id field and to the document that describes the mention with label_document_id. For example, the following sample:
{ " mention_id " : "0046D0B545F1A899 " , " context_document_id " : "02 D5AD6366E36BBA" , " corpus " : " l e g o " , " s t a r t _ i n d e x " : 26 , "
end_index " : 28 , " t e x t " : " the p r e v i o u s episode " , "
label_document_id " : "AABE7679AC9F8700 " , " c a t e g o r y " : "
LOW_OVERLAP " }
is described from the following record from the documents data:
{ " t i t l e " : "Umbara 1 " , " t e x t " : "Umbara 1 Umbara 1 i s the f i r s t episode o f an o n l i n e s e r i e s l a b e l e d as \ " The Yoda
C h r o n ic l e s \ " . Synopsis . The episode s t a r t s with Master Yoda , Ahsoka Tano , and Pong K r e l l f e n d i n g o f l a s e r s from
the Sniper Droideka , Commando Droid Captain , and an Umbaran S o l d i e r . They are f i g h t i n g the enemies u n t i l Yoda
makes a ramp , launching the droideka i n t o the a i r . I t s u r p r i s i n g l y lands on a c l i f f l e d g e , f i r i n g a b o l t . Soon
a f t e r w a r d s , though , an AT − RT f i r e s and d e s t r o y s the droideka . The 501 s t Clone Trooper who destroyed the droideka waves a t the o t h e r s , and the group jumps onto
the v e h i c l e . Two Z − 95 Headhunder s f l y i n . One i s h i t by another droideka , causing the 501 s t Clone P i l o t i n s i d e
t o e j e c t from the c o c k p i t . The o t h e r Z − 95 f i r e s the m i s s i l e , d e s t r o y i n g the d r o i d s . The droideka s u r v i v e d , about t o f i r e on the heroes . But the AT − RT kicks the d r o i d away , causing i t t o h i t an Umbaran MHC . The cannon
walks up , and the episode ends with the t h r e e J e d i i g n i t i n g t h e i r l i g h t s a b e r s . " , " document_id " : "
AABE7679AC9F8700 " }
and is located in the text from the document record:
{ " t i t l e " : "Umbara 2 " , " t e x t " : "Umbara 2 Umbara 2 i s the second episode o f an o n l i n e s e r i e s l a b e l e d as \ " The Yoda C h r o n ic l e s \ " . Synopsis . The episode begins where the p r e v i o u s episode l e f t o f f . The Umbaran MHC i s a t t a c k i n g Yoda , Ahsoka Tano , Pong K r e l l , and the c l o n e s . A Z − 95 Headhunder launches a m i s s i l e a t the Umbaran MHC , but
the m i s s i l e j u s t bounces o f f the MHC and embeds i t s e l f i n a c l i f f w h i l e the Umbaran MHC keeps on f i r i n g a t the j e d i and c l o n e s , d e s t r o y i n g t h e i r AT − RT . On top o f a c l i f f one one s i d e o f the canyon , Commando Droid s and Sniper Droideka s a l s o begin f i r i n g a t the heroes . A 501 s t Clone
P i l o t from the a d j a c e n t c l i f f l o o k s on i n dismay though h i s e l e c t r o b i n o c u l a r s , but n o t i c e s t h a t the m i s s i l e t h a t embeded i t s e l f i n the c l i f f i s r i g h t under the d r o i d s . He
f i r e s a t the m i s s i l e , causing i t and the p o r t i o n o f the c l i f f i t was i n t o explode . This sends rubble , as w e l l as the d r o i d s , f a l l i n g i n t o the canyon , but the MHC
s t i l l proceeds t o a t t a c k the j e d i . Pong K r e l l b u i l d s a w a l l out o f LEGO b r i c k s t o t r y and stop i t , but the Umbaran MHC simply d e s t r o y s the w a l l , causing the 501 s t
Legion Clone Trooper t o f l e e i n panic . Yoda , however , has an i d e a . Using the f o r c e , he f i r m l y holds a white 1 x1 p l a t e on the ground i n the MHC ’ s path . The MHC t r i p s
on t h i s and i s destroyed , causing the j e d i t o c e l e b r a t e , w h i l e the 501 s t Legion Clone Trooper who f l e d e a r l i e r on
f e e l s embarrassed . " , " document_id " : "02D5AD6366E36BBA " } As a result, the label_document_id field refers to the unique entities that are mentioned in the dataset.
4.2 Models
4.2.1 BLINK Overview
The model that is used in this thesis for the Zero-shot Entity Linking problem is based on the model that BLINK2 provides. BLINK is a Python library for Entity Linking (EL) that uses Wikipedia as the target Knowledge Base (KB). In a nutshell, BLINK uses a two stage approach for Entity Linking (EL), based on fine-tuned BERT3 architectures. In the first stage, BLINK performs retrieval in a dense space defined by a bi-encoder. Each candidate is then examined more carefully with a cross-encoder, that concatenates the mention and entity text.
Figure 4.1: High level description of the zero-shot entity linking solution.
BLINK achieves state-of-the-art results on multiple datasets. For this thesis, the bi-encoder and the cross-encoder model of the library were studied.
4.2.2 Bi-encoder
As it has been mentioned before, a bi-encoder allows for fast, real-time in- ference, as the candidate representations can be cached. The input context
2https://github.com/facebookresearch/BLINK
3https://github.com/google-research/bert
and the entity description are encoded into vectors:
ym =red(T1(tm)),ye =red(T2(te))
where tm is the representation of the mention, te is the representation of the entity, while T1and T2are two transformers. The function red(.) reduces the sequence of vectors into one. By default, the last layer of the output of the [CLS] token is returned.
The representation tm is composed of the mention text surrounded by 2 special tokens that denote the existence of the mention, the context from the left and the right side of the mention text plus the special tokens that BERT inserts. Specifically, the construction of the mention is:
[CLS]ctxtl [Ms]mention[Me]ctxtr [SEP]
where [Ms] and [Me] are the special tokens that tag the mention, ctxtl and ctxtr are the word-pieces tokens of the context before and after the mention and [CLS] and [SEP] are BERT’s special tokens.
The representation te is composed of the entity title, the special token [ENT]
separating the entity’s title and description plus the special tokens that BERT inserts.
[CLS]title[ENT]description [SEP]
For simplicity in both input representations a maximum length of the docu- ment is retrieved. For document mentions, tokens from the left and the right side of the mention are used is cases where the mention doesn’t appear in the beginning or the end of the document, otherwise the first max_length or the last max_length tokens are used. For entity descriptions, the beginning of the document is used.
The score of the entity candidate ei given a mention m is computed by the dot-product:
s(m, ei)= ym·yei
The network is trained to maximize the score of the correct entity with respect to the entities of the same batch.
For each training pair (mi, ei) in a batch of B pairs, the loss is computed as:
L(mi, ei)= −s(mi, ei)+log(
B
X
j=1
exp(s(mi, ej)))
4.2.3 Cross-encoder
Here, the cross-encoder uses a concatenated version of the entity mention and the entity description and removes the [CLS] token from the latter.
This way, the model has deep cross attention between the context and the entity description. Formally, ym,e to denote our context-candidate embed- ding:
ym,e = red(Tcross(tm,e))
where tm,e is the input representation of mention and entity, Tcross is a transformer and red(.) is the same function defined in the previous sub- section.
To score a candidate with a context mention, a linear layer W is applied to the input representation:
scross(m, e)= ym,eW
The network is trained using a softmax loss to maximize scross(mi,ei) for the correct entity, given a set of entity candidates.
4.3 Pre-processing, Train and Test
4.3.1 Pre-processing
The dataset needed to be processed by the models to understand information and to extract representative vectors from it. For that reason, a Tokenizer was used to split each document into a list of tokens and to give each token a respective id. The list of token ids that was extracted for each document was fed to a BERT Model to be converted to a list of tensors that reflect semantic similarities between linguistic items in the dataset.
The BERT Tokenizer used a special technique called Byte Pair Encoding (BPE) [Sennrich et al., 2015]. BPE makes use of a simple data compres- sion technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. Instead of merging frequent pairs of bytes, characters or character sequences are merged. Each word is repre- sented as a sequence of characters, plus a special end-of-word symbol ‘·’.
All the symbol pairs are counted and each occurrence of the most frequent pair is merged. Each merge operation produces a new symbol which rep- resents a character n-gram. Frequent character n-grams (or whole words) are eventually merged into a single symbol, thus BPE requires no shortlist.
The final symbol vocabulary size is equal to the size of the initial vocabulary, plus the number of merge operations. The reason this technique is used by the algorithm is to compute subwords from unknown tokens, so words that the algorithm may known can be associated with unknown words that tend to be similar to them and consequently are given proper token ids.
4.3.2 Training
At first the model needed to get fine-tuned to learn the inserted special tokens. The dataset was constructed as a Tensor Dataset with each record containing the context, the description, the world id it belongs to and the label id.
data = {
" c o n t e x t _ v e c s " : c o n t e x t _ v e c s ,
" cand_vecs " : cand_vecs ,
" l a b e l _ i d x " : l a b e l _ i d x ,
" s r c " : s r c _ v e c s }
t e n s o r _ d a t a = TensorDataset ( c o n t e x t _ v e c s , cand_vecs , s r c _ v e c s , l a b e l _ i d x )
The records were split into batches for training and evaluation. In training, for each batch, the training loss was computed and back propagation was performed. An Adam optimizer [Loshchilov and Hutter, 2017] and a Linear scheduler were implemented. In evaluation, the validation loss and the average accuracy were computed.
d e f t r a i n ( t r a i n _ d a t a l o a d e r , reranker , gradient_accumulation_steps ) :
reranker . model . t r a i n ( ) t o t a l _ l o s s = 0
f o r step , batch i n enumerate ( t r a i n _ d a t a l o a d e r ) : i f s t e p % 50 == 0 and not s t e p == 0 :
p r i n t ( ’ Batch { : > 5 , } o f { : > 5 , } . ’ . format ( step , l e n ( t r a i n _ d a t a l o a d e r ) ) )
c o n t e x t _ i n p u t , candidate_input , _ , _ = batch
l o s s , _ = reranker ( c o n t e x t _ i n p u t , candidate_input ) i f gradient_accumulation_steps > 1 :
l o s s = l o s s / gradient_accumulation_steps t o t a l _ l o s s = t o t a l _ l o s s + l o s s . item ( )
l o s s . backward ( )
i f ( s t e p + 1 ) % gradient_accumulation_steps == 0 : t o r c h . nn . u t i l s . clip_grad_norm_ (
reranker . model . parameters ( ) , 1.0 )
reranker . o p t i m i z e r . s t e p ( ) reranker . scheduler . s t e p ( ) reranker . o p t i m i z e r . z e r o _ g r a d ( )
a v g _ l o s s = t o t a l _ l o s s / ( l e n ( t r a i n _ d a t a l o a d e r ) ) return a v g _ l o s s
d e f e v a l u a t e ( v a l _ d a t a l o a d e r , reranker , e v a l _ b a t c h _ s i z e ) :
reranker . model . e v a l ( ) r e s u l t s = { }
e v a l _ a c c u r a c y = 0.0 nb_eval_examples = 0 n b _ e v a l _ s t e p s = 0 t o t a l _ l o s s = 0.0
f o r step , batch i n enumerate ( v a l _ d a t a l o a d e r ) : c o n t e x t _ i n p u t , candidate_input , _ , _ = batch with t o r c h . no_grad ( ) :
e v a l _ l o s s , l o g i t s = reranker ( c o n t e x t _ i n p u t , candidate_input )
l o s s , _ = reranker ( c o n t e x t _ i n p u t , candidate_input )
t o t a l _ l o s s = t o t a l _ l o s s + l o s s . item ( ) l o g i t s = l o g i t s . detach ( ) . cpu ( ) .numpy ( ) l a b e l _ i d s = t o r c h . LongTensor ( t o r c h . arange (
e v a l _ b a t c h _ s i z e ) ) .numpy ( )
tmp_eval_accuracy = accuracy ( l o g i t s , l a b e l _ i d s ) e v a l _ a c c u r a c y += tmp_eval_accuracy
nb_eval_examples += c o n t e x t _ i n p u t . s i z e ( 0 ) n b _ e v a l _ s t e p s += 1
normalized_eval_accuracy = e v a l _ a c c u r a c y / nb_eval_examples
p r i n t ( " Eval accuracy : %.5 f " % normalized_eval_accuracy ) r e s u l t s [ " normalized_accuracy " ] = normalized_eval_accuracy a v g _ l o s s = t o t a l _ l o s s / l e n ( v a l _ d a t a l o a d e r )
return a v g _ l o s s
At each epoch we checked whether the current validation loss is smaller than the best validation loss. If so, the parameters of the model weres saved.
i f v a l i d _ l o s s < b e s t _ v a l i d _ l o s s : b e s t _ v a l i d _ l o s s = v a l i d _ l o s s reranker . save_model ( )