I. Entity Linking with Wikipedia
4. Cross-Lingual Entity Linking System
4.4. Entity Clustering
The major problem of EL is name disambiguation. Linking a query to a KB (or Wikipedia) entity can be treated as name entity disambiguation with Wikipedia as the reference inventory. For those entities without KB linking (a.k.a. NIL entities), NIL clustering is used to automatically group query mentions with the same reference together so that queries within a cluster refer to the same target (sense). The NIL clustering task is similar to the
4.4. Entity Clustering
people name disambiguation task in SemEval-2007 [6], which consists of clustering a set of documents that mention an ambiguous person name according to actual reference targets. The NIL clustering introduces more types of NEs, e.g. location and organization, and bring into addition obstacle of resolving name variations.
The first step of clustering is to group entities with identical names into one coarse group. Based on observations on development data, we assume that all the entities in the same coarse group share identical strings, so we ignore rare cases of different references to entities (e.g. queries “Ford Motor Co.” and “Ford” refer to the same company) and cross-lingual reference (e.g. the queries “Hyderabad” and “海得拉巴” in Chinese). Entities within a coarse group can represent different senses of the entity, such as Wash- ington means a person, a city or a state under different contexts. The next step is to scatter them into different fine-grained sense clusters, which are considered as final NIL clusters. We utilize the bag-of-words (BOW) feature from surrounding passages of entity mentions from background document to represent the intended sense of the entity.
An important issue in clustering is to determine the number of NIL sense clusters. Since the number varies from entity to entity it is difficult to train an adaptable clustering model for all entities. A single background document for a query mention does not offer sufficient information to help disambiguate its containing mention. Instead of clustering background documents, we seek more relevant documents retrieved from source collections. Those large amount of documents can be used to determine a significant distribution of word occurrences for each fine-grained sense respectively. As in the KB retrieval model, Indri is called to retrieve top 1000 documents by searching with each query string. Those top relevant documents formalize as a larger document collection for queries with identical surface strings. We use hierarchical clustering algorithm to cluster relevant documents, build the partition of disjoint clusters by cutting the hierarchy, and assign query mentions to most similar sense cluster.
Hierarchical clustering algorithms are either top-down or bottom-up, called agglomerative and divisive clustering respectively. We cluster relevant documents with the Hierarchi-
cal Agglomerative Clustering (HAC) algorithm with a single linkage, which has shown effectiveness on clustering ambiguous person names [125].
HAC4initially assigns each document to its own cluster, and a pair of clusters are iteratively merged to form a hierarchy which provides a view of the semantic sense of the entity at different levels of cluster-wise similarity. The merging of pairwise clusters is determined by the combination similarity, which are defined on two criteria: the measure of distance between document vectors including Euclidean distance, squared Euclidean distance, Manhattan distance, maximum distance, cosine similarity etc.;5and the linkage criterion which specifies the cluster similarity as a function of inter-similarity between documents from different clusters. Some common strategies lead to single linkage clustering, complete linkage clustering, group-average clustering and centroid clustering. [62]. The single linkage clustering specifies the pairwise similarity of clusters as that of their most similar members. Figure4.3visually illustrates that the nearest pair of nodes is taken as the most similar ones, whose similarity determines the similarity of clusters.
Cluster1 Cluster2
Figure 4.3.. The demonstration of single linkage criterion for cluster similarity used in HAC algorithms.
We transfer the hierarchy into disjoint clusters by cutting it regarding different specification of final clusters [62]. We specify the number of clusters, or number of documents per cluster to determine the cutting point that produces corresponding results. Each document is represented as a BOW vector d of TFIDF values. Based on the validation on development data, we use two different strategies for cutting a hierarchy into a set of flat clusters which will be described in Section4.6.1.
4We use the implementation in scipyhttp://www.scipy.org/ 5http://en.wikipedia.org/wiki/Hierarchical_clustering
4.4. Entity Clustering
Each cluster is considered as an accumulation of relevant terms with respect to single entity sense, therefore the background document sharing more terms with a cluster is most likely to share the same sense. The centroid µ of a sense clusterC can be seen as an approximation of its semantic context,
µC = 1 |C| X d∈C d, (4.2)
thus a mention occurring in the context similar to the centroid is likely to relate to the same sense as the clusterC.
𝑑𝑚∈ 𝒞2
𝒞2
𝒞1
𝒞4
𝒞3
Figure 4.4.. The assignment of clusters based on the distance between a document and each centroid.
We assign the query to the nearest sense cluster by measuring Euclidean distance between its document and each cluster centroid in the vector space as depicted in Figure4.4. Given a query mention m and background document dm, its cluster is set as
follows, arg min C kdm− µCk2 = arg minC s X i di m− µiC 2 (4.3)