Given a collection of event links, we describe a procedure to determine estimates of wterms
and wtime. Practically, this consists of the following steps:
1. Select a single feature and use it to score candidates for all training instances.
2. Multiply the scores of the top s candidates for each instance by their temporal features, and learn the optimal weights wtime for their linear combination.
3. Use the initial feature multiplied by the learnt temporal weighting to score candidates for all training instances.
4. Multiply the scores of the top s candidates for each instance by their term overlap features, and learn the optimal weights wterms for their linear combination.
This can be seen as a single iteration of a process in which each weight vector is updated in turn using linear modelling while fixing the other, as shown in Algorithm 1. This approach allows us to use standard linear modelling despite learning the product of two linear spaces; we also benefit from drawing a new set of top candidates given a partial model. The models may be optimised in a classification or structured learning paradigm.
In a classification approach, each input reference is represented by a single positive in- stance, corresponding to the true target, and a sample of negative instances corresponding to alternative candidates. Training adjusts the weights according to loss over the predicted
6.4. Supervised learning of parameters 135
Given: training references R, corresponding targets Y , candidates C, batch loss function
L, maximum iterations tmax, w0terms̸= 0
1: t← 0
2: repeat
3: wt+1time← arg minwL(σ(C, R; wtterms, w), Y)
4: wt+1terms← arg minwL(σ(C, R; w, wt+1time), Y)
5: t← t + 1
6: until convergence or t = tmax
7: return wtterms, wtimet
Algorithm 1: Learning weights for time and terms features
classification of all sampled instances. In contrast, a structured learning approach only con- siders the loss with respect to the single best candidate per reference given each setting of
w. Preliminary experiments with structured learning under a zero-one loss substantially
underperformed the classification approach, which we pursue further.8
A few practical considerations pertain to treating the learning task as binary classification and the broader iterative update approach:
Candidate sampling We sample a fixed s candidates for each reference r, or as many as
have non-zero scores if it is fewer. We select the s candidates c with the highest σ(c, r) given the current model parameters (i.e. σ(c, r; wtterms, wtimet ) for line 3 and σ(c, r; wtterms, wt+1time) for line 4 of Algorithm 1), and force the inclusion of the true candidate. Where the true candidate has a score of 0, we discard the instance and its candidates during training, but account for the instance in evaluation. The data is highly skewed to the negative class, yet we find that undersampling degrades performance.
Loss function and optimisation We employ logistic regression to optimise w, although
we use these weights in a linear scoring function, rather than for log-linear prediction.9 Each candidate is represented as a feature vector xi, assigning yi = 1 to the true link target and
yi= 0 otherwise, and we minimise the regularised logistic loss:10
w∗λ,ℓ,β= arg min w min wβ ∑ i log { 1 + e−yi(xiw⊤+βwβ) } + λ |wβ|ℓ+ ∑ j |wj|ℓ 1 ℓ
8Exact inference of the best candidate given each w may be expensive, so in these experiments we employed a similar approximation to the classification task, sampling a fixed number of alternative candidates to compare to the true candidate. We evaluated the structured perceptron and subgradient structured svm. While other loss functions (perhaps based on candidate similarity) may be more appropriate, the outcomes of our structured prediction and undersampling experiments accord in highlighting the importance of negative examples.
9In early experiments we empirically determine that this finds better solutions in terms of our task objective thanℓ2-regularised least-squares regression, and marginally outperforms a linear svm model.
10
This optimisation assumes the role of L in Algorithm 1, such that xiw⊤corresponds to σ(C, R; w) but
We use ℓ = 2 regularisation since our feature space is dense and not large, and select the
regularisation coefficient λ through cross-validated grid search for each model. The logistic regression implementation we use from LIBLINEAR (Fan et al., 2008) regularises the bias (or intercept) term wβ, corresponding to an additional feature of fixed value β for all instances;
we find β = 10 is suitable for our data. The optimisation problem is solved using their trust region Newton method (Lin et al., 2008).
Initial weights and iteration Algorithm 1 at first updates wtime given an initial w0terms.
Although the opposite is possible, we choose this ordering since a large number of candidates have fterms(c, r) = 0 (the same being untrue of ftime) eliminating many candidates from
the sampling procedure. The initial term feature weights are set to 0, except for a single predetermined feature, which is set to 1. We select this feature to maximise average recall of the correct target among the top s candidates. Assigning non-zero weights to multiple term features makes inferring the top candidates much more costly; in this vein, we set the maximum number of iterations tmax= 1 and thus update wtimeand wterms once each, leaving
an investigation of iterative updates to future work.
We apply this estimation technique to evaluate our model in the next chapter.
6.5
Conclusion
We have described a preliminary system intended to perform event linking as a retrieval task. The system is intended to generate the most likely candidates given an event reference, among which a more precise disambiguation process may select a final target. It scores each candidate according to its time of publication and term overlap with the query text, with particular components to focus on three key aspects of the event linking task:
Entities and event description A target candidate should be preferred if it mentions the
entities, location, time and general description of the event that are indicated in the input reference. We thus extract and differentially weight different types of terms in the reference context and candidate story (Section 6.2.2).
News discourse structure Not all mentions of an event in a candidate news story are being
reported there, but some constructs within the text, such as the opening sentence, are likely to be indicative of novelty. This leads to the use of weighted zones identified in the candidate documents (Section 6.2.1).
Temporality Since event linking targets the article that first reports an event, we prefer
documents that introduce new content (Section 6.2.3). The system also prefers candi- dates published shortly after the likely date of an event’s occurrence, or assumes that
6.5. Conclusion 137
recently-reported events are salient, thus preferring candidates recent to the reference time (Section 6.3).
By and large, our system takes na¨ıve approaches to these components. This allows the system to be easily replicated, providing a benchmark for the task and evaluating its feasibility. As a framework, the system is also extensible to introducing, for example, different query formulation methods, leaving many open directions for future improvement and nuance.
We have described a method for estimating the system’s parameters, wtime and wterms,
from annotated event links. However, our manual annotations are likely too few to learn an accurate estimate. In the following chapter we instead propose inferring parameters from noisy hyperlink data – avoiding the cost of high-effort event linking corpus annotation – and evaluate this system on our gold-standard event links.
Chapter 7
Evaluating event linking
with noisy training
We set out to determine to what extent the system described in Chapter 6 effectively performs the event linking task. However, this entails determining its parameters from training data, for which the manually annotated event link corpus may not suffice.
In general, linguistic and media expertise is scarce which makes producing a statistically- sufficient annotated corpus costly. This is particularly applicable to the annotation task described in Chapter 5 which is very time consuming and therefore resulted in only 229 distinct event links from 150 document, which is insufficient to train the model of the last chapter. In previous work we have exploited hyperlinks in Wikipedia to automatically gener- ate training data for multilingual named entity recognition (Nothman et al., 2013). Similarly, we assume that some portion of hyperlinks on the world wide web must correspond to event links: an author may link to a reporting news article when referring to an event. If such a subset is identified, it may be used to train an event linker; in general we believe hyperlinks are under-exploited by the nlp community as indicators of event coreference. This goes hand in hand with the suggestion in Section 4.2 that the output of an event linking system might be used for hypertext construction.
In this work, we go further to suggest learning from noisy, “silver standard” training data. Such data are automatically sampled, not manually verified as valid event links. Thus we experiment with two corpora of hyperlinks to online news with minimal filtering, under the assumption that links to news are often event-oriented. We quantify this assumption with respect to the set of hyperlinks within online content from the same publisher as our manually annotated corpus. The second corpus consists of citations in English Wikipedia targeting that same online news archive.
After further detailing the extraction of such hyperlink corpora in the following section, this chapter performs an event linking system evaluation with the following purposes:
1. to validate the event linking task and establish its difficulty through a performance benchmark;
2. to ascertain whether knowledge from noisy hyperlink data can assist event linking; 3. to identify aspects of Chapter 6’s system that are most effective; and
4. to diagnose aspects of the task and manually annotated data of Chapter 5 for which the system and its noisy training is not sufficient.
The metrics for quantitative evaluation and system optimisation are outlined in Section 7.2 before detailing experimental results of development and testing in Section 7.3. We then analyse portions of our annotated corpus that our system is most and least successful at replicating (Section 7.4) before concluding with a discussion of areas where our system and training methods might be enhanced (Section 7.5).