7.1 Learning from hyperlink corpora
7.1.1 Hyperlinks within online news
The use of in-text hyperlinks by online news sources is increasing, where it had formerly been a distinguishing property of the blogosphere (Coddington, 2014). Sampling articles from news and blog sites shows that 91% of links from news sites are publication-internal, with over half of all links targeting reference material and 30% targeting news reports (Coddington, 2012). This increased presence of hyperlinks from and to news reports presents a resource to cross- document event coreference identification, given the centrality of event reference to news. At the same time, news archives may benefit from this technology which could augment new and archival content with such links.2 This section assesses the event orientation of existing hyperlinks internal to an online news source, with an eye to their use in learning event linking. Using links from and to the same archive minimises the work involved in acquiring and processing data. In selecting a corpus of news-internal hyperlinks to exploit, we require:
• a quantity of such hyperlinks sufficient to learn a model from a sub-sample;
• a complete archive spanning multiple years, accounting for reference to long-past events;
and
• the ability to identify an archival story given its url;
• accessible meta-data such as the date of publication for each archival story.
The inclusion of url information for each article in the New York Times Annotated Cor- pus (Sandhaus, 2008) makes nytimes.com suitable, although we have not found it to mark in-text hyperlink anchors. WikiNews3 is an option attractive because of its free availability and culture of hyperlinking common to WikiMedia projects, yet we find few links targeting its own articles are in-text, and meta-data such as time of publication was not consistently encoded at the time of investigation. We elect to obtain an archive of Fairfax Digital (fd) content spanning from late 2008 to early 2013, incorporating almost 1.05M articles from a number of Australian news web sites.4 For the period covered, this collection is the online parallel of the print archive used for annotation in Chapter 5, incorporating some additional content from mastheads belonging to localities other than Sydney.
To analyse the relevance of such hyperlinks to event linking, we label a sample of 350 instances from articles published in 2011 – including news, features, review, etc. – after first
2
A number of sites utilise automatic insertion of hyperlinks (Coddington, 2014), albeit to reference portals on specific topics rather than individual articles.
3
http://en.wikinews.org 4
This includes http://www.{smh,theage,brisbanetimes,watoday,canberratimes}.com.au which all pro- vide access to the same set of assets.
1. 4 The body . . . was found under a plastic sheet . . . 2. 4 . . . that the global hacking incident won’t affect them. 3. 4 . . . 29-year-old Andy Marshall died this week after . . . 4. 4 Ellison was last year named best-paid executive 5. 6 . . . the Mitsubishi i-MiEV, which is now on sale . . . 6. 6 . . . according to the QS World University Rankings.
7. 6 Read our previous Brisbane’s Best: CBD lunches for $10 | Chips | Bakery | Mexican Figure 7.1: Examples of hyperlinks (underlined) in context from the fd corpus: those marked with4 are annotated as event references anchored at the word in italics; those with
6 are marked for discarding.
removing duplicate targets and source sentences. Without confirming that the link targets are canonical with respect to the event linking definition, we aim to identify clearly negative instances, being links that are:
• not within prose (e.g. among a list of links); • instructive in their anchor text (e.g. click here);
• references to the target document, rather than to the situation it reports (e.g. This
paper revealed on Monday . . . ; similarly); or
• based on references to non-events and non-linkable events (e.g. a link to a review of the
mentioned entity).
The annotator is presented with each randomly-sampled hyperlink within a sentence of con- text,5 and decides whether the text and its relationship with its target approximates an event link. If so, the annotator also selects a single word as the event link anchor.
About half of the sample is labelled positively; of the others, 30% are not within prose, 20% are instructional or a direct reference to the target document, and the remainder involve an inappropriate referent, often a non-event or aggregate fact.6
Examples of annotated hyperlinks are shown in Figure 7.1. The listed positive examples illustrate diverse types of event, and links spanning periods from one day (2) to eleven months (4). The hyperlink is often anchored to a phrase including an event predicate, but may merely span the name of a focal event participant as in 3; in 76 (43%) of the positive instances we marked an anchor word within the given hyperlink.
The negative examples shown target: a review of the mentioned product (5); a mentioned publication available as an article on the fd site (6); and related content in a navigation
5
By default the annotator does not see the full source or target articles, but may open them separately. 6These proportions are estimated from a manual grouping of 50 negative instances.
7.1. Learning from hyperlink corpora 143
feature (7). It may therefore be hard to distinguish hyperlinked entity names that are events from those that are not, while we expect they are less harmful in learning an event linking model than repetitive, non-syntactic navigation links like (7).
Other instances break our intuition that an event reference will be linked to an article reporting, or at least mentioning, that event having happened:
(48) a. Mr Price, fellow reporter Melissa Mallett and producer Aaron Wakeley were dismissed by the television network in August after it was revealed the network faked two live helicopter crosses to the Sunshine Coast where police and volunteers were searching for the remains of Daniel Morcombe.
b. The mobilisation against her was reminiscent of the controversy generated after Adshel pulled down Rip’n’Roll safe-sex ads from Brisbane bus shelters in response to complaints. c. Work on redevelopment at the site began in March , with five World War II air raid
shelters planned to have been restored along with the timber wharf
The hyperlink in Example 48a points to an article that is published prior to the mentioned dismissal, reporting the initial controversy. In 48b, the target reports the advertisements being reinstated after their removal generated controversy, rather than focally reporting the referenced event; yet it also happens to be the first article to reports the advertisement’s removal. While event linking requires the event to be reported as having happened or hap- pening, the target of Example 48c reports that the work “will begin today”. Thus we find that hyperlinks referencing events display a variety of relationships between the mentioned event and the target news report.
Although we have considered using machine learning to distinguish the two classes of hyperlink, we have not yet surpassed the performance of a rule-based approach to removing non-in-text links. Using the rules below, we eliminate 60% of the negative instances in held- out data without sacrificing any recall of positive instances.