Toponym resolution techniques - Document geolocation and toponym resolution

Chapter 6 Document geolocation and toponym resolution

6.2 Toponym resolution techniques

Speriosu (2013) developed a number of methods for toponym resolution, some of which applied document geolocation as one component. It is important to note that these methods are unsupervised from the perspective of toponyms, in that they do not rely on an existing corpus marked up with disambiguated toponyms. (WISTR is a partial exception, in that it synthesizes toponym annotations from a document-geolocation corpus. WISTR∗in §6.2.2 is even more of an exception.)

Some of these methods rely on outside knowledge:

• A corpus, such as Wikipedia, annotated with document-level geocoordinates.

• A document geolocator trained from such a corpus.

• A gazetteer listing known toponyms and possible candidates for their location.

The following are the methods relevant to this dissertation:

• TRIPDL directly uses a document geolocator to produce a probability distribution over a cell grid covering the Earth. Each of the possible candidates for a given toponym in the document is then assigned a probability in line with the document-geolocation probability of the cell containing the candidate, and the highest-probability candidate chosen for the toponym. • WISTR is a stronger method that uses document geolocations of Wikipedia in an indirect

fashion. A named-entity recognizer is run on a Wikipedia page, and the toponyms in the page containing candidates within 10km of the document’s location are considered to resolve

to those candidates. The textual context around those toponyms is used as features to train classifiers to disambiguate those toponyms to their resolved candidates.

For example, the Wikipedia page on Widgery Wharf in Portland, Maine contains various mentions of the toponym Portland, and the gazetteer entry for Portland contains the candidate Portland, Mainewhose location (presumably the city center) is within 10km of the location of Widgery Wharf. Thus, the text surrounding each mention of Portland in this article serves as classifier features to disambiguate a mention of Portland in some other article to Portland, Maine. Combined with appropriate features to identify other Portlands (for example, mentions of Portland in the article on the Portland Youth Philharmonic in Portland, Oregon), a strong classifier can be created.

• TRAWL interpolates between WISTR and TRIPDL, and weights the result by a factor that biases in favor of higher-level administrative entities when e.g. disambiguating between a city and a country of the same name.

• SPIDER is a weighted-minimum-distance resolver than seeks to implement the heuristics of spatial minimality(different toponyms in a text tend to be near each other) and one sense per discourse(multiple instances of a toponym in a text tend to refer to the same location). At its core is a basic minimum-distance resolver, which resolves each toponym to the candidate that is, on average, closest to all other toponyms. (More specifically, for each toponym, it chooses the candidate that minimizes the sum of the distances to the closest candidate of each other toponym.) This has the effect of clumping all toponyms in a document together.

SPIDER builds on top of this basic resolver by attaching a weight to each candidate of a toponym, reflecting its prominence in the corpus. The minimum-distance algorithm is then modified so that all distances computed are divided by the weights of the candidates involved (since smaller distances are better). Furthermore, multiple iterations are run, and at the end of each iteration, the weights are recomputed, reflecting the proportion of times a given candidate has been resolved across the entire corpus.

Geolocator−→ Naive Bayes uniform, 1◦ Hierarchical k-d Toponym resolver ↓ Mean Median Precision Mean Median Precision

RANDOM 2397 933 23.4% 2397 933 23.4% TRIPDL 1014 26 57.2% 1235 38 51.7% TRAWL 1825 419 42.3% 827 15 70.5% WISTR 665 0 74.5% 665 0 74.5% SPIDER 675 0 74.7% 675 0 74.7% TRAWL+SPIDER 673 0 74.8% 243 0 82.0% WISTR+SPIDER 422 0 82.5% 422 0 82.5%

Table 6.1: Dev set performance on CWARusing various toponym resolution methods. Underlined values are those that have changed from left to right (the others remain the same because their method doesn’t use a geolocator).

across an individual document and the entire corpus. (TRIPDL takes advantage of an entire document’s context through the use of a document geolocator, but still resolves each toponym independently.)

• WISTR+SPIDER and TRAWL+SPIDER use WISTR and TRAWL, respectively, to initialize the weights of SPIDER. The underlying idea is that the weights in SPIDER can be viewed as set of prior distributions, one per toponym, over the candidates of that toponym. Both WISTR and TRAWL output probability distributions over the candidates of each toponym and use outside knowledge sources to do so, and thus can be used to more intelligently initialize SPIDER’s weights than simply initializing them uniformly, as SPIDER does by itself. These combined methods are generally stronger than either of the component methods standing alone.

6.2.1 Baseline toponym resolution results

As described in §2.4, I redid the CWARdataset to include coordinates for all of the 56,000+ distinct toponym types originally annotated in the corpus, as opposed to the only 2,000 or so types that were assigned coordinates in Speriosu (2013)’s work. I reran the methods described above, producing the updated results shown in Table 6.1.

In addition, I modified the code that implements these resolvers to allow for the use of the new document geolocation techniques described in this dissertation.1 _{This allowed for new variants}

of TRIPDL, TRAWL and TRAWL+SPIDER, which were run with an underlying hierarchical k-d tree classifier geolocator trained on ENWIKI13 using the optimal settings found in §4.2.3. These results are shown on the right half of Table 6.1. Using a hierarchical classifier does not help with TRIPDL (which is one of the weaker methods in any case), but definitely does with TRAWL and TRAWL+SPIDER, making the latter the strongest method for mean, and very nearly as strong for precision as WISTR+SPIDER. This suggests that a better geolocator can improve the performance of a geolocation-based toponym resolver, a result that is perhaps expected but nonetheless pleasing.

6.2.2 New method WISTR

∗

(variant of WISTR)

WISTR, as described above, identifies toponyms in Wikipedia using a named entity recognizer (NER) and disambiguates them by looking for candidates that are very close (10km or closer) to the document’s geolocation. This procedure would be unnecessary if Wikipedia were directly marked up with toponyms and their resolutions, and in fact we can synthesize exactly such toponyms by making use of the hyperlinks between Wikipedia articles. The idea is that

1. we can identify any stretch of text that is linked to a geolocated article and is also found in the gazetteer as a toponym;

2. we can resolve the toponym by finding the candidate in the gazetteer that is closest to the linked article’s geocoordinate, provided the distance does not exceed a threshold (I use 100km, or 500km for candidates that are identified in the gazetteer as states or higher-level administrative entities due to potential disagreements between Wikipedia and the gazetteer in identifying the “representative point” of such a region);

3. we can identify further stretches of the same text in the article2_{as toponyms, with the same}

resolution (this is necessary because typically only the first mention of a given item in an article is linked).

By doing this procedure, we can identify toponyms both more precisely (since we eliminate NER errors) and in greater number (since we no longer rely on toponyms being close to the article’s

and Naive Bayes.

Corpus Source

CWARPORTAL The Civil War Portal subsection of ENWIKI13

TOPOWIKI13 All of ENWIKI13

Table 6.2: New toponym resolution corpora for use with WISTR∗, derived from part or all of ENWIKI13 using a new and better method to identify and resolve toponyms in Wikipedia.

Method Corpus Mean (km) Precision (%)

WISTR ENWIKI13 850 69.5

WISTR+SPIDER ENWIKI13 107 89.5

WISTR∗ TOPOWIKI13 713 80.8

WISTR∗+SPIDER TOPOWIKI13 85 91.3

WISTR∗ CWARPORTAL 183 86.8

WISTR∗+SPIDER CWARPORTAL 61 92.0

WISTR/WISTR∗ ENWIKI13+CWARPORTAL 463 83.1

WISTR/WISTR∗+SPIDER ENWIKI13+CWARPORTAL 87 91.1

Table 6.3: Results for WISTR and WISTR∗on CWAR.

own geocoordinate). Finally, we can make use of articles that are not themselves geolocated, which comprise more than 80% of the total, and nearly 95% of those in the Civil War Portal subsection (§2.4).

Using this new procedure, I create two new toponym resolution corpora from ENWIKI13 (see Table 6.2).

I then create a variant of WISTR, which I term WISTR∗, that directly relies on the toponyms in these corpora rather than finding toponyms in the former, more roundabout fashion. Table 6.3 shows the results of running on the CWARcorpus. In addition, I investigate results using a combination of WISTR∗features from CWARPORTAL, and WISTR features extracted from all of ENWIKI13.

Interestingly, WISTR∗ results are noticeably better using only CWARPORTAL than TOPOWIKI13 (the entire Wikipedia). This demonstrates the importance of in-domain data.

6.2.3 Variants of SPIDER

I modified SPIDER to incorporate a document-level geotag when it is available. Such a geotag is typically available in the toponym-resolution portion of co-training (§6.3), as the toponym resolver is fed documents that have already been annotated by the document geolocator. There are two ways

to do this:

WEIGHTED This method sets the initial weights of each toponym to be inversely related to the

distance from the document-level geotag.

ADDTOPO This method modifies SPIDER to add an additional toponym corresponding to the document-level geotag, effectively containing only one possible candidate, which resolves to the location of the document-level geotag. This biases SPIDER in favor of resolving other toponyms nearby, in order to satisfy the spatial minimality component of the algorithm.

ADDTOPOcan be combined with any of the WISTR variants, but WEIGHTEDcannot, because its settings for the initial weights would conflict with the WISTR settings.

In document Text-based document geolocation and its application to the digital humanities (Page 147-152)