Contributions - Text-based document geolocation and its application to the digital humanities

This dissertation includes the following contributions to the field of natural language processing:

• An investigation of various effective methods for supervised geolocation of a test document, i.e. associating the document with a particular set of latitude/longitude coordinates on the Earth. I consider methods that divide the Earth’s surface into rectangular grid cells—either of constant degree size or using an adaptive k-d tree (Bentley, 1975)—and find the single best grid cell, relying exclusively on the text of a document. Many of these methods are simple to implement and fast to run, but give comparable accuracy to more complicated and slower Bayesian methods. These methods are fast enough to be scaled up to a large amount of training material, even with a fine-scale grid mesh, and easy to parallelize. Among these methods are Naive Bayes, KL divergence, Average Cell Probability (which involves inverting unigram distributions to determine a distribution of cells for a given word) and a few different baselines.

• An application and careful analysis of the performance of these methods, along with various smoothing techniques as well as the information gain ratio (IGR) feature selection technique of Han et al. (2014), to several different corpora: Three corpora of Twitter user feeds of dis- tinct natures; dumps of the English, German and Portuguese versions of Wikipedia, processed

by custom-written software; and a large set of image tags corresponding to geotagged Flickr images. I consider a number of different evaluation metrics and show that none of the geolocation techniques consistently outperforms Naive Bayes on all the corpora, including IGR, which performs well on the Twitter corpora for which is was designed, but not on the other corpora.

• The application of logistic regression to geolocation. Contrary to the claims of Han et al. (2014), I show that logistic regression can be more accurate than Naive Bayes and KL divergence (including variants incorporating feature selection) and fast enough to run on large corpora. Logistic regression itself very effectively picks out words with high geographic sig- nificance. In addition, because logistic regression does not assume feature independence, complex and overlapping features of various sorts can be employed.

• A new method for supervised geotagging, which involves a hierarchical discriminative classifier that creates multiple individual classifiers at different grid resolutions and combines them to achieve better results than could be done using a classifier at a single level. This method scales well to large training sets and greatly improves results across a wide variety of corpora. In fact, I am able to achieve state-of-the-art results on all of the large corpora I evaluate on. Im- portantly, this is the first method that improves upon straight uniform-grid Naive Bayes on all of these corpora, in contrast with k-d trees (Roller et al., 2012) and the current state-of-the-art technique for Twitter users of geographically-salient feature selection Han et al. (2014). • The development of a new NLP task, text-based document geolocation of historical corpora,

to assist the application of document geolocation techniques to the digital humanities. I apply my techniques to two new corpora (see below), establishing a baseline for further research and showing how good accuracy can be achieved even with text-only documents and with relatively little training material. I further show how domain adaptation techniques that make use of the large amount of geographic annotation available in the English Wikipedia can dra- matically reduce the amount of annotated training data required to achieve equivalent levels of performance. I create learning curves to investigate the minimal amount of annotation re-

quired to achieve a given level of performance, to assist further researchers in deciding how much money and effort to spend on annotation.

• Two new annotated historical corpora for use with the new NLP task I developed (see above). These two corpora are of significantly different size and subject matter—a 19th-century Amer- ican travel log (Western Wilds by John Beadle) and a large set of primary-source documents from the American Civil War archives (War of the Rebellion). The annotations on the Civil War archives are in the form of polygons or multipoints when appropriate, allowing for more sophisticated analyses than can be achieved with typical single-point annotations. I put a sig- inificant amount of work into cleaning up the entire set of Civil War archives (not just the annotated portion) and dividing it into individual documents—some 255,000 in all, spread over 126 volumes. This alone should be of great benefit to digital humanities researchers in- terested in further work on War of the Rebellion. All the data will be released publicly, along with the source code required to process the data and detailed instructions on how to operate it.

• A new technique for creating geographic topic models, based on David Blei’s dynamic topic models (Blei and Lafferty, 2006) and applied to the above Civil War archives. This involves identifying, using a domain expert, a set of regions corresponding to coherent theaters of war. These are then linearized and treated similarly to the timeslices of a standard dynamic topic model. This has the effect of creating topics that vary over geography, allowing for broad- ranging variationist analyses to be performed.

• A new geolocation technique for text-only corpora involving co-training between document geolocation and toponym resolution, building on the toponym resolution methods previously investigated by Speriosu (2013). This has the effect of introducing external information into the process in the form of a gazetteer of locations, yielding the potential to significantly in- crease the geolocation accuracy beyond what can be extracted from the text alone. This demonstrates superior results on some metrics, and can serve as a branching-off point for further research in this area.

• A program that implements the methods described above.

• Processed versions of all the corpora I use for evaluation (to the extent this is legally possible), and programs for recreating them. (This includes the various modern and toponym-resolution corpora I make use of, not just the two new historical corpora I annotated.) Similarly processed versions from other sources can also be created (e.g. different dumps of Wikipedia, different sets of tweets). These programs are written in a modular fashion, so that the components that do various types of processing can be reused in other programs needing such processing, and new components can be created to do similar operations, e.g. as might be required to analyze the dump of a foreign-language Wikipedia.

In document Text-based document geolocation and its application to the digital humanities (Page 33-37)