• No results found

6.3 Block representations

6.3.1 Preprocessing

All of the block representation techniques in this section require some preprocessing to obtain a basic representation of the text within each block on a newspaper page. We discuss the preprocessing steps below, and, unless specified otherwise, we employ all steps for every block representation technique.

Extraction

The blocks extracted in Section 5.7 are represented as two (𝑥, 𝑦) coordinates indicating the top left and bottom right coordinates on a page. The ocr output is in a html format with

6.3. Block representations 131 the coordinates of each character, as well as a grouping of characters for each word. In order to extract the text for a given block, we first calculate for each word (as grouped by the ocr engine) its top-left and bottom-right coordinates. Then, for each text block, we find all the words whose boundaries are wholly or partially within the bounds of the text block. Words can only be assigned to a single text block. In cases where a word overlaps with more than one block, the block with the highest overlapping area takes precedence.

Tokenisation

Tokenisation is the process of splitting a text string into individual units or tokens, such as words and punctuation, for subsequent processing. A simple method could split a string

every time a space is encountered. For example, the stringMy name is John.could be split

into four tokens: [My,name,is,John.]. In this simple method, the full stop was left on the

last token. In ir and nlp applications, it is typically desired to have punctuation in separate tokens. A more advanced method for tokenisation would tokenise the above sentence into

five tokens: [My,name,is,John,.]. Other examples of more complex tokenisation include

splitting contractions such asisn’tinto tokensisandn’t.

We tokenise the ocr text into words and discard punctuation using the Natural Language Toolkit (nltk, Bird et al., 2009) “Punkt” tokeniser (Kiss and Strunk, 2006). We use the pre-trained model included in nltk for applying this tokeniser over English text.

The ordering of words within a text block is not used by our clustering techniques below, but it is necessary for proper tokenisation. While we have shown the ordering of text within an entire page of digitised text to be poor (see Section 5.1.1), the ordering of text within paragraphs tends to be quite good. For application of the Punkt tokeniser, we maintain the ordering of text within text blocks using the ocr output. This ordering is discarded in the vectorisation step below.

Case removal

Words with the same spelling but different casing often share the same meaning. For

example,whiteandWhitemay both refer to the colour white, with the latter capitalised

because it occurred at the beginning of a sentence. However, this is not always the case —

Whitemight be a proper noun referring to a person with the surname White.

Treating words with different casing as two distinct tokens in the vocabulary hinders our goal of attaining dense block representations. Typically the lowercase form of common words occur much more frequently than their uppercase counterparts. This would make

it much more difficult to infer meaning from the presence ofWhitethanwhite. In our

processing, we remove case to increase the density of block representations.

Stop word removal

Stop words are terms that are filtered out of a document because they do not aid later

processing steps. For example, function words such as the,and, andis, do not convey

meaning when processed individually. For our goal of article segmentation, these words would not provide utility in determining which text blocks belong to the same article. Note that this is typically only the case in “bag-of-words” models (those that ignore the ordering of words), such as our models for block representations in this chapter. In more advanced nlp techniques, such as sentence parsing, these words are essential. In our experiments, we remove stop words from text blocks as found in the standard nltk English stop words list.

Stemming

Stemming is the process of reducing words to their grammatical roots. For example, the

wordprocessingcould be reduced toprocessby removing the-ingsuffix. We employ a

simple suffix-stripping stemming algorithm, as done by Aiello and Pegoretti. We use the nltk implementation of the English Snowball stemmer (Porter, 1980).

6.3. Block representations 133

Vectorisation

We represent each text block on a page as a count vector. A count vector is a vector of length equal to the vocabulary size for a given corpus, with the value of each element in the vector corresponding to the number of times a particular term appears in that text block. For

example, consider a page with two text blocks. The first block contains the textmy name

is john, and the second block contains the texthello john. The total vocabulary consists

of 5 words: hello,is,john,my, andname. The first element of a count vector with this

vocabulary would represent the number of timeshelloappeared in the block, while the

second element in the count vector would represent the number of timesisappeared in

the block, and so on. Hence, the first block would have count vector [0, 1, 1, 1, 1], and the second block would have count vector [1, 0, 1, 0, 0].