• No results found

Annotation task

Figure 5.10: Source document on left and desired output on right — bounding boxes for each of the articles on the page. Excerpt from the smh 1988-03-09, p. 72.

5.5 Annotation task

To evaluate the effectiveness of our distant supervision technique, we developed an annota- tion task for humans to manually validate a portion of our automatically created article segmentation dataset. The goal given to annotators was to identify the minimal rectangular bounding boxes that enclosed all the text of each article on a newspaper page. For example, in Figure 5.10, (a) shows the source newspaper page and (b) the desired result — boxes surrounding each of the articles.

We provided the bounding boxes generated from the distant supervision technique as a starting point for each of the annotators. The annotators could then modify bounding boxes to achieve the required article segmentation. Although this had the potential of influencing annotators’ decisions, it greatly reduced the time required to annotate each page.

In cases where our distant supervision technique did not identify an article on a page, the annotator could add bounding boxes. Similarly, where articles were incorrectly identified on a page, the annotator could remove the unneeded bounding boxes.

The guidelines provided to the annotators were more detailed than the task definition (Chapter 4), with numerous explicit examples to facilitate high inter-annotator agreement. A copy of the guidelines has been attached as Appendix A.

Three annotators corrected dev at various stages. All three annotators were graduate- level students, fluent in English, and familiar with the newspapers they were annotating.

5.5.1 Annotation interface

We developed a task-specific, web-based, annotation interface. Existing tools, such as the commonly used (for page segmentation) Aletheia (Clausner et al., 2011a) were considered, but at the time was limited to computers running Microsoft Windows, and presented an interface with many features that were not necessary for this annotation task. We built the interface to be as simple as possible, minimising distractions so that annotations could be completed efficiently.

The interface, shown in Figure 5.11, provided an overlay of ocr text on top of the greyscale image of the original newspaper page. We implemented only the following features, which we deemed either imperative or time saving, in order to minimise clutter and allow for efficient annotation.

Move edges or corners To move the edge of a bounding box, annotators could click on

the side that they intend to resize and drag in or out. It was also possible to resize two sides at once by dragging a corner. This is an intuitive interaction design pattern that exists in many applications. The mouse pointer turned into an arrow when hovering over an edge or corner the annotator would like to resize. This arrow was necessary as two edges of different boxes that are close to each other could be difficult to differentiate.

Move boxes To move an entire bounding box, the annotator could click and drag on the

title of the box, which would appear in the centre of the box when the mouse hovered over. This functionality was seldom used as it was usually quicker and easier to move an article’s bounding box by moving the individual edges or corners as described above.

5.5. Annotation task 113

Figure 5.11: Our annotation interface running in a web browser. The ocr text was overlaid on top of the original newspaper image. Simple controls were placed at the top of the page. Excerpt from the smh 1987-01-21, p. 7.

Create boxes Annotators were given the option to create new bounding boxes using an

Add New Boxbutton at the top of the interface. Clicking this button would create a new

bounding box in the middle of the page that could be resized or moved just like an existing box. This functionality was necessary as the automatic alignment was not perfect and could miss entire articles on a page.

Delete boxes Annotators could remove boxes in cases where articles had been incorrectly

recognised. Once a box enclosed zero characters it would be removed from the interface and marked for deletion in the dataset.

Page navigation Annotators could navigate between pages using previous and next but-

tons at the top of the page. The navigation bar also showed the number of pages remaining to be annotated.

Automatic saving The interface automatically saved all changes, without the need for the

annotator to manually initiate a save operation.

Colourisation The interface automatically coloured the characters enclosed by each art-

icle’s bounding box with a unique colour for each article. Characters that were not enclosed by any bounding box remained black. This feature allowed annotators to quickly identify characters that had not been included in a bounding box.

Newspaper image By default, an image of the newspaper scan appeared as a semi-

transparent background of each page, with the ocr characters aligned with the background image. The image was usually helpful in aiding the quick identification article boundaries, however at times it was distracting, especially when the ocr text was not correctly aligned. In these cases, annotators could temporarily hide the background image using a toggle button in the navigation bar at the top of the page.

5.6. Annotator agreement 115