Recent work in layout analysis

2.2 Noisy text

2.3.7 Recent work in layout analysis

The icdar page segmentation competitions have been quite successful in furthering progress in state-of-the-art page segmentation. The combination of a well-defined task, large datasets, and consistent evaluation have enabled meaningful progress to be accurately gauged over

2.3. Digitisation 29 a long period of time. Methods continue to be published in the literature, even outside the context of the competitions, indicating the task is still popular, and not yet solved for complex documents.

Rangoni et al. (2012) propose a method for labelling logical structures in document images (logical layout analysis) using a dynamic perceptive neural network. Their system utilises the output of geometric layout analysis and the output of ocr, combining geometric, morphological and semantic information to apply logical labels. It also considers the context in the labels of surrounding blocks using several cycles of recognition and correction. The work by Rangoni et al. highlights how semantic information from recognised text can be a useful input to logical layout analysis and how an iterative approach can incorporate contextual information to improve the accuracy of logical labels.

Zirari et al. (2013) present an efficient method to separate the textual and non-textual components of a document image. Their approach models the document image as a graph: nodes in the graph are pixels and edges are added for pixels that are connected. Each connected component (i.e. each connected subgraph) is classified as either textual or non- textual by analysing the size and alignment of connected components. Zirari et al. assume that textual regions contain aligned components of very similar size (characters), while non-textual regions contain components of varying size and alignment. They evaluated their approach on two datasets, the uw-iii and icdar 2009 page segmentation competition datasets. Zirari et al. showed their method to perform better than the previous state-of-the- art approach by Bukhari et al. (2010), demonstrating that efficient rule-based systems can outperform machine learning approaches.

Wei et al. (2013) compare three different classifiers — support vector machines (svm), multilayer perceptrons (mlp), and Gaussian mixture models (gmm) — for the task of page segmentation of historical document images. Pixels are classified into one of four categories: periphery, background, text and decoration (e.g. historiated initials). The features for classification include pixel coordinates, rgb colour values, and minimum/maximum pixel values within horizontal and vertical neighbourhoods. Wei et al. evaluate the three classifiers

and a combined classifier on three datasets of historical manuscripts containing 127 pages in total. The svm, mlp, and gmm achieved accuracies of 91.86%, 92.64% and 85.63% respectively over the combined datasets. The combined classifier took the majority vote label from each of the three classifiers and achieved an accuracy of 92.43%. The svm and mlp performed better than the gmm across all datasets, and the mlp classifier outperformed all models, including the combined approach. Wei et al. offer useful insight into the types of features and models that work well for pixel classification segmentation approaches. However, the poor performance in classifying decorative pixels suggests improvements are needed to handle graphical content.

Chen et al. (2015b) propose a solution for page segmentation in historical handwritten documents. They also treat the problem as a pixel classification task, with the same output labels as Wei et al. (2013). However, instead of relying on hand-crafted features, Chen et al. use a convolutional autoencoder (Hinton and Salakhutdinov, 2006) to learn feature representations automatically. They use these features to train an svm classifier and evaluate the performance of their system on the same three datasets as Wei et al. (2013). Chen et al. show that learning feature representations using a convolutional autoencoder performs substantially better than hand-crafted approaches such as those used in Wei et al. (2013).

Wei et al. (2017) explore using deep learning to fine tune features for document layout analysis. As in Chen et al. (2015b), Wei et al. use stacked autoencoders to learn input features for classification. However, Wei et al. also employ fine tuning to adjust the weights of the encoder layers of the network to the specific task of pixel classification for page segmentation. They compare the features generated by the stacked autoencoders with and without fine tuning using a feature selection method. Wei et al. showed that fine tuning features resulted in higher accuracy in pixel classification. However, most of the features, fine tuned or not, were redundant, and the earlier autoencoders layers tended to have more useful features. Wei et al. highlight the shortcomings of features learned using deep neural networks and demonstrate the value of fine tuning such features for specific tasks.

2.3. Digitisation 31 Melinda et al. (2017) propose a system for document layout analysis in newspapers using multi-Gaussian fitting. They fit a selected number of Gaussians to the histogram of connected component heights in a binarised image. Melinda et al. manually select the number of Gaussians according to the expected number of categories for components (e.g. body text, titles, graphics). The result is a set of split points to distinguish connected components into their respective categories. A limitation of this approach is that it requires manual selection of the number of Gaussians — newspaper pages that contain varying font sizes could cause this technique to fail.

Chen et al. (2017) present a system for page segmentation of handwritten historical document images that utilises a convolutional neural network (cnn) to classify each pixel on a page according to its type. In contrast to previous cnn approaches to page segmentation, Chen et al. propose a shallow cnn consisting of only a single convolutional layer. They evaluate their approach on the same datasets as in Chen et al. (2015b), and the same four output labels: periphery, background, text and decoration. Chen et al. show their simple cnn to outperform substantially more complicated network architectures.

Průša and Fujiyoshi (2017) propose a top-down method for document layout analysis using two-dimensional context-free grammars. They used pdf documents with words and images as terminal entities, obtaining words by merging adjoining characters with the same font and size. They manually create a two-dimensional grammar to describe a document image with two types of rules: horizontal and vertical. The grammar also requires spatial constraints. They detail a top-down parsing algorithm that can build a derivation tree over the set of terminal elements. Průša and Fujiyoshi argue that their 2d grammar is sufficiently expressive to encapsulate the structure of pdf documents. However, they do not detail the domain of the documents they use in their experiments. Therefore, it is not clear how well their approach would generalise to other document types such as newspapers.

Alhéritière et al. (2017) propose a method for page segmentation that, instead of relying on pixel-based segmentation, analyses the line segments from which each connected component is composed. They classify connected components as either text, separators or

images, based on the lengths and orientations of their composite line segments. Alhéritière et al. evaluate their approach using the icdar 2009 page segmentation competition dataset, showing their approach to be superior to a previous state of the art method (Felhi et al., 2014). However, a limitation of Alhéritière et al.’s approach is that the thresholds used to distinguish segment types have to be manually tuned for a given dataset.

In document Article Segmentation in Digitised Newspapers (Page 48-52)