Page as signal - Multimodal Documents and their Components

Multimodal Documents and their Components

2.4 Page as signal

In this section we move a step further away from human interpretation of the pages under analysis. Although it may appear strange to talk of the page

8It does, however, suggest a particularly important role for thematic organisation in the sense of Martin (1992), thereby supporting Djonov’s (2007) recent application of thematic organisation to websites.

as a ‘signal’, this view is actually precisely analogous to the treatment in lin-guistics of speech as an acoustic signal that can be analysed instrumentally.

Just as it is possible to describe the acoustic features of a speech signal in terms of its component frequencies, energy, etc., we can do the same with the ‘visual signal’ that a page produces. We can then apply signal process-ing techniques to this signal just as we can to sound.

This approach is explored most actively in the research and development community concerned with document recognition. This very active area is, in many respects, a logical outgrowth of earlier efforts to automate the ac-quisition of information from printed documents available in paper form by optical character recognition (OCR). This technology, now available with most scanners, is used to process the visual input signal so as to derive from the patterns of dark and light on the page the actual characters used in a document. An obvious motivation for this development was the fact that it is much more convenient (and much less wasteful of computer mem-ory) to represent the characters directly. When stored as internal computer codes, we can edit the scanned text with text editors, run spell checking and correction programs over it, re-format the text and so on. None of this is possible with a simple representation of the ‘picture’ seen.

The relevance of this for our current concerns is the subsequent interest of the document recognition community in going considerably further than simply recognising the characters on a page. Currently the goal is to pro-duce full document recognition, including layout analysis, in order to show which bits of a document belong together and which not. That is, rather than simply returning a string of characters that might involve the text body of the page as well as headings and ﬁgure captions simply placed (if we were lucky) in the sequence they were encountered on the page, document recognition now attempts to recognise the structure of a page also. This means that any text recognised is appropriately allocated to elements such as ﬁgures and captions, headlines with body text, multiple columns of an article, and so on.

This direction of research is given considerable extra impetus by the ob-vious commercial relevance of providing automatic recognition procedures of this kind. Decomposing a page appropriately supports several tasks cur-rently under investigation including:

• information extraction, whereby only parts of documents relevant for some query are retrieved by a search engine,

• automatic document reformatting, or re-purposing, where a docu-ment’s layout is altered to ﬁt different sizes and types of displays (consider, for example, the problems of sending a large scanned page

to someone who has to view the result on a mobile phone with a small display),

• automatic document classiﬁcation, where depending on the kind of document that is found, that document can be routed to different groups of potential readers—this latter capability relies on the close relationship between layout and genre.

The primary goal of this research area is to provide fully automatic cap-ture of scanned documents where the output is a struccap-tured representation of the distinct elements on the page. This is clearly very similar to our aims in this chapter and so it is worthwhile considering the progress and techniques that have been developed in this ﬁeld.

Visual processing ﬁlters

Although the precise principles by which signal processing of this kind is performed will not be our concern—detailed overviews of the range of approaches taken can be found in, for example, Dori, Doermann, Shin, Haralick, Phillips, Buchman and Ross (1997) and Okun, Doermann and Pietik¨ainen (1999)—we can illustrate some of the mechanisms involved very simply. We begin with the approach set out in Reichenberger, Rond-huis, Kleinz and Bateman (1995), where we discussed several alternative renditions of a page in order to consider relations between intent and form.

In this study we also needed to ﬁnd more objective ways of determining potential carriers of functional relationships so as to avoid the problems of conﬂating function and visual clues discussed above.

Figure 2.9 Successive reduction of resolution in a page image (from: Reichenberger et al. 1995)

One of the primary methods that we suggested for ﬁnding such elements was to progressively reduce the resolution of a representation of the page as an image. This procedure removes detail from the page since, for example,

instead of representing a page with a matrix of1000×1000 dots, we reduce the resolution to100×100 dots. Each ‘dot’ of the reduced image must then do the job of 100 ‘dots’ of the unreduced image, which can be achieved in several ways, for example by taking the average of the ‘brightness’ of the 100 dots in the original. The idea behind such a procedure is that it makes visible larger visual groupings of elements on a page. We have seen above in our introduction to visual perception that grouping elements in close proximity has important effects on processing and is also used in computational models of perception (cf. Sarkar and Boyer 1993). Making this grouping ‘visible’ was then seen as a useful step towards uncovering the visual decomposition of pages into parts. The result of a succession of such reductions in resolution is shown in Figure 2.9. Here we can readily see that what is revealed visually is the larger layout ‘blocks’ of the page.

Figure 2.10 Using discriminability at different resolutions to reveal layout elements (from: Reichenberger et al. 1995)

We then suggested further that the granularity at which certain elements became indistinguishable suggested particular candidates for consideration as elements of the page. This is shown in the graph of Figure 2.10. Here the vertical axis shows the resolution (100% is full resolution) and the en-tries in the graph show at which scale particular elements can be distin-guished from one another. Thus, at the lowest resolution, 1.56%, we can only barely distinguish the large blocks labelled A, B and C in the rightmost rendition of Figure 2.9. As the resolution increases, more details become visible, ﬁrst various pictures, then blocks of texts (labelled according to their subject matter in the graph), then lines and ﬁnally individual words.

This simple mechanism therefore provides a ﬁrst indication of how we can ﬁnd elements of a page that are in some sense really ‘there’ without appeal to functional interpretations.

To make the process of document recognition more reliable, research has revealed a host of techniques that go beyond simple reductions in resolution.

These have progressively included principles of visual perception so as to lead to decompositions that correspond ever more closely with those made by readers. We can easily show the principles involved here since most reasonably sophisticated image processing software nowadays already pro-vides processing options for images that can be considered as ways of get-ting at perceptually relevant ‘views’ of a page. These processing options are termed ‘ﬁlters’ in that they operate like a ﬁlter used with a camera, chang-ing the colour balance, the degree of sharpness, etc. of the image received.

Filters also have a precise mathematical definition in that they perform a specified transformation of the information provided as input to produce a corresponding output. A simple filter might, for example, remove all the colour information or reduce the resolution of an image in the manner seen above.

More sophisticated ﬁlters can locate places where there are sudden changes in image quality, such as brightness, look for places of homogeneous visual ‘texture’, and much more. Figure 2.11, for example, shows what might occur with a combination of resolution reduction and

‘contour’ finding of this kind when applied to a newspaper page that we will use as an example at several places below. The figure shows how these different filters make particular features of the page more or less visible. That these differences are more visible to us in the diagram also means that they are more accessible to automatic recognition. If we have a generally gray area and within that a much darker, almost black area, then this can be automatically segmented as a distinct element. More generally, if we have two areas that are each relatively homogeneous along some dimensions but different from one another along others, then we have good grounds for distinguishing them as distinct page elements.

We see this in the example shown in Figure 2.11 in several respects: the upper filter view on the right (a) provides evidence for a segmentation of a central element as well as several bordering headline elements; the lower right filter view (b) also shows a clear distinction between the homogenous lines of text and the line drawing (a cartoon) in the middle of the text; and filter (c) in the middle at the bottom picks out element boundaries, including in this case the columns making up the main body of the article.

There are also useful issues to be discussed concerning just what kind of structure is (a) made visible by such a process and (b) most useful for

Figure 2.11 Alternative visual ﬁlters applied to an extract of a newspaper page

document recognition. Clearly one aim is to bring these together. In doc-ument recognition, the goal is to ﬁnd the so-called logical structure of the documents processed. Logical structure corresponds approximately to the rhetorical clusters of Schriver and similar organisations of page elements discussed above. The main idea is that logical organisation captures the dependencies and connections between the content expressed on a page without recording details of visual layout, formatting and so on. Within document analysis, documents are accordingly considered from two per-spectives, which we will also build on substantially below:

• ﬁrst, documents are seen in terms of their content, captured in the logical or semantic organisation, and

• second, documents are seen in terms of their geometrical properties, which include positioning information of blocks and details concern-ing typeface and font size for text and line widths or other visual features for graphics.

Both perspectives admit of structuring: logical structure captures concep-tual organisation and, according to some authors (e.g., Doermann, Rosen-feld and Rivlin 1997), intended reading order, while geometric structure includes layout.

Once we can combine visual processing ﬁlters that give us candidate elements on the page, we can proceed further to consider their inter-relationships, working from geometric structure towards hypotheses concerning their logical structure. When one element is geometrically completely contained within another, as in the central cartoon shown in the newspaper page for instance, then we have a clear indication of a structural relationship in the content also. More complex is when we have

a collection of elements within some area that may need to be inter-related, such as the columns within an article as shown in the ﬁgure.

XY-trees

A common technique for automatically capturing structural relationships within collections of layout elements is provided by XY-trees. XY-tree con-struction works by successively searching for clean divisions of the page along the horizontal and vertical axes, cycling down through ever smaller portions of the page as necessary until the page is exhausted and no further decomposition is possible. On each cycle, the process translates each visu-ally distinguishable block on the page onto either its horizontal (X-axis) or vertical (Y-axis) extent. If, after doing this, there is a ‘gap’ left anywhere on the axis being considered, i.e., an interval where no blocks have been mapped to, then it is possible to divide the portion of the page being consid-ered at that point. Dividing the page is represented by growing ‘nodes’ in the XY-tree—each node corresponding to an area on the page that can be segmented from the others. This process then repeats, considering each of the divided sections of the block as its own unit for further decomposition.

A concrete illustration of this process in action is shown in Figure 2.12, where we consider the very simple layout for the Canada book that we used above in Figure 2.5. The initial step is always to consider the entire page as the ﬁrst area to be decomposed and to look at the X-axis (i.e., horizontally);

this initialisation step is recorded in the XY-tree by growing the tree’s root node (labelled 1 in the ﬁgure). The X-axis runs across the page and so we consider the horizontal extent of all the elements found on the page, map-ping these onto the horizontal axis (shown at the bottom of Figure 2.12(a)).

Since the drawing at the top of the page and the columns in the lower half of the page together occupy all of the horizontal axis, we can find no point of separation. This means that we cannot find any segment of the X-axis that is ‘unoccupied’ for this page area and so cannot split the page along this axis at this time. To record this fact, the XY-tree ‘grows’ a single node (labelled 2 in the figure) descending from the root of the tree. The number of child nodes descending from any node in an XY-tree represents the num-ber of parts that we can decompose a page area into at that point: here there is only one node, indicating no further decomposition.

We then continue down to this node in the tree and consider whether we can make a clean split of units on the Y-axis—that is, running down the page. In this case, we map the vertical extents of each element on the page to the corresponding axis in order to see if there are any segments of the axis left uncovered (Figure 2.12(b)). This time there is a gap to be found:

(a) X-axis cut: no un-occupied space and so no tree division

(b) Y-axis cut: division into top and bottom halves

Figure 2.12 Consecutive construction of an XY-tree for the page layout shown in Fig-ure 2.5 above

the drawing in the upper half of the page and the overlapping text columns in the lower half do not overlap and so a gap is left on the Y-axis between them. This means that it would be possible to extend an uninterrupted line of whitespace horizontally between these two portions of the page at the point indicated. To represent this division into two sub-areas of the page, we grow the XY-tree further; each division of the XY-tree means that there is some continuous whitespace between the areas represented by the nodes with respect to the part of the page represented by the parent node. In this case, we have two child nodes: one corresponding to the vertical extent of the drawing (labelled 3 in the ﬁgure) and the other corresponding to the vertical extent of the text columns (labelled 4).

Next we consider each of these two nodes in turn and their possible par-titions, this time cycling back to consider the X-axis. The ﬁrst node, corre-sponding to the drawing (labelled 3), is complete: it has no further subele-ments and so is a ‘terminal’ node, or leaf, of the tree. We can derive this information automatically whenever (a) a node in the XY-tree has only a single child node and (b) that child node also has only a single child node.

When there is no branching twice in a row, this means that we have consid-ered possible decompositions for a particular area of the page from both the X-axis and the Y-axis; no further information can be obtained by repeating the process and so we can consider the area as indecomposable.

Turning to the node corresponding to the two text columns (labelled 4), we can see that this node can be split. When we map the horizontal extent of the two columns onto the X-axis, we ﬁnd an unoccupied portion of the X-axis between them. This means that the region can be split neatly by running an uninterrupted line of whitespace up between the columns. Note that this was not possible on the ﬁrst cycle through (Figure 2.12(a)) because

at that point we were not considering the area of the page corresponding to node 4 but rather that corresponding to node 1—i.e., the entire page, including the drawing. So we now obtain two further nodes, one for each text column (labelled 5 and 6 in Figure 2.12(c)). We could take the process further and, when considering the Y-axis again, divide each of the columns up into lines, but for present purposes the illustration is probably sufﬁcient.⁹ This manner of representing the layout of pages works well for regular compositions and has been used in a broad variety of automatic systems.

Combined with other techniques, it shows how an abstract representation of page layout can be derived relatively easily. Adding additional information concerning how large the whitespace gaps between blocks must be before they are considered, i.e., incorporating Gestalt considerations of proximity, is one way of directing the tree construction process to respect more accu-rately the groupings of elements on the page that would be perceived by a human reader.

The need for this can be seen clearly in the two alternative layouts shown in Figure 2.13. Because the process of XY-tree construction always begins with the X-axis, layouts (a) and (b) both receive the same XY-tree descrip-tion as shown to the right of the figure. In both cases the simple procedure finds a ‘gap’ on the X-axis and so can grow the XY-tree on the first cycle, even though it is arguable that for layout (b) it is the much larger gap on the Y-axis that should be given priority perceptually. Incorporating Gestalt con-siderations here would allow distinct trees for (a) and (b) to be constructed.

The tree shown on the right of the ﬁgure would then only hold for layout (a); the tree for (b) would need instead to group node (1) together with node (2) and node (3) together with node (4).

Further extensions concern the kind of information on the page that is accepted for motivating divisions: the ‘Modiﬁed XY-trees’ (MXY) of Ce-sarini, Marinai, Soda and Gori (1999), for example, trigger divisions off connecting ‘lines’ and similar explicit framing visuals, thereby incorporat-ing another range of perceptually signiﬁcant criteria within the general ap-proach. Extensions implementing such additions are now available for the XY-tree technique and are used both for comparing different layouts and for automatically learning and classifying layouts typical for particular kinds of documents (e.g., Cesarini, Lastri, Marinai and Soda 2001, Mao, Nie and Thoma 2005)—a task very relevant to the discussion of multimodal genres that we take up in Chapter 5.

9If we were to continue, however, note that the caption of the ﬁgure would be the ﬁrst

In document John Bateman. Multimodality and Genre (Page 85-94)