TEXT EXTRACTION FROM DEGRADED DOCUMENT IMAGES

(1)

TEXT EXTRACTION FROM DEGRADED DOCUMENT IMAGES Rachid Hedjam, Reza Farrahi Moghaddam and Mohamed Cheriet

Synchromedia Laboratory for Multimedia Communication in Telepresence Ecole de Technologie Sup´eerieure, Montr´eal (QC), H3C 1K3 Canada ´

{rachid.hedjam,reza.farrahi}@synchromedia.ca, [email protected]

ABSTRACT

In this work, a robust segmentation method for text extraction from the historical document images is presented. The method is based on Markovian-Bayesian clustering on local graphs on both pixel and regional scales. It consists of three steps. In the first step, an over-segmented map of the input image is created. The resulting map provides a rich and accurate semi-mosaic fragments. The map is processed in the second step, similar and adjoining sub-regions are merged together to form accurate text shapes. The output of the second step, which contains accurate shapes, is processed in the final step in which, using clustering with fixed number of classes, the segmentation will be obtained. The method employs sig- nificantly the local and spatial correlation and coherence on both the image and between the stroke parts, and therefore is very robust with respect to the degradation. The resulting segmented text is smooth, and weak connections and loops are preserved thanks to robust nature of the method. The output can be used in succeeding skeletonization processes which re- quire preservation of the text topology for achieving high performance. The method is tested on real degraded document images with promising results.

Index Terms— Document image, Image segmentation, Image binarization, MRF, Graph-partitioning.

1. INTRODUCTION

Digital archiving of ancient and historical documents is a new, but expanding, trend in heritage preservation and studies. It is needed that the archiving images are enhanced and restored regardless of the quality of the real images. The historical documents are usually very low, and the images highly suffer degradation. The degradation on the historical document images are normally physical and have various types, such as fading of ink, presence of the interfering patterns (bleed- through etc), denotation of the cellulose structure, etc. There- fore, before any process and feature extraction, proper preprocessing in needed. For skeleton-based methods for example, the preprocessing stage can be limited to segmentation of the text. Many methods, such as global thresholding

[1], local thresholding [2], statistical approaches [3], multi- level classifiers [4], have been used for enhancement and segmentation of historical documents. For example, in [4], using some prior information, multi-level classifiers have been developed which can help in restoration of true text. In the global thresholding methods, a single threshold value is used for the entire image, it can gives a good results if the text and the background are well separated in terms of pixel intensities. But in presence of gray level degradation over the image (i.e., shadow, non-uniform illumination, some defect in some area of the document) the local thresholding method [5, 6] are needed to adapt the degradation and text changes.

An example of local and adaptive thresholding is presented in [7] which is based on the detection of edges and using information around boundaries. Recently, in [8], a new adaptive approach has been proposed which is based on the combination of several state-of-the-art binarization methods while considering the edge information the gray scale source image.

All local thresholding methods are ineffective when the ex- tend of degradation is smaller than their characteristic length.

The characteristic length is usually a fraction of distance between baselines of text. Therefore, degradation and changes in the intensity, which are very local and within a small part of strokes, cannot be captured by local and adaptive methods.

These parts usually appear as holes of discontinuities on the final segmentation. Therefore, for achieving a suitable segmentation for skeletonization, a robust method which is able to preserve very weak and local connections and strokes is needed. The goal of this work is to introduce such a method using robust segmentation and merging of regions on the document image.

The proposed method (see algorithm 1) for text segmentation of historical and degraded document images is discussed in details. The basic idea is to have a robust and continuous extraction of shapes and strokes from the image. For having a robust performance, at first the image will be di- vided into as many as possible coherent regions. Markovian method which is resilient to local noise and benefits form the spatial correlation of the image data will be used. Many of small degradation for example small cracks or holes will be captured with this method. However, large variations in the stroke intensity and discontinuities will appear as separated

(2)

small regions. In the second step, merging of the small regions based on their similarities and also spatial connection will be performed. Therefore, after this step, the weak parts of the strokes,which have variable intensities, will be attached to strong stroke shapes. Although the intensity of these regions are very low, their similarity to the stroke shapes, and espe- cially very local nature of the merging step, help in connecting these regions to other stroke shapes. Finally, the number of classes of the regions on the document image will be reduced to two (or more, based on the type of degradation document suffering). This step, called binarization, will extract the text of the document image as one of the final classes. Because of merging of the weak regions with the strong shapes at the second step, the complete stroke shapes will pass through the third step and will appear on the final segmentation. There- fore, the resulting binarization and segmentation is continuous and smooth even the original image suffers high degradation.

The paper is organized as follows. In section 2, an MRF over-segmentation into regions is presented. The regions merging and binarization methods are respectively discussed in sections 3 and 4. The application of the proposed method and numerical results are provided in section 5. Finally, the conclusions are discussed in section 6.

2. MRF OVER-SEGMENTATION

As the basic of our method is on the dividing the input image into as much as possible coherent sub-regions and then merging the proper sub-regions together in order to obtain the correct text shapes, we need to obtain an over-segmented map of the input image. In contract to normal applications of segmentation methods which try to reduce the number of classes, in our case, we are looking for over classification of the input. However, at the same time, we want to have local correlation inside each sub-region. Traditional segmentation methods, such as Markovian-Bayesian segmentation [9]

or watershedding segmentation [10] can be used for our goal.

Because of uniform nature of the stroke intensities on the document images, Markovian-Bayesian segmentation will provide more accurate regional segmentation. In order to obtain the over-segmented image from the input document image, Markovian-Bayesian segmentation is used with a large number of classes. In this way, a class label is assigned to each pixel of the image based on its properties and also its rela- tionship to the neighbors. The employed properties for this stage are very local. The input image is modeled by a MRF prior model defined by a graph whose pixels correspond to nodes connected to their 4 nearest neighbors. For the likelihood model, we take a Gaussian law to describe the intensity distribution withing each class, and parameters of this distribution are estimated thanks to an iterative method called iterative conditional estimation (ICE) [11]. Therefore, the result consists of many small, coherent and homogeneous re-

gions. The large number of possible classes (For example, 8 different class labels are used to cluster the pixels) ensures the over-segmentation behavior.

MRF model

Let us now consider a couple of random fields Z = (X, Y ).

with Y = Ys, s∈ S the field of observation located on a lat- tice S of N sites s (pixels), and X = {Xs, s ∈ S} the label field (classes label). Each of the Ys takes its values in λobs={0, .., 255} and each Xsin set of classes{c0, .., ck}.

The segmentation process is to estimate the process of label X form the observation Y , otherwise it can be viewed as a statistical labeling problem according to a global Bayesian formulation in which the following a posterior distribution has to be maximized [9]:

PX|Y(x|y) ≈ exp{−U(x, y)} (1) This is the maximum a posteriori (MAP) estimation. In the standard case of Ising-type, the correponding posterior energy to be minimized is:

U(x, y) =X

s⊂S

Ψs(xs, ys) +X

s,t

βst[1− δ(xs, xt)] (2)

where δ(x, y) is the delta Kronecker function equal to 1 if x = y and 0 otherwise, and βst=β1, β2, β3, or β4depend- ing on the locations of the four-pair of neighboring sites (in our case we take only the north, east, west and south sites and we assume βi = 1, with i = 1, ..4). In this energy setting, the first energy term express the adequacy (likelihood model) between observations, whereas the second one is related to the a priori information. For the likelihood model, we take a Gaussian law to describe the intensity distribution withing each class, as follow:

PY|X(y|x) =Y

s∈S

[ 1

√2πσxs

exp(−(ys− µxs)² 2σ²_x_s )] (3) we note that:

lnPY|X(y|x) =X

s∈S

Ψs(xs, ys) (4)

and the parameters µ and σ (mean and standard devi- ation respectively) of this distribution are estimated thanks to an iterative method called iterative conditional estimation (ICE)[11].

Also, in order to have only connected regions, the regions are re-labeled at the end of clustering so that each simply connected sub-region has a unique region label. The final goal of this step is to subdivide as much as possible the image into sub-regions which will form the shapes and the background regions in the next step.

(3)

Figure 1, shows an example of over-segmentation. Figure 1(a), shows the input image and the sub-regions are shown in Figure 1(b). It can easily be seen from the figure that the weak parts of the strokes, although are not connected to the stroke shapes, form separated sub-regions which will be merged to the text shapes in the second step below.

(a) (b)

(c) (d)

Fig. 1. An example of over-segmentation step. The original degraded image and its corresponding over-segmentation map of sub-regions are shown respectively in (a) and (b). Also, The output of second step, which consists of text shapes obtained after merging of sub-regions, and the final binarization are shown respectively in (c) and (d).

3. REGIONS MERGING

Once the map of sub-regions (over-segmentation map) is obtained, the similar neighboring sub-regions are merged together to form complete shapes of the text. The criteria for merging is the degree of homogeneity between sub-regions and, more importantly, being neighbors to each other. Stroke parts, although can have different intensities, show similar patterns of intensity distribution and homogeneity. In this work, the intensity distribution is featured using textural similarity of the sub-regions. Also, Bhattacharyya distance is used to compute the distance between features of to different sub- regions. Using the Bhattacharyya distance, which measures the similarity of discrete probability distribution of two neighboring regions [12]. Assume that{h(n, x)}n=0,··· ,Nb−1is the normalized histogram of region x and{h(n, y)}n=0,··· ,Nb−1

a normalized histogram of region y where Nbis the number of bins used for computation of the histogram. The Bhat- tacharyya distance between these two histograms is defined as:

DB[h(x), h(y)] =







1−PNb−1

n=0 ph(n, x)h(n, y)^1/2 if x and y are neighbors.

0 otherwise

(5) Having all distances between sub-regions, each sub- region is merged to one of its neighboring regions. The merged neighborhood has the lowest distance to the target sub-region and also its distance is lower than a threshold value k = 0.15. After the merging step, the map of the document image consists of many continuous shapes and regions that represent text strokes regardless of presence of weak intensity regions on the strokes. An example of the merging process is shown in Figure 1(c). As it can be seen from the figure, the text shapes are well merged and all weak connections are preserved. Also, in Figure 2, the basic mech- anism of merging step is shown. In Figure 2(a), a sample over-segmented map is presented. Also, on this figure, the weights between different sub-regions, which is computed based on equation (4), are shown with different thickness.

The text regions, which usually have more similarities, have lower distances, and therefore have higher similarity weights.

The result of merging for Figure 2(a) is shown in Figure 2(b) that again

The final step, in the next section, will be devoted to reduce the number of labels to two (for simple binarization) or few higher value in order to achieve the segmentation of the document text.

(a) (b)

Fig. 2. The basic Similarity between regions represented as a graph. (a) Weighted graph: the thin white edges describe a strong connection between text regions; the thin black edges describe the strong connection between background regions, and the thick white edges describe the weak connection between text regions and background regions. (b) result of merging.

4. BINARIZATION

To extract the text from the background we need to use an- other classification step in order to reduce the number of classes to just two classes. One of these classes represents the text (which is usually represented by black color on the

(4)

final segmentation map), and the other one represents the background and interfering patterns and is shown as white on the segmentation output. As it has been discussed before, thanks to the merging process in the step two, the weak parts of the strokes and text regions with high degree of intensity variations are connected together an to strong portions of the strokes. Therefore, they can easily be differentiated from the complex background and other unwanted informations. In other words, application of any simple clustering technique to the output of the second step can provide a good segmentation of the text. In this work, we use K-means clustering technique [13] with two classes. As feature of the extracted regions obtained in step two, the average gray-level of each region is used. The result of application of binarization step for the example of Figure 1 is shown in Figure 1(d). It can be seen that the true text is accurately segmented, and, despite presence of many weak parts of the strokes, the extracted text is continuous and all connections and loops are preserved. In the following we give an overview of the proposed algorithm.

Algorithm 1: The proposed algorithm.

Data: Degraded Image uin, Oversegmentation map uoverseg, Merged regions map umerg. Result: Binarized image, uout.

1-MRF oversegmentation of uin. uoverseg⇐= M RF (uin)

2-Regions merging based on texture similarity.

umerg⇐= M erging(uoverseg)

3-Binarization with K-means clustering in 2 classes.

uout⇐= Kmeans(umerg, 2)

5. EXPERIMENTAL RESULTS AND DISCUSSION Figure 3, shows the performance of the method for a sub- image from the Google dataset. The results of two other methods are also shown in the figure for the sake of com- parison. The Otsu’s method [14], which is a global thresholding method, and local-thresholding method of Sauvola [2]

are used to segment the input image. Because of the variations in the text intensity, none of global and adaptive methods are able to preserve weak connections and provide continuous and smooth output. As it can be seen from the figure, the output of both methods suffer from cuts and false holes, which have very negative effect on the performance of succeeding skeletonization processes. On the other hand, the output of the proposed method is continuous and very weak connections are preserved thanks to very local and correlation-based nature of the method.

Final example, shown in Figure 4, is suffering bleed- through effect. As it can be seen from the input image (Figure 4(a)) the degree of bleed-through effect is high and at some

(a) (b)

(c) (d)

Fig. 3. An example of degraded shapes with variable inten- sities. (a) shows the original image. (b) shows the output of Otsu’s method. (c) shows the output of Sauvola’s method. (d) shows the output of the proposed method which is continuous and smooth.

regions the intensity of the interfering pattens is very close to the true text intensity. The output of the Otsu’s method and Sauvola’s method are shown in 4(b) and 4(c). Both of results are not successful in removing the bleed-through, and the segmented texts suffer high degree of interfering patterns.

The proposed method, which the result in 4(d), ont only is able notaily to remove the bleed-through interfering patterns, but also provides continuous strokes which are ready for next processes.

(a) (b)

(c) (d)

Fig. 4. An example of the bleed-through effect. (a) Original image. (b) Otsu’s method. (c) Sauvola’s method. (d)The proposed method.

An additional objective evaluation to quantify the effi-

(5)

ciency of the proposed method was also performed. We compared the results obtained by the well-known F-measure, using as input the binarization images of the proposed method and those of Sauvola method, Otsu method and MRF binarization method [9]. The figures 5 and 6 show that the proposed method perform well than others in sense of noise cleaning and weak ink recovering due to the homogeneity of the segmented regions which are locally merged. We can observe that even, some times, the Otsu or Sauvola methods are ables to provide a good binarization but unfortunately are not capables to clean some artifacts (noise). The proposed method thanks to its Markov regularization of regions can merge these artifacts with the big regions where they belong in.

Fig. 5. F-measure (Fm) evaluation. From up to down: Orig- inal image. Ground truth. Proposed method (Fm=96.46).

Sauvola’s method [2] (Fm=95.47). Otsu’s method [14]

(Fm=96.65). MRF (2 classes) method [9](Fm=70.27).

Fig. 6. F-measure (Fm) evaluation. From up to down: Orig- inal image. Ground truth. Proposed method (Fm=92.47).

Sauvola’s method [2] (Fm=85.90). Otsu’s method [14]

(Fm=88.88). MRF (2 classes) method [9](Fm=92.30).

6. CONCLUSION

Using a combination of Markovian-Bayesian clustering and local merging of sub-regions a robust and fast segmentation and binarization method for degraded and historical document images is introduced. The method is able to preserve very weak connections and edges that are very important in the subsequent skeletonization steps which rely mainly

(6)

on the topological features of shapes. For achieving such a performance, the input image is processed in three steps.

In the first step, by applying Markovian-Bayesian clustering technique, an over-segmented map of the input consisting of coherent sub-regions is created. The sub-regions are small enough to distinguish between the different parts of strokes with variable intensities and background regions, and at the same time they are large enough to capture very small varia- tion and noise. Then, in the second step, by forcing merging between just neighboring sub-regions, the map of complete text shapes is obtained. For achieving a higher performance, the sub-regions are compared based on their homogeneity features. On the new map, the weak parts of the strokes and text shapes are connected to the strong parts because of their similarity in the homogeneity and also being neighborhood.

To finalize the segmentation and binarization process, in the last and the third step, the number of classes and labels on the document image is reduced to two.

Acknowledgments

The authors would like to thank the NSERC of Canada for their financial support and Juma Al Majid Center (Dubai) for providing valuable datasets.

7. REFERENCES

[1] Joo Marcelo Monte da Silva, Rafael Dueire Lins, Fer- nando Mrio Junqueira Martins, and Rosita Wachen- chauzer, “A new and efficient algorithm to binarize document images removing back-to-front interference,”

Journal of Universal Computer Science, vol. 14, no. 2, pp. 299–313, Jan. 2008.

[2] J. Sauvola and M. Pietikinen, “Adaptive document im- age binarization,” Pattern Recognition, vol. 33, no. 2, pp. 225–236, Feb. 2000.

[3] Anna Tonazzini, Emanuele Salerno, and Luigi Bedini,

“Fast correction of bleed-through distortion in grayscale documents by a blind source separation technique,” In- ternational Journal on Document Analysis and Recog- nition, vol. 10, no. 1, pp. 17–25, June 2007.

[4] Reza Farrahi Moghaddam and Mohamed Cheriet,

“RSLDI: Restoration of single-sided low-quality docu- ment images,” Pattern Recognition, vol. to appear in Special Issue on Handwriting Recognition, 2009.

[5] J. Bernsen, “Dynamic thresholding of grey-level im- age,” Proc. pf the Eighth international Conference on Pattern Reognition, 1986.

[6] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Adaptive degraded document image binarization,” Pattern Recog- nition, vol. 39, no. 3, pp. 317–327, Mar. 2006.

[7] Q. Chen, Q-S. Sun, P.A. Heng, and D-S. Xia, “A double thresholding image binarization method based on edge detector,” Pattern recognition, vol. 41, 2008.

[8] B. Gatos, I. Pratikakis, and S.J. Perantonis, “Improved document image binarization by using a combination of multiple binarization techniques and adapted edge infor- mation,” in ICPR08, 2008, pp. 1–4.

[9] Stuart Geman and Donald Geman, “Stochastic relax- ation, gibbs distribution and the bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 721–741, 1984.

[10] S. Mukhopadhyay, S. Mukhopadhyay, and B. Chanda,

“Multiscale morphological segmentation of gray-scale images,” Image Processing, IEEE Transactions on, vol.

12, no. 5, pp. 533–549, 2003.

[11] W. Pieczynski, “Statistical image segmentation,” Ma- chine Graphics and Vision, vol. 1, no. 1/2, pp. 261–286, 1992.

[12] Max Mignotte, “Segmentation by fusion of histogram- based k-means clusters in different color spaces,” Image Processing, IEEE Transactions on, vol. 17, no. 5, pp.

780–787, 2008.

[13] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of 5- th Berkeley Symposium on Mathematical Statistics and Probability, 1967.

[14] N Otsu, “A threshold selection method from gray- level histograms,” Systems, Man and Cybernetics, IEEE Transactions on, vol. 9, pp. 62–66, 1979.