An algorithm for Text Line Segmentation in Handwritten Skewed and Overlapped Devanagari Script

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 5, May 2014)

An algorithm for Text Line Segmentation in Handwritten

Skewed and Overlapped Devanagari Script

Rahul Garg

1

, Naresh Kumar Garg

2

1

PG Scholar, 2Assistant Professor, GZS-PTU Campus, Bathinda, Punjab Abstract- Text line segmentation is a very crucial step in

optical character recognition. Poor line segmentation leads to wrong results in recognition. In printed text, line segmentation is quite easy but in handwritten text, it is quite difficult due to problems like overlapping, touching of characters and also due to different writing style of a writer. In this paper we have discussed a new algorithm that can perform line segmentation in handwritten text. This algorithm mainly deals with skewed text but also with overlapping and touching of characters. This algorithm is based on projection profile technique. We have applied this algorithm on many of document images and it has given promising results.

Keywords- Handwritten Text, Overlapping, Projection profile, Skew, Text Line Segmentation, Text blocks.

I. INTRODUCTION

Handwritten Hindi text recognition is an important area of Optical Character Recognition (OCR) [1]. Improper line segmentation leads to faulty results in word or character segmentation and recognition stage. Hence before these, proper line segmentation must be done. Text line segmentation in printed text is much more easier than the handwritten text an in printed text, there is a equal font size and equal line spacing between the text lines which makes the text line segmentation very much easier, but in handwritten text document, text line segmentation is not that much easy due to presence of some problems like touching and overlapping of characters. Also in handwritten text, while writing, the writer does not use the same font size. At some point, the font size is bigger (usually in starting of line) and at some point the font size is smaller (usually in the end of line). In handwritten text, there is no equal line spacing which makes the text skewed and hence that makes the text line segmentation much more difficult. The presence of upper and lower modifiers also makes line segmentation difficult in Devanagari Script. The main aim of this paper is to provide an algorithm that can deal with these problems and can segment line properly from a hand written text. Our algorithm is based on projection profile method.

Projection profile methods are based on top-down algorithms which are one of the most successful methods in machine printed text [1]. This method is also good for handwritten text if there is a sufficient gap between two text lines. We have successfully applied this method on many documents.

II.LITERATURE REVIEW

A lot of research work has been done by various researchers in field of handwritten text line segmentation in OCR. A wide variety of handwritten text line segmentation methods are reported in the literature. The various existing handwritten text line segmentation methods can be categorized as projection profile based, Hough transform based method, smearing, grouping, graph based, and many more[1].

Projection profile based methods are good for those documents in which there is a significant gap between two text lines. Here the projections of the pixels are used to segment the text lines. These methods are mostly suited for straight lines and fails when there are variable skew angles in a text line. Some researchers modifies this method and uses partial projection profile. The input document image is divided into some vertical strips and horizontal projections of each strip is used for text line segmentation. Bar-Yosuf et al. [2], also used piecewise projection profile and divides the document image into vertical strips and then finds the local minima of projection profile of each strip and then groups them to form final text line segmentation.

In [3], a novel text line segmentation approach is proposed. Here line separating path's coordinates are determined in each step by three stage algorithm that combines (1) a local minima search technique, (2) a technique based on following the contour of the foreground component and (3) piecewise projection profile.

(2)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 5, May 2014) The fuzzy RLSA measure is calculated for each pixel

and a gray scale image is calculated using this measure which is then binarized and then the text lines are extracted from the newly created image[5].

In Hough transform based methods, for the text line segmentation, gravity centers or minima points of the connected components are used. In [6], the text line detection method based on Hough transform for unconstrained handwritten text is used. Hough transform method fails when there are multiple skew angles in the same text line.

In [7], a graph model is used that describes the possible locations for segmenting neighbouring characters, and then an average longest path algorithm is applied to identify the globally optimal segmentation.

In grouping method, using bottom-up strategy, the connected components of the black pixels are grouped together to form alignments or groups. In [8], an approach based on grouping of black pixels is used for text line segmentation.

Jindal et al. [9], have worked on recognition of degraded printed Gurumukhi script. They have also used partial projection profile based method. In this the whole document is divided into strips and the proposed algorithm is applied for segmenting horizontally overlapping lines and associating small strips to their respective lines[10,11].

In [12], a piecewise painting technique is proposed for line segmentation in unconstrained handwritten text.

III. DATABASE

A database has been constructed by taking handwritten data from 30 different writers. All the images are scanned on 600 dpi. All the experiments are conducted on this database. Total number of text lines in the database are 180. No pre-processing like noise removal, skew removal etc is performed on the database. An image from our database is shown in figure 1.

Figure 1. Part of database

IV. CHARACTERISTICS OF DEVANAGARI SCRIPT

Devanagari is the script used for writing Hindi, Nepali, Sanskrit and Marathi languages. It is written from left to right of a page. There is no upper case and lower case concept in Devanagari Script. There are 33 consonants and 14 vowels in this script.

V. LINE SEGMENTATION

We have proposed an algorithm for handwritten text line segmentation. This algorithm is based on projection profile method. Here we have used piecewise projection profile. Segmentation is done by using the gap between the text lines. We have divided the whole document image into 6,7 and 8 strips and have calculated results on many different document images. This algorithm efficiently deal with skewed text. It can also promisingly deal with overlapped and touched text lines.

We have made following assumptions about the data: 1)The minimum height of a constant in a text line is

10 pixels.

2)The average height (AVGHYT) of a text line is between 20 to 40 pixels.

3)The maximum height of a text line (consonant + modifier) is 25 pixels.

4)If a text block is of less than 10 pixels, it would be merged with previous or next text block depending upon condition.

5)If a text block is of more size than AVGHYT+10 pixels, it must be broken into two parts.

Procedure for text line segmentation: We have initialized the following arrays:

 m_MyImage: It contains pre-processed binarized image.

 HProfiles: It is array that stores the total number of black pixels per row.

 START: It is a 2-D array and contains the starting address of each text block in each strip.

 END: It is a 2-D array and contains ending address of each text block in each strip.

Step-1: Read the input image and then binarize it and store it in a 2-D array m_MyImage.

Step-2: Get information about the height (h) and width of the image (w) which is the size of the 2-D array.

(3)

International Journal of Emerging Technology and Advanced Engineering

Step-4: For each strip follow these steps:

i. if HProfiles[y]<=0 or HProfiles[y]==1 or HProfiles[y]==2, assign RED color to that pixel and assign that value of HProfiles[y] to a new variable p.

ii. if HProfiles[y]>p, put the starting pixel address into START array, now a while loop is called that processes all the pixels till the HProfiles[y]==0. Put the last pixel address into END array.

iii. Now we have got the starting and ending addresses of all the text blocks. By using a loop, display the first text block of all the stipes and then after some gap display second text blocks.

Step-5: Repeat the step-4 until full text document image is displayed.

During segmentation with this algorithm, we face certain problems due to some factors like overlapping and touching of characters. Also sometimes the writer left some gap between diacritics and character or between diacritics and header line due to which small text blocks gets created which leads to improper text line segmentation and hence we get faulty results.

Overlapping and Touching of Characters:

[image:3.595.329.537.134.258.2]

Due to overlapping and touching of characters, there remains no significant gap between the text lines and hence two or more text lines comes in a same text block which leads to wrong results. Hence this bigger text block must be divided into 2 parts at a point where the pixel density is minimum. In figure 2, overlapping and touching of characters is shown. Touching is shown in green circles and overlapping in red circles.

Figure 2. Overlapping and touching components in text lines.

Gap between diacritics and header line:

[image:3.595.55.278.515.628.2]

Sometimes due to poor handwriting, the writer left some gap between diacritics and character or between diacritics and header line due to which small text blocks gets created which leads to improper text line segmentation and hence leads to wrong results as shown in figure 3.

Figure 3. Showing ( in red circles) space between diacritics and header line.

To deal with these problems, following steps are added after the 4th step of in above algorithm.

Step-1: Calculate the average height of first strip. Size of text block = END[0][0]-START[0][0].

Total size of all text blocks in a strip, s=s+END[0][j]-START[0][i].

where 0 represnts first strip and i,j= 0,1.. upto 1 less than number of text lines. Initially S=0.

Average height (AVGHYT)= s/n, where n= number of text blocks.

Step-2: Calculate the height(HYT) of each text block.

Step-3: if HYT<10 pixels, it means the height of block is very less means either it is a diacritics or broken character due to poor scanning and hence it must be merged with the previous or next block.

 if it is a lower diacritics then it is merged with pervious block.

 if it is a upper diacritics then it is merged with next block.

 if it is a broken character, then also it is merged with previous block.

Step-4: if HYT>= (AVGHYT+10), it means this text block has a very large height hence there must be a touching or overlapping characters.

Here then minima (MIN) is calculated means a point where pixel density is minimum. A while loop is called which processes all the pixels till HProfile[y]==MIN. At a point of MIN, the text block is divided into 2 parts. Starting address of block is stored in START and ending address is stored in END array.

VI. RESULTS

(4)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 5, May 2014) If the text have suffucient gap between text lines and

the document is properly scanned then the accuracy in line segmentation will be very much. Here by using this algorithm, skewed text lines are segmented correctly however on overlapped and touched text lines, this algorithm have not given excellent results but have given good results. Here we have divided the docuement image into 6,7 and 8 strips and has collected results on differnent text images and have compared their results. We have got maximum accuracy by dividing the text image into 6 strips. We have got main problem that diacritics have not segmented properly here when there is overlapping in text lines. Some results are shown in figure 4a and 4b.

Figure 4a. Result of segmentation

[image:4.595.313.544.150.424.2]

Figure 4b. Result of segmentation

TABLE I.

Accuracy of segmentation by dividing text document image into 6 strips.

TOTAL LINES

LINES CORRECTLY SEGMENTED

PECENTAGE OF

ACCURACY

180 169 93.9

TABLE II.

TOTAL LINES

PECENTAGE OF

ACCURACY

180 164 91.1

TABLE III.

TOTAL LINES

PECENTAGE OF

ACCURACY

180 167 92.8

VII. DISCUSSIONS

From the above results of line segmentation, it is clear that the above algorithm can work well for handwritten text even if the text is skewed but it doesnot work well if there is too much overlapping or touching. The lines with broken diaritics have also segmented very efficiently by this algorithm.

In future the study may be carried out in following directions:

The above algorithm can be applied on other languages like Gurmukhi, Bangla, Sanskrit etc.

Some other technique may be applied for the segmentation of touching and overlapping lines

REFERENCES

[1 ] N. K. Garg, L. Kaur and M. K. Jindal, "A New Method for Line

Segmentation of Handwritten Hindi Text", In: Proceedings of the 7th Internation IEEE Conference on Human Technology: New Generations (ITNG), pp. 392-397, 2010

[2 ] I. B. Yosef, N. Hagbi, K. Kedem, I. Dinstein, “Line Segmentation

for Degraded Handwritten Historical Documents,” Proc. 10th ICDAR, pp. 1161-1165, 2009.

[3 ] Y. Gao, X. Ding, C. Liu, "A Multi-scale Text Line Segmentation

Method in Freestyle Handwritten Documents", International

Conference on Document Analysis and Recognition, pp. 643-647, 2011.

[4 ] Z. Shi, V. Govindaraju, “Line Separation for Complex Document

[image:4.595.57.270.308.733.2]

(5)

International Journal of Emerging Technology and Advanced Engineering

[5 ] N. Modi and K. Jindal, “Text line Detection and Segmentation in

Handwritten Gurumukhi Scripts,” International Journal of Advance Research in Computer Science and Software Engineering, Vol. 3, pp. 1075-1080, 2013.

[6 ] G. Louloudis, B. Gatos, I. Pratikakis and K. Halatsis” A Block Based Hough Transform Mapping for Text Line Detection in Handwritten Documents”, Proceedings of the Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 515-520, 2006.

[7 ] D. Salvi, J. Zhou, J. Waggoner, S. Wang, ”Handwritten Text

Segmentation using Average Longest Path Algorithm,” Applications of Computer Vision(WACV), IEEE Workshop, pp. 505-512, 2013.

[8 ] L. Likforman-Sulem and C. Faure, "Extracting Text lines in Handwritten Documents by Perceptual Grouping", Advances in handwriting and drawing : a multidisciplinary approach, pp. 21-38, 1994.

[9 ] M. K. Jindal, G. S. Lehal and R. K. Sharma, “On Segmentation of

Touching Characters and Overlapping lines in Degraded Printed Gurmukhi Script”, International Journal of Image and Graphics (IJIG), World Scientific Publishing Company, Vol. 9, No. 3, pp. 321-353, 2009.

[10 ]M. K. Jindal, R. K. Sharma and G. S. Lehal, “Segmentation of Horizontally Overlapping Lines in Printed Indian Scripts”, International Journal of Computational Intelligence Research (IJCIR), Research India Publications, Vol. 3, No. 4, pp. 277-286, 2007.

[11 ]N. K. Garg, L. Kaur and M. K. Jindal, "Segmentation of

Handwritten Hindi Text", International Journal of Computer Applications (IJCA), Vol. 1, 2010, pp. 19-23.

[12 ]A. Alaei, P. Nagabhushan, U. Pal, “Piece-wise painting technique