International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012)590
Word Level Handwritten and Printed Text Separation
Based on Shape Features
Upasana Patil
1, Masarath Begum
21,2Department of Computer Science, GND Engineering College, Bidar, India 1[email protected]
Abstract— In this paper, we present a method for discriminating handwritten and printed text from document images based on shape features. The separation of handwritten and printed text from document image is essential to optimize the OCR accuracy and to activate an appropriate OCR engine. It leads to reduce the search space of the OCR and it also facilitates the retrieval of Handwritten and Printed text from document images. We have used IAM
dataset 3.0 and with morphological transformations
segmented 74 pages and obtained 10768 words and 2000 were used for experimentation and achieved average accuracy of 98.57% with only seven features. The proposed method is simple, have promising discrimination accuracy and less time complexity as compared to [10].
Keywords—Document Image Analysis, OCR, Shape Features Handwritten and Printed Text.
I. INTRODUCTION
Integration of new technologies and inventions leads us towards the achievement of paperless office and paperless society. Document image analysis is one of the import steps in automating the offices. Every activity of the office involves papers, which are in the form of petitions, application forms, reports, letters and accounts. In most of the situations we come across with numerous documents presenting a mixture of handwritten and printed text. For example, railway reservation forms, bank cheques, memorandums etc. Often we notice that interlacing of handwritten and printed text is at word level, line level and paragraph level. The recognition of such documents is a challenging task for OCR designers. To optimize the OCR accuracy, separation of handwritten and printed text from such documents is very essential prior to activation of the OCR engine. Handwritten and printed text separation leads to reduce the search space of the OCR and it also facilitates the retrieval of Handwritten and Printed text documents. Thus, the problem of Automatic Discrimination of between Handwritten and Printed text from Document Images may be addresses in three different cases as they classified as
Paragraph Level Separation, Line Level Separation and Word Level Separation. In this paper, word level handwritten and printed text separation is carried out.
In Section 2, we present survey of literature. The details of dataset used for experimentation is presented in Section 3. In Section 4, feature extraction methodology is given and Section 5 and 6 describes the classification of text words and results respectively. Conclusion is dawn in Section 7.
II. LITERATURE SURVEY
There exists some research publications on
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012)591
J Koyama et al.[9] introduced a scheme for distinction between handwritten and machine printed characters named as (SDLFD) spectrum domain local fluctuation detection method. They transform local region of document into frequency domain and feed feature values to optimize multilayer perceptron to get likelihood of handwriting. Lincolon et al. [10] developed a scheme to discrimination between printed and handwritten text into documents. They extracted features deviation of width, height and areas, density, vertical projection variance, major horizontal projection difference, pixels distribution, bottom line, black pixels of each line, vertical edges, and Major vertical edge, mining rules used for classification.
III. DATASET
[image:2.612.337.551.132.284.2]IAM dataset 3.0 [32] is considered for experimentation. Word level dataset is created using 74 pages containing handwritten and printed text. These pages are segmented using simple morphological operations and obtained 10768 word images. Each 1000 text words of printed and handwritten text are used for training and testing. The sample images used for experimentation are shown in Figure. 1, 2, 3, 4 and 5
Figure 1 Input image of IAM database
Fig. 2 Binarized image of IAM database Fig. 3 Image after line
removal
Figure 4 Sample segmented handwritten text words
Figure 5 Sample segmented printed text words
IV. FEATURE EXTRACTION
Documents are binarized using N. Otsu’s A Threshold Selection Method from Gray-Level Histograms [22].Noise is removed using basic morphological operations. Based on connected component analysis long lines are removed from the document images. Before designing a feature extractor, shape of the connected components of the handwritten and printed text words are keenly observed and noticed some differences in roundness, stroke orientation, compactness and aspect ratios of each component and extracted these visual discriminating shape features of the connected components of the word by using simple shape features like circularity, compactness, area, major axis, minor axis and aspect ratio. The following features are calculated on every connected component of the word as discussed below: 1. Area: It is defined as the actual number of pixels in the
region
[image:2.612.85.266.413.675.2]International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012)592
3. Form factor=
2 4 Perimeter
xArea
4. Major Axis: It is defined as the length (in pixels) of the major axis of the ellipse that has the same normalized second central moments as the region
5. Minor axis: It is defined as the length (in pixels) of the minor axis of the ellipse that has the same normalized second central moments as the region.
6. Roundness=
2 4
XMajorAxis xArea
7. Compactness=
MajorAxis xArea)/ 4
(
Derived a feature vector of seven features by computing the mean of the above-mentioned features as defined below:
1. F1=
N Area
N
i i
1
2. F2=
N Perimeter
N
i
i
1
3. F3=
N factor Form
N
i
i
1
4. F4=
N MajorAxis
N
i
i
1
5. F5=
N MinorAxis
N
i
i
1
6. F6=
N Roundness
N
i
i
1
7. F7=
N s Compactnes
N
i
i
1 , where N is the number of
connected components of a word
V. CLASSIFICATION OF TEXT WORDS
K-nearest neighbour is a supervised learning algorithm. It is based on minimum distance (Euclidian distance metric is used) from the query instance to the training samples to determine the k- nearest neighbours. After determining the k nearest neighbours, we take simple majority of these k-nearest neighbours to be the prediction of the query instance. The experiment is carried out by varying the number of neighbours (K= 3, 5, 7) and the performance of the algorithm is optimal when K = 1.
Proposed algorithm
Input: Input gray image of IAM database
Output: Classified handwritten and printed text words
1. Input gray scale image of IAM database
2. Pre-process the input document image i.e. binarization
using Otsu’s method, and remove speckles using morphological opening.
3. Apply connected component’s processing for line
removal
4. Apply horizontal dilation with line structuring element to join the connected components of a word
5. Crop and fix the bounding box on each segmented word and save into a folder
6. Extract the shape features of connected components of
a word as defined above
7. Compute the average of the shape features as discussed
above and form a seven feature vector
8. Classify the handwritten and printed text words using KNN classifier with K=1 and two fold cross validation.
9. Stop
VI. RESULTS AND DISCUSSION
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012)593
TABLE1
CLASSIFICATION AVERAGE ACCURACY IN PERCENTAGE WITH KNN AND
K=1
Text Average
Handwritten 97.14%
Printed 100.00%
Average 98.57%
TABLE 2
CONFUSION MATRIX
Text Handwritten Printed
Handwritten 971 29
Printed 0 1000
TABLE 3
COMPARATIVE ANALYSIS WITHOUT CROSS VALIDATION
Lincoln Feria et
al.[10]
Proposed
Method
Dataset Used IAM DB3.0 IAM DB3.0
Classification Word Level Word Level
No.of Features
Used
09 07
Average accuracy 97.82% 98.57%
TABLE 4
COMPARATIVE ANALYSIS WITH CROSS VALIDATION
Lincoln Feria et al.[10] Proposed
Method
Dataset Used IAM DB3.0 IAM DB3.0
Classification Word Level Word Level
No.of Features Used 09 07
Cross Validation K-fold Cross V with k=10 K-fold Cross V
with k=10
Average accuracy 100% 100%
From table 1 and 2 it is clear that the miss classification of handwritten as printed is 2.9% whereas in printed text there is no misclassification. Table 3 and 4 reveals that the proposed method is simple, efficient and having less time complexity over [10]. The method presented by [10] employed projection profile features which have high time complexity.
VII. ACKNOWLEDGMENT
We are very grateful to the reviewers for their valuable comments. We also extend our gratitude to Dr. Mallikarjun Hangarge, Srikanth Doddamani and Rajmohan Pardeshi of Karnatak College, Bidar for giving their valuable suggestions during this work.s
VIII. CONCLUSION
In this work, we present an algorithm for automatic discrimination between handwritten and printed text using simple shape features. This algorithm is robust to the noise and has less time complexity as compared to [10]. Lincoln Feria et al. has used 9 features and achieved 97.82% accuracy of classification without cross validation; however, proposed method achieved 98.57% with only seven features. In future, we will extend this method by reducing the feature dimension by feature selection process to obtain 100% classification accuracy at character level. The classification of Handwritten and Printed characters is essential particularly with the documents available in India. As India is a multilingual country, and hence we come across with various documents which contain the combination of different scripts at character level. Hence, the extension of the proposed method to character level separation is justifiable.
REFERENCES
[1] S. Imade, S. Tatsuta, and T. Wada, Segmentation and Classification for Mixed Text/Image Documents Using Neural Network, Proceedings of the Second International Conference on Document Analysis and Recognition, 20-22 Oct., pp. 930 - 934, 1993.
[2] K. Kuhnke, L. Simoncini, and Z. M. Kovacs-V, A System for Machine-Written and Hand-Written Character Distinction, Proceedings of the Third International Conference on Document Analysis and Recognition. 2,14 - 16 Aug.,pp 811 - 814, 1995.
[3] S. Violante, R. Smith, and M. Reiss, A Computationally Efficient Technique for Discriminating Between Hand-Written and Printed Text, IEEE Colloquium on Document Image Processing and Multimedia Environments, 2 Nov.,pp. 17/1 - 17/7, 1995.
[4] U. Pal, and B. B. Chaudhuri, Automatic separation of machine-printed and hand-written text lines, ICDAR ’99. Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 645-648, 1999.
[5] U. Pal, and B. B. Chaudhuri, Machine-printed and Hand-written Text Line Identification, Pattern Recognition Letters, v. 22, n 3 - 4, pp. 431- 441, 2001.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 4, April 2012)594
[7] Y. Zheng, H. Li, and D. Doermann, Machine Printed Text and Handwriting Identification in Noisy Document Images, , IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 26, n 3, pp. 337 - 353, 2004.
[8] E. Kavallieratou, S. Stamatatos, and H. Antonopoulou, Machine-Printed from Handwritten Text Discrimination, IWFHR-9 2004, 9th Intern. Workshop on Frontiers in Handwriting Recognition, 26-29 Oct., pp. 312 - 316, 2004.
[9] J. Koyama, M. Kato, and A. Hirose, Local-spectrum-based distinction between handwritten and machine-printed characters, 15th IEEE International Conference on Image Processing, 12-15 Oct., pp. 1021 -1024, 2008.
[10] Lincoln Feria and Angel Sanchez, Automatic discrimination between printed and handwritten text in documents, XXII Brazilian Symposium on Computer graphics and image processing, Digital object identifier; 10.1109/SIBGRAPI.2009.40,pp. 261-267 2009
[11] J. E. B. Santos, B. Dubuisson, and F. Bortolozzi, A Non Contextual Approach for Textual Element Identification on Bank Cheque Images, IEEE International Conference on Systems, Man and Cybernetics, v. 4, pp 6 - 9, 2002.
[12] J. E. B. Santos, B. Dubuisson, and F. Bortolozzi, Characterizing and Distinguishing Text in Bank Cheque Images, Proceedings XV SIBGRAPI,pp. 203 - 209, 2002.
[13] F. Farooq, K. Sridharan, and V. Govindaraju, Identifying Handwritten Text in Mixed Documents, ICPR 2006, 18th International Conference on Pattern Recognition, v. 2, pp. 1142 - 1145, 2006.
[14] J. Franke, and M. Oberlander, Writing Style Detection by Statistical Combination of Classifiers in Form Reader Applications, Proceedings of the 2nd Intern. Conference on Document Analysis and Recognition, pp. 581 - 584, 1993.
[15] S. N. Srihari, Y. C. Shin, V. Ramanaprasad, and D. S. Lee, A System to Read Names and Addresses on Tax Forms, Proceedings of the IEEE, v. 84, n 7, pp. 1038 - 1049. DOI: 10.1109/5.503302, 1996.
[16] Y. Zheng, H. Li, and D. Doermann, The Segmentation and Identification of Handwriting in Noisy Document Images, , Document Analysis Systems V, Lecture Notes in Computer Science, v. 2423, pp. 95-105, 2002.
[17] Y. Zheng, H. Li, and D. Doermann, Text Identification in Noisy Document Images Using Markov Random Field, Proceedings of the Seventh International Conference on Document Analysis and Recognition, v. 1, pp. 599 - 603, 2003.
[18] E. Kavallieratou, and S. Stamatatos, Discrimination of Machine-Printed from Handwritten Text Using Simple Structural Characteristics, Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, v. 1, 23 - 26 Aug., pp.437 - 440, 2004.
[19] U. Marti, and H. Bunke, The IAM-database: an English Sentence Database for Off-line Handwriting Recognition, Int. Journal on Document Analysis and Recognition, v. 5, n1, pp. 39-46, 2002.
[20] U. Marti, and H. Bunke, A full English sentence database for offline handwriting recognition, ICDAR ’99, Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 705-708, 1999.
[21] U. Marti, and H. Bunke, Handwritten Sentence Recognition, Proceedings. 15th International Conference on Pattern Recognition, v. 3, pp. 463-466, 2000
[22] M. Zimmermann, and H. Bunke, Automatic Segmentation of the IAM Off-line Database for Handwritten English Text, Proceedings of the 16th International Conference on Pattern Recognition, v. 4, pp. 35 - 39, 2002.
[23] N. Otsu, A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man and Cybernetics, v. 9, n 1, pp. 62 - 66, 1979.
[24] Gonzalez and woods , “ Digital image processing using matlab” second edition
[25] Mohammad cheriet, Nawwaf Kharma,cheng-Lin Liu, Ching Y. Suen “CHARACTER RECOGNITION SYSTEMS A Guide for Students and Practitioners” 2007 by John Wiley & Sons,
[26] Frank Y. Shih ,"IMAGE PROCESSING AND PATTERN RECOGNITION Fundamentals and Techniques" , 2010 by IEEE and John Wiley and sons.
[27] Frank Y shih, "Image Processing Mathecessing and Mathematical Morphology fundamentals and applications", 2010 CRC press.
[28] Lawrence O’Gorman , Rangachar Kasturi IEEE Computer Society Executive Briefings "Document Image Analysis" , ISBN 0-8186-7802-X ,Library of Congress Number 97-17283.
[29] Stefano Ferilli,"Automatic Digital Document Processing and Management Problems, Algorithms and Techniques" published by Springer-Verlag London Limited 2011.
[30] Chris Solomon, Toby Breckon,"Fundamentals of Digital Image Processing A Practical Approach with Examples in Matlab" 2011 by John Wiley & Sons, Ltd. First edition.
[31] Qidwai and Uvais , “Digital image processing an algorithmic approach with Matlab” First edition 2009 by CRC Press.