A comparative Study of Classification Algorithm for Printed Telugu Character Recognition

(1)

A comparative Study of Classification Algorithm for

Printed Telugu Character Recognition

Miss M. Sharmila Devi

Assistant Professor RGM College of Engineering

and Technology Kurnool, India

Mr. Y. Gangadhar

Associate Professor, Kuppam Engineering College,

Chittoor, India

Dr. V. S. Giridhar Akula

Professor & Principal, Avanthi’sScientific Technological & Research Academy, Hyderabad, India

Dr. G. V. Prabhakar Rao

Professor & HOD, RGM College of Engineering and

Technology, Kurnool, India

Abstract—Optical character recognition (OCR) refers to a process whereby printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. Optical Character Recognition plays a key role in the field of digital image processing, pattern recognition and

The motivation for development from then on, was the possible applications within the business world.

II. P

ROCESSING

S

TAGES IN

OCR

image restoration techniques. The key motivation for the A typical optical character recognition process works development of OCR Systems is the need to cope with the

enormous flood of paper in the form of documents, bank cheques, commercial forms, government records, credit card imprints and mail sorting, generated by expanding technological society.

The previous work is primarily designed for Telugu and

through following stages: Scanning and Digitization. Preprocessing.

Segmentation. Feature extraction. Classification. used the Uniform Sampling Method as the basis for

extraction of low-level, structural and stroke-type features and also used the nearest neighbor classifier for classification. The accuracy rate was 96%. The Objective of the current project is to improve the accuracy using different

Post processing.

A. Scanning & Digitization

The Document should first using a scanner and be saved

be scanned & Digitized in a standard format so as type of hybrid classifiers. The proposed work uses nearest _{to be used to recognize characters in the later stages.} neighborhood procedure and k means

better performance.

Keywords — Uniform Sampling, Transform, Threshold, Gray Level, Neighbor Classifier.

algorithms to get

Discrete Cosine K-Means, Nearest

I. I

NTRODUCTION TO

OCR

Character recognition is a subset of the pattern recognition area. However, it was character recognition that gave the incentives for making pattern recognition and image analysis matured fields of science. Optical character recognition (OCR) is the process of translating scanned images of typewritten text into machine-editable information. In the early 1950s, David Shepard was issued U.S. Patent Number 2,663,758 for "Gismo," the first machine to convert printed material into machine language.

The origins of character recognition can actually be found back in 1870. This was the year that C.R.Carey of Boston Massachusetts invented the retina scanner which was an image transmission system using a mosaic of photocells. Two decades later the Polish P. Nipkow invented the sequential scanner which was a major breakthrough both for modern television and reading

B. Pre-Processing

The importance of the preprocessing stage of a character recognition system lies in its ability to remedy some of the problems that may occur due to some of the factors presented in above section. Thus, the use of preprocessing techniques may enhance a document image preparing it for the next stage in a character recognition system. In order to achieve higher recognition rates, it is essential to have an effective preprocessing stage, therefore; using effective preprocessing algorithms makes the OCR system more robust mainly through accurate image enhancement,

machines. During the first decades of the 19th century several attempts were made to develop devices to aid the

noise removal, image detection/correction, page

thresholding, skew segmentation, character

blind through experiments with OCR. However, the modern version of OCR did not appear until the middle of the 1940’s with the development of the digital computer.

segmentation, character normalization and morphological techniques.

C. Segmentation

(2)



the isolation of characters or words. The majority of optical character recognition algorithms segment the words into isolated characters, which are recognized individually, means separation of text into lines, words, and basic symbols.

D. Feature-Extraction

The purpose of Feature extraction is the measurement of those attributes of patterns that are most pertinent to a given classification task. The task of the human expert is to select or invent features that allow effective and efficient recognition of patterns.

E. Classification

The classification stage is the main decision making stage of an OCR system and uses the features extracted in the previous stage to identify the text segment according to preset rules.

F. Post-Processing

The Output of the classification stage is converted into ASCII or ISCII or other standard coding schemes so that words, sentences and paragraphs are reconstructed from the outputs of the classification stage. A well-structured dictionary may also be used to resolve ambiguities in recognition.

III. P

ROPOSED WORK

Telugu is one of the oldest and popular languages of India spoken by more than 66 million people especially in South India. Development of Optical Character Recognition systems for Telugu text is an area of current research for improving Accuracy of OCR system.

Commercial OCR packages are already available for languages like English. Considerable work has also been done for languages like Japanese and Chinese [1]. Recently, work has been done for development of OCR systems for Indian languages. This includes work on recognition of Devanagari characters [4], Bengali characters [5], Kannada characters [6] and Tamil characters [7]. Some more recent work on Indian languages is also reported [8, 9, 10, 11, 12]. Telugu is one of the popular languages of India that is spoken by more than 66 million people especially in South India. Work on Telugu character recognition is not substantial [13, 14]. In this Project, recognition of characters is done using K-means Algorithm. Analysis of recognizing between nearest neighbor and K-means is also done.

A. Thresholding & Binarization

Through the scanning process a digital image of the original document is captured. In OCR optical scanners are used, which generally consist of a transport mechanism plus a sensing device that converts light intensity into gray-levels. Printed documents usually consist of black print on a white background. Hence, when performing OCR, it is common practice to convert the multilevel image into a bi-level image of black and white. Often this process known as Thresholding.

Thresholding creates binary images from grey-level ones by turning all pixels below some threshold to zero and all pixels about that threshold to one. (What you want

to do with pixels at the threshold doesn’t matter, as long as you’re consistent.)

If g(x, y) is a thresholded version of f(x, y) at some global threshold T,



1 if f(x, y)



T

g(x, y)

 





0 if f(x, y)



T

In our present work, we have implemented a simplified form of thresholding, where, we assume a threshold value of 128 for all documents. The reason being, in any gray scale image the pixel value varies from 0 to 255 and 128 is a mid-value between these two numbers. We now present our Binarization algorithm.

Binarization Algorithm

Algorithm Binarize (Image)

//Input: Image refers to digitized image matrix Begin //get the size of the image-matrix

[row, col] = size(Image);

//Traverse through all the elements of the image-matrix and compare them //with the threshold value

for i = 1 to row do for j = 1 to col do

if ( Image(i, j) < 128)

Image (i, j) = 0;

else

Image (i, j) = 1;

end end

end End.

The above algorithm was applied to variety of document images and was found working satisfactorily. A sample test case is provided in the figure below.

a) Original Image:

b) Binarized Image:

B. Noise Filtering

(3)

Noise removal can be done using either spatial filters or frequency domain filters. The commonly used spatial filters are Mean filters and adaptive filters. Wiener filters and geometric filters are the commonly used frequency filters.

The present work assumes a clean document as input for the OCR system. A suitable noise filtering technique can be incorporated into the pre-processing stage in order to account for noisy documents.

Deskewing

Scanned documents often become skewed (slanted) during scanning because of misfeeds or other alignment errors. Skew is the amount of rotation necessary to return an image to horizontal and vertical alignment. Skew is measured in degrees. Deskewing is a process whereby skew is removed by rotating an image by the same amount as its skew but in the opposite direction. This results in a horizontally and vertically aligned image where the text runs across the page rather than at an angle. This effect visually appears as a slope of the text lines with respect to x-axis, and it is mainly concerns the orientation of the text lines.

Without Skew, lines of the document are horizontal or vertical, depending on the language. Skew can even be intentionally designed to emphasize important details in a document. However, this effect is unintentional in many real cases, and it should be eliminated because it dramatically reduces the accuracy of the subsequent processes. Such as page Segmentation and optical character recognition (OCR). Therefore, Skew Detection is one of the primary tasks to be solved in document image analysis systems, and it is necessary for aligning a document image before further processing.

In the present work we have used the deskewing method due to [Gatos et al. 1997]. Due to text skew, each horizontal text line intersects a predefined set of vertical lines at non-horizontal positions. Using only the pixels on these vertical lines we construct a correlation matrix and evaluate the skew angle of the document with high accuracy. This method has the following advantages.



Efficiency

Instead of using all the image pixels, this method uses only those pixels lying on certain vertical lines defined in the image. This improves the calculation time for skew detection when compared with the Hough transform and other cross-correlation based methods.



Accuracy

This method extracts the document skew with high accuracy. This can be further improved by using more than two vertical lines.



Robustness

The results of the proposed method are robust to the presence of graphics in the document.

When an image is not aligned correctly, optical character recognition (OCR) is more difficult and becomes slower and less accurate. Deskewing the documents beforehand can make the OCR process faster and more accurate.

C. Segmentation

Segmentation is the isolation of characters or words. The majority of optical character recognition algorithms segment the words into isolated characters, which are recognized individually. Usually isolating each connected component, that is each connected black area, performs this segmentation.

The ultiate goal of the segmentation phase is extraction of the basic symbols. Prior to this, the binary preprocessed image of the text is initially used for segregation of lines. Once segregation of lines is done, in each line, words are segregated and within each word, segments (basic symbols) are identified. Thereafter, region-growing algorithm is used to extract the primitives.

Segregation of lines:

The First step in the Segmentation phase is segregation of lines. Since in a binarized image, 1 represents text pixel, the presence of consecutive zero rows in the image matrix identifies spacing between lines. The number of consecutive rows, separating two lines, must be greater than or equal to a pre-specified number, which we call line-threshold (row-threshold).

If the number of consecutive zero rows, is greater than equal to line-threshold, then the next non-zero row indicates beginning of a new line, whereas the previous non zero row indicates the end of previous line. This is depicted in the figure below.

The following algorithm demarcates the first line in the document image. Prior to the application of the algorithm we clip all the margins from the document image. This is achieved by deleting all extreme zero-rows and columns from the image matrix.

Algorithm for Clip Margins from Bottom, Top, Left, Right:

Clipping Bottom Margin:

Step 1: Input: Give any Document as input for clipping bottom margin.

Step 2: [m,n]=size(image); //Get the size of the Image. Step 3: Intializations

xend=m;

row_index=xend; clipmargin-0; zero=zeros(1,n);

Step 4: Loop through all rows until a non-empty row is found

(4)

row_index=row_index-1 else

xstart=row_index+1 clipmargin=1 end //if end //while

Step 5: Delete Empty rows from Bottom

image(xstart:xend,:)=[ ] // delete the rows corresponding to the xstart,xend values

final_image=image// assign the image into a new variable final_image

imshow(final_image) // displays the image after clipping margin from bottom

Example:

a) Original Image

b) After Clipping from bottom

c) After all clippings the image is

Segregation of basic symbols:

The segments are identified within each word in the similar manner by scanning each word column wise. In our implementation, a zero column identifies the end of a segment and thereby the segment is extracted from the given word or line.

IV. A

LGORITHMS USED FOR COMPARATIVE

A. Nearest-Neighbor Classifier

Nearest Neighbor Classifier in the classification phase is most traditionally used classifier and simple to implement. Two forms of this classifier are:

Edited NN

If we want to modify the NN classifier, the first goal must be to reduce the error. This can be achieved by editing the training samples as follows: for a given data set, perform the NN classification among them and store only correctly classified samples. Then, an unknown sample is classified according to the class of the nearest stored sample. The editing operation could be repeated to obtain a better set of stored samples.

Condensed NN

By storing only samples around the boundary, we can reduce the number of stored samples. Again, an unknown sample is classified by the class of the nearest stored sample. The performance of this algorithm is close to the one of the NN classifier, and yet we can save the storage space and the computation time. We have implemented the edited form of NN where Euclidean Distance is used as the distance measure.

That is,

1

 n 2

D (x , x ) 





x x



2

Segregation of Words:

After the Line was segmented from a group of lines,

e i j i j

j1 

than each word from that line is segmented by the consecutive zero columns, means the number of zero columns separating two words is again checked against a pre-specified value called threshold.

If the number of consecutive zero columns, is greater than or equal to word-threshold, then the next non-zero column indicates beginning of a new word, whereas the previous non-zero column indicates the end of previous word. This process shown in below figure.

Therefore, with the proper choice of word-threshold the demarcation of words is done in a similar manner except for the fact that columns are scanned instead of rows, as was the case in demarcation of lines.

wherexiis theunknownsampleandxjis thestoredsample

Each stored sample represents a class. The unknown sample belongs to the class of the stored sample that is closest to it.

Classification by using K- means Algorithm

The classification stage uses the features extracted in the feature extraction stage to identify the text segment according to preset rules. Classification is usually accomplished by comparing the feature vectors corresponding to the input character with the representative(s) of each character class, using a distance metric.

(5)

B. K-means Algorithm

The basic step of k-means clustering is simple. In the

1. Take the first k training sample as single-element clusters

beginning we determine number of cluster K and we 2. Assign each of the remaining (N-k) training assume the centroid or center of these clusters. We can

take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial

sample to the cluster with the nearest centroid. After each assignment, recomputed the centroid of the gaining cluster.

centroids. Step3. Take each sample in sequence and compute its

Algorithm

The K means algorithm will do the three steps as mentioned below till attaining convergence

Iterate until stable (= no object move group): 1. Determine the centroid coordinate

2. Determine the distance of each object to the centroids

3. Group the object based on minimum distance

The below figure describes the algorithm of k-mean

distance from the centroid of each of the clusters. If a sample is not currently in the cluster with the closest centroid, switch this sample to that cluster and update the centroid of the cluster gaining the new sample and the cluster losing the sample.

Step4. Repeat step 3 until convergence is achieved, that is until a pass through the training sample causes no new assignments.

If the number of data is less than the number of cluster then we assign each data as the centroid of the cluster. Each centroid will have a cluster number. If the number of data is bigger than the number of cluster, for each data, we calculate the distance to all centroid and get the minimum distance. This data is said belong to the cluster that has minimum distance from this data.

V. R

ESULTS

A

NALYSIS

A. Analysis

We describe a brief analysis about the entire work represented in a document, with the feature vectors of the corresponding characters with its classifier codes, ISCII codes, and this document also represents the characters which are related to its codes.

Document Generation

clustering. It is step by step k means clustering algorithm: _{The characters which are recognizing correctly and} which are not recognizing by Nearest neighbor classifier and by k means is also clearly presented in the dictionary.

Difference of NN & K-mean

Nearest-Neighbor classifier is one of the most traditionally used classifier and simple to implement, although it has some drawbacks.

Demerits:

1. The classifiers are not able to exploit the full potential of the feature extraction technique. 2. Time complexity is more.

In this present work it is observed that Nearest-Neighbor classifier is not recognizing some characters properly. for example...

Symbol 'aye'

The above Symbol ‘aye’ not able to recognize by

Step1. Begin with a decision on the value of k = number of clusters

Step2. Put any initial partition that classifies the data into k clusters. You may assign the training samples randomly, or systematically as the following:

Nearest-Neighbor classifier, and also the characters which contains‘aye’as a part of those characters, as like

(6)

B. Accuracy

In the [1-4] the recognition of characters by using Nearest Neighbor classifier, the accuracy they got up to 96% only, here by using K-means in the classification stage the accuracy increased upto 98%. So by using K-means algorithm in the classification we can recognize the characters effectively and efficiently.

Time Analysis

Timing Functions:

 TIC & TOC

Time taken to recognize the characters are computed by using 2 timing functions. Those are:

 Tic & Toc It is a Stopwatch timer. Syntax: tic

// any statements toc t = toc

Description: tic starts a stopwatch timer. Toc prints the elapsed time since tic was used. t = toc returns the elapsed time in t.

 CPU TIME

Elapsed CPU time Syntax: cputime

Description: cputime returns the total CPU time (in seconds) used by MATLAB from the time it was started. This number can overflow the internal representation and wrap around.

For the above text time taken by NN to recognize all the characters: 27.64 sec

Time taken by K-means to recognize all the characters: 15.02 sec

Checking Output against Input:

After the recognition of character we have to check whether the character recognized correctly or not. In the existing system the resultant image means output image is checked manually to know whether the character was recognized correctly or not. In this project an algorithm is developed to view the output image & input image to visually check the accuracy of the Recognizer.

For the below sample text the output is displayed, to visually check the characters.

Sample text:

t= cputime; \\statements; e = cputime-t;

It displays the time taken by the statements to execute. The main drawback of NN classifier is its time complexity. It takes more time to recognize as compare with the K-means classifier.

(7)

Sample Test Document (scanned image) Results

While recognizing characters by Nearest neighbor classifier and by K-means, it is observed that K-means algorithm taking less amount time as compared with Nearest neighbor classifier. For a number of documents computing time is measured to recognize characters, in

each case k means recognizing characters in a short time. We now produce a test document below in Telugu language.

Test Document: The following is a test document taken after scanning from“Sanathana Sarathi”.

Case i: In this case we used Uniform sampling as Feature Extraction Method and for the classification Nearest Neighbor and K-means algorithms are used. These are the examples of the two different cases.

Example 1:

These are the screenshots represents the time taken by Nearest-neighbor classifier to recognize the above given text.

Time taken by Nearest Neighbor to recognize: 9.625 seconds

In the above case we used Uniform sampling as feature extractor method and Nearest Neighbor as a Classifier.

Time taken by K-means to recognize: 5.390 seconds

In this case we used Uniform sampling as feature extractor method and for the classification K-means algorithm is used.

(8)

stage to recognize characters. But when DCT extraction method with Nearest neighbor is used to recognize characters it takes time greater than by DCT with k-means. Comparing with uniform sampling and DCT, with classification Algorithms, among the two Uniform sampling is better.

The following examples have shown the time taken by DCT with Nearest neighbor and K-means to recognize characters.

Even for the 4 characters the DCT takes larger amount of time.

Example: 1

Case i: DCT with Nearest Neighbor

It takes time to recognise: 8.172 seconds

Case ii : DCT with k-means

Case iii : Uniform sampling with Nearest neighbor

Case iv: Uniform sampling with K-means

VI. C

ONCLUSION

The existing system is primarily designed for Telugu and the accuracy of the recognizer was low. Every technique that was discussed so far has performed better than the existing system. Implemented the classification of input feature vectors into predetermined character classes by using k means Algorithm. Compared and analyzed the time taken to recognize by Uniform sampling with Nearest-neighbor classifier and Uniform sampling with k-means algorithm. Bugs are fixed which are occurring at the existing system [3], in Segmentation part. Developed an algorithm to view the output/resultant image & input image to visually check the accuracy of the Recognizer and the Accuracy rate is increased compared with existing system. By using Nearest Neighbor classifier, the accuracy has reached up to 96% only (in the previous works) but in the proposed work by using K-means in the classification stage the accuracy increased up to 98%. So by using K-means algorithm in the classification we can recognize the characters effectively and efficiently.

R

EFERENCES

(9)

(Hindi)’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, (1997 b).

[2] Chaudhuri B. B., and Pal U., ‘ A Complete Printed BANGLA OCR System’, Pattern Recognition, Vol. 31, No. 5, (1998), pp. 521-549.

[3] Sachwani, Praveen, ‘An OCR System for printed Dravidian Scripts Using Uniformly Sampled Feature Extraction Method’, M.tech Project, Sri Sathya Sai Institute of Higher Learning, 2001.

[4] ‘OCR of Printed Telugu Text with High Recognition Accuracies’ C. Vasantha Lakshmi, Ritu Jain, and C. Patvardhan, Dayalbagh Educational Institute.

[5] ‘OCR of Printed Telugu Text with High Recognition Accuracies’

[6] Chaudhuri B.B, Kumar O. A., and Ramana K.V.,

‘Automatic Generation and Recognition of Telugu Script Characters’,Jour. Instn. Electronics and Telecom. Engrs., Vol. 37, No. 5&6, (1991).

[7] Cho-Hauk Teh, Roland T. Chin,‘OnImage Analysis by the Methods of Moments’,IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 4, (July 1988).

[8] Handwritten Character Recognition of popular south Indian scripts- Umapada Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India.

[9] Daniel S. Le, George R. Thoma and Harry Wechsler,

‘Automated Page Orientation and Skew Angle Detection For Binary Document Images’,Pattern Recognition, Vol. 27, No. 10, (1994), pp. 1325-1344.

[10] Gatos B., Papramarkos N., Chamzas C.,‘Skew Detection and Text line Position Determination in Digitized Documents’,Pattern Recognition, Vol. 30, No. 9, (1997), pp. 1505-1519.

[11] Glauberman M. H.,‘Character Recognition for business machines’,Electronics, (February 1956), pp. 132-136. [12] Gonzalez R.C. and Woods E.R., ‘Digital Image

Processing’, Addision-Wesley Publishing Company, (1988).

[13] “Character recognition systems: a guide for students and practioners” --Mohamed Cheriet, Nawwaf Kharma, Cheng Lin Liu- 2007.

AUTHOR’S

PROFILE

A. Miss M Sharmila Devi

is presently working as Asst. Professor in RGM College of Engineering and Technology, Kurnool-518501, A.P, India. She received B.Tech and M.Tech degrees in computer Science and Engineering from JNTUA. She Published 1paper in international journal. She Published 3 papers in national and international Conferences. Her areas of interest include Digital Image Processing, Data mining, Soft Computing and Computer Networks.

Mr. Y. Gangadhar

is presently working as Assoc. Professor in Kuppam Engineering College, Chittoor-517425, A.P, India. He received B.Tech and M.Tech degrees in computer Science and Engineering from JNTUA. He Published 6 papers in national and international journals. He Published 6 papers in national and international Conferences. His areas of interest include Digital Image Processing, Data mining, Soft Computing and Computer Networks.

Dr. V. S. Giridhar Akula

is presently working as Professor and Principal in Avanthi’s Scientific Technological & Research Academy, Hyderabad. He received B.E., M.Tech. and Ph.D. degrees in Computer Science and Engineering from JNTUA. Dr. Giridhar wrote 06 text books and published 25 papers in many national and international journals. He is acting as an editor and reviewer for many national and international journals. His areas of interest include Digital Image Processing, Computer Networks, Computer Graphics and Artificial Intelligence.