Performance Evaluation of Smoothing Techniques for Arabic Character Recognition

(1)

Mr. Ahmed F. Raaid

, IJRIT-22 International Journal of Research in Information

Technology (IJRIT)

www.ijrit.com ISSN 2001-5569

Performance Evaluation of Smoothing Techniques for Arabic Character Recognition

Mr. Ahmed F. Raaid¹, Dr. Hisham S. Rafid²

1PhD candidate, Department of Computer Science, University of Baghdad Baghdad, Iraq

[email protected]

2Professor, Department of Computer Science, University of Baghdad Baghdad, Iraq

[email protected]

Abstract

Process of optical character recognition (OCR) can benefit from N-gram language model in detecting and correcting its output errors. N-gram language modeling can provide a probability to any N-gram exists in training corpus. However, it cannot measure a probability for missing words in training corpus. Smoothing techniques are used to avoid assigning probability of zero for any unseen word in training corpus. All smoothing techniques have strengths and weakness.

However, a suitable smoothing technique will be chosen by researchers based on topic used. Hence, the goal of this research is to evaluate the performance of these techniques on Arabic dataset in order to select the most excellent for this language. The testing results of experiments show that Katz Backoff is the best among smoothing techniques.

Keywords: Smoothing Techniques Evaluation, Character Recognition, Error Rate

1. Introduction

Optical character recognition, also called OCR, is a process of extracting information such as text, word, car plate number, from images [1]. Errors of OCR output text can be non-word error (NWE) or real word error (RWE) [2]. NWD refers to any word created from OCR process that does not be in a vocabulary, while RWD refers to any word created from OCR process that exists in a vocabulary, but it incompatible with a phrase [3, 4]. Arabic language has usually high error rate of OCR compared to English language.

OCR error rate is high because the unique properties of Arabic language. For instances, Arabic letters are joined, and they has a vertical overlapping in writing between neighboring letters [5-7]. These characteristics cause large error rate during the process of OCR. Furthermore, the OCR error rate will raise when the images are old or their scanning resolution is less than 100 dpi [8, 9].

N-gram language model has a significant reduction in OCR error rate compared to the results of other models and techniques [10-12]. This is because N-gram language model has the ability to detect and correct NWE and RWE, while others can only detect and correct NWE. N-gram language model is a mathematical model that depends on frequency of sentences in large corpus. It can be used to measure the probability of a single word or a single phrase [13]. N-gram language model is named unigram, bigram, and trigram when the value of N is equaled to one, two, and three respectively. N-gram language model can measure the probability of any word using Eq. (1) below [13]:

(2)

, IJRIT-23

In Eq. (1), the probability of a word Wk in position “k” is calculated based on previous part of sentence.

The value of “N” refers to the degree of language model, while the value of “C” refers to the frequency of a phrase. Although of N-gram language model has a good results when used in OCR post-processing, but it has major weakness. The weakness is that, it cannot give a probability to some words because these words are missed from a training corpus [13]. This weakness is overcome by a task called smoothing process.

This process will estimate a value to the unknown words even they are unseeing in corpus [14]. The main smoothing techniques are Laplace smoothing, Linear Interpolation, Backoff, and Kneyser-Ney smoothing.

Each one from smoothing techniques has advantages and disadvantages. The appropriate technique that is chosen by any researcher is based on where will be used [13]. Therefore, the reason of this paper is to compare performance of smoothing techniques, and identify the best technique that owns a high ability to detect and correct OCR errors for Arabic language.

Lastly, this study consists of five main sections. First section presents background of this research. Second section explains in details working, advantages, and disadvantages of each smoothing technique. Third section will presents how experiments will be design, metrics, and testing database. Fourth section describes process of developing a prototype to test comparative techniques, and results of experiments.

Finally, the fifth section contains the summary of research and future direction for this work.

2. Main Comparative Techniques

Four main smoothing techniques will be discussed in this section. The goal is to identify how the probability of smoothing techniques is produced.

2.1 Laplace Smoothing (LS)

Discounting category consists of four main techniques: Laplace smoothing, Witten-Bell, absolute discounting, and Good-Turing [13]. The idea of this category is to estimate the probability of missing N- gram by taking small value from probability of existing N-gram. This process will decrease accuracy of language model, and it is not considering a real value for probability. However, this process will avoid giving probability of zero to any unseeing N-gram [13, 15]. The work principle of discounting category techniques is simple. However, these techniques are not recently used by most researchers [11, 13]. Since all discounting category techniques have similar work principle, then this paper will choose Laplace smoothing (LS) to become one of the comparative techniques. The impact of Laplace smoothing technique on Eq. (1) of language model is shown in Eq. (2). The change in Eq. (1) is important to avoid giving probabilities of zero to any missing N-gram [13].

Note the term P^ refers to the estimating probability calculated by this technique. The value of “D”

represents all N1 grams in a specific corpus. The value of “β” should be greater than zero and less than one.

2.2

Linear Interpolation (LI)

This technique estimates probability of missing N-gram by depending on hierarchy of previous N-grams [11, 13]. In other words, linear interpolation uses several N-gram orders to estimates probability value of missing N-gram. Therefore, if any N-gram is missed, then a language model can depend on other order of (1)

(2)

(3)

, IJRIT-24

N-grams to estimate the probability. For illustration, Eq. (3) represents the statistical expression of trigram language model using this technique. Eq. (3) shows that probability of missing N-gram is always measured by summing unigram, bigram, and trigram as shown below [13]:

Note, the individual values of ∝1, ∝2, and ∝3 should be greater than zero and less than one, and the sum of these values must be one. This paper will give values of 0.2, 0.3, and 0.5 to the unigram, bigram, and trigram respectively. This is because accuracy of large N-grams is greater than small N-grams. The weakness of this technique is that, it can give wrong results in some situation. For instance, by assuming the words “X” and “Y” cannot come together in same sentence, and the individual probability of “X” is large, then a bigram language model will assign high probability of “X” to the phrase “X Y” even the words "X” and “Y" cannot come together [16].

2.3

Katz Backoff (KB)

This technique is also based on hierarchy order of previous N-1 grams [11, 13]. The idea of this technique is that, if N-gram is missed, then it can measure the probability by using previous single (N-1) gram, and if this is also missed, then switch also to the previous of previous N-gram and so on. It measured the probability of lower-order gram if the probability of higher-order gram is zero [13, 17]. This technique is commonly used by most researchers to correct OCR errors. Eq. (4) represents the statistical expression of trigram language model using this technique [17].

Note the individual values of ∝1 and ∝2 should be greater than zero and less than one [12]. This research will give values of 0.2 and 0.3 to the unigram and bigram respectively. KB suffers from the same limitation of LI. For instance, by assuming the words “A”, “B” and “C” cannot come together in same sentence, and the probability of “A B” is large, then a trigram language model will assign high probability of “AB” to the phrase “A B C” even the words "A”, “B” and “C" cannot come together.

2.4

Kneyser-Ney (KN)

The idea of this technique is that, individual words that occurs rarely in many different phrases are better than individual words that occurs widely in few similar phrases [13, 18]. For instance, by assuming the phrase ("A B __ ") has two candidate words, “C” and “D” to fill it, and the word “C” is more frequent than the word “D” then, the previous techniques LS, LI, and KB will select it to complete the phrase. However, the word “C” can come only with the word “E”, while “D” occurs in many different phrases. Then, KN will select the word “D” rather than “C”. KN will add context information to the KB equation or to the LI equation because it builds on one of them [13].

KN can give wrong results in some situations [18]. For instance, by assuming the phrase ("I want Thao food ") has two candidate words, “Thai” and “Chinese” to replace with incorrect word “Thao”, and the word “Chinese” is occurred in different phrases than the word “Thai”, then KN will select the word (3)

(4)

(4)

, IJRIT-25

“Chinese” to complete the phrase even “Thai” is better from “Chinese” to complete the phrase [13]. This research will perform KN by combining context information to the KB equation.

3. Evaluation Process Setting

This section will presents how experiments will be design, metrics used, and testing database details.

3.1

Experimental Design

Fig. 1 shows how to conduct each experiment. It shows that number of experiments is five, and all these experiments will be implemented and tested.

Fig. 1 shows that Tesseract engine will be used to extract a text from testing images. Since most researchers use Tesseract OCR in developing and evaluating OCR post-processing methods, then this study will follow them [19-21]. First experiment will not perform smoothing process. This is because the goal of this of experiment is to know the OCR error rate without smoothing process. LS, LI, KB, and KN techniques will be implemented in experiments 2, 3, 4, and 5. Trigram language model will be used in all experiments except one. However, each experiment will apply different smoothing technique.

Fig. 1 also shows that tokenization operation is necessary to divide a text to words array [10]. After that, alignment process will be implemented. The goal of alignment process is to make parallel OCR text with standard text. Alignment process is important because length of OCR output will be different from length of standard text [22, 23]. The reason is that, OCR process may misrecognize, delete, or insert some characters [22]. The idea of alignment task is that, each symbol in OCR output text will be place with same place of equal symbol in standard text. This task is approximate because wrong alignment results can be occurred. Furthermore, the probability of wrong alignment results increases when number of sequences that

Fig. 1 Experiment details

(5)

, IJRIT-26

needs alignment is more than two, or when number of characters in each sequence is large [21, 24].

Example on an alignment task is shown in Fig. 2.

Standard Levenshtein algorithm with back trace will be used to perform alignment process [25]. This algorithm is perfect in finding the difference between two sequences [3, 10]. Furthermore, it is chosen in this study because many researchers used it in different topics. Therefore, this study will follow them.

However, this algorithm is slow when the sequences are large [10, 26, 27]. Lastly, four metrics will be measured: character error rate (CER), word error rate (WER), real word error rate (RWER) and non-word error rate (NWER).

Most researchers use WER, CER, NWER, and RWER to calculate OCR error rate [2, 28, 29]. Therefore, this study will follow them. The mathematical expressions to calculate these metrics are shown in Eq. (5, 6, 7, and 8) [10, 22, 30].

3.2

Testing images

It is difficult to find standard Arabic testing dataset because each examiner uses different number of images in testing. Furthermore, some researchers use testing datasets that contain only single word in each image, others use single sentence in each image [2, 10, 30]. This study will generate testing images by using same steps performed by [30]. The dataset is large contains 101259 Arabic characters. Text within images is taken from Internet by chance. Furthermore, it consists of 8 different fonts, and six different sizes. The text of dataset is considered standard that can return to it to measure the metrics of this study. Testing images are scanned at 310 dpi with a grey scale.

Fig. 2 Alignment process

(5)

(8) (6)

(7)

(6)

, IJRIT-27 4. Testing Results

Fig. 3 shows the testing results of this research. Fig. 3 will use symbols “NS” to denote the experiments one. From Fig 3, it can be seen that OCR error rate is high for Arabic language in all experiments. In addition to that, it can be seen that values of WER, NWER, RWER, and CER of KB are the best compared to values of other techniques.

5. Conclusion and Future Directions

In this paper, practical evaluation details of main smoothing techniques are presented. All smoothing techniques subjected to the similar testing images. The testing Arabic dataset is large in order to make the validity of evaluation process is higher. The testing results show that the best among smoothing techniques is KB. Future directions of this study are enhancing one of the current smoothing techniques in order to reduce OCR error rate for Arabic language. Furthermore, it is a need to merge several existing techniques of OCR post-processing in order to benefits from individual advantages of each technique.

References

[1] D. N. Barnes, "The Text Contains its Own Lexicon: Extracting a Spelling Reference in the Presence of OCR Errors," 2011.

[2] Y. Bassil and M. Alwani, "Ocr post-processing error correction algorithm using google online spelling suggestion," arXiv preprint arXiv:1204.0191, 2012.

[3] J. F. Daðason, "Post-Correction of Icelandic OCR Text," (Master's thesis, University of Iceland, Reykjavik, Iceland), 2012.

[4] K. Kukich, "Techniques for automatically correcting words in text," ACM Computing Surveys (CSUR), vol. 24, pp. 377-439, 1992.

Fig. 3 Testing Results

(7)

, IJRIT-28

[5] I. Aljarrah, O. Al-Khaleel, K. Mhaidat, M. a. Alrefai, A. Alzu'bi, and M. Rabab'ah, "Automated System for Arabic Optical Character Recognition with Lookup Dictionary," Journal of Emerging Technologies in Web Intelligence, vol. 4, pp. 362-370, 2012.

[6] M. Labidi, M. Khemakhem, and M. Jemni, "Grid’5000 Based Large Scale OCR Using the DTW Algorithm:

Case of the Arabic Cursive Writing," Recent Advances in Document Recognition and Understanding, p. 73, 2011.

[7] M. Oujaoura, R. El Ayachi, M. Fakir, B. Bouikhalene, and B. Minaoui, "Zernike moments and neural networks for recognition of isolated Arabic characters," International Journal of Computer Engineering Science, vol. 2, pp. 17-25, 2012.

[8] H. Al-Rashaideh, "Preprocessing phase for Arabic Word Handwritten Recognition," Информационные процессы, vol. 6, 2006.

[9] M. S. Khorsheed, "Off-line Arabic character recognition–a review," Pattern analysis & applications, vol. 5, pp.

31-45, 2002.

[10] I. Q. Habeeb, S. A. Yusof, and F. B. Ahmad, "Two Bigrams Based Language Model for Auto Correction of Arabic OCR Errors," JDCTA: International Journal of Digital Content Technology and its Applications, vol. 8, pp. 72 - 80, February 28 2014.

[11] A. Islam and D. Inkpen, "Real-word spelling correction using Google Web 1T n-gram with backoff," in Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on, 2009, pp. 1-8.

[12] V. Gupta, M. Lennig, and P. Mermelstein, "A language model for very large-vocabulary speech recognition,"

Computer Speech & Language, vol. 6, pp. 331-344, 1992.

[13] D. Jurafsky and J. H. Martin, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition, 2nd ed.: Pearson Education India, 2009.

[14] D. Jurafsky and J. H. Martin, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition: Pearson Education India, 2000.

[15] W. Gale and K. Church, "What is wrong with adding one," Corpus-based research into language, pp. 189-198, 1994.

[16] C. D. Manning and H. Schütze, Foundations of statistical natural language processing: MIT press, 1999.

[17] S. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 35, pp. 400-401, 1987.

[18] B.-J. P. Hsu, "Language modeling for limited-data domains," Massachusetts Institute of Technology, 2009.

[19] C. Patel, A. Patel, and D. Patel, "Optical character recognition by open source OCR tool tesseract: A case study," International Journal of Computer Applications, vol. 55, pp. 50-56, 2012.

[20] Google Inc. (2015, January 02). Tesseract-ocr v3.02. Available: https://code.google.com/p/tesseract-ocr/

[21] W. B. Lund, D. J. Kennard, and E. K. Ringger, "Combining multiple thresholding binarization values to improve OCR output," in IS&T/SPIE Electronic Imaging, 2013, pp. 86580R-86580R-11.

[22] I. Q. Habeeb, S. A. Yusof, and F. B. Ahmad, "Improving Optical Character Recognition Process for Low Resolution Images," IJACT: International Journal of Advancements in Computing Technology, vol. 6, pp. 13 - 21, May 30 2014.

[23] W. B. Lund, D. D. Walker, and E. K. Ringger, "Progressive alignment and discriminative error correction for multiple OCR engines," in Document Analysis and Recognition (ICDAR), 2011 International Conference on, 2011, pp. 764-768.

[24] C. Notredame, "Recent evolutions of multiple sequence alignment algorithms," PLoS computational biology, vol. 3, p. e123, 2007.

[25] X. Cai, "Approximate Sequence Alignment," Peking University, 2013.

[26] K. U. Schulz and S. Mihov, "Fast string correction with Levenshtein automata," International Journal on Document Analysis and Recognition, vol. 5, pp. 67-85, 2002.

[27] P. Mitankin, "Universal levenshtein automata. building and properties," Master’s thesis, Sofia University, Bulgaria, 2005.

[28] W. B. Lund, E. K. Ringger, and D. D. Walker, "How well does multiple OCR error correction generalize?," in IS&T/SPIE Electronic Imaging, 2013, pp. 90210A-90210A-13.

[29] W. B. Lund and E. K. Ringger, "Improving optical character recognition through efficient multiple system alignment," in Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, 2009, pp. 231-240.

[30] M. S. M. El-Mahallawy, "A large scale HMM-based omni front-written OCR system for cursive scripts," (PhD thesis, Cairo University, Cairo, Egypt), 2008.