• No results found

3.6 Database Implementation Methodology

3.6.4 Ground Truth

Ground truth (GT) refers to a number of information which describes attributes of each entry in the database. These ground truth annotation may including Number of words, PAWs, character sequence, font type, font size and so on. The GT data plays a vital role in recognition system development by providing information about the written text needed for text recognition. The availability of electronic corpus, facilitate automatic generation of GT files.

63

In this database each word image is accompanied by GT date in XML file format describing the image at word level. Figure ‎3.12 shows an example of an XML file at word level. The following GT information is available for each word image:

 Database name “Quranic_MP_Database”  Lexicon word reference identifier

 Arabic word  Number of PAWs  Number of letters  Word image file name

 Corpus name “The Holy Qura'n”

 Word identifier refers to its location in the corpus  Writing instrument  Binarisation  Resolution  Subset name  Font identifier  Font name  Font size  Font style

64

Figure ‎3.12: GT in XML file format for the word )مهرونب(

Another GT data file is provided for each subset folder that contains word-image samples. It comprises main information about all words in the subset and acting as lookup table for fast searching by one of the keywords: file name, word, or lexicon word-id. An example of XML file at subset level is given in Figure ‎3.13. The following GT information is available for each word image:

 Header information

o Database name “Quranic_MP_Database” o Corpus name “The Holy Qura'n”

o Writing instrument o Binarisation o Resolution o Subset name o Font identifier o Font name

65

o Font size o Font style  For each word image

o Word image file name o Arabic word

o Lexicon word reference identifier

The database files, images and GT, are stored in directory subset structure as depicted in Figure ‎3.14:

66

67

68

3.7 Summary

Database implementation is a nontrivial task; it is not a simple collection of text images. This chapter presents a complete procedure to build a database considering the word as the major unit for text recognition. The procedure used to build machine printed text database, in general and Arabic script, in particular. The same procedure can be applied to construct handwritten words database with some modifications in the collection form.

Number of points influence database implementation is discussed. The constraints that should be satisfied to implement database fulfil this study requirements are described. Methodology of implementing of other related database is reviewed.

Text database implementation methodology described in details. Implementation methodology presents an automated system to create the database

from a corpus.

The fully computerized systems provide

efficiency, accuracy, timelessness, security and economy.

The

automated

system starts by form creating, and the following is word segmentation and labelling, and then ground truth generation.

Form layout designed in a way that simplifies line and word segmentation based on pixel location. Forms are designed using Microsoft Word and filled by words selected from The Holy Qur‟an. The scanned forms are pre-processed for eliminating distortions at edges and skew detection and correction using Hough transform approach. The Run-Length Smoothing Algorithm used to help in cropping the word

69

body. File names of the tidy word binary images are following special format represents the word attributes. Finally, a ground truth files in XML format are supplied for each entry and subset folder in the database.

Note that all algorithms were developed using a variety of languages; MATLAB (R2009a/64-bit), Delphi 7, and Microsoft Visual Basic 2010.

70

CHAPTER 4

HMM/DCT HOLISTIC WHOLE WORD

RECOGNISER

4.1 Introduction

Although other languages use Arabic letters, such as Persian and Urdu, Arabic character recognition has not reached the same level of maturity as other languages, especially English. This is attributed to a number of issues; lack of fundamental interaction between researchers in this field as well as deficiency of infrastructure supporting utilities, including Arabic text databases, electronic language corpora, and supporting staff; consequently each researcher has his own system and database with rare exceptions. Accordingly it is very difficult to give comparative results for the proposed methods due to the absence of standard benchmark databases. In addition to the former issues, the complexity of the Arabic script features demonstrates an additional challenge to build Arabic Character Recogniser; more details can be found in [11].

In this work, segmentation problem is avoided by considering the word as the major unit. The popular Block-based DCT transform applied to extracting word feature. The features of the entire word are fed to the recogniser to identify it without segmentation. The system is built on

71

HMMs, where each word was represented by a separate model. The applied HMMs to the task are Discrete 1D-HMMs built using the Hidden Markov Model Toolkit (HTK) [28]. Vector quantization is used to generate a discrete observation symbol density. In the recognition phase the recogniser produces N-best recognition word hypotheses lattice. A true scanned typewritten Arabic word image database for five different fonts is built for this research.