Proceeding of the 3rd International Conference on Informatics and Technology,

(1)

DOCUMENT PRESENTATION ENGINE FOR INDIAN OCR A DOCUMENT LOYOUT ANALYSIS APPLICATION

ABSTRACT

Today office automation is going on in the all of the fields. Everybody is using the computers for fast data processing and for maintaining the large amount data. But, there is need of the previous processed data which is printed documents. There are two ways to use old data. First is to type the data to process in the computer. And the second is to scan the document and use OCR (Optical Character Recognition) to convert the document image in to the editable text format. There are more than 1000 languages and 14 scripts used by 112 million people in the India [1,2]. So there is need of OCR system for Indian scripts which is in development process. In OCR we have to scan the document, then noise cleaning, skew detection and correction, text non–text classification, text line detection and segmentation, word segmentation, character segmentation and identification and output file generation. The main contribution in this paper is that how to maintain the layout of document. At the present scenario OCR system produce the text file as output without maintaining the layout of document. OCR is an error-prone process. The error remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. In this paper the use of XML is presented to generate the open office document file. In this paper the open document standard is followed which is approved by OASIS (Organization for the advancement of Structured Information Standards) on Feb. 1, 2007 [3]. The main feature of the propose solution is that it is scripts independent. So it can be applied for all Indian scripts.

Keywords: OCRed Document, Indian Scripts, Document layout. 1.0 I NTRODUCTION

In the field of computer science text and images are the main source of the information. A human can understand the information if it is presented in well manner. For example, if the data is presented with the images in well organized manner then the other one can understand in a better way. If the presentation is not well the information given will not be un derstood by the other. Th e OCR process is error prone . It is time c onsuming and expensive to ma nually proofread O CR resul ts. T he errors re maining i n O CRed t exts c an cause s erious prob lems i n readi ng and understanding if they do not refer to th e original image representation. Document representation after OCR is very important task. It c an cause serious problems in reading and as well as understanding the document if they do not maintain the layout as in the original image representation. Present system scan the document image and place the text and image one after other without maintaining the layout [2]. In this paper there is small discussion about the processing of the OCR and the detail discussion of the proposed system. The following figure gives an overview of the OCR process.

Figure 1: flow of OCR process

As the above figure show the first step of OCR is two tone conversions which convert the image into binary image, and then skew is de tected and corrected. Noise cleaning is performed on the skew corrected image. Text non-text classification t echnique classifies th e do cument image i n te xt an d i mage. T he i mage part is e xtracted fro m the document image for further processing and remaining text part is passed to text lines detection. After detecting the text lines, each line is processed and individual character is generated and final output file is generated. But at this

(2)

stage there is problem that, in output file the layout of document image, is not maintained. The main contribution of this paper is that how to maintain the layout of the output document.

2.0 PROCESSING OF OCR

The OCR is combination of multiple processes as sho wn in the above figure 1. The first process of the OCR is to acquire the document image using the scanner or a camera. The image has the many color combination in the image but the OCR process the binary image. So the image is converted to the binary image. A global threshold value is generated and the image is converted to the corresponding binary image. Then the noise is removed from the input image. The most commonly used approach called morphological component removal technique is used to remove the noise. Figure 2 shows the image taken as input and after removing the noise.

Figure 2: Noisy image, Noise cleaned Image, Skewed Image and Skew Corrected Image

After noise removal the resultant image is taken as input to the skew detection and correction technique. The cause of sk ew i s due t o the improper a lignment of th e d ocument paper d uring sc anning. When th e d ocument is b eing scanned and it not properly placed on scanner then the scanned image can have skewed image. A s kewed image may result in failure to detect the text in the document image. So it is necessary to detect the skew and correct it. Above figure 2 represent the skew in the image and the resultant image after the correction. After the preprocessing of image involving the noise removal and skew correction, the image is segmented in two categories i.e. text and non text classification. T his categorization is done b y the text, non-text classification. As all types of document image contain the text, Image and tabl e; this module takes the image and tab le as non-text area and remaining part as a text area [13, 14]. At this stage the non-text area is extracted from the image and stored. The remaining part of image is u sed to de tect the text. Th e figure 4 rep resents th e Input image and t he tex t no n t ext area cla ssification an d extracted text area. Red color boundary is used to represent non-text area of image.

Figure 3: An input image, Text non text classification, and extracted Text area

The abov e fig ure 4 s hows t he co mplete text p art i n the do cument i mage which is us ed to d etect t he te xt. A fter classification each text block is identified and text lines are detected as shown in the following image for a single text block.

©Informatics '09, UM 2009

RDT4 - 89

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

(3)

Figure 5: Detected text lines.

After detecting the t ext l ines word s egmentation, ch aracter seg mentation, template matching and ch aracter replacement is then performed. As th ere are standard techniques for these processes and are dis cussed elsewhere [5, 12] so details discussion is not given here. But for the sake of completion of the OCR process, just introduction is given here. Wo rd segmentation i s pe rformed using t he bas ic feature of t he script called the white s pace between each word. Each word is segmented and passed for further processing called the character segmentation. Again the character s egmentation is performed using the s ame b asic fe ature o f the sc ripts c alled the white space or g ap between each character. After this process the output comes as individual character image. And matching character is searched and replaced.

3.0 OVERVIEW OF PROBLEM

The pres ent O CR s ystem av ailable f or Indian s cripts is a ble to c onvert th e doc ument image in ed itable text an d produce a text file as discussed above. As discussed before it performs the pre-processing and classify the text and non text area. Then detect the text in segmented text area and generate its equivalent text. The editable text and image is plac ed in t he text file. It works fine and has no problem if the d ocument image is single column. But what happens if the document image is multicolumn as shown in the following image.

Figure 6: Multicolumn Document Image (left) and corresponding disorder blocks representation (right) As shown in the figure 6 (left) there are five text blocks and the two image block. If the 1st_{block then 3, 2, 4, and 5 is}

(4)

location. Then it is not easy to understand that document. So, it is important to maintain the layout of the document as it was in the inp ut d ocument im age. The abov e fi gure 6 (right) repr esents the ou tput f ile of the OCR s ystem available now. But it represents the serious problem of document understanding. The document image in figure 6 (left) is understandable because he/she know which column to be read after reading the one column but in the case of figure 6 (ri ght), it l oss the document presentation so no t able t o understand. I n thi s p aper t here i s a simple approach which takes the text output from OCR and blocks coordinate information and produce the output file in the odt format in which the layout is maintained as it was in the input document image.

4.0 PR OPOSED SOLUTION

Problem of layout retention of the OCR for document image is solved using the Open Document Format specification v1.1 approved by the OASIS on F eb. 2007. It specifies the XML su pport to generate the document and handle the document el ements. The document c ontains the Document root s, Document metadata, Bod y elements an d Document types, Application settings, Scripts, Font face declarations, Styles, Page styles and Layout [3]. Using the java, a s imple parser is designed which define the document elements in XML and compress as the document file. The Open Document Type file is generated as the output which can be viewed in Linux as well as in Window using the Open Office [3] and Microsoft Office (using the plug-ins).

Algorithm Flow Chart

The above algorithm and flow chart represent that how the document will be g enerated and e lements are placed in document. Every document contains the document root tag which is the meta-data and static information. Each Xml file has some specific root tag which gives the information like version, date, types of document and footer tag. The user has the choice to select the language of OCR be cause the technique proposed here are scripts independent and will work for all the languages. The next process depends upon the block coordinate information file. Decide from the block coordinate information file that there is image block or text block. If document is not blank then write the declaration data in content file after the root information. This declaration data define that there is image block or text block. Now insert the document elements in the document using their coordinate information. Repeat the process to draw the image using the coordinate value for each image, width and height of image, and the path of image to insert for the number of images. As image blocks are inserted in the document then the text frames are inserted for the text information. And write the text i nformation in the text frame t ag. Again repeat this process f or each available text block. At last include the document leaf tag called “footer tag” which is static and same for all documents.

5.0 TESTING AND RESULTS

In this section there is a reference input image and define the result step by step. 1. Read the block information file. If no

blocks are there then Exit. 2. Create Support files.

3. Select the font for the specific language. 4. Create Content.xml file.

5. Write the root information in content file. 6. Read the block coordinate information. 7. If (image!=0) then

Include the element meta-data for image and text.

Else

If (text! = 0) then

Include the element meta-data for text. 8. For I= 0 to N // where N is no. of images.

8.1 read Coordinate information 8.2 insert the image and image tag

and coordinate information. 9. For I =0 to M // where M is no. of Text

block

9.1 read coordinate information 9.2 Put the tag for text frame with

coordinate information. 9.3 Insert the text in above written

tag.

10. Write the closing tag in the content file. 11. Generate the output file using jar utility.

©Informatics '09, UM 2009

RDT4 - 91

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

(5)

Figure 7: A document image which will be OCRed.

The pre processing is performed on t he i mage and c lassified as t he te xt n on text blocks a nd block c oordinate information i s stored in file. The tex t bl ocks a re fu rther pro cessed a nd equivalent tex t i s p roduced. As th e above document i mage is giv en i t has the four i mages block an d t hree text b locks. And their c orresponding block coordinates information which is converted to the document size as given below.

image 4 5734.05 10919.968 17416.018 17105.884 6755.892 151.892 17393.92 4012.946 249.936 154.94 6470.904 4121.912 228.092 4129.024 10721.086 10816.082 text 3 10870.946 4215.892 17375.886 1234.43 202.946 10997.946 5592.826 345.765 305.054 17347.946 17445.99 4567.87

There are four coordinate values for the image blocks and three values for the text blocks. Using these coordinate values draw the text frame, insert the text from text files, and define the image path with their coordinate values and height, width. After defining the all el ements of document in t he content file the jar/Zip utility is used to compress all the files in the filename.odt. The output file opened in the Open Office and has the correct layout as shown in the following image.

(6)

6.0 CONCLUSION AND FUTURE WORKS

The approach presented in this paper is implemented and tested with different Indian scripts as well as roman scripts. This is working ver y well an d dep endent o f t he acc uracy of t he text n on-text clas sification process. Where i t is working well but there is so much to do. The OCR gives the out as a simple text it is not recognizing the text is Bold, Italic, Underline, or heading. So many features are remaining to implement in the OCR.

REFERENCES

[1] ht tp://tdil.mit.gov.in/resource_centre.html. [2] ht tp://indiansaga.com/languages/index.html.

[3] ht tp://docs.oasis-open.org/office/v1.1/OS/ OpenDocument-v1.1.pdf

[4] R Thomas M. Breuel. “High Performance Document Layout Analysis”. Volume: 3, On page(s): 61- 64, 2002. [5] Gaurav Gupta, S hobhit N iranjan, Ank it S hrivastava, Dr RMK Sin ha, “ Document L ayout An alysis a nd its

application i n OCR” , 10th IEEE International E nterprise D istributed Obj ect Computing Conference Workshops, 2006.

[6] Jignesh Dholakia, Atul Negi, S. Rama Mohan, “Zone Identification In Gujarati Text”,Proceedings of the 2005 8th_{International Conference on Document Analysis and Recognition, 2005 IEEE.}

[7] Suryaprakash Kompalli, Sankalp Nayak, Srirangaraj Setlur, “Challenges in OCR of Dev anagari Document”, Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition (ICDAR’05) 1520-5263/05, 2005 IEEE.

[8] Pavlidis and Zhou J., “Page segmentation and classification”, Graphical models and Image Processing vol.54 pp 484-496.pan, p. 301, 1982.

[9] Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., Popescu, G.V.[G. Viorel], “A Li ne-Oriented Approach to Word Spotting in Handwritten Documents”, PAA(3), No. 2, 2000, pp. 153-168.

[10] Likforman-Sulem, L .[Laurence], Zahour, A .[Abderrazak], Taconet, B. [Bruno], “Te xt li ne s egmentation o f historical documents: a survey”, IJDAR(9), No. 2-4, April 2007, pp. 123-138.

[11] Udo Miletzki, “Character Recognition in Practice Today and Tomorrow”, 0-8, 186-7898-4197,IEEE 1997. [12] Tao Hong and Sargur N. Srihari, “Representating OCRed Documents In HTML”, Published in IEEE, 1997. [13] Z. Sh i, S . Se tlur, and V. G ovindaraju, '' Text Ex traction from G ray Scale Historical Document I mages Using

Adaptive Local Connectivity Map'', Eighth International Conference on Document Analysis and Recognition, Seoul, Korea, 2005, pp. 794-798.

[14] Umesh Ku mar “Text Li ne detection i n Mul ticolumn d ocument i mage a nd Doc ument Presentation E ngine” M.Tech Thesis, GGSIPU, New Delhi, 2008.