The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

(1)

The Development of Multimedia-Multilingual Document Storage, Retrieval and

Delivery System for E-Organization

(STREDEO PROJECT)

Asanee Kawtrakul, Kajornsak Julavittayanukool, Mukda Suktarachan, Patcharee Varasrai, Nathavit Buranapraphanont, Chaiwat Ketsuwan, Duangpen Jetpipattanapong,

Prakorn Santiwatt, Nattakan Pengphon

Natural Language Processing and Intelligent Information System Technology Research Laboratory

Department of computer engineering Faculty of Engineering, Kasetsart University

Bangkok, Thailand 10900

Email: [email protected]

Abstract

This paper introduces the new project called STREDEO: The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization. STREDEO aims to provide the system for multimedia multilingual document management consisting of storage, retrieval and delivery. The project can be divided into seven subprojects, which are: The Development of Multimedia and Multilingual Document Storage System (MUU-DOC), The Development of Document Image Processing System for Indexing (DIM), The Development of Web-based Intelligent Information Retrieval System (WIRE), The Development of Automatic Document Clustering and Delivery System (CLUD), The Development of Multimedia Query Processing System: Speech, Text and Handwriting Text (MUL-Q), The Development of Linguistic Knowledge Acquisition System and Natural Language Processing Techniques (KANAL), and A Very Large Scale Multimedia Database Management Design and Integrating System (INTEGRATE)

Keyword: E-Organization, Natural Language Processing, Document Image Processing for Indexing, Automatic Document Indexing, Automatic Document Clustering, Very Large Scale Hypermedia Document Storage and Delivery System, Web-based Intelligent Search Engine, Linguistic Knowledge Base, Knowledge Acquisition System

1. Introduction

There is no doubt that today information technology is expanding very rapidly particularly, in the field of communication and networking. In addition, it is likely that the information will continue to grow exponentially. These create the need for the collection of extremely huge information of different languages and media. Table 1 shows estimation of the sizes of information for different media and their growth rates.

Table 1: Worldwide production of original content, stored digitally using standard compression methods, in terabytes circa 1999 [9].

Storage Medium

Type of Content Terabytes/Year

Upper Estimate Terabytes/Year Lower Estimate Growth Rate (%) Paper Book Newspaper Periodicals 8 25 1 2 2 -2

(2)

Office document 12 195 1 19 2 2 Total 240 23 2 Film Picture Movie X-Rays 410,000 16 17,200 41,000 16 17,200 5 3 2 Total 427,216 58,216 4 Optical information CDs songs CDs data DVDs 58 3 22 6 3 22 3 2 100 Total 83 31 70 Magnetic Information Camcorder Tape PC Disk Drives Departmental Servers Enterprise Servers 300,000 766,000 460,000 167,000 300,000 7,660 161,000 109,000 5 100 100 100 Total 1,693,000 635,660 55 Grand total 2,120,539 693,930 50

Above information can be useful for a wide range of users from an organization to an

individual person. However with the very large size of information available, potential problems

such as too long searching time or system unstability can easily be encountered. Consequently, there is a need to organize and manage such huge information. These include organizing and managing the storage system, the retrieval system and the delivery system.

Today information technology is applied to storage and retrieval system [11]. Examples of such technology are, large scale multimedia document storage [1], [6], [12] automatic indexing system [2], [4] and automatic document clustering system [3], [8], [10]. However the mentioned systems are created for English and they are not applicable for Thai. That is because Thai has unique characteristics such as no space required between words, ambiguity in meaning between noun and noun phase [7], [12].

STREDEO project aims to develop the technology and apply for Thai storage system, Thai retrieval system and Thai delivery system. The project will help to support an office that uses only electronic document and eliminate the use of paper which can help creating better environment for the world. In addition, it can easily provide services and exchange of information both within and outside Thailand.

2. STREDEO Overview

Figure 1 shows the overview of STREDEO. There are 2 types of input: text and image. Text could also be collected by using webrobot. In case of document image, it will be converted into text (not necessary be high quality) before indexing and then kept its image in the data warehouse. When there are information in the document warehouse, the system will continue to perform

(3)

document clustering and delivering. If a user input a query using natural language such as text, handwritten, or speech, the system will retrieve the relevant information or document to the user.

STREDEO project can be divided into 7 subprojects, which will be described in the following subsections. Retrieval Processing Query Processing Document Warehouse in Multimedia Automatic Indexing And Storaging Image Document User Query (Natural Language,Speech)

Speech and Text Query Processing System Document Image to Text Shallow Converting Electronic Text Document www Document Document Image Processing System Multimedia Document Storage

System Intelligent Search Engine

User Document

Clustering and Delivering Document Clustering and Delivering System

Division A Division B Division C Division D Linguistic Knowledge Base Linguisti c Very Large Corpora Linguistic Knowledge Base and Acquisition System Internet Web Robot Scanner Electronic Office Document Books or Papers Toolkit Thesaurus Parser Knowledge Acquisition System

(4)

2.1. The Development of Multimedia and Multilingual Document Storage System (MUU-DOC)

MUU-DOC is an important subsystem of the project. The main function is to analyze information in a document or document image for indexing and storing. Figure 2 shows the scope of MUU-DOC.

Document Warehouse Automatic Indexing

Noun Phrase Analysis Input Document Electronic Office Document Index Representation Morphological Analysis $ VISIO CORPORATION Image Document Automatic Analyzing and Storing Document Image Processing System Electronic Text Document Document Image System

Figure 2: The Development of Multimedia and Multilingual Document Storage System (MUU-DOC)

2.2. The Development of Document Image Processing System for Indexing (DIM) The text data from book or paper, that will be used for indexing and storing in corpus, must be manually typed. The task is time consuming and tedious. DIM is a part of MUU-DOC that will analyze and recognise the text data roughly from the scanned document image and make the indexing of a large number of documents more convenient. It can reduce much time and human work in typing the text data, that will improve the speed of feeding data to the Multimedia and Multilingual Document Storage System.

DIM has four main tasks.

N Image improving to solve scanning problem.

N Layout Analysis to distinguish between text image and picture image.

N Character Segmentation to segment connected character that cause of scanning or font of characters.

(5)

The process of converting image of typed characters into text document in Thai uses syntactic, fuzzy logic and feature extraction. To make the system more practical, this subproject is not designed to focus only on character recognition but also image processing and character segmentation. Figure 3 shows an example of such system.

!" # " $% &# '!( ) " )% +"+ ( ,!(- .) Line segmentation image transformation $)(' $ 0( " $ '!( +

Image

improving

!" # " $% &# '!( ) " )% +"+ ( ,!(- . Line segmentation image transformation $)(' $ 0( " $ '!( +

Layout

Analysis

!" # "

Line

Segmentation

character

Segmentation

_Text

document

Character

recognition

!" # " $% &# '!( ) " )% +"+ ( ,!(- .) Line segmentation image transformation $)(' $ 0( " $ '!( +

Image

document

Figure 3: Document image processing system

2.3. The Development of Web-based Intelligent Information Retrieval System (WIRE) The increasing of information technology and of using the internet cause the electronic documents to be increased exponentially. Consequently, the searching of an information is a

non-trivial problem. It is necessary to create web-based an intelligent Information Retrieval system,

which called WIRE.

WIRE is a prototype system that capable of searching information in bilingual text (Thai-English). It can be divided into two parts, query processing system and searching system. The query processing system will process query words from users by transforming the query words to be multilevel such as words level, phrase level and sentence level. For example, if the query words are “What is an internet address?”, the query processing system will generate a multilevel query as “internet address” for phrase level, and “networking” for conceptual level. In addition, the query processing system will allow a user to enter query in many different styles for example “address of internet” or “address on internet” and still yields the same result.

Since the query processing system produces multilevel queries, the searching system must also capable of searching in multilevel too. This can be done by starting the search in phrase level, then word level and conceptual level respectively.

(6)

2.4. The Development of Automatic Document Clustering and Delivery System (CLUD)

As mentioned before, the increase in information technology and the increase in using the internet cause the electronic documents to be increased exponentially. There is a need to arrange electronic documents into groups. However if the task is done by human, it can be time consuming, ineffective and very tedious. Therefore, there should be a system that can automatically, effectively and accurately cluster electronic documents [10]. In addition to document clustering, the system also provide the capability that could forward the document to the right users.

2.5. The Development of Multimedia Query Processing System: Speech, Text and Handwriting Text (MUL-Q)

Today all input queries are entered by using keyboard. To make the system become more friendlier, MUL-Q is proposed to be a multimedia query processing system that allows users to use speech and handwriting as input query to STERDEO. This project is limited to recognize discontinuous speech with domain based vocabularies. Another form of query can be handwriting. Handwriting character recognition (HCR) is more difficult than OCR. However this project is limited to process only neatly handwriting.

2.6. The Development of Linguistic Knowledge Acquisition System and Natural Language Processing Techniques (KANAL)

Research in natural language processing is important to the development of document processing in term of better understanding human language. This subproject aims to develop linguistic knowledge acquisition system and natural language processing techniques in order to support document processing in indexing, clustering and query.

2.7. A Very Large Scale Multimedia Database Management Design and Integrating System (INTEGRATE)

The development of software and database for very large-scale multimedia always have a lot of problems. For example, connecting each module together, controlling schedule and quality of each module. Since the development of STREDEO project has seven subprojects, the problems always occur if it has no good planning. The objective of this project is then, to design and development of software architecture, planning development direction, plug-in module, test and maintenance service via the network by applying software engineering technique.

3. Conclusion

Today information technology has proved that there is a need to store, query, search, retrieve, and deliver large amount of electronic information efficiently and accurately. This paper introduces STREDEO project that will deal with the growing number of electronic document. STREDEO project consists of seven subprojects. The first subproject, MUU-DOC, will focus on multimedia and multilingual document storage. The second subproject, DIM, will focus on document image processing system for indexing. The third subproject will focus on web-based intelligent information retrieval. The fourth subproject will focus on automatic document clustering and delivery. The fifth project, MUL-Q, will focus on multimedia query processing such

(7)

as speech, text and handwriting text. The sixth project, KANAL will focus on linguistic knowledge acquisition and natural language processing Techniques. The last project will focus on a very large scale multimedia database management design and integrating STREDEO.

4. References

[1] Andres, F. 2000, “Active Hypermedia Delivery System and PHASME Information Engine”, In Proceedings of AdInfo2000 First International Symposium on Advandced Informatics 1: pp37-44. [2] Chengxing, Z. 1995, “Evaluation of syntactic phrase indexing-CLARIT NLP”, Track Report, Text Retrieval Conference 4, New York, p325

[3] Cohen, W. W. 1996, “Learning rules that classify e-mail”, in the Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access. 3, pp18-25.

[4] Dik, L. 1997, “Information storage and retrieval”, 3nd ed., Pentice Hall Publishing Company, New York. 420 p.

[5] Kawtrakul, A. ,et.al. 2000, “Multi-Feature Extraction for Printed Thai Character”, SNLP 2000 Symposium of Natural Language Processing

[6] Kawtrakul, A. et.al. 2000, “Toward on Enhancement of Textual Database Retrieval by Using NLP Technique”, NECTEC Technical Journal, Vol.11 No.7 March-June, 2000.

[7] Kawtrakul, A. and Thumkanon, C. 1997, “A statistical Approach for Thai Morphological“ International Conference, China.

[8] Lang, K. 1995, “NewsWeeder learning to filter netnews In Proceeding of ICML-95”, 12th

International Conference on Machine Learning 12, pp331-339. [9] Peter L. and Hal R. V. 1999, “How much Information?” [online] http://www.sims.berkeley.edu/how-much-info

[10] Sebastiani, F. 1999, “A Tutorial on Automated Text Categorisation. In Analia Amandi and Alejandro Zunino (eds.)”, Proceedings of ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, pp7-35.

[11] William, B. F. and Baeza, Y. R. 1992, “Information retrieval Data Structure & Algorithm”, Prentice Hall, Englewood Cliffs, New Jersey. p504

[12] Kawtrakul A., Andres F., Ono K. and et.al. 2000, “The Implementation of VLSHDS Project

for Thai Document Retrieval” in Proc. First International Symposium on Advance Informatics, Tokyo, Japan.

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)