• No results found

A Document Management System Based on an OODB

N/A
N/A
Protected

Academic year: 2021

Share "A Document Management System Based on an OODB"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

A Document Management System Based on an OODB

Ching-Ming Chao

Department of Computer and Information Science Soochow University

Taipei 100, Taiwan, R.O.C. E-mail: [email protected]

Abstract

Efficient document management is extremely important as a tremendous volume of documents are produced and accessed by mod-ern information systems. A document management system is described in this paper. The system stores SGML documents in an ObjectStore object-oriented database and is capable of storing, within one database, different types of documents by accommodating multiple DTDs. We create an object type for all DTDs and store each DTD as an object of that object type. We create an object type for each element definition in a DTD and store each element of an SGML document as an object. This database representation is advantageous to declarative query and fine-grained modification of documents. The system supports auto-matic creation of object types and insertion of documents into the da-tabase. Two different interfaces are provided for the user to retrieve, modify, and delete documents. The system supports declarative query of documents, which can be made with respect to their contents or structure.

Key Words: Document Management, Object-Oriented Database,

SGML

1. Introduction

Structured documents are central to a wide class of applications such as software engineering, digital library, information retrieval, etc. The ever-increasing volume of structured documents produced by modern information systems makes efficient document management extremely impor-tant. The recognition and marking up of the inter-nal structure of structured documents helps to in-crease the efficiency of retrieving documents in document management systems. SGML [4] has been widely used for marking up structured ments. It is a standard markup language for docu-ment description. It is designed specifically to en-able text interchange such that marked-up docu-ments can be used and exchanged among different systems and platforms. It can also be used to add logical structure information into documents that gives documents greater applicability. Researchers have recognized that the management of structured

documents can benefit notably from database sup-port. A current trend is to employ the ob-ject-oriented database technology in the manage-ment of structured documanage-ments.

In this paper, we report our work on devel-oping an SGML document management system. The system stores SGML documents in an Object-Store object-oriented database and supports inser-tion, modificainser-tion, retrieval, and deletion of docu-ments. With regard to the database representation of documents, we create an object type for each element definition in a document type definition (DTD). In this way, each element of an SGML document is stored as an object. This storage rep-resentation is advantageous to declarative query and fine-grained modification of documents. The system is capable of storing, within one database, different types of documents by accommodating multiple DTDs.

(2)

compo-nents: a DTD parser, an SGML parser, and a query processor. We assume SGML documents and asso-ciated DTDs have been created and validated by an authoring tool. The DTD parser accepts a new DTD and automatically generates object types that correspond to the elements defined in the DTD. The SGML parser accepts an SGML document instance and automatically inserts the document into the database by instantiating appropriate ob-jects that correspond to the elements in the docu-ment. The query processor is responsible for re-trieval, modification, and deletion of documents. The system supports declarative query of docu-ments, which can be made with respect to their contents or structure.

The rest of this paper is organized as follows. In Section 2 we review previous work on storage and retrieval of structured documents (in particular, SGML documents). In Section 3 we briefly intro-duce the syntactic structure of SGML documents as well as investigate the issue of representing SGML documents in an object-oriented database. In Section 4 we present the document management system. Section 5 concludes this paper and sug-gests future research directions.

2. Related Work

In this section we briefly review previous work on storage and retrieval of structured docu-ments (in particular, SGML docudocu-ments). They are distinguished in principle by ways in which docu-ments are stored and accessed.

Schouten [8] used the relational data model to design an SGML document database. Because of the hierarchical and intricate structure of SGML documents, relational databases with flat tables and scalar data types are inappropriate to store these documents for at least two reasons. First, mapping SGML documents into relational tables is a com-plicated and unnatural process and may lose some structural information. Second, because a docu-ment is scattered over several tables, retrieving a document from the database requires several join operations and therefore is inefficient.

VERSO [3], developed at INRIA in France, is an object-oriented database system for SGML documents. It is built on top of the O2

ob-ject-oriented database management system to ex-ploit its sophisticated type system and extensible query language O2SQL. Using an extended version

of the Euroclid SGML parser, VERSO maps DTDs into O2 schema, and document instances into

cor-responding objects. This requires the extension of the O2 data model to union types and ordered

tu-ples. It also extends the O2 query language O2SQL

for document retrieval.

HyperStorM [2], which stands for Hyperme-dia Document Storage and Modeling, is a project developed at GMD-IPSI in Germany. It is built on the VODAK object-oriented database management system. The Structured Document Database com-ponent [1] of HyperStorM investigates various ob-ject-oriented technologies for structured documents. It suggests a hybrid database-internal representa-tion for documents. That is, some elements are represented by individual database objects, while others (the flat elements) are not. This representa-tion is subject to configurarepresenta-tion for the particular document type. It also proposes the concept of query template as a declarative access mechanism of SGML documents.

Ozsu et al. [6] developed an object-oriented multimedia database management system that can store and manage SGML/HyTime compliant mul-timedia documents. The system is capable of stor-ing and managstor-ing different types of documents in one database. This is accomplished by dynamically creating object types according to element defini-tions in each DTD. The system also has tools to automatically insert marked-up documents into the database and provides facilities for querying these documents with respect to their contents and with respect to their structure.

Sengupta and Dillon [9] proposed an ap-proach to the representation of SGML documents that is different from those mentioned above. They argued that converting SGML documents into da-tabase formats is unnatural and may lose informa-tion. Their system puts a set of SGML documents in a repository and poses queries on these docu-ments. A query language based on the SQL stan-dard and a query interface based on the QBE in-terface are also proposed.

3. An Object-Oriented Document Database

In this section, we will investigate the

is-sue of representing SGML documents in an

object-oriented database. Before doing that,

we have to first understand the basic concepts

and syntactic structure of SGML documents.

An SGML document is composed of three

parts: an SGML declaration, a document type

definition (DTD), and a document instance

(DI). The SGML declaration defines the

char-acter set and any special SGML features used

in the document. If it is absent, the default will

be used. The document type definition of a

(3)

document defines the structure and the rules

for marking up the document instance. There

can be many documents that share the same

DTD. Therefore, it is mostly often to store the

DTD separately from the document instance to

make the document itself more concise and to

make the DTD sharable by different documents.

The document instance contains the content and tags of the document, including a reference to its DTD. It is marked up according to the rules de-fined by the DTD. Figure 1 shows the document instance of a memo document and Figure 2 shows its DTD.

In Figure 3 we formally specify the syntactic structure of SGML documents in the OMT nota-tion [7]. An SGML document has a name, an op-tional SGML declaration, one or more DTDs, and an element. An element may contain text data and/or component elements and may have any number of attributes. An element and its attributes are defined by their corresponding definitions in DTD. A DTD has a name, any number of entity definitions, notation definitions, and public text, and at least one element definition. An element definition contains either a content model or a de-clared content and an optional exception list. A

model group is recursively defined. An element <! DOCTYPE Memo SYSTEM “C:\Memo.dtd”> <Memo> <To> All Employees </To>

<From> The President </From>

<Body> <P> In the last year, our company earn-ings increased 100%. It is good news. Please re-member: <Q> “Working hard is the best policy.” </Q> I hope our company will be better tomorrow. </P> </Body>

<Close> Isaac Newton </Close> </Memo> Figure 1. A Document Instance

<! -- DTD for simple memoranda -- > <! ELEMENT Memo -- ((To & From), Body, Close?) > <! ELEMENT To -O (#PCDATA) > <! ELEMENT From -O (#PCDATA) > <! ELEMENT Body -O (P*) > <! ELEMENT P -O (#PCDATA | Q)* > <! ELEMENT Q -- (#PCDATA) > <! ELEMENT Close -O (#PCDATA) > <! ATTLIST Memo status (confiden | public) public>

<! -- End of DTD -- >

Figure 2. A Document Type Definition

SGML Document name

1+ {ordered} SGML Declaration DTD Element

name data contain instance Has

Attribute Entity Notation Definition Public Text Define value name name name Define instance type type 1+ definition definition data data Element Definition Attribute Definition data type Contain name Has name

omit start tag declare value Contain Group omit end tag declare value type Type default value occurrence Exclusive Inclusive

{ordered} connector

(4)

definition may be associated with the definition of its attributes. We will not go into further details of the syntactic structure of SGML documents. The interested reader is referred to [5].

Now we have learned the syntactic structure of SGML documents, it is time to discuss how to represent SGML documents in an object-oriented database. First let us discuss how to store docu-ment type definitions. A DTD may include entity definitions, element definitions, and attribute defi-nitions. An entity definition defines a symbolic name for any type of data. ENTITY is the keyword for an entity definition followed by the symbolic name and the data of the entity. An element defini-tion defines the structure of an element of a docu-ment. ELEMENT is the keyword for an element definition followed by the element name, a two-character tag omission indicator, and a content model or declaration content. An element may have an attribute definition that defines one or more attributes of the element. ATTLIST is the keyword for an attribute definition followed by the element name and one or more attribute declara-tions. Each attribute declaration contains the name, all possible values, and the default value of the attribute.

Other than storing each DTD in a file, we al-so create an object type for all DTDs and store each DTD as an object of that type. The structure of the object type for all DTDs is shown in Figure 4, which is drawn in the OMT notation.

Now let us discuss how to store document in-stances. We create an object type for each element definition in a DTD. In this way, each element of an SGML document is stored as an object. This storage representation is advantageous to declara-tive query and fine-grained modification of docu-ments. Our system is capable of storing, within one database, different types of documents by accom-modating multiple DTDs. This is accomplished by

DTD name

1+

entity def. element def. attribute def. name name name

data tag omission all values content model default value

Figure 4. Object Types for All DTDs

To

text

From

text

Memo

Body

P

Q

text

text

Close

text

Figure 5. Object Types for Elements in Memo DTD

creating different object types for different DTDs. For example, the object types created for element definitions in the memo DTD are shown in Figure 5.

4. A Document Management System

A document management system must sup-port the functionality of storing and accessing documents. The system we developed can be used to define an ObjectStore object-oriented database and store SGML documents in the database. In ad-dition, it supports retrieval, modification, and dele-tion of SGML documents. The system architecture of our document management system is shown in Figure 6.

The system includes three primary compo-nents: a DTD parser, an SGML parser, and a query processor. The DTD parser accepts a new DTD and automatically transforms the DTD into a collection of object type definitions. Each object type defini-tion corresponds to an element definidefini-tion in the DTD. The type generator is responsible for

DTDs DTD type object

parser generator database

OODBMS

documents SGML instance

parser generator query processor

(5)

creating these object types in the database. The SGML parser accepts an SGML document, parses the document, and breaks the document instance into elements. The instance generator is responsi-ble for automatically storing the document instance in the database by instantiating objects of appro-priate object types. Each object corresponds to an element of the document instance. The query proc-essor is responsible for retrieval, modification, and deletion of documents.

The system provides a graphic user interface to the user. Figure 7 shows the main menu of the system. The main menu contains seven menu items: File, Edit, View, Parser, Database, Window, and Help. When a menu item is selected, a pull-down menu is displayed which contains several com-mands.

Figure 7. Main Menu of the Document Management System

The File menu includes commands for open-ing a file, closopen-ing a file, savopen-ing a file, savopen-ing on another file, previewing a file, setting up the for-mat of file for printing, printing a file, and exiting the program. The Edit menu includes commands for undoing (and redoing) the previous command, cutting, copying, pasting, and searching and re-placing the content of a file. The View menu in-cludes commands to display (or not to display) the toolbar and the status bar. The Parser menu in-cludes commands for invoking the DTD parser and the SGML parser. The Database menu is used for retrieving, modifying, and deleting documents. The Window menu includes commands for opening a new window and displaying opened windows in cascade or tile arrangement. Finally, the Help menu provides on-line help for using this SGML document management system.

The system provides two different interfaces

for the user to retrieve, modify, and delete SGML documents: one is command-driven and the other is form-driven. In the command-driven interface, the user enters statements in an Object SQL-like language.

The statement for retrieving documents takes the form

SELECT elements FROM DTD WHERE condition

where the DTD specifies the DTD of the collection of documents to be searched, the condition speci-fies a condition to be satisfied by the retrieved documents, and the elements specifies the elements of the documents to be displayed. For example, the following statement

SELECT * FROM Memo

WHERE Memo.To contains ‘All Employ-ees’ and Memo.From contains ‘The Presi-dent’

retrieves all documents of Memo DTD in which the element To contains ‘All employees’ and the element From contains ‘The President’.

The statement for modifying documents takes the form

UPDATE DTD

elements modification

WHERE condition

where the DTD specifies the DTD of the collection of documents to be updated, the condition specifies the condition to be satisfied by the updated docu-ments, and the elements modification specifies how the elements of the documents are to be modified. There are two ways to modify an element: one is to replace the whole element and the other is to re-place only part of the element. For example, the following statement

UPDATE Memo

replace ‘President’ in Memo.From by ‘Chair’

WHERE Memo.From contains ‘The President’

modifies all documents of Memo DTD in which the element From contains ‘The President’ by re-placing ‘President’ in the element From by ‘Chair’.

The statement for deleting documents takes the form

DELETE FROM DTD WHERE condition

where the DTD specifies the DTD of the collection of documents to be deleted and the condition speci-fies the condition to be satisfied by the deleted documents. For example, the following statement

(6)

DELETE FROM Memo

WHERE Memo/status = ‘confiden’

deletes all documents of Memo DTD in which the value of the attribute status in the element Memo is ‘confiden’.

In the form-driven interface, the user first se-lects the statement (retrieve, modify, or delete) as well as the DTD of the documents to be accessed. For different statements and DTDs, the system provides different statement-and-DTD-specific forms to the user. The user only has to fill in the information to execute the statement.

5. Conclusion

In this paper we described an SGML docu-ment managedocu-ment system based on an ob-ject-oriented database. The system stores SGML documents in an ObjectStore object-oriented data-base. We create an object type for all DTDs and store each DTD as an object of that object type. We create an object type for each element definition in a DTD and store each element of an SGML docu-ment as on object. This database representation is advantageous to declarative query and fine-grained modifications of documents. The system supports automatic creation of object types and insertion of documents into the database. It provides two dif-ferent interfaces for the user to retrieve, modify, and delete documents. Currently it only supports declarative query. We plan to add the navigational access function to make the system useful on the WWW environment.

References

[1] Bohm, K. and Aberer, K., “HyperStorM - Ad-ministering Structured Documents Using Ob-ject-Oriented Database Technology,” in

Pro-ceedings. of 1996 ACM SIGMOD International Conference on Management of Data, Montreal,

Canada, pp. 547 (1996).

[2] Bohm, K., Aberer, K., Neuhold, E.J. and Yang, X., “Structured Document Storage and Refined Declarative and Navigational Access Mecha-nisms in HyperStorM,” The VLDB Journal, Vol. 6, No. 4, pp. 296-311 (1997).

[3] Christophides, V., Abiteboul, S., Cluet, S. and Schott, M., “From Structural Documents to Novel Query Facilities,” SIGMOD Record, Vol. 23, No. 2, pp. 313-324 (1994).

[4] Goldfarb, C.F., The Standard Generalized

Markup Language (ISO 8879), International

Organization for Standardization, Geneva (1986).

[5] Maler, E. and El Andaloussi, J., Developing

SGML DTDs: From Text to Model to Markup,

Prentice Hall PTR, Upper Saddle River, New Jersey (1996).

[6] Ozsu, M.T., Iglinski, P., Szafron, D., El-Medani, S. and Junghanns, M., “An Ob-ject-Oriented SGML/HyTime Compliant Mul-timedia Database Management System,” in

Proceedings. of 1997 ACM Multimedia Con-ference, Seattle, Washington, USA (1997).

[7] Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F. and Lorensen, W., Object-Oriented

Modeling and Design, Prentice Hall,

Engle-wood Cliffs, New Jersey (1991).

[8] Schouten, H., SGML*CASE: The Storage of

Documents in Databases, The Netherlands

Ministry for Agriculture and Fisheries, Wageningen (1989).

[9] Sengupta, A. and Dillon, A, “Extending SGML to Accommodate Database Functions: A Methodological Overview,” Journal of the

American Society for Information Science, Vol.

48, No. 7, pp. 629-637 (1997).

Manuscript Received: Apr. 12, 2000 Accepted: Nov. 23, 2000

References

Related documents

36.5 % of the women are illiterate and the next majority of the women have done up to secondary education 92% of the women stated that microfinance has reduced their poverty level to

The use of the emergency released vapor (0.2MPa/120ºC) can generate an emergency 2Mwe.. The production cost of the nuclear electricity. In calculating the reduced costs of production

Based on the limited sample size depicted above, many of the larger credit unions seem to be leveraging pricing strategies effectively today with an APR range of 700 bps on

Na območjih težko dostopne podzemne vode, nerazpoložljivih površinskih vodnih virov ali tam, kjer so vodni viri za rabo izjemno omejeni, predlagamo izgradnjo novih virov; na

Therefore, it is essential to know if ease of payment, trust, benefits of online shopping, and information quality will affect millennials’ purchase intention

Short (Division of Entomology, Biodiversity Institute and Department of Ecology and Evolutionary Biology, University of Kansas, Lawrence, KS, U.S.A) — Martin Fikáček (Department

Many users, however, observe that popular systems such as BitTorrent (employing tit-for-tat as incentive mechanism), are often ineffective at fulfilling a set of key

IEEE Computer Society defines software engineering as the application of a systematic, disciplined, quantifiable approach to the development, operation and maintenance of