ORIGINAL ARTICLE
An automatic mark-up approach for structured document
retrieval in engineering design
S. Liu&C. A. McMahon&M. J. Darlington& S. J. Culley&P. J. Wild
Accepted: 22 March 2007
#Springer-Verlag London Limited 2007
Abstract Information and knowledge retrieval has been recognized as a key issue in engineering design. A great deal of design-related information used and generated within engineering companies is formally recorded in documents. These documents become more useful if they are structured in a consistent way so that they can be retrieved and their contents accessed more effectively. Achieving useful structure in electronic documents relies on embedding some sort of mark-up or coding that is computer-understandable. Manual mark-up is time-consum-ing and costly. This paper proposes a knowledge engineer-ing approach to automatic document mark-up employengineer-ing XML (the eXtensible Mark-up Language) to’tag’explicitly the structural information. The focus here is on long and complex engineering documents. A three-level model is explored to achieve automatic semantic mark-up using a set of document decomposition schemes. The model includes a strategic level which identifies document typographical features based on such things as styles, inference or templates; a tactical level to define the rules to realize semantic mark-up according to the document features; and
an operational level to perform the computational imple-mentation of the mark-up rules. By making document structure explicit, information retrieval can be made more focused by returning not just whole documents but the document components that are most relevant or of most interest to the engineering designer, and information relevant to the designer’s need both with respect to document structure and content, not content alone. In addition, interpretation of useful structure by the human user can be hardwired into documents, which allows us to move closer to true semantic level retrieval.
Keywords Knowledge engineering approach . Automatic mark-up . Structured document retrieval . Engineering design . Document decomposition . XML
1 Introduction
Engineering design is an information and knowledge inten-sive process that consists of many tasks such as conceptual design, detailed design, engineering analysis, process design, performance evaluation and so on. In each of these tasks, engineering designers often record their own design ideas, solutions and results in documents. In the meantime, they access and retrieve information from numerous documents to make decisions. Indeed, research shows that engineering designers spend as much as 30% of their working time on searching and accessing information [1].
Engineering information can be categorised into three types from the human point of view: structured information (e.g., engineering drawings, process plans, supplier records, bills of materials), semi-structured information (written documents such as meeting minutes, reports, articles, graphs) and non-structured information (dialogues and DOI 10.1007/s00170-007-1342-z
S. Liu (*)
:
C. A. McMahon:
M. J. Darlington:
S. J. Culley:
P. J. WildInnovative Manufacturing Research Centre,
Department of Mechanical Engineering, University of Bath, BA2 7AY Bath, UK
e-mail: [email protected] C. A. McMahon e-mail: [email protected] M. J. Darlington e-mail: [email protected] S. J. Culley e-mail: [email protected] P. J. Wild e-mail: [email protected]
sketches, etc.) [2]. While structured information may be stored in computer records that facilitate computer inter-pretation and manipulation, the computer interpretability of semi-structured and unstructured information is very limit-ed, and semantic interpretation of such information is generally only possible by human action.
It has been realised that the value of documents can be greatly increased through mark-up to make the information more accessible [3]. Once documents are marked up, a computer system is able to exploit the implicit semantics in the mark-up, thus allowing for operations that are closer to the semantic level, which helps engineers to find the most ‘meaningful’information for their design tasks. Therefore, the authors of this paper suggest that engineering docu-ments be classified into three categories according to the nature of markup and the following computercentric -definitions adopted:
– Documents without mark-up to specify content such as plain text documents are considered to beunstructured. – Documents marked-up with HTML (Hypertext Mark-up Language) or other similar mark-Mark-up languages but without a schema or DTD (document definition type) to define the full meaning of the documents are considered to besemi-structured.
– Documents marked-up with a schema or DTD that accurately defines the meaning and structure of the documents are considered to bestructured.
It is the vocabulary and lexicons—the basis of a schema or a DTD—that allow the document to be sorted, classified, taken apart, re-assembled, or transformed as appropriate. Therefore, ideally, all documents should be structured.
Computational mark-up can be applied to documents only when they are in electronic form; therefore, paper documents must be converted before mark-up occurs. This transformation can be made either by direct keyed-entry or by scanning followed by optical character recognition (OCR) [4]. Document mark-up can be undertaken at the macro-level (dealing with the global visual and logical structure of a document), the micro-level (used for marking single words or word groups), or at the symbol-level [3]. Mark-up can be done manually, semi-automatically or automatically according to how much human effort is required. Manual mark-up is the most accurate but most labour-intensive; here the human reads the document, identifies and gives appropriate labels to structural elements embedded within the documents, and tags the document accordingly. Where there are many documents, such as is often the case in the engineering design context, the manual method is clearly inappropriate on grounds of effort and cost. Fully automatic mark-up is clearly preferable since it not only reduces manual intervention but also gives a more coherent representation of the documents. The literature on
automatic mark up is quite limited. For example [5] and [6] both explore automatic mark-up with different methods and targeting different types of documents.
To better meet the challenges of structured document retrieval and access, advances in automatic semantic mark-up—especially with a specific DTD or vocabulary appro-priate to a particular context or domain—are highly desirable. This paper addresses the automatic mark-up of logical content elements as well as physical structural elements that comply with a specified vocabulary for engineering documents based on a set of document decomposition schemes. The principal purpose of the automatic mark-up is to support engineers to extract the most relevant document fragments that meet their informa-tion needs, rather than to return the whole documents.
The remaining parts of this paper are organized as follows: Sect. 2 introduces the document decomposition strategy and vocabulary development. Section 3 presents the process of automatic mark-up. A three-level model of automatic mark-up with XML is discussed in Sect. 4. Section 5 investigates the application of the automatic mark-up system to a fragment retrieval web service. Finally, Sect.6 draws some conclusions.
2 Vocabulary development through document decomposition
It is well recognized that engineering documents are very diverse in their physical form, their structure and their content. Because of this it is non-trivial to develop a DTD or vocabulary that embraces the generality of engineering documents. Based on an analysis of the characteristics of engineering documents [1] and an empirical study on how engineers use documents [7], the authors took an approach to mark up (as illustrated in Fig. 1), which allows documents to be decomposed structurally from multiple viewpoints each of which has its own vocabulary of ‘labels’.
Document decomposition has been seen as an important but a difficult task because of the diversity, dynamics and heterogeneity of engineering documents. Document decom-position can be made not only from different viewpoints, across different dimensions but also within those dimen-sions at different levels of granularity and abstraction. Decomposition can take place not only based on what a document says (content) but also what is said about it (context). Closely connected with this is decomposition according to explicit and implicit content.
Based on the above consideration, the authors have developed eleven decomposition schemes to interpret engineering documents comprehensively. Six of them are illustrated in Fig.1, these are physical structure
decompo-sition scheme (PSDS), logical content decompodecompo-sition scheme—convention based (LCDS-CB), logical content decomposition scheme—post hoc analysis (LCDS-PHA), media-type decomposition scheme (MTDS), document context decomposition scheme (DCDS) and technical description decomposition scheme (TDDS). Detailed dis-cussion of these decomposition schemes can be found in the authors’ earlier publication [8]. Accordingly, a vocab-ulary has been developed that contains the elements and their definitions within the decomposition schemes. The result from the empirical study of engineers’behaviour in using documents has also been taken into account to make sure that the vocabulary will meet engineers’requirements. Once we have the vocabulary, we have the foundation to define a DTD or a schema for semantic mark-up. The creation of DTDs for the decomposition schemes has been discussed elsewhere [9].
3 Process of automatic mark-up and key controls The importance of the conversion of other document formats into XML has been addressed in [10]. The automatic mark-up of legacy documents in other formats such as in Word and PDF with XML includes three main steps: format conversion, semantic mark-up and creation of an output hierarchy. Figure 2 is the process model represented in IDEF0 (Integrated DEFinition method) [11], which shows the main activities and information flow (inputs I, outputs O, controls C and mechanisms M) within the process. Format conversion (activity A1) takes in legacy documents (non-structured text documents or semi-struc-tured documents such as in Word, PDF, Excel, etc.) and
generates a ppXML (pre-processed XML) documents (i.e., stylistic XML but with non-interpreted tags). The ppXML documents are well formed XML but not all the tags of the document content have specific meaning; therefore, they are still semi-structured.
At the semantic mark-up stage, ppXML documents will be annotated by computers under two significant controls: mark-up rules (C1) and the DTD (C3). Meaningful tags compliant with the vocabulary will be placed for corresponding elements. It is essential that an XML document conforms to its special syntax, for example, all elements must be nested properly in the hierarchy, which can be realised through activity A3. The final output of the automatic mark-up process should be well formed with meaningful tags.
The success of the above automatic mark-up process is greatly dependent on the mark-up rules (Control C1) that will instruct the computer to interpret the document content. How to develop and implement mark-up rules is discussed in the following section.
4 A knowledge engineering approach to automatic mark-up
Generally, there are two types of approaches to develop mark-up systems: aknowledge engineeringapproach and a machine learning approach. In machine learning, an annotated corpus of documents is needed. The learning system learns rules from the training data and applies them to mark up documents. In contrast, mark-up rules to be applied by the system can be constructed by knowledge engineers. In this case, experienced knowledge engineers’
Fig. 1 Document decomposi-tion and the vocabulary development
implicit expertise can be transferred to the system by ‘hardwiring’rules to achieve better results. Especially when appropriate knowledge (e.g., lexicons) and human resources (i.e., knowledge engineers / rule writers) are available, the knowledge engineering approach is to be preferred. The authors of this paper took the knowledge engineering approach because they intended to make use of their own knowledge in the understanding of engineering document decomposition and engineers’behaviour in using documents based on the experience accumulated over many years of research. In addition they have defined and developed an XML DTD for the vocabulary of engineering documents.
A three-level model has been proposed, as shown in Fig.3, which includes a strategic level, a tactical level and an operational level. Figure3 summarizes the idea of the development and implementation of mark-up rules. At the highest level, decisions are made on strategies to trigger the provenance that mark-up rules may be based on—for example, style based, template based, or inference based. The figure illustrates some typical style information embedded in a common Word document. For example, a document title may be written in Arial 14pt bold; first-level headings may be specified to be written in Times New Roman 13pt bold, and the text in Times New Roman 12pt Roman. Using the CambridgeDocs software xDoc XML Converter [12], when the Word document is converted into a ppXML document, one captures the style information and represents it as attribute/value pairs.
At the tactical level, specific mark-up rules are devel-oped to realize the strategies set at the high level. Three types of semantic mark-up rules have been defined: simple
rules, compound rules and advanced rules. By simple rules we mean using a single element. When these simple rules are used together, they are defined it as compound rules and conjoined using Boolean logic (AND, OR, NOT), compar-ison rules (contains, equals, less than, greater than, true, false) and shorthand rules (element name with attributes). The advanced rules make use of sequences of elements, for example, grammar sequence, start-to-end sequence and XPath expressions.
Programming for implementation of these rules is under-taken at the operational level. The authors employed Java coding and the XML DOM (document object model) to develop a software system, named AutoMarker, which consists of two modules to fulfil the automatic mark-up functionality with four chosen decomposition schemes, as illustrated in Fig. 4. In module 1, automatic mark-up has been undertaken against three decomposition schemes: the physical structure decomposition scheme (PSDS), the docu-ment context decomposition scheme (DCDS) and the media-type decomposition scheme (MTDS). Once document elements have been marked up with the vocabulary from the above three decomposition schemes, module 2 uses the information to detect and tag the elements against the logical content decomposition scheme-convention based (LCDS-CB), for example to mark asectionas anintroduction, acase study, or aconclusion.
If the document users are not happy with the final output of the marked-up documents, amendment requirements should be identified and mark-up rules should be refined, or perhaps some correction rules should be added to direct the programming coding. It is an advantage of the
knowledge engineering approach that effective rule sets can be constructed iteratively and transparently. Knowledge engineers can always start with relatively simpler rule sets, evaluate the results and refine them stepwise when they have a better and better understanding of the documents.
5 Application of the automatic mark-up system
In the authors’ earlier study of engineering document management, a fragment retrieval web service has been developed and used to organise information for engineers [8]. Figure 5 illustrates the application of the automatic mark-up system, i.e., the AutoMarker, to the web-based document fragment retrieval system to automate the structured document management process. In the existing fragment retrieval web service, there are four modules to
fulfil the functions of document mark-up, fragment extrac-tion, fragment classification and fragment retrieval (navi-gation and presentation). The latter three functions are performed automatically either by programming or by using an existing system (in this instance, Waypoint [13]).
For this experimental system document mark-up was done manually, with all the attendant time and effort, thereby impeding the automation of the whole information management process. By integrating the AutoMarker automatic mark-up system into the existing fragment retrieval system, the problem of manual mark up has been solved.
Experiments have been also undertaken through Auto-Marker to evaluate the knowledge engineering automatic mark-up approach proposed in Sect. 4. A set of design conference papers was selected as source documents for experiments because these conference papers were
uni-Fig. 3 The three-level model for automatic mark-up
formly within engineering design context and were pre-pared with a formally defined Word document template, which were taken in by the AutoMarker system as original inputs. The outputs of the system, i.e., the target XML documents, were checked and validated by the dedicated XML software Altova XMLSpy [14]. Table1 summarises the experimental results.
Column 1 in the table lists the structure types that have been identified through decomposition schemes in the source documents. Example elements from each corresponding decomposition scheme are given in column 2. For example, the physical structure decomposition scheme considers elements like title, heading, section, paragraph, lists and so on. Authors, references, keywords and acknowledgement are information about the document; therefore, they are included in the document context decomposition scheme. The media-type decomposition scheme looks into a document through elements like table, graphics, image and text. Typical elements in the Logical Content Decomposition Scheme include introduction, background, case study, discussion and conclusion. For the above elements in a document to be correctly recognised and marked-up, provenance or cues of the
elements have to be conveyed from the source document to the ppXML document through the document template. The main cues used in this work include stylistic information, containment information, content information and sequenc-ing information, as seen from column 3. AutoMarker can then pick up the cues from column 3 and implement them within corresponding mark-up rules, which are shown in column 4. Columns 5, 6 and 7 present the results of the experiments. They show that all the elements from column 2 are correctly picked up by the AutoMarker system, and the target XML documents are both well formed and validated, following the validation process undertaken through a dedicated XML software tool, in this case Altova XMLSpy. AutoMarker’s most notable success is the ability to pick up such a variety of elements across the physical structure decomposition scheme, document context decomposition scheme, media-type decomposition scheme and logical content decomposition scheme-convention based. Even though the above experiments were carried out with conference papers, the mark-up approach and rules are general provided the design information of the document template can be captured and recognised by the Auto-Marker. Change of the design information of the template
Table 1 Summary of the experiments of using AutoMarker
Structure types identified in the source documents Element examples in the source documents
Main cues conveyed through template to ppXML
Mark-up rules applied to the elements Elements correctly marked-up Output XML documents well formed? Output XML documents validated? Physical structure decomposition scheme
Title, section heading, subsection heading, section, subsection, paragraph, list Stylistic information, containment information Shorthand rules, comparison rules, XPath expression
All Yes Yes
Document context decomposition scheme Authors, references, keywords, acknowledgement
Stylistic information Boolean logic, comparison rules
All Yes Yes
Media-type decomposition Scheme
Table caption, figure caption, table data, figure
Stylistic information Comparison rules, Boolean logic
All Yes Yes
Logical content decomposition scheme– convention-based Introduction, background, case study, discussion, conclusion Content information, sequencing information Start_to_end, XPath expression, grammar sequence, comparison rules
All Yes Yes
Fig. 5 Application of Auto-Marker to the fragment retrieval web service
can be passed to the AutoMarker through programming parameters. Therefore, the proposed knowledge engineering automatic mark-up approach is not restricted to a specific type of document, and has much wider applicability. More importantly, the approach provides the capability for engineering designers to consistently mark-up legacy docu-ments compatible with a well-developed engineering lexicon and DTDs.
The research reported in this paper distinguishes itself from other related work in that it explored an automatic mark-up approach that can correctly interpret engineering document elements both at macro level and micro level, across multi-dimensions and viewpoints, as shown in Table2. From the table, we can see that most other existing work is dealing with documents outside the engineering domain, and restricted to micro-level mark-up such as from sentence level down to clause or phrase and word level. The authors believe that the automatic mark-up approach proposed in this paper represents a first attempt at focusing on engineering document mark-up in this way.
6 Conclusions
This paper has introduced a knowledge engineering approach to document automatic mark-up so that more
focused information retrieval can be realised to facilitate the engineering design process. A three-level model of auto-matic mark-up with XML has been discussed and applied to a number of decomposition schemes. The significance of the decomposition of engineering documents into prede-fined elements and the study of structured document retrieval arises with the ever-increasing importance of collaborative design and information-support co-ordinated decision making. It is well recognised that the co-operation and co-ordination between design team members relies heavily on consistent communication and information access. This paper reports on work which is a first attempt at any automatic mark-up system compliant with decom-position schemes to support structured document retrieval in the engineering design domain.
The main contribution of this work is that it has proposed an automatic mark-up approach which achieves the following:
(1) Integrates lexicons and DTDs that represent the content of engineering documents through decompo-sition schemes
(2) Can provide mark-up of document elements both at micro-level (from sentences down to words) and macro-level (from paragraph up to subsections and sections) (3) Can interpret logical content elements exemplified by
such things as introduction, background and conclusion
Table 2 Comparison with other related work
Research work Mark-up approach Document types / Domains Mark-up levels Element types ICID project at
Uni. of Bath
Knowledge engineering Engineering design documents
Macro- & micro- level Logical content elements, title, author, keyword.... Uni. of College
Dublin [6]
Self-organising map (SOM)
Business letters Micro-level Address, date, salutation Uni. of Duisburg–
Essen [17]
Knowledge engineering Encyclopaedia of arts (text) Micro-level Name, birth place, date of birth, profession Uni. of Western Ontario [15] Inductive machining learning Taxonomic descriptions of plants (flora)
Sentence / Clause Seed cones, roots, stems, buds Uni. of Nevada [4] Optical character
recognition (OCR), page zoning
Printing material, technical documents
Sentences, title, author Not for logical content elements
Table 3 A hybrid approach for future work on remaining seven decomposition schemes
Decomposition schemes Example elements Mark-up strategy Mark-up level
1 Logical content decomposition scheme— post hoc analysis
Aim, explanation Machine learning Macro & micro
2 Technical description decomposition scheme Instruction, procedure
3 Linguistic / Grammatical decomposition scheme Clause, verb phrase, verb Natural language processing (NLP) Micro 4 Semantic structure decomposition scheme Topic, predicate
5 Semantic category decomposition scheme Event, action, manner
6 Process / Function decomposition scheme Input, output To be identified To be identified 7 Temporal decomposition scheme Date, stage/ phase
Further work on automatic mark-up exploring all eleven decomposition schemes will be done using a hybrid approach which combines knowledge engineering and machine learn-ing supported, perhaps, by Natural Language Processlearn-ing (NLP) [16]. Table3outlines possible future work relating to the remaining seven decomposition schemes.
Acknowledgement The research reported in this paper was under-taken as part of the‘Study of Document Structure and Information Use Patterns in Engineering Information Management’(ICID) project at the Engineering Innovative Manufacturing Research Centre (EIMRC) at the University of Bath. It was funded by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant GR/R67507/01.
References
1. Lowe A, McMahon CA, Culley SJ (2004) Characterising the requirements of engineering information systems. Int J Inf Manage 24:401–422
2. Gardoni M, Frank C, Vernadat F (2005) Knowledge capitalisation based on textual and graphical semi-structured and non-structured information: case study in an industry research centre at EADS. Comput Ind 56:231–243
3. Liu S, McMahon CA, Culley SJ (2008) A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management. Com-put Ind 59(1):3–16
4. Taghva K, Beckley R, Cooms J (2006) The effects of OCR on the extraction of private information. Document Analysis Systems VII. Proceedings Lect Notes Comput Sci 3872:348–357 5. Feldman R, Rosenfeld B, Fresko M (2006) TEG - a hybrid
approach to information extraction. Knowl Inform Syst 9(1):1–18
6. Akhtar S, Reilly RG, Dunnion J (2003) Auto-tagging of text documents into XML. Text, Speech and Dialogue. Proceedings Lecture Notes in Artificial Intelligent 2807:20–26
7. Wild PJ, McMahon CA, Culley SJ, Darlington MJ, Liu S (2006) Towards a method for profiling engineering documentation. Proceeding of the 9th International Conference of Design, Dubrovnik, May 15–18th
8. Liu S, McMahon CA, Darlington MJ, Culley SJ, Wild PJ (2006) A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management. Adv Eng Informat 20:401–413
9. Liu S, McMahon CA, Darlington MJ, Culley SJ, Wild PJ (2007) EDCMS: a content management system for engineering docu-ments. Int J Autom Comput 5(1):56–70
10. Liu S, McMahon CA, Darlington MJ, Culley SJ, Wild PJ (2006) An approach for document fragment retrieval and its formatting issues in engineering information management. Lect Notes Comput Sci 3981:279–287
11. IDEF0, Integrated DEFinition methods, http://www.idef.com/ idef0.html
12. CambridgeDocs,http://www.cambridgedocs.com/index.htm 13. McMahon CA, Lowe A, Culley SJ, Corderoy M, Crossland R,
Shan T and Stewart D (2004) Waypoint - an integrated search and retrieval system for engineering documents. J Comput Inform Sci Eng 4(4):329–338
14. Altova XMLSpy,http://www.altova.com/
15. Cui H (2005) MARTT: a general approach to automatic mark-up of taxonomic descriptions with XML. Communications of the AIS. Also available on http://cais-acsi.ca/proceedings/2005/ cui_2005.pdf
16. Friedman C, Hripcsak G, Shagina L, Liu HF (1999) Representing information using natural language processing and XML. J Am Med Inform Assoc 6(1):76–87
17. Abolhassani M, Fuhr N, Govert N (2003) Information extraction and automatic mark-up for XML documents. Intel-ligent Search on XML Data, Lect Notes Comput Sci 2818: 159–174