DATA INTEGRATION ISSUES WITHIN ND INFORMATION MODEL
FOR URBAN PLANNING
Hongxia Wang, Andy Hamilton
Research Institute for Built and Human Environment, University of Salford, Salford, M7 9NU
E- mail: [email protected]
ABSTRACT: The decision making of urban planning need to consider the physical structure of the city along with economic, social and environmental factors. Therefore, an integrated nD information model is needed to provide accurate and comprehensive information for effective decision support. Data integration is vital for nD information model and also is an important issue for the ongoing Intelcities project, which aims to produce an integrated open source city platform based on integrated spatial and other datasets. This paper tries to investigate the data integration related issues. First, the requirement of data integration for urban planning is analysed. Subsequently, the Metadata and data standards are discussed separately with regard to their impact and roles on integration. Several data integration related technologies including gateway, federated database, data warehouse and mediator/wrapper etc are discussed. At last, a hybrid integration framework is designed to support data integration within nD urban information model.
Key words: Integration, Metadata, Mediator/wrapper, ND modelling, Standard
1. INTRODUCTION
Urban environments have become such incredibly complex organisms that no single person or agency has the data or knowledge upon which to make responsible decisions. Urban planning related decisions are made through a complex process involving various stakeholders like planners, developers, politicians, designers, engineers, transport and utility service providers and individual citizens. Participants in the planning process rely on many types of "information," including both the formal analytic reports and quantitative measures and the understandings and meanings attached to planning issues and activities (Innes 1998). Planning information integration is a method for overcoming the barriers so that stakeholders will have easy access to relevant information. The increased access to relevant information aided by the implementation of a planning information system can ultimately lead to increases in the quality of plans, number of alternatives generated, and the quality of decisions(Shiffer 1992). The information required in planning is always a mix of the spatial, aspatial and non-spatial, a blend of the qualitative and the quantitative, covering a wide range of physical, social and economic attributes many of which are non-comparable with one other (Harris and Batty 1993). For example, finding the right location for a primary school requires geospatial, infrastructure, environmental, housing, and population data.
Many decisions are dependent on information that can only be obtained by combining various data sources. An integrated multi-dimension information model is needed for effective decision support. The concept of nD urban information models was presented in (Wang 2004a), as shown in Figure 1. This concept is influenced by 3D to nD modelling research project (Lee 2003). The nD urban information model should integrate the multi-dimensional urban aspects like economy, society and environment with 3D urban model plus temporal dimension.
Figure 1 Concept of nD urban information model
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data (Lenzerini 2002). Data integration is a vital element for this nD information model. It is also an important issue for the ongoing Intelcities project, which aims to produce an integrated open source city platform based on integrated spatial and other datasets.
However, integrating the dataset from multiple dimensions so that they can be queried seamlessly for extracting the information wanted is not a straightforward task. In fact, integration of data from multiple sources is one of the major challenges and opportunities in data management today. In order to make it more tractable, data integration can be unfolded in some interrelated issues.
This paper tries to investigate and analysis the data integration related issues. Section 2 analyses the characteristic of data sources for urban planning is analysed. Subsequently, the Metadata and data standards are discussed separately in section 3 and 4. Their impact and roles on integration are investigated. Several data integration related technologies including gateway, federated database, data warehouse and mediator/wrapper etc are introduced, and then a hybrid integration framework is designed in section 5. At last, future research work is discussed.
2. CHARACTERISTIC OF DATA SOURCES FOR URBAN PLANNING
A good understanding of the status of data sources within urban planning domain is necessary for integration. Due to the complexity of the urban environment, the planning related datasets are usually presented in different locations on different platforms and under different schemas. The sources of these datasets are distributed across the network and across different organizations and agencies. Planner may not have administrative control over them, cannot modify their structure, or write data to them. The sources can change their information without warning. Heterogeneity, autonomy and distribution are the three features of these data sources.
2.1 Heterogeneity of the data sources
Data heterogeneity refers to the incompatibilities that may occur among distinct datasets. Each data source might model the world in its own way. The representation of data of the similar
Time dimension Social dimension Environmental dimension Economic dimension 3D urban model ND Urban Information model
semant ics might be quite different in each data source. For example, each might be using different naming conventions to refer to the same real world object. Moreover, they may contain conflicting data. In general, heterogeneity problems can be divided into three levels (Bishr 1998a; Fileto 2001; Ubbo Visser 2001):
• Syntactic heterogeneity
Syntactic heterogeneity refers to discrepancies in the representation of semantically equivalent information. The distinct data sources may use different data models, different data types and formats. For example, database may use different data model paradigms, such as relational or object oriented models. Syntactic conflicts must be solved from the higher to the lower abstraction levels, taking advantage of abstraction to obtain systematic solutions.
• Schematic heterogeneity
Just as its name implies, schematic heterogeneity means different data source using different schema. Schema integration generate a mediated schema that characterize a set of data sources to solve the schema conflict.objects in one database are considered as properties in another, or object classes can have different aggregation or generalization hierarchies, although they might describe the same real world facts.
• Semantics heterogeneity
Semantics heterogeneity refer to disagreement about the meaning, interpretation or intended use of the same or related data (Sheth 1990). That may include naming conflicts (synonyms, homonyms) and scaling & precision conflicts. For example, same schema label “NAME” in different sources maybe for NAME of buildings, roads, organizations. Sharing a common ontology among the data source providers or covering data sources may solve semantics heterogeneity.
2.2 Autonomy of the data sources
Autonomy indicates how independent the data sources are from the other sources and from the integrated system. Most of planning departments have various existing systems and legacy data. In general terms, legacy systems are environment-dependent and self-contained. The advisable integration strategies are leaving the legacy systems and data unchanged. Further more, planning department need share relevant data from other partners. You cannot force other partners to act in certain ways. As a natural consequence of this, they can also change their data without any announcement to the outside world.
Building the planning application will not start from scratch due to the economic and organizational reasons. In order to keep the autonomy of the data sources, we argue that data integration should not imply to physically integrate data into a higher level consolidated and standardised data pool.
2.3 Distribution of the data sources
Distribution refers to the physical distribution of data over multiple sites. Data sources that have to be integrated do not always reside on the same place. It's likely to be that they are on different hardware platforms and operating systems and can only be accessed through certain network protocols. Creating an integrated system, the designers should take into account the problems of communicating with the distributed data sources.
Attempts have been made to implement data integration and improve interoperability, and they can be classified into several groups (a) Metadata, (b) Standardization, and (c) technical approaches. These groups will be described below.
3. METADATA
Metadata is a very efficient solution to manage, structure and mine information.
3.1 What is Metadata?
Metadata is not a new term, the underlying concepts behind metadata have been in use for a very long period. For example, A old style Library card catalogs is a conventional well-established type of metadata. It inlcude the ID, title, author, publisher etc. There are various deinitions of metadata:
• Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource(NISO 2004).
• Metadata is data associated with objects which relieve their potential users of having to have full advance knowledge of their existence or characteristics. A user might be a program or a person(Dempsey 1998).
• Metadata is structured data about data that supports the discovery, use, authentication, and administration of information objects” (Greenberg 2001).
A short definition of metadata is "structured data about data".
The meaning of metadata is deveoping with the emerging of new technologies. Intially in database community, metadata was called the data dictionary or the catalog that decribe the data (Date 1999). With the integration of data warehouse, data mining, Web-based document and multimedia, the content of metadata started expanding. Metadata should not only inlude the infromation about the data but also include information on access methods, integrity and security constraints, schema and mapiing. Metadata guides the schema transformation and integration process in handling heterogeneous. Metadata is also needed to migrate legacy databases, implement data mining, support decision making and visualization. In short, metadata is central component that is common to all techonologies (Thurrasingham 2000).
One thing need to be clarified is metadata itself is a kind of structured data. Metadata can be either integrated with the document it describes or be part of a separate file.
3.2 Why Metadata?
Describing a data source with metadata allows it to be understood by both humans and machines in ways that promote integration and interoperability. The two main functions of metadata are:
§ The provision of the means to establish the existence of a resource and the means to obtain or access it;
§ The provision of an indication regarding the suitability of a resource, obtained through the documentation of its content, quality, and features.
Basically, metadata can help decision makers, researchers, and managers to find and use data. They also can provide and contribute essential information for data integration.
3.3 Dublin Core (DC)
In order to share and integrate data effectively, it is essential that data providers and data users choose common metadata elements to describe a dataset.
The Dublin Core (DC) is the most widely used metadata standard for resource discovery. Developed by the Dublin Core Metadata Initiative (http://dublincore.org/), the Dublin Core is intended to be simple to use, and general enough to be applied to resources in any discipline. It has been approved as an ANSI standard (Z39.85-2001), an ISO standard (15836).
Dublin Core has proposed sets of standard metadata elements for description of media objects (Diane 2001). The metadata elements included in the Dublin Core (DC) proposal ha ve been compiled from work initiated in the library science field. The core DC attribute set consists of 15 elements that can be organized according to the type of metadata captured:
Table 1. Dublin Core element
Metadata Type DC element
Semantic Title, Subject, Description, Coverage, Relation, Source
Context Creator, Contributor, Publisher, Date, Rights Structural Type, Format, Language, Identifier
Because of the importance of metadata in organising information resources, governments also started realising its importance and began using metadata for creating and maintaining knowledge through complex information systems with central metadata components. UK government employ Dublin Core as its metadata standard in its e-Government Interoperability Framework (eGIF 2004).
4. DATA STANDARDIZATION
Data standardising is a process of normalising data for easier information representation and exchange between different systems. Data standardization tries to promote data integration and interoperability across a wider variety of platforms.
4.1 Related standards
Many industrial, national and international standards have been developed or are under development. Within urban pla nning domain, the related standards are listed in the following Table 2.
Table 2. Related standards
Standard name Organization Description
eGIF
(e-Government Interoperability Framework)
UK government defines the technical policies and specifications governing information flows across government and the public sector
GML (Governmental Mark-up Language) OGC (Open Geospatial Consortium, Inc.)
XML-based schema for the modelling, transport, and storage of geospatial information
GovML (Governmental Mark-up Language)
EC A XML vocabulary to be used when delivering online public services to citizens. Enhancing interoperability as it supports the flow of data between web service “consumers” and repositories of distributed services.
IFC (Industry Foundation Classes) IAI (International Alliance for Interoperability)
an universal model to be a basis for
collaborative work in the building industry and consequently to improve communication, productivity, delivery time, cost, and quality throughout the design, construction, operation and maintenance life cycle of buildings.
Foundation classes
for GIS)
demonstration of the concept of using the Industry Foundation Classes (IFC) model as the specification for the exchange of limited but meaningful information between GIS and AEC CAD systems and vice versa
RDF (Resource Description Framework) (http://www.w3.or g/RDF/) W3C (World Wide Web Consortium)
A framework for describing and interchanging metadata. The resource is modelled as individual objects of a data model. Properties are used to describe and define a resource. The properties of RDF can easily be expressed in XML.
X3D (eXtensible 3D)
Web3D Consortium
An XML-enabled 3D file format to enable real-time communication of 3D data across
applications. X3D extends and upgrades the geometry and behaviour capabilities of
VRML using the XML. X3D is the successor of VRML. ebXML (electronic business Mark-up Language) (Http://www.ebxm l.org) UN & OASIS (United Nation & Organisation for the Advancement of Structured Information Standard)
A technical framework that will enable XML to be utilised in a consistent manner for the
exchange of all electronic business data, in order to create a single global electronic market.
4.2 Selecting standards
The development of standards could never stop. There are so many data standards available. The problem is how we choose and apply them within the urban planning domain. There are two controversial problems of selecting standards:
• Standards are not sufficiently universal enough to cover all of the functions and capabilities that the users of software applications demand.
Intelcities project employed the e-GIF standards. The e-GIF, as an initiative of UK government, aims to define the essential requirements for an e-Governance. E-GIF provides for much needed integration at the interface and information interchange level, via XML and other standards. But it does not address the simulation and visualization problems that are much needed in urban built environment. To describe the most important manmade object ‘building’, the related standards from construction industry like IFC model may need to be employed.
• Standards are too detailed and complex to be used for simple software applications. IFCs (IAI 2002) provide an object-oriented representation for the life-cycle information of building model in the AEC industry. The IFC model represents not just building components such as walls, doors, beams, ceilings, furniture, etc., but also more abstract concepts such as schedules, activities, spaces, organization, construction costs, etc. IFC is obviously an excellent standard for building industry. But for urban planning and development application, IFC may be too detail and complex for urban planning and urban development which does not care about most of the building information in IFC model.
Standards are clearly fundamental to the sharing of data across international boundaries. The integration and interoperability problem would go away if every system always uses the same data standards. However, this is only an ideal imagination that could not happen in real world. Constructing and maintaining a single, integrated standard is a very difficult problem.
do not address the interoperability problems of converting existing data into the selected standard format or integrating data from different sources (Garton 2001). Therefore, data standardization is not the whole solution.
5. DATA INTEGRATION METHODS
A number of proven and well–established methods exist that allow heterogeneous data sources to integrate, including gateway, federated databases, data warehousing and mediators/wrapper system(Fileto 2001; Kajan 2002).
• Gateway
A Gateway is some middleware that allows an application running in one DBMS to access data maintained by another DBMS. Most well known gateways are ODBC (Open DataBase Connectivity) and JDBC (Java DataBase Connectivity). Usually, gateways are available only for DBMS that employ the same data modelling paradigm and do not provide location or interface transparency. Hence, gateways are not versatile and do not offer support to establish a homogeneous view of heterogeneous data.
• Data warehouse
The data warehousing approach, as described by Voisard and Juergens (Voisard 1999), implies accumulation of data in a few well-defined and tightly connected data stores, where information integration is “pre-computed”. In order to integrate data from multiple sources, this approach extracted the data from these sources, transforms into a common schema, and loaded into a single, unified database for the enterprise. While efficient for a relatively small number of datasets, this approach is not readily extensible to a larger number of datasets with semi-structured and ad hoc data. For data sources that change frequently, the cost of shipping incremental updates to the warehouse and inserting it correctly is high. The major cost of this approach is in keeping the warehouse up-to-date (Wiederhold 2002).
• Federated database
A federated database system (FDBS) is a collection of cooperating but autonomous component database system (DBSs)(Sheth 1990). The component DBSs are integrated to various degrees. Federated database shared a schema and enabled distributed search. In the case of Database Federation, information needed to answer a query is gathered directly from the data sources in response to the posted query. Hence, the results are up-to date with respect to the contents of the data sources at the time the query is posted. Integration of data was achieved through interoperation at the level of communication technology. The result mirrored the sources exactly, and semantic relationship or mismatches had to be handled by the applications.
• Mediator/Wrapper system
Mediator–based systems are constructed from a large number of relatively autonomous sources of data and services, communicating with each other over a –demand” information integration(Wiederhold 1994). Wrappers encapsulate details of each data source, allowing data access using a common data model. Mediators offer an integrated view of the data supplied by a collection of sources and through wrappers. A mediator transforms requests posed according to the integrated view into requests to the data sources, integrating the results. Structural and syntactic heterogeneity may be solved by mediation.
Urban planning datasets can be well- structured (database), semi- structured (like XML, HTML document), or non-structured data sources (like binary file, document). Based on the above analysis, better solution for integration of urban planning datasets should be mediator/wrapper system. Here we suggest a hybrid solution based on the architecture of mediator/wrapper system as shown in Figure 2.
This integration framework uses mediator/wrapper method to integrate multiple autonomous, distributed and heterogeneous data sources. Users can easily access to the integrated comprehensive information through a unified query interface. Also multiple datasets are not pulled into a higher level repository; legacy application can still work as normal. And it does not increase the cost of data management.
Figure 2. A hybrid data integration framework
6. SUMMARY
In this paper, we discuss the data integration issues for nD urban information model. The complex and diverse characteristic of urban planning datasets makes the integration a tough problem. Metadata can provide essential information for data integration. Data standards contribute to easier information representation and exchange between different systems and are very helpful to support the integration and interoperability across institutional border. Database communities’ research effort leads to several data integration methods. We suggested a hybrid mediator/wrapper framework for urban planning data integration. This approach can solve the syntactic and structural heterogeneity.
However, some important issues like ontology and semantic integration are not covered in this paper. And some integration issue’s discussion is not detailed enough. Future research work will include investigating and addressing these issues further and applying the research findings to Intelcities project.
7. REFERENCES
Bish, Y. (1998a). "Overcoming the semantic and other Barriers to GIS Interoperability." The international Journal of GIS 12(3).
Date, C. J. (1999). Introduction to Database systems (7th ed.), Addison-Wesley Longman Services of SOAP/CORBA/ DCOM
Legacy applications Mediator
Wrapper
Wrapper Wrapper Wrapper
Metadata ORDatabase (Geospatial data, GML) Metadata OODatabase (CAD, IFC, DXF) Metadata RDatabase (Housing, population) Metadata Flat file (document, multimedia) Metadata Web – based file Wrapper User interfaces
Dempsey, L., Heery, R. (1998). "Metadata: a current view of practice and issues." Journal of Documentation 54 (2): 145-172.
Diane, H. (2001). Using Dublin Core, Dublin Core Metadata Initiative. 2004.
eGIF (2004). e-Government Interoperability Framework, UK government.
Fileto, R. (2001). Issues on Interoperability and Integration of Heterogeneous Geographical Data. III Workshop Brasileiro de GeoInformatics, Rio de Janeiro, Braisl.
Garton, M. a. T., G. (2001). Data integration issues for a Farm Decision Support System. GIS Research in the UK 9th Annual Conference, Glamorgan.
Greenberg, J. (2001). "A quantitative categorical analysis of metadata elements in image-applicable metadata schemas." Journal of the American Society for Information Science and Technology 52(11): 917 - 924.
Harris, B. and M. Batty (1993). "Locational Models, Geographic Information Systems and Planning Support Systems." Journal of Planning Education and Research 12: 184-198.
IAI (2002). International Alliance for Interoperability. 2004.
Innes, J. E. (1998). "Information in Communicative Planning. ." Journal of the American Planning Association 64(1): 52-63.
Kajan, L. S. a. S. D. (2002). FRAMEWORK FOR SEMANTIC GIS INTEROPERABILITY. Ser. Math. Inform. 17.
Lee, A., Marshall_Ponting, A., Aouad, G., Song, W., Fu, C., Cooper, R., Betts, M., Kagioglou and Fischer, M., (2003). "Developing a Vision of nD-Enabled Construction."
Lenzerini, M. (2002). Data Integration: A Theoretical Perspective. ACM PODS ,, Madison, Wisconsin, USA.
NISO (2004). Understanding metadata. Bethesda, USA, National Information Standards Organization Press.
Sheth, A., Larson, L. (1990). "Federated database systems for managing distributed, heterogeneous and autonomous databases." ACM Computing Surveys 22: 183-236.
Shiffer, M. J. (1992). "Towards a Collaborative Planning System." Environment and Planning B: Planning and Design 19: 709-722.
Thurrasingham, B. (2000). Web data management and electronic commerce, CRC Press LLC.
Ubbo Visser, H. S., Holger Wache, Thomas Vögele (2001). Using Environmental Information Efficiently: Sharing Data and Knowledge from Heterogeneous Sources. Environmental Information Systems in Industry and Public Administration. C. R. a. S. Patig, PA: Idea Group Publishing: 41-73.
Voisard, A., and Juergens, M. (1999). Geographic Information Extraction: Querying or Quarrying? Interoperating Geographic Information Systems. M. E. M. Goodchild, R. Fegeas and C. Kottman. New York, Kluwer Academic Publishers.
Wang, H. a. H., A. (2004a). The conceptual framework of ND urban information model. 2nd CIB Student Chapters International Symposium. Beijing, China: 625-634.
Wiederhold, G. (1994). Interoperation, Mediation and Ontologies. Sympoism. on Fifth Generation Computer Systems, Tokyo, Japan.