DESIGN OF A SPATIAL DATA WAREHOUSE BASED ON AN INTEGRATED NON-SPATIAL DATABASE AND GEO-NON-SPATIAL INFORMATION
Abdulvahit Torun
Harita Genel Komutanlığı (General Command of Mapping) (GCM), Kartografya Dairesi, 06100 Cebeci, Ankara, Türkiye
Accessing multiple, distributed, heterogeneous and autonomous information sources storing spatial or non-spatial data and integration of those data sources has been an important issue as distributed data processing and management became available with today’s technology. Common solutions to data integration in database area are the data integration, schema integration and software developing approaches. The first approach is implemented as a data warehouse (DW) or an integration as a central database (DB). The data in DW has been cleansed, integrated, and pre-processed and infrastructures have been built surrounding DW for efficient data analysis. Main processes to construct a spatial data warehouse are; collecting data from information sources (data warehousing), organizing data in a geo-spatial data store (warehouse organization) and querying and visualizing data. This study covers implementation of first and design of the following two phases of developing spatial data warehouse. In data warehousing phase, required data about population information, information about administrative hierarchy, location information and administrative boundaries are extracted from relevant data sources. After introducing data warehousing step, model of latter two phases are given in this paper.
1. INTRODUCTION
There are multiple, distributed, heterogeneous and autonomous information sources storing spatial data within a single company or property of various companies. The information sources vary from legacy systems to operational spatial databases. The need for gathering or defining associations among those data sources motivates to integrate them as a spatial data warehouse (DW) or as a database using integration techniques. Spatial data sets are integrated either through a part of data warehouse or distributed/federated database (by means of wrappers and mediators).
DW is a subject-oriented, integrated, time-varying, non-volatile collection of data repository, which has been extracted and integrated from heterogeneous and autonomous distributed data sources. DW is used primarily in organizational decision-making. The data in DW has been cleansed, integrated, and pre-processed and infrastructures have been built surrounding DW for efficient data analysis. Main processes to construct a spatial data warehouse are; collecting data from information sources (data warehousing), organizing data in a geo-spatial data store (warehouse organization) and querying and visualizing data [Ezeife, 2001] [Voisard, et.al., 2002].
In warehousing phase, resolving formats, data integration, data cleansing, consistency-checking tasks are accomplished. Warehouse organization step comprises re-organizing and aggregating the data as base tables, summary tables and metadata. Querying and visualizing a data spatial warehouse considering spatial characteristics of data requires essential tools for analytical processes and visualization of the information both for organizational and ad hoc users.
Most of DWs store large amount of data, which is often used for summarization-based on-line analytical processing (OLAP). ROLAP (relational OLAP) and MOLAP (multidimensional OLAP) are two basic OLAP architectures. DWs are usually based on relational technologies but OLAP uses a multidimensional view of aggregate data to provide quick access to strategic information for further analysis. OLAP enables analysts, managers and executive to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information. However OLAP and DWs are complementary. OLAP transforms raw data so that it reflects the real dimensionality of the enterprise as understood by the user [Niemi, 2002].
Materialized views pre-compute and store (materialize) aggregates from the base data. The data is grouped using categories from the dimensions tables, which corresponds to the subjects of interest (dimensions) of the organization. Storing all possible aggregates poses storage space problems and increases maintenance cost, since all stored aggregates need to be refreshed as updates are being made to the source data sources. .
2. DATA SOURCES AND DATABASE INTEGRATION
In most of the DB integration cases, the tendency is management of legacy systems and re-use of data in recent years. The DBs, which are distributed geographically, in different models, in different environment and with different semantics are needed to be integrated to use the available data. This problem is called as ‘DB integration’. DBMSs and applications are designed considering incorporate data and process sharing to ease autonomy. Main efforts are spend in three directions; ‘schema integration’, ‘federated DB’ and ‘multi-database language’ [Elmagarmid et.al., 1999] [Bobak, 1996]. 2.1 Data Warehousing
In warehousing phase resolving formats, data integration, data cleansing, consistency-checking tasks are done. New data is added to DW due to format and transformation definitions in metadata. Before adding the data, consistency checks are performed in order to guarantee that data is loaded once and correctly.
2.2 Database Integration Strategy
A hybrid integration technique is applied in order to develop Populated Places DB of Turkey (PPDB_T) [Torun, 2000] [Torun, 2002_1]. Data integration method –integrating several DBs into one unique DB- and schema integration method are used for integration. Data insertion from the component DBs is done by running the programs written in API of PPDB_T. Schema and Data integration is accomplished by using ladder technique in which first of all, schema of newly designed PPDB_T of General Command of Mapping (GCM) is integrated with the data collected for populated places which are stored in a plain table. Then, schema of PPDB_T is extended to integrate it with State Statistics Institute (SSI) DB schema (Figure 1.b).
For integrating different data sets describing the same phenomena, firstly a correct understanding of the semantics of the existing data should be developed. Schemas are transformed into a common data model. For instance SSI DB schema into SQL and DB instance into dbf are converted. Then, an accurate correlation among structures (schemas) is established to avoid comparing different type of objects within the same category. The inter-schema correspondence at metalevel and inter-DB correspondence at data level are identified and described. Finally, integration is described precisely to prevent merging irrelevant data together. The conflicts are solved semi-automatically. Integrated schema is generated on top of contributing data sources. Schema integration is done by using modified 5 level integration architecture for the purpose of single DB generation (Figure 1.a) [Ozsu, et.al., 1999].
Modified 5 Level Schema Integration Architecture
1. Local Schema: Local schema of plain table, SSI DB and DCT are made available by the Component DBs.
2. Component Schema: Representation of Local schema in canonical (standard) data Model is defined by using SQL. Canonical data Model is employed for unifying divergent local schemas in a single schema.
3. Export Schema: Export Schemas are defined from component schemas based on integrated global schema of PPDB_T.
4. Schema Integration: Export Schemas are integrated to form a single schema. Since the desired final DB is not a federated one, the integrated schema is not a Federated Schema, either. Firstly, two component schemas are integrated. Then, the third one is added to the integrated schema. This method is called as ladder technique. Data integration follows schema integration.
5. External Schema: Upon the integrated schema different conceptual/external schemas are defined for different purposes and usage [Torun, 2002_2].
2.3 Developing Populated Places Database of Turkey as a base for Spatial Data warehouse Data warehousing phase is implemented as follows. Population information and information about administrative hierarchy are extracted from State Statistics Institute (SSI) DB and from Ministry of Interior ‘Populated Places Book’ respectively. Location information and administrative boundaries are imported from topo-maps at scale 1:25000 and Digital Chart of Turkey (DCT) at scale 1:1000000 of General Command of Mapping respectively. Integration of those four data sources is accomplished by using ladder technique to construct Populated Places DB of Turkey (PPDB_T) as a central DB at Cartography Department of GCM.
Population information and relevant statistics are collected and stored by SSI. In order to maintain population information up-to-date for map production and for monitoring the changes of populated places, DB integration is necessary to share the legacy data by SSI instead of re-constructing the same content.
Defining Associations
Turkey has a hierarchical administrative system, which looks like a balanced tree structure. The sequence of residential entities (populated places) from top to bottom is province, district, sub-district, village and suburb respectively.
Schema 1 Schema 2 Schema 3
Declaration of Correspondence Resolution of Conflicts Schema Fusion Integration Rules Semantic and Structural Conflicts
Plain Table SSI DB PPDB_T
DCT I_SDB
PPDB_T
PPDB_T
Figure : 1.a Schema Integration, 1.b. DB integration using ladder technique
2.4 Database Integration
Schema Integration
Source DBs can be integrated by means of schema integration or using standardized data or schema models. Schema integration is a practical method such that the data is integrated as a logical DB with a global schema. With a given a set of local DB schemas belonging to an individual DBMS, an integrated schema which subsumes those local schemas is created by synthesizing the schemas.
First of all, database schema of PPDB_T is designed by means of actual and future needs. DB schema of PPDB_T is designed considering the administrative binding of populated places. Then, relevant attributes of SSI DB are added to the PPDB_T schema in addition to a foreign key, which provides PPDB_T to connect SSI DB. The final step is designing an Integrated spatial DB (I_SDB) with the three component DBs; PPDB_T, DCT and spatial DB derived from a subset of PPDB_T.
Data Integration
There are three main data integration techniques. Firstly, a very basic approach is providing a global catalogue of accessible information sources to user to allow him doing integration by himself without attempting any integration. The user should search, find, decide and find tools to select, extract, integrate and query the data. Secondly, a step further is integrating the source DBs as one single DB by putting the data together. The data and applications should be converted into the integrated DB. Thirdly, subsets of source databases are extracted due to main application needs.
Data integration process is done in four major steps by using the first two techniques. Data integration is accomplished by applying ladder technique. Firstly, DB schema of PPDB_T is designed considering the available non-spatial digital information sources and further needs. Secondly, PPDB_T is populated with the data from plain table. Therefore, the plain table is mapped into PPDB_T. Thirdly, PPDB_T schema is extended in order to import population information from exported data of SSI DB (Figure 2.a).
Derived Spatial DB (d_SDB) is based on a common schema for both PPDB_T and spatial data processor –for the time being ESRI-ArcInfo- that will import the data. Export schema is created by means of a non-spatial predicate that cuts the DB both vertically –a subset of attributes- and horizontally –a subset of tuples-. Primary key is repeated in every fragment in vertical fragmentation. Thus, disjointness is valid only on non-primary key attributes in vertical fragmentation (Devogele, et.al, 1998). The exported non-spatial data is mapped into spatial format to generate d_SDB. These are done by developing a tiny software which extracts data from PPDB_T into a common model, transforms the common model into a spatial DB and visa-versa. Different languages are employed for spatial and non-spatial definitions and manipulations. DCT contains a set of spatial data classes such as administrative boundaries, hydrology, transportation, elevation, populated places (only provinces and sub-provinces), physiography. The relationship among PPDB_T and d_SDB is preserved by keeping the same primary key –Populated Place ID- in corresponding relations of both DBs.
Problems of Integration
Constructing an integrated DB from existing DBs yield some problems due lack of interoperability among DBs. GCM and SSI have different non-spatial data to an extent for the same context. The data differs because of partly semantic but mainly schematic and format (syntax) discrepancy. Updateness is the main reason for this kind of anomalies. If one of the DBs stores a different name than the current one for a populated place the corresponding populated places cannot be matched till the mistake is removed or association is built. This process prevents scalability of the resultant schema and DB
instance. Each time there happens a change, the integrated schema and integrated DB should be updated.
3. SPATIAL DATA WAREHOUSE MODEL
Data analysis applications typically aggregate data across many dimensions looking for anomalies or unusual patterns. They categorize data values and trends, extract statistical information, and contrast one category with another. Data analysis is done four steps;
• A query that extracts relevant data from a large database is formulated,
• The aggregated data from the database into file or table is extracted
• The results are visualized by using graphic or mapping tools,
• The results are analyzed and new queries are formulated [Gray et al., 1997].
Data integration and consistency checks are accomplished. The coming stage will comprise organizing DW, defining and developing analytical operations and generating tools to extend integrated global schema of DW and DW instance.
3.1 Spatial Data Warehouse Organization
Data is organized in a spatial data store in this stage. DW does not allow the user to delete or update the data in the store. However, the growth of DW is managed either transferring into cheaper media or generating summary tables of less used data. Row data is re-structured and aggregated in base tables. Aggregation rules are defined in metadata. Users are sometimes interested in aggregated values such as population of a region or province, which is calculated by summing the population of populated places at lower hierarchies. Getting the result of a query on the fly based on operational DB may not be efficient. On the other hand, the more data is aggregated, the faster queries are responded, but update and storage are increased. A balance among query response time and update time should be considered.
PPDB_T
SSI DB PPDB_T
Schema Plain Table for
Populated Places of GCM Integrated (extended) Schema for PPDB_T Data Integration integration rules Modified SSI DB PPDB_T imigration SSI DB Schema Query Interface Data Warehouse DB Integration Extracter (DB API) Metadata Source
DB1 Data warehouse Source
Query Engine
Source DB2
Source DB...
Figure : 2.a DB intefration for PPDB_T, 2.b. Hybrid model for Spatial DW
A hybrid approach is employed to provide efficient and rich analytical analysis to the user (Figure 2.b). A part of data is organized in a DW while other part of data is extracted from the integrated DB during query processing. In the metadata information about ‘where and how to get the relevant data’ is stored. A warehouse is useful for data related to whole field. This is useful in geomarketing applications where data, being socio-cultural behaviors or statistical data. For the case of spatial data warehouse, resultant maps of some queries are stored in the DW for quick response and other queries are responded on the fly as a graphic result. Therefore dimension of data cube is extended by adding related maps of each dimension considering aggregate hierarchies.
The dimension determined in the actual DB are geometry, time, type of populated places. The geometry dimension comprises geometric boundary of populated places from region, province, district, sub-district and village. Some other geometric hierarchies can be defined according to a property like height zones, earthquake danger zones, fruitful soil zones etc…
3.2 Analytical Operations on Datawarehouse
Analytical Operations applied to warehouses for are aggregation (consolidation, roll up), roll down (drill down, drill through), selection (screening, filtering, dicing), slicing, pivoting (rotation). Aggregation is summarization of data for the higher level of hierarchy. Roll down is navigation among levels of data ranging from higher-level summary (up) to lower level summary or detailed data (down). Selection is taking a subset of data by means of a criterion which is evaluated against the data or members of a dimension in order to restrict the set of retrieved data. Slicing is selection of all data satisfying a condition along a particular dimension. Pivoting is changing the dimensional orientation of cube [Vassiliadis, 2000] [Vassiliadis, 2002].
A cube is a group of data cells arranged by the dimensions of the data. A dimension is defined as a structural attribute of a cube that is a list of members, all of which are of a similar type in the users perception of the data. Each dimension has an associated hierarchy of levels of aggregated data. For instance time can be detailed as year, month, week, day, hour. Measures (variables, metrics, facts) represent the real world values. A single datapoint that occurs at the intersection is defined by selecting one member from each dimension in a multi-dimensional array (cube).
Although relational technologies provide some non-procedural tools for analytical operations, routines should be provided for complex calculations. Operators defined in SQL in relational DBMS are group by (), count (), sum (), minimum (), maximum (), average (), median (), standard deviation (), variance () etc... However, an algebra is defined on cube having the operations group by, cube, roll up [Gray, et.al., 1997].
Visualizing the Results by Means of Graphic and cartographic Tools
Visualization tools display trends, clusters and differences by using graphic tools to make the user to understand the metaphor easily. Visualization tools render the results as 2D or 3D graphs in addition to tools of cartography, which are used for mapping statistical distributions. In order to visualize dynamic characteristics dynamic visual variables are used in addition to visual variables. and Cartographic tools in GIS software provide most of the mapping tools. Results of queries against spatial data warehouses are visualized as a value, an array (table), graphic (2D and 3D), maps (2D, 3D, multimedia-animation, sound etc…).
3.3 Extending and Maintaining Spatial DB as a source of Spatial Data Warehouse
After founding a spatial DW, the data is updated, changed or model is extended in two ways. Firstly, source sends data regularly or after balk changes. Secondly, warehouse asks for data at certain time periods [Voisard, et.al, 2002].
Since, mapping from data sources into PPDB_T is initially done automatically, the tools for mapping are available for further data injection into the DB. Moreover, there are tools to check consistency considering administrative hierarchy and to compare different versions of PPDB_T. The application has tools for entering, manipulating and updating data in addition to intelligent query generator based on administrative hierarchy and standard topographic map indexes.
4. CONCLUSIONS
The concepts about database integration are briefed in the text. Data and schema integration techniques are given related with developing an integrated DB comprising populated places, administeral hierarchies, population information which is called as PPDB_T (Populated Places DB of Turkey). Finally, hybrid model for integration processes is defined within the application phase. Syntactic conflicts are removed by using common formats for all data sources. Schematic conflicts are resolved by defining an integrated global schema and removing redundant and repeated information. Since the semantic meanings of administrative units are unique among governmental bodies, almost no semantic conflicts are met. Naming differences are removed by defining unique names or synonyms.
Organizing DW and defining analytical operations phases are designed. A hybrid method is going to be used to develop spatial DB. The spatial DW will comprise a set of maps, which are generated for the frequent user queries considering aggregation hierarchies.
Acknowledgements. Populated Places Database of Turkey is designed and developed in Cartography Department of General Command of Mapping, Turkey.
REFERENCES
Bobak, A.R. “Distributed and Multi-Database Systems”, Artech House, Boston, pp 121-138 (1996). Devogele, T., C.parent, S.Spaccapietra, “On Spatial Database Integration”, International Journal of
Geographic Information Science, 12(4), pp. 335-352 (1998).
Elmagarmid, A. et.al., “Management of Heterogeneous And Autonomous Database Systems”, Morgan Kaufmann Publishers, San Francisco, pp.2-32 (1999).
Ezeife, C.I., “Selecting and materializing horizontally partitioned warehouse views”, Data & Knowledge Engineering 36, pp 185-210, Elsevier Science (2001).
Gray, J. et.al., “Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross-tab, and Sub-Totals”, data Mining and Knowledge Discovery 1, pp 29-53, Kluwer Academic Publishers (1997).
Hepner, P., “Integrating Heterogenous Databases”: An Overview, http://citeseer.nj.nec.com/cs, (1995).(accessed Dec. 2001).
Niemi, T., “Constructing OLAP Cubes Based on Queries”, http://citeseer.nj.nec.com/cs (accessed Aug. 2002).
Özsu, M.T., P. Valduriez, “Principles of Distributed Database Systems”, Prentice Hall, New Jersey, pp. 75-101 (1999).
Torun, A., “Populated Places DB Project”, General Command of Mapping. Internal Report, Cartography Department, General Command of Mapping, Turkey (2000).
Torun, A., “Designing Populated Places Database of Turkey (PPDB_T) by Using Relational Model, Harita Dergisi, 128 (2002_1).
Torun, A., Using Schema and Data Integration Techniques to Integrate Spatial and non-Spatial Data: Developing Populated Places DB of Turkey (PPDB_T), ISPRS Comm. IV, Canada (2002_2). Vassiliadis, P., “Data Warehouse Modeling and Quality Issues”, Ph.D. Thesis, Nat. Tech. Univ. of
Athens, Greece (2000).
Vassiliadis, P., “Modeling Multidimensional Databases, Cubes and Cube Operations”,
http://citeseer.nj.nec.com/cs (accessed Aug. 2002).
Voisard, A., M. Jürgens, “Geospatial Information Extraction: Querying or Quarrying”,