Chapter 3 Architecture Overview
3.3 Data integration space
The main component of this space is the mediator which is responsible for the activities of decomposing user queries into subqueries over the underlying data sources and integrating the corresponding results. Originally, our data integration system was proposed to adopt only the virtual approach to data integration. However, in [Batista et al. 2003] we extended the system’s
execution. The extended architecture combines features of both data integration approaches supporting the execution of virtual and materialized queries. This is done to minimize the impacts of most common problems presented by the mentioned approaches. Some portions of data more intensively unavailable and static may be materialized in a data repository and the more dynamic data are accessed by virtual queries.
Another distinguishing feature of the extended architecture is the use of a local cache, i.e., a repository to store prepared answers for the most frequently queries submitted to the data integration system. More details about the specification and implementation of the modules responsible for the optimization of query response time can be found in [Batista 2003]. In the following, we describe the components of the data integration space in more details.
Mediator
A Mediator is a software device supporting an integrated view of several data sources. A schema for the integrated view is available from the Mediator, thus allowing queries to be made against that schema. The mediation schema contains all the elements needed to answer the users queries which can be computed from the data sources. Also, the elements in the mediation schema are associated with mediation queries, which are responsible for their computation.
The Mediator is composed by two sub-modules: − Query Manager
Whenever a query is submitted to the data integration system, the Query Manager first analyzes the query to determine where the data which is relevant to its answer is stored. If the query results are stored in cache, they will promptly be returned by the Query Manager. On the contrary, if these results are not stored in cache, then it is necessary to identify where the elements which compose the query results are stored. These elements can be virtual or materialized. Virtual elements are accessed from the data sources and materialized elements are obtained directly from the data warehouse. After retrieving the data, the Query Manager composes the results and returns the answer to the user.
− Source Manager
One of the tasks of the Source Manager is to interact with the data sources, sending queries addressed to the source wrappers and returning the corresponding results to the Query Manager. The Source Manager also monitors the sources in order to determine if there are portions of data that may be materialized in the data warehouse [Harinarayan et al. 1996, Gupta et al. 1997, Theodoratos et al. 1997]. This procedure implies in analyzing the data
criteria, for example: availability and update frequency of the data source. More details about the criteria adopted to select the portions of data to be materialized may be found in [Batista 2003].
Cache Manager
The Cache Manager is responsible for the maintenance of a locally stored cache with respect to space availability, substitution policies and contents refreshment [Dar 1996]. Another task of the Cache Manager is to periodically access the Query Log, to verify the frequencies of submitted queries and to identify if there are new queries results to be stored in cache. In this case, the Query Manager will recompute the new queries and store their results in the cache. Periodically all the cached queries must be recomputed to proceed with the refreshment of cache contents.
Data Warehouse Manager
The main role of this module is the maintenance of the data warehouse with respect to the materialization of the data suggested by the Source Manager and refreshment of the materialized data. Data warehouse maintenance policies were investigated in previous works [Gupta et al. 1995, Widom 1995, Rundensteiner et al. 2000]. These policies addressed the problem of keeping the consistence of the data warehouse with sources contents. As we already discussed, this can be done either by overriding the existing contents or appending new data to existing data in the data warehouse.
The Data Integration Space has two additional data repositories: the data warehouse, which stores data more intensively unavailable and static, and the cache, which stores prepared answers for the most frequently queries submitted to the integration system. It is important to note that the construction of the cache and the data warehouse will be done in two different phases:
− The data warehouse is built during the initial construction of the integrated view and hereafter it will be maintained by the Data Warehouse Manager. The materialization of data to be stored in the data warehouse is done through the execution of mediation queries. During the refreshment of the data warehouse, the elements must be recomputed, i.e., the corresponding mediation queries must be re-executed.
− Initially, at system startup, the cache is empty. The cache will be populated at runtime when queries are submitted to the system. This will be done based on the value of the queries frequency, i.e, the queries posed to the system more frequently will have their