Data Warehousing in Environmental Digital Libraries

(1)

Data Warehousing in Environmental Digital

Libraries

Richard D. Holowczak, Nabil R. Adam,

Francisco J. Artigas, and Irfan Bora

The purpose of this article is to present what we view as a new application of data warehousing in non-traditional domains such as environmental monitoring and research, biomedical applications, and other types of digital libraries (DL). While the work we present here focuses on an environmental data warehouse, the ideas are equally applicable to other DL environments.

Data warehousing has been embraced by the professional IT community with many successful implementations reported in the retail, insurance, finance, and other industries [7]. For the most part, traditional data warehouses share the following common features. First, virtually all the data propagated from the operational systems and data sources to the data warehouse is either numeric or character-based. The data warehouse database contains mostly aggregated and summarized data in numeric for-mat. Next, operational systems and data sources are identified early on in the ware-house development cycle. Once the wareware-house is operational, new data sources are often difficult to integrate into the data warehouse schema. Quite often the ware-house database is designed around a core set of business needs that are decided upon in advance. The set of queries that can be addressed by the warehouse might there-fore be fixed.

By contrast, data warehouses to support DLs must contend with a variety of mul-timedia data types from a collection of component systems that may grow over time.

Richard Holowczak ([email protected]) is an associate professor of computer

information systems and the director of the Subotnick Financial Services Center in the Zicklin School of Business at Baruch College, City University of New York.

Nabil R. Adam([email protected]) is the Founding Director of the Center for Information Management Integration and Connectivity (CIMIC) at Rutgers University, and is the Director of the Meadowlands Environment Research Institute.

Francisco Artigas([email protected]) is a research associate professor at Rutgers University Center for Information Management Integration and Connectivity (CIMIC).

Irfan Bora([email protected]) currently serves as the Chief Fiscal Officer of the New Jersey Meadowlands Commission, and also directs the Commisson’s information systems area.

This work was supported in part by the Professional Staff Congress of the City University of New York.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

(2)

Such a data warehouse must also be robust enough to adapt to new queries and ways of querying, and should not only store aggregated and summarized data, but also any value-added data derived from data sources. However, as in the case of the traditional data warehouse, copies of the source data may not generally be stored in the data warehouse database. In addition, subsets or disjoint sets of aggregated, summarized data should be easily distributed to facilitate localized user requirements such as search. Thus, DLs do not rely on a single point source for access to DL data ware-house data.

A number of interesting research issues follow from the application of traditional data warehousing approaches to DLs. What are appropriate data models for data warehouses in DLs? What interoperability and data heterogeneity issues arise in data warehouses for DLs? What levels of aggregation are meaningful for multimedia data types such as maps, images, and video? What methods can we employ to allow a data warehouse to effectively evolve as new data sources become available? How can we allow a data warehouse to effectively evolve so users can push their findings back to the DL? What methods will allow a data warehouse to effectively respond to queries unanticipated when the data model was developed? How can we use a data warehouse in a DL to support data mining? Finally, how do all of these issues impact the user’s ability to search and retrieve aggregated, summarized, and value-added DL data?

The Digital Meadowlands

We are presently undertaking a project to develop an environmental DL for a 32-square-mile governmental jurisdiction known as the New Jersey Hackensack Mead-owlands District (HMD), which is located in northeastern New Jersey. This DL, called the Digital Meadowlands (DM), will support environmental research, educa-tion, and government planning activities in the service of the New Jersey Meadow-lands Commission, the agency responsible for HMD land use planning, implementation of zoning controls, regional solid waste management, and environ-mental protection.

The HMD is among the most environmentally assaulted and developed areas in the northeastern United States. The heavy infrastructure of industry, commercial and transportation corridors, and high-density residential developments, coexist with a degraded estuary and urban wetlands. Undeveloped wetland areas are under intensive development pressure [2]. This configuration creates a complex and highly impacted landscape under continuous pressure to comply with opposing demands for develop-ment and conservation.

Digital Meadowlands (DM) is a state-of-the-art environmental monitoring sys-tem that allows users to query critical parameters such as land use, and ambient air and water quality, and visualize the results in a graphical form. DM will allow users with different levels of expertise to explore problems facing their communities. Data types stored in DM are summarized in Table 1.

One potential use of Digital Meadowlands is to identify and monitor known con-taminated areas. From the 1960s to the 1970s, polychlorinated biphenyls (PCBs) were common industrial wastes dumped in the Hackensack River watershed and sur-rounding areas. Now a suspected carcinogen, this contaminant saturates the riverbed and accumulates over time in fatty tissues of fish and shellfish that may later be con-sumed by humans. Anglers from the area may want to know which areas are known or suspected to be contaminated with PCBs and to what degree. They may also want

(3)

to access current public advisories on fish consumption and important health infor-mation available from the state.

A more sophisticated user may wish to use street addresses along with an interac-tive geocoding application to find the location of old industrial sites known to have generated PCB waste. The user may wish to inspect current high-resolution images to ascertain if the plants still exist and the nature of current land use. The user may examine interactive maps and historical reports to explore the exact sampling loca-tions where contaminated fish were extracted that exceeded the U.S. Food and Drug Administration safety threshold for PCBs. Based on this information, the user may choose to propose new sampling locations.

An even more sophisticated user may look at remediation sites along the river where sediments were or are being disturbed. He or she may examine water and sed-iment data reports from before and after the remediation. He or she may wish to look up the new bathimetry map and water level data to determine if contaminated sedi-ments are permanently submerged or are exposed during low tide. The user may also wish to interact with digital elevation models of the river and water flow measure-ments using a multidimensional hydrodynamic model to simulate PCB migration and sediment transport under different remediation scenarios. Annotated images of output from such models might then be stored for later use by others.

Data Type Data Source Growth rate Satellite Imagery Direct download, NASA, 3rd parties 3 GB/mo. Digital Maps and environmental models Internal digitization, 3rd parties 10 MB/mo. Environmental monitoring data Automated monitoring stations and specific studies 2 MB/mo. Engineering Reports and scientific papers Internal and contracted studies 10 MB/mo. Video Internal documentaries, 3D Fly-bys and news media 1 GB/mo. Web pages and images Internal documentation and photos 20 MB/mo.

(4)

Research Challenges and Approaches

We now turn our attention to the research questions outlined earlier. We view the overall warehousing issue in terms of the underlying data and the metadata that describe it. In a warehouse, metadata implies the datatypes and domains of data items as well as the lineage of the data that includes the algorithms used to transform and process data. Both the data and the metadata can be centralized or distributed. In the case of the Digital Meadowlands, we envision a system that integrates metadata and maintains it in a centralized repository.

In traditional data warehouses, the data warehouse data model is designed accord-ing to a specific business model to serve the needs of a known user community. The user community is aligned with the business model, and thus its needs coincide with the needs of the business. By contrast, the contents of a DL follow no single static business model, as the user community is constantly in flux, and presents a wide vari-ety of needs. For DLs, several data warehouse modeling challenges are present: • To design a robust data warehouse schema capable of modeling the diverse

requirements of a DL. Such a data model should provide a foundation to build multiple views of the warehouse according to the needs of diverse user commu-nities.

• To allow the data warehouse model to evolve gracefully as new data sources come online, and as the needs of the user community evolve.

Traditional data warehouses have employed multidimensional data models imple-mented in either a proprietary multidimensional database management system (DBMS), or in a relational DBMS. In such models, important aspects of the business are recorded in a set of dimensions, each with its own set of unique identifiers. A dimension contains records that represent varying levels of granularity. For example, a time dimension contains day, week, month, quarter, and year. Facts are then asso-ciated with some combination of identifiers for each of the dimensions.

The most common model in use is called the star schema, in which multiple dimension tables are related to a single fact table. In a snowflake schema, the dimen-sion tables are de-normalized. In a constellation schema, there are several fact tables that share some dimension tables [7]. The advantages of this type of data model are that the aggregates can be pre-computed and stored in the warehouse for fast retrieval, and additional levels of granularity can be added without disrupting the existing schema or applications. However, adding dimensions to an existing warehouse is a difficult task.

The data warehouse component of DM will include a multidimensional data model that supports the environmental focus of DM. Our dimensions include time, data source, and region of interest (ROI). The ROI dimension is represented as a series of arbitrary, possibly overlapping points, polygons and three-dimensional regions. ROIs can be aggregated and summarized into higher-level ROIs. For exam-ple, the wetlands acreage for several municipalities can be aggregated into the wet-lands acreage for a county. Users are free to designate their own ROIs by either combining existing ROIs or by specifying a point, polygon or three-dimensional space. For example, the users in the above scenario may designate an ROI for an area of high PCB concentration, and then share this ROI with others for their input and

(5)

use. A physical measurement for an ROI taken at a given time period from a given source forms the basis for the facts.

Distributed database systems, federated database systems and data warehouses all must deal with issues of interoperability and data heterogeneity. As described previ-ously, DM consists of a variety of disparate data types from a wide range of sources. Interoperability issues include how multiple systems can exchange queries and data. In general, interoperability can be overcome through the use of wrappers or other interfaces. This is typical in cases where a single entity likely controls all of the com-ponent systems. However, in a DL, there may not be a single owner for all data sources. A far more likely scenario is that each data source is owned by a different management entity. Thus it may be difficult to impose a given solution on all sources, especially in cases where modification to the sources is required.

DL data comes in a number of modalities including text, images, audio, and video. In an environmental DL, we also consider maps in vector-based formats, numeric data, and environmental models. These data types have the same, if not more complex, data heterogeneity issues. For example, how should a video of a PCB study area be integrated with a 3D representation of a hydrodynamic model? The interoperability and data heterogeneity challenges are to:

• Develop interoperability methods that allow free exchange of data between sources and the warehouse, while requiring minimal modification of the sources. • Develop a model of DL content that encompasses all data types, and provides

methods to convert content from one modality to another.

A wide array of approaches in the literature addresses interoperability and data het-erogeneity issues. Approaches range from loosely coupled federated systems to tightly coupled heterogeneous distributed database management systems. There are however, two dimensions of this problem that we wish to emphasize. First, heterogeneity of multimedia data has not been addressed in as much depth as traditional numeric and character data have been. Several approaches, such as mediators or the IBM data pyra-mid, allow the specification of conversion functions between data in different for-mats. For example, [1] describes oblets that render content according to client capabilities, thus reducing the load on a central server. The goal of these approaches is to store only one copy of the multimedia object, and convert the object into the needed format on the fly, when it is queried. While economizing storage space, such approaches may suffer when fast response times are required or for applications that process large data quantities such as data mining. Rapid responses to large collections of objects are also required with data warehousing.

Second, a number of brute force integration approaches require modification of the component systems so they can participate in a federation or distributed database. The CORBA specification, and more recently XML-based Web services, have done much to guide the development of interoperable applications. However, we note that legacy systems or systems beyond immediate management, such as a DL, may be dif-ficult to integrate, as the owners of the source may not have the resources or the desire to customize their system to fully participate in the DL.

Data warehouses traditionally store summarized and abstracted data. Summaries are normally in numeric form, such as the total sales of a product for a given time

(6)

period. Therefore, data in the data warehouse database is predominantly numeric. In the case of an environmental DL, users may only be interested in a high-level sum-mary of an environmental situation. As discussed, a number of multimedia data types will be present in a DL. Summarizing or aggregating such data is not straightforward. For example, how does one summarize a collection of videos or a collection of maps? One example of research in this direction is the Informedia project at Carnegie Mel-lon, which demonstrates several new techniques for summarizing video stories using text summaries and video digests [3].

DL objects can be combined with or related to one another in multiple ways. In the environmental domain, it is common to combine vector maps representing spa-tial characteristics with satellite images or other data sources to form new insights into the environmental characteristics of a given area. The challenges with such approaches include:

• Developing models for representation of DL objects, and summaries of DL objects in a variety of data types and modalities.

• Developing efficient methods for aggregating, summarizing, and otherwise adding value to DL objects.

Traditional data warehouses are used primarily as read-only or query-only sys-tems. That is, following a data load event during which the data warehouse is unavail-able to users as summarized operational data is loaded, users query the data warehouse and act on the results. The discovery of new relationships among query results can-not be pushed back into the data warehouse for storage. For example, by querying the data warehouse, a user discovers that when promotion X is in effect, product Y sales increase in store S1 for the first three days of the promotion, before dropping off. In comparison, sales in store S2 are flat until days four through seven, when they increase steadily. The user might act externally (to the DW) on this discovery, but he or she has no way to record this discovery back in the data warehouse.

It is insufficient for DLs to act solely as repositories for data retrieval. DLs must foster experimentation and exploration of relationships between DL objects. By the same token, discovery of relationships or value-added methods for manipulating data should be stored in a repository for future inspection both by the original user and other users, and for future processing and possible combination with other DL objects. In our environmental DL scenario, a user may wish to associate their ROI with a particular set of annotated maps and scientific studies and then store this ROI to share with other users. The data warehouse appears to be the most appropriate repository for these value-added objects and discovered relationships. If we consider query/retrieval of DL objects as a “pull” mechanism, then storing value-added objects to the data warehouse might be considered as a “push” activity. Thus the challenges become:

• Providing users with tools and interfaces to combine, associate, and operate on objects in the DL, and then providing methods to store these new derivative objects in the data warehouse.

• Providing data warehouse data models capable of accepting such objects pushed back from the users.

(7)

Little work has been published explicitly on the topic of folding query results back into a data warehouse for other users to locate and manipulate. For example, in [4], the authors discuss an approach to mining scientific data for the purposes of auto-matically generating metadata describing the data sets. This derivative metadata can then be used to support future searches.

Data warehouses provide an excellent basis for performing data mining due to their “clean” data sets and coherent structure. Data mining tools for numeric data types are now coming into the mainstream. Data mining tools for multimedia and spatial data types represent a ripe field for research [8, 5].

The more general topic of scientific data mining has also received attention [6]. Phenomena-oriented data mining attempts to discover scientific data set phenomena that can be characterized and stored for later retrieval. This approach has been applied to NASA’s EOSDIS satellite data archives to provide methods to effectively search for data corresponding to certain events.

Conclusion

In this paper we distinguish between traditional and nontraditional data warehous-ing, such as that encountered in DLs. We illustrate the differences between the two types of data warehousing by focusing on a specific application in the area of envi-ronmental digital libraries. We discuss some of the challenges and issues that need to be addressed when dealing with nontraditional data warehousing and suggest possi-ble approaches to some of these challenges. Our current prototype implementation of Digital Meadowlands includes all of the data sources we have described, and is acces-sible through a Web portal. As we continue our research, DM will be enhanced with the features we have presented here.

References

1. Adam, N., Atluri, V., Adiwijaya, I., Banerjee, S., and Holowczak, R. A dynamic manifes-tation approach for providing universal access to digital library objects. IEEE Transactions on Knowledge and Data Engineering 13, 4 (July/Aug. 2001) 705–716.

2. Enviromental Protection Agency. Draft environmental impact statement on the special area management plan for the Hackensack Meadowlands District, NJ. Technical report, EPA, 1995.

3. Christel, M.G. Visual digests for news video libraries. In Proceedings of the ACM Multime-dia ‘99 Conference.

4. Graves, S.J., Hinke, T.H., and Kansal, S. Metadata: The golden nuggets of mining data. In Proceedings of the First IEEE Metadata Conference (April 1996).

5. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann, (Aug. 2000).

6. Hinke, T.H. Rushing, J., Ranganath, H., and Graves, S.J. Target-independent mining for scientific data. In Proceedings of the Third International Conference of Knowledge Discovery and Data Mining (Aug. 1997).

7. W. H. Inmon. Building the Data Warehouse. QED Publishing Group, 2nd edition, 1996. 8. Koperski, K., Adhikary, J., and Han, J. Knowledge discovery in spatial databases: Progress and challenges. In Proceedings of the SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (1996).