IMPLEMENTING EXPLORATORY SPATIAL DATA ANALYSIS METHODS FOR MULTIVARIATE HEALTH STATISTICS

(1)

IMPLEMENTING EXPLORATORY SPATIAL DATA ANALYSIS METHODS

FOR MULTIVARIATE HEALTH STATISTICS

Daniel B. Haug

1

, Alan M. MacEachren

2

, Francis P. Boscoe

1

, David Brown

1

, Mark

Marra

1

, Colin Polsky

1

, and Jaishree Beedasy

1

Proceedings of GIS/LIS '97 Annual Conference and Exposition

. October 28-30, 1997.

Cincinnati, Ohio., pp. 205-213

1.

Department of Geography, 302 Walker, Penn State University, University Park, PA; e-mail: [email protected]

2.

Department of Geography & Population Research Institute

Abstract:

This paper reports on the development of prototype software designed for exploratory visualization of geographically referenced health statistics. The software prototype provides a number of interactive methods for exploring relationships between risk factors and mortality rates and how they are distributed in space. The use of geographically referenced mortality data to detect disease "hot spots" can be traced, at least, to Dr. John Snow’s 1854 map of cholera deaths in London, which allowed him to hypothesize that a particular water pump was the source of the epidemic. While the use of traditional static maps for cluster identification continues to be important, with a major new atlas of mortality in the U.S. just published, dynamic exploratory data analysis and visualization techniques have the potential to further enhance detection of "hot spots". Our prototype implements a number of exploratory data visualization techniques within existing geographic information systems software (ArcView GIS, ESRI, Redlands, CA). These techniques, including scatterplot brushing, interactive data classification, focusing, and representational methods for multivariate display, can help the analyst identify disease "hot spots" and facilitate data exploration that may lead to hypothesis about causal links between morality and potential risk factors. This paper will discuss the technical issues surrounding the implementation of these visualization techniques and the ways in which these techniques may be employed in existing GIS software. Finally, it will examine possible enhancements to the methods implemented here.

Introduction:

The U. S. Centers for Disease Control and Prevention, National Center for Health Statistics (NCHS) recently commissioned a study designed to pinpoint specific conceptual and implementational issues in interface design for Geographic Visualization (GVis) and exploratory data analysis (EDA) of relationships between risk factors and mortality rates. This paper reports on one component of that project, the implementation of some exploratory data analysis operations in ArcView GIS (ESRI, Redlands CA).

Epidemiology has long been concerned with finding the causes (i.e. risk factors associated with) given diseases. To that end, statistical techniques are often used to confirm or refute such hypothesized

relationships. However, the process of forming hypotheses regarding relationships between risk factors and diseases should not be left to statistical techniques. Development of statistical methods for finding patterns and relationships that have a spatial component is an active area of research. However, lacking generally accepted statistical methods for identifying patterns and relationships in georeferenced data, analysts have often turned to geographic visualization as a tool for hypothesis generation (Mason et al., 1975; Pickle et al., 1987; Pickle et al., 1990; Croner et al., 1992; Winn et al., 1981). Using visualization, hypotheses are generated by visually searching a map for disease clusters, or "hot spots", and then looking for potential risk factors having a similar geographic distribution. The recent publication of the NCHS Atlas of Mortality is a testament to the acceptance of visualization as a technique for identifying disease clusters.

(2)

Geographic Visualization has emerged as an effort to expand and enhance the traditional use of maps by extending and expanding methods associated with visualization in scientific computing (ViSC) and EDA to analysis of geographically referenced data. GVis, then, combines the work in ViSC and EDA with the principles of cartography to produce multivariate, and spatio-temporal representations that allow the analyst to address problems that were difficult, or impossible to deal with using static, univariate, or non-visual representations.

We begin with a discussion of the goals of the project, and software tools selected for initial prototype development. This is followed by a brief outline of the philosophical framework within which the interface was developed. The next section is a thoro ugh explanation of the implementation of the following operations: creating scatterplots with independently transformed axes, geographic brushing, focusing, bivariate mapping through texture overlay, and dynamic visual data classification through scatterplots. The paper concludes with a discussion of the limitations of these operations as they are implemented here, as well as research questions that have been raised in this exercise.

Project overview:

This research was commissioned by the NCHS to explore the capabilities and problems associated with a highly interactive data analysis and visualization system applied to health statistics. The goals of this project were to construct a prototype interface that could be employed in empirical testing of representation and interaction techniques applied to analyzing multiple georeferenced variables at several time intervals. Because of the limitations of the software packages available at the project outset, we chose to develop two separate modular prototypes. The first was developed in ArcView ® and the second in MacroMedia Director ®, which both provide fairly robust object oriented scripting languages. The ArcView prototype was chosen to explore multivariate data-driven analysis of geographic data. However, ArcView provides poor support for animation or the development of custom user controls (such as slider bars). Director, on the other hand, emphasizes animation, and is highly flexible in facilitating design of user controls. Its weaknesses lie in the area of data-driven analysis of geographic representations. As a result, ArcView hosts the data-driven analysis and provides images of geographically referenced data that Director can animate and manipulate with a more flexible user interface. Further discussion of the Director side of this research can be found at the project web site listed at the end of this article.

Prototype Development:

The development of this prototype followed a hierarchical interface design approach proposed by Howard and MacEachren (1996). This approach has three levels at which interface goals must be addressed. The first is the conceptual level, which asks what and whom the system is for. This is followed by development of operational level goals outlining specific tasks that must be accomplished to achieve conceptual level goals. In addition, this level requires the formalization of these tasks as operations on information. At the implementational level, decisions are made about how to implement those operations and represent them to the user. The goals for each of these three levels of interface design are discussed in more detail below. The focus of this paper, however, is the implementational level, and therefore this section will be emphasized. For a more in-depth discussion of the full system design process, see at the project web site listed at the end of this article.

Conceptual Level

The conceptual level of computer interface design, as proposed by Howard and MacEachren, should pinpoint what the system is for, who the expected system users are, what needs are met by the system, and what the results of working with the system should be (1996, p.61). The system that we set out to develop was designed for epidemiologists, whose concerns are with health statistics analysis. This can be re phrased

(3)

analysis of attributes in space and time at multiple scales, and d) addressing how specific data

characteristics and data processing methods play a role in the perception and identification of spatial and spatio-temporal patterns. For the purposes of this paper, we will only address the first of these goals, which serves as the base from which to identify specific data analysis tasks and associated operations to achieve them at the operational level. Operations relating to the other three goals are implemented in the

MacroMedia Director module of this prototype, and are not discussed in this paper.

Operational Level

The operations considered here address the first conceptual goal listed above. That goal, to facilitate spatial pattern analysis can be broken down into the following four operational sub-goals: a) highlight high and low values, b) enhance visibility of regional clusters, c) relate data in geographic and attribute space, d) explore associations between mortality and risk factors.

Implementational Level

The operational level goals enumerated above were implemented in a number of customized functions. These functions, discussed at length in the following section, include the creation of scatterplots, brushing, focusing, bivariate mapping through texture overlay, and dynamic data classification using scatterplots.

The ArcView user interface allows users to write their own scripts in a language called Avenue, and then incorporate those scripts directly into the interface using pulldown menus,

buttons, or tools. While pulldown menus and buttons are fairly self-explanatory, tools are functions that are activated through the use of the cursor in the active window. For instance, there is a built-in zooming tool for the "view" window which allows the user to pick the lower left and upper right corners of a section of the view, and then sets the extent of that view to those points. Figure 1, at right, illustrates custom buttons and tools implemented in this prototype.

The remainder of this section describes the implementation of each of the functions listed above, the operational level goals that these functions are designed to meet, and how they have been integrated into the ArcView user interface.

Creating Scaleable and Linkable Scatterplots

By providing a tool for visualizing bivariate relationships, scatterplots meet the operational level goal of analyzing relationships between mortality and risk factors. However, it became apparent at early stages of prototype development that the graphing capabilities built into ArcView were not robust enough to handle other operations that we were interested in implementing (e.g. brushing and dynamic data classification). Creating user manipulable scatterplots, then, is a precursor to further addressing the four operational goals outlined above. We needed to be able to treat each point in a scatterplot as an entity linked to a

corresponding feature in the map, and be able to manipulate that point and have those manipulations reflected on the map. We felt that the best way to accomplish this would be to simply create a view and "map" each scatterplot point into a location in data space and store it as an entity in a shape file.

Scatterplot creation was accomplished by simply reading two fields of data from the polygon attribute file associated with the layer being mapped (in this case health service areas (HSAs)), and assigning the values in one field to the X axis and the values in the other to the Y axis. When the script has processed the list of HSAs, the result is a point theme in the scatterplot with one point corresponding to each HSA. The points in this scatterplot theme are sorted in the same order as the HSAs reside in their polygon coverage. This common order is critical for implementation of geographic brushing (discussed below).

(4)

Once the point shape file is created, X and Y axis lines and labels can be added. However, it is often necessary to transform the data values on one or both axes in o rder to create a visual representation that is interpretable. For instance, if the X axis variable has a range of values from 0 to 1000, and the Y axis variable has a range from 0 to 1, it will be impossible to distinguish variation in the Y axis without

transforming one or the other axis in order to make the ranges comparable. Transformation can also be useful in searching for non-linear relationships between variables in the scatterplot. Because of this, we provided the analyst with the ability to perform any one of a wide variety of transformations on either axis. For an example of a scatterplot with a transformed Y axis, see Figure 2.

Transformation of axes requires relatively simple calculations, but it is important to store information about how each axis had been transformed. This particular function was implemented through the use of object tags associated with each axis line. Each object in the Avenue class structure allows the user to instantiate any other object as an attribute associated with it. This attribute is known as an object tag. Throughout this implementation, we relied on object tags to allow objects to carry information about themselves that would be needed later. In addition, since the object tag attribute can be instantiated with any o bject, we were able to use a list object to carry multiple attributes in the object tag. In this case, an object tag of each axis line was instantiated with a list. The first element in the list was a text string identifying the line as the X or Y axis, and the second element in the list was the type of transformation that had been performed. We will deal more explicitly with the importance of storing this information below.

One of the reasons for creating scatterplots was the ability to construct scatterplot matrices by creating multiple scatterplots of related data and arranging them on the computer screen in a simple matrix. In order for this to be effective, however, it is necessary to have the same axes on each scatterplot. For instance, when comparing the juxtaposition of death rates due to lung cancer and a risk factor such as smoking for different time periods, the rates of smoking and the death rates may decline over time. In order for different scatterplots to be comparable, the axes scales and ranges on each plot must be identical. Therefore the data range for all possible death rates in all possible years must be used to determine the axis range. In order to solve this problem we have allowed the user to pick multiple fields for the range of values that will be used in setting up each axis. A simple example of a 1 X 2 scatterplot matrix and its associated map can be seen in Figure 3.

One of the side effects of giving the user flexibility in creating a scatterplot is that this process can be somewhat cumbersome. For instance, when the user wishes to create a scatterplot, they will first chose the field that will be used as the X axis, and then choose all fields that will be used to set the range of values for which the axis will be scaled. These same parameters will then be chosen for the Y axis. Finally the user will be presented with the range of possible values for each axis, and then prompted to choose an appropriate transformation for each axis. In addition, the user is provided with the opportunity to enter a name for the scatterplot and a label for each axis. The result in ArcView is a new view with the user-provided name of the scatterplot. Within the view is a theme with the a set of points and four graphic objects; one line and one label for each axis. This script is integrated in to the ArcView interface using a button.

Linked Brushing

Brushing is essentially the ability to highlight certain entities in one visual display and have them appear highlighted in a corresponding display as well. It supports the operational goal of relating data in

geographic and attribute space. Geographic brushing has been implemented in a number of other prototype visualization packages in geography as a link between a scatterplot and a map (Monmonier and Gluck, 1994; Dykes, 1997), but to our knowledge, this is the first such implementation in a commercially available GIS package.

(5)

In this prototype, brushing has been implemented so that not only are a map and a scatterplot linked, but the map and any scatterplot that was created based on that map are linked. This provides the opportunity for multidimensional brushing of a scatterplot matrix, where entities can be added or subtracted from a highlighted set using visual selection in any scatterplot or in the map display.

Brushing was implemented by taking advantage of object tags, and ArcView selection bitmaps. Selection bitmaps are a list of bits (ones or zeros) that are ordered identically to the records in a data table. Each bit, then, is a flag, indicating whether or not a given record in the data table has been selected. Since the points in the scatterplots are ordered identically to the features in the map that they are associated with, the same selection bitmap can be applied to any scatterplot as well as to the map. Since every

scatterplot is a view object, just like the map, all related scatterplots and the map that they are related to are given a unique identifier in the object tag. Using this identifier, the brushing script knows which views to apply the bitmap to.

The brushing tool is implemented in this prototype as a user-defined rectangular box that allows the user to select the points on a scatterplot, or features on the map, that are of interest. The script then finds all views that have the same identifier as an object tag, and applies the selection bitmap to the data theme. Elements can also be added to or subtracted from the selected set using different views. For instance, if the user had three separate scatterplots and wanted to see the geographic distribution of outliers on all three plots, she could select the outliers in one plot, move to the next plot and add the outliers from that plot, and then move to the third plot and add those outliers. This final selection set would be highlighted in all three plots as well as the associated map. Figure 3 is an example of scatterplot brushing where the points selected in the lower left scatterplot are also highlighted in the other scatterplot and the map.

Focusing and Dynamic Classification

Focusing is an EDA method of highlighting extreme high and extreme low data values, thus addressing this operational goal. Focusing, as implemented here, allows the user to slide a break point along the X axis of a scatterplot, and have the shaded regions of a two -class map reflect which side of that break point they are on. This also addresses the operational goal of facilitating spatial cluster recognition by allowing the user to search for spatial clusters of extreme data values.

Dynamic classification is the natural extension of focusing. It allows the user to insert more than one break point into the scatterplot, and to then manipulate those break points so that they are reflected in the classification of the map. Dynamic classification through scatterplots, then allows the user to find geographic clusters of like values by "focusing" on value ranges that are not in the extreme high and low ends. This technique also allows the user to set class breaks at visually intuitive points in the scatterplot (if any exist). Focusing and dynamic classification have the added advantage of circumventing problems that can be caused by aspatial data classification, where the choice of class breaks may hide geographic clusters. The analyst is no longer limited to the standard choices for breaking data into classes (equal interval, quantile, or even Jenk’s Optimal, which is also aspatial), and can explore the effects that different class breaks have on the geographic patterns in the data.

The first step in implementing focusing and dynamic classification in ArcView was creating initial class breaks, and having them applied to the classification of the appropriate data theme in the map, as well as in the scatterplot. ArcView 3.0 has a number of built in data classification techniques. We took advantage of these techniques by allowing the user to choose any of them to determine the initial class breaks. Information about the classification scheme, including the number of class breaks, the scatterplot axis on which they occur, and the type of classification method being initially applied, is collected from the user

(6)

using pop-up dialog boxes. The classification scheme is then applied to the scatterplot, and, using object tags to distinguish which map view in the ArcView project the scatterplot was associated with, the classification scheme is then applied to the map as well. The class breaks are represented on the scatterplot as lines perpendicular to the axis along which the data is being classified. Each one of these lines is a graphic element in the view, and carries information in its object tag identifying it as a class break. Scatterplot classification is implemented in the ArcView user interface as a button, and will only activate when the active document is a scatterplot view.

Once the data classification has been set up, the user can take advantage of the focusing/dynamic classification tool that we have implemented. This tool allows the user to click and drag any of the class breaks to a new position. When the user "lets go" of the class break by releasing the mouse button, the class break returns its new position on the axis that it classifies. This value is then used to change the class breaks in the legend of the map (see Figure 4). However, since it is possible that the axis of the scatterplot has been transformed, the value returned from the class break may need to be inversely transformed to its original scale. This is accomplished by checking the object tag of the appropriate axis of the scatterplot, and retrieving the information about the

transformation applied.

Because it is possible to have multiple scatterplots with different classification schemes at any given time, labels are added to the map to indicate which scatterplot and classification scheme it currently represents. These labels, and the classification of the map are updated whenever the user applies the focusing/dynamic classification or brushing tool in any given scatterplot.

Bivariate Mapping

As mentioned above, scatterplots are good tools for viewing relationships between variables in data space. Similarly, bivariate mapping is a tool for viewing the relationships between variables in geographic space. With focusing and dynamic classification being implemented in scatterplots, in order to use both axes as criteria for classifying the map, some form of bivariate representational scheme needed to be implemented. MacEachren and Brewer (in press) developed a symbolization method for visually separable bivariate representation using texture overlay. Their method allows a color scheme to be overlaid with a hatch pattern of adjacent black and white lines. This representational method is equally discernable on light or dark colors, and is thus particularly suited to the generation of bivariate maps with a complex color scheme representing one variable, and the texture overlay representing the presence or absence of another variable.

Because of the interactivity of this prototype, and the belief that a visually separable representation would be useful, we chose to use the texture overlay method to represent the data on the Y axis of the scatterplot. This was done by adding a second theme to both the scatterplot and the map. On the scatterplot, this theme is classified along the Y axis, and is symbolized using different colors because the texture overlay cannot be implemented with point data. On the map the theme is symbolized using the texture, and can be turned on or off as the user desires. Since the representation method chosen for the Y axis is binary, only focusing can be implemented on that axis. The implementation of this dynamic classification of a bivariate map through the use

(7)

Focusing using the Y axis and the texture overlay is illustrated in Figure 5.

Discussion:

The prototype described has been developed as a test vehicle for experimenting with the incorporation of exploratory data analysis methods within a GIS environment that provides flexible access to georeferenced data. A next step in the project is to use what we have learned here, and in the complementary spatio -temporal analysis component of the project cited above to build a single integrated environment for exploratory spatio-temporal data analysis. In relation to this goal, our discussion focuses on limitations identified in the current prototype and related enhancements should be considered for subsequent development.

There are a number of limitations in the visualization of relationships between variables that can be improved upon. Among these, the implementation of dynamic classification in scatterplots could be improved by allowing the interactive addition and removal of class breaks. This feature would allow the user to add new class breaks without moving any existing breaks. As the system exists now, the number of class breaks can only be changed by re -applying the classification tool, which applies standard non-spatial classification techniques (thus eliminating any class breaks a user had already identified). A second possible enhancement to the implementation of scatterplots would be the ability to interactively change the variable and the transformation of the data plotted along either axis. This would provide the user with the ability to quickly and easily explore different transformations of a given variable, or to page through different variables. With the current implementation, once a variable and a transformation are specified for an axis of a scatterplot, it cannot be changed. Finally it would improve the visualization of relationships between variables to implement one or more bivariate color schemes modeled on those proposed by Brewer (1994). Using attributes of color to depict both variables (rather than our present use of color for one and texture overlay for the other) would provide the capability for more complex bivariate analysis by allowing more than two classes for the second variable being mapped. It would also provide the user with the opportunity to choose a more visually integral mapping scheme (which should enhance any relationships between the variables (MacEachren et.al, in press).

The changes and additions suggested above focus on improving t he implementation of a single

representation form (choropleth maps). Part of the power of dynamic visualization tools, however, is their ability to allow analysts to examine data from multiple, often unusual, perspectives. Dorling (1992 and 1996) has been a strong advocate of cartograms as an alternative perspective. Cartograms dynamically linked to choropleth maps have been demonstrated to be quite effective (Dykes, 1997). As Dykes notes, a dynamic link between a cartogram and a choropleth map provides the user with all of the advantages of both a cartogram and a choropleth map. Users can identify well known geographic features on the choropleth map, but search for cluster patterns on the cartogram without being unduly influenced by the size of the areal units.

This paper has reported on the development and integration of Exploratory Data Analysis (EDA) and Geographic Visualization (GVis) techniques within a commercially available GIS package. The prototype described has undergone only informal evaluation thus far. We plan to begin more formal assessment of some of these techniques in the fall of 1997. Those interested in learning more about the project and monitoring our progress on system assessment and subsequent system development are encouraged to consult the project web page at:

http://www.geog.psu.edu/MacEachren/MacEachrenHTML/NCHS.html

.

Acknowledgement: The research reported here was supported in part through a contract from the U. S. Centers for Disease Control National Center for Health Statistics (DHHS, OASH, DAM #9630348). We are grateful for that support and would like to thank Linda Pickle and her colleagues for all of their assistance.

(8)

References

Brewer, C.A. (1994). Color use guidelines for mapping and visualization. Visualization in Modern Cartography. Oxford, UK, Pergamon. A. M. MacEachren and D. R. F. Taylor ed. 123-147.

Croner, C.M., Pickle, L.W., Wolf, D.R., and White, A.A. (1992). A GIS approach to hypothesis generation in epidemiology, in ASPRS/ACSM/RT ’92, Washington D.C., August 3-8, 1992, ASPRS/ACSM, pp. 275-283.

Dorling, D. (1992). Visualizing people in time and space. Environment and Planning B: Planning and Design, 19: 613-637.

Dorling, D. (1996). Dorling, Daniel. Area cartograms: their use and creation. Series: Concepts and techniques in modern geography, 59. University of East Anglia, Norwich.

Dykes, J. (1997). Exploring spatial data representation with dynamic graphics. Computers & Geosciences, special issu e in Exploratory Cartographic Visualization, 23(4): 345-370.

Howard, D. and MacEachren, A. M. (1996). Interface design for geographic visualization: Tools for representing reliability. Cartography and Geographic Information Systems, 23(2): 59-77.

MacEachren, A.M., Brewer, C.A. and Pickle, L. (in press). Visualizing Georeferenced data: Representing reliability of health statistics. Environment and Planning, A,

Mason, T.J., McKay, F.W., Hoover, R., Blot, W.J., and Fraumeni, J.F.J. (1975). Atlas of Cancer Mortality for U.S. Counties: 1950-1969. Washington D.C., USGPO.

Monmonier, M and Gluck, M. (1994). Focus groups for design improvement in dynamic cartography.

Cartography and Geographic Information Systems, 21(1): 37-47.

Pickle, L.W., Mason, T.J., Howard N., Hoover, and R., Fraumeni, J.F.J. (1987). Atlas of U.S Cancer Mortality among Whites: 1950-1980. Washington D.C., USGPO.

Pickle, L.W., Mason, T.J., Howard N., Hoover, and R., Fraumeni, J.F.J. (1990). Atlas of U.S Cancer Mortality among nonwhites: 1950-1980. Washington D.C., USGPO.

Winn, D.M., Blot, W.J., Shy, C.M., Pickle, L.W., Roleda, A., and R., Fraumeni, J.F.J. (1981). Snuff

dipping and oral cancer among women in the southern United States. New England Journal of Medicine, 304: 754-749.