• No results found

Augsburg University, Computeroriented Statistics And Data Analysis ècosadaè, Abstract

N/A
N/A
Protected

Academic year: 2021

Share "Augsburg University, Computeroriented Statistics And Data Analysis ècosadaè, Abstract"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Screening Multivariate Categorical Data

Heike Hofmann

Augsburg University, Computeroriented Statistics And Data Analysis (Cosada), Universitatsstrae 2,

68135 Augsburg, Germany

[email protected]

Abstract

Graphical methods for categorical data are not well developed, especially in comparison with what is available for quantitative variables. It is often the case in data mining that there are primarily categorical variables to be analysed and many datamining procedures produce results dening groupings of the data, i.e. yet more categorical variables. The goal of this paper is to review datamining data sets and results graphically to better detect and understand patterns and structure.

One possibility of visualising multidimensional data are the Mosaic plots proposed by

Hartigan & Kleiner (1981). Enhanced with interactive features such as querying,

re-ordering of variables and variable categories, and groupings these plots become a very powerful and easy to use tool for analysing and understanding multivariate categorical data.

Introduction

The mass of information in large data bases with a complex structure of variables often makes a reduction to categorical data necessary (in a rst attempt towards understanding). The use of graphics is possible during dierent steps of the analytic process. Especially in order to display nal results, graphics are used. This paper, however, wants to point out how graphics can also be applied as tool for the data analysis itself. The beginnings of methods for this already exist in data mining tools such as the commercialMineSet by SiliconGraphics (I will

refer to the MineSet software as data mining standard for visualisation in the following, as

there was a well-organized web page provided with nice examples and demos of the software, see www.sgi.com/software/mineset/). More advanced tools for a graphical data analysis can be found in Explorative Data Analysis,EDA(Wilhelm, Unwin & Theus1996). Manet, a

software for interactive graphical Exploration, which is currently developed at the department for Computeroriented Statistics and Data Analysis at Augsburg University will be my standard package for questions regarding graphical exploration tools.

Basics of a graphical representation

Why Graphics?

The power of graphics lies in the ability of the human mind to grasp an idea that is explained graphically a lot faster than mere numbers - in terms of the over-used proverb - more than a thousand times faster.

Where does this come from? The exact numbers are not of interest during the process of understanding, at least not at rst. This makes a graphical display, where it is virtually

(2)

impossible to produce exact numbers due to lack of to precision of drawing or resolution of printer or monitor, the best choice for this task, as it helps to concentrate on essentials. In the context of categorical data the use of points and distances between points does not make much sense for visualisation. Instead, the basis of the graphical display is given by the equivalence between displayed area and a number in the data set.

In order to ensure visual comparability it is important for the representation to choose objects which are geometrically as simple as possible. For interactive graphics therefore barcharts (Playfair1786) or variations of barcharts, such as spine plots (Hummel 1996), are basically the only applicable way of representing categorical data.

High-dimensional ways of representation

Graphical data analysis provides dierent methods for including higher dimensional dependen-cies. Mainly there are two: either it is possible to display an arbitrary number of variables in one plot (limited only by technical aspects) or several low-dimensional plots are linkedand the high-dimensional relationship between them is derived from the use of selection and highlight-ing.

The multidimensional analogon of barcharts and spines, in numbers a contingency table, is the mosaic plot byHartigan & Kleiner(1981). Here with each further dimension the bars are

alternately divided horizontally resp. vertically according to the number of datapoints with the corresponding attributes. Interactive versions of mosaics are found in Theus (1996) or Hofmann(1999).

Interactive methods

Interactive methods let graphics become real tools of data analysis. An interactive system as dened by Theus (1996) basically consists of interactive querying and warnings - besides selection, highlighting and linking of graphics, where interactive querying means output of context sensitive This operation is normally triggered by a movement of the mouse or by point & click.

Warnings are given to the user whenever something within the graphic is not visible - due to the screen resolution for example. If for example a group of points is of so small a number that its visual equivalent would be of height zero, making them disappear totally, a ag is set at this place as a warning. Selected points are treated in the same way, which provides a far-reaching method of tracing small groups or even individuals in dierent displays.

Advanced interactive methods

Depending on the kind of graphic, there exist dierent possibilities for interactive extensions. Barcharts or spineplots for example are useful for re-organizing the category structure within a variable, i.e. grouping small categories. Also, sorting can be carried out very easily - either by hand to get a more sensible ordering than a standard lexicographic one (only think of `rst', `second', `third' or `low', `middle', `high') or with an automated re-ordering according to a specic criterion, highlighting proportions within bars for example.

Due to the strictly hierarchical construction of mosaics, there has to be a powerful method for re-ordering variables as well as a simple way of including further variables or excluding them. One interactive method which is applicable to all kinds of graphics is hotselection. This means

(3)

that only selected points are considered in the visualisation. Step by step (rather click by click) one is able to focus more and more onto the point of interest. Another term for this is logical zooming, in data mining (DM) it is calleddrill-down. If the data set also has continuous attributes, methods for categorizing them interactively have proved useful. DM - software such asMineSet or Keso(Siebes 1996) contains procedures, which produce results dening

groupings of the data. Manet on the other hand provides a tool for dening such groupings

interactively. A combination of those two approaches therefore allows checking the importance of a particular split of a continuous variable by examining neighbouring splits by hand.

This and various other examples make clear that data mining and methods of graphical data analysis t together very well.

An example

Figures 1 and 2 show an example of two software solutions for the same problem, namely the

Figure 1: Zoom into a decision table produced automatically by the MineSet software.

anise none creosotefoul musty spicy fishy

almond pungent odor: buff orange purple yellow brown black white green chocolate

spore print color

odor = none, spore print color = white

odor = none, spore print color = white

habitat: population: several woods habitat: population: leaves clustered

Figure 2: Hotselection in Mosaicplots of the software Manet. On the left hand side a variation of a mosaic plot, where all bins are of the same size, is displayed. Empty combi-nations are marked by a small circle. On the right also the number of mushrooms within a specic combination is visible. Highlighted are poisonous mushrooms.

classication of mushrooms into edible and poisonous ones. The dataset is available at UCI Repository of Machine Learning Databases and Domain Theories via

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.

It contains 8124 instances of poisonous or edible mushrooms described with altogether 22 nominal attributes, such as cap shape and color, gill attachment, color of the spores, habitat and population density. For more information see

(4)

ftp://ftp.ics.uci.edu/pub/machine-learning-databases/agaricus-lepiota.names.

Each mushroom is uniquely determined by these attributes, a discrimination between edible and poisonous mushrooms is therefore possible. One task for mining systems is to consider which attributes are needed for the discrimination, another to visualize the result.

Conclusion

Techniques for exploring high-dimensional categorial data via a graphical display have been discussed. It has been shown that interactive methods for graphical data analysis t very well into the concept of data mining. Even now, numerous common features are apparent, although dierent technical terms are common for the same concepts. DM may benet from interactive graphics as well as vice versa. One of these approaches could be the automation of certain processes. An automatically given ordering of variables within a mosaic plot e.g. leads to a kind of discriminant analysis.

References

Hartigan J.A. & Kleiner B. (1981) Mosaic for Contingency Tables, Computer Science and Statistics, Proceedings of the 13th Symposium on the Interface, 268-273.

Hofmann H. (1999) Interactive Mosaic Plots, to be published in Metrika.

Hummel, Jurgen (1996) Linked Bar Charts: Analysing Categorical Data Graphically, Compu-tational Statistics, vol. 11, Issue 1, pp. 23-33.

Playfair, William (1786) The Commercial and Political Atlas, London.

Siebes, A. (1996), Data mining and the Kesoproject, in SOFSEM'96: Theorie and Practice of

Informatics, vol. 1175 of Lecture Notes in Computer Scienxce, pp.161-177, Springer. Theus, Martin (1996) Theorie und Anwendung interaktiver statistischer Graphik, Winer

Ver-lag, Augsburg.

Tufte, Edward (1983) The Visual Display of Quantitative Information, Graphics Press, Cheshire, Connecticut.

Unwin A., Hawkins G., Hofmann H. & Siegl B. (1996) Interactive Graphics for Data Sets with Missing Values - MANET,Journal of Computational and Graphical Statistics, 4 (6). Wilhelm, Adalbert, Unwin, Antony R. & Theus, Martin (1996) Software for Interactive

Statis-tical Graphics A Review, in Advances in Statistical Software 5, Softstat 95, ed. Frank Faulbaum, Gustav Fischer Verlag, Stuttgart

(5)

Abstract

Le developpement de methodes graphiques pour des donnees categoriques n'est pas tres avance, en particulier par comparaison avec tout ce qu'il est a notre disposition au cas des donnees quantitatives.

Souvent en `data mining', les variables sont categoriques d'emblee. En plus, bien des methodes du `data mining' produisent des groupements de donnees comme resultat, c'est-a-dire encore des variables categoriques. Le but de cet article est de presenter des methodes graphiques qui nous permettent de mieux decouvrir des interactions et comprendre la structure des donnees.

L'une de ces methodes de visualisation de donnees a dimensions multiples est le graphique mosaique (mosaic plot) propose par Hartigan & Kleiner (1981). Enrichi par des

traits interactifs comme questions/reponses, changement d'ordre des variables, change-ment d'ordre des categories d'une variable, groupechange-ment de points de donnees, ces graphiques mosaiques deviennent un outil tres puissant et facile a employer pour l'analyse de donnees categoriques.

References

Related documents

This can lead to the undesirable behavior observed in some systems when an interrupt service routine (ISR) preempts the running process. Since a typical ISR is only a few

Conduct a probability experiment to compare the theoretical probability (determined using a tree diagram, table, or another graphic organizer) and experimental probability of

● From the 1991 Curriculum Guide: “Every mathematical sciences major should include at least one semester of study of probability and statistics … The major focus of this course

Interestingly, the ref= option in the CLASS statement is also available under the effect parameterization; it determines what level gets the -1 row of dummy variable coefficients

By using the proposed procedures, we are able to obtain an improved prediction interval with their true minimum coverage probability or average coverage probability very close to

To achieve the navigation data recovery from the signal, the carrier frequency and the ranging code need to be previously removed, remaining then the Binary Phase Shift

While suppliers will continue to work to monetize the computing and network assets that underpin the cloud services, it is the operational expertise of billing

Review of previous studies indicates they have been conflicting results and this study sought to determine the relationship of organizational structure and internal