Screening Multivariate Categorical Data
Heike Hofmann
Augsburg University, Computeroriented Statistics And Data Analysis (Cosada), Universitatsstrae 2,
68135 Augsburg, Germany
Abstract
Graphical methods for categorical data are not well developed, especially in comparison with what is available for quantitative variables. It is often the case in data mining that there are primarily categorical variables to be analysed and many datamining procedures produce results dening groupings of the data, i.e. yet more categorical variables. The goal of this paper is to review datamining data sets and results graphically to better detect and understand patterns and structure.
One possibility of visualising multidimensional data are the Mosaic plots proposed by
Hartigan & Kleiner (1981). Enhanced with interactive features such as querying,
re-ordering of variables and variable categories, and groupings these plots become a very powerful and easy to use tool for analysing and understanding multivariate categorical data.
Introduction
The mass of information in large data bases with a complex structure of variables often makes a reduction to categorical data necessary (in a rst attempt towards understanding). The use of graphics is possible during dierent steps of the analytic process. Especially in order to display nal results, graphics are used. This paper, however, wants to point out how graphics can also be applied as tool for the data analysis itself. The beginnings of methods for this already exist in data mining tools such as the commercialMineSet by SiliconGraphics (I will
refer to the MineSet software as data mining standard for visualisation in the following, as
there was a well-organized web page provided with nice examples and demos of the software, see www.sgi.com/software/mineset/). More advanced tools for a graphical data analysis can be found in Explorative Data Analysis,EDA(Wilhelm, Unwin & Theus1996). Manet, a
software for interactive graphical Exploration, which is currently developed at the department for Computeroriented Statistics and Data Analysis at Augsburg University will be my standard package for questions regarding graphical exploration tools.
Basics of a graphical representation
Why Graphics?
The power of graphics lies in the ability of the human mind to grasp an idea that is explained graphically a lot faster than mere numbers - in terms of the over-used proverb - more than a thousand times faster.
Where does this come from? The exact numbers are not of interest during the process of understanding, at least not at rst. This makes a graphical display, where it is virtually
impossible to produce exact numbers due to lack of to precision of drawing or resolution of printer or monitor, the best choice for this task, as it helps to concentrate on essentials. In the context of categorical data the use of points and distances between points does not make much sense for visualisation. Instead, the basis of the graphical display is given by the equivalence between displayed area and a number in the data set.
In order to ensure visual comparability it is important for the representation to choose objects which are geometrically as simple as possible. For interactive graphics therefore barcharts (Playfair1786) or variations of barcharts, such as spine plots (Hummel 1996), are basically the only applicable way of representing categorical data.
High-dimensional ways of representation
Graphical data analysis provides dierent methods for including higher dimensional dependen-cies. Mainly there are two: either it is possible to display an arbitrary number of variables in one plot (limited only by technical aspects) or several low-dimensional plots are linkedand the high-dimensional relationship between them is derived from the use of selection and highlight-ing.
The multidimensional analogon of barcharts and spines, in numbers a contingency table, is the mosaic plot byHartigan & Kleiner(1981). Here with each further dimension the bars are
alternately divided horizontally resp. vertically according to the number of datapoints with the corresponding attributes. Interactive versions of mosaics are found in Theus (1996) or Hofmann(1999).
Interactive methods
Interactive methods let graphics become real tools of data analysis. An interactive system as dened by Theus (1996) basically consists of interactive querying and warnings - besides selection, highlighting and linking of graphics, where interactive querying means output of context sensitive This operation is normally triggered by a movement of the mouse or by point & click.
Warnings are given to the user whenever something within the graphic is not visible - due to the screen resolution for example. If for example a group of points is of so small a number that its visual equivalent would be of height zero, making them disappear totally, a ag is set at this place as a warning. Selected points are treated in the same way, which provides a far-reaching method of tracing small groups or even individuals in dierent displays.
Advanced interactive methods
Depending on the kind of graphic, there exist dierent possibilities for interactive extensions. Barcharts or spineplots for example are useful for re-organizing the category structure within a variable, i.e. grouping small categories. Also, sorting can be carried out very easily - either by hand to get a more sensible ordering than a standard lexicographic one (only think of `rst', `second', `third' or `low', `middle', `high') or with an automated re-ordering according to a specic criterion, highlighting proportions within bars for example.
Due to the strictly hierarchical construction of mosaics, there has to be a powerful method for re-ordering variables as well as a simple way of including further variables or excluding them. One interactive method which is applicable to all kinds of graphics is hotselection. This means
that only selected points are considered in the visualisation. Step by step (rather click by click) one is able to focus more and more onto the point of interest. Another term for this is logical zooming, in data mining (DM) it is calleddrill-down. If the data set also has continuous attributes, methods for categorizing them interactively have proved useful. DM - software such asMineSet or Keso(Siebes 1996) contains procedures, which produce results dening
groupings of the data. Manet on the other hand provides a tool for dening such groupings
interactively. A combination of those two approaches therefore allows checking the importance of a particular split of a continuous variable by examining neighbouring splits by hand.
This and various other examples make clear that data mining and methods of graphical data analysis t together very well.
An example
Figures 1 and 2 show an example of two software solutions for the same problem, namely the
Figure 1: Zoom into a decision table produced automatically by the MineSet software.
anise none creosotefoul musty spicy fishy
almond pungent odor: buff orange purple yellow brown black white green chocolate
spore print color
odor = none, spore print color = white
odor = none, spore print color = white
habitat: population: several woods habitat: population: leaves clustered
Figure 2: Hotselection in Mosaicplots of the software Manet. On the left hand side a variation of a mosaic plot, where all bins are of the same size, is displayed. Empty combi-nations are marked by a small circle. On the right also the number of mushrooms within a specic combination is visible. Highlighted are poisonous mushrooms.
classication of mushrooms into edible and poisonous ones. The dataset is available at UCI Repository of Machine Learning Databases and Domain Theories via
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.
It contains 8124 instances of poisonous or edible mushrooms described with altogether 22 nominal attributes, such as cap shape and color, gill attachment, color of the spores, habitat and population density. For more information see
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/agaricus-lepiota.names.
Each mushroom is uniquely determined by these attributes, a discrimination between edible and poisonous mushrooms is therefore possible. One task for mining systems is to consider which attributes are needed for the discrimination, another to visualize the result.
Conclusion
Techniques for exploring high-dimensional categorial data via a graphical display have been discussed. It has been shown that interactive methods for graphical data analysis t very well into the concept of data mining. Even now, numerous common features are apparent, although dierent technical terms are common for the same concepts. DM may benet from interactive graphics as well as vice versa. One of these approaches could be the automation of certain processes. An automatically given ordering of variables within a mosaic plot e.g. leads to a kind of discriminant analysis.
References
Hartigan J.A. & Kleiner B. (1981) Mosaic for Contingency Tables, Computer Science and Statistics, Proceedings of the 13th Symposium on the Interface, 268-273.
Hofmann H. (1999) Interactive Mosaic Plots, to be published in Metrika.
Hummel, Jurgen (1996) Linked Bar Charts: Analysing Categorical Data Graphically, Compu-tational Statistics, vol. 11, Issue 1, pp. 23-33.
Playfair, William (1786) The Commercial and Political Atlas, London.
Siebes, A. (1996), Data mining and the Kesoproject, in SOFSEM'96: Theorie and Practice of
Informatics, vol. 1175 of Lecture Notes in Computer Scienxce, pp.161-177, Springer. Theus, Martin (1996) Theorie und Anwendung interaktiver statistischer Graphik, Winer
Ver-lag, Augsburg.
Tufte, Edward (1983) The Visual Display of Quantitative Information, Graphics Press, Cheshire, Connecticut.
Unwin A., Hawkins G., Hofmann H. & Siegl B. (1996) Interactive Graphics for Data Sets with Missing Values - MANET,Journal of Computational and Graphical Statistics, 4 (6). Wilhelm, Adalbert, Unwin, Antony R. & Theus, Martin (1996) Software for Interactive
Statis-tical Graphics A Review, in Advances in Statistical Software 5, Softstat 95, ed. Frank Faulbaum, Gustav Fischer Verlag, Stuttgart
Abstract
Le developpement de methodes graphiques pour des donnees categoriques n'est pas tres avance, en particulier par comparaison avec tout ce qu'il est a notre disposition au cas des donnees quantitatives.
Souvent en `data mining', les variables sont categoriques d'emblee. En plus, bien des methodes du `data mining' produisent des groupements de donnees comme resultat, c'est-a-dire encore des variables categoriques. Le but de cet article est de presenter des methodes graphiques qui nous permettent de mieux decouvrir des interactions et comprendre la structure des donnees.
L'une de ces methodes de visualisation de donnees a dimensions multiples est le graphique mosaique (mosaic plot) propose par Hartigan & Kleiner (1981). Enrichi par des
traits interactifs comme questions/reponses, changement d'ordre des variables, change-ment d'ordre des categories d'une variable, groupechange-ment de points de donnees, ces graphiques mosaiques deviennent un outil tres puissant et facile a employer pour l'analyse de donnees categoriques.