Contemporary statistical data visualization

(1)

Contemporary statistical data visualization

Junji Nakano

The Institute of Statistical Mathematics, Tachikawa, Tokyo, Japan

WWW home page:http://jasp.ism.ac.jp/~nakanoj/

Abstract. Statistical graphics has been one of the important statistical techniques from the beginning of the history of statistics. Recently, computer technologies ad-vanced a lot and changed statistical graphics considerably: interactive and dynamic operations become important, and graphics for large amount of data are strongly re-quired. This paper describes some of these developments with examples.

1 Introduction

In any stage of statistical data analysis, it is useful to draw graphics. Statistical data is usually expressed by a table of numbers and characters which record measurement values of each individual. It is usually difficult to capture charac-teristics of them by just looking at such a table. Proper graphics for expressing the values in a table are useful for this work, because human ability of visual perception can capture the patterns in figures easily.

A lot of statistical graphics have been used from the beginning of the history of statistics. Among them, a bar chart, a pie chart and a line graph are taught in elementary education, and used frequently in newspapers or TV programs. A histogram and a scatter diagram are important topics of a basic statistics, and are used also widely in real life.

Nowadays, graphics are usually drawn by computers. For example, Excel, a widely used spreadsheet software, can produce these statistical graphics easily and beautifully, and is used daily for writing reports and presentation materials. More advanced statistical graphics has been developed by using the recent advancement of the computer technology. They are used for analyzing a large amount of complex data visually, and the process is sometimes referred as “Vi-sual data mining”.

In this paper, we introduce several techniques in the contemporary statistical graphics by using free software products.

2 Basic interactive operations

As modern statistical graphics are drawn by computers, it is easy to redraw them many times. It is important that such redrawing is easily achieved by interactive operations for performing effective data analysis.

(2)

For example, consider drawing a histogram. The bin width and the left end point (anchor point) of the histogram are required to define it. As a histogram was manually drawn before, one histogram was usually drawn by choosing ap-propriate values of them. For this purpose, several formulae were proposed to obtain appropriate bin width or number of classes.

If we use a computer and appropriate software, we can change the bin width and the anchor point interactively and draw many histograms repeat-edly. Among them, we can choose histograms which capture the feature of the data well. This means that it is less important to decide the “best” bin width and anchor point, and interactive operations are recommended to get the infor-mation from data.

Such interactive operations can be realized by using appropriate software. Fortunately, many statistical data visualization software are available freely.

Mondrian (http://rosuda.org/Mondrian/) is one of them. This software is

written by Java, and can run on several operating systems.

As an example, we use 2004 cars data (http://rosuda.org/GOLD/data/

2004Cars.txt), which includes 14 variables for 428 car models. Fig. 1 shows

two histograms of Horsepower with the anchor point 50 and bin width 10 and

50. They have considerably different impressions.

Fig. 1.Examples of histograms (Left: 10 bin width, Right: 50 bin width)

2.1 Several interactive operations

A histogram is a basic graphics to express the characteristics of one continuous variable. For visualizing two continuous variables, a scatter diagram is useful.

As an example, a scatter diagram for Horsepower and Highway Miles Per

Gallon is drawn if Fig. 2 with a histogram of Horsepower. In addition, the bar chart ofType, a categorical variable, is drawn. When we click the bar for Sports

(3)

Car by a mouse, the individuals contained in the bar are selected and high-lighted in the other graphics (Fig. 2). In recent statistical visualizing software like Mondrian, selection of individuals, highlighting of them, and linking several graphics are realized as basic interactive operations, and are used effectively for data analysis.

Fig. 2.Examples of selection, highlighting, and linked graphics

2.2 Scatter diagram matrix

A scatter diagram is sued to examine the relation between two real-valued vari-ables. When we have more than three real-valued variables, a scatter diagram matrix is useful. Fig. 3 is an example of a scatter diagram matrix, in which four variables (Suggested Retail Price, Horsepower, Highway Miles Per Gallon

and Weight ) are used. Six cars whose Highway Miles Per Gallon are large are selected, and highlighted in scatter diagrams. It is clear that the selected cars are cheap, has smallHorsepower and light Weight.

2.3 Zooming

It is sometimes useful to zoom a part of graphics. The left graphics of Fig. 4 is

a scatter diagram of artificial data (http://stats.math.uni-augsburg.de/

Mondrian/Data/Pollen.txt ) of two variables (Ridge and Nub). The right graphics of Fig. 4 is given by zooming the center of the left graphics. It is clearly seen that the points in the center of this data form characters “EUREKA”.

(4)

Fig. 3.Example of a scatter diagram matrix

(5)

3 Parallel coordinate plot

A parallel coordinate plot (PCP, and some authors call it parallel coordinates) is a statistical graphics that displays multivariate data mainly for real-valued variables directly. It was proposed by Inselberg (1985), and was used by Wegman (1990) for the statistical data analysis.

Fig. 5 is an example of PCP to display all the variables of 2004 cars data by Mondrian. Axes for displaying variables are arranged in parallel. Each axis of coordinate is vertically taken and arranged from left to right in Mondrian (Note that axes are not drawn here).

Fig. 5.PCP of 2004 cars data

We ignore three categorical variables (Vehicle Name, Type, Drive) for a while, and concentrate on other real-valued variables. The top of axis shows the maximum and the bottom of it shows the minimum of the variable. The value of each variable is plotted on each axis, and are connected by line segments for ones from the same individual. Thus each individual is expressed by one broken line. In Fig. 4, the individual whose price is the highest is selected and highlighted.

3.1 Correlation between variables

Particular linear relation between adjacent variables shows a pattern of line segments in PCP. As an example, consider ten individuals with three variable

X1, X2, andX3, whose PCP and scatter plot diagrams are shown in the left and

the center of Fig. 6. It is clear that correlation coefficient between X1 and X2

is 1 and the one between X2 andX3 is -1. We sometimes use the same scale for

all the axes in PCP (the right figure of Fig. 6). When the correlation is -1, line segments meet at one point. When the correlation is 1, line segments becomes parallel or near parallel.

(6)

Fig. 6.PCP and a scatter diagram matrix for linearly correlated variables

It is clear that Suggested Retail Price and Dealer Cost are positively

corre-lated from Fig. 5, and Highway Miles Per Gallon and Weight have a negative

correlation coefficient. It is required that variable axes are adjacent to see the linear relationship between them. Therefore, it is preferable that positions of axes can be changed interactively. In Mondrian, this is realized by Alt + mouse drug operation (on Microsoft Windows).

The operation of reversing the axis direction is sometimes useful. Line seg-ments become parallel if they have a negative correlation and one axis is re-versed.

Fig. 5 is changed to Fig. 7 by repeatedly operated by such two operations for achieving the ease of interpretation. Two variables in the most right concern about fuel cost and reversed to show that the lower values indicate good fuel costs. Next 4 variables from the right concern sizes of cars and show that big cars require much fuel. Next 3 variables concern engine data, and next 2 variables concern price. Six individuals that require less fuel are selected and highlighted.

(7)

In addition, α-blending operation is performed in Fig. 7. This means that a constant transparency is given to each line segment and the points where a lot of segments came together become thick. This procedure is useful when the number of individuals is large and almost all the area is painted out completely by the usual single color.

3.2 Interactive operations of PCP

To see relations between two variables, a scatter diagram matrix is more com-prehensible than PCP. When the number of variables are large, for example more than 20, each scatter diagram in a scatter diagram matrix becomes too small to see. PCP, however, has better visibility in that case. A PCP can show each individual more clearly than a scatter diagram matrix. An individual is expressed by a connected broken line in PCP and by scattered points in many scatter diagrams in a scatter diagram matrix. When the number of individuals are not so large, they are easily distinguished in PCP, but are not distinguished in a scatter diagram matrix.

If the number of individuals is large, it becomes difficult to distinguish each individual even in PCP. Interactive operations are useful to evade it. Selection and highlighting of an individual (Fig. 5) or several individuals whose values of a certain variable are near (Fig. 7).

In Mondrian, a categorical variable can be also plotted in an axis. Each category is arranged from the bottom to the top in alphabetical order of the category values on the parallel coordinates axis in equal intervals. For instance, the variable Drivetakes three values AWD, front and rear, which are placed in this order from the bottom in Fig. 7. It is clear those six individuals with good fuel cost are all front drive and sedan cars in Fig. 7.

4 Mosaic plot

A mosaic plot is a graphics for expressing multidimensional contingency table, where individuals with two or more categorical variables are aggregated. In a mosaic plot, the number of individuals in each cell of the table is expressed as the area of a rectangular placed just like the original contingency table. A mo-saic plot was first proposed in Hartigan and Kleiner (1981) and investigated by M. Friendly in detail (He calls mosaic display) in one manual (2000) of SAS soft-ware. Several interactive software products were made by the group of Augsburg university (A. Unwin, H. Hofmann, and M. Theus). Mondrian is one of them and is also used here. The data analyzed in this section is the famous Titanic data (http://stats.math.uni-augsburg.de/Mondrian/Data/Titanic.txt). Four variables (Class, Age, Sex and Survived are recorded for each 2201 people who had gotten on the ship Titanic. All variables are categorical: Class takes First,

(8)

Second, Third or Clue, Age takes Adult or Child, Sex takes Male or Female, and Survived takes Yes or No.

4.1 Bar chart, spine plot and mosaic plot

We usually draw bar charts for categorical data like Titanic data.

Fig. 8.Bar charts and spine plots

Left graphics of Fig.8 are bar charts of Titanic data. They are linked and the bar for Yes of Survived is selected and highlighted in the corresponding parts of other bar charts. They show that the survival rate is high for First, Children and Female. In bar charts, a count is expressed by the height of a bar. If we express the count by the width of a bar, graphics in the right of Fig. 8 are drawn. They are called spine plots and express the survival rate more clearly than bar charts. These spine plots in which values of another variable are highlighted are almost same as the mosaic plot for two variables.

In a mosaic plot, total number of individuals is first expressed as a rectangle. Then, this rectangle is spitted into small rectangles. First variable is used to split the rectangle vertically according to the number of individuals which take same categorical values just like a spine plot. Second variable is used to split rectangles horizontally according to the number of individuals. Third variable is use to split rectangles vertically again. Forth variables is used to split rectangles horizontally. In this way, rectangles are splited alternatively to vertical and horizontal directions.

(9)

Fig. 9.Mosaic plot

Fig. 9 is a mosaic plot of Titanic data. Class is first used to split vertically,

Age is secondly used to split horizontally, Sex is thirdly used to split vertically again, and finally Survived is used to split horizontally.

The order of variables is displayed at the top of the window of graphics. In Mondrian, if we put the mouse cursor on a specific cell while pressing the Ctrl key, a pop up window is displayed to show the information of the cell. Fig. 9 shows that the biggest group is Crew, Adult, Male and No, whose total is 670. There is no one who is a Crew, a Child and in First or Second classes. These rectangles have 0 area but are expressed by red marks to show that their area is 0.

4.2 Describing data by a mosaic plot

Spine plots in Fig. 8 shows that Female, First class and Child survived much. How high is the survival rate of people who have these conditions together? The information is shown in the mosaic plot in Fig. 9. Figs. 8 and 9 are linked and Fig. 8 is used to select survived people. It is clear that the survival rate of the Child is 100% except Third class. Female survived considerably even if they are Adult except also Third class.

Interactive operations are important for a mosaic plot. As the order of vari-ables for splitting the rectangle affects the appearance of the plot heavily, we

(10)

often change the order. Mondrian can perform this operation easily. The upper left figure of Fig. 10 shows a mosaic plot where Class, Sex , Survived are used in this order and Age is ignored. This information is shown at the top of the window. Used variables and their order are shown inside the parentheses and ignored variable is shown inside the brackets.

Fig. 10.Variations of mosaic plot

In addition, there are several variations of the mosaic plot. The upper right of Fig. 10 is called a fluctuation diagram in which the size of each cell is displayed by a similar rectangular arranged in mesh. In the lower left figure, the size is expressed by the height of bar and is called multiple bar charts.

A multidimensional contingency table is used to test the mutual indepen-dence of categorical variables. This is visually shown in the lower right of Fig. 10. Under the hypothesis of independence, expected sizes are decided by marginal distributions and have mesh shape. The red rectangles show the lack and the blue rectangles show the excess of the real size from these expectations. This figure is called an expected value display of a mosaic plot, and is used for the inference of the log-linear model of the multidimensional contingency table.

(11)

5 Dynamic projection plots

Recently, visualization of phenomenon is performed in various fields. Among them, three dimensional visualization of physical phenomenon is most success-ful. For example, flows of liquid or activities in a human brain can be seen intuitively by using dynamic three dimensional display, because they are essen-tially dynamic three dimensional phenomenon.

On the other hand in statistics, data is mainly given as a table in which individuals arranged as rows and variables are arranged as columns. Dimension of data is thought to be the number of variables and usually is more than three, and sometimes more than 100 recently. Thus three dimensional visualization is less important for them.

However, it is true that human intuition works sensitively for three dimen-sional visualization because we always live in the three dimendimen-sional reality. Therefore, there have been attempts to use three dimensional display for visu-alizing statistical data. A three dimensional scatter diagram is one of them and has some popularity. At present, we have to use two dimensional screen to show pseudo three dimensional display by using dynamic movements. We can think that a scatter diagram in three dimension is projected to a suitable two dimen-sional plane and the projected plane is moved slowly. This idea is sometimes effective for more than three dimensional data.

A data analysis system GGobi (http://www.GGobi.org/) is available for

such dynamic projection. GGobi has a long history back to XGobi that has been researched since around 1990 and is free software. We explain the use of dynamic statistical graphics by projection by using GGobi in this section.

The data analyzed here is Italian olive oils data included in GGobi package. It is available in thedatadirectory asolive.csvfile, has 572 olive oils collected from 3 provinces of Italy (Variable Regiontakes the values 1 (South), 2 (Salge-nia) and 3 (North), and 9 areas expressed by Area (4 in South, 2 in Salgenia, and 3 in North). It also has data for 8 fatty acids (palmitic, palmitoleic,stearic,

oleic, linoleic, linolenic, arachidic, eicosenoic). The purpose of analysis was to catch the feature of olive oils in nine areas, and to guess the area of oil from their fatty acid values.

5.1 One dimensional projection

When the data is multidimensional, the easiest way to see the data is to investi-gate each variable separately. When a variable takes real value, a basic graphics is a histogram. However in GGobi, Average Shifted Histogram (ASH) is used instead of a histogram. In this graphics, the horizontal axis expresses the value of variable, and the horizontal axis expresses the average of the height of his-tograms which are given by moving the anchor point several times and a fixed

(12)

width of the classes. In a histogram, an individual exist somewhere in a class. ASH is a reasonably smoothed histogram and the location of an individual is exactly indicated in it. The resolution of smoothing is decided by the width of the class.

Fig. 11.ASH of eicosenoic

Fig. 11 is the ASH of eicosenoic. South is highlighted by the selection in the right bar chart. As values from South are high compared with other two regions, they are easily separated. In an ASH, a perpendicular line segment is used for each individual to see its location clearly.

Next, we consider the separation of other two regions, so we exclude the data

from South. The ASH of linoleic is drawn and the individuals from North are

selected is Fig. 12. Two regions seem to be separated by this variable. However, it is not preferable that there is no gap between two regions and many individuals exist around the boundary. Therefore, we consider several variables at the same time.

One way to see a lot of variables at the same time is to consider a linear combination of variables like the principal component analysis. Let the data matrix be X = [xij] (i = 1,2, . . . , n, j = 1,2, . . . , p) and the coefficient vector be a = [a1, a2, . . . , ap]0, and consider y = Xa and see the distribution of y.

This means that n data points in the p dimensional space are projected on

one straight line specified by a where a2

1 +a22 +· · ·+a2p = 1. If we change elements of a under this condition, we change the projected straight line. In

GGobi, ASH of y is displayed while a is changed continuously as animation.

This graphics is called a 1D Tour. Especially, when the straight line is moved randomly and continuously, it is called a Grand Tour. We see the animation and manipulate the value ofamanually to find an appropriate projection that shows clear separation between regions. At last, for example, Fig. 13 is obtained. In

(13)

Fig. 12.ASH of linoleic (data from South is excluded)

this case, we use large coefficient (0.962) is given to linoleic, oleic (0.158) and

linolenic(0.181) are used a little, and other variables are hardly used (absolute values of their coefficients are less than 0.1).

Fig. 13.1D Tour

5.2 Two dimensional projection

Like the second principal component in the principal component analysis, we can use two linear combinations for projecting on a two dimensional plane. Consider

(14)

A = [aij] (i = 1,2, . . . , p, j = 1,2) instead of a of a 1D Tour, and y = XA, where two columns of A are orthogonal and their lengths are 1. As y expresses

n individuals in two dimensional plane, we can draw a scatter diagram. When

A is moved randomly and continuously, the scatter diagram is also moved as

animation. This is called a Grand Tour of 2D Tour. The snap shot is shown in Fig. 14. North and the Salgenia are distinguished by different colors, and areas are distinguished by different glyphs (which are marks for data points). It is clear that areas are separated considerably well in this figure.

Fig. 14.2D Tour

By looking at a 2D Grand Tour, we may reach an appropriate projection if we are lucky. It is preferable to search a projection by checking a specific purpose automatically. Projection pursuit is a method for it and is realized in GGobi. It calculates a certain index for a projection, and moves the projection plane to the direction where the index becomes largest. The index shows some non-normality, because normally distributed data include the information for just the mean and the variance. Five indices (Holes, Central Mass, LDA, Gini-C, Entropy-C) can be used in GGobi, although there are a lot of researches for the index of the projection pursuit. For example, Holes shows that there are less data in the center and are several clusters around the rim, and Central Mass shows that there are many individuals around the center. The projection pursuit is also available in a 1D Tour. Tours that uses the index is called a Guided Tour.

It is clear that a 2D Tour is a kind of extension of the three dimensional scatter diagram.

GGobi provides another two dimensional dynamic graphics in which two independent 1D Tours are considered and used as horizontal and vertical axes to draw a scatter diagram called a 2x1D Tour. When the variables for each

(15)

axis are disjoint, this graphics is thought to be a visualization of the canonical correlation analysis. Fig. 15 is a snap shot of it. A linear combination of oleic,

linoleic and linolenic is used for the horizontal axis, and a linear combination of palmitic, palmitoleic, stearic and arachidic is used for the vertical axis.

Fig. 15.2 x 1D Tour

6 Generalized association plot

Colors can show important information in statistical data visualization. For example, we use a color to distinguish the selected group of individuals from other individuals in the software products explained so far.

Colors can express more information of data. Generalized Association Plots (GAP) is a software product to use colors intensively. The software is available

freely from (http://gap.stat.sinica.edu.tw/Software/GAP/index.htm).

In GAP, the data is arranged just like a table of usual data, whose value is expressed by color not by number and characters. The basic idea of GAP is to get a useful table by permuting the order of variables and individuals in order to have similar variables and individuals arranged in “neighborhood”.

We use an example data set SAMOANS9550.txt included in GAP software.

This data set records 50 symptoms as variables for 69 schizophrenic and 26 bipolar disorders. All variables are recorded as 0,1,...,5. First 30 variables asses positive symptoms and the next 20 variables asses negative symptoms.

(16)

6.1 Data presentation

First figure of GAP is Fig. 16. Original data is expressed by the lower center rectangle, which is divided into 95_×50 small rectangles whose colors express the data values. White expresses 0 and black expresses 5, and four gray colors ex-press 1,2,3,4 using their thickness. This monotone color spectrum is appropriate for this data.

Fig. 16.Original data representation

The upper square consists of 50×50 small squares whose colors show values of correlation coefficients between two variables. As a correlation coefficient takes the value between -1 and 1, -1 is expressed by thickest blue, 0 is expressed by white, 1 is expressed by thickest red. In other word, a positive correlation coefficient is expressed by red with some thickness and a negative correlation coefficient is expressed by blue. This color spectrum is appropriate for expressing correlation coefficients.

The right square consists of 95 ×95 small squares expressing correlation coefficients between individuals. Color spectrum is as same as the upper square.

The variable UNIQID shows the name of diseases, where schizophrenic is

shown by black and bipolar disorders is shown by white in the lower left part of Fig. 16.

We can use several color spectrums accroding to data. When variables are categorical, a color spectrum using different pastel colors with same thickness may be useful.

(17)

The upper square describes proximity between variables and the right square describes proximity between individuals. We use Pearson’s correlation coeffi-cient for proximity measure in this example. Other proximity measures are also available, for example, Euclidean distance and Kendall’s tau for real-valued variables, and Goodman–Kruskal’s tau for categorical variables.

6.2 Seriation of data

It is not enough to represent data (and proximity) by using color for grasping the characteristics of them. It is useful to permute the order of variables and individuals in order to arrange similar ones nearly. Classical cluster analysis for building a dendrogram can be used for this purpose and is available in GAP. The result of the cluster analysis is shown in Fig. 17. It is easier to interpret, because bipolar disorders are arranged in the upper part and schizophrenic are arranged in the lower part of the data table. In addition, positive symptoms are in the left and negative symptoms are in the right. The last observation is not shown directly in the Fig. 17, but can be seen by a mouse operation.

Fig. 17.Result of cluster analysis

The cluster analysis techniques are useful, but sometimes are unstable espe-cially for visualization. In GAP, a more stable method is realized as the elliptical seriation method. The result of the seriation is shown in Fig. 18, and may be interpreted more easily.

6.3 Partitions of data matrix and sufficient graph

In GAP, seriated data is divided into several groups for interpretation. One technique for making this operation easy is proposed and is available in the software.

(18)

Fig. 18.Result of elliptical seriation

After we have groups of variables and individuals, the values are replaced by the representative value of each group such as the mean or the median. The resulted graphics is called the sufficient graph in GAP (Fig. 19). It means that this graph explains “sufficient” information of the original data table.

7 Aggregation and data visualization

It often happens that we have more interest in groups of data than each individ-uals. In the database system, it is realized by the online analytical processing (OLAP) technique, which is as same as the pivot table operations in Microsoft Excel.

Symbolic data analysis (SDA) also handles a set of groups of individuals in the original data (Billard and Diday, 2006). In SDA, values of a variable can be more complex than the traditional data such as real numbers and categorical values. Typical symbolic data can take intervals, histograms and bar charts as variable values.

This section handles aggregation in data visualization by combining inter-active operation, OLAP and SDA. We use our own software which is still under development to demonstrate our idea in this section.

7.1 Symbolic data as aggregation results

Symbolic data often arise by aggregation operations. We consider “aggrega-tion” as an operation to summarize a group of individuals by several statistics.

(19)

Fig. 19.Sufficient graph

In SDA, we may use the interval given by the range of values or histogram of the variable. It is clear that an interval has more information than a mean value, and the histogram has more information than the interval. The required num-bers to record them is considerably small compared with those of the original individuals.

7.2 Aggregation in database by OLAP

Database is a computer technology to store large amount of data as compactly as possible in unified form, and to retrieve the required data as quickly as possible. The most famous database model is the relational database, which expresses data in the form of one or more tables. Online analytical processing (OLAP) is a technology to capture the data structure as a whole by aggregating data from various ways of grouping data. In OLAP, groups are formed according to the values of the categorical variables, and simple aggregation functions such as summation and average are performed in each group of data.

7.3 Graphics for aggregation

Based on the considerations above, we conclude that the aggregation operations is important in the visualization of large data set.

We propose to show aggregation results on the same framework of graphics for expressing the values of each individual, such as a a parallel coordinate plot and a scatter diagram, with some modifications, because original data and grouped data are easily compared on the same framework.

(20)

Fig. 20.Extended PCP displaying all individuals

We have modified a parallel coordinate plot for realizing flexible aggregation operation. Our user interface enables us to perform OLAP like group specifica-tion. The results of aggregation are expressed by visual objects which illustrate SDA like complex values.

Fig. 20 is an example to show all the individual data of 2004 cars data as a usual parallel coordinate plot.

For dividing raw data into groups, our system puts variable axes in three divided areas: an ignored variables area, an aggregation specification area and a data description area, from the left to right of Fig. 20. Variables can be moved to or from each area freely by drag-and-drop operation using a mouse.

Variable axes placed in the ignored variables area are completely ignored from drawing broken lines for individuals.

The aggregation specification area is a place where variables are used to specify groups. Categorical variables are mainly put here as the stacked boxes. Units of the aggregation can be specified by the Cartesian product of dimen-sions. In Fig. 20, on each 6 values of variableType, 3 values of variableDriveare stacked and the total number of groups is 18 (3 of them contain no individual). In the data description area, individuals is originally displayed just like a usual parallel coordinate plot (Fig. 20). When we perform aggregation based on the group specified in the aggregation specification area, just aggregated data are displayed and raw data disappeared. Default aggregation results are means of groups (Fig. 21). Other available one dimensional results are summations and medians. In addition, more detailed aggregation results are available as box plot or histogram on each axes (Fig. 22). It is clear that the visibility becomes worse when we draw all the resulted graphics for all groups at the same time. Thus, we can select groups whose resulted graphics are shown on the axes. We note that our system has multiple selectors and can specify several groups at the same time.

(21)

Fig. 21.Displaying means of groups.

Fig. 22.Histograms of a group.

Similar way of expression of groups are available on a scatter diagram. Left of Fig. 23 is a usual scatter diagram of variables Length and Horsepower. We aggregate them as same groups as Fig. 22 and show means of groups in the right of Fig. 23. More detailed distribution information for two groups are expressed by histograms. We express the location information (mean) of each group as it is, and scales of histograms are reduced (1/10 here) for ease of perception. Distributions of two variables of two groups are compared visually.

We have used Jasplot software (Nakano, Yamamoto and Honda, 2008) to realize our experimental software.

7.4 Remarks on aggregation

If the number of individuals is huge, even a high-speed computer requires con-siderable time to perform interactive operations on graphics. The aggregation operation can be one remedy for this trouble. Aggregation means to summarize

(22)

Fig. 23.Scatter diagram of data and extended scatter diagram of groups

the information in one group of data by the less number of values than all the variable values for individuals.

Our experimental implementation of an extended parallel coordinate plot and a scatter diagram seems to be useful as an extended OLAP. Obtained groups can be analyzed by using symbolic data analysis techniques.

8 Concluding remarks

Visualization is strongly required in all fields of human activities, because it is a useful interface to exchange knowledge among people working in different fields. Image can be easily and intuitively understood by human being, although languages (and equations) are difficult to understand and are often very different in various fields of activities. Statistical data visualization is important for same reasons.

As statistical data is highly multidimensional, we have to develop appro-priate techniques to handle them on the two dimensional screen (or the three dimensional screen in the near future). As statistical techniques have become used more often in various fields, new and better data visualization techniques should be developed.

References

1. Billard, L. and Diday, E.:Symbolic Data Analysis: Conceptual statistics and data mining. John Wiley, New York. (2006)

2. Cook, D and Swayne, D. F.:Interactive and Dynamic Graphics for Data Analysis: With R and

GGobi. Springer, Berlin. (2007)

(23)

4. Nakano, J., Yamamoto, Y. and Honda, K.: Programming statistical data visualization in the Java language. In: Chen, C-H., H¨adle, W. and Unwin, A. (Eds.): Handbook of Data Visualization. Springer, Berlin, 725–756. (2008)

5. Theus, M. and Urbanek, S.:Interactive Graphics for Data Analysis: Principles and Examples. Chapman & Hall/CRC Computer Science & Data Analysis, (2008)

6. Unwin, A., Theus, M. and Hofmann, H.: Graphics of Large Datasets: Visualizing a Million. Springer, Berlin. (2006)

7. Wegman, E.J.: Hyperdimensional data analysis using parallel coordinates.Journal of the

Ameri-can Statistical Association 85, 664-675. (1990)

8. Wu, H-M., Tzeng, S. and Chen, C-H.: Matrix Visualization. In: Chen, C-H., H¨adle, W. and Unwin, A. (Eds.):Handbook of Data Visualization. Springer, Berlin, 681–708. (2008)