Generalized association plots (GAP): Dimension free information
visualization environment for multivariate data structure
Chun-houh Chen, Shun-Chuan Chang, Yueh-Yun Chi, and Chih-Wen Ou-Young
Academia Sinica, Taiwan
Abstract: GAP is a dimension free visualization environment for multivariate data structure. Given a multivariate
data set, GAP first compute the proximity matrices for variables as well as for subjects. Proper seriations (permutations) are searched for rearrange these two matrices to satisfy certain properties. Double sorted raw data matrix together with two sorted proximity matrices are then projected through appropriate color spectrums to create matrix maps. These three maps should be cross-examined to identify important information structure embedded in the raw data matrix. GAP has unique seriation algorithm for identifying permutations of matrices with better global sense and clustering algorithm for finding multiple clustering patterns co-exist in proximity matrices. Extensions have been added to GAP for analyzing more complicated formats such as categorical, longitudinal, and multiple-set structure.
1 CONVERGING SEQUENCE OF CORRELATION MATRICES
The computation kernel of GAP is a converging sequence of iteratively generated correlation matrices, Figure 1. This sequence of correlation matrices is used to dynamically identify seriation and clustering for a given proximity matrix.
D R( 0) R(1) R( 2) R( 3) R( 4) R( 5) R( 6) R( 7) R(8 ) R( 9) R(∞) 237 515 934 1214 1313 1383 1518 1761 2119 2437 2499 2500 ρ(0 )= 49 ρ( D)=50 ρ(1 )=49 ρ(2 )=40 ρ(3 ) =20 ρ(4 ) =7 ρ(5 ) =3 ρ(6 ) =2 ρ(7 )=2 ρ(8)=2 ρ(9 )=2 ρ(∞)=1 AH1 AH2 AH3 AH4AH5 AH6 DL1 DL2 DL3 DL4 DL5DL6DL7 DL8 DL9 DL10DL11DL12 BE1 BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6 TH7 TH8 NA1 NA2NA3 NA4NA5 NA6 NA7 NB1 NB2 NB3 NB4 NC1NC2NC3ND1 ND2ND3ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6DL7 DL8 DL9 DL10DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3 TH4 TH5 TH6 TH7 TH8 NA1 NA2 NA3 NA4 NA5 NA6 NA7 NB1 NB2 NB3NB4 NC1 NC2 NC3 ND1 ND2ND3ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5AH6 DL1 DL2 DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6 TH7 TH8 NA1 NA2 NA3 NA4 NA5 NA6 NA7 NB1 NB2 NB3NB4 NC1 NC2 NC3 ND1 ND2ND3ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10 DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6 TH7 TH8 NA1 NA2 NA3 NA4 NA5 NA6 NA7 NB1 NB2 NB3NB4NC1NC2NC3ND1 ND2 ND3 ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5AH6 DL1 DL2 DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6TH7 TH8
NA1NA2NA3NA4 NA5 NA6 NA7NB1NB2 NB3 NB4 NC1 NC2 NC3 ND1 ND2 ND3 ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6 TH7 TH8
NA1NA2NA3NA4 NA5 NA6 NA7NB1NB2 NB3 NB4 NC1 NC2 NC3 ND1 ND2 ND3 ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10 DL11 DL12 BE1BE2 BE3 BE4 TH1 TH2 TH3TH4 TH5 TH6TH7 TH8
NA1NA2NA3NA4 NA5NA6 NA7NB1NB2 NB3 NB4NC1NC2NC3 ND1 ND2 ND3 ND4NE1
NE2 DL6TH7TH6BE3DL12DL11DL10AH1DL8DL9DL5AH5AH2AH3DL4AH4AH6TH8DL1DL7 TH3BE1BE2DL2BE4TH1TH2TH4TH5DL3 NA1 NA2 NA3 NA4 NA5 NA6 NA7 NB1 NB2 NB3 NB4 NC1 NC2 NC3 ND1 ND2 ND3 ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10 DL11 DL12 BE1BE2 BE3 BE4TH1TH2TH3TH4TH5 TH6 TH7
TH8 NB4NE1ND4ND3ND2ND1NC3NC2NC1NB3NA1NB2NB1NA7NA6NA5NA4NA3NA2NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2DL3 DL4 DL5 DL6 DL7 DL8 DL9 DL10 DL11 DL12 BE1BE2 BE3 BE4TH1TH2TH3TH4TH5 TH6 TH7
TH8 NC1NA1NE2NE1ND4ND3ND2NC3NC2ND1NB4NA5NA2NB3NA4NA3NA6NA7NB1NB2 AH3TH7TH8TH8TH7TH6BE3AH1AH2DL10AH4AH5AH6DL1DL4DL5DL6DL7DL8DL9DL12TH6DL11DL9DL1DL5DL4DL11DL10DL7DL8AH6DL12AH5AH4BE3AH3AH2AH1DL6 NC3ND4NE1BE4BE1BE2TH3TH1NA6NB4NB3ND1ND2NB2NB1NA7NA5TH2NA4NA3NA2NA1ND3TH5TH4NC1NC2NB3DL3NA4DL2DL3BE1BE2BE4TH1TH2TH3TH4TH5NA1NA2NA3NA5DL2NA6NA7NB1NB2NB4NC1NC2NC3ND1ND2ND3ND4NE1NE2NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6DL7 DL8 DL9 DL10 DL11 DL12 BE1 BE2 BE3 BE4 TH1 TH2 TH3 TH4 TH5 TH6 TH7 TH8 NA1 NA2 NA3 NA4NA5 NA6 NA7 NB1 NB2 NB3NB4 NC1 NC2 NC3 ND1 ND2ND3ND4 NE1 NE2 AH1 AH2 AH3 AH4 AH5 AH6 DL1 DL2 DL3 DL4 DL5 DL6DL7 DL8 DL9 DL10DL11 DL12 BE1 BE2 BE3 BE4 TH1 TH2 TH3 TH4 TH5 TH6 TH7 TH8 NA1 NA2 NA3 NA4 NA5 NA6 NA7 NB1 NB2 NB3NB4 NC1 NC2 NC3 ND1 ND2ND3 ND4 NE1 NE2 (a) (b)
Figure 1. Converging Sequence of Iteratively Generated Correlation Matrices.
(a). Sorted Color Maps for the Correlation Matrices; (b). First Two Eigen-Vectors for the Correlation Matrices.
2 BASIC PRINCIPLE OF GAP
There are three major pieces of information contained in any multivariate data set with n subjects and p variables: 1. the linkage amongst n subject points in the p-dimensional space; 2. the
linkage between p variable vectors in the n-dimensional space; and 3. the interaction linkage between the sets of subjects and variables. We use a psychosis disorder data with 95 patients on 50 symptoms to illustrate the framework of a standard GAP analysis, in Figure 2. GAP integrates the following four major steps to extract and summarize information embedded in a multivariate data set.
2.1 Raw Data and Proximity Matrix Maps with Suitable Color Projection The raw data
matrix is denoted as Da in Figure 2a. A gray spectrum is applied to project numerical numbers
into gray dots with different intensities. The correlation matrix is calculated as the proximity
matrix Va for the 50 symptoms. For the 95 patients, also the correlation matrix is used as the
proximity matrix. The diverging blue-red color scheme is used to represent the bi-directional property of the correlation coefficients.
v1 v2 v3 v4 v5 S 0 S 0 1 S 1 v1 v2 v3 v4 v5 S 0 S 0 1 S 1 (a) (b) (c) (d) Da Db Dc Dd Vd Vc Vb Va Sa Sb Sc Sd
(a). Original Raw Data and Proximity Matrices Maps; (b). Sorted Data Matrix Map and Proximity Matrix Maps; (c). Clustering for Variables and Subjects; (d). Sufficient Graph with Three Multivariate Linkages.
2.2 The Sorted Matrix Maps with the Principle of Geometry The next step is to find
proper seriations for Va and Sa respectively. Seriation is a data analytic tool for finding a
permutation or ordering of a set of objects using a data matrix. The seriations are then applied to
arrange the two correlation matrices Va and Sa into Vb and Sb in Figure 2b. The same
seriations are also used to reshape the raw data matrix Da into Db. The difference between Va
and Vb is not much since Va is already grouped by the symptom tables. However, there is a
dramatic change from Sa to Sb since the patients are admitted in a random order. There is a
clear latent structure in Db. A band of dark gray dots moves from the upper right corner to the
lower left corner.
2.3 Partitioned Matrix Maps with near Stationary Iterations After the seriations have been
applied to the matrix maps, the potential groups of variable and clusters of subject are identified using dynamic clustering procedures (to be discussed in Section 2) as displayed in Figure 2c. The sorted proximity matrix maps for variables and subjects are then partitioned into squares on the diagonal for within-group mean correlation structure and rectangles off diagonal for between-group mean correlation relationship, Figure 2c. The double sorted raw data matrix map is also partitioned into rectangles to represent the mean interaction effect between each subject-cluster on every variable-group.
2.4 The Sufficient Graph with Three Multivariate Linkages In order to extract and
summarize the visualized information in Figure 2b, we can further convert these matrix maps into a simplified version. Illustrated in Figure 2d are the mean-structure maps of the three matrices for raw data and proximities. These three mosaic-displays in Figure 2d contain the principal structural information embedded in the original data set. The mean function in Figure 2d can be replaced with any statistic for displaying desired information structure. We shall name these three mosaic displays the sufficient graph for a multivariate data set. The sufficient graph is then used to answer the three multivariate problems raised by the psychiatrist. Fifty symptoms are divided into five symptom-groups with different within and between group structure.
Ninety-five patients are also grouped into three clusters. The general behavior of these three patient-clusters on each of the five symptom-groups can now be easily comprehended.
3 EXTENSIONS OF GAP
Section 2 introduced a typical GAP procedure for a data set with continuous variables measured at a single time point. We have developed several modules of GAP for handling many possible variants of data structure.
3.1 Categorical GAPFigure 3 displays four matrix maps with different data type. Categorical
GAP with dual scaling can be used for information visualization for binary and nominal data matrices.
1 2 3 4 5 6 7
A C G T
Continuous
Ordinal
Binary
Categorical
0 1
-1 0 1
Figure 3. Information Visualization for Data Matrices with Different Data Types.
3.2 Longitudinal GAP It is possible that data profile may be measured at multiple time
points as illustrated in Figure 4. When the data profile is observed at multiple time points, we have an extra piece of linkage for subjects and variables over time. This extra linkage makes the visualization of longitudinal multivariate data profiles a much harder statistical challenge than
the single profile GAP procedure. Longitudinal GAP is created for visualizing data profiles with longitudinal structure.
t1
t2
t3
Figure 4. Composed Color Profile with Seriation for Data Matrices at Multiple Time Points
3.3 Canonical GAP Given two data sets with different sets of variables for the same subjects,
we would like to explore the relationship with Canonical GAP, Figure 5.
X Y Corr(X) Corr(Y) Corr(X,Y) Corr(Y,X) Corr((XY)') Corr((XY)') BPRS7 BPRS3 BPRS14 BPRS4 BPRS13 BPRS16 BPRS9 BPRS6 BPRS5 BPRS2 BPRS11 BPRS12 BPRS15 BPRS10 BPRS1 BPRS8 DL6 AH4 DL7 DL8 DL9 DL11 DL1 AH3 AH1 AH2 AH6 DL10 DL12 DL2 ND2 ND4 ND3 ND1 NA7 NA5 NA4 NA1 NA2 NA3 NC3 NB4 NB1 NB2 NB3 NC2 DL3 BE4 NC1 NE1 NA6 BE2 BE1 TH4 TH3 NE2 TH2 TH1 AH5 TH5 BE3 DL5 TH7 DL4 TH8 TH6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 BPRS7 BPRS3 BPRS14 BPRS4 BPRS13 BPRS16 BPRS9 BPRS6 BPRS5 BPRS2 BPRS11 BPRS12 BPRS15 BPRS10 BPRS1 BPRS8 DL6 AH4 DL7 DL8 DL9 DL11 DL1 AH3 AH1 AH2 AH6 DL10 DL12 DL2 ND2 ND4 ND3 ND1 NA7 NA5 NA4 NA1 NA2 NA3 NC3 NB4 NB1 NB2 NB3 NC2 DL3 BE4 NC1 NE1 NA6 BE2 BE1 TH4 TH3 NE2 TH2 TH1 AH5 TH5 BE3 DL5 TH7 DL4 TH8 TH6
3.4 Cartograph GAP When each subject in a multivariate categorical data set is affiliated to a district in a map, Cartography GAP extents the power of Categorical GAP to systematically identify suitable color scheme for coloring districts in the map, Figure 6.
´Û ß s \ ßÐ 5 4.5 4 3 2 1 ß s \ ßÐ ´Û •x•_•´ ™ŽØ•´ ÚŽ©•´ s¶À•´ •x§§•´ š- q•´ •x´n•´ •x•_ø§ ©yƒÐø§ ÆÁŽÈø§ s¶Àø§ °]ƽø§ •x§§ø§ š¸§²ø§ ´nßÎø§ Ž ™Lø§ š- qø§ •x´nø§ ™ŽØø§ ´Ã™Fø§ •x™Fø§ ™·ž¨ø§ ºÍ¥Úø§ ™˜™×ø§ s¶øø§