The Sorted Binary Plot: A New Technique for Exploratory Data Analysis

(1)

The Sorted Binary Plot: A New Technique

for Exploratory

Data Analysis

G. Alvin Mead Special Projects Department

BOC Group, Inc. Murray Hill, NJ 07974

The sorted binary plot is a graphical method for identifying and displaying patterns in multivariate data sets. The construction requires calculation of the median for each variable measured, followed by subtraction of the medians from the values for each sample. The signs of the residuals represent a binary number for each sample. The list of binary numbers is sorted and converted to a graph by assigning distinctive symbols to I and 0. The sorting operation causes samples with the same binary number to form clusters. The method can be extended to three or more quantiles. It has been applied to several kinds of data.

KEY WORDS: Classification; Cluster analysis; Numerical taxonomy.

1. INTRODUCTION

Our laboratory recently ran a number of mass spectrometer measurements on process-eflluent samples with the intent of detecting suspected fluctua- tions in the process. A hypothetical model that predicted quantitative changes in the effluent composition had been suggested.

The precision in the first measurements was in- adequate for detecting the predicted effect in any single measurement. As more measurements were made and larger groups of data became available, bivariate graphical methods showed distinct patterns in both time series and composition relations.

In our efforts to squeeze more information out of the measurements, we developed a new graphical method for displaying multivariate data. This method showed significant patterns that were incon- sistent with the suggested process model, indicated deficiencies in the measurement techniques, and later led to a revised model.

2. CONSTRUCTION OF THE PLOT

The object of this procedure, like that of many other graphical methods, is to convert a large table of numbers into a visual pattern in which significant features stand out.

The first step is to prepare a table of residuals by subtracting the median values for each variable from the data vectors for the individual samples. Only the signs of the residuals are used to make the plot. These constitute a binary number (or word) for each sample in which + is 1 and - is 0. Zero residuals have

Graphical methods; Multivariate analysis;

arbitrarily been treated as negative. Sorting the list of binary numbers causes equal numbers to form clusters.

Sorting in numerical order, however, can have the undesirable side effect of producing rather pronounced diagonal patterns in the display. These are purely artifacts and have nothing to do with the underlying data structure.

The diagonal patterns can be largely eliminated by a second sort, arranging the binary words in order of the sum of digits (SOD)-for example, for 101000, SOD = 2; for 101001, SOD = 3; and so on. The second sort puts samples with more properties above median toward one end of the plot, and the converse is also true. Samples that differ only in one digit tend to be close together (although not necessarily adja- cent). On the other hand, two successive numbers in an ascending binary sequence may differ in all but one digit and may thus represent samples with most properties quite different.

In some cases, the variables can have a natural order of importance. In such a situation, it would make sense to arrange the variables in this order from top to bottom in the plot and sort only by numerical order, omitting the sort by SOD.

The number and size of clusters are independent of the arrangement of variables and are unaffected by the SOD sort.

Qualitative data can be used if a binary represen- tation makes sense-for example, better or worse, red or blue, dead or alive, and so forth. Data can be a mixture of analog and binary data. More precisely, all of the data used in the plot are binary, since analog

(2)

data are first converted to binary form. If the original data are binary, the actual values should be used to construct the plot.

The display is printed with distinctive characters for 1 and 0. Identification of samples and variables can be added to suit. In the plots shown here, a line called “Count” has been added at the bottom to mark the clusters clearly. This is simply a toggle that changes direction whenever a binary number differs from the one preceding it in the sorted series.

3. APPLICATIONS TO REAL DATA 3.1 Mass Spectrometer Data

In a recent investigation of possible effects of a process change on effluent composition, predicted variations were sought in a group of components, identified as Nl-N9, having distinct mass peaks. An operating time period was selected, and collection of samples was started. Samples were analyzed as they arrived. Contrary to our wishful thinking, this ap- proach did not result in randomization of the anal- yses.

Mass spectrometer data were tabulated as ratios to ingredient N6, which is present in fairly large quantity and is supposedly unaffected by the process changes.

The plot of the binary sign data is shown in Figure la. The sampling location is identified at the top. The three locations sampled are designated A, B, and C. As standards, some samples were prepared simulating the mixture in the process effluent. These are designated S.

a

To get some idea of the number of clusters that might be observed by chance, the median-sorting process was simulated by generating arrays of bits having the same numbers of rows and columns as the data arrays. Each row, corresponding to a variable, is a bit string having an equal number of 1s and OS. These represent the signs of residuals from the median. The complete array is generated by ran- domly ordering the bits in each row. An example of a plot from a simulated array is given in Figure lb. The simulated data set obviously has fewer and smaller clusters than the real data.

Table 1 shows median results for size and number of clusters in a thousand simulations for this data set and for the others presented here.

The experimental data set in Figure la shows some highly pronounced nonrandom features. Unfortu- nately, none of the patterns corresponds to the hypothetical effect, which would cause Nl and N2 to be higher than median with all of the other ingre- dients lower.

The real explanation for the clusters turned out to be instrument bias. The bias was traced to automatic switching of amplifiers in the mass spectrometer. The effect was intermittent, probably because of envi- ronmental temperature changes.

This source of bias was eliminated by operating with only a single amplifier. This amplifier, however, is not sensitive enough to measure N8 and N9. Other modifications were made, and considerable improve- ment in the overall precision resulted.

A plot for a group of measurements made with the modified procedure is shown in Figure 2.

NllN6 N2/N6 N3/N6 NS/N6 N6/N6 NT/N6 N8/N6 N9/N6 CO”Ilt NllN6 NZ/N6 N3/N6 N5/N6 N6/N6 NT/N6 N8/N6 N9/N6 count

Figure 1. (a) First Series of Mass Spectrometer Measurements Treated by Median Sort; (b) Random Numbers Treated by the Same Procedure. A, B, and C are sampling locations; S is standard.

(3)

Tab/e 1. Data Simulations With Median Sort (average of 7,000 simulations) Number First experimental Second experimental

in data set data set Crime data Mortality data Cars cluster (74 x 8)* (57 x S)* (48 x 7)' (51 x 10)" (74 x 12)X 0 190.7 25.0 87.1 974.0 4,022.5 1 57.2 24.8 34.3 48.9 72.9 2 7.5 10.9 5.9 1.0 34 3 .58 2.7 .56 .018 .002 4 .029 .46 .033 0 0 5 .OOl .06 .002 l Samples by variables.

All of the samples were collected from a single process location, and a larger number of reference standards was used in the comparison. In Figure 2, the standards are marked S as in Figure la, and the real samples are marked R. Again a pronounced clus- tering pattern is evident, different from the earlier pattern. There are very significant differences between the standard mixture and the real samples.

Analysis of the numerical differences between the sample data and the reference data led to a simple model that fits the data quantitatively. The effect described by this model is in almost exactly the re- verse direction to the original hypothesis about what was thought to be happening in the process.

3.2 U.S. Census Data

To see if the technique would produce interesting patterns when applied to other kinds of data, we analyzed some of the tabulations prepared by the U.S. Bureau of the Census (1982). A sorted plot for crime data in 1981 is shown in Figure 3. A striking feature in the gross pattern is that a large group of states has crime rates higher than the median in all categories and another large group has crime rates lower than the median in all categories. The large groups at the high end and the low end may indicate differences in the criminal population, the mecha- nism of law enforcement, crime-reporting proce- dures, or the availability of something to steal.

There is a cluster of four and a cluster of three toward the low end of the plot. The quartet-wis-

consin, Indiana, Iowa, and Montana-are not close together. They are, however, all part of a large low- crime group having either one or none of the crime categories above median. The triplet-Kentucky, Mississippi, and Arkansas-is close to being geographically contiguous; all border on the Mississippi River. These three are higher than median in murder rate and in no other category. They form a contiguous group with several other states including North Carolina, Alabama, Virginia, Illinois, Tennessee, and South Carolina that are higher than the median in murder and only a few other categories.

In some cases, it may be informative to use three or more quantiles for sorting the data. Figure 3b is an example showing the crime data sorted by three quantiles. There are still large groups at the ends. The gross pattern still shows that the country can be divided into a large group of high-crime states, another large group of low-crime states, and a relatively small group in the middle. Inspection of the details confirms that the wild West really is, but there are also some hot spots on the Atlantic coast (south of New Hampshire), the Gulf of Mexico, the Missis- sippi, and the Great Lakes.

Mortality data from the U.S. census for 1979, sorted by medians, are shown in Figure 4a. Even though this pattern is more random than the crime data, there appears to be some structure.

There are two triplets and six pairs, a moderately improbable observation according to the simulations shown in Table 1. One of the triplets, consisting of

N1/N6 N2/N6 N3/N6 N4/N6 NS/N6 NT/N6 Count

(4)

a

MNMNSNWIIWIMUKMAPNAVVKWITCMRWSDHAOOLNOMGNMMFTCNANC EH~DDEVDNI~NTYIRA~~ATAYLB~AIACEAKKHAJ~~AYIDLEOMREA

BAADSWTA SK NOLNNS S LWALI AXL ZVL

b

NNSWINMWKIAMVMNPIVMUOKRWAHMCITSWOOMDNLCANGNATMMFNC HDDIOEEVYDRNTICANAITKAIYLAAOLECARHOEJAORMAYKEIDLEA

SWB A AKT N D SALN OAWSNLN SE1 L LZ AX AVL

II II II II II II II II II 8 II II II I II II II II II II II II II II II II II II II II II II II II II II II II II II II I II II II II II II II II II II 111 II II I II II II II II II a II II II II II 111 II II II 111 II II II a II 181 II II II II II II II II II II II II II II II MURDER 1 RAPE 1 ROBBERY 1 ASSAULT 1 BURGLARY / LARCENY 1 MVTHEFT

Figure 3. (a) U.S. Census Crime Data; (b) Same Data With Ternary Sort Instead of Binary Sort. Missouri, Kansas, and Arkansas, is geographically

contiguous. The states in the other-Virginia, Utah,

and Hawaii-do not have any obvious geographical relation or other similarities.

Because this data set has some unusual properties, the states were sorted only by numerical order and

not by SOD. The first five causes of death account for almost 90% of the total, and they (loosely) form a geometric progression with a ratio of about one- half. The order of the variables is thus important, and the order of the binary (or ternary) numbers corresponds closely to the order of total death rate. a HUVCMNATIWMNNCAWMNSALGMNTOWIMSDNCMNDVOIPRNMKIMAOKWF ~~A~IDKEDYNMEORAICCLAADH~~~~~D~J~~YCT~~Al~E~~O~~Y~~ AXAOT VLZSN A HEART CANCER STROKE ACCID PULM PNEU DIAB LIVER ATHERO SUICIDE count b HUCAWINNAVMMTSCWOMNGMDVWNLITKMADCNNKOSNNIMOPRMIAMFW ATOKYDMERA~~~C~g~ICAD~T~DANEAILCOHEYKDJY~~~AIE~~O~~ WALAOA VZ DNNSA N B L

(5)

The sorted ternary plot, Figure 4b, shows perhaps a little more clearly than Figure 4a that there is a group of low-total-death-rate states that have relatively high deaths by accident and suicide.

3.3 Automobile Data

cedure was applied to a tabulation that was used by Chambers, Cleveland, Kleiner, and Tukey (1983) to illustrate other graphical methods of comparison. The plot is shown in Figure 5. A key to the cars is given in Table 2. Several features stand out, espe- cially a number of large clusters. According to the

-

Statistics on automobiles in the United States are simulations in Table 1, all of these are highly im-

published widely. For another illustration, the pro- probable.

Table 2. Car identification (1979 mode/ year)

Make and model Code Make and mode/ Code

AMC Concord AMC Pacer AMC Spirit Audi 5000 Audi Fox BMW 320 Buick Century Buick Electra Buick LeSabre Buick Opel Buick Regal Buick Riviera Buick Skylark Cadillac DeVille Cadillac Eldorado Cadillac Seville Chevrolet Chevette Chevrolet Impala Chevrolet Malibu Chevrolet Monte Carlo Chevrolet Monza Chevrolet Nova Datsun 200-SX Datsun 210 Datsun 510 Datsun 810 Dodge Colt Dodge Diplomat Dodge Magnum Dodge St. Regis Fiat Strada Ford Fiesta Ford Mustang Honda Accord Honda Civic Lincoln Continental Lincoln Mark V AMC AMP AMS AU5 AUF BMS BUC BUE BUI BUO BUR BRI BUS CAD CAE CAS CHC CHI CHM CMC CM0 CHN D20 D21 D51 D81 DOC DOD DOM DOS FIS FOF FOM HOA HOC LIC LIM Lincoln Versailles Mazda GLC Mercury Bobcat Mercury Cougar Mercury Cougar XR7 Mercury Marquis Mercury Monarch Mercury Zephyr Olds Cutlass

Olds Cutlass Supreme Olds Delta 88 Olds 98 Olds Omega Olds Starfire Olds Toronado Peugeot 604SL Plymouth Arrow Plymouth Champ Plymouth Horizon Plymouth Sapporo Plymouth Volare Pontiac Catalina Pontiac Firebird Pontiac Grand Prix Pontiac LeMans Pontiac Phoenix Pontiac Sunbird Renault Le Car Subaru Toyota Celica Toyota Corolla Toyota Corona VW Dasher VW Rabbit VW Rabbit Diesel VW Scirocco Volvo 260 LIV MAZ MEB MC0 MC7 MAR MON MEZ OLC ocs OL8 OL9 OLO OLS OLT PEU PLA PLC PLH PLS PLV POC POF POG POL POP POS REN SUB TCL TCR TCN VDS VRB VRD vsc VOL

(6)

Many of the factors evaluated are highly corre- lated. There are seven factors related to size, including head room (Head); rear seat length (Rear); trunk volume (Trunk); weight, length, and turning radius (Turn); and engine displacement (Disp). All of these could probably have been lumped into one factor and given a value of either big or small. A more pronounced pattern could be obtained by doing this. The frequency-of-repair records (lines REP78 and REP77 in Fig. 5; a positive is better than median) appear to correlate with each other but not with any- thing else. There is little if any correlation with price.

In this case, the gross pattern is probably more interesting than the clusters of identical samples. Four main groupings by price and mileage (Mile) can be seen. To some extent, this is caused by the nature of the sorting procedure, but compare these groupings with the mortality data in Figure 4a. (The sort for SOD has been omitted here also.) Within each of these groupings, most of the cars appear to be either big or small. That is, there is a group of cars with better than median mileage; most of them are small, and all of them have higher than median gear ratio (Gear). There is next (reading from right to left) a group of big, expensive cars having lower than median gear ratios. As would be expected, most of the cars with higher than median gear ratios have better than median gas mileage, but the correlation is not perfect.

There are then two large groups of the lower priced cars, one of which is composed mostly of big cars and the other of small cars. There are, however, a few exceptions within the two subdivisions of the lower priced cars. Further analysis of the exceptions might be of interest.

Any number of speculations can be offered about the underlying causes for observed patterns. Several of the cars in the sample may differ from each other in name only. There may be established principles for automobile design that cause some of the factors to be related in a consistent way. Automobile man- ufacturers may consciously design and price cars to appeal to selected market segments.

4. DISCUSSION

A wide variety of methods has been used to construct displays of multivariate data, such as the star, tree, profile, and draftsman’s plot described by Chambers et al. (1983), and other pictorial ap- proaches, including Chernoff’s stylized faces, described by Barnett (1981).

This procedure seems to have some advantages over the other pictorial methods. The display is com- pact and readily comprehensible. There is little quantitative detail. For data sets with as many as 100 samples and 10 variables, however, it is possible to

show some information about every sample with re- spect to every variable in a single plot. Every sample and variable can be labeled and identified in the plot, with more or less eyestrain, depending on the size of the array and the complexity of the labeling.

Cluster structure is influenced by correlations in the data, but there is no simple relationship between correlation coefficients and cluster structure. High correlation between variables has the effect of re- ducing the number of variables. Addition of a single outlier to a collection of random data, however, will cause high values of correlation coefficients to appear without noticeable effect on the cluster structure.

On the other hand, clusters can be generated in a data set simply by replicating data for one or more of the samples. There may be little effect on correlation coefficients, but the clusters will show up in the sorted plot.

The procedure does not distinguish between correlation caused by groups of data and correlation resulting from smooth distribution of data. The binary-sort procedure splits data at the median and assigns 1 or 0 to points above and below the median; no other consideration is given to the distribution of the points. Clumps of data on each side of the median can thus produce the same cluster appearance as uni- form correlations in the data. To elucidate the underlying associations indicated by the graphical clusters, the binary-sort procedure should be com- bined with other graphical methods such as the draftsman’s plot and quantitative estimates of correlation.

This method is probably more suitable than most of the other pictorial methods for identification of clusters. This identification process can be carried out entirely by a computer following simple, un- equivocal rules. The display can be modified to suit the intended application of the data. The numerical procedure for forming clusters could be used with very large data sets. but the cluster characteristics of the resulting binary (ternary, etc.) number list would have to be shown by a frequency tabulation or his- togram rather than by a binary number plot.

Construction of the display is conceptually simple. As with some of the other pictorial methods, it is easy to explain what was done to the original data to construct the plot.

ACKNOWLEDGMENTS

This work was supported in large part by contract with the U.S. Air Force. 1 thank Mark Nicolich of Rutgers University for many helpful comments. 1 also thank the editors and two referees for several constructive suggestions.

(7)

REFERENCES

Barnett, V. (1981), Interpreting Multivariate Data, New York: John Wiley.

Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey,

P. A. (1983), Graphical Methods for Data Analysis, Boston: Duxbury Press.

U.S. Bureau of the Census (1982), Statistical Abstract of the United States: 1982-1983 (103rd ed.), Washington, DC: Author.