An Exploratory Data Analysis Approach to Qualitative Response Modelling Using SAS/IML(R) and SAS/GRAPH(R) Software

(1)

An Exploratory Data Analysis Approach to Qualitative Response Modelling

Using SAS/IML(R) and SAS/GRAPH(R) Software Merwyn L. Elliott

Ross Hightower Caleb Chan

Statistical Services Laboratory Georgia State University

ABSTRACT:

A graphical exploratory data analysis alternative to qualitative response modelling is presented which can either be a precursor to or a replacement for discriminant analysis and logistic regression. In taking an exploratory approach, analysts have the opportunity to investigate their data through color graphics using multivariate plots and scatter diagrams in SAS/GRAPH. Underlying relationships, often overlooked in traditional confirmatory modelling, can be discovered in a multivariate exploratory analysis.

As an application example, a data set containing both failed and not failed industrial companies and selected financial ratios is used. The classification problem is to distinguish between failed and not failed firms given their financial ratios.

Discriminating variables are selected using series of schematic side-by-side box and whisker plots whereby the degree to which the classification variable is "pulled apart" is visually determined.

Classification is accomplished by first computing Mahalanobis distance metrics from each observation vector to the centroid for each group using PROC IML. Next, three-dimensional graphics generated from PROC G3D are used to present the centroids for each group, actual observation vectors, and symbol-coded classifications into either failed or nq~

failed categories based on minimum D distances. (Note: In our work we use color graphics extensively. Our discussion here, given restrictions for the proceedings, does not involve color. The attached code to generate classification plots does include liberal use of color.) The analysis culminates in a picture which tells the classification story and enhances the analyst's understanding of the classification process. Using the ROT A TE option in PROC G3D, the analyst can "walk

around" the multivariate classification plot and view the relationships from various angles to get a better "feel" for the data. PROC GREPLA Y is used to assemble graphs for more efficient presentation (multiple graphs per page).

BACKGROUND:

During the 1970's, the basic foundations for exploratory data analysis (EDA) were laid by Tukey, Mosteller and others (See Tukey 1977). The basic idea in EDA is to investigate one's data to see what can be done before one measures how well he has done it. Hea vy emphasis is placed on looking at pictures to see not so much what you already know but to discover things about which you were previously unaware. Rather than replacing confirmatory techniqucs,EDA procedures were proposed to be done side-by-side with confirmatory techniques. Much of the early work in EDA involved univariate and bivariate analyses on small data sets without computer assist.

Since then, according to W. S. Cleveland in a recent article in the Journal of the American Statistical Association (Cleveland 1987), the force behind the tremendous increase in activity in statistical graphics has been the computer graphics revolution. High quality hardware and software are now available at low cost, and increasingly powerful systems are penetrating the data analyst's workplace. Statisticians are exploiting the new medium to produce new tools for the data analyst.

APPLICATION:

As an application area to demonstrate a new computer graphics based approach to EDA for qualitative response modelling, we have selected a prototype failing company study, an area which continues to receive considerable

(2)

attention in business and economic research. Since interest centers on a qualitative outcome variable (fail, not fail) and predictor variables are typically financial ratios, qualitative response models such as discriminant analysis or logistic regression are common analytical tools. Two purposes are typically pursued: I) determine the predictor variables' ability to discriminate and 2) develop a predictive or classificatory model to predict the likelihood of failure.

An exploratory data analysis is a useful precursor to failing company studies in that the analyst has the ability to investigate his data visually to see what sorts of relationships exist among the variables. In what follows we shall demonstrate: I) variable selection using schematic side-by-,ide plots, and 2) classification using multivariate plots of Mahalanobis distances.

VARIABLE SELECTION:

In selecting candidate variables for a discriminant analysis it is important that the variables adequately separate or "pull apart" the grouping variable. This separation property can be viewed by using schematic side-by-side plots. These are box and whisker plots for failed and healthy firms side-by-side on each variable and are produced by PROC SPLOT. The four financial ratios selected for demonstration purposes are 1) Return on Assets, 2) Current Ratio, 3) Cash Flow and 4) Current Assets to Net Sales Ratio. All of the variables except Current Assets to Net Sales Ratio seem to be reasonably good discriminators. Therefore the Current Assets to Net Sales Ratio variable will be deleted from further analysis.

CLASSIFICATION:

Description: First, the three discriminating variables selected in the previous step were standardized using PROC STANDARD. The purpose of this adjustment is to remove unit and scale differences so that relationships among the discriminators can be more meaningfully visuali

1

ed. Next, using PROC

IML, Mahalanobis D distance measures are computed for each observation vector to the centroid for each group as,

where,

D2i

=

Mahalanobis distance for the ith

obs~vation

from the centroid for the gth

group,

Xi

=

ith observation vector ~

=

Centroid for the gth group,

Cg

-I = Covariance matrix for gth group. An observation vector would be classified into the group 2with the shortest "statistical" distance (D). If the distance to the failed group is shortest, the plot symbol is coded with flags (or crosses if it is misclassified), otherwise it is coded with balloons (or stars if it is misclassified).

Finally, PROC G3D is used to generate the Classification Plots. These three-dimensional plots present the basic information needed for classification. Actual failed firms have "flag" plot symbols and not failed firms have "balloon" symbols. "Pyramid" and "cube" designate not failed and failed centroids, respectively. Balloons/stars indicate that an observation was "closest" to the not failed group and flags/crosses symbols denote that the observation was closest to the failed group. Accordingly, a flag/cross symbol has been classified as failed and a balloon/star symbol received a not failed classification. Therefore, a star would be a failed firm misclassified as not failed and a cross would be a not failed firm classified as a failed firm.

Our choices of symbols (and colors) were governed by the empirical work of Lewandowsky and Spence (Lewandowsky and Spence 1989). They determined that accuracy and response latency of observers of strata in graphics were positively affected by (color and) shape.

Discussion: Figure I and Figure 2 present the Classification Plots, each expressed from a different angle based on the ROTATION option in PROC G3D. Figure 1 shows a 0 degree rotation and Figure 2 shows a 90 degrees rotation. In practice several plots would be done at different degrees of rotation. The rotation can be performed

counterclockwise around the origin in any degree increments. Such rotation allows the analyst to "walk around" the plot, very much like one would walk around a piece of sculpture to fully appreciate its form. Some possible interpretations that might be drawn from these figures are that:

(3)

l)there is one misclassified not failed firm (one cross),

2)three failed firms are classified as not failed (three stars),

3)the Current Ratio (vertical axis) appears to be the best discriminating variable (more separation/less overlap along this axis), and Cash Flow is probably a slightly better discriminator than Rate of Return,

4)probably the covariance matrices are unequal (imaginary ellipsoids drawn around the clouds of points for each group would not be of the same size and orientation), suggesting nonlinear classification rules (separate covariance matrix estimates as we have done here),

5)the not failed firms seem to exhibit more variability than the failed firms (particularly along the cash flow dimension), and

6)one might infer that relatively low Cash Flow does not necessarily mean failure as long as the Current Ratio and Rate of Return are relatively high. A low Current Ratio almost certainly means a firm's demise.

(4)

,.~ • 1.44 " • -0.13 ,

Rotation: 0 Degree

)

( _u.30

Figure I: 3D Classification Plot with 0 Degree Rotation

Rotation: gO Degrees

( ) ( C 3.Q1 (

o

( L ( )

c

•

> ) () -"3

r---!--Q--- -

~L-r- ---~'

;

r---/---

-I--~-

---

-r----

oj." -1.69 , I 3.20 ,." ,.~ -1.015

Figure 2: 3D Classification Plot with 90 Degrees Rotation Misclassified: Failed/Crosses

Not Failed/Stars

Actual: Failed/Flags Not Failed/Balloons

Cube: Failed Group Centroid Pyramid: Not Failed Group Centroid

(5)

The Program: DATA DATAl;

INFILE '\SAS\ST A T\F AILED.DA T;

INPUTCF ROR CR CA NS FAIL;

RUN;

-PROC STANDARD M=O S=I OUT=DATA2; V AR CF ROR CR;

RUN;

PROCFREQ;

TABLES FAIL/OUT=XXX; RUN;

PROC SORT DATA=DATAI; BY FAIL;

RUN;

DATA DATA3;

MERGE DATA2 XXX; BY FAIL;

IF FAIL=I THEN PERI=PERCENT/100; IF FAIL=2 THEN PERI=I-PERCENT/IOO; PER2=I-PERI;

RUN;

1*

CALCULATE DISTANCES */ PROC IML;

USE DATA2;

READ ALL V AR{CF ROR CR} INTO X; READ ALL V AR{CF ROR CR}

WHERE(FAIL=I} INTO XI; READ ALL V AR{CF ROR CR} WHERE(F AIL=2) INTO X2;

M=NROW(X); MI=NROW(XI); M2=NROW(X2); UNIT=J(M,I,I); UNITI=J(MI,I,I); UNIT2=J(M2, I, I); DSQRI=UNIT; DSQR2=UNIT; MEAN =( I /M)*UNlT'*X; MEAN I=(I/MI)*UNITI '*XI; MEAN2=( I/M2)*UNIT2'*X2; XD I=X -UNIT*MEAN I; XDII=XI-UNITI*MEANI; CI=(\/(MI-I»*(XDII '*XDII); START LOOPI; DO J=I TO M· DSQ R I (lJi)=XD'1 (lJ ,I)*INV (C 1)* XD I (lJ,I)'+ LOG(DET(C I»; END; FINISH; RUN LOOPI; XD2=X-UNIT*MEAN2; XD22=X2-UNIT2*MEAN2; C2=( 1 /(M2-1 »*(XD22'*XD22); START LOOP2; DO J=I TO M; DSQR2(IJD=XD2(1J,I)*INV(C2)* XD2(1J,I)'+LOG(DET(C2»; END; FINISH; RUN LOOP2;

MEANVAR={CF ROR CR FAIL}; FAILI={3};

MEANll=MEANIIiFAILI;

CREATE CENTI FROM MEANll (ICOLNAME=MEANV ARI);

APPEND FROM MEAN II; CLOSE CENTI;

FAIL2={4};

MEAN22=MEAN2I1F AIL2;

CREATE CENT2 FROM MEAN22 (lCOLNAME=MEANV ARI);

APPEND FROM MEAN22; CLOSE CENT2;

V AR={DSQRl DSQR2}; DSQR=DSQRIIIDSQR2;

CREATE DIST FROM DSQR (lCOLNAME=VARI);

APPEND FROM DSQR; CLOSE DIST;

QUIT;

1*

CLASSIFY OBSERV A TrONS * / DATA DATA4; LENGTH COLOR V AL $ 7; MERGE DA T A3 DIST; DSQRI=DSQRl-2*LOG(PERl); DSQR2=DSQR2-2*LOG(PER2); POSTI=EXP(-DSQRl/2)/ (EXP(-DSQRl/2)+EXP(-DSQR2/2»; POST2=EXP( -DSQR2/2)/ (EXP( -DSQR 1 /2)+ EXP( -DSQR2/2»; IF DSQRl<=DSQR2 THEN COLOR V AL='RED'; IF DSQRl>DSQR2 THEN COLOR V AL='GREEN';

IF FAIL=l THEN SHAPE='FLAG'; IF F AIL=2 THEN SHAPE='BALLOON'; RUN;

DATA TEMP; SET DATA4;

IF COLORVAL='RED' THEN CLASS=I; IF COLOR V AL='GREEN' THEN CLASS=2; RUN;

(6)

PROC FREQ;

TABLES FAIL *CLASS/NOPERCENT ; RUN;

DATA DATA5;

SET DA T A4 CENT! CENT2; IF FAIL=3 THEN

COLORV AL='RED';ELSE;

IF FAIL=4 THEN COLORVAL='GREEN'; IF FAIL=3 OR FAIL=4 THEN

SHAPE='PILLAR'; RUN; PROC G3D DATA=DATA5; SCATTER CF*ROR=CRj ROTATE=O COLOR=COLORVAL SHAPE=SHAPE;

TITLEI 'ROTATION: 0 DEGREE'; RUN;

REFERENCES:

Tukey, John W. (1977), Exploratory Data Analysis, MA: Addison- Wesley.

Cleveland, W. S. (1987), "Research in Statistical Graphics," American Statistical Association Journal, 82, 419-423.

Lewandowsky, Stephan and Spence, Ian (1989), "Discriminating Strata in Scatterplots," American Statistical Association Journal, 84, 682-688.

For further information Contact: Merwyn L. Elliott

Department of Decision Sciences Georgia State University

University Plaza Atlanta, GA 30303

SAS(R), SAS/GRAPH(R), and SAS/IML(R) are registered trademarks of SAS Institute, Inc., Cary, NC.