Reexpressing for Straight Data - Ratner - Statistical and Machine-Learning Data Mining

The ladder of powers is a method of reexpressing variables to straighten a bulging relationship between two continuous variables, say X and Y. Bulges in the data can be depicted as one of four shapes, as displayed in Figure 8.3. When the X-Y relationship has a bulge similar to any one of the four shapes,

both the ladder of powers and the bulging rule, which guides the choice of “rung” in the ladder, are used to straighten out the bulge. Most data have bulges. However, when kinks or elbows characterize the data, then another approach is required, which is discussed further in the chapter.

8.6.1 ladder of Powers

Going up-ladder of powers means reexpressing a variable by raising it to a power p greater than 1. (Remember that a variable raised to the power of 1 is still that variable; X1_{= X, and Y}1_{= Y). The most common p values used}

are 2 and 3. Sometimes values higher up-ladder and in-between values like 1.33 are used. Accordingly, starting at p = 1, the data miner goes up-ladder, resulting in reexpressed variables, for X and Y, as follows:

Starting at X1_{: X}2,_X3,_X4_{, X}5_{, …}

Starting at Y1_{: Y}2_{, Y}3_{, Y}4_{, Y}5_{, …}

Some variables reexpressed going up-ladder have special names. Corresponding to power values 2 and 3, they are called X squared and X cubed, respectively. Similarly, for the Y variables, they are called Y squared and Y cubed, respectively.

Going down-ladder of powers means reexpressing a variable by raising it to a power p that is less than 1. The most common p-values are ½, 0, -½, and -1. Sometimes, values lower down-ladder and in-between values like 0.33 are used. Also, for negative powers, the reexpressed variable now sports a negative sign (i.e., is multiplied by -1); the reason for this is theo- retical and beyond the scope of this chapter. Accordingly, starting at p = 1, the data miner goes down-ladder, resulting in reexpressed variables for X and Y, as follows: Starting at X1_{: X}1/2_{, X}0_{, X}–1/2_{, –X}1_, Starting at Y1_{: Y}1/2_{, Y}0_{, Y} –1/2, –Y1, 2 1 3 4 Y down X up Y up X up Y down X down Y up X down FIguRe 8.3

Some reexpressed variables going down-ladder have special names. Corresponding to values ½, -½, and -1, they are called the square root of X, negative reciprocal square root of X, and negative reciprocal of X, respectively. Similarly, for the Y variables, they are called square root of Y, negative reciprocal square root of Y, and negative reciprocal of Y, respectively. The reexpression for p = 0 is not mathematically defined and is conveniently defined as log to base 10. Thus, X0_{= log X, and Y}0_{= log Y.}

8.6.2 bulging Rule

The bulging rule states the following:

1. If the data have a shape similar to that in the first quadrant, then the data miner tries reexpressing by going up-ladder for X, Y, or both.

2. If the data have a shape similar to that shown in the second quadrant, then the data miner tries reexpressing by going down-ladder for X or up-ladder for Y.

3. If the data have a shape similar to that in the third quadrant, then the data miner tries reexpressing by going down-ladder for X, Y, or both.

4. If the data have a shape similar to that in the fourth quadrant, then the data miner tries reexpressing by going up-ladder for X or down- ladder for Y.

Reexpressing is an important, yet fallible, part of EDA detective work. While it will typically result in straightening the data, it might result in a deterioration of information. Here is why: Reexpression (going down too far) has the potential to squeeze the data so much that its values become indistinguishable, resulting in a loss of information. Expansion (going up too far) can potentially pull apart the data so much that the new far- apart values lie within an artificial range, resulting in a spurious gain of information.

Thus, reexpressing requires a careful balance between straightness and soundness. Data miners can always go to the extremes of the ladder by exerting their will to obtain a little more straightness, but they must be mindful of a consequential loss of information. Sometimes, it is evi- dent when one has gone too far up/down on the ladder; there is power p, after which the relationship either does not improve noticeably or inex- plicably bulges in the opposite direction due to a corruption of information. I recommend using discretion to avoid overstraightening and its potential deterioration of information. In addition, I caution that extreme reexpressions are sometimes due to the extreme values of the original variables. Thus, always check the maximum and minimum values of the

original variables to make sure they are reasonable before reexpressing the variables.

8.6.3 Measuring Straight Data

The correlation coefficient measures the strength of the straight-line or linear relationship between two variables, X and Y, discussed in detail in Chapter 2. However, there is an additional assumption to consider.

In Chapter 2, I referred to a “linear assumption,” in that the underlying relationship between X and Y is linear. The second assumption is an implicit one: The (X, Y) data points are at the individual level. When the (X, Y) points are analyzed at an aggregate level, such as in the logit plot and other plots presented in this chapter, the correlation coefficient based on “big” points tends to produce a “big” r value, which serves as a gross estimate of the individual-level r value. The aggregation of data diminishes the idiosyn- crasies of the individual (X, Y) points, thereby increasing the resolution of the relationship, for which the r value also increases. Thus, the correlation coefficient on aggregated data serves as a gross indicator of the strength of the original X-Y relationship at hand. There is a drawback of aggregation: It often produces r values without noticeable differences because the power of the distinguishing individual-level information is lost.

In document Ratner - Statistical and Machine-Learning Data Mining (Page 132-135)