Example of a User-Deﬁned Composite - Data Mining Methods And Models Larose DT (2006) pdf

Consider again thehousesdata set. Suppose that the analyst had reason to believe that the four variablestotal rooms,total bedrooms,population, andhouseholdswere

SUMMARY 25

highly correlated with each other and not with other variables. The analyst could then construct the following user-deﬁned composite:

W=aZ= a1(total rooms)+a2(total bedrooms)+a3(population)+a2(households)

4 ,

withai =1/4,i =1, . . . ,4, so thatcomposite Wrepresented the mean of the four (standardized) variables.

The conceptual deﬁnition ofcomposite Wis “block group size,” a natural and straightforward concept. It is unlikely that all block groups have exactly the same size and therefore that, differences in block group size may account for part of the variability in the other variables. We might expect large block groups tending to have large values for all four variables, and small block groups tending to have small values for all four variables.

The analyst should seek out support in the research or business literature for the conceptual deﬁnition of the composite. The evidence for the existence and relevance of the user-deﬁned composite should be clear and convincing. For example, forcompos- ite W, the analyst may cite the study from the National Academy of Sciences by Hope et al. [11], which states that block groups in urban areas average 5.3 square kilometers in size, whereas block groups outside urban areas averaged 168 square kilometers in size. Since we may not presume that block groups inside and outside urban areas have exactly similar characteristics, this may mean that block group size could conceiv- ably be associated with differences in block group characteristics, includingmedian housing value, the response variable. Further, the analyst could cite the U.S. Census Bureau’s notice in theFederal Register[12] that population density was much lower for block groups whose size was greater than 2 square miles. Hence, block group size may be considered a “real” and relevant concept to be used in further analysis downstream.

SUMMARY

Dimension reduction methods have the goal of using the correlation structure among the predictor variables to accomplish the following:

r _{To reduce the number of predictor components} r _{To help ensure that these components are independent} r _{To provide a framework for interpretability of the results}

In this chapter we examined the following dimension reduction methods: r _{Principal components analysis}

r _{Factor analysis}

r _{User-deﬁned composites}

Principal components analysis (PCA) seeks to explain the correlation structure of a set of predictor variables using a smaller set of linear combinations of these

SPH SPH

JWDD006-01 JWDD006-Larose November 18, 2005 17:46 Char Count= 0

26 CHAPTER 1 DIMENSION REDUCTION METHODS

variables. These linear combinations are calledcomponents. The total variability of a data set produced by the complete set ofmvariables can often be accounted for primarily by a smaller set ofklinear combinations of these variables, which would mean that there is almost as much information in thekcomponents as there is in the originalmvariables. Principal components analysis can sift through the correlation structure of the predictor variables and identify the components underlying the correlated variables. Then the principal components can be used for further analysis downstream, such as in regression analysis, classiﬁcation, and so on.

The first principal component may be viewed in general as the single best summary of the correlations among the predictors. Specifically, this particular linear combination of the variables accounts for more variability than any other conceivable linear combination. The second principal component,Y2, is the second-best linear combination of the variables, on the condition that it is orthogonal to the first principal component. Two vectors areorthogonalif they are mathematically independent, have no correlation, and are at right angles to each other. The second component is derived from the variability that is left over once the first component has been accounted for. The third component is the third-best linear combination of the variables, on the condition that it is orthogonal to the first two components. The third component is derived from the variance remaining after the first two components have been extracted. The remaining components are defined similarly.

The criteria used for deciding how many components to extract are the following:

r _{Eigenvalue criterion}

r _{Proportion of variance explained criterion} r _{Minimum communality criterion}

r _{Scree plot criterion}

The eigenvalue criterion states that each component should explain at least one variable’s worth of the variability, and therefore the eigenvalue criterion states that only components with eigenvalues greater than 1 should be retained. For the proportion of variance explained criterion, the analyst simply selects the components one by one until the desired proportion of variability explained is attained. The minimum communality criterion states that enough components should be extracted so that the communalities for each of these variables exceeds a certain threshold (e.g., 50%). The scree plot criterion is this: The maximum number of components that should be extracted isjust prior towhere the plot begins to straighten out into a horizontal line.

Part of the PCA output takes the form of a component matrix, with cell entries called thecomponent weights. These component weights represent the partial correlation between a particular variable and a given component. For a component weight to be considered of practical signiﬁcance, it should exceed ±0.50 in magnitude. Note that the component weight represents the correlation between the component and the variable; thus, the squared component weight represents the amount of the variable’s total variability that is explained by the component. Thus, this threshold

SUMMARY 27

value of±0.50 requires that at least 25% of the variable’s variance be explained by a particular component.

PCA does not extract all the variance from the variables, only that proportion of the variance that is shared by several variables.Communalityrepresents the proportion of variance of a particular variable that is shared with other variables. The communalities represent the overall importance of each of the variables in the PCA as a whole. Communality values are calculated as the sum of squared component weights for a given variable. Communalities less than 0.5 can be considered to be too low, since this would mean that the variable shares less than half of its variability in common with the other variables.

Factor analysis is related to principal components, but the two methods have different goals. Principal components seeks to identify orthogonal linear combinations of the variables, to be used either for descriptive purposes or to substitute a smaller number of uncorrelated components for the original variables. In contrast, factor analysis represents amodelfor the data, and as such is more elaborate.

Unfortunately, the factor solutions provided by factor analysis are not invari- ant to transformations. Hence, the factors uncovered by the model are in essence nonunique, without further constraints. The Kaiser–Meyer–Olkin measure of sam- pling adequacy and Bartlett’s test of sphericity are used to determine whether a sufﬁcient level of correlation exists among the predictor variables to apply factor analysis.

Factor loadingsare analogous to the component weights in principal components analysis and represent the correlation between theith variable and thejth factor. To assist in the interpretation of the factors,factor rotationmay be performed. Factor rotation corresponds to a transformation (usually, orthogonal) of the coordinate axes, leading to a different set of factor loadings. Often, the ﬁrst factor extracted represents a “general factor” and accounts for much of the total variability. The effect of factor rotation is to redistribute the variability explained among the second, third, and subsequent factors.

Three methods for orthogonal rotation are quartimax rotation, varimax rotation, and equimax rotation. Quartimax rotation tends to rotate the axes so that the variables have high loadings for the ﬁrst factor and low loadings thereafter. Varimax rotation maximizes the variability in the loadings for the factors, with a goal of working toward the ideal column of zeros and ones for each variable. Equimax seeks to compromise between the previous two methods. Oblique rotation methods are also available in which the factors may be correlated with each other.

A user-defined composite is simply a linear combination of the variables, which combines several variables together into a single composite measure. User-defined composites provide a way to diminish the effect of measurement error by combining multiple variables into a single measure. User-defined composites enable the analyst to embrace the range of model characteristics while retaining the benefits of a parsimonious model. Analysts should ensure that the conceptual definition for their user-defined composites lies grounded in prior research or established practice. The variables comprising the user-defined composite should be highly correlated with each other and uncorrelated with other variables used in the analysis.

SPH SPH

JWDD006-01 JWDD006-Larose November 18, 2005 17:46 Char Count= 0

28 CHAPTER 1 DIMENSION REDUCTION METHODS

In document Data Mining Methods And Models Larose DT (2006) pdf (Page 42-46)