Our Algorithm - Proposed Methodology - Brownstein_unc_0153D

5.2 Proposed Methodology

5.2.3 Our Algorithm

First, we center and scale X, resulting in the matrix Xs. We perform traditional

factor analysis on Xs and extract the factor loadings vf1, . . . , vf k for k factors. We

perform soft threshholding on thekfactors and normalize the results to yieldv1, . . . , vk.

Next, for each factor, we updateubased onXs andv. Namely,u1 =Xsv1/||Xsv1||,

and for j = 2, . . . , k, uj = j where ˜yj = Xsvj, ˜Xj = (u1, . . . , uj−1) denotes the first

j−1 columns of U, and uj is the set of residuals from the least squares equation

yj = ˜Xjβj +j (5.6)

The following algorithm summarizes the method. 1. Find the initial estimate for v.

(a) Scale and centerX to yield Xs.

(b) Perform traditional factor analysis on Xs.

(d) vj =

S(vf j,∆j)

||S(vf j,∆j)||2 for j = 1, . . . , k

(a) u1 ← ||XXssvv11||2

(b) for j = 2, . . . , k, uj =j is the set of residuals from the fit of the data to the

model in (5.6).

Note that the thresholds, ∆1, . . . ,∆k, may vary for each component.

Finally, we use the formula in (109) to calculate the percentage of variance explained by this procedure. Namely, the cumulative percentage of variance explained by first

j components is given by tr(Xj0Xj) where Vj = [v1, . . . , vj] is the p×j matrix of the

first j loading vectors and Xj =XsVj(Vj0Vj)−1Vj0 is then×j projection of Xs onto the

subspace generated byVj.

Unlike the method of (147), our procedure does not iterate between these steps. Consequently, the percentage of variance explained will be less than the variance explained by their method. However, the trade off is that the results of our procedure should be more sparse and lend to more concise interpretations, which are of interest in the factor analysis setting. In addition, lack of iteration results in greater computational efficiency.

5.3 Simulations

This section describes simulations that demonstrate our method and compare it to a number of alternatives. Samples of size n = 1000 were generated with p = 30 covariates. For each individual, k = 5 factors were generated as independent normal random variables with mean zero and standard deviations 10, 20, 30, 30, 30. That is, for all i, Fi = (F1i, . . . , F5i)0 is a 5×1 random normal variable with E[Fji] = 0 and V ar[Fi] = Diag(100,400,900,900,900). This corresponds to having one factor with a

small amount of noise, a second factor with increased noise, and 3 additional factors that are noisier than the first two. The full matrix of all factors is given by the n×k

We generated the p× k normally distributed loading matrix L. Loadings were independent and each had standard deviation 0.1. The meanµij =E[Lij] varied based

on the row i and column j of the matrix L. Namely, µij = 0.8 if j = 1 and i= 1, ...6; µij = 0.7 if j = 2 and i = 7, . . . ,12; µij = 0.6 if j = 3, and i = 13, ...18; µij = 0.5 if j = 4, i= 19, ...,24; or µij = 0.4 if j = 5, and i= 25, ...,30. In other words, the first 6

elements of column 1 have mean 0.8, the second 6 elements of column 2 have mean 0.7, the third six elements of column 3 have mean 0.6, the fourth six elements of column 4 have mean 0.5, and the last six elements of column 5 have mean 0.4; all other elements have mean zero. The standard deviation of 0.1 for all elements means that all of the loadings are nonzero, some are essentially just noise, and others are larger and more meaningful but have a small amount of noise as well.

The elements of the n × 1 error vector were also independent and normally distributed with mean 0, variance 1. As in the methodology section, X =F L0+.

For each individual, a binary outcomeYi was generated based on the scaled factors, Fs. We used

logitP(Y = 1) = ˜F α

where α is a (k+ 1)×1 vector of ones, Fs is an n×k matrix of the factors, scaled

to have mean zero and unit variance, and ˜F = (1, Fs) is a n×(k + 1) matrix with

the first column consisting of all 1’s and the remaining entries identical to Fs. Here, Y = (Y1, . . . , Yn)0.

We compared the performance of our method with traditional factor analysis, three applications of (147), and two applications of (157). The different applications of the competing methods varied based on level of sparsity specified. We used

c(0.06,0.16,0.1,0.5,0.5) andc(7,4,4,1,1) in the (157) function which corresponded to low and high sparsity, respectively. The (147) applications varied based on 1,p(p)/2,p(p), which correspond to high, medium, and low sparsity, where p denotes the number of

variables under consideration. For our method, we also investigated softthreshold values of 0.4, 0.5, 0.6, 0.7, and 0.8 for all components, as well as one scenario with increasing thresholds for each components and one scenario with decreasing thresholds for the components.

Simulations were run independently 1000 times. For each run, we fit logistic regression models of the outcome based on each component separately for each method and recorded the parameter estimate and corresponding p-value. We recorded the average percent variance explained, and average parameter estimate (Table 5.1) and average p-value (Table 5.2) for each of the five components. We also noted the number of variables with nonzero loadings for each component and each method (Table 5.3). We refer to our proposed method as Sparse Factor Analysis (SFA).

Our method captured a large percentage of the variance, around 71-82%, depending on the thresholds used. Traditional factor analysis and the non-sparse methods of (147) and (157) explained 91%. Sparse applications of (147) and (157) explained far less variance (<40%).

For all thresholds, parameter estimates for all components were comparable in magnitude to those for factor analysis but much larger than for the competing methods. P-values were strongly significant for all methods. This indicates that the factor scores from our method were more strongly related to the outcome than the competing method scores were to the outcome. The associations were positive for our method, as intended, and often negative for the other methods. Moreover, our method was clearly more sparse than traditional factor analysis and all but the most sparse versions of (147) and (157). We retained an average of about 6 variables per component with nonzero loadings, compared to all or nearly all of the 30 variables for these other methods. This was favorable, as we had generated only 6 variables to be meaningful for each component and generated all others to represent noise.

In conclusion, the simulations show that for various thresholds, our method results in sparse factor scores that still explain a majority of the variance and are correlated with the outcome of interest. These were exactly the properties that were desired from a practical standpoint.

In document Brownstein_unc_0153D_14117.pdf (Page 80-84)