Penalization Methods for Group Identification and Variable Selection in Models with Correlated Predictors.

(1)

Abstract

SHARMA, DHRUV BHUSHAN. Penalization Methods for Group Identification and Variable Selection in Models with Correlated Predictors. (Under the direction of Howard D. Bondell, Ph.D.)

This dissertation consists of two projects related to the study and development of

penalization methods for group identification and variable selection in models with

cor-related predictors. Statistical procedures for variable selection have become integral

elements in any analysis. Successful procedures are characterized by high predictive

ac-curacy, yielding interpretable models while retaining computational efficiency. Penalized

methods that perform coefficient shrinkage have been shown to be successful in many

cases. Models with correlated predictors are particularly challenging to tackle and these

are the main focus of this dissertation. In the first part of this dissertation we focus

on developing a penalization method for regression models. We propose a penalization

procedure that performs variable selection while clustering groups of predictors

automati-cally. The oracle properties of this procedure including consistency in group identification

are also studied. An efficient algorithm based on a quadratic approximation is proposed.

The procedure compares favorably with existing selection approaches in both prediction

accuracy and model discovery, while retaining its computational efficiency.

In the second part we focus on variable selection in high dimensional binary

classifi-cation problems. Gene selection in studies of disease classificlassifi-cation using gene expression

data is a challenging problem due to the “high dimensional low sample size” nature of the

data. Support vector machines are a classification tool with successful classification

per-formance in studies of “high dimensional low sample size” data that have recently been

(2)

are difficult to analyze since many genes that predict disease are often also correlated. To

this end we propose a penalization approach that smoothes coefficients together, setting

them as equal, while eliminating redundant variables, thus aiding in the classification of

disease. This approach is shown to be superior in many cases where genes are correlated.

Additional advantages of using this method over existing methods include the data

adap-tive nature of the penalty and the computational conveniences of the method, including

an easily applicable algorithm. The procedure compares favorably with existing selection

approaches in both classification accuracy and model discovery in simulation studies and

the analysis of microarray gene expression cancer classification data, while retaining its

(3)

Penalization Methods for Group Identification and Variable Selection in Models with Correlated Predictors

by

Dhruv Bhushan Sharma

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina

2010

APPROVED BY:

Leonard A. Stefanski, Ph.D Brian J. Reich, Ph.D

Hao Helen Zhang, Ph.D Co-chair of Advisory Committee

(4)

Dedication

(5)

Biography

Dhruv was born in Chennai (Madras), the coastal city in south-east India. He attended

Rishi Valley School, a boarding school in the hills of southern India for eight years before

studying Statistics at Loyola College in Chennai. He moved to NC State University

in 2004 to pursue a Ph.D in Statistics. He moves to Boston to pursue a postdoc in

(6)

Acknowledgements

This dissertation would not have been possible without the support of my advisors,

Howard Bondell and Helen Zhang. I will always be grateful and indebted to them. I

want to thank Len Stefanski for introducing me to academic research by giving me the

opportunity to help edit his book in 2005. I want to thank Sujit Ghosh for encouraging

me to present our project at the conference in Salt Lake City in 2007. It was a valuable

and enjoyable experience at a time when I was more involved in course work. Getting

a dissertation is an academic endeavor that is possible only under the right conditions.

These conditions are made possible only with the dedication of our teachers, so I would

like to thank all my teachers, from my past to the present, for the knowledge they have

shared with me.

Bill Swallow, who recruited me to attend NC State, has been a wonderful person to

talk to for guidance and otherwise. He has always been there to talk things over lunch or

a cup of coffee, even after officially retiring from the department. Jason Osborne taught

me applied statistics my first semester here and has been more than just a teacher. We

met at a party in my first week at Raleigh and I’ve enjoyed getting to know him socially

through these years. Jason and I drove to Charlotte to see Mark Knopfler perform in

a very memorable concert. Jeff Thompson has been an inspiration for his attitude to

teaching and a great mentor. I met Terry Byron my first day at NC State and we have

been friends throughout my many years here. His excellent tech advice and ability to

make all my computer problems disappear will constantly remind me of how privileged

I am.

I wish to acknowledge the contributions of my friends. They have shared their lives

(7)

Hamilton is one of the most interesting conversationists I know, most of our conversations

were far more educational than school work. McKay Curtis encouraged me constantly

with his intellect, wisdom and humor. Chris Franck has shared many highs and lows of

graduate school with me, always with his great sense of humor. I have learned many new

things, statistics or otherwise, from my conversations with Ian Fiske. Aijing Starr has

been a very positive influence on me during the latter stages of this process. I hope to

someday be as cheerful as her. I thank them all.

I would like to thank my family for all the wonderful opportunities that were offered

to me. I received a great education and had the support and encouragement to pursue

(8)

List of Tables

Table 2.1 Results for Example 1 and Example 2 . . . 44

Table 2.5 Results for Real Data Examples . . . 48

Table 3.1 Results for SVM Simulation Study . . . 64

(11)

List of Figures

Figure 2.1 Illustration to represent the flexibility of the PACS approach over the OSCAR approach in terms of the shaded constraint regions in the (β1,β2) plane. . . 20

Figure 2.2 Graphical representation to represent the flexibility of the PACS approach over the OSCAR approach in the (β1,β2) plane. All

fig-ures represent correlation of ρ = 0.85. The top panel has OLS solution ˜βOLS = (1,2) while the bottom panel has ˜βOLS = (0.1,2).

(12)

Chapter 1 Introduction

The collection of large quantities of data is becoming increasingly common with advances

in technical research. These changes have opened up many new statistical challenges;

the availability of high dimensional data often brings with it large quantities of noise and

redundant information. Separating the noise from the meaningful signal has led to the

development of new statistical techniques for variable selection. For instance we may wish

to predict the plasma concentrations of beta-carotene based on personal characteristics

and dietary factors; many of which may not help in predicting the plasma levels. This

is a regression problem since the response is quantitatively measured. We may also wish

to predict the type of leukemia cancer based on an individual’s genetic characteristic;

many of which may not help in predicting the disease. This problem can be viewed

as a classification problem since we wish to classify the type of disease. Both types of

(13)

1.1 Regression

1.1.1 Background

Consider the usual linear regression model setup with n observations and p predictors

given by

y=Xβ+,

where is a random vector with E() = 0 and V ar() = σ2_I_{. Let the response vector}

be y= (y1, ..., yn)T, the jth predictor of the design matrix X be xj=(x1j, ..., xnj)T for

j=1,...,p and the vector of coefficients be β = (β1, ..., βp)T. Assume the data is centered

and the predictors are standardized to have unitL2 norm, so that

n

X

i=1

yi = 0, n

X

i=1

xij = 0

and

n

X

i=1

x2_ij = 1 for all j=1,...,p. This allows the predictors to be put on a comparative

scale, and the intercept may be omitted. The ordinary least squares (OLS) coefficient

estimates are the minimizers w.r.t. β of the residual sums-of-squares, given by

ky−

p

X

j=1

xjβjk2. (1.1)

When we have large quantities of data, many of the predictors have little to no effect on

the response making variable selection a necessary step in the analysis. The reasons for

this as described in Hastie et al. (2001) are the following:

• Prediction accuracy: the least squares estimates often have low bias but large

variance. Improvements can be made by shrinking or setting some of the coefficients

(14)

• Interpretation: with a large number of predictors, we are interested in eliminating

the redundant predictors from the model and retaining a smaller subset of the

predictors that exhibit the strongest effect on the response.

Subset selection (see Miller, 2002) is a conventional method of performing variable

se-lection. In subset selection, a subset of the predictors are included in the model while

the rest are eliminated and least squares regression is used to estimate the coefficients

of the retained predictors. Thus if we were looking at all subsets of a model with p

predictors, we could have 2p _{possible models to choose from, where each predictor is}

either included or excluded from the model. As we can see from this description, subset

selection becomes increasingly expensive as p increases. Alternatives to this process of

searching through all possible subsets, such as Forward stepwise selection which starts

with an empty model and sequentially adds predictors into the model that most improves

the fit or Backward stepwise selection which starts with a full model and sequentially

eliminates the least significant predictors from the model have been studied extensively.

Hybrid stepwise methods that consider both forward and backward selection have also

been studied.

Subset selection methods are easy to interpret but are a discrete search process,

whereby variables are either included or excluded from the model, often exhibiting high

variance. Penalization techniques for variable selection in regression models are an

alter-native that are more continuous and do not suffer as much from this variability. They

have become increasingly popular as they perform variable selection while simultaneous

(15)

1.1.2 Penalization Techniques

A penalization technique can be described as follows. Suppose Pλ(·) is a penalty on the

coefficient vector andλis its corresponding non-negative penalization parameter. From a

loss and penalty setup, estimates from a penalization scheme for the least squares model

are given as the minimizers of

ky−

p

X

j=1

xjβjk2+Pλ(β). (1.2)

The most commonly used penalization scheme for regression is ridge regression (Hoerl

and Kennard, 1970), whose estimates are given as the minimizers of

ky−

p

X

j=1

xjβjk2+λ p

X

j=1

β_j2. (1.3)

When predictors are correlated, their least square coefficient estimates are poorly

deter-mined and have high variance. Ridge regression was used to control for this variability

by imposing a penalty on the size of the coefficients and pulling them closer to zero. In

ridge regression, the penalization parameter λ controls the amount that the coefficients

are shrunk towards zero as λ increases. The L2-norm penalty on the coefficients, also

known as the ridge penalty, introduces bias in the estimation process by shrinking least

squares estimates towards zero, but improves on the least squares estimates in terms

of having lower variance. Although ridge regression does not perform variable selection

(since it does not set coefficients exactly zero), it acts as a precursor to other

penaliza-tion techniques for variable selecpenaliza-tion with the introducpenaliza-tion of continuous shrinkage of

coefficients.

(16)

(LASSO, Tibshirani, 1996) opened the doors to performing variable selection via

penal-ization. The LASSO replaces the L2-norm penalty on the coefficients with an L1-norm

penalty, which exactly sets coefficients to zero, thus performing variable selection. From

a loss and penalty setup, estimates from the LASSO for the least squares model are given

as the minimizers of

ky−

p

X

j=1

xjβjk2+λ p

X

j=1

|βj|. (1.4)

As the penalization parameter λ increases, more coefficient estimates are encouraged

to be continuously shrunk towards zero. The LASSO estimates can be computed via

quadratic programming, using a shooting algorithm by Fu (1998), or using the efficient

least angle regression (LARS) algorithm by Efron et al. (2004). The use of the L1-norm

penalty in the LASSO has been the foundation to many other penalization techniques

for variable selection.

Another early approach is the nonnegative garrote (Breiman, 1995) which shrinks the

OLS solution by a factor cj ≥ 0 for j=1,...,p. If ˜β is the OLS estimate of β, then the

nonnegative garrote estimates minimizew.r.t. c= (c1, ..., cp)

ky−

p

X

j=1

xjβ˜jcjk2+λ p

X

j=1

cj, (1.5)

and give the estimate ˆβj = ˆcjβj˜ for j=1,...,p. Small values of cj shrink the estimates

closer towards zero, eventually setting them to zero when cj = 0. The nonnegativity of

cjensures that the nonnegative garrote estimates have the same sign as their least-squares

estimates.

(17)

LASSO and the ridge regression properties that controlled for the variability attributed

to correlated predictors. The penalty, called the elastic net penalty, penalizes both the

L1-norm and the L2-norm of the coefficients. The elastic net estimates are given as the

minimizers of

ky−

p

X

j=1

xjβjk2+λ1

p

X

j=1

|βj|+λ2

p

X

j=1

β_j2. (1.6)

Although the elastic net can be computed via an efficient modified LARS algorithm, it

still suffers from the computational burden of having two tuning parameters.

Tibshirani et al. (2005) proposed the fused LASSO for studying models with predictors

that can be ordered in some meaningful manner. The fused LASSO estimates are given

as the minimizers of

ky−

p

X

j=1

xjβjk2+λ1

p

X

j=1

|βj|+λ2

p

X

j=2

|βj−βj−1|. (1.7)

The fused lasso shrinks adjacent coefficients towards each other, possibly setting them as

equal and also shrinks coefficients towards zero. The octagonal shrinkage and clustering

algorithm for regression (OSCAR, Bondell and Reich, 2008) deals with correlated groups

of predictors. If correlated predictors have similar effects (up to a change in sign) on the

response, the OSCAR aims to identify these groups by assigning identical coefficients

to each element in the group up to a change in sign, while simultaneously eliminating

extraneous predictors. The OSCAR estimates can be defined as the solution to

ky−

p

X

j=1

xjβjk2+λ1

p

X

j=1

|βj|+λ2

X

1≤j<k≤p

max{|βj|,|βk|}. (1.8)

(18)

The first, λ1, penalizes the L1 norm of the coefficients and encourages sparseness. The

second, λ2, penalizes the pair-wise L∞ norm of the coefficients encouraging equality of

coefficients.

The group LASSO (Yuan and Lin, 2006) is a penalization method that takes

pre-specified subsets of groups of predictors and removes them from the model as a group.

Suppose there areG pre-specified groups of predictors andXj is ann bymj forj=1,...,G

is the design matrix whose columns correspond to the predictors in the group. If βj

is an mj length vector of coefficients, Kj is a mj by mj positive definite matrix and

kβjkKj = (βjTKjβj)1/2 for j=1,...,G, the group LASSO estimates are given as

ky−

p

X

j=1

Xjβjk2+λ1

G

X

j=1

kβjkKj. (1.9)

Yuan and Lin (2006) also studied this grouping nature in other penalties like the

nonneg-ative garotte. Unlike the OSCAR penalty, the group LASSO does not identify groups, it

takes them as pre-specified inputs.

For all penalization methods, the tuning parameter is selected using various methods.

In a data rich environment, a validation set could be used to compute error. If no

validation set is available, techniques such as cross-validation or an information criterion

such as Mallow’s Cp (Mallows, 1973) or Akaike Information Criterion (AIC, Akaike,

1974) or Bayesian Information Criterion (BIC, Schwarz, 1978) could be used to select

the model.

1.1.3 Oracle Procedures

The smoothly clipped absolute deviation (SCAD, Fan and Li, 2001) penalty modifies the

(19)

intro-duces an additional tuning parameter a ≥ 2. From a loss and penalty setup, estimates

from the SCAD for the least squares model are given as the minimizers of

ky−

p

X

j=1

xjβjk2+ p

X

j=1

Pλ(|β|) (1.10)

where the penalty Pλ(|β|) is given by

Pλ(|β|) =

     

    

λ|β| if |β| ≤λ,

−|β|2−2₂₍_aaλ₋₁₎|β|+λ2 if λ <|β| ≤aλ

(a+1)λ2

2 if |β|> aλ

(1.11)

Fan and Li (2001) introduced a framework for viewing successful penalization

dures for variable selection. Fan and Li (2001) discussed the concept of “oracle”

proce-dures and the general theoretical framework for the asymptotic behavior of penalization

methods. An oracle procedure is one that should consistently identify the correct model

and achieve the optimal estimation accuracy. Suppose A = {i : βi 6= 0, i = 1, ..., p}

denotes the set of indices for which β is truly non-zero andAn denotes the set of indices

that are estimated to be non-zero. Consider β∗, a vector of length po ≤p that denotes

the vector of the non-zero regression coefficients derived from A. Let A∗ be the po ×p

full rank matrix such thatβ∗ =A∗β. For example, if a coefficient in β is not in the true

model, then every element of the column of A∗ corresponding to it would equal 0. If ˆβ

are the estimates from the penalization procedure, then the oracle properties are

• Consistency in structure identification: limnP(An =A) = 1.

(20)

where Σ is the asymptotic covariance matrix of the non-zero regression coefficients if

the subset of true predictors were known. The SCAD is an oracle procedure. The main

drawback of the SCAD over the LASSO is that the SCAD penalty is non-convex, making

computing the SCAD estimates challenging.

Since the introduction of the concept of an oracle procedure, adaptive weighting has

become a successful technique of constructing oracle procedures. Consider the adaptive

LASSO (Zou, 2006), whose estimates are given as the minimizers of

ky−

p

X

j=1

xjβjk2+λ p

X

j=1

|βj|

|β˜j|γ

, (1.12)

for γ > 0 and where ˜β is a consistent estimator of β. The adaptive LASSO is an oracle

procedure. The reasoning behind the adaptive weights in the penalty is to ensure that

estimates of larger coefficients are penalized less while those that are truly zero have large

penalization. The adaptive LASSO was separately studied in other contexts including

survival models by Zhang and Lu (2007) and least absolute deviation (LAD) models

by Wang et al. (2007). We note here that the nonnegative garotte is equivalent to the

adaptive LASSO with γ = 1 coupled with a sign restriction (Yuan and Lin, 2007). In

the analysis of variance (ANOVA) context the collapsing and shrinkage in analysis of

variance (CAS-ANOVA, Bondell and Reich, 2009) studied oracle properties for their

proposed method in the ANOVA context.

1.2 Classification with Support Vector Machines

Support vector machines (SVMs, Vapnik 1996, Hastie et al. 2001) are a popular tool for

(21)

features of training data {(xi, yi), i = 1, ..., n} where the input xi ∈ Rp, and the output yi ∈ {−1,+1} is binary. A classification rule δ is a mapping from x to{−1,+1}, where

a label δ(x) is assigned to the input. Under the 0-1 loss conditioning on x, the risk of δ

isR(δ) =E(I[y6=δ(x)]|x) = P(y 6=δ(x)|x). Minimum risk is the Bayes risk, minimized by the Bayes rule given by

arg max_c_∈{−1_,_+1}P(y=c|x). (1.13)

The standard linear SVM finds a hyperplaneα+βT_x₍_α_{is a constant and}_β _{= (}_β

1, ..., βp)T

is a directional vector) that creates the largest margin between training points for the

two classes or separates the two classes by the largest distance, if the two classes are

separable. If the two classes are not separable, suppose B is a pre-specified positive

constant that controls the overlap between the two classes and letξi ≥0 for i=1,...,n be

slack variables, then the SVM can be viewed as an optimization problem given by

max

α,β

1

βT_β subject toyi(α+β T_x

i)≥1−ξi,i=1,...,n, n

X

i=1

ξi ≤B. (1.14)

The SVM has an equivalent formulation as a penalized hinge loss with a 2-norm penalty.

Suppose λ is a non-negative tuning parameter, then the linear SVM fits a classifier

ˆ

f(x) = ˆα+ ˆβT_x _{and the classification rule sgn( ˆ}_f₍_x_{)) by minimizing} _w.r.t. ₍_{α, β}₎ n

X

i=1

[1−yi(α+βTxi)]++λ

p

X

j=1

β_j2. (1.15)

Here z+ = max(0, z) and L(yf) = [1−yf]+ is known as the hinge loss function. The

L2-norm penalty associated with the non-negative tuning parameter, λ, is known as a

(22)

as the weight decay in neural networks. The tuning parameter that corresponds to the

best model is usually chosen by cross-validation or by using an independent validation

set.

Support vector machines are known for building classifiers with superior classification

accuracy in “high dimension low sample size” situations where the number of features

exceeds the number of observations orpn. SVMs derive their name from the property

of the classifiers to be based on only a subset of the training data, called “support vectors”.

As with other linear methods, SVMs can be made more flexible by expanding the feature

space to include non-linear feature spaces. To this end suppose_D={h1(·), ..., hq(·)}is a

dictionary of basis functions,α is a constant, β = (β1, ..., βq)T is a directional vector and λis a tuning parameter, then the non-linear SVM fits a classifier ˆf(x) = ˆα+ ˆβT_h₍_x_{) and}

classification rule sgn( ˆf(x)) by minimizing w.r.t. (α, β)

n

X

i=1

[1−yi(α+βTh(xi))]++λ

q

X

j=1

β_j2.

This expansion often aids in fitting classifiers with better classification accuracy than their

linear counterparts. Although non-linear SVMs are more flexible than linear SVMs, we

note that when interested in variable selection, it is easier to distinguish features in linear

SVMs and so we restrict our study to the linear case. The formulation of the SVM as a

penalized hinge loss was noted for its similarities to penalized methods in regression and

the use of penalization for variable selection. We explore and expand on this theme of

(23)

1.3 Dissertation Outline

In this dissertation we propose two penalization procedures for variable selection. In

Chapter 2, we introduce a consistent group identification and variable selection

proce-dure for regression models. It is well known that models with correlated predictors are

particularly challenging to tackle and methods that identify or group these correlated

predictors automatically are desired. To this end we propose a penalization procedure

that performs variable selection while clustering groups of predictors automatically. The

oracle properties of this procedure including consistency in group identification are also

studied. An efficient algorithm based on a quadratic approximation is proposed. The

pro-cedure compares favorably with existing selection approaches in both prediction accuracy

and model discovery, while retaining its computational efficiency.

In Chapter 3, we study a penalization approach for variable selection in SVM binary

classification models of “high dimensional low sample size” nature. SVMs have been

successfully used in studies of disease classification using gene expression data where

the goal is to simultaneously classify disease and perform gene selection. Such studies

are difficult to analyze since many genes that predict disease are often also correlated.

We propose a penalization approach that smoothes coefficients together, setting them as

equal, while eliminating redundant variables, thus aiding in the classification of disease.

This approach is shown to be superior in many cases where genes are correlated.

Addi-tional advantages of using this method over existing methods include the data adaptive

nature of the penalty and the computational conveniences of the method. The procedure

compares favorably with existing selection approaches in both prediction accuracy and

model discovery in simulation studies and the analysis of microarray gene expression

(24)

Chapter 2 Consistent Group Identification and

Variable Selection in Regression Models

2.1 Introduction

The collection of large quantities of data is becoming increasingly common with advances

in technical research. These changes have opened up many new statistical challenges;

the availability of high dimensional data often brings with it large quantities of noise and

redundant information. Separating the noise from the meaningful signal has led to the

development of new statistical techniques for variable selection.

Consider the usual linear regression model setup withn observations andp predictors

given by

y=Xβ+,

(25)

be y= (y1, ..., yn)T, the jth predictor of the design matrix X be xj=(x1j, ..., xnj)T for

j=1,...,p and the vector of coefficients be β = (β1, ..., βp)T. Assume the data is centered

and the predictors are standardized to have unitL2 norm, so that

n

X

i=1

yi = 0, n

X

i=1

xij = 0

and

n

X

i=1

x2_ij = 1 for all j=1,...,p. This allows the predictors to be put on a comparative scale, and the intercept may be omitted.

In a high dimensional environment this model could include many predictors that do

not contribute to the response or have an indistinguishable effect on the response, making

variable selection and the identification of the true underlying model an important and

necessary step in the statistical analysis. To this end, consider the true underlying linear

regression model structure given by

y=Xoβ∗+,

whereXo is then×po true design matrix obtained by removing unimportant predictors

and combining columns of predictors with indistinguishable coefficients and β∗ is the

corresponding true coefficient vector of length po. For example, if the coefficients of

two predictors are truly equal in magnitude, we would combine these two columns of

the design matrix by their sum and if a coefficient were truly zero, we would exclude

the corresponding predictor. In practice, the discovery of the true model raises two

issues; the exclusion of unimportant predictors and the combination of predictors with

indistinguishable coefficients. Most existing variable selection approaches can exclude

unimportant predictors but fail to combine predictors with indistinguishable coefficients.

In this paper we explore a variable selection approach that achieves both goals.

Penalization schemes for regression such as ridge regression (Hoerl and Kennard,

(26)

coefficient vector andλis its corresponding non-negative penalization parameter. From a

loss and penalty setup, estimates from a penalization scheme for the least squares model

are given as the minimizers of

ky−

p

X

j=1

xjβjk2+Pλ(β).

Penalization techniques for variable selection in regression models have become

increas-ingly popular, as they perform variable selection while simultaneous estimating the

co-efficients in the model. Examples include nonnegative garrote (Breiman, 1995), least

absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), smoothly clipped

absolute deviation (SCAD, Fan and Li, 2001), elastic net (Zou and Hastie, 2005), fused

LASSO (Tibshirani et al., 2005), adaptive LASSO (Zou, 2006) and group LASSO (Yuan

and Lin, 2006). Although these approaches are popular, they fail to combine predictors

with indistinguishable coefficients. Recent additions to the literature including

octago-nal shrinkage and clustering algorithm for regression (OSCAR, Bondell and Reich, 2008)

and collapsing and shrinkage in analysis of variance (CAS-ANOVA, Bondell and Reich,

2009) were studied to address this limitation in the linear regression and ANOVA context

respectively.

Supervised clustering is an approach to address the combined effect of predictors

with indistinguishable coefficients. The process aims to identify meaningful groups of

predictors that form predictive clusters; such as a group of highly correlated predictors

that have a unified effect on the response. The OSCAR is a penalization procedure for

simultaneous variable selection and supervised clustering. It identifies a predictive cluster

by assigning identical coefficients to each element in the group up to a change in sign,

(27)

defined as the solution to

ky−

p

X

j=1

xjβjk2+λ1

p

X

j=1

|βj|+λ2

X

1≤j<k≤p

max{|βj|,|βk|}. (2.1)

The OSCAR penalty contains two penalization parameters λ1 and λ2, for λ1, λ2 ≥ 0.

The first, λ1, penalizes the L1 norm of the coefficients and encourages sparseness. The

second, λ2, penalizes the pair-wise L∞ norm of the coefficients encouraging equality of

coefficients.

Although the OSCAR performs effectively in practice, we note that the procedure

has certain limitations. The method defined in (2.1) is implemented via quadratic

pro-gramming of order p2_{, which becomes increasingly expensive as} _p _{increases. Another}

limitation of the OSCAR is that it is not an oracle procedure. An oracle procedure

(Fan and Li, 2001) is one that should consistently identify the correct model and achieve

the optimal estimation accuracy. That is, asymptotically, the procedure performs as

well as performing standard least-squares analysis on the correct model, were it known

beforehand.

Adaptive weighting is a successful technique to constructing oracle procedures. Zou

(2006) showed oracle properties for the adaptive LASSO in linear models and generalized

linear models (GLMs) by incorporating data dependent in the penalty. Oracle properties

for adaptive LASSO was separately studied in other contexts including survival models by

Zhang and Lu (2007) and least absolute deviation (LAD) models by Wang et al. (2007).

The reasoning behind these weights is to ensure that estimates of larger coefficients are

penalized less while those that are truly zero have unbounded penalization. Note that

all of these procedures define an oracle procedure solely based on selecting variables, not

(28)

must also consistently identify the the group of indistinguishable coefficients.

Bondell and Reich (2009) showed oracle properties for the full selection and grouping

structure of the CAS-ANOVA procedure in the ANOVA context, using similar arguments

of adaptive weighting. In this paper, we show that the OSCAR penalty does not lend

itself intuitively to data adaptive weighting. Weighting the pairwise L∞ norm by an

initial estimate, as is the typical case, fails in the following simple scenario. Suppose

two coefficients in the pair-wise term are small but should be grouped, then an initial

estimate of this quantity would shrink both to zero instead of setting them as equal.

These limitations are the main motivations of the methods in this chapter. Our

goal in this chapter is to find an oracle procedure for simultaneous group identification

and variable selection, and to address the limitations of existing methods, including the

computational burden. The remainder of this chapter proceeds as follows. Section 2.2

proposes a penalization procedure for simultaneous grouping and variable selection along

with an algorithm efficient for computation. Section 2.3 studies theoretical properties

of the procedure. Section 2.4 discusses extensions of the method to GLMs. Section 2.5

contains a simulation study and the analysis of real examples. A discussion concludes in

Section 2.6.

2.2 Pairwise Absolute Clustering and Sparsity

2.2.1 Methodology

In this section, we consider a new penalization scheme for simultaneous group

identi-fication and variable selection. We introduce an alternative to the pairwise L∞ norm

(29)

coefficients with opposite signs are desired to be grouped together in the presence of high

negative correlation, leaving the problem equivalent to a sign change of the predictors.

In our setup, the equality of coefficients is achieved by penalizing the pairwise

differ-ences and pairwise sums of coefficients. In particular, we propose a penalization scheme

with non-negative weights, w, whose estimates are the minimizers of

ky− p

X

j=1

xjβjk2+λ{ p

X

j=1

wj|βj|+

X

1≤j<k≤p

w_jk(−)|βk−βj| +

X

1≤j<k≤p

w_jk₍₊₎|βj+βk|}. (2.2)

The penalty in (2.2) consists of a weighted L1 norm of the coefficients that

encour-ages sparseness and a penalty on the differences and sums of pairs of coefficients that

encourages equality of coefficients. The weighted penalty on the differences of pairs of

co-efficients encourages coco-efficients with the same sign to be set as equal, while the weighted

penalty on the sums of pairs of coefficients encourages coefficients with opposite sign to

be set as equal in magnitude. The weights are pre-specified non-negative numbers, and

their choices will be studied in Section 2.2.2. We call this penalty Pairwise Absolute

Clustering and Sparsity (PACS) penalty and for the remainder of this article we will

refer to the PACS procedure in the form given in (2.2). The PACS objective function is

a convex function since it is a sum of convex functions; in particular, if XT_X _{is full rank}

then it is strictly convex.

We note here that a specific case of the PACS turns out to be an equivalent

repre-sentation for the OSCAR. It can be shown that

max{|βj|,|βk|}=

1

2{|βk−βj| +|βj +βk|}.

(30)

minimizers of

ky− p

X

j=1

xjβjk2+λ{ p

X

j=1

c|βj|+

X

1≤j<k≤p

0.5(1−c)|βj−βk| +

X

1≤j<k≤p

0.5(1−c)|βj+βk|}.

Hence the OSCAR can be regarded as a special case of the PACS.

The PACS formulation enjoys certain advantages over the original formulation of the

OSCAR as given in (2.1). Like the OSCAR it can be computed via quadratic

program-ming. However, it can also be computed using a local quadratic approximation of the

penalty, which is not directly applicable to the original formulation of the OSCAR. The

latter strategy is superior to quadratic programming in that quadratic programming

be-comes expensive in computation and is not feasible for large and even moderate numbers

of parameters, while the latter strategy continues to be feasible in these cases, and can

be conveniently implemented in standard software.

Furthermore the choice of weighting scheme in the penalty offers the possibility of

subjectivity. With a proper reformulation, we show later that it needs only one tuning

parameter. The OSCAR has two tuning parameters and this reduction in tuning

param-eters vastly improves on the cost of computation. This flexibility in choice of weights

allows us to explore data adaptive weighting. This is explored in Section 2.3 where we

show that a data adaptive PACS has the oracle property for variable selection and group

identification.

Figure 2.1 is an illustration of the flexibility of the PACS approach over the OSCAR

approach in terms of the constraint region in the (β1, β2) plane. In Figure 2.1 (a), we

see that the OSCAR constraint region has the same shape in all four quadrants. Note

that although the shape varies with c in 0≤c ≤1, it always remains symmetric across

(31)

(a) OSCAR constraint re-gion for a fixed value ofc.

(b) PACS constraint region with one weighting scheme.

(c) PACS constraint region with a different weighting scheme.

Figure 2.1: Illustration to represent the flexibility of the PACS approach over the OS-CAR approach in terms of the shaded constraint regions in the (β1,β2) plane.

different choices of weights are seen in Figures 2.1 (b) and 2.1 (c). As we see from these

figures, they have very different shapes and suggest the flexibility of the PACS approach

over the OSCAR approach.

2.2.2 Choosing the Weights

In this section we study different strategies for choosing the weights. The choice of weights

offers the possibility of subjectivity which come in various forms. Three choices will be

examined in detail: weights determined by a predictor scaling scheme, data adaptive

weights for oracle properties, and an approach to incorporate variable correlation into

the weights.

Scaling of the PACS Penalty

The weights for the PACS could be determined via standardization. For any penalization

(32)

done equally. In penalized regression this is done by standardization, for example, each

of the columns of the design matrix has unit L2 norm. In a penalization scheme such as

the LASSO, upon standardization, each column of the design matrix contributes equally

to the penalty. When the penalty is based on a norm, rescaling a column of the design

matrix is equivalent to weighting coefficients by the inverse of this scaling factor (Bondell

and Reich, 2009). We will now determine the weights in (2.2) via a standardization

scheme.

Standardization for the PACS is not a trivial matter since it must incorporate the

pairwise differences and sums of coefficients. In fact one needs to consider an

over-parameterized design matrix that includes the pairwise coefficient differences and sums.

Let the vector of pairwise coefficient differences of length d = p(p−1)/2 be given by

τ = {τjk : 1 ≤ j < k ≤ p}, where τjk = βk −βj. Similarly, let the vector of pairwise

coefficient sums of length d be given by γ ={γjk : 1≤j < k ≤p}, where γjk =βk+βj.

Letθ= [βT τT γT]T be the coefficient vector of lengthq =p2 for this over-parameterized model. We have θ = M β, where M is a matrix of dimension q × p given by M =

[Ip DT(−) DT(+)]

T_{, with} _D

(−) being a d×p matrix of ±1 that creates η from β and D(+)

being a d ×p matrix of +1 that creates γ from β. The corresponding design matrix

for this over-parameterized design is an n×q matrix such that Zθ =Xβ for all β, i.e.

ZM =X.

Note that Z is not uniquely defined; possible choices include Z = [X 0n×2d] and Z =XM∗, where M∗ is any left inverse of M. In particular, choose Z = XM−, where

M− denotes the Moore-Penrose generalized inverse of M.

Proposition 1. The Moore-Penrose generalized inverse of M isM−= ₍₂_p1₋₁₎[Ip D(−)T DT(+)].

Proof of Proposition 1. Consider the matrix M− = ₍₂_p1₋₁₎[Ip DT(−) DT(+)]. Then M

(33)

1

(2p−1)[Ip D

T

(−)D(−) DT₍₊₎D(+)]. Via direct calculation, one obtains D(−)T D(−)=pIp−1p1Tp

andDT

(+)D(+) = (p−2)Ip+ 1p1

T

p, soM

−_M ₌_I

p. To show thatM− is the Moore-Penrose

generalized inverse of M, it suffices to show

1. M−M M− =M−.

2. M M−M =M.

3. M−M is symmetric.

4. M M− is symmetric.

Clearly (1), (2) and (3) follow directly from the fact that M−M =Ip. For (4),

M M− =



    

Ip D₍₋₎T D₍₊₎T D(−) D(−)DT(−) D(−)DT₍₊₎

D(+) D(+)DT(−) D(+)DT₍₊₎



    

.

Clearly,M M− is symmetric, andM−is the Moore-Penrose generalized inverse ofM.

The above proposition allows the determination of the weights via the standardization

in the over-parameterized design space. The resulting matrix Z determined using the

Moore-Penrose generalized inverse of M is an appropriate design matrix for the

over-parametrization. We propose to use the L2 norm of the corresponding column of Z

for standardization as the weights in (2.2). In particular, these weights are wj = 1, wjk(−) =

p

2(1−rjk) and wjk(+) =

p

2(1 +rjk) for 1 ≤ j < k ≤ p, where rjk is the

(34)

Data Adaptive Weights

PACS with appropriately chosen data adaptive weights will be shown to be an oracle

procedure. Suppose ˜βis a√n-consistent estimator of β, such as the ordinary least squares

weights allow for less penalization when the coefficients, their pairwise differences, or

their pairwise sums are larger in magnitude and penalized in an unbounded manner

when they are truly zero.

We note that for α = 1, the adaptive weights belong to a class of scale equivariant

weights, as long as the initial estimator ˜βis scale equivariant. When the weights are scale

equivariant, the resulting solution to the optimization problem is also scale equivariant.

In such cases the design matrix does not need to be standardized beforehand, since the

resulting solution obtained after transforming back is the same as the solution obtained

when the design matrix is not standardized. Due to this simplicity, for the remainder of

this chapter we setα = 1.

In practice, the initial estimates can be obtained using OLS or other shrinkage

esti-mates like ridge regression estiesti-mates. In studies with collinear predictors, we notice that

using ridge estimates for the adaptive weights perform better than those given by OLS

estimates. The choice of ridge estimate controls the collinearity by smoothing the

coeffi-cients. Ridge estimates selected by AIC are used for adaptive weights in all applications

in this paper. Ridge estimates chosen by AIC provide slight regularization while not

over shrinking, as opposed to BIC. Note that, since the original data are standardized,

the ridge estimates are equivariant. Any other choice would require using the scaling in

(35)

Incorporating Correlation

The choice of weights is subjective and in particular, one may wish to assist the grouping

of predictors based on the correlation between predictors. This approach is explored

in Tutz and Ulbricht (2009) in the ridge regression context. To this end, consider

in-corporating correlation in the weighting scheme scheme such as those given by wj = 1,

wjk(−) = (1− rjk)−1 and wjk(+) = (1 + rjk)−1 for 1 ≤ j < k ≤ p. Intuitively, these

weights more heavily penalize the differences in coefficients when they are highly

posi-tively correlated and heavily penalize their sums when they are highly negaposi-tively

corre-lated. Though such weighting will not discourage uncorrelated predictors to have equal

coefficients, they will encourage pairs of highly correlated predictors to have equal

co-efficients. Adaptive weights that incorporate these correlations terms are then given by

wj = |βj˜|−1_, _wjk

(−) = (1− rjk)−1|βk˜ −βj˜|−1 and wjk(+) = (1 +rjk)−1|βk˜ + ˜βj|−1 for

1≤j < k≤p.

2.2.3 Geometric Interpretation of PACS Penalty

The geometric interpretation of the constrained least squares solutions illustrate the

flexibility of the PACS penalty over the OSCAR penalty in accurately selecting the

estimates. Aside from a constant, the contours of the least squares loss function are

given by (β−β˜OLS)TXTX(β−β˜OLS) = K. These contours are ellipses centered at the

OLS solution. Since the predictors are standardized, when p = 2 the principal axis of

the contours are at ±45◦ to the horizontal. As the contours are in terms of XTX, as opposed to (XTX)−1, positive correlation would yield contours that are at −45◦ whereas negative correlation give the reverse. In the (β1, β2) plane, intuitively, the solution in the

(36)

Figure 2.2 illustrates the flexibility of the PACS approach over the OSCAR approach

for a specific high correlation (ρ = 0.85) setup. In Figures 2.2 (a) and 2.2 (b), we see

that when the OLS solutions for β1 and β2 are close to each other, a specific case of

the OSCAR penalty sets the OSCAR solutions as equal, ˆβ1 = ˆβ2, while the adaptive

PACS penalty with weights given in Section 2.2.2 also sets ˆβ1 = ˆβ2. In Figures 2.2 (c)

and 2.2 (d) we see the same OSCAR and adaptive PACS penalty functions find different

solutions when the OLS solution for β1 is close to 0 and not close to that of β2. Here

the OSCAR solutions remain equal to each other while the adaptive PACS sets ˆβ1 = 0.

Thus the OSCAR solution is more dependent on the correlation of the predictors, and

does not easily adapt to the different least squares solutions.

2.2.4 Computation and Implementation

Algorithm

The PACS estimates of (2.2) can be computed via quadratic programming with (p(p+1))

parameters and (2p(p-1)+1) linear constraints. As the number of predictors increases,

quadratic programming becomes more computationally expensive and hence is not

fea-sible for large, and even moderate p. In the following, we suggest an alternative

compu-tation technique that is more cost efficient. The algorithm is based on a local quadratic

approximation of the penalty similar to that used in Fan and Li (2001). The penalty of

the PACS in (2.2) can be locally approximated by a quadratic function. Suppose for a

(37)

the first order expansion with respect toβ2

j, we have

P(|βj|) = P((βj2)

1

2)≈P(|β₀_j|) +P0(|β₀_j|) 1

2(β

2 0j)

−1 2(β2

j −β

2 0j).

ThusP(|βj|)≈P(|β0j|) + 1₂P

0_(|_β

0j|)

|β0j| (β

2

j −β02j) for j = 1, ..., p and in the case of the PACS

|βj| ≈ β2

j

2|β0j|+K, whereK is a term not involvingβj and thus does not play a role in the

optimization. Similarly, we approximate the penalty terms on the differences and sums

of the coefficients. Thus both the loss function and penalty are quadratically expressed,

and hence there is a closed form solution.

An iterative algorithm with closed form updating expressions follows from these local

quadratic approximations. Suppose, at step t of the algorithm, the current value, β(t), is used to construct the following diagonal matrices; Iw(t) = diag

wj/|β(_jt)|, I_w(t₍₋₎) = diagwjk(−)/|β

(t)

k −β

(t)

j |

and I_w(t₍₊₎) = diagwjk(+)/|β (t)

j +β

(t)

k |

. The value of the

solu-tion at step t+1 using these constructed matrices is given as

ˆ

β(t+1) = (XTX+ λ 2(I

(t)

w +D(−)T I (t)

w(−)D(−)+D

T

(+)I (t)

w(+)D(+))) −

XTy.

The algorithm follows as:

1. Specify an initial starting value, β(0) _{= ˆ}_β(0) _{close to the minimizer of (2.2).}

2. For step t+1, construct Iw(t), I_w(t₍₋₎) and I_w(t₍₊₎) and compute ˆβ(t+1).

3. Let t =t+ 1. Go to step 2 until convergence criterion is met.

Proposition 2. Assume XTX is full rank, then the optimization problem in (2.2) is strictly convex, and the proposed algorithm is guaranteed to converge to the unique

(38)

that case it is guaranteed to converge to a local minimizer.

The proof of Proposition 2 is due to the fact that it is an MM (majorize/minimize)

algorithm (Hunter and Li, 2005). Note that for all applications in this chapter, the

convergence criterion is met when max(|βˆ(t+1)₋_βˆ(t)_|₎_/₍_|_βˆ(t)_|_{+ 10}−7₎_<₁₀−4_.

Once the estimates are computed for a given λ, the next step is to select λ that

corresponds to the best model. Cross-validation or an information criterion like AIC or

BIC could be used for model selection. When using an information criterion, we need an

estimate of the degrees of freedom. The degrees of freedom for the PACS is given as the

number of unique absolute nonzero estimates as in the case of the OSCAR (Bondell and

Reich, 2008). We use BIC to perform model selection in all examples in this paper.

2.3 Oracle Properties

The use of data adaptive weights in penalization assist in obtaining oracle properties for

structure identification. Suppose ˜β is a √n-consistent estimator of β. Consider weights

for this data adaptive PACS given by w∗, where w∗ incorporates the weighting scheme

one could use the correlation terms of Section 2.2.2 or any other choice satisfying the

condition that vj →cj, vjk(−) →cjk(−) and vjk(+)→cjk(+) with 0< cj, cjk(−), cjk(+)<∞

for all 1≤j < k ≤p. The adaptive PACS estimates are given as the minimizers of

ky− p

X

j=1

xjβjk2+λn{ p

X

j=1

w∗_j|βj|+

X

1≤j<k≤p

w∗_jk₍₋₎|βk−βj| +

X

1≤j<k≤p

w∗_jk₍₊₎|βk+βj|}. (2.3)

(39)

and group identification. Consider the overparameterization,θ=M βfrom Section 2.2.2.

Let A = {i : θi 6= 0, i = 1, ..., q}, q = p2, denote the set of indices for which θ is truly

non-zero. LetAn denote the set of indices that are estimated to be non-zero.

Consider β∗, a vector of length po ≤ p that denotes the oracle parameter vector

derived fromA. Let A∗ be thepo×pfull rank matrix such thatβ∗ =A∗β. For example,

suppose the first two coefficients of β are truly equal in magnitude, then the first two

elements of the first row of A∗ would be equal to 0.5 and the remaining elements in the

first row would equal 0. If a coefficient inβ is not in the true model, then every element

of the column of A∗ corresponding to it would equal 0. We assume the regression model;

y=Xoβ∗+,

where is a random vector with E() = 0 and V ar() = σ2_I_{. Further assume that}

1

nX T

oXo → C, where Xo is the design matrix collapsed to the correct oracle structure

determined byA, and C is a positive definite matrix. The following theorem shows that

the adaptive PACS has the oracle property.

Theorem 1. (Oracle properties). Suppose λn → ∞ and √λn_n → 0, then the adaptive

PACS estimates must satisfy the following:

• Consistency in structure identification: limnP(An=A) = 1.

• Asymptotic normality: √n(A∗βˆ−A∗β)→dN(0, σ2C−1).

Remark 1. We note here a correction to Theorem 1 in Bondell and Reich (2009). The

proof of that theorem is also under the assumption that λn→ ∞ and √λn_n →0.

Proof of Theorem 1.1. We first prove for asymptotic normality. Let β0 denotes the true

(40)

ˆ

u= arg minVn(u), where ˆβ =β0+√uˆ_n and

Vn(u) =u0(1

nX T

X)u−2 0

X

√

nu+ λn

√

nP(u)

with

P(u) =

p

X

j

w_j∗√n(|βj0+

uj

√

n| − |βj0|)

+ X

1≤j<k≤p

w∗_jk₍₋₎√n(|βk0−βj0−

uk−uj

√

n | − |βk0−βj0|)

+ X

1≤j<k≤p

w∗_jk₍₊₎√n(|βj0+βk0−

uj_√+uk

n | − |βj0+βk0|)

Consider the limiting behavior of √λn

nP(u). By using arguments as in Zou, 2006 and

Bondell and Reich, 2009, _√λn

nP(u) will go to zero for the correct structure and diverge

under the incorrect structure. If βj0 6= 0, βk0 6= 0 and βj0 6=βk0, then w∗j →p cj|βj0|−1,

w∗_jk₍₋₎ →p cjk(−)|βk0 −βj0|−1 and w_jk∗ ₍₊₎ →p cjk(+)|βj0 +βk0|−1. Also

√

n(|βj0 +

uj

√

n| −

|βj0|) → uj sgn(βj0),

√

n(|βk0 −βj0−

uk_√−uj

n | − |βk0−βj0|) → (uk−uj) sgn(βk0 −βj0)

and √n(|βj0 +βk0 −

uj_√+uk

n | − |βj0 +βk0|) → (uj +uk) sgn(βj0 +βk0). By Slutsky’s

theorem, we have √λn

nP(u) →p 0. If βj0 = 0 then

√

n(|βj0 +

uj

√

n| − |βj0|) = |uj| and λn

√

nw

∗

j = λnvj(|

√

nβˆj|)−1 where

√

nβˆj =Op(1), so √λn_nP(u)→ ∞ unless λnuˆj → 0, since λn → ∞. Similarly if βj0 = βk0,

√

n(|βk0 −βj0 − uj −uk

√

n | − |βk0 −βj0|) = |uk−uj| and λn

√

nw

∗

jk(−) = λnvjk(−)(|

√

n(βk0 −βj0)|)−1 where

√

n( ˆβk −βˆj) = Op(1), so √λn_nP(u) → ∞

unless |uk −uj| → 0, since λn → ∞. Furthermore, if Ac ⊆ Acn then Vn(u) = Vno(uo),

where Vo

n(uo) is the value of the objective function obtained after collapsing the design

matrix X to the correct oracle structure determined by A, i.e., X = Xo and uo is the

(41)

and 0

Xo

√

n →d W ∼N(0, σ

2_C_{). Thus, we have} _Vn₍_u₎_→_V₍_u_{) for every} _u_{, where}

V(u)=       

u0_oCuo−2u

0

oW if Ac ⊆ Acn,

∞ otherwise.

Hence the asymptotic normality follows by noting that Vn(u) is convex and the unique

minimizer of V(u) is C−1W.

Proof of Theorem 1.1. We now show for consistency in structure identification. Note

that the asymptotic normality implies that P(Ac

n∩ A 6=∅)→0. To complete the proof

of consistency, we will show that P(An ∩ Ac 6= ∅) → 0. The proof is demonstrated

through contradictions in 3 separate scenarios.

In the first scenario, letA∗ ₌_{_j _:_β

j 6= 0}be the set of indices for coefficients that are

truly non-zero and letA∗

nbe the set of indices for coefficients that are estimated to be

non-zero. Then ifλn→ ∞and ˜β is a

√

n-consistent estimator ofβ,P(A∗

n∩A

∗c₆₌_∅₎_→_{0. We}

note that the asymptotic normality implies thatP(A∗c n∩A

∗ ₆₌_∅₎_→_{0. Let}_B

n=A∗n∩A

∗c

be the set of indices corresponding to the coefficients that should be set to zero but were

set to nonzero. We will show that P(Bn 6= ∅) → 0 i.e., that the true zeros will be set

to zero with probability tending to one. Suppose Bn is not empty and also suppose

0 ≤ βˆ1 ≤ ... ≤ βˆp. Also let ˆβm be the largest coefficient indexed by Bn. Since ˆβm 6= 0,

(2.3) is differentiable w.r.t. βm and ˆβm satisfies

2

√

nx

0

m(y−Xβˆ) = λn

√

n{w

∗

m+

X

1≤j<m

w∗_jm₍₋₎− X

m<k≤p

w_mk∗ ₍₋₎

+ X

1≤j<m

w∗_jm₍₋₎+ X

m<k≤p

w∗_mk₍₋₎}, (2.4)

(42)

since ˜β is a √n-consistent estimator of β, we have √nβ˜m = Op(1). Therefore w

∗

m

√

n = vm(

√

n|β˜m|)−1 =Op(1). Consider w∗_jm₍₋₎

√

n =vjm(−)(

√

n|β˜m−β˜j|)−1 forj < m. If ˆβj is also

indexed byBn, thenβj = 0,

√

n( ˜βm−β˜j) =Op(1) and w∗_jm₍₋₎

√

n =Op(1). If ˆβj is not indexed

byBn, then|β˜m−β˜j| → |βj|and w∗_jm₍₋₎

√

n →0. Hence w_jm∗ ₍₋₎

√

n =Op(1) for j < m. Consider w∗_mk₍₋₎

√

n =vmk(−)(

√

n|β˜k−β˜m|)−1, since ˆβk is not indexed byBn, we have|β˜k−β˜m| → |βk|

and hence w ∗

mk_√(−)

n →0 for m < k. Consider w∗

mj_√(+)

n =vmj(+)(

√

n|β˜m+ ˜βj|)−1 forj 6=m. If

ˆ

βj is also indexed by Bn, then βj = 0,

√

n( ˜βm+ ˜βj) =Op(1) and w∗

mj_√(+)

n =Op(1). If ˆβj is

not indexed by Bn, then |βm˜ + ˜βj| → |βj| and w∗

mj_√(+)

n → 0. Hence w∗

mj_√(+)

n =Op(1) for all j 6=m. So the right hand side of (2.4) is λn×Op(1) =Op(λn), while the left hand side

is Op(1) by the asymptotic normality. Now since λn → ∞, we have a contradiction and P(Bn 6=∅)→0.

For the second scenario, let A(−) = {(j, k) : βj 6= βk} be the set of indices for

differences in coefficients that are truly non-zero and let An(−) be the set of indices for

these differences that are estimated to be non-zero. Then if λn → ∞ and ˜β is a

√

n

-consistent estimator of β, P(An(−) ∩ Ac(−) 6= ∅) → 0. We note that the asymptotic

normality implies that P(Ac

n(−) ∩ A(−) 6= ∅) → 0. Let Bn = An(−) ∩ Ac(−) be the set

of indices corresponding to the differences in coefficients that should be set to zero but

were set to nonzero. We will show that P(Bn 6= ∅)→ 0 i.e., that the true zeros will be

set to zero with probability tending to one. Suppose Bn is not empty and also suppose

0 ≤ βˆ1 ≤ ... ≤ βˆp. Bn = Bn1 ∪ Bn2, where Bn1 corresponds to the set {(βj = βk =

0)∩( ˆβj −βˆk 6= 0)} and Bn2 corresponds to the set {(βj = βk 6= 0)∩( ˆβj −βˆk 6= 0)}. P(Bn1 6=∅)→0 follows directly from the proof in the first scenario. We shall now prove

P(Bn2 6=∅)→0.

Suppose ˆβm is the largest coefficient indexed by Bn2 i.e. m = max{k : (j, k) ∈

(43)

Consider the reparameterization given by Xβ =Hη, where ηj =βj −βq for j 6= q and ηj =βq for j =q. By assumption ˆηm 6= 0 and ηm = 0, and that ˆηm−ηˆj ≥0 for all (j,m)

∈ Bn2. |βj| = ηj +ηq for j 6=q and |βj| = ηq for j = q for j ≤ k at the solution. Also

for j ≤ k at the solution, we have |βk−βj|= ηk−ηj for (j 6=q, k 6=q), |βk−βj| = ηk

for (j =q ≤k) and |βk−βj| =−ηj for (j ≤k =q). Also for j ≤k at the solution, we

have|βk+βj|=ηk+ηj+ 2ηq for (j 6=q, k =6 q), |βk+βj|=ηk+ 2ηq for (j =q ≤k) and

|βk+βj|=ηj + 2ηq for (j ≤k =q). For this parametrization, for j ≤k at the solution,

ˆ

η = arg min

η∈Bn2

ky−Hηk2₊ λn

√

n{

X

j6=q

w_j∗(ηj +ηq) +wq∗ηq+

X

j6=q,k6=q,j<k

w∗_jk₍₋₎(ηk−ηj)

+ X

j=q≤k

w∗_qk₍₋₎ηk+

X

j≤k=q

w∗_jq₍₋₎ηk+

X

j6=q,k6=q,j<k

w∗_jk₍₊₎(ηk+ηj+ 2ηq)

+ X

j=q≤k

w∗_qk₍₊₎(ηk+ 2ηq) +

X

j≤k=q

w_jq∗₍₋₎(ηj + 2ηq)}. (2.5)

Since ˆηm 6= 0, (2.5) is differentiablew.r.t. ηm at the solution. Consider this derivative

in the neighborhood of the solution on which the coefficients that are set equal remain

equal. That is, the differences remain zero. In this neighborhood, the terms involving

(k,m) ∈ Ac

n(−) ∩ A

c

(−) can be omitted as these terms will vanish in the objective

func-tion. A contradiction will be obtained in this neighborhood of the solufunc-tion. Due to the

differentiability, the solution ˆη must satisfy

2

√

nh

0

m(y−Hηˆ) = λn

√

n{w

∗

m+

X

k6=m,(k,m)∈A₍−)

(−1)[m≤k]w_km∗ ₍₋₎+ X

k6=m,(k,m)∈Bn2

w_km∗ ₍₋₎

+ X

k6=m,(k,m)∈A(−)

w∗_km₍₊₎+ X

k6=m,(k,m)∈Bn2

w∗_km₍₊₎}, (2.6)

wherehm is themth column ofH. Note that each term in the sums on the right hand side

are nonnegative. Consider w_√m∗

n = vm(

√

n|β˜m|)−1. If βm = 0, we have w∗

m

√

(44)

and if βm 6= 0, we have |β˜m| → |βm| and w

∗

m

√

n → 0. Consider (−1)

[m≤k]w

∗

km_√(−)

n =

(−1)[m≤k]_v

km(−)(

√

n|β˜m − β˜k|)−1 for all k 6= m,(k, m) ∈ A(−). Since (k,m)∈ A(−),

|β˜m −β˜k|−1 = Op(1) and (−1)[m≤k] w∗_km₍₋₎

√

n → 0 for all k 6= m,(k, m) ∈ A(−).

Con-sider w ∗

km_√(−)

n = vkm(−)(

√

n|β˜k−β˜m|)−1 for all k =6 m,(k, m) ∈ Bn2. Since (k, m) ∈ Bn2,

(√n|β˜k −β˜m|)−1 = Op(1) and

w_km∗ ₍₋₎

√

n = Op(1) for all k 6= m,(k, m) ∈ Bn2. Consider w∗

km_√(+)

n = vkm(+)(

√

n|β˜k + ˜βm|)−1 for all k 6= m,(k, m) ∈ A(−). Since (k, m) ∈ A(−),

|βk˜ + ˜βm|−1 ₌ _O

p(1) and w∗

km_√(+)

n = Op(

1 √

n) for all k 6= m,(k, m) ∈ A(−). Consider w∗_km₍₊₎

√

n = vkm(+)(

√

n|β˜k + ˜βm|)−1 for all k 6= m,(k, m) ∈ Bn2. Since (k, m) ∈ Bn2,

(√n|β˜k+ ˜βm|)−1 = Op(1) and

w∗_km₍₊₎

√

n = Op(1) for all k 6= m,(k, m)∈ Bn2. So the right

hand side of (2.6) isλnOp(1) =Op(λn), while the left hand side isOp(1) by the asymptotic

normality. Now sinceλn→ ∞, we have a contradiction and P(Bn2 6=∅)→0.

For the third scenario, letA(+)={(j, k) :βj 6=−βk}be the set of indices for sums of

coefficients that are truly non-zero and let An(+)be the set of indices for these sums that

are estimated to be non-zero. Then if λn→ ∞ and ˜β is a √n-consistent estimator ofβ,

P(An(+)∩ Ac₍₊₎ 6=∅)→0. Let Bn =An(+)∩ Ac₍₊₎ be the set of indices corresponding to

the sums in coefficients that should be set to zero but were set to nonzero. We will show

thatP(Bn6=∅)→0i.e., that the true zeros will be set to zero with probability tending to

one. The proof that P(Bn6=∅)→0 follows directly from the proof in the first scenario.

Suppose βj ≥ 0 and βk ≥ 0, then Bn is the set for which {βj +βk = 0,βˆj + ˆβk 6= 0}.

Since this set is contained in {βj = 0,βˆj 6= 0} ∪ {βk = 0,βˆk6= 0}, it suffices to show the

property for the first scenario.

Together the three scenarios show that P(An∩ Ac)→0. As stated earlier, the proof

of asymptotic normality implies that P(Ac

(45)

2.4 Extension to Generalized Linear Models

We now study an extension of the PACS to GLMs and present a framework to show

that an adaptive PACS is an oracle procedure. We consider estimation of penalized

log likelihoods using an adaptive PACS penalty, where the likelihood belongs to the

exponential family whose density is of the form f(y|x, β) = h(y) exp(y(xT_β ₋_φ₍_xT_β₎₎₎

(see McCullagh and Nelder, 1989). Suppose ˜βis the maximum likelihood estimate (MLE)

in the GLM, then the adaptive PACS estimates are given as the minimizers of

n X

i=1

(−yi(xTiβ) +φ(x T

i β)) +λn{ p X

j=1

w∗_j|βj|+ X

1≤j<k≤p

w∗_jk₍₋₎|βk−βj| + X

1≤j<k≤p

w_jk∗ ₍₊₎|βk+βj|},

where the weights are given as in Section 2.3. We assume the following regularity

conditions:

(A) The Fisher information matrix forβ∗,I(β∗) =E[φ00(xT_oβ∗)xoxTo], is positive definite.

(B) There is a sufficiently large open set O that contains β∗ such that for all vectors of

length po contained in O , |φ000(xT_oβ∗)| ≤M(xo) <∞ and E[M(xo)|xojxokxol]<∞

for all 1≤j, k, l≤p.

Theorem 2. (Oracle properties for GLMs). Assume conditions (A) and (B) and suppose

λn → ∞ and √λn_n →0, then the adaptive PACS estimates must satisfy the following:

1. Consistency in structure identification: limnP(An=A) = 1.

2. Asymptotic normality: √n(A∗βˆ−A∗β)→dN(0, I(β∗)−1).

The proof of Theorem 2 follows the proof of Theorem 1 after a Taylor expansion of the

Penalization Methods for Group Identification and Variable Selection in Models with Correlated Predictors.

Abstract

Dedication

Biography

Acknowledgements

Table of Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1

Regression

1.1.1

Background

1.1.2

Penalization Techniques

1.1.3

Oracle Procedures

1.2

Classification with Support Vector Machines

1.3

Dissertation Outline

Chapter 2

Consistent Group Identification and

Variable Selection in Regression Models

2.1

Introduction

2.2

Pairwise Absolute Clustering and Sparsity

2.2.1

Methodology

2.2.2

Choosing the Weights

2.2.3

Geometric Interpretation of PACS Penalty

2.2.4

Computation and Implementation

2.3

Oracle Properties

2.4

Extension to Generalized Linear Models