Regression Models and Generalized Maxent - Generalized Maximum Entropy, Convexity and Machine L

formy_b=hβ,xiwhere xis a vector of observations. This is done with the aid of historical observations denoted y and a data matrix X, consisting of an M×R

matrix of M observations of R regressors.

The traditional approach to regression is based on the method of least squares. In undergraduate econometrics, students learn about the assumptions that give this method a maximum likelihood interpretation: iid observations, Gaussian noise, and so on. Later they learn methods adapted to other statistical assump- tion. Often the methods result in changing the likelihood, regularizer, or both. In recent years a number of approaches to regression emphasize l1-regularization. These include the Dantzig selector, the LASSO algorithm ridge regression, and other variants. See Hastie et al. (2001) for a survey. One of the advantages seen by this type of regularization is that it promotes sparsity in the resulting predictor. This is seen as potentially useful in reducing the number of regressors in an era of ever-larger data sets. This form of regularization is also popular in the related area of compressed sensing. Recent surveys of the field include Candes (2006); DeVore (2007); Donoho (2006); Tsaig and Donoho (2006).

This section presents a short review of selected regression techniques, some old, and a new one, the Dantzig Selector, proposed in Candes and Tao (2007). A pattern of construction will become clear. Following that, a variation based on generalized entropy is introduced. The Dantzig selector can be seen as a limiting case of the new model. [Experiments and regularization paths should be mentioned here.]

In order to fit regression into an ‘entropic’ setting a little work is required. Normally the vector β is not restricted to the non-negative orthant. In order to utilize the generalized maxent framework, which for most generalized entropies typically produces a non-negative vector we can employ a ‘±’ trick described in Kivinen and Warmuth (1997) and other places. This is simply a linear map that distinguishes the positive and negative coefficients inβ, whereβ+_and_β−_denote

the positive and negative parts of β. Let

p= β+ β− .

β+ and β− denote the positive and negative parts of β. Then we can write a fitted values as

y≈Xβ =X±p,

trick is a doubling of the length of the parameter vector. Whether this fact is significant or trivial depends on the application.

5.4.1 Review of Least Squares

Least squares: The best known linear predictor is the least squares estimator, which will be denoted βls. The least squares estimator gets its name from the

fact that it solves the following problem: minimize β 1 2||y−Xβ|| 2 2.

Assuming X is ‘tall’ (M > R) and full (column) rank the solution is given by:

βls = (XTX)−1XTy.

Fitted value are given by:

yls =X(XTX)−1XTb.

The least squares predictor provides a familiar and important point of comparison for almost any other predictor one may wish to analyze.

An important property of βls is the fact:

Lemma 5.7 The following holds for all β: ||X(β−βls)||22 ≥ ||y−Xβ|| 2 2− ||y−yls||22 Proof ||X(β−βls)||2₂ =||Xβ−y+y−Xβls||2₂ =hXβ−y+y−Xβls,Xβ−y+y−Xβlsi =hXβ−y,Xβ−yi+hy−yls,y−ylsi −2hy−Xβ,y−ylsi

The last term has

hy−Xβ,y−ylsi ≤ ||y−Xβ||₂ ||y−yls|| (Cauchy-Schwarz) ≤ ||y−yls||2₂ (least squares property)

5.4.2 Three Regressions

Many other regression techniques have been proposed. In this section we briefly cover three, enough to highlight a pattern of construction.

In Ridge regression the following problem is solved minimize β 1 2||y−Xβ|| 2 2+λ||β|| 2 2.

The Lasso uses a one-norm regularizer instead: minimize β 1 2||y−Xβ|| 2 2+λ||β||1 .

Both of these methods are covered in Hastie et al. (2001). The next is newer. The Dantzig Selector (Candes and Tao, 2007) is solution to:

min β ||β||1 subject to XT(y−Xβ) _∞< .

The regularizer is an l1-norm, but the statistical loss is merely a constraint on the largest error, a rotated one at that. Two of the unique aspects of this method are the use of the sup-norm, and the fact that its inspiration comes from results in compressed sensing. One feature of that literature is that there is always something that corresponds to the ‘true’ model. Also Candace and Tao show that under special conditions of low noise and other conditions on the covariance of the regressors, the method can recover the exact true set of regressors with high probability. This result is interesting, but our concern at the moment is to see how the procedure might or might not relate to generalized maxent.

In the table below, the three objective functions are summarized. In addition the statistical loss (L(β,y)) is restated in terms of the deviation from the least squares predictor (L(β,βls)), which is obviously a data-dependent quantity.

The table summarizes what we have so far2:

In document Generalized Maximum Entropy, Convexity and Machine Learning (Page 102-104)