Consider the data (Y, X) where Y = Yn×J = (yij) = (y1y2· · · yJ) and
X = Xn×p = (xik) = (x1· · · xp), and yj; 1 ≤ j ≤ J and xk; 1 ≤ k ≤ p
are the vectors of n observations made on the response variables and the pre- dictors respectively. Since we have several response variables in these data, the univariate regression model (3.2.1) is extended to the following multivariate multiple regression model
where B = Bp×J = (bij) = (b1b2· · · bp)T and E = En×J = (ε1ε2· · · εJ)
with bj and εj respectively denote the values of the regression coefficients and
the error terms related to the jth response variable. Also suppose that the errors are Gaussian. We normalise both response variables and predictors. Suppose ¯xk = 1/nPni=1xik; k = 1,· · · , p, then predictors are centered by
subtracting the column means ¯x1,· · · , ¯xp of the predictor matrix from their
corresponding column and standardised by dividing the centered columns by their standard deviations. The response variables yj = (y1j,· · · , ynj)T; j =
1· · · , J are centralised by subtracting the mean ¯y = 1 J
PJ
j=1yj from the
corresponding column and standardised by dividing the centered column by the standard deviation.
In order to conduct screening with the aim of reducing the dimension, we marginally fit a multivariate regression to each of (Yn×J, xk), 1 ≤ k ≤ p as
follows
Y= xkbk+ ˜E, k = 1,· · · , p. (3.3.2)
where ˜E contains the error terms. To obtain the corresponding least square estimates ˆbk of bk, 1≤ k ≤ p, we consider the following optimisation
ˆ
bk = argmin
bk∈RJ
kY − xkbkk2F,
where k.kF denotes the Frobenius norm of matrices. Least square estimates
minimise the following expression
kY − xkbkk2F = tr (Y − xkbk)T(Y− xkbk) ,
where tr (.) is the trace of a matrix. Differentiating the above equation with respect to the J-dimensional coefficient vector bk and setting it equal to zero,
we get 0 = ∂ ∂bk tr (Y − xkbk)T(Y− xkbk) = −2xTk(Y− xkbk),
therefore the marginal least square estimate corresponding to a single predictor xk is of the form
ˆ
bk= (xTkxk)−1xTkY, k = 1,· · · , p, (3.3.3)
which is the k-th row of the estimated coefficient matrix ˆBp×J. To distinguish
predictors that have the highest effect on the response variable, we calculate the squared Euclidean norm of estimated coefficient vectors kˆb1k22,· · · , kˆbpk22
where kˆbkk22 = ˆb T kbˆk = YTxk(xTkxk)−2xTkY = (xTkxk)−2xTkYYTxk. (3.3.4)
Since response variables are centralized, the response sample covariance is of the form ˆS = 1JPJ
j=1yjyTj = J1YY
T. Substituting ˆS in the Equation (3.3.4),
the squared Euclidean norm of each coefficient is obtained by
kˆbkk22 = J(xTkxk)−2xTkSxˆ k. (3.3.5)
Predictors with larger squared Euclidean norms are selected as important and will be included in the following reduced model
Mδ ={1 ≤ k ≤ p : kˆbkk2
2 is among the first δ largest of all}, (3.3.6)
where δ is a pre-specified cutoff point. Since the errors are assumed to be Gaussian, the least square estimates and maximum likelihood estimates are equivalent. Therefore, we refer to the above procedure as likelihood-based marginal screening (LMS). In implementing the LMS, if δ is chosen to be large, the probability that the true model is included in the reduced model Mδ is high. However, the reduced model will not be parsimonious and the computational cost might be expensive. Zhong and Zhu (2015), Fan and Lv (2008) and Zhu et al. (2011), in different works, empirically show that setting
the cutoff point to [n/ log n] in their proposed screening methods gives good results in simulation studies. In another work by Hall and Miller (2009), p/2 was chosen as the cutoff point. This choice may contain too many irrelevant or false positives in cases that dimension of variables is high. Since there is no universal method to choose the cutoff point, this value is mostly user specified and varies based on areas that theses methods are applied. For example, con- sider the analysis of gene expression data in cancer biology. In such datasets, usually the number of variables (genes) exceeds tens of thousands while the sample size is very small. Suppose we apply variable screening on such data to determine the important variables (genes) which play a role in a particular type of cancer. Since only a small fraction of genes may be responsible for the disease (Moosa et al., 2016), choosing p/2 as cutoff point may not be a good choice for such data, while [n/ log n] seems to be a more reasonable choice.
Although the above screening is suitable for multivariate regressions, there is a crucial drawback in conducting such screening. When the regression model is built upon more than one response variable, covariance structure of Yn×J
is simply ignored by applying the above screening approach. To address this issue, in the following sections we introduce our proposed screening procedure that applies to the cases with the multivariate response variable and also takes into account the response covariance structure.