• No results found

Local shape change detection

In document Choi_unc_0153D_18151.pdf (Page 102-105)

CHAPTER 3. SCISSOR: SHAPE CHANGES IN SELECTING SAMPLE OUT-

3.5 Detecting shape changes

3.5.2 Local shape change detection

We now propose a second step procedure to deal with more challenging situations when outlying features are not distinguishable using a low rank representation. As mentioned earlier, the features whose signals are not strong enough to dominate increasing dimensions are the ones that may not be captured by the first few PCs. Then, these features remain in the residuals, which still suffer from high dimensionality, and thus simply applying the projection depth idea or other conventional outlier detection algorithm to the residuals may not be appropriate. To address this issue, this second step proposes to benefit from available knowledge about the outlying structures to extract critical information from many irrelevant features in an overwhelming number of dimensions.

As we stated earlier, shape variations missing from the low rank approximation often exhibit changes in a limited region of coverage. This is roughly because the required level of the signal for a latent local abnormality to be distinctively expressed need not be that intensive as required for a global abnormality. Such local shape changes are characterized by loss or gain in a whole/part of an exon or intron. Because widespread noise in the whole domain challenges the local variations to be found, it is advantageous to separately deal with the relevant local regions. An intuitive and simple way of doing this is to slide a window along the domain and hook potential outliers within each window area as follows.

We first define a window directionas a unit vector whose entries corresponding to a given window area are all equal to a constant and the rest of the entries are all zero. Then, a group of window directions can be a useful source for the projection depth function to measure local

abnormalities. The sparsity of a window direction helps to reduce the impact of noise and thus to separate meaningful outliers from inliers. It may be important to choose an appropriate size of window because too small windows can be sensitive to too fine variations and too large windows can be vulnerable to noise accumulation. We propose to use 50∼200 for a window size because most meaningful local variations often appear as such lengths of changes in expression. In addition to a specific length of windows whose union spans the transcript of a given gene, we include windows each of which corresponds to a whole exon or a whole intron. The window directions with these windows are useful to more accurately capture the whole exonic changes or whole intronic changes. We denote a collection of these window directions byw.

The basic idea of the second step is the same as the first step in the sense that we find a direction that maximizes the one-dimensional outlyingness for each data point. However, the directions involved in the projection depth function are given some structure, which helps to accent local abnormality. This means that we take the set v (a collection of direction vectors that the depth function will examine) to bew. LetI2be a set of the remaining sample indices after excluding the first step outliers and also let ˇXdenote the residual matrix whose columns ( ˇXj) correspond to each of the remaining samples. Then, the second step PO scores can be written as

o2(Xˇj|Xˇ) =sup hTXˇj−Med(hTXˇ) MAD(hTXˇ) s.t. h∈v, φ(h TXˇ)ρ, (3.5.15)

whereφandρfor the normality condition as introduced in (3.5.14). We can obtain the MOD by choosing the direction giving the largest value among v/v0 wherev0 is a set of directions with φ(hTkXˇj)>ρfor hk ∈v. The MODs maximizing (3.5.15) can be used for interpreting the local

abnormality. For a given outlier, the corresponding MOD informs the specific region where shape aberration occurs as well as the type (loss/gain) of the aberration involved.

In contrast to the PO scoreso1from the global shape change detection procedure, the distribu- tion of the statistico2is unknown so that a cutoff value to determine outliers should be carefully chosen. For detecting outliers in the data from an unknown distribution, it is common to use a

box-plot and follow the rule that a point beyond an upper outer fence (Q3+1.5×IQR) is considered as an outlier. However, this rule does not reflect the potential skewness of the distribution because the MAD does not account for the asymmetry. This may result in false discoveries on one side of the distribution or mask actual outliers on the other side. Rousseeuw et al. (2016) proposed an alternative to single robust scale measures to account for the potential asymmetry of a distribution. They suggested to separately apply a robust scale estimator to each of two subsamples that are above and below the median. We employ this idea to compute the second step PO scoreso2which often show a right-skewed distribution.

Obtain the two scale estimates of{o2(Xˇj|Xˇ)}, denoted bysL andsR for the left and right side,

respectively. Then the proposed cutoff can be chosen by the following rule:

c2=Med(o2) +sR×Φ−1(1−α).

We only look at the right side of the distribution because we are interested in detecting a data point involved with abnormally large outlyingness.

To sum up, the local shape change algorithm is proposed as follows: 1. Collect a set of window directions.

2. Get the residual vectors ˇXj for j∈I2. For each ˇXj, obtain the second step PO scoreo2(Xˇj|Xˇ)

based on (3.5.15) withvbeing a collection of the window directions considered. Here, ˇXis a matrix whose columns are ˇXjfor j∈I2, but it does not need to contain the jth sample for the

jth sample inspection.

3. Collecting the PO scores, declare a set of data points as outliers ifo2(Xˇj|Xˇ)≥c2 with a pre-determined levelα.

In document Choi_unc_0153D_18151.pdf (Page 102-105)