Outlier Nomination - Outlier Detection and Multicollinearity in Sequential Variable Selection:

It would not be advisable to recommend every observation nominated during the branching process, as this generally leads to unreasonably high false positive rates. Neither is it feasible to simply record all influence values and pick the largest, since influence measures tend to inflate as p increases. However, the influence values should provide important information for identifying extreme observations.

The second issue is addressed by “standardizing” the influence before storing it for later comparison. Storage occurs in two modes: running average and maximum influence values. Let Drepresent the vector of influence values, D∗ the vector of values to be stored, and c be the cutoff value. At each step, the influence values are adjusted for storage such that

D∗= (D−c)

3·_IQR(D)

where(·)+ = max(·,0). There are a number of ideas coalescing in this equation. Most influence measures are not associated with any probability distribution, so the cutoff value is generally the best available estimate for the expected change inD. Since it is also true that only those observations deemed sufficiently influential at a given point are of interest, it makes a reasonable pivot point. To justify the positive-part operator, it is important to note that an observation deemed particularly “uninfluential” at a given step in the process may simply be due to a key variable missing from the active set. Thus all observations not nominated are set to 0 for purposes of averaging rather than incorporating the negative

values. The IQR is used for scaling since there is no reason to believe that influence measures will behave symmetrically (rather, they are highly likely to be quite skewed), and Mirkin [51] recommends using 1.5·_IQR(x)in place of sd(x) for skewed data as a more stable measure of variation. Gelman [28] recommends dividing by2 · sd(x) for comparison purposes in regression situations, and so this is incorporated as well.

The main laron function automatically performs this outlier nomination process; however, this latter nomination may be changed without performing the time-consuming branch-building process again using the outliers(fit,...)_function.

4.2.1 Nomination Criteria

Once all branches have been created, there is a collection of cases that have been removed to form new branches. This section addresses the options the user may specify to select the most interesting cases from among that group. All of the options described in this section may be used as arguments for the original laron_{call or the}outliers_function.

Instead of storing the influence measure of every observation at every step, laron only keeps track of the maximum and the average influence measures calculated during the process (after the standardization process described above). TheinfTypeargument can be used to specify which to use. The default isinfType = "average"_{since the maximum is more likely to be unstable as} papproachesn.

Other measures may also be used to nominate cases for investigation,

controlled by the nominator _argument. _{There are three general mea-}

sures that could be passed into this argument: "influence" _{(the de-} fault),"selection.ratio" _(or "sr"_{) which indicates the proportion of} branches that omitted the observation, and "ranker" which is equal to the influence multiplied by the selection ratio plus one. The ranker appears to out- perform the influence measure alone in high dimensional settings as a method for outlier nomination.

It is possible to control the cutoff point for nomination, which can be set manually or determined using one of three possible probability methods. The cutoff argument takes a vector of up to 4 cutoff values. These values are used to assign the relative significance of an observation. For example, we can assign specific cutoffs for the average Cook’s distance for the LARON fit to the simulated data setsim1above.

> outliers(fit, cutoff = c(5, 10, 50, 100)) Most Influential Cases:

Subject Avg Cook’s Distance Selection.Ratio

1: 91 367.357 0.938 ***

2: 31 1.281 0.562

3: 71 1.072 0.688

---

Influence Measure: Inf *** 100 ** 50 * 10 . 5 0

Alternatively, we could determine that any observation with sufficiently large selection ratio (say,> 0.6) should be nominated for investigation. In this case, 4 observations are nominated instead of only 1.

> outliers(fit, nominator="sr", cutoff = c(0.6, 0.8, 0.9))

Most Influential Cases:

Subject Avg Cook’s Distance Selection.Ratio

1: 91 367.3574 0.938 **

2: 71 1.0716 0.688 .

3: 72 0.4527 0.875 *

4: 92 0.0967 0.688 .

Selection Ratio : Inf ** 0.9 * 0.8 . 0.6 0

If only a single cutoff value is specified and the influence is used, relative significance is assigned according to whether only the maximum or both the average influence and the maximum exceed the cutoff.

> outliers.laron(fit, cutoff = 100, infType = "max") Most Influential Cases:

Subject Max Cook’s Distance Selection.Ratio

1: 91 624.173 0.938 ** 2: 47 4.670 0.000 3: 31 2.726 0.562 --- ** Average > 100 * Max > 100

An alternative is to determine cutoffs using probability inequalities. Most influence measures do not follow known probability distributions, so we rely instead on several theorems to make conservative nominations. Chebychev’s inequality [40], the default option, provides a maximal tail probability for random variables with unknown distributions. Suppose random variable X ∼ F

whereFis an unknown distribution with finite mean (µ) and variance (ς2_{). Then} Chebychev’s inequality states

P |X−µ|

ς ≥k

≤ 1

k2.

Another built-in option is to use Gaussian quantiles as cutoff points, though it is generally not recommended that this be used except with average influence measures which may reasonably be expected to behave Normally due to the wonders of the Central Limit Theorem.

It is important to note that all of these inequalities (as we have used them) rely on the assumption that all nominator values are drawn from the same probability distribution. Although this is likely not the case for the influence measures (which generally depend on the size of the design matrix to determine the

probability distribution) the standardization appears to make their distributions sufficiently similar as to allow these inequalities to work well in practice.

To select a probabilistic method in thelaron_oroutliers_{function call, the} argumentprobTypecan take values"chebychev"(the default) or"normal" for Gaussian quantile cutoffs. Partial matches are also acceptable. The probability values are by default set toprobs = c(0.001, 0.01, 0.05, 0.1), and can take only up to 4 values for significance levels.

4.2.2 Reading Outlier Plots

There are three outlier-specific graphs (visible in Figure 4.1) which can

be obtained by plot(laron.fit, which.graphs = c(3, 6, 8)) or

plot(outliers(laron.fit)).

The first outlier plot gives the half-normal probability plot of the selected nominator value. Half-normal plots are commonly used for data that are known to be strictly positive, as here. Observations that are near the top right are con- sidered suspect, particularly if they come after any large vertical gaps. Nomi- nated points are labeled with their name or subject ID.

The second plot is a scatterplot of the selection ratio against the observation ID. In this plot, by default, observations are labeled if their selection ratio exceeds 0.9. Any point in the upper part of the graph may warrant further investigation.

The third plot examines the selection ratio and influence value simultane- ously, and provides a visual representation of observations likely to be nomi-

nated using the"ranker"nominator. As in the first plot, points near the upper right corner ought to be investigated futher.

In document Outlier Detection and Multicollinearity in Sequential Variable Selection: A Least Angle Regression-Based Approach (Page 98-103)