1.4 Regression diagnostics
1.4.3 Review of methods for identifying multiple outliers in regression
In this subsection, I clarify the methods for identifying multiple outliers in regression analysis that appears in the literature as “single-step”, “backward-search” and “forward- search”. A single-step method identifies outliers in one step; a backward search starts from the set of all the observations, and in each step examines the “outlyingness” of all observations included and excludes the most extreme one; a forward search starts from a subset of observations that excludes all outliers and in each step an observation is reincluded by a certain rule. The word “outlyingness” means the plausibility that an observation is an outlier and has been used in Atkinson, Riani and Cerioli [6] and Hadi and Simonoff [45].
The method of identifying a single outlier, which is based on the deletion of a single observation, is introduced in the previous section. The algebra of the deletion of a single observation can be generalized to the deletion of a set of observations, and hence the dele- tion residual calculated from multiple deletions can be used to test multiple outliers. The multiple deletions is a single-step method which is also introduced in Cook and Weisberg [25] and Atkinson [1]. When the number of outliers is small, for example two or three, the
number of deletion residuals that need to be calculated is moderate, m2or m3. However,
the number of outliers is usually unknown and small in real data sets. When m observa-
tions are included in the dataset, then Pmi=1′ mi hypotheses are need to be tested, where
m′ is a moderate number smaller than m. If the number of atypical observations are more
than that of typical observations, then typical observations become atypical. Therefore, the number of outliers is usually smaller than or equal to m/2. Data coming from two
clusters of equal size is an example for m0 = m1, where m0 and m1 denote respectively
the number of typical observations and that of atypical observations [6]. When m and m1
are both large, it requires massive computation time to test allPmi=1′ mi hypotheses.
Since the single step multiple deletion of Cook and Weisberg [25] and Atkinson [1] is time consuming, one may consider to use multi-step methods instead. It is straightforward to use single-case diagnostics backward, that is, include all observations for estimation and identify potential outliers from most to least extreme (Prescott [70], Tietjen, Moore, and Beckman [95]). Such methods are backward search methods. Hadi and Simonoff [44] gave a review of early works of multi-step methods, but they did not distinguish forward
methods from backwards methods. The methods using single-case diagnostics backward are criticized for having the “masking” problem [48], that is, multiple outliers may hide the effect of each other because including other outliers in parameter estimation may lead to a small residual for one outlier [4, 5, 45]. Hawkins [49] suggested the forward search that excludes all possible outliers and then tests excluded observations sequentially for reinclusion. However, the starting data set and the test statistics for reinclusion were not specified in [49]. On the other hand, the classical least square estimator is criticized for its lack of robustness since a single outlier can have a large effect on the estimates. Rousseeuw [78] proposed the least median of squares estimator which minimizes the median of the squared residual, whereas the least square estimator minimizes the sum of the residual squares.
Recently Atkinson [2] combined the least median of squares with forward search and this method is introduced in detail in the books [3] and [6]. This forward search starts from a small, robustly selected subset of observations which excludes all outliers. Then the size of the subset used for estimating increases sequentially, and in each step, the least extreme observation among those excluded is tested for outlyingness before being reincluded. The search stops if the least extreme observation among those excluded is declared significant at that step. A drawback of Atkinson’s forward search is the difficulty in finding the distribution of the test statistic, which is used for testing outlyingness in each step. Also an adjustment to multiplicity is needed for this step-wise procedure if all outliers are tested simultaneously. Atkinson and Riani [5] introduced some methods to approximate this distribution. They also addressed the simultaneous inference of their methods in this article. There are also other forward search methods such as those of Hadi [44] and Hadi and Simonoff [45]. All the methods mentioned so far are frequentist methods. Many graphical methods and influence statistics such as plots of leverage and Cook’s D are not mentioned in this thesis. These methods are introduced in many standard books, for example, [1, 25, 26, 72].
Outliers usually have strong effects on parameter estimation and lead to wrong models. If one is interested in minimizing the effects of outliers, then robustness of estimates need to be studied. The primary purpose of the thesis is to construct an efficient method to identify outliers rather than to study the robustness of estimates. The choice of explanatory variable may also influence the results of deletion diagnostics. Since the main objective of
the thesis is the identification of outliers, the effects of variable selection are not discussed in this thesis.