Linear regression models are used in analyzing data from many fields of study. These data are usually contaminated and contain outliers. In some cases, these outliers are more interesting than the other data. The main purpose of this chapter is to identify outliers in linear regression models. The methods based on deletion residuals are powerful methods to identify single outliers. The distribution of the deletion residual of an observation under the null hypothesis is shown to be a student t distribution when this observation is the only suspicious outlier in a given data set. However, there is usually more than one outlier in a real dataset, and multiple outliers may hide the effect of each other which is called “masking” problem. Hence, simply deleting observations sequentially is not suitable and the effect of all observations needs to be taken into account simultaneously. When there is more than one outlier and the distribution of outliers is assumed to have a mean shift, I have proved the marginal distribution of the deletion residual of an observation is a doubly noncentral t distribution. Hence, the marginal distribution of the squares of deletion
residuals is a doubly noncentral F distribution.
The outlier identification problem can be viewed as a multiple testing problem. In this chapter, a multiple testing method to identify outliers was proposed. The proposed method is a Bayesian method with the test statistics being deletion residuals and the prior
distributions of π0 and µi being Beta and normal, respectively. An importance sampling
method was used to compute the marginal posterior probability that an observation is an outlier given its own deletion residual. This posterior probability was used to measure the outlyingness of an observation. Then decisions can be made cooperating certain decision rules.
In the last section of this chapter, the proposed Bayesian method was applied to some simulated datasets. The simulation parameters were varied over a set of values and various priors were employed to study how the priors affect the posterior. First, the proposed Bayesian method was applied to two single datasets, and the ROC curve was plotted and the area under ROC curve was calculated for each dataset and each combination of parameters. Secondly, the proposed Bayesian method was applied to two sets of data, each including 1000 datasets, and the ROC curve for average TPR versus FPR was plotted and the average AUC over 1000 replicates was calculated for each set of data and different priors. In the two simulation studies, the resulting AUC values are high for various choices of priors, indicating that the proposed method can identify a majority of the outliers with tolerable error. Both ROC curves and AUC values obtained by using different priors are similar, indicating that the posterior probability is not sensitive to the chosen priors. At last, a factorial design analysis was used to compare the AUC using a wider range of simulation and prior parameters. The results of the factorial design analysis show that
the priors do not affect the marginal posterior probability P (Hi = 1 | r∗2i ) as long as the
sample size is not too small.
When calculating the posterior probabilities P (Hi= 1 | ri∗2), I used Patnaik’s approx-
imation to calculate the density of the doubly noncentral F distribution. To examine the accuracy of Patnaik’s approximation, I also proposed an algorithm and used it to calculate the density of the doubly noncentral F distribution, and compared the results obtained by using the two methods. For the first simulated dataset, the maximum differences of densities by using the two methods are not larger than 0.071, and the maximum differences
naik’s approximation is also faster than the proposed algorithm. The results show that the doubly noncentral F density calculated by Patnaik’s approximation is not too far from the true density, and the computation time can be saved by using Patnaik’s approximation.
I used an importance sampling method to calculate the marginal posterior probability
P (Hi= 1| r∗2i ), where I chose the joint prior to be the importance function. As mentioned
in Section 3.3.3, the choice may be problematic. The estimates would be poor if some importance ratios are much larger than the others. Gelman et al. [37] suggest to examine the distribution of sampled importance ratios to detect possible problems. I plotted the histogram of the logarithms of the importance ratios for some simulated datasets and did not find any exceedingly large ratio. Other importance functions may be considered and a comparison of different importance functions could be done in the future. For example, we can use Gibbs sampling methods to approximate the joint posterior distribution of parameters.
The proposed method measures the outlyingness of the i th observation by the marginal posterior probability that the i th observation is an outlier given its own deletion residual, and hence the information of the other observations is not included in this marginal pos- terior and the outlyingness of all observation are not tested simultaneously. A future
work could be calculating the joint posterior probabilities P (H = h | r∗2
1 · · · , rm∗2), where
h ∈ {0, 1}m, or to calculate the marginal posterior probability P (H
i = 1 | r1∗2· · · , r∗2m),
rather than the marginal posterior probability P (Hi = 1 | r∗2i ). Computing either P (H =
h| r1∗2· · · , rm∗2) or P (Hi = 1| r1∗2· · · , r∗2m) requires the calculation of the joint distribution
of all the deletion residuals r1∗2· · · , r∗2
m, which marginally have doubly noncentral t distri-
butions. The joint distribution describes the dependence structure of r∗21 · · · , r∗2m, and thus
the two posteriors given all the deletion residuals contain more information of data than
P (Hi = 1 | ri∗2). So the proposed method is expected to be improved by measuring the
outlyingness of all observations simultaneously with P (H = h| r∗2
Chapter 4
Amino acid sequence similarity of viral to hu-
man proteomes (An application of the Bayesian
method proposed in Chapter 3)
4.1
Introduction
An immune system protects its host against disease by identifying and killing pathogens. Meanwhile an autoimmune disease may cause an immune system to fail to kill pathogens and attack normal cells. It is proposed that some autoimmune reactions are related to similarities in proteins between a virus and a host [66]. It is possible to study the amino acid sequence similarity of viral proteomes to the human proteome since protein sequences of the human proteome and those of a number of viral proteomes are available in databanks [56]. Kanduc et al. [56] examined thirty proteomes for amino acid sequence similarity to the human proteome and revealed “a massive, indiscriminate, unexpected pentapeptide overlapping between viral and human proteomes”, where a pentapeptide is a peptide having five amino acids. They also performed a linear regression analysis to determine whether a linear relationship exists between the level of viral overlaps to the human proteome and the length of viral proteomes. Their results show that the level of overlaps does have a strong
linear correlation with the viral proteome length (with a coefficient of determination, R2 =
0.9497). They also reported that “among the examined viruses, human T-lymphotropic virus 1, Rubella virus, and hepatitis C virus present the highest number of viral overlaps to the human proteome” [56, p.1755]. Recall from Section 1.5 the short form of these three viruses, the Kanduc et al. identified (KI) viruses. One interesting question that arises from this linear regression analysis is if there exist some viral proteomes sharing significantly higher or lower levels of overlaps with the human proteome than the predicted level of overlaps by the linear regression model, i.e. the outliers in the level of viral overlaps to the
human proteome. Are KI viruses the interesting viruses?
The Bayesian method proposed in Chapter 3 is used to identify outliers in this dataset. I received the data that was published in Kanduc et al. [56] from Kusalik [58]. The results confirm the report given in Kanduc et al. [56] in the case that only the viruses with not too large proteome size (less than 10, 000 pentapeptides) are analyzed. However, the full dataset, which are used in [56], has four viruses (Human herpesvirus 4, Human herpesvirus 6, Variola virus, and Human herpesvirus 5) with extremely large size (greater than 32, 009 pentapeptides), and these four viruses, if included, are more likely to be the outliers. Recall from Section 1.5 the short form of these four viruses, the four extremely large (FEL) viruses. Among the other 26 viruses, KI viruses still do not have greater posterior probability of being an outlier given its deletion residual than the other viruses, when all thirty viral proteomes are analyzed.
The description of the dataset examined in [56] is given in Section 4.2. The results from the analysis of the full dataset are presented in Section 4.3.1, and the results for the analysis of the dataset excluding the FEL viruses are shown in 4.3.2. The conclusion is given in the last section.