Computation and Examples - Nonparametric Statistical Methods Using R

4.8 Correlation

4.8.4 Computation and Examples

We illustrate the R function cor.test in the following example.

Example 4.8.1 (Baseball Data, 2010 Season). Datasets of major league base-ball statistics can be downloaded at the site basebase-ballguru.com. For this example, we investigate the relationship between the batting average of a full-time player and the number of home runs that he hits. By full-time we mean that the batter had at least 450 official at bats during the season. These data are in the npsm dataset bb2010. Figure 4.13 displays the scatterplot of home run production versus batting average for full-time players. Based on this plot there is an increasing monotone relationship between batting average and home run production, although the relationship is not very strong.

In the next code segment, the R analyses (based on cor.test) of Pearson’s, Spearman’s, and Kendall’s measures of association are displayed.

> with(bb2010,cor.test(ave,hr))

Pearson’s product-moment correlation data: ave and hr

t = 2.2719, df = 120, p-value = 0.02487

alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval:

0.02625972 0.36756513 sample estimates:

cor 0.2030727

> with(bb2010,cor.test(ave,hr,method="spearman"))

0.20 0.25 0.30 0.35

01020304050

Batting average

Home runs

Batting statistics for full−time players, 2010

FIGURE 4.13

Scatterplot of home runs versus batting average for players who have at least 450 at bats during the 2010 Major League Baseball Season.

Spearman’s rank correlation rho data: ave and hr

S = 234500, p-value = 0.01267

alternative hypothesis: true rho is not equal to 0 sample estimates:

rho 0.2251035

> with(bb2010,cor.test(ave,hr,method="kendall")) Kendall’s rank correlation tau

data: ave and hr

z = 2.5319, p-value = 0.01134

alternative hypothesis: true tau is not equal to 0 sample estimates:

tau 0.1578534

For each of the methods the output contains a test statistic and associated p-value as well as the point estimate of the measure of association. Pearson’s also contains the estimated confidence interval (95% by default). For exam-ple the results from Pearson’s analysis give r = 0.203 and a p-value of 0.025.

While all three methods show a significant positive association between home run production and batting average, the results for Spearman’s and Kendall’s procedures are somewhat stronger than that of Pearson’s. Based on the scat-terplot, Figure 4.13, there are several outliers in the dataset which may have impaired Pearson’s r. On the other hand, Spearman’s rS and Kendall’s ˆτK

are robust to the effects of the outliers.

The output for Spearman’s method results in the value of rS and the p-value of the test. It also computes the statistic

S = Xn i=1

[R(Xi) − R(Yi)]².

Although it can be shown that

rS = 1 − 6S n²− n;

the statistic S does not readily show the strength of the association, let alone the sign of the monotonicity. Hence, in addition, we advocate forming the z statistic or the t-approximation of expression (4.38). The latter gives the value of 2.53 with an approximate p-value of 0.0127. This p-value agrees with the p-value calculated by cor.test and the value of the standardized test statistic is readily interpreted. See Exercise 4.9.16.

As with the R output for Spearman’s procedure, the output for Kendall’s procedure includes ˆτK and the p-value of the associated test. The results of the analysis based on Kendall’s procedure indicate that there is a signifi-cant monotone increasing relationship between batting average and home run production, similar to the results for Spearman’s procedure. The estimate of association is smaller than that of Spearman’s, but recall that they are esti-mating different parameters. Instead of a z-test statistic, R computes the test statistic T which is the number of pairs which are monotonically increasing.

It is related to ˆτK by the expression

The statistic T does not lend itself easily to interpretation of the test. Even the sign of monotonicity is missing. As with the Spearman’s procedure, we recommend also computing the standardized test statistic; see Exercise 4.9.17.

In general, a confidence interval yields a sense of the strength of the rela-tionship. For example, a “quick” standard error is the length of a 95% dence interval divided by 4. The function cor.test does not compute confi-dence intervals for Spearman’s and Kendall’s methods. We have written an R function, cor.boot.ci, which obtains a percentile bootstrap confidence inter-val for each of the three measures of association discussed in this section. Let B = [X, Y ] be the matrix with the samples of Xi’s in the first column and the samples of Yi’s in the second column. Then the bootstrap scheme resam-ples the rows of B with replacement to obtain a bootstrap sample of size n.

This is performed nBS times. For each bootstrap sample, the estimate of the measure of association is obtained. These bootstrap estimates are collected and the α/2 and (1 − α/2) percentiles of this collection form the confidence interval. The default arguments of the function are:

> args(cor.boot.ci)

function (x, y, method = "spearman", conf = 0.95, nbs = 3000) NULL

Besides Spearman’s procedure, bootstrap percentile confidence inter-vals are computed for ρ and τK by using respectively the arguments method="pearson"and method="kendall". Note that (1 − α) is the confi-dence level and the default number of bootstrap samples is set at 3000. We illustrate this function in the next example.

Example 4.8.2 (Continuation of Example 4.8.1). The code segment below obtains a 95% percentile bootstrap confidence interval for Spearman’s ρS.

> library(boot)

> with(bb2010,cor.boot.ci(ave,hr))

2.5% 97.5%

0.05020961 0.39888150

The following code segment computes percentile bootstrap confidence intervals for Pearson’s and Kendall’s methods.

> with(bb2010,cor.boot.ci(ave,hr,method=’pearson’))

2.5% 97.5%

0.005060283 0.400104126

> with(bb2010,cor.boot.ci(ave,hr,method=’kendall’))

2.5% 97.5%

0.02816001 0.28729659

TABLE 4.1

Estimates and Confidence Intervals for the Three Methods.

The first three columns contain the results for the original data, while the last three columns contain the results for the changed data.

Original Data Outlier Data Est LBCI UBCI Est2 LBCI2 UBCI2 Pearson’s 0.20 0.00 0.40 0.11 0.04 0.36 Spearman’s 0.23 0.04 0.40 0.23 0.05 0.41 Kendall’s 0.16 0.03 0.29 0.16 0.04 0.29

To show the robustness of Spearman’s and Kendall’s procedures, we changed the home run production of the 87th batter from 32 to 320; i.e., a typographical error. Table 4.1 compares the results for all three procedures on the original and changed data.⁶

Note that the results for Spearman’s and Kendall’s procedures are essen-tially the same on the original dataset and the dataset with the outlier. For Pearson’s procedure, though, the estimate changes from 0.20 to 0.11. Also, the confidence interval has been affected.

4.9 Exercises

4.9.1. Obtain a scatterplot of the telephone data. Overlay the least squares and R fits.

4.9.2. Write an R function which given the results of a call to rfit returns the diagnostic plots: Studentized residuals versus fitted values, with ±2 hor-izontal lines for outlier identification; normal q − q plot of the Studentized residuals, with ±2 horizontal lines outliers for outlier identification; histogram of residuals; and a boxplot of the residuals.

4.9.3. Consider the free fatty acid data.

(a) For the Wilcoxon fit, obtain the Studentized residual plot and q−q plot of the Studentized residuals. Comment on the skewness of the errors.

(b) Redo the analysis of the free fatty acid data using the bent scores (bentscores1). Compare the summary of the regression coefficients with those from the Wilcoxon fit. Why is the bent score fit more precise (smaller standard errors) than the Wilcoxon fit?

6These analyses were run in a separate step so they may differ slightly from those already reported.

4.9.4. Using the baseball data, calculate a large sample confidence interval for the slope parameter when regressing weight on height. Compare the results to those obtained using the bootstrap discussed in Section 4.6.

4.9.5. Consider the following data:

x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

y −7 0 5 9 −3 −6 18 8 −9 −20 −11 4 −1 7 5

Consider the simple regression model: Y = β0+ β1x + e.

(a) For Wilcoxon scores, write R code which obtains a sensitivity curve of the rfit of the estimate of β1, where the sensitivity curve is the difference in the estimates of β1between perturbed data and the original data.

(b) For this exercise, use the above data as the original data. Let ˆβ1

denote the Wilcoxon estimate of slope based on the original data.

Then obtain 9 perturbed datasets using the following sequence of replacements to y15: −995, −95, −25, −5, 5, 10, 30, 100, 1000. Let ˆβ1j

be the Wilcoxon fit of the jth perturbed dataset for j = 1, 2, . . . , 9.

Obtain the sensitivity curve which is the plot of ˆβ1j− ˆβ1 versus the jth replacement value for y15.

4.9.6. For the simple regression model, the estimator of slope proposed by Theil (1950) is defined as the median of the pairwise slopes:

βˆT = med{bij} where bij= (yj− yⁱ)/(xj− xⁱ) for i < j.

(a) Write an R function which takes as input a vector of response variables and a vector of explanatory variables and returns the Theil estimate.

(b) For a simple regression model where the predictor is a continu-ous variable, write an R function which computes the bootstrap per-centile confidence interval for the slope parameter based on Theil’s estimate.

4.9.7. Consider the data of Example 4.4.1.

(a) Obtain a scatterplot of the data.

(b) Obtain the Wilcoxon fit of the linear trend model. Overlay the fit on the scatterplot. Obtain the Studentized residual plot and normal q−q plots. Identify any outliers and comment on the quality of fit.

(d) Estimate the mean of the response when time has the value 16 and find the 95% confidence interval for it which was discussed in Section 4.4.4.

4.9.8. Bowerman et al. (2005) present a dataset concerning the value of a home (x) and the upkeep expenditure (y). The data are in qhic. The variable x is in $1000’s of dollars while the y variable is in $10’s of dollars.

(a) Obtain a scatterplot of the data.

(b) Use Wilcoxon Studentized residual plots, values of ˆτ , and values of the robust R² to decide whether a linear or a quadratic model fits the data better.

(c) Based on your model, estimate the expected expenditures (with a 95% confidence interval) for a house that is worth $155,000.

(d) Repeat (c) for a house worth $250,000.

4.9.9. Rewrite the aligned.test function to take an additional design matrix as its third argument instead of group/treatment membership. That is, for the model Y = α1 + X1β₁+ X2β₂+ e, test the hypothesis H0: β₂= 0.

4.9.10. Hettmansperger and McKean (2011) discuss a dataset in which the dependent variable is the cloud point of a liquid, a measure of degree of crys-tallization in a stock, and the independent variable is the percentage of I-8 in the base stock. For the readers’ convenience, the data can be found in the dataset cloud in the package npsm.

(a) Scatterplot the data. Based on the plot, is a simple linear regression model appropriate?

(b) Show by residual plots of the fits that the linear and quadratic polynomials are not appropriate but that the cubic model is.

(c) Use the R function polydeg, with a super degree set at 5, to deter-mine the degree of the polynomial. Compare with Part (b).

4.9.11. Devore (2012) discusses a dataset on energy. The response variable is the energy output in watts while the independent variable is the temperature difference in degrees K. A polynomial fit is suggested. The data are in the dataset energy.

(a) Scatterplot the data. What degree of polynomial seems suitable?

(b) Use the R function polydeg, with a super degree set at 6, to deter-mine the degree of the polynomial.

4.9.12. Consider the weather dataset, weather, discussed in Example 4.7.4.

One of the variables is mean average temperature for the month of January (meantmp).

(a) Obtain a scatterplot of the mean average temperature versus the year. Determine the warmest and coldest years.

(b) Obtain the loess fit of the data. Discuss the fit in terms of years, (were there warm trends, cold trends?).

4.9.13. As in the last problem, consider the weather dataset, weather. One of the variables is total snowfall (in inches), totalsnow, for the month of January.

(a) Scatterplot total snowfall versus year. Determine the years of max-imal and minmax-imal snowfalls.

(b) Obtain the local LS and robust loess fits of the data. Compare the fits.

(d) Obtain a boxplot of the residuals found in Part (c). Identify the outliers by year.

4.9.14. In the discussion of Figure 4.7, the nonparametric regression fit by ksmoothdetects an artificial valley. Obtain the locally robust loess fit of this dataset (poly) and compare it with the ksmooth fit.

4.9.15. Using the baseball data, obtain the scatterplot between the variables home run productions and RBIs. Then compute the Pearson’s, Spearman’s, and Kendall’s analyses for these variables. Comment on the plot and analyses.

4.9.16. Write an R function which computes the t-test version of Spearman’s procedure and returns it along with the corresponding p-value and the estimate of ρS.

4.9.17. Repeat Exercise 4.9.16 for Kendall’s procedure.

4.9.18. Create a graphic similar to Figure 4.10.

4.9.19. Recall that, in general, the three measures of association estimate different parameters. Consider bivariate data (Xi, Yi) generated as follows:

Yi= Xi+ ei, i = 1, 2, . . . , n,

where Xi has a standard Laplace (double exponential) distribution, ei has a standard N (0, 1) distribution, and Xi and ei are independent.

(a) Write an R script which generates this bivariate data. The supplied R function rlaplace(n) generates n iid Laplace variates. For n = 30, compute such a bivariate sample. Then obtain the scatterplot and the association analyses based on the Pearson’s, Spearman’s, and Kendall’s procedures.

(b) Next write an R script which simulates the generation of these bi-variate samples and collects the three estimates of association. Run this script for 10,000 simulations and obtain the sample averages of these estimates, their corresponding standard errors, and approxi-mate 95% confidence intervals. Comment on the results.

4.9.20. The electronic memory game Simon was first introduced in the late 1970s. In the game there are four colored buttons which light up and produce a musical note. The device plays a sequence of light/note combinations and the goal is to play the sequence back by pressing the buttons. The game starts with one light/note and progressively adds one each time the player correctly recalls the sequence.⁷

Suppose the game were played by a set of statistics students in two classes (time slots). Each student played the game twice and recorded his or her longest sequence. The results are in the dataset simon.

Regression toward the mean is the phenomenon that if an observation is extreme on the first trial it will be closer to the average on the second trial.

In other words, students that scored higher than average on the first trial would tend to score lower on the second trial and students who scored low on the first trial would tend to score higher on the second.

(a) Obtain a scatterplot of the data.

(b) Overlay an R fit of the data. Use Wilcoxon scores. Also overlay the line y = x.

(d) Do these data suggest a regression toward the mean effect?

7The game is implemented on the web. The reader is encouraged to use his or her favorite search engine and try it out.

In document Nonparametric Statistical Methods Using R (Page 130-139)