• No results found

Principal Component Analysis 1. Show part 2 of Proposition2.1.

2. For n= 100,200,500 and 1,000, simulate data from the true distribution referred to in Example2.3. Separately for each n carry out parts (a)–(c).

(a) Calculate the sample eigenvalues and eigenvectors.

(b) Calculate the two-dimensional PC data, and display them.

(c) Compare your results with those of Example2.3and comment.

(d) For n= 100, repeat the simulation 100 times, and save the eigenvalues you obtained from each simulation. Show a histogram of the eigenvalues (separately for each eigenvalue), and calculate the sample means of the eigenvalues. Compare the means to the eigenvalues of the true covariance matrix and comment.

3. Let X satisfy the assumptions of Theorem2.5. Let A be a non-singular matrix of size d×d, and let a be a fixed d-dimensional vector. Put T = AX+a, and derive expressions for

(a) the mean and covariance matrix of T , and

(b) the mean and covariance matrix of the PC vectors of T .

4. For each of the fourteen subjects of the HIV flow cytometry data described in Exam-ple2.4,

(a) calculate the first and second principal component scores for all observations, (b) display scatterplots of the first and second principal component scores and

(c) show parallel coordinate views of the first and second PCs together with their respective density estimates.

Based on your results of parts (a)–(c), determine whether there are differences between the HIV+and HIVsubjects.

5. Give an explicit proof of part 2 of Theorem2.12.

6. The MATLAB commandssvd, princomp and pcacov can be used to find eigenvectors and singular values or eigenvalues of the dataX or its covariance matrix S. Examine and explain the differences between these MATLAB functions, and illustrate with data that are at least of size 3× 5.

7. Let X and pj satisfy the assumptions of Theorem2.12, and let Hj be the matrix of (2.9) for j≤ d. Let λkbe the eigenvalues of.

(a) Show that Hj satisfies

Hj= λjHj, HjHj = λjHj and tr (HjHj)= 1.

165

(b) Assume that has rank d. Put H = ∑dk=1λkHk. Show that H= T, the spectral decomposition of as defined in (1.19) of Section1.5.2.

8. Prove Theorem2.17.

9. Let X be d × n data with d > n. Let S be the sample covariance matrix of X. Put Q= (n − 1)−1XTX.

(a) Determine the relationship between the spectral decompositions of S and Q.

(b) Relate the spectral decomposition of Q to the singular value decomposition ofX.

(c) Explain how Q instead of S can be used in a principal component analysis of HDLSS data.

10. LetX be d × n data. Explain the difference between scaling and sphering of the data.

Compare the PCs of raw, scaled and sphered data theoretically, and illustrate with an example of data with d≥ 3 and n ≥ 5.

11. Consider the breast cancer data and the Dow Jones returns.

(a) Calculate the first eight principal components for the Dow Jones returns and show PC score plots of these components.

(b) Repeat part (a) for the raw and scaled breast cancer data.

(c) Comment on the figures obtained in parts (a) and (b).

12. Under the assumptions of Theorem2.20, give an explicit expression for a test statistic which tests whether the j th eigenvalueλj of a covariance matrix with rank r is zero.

Apply this test statistic to the abalone data to determine whether the smallest eigenvalue could be zero. Discuss your inference results.

13. Give a proof of Corollary2.27.

14. For the abalone data, derive the classical Linear Regression solution, and compare it with the solution given in Example2.19:

(a) Determine the best linear least squares model, and calculate the least squares residuals.

(b) Determine the best univariate predictor, and calculate its least squares residuals.

(c) Consider and calculate a ridge regression model as in (2.39).

(d) Use the covariance matrix Cp of (2.34) and the PC-based estimator of (2.43) to determine the least-squares residuals of this model.

(e) Compare the different approaches, and comment on the results.

15. Give an explicit derivation of the regression estimator βP shown in the diagram in Section2.8.2, and determine its relationship with βW.

16. For the abalone data of Example2.19, calculate the least-squares estimate βLS, and carry out tests of significance for all explanatory variables. Compare your results with those obtained from Principal Component Regression and comment.

17. Consider the model given in (2.45) and (2.46). Assume that the matrix of latent variables F is of size p × n and p is smaller than the rank of X. Show that F is the matrix which consists of the first p sphered principal component vectors ofX.

18. For the income data described in Example3.7, use the nine variables described there with the income variable as response. Using the first 1,000 records, calculate and com-pare the regression prediction obtained with least squares and principal components. For the latter case, examineκ = 1,...,8, and discuss how many predictors are required for good prediction.

Problems for Part I 167

Canonical Correlation Analysis 19. Prove part 3 of Proposition3.1.

20. For the Swiss bank notes data of Example2.5, use the length and height variables to defineX[1]and the distances of frame and the length of the diagonal to defineX[2]. (a) For X[1] and X[2], calculate the between covariance matrix S12 and the matrix of

canonical correlations C.

(b) Calculate the left and right eigenvectors p and q of C, and comment on the weights for each variable.

(c) Calculate the three canonical transforms. Compare the weights of these canonical projections with the weights obtained in part (b).

(d) Calculate and display the canonical correlation scores, and comment on the strength of the correlation.

21. Give a proof of Proposition3.8.

22. Consider (3.19). Find the norms of the canonical transformsϕk andψk, and derive the two equalities. Determine the singular values of12, and state the relationship between the singular values of12and C.

23. Consider the abalone data of Examples2.13and2.17. Use the three length measure-ments to defineX[1]and the weight measurements to defineX[2]. The abalone data have 4,177 records. Divide the data into four subsets which consist of the observations 1–

1,000, 1,001–2,000, 2,001–3,000 and 3,001–4177. For each of the data, subsets and for the complete data

(a) calculate the eigenvectors of the matrix of canonical correlations, (b) calculate and display the canonical correlation scores, and

(c) compare the results from the subsets with those of the complete data and comment.

24. Prove part 3 of Theorem3.10.

25. For the Boston housing data of Example3.5, calculate and compare the weights of the eigenvectors of C and those of the canonical transforms for all four CCs and comment.

26. Let T[1]and T[2]be as in Theorem3.11.

(a) Show that the between covariance matrix of T[1]and T[2]is A112AT2. (b) Prove part 3 of Theorem3.11.

27. Consider all eight variables of the abalone data of Example2.19.

(a) Determine the correlation coefficients between the number of rings and each of the other seven variables. Which variable is most strongly correlated with the number of rings?

(b) Consider the set-up of X[1] and X[2] as in Problem23. Let X[1a] consist of X[1]

and the number of rings. Carry out a canonical correlation analysis forX[1a] and X[2]. Next letX[2a] consist ofX[2] and the number of rings. Carry out a canonical correlation analysis forX[1]andX[2a].

(c) Let X[1] be the number of rings, and let X[2] be all other variables. Carry out a canonical correlation analysis for this set-up.

(d) Compare the results of parts (a)–(c).

(e) Compare the results of part (c) with those obtained in Example2.19and comment.

28. A canonical correlation analysis for scaled data is described in Section3.5.3. Give an explicit expression for the canonical correlation matrix of the scaled data. How can we interpret this matrix?

29. Forρ = 1,2, consider random vectors X[ρ]

μρ,ρ

. Assume that1has full rank d1 and spectral decomposition1= 111T. Let W[1] be the d1-dimensional PC vector of X[1]. Let C be the matrix of canonical correlations, and assume that rank(C)= d1. Further, let U be the d1-variate CC vector derived from X[1]. Show that there is an orthogonal matrix E such that

U= EW[1], (4.53)

where W[1] is the vector of sphered principal components. Give an explicit expression for E. Is E unique?

30. Forρ = 1,2, consider X[ρ]

μρ,ρ

. Fix k≤ d1and ≤ d2. Let W(k,)be the (k +)-variate vector whose first k entries are those of the k-dimensional PC vector W[1] of X[1] and whose remaining entries are those of the -dimensional PC vector W[2] of X[2].

(a) Give an explicit expression for W(k,), and determine its covariance matrix in terms of the covariance matrices of the X[ρ].

(b) Derive explicit expressions for the canonical correlation matrix Ck, of W[1] and W[2] and the corresponding scores Uk and V in terms of the corresponding properties of the X[ρ].

31. Let X= 

X[1], X[2]T

N(μ,) with  as in (3.1), and assume that 2−1 exists.

Consider the conditional random vector T = (X[1]| X[2]). Show that (a) ET = μ1+ 122−1(X[2]− μ2), and

(b) var (T )= 1− 12−12 T12.

Hint: Consider the joint distribution of A(X− μ), where A=

Id1×d1 −122−1 0(d2×d1) Id2×d2

 .

32. Consider the hypothesis tests for Principal Component Analysis of Section2.7.1and for Canonical Correlation Analysis of Section3.6.

(a) List and explain the similarities and differences between them, including how the eigenvalues or singular values are interpreted in each case.

(b) Explain how each hypothesis test can be used.

33. (a) Split the abalone data into two groups as in Problem23.Carry out appropriate tests of independence at theα = 2 per cent significance level – first for all records and then for the first 100 records only. In each case, state the degrees of freedom of the test statistic. State the conclusions of the tests.

(b) Suppose that you have carried out a test of independence for a pair of canonical correlation scores with n = 100, and suppose that the null hypothesis of this test was accepted at theα = 2 per cent significance level. Would the conclusion of the test remain the same if the significance level changed toα = 5 per cent? Would you expect the same conclusion as in the first case for n= 500 and α = 2 per cent? Justify your answer.

Problems for Part I 169 34. LetX and Y be data matrices in d and q dimensions, respectively, and assume that XXT is invertible. Let C be the sample matrix of canonical correlations forX and Y. Show that (3.37) holds.

35. For the income data, compare the strength of the correlation resulting from PCR and CCR by carrying out the following steps. First, find the variable in each group which is best predicted by the other group, where ‘best’ refers to the absolute value of the corre-lation coefficient. Then carry out analyses analogous to those described in Example3.8.

Finally, interpret your results.

36. Forκ ≤ r, prove the equality of the population expressions corresponding to (3.38) and (3.42). Hint: Considerκ = 1, and prove the identity.

Discriminant Analysis

37. For the Swiss bank notes data, define classC1to be the genuine notes and classC2to be the counterfeits. Use Fisher’s rule to classify these data. How many observations are misclassified? How many genuine notes are classified as counterfeits, and how many counterfeits are regarded as genuine?

38. Explain and highlight the similarities and differences between the best directions cho-sen in Principal Component Analysis, Canonical Correlation Analysis and Discriminant Analysis. What does each of the directions capture? Demonstrate with an example.

39. Consider the wine recognition data of Example 4.5. Use all observations and two cultivars at a time for parts (a) and (b).

(a) Apply Fisher’s rule to these data. Compare the performance on these data with the classification results reported in Example4.5.

(b) Determine the leave-one-out performance based on Fisher’s rule. Compare the results with those obtained in Example4.5and in part (a).

40. Theorem 4.6 is a special case of the generalised eigenvalue problem referred to in Section 3.7.4. Show how Principal Component Analysis and Canonical Corre-lation Analysis fit into this framework, and give an explicit form for the matri-ces A and B referred to in Section 3.7.4. Hint: You may find the paper by Borga, Knutsson, and Landelius(1997) useful.

41. Generate 500 random samples from the first distribution given in Example 4.3. Use the normal linear rule to calculate decision boundaries for these data, and display the boundaries together with the data in a suitable plot.

42. (a) Consider two classes C =N(μ,σ2) with  = 1,2 and μ1 < μ2. Determine a likelihood-based discriminant rule normfor a univariate X as in (4.14) which assigns X to classC1.

(b) Give a proof of Theorem 4.10 for the multivariate case. Hint: Make use of the relationship

2XT

−1(μ1− μ2)

> μT1−1μ1− μT2−1μ2. (4.54)

43. Consider the breast cancer data. Calculate and compare the leave-one-out error based on both the normal rule and Fisher’s rule. Are the conclusions the same as for the classification error?

44. Consider classesCν=N(μν,) with 1 ≤ ν ≤ κ and κ ≥ 2, which differ in their means but have the same covariance matrix.

(a) Show that the likelihood-based rule (4.26) is equivalent to the rule defined by

 X−1

2μ

T

−1μ= max

1≤ν≤κ

 X−1

2μν

T

−1μν.

(b) Deduce that the two rules norm of (4.25) and norm1 of (4.26) are equivalent if the random vectors have a common covariance matrix.

(c) Discuss how the two rules apply to data, and illustrate with an example.

45. Consider the glass identification data set from the Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Glass+Identification) The glass data consist of 214 observations in nine variables. These observations belong to seven classes.

(a) Use Fisher’s rule to classify the data.

(b) Combine classes in the following way. Regard building windows as one class, vehicle windows as one class, and the remaining three as a third class. Perform clas-sification with Fisher’s rule on these data, and compare your results with those of part (a).

46. Consider two classesC1 andC2 which have different means but the same covariance matrix. Let Fbe Fisher’s discriminant rule for these classes.

(a) Show that B and W equal the expressions given in (4.16).

(b) Show thatη given by (4.16) is the maximiser of W−1B.

(c) Determine an expression for the decision function hF which is given in terms ofη as in (4.16), and show that it differs from the normal linear decision function by a constant. What is the value of this constant?

47. Starting from the likelihood function of the random vector, give an explicit proof of Corollary4.17.

48. Use the first two dimensions of Fisher’s iris data and the normal linear rule to deter-mine decision boundaries, and display the boundaries together with the data. Hint: For calculation of the rule, use the sample means, and the pooled covariance matrix.

49. Prove part 2 of Theorem4.19.

50. Consider two classes given by Poisson distributions with different values of the param-eterλ. Find the likelihood-based discriminant rule which assigns a random variable the value 1 if L1(X )> L2(X ).

51. Prove part 2 of Theorem4.21.

52. Consider the abalone data. The first variable, sex, which we have not considered previ-ously, has three groups, M, F and I (for infant). It is important to distinguish between the mature abalone and the infant abalone, and it is therefore natural to divide the data into two classes: observations M and F belong to one class, and observations with label I belong to the second class.

(a) Apply the normal linear and the quadratic rules to the abalone data.

(b) Apply the rule based on the regularised covariance matrix Sν(α) ofFriedman(1989) (see Section4.7.3) to the abalone data, and find the optimalα.

(c) Compare the results of parts (a) and (b) and comment.

53. Consider the Swiss bank notes data, with the two classes as in Problem37.

Problems for Part I 171 (a) Classify these data with the nearest-neighbour rule for a range of values k. Which k

results in the smallest classification error?

(b) Classify these data with the logistic regression approach. Discuss which values of the probability are associated with each class.

(c) Compare the results in parts (a) and (b), and also compare these with the results obtained in Problem37.

54. Consider the three classes of the wine recognition data.

(a) Apply Principal Component Discriminant Analysis to these data, and determine the optimal number pof PCs for Fisher’s rule.

(b) Repeat part (a), but use the normal linear rule.

(c) Compare the results of parts (a) and (b), and include in the comparison the results obtained in Example4.7.