• No results found

Koch I. Analysis of Multivariate and High-Dimensional Data 2013

N/A
N/A
Protected

Academic year: 2021

Share "Koch I. Analysis of Multivariate and High-Dimensional Data 2013"

Copied!
532
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)

Analysis of Multivariate and High-Dimensional Data

‘Big data’ poses challenges that require both classical multivariate methods and contemporary techniques from machine learning and engineering. This modern text integrates the two strands into a coherent treatment, drawing together theory, data, computation and recent research.

The theoretical framework includes formal definitions, theorems and proofs which clearly set out the guaranteed ‘safe operating zone’ for the methods and allow users to assess whether data are in or near the zone. Extensive examples showcase the strengths and limitations of different methods in a range of cases: small classical data; data from medicine, biology, marketing and finance; high-dimensional data from bioinformatics; functional data from proteomics; and simulated data. High-dimension low sample size data get special attention. Several data sets are revisited repeatedly to allow comparison of methods. Generous use of colour, algorithms, MATLAB code and problem sets completes the package. The text is suitable for graduate students in statistics and researchers in data-rich disciplines.

IN G E KO C H is Associate Professor of Statistics at the University of Adelaide, Australia.

(4)

PROBABILISTIC MATHEMATICS Editorial Board

Z. Ghahramani (Department of Engineering, University of Cambridge) R. Gill (Mathematical Institute, Leiden University)

F. P. Kelly (Department of Pure Mathematics and Mathematical Statistics, University of Cambridge)

B. D. Ripley (Department of Statistics, University of Oxford) S. Ross (Department of Industrial and Systems Engineering,

University of Southern California)

M. Stein (Department of Statistics, University of Chicago)

This series of high-quality upper-division textbooks and expository monographs covers all aspects of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations research, optimization, and mathematical programming. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice.

A complete list of books in the series can be found atwww.cambridge.org/statistics. Recent titles include the following:

11. Statistical Models, by A. C. Davison

12. Semiparametric Regression, by David Ruppert, M. P. Wand and R. J. Carroll 13. Exercises in Probability, by Lo¨ıc Chaumont and Marc Yor

14. Statistical Analysis of Stochastic Processes in Time, by J. K. Lindsey 15. Measure Theory and Filtering, by Lakhdar Aggoun and Robert Elliott 16. Essentials of Statistical Inference, by G. A. Young and R. L. Smith 17. Elements of Distribution Theory, by Thomas A. Severini

18. Statistical Mechanics of Disordered Systems, by Anton Bovier

19. The Coordinate-Free Approach to Linear Models, by Michael J. Wichura 20. Random Graph Dynamics, by Rick Durrett

21. Networks, by Peter Whittle

22. Saddlepoint Approximations with Applications, by Ronald W. Butler 23. Applied Asymptotics, by A. R. Brazzale, A. C. Davison and N. Reid

24. Random Networks for Communication, by Massimo Franceschetti and Ronald Meester 25. Design of Comparative Experiments, by R. A. Bailey

26. Symmetry Studies, by Marlos A. G. Viana

27. Model Selection and Model Averaging, by Gerda Claeskens and Nils Lid Hjort 28. Bayesian Nonparametrics, edited by Nils Lid Hjort et al.

29. From Finite Sample to Asymptotic Methods in Statistics, by Pranab K. Sen, Julio M. Singer and Antonio C. Pedrosa de Lima

30. Brownian Motion, by Peter M¨orters and Yuval Peres 31. Probability (Fourth Edition), by Rick Durrett 33. Stochastic Processes, by Richard F. Bass

34. Regression for Categorical Data, by Gerhard Tutz

35. Exercises in Probability (Second Edition), by Lo¨ıc Chaumont and Marc Yor

(5)

Analysis of Multivariate and

High-Dimensional Data

Inge Koch

(6)

Cambridge University Press is part of the University of Cambridge. It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning and research at the highest international levels of excellence.

www.cambridge.org

Information on this title:www.cambridge.org/9780521887939

c

 Inge Koch 2014

This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements,

no reproduction of any part may take place without the written permission of Cambridge University Press.

First published 2014 Printed in the United States of America

A catalog record for this publication is available from the British Library. Library of Congress Cataloging in Publication Data

Koch, Inge, 1952–

Analysis of multivariate and high-dimensional data / Inge Koch. pages cm

ISBN 978-0-521-88793-9 (hardback) 1. Multivariate analysis. 2. Big data. I. Title.

QA278.K5935 2013 519.535–dc23 2013013351 ISBN 978-0-521-88793-9 Hardback

Additional resources for this publication atwww.cambridge.org/9780521887939

Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet websites referred to in this publication

and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

(7)
(8)
(9)

Contents

List of Algorithms page xiii

Notation xv

Preface xix

I CLASSICAL METHODS

1 Multidimensional Data 3

1.1 Multivariate and High-Dimensional Problems 3

1.2 Visualisation 4

1.2.1 Three-Dimensional Visualisation 4

1.2.2 Parallel Coordinate Plots 6

1.3 Multivariate Random Vectors and Data 8

1.3.1 The Population Case 8

1.3.2 The Random Sample Case 9

1.4 Gaussian Random Vectors 11

1.4.1 The Multivariate Normal Distribution and the Maximum

Likelihood Estimator 11

1.4.2 Marginal and Conditional Normal Distributions 13 1.5 Similarity, Spectral and Singular Value Decomposition 14

1.5.1 Similar Matrices 14

1.5.2 Spectral Decomposition for the Population Case 14

1.5.3 Decompositions for the Sample Case 16

2 Principal Component Analysis 18

2.1 Introduction 18

2.2 Population Principal Components 19

2.3 Sample Principal Components 22

2.4 Visualising Principal Components 27

2.4.1 Scree, Eigenvalue and Variance Plots 27

2.4.2 Two- and Three-Dimensional PC Score Plots 30

2.4.3 Projection Plots and Estimates of the Density of the Scores 31

2.5 Properties of Principal Components 34

2.5.1 Correlation Structure of X and Its PCs 34

2.5.2 Optimality Properties of PCs 37

(10)

2.6 Standardised Data and High-Dimensional Data 42

2.6.1 Scaled and Sphered Data 42

2.6.2 High-Dimensional Data 47

2.7 Asymptotic Results 55

2.7.1 Classical Theory: Fixed Dimension d 55

2.7.2 Asymptotic Results when d Grows 57

2.8 Principal Component Analysis, the Number of Components

and Regression 62

2.8.1 Number of Principal Components Based on the Likelihood 62

2.8.2 Principal Component Regression 65

3 Canonical Correlation Analysis 70

3.1 Introduction 70

3.2 Population Canonical Correlations 71

3.3 Sample Canonical Correlations 76

3.4 Properties of Canonical Correlations 82

3.5 Canonical Correlations and Transformed Data 88

3.5.1 Linear Transformations and Canonical Correlations 88

3.5.2 Transforms with Non-Singular Matrices 90

3.5.3 Canonical Correlations for Scaled Data 98

3.5.4 Maximum Covariance Analysis 100

3.6 Asymptotic Considerations and Tests for Correlation 100

3.7 Canonical Correlations and Regression 104

3.7.1 The Canonical Correlation Matrix in Regression 105

3.7.2 Canonical Correlation Regression 108

3.7.3 Partial Least Squares 109

3.7.4 The Generalised Eigenvalue Problem 113

4 Discriminant Analysis 116

4.1 Introduction 116

4.2 Classes, Labels, Rules and Decision Functions 118

4.3 Linear Discriminant Rules 120

4.3.1 Fisher’s Discriminant Rule for the Population 120

4.3.2 Fisher’s Discriminant Rule for the Sample 123

4.3.3 Linear Discrimination for Two Normal Populations or Classes 127 4.4 Evaluation of Rules and Probability of Misclassification 129

4.4.1 Boundaries and Discriminant Regions 129

4.4.2 Evaluation of Discriminant Rules 131

4.5 Discrimination under Gaussian Assumptions 136

4.5.1 Two and More Normal Classes 136

4.5.2 Gaussian Quadratic Discriminant Analysis 140

4.6 Bayesian Discrimination 143

4.6.1 Bayes Discriminant Rule 143

4.6.2 Loss and Bayes Risk 146

4.7 Non-Linear, Non-Parametric and Regularised Rules 148

(11)

Contents ix

4.7.2 Logistic Regression and Discrimination 153

4.7.3 Regularised Discriminant Rules 154

4.7.4 Support Vector Machines 155

4.8 Principal Component Analysis, Discrimination and Regression 157

4.8.1 Discriminant Analysis and Linear Regression 157

4.8.2 Principal Component Discriminant Analysis 158

4.8.3 Variable Ranking for Discriminant Analysis 159

Problems for Part I 165

II FACTORS AND GROUPINGS

5 Norms, Proximities, Features and Dualities 175

5.1 Introduction 175

5.2 Vector and Matrix Norms 176

5.3 Measures of Proximity 176

5.3.1 Distances 176

5.3.2 Dissimilarities 178

5.3.3 Similarities 179

5.4 Features and Feature Maps 180

5.5 Dualities forX and XT 181

6 Cluster Analysis 183

6.1 Introduction 183

6.2 Hierarchical Agglomerative Clustering 185

6.3 k-Means Clustering 191

6.4 Second-Order Polynomial Histogram Estimators 199

6.5 Principal Components and Cluster Analysis 207

6.5.1 k-Means Clustering for Principal Component Data 207 6.5.2 Binary Clustering of Principal Component Scores and Variables 210

6.5.3 Clustering High-Dimensional Binary Data 212

6.6 Number of Clusters 216

6.6.1 Quotients of Variability Measures 216

6.6.2 The Gap Statistic 217

6.6.3 The Prediction Strength Approach 219

6.6.4 Comparison of ˆk-Statistics 220

7 Factor Analysis 223

7.1 Introduction 223

7.2 Population k-Factor Model 224

7.3 Sample k-Factor Model 227

7.4 Factor Loadings 228

7.4.1 Principal Components and Factor Analysis 228

7.4.2 Maximum Likelihood and Gaussian Factors 233

7.5 Asymptotic Results and the Number of Factors 236

(12)

7.6.1 Principal Component Factor Scores 239

7.6.2 Bartlett and Thompson Factor Scores 240

7.6.3 Canonical Correlations and Factor Scores 241

7.6.4 Regression-Based Factor Scores 242

7.6.5 Factor Scores in Practice 244

7.7 Principal Components, Factor Analysis and Beyond 245

8 Multidimensional Scaling 248

8.1 Introduction 248

8.2 Classical Scaling 249

8.2.1 Classical Scaling and Principal Coordinates 251

8.2.2 Classical Scaling withStrain 254

8.3 Metric Scaling 257

8.3.1 Metric Dissimilarities and Metric Stresses 258

8.3.2 MetricStrain 261

8.4 Non-Metric Scaling 263

8.4.1 Non-Metric Stress and the Shepard Diagram 263

8.4.2 Non-MetricStrain 268

8.5 Data and Their Configurations 268

8.5.1 HDLSS Data and theX and XTDuality 269

8.5.2 Procrustes Rotations 271

8.5.3 Individual Differences Scaling 273

8.6 Scaling for Grouped and Count Data 274

8.6.1 Correspondence Analysis 274

8.6.2 Analysis of Distance 279

8.6.3 Low-Dimensional Embeddings 282

Problems for Part II 286

III NON-GAUSSIAN ANALYSIS

9 Towards Non-Gaussianity 295

9.1 Introduction 295

9.2 Gaussianity and Independence 296

9.3 Skewness, Kurtosis and Cumulants 297

9.4 Entropy and Mutual Information 299

9.5 Training, Testing and Cross-Validation 301

9.5.1 Rules and Prediction 302

9.5.2 Evaluating Rules with the Cross-Validation Error 302

10 Independent Component Analysis 305

10.1 Introduction 305

10.2 Sources and Signals 307

10.2.1 Population Independent Components 307

10.2.2 Sample Independent Components 308

10.3 Identification of the Sources 310

(13)

Contents xi 10.4.1 Independence, Uncorrelatedness and Non-Gaussianity 314

10.4.2 Approximations to the Mutual Information 317

10.5 Estimation of the Mixing Matrix 320

10.5.1 An Estimating Function Approach 321

10.5.2 Properties of Estimating Functions 322

10.6 Non-Gaussianity and Independence in Practice 324

10.6.1 Independent Component Scores and Solutions 324

10.6.2 Independent Component Solutions for Real Data 326

10.6.3 Performance of I for Simulated Data 331

10.7 Low-Dimensional Projections of High-Dimensional Data 335 10.7.1 Dimension Reduction and Independent Component Scores 335

10.7.2 Properties of Low-Dimensional Projections 339

10.8 Dimension Selection with Independent Components 343

11 Projection Pursuit 349

11.1 Introduction 349

11.2 One-Dimensional Projections and Their Indices 350

11.2.1 Population Projection Pursuit 350

11.2.2 Sample Projection Pursuit 356

11.3 Projection Pursuit with Two- and Three-Dimensional

Projections 359

11.3.1 Two-Dimensional Indices:QE,QCandQU 359

11.3.2 Bivariate Extension by Removal of Structure 361

11.3.3 A Three-Dimensional Cumulant Index 363

11.4 Projection Pursuit in Practice 363

11.4.1 Comparison of Projection Pursuit and Independent Component

Analysis 364

11.4.2 From a Cumulant-Based Index to FastICA Scores 365

11.4.3 The Removal of Structure and FastICA 366

11.4.4 Projection Pursuit: A Continuing Pursuit 371

11.5 Theoretical Developments 373

11.5.1 Theory Relating toQR 373

11.5.2 Theory Relating toQUandQD 374

11.6 Projection Pursuit Density Estimation and Regression 376

11.6.1 Projection Pursuit Density Estimation 376

11.6.2 Projection Pursuit Regression 378

12 Kernel and More Independent Component Methods 381

12.1 Introduction 381

12.2 Kernel Component Analysis 382

12.2.1 Feature Spaces and Kernels 383

12.2.2 Kernel Principal Component Analysis 385

12.2.3 Kernel Canonical Correlation Analysis 389

12.3 Kernel Independent Component Analysis 392

12.3.1 TheF-Correlation and Independence 392

(14)

12.3.3 Comparison of Non-Gaussian and Kernel Independent Components

Approaches 396

12.4 Independent Components from Scatter Matrices (aka Invariant

Coordinate Selection) 402

12.4.1 Scatter Matrices 403

12.4.2 Population Independent Components from Scatter Matrices 404 12.4.3 Sample Independent Components from Scatter Matrices 407 12.5 Non-Parametric Estimation of Independence Criteria 413 12.5.1 A Characteristic Function View of Independence 413 12.5.2 An Entropy Estimator Based on Order Statistics 416 12.5.3 Kernel Density Estimation of the Unmixing Matrix 417

13 Feature Selection and Principal Component Analysis Revisited 421

13.1 Introduction 421

13.2 Independent Components and Feature Selection 423

13.2.1 Feature Selection in Supervised Learning 423

13.2.2 Best Features and Unsupervised Decisions 426

13.2.3 Test of Gaussianity 429

13.3 Variable Ranking and Statistical Learning 431

13.3.1 Variable Ranking with the Canonical Correlation Matrix C 432 13.3.2 Prediction with a Selected Number of Principal Components 434 13.3.3 Variable Ranking for Discriminant Analysis Based on C 438 13.3.4 Properties of the Ranking Vectors of the Naive C when d Grows 442

13.4 Sparse Principal Component Analysis 449

13.4.1 The Lasso, SCoTLASS Directions and Sparse Principal Components 449 13.4.2 Elastic Nets and Sparse Principal Components 453 13.4.3 Rank One Approximations and Sparse Principal Components 458 13.5 (In)Consistency of Principal Components as the Dimension Grows 461 13.5.1 (In)Consistency for Single-Component Models 461 13.5.2 Behaviour of the Sample Eigenvalues, Eigenvectors and Principal

Component Scores 465

13.5.3 Towards a General Asymptotic Framework for Principal

Component Analysis 471

Problems for Part III 476

Bibliography 483

Author Index 493

Subject Index 498

(15)

List of Algorithms

3.1 Partial Least Squares Solution page 110 4.1 Discriminant Adaptive Nearest Neighbour Rule 153 4.2 Principal Component Discriminant Analysis 159 4.3 Discriminant Analysis with Variable Ranking 162 6.1 Hierarchical Agglomerative Clustering 187 6.2 Mode and Cluster Tracking with the SOPHE 201

6.3 The Gap Statistic 218

8.1 Principal Coordinate Configurations in p Dimensions 253 10.1 Practical Almost Independent Component Solutions 326 11.1 Non-Gaussian Directions from Structure Removal and FastICA 367 11.2 An M-Step Regression Projection Index 378 12.1 Kernel Independent Component Solutions 395 13.1 Independent Component Features in Supervised Learning 424 13.2 Sign Cluster Rule Based on the First Independent Component 427 13.3 An IC1-Based Test of Gaussianity 429

13.4 Prediction with a Selected Number of Principal Components 434 13.5 Naive Bayes Rule for Ranked Data 447 13.6 Sparse Principal Components Based on the Elastic Net 455 13.7 Sparse Principal Components from Rank One Approximations 459 13.8 Sparse Principal Components Based on Variable Selection 463

(16)
(17)

Notation

a= [a1···ad]T Column vector inRd

A, AT, B Matrices, with ATthe transpose of A Ap×q, Br×s Matrices A and B of size p× q and r × s

A= (ai j) Matrix A with entries ai j

A=a1···ap  = ⎡ ⎢ ⎣ a•1 .. . a•q ⎤ ⎥

⎦ Matrix A of size p × q with columns ai, rows a• j

Adiag Diagonal matrix consisting of the diagonal entries of A

0k× k×  matrix with all entries 0

1k Column vector with all entries 1

Id×d d× d identity matrix Ik× Ik×k 0k×(−k) if k≤ , and I× 0(k−)×  if k≥ 

X, Y d-dimensional random vectors

X|Y Conditional random vector X given Y

X=X1... Xd

T

d-dimensional random vector with entries (or variables) Xj

X =X1X2···Xn



d× n data matrix of random vectors Xi, i≤ n

Xi=



Xi1··· Xid

T

Random vector fromX with entries Xi j

X• j=X1 j··· Xn j



1× n row vector of the jth variable of X

μ = EX Expectation of a random vector X, also denoted byμX

X Sample mean

X Average sample class mean σ2= var(X) Variance of random variable X

 = var(X) Covariance matrix of X with entriesσj k andσj j= σ2j

S Sample covariance matrix ofX with entries si j and sj j= s2j

R= (ρi j), RS= (ρi j) Matrix of correlation coefficients for the population and sample

Qn, Qd Dual matrices Qn= XXTand Qd= XTX diag Diagonal matrix with entriesσ2j obtained from

Sdiag Diagonal matrix with entries s2j obtained from S

 = T Spectral decomposition of

(18)

k=



η1···ηk



d× k matrix of (orthogonal) eigenvectors of , k ≤ d k= diag(λ1,...,λk) Diagonal k× k matrix with diagonal entries the

eigenvalues of, k ≤ d

S=  T Spectral decomposition of S with eigenvalues λj and

eigenvectorsj

β3(X), b3(X) Multivariate skewness of X and sample skewness ofX

β4(X), b4(X) Multivariate kurtosis of X and sample kurtosis ofX

f , F Multivariate probability density and distribution functions φ, Standard normal probability density and distribution

functions

f , fG Multivariate probability density functions; f and fGhave

the same mean and covariance matrix, and fGis

Gaussian

L(θ) or L(θ|X) Likelihood function of the parameterθ, given X

X∼ (μ,) Random vector with meanμ and covariance matrix 

XN(μ,) Random vector from the multivariate normal distribution with meanμ and covariance matrix 

X ∼Sam(X, S) Data – with sample mean X and sample covariance matrix S

Xcent Centred data



X1− X ··· Xn− X



, also written asX − X

X, XS Sphered vector and data−1/2(X− μ), S−1/2(X − X)

Xscale, Xscale Scaled vector and datadiag−1/2(X − μ), Sdiag−1/2

X − X

X , X (Spatially) whitened random vector and data

W(k)=W1···Wk

T

Vector of first k principal component scores W(k)=W

•1...W•kT k× n matrix of first k principal component scores

Pk= Wkηk Principal component projection vector

P•k= ηkW•k d× n matrix of principal component projections

F, F Common factor for population and data in k-factor model

S,S Source for population and data in Independent Component Analysis

O= {O1,..., On} Set of objects corresponding to dataX =



X1X2...Xn

 {O, } Observed data, consisting of objects Oi and dissimilarities

ikbetween pairs of objects

f, f(X), f(X) Feature map, feature vector and feature data cov (X, T) dX× dT (between) covariance matrix of X and T

12= cov(X[1], X[2]) d1× d2(between) covariance matrix of X[1]and X[2]

S12= cov(X[1],X[2]) d1× d2sample (between) covariance matrix of d× n data

X[], for = 1,2

C= 1−1/2122−1/2 Canonical correlation matrix of X[]∼ (μ,), for = 1,2

C= Pϒ QT Singular value decomposition of C with singular valuesυj

and eigenvectors pj, qj



C= P ϒ QT Sample canonical correlation matrix and its singular value decomposition

R[C,1]= CCT, R[C,2]= CTC Matrices of multivariate coefficients of determination, with C the canonical correlation matrix

(19)

Notation xvii

U(k), V(k) Pair of vectors of k-dimensional canonical correlations

ϕk,ψk kth pair of canonical (correlation) transforms

U(k),V(k) k× n matrices of k-dimensional canonical correlation data νth class (or cluster)

r Discriminant rule or classifier

h, hβ Decision function for a discriminant rule r (hβ depends onβ) b, w Between-class and within-class variability

E (Classification) error

P(X,k) k-cluster arrangement ofX

A◦ B Hadamard or Schur product of matrices A and B tr ( A) Trace of a matrix A

det ( A) Determinant of a matrix A dir (X) Direction (vector) X/ X of X

· , · p, (Euclidean) norm, p-norm of a vector or matrix

X tr Trace norm of X given by[tr()]1/2

A Frob Frobenius norm of a matrix A given by [tr ( A AT)]1/2

(X,Y) Distance between vectors X and Y (X,Y) Dissimilarity of vectors X and Y

a(α, β) Angle between directionsα and β: arccos(αTβ)

H,I, J, K Entropy, mutual information, negentropy and Kullback-Leibler divergence

Q Projection index

n d Asymptotic domain, d fixed and n→ ∞

nd Asymptotic domain, d, n→ ∞ and d = O(n)

nd Asymptotic domain, d, n→ ∞ and n = o(d)

(20)
(21)

Preface

This book is about data in many – and sometimes very many – variables and about analysing such data. The book attempts to integrate classical multivariate methods with contemporary methods suitable for high-dimensional data and to present them in a coherent and transparent framework. Writing about ideas that emerged more than a hundred years ago and that have become increasingly relevant again in the last few decades is exciting and challenging. With hindsight, we can reflect on the achievements of those who paved the way, whose methods we apply to ever bigger and more complex data and who will continue to influence our ideas and guide our research. Renewed interest in the classical methods and their extension has led to analyses that give new insight into data and apply to bigger and more complex problems. There are two players in this book: Theory and Data. Theory advertises its wares to lure Data into revealing its secrets, but Data has its own ideas. Theory wants to provide elegant solutions which answer many but not all of Data’s demands, but these lead Data to pose new challenges to Theory. Statistics thrives on interactions between theory and data, and we develop better theory when we ‘listen’ to data. Statisticians often work with experts in other fields and analyse data from many different areas. We, the statisticians, need and benefit from the expertise of our colleagues in the analysis of their data and interpretation of the results of our analysis. At times, existing methods are not adequate, and new methods need to be developed.

This book attempts to combine theoretical ideas and advances with their application to data, in particular, to interesting and real data. I do not shy away from stating theorems as they are an integral part of the ideas and methods. Theorems are important because they summarise what we know and the conditions under which we know it. They tell us when methods may work with particular data; the hypotheses may not always be satisfied exactly, but a method may work nevertheless. The precise details do matter sometimes, and theorems capture this information in a concise way.

Yet a balance between theoretical ideas and data analysis is vital. An important aspect of any data analysis is its interpretation, and one might ask questions like: What does the analysis tell us about the data? What new insights have we gained from a particular analysis? How suitable is my method for my data? What are the limitations of a particular method, and what other methods would produce more appropriate analyses? In my attempts to answer such questions, I endeavour to be objective and emphasise the strengths and weaknesses of different approaches.

(22)

Who Should Read This Book?

This book is suitable for readers with various backgrounds and interests and can be read at different levels. It is appropriate as a graduate-level course – two course outlines are suggested in the section ‘Teaching from This Book’. A second or more advanced course could make use of the more advanced sections in the early chapters and include some of the later chapters. The book is equally appropriate for working statisticians who need to find and apply a relevant method for analysis of their multivariate or high-dimensional data and who want to understand how the chosen method deals with the data, what its limitations might be and what alternatives are worth considering.

Depending on the expectation and aims of the reader, different types of backgrounds are needed. Experience in the analysis of data combined with some basic knowledge of statistics and statistical inference will suffice if the main aim involves applying the methods of this book. To understand the underlying theoretical ideas, the reader should have a solid background in the theory and application of statistical inference and multivariate regression methods and should be able to apply confidently ideas from linear algebra and real analysis. Readers interested in statistical ideas and their application to data may benefit from the theorems and their illustrations in the examples. These readers may, in a first journey through the book, want to focus on the basic ideas and properties of each method and leave out the last few more advanced sections of each chapter. For possible paths, see the models for a one-semester course later in this preface.

Researchers and graduate students with a good background in statistics and mathemat-ics who are primarily interested in the theoretical developments of the different topmathemat-ics will benefit from the formal setting of definitions, theorems and proofs and the careful dis-tinction of the population and the sample case. This setting makes it easy to understand what each method requires and which ideas can be adapted. Some of these readers may want to refer to the recent literature and the references I provide for theorems that I do not prove.

Yet another broad group of readers may want to focus on applying the methods of this

book to particular data, with an emphasis on the results of the data analysis and the new

insight they gain into their data. For these readers, the interpretation of their results is of prime interest, and they can benefit from the many examples and discussions of the analysis for the different data sets. These readers could concentrate on the descriptive parts of each method and the interpretative remarks which follow many theorems and need not delve into the theorem/proof framework of the book.

Outline

This book consists of three parts. Typically, each method corresponds to a single chapter, and because the methods have different origins and varied aims, it is convenient to group the chapters into parts. The methods focus on two main themes:

1. Component Analysis, which aims to simplify the data by summarising them in a smaller number of more relevant or more interesting components

(23)

Preface xxi

2. Statistical Learning, which aims to group, classify or regress the (component) data by partitioning the data appropriately or by constructing rules and applying these rules to new data.

The two themes are related, and each method I describe addresses at least one of the themes. The first chapter in each part presents notation and summarises results required in the following chapters. I give references to background material and to proofs of results, which may help readers not acquainted with some topics. Readers who are familiar with the topics of the three first chapters in their part may only want to refer to the notation. Properties or theorems in these three chapters are called Results and are stated without proof. Each of the main chapters in the three parts is dedicated to a specific method or topic and illustrates its ideas on data.

PartIdeals with the classical methods Principal Component Analysis, Canonical Cor-relation Analysis and Discriminant Analysis, which are ‘musts’ in multivariate analysis as they capture essential aspects of analysing multivariate data. The later sections of each of these chapters contain more advanced or more recent ideas and results, such as Principal Component Analysis for high-dimension low sample size data and Principal Component Regression. These sections can be left out in a first reading of the book without greatly affecting the understanding of the rest of PartsIandII.

PartIIcomplements PartIand is still classical in its origin: Cluster Analysis is similar to Discriminant Analysis and partitions data but without the advantage of known classes. Fac-tor Analysis and Principal Component Analysis enrich and complement each other, yet the two methods pursue distinct goals and differ in important ways. Classical Multidimensional Scaling may seem to be different from Principal Component Analysis, but Multidimensional Scaling, which ventures into non-linear component analysis, can be regarded as a generali-sation of Principal Component Analysis. The three methods, Principal Component Analysis, Factor Analysis and Multidimensional Scaling, paved the way for non-Gaussian component analysis and in particular for Independent Component Analysis and Projection Pursuit.

PartIIIgives an overview of more recent and current ideas and developments in com-ponent analysis methods and links these to statistical learning ideas and research directions for high-dimensional data. A natural starting point are the twins, Independent Component Analysis and Projection Pursuit, which stem from the signal-processing and statistics com-munities, respectively. Because of their similarities as well as their resulting analysis of data, we may regard both as non-Gaussian component analysis methods. Since the early 1980s, when Independent Component Analysis and Projection Pursuit emerged, the concept of independence has been explored by many authors. Chapter12showcases Independent Component Methods which have been developed since about 2000. There are many different approaches; I have chosen some, including Kernel Independent Component Analysis, which have a more statistical rather than heuristic basis. The final chapter returns to the beginning – Principal Component Analysis – but focuses on current ideas and research directions: feature selection, component analysis of high-dimension low sample size data, decision rules for such data, asymptotics and consistency results when the dimension increases faster than the sample size. This last chapter includes inconsistency results and concludes with a new and general asymptotic framework for Principal Component Analysis which covers the different asymptotic domains of sample size and dimension of the data.

(24)

Data and Examples

This book uses many contrasting data sets: small classical data sets such as Fisher’s four-dimensional iris data; data sets of moderate dimension (up to about thirty) from medical, biological, marketing and financial areas; and big and complex data sets. The data sets vary in the number of observations from fewer than fifty to about one million. We will also generate data and work with simulated data because such data can demonstrate the perfor-mance of a particular method, and the strengths and weaknesses of particular approaches more clearly. We will meet high-dimensional data with more than 1,000 variables and high-dimension low sample size (HDLSS) data, including data from genomics with dimen-sions in the tens of thousands, and typically fewer than 100 observations. In addition, we will encounter functional data from proteomics, for which each observation is a curve or profile.

Visualising data is important and is typically part of a first exploratory step in data analysis. If appropriate, I show the results of an analysis in graphical form.

I describe the analysis of data in Examples which illustrate the different tools and method-ologies. In the examples I provide relevant information about the data, describe each analysis and give an interpretation of the outcomes of the analysis. As we travel through the book, we frequently return to data we previously met. The Data Index shows, for each data set in this book, which chapter contains examples pertaining to these data. Continuing with the same data throughout the book gives a more comprehensive picture of how we can study a particular data set and what methods and analyses are suitable for specific aims and data sets. Typically, the data sets I use are available on the Cambridge University Press website

www.cambridge.org/9780521887939.

Use of Software and Algorithms

I use MATLAB for most of the examples and make generic MATLAB code available on the Cambridge University Press website. Readers and data analysts who prefer R could use that software instead of MATLAB. There are, however, some differences in implementation between MATLAB and R, in particular, in the Independent Component Analysis algorithms. I have included some comments about these differences in Section11.4.

Many of the methods in Parts I and II have a standard one-line implementation in MATLAB and R. For example, to carry out a Principal Component Analysis or a likelihood-based Factor Analysis, all that is needed is a single command which includes the data to be analysed. These stand-alone routines in MATLAB and R avoid the need for writing one’s own code. The MATLAB code I provide typically includes the initial visualisation of the data and, where appropriate, code for a graphical presentation of the results.

This book contains algorithms, that is, descriptions of the mathematical or computational steps that are needed to carry out particular analyses. Algorithm4.2, for example, details the steps that are required to carry out classification for principal component data. A list of all algorithms in this book follows the Contents.

Theoretical Framework

I have chosen the conventional format with Definitions, Theorems, Propositions and Corol-laries. This framework allows me to state assumptions and conclusions precisely and in an

(25)

Preface xxiii

easily recognised form. If the assumptions of a theorem deviate substantially from proper-ties of the data, the reader should recognise that care is required when applying a particular result or method to the data.

In the first part of the book I present proofs of many of the theorems and propositions. For the later parts, the emphasis changes, and in Part IIIin particular, I typically present theorems without proof and refer the reader to the relevant literature. This is because most of the theorems in PartIIIrefer to recent results. Their proofs can be highly technical and complex without necessarily increasing the understanding of the theorem for the general readership of this book.

Many of the methods I describe do not require knowledge of the underlying distribution. For this reason, my treatment is mostly distribution-free. At times, stronger properties can be derived when we know the distribution of the data. (In practice, this means that the data come from the Gaussian distribution.) In such cases, and in particular in non-asymptotic situations I will explicitly point out what extra mileage knowledge of the Gaussian distri-bution can gain us. The development of asymptotic theory typically requires the data to come from the Gaussian distribution, and in these theorems I will state the necessary dis-tributional assumptions. Asymptotic theory provides a sound theoretical framework for a method, and in my opinion, it is important to detail the asymptotic results, including the precise conditions under which these results hold.

The formal framework and proofs of theorems should not deter readers who are more interested in an analysis of their data; the theory guides us regarding a method’s appropri-ateness for specific data. However, the methods of this book can be used without dipping into a single proof, and most of the theorems are followed by remarks which explain aspects or implications of the theorems. These remarks can be understood without a deep knowledge of the assumptions and mathematical details of the theorems.

Teaching from This Book

This book is suitable as a graduate-level textbook and can be taught as a one-semester course with an optional advanced second semester. Graduate students who use this book as their textbook should have a solid general knowledge of statistics, including statistical inference and multivariate regression methods, and a good background in real analysis and linear alge-bra and, in particular, matrix theory. In addition, the graduate student should be interested in analysing real data and have some experience with MATLAB or R.

Each part of this book ends with a set of problems for the preceding chapters. These problems represent a mixture of theoretical exercises and data analysis and may be suitable as exercises for students.

Some models for a one-semester graduate course are

• A ‘classical’ course which focuses on the following sections of Chapters2to7

Chapter2, Principal Component Analysis: up to Section2.7.2

Chapter3, Canonical Correlation Analysis: up to Section3.7

Chapter4, Discriminant Analysis: up to Section4.8

Chapter6, Cluster Analysis: up to Section6.6.2

(26)

• A mixed ‘classical’ and ‘modern’ course which includes an introduction to Indepen-dent Component Analysis, based on Sections10.1to 10.3and some subsections from Section 10.6. Under this model I would focus on a subset of the classical model, and leave out some or all of the following sections

Section4.7.2to end of Chapter4

Section6.6. Section7.6.

Depending on the choice of the first one-semester course, a more advanced one-semester graduate course could focus on the extension sections in PartsIandIIand then, depending on the interest of teacher and students, pick and choose material from Chapters 8 and 10 to 13.

Choice of Topics and How This Book Differs from Other Books on Multivariate Analysis

This book deviates from classical and newer books on multivariate analysis (MVA) in a number of ways.

1. To begin with, I have only a single chapter on background material, which includes summaries of pertinent material on the multivariate normal distribution and relevant results from matrix theory.

2. My emphasis is on the interplay of theory and data. I develop theoretical ideas and apply them to data. In this I differ from many books on MVA, which focus on either data analysis or treat theory.

3. I include analysis of high-dimension low sample size data and describe new and very recent developments in Principal Component Analysis which allow the dimension to grow faster than the sample size. These approaches apply to gene expression data and data from proteomics, for example.

Many of the methods I discuss fall into the category of component analysis, including the newer methods in PartIII. As a consequence, I made choices to leave out some topics which are often treated in classical multivariate analysis, most notably ANOVA and MANOVA. The former is often taught in undergraduate statistics courses, and the latter is an interest-ing extension of ANOVA which does not fit so naturally into our framework but is easily accessible to the interested reader.

Another more recent topic, Support Vector Machines, is also absent. Again, it does not fit naturally into our component analysis framework. On the other hand, there is a growing number of books which focus exclusively on Support Vector Machines and which remedy my omission. I do, however, make the connection to Support Vector Machines and cite relevant papers and books, in particular in Sections4.7and12.1.

And Finally ...

It gives me great pleasure to thank the people who helped while I was working on this monograph. First and foremost, my deepest thanks to my husband, Alun Pope, and sons, Graeme and Reiner Pope, for your love, help, encouragement and belief in me throughout the years it took to write this book. Thanks go to friends, colleagues and students who provided

(27)

Preface xxv

valuable feedback. I specially want to mention and thank two friends: Steve Marron, who introduced me to Independent Component Analysis and who continues to inspire me with his ideas, and Kanta Naito, who loves to argue with me until we find good answers to the problems we are working on. Finally, I thank Diana Gillooly from Cambridge University Press for her encouragement and help throughout this journey.

(28)
(29)

Part I

(30)
(31)

1

Multidimensional Data

Denken ist interessanter als Wissen, aber nicht als Anschauen (Johann Wolfgang von Goethe, Werke – Hamburger Ausgabe Bd. 12, Maximen und Reflexionen, 1749–1832). Thinking is more interesting than knowing, but not more interesting than looking at.

1.1 Multivariate and High-Dimensional Problems

Early in the twentieth century, scientists such asPearson(1901),Hotelling(1933) andFisher

(1936) developed methods for analysing multivariate data in order to

• understand the structure in the data and summarise it in simpler ways;

• understand the relationship of one part of the data to another part; and

• make decisions and inferences based on the data.

The early methods these scientists developed are linear; their conceptual simplicity and elegance still strike us today as natural and surprisingly powerful. Principal Component Analysis deals with the first topic in the preceding list, Canonical Correlation Analysis with the second and Discriminant Analysis with the third. As time moved on, more complex methods were developed, often arising in areas such as psychology, biology or economics, but these linear methods have not lost their appeal. Indeed, as we have become more able to collect and handle very large and high-dimensional data, renewed requirements for linear methods have arisen. In these data sets essential structure can often be obscured by noise, and it becomes vital to

reduce the original data in such a way that informative and interesting structure in the data is preserved while noisy, irrelevant or purely random variables, dimensions or features are removed, as these can adversely affect the analysis.

Principal Component Analysis, in particular, has become indispensable as a dimension-reduction tool and is often used as a first step in a more comprehensive analysis.

The data we encounter in this book range from two-dimensional samples to samples that have thousands of dimensions or consist of continuous functions. Traditionally one assumes that the dimension d is small compared to the sample size n, and for the asymptotic theory, n increases while the dimension remains constant. Many recent data sets do not fit into this framework; we encounter

(32)

• data whose dimension is comparable to the sample size, and both are large;

high-dimension low sample size (HDLSS) data whose dimension d vastly exceeds the sample size n, so d n; and

• functional data whose observations are functions.

High-dimensional and functional data pose special challenges, and their theoretical and asymptotic treatment is an active area of research. Gaussian assumptions will often not be useful for high-dimensional data. A deviation from normality does not affect the applicabil-ity of Principal Component Analysis or Canonical Correlation Analysis; however, we need to exercise care when making inferences based on Gaussian assumptions or when we want to exploit the normal asymptotic theory.

The remainder of this chapter deals with a number of topics that are needed in subsequent chapters. Section1.2looks at different ways of displaying or visualising data, Section1.3

introduces notation for random vectors and data and Section1.4discusses Gaussian random vectors and summarises results pertaining to such vectors and data. Finally, in Section1.5

we find results from linear algebra, which deal with properties of matrices, including the spectral decomposition. In this chapter, I state results without proof; the references I provide for each topic contain proofs and more detail.

1.2 Visualisation

Before we analyse a set of data, it is important to look at it. Often we get useful clues such as skewness, bi- or multi-modality, outliers, or distinct groupings; these influence or direct our analysis. Graphical displays are exploratory data-analysis tools, which, if appropriately used, can enhance our understanding of data. The insight obtained from graphical displays is more subjective than quantitative; for most of us, however, visual cues are easier to understand and interpret than numbers alone, and the knowledge gained from graphical displays can complement more quantitative answers.

Throughout this book we use graphic displays extensively and typically in the examples. In addition, in the introduction to non-Gaussian analysis, Section 9.1of Part III, I illus-trate with the simple graphical displays of Figure9.1the difference between interesting and purely random or non-informative data.

1.2.1 Three-Dimensional Visualisation

Two-dimensional scatterplots are a natural – though limited – way of looking at data with three or more variables. As the number of variables, and therefore the dimension, increases, sequences of two-dimensional scatterplots become less feasible to interpret. We can, of course, still display three of the d dimensions in scatterplots, but it is less clear how one can look at more than three dimensions in a single plot.

We start with visualising three data dimensions. These arise as three-dimensional data or as three specified variables of higher-dimensional data. Commonly the data are displayed in a default view, but rotating the data can better reveal the structure of the data. The scat-terplots in Figure1.1display the 10,000 observations and the three variables CD3, CD8 and CD4 of the five-dimensional HIV+ and HIV− data sets, which contain measurements of blood cells relevant to HIV. The left panel shows the HIV+ data and the right panel the

(33)

1.2 Visualisation 5 100 300 500 200 400 100 300 500 100 300 500 0 200 400 100 300 500

Figure 1.1 HIV+data (left) and HIVdata (right) of Example2.4with variables CD3, CD8 and CD4. −200 0 200200 0−200 −200 0 200 −200 −100 0 100 200 −2000 200 400 −200 −100 0 100 200

Figure 1.2 Orthogonal projections of the five-dimensional HIV+data (left) and the HIV−data (right) of Example2.4.

HIV−data. There are differences between the point clouds in the two figures, and an impor-tant task in the analysis of such data is to exhibit and quantify the differences. The data are described in Example2.4of Section2.3.

It may be helpful to present the data in the form of movies or combine a series of different views of the same data. Other possibilities include projecting the five-dimensional data onto a smaller number of orthogonal directions and displaying the lower-dimensional projected data as in Figure1.2. These figures, again with HIV+in the left panel and HIV−in the right panel, highlight the cluster structure of the data in Figure1.1. We can see a smaller fourth cluster in the top right corner of the HIV−data, which seems to have almost disappeared in the HIV+data in the left panel.

We return to these figures in Section2.4, where I explain how to find informative pro-jections. Many of the methods we explore use projections: Principal Component Analysis, Factor Analysis, Multidimensional Scaling, Independent Component Analysis and Projec-tion Pursuit. In each case the projecProjec-tions focus on different aspects and properties of the data.

(34)

5 6 7 3 2 4 2 4 6 5 6 7 4 3 2 0.5 1 1.5 2 2.5 567 2 4 6 0.5 1 1.5 2 2.5 2 3 4 2 4 6 0.5 1 1.5 2 2.5

Figure 1.3 Three species of the iris data: dimensions 1, 2 and 3 (top left), dimensions 1, 2 and 4

(top right), dimensions 1, 3 and 4 (bottom left) and dimensions 2, 3 and 4 (bottom right).

Another way of representing low-dimensional data is in a number of three-dimensional scatterplots – as seen in Figure1.3– which make use of colour and different plotting symbols to enhance interpretation. We display the four variables of Fisher’s iris data – sepal length, sepal width, petal length and petal width – in a sequence of three-dimensional scatterplots. The data consist of three species: red refers to Setosa, green to Versicolor and black to Virginica. We can see that the red observations are well separated from the other two species for all combinations of variables, whereas the green and black species are not as easily separable. I describe the data in more detail in Example4.1of Section4.3.2.

1.2.2 Parallel Coordinate Plots

As the dimension grows, three-dimensional scatterplots become less relevant, unless we know that only some variables are important. An alternative, which allows us to see all variables at once, is to followInselberg(1985) and to present the data in the form of parallel

coordinate plots. The idea is to present the data as two-dimensional graphs. Two different

versions of parallel coordinate plots are common. The main difference is an interchange of the axes. In vertical parallel coordinate plots – see Figure1.4 – the variable numbers are represented as values on the y-axis. For a vector X= [X1,..., Xd]T we represent the first

variable X1by the point (X1, 1) and the j th variable Xj by (Xj, j ). Finally, we connect the

d points by a line which goes from (X1, 1) to (X2, 2) and so on to (Xd, d). We apply the same

rule to the next d-dimensional datum. Figure1.4shows a vertical parallel coordinate plot for Fisher’s iris data. For easier visualisation, I have used the same colours for the three species as in Figure1.3, so red refers to the observations of species 1, green to those of species 2 and black to those of species 3. The parallel coordinate plot of the iris data shows that the data fall into two distinct groups – as we have also seen in Figure1.3, but unlike the previous figure it tells us that dimension 3 separates the two groups most strongly.

(35)

1.2 Visualisation 7 0 1 2 3 4 5 6 7 8 1 2 3 4

Figure 1.4 Iris data with variables represented on the y-axis and separate colours for the three

species as in Figure1.3. 2 4 6 8 10 12 14 200 400 600 800 1000

Figure 1.5 Parallel coordinate view of the illicit drug market data of Example2.14.

Instead of the three colours shown in the plot of the iris data, different colours can be used for each observation, as in Figure1.5. In a horizontal parallel coordinate plot, the x -axis represents the variable numbers 1,...,d. For a datum X = [X1··· Xd]T, the first variable

gives rise to the point (1, X1) and the j th variable Xj to ( j , Xj). The d points are connected

by a line, starting with (1, X1), then (2, X2), until we reach (d, Xd). Because we typically

identify variables with the x -axis, we will more often use horizontal parallel coordinate plots. The differently coloured lines make it easier to trace particular observations.

Figure1.5shows the 66 monthly observations on 15 features or variables of the illicit drug market data which I describe in Example2.14in Section2.6.2. Each observation (month) is displayed in a different colour. I have excluded two variables, as these have much higher val-ues and would obscure the valval-ues of the remaining variables. Looking at variable 5, heroin overdose, the question arises whether there could be two groups of observations correspond-ing to the high and low values of this variable. The analyses of these data throughout the book will allow us to look at this question in different ways and provide answers.

(36)

Interactive graphical displays and movies are also valuable visualisation tools. They are beyond the scope of this book; I refer the interested reader to Wegman (1991) or

Cook and Swayne(2007).

1.3 Multivariate Random Vectors and Data

Random vectors are vector-valued functions defined on a sample space. We consider a single random vector and collections of random vectors. For a single random vector we assume that there is a model such as the first few moments or the distribution, or we might assume that the random vector satisfies a ‘signal plus noise’ model. We are then interested in deriving properties of the random vector under the model. This scenario is called the population case.

For a collection of random vectors, we assume the vectors to be independent and iden-tically distributed and to come from the same model, for example, have the same first and second moments. Typically we do not know the true moments. We use the collection to construct estimators for the moments, and we derive properties of the estimators. Such prop-erties may include how ‘good’ an estimator is as the number of vectors in the collection grows, or we may want to draw inferences about the appropriateness of the model. This scenario is called the sample case, and we will refer to the collection of random vectors as the data or the (random) sample.

In applications, specific values are measured for each of the random vectors in the collec-tion. We call these values the realised or observed values of the data or simply the observed

data. The observed values are no longer random, and in this book we deal with the observed

data in examples only. If no ambiguity exists, I will often refer to the observed data as data throughout an example.

Generally, I will treat the population case and the sample case separately and start with the population. The distinction between the two scenarios is important, as we typically have to switch from the population parameters, such as the mean, to the sample parameters, in this case the sample mean. As a consequence, the definitions for the population and the data are similar but not the same.

1.3.1 The Population Case

Let X= ⎡ ⎢ ⎣ X1 .. . Xd ⎤ ⎥ ⎦

be a random vector from a distribution F:Rd → [0,1]. The individual Xj, with j ≤ d,

are random variables, also called the variables, components or entries of X, and X is

d-dimensional or d-variate. We assume that X has a finite d-dimensional mean or expected

valueEX and a finite d × d covariance matrix var(X). We write

μ = EX and  = var(X) = E



(X− μ)(X − μ)T 

(37)

1.3 Multivariate Random Vectors and Data 9

The entries ofμ and  are

μ = ⎡ ⎢ ⎣ μ1 .. . μd ⎤ ⎥ ⎦ and  = ⎛ ⎜ ⎜ ⎜ ⎝ σ2 1 σ12 ··· σ1d σ21 σ22 ··· σ2d .. . ... . .. ... σd1 σd2 ··· σd2 ⎞ ⎟ ⎟ ⎟ ⎠, (1.1) where σ2j = var(Xj) and σj k = cov(Xj, Xk). More rarely, I write σj j for the diagonal

elementsσ2j of.

Of primary interest in this book are the mean and covariance matrix of a random vector rather than the underlying distribution F. We write

X∼ (μ,) (1.2)

as shorthand for a random vector X which has meanμ and covariance matrix .

If X is a d-dimensional random vector and A is a d× k matrix, for some k ≥ 1, then ATX

is a k-dimensional random vector. Result1.1lists properties of ATX.

Result 1.1 Let X∼ (μ,) be a d-variate random vector. Let A and B be matrices of size

d× k and d × , respectively.

1. The mean and covariance matrix of the k-variate random vector ATX are

ATXATμ, AT A . (1.3) 2. The random vectors ATX and BTX are uncorrelated if and only if ATB = 0k×, where

0k×is the k× n matrix all of whose entries are 0.

Both these results can be strengthened when X is Gaussian, as we shall see in Corollary1.6.

1.3.2 The Random Sample Case

Let X1,...,Xn be d-dimensional random vectors. Unless otherwise stated, we assume that

the Xi are independent and from the same distribution F:Rd → [0,1] with finite mean μ

and covariance matrix. We omit reference to F when knowledge of the distribution is not required.

In statistics one often identifies a random vector with its observed values and writes

Xi = xi. We explore properties of random samples but only encounter observed values of

random vectors in the examples. For this reason, I will typically write X =X1X2···Xn



, (1.4)

for the sample of independent random vectors Xi and call this collection a random sample

or data. I will use the same notation in the examples as it will be clear from the context whether I refer to the random vectors or their observed values. If a clarification is necessary, I will provide it. We also write

X = ⎡ ⎢ ⎢ ⎢ ⎣ X11 X21 ··· Xn1 X12 X22 ··· Xn2 .. . ... . .. ... X1d X2d ··· Xnd ⎤ ⎥ ⎥ ⎥ ⎦= ⎡ ⎢ ⎢ ⎢ ⎣ X•1 X•2 .. . X•d ⎤ ⎥ ⎥ ⎥ ⎦. (1.5)

(38)

The i th column ofX is the ith random vector Xi, and the j th row X• j is the j th variable

across all n random vectors. Throughout this book the first subscript i in Xi j refers to the

i th vector Xi, and the second subscript j refers to the j th variable.

A Word of Caution. The dataX are d ×n matrices. This notation differs from that of some

authors who write data as an n× d matrix. I have chosen the d × n notation for one main reason: The population random vector is regarded as a column vector, and it is therefore more natural to regard the random vectors of the sample as column vectors. An important consequence of this notation is the fact that the population and sample cases can be treated the same way, and no additional transposes are required.1

For data, the meanμ and covariance matrix  are usually not known; instead, we work with the sample mean X and the sample covariance matrix S and sometimes write

X ∼Sam(X, S) (1.6) in order to emphasise that we refer to the sample quantities, where

X= 1 n n

i=1 Xi and (1.7) S= 1 n− 1 n

i=1 (Xi− X)(Xi− X)T. (1.8)

As before,X ∼ (μ,) refers to the data with population mean μ and covariance matrix . The sample mean and sample covariance matrix depend on the sample size n. If the depen-dence on n is important, for example, in the asymptotic developments, I will write Sninstead

of S, but normally I will omit n for simplicity.

Definitions of the sample covariance matrix use n−1or (n−1)−1in the literature. I use the (n− 1)−1version. This notation has the added advantage of being compatible with software environments such as MATLAB and R.

Data are often centred. We writeXcentfor the centred data and adopt the (unconventional)

notation Xcent≡ X − X =  X1− X ... Xn− X  . (1.9)

The centred data are of size d× n. Using this notation, the d × d sample covariance matrix S becomes

S= 1 n− 1

X − X X − X T. (1.10) In analogy with (1.1), the entries of the sample covariance matrix S are sj k, and

sj k= 1 n− 1 n

i=1 (Xi j− mj)(Xik− mk), (1.11)

with X= [m1,...,md]T, and mjis the sample mean of the j th variable. As for the population,

we write s2j or sj jfor the diagonal elements of S.

1 Consider a∈ Rd; then the projection of X onto a is aTX. Similarly, the projection of the matrixX onto a is done elementwise for each random vector Xiand results in the 1× n vector aTX. For the n × d matrix

(39)

1.4 Gaussian Random Vectors 11 1.4 Gaussian Random Vectors

1.4.1 The Multivariate Normal Distribution and the Maximum Likelihood Estimator

Many of the techniques we consider do not require knowledge of the underlying distribu-tion. However, the more we know about the data, the more we can exploit this knowledge in the derivation of rules or in decision-making processes. Of special interest is the Gaussian distribution, not least because of the Central Limit Theorem. Much of the asymptotic the-ory in multivariate statistics relies on normality assumptions. In this preliminary chapter, I present results without proof. Readers interested in digging a little deeper might find

Mardia, Kent, and Bibby(1992) andAnderson(2003) helpful. We write

XN(μ,) (1.12)

for a random vector from the multivariate normal distribution with meanμ and covari-ance matrix. We begin with the population case and consider transformations of X and relevant distributional properties.

Result 1.2 Let XN(μ,) be d-variate, and assume that −1exists.

1. Let X= −1/2(X− μ); then XN(0, Id×d), where Id×d is the d× d identity matrix.

2. Let X2= (X−μ)T−1(X−μ); then X2∼ χd2, theχ2distribution in d degrees of freedom. The first part of the result states the multivariate analogue of standardising a normal ran-dom variable. The result is a multivariate ranran-dom vector with independent variables. The second part generalises the square of a standard normal random variable. The quantity X2 is a scalar random variable which has, as in the one-dimensional case, aχ2-distribution, but this time in d degrees of freedom.

From the population we move to the sample. Fix a dimension d≥ 1. Let XiN(μ,) be

independent d-dimensional random vectors for i= 1,...,n with sample mean X and sample covariance matrix S. We define Hotelling’s T2by

T2= n X− μ TS−1 X− μ . (1.13) Further let ZjN(0,) for j = 1,...,m be independent d-dimensional random vectors,

and let W = m

j=1 ZjZTj (1.14)

be the d× d random matrix generated by the Zj; then W has the Wishart distribution

W (m,) with m degrees of freedom and covariance matrix , where m is the number of summands and is the common d ×d covariance matrix. Result1.3lists properties of these sample quantities.

Result 1.3 Let XiN(μ,) be d-dimensional random vectors for i = 1,...,n. Let S be

(40)

1. The sample mean X satisfies XN(μ,/n).

2. Assume that n> d. Let T2be given by (1.13). It follows that n− d

(n− 1)dT

2∼ F d,n−d,

the F distribution in d and n− d degrees of freedom.

3. For n observations Xi and their sample covariance matrix S there exist n−1 independent

random vectors ZjN(0,) such that

S= 1 n− 1 n−1

j=1 ZjZTj,

and (n− 1)S has a W ((n − 1),) Wishart distribution.

From the distributional properties we turn to estimation of the parameters. For this we require the likelihood function.

Let XN(μ,) be d-dimensional. The multivariate normal probability density

function f is f (· ) = (2π)−d/2det ()−1/2exp  −1 2(· −μ) T−1 (· −μ)  , (1.15)

where det () is the determinant of . For a sample X =X1X2···Xn



of independent random vectors from the normal distribution with the same mean and covariance matrix, the joint probability density function is the product of the functions (1.15).

If attention is focused on a parameterθ of the distribution, which we want to estimate, we define the normal or Gaussian likelihood (function) L as a function of the parameterθ of interest conditional on the data

L(θ|X) = (2π)−nd/2det ()−n/2exp  −1 2 n

i=1 (Xi− μ)T−1(Xi− μ)  . (1.16)

Assume that the parameter of interest is the mean and the covariance matrix, soθ = (μ,); then, for the normal likelihood, the maximum likelihood estimator (MLE) ofθ, denoted by θ, is θ = (μ, ), where μ = 1 n n

i=1 Xi = X and   = 1 n n

i=1 Xi− X Xi− X T =n− 1 n S. (1.17)

Remark. In this book we distinguish between  and S, the sample covariance matrix

defined in (1.8). For details and properties of MLEs, see chapter 7 of Casella and Berger

References

Related documents

• Follow up with your employer each reporting period to ensure your hours are reported on a regular basis?. • Discuss your progress with

The updated Field Implementation ECO 103 shall be applied to all Infinity Voting Panels with serial numbers below 5000 in a manner approved as part of the EAC certification of

This essay asserts that to effectively degrade and ultimately destroy the Islamic State of Iraq and Syria (ISIS), and to topple the Bashar al-Assad’s regime, the international

19% serve a county. Fourteen per cent of the centers provide service for adjoining states in addition to the states in which they are located; usually these adjoining states have

Standardization of herbal raw drugs include passport data of raw plant drugs, botanical authentification, microscopic & molecular examination, identification of

Field experiments were conducted at Ebonyi State University Research Farm during 2009 and 2010 farming seasons to evaluate the effect of intercropping maize with

Al-Hazemi (2000) suggested that vocabulary is more vulnerable to attrition than grammar in advanced L2 learners who had acquired the language in a natural setting and similar

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department