Applying Multivariate Kernel Density Estimation to Kernel CVBF

3. MULTIVARIATE KERNEL DENSITY ESTIMATION

3.5 Applying Multivariate Kernel Density Estimation to Kernel CVBF

So this discussion leads to two very important questions. First, which one of these three estimation schemes should be used in the multivariate kernel CVBF method? Perhaps we should only consider a full bandwidth matrix class since it gives the best estimate of the underlying density function. However, the computational cost may be too much and in the interest of simplicity, the scalar bandwidth matrix may be preferred. Another possibility

is to take the advice of Wand and Jones (1993) and re-scale the data using the sample covariance matrix before considering a restricted bandwidth matrix class. This way, we can improve our density estimate while still taking advantage of the reduced number of smoothing parameters.

The choice of bandwidth matrix class may also differ depending on the data dimension. When d = 2, the computational cost may be inconsequential regardless of bandwidth matrix class, so using H ∈ F may be preferred. As dimension increases, due to the curse of dimensionality, the respective computation times will increase such that eventually H ∈ D and/or H ∈ S become(s) the only feasible option(s).

The second important question is, what are the practical limits on the number of dimensions for which the kernel CVBF method works reasonably well? Of course, the answer to this question depends on the number of observations. Regardless of which bandwidth matrix class we consider, when the data dimension becomes moderately large, accurate estimation of the true density function will become difficult (if not impossible). This could play a pivotal role in determining plausible dimensions for application of the kernel CVBF method. If the kernel model never fits the data well, then we will always favor the null model which makes for a miserable goodness-of-fit test.

Both of these questions will be answered in the next chapter where we consider how to extend the univariate kernel CVBF method of Hart and Choi (2016) to multivariate data.

4. TESTING MULTIVARIATE GOODNESS-OF-FIT USING KERNEL CROSS-VALIDATION BAYES FACTORS

The goal in this chapter is to combine the contents of Chapters 2 and 3 to extend the univariate CVBFK technique of Hart and Choi (2016) to test goodness-of-fit for data in

any dimension. Section 4.1 begins with a description of the overall CVBFK methodology

when applied to multivariate data as slight modifications of the univariate approach must be made. Next, Section 4.2 contains the necessary details for constructing and computing the alternative marginal likelihoods using each of the three bandwidth matrix classes. In order to compare the performance of these three constructions, we carry out simulations in which we test for multivariate normality in Section 4.3. A common theme in this chapter is that we will only consider tests for multivariate normality since the multivariate normal distribution is by far the most common distributional assumption in multivariate analysis and inference. However, keep in mind that the CVBFKmethods can be applied to test any

d-dimensional parametric model.

In Section 4.4, we explore the location-scale invariance of the kernel CVBF method and make the necessary modifications to ensure that the resulting conclusions are indepen- dent of changes in location and scale. In order to implement the kernel CVBF method in practice, we need to choose the training set size m and the number of random splits N . Section 4.5 describes modifications to the calibration scheme in Subsection 2.2 for finding m as well as a small simulation to explain our recommendation for the choice of N for multivariate data. Arguably the most important property of any model selection technique using Bayes factors is consistency (Definition 1) which will be assessed in Section 4.6 for the scalar bandwidth construction. Also, Section 4.6 includes a description of a Divide and Conquerscheme for increasing the computational efficiency of the kernel CVBF method

in large samples without compromising the overall conclusions.

As described in Subsection 1.1.1, there are a few commonly used frequentist tests for goodness-of-fit. Section 4.7 contains a power study for these frequentist tests along with a few kernel CVBF constructions. It is here that we make a final recommendation as to which kernel CVBF construction we recommend in practice after examining their respective performances in terms of power and Type I error rates. However, it will be clear early on in this chapter that the computational burden is far too great for the unconstrained and diagonal bandwidth matrix constructions. One topic that is almost synonymous with multivariate analysis is the curse of dimensionality, which we briefly introduced in Section 3.4. In Section 4.8, we describe how the curse of dimensionality impacts the kernel CVBF methods, in particular its applicability to data beyond moderate dimensions. We also pro- vide possible approaches in which goodness-of-fit can be assessed in higher dimensional data.

Sections 4.2 to 4.8 are all focused on the formulation, properties, and overall performance of the three kernel CVBF constructions. To see how we can assess multivariate goodness-of-fit in practice, Section 4.9 examines testing bivariate normality for Academic Performance Index (API) scores in California schools. In this example we carryout all the calibration steps and illustrate the importance of choosing m appropriately. An interesting application of the kernel CVBF method based on the scalar bandwidth matrix case is in checking the normality assumptions in random effects models. There are some simple modifications to the method that must be made which will be described in Section 4.10. Then, using gene expression data from five rats, we will apply the kernel CVBF method to check the assumptions while also implementing some of the dimension reduction and Divide and Conquer techniques described in this chapter. Lastly, an overall summary of this chapter is given in Section 4.11.

In document Goodness-of-Fit Testing Using Cross-Validation Bayes Factors (Page 49-53)