An Adaptive Test of Independence with Analytic Kernel Embeddings

(1)

An Adaptive Test of Independence with Analytic Kernel Embeddings

Wittawat Jitkrittum1 Zoltán Szabó2 Arthur Gretton1

Abstract

A new computationally efficient dependence mea-sure, and an adaptive statistical test of independence, are proposed. The dependence measure is the differ-ence between analytic embeddings of the joint distri-bution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample sizen). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distri-bution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing

O(n)and_O(nlogn)tests.

1. Introduction

We consider the design of adaptive, nonparametric statistical tests of dependence: that is, tests of whether a joint distri-butionPxyfactorizes into the product of marginalsPxPy

with the null hypothesis thatH0 :X andY are indepen-dent. While classical tests of dependence, such as Pearson’s correlation and Kendall’sτ, are able to detect monotonic relations between univariate variables, more modern tests can address complex interactions, for instance changes in variance ofX with the value ofY. Key to many recent tests is to examine covariance or correlation between data features. These interactions become significantly harder to detect, and the features are more difficult to design, when the data reside in high dimensions.

Zoltán Szabó’s ORCID ID: 0000-0001-6183-7603. Arthur Gret-ton’s ORCID ID: 0000-0003-3169-7624. 1_{Gatsby Unit,}

Univer-sity College London, UK.2CMAP, École Polytechnique, France. Correspondence to: Wittawat Jitkrittum <[email protected]>. Proceedings of the34th

A basic nonlinear dependence measure is the Hilbert-Schmidt Independence Criterion (HSIC), which is the Hilbert-Schmidt norm of the covariance operator between feature mappings of the random variables (Gretton et al.,

2005;2008). Each random variableX andY is mapped to a respective reproducing kernel Hilbert space_Hk and Hl. For sufficiently rich mappings, the covariance operator

norm is zero if and only if the variables are independent. A second basic nonlinear dependence measure is the smoothed difference between the characteristic function of the joint distribution, and that of the product of marginals. When a particular smoothing function is used, the statistic corre-sponds to the covariance between distances ofXandY vari-able pairs (Feuerverger,1993;Székely et al.,2007;Székely & Rizzo,2009), yielding a simple test statistic based on pairwise distances. It has been shown bySejdinovic et al.

(2013) that the distance covariance (and its generalization to semi-metrics) is an instance of HSIC for an appropriate choice of kernels. A disadvantage of these feature covari-ance statistics, however, is that they require quadratic time to compute (besides in the special case of the distance co-variance with univariate real-valued variables, whereHuo & Székely(2016) achieve an_O(nlogn)cost). Moreover, the feature covariance statistics have intractable null distribu-tions, and either a permutation approach or the solution of an expensive eigenvalue problem (e.g.Zhang et al.,2011) is required for consistent estimation of the quantiles. Several approaches were proposed byZhang et al.(2017) to obtain faster tests along the lines of HSIC. These include comput-ing HSIC on finite-dimensional feature mappcomput-ings chosen as random Fourier features (RFFs) (Rahimi & Recht,2008), a block-averaged statistic, and a Nyström approximation to the statistic. Key to each of these approaches is a more efficient computation of the statistic and its threshold under the null distribution: for RFFs, the null distribution is a finite weighted sum ofχ2_{variables; for the block-averaged} statistic, the null distribution is asymptotically normal; for Nyström, either a permutation approach is employed, or the spectrum of the Nyström approximation to the kernel matrix is used in approximating the null distribution. Each of these methods costs significantly less than the_O(n2₎_cost of the full HSIC (the cost is linear inn, but also depends quadratically on the number of features retained). A poten-tial disadvantage of the Nyström and Fourier approaches is that the features are not optimized to maximize test power,

(2)

but are chosen randomly. The block statistic performs worse than both, due to the large variance of the statistic under the null (which can be mitigated by observing more data). In addition to feature covariances, correlation measures have also been developed in infinite dimensional feature spaces: in particular,Bach & Jordan(2002);Fukumizu et al.(2008) proposed statistics on the correlation operator in a repro-ducing kernel Hilbert space. While convergence has been established for certain of these statistics, their computa-tional cost is high at_O(n3), and test thresholds have relied on permutation. A number of much faster approaches to testing based on feature correlations have been proposed, however. For instance,Dauxois & Nkiet(1998) compute statistics of the correlation between finite sets of basis func-tions, chosen for instance to be step functions or low order B-splines. The cost of this approach is_O(n). This idea was extended byLopez-Paz et al.(2013), who computed the canonical correlation between finite sets of basis func-tions chosen as random Fourier features; in addition, they performed a copula transform on the inputs, with a total cost of_O(nlogn). Finally, space partitioning approaches have also been proposed, based on statistics such as the KL divergence, however these apply only to univariate vari-ables (Heller et al.,2016), or to multivariate variables of low dimension (Gretton & Györfi,2010) (that said, these tests have other advantages of theoretical interest, notably distribution-independent test thresholds).

The approach we take is most closely related to HSIC on a finite set of features. Our simplest test statistic, the Finite Set Independence Criterion (FSIC), is an average of covari-ances of analytic functions (i.e., features) defined on each ofX andY. A normalized version of the statistic (NFSIC) yields a distribution-independent asymptotic test threshold. We show that our test is consistent, despite a finite number of analytic features being used, via a generalization of ar-guments inChwialkowski et al.(2015). As in recent work on two-sample testing byJitkrittum et al.(2016), our test isadaptivein the sense that we choose our features on a held-out validation set to optimize a lower bound on the test power. The design of features for independence testing turns out to be quite different to the case of two-sample testing, however: the task is to find correlated featurepairs on the respective marginal domains, rather than attempting to find a single, high-dimensional feature representation on thetensor productof the marginals, as we would need to do if we were comparing distributionsPxyandQxy. While the use of coupled feature pairs on the marginals entails a smaller feature space dimension, it introduces significant complications in the proof of the lower bound, compared with the two-sample case. We demonstrate the performance of our tests on several challenging artificial and real-world datasets, including detection of dependence between music and its year of appearance, and between videos and captions.

In these experiments, we outperform competing linear and

O(nlogn)time tests.

2. Independence Criteria and Statistical Tests

We introduce two test statistics: first, the Finite Set Inde-pendence Criterion (FSIC), which builds on the principle that dependence can be measured in terms of the covari-ance between data features. Next, we propose a normalized version of this statistic (NFSIC), with a simpler asymptotic distribution whenPxy = PxPy. We show how to select features for the latter statistic to maximize a lower bound on the power of its corresponding statistical test.

2.1. The Finite Set Independence Criterion

We begin by recalling the Hilbert-Schmidt Independence Criterion (HSIC) as proposed inGretton et al.(2005), since our unnormalized statistic is built along similar lines. Con-sider two random variablesX _{∈ X ⊆}Rdx _and_Y _{∈ Y ⊆} Rdy_{. Denote by}_Pxy_{the joint distribution between}_X_and_Y_; PxandPyare the marginal distributions ofXandY. Let_⊗ denote the tensor product, such that(a_⊗b)c=a_hb, c_i. As-sume thatk:_{X × X →}Randl:Y × Y →Rare positive

definite kernels associated with reproducing kernel Hilbert spaces (RKHS)_HkandHl, respectively. Letk · kHSbe the

norm on the space of_Hl→ HkHilbert-Schmidt operators.

Then, HSIC betweenXandY is defined as

HSIC(X, Y) =µxy−µx⊗µy 2 HS =E(x,y),(x0_,_y0₎[k(x,x0)l(y,y0)] +_ExEx0[k(x,x0)]EyEy0[l(y,y0)] −2E(x,y)[Ex0[k(x,x0)]_E_y0[l(y,y0)]], (1) whereEx :=Ex∼Px,Ey :=Ey∼Py,Exy :=E(x,y)∼Pxy,

andx0_{is an independent copy of}_x_{. The mean embedding of} Pxybelongs to the space of Hilbert-Schmidt operators from Hl to Hk, µxy :=

R

X ×Yk(x,·)⊗l(y,·) dPxy(x,y) ∈ HS(_Hl,Hk), and the marginal mean embeddings areµx:=

R

Xk(x,·) dPx(x) ∈ Hk andµy :=

R

Yl(y,·) dPy(y) ∈ Hl(Smola et al.,2007).Gretton et al.(2005, Theorem 4)

show that if the kernels kandl are universal (Steinwart & Christmann,2008) on compact domains_X and_Y, then

HSIC(X, Y) = 0if and only ifXandY are independent. Alternatively,Gretton(2015) shows that it is sufficient for each ofkandlto be characteristic to their respective do-mains (meaning that distribution embeddings are injective in each marginal domain: seeSriperumbudur et al.(2010)). Given a joint sampleZn = {(xi,yi)}ni=1 ∼ Pxy, an em-pirical estimator of HSIC can be computed in_O(n2)time by replacing the population expectations in (1) with their corresponding empirical expectations based onZn.

We now propose our new linear-time dependence mea-sure, the Finite Set Independence Criterion (FSIC). Let

(3)

X ⊆ Rdx and Y ⊆ Rdy be open sets. Let µxµy(x,y) :=µx(x)µy(y)The idea is to seeµxy(v,w) = Exy[k(x,v)l(y,w)], µx(v) = Ex[k(x,v)]andµy(w) =

Ey[l(y,w)]as smooth functions, and consider a new dis-tance betweenµxyandµxµyinstead of a Hilbert-Schmidt distance as in HSIC (Gretton et al.,2005). The new mea-sure is given by the average of squared differences be-tweenµxyandµxµy, evaluated atJ random test locations

VJ:=_{(vi,wi)_}J_i₌₁_{⊂ X × Y}. FSIC2(X, Y) := 1 J J X i=1 u2(vi,wi) = 1 Jkuk 2 2, where u(v,w) :=µxy(v,w)₋µx(v)µy(w) =_Exy[k(x,v)l(y,w)]−Ex[k(x,v)]Ey[l(y,w)], (2) = covxy[k(x,v), l(y,w)], u := (u(v1,w1), . . . , u(vJ,wJ))>, and {(vi,wi)}Ji=1 are realizations from an absolutely continuous distribution (wrt the Lebesgue measure).

Our first result in Proposition 2 states that FSIC(X, Y)

almost surely defines a dependence measure for the random variablesX andY, provided that the kernelskandlsatisfy some conditions summarized in AssumptionA.

Definition 1 (A0 kernels). Let X be an open set inRd.

A positive definite kernelk : _{X × X →} _Ris said to be

A0 if it is bounded (i.e., there exists B ∈ Rsuch that sup_x_,_x0_∈Xk(x,x0) ≤ B), analytic (Chwialkowski et al., 2015) and vanishes at infinity: for allv _{∈ X}, f(x) := k(x,v)is bounded, real analytic on_X, and for all >0the set_{x_{| |}f(x)_{| ≥}_}is compact.1

Assumption A. The kernelsk:_{X × X →}_Randl: _{Y ×} Y → RareA0 (assumed to be bounded by Bk andBl

respectively), characteristic (Sriperumbudur et al.,2010, Definition 6), and translation invariant i.e., there existˇk

andˇlsuch thatk(x,x0) = ˇk(x₋x0)for allx,x0_{∈ X}, and

l(y,y0) = ˇl(y₋y0)for ally,y0_{∈ Y}.

Proposition 2(FSIC is a dependence measure). Assume that assumption A holds, and that the test locations

VJ = _{(vi,wi)_}J

i=1 are drawn from an absolutely con-tinuous distributionη. Then,η-almost surely, it holds that

FSIC(X, Y) = √1

Jkuk2 = 0if and only ifX andY are

independent.

Proof. We will prove the forward direction. The back-ward direction is obvious. Letu := µxy ₋µx_⊗µy, a member of the RKHS_Hk× Hlassociated with the

prod-uct kernelg((x,y),(v,w)) := k(x,v)l(y,w). Since k

1

A related class is the set ofC0kernels (Sriperumbudur et al., 2010, Section 2). A kernelkis said to beC0if it is bounded with

k(·,v)∈C0(X)for allv∈ XwhereC0(X)contains the set of functions that vanish at infinity. Note that ifkisA0, then it isC0.

andlareC0-kernels, characteristic and translation invari-ant, Gretton (2015, Theorem 2) implies that u = 0 if and only ifPxy = PxPy. Since for allv _{∈ X},w _{∈ Y},

x _7→ k(x,v)and y _7→ l(y,w) are real analytic, it fol-lows fromKrantz & Parks(2002, Proposition 2.2.10) that

(x,y)_7→k(x,v)l(y,w)is analytic on_{X × Y}. Sincegis analytic, and bounded, Lemma9ensures thatuis a real analytic function. It is known that ifu ₆= 0, the set of rootsRu :=_{(v,w) _|u(v,w) = 0_}has Lebesgue mea-sure zero (Mityagin,2015). Hence, it is sufficient to draw

(v,w)from an absolutely continuous distribution to have

(v,w)_∈/ Ruη-almost surely, and consequently ifXandY

are dependent, thenFSIC(X, Y)>0,η-almost surely. Examples of kernelskandlwhich satisfy AssumptionA

are the Gaussian kernels. FSIC usesµxyas a proxy forPxy, andµxµy as a proxy forPxPy. Proposition 2states that, to detect the dependence betweenXandY, it is sufficient to evaluate the difference of the population joint embed-dingµxyand the embedding of the product of the marginal

distributionsµxµyat a finite number of locations (defined

by VJ). The intuitive explanation of this property is as

follows. IfPxy = PxPy, thenu(v,w) = 0everywhere,

andFSIC(X, Y) = 0for anyVJ. IfPxy 6=PxPy, thenu

will not be a zero function. Using the same argument as inChwialkowski et al.(2015), sincekandl are analytic,

uis also analytic, and the set of roots Ru := _{(v,w) _| u(v,w) = 0_}has Lebesgue measure zero (Mityagin,2015). Thus, it is sufficient to draw(v,w)from an absolutely con-tinuous distribution to have(v,w)_∈/ Ruη-almost surely, and henceFSIC(X, Y)>0. We note that a characteristic kernel which is not analytic may produceusuch thatRuhas a positive Lebesgue measure. In this case, there is a positive probability that(v,w)_∈Ru, resulting in a potential fail-ure to detect the dependence. The required AssumptionA

only imposes conditions separately on each of the marginal kernelskandl. In particular, there is no condition on the product kernelg((x,y),(v,w)) :=k(x,v)l(y,w)which would have been much more restrictive.

Plug-in Estimator Assume that we observe a joint sample Zn := {(xi,yi)}ni=1

i.i.d.

∼ Pxy. Un-biased estimators of µxy(v,w) and µxµy(v,w)

are µˆxy(v,w) := 1n Pn i=1k(xi,v)l(yi,w) and [ µxµy(v,w) := _n₍_n1₋₁₎P n i=1 P j6=ik(xi,v)l(yj,w),

respectively. A straightforward empirical estimator of

FSIC2is then given by

\ FSIC2₍_Z n) = 1 J J X i=1 ˆ u(vi,wi)2, ˆ u(v,w) := ˆµxy(v,w)−µ[xµy(v,w) (3) = 2 n(n−1) X i<j h(v,w)((xi,yi),(xj,yj)), (4) where h(v,w)((x,y),(x0,y0)) := 12(k(x,v) −

(4)

k(x0,v))(l(y,w) ₋ l(y0,w)). For conciseness, we defineuˆ := (ˆu1, . . . ,uJ)ˆ > ∈ RJ whereuiˆ := ˆu(vi,wi)

so thatFSIC\2₍_Z_{n) =} 1

Juˆ >_u_ˆ_. \

FSIC2_{can be efficiently computed in}_O_((dx_+dy_{)J n)}_time which is linear inn[see (3) which does not have nested double sums], assuming that the runtime complexity of evaluatingk(x,v)is_O(dx)and that ofl(y,w)is_O(dy). SinceFSICsatisfiesFSIC(X, Y) = 0 _⇐⇒ X _⊥Y, in principle its empirical estimator can be used as a test statistic for an independence test proposing a null hypothesisH0: “XandY are independent” against an alternativeH1:“X

andY are dependent.” The null distribution (i.e., distribu-tion of the test statistic assuming thatH0is true) is challeng-ing to obtain, however, and depends on the unknownPxy.

This prompts us to consider a normalized version ofFSIC

whose asymptotic null distribution takes a more convenient form. We first derive the asymptotic distribution ofuˆ in Proposition3, which we use to derive the normalized test statistic in Theorem4. As a shorthand, we writez:= (x,y),

t:= (v,w),covzis covariance,Vzstands for variance.

Proposition 3 (Asymptotic distribution of uˆ). Define

u := (u(t1), . . . , u(tJ))>, k(˜ x,v) := k(x,v) − Ex0k(x0,v), and ˜l(y,w) := l(y,w) − _E_y0l(y0,w).

Let Σ = [Σij] ∈ RJ×J be the positive semi-definite

matrix with entries Σij = covz(ˆu(ti),u(ˆ tj)) = Exy[˜k(x,vi)˜l(y,wi)˜k(x,vj)˜l(y,wj)]−u(ti)u(tj). Then, under both H0 and H1, for any fixed test locations

{t1, . . . ,tJ} for which Σ is full rank, and 0 < Vz[htj(z)]< ∞forj = 1, . . . , J, it holds that

√_n(ˆ_u −

u)_{→ N}d (0,Σ).

Proof. For a fixed_{t1, . . . ,tJ},uˆis a one-sample

second-order multivariate U-statistic with a U-statistic kernelht. Thus, by Lehmann (1999, Theorem 6.1.6) and Kowal-ski & Tu (2008, Section 5.1, Theorem 1), it follows di-rectly that √n(ˆu₋u) _{→ N}d (0,Σ) where we note that

Exy[˜k(x,v)˜l(y,w)] =u(v,w).

Recall from Proposition2thatu=0holds almost surely un-derH0. The asymptotic normality described in Proposition

3implies thatnFSIC\2₌ n Juˆ

>_u_ˆ_{converges in distribution} to a sum ofJ dependent weightedχ2 _{random variables.} The dependence comes from the fact that the coordinates

ˆ

u1. . . ,uJˆ ofuˆ all depend on the sampleZn. This null

dis-tribution is not analytically tractable, and requires a large number of simulations to compute the rejection threshold

Tαfor a given significance valueα.

2.2. Normalized FSIC and Adaptive Test

For the purpose of an independence test, we will consider a normalized variant ofFSIC\2_{, which we call} _NFSIC_\2_,

whose tractable asymptotic null distribution isχ2_(J₎_{, the} chi-squared distribution with J degrees of freedom. We then show that the independence test defined byNFSIC\2_is consistent. These results are given in Theorem4.

Theorem 4(Independence test based onNFSIC\2_is consis-tent). LetΣˆ be a consistent estimate ofΣbased on the joint sampleZn, whereΣis defined in Proposition3. Assume

thatVJ={(vi,wi)}Ji=1∼ηwhereηis absolutely contin-uous wrt the Lebesgue measure. TheNFSIC\2_{statistic is} defined asˆλn :=nˆu>Σˆ +γnI

−1

ˆ

uwhereγn _≥0is a regularization parameter. Assume that

1. AssumptionAholds.

2. Σis invertibleη-almost surely. 3. limn→∞γn= 0.

Then, for anyk, landVJ satisfying the assumptions, 1. UnderH0,ˆλn

d

→χ2_(J)_as_n

→ ∞. 2. UnderH1, for anyr _∈ _R,limn→∞P

ˆ λn≥r

= 1 η-almost surely. That is, the independence test based on

\

NFSIC2_{is consistent.}

Proof (sketch) . UnderH0,nˆu>( ˆΣ+γnI)−1uˆ

asymptot-ically followsχ2_(J)_because√_nˆ_u_{is asymptotically} nor-mally distributed (see Proposition3). Claim 2 builds on the result in Proposition2stating thatu₆=0underH1; it follows using the convergence ofuˆtou. The full proof can be found in AppendixD.

Theorem4states that ifH1holds, the statistic can be arbi-trarily large asnincreases, allowingH0to be rejected for any fixed threshold. Asymptotically the test thresholdTαis given by the(1₋α)-quantile ofχ2_(J₎_{and is independent} ofn. The assumption on the consistency ofΣˆ is required to obtain the asymptotic chi-squared distribution. The regu-larization parameterγnis to ensure that( ˆΣ+γnI)−1can

be stably computed. In practice,γnrequires no tuning, and can be set to be a very small constant. We emphasize thatJ

need not increase withnfor test consistency.

The next proposition states that the computational com-plexity of theNFSIC\2_{estimator is linear in both the input} dimension and sample size, and that it can be expressed in terms of the K=[Kij] = [k(vi,xj)] _∈ RJ×n,L = [Lij] = [l(wi,yj)]_∈RJ×nmatrices. In contrast to typical

kernel methods, a large Gram matrix of sizen_×nis not needed to computeNFSIC\2_.

Proposition 5(An empirical estimator ofNFSIC\2₎_. _Let

1n := (1, . . . ,1)> ∈ Rn. Denote by◦the element-wise

(5)

1. uˆ =(K◦L)1n n−1 −

(K1n)◦(L1n) n(n−1) .

2. A consistent estimator forΣisΣˆ = ΓΓ_n> where

Γ:= (K₋n−1K1n1>n)◦(L−n− 1_L1 n1>n)−uˆ b₁> n, ˆ ub=n−1(K_◦L)1n−n−2(K1n)◦(L1n).

Assume that the complexity of the kernel evaluation is lin-ear in the input dimension. Then the test statisticλnˆ = nˆu>Σˆ +γnI

−1

ˆ

ucan be computed in_O(J3+J2n+ (dx+dy)J n)time.

Proof (sketch). Claim 1 foruˆ is straightforward. The ex-pression forΣˆ in claim 2 follows directly from the asymp-totic covariance expression in Proposition3. The consis-tency ofΣˆ can be obtained by noting that the finite sample bound forP(kΣˆ−ΣkF > t)decreases asnincreases. This

is implicitly shown in Appendix E.2.2and its following sections.

Although the dependency of the estimator onJis cubic, we empirically observe that only a small value ofJis required (see Section3). The number of test locationsJ relates to the number of regions in_{X × Y}ofpxyandpxpythat differ

(see Figure1).

Theorem4asserts the consistency of the test for any test locationsVJdrawn from an absolutely continuous

distribu-tion. In practice,VJcan be further optimized to increase the test power for a fixed sample size. Our final theoretical result gives a lower bound on the test power ofNFSIC\2_i.e., the probability of correctly rejectingH0. We will use this lower bound as the objective function to determineVJand the kernel parameters. Let_{k · k}F be the Frobenius norm. Theorem 6 (A lower bound on the test power). Let

NFSIC2_{(X, Y}_{) :=} _λn _:=_n_u>_Σ−1_u_{. Let}

Kbe a kernel class fork,_Lbe a kernel class forl, and_Vbe a collection with each element being a set ofJ locations. Assume that

1. There exist finite Bk and Bl such that

sup_k_∈Ksup_x_,_x0_∈X|k(x,x0)| ≤ Bk and sup_l_∈Lsup_y_,_y0_∈Y|l(y,y0)| ≤Bl.

2. ˜c:= sup_k_∈Ksup_l_∈Lsup_V J∈VkΣ

−1

kF <∞.

Then, for anyk_{∈ K}, l_{∈ L}, VJ ∈ V, andλn ≥r, the test

power satisfiesP ˆ λn_≥r_≥L(λn)where L(λn) = 1₋62e−ξ1γn2(λn−r)2/n₋_2e−b0.5nc(λn−r)2/[ξ2n2] −2e−[(λn−r)γn(n−1)/3−ξ3n−c3γn2n(n−1)] 2 /[ξ4n2(n−1)]_,

b·cis the floor function,ξ1 := ₃2c21 1J2B∗

,B∗is a constant depending on onlyBkandBl,ξ2:= 72c22J B2,B :=BkBl,

ξ3 := 8c1B2J,c3 := 4B2J˜c2,ξ4 := 28B4J2c21, c1 :=

4B2_J√_J_˜_c_{, and}_c

2 := 4B

√

J˜c. Moreover, for sufficiently large fixedn,L(λn)is increasing inλn.

We provide the proof in Appendix E. To put Theorem 6 into perspective, assume that _K =

n (x,v)_7→exp₋kx₂−_σv2k2 x |σ2 x∈[σ2x,l, σ2x,u] o =: _Kg for some 0 < σ2 x,l < σ 2 x,u < ∞ and L = n (y,w)_7→exp₋ky₂−_σw2k2 y |σ2 y∈[σ2y,l, σ2y,u] o =: _Lg

for some 0 < σ_y,l2 < σ2y,u < ∞ are Gaussian kernel

classes. Then, in Theorem 6, B = Bk = Bl = 1,

and B∗ = 2. The assumption ˜c < _∞ is a techni-cal condition to guarantee that the test power lower bound is finite for all θ defined by the feasible sets

K,_L, and _V. Let _V,r :=

VJ | kvik2,kwik2 ≤ rand_kvi−vjk22+kwi−wjk22≥, for alli6=j . If we set_K=_Kg,_L=_Lg,and_V =_V,rfor some, r >0, then ˜

c <_∞as_Kg,Lg,andV,rare compact. In practice, these

conditions do not necessarily create restrictions as they almost always hold implicitly. We show in AppendixCthat the objective function used to chooseVJ will discourage

any two locations to be in the same neighborhood.

Parameter TuningLet θ be the collection of all tuning parameters of the test. Ifk_{∈ K}g andl ∈ Lg(i.e.,

Gaus-sian kernels), then θ = _{σ2

x, σy2, VJ}. The test power

lower bound L(λn)in Theorem6is a function ofλn = nu>Σ−1_u_{which is the population counterpart of the test} statisticˆλn. As in FSIC, it can be shown thatλn = 0if and only ifXareY are independent (from Proposition2). According to Theorem6, for a sufficiently largen, the test power lower bound is increasing inλn. One can therefore think ofλn(a function ofθ) as representing how easily the test rejectsH0given a problemPxy. The higher theλn, the

greater the lower bound on the test power, and thus the more likely it is that the test will rejectH0when it is false. In light of this reasoning, we propose to set θ by maxi-mizing the lower bound on the test power i.e., set θ to

θ∗= arg maxθL(λn). Assume thatnis sufficiently large so that λn _7→ L(λn) is an increasing function. Then,

arg maxθL(λn) = arg maxθλn. That this procedure is also valid under H0 can be seen as follows. Under H0,

θ∗= arg maxθ0will be arbitrary. Since Theorem6 guaran-tees thatλnˆ _→d χ2(J)asn_{→ ∞}for anyθ, the asymptotic null distribution does not change by usingθ∗. In practice,

λnis a population quantity which is unknown. We propose dividing the sampleZninto two disjoint sets: training and

test sets. The training set is used to computeλˆn(an estimate

ofλn) to optimize forθ∗, and the test set is used for the

ac-tual independence test with the optimizedθ∗. The splitting is to guarantee the independence ofθ∗and the test sample to avoid overfitting.

(6)

(a)µˆxy(v,w) (b)µ[xµy(v,w)

(c)Σb(v,w) (d) Statisticλˆn(v,w)

Figure 1: Illustration ofNFSIC\2_.

To better understand the behaviour ofNFSIC\2_{, we} visual-izeµxy(ˆ v,w),µxµy([ v,w)andΣˆ(v,w)as a function of

one test location(v,w)on a simple toy problem. In this problem,Y =₋X+ZwhereZ_{∼ N}(0,0.32)is an inde-pendent noise variable. As we consider only one location

(J = 1),Σˆ(v,w)is a scalar. The statistic can be written asλnˆ =n(µˆxy(v,w)−\µxµy(v,w))

2

ˆ

Σ(v,w) . These components are shown in Figure1, where we use Gaussian kernels for both

X andY, and the horizontal and vertical axes correspond tov_∈Randw∈R, respectively.

Intuitively,u(ˆ v,w) = ˆµxy(v,w)₋µxµy([ v,w)captures

the difference of the joint distribution and the product of the marginals as a function of (v,w). Squaringu(ˆ v,w)

and dividing it by the variance shown in Figure1cgives the statistic (also the parameter tuning objective) shown in Fig-ure1d. The latter figure illustrates that the parameter tuning objective function can be non-convex: non-convexity arises since there are multiple ways to detect the difference be-tween the joint distribution and the product of the marginals. In this case, the lower left and upper right regions equally indicate the largest difference. A convex objective would not be able to capture this phenomenon.

3. Experiments

In this section, we empirically study the performance of the proposed method on both toy (Section3.1) and real problems (Section 3.2). We are interested in challeng-ing problems requirchalleng-ing a large number of samples, where a quadratic-time test might be computationally infeasi-ble. Our goal is not to outperform a quadratic-time test with a linear-time test uniformly overalltesting problems. We will find, however, that our test does outperform the quadratic-time test in some cases. Code is available at

https://github.com/wittawatj/fsic-test. We compare the proposed NFSIC with optimization (NFSIC-opt) to five multivariate nonparametric tests. TheNFSIC\2 test without optimization (NFSIC-med) acts as a baseline, allowing the effect of parameter optimization to be clearly

seen. For pedagogical reason, we consider the original HSIC test ofGretton et al.(2005) denoted by QHSIC, which is a quadratic-time test. Nyström HSIC (NyHSIC) uses a Nys-tröm approximation to the kernel matrices ofXandY when computing the HSIC statistic. FHSIC is another variant of HSIC in which a random Fourier feature approximation (Rahimi & Recht,2008) to the kernel is used. NyHSIC and FHSIC are studied inZhang et al.(2017) and can be com-puted in_O(n), with quadratic dependency on the number of inducing points in NyHSIC, and quadratic dependency on the number of random features in FHSIC. Finally, the Randomized Dependence Coefficient (RDC) proposed in

Lopez-Paz et al.(2013) is also considered. The RDC can be seen as the primal form (with random Fourier features) of the kernel canonical correlation analysis ofBach & Jordan

(2002) on copula-transformed data. We consider RDC as a linear-time test even though preprocessing by an empirical copula transform costs_O((dx+dy)nlogn).

We use Gaussian kernel classes _Kg and Lg for both X and Y in all the methods. Except NFSIC-opt, all other tests use full sample to conduct the independence test, where the Gaussian widths σx and σy are set ac-cording to the widely used median heuristic i.e., σx = median (_{kxi−xjk2|1≤i < j ≤n}), andσy is set in the same way using_{yi}ni=1. TheJlocations for NFSIC-med are randomly drawn from the standard multivariate normal distribution in each trial. For a sample of sizen, NFSIC-opt uses half the sample for parameter tuning, and the other disjoint half for the test. We permute the sample 300 times in RDC2 _{and HSIC to simulate from the null}

distribution and compute the test threshold. The null distri-butions for FHSIC and NyHSIC are given by a finite sum of weightedχ2₍₁₎_{random variables given in Eq. 8 of}_Zhang

et al.(2017). Unless stated otherwise, we set the test thresh-old of the two NFSIC tests to be the(1₋α)-quantile of

χ2(J). To provide a fair comparison, we setJ = 10, use 10 inducing points in NyHSIC, and 10 random Fourier features in FHSIC and RDC.

Optimization of NFSIC-optThe parameters of NFSIC-opt areσx, σy,andJlocations of size(dx+dy)J. We treat all

the parameters as a long vector inR2+(dx+dy)J_{and use}

gra-dient ascent to optimizeˆλn/2. We observe that initializing VJby randomly pickingJpoints from the training sample

yields good performance. The regularization parameterγn

in NFSIC is fixed to a small value, and is not optimized. It is worth emphasizing that the complexity of the optimization procedure is still linear-time.3

2

We use a permutation test for RDC, following the au-thors’ implementation (https://github.com/lopezpaz/

randomized_dependence_coefficient, referred

com-mit: b0ac6c0).

3

Our claim on linear runtime (with respect to n) is for the gradient ascent procedure to find a local optimum forθ. We do not

(7)

100 200 dxanddy 100 101 102 Time (s) (a) SG(α= 0.05) 50 100 150 200 250 dxanddy 0.04 0.06 T yp e-I error (b) SG(α= 0.05) 1 2 3 4 5 6

ωin 1 + sin(ωx) sin(ωy) 0.0 0.5 1.0 T est p ow er (c) Sin 1 2 3 4 5 6 dx 0.0 0.5 1.0 T est p ow er 1 2 3 4 5 6 dx 0.0 0.5 1.0 T est p ow er NFSIC-opt_NFSIC-med QHSIC NyHSIC FHSIC RDC (d) GSign

Figure 2: (a): Runtime. (b): Probability of rejectingH0as problem parameters vary. Fixn= 4000. Since FSIC, NyHFSIC and RDC rely on a

finite-dimensional kernel approximation, these tests are consistent only if both the number of features increases withn. By constrast, the proposed NFSIC requires onlynto go to in-finity to achieve consistency i.e.,Jcan be fixed. We refer the reader to AppendixCfor a brief investigation of the test power vs. increasingJ. The test power does not necessarily monotonically increase withJ.

3.1. Toy Problems

We consider three toy problems.

1. Same Gaussian (SG).The two variables are indepen-dently drawn from the standard multivariate normal distri-bution i.e.,X _{∼ N}(0,Idx)andY ∼ N(0,Idy)whereId

is thed_×didentity matrix. This problem represents a case in whichH0holds.

2. Sinusoid (Sin).Letpxybe the probability density ofPxy.

In the Sinusoid problem, the dependency ofXandY is char-acterized by(X, Y)_∼pxy(x, y)∝1 + sin(ωx) sin(ωy),

where the domains of_X,_Y = (₋π, π)andω is the fre-quency of the sinusoid. As the frefre-quencyωincreases, the drawn sample becomes more similar to a sample drawn fromUniform((₋π, π)2₎_{. That is, the higher}_ω_{, the harder} to detect the dependency betweenXandY. This problem was studied inSejdinovic et al.(2013). Plots of the density for a few values ofωare shown in Figures6and7in the appendix. The main characteristic of interest in this problem is the local change in the density function.

3. Gaussian Sign (GSign). In this problem, Y = |Z_|Qdx

i=1sgn(Xi),where X ∼ N(0,Idx), sgn(·) is the

sign function, andZ_{∼ N}(0,1)serves as a source of noise. The full interaction ofX= (X1, . . . , Xdx)is what makes

the problem challenging. That is,Y is dependent on X, yet it is independent of any proper subset of_{X1, . . . , Xd}. Thus, simultaneous consideration of all the coordinates of

Xis required to successfully detect the dependency. We fixn= 4000and vary the problem parameters. Each problem is repeated for 300 trials, and the sample is redrawn each time. The significance levelαis set to 0.05. The

re-claim a linear runtime to find a global optimum.

sults are shown in Figure2. It can be seen that in the SG problem (Figure2b) whereH0holds, all the tests achieve roughly correct type-I errors atα = 0.05. In particular, we point out that NFSIC-opt’s rejection rate is well con-trolled as the sample used for testing and the sample used for parameter tuning are independent. The rejection rate would have been much higher had we done the optimization and testing on the same sample (i.e., overfitting). In the Sin problem, NFSIC-opt achieves high test power for all consideredω= 1, . . . ,6, highlighting its strength in detect-ing local changes in the joint density. The performance of NFSIC-med is significantly lower than that of NFSIC-opt. This phenomenon clearly emphasizes the importance of the optimization to place the locations at the relevant regions in

X ×Y. RDC has a remarkably high performance in both Sin and GSign (Figure2c,2d) despite no parameter tuning. The ability to simultaneously consider interacting features of NFSIC-opt is indicated by its superior test power in GSign, especially at the challenging settings ofdx= 5,6.

NFSIC vs. QHSIC.We observe that NFSIC-opt outper-forms the quadratic-time QHSIC in these two problems. QHSIC is defined as the RKHS norm of the witness func-tionu(see (2)). Intuitively, one can think of the RKHS norm as taking into account all the locations(v,w). By contrast, the proposed NFSIC evaluates the witness function atJlocations. If the differences inpxyandpxpyare local (e.g., Sin problem), or there are interacting features (e.g., GSign problem), then only small regions in the space of

(X, Y)are relevant in detecting the difference ofpxyand

pxpy. In these cases, pinpointing exact test locations by the optimization of NFSIC performs well. On the other hand, taking into account all possible test locations as done implicitly in QHSIC also integrates over regions where the difference between pxy andpxpy is small, resulting in a

weaker indication of dependence. Whether QHSIC is better than NFSIC depends heavily on the problem, and there is no one best answer. If the difference betweenpxyandpxpy

is large only in localized regions, then the proposed linear time statistic has an advantage. If the difference is spatially diffuse, then QHSIC has an advantage. No existing work has proposed a procedure to optimally tune kernel param-eters for QHSIC; by contrast, NFSIC has a clearly defined objective for parameter tuning.

(8)

103 ₁₀4 ₁₀5 Sample sizen 100 102 Time (s) (a) SG.dx=dy= 250. 104 ₁₀5 Sample sizen 0.04 0.06 T yp e-I error (b) SG.dx=dy= 250. 103 ₁₀4 ₁₀5 Sample sizen 0.0 0.5 1.0 T est p ow er (c) Sin.ω= 4. 103 ₁₀4 ₁₀5 Sample sizen 0.0 0.5 1.0 T est p ow er 1 2 3 4 5 6 dx 0.0 0.5 1.0 T est p ow er NFSIC-opt_NFSIC-med QHSIC NyHSIC FHSIC RDC (d) GSign.dx= 4.

Figure 3: (a) Runtime. (b): Probability of rejectingH0asnincreases in the toy problems. To investigate the sample efficiency of all the tests, we fix

dx=dy = 250in SG,ω= 4in Sin,dx= 4in GSign, and

increasen. Figure3shows the results. The quadratic depen-dency onnin QHSIC makes it infeasible both in terms of memory and runtime to considernlarger than 6000 (Fig-ure3a). By constrast, although not the most time-efficient, NFSIC-opt has the highest sample-efficiency for GSign, and for Sin in the low-sample regime, significantly outperform-ing QHSIC. Despite the small additional overhead from the optimization, we are yet able to conduct an accurate test withn = 105_{, dx} ₌_dy _{= 250}_{in less than}₁₀₀ _seconds. We observe in Figure3bthat the two NFSIC variants have correct type-I errors across all sample sizes. We recall from Theorem4that the NFSIC test with random test locations will asymptotically rejectH0if it is false. A demonstration of this property is given in Figure3c, where the test power of NFSIC-med eventually reaches 1 withnhigher than105_.

3.2. Real Problems

We now examine the performance of our proposed test on real problems.

Million Song Data (MSD) We consider a subset of the Million Song Data4₍_{Bertin-Mahieux et al.}_,₂₀₁₁_{), in which}

each song(X)out of 515,345 is represented by 90 features, of which 12 features are timbre average (over all segments) of the song, and 78 features are timbre covariance. Most of the songs are western commercial tracks from 1922 to 2011. The goal is to detect the dependency between each song and its year of release(Y). We setα = 0.01, and repeat for 300 trials where the full sample is randomly subsampled ton points in each trial. Other settings are the same as in the toy problems. To make sure that the type-I error is correct, we use the permutation approach in the NFSIC tests to compute the threshold. Figure 4bshows the test powers asnincreases from 500 to 2000. To simulate the case whereH0holds in the problem, we permute the sample to break the dependency ofXandY. The results are shown in Figure5in the appendix.

Evidently, NFSIC-opt has the highest test power among all 4

Million Song Data subset: https://archive.ics.

uci.edu/ml/datasets/YearPredictionMSD. 500 1000 1500 2000 Sample sizen 0.00 0.01 0.02 T yp e-I error

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

500 1000 1500 2000 Sample sizen 0.5 1.0 T est p ow er (a) MSD problem. 2000 4000 6000 8000 Sample sizen 0.5 1.0 T est p ow er

(b) Videos & Captions problem.

Figure 4: Probability of rejectingH0asnincreases in the two real problems.α= 0.01.

the linear-time tests for all the sample sizes. Its test power is second to only QHSIC. We recall that NFSIC-opt uses half of the sample for parameter tuning. Thus, atn= 500, the actual sample for testing is 250, which is relatively small. The fact that there is a vast power gain from 0.4 (NFSIC-med) to 0.8 (NFSIC-opt) atn= 500suggests that the optimization procedure can perform well even at a lower sample sizes.

Videos and CaptionsOur last problem is based on the VideoStory46K5 _{dataset (}_{Habibian et al.}_, ₂₀₁₄_). _The

dataset contains 45,826 Youtube videos (X) of an aver-age length of roughly one minute, and their corresponding text captions (Y)uploaded by the users. Each video is represented as adx= 2000dimensional Fisher vector en-coding of motion boundary histograms (MBH) descriptors ofWang & Schmid(2013). Each caption is represented as a bag of words with each feature being the frequency of one word. After filtering only words which occur in at least six video captions, we obtaindy = 1878words. We

examine the test powers asnincreases from2000to8000. The results are given in Figure 4. The problem is suffi-ciently challenging that all linear-time tests achieve a low power atn= 2000. QHSIC performs exceptionally well on this problem, achieving a maximum power throughout. NFSIC-opt has the highest sample efficiency among the linear-time tests, showing that the optimization procedure is also practical in a high dimensional setting.

5

VideoStory46K dataset:https://ivi.fnwi.uva.nl/

(9)

Acknowledgement

We thank the Gatsby Charitable Foundation for the financial support. The major part of this work was carried out while Zoltán Szabó was a research associate at the Gatsby Com-putational Neuroscience Unit, University College London.

References

Anderson, Theodore W. An Introduction to Multivariate Statistical Analysis. Wiley, 2003.

Bach, Francis R. and Jordan, Michael I. Kernel indepen-dent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.

Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, and Lamere, Paul. The million song dataset. In International Conference on Music Information Retrieval (ISMIR), 2011.

Chwialkowski, Kacper P., Ramdas, Aaditya, Sejdinovic, Dino, and Gretton, Arthur. Fast Two-Sample Testing with Analytic Representations of Probability Measures. InAdvances in Neural Information Processing Systems (NIPS), pp. 1981–1989. 2015.

Dauxois, Jacques and Nkiet, Guy Martial. Nonlinear canon-ical analysis and independence tests.The Annals of Statis-tics, 26(4):1254–1278, 1998.

Feuerverger, Andrey. A consistent test for bivariate depen-dence. International Statistical Review, 61(3):419–433, 1993.

Fukumizu, Kenji, Gretton, Arthur, Sun, Xiaohai, and Schölkopf, Bernhard. Kernel measures of conditional dependence. InAdvances in Neural Information Process-ing Systems (NIPS), pp. 489–496, 2008.

Gretton, Arthur. A simpler condition for consistency of a kernel independence test. Technical report, 2015. URL

http://arxiv.org/abs/1501.06103.

Gretton, Arthur and Györfi, László. Consistent nonparamet-ric tests of independence. Journal of Machine Learning Research, 11:1391–1423, 2010.

Gretton, Arthur, Bousquet, Olivier, Smola, Alex, and Schölkopf, Bernhard. Measuring Statistical Dependence with Hilbert-Schmidt Norms. InAlgorithmic Learning Theory (ALT), pp. 63–77. 2005.

Gretton, Arthur, Fukumizu, Kenji, Teo, Choon H., Song, Le, Schölkopf, Bernhard, and Smola, Alex J. A Kernel Statistical Test of Independence. InAdvances in Neural Information Processing Systems (NIPS), pp. 585–592. 2008.

Habibian, Amirhossein, Mensink, Thomas, and Snoek, Cees GM. Videostory: A new multimedia embedding for few-example recognition and translation of events. In ACM International Conference on Multimedia, pp. 17–26, 2014.

Heller, Ruth, Heller, Yair, Kaufman, Shachar, Brill, Barak, and Gorfine, Malka. Consistent distribution-free k -sample and independence tests for univariate random variables. Journal of Machine Learning Research, 17 (29):1–54, 2016.

Huo, Xiaoming and Székely, Gábor J. Fast computing for distance covariance.Technometrics, 58(4):435–447, 2016.

Jitkrittum, Wittawat, Szabó, Zoltán, Chwialkowski, Kacper, and Gretton, Arthur. Interpretable Distribution Features with Maximum Testing Power. InAdvances in Neural Information Processing Systems (NIPS), pp. 181–189. 2016.

Kowalski, Jeanne and Tu, Xin M. Modern Applied U-Statistics. John Wiley & Sons, 2008.

Krantz, Steven G. and Parks, Harold R. A Primer of Real Analytic Functions. Springer Science & Business Media, 2002.

Lehmann, Eric L. Elements of Large-Sample Theory. Springer Science & Business Media, 1999.

Lopez-Paz, David, Hennig, Philipp, and Schölkopf, Bern-hard. The Randomized Dependence Coefficient. In Ad-vances in Neural Information Processing Systems (NIPS), pp. 1–9. 2013.

Mityagin, Boris. The Zero Set of a Real Analytic Function. December 2015. arXiv: 1512.07276.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. InAdvances in Neural In-formation Processing Systems (NIPS), pp. 1177–1184. 2008.

Sejdinovic, Dino, Sriperumbudur, Bharath, Gretton, Arthur, and Fukumizu, Kenji. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):2263–2291, 2013.

Serfling, Robert J.Approximation Theorems of Mathemati-cal Statistics. John Wiley & Sons, 2009.

Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf, Bernhard. A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory (ALT), pp. 13–31, 2007.

(10)

Sriperumbudur, Bharath K., Gretton, Arthur, Fukumizu, Kenji, Schölkopf, Bernhard, and Lanckriet, Gert R. G. Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11: 1517–1561, 2010.

Steinwart, Ingo and Christmann, Andreas.Support vector machines. Springer Science & Business Media, 2008. Székely, Gábor J. and Rizzo, Maria L. Brownian distance

covariance.The Annals of Applied Statistics, 3(4):1236– 1265, 2009.

Székely, Gábor J., Rizzo, Maria L., and Bakirov, Nail K. Measuring and testing dependence by correlation of dis-tances.The Annals of Statistics, 35(6):2769–2794, 2007. van der Vaart, Aad. Asymptotic Statistics. Cambridge

Uni-versity Press, 2000.

Wang, Heng and Schmid, Cordelia. Action recognition with improved trajectories. InIEEE International Conference on Computer Vision (ICCV), pp. 3551–3558, 2013. Zhang, Kun, Peters, Jonas, Janzing, Dominik, and

Schölkopf, Bernhard. Kernel-based conditional indepen-dence test and application in causal discovery. In Confer-ence on Uncertainty in Artificial IntelligConfer-ence (UAI), pp. 804–813, 2011.

Zhang, Qinyi, Filippi, Sarah, Gretton, Arthur, and Sejdi-novic, Dino. Large-Scale Kernel Methods for Indepen-dence Testing. Statistics and Computing, pp. 1–18, 2017.

(11)

Supplementary Material

A. Type-I Errors

In this section, we show that all the tests have correct type-I errors (i.e., the probability of rejectH0when it is true) in real problems. We permute the joint sample so that the dependency is broken to simulate cases in whichH0holds. The results are shown in Figure5.

500 1000 1500 2000 Sample sizen 0.00 0.01 0.02 T yp e-I error

NFSIC-opt NFSIC-med QHSIC NyHSIC FHSIC RDC

500 1000 1500 2000 Sample sizen 0.00 0.01 0.02 T yp e-I error

(a) MSD problem (permuted sample).

2000 4000 6000 8000 Sample sizen 0.01 0.02 T yp e-I error

(b) Videos & Captions problem (permuted sample).

Figure 5: Probability of rejectingH0asnincreases.α= 0.01.

B. Redundant Test Locations

Here, we provide a simple illustration to show that two locationst1 = (v1,w1)andt2 = (v2,w2)which are too close to each other will reduce the optimization objective. We consider the Sinusoid problem described in Section3.1with

ω= 1, and useJ = 2test locations. In Figure6,t1is fixed at the red star, whilet2is varied along the horizontal line. The objective valueˆλnas a function oft2is shown in the bottom figure. It can be seen thatλnˆ decreases sharply whent2is in the neighborhood oft1. This property implies that two locations which are too close will not maximize the objective function (i.e., the second feature contains no additional information when it matches the first). ForJ >2, the objective sharply decreases if any two locations are in the same neighborhood.

−

2.5

0.0

2.5 x

−

2.5

0.0

2.5 y

ω

= 1

.

00

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

x −2 0 2 y −2 0 2 t2 175 200 225 250 ˆλ(n t1 , t2 )

Figure 6: Plot of optimization objective values as locationt2moves along the green line. The objective sharply drops when the two locations are in the same neighborhood.

C. Test Power vs.

J

It might seem intuitive that as the number of locationsJ increases, the test power should also increase. Here, we empirically show that this statement isnotalways true. Consider the Sinusoid toy example described in Section3.1withω= 2(also see the left figure of Figure7). By construction,XandY are dependent in this problem. We run NFSIC test with a sample size ofn= 800, varyingJ from1to600. For each value ofJ, the test is repeated for 500 times. In each trial, the sample is redrawn and theJ test locations are drawn fromUniform((₋π, π)2₎_{. There is no optimization of the test locations. We use} Gaussian kernels for bothXandY, and use the median heuristic to set the Gaussian widths to 1.8. Figure7shows the test power asJ increases.

(12)

−

2 .

5

0 .

0

2 .

5 x

−

2 .

5

0 .

0

2 .

5 y

ω

= 2.00

0 .

00

0 .

25

0 .

50

0 .

75

1 .

00

1 .

25

1 .

50

1 .

75

2 .

00

10

200

400

600 J

0 .

5

1 .

0 T

est

p

ow

er

Figure 7: The Sinusoid problem and the plot of test power vs. the number of test locations.

We observe that the test power does not monotonically increase asJincreases. WhenJ = 1, the difference ofpxyandpxpy

cannot be adequately captured, resulting in a low power. The power increases rapidly to roughly 0.6 atJ = 10, and stays at 1 until aboutJ = 100. Then, the power starts to drop sharply whenJ is higher than400in this problem.

Unlike random Fourier features, the number of test locations in NFSIC is not the number of Monte Carlo particles used to approximate an expectation. There is a tradeoff: if the test locations are in key regions (i.e., regions in which there is a big difference betweenpxyandpxpy), then they increase power; yet the statistic gains in variance (thus reducing test power) as

J increases. As can be seen in Figure7, there are eight key regions (in blue) that can reveal the difference ofpxyandpxpy. Using an unnecessarily highJnot only makes the covariance matrixΣˆ harder to estimate accurately, it also increases the computation as the complexity onJ is_O(J3).

We note that NFSIC is not intended to be used with a largeJ. In practice, it should be set to be large enough so as to capture the key regions as stated. As a practical guide, with optimization of the test locations, a good starting point isJ = 5or10.

D. Proof of Theorem

4

Recall Theorem4,

Theorem 4(Independence test based onNFSIC\2_{is consistent)}_. _Let_Σˆ _{be a consistent estimate of}_Σ_{based on the joint}

sampleZn, whereΣis defined in Proposition3. Assume thatVJ={(vi,wi)}Ji=1∼ηwhereηis absolutely continuous wrt the Lebesgue measure. TheNFSIC\2_{statistic is defined as}ˆ_λ

n :=nˆu> ˆ Σ+γnI −1 ˆ uwhereγn≥0is a regularization

parameter. Assume that 1. AssumptionAholds.

2. Σis invertibleη-almost surely. 3. limn→∞γn= 0.

Then, for anyk, landVJ satisfying the assumptions,

1. UnderH0,ˆλn _→d χ2(J)asn_{→ ∞}. 2. UnderH1, for anyr_∈R,limn→∞P

ˆ

λn _≥r= 1η-almost surely. That is, the independence test based onNFSIC\2 is consistent.

Proof. Assume thatH0holds. The consistency ofΣˆ and the continuous mapping theorem imply that

ˆ

Σ+γnI

−1 _p

→Σ−1

which is a constant. Letabe a random vector inRJfollowingN(0,Σ). Byvan der Vaart(2000, Theorem 2.7 (v)), it follows

that √_nˆ_u_, ˆ Σ+γnI −1 d →

a,Σ−1whereu= 0almost surely by Proposition2, and√nˆu_{→ N}d (0,Σ)by Proposition

3. Sincef(x,S) :=x>Sxis continuous,f √_nˆ_u_, ˆ Σ+γnI −1 _d →f(a,Σ−1). Equivalently,nˆu>Σˆ +γnI −1 ˆ u_→d a>Σ−1a_∼χ2_(J₎_by_Anderson₍₂₀₀₃_{, Theorem 3.3.3). This proves the first claim.}

(13)

The proof of the second claim has a very similar structure to the proof of Proposition 2 ofChwialkowski et al.(2015). Assume that H1 holds. Then, u 6= 0 almost surely by Proposition 2. Since k andl are bounded, it follows that

|ht(z,z0)| ≤2BkBlfor anyz,z0(see (8)), and we have thatuˆ

a.s.

→ ubySerfling(2009, Section 5.4, Theorem A). Thus,

ˆ

u>_Σˆ ₊_γ_n_I−1_u_ˆ₋r n

d

→u>_Σ−1_u_{by the continuous mapping theorem, and the consistency of}_Σ_ˆ_{. Consequently,}

lim n→∞P ˆ λn_≥r = 1₋ lim n→∞P ˆ u>Σˆ +γnI −1 ˆ u₋ r n <0 (a) = 1₋_P u>Σ−1u<0(b) = 1,

where at(a)we use the Portmanteau theorem (van der Vaart,2000, Lemma 2.2 (i)) guaranteeing thatxn d

→xif and only if

P(xn< t)→P(x < t)for all continuity points oft7→P(x < t). Step(b)is justified by noting that the covariance matrix

Σis positive definite so thatu>Σ−1_u_>₀_{, and}_t

7→P(u>Σ−1u< t)(a step function) is continuous at0.

E. Proof of Theorem

6

Recall Theorem6,

Theorem 6(A lower bound on the test power). LetNFSIC2(X, Y) :=λn:=nu>Σ−1u. Let_Kbe a kernel class fork,_L be a kernel class forl, and_Vbe a collection with each element being a set ofJ locations. Assume that

1. There exist finiteBkandBlsuch thatsupk∈Ksupx,x0_∈X|k(x,x0)| ≤Bkandsupl∈Lsupy,y0_∈Y|l(y,y0)| ≤Bl.

2. ˜c:= sup_k_∈Ksup_l_∈Lsup_V_J_∈V_kΣ−1_kF <∞.

Then, for anyk_{∈ K}, l_{∈ L}, VJ_{∈ V}, andλn_≥r, the test power satisfiesP

ˆ λn_≥r_≥L(λn)where L(λn) = 1−62e−ξ1γ 2 n(λn−r)2/n₋_2e−b0.5nc(λn−r)2/[ξ2n2] −2e−[(λn−r)γn(n−1)/3−ξ3n−c3γ2nn(n−1)] 2 /[ξ4n2(n−1)]_, b·cis the floor function,ξ1 := ₃2_c21

1J2B∗

,B∗is a constant depending on onlyBk andBl,ξ2:= 72c22J B2,B :=BkBl,

ξ3 := 8c1B2J,c3 := 4B2J˜c2,ξ4 := 28B4J2c21,c1 := 4B2J

√

J˜c, andc2 := 4B

√

Jc˜. Moreover, for sufficiently large fixedn,L(λn)is increasing inλn.

Overview of the proof We first derive a probabilistic bound for_|λnˆ ₋λn_|/n. The bound is in turn upper bounded by an expression involving_kuˆ₋u_k2andkΣˆ −ΣkF. The differencekuˆ−uk2can be bounded by applying the bound for U-statistics given inSerfling(2009, Theorem A, p. 201). For_kΣˆ₋Σ_kF, we decompose it into a sum of smaller components,

and bound each term with a product variant of the Hoeffding’s inequality (Lemma8).L(λn)is obtained by combining all the bounds with the union bound.

E.1. Notations

Let_hA,B_i_F := tr(A>_B₎_{denote the Frobenius inner product, and}_k_A_k F :=

p

tr(A>_A₎_{be the Frobenius norm. Write}

z:= (x,y)to denote a pair of points from_{X × Y}. We writet:= (v,w)to denote a pair of test locations from_{X × Y}. For brevity, an expectation over(x,y)(i.e.,E(x,y)∼Pxy) will be written asEzorExy. Define

˜

k(x,v) :=k(x,v)₋_Ex0k(x0,v),

and˜l(y,w) :=l(y,w)₋_Ey0l(y0,w). LetB₂(r) :={x| kxk₂≤r}be a closed ball with radiusrcentered at the origin.

Similarly, defineBF(r) :={A| kAkF ≤r}to be a closed ball with radiusrofJ×Jmatrices under the Frobenius norm.

Denote the max operation by(x1, . . . , xm)+= max(x1, . . . , xm).

For a product of marginal mean embeddingsµx(v)µy(w), we writeµxµy(_[ v,w) := _n₍_n1₋₁₎Pn

i=1 P

j6=ik(xi,v)l(yj,w)

to denote the unbiased plug-in estimator, and writeµx(ˆ v)ˆµy(w) := _n1Pn

i=1k(xi,v) 1 n Pn j=1l(yj,w)which is a biased estimator. Defineuˆb₍_v_,_w_{) := ˆ}_µ xy(v,w)−µˆx(v)ˆµy(w)so thatuˆb := ˆub(t1), . . . ,uˆb(tJ) >

where the superscriptb

(14)

E.2. Proof

We will first derive a bound for_P(_|ˆλn₋λn_{| ≥}t), which will then be reparametrized to get a bound for the target quantity

P(ˆλn ≥r). We closely follow the proof inJitkrittum et al.(2016, Section C.1) up to (12), then we diverge. We start by

considering_|ˆλn−λn|/n. |ˆλn₋λn_|/n= uˆ >_{( ˆ}_Σ₊_γn_I₎−1_u_ˆ −u>Σ−1u = ˆ u>Σˆ +γnI −1 ˆ u₋u>(Σ+γnI) −1 u+u>(Σ+γnI) −1 u₋u>Σ−1u ≤ ˆ u>Σˆ +γnI −1 ˆ u₋u>(Σ+γnI) −1 u + u >₍_Σ₊_γ nI) −1 u₋u>Σ−1u := (_F)₁+ (_F)₂.

We next bound(_F1)and(_F2)separately.

(_F)1= ˆ u>Σˆ +γnI −1 ˆ u₋u>(Σ+γnI)−1u = ˆ u>Σˆ +γnI −1 ˆ u₋uˆ>(Σ+γnI)−1uˆ+ û>(Σ+γnI)−1uˆ₋u>(Σ+γnI)−1u ≤ ˆ u>Σˆ +γnI −1 ˆ u₋uˆ>(Σ+γnI)− 1 ˆ u + uˆ >₍_Σ₊_γ nI)− 1 ˆ u₋u>(Σ+γnI)− 1 u = ˆ uuˆ>,Σˆ +γnI −1 −(Σ+γnI) −1 F + D ˆ uuˆ>₋uu>,(Σ+γnI) −1E F ≤ kuûˆ>_kFk( ˆΣ+γnI)−1−(Σ+γnI)−1kF +kuûˆ>−uu>kFk(Σ+γnI)−1kF =_kuûˆ>_kFk( ˆΣ+γnI)−1[(Σ+γnI)−( ˆΣ+γnI)](Σ+γnI)−1kF+kuûˆ>−uuˆ >+ ûu>−uu>kFk(Σ+γnI)−1kF (a) ≤ kuûˆ>_kFk( ˆΣ+γnI)−1kFkΣ−ΣˆkFkΣ−1kF+kuûˆ>−uuˆ >+ ûu>−uu>kFkΣ−1kF (b) ≤ √ J γn kuˆk 2 2kΣ−ΣˆkFkΣ−1kF+ kuˆ(û−u)>kF +k(û−u)u>kF kΣ−1_kF ≤ √ J γn kuˆk 2 2kΣ−ΣˆkFkΣ−1kF+ (kuˆk2+kuk2)kuˆ−uk2kΣ−1kF, (5)

where at(a)we used_k(Σ+γnI)−1kF ≤ kΣ−1kF, at(b)we usedk( ˆΣ+γnI)−1kF ≤ √ J_k( ˆΣ+γnI)−1k2≤ √ J /γn. For(_F)2, we have (_F)2= u >₍_Σ₊_γn_I₎−1 u₋u>Σ−1u = uu>,(Σ+γnI)−1₋Σ−1_F ≤ kuu>_kFk(Σ+γnI)−1−Σ−1kF =_ku_k2₂_k(Σ+γnI)−1[Σ₋(Σ+γnI)]Σ−1_kF ≤γn_ku_k2₂_k(Σ+γnI)−1_kFkΣ−1kF (a) ≤ γnkuk22kΣ −1 k2 F, (6)

where at(a)we used_k(Σ+γnI)−1_kF ≤ kΣ−1kF.

Combining (5) and (6), we have uˆ >_{( ˆ}_Σ₊_γn_I₎−1_u_ˆ −u>Σ−1u

(15)

≤ √ J γn k ˆ u_k2_kΣ₋Σˆ_kFkΣ−1kF+ (kuˆk2+kuk2)kuˆ−uk2kΣ−1kF+γnkuk22kΣ −1 k2 F. (7) Bounding_kuˆ_k2

2andkuk22 Here, we show that by the boundedness of the kernelskandl, it follows thatkuˆk22is bounded. Recall that sup_x_,_x0_∈X|k(x,x0)| ≤ Bk, sup_y_,_y0|l(y,y0)| ≤ Bl, our notationt = (v,w) for the test locations, and

zi:= (xi,yi). We first show that the U-statistic corehis bounded. |ht((x,y),(x0,y0))|= 1 2(k(x,v)−k(x 0_,_v_))(l(_y_,_w₎ −l(y0,w)) ≤1₂(_|k(x,v)_|+_|k(x0,v)_|) (_|l(y,w)_|+_|l(y0,w)_|) ≤2BkBl:= 2B, (8)

where we defineB :=BkBl. It follows that

kuˆ_k2₂= J X m=1   2 n(n₋1) X i<j htm(zi,zj)   2 ≤ J X m=1 [2BkBl]2= 4B2J, (9) ku_k2₂= J X m=1 [_EzEz0h_t m(z,z 0_)]2 ≤4B2J. (10)

Using the upper bounds on_kuˆ_k2

2,kuk22,(7) and the definition of˜c, we have uˆ >_{( ˆ}_Σ₊_γn_I₎−1_ˆ u₋u>Σ−1u ≤ √ J γn 4B2J˜c_kΣ₋Σˆ_kF + 4B √ Jc˜_kuˆ₋u_k2+ 4B2J˜c2γn =: c1 γnkΣ− ˆ Σ_kF+c2kuˆ−uk2+c3γn, (11) where we definec1:= 4B2J √ J˜c,c2:= 4B √

J˜c, andc3:= 4B2J˜c2. This upper bound implies that

|ˆλn₋λn_{| ≤} c1 γnnkΣ−

ˆ

Σ_kF+c2nkuˆ−uk2+c3nγn. (12) We will separately upper bound_kΣ₋Σˆ_kF andkuˆ−uk2, and combine them with a union bound.

E.2.1. BOUNDINGkuˆ₋u_k2

Lett∗= arg maxt∈{t1,...,tJ}|u(ˆ t)−u(t)|. Recall thatu= (u(t1), . . . , u(tJ)) >_{= (u} 1, . . . , uJ)>. kuˆ₋u_k2= sup b∈B2(1) hb,uˆ₋u_i₂_≤ sup b∈B2(1) J X j=1 |bj||u(ˆ tj)−u(tj)| ≤ |u(ˆ t∗)₋u(t∗)_| sup b∈B2(1) J X j=1 |bj_| (a) ≤ √J_|u(ˆ t∗)₋u(t∗)_| sup b∈B2(1) kb_k2 =√J_|u(ˆ t∗)₋u(t∗)_|, (13) where at(a)we used_ka_k1 ≤

√

J_ka_k2for anya ∈RJ. From (13), it can be seen that boundingkuˆ−uk2amounts to bounding the difference of a U-statisticu(ˆ t∗)(see (4)) to its expectationu(t∗). Combining (13) and (12), we have

|_λnˆ ₋_λn_{| ≤} c1 γnnkΣ− ˆ Σ_kF+c2n √ J_|u(ˆ t∗)₋u(t∗)_|+c3nγn. (14)

(16)

E.2.2. BOUNDINGkΣˆ ₋Σ_kF

The plan is to writeΣˆ = ˆS₋uˆbuˆb>,Σ =S₋uu>,so that_kΣˆ ₋Σ_kF ≤ kSˆ−SkF+kuˆbuˆb>−uu>kF and bound

separately_kSˆ₋S_kFandkuˆbuˆb>−uu>kF.

Recall thatΣij =η(ti,tj), η(t,t0) =Exy[ ˜k(x,v)˜l(y,w)−u(v,w) ˜k(x,v0)˜l(y,w0)−u(v0,w0)]wherek(˜ x,v) =

k(x,v)₋Ex0k(x0,v), and˜l(y,w) =l(y,w)−_E_y0l(y0,w). Its empirical estimator (see Proposition5) isΣijˆ = ˆη(ti,tj)

where ˆ η(t,t0) = 1 n n X i=1 [ k(xi,v)l(yi,w)−uˆb(v,w) k(xi,v0)l(yi,w0)−ˆub(v0,w0) ] = 1 n n X i=1 k(xi,v)l(yi,w)k(xi,v0)l(yi,w0)₋uˆb(v,w)ˆub(v0,w0), k(x,v) := k(x,v) ₋ 1 n Pn i=1k(xi,v), and l(y,w) := l(y,w) − 1_nP n i=1l(yi,w). We note that _n1Pn

i=1k(xi,v)l(yi,w) = uˆb(v,w). We define Sˆ ∈ RJ×J such that Sˆij :=

1

n

Pn

m=1k(xm,vi)l(ym,wi)k(xm,vj)l(yi,wj), and define similarly its population counterpart S such that Sij :=Exy[˜k(x,v)˜l(y,w)˜k(x,v0)˜l(y,w0)]. We have ˆ Σ= ˆS₋uˆbuˆb>, Σ=S₋uu>, kΣˆ ₋Σ_kF =kSˆ−S−(ˆubuˆb>−uu>)kF (15) ≤ kSˆ₋S_kF+kuˆbuˆb>−uu>kF. (16) With (16), (14) becomes |ˆλn₋λn_{| ≤} c1n γn k ˆ S₋S_kF+ c1n γn k ˆ ubuˆb>₋uu>_kF+c2n √ J_|u(ˆ t∗)₋u(t∗)_|+c3nγn. (17) We will further separately bound_kSˆ₋S_kF andkuˆbuˆb>−uu>kF.

E.2.3. BOUNDINGkuˆbuˆb>₋uu>_kF

kuˆbuˆb>₋uu>_kF =kuˆbuˆb>−uˆbu>+ ûbu>−uu>kF ≤ kuˆb(ûb₋u)>_kF+k(ûb−u)u>kF =_kuˆb_k2kuˆb−uk2+kuˆb−uk2kuk2

≤4B√J_kuˆb₋u_k2, where we used (10) and the fact that_kuˆb

k2≤2B

√

J which can be shown similarly to (9) as

kuˆb_k2₂= J X m=1 [ˆµxy(vm,wm)−µˆx(vm)ˆµy(wm)] 2 = J X m=1   1 n2 n X i=1 n X j=1 htm(zi,zj)   2 ≤ J X m=1 [2BkBl] 2 = 4B2J.

Let(˜v,w˜) := ˜t= arg maxt∈{t1,...,tJ}|uˆ b₍_t₎ −u(t)_|. We bound_kuˆb −u_k2by kuˆb₋u_k2 (a) ≤ √J_|uˆb(˜t)₋u(˜t)_| =√Jµˆxy(˜t)−µˆx(˜v)ˆµy( ˜w)−u(˜t) =√Jµˆxy(˜t)−µ[xµy(˜t) +µ[xµy(˜t)−µˆx(˜v)ˆµy( ˜w)−u(˜t) ≤√Jµˆxy(˜t)−µ[xµy(˜t)−u(˜t) + √ Jµ_[xµy(˜t)−µˆx(˜v)ˆµy( ˜w)

(17)

=√Ju(˜ˆ t)−u(˜t) + √ Jµ_[xµy(˜t)−µˆx(˜v)ˆµy( ˜w) , (18)

where at(a)we used the same reasoning as in (13). The biasµ_[xµy(˜t)−µˆx(˜v)ˆµy( ˜w)

in the second term can be bounded as µ_[xµy(˜t)−µˆx(˜v)ˆµy( ˜w) = 1 n(n₋1) n X i=1 X j6=i k(xi,v˜)l(yj,w˜)− 1 n2 n X i=1 n X j=1 k(xi,˜v)l(yj,w˜) = 1 n(n₋1) n X i=1 n X j=1 k(xi,v˜)l(yj,w˜)− 1 n(n₋1) n X i=1 k(xi,v˜)l(yi,w˜)₋ 1 n2 n X i=1 n X j=1 k(xi,v˜)l(yj,w˜) = 1₋ n n₋1 ₁ n2 n X i=1 n X j=1 k(xi,v˜)l(yj,w˜) + 1 n(n₋1) n X i=1 k(xi,v˜)l(yi,w˜) ≤ 1₋ n n₋1 1 n2 n X i=1 n X j=1 k(xi,v˜)l(yj,w˜) + 1 n(n₋1) n X i=1 k(xi,v˜)l(yi,w˜) ≤ _nB −1 + B n₋1 = 2B n₋1.

Combining this upper bound with (18), we have

kuˆbuˆb>₋uu>_kF ≤4BJ u(˜ˆ t)−u(˜t) + 8B2J n₋1. (19) With (19), (17) becomes |ˆλn₋λn_{| ≤}c1n γn k ˆ S₋S_kF+ 4BJ c1n γn u(˜ˆ t)−u(˜t) + c1n γn 8B2J n₋1 +c2n √ J_|u(ˆ t∗)₋u(t∗)_|+c3nγn. (20) E.2.4. BOUNDINGkSˆ₋S_kF

Recall that VJ = _{t1, . . . ,tJ}, Sijˆ = ˆS(ti,tj) = _n1P n

m=1k(xm,vi)l(ym,wi)k(xm,vj)l(ym,wj), and Sij = S(ti,tj) =Exy[˜k(x,vi)˜l(y,wi)˜k(x,vj)˜l(y,wj)]. Let(t(1),t(2)) = arg max(s,t)∈VJ×VJ|S(ˆ s,t)−S(s,t)|.

k_Sˆ₋_S_k_F ₌ _sup B∈BF(1) D B,Sˆ₋SE F ≤ sup B∈BF(1) J X i=1 J X j=1 |Bij||Sˆij−Sij| ≤ ˆ S(t(1),t(2))₋S(t(1),t(2)) sup B∈BF(1) J X i=1 J X j=1 |Bij_| (a) ≤ J ˆ S(t(1),t(2))₋S(t(1),t(2)) sup B∈BF(1) kB_kF =J ˆ S(t(1),t(2))₋S(t(1),t(2)) , (21)

where at(a)we usedPJ

i=1 PJ

j=1|Aij| ≤JkAkFfor any matrixA∈RJ×J. We arrive at |λnˆ ₋λn_{| ≤} c1J n γn ˆ S(t(1),t(2))₋S(t(1),t(2)) + 4BJ c1n γn u(˜ˆ t)−u(˜t) +c1n γn 8B2_J n₋1 +c2n √ J_|u(ˆ t∗)₋u(t∗)_|+c3nγn. (22)

(18)

E.2.5. BOUNDING ˆ S(t,t0₎₋_S(_t_,_t0₎

Having an upper bound for ˆ S(t,t0)₋S(t,t0)

will allow us to bound (22). To keep the notations uncluttered, we will define the following shorthands.

Expression Shorthand k(x,v) a k(x,v0) a0 k(xi,v) ai k(xi,v0) a0_i Ex∼Pxk(x,v) ˜a Ex∼Pxk(x,v 0₎ _˜_a0 1 n Pn i=1k(xi,v) a 1 n Pn i=1k(xi,v0) a 0 Expression Shorthand l(y,w) b l(y,w0) b0 l(yi,w) bi l(yi,w0) b0i Ey∼Pyl(y,w) ˜b Ey∼Pyl(y,w 0₎ ˜_b0 1 n Pn i=1l(yi,w) b 1 n Pn i=1l(yi,w0) b 0

We will also use _· to denote a empirical expectation over x, or y, or (x,y). The argument under _· will deter-mine the variable over which we take the expectation. For instance, aa0 ₌ 1

n Pn i=1k(xi,v)k(xi,v0) and aba0 = 1 n Pn

i=1k(xi,v)l(yi,w)k(xi,v0), and so on. We define in the same way for the population expectation usinge· i.e., f

aa0₌

Ex[k(x,v)k(x,v0)]andabag0=_Exy[k(x,v)l(y,w)k(x,v0)]. With these shorthands, we can rewriteS(ˆ t,t0)andS(t,t0)as

ˆ S(t,t0) = 1 n n X i=1

(ai₋a)(bi₋b)(a0_i₋a0)(b0_i₋b0), S(t,t0) =_Exy h (a₋a)(b˜ ₋˜b)(a0₋a˜0)(b0₋˜b0)i. By expandingS(t,t0), we have S(t,t0) =Exy

+aba0b0₋aba0˜b0₋abã0b0+abã0˜b0 −a˜ba0b0+a˜ba0˜b0+a˜bã0b0₋a˜bã0˜b0 −ãba0b0+ ãba0˜b0+ ãbã0b0₋ãbã0˜b0 + ã˜ba0b0₋a˜˜ba0˜b0₋a˜˜bã0˜b0+ ã˜bã0˜b0 = +aba^0_b0₋ g aba0˜_b0 −abbg0ã0+ab˜ea0˜b0 −aa]0_b0˜_b₊_aa_f0˜_b˜_b0₊ f ab0_˜_a0_˜_b −ã˜bã0˜b0 −ag0bb0ã+af0bã˜b0+ ãã0bbf0−a˜˜bã0˜b0 +ag0b0ã˜b−ã˜bã0˜b0−ã˜bã0˜b0+ã˜bã0˜b0 = +aba^0_b0₋ g aba0˜_b0 −abbg0ã0+ab˜ea0˜b0 −aa]0_b0˜_b₊_aa_f0˜_b˜_b0₊ f ab0_˜_a0˜_b₊_g_a0_b0_a_˜˜_b −ag0bb0ã+af0bã˜b0+ ãã0bbf0−3ã˜bã0˜b0.

The expansion ofS(ˆ t,t0)can be done in the same way. By the triangle inequality, we have ˆ S(t,t0)₋S(t,t0) ≤ aba 0_b0₋_aba_^0_b0 + aba 0_b0₋ g aba0˜_b0 + abb 0_a0 −abbg0˜a0 + aba 0_b0 −ab˜ea0˜b0 aa 0_b0_b₋_aa_]0_b0˜_b + aa 0_{b b}0₋ f aa0˜_b˜_b0 + ab 0_a0_b −abf0˜a0˜b + a 0_b0_ab₋ g a0_b0_˜_a˜_b