A Wavelet-Based Independence Test for Functional Data with an Application to MEG Functional Connectivity
aa r X i v : . [ s t a t . M E ] S e p A Wavelet-Based Independence Test for Functional Data with anApplication to MEG Functional Connectivity
Rui Miao , Xiaoke Zhang , and Raymond K. W. Wong Department of Statistics, George Washington University Department of Statistics, Texas A&M University
September 25, 2020
Abstract
Measuring and testing the dependency between multiple random functions is often animportant task in functional data analysis. In the literature, a model-based method relieson a model which is subject to the risk of model misspecification, while a model-free methodonly provides a correlation measure which is inadequate to test independence. In this paper,we adopt the Hilbert-Schmidt Independence Criterion (HSIC) to measure the dependencybetween two random functions. We develop a two-step procedure by first pre-smoothingeach function based on its discrete and noisy measurements and then applying the HSIC torecovered functions. To ensure the compatibility between the two steps such that the effect ofthe pre-smoothing error on the subsequent HSIC is asymptotically negligible, we propose touse wavelet soft-thresholding for pre-smoothing and Besov-norm-induced kernels for HSIC.We also provide the corresponding asymptotic analysis. The superior numerical performanceof the proposed method over existing ones is demonstrated in a simulation study. Moreover,in an magnetoencephalography (MEG) data application, the functional connectivity patternsidentified by the proposed method are more anatomically interpretable than those by existingmethods.
Key Words : Reproducing kernel Hilbert space; Besov spaces; Permutation test; Human con-nectome project.
In recent decades, functional data analysis (FDA) has been developed rapidly due to a huge andincreasing number of datasets collected in the form of curves, surfaces and volumes. Generalintroductions to the subject may be found in a few monographs (e.g., Ramsay and Silverman,2005; Ferraty and Vieu, 2006). In many scientific fields, measurements are taken from multiplerandom functions per subject and the dependency between these functions is of interest. Forinstance, neuroscientists are interested in functional connectivity patterns between signals atmultiple brain regions, which are measured over time in functional magnetic resonance imagingdata. It is thus an important task in FDA to measure their dependency and to further test thesignificance of the dependency. Among extensive relevant research endeavors, most dependencytest methods can be categorized as either model-based or model-free.1 model-based method typically infers the dependency between multiple functions by firstassuming a functional regression model (see, e.g., Morris, 2015, for a survey) which characterizestheir structural relationship, and then testing the significance of the assumed model. See ex-amples of model-based methods by Guo (2002); Huang et al. (2002); Shen and Faraway (2004);Antoniadis and Sapatinas (2007) for concurrent/varying-coefficient models and by Kokoszka et al.(2008); Chen et al. (2020) for function-on-function regression models. The main disadvantage ofa model-based method is its reliance on correct model specification. If the model is misspecified,the inference is not well grounded and might be inaccurate.A model-free method can avoid the misspecification issue associated with model-based meth-ods since it typically quantifies the dependency between random functions by a correlationmeasure, without assuming any particular model. As a natural extension of the canonical cor-relation for multivariate data, the functional canonical correlation is a popular correlation mea-sure for functional data (e.g., Leurgans et al., 1993; He et al., 2003; Eubank and Hsing, 2008;Shin and Lee, 2015). However, it is plagued by the involvement of inverting a covariance op-erator, which is an ill-posed problem and often requires proper regularizations. The dynamicalcorrelation (Dubin and M¨uller, 2005; Sang et al., 2019) and temporal correlation (Zhou et al.,2018) are two functional correlation measures without the aforementioned inverse problem. Theformer measures the angle between two random functions in the L space. The latter essentiallycomputes the Pearson correlation between two random functions at each time point and then av-erages all pointwise Pearson correlations over the time domain. However, since uncorrelatednessdoes not imply independence, these functional correlations are insufficient to test independence.Recently a few model-free approaches have been developed to test mean independence for func-tional data (e.g., Patilea et al., 2016; Lee et al., 2020), but they can only test a weaker notionof independence.In this paper we develop a model-free independence test for functional data. Under thereproducing kernel Hilbert space (RKHS) framework, we propose to use the Hilbert-SchmidtIndependence Criterion (HSIC, e.g., Gretton et al., 2005, 2008) to measure the dependencybetween two random functions. An appealing property is that HSIC is zero if and only ifthe two random functions are independent. However, the application of HSIC requires fullyobserved and noiseless functional data, while in practice functional data are always discretelymeasured and contaminated by noise. To tackle this problem, one may perform a two-stepprocedure: first pre-smooth the data, and then apply HSIC to the resulting functions. Clearly,pre-smoothing will affect the performance of HSIC. Indeed, the functional distance with respectto which the asymptotic convergence of the pre-smoothing procedure is measured is crucial, asHSIC is fundamentally based on a functional distance. Some common pre-smoothing proceduresdo not have existing convergence results on the required functional distance, and hence may notbe compatible; namely, the pre-smoothing error may have a profound effect on the subsequentHSIC. See Section 3 for more discussion. In this work, we carefully design our procedure to ensurethat the two steps are compatible. Our choice of the first step is wavelet soft-thresholding (e.g.,Donoho et al., 1995), while the choice of HSIC is based on Besov-norm-induced kernels. We canshow that these choices are theoretically compatible if the functional data are sufficiently denselymeasured. See Section 4 for details. This work is motivated by the Human Connectome Project(HCP, ) from which various brain imaging datasets arepublicly accessible. The application of our method to a magnetoencephalography (MEG) datasetfrom HCP is capable of identifying anatomically interpretable functional connectivity patterns.Therefore, the proposed method provides an important tool to study functional connectivity2etween different brain regions.The main contribution of this paper is three-fold. First, we generalize HSIC to Besov spaces,a larger class of functions than Sobolev spaces which are popular in RKHS modeling. For randomfunctions of which sample paths belong to Besov spaces, we show that the Besov sequence normfor their wavelet coefficients can induce a characteristic kernel, which is required by HSIC.Second, for dense functional data, we develop the asymptotic distribution of the empirical HSICbased on pre-smoothed functions by wavelet soft-thresholding. Since the asymptotic distributioninvolves many unknown quantities, we suggest a permutation test in practice and prove that notonly can the test control the Type I error probability but also it is consistent. The theoreticalresults show that the two steps in our proposed procedure are compatible. Finally, we proposean data-adaptive approach to tuning the smoothness parameter for the Besov norm needed toinduce the kernel for HSIC. It is numerically shown that this approach is able to enhance thesensitivity of HSIC to detecting dependencies at high frequencies.The rest of the paper proceeds as follows. Section 2 provides a brief introduction to HSIC.The two-step procedure for the proposed wavelet-based HSIC test is given in Section 3. Itsasymptotic properties are presented in Section 4. Section 5 provides a method for smoothnessparameter selection. The numerical performance of the proposed method is illustrated in asimulation study in Section 6 and an MEG functional connectivity study in Section 7 where itis also compared with representative existing methods. In this section we give a brief introduction to HSIC. Let X and Y be two random functionsof which sample paths belong to X and Y respectively, and H ( κ X ) and H ( κ Y ) be the RKHSequipped with kernels κ X and κ Y defined on X × X and
Y × Y respectively.HSIC requires that both κ X and κ Y are characteristic, in the sense that two probabilitymeasures P = P ′ if and only if P κ Z ( P ) = P κ Z ( P ′ ) where P κ Z ( P ) = E P { κ Z ( Z, · ) } for a randomfunction Z ∈ Z which follows P and ( Z, Z ) = ( X, X ) or ( Y, Y ). A characteristic kernel maybe induced by a strong negative type semi-metric (see Definition [S3] and Proposition [S1] inAppendix). Denote the joint probability measure of X and Y by P XY and their marginalprobability measures by P X and P Y respectively. Since κ X and κ Y are characteristic, P X and P Y are fully characterized by P κ X ( P X ) = E P X { κ X ( X, · ) } and P κ Y ( P Y ) = E P Y { κ Y ( Y, · ) } respectively. Let P κ X ⊗ κ Y ( P XY ) = E P XY { ( κ X ⊗ κ Y )(( X, Y ) , ( ∗ , · )) } , where the tensor productkernel κ X ⊗ κ Y is defined by ( κ X ⊗ κ Y )(( x, y ) , ( x ′ , y ′ )) = κ X ( x, x ′ ) κ Y ( y, y ′ ) , x, x ′ ∈ X , y, y ′ ∈ Y .Sejdinovic et al. (2013) showed that X and Y are independent, i.e., P XY = P X P Y , if and onlyif P κ X ⊗ κ Y ( P XY ) = P κ X ( P X ) P κ Y ( P Y ), although κ X ⊗ κ Y is not characteristic for all probabilitymeasures on H ( κ Y ) ×H ( κ Y ). Therefore, to test the independence between X and Y , it suffices tostudy the difference between P κ X ⊗ κ Y ( P XY ) and P κ X ( P X ) P κ Y ( P Y ). Since P κ X ( P X ) ∈ H ( κ X ), P κ Y ( P Y ) ∈ H ( κ Y ) and P κ X ⊗ κ Y ( P XY ) ∈ H ( κ X ⊗ κ Y ) where H ( κ X ⊗ κ Y ) is the RKHS equippedwith κ X ⊗ κ Y , HSIC may be used to measure this difference under the norm of H ( κ X ⊗ κ Y ). Definition 1 (HSIC) . Suppose that R κ X ( x, x ) dP X ( x ) < ∞ and R κ Y ( y, y ) dP Y ( y ) < ∞ . The SIC of P XY is defined by γ ( P XY , κ X , κ Y ) = k P κ X ⊗ κ Y ( P XY ) − P κ X ( P X ) P κ Y ( P Y ) k H ( κ X ⊗ κ Y ) = 4 Z Z κ X ( x, x ′ ) κ Y ( y, y ′ ) d ( P XY − P X P Y )( x, y ) d ( P XY − P X P Y )( x ′ , y ′ ) . In practice with { ( X i , Y i ) : i = 1 , . . . , n } which are independently and identically distributed(i.i.d.) copies of ( X, Y ), the sample versions of P κ X ( P X ), P κ Y ( P Y ) and P κ X ⊗ κ Y ( P XY ) are de-fined by P κ X ( P n,X ) = n − P ni =1 κ X ( X i , · ), P κ Y ( P n,Y ) = n − P ni =1 κ Y ( Y i , · ), and P κ X ⊗ κ Y ( P n,XY ) = n − P ni =1 { ( κ X ⊗ κ Y )(( X i , Y i ) , ( ∗ , · )) } . Obviously P κ X ( P n,X ) ∈ H ( κ X ), P κ Y ( P n,Y ) ∈ H ( κ Y ) and P κ X ⊗ κ Y ( P n,XY ) ∈ H ( κ X ⊗ κ Y ), so we can obtain a sample version of HSIC as follows. Definition 2 (Empirical HSIC) . Under the same setting in Definition 1, the empirical HSIC,which is an estimator of HSIC, is defined by γ ( P n,XY , κ X , κ Y ) = k P κ X ⊗ κ Y ( P n,XY ) − P κ X ( P n,X ) P κ Y ( P n,Y ) k H ( κ X ⊗ κ Y ) = 4 Z Z κ X ( x, x ′ ) κ Y ( y, y ′ ) d ( P n,XY − P n,X P n,Y )( x, y ) d ( P XY − P n,X P n,Y )( x ′ , y ′ ) . By Sejdinovic et al. (2013), the empirical HSIC can be rewritten as γ ( P n,XY , κ X , κ Y ) = n − tr( Γ X HΓ Y H ) , where Γ X = ( κ X ( X i , X j )) ≤ i,j ≤ n and Γ Y = ( κ Y ( Y i , Y j )) ≤ i,j ≤ n are Gram matrices, and H = I n − n − n ⊤ n is the centering matrix with the n × n identity matrix I n and n = (1 , . . . , ⊤ ofdimension n . Suppose that bivariate functional data { ( X i , Y i ) : i = 1 , . . . , n } collected from n subjects arei.i.d. copies of a pair of random functions ( X, Y ), which, without loss of generality, is definedon the domain [0 , × [0 , X and Y belong to function spaces X and Y respectively. In many applications such as brain imaging analysis, the measurements of eachfunction are sampled at a discrete and regular grid and subject to noise contamination. Hencewe assume that the observations are { ( ˜ X il , ˜ Y il ) := ( X i ( T l ) + e Xil , Y i ( T l ) + e Yil ) : i = 1 , . . . , n ; l =1 , . . . , m } , where { T l = ( l − /m : l = 1 , . . . , m } is a regular grid with m = 2 J +1 for someinteger J > { e Xil : i = 1 , . . . , n ; l = 1 , . . . , m } and { e Yil : i = 1 , . . . , n ; l = 1 , . . . , m } with respective variances δ X and δ Y are independent ofeach other and of { ( X i , Y i ) : i = 1 , . . . , n } . Our goal is to formulate an HSIC-based test for theindependence between X and Y via { ( ˜ X il , ˜ Y il ) : i = 1 , . . . , n ; l = 1 , . . . , m } . For simplicity weassume that all functions share the same measurement grid and m = 2 J +1 , but the proposedmethod is applicable with minor modifications if the grid is irregular, the functions are measuredat different grids, or m = 2 J +1 (see Remark 1).4ue to the success of existing HSIC-based independence tests for multivariate data, it istempted to treat the discretized observations as multivariate data and directly apply existingmethods. However, there are two issues with this approach. First, in order to capture enoughinformation, m should be large enough, which naturally leads to high-dimensional data. Withoutreasonable structure across these m dimensions, HSIC does not perform well. In the FDAliterature, modeling the sample paths with certain form of smoothness has been shown anempirically successful strategy in many applications. It is beneficial to incorporate smoothnessstructure during the design of a tailor-made HSIC method. Second, the discretized observationsare contaminated by noise. Hence these raw observations are indeed not “smooth” but thenoiseless ones are.The proposed method is directly based on the definition of HSIC (Definition 1) when appliedto random functions. Clearly, the application of such HSIC requires the trajectories of all randomfunctions to be fully observed and noiseless. A natural idea is to perform pre-smoothing torecover these trajectories followed by an application of HSIC for random functions. However,the compatibility of these two steps is generally unclear. Namely, it is non-trivial to knowwhether the pre-smoothing error (measured in certain norm) would have a profound effect onthe subsequent HSIC-based test. For instance, if the sample paths of all random functions areassumed to belong to a Sobolev space, it is seemingly reasonable to pre-smooth each trajectoryby a smoothing spline followed by the HSIC based on Sobolev-norm-induced kernels. However,the compatibility of the two steps is unknown since there is no theoretical result to guarantee thatthe pre-smoothing error under a Sobolev norm converges to zero, although the correspondingresults with respect to the L or empirical norm exist.To address this compatibility issue, we propose to use HSIC based on Besov-norm-inducedkernels for testing independence under the assumption that the sample paths of all randomfunctions belong to Besov spaces, a larger class of functions than Sobolev spaces. To recover eachtrajectory, we adopt wavelet soft-thresholding (Donoho et al., 1995), a successful pre-smoothingtechnique in Bosev spaces. Its theoretical compatibility with the proposed HSIC is given inSection 4. In the rest of this section, we first briefly introduce wavelets (e.g., Ogden, 1997;Vidakovic, 2009; Morettin et al., 2017) together with other related results and then give thedetails of the proposed two-step procedure. Following the Cohen-Daubechies-Jawerth-Vial (CDJV) construction (Cohen et al., 1993), letfather and mother wavelets be φ, ψ ∈ C R [0 ,
1] respectively with D vanishing moments (e.g.,Daubechies, 1992) where C R [0 ,
1] is the space of all functions on [0 ,
1] with R -th order continuousderivatives. We consider a Besov space B αp,q [0 ,
1] with norm k · k B αp,q [0 , of which smoothnessparameter α satisfies 1 /p < α < min { R, D } such that B αp,q [0 ,
1] can be embedded continuouslyin C [0 , B αp,q [0 ,
1] and its norm k · k B αp,q [0 , are given in Section [A.2] inAppendix. Then for any function f ∈ B αp,q [0 , ∩ L [0 ,
1] and a fixed coarse scale L , we have thefollowing decomposition f ( t ) = L − X k =0 ξ k { L/ φ (2 L t − k ) } + X j ≥ L j − X k =0 θ j,k { j/ ψ (2 j t − k ) } , t ∈ [0 , . (1)5enote θ j,k = ξ j + k , ≤ j < L, ≤ k < j and θ − , = ξ . Based on the wavelet coefficients of f , θ f = (( θ f − ) ⊤ , ( θ f ) ⊤ , . . . , ( θ fL ) ⊤ , ( θ fL +1 ) ⊤ , . . . ) ⊤ where θ fj = ( θ j, , θ j, , . . . , θ j, j − ) ⊤ and θ f − = θ − , , the Besov sequence norm k · k b αp,q (e.g., Donoho et al., 1995; Johnstone and Silverman,2005) is defined by k θ f k b αp,q = X j ≥− jsq k θ fj k qp /q , s = α + 1 / − /p, where k · k p refers to the ℓ p -norm for vectors. Denote the corresponding space by b αp,q = { a : k a k b αp,q < ∞} . Note that the two norms k · k B αp,q and k · k b αp,q are equivalent (e.g.,DeVore and Lorentz, 1993; Donoho et al., 1995) and obviously b αp,q ⊂ b βp,q if β ≤ α .We can show that some Besov sequence norm can induce a characteristic kernel, which isrequired by HSIC. Theorem 1.
For < q ≤ p ≤ and α > , let the semi-metric ρ b αp,q ( f, g ) = k θ f − θ g k qb αp,q for f, g ∈ B αp,q [0 , where θ f and θ g are the wavelet coefficients of f and g respectively. The kernelinduced by ρ b αp,q , which is κ ( z, z ′ ) = ρ b αp,q ( z,
0) + ρ b αp,q ( z ′ , − ρ b αp,q ( z, z ′ ) , z, z ′ ∈ B αp,q [0 , , issymmetric, positive definite and characteristic. The proof of Theorem 1 is given in Section [B.1] in Appendix. By Theorem 1, we can defineHSIC properly based on kernels induced by Besov sequence norms. For simplicity, hereafter wefocus on popular choices of p = q = 2. Accordingly we abbreviate B α , [0 ,
1] and b α , to B α and b α respectively. Under the setting in Section 3.1, we assume X ∈ B α X and Y ∈ B α Y where 1 / < α X , α Y < min { R, D } . To test the independence between X and Y based on their discretely measuredand noisy observations, we propose to first denoise each function and then apply HSIC to therecovered functions. The two-step procedure is explicitly stated as follows: Step 1
By the decomposition (1) and the resolution limitation due to a finite number ofmeasurements m = 2 J +1 taken for each subject, we first obtain the initial wavelet coefficientestimates for each X i , denoted by θ ˜ X i = (( θ ˜ X i − ) ⊤ , ( θ ˜ X i ) ⊤ , . . . , ( θ ˜ X i J ) ⊤ ) ⊤ by the discrete wavelettransformation with the coarse scale L X . The coarse scale L X may be selected by cross-validationor domain knowledge. Then we denoise each θ ˜ X i by wavelet soft-thresholding (Donoho et al.,1995). Explicitly, the soft-thresholded estimates θ ˆ X i = (( θ ˆ X i − ) ⊤ , ( θ ˆ X i ) ⊤ , . . . , ( θ ˆ X i J ) ⊤ ) ⊤ are ob-tained by θ ˆ X i j = θ ˜ X i j , j = − , . . . , L X − θ ˆ X i j = { sgn( θ ˜ X i j,k ) (cid:16) | θ ˜ X i j,k | − δ X p (2 log m ) /m (cid:17) + : k = 0 , . . . , j − } , j = L X , . . . , J , where ( x ) + = max { x, } and δ X is the noise standard devia-tion. We denote each denoised function by ˆ X i , i = 1 , . . . , n . To estimate δ X , we adopt the robustestimator (Donoho et al., 1995) ˆ δ X = median n √ mθ ˜ X i J,k : k = 0 , . . . , J o / median( | W | ), where W is a standard normal random variable. We can similarly obtain θ ˆ Y i and ˆ Y i , i = 1 , . . . , n .6 tep 2 Since the soft-thresholded wavelet coefficient estimates θ ˆ X i ∈ b α X ⊂ b β X and θ ˆ Y i ∈ b α Y ⊂ b β Y , i = 1 , . . . , n , for any β X < α X and β Y < α Y , we may apply HSIC to the denoisedfunctions where the kernels κ X and κ Y are induced by ρ b βX and ρ b βY respectively as defined inTheorem 1. Explicitly, we have γ ( P n, ˆ X ˆ Y , κ X , κ Y ) = n − tr( Γ ˆ X HΓ ˆ Y H ) , where Γ ˆ X = (cid:16) k θ ˆ X i k b βX + k θ ˆ X j k b βX − k θ ˆ X i − θ ˆ X j k b βX (cid:17) ≤ i,j ≤ n , and Γ ˆ Y = (cid:16) k θ ˆ Y i k b βY + k θ ˆ Y j k b βY − k θ ˆ Y i − θ ˆ Y j k b βY (cid:17) ≤ i,j ≤ n . By adopting ρ b βX and ρ b βY where β X < α X and β Y < α Y to construct kernels, we areable to make the pre-smoothing step theoretically compatible with the HSIC. As revealed inLemma 1 and Theorem 2 in Section 4 below, if the observations of all functions are sufficientlydense, the denoising error due to wavelet soft-thresholding is asymptotically negligible in theasymptotic distribution of the HSIC. This is a key benefit of using wavelets and Besov normsfor pre-smoothing.In Section 4, the asymptotic distribution of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) is developed in Theorem 2under the independence hypothesis. Despite its theoretical appeal, the asymptotic distributionunfortunately involves many unknown quantities. Therefore, we suggest using permutations toperform the independence test which, as shown in Theorem 3, can control the Type I errorprobability and is also consistent. Remark 1.
Since denoising is performed separately for each function and subject, the proposedmethod is applicable when the functions of different subjects are not measured at the same grid.For m = 2 J +1 at fixed but possibly irregular designs, resampling or linear interpolation may beapplied if the original measurement resolution is sufficiently high (e.g. Kovac and Silverman,2000). For random designs with different measurements per subject, one may apply the methodby Cai and Brown (1999) or by Pensky and Vidakovic (2001). In this section we show that the proposed two-step procedure can lead to an asymptotically validtest, which addresses the compatibility issue raised in Section 3. Explicitly, we first provide therate of convergence for the denoising error involved in Step 1 in Lemma 1, then the asymptoticdistribution of HSIC γ ( P n, ˆ X ˆ Y , κ X , κ Y ) in Step 2 in Theorem 2, and finally the asymptotic prop-erties of the permutation test in Theorem 3. Hereafter, the kernels κ X and κ Y are induced by ρ b βX and ρ b βY respectively. Lemma 1.
Let Z = X or Y . Assume that the noise e Zil ∼ N (0 , δ Z ) , i = 1 , . . . , n, l = 1 , . . . , m , β Z < α Z , and k θ Z k b αZ ≤ C Z for a constant C Z > . Then as m → ∞ , k θ ˆ Z i − θ Z i k b βZ = O p n m − r Z / (log m ) r Z / r Z o , uniformly for i = 1 , . . . , n , here r Z = ( α Z − β Z ) / ( α Z + 1 / and r Z = (1 / · I (0 ≤ β Z ≤ / . Lemma 1 is a special case of Theorem 4 in Donoho et al. (1995) so its proof is omitted.All assumptions in Lemma 1 are standard in the literature of wavelet soft-thresholding (e.g.Donoho et al., 1995; Johnstone and Silverman, 2005). Lemma 1 indicates that the denoisingerror converges to zero if the functional data are sufficiently densely measured.Since the HSIC is constructed based on the kernels induced by ρ b βX and ρ b βY , the same normsused to evaluate the denoising error as in Lemma 1, the compatibility between the pre-smoothingby wavelet soft-thresholding and HSIC is theoretically guaranteed. As shown in Theorem 2, theeffect of the denosing error on the distribution of the HSIC is asymptotically negligible for densefunctional data.To develop the asymptotic distribution of γ ( P n, ˆ X ˆ Y , κ X , κ Y ), we further define the centeredkernel for κ X by ˇ κ X ( X, X ′ ) = h κ X ( X, · ) − µ κ X ( P X ) , κ X ( X ′ , · ) − µ κ X ( P X ) i H ( κ X ) , where µ κ X ( P X ) = R κ X ( x, · ) dP X ( x ). Furthermore define an integral kernel operator S ˇ κ X : H ( κ X ) → H ( κ X ) by S ˇ κ X ( g ) = R X ˇ κ X ( x, · ) g ( x ) dP X for any g ∈ H ( κ X ). An integral kernel operator S ˇ κ Y for Y can besimilarly defined. Theorem 2.
Under the same assumptions of Lemma 1, if m satisfies m − r Z / (log m ) r Z / r Z = o ( n − / ) , (2) where r Z = ( α Z − β Z ) / ( α Z + 1 / and r Z = (1 / · I (0 ≤ β Z ≤ / , for Z = X or Y , then nγ ( P n, ˆ X ˆ Y , κ X , κ Y ) P ∞ r =1 P ∞ s =1 µ r ν s N rs , if X and Y are independent , ∞ , otherwise , where “ ” represents weak convergence, N rs ∼ N (0 , , r, s ≥ are i.i.d. and { µ r : r ≥ } and { ν s : s ≥ } are eigenvalues of S ˇ κ X and S ˇ κ Y respectively. The proof of Theorem 2 is given in Section [B.2] in Appendix. The asymptotic distribution of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) in Theorem 2 is the same as that for fully observed { ( X i , Y i ) : X i ∈ B β X , Y i ∈ B β Y , i = 1 , . . . , n } (Sejdinovic et al., 2013). The requirement (2) ensures that the error due towavelet soft-thresholding is asymptotically negligible under b β Z norm if the measurements aresufficiently dense. In general, for fixed α Z and β Z , the order of m should be higher than n /r Z ,which, for example, is n / if ( α Z , β Z ) = (3 ,
1) and n / if ( α Z , β Z ) = (2 , / γ ( P n, ˆ X ˆ Y , κ X , κ Y ) when X and Y are assumedindependent involves many unknown quantities, in practice we perform the test by permutation.As shown in Theorem 3, the permutation test can control the Type I error probability and isalso consistent. Theorem 3 (Permutation Test) . Let the level of significance be α ∈ (0 , . If the null hypothesisthat X and Y are independent is true, the permutation test of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) based on a finite umber of permutations rejects the null hypothesis with probability at most α . If the alternativehypothesis that X and Y are dependent is true and the assumptions of Lemma 1 and (2) hold,the permutation test of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) based on B ≥ /α − permutations is consistent, i.e., P (ˆ p ˆ X ˆ Y ≤ α ) → as n → ∞ , where ˆ p ˆ X ˆ Y is the p-value. The proof of Theorem 3 is given in Section [B.3] in Appendix. Theorem 3 shows that theproposed permutation test is also theoretically compatible with the wavelet soft-thresholdingdenoising.
The proposed method in Section 3 requires a proper choice of tuning parameters β X and β Y . Inthis section we first discuss their roles in dependency detection and then propose a data-adaptiveselection method for them.In Section 4, Lemma 1 seems to imply that given α X and α Y , the best choice is β X = β Y = 0because the corresponding denoising error attains the best rate of convergence. However, thischoice of β X and β Y may result in a poor dependency detection especially when the dependencyof X and Y originates from their high frequency bands.For illustration, by Definition 1 and (3.1), we have the following decomposition γ ( P XY , κ X , κ Y ) = γ (cid:16) P XY , X j X ≥− β X j X κ ( j X ) X , X j Y ≥− β Y j Y κ ( j Y ) Y (cid:17) = X j X ≥− X j Y ≥− γ (cid:16) P XY , β X j X κ ( j X ) X , β Y j Y κ ( j Y ) Y (cid:17) , where κ ( j Z ) Z ( z, z ′ ) = k θ zj k + k θ z ′ j k − k θ zj − θ z ′ j k for j Z ≥ −
1, with ( z, Z, Z ) = ( x, X, X ) or( y, Y, Y ) and Euclidean norm k · k . Apparently γ (cid:16) P XY , β X j X κ ( j X ) X , β Y j Y κ ( j Y ) Y (cid:17) measures thedependency contribution to the HSIC at j X and j Y of X and Y respectively, which is zero if andonly if X and Y are independent at j X and j Y . If β X = β Y = 0, the scaling factors 2 β X j X =2 β Y j Y = 1 for all j X ≥ − , j Y ≥ − X and Y at high frequencies since the dependency contributions contained at high frequenciesare very likely to be overwhelmed by the independent signals at low frequencies. Therefore, weaim to select β X and β Y such that the dependency contributions at high frequencies, if any, aredetectable.The idea of the proposed tuning method is to balance the dependency contributions to HSICat all frequency scales such that they are approximately the same. To lessen the computationalburden, a marginal selection algorithm is proposed in the sense that the optimal β X is selectedonly based on X without reliance on Y . Note that, by the Cauchy-Schwarz inequality, thedependency contribution at each j X , j Y ≥ − γ (cid:16) P XY , β X j X κ ( j X ) X , β Y j Y κ ( j Y ) Y (cid:17) ≤ β X j X + β Y j Y ) q γ ( P X , κ ( j X ) X ) γ ( P Y , κ ( j Y ) Y ) , γ ( P Z , κ ( j Z ) Z ) = 4 RR κ ( j Z ) Z ( z, z ′ ) d ( P Z − P Z )( z ) d ( P Z − P Z )( z ′ ) , j Z ≥ − , is essentially adistance variance (Sz´ekely et al., 2007) with ( z, Z, Z ) = ( x, X, X ) or ( y, Y, Y ) (Sejdinovic et al.,2013). Thus we propose to select β X by balancing 2 β X j X q γ ( P X , κ ( j X ) X ) at all j X ≥ −
1. If2 β X j X q γ ( P X , κ ( j X ) X ) ≈ C where C > β X j X + log γ ( P X , κ ( j X ) X ) ≈ log C, so β X may be selected as the estimated slope of the linear regression on ( − j X , log γ ( P X , κ ( j X ) X ) / γ ( P X , κ ( j X ) X ) by γ ( P n, ˆ X , κ ( j X ) X ) for each j X ≥ −
1, but itsaccuracy is poor for very high frequencies due to noise contamination. Thus we only consider j X up to j X = max j X ≥ L X { γ ( P n, ˆ X , κ ( j X ) X ) ≥ γ ( P n, ˆ e X , κ ( j X ) X ) } where ˆ e X = ˜ X − ˆ X is the residual,such that the distance variances of all j X ≤ j X are not smaller than that of the residual. Ifa known frequency band is of interest in the context of a study, e.g., the alpha band of brainsignals, one may alternatively select β X by balancing 2 β X j X q γ ( P X , κ ( j X ) X ) over that frequencyband. Last, we remark that the computational benefit of the proposed marginal approach fortuning parameter selection is substantial when many tests have to be performed, such as in thefunctional connectivity analysis (Section 7). In this section we evaluate the numerical performance of our proposed wavelet-based HSICmethod wavHSIC in both controlling the Type I error probability and statistical power. We alsocompare it with a few representative existing methods, including(a) Pearson Correlation (
Pearson ). It is a one-sample t-test based on Fisher-Z transformed cor-relation coefficients of all subjects. The correlation coefficient for each subject is obtainedby applying the Pearson correlation formula to the bivariate time series of the subject,without adjusting for any possible dependence within the time series. It is a popularfunctional connectivity measure in neuroscience (e.g., He et al., 2012).(b) Dynamical Correlation ( dnm , Dubin and M¨uller, 2005). It is defined as the expectation ofthe cosine of the L angle between the standardized versions of two random functions.(c) Global Temporal Correlation ( gtemp , Zhou et al., 2018). It is the integral of the Pearsoncorrelation obtained at each time point.(d) Bias-Corrected Distance Covariance ( dCov-c , Sz´ekely and Rizzo, 2013). It is a t-test de-signed to correct the bias of distance covariance for high-dimensional multivariate data.We apply it by treating the discrete measurements of two random functions as multivariatedata. If the bias is not corrected, it is equivalent to wavHSIC with β X = β Y = 0.(e) FPCA-Based Distance Covariance ( FPCA , Kosorok, 2009). The distance covariance (Sz´ekely et al.,2007) is applied to top FPC scores which cumulatively account for 95% of the variation ofeach random function. When all FPC scores are used, it is equivalent to wavHSIC when β X = β Y = 0.(f) Functional Linearity Test ( KMSZ , Kokoszka et al., 2008). It is an approximate chi-squaredtest for the nullity of the coefficient function by assuming a functional linear model betweenthe two random functions. The model fitting requires a satisfactory approximation of each10andom function by its top FPC scores and we select those which cumulatively accountfor 95% of variation of each random function.The first five in comparison are model-free methods while the last is one of the most popularmodel-based methods in the FDA literature. Permutation is used to obtain the p-value fortesting independence for wavHSIC , dnm , gtemp and FPCA .We generated 199 simulated datasets, where the specific choice 199 is chosen to preventempirical Type I and Type II error probabilities from coinciding with the level of signifi-cance 0 .
05. In each simulated dataset n = 50 or 200 independent subjects with bivariatefunctions { ( X i ( t ) , Y i ( t )) : t ∈ [0 , , i = 1 , . . . , n } were generated where for the i -th subject, X i ( t ) = P k =1 η ik φ k ( t ) and Y i ( t ) = P k =1 ζ ik φ k ( t + 0 .
2) with φ k − ( t ) = √ πkt ) , φ k ( t ) = √ πkt ) for k = 1 , . . . ,
8. We considered three settings with different dependency struc-tures of the bivariate functional data which are controlled by the FPC scores { ( η ik , ζ ik ) : k =1 , . . . , i = 1 , . . . , n } . • Setting 1. We generated η ik ∼ N (0 , k − . ) , k = 1 , . . . ,
16 and ζ ik ∼ N (0 , k − . ) , k =1 , . . . ,
16 independently. • Setting 2. With ρ = 0 for k = 1 , . . . , ρ = 0 . k = 9 , . . . ,
16, we generated (cid:20) η ik ζ ik (cid:21) ∼ N (cid:18)(cid:20) (cid:21) , (cid:20) k − . ρk − . ρk − . k − . (cid:21)(cid:19) . • Setting 3. For k = 1 , . . . , η ik ∼ N (0 , k − . ) was generated independently of ζ ik ∼ N (0 , k − . ). For k = 9 , . . . , η ik ∼ N (0 , k − . ) and ζ ik = η ik − E η ik .Apparently X and Y are independent in Setting 1 and dependent in Settings 2 and 3. In Setting2, the FPC scores of X and Y are linearly correlated but only at high spectral frequencies, whilein Setting 3 they are linearly uncorrelated but dependent only at high spectral frequencies, soit is more difficult to detect dependency for all methods in Setting 3 than Setting 2.Both functions are measured at m = 64 or 256 equidistant points on the time domain [0 , wavethresh . For wavHSIC , we chose the CDJV wavelet basis functions with vanishing moment D = 10 for both X and Y , which leads to α X = α Y ≈ .
902 (Daubechies, 1992). The tuningparameters β X and β Y were selected by the method in Section 5. For wavHSIC , dnm , gtemp and FPCA which compute p-values by permutation, we always used 199 permutations. The resultsare given in Tables 1–3 for the three settings respectively.Table 1 shows that all methods are almost always able to control type I error probabilitieswhen the two random functions are truly independent. Relatively, dCov-c seems more likelyto detect spurious dependency when ( n, m ) = (200 , KMSZ is very conservative when n = 50. 11able 1: Empirical Type I error probabilities for the seven methods under Setting 1. The lasttwo columns provide the medians of the selected β X and β Y for wavHSIC . n m SNR Pearson dnm gtemp dCov-c FPCA KMSZ wavHSIC β X β Y
50 64 4 0.0503 0.0452 0.0553 0.0503 0.0302 0.0151 0.0151 0.87 0.7550 64 8 0.0553 0.0503 0.0402 0.0503 0.0302 0.0050 0.0302 0.81 0.7050 256 4 0.0402 0.0603 0.0653 0.0603 0.0452 0.0050 0.0553 0.76 0.6450 256 8 0.0603 0.0503 0.0603 0.0603 0.0402 0.0101 0.0603 0.80 0.62200 64 4 0.0503 0.0653 0.0553 0.0653 0.0603 0.0452 0.0402 0.87 0.76200 64 8 0.0503 0.0603 0.0302 0.0754 0.0452 0.0251 0.0251 0.82 0.71200 256 4 0.0603 0.0553 0.0553 0.0955 0.0452 0.0452 0.0452 0.76 0.64200 256 8 0.0503 0.0653 0.0402 0.0854 0.0452 0.0452 0.0553 0.76 0.62
Tables 2 and 3 show that the statistical powers of all methods typically improve when oneof n , m and SNR increases under Setting 2, but unnecessarily under Setting 3 except for KMSZ and wavHSIC . This demonstrates the difficulty of Setting 3 in detecting dependency to someextent. Except wavHSIC , all model-free methods have very low powers in all scenarios undereither Setting 2 or 3, which indicates their poor performances in detecting linear dependency inhigh frequencies or nonlinear dependency. The performance of
KMSZ is satisfactory for n = 200under Setting 2 when the relationship between X and Y is truly linear but it is poor for eithernonlinear dependency in Setting 3 or small samples.Tables 2 and 3 also demonstrate the appealing performance of wavHSIC . It is always themost powerful method, and substantially better than the other methods. Only the powers of KMSZ are comparable with those of wavHSIC when the sample size n = 200 is large and thelinearity assumption is valid under Setting 2. For fixed ( n, m, SNR), the medians of the selectedparameters β X and β Y for wavHSIC are always similar between Settings 2 and 3 since they weretuned marginally regardless of the dependency structure. On average, both β X and β Y wereconsiderably away from zero, which confirms the need and benefit of choosing them properly toenhance the detection sensitivity of wavHSIC .Table 2: Empirical Powers for the seven methods under Setting 2. The last two columns providethe medians of the selected β X and β Y for wavHSIC . n m SNR Pearson dnm gtemp dCov-c FPCA KMSZ wavHSIC β X β Y
50 64 4 0.0553 0.0603 0.0603 0.0603 0.0452 0.0603 0.2563 0.87 0.7550 64 8 0.0553 0.0653 0.0603 0.0653 0.0603 0.2111 0.7487 0.81 0.7050 256 4 0.0653 0.0553 0.0553 0.0955 0.0603 0.4422 0.8392 0.76 0.6550 256 8 0.0603 0.0503 0.0603 0.1055 0.0754 0.5779 0.9397 0.78 0.62200 64 4 0.0804 0.0603 0.0653 0.1005 0.0704 0.9799 0.9849 0.87 0.76200 64 8 0.0854 0.0854 0.0653 0.1407 0.1106 1.0000 1.0000 0.82 0.71200 256 4 0.1156 0.1307 0.0603 0.1809 0.1508 1.0000 1.0000 0.76 0.64200 256 8 0.1256 0.1256 0.0653 0.2312 0.1608 1.0000 1.0000 0.75 0.62 n m SNR
Pearson dnm gtemp dCov-c FPCA KMSZ wavHSIC β X β Y
50 64 4 0.0653 0.0653 0.0503 0.0854 0.0754 0.0452 0.3367 0.88 0.7850 64 8 0.0603 0.0754 0.0603 0.0854 0.0754 0.0201 0.4673 0.83 0.7550 256 4 0.0503 0.0653 0.0754 0.1005 0.0905 0.0251 0.3266 0.75 0.7050 256 8 0.0503 0.0804 0.0653 0.1055 0.0905 0.0101 0.3920 0.78 0.67200 64 4 0.0553 0.0704 0.0603 0.0704 0.0402 0.2161 0.7085 0.87 0.78200 64 8 0.0503 0.0603 0.0704 0.0804 0.0452 0.1457 0.8492 0.82 0.74200 256 4 0.0503 0.0553 0.0804 0.0603 0.0452 0.1859 0.8643 0.76 0.69200 256 8 0.0603 0.0704 0.0653 0.0653 0.0452 0.1508 0.9045 0.75 0.65
In this section we applied our proposed method to study human brain functional connectivity us-ing the MEG dataset collected by the HCP. MEG measures magnetic fields generated by humanneuronal activities with a high temporal resolution. Before source reconstruction, the signalsfrom all MEG sensors outside head were preprocessed following the HCP MEG pipeline refer-ence ( ) and the preprocessed dataare publicly accessible from the HCP website. To obtain the electric activity signals from cor-tex regions, we applied the source reconstruction procedure of MEG signals using the linearlyconstrained minimum variance beamforming method in the MATLAB package
FieldTrip .To study the functional dependency between cortex regions under some motor activities, wefocused on motor task trials where subjects moved their right hands. There are 61 subjects with75 .
38 trials per subject on average. Within each trial, the signal was recorded about every 2 msfrom − . . .
75 seconds and typically a subject finishedthe previous movement and received a new cue between times − .
25 and 0 of the next trial, weconsidered the time domain [ − . , . ,
004 signal curves inthe cerebral cortex according to the atlas provided by Glasser et al. (2016) and each signal wasdenoised by the empirical Bayes soft-thresholding method in the R package wavethresh .We applied the proposed method wavHSIC to perform an independence test for every pairof the MEG signals. To implement wavHSIC , we chose the CDJV wavelet basis functions withvanishing moment D = 4 which leads to α ≈ . β was selected by the method in Section 5. For comparison, we also provided the results for themodel-based test KMSZ and two model-free tests,
Pearson and
FPCA . FPCA was based on topFPC scores which cumulatively account for 95% of the variation of each signal. The p-value fortesting the independence between each pair of signals were obtained by 1,999 permutations for wavHSIC and
FPCA .The empirical cumulative distribution functions for the p-values of the four methods aregiven in Figure 1, which shows that wavHSIC is more sensitive to detecting connectivity thanthe other methods. To evaluate and compare the four methods at the presence of multipletesting, we set the same discovery rate at 60% to control the number of edges, or sparsity, of13ach brain connectivity network, which is important in evaluating the reliability of brain networkmetrics (e.g. Van Wijk et al., 2010; Tsai, 2018). In this analysis, we focus on sensorimotor areas4, 3a, 3b, 1 and 2 on the left and right hemispheres as illustrated in Figure 2 (c) which are mostrelated to motor task trials (Glasser et al., 2016). With a controlled discovery rate, we expectan excellent connectivity detection method to identify plenty of edges within these areas.Figure 2 provides the functional connectivity networks within these sensorimotor areas ob-tained by the five methods. The nodes in each area was ordered from the superio-medial cortexto infero-lateral cortex following the atlas “atlas MMP1.0 4k.mat” in
FieldTrip . Comparedwith
KMSZ and wavHSIC , Pearson and
FPCA are substantially less sensitive to detecting func-tional connectivity and their corresponding networks are less structured. This demonstrates thesuperior performances of both
KMSZ and wavHSIC in identifying connectivity patterns withinthese areas which are anatomically connected and functionally related to the motion task trials.Different from the overall homogeneous pattern in the network for
KMSZ , several structureddark strips appear in the network obtained by wavHSIC . This indicates that wavHSIC can iden-tify two sub-areas in each sensorimotor area, the top left (TL) and bottom right (BR) cornersrespectively in each colored square as in Figure 2 (c), such that the signals with all ten TL sub-areas or within all ten BR sub-areas are strongly connected, while the connectivities betweenTL and BR sub-areas are generally weak. According to Glasser et al. (2016), the five BR sub-areas in the same hemisphere correspond to face and eye portions while the five TL sub-areascorrespond to upper limbs, trunk and lower limbs portions. Note that the motor task involvedin this dataset is raising the right hand, so the connectivity patterns detected by wavHSIC areintuitively and anatomically interpretable. p−value E m p i r i c a l CD F PearsonFPCAK
MSZwav
HSIC
Figure 1: Empirical cumulative distribution function for the p-values for testing the indepen-dence between every pair of the 8,004 signals for each method.14 a) Pearson (b)
FPCA (c) Brain Cortex(d)
KMSZ (e) wavHSIC (f) Label
Figure 2: Functional connectivity networks of the five sensorimotor areas in the left and righthemispheres. In the adjacency matrices in (a), (b), (d) and (d) obtained by the four methodsrespectively, a bright entry indicates significant dependency between the corresponding signalpairs while a dark one indicates otherwise. The black subregion in (c) corresponds to face andeye portions and the rest of the colored area corresponds to upper limbs, trunk and lower limbsportions.
Acknowledgements
The research of Xiaoke Zhang is partially supported by the US National Science Foundation(NSF) under grant DMS-1832046. The research of Raymond K. W. Wong is partially supportedby the US NSF under grants DMS-1806063, DMS-1711952 and CCF-1934904.
A Appendix: Background Materials
A.1 Distance-Induced Characteristic Kernels
Characteristic kernels are required to construct HSIC for two random functions under the RKHSframework. Such a kernel can be generated by a semi-metric of strong negative type.15 efinition S3 (Strong Negative Type Semi-Metric) . A semi-metric ρ : Z × Z → [0 , ∞ ) definedon a non-empty set Z is of negative type if P ni =1 P nj =1 α i α j ρ ( z i , z j ) ≤ for all z , . . . , z n ∈ Z and α , . . . , α n ∈ R such that P ni =1 α i = 0 , n ≥ . Furthermore, it is of strong negative type if forany two probability measures P and P ′ on Z such that R Z ρ ( z, z ) dP ( z ) , R Z ρ ( z, z ) dP ′ ( z ) < ∞ for some z ∈ Z , we have R ρ d { ( P − P ′ ) × ( P − P ′ ) } = 0 if and only if P = P ′ . Proposition S1 shows that a kernel induced by a strong negative type semi-metric is charac-teristic.
Proposition S1.
Let ρ be a semi-metric defined on Z and z ∈ Z . The induced kernel κ ρ ( z, z ′ ) = ρ ( z, z ) + ρ ( z ′ , z ) − ρ ( z, z ′ ) , z, z ′ ∈ Z , is symmetric and positive definite. Moreover, κ ρ is characteristic if and only if ρ is of strong negative type. Obviously distance-induced kernels are symmetric. For the proof of Proposition S1, seeLemma 2.1 of Berg et al. (1984) for positive definiteness and Lyons (2013) and Sejdinovic et al.(2013) for the characteristic property. Since the set Z of interest often contains zero, in thispaper we always set z = 0 for any distance-induced kernel κ ρ for simplicity and convenience. A.2 Besov Spaces and Norms
The Besov space is a generalization of the Sobolev space, which is widely used in nonparametricregression under the RKHS framework. A Besov space B αp,q [0 , , p, q, α > , contains all func-tions of which Besov norm k · k B αp,q is finite. Explicitly, with any integer r ≥
1, define the r thorder difference of a function f by∆ rh ( f, x ) = r X k =0 (cid:18) rk (cid:19) ( − r − k f ( x + kh ) , and its r th order modulus of continuity by ω r ( f, t ) p = sup ≤ h ≤ t k ∆ rh ( f, · ) | [0 , − rh ] k L p , where ∆ rh ( f, · ) | [0 , − rh ] represents ∆ rh ( f, · ) restricted on [0 , − rh ] and k · k L p is the L p norm.Then the Besov norm of f is defined by k f k B αp,q = k f k L p + | f | B αp,q , where | f | B αp,q = (cid:20)Z ∞ (cid:26) ω r ( f, t ) p t α (cid:27) q dtt (cid:21) q . For the same α , the Besov norms generated by different values of r > α are equivalent when p > p > r = ⌊ α ⌋ + 1 where ⌊ α ⌋ is the greatest integer less than or equal to α .16he Besov norm (semi-norm) generalizes some traditional smoothness measures, such as theSobolev semi-norm | · | W kp | f | W kp = (cid:18)Z (cid:12)(cid:12)(cid:12) D k f (cid:12)(cid:12)(cid:12) p dx (cid:19) /p , ≤ p ≤ ∞ , where D k is k th order weak-derivative operator. B Appendix: Technical Proofs
B.1 Proof of Theorem 1
We first list two lemmas on some properties of negative type semi-metrics, which will be neededin the proof of Theorem 1.
Definition S4 (Radial Positive Definite Function) . A real function F defined on R + is calledradial positive definite on the semi-metric space ( Z , ρ ) if F is continuous and n X j =1 n X k =1 F ( ρ ( z j , z k )) c j c k ≥ , for all choices of n ≥ points z , . . . , z n ∈ Z . We denote the set of all radial positive definitefunctions by RPD( Z ) . Lemma S2.
The following hold in any semi-metric space Z .(a) RPD( Z ) is never empty.(b) If F , F ∈ RPD( Z ) , then F · F ∈ RPD( Z ) .(c) If F j ∈ RPD( Z ) and ≤ c j < ∞ , j = 1 , . . . , n , then P nj =1 c j F j ∈ RPD( Z ) .(d) If F j ∈ RPD( Z ) , j = 1 , , . . . and the F j converge point-wise to a continuous limit F , then F ∈ RPD( Z ) (e) For space ( L p , k · k p ) , ( ℓ p , k · k p ) with < p ≤ , then exp( − t α ) is RPD for < α ≤ p . Lemma S2 is a combination of Theorems 4.4 and 4.10 of Wells and Williams (2012).
Lemma S3 (Theorem 4.5, Wells and Williams (2012)) . In a semi-metric space ( Z , ρ ) , the fol-lowing are equivalent: a) ρ is of negative type;(b) the function exp( − λt ) belongs to RPD( Z , ρ ) for λ > ;(c) ( Z , ρ ) is embeddable in a Hilbert space.Proof of Theorem 1. By Proposition S1, it suffices to prove that ρ b αp,q is of strong negative type.Lemmas S2 (e) and S3 (a) ensure that ¯ ρ j ( f, g ) := k θ fj − θ gj k qp , j = − , , , . . . are of negativetype for q ≤ p ≤ F j ( t ) = exp( − sjq t ) belongs to RPD( Z , ¯ ρ j ). For any finiteproduct, by Lemma S2 (b) n Y j = − F j (¯ ρ j ) = exp − n X j = − sjq ¯ ρ j (3)belongs to RPD( Z ). Lemma S2 (c) ensures the continuous sequence limit of (3), i.e., exp( − ρ b αp,q ) ∈ RPD( Z ) as n → ∞ . Therefore ρ b αp,q is of negative type on Z . The separability of B αp,q [0 ,
1] en-sures that ρ b αp,q is of strong negative type. B.2 Proof of Theorem 2
We first present a lemma that will be used to prove Theorem 2.
Lemma S4.
Let { ( X i ( · ) , Y i ( · ) } ni =1 be i.i.d. fully observed random samples from probability mea-sure P XY = P X P Y defined on X ⊗ Y . Then as n → ∞ , nγ ( P n,XY , κ X , κ Y ) ∞ X r =1 ∞ X s =1 µ r ν s N rs , (4) where N rs ∼ N (0 , , r, s ∈ N are i.i.d. and { µ r } ∞ r =1 and { ν s } ∞ s =1 are eigenvalues of the integralkernel operators S ˇ κ X and S ˇ κ Y , respectively. If P XY = P X P Y , then nγ ( P n,XY , κ X , κ Y ) → ∞ inprobability as n → ∞ . Lemma S4 is exactly Theorem 33 of (Sejdinovic et al., 2013), which provides the weak con-vergence result of HSIC for fully observed random functions.
Proof of Theorem 2.
According to Lemma S4, it suffices to prove that the difference betweenHSIC based on original curves { X i ( · ) , Y i ( · ) } ni =1 and HSIC based on denoised curves { ˆ X i , ˆ Y i } ni =1 o p (1 /n ), where { ˆ X i , ˆ Y i } ni =1 are obtained by Step 1 in Section 3. By Definition 1, n (cid:12)(cid:12)(cid:12) γ ( P n,XY , κ X , κ Y ) − γ ( P n, ˆ X ˆ Y , κ X , κ Y ) (cid:12)(cid:12)(cid:12) = n − (cid:12)(cid:12)(cid:12) k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) − k ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) (cid:12)(cid:12)(cid:12) = n − (cid:12)(cid:12)(cid:12) k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) − k ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) (cid:12)(cid:12)(cid:12) (cid:16) k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) + k ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) (cid:17) ≤ n − k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) (cid:16) k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) + k ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) (cid:17) ≤ n − / k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) × n − / k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) + n − k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) where κ ⊤X = [ κ X ( · , X ) , . . . , κ X ( · , X n )], κ ⊤Y = [ κ Y ( · , Y ) , . . . , κ Y ( · , Y n )], ˆ κ ⊤X = h κ X ( · , ˆ X ) , . . . , κ X ( · , ˆ X n ) i ,ˆ κ ⊤Y = h κ Y ( · , ˆ Y ) , . . . , κ Y ( · , ˆ Y n ) i .By (4), n − / k κ ⊤X H κ Y k H ( κ X ⊗ κ Y ) vuut ∞ X r =1 ∞ X s =1 µ r ν s N rs = O p (1) , (5)so it suffices to prove that k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) = o p (cid:0) n / (cid:1) . Notice that k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) can be bounded by the following inequality: k κ ⊤X H κ Y − ˆ κ ⊤X H ˆ κ Y k H ( κ X ⊗ κ Y ) = k κ ⊤X H ( κ Y − ˆ κ Y ) + ( κ X − ˆ κ X ) H ˆ κ ⊤Y k H ( κ X ⊗ κ Y ) ≤ k κ ⊤X H ( κ Y − ˆ κ Y ) k H ( κ X ⊗ κ Y ) + k ( κ X − ˆ κ X ) H ˆ κ ⊤Y k H ( κ X ⊗ κ Y ) ≤ k κ ⊤X H ( κ Y − ˆ κ Y ) k H ( κ X ⊗ κ Y ) + k ( κ X − ˆ κ X ) H κ ⊤Y k H ( κ X ⊗ κ Y ) + k ( κ X − ˆ κ X ) H ( κ Y − ˆ κ Y ) ⊤ k H ( κ X ⊗ κ Y ) = tr (cid:16) Γ X H h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) H (cid:17) + tr (cid:16) Γ Y H h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) H (cid:17) + tr (cid:16) h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) H h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) H (cid:17) = tr (cid:16) ˇ Γ X h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) (cid:17) + tr (cid:16) ˇ Γ Y h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) (cid:17) + tr (cid:16) h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) H h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) H (cid:17) ≤ tr ( ˇ Γ X )tr ( h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) ) + tr ( ˇ Γ Y )tr ( h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) ) (*)+ tr ( h κ X − ˆ κ X , κ ⊤X − ˆ κ ⊤X i H ( κ X ) )tr ( h κ Y − ˆ κ Y , κ ⊤Y − ˆ κ ⊤Y i H ( κ Y ) ) = o p ( n / ) , where ˇ Γ X = HΓ X H and ˇ Γ Y = HΓ Y H are centered Gram matrices.In (*) we used the fact that for symmetric positive definite matrices A and B ,tr AB = vec( A ) ⊤ vec( B ) ≤ k A k F k B k F = √ tr A tr B ≤ tr A tr B . Z , Z, z ) = ( X , X, x ) or ( Y , Y, y ): • tr( ˇ Γ Z ) = O p ( n ) because R Z ˇ κ Z ( z, z ) dP Z < ∞ which is ensured by the assumptions inLemma 1. • tr h κ Z − ˆ κ Z , κ ⊤Z − ˆ κ ⊤Z i H ( κ Z ) = P ni =1 k κ Z ( · , Z i ) − κ Z ( · , ˆ Z i ) k H ( κ Z ) = o p (1), because k κ Z ( · , Z i ) − κ Z ( · , ˆ Z i ) k H ( κ Z ) = κ Z ( Z i , Z i ) + κ Z ( ˆ Z i , ˆ Z i ) − κ Z ( Z i , ˆ Z i )=2 k Z i k b βZ + 2 k ˆ Z i k b βZ − (cid:16) k Z i k b βZ + k ˆ Z i k b βZ − k Z i − ˆ Z i k b βZ (cid:17) = 2 k Z i − ˆ Z i k b βZ , and k Z i − ˆ Z i k b βZ = o p ( n − ) , i = 1 , . . . , n ensured by Lemma 1. B.3 Proof of Theorem 3
We first introduce a few notations. To perform a permutation test, let S ( n ) = { σ , . . . , σ n ! } be the cyclic group of { , . . . , n } . For a permutation σ randomly selected from S ( n ), let γ ( P σn, ˆ X ˆ Y , κ X , κ Y ) = n − tr( Γ ˆ X HΓ ˆ Y ( σ ) H ), where Γ ˆ Y ( σ ) is generated by Γ ˆ Y with rows andcolumns permuted according to σ . Let R be the rank of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) in all possible per-muted HSICs. Then we reject H : P XY = P X P Y if p ˆ X ˆ Y = R/n ! ≤ α , where p ˆ X ˆ Y denotes thep-value of the permutation test enumerating all permutations and α is the level of significance.In practice, it is impractical to consider all permutations from S ( n ). Hence we use a Monte-Carlo approximation by randomly choosing B permutations σ , . . . , σ B ∈ S ( n ) \{ id } where idrefers to no permutation and calculating γ ( P n, ˆ X ˆ Y , κ X , κ Y ) , γ ( P σ n, ˆ X ˆ Y , κ X , κ Y ) , . . . , γ ( P σ B n, ˆ X ˆ Y , κ X , κ Y ).With a notational abuse, let R be the rank of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) and we reject H if ˆ p ˆ X ˆ Y = R/ ( B + 1) ≤ α , where ˆ p ˆ X ˆ Y is the p-value of the permutation test enumerating a finite sampleof size B from S ( n ).If the value of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) repeats in { γ ( P σ n, ˆ X ˆ Y , κ X , κ Y ) , . . . , γ ( P σ B n, ˆ X ˆ Y , κ X , κ Y ) } for sev-eral times with B ≤ n !, the rank R of γ ( P n, ˆ X ˆ Y , κ X , κ Y ) is determined by the following two waysproposed by Rindt et al. (2020). • Breaking ties at random: R is distributed uniformly on ranks of γ ( P σn, ˆ X ˆ Y , κ X , κ Y ) thathave the same value of γ ( P n, ˆ X ˆ Y , κ X , κ Y ); • Breaking ties conservatively: R is the largest among ranks of γ ( P σn, ˆ X ˆ Y , κ X , κ Y ) that havethe same value of γ ( P n, ˆ X ˆ Y , κ X , κ Y ).Next we list two lemmas which will be useful to prove Theorem 3. Lemma S5.
For σ randomly selected from S ( n ) , γ ( P n, ˆ X ˆ Y , κ X , κ Y ) → in probability as n → ∞ . Lemma S5 is a direct application of Theorem 3 of Rindt et al. (2020) for d = 2.20 emma S6. Suppose that the alternative hypothesis H : P XY = P X P Y is true and noisesare i.i.d. Let { t n ( ˆ D ) ≥ · · · ≥ t n ! n ( ˆ D ) } be ordered values of HSIC computed on all permutationsof denoised curves { γ ( P σ n, ˆ X ˆ Y , κ X , κ Y ) , . . . , γ ( P σ n ! n, ˆ X ˆ Y , κ X , κ Y ) } . Let a = ⌊ n ! α ⌋ for any level ofsignificance α ∈ (0 , . Then t an ( ˆ D ) → in probability as n → ∞ . Lemma S6 is a direct application of Theorem 4 of Rindt et al. (2020) for d = 2. Proof of Theorem 3.
Denote the fully observed dataset by D = { ( X i , Y i ) : i = 1 , . . . , n } andthe denoised dataset by ˆ D = { ( ˆ X i , ˆ Y i ) : i = 1 , . . . , n } . For a permutation σ ∈ S ( n ), denotethe permuted datasets by σ ( D ) and σ ( ˆ D ), resulting in permuted HSIC γ ( P σn,XY , κ X , κ Y ) and γ ( P σn, ˆ X ˆ Y , κ X , κ Y ) respectively. If H : P XY = P X P Y is true , then for any σ ∈ S ( n ), D and σ ( D ) have the same distributionand ˆ D and σ ( ˆ D ) have the same distribution due to the facts that the noise across subjects are i.i.dand that the denoising procedure in Section 3 is separately for each subject. For B permutations σ , . . . , σ B randomly selected from S ( n ) \{ id } , ( D , σ ( D ) , . . . , σ B ( D )) is an exchangeable vector,and thus (cid:16) γ ( P n, ˆ X ˆ Y , κ X , κ Y ) , γ ( P σ n, ˆ X ˆ Y , κ X , κ Y ) , . . . , γ ( P σ B n, ˆ X ˆ Y , κ X , κ Y ) (cid:17) is exchangeable.By breaking ties at random, each entry is equally likely to have any given rank, so the rankof γ ( P n, ˆ X ˆ Y , κ X , κ Y ) is uniformly distributed in { , . . . , B } . Therefore the type I error rate canbe controlled for any level of significance α ∈ (0 , If H : P XY = P X P Y is true , then by the definition of t an ( ˆ D ) in Lemma S6, we reject H : P XY = P X P Y if γ ( P n, ˆ X ˆ Y , κ X , κ Y ) > t an ( ˆ D ). For any α ∈ (0 , n →∞ P ( p ˆ X ˆ Y ≤ α ) ≥ lim n →∞ P ( γ ( P n, ˆ X ˆ Y , κ X , κ Y ) > t an ( ˆ D )) = 1 , since γ ( P n, ˆ X ˆ Y , κ X , κ Y ) → γ ( P XY , κ X , κ Y ) > n → ∞ by the proof of Theorem2. For a finite number B of permutations, the p-value ˆ p ˆ X ˆ Y = (1 + U ) / ( B + 1) where U ∼ Binomial(
B, p ˆ X ˆ Y ). If U = 0, then ˆ p ˆ X ˆ Y = 1 / ( B + 1) ≤ α and we reject the null hypothesis.Since P ( p ˆ X ˆ Y ≤ ǫ ) ≥ − ǫ for some ǫ , ǫ >
0. For n large enough, we have P (ˆ p ˆ X ˆ Y ) ≥ P (ˆ p ˆ X ˆ Y = 1 / ( B + 1) | p ˆ X ˆ Y ≤ ǫ ) P ( p ˆ X ˆ Y ≤ ǫ ) ≥ (1 − ǫ ) B (1 − ǫ ) . ǫ , ǫ → References
Antoniadis, A. and T. Sapatinas (2007). Estimation and inference in functional mixed-effectsmodels.
Computational Statistics & Data Analysis 51 (10), 4793–4813.Berg, C., J. P. R. Christensen, and P. Ressel (1984).
Harmonic analysis on semigroups: theoryof positive definite and related functions , Volume 100. Springer.Cai, T. T. and L. D. Brown (1999). Wavelet estimation for samples with random uniform design.
Statistics & Probability Letters 42 (3), 313–321.Chen, F., Q. Jiang, Z. Feng, and L. Zhu (2020). Model checks for functional linear regressionmodels based on projected empirical processes.
Computational Statistics & Data Analysis 144 ,106897.Cohen, A., I. Daubechies, B. Jawerth, and P. Vial (1993). Multiresolution analysis, waveletsand fast algorithms on an interval.
Comptes rendus de l’Acad´emie des sciences. S´erie 1,Math´ematique 316 (5), 417–421.Daubechies, I. (1992).
Ten lectures on wavelets , Volume 61. SIAM.DeVore, R. A. and G. G. Lorentz (1993).
Constructive approximation , Volume 303. Springer-Verlag Berlin Heidelberg.Donoho, D. L., I. M. Johnstone, G. Kerkyacharian, and D. Picard (1995). Wavelet shrinkage:asymptopia?
Journal of the Royal Statistical Society: Series B (Methodological) 57 (2), 301–337.Dubin, J. A. and H.-G. M¨uller (2005). Dynamical correlation for multivariate longitudinal data.
Journal of the American Statistical Association 100 (471), 872–881.Eubank, R. L. and T. Hsing (2008). Canonical correlation for stochastic processes.
StochasticProcesses and their Applications 118 (9), 1634–1661.Ferraty, F. and P. Vieu (2006).
Nonparametric functional data analysis: theory and practice .Springer, New York.Glasser, M. F., T. S. Coalson, E. C. Robinson, C. D. Hacker, J. Harwell, E. Yacoub, K. Ugurbil,J. Andersson, C. F. Beckmann, M. Jenkinson, S. M. Smith, and D. C. Van Essen (2016, Aug).A multi-modal parcellation of human cerebral cortex.
Nature 536 (7615), 171–178.Gretton, A., O. Bousquet, A. Smola, and B. Sch¨olkopf (2005). Measuring statistical dependencewith hilbert-schmidt norms. In
International conference on algorithmic learning theory , pp.63–77. Springer.Gretton, A., K. Fukumizu, C. H. Teo, L. Song, B. Sch¨olkopf, and A. J. Smola (2008). A kernelstatistical test of independence. In
Advances in neural information processing systems , pp.585–592. 22uo, W. (2002). Functional mixed effects models.
Biometrics 58 (1), 121–128.He, G., H.-G. M¨uller, and J.-L. Wang (2003). Functional canonical analysis for square integrablestochastic processes.
Journal of Multivariate Analysis 85 (1), 54–77.He, J., O. Carmichael, E. Fletcher, B. Singh, A.-M. Iosif, O. Martinez, B. Reed, A. Yonelinas,and C. DeCarli (2012). Influence of functional connectivity and structural mri measures onepisodic memory.
Neurobiology of Aging 33 (11), 2612–2620.Huang, J. Z., C. O. Wu, and L. Zhou (2002). Varying-coefficient models and basis functionapproximations for the analysis of repeated measurements.
Biometrika 89 (1), 111–128.Johnstone, I. M. and B. W. Silverman (2005). Empirical bayes selection of wavelet thresholds.
Annals of Statistics 33 (4), 1700–1752.Kokoszka, P., I. Maslova, J. Sojka, and L. Zhu (2008). Testing for lack of dependence in thefunctional linear model.
Canadian Journal of Statistics 36 (2), 207–222.Kosorok, M. R. (2009). Discussion of: Brownian distance covariance.
Annals of Applied Statis-tics 3 (4), 1270–1278.Kovac, A. and B. W. Silverman (2000). Extending the scope of wavelet regression methods bycoefficient-dependent thresholding.
Journal of the American Statistical Association 95 (449),172–183.Lee, C., X. Zhang, and X. Shao (2020). Testing conditional mean independence for functionaldata.
Biometrika , In press.Leurgans, S. E., R. A. Moyeed, and B. W. Silverman (1993). Canonical correlation analysis whenthe data are curves.
Journal of the Royal Statistical Society. Series B (Methodological) 55 (3),725–740.Lyons, R. (2013). Distance covariance in metric spaces.
The Annals of Probability 41 (5), 3284–3305.Morettin, P. A., A. Pinheiro, and B. Vidakovic (2017).
Wavelets in functional data analysis .Springer.Morris, J. S. (2015). Functional regression.
Annual Review of Statistics and Its Application 2 (1),321–359.Ogden, R. T. (1997).
Essential wavelets for statistical applications and data analysis . SpringerScience & Business Media.Patilea, V., C. S´anchez-Sellero, and M. Saumard (2016). Testing the predictor effect on afunctional response.
Journal of the American Statistical Association 111 (516), 1684–1695.Pensky, M. and B. Vidakovic (2001). On non-equally spaced wavelet regression.
Annals of theInstitute of Statistical Mathematics 53 (4), 681–690.Ramsay, J. and B. Silverman (2005).
Functional data analysis . Springer, New York.23indt, D., D. Sejdinovic, and D. Steinsaltz (2020). Consistency of permutation tests for hsicand dhsic. arXiv preprint arXiv:2005.06573 .Sang, P., L. Wang, and J. Cao (2019). Weighted empirical likelihood inference for dynamicalcorrelations.
Computational Statistics & Data Analysis 131 , 194–206.Sejdinovic, D., B. Sriperumbudur, A. Gretton, and K. Fukumizu (2013). Equivalence of distance-based and rkhs-based statistics in hypothesis testing.
Annals of Statistics 41 (5), 2263–2291.Shen, Q. and J. Faraway (2004). An f test for linear models with functional responses.
StatisticaSinica 14 , 1239–1257.Shin, H. and S. Lee (2015). Canonical correlation analysis for irregularly and sparsely observedfunctional data.
Journal of Multivariate Analysis 134 , 1–18.Sz´ekely, G. J. and M. L. Rizzo (2013). The distance correlation t-test of independence in highdimension.
Journal of Multivariate Analysis 117 , 193–213.Sz´ekely, G. J., M. L. Rizzo, and N. K. Bakirov (2007, 12). Measuring and testing dependenceby correlation of distances.
Annals of Statistics 35 (6), 2769–2794.Tsai, S.-Y. (2018). Reproducibility of structural brain connectivity and network metrics usingprobabilistic diffusion tractography.
Scientific Reports 8 (1), 1–12.Van Wijk, B. C., C. J. Stam, and A. Daffertshofer (2010). Comparing brain networks of differentsize and connectivity density using graph theory.
PLOS ONE 5 (10), e13701.Vidakovic, B. (2009).
Statistical modeling by wavelets , Volume 503. John Wiley & Sons.Wells, J. H. and L. R. Williams (2012).
Embeddings and extensions in analysis , Volume 84.Springer Science & Business Media.Zhou, Y., S.-C. Lin, and J.-L. Wang (2018). Local and global temporal correlations for longitu-dinal data.