aa r X i v : . [ s t a t . M E ] A ug The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2006
EFFICIENT INDEPENDENT COMPONENT ANALYSIS
By Aiyou Chen and Peter J. Bickel Bell Labs and University of California, Berkeley
Independent component analysis (ICA) has been widely used forblind source separation in many fields such as brain imaging analysis,signal processing and telecommunication. Many statistical techniquesbased on M-estimates have been proposed for estimating the mixingmatrix. Recently, several nonparametric methods have been devel-oped, but in-depth analysis of asymptotic efficiency has not beenavailable. We analyze ICA using semiparametric theories and pro-pose a straightforward estimate based on the efficient score functionby using B-spline approximations. The estimate is asymptotically ef-ficient under moderate conditions and exhibits better performancethan standard ICA methods in a variety of simulations.
1. Introduction.
Independent component analysis (ICA) aims to sepa-rate independent blind sources from their observed linear mixtures with-out any prior knowledge. This technique has been widely used in the pastdecade to extract useful features from observed data in many fields such asbrain imaging analysis, signal processing and telecommunication. Hyvari-nen, Karhunen and Oja [16] described a variety of applications of ICA. Forexample, Vigario, Jousmaki, Hamalainen, Hari and Oja [25] used ICA toseparate artifacts from magnetoencephalography (MEG) data, without theburden of modeling the process that generated the artifacts.Standard ICA represents an m × X (e.g., an instanta-neous MEG image) as linear mixtures of m mutually independent randomvariables ( S , . . . , S m ) (e.g., artifacts and other brain activities), where thedistribution of each S i is totally unknown. That is, for S = ( S , . . . , S m ) T and some m × m nonsingular matrix W , X = W − S. (1.1) Received July 2003; revised February 2006. Supported in part by NSF Grant DMS-01-04075 and Lucent Technologies. Supported in part by NSF Grant DMS-01-04075.
AMS 2000 subject classifications.
Primary 62G05; secondary 62H12.
Key words and phrases.
Independent component analysis, semiparametric models, effi-cient score function, asymptotically efficient, generalized M-estimator, B-splines.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2006, Vol. 34, No. 6, 2825–2855. This reprint differs from the original inpagination and typographic detail. 1
A. CHEN AND P. J. BICKEL
Here, W − is called the mixing matrix . Given n i.i.d. observations, X , . . . , X n ,from the distribution of X , the aim is to estimate W and, thus, to separateeach component of S = W X such that the components are maximally mu-tually independent. W is called the unmixing matrix . This can be seen as aprojection pursuit problem [14] in which m directions are sought such thatthe corresponding projections are most mutually independent.It was shown by Comon [9] that W is identifiable up to scaling and per-mutation of its rows if at most one S i is Gaussian [18]. Model (1.1) can beviewed as a semiparametric model with parameters ( W, r , . . . , r m ), where r i is the probability density function (PDF) of S i . Our interest centers on W ;( r , . . . , r m ) are nuisance parameters.ICA was motivated by neurophysiological problems in the early 1980’s(see [16]) and two classes of methods have been proposed to estimate W .One class involves specifying a particular parametric model for each r i andthen optimizing contrast functions that involve ( W, r , . . . , r m ). Primary ex-amples of this approach are maximum likelihood (ML) (e.g., [19, 21]) or,equivalently, minimum mutual information (e.g., [9]), minimizing high-ordercorrelation between components of W X (e.g., [7]) and maximizing the non-Gaussianity of
W X (e.g., [15]). A second class of methods view ICA as asemiparametric model and assume nothing about the distributions of thecomponents S i . Thus, two distinct goals can be formulated: (i) to find esti-mates ˆ W of W that are consistent or, even better, √ n -consistent—that is,ˆ W = W + O p ( n − / ) and (ii) to find procedures that achieve the informationbound—that is, estimates of W which are asymptotically normal and havesmallest variance-covariance matrix among all estimates that are uniformlyasymptotically normal in a suitable sense; see [5]. Amari [1] formally demon-strated that to achieve the information bound in this situation, a methodmust estimate the densities of the sources. In fact, it can even be shown[6] that for any fixed estimating equation corresponding to maximizing anobjective function, there is a possible distribution of sources for which theglobal maximizer is inconsistent, despite the consistency of a local solutionnear the truth.Recently, some nonparametric methods to estimate W have appeared.For example, Bach and Jordan [3] proposed: (i) To reduce the dimensionof the data by using a kernel representation and (ii) to choose W so as tominimize the empirical generalized variance among the components of W X .Hastie and Tibshirani [13] proposed maximizing the penalized likelihood asa function of (
W, r , . . . , r m ) and Vlassis and Motomura [26] proposed max-imizing the likelihood by using Gaussian kernel density estimation. Variousperformance analyses have been carried out using simulations. The Vlassis–Motomura and Hastie–Tibshirani methods are of the same type as ours,but these papers do not provide a method for tuning the procedures and FFICIENT INDEPENDENT COMPONENT ANALYSIS nothing has been proven about their asymptotic properties. Samarov andTsybakov [22] proposed and analyzed a √ n -consistent estimate of W un-der mild conditions. Chen and Bickel [8] analyzed the method of Erikssonand Koivunen [12] based on characteristic functions and showed it to beconsistent under the minimal identifiability conditions and √ n -consistentunder additional mild conditions. This paper concerns the construction ofefficient estimates. We develop an efficient estimator by using efficient scorefunctions after starting the algorithm at a consistent point based on thePCFICA algorithm of Chen and Bickel [8].The outline of the paper is as follows. In Section 2 we analyze ICA as asemiparametric model and propose a new method to estimate W using theefficient score function. The main theorem is given in Section 3. Numericalstudies are given in Section 4. Technical details are provided in Sections 5and 6. Notation . In this paper, W denotes an m × m matrix and W i and W ij denote the i th row and the ( i, j )th element of W , respectively. A T denotesthe transpose of a matrix A and A − T denotes the transpose of A − . Forany matrix A with column vectors { a i : 1 ≤ i ≤ k } , k A k F = q tr ( A T A ) and vec ( A ) = ( a T , a T , . . . , a Tk ) T , a column vector created from A . Define the sup-norm as | f | ∞ = sup t ∈ R | f ( t ) | . X i denotes the i th random sample from thedistribution of X . The population (empirical) law of X is denoted by P ( P n ). X i and S i denote the i th element of X and S , respectively. Denotethe vector of density functions ( r , . . . , r m ) by r m . A vector or matrix offunctions is denoted in boldface. For a vector of functions B , BB T ( x ) shallbe used as an abbreviation of B ( x )[ B ( x )] T .
2. Semiparametric inference.
In this section, we first briefly review thesalient features of estimation in semiparametric models and then show howto solve an approximate efficient score equation for estimating W in the ICAmodel.2.1. Efficient estimates for semiparametric models.
Given a semipara-metric model, X , . . . , X n i.i.d. { P ( θ,η ) : θ ∈ Ω ⊂ R d , η ∈ E} , where E is a sub-set of a function space, estimates ˆ θ n of θ are called regular if √ n (ˆ θ n − θ )converges in law uniformly in P ( θ n ,η n ) , where ( θ n , η n ) converges to ( θ , η )in a smooth way. Then if there is a regular estimate that is uniformly best(call it θ ∗ n ), it must have the form θ ∗ n = θ + 1 n n X i =1 ˜ l ( X i , θ, η ) + o p ( n − / )(2.1)under P ( θ,η ) . The function ˜ l is called the efficient influence function in [5].When η = ( η , . . . , η d ′ ) is a Euclidean parameter, ˜ l is, under regularity con-ditions, the influence function of the ML estimator (MLE) of θ . That is, if A. CHEN AND P. J. BICKEL ˙ l = ( ∂l∂θ T , ∂l∂η , . . . , ∂l∂η d ′ ) T , where l is the log-likelihood function of a single ob-servation and I ( θ, η ) ≡ E ˙ l ˙ l T ( X, θ, η ) is the Fisher information matrix, then ˜ l is the first d coordinates of the vector I − ( θ, η )˙ l . An alternative formulationis to begin by defining the efficient score function l ∗ = ( l ∗ , . . . , l ∗ d ) T with l ∗ k = ∂l∂θ k − d ′ X j =1 a jk ( θ, η ) ∂l∂η j , where a jk ( θ, η ) minimizes E ( ∂l∂θ k ( X, θ, η ) − P d ′ j =1 a jk ( θ, η ) ∂l∂η j ( X, θ, η )) .That is, l ∗ is the projection of ∂l∂θ ( X, θ, η ) onto the orthocomplement ofspan { ∂l∂η j ( X, θ, η ) : 1 ≤ j ≤ d ′ } . Then˜ l = ( E [ l ∗ l ∗ T ( X, θ, η )]) − l ∗ . When η is infinite-dimensional, the generalization of span { ∂l∂η j ( X, θ, η ) : 1 ≤ j ≤ d ′ } is the tangent space. That is defined to be the closed linear span of { ∂l∂λ ( X, θ, η ( λ )) | λ =0 : η (0) = η and λ → η ( λ ) defines a smooth one-dimensionalsubmodel { P ( θ,η ( λ )) : | λ | < }} in L ( P ( θ,η ) ). Now, l ∗ is again obtained by pro-jection onto the orthocomplement of this span. An extensive discussion oftangent spaces and the geometric interpretation of formulas such as the oneabove is given in [5], Chapters 2 and 3. For many canonical semiparametricmodels including ICA, l ∗ can be computed; we sketch the argument in theAppendix. Suppose that for each θ , an estimate ˆ η ( θ ) is available and is atleast consistent. Then the usual Taylor expansions suggest that the solutionof the generalized estimating equation n X i =1 l ∗ ( X i , θ, ˆ η ( θ )) = 0(2.2)will have an influence function ˜ l and, hence, be efficient. These heuristicsand others are discussed in Chapter 7 of [5]. Of course, more than consis-tency is needed and after calculating l ∗ in our case, validating that (2.2)leads to (2.1) for a suitable ˆ η ( θ ) is the subject of Sections 3, 5, 6 and theAppendix. Note that if ˆ η ( θ ) maximizes P ni =1 l ( X i , θ, η ), then (2.2) simplygives the profile maximum likelihood estimate discussed in [20]. In that case,(2.2) simplifies, becoming equivalent to n X i =1 ∂l∂θ ( X i , θ, ˆ η ( θ )) = 0 . Unfortunately, such ˆ η ( θ ) do not exist in the ICA model. Using l ∗ instead of ∂l∂θ in the estimating equation (2.2) permits a less demanding choice of ˆ η ( θ ).These issues are discussed in detail in [5], Chapter 7. In this paper, we sim-ply show that a ˆ θ solving (2.2) for a particular ˆ η ( θ ) does indeed satisfy (2.1) FFICIENT INDEPENDENT COMPONENT ANALYSIS in a sufficiently uniform sense. Optimality of ˆ θ then follows from the generaltheory given in Chapter 3 of [5].This technique is different from the quasi-ML method which belongs tothe first class of methods in the ICA literature reviewed in Section 1. Thisapproach is to guess some shape η for η and then use ordinary ML. Ofcourse, if η is true, then the resulting estimate is asymptotically Gaussianand has smaller variance than the ˆ θ we discuss. But, if η is false, then theestimate can be inconsistent. The ICA algorithms used for comparison inSection 4 such as FastICA [Hyvarinen and Oja (1997)] and extended infomax[19] are of this type. Closest to ours in spirit among these is the method ofPham and Garrat [21]. They use parametric models such as logsplines (seeSection 2.3) for the nuisance parameters. However, they propose solving thescore equations rather than (2.2). More importantly, they do not suggestincreasing the model dimension with n , do not give a method for selecting thenumber of knots of the splines and, hence, are subject to the inconsistencywe have discussed.The remainder of Section 2 shows how to implement the idea given in (2.2)for the ICA model. Technical analysis is carried out in Section 3.2.2. Further notation and assumptions.
Let W P be a nonsingular unmix-ing matrix such that S = W P X has m mutually independent components.Without loss of generality, assume that det( W P ) >
0. For any row vector w ∈ R m , let f w denote the PDF of wX and φ w denote the density scorefunction defined by φ w ( t ) = − ∂∂t log f w ( t ) I ( f w ( t ) > I ( . ) is an indi-cator function.In model (1.1), the order and scaling of rows of W or components of S must be constrained for W to be identifiable. For scaling, we take each S i to have absolute median 1, that is, P ( | S i | ≤
1) = or, equivalently,2 Z − r i ( s ) ds = 1 . (2.3)Even after this choice, the correct unmixing matrix requires 2 m m ! choicesdue to sign changes and row permutations. This ambiguity can be resolvedin various ways, but we avoid being specific by assuming that a consistentstarting value is available for W P , say PCFICA of Chen and Bickel [8]. Let κ ( s ) = 2 I ( | s | ≤ −
1. Then (2.3) is equivalent to Z κ ( S i ) dP = 0 . Equation (2.2), for our case, can be written n X i =1 l ∗ ( X i , W, ˆΦ W ) = 0 , A. CHEN AND P. J. BICKEL
Table 1
Algorithm EFFICA
1. Input data { X , . . . , X n } , an initial estimate ˆ W (0) and B-spline basisfunctions B ( k ) ≡ ( B ( k )1 , . . . , B ( k ) n k ) T for k = 1 , . . . , m ;2. For j = 0 , , . . . until convergence:1) Set S ( j ) k ≡ { ˆ W ( j ) k X i : i = 1 , . . . , n } for k = 1 , . . . , m ;2) Set for k = 1 , . . . , m ,ˆ φ ( j ) k = γ ( j ) Tk B ( k ) ,where γ ( j ) k = ( ˆ E ( j ) k [ B ( k ) B ( k ) T ]) − ˆ E ( j ) k [ D ( k ) ], D ( k ) is the derivative functionof B ( k ) and ˆ E ( j ) k is empirical expectation w.r.t. S ( j ) k ;3) Set ˆΦ W at W = ˆ W ( j ) :ˆΦ W ( s ) = ( ˆ φ ( j )1 ( s ) , . . . , ˆ φ ( j ) m ( s m )) T
4) Set for k = 1 , . . . , m, ˆ u ( j ) k = ˆ E ( j ) k [ ˆ φ ( j ) k ψ ] , ˆ v ( j ) k = ˆ E ( j ) k [ ψ ], where ψ ( s ) = 2 sI ( | s | ≤ σ ( j ) k ) = ˆ E ( j ) k [ ϕ ], where ϕ ( s ) = s ,ˆ α ( j ) k = − (1 − ˆ u ( j ) k )ˆ v ( j ) k (ˆ σ ( j ) k ) − (ˆ v ( j ) k ) ,ˆ β ( j ) k = (1 − ˆ u ( j ) k )(ˆ σ ( j ) k ) (ˆ σ ( j ) k ) − (ˆ v ( j ) k ) ;5) Set M ( j ) ( s ), an m × m function matrix with elements: M ( j ) kk ′ ( s ) = − ˆ φ ( j ) k ( s k ) s k ′ , k = k ′ = ˆ α ( j ) k s k + ˆ β ( j ) k (2 I ( | s k | ≤ − , k = k ′ ;6) Setˆ l ∗ ( j ) ( x ) = vec ( M ( j ) ( ˆ W ( j ) x )[ ˆ W ( j ) ] − T ), ∗ where vec ( M ) vectorizes M ;7) Set e ( j ) n = n P ni =1 ˆ l ∗ ( j ) ( X i ),Σ ( j ) n = n P ni =1 ˆ l ∗ ( j ) ˆ l ∗ ( j ) T ( X i );8) Update ˆ W ( j +1) = ˆ W ( j ) + [Σ ( j ) n ] − e ( j ) n . ∗ Here, ˆ l ∗ ( j ) ( x ) = l ∗ ( x , W, ˆΦ W ) at W = ˆ W ( j ) , where ˆΦ W can be identified as anestimate of the “nuisance parameter” Φ W . where Φ W is the parameter defined by (2.9) and ˆΦ W is an estimate of it.We give pseudo code in Table 1 for iteratively solving this equation. Theexpressions appearing in the pseudo code are developed in the rest of thissection.2.3. Efficient score function of W . The likelihood function of X under(1.1) can be expressed as p X ( x , W, r m ) = | det( W ) | m Y i =1 r i ( W i x ) . FFICIENT INDEPENDENT COMPONENT ANALYSIS The parameter of interest is W , while r m are nuisance parameters. Forsimplicity, assume E [ S i ] = 0.Let φ i ( s i ) = − ∂∂s i log r i ( s i ) I ( r i ( s i ) >
0) be the density score function as-sociated with r i and define Φ by Φ( s ) = ( φ ( s ) , . . . , φ m ( s m )) T , where s =( s , . . . , s m ) T . Then the score function of W , ˙ l W ( x ) ≡ ∂∂ vec ( W ) log( p X ( x , W, r m )),is equal to ˙ l W ( x ) = vec { ( I m × m − Φ( s ) s T ) W − T } , where s = W x and I m × m is an m × m identity matrix. Thus, minimal regu-larity conditions for efficient estimation are that each r i should be absolutelycontinuous, W nonsingular and E [ φ i ( S i ) ] < ∞ and E [ S i ] < ∞ . Using the devices of tangent space and projection mentioned in Section 2.1(see calculation details in the Appendix), the efficient score can be expressedas l ∗ ( x , W, Φ) = vec ( M ( W x ) W − T ) , (2.4)where M ( s ) is an m × m function matrix with elements M ij ( s ) = − φ i ( s i ) s j , for 1 ≤ i = j ≤ m, (2.5) M ii ( s ) = α i s i + β i κ ( s i ) , for i = 1 , . . . , m (2.6)and α i = − (1 − u i ) v i σ i − v i , β i = (1 − u i ) σ i σ i − v i , σ i = E [ S i ] , (2.7) v i = E [2 S i I ( | S i | ≤ , u i = E [2 S i φ i I ( | S i | ≤ . (2.8)Most of these formulas were derived in [2], but in a different context.We repeat these in our own notation for completeness. By the convolutiontheorem on semiparametric models (see [5]), the information bound for reg-ular estimators of W is ( E [ l ∗ l ∗ T ( X, W,
Φ)]) − . It is obvious that the efficientscore function depends on r m only through the density score functions( φ , . . . , φ m ). Next, we describe how to perform the estimation by using theefficient score function.2.4. The ICA estimate.
LetΦ W = ( φ W , . . . , φ W m ) T (2.9)and assume that a starting estimate ˆ W (0) is available which is consistent for W P . We shall show how to construct an estimate ˆΦ W of Φ W for W close to W P and then solve Z l ∗ ( X, W, ˆΦ W ) dP n = 0 A. CHEN AND P. J. BICKEL to obtain an efficient estimator of W P . Here, ˆΦ W is a data-dependent func-tion of W and, thus, l ∗ ( X, W, ˆΦ W ) is an approximation to the efficient scorefunction given by (2.4).For each k ∈ { , . . . , m } , choose a sieve for ˆ φ W k as follows. Let [ b nk , b nk ] ⊂ R be a subset of supp ( r k ) containing most of the mass of r k . For an integer n k , set n k + 4 equally spaced points { b nk + ( i − δ nk : 1 ≤ i ≤ n k + 4 } asknots, where δ nk depends on n k through δ nk = ( b nk − b nk ) / ( n k + 3) , and then construct n k cubic B-spline basis functions, as in the Appendix.Here, n k is chosen by cross-validation as described in Section 2.5 below. De-note the basis functions as B ( k ) n ≡ ( B ( k ) n , . . . , B ( k ) nn k ) T , where the superscript( k ) denotes the association with S k and the subscript n denotes the depen-dence on the sample size. Given the random sample { W k X i : 1 ≤ i ≤ n } fromthe density function f W k , the density score function φ W k can be estimatedby ˆ φ W k = [ γ n ( W k )] T B ( k ) n , (2.10)where γ n ( W k ) is given in Table 5 (see Section 2.5 for details). Then defineˆΦ W ( s ) ≡ ( ˆ φ W ( s ) , . . . , ˆ φ W m ( s m )) T . To avoid further complications, it is as-sumed that both [ b nk , b nk ] and n k are fixed using ˆ W (0) . That is, the n k + 4knot locations are fixed.Now, replace the efficient score function l ∗ ( X, W,
Φ) defined in (2.4)–(2.8) by its profile form l ∗ ( X, W, ˆΦ W ), where α i , β i and σ i defined in (2.7)and (2.8) are replaced with plug-in estimatesˆ α i = − (1 − ˆ u i )ˆ v i ˆ σ i − ˆ v i , ˆ β i = (1 − ˆ u i )ˆ σ i ˆ σ i − ˆ v i , ˆ σ i = Z ( W i X ) dP n , (2.11)where ˆ u i = R Y = W i X Y ˆ φ W i ( Y ) I ( | Y | ≤ dP n and ˆ v i = R Y = W i X Y I ( | Y | ≤ dP n . Define e n ( W ) = Z l ∗ ( X, W, ˆΦ W ) dP n and e ( W ) = Z l ∗ ( X, W, Φ W ) dP (2.12)and let ˆ W be a solution of e n ( W ) = 0 , (2.13)if it exists. Let ˆ l ∗ ( x , W ) ≡ l ∗ ( x , W, ˆΦ W ) and ˙ e n ( W ) ≡ ∂∂ vec ( W ) e n ( W ). Notethat if ˆ W → W P , then − ˙ e n ( ˆ W ) and R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ) dP n have the same limit, − ∂ e ( W ) ∂ vec ( W ) (cid:12)(cid:12)(cid:12)(cid:12) W P = E [ l ∗ l ∗ T ( X, W P , Φ P )] , FFICIENT INDEPENDENT COMPONENT ANALYSIS with probability converging to 1, as demonstrated later in Section 5. The fi-nal estimator ˆ W is defined as the limiting value of the following approximateNewton–Raphson iteration:ˆ W ( j +1) = ˆ W ( j ) + (cid:20) Z ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n (cid:21) − e n ( ˆ W ( j ) ) , j = 0 , , . . . . (2.14)We shall show that this limit exists with probability tending to 1. Notethat this method does not require the calculation of the Hessian matrix˙ e n ( W ). The convergence and asymptotic properties of (2.14) are developedin Section 3. Call ˆ W ≡ ˆ W ( ∞ ) defined by (2.14) the EFFICA estimate . Thisis summarized in Table 1 and will be used for the simulation in Section 4.The density score estimation, as well as how to choose the number of knotsby cross-validation (mentioned above), is provided in the next subsection.2.5.
Estimating a density score function by B-spline approximations.
Let φ = − r ′ /r be the density score associated with a univariate PDF r . Let G be a linear space with differentiable basis functions B = ( B , . . . , B N ) T suchthat each B i r vanishes at infinity. An estimator of φ in G can then beobtained by minimizing with respect to γ ∈ R N the mean square error c ( γ ) = Z R ( φ ( s ) − γ T B ( s )) r ( s ) ds. Using integration by parts, we obtain c ( γ ) = γ T E r [ BB T ] γ − γ T E r [ B ′ ] + E r [ φ ] , where B ′ is the derivative of B and E r indicates expectation under r . Thus,the optimal γ is γ φ = ( E r [ BB T ]) − E r [ B ′ ] and the best approximation of φ in G , in the sense of mean square error, is φ G = γ Tφ B . This method wasproposed by Jin [17] as a variant of Cox’s [10] penalized estimators. Givena random sample of size n from the density function r , γ φ can be estimatedby combinations of empirical moments. So, a natural estimator of φ is givenby ˆ φ G = ˆ γ Tφ B , where ˆ γ Tφ = ( ˆ E r [ BB T ]) − ˆ E r [ B ′ ] , (2.15)where ˆ E r is empirical expectation corresponding to E r .B-spline basis functions are popular choices for G . In general, the supportof r is unknown and we need to choose a working interval [ b n , b n ], in whichknots are distributed, for the construction of the basis functions. The basicrule for adaptation is that [ b n , b n ] → supp ( r ) very slowly as n → ∞ . Here, b n and b n are selected as α n and 1 − α n empirical quantiles where α n → N , is an additional empirical smoothingparameter. One can use cross-validation to choose N as follows: A. CHEN AND P. J. BICKEL
1. Split the sample randomly into two halves, say I and I ;2. For N = 1 , , . . . , use I to estimate γ φ ∈ R N by (2.15), say ˆ γ φ ( I ), anduse I to evaluate c (ˆ γ φ ( I )) empirically, but omitting the last term E r [ φ ],say ˆ c I |I ( N ). Similarly calculate ˆ c I |I ( N );3. Select N as the largest value such that { ˆ c I |I ( N ) + ˆ c I |I ( N ) } strictlydecreases until N .Jin [17] used a similar method in the i.i.d. case we have discussed, provedits validity and showed that N = O ( n δ ) under weak smoothness assumptions,where δ ∈ (0 , /
6) depends on tail properties of r .
3. Asymptotic properties.
We are given ˆ W (0) (e.g., the PCFICA esti-mate) such that for some ε n > P ( k ˆ W (0) − W P k F ≤ ε n ) → n → ∞ , where ε n satisfies ε n → √ nε n → ∞ . LetΩ n = { W ∈ R m × m : k W − W P k F < ε n } . (3.2)Define φ W k ,n ( x ) ≡ φ W k ( x ) I ( x ∈ [ b nk , b nk ]) and consider the following condi-tions for 1 ≤ k, i, j ≤ m :C1: W P is nonsingular.C2: E [ S k ] = 0, E [ S k ] < ∞ , med ( | S k | ) = 1 and E ( φ k ( S k )) < ∞ .C3: | r k | ∞ < ∞ , | r ′ k | ∞ < ∞ , sup t ∈R | tr ′ k ( t ) | < ∞ .C4: The uniform law of large numbers (ULLN) holds for { φ W k ( W k X ) X i : W ∈ Ω n } , { φ W k ( W k X ) X i : W ∈ Ω n } and for { φ ′ W k ( W k X ) W i XX j : W ∈ Ω n } .C5: For some positive c , c , r k ( t ) ≥ c δ nk if t ∈ [ b nk , b nk ], otherwise r k ( t ) ≤ c δ nk .C6: sup W ∈ Ω n | φ W k ,n | ∞ δ nk = O (1) and sup W ∈ Ω n | φ ′′′ W k ,n | ∞ δ nk = o (1).C7: ε n δ − nk ( b nk − b nk ) = o (1).[Note: ULLN holds for G n iff sup g ∈G n | R g ( X ) d ( P n − P ) | = o p (1); see, e.g., [24].]Conditions C1–C3 are simplified regularity conditions. C1 and the finitemoments in C2 are among the minimal regularity conditions for consideringefficiency, as mentioned in Section 2.3. Setting the absolute median to unityin C2 is a simple and minimal condition to make the scales of the unmixingmatrix identifiable [9]. It should be clear that the zero mean assumption inC2 is in no way crucial to the general argument as the mean can be estimatedadaptively, but it serves to keep algebraic complication to a minimum. C3assumes some smoothness on the density score function φ k for each hiddencomponent, which is needed if it is to be well approximated by B-splines.Conditions C4–C7 are technical conditions that we believe are far fromnecessary, but they are reasonably easy to check and enable construction of FFICIENT INDEPENDENT COMPONENT ANALYSIS a more compact proof. As an easy example, if | φ k | ∞ < ∞ and | r ′′ k /r k | ∞ < ∞ for k = 1 , . . . , m , then by (A.1) in the Appendix, sup W ∈ Ω n | φ W k | ∞ < ∞ andby (A.2), sup Ω n | φ ′ W k | ∞ < ∞ . Thus C4 holds. C5 and C6 require that thetail of r k be not too wiggly. C6 also implies that δ nk →
0. C7 requires thatthe initial value be reasonably close to the truth and that the domain andthe number of knots of the B-splines [i.e., n k = ( b nk − b nk ) δ − nk −
3] do notgrow so quickly that we lose control of the approximation to Φ W .Here is the main theorem: Theorem 3.1.
In the ICA model (1.1) , if (3.1) and
C1–C7 hold for i, j, k = 1 , . . . , m , i = k and j = k , then with probability converging to , thealgorithm (2.14) has a limit ˆ W ( ∞ ) and √ n vec ( ˆ W ( ∞ ) − W P ) = I − √ n Z l ∗ ( X, W P , Φ P ) dP n + o P (1) , (3.3) where I eff = R l ∗ l ∗ T ( X, W P , Φ P ) dP . That is, ˆ W ( ∞ ) is Fisher efficient. Fur-ther, √ n vec ( ˆ W ( ∞ ) W − P − I m × m ) → d N ( , ¯ I − ) , where ¯ I eff = R vec ( M ) vec ( M ) T dP does not depend on W P and M is givenby (2.5)–(2.8) , with s replaced by S . The proof of Theorem 3.1 is given in later sections and the Appendix.
4. Numerical studies and some computational issues.
Two groups of ex-periments are implemented to test the empirical performance of EFFICA.Data are generated from known source distributions listed in Table 2 witha known mixing matrix W − P . The boundaries for B-spline approximationof the density score functions are taken as b nk = max( q n (0) , q n (0 . − ∆ n )and b nk = min( q n (1) , q n (0 .
99) + ∆ n ), where q n ( · ) denotes the empirical quan-tile and ∆ n = c · √ log log n . We used c = 5 in the simulation. The numberof knots is key for EFFICA and is chosen by the cross-validation methoddescribed in Section 2.5.In the first group of experiments, two hidden components are used and W P = [2 ,
1; 2 , n = 1000.In the second group of experiments, the number of hidden components isincreased to m = 4, 8 and 12, m hidden components are chosen with distri-butions of [0] , [1] , . . . , [ m −
1] in Table 2 and W P = I m × m . The experimentsare replicated 100, 100 and 50 times for m = 4, 8 and 12, respectively, with n = 4000. A. CHEN AND P. J. BICKEL
Comparisons are made with five existing ICA algorithms: the FastICAalgorithm with the options of “symmetric” and “tanh” [15] which is equiva-lent to quasi-ML by using a tanh distribution for each hidden source (note:the FastICA code uses logistic for tanh), the JadeICA algorithm [7], theextended infomax algorithm [19], the KernelICA-Kgv algorithm [3] and thePCFICA algorithm. The PCFICA’s estimate is used as the initial value forboth EFFICA and KernelICA-Kgv. Due to the existence of multiple localsolutions, PCFICA uses three starting values, one from FastICA and theothers random. The performance of each algorithm is measured by boththe Frobenius error, that is, d F ( ˆ W , W P ) = k ˆ W W − P − I m × m k F after suitablerescaling and permutation on rows of both ˆ W and W P , and the so-called Amari error d A ( ˆ W , W P ) (e.g., [3]), d A ( V, W ) = 12 m m X i =1 (cid:18) P mj =1 | a ij | max j | a ij | − (cid:19) + 12 m m X j =1 (cid:18) P mi =1 | a ij | max i | a ij | − (cid:19) , where V, W are rescaled into ¯
V , ¯ W such that each row of ¯ V and ¯ W hasnorm 1 and a ij = ( ¯ V ¯ W − ) ij . The Amari error lies in [0 , m − V and W and is equal to zeroif and only if V and W represent the same row components.For each experiment in the first group of simulations with T = 400 repli-cations, Table 3 reports the average Amari error and square root of the meansquare error √ MSE with
MSE = 1 T T X i =1 ( d ( i ) F ) , where d ( i ) F denotes the Frobenius error for the i th replication. For the secondgroup of simulations, Figure 1 shows the boxplots of the Amari errors andTable 4 reports √ MSE .From the simulation results, in some cases, some parametric ICA algo-rithms work very well and even outperform EFFICA. For example, FastICA
Table 2
Source distributions used in the simulations [0]. N(0,1) [8]. exp(1) + U (0 , ,
1) [11]. mixture Gaussians: multimodal[4]. t(5) [12]. mixture Gaussians: unimodal[5]. logistic(0 ,
1) [13]. exp(1) vs normal(0,1)[6]. Weibull(3 ,
1) [14]. lognormal(1 ,
1) vs normal(0 , ,
1) [15]. Weibull(3 ,
1) vs exp(1)FFICIENT INDEPENDENT COMPONENT ANALYSIS Table 3 × mean Amari errors (and × √ MSE in brackets) for six ICA methods using m = 2 sources and sample size n = 1000 . Source distributions for row k are given by [ k ] inTable . The bold numbers represent the best performance according to each experiment pdf Fast Jade ExtImax Pcf Kgv EFFICA (89) (66) (57) (31) (24) ( )2 36 36
35 33 29(231) (68) (61) (61) (55) ( )3 33 31 19 16 14 (243) (69) (33) (27) (24) ( )4
50 41 60 61 60(99) ( ) (112) (100) (102) (110)5
85 87 109 99 128(194) ( ) (232) (192) (170) (253)6 42 43 32 18 15 (188) (83) (57) (30) (24) ( )7 43 41 35 18 15 (205) (72) (96) (31) (25) ( )8 36 44 35 21 19 (99) (96) (64) (35) (31) ( )9 35 37 24 16 14 (212) (83) (41) (28) (24) ( )10 46 59 39 44 ) (105)11 28 33 27 29 ) (42)12 50 49 44 44 ) (264)13 65 52 185 24 19 (164) (89) (355) (42) (35) ( )14 35 45 91 20 14 (109) (89) (188) (35) (24) ( )15 69 72 57 32 27 (192) (184) (136) (69) (58) ( ) works best in case 5 where hidden sources have logistic distributions. Thisis not surprising as we have pointed out in Section 2.1 that a simple quasi-MLE can outperform an efficient estimator when the value of the nuisanceparameter used by the quasi-MLE is close to the truth. But, in most exper-iments, the parametric methods (FastICA, JADE, ExtImax) behave worsethan the nonparametric methods (PCFICA, Kgv, EFFICA) and EFFICAhas both the smallest Amari errors and smallest Frobenius errors, while Kgv,which we conjecture can be efficient after appropriate regularization, is thebest in the cases of mixture Gaussians. The three nonparametric ICA al- A. CHEN AND P. J. BICKEL
Table 4 × √ MSE for ICA algorithms with the same simulations as in Figure Case Fast Jade ExtImax PCF Kgv EFFICA m = 4 0 .
82 1 .
31 2 .
71 0 .
60 0 .
51 0 . m = 8 7 . . . . . . m = 12 9 . . . . . . gorithms require heavier computation, but their performance is better thanthe parametric methods.All of the ICA algorithms used in the simulation except EFFICA are basedon contrast functions which empirically measure the dependence or nongaus-sianity among { W X, . . . , W m X } and, thus, they are invariant with respectto the choice of W P for both error metrics d F ( ˆ W , W P ) and d A ( ˆ W , W P ).We note that prewhitening, which is used for data preprocessing by thesealgorithms, can reduce such invariance, although it does not cause incon-sistency [8]. Theorem 3.1 implies that EFFICA is asymptotically invariantwith respect to W P . Figure 2 compares m = 8 with two different unmixingmatrices W P = I m × m and W P = I m × m + V , where V jk = j/m + ( k − /m for 1 ≤ j, k ≤ m . We performed many other simulations with different W P and obtained similar results. We observe that the Frobenius error boxplotsdo change somewhat with different W P , but EFFICA is more robust thanother ICA algorithms. We believe that the main reasons are (i) none of the ICAalgorithms are convex and, thus, may suffer from local solutions and (ii) EF-FICA does not use prewhitening for preprocessing, while others do.
5. Proof of Theorem 3.1.
In this section, we prove Theorem 3.1. Notethat solving (2.13) can be viewed as a generalized M-estimator (GM-estimator).The existence/uniqueness, convergence and asymptotic linearity of GM-estimators have been studied in [5] (the Iteration Theorem in Appendix A.10.2,page 517). The idea of our proof is to use the Iteration Theorem.Suppose that M n ( θ, P n ) is a functional of θ ∈ Ω (a subset of a finite Eu-clidean space) and P n , but is not necessarily linear in P n . The subscript n in M n allows the existence of a possible smoothing or sieve parameter depen-dent on n . The zero of M n ( θ, P n ) w.r.t θ is called a generalized M -estimator .Let M ( θ, P ) ≡ M ∞ ( θ, P ). We review the conditions for the Iteration Theo-rem.[GM1] θ P ∈ Ω is the unique solution of M ( θ, P ) = 0 in Ω.[GM2] M n ( θ P , P n ) = R ψ θ P ( X ) dP n + o p ( n − / ) for some ψ θ P ∈ L ( P ).[GM3] M ( θ, P ) is differentiable w.r.t θ in a neighborhood of θ P and ∂M ( θ P ,P ) ∂θ is nonsingular. FFICIENT INDEPENDENT COMPONENT ANALYSIS Fig. 1.
Boxplots of Amari errors for ICA algorithms: in each case ( left, middle, right ) , m hidden components are generated from pdfs [0] , . . . , [ m − in Table . The X-labels rep-resent ICA algorithms: F-FastICA, J-JadeICA, X-extended infomax, P-PCFICA, K-Kgv,E-EFFICA. The sample sizes are 4000 for all the experiments and the replication timesare , , for m = 4 , m = 8 and m = 12 , respectively. Fig. 2.
Frobenius errors ( × with m = 8 and n = 4000 , where the left panel is for W = I m × m and the right panel is for W = I m × m + V , with V jk = j/m + ( k − /m . Eachexperiment is replicated times. A. CHEN AND P. J. BICKEL
For our efficient score equation M n ( θ, P n ) = e n ( W ) defined in (2.12), [5]condition [U] becomes[U] sup W ∈ Ω n | ˙ e n ( W ) − ˙ e ( W P ) | = o P (1). Theorem 5.1 [5].
Suppose [GM1], [GM2] and [GM3] hold with M n ( θ, P n ) = e n ( W ) and that [U] holds. If the starting point satisfies P ( | ˆ W (0) − W P | < ε n ) → , then with probability converging to , e n ( W ) in (2.13) has aunique root ˆ W ( ∞ ) which is also the limit of the sequence defined by (2.14) ,except with R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n replaced by − ˙ e n ( ˆ W ( j ) ) , and ˆ W ( ∞ ) is asymp-totically linear with the influence function − [ ˙ e ( W P )] − l ∗ ( ., W P , Φ P ) . Theorem 5.1 is called the
Iteration Theorem in [5]. Note that the sequencelimit defined by the Iteration Theorem uses the exact Newton–Raphson,whereas we use an approximate Newton–Raphson, as in (2.14). To makeup the difference, we need the following condition [V] which is verified byProposition 6.4 in Section 6:[V] sup W ∈ Ω n | R ˆ l ∗ ˆ l ∗ T ( X, W ) dP n − R l ∗ l ∗ T ( X, W P , Φ W P ) dP | = o P (1).Theorem 3.1 can now be proved as follows: Proof of Theorem 3.1.
It is obvious that [GM1] holds under the con-ditions of Theorem 3.1 as it is the efficient score function. [GM2], [GM3] and[U] are verified by Propositions 6.1, 6.2 and 6.3 below, respectively. Thus, theconclusion of the above Iteration Theorem applies here. By Proposition 6.4,the condition [V] holds. Further, by Proposition 6.2,˙ e ( W P ) = − E [ l ∗ l ∗ T ( X, W P , Φ P )] . Thus, we have sup W ∈ Ω n (cid:12)(cid:12)(cid:12)(cid:12) ˙ e n ( W ) + Z ˆ l ∗ ˆ l ∗ T ( x, W ) dP n (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Then following the contraction arguments of [5] (pages 317–319), the itera-tion given in (2.14) has the same limit as that replacing R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n by − ˙ e n ( ˆ W ( j ) ), with probability converging to 1. Thus, (3.3) holds. The sec-ond result follows from (3.3) directly by using (2.4) and the devices of Kro-necker product and vec operator. (cid:3)
6. Propositions 6.1–6.4.
This section verifies conditions [GM2], [GM3],[U] and [V]. For convenience, we list all the notation used in the followingproofs in Table 5, for k ∈ { , . . . , m } and W ∈ Ω n . Note that all of the lemmasused in this section are given in the Appendix. For simplicity of notation,we will often write δ nk as δ n . FFICIENT INDEPENDENT COMPONENT ANALYSIS Table 5
List of all notation used in the proof
P, P n population, empirical law of XW, W k , W ij m × m matrix, k th row, ( i, j )th element W P , W Pk , W Pij unmixing matrix, k th row, ( i, j )th entry r k PDF of S k φ k = − r ′ k /r k density score function for S k Φ P = ( φ , . . . , φ m ) T function vector f W k PDF of W k X ( f W Pk ≡ r k ) φ W k = − f ′ W k /f W k score function of W k X ( φ W Pk ≡ φ k ) φ k,n , φ W k ,n truncation of φ k , φ W k on [ b nk , b nk ]Φ W = ( φ W , . . . , φ W m ) T function vectorˆΦ W = ( ˆ φ W , . . . , ˆ φ W m ) T function vector l ∗ ( X, W,
Φ) efficient score function defined by (2.4) e ( W ) = R l ∗ ( X, W, Φ W ) dP expectation e n ( W ) = R l ∗ ( X, W, ˆΦ W ) dP n empirical expectation B ( k ) n = ( B ( k ) n , . . . , B ( k ) nn k ) T B-splines defined on [ b nk , b nk ] A n ( W k ) = R B ( k ) n B ( k ) Tn ( W k X ) dP n served in coefficients of ˆ φ W k D n ( W k ) = R ( B ( k ) n ) ′ ( W k X ) dP n served in coefficients of ˆ φ W k γ n ( W k ) = A n ( W k ) − D n ( W k ) served as coefficients of ˆ φ W k A ( W k ) = R B ( k ) n B ( k ) Tn ( W k X ) dP served in coefficients of ˆ φ W k D ( W k ) = R ( B ( k ) n ) ′ ( W k X ) dP served in coefficients of ˆ φ W k γ ( W k ) = A ( W k ) − D ( W k ) served as coefficients of ˆ φ W k G ( k ) n = { a T B ( k ) n : a ∈ R n k } closed linear span of B-splinesˆ φ W k = γ n ( W k ) T B ( k ) n estimator of φ W k in G ( k ) n defined by (2.10)ˆ φ W k = γ ( W k ) T B ( k ) n estimator of φ W k in G ( k ) n defined by (A.3) Proposition 6.1.
Under the conditions of Theorem , we have e n ( W P ) = Z l ∗ ( X, W P , Φ P ) dP n + o P ( n − / ) . Proof.
Recall the definition of e n ( W ) given by (2.12) and that of l ∗ ( x , W, Φ) given by (2.4)–(2.8). It is sufficient to show that for 1 ≤ i = j ≤ m ,ˆ α i − α i = o P (1) and ˆ β i − β i = o P (1), where ( α i , β i ) and ( ˆ α i , ˆ β i ) are definedin (2.7) and (2.11), respectively, and Z ˆ φ W Pi ( S i ) S j dP n = Z φ W Pi ( S i ) S j dP n + o P ( n − / ) , (6.1)where S i = W P i X , S j = W P j X .The first two are not hard verify with the law of large numbers andLemma A.7. Here, we will just show (6.1). Observe that (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W Pi ( S i ) S j dP n − Z φ W Pi ( S i ) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) Z [ ˆ φ W Pi ( S i ) − ˆ φ W Pi ( S i )] S j dP n (cid:12)(cid:12)(cid:12)(cid:12) A. CHEN AND P. J. BICKEL + (cid:12)(cid:12)(cid:12)(cid:12) Z [ ˆ φ W Pi ( S i ) − φ i,n ( S i )] S j dP n (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) Z ( φ i ( S i ) − φ i,n ( S i )) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) = [1] + [2] + [3] . In the following, we will show that all of [1], [2] and [3] are o P ( n − / ).First, by Lemma A.2(7) and Lemma A.3,[1] = (cid:12)(cid:12)(cid:12)(cid:12) Z ( γ n ( W P i ) − γ ( W P i )) T B ( i ) n ( S i ) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ k γ n ( W P i ) − γ ( W P i ) k (cid:13)(cid:13)(cid:13)(cid:13) Z B ( i ) n ( S i ) S j dP n (cid:13)(cid:13)(cid:13)(cid:13) = o P (1) O P ( n − / ) . Further, E ([2]) = n E ( ˆ φ W Pi ( S i ) − φ W Pi ,n ( S i )) E ( S j ). By Lemma A.6, | ˆ φ W Pi − φ W Pi ,n | ∞ ≤ cδ n | φ ′′′ W Pi ,n | ∞ . Thus, by C6,[2] = n − / δ n | φ ′′′ W Pi ,n | ∞ O P (1) = o P ( n − / ) . For [3], since P ( S i / ∈ [ b ni , b ni ]) →
0, we have E ([3]) = 1 n E ( φ i ( S i ) I ( S i / ∈ [ b ni , b ni ])) E ( S j ) = o (cid:18) n (cid:19) . So [3] = o P ( n − / ). (cid:3) Proposition 6.2.
Under conditions
C1, C2 and C4 of Theorem , e ( W ) is differentiable w.r.t. W in a neighborhood of W P and ˙ e ( W P ) = − E { l ∗ l ∗ T ( X, W P , Φ P ) } is nonsingular. Proof.
Let T w ( · ) = ∂∂w { φ w ( · ) } for any nonzero row vector w ∈ R m .By (A.1) in the Appendix, after exchanging the order of differentiation andintegration, we have E [ T w ( wX )] = 0. Then by (2.5), we have for k = 1 , . . . , m , E (cid:26) ∂∂W k { l ∗ ( X, W, Φ W ) } W P (cid:27) = E (cid:26) ∂∂W k { l ∗ ( X, W, Φ P ) } W P (cid:27) . Since the left-hand side of the above is ˙ e ( W P ), by Lemma A.8, the right-hand side is equal to ˙ e ( W P ) = − E { l ∗ l ∗ T ( X, W P , Φ P ) } . (6.2)Note that the elements of M in (2.5)–(2.6) are linearly independent, andthat this is also true for the elements of l ∗ ( ., W P , Φ P ). Thus, ˙ e ( W P ) mustbe nonsingular. (cid:3) FFICIENT INDEPENDENT COMPONENT ANALYSIS Proposition 6.3.
Under the conditions of Theorem , for i, j, k =1 , . . . , m , we have sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ P i ( W P i X ) X k dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)(6.3) and for i = j , sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ∂∂W i { ˆ φ W i ( W i X ) } W j X dP n − Z ∂∂W i { φ P i ( W P i X ) } W P j
X dP (cid:12)(cid:12)(cid:12)(cid:12) (6.4) = o P (1) . Thus, condition [U] holds.
Proof.
We omit the superscript ( i ) in B ( i ) n henceforth. By the Cauchy–Schwarz inequality, (cid:13)(cid:13)(cid:13)(cid:13) Z B ( i ) n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) ≤ Z | X k | dP n Z n i X l =1 [ B nl ( W i X )] dP n . Since P n i l =1 [ B nl ( W i X )] < Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) = O P (1)(6.5)and by Lemma A.2(7), sup Ω n k γ n ( W k ) − γ ( W k ) k = o P (1), sosup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z ˆ φ W i ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) = sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) ( γ n ( W k ) − γ ( W k )) T Z B n ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) (6.6) ≤ sup Ω n k γ n ( W k ) − γ ( W k ) k sup Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) = o P (1) O P (1) . Further, by Lemma A.6, sup Ω n | ˆ φ W i ( W i X ) − φ W i ,n | ∞ ≤ sup Ω n c | φ ′′′ W i ,n | ∞ δ n ,so sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ W i ,n ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup Ω n | φ ′′′ W i ,n | ∞ δ n Z | X k | P n (6.7) = o P (1) (by C6) . A. CHEN AND P. J. BICKEL
By C4, ULLN holds for { φ W i ( W i X ) X k : W ∈ Ω n } , and by Lemma A.1,sup Ω n P ( W i X / ∈ [ b ni , b ni ]) = o (1), sosup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k I ( W i X / ∈ [ b ni , b ni ]) dP n (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . (6.8)Recall that φ W i ,n ( x ) = φ W i ( x ) I ( x ∈ [ b ni , b ni ]). From (6.6)–(6.8), we obtainsup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ W i ( W i X ) X k dP n ( X ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . (6.9)Now, by C4, sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) , (6.10)and by continuity,sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k dP − Z φ W Pi ( W P i X ) X k dP (cid:12)(cid:12)(cid:12)(cid:12) = o (1) . (6.11)(6.3) then follows from (6.9)–(6.11).In the following, we prove (6.4). Note that ∂∂W ik { ˆ φ W i ( W i X ) } = ∂∂W ik { γ Tn ( W i ) } B n ( W i X ) + ˆ φ ′ W i ( W i X ) X k . (6.12)It suffices to show that the following hold:[4] sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ ′ W i ( W i X ) X k W j X dP n ( X ) − Z φ ′ P i ( W P i X ) X k W P j
X dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1);[5] sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ∂∂W ik { γ Tn ( W i ) } B n ( W i X ) W j X dP n ( X ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Similarly to (6.3), the uniform convergence of [4] can be verified usingconditions C4, C6, C7 and Lemmas A.1, A.2 and A.6. Further, the left-handside of [5] is bounded bysup Ω n (cid:26)(cid:13)(cid:13)(cid:13)(cid:13) ∂∂W ik γ n ( W i ) (cid:13)(cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X dP n (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) = O P ( δ − / n n / i ε n δ − n n / i )= o P (1) , where the first equality follows from Lemma A.2 and Lemma A.4 and thesecond follows from C7. Thus, [5] holds and, hence, (6.4) is proved. (cid:3) Proposition 6.4.
Under the conditions of Theorem , condition [V] holds, that is, sup W ∈ Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ l ∗ ˆ l ∗ T ( X, W ) dP n − Z l ∗ l ∗ T ( X, W P , Φ W P ) dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . FFICIENT INDEPENDENT COMPONENT ANALYSIS Proof.
By checking the elements of ˆ l ∗ ˆ l ∗ T ( x , W ), it suffices to show thatfor 1 ≤ i, j, k, k ′ ≤ m and i = j ,sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) ˆ φ W j ( W j X ) X k X k ′ dP n − Z φ P i ( W P i X ) φ P j ( W P j X ) X k X k ′ dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) , sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k X k ′ dP n ( X ) − Z φ P i ( W P i X ) X k X k ′ dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)andsup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k κ ( W j X ) dP n ( X ) − Z φ P i ( W P i X ) X k κ ( W P j X ) dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Each of these can be verified using Lemmas A.1, A.2, A.6 and conditionsC4, C6 and C7 with arguments similar to those used in proving (6.3). (cid:3)
7. Conclusion.
In this paper, we viewed the classical ICA model withinthe framework of semiparametric models and obtained an asymptotically ef-ficient estimator for the unmixing matrix by solving an approximate efficientscore equation. The main difference between this new method and popularparametric ICA methods is that we estimate the density score functionsof hidden sources adaptively. A variety of simulations have illustrated sta-tistical efficiency of this estimator in comparison with state-of-the-art ICAalgorithms. APPENDIX
A.1. Some useful formulas.
Let v = wW − P . Then wX = vS . If v k = 0 forsome k ∈ { , . . . , m } , then by the classical convolution formula, we have f w ( t ) = Z v k r k (cid:18) t − P j = k v j s j v k (cid:19) Y j = k r j ( s j ) ds j = E (cid:26) v k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:27) . Since f w ( t ) is a marginal density function of ( vS, S j : 1 ≤ j = k ≤ m ), by astandard formula (see, e.g., [4]), φ w ( t ) = − v k E (cid:26) r ′ k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:12)(cid:12)(cid:12) vS = t (cid:27) = 1 v k E [ φ k ( S k ) | vS = t ](A.1)and further calculation gives ∂∂t φ w ( t ) = φ w ( t ) − v k E (cid:26) r ′′ k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:12)(cid:12)(cid:12) vS = t (cid:27) . (A.2) A. CHEN AND P. J. BICKEL
A.2. Calculation of the efficient score.
To formulate the tangent spacedefined in Section 2.1 for each nuisance parameter r i , by taking the smoothsubmodel { r i ( · ; t ) = r i ( · ) e th i ( · ) : | t | < } for some h i ∈ L ( P ), we havelim t → ∂∂t { log p X ( x , W, r , . . . , r i ( · ; t ) , . . . , r m ) } = h i ( W i x ) . Since r i ( · ; t ) needs to be a probability density function which satisfies themean and absolute median assumptions, h i needs to satisfy E [ h i ( S i )] = 0, E [ h i ( S i ) S i ] = 0 and E [ h i ( S i ) κ ( S i )] = 0, but is otherwise arbitrary. Thus, thetangent space for r i can be expressed as TS i = { h i ( W i x ) ∈ L ( P ) | E [ h i ( S i )] = 0 , E [ h i ( S i ) S i ] = 0 , E [ h i ( S i ) κ ( S i )] = 0 } . Note that the tangent spaces { TS i : 1 ≤ i ≤ m } are perpendicular to eachother since the S i are mutually independent. Thus, any projection ontothe tangent space of ( r , . . . , r m ) is equal to the summation of the partialprojection onto each TS i . The efficient score of W becomes l ∗ ( ., W, Φ) = ˙ l W − m X i =1 π (˙ l W | TS i ) , where π ( . | L ) denotes the projection operator in L ( P ( W,r ,...,r m ) ) onto L .Since each off-diagonal entry of I m × m − Φ( S ) S T is perpendicular to all TS i and each diagonal entry of it is perpendicular to all but one TS i , l ∗ canbe obtained as in (2.4)–(2.8) of Section 2.3 by using the fact that TS ⊥ i = span { , S i , κ ( S i ) } . A.3. Some properties of cubic B-splines.
Let ξ < ξ < · · · < ξ N be fixedpoints. The first order B-spline basis functions based on these knots can beexpressed as B i ( x ) = I ( x ∈ [ ξ i , ξ i +1 )) , i = 1 , . . . , N −
1, and the k th orderB-spline basis functions can be obtained recursively ( k ≥
2) by B ki ( x ) = x − ξ i ξ i + k − − ξ i B k − i ( x ) + ξ i + k − xξ i + k − ξ i +1 B k − i +1 ( x ) , for i = 1 , . . . , N − k . Each B ki ( x ) is differentiable w.r.t. x up to order k − ddx B ki ( x ) = k − ξ i + k − − ξ i B k − i ( x ) − k − ξ i + k − ξ i +1 B k − i +1 ( x ) . We use the 4th order, so-called cubic, B-splines { B i : 1 ≤ i ≤ N − } with equally spaced knots, that is, ξ i +1 − ξ i = δ ( i = 1 , . . . , N −
1) for somealgorithm-determined δ . For simplicity, the superscript in B i is omitted be-low. The following properties of cubic B-splines will be frequently used inproving the lemmas below (see [11, 23] for details): FFICIENT INDEPENDENT COMPONENT ANALYSIS (I) 0 ≤ B i ( x ) < B i ( x ) B j ( x ) = 0 if | i − j | > | ddx B i ( x ) | < δ − , | d dx B i ( x ) | < δ − .(III) P Ni =1 [ B i ( x )] < P Ni =1 [ d dx B i ( x )] < δ − . A.4. Supporting lemmas for Propositions 6.1–6.4.
In this subsection, weprove all of the lemmas used in the proofs of Propositions 6.1–6.4. Recallthat for each φ k ( k = 1 , . . . , m ), we have an interval [ b nk , b nk ] and n k cubicB-spline basis functions defined thereupon using equally spaced knots on it,say B ( k ) n = ( B ( k ) n , . . . , B ( k ) nn k ) T , as in Section 2.4. Thus, we have constructeda sequence of sieves G ( k ) n using B ( k ) n as basis functions. For any W ∈ Ω n , wehave a class of estimates ˆ φ W k ∈ G ( k ) n for φ W k , as defined in Section 2.4. Forconvenience, the superscript of B ( k ) n is often omitted below.Let Ω ( k ) n = { W k : W ∈ Ω n } for k = 1 , . . . , m . We also need an intermediateapproximation function ˆ φ W k ∈ G ( k ) n , defined as follows. For w ∈ Ω ( k ) n ,ˆ φ w = γ ( w ) T B ( k ) n , (A.3)where γ ( w ) = A ( w ) − D ( w ) with A ( w ) = R B ( k ) n B ( k ) Tn ( wX ) dP and D ( w ) = R [ B ( k ) n ] ′ ( wX ) dP . Note that the subscript w of ˆ φ w should always be associ-ated with Ω ( k ) n for some k ∈ { , . . . , m } , similarly for ˆ φ w .In the following, c denotes a constant (only dependent on the populationlaw P ), but its exact value may vary in different places (even in a single line)without clarification. For a column vector x ∈ R m , k x k = √ x T x . For an m × m real matrix A , k A k = max ≤ i ≤ m k A i k , k A k = max x ∈ R m , k x k =1 k Ax k and k A k F = q tr ( A T A ), thus, k A k ≤ k A k .The following Lemmas A.1–A.8 hold under the conditions of Theorem 3.1.Jin [17] obtained results similar to Lemma A.2 and Lemmas A.5–A.7 con-cerning the B-spline approximation, but in generally different settings. Lemma A.1.
For sufficient large n , sup w ∈ Ω ( k ) n | f w | ∞ < ∞ , sup w ∈ Ω ( k ) n | f ′ w | ∞ < ∞ , sup w ∈ Ω ( k ) n min t ∈ [ b nk ,b nk ] f w ( t ) ≥ cδ n and sup w ∈ Ω ( k ) n P ( wX / ∈ [ b ni , b ni ]) = o (1) . Proof.
The first two inequalities follow easily from C3. The remainingtwo are proved as follows. For any w ∈ Ω ( k ) n , k w − W P k k ≤ ε n . If we let v = wW − P , then | v j | → ≤ j = k ≤ m and v k → n → ∞ . Since f w ( t ) = E [ v k r k ( t − P j = k v j S j v k )], we can consider the right-hand side as a function (say A. CHEN AND P. J. BICKEL h ) of v . By the first order Taylor expansion, | f w ( t ) − r k ( t ) | ≤ ε n k W − P k ( m X j =1 sup w ∈ Ω ( k ) n (cid:12)(cid:12)(cid:12)(cid:12) ∂∂v j h ( v ) (cid:12)(cid:12)(cid:12)(cid:12)) ≤ cε n , where by direct calculation and using C3, | ∂∂v j h ( v ) | is uniformly boundedwith w ∈ Ω ( k ) n and t ∈ R . Recall that r k ( t ) ≥ cδ n for t ∈ [ b nk , b nk ] and ε n ≪ δ n ,thus sup w ∈ Ω ( k ) n min t ∈ [ b nk ,b nk ] f w ( t ) ≥ cδ n . Finally, P ( wX ∈ [ b ni , b ni ]) = Z [ b nk ,b nk ] f w ( t ) dt ≥ Z [ b nk ,b nk ] ( r k ( t ) − cε n ) dt = P ( S k ∈ [ b nk , b nk ]) − cε n ( b nk − b nk ) . Since ε n ( b nk − b nk ) = o (1) and P ( S k ∈ [ b nk , b nk ]) →
1, we have inf w ∈ Ω ( k ) n P ( wX ∈ [ b ni , b ni ]) → (cid:3) Recall the definitions of ˆ φ W k , γ n ( W k ), A n ( W k ) and D n ( W k ) in Section 2.4and that of γ ( w ) = [ A ( w )] − D ( w ) in (A.3). The lemmas below give theiruniform convergence rates. Lemma A.2.
The following holds for k, i = 1 , . . . , m : (1) k D ( w ) k ≤ cδ n n / k ; cδ n ≤ eig ( A ( w )) ≤ cδ n for w ∈ Ω ( k ) n ; (2) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k = O P (( n k log n k ) / ( nδ n ) − / ) ; (3) sup w ∈ Ω ( k ) n k D n ( w ) k = O p ( δ n n / k ) ; (4) sup w ∈ Ω ( k ) n k A n ( w ) − A ( w ) k = O P (( δ n log n k ) / n − / ) ; (5) sup w ∈ Ω ( k ) n k A − n ( w ) k = O P ( δ − n ) ; (6) sup w ∈ Ω ( k ) n k γ n ( w ) k = O p ( n / k δ − n ) ; (7) sup Ω ( k ) n k γ n ( w ) − γ ( w ) k = o P (1) ; (8) sup Ω ( k ) n k ∂∂w i { A n ( w ) }k = O P ( δ − / n ) ; (9) sup Ω ( k ) n k ∂∂w i { D n ( w ) }k = O P ( δ − n ) ; (10) sup Ω ( k ) n k ∂∂w i { γ n ( w ) }k = o P ( δ − / n n / k ) . Proof.
The procedure is as follows. First, (3) is implied by (1) and (2).Second, (6) is implied by (3) and (5). Third, since ∂∂w i { γ n ( w ) } = − A − n ( w ) ∂∂w i { A n ( w ) } γ n ( w ) + A − n ( w ) ∂∂w i { D n ( w ) } , FFICIENT INDEPENDENT COMPONENT ANALYSIS (10) is implied by (5), (6), (8) and (9). Hence, it suffices to prove (1), (2),(4), (5), (7), (8) and (9). Further, the proofs of (2) and (4) are similar andthe proofs of (8) and (9) are similar. Thus the proofs of (4) and (9) areomitted. Proof of (1).
By taking the derivatives of the cubic B-splines, B ′ ni ( t ) = δ − n ( B ni ( t ) − B n,i +1 ( t )), where B ni are the third order B-splines defined onthe same knots, i = 1 , . . . , n k , we have | D i ( w ) | = δ − n (cid:12)(cid:12)(cid:12)(cid:12) Z ( B n,i ( t ) − B n,i +1 ( t )) f w ( t ) dt (cid:12)(cid:12)(cid:12)(cid:12) = δ − n (cid:12)(cid:12)(cid:12)(cid:12) Z B n,i ( t )( f w ( t ) − f w ( t + δ n )) dt (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z B n,i ( t ) dt | f ′ w | ∞ < δ n | f ′ w | ∞ . So the first result holds due to Lemma A.1. The second result follows fromLemma 5.1 of [17]. (cid:3)
Proof of (2).
Note that P (cid:16) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k ≥ t (cid:17) = P (cid:18) sup w ∈ Ω ( k ) n (cid:13)(cid:13)(cid:13)(cid:13) Z B ′ n ( wX ) d ( P n − P ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:19) ≤ n k X i =1 P (cid:18) sup w ∈ Ω ( k ) n (cid:12)(cid:12)(cid:12)(cid:12) Z B ′ n,i ( wX ) d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ t √ n k (cid:19) . For a fixed pair ( i, k ), let F n = { g w ( x ) = B ′ n,i ( w x ) : w ∈ Ω ( k ) n } , a class offunctions indexed by Ω ( k ) n . Then for w, v ∈ Ω ( k ) n , k g w ( x ) − g v ( x ) k ≤ δ − n k w − v k k x k by property II of cubic B-splines. Now, the index set Ω ( k ) n can becovered by N = cε mn u − m balls of radius u and for any w, v in the same ball, E [ k g w ( X ) − g v ( X ) k ] ≤ cδ − n u . Further, by property I of cubic B-splines,sup g w ∈F n | g w | ∞ ≤ δ − nk . Then for c max( δ − n , ε n δ − n ) ≤ a ≤ c √ n , P (cid:18) sup w ∈ Ω ( k ) n √ n (cid:12)(cid:12)(cid:12)(cid:12) Z B ′ n,i ( wX ) d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ a (cid:19) ≤ exp( − ca δ n ) . This follows from Theorem 5.11 of van de Geer ([24], page 75) which gener-alizes Hoeffding’s inequality and calculates the uniform convergence rate for A. CHEN AND P. J. BICKEL a class of functions in terms of its size measure, or so-called bracketing en-tropy ; see [24] for details. Note that ε n ≪ δ n by C7, so for t ≥ cn / k n − / δ − n ,we have P (cid:16) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k ≥ t (cid:17) ≤ n k exp( − ct nδ n n − k ) . Thus, sup Ω ( k ) n k D n ( w ) − D ( w ) k = O p s n k log n k nδ n ! . (cid:3) Proof of (5).
Since A − n = ( A + A n − A ) − = A − ( I + ( A n − A ) A − ) − ,by (1) and (4) (omitting the index w ), we havesup w ∈ Ω ( k ) n k A n − A k k A − k = o p (1) , hence,sup w ∈ Ω ( k ) n k A − n k ≤ sup w ∈ Ω ( k ) n k A − k (1 − k A n − A k k A − k ) − = O P ( δ − n ) . Here, we use the inequality of matrix norm k ( I + A ) − k ≤ (1 − k A k ) − forany square matrix A with k A k <
1, where I is the identity matrix. (cid:3) Proof of (7).
Since by (1)–(5),sup w ∈ Ω ( k ) n k γ n ( w ) − γ ( w ) k = sup w ∈ Ω ( k ) n k A − ( D n − D ) − A − n ( A n − A ) A − D n k = O p δ − n s n k log n k nδ n + δ − n s δ n log n k n δ n √ n k δ − n ! = o P (1) , where the last equality follows from C7. (cid:3) Proof of (8).
Note that the partial derivative is (omitting ( i ) from B ( i ) n ), ∂∂w k { A n ( w ) } jj ′ = Z ( B nj B ′ nj ′ + B ′ nj B nj ′ )( wX ) X k dP n . By the Cauchy–Schwarz inequality, (cid:12)(cid:12)(cid:12)(cid:12) Z ( B nj B ′ nj ′ + B ′ nj B nj ′ )( wX ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) dP n Z X k dP n . FFICIENT INDEPENDENT COMPONENT ANALYSIS Similarly, we can calculate the uniform convergence rate,sup ≤ j,j ′ ≤ n i , | j − j ′ |≤ sup w ∈ Ω ( i ) n Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) d ( P n − P ) = o p (1) . Further, from Lemma A.1, sup w ∈ Ω ( k ) n | f w | is bounded so after algebraic ex-pansion, we havesup w ∈ Ω ( i ) n Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) dP ≤ cδ − n . Thus, | ∂∂w k { A n ( w ) } jj ′ | ≤ cδ − / n . For cubic B-splines, B nj ( x ) B ′ nj ′ ( x ) ≡ | j − j ′ | >
3, thus, each row of ∂∂w k { A n ( w ) } has at most seven nonzero ele-ments. So,sup Ω ( i ) n (cid:13)(cid:13)(cid:13)(cid:13) ∂∂w k { A n ( w ) } (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup Ω ( i ) n (cid:13)(cid:13)(cid:13)(cid:13) ∂∂w k { A n ( w ) } (cid:13)(cid:13)(cid:13)(cid:13) = O p ( δ − n ) . (cid:3) Lemma A.3. k R B ( i ) n ( S i ) S j dP n k = O P ( n − / ) , where S i = W P i X , ≤ i = j ≤ m . Proof. (We omit ( i ) in B ( i ) n , B ( i ) nk .) E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Z B n ( S i ) S j dP n (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) = 1 n E n i X k =1 B nk ( S i ) S j ! ≤ n E ( S j ) . (cid:3) Lemma A.4. sup Ω n k R B ( i ) n ( W i X ) W j X dP n k = O p ( ε n δ − n n − / i ) for ≤ i = j ≤ m . Proof.
Similarly to the proof of Lemma A.2(2), we havesup Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X d ( P n − P ) (cid:13)(cid:13)(cid:13)(cid:13) = O P s δ n n i log n i n ! . Further, note that | B nk ( x ) − B nk ( y ) | ≤ δ − n | x − y | , so for any W ∈ Ω n , (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X dP (cid:13)(cid:13)(cid:13)(cid:13) = n i X k =1 (cid:12)(cid:12)(cid:12)(cid:12) Z ( B nk ( W i X ) W j X − B nk ( W P i X ) W P j X ) dP (cid:12)(cid:12)(cid:12)(cid:12) ≤ n i X k =1 ( δ − n E k X k k W i − W P i k k W i k + E k X k k W i − W P i k ) ≤ cε n δ − n n i . A. CHEN AND P. J. BICKEL
Thus, sup Ω n k R B ( i ) n ( W i X ) W j X dP n k = O P ( q δ n n i log n i n + ε n δ − n n / i ) . (cid:3) Lemma A.5. E { ( ˆ φ W i ( W i X ) − φ W i ,n ( W i X )) } ≤ δ n | φ ′′′ W i ,n | ∞ . Proof.
Let d ( φ W i ,n , G n ) = inf h ∈G n | φ W i ,n − h | ∞ . Then by the Jackson-type theorem [11], d ( φ W i ,n , G n ) ≤ cδ n | φ ′′′ W i ,n | ∞ . Thus, the result follows from E { ( ˆ φ W i ( W i X ) − φ W i ,n ( W i X )) } = inf h ∈G ( i ) n E { ( h ( W i X ) − φ W i ,n ( W i X )) }≤ d ( φ W i ,n , G n ) . (cid:3) Lemma A.6. | ˆ φ W i − φ W i ,n | ∞ ≤ cδ n | φ ′′′ W i ,n | ∞ ; | ˆ φ ′ W i − φ ′ W i ,n | ∞ ≤ c | φ ′′′ W i ,n | ∞ δ n . Proof.
By Theorem XII.4 of de Boor (1978), there exists a quasi-interpolant with some a ∈ R n i ,˜ˆ φ W i ( t ) = a T B n ( t ) , such that ˜ˆ φ W i simultaneously approximates φ W i ,n and its first derivative tooptimal order, that is, | ˜ˆ φ W i − φ W i ,n | ∞ = c | φ ′′′ W i ,n | ∞ δ n and | ˜ˆ φ ′ W i − φ ′ W i ,n | ∞ = c | φ ′′′ W i ,n | ∞ δ n . So, E (˜ˆ φ W i ( W i X ) − φ W i ,n ( W i X )) ≤ c | φ ′′′ W i ,n | ∞ δ n . Combining this with Lemma A.5, we have E ( ˆ φ W i − ˜ˆ φ W i ) ≤ E (˜ˆ φ W i − φ W i ,n ) + E ( ˆ φ W i − φ W i ,n ) ≤ c | φ ′′′ W i ,n | ∞ δ n . Let coef (˜ˆ φ W i ) and coef ( ˆ φ W i ) be coefficients of B n in ˜ˆ φ W i and ˆ φ W i , respec-tively. Then E ( ˆ φ W i − ˜ˆ φ W i ) = E (( coef (˜ˆ φ W i ) − coef ( ˆ φ W i )) T B n ) ≥ λ n k coef (˜ˆ φ W i ) − coef ( ˆ φ W i ) k , where λ n is the minimum eigenvalue of A ( W i ) = E [ B n B Tn ( W i X )]. By Lemma A.2, λ n ≥ cδ n . Thus, | ˆ φ W i − ˜ˆ φ W i | ∞ ≤ k coef (˜ˆ φ W i ) − coef ( ˆ φ W i ) k ≤ c | φ ′′′ W i ,n | ∞ δ n . FFICIENT INDEPENDENT COMPONENT ANALYSIS Hence, sup Ω n | ˆ φ W i − φ W i ,n | ≤ sup Ω n c | φ ′′′ W i ,n | ∞ δ n . Further, by observing that | B ′ nk | ∞ ≤ δ − n , we have | ˆ φ ′ W i − ˜ˆ φ ′ W i | ∞ ≤ k coef (˜ˆ φ W i ) − coef ( ˆ φ W i ) k δ − n ≤ c | φ ′′′ W i ,n | ∞ δ n . Thus, | ˆ φ ′ W i − φ ′ W i ,n | ∞ ≤ | ˜ˆ φ ′ W i − φ ′ W i ,n | ∞ + | ˆ φ ′ W i − ˜ˆ φ ′ W i | ∞ ≤ c | φ ′′′ W i ,n | ∞ δ n . (cid:3) Lemma A.7. R ( ˆ φ W Pk ( S k ) − φ k ( S k )) dP n = o p (1) . Proof.
Observe that13 Z ( ˆ φ W Pk ( S k ) − φ k ( S k )) dP n ≤ Z ( ˆ φ W Pk ( S k ) − ˆ φ W Pk ( S k )) dP n + Z ( ˆ φ W Pk ( S k ) − φ k,n ( S k )) dP n + Z φ k ( S k ) I ( S k / ∈ [ b nk , b nk ]) dP n . The remainder of the proof is similar to the use of (6.1) in proving Propo-sition 6.1 by using Lemma A.2 and Lemma A.6. (cid:3)
Lemma A.8.
Let { p ( · , θ, η ) : θ ∈ Ω ⊂ R d , η ∈ E} be a parametric or semi-parametric model, where θ is the parameter of interest. Suppose that moder-ate regularity conditions are satisfied and that l ∗ ( · , θ, η ) is the efficient scorefunction of θ , as defined in [5] . Then Z ∂∂θ l ∗ ( X, θ, η ) dP ( θ,η ) = − Z l ∗ l ∗ T ( X, θ, η ) dP ( θ,η ) . Proof.
We only prove this for the parametric case
E ⊂ R m . Let I ( θ, η )be the information matrix of ( θ, η ). Then by classic likelihood theory (e.g.,Proposition 2.4.1 of [5]), l ∗ ( · , θ, η ) = ˙ l − ( I I − )( θ, η )˙ l . Here, ˙ l and ˙ l arethe partial derivatives of l ( · , θ, η ) ≡ log p ( · , θ, η ) w.r.t. θ and η , respectively.Similarly, ¨ l ij ( i, j = 1 ,
2) are defined as second order derivatives of l ( · , θ, η )w.r.t. ( θ, η ). Thus, ∂∂θ l ∗ ( X, θ, η ) = ¨ l − ( I I − )( θ, η )¨ l − ∂∂θ { ( I I − )( θ, η ) } ˙ l .Since R ˙ l ( X, θ, η ) dP ( θ,η ) = 0, we have Z ∂∂θ l ∗ ( X, θ, η ) dP ( θ,η ) = Z ¨ l dP ( θ,η ) − ( I I − )( θ, η ) Z ¨ l dP ( θ,η ) . A. CHEN AND P. J. BICKEL
Since the information matrix satisfies I ij = − R ¨ l ij ( X, θ, η ) dP ( θ,η ) , i, j = 1 , R l ∗ l ∗ T dP = I − I I − I (see Proposition 2.4.1 of [5],page 32). For the semiparametric case, the reader is referred to [5] for ageneralization of this proof. (cid:3) Acknowledgments.
This paper was presented at the Joint StatisticalMeeting in San Francisco, USA, August 2003. The results were reported inChen’s Ph.D. Thesis. We would like to thank Francis Bach at UC-Berkeleyfor sharing his simulation code and Scott Vander Wiel for helpful comments.We also wish to thank the Editors and four referees for useful suggestionsthat have helped significantly to improve the paper.REFERENCES [1]
Amari, S. (2002). Independent component analysis and method of estimating func-tions.
IEICE Trans. Fundamentals
E85-A
Amari, S. and
Cardoso, J. (1997). Blind source separation—semiparametric sta-tistical approach.
IEEE Trans. Signal Process. Bach, F. and
Jordan, M. (2002). Kernel independent component analysis.
J. Ma-chine Learning Res. Bickel, P. and
Doksum, K. (2001).
Mathematical Statistics I , 2nd ed. Prentice-Hall.[5] Bickel, P., Klaassen, C., Ritov, Y. and
Wellner, J. (1998).
Efficient and Adap-tive Estimation for Semiparametric Models . Springer, New York.[6]
Cardoso, J. F. (1998). Blind signal separation: statistical principles.
Proc. IEEE Cardoso, J. F. (1999). High-order contrasts for independent component analysis.
Neural Comput. Chen, A. and
Bickel, P. J. (2005). Consistent independent component analysis andprewhitening.
IEEE Trans. Signal Process. Comon, P. (1994). Independent component analysis, a new concept?
Signal Process. Cox, D. D. (1985). A penalty method for nonparametric estimation of the loga-rithmic derivative of a density function.
Ann. Inst. Statist. Math. de Boor, C. (2001). A Practical Guide to Splines , revised ed. Springer. MR1900298[12]
Eriksson, J. and
Koivunen, V. (2003). Characteristic-function based independentcomponent analysis.
Signal Process. Hastie, T. and
Tibshirani, R. (2002). Independent component analysis throughproduct density estimation. Technical report, Dept. Statistics, Stanford Univ.[14]
Huber, P. J. (1985). Projection pursuit.
Ann. Statist. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent com-ponent analysis.
IEEE Trans. Neural Networks Hyv¨arinen, A., Karhunen, J. and
Oja, E. (2001).
Independent Component Anal-ysis . Wiley, New York.[17]
Jin, K. (1992). Empirical smoothing parameter selection in adaptive estimation.
Ann.Statist. Kagan, A., Linnik, Y. and
Rao, C. (1973).
Characterization Problems in Mathe-matical Statistics . Wiley, New York. MR0346969FFICIENT INDEPENDENT COMPONENT ANALYSIS [19] Lee, T. W., Girolami, M. and
Sejnowski, T. (1999). Independent componentanalysis using an extended infomax algorithm for mixed subgaussian and super-gaussian sources.
Neural Comput. Murphy, S. and
Van der Vaart, A. (2000). On profile likelihood.
J. Amer. Statist.Assoc. Pham, D. T. and
Garrat, P. (1997). Blind separation of mixture of independentsources through a quasi-maximum likelihood approach.
IEEE Trans. Signal Pro-cess. Samarov, A. and
Tsybakov, A. (2002). Nonparametric independent componentanalysis.
Bernoulli Truong, Y. K., Kooperberg, C. and
Stone, C. J. (2005).
Statistical Modelingwith Spline Functions: Methodology and Theory . Springer, New York.[24] van de Geer, S. (2000).
Applications of Empirical Process Theory . Cambridge Univ.Press, UK. MR1739079[25]
Vigario, R., Jousmaki, V., Hamalainen, M., Hari, R. and
Oja, E. (1998). Inde-pendent component analysis for identification of artifacts in magnetoencephalo-graphic recordings. In
Advances in Neural Information Processing Systems Vlassis, N. and
Motomura, Y. (2001). Efficient source adaptivity in independentcomponent analysis.
IEEE Trans. Neural Networks
600 Mountain Avenue, Room 2c-281Murray Hill, New Jersey 07974USAE-mail: [email protected]
367 Evans HallBerkeley, California 94720USAE-mail:367 Evans HallBerkeley, California 94720USAE-mail: