[PDF] Efficient independent component analysis

Abstract

Independent component analysis (ICA) has been widely used for blind source separation in many fields such as brain imaging analysis, signal processing and telecommunication. Many statistical techniques based on M-estimates have been proposed for estimating the mixing matrix. Recently, several nonparametric methods have been developed, but in-depth analysis of asymptotic efficiency has not been available. We analyze ICA using semiparametric theories and propose a straightforward estimate based on the efficient score function by using B-spline approximations. The estimate is asymptotically efficient under moderate conditions and exhibits better performance than standard ICA methods in a variety of simulations.

Full PDF

aa r X i v : . [ s t a t . M E ] A ug The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2006

EFFICIENT INDEPENDENT COMPONENT ANALYSIS

By Aiyou Chen and Peter J. Bickel Bell Labs and University of California, Berkeley

Independent component analysis (ICA) has been widely used forblind source separation in many ﬁelds such as brain imaging analysis,signal processing and telecommunication. Many statistical techniquesbased on M-estimates have been proposed for estimating the mixingmatrix. Recently, several nonparametric methods have been devel-oped, but in-depth analysis of asymptotic eﬃciency has not beenavailable. We analyze ICA using semiparametric theories and pro-pose a straightforward estimate based on the eﬃcient score functionby using B-spline approximations. The estimate is asymptotically ef-ﬁcient under moderate conditions and exhibits better performancethan standard ICA methods in a variety of simulations.

1. Introduction.

Independent component analysis (ICA) aims to sepa-rate independent blind sources from their observed linear mixtures with-out any prior knowledge. This technique has been widely used in the pastdecade to extract useful features from observed data in many ﬁelds such asbrain imaging analysis, signal processing and telecommunication. Hyvari-nen, Karhunen and Oja [16] described a variety of applications of ICA. Forexample, Vigario, Jousmaki, Hamalainen, Hari and Oja [25] used ICA toseparate artifacts from magnetoencephalography (MEG) data, without theburden of modeling the process that generated the artifacts.Standard ICA represents an m × X (e.g., an instanta-neous MEG image) as linear mixtures of m mutually independent randomvariables ( S , . . . , S m ) (e.g., artifacts and other brain activities), where thedistribution of each S i is totally unknown. That is, for S = ( S , . . . , S m ) T and some m × m nonsingular matrix W , X = W − S. (1.1) Received July 2003; revised February 2006. Supported in part by NSF Grant DMS-01-04075 and Lucent Technologies. Supported in part by NSF Grant DMS-01-04075.

AMS 2000 subject classiﬁcations.

Primary 62G05; secondary 62H12.

Key words and phrases.

Independent component analysis, semiparametric models, eﬃ-cient score function, asymptotically eﬃcient, generalized M-estimator, B-splines.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2006, Vol. 34, No. 6, 2825–2855. This reprint diﬀers from the original inpagination and typographic detail. 1

A. CHEN AND P. J. BICKEL

Here, W − is called the mixing matrix . Given n i.i.d. observations, X , . . . , X n ,from the distribution of X , the aim is to estimate W and, thus, to separateeach component of S = W X such that the components are maximally mu-tually independent. W is called the unmixing matrix . This can be seen as aprojection pursuit problem [14] in which m directions are sought such thatthe corresponding projections are most mutually independent.It was shown by Comon [9] that W is identiﬁable up to scaling and per-mutation of its rows if at most one S i is Gaussian [18]. Model (1.1) can beviewed as a semiparametric model with parameters ( W, r , . . . , r m ), where r i is the probability density function (PDF) of S i . Our interest centers on W ;( r , . . . , r m ) are nuisance parameters.ICA was motivated by neurophysiological problems in the early 1980’s(see [16]) and two classes of methods have been proposed to estimate W .One class involves specifying a particular parametric model for each r i andthen optimizing contrast functions that involve ( W, r , . . . , r m ). Primary ex-amples of this approach are maximum likelihood (ML) (e.g., [19, 21]) or,equivalently, minimum mutual information (e.g., [9]), minimizing high-ordercorrelation between components of W X (e.g., [7]) and maximizing the non-Gaussianity of

W X (e.g., [15]). A second class of methods view ICA as asemiparametric model and assume nothing about the distributions of thecomponents S i . Thus, two distinct goals can be formulated: (i) to ﬁnd esti-mates ˆ W of W that are consistent or, even better, √ n -consistent—that is,ˆ W = W + O p ( n − / ) and (ii) to ﬁnd procedures that achieve the informationbound—that is, estimates of W which are asymptotically normal and havesmallest variance-covariance matrix among all estimates that are uniformlyasymptotically normal in a suitable sense; see [5]. Amari [1] formally demon-strated that to achieve the information bound in this situation, a methodmust estimate the densities of the sources. In fact, it can even be shown[6] that for any ﬁxed estimating equation corresponding to maximizing anobjective function, there is a possible distribution of sources for which theglobal maximizer is inconsistent, despite the consistency of a local solutionnear the truth.Recently, some nonparametric methods to estimate W have appeared.For example, Bach and Jordan [3] proposed: (i) To reduce the dimensionof the data by using a kernel representation and (ii) to choose W so as tominimize the empirical generalized variance among the components of W X .Hastie and Tibshirani [13] proposed maximizing the penalized likelihood asa function of (

W, r , . . . , r m ) and Vlassis and Motomura [26] proposed max-imizing the likelihood by using Gaussian kernel density estimation. Variousperformance analyses have been carried out using simulations. The Vlassis–Motomura and Hastie–Tibshirani methods are of the same type as ours,but these papers do not provide a method for tuning the procedures and FFICIENT INDEPENDENT COMPONENT ANALYSIS nothing has been proven about their asymptotic properties. Samarov andTsybakov [22] proposed and analyzed a √ n -consistent estimate of W un-der mild conditions. Chen and Bickel [8] analyzed the method of Erikssonand Koivunen [12] based on characteristic functions and showed it to beconsistent under the minimal identiﬁability conditions and √ n -consistentunder additional mild conditions. This paper concerns the construction ofeﬃcient estimates. We develop an eﬃcient estimator by using eﬃcient scorefunctions after starting the algorithm at a consistent point based on thePCFICA algorithm of Chen and Bickel [8].The outline of the paper is as follows. In Section 2 we analyze ICA as asemiparametric model and propose a new method to estimate W using theeﬃcient score function. The main theorem is given in Section 3. Numericalstudies are given in Section 4. Technical details are provided in Sections 5and 6. Notation . In this paper, W denotes an m × m matrix and W i and W ij denote the i th row and the ( i, j )th element of W , respectively. A T denotesthe transpose of a matrix A and A − T denotes the transpose of A − . Forany matrix A with column vectors { a i : 1 ≤ i ≤ k } , k A k F = q tr ( A T A ) and vec ( A ) = ( a T , a T , . . . , a Tk ) T , a column vector created from A . Deﬁne the sup-norm as | f | ∞ = sup t ∈ R | f ( t ) | . X i denotes the i th random sample from thedistribution of X . The population (empirical) law of X is denoted by P ( P n ). X i and S i denote the i th element of X and S , respectively. Denotethe vector of density functions ( r , . . . , r m ) by r m . A vector or matrix offunctions is denoted in boldface. For a vector of functions B , BB T ( x ) shallbe used as an abbreviation of B ( x )[ B ( x )] T .

2. Semiparametric inference.

In this section, we ﬁrst brieﬂy review thesalient features of estimation in semiparametric models and then show howto solve an approximate eﬃcient score equation for estimating W in the ICAmodel.2.1. Eﬃcient estimates for semiparametric models.

Given a semipara-metric model, X , . . . , X n i.i.d. { P ( θ,η ) : θ ∈ Ω ⊂ R d , η ∈ E} , where E is a sub-set of a function space, estimates ˆ θ n of θ are called regular if √ n (ˆ θ n − θ )converges in law uniformly in P ( θ n ,η n ) , where ( θ n , η n ) converges to ( θ , η )in a smooth way. Then if there is a regular estimate that is uniformly best(call it θ ∗ n ), it must have the form θ ∗ n = θ + 1 n n X i =1 ˜ l ( X i , θ, η ) + o p ( n − / )(2.1)under P ( θ,η ) . The function ˜ l is called the eﬃcient inﬂuence function in [5].When η = ( η , . . . , η d ′ ) is a Euclidean parameter, ˜ l is, under regularity con-ditions, the inﬂuence function of the ML estimator (MLE) of θ . That is, if A. CHEN AND P. J. BICKEL ˙ l = ( ∂l∂θ T , ∂l∂η , . . . , ∂l∂η d ′ ) T , where l is the log-likelihood function of a single ob-servation and I ( θ, η ) ≡ E ˙ l ˙ l T ( X, θ, η ) is the Fisher information matrix, then ˜ l is the ﬁrst d coordinates of the vector I − ( θ, η )˙ l . An alternative formulationis to begin by deﬁning the eﬃcient score function l ∗ = ( l ∗ , . . . , l ∗ d ) T with l ∗ k = ∂l∂θ k − d ′ X j =1 a jk ( θ, η ) ∂l∂η j , where a jk ( θ, η ) minimizes E ( ∂l∂θ k ( X, θ, η ) − P d ′ j =1 a jk ( θ, η ) ∂l∂η j ( X, θ, η )) .That is, l ∗ is the projection of ∂l∂θ ( X, θ, η ) onto the orthocomplement ofspan { ∂l∂η j ( X, θ, η ) : 1 ≤ j ≤ d ′ } . Then˜ l = ( E [ l ∗ l ∗ T ( X, θ, η )]) − l ∗ . When η is inﬁnite-dimensional, the generalization of span { ∂l∂η j ( X, θ, η ) : 1 ≤ j ≤ d ′ } is the tangent space. That is deﬁned to be the closed linear span of { ∂l∂λ ( X, θ, η ( λ )) | λ =0 : η (0) = η and λ → η ( λ ) deﬁnes a smooth one-dimensionalsubmodel { P ( θ,η ( λ )) : | λ | < }} in L ( P ( θ,η ) ). Now, l ∗ is again obtained by pro-jection onto the orthocomplement of this span. An extensive discussion oftangent spaces and the geometric interpretation of formulas such as the oneabove is given in [5], Chapters 2 and 3. For many canonical semiparametricmodels including ICA, l ∗ can be computed; we sketch the argument in theAppendix. Suppose that for each θ , an estimate ˆ η ( θ ) is available and is atleast consistent. Then the usual Taylor expansions suggest that the solutionof the generalized estimating equation n X i =1 l ∗ ( X i , θ, ˆ η ( θ )) = 0(2.2)will have an inﬂuence function ˜ l and, hence, be eﬃcient. These heuristicsand others are discussed in Chapter 7 of [5]. Of course, more than consis-tency is needed and after calculating l ∗ in our case, validating that (2.2)leads to (2.1) for a suitable ˆ η ( θ ) is the subject of Sections 3, 5, 6 and theAppendix. Note that if ˆ η ( θ ) maximizes P ni =1 l ( X i , θ, η ), then (2.2) simplygives the proﬁle maximum likelihood estimate discussed in [20]. In that case,(2.2) simpliﬁes, becoming equivalent to n X i =1 ∂l∂θ ( X i , θ, ˆ η ( θ )) = 0 . Unfortunately, such ˆ η ( θ ) do not exist in the ICA model. Using l ∗ instead of ∂l∂θ in the estimating equation (2.2) permits a less demanding choice of ˆ η ( θ ).These issues are discussed in detail in [5], Chapter 7. In this paper, we sim-ply show that a ˆ θ solving (2.2) for a particular ˆ η ( θ ) does indeed satisfy (2.1) FFICIENT INDEPENDENT COMPONENT ANALYSIS in a suﬃciently uniform sense. Optimality of ˆ θ then follows from the generaltheory given in Chapter 3 of [5].This technique is diﬀerent from the quasi-ML method which belongs tothe ﬁrst class of methods in the ICA literature reviewed in Section 1. Thisapproach is to guess some shape η for η and then use ordinary ML. Ofcourse, if η is true, then the resulting estimate is asymptotically Gaussianand has smaller variance than the ˆ θ we discuss. But, if η is false, then theestimate can be inconsistent. The ICA algorithms used for comparison inSection 4 such as FastICA [Hyvarinen and Oja (1997)] and extended infomax[19] are of this type. Closest to ours in spirit among these is the method ofPham and Garrat [21]. They use parametric models such as logsplines (seeSection 2.3) for the nuisance parameters. However, they propose solving thescore equations rather than (2.2). More importantly, they do not suggestincreasing the model dimension with n , do not give a method for selecting thenumber of knots of the splines and, hence, are subject to the inconsistencywe have discussed.The remainder of Section 2 shows how to implement the idea given in (2.2)for the ICA model. Technical analysis is carried out in Section 3.2.2. Further notation and assumptions.

Let W P be a nonsingular unmix-ing matrix such that S = W P X has m mutually independent components.Without loss of generality, assume that det( W P ) >

0. For any row vector w ∈ R m , let f w denote the PDF of wX and φ w denote the density scorefunction deﬁned by φ w ( t ) = − ∂∂t log f w ( t ) I ( f w ( t ) > I ( . ) is an indi-cator function.In model (1.1), the order and scaling of rows of W or components of S must be constrained for W to be identiﬁable. For scaling, we take each S i to have absolute median 1, that is, P ( | S i | ≤

1) = or, equivalently,2 Z − r i ( s ) ds = 1 . (2.3)Even after this choice, the correct unmixing matrix requires 2 m m ! choicesdue to sign changes and row permutations. This ambiguity can be resolvedin various ways, but we avoid being speciﬁc by assuming that a consistentstarting value is available for W P , say PCFICA of Chen and Bickel [8]. Let κ ( s ) = 2 I ( | s | ≤ −

1. Then (2.3) is equivalent to Z κ ( S i ) dP = 0 . Equation (2.2), for our case, can be written n X i =1 l ∗ ( X i , W, ˆΦ W ) = 0 , A. CHEN AND P. J. BICKEL

Table 1

Algorithm EFFICA

1. Input data { X , . . . , X n } , an initial estimate ˆ W (0) and B-spline basisfunctions B ( k ) ≡ ( B ( k )1 , . . . , B ( k ) n k ) T for k = 1 , . . . , m ;2. For j = 0 , , . . . until convergence:1) Set S ( j ) k ≡ { ˆ W ( j ) k X i : i = 1 , . . . , n } for k = 1 , . . . , m ;2) Set for k = 1 , . . . , m ,ˆ φ ( j ) k = γ ( j ) Tk B ( k ) ,where γ ( j ) k = ( ˆ E ( j ) k [ B ( k ) B ( k ) T ]) − ˆ E ( j ) k [ D ( k ) ], D ( k ) is the derivative functionof B ( k ) and ˆ E ( j ) k is empirical expectation w.r.t. S ( j ) k ;3) Set ˆΦ W at W = ˆ W ( j ) :ˆΦ W ( s ) = ( ˆ φ ( j )1 ( s ) , . . . , ˆ φ ( j ) m ( s m )) T

4) Set for k = 1 , . . . , m, ˆ u ( j ) k = ˆ E ( j ) k [ ˆ φ ( j ) k ψ ] , ˆ v ( j ) k = ˆ E ( j ) k [ ψ ], where ψ ( s ) = 2 sI ( | s | ≤ σ ( j ) k ) = ˆ E ( j ) k [ ϕ ], where ϕ ( s ) = s ,ˆ α ( j ) k = − (1 − ˆ u ( j ) k )ˆ v ( j ) k (ˆ σ ( j ) k ) − (ˆ v ( j ) k ) ,ˆ β ( j ) k = (1 − ˆ u ( j ) k )(ˆ σ ( j ) k ) (ˆ σ ( j ) k ) − (ˆ v ( j ) k ) ;5) Set M ( j ) ( s ), an m × m function matrix with elements: M ( j ) kk ′ ( s ) = − ˆ φ ( j ) k ( s k ) s k ′ , k = k ′ = ˆ α ( j ) k s k + ˆ β ( j ) k (2 I ( | s k | ≤ − , k = k ′ ;6) Setˆ l ∗ ( j ) ( x ) = vec ( M ( j ) ( ˆ W ( j ) x )[ ˆ W ( j ) ] − T ), ∗ where vec ( M ) vectorizes M ;7) Set e ( j ) n = n P ni =1 ˆ l ∗ ( j ) ( X i ),Σ ( j ) n = n P ni =1 ˆ l ∗ ( j ) ˆ l ∗ ( j ) T ( X i );8) Update ˆ W ( j +1) = ˆ W ( j ) + [Σ ( j ) n ] − e ( j ) n . ∗ Here, ˆ l ∗ ( j ) ( x ) = l ∗ ( x , W, ˆΦ W ) at W = ˆ W ( j ) , where ˆΦ W can be identiﬁed as anestimate of the “nuisance parameter” Φ W . where Φ W is the parameter deﬁned by (2.9) and ˆΦ W is an estimate of it.We give pseudo code in Table 1 for iteratively solving this equation. Theexpressions appearing in the pseudo code are developed in the rest of thissection.2.3. Eﬃcient score function of W . The likelihood function of X under(1.1) can be expressed as p X ( x , W, r m ) = | det( W ) | m Y i =1 r i ( W i x ) . FFICIENT INDEPENDENT COMPONENT ANALYSIS The parameter of interest is W , while r m are nuisance parameters. Forsimplicity, assume E [ S i ] = 0.Let φ i ( s i ) = − ∂∂s i log r i ( s i ) I ( r i ( s i ) >

0) be the density score function as-sociated with r i and deﬁne Φ by Φ( s ) = ( φ ( s ) , . . . , φ m ( s m )) T , where s =( s , . . . , s m ) T . Then the score function of W , ˙ l W ( x ) ≡ ∂∂ vec ( W ) log( p X ( x , W, r m )),is equal to ˙ l W ( x ) = vec { ( I m × m − Φ( s ) s T ) W − T } , where s = W x and I m × m is an m × m identity matrix. Thus, minimal regu-larity conditions for eﬃcient estimation are that each r i should be absolutelycontinuous, W nonsingular and E [ φ i ( S i ) ] < ∞ and E [ S i ] < ∞ . Using the devices of tangent space and projection mentioned in Section 2.1(see calculation details in the Appendix), the eﬃcient score can be expressedas l ∗ ( x , W, Φ) = vec ( M ( W x ) W − T ) , (2.4)where M ( s ) is an m × m function matrix with elements M ij ( s ) = − φ i ( s i ) s j , for 1 ≤ i = j ≤ m, (2.5) M ii ( s ) = α i s i + β i κ ( s i ) , for i = 1 , . . . , m (2.6)and α i = − (1 − u i ) v i σ i − v i , β i = (1 − u i ) σ i σ i − v i , σ i = E [ S i ] , (2.7) v i = E [2 S i I ( | S i | ≤ , u i = E [2 S i φ i I ( | S i | ≤ . (2.8)Most of these formulas were derived in [2], but in a diﬀerent context.We repeat these in our own notation for completeness. By the convolutiontheorem on semiparametric models (see [5]), the information bound for reg-ular estimators of W is ( E [ l ∗ l ∗ T ( X, W,

Φ)]) − . It is obvious that the eﬃcientscore function depends on r m only through the density score functions( φ , . . . , φ m ). Next, we describe how to perform the estimation by using theeﬃcient score function.2.4. The ICA estimate.

LetΦ W = ( φ W , . . . , φ W m ) T (2.9)and assume that a starting estimate ˆ W (0) is available which is consistent for W P . We shall show how to construct an estimate ˆΦ W of Φ W for W close to W P and then solve Z l ∗ ( X, W, ˆΦ W ) dP n = 0 A. CHEN AND P. J. BICKEL to obtain an eﬃcient estimator of W P . Here, ˆΦ W is a data-dependent func-tion of W and, thus, l ∗ ( X, W, ˆΦ W ) is an approximation to the eﬃcient scorefunction given by (2.4).For each k ∈ { , . . . , m } , choose a sieve for ˆ φ W k as follows. Let [ b nk , b nk ] ⊂ R be a subset of supp ( r k ) containing most of the mass of r k . For an integer n k , set n k + 4 equally spaced points { b nk + ( i − δ nk : 1 ≤ i ≤ n k + 4 } asknots, where δ nk depends on n k through δ nk = ( b nk − b nk ) / ( n k + 3) , and then construct n k cubic B-spline basis functions, as in the Appendix.Here, n k is chosen by cross-validation as described in Section 2.5 below. De-note the basis functions as B ( k ) n ≡ ( B ( k ) n , . . . , B ( k ) nn k ) T , where the superscript( k ) denotes the association with S k and the subscript n denotes the depen-dence on the sample size. Given the random sample { W k X i : 1 ≤ i ≤ n } fromthe density function f W k , the density score function φ W k can be estimatedby ˆ φ W k = [ γ n ( W k )] T B ( k ) n , (2.10)where γ n ( W k ) is given in Table 5 (see Section 2.5 for details). Then deﬁneˆΦ W ( s ) ≡ ( ˆ φ W ( s ) , . . . , ˆ φ W m ( s m )) T . To avoid further complications, it is as-sumed that both [ b nk , b nk ] and n k are ﬁxed using ˆ W (0) . That is, the n k + 4knot locations are ﬁxed.Now, replace the eﬃcient score function l ∗ ( X, W,

Φ) deﬁned in (2.4)–(2.8) by its proﬁle form l ∗ ( X, W, ˆΦ W ), where α i , β i and σ i deﬁned in (2.7)and (2.8) are replaced with plug-in estimatesˆ α i = − (1 − ˆ u i )ˆ v i ˆ σ i − ˆ v i , ˆ β i = (1 − ˆ u i )ˆ σ i ˆ σ i − ˆ v i , ˆ σ i = Z ( W i X ) dP n , (2.11)where ˆ u i = R Y = W i X Y ˆ φ W i ( Y ) I ( | Y | ≤ dP n and ˆ v i = R Y = W i X Y I ( | Y | ≤ dP n . Deﬁne e n ( W ) = Z l ∗ ( X, W, ˆΦ W ) dP n and e ( W ) = Z l ∗ ( X, W, Φ W ) dP (2.12)and let ˆ W be a solution of e n ( W ) = 0 , (2.13)if it exists. Let ˆ l ∗ ( x , W ) ≡ l ∗ ( x , W, ˆΦ W ) and ˙ e n ( W ) ≡ ∂∂ vec ( W ) e n ( W ). Notethat if ˆ W → W P , then − ˙ e n ( ˆ W ) and R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ) dP n have the same limit, − ∂ e ( W ) ∂ vec ( W ) (cid:12)(cid:12)(cid:12)(cid:12) W P = E [ l ∗ l ∗ T ( X, W P , Φ P )] , FFICIENT INDEPENDENT COMPONENT ANALYSIS with probability converging to 1, as demonstrated later in Section 5. The ﬁ-nal estimator ˆ W is deﬁned as the limiting value of the following approximateNewton–Raphson iteration:ˆ W ( j +1) = ˆ W ( j ) + (cid:20) Z ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n (cid:21) − e n ( ˆ W ( j ) ) , j = 0 , , . . . . (2.14)We shall show that this limit exists with probability tending to 1. Notethat this method does not require the calculation of the Hessian matrix˙ e n ( W ). The convergence and asymptotic properties of (2.14) are developedin Section 3. Call ˆ W ≡ ˆ W ( ∞ ) deﬁned by (2.14) the EFFICA estimate . Thisis summarized in Table 1 and will be used for the simulation in Section 4.The density score estimation, as well as how to choose the number of knotsby cross-validation (mentioned above), is provided in the next subsection.2.5.

Estimating a density score function by B-spline approximations.

Let φ = − r ′ /r be the density score associated with a univariate PDF r . Let G be a linear space with diﬀerentiable basis functions B = ( B , . . . , B N ) T suchthat each B i r vanishes at inﬁnity. An estimator of φ in G can then beobtained by minimizing with respect to γ ∈ R N the mean square error c ( γ ) = Z R ( φ ( s ) − γ T B ( s )) r ( s ) ds. Using integration by parts, we obtain c ( γ ) = γ T E r [ BB T ] γ − γ T E r [ B ′ ] + E r [ φ ] , where B ′ is the derivative of B and E r indicates expectation under r . Thus,the optimal γ is γ φ = ( E r [ BB T ]) − E r [ B ′ ] and the best approximation of φ in G , in the sense of mean square error, is φ G = γ Tφ B . This method wasproposed by Jin [17] as a variant of Cox’s [10] penalized estimators. Givena random sample of size n from the density function r , γ φ can be estimatedby combinations of empirical moments. So, a natural estimator of φ is givenby ˆ φ G = ˆ γ Tφ B , where ˆ γ Tφ = ( ˆ E r [ BB T ]) − ˆ E r [ B ′ ] , (2.15)where ˆ E r is empirical expectation corresponding to E r .B-spline basis functions are popular choices for G . In general, the supportof r is unknown and we need to choose a working interval [ b n , b n ], in whichknots are distributed, for the construction of the basis functions. The basicrule for adaptation is that [ b n , b n ] → supp ( r ) very slowly as n → ∞ . Here, b n and b n are selected as α n and 1 − α n empirical quantiles where α n → N , is an additional empirical smoothingparameter. One can use cross-validation to choose N as follows: A. CHEN AND P. J. BICKEL

1. Split the sample randomly into two halves, say I and I ;2. For N = 1 , , . . . , use I to estimate γ φ ∈ R N by (2.15), say ˆ γ φ ( I ), anduse I to evaluate c (ˆ γ φ ( I )) empirically, but omitting the last term E r [ φ ],say ˆ c I |I ( N ). Similarly calculate ˆ c I |I ( N );3. Select N as the largest value such that { ˆ c I |I ( N ) + ˆ c I |I ( N ) } strictlydecreases until N .Jin [17] used a similar method in the i.i.d. case we have discussed, provedits validity and showed that N = O ( n δ ) under weak smoothness assumptions,where δ ∈ (0 , /

6) depends on tail properties of r .

3. Asymptotic properties.

We are given ˆ W (0) (e.g., the PCFICA esti-mate) such that for some ε n > P ( k ˆ W (0) − W P k F ≤ ε n ) → n → ∞ , where ε n satisﬁes ε n → √ nε n → ∞ . LetΩ n = { W ∈ R m × m : k W − W P k F < ε n } . (3.2)Deﬁne φ W k ,n ( x ) ≡ φ W k ( x ) I ( x ∈ [ b nk , b nk ]) and consider the following condi-tions for 1 ≤ k, i, j ≤ m :C1: W P is nonsingular.C2: E [ S k ] = 0, E [ S k ] < ∞ , med ( | S k | ) = 1 and E ( φ k ( S k )) < ∞ .C3: | r k | ∞ < ∞ , | r ′ k | ∞ < ∞ , sup t ∈R | tr ′ k ( t ) | < ∞ .C4: The uniform law of large numbers (ULLN) holds for { φ W k ( W k X ) X i : W ∈ Ω n } , { φ W k ( W k X ) X i : W ∈ Ω n } and for { φ ′ W k ( W k X ) W i XX j : W ∈ Ω n } .C5: For some positive c , c , r k ( t ) ≥ c δ nk if t ∈ [ b nk , b nk ], otherwise r k ( t ) ≤ c δ nk .C6: sup W ∈ Ω n | φ W k ,n | ∞ δ nk = O (1) and sup W ∈ Ω n | φ ′′′ W k ,n | ∞ δ nk = o (1).C7: ε n δ − nk ( b nk − b nk ) = o (1).[Note: ULLN holds for G n iﬀ sup g ∈G n | R g ( X ) d ( P n − P ) | = o p (1); see, e.g., [24].]Conditions C1–C3 are simpliﬁed regularity conditions. C1 and the ﬁnitemoments in C2 are among the minimal regularity conditions for consideringeﬃciency, as mentioned in Section 2.3. Setting the absolute median to unityin C2 is a simple and minimal condition to make the scales of the unmixingmatrix identiﬁable [9]. It should be clear that the zero mean assumption inC2 is in no way crucial to the general argument as the mean can be estimatedadaptively, but it serves to keep algebraic complication to a minimum. C3assumes some smoothness on the density score function φ k for each hiddencomponent, which is needed if it is to be well approximated by B-splines.Conditions C4–C7 are technical conditions that we believe are far fromnecessary, but they are reasonably easy to check and enable construction of FFICIENT INDEPENDENT COMPONENT ANALYSIS a more compact proof. As an easy example, if | φ k | ∞ < ∞ and | r ′′ k /r k | ∞ < ∞ for k = 1 , . . . , m , then by (A.1) in the Appendix, sup W ∈ Ω n | φ W k | ∞ < ∞ andby (A.2), sup Ω n | φ ′ W k | ∞ < ∞ . Thus C4 holds. C5 and C6 require that thetail of r k be not too wiggly. C6 also implies that δ nk →

0. C7 requires thatthe initial value be reasonably close to the truth and that the domain andthe number of knots of the B-splines [i.e., n k = ( b nk − b nk ) δ − nk −

3] do notgrow so quickly that we lose control of the approximation to Φ W .Here is the main theorem: Theorem 3.1.

In the ICA model (1.1) , if (3.1) and

C1–C7 hold for i, j, k = 1 , . . . , m , i = k and j = k , then with probability converging to , thealgorithm (2.14) has a limit ˆ W ( ∞ ) and √ n vec ( ˆ W ( ∞ ) − W P ) = I − √ n Z l ∗ ( X, W P , Φ P ) dP n + o P (1) , (3.3) where I eﬀ = R l ∗ l ∗ T ( X, W P , Φ P ) dP . That is, ˆ W ( ∞ ) is Fisher eﬃcient. Fur-ther, √ n vec ( ˆ W ( ∞ ) W − P − I m × m ) → d N ( , ¯ I − ) , where ¯ I eﬀ = R vec ( M ) vec ( M ) T dP does not depend on W P and M is givenby (2.5)–(2.8) , with s replaced by S . The proof of Theorem 3.1 is given in later sections and the Appendix.

4. Numerical studies and some computational issues.

Two groups of ex-periments are implemented to test the empirical performance of EFFICA.Data are generated from known source distributions listed in Table 2 witha known mixing matrix W − P . The boundaries for B-spline approximationof the density score functions are taken as b nk = max( q n (0) , q n (0 . − ∆ n )and b nk = min( q n (1) , q n (0 .

99) + ∆ n ), where q n ( · ) denotes the empirical quan-tile and ∆ n = c · √ log log n . We used c = 5 in the simulation. The numberof knots is key for EFFICA and is chosen by the cross-validation methoddescribed in Section 2.5.In the ﬁrst group of experiments, two hidden components are used and W P = [2 ,

1; 2 , n = 1000.In the second group of experiments, the number of hidden components isincreased to m = 4, 8 and 12, m hidden components are chosen with distri-butions of [0] , [1] , . . . , [ m −

1] in Table 2 and W P = I m × m . The experimentsare replicated 100, 100 and 50 times for m = 4, 8 and 12, respectively, with n = 4000. A. CHEN AND P. J. BICKEL

Comparisons are made with ﬁve existing ICA algorithms: the FastICAalgorithm with the options of “symmetric” and “tanh” [15] which is equiva-lent to quasi-ML by using a tanh distribution for each hidden source (note:the FastICA code uses logistic for tanh), the JadeICA algorithm [7], theextended infomax algorithm [19], the KernelICA-Kgv algorithm [3] and thePCFICA algorithm. The PCFICA’s estimate is used as the initial value forboth EFFICA and KernelICA-Kgv. Due to the existence of multiple localsolutions, PCFICA uses three starting values, one from FastICA and theothers random. The performance of each algorithm is measured by boththe Frobenius error, that is, d F ( ˆ W , W P ) = k ˆ W W − P − I m × m k F after suitablerescaling and permutation on rows of both ˆ W and W P , and the so-called Amari error d A ( ˆ W , W P ) (e.g., [3]), d A ( V, W ) = 12 m m X i =1 (cid:18) P mj =1 | a ij | max j | a ij | − (cid:19) + 12 m m X j =1 (cid:18) P mi =1 | a ij | max i | a ij | − (cid:19) , where V, W are rescaled into ¯

V , ¯ W such that each row of ¯ V and ¯ W hasnorm 1 and a ij = ( ¯ V ¯ W − ) ij . The Amari error lies in [0 , m − V and W and is equal to zeroif and only if V and W represent the same row components.For each experiment in the ﬁrst group of simulations with T = 400 repli-cations, Table 3 reports the average Amari error and square root of the meansquare error √ MSE with

MSE = 1 T T X i =1 ( d ( i ) F ) , where d ( i ) F denotes the Frobenius error for the i th replication. For the secondgroup of simulations, Figure 1 shows the boxplots of the Amari errors andTable 4 reports √ MSE .From the simulation results, in some cases, some parametric ICA algo-rithms work very well and even outperform EFFICA. For example, FastICA

Table 2

Source distributions used in the simulations [0]. N(0,1) [8]. exp(1) + U (0 , ,

1) [11]. mixture Gaussians: multimodal[4]. t(5) [12]. mixture Gaussians: unimodal[5]. logistic(0 ,

1) [13]. exp(1) vs normal(0,1)[6]. Weibull(3 ,

1) [14]. lognormal(1 ,

1) vs normal(0 , ,

1) [15]. Weibull(3 ,

1) vs exp(1)FFICIENT INDEPENDENT COMPONENT ANALYSIS Table 3 × mean Amari errors (and × √ MSE in brackets) for six ICA methods using m = 2 sources and sample size n = 1000 . Source distributions for row k are given by [ k ] inTable . The bold numbers represent the best performance according to each experiment pdf Fast Jade ExtImax Pcf Kgv EFFICA (89) (66) (57) (31) (24) ( )2 36 36

35 33 29(231) (68) (61) (61) (55) ( )3 33 31 19 16 14 (243) (69) (33) (27) (24) ( )4

50 41 60 61 60(99) ( ) (112) (100) (102) (110)5

85 87 109 99 128(194) ( ) (232) (192) (170) (253)6 42 43 32 18 15 (188) (83) (57) (30) (24) ( )7 43 41 35 18 15 (205) (72) (96) (31) (25) ( )8 36 44 35 21 19 (99) (96) (64) (35) (31) ( )9 35 37 24 16 14 (212) (83) (41) (28) (24) ( )10 46 59 39 44 ) (105)11 28 33 27 29 ) (42)12 50 49 44 44 ) (264)13 65 52 185 24 19 (164) (89) (355) (42) (35) ( )14 35 45 91 20 14 (109) (89) (188) (35) (24) ( )15 69 72 57 32 27 (192) (184) (136) (69) (58) ( ) works best in case 5 where hidden sources have logistic distributions. Thisis not surprising as we have pointed out in Section 2.1 that a simple quasi-MLE can outperform an eﬃcient estimator when the value of the nuisanceparameter used by the quasi-MLE is close to the truth. But, in most exper-iments, the parametric methods (FastICA, JADE, ExtImax) behave worsethan the nonparametric methods (PCFICA, Kgv, EFFICA) and EFFICAhas both the smallest Amari errors and smallest Frobenius errors, while Kgv,which we conjecture can be eﬃcient after appropriate regularization, is thebest in the cases of mixture Gaussians. The three nonparametric ICA al- A. CHEN AND P. J. BICKEL

Table 4 × √ MSE for ICA algorithms with the same simulations as in Figure Case Fast Jade ExtImax PCF Kgv EFFICA m = 4 0 .

82 1 .

31 2 .

71 0 .

60 0 .

51 0 . m = 8 7 . . . . . . m = 12 9 . . . . . . gorithms require heavier computation, but their performance is better thanthe parametric methods.All of the ICA algorithms used in the simulation except EFFICA are basedon contrast functions which empirically measure the dependence or nongaus-sianity among { W X, . . . , W m X } and, thus, they are invariant with respectto the choice of W P for both error metrics d F ( ˆ W , W P ) and d A ( ˆ W , W P ).We note that prewhitening, which is used for data preprocessing by thesealgorithms, can reduce such invariance, although it does not cause incon-sistency [8]. Theorem 3.1 implies that EFFICA is asymptotically invariantwith respect to W P . Figure 2 compares m = 8 with two diﬀerent unmixingmatrices W P = I m × m and W P = I m × m + V , where V jk = j/m + ( k − /m for 1 ≤ j, k ≤ m . We performed many other simulations with diﬀerent W P and obtained similar results. We observe that the Frobenius error boxplotsdo change somewhat with diﬀerent W P , but EFFICA is more robust thanother ICA algorithms. We believe that the main reasons are (i) none of the ICAalgorithms are convex and, thus, may suﬀer from local solutions and (ii) EF-FICA does not use prewhitening for preprocessing, while others do.

5. Proof of Theorem 3.1.

In this section, we prove Theorem 3.1. Notethat solving (2.13) can be viewed as a generalized M-estimator (GM-estimator).The existence/uniqueness, convergence and asymptotic linearity of GM-estimators have been studied in [5] (the Iteration Theorem in Appendix A.10.2,page 517). The idea of our proof is to use the Iteration Theorem.Suppose that M n ( θ, P n ) is a functional of θ ∈ Ω (a subset of a ﬁnite Eu-clidean space) and P n , but is not necessarily linear in P n . The subscript n in M n allows the existence of a possible smoothing or sieve parameter depen-dent on n . The zero of M n ( θ, P n ) w.r.t θ is called a generalized M -estimator .Let M ( θ, P ) ≡ M ∞ ( θ, P ). We review the conditions for the Iteration Theo-rem.[GM1] θ P ∈ Ω is the unique solution of M ( θ, P ) = 0 in Ω.[GM2] M n ( θ P , P n ) = R ψ θ P ( X ) dP n + o p ( n − / ) for some ψ θ P ∈ L ( P ).[GM3] M ( θ, P ) is diﬀerentiable w.r.t θ in a neighborhood of θ P and ∂M ( θ P ,P ) ∂θ is nonsingular. FFICIENT INDEPENDENT COMPONENT ANALYSIS Fig. 1.

Boxplots of Amari errors for ICA algorithms: in each case ( left, middle, right ) , m hidden components are generated from pdfs [0] , . . . , [ m − in Table . The X-labels rep-resent ICA algorithms: F-FastICA, J-JadeICA, X-extended infomax, P-PCFICA, K-Kgv,E-EFFICA. The sample sizes are 4000 for all the experiments and the replication timesare , , for m = 4 , m = 8 and m = 12 , respectively. Fig. 2.

Frobenius errors ( × with m = 8 and n = 4000 , where the left panel is for W = I m × m and the right panel is for W = I m × m + V , with V jk = j/m + ( k − /m . Eachexperiment is replicated times. A. CHEN AND P. J. BICKEL

For our eﬃcient score equation M n ( θ, P n ) = e n ( W ) deﬁned in (2.12), [5]condition [U] becomes[U] sup W ∈ Ω n | ˙ e n ( W ) − ˙ e ( W P ) | = o P (1). Theorem 5.1 [5].

Suppose [GM1], [GM2] and [GM3] hold with M n ( θ, P n ) = e n ( W ) and that [U] holds. If the starting point satisﬁes P ( | ˆ W (0) − W P | < ε n ) → , then with probability converging to , e n ( W ) in (2.13) has aunique root ˆ W ( ∞ ) which is also the limit of the sequence deﬁned by (2.14) ,except with R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n replaced by − ˙ e n ( ˆ W ( j ) ) , and ˆ W ( ∞ ) is asymp-totically linear with the inﬂuence function − [ ˙ e ( W P )] − l ∗ ( ., W P , Φ P ) . Theorem 5.1 is called the

Iteration Theorem in [5]. Note that the sequencelimit deﬁned by the Iteration Theorem uses the exact Newton–Raphson,whereas we use an approximate Newton–Raphson, as in (2.14). To makeup the diﬀerence, we need the following condition [V] which is veriﬁed byProposition 6.4 in Section 6:[V] sup W ∈ Ω n | R ˆ l ∗ ˆ l ∗ T ( X, W ) dP n − R l ∗ l ∗ T ( X, W P , Φ W P ) dP | = o P (1).Theorem 3.1 can now be proved as follows: Proof of Theorem 3.1.

It is obvious that [GM1] holds under the con-ditions of Theorem 3.1 as it is the eﬃcient score function. [GM2], [GM3] and[U] are veriﬁed by Propositions 6.1, 6.2 and 6.3 below, respectively. Thus, theconclusion of the above Iteration Theorem applies here. By Proposition 6.4,the condition [V] holds. Further, by Proposition 6.2,˙ e ( W P ) = − E [ l ∗ l ∗ T ( X, W P , Φ P )] . Thus, we have sup W ∈ Ω n (cid:12)(cid:12)(cid:12)(cid:12) ˙ e n ( W ) + Z ˆ l ∗ ˆ l ∗ T ( x, W ) dP n (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Then following the contraction arguments of [5] (pages 317–319), the itera-tion given in (2.14) has the same limit as that replacing R ˆ l ∗ ˆ l ∗ T ( X, ˆ W ( j ) ) dP n by − ˙ e n ( ˆ W ( j ) ), with probability converging to 1. Thus, (3.3) holds. The sec-ond result follows from (3.3) directly by using (2.4) and the devices of Kro-necker product and vec operator. (cid:3)

6. Propositions 6.1–6.4.

This section veriﬁes conditions [GM2], [GM3],[U] and [V]. For convenience, we list all the notation used in the followingproofs in Table 5, for k ∈ { , . . . , m } and W ∈ Ω n . Note that all of the lemmasused in this section are given in the Appendix. For simplicity of notation,we will often write δ nk as δ n . FFICIENT INDEPENDENT COMPONENT ANALYSIS Table 5

List of all notation used in the proof

P, P n population, empirical law of XW, W k , W ij m × m matrix, k th row, ( i, j )th element W P , W Pk , W Pij unmixing matrix, k th row, ( i, j )th entry r k PDF of S k φ k = − r ′ k /r k density score function for S k Φ P = ( φ , . . . , φ m ) T function vector f W k PDF of W k X ( f W Pk ≡ r k ) φ W k = − f ′ W k /f W k score function of W k X ( φ W Pk ≡ φ k ) φ k,n , φ W k ,n truncation of φ k , φ W k on [ b nk , b nk ]Φ W = ( φ W , . . . , φ W m ) T function vectorˆΦ W = ( ˆ φ W , . . . , ˆ φ W m ) T function vector l ∗ ( X, W,

Φ) eﬃcient score function deﬁned by (2.4) e ( W ) = R l ∗ ( X, W, Φ W ) dP expectation e n ( W ) = R l ∗ ( X, W, ˆΦ W ) dP n empirical expectation B ( k ) n = ( B ( k ) n , . . . , B ( k ) nn k ) T B-splines deﬁned on [ b nk , b nk ] A n ( W k ) = R B ( k ) n B ( k ) Tn ( W k X ) dP n served in coeﬃcients of ˆ φ W k D n ( W k ) = R ( B ( k ) n ) ′ ( W k X ) dP n served in coeﬃcients of ˆ φ W k γ n ( W k ) = A n ( W k ) − D n ( W k ) served as coeﬃcients of ˆ φ W k A ( W k ) = R B ( k ) n B ( k ) Tn ( W k X ) dP served in coeﬃcients of ˆ φ W k D ( W k ) = R ( B ( k ) n ) ′ ( W k X ) dP served in coeﬃcients of ˆ φ W k γ ( W k ) = A ( W k ) − D ( W k ) served as coeﬃcients of ˆ φ W k G ( k ) n = { a T B ( k ) n : a ∈ R n k } closed linear span of B-splinesˆ φ W k = γ n ( W k ) T B ( k ) n estimator of φ W k in G ( k ) n deﬁned by (2.10)ˆ φ W k = γ ( W k ) T B ( k ) n estimator of φ W k in G ( k ) n deﬁned by (A.3) Proposition 6.1.

Under the conditions of Theorem , we have e n ( W P ) = Z l ∗ ( X, W P , Φ P ) dP n + o P ( n − / ) . Proof.

Recall the deﬁnition of e n ( W ) given by (2.12) and that of l ∗ ( x , W, Φ) given by (2.4)–(2.8). It is suﬃcient to show that for 1 ≤ i = j ≤ m ,ˆ α i − α i = o P (1) and ˆ β i − β i = o P (1), where ( α i , β i ) and ( ˆ α i , ˆ β i ) are deﬁnedin (2.7) and (2.11), respectively, and Z ˆ φ W Pi ( S i ) S j dP n = Z φ W Pi ( S i ) S j dP n + o P ( n − / ) , (6.1)where S i = W P i X , S j = W P j X .The ﬁrst two are not hard verify with the law of large numbers andLemma A.7. Here, we will just show (6.1). Observe that (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W Pi ( S i ) S j dP n − Z φ W Pi ( S i ) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) Z [ ˆ φ W Pi ( S i ) − ˆ φ W Pi ( S i )] S j dP n (cid:12)(cid:12)(cid:12)(cid:12) A. CHEN AND P. J. BICKEL + (cid:12)(cid:12)(cid:12)(cid:12) Z [ ˆ φ W Pi ( S i ) − φ i,n ( S i )] S j dP n (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) Z ( φ i ( S i ) − φ i,n ( S i )) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) = [1] + [2] + [3] . In the following, we will show that all of [1], [2] and [3] are o P ( n − / ).First, by Lemma A.2(7) and Lemma A.3,[1] = (cid:12)(cid:12)(cid:12)(cid:12) Z ( γ n ( W P i ) − γ ( W P i )) T B ( i ) n ( S i ) S j dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ k γ n ( W P i ) − γ ( W P i ) k (cid:13)(cid:13)(cid:13)(cid:13) Z B ( i ) n ( S i ) S j dP n (cid:13)(cid:13)(cid:13)(cid:13) = o P (1) O P ( n − / ) . Further, E ([2]) = n E ( ˆ φ W Pi ( S i ) − φ W Pi ,n ( S i )) E ( S j ). By Lemma A.6, | ˆ φ W Pi − φ W Pi ,n | ∞ ≤ cδ n | φ ′′′ W Pi ,n | ∞ . Thus, by C6,[2] = n − / δ n | φ ′′′ W Pi ,n | ∞ O P (1) = o P ( n − / ) . For [3], since P ( S i / ∈ [ b ni , b ni ]) →

0, we have E ([3]) = 1 n E ( φ i ( S i ) I ( S i / ∈ [ b ni , b ni ])) E ( S j ) = o (cid:18) n (cid:19) . So [3] = o P ( n − / ). (cid:3) Proposition 6.2.

Under conditions

C1, C2 and C4 of Theorem , e ( W ) is diﬀerentiable w.r.t. W in a neighborhood of W P and ˙ e ( W P ) = − E { l ∗ l ∗ T ( X, W P , Φ P ) } is nonsingular. Proof.

Let T w ( · ) = ∂∂w { φ w ( · ) } for any nonzero row vector w ∈ R m .By (A.1) in the Appendix, after exchanging the order of diﬀerentiation andintegration, we have E [ T w ( wX )] = 0. Then by (2.5), we have for k = 1 , . . . , m , E (cid:26) ∂∂W k { l ∗ ( X, W, Φ W ) } W P (cid:27) = E (cid:26) ∂∂W k { l ∗ ( X, W, Φ P ) } W P (cid:27) . Since the left-hand side of the above is ˙ e ( W P ), by Lemma A.8, the right-hand side is equal to ˙ e ( W P ) = − E { l ∗ l ∗ T ( X, W P , Φ P ) } . (6.2)Note that the elements of M in (2.5)–(2.6) are linearly independent, andthat this is also true for the elements of l ∗ ( ., W P , Φ P ). Thus, ˙ e ( W P ) mustbe nonsingular. (cid:3) FFICIENT INDEPENDENT COMPONENT ANALYSIS Proposition 6.3.

Under the conditions of Theorem , for i, j, k =1 , . . . , m , we have sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ P i ( W P i X ) X k dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)(6.3) and for i = j , sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ∂∂W i { ˆ φ W i ( W i X ) } W j X dP n − Z ∂∂W i { φ P i ( W P i X ) } W P j

X dP (cid:12)(cid:12)(cid:12)(cid:12) (6.4) = o P (1) . Thus, condition [U] holds.

Proof.

We omit the superscript ( i ) in B ( i ) n henceforth. By the Cauchy–Schwarz inequality, (cid:13)(cid:13)(cid:13)(cid:13) Z B ( i ) n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) ≤ Z | X k | dP n Z n i X l =1 [ B nl ( W i X )] dP n . Since P n i l =1 [ B nl ( W i X )] < Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) = O P (1)(6.5)and by Lemma A.2(7), sup Ω n k γ n ( W k ) − γ ( W k ) k = o P (1), sosup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z ˆ φ W i ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) = sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) ( γ n ( W k ) − γ ( W k )) T Z B n ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) (6.6) ≤ sup Ω n k γ n ( W k ) − γ ( W k ) k sup Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) X k dP n (cid:13)(cid:13)(cid:13)(cid:13) = o P (1) O P (1) . Further, by Lemma A.6, sup Ω n | ˆ φ W i ( W i X ) − φ W i ,n | ∞ ≤ sup Ω n c | φ ′′′ W i ,n | ∞ δ n ,so sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ W i ,n ( W i X ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup Ω n | φ ′′′ W i ,n | ∞ δ n Z | X k | P n (6.7) = o P (1) (by C6) . A. CHEN AND P. J. BICKEL

By C4, ULLN holds for { φ W i ( W i X ) X k : W ∈ Ω n } , and by Lemma A.1,sup Ω n P ( W i X / ∈ [ b ni , b ni ]) = o (1), sosup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k I ( W i X / ∈ [ b ni , b ni ]) dP n (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . (6.8)Recall that φ W i ,n ( x ) = φ W i ( x ) I ( x ∈ [ b ni , b ni ]). From (6.6)–(6.8), we obtainsup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k dP n ( X ) − Z φ W i ( W i X ) X k dP n ( X ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . (6.9)Now, by C4, sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) , (6.10)and by continuity,sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z φ W i ( W i X ) X k dP − Z φ W Pi ( W P i X ) X k dP (cid:12)(cid:12)(cid:12)(cid:12) = o (1) . (6.11)(6.3) then follows from (6.9)–(6.11).In the following, we prove (6.4). Note that ∂∂W ik { ˆ φ W i ( W i X ) } = ∂∂W ik { γ Tn ( W i ) } B n ( W i X ) + ˆ φ ′ W i ( W i X ) X k . (6.12)It suﬃces to show that the following hold:[4] sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ ′ W i ( W i X ) X k W j X dP n ( X ) − Z φ ′ P i ( W P i X ) X k W P j

X dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1);[5] sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ∂∂W ik { γ Tn ( W i ) } B n ( W i X ) W j X dP n ( X ) (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Similarly to (6.3), the uniform convergence of [4] can be veriﬁed usingconditions C4, C6, C7 and Lemmas A.1, A.2 and A.6. Further, the left-handside of [5] is bounded bysup Ω n (cid:26)(cid:13)(cid:13)(cid:13)(cid:13) ∂∂W ik γ n ( W i ) (cid:13)(cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X dP n (cid:13)(cid:13)(cid:13)(cid:13) (cid:27) = O P ( δ − / n n / i ε n δ − n n / i )= o P (1) , where the ﬁrst equality follows from Lemma A.2 and Lemma A.4 and thesecond follows from C7. Thus, [5] holds and, hence, (6.4) is proved. (cid:3) Proposition 6.4.

Under the conditions of Theorem , condition [V] holds, that is, sup W ∈ Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ l ∗ ˆ l ∗ T ( X, W ) dP n − Z l ∗ l ∗ T ( X, W P , Φ W P ) dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . FFICIENT INDEPENDENT COMPONENT ANALYSIS Proof.

By checking the elements of ˆ l ∗ ˆ l ∗ T ( x , W ), it suﬃces to show thatfor 1 ≤ i, j, k, k ′ ≤ m and i = j ,sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) ˆ φ W j ( W j X ) X k X k ′ dP n − Z φ P i ( W P i X ) φ P j ( W P j X ) X k X k ′ dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) , sup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k X k ′ dP n ( X ) − Z φ P i ( W P i X ) X k X k ′ dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1)andsup Ω n (cid:12)(cid:12)(cid:12)(cid:12) Z ˆ φ W i ( W i X ) X k κ ( W j X ) dP n ( X ) − Z φ P i ( W P i X ) X k κ ( W P j X ) dP (cid:12)(cid:12)(cid:12)(cid:12) = o P (1) . Each of these can be veriﬁed using Lemmas A.1, A.2, A.6 and conditionsC4, C6 and C7 with arguments similar to those used in proving (6.3). (cid:3)

7. Conclusion.

In this paper, we viewed the classical ICA model withinthe framework of semiparametric models and obtained an asymptotically ef-ﬁcient estimator for the unmixing matrix by solving an approximate eﬃcientscore equation. The main diﬀerence between this new method and popularparametric ICA methods is that we estimate the density score functionsof hidden sources adaptively. A variety of simulations have illustrated sta-tistical eﬃciency of this estimator in comparison with state-of-the-art ICAalgorithms. APPENDIX

A.1. Some useful formulas.

Let v = wW − P . Then wX = vS . If v k = 0 forsome k ∈ { , . . . , m } , then by the classical convolution formula, we have f w ( t ) = Z v k r k (cid:18) t − P j = k v j s j v k (cid:19) Y j = k r j ( s j ) ds j = E (cid:26) v k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:27) . Since f w ( t ) is a marginal density function of ( vS, S j : 1 ≤ j = k ≤ m ), by astandard formula (see, e.g., [4]), φ w ( t ) = − v k E (cid:26) r ′ k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:12)(cid:12)(cid:12) vS = t (cid:27) = 1 v k E [ φ k ( S k ) | vS = t ](A.1)and further calculation gives ∂∂t φ w ( t ) = φ w ( t ) − v k E (cid:26) r ′′ k r k (cid:18) t − P j = k v j S j v k (cid:19)(cid:12)(cid:12)(cid:12) vS = t (cid:27) . (A.2) A. CHEN AND P. J. BICKEL

A.2. Calculation of the eﬃcient score.

To formulate the tangent spacedeﬁned in Section 2.1 for each nuisance parameter r i , by taking the smoothsubmodel { r i ( · ; t ) = r i ( · ) e th i ( · ) : | t | < } for some h i ∈ L ( P ), we havelim t → ∂∂t { log p X ( x , W, r , . . . , r i ( · ; t ) , . . . , r m ) } = h i ( W i x ) . Since r i ( · ; t ) needs to be a probability density function which satisﬁes themean and absolute median assumptions, h i needs to satisfy E [ h i ( S i )] = 0, E [ h i ( S i ) S i ] = 0 and E [ h i ( S i ) κ ( S i )] = 0, but is otherwise arbitrary. Thus, thetangent space for r i can be expressed as TS i = { h i ( W i x ) ∈ L ( P ) | E [ h i ( S i )] = 0 , E [ h i ( S i ) S i ] = 0 , E [ h i ( S i ) κ ( S i )] = 0 } . Note that the tangent spaces { TS i : 1 ≤ i ≤ m } are perpendicular to eachother since the S i are mutually independent. Thus, any projection ontothe tangent space of ( r , . . . , r m ) is equal to the summation of the partialprojection onto each TS i . The eﬃcient score of W becomes l ∗ ( ., W, Φ) = ˙ l W − m X i =1 π (˙ l W | TS i ) , where π ( . | L ) denotes the projection operator in L ( P ( W,r ,...,r m ) ) onto L .Since each oﬀ-diagonal entry of I m × m − Φ( S ) S T is perpendicular to all TS i and each diagonal entry of it is perpendicular to all but one TS i , l ∗ canbe obtained as in (2.4)–(2.8) of Section 2.3 by using the fact that TS ⊥ i = span { , S i , κ ( S i ) } . A.3. Some properties of cubic B-splines.

Let ξ < ξ < · · · < ξ N be ﬁxedpoints. The ﬁrst order B-spline basis functions based on these knots can beexpressed as B i ( x ) = I ( x ∈ [ ξ i , ξ i +1 )) , i = 1 , . . . , N −

1, and the k th orderB-spline basis functions can be obtained recursively ( k ≥

2) by B ki ( x ) = x − ξ i ξ i + k − − ξ i B k − i ( x ) + ξ i + k − xξ i + k − ξ i +1 B k − i +1 ( x ) , for i = 1 , . . . , N − k . Each B ki ( x ) is diﬀerentiable w.r.t. x up to order k − ddx B ki ( x ) = k − ξ i + k − − ξ i B k − i ( x ) − k − ξ i + k − ξ i +1 B k − i +1 ( x ) . We use the 4th order, so-called cubic, B-splines { B i : 1 ≤ i ≤ N − } with equally spaced knots, that is, ξ i +1 − ξ i = δ ( i = 1 , . . . , N −

1) for somealgorithm-determined δ . For simplicity, the superscript in B i is omitted be-low. The following properties of cubic B-splines will be frequently used inproving the lemmas below (see [11, 23] for details): FFICIENT INDEPENDENT COMPONENT ANALYSIS (I) 0 ≤ B i ( x ) < B i ( x ) B j ( x ) = 0 if | i − j | > | ddx B i ( x ) | < δ − , | d dx B i ( x ) | < δ − .(III) P Ni =1 [ B i ( x )] < P Ni =1 [ d dx B i ( x )] < δ − . A.4. Supporting lemmas for Propositions 6.1–6.4.

In this subsection, weprove all of the lemmas used in the proofs of Propositions 6.1–6.4. Recallthat for each φ k ( k = 1 , . . . , m ), we have an interval [ b nk , b nk ] and n k cubicB-spline basis functions deﬁned thereupon using equally spaced knots on it,say B ( k ) n = ( B ( k ) n , . . . , B ( k ) nn k ) T , as in Section 2.4. Thus, we have constructeda sequence of sieves G ( k ) n using B ( k ) n as basis functions. For any W ∈ Ω n , wehave a class of estimates ˆ φ W k ∈ G ( k ) n for φ W k , as deﬁned in Section 2.4. Forconvenience, the superscript of B ( k ) n is often omitted below.Let Ω ( k ) n = { W k : W ∈ Ω n } for k = 1 , . . . , m . We also need an intermediateapproximation function ˆ φ W k ∈ G ( k ) n , deﬁned as follows. For w ∈ Ω ( k ) n ,ˆ φ w = γ ( w ) T B ( k ) n , (A.3)where γ ( w ) = A ( w ) − D ( w ) with A ( w ) = R B ( k ) n B ( k ) Tn ( wX ) dP and D ( w ) = R [ B ( k ) n ] ′ ( wX ) dP . Note that the subscript w of ˆ φ w should always be associ-ated with Ω ( k ) n for some k ∈ { , . . . , m } , similarly for ˆ φ w .In the following, c denotes a constant (only dependent on the populationlaw P ), but its exact value may vary in diﬀerent places (even in a single line)without clariﬁcation. For a column vector x ∈ R m , k x k = √ x T x . For an m × m real matrix A , k A k = max ≤ i ≤ m k A i k , k A k = max x ∈ R m , k x k =1 k Ax k and k A k F = q tr ( A T A ), thus, k A k ≤ k A k .The following Lemmas A.1–A.8 hold under the conditions of Theorem 3.1.Jin [17] obtained results similar to Lemma A.2 and Lemmas A.5–A.7 con-cerning the B-spline approximation, but in generally diﬀerent settings. Lemma A.1.

For suﬃcient large n , sup w ∈ Ω ( k ) n | f w | ∞ < ∞ , sup w ∈ Ω ( k ) n | f ′ w | ∞ < ∞ , sup w ∈ Ω ( k ) n min t ∈ [ b nk ,b nk ] f w ( t ) ≥ cδ n and sup w ∈ Ω ( k ) n P ( wX / ∈ [ b ni , b ni ]) = o (1) . Proof.

The ﬁrst two inequalities follow easily from C3. The remainingtwo are proved as follows. For any w ∈ Ω ( k ) n , k w − W P k k ≤ ε n . If we let v = wW − P , then | v j | → ≤ j = k ≤ m and v k → n → ∞ . Since f w ( t ) = E [ v k r k ( t − P j = k v j S j v k )], we can consider the right-hand side as a function (say A. CHEN AND P. J. BICKEL h ) of v . By the ﬁrst order Taylor expansion, | f w ( t ) − r k ( t ) | ≤ ε n k W − P k ( m X j =1 sup w ∈ Ω ( k ) n (cid:12)(cid:12)(cid:12)(cid:12) ∂∂v j h ( v ) (cid:12)(cid:12)(cid:12)(cid:12)) ≤ cε n , where by direct calculation and using C3, | ∂∂v j h ( v ) | is uniformly boundedwith w ∈ Ω ( k ) n and t ∈ R . Recall that r k ( t ) ≥ cδ n for t ∈ [ b nk , b nk ] and ε n ≪ δ n ,thus sup w ∈ Ω ( k ) n min t ∈ [ b nk ,b nk ] f w ( t ) ≥ cδ n . Finally, P ( wX ∈ [ b ni , b ni ]) = Z [ b nk ,b nk ] f w ( t ) dt ≥ Z [ b nk ,b nk ] ( r k ( t ) − cε n ) dt = P ( S k ∈ [ b nk , b nk ]) − cε n ( b nk − b nk ) . Since ε n ( b nk − b nk ) = o (1) and P ( S k ∈ [ b nk , b nk ]) →

1, we have inf w ∈ Ω ( k ) n P ( wX ∈ [ b ni , b ni ]) → (cid:3) Recall the deﬁnitions of ˆ φ W k , γ n ( W k ), A n ( W k ) and D n ( W k ) in Section 2.4and that of γ ( w ) = [ A ( w )] − D ( w ) in (A.3). The lemmas below give theiruniform convergence rates. Lemma A.2.

The following holds for k, i = 1 , . . . , m : (1) k D ( w ) k ≤ cδ n n / k ; cδ n ≤ eig ( A ( w )) ≤ cδ n for w ∈ Ω ( k ) n ; (2) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k = O P (( n k log n k ) / ( nδ n ) − / ) ; (3) sup w ∈ Ω ( k ) n k D n ( w ) k = O p ( δ n n / k ) ; (4) sup w ∈ Ω ( k ) n k A n ( w ) − A ( w ) k = O P (( δ n log n k ) / n − / ) ; (5) sup w ∈ Ω ( k ) n k A − n ( w ) k = O P ( δ − n ) ; (6) sup w ∈ Ω ( k ) n k γ n ( w ) k = O p ( n / k δ − n ) ; (7) sup Ω ( k ) n k γ n ( w ) − γ ( w ) k = o P (1) ; (8) sup Ω ( k ) n k ∂∂w i { A n ( w ) }k = O P ( δ − / n ) ; (9) sup Ω ( k ) n k ∂∂w i { D n ( w ) }k = O P ( δ − n ) ; (10) sup Ω ( k ) n k ∂∂w i { γ n ( w ) }k = o P ( δ − / n n / k ) . Proof.

The procedure is as follows. First, (3) is implied by (1) and (2).Second, (6) is implied by (3) and (5). Third, since ∂∂w i { γ n ( w ) } = − A − n ( w ) ∂∂w i { A n ( w ) } γ n ( w ) + A − n ( w ) ∂∂w i { D n ( w ) } , FFICIENT INDEPENDENT COMPONENT ANALYSIS (10) is implied by (5), (6), (8) and (9). Hence, it suﬃces to prove (1), (2),(4), (5), (7), (8) and (9). Further, the proofs of (2) and (4) are similar andthe proofs of (8) and (9) are similar. Thus the proofs of (4) and (9) areomitted. Proof of (1).

By taking the derivatives of the cubic B-splines, B ′ ni ( t ) = δ − n ( B ni ( t ) − B n,i +1 ( t )), where B ni are the third order B-splines deﬁned onthe same knots, i = 1 , . . . , n k , we have | D i ( w ) | = δ − n (cid:12)(cid:12)(cid:12)(cid:12) Z ( B n,i ( t ) − B n,i +1 ( t )) f w ( t ) dt (cid:12)(cid:12)(cid:12)(cid:12) = δ − n (cid:12)(cid:12)(cid:12)(cid:12) Z B n,i ( t )( f w ( t ) − f w ( t + δ n )) dt (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z B n,i ( t ) dt | f ′ w | ∞ < δ n | f ′ w | ∞ . So the ﬁrst result holds due to Lemma A.1. The second result follows fromLemma 5.1 of [17]. (cid:3)

Proof of (2).

Note that P (cid:16) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k ≥ t (cid:17) = P (cid:18) sup w ∈ Ω ( k ) n (cid:13)(cid:13)(cid:13)(cid:13) Z B ′ n ( wX ) d ( P n − P ) (cid:13)(cid:13)(cid:13)(cid:13) ≥ t (cid:19) ≤ n k X i =1 P (cid:18) sup w ∈ Ω ( k ) n (cid:12)(cid:12)(cid:12)(cid:12) Z B ′ n,i ( wX ) d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ t √ n k (cid:19) . For a ﬁxed pair ( i, k ), let F n = { g w ( x ) = B ′ n,i ( w x ) : w ∈ Ω ( k ) n } , a class offunctions indexed by Ω ( k ) n . Then for w, v ∈ Ω ( k ) n , k g w ( x ) − g v ( x ) k ≤ δ − n k w − v k k x k by property II of cubic B-splines. Now, the index set Ω ( k ) n can becovered by N = cε mn u − m balls of radius u and for any w, v in the same ball, E [ k g w ( X ) − g v ( X ) k ] ≤ cδ − n u . Further, by property I of cubic B-splines,sup g w ∈F n | g w | ∞ ≤ δ − nk . Then for c max( δ − n , ε n δ − n ) ≤ a ≤ c √ n , P (cid:18) sup w ∈ Ω ( k ) n √ n (cid:12)(cid:12)(cid:12)(cid:12) Z B ′ n,i ( wX ) d ( P n − P ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ a (cid:19) ≤ exp( − ca δ n ) . This follows from Theorem 5.11 of van de Geer ([24], page 75) which gener-alizes Hoeﬀding’s inequality and calculates the uniform convergence rate for A. CHEN AND P. J. BICKEL a class of functions in terms of its size measure, or so-called bracketing en-tropy ; see [24] for details. Note that ε n ≪ δ n by C7, so for t ≥ cn / k n − / δ − n ,we have P (cid:16) sup w ∈ Ω ( k ) n k D n ( w ) − D ( w ) k ≥ t (cid:17) ≤ n k exp( − ct nδ n n − k ) . Thus, sup Ω ( k ) n k D n ( w ) − D ( w ) k = O p s n k log n k nδ n ! . (cid:3) Proof of (5).

Since A − n = ( A + A n − A ) − = A − ( I + ( A n − A ) A − ) − ,by (1) and (4) (omitting the index w ), we havesup w ∈ Ω ( k ) n k A n − A k k A − k = o p (1) , hence,sup w ∈ Ω ( k ) n k A − n k ≤ sup w ∈ Ω ( k ) n k A − k (1 − k A n − A k k A − k ) − = O P ( δ − n ) . Here, we use the inequality of matrix norm k ( I + A ) − k ≤ (1 − k A k ) − forany square matrix A with k A k <

1, where I is the identity matrix. (cid:3) Proof of (7).

Since by (1)–(5),sup w ∈ Ω ( k ) n k γ n ( w ) − γ ( w ) k = sup w ∈ Ω ( k ) n k A − ( D n − D ) − A − n ( A n − A ) A − D n k = O p δ − n s n k log n k nδ n + δ − n s δ n log n k n δ n √ n k δ − n ! = o P (1) , where the last equality follows from C7. (cid:3) Proof of (8).

Note that the partial derivative is (omitting ( i ) from B ( i ) n ), ∂∂w k { A n ( w ) } jj ′ = Z ( B nj B ′ nj ′ + B ′ nj B nj ′ )( wX ) X k dP n . By the Cauchy–Schwarz inequality, (cid:12)(cid:12)(cid:12)(cid:12) Z ( B nj B ′ nj ′ + B ′ nj B nj ′ )( wX ) X k dP n (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) dP n Z X k dP n . FFICIENT INDEPENDENT COMPONENT ANALYSIS Similarly, we can calculate the uniform convergence rate,sup ≤ j,j ′ ≤ n i , | j − j ′ |≤ sup w ∈ Ω ( i ) n Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) d ( P n − P ) = o p (1) . Further, from Lemma A.1, sup w ∈ Ω ( k ) n | f w | is bounded so after algebraic ex-pansion, we havesup w ∈ Ω ( i ) n Z ( B nj B ′ nj ′ + B ′ nj B nj ′ ) ( wX ) dP ≤ cδ − n . Thus, | ∂∂w k { A n ( w ) } jj ′ | ≤ cδ − / n . For cubic B-splines, B nj ( x ) B ′ nj ′ ( x ) ≡ | j − j ′ | >

3, thus, each row of ∂∂w k { A n ( w ) } has at most seven nonzero ele-ments. So,sup Ω ( i ) n (cid:13)(cid:13)(cid:13)(cid:13) ∂∂w k { A n ( w ) } (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup Ω ( i ) n (cid:13)(cid:13)(cid:13)(cid:13) ∂∂w k { A n ( w ) } (cid:13)(cid:13)(cid:13)(cid:13) = O p ( δ − n ) . (cid:3) Lemma A.3. k R B ( i ) n ( S i ) S j dP n k = O P ( n − / ) , where S i = W P i X , ≤ i = j ≤ m . Proof. (We omit ( i ) in B ( i ) n , B ( i ) nk .) E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Z B n ( S i ) S j dP n (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) = 1 n E n i X k =1 B nk ( S i ) S j ! ≤ n E ( S j ) . (cid:3) Lemma A.4. sup Ω n k R B ( i ) n ( W i X ) W j X dP n k = O p ( ε n δ − n n − / i ) for ≤ i = j ≤ m . Proof.

Similarly to the proof of Lemma A.2(2), we havesup Ω n (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X d ( P n − P ) (cid:13)(cid:13)(cid:13)(cid:13) = O P s δ n n i log n i n ! . Further, note that | B nk ( x ) − B nk ( y ) | ≤ δ − n | x − y | , so for any W ∈ Ω n , (cid:13)(cid:13)(cid:13)(cid:13) Z B n ( W i X ) W j X dP (cid:13)(cid:13)(cid:13)(cid:13) = n i X k =1 (cid:12)(cid:12)(cid:12)(cid:12) Z ( B nk ( W i X ) W j X − B nk ( W P i X ) W P j X ) dP (cid:12)(cid:12)(cid:12)(cid:12) ≤ n i X k =1 ( δ − n E k X k k W i − W P i k k W i k + E k X k k W i − W P i k ) ≤ cε n δ − n n i . A. CHEN AND P. J. BICKEL

Thus, sup Ω n k R B ( i ) n ( W i X ) W j X dP n k = O P ( q δ n n i log n i n + ε n δ − n n / i ) . (cid:3) Lemma A.5. E { ( ˆ φ W i ( W i X ) − φ W i ,n ( W i X )) } ≤ δ n | φ ′′′ W i ,n | ∞ . Proof.

Let d ( φ W i ,n , G n ) = inf h ∈G n | φ W i ,n − h | ∞ . Then by the Jackson-type theorem [11], d ( φ W i ,n , G n ) ≤ cδ n | φ ′′′ W i ,n | ∞ . Thus, the result follows from E { ( ˆ φ W i ( W i X ) − φ W i ,n ( W i X )) } = inf h ∈G ( i ) n E { ( h ( W i X ) − φ W i ,n ( W i X )) }≤ d ( φ W i ,n , G n ) . (cid:3) Lemma A.6. | ˆ φ W i − φ W i ,n | ∞ ≤ cδ n | φ ′′′ W i ,n | ∞ ; | ˆ φ ′ W i − φ ′ W i ,n | ∞ ≤ c | φ ′′′ W i ,n | ∞ δ n . Proof.

Observe that13 Z ( ˆ φ W Pk ( S k ) − φ k ( S k )) dP n ≤ Z ( ˆ φ W Pk ( S k ) − ˆ φ W Pk ( S k )) dP n + Z ( ˆ φ W Pk ( S k ) − φ k,n ( S k )) dP n + Z φ k ( S k ) I ( S k / ∈ [ b nk , b nk ]) dP n . The remainder of the proof is similar to the use of (6.1) in proving Propo-sition 6.1 by using Lemma A.2 and Lemma A.6. (cid:3)

Lemma A.8.

Let { p ( · , θ, η ) : θ ∈ Ω ⊂ R d , η ∈ E} be a parametric or semi-parametric model, where θ is the parameter of interest. Suppose that moder-ate regularity conditions are satisﬁed and that l ∗ ( · , θ, η ) is the eﬃcient scorefunction of θ , as deﬁned in [5] . Then Z ∂∂θ l ∗ ( X, θ, η ) dP ( θ,η ) = − Z l ∗ l ∗ T ( X, θ, η ) dP ( θ,η ) . Proof.

We only prove this for the parametric case

E ⊂ R m . Let I ( θ, η )be the information matrix of ( θ, η ). Then by classic likelihood theory (e.g.,Proposition 2.4.1 of [5]), l ∗ ( · , θ, η ) = ˙ l − ( I I − )( θ, η )˙ l . Here, ˙ l and ˙ l arethe partial derivatives of l ( · , θ, η ) ≡ log p ( · , θ, η ) w.r.t. θ and η , respectively.Similarly, ¨ l ij ( i, j = 1 ,

2) are deﬁned as second order derivatives of l ( · , θ, η )w.r.t. ( θ, η ). Thus, ∂∂θ l ∗ ( X, θ, η ) = ¨ l − ( I I − )( θ, η )¨ l − ∂∂θ { ( I I − )( θ, η ) } ˙ l .Since R ˙ l ( X, θ, η ) dP ( θ,η ) = 0, we have Z ∂∂θ l ∗ ( X, θ, η ) dP ( θ,η ) = Z ¨ l dP ( θ,η ) − ( I I − )( θ, η ) Z ¨ l dP ( θ,η ) . A. CHEN AND P. J. BICKEL

Since the information matrix satisﬁes I ij = − R ¨ l ij ( X, θ, η ) dP ( θ,η ) , i, j = 1 , R l ∗ l ∗ T dP = I − I I − I (see Proposition 2.4.1 of [5],page 32). For the semiparametric case, the reader is referred to [5] for ageneralization of this proof. (cid:3) Acknowledgments.

This paper was presented at the Joint StatisticalMeeting in San Francisco, USA, August 2003. The results were reported inChen’s Ph.D. Thesis. We would like to thank Francis Bach at UC-Berkeleyfor sharing his simulation code and Scott Vander Wiel for helpful comments.We also wish to thank the Editors and four referees for useful suggestionsthat have helped signiﬁcantly to improve the paper.REFERENCES [1]

Amari, S. (2002). Independent component analysis and method of estimating func-tions.

IEICE Trans. Fundamentals

E85-A

Amari, S. and

Cardoso, J. (1997). Blind source separation—semiparametric sta-tistical approach.

IEEE Trans. Signal Process. Bach, F. and

Jordan, M. (2002). Kernel independent component analysis.

J. Ma-chine Learning Res. Bickel, P. and

Doksum, K. (2001).

Mathematical Statistics I , 2nd ed. Prentice-Hall.[5] Bickel, P., Klaassen, C., Ritov, Y. and

Wellner, J. (1998).

Eﬃcient and Adap-tive Estimation for Semiparametric Models . Springer, New York.[6]

Cardoso, J. F. (1998). Blind signal separation: statistical principles.

Proc. IEEE Cardoso, J. F. (1999). High-order contrasts for independent component analysis.

Neural Comput. Chen, A. and

Bickel, P. J. (2005). Consistent independent component analysis andprewhitening.

IEEE Trans. Signal Process. Comon, P. (1994). Independent component analysis, a new concept?

Signal Process. Cox, D. D. (1985). A penalty method for nonparametric estimation of the loga-rithmic derivative of a density function.

Ann. Inst. Statist. Math. de Boor, C. (2001). A Practical Guide to Splines , revised ed. Springer. MR1900298[12]

Eriksson, J. and

Koivunen, V. (2003). Characteristic-function based independentcomponent analysis.

Signal Process. Hastie, T. and

Tibshirani, R. (2002). Independent component analysis throughproduct density estimation. Technical report, Dept. Statistics, Stanford Univ.[14]

Huber, P. J. (1985). Projection pursuit.

Ann. Statist. Hyv¨arinen, A. (1999). Fast and robust ﬁxed-point algorithms for independent com-ponent analysis.

IEEE Trans. Neural Networks Hyv¨arinen, A., Karhunen, J. and

Oja, E. (2001).

Independent Component Anal-ysis . Wiley, New York.[17]

Jin, K. (1992). Empirical smoothing parameter selection in adaptive estimation.

Ann.Statist. Kagan, A., Linnik, Y. and

Rao, C. (1973).

Characterization Problems in Mathe-matical Statistics . Wiley, New York. MR0346969FFICIENT INDEPENDENT COMPONENT ANALYSIS [19] Lee, T. W., Girolami, M. and

Sejnowski, T. (1999). Independent componentanalysis using an extended infomax algorithm for mixed subgaussian and super-gaussian sources.

Neural Comput. Murphy, S. and

Van der Vaart, A. (2000). On proﬁle likelihood.

J. Amer. Statist.Assoc. Pham, D. T. and

Garrat, P. (1997). Blind separation of mixture of independentsources through a quasi-maximum likelihood approach.

IEEE Trans. Signal Pro-cess. Samarov, A. and

Tsybakov, A. (2002). Nonparametric independent componentanalysis.

Bernoulli Truong, Y. K., Kooperberg, C. and

Stone, C. J. (2005).

Statistical Modelingwith Spline Functions: Methodology and Theory . Springer, New York.[24] van de Geer, S. (2000).

Applications of Empirical Process Theory . Cambridge Univ.Press, UK. MR1739079[25]

Vigario, R., Jousmaki, V., Hamalainen, M., Hari, R. and

Oja, E. (1998). Inde-pendent component analysis for identiﬁcation of artifacts in magnetoencephalo-graphic recordings. In

Advances in Neural Information Processing Systems Vlassis, N. and

Motomura, Y. (2001). Eﬃcient source adaptivity in independentcomponent analysis.

IEEE Trans. Neural Networks

600 Mountain Avenue, Room 2c-281Murray Hill, New Jersey 07974USAE-mail: [email protected]

367 Evans HallBerkeley, California 94720USAE-mail:367 Evans HallBerkeley, California 94720USAE-mail: