Dynamic Large Spatial Covariance Matrix Estimation in Application to Semiparametric Model Construction via Variable Clustering: the SCE approach
DDynamic Large Spatial Covariance Matrix Estimation inApplication to Semiparametric Model Construction viaVariable Clustering: the SCE approach
Song Song ∗ September 11, 2018
Abstract
To better understand the spatial structure of large panels of economic and financial timeseries and provide a guideline for constructing semiparametric models, this paper first consid-ers estimating a large spatial covariance matrix of the generalized m -dependent and β -mixingtime series (with J variables and T observations) by hard thresholding regularization as longas log J X ∗ ( T ) /T = O (1) (the former scheme with some time dependence measure X ∗ ( T )) orlog J/T = O (1) (the latter scheme with the mixing coefficient β mix = O{ ( J δ (cid:48) √ log J T ) − } , δ (cid:48) >
0. We quantify the interplay between the estimators’ consistency rate and the time dependencelevel, discuss an intuitive resampling scheme for threshold selection, and also prove a generalcross-validation result justifying this. Given a consistently estimated covariance (correlation) ma-trix, by utilizing its natural links with graphical models and semiparametrics, after “screening”the (explanatory) variables, we implement a novel forward (and backward) label permutation pro-cedure to cluster the “relevant” variables and construct the corresponding semiparametric model,which is further estimated by the groupwise dimension reduction method with sign constraints.We call this the SCE (screen - cluster - estimate) approach for modeling high dimensional datawith complex spatial structure. Finally we apply this method to study the spatial structure oflarge panels of economic and financial time series and find the proper semiparametric structurefor estimating the consumer price index (CPI) to illustrate its superiority over the linear models.
Keywords : Time Series, Covariance Estimation, Regularization, Sparsity, Thresholding, Semi-parametrics, Graphical Model, Variable Clustering
JEL classification : C13, C14, C32, E30, G10 ∗ University of California, Berkeley. Email: [email protected] a r X i v : . [ s t a t . M L ] J un Introduction
Recent breakthroughs in technology have created an urgent need for high-dimensional data analysistools. Examples include economic and financial time series, genetic data, brain imaging, spectroscopicimaging, climate data and many others. To model high dimensional data, especially large panels ofeconomic and financial time series as our focus here, it is very important to begin with understand-ing the “spatial” structure (over the space of variables instead of from a geographic point of view;also used in future for convenience) instead of simply assuming any specific type of parametric (e.g.linear) model first. Estimation of large spatial covariance matrix plays a fundamental role here sinceit can indicate a predictive relationship that can be exploited in practice. It is also very important innumerous other areas of economics and finance, including but not limited to handling heteroscedas-ticity of high dimensional econometric models, risk management of large portfolios, setting confidenceintervals (or interval forecasts) on linear functions of the means of the components, variable groupingvia graphs, dimension reduction by principal component analysis (PCA) and classification by linearor quadratic discriminant analysis (LDA and QDA). In recent years, many application areas wherethese tools are used have dealt with very high-dimensional datasets with relatively small sample size,e.g. the typically low frequency macroeconomic data.It is well known by now that the empirical covariance matrix for samples of size T from a J -variateGaussian distribution, N J ( µ, Σ J ) is not a good estimator of the population covariance if J is large. If J/T → c ∈ (0 ,
1) and the covariance matrix Σ J = I (the identity), then the empirical distribution ofthe eigenvalues of the sample covariance matrix ˆΣ J follow the Marˆcenko-Pastur Law (Marˆcenko andPastur (1967)) and the eigenvalues are supported on ((1 − √ c ) , (1 + √ c ) ). Thus, the larger J/T is,the more spread out the eigenvalues are.Therefore, alternative estimators for large covariance matrices have attracted a lot of attentionrecently. Two broad classes of covariance estimators have emerged. One is to remedy the samplecovariance matrix and construct a better estimate by using approaches such as banding, taperingand thresholding. The other is to reduce dimensionality by imposing some structure on the datasuch as factor models, Fan et al. (2008). Among the first class, regularizing the covariance matrix bybanding or tapering relies on a natural ordering among variables and assumes that variables far apartin the ordering are only weakly correlated, Wu and Pourahmadi (2003), Bickel and Levina (2008b),Cai and Zhou (2011) among others. However, there are many applications, such as large panels ofmacroeconomic and financial time series, gene expression arrays and other spatial data, where there is2o total ordering on the plane and no defined notion of distance among variables at all. These existingapplications require estimators to be invariant under variable permutations such as regularizing thecovariance matrix by thresholding, El Karoui (2008) and Bickel and Levina (2008a). In this paper, weconsider thresholding of the sample spatial covariance matrix for high dimensional time series, whichextends the existing work from the iid to the dependent scenarios. Under the time series setup, a veryimportant question to ask is: how the time dependence will affect the estimate’s consistency? This isthe first question this paper is going to answer.For time series, there have been two recent works by Bickel and Gel (2011) and Xiao and Wu(2011) about banding and tapering the large autocovariance matrices for univariate time series. Butour goal here is to better understand the spatial structure of high dimensional time series, so we needa consistent estimate of the large spatial covariance matrix, especially under a mixture of serial cor-relation (temporal dynamics), high dimensional (spatial) dependence structure and moderate samplesize (relative to dimensionality).
As mentioned at the very beginning, when the spatial structure of the high dimensional data (time se-ries) is complex, instead of simply assuming any specific type of parametric (e.g. linear) model first, wecould adopt the flexible nonparametric approach. Due to the “curse of dimensionality” disadvantageof full nonparametrics, various semiparametric models have been considered to maintain flexibility inmodeling while attempting to deal with the “curse of dimensionality” problem. However, most of theprior semiparametric works were carried out under some (prefixed) specific classes of semiparametricmodels without discussing which ones might be closer to the actual data structure. More specifically,to model some dependent variable y (or x J ) using explanatory variables x , x , . . . , x J − (very large J − x ’s to avoid overfitting. • E ( y ) = g ( x β + x β + x β + x β + . . . + x J − β J − ), where g is an unknown univariate linkfunction, and β , β , β , . . . , β J − are unknown parameters that belong to the parameter space. • E ( y ) = g ( x ) + g ( x ) + g ( x ) + . . . + g J − ( x J − ), where g , g , g , . . . , g J − are the unknownfunctions to be estimated nonparametrically.This approach encounters limitations from the following three perspectives. First , when thedimensionality J − → ∞ , the prefixed assumption itself becomes more and more questionable.3s the single index model or the additive one closer to the actual data structure? Or maybe someother type of semiparametric structures is more suitable? We do not know. And this becomes morechallenging when the sample size T is small (with respect to dimensionality). Second , this - prefixing some specific semiparametric classes first and then selecting variablesaccordingly - approach is also challenged by another character of high dimensional economic andfinancial time series: strong spatial dependence (near-collinearity). Under near-collinearity, we expectvariable selection to be unstable and very sensitive to minor perturbation of the data. In this sense,we do not expect variable selection to provide results that lead to clearer economic interpretation thanprincipal components or ridge regression. This is actually due to the fact that although comparedwith the information criteria based L and ridge regression type L regularization methods, the Lassotype L variable selection techniques (Tibshirani (1996)) could deal with large J and require weakerassumptions on the design matrix x (composed of x , . . . , x J − ), it still requires the following (as oneof many similar requirements) restricted eigenvalue (RE) assumptions from Bickel et al. (2009): thereexists a positive number κ = κ ( s ) such that min (cid:110) | x (cid:62) ∆ | √ T | ∆ R | : |R| (cid:54) s, ∆ ∈ R J − \{ } , (cid:107) ∆ R c (cid:107) (cid:54) (cid:107) ∆ R (cid:107) (cid:111) (cid:62) κ, where |R| denotes the cardinality of the set R , R c denotes the complement of the set of indices R ,and ∆ R denotes the vector formed by the coordinates of the vector ∆ w.r.t. the index set R . It isessentially a restriction on the eigenvalues of the Gram matrix Ψ T = x (cid:62) x/T as a function of sparsity s . To see this, recall the definitions of restricted eigenvalue and restricted correlation in Bickel et al.(2009): ψ min ( u ) = min z ∈ R J − (cid:54) M ( z ) (cid:54) u z (cid:62) Ψ T z | z | , (cid:54) z (cid:54) J − ,ψ max ( u ) = max z ∈ R J − (cid:54) M ( z ) (cid:54) u z (cid:62) Ψ T z | z | , (cid:54) z (cid:54) J − ,ψ m ,m = max (cid:110) f (cid:62) x (cid:62) I x I f T | f | | f | : I (cid:92) I = ∅ , | I i | (cid:54) m i , f i ∈ R I i \{ } , i = 1 , (cid:111) , where | I i | denotes the cardinality of I i and x I i is the T × | I i | submatrix of x obtained by removingfrom x the columns that do not correspond to the indices in I i . Lemma 4.1 in Bickel et al. (2009)shows that if the restricted eigenvalue of the Gram matrix Ψ T satisfies ψ min (2 s ) > ψ s, s for someinteger 1 (cid:54) s (cid:54) ( J − /
2, Assumption RE holds. Under this condition, the Lasso type estimate’svarious oracle inequalities could be derived, e.g. Bickel et al. (2009), where the upper bounds typicallynegatively depend on κ . From an economic point of view, this in fact requires that the dependencecan not be too strong, which, unfortunately, is often unsatisfied for large panels of macroeconomicand financial data. 4 hird , when the proposed high dimensional semiparametric model has a complex structure, findinga proper penalty term and the corresponding estimation method for variable selection in general mightbe very difficult, since, ideally, the penalty should depend not only on the coefficients, but also onthe (shapes of the) unknown nonparametric link functions. Several examples of regularizing highdimensional semiparametric models could be found in Chapter 5 and 8 of B¨uhlmann and van de Geer(2011), Ravikumar et al. (2009) among others.To this end, developing a specific model free high dimensional spatial structure first and thenconstructing the right class of semiparametric models seems important. Specifically speaking, given x , x , . . . , x J − and y (or x J ), we try to find the index sets A , A , . . . , A S (possibly with overlappingelements) such that y could be well approximated by: S (cid:88) s =1 g s (cid:16) | A s | (cid:88) l =1 β sl x l ∈A s (cid:17) def = S (cid:88) s =1 g s (cid:16) β (cid:62) s x A s (cid:17) , (1)where • | · | denotes the cardinality of the set · ; S is the number of index sets A , A , . . . , A S and alsothe number of the unknown univariate nonparametric link functions g , . . . , g S ; • x A s def = ( x l , l ∈ A s ) is a vector of regressors w.r.t. the index set A s ; β sl , (cid:54) s (cid:54) S, (cid:54) l (cid:54) |A s | are the unknown parameters in the parametric space; β s = ( β s , . . . , β s |A s | ); • ∀ j (cid:54) = l, s (cid:54) = t , x j ∈ A s , x l ∈ A t , x j and x l are (conditionally) independent given other x ’s.If K def = |A (cid:83) . . . (cid:83) A S | (cid:28) J − K is still possibly not moderate), we could strike abalance between dimension reduction and flexibility of modeling. Model (1) is very general andincludes the single index model if S = 1 (Ichimura (1993)), the additive model if |A | = |A | = . . . = |A S | = 1 (Hastie and Tibshirani (1990)), the partial linear model if S = 2, g is the identity functionand |A | = 1 (Speckman (1988)) and the partial linear single index model if S = 2 and g is theidentity function (Ahn and Powell (1993), Carroll et al. (1995), Yu and Ruppert (2002)). Model (1)could also be viewed as an extension of the multiple index model (Stoker (1986), Ichimura and Lee(1991), Horowitz (1998), Xia (2008)) and can be further generalized if the RHS is µ { E ( y ) } where µ is some known link function. As also considered by Li et al. (2010), if A , A , . . . , A S are disjoint (nooverlapping elements), then for each group of variables x A s , we could say g s denotes (the only) oneindex. Thus according to Li et al. (2010), model (1) is identifiable as every subspace of every group x A s is identifiable and could be solved efficiently by the grouping dimension reduction method in Liet al. (2010), where they primarily assume that the grouping information is available. An immediate5uestion is that given x , . . . , x J − ( J − → ∞ ), how can we extract S groups of “relevant” x ’s withcorresponding index sets A , . . . , A S and |A (cid:83) . . . (cid:83) A S | (cid:28) J −
1? This is the second question thispaper is going to answer. From now on, we mainly study the case where A , A , . . . , A S are disjoint,although in Section 4 we will also present the method generating overlapping index sets.Before moving on, let us study the differences among various semiparametric models from thegraphical point of view. If we use a vertex in the graph to represent a relevant variable, a solid edge ina “block” to represent linear relationship among variables inside, a bandy edge (connecting a “block”with the dependent variable y ) to represent a nonparametric link function, a crossed vertex to representan “unrelated” ones, then we can visualize different semiparametric models through correspondinggraphs. For instance, we can get Figure 1 (left) for the single index model; Figure 1 (right) for theadditive model; Figure 2 (left) for the (more general) multiple index model, among many others.As we can see, the underlying difference among various semiparametric models is where to allocatethe nonparametric link function and linearity through clustering variables. Consequently, assumingthat all the variables have been included (complete graph), if we can find the corresponding type ofgraphs, we can construct the right class of semiparametric models. Sparse concentration matrices areof special interest in graphical models because zero partial correlations help establish independenceand conditional independence relations in the context of graphical models and thus imply a graphicalstructure. For example, if we have a sparse covariance matrix for y, x , . . . , x as the one in Figure 2(right), we know that x , . . . , x are “relevant” to y , and due to the “block” structure w.r.t. x , x , x and x , x , we can construct the following class of semiparametric models as a specific case of (1): E ( y ) = g ( x β + x β + x β ) + g ( x β + x β ) + g ( x β ) . (2)Now we have found the links among semiparametrics, graphical models and sparse large spatialcovariance matrix. Thus consistently estimating the large sparse covariance matrix first and clusteringthe (explanatory) variables (or forming a block diagonal structure for the corresponding partition ofthe covariance matrix) are the key focuses. In this article, we assume that the grouping structure (orthe corresponding covariance matrix) and parametric coefficients β s are both time invariant to simplythe study.Another related and potential application of clustering variables comes from group regularization(e.g. group Lasso, Yuan and Lin (2006)) in the modern sparsity analysis. Huang and Zhang (2009)show that, if the underlying structure is strongly group-sparse, group Lasso is more robust to noisedue to the stability associated with group structure and thus requires a smaller sample size to meet thesparse eigenvalue condition required in modern sparsity analysis. However, other than the situations,e.g. multi-task learning, where we have clear background knowledge about how to group variables, in6 igure 1: g Y X – X – X – X – X – X Figure 2: g Y g g X X X X X X Figure 1: Left: E ( y ) = g ( x β + x β + x β ), where g is an unknown univariate link function, and β , β , β are unknown indices which belong to the parameter space. Right: E ( y ) = g ( x ) + g ( x ) + g ( x ), where g , g and g are the unknown functions to be estimated nonparametrically. Figure 3: g g g Y X – X – X X – X X – X – X – X Figure 3: g g g Y X – X – X X – X X – X – X – X Figure 3: g g g Y X – X – X X – X X – X – X – X Figure 3: g g g Y X – X – X X – X X – X – X – X x • • • x • • • x • • • x • • x • • x • x x x Figure 3: Left: E ( y ) = g ( x β + x β + x β ) + g ( x β + x β ) + g ( x β ), where g , g and g arethe unknown functions to be estimated nonparametrically. It includes the partial linear single indexmodel when a link function, say g , is identity; the single index model when g , g , g are the univariatelink functions; the generalized linear model when g and g are not included; Right: a sample of blockdiagonal structure after label permutation to the regularized large spatial covariance matrix.6 y • • • • • • x • • • x • • • x • • • x • • x • • x • x x x Figure 2: Left: E ( y ) = g ( x β + x β + x β ) + g ( x β + x β ) + g ( x β ), where g , g and g arethe unknown functions to be estimated nonparametrically. It includes the partial linear single indexmodel when a link function, say g , is identity; the single index model when g , g , g are the univariatelink functions; the generalized linear model when g and g are not included; Right: a sample of blockdiagonal structure after label permutation to the regularized large spatial covariance matrix.6 x • • • x • • • x • • • x • • x • • x • x • x • x • y • • • • • • • Figure 2: Left: E ( y ) = g ( x β + x β + x β ) + g ( x β + x β ) + g ( x β ), where g , g and g arethe unknown functions to be estimated nonparametrically. It includes the partial linear single indexmodel when a link function, say g , is identity; the single index model when g , g , g are the univariatelink functions; the generalized linear model when g and g are not included; Right: a sample of blockdiagonal structure after label permutation to the regularized large spatial covariance matrix.7 Figure 2: Left: E ( y ) = g ( x β + x β + x β ) + g ( x β + x β ) + g ( x β ), where g , g and g are theunknown functions to be estimated nonparametrically. Right: a sample of block diagonal structureafter label permutation to the regularized large spatial covariance matrix.7eneral, it is hard to tell how to properly group the variables to make use of group regularization. Anexample could be found in a paper in preparation with Bickel, Song and Bickel (2011), where theydiscuss three types of estimates for large vector auto regression w.r.t. different grouping methods. Tothis end, proper “grouping” of the variables is also significant.For the semiparametric modeling in econometrics, people usually “group” the variables in a “ruleof thumb” way. For example, to model the consumer price index (CPI - all items), they mightsubjectively group the variables “CPI - apparel & upkeep; transportation; medical care; commodities;durables; services” in the first group, “CPI - all items less food; all items less shelter; all items lessmedical care” in the second group; “ Producer Price Index (PPI) - Finished Goods; Finished ConsumerGoods; Intermed Mat. Supplies & Components; Crude Materials” in the third group, “Implicit PriceDeflator (of Personal Consumption Expenditures) PCE - all items; durables; nondurables; services”in the fourth group, all other variables in the last group. Is this way of grouping closest to the actualdata structure? Why not put “CPI - Durables; PCE - Durables” in one group and “CPI - Services;PCE - Services” in another group? We are going to provide a procedure of grouping these variablesfrom a data-driving approach.In summary, the novelty of this article lies in the following two aspects. First , under the highdimensional time series situation, we show consistency (and the explicit rate of convergence) of thethreshold estimator in the operator norm, uniformly over the class of matrices that satisfy our no-tion of sparsity as long as log J X ∗ ( T ) /T = O (1) (for the generalized m -dependent time series; themeaning of X ∗ ( T ) is presented later) or log J/T = O (1) (for the β -mixing process with the mixingcoefficient β mix = O{ ( J δ (cid:48) √ log J T ) − } , δ (cid:48) >
0. Furthermore, we quantify the interplay between theestimators’ consistency rate and the time dependence level, which is novel in this context. Thereare various arguments showing that convergence in the operator norm implies convergence of eigen-values of eigenvectors, El Karoui (2008) and Bickel and Levina (2008a), so this norm is particularlyappropriate for various applications. We also discuss an intuitive resampling scheme for thresholdselection for high dimensional time series, and prove a general cross-validation result that justifies thisapproach.
Second , we propose a SCE (screen - cluster - estimate) approach for modeling high dimen-sional data with complex spatial structure. Specifically, given a consistently estimated large spatialcovariance (correlation) matrix, by utilizing its natural links with graphical models and semiparamet-rics and using the correlation (or covariance for the standardized observations) between variables as ameasure of similarity, after “screening” the (explanatory) variables, we propose a novel forward (andbackward) label permutation procedure to cluster the “relevant” (explanatory) variables (or to forma block diagonal structure for the regularized large spatial matrix) and construct the corresponding8emiparametric model, which is further estimated by the groupwise dimension reduction method (Liet al. (2010)) with sign constraints.It is noteworthy that the “screening” in Step 1, “clustering” in Step 2, and the “sign constraints”in Step 3 here are crucial for applying the groupwise dimension reduction method of Li et al. (2010)in the high dimensional situation.
First , their method requires the use of the high dimensional kernelfunction, which faces some limitations when J (cid:29) T . The Step 1 here help reduce the dimensionalityfrom J ( (cid:29) T ) to a more manageable level. Second , they primarily assume that the grouping infor-mation is available from the background knowledge, which is often not available from the typically(spatially) unordered high dimensional data sets. Although they also proposed an information crite-rion based grouping method, this - “trying” many different combinations of grouping - approach isvery computationally intensive and less practical. The Step 2 here provides this grouping informationfrom a data driven approach with feasible computation.
Third , without adding the “sign constraints”,the signs of the estimated parametric coefficients might violate the economic laws (details presentedin Section 5). Overall, together with Li et al. (2010)’s very timely and stimulating work, we providean integrated approach for modeling high dimensional data with complex spatial structure.The rest of the article is organized as follows. In the next section, we present the main notationsof the thresholding estimator. The estimates’ properties are presented in Section 3. In Section 4we state the details of the SCE procedure and in Section 5 apply it to study the spatial structure oflarge panels of macroeconomic and financial times series and find the proper semiparametric structurefor estimating the consumer price index (CPI). Section 6 contains concluding remarks with a briefdiscussion. All technical proofs are sketched in the appendix.
We start by setting up notations and corresponding concepts for covariance matrix Σ, which aremostly from Bickel and Levina (2008b) and Bickel and Levina (2008a). We write λ max (Σ) = λ (Σ) (cid:62) . . . (cid:62) λ J (Σ) = λ min (Σ) for the eigenvalues of a matrix Σ. Following the notations of Bickel andLevina (2008b) and Bickel and Levina (2008a), we define that, for any 0 (cid:54) r, s (cid:54) ∞ and a J × J matrix Σ, (cid:107) Σ (cid:107) ( r,s ) def = sup {(cid:107) Σ x (cid:107) s : (cid:107) x (cid:107) r = 1 } , where (cid:107) x (cid:107) rr = (cid:80) Jj =1 | x j | r . In particular, we write (cid:107) Σ (cid:107) = (cid:107) Σ (cid:107) (2 , = max (cid:54) j (cid:54) J | λ j (Σ) | , which is the operator norm for a symmetric matrix. We also usethe Frobenius matrix norm, (cid:107) Σ (cid:107) F = (cid:80) i,j σ ij = tr (ΣΣ (cid:62) ). Dividing it by a factor J brings (cid:107) Σ (cid:107) F /J ,which is the average of a set of eigenvalues, while the operator norm (cid:107) Σ (cid:107) (2 , means the maximum of9he same set of eigenvalues. Bickel and Levina (2008a) defines the thresholding operator by T s (Σ) def = [ m ij ( | m ij | (cid:62) s )] , which we refer to as Σ thresholded at s . Notice that T s preserves symmetry; it is invariant underpermutations of variable labels; and if (cid:107) T s − T (cid:107) (cid:54) ε and λ min (Σ) > ε , it preserves positive definiteness.We study the properties of the following uniformity class of covariance matrices invariant underpermutations U τ ( q, c ( J ) , M ) def = { Σ : σ ii (cid:54) M, J (cid:88) j =1 | σ ij | q (cid:54) c ( J ) , ∀ i } , (cid:54) q < . We will mainly write c for c ( J ) in the future. Suppose that we observe T J -dimensional observa-tions X , . . . , X T with E X = 0 (without loss of generality), and E ( XX (cid:62) ) = Σ, which is independentof t . We consider the sample covariance matrix byˆΣ def = T − T (cid:88) t = i ( X t − ¯ X )( X t − ¯ X ) (cid:62) def = [ˆ σ ij ] , (3)with ¯ X = T − (cid:80) Tt =1 X t .Let us first recall the fractional cover theory based definition, which was introduced by Janson(2004) and can be viewed as a generalization of m -dependency. Given a set T and random variables V t , t ∈ T , we say: • A subset T (cid:48) of T is independent if the corresponding random variables { V t } t ∈T (cid:48) are independent. • A family {T j } j of subsets of T is a cover of T if (cid:83) j T j = T . • A family { ( T j , w j ) } j of pairs ( T j , w j ), where T j ⊆ T and w j ∈ [0 ,
1] is a fractional cover of T if (cid:80) j w j T j (cid:62) T , i.e. (cid:80) j : t ∈T j w j (cid:62) t ∈ T . • A (fractional) cover is proper if each set T j in it is independent. • X ( T ) is the size of the smallest proper cover of T , i.e. the smallest m such that T is the unionof m independent subsets. • X ∗ ( T ) is the minimum of (cid:80) j w j over all proper fractional covers { ( T j , w j ) } j .Notice that, in spirit of these notations, X ( T ) and X ∗ ( T ) depend not only on T but also on thefamily { V t } t ∈T . Further note that X ∗ ( T ) (cid:62) T = ∅ ) and that X ∗ ( T ) = 1 if and only if thevariables V t , t ∈ T are independent, i.e. X ∗ ( T ) is a measure of the dependence structure of { V t } t ∈T .10or example, if V t only depends on V t − , . . . , V t − k but is independent of all { V s } s U − P( V )) | ; V ∈ V} be a measure of dependence between U and V , whichhas been defined by Kolmogorov and first appeared in the paper by Volkonskii and Rozanov (1959).By its definition, the closer to 0 β is, the more independent the time series is. For examples ofthe β -mixing process, we refer to Doukhan (1994). Through this article, we use β mix to denote the β -mixing coefficient for notational convenience. We have the following two results which parallel those in Bickel and Levina (2008b) and Bickel andLevina (2008a). THEOREM 3.1 (Dependence level affects consistency?) Suppose for all i, j , | X ti X tj | def = | V t | (cid:54) M t holds with a high probability and (cid:80) Tt =1 M t /T is bounded by some constant C (cid:48) . Then, uniformlyon U τ ( q, c ( J ) , M ) , for sufficiently large M (cid:48) also depending on C (cid:48) , if s T = M (cid:48) ( C (cid:48) ) (cid:114) log J X ∗ ( T ) T and log J X ∗ ( T ) /T = O (1) , then (cid:107) T s T ( ˆΣ) − Σ (cid:107) = O P (cid:104) c ( J ) (cid:110) log J X ∗ ( T ) T (cid:111) (1 − q ) / (cid:105) J − (cid:107) T s T ( ˆΣ) − Σ (cid:107) F = O P (cid:104) c ( J ) (cid:110) log J X ∗ ( T ) T (cid:111) − q/ (cid:105) X ∗ ( T ))increases, or in other words, the rate is maximized when X ∗ ( T ) = 1, same as what Bickel and Levina(2008a) shows for the i.i.d case. When X ∗ ( T ) reaches T , it will be offset by T in the denominator.The intuition behind is clear: if dependence is strong, then additional information brought by a“new” observation will be effectively less, i.e. the overall information from T observations will be lesscorrespondingly, which will result in a slower consistency rate. On the other hand, according to thelog J X ∗ ( T ) /T = O (1) requirement, when the dependence level X ∗ ( T ) increases, J must decrease and T must increase to retain the same amount of information.A very natural question to ask next is: to what extent, the degree of dependence (in terms of β -mixing coefficients) is allowed, while the consistency rate is still the same as the i.i.d. case, i.e. tostudy the relationship among high dimensionality R , moderate sample size T and β -mixing coefficient β mix . ASSUMPTION 3.1 A1 ∀ t , E X ti X tj = 0 A2 ∃ σ , ∀ n, m , m − E ( X ni X nj + . . . + X n + m,i X n + m,j ) (cid:54) σ A3 ∀ t , | X ti X tj | (cid:54) M THEOREM 3.2 (Balance “ J, T, β ” to achieve “good” consistency rate) Assume the β -mixingsequence { X ti X tj } Tt =1 satisfies Assumption 3.1 ∀ i, j with a high probability. Then, uniformly on U τ ( q, c ( J ) , Σ) , for sufficiently large M (cid:48) also depending on σ , M , if s T = M (cid:48) ( σ , M ) (cid:113) log JT , log J/T = O (1) and the β -mixing coefficient β mix = O{ ( J δ (cid:48) √ log J T ) − } , δ (cid:48) > , we have: (cid:107) T s T ( ˆΣ) − Σ (cid:107) = O P (cid:110) c ( J ) (cid:16) log JT (cid:17) (1 − q ) / (cid:111) J − (cid:107) T s T ( ˆΣ) − Σ (cid:107) F = O P (cid:110) c ( J ) (cid:16) log JT (cid:17) − q/ (cid:111) As we can see, when dimensionality J increases, since the β -mixing coefficient is controlled by O{ ( J δ (cid:48) √ log J T ) − } , δ (cid:48) > 0, the dependence level must decrease at the rate of J − (skipping theslow varying logs). When J is very large, this means “nearly” independent, which again confirms theresult from the previous theorem. Choices of threshold play a fundamental role in implementing this estimation procedure. We choosean optimal threshold by a cross-validation procedure as in Bickel and Levina (2008a) and Bickel and12 hoice of Threshold via Cross Validation (cid:0) Classic CV not works (ordered over time)(cid:16)Thresholding(cid:17) Set Ω | (cid:16)Population(cid:17) Set Ω (cid:0) Randomly choose continuous sets as Ω S Ω and split (cid:0) ˆ s minimizing empirical loss:arg min s N (cid:0)1 N X v = kT s ( ˆΣ ) (cid:0) ˆΣ k (cid:0) s minimizing oracle loss:arg min s E kT s ( ˆΣ ) (cid:0) ˆΣ k Figure 3: Illustration of the cross-validation method.Levina (2008b). In particular, we divide the data set Ω of size T into two consecutive segments, Ω andΩ of size T and T respectively, where T is typically about T / 3. Then we compare the regularized(via thresholding) “target” quantity T s ( ˆΣ ,v ), estimated from Ω , with the “target” quantity ˆΣ ,v ,estimated from Ω . Hence ˆΣ ,v can be viewed as a proxy to the population “target” quantity Σ. Thesubindex v in T s ( ˆΣ ,v ) and ˆΣ ,v indicates values from the v th split from a total of N repeats. Theoptimal threshold is then selected as a minimizer (w.r.t. s ) of the empirical loss function over N repeats, i.e. arg min s N − N (cid:88) v =1 (cid:107) T s ( ˆΣ ,v ) − ˆΣ ,v (cid:107) F . (4)Similarly, the oracle threshold is then selected as a minimizer w.r.t. s of the oracle loss function over N repeats, i.e. arg min s E (cid:107) T s ( ˆΣ ,v ) − ˆΣ ,v (cid:107) F . (5)Since the data are observed in time, the order of X t is of importance, and hence a random splitof Ω to Ω and Ω is not appropriate in a time series context. Alternatively, we randomly select aconsecutive segment of size T + T as Ω (cid:83) Ω from the data set Ω first, and then take the first thirdof Ω (cid:83) Ω as Ω ( T ≈ T ) and the remaining two thirds as Ω . Figure 3 provides an illustrationfor the cross-validation procedure. We repeat this N times as before. Our goal now is to show thatthe rates of convergence for the empirical loss function and the oracle loss function are of the sameorder and hence, asymptotically the empirical threshold ˆ s performs as well as the oracle threshold s selection.Our theoretical justification is based on adapting the results on the optimal threshold selectionin Bickel and Levina (2008a) and the optimal band selection in Bickel and Gel (2011) to a case ofoptimal choice of a threshold for high dimensional β -mixing time series. Let W , . . . , W n , . . . , W n + B be J × E W . Let (cid:107) x (cid:107) v = max p =1 ,...,P | v (cid:48) p x | , x ∈ R J , v p ∈ R J , (cid:107) v p (cid:107) = 113nd ¯ W B = B − (cid:80) Bp =1 W n + p . Then the empirical and oracle estimates based on W k are defined asˆ µ e def = arg min p =1 ,...,P | ¯ W B − ˆ µ p | (6)ˆ µ o def = arg min p =1 ,...,P | E W − ˆ µ p | (7)respectively, where ˆ µ p is estimated using W , . . . , W n .We use Theorem 3 in Bickel and Levina (2008a) as Lemma 3.1 here, which states a result onasymptotic relation between the empirical and oracle estimates ˆ µ e and ˆ µ o . LEMMA 3.1 (Theorem 3 in Bickel and Levina (2008a)) If the following assumptions (A4, A5,A6) are satisfiedA4 | ˆ µ o − E W | = Ω P ( r n ) ;A5 E max p =1 ,...,P (cid:107) ( v j , W − µ ) (cid:107) (cid:54) Cρ ( P ) for v p ∈ R J , (cid:107) v p (cid:107) = 1 ;A6 ρ ( P n ) = O ( r n ) ,then we have | ˆ µ e − ¯ W B | = | ˆ µ o − E W | { O (1) } = Ω P ( r n ) . Without loss of generality, assume that the number of repeats N = 1. Notice that the empiricalestimates T s ( ˆΣ ,v ) and ˆΣ ,v play the role of ˆ µ p and ¯ W here respectively. Hence, if we can verify theconditions of Lemma 3.1, we can apply it to justify the choice of a threshold by cross-validation andshow that such regularized covariance matrix of high dimensional time series { X t } , with an empiricalselected threshold, asymptotically coincides with the regularized estimate selected by oracle. To thisend, we also need the auxiliary Lemma 3.2. LEMMA 3.2 Assume that v t is white noise satisfying E v t = 0 , E v t = σ and E | v t | β (cid:54) C < ∞ for β > . Let (cid:107) V (cid:107) F = 1 . For the β -mixing process satisfying the conditions of Theorem 3.2, we have P (cid:16) J − | tr ( V ˆΣ B − V Σ) | (cid:62) s (cid:17) (cid:54) K exp( − K s B ) J − E max p =1 ,...,P (cid:16) | tr { v j ˆΣ B − E ( v j Σ) }| (cid:17) (cid:54) C ( q, c , M ) (cid:112) log P /B with some constants K and K . THEOREM 3.3 (Consistency of Cross Validation) Let ˆ s and s o be the threshold selected fromminimizing the empirical and oracle loss functions (4) and (5) respectively. Then under the conditionsof Theorem 3.2 and O P = Ω P , if B T = T ε ( T, J ) , log P = O { T q/ c ( J ) J − (log J ) − q/ ε ( T, J ) } , basedon Lemma 3.1 and 3.2, then (cid:107) T ˆ s ( ˆΣ) − Σ (cid:107) F = (cid:107) T s o ( ˆΣ) − Σ (cid:107) F { O P (1) } . The Screen - Cluster - Estimate (SCE) Procedure To circumvent the problems in semiparametric modeling for high dimensional data with complex spa-tial structure, in the following three subsections, we state the three-step SCE procedure for construct-ing and estimating semiparametric models from a large number of unordered explanatory variableswith a moderate sample size. J × J (dependent variable y ( x J ) and all explanatory variables x , . . . , x J − ) largecovariance (Spearman’s correlation) matrix using hard thresholding as T ˆ s ( ˆΣ) def = [˜ σ ij ], and onlykeep and consider the (say K ) x ’s with nonzero correlation entries with y for following steps.Without loss of generality, we rename the K x ’s as x , x , . . . , x K .Since all observations are standardized first, the previously considered covariance matrix is actuallythe (Pearson’s) correlation coefficient matrix. However, at the “screening” step, we estimate andthreshold the large Spearman’s rank correlation matrix, where Spearman’s rank correlation between x i and x j is defined as: ρ x i ,x j = Cov { F i ( x i ) , F j ( x j ) } (cid:112) Var { F i ( x i ) } Var { F j ( x j ) } , (8)and F i and F j are the cumulative distribution functions of x i and x j respectively. It can be seenthat the population version of Spearman’s rank correlation is just the classic Person’s correlationbetween F i ( x i ) and F j ( x j ). Here we consider Spearman’s rank correlation instead of the Pearson’scorrelation coefficient is because the latter one is sensitive only to a linear relationship between twovariables, while the former one is more robust than the Pearson’s correlation - that is, more sensitive tononlinear relationships. It could be viewed as a non-parametric measure of correlation and especiallysuitable for the non and semiparametric situations we consider here. It assesses how well an arbitrarymonotonic function could describe the relationship between two variables. Specifically speaking, itmeasures the extent to which, as one variable increases, the other variable tends to increase, withoutrequiring that increase to be represented by a linear relationship. If, as the one variable increases, theother decreases, the rank correlation coefficients will be negative. Similar to the consistency resultstowards the large spatial thresholding covariance (correlation) matrix studied here, Xu and Bickel(2010) established those for the large Spearman’s rank correlation matrix (for the i.i.d. case).At step 1, via hard thresholding, we single out the important predictors by using their Spearman’srank correlations with the response variable y and eliminate all explanatory variables that are “irrel-evant” to y . In light of equation (1), we actually get an estimate for A (cid:83) . . . (cid:83) A S . Thus we could15educe the feature space significantly from J to a lower dimensional and more manageable space.Correlation learning is a specific case of independent learning, which ranks the features according tothe marginal utility of each feature. The computational expediency and stability are prominentlyfeatured in independent learning. This kind of idea is frequently used in applications (Guyon andElisseeff (2003)) and recently has been carefully studied for its theoretical properties by Fan and Lv(2008) using Pearson’s correlation for variable screening of linear models; Huang et al. (2008) , whoproposed the use of marginal bridge estimators to select variables for sparse high dimensional regres-sion models; Fan et al. (2011) using the marginal strength of the marginal nonparametric regressionfor variable screening of additive models; Hall and Miller (2009) using the generalized correlation forvariable selection of linear models.It is also worthy noticing that the threshold is a global measure (implicitly) depending on all J variables. If we remove some x ’s from the original explanatory variables set, the threshold value willbe changed correspondingly. Thus the “relevant” and “irrelevant” regressors will also change. Motivated by the fact that in a block diagonal matrix, the nonzero entries along the diagonal aredenser than those in the off-diagonal region and the assumption w.r.t. equation (1): “ ∀ j (cid:54) = l , x j ∈ A j , x l ∈ A l , x j and x l are (conditionally) independent given other x ’s”, we define the following “averagednon-zero” score S A for a index set A : S A def = (cid:80) i,j ∈A (˜ σ ij (cid:54) = 0) / |A| . Here we do not distinguishbetween the positive and negative values of ˜ σ ij since they could also be reflected by the correspondinglinear coefficients as in equation (1).2 Perform the label permutation procedure for x , . . . , x K to form clusters of (explanatory) vari-ables (or A , . . . , A S ) by utilizing the “averaged non-zero” score S A .2.1 Rank (in decreasing order) and relabel all x , . . . , x k , . . . , x K according to (cid:80) (cid:54) j (cid:54) K (˜ σ kj (cid:54) = 0)to obtain the “new” x , . . . , x K . Always assume x is in the first block (index set) A ;2.2 Forward Include x k (2 (cid:54) k (cid:54) K ) in the first index set ( x k ∈ A ) if S A (cid:83) { x j } (cid:62) S A , andcontinue searching until the K th variable x K . Without loss of generality (otherwise justrelabel them), we assume x , x , . . . , x k − ∈ A .2.3a (For the case of no overlapping indices among A , . . . , A S )Given A formed in the last step, perform Steps 2.1 and 2.2 again for the variables not inthe set A , i.e. { , , . . . , K }\A and construct A .16.3b Backward (replace Step 2.3a, for the case allowing overlapping indices among A , . . . , A S )Given A , perform Step 2.1 again for the variables not in the set A , i.e. { , , . . . , K }\A and start to construct A , for example, x k ∈ A . Let x ∈ A (cid:84) A only if S { x } (cid:83) A (cid:62) S A and continue searching until x k − . Notice that it is impossible for all x , . . . , x k − ∈ A because of the way we construct A in the forward step. Continue to construct A as inthe forward step by selecting variables w.r.t. { , , . . . , K }\A (cid:83) { x k } .2.4 Continue this procedure until all variables x , . . . , x K have been included into some indexset(s) of A , . . . , A S , where S is the number of selected index sets and A (cid:83) . . . (cid:83) A S = { , , . . . , K } . Given these, construct the corresponding semiparametric models by equation(1).At Step 2, if we can permute the variables’ labels to have a block diagonal structure for thepartition of the consistently estimated covariance matrix, such as the one in Figure 2 (right), we canconstruct the corresponding class of semiparametric models as specific cases of equation (1). By thisstep, we are grouping the (explanatory) variables into highly correlated groups, which are, however,weakly correlated with each other. This “independence” property (between the “new” predictors β (cid:62) s x A s , (cid:54) s (cid:54) S ) is actually also required for the groupwise dimension reduction method of Li et al.(2010) for estimation. Except that we use the Spearman’s rank correlation instead of the Pearson’scorrelation for “screening”, a second difference between this work and Fan and Lv (2008) is that weconsider the covariance (correlation) matrix for all x , . . . , x J − , y variables instead of just between y and x , . . . , x J − , which is because we need to further group the relevant explanatory variables forsemiparametric model construction at the second step.A very important feature of the proposed label permutation procedure is that it is based on thethresholding regularized covariance matrix instead of the sample one. A related work which tries todiscover the ordering of the variables through the metric multi-dimensional scaling method could befound in Wagaman and Levina (2009) for the i.i.d. Gaussian case. Their ultimate goal is to improvecovariance matrix estimation rather than order the variables itself. Thus by utilizing the discovered“order” based on the sample covariance matrix, they estimate the large covariance matrix throughbanding regularization to enjoy the benefits brought by ordering. But in the case of large panelsof economic and financial variables as we consider here, our ultimate goal is to cluster the variablesto construct the proper semiparametric models instead of “ordering”. For example, in the multipleindex model, the order of the first index (or first cluster of variables) and the second index (or secondcluster of variables) and the order of variables inside each “cluster” are both unimportant.17his case is also related to the hierarchical clustering, k-means algorithm and correlation clusteringproblem in computer science (Demaine and Immorlica (2003), Bansal et al. (2004)), which aims topartition a weighted graph with positive and negative edge weights so that negative edges are brokenup and positive edges are kept together. However, the correlation clustering algorithm is also based onthe sample correlation(s), and has also been shown to be NP-hard. Thus, as a key difference with otherworks in the literature, instead of using the sample covariance (or correlation) matrix for orderingand clustering as Wagaman and Levina (2009), Demaine and Immorlica (2003) and Bansal et al.(2004) did, we implement thresholding regularization for the sample covariance matrix and screeningfirst and then find the corresponding groups through the stepwise label permutation procedure. It issimpler to be implemented than their’s, since the thresholding regularized covariance matrix only haslimited number of nonzero entries. By doing so, w.r.t. the regression setup, we also simultaneouslyextract the “relevant” explanatory variables for y (Step 1). Thus, we actually combine dimensionreduction and variable clustering , which is especially suitable for modeling high dimensional data viasemiparametric methods.This procedure is computationally simple for a typical J (cid:54) 150 macroeconomic and financial dataset since the thresholding regularization procedure removes J − − K “irrelevant” variables first, andthen rank the remaining K ones before entering this label permutation procedure. Thus we avoid theNP-hard correlation clustering problem based on the sample covariance matrix. x , . . . , x K and y are all positive, however,some of their corresponding parametric coefficients are estimated to be negative (details presented laterin Table 2). This means that the consumer price index negatively depends on them, which is unlikelyto be true from an economic point of view. Since we have disjoint groups of variables here and giventhe meaning of the Spearman’s rank correlation, ideally, the sign of the corresponding parametriccoefficient estimate w.r.t. x k , (cid:54) k (cid:54) K should be the same as the sign of the correspondingSpearman’s rank correlation. This motivates us to add the sign constraint , as a refinement, to thegroupwise dimension reduction method developed in Li et al. (2010) to secure the sign consistency .18et us first consider a simple linear regression model E y = x β + , . . . , + x K β K def = x (cid:62) β withthe constraint β , . . . , β K (cid:62) 0. The linear coefficient β could be estimated as the minimizer of (cid:107) y − x (cid:62) β (cid:107) / − (cid:80) Kk =1 λ k β k with the corresponding nonnegative Lagrange multipliers λ k ’s, 1 (cid:54) k (cid:54) K .If we denote diag[ λ , . . . , λ K ] by Λ, ˆ β = ( x (cid:62) x ) − ( x (cid:62) y + Λ) def = ˆ β OLS + ( x (cid:62) x ) − Λ. Intuitively, incase some entry of ˆ β OLS , say the k th, is negative, which contradicts the initial requirement β k (cid:62) x (cid:62) x ) − λ k plays the role of adding a positive increment to it, s.t. ˆ β k (cid:62) σ kJ to denote the Spearman’s rank correlation estimate between x k and y ( x J ) extracted from T ˆ s ( ˆΣ) and add the sign constraint sign (˜ σ kJ ) β k (cid:62) β s here corresponds to their β g ), a simple calculation shows that we justneed replace their estimation equation (15) for β def = ( β , . . . , β K ) (cid:62) by ( β here corresponds to their ζ ):ˆ β = (cid:110) T (cid:88) i =1 T (cid:88) j =1 R ij R ij (cid:62) K h ( V j − V i ) (cid:111) − (cid:110) n (cid:88) i =1 n (cid:88) j =1 ( Y j − a i ) K h ( V j − V i ) R ij + Λ (cid:48) (cid:111) , = ˆ ζ + (cid:110) T (cid:88) i =1 T (cid:88) j =1 R ij R ij (cid:62) K h ( V j − V i ) (cid:111) − Λ (cid:48) (9)where Λ (cid:48) is a K × K diagonal matrix diag[ λ sign (˜ σ J ) , . . . , λ K sign (˜ σ KJ )]; { λ k , (cid:54) k (cid:54) K } are thehyperparameters; ˆ ζ and other variables are the same as in equation (15) of Li et al. (2010). In general,selection of λ ’s requires minimizing some loss function. Motivated by the discussion above for thesimple linear regression case, when sign ( ˆ ζ k ) is the same as sign (˜ σ kJ ), we simply choose λ k = 0,otherwise choose λ k to be the minimum (positive) value s.t. ˆ β k = 0. By our experience, this workswell and the convergence of the iterative estimation procedure is achieved within 19 iteration steps(10 − as the tolerance) for modeling CPI, which is to be presented in Section 5. Then by the propertyof the convex minimization problem, if a local minimum exists, it is also a global minimum.Overall, similar to Fan and Lv (2008)’s “screen first; fit later” approach for modeling high di-mensional data, ours could be considered as the “screen first; group second; fit third” approach.Alternatively, Bickel et al. (2009) and Meinshausen and B¨uhlmann (2006) consider the “fit first;screen later” approach. In general, a great deal of work is needed to compare “screen first; fit later”type of methods with “fit first; screen later” types of method in terms of consistency and oracleproperties. But when the spatial structure is complex (thus we need deviate from linearity), in termsof semiparametric modeling, as we have discussed in Section 1, the later one might face several mainlimitations, while ours, together with the estimation method modified from Li et al. (2010), as aspecial case of the former one, could circumvent these issues and would be faster when dealing withhigher dimensionality. 19 20 40 60 80 100 12020406080100120 102030405060 20 40 60 80 100 12020406080100120 102030405060 Figure 4: Sample and regularized covariance matrices (after multiplying each entry’s value by 100). We use the dataset of Stock and Watson (2005). This dataset contains 131 monthly macro indicatorscovering a broad range of categories including income, industrial production, capacity, employmentand unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders,stock prices, interest rates for different maturities, exchange rates, money aggregates and so on. Thetime span is from January 1959 to December 2003. We apply logarithms to most of the series exceptthose already expressed in rates. The series are transformed to obtain stationarity by taking (the1st or 2nd order) differences of the raw data series (or the logarithm of the raw series). Then allobservations are standardized. CPI Table 1: Partition of the T s ( ˆΣ ,v ) w.r.t. CP I and the 15 “relevant” variables and the diagonal entriesdenote the indices (in the original data set) of the corresponding variables.20ariable Meaning Coefficients Coefficients P W F CSA Producer Price Index: Finished Goods 0 . 150 0 . P W IM SA PPI: Finished Consumer Goods 0 . 741 0 . P W CM SA PPI: Intermed. Mat. Supplies & Components 0 . 074 0 . P U CPI-U: Transportation 0 . 307 0 . P U C CPI-U: Commodities 0 . 290 0 . P U XF CPI-U: All Items Less Food 0 . 397 0 . P U XHS CPI-U: All Items Less Shelter − . 034 0 P U XM CPI-U: All Items Less Medical Care 0 . 246 0 . GM DC PCE,IMPL PR DEFL:PCE − . 146 0 GM DCN PCE,IMPL PR DEFL:PCE; Nondurables − . 061 0 P U CD CPI-U: Durables − . 639 0 . GM DCD PCE,IMPL PR DEFL:PCE; Durables − . 770 0 . P U S CPI-U: Services − . 994 0 . GM DCS PCE,IMPL PR DEFL:PCE; Services 0 . 111 0 . P U CPI-U: Apparel & Upkeep 1 1Table 2: Detailed meanings of the variables with the corresponding parametric coefficients’ estimatesusing the groupwise dimension reduction method without (3rd column) and with (4th column) signconstraints.Figure 4 contains plots of the sample and thresholding regularized Spearman’s rank correlationmatrices based on the “optimal” threshold 0 . 13 selected by the cross validation procedure discussedin subsection 3.2 with T = 120 , T = 240. The variables of special interest include the consumer priceindex (CPI) as a measure of prices and an economic indicator. The annual percentage change in CPIis used as a measure of inflation. CPI can be used to index (i.e., adjust for the effect of inflation)the real value of wages, salaries and pensions, and also for regulating prices and deflating monetarymagnitudes to show changes in real values. Besides being a deflator of other economic series, it is alsoa means of adjusting dollar values. Thus CPI is one of the most closely watched national economicstatistics. To this end, we use modeling of CPI to illustrate our method. Table 1 displays the partitionof the T s ( ˆΣ ,v ) w.r.t. the variables relevant to CPI. The detailed meanings of these variables (andcorresponding three digits indices in the data set provided by Stock and Watson (2005)) are givenin Table 2. By the forward (and backward) procedure discussed in Section 4, we find the followingindex sets for constructing semiparametric models for modeling CPI. The one without overlapping is21resented on the LHS, and that allowing overlapping is presented on the RHS. A = 108 − , , , − , A (cid:48) = 108 − , , , − , A = 119 , A (cid:48) = 121 − , A = 120 , A (cid:48) = 121 − , A = 115 A (cid:48) = 121 − , A (cid:48) = 121 , , A (cid:48) = 120 , , A (cid:48) = A , and the main difference between these two methods comes from A − A and A (cid:48) − A (cid:48) , i.e. how to allocate the 119 , , , , − − y variable: CPI-U: All Items (82-84=100, SA)except one item (food, shelter or medical care) or the implicit price deflator (of personal consumptionexpenditures).Due to the identification and estimation problems we discussed before, from now on, we mainlyconcentrate on the disjoint index sets case and suggest the following semiparametric model for mod-eling CPI: E ( CP I ) = g (cid:16) β P W F CSA + β P W IM SA + β P W CM SA + β P U + β P U C + β P U XF + β P U XHS + β P U XM + β GM DC + β GM DCN (cid:17) + g (cid:16) β P U CD + β GM DCD (cid:17) + g (cid:16) β P U S + β GM DCS (cid:17) + g (cid:16) P U (cid:17) , (10) where g , . . . , g are the unknown link functions to be estimated nonparametrically and β , . . . , β are unknown parameters which belong to the parameter space. The variables P U CD and GM DCD denote the consumer price index and implicit price deflator (of personal consumption expenditures)for durable goods respectively. Thus A could be interpreted as the index set for durable goods.Very similarly, A and A could be interpreted as the index sets for service, and apparel and upkeeprespectively which are also important factors affecting consumer price index. All common factorsstrongly associated with CPI are included in A . Compared with the linear, additive or single indexmodels, the model (10) actually combines flexibility in statistical modeling and interpretability froman economic point of view, while being kept close to the data’s complex spatial structure.We further employ the groupwise dimension reduction method in Li et al. (2010) to estimate(10) and present the parametric coefficients’ estimate in Table 1 (3rd column). Contradicting to22he background knowledge of economics and the positive Spearman’s rank correlations between CPIand P U XHS , GM DC , GM DCN , P U CD , GM DCD , P U S shown in Table 2, theircorresponding parametric coefficients are estimated to be negative . This means that the consumerprice index negatively depends on them, which is unlikely to be true. Finally we apply the modifiedprocedure with sign constraints to estimate (10) again and present the corresponding parametriccoefficients’ estimates in the last column of Table 2 with the explained variation 85 . β , β and β are estimated positively, β , β and β are estimated to be 0, which means P U XHS , GM DC and GM DCN could be eliminated from the model. To compare with thelinear models and see the advantages of semiparametrics, we also consider the linear model using allother 130 variables (except CPI itself). The explained variation w.r.t. the LARS estimate (least angleregression, developed by Efron et al. (2004)) is 80 . . 9% and 97 . 6% respectively, whilethe corresponding LARS estimates’ is 99 . 6% and 86 . . 4% for CPI and 82 . 0% for FFRthrough considering the (flexible and proper) semiparametrics. The improvement for EMPL is notsignificant since the LARS estimate has already performed quite well.CPI EMPL FFR R (SCE) 85 . 8% 99 . 9% 97 . R (LARS) 80 . 4% 99 . 6% 86 . In this paper, we consider estimating a large spatial covariance matrix of the generalized m dependentand β -mixing time series (with J variables and T observations) by hard thresholding regularization.We quantify the interplay between the estimators’ consistency rate and the time dependence level,discuss an intuitive resampling scheme for threshold selection, and prove a general cross-validationresult that justifies this approach. Given a consistently estimated large sparse covariance matrix,23y utilizing the natural links among graphical models, semiparametrics and large spatial covariancematrix, we propose a novel forward (and backward) label permutation procedure to form a blockdiagonal structure for it and construct the corresponding low dimensional semiparametric model.Finally we apply this method to study the spatial structure of large panels of economic and financialtime series to find the proper semiparametric structure for estimating the consumer price index (CPI)and present its superiority over the linear models. Choice of Threshold Concerning the choice of threshold in the context of time series analysis, if we are mainly targetingestimation performance of the corresponding semiparametric models instead of minimizing the lossfunctions (6) and (7) related to the covariance matrix estimation, we might directly consider min-imizing the estimation error based on the selected semiparametric model, for example (2), s.t. theprediction performance might be optimized. Other Measures of Dependence for Screening The information given by a Pearson’s correlation coefficient is not enough to define the dependencestructure between random variables. Except the Spearman’s rank correlation we used here, distancecorrelation, Szkely et al. (2007) and Brownian covariance (correlation), Sz´ekely and Rizzo (2009) werealso introduced to address the deficiency of Pearson’s correlation that it can be zero for dependentrandom variables; zero distance correlation and zero Brownian correlation imply independence. Thecorrelation ratio is able to detect almost any functional dependency, and the entropy-based mutualinformation/total correlation is capable of detecting even more general dependencies. We want topoint out that the Step 1 of the SCE procedure could be very easily extended to these measuresabove and the threshold value could be selected by the cross-validation procedure similarly.It is also noteworthy that Fan et al. (2011) considers the independence screening procedure byranking the explanatory variable’s importance according to the descent order of the residual sumof squares of the componentwise nonparametric regressions or the marginal strength of the marginalnonparametric regression. By doing that, they (implicitly) assume that the true semiparametric struc-ture is additive, which is different from our ultimate goal here: construct the proper semiparametricstructure. Theoretical Study of the Screening Step Noticing that F i ( x i ) and F j ( x j ) in the formula (8) follow the uniform distribution on [0 , ρ x i ,x j = 12 E { F i ( x i ) F j ( x j ) } − 3. Similar to the “sure independence screening”property of Fan and Lv (2008) using Pearson’s correlation for variable screening of linear models, tostudy the theoretical property of the screening step here based on the Spearman’s rank correlation,24arallel to the equation (20) “ ω = X (cid:62) y = X (cid:62) Xβ + X (cid:62) ε ” of Fan and Lv (2008), we could define ω = ( ω , . . . , ω J − ) ω j = 12 F j ( x j ) F J ( x J ) − def = 12 F j ( x j ) F y ( y ) − , (cid:54) j (cid:54) J − 1= 12 F j ( x j ) F y (cid:110) S (cid:88) s =1 g s ( β (cid:62) s x A s ) + ε (cid:111) − , (11)where ε is the (conditional) mean-zero error term from approximating y by (cid:80) Ss =1 g s ( β (cid:62) s x A s ) in (1).Similar to the idea of the (group) MAVE method of Xia et al. (2002), Li et al. (2010), we notice that ∂g s ( β (cid:62) s x A s ) /∂x A s = g (cid:48) s ( β (cid:62) s x A s ) β s , provided by g (cid:48) s ( β (cid:62) s x A s ) is well defined. Thus applying the Taylor expansion to (cid:80) Ss =1 g s ( β (cid:62) s x A s ) + ε at x (cid:48) will help linearize it as: a + S (cid:88) s =1 g (cid:48) s ( β (cid:62) s x (cid:48)A s ) β (cid:62) s ( x − x (cid:48) ) A s + O{ S (cid:88) s =1 ( x − x (cid:48) ) (cid:62)A s ( x − x (cid:48) ) A s } + ε = a + S (cid:88) s =1 b s β (cid:62) s ( x − x (cid:48) ) A s + ε. Therefore, we could rewrite (11) as12 F j ( x j ) F y (cid:110) a + S (cid:88) s =1 b s β (cid:62) s ( x − x (cid:48) ) A s + ε (cid:111) − . (12)Studying the property of (12) will be the main focus. However, due to the presence of the cumulativedensity functions F j and F y here, this is expected to be much more complex than the Pearson’scorrelation case. We hope that the other people could further investigate this. Acknowledgement This work was partially motivated during a conversation with Prof LixingZhu in Hong Kong in Feb, 2010. The author is very grateful to Prof Peter Bickel, Prof Lixing Zhu,Prof Lexin Li, Mu Cai and Ying Xu for very interesting discussions and comments on this and relatedtopics. In particular, I would like to thank Prof Peter Bickel for sponsoring my stay at the Universityof California, Berkeley. Proof of Theorem 3.1 The proof of this theorem is based on the ones of Theorem 1 and 2 in Bickeland Levina (2008a) up to a modification of the bound on P { max i,j | ˆ σ ij − σ ij | (cid:62) s } , as remarked by their25ubsection 2.3. By the definition of ˆ σ ij in (3) and the assumption that for all i and j , | X ti X tj | (cid:54) M t holds with a high probability, applying the (extended) Mcdiarmid inequality, see Theorem 2 . (cid:80) Tt =1 | X ti X tj | yields:P { max i,j | ˆ σ ij − σ ij | (cid:62) s } (cid:54) J exp (cid:26) − s T X ∗ ( T ) (cid:80) t M t (cid:27) = exp { (2 − M (cid:48) ) log J } , where s T = M (cid:48) (cid:112) log J X ∗ ( T ) /T with sufficiently large M (cid:48) also depending on C (cid:48) with (cid:80) Tt =1 M t C (cid:48) /T (cid:54) C (cid:48) and log J X ∗ ( T ) /T = O (1). Since equation (10) in Bickel and Levina (2008a) holds, others gothrough verbatim. This completes the proof. (cid:3) Proof of Theorem 3.2 The proof of this theorem is also based on the ones of Theorem 1 and2 in Bickel and Levina (2008a) up to a modification of the bound on P { max i,j | ˆ σ ij − σ ij | (cid:62) s } .Assume the β -mixing sequence { X ti X tj } Tt =1 to satisfy Assumption 3.1 ∀ i, j , applying the Bernsteintype inequality for β -mixing random variables { X ti X tj } Tt =1 , see Theorem 4 of Doukhan (1994)[P.36],yields that, ∀ ε > θ def = ε / 4) and ∀ < q (cid:54) | T (cid:88) t =1 X ti X tj | (cid:62) sT ) (cid:54) (cid:104) − (1 − ε )3(1 + θ ) s T { θ ) σ + qM sT } (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) def = A + 2 (1 + θ ) β mix q (cid:124) (cid:123)(cid:122) (cid:125) def = B . (13)To make J ( A + B ) arbitrarily small, we choose s T = M (cid:48) (cid:113) log JT with sufficiently large M (cid:48) alsodepending on ε, σ , M , log J/T = O (1), q = 3(1 + θ ) σ / ( M sT ), and β mix = O{ ( J δ (cid:48) √ log J T ) − } with δ (cid:48) > 0. Thus A and B are bounded by exp( − M (cid:48) log J ) and J − (2+ δ (cid:48) ) respectively, which can bearbitrarily close to 0. This completes the proof. (cid:3) Proof of Lemma 3.2 SinceP (cid:16) J − | tr ( V ˆΣ B − V Σ) | (cid:62) s (cid:17) (cid:54) P (cid:16) J − | tr ( B − B (cid:88) p =1 V X p X (cid:62) p − V Σ) | (cid:62) s (cid:17) , and tr ( X p X (cid:62) p ) = tr ( X (cid:62) p X p ), tr ( (cid:80) Bp =1 V X p X (cid:62) p − V Σ) (cid:54) J (cid:80) Bp =1 ¯ X p with ¯ X p = J − (cid:80) Jj =1 X pj , applyingthe same inequality as in (13) to (cid:80) Bp =1 ¯ X p leads toP (cid:16) J − | tr ( V ˆΣ B − V Σ) | (cid:62) s (cid:17) (cid:54) K exp( − K s B )with some constants K and K .Consequently we also haveP (cid:16) J − max p =1 ,...,P | tr ( V p ˆΣ B − V p Σ) | (cid:62) s (cid:17) (cid:54) (0 (cid:54) s (cid:54) x ) + K P exp( − K s B ) ( s > x ) . (14)26f we integrate (14), i.e. J − E max p =1 ,...,P (cid:16) | tr { v j ˆΣ B − E ( v j Σ) }| (cid:17) (cid:54) x + K P (cid:90) ∞ x exp( − K s B ) ds, (15)and minimize the RHS of (15) over x as P → ∞ , we find that the minimizer satisfies x = C ( q, c , M ) (cid:112) log P /B { O (1) } . Hence J − E max p =1 ,...,P (cid:16) | tr { v j ˆΣ B − E ( v j Σ) }| (cid:17) (cid:54) C ( q, c , M ) (cid:112) log P /B. (cid:3) Proof of Theorem 3.3 Based on Lemma 3.2, we conclude that ρ ( P ) from the second condition ofLemma 3.1 satisfies ρ ( P ) (cid:54) C ( q, c , M ) J log P /B.