# Data-adaptive Dimension Reduction for US Mortality Forecasting

DData-adaptive Dimension Reduction for US MortalityForecasting

Lingyu He , Fei Huang , and Yanrong Yang *31 Hunan University, China The University of New South Wales, Australia The Australian National University, Australia

February 9, 2021

Abstract

Forecasting accuracy of mortality data is important for the management of pension funds andpricing of life insurance in actuarial science. Age-speciﬁc mortality forecasting in the US poses achallenging problem in high dimensional time series analysis. Prior attempts utilize traditional di-mension reduction techniques to avoid the curse of dimensionality, and then mortality forecastingis achieved through features’ forecasting. However, a method of reducing dimension pertinent toideal forecasting is elusive. To address this, we propose a novel approach to pursue features thatare not only capable of representing original data well but also capturing time-serial dependenceas most as possible. The proposed method is adaptive for the US mortality data and enjoys goodstatistical performance. As a comparison, our method performs better than existing approaches,especially in regard to the Lee-Carter Model as a benchmark in mortality analysis. Based on fore-casting results, we generate more accurate estimates of future life expectancies and prices of lifeannuities, which can have great ﬁnancial impact on life insurers and social securities comparedwith using Lee-Carter Model. Furthermore, various simulations illustrate scenarios under whichour method has advantages, as well as interpretation of the good performance on mortality data.

Keywords:

Dimension reduction; high dimensional time series; life expectancy; mortality forecasting;principal component analysis.

The age-speciﬁc human mortality data consists of observations on either the death numbers or thedeath rates of a population under each age, measured for each historical year. Accurate forecastingof mortality data plays a crucial role in demography and actuarial science. For instance, the lifeexpectancy and present value of life annuity are highly related to the future mortality rates. Accordingto the life table published by Social Security Administration (2019), from 2016 to 2095, the lifeexpectancy, which is the average remaining years of life, for a male aged 66 in the US will rise from * Correspondence to: Dr. Yanrong Yang, Research School of Finance, Actuarial Studies and Statistics, College of Busi-ness and Economics, Australian National University, Canberra, ACT 2601, Australia. Email: [email protected] a r X i v : . [ s t a t . A P ] F e b . . .

94 to $16 . . − + ).Modelling and forecasting mortality data pose a challenge for traditional statistical analysis and mul-tivariate time series analysis, as the dimension 91 is comparable to the sample size (or time length)86. This high dimensional setting incurs curse of dimensionality. Dimension reduction is a remedymethod that extracts representative features or patterns of available high dimensional data. Statisticalanalysis on extracted features and recovery of corresponding inference on original data are commontechniques in high dimensional data analysis. However, optimal features for speciﬁc statistical infer-ence are rarely studied. This paper contributes to seeking linear features to attain optimal forecastingof the US mortality data. Roughly speaking, a linear feature is linear combination of annually deathrates over the total 91 ages, which is a univariate time series that summarizes a 91-dimensional timeseries linearly. Before introducing the formal statistical model, we analyze the US mortality dataqualitatively and interpret the features in pursuit intuitively.Table 1: The Log Central Death Rates of the US Historical data Forecasts1933 1934 1935 . . . 2018 2019 2020 . . .0 − . − . − .

789 . . . . . . ? ? ?1 − . − . − .

720 . . . . . . ? ? ?2 − . − . − .

486 . . . . . . ? ? ?3 − . − . − .

816 . . . . . . ? ? ?4 − . − . − .

031 . . . . . . ? ? ?5 − . − . − .

210 . . . . . . ? ? ?. . . . . . . . . . . . . . . . . . ? ? ?90+ . . . . . . . . . . . . . . . ? ? ?

We consider the logarithms of the death rates, because this transformation makes the positive-valued original data spread over total real-value set, which results in easier statistical modelling and2

Years

Log c en t r a l dea t h r a t e s Ages

Figure 1: The Log Central Death Rates, years1933 − + . −7.5−5.0−2.5 0 25 50 75 Ages

Log c en t r a l dea t h r a t e s Years

Figure 2: The Log Central Death Rates, ages 0 to90 + for years 1933 − . .

041 per $1, which is a remarkable improvement.The rest of the paper is organized as follows. In Section 2, the details of the model are described,including the estimation and forecasting methods. A practical algorithm is provided as well. To assistunderstanding the proposed method, in Appendix A of the supplemetary, we discuss the relationshipand differences of our method compared with static PCA and dynamic PCA methods. The asymptoticproperties of the proposed method are presented in Section 3 and the corresponding proof is in Section7. Simulation studies demonstrated the advantages of our method are presented in Section 4 andAppendix B in the supplementary material. In Section 5, we compare the forecasting performance4f our method with conventional methods on the US age-speciﬁc mortality rates. Finally, We applyour method to conduct a long term forecasting on the US population and compare the estimated lifeexpectancy and price of life annuity with those obtained by Lee-Carter model. The conclusion ispresented in Section 6. It is noteworthy that all the appendices are in a supplementary material.The notations in this paper are summarized here. For an p × n matrix CCC , we denote its transposeas

CCC (cid:62) , the square root of the maximum eigenvalue of

CCCCCC (cid:62) as (cid:107) CCC (cid:107) , and the square root of the smallestnonzero eigenvalue of

CCCCCC (cid:62) as (cid:107) CCC (cid:107) min . For a k × k matrix FFF , λ i ( FFF ) indicates the i -th largest eigenvalueof the matrix FFF . For a non-symmetric matrix

SSS , we use σ j ( SSS ) to denote the singular value of the matrix SSS , which corresponds to the j -th largest eigenvalue of the matrix SSSSSS (cid:62) . III p represents p − dimensionalidentity matrix. All vectors are column vectors. The notation a (cid:16) b means that a = O ( b ) and b = O ( a ) . i . p . −→ denotes convergence in probability. We use P , T → ∞ to denote that P and T go to inﬁnity jointly. Let mmm t = ( m , t , m , t , . . . , m P , t ) (cid:62) be the US age-speciﬁc death rates in year t , where m p , t is thedeath rate for age p in year t with p = , , . . . , P and t = , , . . . , T . The historical mortality data isavailable annually from the year 1933 to the year 2018 for ages from 0 to 90 + . For high dimensionaltime series { mmm t , t = , , . . . , T } , the time-serial length and the dimension are T =

86 and P = mmm t which is denoted by yyy t = ( ln ( m , t ) , ln ( m , t ) , . . . , ln ( m P , t )) (cid:62) . It is worth noting that building a model ismuch easier on yyy t than that on mmm t because mmm t take non-negative values.In this section, we ﬁrst introduce a two-style factor model on the historical mortality data. Sec-ondly we provide a two-step estimation method for the model. Finally, the forecasting procedure isprovided based on the model. As analyzed in last section, death rates for all the 91 ages possess common features that drivecommon time-serial trend and common variation, respectively. This leads us to the following two-style factor model: for any t = , , . . . , T ( T = ) , yyy t = BBBkkk ( ) t + uuu t , (2.1) uuu t = AAAkkk ( ) t + εεε t , (2.2)5here kkk ( ) t = (cid:16) k ( ) t , k ( ) t , . . . , k ( ) r t (cid:17) (cid:62) is an r × r < P , which represents commontemporal trends; BBB = ( bbb , bbb , . . . , bbb r ) is a P × r unknown deterministic coefﬁcients matrix; similarly, kkk ( ) t = (cid:16) k ( ) t , k ( ) t , . . . , k ( ) r t (cid:17) (cid:62) is an r × r < P , which indicates common variationamong all ages; AAA = ( aaa , aaa , . . . , aaa r ) is the corresponding P × r unknown deterministic coefﬁcientsmatrix; and εεε t is an error component.Here we assume r and r are both unknown positive integers. Once P is much larger than ( r + r ) , an effective dimension reduction is achieved because the original time series yyy t is driven by amuch lower dimensional time series (cid:16) kkk ( ) t , kkk ( ) t (cid:17) . We also call model (2.1) and (2.2) the factor modelswith kkk ( ) t and kkk ( ) t being common factors, respectively. Thus BBB and

AAA are the corresponding factorloadings, respectively. Factor model is a popular dimension reduction model in high dimensionalstatistics, which is investigated in huge amounts of literature, including Bai (2002), Lam et al. (2011),and Lam and Yao (2012).It is noted that this two-style factor model involves two kinds of common factors kkk ( ) t and kkk ( ) t ,which represent common temporal trends and common variations among all the p ages, respectively.These two kinds of common factors are necessary in producing good forecasting results.Because all elements in the model are unknown, including factor loadings and common factors,we should impose identiﬁcation conditions to make the model well-deﬁned. First, we assume that therank of factor loadings BBB and

AAA are equal to r and r , respectively. Otherwise, the two parts BBBkkk ( ) t and AAAkkk ( ) t can be represented in terms of factor models with even lower dimension. Moreover, as factorsand factor loadings are all unknown, for any r × r invertible matrix HHH , if we substitute the factormodel part (cid:16)

BBB , kkk ( ) t (cid:17) with (cid:16) BBBHHH , HHH − kkk ( ) t (cid:17) , the term BBBkkk ( ) t is unchanged. It is also true for the term AAAkkk ( ) t . To avoid such matters, we impose the following assumption. Assumption 1.

Orthogonal Condition. BBB (cid:62)

BBB = III r , AAA (cid:62) AAA = III r , where III r and III r are r × r andr × r identity matrices, respectively. Under Assumption 1, the factor loading

BBB and the common factor kkk ( ) t are determined up to anorthogonal matrix HHH , and the same for the pair (cid:16)

AAA , kkk ( ) t (cid:17) . In this way, Assumption 1 provides identiﬁ-cation conditions between common factors and the corresponding factor loadings. It is also a commonidentiﬁcation condition for factor models used in literature including Bai (2002), Lam and Yao (2012).Secondly, we consider the identiﬁcation between the two kinds of common factors. As mentionedearlier, the two kinds of factor parts represent common temporal trends and common variations ofthe data, respectively. Intuitively, we think the ﬁrst common factor part possesses stronger time se-rial dependence than the second factor part. Formally, we use the auto-covariance to distinguishthe two parts, which is reasonable since auto-covariance can describe strength of time-serial depen-6ence. Before introducing Assumption 2, we specify some notations. ΣΣΣ ( ) k ( (cid:96) ) : = cov (cid:16) kkk ( ) t , kkk ( ) t + (cid:96) (cid:17) and ΣΣΣ ( ) k ( (cid:96) ) : = cov (cid:16) kkk ( ) t , kkk ( ) t + (cid:96) (cid:17) are auto-covariance of kkk ( ) t with lag (cid:96) and that of kkk ( ) t with lag (cid:96) , respec-tively. For any matrix CCC , let || CCC || be the square root of the maximum eigenvalue of CCCCCC (cid:62) and || CCC || min be the square root of the smallest nonzero eigenvalue of CCCCCC (cid:62) . Assumption 2.

Identiﬁcation between kkk ( ) t and kkk ( ) t . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) P − δ (cid:16) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) min , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) P − δ (cid:16) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) min , where ≤ δ < δ ≤ . Assumption 2 imposes different orders for eigenvalues of the auto-covariance matrices for the twokinds of common factors. The order P − δ for kkk ( ) t is larger than the order P − δ for kkk ( ) t as δ < δ ,which ensures that the time-serial dependence of kkk ( ) t is stronger than that of kkk ( ) t . In view of this,Assumption 2 identiﬁes the two kinds of factor parts via their time-serial dependence. In other words,the ﬁrst factor model part extracts common factors with stronger time-serial dependence, which alsotakes higher priority in the forecasting. This kind of identiﬁcation condition is utilized in Lam andYao (2012).Thirdly, we distinguish the second kind of factor part from the error component of the model. Af-ter extracting the common temporal trends in the ﬁrst part of the model, the aim on better forecastingstimulates us to pursue further necessary features in the data. Compared to the factor kkk ( ) t , the factor kkk ( ) t has weaker time-serial dependence. It has little interest for forecasting improvement. However,it implies large amounts of common variations of the data. Neglecting it will result in unsatisﬁedmodel ﬁtting of the original data. As better model ﬁtting also plays an important role in improvingforecasting, we would like to keep them in the dimension reduction as well. Assumption 3.

Identiﬁcation between kkk ( ) t and εεε t . T ∑ Tt = kkk ( ) t kkk ( ) (cid:62) t i . p . −→ ΣΣΣ ( ) k ( ) > , as P , T → ∞ . Here ΣΣΣ ( ) k ( ) is a deterministic r × r positive deﬁnitematrix. Assumption 3 is a common condition on factor model in the sense that the factors represent mostvariations of the data. It is used in Bai (2002).At last, we impose some conditions on the error component.

Assumption 4.

Error components.1. E ( ε it ) = . { εεε t : t ≥ } is strictly stationary.2. ∑ Pi = ∑ Pj = ∑ Tt = ∑ Ts = (cid:12)(cid:12) E (cid:0) ε it ε js (cid:1)(cid:12)(cid:12) = O ( PT ) and ∑ Pi = ∑ j (cid:54) = i (cid:12)(cid:12) σ ε , i j (cid:12)(cid:12) = O ( P ) , where σ ε , i j : = E (cid:0) ε it ε jt (cid:1) . Condition 2 of Assumption 4 ensures that only weak cross-sectional dependence and time-serialdependence exist in the error component. This condition indicates that no obvious common variationand common temporal trends are involved in the error component.7n summary, Assumptions 1-4 create a well-deﬁned two-style factor model (2.1) and (2.2). Next,we will consider how to estimate the two kinds of common factors for further forecasting.

Based on the identiﬁcation between kkk ( ) t and kkk ( ) t , the factor kkk ( ) t will play a leading role in the auto-covariance matrix ΣΣΣ y ( (cid:96) ) : = cov ( yyy t , yyy t + (cid:96) ) , with (cid:96) being a positive integer. To see this point clearly, wedo some calculations under the case of kkk ( ) t and kkk ( ) t being independent. It follows from (2.1) and (2.2)that ΣΣΣ y ( (cid:96) ) = BBB

ΣΣΣ ( ) k ( (cid:96) ) BBB (cid:62) + AAA

ΣΣΣ ( ) k ( (cid:96) ) AAA (cid:62) + ΣΣΣ ε ( (cid:96) ) , (2.3)where ΣΣΣ ε ( (cid:96) ) = cov ( εεε t , εεε t + (cid:96) ) .With Assumption 2 and Assumption 4, BBB

ΣΣΣ ( ) k ( (cid:96) ) BBB (cid:62) is the leading term of

ΣΣΣ y ( (cid:96) ) in the sense ofspectral norm. As auto-covariance matrices are not symmetric, we consider the matrix LLL ( (cid:96) ) : = ΣΣΣ y ( (cid:96) ) ΣΣΣ y ( (cid:96) ) (cid:62) . (2.4)It is easy to obtain that the columns of BBB are the eigenvectors of the matrix

BBB

ΣΣΣ ( ) k ( (cid:96) ) ΣΣΣ ( ) k ( (cid:96) ) (cid:62) BBB (cid:62) corresponding to its non-zero eigenvalues. In fact, if

CCC = ( bbb , . . . , bbb P − r ) is a P × ( P − r ) matrixfor which ( BBB , CCC ) forms a P × P orthogonal matrix, that is CCC (cid:62)

BBB = and CCC (cid:62)

CCC = III P − r , then we have (cid:16) BBB

ΣΣΣ ( ) k ( (cid:96) ) ΣΣΣ ( ) k ( (cid:96) ) (cid:62) BBB (cid:62) (cid:17)

CCC = . That is, the columns of CCC are eigenvectors of

BBB

ΣΣΣ ( ) k ( (cid:96) ) ΣΣΣ ( ) k ( (cid:96) ) (cid:62) BBB (cid:62) corresponding to zero eigenvalues.Furthermore, the matrix

BBB

ΣΣΣ ( ) k ( (cid:96) ) ΣΣΣ ( ) k ( (cid:96) ) (cid:62) BBB (cid:62) is the leading term of the matrix

LLL ( (cid:96) ) in the sense ofspectral norm. Hence the columns of BBB are close to the eigenvectors of the matrix

LLL ( (cid:96) ) correspondingto non-zero eigenvalues, approximately.In terms of the analysis above, the eigendecomposition of LLL ( (cid:96) ) provides a recovery method ofthe factor loading matrix BBB . Note that we use (cid:96) = BBB is not sensitive to (cid:96) and the correlation is often at its strongest at the small time lag (Lam andYao (2012)). Besides, after analyzing the US mortality data, we also ﬁnd (cid:96) = yyy t − BBBkkk ( ) t = AAAkkk ( ) t + εεε t . (2.5)Under Assumption 3, (2.5) is a classical factor model which can be estimated by the standard (static)PCA. See Fan et al. (2013).In summary, the estimation of the two-style factor model has two dimension reduction steps. Theﬁrst step is to extract features which has good forecasting behaviors by a dynamic PCA procedure.The second step is to extract features, that retain variations as most as possible for each age byperforming static PCA. Next let us discuss in details about the two steps. The ﬁrst step

Firstly, we assume that { yyy t } t = , ,..., T is covariance stationary and consider the following matrix LLL = ΣΣΣ yyy ( ) ΣΣΣ yyy ( ) (cid:62) , where ΣΣΣ yyy ( ) = cov ( yyy t , yyy t + ) . As LLL is a symmetric matrix, it can be decomposed as LLL = QQQ

ΛΛΛ

QQQ (cid:62) .The P × P matrix QQQ consists of the orthogonal eigenvectors of

LLL in the columns and the columnsare arranged such that the corresponding eigenvalues are in descending order. ΛΛΛ is a P × P diagonalmatrix with eigenvalues of LLL as the diagonal elements in descending order. As QQQ is an orthogonalmatrix, we have

QQQ (cid:62)

QQQ = QQQQQQ (cid:62) = III .Let µµµ yyy = E ( yyy t ) , and bbb i be the i th column of QQQ , which is the eigenvector corresponding to the i th largest eigenvalue of LLL . Without loss of generality, we assume µµµ yyy = for convenience. Then bysome simple rearrangement, we have yyy t = r ∑ i = bbb i bbb (cid:62) i yyy t + P ∑ i = r + bbb i bbb (cid:62) i yyy t . (2.6)Let k ( ) it = bbb (cid:62) i yyy t , and uuu t = ∑ Pi = r + bbb i bbb (cid:62) i yyy t . Then we can rewrite equation (2.6) as yyy t = r ∑ i = bbb i k ( ) it + uuu t . (2.7)Then the linear combination k ( ) it , i = , , . . . , r , are the features representing the time-serial trends,which are supposed to have good forecasting behaviors.9 he second step The second step is equivalent to do a static PCA on uuu t in equation (2.7). Let ΣΣΣ uuu ( ) = var ( uuu t ) , thenthe desired matrix for the second step is: LLL = ΣΣΣ uuu ( ) ΣΣΣ uuu ( ) (cid:62) . We conducting eigendecomposition on

LLL and let aaa i be the eigenvector corresponding to the i th largesteigenvalue of LLL . Then similar to that in the ﬁrst step, uuu t can be expressed as: uuu t = r ∑ i = aaa i k ( ) it + εεε t , (2.8)where k ( ) it = aaa (cid:62) i uuu t and εεε t = ∑ Pi = r + aaa i aaa (cid:62) i uuu t . k ( ) it = aaa (cid:62) i uuu t , i = , , . . . , r is the features extracted fromthe second step, which capture most of the common variations.Finally combining equation (2.7) and (2.8), we have: yyy t = r ∑ i = bbb i k ( ) it + r ∑ i = aaa i k ( ) it + εεε t . We can choose r and r such that r + r < P and E ( (cid:107) εεε (cid:62) t εεε t (cid:107) ) is small enough.Replace the matrices with their sample version, we get the sample estimation of the model (cid:101) yyy t = (cid:98) r ∑ i = (cid:98) bbb i (cid:98) k ( ) it + (cid:98) r ∑ i = (cid:98) aaa i (cid:98) k ( ) it , t = , . . . , T (2.9)which is a low-dimensional representation of the original data via two-step dimension reduction. Recall that after the two-step dimension reduction, we get the estimation (2.9). Following Leeand Carter (1992), we can forecast yyy T + h by forecasting the features k ( ) i , T + h and k ( ) i , T + h ﬁrst. In or-der to get the forecasts (cid:98) k ( ) i , T + h and (cid:98) k ( ) i , T + h , we model { (cid:98) k ( ) it : i = , , . . . , r } t = , ,..., T and { (cid:98) k ( ) it : i = , , . . . , r } t = , ,..., T with standard time series models and conduct h -steps ahead forecasting withthem. Then together with (2.9), the h -steps ahead forecasting for yyy T + h is (cid:101) yyy T + h = r ∑ i = (cid:98) bbb i (cid:98) k ( ) i , T + h + r ∑ i = (cid:98) aaa i (cid:98) k ( ) i , T + h , where (cid:98) k ( ) i , T + h and (cid:98) k ( ) i , T + h are predicted values of the features in h years after time T , h = , , . . . .10onsequently, instead of conducting P forecasting models, we only need (cid:98) r + (cid:98) r < P forecastingmodels. In our simulations and application on the US mortality data, we choose ARIMA ( p , d , q ) models to forecast the time series, and we use BIC to choose the parameters p , d , q for each model. The practical procedure for the two-steps dimension reduction and forecasting is summarized inAlgorithm 1.

Algorithm 1: Data-adaptive Dimension Reduction for Mortality ForecastingInput:

Data

YYY = [ yyy , . . . , yyy T ] ∈ R P × T ; Desired rank ≤ P . Output:

Low-dimensional representation of

YYY ; h − steps ahead forecasts of yyy T , h = 1,2, . . . . Dimension Reduction Step 1 : Compute the sample mean yyy = T − ∑ Tt = yyy t ; Compute the sample auto-covariance matrix (cid:98)

ΣΣΣ yyy ( ) = T − ∑ T − t = ( yyy t + − yyy )( yyy t − yyy ) (cid:62) ; Compute sample matrix for the ﬁrst step (cid:98)

LLL = (cid:98) ΣΣΣ yyy ( ) (cid:98) ΣΣΣ yyy ( ) (cid:62) ; Conduct eigendecomposition on (cid:98)

LLL and get (cid:98) bbb , . . . , (cid:98) bbb (cid:98) r , the eigenvectors corresponding to thelargest (cid:98) r eigenvalues of (cid:98) LLL ; Compute the ﬁrst sets of features (cid:98) k ( ) it = (cid:98) bbb (cid:62) i ( yyy t − yyy ) , i = , . . . , (cid:98) r , t = , . . . , T ; Dimension Reduction Step 2 : Compute (cid:98) uuu t = ( yyy t − yyy ) − ∑ (cid:98) r i = (cid:98) bbb i (cid:98) k ( ) it ; Compute sample the covariance matrix of (cid:98) uuu t , (cid:98) ΣΣΣ uuu ( ) = T ∑ Tt = (cid:98) uuu t (cid:98) uuu (cid:62) t ; Compute sample matrix for second step (cid:98)

LLL = (cid:98) ΣΣΣ uuu ( ) (cid:98) ΣΣΣ uuu ( ) (cid:62) ; Conduct eigendecomposition on (cid:98)

LLL and get (cid:98) aaa , . . . , (cid:98) aaa (cid:98) r , the eigenvectors corresponding to thelargest (cid:98) r eigenvalues of (cid:98) LLL ; Compute the second sets of features (cid:98) k ( ) it = (cid:98) aaa (cid:62) i (cid:98) uuu t , i = , . . . , (cid:98) r , t = , . . . , T ; Estimation result : Compute (cid:98) yyy t = yyy + ∑ (cid:98) r i = (cid:98) bbb i (cid:98) k ( ) it + ∑ (cid:98) r i = (cid:98) aaa i (cid:98) k ( ) it , t = , . . . , T , and the estimated low-dimensionalrepresentation of YYY is (cid:98) YYY = [ (cid:98) yyy , . . . , (cid:98) yyy T ] ; Forecasting Step : Fit (cid:98) k ( ) it , i = , . . . , (cid:98) r , t = , . . . , T and (cid:98) k ( ) it , i = , . . . , (cid:98) r , t = , . . . , T with standardARIMA(p,d,q) models respectively; Compute (cid:98) k ( ) i , T + h , (cid:98) k ( ) j , T + h , the h − step ahead forecasts of the features, with the ﬁtted ARIMAmodels; i = , . . . , (cid:98) r ; j = , . . . , (cid:98) r ; Compute the h − steps ahead forecasts after yyy T by (cid:98) yyy T + h = yyy + ∑ (cid:98) r i = (cid:98) bbb i (cid:98) k ( ) i , T + h + ∑ (cid:98) r i = (cid:98) aaa i (cid:98) k ( ) i , T + h .In our simulations and analysis of the US mortality data, we estimate the values of r and r bythe criterion, (cid:98) r = argmin ≤ i ≤ R (cid:98) λ i + (cid:98) λ i , (2.10)where (cid:98) λ i , i = , , . . . , R are the eigenvalues of (cid:98) LLL or (cid:98) LLL in descending order, and max ( r , r ) < R < P .This criterion is justiﬁed in Lam and Yao (2012) and Ahn and Horensten (2013) for auto-covariance11atrices and covariance matrices on high dimensional data, respectively. As mentioned in Lam andYao (2012), in practice, the parameter R is chosen as min ( P , T ) . It is worthy being mentioned thatthe number of nonzero eigenvalues of the sample matrices (cid:98) LLL and (cid:98) LLL is no larger than min ( P , T ) . In this section, we establish the rates of convergence for the two-steps estimators of the factorloadings. Additional to the Assumptions 1-4, we impose the following assumptions for the asymptotictheory.

Assumption 5.

Relation between the error and the factors. εεε t independent of kkk ( ) t and kkk ( ) t . Remark 1.

For simplicity of techniques and without loss of generality, the Assumption 5 assumesindependent relationship between the error component and the two kinds of factors.

Assumption 6.

Relation between kkk ( ) t and kkk ( ) t . Suppose that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) min , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O (cid:16) P − δ (cid:17) , where ΣΣΣ ( ) k ( (cid:96) ) = cov (cid:16) kkk ( ) t + (cid:96) , kkk ( ) t (cid:17) , ΣΣΣ ( ) k ( (cid:96) ) = cov (cid:16) kkk ( ) t + (cid:96) , kkk ( ) t (cid:17) and δ is deﬁned in As-sumption 2. Remark 2.

The order of the eigenvalues of

ΣΣΣ ( ) k ( (cid:96) ) is not speciﬁed in Assumption 6. The reasonis that the information involved in ΣΣΣ ( ) k participate in the recovery of the factor kkk ( ) t . The order of (cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ k ( (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12) is restricted in order to make it not involved in the leading term when recovering kkk ( ) t . Assumption 7.

Dimension Condition. PT → c ∈ ( , ∞ ) . Remark 3.

The setting of the dimension P and the sample size T being comparable is under consider-ation because the number of ages is comparable to the length of time series for the US mortality data.Note that when P and T are on the same order, the estimators for the eigenvalues and the eigenvectorsmay be no longer consistent. See Lam and Yao (2012), Ahn and Horensten (2013). However, the ratiobased estimators for r and r can still work well. Assumption 8. { (cid:16) kkk ( ) t , kkk ( ) t , εεε t (cid:17) : t ≥ } is strictly stationary with ﬁnite fourth moments. Theorem 1.

In addition to Assumptions 1 - 8, we assume thatP − δ T = o ( ) , as P , T → ∞ . (3.1) Then we have the following convergent rates (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)

BBB − BBB (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:18) T / (cid:19) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) AAA − AAA (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:18) T / (cid:19) . (3.2)12 emark 4. Two kinds of factors are both strong factors in the sense of auto-correlation and variance,respectively. It is reasonable to obtain fast rates of convergence for both of them. In view of this,our proposed two-steps estimators have good statistical performance, which is an advantage forforecasting improvement. Based on identiﬁcation condition between factors and factor loadings, BBB isdetermined up to an orthogonal matrix. Due to technical proofs (some techniques in Lemma 3), theestimator (cid:98)

BBB here is the estimator up to an identity matrix.

In this section, we use simulated data to illustrate the advantages of our method. For descriptiveconvenience, we use “SWPCA” to represent our method, “CPCA” to represent the static PCA methodwhich was described in Section A, and “DPCA” to represent the dynamic PCA method described inSection A with (cid:96) = We generate three examples according to the following two-factors model: yyy t = bbbk t + aaaw t + εεε t , where aaa and bbb are two independent P × U ( , ) and εεε t is a P × N ( , . ) . For all the three examples, { k t } t = , ,..., T is generated from AR ( ) model with coefﬁcient0 .

8, while { w t } t = , ,..., T are different for each example.• For Example 1 , elements in { w t } t = , ,..., T are independently generated from standard normaldistribution N ( , ) , which indicates the series of w t are independent;13 For Example 2 , we add time-serial dependence to the feature w t , hence { w t } t = , ,..., T is gener-ated from AR ( ) model with coefﬁcient 0 . Example 3 , we increase the dependence in the series of w t and generate it from AR ( ) model with coefﬁcient 0 . Firstly, we show the variance and serial dependence (lag 1 auto-covariance) of the ﬁrst estimatedfactor of the three methods, respectively. The variance and serial dependence of the ﬁrst estimatedfactor are computed as follows:Time variance (cid:16)(cid:98) k t (cid:17) = T − ∑ Tt = (cid:16)(cid:98) k t − T ∑ Tj = (cid:98) k j (cid:17) , Time dependence (cid:16)(cid:98) k t (cid:17) = T − ∑ T − t = (cid:16)(cid:98) k t − T ∑ Tj = (cid:98) k j (cid:17) (cid:16)(cid:98) k t + − T ∑ Tj = (cid:98) k j (cid:17) , where (cid:98) k t is the estimated ﬁrst feature at time t . Especially, for our method, we compare the estimatedﬁrst feature from the ﬁrst step as it is the feature which intends to improve the forecasting power.Besides, we also report the sum of the aforementioned quantities:Mix ( (cid:98) k t ) = Time variance ( (cid:98) k t ) + Time dependence ( (cid:98) k t ) . Secondly, we investigate the dependence and the variation remained in the error terms as follows:Time variance (cid:0)(cid:98) εεε · t (cid:1) = P ∑ Pp = (cid:18) T − ∑ Tt = (cid:16)(cid:98) ε pt − T ∑ Tj = (cid:98) ε p j (cid:17) (cid:19) , Time dependence (cid:0)(cid:98) εεε · t (cid:1) = T ( T − ) ∑ Tt = ∑ Tt = , t (cid:54) = t (cid:12)(cid:12) cov (cid:0)(cid:98) εεε · t , (cid:98) εεε · t (cid:1)(cid:12)(cid:12) , Cross-sectional variance (cid:0)(cid:98) εεε p · (cid:1) = T ∑ Tt = (cid:18) P − ∑ Pp = (cid:16)(cid:98) ε pt − P ∑ Pj = (cid:98) ε jt (cid:17) (cid:19) , Cross-sectional dependence (cid:0)(cid:98) εεε p · (cid:1) = P ( P − ) ∑ Pp = ∑ Pp = , p (cid:54) = p (cid:12)(cid:12) cov (cid:0)(cid:98) εεε p · , (cid:98) εεε p · (cid:1)(cid:12)(cid:12) , where (cid:98) ε pt is the the error term for age p at time t , (cid:98) εεε · t is error terms for all ages at time t , (cid:98) εεε p · is the14rror terms across all time for age p , and cov (cid:0)(cid:98) εεε · t , (cid:98) εεε · t (cid:1) = P ∑ Pp = (cid:16)(cid:98) ε pt − P ∑ Pj = (cid:98) ε jt (cid:17) (cid:16)(cid:98) ε pt − P ∑ Pj = (cid:98) ε jt (cid:17) , cov (cid:0)(cid:98) εεε p · , (cid:98) εεε p · (cid:1) = T ∑ Tt = (cid:16)(cid:98) ε p t − T ∑ Tj = (cid:98) ε p j (cid:17) (cid:16)(cid:98) ε p t − T ∑ Tj = (cid:98) ε p j (cid:17) . To evaluate the forecasting performance, we show the the 1 step and 5 steps ahead root mean squarederror, which is computed byFRMSE ( h ) = (cid:32) ∑ h − i = (cid:107) (cid:98) yyy T − i − yyy T − i (cid:107) hP (cid:33) / where h = , (cid:98) yyy T − i is obtained by forecasting with { yyy , yyy , . . . , yyy T − h } , and yyy T − i is the true value in the forecasting horizon. We try different sets of ( P , T ) : ( , ) , ( , ) , ( , ) , ( , ) , ( , ) , as wewould like to evaluate the performance under the situations that P and T are comparable. The resultsare shown in Table 2 to Table 5.From Table 2, we can see that the CPCA method provides feature with the largest variance, whilethe ﬁrst step of our method (SWPCA) captures the feature with the largest lag 1 auto-covariance,which summarizes the most of the time serial dependence of the original data. These simulatedresults corroborate the analysis in Section A.On the other hand, we can see that the DPCA method provides a feature with the largest sum ofvariance and lag 1 auto-covariance. Our method utilizes the same information with the DPCA, whilewe have two steps. When we compare the feature from our ﬁrst step with that of the DPCA method,it is not surprising that DPCA one has larger Mix ( (cid:98) k t ) . However, our second step provides featuresthat capture the remaining variance, which is a necessary supplement for the ﬁrst step to ensure thatthe ﬁnal sets of features provide good ﬁtting to the original data.From Table 3 and 4, we can see that our method always provides the error terms with the smallesttime and cross-sectional variance and dependence. It shows that SWPCA can capture most of thetime-serial dependence and variation information of all the ages among the three methods. The bettermodel ﬁtting performance of our method is supported by these results.Finally in Table 5, we show the the 1 step and 5 steps ahead root mean square errors for the15able 2: Variance and Dependence of (cid:98) k t Time variance ( (cid:98) k t ) Time dependence ( (cid:98) k t ) Mix ( (cid:98) k t ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCA CPCA DPCA SWPCAExample 1 (AR(1) 0.8 + N(0,1)) ( , ) . .

008 48 .

750 29 .

015 29 . . . . . ( , ) . .

330 51 .

577 31 .

843 32 . . . . . ( , ) . .

263 103 .

799 64 .

119 65 . . . . . ( , ) . .

037 107 .

091 67 .

532 68 . . . . . ( , ) . .

619 214 .

682 135 .

760 137 . . . . . ( , ) . .

909 48 .

941 29 .

409 29 . . . . . ( , ) . .

986 51 .

430 31 .

958 32 . . . . . ( , ) . .

466 104 .

384 65 .

591 66 . . . . . ( , ) . .

838 107 .

249 68 .

630 69 . . . . . ( , ) . .

278 216 .

268 139 .

463 141 . . . . . ( , ) . .

498 50 .

033 31 .

392 31 . . . . . ( , ) . .

889 53 .

942 35 .

265 35 . . . . . ( , ) . .

778 106 .

789 69 .

386 69 . . . . . ( , ) . .

024 108 .

342 71 .

475 72 . . . . . ( , ) . .

200 220 .

766 147 .

018 148 . . . . . three examples. Overall, our method (SWPCA) has the smallest FRMSE for all the examples whilethe CPCA method performs the worst on the forecasting. The phenomenon tells us that the featuresextracted via the auto-covariance matrix are better than the ones from the covariance matrix, in theview of the forecasting accuracy. Moreover, the SWPCA has smaller forecasting error than the DPCA,which indicates that, extracting the different types of features sequentiality can beneﬁt the forecastingmore than mixing them together.In summary, our methods (SWPCA) provides a dimension reduction method that gives moreaccurate forecasts for the high dimensional time series data simulated in this section. In the AppendixB, more special simulated cases are presented. In this section, we apply our proposed method on age-speciﬁc mortality data of the US. The datais the mortality data of the US population in the Human Mortality Database (HMD) (18) obtained inDecember 2020. HMD contains original calculations of death rates and life tables for the populationsin 40 countries and areas, as well as the input data used in constructing those tables. The data we16able 3: Variance across Time and Sections of the error terms

Time Variance ( (cid:98) εεε · t ) Cross-sectional Variance ( (cid:98) εεε p · ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCAExample 1 (AR(1) 0.8 + N(0,1)) ( , ) .

141 0 . . .

145 0 . . ( , ) .

146 0 . . .

148 0 . . ( , ) .

147 0 . . .

151 0 . . ( , ) .

151 0 . . .

154 0 . . ( , ) .

152 0 . . .

156 0 . . Example 2 (AR(1) 0.8 + AR(1) 0.05) ( , ) .

142 0 . . .

145 0 . . ( , ) .

148 0 . . .

150 0 . . ( , ) .

147 0 . . .

151 0 . . ( , ) .

150 0 . . .

153 0 . . ( , ) .

152 0 . . .

155 0 . . Example 3 (AR(1) 0.8 + AR(1) 0.2)) ( , ) .

144 0 . . .

148 0 . . ( , ) .

150 0 . . .

152 0 . . ( , ) .

151 0 . . .

154 0 . . ( , ) .

154 0 . . .

157 0 . . ( , ) .

155 0 . . .

159 0 . . Table 4: Covariance across Time and Sections of error terms

Time dependence ( (cid:98) εεε · t ) Cross-sectional dependence ( (cid:98) εεε p · ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCAExample 1 (AR(1) 0.8 + N(0,1)) ( , ) .

067 0 . . .

072 0 . . ( , ) .

069 0 . . .

074 0 . . ( , ) .

069 0 . . .

075 0 . . ( , ) .

071 0 . . .

076 0 . . ( , ) .

072 0 . . .

078 0 . . Example 2 (AR(1) 0.8 + AR(1) 0.05) ( , ) .

067 0 . . .

072 0 . . ( , ) .

070 0 . . .

075 0 . . ( , ) .

069 0 . . .

075 0 . . ( , ) .

071 0 . . .

076 0 . . ( , ) .

072 0 . . .

078 0 . . Example 3 (AR(1) 0.8 + AR(1) 0.2)) ( , ) .

069 0 . . .

074 0 . . ( , ) .

071 0 . . .

076 0 . . ( , ) .

072 0 . . .

077 0 . . ( , ) .

074 0 . . .

079 0 . . ( , ) .

074 0 . . .

080 0 . . ( P , T ) SWPCA CPCA DPCA SWPCA CPCA DPCAExample 1 (AR(1) 0.8 + N(0,1)) ( , ) . .

829 0 . . .

053 1 . ( , ) . .

793 0 . . .

004 1 . ( , ) . .

812 0 . . .

054 1 . ( , ) . .

814 0 . . .

035 1 . ( , ) . .

819 0 . . .

996 0 . ( , ) . .

850 0 . . .

049 1 . ( , ) . .

818 0 . . .

049 1 . ( , ) . .

826 0 . . .

028 1 . ( , ) . .

810 0 . . .

998 0 . ( , ) . .

807 0 . . .

993 0 . ( , ) . .

809 0 . . .

045 1 . ( , ) . .

812 0 . . .

037 1 . ( , ) . .

771 0 . . .

039 1 . ( , ) . .

825 0 . . .

018 1 . ( , ) . .

803 0 . . .

017 1 . originally obtained from HMD includes the annual age-sex-speciﬁc information of the number ofexposures to risk, the number of deaths, and the central death rate, for ages from 0 to 110+ (age 100and above) during the period from 1933 to 2018. We focus our analysis on the age-speciﬁc centraldeath rates of the total sex population. As the mortality data for advanced ages are measured sparselywhich is mentioned in Lee and Carter (1992), death rates for the older age groups (from age 91 to110+) are summarized and incorporated into a modiﬁed death rate denoted as age 90 + . In view of this,the annual age-speciﬁc death rates under study consist of a matrix data with 86 yearly observations(1933-2018) for 91 ages (0-90 + ). Following Lee and Carter (1992), we consider the log transformedcentral death rates [ ln ( m p , t )] P × T , where P = , T =

86, for modeling purposes. By doing so, we canguarantee that the estimated and predicted central death rates are non-negative. We show the bettermodel ﬁtting and forecasting performance of our proposed method compared to the static PCA anddynamic PCA in this section. Moreover, we explain by examining the factor loadings of the featuresthat the two-style factor model and its two-steps dimension reduction estimation are necessary on themortality data. At last, we illustrate with two applications that improving the accuracy of the predicteddeath rates is crucial. 18 .1 Stationarity

As the log central death rates are not stationary time series, we modify the ﬁrst estimation step inour method (SWPCA) to deal with the non-stationary issue, which is summarized in Algorithm 2.

Algorithm 2: Modiﬁed First Step of the Dimension ReductionInput:

Data

YYY = [ yyy , . . . , yyy T ] ∈ R P × T ; yyy t = ( ln ( m , t ) , ln ( m , t ) , . . . , ln ( m P , t )) (cid:62) . Output: (cid:98) k ( ) it , which is described in the ﬁrst step of our method (see 2.7). Dimension Reduction Step 1 : Compute ddd t = yyy t − yyy t − and ddd = T − ∑ T − t = ddd t , t = , , . . . , T ; Compute (cid:98)

ΣΣΣ ddd ( ) = T − T − ∑ t = ( ddd t + − ddd )( ddd t − ddd ) (cid:62) and (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) ; Compute (cid:98) bbb i : P × i th largest eigenvalue of (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) ,where i = , , . . . , (cid:98) r . (cid:98) r is chosen by the method introduced in Section 2; Compute yyy = T − ∑ Tt = yyy t and (cid:98) k ( ) it = (cid:98) bbb (cid:62) i ( yyy t − yyy ) .The difference of Algorithm 2 and the ﬁrst step in Algorithm 1 is, instead of obtaining (cid:98) bbb i by theeigenvectors of (cid:98) ΣΣΣ yyy ( ) (cid:98) ΣΣΣ yyy ( ) (cid:62) , we get it from (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) . Because yyy t is not stationary, (cid:98) ΣΣΣ yyy ( ) (cid:98) ΣΣΣ yyy ( ) (cid:62) is not a good estimator for the population lag 1 auto-covariance of yyy t .We now explain the reason for using (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) . From ddd t = yyy t − yyy t − , we have yyy t = yyy t − + ddd t = yyy t − + ddd t − + ddd t = · · · = ∑ ti = − ∞ ddd i . With the coefﬁcients bbb i , ddd i can be expressed as ddd i = ∑ Pj = bbb j ϕ i j where ϕ i j = bbb (cid:62) j ddd i , thus yyy t = t ∑ i = − ∞ ddd i = t ∑ i = − ∞ (cid:32) P ∑ j = bbb j ϕ i j (cid:33) = P ∑ j = (cid:32) bbb j t ∑ i = − ∞ ϕ i j (cid:33) = P ∑ j = bbb j ψ t j , where ψ t j = ∑ ti = − ∞ ϕ i j . Thus when performing the dimension reduction, the coefﬁcients to form alow-dimensional representation of uuu t should be the same as those of yyy t . If ddd t is stationary, (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) is a good estimator for ΣΣΣ ddd ( ) ΣΣΣ ddd ( ) (cid:62) . Then eigenvectors of (cid:98) ΣΣΣ ddd ( ) (cid:98) ΣΣΣ ddd ( ) (cid:62) are better estimators of thefactor loadings than those of (cid:98) ΣΣΣ yyy ( ) (cid:98) ΣΣΣ yyy ( ) (cid:62) . We did stationary tests on the lag 1 differenced series ofeach age separately, and more than 72% of the ages have stationary results under signiﬁcant level 0 . uuu t is stationary, but for this dataset, it is better than the original logcentral death rates.Due to the same reason, we make the same modiﬁcation for the static PCA and the dynamicPCA methods when comparing on the US mortality data. Thus from now on, “CPCA” and “DPCA”refers to the static PCA and dynamic PCA which deal with the non-stationary issue, respectively.19n addition, for comparison purpose, we also apply the static PCA method without considering thenon-stationary issue, which is exactly the same as the method in Lee and Carter (1992) and we call it“Lee-Carter” in the following sections. Let us have a further discussion about the suitability of the SWPCA for the US mortality data.We examine the variance and time serial dependence of the central death rates of the US. Becausewe modiﬁed the ﬁrst step of the SWPCA according to Section 5.1, instead of examining the originaldata, we check the ﬁrst difference of log central death rates for each age. That is, for each age p , we compute the variance and lag 1 autocorrelation (representing the time serial dependence) of d p , , d p , , . . . , d p , T , where d p , t = log ( m p , t ) − log ( m p , t − ) , p = , , . . . , + , and T =

86. The resultsare shown in Figure 3.In Figure 3, the top red plot shows the variances of d p , · for age p = , , . . . , + , and the bottomblue line shows the lag 1 autocorrelation of d p , · . From the plot, we see that the variances of ages from5 to 13 are larger than those of ages from 25 to 40, while the lag 1 autocorrelations of ages from 5 to13 are smaller than those of ages from 25 to 40. This is the same structure with the Example 5 in thesimulation (in Appendix B). In addition, we have seen previously that the death rates of all ages havesimilar patterns, which indicates that information from part of the ages can be borrowed to help withthe forecasting of other ages. Thus the ﬁrst step of the proposed method would like to use informationfrom the ages with powerful forecasting ability, such as ages 25 to 40, to help with the forecastingof other ages with weak correlations, ages 5 to 13 for instance. On the other hand, the parts withpowerful forecasting ability do not contain sufﬁcient variations. For example, most of the variationis contained in younger ages while they do not all have large correlations. Therefore, the secondstep of our method utilizes static PCA to help retain sufﬁcient variation of the original data, which isnecessary for the ﬁnal recovery for forecasting. As a result, SWPCA is particularly suitable for theUS mortality data. In the next section, we examine the model ﬁtting performance of our method andillustrate the necessity of both steps using the estimated factor loadings in the following section. In this section, we check the performance of the SWPCA on ﬁtting the original data. We applySWPCA, Lee-Carter, CPCA, and DPCA on the logarithm of central rates of death and compare theﬁtting performances using the root mean square error (RMSE). Based on the criteria in Section 2, allthe methods choose only one feature. For SWPCA, we have (cid:98) r = (cid:98) r =

1. Table 6 and Table 720 a r i an c eLag au t o c o rr e l a t i on Ages v a l ue Figure 3: Variance and Time serial dependence of agesshow the RMSEs of the four methods for selected ages (5, 25, 50, 65 and 85) and years (1933, 1953,1993, and 2018) respectively, along with the overall RMSE of the whole data. From the tables, theRMSEs of the SWPCA are the lowest among the other three methods, which shows that SWPCAﬁts the data the best. As we described before, the two steps of the SWPCA guarantee that it capturessufﬁcient variations of the original data and results in a good model ﬁtting. In addition, we see that theRMSEs of Lee-Carter are smaller than those of CPCA. It implies that the ﬁtting performance on thelog central death rates is worse for static PCA if we revise the method to deal with non-stationarity.This phenomenon may be caused by special characteristics of the mortality data, which is interestingto explore further. Table 6: RMSE, for some speciﬁc ages

Age SWPCA Lee-Carter CPCA DPCA5 . .

062 0 .

304 0 . . .

126 0 .

191 0 . . .

063 0 .

108 0 . . .

086 0 .

126 0 . . .

067 0 .

078 0 . . .

083 0 .

151 0 . We can also visualize the ﬁtting performances of the four methods via plots. Figure 4 shows theactual and ﬁtted log central rates of death for selected ages (5, 25, 50, 65 and 85) over all historicalyears from 1933 to 2018, while Figure 5 shows the actual and ﬁtted log central rates of death forselected years (1933, 1953, 1993 and 2018) over all ages from 0 to 90 + . The black lines representthe actual log central rates of death; the red, light blue, green and blue dashed lines show the ﬁtted21able 7: RMSE, for some speciﬁc years Year SWPCA Lee-Carter CPCA DPCA1933 . .

153 0 .

186 0 . . .

092 0 .

160 0 . . .

080 0 .

142 0 . . .

140 0 .

275 0 . . .

083 0 .

151 0 . − − − − Years

Log D ea t h R a t e s actualSWPCALee−CarterCPCADPCA Figure 4: Log Death Rates, 1933 − − − − − Ages

Log D ea t h R a t e s actualSWPCALee−CarterCPCADPCA Figure 5: Log Death Rates, ages 0 to 90 + foryears 1933, 1953, 1993, 2018; Actual and Fitted.log central rates of death using SWPCA, Lee-Carter, CPCA, and DPCA respectively. From Figure 4,we see that the SWPCA captures the time-serial patterns well for all selected ages even when there iscurvature, such as the paths of ages 25, 65 and 85. However, the Lee-Carter, CPCA and DPCA failedto recover the time-serial dependence appropriately and hence provide worse ﬁtting results than theSWPCA. From Figure 5, we see that the four methods provide similar ﬁttings for ages 0 to 20. Forages 20 to 40, the mortality patterns changed and SWPCA shows a better ﬁtting performance thanthe other three methods. The four models perform similarly again for ages 40 and above with theSWPCA’s ﬁtting performance slightly better than the other three, especially for the year 2018. Hence,it shows that the SWPCA captures both time-serial (time dimension) dependence and cross-sectional(age dimension) variation well and exhibits advantages over the other three methods especially whenthe mortality trends change. 22 .4 Forecasting performance comparison In this section, we show that our method can forecast the central death rates of the US moreaccurately than the other methods. We compare the forecasting performance of SWPCA, Lee-Carter,CPCA, DPCA, and the univariate ARIMA using rolling window out-of-sample forecasting. Theunivariate ARIMA model is ﬁtted with time series of each age independently and the model structureis selected based on Bayesian information criterion (BIC); we refer to this as “individual” model. Theindividual model is included for comparison, as we would like to show that conducting dimensionreduction before the forecasting is necessary, especially in the long term.We use data of years from 2007 to 2018 as the test set and the historical data of previous years asthe training set for modeling and testing purposes. Table 8 shows the forecasting root mean squareerrors (FRMSEs) of 1 to 25 steps ahead forecasts using the ﬁve methods. For each forecast, we have12 rolling window sub-training sets for the 12 test years and the values presented in the table arethe averages of the 12 rolling window FRMSEs. Figure 6 plots the results shown in Table 8. Wesee that as the length of prediction steps increases, the performance of all methods get worse. Thisis because the longer-term forecasting is always harder and contains more uncertainty. The individ-ual model has the best forecasting accuracy when h ≤

11, while performs worse in the long-termcompared with the SWPCA and the Lee-Carter. This is because the individual model focuses oncapturing the mortality pattern of each age vector, which ignores the dependence among differentages and overlooks the cross-sectional common information. So, in the short term, individual fac-tors dominate the forecasting performance, and the individual model performs best. However, thelong-term mortality forecasting provides important assumptions for various actuarial practices andgovernment policymaking, such as life insurance and annuities pricing and reserving, asset liabil-ity management of pension funds, and the solvency analysis of social securities. In the long term,different ages share similar drivers of the mortality variation, such as technology innovation, healthimprovement, wars, and epidemics. So common factors dominate the forecasting performance in thelong term, and dimension reduction plays a crucial role in recovering the common information fromthe high dimensional mortality data. Comparing the four dimension reduction methods (SWPCA,Lee-Carter, CPCA, and DPCA), we ﬁnd that the SWPCA has the smallest FRMSEs for all h , andhence the best out-of-sample forecasting performance. The empirical analysis shows that SWPCAsuccessfully extracts features with powerful forecasting ability and provides a good representation torecover the mortality forecasting from the features’ forecasting.23able 8: Comparison of the forecasting performance on the US data: Rolling Window FRMSE h SWPCA Lee-Carter CPCA DPCA Individual1 0 .

090 0 .

122 0 .

235 0 . . .

101 0 .

131 0 .

239 0 . . .

113 0 .

141 0 .

245 0 . . .

125 0 .

150 0 .

251 0 . . .

134 0 .

158 0 .

258 0 . . .

142 0 .

164 0 .

263 0 . . .

149 0 .

170 0 .

268 0 . . .

155 0 .

174 0 .

272 0 . . .

156 0 .

176 0 .

275 0 . .

10 0 .

160 0 .

179 0 .

278 0 . .

11 0 .

162 0 .

182 0 .

280 0 . . . .

186 0 .

283 0 .

302 0 . . .

191 0 .

286 0 .

305 0 . . .

197 0 .

290 0 .

309 0 . . .

202 0 .

294 0 .

313 0 . . .

209 0 .

298 0 .

317 0 . . .

215 0 .

302 0 .

321 0 . . .

220 0 .

305 0 .

324 0 . . .

224 0 .

306 0 .

325 0 . . .

227 0 .

305 0 .

325 0 . . .

228 0 .

303 0 .

322 0 . . .

233 0 .

302 0 .

321 0 . . .

243 0 .

306 0 .

326 0 . . .

259 0 .

318 0 .

337 0 . . .

274 0 .

328 0 .

348 0 . . .

194 0 .

284 0 .

301 0 . . .

191 0 .

286 0 .

305 0 . l l l l l l l l l l l l l l l l l l l l l l l l l h steps F R M SE l SWPCALee.CarterCPCADPCAIndividual

Figure 6: Comparison of the forecasting performance on the US data: Rolling Window FRMSE24

Years V a l ue s a ft e r t he f i r s t s t ep Ages

Figure 7: Estimation of u t on the US mortality Data Recall that the two-style factor model intends to capture two kinds of common features for mor-tality data among all ages: common temporal trends and common variation. Now we would like toanalysis the necessity and the behavior of the proposed model on the mortality data under study.Figure 7 provides the estimation of { u p , t : t ≥ } for all p = , . . . , P , which is the residual ofthe ﬁrst step. After extracting the common temporal trends in the ﬁrst step, it is expected that thereis relatively weak common temporal trend existed in the residual u p , t . Compared with Figure 1 thatillustrates the time-trends in original mortality data, Figure 7 indeed demonstrates weak commontime-trends for all ages, in view of different time-tendency for the young ages from that of the oldages.Next, we investigate the extracted features from the two kinds of factor models, respectively.As analyzed earlier, the estimation for the two parts is based on the eigendecomposition of the twomatrices (cid:98) LLL and (cid:98) LLL , respectively. The ﬁrst row of Figure 8 shows all the eigenvalues of the twomatrices. The spikeness is obvious and the ratio-based statistic will estimate r and r as 1 intuitively.Then the bottom row of Figure 8 provides the eigenvectors of (cid:98) LLL and (cid:98) LLL corresponding to their largesteigenvalues respectively. By comparing them, factor loadings from the two parts of factor models arequite different from each other.A natural question may arise: other than conducting the second factor modelling, whether is itenough to keep two factors in the ﬁrst step? Figure 9 shows that the second eigenvector of (cid:98) LLL isdifferent from the ﬁrst eigenvector extracted by the second step, which ensures the necessity of thesecond factor modelling. Roughly speaking, the second principal component (PC) in the ﬁrst steprepresents weaker common temporal trend than the ﬁrst PC, but stronger than the left PCs. However,25 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . eigenvalues of the first step Index v a l ue s lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll eigenvalues of the second step Index v a l ue s lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . first eigenvector of the first step , first eigenvalue is 0.0023 ages v a l ue s lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − . − . − . . . . . . first eigenvector of the second step , first eigenvalue is 9.7494 ages v a l ue s Figure 8: 1 st Principal Component in Two Stepsthe aim of our second step is to pursue features possessing most common variation of the residualafter the ﬁrst step. Although the extracted factor in the second step also has weaker common temporaltrend than that is extracted in the ﬁrst step, it is not the second PC of the ﬁrst step in view of Figure9. Furthermore, in terms of eigenvalues of the matrix (cid:98)

LLL in Figure 10, the second eigenvalue is notseparable with others well except the ﬁrst one. This phenomenon indicates that keeping the secondPC may not increase sufﬁciently large amount of common temporal trends. In this case, the increasedﬂexibility of keeping the 2nd PC will make this method undeserved. lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − . − . . . . . second eigenvector of the first step , second eigenvalue is 3.21270925427776e−05 ages v a l ue s lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll − . − . − . . . . . . first eigenvector of the second step , first eigenvalue is 9.7494 ages v a l ue s Figure 9: 2 nd PC in Step 1 and 1 st PC in Step 226 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . eigenvalues of the first step Index v a l ue s llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . + . − . − . − . − . − . − eigenvalues of the first step (exclude the largest one) Index v a l ue s Figure 10: Eigenvalues of Step 1

In this section, we use the forecasts of mortality to perform two applications: predicting thelife expectancies and pricing the life annuities. The life expectancy describes the expected averageremaining number of years prior to death for a person reached a speciﬁc age. Usually it can bereported in two different forms based on the mortality rates (period and cohort). The period lifeexpectancy for a given year of each age is calculated based on the mortality rates for that single year,while the cohort life expectancy is estimated based on the mortality rates for the series of years inwhich the person will actually reach each succeeding age if the individual survives (The Board ofTrustees of the Federal OASDI Trust Funds (2019)). For example, according to Table V.A4 and V.A5in the 2019 report of The Board of Trustees of the Federal OASDI Trust Funds (2019), a male inthe US aged 65 in year 2018 is expected to live another 18 . . x at year T ( e x , T ) as follows, e x , T = w − x − ∑ t = t p x , T , where w is the assumed maximum age, and t p x , T = ∏ t − j = (cid:0) − q x + j , T (cid:1) is the probability that a personaged x at year T will survive to age x + t . For the period life expectancy, q x + j , T = m x + j , T , and for the27ohort life expectancy, q x + j , T = m x + j , T + j , where m x , t is the death rate of a person aged x at year t from the mortality table. In addition, for simplicity, we assume 1 − m + , T represents the probabilitythat a person age 90 will survive to the maximum age w . Further, we calculate the present value of thelife annuity ( PV x , T ) for an individual purchased at age x in year T and beginning to make payments$1 annually after age 66 until death or aged 90 (which one happens ﬁrst) as below: PV x , T = ∑ − xt = t p x , T / ( + i ) t if x ≥ PV , T +( − x ) / ( + i ) − x if x < , where i =

2% is the interest rate, t p x , T = ∏ t − j = (cid:0) − q x + j , T (cid:1) and q x + j , T = m x + j , T + j , which is on acohort basis and the same with the calculation for the cohort life expectancy. We let the life annuitiesend at age 90 for simplicity as the mortality rates for extreme older ages need more detailed analysis,which is beyond the scope of this paper. The age 66 is the retirement age for most individuals in theUS. Hence, for an individual younger than 66, PV x , T is the price for a deferred life annuity. Similarcalculation can be ﬁnd in Warshawsky (1988), McCarthy and Mitchell (2001), and Cunningham et al.(2012).In order to compare the out-of-sample performance of our method (SWPCA) and the Lee-Cartermodel, we deﬁne the data for years 1933 to 1988 as the training set and the data for the last 30 years(1989 to 2018) as the test set. We ﬁrst forecast the mortality rates of the test set with the trainingset using the SWPCA and the Lee-Carter method, respectively. Then we calculate e x , T (cohort andperiod) and PV x , T using the actual mortality rates as well as the forecasting mortality rates from thetwo methods, respectively.With more accurate mortality forecasts, how much can the SWPCA method improve the predic-tion of the life expectancies and the pricing of the life annuities? Table 9 shows the forecast meansquared error (FMSE) and the forecast mean absolute error (FMAE) for the SWPCA method and theLee-Carter model, which are computed asFMSE = N ∑ x ∑ t ( ˆ y x , t − y x , t ) , FMAE = N ∑ x ∑ t | ˆ y x , t − y x , t | , where ˆ y x , t is the estimated value (computed with forecast death rates from the SWPCA or Lee-Carter), y x , t is the true value (computed with actual death rates), N is the number of estimates (it is different forthe period and cohort life expectancies). It can be seen from the table that for all the three applications,the estimations from the SWPCA have smaller FMSEs and FMAEs comparing with those from Lee-28 Years L i f e e x pe c t an cy f o r peop l e aged ( c oho r t ) trueSWPCALee−Carter Figure 11: Comparison of the predicted lifeexpectancies from the SWPCA and Lee-Carterwith the true values (cohort)

Years L i f e e x pe c t an cy f o r peop l e aged ( pe r i od ) trueSWPCALee−Carter Figure 12: Comparison of the predicted lifeexpectancies from the SWPCA and Lee-Carterwith the true values (period)Carter method. Particularly, from the FMAEs of the present values of life annuities, we can seethat, on average, the pricing error is $0 .

154 for Lee-Carter and only $0 .

041 for SWPCA with annualpayment $1. The better performance of SWPCA is lead by the more accurately mortality forecasting.Table 9: FMSE and FMAE of life expectancies (cohort and period) and present values of annuities(annual payment $1 and interest rate 2%)

FMSE FMAEperiod lifeexpectancy cohort lifeexpectancy pv of lifeannuity period lifeexpectancy cohort lifeexpectancy pv of lifeannuityLee-Carter 0 .

768 0 .

111 0 .

040 0 .

790 0 .

251 0 . .

109 0 .

009 0 .

004 0 .

263 0 .

072 0 . Figure 11 and Figure 12 show the cohort and period life expectancies for an individual aged 66at different years. The red line is the value computed from historical death rates, the green one isthe value computed with the forecast from the SWPCA and the blue line is that from the Lee-Cartermethod. From Figure 11, we see that the three lines are close to each other before 1970, which isdue to the less forecasts involved in the calculation for those years. After 1970, when involving moreforecasts, the Lee-Carter method tends to estimate the life expectancies lower while the SWPCA isclose to the true value with slightly higher estimations for some years. From Figure 12, we see thatthe output of SWPCA is always more close to the the true values while both the SWPCA and theLee-Carter tend to underestimate the true values for the second half of the time horizon.Table 10 exhibits the life expectancies (cohort and period) and the present values of annuitieswith annual payment $1 and interest rate 2% for some selected ages and years (for some years and29ges there are no forecast involves, hence we use ∗ to mark them). We can see that for the lifeexpectancies, almost all the values from the SWPCA are closer to the true values than those fromthe Lee-Carter method. The Lee-Carter method tends to price much lower than the empirical truevalues. The absolute pricing errors are around $0 .

20 to $0 .

40 per $1 of the life annuity. On theother hand, SWPCA provides very accurately pricing with a maximum $0 .

02 error per $1 annualpayment. Although the difference looks very small, it is indeed a big risk for life insurers or socialsecurity. To illustrate the ﬁnancial impact on the industry, we can consider the pricing for individualsaged 45 in year 1970. The price from the Lee-Carter method is $0 .

27 lower and from the SWPCAis $0 .

01 lower per $1 compared with the empirical true price. Suppose the annual payment for anindividual is $10000 and the number of people purchased this insurance is 50000. Then accordingto the Lee-Carter method, the insurance company will have a $135 million shortfall (135million = . × × period life expectancy cohort life expectancy pv of life annuity(year, age) true Lee-Carter SWPCA true Lee-Carter SWPCA true Lee-Carter SWPCA(1950, 25) 45 .

81 * * 50 .

12 49 .

60 50 .

09 5 .

72 5 .

53 5 . .

59 * * 40 .

84 40 .

32 40 .

81 6 .

97 6 .

74 6 . .

19 * * 31 .

97 31 .

43 31 .

94 8 .

49 8 .

22 8 . .

59 * * 23 .

85 23 .

27 23 .

81 10 .

36 10 .

02 10 . .

95 15 .

23 15 .

83 16 .

50 15 .

87 16 .

47 12 .

62 12 .

21 12 . .

59 9 .

40 9 .

87 10 .

05 9 .

63 10 .

09 8 .

61 8 .

29 8 . This paper focus on forecasting the US mortality data with a two-steps dimension reductionmethod. Particularly, we analyzed the data structure of the age-speciﬁc central death rates of theUS and proposed a new dimension reduction method especially suitable for forecasting this kind ofdata. We have found that the death rates for all the ages have similar patterns, which indicates com-mon time-serial trend can be extracted to improve the forecasting of the data. In addition, variationsamong the death rates of all the ages is also crucial to provide more accurate ﬁtting and beneﬁt the30orecasting. We make use of those characteristics to proposed the new method and ﬁnd that thismethod can provide better forecasting results comparing with static PCA and dynamic PCA methods.To the best of our knowledge, this is the ﬁrst work to especially consider the forecasting ability ofdimension reduction. The novel dimension reduction method (SWPCA) can be seen as a two-stylefactor model, with estimations from the stepwise combination of static PCA (used in the Lee-Cartermodel) and dynamic PCA. It extracts two kinds of features that represent the common temporal trendand common variations receptively, which are both helpful for improving the forecasting accuracy.We simulated examples with the two-style factor model and we can clearly see that the SWPCAoutperforms the other considered methods.The detailed empirical analysis shows that the method is suitable and necessary for the mortalitydata in the US. Moreover, we ﬁnd that the better forecasting of mortality from our method can improvethe prediction of the corresponding life expectancy and life annuity. Hence the forecasting results ofthe SWPCA can be used to conduct important decisions in Actuarial science, such as providing advicefor social security, pricing life insurances, and making the decision on required future cash reserves.Furthermore, we ﬁnd in the long-term forecasting, recovering the mortality forecasting via features’forecasting is preferred than that via age-individually.

This section contains proof of Theorem 1, as well as some lemmas that are used in these proofs.Before introducing the proofs, we provide some notations. For a k × k matrix FFF , λ i ( FFF ) indicates the i -th largest eigenvalue of the matrix FFF . For a non-symmetric matrix

SSS , we use σ j ( SSS ) to denote thesingular value of the matrix SSS , which corresponds to the j -th largest eigenvalue of the matrix SSSSSS (cid:62) .Let || FFF || be the square root of the maximum eigenvalue of FFF FFF (cid:62) and || FFF || min be the square root ofthe smallest nonzero eigenvalue of the matrix FFF FFF (cid:62) . The notation a (cid:16) b means that a = O ( b ) and b = O ( a ) . Useful Lemmas

We will introduce four lemmas that will be used in the proofs of Theorem 1. Lemma 1, Lemma 2and Lemma 3 are available results on eigenvalues of matrices under various decomposition. Lemma4 provides the orders of eigenvalues of the matrix

LLL and LLL , and the proof follows up the statementof Lemma 4. Lemma 1 (Weyl’s Theorem) . Let { λ i ( SSS ) : i = , . . . , P } be eigenvalues of the matrix SSS in descending rder and { λ i ( JJJ ) : i = , . . . , P } be eigenvalues of the matrix JJJ in descending order. Then | λ i ( SSS ) − λ i ( JJJ ) | ≤ || SSS − JJJ || . (7.1) Lemma 2 (Lemma S.1 of Lam and Yao (2012)) . Let FFF be a k × k symmetric matrix such thatFFF = GGG HHHHHH (cid:62)

DDD (7.2) with GGG : k × k , DDD : k × k and λ k ( GGG ) > λ ( DDD ) . Note that k + k = k. Then for ≤ j ≤ k , ≤ λ j ( DDD ) − λ k + j ( FFF ) ≤ λ (cid:0) HHHHHH (cid:62) (cid:1) λ k ( GGG ) − λ j ( DDD ) . (7.3) Lemma 3 (Lemma 3 of Lam et al. (2011)) . Suppose FFF and FFF + EEE are P × P symmetric matrices andthat QQQ = (

QQQ , QQQ ) , where QQQ has size P × k and QQQ has size P × ( P − k ) , is an orthogonal matrix suchthat span ( QQQ ) is an invariant subspace for the matrix FFF , that is, FFF × span ( QQQ ) ⊂ span ( FFF ) . Partitionthe matrices QQQ (cid:62) FFF QQQ and QQQ T EEEQQQ as follows.QQQ (cid:62)

FFF QQQ = DDD DDD , QQQ (cid:62)

EEEQQQ = EEE EEE (cid:62) EEE EEE . (7.4) If sep ( DDD , DDD ) : = min λ ∈ Λ ( DDD ) , µ ∈ Λ ( DDD ) | λ − µ | > , where Λ ( DDD ) denotes the set of eigenvalues of thematrix DDD and || EEE || ≤ sep ( DDD , DDD ) / , then there exists a matrix PPP : ( P − k ) × k with || PPP || ≤ || EEE || sep ( DDD , DDD ) (7.5) such that the columns of the matrix (cid:98) QQQ = ( QQQ + QQQ PPP ) (cid:0) III + PPP (cid:62)

PPP (cid:1) − / deﬁne an orthogonal basis fora subspace that is invariant for the matrix FFF + EEE.

Lemma 4.

Under Assumptions 1-8, we have λ j ( LLL ) (cid:16) P − δ , j = , . . . , r . (7.6) λ r + j ( LLL ) (cid:16) P − δ , j = , . . . , r ; (7.7) λ r + r + i ( LLL ) = o p (cid:16) P − δ (cid:17) , i = , . . . , P − ( r + r ) . (7.8) λ i ( LLL ) (cid:16) P , i = , . . . , r . (7.9)32 roof of Lemma 4. Recall the two-style factor model yyy t = BBBkkk ( ) t + AAAkkk ( ) t + εεε t . (7.10)From the expression (7.10), the population covariance matrix of yyy t has the following decomposition ΣΣΣ y ( ) = BBBMMM + AAAMMM + ΣΣΣ ε ( ) , (7.11)where MMM = ΣΣΣ ( ) k ( ) BBB (cid:62) + ΣΣΣ ( ) k ( ) AAA (cid:62) , MMM = ΣΣΣ ( ) k ( ) AAA (cid:62) + ΣΣΣ ( ) k ( ) BBB (cid:62) . Based on Lemma 1, we can evaluate the j -th eigenvalue of LLL below, j = , . . . , r , λ j ( LLL ) = σ j ( ΣΣΣ y ( )) ≥ (cid:2) σ j ( BBBMMM ) − σ ( AAAMMM + ΣΣΣ ε ( )) (cid:3) ≥ (cid:2) σ j ( BBBMMM ) − σ ( AAAMMM ) − σ ( ΣΣΣ ε ( )) (cid:3) = (cid:2) σ j ( MMM ) − σ ( MMM ) − σ ( ΣΣΣ ε ( )) (cid:3) (cid:16) σ ( MMM ) ≥ (cid:104) σ j (cid:16) ΣΣΣ ( ) k ( ) BBB (cid:62) (cid:17) − σ (cid:16) ΣΣΣ ( ) k ( ) AAA (cid:62) (cid:17)(cid:105) (cid:16) σ j (cid:16) ΣΣΣ ( ) k ( ) BBB (cid:62) (cid:17) = σ j (cid:16) ΣΣΣ ( ) k ( ) (cid:17) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = P − δ , where the ﬁrst and second inequalities use Lemma 1; the second equality uses the matrices BBB and

AAA being orthonormal assumed in Assumption 1; and the last inequality and equality both utilizeAssumption 2.Hence, the ﬁrst r largest eigenvalues of the matrix LLL have the order of P − δ .Now we consider the order of the left p − r eigenvalues of the matrix LLL . In terms of Weyl’sinequality in Lemma 1, we use the eigenvalues of the matrix (cid:101) LLL = (cid:101) ΣΣΣ y ( ) (cid:101) ΣΣΣ y ( ) to approximate theeigenvalues of LLL , where (cid:101) ΣΣΣ y ( ) = BBBMMM + AAAMMM . In fact, (cid:12)(cid:12)(cid:12) λ r + j ( LLL ) − λ r + j (cid:16)(cid:101) LLL (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) LLL − (cid:101) LLL (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ || BBBMMM + AAAMMM + ΣΣΣ ε ( ) || · || ΣΣΣ ε ( ) || + || BBBMMM + AAAMMM || · || ΣΣΣ ε ( ) || = o (cid:16) P − δ (cid:17) , (7.12)where the last equality uses Assumption 2.Now we evaluate the order of λ r + j (cid:16)(cid:101) LLL (cid:17) . Note that the rank of (cid:101) LLL is no larger than r + r . So,33hen j > r + r , λ r + j (cid:16)(cid:101) LLL (cid:17) =

0. Hence, next we investigate the case of j = , . . . , r .Decompose (cid:101) LLL in the following way. (cid:101) LLL = (cid:18) BBB AAA (cid:19)

MMM MMM (cid:18) MMM (cid:62) MMM (cid:62) (cid:19) BBB (cid:62)

AAA (cid:62) = (cid:18) BBB AAA (cid:19)

MMM MMM (cid:62) MMM MMM (cid:62) MMM MMM (cid:62) MMM MMM (cid:62) BBB (cid:62)

AAA (cid:62) . (7.13)Because BBB (cid:62)

CCC (cid:62) (cid:18)

BBB CCC (cid:19) = III , (7.14)we have λ j ( LLL ) = λ j ( MMM ) , where MMM = MMM MMM (cid:62) MMM MMM (cid:62) MMM MMM (cid:62) MMM MMM (cid:62) . (7.15)It follows from Lemma 3 and Assumption 2 that λ r + j ( MMM ) ≤ λ j (cid:16) MMM MMM (cid:62) (cid:17) (cid:16) P − δ , (7.16)and λ r + j ( MMM ) ≥ σ j ( MMM ) − σ (cid:0) MMM MMM (cid:62) (cid:1) σ r ( MMM ) − σ j ( MMM ) (cid:16) P − δ , (7.17)where the last (cid:16) above uses the fact that σ r ( MMM ) (cid:16) P − δ , σ j ( MMM ) (cid:16) P − δ and σ (cid:0) MMM MMM (cid:62) (cid:1) = O (cid:16) P − ( δ + δ ) (cid:17) .Combining (7.16) and (7.17), we can get λ r + j ( MMM ) (cid:16) P − δ , j = , . . . , r . (7.18)34hen it follows from (7.12), (7.18) and Assumption 2 that λ r + j ( LLL ) (cid:16) P − δ , j = , . . . , r ; (7.19) λ r + r + i ( LLL ) = o p (cid:16) P − δ (cid:17) , i = , . . . , P − ( r + r ) . (7.20)Finally, the order of λ i ( LLL ) can be derived from Proposition 2.1 of Fan et al. (2013) directly. Proof of Theorem 1

Proof of Theorem 1.

Let

EEE ( ) L = (cid:98) LLL − LLL with (cid:98) LLL = (cid:98) ΣΣΣ y ( ) (cid:98) ΣΣΣ y ( ) (cid:62) and LLL = ΣΣΣ y ( ) ΣΣΣ y ( ) (cid:62) . First weevaluate the order of (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . In terms of simple calculations, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ y ( ) − ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ y ( ) − ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (7.21)In terms of (7.6) in Lemma 4, we have (cid:12)(cid:12)(cid:12)(cid:12) ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) P − δ . From (7.11), we can get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ y ( ) − ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) MMM − MMM (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) MMM − MMM (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ε ( ) − ΣΣΣ ε ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7.22)where (cid:98) MMM = (cid:98) ΣΣΣ ( ) k ( ) BBB (cid:62) + (cid:98) ΣΣΣ ( ) k ( ) AAA (cid:62) , (cid:98) MMM = (cid:98) ΣΣΣ ( ) k ( ) AAA (cid:62) + (cid:98) ΣΣΣ ( ) k ( ) BBB (cid:62) , with (cid:98) ΣΣΣ ( ) k ( ) , (cid:98) ΣΣΣ ( ) k ( ) , (cid:98) ΣΣΣ ( ) k ( ) and (cid:98) ΣΣΣ ( ) k ( ) are the sample covariances corresponding to the popula-tion covariances ΣΣΣ ( ) k ( ) , ΣΣΣ ( ) k ( ) , ΣΣΣ ( ) k ( ) and ΣΣΣ ( ) k ( ) , respectively.Hence, we evaluate (7.22) further (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ y ( ) − ΣΣΣ y ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ( ) k ( ) − ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ( ) k ( ) − ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ( ) k ( ) − ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ( ) k ( ) − ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ε ( ) − ΣΣΣ ε ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) P − δ T / (cid:33) + O p (cid:32) P − δ T / (cid:33) + O p (cid:18) PT (cid:19) = O p (cid:32) max (cid:32) PT , P − δ T / (cid:33)(cid:33) , (7.23)35here the last second equality uses (A8) of Lam et al. (2011) which demonstrates (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ε ( ) − ΣΣΣ ε ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:0) PT (cid:1) .Then it follows from (7.21) and (7.23) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) max (cid:32) P T , P − δ T , P − δ T , P − δ T / (cid:33)(cid:33) (7.24) = O p (cid:32) P − δ T / (cid:33) , (7.25)where the last equality uses the assumption that P δ = o (cid:16) T / (cid:17) .Now we use Lemma 3 to get the order of estimated factor loadings. In Lemma 3, let FFF and

EEE be LLL and (cid:98) LLL − LLL , respectively. Let k in Lemma 3 equal to r . Then we have, from (7.6), sep ( DDD , DDD ) (cid:16) P − δ , (7.26)where the deﬁtion of sep ( · , · ) is provided in Lemma 3. Then EEE ( ) L and sep ( DDD , DDD ) satisﬁes (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p ( sep ( DDD , DDD )) ≤ sep ( DDD , DDD ) . (7.27)Hence Lemma 3 tells us that, there exists a matrix PPP : ( P − r ) × r such that || PPP || ≤ sep ( DDD , DDD ) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) EEE ( ) L (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sep ( DDD , DDD ) (7.28)and then (cid:98) BBB = (

BBB + BBB c PPP ) (cid:0) III + PPP (cid:62)

PPP (cid:1) − / is an estimator of BBB with

BBB c being QQQ in Lemma 3. In viewof this, the rate of convergence for (cid:98) BBB can be calculated as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)

BBB − BBB (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( BBB + BBB c PPP ) (cid:0) III + PPP T PPP (cid:1) − / − BBB (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:20) ( BBB + BBBPPP ) − BBB (cid:16)

III + PPP (cid:62)

PPP (cid:17) / (cid:21) (cid:16) III + PPP (cid:62)

PPP (cid:17) − / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) BBB (cid:20)

III − (cid:16) III + PPP (cid:62)

PPP (cid:17) / (cid:21) + BBB c PPP (cid:19) (cid:16)

III + PPP (cid:62)

PPP (cid:17) − / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) III − (cid:16) III + PPP (cid:62)

PPP (cid:17) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + || PPP || ≤ || PPP || , (7.29)where the last second equality uses the fact that BBB and

BBB c are orthonormal; and the last equality uses36he fact that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) III − (cid:16) III + PPP (cid:62)

PPP (cid:17) / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = − (cid:16) + λ min (cid:16) PPP (cid:62)

PPP (cid:17)(cid:17) / (7.30) ≤ λ / (cid:16) PPP (cid:62)

PPP (cid:17) . (7.31)Therefore, by (7.29) and (7.28), we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) BBB − BBB (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sep ( DDD , DDD ) = O p (cid:18) √ T (cid:19) . (7.32)For the second factor model part, the estimation is to conduct principal component analysis on theresidual of the ﬁrst step, i.e. estimating the factor model (cid:98) uuu t = AAAkkk ( ) t + ηηη t , t = , , . . . , T , (7.33)where (cid:98) uuu t = yyy t − (cid:98) BBB (cid:98) kkk ( ) t , ηηη t is the new error component in the estimation at the second step.In order to derive the rate of convergence for (cid:98) AAA , we also utilize Lemma 3. Now let

FFF and

EEE inLemma 3 are

LLL and EEE ( ) L : = (cid:98) LLL − LLL , respectively. Let k in Lemma 3 equal to r .First, we evaluate (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Based on (7.33), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + || ΣΣΣ u ( ) || · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7.34)where ΣΣΣ u ( ) is the population covariance matrix of uuu t and (cid:98) ΣΣΣ (cid:98) u ( ) is the sample covariance matrix of (cid:98) uuu t . Based on Assumption 3 and Proposition 2.1 of Fan et al. (2013), we know that || ΣΣΣ u ( ) || (cid:16) p . Forthe term (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , we evaluate its order as follows. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − (cid:98) ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) BBB (cid:98)

KKK ( ) − BBBKKK ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) BBBKKK ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) BBB (cid:98)

KKK ( ) − BBBKKK ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (7.35)where (cid:98) KKK ( ) = (cid:18)(cid:98) kkk ( ) , (cid:98) kkk ( ) , . . . , (cid:98) kkk ( ) T (cid:19) and KKK ( ) = (cid:16) kkk ( ) , kkk ( ) , . . . , kkk ( ) T (cid:17) .37rom Assumption 2 and (7.32), it can be derived that1 √ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) BBB (cid:98)

KKK ( ) − BBBKKK ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) P / − δ / √ T (cid:33) , √ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) BBBKKK ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) P / − δ / (cid:17) . (7.36)Similar to (7.23), we can also get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ( ) k ( ) − ΣΣΣ ( ) k ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ ε ( ) − ΣΣΣ ε ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:18) PT / (cid:19) + O p (cid:18) PT (cid:19) . (7.37)In view of (7.36), (7.37) and (7.35), we can get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) ΣΣΣ (cid:98) u ( ) − ΣΣΣ u ( ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:32) p − δ √ T (cid:33) + O p (cid:18) P √ T (cid:19) + O p (cid:18) PT (cid:19) = O p (cid:18) P √ T (cid:19) . (7.38)The order of (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) is obtained from (7.34) and (7.38), i.e. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:18) P √ T (cid:19) . (7.39)Moreover, it follows from Proposition 2.1 of Fan et al. (2013) that sep ( DDD , DDD ) (cid:16) P . (7.40)Here DDD in Lemma 3 is the diagonal matrix corresponding to the orthogonal matrix AAA . Then we canget from Lemma 3 that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98)

AAA − AAA (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) EEE ( ) L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sep ( DDD , DDD ) = O p (cid:18) √ T (cid:19) . (7.41)38 eferences Ahn, S. C. and Horensten, A. R. (2013), ‘Eigenvalue ratio test for the number of factors’,

Economet-rica , 1203–1227.Anderson, T. W. (2003), An Introduction to Multivariate Statistical Analysis , 3 edn, Wiley.Bai, J. (2002), ‘Determine the number of factors in approximate factor models’,

Econometrica (1), 191–221.Booth, H. and Tickle, L. (2008), ‘Mortality modelling and forecasting: A review of methods’, Annalsof Actuarial Science (1-2), 3–43.Brillinger, D. R. (1975), Time Series: Data Analysis and Theory , Holt, Rinehart, and Winston.Chang, J., Guo, B. and Yao, Q. (2018), ‘Principal component analysis for second-order stationaryvector time series’,

The Annals of Statistics (5), 2094–2124.Cunningham, R., Herzog, T. and London, R. (2012), Models for Quantifying Risk , ACTEX Academicseries, Actex Publications.Fan, J., Liao, Y. and Mincheva, M. (2013), ‘Large covariance estimation by thresholding principalorthogonal complements’,

Journal of the Royal Statistical Society: Series B (4), 603–680.Hollmann, F. W., Mulder, T. J. and Kallan, J. E. (1999), Methodology & Assumptions for the Popula-tion Projections of the United States: 1999 to 2010 , US Department of Commerce, Bureau of theCensus, Population Division . . . .H¨ormann, S., Kidzi ´nski, Ł. and Hallin, M. (2015), ‘Dynamic functional principal components’,

J. R.Stat. Soc. B , 319–348.Jolliffe, I. T. (2002), Principal component analysis , Springer Series in Statistics, second edn,Springer-Verlag, New York.Lam, C. and Yao, Q. (2012), ‘Factor modeling for high-dimensional time series: inference for thenumber of factors’,

The Annals of Statistics (2), 694–726.Lam, C., Yao, Q. and Bathia, N. (2011), ‘Estimation of latent factors for high-dimensional timeseries’, Biometrika (4), 901–918.Lee, R. D. and Carter, L. R. (1992), ‘Modeling and forecasting US mortality’, Journal of the AmericanStatistical Association (419), 659–671. 39cCarthy, D. and Mitchell, O. S. (2001), Assessing the impact of mortality assumptions on annuityvaluation: Cross-country evidence , Pension Research Council, the Wharton School, University ofPennsylvania.Social Security Administration (2019), ‘Period life tables’. [ ; accessed 5-June-2019].The Board of Trustees of the Federal OASDI Trust Funds (2019),

The 2019 annual report of the Boardof Trustees of the Federal Old-Age and Survivors Insurance and Federal Disability Insurance TrustFunds , U.S. government publishing ofﬁce.University of California, Berkeley (USA) and Max Planck Institute for Demographic Research (Ger-many) (n.d.), ‘

Human Mortality Database ’. [Accessed: 2018.07.10].Warshawsky, M. (1988), ‘Private annuity markets in the united states: 1919-1984’, Journal of Riskand Insurance pp. 518–528. 40 upplementary material to “Data-adaptive Dimension Reduction forUS Mortality Forecasting”

Lingyu He , Fei Huang Yanrong Yang Hunan University, China The University of New South Wales, Australia The Australian National University, AustraliaIn this material, there are two parts. Part A discusses the relation between the proposed methodand other available methods in literature. Part B provide additional simulations to demonstrate theadvantage of our method in ﬁnite sample performance.

A Relationship with existing methods

The methods which forecast mortality via features’ forecasting date back to the Lee-Carter model(14). For general comparison, consider the following one factor model y x , t = ln ( m x , t ) = a x + b x k t + u x , t , where a x is a constant for each x , k t is an unobserved time series (the feature summarizing the originalhigh-dimensional time series), b x is the loading of the feature k t to each age x , and u x , t is the errorterm. One can recover the h − steps ahead forecasting of y x , t via the forecasting of k t . Therefore, howwe extract the feature k t by dimension reduction is the main difference. A.1 Static PCA method

The most popular method for mortality modelling, the Lee-Carter model, utilizes static PCA toestimate k t . The extracted feature k t , which performs the most important role in the forecasting, is theﬁrst principal component that explains the most of the variance of the original data. Mathematically,the static PCA solves the following objectivemax bbb var ( bbb (cid:62) yyy t ) to get (cid:101) k t = (cid:101) bbb (cid:62) ( yyy t − ¯ yyy ) , t = , . . . , T . This solution has the smallest average squared reconstructionerror E ( (cid:107) uuu (cid:62) t uuu (cid:107) ) (Brillinger, 1975). It, however, does not seem to have enough forecasting ability.1or example, suppose yyy t = ( y t , y t , . . . , y pt ) (cid:62) , t = , . . . , T , where y t and y t have huge variancesand very weak time serial dependence, while the rest have small variances but strong time serialdependence. Performing static PCA on yyy t will get a feature which puts most of the loading on y t and y t . The forecasting based on this feature will heavily depend on the pattern of y t and y t andnot make use of the strong serial dependence information contained in the rest, which may lead to amisleading forecasting.On the other hand, the ﬁrst step of our proposed method extracts k t from the auto-covariancematrix, which contains most of the time-serial dependence information. Thus it is expected to havestronger forecasting ability comparing to the feature extracted by the static PCA. A.2 Dynamic PCA method

There are several dynamic PCA methods, which usually involve the auto-covariance matrices andalso utilize the time-serial dependence information, including Brillinger (1975), Lam et al. (2011),H¨ormann et al. (2015), and Chang et al. (2018). Those methods can be used to extract feature k t aswell. For comparison purpose with our method, we consider one of them described in the following.Deﬁne ΣΣΣ yyy ( (cid:96) ) = cov ( yyy t , yyy t + (cid:96) ) , (cid:96) = , , , . . . , and consider the nonnegative deﬁnite matrix LLL = (cid:96) ∑ (cid:96) = ΣΣΣ yyy ( (cid:96) ) ΣΣΣ yyy ( (cid:96) ) (cid:62) . (A.1)With matrix (A.1), the coefﬁcients of the feature k t can be estimated by the eigenvector of the samplematrix (cid:98) LLL corresponding to its largest eigenvalue. This is similar to Brillinger (1975) and H¨ormannet al. (2015) by assigning different weights on those covariances, while in Lam et al. (2011), theyexclude

ΣΣΣ yyy ( ) . If (cid:96) =

0, it is the same with the static PCA. If (cid:96) = LLL can be seen as the mixing ofthe two steps in our method. If (cid:96) > LLL aggregates more lagged covariances than our method.There are similarities and advantages of our method compared to the aforementioned dynamicPCA methods. On one hand, the ﬁrst step of our method is motivated by the dynamic PCA that auto-covariance matrices are used to obtain forecasting ability for the features. While from the empiricaland simulating studies, we ﬁnd the lag 1 auto-covariance is enough for the mortality data and datawith similar structure. To make the method simple and easy to apply, our method only involves themost useful auto-covariance. If the data structure changes, one can include more lagged covarianceswhich adapts to the data. On the other hand, we intend to maximize the forecasting ability instead ofbalancing several characteristics of the features. The dynamic PCA provides only one set of featureswhich mix the information of the temporal trend and the variation, while our proposed method extracts2 y z DPCA SWPCA 1SWPCA 2

Figure 13: The directions of the features; A comparison between the dynamic PCA and our method.two types of features separately via a two-steps procedure. The ﬁrst type of features represent thetemporal trend, which beneﬁts the forecasting, and the second type ones capture variations which aregood for model ﬁtting as well as the forecasting. The features are linear combinations of the originaldata and the coefﬁcients of them represent the directions which the original data are projected to.Therefore, we can visualize the features by the directions. Figure 13 shows a simple example of theestimated directions for our method and the dynamic PCA. Data is generated the same as in

Example P = T =

20 . The red arrow is the direction of the feature forthe dynamic PCA (DPCA), and the blue ones are those for the ﬁrst step and the second step of ourmethod (SWPCA), respectively. It is clear that they project the data into different directions and thered one can be seen as a direction which mixes the other two. In addition, if we take a detailed lookat the matrix

LLL , we ﬁnd that it is actually hard to tell and explain what information is contained in thismatrix, as it is a mix of several auto-covariances and the variance matrix. Our method, on the otherhand, is stepwise, thus it has a clear goal for each step.

B Additional simulations

In this part, we provide more simulation studies as a supplementary to Section 4. Specially,

Example

Example

Example (cid:96) =

1. We also consider comparing with the method given in Lam et al. (2011),and we use “DPCA ( (cid:96) ) ” to represent it, with (cid:96) = , ,

10. “DPCA ( ) ” is the same with the ﬁrst stepin our method, while “DPCA ( ) ” and “DPCA ( ) ” aggregate more auto-covariances but exclude thevariance matrix. Data generating • Example 4 { yyy t } t = , ,..., T : P × yyy t = aaa + bbbk t + εεε t , where aaa is a P × bbb is a P × { k t } t = , ,..., T is generated from AR ( ) model with coefﬁcient 0 . εεε t is a P × Example 5

Construct yyy t = ( yyy ( ) (cid:62) t , yyy ( ) (cid:62) t ) (cid:62) . { yyy ( ) t } t = , ,..., T : ( dP ) × yyy ( ) t = bbbk t + εεε ( ) t . bbb is a ( dP ) × U ( , ) , { k t } t = , ,..., T is generated from AR ( ) model with coefﬁcient 0 .

8, and εεε ( ) t is a ( dP ) × N ( , . ) . { yyy ( ) t } t = , ,..., T : (( − d ) P ) × yyy ( ) t = aaaw t + εεε ( ) t , where aaa is a (( − d ) P ) × U ( , ) , { w t } t = , ,..., T is gen-4rated from N ( , . ) , and εεε ( ) t is a (( − d ) P ) × N ( , . ) .We call { yyy ( ) t } t = , ,..., T the dependent part as it has autocorrelations within observations (timedimension), and { yyy ( ) t } t = , ,..., T the independent part as it has independent generated observa-tions, while the variance of it is larger than that of { yyy ( ) t } t = , ,..., T . The parameter d is theproportion of the dependent part among the whole dataset, with possible values among ( , ) .As a result of this design, the whole data consists of two part. The dependent part has strongserial dependence with relatively small variance and the independent part has very weak de-pendence with relatively large variance. This is a special case for the two-style factor model inexample 1 to 3 in which we can set 0s in the coefﬁcient vector aaa and bbb to get the Example 5 .• Example 6

The data structure is the same with

Example d = .

4, except that { k t } t = , ,..., T is gener-ated from AR ( ) model with coefﬁcient 0 . εεε ( ) t and εεε ( ) t are ( dP ) × N ( , . ) , and { w t } t = , ,..., T is generated from N ( , ) . The maindifference is we enlarge the variations in Example

5, comparing with

Example

6. The purposeis to show that keeping sufﬁcient information of the variation is necessary.Because in

Example yyy t consists of the dependent part and independent part, we alsoreport the FRMSE for the two parts separately, in addition to the overall FRMSE deﬁned in Section4. Rewrite (cid:98) yyy T − i as ( (cid:98) yyy ( ) (cid:62) T − i , (cid:98) yyy ( ) (cid:62) T − i ) (cid:62) and yyy T − i as ( yyy ( ) (cid:62) T − i , yyy ( ) (cid:62) T − i ) (cid:62) , then:Dependent FRMSE ( h ) = (cid:32) ∑ h − i = (cid:107) (cid:98) yyy ( ) T − i − yyy ( ) T − i (cid:107) hPd (cid:33) / , Independent FRMSE ( h ) = (cid:32) ∑ h − i = (cid:107) (cid:98) yyy ( ) T − i − yyy ( ) T − i (cid:107) hP ( − d ) (cid:33) / , where d is the proportion of yyy ( ) t among yyy t . Results

We try different sets of ( P , T ) : ( , ) , ( , ) , ( , ) , ( , ) , ( , ) , as wewould like to evaluate the performance under situations P and T are comparable. The results areshowed in Table 11 to Table 16.From Table 11, we can see that the CPCA provides feature (cid:98) k t with the largest variance, while theﬁrst step of our method (SWPCA) captures (cid:98) k t with the largest lag 1 auto-covariance. In addition, in5able 11: Variance and Dependence of (cid:98) k t Time variance ( (cid:98) k t ) Time dependence ( (cid:98) k t ) Mix ( (cid:98) k t ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCA CPCA DPCA SWPCAExample 4 ( , ) . .

723 6 .

073 1 .

201 1 . . . . . ( , ) . .

917 4 .

595 1 .

061 1 . . . . . ( , ) . .

811 6 .

094 0 .

957 1 . . . . . ( , ) . .

911 4 .

567 0 .

944 1 . . . . . ( , ) . .

814 6 .

088 0 .

757 1 . . . . . d = . ( , ) . .

935 21 .

005 11 .

916 13 . . . . . ( , ) . .

494 21 .

389 12 .

132 14 . . .

990 37 .

805 3 . ( , ) . .

217 43 .

802 25 .

852 30 . . .

780 77 . . ( , ) . .

898 45 .

228 27 .

808 33 . . .

539 80 . . ( , ) . .

398 89 .

976 55 .

774 66 . . .

692 159 . . Example

5, our method has slightly larger Mix ( (cid:98) k t ) than DPCA, which shows that under certain datastructure the dimension reduction of our ﬁrst step is enough to represent sufﬁcient information.From Table 12 and 13, we can see that our method always provides the error terms with thesmallest time and cross-sectional variance and dependence.Table 12: Variance across Time and Ages of error terms Time Variance ( (cid:98) εεε · t ) Cross-sectional Variance ( (cid:98) εεε p · ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCAExample 4 ( , ) .

778 0 . . .

794 0 . . ( , ) .

775 0 . . .

782 0 . . ( , ) .

796 0 . . .

804 0 . . ( , ) .

778 0 . . .

782 0 . . ( , ) .

803 0 . . .

807 0 . . Example 5 ( d = . ( , ) .

071 0 . . .

094 0 . . ( , ) .

056 0 . . .

067 0 . . ( , ) .

055 0 . . .

064 0 . . ( , ) .

046 0 . . .

050 0 . . ( , ) .

046 0 . . .

051 0 . . Table 14 and 15, show the 1 step and 5 steps ahead forecasting root mean square errors for

Exam-ple d = . , . , .

3, respectively. Overall, SWPCA performs better than the other two, as ithas the smallest overall FRMSE for all the cases. Checking the Dependent FRMSE and IndependentFRMSE separately, we can ﬁnd that SWPCA performs even better for the dependent part. As d de-6reasing, the Dependent FRMSE of SWPCA increases the least, although all the Dependent FRMSEsincrease. And for the independent part, SWPCA is better when d = . , .

4, but performs almost thesame with others when d = .

3. This result shows that SWPCA extracts features with more fore-casting power from the dependent part and uses it to help improve the forecasting of the independentpart. However, when the proportion of the dependent part is small, such as d = .

3, forecasting theindependent part cannot be blessed that much from the dependent part. Therefore, there will be verylittle difference among the three methods when comparing the performance for the independent partwhen d = .

3. But SWPCA will always provide better forecasting for the dependent part, which leadsto better overall forecasting for all cases.Table 13: Covariance across Time and Ages of error terms

Time dependence ( (cid:98) εεε · t ) Cross-sectional dependence ( (cid:98) εεε p · ) ( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCAExample 4 ( , ) .

107 0 . . .

108 0 . . ( , ) .

106 0 . . .

086 0 . . ( , ) .

077 0 . . .

077 0 . . ( , ) .

075 0 . . .

061 0 . . ( , ) .

054 0 . . .

055 0 . . Example 5 ( d = . ( , ) .

025 0 . . .

032 0 . . ( , ) .

016 0 . . .

015 0 . . ( , ) .

013 0 . . .

014 0 . . ( , ) .

007 0 . . .

007 0 . . ( , ) .

007 0 . . .

007 0 . . Table 16 shows the 1 step and 5 steps ahead forecasting root mean square error of SWPCA com-pared to DPCA ( (cid:96) ) , (cid:96) = , ,

10, for

Example

6. The reason for comparing DPCA ( (cid:96) ) separately is thatit contains different information. The DPCA we compared with in Example yyy t ) with SWPCA, while DPCA ( (cid:96) ) aggregates moredependent information (lag 1 to lag (cid:96) auto-covariances) but discards the variance var ( yyy t ) . In addition,DPCA ( ) is equivalent to only conduct the ﬁrst step of SWPCA.In Table 16, we can see that SWPCA and DPCA ( ) perform better than DPCA ( ) and DPCA ( ) for most ( P , T ) cases. This shows that involving more lagged auto-covariances does not always pro-vide more useful information for forecasting under certain situations. The performance of DPCA ( ) is worse than SWPCA with ( P , T ) = ( , ) , ( , ) , ( , ) for 1 step ahead forecasting and ( P , T ) = ( , ) for 5 step ahead forecasting, and similar for other cases. These results show thatwhen variation is large, it is necessary to conduct the second step in the SWPCA in order to achieve7able 14: 1 Step Ahead Forecasting RMSE Dependent FRMSE ( ) Independent FRMSE ( ) Overall FRMSE ( )( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCA CPCA DPCA SWPCAExample 5 ( d = . ( , ) .

623 0 . . .

771 0 . . .

757 0 . . ( , ) .

573 0 . . .

746 0 . . .

715 0 . . ( , ) .

568 0 . . .

760 0 . . .

719 0 . . ( , ) .

572 0 . . .

782 0 . . .

732 0 . . ( , ) .

537 0 . . .

759 0 . . .

705 0 . . Example 5 ( d = . ( , ) .

654 0 . .

584 0 . .

743 0 .

741 0 .

759 0 . . ( , ) .

603 0 . . .

727 0 . . .

732 0 . . ( , ) .

631 0 . . .

767 0 . . .

766 0 . . ( , ) .

575 0 . . .

778 0 . . .

752 0 . . ( , ) .

570 0 . . .

739 0 . . .

719 0 . . Example 5 ( d = . ( , ) .

708 0 . . .

761 0 . . .

798 0 . . ( , ) .

687 0 . .

572 0 . . . .

782 0 . . ( , ) .

695 0 . .

556 0 . .

774 0 .

775 0 .

805 0 . . ( , ) .

682 0 . .

533 0 . .

747 0 .

743 0 .

772 0 . . ( , ) .

680 0 . .

549 0 . .

720 0 .

720 0 .

759 0 . . more accurate forecasting. 8able 15: 5 Steps Ahead Forecasting RMSE Dependent FRMSE ( ) Independent FRMSE ( ) Overall FRMSE ( )( P , T ) CPCA DPCA SWPCA CPCA DPCA SWPCA CPCA DPCA SWPCAExample 5 ( d = . ( , ) . .

839 0 . .

891 0 . . .

899 0 . . ( , ) .

833 0 . . .

876 0 . . .

883 0 . . ( , ) .

823 0 . . .

885 0 . . .

883 0 . . ( , ) .

802 0 . . .

872 0 . . .

866 0 . . ( , ) .

790 0 . . .

862 0 . . .

852 0 . . Example 5 ( d = . ( , ) .

864 0 . . .

876 0 . . .

901 0 . . ( , ) .

826 0 . . .

875 0 . . .

885 0 . . ( , ) .

815 0 . . .

870 0 . . .

874 0 . . ( , ) .

787 0 . . .

866 0 . . .

859 0 . . ( , ) .

792 0 . . .

865 0 . . .

861 0 . . Example 5 ( d = . ( , ) .

877 0 . .

835 0 . . . .

910 0 . . ( , ) .

860 0 . .

821 0 . . . .

883 0 . . ( , ) .

890 0 . .

833 0 . . . .

887 0 . . ( , ) .

831 0 . . .

854 0 . . .

871 0 . . ( , ) .

851 0 . . .

867 0 . . .

888 0 . . Table 16: 1 Step and 5 Steps Ahead Forecasting RMSE, Example 6

Overall FRMSE ( ) Overall FRMSE ( )( P , T ) DPCA ( ) DPCA ( ) DPCA ( ) SWPCA DPCA ( ) DPCA ( ) DPCA ( ) SWPCA ( , ) .

371 1 .

379 1 . . . .

518 1 . . ( , ) .

312 1 .

332 1 . .

309 1 .

495 1 .

496 1 . . ( , ) .

389 1 .

403 1 . . .

490 1 .

490 1 . . ( , ) . .

371 1 . .

349 1 . .

462 1 .

462 1 . ( , ) . .

355 1 . .

329 1 . .

489 1 . .481