[PDF] A Bayesian cohort component projection model to estimate adult populations at the subnational level in data-sparse settings

Abstract

Accurate estimates of subnational populations are important for policy formulation and monitoring population health indicators. For example, estimates of the number of women of reproductive age are important to understand the population at risk to maternal mortality and unmet need for contraception. However, in many low-income countries, data on population counts and components of population change are limited, and so levels and trends subnationally are unclear. We present a Bayesian constrained cohort component model for the estimation and projection of subnational populations. The model builds on a cohort component projection framework, incorporates census data and estimates from the United Nation's World Population Prospects, and uses characteristic mortality schedules to obtain estimates of population counts and the components of population change, including internal migration. The data required as inputs to the model are minimal and available across a wide range of countries, including most low-income countries. The model is applied to estimate and project populations by county in Kenya for 1979-2019, and validated against the 2019 Kenyan census.

Full PDF

AA Bayesian cohort component projection model to estimate adultpopulations at the subnational level in data-sparse settings

Monica Alexander ∗ Leontine Alkema † Abstract

Accurate estimates of subnational populations are important for policy formulation andmonitoring population health indicators. For example, estimates of the number of women ofreproductive age are important to understand the population at risk to maternal mortality andunmet need for contraception. However, in many low-income countries, data on populationcounts and components of population change are limited, and so levels and trends subnationallyare unclear. We present a Bayesian constrained cohort component model for the estimationand projection of subnational populations. The model builds on a cohort component projectionframework, incorporates census data and estimates from the United Nation’s World PopulationProspects, and uses characteristic mortality schedules to obtain estimates of population countsand the components of population change, including internal migration. The data required asinputs to the model are minimal and available across a wide range of countries, including mostlow-income countries. The model is applied to estimate and project populations by county inKenya for 1979-2019, and validated against the 2019 Kenyan census. ∗ University of Toronto. [email protected] . † University of Massachusetts, Amherst. [email protected] . The work was supported by the Bill & MelindaGates Foundation. We thank Gregory Guranich for assistance with R programming. a r X i v : . [ s t a t . A P ] F e b Introduction

Reliable estimates of demographic and health indicators at the subnational level are essential formonitoring trends and inequalities over time. As part of monitoring progress towards global healthtargets such as the Sustainable Development Goals (SDGs), there has been increasing recognition ofthe substantial diﬀerences that can occur across regions within a country (World Health Organization(WHO) 2016b; Lim et al. 2016; He et al. 2017). As such, analysis of national-level trends is ofteninadequate, and subnational patterns should be considered in order to fully understand likely futuretrajectories. Indeed, estimates and projections of important indicators such as child mortality andcontraceptive use are now being published at the subnational level (New et al. 2017; Wakeﬁeld et al.2019).To eﬀectively measure health indicators of interest, we need to be able to accurately estimate thesize of the population at risk. In order to convert the rate of incidence of a particular demographicor health outcome to the number of people aﬀected by that outcome, we need a good estimate of thedenominator of those rates. As such, population counts are essential knowledge for policy planningand resource allocation purposes. However, even something as seemingly simple as the number ofpeople in an area of a certain age is relatively unknown in many countries, particularly low-incomecountries that do not have well-functioning vital registration systems. And as previously reportedoutcomes show, diﬀerences in estimates of the population at risk can have a large eﬀect on theresulting estimates of key indicators. For example, in 2017 the United Nations Inter-agency Groupfor Child Mortality Estimation (UN-IGME) and the Institute for Health Metrics and Evaluation(IHME) both published estimates of under-ﬁve child mortality in countries worldwide (GBD 2016Mortality Collaborators (IHME) 2017; UN-IGME 2017). However, estimates for 2016 diﬀeredmarkedly, with IHME’s estimate being 642,000 deaths lower than the UN-IGME estimate. Themain reason for the discrepancy was the diﬀerent sets of estimates of live births: IHME assumedthere were 128.8 million live births in 2016, which was 12.2 million lower than the 141 million usedby UN-IGME.Data on population counts by age and sex at the subnational level vary substantially by country, andoften data availability and quality is the worst in countries where outcomes are also relatively poor.2or example, many low-income countries may only have one or two historical censuses available,and very little data available on internal migration or mortality rates at the subnational level.This situation is in stark contrast to many high-income countries where multiple data sources onpopulation counts, mortality and migration may exist. These varying data availability contextsboth present challenges in estimates of population and the components of population change. Indata-rich contexts, the challenge is to reconcile multiple data sources that may be measuring thesame outcome. In data-sparse contexts, the challenge is to obtain reasonable estimates without manyobservations. In both cases traditional demographic models are often utilized, which often centeraround a cohort component projection framework and take advantage of the fact that patterns inpopulations often exhibit strong regularities across age and time. However, these classical methodsdo not give any indication of uncertainty around the estimates or projections, and incorporatinginformation from diﬀerent data sources often requires adhoc adjustments to ensure consistency. Toovercome these limitations, we propose a method that builds on classical demographic estimation ofpopulations by incorporating these techniques within a probabilistic framework.In particular, we present a Bayesian constrained cohort component model to estimate subnationaladult populations, focusing on women of reproductive age (WRA), i.e. women aged 15-49. Thissubgroup forms the population at risk for many important health indicators such as fertility rates,maternal mortality, and measures of contraceptive prevalence. The model presented embeds a cohortcomponent projection setup in a Bayesian framework, allowing uncertainty in data and populationprocesses to be taken into account. At a minimum, the model uses data on population and migrationcounts from censuses, as well as national-level information on mortality and population trends,taken from the UN World Population Prospects (UNPD 2019a). As data requirements are relativelysmall, the methodology is applicable across a wide range of countries, and overcomes limitations ofprevious subnational cohort component methods, which require relatively large amounts of data.Estimates and projections of population by age are produced, as well as estimates of subnationalmortality schedules and in- and out- migration ﬂows. As such, results from the model help tounderstand population at risk to demographic and health outcomes at the subnational level, but alsoto understand drivers of population change and how these may in turn aﬀect trends in indicators ofinterest. 3he remainder of this paper is structured as follows. The next section gives a brief overviewof existing methods of subnational population estimation, and outlines the contributions of themodel proposed here. We then describe the main data sources typically available for subnationalpopulation estimation in low-income countries, using counties in Kenya as an example, followed bydetailed description of the proposed methodology. We then present results of ﬁtting the model todata in Kenya and validate its out-of-sample projections against the 2019 census. Finally, possibleextensions are discussed.

Methods to estimate population at the subnational level are similar to estimation methods at thenational level. However, there are several notable challenges of subnational population estimationthat do not exist at a country level. Firstly, migration ﬂows are more important at the subnationallevel. While migration ﬂows are often assumed to be negligible at the national level, they are usuallylarger as a proportion of total population size at the regional level. In addition, migration ﬂows atthe subnational level are also often more diﬃcult to estimate. Any particular region could havenet in- or out-migration, and ﬂows to and from diﬀerent regions can diﬀer markedly in magnitude.Secondly, when estimating subnational populations, it is important to ensure the sum of all regionsagrees with national estimates produced elsewhere. In practice, this usually involves a process ofcalibration against a known national population so that they match the total. Lastly, data qualityand availability is often poorer at the subnational level. Populations at the regional level are smallerand data are often more volatile, and data on key indicators of mortality and internal migration isoften lacking or unreliable.

Perhaps the simplest and least data-intensive methods of subnational population estimation involveinterpolation and extrapolation of regional shares of the total population (Swanson and Tayman2012). Given two (or more) censuses, one can calculate the relevant shares of the population by age,sex and region and see how they have changed over time. Intercensal estimations of populationsassume constant increase (or decrease) over time. Projection of populations into the future can then4e made based on assumptions of constant levels or trends in shares. For example, the U.S. censusBureau produce subnational population estimates for the majority of countries worldwide (U.S.census Bureau 2017). The methods used to produce such estimates involve making assumptionssuch as constant or logistic growth, and iteratively calculating population proportions by age, sexand region such that they match the country’s total populations (Leddy 2017).The most commonly used methods of population estimation and projection are cohort componentmethods. These center on the demographic accounting identity, which states that the populationsize ( P ) at time t is equal to the population size at t −

1, plus births ( B ) and in-migrants ( I ), minusdeaths ( D ) and out-migrants ( O ) (Wachter 2014): P t = P t − + B t − + I t − − D t − − O t − (1)The above equation is for a total population, but the same accounting equation holds for each agegroup separately (where births only aﬀect the ﬁrst age group). The cohort component method ofpopulation projection (Leslie 1945) takes a baseline population with a certain age structure andsurvives it forward based on age-speciﬁc mortality, fertility and migration rates. Cohort componentmethods are important because they allow for overall population change to be related to the maincomponents of that change. By estimating population size based on the components of fertility,mortality and migration, the method allows changes in these components to be taken into account.However, cohort component methods are more data-intensive than extrapolation methods, whichis particularly an issue at the subnational level. For developing countries in particular, wherewell-functioning vital registration systems do not exist, suﬃcient data on mortality, fertility andmigration is often lacking.Other methods of subnational estimation involve building regression models which relate othervariables of interest to changes in population over time. For example, one could regress the ratio ofcensus populations (area of interest / total population) against the ratio of some other indicatore.g. births, deaths, voters, school enrollments (see Swanson and Tayman (2012) for a detailed review).However, given the lack of data available in many developing countries – on population counts, letalone other indicators of growth – these methods have limited use in our context.5hese traditional methods of population estimation are deterministic and do not account for randomvariation in demographic processes and possible measurement errors that may exist in the data. Inpractice, the population data that are available in developing countries are often sparse and maysuﬀer from various types of errors. When estimating and projecting population sizes through time,it is particularly important in developing country contexts to give some indication of the level ofuncertainty around those estimates, based on stochastic error, measurement error and uncertaintiesin the underlying modeling process. The use of Bayesian methods in demography has become increasingly common, as it provides auseful framework to incorporate diﬀerent data sources in the same model, account for varioustypes of uncertainty, and allow for information exchange across time and space (Bijak and Bryant2016). Bayesian methods have been used to model and forecast national populations (Rafteryet al. 2012; UNPD 2019a), fertility (Alkema et al. 2011), mortality (Alexander and Alkema2018; Alkema and New 2014; Girosi and King 2008) and migration (Bijak 2008). In terms ofestimating the full demographic accounting identity, Wheldon et al. (2013) propose a method forthe reconstruction of past populations. The model embeds the demographic accounting equationwithin a Bayesian hierarchical framework, using information from available censuses to reconstructhistorical populations via a cohort component projection framework. The authors show the methodworks well to estimate populations and quantify uncertainty in a wide range of countries withvarying data availability (Wheldon et al. 2016). The method presented in Wheldon et al. (2013) isdesigned for population reconstruction at the national level, and as such, accounting for internalmigration is not an issue. In addition, their method relies upon and calibrates to national populationestimates produced as part of the UN World Population Prospects.In the ﬁeld of subnational estimation, Bayesian methods have also been used in many diﬀerentcontexts. For subnational mortality estimation, many researchers have used Bayesian hierarchicalframeworks to share information about mortality trends across space and time, in contexts wherethe available data are both reliable (Congdon, Shouls, and Curtis 1997; Alexander, Zagheni, andBarbieri 2017) and sparse (Schmertmann and Gonzaga 2018). For subnational fertility estimation,6evcikova, Raftery, and Gerland (2018) propose a Bayesian model that produces estimates andprojections of subnational total fertility rates (TFRs) that are consistent with national estimates ofTFR produced by the UN. Building from the local level up, Schmertmann et al. (2013) proposea method which uses empirical Bayesian methods to smooth volatile fertility data at the regionallevel, before modeling using a Brass relational model variant.In terms of population estimation at the subnational level, John Bryant and colleagues have shownhow the demographic accounting equation can be placed within a Bayesian framework to accountfor and reconcile diﬀerent data sources population counts and the components of population change(Bryant and Graham 2013; Bryant and Zhang 2018). Bryant and Zhang (2018) show how theunderlying demographic processes can be captured through a process or system model, and diﬀerenttypes of uncertainty around data inputs is captured through data models. The focus of Bryantand Graham (2013) is producing subnational population estimates for New Zealand, reconcilingand incorporating information about the population from sources such as censuses, and school andvoting enrollments. The approach that we take in this paper is similar to the Byrant et al. approach,in that we model population change with a process model, the components of which are describedby system models, and diﬀerent sources of information are combined through the use of data models.However, whereas Bryant et al. tries to overcome challenges of combining multiple data sourcesthat may be measuring the same outcome, we are trying to overcome the challenges of estimatingsubnational populations in contexts where there is extremely limited amounts of data available.There is an increasing amount of work using geo-located data and satellite imagery to estimatepopulation sizes and ﬂows in developing countries (Wardrop et al., 2019, Leasure et al. 2020). Ledby the WorldPop project at the University of Southampton (WorldPop 2018), researchers have usedinformation from satellite imagery to identify areas of settlements, and combined this informationwith census data to obtain highly granular population density estimates across Africa (Linard etal. 2012; Leasure et al. 2020). While this work contributes to information about subnationalpopulations, the focus and goals of this estimation work are diﬀerent to our goals in this paper.In particular, the goal of the WorldPop work is primarily to obtain estimates of total populationand population density at a very granular level, rather than obtaining population estimates byage and sex. The results have then been combined with data on age- and sex-distributions from7ensuses (or more recent surveys) to map the distribution of populations by age and sex. However,little attention is paid to how age distributions across regions change over time. But changes in agedistributions are important in understanding broader population change and how this will impactglobal health indicators of interest. In addition, our approach is grounded in understanding themain components of demographic change – mortality and migration – over time and how they aﬀectpopulation sizes, rather than just estimating the population size as a single outcome.The methodology proposed in this paper incorporates a cohort component projection model into aBayesian hierarchical framework to understand changes in population structures over time. It allowsestimates to be driven by available data and for uncertainty to be incorporated around estimatesand projections. The approach has similarities with methodologies described in Wheldon et al.(2013) (but with a focus on subnational estimation) and in Bryant et al (2013; 2018) (but with afocus on data-sparse situations).In particular, we introduce a framework to estimate subnational population counts and componentsof population change that relies on a minimal amount of data that is available for the vast majorityof countries worldwide. Observations on subnational population counts an internal migrationmovements are taken from censuses, but no information on subnational mortality patterns isrequired. We instead use a mortality model approach based on principal components derived fromnational mortality schedules. Using principal components for demographic modeling and forecastingﬁrst gained popularity after Lee and Carter used the technique as a basis for forecasting US mortalityrates (Lee and Carter 1992). More recently, principal components has become increasingly used indemographic modeling, in both fertility and mortality settings (Schmertmann et al. 2014; Clark2016; Alexander, Zagheni, and Barbieri 2017).While one strength of our approach is being able to estimate components of subnational populationchange with limited data, another strength of the proposed framework is that it can be readilyextended to include other data or estimates. For example, gridded estimates produced as part ofthe WorldPop project could conceivably be treated as an additional data input to the model.8

Data

We aim to estimate female population counts for ages 15-49 per 5-year age group for subnationalareas that are the second administrative level down. This data description focuses on Kenya, forwhich the model is applied in later sections. However, the data and methods are more broadlyapplicable to other countries that have similar data available. Inputs used to obtain estimates comefrom two main sources: micro-level data from censuses, and national population and mortalityestimates from the 2019 World Population Prospects. These data sources are outlined in thefollowing sections.

In Kenya, the ﬁrst administrative units are provinces, and the second administrative units arecounties. There are eight provinces, including the capital Nairobi, and 47 counties. The countyboundaries have changed over time, but have been stable since the 2009 census. We aim to produceestimates of populations of women of reproductive age at the county level based on county boundariesin 2009. Within the model, we also make use of harmonized district boundaries (see descriptionbelow), which are slightly larger than counties. There are a total of 35 districts. Provinces anddistricts are illustrated below in Fig 1.

Data inputs on subnational population counts and internal migration ﬂows come from nationalcensuses. The census data are available through Integrated Public Use Microdata Series (IPUMS)International (Minnesota Population Center 2017). IPUMS-International contains samples ofmicrodata for 305 censuses over 85 diﬀerent countries. The majority of countries of interest haverelatively recent censuses available through IPUMS-International. Kenya has decennial censusesavailable from 1979 to 2009. Micro-level data are not available for the 2019 Kenyan census, althoughpopulation counts by sex and ﬁve-year age group and county are available through the nationalstatistics oﬃce. The 2019 data are reserved for model evaluation, as detailed in the Model Evaluationsection.In the micro-level IPUMS data, location of residence is reported at the ﬁrst (province) and second9 rovince

CentralCoastEasternNairobiNortheasternNyanzaRift ValleyWestern

Figure 1: Map of Kenya provinces, showing IPUMS harmonized districts.(county) administrative levels, as well as a harmonized district level. For Kenya, the provincesare stable over time, but before 2009 the county boundaries changed. As such, we only have dataat the county level for Kenya for 2009. However, we can make use of the harmonized districtsfor data in years prior to 2009. The districts represent slightly larger groups than the 47 Kenyancounties, which are harmonized and temporally stable (IPUMS 2018). In all cases, each 2009 countyis completely contained in one unique district.We used census data to obtain information on two diﬀerent quantities: observed population counts;and observed patterns of in- and out-migration. Female population counts by ﬁve-year age groups forages 15-49 and subnational administrative region are obtained directly from the IPUMS-Internationalmicrodata. As these data are samples (most commonly 10%), the microdata are multiplied by theperson weights to obtain counts by age and area. Information on internal migration between counties and districts is also obtained from nationalcensuses. This is based on questions about a migrant’s location of residence one year ago. We The sampling error introduced by considering sampled microdata is accounted for in the data model, refer to theMethods section for details.

The World Population Prospects (WPP) are the oﬃcial population estimates and projectionsproduced by the United Nations. WPP is revised every two years, with the latest revision being in2019 (UNPD 2019a). WPP estimates are produced using a combination of census and survey data,and demographic and statistical methods. Both population counts and mortality estimates fromWPP are used in the model.We use estimates from WPP 2019 in two ways. Firstly, we would like to ensure that the sum ofpopulation estimates at the regional level agrees with published estimates at the national level.National population counts produced by WPP are used as a constraint in the model, subject touncertainty. The WPP models populations of ﬁve-year age groups every ﬁve years from 1950-2100.National mortality estimates produced by WPP are used as the basis of a mortality model forpatterns at the regional level, capturing HIV/AIDS related patterns of mortality. WPP uses therelationship between infant mortality and the probability of dying between ages 15 and 60, i.e. q ,to estimate a life table based on Coale-Demeny Model Life Tables (UNPD 2019b). We use estimatesof the probability of dying between ages x and x + 5, q x . We use census data and WPP estimates as inputs to the model. There are other available datasources that could be used as inputs. These sources and the reasons for not including them arediscussed in Appendix A. 11

Model

In this section we describe the modeling framework to estimate female populations by ﬁve-yearage group and county. The model is outlined for the situation where, like in the Kenyan case,we do not observe county-level information for every census, but we have information on larger,harmonized districts that fully encapsulate the counties. This situation is common for manylow-income countries where geographic boundaries may vary over time but there exist some otherstably-deﬁned boundaries through the micro-data on IPUMS.There are many components and several types of data going into the model at diﬀerent stages. Theoverall model framework is summarized visually in Figure 2. We deﬁne η a,t,c to be the underlying‘true’ population of women in age group a , year t and county c . Our main modeling goal is to obtainestimates and projections of these quantities. The population counts follow a cohort componentprojection (CCP) model, which assumes population counts in the current time period are thosefrom the previous period, after taking into account expected changes in mortality and migration.The CCP model also includes an additional age-time multiplier which captures any other variationnot already captured by expected changes in mortality or migration. Our set-up allows for changesin mortality and migration to be projected forward even if there are no data on these components,and is useful in data-sparse contexts where there is limited information available on the individualcomponents of population change.As illustrated in Figure 2, the mortality, migration and additional age-time speciﬁc multipliers haveadditional ‘process models’ (shown on the third row), and data on population counts and migrationare related to the underlying process through data models (shown in the top row).The following sections broadly describe each component of the model. The full model speciﬁcationand details can be found in Appendix B. The model for population includes: the cohort component project model, the data model, whichrelates observations of population counts to the underlying quantities of interest; and the national-12 opu l a t i on D a t a m od e l D a t a s ou r ces : C en s u s e s C on s t r a i n t : N a t i ona l popu l a t i on w i t h i n app r o x i m a t e l y - % o f W PP e s t i m a t e s . η a , t , c = ( η a − , t − , c ⋅ ( − γ a − , t − , c ) ) ⋅ ( + ϕ a − , t − , c ) ⋅ ( ε a − , t − , c ) l og y i ∼ N ( l og η a [ i ] , t [ i ] , c [ i ] , s y [ i ] ) M i g r a t i on D a t a m od e l D a t a s ou r ces : C en s u s e s C on s t r a i n t : t he s u m o f i n t e r na l m i g r a t i on i s app r o x i m a t e l y . l og M i n i ∼ N ( l og ψ i n a [ i ] , t [ i ] , c [ i ] , s i n [ i ] ) l og M ou ti ∼ N ( l og ψ ou t a [ i ] , t [ i ] , c [ i ] , s ou t [ i ] ) M i g r a t i on M ode ll ed a s a c on s t an t age d i s t r i bu t i on t i m e s t o t a l i n - o r ou t - m i g r an t s () . T o t a l m i g r an t s m ode ll ed a s a s e c ond - o r de r r ando m w a l k . Π Ψ ψ ou t a , t , c = Ψ ou tt , c ⋅ Π ou t a , c ψ i n a , t , c = Ψ i n t , c ⋅ Π i n a , c ϕ a , t , c = ψ i n a , t , c − ψ ou t a , t , c η a − , t − , c M o r t a li t y M ode ll ed on t he l og i t sc a l e a s a li nea r c o m b i na t i on o f m ean p l u s t w o p r i n c i pa l c o m ponen t s , de r i v ed f r o m W PP m o r t a li t y e s t i m a t e s . l og i t γ a , t , c = α , c + Y a , + β t , c , ⋅ Y a , + β t , c , ⋅ Y a , M u l t i p li e r A ll o w s f o r add i t i ona l age - y ea r v a r i a t i on . M ode ll ed a s a ﬁ r s t - o r de r r ando m w a l k o v e r age , w i t h t he m ean c on s t r a i ned t o equa l z e r o . Δ l og ε a , t , c ∼ N ( , σ ε ) D a t a i n p u t s C o h o r t c o m p o n e n t p r o j ec t i o n P r o ce ss m o d e l s F i g u r e : D i ag r a m s h o w i n g t h e m a i n c o m p o n e n t s o f t h e B a y e s i a n c o h o r t c o m p o n e n t p r o j ec t i o n m o d e l. The underlying population η a,t,c can be expressed as η a,t,c = ( η a − ,t − ,c · (1 − γ a − ,t − ,c )) · (1 + φ a − ,t − ,c ) · ( ε a − ,t − ,c ) , (2)where γ a,t,c is the expected conditional probability of death in age group a , year t and county c , φ a,t,c is expected net migration (that is, in- minus out-migration) as a proportion of populationsize, and ε a,t,c is an additional age-year-county multiplier. Note that this is a form of a cohortcomponent projection framework. As mentioned previously, our main modeling goal is to obtainestimates of the η a,t,c , but we are also interested in estimates of expected mortality ( γ a,t,c ) andexpected migration ( φ a,t,c ), and, if non-zero, the multipliers ( ε a,t,c ). Deﬁne y i to be i th observed population count. Depending on the year of the census, y i is eitherobserved at the county c level or district d level. The data model is:log y i | η a,t,c ∼  N (cid:16) log η a [ i ] ,t [ i ] ,c [ i ] , s y [ i ] (cid:17) if t = 2009 ,N (cid:16) log P c ∈ d [ i ] ( η a [ i ] ,t [ i ] ,c [ i ] ) , s y [ i ] (cid:17) if t < , (3)where s y is the sampling error based on the fact that the micro-data in IPUMS is a 10% sample.The second case of the above equation dictates that if we have observations prior to 2009, we canonly relate these to η a,t,c ’s that have been summed to the district level. We would like to ensure the county-level populations η a,t,c imply a national-level population thatis consistent with previously-published estimates in WPP. To do this, we implement the followingconstraint in the model, which roughly corresponds to the sum of the subnational populations in14ny age and year being within 90-110% of WPP. Further details on the constraint and priors in thepopulation model are given in Appendix B. Equation 2 requires estimates of the expected conditional probability of death in each age group,year and county. As discussed in the Data section and appendix, we do not have reliable informationabout mortality by age at the county level, and as such we use information about mortality trendsat the national level as the basis for a mortality model at the subnational level. A semi-parametricmodel is used to capture the shape of national mortality through age and time, while allowing fordiﬀerences by county. In particular, we model county mortality on the logit scale aslogit( γ a,t,c ) = α ,c + Y a, + β t,c, · Y a, + β t,c, · Y a, , (4)where Y a, is the mean age-speciﬁc logit mortality schedule of the national mortality curves and Y , and Y , are the ﬁrst two principal components derived from national-level mortality schedules.Modeling on the logit scale ensures the death probabilities are between zero and one.Principal components create an underlying structure of the model in which regularities in agepatterns of human mortality can be expressed. Many diﬀerent kinds of shapes of mortality curvescan be expressed as a combination of the components. Incorporating more than one principalcomponent allows for greater ﬂexibility in the underlying shape of the mortality age schedule.Principal components were obtained from a decomposition on a matrix which contains a set ofstandard mortality curves. As discussed in the Data section, we used national Kenyan life tablespublished in the World Population Prospects 2019. In particular, let X be a N × G matrix oflogit mortality rates, where N is the number of years and G is the number of age-groups. Inthis case, we had N = 16 years (estimates every 5 years from 1950 to 2025) of G = 7 age-groups(15 − , − , . . . , − X is X = UDV , (5)where U is a N × N matrix, D is a N × G matrix and V is a G × G matrix. The ﬁrst two columns15f V (the ﬁrst two right-singular values of X ) are Y A, and Y A, .The mean mortality schedule and the ﬁrst two principal components for Kenyan national mortalitycurves between ages 15-49 from 1950–2020 are shown in Fig. 3. The mean logit mortality scheduleshows a standard age-speciﬁc mortality curve, with mortality increasing over age. The ﬁrst twoprincipal components have demographic interpretations. The ﬁrst shows the average contributionof each age to mortality improvement over time. This interpretation is similar to the b x term in aLee-Carter model (Lee and Carter 1992). For the case of Kenya, the second principal componentmost likely represents the relative eﬀect of HIV/AIDS mortality by age. Y_0 (mean)

Y_1 (mortality improvement)

Y_2 (HIV/AIDS)

20 30 40 20 30 40 20 30 40−0.20−0.15−0.10−0.050.00−0.35−0.30−0.25−0.20−3.9−3.6−3.3−3.0 age v a l ue Figure 3: Mean logit mortality schedule and ﬁrst two principal components.The county-speciﬁc mortality intercepts are modeled using a Normal distribution centered at zero: α ,c | σ α ∼ N (0 , σ α ) . (6)The county-speciﬁc coeﬃcients β t,c,k are modeled as ﬂuctuations around a national mean: β t,c,k = B natt,k + δ t,c,k , (7) δ t,c,k | δ t − ,c,k , σ δ ∼ N ( δ t − ,c,k , σ δ ) , (8)where B natt,k are the national coeﬃcients on principal components, derived from WPP data. Thecounty-speciﬁc ﬂuctuations are modeled as a random walk.16 .4 Migration model The second population change component of Equation 2 refers to the net-migration rate in aparticular age group, year and county. Speciﬁcally, deﬁne the net-migration rate as φ a,t,c = ψ ina,t,c − ψ outa,t,c η a − ,t − ,c , (9)where ψ ina,t,c is the number of in-migrants and ψ outa,t,c is the number of out-migrants.For the migration component, we use observed data from the census. As such, in a similar way tothe population model, we have a process model, which deﬁnes the underlying migration process forthe ‘true’ migrant parameters, and a data model, which relates observations from the census to theunderlying truth. The model form for the number of in-migrants and out-migrants is informed by patterns observedin the raw census data. In particular, looking at the age distribution of both in- and out-migration(i.e. the proportion of total migrants who are in age group a ) suggests that, while the overallmagnitude of migration changes over time, the age patterns in migration are fairly constant (seeﬁgures in Appendix C). This observation allowed us to simplify the expression for the number ofin-migrants and out-migrants, which are modeled as φ a,t,c = ψ ina,t,c − ψ outa,t,c η a − ,t − ,c , (10) ψ ina,t,c = Ψ int,c · Π ina,c , (11) ψ outa,t,c = Ψ outt,c · Π outa,c , (12)where Ψ int,c and Ψ outt,c are the total number of in- and out-migrants, respectively, and Π ina,c and Π outa,c arethe relevant age distributions. In this way the age distributions are assumed to be constant over timewhile the total counts vary. We model the total counts as a second order random walk to imposea certain level of smoothness in the counts over time. As the model is meant to capture internalmigration ﬂows in and out of each county, it must be the case that the sum of all in-migration ﬂows17ust equal the sum of all out-migration ﬂows. As such, we also constrain the diﬀerence betweenthe sum of all estimated in- and out-migration ﬂows to be close to zero. See Appendix B for furtherdetails. Finally, we relate the observed age-speciﬁc in- and out-migration counts in the censuses, denoted M ini and M outi , respectively, to the underlying true counts ψ ina,t,c and ψ outa,t,c through the followingdata model: log M ini | ψ ina,t,c ∼  N (cid:16) log ψ ina [ i ] ,t [ i ] ,c [ i ] , s in [ i ] (cid:17) if t [ i ] = 2009 N (cid:16) log P c ∈ d [ i ] ( ψ ina [ i ] ,t [ i ] ,c [ i ] ) , s in [ i ] (cid:17) if t [ i ] < M outi | ψ outa,t,c ∼  N (cid:16) log ψ outa [ i ] ,t [ i ] ,c [ i ] , s out [ i ] (cid:17) if t [ i ] = 2009 ,N (cid:16) log P c ∈ d [ i ] ( ψ outa [ i ] ,t [ i ] ,c [ i ] ) , s out [ i ] (cid:17) if t [ i ] < . (14)In a similar fashion to the data model for population, data observed prior to 2009 can only berelated to the migration counts that have been summed to the district level. In addition, the s in and s out are the sampling errors based on the fact that the micro-data in IPUMS is a 10% sample. ε a,t,c In both the models for expected mortality and migration discussed above, constraints are imposedon the age-speciﬁc eﬀects. In particular, the use of the SVD approach to model mortality results inmortality age patterns that are linear combinations of the mean schedule and the components ofchange (the Y ’s). Additionally, the migration model assumes a constant age pattern of migrationover time with varying magnitudes of in- and out-migration. We assume these forms in orderto greatly reduce the number of parameters that need to be estimated in each model, such thatreasonable estimates of mortality and migration rates can still be obtained in data-sparse settings.In order to allow for county-speciﬁc age- and time- variation that may not have already been capturedby other components, we introduced an additional age-time multiplier ε a,t,c in the population cohortcomponent model (see Equation 2). We model these multipliers on the log scale, and to ensure18dentiﬁability we assume the mean of the sum of the log multipliers over all age groups is zero. Thisconstraint is implemented through the re-parameterization:log ε A,t,c = D ( DD ) − ζ A − ,t,c , (15) ζ a,t,c ∼ N (0 , σ ζ ) , (16)where D is ﬁrst-order diﬀerence matrix (with D i,i = − D i,i +1 = 1, and D i,j = 0 otherwise) suchthat ζ a,t,c = log ε a,t,c − log ε a − ,t,c . The model was ﬁtted in a Bayesian framework using the statistical software R. Samples were takenfrom the posterior distributions of the parameters via a Markov Chain Monte Carlo (MCMC)algorithm. This was performed using JAGS software (Plummer 2003). Standard diagnostic checksusing trace plots and the ˆ R diagnostic (Gelman et al. 2020) were used to check convergence.Best estimates of all parameters of interest were taken to be the median of the relevant posteriorsamples. The 95% Bayesian credible intervals were calculated by ﬁnding the 2.5% and 97.5%quantiles of the posterior samples. In this section we illustrate some key results of population counts, mortality and migration. Addi-tional results are presented in Appendix D.

Figure 4 shows the WRA population by province in 1979-2019. The black line and associatedshaded area are the model estimates and associated 95% credible intervals. The red dots are thedata from decennial censuses. Populations of WRA are increasing in every province, with the twolargest provinces being Nairobi and Rift Valley. While Northeastern is the smallest province bypopulation size, the growth rate is relatively rapid. This is likely due to the relatively high fertilityrates in this province (Westoﬀ and Cross 2006; Kenya National Bureau of Statistics 2015), whereas19apid population increases in Nairobi are driven by in-migration.

Central Coast Eastern Nairobi Northeastern Nyanza Rift Valley Western year popu l a t i on ( ) Figure 4: Estimates of female population aged 15-49 by province, Kenya, 1979-2020.Figure 5 illustrates populations over age and time for 3 diﬀerent counties. Note the diﬀerent y-axisscales for each county. For Nairobi, populations are much larger and the presence of net in-migrationfar surpasses the eﬀects of mortality, leading to an inverted-U shaped age distribution. For Wajir,a relatively rural county in the northeast, population growth seems rapid over time. For Baringo,populations are relatively small and decline regularly over age due to mortality.

Nairobi Wajir Baringo20 30 40 20 30 40 20 30 400204002550750100200300400 age group popu l a t i on ( ' ) Figure 5: Estimates of female population aged 15-49 (’000) by age and year for three counties.

In addition to getting estimates of population counts, we also obtain estimates of the componentsof population change, namely mortality and migration. In terms of mortality, there is evidence of20ariation across the counties. Focusing on the three counties as above, mortality proﬁles are quitediﬀerent, with Baringo’s estimates being similar to the national mean (Figure 6).

Nairobi Wajir Baringo20 30 40 20 30 40 20 30 400.0010.0100.100 age group P r( dea t h ) year Figure 6: Estimates of mortality by age and year for three counties.

In addition to mortality, there is substantial variation in patterns in migration across Kenyancounties. Figure 7 shows estimates of all migration components in the three case study counties.For total in-migration and out-migration estimates (Figure 7a), ﬂows into and out of Nairobi aremuch larger, with net in migration reaching almost 400,000 people per year. Flows into Wajirare much smaller (<10,000 people), and in 2019 Baringo had net out-migration of around 10,000.The estimated age patterns of migration for the three counties are also shown in Figure 7b. Somediﬀerences exist, with Nairobi’s immigrants much more concentrated around age 20.21 airobi Wajir Baringo1980 1990 2000 2010 2020 1980 1990 2000 2010 2020 1980 1990 2000 2010 2020010203040051015200200400600800 year popu l a t i on ( ' ) Flow direction inout (a) Estimates of total in- and out-migration over time.

Nairobi Wajir Baringo20 30 40 20 30 40 20 30 400.00.10.20.3 age p r opo r t i on o f m i g r an t s Flow direction inout (b) Estimates of age distribution of in- and out-migration.

Figure 7: Estimates and 95% credible intervals of migration components for three counties.

Figure 8 shows the age-time multipliers ε for the three example counties. For Baringo, the multipliersare essentially always zero on the log scale. This observation is true for the majority of counties(see Appendix D for plots for additional counties), which suggests that most of the patterns overage and time are captured well by the mortality and migration components. For county-yearswhere multipliers do deviate from zero, estimates are at most around 10% of the total populationmagnitude, and usually between 0-0.05%. For example for Nairobi, the estimated multiplier suggeststhat, after accounting for the expected mortality and migration components, in 1989, we see anadditional increase around age 20 (of around 10%) and an additional decrease of around 10% at theoldest age group. 22 airobi Wajir Baringo10 20 30 40 50 10 20 30 40 50 10 20 30 40 50−0.3−0.2−0.10.00.10.2 age group l og m u l t i p li e r Year

Figure 8: Age-time speciﬁc multipliers for three counties.23 .5 Model evaluation

A national census was run in Kenya in 2019. While the microlevel data are not yet publicly available(for example, via IPUMS), the resulting population counts by age, sex and county have beenpublished by the Kenya National Bureau of Statistics (Kenya National Bureau of Statistics 2019).We can therefore evaluate the 2019 projections from our model with the actual counts from the2019 census.We extracted census population counts by age, sex and county from a PDF ﬁle containing theresults following code provided by Alexander (2020). We compared the 2019 projections from theBayesian cohort component projection model with these counts and calculated several summarymetrics. We deﬁne the relative error e g for a particular group g as e g = y g, − ˆ η g, y g, , (17)where y g, refers to the census-based population count for that population and ˆ η g, to themodel-based projection. A group g can refer to an age-county or age-district group, for example.Based on the errors, we calculate mean, median, and root mean squared errors by age group andfor the total population. We compared these results to the results of a similar linear extrapolationmodel, where the population in 2019 was estimated based on applying the same proportion changeseen between the 1999-2009 censuses. Errors are summarized over districts, as estimates by countyare not possible with the linear extrapolation method (as we only have one previous set of censusobservations by county).Error summaries by age group and for the total population are shown in Table 1. In general, theBayesian model projections are within ~1% of the census populations. The magnitudes of theRMSEs for the simple linear interpolation is 3-10 times higher than that of the Bayes CCP. Thebias results suggest that the point estimate from the Bayes CCP is often slightly lower than thecensus observation, whereas linear interpolation substantially over-estimates population counts.We also calculated the coverage of the 95% prediction intervals of the Bayesian cohort componentprojection model estimates for 2019, compared to the observed 2019 census counts, and the proportion24ean error Median error RMSEAge group Interpolation Bayes CCP Interpolation Bayes CCP Interpolation Bayes CCP15 -0.070 -0.058 0.128 0.010 0.013 0.00520 -0.228 -0.081 -0.040 -0.033 0.016 0.00525 -0.266 -0.019 0.018 0.006 0.020 0.00530 -0.146 0.043 0.035 0.047 0.020 0.00535 -0.254 -0.161 -0.008 -0.179 0.035 0.01240 -0.058 -0.074 0.185 -0.048 0.038 0.00945 -0.246 -0.065 0.049 0.011 0.061 0.021Total population -0.101 -0.045 0.119 0.011 0.031 0.010Table 1: Summary of errors in district population sizes by age group comparing 2019 census countswith two methods, linear interpolation and the Bayesian cohort component projection model (BayesCPP).of census counts above and below the prediction intervals. If the model is well-calibrated, on averagearound 90% of the observed census counts should fall within the 90% prediction intervals, and 5%of observation should fall above and below the interval. Table 2 reports coverage by age group, andsuggests that in general the coverage of the credible intervals matches expectations. However, insome age groups, there is a relative bias towards observations falling below the interval rather thanabove. Age Group Prop in interval Prop above Prop below15 0.89 0.02 0.0820 0.89 0.00 0.0925 0.89 0.04 0.0430 0.91 0.06 0.0235 0.87 0.01 0.0940 0.92 0.04 0.0445 0.87 0.04 0.05Table 2: Proportion of 2019 census county counts falling within, above, and below the 90% predictionintervals as estimated by the Bayesian CPP model.We also calculated the probability integral transform (PIT) to assess the consistency between the2019 projections and observed counts. Results are presented in Appendix E.25 Discussion

In this paper we proposed a Bayesian cohort component projection framework to estimate adultsubnational populations with limited amounts of data available. The model uses information onpopulation and migration counts from censuses, as well as mortality patterns from national schedules,to reconstruct populations based on cohorts moving through time. The modeling framework alsonaturally extends to allow projection of populations. In addition, the model ensures the nationalpopulations implied by the sum of subnational areas agree with national pre-published UN WPPestimates.The model was used to estimate and project populations of women of reproductive ages (WRA)for counties in Kenya over the period 1979-2019. Results suggested continued growth of WRApopulations in all districts, and accelerated growth in particular in areas such as Nairobi andNortheastern. The mortality component of the modeling framework highlighted the stagnatingprogress through the 1990s and 2000s, largely due to HIV/AIDS, but more recent mortality declines.The estimates from the Kenyan example also highlighted substantial diﬀerences in internal migrationpatterns across the nation.The model requires only inputs from national censuses and WPP estimates, which are available forthe majority of countries. Thus, while the model was tested on estimation in Kenya, the methodologyis applicable to a wide range of countries with very little alterations. For example, there is currentlycensus microdata available for almost 100 counties on the IPUMS-International website.Based on a series of validation measures, the proposed model outperformed a benchmark model oflinear interpolation. In addition to having lower performance than Bayes CCP, note that with thesimple interpolation method, it is not possible to get estimates by county easily, because 2009 is theﬁrst year which the counties as they are today were recorded. In addition, another advantage of theBayesian model is that the population estimates also have an associated uncertainty level, and thatestimating not only population counts but also mortality and migration rates allows us to betterunderstand the drivers of population change by county.There are several other advantages and contributions of this modeling framework to the estimation ofsubnational populations. The model is governed by a cohort component projection model, tracking26ohorts as they move through time. This has advantages over more aggregate techniques such asinterpolation and extrapolation, because it allows us to understand trends in overall population as aprocess governed by separate components that add or remove population. In addition, this processtakes into account intercensal events such as trends in HIV/AIDS mortality and produces estimatesand projections with uncertainty.Secondly, the modeling framework proposes a parsimonious model for internal net-migration acrosssubnational areas. In cohort component models, it is often the case that migration componentsare assumed to be negligible or considered to just be the residual once mortality has been takeninto account. Very little data usually exists on migration patterns, and estimation of all migrationcomponents by age, region and year becomes very intensive. After observing key patterns in thedata, we proposed a net-migration model which separates migration patterns into independent ageand time components. The result is an age-speciﬁc net migration model with parameters that areeasier to estimate when data are limited.More broadly, one of the contributions of our proposed framework over existing work in this areais the use of mortality and migration models that have relatively strong functional forms, whichallow plausible estimates to be produced even in the absence of good-quality data. Our approach tomodeling mortality through the use of characteristic age patterns is inspired by the long demographictradition of using model life tables where information on mortality are sparse.While we have illustrated the utility of this approach in data-limited contexts, the framework cannaturally be extended to include additional sources of data. For example, if there exist observationsof age-speciﬁc mortality rates at the subnational level (even at some ages), these data could beused as inputs to the mortality model. If more reliable data on internal migration ﬂows wereavailable, the existing migration process model — which assumes a ﬁxed age schedule with varyingmagnitude over time — could be reformulated to be more ﬂexible. In general, to be able to handlepopulation projection in a low data availability context, the model proposed here includes mortalityand mortality process models that separate age- and time-trends into independent eﬀects. Additionalage-time speciﬁc eﬀects were then captured by the multiplier ε . If more data are available, theunderlying process models could be extended to better understand these age-time speciﬁc eﬀectsand how they relate to either mortality or migration.27nother possible extension of this framework is to include other total population estimates such asthose from WorldPop as additional “data” that could be used to inform estimates. As such, weview this methodology and subnational population estimates produced from it as complementaryto estimates produced by other eﬀorts such as the WorldPop project. As mentioned in Section 2,the primary goal of the WorldPop estimates is to produce extremely ﬁne-grained estimates of totalpopulation, whereas we are more interested in understanding population patterns by age and sexand the underlying components of population change within larger subnational areas.The incorporation of a cohort component projection model into a probabilistic setting allows fordiﬀerent sources of uncertainty, such as sampling and non-sampling error, to be included into themodeling process. The Bayesian hierarchical framework allows information from diﬀerent datasources to be consolidated without the need for post-estimation redistribution changes as is oftenthe case with subnational population estimation (Swanson and Tayman 2012). In addition, it allowsfor increased ﬂexibility in modeling population processes compared to traditional deterministictechniques, while still keeping the basis of an underlying demographic process.28 Other potential data sources

We use census data and WPP estimates as inputs to the model. There are other available datasources that could be used as inputs. These sources and the reasons for not including them arediscussed below.

A.1 Mortality

Mortality is estimated at the subnational level based on national patterns of mortality from WPP,as well as changes in subnational population counts over time. Thus, no explicit information onsubnational mortality levels is used; mortality is estimated based on likely patterns at the nationallevel and intercensal changes in population. There are two main sources for subnational mortalitydata in Kenya that are not included as data inputs.Firstly, the Demographic and Health Survey (DHS) collects information about sibling mortalityhistories. Adult mortality can be calculated from these data using the sibling history method, wherecohorts of siblings are constructed and age-speciﬁc mortality rates are calculated based on whenthey died. Previous research has illustrated sibling data produces relatively reliable estimates at thenational level (Masquelier, 2013). However, the DHS does not ask the location of residents of thesiblings who died, thus the data cannot be used to inform diﬀerentials in subnational mortality.A second source of information on subnational mortality comes from a question about householddeaths, that was collected in the most recent census (2009). This can be used to obtain deathprobabilities by age. However, previous research has found that the value of q implied byhousehold deaths is often much lower or higher than other mortality sources (Masquelier et al.2017). Indeed, mortality information from census household deaths is excluded from other mortalityanalyses due to its unreliable nature (e.g. child mortality, see UN-IGME (2017)). As such, we choseto omit this information for now. Future work will investigate this data source to see if it can beused to inform age patterns of mortality by subnational region.29 .2 Migration There are two other potential sources of information on internal migration in Kenya that are notincluded as data inputs. Firstly, the census also includes a question about how many years theperson has resided in their current locality of residence, referring to the district level. The question isasked in the 1999 and 2009 censuses. Based on the year of the census and the age of the respondent,as well as how many years they indicated they had lived in the current locality, the implied yearand age of in-migration can be calculated. However, this method gave much lower numbers ofin-migration compared to those implied by the ‘location one year ago’ question. As such thisinformation was not used in the model.Secondly, the DHS contains some information about migration. For Kenya, it is possible to obtaininformation about the proportion of the population who moved to a particular province in the yearbefore the survey. However, when compared to corresponding data from the census, there werelarge discrepancies, and trends in DHS proportions were erratic over time. Note that questions about migration in the DHS diﬀer by country. The migration questions in the Kenya DHSare quite minimal; however for other countries there may be more useful data available. Full Model Speciﬁcation

The full model speciﬁcation is described below.

B.1 Population

B.1.1 Cohort component projection model

The underlying population by age group, year and county η a,t,c is η a,t,c = ( η a − ,t − ,c · (1 − γ a − ,t − ,c )) · (1 + φ a − ,t − ,c ) · ( ε a − ,t − ,c ) , (18)where γ a,t,c is the conditional probability of death in age group a , year t and county c , φ a,t,c isnet migration (that is, in- minus out-migration) as a proportion of population size and ε a,t,c is anadditional age-year-county multiplier. B.1.2 Data model

The data model is:log y i | η a,t,c ∼  N (cid:16) log η a [ i ] ,t [ i ] ,c [ i ] , s y [ i ] (cid:17) if t = 2009 ,N (cid:16) log P c ∈ d [ i ] ( η a [ i ] ,t [ i ] ,c [ i ] ) , s y [ i ] (cid:17) if t < , (19)where y i is i th observed population count, s y is the sampling error based on the fact that themicro-data in IPUMS is a 10% sample. The second case of the above equation dictates that if wehave observations prior to 2009, we can only relate these to η a,t,c ’s that have been summed to thedistrict level. 31 .1.3 National constraints We constrain the sum of the county populations by age and year to be within approximately 10% ofthe national estimates produced by WPP:Λ a,t < P c log η a,t,c ≤ Ω a,t , (20)log Λ a,t ∼ N (log 0 . W P P a,t , . ) T ( , log W P P a,t ) , (21)log Ω a,t ∼ N (log 1 . W P P a,t , . ) T (log W P P a,t , ) . (22) B.1.4 Priors on ﬁrst year and age group

The cohort component projection framework requires priors to be placed on populations in the ﬁrstyear and age group. We use the following priors:log η ,t,c ∼ N (log W P P ,t + log prop ,t,c , . ) , (23)log η a, ,c ∼ N (log W P P a, + log prop a, ,c , . ) , (24)where W P P a,t is the national-level population count from WPP in the relevant age group and year,and prop a,t,c is the proportion of the total population in the relevant age, year and county, whichwas calculated based on interpolating census year proportions and assuming the proportion of adistrict’s population in each county was constant at a level equal to 2009.

B.2 Mortality

The model for mortality is aslogit γ a,t,c = α ,c + Y a, + β t,c, · Y a, + β t,c, · Y a, , (25)where Y a, is the mean age-speciﬁc logit mortality schedule of the national mortality curves and Y , and Y , are the ﬁrst two principal components derived from national-level mortality schedules.Modeling on the logit scale ensures the death probabilities are between zero and one.The county-speciﬁc mortality intercepts are modeled using a Normal distribution centered at zero:32 ,c | σ α ∼ N (0 , σ α ) , The county-speciﬁc coeﬃcients β t,c,k are modeled as ﬂuctuations around a national mean: β t,c,k = B natt,k + δ t,c,k , (26) δ t,c,k | δ t − ,c,k , σ δ ∼ N ( δ t − ,c,k , σ δ ) , (27)where B nata,t,k are the national coeﬃcient on principal components, derived from WPP data. Thecounty-speciﬁc ﬂuctuations are modeled as a random walk. B.3 Migration

B.3.1 Process model

The process model for the net-migration is: φ a,t,c = ψ ina,t,c − ψ outa,t,c η a − ,t − ,c , (28) ψ ina,t,c = Ψ int,c · Π ina,c , (29) ψ outa,t,c = Ψ outt,c · Π outa,c , (30)where Ψ int,c and Ψ outt,c are the total number of in- and out-migrants, respectively, and Π ina,c and Π outa,c are the relevant age distributions. We model the total counts as a second order random walk toimpose a certain level of smoothness in the counts over time:Ψ in ,c ∼ U (0 , y c ) , (31)log Ψ in ,c | Ψ in ,c , σ in ∼ N (log Ψ in ,c , σ in ) , (32)log Ψ t,c in | Ψ in ( t − t − ,c , σ in ∼ N (2 log Ψ int − ,c − log Ψ int − ,c , σ in ) , (33)Ψ out ,c ∼ U (0 , y k [ c ]) , (34)log Ψ out ,c | Ψ out ,c , σ out ∼ N (log Ψ out ,c , σ out ) , (35)log Ψ outt,c | Ψ out ( t − t − ,c , σ out ∼ N (2 log Ψ outt − ,c − log Ψ outt − ,c , σ out ) . (36)33here y c refers to the observed total population for county c based on the census in the ﬁrstobservation period.We place Uniform priors on the non-normalized age distributions of in- and out-migration, withequal prior probability on each age group:Π in ∗ a,c ∼ Uniform(0 , , (37)Π out ∗ a,c ∼ Uniform(0 , . (38)We then normalize the age distributions asΠ ina,c = Π in ∗ a,c P a Π in ∗ a,c , (39)Π outa,c = Π out ∗ a,c P a Π out ∗ a,c . (40) B.3.2 Data model

We relate the observed age-speciﬁc in- and out-migration counts in the censuses, denoted M ini and M outi , respectively, to the underlying true counts ψ ina,t,c and ψ outa,t,c through the following data model:log M ini | ψ ina,t,c ∼  N (cid:16) log ψ ina [ i ] ,t [ i ] ,c [ i ] , s in [ i ] (cid:17) if t [ i ] = 2009 ,N (cid:16) log P c ∈ d [ i ] ( ψ ina [ i ] ,t [ i ] ,c [ i ] ) , s in [ i ] (cid:17) if t [ i ] < , (41)log M outi | ψ outa,t,c ∼  N (cid:16) log ψ outa [ i ] ,t [ i ] ,c [ i ] , s out [ i ] (cid:17) if t [ i ] = 2009 ,N (cid:16) log P c ∈ d [ i ] ( ψ outa [ i ] ,t [ i ] ,c [ i ] ) , s out [ i ] (cid:17) if t [ i ] < . (42) B.3.3 Constraint

Using the fact that the sum of all internal migration for a particular age group and year should bearound zero, we implement the following constraint: X c − . η a,t,c < P c ψ ina,t,c − P c ψ outa,t,c ≤ X c . η a,t,c . (43)34he constrain states that the diﬀerence between the sum of all in- and out-migration ﬂows across allcounties cannot be more than ±

10% of the total estimated national population for that particularage group and year.

B.4 Age-time multiplier

We model multipliers on the log scale, and to ensure identiﬁability we assume the mean of the sumof the log multipliers is zero. This constraint is implemented through the re-parameterization:log ε A,t,c = D ( DD ) − ζ A − ,t,c , (44) ζ a,t,c ∼ N (0 , σ ζ ) , (45)where D is ﬁrst-order diﬀerence matrix (with D i,i = − D i,i +1 = 1, and D i,j = 0 otherwise) suchthat ζ a,t,c = log ε a,t,c − log ε a − ,t,c . B.5 Priors on variance parameters

All variance parameters that are estimated ( σ α , σ δ , σ , σ in and σ out ) have half-Normal standardpriors placed on them, i.e. σ ∼ N + (0 , . Age patterns in migration data

In the Bayesian cohort component model, speciﬁcally in the migration process model, we assumethe age distribution of in- and out-migrants by count is constant over time (see Equations 13 and14). This is a somewhat strong assumption and was made to ensure identiﬁability of all parametersin the model in cases where we do not have very much data. While the assumption is relativelystrong, it was motivated by age patterns observed in census data. Figures 9 and 10 show theproportion of all in- and out-migrants by age group for each year and district, and illustrate thatthe age patterns remain remarkably constant over time. For reference, the broad areas covered bythe IPUMS districts are listed in Table 3. age_group p r opo r t i on o f t o t a l i n m i g r an t s year Figure 9: Observed age patterns of in-migration from Kenyan censuses, 1979-2009.36 age_group p r opo r t i on o f t o t a l ou t m i g r an t s year Figure 10: Observed age patterns of out-migration from Kenyan censuses, 1979-2009.37 istrict Areas404001001 Nairobi East, Nairobi North, Nairobi West, Westlands404002001 Gatanga, Gatundu, Githunguri, Kiambu (Kiambaa), Kikuyu, Lari, Muranga, Nyandarua, Ruiru, Thika, Maragua404004001 Chalbi, Laisamis, Marsabit, Moyale404004002 Garba Tulla, Igembe, Imenti, Isiolo, Maara, Meru, Tharaka, Tigania, Meru404004003 Embu, Kangundo, Kibwezi, Machakos, Makueni, Mbeere, Mbooni, Mwala, Nzaui, Yatta404004004 Kitui North, Kitui South (Mutomo), Kyuso, Mwingi404005001 Faﬁ, Garissa, Ijara, Lagdera404005002 Wajir East, Wajir North, Wajir South, Wajir West404005003 Mandera Central, Mandera East, Mandera West404006001 Bondo, Rarieda, Siaya404006002 Kisumu East, Kisumu West, Nyando404006003 Homa Bay, Kuria East, Kuria West, Migori, Rachuonyo, Rongo, Suba404002002 Nyeri North, Nyeri South404006004 Borabu, Gucha, Gucha South, Kisii Central, Kisii South, Manga, Masaba, Nyamira, North Kisii404007001 Turkana Central, Turkana North, Turkana South404007002 Pokot Central, Pokot North, West Pokot404007003 Samburu Central, Samburu East, Samburu North404007004 Kwanza, Trans Nzoia East, Trans Nzoia West404007005 Baringo, Baringo North, East Pokot, Koibatek, Laikipia East, Laikipia North, Laikipia West404007006 Eldoret East, Eldoret West, Wareng, Uasin Gishu404007007 Keiyo, Marakwet, Elgeyo Markwet404007008 Nandi Central, Nandi East, Nandi North, Nandi South, Tinderet404007009 Kaijiado Central, Kaijiado North, Loitoktok, Molo, Naivasha, Nakuru, Nakuru North, Kajiado404002003 Kirinyaga404007010 Narok North, Narok South, Trans Mara404007011 Bomet, Buret, Kericho, Kipkelion, Sotik404008001 Butere, Emuhaya, Hamisi, Kakamega, Lugari, Mumias, Vihiga, Butere/Mumias404008002 Bungoma East, Bungoma North, Bungoma South, Bungoma West, Mt. Elgon404008003 Bunyala, Busia, Samia, Teso North, Teso South404888001 Waterbodies404003001 Kilindini, Kilindini, Mombasa404003002 Kinango, Kwale, Msambweni404003003 Kaloleni, Kiliﬁ, Malindi404003004 Tana Delta, Tana River404003005 Lamu404003006 Taita, Taveta, Taita Taveta

Table 3: IPUMS district codes and areas covered38

Additional results

In this section we highlight several other components that are estimated within the model; speciﬁcallythe coeﬃcients on the ﬁrst and second principal components. Results are illustrated on three examplecounties: Nairobi, Wajir and Baringo. Additionally, we show estimates for the age-time multiplierfor all counties.Figure 11 shows estimates over time of the coeﬃcient of the ﬁrst and second principal componentwithin the mortality model (i.e. β tc, and β tc, ). Broadly, the ﬁrst principal component relates tooverall mortality improvement, and the second relates to the eﬀect of the HIV/AIDS epidemic.Coeﬃcients on the ﬁrst component suggest mortality improvement is relatively slow in Nairobi, andbetter than the national average in Wajir. Based on patterns on the second principal component,there is evidence to suggest that the eﬀect of HIV/AIDS epidemic was relatively small in Wajir(Figure 11). In both cases, estimates for Baringo are not signiﬁcantly diﬀerent from the nationalmean.Figure 12 shows the estimated age-time speciﬁc multiplier for all counties. As can be seen, theestimates on the log scale are very close to zero for the majority of age groups, years and counties.39 airobi Wajir Baringo1980 1990 2000 2010 20201980 1990 2000 2010 20201980 1990 2000 2010 2020−4048 year C oe ff i c i en t e s t i m a t e (a) Estimates for ﬁrst component (region deviations from national mortality trends) Nairobi Wajir Baringo1980 1990 2000 2010 20201980 1990 2000 2010 20201980 1990 2000 2010 2020−7.5−5.0−2.50.02.5 year C oe ff i c i en t e s t i m a t e (b) Estimates for second component (region deviations from national HIV/AIDS mortality) Figure 11: County-speciﬁc deviations from national-level mortality improvements (ﬁrst component)and HIV/AIDS mortality (second component) for three counties.40 ericho Kakamega Vihiga Bungoma BusiaUasin Gishu Elgeyo Marakwet Nandi Kajiado Nakuru Narok BometNyamira Turkana West Pokot Samburu Trans Nzoia Baringo LaikipiaWajir Mandera Siaya Kisumu Homa Bay Migori KisiiMeru Tharaka Nithi Embu Machakos Makueni Kitui GarissaKwale Kilifi Tana River Lamu Taita Taveta Marsabit IsioloNairobi Kiambu Muranga Nyandarua Nyeri Kirinyaga Mombasa20 30 40 20 30 40 20 30 40 20 30 40 20 30 40 20 30 40 20 30 40−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2 age group l og m u l t i p li e r Year

Figure 12: Age-time speciﬁc multipliers for all counties.41

PIT histogram

A Probability Integral Transform (PIT) histogram is a tool for evaluating the similarity betweenmodel projections and left out observations. The predictive distributions of the projections arecompared with the actual observations.For each observation j for 2019 (i.e. each population count by age group and county) we haveobservation y j from the 2019 census, and sample ˆ η ( S ) j from the corresponding posterior distribution(with a total of S samples). The PIT for observation j was calculated as P IT j = P Ss =1 ˆ η ( s ) j ≤ y j S . (46)If the predictive distribution is well calibrated, the result should be a uniform distribution of PITvalues. Figure 13 shows the PIT histogram for 2019. The relatively high density in the middleof the distribution suggests the model is somewhat over-dispersed, and the low density towards 1suggests the upper bound of population projections is in general too conservative.

PIT den s i t y Figure 13: PIT histogram comparing projected 2019 population counts with observed 2019 censuscounts. 42 eferences

Alexander, Monica, and Leontine Alkema. 2018. “Global Estimation of Neonatal Mortality Using aBayesian Hierarchical Splines Regression Model.”

Demographic Research

38: 335–72.Alexander, Monica, Emilio Zagheni, and Magali Barbieri. 2017. “A Flexible Bayesian Model forEstimating Subnational Mortality.”

Demography

The Annals of Applied Statistics

Demography

48 (3): 815–39.Bijak, Jakub. 2008. “Bayesian Methods in International Migration Forecasting.”

InternationalMigration in Europe: Data, Models and Estimates , 255–88.Bijak, Jakub, and John Bryant. 2016. “Bayesian Demography 250 Years After Bayes.”

PopulationStudies

70 (1): 1–19. https://doi.org/10.1080/00324728.2015.1122826.Bryant, John R, and Patrick J Graham. 2013. “Bayesian Demographic Accounts: SubnationalPopulation Estimation Using Multiple Data Sources.”

Bayesian Analysis

Bayesian Demographic Estimation and Forecasting . CRCPress.Clark, Samuel J. 2016. “A General Age-Speciﬁc Mortality Model with an Example Indexed byChild or Child/Adult Mortality.” arXiv Preprint arXiv:1612.01408 .Congdon, P, S Shouls, and S Curtis. 1997. “A Multi-Level Perspective on Small-Area Health andMortality: A Case Study of England and Wales.”

Population, Space and Place

The Lancet

390 (10100): 1084–1150.Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, YulingYao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák 2020. “Bayesianworkﬂow.” arXiv preprint arXiv:2011.01808 .Girosi, Federico, and Gary King. 2008.

Demographic Forecasting . Princeton University Press.He, Chunhua, Li Liu, Yue Chu, Jamie Perin, Li Dai, Xiaohong Li, Lei Miao, et al. 2017. “Nationaland Subnational All-Cause and Cause-Speciﬁc Child Mortality in China, 1996-2015: A SystematicAnalysis with Implications for the Sustainable Development Goals.”

The Lancet Global Health

Proceedings of the National Academy of Sciences

Journalof the American Statistical Association

Biometrika

443 (3): 183–212.Lim, Stephen S, Nancy Fullman, Christopher JL Murray, and Amanda Jayne Mason-Jones. 2016.“Measuring the Health-Related Sustainable Development Goals in 188 Countries:: A BaselineAnalysis from the Global Burden of Disease Study 2015.”

The Lancet , 1–38.Linard, Catherine, Marius Gilbert, Robert W Snow, Abdisalan M Noor, and Andrew J Tatem. 2012.“Population Distribution, Settlement Patterns and Accessibility Across Africa in 2010.”

PloS One

Demography , 50(1), pp.207-228.Masquelier, Bruno, Jeﬀrey W Eaton, Patrick Gerland, François Pelletier, and Kennedy K Mutai.2017. “Age Patterns and Sex Ratios of Adult Mortality in Countries with High Hiv Prevalence.”

AIDS

31: S77–S85.Minnesota Population Center. 2017. “Integrated Public Use Microdata Series, International: Version6.5 [Dataset].” Available at: https://international.ipums.org/international/.New, Jin Rou, Niamh Cahill, John Stover, Yogender Pal Gupta, and Leontine Alkema. 2017. “Levelsand Trends in Contraceptive Prevalence, Unmet Need, and Demand for Family Planning for 29States and Union Territories in India: A Modelling Study Using the Family Planning EstimationTool.”

The Lancet Global Health

Proceedings of the 3rd International Workshop on Distributed Statistical Computing .Vienna, Austria.Raftery, Adrian E, Nan Li, Hana Ševčíková, Patrick Gerland, and Gerhard K Heilig. 2012. “BayesianProbabilistic Population Projections for All Countries.”

Proceedings of the National Academy ofSciences

109 (35): 13915–21.Schmertmann, Carl, and Marcos Roberto Gonzaga. 2018. “Bayesian Estimation of Age-SpeciﬁcMortality and Life Expectancy for Small Areas with Defective Vital Records.”45chmertmann, Carl P, Suzana M Cavenaghi, Renato M Assunção, and Joseph E Potter. 2013.“Bayes Plus Brass: Estimating Total Fertility for Many Small Areas from Sparse census Data.”

Population Studies

67 (3): 255–73.Schmertmann, Carl, Emilio Zagheni, Joshua R Goldstein, and Mikko Myrskylä. 2014. “BayesianForecasting of Cohort Fertility.”

Journal of the American Statistical Association

109 (506): 500–513.Sevcikova, Hana, Adrian E Raftery, and Patrick Gerland. 2018. “Probabilistic Projection ofSubnational Total Fertility Rates.”

Demographic Research

Subnational Population Estimates . Vol. 31. SpringerScience & Business Media.Tatem, Andrew J, Andres J Garcia, Robert W Snow, Abdisalan M Noor, Andrea E Gaughan,Marius Gilbert, and Catherine Linard. 2013. “Millennium Development Health Metrics: Where DoAfrica’s Children and Women of Childbearing Age Live?”

Population Health Metrics

Essential Demographic Methods . Harvard University Press.Wakeﬁeld, Jon, Geir-Arne Fuglstad, Andrea Riebler, Jessica Godwin, Katie Wilson, and Samuel JClark. 2019. “Estimating Under-Five Mortality in Space and Time in a Developing World Context.”46 tatistical Methods in Medical Research

28 (9): 2614–34.Wardrop, N. A., W. C. Jochem, T. J. Bird, H. R. Chamberlain, D. Clarke, D. Kerr, L. Bengtsson, S.Juran, V. Seaman, and A. J. Tatem.“Spatially disaggregated population estimates in the absence ofnational population and housing census data.”

Proceedings of the National Academy of Sciences

Journal of the American StatisticalAssociation

108 (501): 96–110.———. 2016. “Bayesian Population Reconstruction of Female Populations for Less Developed andMore Developed Countries.”

Population Studies