[PDF] Revisiting Identifying Assumptions for Population Size Estimation

Abstract

The problem of estimating the size of a population based on a subset of individuals observed across multiple data sources is often referred to as capture-recapture or multiple-systems estimation. This is fundamentally a missing data problem, where the number of unobserved individuals represents the missing data. As with any missing data problem, multiple-systems estimation requires users to make an untestable identifying assumption in order to estimate the population size from the observed data. Approaches to multiple-systems estimation often do not emphasize the role of the identifying assumption during model specification, which makes it difficult to decouple the specification of the model for the observed data from the identifying assumption. We present a re-framing of the multiple-systems estimation problem that decouples the specification of the observed-data model from the identifying assumptions, and discuss how log-linear models and the associated no-highest-order interaction assumption fit into this framing. We present an approach to computation in the Bayesian setting which takes advantage of existing software and facilitates various sensitivity analyses. We demonstrate our approach in a case study of estimating the number of civilian casualties in the Kosovo war. Code used to produce this manuscript is available at this https URL

Full PDF

RRevisiting Identifying Assumptions for Population Size Estimation

Serge Aleshin-Guendel , ∗ , Mauricio Sadinle , and Jon Wakeﬁeld , Department of Biostatistics, University of Washington, Seattle, Washington, U.S.A. Department of Statistics, University of Washington, Seattle, Washington, U.S.A.*email: [email protected] 26, 2021

Abstract

The problem of estimating the size of a population based on a subset of individuals observed acrossmultiple data sources is often referred to as capture-recapture or multiple-systems estimation. Thisis fundamentally a missing data problem, where the number of unobserved individuals represents themissing data. As with any missing data problem, multiple-systems estimation requires users to makean untestable identifying assumption in order to estimate the population size from the observed data.Approaches to multiple-systems estimation often do not emphasize the role of the identifying assump-tion during model speciﬁcation, which makes it diﬃcult to decouple the speciﬁcation of the model forthe observed data from the identifying assumption. We present a re-framing of the multiple-systemsestimation problem that decouples the speciﬁcation of the observed-data model from the identifyingassumptions, and discuss how log-linear models and the associated no-highest-order interaction assump-tion ﬁt into this framing. We present an approach to computation in the Bayesian setting which takesadvantage of existing software and facilitates various sensitivity analyses. We demonstrate our approachin a case study of estimating the number of civilian casualties in the Kosovo war. Code used to producethis manuscript is available at https://github.com/aleshing/revisiting-identifying-assumptions .Keywords: Capture-recapture; Missing data; Multiple-systems estimation; Sensitivity analysis. a r X i v : . [ s t a t . M E ] J a n Introduction

Estimating the size of a closed population is a common problem in many ﬁelds, including ecology (Otiset al., 1978), epidemiology (Hook and Regal, 1995), oﬃcial statistics (Anderson and Fienberg, 1999), andhuman rights (Ball et al., 2002). The available data typically take the form of multiple lists which recordinformation on a subset of individuals in a population. When there exists a mechanism to identify whichindividuals are the same across lists, multiple-systems estimation (MSE), also known as capture-recapture,provides an approach to estimating the population size based on the overlap of the lists (Bird and King,2018).MSE is at its heart a missing data problem, as we do not observe all individuals in the population ofinterest (see e.g. Fienberg and Manrique-Vallier, 2009; Manrique-Vallier, 2016). As is the case in any missingdata problem, MSE requires users to make an untestable identifying assumption about how the observedindividuals relate to the unobserved individuals in order to estimate the population size from the observeddata. We believe that this requirement is not suﬃciently appreciated, as it is usually conﬂated with modelspeciﬁcation, which involves both making an identifying assumption and specifying a model for the observeddata. See for example Fienberg (1972) who wrote “... we are assuming that the model which describesthe observed data also describes the count of the unobserved individuals. We have no way of checking thisassumption,” and Manrique-Vallier et al. (2013) who wrote “The arguably most basic assumption in MSEis that the noninclusion of the fully unobserved individuals ... can be represented by the same model thatrepresents the inclusion (and noninclusion) of those we can observe in at least one list. This is a strong anduntestable condition.”This conﬂation of identifying assumption speciﬁcation and model speciﬁcation has led practitioners toperform model evaluation by comparing a suite of model ﬁts that are the results of both fundamentallydiﬀerent identifying assumptions and diﬀerent model speciﬁcations for the observed data (see e.g. Sadinle,2018; Manrique-Vallier et al., 2019; Silverman, 2020). This makes it essentially impossible to disentanglewhether diﬀerences in inferences are due to diﬀerences in identifying assumptions, model speciﬁcations forthe observed data, or some combination.In this article, we propose a Bayesian approach to MSE that places the identifying assumption front and2enter in the MSE workﬂow. We ﬁrst revisit the framing of MSE as a missing data problem in Section 2.Section 3 reviews two common MSE models—log-linear and latent class models—through our missing dataframing. In Section 4 we focus on the identifying assumption associated with log-linear models, and describehow it can be used as a building block for alternative identifying assumptions. In Section 5 we presentan approach to computation in the Bayesian setting that allows us to use any identifying assumption inconjunction with any prior speciﬁcation by taking advantage of existing software. This approach facilitatessensitivity analyses that examine the impact of both the prior and the identifying assumption. Finally, inSection 6 we illustrate our approach in a case study of estimating the number of civilian casualties in theKosovo war.

Suppose we have a closed population of N individuals, of which n < N are observed by one or more of K lists. Let H = { , } K denote the possible patterns of inclusion of the individuals in the lists, H ∗ = H \ { } K denote the possible subsets of lists in which each of the n observed individuals could have been observed,and let x i ∈ H denote the subset of lists in which individual i was included. For example, with K = 3, x i = (0 , ,

1) indicates that individual i was observed in lists 2 and 3, but not list 1.These data for the N individuals can be gathered into a 2 K contingency table of list overlap, wherethe cells of the table are indexed by h ∈ H , with counts n h = (cid:80) Ni =1 I ( x i = h ). We do not observe thecount for cell { } K , n := n (0 , ··· , = N − n , which records the number of individuals missing from all lists,so the observed contingency table is incomplete. Let n = { n h } h ∈ H ∗ denote the counts of the incompletecontingency table. The unobserved cell count n , or equivalently the population size N , is the target ofinference. 3 .2 The Complete-Data Distribution Under independent and identically distributed (i.i.d.) sampling of individuals by the lists, the 2 K contingencytable of counts is multinomially distributed, i.e. n , n | N, π ∼ Multinomial ( N, π ) , (1)where π = { π h } h ∈ H ∈ S K − is a set of cell probabilities, and S d = { ( a , · · · , a d +1 ) ∈ R d +1 | (cid:80) d +1 i =1 a i =1 , a i > ∀ i } denotes the d -dimensional probability simplex. We will refer to the model in (1) as the complete-data distribution , for which the evaluation relies on knowing the complete 2 K contingency table of counts. Ingeneral, the parameter space for this model will be some subset of Θ = { N, π | N ∈ N , π ∈ S K − } , which wewill refer to as the complete-data parameterization . As shown in Appendix A, when individuals are not i.i.d.sampled, but are sampled independently with cell probabilities drawn i.i.d. from some mixing distributionon S K − , we also arrive at the model in (1). This is the case for common models for heterogeneity such asthe M h and M th models of Otis et al. (1978). It is instructive to decompose the complete-data distribution as p ( n , n | N, π ) = N ! (cid:89) h ∈ H π n h h n h ! = L ( N, π | n ) L ( ˜ π | n ) , (2)with L ( N, π | n ) = (cid:0) Nn (cid:1) π N − n (1 − π ) n and L ( ˜ π | n ) = n ! (cid:81) h ∈ H ∗ ˜ π n h h /n h !, where π := π (0 , ··· , =1 − (cid:80) h ∈ H ∗ π h is the probability of being missing from every list, and ˜ π h = π h − π = π h (cid:80) h (cid:48)∈ H ∗ π h (cid:48) is theprobability of being observed in the subset of the lists h conditional on being observed in at least one list. L is a binomial likelihood for n , which has been well studied in the related binomial N problem literature(see e.g. Rukhin, 1975). L is a multinomial likelihood for the observed data n conditional on their sum n , referred to as the conditional likelihood (Fienberg, 1972). We will refer to π as the unobserved cellprobability and to ˜ π as the observed cell probabilities . This decomposition hints at an alternative to thecomplete-data parameterization Θ, Θ ∗ = { N, π , ˜ π | N ∈ N , π ∈ (0 , , ˜ π ∈ S K − } , which we will referto as the observed-data parameterization . The two parameterizations are equivalent, so we will work withwhichever is more convenient for exposition. 4 .4 Identiﬁability Before performing inference in a statistical model, it is important to check that the model is identiﬁable.For θ ∈ Θ ∗ , let P θ denote the complete-data distribution at the set of parameters θ . Consider the followingstandard deﬁnition of identiﬁability: Deﬁnition 1.

The statistical model P Ω = { P θ | θ ∈ Ω ⊂ Θ ∗ } is identiﬁable if ∀ θ , θ ∈ Ω , P θ = P θ implies that θ = θ . Equivalently, P Ω is identiﬁable if ∀ θ = { N, π , ˜ π } , θ = { N (cid:48) , π (cid:48) , ˜ π (cid:48) } ∈ Ω , L ( N, π | n ) L ( ˜ π | n ) = L ( N (cid:48) , π (cid:48) | n ) L ( ˜ π (cid:48) | n ) ∀ n implies that θ = θ . One can show that the unrestricted model P Θ ∗ is identiﬁable. Since the goal is to estimate N , suﬃciencymight lead one to try to estimate N and π in the unrestricted model based solely on the binomial likelihoodfor n . Examining the likelihood surface for a given n , one ﬁnds a maximum at N = n and π = 0, witha ridge centered along the set { N ∈ N , π ∈ (0 , | N (1 − π ) ≈ n } that monotonically decreases as N increases. In Figure 1 we plot this surface when n = 100. There is a fundamental problem in that twoparameters are being estimated with one data point, which makes it impossible to construct an unbiased orconsistent estimator of either N or π (DasGupta and Rubin, 2005; Farcomeni and Tardella, 2012). Thusthe standard deﬁnition of identiﬁability is misleading in this setting, as it does not necessarily imply thatthe parameters are estimable in any traditional sense. N p Likelihood

Figure 1: Likelihood surface of L when n = 100.5e will instead use the following alternative deﬁnition of identiﬁability speciﬁc to MSE (Link, 2003;Holzmann et al., 2006): Deﬁnition 2.

The statistical model P Ω is conditionally identiﬁable if ∀ θ = { N, π , ˜ π } , θ = { N (cid:48) , π (cid:48) , ˜ π (cid:48) } ∈ Ω , L ( ˜ π | n ) = L ( ˜ π (cid:48) | n ) ∀ n implies that π = π (cid:48) . In a conditionally identiﬁable model, the conditional likelihood, L , identiﬁes the unobserved cell probability, π . Clearly the unrestricted model P Θ ∗ is not conditionally identiﬁable. Standard identiﬁability of themultinomial conditional likelihood tells us that we can equivalently state Deﬁnition 2 as follows: the statisticalmodel P Ω is conditionally identiﬁable if ∀ θ = { N, π , ˜ π } , θ = { N (cid:48) , π (cid:48) , ˜ π (cid:48) } ∈ Ω, ˜ π = ˜ π (cid:48) implies that π = π (cid:48) . Thus for a conditionally identiﬁable model, there exists a function T : ˜ T → (0 ,

1) that mapsobserved cell probabilities, ˜ π , to unobserved cell probabilities, π , where ˜ T ⊂ S K − . When the domain˜ T of this function is not equal to S K − , this restricts the set of possible values for ˜ π in the model to˜ T . Any extra assumptions in the model involving ˜ π can then further restrict the set of possible valuesfor ˜ π in the model to a set ˜ S ⊂ ˜ T . Thus conditionally identiﬁable models take the form P Ω , whereΩ = { N, π , ˜ π | N ∈ N , π = T ( ˜ π ) , ˜ π ∈ ˜ S } .When a model is not conditionally identiﬁable, we have no guarantees for when the parameters areestimable in any traditional sense. In particular, non-identiﬁability precludes consistent estimation as “therewill be uncertainty in parameter estimates that is not washed out as more data are collected” (Linero,2017). If a model P Ω is conditionally identiﬁable, all parameters of the model can be consistently estimated(Sanathanan, 1972). However, we emphasize that the data needs to have been generated by a distributionin the model P Ω for the parameters to be consistently estimable. In other words, in order to estimate thepopulation size N , we need to assume a functional relationship, T , between the observed cell probabilities˜ π and the unobserved cell probability π . This is the main idea behind MSE. The framing in the previous section is motivated by our treatment of MSE as a missing data problem.The decomposition in (2) is related to the decomposition in the missing data literature of the complete-data distribution into the extrapolation distribution and the observed-data distribution (Hogan and Daniels,6008). The extrapolation distribution captures how to extrapolate to the missing data given the observeddata, which in our context corresponds to L . The observed-data distribution, as the name indicates, isthe distribution of the observed data, which in this context corresponds to L . Following the analogy ofthe missing data literature, by restricting ourselves to models of the form P Ω , where Ω = { N, π , ˜ π | N ∈ N , π = T ( ˜ π ) , ˜ π ∈ ˜ S } , one is making an identifying assumption , T , about how ˜ π relates to π in order toidentify π .The observed-data distribution is restricted when the set of possible values for the observed cell proba-bilities, ˜ S , is not equal to S K − . Based on standard properties of the multinomial conditional likelihood,restrictions on the observed-data distribution are assumptions that are testable from the data. As noted inthe previous section, these restrictions could be due to the domain, ˜ T , of the identifying assumption (seeSection 4.3 for an example), or due to extra modeling assumptions for the observed cell probabilities, ˜ π (see Section 3.1 for an example). When the observed-data distribution is not restricted by the model, i.e.˜ S = ˜ T = S K − , we refer to the model as nonparametric identiﬁed (see Chapter 8 of Hogan and Daniels,2008).In the MSE literature, previous work has been concerned with determining when certain models areconditionally identiﬁed (see e.g. Link, 2003; Holzmann et al., 2006). Here we are concerned with determiningboth when and how models are conditionally identiﬁed. Since the validity of our inferences rests on theuntestable identifying assumption and any restrictions on the observed-data distribution being correct, wewould like to know what identifying assumption we are actually making so we can determine whether or notthe assumption is plausible in a given context. Thus, in this article we will advocate for the use of modelsbased on explicit identifying assumptions, where the observed-data distribution is only possibly restrictedby the identifying assumption (i.e. ˜ S = ˜ T ), so that we make as few testable assumptions as possible. In this section we describe two commonly used models, which we use to demonstrate the drawbacks ofusing models that either place unnecessary restrictions on the observed-data distribution or are not basedon explicit identifying assumptions. 7 .1 Log-Linear Models

Any set of cell probabilities, π ∈ S K − , can be represented as π h = µ h / (cid:80) h (cid:48) ∈ H µ h (cid:48) , where log( µ h ) = (cid:80) h (cid:48) ∈ H ∗ λ h (cid:48) (cid:81) Kk =1 h h (cid:48) k k , for some set of log-linear parameters λ = { λ h } h ∈ H ∗ ∈ R K − . This leads to the log-linear parameterization Θ LL = { N, λ | N ∈ N , λ ∈ R K − } . For cells in the incomplete table h ∈ H ∗ such that (cid:80) Kk =1 h k = 1 we refer to λ h as a main eﬀect; for h ∈ H ∗ such that (cid:80) Kk =1 h k = (cid:96) > λ h as an (cid:96) -way interaction. The main eﬀects and interactions all have interpretations as log ra-tios of certain cross-product ratios (see e.g. Chapter 2 of Bishop et al., 1975). Of particular interestis the K -way, or highest-order, interaction λ , where := (1 , · · · , (cid:81) h ∈ H π I odd ( h ) h / (cid:81) h ∈ H π I even ( h ) h = exp { ( − I even ( ) λ } , where I odd ( h ) = I ( (cid:80) Kk =1 − h k is odd) and I even ( h ) = I ( (cid:80) Kk =1 − h k is even). This notation diﬀers from Bishop et al. (1975) as we index the completetable by H = { , } K rather than { , } K .The model P Θ LL is equivalent to the unrestricted model P Θ , so we need to restrict Θ LL to identifythe unobserved cell probability π . It is standard in this scenario to set λ = 0, so that there is nohighest-order interaction in the model. Referring to the resulting parameter space as Ω LL , we would like tounderstand the identifying assumption made by the saturated model P Ω LL . In Appendix B, we show P Ω LL isnonparametric identiﬁed and that this no-highest-order interaction (NHOI) assumption corresponds to theexplicit identifying assumption T ( ˜ π ) = ( ˜Π odd / ˜Π even ) / (1 + ˜Π odd / ˜Π even ) , where ˜Π odd = (cid:81) h ∈ H ∗ ˜ π I odd ( h ) h and˜Π even = (cid:81) h ∈ H ∗ ˜ π I even ( h ) h , which we discuss in more detail in Section 4.In practice there is an emphasis on achieving low variance estimates of the log-linear parameters and,consequentially, N . To this end, rather than just setting the highest-order interaction to zero and using thesaturated model, it is common to further restrict the model and set other interactions to zero. This is the case,for example, when restricting to decomposable graphical models (Madigan and York, 1997), or when onlyincluding main eﬀects and 2-way interactions (Silverman, 2020), which can be hard to justify in practice (seee.g. Dellaportas and Forster, 1999; Whitehead et al., 2019). By restricting the observed-data distribution,we are making a testable assumption that, in addition to the untestable identifying assumption, must becorrect in order for inferences to be valid. The hope is that by specifying a model with fewer parameters,the resulting estimates will have lower variance if the chosen restricted model generated the data. However,8f the chosen restricted model did not generate the data, estimates of N can be arbitrarily biased, and moregenerally can have arbitrarily poor frequentist properties (Regal and Hook, 1991; Whitehead et al., 2019).This is a classic bias-variance trade oﬀ, which has been acknowledged since the seminal work of Fienberg(1972) (edited to match our notation): “In analyzing multiple recapture census data our aim is to ﬁt theincomplete 2 K table by a log linear model with the fewest possible parameters, since the fewer parametersin an ‘appropriate’ model for estimating n , the smaller the variance of the estimate. Thus it is not a goodpractice simply to use the saturated model. On the other hand, if we use a model with too few parameters,we introduce a bias into our estimate of population size that can possibly render the variance formulae ofthe next section meaningless.” We disagree with Fienberg (1972) in that we believe there is a clear route totake in this case: make as few testable assumptions as possible (i.e. use the saturated model P Ω LL ) in thehopes of not being arbitrarily biased. Some form of regularization can then be used within the saturatedmodel P Ω LL in order to produce lower variance estimators, e.g. through priors in a Bayesian framework asdiscussed in Section 5.1. Latent class models (LCMs) are typically motivated as models of multivariate categorical data that captureindividual heterogeneity when the population can be stratiﬁed into J classes, where lists sample individualsindependently within each class (Haberman, 1979; Manrique-Vallier, 2016). Thus they are so-called M th models as described in Appendix A (Otis et al., 1978). Corollary 1 of Dunson and Xing (2009) shows that forany set of cell probabilities π ∈ S K − , there exists some J < ∞ such that π can be represented as a J -classlatent class model, i.e. π h = (cid:80) Jj =1 ν j (cid:81) Kk =1 q h k jk (1 − q jk ) − h k , where ν = ( ν , · · · , ν J ) are class membershipprobabilities, and q = { q jk } J,Kj =1 ,k =1 are class speciﬁc observation probabilities for each list. This leads tothe latent class model parameterization Θ LCM = { N, ν , q , J | N ∈ N , ν ∈ S J − , q ∈ (0 , J × K , J ∈ N } . As P Θ LCM is equivalent to the unrestricted model P Θ , we need to restrict Θ LCM to identify the unobserved cellprobability π . It is common to ﬁx the number of latent classes, J , in advance, to arrive at the the restrictedparameterization Ω LCM,J = { N, ν , q | N ∈ N , ν ∈ S J − , q ∈ (0 , J × K } .In Appendix A we show that P Ω LCM,J is conditionally identiﬁed if and only if 2 J ≤ K . However, when9 Ω LCM,J is conditionally identiﬁed we do not know what explicit identifying assumption is being made orwhether the model is nonparametric identiﬁed. A recent development in MSE is the use of LCMs with J largeenough that 2 J > K (Manrique-Vallier, 2016). Here LCMs suﬀer from the opposite problem of log-linearmodels: rather than making too many assumptions, and hence restricting the observed-data distribution,so few assumptions are being made that the model is not conditionally identiﬁed. In Appendix C we showthrough a variety simulations that this is a practically relevant problem, as we have no guarantees for whenestimates based on non-identiﬁed models are going to be accurate.

In this section we revisit the NHOI identifying assumption associated with log-linear models and discussits role in our framing of MSE. We then describe how this assumption can be used as a building block foralternative identifying assumptions.

The NHOI assumption introduced in Section 3.1 can be interpreted as follows: for any given subset of K − K − K thlist. Here the meaning of “associated with” changes as the number of lists K changes. When K = 2we are assuming that the odds of appearing in list 1 conditional on appearing in list 2 is equal to theodds of appearing in list 1 conditional on not appearing in list 2, and thus the lists are independent: π (1 , /π (0 , = π (1 , /π (0 , . When K = 3 we are assuming that the odds ratio for lists 1 and 2 conditionalon appearing in list 3 is equal to the odds ratio for lists 1 and 2 conditional on not appearing in list 3: π (1 , , π (0 , , / ( π (1 , , π (0 , , ) = π (1 , , π (0 , , / ( π (1 , , π (0 , , ). When K = 4 we assume that certain ratiosof odds ratios are equal, and so on for larger K .As discussed in Section 2.5, in order to use the NHOI assumption in a given application, we need tobe able to determine whether or not it is plausible. Odds and odds ratios are well understood in statistics10Bishop et al., 1975), and thus the NHOI assumption may be of use when there are K = 2 or K = 3lists. However, higher order measures of association like ratios of odds ratio are more obscure and hard tointerpret, which makes the NHOI assumption diﬃcult to use when there are more than K = 3 lists. Thisdiﬃculty compounds when considering sensitivity analyses as we explain in the next section. Sensitivity analyses aim to gauge how sensitive inferences are to untestable assumptions, and are an importantpart of missing data workﬂows (see Chapter 9 of Hogan and Daniels, 2008). The NHOI assumption facilitatessensitivity analyses based on varying the highest-order interaction across a range of non-zero values. Inparticular, when ﬁxing ξ = exp { ( − I even ( ) λ } ∈ R + , we show in Appendix B that we arrive at the explicitidentifying assumption T ( ˜ π ) = ( ˜Π odd / ˜Π even ) / ( ξ + ˜Π odd / ˜Π even ) . This generalizes the two list sensitivityanalyses of Lum and Ball (2015) and Gerritse et al. (2015). Under this identifying assumption, rather thanassuming certain measures of association are equal, we are assuming one measure is ξ times another. Forexample, when K = 2 we are assuming that the odds of appearing in list 1 conditional on not appearing inlist 2 is ξ times the odds of appearing in list 1 conditional on appearing in list 2: π (1 , /π (0 , = ξπ (1 , /π (0 , .In order to perform a meaningful sensitivity analysis, one needs to be able to specify a range of values forthe highest-order interaction that are plausible for a given application. Due to our understanding of oddsand odds ratios, performing this sort of sensitivity analysis may be possible when there are K = 2 or K = 3lists. When considering more than K = 3 lists, it can become diﬃcult to even start thinking about whetherit is plausible that ξ is less than or greater than 1, let alone determine speciﬁc values of ξ that are plausible. K (cid:48) -List Marginal No-Highest-Order Interaction Assumptions The NHOI assumption can be used as a building block to generate other identifying assumptions. Supposewe can assume that, without loss of generality, the NHOI assumption holds for the ﬁrst 1 < K (cid:48) < K lists,marginal of the remaining K − K (cid:48) lists. This leads to a new identifying assumption which in general doesnot imply that there is no highest-order interaction for all K lists. To introduce this assumption formallywe need to introduce some notation. Let G = { , } K (cid:48) index the marginal 2 K (cid:48) contingency table for the11rst K (cid:48) lists and G ∗ = G \ { } K (cid:48) . For a set of cell probabilities, π ∈ S K − , and a given cell in themarginal table, g ∈ G , let π g + = (cid:80) h ∈ H π h I { ( h , · · · , h K (cid:48) ) = g } denote the probability of being observedin cell g of the marginal table implied by π . Similarly let ˜ π g + = (cid:80) h ∈ H ∗ ˜ π h I { ( h , · · · , h K (cid:48) ) = g } and˜ π = (cid:80) h ∈ H ∗ ˜ π h I { ( h , · · · , h K (cid:48) ) = (0 , · · · , } .Assuming that the NHOI assumption holds for the ﬁrst 1 < K (cid:48) < K lists, marginal of the remain-ing K − K (cid:48) lists, is equivalent to assuming (cid:81) g ∈ G π I odd ( g ) g + / (cid:81) g ∈ G π I even ( g ) g + = 1. In Appendix B we showthat this K (cid:48) -list marginal no-highest-order interaction assumption corresponds to the explicit identifyingassumption T ( ˜ π ) = ( ˜Π odd, + / ˜Π even, + − ˜ π ) / (1 + ˜Π odd, + / ˜Π even, + − ˜ π ) , where ˜Π odd, + = (cid:81) g ∈ G ∗ ˜ π I odd ( g ) g + and ˜Π even, + = (cid:81) g ∈ G ∗ ˜ π I even ( g ) g + . Further, we can perform sensitivity analyses for this assumption by ﬁxing (cid:81) g ∈ G π I odd ( g ) g + / (cid:81) g ∈ G π I even ( g ) g + = ξ ∈ R + . As we show in Appendix B, this leads to the explicit identifyingassumption T ( ˜ π ) = ˜Π odd, + / ˜Π even, + − ξ ˜ π ξ + ( ˜Π odd, + / ˜Π even, + − ξ ˜ π ) . (3)Models that use the assumption that (cid:81) g ∈ G π I odd ( g ) g + / (cid:81) g ∈ G π I even ( g ) g + = ξ ∈ R + are not nonparametric iden-tiﬁed, as the domain of the identifying assumption is ˜ T = { ˜ π ∈ S K − | ˜Π odd, + / ( ˜Π even, + ˜ π ) > ξ } .A special case of this identifying assumption was originally suggested in Regal and Hook (1998) asan alternative to the NHOI assumption. They considered a data set consisting of K = 3 lists recordingcases of spina biﬁda in upstate New York, where they believed that the assumption that two of the lists weremarginally independent (i.e., using the 2-list marginal NHOI assumption) was more plausible than the NHOIassumption. This illustrates that there may be applications where one may be more willing to make marginalassumptions about a subset of K (cid:48) lists, rather than an assumption involving all K lists. Additionally whenthere are K > K (cid:48) = 2 or K (cid:48) = 3, the K (cid:48) -list marginal NHOI assumption and its sensitivityanalyses are much more straightforward to interpret than the highest-order interaction and its sensitivityanalyses, as discussed in Sections 4.1 and 4.2.For these reasons, we believe that the K (cid:48) -list marginal NHOI assumption can be useful as an explicitidentifying assumption in the toolbox of the MSE practitioner. However, we emphasize that there is noone-size-ﬁts-all identifying assumption. Speciﬁcation of identifying assumptions in practice should be ac-companied with appropriate justiﬁcation based on the context of the data. In Section 6.1 we attempt to12rovide such a justiﬁcation for our use of the 2-list marginal NHOI assumption in an application estimatingthe number of civilian casualties in the Kosovo war. In this section we describe a computational approach that allows any identifying assumption to be usedin a Bayesian framework with any prior for the population size, N , and any prior for the observed cellprobabilities, ˜ π . Various sensitivity analyses are facilitated from this approach. We further give someguidance to speciﬁcation of the prior for N . Suppose that we are using a conditionally identiﬁed model P Ω , where Ω = { N, π , ˜ π | N ∈ N , π = T ( ˜ π ) , ˜ π ∈ ˜ S } , and we have speciﬁed independent prior distributions for N and ˜ π . The posterior distribution of N and˜ π is p ( N, ˜ π | n ) ∝ L ( N, T ( ˜ π ) | n ) L ( ˜ π | n ) p ( N ) p ( ˜ π ) I ( ˜ π ∈ ˜ S ) . The marginal posterior distributions of ˜ π and N are p ( ˜ π | n ) ∝ p ( n | T ( ˜ π )) L ( ˜ π | n ) p ( ˜ π ) I ( ˜ π ∈ ˜ S ) , (4)and p ( N | n ) = (cid:82) p ( N | n, T ( ˜ π )) p ( ˜ π | n ) d ˜ π , where p ( n | π ) = (cid:80) ∞ N = n L ( N, π | n ) p ( N ) and p ( N | n, π ) = L ( N, π | n ) p ( N ) /p ( n | π ), with π = T ( ˜ π ). As we discuss in Section 5.3, we can compute p ( n | π ), andthus p ( N | n, π ), analytically for common priors on N . If one has access to Markov chain Monte Carlo(MCMC) samples { ˜ π [ t ] } Tt =1 from p ( ˜ π | n ), one can then generate MCMC samples { N [ t ] } Tt =1 from p ( N | n )via N [ t ] ∼ p ( N | n, T ( ˜ π [ t ] )). Summaries of the marginal posterior of N can then be calculated based onthese samples. While computation as described in the previous section may seem straightforward, the marginal posterior p ( ˜ π | n ) depends on the speciﬁc combination of priors for ˜ π and N and identifying assumption T . Thuswe need new MCMC samples from p ( ˜ π | n ) for each new combination of priors and identifying assumption,which can be diﬃcult both technically and computationally. Rather than develop new MCMC samplers for13ach combination, we will rely on a combination of existing software and a computationally cheap rejectionsampler.Letting p C ( ˜ π | n ) ∝ L ( ˜ π | n ) p ( ˜ π ) denote the marginal posterior for ˜ π using just the conditionallikelihood L , we can rewrite (4) as p ( ˜ π | n ) ∝ p ( n | T ( ˜ π )) I ( ˜ π ∈ ˜ S ) p C ( ˜ π | n ). This suggests a computation-ally cheap rejection sampler to generate samples from p ( ˜ π | n ), if we have access to MCMC samples from p C ( ˜ π | n ) (Smith and Gelfand, 1992):1. Generate U ∼ Unif (0 ,

1) and ˜ π ∼ p C ( ˜ π | n ) independently.2. If U < p ( n | T ( ˜ π )) I ( ˜ π ∈ ˜ S ) / { max π p ( n | π ) } accept ˜ π . Else go back to (1).Thus, for a given prior p ( ˜ π ), if we want to perform prior sensitivity analyses for N and/or sensitivityanalyses probing the identifying assumption as discussed in Section 4.2, we can take a one time sample from p C ( ˜ π | n ), and then reuse this sample to generate samples from p ( ˜ π | n ) for each combination of prior for N and identifying assumption.The approach just described is only useful if we have access to MCMC samples from p C ( ˜ π | n ). Previouswork in Bayesian MSE speciﬁes priors for p ( ˜ π ) indirectly. In particular, most work speciﬁes priors onreparametrizations of the cell probabilities π , such as log-linear models or LCMs, which induce priors for π ,and thus for ˜ π . Let p w ( π ) denote what we will call the working prior for π , which induces the prior p ( ˜ π )we would like to use. Consider the working posterior for π , p w ( π | n ) ∝ (cid:80) ∞ N = n p ( n , n | N, π ) p w ( π ) /N ,obtained using the working prior for N of p w ( N ) ∝ /N . The posterior for ˜ π under this working priorcombination is equal to p C ( ˜ π | n ) ∝ L ( ˜ π | n ) p ( ˜ π ), as p ( n | π ) ∝ /n under the working prior for N (see Table 1). Thus, given MCMC samples, { π [ t ] } Tt =1 , drawn from p w ( π | n ), letting ˜ π [ t ] h = π [ t ] h / (1 − π [ t ]0 ), { ˜ π [ t ] } Tt =1 are MCMC samples drawn from p C ( ˜ π | n ).Thus if we want to use the prior p ( ˜ π ) induced by a working prior p w ( π ), we can rely on a combinationof existing software and a computationally cheap rejection sampler to generate draws from the posterior p ( N, ˜ π | n ) for any combination of prior for N and identifying assumptions, as long as the software uses theprior p w ( N ) ∝ /N . Note that our prior for N does not have to be p w ( N ). This is the case for most existingsoftware, including the R package conting (Overstall and King, 2014), which implements a reversible-jumpMCMC sampler to target p w ( π | n ) under a working prior p w ( π ) induced by a prior that averages over14ll hierarchical log-linear models (King and Brooks, 2001), and the R package LCMCR , which implements adata augementation Gibbs sampler to target p w ( π | n ) under a working prior p w ( π ) induced by a Dirichletprocess prior for LCMs (Manrique-Vallier, 2016). The steps of the MCMC samplers used in these packagesare model speciﬁc and we would not be able to use them if we tried to create bespoke MCMC samplerstargeting the marginal posterior in (4). We note that this approach is closely related to the working priorapproach of Linero (2017), with some necessary modiﬁcations speciﬁc to MSE. N In Table 1 we catalog p ( n | π ) and p ( N | n, π ) under Poisson, negative-binomial, and binomial priors for N , in addition to the class of priors p ( N ) ∝ ( N − (cid:96) )! /N !, where (cid:96) ∈ { , , , · · · } , suggested by Fienberget al. (1999). This class of priors contains both the improper uniform prior, p ( N ) ∝

1, when (cid:96) = 0, and theimproper scale prior, p ( N ) ∝ /N , when (cid:96) = 1. If p ( n | π ) is not available analytically, for example when p ( N ) is beta-binomial, we recommend truncating the prior for N to the range { , · · · , N max } where N max is an upper bound on the population size, in which case p ( n | π ) can be computed numerically.Table 1: Catalog of p ( N | n, π ) and p ( n | π ) under common priors for N .Prior p ( N ) p ( N | n, π ) p ( n | π ) Pois ( M ) ( M ) N e − M /N ! n + Pois ( π M ) Pois ((1 − π ) M ) NB (cid:16) a, MM + a (cid:17) (cid:0) N + a − N (cid:1) ( MM + a ) N ( aM + a ) a n + NB (cid:16) n + a, Mπ M + a (cid:17) NB (cid:16) a, (1 − π ) M (1 − π ) M + a (cid:17) Bin ( M, q ) (cid:0) MN (cid:1) q N (1 − q ) M − N n + Bin (cid:16) M − n, π qπ q +1 − q (cid:17) Bin ( M, (1 − π ) q ) Fienberg et al. (1999) ∝ ( N − (cid:96) )! /N ! n + NB ( n − (cid:96) + 1 , π ) ∝ ( n − (cid:96) )! n ! (1 − π ) (cid:96) − The improper scale prior, under which p ( ˜ π | n ) ∝ p C ( ˜ π | n ) I ( ˜ π ∈ ˜ S ), is a common “noninformative”prior for N and has the nice property that the posterior mean of N conditional on ˜ π is the Horvitz-Thompsonestimator (Horvitz and Thompson, 1952), n/ { − T ( ˜ π ) } , which is well understood in the present context(see e.g. Rukhin, 1975). Following Link (2013), we recommend using this prior in the absence of substantiveknowledge about N .When incorporating substantive knowledge about N into an informative prior for N we recommend15sing a negative-binomial or beta-binomial prior, as we have found Poisson and binomial priors to be moreinformative than we would usually like to use. For concreteness in this article we will focus on the negative-binomial prior. In Table 1, we use a common parameterization for the negative-binomial distribution interms of the mean M and overdispersion parameter a . This parameterization arises from a Poisson-gammamixture, where N | δ ∼ Poisson ( M δ ), δ ∼ Gamma ( a, a ). As a → ∞ the prior approaches a Poisson priorwith mean M , and as a → In this section we estimate the number of civilian casualties in the Kosovo war between March 20 and June22, 1999, using data originally analyzed in Ball et al. (2002). The data consist of K = 4 lists with n = 4400observed casualties, and are presented in Table 2, reproduced from Section 6 of Ball et al. (2002). Threeof the lists were constructed from refugee interviews conducted separately by the American Bar AssociationCentral and East European Law Initiative (ABA), Human Rights Watch (HRW), and the Organizationfor Security and Cooperation in Europe (OSCE). The fourth list was constructed from exhumation reportsconducted on behalf of the International Criminal Tribunal for the Former Yugoslavia (EXH). We refer thereader to Appendix 1 of Ball et al. (2002) for a detailed description of each list.Table 2: Kosovo dataset, reproduced from Section 6 of Ball et al. (2002).ABA yes yes no noEXH yes no yes noHRW OSCEyes yes 27 32 42 123yes no 18 31 106 306no yes 181 217 228 936no no 177 845 1131 n After motivating our choice of identifying assumption in the next section, we will analyze the Kosovodata under a variety of priors for the population size and observed cell probabilities. Our purpose is to16emonstrate the ease with which it is possible to perform inference in a model with explicit identifyingassumptions and arbitrary priors, as described in Section 5.2. This facilitates prior sensitivity analyses thatdecouple the identifying assumption from the model used for the observed cell probabilities. Since the sameidentifying assumption is being used for each prior combination, it follows that diﬀerences in ﬁts are entirelydue to diﬀerences in prior speciﬁcations. Compare this, for example, to the analyses of the Kosovo data setin Silverman (2020), where the Bayesian LCM of Manrique-Vallier (2016) was compared to various Bayesianlog-linear models. The ﬁts of these diﬀerent models are not comparable to each other, as they are usingdiﬀerent identifying assumptions and diﬀerent priors for the observed cell probabilities. Note that we are not attempting to perform model selection or provide guidance on which prior speciﬁcation one should use.

For our main analysis, we will use the 2-list marginal NHOI assumption, where we will assume that theABA and HRW lists are independent. We believe this assumption is plausible given that “there were noovert eﬀorts by any of the researchers to exclude or include witnesses who had participated in another datacollection project” (ABA/AAAS, 2000, p. 40) and that the two lists had similarly extensive geographic reachin their interviews. In particular, ABA conducted interviews in Albania, Macedonia, Kosovo, the UnitedStates, and Poland, while HRW conducted interviews in Albania, Macedonia, Kosovo, and Montenegro. ABAonly conducted around 10% of its interviews in the United States and Poland, and HRW only conducted 3%of its interviews in Montenegro. Further, within Kosovo, ABA and HRW conducted interviews in similargeographic regions. For more information on where the lists conducted interviews see Appendix 1 of Ballet al. (2002).The original analysis of the Kosovo data set in Ball et al. (2002) used the NHOI assumption. To justifythis assumption for the Kosovo data, as we have K = 4 lists, we would need to reason about certain ratios ofodds ratios being equal, which can be diﬃcult, as discussed in Section 4.1 and further explained in AppendixD. Nevertheless, to further demonstrate the use of the computational approach described in Section 5.2, inAppendix D we present an analysis of the Kosovo data using the NHOI assumption.17 .2 Priors for Total Number of Casualties We will consider two priors for the total number of casualties, N , as discussed in Section 5.3: the improperscale prior and a negative-binomial prior. To inform the negative-binomial prior, we will rely on two studiesthat attempted to estimate the number of casualties in the Kosovo war using diﬀerent data sources than Ballet al. (2002). Spiegel and Salama (2000) estimated there were 12000 casualties with a 95% conﬁdence intervalof [5500 , , M = 10000 (the average of the estimates from the two studies) and overdispersion parameter a = 1 . , We will consider four prior speciﬁcations for the observed cell probabilities ˜ π : 1) a ﬂat Dirichlet prior, i.e.˜ π ∼ Dirichlet (1 , · · · , p C ( ˜ π | n ) is available in closed form, 2) the prior induced fromusing Normal (0 , ) priors for the log-linear parameters in the saturated log-linear model Ω LL , ﬁt usingthe Stan probabilistic programming language (Carpenter et al., 2017), 3) the prior induced from using theDirichlet process prior of Manrique-Vallier (2016) for the J class LCM Ω LCM,J , with J = 10 and defaulthyperparameters, as implemented in the R package LCMCR , and 4) the prior induced from using the Bayesianmodel averaging prior of King and Brooks (2001) for the log-linear parameters in the saturated log-linearmodel Ω LL , with the unit information prior on log-linear parameters, as implemented in the R package conting (Overstall and King, 2014). We note that conting uses an alternative log-linear parameterizationbased on sum to zero constraints rather than corner point constraints used in Section 3.1. For each combination of prior for N and ˜ π we ﬁt the corresponding model using the computational approachdescribed in Section 5.2. In particular, we emphasize that for each prior for ˜ π , we only drew samples from18he corresponding p C ( ˜ π | n ) once. The posterior density for N under each prior combination is displayedin Figure 2. Posterior means and 95% credible intervals for N under each prior combination are displayedin Table 3. Assuming independence of the ABA and HRW lists, under the negative-binomial prior for N and LCM prior for ˜ π of Manrique-Vallier (2016), we estimate there were 9359 civilian casualties, with a95% credible interval of [7967 , Dirichlet Log−Linear LCMCR Conting6000 8000 10000 12000 6000 8000 10000 12000 6000 8000 10000 12000 6000 8000 10000 120000e+002e−044e−04 N D en s i t y Prior for N

Improper Scale Prior Negative−Binomial

Figure 2: Posterior density of N under each combination of prior for N and ˜ π , using the marginalno-highest-order interaction assumption.Table 3: Posterior means and 95% credible intervals for N under each combination of prior for N and ˜ π ,using the marginal no-highest-order interaction assumption.Improper Scale Prior Negative-BinomialDirichlet 9536 [8113, 11252] 9540 [8123, 11247]Log-Linear 9764 [8277, 11549] 9766 [8288, 11550]LCMCR 9353 [7959, 11063] 9359 [7967, 11059]Conting 9618 [8224, 11195] 9621 [8232, 11191]When using the LCM prior for ˜ π , the posterior distribution for N is not sensitive to our choice of prior19or N , as it is essentially the same as when using the “noninformative” improper scale prior for N insteadof the negative-binomial prior. When considering the other priors for ˜ π , the posterior is again not sensitiveto the prior for N . Across the diﬀerent priors for ˜ π , the posterior summaries are fairly consistent, with theposterior summaries under the LCM prior for ˜ π being slightly lower than under the other priors. We notethat all of the credible intervals fall within the conﬁdence interval of Spiegel and Salama (2000). While we believe that it is plausible that the ABA and HRW lists are independent, we would also like tounderstand how sensitive our resulting estimates are to realistic violations of the assumption. If independencewas violated, it would likely be the case that the lists are positively dependent and thus population sizeestimates under independence are downward biased, as is common in human rights applications (see e.g. thediscussion in Section 5 of Lum and Ball, 2015). In particular, HRW selected regions in Kosovo to conductinterviews based on reports of human rights violations from refugees and other sources (ABA/AAAS, 2000).Thus it seems possible that a casualty appearing in HRW could be more likely to appear in ABA than acasualty that did not appear in HRW.We now perform a sensitivity analysis probing the 2-list marginal NHOI assumption, focusing on theLCM prior for ˜ π of Manrique-Vallier (2016) and the negative-binomial prior for N , for concreteness. Weﬁt the model with the identifying assumption (3), varying ξ over { . , . , . , } , using the computationalapproach described in Section 5.2. Thus in each case we are assuming that the odds of appearing in ABAconditional on not appearing in HRW is ξ times the odds of appearing in ABA conditional on appearing inHRW. We emphasize that we only drew samples from p C ( ˜ π | n ) under the LCM prior for ˜ π of Manrique-Vallier (2016) once for both the main analysis in the previous section and the sensitivity analysis describednow. The posterior density for N under each identifying assumption is displayed in Figure 3. Posteriormeans and 95% credible intervals for N under each identifying assumption combination are displayed inTable 4.The estimates of the number of casualties N increase as the amount of assumed positive dependenceincreases, i.e. as ξ decreases, as expected. When ξ = 0 .

9, the posterior mean of 10155 and 95% credible20 e+001e−042e−043e−044e−045e−04 5000 10000 15000 20000 N D en s i t y x Figure 3: Posterior density of N under each identifying assumption, using the LCM prior for ˜ π ofManrique-Vallier (2016) and the negative-binomial prior for N .Table 4: Posterior means and 95% credible intervals for N under each identifying assumption, using theLCM prior for ˜ π of Manrique-Vallier (2016) and the negative-binomial prior for N . ξ = 1 ξ = 0.9 ξ = 0.8 ξ = 0.79359 [7967, 11059] 10155 [8607, 12038] 11147 [9419, 13258] 12419 [10451, 14816]interval of [8607 , ξ = 0 .

7, the 95%credible interval of [10451 , Discussion

In this article we revisited the framing of MSE as a missing data problem and proposed a Bayesian approachto MSE that places the identifying assumption front and center in the MSE workﬂow. A natural next step isto develop new explicit identifying assumptions, for situations where the identifying assumptions describedin Section 4 can not be justiﬁed in the context of a given data set. We believe that this is an extremelyunder-researched problem that will hopefully gain attention with the re-framing of MSE we present in thisarticle.The presentation of MSE in this article was focused on estimating the size of a single population. Whenthe population can be stratiﬁed based on observed covariates, such as location or time, it may be desirableto estimate the population sizes within each strata. In theory, the methodology developed in this articlecould be applied independently to each strata. However, stratiﬁcation can lead to sparse contingency tables,which need signiﬁcant regularization through the prior for ˜ π . In this case, it would be desirable to developobserved data models that borrow strength across strata.22 Appendix A: Conditional Identiﬁability in Models for Hetero-geneity

The purpose of this appendix is to show how common models for heterogeneity ﬁt into the model described inSection 2.2, and to provide results regarding conditional identiﬁability in a particular family of heterogeneousmodels. The material presented in Sections A.2, A.3, A.4, and A.5 previously appeared in an unpublishedpreprint written by the ﬁrst author . A.1 Models for Heterogeneity

Consider the following heterogeneous model π i i.i.d. ∼ Q, x i | π i ind. ∼ Categorical ( π i ) , (A.1)where π i = { π i h } h ∈ H ∈ S K − for i = 1 , . . . , N . Under this model each individual has its own set ofcell probabilities, π i , drawn from some mixing distribution Q on S K − . Working with the heterogeneousmodel in (A.1) is equivalent, after marginalizing out π i , to working with the complete-data distribution in(1), where π := π Q = E Q ( π i ) and E Q denotes the expectation with respect to the mixing distribution Q . This is a consequence of the data only providing information about the ﬁrst moment of the mixingdistribution. Suppose Q is a family of mixing distributions on S K − . For Q ∈ Q , let π Q, denote theinduced unobserved cell probability and ˜ π Q denote the induced observed cell probabilities. The parameterspace induced by the family Q , as a subset of the observed-data parameterization, can then be written asΩ Q = { N, π , ˜ π | N ∈ N , π = π Q, and ˜ π = ˜ π Q for some Q ∈ Q} .The general heterogeneous model in (A.1) captures common models for heterogeneity, including the M h and M th models (Otis et al., 1978). The M th model assumes the individual cell probabilities take theform π i h = (cid:81) Kk =1 ( q ik ) h k (1 − q ik ) − h k , where ( q i , · · · , q iK ) i.i.d. ∼ Q and Q is a mixing distribution on (0 , K .Under this model, conditional on an individual’s sampling probabilities, ( q i , · · · , q iK ), each individual isindependently sampled by each list. The M h model is a submodel of the M th model that assumes that the Aleshin-Guendel, S. (2020). On the Identiﬁability of Latent Class Models for Multiple-Systems Estimation. arXiv preprintarXiv:2008.09865 q i , · · · , q iK ), are the same for each list, i.e. q i = · · · = q iK . Thus the M h model assumes individuals have the same probability of being sampled by each list. After marginalizing out π i , this enforces a symmetry where the probability of appearing in k lists is the same for each subset of k lists. We do not believe this is plausible in human population settings. A.2 Conditional Identiﬁability in M th Models

While there exists a literature characterizing identiﬁability in M h models (Huggins, 2001; Link, 2003; Holz-mann et al., 2006; Link, 2006), no such results exist for M th models. The purpose of this section is to providea mechanism for verifying whether the M th model P Ω Q is conditionally identiﬁable based on moments of themixing distributions Q ∈ Q , analogously to the results for M h models presented in Holzmann et al. (2006).Before proving the main theorem of this section, we have the following lemma, which tells us that for anymixing distribution Q on (0 , K , the induced cell probabilities, π Q , only depend on Q through its mixedmoments. Lemma A.1.

For any h ∈ H ∗ , π Q, h = (cid:80) h (cid:48) ∈ H ∗ c h , h (cid:48) m Q, h (cid:48) where c h , h (cid:48) = ( − (cid:80) Kk =1 h (cid:48) k − h k (cid:81) Kk =1 I ( h k ≤ h (cid:48) k ) and m Q, h (cid:48) = E Q ( (cid:81) Kk =1 q h (cid:48) k k ) .Proof. For all h ∈ H ∗ , (cid:81) Kk =1 q h k k (1 − q k ) − h k = (cid:80) h (cid:48) ∈ H ∗ c h , h (cid:48) (cid:81) Kk =1 q h (cid:48) k k by an application of the multi-binomial theorem (a generalization of the binomial theorem). The result follows from taking the expectationover both sides with respect to Q .We can restate Lemma A.1 in matrix form. Letting π ∗ Q = ( π Q, h ) h ∈ H ∗ and m Q = ( m Q, h ) h ∈ H ∗ , we havethat π ∗ Q = C m Q , where C = ( c h , h (cid:48) ) h ∈ H ∗ , h (cid:48) ∈ H ∗ . C is invertible as it is upper triangular with non-zerodiagonal entries. We are now ready to prove Theorem A.1. Theorem A.1.

For any two distributions

Q, R on (0 , K , ˜ π Q = ˜ π R is equivalent to m Q = A m R for some A > .Proof. ˜ π Q = ˜ π R is equivalent to π ∗ Q / (1 − π Q, ) = π ∗ R / (1 − π R, ) . Rearranging terms we have that π ∗ Q = π ∗ R (1 − π Q, ) / (1 − π R, ) , and thus π ∗ Q = A π ∗ R , where A = (1 − π Q, ) / (1 − π R, ) >

0. Using Lemma A.1,this is equivalent to C m Q = AC m R , and thus m Q = A m R due to the invertibility of C .24he immediate consequence of Theorem A.1 is that to verify conditional identiﬁability of an M th model P Ω Q , one can demonstrate that if m Q = A m R for some Q, R ∈ Q , then π Q, = π R, . We use this mechanismin the next section to characterize when latent class models (LCMs) are conditionally identiﬁable. A.3 Conditional Identiﬁability of Latent Class Models

We denote the family of mixing distributions corresponding to LCMs with J classes by Q J = { Q = (cid:80) Jj =1 ν Q,j (cid:81) Kk =1 δ q Q,jk | ν Q,j ≥ , (cid:80) Jj =1 ν Q,j = 1 , q

Q,jk ∈ (0 , K } , so that P Ω Q J is equivalent to P Ω LCM,J from Section 3.2. To provide necessary and suﬃcient conditions for P Ω Q J to be conditionally identiﬁable,we restrict the family of mixing distributions to Q J = { Q = (cid:80) Jj =1 ν Q,j (cid:81) Kk =1 δ q Q,jk | ν Q,j ≥ , (cid:80) Jj =1 ν Q,j =1 , q Q,jk ∈ (0 , K , q Q,jk (cid:54) = q Q,j (cid:48) k for j (cid:54) = j (cid:48) } . This restriction makes the mild assumption that each class’sampling probabilities are distinct, which simpliﬁes the proof of Theorem A.2. Loosening this restrictioncould only make the conditions on J for Q J to be identiﬁable stricter, and thus the conclusions we reach inSection A.6 would still stand for families where this restriction is violated.There are J ( K + 1) − Q J , thus when P Ω Q J is conditionally identiﬁable, J satisﬁes J ( K + 1) − ≤ K −

2, as the observed cell probabilities, ˜ π Q , are 2 K − J must satisfy a stricter condition for P Ω Q J to be conditionally identiﬁable. In Section A.6 wediscuss some limitations of this result. Theorem A.2. P Ω Q J is conditionally identiﬁable iﬀ J ≤ K .Proof. We will ﬁrst show that if 2 J ≤ K , then P Ω Q J is conditionally identiﬁable. The proof of this directionis similar in spirit to the proofs of Theorem 2 in Holzmann et al. (2006) and Theorem 1 in Pezzott et al.(2019), which were both concerned with characterizing the identiﬁability of the M h analogue of P Ω Q J .Assume 2 J ≤ K , and let Q, R ∈ Q J such that m Q = A m R for some A >

0, so that we have the followingsystem of equations: J (cid:88) j =1 ν Q,j K (cid:89) k =1 q h k Q,jk − A J (cid:88) j =1 ν R,j K (cid:89) k =1 q h k R,jk = 0 ( h ∈ H ∗ ) . (A.2)Let I Q = { j | q Q,j (cid:54)∈ ( q R, , . . . , q R,J ) } and I R = { j | q R,j (cid:54)∈ ( q Q, , . . . , q Q,J ) } , where q Q,j = ( q Q,j , . . . , q Q,jK )25nd q R,j = ( q R,j , . . . , q R,jK ). We can then rewrite (A.2) as J (cid:88) j =1 y j K (cid:89) k =1 q h k Q,jk − A J (cid:88) i ∈I R ν R,j K (cid:89) k =1 q h k R,jk = 0 ( h ∈ H ∗ ) , (A.3)where y j = ν Q,j if j ∈ I Q and y j = ν Q,j − Aν R,j (cid:48) for some j (cid:48) ∈ { , . . . , J } \ I R otherwise. Letting m = |I R | = |I Q | and labelling the elements of I R as i , . . . , i m , the system of equations in (A.3) can bewritten in matrix form as Λ y = 0, whereΛ =  q Q, K · · · q Q,JK q R,i K · · · q R,i m K ... . . . ... ... . . . ... (cid:81) Kk =1 q h k Q, k · · · (cid:81) Kk =1 q h k Q,Jk (cid:81) Kk =1 q h k R,i k · · · (cid:81) Kk =1 q h k R,i m k ... . . . ... ... . . . ... (cid:81) Kk =1 q Q, k · · · (cid:81) Kk =1 q Q,Jk (cid:81) Kk =1 q R,i k · · · (cid:81) Kk =1 q R,i m k  , y =  y ... y J − Aν R,i ... − Aν R,i m  , and the rows of Λ are indexed by h ∈ H ∗ . In Section A.4, we prove that Λ is full rank, and thus y = 0, forany m ∈ { , . . . , J } . The proof of this direction concludes by examining three possible cases. Case 1.

Suppose m = 0 , i.e. for each j ∈ { , . . . , J } , there exists some j (cid:48) ∈ { , . . . , J } such that q Q,j = q R,j (cid:48) and ν Q,j = Aν R,j (cid:48) . As (cid:80) Jj =1 ν Q,j = (cid:80) Jj =1 ν R,j = 1 , this implies that A = 1 and thus π Q, = π R, . Case 2.

Suppose m ∈ { , . . . , J − } , i.e. for each j ∈ { , . . . , J } \ I Q , there exists some j (cid:48) ∈ { , . . . , J } \ I R such that q Q,j = q R,j (cid:48) and ν Q,j = Aν R,j (cid:48) . Further, for each j ∈ I Q and j (cid:48) ∈ I R ν Q,j = ν R,j (cid:48) = 0 . We canthus ignore the classes j ∈ I Q and j (cid:48) ∈ I R . As (cid:80) Jj =1 ν Q,j = (cid:80) Jj =1 ν R,j = 1 , this implies that A = 1 andthus π Q, = π R, . Case 3.

Suppose m = J , i.e. for each j ∈ { , . . . , J } , there exists no j (cid:48) ∈ { , . . . , J } such that q Q,j = q R,j (cid:48) .Then ν Q,j = ν R,j = 0 for j ∈ { , . . . , J } , which is a contradiction. We will now show that if 2

J > K , then P Ω Q J is not conditionally identiﬁable. To do so we willprovide explicit Q, R ∈ Q J such that π Q, (cid:54) = π R, , but m Q = A m R for A >

0. This counterexample ismodiﬁed from Tahmasebi et al. (2018), who studied identiﬁability of families of LCMs outside of the multiple-systems estimation context where n is observed. Choose J such that 2 J > K . For j ∈ { , . . . , J } , let26 Q,j = (cid:0) J j (cid:1) / (2 J − −

1) and ν R,j = (cid:0) J j − (cid:1) / (2 J − ). For j ∈ { , . . . , J } and k ∈ { , . . . , K } , let q Q,jk = α (2 j )and q R,jk = α (2 j −

1) where 0 < α < / (2 J ). We thus have that Q, R ∈ Q J , where clearly Q (cid:54) = R . InSection A.5 we prove that for these choices of Q, R , m Q = A m R for A > A (cid:54) = 1, and thus π Q, (cid:54) = π R, . A.4 Proof that Λ is Full Rank We will prove that Λ is full rank for any m ∈ { , . . . , J } by proving a stronger result. Recall that K ≥ x (cid:96)k ∈ (0 ,

1) for (cid:96) ∈ { , . . . , K } and k ∈ { , . . . , K } , such that x (cid:96)k (cid:54) = x (cid:96)k (cid:48) for k (cid:54) = k (cid:48) . Let X K =  x K · · · x KK ... . . . ... (cid:81) Kk =1 x h k k · · · (cid:81) Kk =1 x h k Kk ... . . . ... (cid:81) Kk =1 x k · · · (cid:81) Kk =1 x Kk  , where the rows of X K are indexed by h ∈ H ∗ . We will show that X K is full rank by induction on K . Thisimplies that Λ is full rank, as J + m ≤ J ≤ K by assumption for any m ∈ { , . . . , J } .For the base case when K = 2, verifying X is full rank is straightforward. Assume that X K − is fullrank. Let v ∈ R K × be such that X K v = 0. For each h ∈ { h (cid:48) ∈ H ∗ | h (cid:48) K = 0 } we have that v K (cid:81) K − k =1 x h k Kk = − (cid:80) K − (cid:96) =1 v (cid:96) (cid:81) K − k =1 x h k (cid:96)k , which implies that (cid:80) K − (cid:96) =1 v (cid:96) ( x (cid:96)K − x KK ) (cid:81) K − k =1 x h k (cid:96)k = 0. For (cid:96) ∈ { , . . . , K − } , let v (cid:48) (cid:96) = v (cid:96) ( x (cid:96)K − x KK ) and v (cid:48) = ( v (cid:48) , . . . , v (cid:48) K − ). This leads to the system of equations X K − v (cid:48) = 0. Bythe inductive assumption, v (cid:48) = 0. Since x (cid:96)K (cid:54) = x KK for (cid:96) ∈ { , . . . , K − } , we have that v (cid:96) = 0 for (cid:96) ∈ { , . . . , K − } , and thus v K = 0. A.5 Proof of Counterexample

We will now prove that m Q, h = Am R, h for all h ∈ H ∗ , where A = (2 J − ) / (2 J − − (cid:54) = 1. Deﬁne thefunction h ( x ) = (1 − e αx ) J = (cid:80) Ji =0 (cid:0) Ji (cid:1) ( − i e αix . For t ∈ { , . . . , K } , we can diﬀerentiate the series repre-sentation of h to ﬁnd that h ( t ) ( x ) = (cid:80) Ji =0 (cid:0) Ji (cid:1) ( − i ( αi ) t e αix and thus h ( t ) ( x ) | x =0 = (cid:80) Ji =0 (cid:0) Ji (cid:1) ( − i ( αi ) t = (cid:80) Ji =1 (cid:0) Ji (cid:1) ( − i ( αi ) t . We can alternatively diﬀerentiate the non-series representation of h using the fact that27 ≤ K < J and the chain rule for higher order derivatives to ﬁnd that h ( t ) ( x ) | x =0 = 0. Let h ∈ H ∗ and t = (cid:80) Kk =1 h k ∈ { , . . . , K } . The desired result follows as m Q, h − Am R, h = J (cid:88) j =1 ν Q,j K (cid:89) k =1 q h k Q,jk − A J (cid:88) j =1 ν R,j K (cid:89) k =1 q h k R,jk = J (cid:88) j =1 (cid:18) J j (cid:19) (2 J − − − K (cid:89) k =1 { α (2 j ) } h k − A J (cid:88) j =1 (cid:18) J j − (cid:19) (2 J − ) − K (cid:89) k =1 { α (2 j − } h k = (2 J − − − J (cid:88) i =1 (cid:18) Ji (cid:19) ( − i ( αi ) t = (2 J − − − { h ( t ) ( x ) | x =0 } = 0 . A.6 Limitations of Theorem A.2

Theorem A.2 shows that P Ω Q J is not conditionally identiﬁable if 2 J > K by counterexample, by demon-strating two mixing distributions

Q, R ∈ Q J where ˜ π Q = ˜ π R but π Q, (cid:54) = π R, . Within each latent class of Q and R , the sampling probabilities were the same, meaning Q and R can be seen as mixing distributions ofan M h model. It would be interesting in future work to see whether further restrictions on Ω Q J , for examplerestrictions not allowing the sampling probabilities within latent classes to be equal, lead to diﬀerent resultsconcerning conditional identiﬁability. Another interesting route would be to see whether results concerning generic identiﬁability of latent class models (Allman et al., 2009) could be applied to the multiple-systemsestimation setting.However, this does not mean Theorem A.2 is not a practically useful result. Theorem A.2 provides as-sumptions under which which we have formal statistical guarantees for when we can estimate the parametersin P Ω Q J : the parameters of P Ω Q J can be consistently estimated if 2 J ≤ K . When 2 J > K we currentlyhave no such guarantees. In Appendix C we demonstrate this reality across a variety of simulation studies.

B Appendix B: Identifying Assumption Derivations

The purpose of this appendix is to derive the identifying assumptions associated with no-highest-orderinteraction assumption and the K (cid:48) -list marginal no-highest-order interaction assumption.28 .1 Derivation for No-Highest-Order Interaction Assumption Recall from Section 3.1 that we have the following relationship between the cell probabilities and thehighest order interaction, λ : (cid:81) h ∈ H π I odd ( h ) h / (cid:81) h ∈ H π I even ( h ) h = exp { ( − I even ( ) λ } , where I odd ( h ) = I ( (cid:80) Kk =1 − h k is odd) and I even ( h ) = I ( (cid:80) Kk =1 − h k is even). Suppose we ﬁx λ ∈ R , or equivalently ξ = exp { ( − I even ( ) λ } ∈ R + . Under this assumption we have that (cid:81) h ∈ H π I odd ( h ) h / (cid:81) h ∈ H π I even ( h ) h = ξ .Multiplying the left-hand side by 1 = (cid:16) − π − π (cid:17) K − , we ﬁnd that ˜Π odd / { [ π / (1 − π )] ˜Π even } = ξ , where˜Π odd = (cid:81) h ∈ H ∗ ˜ π I odd ( h ) h and ˜Π even = (cid:81) h ∈ H ∗ ˜ π I even ( h ) h . Rearranging terms and solving for π , we ﬁnd thatthe assumption that ξ is a ﬁxed value corresponds to the explicit functional relationship T ( ˜ π ) = ˜Π odd / ˜Π even ξ + ˜Π odd / ˜Π even . (B.1)The identifying assumption corresponding to the no-highest-order interaction assumption is recovered bysetting λ = 0, or equivalently ξ = 1: T ( ˜ π ) = ( ˜Π odd / ˜Π even ) / (1 + ˜Π odd / ˜Π even ). The observed-data distribu-tion is not restricted by the assumption that the highest-order interaction is ﬁxed, and thus models that usethis assumption without any extra assumptions regarding the observed cell probabilities are nonparametricidentiﬁed. B.2 Derivation for K (cid:48) -list Marginal No-Highest-Order Interaction Assumption Suppose we assume that (cid:81) g ∈ G π I odd ( g ) g + / (cid:81) g ∈ G π I even ( g ) g + = ξ , where ξ ∈ R + is ﬁxed. Multiplying the left-hand side by 1 = (cid:16) − π − π (cid:17) K (cid:48)− , we ﬁnd that ˜Π odd, + / { [ π / (1 − π ) + ˜ π ] ˜Π even, + } = ξ , where ˜Π odd, + = (cid:81) g ∈ G ∗ ˜ π I odd ( g ) g + and ˜Π even, + = (cid:81) g ∈ G ∗ ˜ π I even ( g ) g + . Rearranging terms and solving for π , we ﬁnd that theassumption that ξ is a ﬁxed value corresponds to the explicit functional relationship T ( ˜ π ) = ˜Π odd, + / ˜Π even, + − ξ ˜ π ξ + ( ˜Π odd, + / ˜Π even, + − ξ ˜ π ) . (B.2)The identifying assumption corresponding to the K (cid:48) -list marginal no-highest-order interaction assumptionis recovered by setting ξ = 1: T ( ˜ π ) = ( ˜Π odd, + / ˜Π even, + − ˜ π ) / (1 + ˜Π odd, + / ˜Π even, + − ˜ π ) . As noted in Section 4.3, the the K (cid:48) -list marginal no-highest-order interaction assumption does not implythat there is no highest-order interaction for all K lists, as (cid:81) h ∈ H π I odd ( h ) h / (cid:81) h ∈ H π I even ( h ) h = ( ˜Π odd / ˜Π even ) × ( ˜Π odd, + / ˜Π even, + − ˜ π ) − (cid:54) = 1 in general. 29 Appendix C: Latent Class Model Simulations

The purpose of this appendix is conduct simulation studies demonstrating the practical implications ofTheorem A.2. In particular, we present a variety of simulations exploring the frequentist properties of theBayesian LCM of Manrique-Vallier (2016). In each example we generate 200 data sets from the model in (A.1)for a given number of lists K and a ﬁxed parameter setting of θ ∈ Ω Q J , i.e. a ﬁxed population size N and a J -class LCM Q ∈ Q J . For all examples we will use N ∈ { , , } . For each simulated data set,we ﬁt the Bayesian LCM of Manrique-Vallier (2016) as implemented in the R package LCMCR , using J latentclasses (i.e. the same number that generated the data) and the default prior for ν , by running the Gibbssampler implemented in LCMCR for 250 ,

000 iterations, with the ﬁrst 50 ,

000 tossed for burn-in. We note that

LCMCR uses the improper scale prior for N , i.e. p ( N ) ∝ /N , and a ﬂat prior for q , i.e. q jk i.i.d. ∼ Unif (0 , θ ∈ Ω Q J we examine the frequentist performanceof the posterior median, 95% credible interval, and 50% credible interval for estimating the unobservedcell probability, π , through the sample mean of the posterior medians, the sample coverage of the 95%credible intervals, the sample mean of the 95% credible interval widths over the 200 replications, the samplecoverage of the 50% credible intervals, and the sample mean of the 50% credible interval widths over the 200replications. C.1 Example 1

In this example we consider data from K = 2 lists generated from the two-class LCM Q a with parametersgiven in Table 5. Under Q a , ˜ π Q a , (0 , = 0 . π Q a , (1 , = 0 . π Q a , (1 , = 0 . π Q a , = 0 . Q b , with parameters given in Table 5, such that ˜ π Q a = ˜ π Q b but π Q b , = 0 . P Ω Q is not conditionally identiﬁed when K = 2, if we try to perform estimationwithin P Ω Q , which contains the true data generating model, there is no guarantee that we can estimate wellthe cell probabilities and population size which generated the data. This example was constructed using thecounterexample used to prove Theorem A.2.The results of the simulation using data generated using the LCM Q a are presented in Table 6. Wesee that the posterior median has a negative bias that does not vanish as N increases. One may have30able 5: Parameters of two latent class models, Q a and Q b (rounded for presentation) ν ν q q q q Q a Q b π Q b , = 0 .

219 since Q a and Q b induce the same observed-data distribution. However, the posterior median is also negatively biasedfor estimating π Q b , , which suggests there are other LCMs in Q that induce very similar observed-datadistributions to Q a and Q b but with diﬀerent induced unobserved cell probabilities. While the 95% credibleinterval has nominal coverage when N = 2000, as N increases, coverage decreases and is no longer nominal.The 50% credible interval have essentially 0 coverage for settings of N , even for N = 2000 where the 95%credible interval has nominal coverage. This suggests the 95% credible interval only has nominal coverageat N = 2000 due to wide tails of the posterior for N .Table 6: Results of the simulation study where data was generated from the two-class latent class model Q a . Truth is π Q a , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.148 0.955 0.332 0.000 0.02910000 0.146 0.730 0.316 0.000 0.023100000 0.151 0.265 0.167 0.055 0.037

C.2 Example 2

One may object to the practicality of Example 1, as it examined a two class LCM constructed using thecounterexample from the proof of Theorem A.2, and is thus an M h LCM. So we now consider the followingexample. Manrique-Vallier (2016) presented a simulation study with K = 5 lists where data was generatedfrom a LCM with J = 2 classes, which we reproduce in Table 7. The parameters of this LCM were based ona hypothetical population where a small proportion of people have a high probability of being observed, anda large proportion of people have a small probability of being observed, which is plausible in some human31ights applications.Table 7: Parameters of latent class model which generated data in simulation of Manrique-Vallier (2016).Sampling probabilities, q Class ν List 1 List 2 List 3 List 4 List 51 0.900 0.033 0.033 0.099 0.132 0.0332 0.100 0.660 0.825 0.759 0.990 0.693Suppose we only observed lists three and four, so that we have data from K = 2 lists generated fromthe two-class LCM Q with parameters given in Table 8. Under Q , π Q , = 0 . P Ω Q is not conditionally identiﬁed when K = 2, if we try to perform estimation within P Ω Q , which contains the true data generating model, there is no guarantee that we can estimate well thecell probabilities and population size which generated the data. The results of the simulation using datagenerated using the LCM Q are presented in Table 9. We see that the posterior median has a large negativebias that does not vanish as N increases, while the mean 95% and 50% credible interval widths decrease as N increases. Further, the 95% and 50% credible intervals have essentially 0 coverage across all N .Table 8: Parameters of latent class model Q ν ν q q q q Q .Truth is π Q , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.285 0.000 0.408 0.000 0.05510000 0.283 0.010 0.401 0.000 0.036100000 0.285 0.030 0.256 0.000 0.03532 .3 Example 3

In this example we present two more frequentist simulation studies based on only observing a subset of theﬁve lists from the simulation of Manrique-Vallier (2016).First suppose that we only observe lists two, three, and four from the simulation of Manrique-Vallier(2016), so that we have data from K = 3 lists generated from the two-class LCM Q a with parameters givenin Table 10. Under Q a , π Q a , = 0 . P Ω Q is not conditionally identiﬁed when K = 3, if we tryto perform estimation within P Ω Q , which contains the true data generating model, there is no guaranteethat we can estimate well the cell probabilities and population size which generated the data. The results ofthe simulation using data generated using the LCM Q a are presented in Table 11. We see that the posteriormedian has a slight negative bias that becomes negligible as N increases. The 95% credible intervals haveover-coverage across the diﬀerent settings of N . The 50% credible intervals have nominal coverage when N = 2000, but have over-coverage as N increases.Table 10: Parameters of latent class model Q a Sampling probabilities, q Class ν List 2 List 3 List 41 0.900 0.033 0.099 0.1322 0.100 0.825 0.759 0.990Table 11: Results of the simulation study where data was generated from the two-class latent class model Q a . Truth is π Q a , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.622 1.000 0.339 0.510 0.12010000 0.667 1.000 0.274 0.800 0.091100000 0.682 1.000 0.209 0.965 0.074Suppose now we only observe lists two, three, four, and ﬁve from the simulation of Manrique-Vallier(2016), so that we have data from K = 4 lists generated from the two-class LCM Q b with parameters given33n Table 12. Under Q b , π Q b , = 0 . P Ω Q is conditionally identiﬁed when K = 4, we knowthat, since P Ω Q contains the true data generating model, we can consistently estimate the cell probabilitiesand population size which generated the data. The results of the simulation using data generated using theLCM Q a are presented in Table 13. We see that the posterior median has a negative bias that becomesnegligible as N increases, as expected. The 95% and 50% credible intervals have slight under-coverage when N = 2000, which becomes nominal as N increases.Table 12: Parameters of latent class model Q b Sampling probabilities, q Class ν List 2 List 3 List 4 List 51 0.900 0.033 0.099 0.132 0.0332 0.100 0.825 0.759 0.990 0.693Table 13: Results of the simulation study where data was generated from the two-class latent class model Q b . Truth is π Q b , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.631 0.915 0.190 0.445 0.06510000 0.653 0.940 0.089 0.505 0.031100000 0.658 0.955 0.028 0.485 0.010

C.4 Example 4

In this example we present three more frequentist simulation studies based on adding a third class to theLCM from the simulation study of Manrique-Vallier (2016), representing a small proportion of the populationhaving a probability of being observed somewhere between the other two classes. The parameters of thisnew LCM are given in Table 14.First suppose that we only observe lists two, three, and four from the LCM in Table 14, so that we havedata from K = 3 lists generated from the three-class LCM Q a with parameters given in Table 15. Under Q a ,34able 14: Parameters of latent class model which generated data in simulation of Manrique-Vallier (2016),with a third class added. Sampling probabilities, q Class ν List 1 List 2 List 3 List 4 List 51 0.700 0.033 0.033 0.099 0.132 0.0332 0.200 0.275 0.250 0.200 0.300 0.3253 0.100 0.660 0.825 0.759 0.990 0.693 π Q a , = 0 . P Ω Q is not conditionally identiﬁed when K = 3, if we try to perform estimationwithin P Ω Q , which contains the true data generating model, there is no guarantee that we can estimate wellthe cell probabilities and population size which generated the data. The results of the simulation using datagenerated using the LCM Q a are presented in Table 16. We see that the posterior median has a negativebias that does not vanish as N increases. The 95% credible intervals have over-coverage across the diﬀerentsettings of N , while the 50% credible intervals have under-coverage across the diﬀerent settings of N . Similarto Example 1 in Section C.1, this suggests the 95% credible interval only has over-coverage due to wide tailsof the posterior for N . Table 15: Parameters of latent class model Q a Sampling probabilities, q Class ν List 1 List 2 List 31 0.700 0.033 0.099 0.1322 0.200 0.250 0.200 0.3003 0.100 0.825 0.759 0.990Next suppose that we only observe lists two, three, four, and ﬁve from the LCM in Table 14, so thatwe have data from K = 4 lists generated from the three-class LCM Q b with parameters given in Table13. Under Q b , π Q b , = 0 . P Ω Q is not conditionally identiﬁed when K = 4, if we try to35able 16: Results of the simulation study where data was generated from the two-class latent class model Q a . Truth is π Q a , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.524 1.000 0.387 0.210 0.11910000 0.537 1.000 0.364 0.150 0.102100000 0.538 1.000 0.323 0.175 0.096perform estimation within P Ω Q , which contains the true data generating model, there is no guarantee thatwe can estimate well the cell probabilities and population size which generated the data. The results of thesimulation using data generated using the LCM Q b are presented in Table 18. We see that the posteriormedian has a negative bias that decreases as N increases. While the 95% and 50% credible intervals donot have nominal coverage, coverage improves as N increases (but is still far from nominal even when N = 100000). Table 17: Parameters of latent class model Q b Sampling probabilities, q Class ν List 1 List 2 List 3 List 41 0.700 0.033 0.099 0.132 0.0332 0.200 0.250 0.200 0.300 0.3253 0.100 0.825 0.759 0.990 0.693Table 18: Results of the simulation study where data was generated from the two-class latent class model Q b . Truth is π Q b , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.469 0.525 0.199 0.090 0.06510000 0.509 0.630 0.128 0.120 0.041100000 0.519 0.695 0.066 0.290 0.02336ext suppose that we observe all ﬁve lists from the LCM in Table 14, so that we have data from K = 5lists generated from the three-class LCM which we will refer to as Q c . Under Q c , π Q c , = 0 . P Ω Q is not conditionally identiﬁed when K = 5, if we try to perform estimation within P Ω Q , which containsthe true data generating model, there is no guarantee that we can estimate well the cell probabilities andpopulation size which generated the data. The results of the simulation using data generated using the LCM Q c are presented in Table 19. We see that the posterior median has a negative bias that decreases as N increases. While the 95% and 50% credible intervals do not have nominal coverage, coverage improves as N increases.Table 19: Results of the simulation study where data was generated from the two-class latent class model Q c . Truth is π Q c , = 0 . N MeanPosterior Median 95% CI Coverage Mean95% CI Width 50% CI Coverage Mean50% CI Width2000 0.435 0.415 0.169 0.050 0.05510000 0.500 0.875 0.132 0.370 0.044100000 0.523 0.895 0.053 0.490 0.018

C.5 Takeaways

When using the model P Ω Q J for multiple-systems estimation, one is relying on the assumption that thedata was generated from a distribution in P Ω Q J . If a practitioner is comfortable with the assumption that2 J ≤ K , then we know the model is conditionally identiﬁed, and thus this assumption is a combinationof an explicit identifying assumption (which is currently unknown) and possibly some restrictions on theobserved-data distribution. Due to conditional identiﬁcation, the practitioners have guarantees under thisassumption that they can estimate the population size, and other parameters, well if their observed samplesize n is large enough. However, if a practitioner is not comfortable with this assumption, and chooses touse J > K/

2, they have no such guarantees as they are using a model that is not conditionally identiﬁed.Through the four example simulation studies in this appendix we saw examples where models that werenot conditionally identiﬁed had good frequentist performance ( Q a , Q a ) and bad frequentist performance37 Q , Q , Q b , Q c ) according to some of our simulation summary measures. The good and bad frequentistperformances could have been due to • where the prior of Manrique-Vallier (2016) places mass in the parameter space Ω Q J (e.g. good fre-quentist performance if it places enough prior mass around the true data generating parameters), • whether there actually exists other LCMs in Q J that induce similar observed cell probabilities to thetrue data generating parameters but a diﬀerent unobserved cell probability (e.g. good frequentistperformance if other LCMs do not exist with these properties), • or some combination of the two previous factors.We currently have no way to tease apart these factors and tell when a model that is not conditionallyidentiﬁed will have good or bad performance. This is a problem for using these models in practice, as wehave no way to tell practitioners “under these assumptions the model will perform well”.We believe there are two routes forward to combat this problem, if one wants to use LCMs for multiple-systems estimation. The ﬁrst option is to further study technical results for conditional identiﬁcation inLCMs. For example, as we discussed in Section A.6, suppose we can prove under further (practicallyrelevant) restrictions on Q J that P Ω Q J is conditionally identiﬁed for some J > K/

2. We would then be ableto expand the range of models we could ﬁt under which we had guarantees that we could estimate well theparameters of the model.The other option is to study LCMs through the framework of partial identiﬁcation (Tamer, 2010;Gustafson, 2010), which was recently used in multiple-systems estimation by Sun et al. (2020) for fre-quentist inference for partially-identiﬁed log-linear models. This would require both: 1) a better technicalunderstanding of what parameters, or functions of parameters, of LCMs are not identiﬁed, and 2) plac-ing substantively meaningful priors on the non-identiﬁed parameters (i.e. priors informed by substantiveknowledge concerning the population of interest and how the data was collected) if a Bayesian approach istaken. Without 1), the best we can do in a Bayesian approach is to place substantively meaningful priorson all LCM parameters, i.e. on Ω Q J . The prior for Ω Q J of Manrique-Vallier (2016) is based on the DirichletProcess prior speciﬁcation of Dunson and Xing (2009), which is a prior of technical convenience. Specifying38 substantively meaningful prior for Ω Q J would require being able to specify a prior for the class membershipprobabilities ν and for the class speciﬁc observation probabilities q . It is diﬃcult to imagine a scenario inwhich a practicioner would have knowledge of the population of interest and how the data was collected thatcould be incorporated into priors for all J ( K + 1) − ν and q ).While we do not believe that latent class models cannot be used for multiple-systems estimation (see ourapplication in Section 6 where we use the LCM prior of Manrique-Vallier (2016) to induce a prior for theobserved cell probabilities ˜ π ), we do believe that there needs to be further research to understand underwhat assumptions LCMs do and do not perform well in practice. We discuss one further area of researchbefore concluding this section. The start of this section began by assuming that a practitioner assumedtheir data was generated by a distribution in P Ω Q J . It is not clear to the authors how in practice one wouldchoose a speciﬁc value of J . In practice, how would a practitioner choose between P Ω Q J and P Ω Q J (cid:48) for J (cid:54) = J (cid:48) ? What characteristics of the population being studied and the data collection process would allowone to diﬀerentiate between these two models? Research into understanding how to elicit plausible values of J would help to justify the use of the model P Ω Q J in practice. D Appendix D: Kosovo Analysis Using No-Highest-Order Inter-action Assumption

The purpose of this appendix is to repeat the analysis of the Kosovo data set from Section 6 using theno-highest-order interaction assumption. We ﬁrst discuss the implications of this identifying assumption inthe context of the Kosovo data set before performing the analysis.

D.1 The No-Highest-Order Interaction Assumption

The Kosovo data set has K = 4 lists, which we will order (without loss of generality) so that the AmericanBar Association Central and East European Law Initiative (ABA) list is ﬁrst, the Human Rights Watch(HRW) list is second, the Organization for Security and Cooperation in Europe (OSCE) list is third, andthe list constructed from exhumation reports conducted on behalf of the International Criminal Tribunal for39he Former Yugoslavia (EXH) is fourth. Let Odds( h = 1 | h = 1 , h , h ) = π (1 , ,h ,h ) /π (0 , ,h ,h ) denotethe odds that an individual is observed in list 1, conditional on being observed in list 2 and the inclusionpatterns h , h for lists 3 and 4. For example, if h = 0 and h = 1, Odds( h = 1 | h = 1 , h = 0 , h = 1)is the odds that an individual is observed in list 1, conditional on being observed in lists 2 and 4 and notbeing observed in list 3. Similarly let Odds( h = 1 | h = 0 , h , h ) = π (1 , ,h ,h ) /π (0 , ,h ,h ) denote theodds that an individual is observed in list 1, conditional on not being observed in list 2 and the inclusionpatterns h , h for lists 3 and 4. We can then deﬁne OR( h , h ) = Odds( h = 1 | h = 1 , h , h ) / Odds( h =1 | h = 0 , h , h ) as the odds ratio for lists 1 and 2, conditional on the inclusion patterns h , h for lists 3and 4. Following Section 4.1, the no-highest-order interaction assumption assumes that OR(1 , / OR(0 ,

0) =OR(1 , / OR(0 , , / OR(0 , , / OR(0 , D.2 Main Analysis

For each combination of prior for N and ˜ π described in Sections 6.2 and 6.3, we ﬁt the corresponding modelusing the computational approach described in Section 5.2, using the no-highest-order interaction assump-tion. In particular, we emphasize that for each prior for ˜ π , we only drew samples from the corresponding p C ( ˜ π | n ) once. The posterior density for N under each prior combination is displayed in Figure 4. Posteriormeans and 95% credible intervals for N under each prior combination are displayed in Table 20. Assumingthat there is no-highest-order interaction, under the negative-binomial prior for N and LCM prior for ˜ π ofManrique-Vallier (2016), we estimate there were 14071 civilian casualties, with a 95% credible interval of[9321 , irichlet Log−Linear LCMCR Conting0 10000 30000 50000 0 10000 30000 50000 0 10000 30000 50000 0 10000 30000 500000.000000.000050.000100.00015 N D en s i t y Prior for N

Improper Scale Prior Negative−Binomial

Figure 4: Posterior density of N under each combination of prior for N and ˜ π , using the no-highest-order interaction assumption.When using the LCM prior for ˜ π , the posterior distribution for N is somewhat sensitive to our choiceof prior for N , as under the “noninformative” improper scale prior for N the posterior mean increasesfrom 14071 to 14695, and the upper limit of the credible interval increases from 21604 to 23675. Whenconsidering the other priors for ˜ π , the posterior distribution for N is again somewhat sensitive to our choiceof prior for N , where the posterior mean and credible interval limits are larger under the improper scale priorcompared to the negative-binomial prior. Across the diﬀerent priors for ˜ π , the posteriors corresponding tothe Dirichlet prior, the log-linear model prior, and the LCM prior of Manrique-Vallier (2016) are in relativeagreement. The posterior corresponding to the Dirichlet prior is the most diﬀuse of the three, and theposterior corresponding to the LCM prior of Manrique-Vallier (2016) is the most concentrated of the three.The posterior corresponding to the log-linear model prior of King and Brooks (2001), implemented in the conting package, is multimodal, which is not unexpected as it is performing Bayesian model averaging(Hoeting et al., 1999) over all hierarchical log-linear models. Due to this multimodality, point estimates (e.g.the posterior mean) may not be reliable summaries of the posterior distribution. We note that all of thecredible intervals contain the point estimate of Spiegel and Salama (2000).41able 20: Posterior means and 95% credible intervals for N under each combination of prior for N and ˜ π ,using the no-highest-order interaction assumption.Improper Scale Prior Negative-BinomialDirichlet 18500 [9402, 35908] 16051 [9098, 27679]Log-Linear 16209 [8731, 30025] 14719 [8579, 24878]LCMCR 14695 [9423, 23675] 14071 [9321, 21604]Conting 13000 [9202, 19971] 12694 [9175, 19299] D.3 A Sensitivity Analysis Probing the Identifying Assumption

We now perform a sensitivity analysis probing the no-highest-order interaction assumption, focusing on theLCM prior for ˜ π of Manrique-Vallier (2016) and the negative-binomial prior for N , for concreteness. Weﬁt the model with the identifying assumption described in Section 4.2, varying ξ over { / , / , , / , } (following Gerritse et al., 2015), using the computational approach described in Section 5.2. We emphasizethat we only drew samples from p C ( ˜ π | n ) under the LCM prior for ˜ π of Manrique-Vallier (2016) once forboth the main analysis in the previous section and the sensitivity analysis described now. This sensitivityanalysis is limited in that we followed Gerritse et al. (2015) and chose an arbitrary range of values for ξ around 1. Due to the diﬃculty in interpreting the highest-order interaction when there are K = 4 lists, weare not able to say with conﬁdence whether this range of values is meaningful or not. The posterior densityfor N under each identifying assumption is displayed in Figure 5. Posterior means and 95% credible intervalsfor N under each identifying assumption combination are displayed in Table 21.Table 21: Posterior means and 95% credible intervals for N under each identifying assumption, using theLCM prior for ˜ π of Manrique-Vallier (2016) and the negative-binomial prior for N . ξ = 1 / ξ = 2 / ξ = 1 ξ = 3 / ξ = 221476 [13518, 33507] 17983 [11492, 27987] 14071 [9321, 21604] 11121 [7766, 16564] 9538 [6943, 13821]The results are not very robust to misspeciﬁcation of ξ in the chosen range. The credible intervals when ξ = 1 / ξ = 2 barely overlap. The posterior mean when ξ = 2 is 32% lower than the posterior mean42 e+001e−042e−04 0 15000 30000 45000 N D en s i t y x Figure 5: Posterior density of N under each identifying assumption, using the LCM prior for ˜ π ofManrique-Vallier (2016) and the negative-binomial prior for N .when ξ = 1 (i.e. under the no-highest-order interaction assumption), and the posterior mean when ξ = 1 / ξ = 1. The posterior mean ξ = 1 / ξ = 2. This lack of robustness to misspeciﬁcation of ξ would be a cause for concern ifthe no-highest-order interaction assumption was plausible, and the deviations from the assumption in termsof ξ were also plausible, in the context of the Kosovo data set. References

ABA/AAAS (2000). Political killings in Kosova/Kosovo, March-June 1999. Technical report, American BarAssociation Central and East European Law Initiative and the American Association for the Advancementof Science.Allman, E. S., Matias, C., Rhodes, J. A., et al. (2009). Identiﬁability of parameters in latent structuremodels with many observed variables.

The Annals of Statistics

Who Counts?: The Politics of Census-Taking in ContemporaryAmerica . Russell Sage Foundation. 43all, P., Betts, W., Scheuren, F., Dudukovich, J., and Asher, J. (2002).

Killings and Refugee Flow in KosovoMarch-June 1999 . American Association for the Advancement of Science and American Bar AssociationCentral and East European Law Initiative.Bird, S. M. and King, R. (2018). Multiple systems estimation(or capture-recapture estimation) to informpublic policy.

Annual Review of Statistics and its Application Discrete Multivariate Analysis: Theory andPractice . Springer Science & Business Media.Carpenter, B., Gelman, A., Hoﬀman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo,J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language.

Journal of StatisticalSoftware .DasGupta, A. and Rubin, H. (2005). Estimation of binomial parameters when both n , p are unknown. Journal of Statistical Planning and Inference

Biometrika

Journalof the American Statistical Association

Electronic Journal of Statistics k contingencytables. Biometrika

Journal of the Royal Statistical Society: Series A

AStA Advances in Statistical Analysis

Entropy

Journal of Oﬃcial Statistics

The International Journal ofBiostatistics .Haberman, S. J. (1979). Analysis of Qualitative Data. Volume 2 . Academic Press.Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: atutorial.

Statistical science pages 382–401.Hogan, J. W. and Daniels, M. J. (2008).

Missing data in longitudinal studies: Strategies for Bayesianmodeling and sensitivity analysis . Chapman and Hall/CRC.Holzmann, H., Munk, A., and Zucchini, W. (2006). On identiﬁability in capture–recapture models.

Biomet-rics

Epidemiologic Reviews

Journal of the American Statistical Association

Statistics & probability letters

American Journal of Public Health

Biometrika

Biometrics

Biometrics N . Ecology

Biometrika

Biomet-rics arXiv preprint arXiv:1906.04763 .Manrique-Vallier, D., Price, M. E., and Gohdes, A. (2013). Multiple systems estimation techniques forestimating casualties in armed conﬂicts.

Counting civilian casualties: An introduction to recording andestimating nonmilitary deaths in conﬂict pages 165–182.Otis, D. L., Burnham, K. P., White, G. C., and Anderson, D. R. (1978). Statistical inference from capturedata on closed animal populations.

Wildlife monographs pages 3–135.Overstall, A. and King, R. (2014). conting: An R package for Bayesian analysis of complete and incompletecontingency tables.

Journal of Statistical Software

Communications inStatistics-Theory and Methods pages 1–21.Regal, R. R. and Hook, E. B. (1991). The eﬀects of model selection on conﬁdence intervals for the size of aclosed population.

Statistics in Medicine

Sankhy¯a: The IndianJournal of Statistics, Series A pages 514–522.Sadinle, M. (2018). Bayesian propagation of record linkage uncertainty into population size estimation ofhuman rights violations.

The Annals of Applied Statistics

The Annals of MathematicalStatistics pages 142–152.Silverman, B. (2020). Multiple systems analysis for the quantiﬁcation of modern slavery: Classical andBayesian approaches.

Journal of the Royal Statistical Society: Series A

The American Statistician

The Lancet arXiv preprint arXiv:2008.00127 .Tahmasebi, B., Motahari, S. A., and Maddah-Ali, M. A. (2018). On the Identiﬁability of Finite Mixtures ofFinite Product Measures. arXiv preprint arXiv:1807.05444 .47amer, E. (2010). Partial identiﬁcation in econometrics.