[PDF] Applying Data Synthesis for Longitudinal Business Data across Three Countries

Abstract

Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.

Full PDF

AApplying Data Synthesis for Longitudinal Business Dataacross Three Countries

M. Jahangir Alam , Benoit Dostie J¨org Drechsler Lars Vilhuber ABSTRACT

Data on businesses collected by statistical agencies are challenging to protect. Manybusinesses have unique characteristics, and distributions of employment, sales, andproﬁts are highly skewed. Attackers wishing to conduct identiﬁcation attacks oftenhave access to much more information than for any individual. As a consequence,most disclosure avoidance mechanisms fail to strike an acceptable balance betweenusefulness and conﬁdentiality protection. Detailed aggregate statistics by geographyor detailed industry classes are rare, public-use microdata on businesses are virtuallyinexistant, and access to conﬁdential microdata can be burdensome. Synthetic micro-data have been proposed as a secure mechanism to publish microdata, as part of abroader discussion of how to provide broader access to such data sets to researchers.In this article, we document an experiment to create analytically valid synthetic data,using the exact same model and methods previously employed for the United States,for data from two different countries: Canada (Longitudinal Employment AnalysisProgram (LEAP)) and Germany (Establishment History Panel (BHP)). We assess util-ity and protection, and provide an assessment of the feasibility of extending such anapproach in a cost-effective way to other data.

Key words: business data, conﬁdentiality, LBD, LEAP, BHP, synthetic.

There is growing demand for ﬁrm-level data allowing detailed studies of ﬁrm dynamics.Recent examples include Bartelsman et al. (2009), who use cross-country ﬁrm-leveldata to study average post-entry behavior of young ﬁrms. Sedl´aˇcek et al. (2017) usethe Business Dynamics Statistics (BDS) to show the role of ﬁrm size in ﬁrm dynamics.However, such studies are made difﬁcult due to the limited or restricted access to ﬁrm-level data.Data on businesses collected by statistical agencies are challenging to protect.Many businesses have unique characteristics, and distributions of employment, salesand proﬁts are highly skewed. Attackers wishing to conduct identiﬁcation attacks of-ten have access to much more information than for any individual. It is easy to ﬁndexamples of ﬁrms and establishments that are so dominant in their industry or locationthat they would be immediately identiﬁed if data that included their survey responses Department of Applied Economics, HEC Montr´eal, and Department of Economics, Truman State Uni-versity. USA. E-mail: [email protected]: https://orcid.org/0000-0001-6478-114X. Department of Applied Economics, HEC Montr´eal. USA. E-mail: [email protected]. ORCID:https://orcid.org/0000-0002-4133-2365. Institute for Employment Research. USA. E-mail: [email protected] Cornell University. E-mail: [email protected]: https://orcid.org/0000-0001-5733-8932. a r X i v : . [ ec on . E M ] J u l r administratively collected data were publicly released. Finally, there are also greaterﬁnancial incentives to identifying the particulars of some ﬁrms and their competitors.As a consequence, most disclosure avoidance mechanisms fail to strike an accept-able balance between usefulness and conﬁdentiality protection. Detailed aggregatestatistics by geography or detailed industry classes are rare, public-use microdata onbusiness are virtually inexistant, and access to conﬁdential microdata can be burden-some. It is not uncommon that access to establishment microdata, if granted at all,is provided through data enclaves (Research Data Centers), at headquarters of statis-tical agencies, or some other limited means, under strict security conditions. Theserestrictions on data access reduce the growth of knowledge by increasing the cost toresearchers of accessing the data.Synthetic microdata have been proposed as a secure mechanism to publish mi-crodata (Drechsler et al., 2008; Drechsler, 2012; National Research Council, 2007;Jarmin et al., 2014), based on suggestions and methods ﬁrst proposed by Rubin (1993)and Little (1993). Such data are part of a broader discussion of how to provide im-proved access to such data sets to researchers (Bender, 2009; Vilhuber, 2013; Abowdet al., 2004; Abowd et al., 2015). For business data, synthetic business microdatawere released in the United States (Kinney et al., 2011b) and in Germany (Drech-sler, 2011b) in 2011. The former data set, called Synthetic Longitudinal BusinessDatabase (LBD) (SynLBD), was released to an easily web-accessible computing envi-ronment (Abowd et al., 2010), and combined with a validation mechanism. By makingdisclosable synthetic microdata available through a remotely accessible data server,combined with a validation server, the SynLBD approach alleviates some of the accessrestrictions associated with economic data. The approach is mutually beneﬁcial to bothagency and researchers. Researchers can access public use servers at little or no cost,and can later validate their model-based inferences on the full conﬁdential microdata.Details about the modeling strategies used for the SynLBD can be found in Kinneyet al. (2011b) and Kinney et al. (2011a).In this article, we document an experiment to create analytically valid syntheticdata, using the exact same model and methods previously used to create the SynLBD,but applied to data from two different countries: Canada (Longitudinal EmploymentAnalysis Program (LEAP)) and Germany (Establishment History Panel (BHP)). Wedescribe all three countries’ data in Section 2.In Canada, the Canadian Center for Data Development and Economic Research(CDER) was created in 2011 to allow Statistics Canada to make better use of its busi-ness data holdings, without compromising security. Secure access to business micro-data for approved analytical research projects is done through a physical facility locatedin Statistics Canada’s headquarters.CDER implements many risk mitigation measures to alleviate the security risksspeciﬁc to micro-level business data including limits on tabular outputs, centralizedvetting, monitoring of program logs. Access to the data is done through a Statistics See Guzman et al. (2016) and Guzman et al. (2020) for an example of scraped, public-use microdata. For a recent overview of some, see Vilhuber et al. (2016b). See Drechsler (2011a) for a review of thetheory and applications of the synthetic data methodology. Other access methods include secure data en-claves (e.g., research data centers of the U.S. Federal Statistical System, of the German Federal EmploymentAgency, others), and remote submission systems. We will comment on the latter in the conclusion. best syn-thetic data method for each ﬁle, but rather to assess the effectiveness of using a ‘pre-packaged’ method to cost-effectively generate synthetic data. In particular, while wecould have used newer implementations of methods combined with a pre-deﬁned orautomated model (Nowok et al., 2016; Raab et al., 2018), we chose to use the exactSAS code used to create the original SynLBD. A brief synopsis of the method, and anyadjustments we made to take into account structural data differences, are described inSection 3.We verify the analytical validity of the synthetic data ﬁles so created along a varietyof measures. First, we show how well average ﬁrm characteristics (gross employment,total payroll) in the synthetic data match those from the original data. We also considerhow well the synthetic data replicates various measures of ﬁrm dynamics (entry andexit rates) and job ﬂows (job creation and destruction rate). Second, we assess whethermeasures of economic growth vary between both data sets using dynamic panel datamodels. Finally, to assess the analytical validity from a more general perspective, wecompute global validity measures based on the ideas of propensity score matching asproposed by Woo et al. (2009) and Snoke et al. (2018a).To assess how protective the newly created synthetic database is, we estimate theprobability that the synthetic ﬁrst year equals the true ﬁrst year given the synthetic ﬁstyear.The rest of the paper is organized as follows. Section 2 describes the different datasources and summarizes which steps were taken to harmonize the data sets prior to theactual synthesis. Section 3 provides some background on the synthesis methods, lim-itations in the applications, and a discussion of some of the measures, which are usedin Section 4 to evaluate the analytical validity of the generated data sets. Preliminaryresults regarding the achieved level of protection are included in Section 5. The paperconcludes with a discussion of the implications of the study for future data synthesisprojects. 3

Data

In this section, we brieﬂy describe the structure of the three data sources.

The LBD (U.S. Census Bureau, 2015) is created from the U.S. Census Bureau’sBusiness Register (BR) by creating longitudinal links of establishments using nameand address matching. The database has information on birth, death, location, indus-try, ﬁrm afﬁliation of employer establishments, and ownership by multi-establishmentﬁrms, as well as their employment over time, for nearly all sectors of the economyfrom 1976 through 2015 (as of this writing). It serves as a key linkage ﬁle as well as aresearch data set in its own right for numerous research articles, as well as a tabulationinput to the U.S. Census Bureau’s Business Dynamics Statistics (U.S. Census Bureau,2017, BDS). Other statistics created from the underlying Business Register includethe County Business Patterns (U.S. Census Bureau, 2016a, CBP) and the Statistics ofU.S. Businesses (U.S. Census Bureau, 2016b, SBUSB). For a full description, readersshould consult Jarmin et al. (2002). The key variables of interest for this experimentare birth and death dates, payroll, employment, and the industry coding of the estab-lishment. Kinney et al. (2014b) explore a possible expansion of the synthesis methodsdescribed later to include location and ﬁrm afﬁliation. Note that information on payrolland employment does not come from individual-level wage records, as is the case forboth the Canadian and German data sets described below, as well as for the QuarterlyWorkforce Indicators (Abowd et al., 2009) derived from the Longitudinal Employer-Household Dynamics (Vilhuber, 2018, LEHD) in the United States. Thus, methodsthat connect establishments based on labor ﬂows (Benedetto et al., 2007; Hethey et al.,2010) are not employed. We also note that payroll is the cumulative sum of wages paidover the entire calendar year, whereas employment is measured as of March 12 of eachyear.

The LEAP (Statistics Canada, 2019b) contains information on annual employment foreach employer business in all sectors of the Canadian economy. It covers incorporatedand unincorporated businesses that issue at least one annual statement of remunerationpaid (T4 slips) in any given calendar year. It excludes self-employed individuals orpartnerships with non-salaried participants.To construct the LEAP, Statistics Canada uses three sources of information: (1)T4 administrative data from the Canada Revenue Agency (CRA), (2) data from Statis-tics Canada’s Business Register (Statistics Canada, 2019c), and (3) data from Statis-tics Canada’s Survey of Employment, Payrolls and Hours (SEPH) (Statistics Canada,2019a). In general, all employers in Canada provide employees with a T4 slip if theypaid employment income, taxable allowances and beneﬁts, or any other remunera-tion in any calendar year. The T4 information is reported to the tax agency, which inturn provides this information to Statistics Canada. The Business Register is Statistics4anada’s central repository of baseline information on businesses and institutions op-erating in Canada. It is used as the survey frame for all business related data sets. Theobjective of the SEPH is to provide monthly information on the level of earnings, thenumber of jobs, and hours worked by detailed industry at the national and provinciallevels. To do so, it combines a census of approximately one million payroll deductionsprovided by the CRA, and the Business Payrolls Survey, a sample of 15,000 establish-ments.The core LEAP contains four variables (1) a longitudinal Business Register Iden-tiﬁer (LBRID), (2) an industry classiﬁcation, (3) payroll and (4) a measure of employ-ment. The LBRID uniquely identiﬁes each enterprise and is derived from the BusinessRegister. To avoid “false” deaths and births due to mergers, restructuring or changes inreporting practices, Statistics Canada uses employment ﬂows. Similar to Benedetto etal. (2007) and Hethey et al. (2010), the method compares the cluster of workers in eachnewly identiﬁed enterprise with all the clusters of workers in ﬁrms from the previousyear. This comparison yields a new identiﬁer (LBRID) derived from those of the BR.The industry classiﬁcation comes from the BR for single-industry ﬁrms. If a ﬁrm oper-ates in multiple industries, information on payroll from the SEPH is used to identify theindustry in which the ﬁrm pays the highest payroll. Prior to 1991, information on in-dustry was based on the SIC, but it is currently based on the North American IndustrialClassiﬁcation System (NAICS). We use the information at the NAICS four-digit (in-dustry group) level. The ﬁrm’s payroll is measured as the sum of all T4s reported to theCRA for the calendar year. Employment is measured either using Individual LabourUnit (ILU) or Average Labour Unit (ALU). ALUs are obtained by dividing the payrollby the average annual earnings in its industry/province/class category computed usingthe SEPH. ILUs are a head count of the number of T4 issued by the enterprise, withemployees working for multiple employers split proportionately across ﬁrms accordingto their total annual payroll earned in each ﬁrm.For the purpose of this experiment, we exclude the public sector (NAICS 61, 62,and 91), even though it is contained in the database, because it may not be accuratelycaptured (Statistics Canada, 2019b). Statistics Canada does not publish any statisticsfor those sectors.

The core database for the Establishment History Panel is the German Social SecurityData (GSSD), which is based on the integrated notiﬁcation procedure for the health,pension and unemployment insurances, introduced in 1973. Employers report infor-mation on all their employees. Aggregating this information via an establishment iden-tiﬁer yields the Establishment History Panel (Bundesagentur f¨ur Arbeit, 2013, Germanabbreviation: BHP). We used data from 1975 until 2008, which at the time this projectstarted was the most current data available for research. Information for the formerEastern German States is limited to the years 1992-2008.Due to the purpose and structure of the GSSD, some variables present in the LBDare not available on the BHP. Firm-level information is not captured, and it is thus notknown whether establishments are part of a multi-establishment employer. In 1999,reporting requirements were extended to all establishments; prior to that date, only es-5ablishments that had at least one employee covered by social security on the referencedate June 30 of each year were subject to ﬁling requirements. Payroll and employmentare both based on a reference date of June 30, and are thus consistent point-in-timemeasures. Industries are identiﬁed according to the WZ 2003 classiﬁcation system(Statistisches Bundesamt, 2003) at the ﬁve digit level. We aggregated the industryinformation for this project using the ﬁrst four digits of the coding system.

In all countries, the underlying data provide annual measures. However, S YN LBDassumes a longitudinal (wide) structure of the data set, with invariant industry (andlocation). In all cases, the modal industry is chosen to represent the entity’s indus-trial activity. Further adjustments made to the BHP for this project include estimatingfull-year payroll, creating time-consistent geographic information, and applying em-ployment ﬂow methods (Hethey et al., 2010) to adjust for spurious births and deaths inestablishment identiﬁers. Drechsler et al. (2014b) provide a detailed description of thesteps taken to harmonize the input data.In both Canada and Germany, we encountered various technical and data-drivenlimitations. In all countries, data in the ﬁrst year and last year are occasionally problem-atic, and such data were dropped. Both the German and the Canadian data experiencesome level of industry coding change, which may affect the classiﬁcation of some en-tities. Furthermore, due to the nature of the underlying data, entities are establishmentsin Germany and the US, but employers in Canada.After the various standardizations and choices made above, the data structure isintended to be comparable, as summarized in Table 1. The column ”Nature” identiﬁesthe treatment of the variable in the synthesis process S YN LBD.Table 1: Variable descriptions and comparison

Name Type Description US Canada Germany Nature

Entity Identiﬁer identiﬁer Establishment Employer Establishment CreatedIndustry code Categorical Various across countries SIC3 NAICS4 WZ2003 Unmodiﬁed(3-digit ) (4-digit) (4-digit)First year Categorical First year entity is observed — ﬁrstyear — SynthesizedLast year Categorical Last year entity is observed — lastyear — SynthesizedYear Categorical Year dating of annual variables — year — DerivedEmployment Continuous Employment measure Count ALU* Count Synthesized(March 15) (annual) (June 30)Payroll Continuous Payroll (annual) Reported Computed Computed, SynthesizedAdjusted* ALU = Average Labour Unit. See text for additional explanations. The WZ 2003 classiﬁcation system is compliant with the requirements of the Statistical Classiﬁcationof Economic Activities in the European Community (NACE Rev. 1.1), which is based on the InternationalStandard Industrial Classiﬁcation (ISIC Rev. 3.1). Methodology

To create a partially synthetic database with analytic validity from longitudinal estab-lishment data, Kinney et al. (2011a) synthesize the life-span of establishments, as wellas the evolution of their employment, conditional on industry over that synthetic lifes-pan. Geography is not synthesized, but is suppressed from the released ﬁle (Kinneyet al., 2011a). Applying this to the LBD, Kinney et al. (2011b) created the current ver-sion of the Synthetic LBD, based on the Standard Industrial Classiﬁcation (SIC) andextending through 2000. Kinney et al. (2014a) describe efforts to create a new versionof the Synthetic LBD, using a longer time series (through 2010) and newer industrycoding (NAICS), while also adjusting and extending the models for improved analyticvalidity and the imputation of additional variables. In this paper, we refer to and re-usethe older methodology, which we will call S YN LBD. Our emphasis is on the compa-rability of results obtained for a given methodology across the various applications.The general approach to data synthesis is to generate a joint posterior predictivedistribution of Y | X where Y are variables to be synthesized and X are unsynthesizedvariables. The synthetic data are generated by sampling new values from this distri-bution. In S YN LBD, variables are synthesized in a sequential fashion, with categor-ical variables being generally processed ﬁrst using a variant of Dirichlet-Multinomialmodels. Continuous variables are then synthesized using a normal linear regressionmodel with kernel density-based transformation (Woodcock et al., 2009). The syn-thesis models are run independently for each industry. S YN LBD is implemented inSAS TM , which is frequently used in national statistical ofﬁces.To evaluate whether synthetic data algorithms developed in the U.S. can be adaptedto generate similar synthetic data for other countries, Drechsler et al. (2014a) imple-ment S YN LBD to the German Longitudinal Business Database (GLBD). In this paper,we extend the analysis from the earlier paper, and extend the application to the Cana-dian context (SynLEAP).

In all countries, the synthesis of certain industries failed to complete. In both Canadaand the US, this number is less than 10. In Canada, they account for about 7 percent ofthe total number of observations (see Table 13 in the Appendix).In the German case, our experiments were limited to only a handful of industries,due to a combination of time and software availability factors. The results should stillbe considered preliminary. In both countries, as outlined in Section 2, there are subtlebut potentially important differences in the various variable deﬁnitions. Industry codingdiffers across all three countries, and the level of detail in each of the industry codingsmay affect the success and precision of the synthesis. Kinney et al. (2014a) shift to a Classiﬁcation and Regression Trees (CART) model with Bayesian boot-strap. STATISTICS CANADA et al. (1991), when comparing the 1987 US Standard Industrial Classiﬁcation(SIC) to the 1980 Canadian SIC, already pointed out that the degree of specialization, the organization ofproduction, and the size of the respective markets differed. Thus, the density of establishments within eachof the chosen categories is likely to affect the quality of the synthesis.

7s noted in Section 2, entities are establishments in Germany and the US, but em-ployers in Canada. S YN LBD should work on any level of entity aggregation (see Kin-ney et al. (2014a) for an application to hierarchical ﬁrm data with both ﬁrm/employerand establishment level imputation). However, these differences may affect the ob-served density of the data within industry-year categories, and therefore the overallcomparability.Finally, due to a feature of S YN LBD that we did not fully explore, synthesis of datain the last year of the data generally was of poor quality. For some industry-countrypairs, this also happened in the ﬁrst year. We dropped those observations.

In order to assess the outcomes of the experiment, we inspect analytical validity byvarious measures and also evaluate the extent of conﬁdentiality protection. To checkanalytical validity, we compare basic univariate time series between the synthetic andconﬁdential data (employment, entity entry and exit rates, job creation and destructionrates), and the distribution of entities (ﬁrms and establishment, depending on country),employment, and payroll across time by industry. For a more complex assessment,we compute a dynamic panel data model of economic (employment) growth on eachdata set. We computed, but do not report here the conﬁdence interval overlap measure(CIO) proposed by Karr et al. (2006) in all these evaluations. The CIO is a popularmeasure when evaluating the validity for speciﬁc analyses. It evaluates how muchthe conﬁdence intervals of the original data and the synthetic data overlap. We didnot ﬁnd this measure to be useful in our context. Most of our analyses are based onmillions of records, and observed conﬁdence intervals were so small that conﬁdenceintervals (almost) never overlap even when the estimates between the original data andthe synthetic data are quite close.To provide a more comprehensive measure of quality of the synthetic data relativeto the conﬁdential data, we compute the pMSE (propensity score mean-squared error,Woo et al., 2009; Snoke et al., 2018b; Snoke et al., 2018a): the mean-squared errorof the predicted probabilities (i.e., propensity scores) for those two databases. Speciﬁ-cally, pMSE is a metric to assess how well we are able to discern the high distributionalsimilarity between synthetic data and conﬁdential data. We follow Woo et al. (2009)and Snoke et al. (2018b) to calculate the pMSE , using the following algorithm:1. Append the n rows of the conﬁdential database X to the n rows of the syntheticdatabase X s to create X comb with N = n + n rows, where both X and X s are inthe long format.2. Create a variable I et denoting membership of an observation for entity e , e = , . . . , E , at time point t , t = , . . . , T , in the component databases, I et = { X combet ∈ X s } . I et takes on values of 1 for the synthetic database and 0 for theconﬁdential database. The full parameter estimates and the computed CIO are available in our replication materials (Alamet al., 2020).

8. Fit the following generalised linear model to predict IP ( I et = ) = g − ( β + β Emp et + β Pay et + Age

Tet β + λ t + γ i ) , (1)where Emp et is log employment of entity e in year t , Pay et is log payroll ofentity e in year t , Age et is a vector of age classes of entity e in year t , λ t is ayear ﬁxed effect, γ i is an time-invariant industry-speciﬁc effect for the industryclassiﬁcation i of entity e , and g is an appropriate link function (in this case, thelogit link).4. Calculate the predicted probabilities, ˆ p et .5. Compute pMSE = N ∑ Tt = ∑ Ee = ( ˆ p et − c ) , where c = n / N .If n = n , pMSE = 0 means every ˆ p et = .

5, and the two databases are distributionallyindistinguishable, suggesting high analytical validity. While the number of records inthe synthetic data typically matches the number of records in the original data, i.e., n = n , this does not necessarily hold in our application. Although the synthesisprocess ensures that the total number of entities is the same in both data sets, the yearsin which the entities are observed will generally differ between the original data and thesynthetic data and thus the number of records in the long format will not necessarilymatch between the two data sets. For this reason we follow Woo et al. (2009) andSnoke et al. (2018a) and use c = n / N instead of ﬁxing c at 0.5. Using this moregeneral deﬁnition, c will always be the mean of the predicted propensity scores so thatthe pMSE measures the average of the squared deviations from the mean, as intended.Since the pMSE depends on the number of predictors included in the propensityscore model, Snoke et al. (2018a) derived the expected value and standard deviationfor the pMSE under the null hypothesis ( pMSE ) that the synthesis model is correct,i.e., it matches the true data generating process (Snoke et al., 2018a, Equation 1): E [ pMSE ] = ( k − )( − c ) cN and StDev [ pMSE ] = (cid:112) ( k − )( − c ) cN where k is the number of synthesized variables used in the propensity model. To mea-sure the analytical validity of the synthetic data, they suggest looking at the pMSEratio pMSEratio = (cid:92) pMSEE [ pMSE ] and the standardized pMSEpMSE s = (cid:92) pMSE − E [ pMSE ] StDev [ pMSE ] , where (cid:92) pMSE is the estimated pMSE based on the data at hand. Under the null hypoth-esis, the pMSE ratio has an expectation of 1 and the expectation of the standardized pMSE s is zero. 9 . . . . G r o ss e m p l o y m en t ( m illi on s ) . . . . . . G r o ss e m p l o y m en t ( m illi on s ) . . , . T o t a l pa y r o ll ( b illi on s ) (a) CanSynLBD . . . . . . T o t a l pa y r o ll ( b illi on s ) (b) GSynLBD Figure 1: Gross employment level (upper panels) and total payroll (lower panels) byyear.

In the following ﬁgures, the results for the Canadian data are shown in the left panels,and the German data in the right panels. In all cases, the Canadian data are reportedfor the entire private sector, including the manufacturing sector but excluding the pub-lic sector industries (NAICS 61, 62, and 91). German results are for two WZ2003industries.

Figure 1 shows a comparison between the synthetic data and the original data for grossemployment level (upper panels) and total payroll (lower panels) by year. While thegeneral trends are preserved for both data sources, the results for the German syntheticdata resemble the trends from the original data more closely. For the Canadian datathe positive trends over time are generally overestimated. However, in both cases,levels are mostly overestimated. These patterns are not robust. When considering themanufacturing sector in Canada (Figure 8 in the Appendix), trends are better matched,but a signiﬁcant negative bias is present in levels.10 . . . . J ob c r ea t i on r a t e ( % ) . . . . . J ob c r ea t i on r a t e ( % ) J ob de s t r u c t i on r a t e ( % ) (a) CanSynLBD . . . . . J ob de s t r u c t i on r a t e ( % ) (b) GSynLBD Figure 2: Job creation rates (upper panels) and job destruction rates (lower panels) byyear.

Key statistics commonly computed from business registers such as the LEAP or theBHP include job ﬂows over time. Following Davis et al. (1996), job creation is deﬁnedas the sum of all employment gains from expanding ﬁrms from year t − t including entry ﬁrms. The job destruction rate is deﬁned as the sum of all employmentlosses from contracting ﬁrms from year t − t including exiting ﬁrms. Figure 2depicts job creation rates (upper panels) and destruction rates (lower panels). Thegeneral levels and trends are preserved for both data sources, but the time-series alignmore closely for the German data. Even the substantial increase in job creations in1993, which can be attributed to the integration of the data from Eastern Germany afterreuniﬁcation, is remarkably well preserved in the synthetic data. Still, there seems tobe a small but systematic overestimation of job creation and destruction rates in bothsynthetic data sources. The substantial deviation in the job destruction rate in the lastyear of CanSynLBD is an artefact requiring further investigation. The results for the Canadian manufacturing sector are included in Figure 9 in the Appendix, and arecomparable to the results for the entire private sector. .3 Entity Dynamics To assess how well the synthetic data capture entity dynamics, we also compute entryand exit rates, i.e. how many new entities appear in the data and how many cease toexist relative to the population of entities in a speciﬁc year. Figure 3 shows that thoserates are very well preserved for both data sources.Only the (delayed) re-uniﬁcation spike in the entry rates in the German data is notpreserved correctly. The conﬁdential data show a large spike in entry rates in 1993.In that year, detailed information about Eastern German establishments was integratedfor the ﬁrst time. However, the synthetic data shows increased entry rates in the twoprevious years. We speculate that this occurs due to incomplete data in the conﬁdentialdata: Establishments were successively integrated into the data starting in 1991, butmany East German establishments did not report payroll and number of employees inthe ﬁrst two years. Thus, records existed in the original data, but the establishment sizeis reported as missing. Such a combination is not possible in the synthetic data. Thesynthesis models are constructed to ensure that whenever an establishment exists, it hasto have a positive number of employees. Since entry rates are computed by looking atwhether the employment information changed from missing to a positive value, mostof the Eastern German establishments only exist from 1993 on-wards in the originaldata, but from 1991 in the synthetic data.The second, smaller spike in the entry rate in the German data occurs in 1999. Inthat year, employers were required to report marginally employed workers for the ﬁrsttime. Some establishments exclusively employ marginally employed workers, and willthus appear for the ﬁrst time in the data after 1999. The synthetic data preserves thispattern.

The S YN LBD code ensures that the total number of entities that ever exist within theconsidered time frame matches exactly between the original data and the syntheticdata. But each entity’s entry and exit date are synthesized, and the total number ofentities at any particular point in time may differ, and with it employment and payroll.To investigate how well the information is preserved at any given point in time, wecompute the following statistic: x its = X its / ∑ i ∑ t X its , (2)where i is the index for the industry (aggregated to the two digit level for the Canadiandata), t is the index for the year and s denotes the data source (original or synthetic). X its = ∑ j X its j , j = , . . . , n its is the variable of interest aggregated at the industry leveland n its is the number of entities in industry i at time point t in data source s . Tocompute the statistic provided in Equation (2), this number is then divided by the totalof the variable of interest aggregated across all industries and years. Figure 4 plots theresults from the original data against the results from the synthetic data for the number As described in Section 2, for both countries’ data, corrections based on worker ﬂows have been applied,correcting for any bias due to legal reconﬁguration of economic entities. E n t r y r a t e ( % ) . . . E n t r y r a t e ( % ) E x i t r a t e ( % ) (a) CanSynLBD . . . . E x i t r a t e ( % ) (b) GSynLBD Figure 3: Entry rates (upper panels) and exit rates (lower panels) by year.of entities, employment, and payroll. If the information is well preserved, all pointsshould be close to the 45 degree line.We ﬁnd that the share of entities is well preserved for both data sources, but shareof employment and share of payroll vary more in the Canadian data with an upwardbias for the larger shares. It should be noted that the German data shown here andelsewhere in this paper only contain data from two industries, whereas the Canadiandata contains nearly all available industry codes at the two digit level. Thus, resultsfrom Canada are expected to be more diverse. When only considering the Canadianmanufacturing sector (see Figure 10 in the Appendix), less bias is present.

To assess how well the synthetic data perform in a more complex model and in thecontext of an analyst’s modelling strategy, we simulate how a macroeconomist (thetypical user of these data) might approach the problem of estimating a model for theevolution of employment if only the synthetic data are available. The analyst willconsider both the literature and the data to propose a meaningful model. In doingso, a sequence of models will be proposed, and tests or theory brought to bear ontheir merits, potentially rejecting their appropriateness. In doing so, the outcome thatthe analyst obtains from following that strategy using the synthetic data should notdiverge substantially from the outcome they would obtain when using the (inaccessible)13 . . . C an S y nL B D [t] . . . . . . . G S y nL B D . . . C an S y nL B D . . . . . . . G S y nL B D . . . . . C an S y nL B D (a) CanSynLBD . . . . . . . . G S y nL B D (b) GSynLBD Figure 4: Share of entities (upper panels), share of employment (middle panels), andshare of payroll (lower panels) by year and industry.conﬁdential data. The speciﬁc parameter estimates obtained, and the actual modelretained, are not the goal of this exercise — the focus is on the process.To do so, our analyst would start by using a base model (typically OLS), and thenlet economic and statistical theory suggest more appropriate models. In this case, wewill estimate variants of a dynamic panel data model for the evolution of employment.For each model, tests can be speciﬁed to check whether the model is an appropriateﬁt under a certain hypothesis. The outcome of this exercise, illustrated by Figure 5, We do not describe these models in more detail here, referring the reader to the literature instead, in LS GMM System GMM System GMM MA

Analyst speciﬁesspeciﬁes

Test:Reject? Test:Reject? Test:Reject?

Figure 5: Modelling strategy of a hypothetical analystallows us to assess whether the synthetic data capture variability in economic growthdue to industry, ﬁrm age and payroll — the key variables in the data — and whetherthe analyst might reasonable choose the same, or a closely related modelling strategy.The base model is an OLS speciﬁcation:

Emp et = β + θ Emp e , t − + η Pay et + Age

Tet β + γ i + λ t + ε et (3)where Emp et is log employment of entity e in year t , Emp e , t − is its one year lag, Pay et is the logarithm of payroll of entity e in year t , Age et is a vector of dummy variablesfor age of entity e in year t , λ t is a year effect, γ i is a time-invariant industry-speciﬁceffect for each industry i , and ε et is the disturbance term of entity e in year t . As Emp e , t − is correlated with γ i because Emp e , t − is itself determined by time-invariant γ i , OLS estimators are biased and inconsistent. To obtain consistent estimates of theparameters in the model, Arellano et al. (1991) suggest using generalized method ofmoments (GMM) estimation methods, as well as associated tests to assess the validityof the model. We also estimate the model using system GMM methods proposed byArellano et al. (1995) and Blundell et al. (1998) (System GMM), as well as a variantof equation (3) that includes a ﬁrst-order moving average in the error term ε et (SystemGMM MA): Emp et = β + θ Emp e , t − + η Pay et + Age

Tet β + λ t + α e + ε et + ε e , t − (4)where α e is a time-invariant entity effect, which includes any time-invariant industryeffects.The Sargan test (Hansen, 1982; Arellano et al., 1991; Blundell et al., 2001) isused to assess the validity of the over-identifying restrictions. We also compute thez-score for the m η (cid:63) = ˆ η − ˆ θ . particular Arellano et al. (1995) and Blundell et al. (1998).

15t is important that this model is close, but not identical to the model used to syn-thesize the data. In S YN LBD,

Emp et is synthesized as f ( Emp e , t − , X et ) (where X et does not contain Pay et ), and Pay et = f ( Pay e , t − , Emp et , X et ) (Kinney et al., 2011b, pg.366). Thus, the model we chose is purposefully not (completely) congenial with thesynthesis model, but the synthesis process of the S YN LBD should preserve sufﬁcientserial correlation in the data to be able to estimate these models.We estimate each model and test statistics separately on conﬁdential and syntheticdata for the private sector (and for Canada, for the manufacturing sector). Detailedestimation results are reported in the Appendix. Here we focus on the two regressioncoefﬁcients of major interest: θ and η , the coefﬁcients for lagged employment andpayroll, as well as the elasticity η (cid:63) . Figure 6 plots the bias in the synthetic coefﬁcients,i.e., θ synth − θ con f and η synth − η con f , for all four models. While the detailed resultsin the Appendix conﬁrm that all regression coefﬁcients still have the same sign, allestimates plotted in Figure 6 show substantial bias in all models in all datasets (the OLSmodel for the German data being the only exception). Still, the computed elasticity η (cid:63) has very little bias in most models.Figure 6: Bias in estimates of coefﬁcients on pay and lagged employment Note : For details on the estimated coefﬁcients, see the Appendix.

However, we observe a striking pattern: The biases of the two regression coefﬁ-cients are always symmetric, i.e. the sum of the biases of θ synth and η synth is close tozero in all models (and mostly cancel out in the computation of η (cid:63) ). This may sim-ply be a feature of the modeling strategy pointed out earlier, which generates serialcorrelation with a slightly different structure. Another possible explanation could bethat the model is poorly identiﬁed because of multicollinearity generating a ridge forthe estimated coefﬁcients. The estimated coefﬁcients would be highly unstable in thiscase even in the original data and thus it would not be surprising to ﬁnd substantialdifferences between the coefﬁcients from the original data and the coefﬁcients fromthe synthetic data. Better understanding this phenomenon will be an interesting area offuture research.While the bias in coefﬁcients is quite consistent across countries and models, spec-iﬁcation tests such as the m m Canada GermanyModel Test Conﬁdential Synthetic Conﬁdential SyntheticGMM m2 -14.5 -27.54 -2.51 -4.13Sargan test 69000 15000 3600 2000System GMM m2 -11.43 -41.6 19.49 -8.83Sargan test 77000 18000 4500 2800System GMM MA m2 8.2 -40.03 19.03 -11.69Sargan test 28000 17000 3100 2500

Note : The Sargan test (Blundell et al., 2001; Arellano et al., 1991) is used to assess the validity of theover-identifying restrictions. The z-score for the m To compute the pMSE , we estimate Equation (1) using logit models. The estimated pMSE is 0.0121 for the Canadian data (0.0041 for the manufacturing sector) and0.0013 for the German data (see Table 3). While these numbers may seem small,the pMSE ratio and the standardized pMSE are large, indicating that the null hypoth-esis that the synthetic data and the original data stem from the same data generatingprocess should be rejected. The expected pMSE is quite sensitive to sample size N .Even small differences between the original and synthetic data will lead to large valuesfor this test statistic. In both countries, the conﬁdential data ﬁles are quite large (about2 million cases for Germany and the manufacturing sector in Canada and about 34.5million cases for the full Canadian data sets). In practice, therefore, it is quite likely toreject the null of equivalence given this test’s very high power.Table 3: pMSE by sector and countryCountry Sector pMSE pMSE ratio standardized pMSECanada Manufacturing 0.0041 656.88 4908.17Canada Private 0.0121 10957.61 135525.77Germany Universe 0.0013 725.21 2896.85 To assess the risk of disclosure, we use a measure proposed by Kinney et al. (2011b):For each industry, we estimate the fraction of entities for which the synthetic birth yearequals the true birth year, conditional on the synthetic birth year, and interpret it asa probability. Tables 14 and 15 in the Appendix show the minimum, maximum, andmean of these probabilities, by year. Figure 7 shows the maximum and average values17cross time, for each country. The ﬁgure shows that these probabilities are quite lowexcept for the ﬁrst year. Entry rates in the ﬁrst year are much larger than in any otheryear due to censoring. It is therefore quite likely that the (left-censored) entry year ofthe synthetic record matches that of the (left-censored) original record if the syntheticentry year is the ﬁrst year observed in the data. A somewhat more muted version of thiseffect can be seen for Germany in the years 1991 and 1992, when the lower panel ofFigure 7 shows another spike. These are the years in which data from Eastern Germanywere added to the database successively, leading to new sets of (left-censored) entities.With the exception of the ﬁrst year in the data, the average rate of concordancebetween synthetic and observed birth year of an establishment in the Canadian data isbelow 5%, and the maximum is never above 50%. The German data reﬂect results froma smaller set of industries, and while the average concordance is higher (never above10%), the maximum is never above 6% other than during the noted entry spikes. Thissuggests that the synthetic lifespan of any given entity is highly unlikely to be matchedto its conﬁdential real lifespan. This is generally considered to be a high degree ofconﬁdentiality.Figure 7: Average and maximum likelihood that synthetic birthyear matches actualbirthyear

Note: Plot shows fraction of entities by industry for which the synthetic birthyear equals the true birth year, conditional on the synthetic birth year. Plothas been rescaled to be relative to the ﬁrst year observed in the data. The Canadian manufacturing sector is not shown. In the German case, we only use two industries, butwe show the average of the two, rather than the values for both industries, to maintain comparability withthe Canadian plot. Conclusion

In this paper, we presented results from two projects that evaluated whether the codedeveloped to synthesize the U.S. LBD can easily be adapted to create synthetic versionsof similar data from Canada and Germany. We considered both univariate time-seriescomparisons as well as model-based comparisons of coefﬁcients and model ﬁt. In gen-eral, utility evaluations show signiﬁcant differences between each country’s syntheticand conﬁdential data. Frequently-used measures such as conﬁdence interval overlapand pMSE suggest that the synthetic data are an unreliable image of the conﬁdentialdata. Less formal comparisons of speciﬁcation test scores suggest that the syntheticdata do not reliably lead to the same modeling decisions.Interestingly, the utility of the German synthetic data was higher than the utilityof the Canadian data in almost all dimensions evaluated. At this point we can onlyspeculate about potential reasons. The most important difference between the two datasources is that the German data comprises only a handful of industries while almostall industries have been included in the Canadian evaluation. Given that the industriesincluded in the German data were rather large, and synthesis models are run inde-pendently for each industry, it might have been easier to preserve the industry levelstatistics for the German data. We cannot exclude the possibility that the structure ofthe German data aligns more closely with the LBD and thus the synthesis models tunedon the LBD data provide better results on the (adjusted) BHP than on the LEAP. Wenote that both the LBD and the BHP are establishment-level data sets, whereas theLEAP is an employer-level data set.We emphasize that adjustments to the original synthesis code were explicitly lim-ited to ensuring that the code runs on the new input data. The validity of the syntheticdata could possibly be improved by tuning the synthesis models to the particularities ofthe data at hand, such as the non-standard dynamics introduced into the German databy reuniﬁcation. However, the aim of this project was to illustrate that the high in-vestments necessary for developing the synthesis code for the LBD offered additionalpayoffs as the re-use of the code substantially reduced the amount of work requiredto generate decent synthetic data products for other business data. One of the majorcriticisms of the synthetic data approach has been that investments necessary to de-velop useful synthesizers are substantial. This project illustrated that substantial gainscan be achieved when exploiting knowledge from previous projects. With the adventof tailor-made software such as the synthpop package in R (Nowok et al., 2016), theinvestments for generating useful synthetic data might be further reduced in the future.However, even without ﬁne-tuning or customization of models, the current syn-thetic data have, in fact, proven useful. De facto, many deployments of syntheticdata, including the Synthetic LBD in the US, have been used for model preparationby researchers in a public or lower-security environment, with subsequent remote sub-mission of prepared code for validation against the conﬁdential data. When viewedthrough the lens of such a validation system, the synthetic data prepared here wouldseem to have reasonable utility. While time series dynamics are not the same, theyare broadly similar. Models converged in similar fashions, and while coefﬁcients werestrictly different, they were broadly similar and plausible. Speciﬁcation tests did not19ead to the same conclusions, but they also did not collapse or yield meaningless con-clusions. Thus, we believe that the synthetic data, despite being different, have thepotential to be a useful tool for analysts to prepare models without direct access to theconﬁdential data. Vilhuber et al. (2016a) and Vilhuber (2019) come to a similar con-clusion when evaluating usage of the synthetic data sets available through the SyntheticData Server (Abowd et al., 2010), including the Synthetic LBD. A more thorough eval-uation would need to explicitly measure the investment in synthetic data generation,the cost of setting up a validation structure, and the number of studies enabled throughsuch a setup. We note that such an evaluation is non-trivial: the counter-factual in manycircumstances is that no access is allowed to sensitive business microdata, or that ac-cess occurs through a secure research data system that is also costly to maintain. Thisstudy has contributed to such a future evaluation by showing that plausible results canbe achieved with relatively low up-front investments.The use of synthetic data sets to broaden access to conﬁdential microdata is likelyto increase in the near future, with increasing concerns by statistical agencies regard-ing the disclosure risks of releasing microdata. The resulting reduction in access toscientiﬁc microdata is overwhelmingly seen as problematic. Broadly “plausible” if notanalytically valid synthetic data sets such as those described in this paper, combinedwith scalable remote submission systems that integrate modern disclosure avoidancemechanisms, may be a feasible mitigation strategy.

Acknowledgements

The opinions expressed here are those of the authors, and do not reﬂect the opinions ofany of the statistical agencies involved. All results were reviewed for disclosure risksby their respective custodians, and released to the authors. Alam thanks Claudiu Mo-toc and Danny Leung for help with the Canadian data. Vilhuber acknowledges fundingthrough NSF Grants SES-1131848 and SES-1042181, and a grant from Alfred P. SloanGrant (G-2015-13903). Alam and Dostie acknowledge funding through SSHRC Part-nership Grant “Productivity, Firms and Incomes”. The creation of the Synthetic LBDwas funded by NSF Grant SES-0427889.

References

ABOWD, J. M. and J. I. LANE (2004). “New Approaches to Conﬁdentiality ProtectionSynthetic Data, Remote Access and Research Data Centers”. In:

Privacy in Statisti-cal Databases . Ed. by J. DOMINGO-FERRER and V. TORRA. Vol. 3050. LectureNotes in Computer Science. Springer, pp. 282–289.

DOI : . URL : .ABOWD, J. M. and I. SCHMUTTE (2015). “Economic analysis and statistical disclo-sure limitation”. In: Brookings Papers on Economic Activity

Fall 2015.

URL : .20BOWD, J. M., B. E. STEPHENS, L. VILHUBER, F. ANDERSSON, K. L. MCKIN-NEY, M. ROEMER, and S. D. WOODCOCK (2009). “The LEHD InfrastructureFiles and the Creation of the Quarterly Workforce Indicators”. In: Producer Dy-namics: New Evidence from Micro Data . Ed. by T. DUNNE, J. B. JENSEN, andM. J. ROBERTS. University of Chicago Press.

URL : .ABOWD, J. M. and L. VILHUBER (2010). VirtualRDC - Synthetic Data Server . Cor-nell University, Labor Dynamics Institute.

URL : .ALAM, M. J., B. DOSTIE, J. DRECHSLER, and L. VILHUBER (2020). Replicationarchive for: Applying Data Synthesis for Longitudinal Business Data across ThreeCountries . Code and data. Zenodo.

DOI : .ARELLANO, M. and S. BOND (1991). “Some Tests of Speciﬁcation for Panel Data:Monte Carlo Evidence and an Application to Employment Equations”. In: Reviewof Economic Studies

URL : https : / / EconPapers . repec .org/RePEc:oup:restud:v:58:y:1991:i:2:p:277-297. .ARELLANO, M. and O. BOVER (1995). “Another look at the instrumental variableestimation of error-components models”. In: Journal of Econometrics

URL : https://EconPapers.repec.org/RePEc:eee:econom:v:68:y:1995:i:1:p:29-51 .BARTELSMAN, E., J. HALTIWANGER, and S. SCARPETTA (2009). “Measuringand Analyzing Cross-country Differences in Firm Dynamics”. In: DUNNE, T., J. B.JENSEN, and M. J. ROBERTS. Producer Dynamics: New Evidence from MicroData . University of Chicago Press, pp. 15–76.

URL : .BENDER, S. (2009). “The RDC of the Federal Employment Agency as a part of theGerman RDC Movement”. In: Comparative Analysis of Enterprise Data, 2009Conference . Comparative Analysis of Enterprise Data, 2009 Conference. (Tokyo).

URL : http : / / gcoe . ier . hit - u . ac . jp / CAED / index . html (visited on05/05/2014).BENEDETTO, G., J. HALTIWANGER, J. LANE, and K. MCKINNEY (2007). “UsingWorker Flows in the Analysis of the Firm”. In: Journal of Business and EconomicStatistics

Journal of Econometrics

URL : https://ideas.repec.org/a/eee/econom/v87y1998i1p115-143.html .BLUNDELL, R., S. BOND, and F. WINDMEIJER (2001). “Estimation in dynamicpanel data models: Improving on the performance of the standard GMM estima-tor”. In: Nonstationary Panels, Panel Cointegration, and Dynamic Panels . Ed. byB. H. BALTAGI, T. B. FOMBY, and R. CARTER HILL. Vol. 15. Advances inEconometrics. Emerald Group Publishing Limited, pp. 53–91.

DOI :

10 . 1016 /S0731 - 9053(00 ) 15003 - 0 . URL : https : / / doi . org / 10 . 1016 / S0731 -9053(00)15003-0 (visited on 04/30/2020).BUNDESAGENTUR F ¨UR ARBEIT (2013). Establishment History Panel (BHP) . [Com-puter ﬁle]. N¨urnberg, Germany: Research Data Centre (FDZ) of the German Fed-21ral Employment Agency (BA) at the Institute for Employment Research (IAB)[distributor].DAVIS, S. J., J. C. HALTIWANGER, and S. SCHUH (1996).

Job creation and de-struction . Cambridge, MA: MIT Press.DRECHSLER, J. (2011a).

Synthetic Datasets for Statistical Disclosure Control–Theoryand Implementation . New York: Springer.

DOI : .DRECHSLER, J. (2011b). Synthetische Scientiﬁc-Use-Files der Welle 2007 des IAB-Betriebspanels . FDZ Methodenreport 201101 de. Institute for Employment Re-search, Nuremberg, Germany.

URL : http : / / ideas . repec . org / p / iab /iabfme/201101_de.html .— (2012). “New data dissemination approaches in old Europe – synthetic datasets fora German establishment survey”. In: Journal of Applied Statistics

URL : http : / / ideas . repec . org / a / taf / japsta /v39y2012i2p243 -265.html .DRECHSLER, J., A. DUNDLER, S. BENDER, S. R ¨ASSLER, and T. ZWICK (2008).“A new approach for disclosure control in the IAB establishment panel—multipleimputation for a better data access”. In: AStA Advances in Statistical Analysis

A First Step Towards A German Synlbd:Constructing A German Longitudinal Business Database . Working Papers 14-13.Center for Economic Studies, U.S. Census Bureau.

URL : https://ideas.repec.org/p/cen/wpaper/14-13.html .DRECHSLER, J. and L. VILHUBER (2014b). “A First Step Towards A German SynLBD:Constructing A German Longitudinal Business Database”. In: Statistical Journal ofthe IAOS: Journal of the International Association for Ofﬁcial Statistics

DOI : . URL : http://iospress.metapress.com/content/X415V18331Q33150 .GUZMAN, J. and S. STERN (2016). The State of American Entrepreneurship: NewEstimates of the Quality and Quantity of Entrepreneurship for 32 US States, 1988-2014 . Working Paper 22095. National Bureau of Economic Research.

DOI :

10 .3386/w22095 . URL : .— (2020). Startup Cartography . URL : (visited on 01/26/2020).HANSEN, L. P. (1982). “Large Sample Properties of Generalized Method of Mo-ments Estimators”. In: Econometrica

DOI : . URL : (visited on04/30/2020).HETHEY, T. and J. F. SCHMIEDER (2010). Using worker ﬂows in the analysis ofestablishment turnover: Evidence from German administrative data . FDZ Metho-denreport 201006 en. Institute for Employment Research, Nuremberg, Germany.

URL : http://ideas.repec.org/p/iab/iabfme/201006_en.html .JARMIN, R. S., T. A. LOUIS, and J. MIRANDA (2014). “Expanding The Role OfSynthetic Data At The U.S. Census Bureau”. In: Statistical Journal of the IAOS:Journal of the International Association for Ofﬁcial Statistics

DOI : . URL : http://iospress.metapress.com/content/fl8434n4v38m4347/?p=00c99b98bf2f4701ae806ee638594915&pi=0 .22ARMIN, R. S. and J. MIRANDA (2002). The Longitudinal Business Database . Work-ing Papers 02-17. Center for Economic Studies, U.S. Census Bureau.

URL : https://ideas.repec.org/p/cen/wpaper/02-17.html .KARR, A. F., C. N. KOHNEN, A. OGANIAN, J. P. REITER, and A. P. SANIL (2006).“A Framework for Evaluating the Utility of Data Altered to Protect Conﬁdential-ity”. In: The American Statistician

DOI : .KINNEY, S. K., J. P. REITER, and J. MIRANDA (2014a). Improving The SyntheticLongitudinal Business Database . Working Papers 14-12. Center for Economic Stud-ies, U.S. Census Bureau.

URL : https://ideas.repec.org/p/cen/wpaper/14-12.html .— (2014b). “Improving The Synthetic Longitudinal Business Database”. In: Statisti-cal Journal of the IAOS: Journal of the International Association for Ofﬁcial Statis-tics

DOI : .KINNEY, S. K., J. P. REITER, A. P. REZNEK, J. MIRANDA, R. S. JARMIN, andJ. M. ABOWD (2011a). LBD Synthesis Procedures . CES Technical Notes Series11-01. Center for Economic Studies, U.S. Census Bureau.

URL : https://ideas.repec.org/p/cen/tnotes/11-01.html .— (2011b). “Towards Unrestricted Public Use Business Microdata: The SyntheticLongitudinal Business Database”. In: International Statistical Review

DOI : j.1751-5823.2011.00152.x . URL : https://ideas.repec.org/a/bla/istatr/v79y2011i3p362-384.html .LITTLE, R. J. (1993). “Statistical Analysis of Masked Data”. In: Journal of OfﬁcialStatistics

Understanding Business Dynamics: AnIntegrated Data System for America’s Future . Ed. by J. HALTIWANGER, L. M.LYNCH, and C. MACKIE. Washington, DC: The National Academies Press.

DOI : . URL : .NOWOK, B., G. RAAB, and C. DIBBEN (2016). “synthpop: Bespoke Creation ofSynthetic Data in R”. In: Journal of Statistical Software, Articles

DOI : . URL : .RAAB, G. M., B. NOWOK, and C. DIBBEN (2018). “Practical Data Synthesis forLarge Samples”. In: Journal of Privacy and Conﬁdentiality

DOI : . URL : https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407 .RUBIN, D. B. (1993). “Discussion of Statistical Disclosure Limitation”. In: Journal ofOfﬁcial Statistics

American Economic Review

DOI : . URL : .SNOKE, J., G. M. RAAB, B. NOWOK, C. DIBBEN, and A. SLAVKOVIC (2018a).“General and speciﬁc utility measures for synthetic data”. In: Journal of the RoyalStatistical Society: Series A (Statistics in Society)

DOI :

10 . . eprint: https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12358 . URL : https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12358 .SNOKE, J. and A. SLAVKOVIC (2018b). “pMSE Mechanism: Differentially PrivateSynthetic Data with Maximal Distributional Similarity: UNESCO Chair in DataPrivacy, International Conference, PSD 2018, Valencia, Spain, September 26-28,2018, Proceedings”. In: pp. 138–159. DOI : .STATISTICS CANADA (2019a). Business Register (BR) . URL : (visited on 01/30/2020).— (2019b). Longitudinal Employment Analysis Program (LEAP) . URL : (visited on 01/30/2020).— (2019c). Survey of Employment, Payrolls and Hours (SEPH) . URL : (visited on 01/30/2020).STATISTICS CANADA and BUREAU OF THE CENSUS (1991). Concordance be-tween the Standard Industrial Classiﬁcations of Canada and the United States,1980 Canadian SIC - 1987 United States SIC . Catalogue No. 12-574E. Statis-tics Canada.

URL : http : / / publications . gc . ca / site / eng / 9 . 847987 /publication.html (visited on 01/30/2020).STATISTISCHES BUNDESAMT (2003). Classiﬁcation of Economic Activities, issue2003 (WZ 2003) . Statistisches Bundesamt (Federal Statistical Ofﬁce) of Germany.

URL : (visited on 02/02/2020).U.S. CENSUS BUREAU (2015). Longitudinal Business Database 1975-2015 [Dataﬁle] . Tech. rep.

URL : (visited on 01/26/2020).— (2016a). County Business Patterns (CBP) . U.S. Census Bureau.

URL : (visited on 01/26/2020).— (2016b). Statistics of U.S. Businesses (SUSB) . U.S. Census Bureau.

URL : (visited on 01/26/2020).— (2017). Business Dynamics Statistics (BDS) . U.S. Census Bureau.

URL : (visited on 01/26/2020).VILHUBER, L. (2013). Methods for Protecting the Conﬁdentiality of Firm-Level Data:Issues and Solutions . Document 19. Labor Dynamics Institute.

URL : http : / /digitalcommons.ilr.cornell.edu/ldi/19/ .— (2018). LEHD Infrastructure S2014 ﬁles in the FSRDC . Working Papers 18-27.Center for Economic Studies, U.S. Census Bureau.

URL : https://ideas.repec.org/p/cen/wpaper/18-27.html .— (2019). Utility of two synthetic data sets mediated through a validation server:Experience with the Cornell Synthetic Data Server . Presentation. Conference onCurrent Trends in Survey Statistics.

URL : https://hdl.handle.net/1813/43883 . 24ILHUBER, L. and J. M. ABOWD (2016a). Usage and outcomes of the Synthetic DataServer . Presentation. Meetings of the Society of Labor Economists.

URL : https://hdl.handle.net/ .VILHUBER, L., J. M. ABOWD, and J. P. REITER (2016b). “Synthetic establishmentmicrodata around the world”. In: Statistical Journal of the International Associa-tion for Ofﬁcial Statistics

DOI : .WOO, M.-J., J. P. REITER, A. OGANIAN, and A. F. KARR (2009). “Global Measuresof Data Utility for Microdata Masked for Disclosure Limitation”. In: Journal ofPrivacy and Conﬁdentiality

DOI : . URL : https://journalprivacyconfidentiality.org/index.php/jpc/article/view/568 .WOODCOCK, S. D. and G. BENEDETTO (2009). “Distribution-preserving statis-tical disclosure limitation”. In: Computational Statistics & Data Analysis

DOI : https : / / doi . org / 10 . 1016 / j . csda . 2009 . 05 .020 . URL : . 25 ppendix “Applying Data Synthesis for Longitudinal Business Data across Three Countries” M. Jahangir Alam, Benoit Dostie, J¨org Drechsler, Lars Vilhuber

A Figures for the Manufacturing Sector in Canada . . . . . G r o ss e m p l o y m en t ( m illi on s ) (a) Gross employment level by year T o t a l pa y r o ll ( b illi on s ) (b) Total payroll Figure 8: Entity characteristics for the manufacturing sector in Canada by year. . . . . . J ob c r ea t i on r a t e ( % ) (a) Job creation rates J ob de s t r u c t i on r a t e ( % ) (b) Job destruction rates Figure 9: Dynamics of job ﬂows for the manufacturing sector in Canada by year.26 . . . . . . . C an S y nL B D . . . . . . . C an S y nL B D . . . . . . . C an S y nL B D Figure 10: Share of entities (upper panel), share of employment (middle panel), andshare of payroll (lower panel) by year and industry for the Canadian manufacturingsector. 27

Appendix Tables

B.1 pMSE

Table 4: Detailed results for pMSE estimation by sector and country

Independent Variables Canada Germany

Sector:

Manufacturing Private AllLn ALU 0.158 0.7138 -0.2895(0.0039) (0.001) (0.0033)Ln Pay 0.0039 -0.4426 0.2584(0.0037) (0.001) (0.0028)Age 3-4 0.0392 0.0972 -0.0987(0.0078) (0.0017) (0.007)Age 5-7 -0.0382 0.0477 -0.0973(0.0073) (0.0016) (0.0066)Age 8-12 -0.1258 -0.0263 -0.1172(0.0071) (0.0015) (0.0063)Age 13 or more -0.219 -0.1024 -0.1487(0.0074) (0.0016) (0.0059)N 2243011 34638723 2121956pseudo R-sq 0.0112 0.0318 0.0038pMSE 0.0041 0.0121 0.0013

Note : See Equation 1 for estimation method. An observation is a entity-year in thecombined database of each country-sector combination. All speciﬁcations include timeand industry ﬁxed effects. Standard errors are in parentheses. .2 Regression analysis tables Table 5: Regression coefﬁcients (OLS) for LEAP

Independent Variables LEAP CanSynLBD

Private Manufacturing Private ManufacturingAR(1) Coefﬁcient 0.2031*** 0.2481*** 0.3970*** 0.4405***(0.0001) (0.0005) (0.0002) (0.0007)Ln Pay 0.7847*** 0.7300*** 0.5481*** 0.5228***(0.0001) (0.0005) (0.0002) (0.0006)Age 3-4 -0.1202*** -0.1717*** -0.1223*** -0.2340***(0.0003) (0.0014) (0.0004) (0.0016)Age 5-7 -0.1260*** -0.1891*** -0.1235*** -0.2507***(0.0003) (0.0014) (0.0004) (0.0016)Age 8-12 -0.1268*** -0.1973*** -0.1169*** -0.2551***(0.0003) (0.0013) (0.0004) (0.0016)Age 13 or more -0.1246*** -0.1992*** -0.1101*** -0.2577***(0.0003) (0.0014) (0.0004) (0.0017) N R Note: In all speciﬁcations, we include both year and industry ﬁxed effects. Standard errors arein parentheses. ***, **, and * indicate statistically signiﬁcant coefﬁcients at 1%, 5%, and 10%percent levels, respectively.

Independent Variables GLBD GSynLBD

AR(1) Coefﬁcient 0.4430*** 0.4143***(0.0007) (0.0008)Ln Pay 0.4629*** 0.5143***(0.0006) (0.0007)Age 3-4 -0.0695*** -0.0642***(0.0017) (0.0016)Age 5-7 -0.1066*** -0.0891***(0.0017) (0.0016)Age 8-12 -0.1324*** -0.1109***(0.0017) (0.0016)Age 13 or more -0.1880*** -0.1600***(0.0016) (0.0015) N R Note: In all speciﬁcations, we include both year and industry ﬁxed effects. Standard errors arein parentheses. ***, **, and * indicate statistically signiﬁcant coefﬁcients at 1%, 5%, and 10%percent levels, respectively.

Independent Variables LEAP CanSynLBD

Private Manufacturing Private ManufacturingAR(1) Coefﬁcient 0.0805*** 0.1189*** 0.5722*** 0.5425***(0.0003) (0.0018) (0.0024) (0.0084)Ln Pay 0.8991*** 0.8523*** 0.4101*** 0.4302***(0.0002) (0.0015) (0.0018) (0.0067)Age 3-4 -0.0450*** -0.0797*** -0.2075*** -0.2972***(0.0002) (0.0014) (0.0010) (0.0051)Age 5-7 -0.0438*** -0.0860*** -0.2129*** -0.3162***(0.0002) (0.0015) (0.0011) (0.0059)Age 8-12 -0.0418*** -0.0923*** -0.2187*** -0.3294***(0.0003) (0.0017) (0.0013) (0.0070)Age 13 or more -0.0379*** -0.0898*** -0.2318*** -0.3414***(0.0003) (0.0019) (0.0015) (0.0080) N Note: In this table, m Independent Variables GLBD GSynLBD

AR(1) Coefﬁcient 0.0489*** 0.6999***(0.0051) (0.0057)Ln Pay 0.7559*** 0.2916***(0.0035) (0.0042)Age 3-4 -0.0070*** -0.1026***(0.0012) (0.0015)Age 5-7 -0.0233*** -0.1386***(0.0014) (0.0017)Age 8-12 -0.0473*** -0.1694***(0.0015) (0.0018)Age 13 or more -0.1084*** -0.2183***(0.0015) (0.0018) N Note: In this table, m Independent Variables LEAP CanSynLBD

Private Manufacturing Private ManufacturingAR(1) Coefﬁcient 0.0978*** 0.1614*** 0.5111*** 0.5780***(0.0002) (0.0014) (0.0008) (0.0041)Ln Pay 0.8854*** 0.8161*** 0.4562*** 0.4022***(0.0002) (0.0012) (0.0006) (0.0033)Age 3-4 -0.0555*** -0.1097*** -0.1828*** -0.3177***(0.0002) (0.0012) (0.0004) (0.0028)Age 5-7 -0.0558*** -0.1201*** -0.1860*** -0.3408***(0.0002) (0.0013) (0.0005) (0.0031)Age 8-12 -0.0548*** -0.1298*** -0.1875*** -0.3583***(0.0002) (0.0014) (0.0005) (0.0036)Age 13 or more -0.0524*** -0.1317*** -0.1943*** -0.3747***(0.0002) (0.0016) (0.0006) (0.0041) N Note: An observation is an entity-year. In this table, m Independent Variables GLBD GSynLBD

AR(1) Coefﬁcient 0.1883*** 0.6140***(0.0021) (0.0027)Ln Pay 0.6599*** 0.3553***(0.0014) (0.0020)Age 3-4 -0.0292*** -0.0934***(0.0011) (0.0013)Age 5-7 -0.0512*** -0.1266***(0.0011) (0.0014)Age 8-12 -0.0791*** -0.1545***(0.0011) (0.0015)Age 13 or more -0.1400*** -0.2012***(0.0011) (0.0015) N Note: An observation is an entity-year. In this table, m Independent Variables LEAP CanSynLBD

Private Manufacturing Private ManufacturingAR(1) Coefﬁcient 0.2005*** 0.2821*** 0.4850*** 0.5737***(0.0007) (0.0040) (0.0012) (0.0059)Ln Pay 0.8044*** 0.7135*** 0.4760*** 0.4056***(0.0005) (0.0034) (0.0009) (0.0046)Age 3-4 -0.1245*** -0.2033*** -0.1716*** -0.3158***(0.0005) (0.0032) (0.0006) (0.0037)Age 5-7 -0.1328*** -0.2264*** -0.1733*** -0.3389***(0.0005) (0.0035) (0.0006) (0.0043)Age 8-12 -0.1383*** -0.2454*** -0.1731*** -0.3560***(0.0006) (0.0039) (0.0007) (0.0051)Age 13 or more -0.1441*** -0.2586*** -0.1774*** -0.3717***(0.0006) (0.0042) (0.0008) (0.0058) N Note: An observation is a ﬁrm and a year. In this table, m Independent Variables GLBD GSynLBD

AR(1) Coefﬁcient 0.3701*** 0.5268***(0.0060) (0.0048)Ln Pay 0.5349*** 0.4202***(0.0041) (0.0036)Age 3-4 -0.0594*** -0.0831***(0.0015) (0.0013)Age 5-7 -0.0922*** -0.1105***(0.0018) (0.0015)Age 8-12 -0.1252*** -0.1351***(0.0019) (0.0016)Age 13 or more -0.1850*** -0.1802***(0.0019) (0.0017) N Note: An observation is a ﬁrm and a year. In this table, m C Canada: Synthesized Observations

Table 13: Synthesized observations