Applying Data Synthesis for Longitudinal Business Data across Three Countries
M. Jahangir Alam, Benoit Dostie, Jörg Drechsler, Lars Vilhuber
AApplying Data Synthesis for Longitudinal Business Dataacross Three Countries
M. Jahangir Alam , Benoit Dostie J¨org Drechsler Lars Vilhuber ABSTRACT
Data on businesses collected by statistical agencies are challenging to protect. Manybusinesses have unique characteristics, and distributions of employment, sales, andprofits are highly skewed. Attackers wishing to conduct identification attacks oftenhave access to much more information than for any individual. As a consequence,most disclosure avoidance mechanisms fail to strike an acceptable balance betweenusefulness and confidentiality protection. Detailed aggregate statistics by geographyor detailed industry classes are rare, public-use microdata on businesses are virtuallyinexistant, and access to confidential microdata can be burdensome. Synthetic micro-data have been proposed as a secure mechanism to publish microdata, as part of abroader discussion of how to provide broader access to such data sets to researchers.In this article, we document an experiment to create analytically valid synthetic data,using the exact same model and methods previously employed for the United States,for data from two different countries: Canada (Longitudinal Employment AnalysisProgram (LEAP)) and Germany (Establishment History Panel (BHP)). We assess util-ity and protection, and provide an assessment of the feasibility of extending such anapproach in a cost-effective way to other data.
Key words: business data, confidentiality, LBD, LEAP, BHP, synthetic.
There is growing demand for firm-level data allowing detailed studies of firm dynamics.Recent examples include Bartelsman et al. (2009), who use cross-country firm-leveldata to study average post-entry behavior of young firms. Sedl´aˇcek et al. (2017) usethe Business Dynamics Statistics (BDS) to show the role of firm size in firm dynamics.However, such studies are made difficult due to the limited or restricted access to firm-level data.Data on businesses collected by statistical agencies are challenging to protect.Many businesses have unique characteristics, and distributions of employment, salesand profits are highly skewed. Attackers wishing to conduct identification attacks of-ten have access to much more information than for any individual. It is easy to findexamples of firms and establishments that are so dominant in their industry or locationthat they would be immediately identified if data that included their survey responses Department of Applied Economics, HEC Montr´eal, and Department of Economics, Truman State Uni-versity. USA. E-mail: [email protected]: https://orcid.org/0000-0001-6478-114X. Department of Applied Economics, HEC Montr´eal. USA. E-mail: [email protected]. ORCID:https://orcid.org/0000-0002-4133-2365. Institute for Employment Research. USA. E-mail: [email protected] Cornell University. E-mail: [email protected]: https://orcid.org/0000-0001-5733-8932. a r X i v : . [ ec on . E M ] J u l r administratively collected data were publicly released. Finally, there are also greaterfinancial incentives to identifying the particulars of some firms and their competitors.As a consequence, most disclosure avoidance mechanisms fail to strike an accept-able balance between usefulness and confidentiality protection. Detailed aggregatestatistics by geography or detailed industry classes are rare, public-use microdata onbusiness are virtually inexistant, and access to confidential microdata can be burden-some. It is not uncommon that access to establishment microdata, if granted at all,is provided through data enclaves (Research Data Centers), at headquarters of statis-tical agencies, or some other limited means, under strict security conditions. Theserestrictions on data access reduce the growth of knowledge by increasing the cost toresearchers of accessing the data.Synthetic microdata have been proposed as a secure mechanism to publish mi-crodata (Drechsler et al., 2008; Drechsler, 2012; National Research Council, 2007;Jarmin et al., 2014), based on suggestions and methods first proposed by Rubin (1993)and Little (1993). Such data are part of a broader discussion of how to provide im-proved access to such data sets to researchers (Bender, 2009; Vilhuber, 2013; Abowdet al., 2004; Abowd et al., 2015). For business data, synthetic business microdatawere released in the United States (Kinney et al., 2011b) and in Germany (Drech-sler, 2011b) in 2011. The former data set, called Synthetic Longitudinal BusinessDatabase (LBD) (SynLBD), was released to an easily web-accessible computing envi-ronment (Abowd et al., 2010), and combined with a validation mechanism. By makingdisclosable synthetic microdata available through a remotely accessible data server,combined with a validation server, the SynLBD approach alleviates some of the accessrestrictions associated with economic data. The approach is mutually beneficial to bothagency and researchers. Researchers can access public use servers at little or no cost,and can later validate their model-based inferences on the full confidential microdata.Details about the modeling strategies used for the SynLBD can be found in Kinneyet al. (2011b) and Kinney et al. (2011a).In this article, we document an experiment to create analytically valid syntheticdata, using the exact same model and methods previously used to create the SynLBD,but applied to data from two different countries: Canada (Longitudinal EmploymentAnalysis Program (LEAP)) and Germany (Establishment History Panel (BHP)). Wedescribe all three countries’ data in Section 2.In Canada, the Canadian Center for Data Development and Economic Research(CDER) was created in 2011 to allow Statistics Canada to make better use of its busi-ness data holdings, without compromising security. Secure access to business micro-data for approved analytical research projects is done through a physical facility locatedin Statistics Canada’s headquarters.CDER implements many risk mitigation measures to alleviate the security risksspecific to micro-level business data including limits on tabular outputs, centralizedvetting, monitoring of program logs. Access to the data is done through a Statistics See Guzman et al. (2016) and Guzman et al. (2020) for an example of scraped, public-use microdata. For a recent overview of some, see Vilhuber et al. (2016b). See Drechsler (2011a) for a review of thetheory and applications of the synthetic data methodology. Other access methods include secure data en-claves (e.g., research data centers of the U.S. Federal Statistical System, of the German Federal EmploymentAgency, others), and remote submission systems. We will comment on the latter in the conclusion. best syn-thetic data method for each file, but rather to assess the effectiveness of using a ‘pre-packaged’ method to cost-effectively generate synthetic data. In particular, while wecould have used newer implementations of methods combined with a pre-defined orautomated model (Nowok et al., 2016; Raab et al., 2018), we chose to use the exactSAS code used to create the original SynLBD. A brief synopsis of the method, and anyadjustments we made to take into account structural data differences, are described inSection 3.We verify the analytical validity of the synthetic data files so created along a varietyof measures. First, we show how well average firm characteristics (gross employment,total payroll) in the synthetic data match those from the original data. We also considerhow well the synthetic data replicates various measures of firm dynamics (entry andexit rates) and job flows (job creation and destruction rate). Second, we assess whethermeasures of economic growth vary between both data sets using dynamic panel datamodels. Finally, to assess the analytical validity from a more general perspective, wecompute global validity measures based on the ideas of propensity score matching asproposed by Woo et al. (2009) and Snoke et al. (2018a).To assess how protective the newly created synthetic database is, we estimate theprobability that the synthetic first year equals the true first year given the synthetic fistyear.The rest of the paper is organized as follows. Section 2 describes the different datasources and summarizes which steps were taken to harmonize the data sets prior to theactual synthesis. Section 3 provides some background on the synthesis methods, lim-itations in the applications, and a discussion of some of the measures, which are usedin Section 4 to evaluate the analytical validity of the generated data sets. Preliminaryresults regarding the achieved level of protection are included in Section 5. The paperconcludes with a discussion of the implications of the study for future data synthesisprojects. 3
Data
In this section, we briefly describe the structure of the three data sources.
The LBD (U.S. Census Bureau, 2015) is created from the U.S. Census Bureau’sBusiness Register (BR) by creating longitudinal links of establishments using nameand address matching. The database has information on birth, death, location, indus-try, firm affiliation of employer establishments, and ownership by multi-establishmentfirms, as well as their employment over time, for nearly all sectors of the economyfrom 1976 through 2015 (as of this writing). It serves as a key linkage file as well as aresearch data set in its own right for numerous research articles, as well as a tabulationinput to the U.S. Census Bureau’s Business Dynamics Statistics (U.S. Census Bureau,2017, BDS). Other statistics created from the underlying Business Register includethe County Business Patterns (U.S. Census Bureau, 2016a, CBP) and the Statistics ofU.S. Businesses (U.S. Census Bureau, 2016b, SBUSB). For a full description, readersshould consult Jarmin et al. (2002). The key variables of interest for this experimentare birth and death dates, payroll, employment, and the industry coding of the estab-lishment. Kinney et al. (2014b) explore a possible expansion of the synthesis methodsdescribed later to include location and firm affiliation. Note that information on payrolland employment does not come from individual-level wage records, as is the case forboth the Canadian and German data sets described below, as well as for the QuarterlyWorkforce Indicators (Abowd et al., 2009) derived from the Longitudinal Employer-Household Dynamics (Vilhuber, 2018, LEHD) in the United States. Thus, methodsthat connect establishments based on labor flows (Benedetto et al., 2007; Hethey et al.,2010) are not employed. We also note that payroll is the cumulative sum of wages paidover the entire calendar year, whereas employment is measured as of March 12 of eachyear.
The LEAP (Statistics Canada, 2019b) contains information on annual employment foreach employer business in all sectors of the Canadian economy. It covers incorporatedand unincorporated businesses that issue at least one annual statement of remunerationpaid (T4 slips) in any given calendar year. It excludes self-employed individuals orpartnerships with non-salaried participants.To construct the LEAP, Statistics Canada uses three sources of information: (1)T4 administrative data from the Canada Revenue Agency (CRA), (2) data from Statis-tics Canada’s Business Register (Statistics Canada, 2019c), and (3) data from Statis-tics Canada’s Survey of Employment, Payrolls and Hours (SEPH) (Statistics Canada,2019a). In general, all employers in Canada provide employees with a T4 slip if theypaid employment income, taxable allowances and benefits, or any other remunera-tion in any calendar year. The T4 information is reported to the tax agency, which inturn provides this information to Statistics Canada. The Business Register is Statistics4anada’s central repository of baseline information on businesses and institutions op-erating in Canada. It is used as the survey frame for all business related data sets. Theobjective of the SEPH is to provide monthly information on the level of earnings, thenumber of jobs, and hours worked by detailed industry at the national and provinciallevels. To do so, it combines a census of approximately one million payroll deductionsprovided by the CRA, and the Business Payrolls Survey, a sample of 15,000 establish-ments.The core LEAP contains four variables (1) a longitudinal Business Register Iden-tifier (LBRID), (2) an industry classification, (3) payroll and (4) a measure of employ-ment. The LBRID uniquely identifies each enterprise and is derived from the BusinessRegister. To avoid “false” deaths and births due to mergers, restructuring or changes inreporting practices, Statistics Canada uses employment flows. Similar to Benedetto etal. (2007) and Hethey et al. (2010), the method compares the cluster of workers in eachnewly identified enterprise with all the clusters of workers in firms from the previousyear. This comparison yields a new identifier (LBRID) derived from those of the BR.The industry classification comes from the BR for single-industry firms. If a firm oper-ates in multiple industries, information on payroll from the SEPH is used to identify theindustry in which the firm pays the highest payroll. Prior to 1991, information on in-dustry was based on the SIC, but it is currently based on the North American IndustrialClassification System (NAICS). We use the information at the NAICS four-digit (in-dustry group) level. The firm’s payroll is measured as the sum of all T4s reported to theCRA for the calendar year. Employment is measured either using Individual LabourUnit (ILU) or Average Labour Unit (ALU). ALUs are obtained by dividing the payrollby the average annual earnings in its industry/province/class category computed usingthe SEPH. ILUs are a head count of the number of T4 issued by the enterprise, withemployees working for multiple employers split proportionately across firms accordingto their total annual payroll earned in each firm.For the purpose of this experiment, we exclude the public sector (NAICS 61, 62,and 91), even though it is contained in the database, because it may not be accuratelycaptured (Statistics Canada, 2019b). Statistics Canada does not publish any statisticsfor those sectors.
The core database for the Establishment History Panel is the German Social SecurityData (GSSD), which is based on the integrated notification procedure for the health,pension and unemployment insurances, introduced in 1973. Employers report infor-mation on all their employees. Aggregating this information via an establishment iden-tifier yields the Establishment History Panel (Bundesagentur f¨ur Arbeit, 2013, Germanabbreviation: BHP). We used data from 1975 until 2008, which at the time this projectstarted was the most current data available for research. Information for the formerEastern German States is limited to the years 1992-2008.Due to the purpose and structure of the GSSD, some variables present in the LBDare not available on the BHP. Firm-level information is not captured, and it is thus notknown whether establishments are part of a multi-establishment employer. In 1999,reporting requirements were extended to all establishments; prior to that date, only es-5ablishments that had at least one employee covered by social security on the referencedate June 30 of each year were subject to filing requirements. Payroll and employmentare both based on a reference date of June 30, and are thus consistent point-in-timemeasures. Industries are identified according to the WZ 2003 classification system(Statistisches Bundesamt, 2003) at the five digit level. We aggregated the industryinformation for this project using the first four digits of the coding system.
In all countries, the underlying data provide annual measures. However, S YN LBDassumes a longitudinal (wide) structure of the data set, with invariant industry (andlocation). In all cases, the modal industry is chosen to represent the entity’s indus-trial activity. Further adjustments made to the BHP for this project include estimatingfull-year payroll, creating time-consistent geographic information, and applying em-ployment flow methods (Hethey et al., 2010) to adjust for spurious births and deaths inestablishment identifiers. Drechsler et al. (2014b) provide a detailed description of thesteps taken to harmonize the input data.In both Canada and Germany, we encountered various technical and data-drivenlimitations. In all countries, data in the first year and last year are occasionally problem-atic, and such data were dropped. Both the German and the Canadian data experiencesome level of industry coding change, which may affect the classification of some en-tities. Furthermore, due to the nature of the underlying data, entities are establishmentsin Germany and the US, but employers in Canada.After the various standardizations and choices made above, the data structure isintended to be comparable, as summarized in Table 1. The column ”Nature” identifiesthe treatment of the variable in the synthesis process S YN LBD.Table 1: Variable descriptions and comparison
Name Type Description US Canada Germany Nature
Entity Identifier identifier Establishment Employer Establishment CreatedIndustry code Categorical Various across countries SIC3 NAICS4 WZ2003 Unmodified(3-digit ) (4-digit) (4-digit)First year Categorical First year entity is observed — firstyear — SynthesizedLast year Categorical Last year entity is observed — lastyear — SynthesizedYear Categorical Year dating of annual variables — year — DerivedEmployment Continuous Employment measure Count ALU* Count Synthesized(March 15) (annual) (June 30)Payroll Continuous Payroll (annual) Reported Computed Computed, SynthesizedAdjusted* ALU = Average Labour Unit. See text for additional explanations. The WZ 2003 classification system is compliant with the requirements of the Statistical Classificationof Economic Activities in the European Community (NACE Rev. 1.1), which is based on the InternationalStandard Industrial Classification (ISIC Rev. 3.1). Methodology
To create a partially synthetic database with analytic validity from longitudinal estab-lishment data, Kinney et al. (2011a) synthesize the life-span of establishments, as wellas the evolution of their employment, conditional on industry over that synthetic lifes-pan. Geography is not synthesized, but is suppressed from the released file (Kinneyet al., 2011a). Applying this to the LBD, Kinney et al. (2011b) created the current ver-sion of the Synthetic LBD, based on the Standard Industrial Classification (SIC) andextending through 2000. Kinney et al. (2014a) describe efforts to create a new versionof the Synthetic LBD, using a longer time series (through 2010) and newer industrycoding (NAICS), while also adjusting and extending the models for improved analyticvalidity and the imputation of additional variables. In this paper, we refer to and re-usethe older methodology, which we will call S YN LBD. Our emphasis is on the compa-rability of results obtained for a given methodology across the various applications.The general approach to data synthesis is to generate a joint posterior predictivedistribution of Y | X where Y are variables to be synthesized and X are unsynthesizedvariables. The synthetic data are generated by sampling new values from this distri-bution. In S YN LBD, variables are synthesized in a sequential fashion, with categor-ical variables being generally processed first using a variant of Dirichlet-Multinomialmodels. Continuous variables are then synthesized using a normal linear regressionmodel with kernel density-based transformation (Woodcock et al., 2009). The syn-thesis models are run independently for each industry. S YN LBD is implemented inSAS TM , which is frequently used in national statistical offices.To evaluate whether synthetic data algorithms developed in the U.S. can be adaptedto generate similar synthetic data for other countries, Drechsler et al. (2014a) imple-ment S YN LBD to the German Longitudinal Business Database (GLBD). In this paper,we extend the analysis from the earlier paper, and extend the application to the Cana-dian context (SynLEAP).
In all countries, the synthesis of certain industries failed to complete. In both Canadaand the US, this number is less than 10. In Canada, they account for about 7 percent ofthe total number of observations (see Table 13 in the Appendix).In the German case, our experiments were limited to only a handful of industries,due to a combination of time and software availability factors. The results should stillbe considered preliminary. In both countries, as outlined in Section 2, there are subtlebut potentially important differences in the various variable definitions. Industry codingdiffers across all three countries, and the level of detail in each of the industry codingsmay affect the success and precision of the synthesis. Kinney et al. (2014a) shift to a Classification and Regression Trees (CART) model with Bayesian boot-strap. STATISTICS CANADA et al. (1991), when comparing the 1987 US Standard Industrial Classification(SIC) to the 1980 Canadian SIC, already pointed out that the degree of specialization, the organization ofproduction, and the size of the respective markets differed. Thus, the density of establishments within eachof the chosen categories is likely to affect the quality of the synthesis.
7s noted in Section 2, entities are establishments in Germany and the US, but em-ployers in Canada. S YN LBD should work on any level of entity aggregation (see Kin-ney et al. (2014a) for an application to hierarchical firm data with both firm/employerand establishment level imputation). However, these differences may affect the ob-served density of the data within industry-year categories, and therefore the overallcomparability.Finally, due to a feature of S YN LBD that we did not fully explore, synthesis of datain the last year of the data generally was of poor quality. For some industry-countrypairs, this also happened in the first year. We dropped those observations.
In order to assess the outcomes of the experiment, we inspect analytical validity byvarious measures and also evaluate the extent of confidentiality protection. To checkanalytical validity, we compare basic univariate time series between the synthetic andconfidential data (employment, entity entry and exit rates, job creation and destructionrates), and the distribution of entities (firms and establishment, depending on country),employment, and payroll across time by industry. For a more complex assessment,we compute a dynamic panel data model of economic (employment) growth on eachdata set. We computed, but do not report here the confidence interval overlap measure(CIO) proposed by Karr et al. (2006) in all these evaluations. The CIO is a popularmeasure when evaluating the validity for specific analyses. It evaluates how muchthe confidence intervals of the original data and the synthetic data overlap. We didnot find this measure to be useful in our context. Most of our analyses are based onmillions of records, and observed confidence intervals were so small that confidenceintervals (almost) never overlap even when the estimates between the original data andthe synthetic data are quite close.To provide a more comprehensive measure of quality of the synthetic data relativeto the confidential data, we compute the pMSE (propensity score mean-squared error,Woo et al., 2009; Snoke et al., 2018b; Snoke et al., 2018a): the mean-squared errorof the predicted probabilities (i.e., propensity scores) for those two databases. Specifi-cally, pMSE is a metric to assess how well we are able to discern the high distributionalsimilarity between synthetic data and confidential data. We follow Woo et al. (2009)and Snoke et al. (2018b) to calculate the pMSE , using the following algorithm:1. Append the n rows of the confidential database X to the n rows of the syntheticdatabase X s to create X comb with N = n + n rows, where both X and X s are inthe long format.2. Create a variable I et denoting membership of an observation for entity e , e = , . . . , E , at time point t , t = , . . . , T , in the component databases, I et = { X combet ∈ X s } . I et takes on values of 1 for the synthetic database and 0 for theconfidential database. The full parameter estimates and the computed CIO are available in our replication materials (Alamet al., 2020).
8. Fit the following generalised linear model to predict IP ( I et = ) = g − ( β + β Emp et + β Pay et + Age
Tet β + λ t + γ i ) , (1)where Emp et is log employment of entity e in year t , Pay et is log payroll ofentity e in year t , Age et is a vector of age classes of entity e in year t , λ t is ayear fixed effect, γ i is an time-invariant industry-specific effect for the industryclassification i of entity e , and g is an appropriate link function (in this case, thelogit link).4. Calculate the predicted probabilities, ˆ p et .5. Compute pMSE = N ∑ Tt = ∑ Ee = ( ˆ p et − c ) , where c = n / N .If n = n , pMSE = 0 means every ˆ p et = .
5, and the two databases are distributionallyindistinguishable, suggesting high analytical validity. While the number of records inthe synthetic data typically matches the number of records in the original data, i.e., n = n , this does not necessarily hold in our application. Although the synthesisprocess ensures that the total number of entities is the same in both data sets, the yearsin which the entities are observed will generally differ between the original data and thesynthetic data and thus the number of records in the long format will not necessarilymatch between the two data sets. For this reason we follow Woo et al. (2009) andSnoke et al. (2018a) and use c = n / N instead of fixing c at 0.5. Using this moregeneral definition, c will always be the mean of the predicted propensity scores so thatthe pMSE measures the average of the squared deviations from the mean, as intended.Since the pMSE depends on the number of predictors included in the propensityscore model, Snoke et al. (2018a) derived the expected value and standard deviationfor the pMSE under the null hypothesis ( pMSE ) that the synthesis model is correct,i.e., it matches the true data generating process (Snoke et al., 2018a, Equation 1): E [ pMSE ] = ( k − )( − c ) cN and StDev [ pMSE ] = (cid:112) ( k − )( − c ) cN where k is the number of synthesized variables used in the propensity model. To mea-sure the analytical validity of the synthetic data, they suggest looking at the pMSEratio pMSEratio = (cid:92) pMSEE [ pMSE ] and the standardized pMSEpMSE s = (cid:92) pMSE − E [ pMSE ] StDev [ pMSE ] , where (cid:92) pMSE is the estimated pMSE based on the data at hand. Under the null hypoth-esis, the pMSE ratio has an expectation of 1 and the expectation of the standardized pMSE s is zero. 9 . . . . G r o ss e m p l o y m en t ( m illi on s ) . . . . . . G r o ss e m p l o y m en t ( m illi on s ) . . , . T o t a l pa y r o ll ( b illi on s ) (a) CanSynLBD . . . . . . T o t a l pa y r o ll ( b illi on s ) (b) GSynLBD Figure 1: Gross employment level (upper panels) and total payroll (lower panels) byyear.
In the following figures, the results for the Canadian data are shown in the left panels,and the German data in the right panels. In all cases, the Canadian data are reportedfor the entire private sector, including the manufacturing sector but excluding the pub-lic sector industries (NAICS 61, 62, and 91). German results are for two WZ2003industries.
Figure 1 shows a comparison between the synthetic data and the original data for grossemployment level (upper panels) and total payroll (lower panels) by year. While thegeneral trends are preserved for both data sources, the results for the German syntheticdata resemble the trends from the original data more closely. For the Canadian datathe positive trends over time are generally overestimated. However, in both cases,levels are mostly overestimated. These patterns are not robust. When considering themanufacturing sector in Canada (Figure 8 in the Appendix), trends are better matched,but a significant negative bias is present in levels.10 . . . . J ob c r ea t i on r a t e ( % ) . . . . . J ob c r ea t i on r a t e ( % ) J ob de s t r u c t i on r a t e ( % ) (a) CanSynLBD . . . . . J ob de s t r u c t i on r a t e ( % ) (b) GSynLBD Figure 2: Job creation rates (upper panels) and job destruction rates (lower panels) byyear.
Key statistics commonly computed from business registers such as the LEAP or theBHP include job flows over time. Following Davis et al. (1996), job creation is definedas the sum of all employment gains from expanding firms from year t − t including entry firms. The job destruction rate is defined as the sum of all employmentlosses from contracting firms from year t − t including exiting firms. Figure 2depicts job creation rates (upper panels) and destruction rates (lower panels). Thegeneral levels and trends are preserved for both data sources, but the time-series alignmore closely for the German data. Even the substantial increase in job creations in1993, which can be attributed to the integration of the data from Eastern Germany afterreunification, is remarkably well preserved in the synthetic data. Still, there seems tobe a small but systematic overestimation of job creation and destruction rates in bothsynthetic data sources. The substantial deviation in the job destruction rate in the lastyear of CanSynLBD is an artefact requiring further investigation. The results for the Canadian manufacturing sector are included in Figure 9 in the Appendix, and arecomparable to the results for the entire private sector. .3 Entity Dynamics To assess how well the synthetic data capture entity dynamics, we also compute entryand exit rates, i.e. how many new entities appear in the data and how many cease toexist relative to the population of entities in a specific year. Figure 3 shows that thoserates are very well preserved for both data sources.Only the (delayed) re-unification spike in the entry rates in the German data is notpreserved correctly. The confidential data show a large spike in entry rates in 1993.In that year, detailed information about Eastern German establishments was integratedfor the first time. However, the synthetic data shows increased entry rates in the twoprevious years. We speculate that this occurs due to incomplete data in the confidentialdata: Establishments were successively integrated into the data starting in 1991, butmany East German establishments did not report payroll and number of employees inthe first two years. Thus, records existed in the original data, but the establishment sizeis reported as missing. Such a combination is not possible in the synthetic data. Thesynthesis models are constructed to ensure that whenever an establishment exists, it hasto have a positive number of employees. Since entry rates are computed by looking atwhether the employment information changed from missing to a positive value, mostof the Eastern German establishments only exist from 1993 on-wards in the originaldata, but from 1991 in the synthetic data.The second, smaller spike in the entry rate in the German data occurs in 1999. Inthat year, employers were required to report marginally employed workers for the firsttime. Some establishments exclusively employ marginally employed workers, and willthus appear for the first time in the data after 1999. The synthetic data preserves thispattern.
The S YN LBD code ensures that the total number of entities that ever exist within theconsidered time frame matches exactly between the original data and the syntheticdata. But each entity’s entry and exit date are synthesized, and the total number ofentities at any particular point in time may differ, and with it employment and payroll.To investigate how well the information is preserved at any given point in time, wecompute the following statistic: x its = X its / ∑ i ∑ t X its , (2)where i is the index for the industry (aggregated to the two digit level for the Canadiandata), t is the index for the year and s denotes the data source (original or synthetic). X its = ∑ j X its j , j = , . . . , n its is the variable of interest aggregated at the industry leveland n its is the number of entities in industry i at time point t in data source s . Tocompute the statistic provided in Equation (2), this number is then divided by the totalof the variable of interest aggregated across all industries and years. Figure 4 plots theresults from the original data against the results from the synthetic data for the number As described in Section 2, for both countries’ data, corrections based on worker flows have been applied,correcting for any bias due to legal reconfiguration of economic entities. E n t r y r a t e ( % ) . . . E n t r y r a t e ( % ) E x i t r a t e ( % ) (a) CanSynLBD . . . . E x i t r a t e ( % ) (b) GSynLBD Figure 3: Entry rates (upper panels) and exit rates (lower panels) by year.of entities, employment, and payroll. If the information is well preserved, all pointsshould be close to the 45 degree line.We find that the share of entities is well preserved for both data sources, but shareof employment and share of payroll vary more in the Canadian data with an upwardbias for the larger shares. It should be noted that the German data shown here andelsewhere in this paper only contain data from two industries, whereas the Canadiandata contains nearly all available industry codes at the two digit level. Thus, resultsfrom Canada are expected to be more diverse. When only considering the Canadianmanufacturing sector (see Figure 10 in the Appendix), less bias is present.
To assess how well the synthetic data perform in a more complex model and in thecontext of an analyst’s modelling strategy, we simulate how a macroeconomist (thetypical user of these data) might approach the problem of estimating a model for theevolution of employment if only the synthetic data are available. The analyst willconsider both the literature and the data to propose a meaningful model. In doingso, a sequence of models will be proposed, and tests or theory brought to bear ontheir merits, potentially rejecting their appropriateness. In doing so, the outcome thatthe analyst obtains from following that strategy using the synthetic data should notdiverge substantially from the outcome they would obtain when using the (inaccessible)13 . . . C an S y nL B D [t] . . . . . . . G S y nL B D . . . C an S y nL B D . . . . . . . G S y nL B D . . . . . C an S y nL B D (a) CanSynLBD . . . . . . . . G S y nL B D (b) GSynLBD Figure 4: Share of entities (upper panels), share of employment (middle panels), andshare of payroll (lower panels) by year and industry.confidential data. The specific parameter estimates obtained, and the actual modelretained, are not the goal of this exercise — the focus is on the process.To do so, our analyst would start by using a base model (typically OLS), and thenlet economic and statistical theory suggest more appropriate models. In this case, wewill estimate variants of a dynamic panel data model for the evolution of employment.For each model, tests can be specified to check whether the model is an appropriatefit under a certain hypothesis. The outcome of this exercise, illustrated by Figure 5, We do not describe these models in more detail here, referring the reader to the literature instead, in LS GMM System GMM System GMM MA
Analyst specifiesspecifies
Test:Reject? Test:Reject? Test:Reject?
Figure 5: Modelling strategy of a hypothetical analystallows us to assess whether the synthetic data capture variability in economic growthdue to industry, firm age and payroll — the key variables in the data — and whetherthe analyst might reasonable choose the same, or a closely related modelling strategy.The base model is an OLS specification:
Emp et = β + θ Emp e , t − + η Pay et + Age
Tet β + γ i + λ t + ε et (3)where Emp et is log employment of entity e in year t , Emp e , t − is its one year lag, Pay et is the logarithm of payroll of entity e in year t , Age et is a vector of dummy variablesfor age of entity e in year t , λ t is a year effect, γ i is a time-invariant industry-specificeffect for each industry i , and ε et is the disturbance term of entity e in year t . As Emp e , t − is correlated with γ i because Emp e , t − is itself determined by time-invariant γ i , OLS estimators are biased and inconsistent. To obtain consistent estimates of theparameters in the model, Arellano et al. (1991) suggest using generalized method ofmoments (GMM) estimation methods, as well as associated tests to assess the validityof the model. We also estimate the model using system GMM methods proposed byArellano et al. (1995) and Blundell et al. (1998) (System GMM), as well as a variantof equation (3) that includes a first-order moving average in the error term ε et (SystemGMM MA): Emp et = β + θ Emp e , t − + η Pay et + Age
Tet β + λ t + α e + ε et + ε e , t − (4)where α e is a time-invariant entity effect, which includes any time-invariant industryeffects.The Sargan test (Hansen, 1982; Arellano et al., 1991; Blundell et al., 2001) isused to assess the validity of the over-identifying restrictions. We also compute thez-score for the m η (cid:63) = ˆ η − ˆ θ . particular Arellano et al. (1995) and Blundell et al. (1998).
15t is important that this model is close, but not identical to the model used to syn-thesize the data. In S YN LBD,
Emp et is synthesized as f ( Emp e , t − , X et ) (where X et does not contain Pay et ), and Pay et = f ( Pay e , t − , Emp et , X et ) (Kinney et al., 2011b, pg.366). Thus, the model we chose is purposefully not (completely) congenial with thesynthesis model, but the synthesis process of the S YN LBD should preserve sufficientserial correlation in the data to be able to estimate these models.We estimate each model and test statistics separately on confidential and syntheticdata for the private sector (and for Canada, for the manufacturing sector). Detailedestimation results are reported in the Appendix. Here we focus on the two regressioncoefficients of major interest: θ and η , the coefficients for lagged employment andpayroll, as well as the elasticity η (cid:63) . Figure 6 plots the bias in the synthetic coefficients,i.e., θ synth − θ con f and η synth − η con f , for all four models. While the detailed resultsin the Appendix confirm that all regression coefficients still have the same sign, allestimates plotted in Figure 6 show substantial bias in all models in all datasets (the OLSmodel for the German data being the only exception). Still, the computed elasticity η (cid:63) has very little bias in most models.Figure 6: Bias in estimates of coefficients on pay and lagged employment Note : For details on the estimated coefficients, see the Appendix.
However, we observe a striking pattern: The biases of the two regression coeffi-cients are always symmetric, i.e. the sum of the biases of θ synth and η synth is close tozero in all models (and mostly cancel out in the computation of η (cid:63) ). This may sim-ply be a feature of the modeling strategy pointed out earlier, which generates serialcorrelation with a slightly different structure. Another possible explanation could bethat the model is poorly identified because of multicollinearity generating a ridge forthe estimated coefficients. The estimated coefficients would be highly unstable in thiscase even in the original data and thus it would not be surprising to find substantialdifferences between the coefficients from the original data and the coefficients fromthe synthetic data. Better understanding this phenomenon will be an interesting area offuture research.While the bias in coefficients is quite consistent across countries and models, spec-ification tests such as the m m Canada GermanyModel Test Confidential Synthetic Confidential SyntheticGMM m2 -14.5 -27.54 -2.51 -4.13Sargan test 69000 15000 3600 2000System GMM m2 -11.43 -41.6 19.49 -8.83Sargan test 77000 18000 4500 2800System GMM MA m2 8.2 -40.03 19.03 -11.69Sargan test 28000 17000 3100 2500
Note : The Sargan test (Blundell et al., 2001; Arellano et al., 1991) is used to assess the validity of theover-identifying restrictions. The z-score for the m To compute the pMSE , we estimate Equation (1) using logit models. The estimated pMSE is 0.0121 for the Canadian data (0.0041 for the manufacturing sector) and0.0013 for the German data (see Table 3). While these numbers may seem small,the pMSE ratio and the standardized pMSE are large, indicating that the null hypoth-esis that the synthetic data and the original data stem from the same data generatingprocess should be rejected. The expected pMSE is quite sensitive to sample size N .Even small differences between the original and synthetic data will lead to large valuesfor this test statistic. In both countries, the confidential data files are quite large (about2 million cases for Germany and the manufacturing sector in Canada and about 34.5million cases for the full Canadian data sets). In practice, therefore, it is quite likely toreject the null of equivalence given this test’s very high power.Table 3: pMSE by sector and countryCountry Sector pMSE pMSE ratio standardized pMSECanada Manufacturing 0.0041 656.88 4908.17Canada Private 0.0121 10957.61 135525.77Germany Universe 0.0013 725.21 2896.85 To assess the risk of disclosure, we use a measure proposed by Kinney et al. (2011b):For each industry, we estimate the fraction of entities for which the synthetic birth yearequals the true birth year, conditional on the synthetic birth year, and interpret it asa probability. Tables 14 and 15 in the Appendix show the minimum, maximum, andmean of these probabilities, by year. Figure 7 shows the maximum and average values17cross time, for each country. The figure shows that these probabilities are quite lowexcept for the first year. Entry rates in the first year are much larger than in any otheryear due to censoring. It is therefore quite likely that the (left-censored) entry year ofthe synthetic record matches that of the (left-censored) original record if the syntheticentry year is the first year observed in the data. A somewhat more muted version of thiseffect can be seen for Germany in the years 1991 and 1992, when the lower panel ofFigure 7 shows another spike. These are the years in which data from Eastern Germanywere added to the database successively, leading to new sets of (left-censored) entities.With the exception of the first year in the data, the average rate of concordancebetween synthetic and observed birth year of an establishment in the Canadian data isbelow 5%, and the maximum is never above 50%. The German data reflect results froma smaller set of industries, and while the average concordance is higher (never above10%), the maximum is never above 6% other than during the noted entry spikes. Thissuggests that the synthetic lifespan of any given entity is highly unlikely to be matchedto its confidential real lifespan. This is generally considered to be a high degree ofconfidentiality.Figure 7: Average and maximum likelihood that synthetic birthyear matches actualbirthyear
Note: Plot shows fraction of entities by industry for which the synthetic birthyear equals the true birth year, conditional on the synthetic birth year. Plothas been rescaled to be relative to the first year observed in the data. The Canadian manufacturing sector is not shown. In the German case, we only use two industries, butwe show the average of the two, rather than the values for both industries, to maintain comparability withthe Canadian plot. Conclusion
In this paper, we presented results from two projects that evaluated whether the codedeveloped to synthesize the U.S. LBD can easily be adapted to create synthetic versionsof similar data from Canada and Germany. We considered both univariate time-seriescomparisons as well as model-based comparisons of coefficients and model fit. In gen-eral, utility evaluations show significant differences between each country’s syntheticand confidential data. Frequently-used measures such as confidence interval overlapand pMSE suggest that the synthetic data are an unreliable image of the confidentialdata. Less formal comparisons of specification test scores suggest that the syntheticdata do not reliably lead to the same modeling decisions.Interestingly, the utility of the German synthetic data was higher than the utilityof the Canadian data in almost all dimensions evaluated. At this point we can onlyspeculate about potential reasons. The most important difference between the two datasources is that the German data comprises only a handful of industries while almostall industries have been included in the Canadian evaluation. Given that the industriesincluded in the German data were rather large, and synthesis models are run inde-pendently for each industry, it might have been easier to preserve the industry levelstatistics for the German data. We cannot exclude the possibility that the structure ofthe German data aligns more closely with the LBD and thus the synthesis models tunedon the LBD data provide better results on the (adjusted) BHP than on the LEAP. Wenote that both the LBD and the BHP are establishment-level data sets, whereas theLEAP is an employer-level data set.We emphasize that adjustments to the original synthesis code were explicitly lim-ited to ensuring that the code runs on the new input data. The validity of the syntheticdata could possibly be improved by tuning the synthesis models to the particularities ofthe data at hand, such as the non-standard dynamics introduced into the German databy reunification. However, the aim of this project was to illustrate that the high in-vestments necessary for developing the synthesis code for the LBD offered additionalpayoffs as the re-use of the code substantially reduced the amount of work requiredto generate decent synthetic data products for other business data. One of the majorcriticisms of the synthetic data approach has been that investments necessary to de-velop useful synthesizers are substantial. This project illustrated that substantial gainscan be achieved when exploiting knowledge from previous projects. With the adventof tailor-made software such as the synthpop package in R (Nowok et al., 2016), theinvestments for generating useful synthetic data might be further reduced in the future.However, even without fine-tuning or customization of models, the current syn-thetic data have, in fact, proven useful. De facto, many deployments of syntheticdata, including the Synthetic LBD in the US, have been used for model preparationby researchers in a public or lower-security environment, with subsequent remote sub-mission of prepared code for validation against the confidential data. When viewedthrough the lens of such a validation system, the synthetic data prepared here wouldseem to have reasonable utility. While time series dynamics are not the same, theyare broadly similar. Models converged in similar fashions, and while coefficients werestrictly different, they were broadly similar and plausible. Specification tests did not19ead to the same conclusions, but they also did not collapse or yield meaningless con-clusions. Thus, we believe that the synthetic data, despite being different, have thepotential to be a useful tool for analysts to prepare models without direct access to theconfidential data. Vilhuber et al. (2016a) and Vilhuber (2019) come to a similar con-clusion when evaluating usage of the synthetic data sets available through the SyntheticData Server (Abowd et al., 2010), including the Synthetic LBD. A more thorough eval-uation would need to explicitly measure the investment in synthetic data generation,the cost of setting up a validation structure, and the number of studies enabled throughsuch a setup. We note that such an evaluation is non-trivial: the counter-factual in manycircumstances is that no access is allowed to sensitive business microdata, or that ac-cess occurs through a secure research data system that is also costly to maintain. Thisstudy has contributed to such a future evaluation by showing that plausible results canbe achieved with relatively low up-front investments.The use of synthetic data sets to broaden access to confidential microdata is likelyto increase in the near future, with increasing concerns by statistical agencies regard-ing the disclosure risks of releasing microdata. The resulting reduction in access toscientific microdata is overwhelmingly seen as problematic. Broadly “plausible” if notanalytically valid synthetic data sets such as those described in this paper, combinedwith scalable remote submission systems that integrate modern disclosure avoidancemechanisms, may be a feasible mitigation strategy.
Acknowledgements
The opinions expressed here are those of the authors, and do not reflect the opinions ofany of the statistical agencies involved. All results were reviewed for disclosure risksby their respective custodians, and released to the authors. Alam thanks Claudiu Mo-toc and Danny Leung for help with the Canadian data. Vilhuber acknowledges fundingthrough NSF Grants SES-1131848 and SES-1042181, and a grant from Alfred P. SloanGrant (G-2015-13903). Alam and Dostie acknowledge funding through SSHRC Part-nership Grant “Productivity, Firms and Incomes”. The creation of the Synthetic LBDwas funded by NSF Grant SES-0427889.
References
ABOWD, J. M. and J. I. LANE (2004). “New Approaches to Confidentiality ProtectionSynthetic Data, Remote Access and Research Data Centers”. In:
Privacy in Statisti-cal Databases . Ed. by J. DOMINGO-FERRER and V. TORRA. Vol. 3050. LectureNotes in Computer Science. Springer, pp. 282–289.
DOI : . URL : .ABOWD, J. M. and I. SCHMUTTE (2015). “Economic analysis and statistical disclo-sure limitation”. In: Brookings Papers on Economic Activity
Fall 2015.
URL : .20BOWD, J. M., B. E. STEPHENS, L. VILHUBER, F. ANDERSSON, K. L. MCKIN-NEY, M. ROEMER, and S. D. WOODCOCK (2009). “The LEHD InfrastructureFiles and the Creation of the Quarterly Workforce Indicators”. In: Producer Dy-namics: New Evidence from Micro Data . Ed. by T. DUNNE, J. B. JENSEN, andM. J. ROBERTS. University of Chicago Press.
URL : .ABOWD, J. M. and L. VILHUBER (2010). VirtualRDC - Synthetic Data Server . Cor-nell University, Labor Dynamics Institute.
URL : .ALAM, M. J., B. DOSTIE, J. DRECHSLER, and L. VILHUBER (2020). Replicationarchive for: Applying Data Synthesis for Longitudinal Business Data across ThreeCountries . Code and data. Zenodo.
DOI : .ARELLANO, M. and S. BOND (1991). “Some Tests of Specification for Panel Data:Monte Carlo Evidence and an Application to Employment Equations”. In: Reviewof Economic Studies
URL : https : / / EconPapers . repec .org/RePEc:oup:restud:v:58:y:1991:i:2:p:277-297. .ARELLANO, M. and O. BOVER (1995). “Another look at the instrumental variableestimation of error-components models”. In: Journal of Econometrics
URL : https://EconPapers.repec.org/RePEc:eee:econom:v:68:y:1995:i:1:p:29-51 .BARTELSMAN, E., J. HALTIWANGER, and S. SCARPETTA (2009). “Measuringand Analyzing Cross-country Differences in Firm Dynamics”. In: DUNNE, T., J. B.JENSEN, and M. J. ROBERTS. Producer Dynamics: New Evidence from MicroData . University of Chicago Press, pp. 15–76.
URL : .BENDER, S. (2009). “The RDC of the Federal Employment Agency as a part of theGerman RDC Movement”. In: Comparative Analysis of Enterprise Data, 2009Conference . Comparative Analysis of Enterprise Data, 2009 Conference. (Tokyo).
URL : http : / / gcoe . ier . hit - u . ac . jp / CAED / index . html (visited on05/05/2014).BENEDETTO, G., J. HALTIWANGER, J. LANE, and K. MCKINNEY (2007). “UsingWorker Flows in the Analysis of the Firm”. In: Journal of Business and EconomicStatistics
Journal of Econometrics
URL : https://ideas.repec.org/a/eee/econom/v87y1998i1p115-143.html .BLUNDELL, R., S. BOND, and F. WINDMEIJER (2001). “Estimation in dynamicpanel data models: Improving on the performance of the standard GMM estima-tor”. In: Nonstationary Panels, Panel Cointegration, and Dynamic Panels . Ed. byB. H. BALTAGI, T. B. FOMBY, and R. CARTER HILL. Vol. 15. Advances inEconometrics. Emerald Group Publishing Limited, pp. 53–91.
DOI :
10 . 1016 /S0731 - 9053(00 ) 15003 - 0 . URL : https : / / doi . org / 10 . 1016 / S0731 -9053(00)15003-0 (visited on 04/30/2020).BUNDESAGENTUR F ¨UR ARBEIT (2013). Establishment History Panel (BHP) . [Com-puter file]. N¨urnberg, Germany: Research Data Centre (FDZ) of the German Fed-21ral Employment Agency (BA) at the Institute for Employment Research (IAB)[distributor].DAVIS, S. J., J. C. HALTIWANGER, and S. SCHUH (1996).
Job creation and de-struction . Cambridge, MA: MIT Press.DRECHSLER, J. (2011a).
Synthetic Datasets for Statistical Disclosure Control–Theoryand Implementation . New York: Springer.
DOI : .DRECHSLER, J. (2011b). Synthetische Scientific-Use-Files der Welle 2007 des IAB-Betriebspanels . FDZ Methodenreport 201101 de. Institute for Employment Re-search, Nuremberg, Germany.
URL : http : / / ideas . repec . org / p / iab /iabfme/201101_de.html .— (2012). “New data dissemination approaches in old Europe – synthetic datasets fora German establishment survey”. In: Journal of Applied Statistics
URL : http : / / ideas . repec . org / a / taf / japsta /v39y2012i2p243 -265.html .DRECHSLER, J., A. DUNDLER, S. BENDER, S. R ¨ASSLER, and T. ZWICK (2008).“A new approach for disclosure control in the IAB establishment panel—multipleimputation for a better data access”. In: AStA Advances in Statistical Analysis
A First Step Towards A German Synlbd:Constructing A German Longitudinal Business Database . Working Papers 14-13.Center for Economic Studies, U.S. Census Bureau.
URL : https://ideas.repec.org/p/cen/wpaper/14-13.html .DRECHSLER, J. and L. VILHUBER (2014b). “A First Step Towards A German SynLBD:Constructing A German Longitudinal Business Database”. In: Statistical Journal ofthe IAOS: Journal of the International Association for Official Statistics
DOI : . URL : http://iospress.metapress.com/content/X415V18331Q33150 .GUZMAN, J. and S. STERN (2016). The State of American Entrepreneurship: NewEstimates of the Quality and Quantity of Entrepreneurship for 32 US States, 1988-2014 . Working Paper 22095. National Bureau of Economic Research.
DOI :
10 .3386/w22095 . URL : .— (2020). Startup Cartography . URL : (visited on 01/26/2020).HANSEN, L. P. (1982). “Large Sample Properties of Generalized Method of Mo-ments Estimators”. In: Econometrica
DOI : . URL : (visited on04/30/2020).HETHEY, T. and J. F. SCHMIEDER (2010). Using worker flows in the analysis ofestablishment turnover: Evidence from German administrative data . FDZ Metho-denreport 201006 en. Institute for Employment Research, Nuremberg, Germany.
URL : http://ideas.repec.org/p/iab/iabfme/201006_en.html .JARMIN, R. S., T. A. LOUIS, and J. MIRANDA (2014). “Expanding The Role OfSynthetic Data At The U.S. Census Bureau”. In: Statistical Journal of the IAOS:Journal of the International Association for Official Statistics
DOI : . URL : http://iospress.metapress.com/content/fl8434n4v38m4347/?p=00c99b98bf2f4701ae806ee638594915&pi=0 .22ARMIN, R. S. and J. MIRANDA (2002). The Longitudinal Business Database . Work-ing Papers 02-17. Center for Economic Studies, U.S. Census Bureau.
URL : https://ideas.repec.org/p/cen/wpaper/02-17.html .KARR, A. F., C. N. KOHNEN, A. OGANIAN, J. P. REITER, and A. P. SANIL (2006).“A Framework for Evaluating the Utility of Data Altered to Protect Confidential-ity”. In: The American Statistician
DOI : .KINNEY, S. K., J. P. REITER, and J. MIRANDA (2014a). Improving The SyntheticLongitudinal Business Database . Working Papers 14-12. Center for Economic Stud-ies, U.S. Census Bureau.
URL : https://ideas.repec.org/p/cen/wpaper/14-12.html .— (2014b). “Improving The Synthetic Longitudinal Business Database”. In: Statisti-cal Journal of the IAOS: Journal of the International Association for Official Statis-tics
DOI : .KINNEY, S. K., J. P. REITER, A. P. REZNEK, J. MIRANDA, R. S. JARMIN, andJ. M. ABOWD (2011a). LBD Synthesis Procedures . CES Technical Notes Series11-01. Center for Economic Studies, U.S. Census Bureau.
URL : https://ideas.repec.org/p/cen/tnotes/11-01.html .— (2011b). “Towards Unrestricted Public Use Business Microdata: The SyntheticLongitudinal Business Database”. In: International Statistical Review
DOI : j.1751-5823.2011.00152.x . URL : https://ideas.repec.org/a/bla/istatr/v79y2011i3p362-384.html .LITTLE, R. J. (1993). “Statistical Analysis of Masked Data”. In: Journal of OfficialStatistics
Understanding Business Dynamics: AnIntegrated Data System for America’s Future . Ed. by J. HALTIWANGER, L. M.LYNCH, and C. MACKIE. Washington, DC: The National Academies Press.
DOI : . URL : .NOWOK, B., G. RAAB, and C. DIBBEN (2016). “synthpop: Bespoke Creation ofSynthetic Data in R”. In: Journal of Statistical Software, Articles
DOI : . URL : .RAAB, G. M., B. NOWOK, and C. DIBBEN (2018). “Practical Data Synthesis forLarge Samples”. In: Journal of Privacy and Confidentiality
DOI : . URL : https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407 .RUBIN, D. B. (1993). “Discussion of Statistical Disclosure Limitation”. In: Journal ofOfficial Statistics
American Economic Review
DOI : . URL : .SNOKE, J., G. M. RAAB, B. NOWOK, C. DIBBEN, and A. SLAVKOVIC (2018a).“General and specific utility measures for synthetic data”. In: Journal of the RoyalStatistical Society: Series A (Statistics in Society)
DOI :
10 . . eprint: https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/rssa.12358 . URL : https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12358 .SNOKE, J. and A. SLAVKOVIC (2018b). “pMSE Mechanism: Differentially PrivateSynthetic Data with Maximal Distributional Similarity: UNESCO Chair in DataPrivacy, International Conference, PSD 2018, Valencia, Spain, September 26-28,2018, Proceedings”. In: pp. 138–159. DOI : .STATISTICS CANADA (2019a). Business Register (BR) . URL : (visited on 01/30/2020).— (2019b). Longitudinal Employment Analysis Program (LEAP) . URL : (visited on 01/30/2020).— (2019c). Survey of Employment, Payrolls and Hours (SEPH) . URL : (visited on 01/30/2020).STATISTICS CANADA and BUREAU OF THE CENSUS (1991). Concordance be-tween the Standard Industrial Classifications of Canada and the United States,1980 Canadian SIC - 1987 United States SIC . Catalogue No. 12-574E. Statis-tics Canada.
URL : http : / / publications . gc . ca / site / eng / 9 . 847987 /publication.html (visited on 01/30/2020).STATISTISCHES BUNDESAMT (2003). Classification of Economic Activities, issue2003 (WZ 2003) . Statistisches Bundesamt (Federal Statistical Office) of Germany.
URL : (visited on 02/02/2020).U.S. CENSUS BUREAU (2015). Longitudinal Business Database 1975-2015 [Datafile] . Tech. rep.
URL : (visited on 01/26/2020).— (2016a). County Business Patterns (CBP) . U.S. Census Bureau.
URL : (visited on 01/26/2020).— (2016b). Statistics of U.S. Businesses (SUSB) . U.S. Census Bureau.
URL : (visited on 01/26/2020).— (2017). Business Dynamics Statistics (BDS) . U.S. Census Bureau.
URL : (visited on 01/26/2020).VILHUBER, L. (2013). Methods for Protecting the Confidentiality of Firm-Level Data:Issues and Solutions . Document 19. Labor Dynamics Institute.
URL : http : / /digitalcommons.ilr.cornell.edu/ldi/19/ .— (2018). LEHD Infrastructure S2014 files in the FSRDC . Working Papers 18-27.Center for Economic Studies, U.S. Census Bureau.
URL : https://ideas.repec.org/p/cen/wpaper/18-27.html .— (2019). Utility of two synthetic data sets mediated through a validation server:Experience with the Cornell Synthetic Data Server . Presentation. Conference onCurrent Trends in Survey Statistics.
URL : https://hdl.handle.net/1813/43883 . 24ILHUBER, L. and J. M. ABOWD (2016a). Usage and outcomes of the Synthetic DataServer . Presentation. Meetings of the Society of Labor Economists.
URL : https://hdl.handle.net/ .VILHUBER, L., J. M. ABOWD, and J. P. REITER (2016b). “Synthetic establishmentmicrodata around the world”. In: Statistical Journal of the International Associa-tion for Official Statistics
DOI : .WOO, M.-J., J. P. REITER, A. OGANIAN, and A. F. KARR (2009). “Global Measuresof Data Utility for Microdata Masked for Disclosure Limitation”. In: Journal ofPrivacy and Confidentiality
DOI : . URL : https://journalprivacyconfidentiality.org/index.php/jpc/article/view/568 .WOODCOCK, S. D. and G. BENEDETTO (2009). “Distribution-preserving statis-tical disclosure limitation”. In: Computational Statistics & Data Analysis
DOI : https : / / doi . org / 10 . 1016 / j . csda . 2009 . 05 .020 . URL : . 25 ppendix “Applying Data Synthesis for Longitudinal Business Data across Three Countries” M. Jahangir Alam, Benoit Dostie, J¨org Drechsler, Lars Vilhuber
A Figures for the Manufacturing Sector in Canada . . . . . G r o ss e m p l o y m en t ( m illi on s ) (a) Gross employment level by year T o t a l pa y r o ll ( b illi on s ) (b) Total payroll Figure 8: Entity characteristics for the manufacturing sector in Canada by year. . . . . . J ob c r ea t i on r a t e ( % ) (a) Job creation rates J ob de s t r u c t i on r a t e ( % ) (b) Job destruction rates Figure 9: Dynamics of job flows for the manufacturing sector in Canada by year.26 . . . . . . . C an S y nL B D . . . . . . . C an S y nL B D . . . . . . . C an S y nL B D Figure 10: Share of entities (upper panel), share of employment (middle panel), andshare of payroll (lower panel) by year and industry for the Canadian manufacturingsector. 27
Appendix Tables
B.1 pMSE
Table 4: Detailed results for pMSE estimation by sector and country
Independent Variables Canada Germany
Sector:
Manufacturing Private AllLn ALU 0.158 0.7138 -0.2895(0.0039) (0.001) (0.0033)Ln Pay 0.0039 -0.4426 0.2584(0.0037) (0.001) (0.0028)Age 3-4 0.0392 0.0972 -0.0987(0.0078) (0.0017) (0.007)Age 5-7 -0.0382 0.0477 -0.0973(0.0073) (0.0016) (0.0066)Age 8-12 -0.1258 -0.0263 -0.1172(0.0071) (0.0015) (0.0063)Age 13 or more -0.219 -0.1024 -0.1487(0.0074) (0.0016) (0.0059)N 2243011 34638723 2121956pseudo R-sq 0.0112 0.0318 0.0038pMSE 0.0041 0.0121 0.0013
Note : See Equation 1 for estimation method. An observation is a entity-year in thecombined database of each country-sector combination. All specifications include timeand industry fixed effects. Standard errors are in parentheses. .2 Regression analysis tables Table 5: Regression coefficients (OLS) for LEAP
Independent Variables LEAP CanSynLBD
Private Manufacturing Private ManufacturingAR(1) Coefficient 0.2031*** 0.2481*** 0.3970*** 0.4405***(0.0001) (0.0005) (0.0002) (0.0007)Ln Pay 0.7847*** 0.7300*** 0.5481*** 0.5228***(0.0001) (0.0005) (0.0002) (0.0006)Age 3-4 -0.1202*** -0.1717*** -0.1223*** -0.2340***(0.0003) (0.0014) (0.0004) (0.0016)Age 5-7 -0.1260*** -0.1891*** -0.1235*** -0.2507***(0.0003) (0.0014) (0.0004) (0.0016)Age 8-12 -0.1268*** -0.1973*** -0.1169*** -0.2551***(0.0003) (0.0013) (0.0004) (0.0016)Age 13 or more -0.1246*** -0.1992*** -0.1101*** -0.2577***(0.0003) (0.0014) (0.0004) (0.0017) N R Note: In all specifications, we include both year and industry fixed effects. Standard errors arein parentheses. ***, **, and * indicate statistically significant coefficients at 1%, 5%, and 10%percent levels, respectively.
Independent Variables GLBD GSynLBD
AR(1) Coefficient 0.4430*** 0.4143***(0.0007) (0.0008)Ln Pay 0.4629*** 0.5143***(0.0006) (0.0007)Age 3-4 -0.0695*** -0.0642***(0.0017) (0.0016)Age 5-7 -0.1066*** -0.0891***(0.0017) (0.0016)Age 8-12 -0.1324*** -0.1109***(0.0017) (0.0016)Age 13 or more -0.1880*** -0.1600***(0.0016) (0.0015) N R Note: In all specifications, we include both year and industry fixed effects. Standard errors arein parentheses. ***, **, and * indicate statistically significant coefficients at 1%, 5%, and 10%percent levels, respectively.
Independent Variables LEAP CanSynLBD
Private Manufacturing Private ManufacturingAR(1) Coefficient 0.0805*** 0.1189*** 0.5722*** 0.5425***(0.0003) (0.0018) (0.0024) (0.0084)Ln Pay 0.8991*** 0.8523*** 0.4101*** 0.4302***(0.0002) (0.0015) (0.0018) (0.0067)Age 3-4 -0.0450*** -0.0797*** -0.2075*** -0.2972***(0.0002) (0.0014) (0.0010) (0.0051)Age 5-7 -0.0438*** -0.0860*** -0.2129*** -0.3162***(0.0002) (0.0015) (0.0011) (0.0059)Age 8-12 -0.0418*** -0.0923*** -0.2187*** -0.3294***(0.0003) (0.0017) (0.0013) (0.0070)Age 13 or more -0.0379*** -0.0898*** -0.2318*** -0.3414***(0.0003) (0.0019) (0.0015) (0.0080) N Note: In this table, m Independent Variables GLBD GSynLBD
AR(1) Coefficient 0.0489*** 0.6999***(0.0051) (0.0057)Ln Pay 0.7559*** 0.2916***(0.0035) (0.0042)Age 3-4 -0.0070*** -0.1026***(0.0012) (0.0015)Age 5-7 -0.0233*** -0.1386***(0.0014) (0.0017)Age 8-12 -0.0473*** -0.1694***(0.0015) (0.0018)Age 13 or more -0.1084*** -0.2183***(0.0015) (0.0018) N Note: In this table, m Independent Variables LEAP CanSynLBD
Private Manufacturing Private ManufacturingAR(1) Coefficient 0.0978*** 0.1614*** 0.5111*** 0.5780***(0.0002) (0.0014) (0.0008) (0.0041)Ln Pay 0.8854*** 0.8161*** 0.4562*** 0.4022***(0.0002) (0.0012) (0.0006) (0.0033)Age 3-4 -0.0555*** -0.1097*** -0.1828*** -0.3177***(0.0002) (0.0012) (0.0004) (0.0028)Age 5-7 -0.0558*** -0.1201*** -0.1860*** -0.3408***(0.0002) (0.0013) (0.0005) (0.0031)Age 8-12 -0.0548*** -0.1298*** -0.1875*** -0.3583***(0.0002) (0.0014) (0.0005) (0.0036)Age 13 or more -0.0524*** -0.1317*** -0.1943*** -0.3747***(0.0002) (0.0016) (0.0006) (0.0041) N Note: An observation is an entity-year. In this table, m Independent Variables GLBD GSynLBD
AR(1) Coefficient 0.1883*** 0.6140***(0.0021) (0.0027)Ln Pay 0.6599*** 0.3553***(0.0014) (0.0020)Age 3-4 -0.0292*** -0.0934***(0.0011) (0.0013)Age 5-7 -0.0512*** -0.1266***(0.0011) (0.0014)Age 8-12 -0.0791*** -0.1545***(0.0011) (0.0015)Age 13 or more -0.1400*** -0.2012***(0.0011) (0.0015) N Note: An observation is an entity-year. In this table, m Independent Variables LEAP CanSynLBD
Private Manufacturing Private ManufacturingAR(1) Coefficient 0.2005*** 0.2821*** 0.4850*** 0.5737***(0.0007) (0.0040) (0.0012) (0.0059)Ln Pay 0.8044*** 0.7135*** 0.4760*** 0.4056***(0.0005) (0.0034) (0.0009) (0.0046)Age 3-4 -0.1245*** -0.2033*** -0.1716*** -0.3158***(0.0005) (0.0032) (0.0006) (0.0037)Age 5-7 -0.1328*** -0.2264*** -0.1733*** -0.3389***(0.0005) (0.0035) (0.0006) (0.0043)Age 8-12 -0.1383*** -0.2454*** -0.1731*** -0.3560***(0.0006) (0.0039) (0.0007) (0.0051)Age 13 or more -0.1441*** -0.2586*** -0.1774*** -0.3717***(0.0006) (0.0042) (0.0008) (0.0058) N Note: An observation is a firm and a year. In this table, m Independent Variables GLBD GSynLBD
AR(1) Coefficient 0.3701*** 0.5268***(0.0060) (0.0048)Ln Pay 0.5349*** 0.4202***(0.0041) (0.0036)Age 3-4 -0.0594*** -0.0831***(0.0015) (0.0013)Age 5-7 -0.0922*** -0.1105***(0.0018) (0.0015)Age 8-12 -0.1252*** -0.1351***(0.0019) (0.0016)Age 13 or more -0.1850*** -0.1802***(0.0019) (0.0017) N Note: An observation is a firm and a year. In this table, m C Canada: Synthesized Observations
Table 13: Synthesized observations
Category
Synthesized 22.01 93.35Not synthesized 1.57 6.65Total 23.58 100.00
Note: Industries that are not synthesized are NAICS 4481, 4482,4483, 4511, 4513, 4841, 4842, 5241, and 5242. We drop observationsfrom synthesized industries when there are less than ten observationsin a given year. We do not synthesize the public sector (NAICS 61,62, and 91). Confidentiality assessment
Table 14: Observed entity births given synthetic births for LEAP.
First (Birth) Year % of Births over NAICSSynthetic Actual Minimum Mean Maximum
Birth Year % of Births over NAICSSynthetic Actual Minimum Mean Maximum1976 1976 18.34 19.77 21.201977 1977 1.35 1.55 1.751978 1978 0.97 1.50 2.021979 1979 1.99 2.05 2.111980 1980 1.15 1.61 2.071981 1981 0.76 1.28 1.801982 1982 1.29 1.39 1.481983 1983 1.54 1.57 1.611984 1984 0.99 1.03 1.071985 1985 0.83 1.56 2.281986 1986 1.36 1.79 2.211987 1987 1.99 2.00 2.021988 1988 1.18 1.49 1.811989 1989 1.65 1.84 2.031990 1990 2.44 2.79 3.141991 1991 7.59 9.17 10.751992 1992 5.19 8.81 12.421993 1993 3.20 3.40 3.601994 1994 3.50 3.93 4.351995 1995 2.86 3.26 3.651996 1996 1.89 2.62 3.351997 1997 3.46 3.96 4.451998 1998 3.58 3.68 3.781999 1999 5.56 5.78 6.002000 2000 3.19 3.64 4.102001 2001 3.26 3.59 3.932002 2002 2.04 3.00 3.972003 2003 2.13 3.17 4.202004 2004 2.57 3.24 3.912005 2005 1.66 2.54 3.412006 2006 2.15 3.06 3.972007 2007 2.17 2.90 3.622008 2008 2.37 2.42 2.47