[PDF] On new data sources for the production of official statistics

Abstract

In the past years we have witnessed the rise of new data sources for the potential production of official statistics, which, by and large, can be classified as survey, administrative, and digital data. Apart from the differences in their generation and collection, we claim that their lack of statistical metadata, their economic value, and their lack of ownership by data holders pose several entangled challenges lurking the incorporation of new data into the routinely production of official statistics. We argue that every challenge must be duly overcome in the international community to bring new statistical products based on these sources. These challenges can be naturally classified into different entangled issues regarding access to data, statistical methodology, quality, information technologies, and management. We identify the most relevant to be necessarily tackled before new data sources can be definitively considered fully incorporated into the production of official statistics.

Full PDF

aa r X i v : . [ s t a t . O T ] M a r On new data sources for the production of oﬃcial statistics

David Salgado and Bogdan Oancea Dept. Methodology and Development of Statistical Production, Statistics Spain (INE),Spain Dept. Statistics and Operations Research, Complutense University of Madrid, Spain Dept. Business Administration, University of Bucharest, RomaniaFebruary 7, 2020

Abstract

In the past years we have witnessed the rise of new data sources for the potential production ofoﬃcial statistics, which, by and large, can be classiﬁed as survey, administrative, and digital data.Apart from the diﬀerences in their generation and collection, we claim that their lack of statisticalmetadata, their economic value, and their lack of ownership by data holders pose several entangledchallenges lurking the incorporation of new data into the routinely production of oﬃcial statistics.We argue that every challenge must be duly overcome in the international community to bring newstatistical products based on these sources. These challenges can be naturally classiﬁed into diﬀerententangled issues regarding access to data, statistical methodology, quality, information technologies,and management. We identify the most relevant to be necessarily tackled before new data sourcescan be deﬁnitively considered fully incorporated into the production of oﬃcial statistics.

Contents

Information technologies 17

In October 2018, the 104th DGINS conference (DGINS, 2018), gathering all directors general of the Euro-pean Statistical System (ESS), “[a]gree[d] that the variety of new data sources, computational paradigmsand tools will require amendments to the statistical business architecture, processes, production models,IT infrastructures, methodological and quality frameworks, and the corresponding governance structures,and therefore invite[d] the ESS to formally outline and assess such amendments”. Certainly, this state-ment is valid for producing oﬃcial statistics in any statistical oﬃce.More often than not, this need for the modernisation of the production of oﬃcial statistics is associ-ated with the rising of

Big Data (e.g. DGINS, 2013). In our view, however, this need is also naturallylinked to the use of administrative data (e.g. ESS, 2013) and even earlier to the eﬀorts to boost theconsolidation of an international industry for the production of oﬃcial statistics through shared tools,common methods, approved standards, compatible metadata, joint production models and congruentarchitectures (HLGMOS, 2011; UNECE, 2019).Diverse analyses can be found in the literature providing insights about the challenges of Big Dataand new digital data sources, in general, for the production of oﬃcial statistics (Struijs et al., 2014; Lan-defeld, 2014; Japec et al., 2015; Reimsbach-Kounatze, 2015; Kitchin, 2015b; Hammer et al., 2017; Gicziand Sz˝oke, 2018; Braaksma and Zeelenberg, 2020). These analyses are mostly strategic, high-level, andtop-down. In this work we undertake a bottom-up approach mainly aiming at identifying those factorsunderpinning the reason why statistical oﬃces are not producing outputs based on all these new datasources yet. Simply put: why are statistical oﬃces not producing routinely oﬃcial statistics based onthese new digital data sources?Our main thesis is that for statistical products based on new data sources to become routinely dissem-inated according to updated legal national and international regulations, at least, all the issues identiﬁedbelow must be provided with a widely acceptable solution. Should we fail to cope with the challengesbehind one of these issues, the new products cannot be achieved. Thus, we are facing an intrinsicallymultifaceted problem. Furthermore, we shall argue that new data sources are compelling a new role ofstatistical oﬃces derived from the social, statistical, and technical complexity of the new challenges.These challenging issues are discussed separately in each section. In section 2 we revise relevantaspects of the concept of data and its implications to produce oﬃcial statistics. In section 3 we tacklethe issue of access to these new data sources. In section 4 we brieﬂy identify issues regarding the newstatistical methodology necessary to undertake the production with both the traditional and the newdata. In section 5 we deal with the implications regarding the quality assurance framework. In section6 we shortly approach the questions about the information technologies. In section 7 we pose somereﬂections regarding skills, human resources, and management in statistical oﬃces. We close with someconclusions in section 8. 2

Data: survey, administrative, digital

The production of oﬃcial statistics is a multifaceted concept. Many of these facets are aﬀected by thenature of the data. We pose some reﬂections about some of them. In a statistical oﬃce three basicdata sources are nowadays identiﬁed: survey, administrative, digital. This distinction runs parallel to thehistorical development of data sources.A survey is “scientiﬁc study of an existing population of units typiﬁed by persons, institutions, orphysical objects” (Lessler and Kalsbeek, 1992). This is not to be confused with the idea of samplingitself, introduced in Oﬃcial Statistics in 1895 by A. Kiær as the representative method (Kiær, 1897),provided with a solid mathematical basis and promoted to probability sampling ﬁrstly by Bowley (1906)and deﬁnitively by Neyman (1934), further developed originally in the US Census Bureau (Deming, 1950;Hansen et al., 1966; Hansen, 1987) (see also Smith, 1976), and still in common practice by statisticaloﬃces worldwide (Bethlehem, 2009a; Brewer, 2013). It has been the preferred and traditional tool toelaborate and produce oﬃcial information about any ﬁnite population. The advent of diﬀerent technolo-gies in the 20th century produced a proliferation of so-called data collection modes (CAPI, CATI, CAWI,EDI, etc.) (cf. e.g. Biemer and Lyberg, 2003), but the essence of a survey is still there.Administrative data is “the set of units and data derived from an administrative source”, i.e. from an“organisational unit responsible for implementing an administrative regulation (or group of regulations),for which the corresponding register of units and the transactions are viewed as a source of statisticaldata” (OECD, 2008). Some experts (see Deliverable 1.3 of ESS, 2013) drop the notion of units to avoidpotential confusion and just refer to data. All in all, these deﬁnitions refer to registers developed andmaintained for administrative and not statistical purposes. Apart from the diverse traditions in countriesfor the use of these data in the production of oﬃcial statistics, in the European context the RegulationNo. 223/2009 provides the explicit legal support for the access to this data source by national statisticaloﬃces for the development, production and dissemination of European statistics (see Art. 24 of EuropeanParliament and Council Regulation 223/2009, 2009). Curiously enough, the Kish tablet from the Sume-rian empire (ca 3500 bC), one of the earliest examples of human writing, seems to be an administrativerecord for statistical purposes.More recently, the proliferation of digital data in an increasing number of human activities has posedthe natural challenge for statistical oﬃces to use this information for the production of oﬃcial statistics.The term

Big Data has polarised this debate with the apparent abuse of the n Vs deﬁnitions (Laney,2001; Normandeau, 2013). But the phenomenon goes beyond this characterization extending the poten-tiality for statistical purposes to any sort of digital data. In parallel to administrative data, we propose todeﬁne digital data as the set of units and data derived from a digital source , i.e. from a digital informationsystem, for which the associated databases are viewed as a source of statistical data .Notice that the OECD (2008) does not include this as one of the types of data sources, probably be-cause this deﬁnition of digital data may be read as falling within the more general one of administrativedata above, since administrative registers are nowadays also digitalised. We shall agree on restrictingadministrative data to the public domain, in agreement with current practice in statistical oﬃces and inapplication of EU Regulation 223/09.This discrimination between the three sources of data runs parallel to their collection modalities:surveys are essentially collected through structured interviews administered directly to the statistical unitof interest, administrative registers are collected from public administrative units, and digital data oﬀersan undeﬁned variety of potential private data providers (either individual or organizational). However,we want to emphasise that diﬀerences among these data sources are deeper than just their collectionmodalities. Furthermore, these diﬀerences lie at the core of many of the challenges described in the next3ections.

The ﬁrst determining factor for the diﬀerences in these data sources is the presence/absence of statis-tical metadata , i.e. metadata for statistical purposes. Not only is it relevant to understand what thismeans but also especially to identify the reason why this is introducing diﬀerences. Data such as surveydata generated with statistical structural metadata embrace variables following strict deﬁnitions directlyrelated to target indicators and aggregates under analysis (unemployment rates, price indices, tourismstatistics, . . . ). These deﬁnitions are operationalised in careful designs of questionnaire items. Data areprocessed using survey methodology, which provides a rigorous inferential framework connecting datasets with target populations at stake.On the contrary, data such as administrative and digital data generated without statistical structuralmetadata embrace variables with a faint connection with target indicators and aggregates. This impingeson their further processing in many aspects and especially regarding data quality and the inference withrespect to target populations. The ultimate reason for this absence of statistical metadata is that thisdata is generated to provide a non-statistical transactional service (taxes, medical care beneﬁts, ﬁnancialtransactions, telecommunication, . . . ). This has already been identiﬁed in the literature (Hand, 2018). Incontraposition to survey data, administrative and digital data are generated before their correspondingstatistical metadata. They do have metadata, but not for statistical purposes.In our view, the key distinguishing factor, derived from this absence of statistical metadata, arisesfrom the explicit or (mostly) implicit conception of information behind data. This plays a critical rolein the statistical production process. The concept of information gathers three complementary aspects,namely (i) syntactic aspects concerning the quantiﬁcation of information, (ii) semantic problems relatedto meaning, and (iii) utility issues regarding the value of information (see e.g. Floridi, 2019). When con-sidering the traditional production of oﬃcial statistics, we all are aware of the substantial investment onthe system of metadata providing rigorous and unambiguous deﬁnitions for each of the variables collectedin a survey, work conducted prior to data collection . This is providing survey data with a purposivesemantic layer and noticeably increasing its value (all three aspects of the concept of information meet insurvey data). On the contrary, administrative data are not generated under this umbrella of statisticalmetadata, but their semantic content is often still close enough to the statistical deﬁnitions used in astatistical oﬃce (think e.g. of the notions of employment, taxes, education grades, etc.). Nonetheless, thequality of administrative data for statistical purposes is still an issue (see e.g. Agaﬁt¸ei et al., 2015; Foleyet al., 2018; Keller et al., 2018). The situation with digital data is extreme. This data is generated toprovide some kind of service completely extraneous to statistical production. Thus, meaning and valuemust be carefully worked out for the new data to be used in the production of oﬃcial statistics (onlythe ﬁrst layer of the concept of information is present in digital data). Some proposed architecturesfor the incorporation of new data sources (Ricciato, 2018b; Eurostat, 2019b) reﬂect this situation: anon-negligible amount of preprocessing is required prior to incorporate digital data into the statisticalproduction process.This diﬀerent informational content of data for producing oﬃcial statistics will prove to have far-reaching consequences on the production methodology. We can borrow a well-known episode in theHistory of Science to illustrate this diﬀerence and its consequences: the Copernican scientiﬁc revolutionsubstituting the Ptolemaic system by the Newtonian law of universal gravitation (Kuhn, 1957). ThePtolemaic system enables us to compute and predict the behaviour of any celestial body by introduc-ing more and more computational elements such as epicycles and deferents. Newton’s law of universalgravitation also enables us to compute this behavior under a completely diﬀerent perspective. We canassimilate the former with a purely syntactic usage of data whereas the latter is somewhat incorporat-ing meaning (theory). This is not a black-or-white comparison, since there is some theory behind thePtolemaic system (Aristotelian Physics) but the diﬀerence in the comprehension of natural phenomena4rovided by both systems is appealing, even using the same set of data. In other words, in the formercase we just introduce our observed astronomical data into a more or less entangled computation systemwhereas in the latter case we make use of underlying assumptions providing context, meaning, and ex-planations for all the observed data. In an analogous way, let us now consider the diﬀerence between aregression model and a random forest, also for the same set of (big) data. In the former, some meaningis incorporated or at least postulated through the choice of a functional form between regressand andregressors (linear, logistic, multinomial, etc.). In the latter, only weaker computational assumptions aremade. The situation is similar to the cosmological picture above and indeed lies at the dichotomy betweenthe so-called “theory-driven” and “data-driven” approaches to data analysis (see e.g. Hand, 2019). Thisalso runs parallel to the model-based vs. design-based inference (Smith, 1994), whose ﬁnally adoptedsolution in favour of the latter can be summarised with the following statement by Hansen et al. (1983,p. 785): “[. . . ] it seems desirable, to the extent feasible, to avoid estimates or inferences that need tobe defended as judgments of the analysts conducting the survey”. Avoiding prior hypotheses about datageneration is possible using probability sampling (survey data), but with new data sources this is not thecase anymore. This duality has already been identiﬁed in the use of Big Data as the historical debatebetween rationalism and empiricism (Starmans, 2016).Thus, as a challenging issue, we may enquire whether statistical oﬃces should still adopt a merelycomputational (empiricist) point of view `a la

Ptolomy or should they pursue theoretical (rationalistic)ﬁndings `a la

Newton perhaps searching for a better system of computation and estimation. No clearposition is recognised in our community yet and this will impinge not only on the statistical methodologyfor new data sources but on the whole role of statistical oﬃces in society. This change will indeed be verydeep.

The second main diﬀerence among the three data sources arises from their economic value. Traditionalsurvey data has little economic value for a data holder/provider in comparison with digital data. For ex-ample, when a company owing a database for an online job vacancy advertisement service is requested toprovide data about their turnover, number of employees, R+D investment, etc., sharing this informationdoes not reasonably seem to be as critical as sharing this whole database for oﬃcial statistical production.In the case of administrative data, whose public dimension we agreed upon above, the economic valuefor the public administration is secondary (statistical oﬃces are indeed part of the public administration).This economic value entails diverse consequences for the incorporation of digital data sources intooﬃcial statistics production. Data collection is clearly more demanding. On the one hand, technologicalchallenges lie ahead about retrieving, preprocessing, storing, and/or transmitting these new databases.On the other hand, and more importantly, by accessing to the business core of data holders, the degreeof disruption of oﬃcial statistical production into their business processes is certainly higher. Moreover,technical staﬀ is usually required to access these data sources and even to preprocess and interpret themfor statistical purposes (e.g. telco data). This also impinges directly on the capability proﬁles of oﬃcialstatisticians. Thus, it is appealingly diﬀerent to collect (either paper or electronic) questionnaires thanto access huge business databases.The economic value of digital data constitutes a key feature which demands careful attention bynational and international statistical systems. The perception of risk for e.g. settling public-private part-nerships (Robin et al., 2015) runs indeed parallel to this economic value. High economic value comesusually as a result of high investments, therefore sharing core business data with statistical oﬃces maybe easily perceived as too high a risk. However, if these public-private partnerships are perceived asan opportunity to increase this economic value (increasing e.g. data quality, the quality of commercialstatistical products, and the social dimension of private economic activities), the statistical productionand the information and knowledge generation thereof can be reinforced in society.5s we shall discuss in a later section, this suggests to broaden the scope of oﬃcial statistical outputsfrom traditionally closed and embedded in statistical domains (usually according to a strict legal regu-lation) to some enriched intermediate high-quality datasets for further customised production by othereconomic and social actors (researchers, companies, NGOs, . . . ) in a variety of socioeconomic domains.This is also a deep change in National Statistical Systems.

The third main diﬀerence stems out from the fact that these digital data refer to third people, not to dataholders themselves. These third people are clients, subscribers, etc. sharing their private information inreturn of a business service. Implications immediately arise. Issues about the legal support for access areobvious (see section 3), but this factor is not entirely new. In survey methodology we already have thenotion of proxy respondent (see e.g. Cobb, 2018) and in administrative data, information about citizensand not about the data-holding public institutions is the core of this data source.Conﬁdentiality and privacy issues naturally arise. Already in the traditional oﬃcial statistical processa whole production step is dedicated to statistical disclosure control (Hundepool et al., 2012) reducingthe re-identiﬁcation risk of any sampling unit while assuring the utility of the disseminated statistics.Now, the data deluge has made this risk increase since it is more feasible to identify individual populationunits (Rocher et al., 2019), even despite data are not personally identiﬁed anymore (in contrast to surveyand administrative data). Apart from the spread of privacy-by-design statistical processes, now moreadvanced cryptographical techniques such as Multiparty Secure Computation (see Zhao et al., 2019, andmultiples references therein) must be taken into consideration, especially regarding data integration.Ethical issues should also be considered. Long can be written about ethics of requesting privateinformation to both people or enterprises in a general setting. Regarding new data sources, the debateabout accessing data for the production of oﬃcial statistics has received attention, in particular regardingprivacy and conﬁdentiality. Since we do not have a clear position regarding this issue, we just want toprovocatively share two reﬂections. Firstly, with both survey and administrative sources, data for oﬃcialstatistics is personal data where either people and business units are univocally identiﬁed in internalsets of microdata at statistical oﬃces. Take for example the European Health Survey (Eurostat, 2019a).Items like the following are included in the questionnaires:

HC08

When was the last time you visited a dentist or orthodontist on your own behalf (that is, not whileonly accompanying a child, spouse, etc.)?

HC09

During the past four weeks ending yesterday, that is since (date), how many times did you visit adentist or orthodontist on your own behalf?This sensitive information is collected together with a full identiﬁcation of each respondent. Otherexample is a historical and fundamental statistics for society: causes of death (Eurostat, 2020a). Peoplecommitting suicide or dead due to alcoholic abuse, for example, are clearly identiﬁed in internal sets ofmicrodata at statistical oﬃces. Most digital sources provide anonymous (or pseudo-anonymous) data.Notice that for the case of the European Health Survey even duly anonymised microdata are publiclyshared. Have statistical oﬃces not been careful enough so far in the application of IT security and sta-tistical disclosure control to scrupulously protect both privacy and conﬁdentiality of statistical units intheir traditional statistical products? Uses other than strictly statistical purposes have not been followedin the usage of this information by statistical oﬃces. Even despite the risk of identiﬁability, should theproduction of oﬃcial statistics revise the ethics of its activity? Even for traditional sources?Secondly, the fast generation of digital data nowadays clearly poses an immediate question. Should orshould not society elaborate accurate and timely information for those matters of public interest (CPI,GDP, unemployment rates, . . . and even potentially novel insights) taking advantage of this data deluge?In all cases this is posed in full compatibility with increasing economic sectors around digital data both6or statistical and nonstatistical purposes.Obviously, this debate is part of the social challenges behind the generation of such an amount ofdigital information. Statistical oﬃces cannot be aside and should assume their role.

Needless to say, should statistical oﬃces have no access to new digital data sources, no oﬃcial statisticalproduct can be oﬀered thereof. Let us consider the increasingly common situation in which a new datasource is identiﬁed to improve or refurbish an oﬃcial statistical product. What lies ahead preventing astatistical oﬃce to access the data? We have empirically identiﬁed four sets of issues: legal issues, datacharacteristics, access conditions, and business decisions.

Legal issues constitute apparently the most evident obstacle for a statistical oﬃce to access new digi-tal data. It is relevant to underline that access to administrative data has been explicitly included inthe main European regulation behind European statistics (European Parliament and Council Regulation223/2009, 2009). In this sense, some countries have already introduced changes in their national regula-tions to explicitly include these new data sources in their Statistical Acts (see e.g. ? ).Certainly, very deep legal discussions can be initiated around the interpretation and scope of thediﬀerent entangled regulations in the international and national legal systems but, in our opinion, all boildown to three factors: (i) Statistical Acts, (ii) speciﬁc data source regulations, and (iii) general personaldata and privacy protection regulations. Regarding Statistical Acts, two main considerations are to betaken into account. On the one hand, by and large these regulations provide legal support for statisticaloﬃces to request data to diﬀerent social agents. On the other hand, more rarely, these regulations alsoestablish legal obligations for these agents to provide the requested data resorting to sanctions in case ofnonresponse (LFEP, 1989). Regarding data sources such as mobile network data, ﬁnancial transactiondata, online databases, etc. there exist commonly speciﬁc regulations protecting these data restrictingtheir use only for their speciﬁc purposes (telecommunication, ﬁnance, online transactions, etc.). Theseregulations may pose unsolved conﬂicts with the preceding Statistical Acts. Besides, personal data andprivacy protection regulations, whose implementation is usually enacted through Data Protection Agen-cies, increases the degree of complexity since exceptions for statistical purposes do not explicitly clarifythe type of data source to be used for the production of oﬃcial statistics.When requesting sustainable access in time, all these issues must be surmounted having in mindthe perspectives of statistical oﬃces, data holders, and statistical units (citizens and business units).Simultaneously (i) legal support for statistical oﬃces must be clearly stated, (ii) data holders must bealso legally supported in providing data, especially about third people (statistical units), and (iii) privacyand conﬁdentiality of all social agents’ data must be guaranteed by Law and in practice. Needless to say,Law must be an instrument to preserve rights and establish legal support for all members of society. Data ecosystems for new data sources are highly complex and of very diﬀerent nature. For example, telcodata are generated in a complex cellular telecommunication network for many diﬀerent internal technicaland business purposes. Accessing data for statistical purposes implicitly implies the identiﬁcation of thosesubsets of data needed for statistical production. Not every piece of data is useful for statistical purposes.Moreover, raw data are not useful for these purposes and they need some preprocessing. Even worse, rawdigital data have an unattainable volume for usual production standards at statistical oﬃces and requiretechnical assistance by telco engineers. Thus, some form of preprocessed or even intermediate data may7e instead required, but then details about this data processing or intermediate aggregating step need tobe shared for later oﬃcial statistical processing.All in all, the characteristics of new data for the production of oﬃcial statistics strongly compel thecollaboration with data holders. This is completely novel for statistical oﬃces.

As a result of the complexity behind new data sources, one of the considered options to use this data forstatistical purposes is the in-situ access, thus avoiding the risk that data leaves the information systems ofdata holders. This possibility alleviates the privacy and conﬁdentiality issues, but the operational aspectmust then be tackled, since the statistical oﬃce will have to access somehow these private informationsystems. A second option may be to transmit the data from the data holders’ premises to the statisticaloﬃces’ information systems. No access to the private information systems is needed but privacy and con-ﬁdentiality issues must then be solved in advance, both from the legal and the operational points of view.Finally, a trusted third party may enter the scene who will receive the data from the data holders andthen, possibly after some preprocessing, will transmit them to the statistical oﬃce. The conﬁdentialityand privacy issue remains open and part of the oﬃcial statistical production process is further delegated.A second condition comes from the exclusivity for statistical oﬃces to access and use these data.Should there been more social agents requesting access and use of these data sources (e.g. other publicagencies, ministries, international organizations, etc.), the access conditions from the data holders’ pointof view would be extremely complex. This raises a natural enquiry about the potential social leading roleof statistical oﬃces in making this data available for public good.A third condition revolves around the issue of intellectual property rights and/or industrial secrecyrequirements. Accessing these data sources usually entails core industrial process for the data holders,who rightfully wants to protect their know-how from their competitors. Statistical oﬃces must not dis-rupt the market competence by leaking this information from one agent to another. Guarantees must beoﬀered and ﬁxed in this sense.Fourthly, new data sources will be more eﬃcient when combined among them and with administrativeand survey data. Furthermore, in a collaborating environment with data holders it seems naturally toconsider the choice to share this data integration (e.g. considering this intermediate output as a newstatistical product). Operational aspects of this data integration step (especially regarding statisticaldisclosure control) must be tackled (e.g. with secure multiparty computation techniques (Zhao et al.,2019); see also section 6.5).Finally, as partially mentioned above, costs associated to data retrieval, access, and/or processingbrought by the complexity of these data sources must be also taken into account. Occasionally this issuedoes not arise when collaborating for research and for one-shot studies, but it stands as an issue for thelong term data provision for standard production. Let us remind the principle 1 of the UN Principlesfor Access to Data for Oﬃcial Statistics (UNGWG, 2016), where this data provision is called upon freeof charge and on a voluntary basis. However, principle 6 explicitly states the “[t]he cost and eﬀort ofproviding data access, including possible pre-processing, must be reasonable compared to the expectedpublic beneﬁt of the oﬃcial statistics envisaged”. Moreover, this is complemented by principle 3 statingthat “[w]hen data is collected from private organizations for the purpose of producing oﬃcial statistics,the fairness of the distribution of the burden across the organizations has to be considered, in order toguarantee a level playing ﬁeld”. Thus, these principles arise as pertinent. However, the issue of the cost isextremely intricate. Firstly, the essential principle of Oﬃcial Statistics by which data provision for thesepurposes must be made completely free of charge must be respected. Yet, the costs associated to dataextraction and data handling for statistical purposes need a careful assessment and this depends verysensitively on the concrete situation of the data holders. Diﬀerent details need consideration: staﬀ time8n data processing, hardware computing time, hardware buy and deployment (if necessary), softwaredevelopment or licenses (if necessary), . . . In addition, the compensation for these costs may be givenshape in diﬀerent ways, from a direct payment to an implicit contribution to a long-term collaborationpartnership. In any case, notice that this compensation of costs should not be for the data themselves,but for the data extraction and data handling. Data must be granted access free of charge. Furthermore,if several data holders are at stake for the same data source, equal treatment must be procured for eachof them. This is a wholly new social scenario for the production of oﬃcial statistics.

Apart from the preceding factors, also apparently potential conﬂicts of interest and risk assessments canadvice decision-makers in private organizations not to establish partnerships with statistical oﬃces. Theconﬂicts of interest may arise because of the perception of a potential collision in the target marketsbetween statistical oﬃces and private data holders/statistical producers. Our view is that this is onlyapparent, that statistical products for the public good considered in National Statistical Plans are oflimited proﬁt for private producers, and that in potentially intersecting insights a collaboration will in-crease the value of all products. Furthermore, corporate social responsibility and activities for social goodnaturally invite private organizations to set up this public-private collaboration broadening the scope oftheir activities to increase the economic and social value of their data, to contribute to the developmentof national data strategies, and to support policy making more accordingly to their information needs.All in all, access and use of new data sources depend on a highly entangled set of challenging factorsfor many public and private organizations but oﬀering an extraordinary opportunity to contribute to theproduction and dissemination of information in the present digital society. Statistical oﬃces should striveto reshape their role to become an active actor in this new scenario.

As stated in section 2, the lack of statistical metadata of new data sources and having data generatedbefore planning and design necessarily impinge directly on the core of traditional survey methodology,especially (but not only) on the limitation in the use of sampling designs for these new data sources. Thismeans that an oﬃcial statistician accessing a new data source cannot resort to the tools in the tradi-tional (current, indeed) production framework to produce a new statistical output. This does not meanwhatsoever that there do not exist statistical techniques to process and analyse this new data. Indeed,there exists a great deal of statistical methods (see e.g. Hall/CRC, 2020). We just lack a new extendedproduction framework to cover methodological needs in every statistical domain for each new data source.We shall focus in this section on key methodological aspects in the production of oﬃcial statistics andshare some reﬂections on the new methods.

There exist key concepts in traditional survey methodology such as sample representativeness, bias, andinference, which should be assessed in the light of the new types of data. Certainly, survey methodol-ogy is limited with new data sources, but it oﬀers a template mirror for a new refurbished productionframework to look at: it provides modular statistical solutions for a diversity of diﬀerent methodologicalneeds along the statistical process in all statistical domains (sample selection, record linkage, editing,imputation, weight calibration, variance estimation, statistical disclosure control, . . . ). Furthermore, theconnection between collected samples and target populations is ﬁrmly rooted on scientiﬁc grounds usingdesign-based inference.When considering an inference method other than sampling strategies (sampling designs togetherwith asymptotically unbiased linear estimators), many oﬃcial statisticians immediately react alluding9o sample representativeness. This combination of sampling designs and linear estimators is indeed inthe DNA of oﬃcial statisticians and some ﬁrst explorations of statistical methods to face this inferentialchallenge still resemble these sampling strategies (Beresewicz et al., 2018). In our view, the introductionof new methods should come with an address on these key concepts (sample representativeness, bias, etc.).To grasp the diﬀerences in these concepts in the statistical methods for survey data and for new datasources, we shall shortly include our view on the origin of the strength felt by oﬃcial statisticians aroundthese concepts in the traditional production framework. As T.M.F. Smith (1976) already pointed out,the design-based inference seminally introduced by J. Neyman (1934) allows the statistician to makeinferences about the population regardless of its structure . Also in our view, this is the essential traitof design-based methodology in Oﬃcial Statistics over other alternatives, in particular, over model-basedinference. As M. Hansen (1987) already remarked, statistical models may provide more accurate esti-mates if the model is correct , thus clearly showing the dependence of the ﬁnal results on our a priorihypotheses about the population in model-based settings. Sampling designs free the oﬃcial statisticianto make hypotheses sometimes diﬃcult to justify and to communicate.This essential trait appears in the statistical methodology under the use of (asymptotically) design-unbiased linear estimators of the form b T = P k ∈ s ω ks y k , where s denotes the sample, ω ks are the so-calledsampling weights (possibly dependent on the sample s ) and y stands for the target variable to estimatethe population total Y = P k ∈ U y k . A number of techniques does exist to deal with diverse circumstancesregarding both the imperfect data collection and data processing procedures so that non-sampling errorsare duly dealt with (Lessler and Kalsbeek, 1992; S¨arndal and Lundstr¨om, 2005). These techniques leadus to the appropriate sampling weights ω ks ( x ) usually dependent on auxiliary variables x . Samplingweights are also present in the construction of the variance estimates and thus of conﬁdence intervals forthe estimates.The interpretation of a sampling weight ω ks ( x ) is extensively accepted as providing the number ofstatistical units in the population U represented by unit k in the sample s , thus settling the notion of rep-resentativeness on apparently ﬁrm grounds. This combination of sampling designs and linear estimators,complemented with this interpretation of sampling weights, stands up as a robust defensive argumentagainst any attempt to use new statistical methodology with digital sources. Indeed, one of the ﬁrstrightful questions when facing the use of digital data is how data represent the target population. Withmany new digital sources (mobile network data, web-scraped data, ﬁnancial transaction data, . . . ) thequestion is clearly meaningful.However, before trying to give due response with new methodology, we believe that it is of utmostrelevance to be aware of the limitations of the sampling design methodology in the inference exercise link-ing sampled data and target populations. This will help producers and stakeholders be conscious aboutchanges brought by new methodological proposals and view the challenges in the appropriate perspective.Firstly, the notion of representativeness is slippery business. This concept was already analyzed byKruskal and Mosteller (1979a,b,c, 1980) in this line. Surprisingly enough, a mathematical deﬁnition inclassical and modern textbooks is not found, providing Bethlehem (2009a) an exception in terms of a dis-tance between the empirical distributions of a target variable in the sample and in the target population.Obviously, this deﬁnition comes with very diﬃcult practical implementation (we would need to know thepopulation distribution). Nonetheless, this has not been an obstacle for the extended use of the conceptof representativeness even in a dangerous way. From time to time, one can hear that the construction oflinear estimators is undertaken upon the basis of being ω ks ( x ) the number of population units representedby the sampled unit k , thus amounting ω ks ( x ) · y k to the part of the population aggregate accountedfor by unit k in the sample s , ﬁnally being P k ∈ s ω ks · y k the total population aggregate to estimate. Astrong resistance is partially perceived in Oﬃcial Statistics against any other technique not providingsome similar clear-cut reasoning accounting for the representativeness of the sample. This argument isindeed behind the restriction upon sampling weights for them not to be lesser than 1 (interpreted as a10nit not representing itself) or for them to be positive in sampling weight calibration procedures (see e.g.S¨arndal (2007)). In our view, the interpretation of a unit k in a sample as representing ω ks units in thepopulation can be impossible to justify even in such a simple example as a Bernoulli sampling design ofprobability π = in a ﬁnite population of size N = 3: if, e.g., s = { , } , how should we understand thatthese two units represent 4 population units?Ultimately, the goal of an estimation procedure is to provide an estimate as close as possible to thereal unknown target quantity together with a measure of the accuracy. The concept of mean square er-ror, and its decomposition in bias and variance components (Groves, 1989), is essential here. Estimatorswith a lower mean square error guarantee a high-quality estimation. No mention to representativenessis needed. Furthermore, not even the requirement of exact unbiasedness is rigorously justiﬁed: comparethe estimation of a population mean using an expansion (Horvitz-Thompson) estimator and using theH´ajek estimator (H´ajek, 1981).The randomization approach does allow the statistician not to make prior hypotheses on the structureof the population to conduct inferences, i.e. the conﬁdence intervals and point estimates are valid for anystructure of the population. But this does not necessarily entail that the estimator must be necessarilylinear. Given a sample s randomly selected according to a sampling design p ( · ) and the values y of thetarget variable, a general estimator is any function T = T ( s, y ), being linear estimators a speciﬁc familythereof (Hedayat and Sinha, 1991). Thus, what prevents us to use more complex functions provided wesearch for low mean square error? Apparently nothing. A linear estimator may be viewed as a homo-geneous ﬁrst-order approximation to an estimator T ( s, y ) such as T ( s, y ) ≈ P k ∈ ω ks y k , but why not asecond-order approximation T ( s, y ) ≈ P k ∈ ω ks y k + P k,l ∈ s ω kls y k y l ? Or even a complete series expansion T ( s, y ) ≈ P ∞ p =1 P k ,...,k p ∈ s ω k ...k p s · y k . . . y k p (see e.g. Lehtonen and Veijanen (1998))?However, the multivariate character of the estimation exercise at statistical oﬃces provides a newingredient shoring up the idea of representativeness, especially through the concept of sampling weight.Given the public dimension of Oﬃcial Statistics usually disseminated in numerous tables, numerical con-sistency (not just statistical consistency) is strongly requested on all disseminated tables, even amongdiﬀerent statistical programs. For example, if a table with smoking habits is disseminated broken downby gender and another table with eating habits is also disseminated broken down by gender, the numberof total women and men inferred from both tables must be exactly equal. Not only is this restriction ofnumerical consistency demanded among all disseminated statistics in a survey but also among statisticsof diﬀerent surveys, especially for core variables such as gender, age, or nationality. Linear estimators canbe made easily fulﬁlled this restriction by forcing the so-called multipurpose property of sampling weights (S¨arndal, 2007). This entails that the same sampling weight ω ks is used for any population quantity toestimate in a given survey. For inter-survey consistency, sometimes the calibration of sampling weightsis (dangerously) used. This elementarily guarantees the numerical consistency of all marginal quantitiesin disseminated tables.Notice, however, that this property has to be forced. Indeed, the diﬀerent techniques to deal withnon-sampling errors (e.g. non-response or measurement errors) rely on auxiliary information x so thatsampling weights ω ks are functions of these auxiliary covariates ω ks = ω ks ( x ). Forcing the multipurposeproperty amounts to forcing the same behaviour in terms of non-response, measurement errors, etc. (thusin terms of social desirability or satisﬁcing response mechanisms) regarding all target variables in thesurvey. Apparently it would be more rigorous to adjust the estimators for non-sampling errors on a sep-arate basis looking only for a statistical consistency among marginal quantities. However, this is muchharder to explain in the dissemination phase and traditionally the former option is prioritized pavingthe way for the representativeness discourse (now every sampled unit is thought to “truly” represent ω ks population units).Secondly, sampling designs are thought of as a life jacket against model misspeciﬁcation. For example,even not having a truly linear model between the target variable y and covariates x , the GREG estimator11s still asymptotically unbiased (S¨arndal et al., 1992). But (asymptotical) design-unbiasedness does notguarantee a high-quality estimate. A well-known example can be found in Basu’s elephants story (Basu,1971). Apart from implications in the inferential paradigm, this story clearly shows how a poor samplingdesign drives us to a poor estimate, even using exactly design-unbiased estimators. A design-based esti-mate is good if the sampling design is correct .Finally, as already well-known in small area estimation techniques (Rao and Molina, 2015) and as R.Little (2012) called inferential schizophrenia , sampling designs cannot provide a full-ﬂedged inferentialsolution for all possible sample sizes out of a ﬁnite population. Traditional estimates based on samplingdesigns show their limitations when the size of the sample for population domains begins to decreasedramatically. With new digital data one expects to avoid this problem by having plenty of data, butin the same line one of the expected beneﬁts of the new data sources is to provide information at anunprecedented space and time scale. So, the problem may still remain in rare population cells.In our view, thus, we must keep the spirit for representativeness in an abstract or diﬀuse way, for lackof bias, and for low variances, as in traditional survey methodology. But we should avoid some restrictivemisconceptions and open the door to ﬁnd solutions in the quest for accurate statistics with new datasources. There exist multiple statistical methods which should be identiﬁed to conform a more generalstatistical production framework. Probability theory can still provide a ﬁrm connection between collecteddata sets and target populations of interest. We do not dare to provide an enumeration of statistical methods conforming the new production frame-work. Much further empirical exploration and analysis of the new data sources are needed to furnish asolid production framework and this will take time. However, some ideas can already be envisaged. Theimpossibility of using sampling designs necessarily makes us resort to statistical models, which essentiallyamounts to the conception of data as realizations of random variables (Lehmann and Casella, 1998).As stated above, notice that this was not the case for the inferential step in survey methodology (althoughit was supplementarily made for other production steps as e.g. imputation).The consideration of random variables as a central element brings immediately into scene the dis-tinction between the enumerative and analytical aims of oﬃcial statistical production (Deming, 1950).Let us use an adapted version of exercise 1 in page 254 of the book by Deming (1950). Consider anindustrial machine producing bolts according to a given set of technical speciﬁcations (geometrical form,temperature resistance, weight, etc.). These bolts are packed into boxes of a ﬁxed capacity (say, N bolts)which are then distributed for retail trade. We distinguish two statistically diﬀerent (though related)questions about this situation. On the one hand, we may be interested to know the number of defectivebolts in each box. On the other hand, we may be interested to know the rate of production of defec-tive bolts by the machine. Both questions are meaningful. The retailer will naturally be interested inthe former question whereas the machine owner will also be interested in the latter. Statistically, theformer question amounts to the problem of estimation in a ﬁnite population (Cassel et al., 1977) whilethe latter is a classical inference problem (Casella and Berger, 2002). Indeed, the concept of sample inboth situations is diﬀerent (see the deﬁnition of sample by Cassel et al. (1977) for a ﬁnite populationsetting and that by Casella and Berger (2002) for an inference problem). Notice that the use of infer-ential samples is not extraneous to the estimation problem in ﬁnite populations. The prediction-basedapproach to ﬁnite-population estimation (Valliant et al., 2000; Chambers and Clark, 2012) already makesuse of target variables as random variables. In traditional oﬃcial statistical production, the former sortof question is solved (number of unemployed people, of domestic tourists, of hectares of wheat crop, etc.).With new data sources and the need to consider data values as realizations of random variables, shouldOﬃcial Statistics begin considering also the new questions?12n this line, there already exists an important venue of Statistics and Computer Science researchwhich Oﬃcial Statistics, in our view, should incorporate in the statistical outputs included in NationalStatistical Plans. Traditionally, the focus of the estimation problem in ﬁnite population has been totalsof variables providing aggregate information for a given population of units broken down into diﬀerentdissemination population cells. The wealth of new digital data opens up the possibility to investigate the interaction between those population units, i.e. to investigate networks . Indeed, a recent discipline hasemerged focusing on this feature of reality (see Barab´asi (2008) and multiple references therein). Aspectsof society with public interest regarding the interaction of population units should be in the focus ofproduction activities in statistical oﬃces. New questions as the representativeness of interactions in agiven data set with respect to a target population arise as a new methodological challenge in OﬃcialStatistics.A closer look at the mathematical elements behind this network science will reveal the versatile useof graph theory (Bollobas, 2002; van Steen, 2010) to cope with complexity. As a matter of fact, thecombination of probability theory and graph theory is a powerful choice to process and analyse largeamounts of data. Probabilistic graphical models (Koller and Friedman, 2009), in our view, should bepart of the methodological tools to produce oﬃcial statistics with new data sources. They provide anadaptable framework to deal with many situations such as speech and pattern recognition, informationextraction, medical diagnosis, genetics and genomics, computer vision and robotics in general, . . . This isalready bringing a new set of statistical and learning techniques into production.This immediately takes us to machine learning and artiﬁcial intelligence techniques. In this regard,we should distinguish between the inferential step connecting data and target populations and the restof production steps. Many tasks, old and new, can be envisaged as incorporating these recent techniquesto gain eﬃciency. Traditional activities such as data collection, coding, editing, imputation, etc. can bepresumably improved with random forests, support vector machines, neural networks, natural languageprocessing, etc. New activities such as pattern and image recognition, record deduplication, [. . . ] will alsobe conducted with these new techniques. Further research and innovation must be carried out in this line.For the inferential step, however, we cannot see these new techniques as a deﬁnitive improvement.Our reasoning goes as follows. An essential ingredient in machine learning and artiﬁcial intelligence is experience (Goodfellow et al., 2016), i.e. the accumulation of past data from which the machine or theintelligent agent will learn. Learning to make inferences for a target population entails that we know andaccumulate the ground truth so that algorithms can be trained and tested. The ground truth for a targetpopulation is never known. Thus, the inference step must receive the same attention as in traditionalproduction. There may be situations in which the wealth and nature of digital data may bring the casewhere the whole target population is sampled (e.g. a whole national territory can be covered by satelliteimages to measure the extensions of crops), but even in those cases the treatment of non-sampling errorsmust be taken into account (as already envisaged by Yates (1965)).This incorporation of new techniques from ﬁelds like machine learning and artiﬁcial intelligence en-tails a necessity to set up a common vocabulary and understanding of many related concepts in thesedisciplines and in traditional statistical production. Let us focus, e.g., on the notion of bias. This arisesonce and another both in machine learning and in estimation theory. In traditional ﬁnite populationestimation, the bias B ( b Y ) of an estimator b Y of a population total Y is deﬁned with respect to the sam-pling design p ( · ) as B ( b Y ) = E p ( b Y ) − Y , which amounts basically to an expectation value over all possiblesamples . In survey methodology, estimators are (asymptotically) unbiased by construction. This notionof bias is not to be confused with the diﬀerence between the true population total Y and an estimatefrom the selected sample b Y ( s ). This estimate error b Y ( s ) − Y is never known and can be non-zero even forexactly unbiased estimators. When the prediction approach is assumed and the population total is alsoconsidered a random variable, the concept of (prediction) bias is slightly diﬀerent: B ( b Y ) = E m ( b Y − Y ),where m stands for the data model. These notions of population bias are not to be confused with the13easurement error y obs k − y (0) k , where y obs k stands for the raw value observed in the questionnaire and y (0) k stands for the true value for unit k of variable y . Indeed, in statistical learning this is very oftenreferred to as bias. We model variable y , indeed. An eﬀort to build a precise terminology when newtechniques are used is needed in order to assure a common understanding of the mathematical concepts atstake. Another example comes from the reference to linear regression as a “machine learning algorithm”(Goodfellow et al., 2016). New techniques bring new useful perspectives even in the traditional processbut the community of oﬃcial statistics producers must be sure that communication barriers do not arise.Finally, apart from machine learning and artiﬁcial intelligence and in connection with diﬀerent aspectsof data access and data use already mentioned in section 3, we must make a special mention to datacollection and data integration. New digital data per se will provide individually a high value to oﬃcialstatistical products but arguably it is the integration and combination of them together with survey andadministrative sources which will boost the scope of future statistical products. At this moment, thisintegration and combination is thought to be potentially conducted only with no disclosure of each inte-grated database. This drives us necessarily to cryptology and the incorporation of cryptosystems in theproduction of oﬃcial statistics. Notice, however, that this does not substitute the statistical disclosurecontrol upon ﬁnal outputs, which must still be conducted. Now it is also at the input of the statisticalprocess where data values are not to be undisclosed. The cryptosystem must be able to carry out complexstatistical processing in an undisclosed way. A lot of research in this line is needed.All in all, new methods are to be incorporated with the new data sources, many of them alreadyexisting in other disciplines. The challenge is to furnish a new production framework. New data andnew methods bring necessarily considerations for quality, for the technological environment, and for staﬀcapabilities and management within statistical oﬃces. Quality has been a distinguishing feature of oﬃcial statistical production for many decades and lots ofeﬀorts have been traditionally devoted to reach high-quality standards in survey-data-based publicly dis-seminated statistical products. With new data sources these high-quality standards must be also pursued.We identify key notions in current quality systems in Oﬃcial Statistics and try to understand howthey are to be aﬀected by the nature of the new data sources and the new needs in statistical methodol-ogy. We underline three important notions. Firstly, the concept of quality in Oﬃcial Statistics evolvedfrom the exclusive focus on accuracy to the present multidimensional conception in terms of (i) relevance,(ii) accuracy and reliability, (iii) timeliness and punctuality, (iv) coherence and comparability, and (v)accessibility and clarity (ESS, 2014). Current quality assurance frameworks in national and internationalstatistical systems implement this multidimensional concept of quality (or slight variants thereof). Willnew quality dimensions be needed? Will existing quality dimensions be unnecessary? Secondly, a sta-tistical product is understood to have a high-quality standard if it has been produced by a high-qualitystatistical process. How will the changes in the statistical process aﬀect quality? Thirdly, quality ismainly conceived of as “ﬁt for purposes”(Eurostat, 2020b). How will statistical products based on newdata sources be ﬁt for purposes? Certainly, these are not orthogonal unrelated notions, but they canjointly oﬀer a wide overview of the main quality issues.

Regarding the quality dimensions, we do not foresee a need to reconsider the current ﬁve-dimensionalconception mentioned above. Already with traditional data, alternative more complex multidimensionalviews of data quality could already be found in the literature (see e.g. Wand and Wang, 1996, and mul-tiple references therein). In our view, the nature of new data sources will certainly require a revision of14he existing dimensions, especially the conceptualization and computation of some quality indicators, butnot the suppression and/or consideration of new dimensions. Let us consider as an immediate relevantexample the consequences of using model-based inference (possibly deeply integrated in complex machinelearning or artiﬁcial intelligence algorithms). Parameter setting, model choice, and any form of prior hy-pothesis regarding the model construction must be clearly assessed and communicated. This ingredientimpinging on accuracy, comparability, accessibility, and clarity gains in relevance with new data sources.We comment very brieﬂy on the aforementioned quality dimensions: • Relevance is essentially an address to current and potential statistical needs of users. This dimensionis deeply entangled with our third question regarding being ﬁt for purpose. We will deal with thisdimension more extensively below. • Accuracy is directly impinged by the new methodological scenario. Inference cannot be design-basedwith new data sources, thus model-based estimates will gain more presence. Furthermore, sincethese new data sources come mostly from event-register systems, the usual reasoning on target unitsand target variables is not directly applicable, thus reducing the validity of the usual classiﬁcationof errors (sampling, coverage, non-response, measurement, processing). These errors are severelysurvey-oriented and despite the possibility of more generic readings of current deﬁnitions we ﬁnd itnecessary to undergo a detailed revision. Let us consider a hypothetical situation in which a statisti-cal oﬃce has access to all call detail records (CDRs) in a country for a given time period of analysisto estimate present population counts. These network events are generated by an active usage of amobile device. Discard children, very elderly people, imprisoned people, severely deprived homelesspeople, and any rather evident non-subscriber of these mobile telecommunication services. Can allCDRs be considered a sample with respect to our (remaining) target population? There is no moreCDR data, however we cannot be sure that all target individuals can be considered included inthe dataset. Indeed, there is no enumeration of the target population and the “error[s] [. . . ] whichcannot be attributed to sampling ﬂuctuations”(ESS, 2014) cannot be clearly identiﬁed. The linedistinguishing coverage and sampling errors becomes thinner (as a matter of fact, the concept offrame population in this new setting loses its meaning).Reliability and the corresponding plan of revisions can still be considered under the same approachas in traditional data sources, only potentially aﬀected by both the higher degree of breakdown andavailability of data. When dissemination cells are very small and publicly released more frequently,the variability of estimates are expected to be much higher. Thus, an assessment to discern betweenrandom ﬂuctuations because of small-sized samples and ﬂuctuations because of real eﬀects (e.g.population counts attending music festivals or sport events) is needed. The plan of revisions shouldbe accommodated to the chosen degree of breakdown in the dissemination stage. • Timeliness arises as one of the most clearly improved quality dimensions when incorporating newdata sources. Indeed, with digital sources even (quasi) real-time estimates may be an importantnovelty. However, this is intimately connected to the design and implementation of the new statisti-cal production process and the relationship with data holders. Real-time estimates entail real-timeaccess and processing, which is usually highly disruptive and requires a higher investment on thedata retrieval and data preprocessing stages, presumably on data holders’ premises. Therefore,guarantees (both legal and technical) for access sustained in the long term must be provided. Oncetimeliness can be improved, new output release calendars can be considered in legal regulations foreach statistical domain, thus binding statistical oﬃces to disseminate ﬁnal products with the samepunctuality standards. • The role of coherence and comparability is to be reinforced with new data sources. The reconciliationamong other sources, other statistical domains, and other time-frequency statistics is now morecritical. It is not only that the data deluge will allow statistical oﬃces to reuse the same source15o produce diﬀerent statistics for diﬀerent statistical domains (e.g. ﬁnancial transactional data forretail trade statistics, for tourism statistics, . . . ), but also that diﬀerent sources will possibly lead toestimates for the same phenomenon (e.g. unmanned aerial images, satellite images, administrativedata, survey data for agriculture). This is naturally connected also to comparability, since statisticalproducts must still be comparable between geographical areas and over time. The criticality isintensiﬁed because the wealth of statistical methods and algorithms potentially applicable on thesame data can lead to multiple diﬀerent results where the comparison is not immediate. Thisdemands a closer collaboration in statistical methodology in the international community. • Accessibility and clarity in relation to users is essential (e.g. to the point expressed above of stronglynailing the non-mathematical notion of representativeness in the world of Oﬃcial Statistics). Thechallenge raised by the wealth of statistical methods and machine-learning algorithms to solve agiven estimation problem stands now as an extraordinary exercise in communication strategy andpolicy. Furthermore, this communication strategy and policy should not only embrace but also getdeeply entangled with the access and use of the new data sources. The promotion of statisticalliteracy will need to be strengthened.

Changes in the process will certainly be needed according to the new methodological ingredients men-tioned in section 4. As a matter of fact, the implementation of new data sources in the production ofsome oﬃcial statistics is already bringing the need for new business functions such as trust management,communication management, visual analyses, . . . (Bogdanovits et al., 2019; Kuonen and Loison, 2020).However, in our view the farthest-reaching element will come from the need to include data holders asactive actors in the early (and not so early) stages of the production process. This will especially af-fect those deeply technology-dependent data sources with a clear data preprocessing need for statisticalpurposes. In other words, data holders have changed their roles from just input data providers, eitherthrough electronic or paper questionnaires, to also data wrangler for further statistical processing.Being oﬃcial statistics a public good, it seems natural to request that this participation of data holderswill need to be reﬂected in quality assurance frameworks to assess their impact on ﬁnal products. In ourview, this entails far-reaching consequences and strongly imposes conditions on the partnerships betweenstatistical oﬃces and data holders. These conditions are two-fold, since restrictions both for the publicand private sector must be observed. For example, the statistical methodology driving us from the rawdata to the ﬁnal product must be openly disseminated, communicated and available to all stakeholders asan integral elemental of the statistical production metadata system. Furthermore, to guarantee coherenceand comparability it seems logical to share this statistical methodology among diﬀerent data holders, i.e.in any preprocessing stage. However, guarantees must also be provided to avoid sensitive informationleakage among diﬀerent agents in the private sector, especially in highly competitive markets. statisticaloﬃces cannot become malicious vectors of industrial secrecies and know-hows endangering an increas-ing economic sector based on data generation and data analytics. Another example comes from datasources providing geolocated information (e.g. to estimate population counts of diverse nature). Cur-rent data technologies allow us to reach unprecedented degrees of breakdown (e.g. providing data everysecond minute at postal code geographical level). Freely disseminating population counts at this levelof breakdown in a statistical oﬃce website would certainly ruin any business initiative to commercialiseand/or to foster private agreements to produce statistical products. Partnerships must include formulasof collaboration where both private and public interests not only can, in our opinion, coexist, but evenalso positively feedback each other.In this line of thought, synthetic data can play a strategic role, even beyond traditional qualitydimensions and traditional metadata reporting. In our view, an important aspect of the public-privatepartnerships with data holders is a deep knowledge of metadata of the new data sources. This wouldenable statistical oﬃces to generate synthetic data with similar properties to real data. This synthetic data16an play a two-fold role. On the one hand, for all data sources, providing synthetic data together withprocess metadata will enable users and stakeholders to get acquainted with the underlying statisticalmethodology thus increasing the overall quality in the process. For example, a frame population ofsynthetic business units can be synthetically created so that the whole process from the sample selectionto the ﬁnal dissemination phase and monitoring can be reproduced. On the other hand, for new datasources with those challenges in access and use reported above, methodological and quality developmentsas well as software tools can be investigated without incurring on those obstacles with real data. Noticethat the utility of this synthetic data will sensitively depend on their similarity with real data, thusdemanding a good knowledge of their metadata, i.e. calling for a close collaboration with data holders.

Relevance is a quality attribute measuring the degree to which statistical information meets the needsof users and stakeholders. Thus, it is intimately related to outputs being ﬁt for purpose . Moreover,relevance is one of the key issues in the Bucharest Memorandum (DGINS, 2018) clearly pointing outthe risk for public statistical systems in case of not incorporating new data sources into the productionprocess (among other things).In more mathematical terms, let us view relevance in terms of the nature of statistical outputs andaggregates. Up to current dates, most (if not all) statistical outputs are estimates of population totals P k ∈ U y k or functions of population totals f ( P k ∈ U y k , P k ∈ U z k , . . . ). They may be the total number ofunemployed resident citizens, the number of domestic tourists, the number of employees in an economicsector, etc., but also volume and price indices, rates, and so on. This sort of outputs is basically builtusing estimates of quantities such as P k ∈ U d y k , where U d denotes a population domain and y k stands forthe the ﬁxed values of a target variable. In our view, the wealth of data provides now the opportunityto investigate a wider class of indicators. Network science (Barab´asi, 2008) provides a generic frameworkto investigate new kinds of target information, in particular, that derived from the interaction betweenpopulation units. Graph theory stands up as a versatile tool to pursue these ideas. If nodes representthe target population units, edges express the relationship among these population units. An illustrativeexample can be found in mobile network data, where edges between mobile devices can represent thecommunication between people and/or with telecommunication services. If the geolocation of this datais also taken into account and they are combined with other data sources (e.g. ﬁnancial transaction data– also potentially geolocated), many new possibilities arise to investigate e.g. segregation, inequalitiesin income, access to information and other services, etc. New statistical needs naturally arise. Shouldstatistical oﬃces act reactively waiting for users to express these new needs or should they act proactivelysearching for new forms of information, new indicators, and new aggregates? In our view, innovationactivities and collaboration with research centres and universities should be strengthened to promoteproactive initiatives. The very fast evolution of the information technologies has changed our lives. Nowadays, almost everyhuman activity leaves a digital footprint: from searching information on Internet using a search engine tousing a mobile phone for a simple call or paying a product with a credit card, the traces of these activitiesare stored somewhere in a digital database. Accordingly, these enormous quantities of data draw the at-tention of statisticians who started to consider their potential for computing new indicators. The distinctcharacteristics of these new data sources that were emphasized in the previous sections also changed theIT tools needed to tackle with them. While using the classical survey data to produce statistical outputsdoesn’t raise special computational problems, collecting and processing new types of data (that are mostof the times very big in volume) requires an entire new computing environment as well as new skills forthe people that work with them. In this section we will shortly review the computing technologies used in17ﬃcial statistics for dealing with survey data and we will describe the new technologies needed to handlenew big data sources. We emphasize that the computing technologies are evolving with an unprecedentedspeed and what it seems to be now the best solution, in few years could be totally outdated. We will alsoprovide some examples of concrete computing environments used for experimental studies in the oﬃcialstatistics area.The computing technology needed for a speciﬁc type of a data source is intrinsically related to thenature of the data source. Survey data are structured data with a reasonable size, properties that makethem easy to store with traditional relational databases. The IT tools used for surveys can be classiﬁedaccording to the speciﬁc stage in the production pipeline and for this purpose we will consider the GS-BPM as the general framework describing the oﬃcial statistics production process.Diﬀerent phases of the statistical production process such as drawing the samples, data editing andimputation, calculation of aggregates, calibration of the sampling weights, seasonal adjustments of thetime series, performing statistical matching or record linkage use specialized software routines, most ofthe time developed in-house by some statistical agencies and then shared with the rest of the statisticalcommunity, that are implemented using commercial products like SAS, SPSS, Stata or open source soft-ware like R or Python.While in the past most of the oﬃcial statistics bureaus where strongly dependent on a commercialsoftware packages like SAS or Stata for example, nowadays we are witnessing a major change in thisﬁeld. The beneﬁts of the open source software were reconsidered by the oﬃcial statistics organizationsand more and more software packages are now ported to the R or Python ecosystems (van der Loo, 2017).The data collection stage in the production pipeline requires specialized software. Even if the paperquestionnaires are still in use in several countries around the world, the main trend today is to collectsurvey data using electronic questionnaires (Bethlehem, 2009b; Salemink et al., 2019) by either CAPI orCAWI method. In both cases, speciﬁc software tools are required to design the questionnaires and toeﬀectively collect the data. We mention here some examples of software tools in this category: • BLAISE (CBS, 2019) is a computer aided interviewing system (CAI) developed by CBS whichis currently used worldwide in several ﬁelds: from household to business and economic or laborforce surveys. According to the oﬃcial web page of the software ( ) more than 130 countries use this system. It allows statisticiansto create multilingual questionnaires that can be deployed on a variety of devices (both desktops andmobile devices), it is supported by all major browsers and operating systems (Windows, Android,iOS) and has a large community of users. More, BLAISE is not only a questionnaire designer anddata collection tool but it can also be used in all stages of the data processing. • CSPro (Census and Survey Processing System) is a freely available software framework for designingapplications for both data collection and data processing. It is developed by the U.S. CensusBureau and ICF International. The software can be run only on Windows systems to design datacollection applications that can be deployed on devices running Android or Windows OS. It is usedby oﬃcial statistics institutes, international organizations, academic institutions and even by privatecompanies in more than 160 countries ( https://census.gov/data/software/cspro.html ). • Survey Solutions (The World Bank, 2018b) is a free CAPI, CAWI and CATI software developedby the World Bank for conducting surveys. The software has capabilities for designing the ques-tionnaires, deploying them on mobile devices or on Web servers, collect the data, perform diﬀerentsurvey management tasks and it is used in more than 140 countries (The World Bank, 2018a).There are also other software tools for data collection but they are used on a smaller scale being built bystatistical oﬃces for their speciﬁc needs. 18ll these tools used in oﬃcial statistics for data collection are built around a well know technology:the client-server model. Even this model dates from the 1960s and 1970s when the foundations of theARPANET where built (Shapiro, 1969; Rulifson, 1969) it becomes very popular with the appearanceand the development of the Web system that transformed the client into the ubiquitous Web browser,making the entire system easier to deploy and maintain. Nowadays there are a plethora of computingtechnologies supporting this model: Java and .NET platforms, PHP together with a relational database,etc. Figure 1 describes the architecture of a typical client-server application when the client is a Webbrowser.

Internet

Client Web server + Application server Database server

Figure 1: The client-server model of computingThe client which is usually a browser running on a mobile device or on a desktop loads a questionnaireused to collect the data from households or business units. These data are subject to preliminary valida-tion operations and then they are sent to the server side where a Web server manages the communicationsvia HTTP/HTTPS protocol and an application server implements the logic of the information system.Usually, some advanced data validation procedures are performed before the data are sent to a relationaldatabase. From this database, the datasets are retried by the productions units that start the processingstage.The last stages of the production pipeline, i.e. the dissemination of the ﬁnal aggregates, require alsospecialized technologies. Statistical disclosure control (SDC) methods are special techniques with theaim of preserving the conﬁdentiality of the disseminated data to guarantee that no statistical unit can beidentiﬁed. These methods are implemented in software packages, most of them being in the open sourcedomain. We can mention here sdcMicro (Templ et al., 2015) and sdcTable (Meindl, 2019) R packagesor tauArgus (de Wolf et al., 2014) and muArgus (Hundepool et al., 2014) Java programs.Even for disseminating the results on paper, software tools are still needed: starting from the classi-cal oﬃce packages which are easy to use by statisticians to more complex tools like

Latex that requirespeciﬁc skills, all the paper documents are produced using IT tools. In the digital era the disseminationof the statistical results switched to the Web pages where technologies based on Javascript libraries like D3 (Bostock et al., 2011) or R packages like ggplot2 (Wickham, 2016) are widespread.In general, the administrative sources are treated using the same software technologies like surveydata with the exception of the data collection step which is not needed in this case. The new data sources bring also new information technologies on the stage of oﬃcial statistics. Acciden-tally or not, with the beginning of the use of new data sources, a new trend has manifested itself in oﬃcial19tatistics: the open source software revolution has also been embraced by the world of oﬃcial statistics.Two software environments emerged as being suitable for oﬃcial statistics tasks: R and Python. WhilePython is considered to be more computationally eﬃcient, R is better suited for statistical purposes:there are R packages for almost every statistical operation, from sampling to data visualisations. Inthe European Statistical System (ESS) it seems that R has gained ground against Python. Most of theNational Statistical Organizations (statistical oﬃces) within the ESS make a transition from old softwarepackages mostly based on commercial solutions to the R environment (Templ and Todorov, 2016; Kowarikand van der Loo, 2018).We mention here only few of the R packages used in statistical oﬃces for diﬀerent tasks. For drawingsurveys samples there are packages like sampling (Till´e and Matei, 2016) that allow not only to use diﬀer-ent sampling algorithms but also to calibrate the design weights.

ReGenesees package (Zadetto, 2013)developed by ISTAT starts from the survey package (Lumley, 2004) that provides functions to computetotals, means, ratios and quantiles for the survey sample and includes calibration and sampling varianceestimation functions. Other R packages used to draw samples with a speciﬁc design are

SamplingStrata (Barcaroli, 2014),

FS4 (Cianchetta, 2013),

MAUSS-R (Buglielli et al, 2013). Visualising and editingthe data sets can be performed with editrules (de Jonge and van der Loo, 2018) or

VIM (Kowarik andTempl, 2016) packages while for selective editing there are packages like

SeleMix (Ugo Guarnera, 2013).Imputation can be performed with

VIM (Kowarik and Templ, 2016), mice (van Buuren and Groothuis-Oudshoorn, 2011) or mi (Su et al., 2011) packages. For time series analyses and seasonal adjustmentsthere are x12 (Kowarik et al., 2014) and seasonal (Sax and Eddelbuettel, 2018) packages besides thewell-known JDemetra+

Java software (Grudkowska, 2017). Statistical matching and record linkage isanother domain where we can ﬁnd good quality R packages:

StatMatch (D’Orazio, 2019),

MatchIt (Hoet al., 2011),

RecordLinkage (Borg and Sariyar, 2019) and

RELAIS (Scannapieco et al., 201r). Ourenumeration is not intended to be exhaustive but to give the reader an image of the capabilities of the Renvironment for statistical data processing. A comprehensive list of R packages used in oﬃcial statisticsis published at https://github.com/SNStatComp/awesome-official-statistics-software . The new types of data sources require diﬀerent technologies for data collection step. If the data setsare to be stored inside NSIs premises, either they are transfered from the data owners using specializedtransmission lines, either they are collected using speciﬁc technologies. For example, one of the mostpromising data source is the Internet, or to be more speciﬁc, Web sites. There are several technologiesthat were developed by statistical oﬃces to collect diﬀerent kind of data (for example prices from onlineretailers, enterprise characteristics from companies’ Web sites, information about job vacancies from spe-cialized portals, etc.) collectively gathered under the term web scrapping techniques.In ﬁgure 2 we depicted the general organization of such a data collection approach for an IT point ofview. There are several solutions used by diﬀerent statistical oﬃces to implement the main component ofthis system, called the

Scrapper in this ﬁgure. Some of them are based on R packages, some on Pythonlibraries, others are speciﬁc software solutions developed in-house or are based on open source projectsFor example, rvest (Wickham, 2019) is an R package that can be used to scrape data from static html pages. It needs an URL and it can gets the entire page or, if the user provides a selector on that page, itgets only the text associated with that selector. The data obtained in this case is text and it is usuallystored in a NoSql database or it is processed according to some speciﬁc needs and transformed in struc-tured data that is stored in a relational database. Similar packages such as scrapeR (Acton, 2010) or

Rcrawler (Khalil, 2018) can be successfully used for static Web pages.Most of the sites today are actually dynamic and this feature raises some problems when it comes toscrap such pages. A solution often used to scrape dynamic sites using the R technology is based on the

RSelenium (Harrison, 2019) package which is an R client for Selenium Remote WebDriver. It allowsthe user to scrape content that is dynamically generated by driving a browser natively, emulating the20ctions of a real user, and it can be used to automate tasks for several browsers: Firefox, Chrome, Edge,Safari or Internet Explorer. A similar client is also available for Python.Another versatile solution for Web scrapping is the Python

Scrapy (Kouzis-Loukas, 2016) which isan application framework that allows users to write Web crawlers that extract structured data from Websites. An example of a real world applications of this framework in the ﬁeld of oﬃcial statistics there area set of projects developed by ONS (Breton et al., 2015; Naylor et al., 2014) to collect price data fromInternet to compile price indices.Besides these tools, we can also mention in-house software solutions for Web scraping such as theRobot framework developed and used by CBS (CBS, 2018) or a solution based on the Apache Nutchtechnology (The Apache Software Foundation, 2019) used by ISTAT for an internal project regardingcollection of enterprise characteristics from Web sites.

Scrapper NoSql databasesProcessing procedures Relational databasesWEB Data Collection procedures Data storage and processing

Figure 2: Data collection through web-scraping

The processing step of new data sources takes into account their speciﬁcity, esspecially the very large vol-ume. This requires either to use parallel programming paradigms inside an ecosystem like R or Python,or to use dedicated IT architectures.The simplest solution for processing large data sets is to use the parallel programming features incor-porated in software systems like R or Python. They make use of the multicore any many core architecturesof the current computing systems. Two paradigms emerged in this area: shared and distributed memoryarchitectures. These two models are depicted in ﬁgure 3.In the ﬁrst approach a set of CPUs are interconnected with a single shared memory and all of themhave access to this common memory. All modern processors are multicore and they are based on anarchitecture very similar with the one presented in upper part of the ﬁgure 3. However, there is animportant limitation of this type of architecture of a computing system: all CPUs are competing foraccess to the same memory. This severely limits the performance of a computing system even there aresolutions that alleviates to some extent this problem.In the second approach several CPUs that have their own memory are interconnected, forming thusa distributed-memory computer. This solution scales up to thousands of CPUs or even more. Tasks canbe run in parallel by diﬀerent CPUs having the necessary data in their own memory, avoiding thus thememory contention problem. At certain steps of the processing algorithms it may be necessary for the21 nterconnection networkInterconnection network CPU CPUCPUCPUCPU Memory MemoryCPUMemoryCPUMemory CPUMemory CPUMemory CPUMemory

Shared memoryDistributed memory

Figure 3: Shared memory versus distributed memory parallelismCPUs to exchange data between them via the interconnection network or to synchronize themselves.Both aproaches are used for statistical data processing. In the following we will use R examples, butsimilar technologies are available for Python too. Parallel computing in the shared memory architec-ture can be implemented in R via compiled extensions that rely on speciﬁc compiler support: OpenMP(OpenMP Architecture Review Board, 2018) or Intel TBB (Reinders, 2007). OpenMP introduced in1998 by Dagum and Menon (Dagum and Menon, 1998) is an industry standard since version 5.0 and issuported by most open source or commercial compilers. OpenMP is available in R itself if it is buildwith this option from the beginning, but it is dependent on the speciﬁc CPUs and C/C++ compiler. Itcan be also used in R by adding C++ processing functions through the Rcpp package (Eddelbuettel andFran¸cois, 2011; Eddelbuettel, 2013; Eddelbuettel and Balamuta, 2017). Intel TBB is a technology similarto OpenMP but is available only via C++. The RcppParallel package (Allaire et al., 2019) is a wrapperaround Intel TBB library, making it easily accessible for R programmers. Both technologies allows usersto build processing functions that make use of all the available cores of the processor on their desktop,speeding up the computations when large data sets are involved or computationally intensive algorithmsare used.These technologies are somehow compiler dependent and not available for every user. To overcomethis diﬃculty, now the base R incorporates the parallel package that makes transparent for users the lowlevel operations to support shared memory parallelism. For example, mclapply function is the parallelversion of the serial lapply and it applies a function to a series of elements, running them in parallel22n separate processes, with the advantage of having all the variables from the main R session inheritedby all child processes. However, the truly parallel execution of the function on diﬀerent data items isimplemented only on systems that implements the FORK directive, i.e. Unix based systems. Windowsdoesn’t support forking, thus mclapply and similar functions will be run in the sequential mode. Never-theless, parallelization is still possible in this case too, using cluster processing, a model where a set of Rprocesses run in parallel independently. Functions like parLapply or parMapply are using this model ofexecution to run processing functions in parallel but it poses on the user the task of sharing the variablesamong worker processes. Besides the parallel package there are other R packages that implement thiskind of parallelism: doMC (Analytics and Weston, 2019), doParallel (Corporation and Weston, 2018), foreach (Microsoft and Weston, 2017), snowi (Tierney et al., 2018).The distributed memory parallelism uses a model called Message Passing which is described in theMessage Passing Interface (MPI) standard (Forum, 1994). Widely used implementations of this standardinclude OpenMPI (Gabriel et al., 2004) or MPICH (Gropp, 2002). MPI involves a set of independentprocesses that run on their processor, directly accessing the data in that processor’s memory. Commu-nication between processes is achieved by means of sending and receiving messages. The communicationoperations between processes is the main bottleneck of this model, processing speed being usually muchhigher than sending or receiving data through the communication network. Less communication opera-tions, the higher speedup will be obtained. This model has a main advantage over the shared memorymodel: it scales very well. Thousands of processors could be added to such a computer, obtaining reallyimpressive computing power. Developing programs that use this paradigm involves writing them usuallyin C or Fortran and then linking them against an MPI library and then run them in a special conﬁguredenvironment. This is not an easy task to do for a statistician, but R packages like Rmpi (Yu, 2002), snow (Tierney et al., 2018) or doMPI (Weston, 2017) present a high level interface to the user, hidingthe complexity of message passing parallel programming.As mentioned before, if we want to integrate new types of data sources into the statistical productionthe classical inferential paradigm has to be changed and new methods involves using algorithms from ma-chine learning or artiﬁcial intellingence area. A survey of the machine learning techniques currently usedacross diﬀerent statistical oﬃces can be found in (Beck et al., 2018). R packages like rpart (Therneauand Atkinson, 2019), caret (Kuhn, 2020), randomForest (Liaw and Wiener, 2002), nnet (Venablesand Ripley, 2002), e1071 (Meyer et al., 2019) or Python libraries like

Scipy (Jones et al., 01 ),

Scikit-learn (Pedregosa et al., 2011),

Theano (Theano Development Team., 01 ),

Keras (Chollet et al., 2015),

PyTorch (Paszke et al., 2019) are among the tools that are best suited for the statistical production.Large frameworks like TensorFlow (Abadi et al., 2015) or Apache Spark (Zaharia et al., 2016) can alsobe used but they require speciﬁc skills from computer science area and have a steep learning curve, butconnectors for R and Python are available that make those frameworks easier to use by statisticians.Processing methods that make use of machine learning algorithms are frequently computing inten-sive. One solution to obtain reasonable running times even for large data sets is to use some parallelprogramming techniques and software packages already mentioned that exploit the multicore or manycore feature of the commodity systems, but together with them another parallel computing paradigmcalled General-Purpose Computing on Graphics Processing Units (GPGPU) that was ﬁrst experimentedaround 2000-2001 (Larsen and McAllister, 2001) could be a viable solution. Today’s GPUs have FLOPrates much higher than CPUs and this comes from the internal structure of a modern GPU: it hasthousands of computing units that can operate in parallel on diﬀerent data items, thus obtaining highthroughputs. A detailed discussion of this computing model is far from the scope of current paper, but aninterested reader can consult, for example, the work by Luebke et al. (2006). CUDA (Nickolls et al., 2008)or OpenCL (Stone et al., 2010) are frameworks that allow users to build applications to take advantage ofthe immense computing power of the graphics processing units (GPU). Usually, they require applicationswritten in C/C++ or Fortran and one may say that this is a task for a computer scientist not for astatistician, but in the last time several R or Python libraries have been developed to make the GPUaccessible from these working environments familiar to statisticians. We can mention gmatrix (Morris,23015), gpuR (Determen Jr., 2019; Rupp et al., 2016), gputools (Buckner et al., 2009) or cudaBayesreg (Ferreira da Silva, 2011) R packages and

PyCUDA PyOpenCL (Kl¨ockner et al., 2012),or gnumpy (Tieleman, 2010) Python libraries that can be used to speedup the computations involved by diﬀerentprocessing procedures.Dedicated systems are the other alternative when very large volumes of data need to be processed.One of the ﬁrst dedicated computing systems tailored to make experiments with large data sets in oﬃcialstatistics was the UNECE Sandbox (Vale, Vale) which was a shared computing environment consistingin a cluster of 28 machines running a Linux operating system, connected through a dedicated high-speednetwork and accessible via a Web interface and SSH. This computing environment was created withsupport from the Central Statistics Oﬃce of Ireland and the Irish Centre for High-End Computing. Sev-eral large datasets where uploaded to this system from diﬀerent areas: scanner data to compute priceindices, mobile phone data for tourism statistics, smart meter data for computing statistics on electricityconsumption, traﬃc loops data for transportation statistics, online job vacancies data and data collectedfrom social media. The software tools deployed in this environment were entirely new to the world of oﬃ-cial statistics: Hadoop (White, 2012) for storing the data sets and performing some processings, ApacheSpark for data analytics and Pentaho (Meadows et al., 2013) for visual analytics. Together with them,the R software environment was also installed in the cluster.Hadoop is a free software framework with the aim of storing and processing very large volumes ofdata using clusters of commodity hardware. Hadoop was developed in Java and thus, Java is the mainprogramming language for this framework, but it can also be interfaced with other languages too, likeR or Python. Although it is a freely available software, there are some commercial distributions thatoﬀer an easy way to install and conﬁgure the software as well as technical support. The most widespreaddistributions are HortonWorks (that was used for the UNECE Sandbox) and Cloudera.Hadoop framework includes a high performance distributed ﬁlesystem (HDFS - Hadoop DistributedFile System), a job scheduling and cluster resource management component - YARN, and MapReducewhich is a system for parallel processing of very large data sets. MapReduce implements a distributedmodel of computation that was ﬁrst developed and used by Google (Dean and Ghemawat, 2004).Brieﬂy speaking, Hadoop provides a reliable distributed storage by means of HDFS and an analysisframework implemented using the MapReduce engine. It is a highly scalable solution being able to runon a single computer as well as on clusters of thousands of computers. Large ﬁles are splited into blocksstored on diﬀerent Data Nodes, while a Name Node is responsible with operations like opening, closing orrenaming these ﬁles. MapReduce is a model of processing very large data sets on clusters of computers,ﬁrst splliting the inputs in several chunks processed in parallel by the map tasks. The results of the map tasks are then forwarded to the reduce tasks that perform an aggregation operation on them. All thecomplexity of the parallel execution of these tasks are hidden from the user that sees only a simple modelof computation.Hadoop framework was very successful for handling large data sets because of its high degree of scal-ability, ﬂexibility and fault tolerance. It can be installed on commodity hardware or on supercomputerstoo, allowing massive parallel processing. It is able to store any kind of data, structured or not, and it istolerant to hardware failures being able to send the tasks of a failed node to other live nodes. The ﬁlesare stored in HDFS using a replication schema to ensure fault tolerance. Starting from the idea that is iteasier to move the computations than the data, when a computing node fails, the computations are sendto another node that stores a replica of the data in the failing node.For statistical purposes, only Hadoop itself is rather diﬃcult to be used, but when it is interfacedwith usual statistical software like R, it becomes a powerful tool in the hand of statisticians. A typicalarchitecture with Hadoop, Spark and other statistical tools is depicted in ﬁgure 4. Accessing the powerof parallel processing of Hadoop from R is achieved through an interface layer made up from specialized24 packages like

Rhipe (Rounds, 2012) or the collection gathered under the name of

RHadoop (Adler,2012).Apache Spark is also an open source distributed computing framework, and it is newer then Hadoop.It provides a faster data analytics engine than the Hadoop MapReduce because it processes all the datain-memory. While Hadoop is better suited for batch processing, Spark also supports stream processing.It can be installed on a HDFS (like in ﬁgure 4) or as a standalone software. The

SparkR (Venkataramanet al., 2019) package provides a lightweight interface to use Spark from the R environment, making iteasily accessible to statisticians. Spark has libraries that implement machine learning algorithms, graphanalytics algorithms, stream processing and SQL querying. Spark and its very fast machine learningalgorithms implementation proved to be a very useful tool especially for new data sources that require amodel based approach. ……..

Hadoop Distributed File SystemMapReduce

Hardware layerMiddleware layerInterface layerStatistical processing layer

Figure 4: Hadoop infrastructure

Statistical oﬃces from the ESS started to implement their own in-house infrastructures to support pro-cessing needs for new data sources. We can mention here the ISTAT Big Data IT Infrastructure thatconsist in a 8-node Hadoop Cluster with Apache Spark as an analytics engine and Apache Impala forquerying large amounts of data (Scannapieco and Fazio, 2019) or CBS Big Data Centre to name onlytwo of them.But soon after the initial enthusiasm of using new data sources, the barrier of data access and thehigh costs stopped further in-house developments of IT infrastructures. Most of the new data setsare privately held data, and the data owners are reluctant when it comes to give access for statis-tical oﬃces to their data. Moreover, the costs of such infrastructures are high and a single organi-zation cannot support them on a long term. We assist in the last years to a paradigm shift: in-stead of developing huge IT infrastructures in-house, using the cloud services available today at alower cost seem to be a better solution. One of the ﬁrst steps in this direction were made by Euro-pean Commission with the Big Data Test Infrastructure ( https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/Big+Data+Test+Infrastructure ) that was used in statistics for experimentationpurposes during NTTS 2019 conference and after that for ESSnet Big Data project ( https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/ESSnet_Big_Data ). This infrastructure wasbuilt on Amazon Web Services cloud environment with a special conﬁguration for statistical projects.It provided an Elastic Map Reduce (EMR) platform for big data processing built around the Hadoopecosystem. Among the tools made available to the users on this platform we mention Apache Sparkand Apache Flink for distributed data processing, Apache Hive and Apache Pig for querying the data,25ensorFlow and Apache Mahout for machine learning applications, Apache Hue as a visual user interface,Jupyter Notebooks, R, RStudio, RShiny and an instance of MySQL relational database.Another innovation that helps oﬃcial statistics to overcome the data access barrier is to push thecomputations out , at least partially (Ricciato, 2018a). Thus, instead of pulling in the data sets fromprivate companies and process them using in-house computing systems, the data are not moved from thedata holders’ premises, but stay there and are only used by oﬃcial statisticians, running commonly agreedalgorithms on private companies’ computing systems and getting only some form of aggregated results.This will avoid sharing the privately held microdata which most of the time are an invaluable asset forcompanies and raises complicated legal problems. However, there are concerns that oﬃcial statisticiansare not in control of the processing stage and results may be biased or the quality of the results wouldnot be as expected.To overcome this problem, a sort of certiﬁcation authority trusted by all parties could be involvedand thus, the processing algorithms would be transparent and trusted by all parties. This is one of theideas on which the Reference Architecture for Trusted Smart Statistics proposed by Ricciato (2018a) isbuild upon. In ﬁgure 5 we show this idea: several data owners and the statistical oﬃce agree upon thealgorithms for data processing and the Certiﬁcation Authority is a guarantee that only the agreed sourcecode is run on the data sets.

Data owners Certiﬁ (cid:0) cation Authority(ensures that source code approved by all parties is run on data sets)NSOinput 1 input 2 input 3output (aggregated data)

Figure 5: Trusted computationThe simplest case is when only one data owner is involved in such a process. In this case runningan authenticated binary code in a secure (trusted) hardware environment could solve the problem ofensuring that indeed the code that was executed on the data sets is exactly the code that was agreedbetween the parties (in our case the data owner and statistical oﬃce). This model can be generalizedwhen more data holders participate in this process and the ﬁnal result can be obtained either by takingthe partial outputs of an agreed code (function) applied on each data set separately and then composingthem using another function by the statistical oﬃce, or by chaining the partial results using again theagreed functions implemented in an authenticated binary code and run in a secure hardware environment26Ricciato, 2018c). These two cases are presented in ﬁgure 6. In the upper part of the ﬁgure we havethe case when each data owner provides a data set input i that is processed by an agreed function in asecure hardware environment and the results output i are then fed into a function F that computes someaggregated measure. In the bottom part of the ﬁgure we have the case when the output of the ﬁrstprocessing algorithm is sent as an input to the second algorithm and so on. Again, an authenticatedbinary code and a secure hardware environment provides all that is neccesary to be sure that the codeexecuted on data sets is the one that was agreed between the data owners and statistical oﬃce.The technologies needed for such a mechanism are known and widespread. Code signing is a form ofbinary authentication that can be used in this case and the Trusted Execution Environment (TEE) stan-dard (Sabt et al., 2015) could be considered a potential candidate, all major hardware producers (Intel,AMD, ARM) providing support for TEE implementations (Futral and Greene, 2013; Mofrad et al., 2018;Li et al., 2019). Mainly, all modern processors provide a mechanism that allows a process to be run insuch a way that its data is not seen by other processes or even by the operating system. output_1output_2output nData owner 1Data owner 2Data owner n authenticated binary code run on secure hardware NSOResult = F(output_1, output_2, … output_n)output_1=F1(input_1)output_2 = F2(input_2, output_1)output_n=Fn(output_(n-1), input_n)input_1input_2input_ninput_1input_2 NSOResult = F(output_n)Data owner 1Data owner 2Data owner n authenticated binary code run on secure hardware Figure 6: Trusted computations with multiple data ownersThe case when the ﬁnal statistical aggregates / estimates suppose to combine data sets from diﬀerentowners and these combined data are then sent as input to a function that computes the estimates requiresmore elaborated processing techniques borrowed from the Privacy-Preserving Computation Techniques(PPCT), a hot research ﬁeld that combines classical cryptography with distributed computing, to provideprotection for data owners and in the mean time allowing statistical analyses to be performed (PrivacyPreserving Techniques Task Team, 2019). Using such techniques allows one to perform data analyses ondata sets coming from diﬀerent owners while the data remains opaque to all the parties involved, thusobtaining an end-to-end protection of the data. Nevertheless, these techniques have their own implemen-tation costs, regarding both the hardware and software investments, that cannot be neglected.One of the proposed PPCT to be used by statistical oﬃces in cooperation with data owners is theSecure Multi Party Computation (SMPC) (Ricciato et al., 2019; Privacy Preserving Techniques TaskTeam, 2019). SMPC is about jointly evaluating a function that all parties agreed upon using a set ofdiﬀerent inputs coming from several parties and maintaining the conﬁdentiality of the data so that no27articipant can have access to the raw data provided by the others. This technique divides the inputdata into random shares that gives back the original data if they are combined, and these shares are thendistributed among all the participants. The shares can then be combined to produce the desired output.Formally, SMPC deals with a set of participants, p , p , . . . p n , each of them having a data set input , input , . . . , input n , who intend to compute a function F ( input , input , . . . , input n ) keeping theinputs secret. An SMPC protocol will ensure all participants about the input privacy (i.e no informationcan be inferred by a party about others party’ data) and correctness of the output. While the ﬁrstattempts to develop such a computation protocol dates from early 1982 when a secure two-party protocolhas been introduced by Yao (1982) and then further developed and formulated in 1986 (Yao, 1986) SMPCis still an academic research topic nowadays and commercial solutions for this protocol are still in anearly stage. Nevertheless, as information technology is advancing with a very fast speed, this could be aviable solution oﬃcial statistics.In ﬁgure 7 we showed a schematic example of this technique where several data owners provide theirdata to an SMPC environment where they are processed and an output is provided to an statistical oﬃce. Secure Multi Party Computation Environment Output NSOData owner 1Data owner 2Data owner 3

Figure 7: Secure Multi Party ComputationOther privacy-preserving computation techniques proposed to be used in oﬃcial statistics are theHomomorphic Encryption, Diﬀerential Privacy or Zero Knowledge Proofs (Privacy Preserving TechniquesTask Team, 2019). However, all these techniques requires further experimentation and development ofpractical software implementations.

This section is remarkably opinative and provocative on purpose to raise thought and debate. Certainly,the analysis above does not pretend to be exhaustive and can be further completed with deeper and moreextensive reﬂections on some mentioned items or new ones. In any case, the success in the adoption ofnew solutions and changes in the statistical production necessarily requires new skills and an extraordi-nary exercise of management.To begin with, at odds with common belief, we claim that the production of oﬃcial statistics in astatistical oﬃce is an activity closer to Engineering than to Social Science and Statistics. By no meansthis signiﬁes that Social Science and Statistics are marginally needed. Experts on National Accounts,on Demography, on sampling, etc. are absolutely necessary, but in the same way as being an expertphysicist in the electromagnetic ﬁeld and the law of induction does not make you capable of producingand distributing electrical power to every dwelling in a country, the knowledge in those disciplines doesnot guarantee the industrial production of oﬃcial statistics comprising a National Statistical Plan. This28eed for an engineering view of oﬃcial statistical production to cope with complexity was already madepatent with the advent of international production standards at the beginning of the 21st century. Weare convinced that with new data sources, especially digital data, this approach is urgently required.Consequently, a new organization of the production processes brings new skills into scene. Some tra-ditional skills will need to be superseded and some others reformulated or adapted to the new productionconditions. However, we view this as an integration process not as a general disruptive substitution oftechniques, procedures and routines, in general. The use of information technologies and computer sci-ence needs to permeate the production and sometimes this may produce a cultural resistance to change(“statisticians do not have to program computer systems because that task belongs to another academicdiscipline”, say some in private). Should archaeologists avoid incorporating knowledge about carbondating and DNA analysis into their work because these belong to other disciplines? They may not needhow to conduct themselves a DNA analysis managing a DNA sequencer, but certainly their renewed skillsallow them to openly communicate with DNA experts and modernise their work accordingly.When all these new skills are mentioned in future prospects of Oﬃcial Statistics, the focus is instinc-tively placed on technical or junior staﬀ, possibly thinking of new recruitment and plans of continuoustraining. This is obviously an element to be considered but we ﬁnd it more critical the extension of theseskills and a clear understanding of their consequences in production among the top management of theorganization. If they need to take critical decisions, they also need to clearly understand some technicaland organizational details about the implications of these decisions. For example, moving away from astove-pipe production model ineﬃciently divided into silos to a standardised production model sharingmethods, tools, data architecture, process design, etc. necessarily brings changes into the chart of theorganization and the governance structure. How does it all smoothly ﬁt to work in practice? These arediﬃcult questions rooted in some technical aspects with consequences throughout the whole organization.Furthermore, in many statistical oﬃces, there are scarce resources fully devoted to production in ahighly demanding environment with little room to acquire these new skills. In many cases, the computerscience, ICT, and programming background is even outdated (for these same reasons). The trainingmodernization plans, in our view, should consider this staﬀ also as a primary target. For example, theintroduction of new distributed computing systems with object-oriented and functional programminglanguages is clearly necessary, but it is also that necessary to bring senior staﬀ to the point in whichthese training programmes are also accessible and valuable for them. With this new knowledge, they canprovide highly valuable insights into the modernization process.In this line, newly recruited staﬀ should be demanded to fulﬁll this joint proﬁle with both computerscience and statistics skills. Interestingly enough, as in other industries (e.g. ﬁnance), a lot of value can begained from professionals with diﬀerent backgrounds such as engineers, physicists, chemists, . . . because oftheir system modelling abilities. In any case, professional training needs to be continuous and embracingcross-cuttingly all the staﬀ, since technologies are changing very fast now.Management challenges do not end with human resources and new skills. With traditional surveydata, the complete production process fall to statistical oﬃces from survey design over data collection toproduction and dissemination. With administrative sources, data are already generated independentlyof the statistical purposes and speciﬁc agreements with other public bodies must be settled to access anduse them to produce oﬃcial statistics. With digital data in private hands, the new scenario portrays ahigher entangled situation. Data holders in the private sector will be necessarily part of the statisticalproduction process and this entails an extraordinary exercise of management on data, quality, metadata,trust, technology, . . .Furthermore, in a dataﬁed society with an increasing economic sector based on data, information, andknowledge, statistical oﬃces need to decide which role to play in an environment with multiple actors,which turn out to be both data holders and stakeholders of a generalised statistical production. Statistical29ﬃces will never be the unique producers of statistical outputs with social interest. Which relation tothese products are statistical oﬃces to take on? Options do exist. Statistical quality certiﬁcation to oﬀerquality assurance is a possibility. The enrichment of data and/or methodologies in the private productionprocesses can also be considered. In any case, all these options entail new exercises of management andleadership.

Data sources for the production of oﬃcial statistics can be grouped in survey data, administrative data,and digital data. The advent of both administrative and digital data introduces important changes in theproduction landscape of statistical oﬃces. The lack of statistical metadata (data are generated prior toany statistical purposes consideration), the economic value of data, and their ownership by third peopleand not by data holders characterise these new data sources. These have implications for data access/use,for statistical methodology, for quality, for the IT environment, and for management. For every aspectseveral issues need to be considered. As summary statements we can conclude the following: • In our view, public-private partnerships stand as the preferred option to incorporate new datasources into the routinely production of oﬃcial statistics. These partnerships must consider aspectsfrom all perspectives. Guarantees for privacy and conﬁdentiality must be pursued at all costs.Oﬃcial Statistics already has a tradition in this line since design-based inference needs unit iden-tiﬁability and statistical disclosure control techniques are increasingly more sophisticated. In ouropinion, legislative initiatives to provide legal support, if further needed, must be undertaken fromthis partnership point of view. New disciplines as cryptology need to be introduced. • Sampling designs cannot be useful to face the inferential step with the new data sources. Traditionalsurvey methodology, however, should be seen as an inspiration to pursue accurate estimators. Thenotion of sample representativeness, not being a mathematical concept, is still to be understoodas the search for estimators with low mean square errors (or similar ﬁgures of merit), as surveymethodology actually does. Probability theory is still the best option to deal with inference. • Machine learning and artiﬁcial intelligence seems of limited use for the inferential stage, since wenever know the ground truth for the learning step training algorithms. However, this is not thecase for multiple production tasks along the production process. Indeed, the wealth of traditionalsurvey data and paradata stands as an opportunity to make use of these techniques in the productionprocess, especially regarding the lack of statistical metadata in the new data sources. • Current quality frameworks are strongly survey-oriented. Although quality dimensions in OﬃcialStatistics appear still to be valid, the subtleties arising from the new nature of data need to beconsidered both in their deﬁnitions and in the indicators derived thereof. • Special focus should be placed on relevance. New insights can be a priori gained from the wealthof new data (e.g. investigating the interaction between population units). Thus, new statisticaloutputs must be devised. • New hardware and software environments are needed to incorporate new data sources into theproduction. Open source software ecosystems like R or Python together with the accompanyinglibraries for oﬃcial statistics seem to be the future of the statistical data processing. The hardwareinfrastructures are changing too. While few years ago several statistical oﬃces built their own (in-house) computing systems they proved to be very costly and now we are witnessing a new trend, i.e.usage of cloud-based hardware infrastructures. These systems are equipped usually with speciﬁcbig data software products like Hadoop or Apache Spark. However, in the IT ﬁeld technologies arechanging with an unprecedented speed and is diﬃcult to predict which technology is the best forstatistical purposes. 30

A crucial challenge to cope with the implications brought by new data sources is the integration ofall the preceding facets into a renewed production system. This demands an extraordinary exerciseof management and leadership. Statistical oﬃces, in our view, should strive to assume a leadingrole in the new dataﬁed society.

Acknowledgments

The views expressed in this paper are those of the authors and do not necessarily reﬂect the views oftheir aﬃliating institutions.

References

Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga,S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015). TensorFlow:Large-scale machine learning on heterogeneous systems. .Acton, R. M. (2010). scrapeR: Tools for Scraping Data from HTML and XML Documents . R package version 0.1.6. https://CRAN.R-project.org/package=scrapeR .Adler, J. (2012).

PR in a Nutshell. A Desktop Quick Reference (2nd ed.). O’Reilly Media.Agaﬁt¸ei, M., F. Gras, W. Kloek, F. Reis, and S. Vˆaju (2015). Measuring output quality for multisource statistics in oﬃcialstatistics: Some directions.

Statistical Journal of the IAOS 31 , 203–211.Allaire, J., R. Francois, K. Ushey, G. Vandenbrouck, M. Geelnard, and Intel (2019).

RcppParallel: Parallel ProgrammingTools for ’Rcpp’ . R package version 4.4.4. https://CRAN.R-project.org/package=RcppParallel .Analytics, R. and S. Weston (2019). doMC: Foreach Parallel Adaptor for ’parallel’ . R package version 1.3.6. https://cran.r-project.org/package=doMC .Barab´asi, A.-L. (2008).

Network science . Cambridge: Cambridge University Press.Barcaroli, G. (2014). SamplingStrata: An R package for the optimization of stratiﬁed sampling.

Journal of StatisticalSoftware 61 (4), 1–24.Basu, D. (1971). An Essay on the Logical Foundations of Survey Sampling, Part One*. In: DasGupta A. (eds), SelectedWorks of Debabrata Basu. Selected Works in Probability and Statistics. Springer, New York, NY.Beck, M., F. Dumpert, and J. Feuerhake (2018). Machine Learning in Oﬃcial Statistics.

CoRR abs/1812.10422 .Beresewicz, M., R. Lehtonen, F. Reis, L. di Consiglio, and M. Karlberg (2018). An overview of methods for treatingselectivity in big data sources. Eurostat Statistical Working Papers KS-TC-18-004-EN-N (2018 edition). https://ec.europa.eu/eurostat/documents/3888793/9053568/KS-TC-18-004-EN-N.pdf/52940f9e-8e60-4bd6-a1fb-78dc80561943 .Bethlehem, J. (2009a). The rise of survey sampling. Statistics Netherlands Discussion Paper 09015. .Bethlehem, J. G. (2009b). The future of surveys for oﬃcial statistics. , 1–15.Biemer, P. and L. Lyberg (2003).

Introduction to survey quality . New York: Wiley.Bogdanovits, F., A. Degorre, F. Gallois, B. Fischer, K. Georgiev, R. Paulussen, S. Quaresma, M. Scannapieco, D. Summa,and P. Stoltze (2019). BREAL: Big Data Reference Architecture and Layers. Business layer. Deliverable F1. ESSneton Big Data project. https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/6/65/WPF_Deliverable_F1_BREAL_Big_Data_REference_Architecture_and_Layers_v.03012020.pdf .Bollobas, B. (2002).

Modern Graph Theory . New York: Springer.Borg, A. and M. Sariyar (2019).

RecordLinkage: Record Linkage in R . R package version 0.4-11.2.Bostock, M., V. Ogievetsky, and J. Heer (2011). D3: Data-driven documents.

IEEE Trans. Visualization & Comp. Graphics(Proc. InfoVis) . owley, A. (1906). Address to the economic science and statistics section of the british association for the advancement ofscience. Journal of the Royal Statistical Society 69 , 548–557.Braaksma, B. and K. Zeelenberg (2020). Big data in oﬃcial statistics. Statistics Netherlands Discussion Paper January2020. .Breton, R., G. Clews, L. Metcalfe, N. Milliken, C. Payne, J. Winton, and A. Woods (2015). Research indices using webscraped data. Oﬃce for National Statistics. .Brewer, K. (2013). Three controversies in the history of survey sampling.

Survey Methodology 39 , 249–262.Buckner, J., J. Wilson, M. Seligman, B. Athey, S. Watson, and F. Meng (2009, 10). The gputools package enables GPUcomputing in R.

Bioinformatics 26 (1), 134–135.Buglielli, M.T., C. De Vitiis, and G. Barcaroli (2013). MAUSS-R Multivariate Allocation of Units in Sampling Surveys. Rpackage version 1.1.Casella, G. and R. Berger (2002).

Statistical Inference . Belmont: Duxbury Press.Cassel, C.-M., C.-E. S¨arndal, and J. Wretman (1977).

Foundations of Inference in Survey Sampling . New York: Wiley.CBS (2018). Robot Framework. http://research.cbs.nl/Projects/RobotFramework/index.html .CBS (2019). Blaise 5 and complex surveys. . Online; accessed 20 January 2020.Chambers, R. and R. Clark (2012).

An introduction to model-based survey sampling with applications . Oxford: OxfordUniversity Press.Chollet, F. et al. (2015). Keras. https://github.com/fchollet/keras .Cianchetta, R. (2013). First Stage Stratiﬁcation and Selection in Sampling. R package version 1.0.Cobb, C. (2018).

Answering for Someone Else: Proxy Reports in Survey Research , pp. 87–93. Springer InternationalPublishing.Corporation, M. and S. Weston (2018). doParallel: Foreach Parallel Adaptor for the ’parallel’ Package . R package version1.0.14.Dagum, L. and R. Menon (1998). OpenMP: an industry standard API for shared-memory programming.

ComputationalScience & Engineering, IEEE 5 (1), 46–55.de Jonge, E. and M. van der Loo (2018). editrules: Parsing, Applying, and Manipulating Data Cleaning Rules . R packageversion 2.9.3.de Wolf, P.-P., A. Hundepool, S. Giessing, J.-J. Salazar, and J. Castro (2014).

Tau Argus User’s Manual . StatisticsNetherlands.Dean, J. and S. Ghemawat (2004). MapReduce: Simpliﬁed data processing on large clusters. In

OSDI’04: Sixth Symposiumon Operating System Design and Implementation , San Francisco, CA, pp. 137–150.Deming, W. (1950).

Some theory of sampling . New York: Wiley.Determen Jr., C. (2019). gpuR: GPU functions for R Objects . R package version 2.0.3.DGINS (2013). Scheveningen Memorandum. https://ec.europa.eu/eurostat/documents/42577/43315/Scheveningen-memorandum-27-09-13 . Online; accessed 29 July 2019.DGINS (2018). Bucharest Memorandum. . Online; accessed 29 July2019.D’Orazio, M. (2019).

StatMatch: Statistical Matching or Data Fusion . R package version 1.3.0.Eddelbuettel, D. (2013).

Seamless R and C++ Integration with Rcpp . New York: Springer. ISBN 978-1-4614-6867-7.Eddelbuettel, D. and J. J. Balamuta (2017, aug). Extending R with C++: A Brief Introduction to Rcpp. ,e3188v1.Eddelbuettel, D. and R. Fran¸cois (2011). Rcpp: Seamless R and C++ integration.

Journal of Statistical Software 40 (8),1–18. SS (2013). ESS.VIP Admin Data. https://ec.europa.eu/eurostat/cros/content/use-administrative-and-accounts-data-business-statistics_en . Online; accessed 29 July 2019.ESS (2014). ESS Handbook for Quality Reports. https://ec.europa.eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf/18dd4bf0-8de6-4f3f-9adb-fab92db1a568 . Online; accessed 25 January 2020.Regulation (EC) no 223/2009. Oﬃcial Journal of the European Union L87, 31.3.2009, p. 164–173.Eurostat (2019a). European Health Statistics. https://ec.europa.eu/eurostat/web/health/overview .Eurostat (2019b). Integrating alternative data sources into oﬃcial statistics: a system-design approach. . Conference of EuropeanStatisticians. 67th Plenary Session. Paris, 26-28 June 2019. Online; accessed 25 January 2020.Eurostat (2020a). Database on health statistics. Technical report, Eurostat. https://ec.europa.eu/eurostat/web/health/data/database .Eurostat (2020b). Quality overview. https://ec.europa.eu/eurostat/web/quality . Online; accessed 25 January 2020.Ferreira da Silva, A. R. (2011). cudaBayesreg: Parallel implementation of a bayesian multilevel model for fmri data analysis.

Journal of Statistical Software 44 (4), 1–24.Floridi, L. (2019). Semantic conceptions of information. In: Edward N. Zalta (ed.), The Stanford Encyclopedia of Philos-ophy (Winter 2019 Edition). Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/archives/win2019/entries/information-semantic/ .Foley, B., I. Shuttleworth, and D. Martin (2018). Administrative data quality: Investigating record-level address accuracyin the Northern Ireland Health Register.

Journal of Oﬃcial Statistics 34 , 55–81.Forum, M. P. (1994). Mpi: A message-passing interface standard. Technical report, USA.Futral, W. and J. Greene (2013).

Intel Trusted Execution Technology for Server Platforms: A Guide to More SecureDatacenters (1st ed.). USA: Apress.Gabriel, E., G. E. Fagg, G. B. T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine,R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall (2004, September). Open MPI: Goals, concept, and designof a next generation MPI implementation. In

Proceedings, 11th European PVM/MPI Users’ Group Meeting , Budapest,Hungary, pp. 97–104.Giczi, J. and K. Sz˝oke (2018, January). Oﬃcial Statistics and Big Data. Intersections. East European Journal of Societyand Politics, [S.l.], v. 4, n. 1, jan. 2018.Goodfellow, I., Y. Bengio, and A. Courville (2016).

Deep Learning . MIT Press. .Gropp, W. (2002, September). Mpich2: A new start for MPI implementations. In

Proceedings of the 9th EuropeanPVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface , pp.97–104.Groves, R. (1989).

Survey errors and survey costs . New York: Wiley.Grudkowska, S. (2017).

JDemetra+ User Guide . Eurostat.H´ajek, J. (1981).

Sampling from a ﬁnite population . London: Marcel Dekker Inc.Hall/CRC, C. . (2020). Handbooks of modern statistical methods. . Online; accessed 25 January 2020.Hammer, C. L., D. C. Kostroch, G. Quir´os, and S. I. Group (2017). Big Data: Potential, Challenges, and StatisticalImplications. Technical report, International Monetary Fund.Hand, D. (2018). Statistical challenges of administrative and transaction data.

Journal of the Royal Statistics Society A 8 ,1–24.Hand, D. (2019, 6). What is the purpose of statistical modelling? https://hdsr.mitpress.mit.edu/pub/9qsbf3hz.Hansen, M. (1987). Some history and reminiscences on survey sampling.

Statistical Science 2 , 180–190.Hansen, M., W. Hurwitz, and W. Madow (1966).

Sample survey: methods and theory (7th ed.). New York: Wiley. ansen, M., W. Madow, and B. Tepping (1983). An evaluation of model-dependent and probability sampling inferences insample surveys. Journal of the American Statistical Association 78 , 776–793.Harrison, J. (2019).

RSelenium: R Bindings for ’Selenium WebDriver’ . R package version 1.7.5.Hedayat, A. and B. Sinha (1991).

Design and Inference in Finite Population Sampling . Wiley.High-Level Group for the Modernisation of Oﬃcial Statistics (2011, June 14-16). Strategic vision of the High-Level Groupfor strategic developments in business architecture in Statistics. In UNECE (Ed.), , pp. Item 4. .Ho, D. E., K. Imai, G. King, and E. A. Stuart (2011). MatchIt: Nonparametric preprocessing for parametric causal inference.

Journal of Statistical Software 42 (8), 1–28.Hundepool, A., P.-P. de Wolf, J. Bakker, A. Reedijk, L. Franconi, S. Polettini, A. Capobianchi, and J. Domingo (2014).

Mu Argus User’s Manual . Statistics Netherlands.Hundepool, A., J. Domingo-Ferrer, L. Franconi, S. Giessing, E. S. Nordholt, K. Spicer, and P.-P. de Wolf (2012).

StatisticalDisclosure Control . New York: Wiley.Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O. Neil, and A. Usher (2015). Aapor reporton big data. Technical report, American Association for Public Opinion Research.Jones, E., T. Oliphant, P. Peterson, et al. (2001–). SciPy: Open source scientiﬁc tools for Python.Keller, A., V. Mule, D. Morris, and S. Konicki (2018). A distance metric for modeling the quality of administrative recordsfor use in the 2020 U.S. Census.

Journal of Oﬃcial Statistics 34 , 599–624.Khalil, S. (2018).

Rcrawler: Web Crawler and Scraper . R package version 0.1.9-1.Kiær, A. (1897). The representative method of statistical surveys. Technical report, Papers from the Norwegian Academyof Science and Letters, II The Historical, philosophical Section 1897 No. 4.Kitchin, R. (2015b, August). Big data and oﬃcial statistics: Opportunities, challenges and risks.

Statistical Journal of theIAOS 31 (3), 471–481.Kl¨ockner, A., N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih (2012). PyCUDA and PyOpenCL: A Scripting-BasedApproach to GPU Run-Time Code Generation.

Parallel Computing 38 (3), 157–174.Koller, D. and N. Friedman (2009).

Probabilistic graphical models . Cambridge (Massachussets): MIT Press.Kouzis-Loukas, D. (2016).

Learning Scrapy . Packt Publishing Ltd.Kowarik, A. and M. van der Loo (2018). Using R in the statistical oﬃce: the experience of statistics netherlands andstatistics austria.

Romanian Statistical Review 45 (1), 15–29.Kowarik, A., A. Meraner, M. Templ, and D. Schopfhauser (2014). Seasonal adjustment with the R packages x12 andx12GUI.

Journal of Statistical Software 62 (2), 1–21.Kowarik, A. and M. Templ (2016). Imputation with the R package VIM.

Journal of Statistical Software 74 (7), 1–16.Kruskal, W. and F. Mosteller (1979a). Representative sampling, i: Non-scientiﬁc literature.

Int. Stat. Rev. 47 , 13–24.Kruskal, W. and F. Mosteller (1979b). Representative sampling, ii: scientiﬁc literature, excluding statistics.

InternationalStatistical Review 47 , 111–127.Kruskal, W. and F. Mosteller (1979c). Representative sampling, iii: the current statistical literature.

International StatisticalReview 47 , 245–265.Kruskal, W. and F. Mosteller (1980). Representative sampling, iv: The history of the concept in statistics, 1895-1939.

International Statistical Review 48 , 169–195.Kuhn, M. (2020). caret: Classiﬁcation and Regression Training . R package version 6.0-85.Kuhn, T. (1957).

The Copernican revolution . Boston: Harvard University Press.Kuonen, D. and B. Loison (2020). Production processes of oﬃcial statistics and analytics processes augmented by trustedsmart statistics: Friends or foes?

Journal of the IAOS 35 , 615–622.Landefeld, S. (2014, October). Uses of big data for oﬃcial statistics: Privacy, incentives, statistical challenges, and otherissues. In

Discussion Paper for the International Conference on Big Data for Oﬃcial Statistics . aney, D. (2001). 3d data management: controlling data volume, velocity y variety.Larsen, E. S. and D. McAllister (2001). Fast matrix multiplies using graphics hardware. In Proceedings of the 2001ACM/IEEE Conference on Supercomputing , SC ’01, New York, NY, USA, pp. 55. Association for Computing Machinery.Lehmann, E. and G. Casella (1998).

Theory of Point Estimation (2nd ed.). Springer.Lehtonen, R. and A. Veijanen (1998). Logistic generalized regression estimators.

Survey Methodology 24 , 51–55.Lessler, J. and W. Kalsbeek (1992).

Nonsampling error in surveys . New York: Wiley.Ley 12/89, de la Funci´on Estad´ıstica P´ublica, de 11 de mayo de 1989 (in Spanish). BOE n´um. 112, de 11 de mayo de 1989,p´aginas 14026-14035. .Li, W., Y. Xia, and H. Chen (2019, January). Research on arm trustzone.

GetMobile: Mobile Comp. and Comm. 22 (3),17–22.Liaw, A. and M. Wiener (2002). Classiﬁcation and regression by randomforest.

R News 2 (3), 18–22.Little, R. (2012). Calibrated bayes, an alternative inferential paradigm for oﬃcial statistics.

Journal of Oﬃcial Statistics 28 ,309–334.LOI (2016), num 2016-1321, du 7 octobre 2016 pour une R´epublique num´erique (in French). JORF n o o .Luebke, D., M. Harris, N. Govindaraju, A. Lefohn, M. Houston, J. Owens, M. Segal, M. Papakipos, and I. Buck (2006).Gpgpu: General-purpose computation on graphics hardware. In Proceedings of the 2006 ACM/IEEE Conference onSupercomputing , SC ’06, New York, NY, USA, pp. 208–es. Association for Computing Machinery.Lumley, T. (2004). Analysis of complex survey samples.

Journal of Statistical Software 9 (1), 1–19. R package version 2.2.Meadows, A., A. S. Pulvirenti, and M. C. Roldn (2013).

Pentaho Data Integration Cookbook (2nd ed.). Packt Publishing.Meindl, B. (2019). sdcTable: Methods for Statistical Disclosure Control in Tabular Data . R package version 0.30.Meyer, D., E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch (2019). e1071: Misc Functions of the Department ofStatistics, Probability Theory Group (Formerly: E1071), TU Wien . R package version 1.7-2.Microsoft and S. Weston (2017). foreach: Provides Foreach Looping Construct for R . R package version 1.4.4.Mofrad, S., F. Zhang, S. Lu, and W. Shi (2018). A comparison study of intel sgx and amd memory encryption technology.In

Proceedings of the 7th International Workshop on Hardware and Architectural Support for Security and Privacy ,HASP ’18, New York, NY, USA. Association for Computing Machinery.Morris, N. (2015).

Unleashing GPU Power Using R: The gmatrix Package . R package version 0.3.Naylor, J., N. Swier, and S. Williams (2014). ONS Big Data Project – Progress report: Qtr 2 April to June 2014 . Oﬃcefor National Statistics.Neyman, J. (1934). On the two diﬀerent aspects of the representative method: the method of stratiﬁed sampling and themethod of purposive selection.

J. R. Stat. Soc. 97 , 558–625.Nickolls, J., I. Buck, M. Garland, and K. Skadron (2008, March). Scalable parallel programming with CUDA.

Queue 6 (2),40–53.Normandeau, K. (2013). Beyond volume, variety and velocity is the issue of big data veracity. http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/ . Online; accessed 20 January, 2020.OECD (2008).

OECD Glossary of Statistical Terms . OECD Publishing.OpenMP Architecture Review Board (2018, November). OpenMP application program interface version 5.0. .Paszke, A., S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Des-maison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chin-tala (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d’ Alch´e-Buc, E. Fox, and R. Garnett (Eds.),

Advances in Neural Information Processing Systems32 , pp. 8024–8035. Curran Associates, Inc.Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn:Machine Learning in Python .

Journal of Machine Learning Research 12 , 2825–2830. rivacy Preserving Techniques Task Team (2019). UN Handbook on Privacy-Preserving Computation Techniques. url =http://tinyurl.com/y4do5he4. Online; accessed 20 January 2020.Rao, J. and I. Molina (2015). Small area estimation (2nd ed.). New York: Wiley.Reimsbach-Kounatze, C. (2015, January). The proliferation of “big data” and implications for oﬃcial statistics and statisticalagencies. OECD Digital Economy Papers No. 245.Reinders, J. (2007).

Intel Threading Building Blocks (First ed.). USA: O’Reilly & Associates, Inc.Ricciato, F. (2018a). Towards a reference architecture for trusted smart statistics. . DGINS 2018; Online; accessed 20 January 2020.Ricciato, F. (2018b). Towards a Reference Methodological Framework for processing MNO data for Oﬃcial Statistics. .Ricciato, F. (2018c). Using (not sharing!) privately held data for trusted smart statistics. https://ec.europa.eu/eurostat/cros/content/keynote-talk-mobile-tartu-2018_en . Mobile Tartu 2018; Online; accessed 20 January 2020.Ricciato, F., A. Wirthmann, K. Giannakouris, R. Fernando, and M. Skaliotis (2019). Trusted smart statistics: Motivationsand principles.

Statistical Journal of the IAOS 35 (4), 589–603.Robin, N., T. Klein, and J. J¨utting (2015, December). Public-private partnerships for statistics lessons learned, futuresteps. PARIS21 Partnership in Statistics for Development in the 21 st Century Discussion Paper No. 8.Rocher, L., J. Hendrickx, and Y. de Montjoye (2019). Estimating the success of re-identiﬁcations in incomplete datasetsusing generative models.

Nature Communications 10 , 3069.Rounds, J. (2012). Rhipe: R and hadoop integrated programming environment. .Rulifson, J. (1969, June). Decode encode language (DEL). RFC 5, RFC Editor.Rupp, K., P. Tillet, F. Rudolf, J. Weinbub, T. Grasser, and A. Jungel (2016). Viennacl - linear algebra library for multi-andmany-core architectures.

SIAM Journal on Scientiﬁc Computing .Sabt, M., M. Achemlal, and A. Bouabdallah (2015, Aug). Trusted execution environment: What it is, and what it is not.In , Volume 1, pp. 57–64.Salemink, I., S. Dufour, and M. van der Steen (2019). Vision paper on future advanced data collection. .S¨arndal, C.-E. (2007). The calibration approach in survey theory and practice.

Survey Methodology 33 , 99–119.S¨arndal, C.-E. and S. Lundstr¨om (2005).

Estimation in Surveys with Nonresponse . Chichester: Wiley.S¨arndal, C.-E., B. Swensson, and J. Wretman (1992).

Model assisted survey sampling . New York: Springer.Sax, C. and D. Eddelbuettel (2018). Seasonal adjustment by X-13ARIMA-SEATS in R.

Journal of Statistical Soft-ware 87 (11), 1–17.Scannapieco, M. and N. R. Fazio (2019, March). Big data architectures @ istat. In

New Techniques and Technologies forStatistics Internation Conference (NTTS) .Scannapieco, M., L. Tosco, L. Valentino, L. Mancini, N. Cibella, T. Tuoto, and M. Fortini (201r).

RELAIS User’s Guide,Version 3.0 . R package version 3.0.Shapiro, E. B. (1969, March). Network timetable. RFC 4, RFC Editor.Smith, T. (1976). The foundations of survey sampling: a review.

J. R. Stat. Soc. A 139 , 183–204.Smith, T. (1994). Sample surveys 1975-1990: An age of reconciliation?

International Statistical Review 62 , 5–19.Starmans, R. (2016). The advent of data science: some considerations on the unreasonable eﬀectiveness of data. InP. B¨uhlmann, P. Drineas, M. Kane, and M. van der Laan (Eds.),

Handbook of Big Data , Handbook of Statistics,Chapter 1, pp. 3–20. Amsterdam: Chapman and Hall/CRC Press.Stone, J. E., D. Gohara, and G. Shi (2010, May). Opencl: A parallel programming standard for heterogeneous computingsystems.

Computing in Science Engineering 12 (3), 66–73.Struijs, P., B. Braaksma, and P. J. Daas (2014, April). Oﬃcial statistics and big data.

Big Data & Society 1 (1), 1–6. u, Y-S., A. Gelman, J. Hill and M. Yajima (2011). Multiple Imputation with Diagnostics (mi) in R: Opening Windowsinto the Black Box. Journal of Statistical Software 45 .Templ, M., A. Kowarik, and B. Meindl (2015). Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.

Journal of Statistical Software 67 (4), 1–36.Templ, M. and V. Todorov (2016, Feb.). The software environment R for oﬃcial statistics and survey methodology.

AustrianJournal of Statistics 45 (1), 97–124.The Apache Software Foundation (2019). Nutch, a highly extensible, highly scalable Web crawler. http://nutch.apache.org/ .The World Bank (2018a). Advancing capi/cawi technology with survey solutions. https://support.mysurvey.solutions/getting-started/overview-printable/resources/SurveySolutionsBooklet_2018oct(ENG).pdf . Online; accessed 20January 2020.The World Bank (2018b).

Survey Solutions CAPI/CAWI platform: Release 5.26.

Washington DC: The World Bank.Theano Development Team. (2001–). Theano: A python framework for fast computation of mathematical expressions.Therneau, T. and B. Atkinson (2019). rpart: Recursive Partitioning and Regression Trees . R package version 4.1-15.Tieleman, T. (2010). Gnumpy: an easy way to use gpu boards in python. Technical Report UTML TR 2010–002, Departmentof Computer Science University of Toronto.Tierney, L., A. J. Rossini, N. Li, and H. Sevcikova (2018). snow: Simple Network of Workstations . R package version 0.4-3.Till´e, Y. and A. Matei (2016). sampling: Survey Sampling . R package version 2.8.Ugo Guarnera, M. T. B. (2013).

SeleMix: an R Package for Selective Editing . Rome, Italy: Istat. R package version 0.9.1.UNECE (2019). High-Level Group for the Modernisation of Oﬃcial Statistics. .Online; accessed 29 July 2019.United Nations Global Working Group on Big Data (2016). Recommendations for access to data from private or-ganizations for Oﬃcial Statistics. http://unstats.un.org/unsd/bigdata/conferences/2016/gwg/Item%202%20(i)%20a%20-%20Recommendations%20for%20access%20to%20data%20from%20private%20organizations%20for%20official%20statistics%20Draft%2014%20July%202016.pdf .Vale, S. International collaboration to understand the relevance of big data for oﬃcial statistics.

Statistical Journal of theIAOS 31 (23).Valliant, R., A. Dorfmann, and R. Royall (2000).

Finite population sampling and inference. A prediction approach . NewYork: Wiley.van Buuren, S. and K. Groothuis-Oudshoorn (2011). mice: Multivariate imputation by chained equations in r.

Journal ofStatistical Software 45 (3), 1–67.van der Loo, M. (2017). Open source statistical software at the statistical oﬃce. 61st World Statistics Congress of theInternational Statistical Institute.van Steen, M. (2010).

Graph Theory and Complex Networks : An Introduction . Maarten van Steen.Venables, W. N. and B. D. Ripley (2002).

Modern Applied Statistics with S (Fourth ed.). New York: Springer. ISBN0-387-95457-0.Venkataraman, S., X. Meng, F. Cheung, and The Apache Software Foundation (2019).

SparkR: R Front End for ’ApacheSpark’ . R package version 2.4.4.Wand, Y. and R. Wang (1996). Anchoring data quality dimensions in ontological foundations.

Communications of theACM 39 , 86–95.Weston, S. (2017). snow: Simple Network of Workstations . R package version 0.2.2.White, T. (2012).

Hadoop: The Deﬁnitive Guide . O’Reilly Media, Inc.Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York.Wickham, H. (2019). rvest: Easily Harvest (Scrape) Web Pages . R package version 0.3.5.Yao, A. C. (1982, nov). Protocols for secure computations. In , Los Alamitos, CA, USA, pp. 160–164. IEEE Computer Society. ao, A. C. (1986, Oct). How to generate and exchange secrets. In , pp. 162–167.Yates, F. (1965). Sampling methods for censuses and surveys (3rd ed.). London: Charles Griﬃns.Yu, H. (2002). Rmpi: Parallel statistical computing in R.

R News 2 (2), 10–14.Zadetto, D. (2013). ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Errors Assessment inComplex Sample Surveys.

Proceedings of the 7th International Conferences on New Techniques and Technologies forStatistics (NTTS) .Zaharia, M., R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin,and et al. (2016, October). Apache Spark: A uniﬁed engine for big data processing.

Commun. ACM 59 (11), 56–65.Zhao, C., S. Zhao, M. Zhao, Z. Chen, C.-Z. Gao, H. Li, and Y. Tang (2019). Secure multi-party computation: Theory,practice and applications.

Information Sciences 476 , 357–372., 357–372.