Ton de Waal
Statistics Netherlands
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ton de Waal.
Archive | 1996
Leon Willenborg; Ton de Waal
Statistical offices basically release two kinds of data, namely tabular data and microdata. Tabular data are the traditional products of statistical offices. These tables contain aggregated data. Microdata sets are released only since recently. These microdata sets consist of records with information about individual entities, such as persons or business enterprises, and have generally been collected by means of a survey. In other words, each record contains the values of a number of variables for an individual entity. A microdata set is in fact the raw material that is used to construct tables. In former days they were actually only used for constructing tables which were subsequently released. Nowadays, the microdata sets themselves are also released, although usually in a somewhat adapted form. Because the disclosure risk for microdata is potentially much higher than that for tables, this is an important reason for the increasing attention that SDC demands.
The Annals of Applied Statistics | 2013
Jeroen Pannekoek; Natalie Shlomo; Ton de Waal
A common problem faced by statistical offices is that data may be missing from collected data sets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data often have to satisfy certain edit rules and that values of variables sometimes have to sum up to known totals. Standard imputation methods for numerical data as described in the literature generally do not take such edit rules and totals into account. In the paper we describe algorithms for imputation of missing numerical data that do take edit restrictions into account and that ensure that sums are calibrated to known totals. The methods sequentially impute the missing data, i.e. the variables with missing values are imputed one by one. To assess the performance of the imputation methods a simulation study is carried out as well as an evaluation study based on a real dataset.
Journal of Official Statistics | 2013
Wieger Coutinho; Ton de Waal; Natalie Shlomo
Abstract A major challenge faced by basically all institutes that collect statistical data on persons, households or enterprises is that data may be missing in the observed data sets. The most common solution for handling missing data is imputation. Imputation is complicated owing to the existence of constraints in the form of edit restrictions that have to be satisfied by the data. Examples of such edit restrictions are that someone who is less than 16 years old cannot be married in the Netherlands, and that someone whose marital status is unmarried cannot be the spouse of the head of household. Records that do not satisfy these edits are inconsistent, and are hence considered incorrect. A further complication when imputing categorical data is that the frequencies of certain categories are sometimes known from other sources or have previously been estimated. In this article we develop imputation methods for imputing missing values in categorical data that take both the edit restrictions and known frequencies into account.
Handbook of Statistics | 2009
Ton de Waal
Publisher Summary Users of statistical information are nowadays demanding high-quality data on social, demographic, industrial, economic, financial, political, and cultural aspects of society with a great level of detail and produced within a short span of time. National statistical institutes (NSIs) fulfill a central role in providing such high-quality statistical information. Most NSIs face this challenge while their financial budgets are constantly diminishing. A major complicating factor is that collected data generally contain errors. The data collection stage in particular is a potential source of errors. For instance, a respondent may give a wrong answer (intentionally or not), a respondent may not give an answer (either because he or she does not know the answer or because he or she does not want to answer), and errors can be introduced at the NSI when the data are transferred from the questionnaire to the computer system. The occurrence of errors in the observed data makes it necessary to carry out an extensive process of checking the collected data and, when necessary, of correcting them. This checking and correcting process is referred to as “statistical data editing.”
Journal of Official Statistics | 2017
Laura Boeschoten; Daniel L. Oberski; Ton de Waal
Abstract Both registers and surveys can contain classification errors. These errors can be estimated by making use of a composite data set. We propose a new method based on latent class modelling to estimate the number of classification errors across several sources while taking into account impossible combinations with scores on other variables. Furthermore, the latent class model, by multiply imputing a new variable, enhances the quality of statistics based on the composite data set. The performance of this method is investigated by a simulation study, which shows that whether or not the method can be applied depends on the entropy R2 of the latent class model and the type of analysis a researcher is planning to do. Finally, the method is applied to public data from Statistics Netherlands.
Archive | 2001
Leon Willenborg; Ton de Waal
Organizations conducting surveys and other forms of data collection may release the results of these exercises to third party users as “statistical products” in a variety of formats. For example, they may release tables to the public through published reports or release microdata files to academics for secondary data analysis. The problem addressed in statistical disclosure control (SDC) is that it is conceivable that a person who is given access to one of these statistical products may, through inappropriate use of the data, be able to disclose confidential information about the individual units which originally provided the data. These units might, for example, be respondents to a survey or persons completing forms for administrative purposes.
Archive | 2001
Leon Willenborg; Ton de Waal
In this chapter we consider the assessment of disclosure risk for tabular data. Disclosure risk may be defined either for the whole table or separately for each cell into which the table is organized. We shall sometimes use the term sensitivity as an alternative term for the disclosure risk of a table or cell. We suppose that a threshold may be specified as the maximum value below which the disclosure risk is deemed acceptable. Disclosure risk exceeding the threshold will call for the use of some form of SDC technique. For a measure of disclosure risk defined at the table level, we say that the table is sensitive if the disclosure risk of the table exceeds the given threshold. For a measure of disclosure risk defined at the cell level, we similarly say that a cell is sensitive if its disclosure risk is greater than the given threshold. In this book we restrict ourselves to measures of disclosure risk defined at the cell level. The objective of disclosure risk assessment will then be to determine which cells of a table are sensitive. We assume that a table containing sensitive cells may not be published. Having identified which cells are sensitive, the next step will be to treat these cells with an SDC technique such as cell suppression. This will be discussed in Chapters 8 and 9.
Archive | 2000
Ton de Waal
Statistical offices have to face the problem that data collected by surveys or obtained from administrative registers generally contain errors. Another problem they have to face is that values in data sets obtained from these sources may be missing. To handle such errors and missing data efficiently, Statistics Netherlands is currently developing a software package, called SLICE (Statistical Localisation, Imputation and Correction of Errors). SLICE will contain several edit and imputation modules. Examples are a module for automatic editing and a module for imputation based on tree-models. In this paper I describe SLICE, hereby focussing on the above-mentioned modules.
Archive | 2001
Leon Willenborg; Ton de Waal
The aim of this chapter is to discuss the impact of SDC techniques on the data analytic potential of microdata. There is no single correct way to define “analytic potential” since different users might analyze a given set of microdata in different unforeseen ways. We shall begin by assuming that the purpose of the analysis is to estimate a specified set of population parameters. These might be descriptive parameters, such as means or proportions or they may be analytic parameters, such as the coefficients of a regression model. We consider the impact of SDC techniques on the estimation of these parameters and, specifically, the impact of the SDC techniques discussed in Chapter 1.
Archive | 2001
Leon Willenborg; Ton de Waal
In the present chapter we discuss the impact of SDC techniques on the statistical quality of tables. This impact on the quality is subsumed under the heading “information loss”.