Is this you? Create Your Porfile

Jörg Drechsler

Institut für Arbeitsmarkt- und Berufsforschung

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jörg Drechsler is active.

Explore More

Publication

Featured researches published by Jörg Drechsler.

privacy in statistical databases | 2008

Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data

Jörg Drechsler

Partially synthetic data comprise the units originally surveyed with some collected values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple draws from statistical models. Because the original records remain on the file, intruders may be able to link those records to external databases, even though values are synthesized. We illustrate how statistical agencies can evaluate the risks of identification disclosures before releasing such data. We compute risk measures when intruders know who is in the sample and when the intruders do not know who is in the sample. We use classification and regression trees to synthesize data from the U.S. Current Population Survey.

privacy in statistical databases | 2010

Using support vector machines for generating synthetic datasets

Jörg Drechsler

Generating synthetic datasets is an innovative approach for data dissemination. Values at risk of disclosure or even the entire dataset are replaced with multiple draws from statistical models. The quality of the released data strongly depends on the ability of these models to capture important relationships found in the original data. Defining useful models for complex survey data can be difficult and cumbersome. One possible approach to reduce the modeling burden for data disseminating agencies is to rely on machine learning tools to reveal important relationships in the data. This paper contains an initial investigation to evaluate whether support vector machines could be utilized to develop synthetic datasets. The application is limited to categorical data but extensions for continuous data should be straight forward. I briefly describe the concept of support vector machines and necessary adjustments for synthetic data generation. I evaluate the performance of the suggested algorithm using a real dataset, the IAB Establishment Panel. The results indicate that some data utility improvements might be achievable using support vector machines. However, these improvements come at the price of an increased disclosure risk compared to standard parametric modeling and more research is needed to find ways for reducing the risk. Some ideas for achieving this goal are provided in the discussion at the end of the paper.

privacy in statistical databases | 2010

Remote data access and the risk of disclosure from linear regression: an empirical study

Philipp Bleninger; Jörg Drechsler; Gerd Ronning

In the endeavor of finding ways for easy data access for researchers not employed at a statistical agency remote data access seems to be an attractive alternative to the current standard of either altering the data substantially before release or allowing access only at designated data archives or research data centers. Data perturbation is often not accepted by the researchers since they do not trust the results from the altered data sets. But on-site access puts some heavy burdens on the researcher and the data providing agency both in terms of time and money. Remote data access or remote analysis servers that allow to submit queries without actually seeing the microdata have the potential of overcoming both these disadvantages. However, even if the microdata is not available to the researcher directly, disclosure of sensitive information for individual survey respondents is still possible. In this paper we illustrate how an intruder could use some commonly available background information to reveal sensitive information using simple linear regression. We demonstrate the real risks from this approach with an empirical evaluation based on a German establishment survey, the IAB Establishment Panel. Although these kind of attacks can easily be prevented once the agency is aware of the problem, this small simulation aims to emphasize that there might be many ways to obtain sensitive information using multivariate analysis and not all of them are obvious. Thus, agencies thinking about actually implementing some form of remote data access should consider carefully which queries could be allowed by the system.

Journal of Official Statistics | 2014

Disclosure Risk from Factor Scores

Jörg Drechsler; Gerd Ronning; Philipp Bleninger

Abstract Remote access can be a powerful tool for providing data access for external researchers. Since the microdata never leave the secure environment of the data-providing agency, alterations of the microdata can be kept to a minimum. Nevertheless, remote access is not free from risk. Many statistical analyses that do not seem to provide disclosive information at first sight can be used by sophisticated intruders to reveal sensitive information. For this reason the list of allowed queries is usually restricted in a remote setting. However, it is not always easy to identify problematic queries. We therefore strongly support the argument that has been made by other authors: that all queries should be monitored carefully and that any microlevel information should always be withheld. As an illustrative example, we use factor score analysis, for which the output of interest - the factor loading of the variables - seems to be unproblematic. However, as we show in the article, the individual factor scores that are usually returned as part of the output can be used to reveal sensitive information. Our empirical evaluations based on a German establishment survey emphasize that this risk is far from a purely theoretical problem.

privacy in statistical databases | 2018

Some Clarifications Regarding Fully Synthetic Data

Jörg Drechsler

There has been some confusion in recent years in which circumstances datasets generated using the synthetic data approach should be considered fully synthetic and which estimator to use for obtaining valid variance estimates based on the synthetic data. This paper aims at providing some guidance to overcome this confusion. It offers a review of the different approaches for generating synthetic datasets and discusses their similarities and differences. It also presents the different variance estimators that have been proposed for analyzing the synthetic data. Based on two simulation studies the advantages and limitations of the different estimators are discussed. The paper concludes with some general recommendations how to judge which synthesis strategy and which variance estimator is most suitable in which situation.

Behavior Research Methods | 2018

Biases in multilevel analyses caused by cluster-specific fixed-effects imputation

Matthias Speidel; Jörg Drechsler; Joseph W. Sakshaug

When datasets are affected by nonresponse, imputation of the missing values is a viable solution. However, most imputation routines implemented in commonly used statistical software packages do not accommodate multilevel models that are popular in education research and other settings involving clustering of units. A common strategy to take the hierarchical structure of the data into account is to include cluster-specific fixed effects in the imputation model. Still, this ad hoc approach has never been compared analytically to the congenial multilevel imputation in a random slopes setting. In this paper, we evaluate the impact of the cluster-specific fixed-effects imputation model on multilevel inference. We show analytically that the cluster-specific fixed-effects imputation strategy will generally bias inferences obtained from random coefficient models. The bias of random-effects variances and global fixed-effects confidence intervals depends on the cluster size, the relation of within- and between-cluster variance, and the missing data mechanism. We illustrate the negative implications of cluster-specific fixed-effects imputation using simulation studies and an application based on data from the National Educational Panel Study (NEPS) in Germany.

privacy in statistical databases | 2014

Synthetic Longitudinal Business Databases for International Comparisons

Jörg Drechsler; Lars Vilhuber

International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.

privacy in statistical databases | 2012

Generating useful test data for complex linked employer-employee datasets

Matthias Dorner; Jörg Drechsler; Peter Jacobebbinghaus

When data access for external researchers is difficult or time consuming it can be beneficial if test datasets that mimic the structure of the original data are disseminated in advance. With these test data researchers can develop their analysis code or can decide whether the data are suitable for their planned research before they go through the lengthly process of getting access at the research data center. The aim of these data is not to provide any meaningful results. Instead it is important to maintain the structure of the data as closely as possible including skip patterns, logical constraints between the variables, and longitudinal relationships so that any code that is developed using these test data will also run on the original data without further modifications. Achieving this goal can be challenging for complex datasets such as linked employer-employee datasets (LEED) where the links between the establishments and the employees also need to be maintained. Using the LEED of the Institute for Employment Research we illustrate how useful test data can be developed for such complex datasets. Our approach mainly relies on traditional statistical disclosure control (SDC) techniques such as data swapping and noise addition for data protection. Since statistical inferences need not be preserved, high swapping rates can be applied to sufficiently protect the data. At the same time it is straightforward to maintain the structure of the data by adding some constraints on the swapping procedure.

Journal of Official Statistics | 2009