Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Alan F. Karr is active.

Publication


Featured researches published by Alan F. Karr.


Journal of the American Statistical Association | 2003

Maximum Penalized Likelihood Estimation, Vol. I: Density Estimation

Alan F. Karr

Use of Tables, and Applications), with 236 pages of text, 110 pages of tables; and a CD-ROM with software (written in Java) that implements the methods presented in the book. According to the Preface, the authors were motivated to write the book in part because of a consulting project. They “noticed that the recommended methods [for estimating proportions] are difx8e cult to handle for practitioners and, moreover, they are inaccurate and yield often useless results.” I have a problem with these words all appearing in the same sentence. The methods that are at times useless (e.g., an interval based on a normal approximation) are not difx8e cult to handle, and methods that can be difx8e cult to handle (e.g., an exact conx8e dence interval without the right software) do not yield useless results. At any rate, the authors decided to basically start from scratch and develop their own theory of estimation. This theory is based on a geometric approach and seems closely related to x8e ducial intervals as developed in several articles by Neyman. One important feature of the authors’ method is that it allows one to specify bounds on the possible parameter values. This of course can also be done within the existing paradigms; a frequentist can use a constrained maximum likelihood estimator, and a Bayesian can specify a prior with support on only part of the parameter space. I think the authors are more than a little bit presumptuous when they say (p. xiv) that “the handbook represents a x8e rst step toward developing statistics as a (natural) science, and further steps must follow.” Later (on p. xvii), we learn that “it may also be looked upon as a x8e rst step in reconciling classical statistics with Bayes statistics.” Apparently this reconciliation will require the Bayesians to make all the concessions, given the authors’ reverence for Jerzy Neyman. They even display Neyman’s portrait in the user interface for their software. Besides conx8e dence intervals (which the authors sometimes call “measurement intervals” for reasons that are unclear to me), the authors also discuss “prediction regions.” Interestingly, a prediction region does not require any data; rather, it is like an acceptance region for a null hypothesis on the Bernoulli parameter p. Again, we can specify bounds on the parameter values. I do not care much for the terminology, because it suggests that we are predicting a future result rather than testing a hypothesis. The book is a tough read. The writing is hard to follow and somewhat overblown and confrontational, and the notation is thick. There are parts that I think are incorrect; for example, in discussing the exact conx8e dence interval for a Bernoulli parameter in Section 3.4.1, the claim is made that an inferior interval estimator is obtained if one uses only the sample sum when the sample of individual outcomes is known. In other words, a sufx8e cient statistic is claimed to be insufx8e cient. This error is of no consequence, however, because the authors’ own tables and software require only the sample sum. In the end, I believe that an engineer or scientist will not be making a serious mistake if he or she uses the accompanying software to obtain a conx8e dence interval. I base this belief on the fact that the few examples I tried yielded results quite close to Bayes credibility intervals based on a uniform prior. It is indeed interesting to have software that can output a critical region for testing a Bernoulli parameter with an interval constraint on its value. However, as Tietjen (1986, p. 37) says, “x8e ducial intervals are understood by only a few and used by fewer still.” My guess is that that also will be the fate of this work.


Journal of the American Statistical Association | 2010

Bayesian Multiscale Multiple Imputation With Implications for Data Confidentiality

Scott H. Holan; Daniell Toth; Marco A. R. Ferreira; Alan F. Karr

Many scientific, sociological, and economic applications present data that are collected on multiple scales of resolution. One particular form of multiscale data arises when data are aggregated across different scales both longitudinally and by economic sector. Frequently, such datasets experience missing observations in a manner that they can be accurately imputed, while respecting the constraints imposed by the multiscale nature of the data, using the method we propose known as Bayesian multiscale multiple imputation. Our approach couples dynamic linear models with a novel imputation step based on singular normal distribution theory. Although our method is of independent interest, one important implication of such methodology is its potential effect on confidential databases protected by means of cell suppression. In order to demonstrate the proposed methodology and to assess the effectiveness of disclosure practices in longitudinal databases, we conduct a large-scale empirical study using the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages (QCEW). During the course of our empirical investigation it is determined that several of the predicted cells are within 1% accuracy, thus causing potential concerns for data confidentiality.


Journal of the American Statistical Association | 2015

Simultaneous Edit-Imputation for Continuous Microdata

Hang J. Kim; Lawrence H. Cox; Alan F. Karr; Quanli Wang

Many statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes (i) a flexible joint probability model for the underlying true values of the data with support only on the set of values that satisfy all editing constraints, (ii) a model for latent indicators of the variables that are in error, and (iii) a model for the reported responses for variables in error. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using simulation studies. We apply the model to edit faulty data from the 2007 U.S. Census of Manufactures. Supplementary materials for this article are available online.


Archive | 2014

Privacy, Big Data, and the Public Good: Using Statistics to Protect Privacy

Alan F. Karr

Introduction Those who generate data – for example, official statistics agencies, survey organizations, and principal investigators, henceforth all called agencies – have a long history of providing access to their data to researchers, policy analysts, decision makers, and the general public. At the same time, these agencies are obligated ethically and often legally to protect the confidentiality of data subjects’ identities and sensitive attributes. Simply stripping names, exact addresses, and other direct identifiers typically does not suffice to protect confidentiality. When the released data include variables that are readily available in external files, such as demographic characteristics or employment histories, ill-intentioned users – henceforth called intruders – may be able to link records in the released data to records in external files, thereby compromising the agency’s promise of confidentiality to those who provided the data. In response to this threat, agencies have developed an impressive variety of strategies for reducing the risks of unintended disclosures, ranging from restricting data access to altering data before release. Strategies that fall into the latter category are known as statistical disclosure limitation (SDL) techniques. Most SDL techniques have been developed for data derived from probability surveys or censuses. Even in complete form, these data would not typically be thought of as big data, with respect to scale (numbers of cases and attributes), complexity of attribute types, or structure: most datasets are released, if not actually structured, as flat files.


Journal of the American Statistical Association | 2006

Exploratory Data Mining and Data Cleaning

Alan F. Karr

is no implied obligation to purchase the software in the future, but that a fee of about 300 euros will likely be charged for upcoming updates. The authors identify their target audience as researchers, applied scientists, and graduate students working in scientific modeling in virtually any discipline. I would add that researchers in the area of computer experiments might find some of the material of interest. For those readers who desire more information before deciding whether to purchase the book, I recommend reading the review article by three of the authors that appeared a few years back in Statistical Science (Saltelli, Tarantola, and Campolongo 2000).


Archive | 2017

Public-Use vs. Restricted-Use: An Analysis Using the American Community Survey

Satkartar K. Kinney; Alan F. Karr

Statistical agencies frequently publish microdata that have been altered to protect confidentiality. Such data retain utility for many types of broad analyses but can yield biased or insufficiently precise results in others. Research access to de-identified versions of the restricted-use data with little or no alteration is often possible, albeit costly and time-consuming. We investigate the advantages and disadvantages of public-use and restricted-use data from the American Community Survey (ACS) in constructing a wage index. The public-use data used were Public Use Microdata Samples, while the restricted-use data were accessed via a Federal Statistical Research Data Center. We discuss the advantages and disadvantages of each data source and compare estimated CWIs and standard errors at the state and labor market levels.


Journal of the American Statistical Association | 2007

Partial Identification of Probability Distributions

Alan F. Karr

Statistical inference for fractal-based point processes is concentrated in the final three chapters. I found these chapters very readable and well presented. Although choosing a point process model that fits real data reasonably well is difficult, often due to the sparseness of the data and the lack of available statistical inference, the authors demonstrate how certain types of fractal behavior can be detected in real datasets by comparing several suitable statistics and/or by using resampling techniques. The authors show this skillfully on the seven real-data examples considered throughout the book. By means of simulations, they also nicely compare the performance of nonparametric estimators of point process measures and of the fractal exponent. Finally, the book’s last chapter considers the accompanying example of computer network traffic in greater detail. This phenomenon has been studied extensively in the literature, and the authors nicely incorporate the existing references into the established framework of fractal-based point processes. Previously considered statistical techniques are put into action, and several fractal-based point processes are fitted to a dataset comprising 1 million consecutive ethernet-packet arrivals. Pedagogically, each chapter is accompanied by well-selected problems that nicely complement the presentation and facilitate a deeper understanding of the subject. It is here that the reader’s mathematical background is put to the test; many may find it beneficial that solutions to selected problems are discussed in the Appendix. Practical aspects of data analysis and simulations can be accessed from the book’s website. Given the tremendous importance of the subject, Fractal-Based Point Processes will certainly prove valuable to scientists working in many distinct fields. Writing a good, practically oriented book on fractals and point processes suitable for a broad audience is not easy, especially considering the mathematical complexity of the concepts. It is apparent that the authors are truly at home in their field and succeed in presenting the topic in a comprehensive and systematic way. Along with the references listed in the presentation, I believe that Fractal-Based Point Processes will be of good use to anyone studying point processes with self-similarity characteristics.


Journal of the American Statistical Association | 2007

Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches

Alan F. Karr

Diagnostics are revisited, and Bartlett’s test is presented in detail. Finally, introductions are provided to logistic regression, M-estimation, and total least squares. A major feature of this book is the inclusion of optional linear algebra sections and problem sets, which are given for all but one of the chapters. In addition, there is a 16-page appendix that can provide a refresher or as an introduction to linear algebra. The goal of these sections is to demonstrate the power and flexibility of the linear algebra approach to linear models. Initially, this material is mechanical in nature. In later chapters (Chap. 7 and beyond), the scope, depth, and volume of the linear algebra sections and exercises increases sharply. This later material will be beyond the reach of the student without a prior linear algebra course, but will be a boon to the initiated undergraduate or possibly to the nonmajor graduate student. The linear algebra sections are fairly brief and do not attempt to recapitulate the entire chapter, but rather highlight two or three ideas previously introduced. These discussions advance to the point of introducing orthogonal projection operators and subspaces. A version of the Gauss–Markov theorem is subsequently presented and proven. Most arguments are coordinatized, and a full-rank design matrix is assumed. The problem sets are particularly strong. There are, of course, many dataoriented exercises that allow the student to work with the methods discussed. These problems are rarely purely mechanical and often ask the student to think about the analysis while proceeding. Each chapter also has several theoretical or mathematical exercises, which typically are very good. These problems often significantly expand the scope in the sense that they introduce new ideas and techniques. Occasionally students are asked to produce simulation-based solutions to problems. The exercises are characterized by both diversity and breadth. Although the authors promise to indicate which problems require calculus, which require simulation, and which require “just a little more persistence,” they do not seem to follow this practice religiously. For example, problems in the latter category are indicated only in the first two chapters. My criticisms of the text are minor. Given the vast nature of the topic area, the authors certainly needed to make some tough decisions regarding coverage. Having said that, I note that the discussions focus on regression models (quantitative covariates). The last chapter does discuss some one-way ANOVA and ANCOVA models, but goes no further. There is no discussion of two-way or higher models, or of experimental design models. These are models that the intended audience will encounter sooner rather than later. In addition, some may feel that the chapter on inference should be a bit broader. Unbiasedness is the only estimation criterion discussed carefully, and the introductions to confidence intervals and testing are done almost exclusively through examples. Testing is often a troublesome area for students, and the authors seem to break their own rule here regarding the provision of mathematical background. My review revealed a number of typographical and other errors commensurate with a first edition. In summary, Introduction to Linear Models and Statistical Inference reflects a strong appreciation of both theory and the realities of practical model building. The book is intended for a broad audience, and the availability of an electronic version should widen its appeal by expanding its accessibility. The authors’ many years of teaching experience are evident, and the beginning or intermediate statistics student should come away with an understanding that analyzing real data requires thoughtfulness and completeness. Truly linking methods and theory in an area like linear models is not trivial for most students, but tools like this one go a long way toward bridging that gap.


Journal of the American Statistical Association | 2006

Evolution of Biological Systems in Random Media: Limit Theorems and Stability

Alan F. Karr

without “bogging them down in theoretical underpinnings.” What results is a book that discusses only a very small subset of the possible statistical methods and designs available. There is little development of the concepts on which design of experiments are built. For example, there is little or no mention of the importance of randomizing the run order of an experiment to obtain valid and interpretable results, the distinction between replicates and repeated observations, or the role of blocking or restrictions on randomization to run an experiment more effectively. Often the reasons for choosing between design types (such as designs for first or second-order models, or cuboidal or spherical regions of interest) are not explained. Written at a very introductory level, the book does a good job explaining the details of running an experiment, including appropriate conversion of units for factors to the unscaled notation standard for most designs, the need for confirming results with subsequent experimentation, and how some calculations are performed for simple analyses. However, if a practitioner were to rely solely on this book to plan and run an experiment, she/he would be left with an overly simplified impression of what the goals of experimentation should be and what tools are available. Alternatives such as Box, Hunter, and Hunter (2005) and Montgomery (2001) provide a much more comprehensive treatment of the topic, complete with wonderful insights into general industrial experimentation thought, including the benefits of sequential experimentation based on learning at each stage, and the power and potential of a good experiment to provide a tailored solution to a diverse set of questions.


Journal of the American Statistical Association | 2006

Random Graphs for Statistical Pattern Recognition

Alan F. Karr

arise when one does not know when and how many transitions a process makes to a given state. Chapter 9 deals with a number of applications of flowgraphs in queueing theory. The level of the book is at an introductory graduate level. Although the authors do not assume knowledge of survival analysis or stochastic processes, courses in these areas would be helpful to the reader. A familiarity with R and MAPLE would also be helpful. The book’s strength lies in the examples the author presents. In fact, much of the text contains techniques applied to examples, many taken from the biomedical literature. These examples aid readers to judge the utility of this approach and to appreciate the calculations needed to implement these techniques. The book has a few minor annoyances. First, most results are given as definitions, when clearly they are theorems or basic results. For example, Bayes’s theorem is given as a definition. Second, the book seems to be a compendium of the author’s published work in many places, and other alternative approaches are ignored. For example, censored data model recognition is exclusively performed by censored data histograms when a number of, perhaps better approaches, such as hazard plots, are available (Nelson 1982). The book’s focus is on parametric models for multistate systems. Although parametric models may be the norm in engineering applications, most of the recent work in biometry on multistate models has used nonparametric or semiparametric methods. These models are often based on the techniques discussed by Andersen, Borgan, Gill, and Keiding (1993). Overall, this is an interesting book. As one who has worked on problems in the area of multistate models for a number of years, the book forced me to think about these models in a new way. This book is one that researchers interested in techniques for multistate models, either in reliability or biometry should look at. I am not sure how much someone without a background in multistate models or survival analysis would benefit from this book.

Collaboration


Dive into the Alan F. Karr's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Daniell Toth

Bureau of Labor Statistics

View shared research outputs
Top Co-Authors

Avatar

Hang J. Kim

University of Cincinnati

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kenneth A. Bollen

University of North Carolina at Chapel Hill

View shared research outputs
Researchain Logo
Decentralizing Knowledge