Diane Lambert | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Diane Lambert is active.

Explore More

Publication

Featured researches published by Diane Lambert.

Handbook of massive data sets | 2002

Detecting fraud in the real world

Michael H. Cahill; Diane Lambert; José C. Pinheiro; Don X. Sun

Finding telecommunications fraud in masses of call records is more difficult than finding a needle in a haystack. In the haystack problem, there is only one needle that does not look like hay, the pieces of hay all look similar, and neither the needle nor the hay changes much over time. Fraudulent calls may be rare like needles in haystacks, but they are much more challenging to find. Callers are dissimilar, so calls that look like fraud for one account look like expected behavior for another, while all needles look the same. Moreover, fraud has to be found repeatedly, as fast as fraud calls are placed, the nature of fraud changes over time, the extent of fraud is unknown in advance, and fraud may be spread over more than one type of service. For example, calls placed on a stolen wireless telephone may be charged to a stolen credit card. Finding fraud is like finding a needle in a haystack only in the sense of sifting through masses of data to find something rare. This chapter describes some issues involved in creating tools for building fraud systems that are accurate, able to adapt to changing legitimate and fraudulent behavior, and easy to use.

knowledge discovery and data mining | 2000

Incremental quantile estimation for massive tracking

Fei Chen; Diane Lambert; José C. Pinheiro

records, internet packet headers, or other trans- action records|are coming down a pipe at a ferocious rate, and we need to monitor statistics of the data. There is no reason to think that the data are normally distributed, so quantiles of the data are important to watch. The probe attached to the pipe has only limited memory, though, so it is impossible to compute the quantiles by sorting the data. The only possibility is to incrementally estimate the quan- tiles as the data fly by. This paper provides such an in- cremental quantile estimator. It resembles an exponentially weighted moving average in form, processing and memory requirements, but it is based on stochastic approximation so we call it an exponentially weighted stochastic approximation or EWSA. Simulations show that the EWSA outperforms other kinds of incremental estimates that also require min- imal main memory, especially when extreme quantiles are tracked for patterns of behavior that change over time. Use of the EWSA is illustrated in an application to tracking call duration for a set of callers over a three month period.

Journal of Computational and Graphical Statistics | 1999

Fitting Trees to Functional Data, with an Application to Time-of-Day Patterns

Yan Yu; Diane Lambert

Abstract Decision trees often give simple descriptions of complex, nonlinear relationships between several predictors and a univariate or multivariate response. But if the response is a high-dimensional vector that can be thought of as a discretized function, then fitting a multivariate regression tree may be unsuccessful. This article explores two ways to fit trees to functional data. Both first reduce the dimensionality of the data and then fit a standard multivariate tree to the reduced response. In the first approach, each individuals response curve is represented as a linear combination of spline basis functions, penalizing for roughness, and then a multivariate regression tree is fit to the coefficients of the basis functions. In the second, a multivariate regression tree is fit to the first several principal component scores for the responses. The two methods are illustrated with time-of-day patterns for customers who place international calls.

knowledge discovery and data mining | 2001

Mining a stream of transactions for customer patterns

Diane Lambert; José C. Pinheiro

Transaction data can arrive at a ferocious rate in the order that transactions are completed. The data contain an enormous amount of information about customers, not just transactions, but extracting up-to-date customer information from an ever changing stream of data and mining it in real-time is a challenge. This paper describes a statistically principled approach to designing short, accurate summaries or signatures of high dimensional customer behavior that can be kept current with a stream of transactions. A signature database can then be used for data mining and to provide approximate answers to many kinds of queries about current customers quickly and accurately, as an empirical study of the calling patterns of 96,000 wireless customers who made about 18 million wireless calls over a three month period shows.

Journal of the American Statistical Association | 1997

Nonparametric Maximum Likelihood Estimation from Samples with Irrelevant Data and Verification Bias

Diane Lambert; Luke Tierney

Abstract Suppose that some measurements come from a distribution F that is of interest and others come from another, irrelevant distribution G. Some measurements from F are verified and known to be from F. The other, unverified measurements may be from F or from G. Not all measurements from F are equally likely to be verified, and no measurement from G is ever verified. This model applies to measurements of low concentrations obtained using gas chromatography/mass spectroscopy, for example, as is shown in this article. How well a feature T(F) of F can be estimated when there are unverified data depends on what can be assumed about F, G, and the conditional probability v(x) of verifying a measurement of x from F. If F, G, and v are unrestricted, then more than one choice of (F, G, v) gives the same distribution p of the observable x, and thus T(F) cannot be uniquely estimated from data. But if the set of values of T(F) that correspond to a distribution p of x is small enough, then it is reasonable to try t...

Technometrics | 1998

Events defined by duration and severity, with an application to network reliability

Richard A. Becker; Linda Ann Clark; Diane Lambert

Communications networks are highly reliable and almost never experience widespread failures. But from time to time performance degrades and the probability that a call is blocked or fails to reach its destination jumps from nearly 0 to an unacceptable level. High but variable blocking may then persist for a noticeable period of time. Extended periods of high blocking, or events, can be caused by congestion in response to natural disasters, fiber cuts, equipment failures, and software errors, for example. Because the consequences of an event depend on the level of blocking and its persistence, lists of events at specified blocking and duration thresholds, such as 50% for 30 minutes or 90% for 15 minutes, are often maintained. Reliability parameters at specified blocking and duration thresholds, such as the mean number of events per year and mean time spent in events, are estimated from the lists of reported events and used to compare network service providers, transmission facilities, or brands of equipmen...

Journal of the American Statistical Association | 2000

Statistics in the Physical Sciences and Engineering

Diane Lambert

No doubt much of the progress in statistics in the 1900s can be traced back to statisticians who grappled with solving real problems, many of which have roots in the physical sciences and engineering. For example, George Box developed response surface designs working with chemical engineers, John Tukey developed exploratory data analysis working with telecommunications engineers, and Abraham Wald developed sequential testing working with military engineers. These statisticians had a strong sense of what was important in the area of application, as well as what statistics could provide. The beginning of the 2000s is a good time to reflect on some of the current problems in the physical sciences and engineering, and how they might lead to new advances in statistics-or, at the least, what statisticians can contribute to solving these problems. My hope is that this set of vignettes will convey some sense of the excitement over the opportunities for statistics and statisticians that those of us who work with physical scientists and engineers feel. The vignettes are loosely grouped by field application, starting with earth sciences and then moving on to telecommunications, quality control, drug screening, and advanced manufacturing. Superficially, these areas have little in common, but they do share some deep similarities. Most of these areas, for example, face new opportunities and challenges because of our increasing ability to collect tremendous amounts of data. More and more often, the unit of sample size in the physical sciences and engineering is not the number of observations, but rather the number of gigabytes of space needed to store the data. Despite the tremendous advances in raw computational power, processing so much data can still be painful, and visualizing, exploring, and modeling the data can be much worse. As Doug Nychka points out, though, one advantage to working with physical scientists and engineers is that many of them have years of experience designing systems for collecting, processing, analyzing, and modeling massive sets of data, and statisticians can learn from their experience. Moreover, as Bert Gunter and Dan Holder point out, many of the advances are encouraging statisticians to collaborate not only with subject matter specialists, but also with computer scientists. Another theme in several of the vignettes is that statistical models alone are likely to be insufficient. What is needed instead are models that incorporate both scientific understanding and randomness. David Vere-Jones, for example, writes that progress in earthquake prediction requires advances in understanding the geophysics that produce earthquakes and progress in building statistical models that respect the geophysics and are appropriate for highly clustered, self-similar

Archive | 1999