Claudia Perlich | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Claudia Perlich is active.

Explore More

Publication

Featured researches published by Claudia Perlich.

Journal of Machine Learning Research | 2003

Tree induction vs. logistic regression: a learning-curve analysis

Claudia Perlich; Foster Provost; Jeffrey S. Simonoff

Tree induction and logistic regression are two standard, off-the-shelf methods for building models for classification. We present a large-scale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on class-membership probabilities. We use a learning-curve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several things. (1) Contrary to some prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (that is, the learning curves cross), so conclusions about induction-algorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective at producing probability-based rankings, although apparently comparatively less so for a given training-set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable can be characterized surprisingly well by a simple measure of the separability of signal from noise.

knowledge discovery and data mining | 2003

Aggregation-based feature invention and relational concept classes

Claudia Perlich; Foster Provost

Model induction from relational data requires aggregation of the values of attributes of related entities. This paper makes three contributions to the study of relational learning. (1) It presents a hierarchy of relational concepts of increasing complexity, using relational schema characteristics such as cardinality, and derives classes of aggregation operators that are needed to learn these concepts. (2) Expanding one level of the hierarchy, it introduces new aggregation operators that model the distributions of the values to be aggregated and (for classification problems) the differences in these distributions by class. (3) It demonstrates empirically on a noisy business domain that more-complex aggregation methods can increase generalization performance. Constructing features using target-dependent aggregations can transform relational prediction tasks so that well-understood feature-vector-based modeling algorithms can be applied successfully.

Machine Learning | 2006

Distribution-based aggregation for relational learning with identifier attributes

Claudia Perlich; Foster Provost

Identifier attributes—very high-dimensional categorical attributes such as particular product ids or peoples names—rarely are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM, or COUNT. This papers main contribution is the introduction of aggregation operators that capture more information about the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating—for example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical justification. We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in support of the aforementioned conjectures.

knowledge discovery and data mining | 2009

Spatial-temporal causal modeling for climate change attribution

Aurelie C. Lozano; Hongfei Li; Alexandru Niculescu-Mizil; Yan Liu; Claudia Perlich; J. R. M. Hosking; Naoki Abe

Attribution of climate change to causal factors has been based predominantly on simulations using physical climate models, which have inherent limitations in describing such a complex and chaotic system. We propose an alternative, data centric, approach that relies on actual measurements of climate observations and human and natural forcing factors. Specifically, we develop a novel method to infer causality from spatial-temporal data, as well as a procedure to incorporate extreme value modeling into our method in order to address the attribution of extreme climate events, such as heatwaves. Our experimental results on a real world dataset indicate that changes in temperature are not solely accounted for by solar radiance, but attributed more significantly to CO2 and other greenhouse gases. Combined with extreme value modeling, we also show that there has been a significant increase in the intensity of extreme temperatures, and that such changes in extreme temperature are also attributable to greenhouse gases. These preliminary results suggest that our approach can offer a useful alternative to the simulation-based approach to climate modeling and attribution, and provide valuable insights from a fresh perspective.

knowledge discovery and data mining | 2012

Bid optimizing and inventory scoring in targeted online advertising

Claudia Perlich; Brian Dalessandro; Rod Hook; Ori Stitelman; Troy Raeder; Foster Provost

Billions of online display advertising spots are purchased on a daily basis through real time bidding exchanges (RTBs). Advertising companies bid for these spots on behalf of a company or brand in order to purchase these spots to display banner advertisements. These bidding decisions must be made in fractions of a second after the potential purchaser is informed of what location (Internet site) has a spot available and who would see the advertisement. The entire transaction must be completed in near real-time to avoid delays loading the page and maintain a good users experience. This paper presents a bid-optimization approach that is implemented in production at Media6Degrees for bidding on these advertising opportunities at an appropriate price. The approach combines several supervised learning algorithms, as well as second price auction theory, to determine the correct price to ensure that the right message is delivered to the right person, at the right time.

Machine Learning | 2014

Machine learning for targeted display advertising: transfer learning in action

Claudia Perlich; Brian Dalessandro; Troy Raeder; Ori Stitelman; Foster Provost

This paper presents the design of a fully deployed multistage transfer learning system for targeted display advertising, highlighting the important role of problem formulation and the sampling of data from distributions different from that of the target environment. Notably, the machine learning system itself is deployed and has been in continual use for years for thousands of advertising campaigns—in contrast to the more common case where predictive models are built outside the system, curated, and then deployed. In this domain, acquiring sufficient data for training from the ideal sampling distribution is prohibitively expensive. Instead, data are drawn from surrogate distributions and learning tasks, and then transferred to the target task. We present the design of the transfer learning system We then present a detailed experimental evaluation, showing that the different transfer stages indeed each add value. We also present production results across a variety of advertising clients from a variety of industries, illustrating the performance of the system in use. We close the paper with a collection of lessons learned from over half a decade of research and development on this complex, deployed, and intensely used machine learning system.

international conference on data mining | 2005

Ranking-based evaluation of regression models

Saharon Rosset; Claudia Perlich; Bianca Zadrozny

We suggest the use of ranking-based evaluation measures for regression models, as a complement to the commonly used residual-based evaluation. We argue that in some cases, such as the case study we present, ranking can be the main underlying goal in building a regression model, and ranking performance is the correct evaluation metric. However, even when ranking is not the contextually correct performance metric, the measures we explore still have significant advantages: They are robust against extreme outliers in the evaluation set; and they are interpretable. The two measures we consider correspond closely to non-parametric correlation coefficients commonly used in data analysis (Spearmans ρ and Kendalls τ); and they both have interesting graphical representations, which, similarly to ROC curves, offer useful various model performance views, in addition to a one-number summary in the area under the curve. An interesting extension which we explore is to evaluate models on their performance in “partially” ranking the data, which we argue can better represent the utility of the model in many cases. We illustrate our methods on a case study of evaluating IT Wallet size estimation models for IBMs customers.

knowledge discovery and data mining | 2013

The Microsoft academic search dataset and KDD Cup 2013

Senjuti Basu Roy; Martine De Cock; Vani Mandava; Swapna Savanna; Brian Dalessandro; Claudia Perlich; William Cukierski; Ben Hamner

KDD Cup 2013 challenged participants to tackle the problem of author name ambiguity in a digital library of scientific publications. The competition consisted of two tracks, which were based on large-scale datasets from a snapshot of Microsoft Academic Search, taken in January 2013 and including 250K authors and 2.5M papers. Participants were asked to determine which papers in an author profile are truly written by a given author (track 1), as well as to identify duplicate author profiles (track 2). Track 1 and track 2 were launched respectively on April 18 and April 20, 2013, with a common final submission deadline on June 12, 2013. For track 1 a training dataset with correct labels was diclosed at the start of the competition. This track was the most popular one, attracting submissions of 561 different teams. Track 2, which was formulated as an unsupervised learning task, received submissions from 241 participants. This paper presents details about the problem definitions, the datasets, the evaluation metrics and the results.

Ibm Systems Journal | 2007

Analytics-driven solutions for customer targeting and sales-force allocation

Richard D. Lawrence; Claudia Perlich; Saharon Rosset; J. Arroyo; M. Callahan; J. M. Collins; A. Ershov; S. Feinzig; Ildar Khabibrakhmanov; Shilpa N. Mahatma; M. Niemaszyk; Sholom M. Weiss

Sales professionals need to identify new sales prospects, and sales executives need to deploy the sales force against the sales accounts with the best potential for future revenue. We describe two analytics-based solutions developed within IBM to address these related issues. The Web-based tool OnTARGET provides a set of analytical models to identify new sales opportunities at existing client accounts and noncustomer companies. The models estimate the probability of purchase at the product-brand level. They use training examples drawn from historical transactions and extract explanatory features from transactional data joined with company firmographic data (e.g., revenue and number of employees). The second initiative, the Market Alignment Program, supports sales-force allocation based on field-validated analytical estimates of future revenue opportunity in each operational market segment. Revenue opportunity estimates are generated by defining the opportunity as a high percentile of a conditional distribution of the customers spending, that is, what we could realistically hope to sell to this customer. We describe the development of both sets of analytical models, the underlying data models, and the Web sites used to deliver the overall solution. We conclude with a discussion of the business impact of both initiatives.

Data Mining and Knowledge Discovery | 2010

Medical data mining: insights from winning two competitions

Saharon Rosset; Claudia Perlich; Grzergorz Świrszcz; Prem Melville; Yan Liu

Two major data mining competitions in 2008 presented challenges in medical domains: KDD Cup 2008, which concerned cancer detection from mammography data; and Informs Data Mining Challenge 2008, dealing with diagnosis of pneumonia based on patient information from hospital files. Our team won both of these competitions, and in this paper we share our lessons learned and insights. We emphasize the aspects that pertain to the general practice and methodology of medical data mining, rather than to the specifics of each modeling competition. We concentrate on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.

Explore More