Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Daria Sorokina is active.

Publication


Featured researches published by Daria Sorokina.


Journal of Wildlife Management | 2007

Data-mining discovery of pattern and process in ecological systems

Wesley M. Hochachka; Rich Caruana; Daniel Fink; Art Munson; Mirek Riedewald; Daria Sorokina; Steve Kelling

Abstract Most ecologists use statistical methods as their main analytical tools when analyzing data to identify relationships between a response and a set of predictors; thus, they treat all analyses as hypothesis tests or exercises in parameter estimation. However, little or no prior knowledge about a system can lead to creation of a statistical model or models that do not accurately describe major sources of variation in the response variable. We suggest that under such circumstances data mining is more appropriate for analysis. In this paper we 1) present the distinctions between data-mining (usually exploratory) analyses and parametric statistical (confirmatory) analyses, 2) illustrate 3 strengths of data-mining tools for generating hypotheses from data, and 3) suggest useful ways in which data mining and statistical analyses can be integrated into a thorough analysis of data to facilitate rapid creation of accurate models and to guide further research.


international conference on data mining | 2006

Plagiarism Detection in arXiv

Daria Sorokina; Johannes Gehrke; Simeon Warner; Paul Ginsparg

We describe a large-scale application of methods for finding plagiarism in research document collections. The methods are applied to a collection of 284,834 documents collected by arXiv.org over a 14 year period, covering a few different research disciplines. The methodology efficiently detects a variety of problematic author behaviors, and heuristics are developed to reduce the number of false positives. The methods are also efficient enough to implement as a real-time submission screen for a collection many times larger.


european conference on machine learning | 2007

Additive Groves of Regression Trees

Daria Sorokina; Rich Caruana; Mirek Riedewald

We present a new regression algorithm called Groves of trees and show empirically that it is superior in performance to a number of other established regression methods. A Grove is an additive model usually containing a small number of large trees. Trees added to the Grove are trained on the residual error of other trees already in the Grove. We begin the training process with a single small tree in the Grove and gradually increase both the number of trees in the Grove and their size. This procedure ensures that the resulting model captures the additive structure of the response. A single Grove may still overfit to the training set, so we further decrease the variance of the final predictions with bagging. We show that in addition to exhibiting superior performance on a suite of regression test problems, bagged Groves of trees are very resistant to overfitting.


knowledge discovery and data mining | 2006

Mining citizen science data to predict orevalence of wild bird species

Rich Caruana; Mohamed Farid Elhawary; Art Munson; Mirek Riedewald; Daria Sorokina; Daniel Fink; Wesley M. Hochachka; Steve Kelling

The Cornell Laboratory of Ornithologys mission is to interpret and conserve the earths biological diversity through research, education, and citizen science focused on birds. Over the years, the Lab has accumulated one of the largest and longest-running collections of environmental data sets in existence. The data sets are not only large, but also have many attributes, contain many missing values, and potentially are very noisy. The ecologists are interested in identifying which features have the strongest effect on the distribution and abundance of bird species as well as describing the forms of these relationships. We show how data mining can be successfully applied, enabling the ecologists to discover unanticipated relationships. We compare a variety of methods for measuring attribute importance with respect to the probability of a bird being observed at a feeder and present initial results for the impact of important attributes on bird prevalence.


international conference on machine learning | 2008

Detecting statistical interactions with additive groves of trees

Daria Sorokina; Rich Caruana; Mirek Riedewald; Daniel Fink

Discovering additive structure is an important step towards understanding a complex multi-dimensional function because it allows the function to be expressed as the sum of lower-dimensional components. When variables interact, however, their effects are not additive and must be modeled and interpreted simultaneously. We present a new approach for the problem of interaction detection. Our method is based on comparing the performance of unrestricted and restricted prediction models, where restricted models are prevented from modeling an interaction in question. We show that an additive model-based regression ensemble, Additive Groves, can be restricted appropriately for use with this framework, and thus has the right properties for accurately detecting variable interactions.


Archive | 2011

Scaling Up Machine Learning: Parallel Large-Scale Feature Selection

Jeremy Kubica; Sameer Singh; Daria Sorokina

The set of features used by a learning algorithm can have a dramatic impact on the performance of that algorithm. Including extraneous features can make the learning problem harder by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, leaving out useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity. The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional data sets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require a learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we are iteratively adding many new features. Even techniques that use relatively computationally inexpensive tests of a feature’s value, such as mutual information, require at least linear time in the number of features being evaluated. As a simple illustrative example consider the task of classifying websites. In this case the data set could easily contain many millions of examples. Just including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model. Considering more complex attributes such as bigrams of words


international conference on data mining | 2009

Detecting and Interpreting Variable Interactions in Observational Ornithology Data

Daria Sorokina; Rich Caruana; Mirek Riedewald; Wesley M. Hochachka; Steve Kelling

In this paper we demonstrate a practical approach to interaction detection on real data describing the abundance of different species of birds in the prairies east of the southern Rocky Mountains. This data is very noisy---predictive models built from it perform only slightly better than baseline. Previous approaches for interaction detection, including a recently proposed algorithm based on Additive Groves, often do not work well on such noisy data for a number of reasons. We describe the issues that appear when working with such data sets and suggest solutions to them. In the end, we discuss results of our analysis for several bird species.


siam international conference on data mining | 2009

Parallel Large Scale Feature Selection for Logistic Regression.

Sameer Singh; Jeremy Kubica; Scott Larsen; Daria Sorokina


knowledge discovery and data mining | 2009

Application of Additive Groves ensemble with multiple counts feature evaluation to KDD Cup'09 small data set

Daria Sorokina


Archive | 2008

Modeling additive structure and detecting interactions with groves of trees

Richard Caruana; Daria Sorokina

Collaboration


Dive into the Daria Sorokina's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Sameer Singh

University of Washington

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge