Craig Silverstein | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Craig Silverstein is active.

Explore More

Publication

Featured researches published by Craig Silverstein.

international conference on management of data | 1997

Beyond market baskets: generalizing association rules to correlations

Sergey Brin; Rajeev Motwani; Craig Silverstein

One of the most well-studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, “A customer purchasing item A often also purchases item B.” Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of mining rules that identify correlations (generalizing associations), and we consider both the absence and presence of items as a basis for generating rules. We propose measuring significance of associations via the chi-squared test for correlation from classical statistics. This leads to a measure that is upward closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between correlated and uncorrelated itemsets in the lattice. We develop pruning strategies and devise an efficient algorithm for the resulting problem. We demonstrate its effectiveness by testing it on census data and finding term dependence in a corpus of text documents, as well as on synthetic data.

Data Mining and Knowledge Discovery | 1998

Beyond Market Baskets: Generalizing Association Rules to Dependence Rules

Craig Silverstein; Sergey Brin; Rajeev Motwani

One of the more well-studied problems in data mining is the search for association rules in market basket data. Association rules are intended to identify patterns of the type: “A customer purchasing item A often also purchases item B.” Motivated partly by the goal of generalizing beyond market basket data and partly by the goal of ironing out some problems in the definition of association rules, we develop the notion of dependence rules that identify statistical dependence in both the presence and absence of items in itemsets. We propose measuring significance of dependence via the chi-squared test for independence from classical statistics. This leads to a measure that is upward-closed in the itemset lattice, enabling us to reduce the mining problem to the search for a border between dependent and independent itemsets in the lattice. We develop pruning strategies based on the closure property and thereby devise an efficient algorithm for discovering dependence rules. We demonstrate our algorithms effectiveness by testing it on census data, text data (wherein we seek term dependence), and synthetic data.

very large data bases | 1998

Scalable Techniques for Mining Causal Structures

Craig Silverstein; Sergey Brin; Rajeev Motwani; Jeffrey D. Ullman

Mining for association rules in market basket data has proved a fruitful area of research. Measures such as conditional probability (confidence) and correlation have been used to infer rules of the form “the existence of item A implies the existence of item B.” However, such rules indicate only a statistical relationship between A and B. They do not specify the nature of the relationship: whether the presence of A causes the presence of B, or the converse, or some other attribute or phenomenon causes both to appear together. In applications, knowing such causal relationships is extremely useful for enhancing understanding and effecting change. While distinguishing causality from correlation is a truly difficult problem, recent work in statistics and Bayesian learning provide some avenues of attack. In these fields, the goal has generally been to learn complete causal models, which are essentially impossible to learn in large-scale data mining applications with a large number of variables.In this paper, we consider the problem of determining casual relationships, instead of mere associations, when mining market basket data. We identify some problems with the direct application of Bayesian learning ideas to mining large databases, concerning both the scalability of algorithms and the appropriateness of the statistical techniques, and introduce some initial ideas for dealing with these problems. We present experimental results from applying our algorithms on several large, real-world data sets. The results indicate that the approach proposed here is both computationally feasible and successful in identifying interesting causal structures. An interesting outcome is that it is perhaps easier to infer the lack of causality than to infer causality, information that is useful in preventing erroneous decision making.

international acm sigir conference on research and development in information retrieval | 1997

Projections for efficient document clustering

Hinrich Schütze; Craig Silverstein

Clustering is increasing in importance, but linearand even constant-time clustering algorithms are often too slow for real-time applications. A simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routines. We study two techniques for improving the cost ofdistance calculations, LSI and trrmcation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clusters. We find that the speed increase is significant while — surprisingly — the quality of clustering is not adversely affected. We conclude that truncation yields clusters as good as those produced by full-profile clustering while offering a significant speed advantage.

SIAM Journal on Computing | 1999

Buckets, Heaps, Lists, and Monotone Priority Queues

Boris V. Cherkassky; Andrew V. Goldberg; Craig Silverstein

We introduce the heap-on-top (hot) priority queue data structure that combines the multilevel bucket data structure of Denardo and Fox with a heap. Our data structure has superior operation bounds than either structure taken alone. We use the new data structure to obtain an improved bound for Dijkstras shortest path algorithm. We also discuss a practical implementation of hot queues. Our experimental results in the context of Dijkstras algorithm show that this implementation of hot queues performs very well and is more robust than implementations based only on heap or multilevel bucket data structures.

international acm sigir conference on research and development in information retrieval | 1997

Almost-constant-time clustering of arbitrary corpus subsets4

Craig Silverstein; Jan O. Pedersen

Methods exist for comtant-time clustering of corpus subsets selected via Scatter/Gather browsing [3]. In thii paper we expand on those techniqum, giving an algorithm for alrnostconstant-time clustering of arbitrary corpus subsets. This algorithm is never slower than clustering the document set from scratch, and for medium-sised and large sets it is significantly faster. ThE algorithm ia USSM for clustering arbitrary subsets of large corpora — obtained, for instance, by a boolean search — quickly enough to be useful in an interactive setting.

symposium on computational geometry | 1997

A practical evaluation of kinetic data structures

Julien Basch; Leonidas J. Guibas; Craig Silverstein; Li Zhang

In many applications of computational geometry to modeling objects and processes in the physical world, the participating objects are in a state of continuous change. Motion is the most ubiquitous kind of continuous transformation but others, such as shape deformation, are also possible. In a recent paper, Baech, Guibas, and Hershberger [BGH97] proposed the framework of kinetic data structures (KDSS) as a way to maintain, in a completely on-line fashion, desirable information about the state of a geometric system in continuous motion or change. They gave examples of kinetic data structures for the maximum of a set of (changing) numbers, and for the convex hull and closest pair of a set of (moving) points in the plane. The KDS frameworkallowseach object to change its motion at will according to interactions with other moving objects, the environment, etc. We implemented the KDSSdescribed in [BGH97],es well as came alternative methods serving the same purpose, as a way to validate the kinetic data structures framework in practice. In this note, we report some preliminary results on the maintenance of the convex hull, describe the experimental setup, compare three alternative methods, discuss the value of the measures of quality for KDSS proposed by [BGH97],and highlight some important numerical issues.

Archive | 1997

Implementations of Dijkstra’s Algorithm Based on Multi-Level Buckets

Andrew V. Goldberg; Craig Silverstein

A 2-level bucket data structure [6] has been shown to perform well in a Dijkstra’s algorithm implementation [4]. In this paper we study how the implementation performance depends on the number of bucket levels used. In particular we are interested in the best number of levels to use in practice.

The Library Quarterly | 1996

Predicting individual book use for off-site storage using decision trees

Craig Silverstein; Stuart M. Shieber

We explore various methods for predicting library book use, as measured by circulation records. Accurate prediction is invaluable when choosing titles to be stored in an off-site location. Previous researchers in this area concluded that past-use information provides by far the most reliable predictor of future use. Because of the computerization of library data, it is now possible not only to reproduce these earlier experiments with a more substantial data set, but also to compare their algorithms with more sophisticated decision methods. We have found that while previous use is indeed an excellent predictor of future use, it can be improved on by combining previous-use information with bibliographic information in a technique that can be customized for individual collections. This has immediate application for libraries that are short on storage space and wish to identify low-demand titles to move to remote storage. For instance, simulations show that the best prediction method we develop, when used as the off-site storage selection method for the Harvard College Library, would have generated only a fifth as many off-site accesses as compared to a method based on previous use.

workshop on algorithms and data structures | 1997

Constrained TSP and Low-Power Computing

Moses Charikar; Rajeev Motwani; Prabhakar Raghavan; Craig Silverstein

In the precedence-constrained traveling salesman problem (PTSP) we are given a partial order on n nodes, each of which is labeled by one of k points in a metric space. We are to find a visit order consistent with the precedence constraints that minimizes the total cost of the corresponding path in the metric space. We give negative results on approximability by relating the problem to the Shortest Common Supersequence problem, helping to explain why there has been very little success in approximation algorithms for this problem. We also give approximation algorithms for a number of special cases, included cases appropriate for a problem in low-power computing; in the process, we show that algorithms for the k-server problem and the traveling salesman problem can be used to derive approximation algorithms for the PTSP. We give tight bounds on the approximation ratios achieved by natural classes of algorithms for this optimization problem (which include algorithms proposed and used in empirical studies of this problem). We briefly summarize results of experiments with several algorithms on a standard set of compiler benchmarks, comparing several known and new algorithms.

Explore More