Padhraic Smyth
University of California, Irvine
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Padhraic Smyth.
Ai Magazine | 1996
Usama M. Fayyad; Gregory Piatetsky-Shapiro; Padhraic Smyth
■ Data mining and knowledge discovery in databases have been attracting a significant amount of research, industry, and media attention of late. What is all the excitement about? This article provides an overview of this emerging field, clarifying how data mining and knowledge discovery in databases are related both to each other and to related fields, such as machine learning, statistics, and databases. The article mentions particular real-world applications, specific data-mining techniques, challenges involved in real-world applications of knowledge discovery, and current and future research directions in the field.
Communications of The ACM | 1996
Usama M. Fayyad; Gregory Piatetsky-Shapiro; Padhraic Smyth
AS WE MARCH INTO THE AGE of digital information, the problem of data overload looms ominously ahead. Our ability to analyze and understand massive datasets lags far behind our ability to gather and store the data. A new generation of computational techniques and tools is required to support the extraction of useful knowledge from the rapidly growing volumes of data. These techniques and tools are the subject of the emerging field of knowledge discovery in databases (KDD) and data mining. Large databases of digital information are ubiquitous. Data from the neighborhood store’s checkout register, your bank’s credit card authorization device, records in your doctor’s office, patterns in your telephone calls, and many more applications generate streams of digital records archived in huge databases, sometimes in so-called data warehouses. Current hardware and database technology allow efficient and inexpensive reliable data storage and access. However, whether the context is business, medicine, science, or government, the datasets themselves (in raw form) are of little direct value. What is of value is the knowledge that can be inferred from the data and put to use. For example, the marketing database of a consumer U s a m a F a y y a d ,
knowledge discovery and data mining | 2008
Ian Porteous; David Newman; Alexander T. Ihler; Arthur U. Asuncion; Padhraic Smyth; Max Welling
In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.
IEEE Transactions on Knowledge and Data Engineering | 1992
Padhraic Smyth; Rodney M. Goodman
An algorithm for the induction of rules from examples is introduced. The algorithm is novel in the sense that it not only learns rules for a given concept (classification), but it simultaneously learns rules relating multiple concepts. This type of learning, known as generalized rule induction, is considerably more general than existing algorithms, which tend to be classification oriented. Initially, it is focused on the problem of determining a quantitative, well-defined rule preference measure. In particular, a quantity called the J-measure is proposed as an information-theoretic alternative to existing approaches. The J-measure quantifies the information content of a rule or a hypothesis. The information theoretic origins of this measure are outlined, and its plausibility as a hypothesis preference measure is examined. The ITRULE algorithm, which uses the measure to learn a set of optimal rules from a set of data samples, is defined. Experimental results on real-world data are analyzed. >
knowledge discovery and data mining | 1999
Scott Gaffney; Padhraic Smyth
In this paper we address the problem of clustering trajectories, namely sets of short sequences of data measured as a function of a dependent variable such as time. Examples include storm path trajectories, longitudinal data such as drug therapy response, functional expression data in computational biology, and movements of objects or individuals in video sequences. Our clustering algorithm is based on a principled method for probabilistic modelhng of a set of trajectories as individual sequences of points generated from a finite mixture model consisting of regression model components. Unsupervised learning is carried out using maximum likelihood principles. Specifically, the EM algorithm is used to cope with the hidden data problem (i.e., the cluster memberships). We also develop generalizations of the method to handle non-parametric (kernel) regression components as well as multi-dimensional outputs. Simulation results comparing our method with other clustering methods such as K-means and Gaussian mixtures are presented as well as experimental results on real data sets.
Pattern Recognition | 1994
Padhraic Smyth
The invention is a system failure monitoring method and apparatus which learns the symptom-fault mapping directly from training data. The invention first estimates the state of the system at discrete intervals in time. A feature vector x of dimension k is estimated from sets of successive windows of sensor data. A pattern recognition component then models the instantaneous estimate of the posterior class probability given the features, p(wi |/x), 1≦i≦m. Finally, a hidden Markov model is used to take advantage of temporal context and estimate class probabilities conditioned on recent past history. In this hierarchical pattern of information flow, the time series data is transformed and mapped into a categorical representation (the fault classes) and integrated over time to enable robust decision-making.
knowledge discovery and data mining | 2000
Igor V. Cadez; David Heckerman; Christopher Meek; Padhraic Smyth; Steven D. White
We present a new methodology for visualizing navigation patterns on a Web site. In our approach, we rst partition site users into clusters such that only users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model based (as opposed to distance based) and partitions users according to the order in which they request Web pages. In particular, we cluster users by learning a mixture of rst-order Markov models using the ExpectationMaximization algorithm. Our algorithm scales linearly with both number of users and number of clusters, and our implementation easily handles millions of users and thousands of clusters in memory. In the paper, we describe the details of our technology and a tool based on it called WebCANVAS. We illustrate the use of our technology on user-traAEc data from msnbc.com.
Neural Computation | 1997
Padhraic Smyth; David Heckerman; Michael I. Jordan
Graphical techniques for modeling the dependencies of random variables have been explored in a variety of different areas, including statistics, statistical physics, artificial intelligence, speech recognition, image processing, and genetics. Formalisms for manipulating these models have been developed relatively independently in these research communities. In this paper we explore hidden Markov models (HMMs) and related structures within the general framework of probabilistic independence networks (PINs). The paper presents a self-contained review of the basic principles of PINs. It is shown that the well-known forward-backward (F-B) and Viterbi algorithms for HMMs are special cases of more general inference algorithms for arbitrary PINs. Furthermore, the existence of inference and estimation algorithms for more general graphical models provides a set of analysis tools for HMM practitioners who wish to explore a richer class of HMM structures. Examples of relatively complex models to handle sensor fusion and coarticulation in speech recognition are introduced and treated within the graphical model framework to illustrate the advantages of the general approach.
Statistics and Computing | 2000
Padhraic Smyth
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modeling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is straightforward in the sense that models are judged directly on their estimated out-of-sample predictive performance. The cross-validation approach, as well as penalized likelihood and McLachlans bootstrap method, are applied to two data sets and the results from all three methods are in close agreement. The second data set involves a well-known clustering problem from the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides an interpretable and objective solution to the atmospheric clustering problem. The clusters found are in agreement with prior analyses of the same data based on non-probabilistic clustering techniques.
ACM Transactions on Information Systems | 2010
Michal Rosen-Zvi; Chaitanya Chemudugunta; Thomas L. Griffiths; Padhraic Smyth; Mark Steyvers
We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multi-author paper is a mixture of the distributions associated with the authors. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precision-recall for document retrieval are used to illustrate systematic differences between the proposed author-topic model and a number of alternatives. Extensions to the model, allowing for example, generalizations of the notion of an author, are also briefly discussed.
