Naftali Tishby | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Naftali Tishby is active.

Explore More

Publication

Featured researches published by Naftali Tishby.

Machine Learning | 1997

Selective Sampling Using the Query by Committee Algorithm

Yoav Freund; H. Sebastian Seung; Eli Shamir; Naftali Tishby

We analyze the “query by committee” algorithm, a method for filtering informative queries from a random stream of inputs. We show that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queries. We show that, in particular, this exponential decrease holds for query learning of perceptrons.

Machine Learning | 1998

The Hierarchical Hidden Markov Model: Analysis and Applications

Shai Fine; Yoram Singer; Naftali Tishby

We introduce, analyze and demonstrate a recursive hierarchical generalization of the widely used hidden Markov models, which we name Hierarchical Hidden Markov Models (HHMM). Our model is motivated by the complex multi-scale structure which appears in many natural sequences, particularly in language, handwriting and speech. We seek a systematic unsupervised approach to the modeling of such structures. By extending the standard Baum-Welch (forward-backward) algorithm, we derive an efficient procedure for estimating the model parameters from unlabeled data. We then use the trained model for automatic hierarchical parsing of observation sequences. We describe two applications of our model and its parameter estimation procedure. In the first application we show how to construct hierarchical models of natural English text. In these models different levels of the hierarchy correspond to structures on different length scales in the text. In the second application we demonstrate how HHMMs can be used to automatically identify repeated strokes that represent combination of letters in cursive handwriting.

conference on learning theory | 1996

The power of amnesia: learning probabilistic automata with variable memory length

Dana Ron; Yoram Singer; Naftali Tishby

We propose and analyze a distribution learning algorithm for variable memory length Markov processes. These processes can be described by a subclass of probabilistic finite automata which we name Probabilistic Suffix Automata (PSA). Though hardness results are known for learning distributions generated by general probabilistic automata, we prove that the algorithm we present can efficiently learn distributions generated by PSAs. In particular, we show that for any target PSA, the KL-divergence between the distribution generated by the target and the distribution generated by the hypothesis the learning algorithm outputs, can be made small with high confidence in polynomial time and sample complexity. The learning algorithm is motivated by applications in human-machine interaction. Here we present two applications of the algorithm. In the first one we apply the algorithm in order to construct a model of the English language, and use this model to correct corrupted text. In the second application we construct a simple stochastic model for E.coli DNA.

international acm sigir conference on research and development in information retrieval | 2000

Document clustering using word clusters via the information bottleneck method

Noam Slonim; Naftali Tishby

We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) ≈ I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

international conference on machine learning | 2004

Margin based feature selection - theory and algorithms

Ran Gilad-Bachrach; Amir Navot; Naftali Tishby

Feature selection is the task of choosing a small set out of a given set of features that capture the relevant properties of the data. In the context of supervised classification problems the relevance is determined by the given labels on the training data. A good choice of features is a key for building compact and accurate classifiers. In this paper we introduce a margin based feature selection criterion and apply it to measure the quality of sets of features. Using margins we devise novel selection algorithms for multi-class classification problems and provide theoretical generalization bound. We also study the well known Relief algorithm and show that it resembles a gradient ascent over our margin criterion. We apply our new algorithm to various datasets and show that our new Simba algorithm, which directly optimizes the margin, outperforms Relief.

Neural Computation | 2001

Predictability, Complexity, and Learning

William Bialek; Ilya Nemenman; Naftali Tishby

We define predictive information Ipred(T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times T: Ipred(T) can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of parameters, then Ipred(T) grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, power-law growth is associated, for example, with the learning of infinite parameter (or non-parametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been defined both in learning theory and the analysis of physical systems through statistical mechanics and dynamical systems theory. Furthermore, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in problems in physics, statistics, and biology.

international acm sigir conference on research and development in information retrieval | 2002

Unsupervised document classification using sequential information maximization

Noam Slonim; Nir Friedman; Naftali Tishby

We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information with time and space complexity typically linear in the data size. information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading clusters recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.

Proceedings of the IEEE | 1990

A statistical approach to learning and generalization in layered neural networks

Esther Levin; Naftali Tishby; Sara A. Solla

A general statistical description of the problem of learning from examples is presented. Learning in layered networks is posed as a search in the network parameter space for a network that minimizes an additive error function of a statistically independent examples. By imposing the equivalence of the minimum error and the maximum likelihood criteria for training the network, the Gibbs distribution on the ensemble of networks with a fixed architecture is derived. The probability of correct prediction of a novel example can be expressed using the ensemble, serving as a measure to the networks generalization ability. The entropy of the prediction distribution is shown to be a consistent measure of the networks performance. The proposed formalism is applied to the problems of selecting an optimal architecture and the prediction of learning curves. >

Neuron | 2006

Reduction of information redundancy in the ascending auditory pathway.

Gal Chechik; Michael Anderson; Omer Bar-Yosef; Eric D. Young; Naftali Tishby; Israel Nelken

Information processing by a sensory system is reflected in the changes in stimulus representation along its successive processing stages. We measured information content and stimulus-induced redundancy in the neural responses to a set of natural sounds in three successive stations of the auditory pathway-inferior colliculus (IC), auditory thalamus (MGB), and primary auditory cortex (A1). Information about stimulus identity was somewhat reduced in single A1 and MGB neurons relative to single IC neurons, when information is measured using spike counts, latency, or temporal spiking patterns. However, most of this difference was due to differences in firing rates. On the other hand, IC neurons were substantially more redundant than A1 and MGB neurons. IC redundancy was largely related to frequency selectivity. Redundancy reduction may be a generic organization principle of neural systems, allowing for easier readout of the identity of complex stimuli in A1 relative to IC.

conference on learning theory | 1995

On the learnability and usage of acyclic probabilistic finite automata

Dana Ron; Yoram Singer; Naftali Tishby

We propose and analyze a distribution learning algorithm for a subclass ofacyclic probalistic finite automata(APFA). This subclass is characterized by a certain distinguishability property of the automatas states. Though hardness results are known for learning distributions generated by general APFAs, we prove that our algorithm can efficiently learn distributions generated by the subclass of APFAs we consider. In particular, we show that the KL-divergence between the distribution generated by the target source and the distribution generated by our hypothesis can be made arbitrarily small with high confidence in polynomial time. We present two applications of our algorithm. In the first, we show how to model cursively written letters. The resulting models are part of a complete cursive handwriting recognition system. In the second application we demonstrate how APFAs can be used to build multiple-pronunciation models for spoken words. We evaluate the APFA-based pronunciation models on labeled speech data. The good performance (in terms of the log-likelihood obtained on test data) achieved by the APFAs and the little time needed for learning suggests that the learning algorithm of APFAs might be a powerful alternative to commonly used probabilistic models.

Explore More