Is this you? Create Your Porfile

Soumya Ray

Case Western Reserve University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Soumya Ray is active.

Explore More

Publication

Featured researches published by Soumya Ray.

international conference on machine learning | 2005

Supervised versus multiple instance learning: an empirical comparison

Soumya Ray; Mark Craven

We empirically study the relationship between supervised and multiple instance (MI) learning. Algorithms to learn various concepts have been adapted to the MI representation. However, it is also known that concepts that are PAC-learnable with one-sided noise can be learned from MI data. A relevant question then is: how well do supervised learners do on MI data? We attempt to answer this question by looking at a cross section of MI data sets from various domains coupled with a number of learning algorithms including Diverse Density, Logistic Regression, nonlinear Support Vector Machines and FOIL. We consider a supervised and MI version of each learner. Several interesting conclusions emerge from our work: (1) no MI algorithm is superior across all tested domains, (2) some MI algorithms are consistently superior to their supervised counterparts, (3) using high false-positive costs can improve a supervised learners performance in MI domains, and (4) in several domains, a supervised algorithm is superior to any MI algorithm we tested.

international conference on machine learning | 2007

Multi-task reinforcement learning: a hierarchical Bayesian approach

Aaron Wilson; Alan Fern; Soumya Ray; Prasad Tadepalli

We consider the problem of multi-task reinforcement learning, where the agent needs to solve a sequence of Markov Decision Processes (MDPs) chosen randomly from a fixed but unknown distribution. We model the distribution over MDPs using a hierarchical Bayesian infinite mixture model. For each novel MDP, we use the previously learned distribution as an informed prior for modelbased Bayesian reinforcement learning. The hierarchical Bayesian framework provides a strong prior that allows us to rapidly infer the characteristics of new environments based on previous environments, while the use of a nonparametric model allows us to quickly adapt to environments we have not encountered before. In addition, the use of infinite mixtures allows for the model to automatically learn the number of underlying MDP components. We evaluate our approach and show that it leads to significant speedups in convergence to an optimal policy after observing only a small number of tasks.

international conference on machine learning | 2008

Automatic discovery and transfer of MAXQ hierarchies

Neville Mehta; Soumya Ray; Prasad Tadepalli; Thomas G. Dietterich

We present an algorithm, HI-MAT (Hierarchy Induction via Models And Trajectories), that discovers MAXQ task hierarchies by applying dynamic Bayesian network models to a successful trajectory from a source reinforcement learning task. HI-MAT discovers subtasks by analyzing the causal and temporal relationships among the actions in the trajectory. Under appropriate assumptions, HI-MAT induces hierarchies that are consistent with the observed trajectory and have compact value-function tables employing safe state abstractions. We demonstrate empirically that HI-MAT constructs compact hierarchies that are comparable to manually-engineered hierarchies and facilitate significant speedup in learning when transferred to a target task.

BMC Bioinformatics | 2005

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Soumya Ray; Mark Craven

BackgroundThe BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence.MethodsOur system relies on simple statistical analyses of the full text article provided. We learn n-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the n-gram models.ResultsWe report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers.ConclusionWe observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning.

Machine Learning | 2014

A theoretical and empirical analysis of support vector machine methods for multiple-instance classification

Gary Doran; Soumya Ray

The standard support vector machine (SVM) formulation, widely used for supervised learning, possesses several intuitive and desirable properties. In particular, it is convex and assigns zero loss to solutions if, and only if, they correspond to consistent classifying hyperplanes with some nonzero margin. The traditional SVM formulation has been heuristically extended to multiple-instance (MI) classification in various ways. In this work, we analyze several such algorithms and observe that all MI techniques lack at least one of the desirable properties above. Further, we show that this tradeoff is fundamental, stems from the topological properties of consistent classifying hyperplanes for MI data, and is related to the computational complexity of learning MI hyperplanes. We then study the empirical consequences of this three-way tradeoff in MI classification using a large group of algorithms and datasets. We find that the experimental observations generally support our theoretical results, and properties such as the labeling task (instance versus bag labeling) influence the effects of different tradeoffs.

Biomedical Optics Express | 2012

Automatic stent detection in intravascular OCT images using bagged decision trees

Hong Lu; Madhusudhana Gargesha; Zhao Wang; Daniel Chamié; Guilherme F. Attizzani; Tomoaki Kanaya; Soumya Ray; Marco A. Costa; Andrew M. Rollins; Hiram G. Bezerra; David L. Wilson

Intravascular optical coherence tomography (iOCT) is being used to assess viability of new coronary artery stent designs. We developed a highly automated method for detecting stent struts and measuring tissue coverage. We trained a bagged decision trees classifier to classify candidate struts using features extracted from the images. With 12 best features identified by forward selection, recall (precision) were 90%–94% (85%–90%). Including struts deemed insufficiently bright for manual analysis, precision improved to 94%. Strut detection statistics approached variability of manual analysis. Differences between manual and automatic area measurements were 0.12 ± 0.20 mm2 and 0.11 ± 0.20 mm2 for stent and tissue areas, respectively. With proposed algorithms, analyst time per stent should significantly reduce from the 6–16 hours now required.

Bioinformatics | 2012

Accurate estimation of short read mapping quality for next-generation genome sequencing

Matthew Ruffalo; Mehmet Koyutürk; Soumya Ray; Thomas LaFramboise

Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment—in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants. Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality. Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can ‘resurrect’ many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms. Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/. Contact: [email protected].

Ai Magazine | 2011

Automatic Discovery and Transfer of Task Hierarchies in Reinforcement Learning

Neville Mehta; Soumya Ray; Prasad Tadepalli; Thomas G. Dietterich

Sequential decision tasks present many opportunities for the study of transfer learning. A principal one among them is the existence of multiple domains that share the same underlying causal structure for actions. We describe an approach that exploits this shared causal structure to discover a hierarchical task structure in a source domain, which in turn speeds up learning of task execution knowledge in a new target domain. Our approach is theoretically justiﬁed and compares favorably to manually designed task hierarchies in learning efﬁciency in the target domain. We demonstrate that causally motivated task hierarchies transfer more robustly than other kinds of detailed knowledge that depend on the idiosyncrasies of the source domain and are hence less transferable.

Computer Networks | 2014

A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise

Tu Ouyang; Soumya Ray; Mark Allman; Michael Rabinovich

Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper, we envision spam filtering as a pipeline consisting of DNS blacklists, filters based on SYN packet features, filters based on traffic characteristics and filters based on message content. Each stage of the pipeline examines more information in the message but is more computationally expensive. A message is rejected as spam once any layer is sufficiently confident. We analyze this pipeline, focusing on the first three layers, from a single-enterprise perspective. To do this we use a large email dataset collected over two years. We devise a novel ground truth determination system to allow us to label this large dataset accurately. Using two machine learning algorithms, we study (i) how the different pipeline layers interact with each other and the value added by each layer, (ii) the utility of individual features in each layer, (iii) stability of the layers across time and network events and (iv) an operational use case investigating whether this architecture can be practically useful. We find that (i) the pipeline architecture is generally useful in terms of accuracy as well as in an operational setting, (ii) it generally ages gracefully across long time periods and (iii) in some cases, later layers can compensate for poor performance in the earlier layers. Among the caveats we find are that (i) the utility of network features is not as high in the single enterprise viewpoint as reported in other prior work, (ii) major network events can sharply affect the detection rate, and (iii) the operational (computational) benefit of the pipeline may depend on the efficiency of the final content filter.

international conference on machine learning | 2005

Generalized skewing for functions with continuous and nominal attributes

Soumya Ray; David C. Page

This paper extends previous work on skewing, an approach to problematic functions in decision tree induction. The previous algorithms were applicable only to functions of binary variables. In this paper, we extend skewing to directly handle functions of continuous and nominal variables. We present experiments with randomly generated functions and a number of real world datasets to evaluate the algorithms accuracy. Our results indicate that our algorithm almost always outperforms an Information Gain-based decision tree learner.

Explore More