Stephen D. Scott
University of Nebraska–Lincoln
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stephen D. Scott.
international conference on machine learning | 2004
Qingping Tao; Stephen D. Scott; N. V. Vinodchandran; Thomas Takeo Osugi
The multiple-instance learning (MIL) model has been very successful in application areas such as drug discovery and content-based image-retrieval. Recently, a generalization of this model and an algorithm for this generalization were introduced, showing significant advantages over the conventional MIL model in certain application areas. Unfortunately, this algorithm is inherently inefficient, preventing scaling to high dimensions. We reformulate this algorithm using a kernel for a support vector machine, reducing its time complexity from exponential to polynomial. Computing the kernel is equivalent to counting the number of axis-parallel boxes in a discrete, bounded space that contain at least one point from each of two multisets P and Q. We show that this problem is #P-complete, but then give a fully polynomial randomized approximation scheme (FPRAS) for it. Finally, we empirically evaluate our kernel.
International Journal of Computational Intelligence and Applications | 2005
Stephen D. Scott; Jun Zhang; Joshua Brown
We describe a generalisation of the multiple-instance learning model in which a bags label is not based on a single instances proximity to a single target point. Rather, a bag is positive if and only if it contains a collection of instances, each near one of a set of target points. We then adapt a learning-theoretic algorithm for learning in this model and present empirical results on data from robot vision, content-based image retrieval, and protein sequence identification.
Journal of the ACM | 2001
Daniel R. Dooly; Sally A. Goldman; Stephen D. Scott
We study an on-line problem that is motivated by the networking problem of dynamically adjusting of acknowledgments in the Transmission Control Protocol (TCP). We provide a theoretical model for this problem in which the goal is to send acks at a time that minimize a linear combination of the cost for the number of acknowledgments sent and the cost for the additional latency introduced by delaying acknowledgments. To study the usefulness of applying packet arrival time prediction to this problem, we assume there is an oracle that provides the algorithm with the times of the next L arrivals, for some L ≥ 0. We give two different objective functions for measuring the cost of a solution, each with its own measure of latency cost. For each objective function we first give an O(n2)-time dynamic programming algorithm for optimally solving the off-line problem. Then we describe an on-line algorithm that greedily acknowledges exactly when the cost for an acknowledgment is less than the latency cost incurred by not acknowledging. We show that for this algorithm there is a sequence of n packet arrivals for which it is &OHgr; (***)-competitive for the first objective function, 2-competitive for the second function for L = 0, and 1-competitivefor the second function for L = 1. Next we present a second on-line algorithm which is a slight modification of the first, and we prove that it is 2-competitive for both objective functions for all L. We also give lower bounds on the competitive ratio for any deterministic on-line algorithm. These results show that for each objective function, at least one of our algorithms is optimal. Finally, we give some initial empirical results using arrival sequences from real network traffic where we compare the two methods used in TCP for acknowledgment delay with our two on-line algorithms. In all cases we examine performance with L = 0 and L = 1.
Molecular Biology and Evolution | 2009
Cory L. Strope; Kevin Abel; Stephen D. Scott; Etsuko N. Moriyama
Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods. Hypothesis testing relies on the representational ability of the sequence simulation method. Simple hypotheses are testable through simulation of random, homogeneously evolving sequence sets. However, testing complex hypotheses, for example, local similarities, requires simulation of sequence evolution under heterogeneous models. To this end, we previously introduced indel-Seq-Gen version 1.0 (iSGv1.0; indel, insertion/deletion). iSGv1.0 allowed heterogeneous protein evolution and motif conservation as well as insertion and deletion constraints in subsequences. Despite these advances, for complex hypothesis testing, neither iSGv1.0 nor other currently available sequence simulation methods is sufficient. indel-Seq-Gen version 2.0 (iSGv2.0) aims at simulating evolution of highly divergent DNA sequences and protein superfamilies. iSGv2.0 improves upon iSGv1.0 through the addition of lineage-specific evolution, motif conservation using PROSITE-like regular expressions, indel tracking, subsequence-length constraints, as well as coding and noncoding DNA evolution. Furthermore, we formalize the sequence representation used for iSGv2.0 and uncover a flaw in the modeling of indels used in current state of the art methods, which biases simulation results for hypotheses involving indels. We fix this flaw in iSGv2.0 by using a novel discrete stepping procedure. Finally, we present an example simulation of the calycin-superfamily sequences and compare the performance of iSGv2.0 with iSGv1.0 and random model of sequence evolution.
Journal of Computer and System Sciences | 2001
Sally A. Goldman; Stephen Kwek; Stephen D. Scott
P. W. Goldberg, S. A. Goldman, and S. D. Scott (Mach. Learning25, No. 1 (1996), 51?70) discussed how the problem of recognizing a landmark from a one-dimensional visual image might be mapped to that of learning a one-dimensional geometric pattern and gave a PAC algorithm to learn that class. In this paper, we present an efficient online agnostic learning algorithm for learning the class of constant-dimensional geometric patterns. Our algorithm can tolerate both classification and attribute noise. By working in higher dimensional spaces we can represent more features from the visual image in the geometric pattern. Our mapping of the data to a geometric pattern and, hence, our learning algorithm are applicable to any data representable as a constant-dimensional array of values, e.g., sonar data, temporal difference information, amplitudes of a waveform, or other pattern recognition data. To our knowledge, these classes of patterns are more complex than any class of geometric patterns previously studied. Also, our results are easily adapted to learn the union of fixed-dimensional boxes from multiple-instance examples. Finally, our algorithms are tolerant of concept shift, where the target concept that labels the examples can change over time.
global communications conference | 2003
Lin Li; Stephen D. Scott; Jitender S. Deogun
Due to the lack of optical random access memory, optical fiber delay line (FDL) is currently the only way to implement optical buffering. Feed-forward and feedback are two kinds of FDL structures in optical buffering. Both have advantages and disadvantages. In this paper, we propose a more effective hybrid FDL architecture that combines the merits of both schemes. The core of this switch is the arrayed waveguide grating (AWG) and the tunable wavelength converter (TWC). It requires smaller optical device sizes and fewer wavelengths and has less noise than feedback architecture. At the same time, it can facilitate preemptive priority routing which feed-forward architecture cannot support. Our numerical results show that the new switch architecture significantly reduces packet loss probability.
international conference on data mining | 2010
Yaling Zheng; Stephen D. Scott; Kun Deng
In active learning, where a learning algorithm has to purchase the labels of its training examples, it is often assumed that there is only one labeler available to label examples, and that this labeler is noise-free. In reality, it is possible that there are multiple labelers available (such as human labelers in the online annotation tool Amazon Mechanical Turk) and that each such labeler has a different cost and accuracy. We address the active learning problem with multiple labelers where each labeler has a different (known) cost and a different (unknown) accuracy. Our approach uses the idea of {\em adjusted cost}, which allows labelers with different costs and accuracies to be directly compared. This allows our algorithm to find low-cost combinations of labelers that result in high-accuracy labelings of instances. Our algorithm further reduces costs by pruning under performing labelers from the set under consideration, and by halting the process of estimating the accuracy of the labelers as early as it can. We found that our algorithm often outperforms, and is always competitive with, other algorithms in the literature.
international conference on data mining | 2006
Matt Culver; Deng Kun; Stephen D. Scott
In active learning, a machine learning algorithm is given an unlabeled set of examples U, and is allowed to request labels for a relatively small subset of U to use for training. The goal is then to judiciously choose which examples in U to have labeled in order to optimize some performance criterion, e.g. classification accuracy. We study how active learning affects AUC. We examine two existing algorithms from the literature and present our own active learning algorithms designed to maximize the AUC of the hypothesis. One of our algorithms was consistently the top performer, and Closest Sampling from the literature often came in second behind it. When good posterior probability estimates were available, our heuristics were by far the best.
Annals of Mathematics and Artificial Intelligence | 2003
Sally A. Goldman; Stephen D. Scott
Recently there has been significant research in multiple-instance learning, yet most of this work has only considered this model when there are Boolean labels. However, in many of the application areas for which the multiple-instance model fits, real-valued labels are more appropriate than Boolean labels. We define and study a real-valued multiple-instance model in which each multiple-instance example (bag) is given a real-valued label in [0, 1] that indicates the degree to which the bag satisfies the target concept. To provide additional structure to the learning problem, we associate a real-valued label with each point in the bag. These values are then combined using a real-valued aggregation operator to obtain the label for the bag. We then present on-line agnostic algorithms for learning real-valued multiple-instance geometric concepts defined by axis-aligned boxes in constant-dimensional space and describe several possible applications of these algorithms. We obtain our learning algorithms by reducing the problem to one in which the exponentiated gradient or gradient descent algorithm can be used. We also give a novel application of the virtual weights technique. In typical applications of the virtual weights technique, all of the concepts in a group have the same weight and prediction, allowing a single “representative” concept from each group to be tracked. However, in our application there are an exponential number of different weights and possible predictions. Hence, boxes in each group have different weights and predictions, making the computation of the contribution of a group significantly more involved. However, we are able to both keep the number of groups polynomial in the number of trials and efficiently compute the overall prediction.
international conference on tools with artificial intelligence | 2004
Qingping Tao; Stephen D. Scott; N. V. Vinodchandran; Thomas Takeo Osugi; Brandon Mueller
The multiple-instance learning (MIL) model has been successful in areas such as drug discovery and content-based image-retrieval. Recently, this model was generalized and a corresponding kernel was introduced to learn generalized MIL concepts with a support vector machine. While this kernel enjoyed empirical success, it has limitations in its representation. We extend this kernel by enriching its representation and empirically evaluate our new kernel on data from content-based image retrieval, biological sequence analysis, and drug discovery. We found that our new kernel generalized noticeably better than the old one in content-based image retrieval and biological sequence analysis and was slightly better or even with the old kernel in the other applications, showing that an SVM using this kernel does not overfit despite its richer representation.