William H. E. Day
Memorial University of Newfoundland
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by William H. E. Day.
Journal of Classification | 1984
William H. E. Day; Herbert Edelsbrunner
Whenevern objects are characterized by a matrix of pairwise dissimilarities, they may be clustered by any of a number of sequential, agglomerative, hierarchical, nonoverlapping (SAHN) clustering methods. These SAHN clustering methods are defined by a paradigmatic algorithm that usually requires 0(n3) time, in the worst case, to cluster the objects. An improved algorithm (Anderberg 1973), while still requiring 0(n3) worst-case time, can reasonably be expected to exhibit 0(n2) expected behavior. By contrast, we describe a SAHN clustering algorithm that requires 0(n2 logn) time in the worst case. When SAHN clustering methods exhibit reasonable space distortion properties, further improvements are possible. We adapt a SAHN clustering algorithm, based on the efficient construction of nearest neighbor chains, to obtain a reasonably general SAHN clustering algorithm that requires in the worst case 0(n2) time and space.Whenevern objects are characterized byk-tuples of real numbers, they may be clustered by any of a family of centroid SAHN clustering methods. These methods are based on a geometric model in which clusters are represented by points ink-dimensional real space and points being agglomerated are replaced by a single (centroid) point. For this model, we have solved a class of special packing problems involving point-symmetric convex objects and have exploited it to design an efficient centroid clustering algorithm. Specifically, we describe a centroid SAHN clustering algorithm that requires 0(n2) time, in the worst case, for fixedk and for a family of dissimilarity measures including the Manhattan, Euclidean, Chebychev and all other Minkowski metrics.
Journal of Classification | 1985
William H. E. Day
LetRn denote the set of rooted trees withn leaves in which: the leaves are labeled by the integers in {1, ...,n}; and among interior vertices only the root may have degree two. Associated with each interior vertexv in such a tree is the subset, orcluster, of leaf labels in the subtree rooted atv. Cluster {1, ...,n} is calledtrivial. Clusters are used in quantitative measures of similarity, dissimilarity and consensus among trees. For anyk trees inRn, thestrict consensus tree C(T1, ...,Tk) is that tree inRn containing exactly those clusters common to every one of thek trees. Similarity between treesT1 andT2 inRn is measured by the numberS(T1,T2) of nontrivial clusters in bothT1 andT2; dissimilarity, by the numberD(T1,T2) of clusters inT1 orT2 but not in both. Algorithms are known to computeC(T1, ...,Tk) inO(kn2) time, andS(T1,T2) andD(T1,T2) inO(n2) time. I propose a special representation of the clusters of any treeT Rn, one that permits testing in constant time whether a given cluster exists inT. I describe algorithms that exploit this representation to computeC(T1, ...,Tk) inO(kn) time, andS(T1,T2) andD(T1,T2) inO(n) time. These algorithms are optimal in a technical sense. They enable well-known indices of consensus between two trees to be computed inO(n) time. All these results apply as well to comparable problems involving unrooted trees with labeled leaves.
Bulletin of Mathematical Biology | 1987
William H. E. Day
Molecular biologists strive to infer evolutionary relationships from quantitative macromolecular comparisons obtained by immunological, DNA hybridization, electrophoretic or amino acid sequencing techniques. The problem is to find unrooted phylogenies that best approximate a given dissimilarity matrix according to a goodness-of-fit measure, for example the least-squares-fit criterion or Farrissf statistic. Computational costs of known algorithms guaranteeing optimal solutions to these problems increase exponentially with problem size; practical computational considerations limit the algorithms to analyzing small problems. It is established here that problems of phylogenetic inference based on the least-squares-fit criterion and thef statistic are NP-complete and thus are so difficult computationally that efficient optimal algorithms are unlikely to exist for them.
Mathematical Social Sciences | 1981
William H. E. Day
Day [3] describes an analytical model of minimum-length sequence (MLS) metrics measuring distances between partitions of a set. By selecting suitable values of model coordinates, a user may identify within the model that metric most appropriate to his classification application. Users should understand that within the model similar metrics may nevertheless exhibit extreme differences in their computational complexities. For example, the asymptotic time complexities of two MLS metrics are known to be linear in the number of objects being partitioned; yet we establish below that the computational problem for a closely related MLS metric is NP-complete.
Journal of Theoretical Biology | 1983
William H. E. Day
Abstract A basic problem in phylogenetic systematics is to construct an evolutionary hypothesis, or phylogenetic tree, from available data for a set of operational taxonomic units (OTUs). Associated with the edges of such trees are weights that usually are interpreted as lengths. Methods proposed for constructing phylogenetic trees attempt to select from among the myriad alternatives a tree that optimizes in some sense the fit of tree topology and edge lengths with the original data. One optimization criterion seeks a most parsimonious tree in which the sum of edge lengths is a minimum. Researchers have failed to develop efficient algorithms to compute optimal solutions for important variations of the parsimonious tree construction problem. Recently Graham & Foulds (1982) proved that a special case of the problem is NP-complete, thus making it unlikely that the computational problem for this case can be solved efficiently. I describe three other parsimonious tree construction problems and prove that they, too, are NP-complete.
Bellman Prize in Mathematical Biosciences | 1983
William H. E. Day
Abstract In numerical taxonomy there is considerable interest in developing theory and methodology to provide for the quantitative comparison of classifications. I describe a mathematical model in which this problem can be addressed, and with it I develop a quantitative measure of the fit of a consenus classification to the set of classifications from which the consensus is derived. The model assumes that classifications are related by a partial order and have associated with them a quantitative measure of classification complexity. I use basic results concerning valuations on partially ordered sets to exhibit relationships among the concepts of classification complexity, distance between classifications, and strict consensus of classifications. Classification complexity is extended parsimoniously to provide a measure of the complexity of sets of classifications. Computing this set complexity measure is a difficult, in fact NP-complete, problem. The measure of consensus fit is based on Papentins idea that pattern complexity can be differentiated into organized complexity (i.e., the minimal description of rules underlying the pattern) and unorganized complexity (i.e., the minimal description of the random aspects of the pattern). Imagine a set of classifications as forming a pattern within the poset of classifications: the complexity of a consensus classification estimates the patterns organized complexity, while its unorganized complexity is estimated by a variation of the set complexity problem. These estimates are used to construct a normalized fit measure whose estreme values are approached as the corresponding estimates approach zero. The fit measure is illustrated using majority-rule and strict-consensus methods. Computing this fit measure is a difficult, in fact NP-complete, problem.
Bulletin of Mathematical Biology | 1985
William H. E. Day; Fred R. McMorris
A consensus in dex method comprises a consensus method and a consensus index that are defined on a common set of objects (e.g. classifications). For each profile of objects, the consensus method returns a consensus object representing information or structure shared among profile objects, while the consensus index returns a quantitative measure of agreement among profile objects. Since the relationship between consensus method and consensus index is poorly understood, we propose simple axioms prescribing it in the most general terms. Many taxonomic consensus index methods violate these axioms because their consensus indices measure consensus object invariants rather than profile agreement. We propose paradigms to obtain consensus index methods that measure agreement and satisfy the axioms. These paradigms salvage concepts underlying consensus index methods violating the axioms.
Journal of Theoretical Biology | 1983
William H. E. Day
Abstract The crossover or nearest neighbor interchange metric has been proposed for use in numerical taxonomy to obtain a quantitative measure of distance between classifications that are modeled as unrooted binary trees with labeled leaves. This metric seems difficult to compute and its properties are poorly understood. A variant called the closest partition distance measure has also been proposed, but no efficient algorithm for its computation has yet appeared and its relationship to the nearest neighbor interchange metric is incompletely understood. I investigate four conjectures concerning the nearest neighbor interchange and closest partition distance measures and establish their validity for trees with as many as seven labeled vertices. For trees in this size range the two distance measures are identical. If a certain decomposition property holds for the nearest neighbor interchange metric, then the two distance measures are also identical at small distances for trees of any size.
Mathematical and Computer Modelling | 1993
William H. E. Day; F. R. McMorris
Two important consensus problems are closely related to two well-known sequence problems. M. Watermans problem of finding consensus strings is a natural extension of the Longest Common Substring problem. The problem of identifying consensus subsequences is a natural extension of the Longest Common Subsequence problem, and thus is NP-hard.
Mathematical Social Sciences | 1983
Ralph P. Boland; Edward Brown; William H. E. Day
Abstract In numerical taxonomy one may wish to measure the dissimilarity of classifications S and T by computing the distance between them with an appropriate metric. A minimum-length-sequence (MLS) metric requires that the user identify a set X of meaningful transformations of classifications; the MLS metric μ x is then defined by requiring that μ x ( S,T ) be the length of a shortest sequence of transformations from X that carries S into T . For a given application it may be relatively easy to identify an appropriate set X of transformations, but it may be difficult or impossible to design an efficient algorithm to compute μ x . In this case it is natural to restrict the definition to obtain an approximation ϱ to the original metric μ x such that ϱ has an efficient algorithm for its computation. This restriction process must be performed carefully lest the approximation fail to satisfy the metric properties. We present a general result about this problem and apply it in two ways. First we prove that a published ‘metric’ on partitions of a set in fact violates the triangle inequality and so is merely a semimetric. Then we clarify the relationship between the nearest neighbor interchange metric on labeled binary trees and the closest partition distance measure proposed by Waterman and Smith (1978).