David Daly
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Daly.
international workshop on petri nets and performance models | 2001
Graham Clark; Tod Courtney; David Daly; Daniel D. Deavours; Salem Derisavi; Jay M. Doyle; William H. Sanders; Patrick G. Webster
Despite the development of many modeling formalisms and model solution methods, most tool implementations support only a single formalism. Furthermore, models expressed in the chosen formalism cannot be combined with models expressed in other formalisms. This monolithic approach both limits the usefulness of such tools to practitioners, and hampers modeling research, since it is difficult to compare new and existing formalisms and solvers. This paper describes the method that a new modeling tool, called Mobius, uses to eliminate these limitations. Mobius provides an infrastructure to support multiple interacting formalisms and solvers, and is extensible in that new formalisms anal solvers can, be added to the tool without changing those already implemented. Mobius provides this capability through the use of an abstract functional interface, which provides a formalism-independent interface to models. This allows models expressed in multiple formalisms to interact with each other, and with multiple solvers.
Lecture Notes in Computer Science | 2000
David Daly; Daniel D. Deavours; Jay M. Doyle; Patrick G. Webster; William H. Sanders
Mobius is a system-level performance and dependability modeling tool. Mobius makes validation of large dependability models possible by supporting many different model solution methods as well as model specification in multiple modeling formalisms.
international symposium on computer architecture | 2010
Jeffrey A. Stuecheli; Dimitris Kaseridis; David Daly; Hillery C. Hunter; Lizy Kurian John
In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPUs data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controllers visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.
Ibm Journal of Research and Development | 2015
William J. Starke; Jeffrey A. Stuecheli; David Daly; John Steven Dodson; Florian A. Auernhammer; Patricia M. Sagmeister; Guy Lynn Guthrie; Charles F. Marino; Michael S. Siegel; Bart Blaner
In this paper, we describe the IBM POWER8™ cache, interconnect, memory, and input/output subsystems, collectively referred to as the “nest.” This paper focuses on the enhancements made to the nest to achieve balanced and scalable designs, ranging from small 12-core single-socket systems, up to large 16-processor-socket, 192-core enterprise rack servers. A key aspect of the design has been increasing the end-to-end data and coherence bandwidth of the system, now featuring more than twice the bandwidth of the POWER7® processor. The paper describes the new memory-buffer chip, called Centaur, providing up to 128 MB of eDRAM (embedded dynamic random-access memory) buffer cache per processor, along with an improved DRAM (dynamic random-access memory) scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency. It also describes new coherence-transport enhancements and the transition to directly integrated PCIe® (PCI Express®) support, as well as additions to the cache subsystem to support higher levels of virtualization and scalability including snoop filtering and cache sharing.
dependable systems and networks | 2003
Henrik C. Bohnenkamp; Tod Courtney; David Daly; Salem Derisavi; Holger Hermanns; Joost-Pieter Katoen; Ric Klaren; Vinh Vi Lam; William H. Sanders
frame-work components. This translation preserves the structureof the models, allowing efficient solutions. The frameworkis implementedin the toolby a well-definedAbstract Func-tional Interface (AFI). Models and solution techniques in-teract with one another through the use of the standardinterface, allowing them to interact with M
international parallel and distributed processing symposium | 2007
David Daly; Jong Hyuk Choi; José E. Moreira; Amos Waterland
Commercial scale-out is a new research project at IBM research. Its main goal is to investigate and develop technologies for the use of large scale parallelism in commercial applications, eventually leading to a commercial supercomputer. The project leverages and explores the features of IBMs BladeCenter family of products. A significant challenge in using a large cluster of servers is the installation and provisioning of the base operating system in those servers. Compounding this problem is the issue of maintenance of the software image in each server after its provisioning. This paper describes the system we developed to manage the installation, provisioning, and maintenance process for a cluster of blades, providing a base level of functionality to be used by higher level management tools. The system leverages the management facilitation features of BladeCenter, and exploits the network and storage architecture of the commercial scale-out prototype cluster. It uses a single shared root filesystem image to reduce management complexity, and completely automates the process of bringing a new blade into the cluster upon its insertion into a BladeCenter chassiss.
distributed systems operations and management | 2002
David Daly; Gautam Kar; William H. Sanders
As Web services are increasingly accepted and used, the next step for them is the development of hierarchical and distributed services that can perform more complex tasks. In this paper, we focus on how to develop guarantees for the performance of an aggregate service based on the guarantees provided by the lower-level services. In particular, we demonstrate the problem with an example of an e-commerce Web site implemented using Web services. The example is based on the Transaction Processing Performance Council (TPC) TPC-W Benchmark [8], which specifies an online store complete with a description of all the functionality of the site as well as a description of how customers use the site. We develop models of the sites performance based on the performance of two sub-services. The models results are compared to experimental data and are used to predict the performance of the system under varying conditions.
high performance computer architecture | 2012
David Daly; Harold W. Cain
The economics of server consolidation have led to the support of virtualization features in almost all server-class systems, with the related feature set being a subject of significant competition. While most systems allow for partitioning at the relatively coarse grain of a single core, some systems also support multiprogrammed virtualization, whereby a system can be more finely partitioned through time-sharing, down to a percentage of a core being allotted to a virtual machine. When multiple virtual machines share a single core however, performance can suffer due to the displacement of microarchitectural state. We introduce cache restoration, a hardware-based prefetching mechanism initiated by the underlying virtualization software when a virtual machine is being scheduled on a core, prefetching its working set and warming its initial environment. Through cycle-accurate simulation of a POWER7 system, we show that when applied to its private per-core L3 last-level cache, the warm cache translates into 20% on average performance improvement for a mixture of workloads on a highly partitioned core, compared to a virtualized server without cache restoration.
high performance computational finance | 2008
David Daly; Kyung Dong Ryu; José E. Moreira
Computational finance is an important application area for high-performance computing today. Large computational resources are used for a variety of operations related to securities and asset portfolios. For online operations, the focus has been both on reducing latency and improving the quality of the algorithms. This focus on latency has forced a predominance of univariate analysis simply from a feasibility perspective. In this paper, we demonstrate that current supercomputers, and in particular the blue gene family of supercomputers, enables the move to online multivariate analysis of entire markets. We use a simple but representative example of multivariate analysis, namely the computation of the correlation matrix, to explore that space. We show how the computation can be parallelized and run as an online real-time operation at the scale of thousands of securities and millions of events per second.
quantitative evaluation of systems | 2006
David Daly; Peter Buchholz; William H. Sanders
Stochastic orders can be applied to Markov reward models and used to aggregate models, while introducing a bounded error. Aggregation reduces the number of states in a model, mitigating the effect of the state-space explosion and enabling the wider use of Markov reward models. Existing aggregation techniques based upon stochastic orders are limited by a combination of strong requirements on the structure of the model, and complexity in determining the stochastic order and generating the aggregated model. We develop a set of general conditions in which models can be analyzed and aggregated compositionally, dramatically lowering the complexity of the aggregation and solution of the model. When these conditions are combined with a recently developed general stochastic order for Markov reward models, significantly larger models can be solved than was previously possible for a large class of models