Francis Patrick O'Connell

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Francis Patrick O'Connell is active.

Explore More

Publication

Featured researches published by Francis Patrick O'Connell.

international conference on parallel architectures and compilation techniques | 2012

Making data prefetch smarter: adaptive prefetching on POWER7

Víctor Jiménez; Roberto Gioiosa; Francisco J. Cazorla; Alper Buyuktosunoglu; Pradip Bose; Francis Patrick O'Connell

Hardware data prefetch engines are integral parts of many general purpose server-class microprocessors in the field today. Some prefetch engines allow the user to change some of their parameters. The prefetcher, however, is usually enabled in a default configuration during system bring-up and dynamic reconfiguration of the prefetch engine is not an autonomic feature of current machines. Conceptually, however, it is easy to infer that commonly used prefetch algorithms, when applied in a fixed mode will not help performance in many cases. In fact, they may actually degrade performance due to useless bus bandwidth consumption and cache pollution. In this paper, we present an adaptive prefetch scheme that dynamically modifies the prefetch settings in order to adapt to the workload requirements. We implement and evaluate adaptive prefetching in the context of an existing, commercial processor, namely the IBM POWER7. Our adaptive prefetch mechanism improves performance with respect to the default prefetch setting up to 2.7X and 30% for single-threaded and multiprogrammed workloads, respectively.

high-performance computer architecture | 2015

Increasing multicore system efficiency through intelligent bandwidth shifting

Víctor Jiménez; Alper Buyuktosunoglu; Pradip Bose; Francis Patrick O'Connell; Francisco J. Cazorla; Mateo Valero

Memory bandwidth is a crucial resource in computing systems. Current CMP/SMT processors have a significant number of cores and they can run many threads concurrently. This large thread count adds high pressure to the memory bus, which demands high bandwidth to service memory requests from the cores. Hardware data prefetching is a well-known technique for hiding memory latency. Due to its speculative nature, however, in some situations prefetching does not effectively work, wasting memory bandwidth and polluting the caches. Data prefetching efficiency depends on the prefetching algorithm. It also depends on the characteristics of the applications running on the system. In this paper we propose an online bandwidth shifting mechanism that dynamically assigns bandwidth to applications according to their prefetch efficiency. This mechanism maximizes the utilization of memory bandwidth, thereby improving system performance and/or reducing memory power consumption. To the best of our knowledge, this solution is the first to not require hardware support. We evaluate the benefits of using our bandwidth shifting mechanism on a real system - the IBM POWER7. We obtain speedups in the order of 10-20% (in one instance, speedup exceeds 1.6X). Our mechanism does not generate a significant degree of unfairness among the applications. In many cases individual thread performance increases by 10-35%, while virtually no thread experiences a slowdown larger than 5%.

Journal of Systems Architecture | 1999

Bounds modelling and compiler optimizations for superscalar performance tuning

Pradip Bose; Sunil Kim; Francis Patrick O'Connell; William A. Ciarfella

Abstract We consider the floating point microarchitecture support in RISC superscalar processors. We briefly review the fundamental performance trade-offs in the design of such microarchitecutres. We propose a simple, yet effective bounds model to deduce the “best-case” loop performance limits for these processors. We compare these bounds to simulated and real performance measurements. From this study, we identify several loop tuning opportunities. In particular, we illustrate the use of this analysis in suggesting loop unrolling and scheduling heuristics. We report our experimental results in the context of a set of application-based loop test cases. These are designed to stress various resource limits in the core (infinite cache) microarchitecture.

Archive | 2000