David M. Koppelman
Louisiana State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David M. Koppelman.
international performance computing and communications conference | 2007
Lu Peng; Jih-Kwon Peir; Tribuvan K. Prakash; Yen-Kuang Chen; David M. Koppelman
As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors to the PC market. In this paper, performance studies on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64times2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multiprogrammed and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for the best performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4.
Journal of Systems Architecture | 2008
Lu Peng; Jih-Kwon Peir; Tribuvan K. Prakash; Carl Staelin; Yen-Kuang Chen; David M. Koppelman
As chip multiprocessor (CMP) has become the mainstream in processor architectures, Intel and AMD have introduced their dual-core processors. In this paper, performance measurement on an Intel Core 2 Duo, an Intel Pentium D and an AMD Athlon 64x2 processor are reported. According to the design specifications, key derivations exist in the critical memory hierarchy architecture among these dual-core processors. In addition to the overall execution time and throughput measurement using both multi-program-med and multi-threaded workloads, this paper provides detailed analysis on the memory hierarchy performance and on the performance scalability between single and dual cores. Our results indicate that for better performance and scalability, it is important to have (1) fast cache-to-cache communication, (2) large L2 or shared capacity, (3) fast L2 to core latency, and (4) fair cache resource sharing. Three dual-core processors that we studied have shown benefits of some of these factors, but not all of them. Core 2 Duo has the best performance for most of the workloads because of its microarchitecture features such as the shared L2 cache. Pentium D shows the worst performance in many aspects due to its technology-remap of Pentium 4 without taking the advantage of on-chip communication.
international conference on parallel architectures and compilation techniques | 2000
David M. Koppelman
A multiprocessor prefetch scheme is described in which a miss is followed by a prefetch of a group of lines, a neighborhood, surrounding the demand-fetched line. The neighborhood is based on the data address and the past behavior of the instruction that missed the cache. A neighborhood for an instruction is constructed by recording the offsets of addresses that subsequently miss. This neighborhood prefetching can exploit sequential access as can sequential prefetch and can to some extent exploit stride access, as can stride prefetch. Unlike stride and sequential prefetch it can support irregular access patterns. Neighborhood prefetching was compared to adaptive sequential prefetching using execution-driven simulation. Results show more useful prefetches and lower execution time for neighborhood prefetching for six of eight SPLASH-2 benchmarks. On eight SPLASH-2 benchmarks the average normalized execution, time is less than 0.9, for three benchmarks, less than 0.8.
Scientific Programming | 2013
Marek Blazewicz; Ian Hinder; David M. Koppelman; Steven R. Brandt; Milosz Ciznicki; Michal Kierzynka; Frank Löffler; Jian Tao
Starting from a high-level problem description in terms of partial differential equations using abstract tensor notation, the Chemora framework discretizes, optimizes, and generates complete high performance codes for a wide range of compute architectures. Chemora extends the capabilities of Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient manner for complex applications, without low-level code tuning. Chemora achieves parallelism through MPI and multi-threading, combining OpenMP and CUDA. Optimizations include high-level code transformations, efficient loop traversal strategies, dynamically selected data and instruction cache usage strategies, and JIT compilation of GPU code tailored to the problem characteristics. The discretization is based on higher-order finite differences on multi-block domains. Chemoras capabilities are demonstrated by simulations of black hole collisions. This problem provides an acid test of the framework, as the Einstein equations contain hundreds of variables and thousands of terms.
Journal of Computational Chemistry | 2015
Yun Ding; Ye Fang; Wei P. Feinstein; J. Ramanujam; David M. Koppelman; Juana Moreno; Michal Brylinski; Mark Jarrell
Molecular docking is an important component of computer‐aided drug discovery. In this communication, we describe GeauxDock, a new docking approach that builds on the ideas of ligand homology modeling. GeauxDock features a descriptor‐based scoring function integrating evolutionary constraints with physics‐based energy terms, a mixed‐resolution molecular representation of protein‐ligand complexes, and an efficient Monte Carlo sampling protocol. To drive docking simulations toward experimental conformations, the scoring function was carefully optimized to produce a correlation between the total pseudoenergy and the native‐likeness of binding poses. Indeed, benchmarking calculations demonstrate that GeauxDock has a strong capacity to identify near‐native conformations across docking trajectories with the area under receiver operating characteristics of 0.85. By excluding closely related templates, we show that GeauxDock maintains its accuracy at lower levels of homology through the increased contribution from physics‐based energy terms compensating for weak evolutionary constraints. GeauxDock is available at http://www.institute.loni.org/lasigma/package/dock/.
PLOS ONE | 2016
Ye Fang; Yun Ding; Wei P. Feinstein; David M. Koppelman; Juana Moreno; Mark Jarrell; J. Ramanujam; Michal Brylinski
Computational modeling of drug binding to proteins is an integral component of direct drug design. Particularly, structure-based virtual screening is often used to perform large-scale modeling of putative associations between small organic molecules and their pharmacologically relevant protein targets. Because of a large number of drug candidates to be evaluated, an accurate and fast docking engine is a critical element of virtual screening. Consequently, highly optimized docking codes are of paramount importance for the effectiveness of virtual screening methods. In this communication, we describe the implementation, tuning and performance characteristics of GeauxDock, a recently developed molecular docking program. GeauxDock is built upon the Monte Carlo algorithm and features a novel scoring function combining physics-based energy terms with statistical and knowledge-based potentials. Developed specifically for heterogeneous computing platforms, the current version of GeauxDock can be deployed on modern, multi-core Central Processing Units (CPUs) as well as massively parallel accelerators, Intel Xeon Phi and NVIDIA Graphics Processing Unit (GPU). First, we carried out a thorough performance tuning of the high-level framework and the docking kernel to produce a fast serial code, which was then ported to shared-memory multi-core CPUs yielding a near-ideal scaling. Further, using Xeon Phi gives 1.9× performance improvement over a dual 10-core Xeon CPU, whereas the best GPU accelerator, GeForce GTX 980, achieves a speedup as high as 3.5×. On that account, GeauxDock can take advantage of modern heterogeneous architectures to considerably accelerate structure-based virtual screening applications. GeauxDock is open-sourced and publicly available at www.brylinski.org/geauxdock and https://figshare.com/articles/geauxdock_tar_gz/3205249.
Parallel Processing Letters | 1996
David M. Koppelman
The stereo-realization of a graph is the assignment of positions in Cartesian space to each of its vertices such that vertex density is bounded. A bound is derived on the average edge length in such a realization. It is similar to an earlier reported result, however the new bound can be applied to graphs for which the earlier result is not well suited. A more precise realization definition is also presented. The bound is applied to d-dimensional realizations of de Bruijn graphs, yielding an edge length of Ω((1−2−d)rn/d/(2n)), where r is the radix (number of distinct symbols) and n is the number of graph dimensions (number of symbol positions). The bound is also applied to shuffle-exchange graphs; for such graphs with small radix the edge-length bound is , where r is the radix, n is the number of graph dimensions, lσ is the average length of shuffle edges, and le is the average length of exchange edges.
IEEE Transactions on Parallel and Distributed Systems | 1994
David M. Koppelman
A method of reducing the volume of data flowing through the network in a shared memory parallel computer (multiprocessor) is described. The reduction is achieved by difference coding the memory addresses in messages sent between processing elements (PEs) and memories. In an implementation, each PE would store the last address sent to each memory, and vice versa. Messages that would normally contain an address instead contain the difference between the address associated with the current and most recent messages. Trace-driven simulation shows that only 70% or less of traffic volume (including data and overhead) is necessary, even in systems using coherent caches. The reduction in traffic could result in a lower cost or lower latency network. The cost of the hardware to achieve this is small, and the delay added is insignificant compared to network latency. >
2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) | 2016
Yue Hu; David M. Koppelman; Steven R. Brandt
Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation presents a number of challenges, one of which is the best way to make use of memory. GPUs have three types of on-chip storage: registers, shared memory, and read-only cache. The choice of type of storage and how its used, a buffering strategy, for each stencil array (grid function, (GF)) not only requires a good understanding of its stencil pattern, but also the efficiency of each type of storage for the GF, to avoid squandering storage that would be more beneficial to another GF. Our code-generation framework supports five buffering strategies. For a stencil computation with N GFs, the total number of possible assignments is 5N. Large, complex stencil kernels may consist of dozens of GFs, resulting in significant search overhead. In this work, we present an analytic performance model for stencil computations on GPUs, and study the behavior of readonly cache and L2 cache. Next, we propose an efficiency-based assignment algorithm, which operates by scoring a change in buffering strategy for a GF using a combination of (a) the predicted execution time and (b) on-chip storage usage. By using this scoring an assignment for N GFs and b strategy types can be determined in (b - 1)N(N + 1)/2 steps. Results show that the performance model has good accuracy and that the assignment strategy is highly efficient.
Journal of Parallel and Distributed Computing | 1997
David M. Koppelman
A finite-buffered banyan network analysis technique designed to model networks at high traffic loads is presented. The technique specially models two states of the network queues: queue empty and queue congested (roughly, zero or one slots free). A congested queue tends to stay congested because packets bound for the queue accumulate in the previous stage. The expected duration of this state is computed using a probabilistic model of a switching module feeding the congested queue. A technique for finding a lower arrival rate to an empty queue is also described. The queues themselves are modeled using independent Markov chains with an additional congested state added. The new analysis technique is novel in its modeling the higher arrival rate to a congested queue and a lower arrival rate to an empty queue. Comparison of queue state distributions obtained with the analysis and simulation shows that an important feature of congestion is modeled.