Konrad Malkowski | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Konrad Malkowski is active.

Explore More

Publication

Featured researches published by Konrad Malkowski.

international parallel and distributed processing symposium | 2005

Reducing power with performance constraints for parallel sparse applications

Guangyu Chen; Konrad Malkowski; Mahmut T. Kandemir; Padma Raghavan

Sparse and irregular computations constitute a large fraction of applications in the data-intensive scientific domain. While every effort is made to balance the computational workload in such computations across parallel processors, achieving sustained near machine-peak performance with close-to-ideal load balanced computation-to-processor mapping is inherently difficult. As a result, most of the time, the loads assigned to parallel processors can exhibit significant variations. While there have been numerous past efforts that study this imbalance from the performance viewpoint, to our knowledge, no prior study has considered exploiting the imbalance for reducing power consumption during execution. Power consumption in large-scale clusters of workstations is becoming a critical issue as noted by several recent research papers from both industry and academia. Focusing on sparse matrix computations in which underlying parallel computations and data dependencies can be represented by trees, this paper proposes schemes that save power through voltage/frequency scaling. Our goal is to reduce overall energy consumption by scaling the voltages/frequencies of those processors that are not in the critical path; i.e., our approach is oriented towards saving power without incurring performance penalties.

international parallel and distributed processing symposium | 2010

Analyzing the soft error resilience of linear solvers on multicore multiprocessors

Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir

As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 109, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.

international parallel and distributed processing symposium | 2008

Towards energy efficient scaling of scientific codes

Yang Ding; Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir

Energy consumption is becoming a crucial concern within the high performance computing community as computers expand to the peta-scale and beyond. Although the peak execution rates on tuned dense matrix operations in supercomputers have consistently increased to approach the peta-scale regime, the linear scaling of peak execution rates has been achieved at the expense of cubic growth in power with systems already appearing in the megawatt range. In this paper, we extend the ideas of algorithm scalability and performance iso-efficiency to characterize the system-wide energy consumption. The latter includes dynamic and leakage energy for CPUs, memories and network interconnects. We propose analytical models for evaluating energy scalability and energy efficiency. These models are important for understanding the power consumption trends of data intensive applications executing on a large number of processors. We apply the models to two scientific applications to explore opportunities when using voltage/frequency scaling for energy savings without degrading performance. Our results indicate that such models are critical for energy-aware high-performance computing in the tera- to peta-scale regime.

international conference on computer design | 2008

Ring data location prediction scheme for Non-Uniform Cache Architectures

Sayaka Akioka; Feihui Li; Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir; Mary Jane Irwin

Increases in cache capacity are accompanied by growing wire delays due to technology scaling. Non-uniform cache architecture (NUCA) is one of proposed solutions to reducing the average access latency in such cache designs. While most of the prior NUCA work focuses on data placement, data replacement, and migration related issues, this paper studies the problem of data search (access) in NUCA. In our architecture we arrange sets of banks with equal access latency into rings. Our last access based (LAB) prediction scheme predicts the ring that is expected to contain the required data and checks the banks in that ring first for the data block sought. We compare our scheme to two alternate approaches: searching all rings in parallel, and searching rings sequentially. We show that our LAB ring prediction scheme reduces L2 energy significantly over the sequential and parallel schemes, while maintaining similar performance. Our LAB scheme reduces energy consumption by 15.9% relative to the sequential lookup scheme, and 53.8% relative to the parallel lookup scheme.

international parallel and distributed processing symposium | 2006

Integrated link/CPU voltage scaling for reducing energy consumption of parallel sparse matrix applications

Seung Woo Son; Konrad Malkowski; Guilin Chen; Mahmut T. Kandemir; Padma Raghavan

Reducing power consumption is quickly becoming a first-class optimization metric for many high-performance parallel computing platforms. One of the techniques employed by many prior proposals along this direction is voltage scaling and past research used it on different components such as networks, CPUs, and memories. In contrast to most of the existent efforts on voltage scaling that target a single component (CPU, network or memory components), this paper proposes and experimentally evaluates a voltage/frequency scaling algorithm that considers CPU and communication links in a mesh network at the same time. More specifically, it scales voltages/frequencies of both CPUs in the network and the communication links among them in a coordinated fashion (instead of one after another) such that energy savings are maximized without impacting execution time. Our experiments with several tree-based sparse matrix computations reveal that the proposed integrated voltage scaling approach is very effective in practice and brings 13% and 17% energy savings over the pure CPU and pure communication link voltage scaling schemes, respectively. The results also show that our savings are consistent with the different network sizes and different sets of voltage/frequency levels

international symposium on low power electronics and design | 2007

Phase-aware adaptive hardware selection for power-efficient scientific computations

Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir; Mary Jane Irwin

We present recent research on utilizing power, performance and reliability trade-offs in meeting the demands of scientific applications. In particular we summarize results of our recent publications on (i) phase-aware adaptive hardware selection for power-efficient scientific computations, (ii) adapting application execution to reduced CPU availability, and (Hi) a helper thread based EDP reduction scheme for adapting application execution in CMPs.Increased power consumption and heat dissipation have become the major limiters of available computational resources at many high performance computing (HPC) centers. Applications that run at such centers typically operate in single user mode, run for long periods of time, and have long lasting application phases. Their users are interested in obtaining the maximum performance. We propose a phase aware adaptive hardware selection technique, featuring data prefetchers and dynamic voltage and frequency scaling. Our technique takes advantage of memory bound phases in scientific codes, resulting in significant power (39%) and energy (37%) reductions while maintaining or exceeding the performance of an unoptimized system.

international parallel and distributed processing symposium | 2007

Load Miss Prediction - Exploiting Power Performance Trade-offs

Konrad Malkowski; Greg M. Link; Padma Raghavan; Mary Jane Irwin

Modern CPUs operate at GHz frequencies, but the latencies of memory accesses are still relatively large, in the order of hundreds of cycles. Deeper cache hierarchies with larger cache sizes can mask these latencies for codes with good data locality and reuse, such as structured dense matrix computations. However, cache hierarchies do not necessarily benefit sparse scientific computing codes, which tend to have limited data locality and reuse. We therefore propose a new memory architecture with a load miss predictor (LMP), which includes a data bypass cache and a predictor table, to reduce access latencies by determining whether a load should bypass the main cache hierarchy and issue an early load to main memory. Our architecture uses the L2 (and lower caches) as a victim cache for data removed from our bypass cache. We use cycle-accurate simulations, with SimpleScalar and Wattch to show that our LMP improves the performance of sparse codes, our application domain of interest, on average by 14%, with a 13.6% increase in power. When the LMP is used with dynamic voltage and frequency scaling (DVFS), performance can be improved by 8.7% with system power savings of 7.3% and energy reduction of 17.3% at 1800 MHz relative to the base system at 2000 MHz. Alternatively our LMP can be used to improve the performance of SPEC benchmarks by an average of 2.9 % at the cost of 7.1 % increase in average power.

international parallel and distributed processing symposium | 2006

Conjugate gradient sparse solvers: performance-power characteristics

Konrad Malkowski; Ingyu Lee; Padma Raghavan; Mary Jane Irwin

We characterize the performance and power attributes of the conjugate gradient (CG) sparse solver which is widely used in scientific applications. We use cycle-accurate simulations with SimpleScalar and Wattch, on a processor and memory architecture similar to the configuration of a node of the BlueGene/L. We first demonstrate that substantial power savings can be obtained without performance degradation if low power modes of caches can be utilized. We next show that if Dynamic Voltage Scaling (DVS) can be used, power and energy savings are possible, but these are realized only at the expense of performance penalties. We then consider two simple memory subsystem optimizations, namely memory and level-2 cache prefetching. We demonstrate that when DVS and low power modes of caches are used with these optimizations, performance can be improved significantly with reductions in power and energy. For example, execution time is reduced by 23%, power by 55% and energy by 65% in the final configuration at 500 MHz relative to the original at 1 GHz. We also use our codes and the CG NAS benchmark code to demonstrate that performance and power profiles can vary significantly depending on matrix properties and the level of code tuning. These results indicate that architectural evaluations can benefit if traditional benchmarks are augmented with codes more representative of tuned scientific applications.

international parallel and distributed processing symposium | 2006

On improving performance and energy profiles of sparse scientific applications

Konrad Malkowski; Ingyu Lee; Padma Raghavan; Mary Jane Irwin

In many scientific applications, the majority of the execution time is spent within a few basic sparse kernels such as sparse matrix vector multiplication (SMV). Such sparse kernels can utilize only a fraction of the available processing speed because of their relatively large number of data accesses per floating point operation, and limited data locality and data re-use. Algorithmic changes and tuning of codes through blocking and loop unrolling schemes can improve performance but such tuned versions are typically not available in benchmark suites such as the SPEC CFP 2000. In this paper, we consider sparse SMV kernels with different levels of tuning that are representative of this application space. We emulate certain memory subsystem optimizations using SimpleScalar and Wattch to evaluate improvements in performance and energy metrics. We also characterize how such an evaluation can be affected by the interplay between code tuning and memory subsystem optimizations. Our results indicate that the optimizations reduce execution time by over 40%, and the energy by over 85%, when used with power control modes of CPUs and caches. Furthermore, the relative impact of the same set of memory subsystem optimizations can vary significantly depending on the level of code tuning. Consequently, it may be appropriate to augment traditional benchmarks by tuned kernels typical of high performance sparse scientific codes to enable comprehensive evaluations of future systems.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

T-NUCA - a novel approach to non-uniform access latency cache architectures for 3D CMPs

Konrad Malkowski; Padma Raghavan; Mahmut T. Kandemir; Mary Jane Irwin

We consider a non-uniform access latency cache architecture (NUCA) design for 3D chip multi-processors (CMPs) where cache structures are divided into small banks interconnected by a network-on-chip (NoC). In earlier NUCA designs, data is placed in banks either statically (S-NUCA) or dynamically (D-NUCA). In both S-NUCA and D-NUCA designs, scaling to hundreds of cores can pose several challenges. Thus, we propose a new NUCA architecture with an inclusive, octal tree-based, hierarchical directory (T-NUCA-8), with the potential to scale to hundreds of cores with performance comparable to D-NUCA at a fraction of the energy cost. Our evaluations indicate that relative to D-NUCA, our T-NUCA-8 reduces network usage by 92%, energy by 87%, and EDP by 87%, at performance cost of 10%.

Explore More