Alexander V. Veidenbaum

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alexander V. Veidenbaum is active.

Explore More

Publication

Featured researches published by Alexander V. Veidenbaum.

design, automation, and test in europe | 2002

Profile-Based Dynamic Voltage Scheduling Using Program Checkpoints

Ana Azevedo; Ilya Issenin; Radu Cornea; Rajesh K. Gupta; Nikil D. Dutt; Alexander V. Veidenbaum; Alexandru Nicolau

Dynamic voltage scaling (DVS) is a known effective mechanism for reducing CPU energy consumption without significant performance degradation. While a lot of work has been done on inter-task scheduling algorithms to implement DVS under operating system control, new research challenges exist in intra-task DVS techniques under software and compiler control. In this paper we introduce a novel intra-task DVS technique under compiler control using program checkpoints. Checkpoints are generated at compile time and indicate places in the code where the processor speed and voltage should be re-calculated. Checkpoints also carry user-defined time constraints. Our technique handles multiple intra-task performance deadlines and modulates power consumption according to a run-time power budget. We experimented with two heuristics for adjusting the clock frequency and voltage. For the particular benchmark studied, one heuristic yielded 63% more energy savings than the other. With the best of the heuristics we designed, our technique resulted in 82% energy savings over the execution of the program without employing DVS.

international conference on supercomputing | 1990

Compiler-directed data prefetching in multiprocessors with memory hierarchies

Edward H. Gornish; Elana D. Granston; Alexander V. Veidenbaum

Memory hierarchies are used by multiprocessor systems to reduce large memory access times. It is necessary to automatically manage such a hierarchy, to obtain effective memory utilization. In this paper, we discuss the various issues involved in obtaining an optimal memory management strategy for a memory hierarchy. We present an algorithm for finding the earliest point in a program that a block of data can be prefetched. This determination is based on the control and data dependencies in the program. Such a method is an integral part of more general memory management algorithms. We demonstrate our methods potential by using static analysis to estimate the performance improvement afforded by our prefetching strategy and to analyze the reference patterns in a set of Fortran benchmarks. We also study the effectiveness of prefetching in a realistic shared-memory system using an RTL-level simulator and real codes. This differs from previous studies by considering prefetching benefits in the presence of network contention.

international conference on supercomputing | 1999

Adapting cache line size to application behavior

Alexander V. Veidenbaum; Weiyu Tang; Rajesh K. Gupta; Alexandru Nicolau; Xiaomei Ji

A cache line size has a significant effect on miss rate and memory traffic. Today’s computers use a fixed line size, typically 32B, which may not be optimal for a given application. Optimal size may also change during application execution. This paper describes a cache in which the line (fetch) size is continuously adjusted by hardware based on observed application accesses to the line. The approach can improve the miss rate, even over the optimal for the fixed line size, as well as significantly reduce the memory traffic.

Neural Networks | 2009

2009 Special Issue: A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors

Jayram Moorkanikara Nageswaran; Nikil D. Dutt; Jeffrey L. Krichmar; Alex Nicolau; Alexander V. Veidenbaum

Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for various neural engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Compute Unified Device Architecture (CUDA) Graphics Processing Units (GPUs) can provide a low-cost, programmable, and high-performance computing platform for simulation of SNNs. In this paper we demonstrate an efficient, biologically realistic, large-scale SNN simulator that runs on a single GPU. The SNN model includes Izhikevich spiking neurons, detailed models of synaptic plasticity and variable axonal delay. We allow user-defined configuration of the GPU-SNN model by means of a high-level programming interface written in C++ but similar to the PyNN programming interface specification. PyNN is a common programming interface developed by the neuronal simulation community to allow a single script to run on various simulators. The GPU implementation (on NVIDIA GTX-280 with 1 GB of memory) is up to 26 times faster than a CPU version for the simulation of 100K neurons with 50 Million synaptic connections, firing at an average rate of 7 Hz. For simulation of 10 Million synaptic connections and 100K neurons, the GPU SNN model is only 1.5 times slower than real-time. Further, we present a collection of new techniques related to parallelism extraction, mapping of irregular communication, and network representation for effective simulation of SNNs on GPUs. The fidelity of the simulation results was validated on CPU simulations using firing rate, synaptic weight distribution, and inter-spike interval analysis. Our simulator is publicly available to the modeling community so that researchers will have easy access to large-scale SNN simulations.

international symposium on microarchitecture | 2012

Improving Cache Management Policies Using Dynamic Reuse Distances

Nam Duong; Dali Zhao; Taesu Kim; Rosario Cammarota; Mateo Valero; Alexander V. Veidenbaum

Cache management policies such as replacement, bypass, or shared cache partitioning have been relying on data reuse behavior to predict the future. This paper proposes a new way to use dynamic reuse distances to further improve such policies. A new replacement policy is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD). The policy protects a cache line long enough for it to be reused, but not beyond that to avoid cache pollution. This can be combined with a bypass mechanism that also relies on dynamic reuse analysis to bypass lines with less expected reuse. A miss fetch is bypassed if there are no unprotected lines. A hit rate model based on dynamic reuse history is proposed and the PD that maximizes the hit rate is dynamically computed. The PD is recomputed periodically to track a programs memory access behavior and phases. Next, a new multi-core cache partitioning policy is proposed using the concept of protection. It manages lifetimes of lines from different cores (threads) in such a way that the overall hit rate is maximized. The average per-thread lifetime is reduced by decreasing the threads PD. The single-core PD-based replacement policy with bypass achieves an average speedup of 4.2% over the DIP policy, while the average speedups over DIP are 1.5% for dynamic RRIP (DRRIP) and 1.6% for sampling dead-block prediction (SDP). The 16-core PD-based partitioning policy improves the average weighted IPC by 5.2%, throughput by 6.4% and fairness by 9.9% over thread-aware DRRIP (TA-DRRIP). The required hardware is evaluated and the overhead is shown to be manageable.

international symposium on neural networks | 2009

Efficient simulation of large-scale Spiking Neural Networks using CUDA graphics processors

Jayram Moorkanikara Nageswaran; Nikil D. Dutt; Jeffrey L. Krichmar; Alexandru Nicolau; Alexander V. Veidenbaum

Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Graphics Processing Units (GPUs) can provide a low-cost, programmable, and high-performance computing platform for simulation of SNNs. In this paper we demonstrate an efficient, Izhikevich neuron based large-scale SNN simulator that runs on a single GPU. The GPU-SNN model (running on an NVIDIA GTX-280 with 1GB of memory), is up to 26 times faster than a CPU version for the simulation of 100K neurons with 50 Million synaptic connections, firing at an average rate of 7Hz. For simulation of 100K neurons with 10 Million synaptic connections, the GPU-SNN model is only 1.5 times slower than real-time. Further, we present a collection of new techniques related to parallelism extraction, mapping of irregular communication, and compact network representation for effective simulation of SNNs on GPUs. The fidelity of the simulation results were validated against CPU simulations using firing rate, synaptic weight distribution, and inter-spike interval analysis. We intend to make our simulator available to the modeling community so that researchers will have easy access to large-scale SNN simulations.

IEEE Computer | 1990

Compiler-directed cache management in multiprocessors

Hoichi Cheong; Alexander V. Veidenbaum

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.<<ETX>>

international symposium on computer architecture | 2004

A Content Aware Integer Register File Organization

Gonzalez Gonzalez; Adrian Cristal; Daniel Ortega; Alexander V. Veidenbaum; Mateo Valero

A register file is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the register files area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processors performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer register file organization, which reduces energy consumption, area, and access time of the register file with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new register file is described and shown to obtain proposed optimized register file designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new register file. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the register file access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.

international symposium on computer architecture | 1993

The cedar system and an initial performance study

David J. Kuck; Edward S. Davidson; Duncan H. Lawrie; Ahmed H. Sameh; Chuan-Qi Zhu; Alexander V. Veidenbaum; Jeff Konicek; Pen Chung Yew; Kyle A. Gallivan; William Jalby; Harry A. G. Wijshoff; Randall Bramley; Ulrike Meier Yang; Perry A. Emrath; David A. Padua; Rudolf Eigenmann; Jay Hoeflinger; Greg Jaxon; Zhiyuan Li; T. Murphy; John T. Andrews; Stephen W. Turner

In this paper, we give an overview of the Cedar multiprocessor and present recent performance results. These include the performance of some computational kernels and the Perfect Benchmarks. We also present a methodology for judging parallel system performance and apply this methodology to Cedar, Cray YMP-8, and Thinking Machines CM-5.

conference on high performance computing (supercomputing) | 1991