Mani Azimi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mani Azimi is active.

Explore More

Publication

Featured researches published by Mani Azimi.

international conference on parallel architectures and compilation techniques | 2012

Application-to-core mapping policies to reduce memory interference in multi-core systems

Reetuparna Das; Rachata Ausavarungnirun; Onur Mutlu; Akhilesh Kumar; Mani Azimi

How applications running on a many-core system are mapped to cores largely determines the interference between these applications in critical shared resources. This paper proposes application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that both ideas significantly improve system throughput, fairness and interconnect power efficiency.

high-performance computer architecture | 2013

Application-to-core mapping policies to reduce memory system interference in multi-core systems

Reetuparna Das; Rachata Ausavarungnirun; Onur Mutlu; Akhilesh Kumar; Mani Azimi

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications are mapped to cores largely determines the interference between these applications in critical shared hardware resources. This paper proposes new application-to-core mapping policies to improve system performance by reducing inter-application interference in the on-chip network and memory controllers. The major new ideas of our policies are to: 1) map network-latency-sensitive applications to separate parts of the network from network-bandwidth-intensive applications such that the former can make fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Our evaluations show that, averaged over 128 multiprogrammed workloads of 35 different benchmarks running on a 64-core system, our final application-to-core mapping policy improves system throughput by 16.7% over a state-of-the-art baseline, while also reducing system unfairness by 22.4% and average interconnect power consumption by 52.3%.

international symposium on computer architecture | 2013

SIMD divergence optimization through intra-warp compaction

Aniruddha S. Vaidya; Anahita Shayesteh; Dong Hyuk Woo; Roy Saharoy; Mani Azimi

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for todays GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

high performance interconnects | 2002

Scalability port: a coherent interface for shared memory multiprocessors

Mani Azimi; Faye A. Briggs; Michel Cekleov; Manoj Khare; Akhilesh Kumar; Lily P. Looi

The scalability port (SP) is a point-to-point cache consistent interface to build scalable shared memory multiprocessors. The SP interface consists of three layers of abstraction: the physical layer, the link layer and the protocol layer. The physical layer uses pin-efficient simultaneous bi-directional signaling and operates at 800 MHz in each direction. The link layer supports virtual channels and provides flow control and reliable transmission. The protocol layer implements cache consistency, TLB consistency, synchronization, and interrupt delivery functions among others. The first implementation of the SP interface is in the Intel/sup /spl reg// E8870 and E9870 chipset for the Intel Itanium/sup /spl reg//2 processor and future generations of the Itanium processor family.

IEEE Transactions on Computers | 2014

MoDe-X: Microarchitecture of a Layout-Aware Modular Decoupled Crossbar for On-Chip Interconnects

Dongkook Park; Aniruddha S. Vaidya; Akhilesh Kumar; Mani Azimi

The number of cores in a single chip keeps increasing with process technology scaling, requiring a scalable interconnection network topology. Buffered wormhole-switched interconnect architectures are attractive for such multicore architectures. The 2D mesh on-chip interconnect provides a scalable, cost-efficient, flexible, and reliable next-generation interconnect topology in this context. In this paper, we provide a microarchitecture for a power and area efficient router for a 2D mesh interconnect. We propose an efficient crossbar implementation, called MoDe-X, that uses a reasonable power-performance tradeoff. The MoDe-X router uses a Modular-Decoupled Crossbar (MoDe-X) that incorporates dimensional decomposition and segmentation to achieve power and area savings. However, unlike most prior work that considers only logical representation of the crossbars, MoDe-X is a physically aware router accounting for the actual layout of router components to reflect practical design requirements. Our simulation results and power estimate show that the MoDe-X router architectures can reduce the overall router area by up to 40 percent and power consumption by up to 35 percent with very little performance impact that occurs only at higher loads. Further, by applying aggressive power gating techniques the net power reductions can be as much as 99 percent for some workloads with no additional performance impact.

formal methods | 2003

Experience with Applying Formal Methods to Protocol Specification and System Architecture

Mani Azimi; Ching-Tsun Chou; Akhilesh Kumar; Victor W. Lee; Phanindra K. Mannava; Seungjoon Park

In the last three years or so we at Enterprise Platforms Group at Intel Corporation have been applying formal methods to various problems that arose during the process of defining platform architectures for Intels processor families. In this paper we give an overview of some of the problems we have worked on, the results we have obtained, and the lessons we have learned. The last topic is addressed mainly from the perspective of platform architects.

IEEE Transactions on Computers | 2015

GREEN Cache: Exploiting the Disciplined Memory Model of OpenCL on GPUs

Jaekyu Lee; Dong Hyuk Woo; Hyesoon Kim; Mani Azimi

As various graphics processing unit architectures are deployed across broad computing spectrum from a hand-held or embedded device to a high-performance computing server, OpenCL becomes the de facto standard programming environment for general-purpose computing on graphics processing units. Unlike its CPU counterpart, OpenCL has several distinct features such as its disciplined memory model, which is partially inherited from conventional 3D graphics programming models. On the other hand, due to ever increasing memory bandwidth pressure and low power requirement, the capacity of on-chip caches in GPUs keeps increasing overtime. Given such trends, we believe that we have interesting programming model/architecture co-optimization opportunities, in particular, how to energy-efficiently utilize large on-chip caches for GPUs. In this paper, as a showcase, we study the characteristics of the OpenCL memory model and propose a technique called GPU Region-aware energy-efficient non-inclusive cache hierarchy, or GREEN cache hierarchy. With the GREEN cache, our simulation results show that we can save 56 percent of dynamic energy in the L1 cache, 39 percent of dynamic energy in the L2 cache, and 50 percent of leakage energy in the L2 cache with practically no performance degradation and off-chip access increases.

ieee computer society international conference | 1992

Two level cache architectures

Mani Azimi; Bindi Prasad; Ketan Bhat

The authors discuss the performance measures required in building two-level cache solutions in uniprocessor systems based on more aggressive processors than the Intel486 microprocessor for desktop applications. The performance of serial second level caches is shown to exceed that of parallel caches by 10%-20%. The effect of second-level cache parameters such as cache/line/associativity/sector sizes is examined. It is shown that, as long as one of the two caches in the cache hierarchy is operating in the write back mode, the performance will be close to the case of both functioning in the write back mode. The authors quantify the fact that second-level caches reduce memory latency sensitivities. The performance gain of a full speed interface between the two levels of the cache hierarchy versus a half speed interface is shown to be about 10% for desktop applications.<<ETX>>

Archive | 2014

On Chip Network Routing for Tera-Scale Architectures

Aniruddha S. Vaidya; Mani Azimi; Akhilesh Kumar

The emergence of Tera-scale architectures features the interconnection of tens to several hundred general purpose cores to each other and with other IP blocks. The high level requirements of the underlying interconnect infrastructure include low latency, high-throughput, scalable performance, flexible and adaptive routing, support for isolated partitions, fault-tolerance, and support for irregular or partially enabled configurations. This chapter presents the architecture and routing algorithms for supporting these requirements in the overall framework of mesh and torus-based point-to-point interconnect topologies. The requirements and desired attributes for tera-scale interconnects are outlined. This is followed by an overview of the interconnect architecture and micro-architecture framework. The descriptions of various routing algorithms supported are at the heart of the chapter, and include various minimal deterministic and adaptive routing algorithms for mesh and torus networks, a novel load-balanced routing algorithm called pole-routing, and performance-isolation routing in non-rectangular mesh partitions. The implementation aspects of these topics is covered through an overview of the environment for prototyping, debugging, performance evaluation and visualization in the context of specific interconnect configurations of interest. Overall, this chapter aims to illustrate a comprehensive approach in architecting (and micro-architecting) a scalable and flexible on-die interconnect and associated routing algorithms that are applicable to a wide range of applications in an industry setting.

field programmable gate arrays | 2010

FPGA-based prototyping of a 2D MESH / TORUS on-chip interconnect (abstract only)

Donglai Dai; Aniruddha S. Vaidya; Roy Saharoy; Seungjoon Park; Dongkook Park; Hariharan Thantry; Ralf Plate; Elmar Maas; Akhilesh Kumar; Mani Azimi

Many-core chip multiprocessors can be expected to scale to tens of cores and beyond in the near future. Existing and emerging workloads on general-purpose many-core processors typically exhibit fast-changing, unpredictable on-chip communication traffic full of burstiness and jitters between different functional blocks. To provide high sustainable performance, scalable interconnects with a rich feature set including support for adaptive and flexible communication, performance isolation, and fault-tolerance are needed. 2D mesh and torus are attractive choices because they are physical layout friendly and scale more gracefully in network latency and bisection bandwidth than other simple interconnects such as buses or rings. However, the adoption of 2D mesh/torus in many-core processor designs is dependent on a verifiable and robust micro-architecture and a validated set of features. FPGA based systems have recently become a cost-effective, rapid prototyping vehicle for chip multiprocessor architectures. In this paper we present an FPGA based prototype of 2D on-die interconnect architecture. Our prototype is a highly configurable full-scale design that supports options selecting many different micro-architectural features and routing algorithms. The prototype incorporates a synthetic traffic generator to exercise and evaluate our design. To facilitate evaluation and characterization, a rich development environment and novel software capabilities including a very detailed performance visualization infrastructure has been developed. We demonstrate the experiment results of several configurations on a 6x6 2D network emulator setup in this paper.

Explore More