Vignesh Adhinarayanan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vignesh Adhinarayanan is active.

Explore More

Publication

Featured researches published by Vignesh Adhinarayanan.

international conference on communications | 2013

Accelerating fast Fourier Transform for wideband channelization

Carlo C. del Mundo; Vignesh Adhinarayanan; Wu-chun Feng

Wideband channelization is a compute-intensive task with performance requirements that are arguably greater than what current multi-core CPUs can provide. To date, researchers have used dedicated hardware such as field programmable gate arrays (FPGAs) to address the performance-critical aspects of the channelizer. In this work, we assess the viability of the graphics processing unit (GPU) to achieve the necessary performance. In particular, we focus on the fast Fourier Transform (FFT) stage of a wideband channelizer. While there exists previous work for FFT on a NVIDIA GPU, the substantially higher peak floating-point performance of an AMD GPU has been less explored. Thus, we consider three generations of AMD GPUs and provide insight into the optimization of FFT on these platforms. Our architecture-aware approach across three different generations of AMD GPUs outperforms a multithreaded Intel Sandy Bridge CPU with vector extensions by factors of 4.3, 4.9, and 6.6 on the Radeon HD 5870, 6970, and 7970, respectively.

ieee international symposium on workload characterization | 2016

Measuring and modeling on-chip interconnect power on real hardware

Vignesh Adhinarayanan; Indrani Paul; Joseph L. Greathouse; Wei Huang; Ashutosh Pattnaik; Wu-chun Feng

On-chip data movement is a major source of power consumption in modern processors, and future technology nodes will exacerbate this problem. Properly understanding the power that applications expend moving data is vital for inventing mitigation strategies. Previous studies combined data movement energy, which is required to move information across the chip, with data access energy, which is used to read or write onchip memories. This combination can hide the severity of the problem, as memories and interconnects will scale differently to future technology nodes. Thus, increasing the fidelity of our energy measurements is of paramount concern. We propose to use physical data movement distance as a mechanism for separating movement energy from access energy. We then use this mechanism to design microbenchmarks to ascertain data movement energy on a real modern processor. Using these microbenchmarks, we study the following parameters that affect interconnect power: (i) distance, (ii) interconnect bandwidth, (iii) toggle rate, and (iv) voltage and frequency. We conduct our study on an AMD GPU built in 28nm technology and validate our results against industrial estimates for energy/bit/millimeter. We then construct an empirical model based on our characterization and use it to evaluate the interconnect power of 22 real-world applications. We show that up to 14% of the dynamic power in some applications can be consumed by the interconnect and present a range of mitigation strategies.

international symposium on performance analysis of systems and software | 2016

An automated framework for characterizing and subsetting GPGPU workloads

Vignesh Adhinarayanan; Wu-chun Feng

Graphics processing units (GPUs) are becoming increasingly common in todays computing systems due to their superior performance and energy efficiency relative to their cost. To further improve these desired characteristics, researchers have proposed several software and hardware techniques. Evaluation of these proposed techniques could be tricky due to the ad-hoc nature in which applications are selected for evaluation. Sometimes researchers spend unnecessary time evaluating redundant workloads, which is particularly problematic for time-consuming studies involving simulation. Other times, they fail to expose the shortcomings of their proposed techniques when too few workloads are chosen for evaluation. To overcome these problems, we propose an automated framework that characterizes and subsets GPGPU workloads, depending on a user-chosen set of performance metrics/counters. This framework internally uses principal component analysis (PCA) to reduce the dimensionality of the chosen metrics and then uses hierarchical clustering to identify similarity among the workloads. In this study, we use our framework to identify redundancy in the recently released SPEC ACCEL OpenCL benchmark suite using a few architecture-dependent metrics. Our analysis shows that a subset of eight applications provides most of the diversity in the 19-application benchmark suite. We also subset the Parboil, Rodinia, and SHOC benchmark suites and then compare them against each other to identify “gaps” in these suites. As an example, we show that SHOC has many applications that are similar to each other and could benefit from adding four applications from Parboil to improve its diversity.

reconfigurable computing and fpgas | 2014

On the performance and energy efficiency of FPGAs and GPUs for polyphase channelization

Vignesh Adhinarayanan; Thaddeus Koehn; Krzysztof Kepa; Wu-chun Feng; Peter M. Athanas

Wideband channelization is an important and computationally demanding task in the front-end subsystem of several software-defined radios (SDRs). The hardware that supports this task should provide high performance, consume low power, and allow flexible implementations. Several classes of devices have been explored in the past, with the FPGA proving to be the most popular as it reasonably satisfies all three requirements. However, the growing presence of low-power mobile GPUs holds much promise with improved flexibility for instant adaptation to different standards. Thus, in this paper, we present optimized polyphase channelizations for the FPGA and GPU, respectively, that must consider power and accuracy requirements in the context of a military application. The performance in mega-samples per second (MSPS) and energy efficiency in MSPS/watt are compared between the two classes of hardware platforms: FPGA and GPU. The results show that by exploiting the flexible datapath width of FPGAs, FPGA implementations generally deliver an order-of-magnitude better performance and energy efficiency over fixed-width GPU architectures.

international conference on parallel and distributed systems | 2013

Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors

Vignesh Adhinarayanan; Wu-chun Feng

Under some circumstances, the power flux density produced by emissions from a spacecraft suffers the presence of spurious frequencies. This occurs, for example, when idle data with long sequences of zeros are transmitted. At high data rates, randomizers may not be able to solve the problem. Because of the need to comply with the recommendations and standards, this can reflect on severe limits on the maximum data rates achievable. Such problem, experimentally observed in some recent missions, was first studied by Alvarez and Lesthievent, but an effective solution has not been found yet. We discuss the topic and formulate three proposals to compensate the drawback. We show they permit to reduce significantly the required margin at high data rates.Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.Wideband channelization is a computationally intensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these requirements. While FPGAs allow for flexible implementations, realizing a FPGA implementation is a difficult and time-consuming process. On the other hand, multicore processors while more programmable, fail to satisfy performance requirements. Graphics processing units (GPUs) overcome the above limitations. However, traditional GPUs are power-hungry and can consume as much as 350 watts, making them ill-suited for many SDR environments, particularly those that are battery-powered. Here we explore the viability of low-power mobile graphics processors to simultaneously overcome the limitations of performance, flexibility, and power. Via execution profiling and performance analysis, we identify major bottlenecks in mapping the wideband channelization algorithm onto these devices and adopt several optimization techniques to achieve multiplicative speed-up over a multithreaded implementation. Overall, our approach delivers a speedup of up to 43-fold on the discrete AMD Radeon HD 6470M GPU and 27-fold on the integrated AMD Radeon HD 6480G GPU, when compared to a vectorized and multithreaded version running on the AMD A4-3300M CPU.The ever increasing of product development and the scarcity of the energy resources that those manufacturing activities heavily rely on have made it of great significance the study on how to improve the energy efficiency in manufacturing environment. Energy consumption sensing and collection enables the development of effective solutions to higher energy efficiency. Further, it is found that the data on energy consumption of manufacturing machines also contains the information on the conditions of these machines. In this paper, methods of machine anomaly detection based on energy consumption information are developed and applied to cases on our Syil X4 computer numerical control (CNC) milling machine. Further, given massive amount of energy consumption data from large amount machining tasks, the proposed algorithms are being implemented on a Storm and Hadoop based framework aiming at online real-time machine anomaly detection.

ieee computer society annual symposium on vlsi | 2012

SCOC IP Cores for Custom Built Supercomputing Nodes

Venkateswaran Nagarajan; Rajagopal Hariharan; Vinesh Srinivasan; Ram Srivatsa Kannan; Prashanth Thinakaran; Vigneshwaren Sankaran; Bharanidharan Vasudevan; Ravindhiran Mukundrajan; Nachiappan Chidambaram Nachiappan; Aswin Sridharan; Karthikeyan Palavedu Saravanan; Vignesh Adhinarayanan; Vignesh Veppur Sankaranarayanan

A high performance and low power node architecture becomes crucial in the design of future generation supercomputers. In this paper, we present a generic set of cells for designing complex functional units that are capable of executing an algorithm of reasonable size. They are called Algorithm Level Functional Units (ALFUs) and a suitable VLSI design paradigm for them is proposed in this paper. We provide a comparative analysis of many core processors based on ALFUs against ALUs to show the reduced generation of control signals and lesser number of memory accesses, instruction fetches along with increased cache hit rates, resulting in better performance and power consumption. ALFUs have led to the inception of the Super Computer On Chip (SCOC) IP core paradigm for designing high performance and low power supercomputing clusters. The proposed SCOC IP cores are compared with the existing IP cores used in supercomputing clusters to bring out the improved features of the former.

cluster computing and the grid | 2016

Online Power Estimation of Graphics Processing Units

Vignesh Adhinarayanan; Balaji Subramaniam; Wu-chun Feng

Accurate power estimation at runtime is essential for the efficient functioning of a power management system. While years of research have yielded accurate power models for the online prediction of instantaneous power for CPUs, such power models for graphics processing units (GPUs) are lacking. GPUs rely on low-resolution power meters that only nominally support basic power management. To address this, we propose an instantaneous power model, and in turn, a power estimator, that uses performance counters in a novel way so as to deliver accurate power estimation at runtime. Our power estimator runs on two real NVIDIA GPUs to show that accurate runtime estimation is possible without the need for the high-fidelity details that are assumed on simulation-based power models. To construct our power model, we first use correlation analysis to identify a concise set of performance counters that work well despite GPU device limitations. Next, we explore several statistical regression techniques and identify the best one. Then, to improve the prediction accuracy, we propose a novel application-dependent modeling technique, where the model is constructed online at runtime, based on the readings from a low-resolution, built-in GPU power meter. Our quantitative results show that a multi-linear model, which produces a mean absolute error of 6%, works the best in practice. An application-specific quadratic model reduces the error to nearly 1%. We show that this model can be constructed with low overhead and high accuracy at runtime. To the best of our knowledge, this is the first work attempting to model the instantaneous power of a real GPU system, earlier related work focused on average power.

computing frontiers | 2018

GPU power prediction via ensemble machine learning for DVFS space exploration

Bishwajit Dutta; Vignesh Adhinarayanan; Wu-chun Feng

A software-based approach to achieve high performance within a power budget often involves dynamic voltage and frequency scaling (DVFS). Thus, accurately predicting the power consumption of an application at different DVFS levels (or more generally, different processor configurations) is paramount for the energy-efficient functioning of a high-performance computing (HPC) system. The increasing prevalence of graphics processing units (GPUs) in HPC systems presents new challenges in power management, and machine learning presents an unique way to improve the software-based power management of these systems. As such, we explore the problem of GPU power prediction at different DVFS states via machine learning. Specifically, we propose a new ensemble technique that incorporates three machine-learning techniques --- sequential minimal optimization regression, simple linear regression, and decision tree --- to reduce the mean absolute error (MAE) to 3.5%.

international parallel and distributed processing symposium | 2017

Characterizing and Modeling Power and Energy for Extreme-Scale In-Situ Visualization

Vignesh Adhinarayanan; Wu-chun Feng; David H. Rogers; James P. Ahrens; Scott Pakin

Plans for exascale computing have identified power and energy as looming problems for simulations running at that scale. In particular, writing to disk all the data generated by these simulations is becoming prohibitively expensive due to the energy consumption of the supercomputer while it idles waiting for data to be written to permanent storage. In addition, the power cost of data movement is also steadily increasing. A solution to this problem is to write only a small fraction of the data generated while still maintaining the cognitive fidelity of the visualization. With domain scientists increasingly amenable towards adopting an in-situ framework that can identify and extract valuable data from extremely large simulation results and write them to permanent storage as compact images, a large-scale simulation will commit to disk a reduced dataset of data extracts that will be much smaller than the raw results, resulting in a savings in both power and energy. The goal of this paper is two-fold: (i) to understand the role of in-situ techniques in combating power and energy issues of extreme-scale visualization and (ii) to create a model for performance, power, energy, and storage to facilitate what-if analysis. Our experiments on a specially instrumented, dedicated 150-node cluster show that while it is difficult to achieve power savings in practice using in-situ techniques, applications can achieve significant energy savings due to shorter write times for in-situ visualization. We present a characterization of power and energy for in-situ visualization; an application-aware, architecturespecific methodology for modeling and analysis of such in-situ workflows; and results that uncover indirect power savings in visualization workflows for high-performance computing (HPC).

ieee computer society annual symposium on vlsi | 2012

Compilation Accelerator on Silicon

Venkateswaran Nagarajan; Vinesh Srinivasan; Ramsrivatsa Kannan; Prashanth Thinakaran; Rajagopal Hariharan; Bharanidharan Vasudevan; Nachiappan Chidambaram Nachiappan; Karthikeyan Palavedu Saravanan; Aswin Sridharan; Vigneshwaran Sankaran; Vignesh Adhinarayanan; V.S. Vignesh; Ravindhiran Mukundrajan

Current day processors utilize a complex and finely tuned system software to map applications across their cores and extract optimal performance. However with increasing core counts and the rise of heterogeneity among cores, tremendous stress will be exerted on the software stack leading to bottlenecks and underutilization of resources. We propose an architecture for a Compilation Accelerator on Silicon (CAS) coupled with a hardware instruction scheduler to tackle the complexity involved in analyzing dependencies among instructions dynamically, accelerate machine code generation and obtain optimum resource utilization across the cores by effective and efficient scheduling. The CAS is realized as a two-level hierarchical subsystem employing the Primary Compiler on Silicon (PCOS) and Secondary Compiler on Silicon (SCOS) with the hardware instruction scheduler as an integral part of it. A comparative analysis with the conventional GCC compiler is presented for a real world brain modeling application and higher instruction generation rates along with improved scheduling efficiency is observed resulting in a corresponding increase in resource utilization.

Explore More