Arunachalam Annamalai
University of Massachusetts Amherst
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arunachalam Annamalai.
international conference on parallel architectures and compilation techniques | 2011
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu; Omer Khan
The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2013
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
We present a study on estimating the dynamic power consumption of a processor based on performance counters. Todays processors feature a large number of such counters to monitor various CPU and memory parameters, such as utilization, occupancy, bandwidth, page, cache, and branch buffer hit rates. The use of various sets of performance counters to estimate the power consumed by the processor has been demonstrated in the past. Our goal is to find out whether there exists a subset of counters that can be used to estimate, with sufficient accuracy, the dynamic power consumption of processors with varying microarchitecture. To this end, we consider two recent processor configurations representing two extremes of the performance spectrum, one targeting low power and the other high performance. Our results indicate that only three counters measuring 1) the number of fetched instructions, 2) level-1 cache hits, and 3) dispatch stalls are sufficient to achieve adequate precision. These counters are shown to be effective in predicting the dynamic power consumption across processors of varying resource sizes achieving a prediction accuracy of 95%.
ACM Transactions on Design Automation of Electronic Systems | 2013
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
Asymmetric multi-core processors (AMPs) have been shown to outperform symmetric ones in terms of performance and performance/watt. Improved performance and power efficiency are achieved when the program threads are matched to their most suitable cores. Since the computational needs of a program may change during its execution, the best thread to core assignment will likely change with time. We have, therefore, developed an online program phase classification scheme that allows the swapping of threads when the current needs of the threads justify a change in the assignment. The architectural differences among the cores in an AMP can never match the diversity that exists among different programs and even between different phases of the same program. Consider, for example, a program (or a program phase) that has a high instruction-level parallelism (ILP) and will exhibit high power efficiency if executed on a powerful core. We can not, however, include such powerful cores in the designed AMP, since they will remain underutilized most of the time, and they are not power efficient when the programs do not exhibit a high degree of ILP. Thus, we must expect to see program phases where the designed cores will be unable to support the ILP that the program can exhibit. We, therefore, propose in this article a dynamic morphing scheme. This scheme will allow a core to gain control of a functional unit that is ordinarily under the control of a neighboring core during periods of intense computation with high ILP. This way, we dynamically adjust the hardware resources to the current needs of the application. Our results show that combining online phase classification and dynamic core morphing can significantly improve the performance/watt of most multithreaded workloads.
symposium on computer architecture and high performance computing | 2012
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMPs power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.
international conference on parallel architectures and compilation techniques | 2013
Arunachalam Annamalai; Rance Rodrigues; Israel Koren; Sandip Kundu
The importance of dynamic thread scheduling is increasing with the emergence of Asymmetric Multicore Processors (AMPs). Since the computing needs of a thread often vary during its execution, a fixed thread-to-core assignment is sub-optimal. Reassigning threads to cores (thread swapping) when the threads start a new phase with different computational needs, can significantly improve the energy efficiency of AMPs. Although identifying phase changes in the threads is not difficult, determining the appropriate thread-to-core assignment is a challenge. Furthermore, the problem of thread reassignment is aggravated by the multiple power states that may be available in the cores. To this end, we propose a novel technique to dynamically assess the program phase needs and determine whether swapping threads between core-types and/or changing the voltage/frequency levels (DVFS) of the cores will result in higher throughput/Watt. This is achieved by predicting the expected throughput/Watt of the current program phase at different voltage/frequency levels on all the available core-types in the AMP. We show that the benefits from thread swapping and DVFS are orthogonal, demonstrating the potential of the proposed scheme to achieve significant benefits by seamlessly combining the two. We illustrate our approach using a dual-core High-Performance (HP)/Low-Power (LP) AMP with two power states and demonstrate significant throughput/Watt improvement over different baselines.
international parallel and distributed processing symposium | 2012
Arunachalam Annamalai; Rance Rodrigues; Israel Koren; Sandip Kundu
Recent trends in technology scaling have enabled the incorporation of multiple processor cores on a single die. Depending on the characteristics of the cores, the multicore may be either symmetric (SMP) or asymmetric (AMP). Several studies have shown that in general, for a given resource and power budget, AMPs are likely to outperform their SMP counterparts. However, due to the heterogeneity in AMPs, scheduling threads is always a challenge. To address the issue of thread scheduling in AMP, we propose a novel dynamic thread scheduling scheme that continuously monitors the current characteristics of the executing threads and determines the best thread to core assignment. The real-time monitoring is done using hardware performance counters that capture several microarchitecture independent characteristics of the threads in order to determine the thread to core affinity. By controlling thread scheduling in hardware, the Operating System (OS) need not be aware of the underlying microarchitecture, significantly simplifying the OS scheduler for an AMP architecture. The proposed scheme is compared against a simple Round Robin scheduling and a recently published dynamic thread scheduling technique that allows swapping of threads (between asymmetric cores) at coarse grain time intervals, once every context switch (~20 ms for the Linux scheduler). The presented results indicate that our proposed scheme is able to achieve, on average, a performance/watt benefit of 10.5% over the previously published dynamic scheduling scheme and about 12.9% over the Round Robin scheme.
international symposium on quality electronic design | 2013
Arunachalam Annamalai; Raghavan Kumar; Arunkumar Vijayakumar; Sandip Kundu
As conventional CMOS technology is approaching scaling limits, the shift in trend towards stacked 3D Integrated Circuits (3D IC) is gaining more importance. 3D ICs offer reduced power dissipation, higher integration density, heterogeneous stacking and reduced interconnect delays. In a 3D IC stack, all but the bottom tier are thinned down to enable through-silicon vias (TSV). However, the thinning of the substrate increases the lateral thermal resistance resulting in higher intra-layer temperature gradients potentially leading to performance degradation and even functional errors. In this work, we study the effect of thinning the substrate on temperature profile of various tiers in 3D ICs. Our simulation results show that the intra-layer temperature gradient can be as high as 57°C. Often, the conventional static solutions lead to highly inefficient design. To this end, we present a system-level situation-aware integrated scheme that performs opportunistic thread migration and dynamic voltage and frequency scaling (DVFS) to effectively manage thermal violations while increasing the system throughput relative to stand-alone schemes.
ACM Transactions in Embedded Computing Systems | 2014
Rance Rodrigues; Arunachalam Annamalai; Sandip Kundu
There is a growing concern about the increasing rate of defects in computing substrates. Traditional redundancy solutions prove to be too expensive for commodity microprocessor systems. Modern microprocessors feature multiple execution units to take advantage of instruction level parallelism. However, most workloads do not exhibit the level of instruction level parallelism that a typical microprocessor is resourced for. This offers an opportunity to reexecute instructions using idle execution units. But, relying solely on idle resources will not provide full instruction coverage and there is a need to explore other alternatives. To that end, we propose and evaluate two instruction replay schemes within the same core for online testing of the execution units. One scheme (RER) reexecutes only the retired instructions, while the other (REI) reexecutes all the issued instructions. The complete proposed solution requires a comparator and minor modifications to control logic, resulting in negligible hardware overhead. Both soft and hard error detection are considered and the performance and energy impact of both schemes are evaluated and compared against previously proposed redundant execution schemes. Results show that even though the proposed schemes result in a small performance penalty when compared to previous work, the energy overhead is significantly reduced.
ieee computer society annual symposium on vlsi | 2013
Sudarshan Srinivasan; Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
Asymmetric Multicore Processors (AMP) have emerged as likely candidates to solve the performance/power conundrum in the current generation of processors. Most recent work in this area evaluate such multicores by considering large (usually out-of-order (OOO)) and small (usually in-order (InO)) cores on the same chip. Dynamic online swapping of threads between these cores is then facilitated whenever deemed beneficial. However, if threads are swapped too often, the overheads may negatively impact the benefits of swapping. Hence, in most recent work, thread swapping decisions are made at coarse grain instruction granularities, leaving out many opportunities. In this paper, we propose a scheme to mitigate the penalty imposed by thread swapping and yet achieve all the benefits of AMPs. Here, a single superscalar OOO core morphs itself into an InO core at runtime, whenever determined to be performance/Watt efficient. Certain Intel processors already have a similar mechanism to statically morph an OOO core to an InO core to facilitate debug. We extend this existing capability to perform dynamic core morphing at runtime with an orthogonal objective of improving power efficiency. Results indicate that on an average, performance/Watt benefits of 10% can be extracted by our proposed morphing scheme at a very small performance penalty of 3.8%. Since this scheme is based on existing mechanisms readily available in current microprocessors, it incurs no hardware overheads.
international conference on computer design | 2013
Sudarshan Srinivasan; Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
The computational needs of a program change over time. Sometimes a program exhibits low instruction level parallelism (ILP), while at other times the inherent ILP may be higher; sometimes a program stalls due to a large number of cache misses, while at other times it may exhibit high cache throughput. Asymmetric Multicore Processors (AMP) have been proposed to allow matching the computing needs of a thread to a core where it executes most efficiently. Some of the recent works focus on AMPs consisting of a monolithic large out-of-order (OOO) core and a small in-order (InO) core. Dynamic swapping of threads between these cores is then facilitated to improve energy efficiency of the threads without impacting performance too negatively. Swapping decisions are made at coarse grain instruction granularities to mitigate the impact of migration overhead. This excludes many opportunities for swap at a fine granular level. In this paper we consider a single superscalar OOO core that can morph itself dynamically into an InO core at runtime. In order to determine when to morph from OOO to InO and vice-versa, we rely on certain hardware performance monitors. Using these performance monitors we estimate the energy-delay-squared product (ED2P) for both modes of operation, which is then used to make morphing decisions. The morphing hardware support is simple and is already available in certain Intel processors to facilitate debug. The proposed scheme has low migration overhead, that enables fine-grain morphing to achieve more energy efficient computing by trading a small loss of performance for much greater energy reduction.