Rance Rodrigues
University of Massachusetts Amherst
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rance Rodrigues.
international conference on parallel architectures and compilation techniques | 2011
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu; Omer Khan
The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.
IEEE Transactions on Circuits and Systems Ii-express Briefs | 2013
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
We present a study on estimating the dynamic power consumption of a processor based on performance counters. Todays processors feature a large number of such counters to monitor various CPU and memory parameters, such as utilization, occupancy, bandwidth, page, cache, and branch buffer hit rates. The use of various sets of performance counters to estimate the power consumed by the processor has been demonstrated in the past. Our goal is to find out whether there exists a subset of counters that can be used to estimate, with sufficient accuracy, the dynamic power consumption of processors with varying microarchitecture. To this end, we consider two recent processor configurations representing two extremes of the performance spectrum, one targeting low power and the other high performance. Our results indicate that only three counters measuring 1) the number of fetched instructions, 2) level-1 cache hits, and 3) dispatch stalls are sufficient to achieve adequate precision. These counters are shown to be effective in predicting the dynamic power consumption across processors of varying resource sizes achieving a prediction accuracy of 95%.
ACM Transactions on Design Automation of Electronic Systems | 2013
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
Asymmetric multi-core processors (AMPs) have been shown to outperform symmetric ones in terms of performance and performance/watt. Improved performance and power efficiency are achieved when the program threads are matched to their most suitable cores. Since the computational needs of a program may change during its execution, the best thread to core assignment will likely change with time. We have, therefore, developed an online program phase classification scheme that allows the swapping of threads when the current needs of the threads justify a change in the assignment. The architectural differences among the cores in an AMP can never match the diversity that exists among different programs and even between different phases of the same program. Consider, for example, a program (or a program phase) that has a high instruction-level parallelism (ILP) and will exhibit high power efficiency if executed on a powerful core. We can not, however, include such powerful cores in the designed AMP, since they will remain underutilized most of the time, and they are not power efficient when the programs do not exhibit a high degree of ILP. Thus, we must expect to see program phases where the designed cores will be unable to support the ILP that the program can exhibit. We, therefore, propose in this article a dynamic morphing scheme. This scheme will allow a core to gain control of a functional unit that is ordinarily under the control of a neighboring core during periods of intense computation with high ILP. This way, we dynamically adjust the hardware resources to the current needs of the application. Our results show that combining online phase classification and dynamic core morphing can significantly improve the performance/watt of most multithreaded workloads.
symposium on computer architecture and high performance computing | 2012
Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu
The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMPs power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.
international conference on parallel architectures and compilation techniques | 2013
Arunachalam Annamalai; Rance Rodrigues; Israel Koren; Sandip Kundu
The importance of dynamic thread scheduling is increasing with the emergence of Asymmetric Multicore Processors (AMPs). Since the computing needs of a thread often vary during its execution, a fixed thread-to-core assignment is sub-optimal. Reassigning threads to cores (thread swapping) when the threads start a new phase with different computational needs, can significantly improve the energy efficiency of AMPs. Although identifying phase changes in the threads is not difficult, determining the appropriate thread-to-core assignment is a challenge. Furthermore, the problem of thread reassignment is aggravated by the multiple power states that may be available in the cores. To this end, we propose a novel technique to dynamically assess the program phase needs and determine whether swapping threads between core-types and/or changing the voltage/frequency levels (DVFS) of the cores will result in higher throughput/Watt. This is achieved by predicting the expected throughput/Watt of the current program phase at different voltage/frequency levels on all the available core-types in the AMP. We show that the benefits from thread swapping and DVFS are orthogonal, demonstrating the potential of the proposed scheme to achieve significant benefits by seamlessly combining the two. We illustrate our approach using a dual-core High-Performance (HP)/Low-Power (LP) AMP with two power states and demonstrate significant throughput/Watt improvement over different baselines.
international parallel and distributed processing symposium | 2012
Arunachalam Annamalai; Rance Rodrigues; Israel Koren; Sandip Kundu
Recent trends in technology scaling have enabled the incorporation of multiple processor cores on a single die. Depending on the characteristics of the cores, the multicore may be either symmetric (SMP) or asymmetric (AMP). Several studies have shown that in general, for a given resource and power budget, AMPs are likely to outperform their SMP counterparts. However, due to the heterogeneity in AMPs, scheduling threads is always a challenge. To address the issue of thread scheduling in AMP, we propose a novel dynamic thread scheduling scheme that continuously monitors the current characteristics of the executing threads and determines the best thread to core assignment. The real-time monitoring is done using hardware performance counters that capture several microarchitecture independent characteristics of the threads in order to determine the thread to core affinity. By controlling thread scheduling in hardware, the Operating System (OS) need not be aware of the underlying microarchitecture, significantly simplifying the OS scheduler for an AMP architecture. The proposed scheme is compared against a simple Round Robin scheduling and a recently published dynamic thread scheduling technique that allows swapping of threads (between asymmetric cores) at coarse grain time intervals, once every context switch (~20 ms for the Linux scheduler). The presented results indicate that our proposed scheme is able to achieve, on average, a performance/watt benefit of 10.5% over the previously published dynamic scheduling scheme and about 12.9% over the Round Robin scheme.
international test conference | 2010
Rance Rodrigues; Sandip Kundu; Omer Khan
At various stages of a product life, faults arise from different sources. During product bring up, logic errors are dominant. During production, manufacturing defects are main concerns while during operation, the concern shifts to aging defects. No matter what the source is, debugging such defects may permit logic, circuit or physical design changes to eliminate them in future. Within a processor chip, there are three broad categories of structures, namely the large memory structures such as caches, small memory structures such as reorder buffer, issue queue, and load-store buffers and the data-path. Most control functions and data steering operations are based on small memory structures and they are hard to debug. In this paper, we propose a lightweight hardware scheme, called shadow checker to detect faults in these critical units. The entries in these units are tested by means of a shadow entry that mimics intended operation. A mismatch traps an error. The shadow checker shadows an entry for a few thousand cycles before moving on to shadow another. This scheme can be employed to test chips during silicon debug, manufacturing test as well as during regular operation. We ran experiments on 13 SPEC2000 benchmarks and found that our scheme detects 100% of inserted faults.
international on line testing symposium | 2011
Rance Rodrigues; Sandip Kundu
Reliability and manufacturability have emerged as dominant concerns for todays multi-billion transistor chips. In this paper, we investigate how to degrade a chip multiprocessor (CMP) gracefully in presence of faults, by keeping its architected functionality intact at the expense of some loss of performance. The proposed solution involves sharing critical execution resources among cores to survive faults. Recent research has suggested that large datapath units such as FPU and integer division units are good candidates for execution outsourcing to other working cores in CMP. In this paper, we focus on relatively small but critically important integer ALU unit. Outsourcing ALU operations incur large performance penalty and better solutions need to be in place to ensure survivability with minimal performance loss. We propose the provisioning of a shared ALU among a set of cores that can act as a spare for any constituent core in the group. This solution works well for single ALU failures, but leads to resource contention when multiple ALUs fail. Simulation case studies on MediaBench and MiBench benchmarks show that the proposed solution allows the CMP to remain functionally intact with no performance penalty for single ALU failures and no more than 1.5% performance loss on average for failure of single ALU in each core.
international conference on vlsi design | 2010
Rance Rodrigues; Sandip Kundu
Printed image on silicon wafer differs from layout due to optical diffraction. Optical proximity correction (OPC) is a layout distortion technique to improve printed image. During manufacturing, parameters such as focus, dose and resist thickness may vary within tolerance margins. These factors contribute to additional distortion of expected printed shape, not addressed directly by OPC. To ensure a robust IC, a process window consideration is extremely important while running lithography simulations as we scale the technology even further, where the sensitivity of patterns printed on silicon to process variations is very high. Optical Lithography simulation has always been an important link in the chain for Design for manufacturability (DFM) and a lot of research has been put into making it faster and more accurate. However, being a compute intensive process, speeding up litho simulation without significant compromise in accuracy has always been tricky. In this paper we propose a new method to approximate litho simulation based on wavelet transform as opposed to the traditional method employed and we validate the speed and accuracy of our simulator by comparing our results with those of a popular commercial Lithography simulator considering focus variations. While our simulator suffers from an RMS error of 15X and (2) the ability to simulate very large circuit masks where the commercial software fails and direct incorporation of (3) manufacturing process variation. This allows litho simulation against multiple manufacturing process corners, which in turn helps in producing robust design.
defect and fault tolerance in vlsi and nanotechnology systems | 2011
Rance Rodrigues; Israel Koren; Sandip Kundu
CMOS wear-out mechanisms such as time dependent breakdown of gate dielectrics (TDDB), hot carrier injection (HCI), negative bias temperature instability (NBTI), electro migration (EM), and stress induced voiding (SIV) are well documented in the literature. Often the onset of wear-out is gradual, with initial manifestation as delay defects that result in timing errors. This motivates the need for online testing. The combined effect of dynamic reconfiguration such as voltage and frequency scaling (DVFS) and signal integrity issues coupled with aging related wear-outs complicate a priori selection of test vectors, further favoring online testing. Traditional online test techniques such as Double and Triple Modular Redundancy (DMR and TMR) pose severe area and power overheads. In this paper we propose an architecture to assist online testing in a Chip Multiprocessor (CMP) based on execution path recording. Since in practice, core utilization in CMPs is low, we can use the idle time of cores opportunistically to run test threads that mimic functional threads. The initiation, termination and comparison of test results is performed by a dedicated, simple and functionally limited small core that we call the Sentry Core (SC). The sentry core is hidden from the OS and has the ability to monitor and interrupt the general-purpose cores. Upon interrupt, the general-purpose core can send data to the sentry core. To detect errors, the SC initializes the general-purpose cores and collects signatures from hardware monitors (that compact execution traces) and compares them against duplicate test threads, obviating any need for cycle by cycle comparison. Major benefits of the proposed solution include: (1) online testing with minimal area overhead, (2) scalability, and (3) testability throughout the life cycle of a CMP. Experimental results show that the proposed scheme is capable of detecting 87% of the faults injected into the processor at an area overhead of less than 3% of the target CMP.