Nima Honarmand
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nima Honarmand.
symposium on principles of programming languages | 2011
Robert L. Bocchino; Stephen T. Heumann; Nima Honarmand; Sarita V. Adve; Vikram S. Adve; Adam Welc; Tatiana Shpeisman
A number of deterministic parallel programming models with strong safety guarantees are emerging, but similar support for nondeterministic algorithms, such as branch and bound search, remains an open question. We present a language together with a type and effect system that supports nondeterministic computations with a deterministic-by-default guarantee: nondeterminism must be explicitly requested via special parallel constructs (marked nd), and any deterministic construct that does not execute any nd construct has deterministic input-output behavior. Moreover, deterministic parallel constructs are always equivalent to a sequential composition of their constituent tasks, even if they enclose, or are enclosed by, nd constructs. Finally, in the execution of nd constructs, interference may occur only between pairs of accesses guarded by atomic statements, so there are no data races, either between atomic statements and unguarded accesses (strong isolation) or between pairs of unguarded accesses (stronger than strong isolation alone). We enforce the guarantees at compile time with modular checking using novel extensions to a previously described effect system. Our effect system extensions also enable the compiler to remove unnecessary transactional synchronization. We provide a static semantics, dynamic semantics, and a complete proof of soundness for the language, both with and without the barrier removal feature. An experimental evaluation shows that our language can achieve good scalability for realistic parallel algorithms, and that the barrier removal techniques provide significant performance gains.
international symposium on computer architecture | 2013
Gilles Pokam; Klaus Danne; Cristiano Pereira; Rolf Kassa; Tim Kranich; Shiliang Hu; Justin E. Gottschlich; Nima Honarmand; Nathan Dautenhahn; Samuel T. King; Josep Torrellas
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This papers focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
architectural support for programming languages and operating systems | 2013
Nima Honarmand; Nathan Dautenhahn; Josep Torrellas; Samuel T. King; Gilles Pokam; Cristiano Pereira
Architectures for deterministic record-replay (R&R) of multithreaded code are attractive for program debugging, intrusion analysis, and fault-tolerance uses. However, very few of the proposed designs have focused on maximizing replay speed -- a key enabling property of these systems. The few efforts that focus on replay speed require intrusive hardware or software modifications, or target whole-system R&R rather than the more useful application-level R&R. This paper presents the first hardware-based scheme for unintrusive, application-level R&R that explicitly targets high replay speed. Our scheme, called Cyrus, requires no modification to commodity snoopy cache coherence. It introduces the concept of an on-the-fly software Backend Pass during recording which, as the log is being generated, transforms it for high replay parallelism. This pass also fixes-up the log, and can flexibly trade-off replay parallelism for log size. We analyze the performance of Cyrus using full system (OS plus hardware) simulation. Our results show that Cyrus has negligible recording overhead. In addition, for 8-processor runs of SPLASH-2, Cyrus attains an average replay parallelism of 5, and a replay speed that is, on average, only about 50% lower than the recording speed.
architectural support for programming languages and operating systems | 2014
Nima Honarmand; Josep Torrellas
Record and Deterministic Replay (RnR) of multithreaded programs on relaxed-consistency multiprocessors has been a long-standing problem. While there are designs that work for Total Store Ordering (TSO), finding a general solution that is able to record the access reordering allowed by any relaxed-consistency model has proved challenging. This paper presents the first complete solution for hard-ware-assisted memory race recording that works for any relaxed-consistency model of current processors. With the scheme, called RelaxReplay, we can build an RnR system for any relaxed-consistency model and coherence protocol. RelaxReplays core innovation is a new way of capturing memory access reordering. Each memory instruction goes through a post-completion in-order counting step that detects any reordering, and efficiently records it. We evaluate RelaxReplay with simulations of an 8-core release-consistent multicore running SPLASH-2 programs. We observe that RelaxReplay induces negligible overhead during recording. In addition, the average size of the log produced is comparable to the log sizes reported for existing solutions, and still very small compared to the memory bandwidth of modern machines. Finally, deterministic replay is efficient and needs minimal hardware support.
vlsi test symposium | 2007
Nima Honarmand; Ali Shahabi; Hasan Sohofi; Maghsoud Abbaspour; Zainalabedin Navabi
As the complexity of the integrated circuits increases, they become more susceptible to manufacturing faults, decreasing the total process yield. Thus, it would be desirable to develop techniques for reusing faulty dies, even with a degraded performance. In this paper, a new method for high level synthesis of degradable ASICs is presented. Our technique introduces the concept of virtual binding. In this approach, the operations are bound to virtual components that are linked with actual non-faulty components using a set of configuration multiplexers and flip-flops embedded in the data-path. Using virtual components simplifies the synthesis algorithm and decreases the size of generated control unit. Virtual-to-physical mapping of the components will be established by programming the configuration flip-flops after diagnosing the faulty components. The experimental results show that the area and delay overhead of the resulting circuits have acceptable values compared to the original, non-degradable circuits.
IEICE Electronics Express | 2007
Ali Shahabi; Nima Honarmand; Hassan Sohofi; Zainalabedin Navabi
The decreasing manufacturing yield of integrated circuits, as a result of rising complexity and decreased feature size, and the emergence of NoC-based design techniques, has necessitated the search for network reconfiguration techniques for reusing NoCs with faulty communication hardware. In this paper, we propose a method to cope with the problem of faulty communication links in mesh-based NoCs. The method is based on the use of programmable routing tables, which has a fixed number of entries, in network switches. Experimental results show that a network reconfigured for fault masking by programming its routing tables has acceptable but degraded performance parameters as compared to the non-faulty network.
international symposium on computer architecture | 2014
Nima Honarmand; Josep Torrellas
Hardware-assisted Record and Deterministic Replay (RnR) of programs has been proposed as a primitive for debugging hard-to-repeat software bugs. However, simply providing support for repeatedly stumbling on the same bug does not help diagnose it. For bug diagnosis, developers typically want to modify the code, e.g., by creating and operating on new variables, or printing state. Unfortunately, this renders the RnR log inconsistent and makes Replay Debugging (i.e., debugging while using an RnR log for replay) dicey at best. This paper presents rdb, the first scheme for replay debugging that guarantees exact replay. rdb relies on two mechanisms. The first one is compiler support to split the instrumented application into two executables: one that is identical to the original program binary, and another that encapsulates all the added debug code. The second mechanism is a runtime infrastructure that replays the application and, without affecting it in any way, invokes the appropriate debug code at the appropriate locations. We describe an implementation of rdb based on LLVM and Pin, and show an example of how rdbs replay debugging helps diagnose a real bug.
international symposium on circuits and systems | 2007
Ali Shahabi; Nima Honarmand; Zainalabedin Navabi
The decreasing manufacturing yield of integrated circuits, as a result of rising complexity and decreased feature size, and the emergence of NoC-based design techniques, has necessitated the search for network reconfiguration techniques for reusing NoCs with faulty communication hardware. In this paper, we propose a method to cope with the problem of faulty communication links in NoCs with torus topology. The method is based on the use of programmable routing tables in network switches. We investigate this technique for a conventional routing mechanism and our optimized routing mechanism. The conventional mechanism uses one entry in its routing table for every destination address while our proposed routing mechanism uses a fixed number of entries per table and routes based on the address value comparison of the current switch and the destination switch.
architectural support for programming languages and operating systems | 2015
Yuelu Duan; Nima Honarmand; Josep Torrellas
There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to their reliance on global state and structures. This papers goal is to optimize both the performance and the implementability of fences. We start-off with a design like the most aggressive ones but without the global state. We call it Weak Fence or wF. Since the concurrent execution of multiple wFs can deadlock, we combine wFs with a conventional fence (i.e., Strong Fence or sF) for the less performance-critical thread(s). We call the result an Asymmetric fence group. We also propose a taxonomy of Asymmetric fence groups under TSO. Compared to past aggressive fences, Asymmetric fence groups both are substantially easier to implement and have higher average performance. The two main designs presented (WS+ and W+) speed-up workloads under TSO by an average of 13% and 21%, respectively, over conventional fences.
asia pacific conference on circuits and systems | 2006
Nima Honarmand; Ali Afzali-Kusha
A data driven approach to design and optimization of low power combinational multipliers is presented. This technique depends on signal gating to avoid un-necessary computations and thus reduce the switching activity and power consumption of combinational multipliers. The proposed technique can be equally well applied to signed and unsigned multiplications. At the same time, it imposes reasonable area and delay overhead on the circuit. The benchmark data is extracted from typical DSP applications to show the efficiency of the proposed technique in the domain of DSP computations in which the low power computing is of rapidly increasing importance. The results show an average of 26% percent reduction in the switching activity and 22% area and 27% delay overhead, compared to combinational multipliers without this technique