Omer Khan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Omer Khan is active.

Explore More

Publication

Featured researches published by Omer Khan.

international symposium on performance analysis of systems and software | 2011

Scalable, accurate multicore simulation in the 1000-core era

Mieszko Lis; Pengju Ren; Myong Hyon Cho; Keun Sup Shim; Christopher W. Fletcher; Omer Khan; Srinivas Devadas

We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued worm-hole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11 ×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, and parameters driving power and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple DOR routing to complex Valiant, ROMM, or PROM schemes, BSOR, and adaptive routing. HORNET can run in network-only mode using synthetic traffic or traces, directly emulate a MIPS-based multicore, or function as the memory subsystem for native applications executed under the Pin instrumentation tool. HORNET is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/.

design, automation, and test in europe | 2009

A self-adaptive system architecture to address transistor aging

Omer Khan; Sandip Kundu

As semiconductor manufacturing enters advanced nanometer design paradigm, aging and device wear-out related degradation is becoming a major concern. Negative Bias Temperature Instability (NBTI) is one of the main sources of device lifetime degradation. The severity of such degradation depends on the operation history of a chip in the field, including such characteristics as temperature and workloads. In this paper, we propose a system level reliability management scheme where a chip dynamically adjusts its own operating frequency and supply voltage over time as the device ages. Major benefits of the proposed approach are (i) increased performance due to reduced frequency guard banding in the factory and (ii) continuous field adjustments that take environmental operating conditions such as actual room temperature and the power supply tolerance into account. The greatest challenge in implementing such a scheme is to perform calibration without a tester. Much of this work is performed by a hypervisor like software with very little hardware assistance. This keeps both the hardware overhead and the system complexity low. This paper describes the entire system architecture including hardware and software components. Our simulation data indicates that under aggressive wear-out conditions, scheduling interval of days or weeks is sufficient to reconfigure and keep the system operational, thus the run time overhead for such adjustments is of no consequence at all.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2012

HORNET: A Cycle-Level Multicore Simulator

Pengju Ren; Mieszko Lis; Myong Hyon Cho; Keun Sup Shim; Christopher W. Fletcher; Omer Khan; Nanning Zheng; Srinivas Devadas

We present hornet, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued wormhole router network-on-chip (NoC) architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on six separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 12×. Most hardware parameters are configurable, including memory hierarchy, interconnect geometry, bandwidth, crossbar dimensions, parameters driving power, and thermal effects. A highly parametrized table-based NoC design allows a variety of routing and virtual channel allocation algorithms out of the box, ranging from simple dimension-ordered routing to complex Valiant, ROMM, O1Turn or PROM schemes, BSOR, and adaptive routing. Hornet can run in network-only mode using synthetic traffic or traces, or directly emulate a MIPS-based multicore. Hornet is freely available under the open-source MIT license at http://csg.csail.mit.edu/hornet/.

ieee international symposium on workload characterization | 2015

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

Masab Ahmad; Farrukh Hijaz; Qingchuan Shi; Omer Khan

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

international conference on parallel architectures and compilation techniques | 2011

Performance Per Watt Benefits of Dynamic Core Morphing in Asymmetric Multicores

Rance Rodrigues; Arunachalam Annamalai; Israel Koren; Sandip Kundu; Omer Khan

The trend toward multicore processors is moving the emphasis in computation from sequential to parallel processing. However, not all applications can be parallelized and benefit from multiple cores. Such applications lead to under-utilization of parallel resources, hence sub-optimal performance/watt. They may however, benefit from powerful uniprocessors. On the other hand, not all applications can take advantage of more powerful uniprocessors. To address competing requirements of diverse applications, we propose a heterogeneous multicore architecture with a Dynamic Core Morphing (DCM) capability. Depending on the computational demands of the currently executing applications, the resources of a few tightly coupled cores are morphed at runtime. We present a simple hardware-based algorithm to monitor the time-varying computational needs of the application and when deemed beneficial, trigger reconfiguration of the cores at fine-grain time scales to maximize the performance/watt of the application. The proposed dynamic scheme is then compared against a baseline static heterogeneous multicore configuration and an equivalent homogeneous configuration. Our results show that dynamic morphing of cores can provide performance/watt gains of 43% and 16% on an average, when compared to the homogeneous and baseline heterogeneous configurations, respectively.

international symposium on computer architecture | 2013

The locality-aware adaptive cache coherence protocol

George Kurian; Omer Khan; Srinivas Devadas

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.

IEEE Transactions on Computers | 2010

Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors

Omer Khan; Sandip Kundu

As the semiconductor industry continues its relentless push for nano-CMOS technologies, device reliability and occurrence of hard errors have emerged as a dominant concern in multicores. Although regular memory structures are protected against hard errors using error correcting codes or spare rows and columns, many of the structures within the cores are left unprotected. Even if the location of hard errors is known a priori, disabling faulty cores results in a substantial performance loss. Several proposed techniques use microarchitectural redundancy to allow defective cores to continue operation. These techniques are attractive, but limited due to either added cost of additional redundancy that offers no benefits to an error-free core, or limited coverage, due to the natural redundancy offered by the microarchitecture. We propose to exploit the intercore redundancy in chip multiprocessors for hard-error tolerance. Our scheme combines hardware reconfiguration to ensure reduced functionality of cores, and a runtime layer of software (microvisor) to manage mapping of threads to cores. Microvisor observes the changing phase behavior of threads and initiates thread relocation to match the computational demands of threads to the capabilities of cores. Our results show that in the presence of degraded cores, microvisor mitigates performance losses by an average of two percent.

great lakes symposium on vlsi | 2010

A self-adaptive scheduler for asymmetric multi-cores

Omer Khan; Sandip Kundu

Asymmetric chip multiprocessors are imminent in the multi-core era primarily due their potential for power-performance efficiency. In order for software to fully realize this potential, the scheduling of threads to cores must be automated to adapt to the changing program behavior. However, strict system abstraction layers limit the controllability and observability of low level hardware details, thereby, limiting the state-of-the-art systems to rely on manual or static mapping of threads to cores in an asymmetric multi-core. In this paper, we propose a self-adaptive scheduler that exploits program behavior at runtime by matching computational demands of threads to the capabilities of cores. We present a novel empirical model to predict the selection of an appropriate core (based on optimizing throughput, power or performance per watt) for changing program phases within threads. Thread migration is initiated when an optimal mapping of threads to cores is predicted. Results show that our predictive schedulers for the three target optimizations are within 10% of the ideal scheduler.

high-performance computer architecture | 2014

Locality-aware data replication in the Last-Level Cache

George Kurian; Srinivas Devadas; Omer Khan

Next generation multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). Our goal is to lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, while simultaneously keeping the off-chip miss rate low. Our approach relies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granularity, and only allows replication for cache lines with high reuse. Furthermore, our classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional coherence protocols. Moreover, the complexity of our protocol is low since no additional coherence states are created. On a set of parallel benchmarks, our protocol reduces the overall energy by 16%, 14%, 13% and 21% and the completion time by 4%, 9%, 6% and 13% when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes.

design, automation, and test in europe | 2009

Improving yield and reliability of chip multiprocessors

Abhisek Pan; Omer Khan; Sandip Kundu

An increasing number of hardware failures can be attributed to device reliability problems that cause partial system failure or shutdown. In this paper we propose a scheme for improving reliability of a homogeneous chip multiprocessor (CMP) that also serves to improve manufacturing yield. Our solution centers on exploiting the natural redundancy that already exists in multi-core systems by using services from other cores for functional units that are defective in a faulty core. A micro-architectural modification allows a core on a CMP to use another core as a coprocessor to service any instruction that the former cannot execute correctly. This service is accessed to improve yield and reliability, but at the cost of some loss of performance. In order to quantify this loss we have used a cycle-accurate simulator to simulate the performance of a dual-core system with one or two cores sustaining partial failure. Our results indicate that when a large and sparingly-used unit such as a floating point arithmetic unit fails in a core, even for a floating point intensive benchmark, we can continue to run each faulty core with help from companion cores with as little as 10% impact to performance and less than 1% area overhead.

Explore More