Farrukh Hijaz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Farrukh Hijaz is active.

Explore More

Publication

Featured researches published by Farrukh Hijaz.

ieee international symposium on workload characterization | 2015

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

Masab Ahmad; Farrukh Hijaz; Qingchuan Shi; Omer Khan

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

international conference on computer design | 2013

A private level-1 cache architecture to exploit the latency and capacity tradeoffs in multicores operating at near-threshold voltages

Farrukh Hijaz; Qingchuan Shi; Omer Khan

Near-threshold voltage (NTV) operation is expected to enable up to 10× energy-efficiency for future processors. However, reliable operation below a minimum voltage (Vccmin) cannot be guaranteed. Specifically, SRAM bit-cell error rates are expected to rise steeply since their margins can easily be violated at near-threshold voltages. Multicore processors rely on fast private L1 caches to exploit data locality and achieve high performance. In the presence of high bit-cell error rates, an L1 cache can either sacrifice capacity or incur additional latency to correct the errors. We observe that L1 cache sensitivity to hit latency offers a design tradeoff between capacity and latency. When error rate is high at extreme Vccmin, it is worthwhile incurring additional latency to recover and utilize the additional L1 cache capacity. However, at low error rates, the additional constant latency to recover cache capacity degrades performance. With this tradeoff in mind, we propose a novel private L1 cache architecture that dynamically learns and adapts by either recovering cache capacity at the cost of additional latency overhead, or operate at lower capacity while utilizing the benefits of optimal hit latency. Using simulations of a 64-core multicore, we demonstrate that our adaptive L1 cache architecture performs better than both individual schemes at low and high error rates (i.e., various NTV conditions).

international conference on computer design | 2011

ARCc: A case for an architecturally redundant cache-coherence architecture for large multicores

Omer Khan; Henry Hoffmann; Mieszko Lis; Farrukh Hijaz; Anant Agarwal; Srinivas Devadas

This paper proposes an architecturally redundant cache-coherence architecture (ARCc) that combines the directory and shared-NUCA based coherence protocols to improve performance, energy and dependability. Both coherence mechanisms co-exist in the hardware and ARCc enables seamless transition between the two protocols. We present an online analytical model implemented in the hardware that predicts performance and triggers a transition between the two coherence protocols at application-level granularity. The ARCc architecture delivers up to 1.6× higher performance and up to 1.5× lower energy consumption compared to the directory-based counterpart. It does so by identifying applications which benefit from the large shared cache capacity of shared-NUCA because of lower off-chip accesses, or where remote-cache word accesses are efficient.

networking architecture and storages | 2015

Efficient parallel packet processing using a shared memory many-core processor with hardware support to accelerate communication

Farrukh Hijaz; Brian Kahne; Peter J. Wilson; Omer Khan

Software IP forwarding routers provide flexibility, programmability and extensibility, while enabling fast deployment. The key question is whether they can keep up with the efficiency of special purpose hardware counterparts. Shared memory stands out as sine qua non for parallel programming of many commercial multicore processors, so it is the paradigm of choice to implement software routers. For efficiency, shared memory is often implemented with hardware support for cache coherence and data consistency among the cores. Although it enables efficient data access in many common case scenarios, the communication between cores using shared memory synchronization primitives often limits scalability. In this paper we perform a thorough characterization of a multithreaded packet processing application to quantify the opportunities from exploiting concurrency, as well as identify scalability bottlenecks in futuristic shared memory multicores. We propose to retain the shared memory model, however, introduce a set of lightweight in-hardware explicit messaging send/receive instructions in the instruction set architecture (ISA). These instructions are used to mitigate the overheads of multi-party communication in shared memory protocols. Using simulations of a 64 core multicore, we identify that scalability of parallel packet processing is limited due to packet ordering requirement that leads to expensive implicit communication under shared memory. Using explicit messaging support in the ISA, the communication bottleneck is mitigated, and the application scales to 30× at 64 cores.

hardware and architectural support for security and privacy | 2015

Exploring the performance implications of memory safety primitives in many-core processors executing multi-threaded workloads

Masab Ahmad; Syed Kamran Haider; Farrukh Hijaz; Marten van Dijk; Omer Khan

Security is a vital consideration for todays processor architectures, both at the software and hardware layers. However, security schemes are known to incur significant performance overheads. For example, buffer overflow protection schemes perform software checks for bounds on program data structures, and incur performance overheads that are up to several orders of magnitude. To mitigate these overheads, prior works focus on either changing the security scheme itself, or selectively apply the security scheme to minimize program vulnerabilities. Most of these works also focus primarily on single core processors, with no prior work done in the context of multicore processors. In this paper, we show how increasing thread counts can help hide the latency overheads of security schemes. We also analyze the architectural implications in the context of multucores, and the insights and challenges associated with applying these security schemes on mutithreaded workloads.

ACM Transactions on Architecture and Code Optimization | 2014

NUCA-L1: A Non-Uniform Access Latency Level-1 Cache Architecture for Multicores Operating at Near-Threshold Voltages

Farrukh Hijaz; Omer Khan

Research has shown that operating in the near-threshold region is expected to provide up to 10× energy efficiency for future processors. However, reliable operation below a minimum voltage (Vccmin) cannot be guaranteed due to process variations. Because SRAM margins can easily be violated at near-threshold voltages, their bit-cell failure rates are expected to rise steeply. Multicore processors rely on fast private L1 caches to exploit data locality and achieve high performance. In the presence of high bit-cell fault rates, traditionally an L1 cache either sacrifices capacity or incurs additional latency to correct the faults. We observe that L1 cache sensitivity to hit latency offers a design trade-off between capacity and latency. When fault rate is high at extreme Vccmin, it is beneficial to recover L1 cache capacity, even if it comes at the cost of additional latency. However, at low fault rates, the additional constant latency to recover cache capacity degrades performance. With this trade-off in mind, we propose a Non-Uniform Cache Access L1 architecture (NUCA-L1) that avoids additional latency on accesses to fault-free cache lines. To mitigate the capacity bottleneck, it deploys a correction mechanism to recover capacity at the cost of additional latency. Using extensive simulations of a 64-core multicore, we demonstrate that at various bit-cell fault rates, our proposed private NUCA-L1 cache architecture performs better than state-of-the-art schemes, along with a significant reduction in energy consumption.

international conference on computer design | 2013

Towards efficient dynamic data placement in NoC-based multicores

Qingchuan Shi; Farrukh Hijaz; Omer Khan

Next generation multicores will process massive data with significant sharing. Since future processors will also be inherently limited by the off-chip bandwidth, the on-chip data management is emerging as a first-order design constraint. On-chip memory latency increases as more cores are added since the diameter of most on-chip networks increases with the number of cores. We observe that a large fraction of on-chip traffic originates from communication between the cores to maintain cache coherence. Motivated by these observations, we propose a novel on-chip data placement mechanism that optimizes shared data placement by minimizing the distance of data from the requesting cores (improve locality) while paying attention to load balancing network contention and the utilization of percore cache capacity. Using simulations of a 64-core multicore, we show that our proposal outperforms state-of-the-art static and dynamic data placement mechanisms by an average of 5.5% and 8.5% respectively.

international symposium on microarchitecture | 2012

Low-Latency Mechanisms for Near-Threshold Operation of Private Caches in Shared Memory Multicores

Farrukh Hijaz; Qingchuan Shi; Omer Khan

Near-threshold voltage operation is widely acknowledged as a potential mechanism to achieve an order of magnitude reduction in energy consumption in future processors. However, processors cannot operate reliably below a minimum voltage, Vccmin, since hardware components may fail. SRAM bitcell failures in memory structures, such as caches, typically determine the Vccmin for a processor. Although the last-level shared caches (LLC) in modern multicores are protected using error correcting codes (ECC), the private caches have been left unprotected due to their performance sensitivity to the latency overhead of the ECC. This limits the operation of the processor at near-threshold voltages.In this paper, we propose mechanisms for near-threshold operation of private caches that do not require ECC support. First, we present a fine-grain mechanism to disable cache lines in private caches, with bitcell failures at the target near-threshold voltage. Second, we propose two mechanisms to better manage the capacity-stressed private caches. (1) We utilize the OS-level data classification of private and shared data and evaluate a data placement mechanism that dynamically relocates the private data blocks to the LLC slice that is physically co-located with the requesting core. (2) We propose an in-hardware yet low-overhead runtime profiling of the locality of each cache line that is classified as private data, and only allow such data to be cached in the private caches if it shows high spatio-temporal locality. These mechanisms allow the private caches to rely on the local LLC slice to cache the low-locality private data efficiently, and enable more space to hold the more frequently used private data (as well as the shared data). We show that combining cache line disabling with efficient cache management of private data performs better (in terms of application completion times) than using a single error correction double error detection (SECDED) based ECC mechanism and/or cache line disabling.

international conference on computer design | 2015

M-MAP: Multi-factor memory authentication for secure embedded processors

Syed Kamran Haider; Masab Ahmad; Farrukh Hijaz; Astha Patni; Ethan Johnson; Matthew Seita; Omer Khan; Marten van Dijk

The challenges faced in securing embedded computing systems against multifaceted memory safety vulnerabilities have prompted great interest in the development of memory safety countermeasures. These countermeasures either provide protection only against their corresponding type of vulnerabilities, or incur substantial architectural modifications and overheads in order to provide complete safety, which makes them infeasible for embedded systems. In this paper, we propose M-MAP: a comprehensive system based on multi-factor memory authentication for complete memory safety. We examine certain crucial implications of composing memory integrity verification and bounds checking schemes in a comprehensive system. Based on these implications, we implement M-MAP with hardware based memory integrity verification and software based bounds checking to achieve a balance between hardware modifications and performance. We demonstrate that M-MAP implemented on top of a lightweight out-of-order processor delivers complete memory safety with only 32% performance overhead on average, while incurring minimal hardware modifications, and area overhead.

international parallel and distributed processing symposium | 2017

Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging

Halit Dogan; Farrukh Hijaz; Masab Ahmad; Brian Kahne; Peter J. Wilson; Omer Khan

Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.

Explore More