Haris Javaid | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haris Javaid is active.

Explore More

Publication

Featured researches published by Haris Javaid.

design automation conference | 2014

darkNoC: Designing Energy-Efficient Network-on-Chip with Multi-Vt Cells for Dark Silicon

Haseeb Bokhari; Haris Javaid; Muhammad Shafique; Jörg Henkel; Sri Parameswaran

In this paper, we propose a novel NoC architecture, called dark-NoC, where multiple layers of architecturally identical, but physically different routers are integrated, leveraging the extra transistors available due to dark silicon . Each layer is separately optimized for a particular voltage-frequency range by the adroit use of multi-Vt circuit optimization. At a given time, only one of the network layers is illuminated while all the other network layers are dark. We provide architectural support for seamless integration of multiple network layers, and a fast inter-layer switching mechanism without dropping in-network packets. Our experiments on a 4 × 4 mesh with multi-programmed real application workloads show that darkNoC improves energy-delay product by up to 56% compared to a traditional single layer NoC with state-of-the-art DVFS. This illustrates darkNoC can be used as an energy-efficient communication fabric in future dark silicon chips.

design automation conference | 2009

A design flow for application specific heterogeneous pipelined multiprocessor systems

Haris Javaid; Sri Parameswaran

This paper describes a rapid design methodology to create a pipeline of processers to execute streaming applications. The methodology is in two separate phases: the first phase, uses a heuristic to rapidly search through a large number of processor configurations (configurations differ by the base processor, the additional instructions and cache sizes) to find the near Pareto front; the second phase, utilizes either the above heuristic or an ILP (integer linear programming) formulation to search a smaller design space to find an appropriate final implementation. By the utilization of the fast heuristic with differing runtime constraints in the first phase, we rapidly find the near Pareto front. The second phase provides either an optimal or a near optimal solution. Both the ILP formulation and the heuristic find a system with the smallest area, within a designer specified runtime constraint. The system has efficiently explored design spaces with over 1012 design points. We integrated this design methodology into a commercial design flow and evaluated our approach with different benchmarks (JPEG encoder, JPEG decoder and MP3 encoder). For each benchmark, the near Pareto front was found in a few hours using the heuristic (took several days for the ILP). The results show that the average area error of the heuristic is within 2.5% of the optimal design points (obtained using ILP) for all benchmarks.

international conference on hardware/software codesign and system synthesis | 2010

Optimal synthesis of latency and throughput constrained pipelined MPSoCs targeting streaming applications

Haris Javaid; Xin He; Aleksander Ignjatovic; Sri Parameswaran

A streaming application, characterized by a kernel that can be broken down into independent tasks which can be executed in a pipelined fashion, inherently allows its implementation on a pipeline of Application Specific Instruction set Processors (ASIPs), called a pipelined MPSoC. The latency and throughput requirements of streaming applications put constraints on the design of such a pipelined MPSoC, where each ASIP has a number of available configurations differing by additional instructions, and instruction and data cache sizes. Thus, the design space of a pipelined MPSoC is all the possible combinations of ASIP configurations (design points). In this paper, a methodology is proposed to optimize the area of a pipelined MPSoC under a latency or a throughput constraint. The final design point is a set of ASIP configurations with one configuration for each ASIP. We proposed an Integer Linear Programming (ILP) based solution to the area optimization problem under a latency constraint, and an algorithm for optimization of pipelined MPSoC area under a throughput constraint. The proposed solutions were evaluated using four streaming applications: JPEG encoder; JPEG decoder; MP3 encoder; and H.264 decoder. The time to find the Pareto front of each pipelined MPSoC was less than 4 minutes where design spaces had up to 1016 design points, illustrating the applicability of our approach.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2010

Rapid Design Space Exploration of Application Specific Heterogeneous Pipelined Multiprocessor Systems

Haris Javaid; Aleksander Ignjatovic; Sri Parameswaran

This paper describes a rapid design methodology to create a pipeline of processors to execute streaming applications. The methodology seeks a system with the smallest area while its runtime is within a specified runtime constraint. Initially, a heuristic is used to rapidly explore a large number of processor configurations to find the near Pareto front of the design space, and then an exact integer linear programming (ILP) formulation (EIF) is used to find an optimal solution. A reduced ILP formulation (RIF) or the heuristic is used if the EIF does not find an optimal solution in a given time window. This design methodology was integrated into a commercial design flow and was evaluated on four benchmarks with design spaces containing up to 1016 design points. For each benchmark, the near Pareto front was found in less than 3 h using the heuristic, while EIF took up to 16 h. The results show that the average area error of the heuristic and RIF was within 2.25% and 1.25% of the optimal design points for all the benchmarks, respectively. The heuristic is faster than RIF, while both the heuristic and RIF are significantly faster than EIF.

international conference on hardware/software codesign and system synthesis | 2008

Synthesis of heterogeneous pipelined multiprocessor systems using ILP: jpeg case study

Haris Javaid; Sri Parameswaran

Streaming applications can be implemented with a pipeline of processors. Each processor in the pipeline can be an application Specific Instruction Set Processor (ASIP) with the result being a heterogeneous pipelined MPSoC system. Since ASIPs can be of differing configurations, finding the optimal set of configurations for a multiprocessor architecture is a difficult problem. In this paper, we obtain an optimal system design for a set of processors which execute a multimedia application. The variables in the system are the presence or absence of different additional instructions and differing cache configurations for each of the processors. The problem is formulated as a 0-1 Integer Linear Programming (ILP) problem. To reduce the complexity of the ILP formulation, inferior ASIP configurations are efficiently pruned so that the solution could be reached quickly. Given a system runtime constraint, the proposed methodology finds a design with minimal area. We integrated this design methodology into a commercial design flow, and performed a case study upon the JPEG encoding application. We obtained 15 optimal designs subject to 15 different runtime constraints, each in less than 100 seconds from more than 4.2 x 1013 design points.

design automation conference | 2015

SuperNet: multimode interconnect architecture for manycore chips

Haseeb Bokhari; Haris Javaid; Muhammad Shafique; Jörg Henkel; Sri Parameswaran

Designers of the on-chip interconnect for manycore chips are faced with the dilemma of meeting performance, power and reliability requirements for different operational scenarios. In this paper, we propose a multimode on-chip interconnect called SuperNet. This interconnect can be configured to run in three different modes: energy efficient mode; performance mode; and, reliability mode. Our proposed interconnect is based on two parallel multi-vt optimized packet switched network-on-chip (NoC) meshes. We describe the circuit design techniques and architectural modifications required to realize such a multimode interconnect. Our evaluation with diverse set of applications show that the energy efficient mode can save on average 40% NoC power, whereas the performance mode can improve the core IPC by up to 13% on selected high MPKI applications. The reliability mode provides protection against soft errors in the routers data path through byte oriented SECDED codes that can correct up to 8 bit errors and detect up to 16 bit errors in a 64 bit flit, whereas the routers control path is protected through DMR lock step execution.

design, automation, and test in europe | 2010

Rapid runtime estimation methods for pipelined MPSoCs

Haris Javaid; Andhi Janapsatya; Mohammad Shihabul Haque; Sri Parameswaran

The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications. A pipelined MPSoC is a system where processing elements (PEs) are connected in a pipeline. Each PE is implemented using one of a number of processor configurations (configurations differ by instruction sets and cache sizes) available for that PE. The goal is to select a pipelined MPSoC with a mapping of a processor configuration to every PE. To estimate the run-time of a pipelined MPSoC, designers typically perform cycle-accurate simulation of the whole pipelined system. Since the number of possible pipelined implementations can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method estimates the runtime of the pipelined MPSoC, by performing cycle accurate simulations of individual processor configurations (rather than the whole pipelined system), and then utilizing an analytical model to estimate the runtime of the pipelined system. In the second method, runtimes of individual processor configurations are estimated using an analytical processor model (which uses cycle-accurate simulations of selected configurations, and an equation based on ISA and cache statistics). These estimated runtimes of individual processor configurations are then used to estimate the total runtime of the pipelined system. By evaluating our approach on three benchmarks, we show that the maximum estimation error is 5.91% and 16.45%, with an average estimation error of 2.28% and 6.30% for the first and second method respectively. The time to simulate all the possible pipelined implementations (design points) using cycle-accurate simulator is in the order of years, as design spaces with at least 1010 design points are considered in this paper. However, the time to simulate all processor configurations individually (first method) takes tens of hours, while the time to simulate a subset of processor configurations and estimate their runtimes (second method) is only a few hours. Once these simulations are done, the runtime of each pipelined implementation can be estimated within milliseconds.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

Energy-Efficient Adaptive Pipelined MPSoCs for Multimedia Applications

Haris Javaid; Muhammad Shafique; Jörg Henkel; Sri Parameswaran

Pipelined MPSoCs provide a high throughput implementation platform for multimedia applications. They are typically balanced at design-time considering worst-case scenarios so that a given throughput can be fulfilled at all times. Such worst-case pipelined MPSoCs lack runtime adaptability and result in inefficient resource utilization and high power/energy consumption under a dynamic workload. In this paper, we propose a novel adaptive architecture and a distributed runtime processor manager to enable runtime adaptation in pipelined MPSoCs. The proposed architecture consists of main processors and auxiliary processors, where a main processor uses differing number of auxiliary processors considering runtime workload variations. The runtime processor manager uses a combination of applications execution and knowledge, and offline profiling and statistical information to proactively predict the auxiliary processors that should be used by a main processor. The idle auxiliary processors are then deactivated using clock- or power-gating. Each main processor with a pool of auxiliary processors has its own runtime manager, which is independent of the other main processors, enabling a distributed runtime manager. Our experiments with an H.264 video encoder for HD720p resolution at 30 frames/s show that the adaptive pipelined MPSoC consumed up to 29% less energy (computed using processors and caches) than a worst-case pipelined MPSoC, while delivering a minimum of 28.75 frames/s. Our results show that adaptive pipelined MPSoCs can emerge as an energy-efficient implementation platform for advanced multimedia applications.

asia and south pacific design automation conference | 2013

RExCache: Rapid exploration of unified last-level cache

Su Myat Min Shwe; Haris Javaid; Sri Parameswaran

In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.

embedded systems for real time multimedia | 2011

Multi-ASIP based parallel and scalable implementation of motion estimation kernel for high definition videos

Hong Chinh Doan; Haris Javaid; Sri Parameswaran

Parallel implementations of motion estimation for high definition videos typically exploit various forms of parallelism (GOP, frame-, slice- and macroblock-level) to deliver real-time throughput. Although parallel implementations deliver real-time throughput, they often suffer from limited flexibility and scalability due to the form of parallelism and architecture used. In this work, we use Group Of MacroBlocks (GOMB) and Intra-MB (IMB) parallelism with a multi-ASIP (Application Specific Instruction set Processor) architecture to provide a flexible and scalable platform for motion estimation of high definition videos. Multiple GOMBs are processed by the ASIPs in parallel (GOMB-level) where each ASIP is equipped with custom instructions to process the pixels of an MB in parallel (IMB-level). The system is flexible and scalable as the number of ASIPs (number of GOMBs) and custom instructions are not fixed, and are determined through design space exploration. We evaluated the multi-ASIP architecture in Tensilicas commercial design environment with varying number of ASIPs (up to nine), and compared hand-coded and automatically generated custom instructions. The results illustrate that systems with three and seven ASIPs delivered real-time throughput of 30 and 60 fps respectively for “pedestrian”, “rush hour” and “tractor” HD1080p video sequences. In addition, the results indicate that the multi-ASIP platform can be extended for even higher resolutions such as Ultra High Definition (UHD) due to its flexibility and scalability.

Explore More