Sri Parameswaran
University of New South Wales
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sri Parameswaran.
international conference on vlsi design | 2004
Jeremy Chan; Sri Parameswaran
In this paper, we describe NoCGEN, a Network On Chip (NoC) generator, which is used to create a simulatable and synthesizable NoC description. NoCGEN uses a set of modularised router components that can be used to form different routers with a varying number of ports, routing algorithms, data widths and buffer depths. A graph description representing the interconnection between these routers is used to generate a top-level VHDL description. A wormhole output-queued 2-D mesh router was created to verify the capability of NoCGEN. Various parameterized designs were synthesized to provide estimated gate counts of 129 K to 695 K for a number of topologies varying from a 2/spl times/2 mesh to a 4/spl times/4 mesh, with constant data bus size width of 32. The NoC was simulated with random traffic using a mixed SystemC/VHDL environment to ensure correctness of operation and to obtain performance and average latency. The results show an accepted load of 53% to 55.6% with an increase in buffer depth from 8 to 32 flits for the 4/spl times/4 mesh router.
asia and south pacific design automation conference | 2006
Andhi Janapsatya; Aleksandar Ignjatovic; Sri Parameswaran
Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy
design automation conference | 2007
Seng Lin Shee; Sri Parameswaran
Multiprocessor SoC systems have led to the increasing use of parallel hardware along with the associated software. These approaches have included coprocessor, homogeneous processor (e.g. SMP) and application specific architectures (i.e. DSP, ASIC). ASIPs have emerged as a viable alternative to conventional processing entities (PEs) due to its configurability and programmability. In this work, we introduce a heterogeneous multi-processor system using ASIPs as processing entities in a pipeline configuration. A streaming application is taken and manually broken into a series of algorithmic stages (each of which make up a stage in a pipeline). We formulate the problem of mapping each algorithmic stage in the system to an ASIP configuration, and propose a heuristic to efficiently search the design space for a pipeline-based multi ASIP system. We have implemented the proposed heterogeneous multiprocessor methodology using a commercial extensible processor (Xtensa LX from Tensilica Inc.). We have evaluated our system by creating two benchmarks (MP3 and JPEG encoders) which are mapped to our proposed design platform. Our multiprocessor design provided a performance improvement of at least 4.1 IX (JPEG) and 3.36X (MP3) compared to the single processor design. The minimum cost obtained through our heuristic was within 5.47% and 5.74% of the best possible values for JPEG and MP3 benchmarks respectively.
design automation conference | 2007
Jude Angelo Ambrose; Roshan G. Ragel; Sri Parameswaran
Side channel attacks are becoming a major threat to the security of embedded systems. Countermeasures proposed to overcome Simple Power Analysis (SPA) and Differential Power Analysis (DPA), are data masking, table masking, current flattening, circuitry level solutions, dummy instruction insertions and balancing bit-flips. All these techniques are either susceptible to multi-order side channel attacks, not sufficiently generic to cover all encryption algorithms, or burden the system with high area cost, run-time or energy consumption. A HW/SW based randomized instruction injection technique is proposed in this paper to overcome the pitfalls of previous countermeasures. Our technique injects random instructions at random places during the execution of an application which protects the system from both SPA and DPA. Further, we devise a systematic method to measure the security level of a power sequence and use it to measure the number of random instructions needed, to suitably confuse the adversary. Our processor model costs 1.9% in additional area for a simplescalar processor, and costs on average 29.8% in runtime and 27.1% in additional energy consumption for six industry standard cryptographic algorithms.
asia and south pacific design automation conference | 2006
Andhi Janapsatya; Aleksandar Ignjatovic; Sri Parameswaran
Scratchpad memory has been introduced as a replacement for cache memory as it improves the performance of certain embedded systems. Additionally, it has also been demonstrated that scratchpad memory can significantly reduce the energy consumption of the memory hierarchy of embedded systems. This is significant, as the memory hierarchy consumes a substantial proportion of the total energy of an embedded system. This paper deals with optimization of the instruction memory scratchpad based on a methodology that uses a metric which we call the concomitance. This metric is used to find basic blocks which are executed frequently and in close proximity in time. Once such blocks are found, they are copied into the scratchpad memory at appropriate times; this is achieved using a special instruction inserted into the code at appropriate places. For a set of benchmarks taken from Mediabench, our scratchpad system consumed just 59% (avg) of the energy of the cache system, and 73% (avg) of the energy of the state of the art scratchpad system, while improving the overall performance. Compared to the state of the art method, the number of instructions copied into the scratchpad memory from the main memory is reduced by 88%.
design automation conference | 2014
Haseeb Bokhari; Haris Javaid; Muhammad Shafique; Jörg Henkel; Sri Parameswaran
In this paper, we propose a novel NoC architecture, called dark-NoC, where multiple layers of architecturally identical, but physically different routers are integrated, leveraging the extra transistors available due to dark silicon . Each layer is separately optimized for a particular voltage-frequency range by the adroit use of multi-Vt circuit optimization. At a given time, only one of the network layers is illuminated while all the other network layers are dark. We provide architectural support for seamless integration of multiple network layers, and a fast inter-layer switching mechanism without dropping in-network packets. Our experiments on a 4 × 4 mesh with multi-programmed real application workloads show that darkNoC improves energy-delay product by up to 56% compared to a traditional single layer NoC with state-of-the-art DVFS. This illustrates darkNoC can be used as an energy-efficient communication fabric in future dark silicon chips.
international conference on hardware/software codesign and system synthesis | 2014
Muhammad Shafique; Siddharth Garg; Tulika Mitra; Sri Parameswaran; Jörg Henkel
Dark Silicon refers to the observation that in future technology nodes, it may only be possible to power-on a fraction of on-chip resources (processing cores, hardware accelerators, cache blocks and so on) in order to stay within the power budget and safe thermal limits, while the other resources will have to be kept powered-off or “dark”. In other words, chips will have an abundance of transistors, i.e., more than the number that can be simultaneously powered-on. Heterogeneous computing has been proposed as one way to effectively leverage this abundance of transistors in order to increase performance, energy efficiency and even reliability within power and thermal constraints. However, several critical challenges remain to be addressed including design, automated synthesis, design space exploration and run-time management of heterogeneous dark silicon processors. The hardware/software co-design and synthesis community has potentially much to contribute in solving these new challenges introduced by dark silicon and, in particular, heterogeneous computing. In this paper, we identify and highlight some of these critical challenges, and outline some of our early research efforts in addressing them.
international conference on computer aided design | 2004
Andhi Janapsatya; Sri Parameswaran; Aleksandar Ignjatovic
We propose a methodology for energy reduction and performance improvement. The target system comprises of an instruction scratchpad memory instead of an instruction cache. Highly utilized code segments are copied into the scratchpad memory, and are executed from the scratchpad. The copying of code segments from main memory to the scratchpad is performed during runtime. A custom hardware controller is used to manage the copying process. The hardware controller is activated by strategically placed custom instructions within the executing program. These custom instructions inform the hardware controller when to copy during program execution. Novel heuristic algorithms are implemented to determine locations within the program to insert these custom instructions, as well as to choose the best sets of code segments to be copied to the scratchpad memory. For a set of realistic benchmarks, experimental results indicate the method uses 50.7% lower energy (on average) and improves performance by 53.2% (on average) when compared to a traditional cache system which is identical in size. Cache systems compared had sizes ranging from 256 to 16K bytes and associativities ranging from 1 to 32.
design, automation, and test in europe | 2003
Newton Cheung; Jörg Henkel; Sri Parameswaran
We present a methodology that maximizes the performance of Tensilica based Application Specific Instruction-set Processor (ASIP) through instruction selection when an area constraint is given. Our approach rapidly selects from a set of pre-fabricated coprocessors/functional units from our library of pre-designed specific instructions (to evaluate our technology we use the Tensilica platform). As a result, we significantly increase application performance while area constraints are satisfied. Our methodology uses a combination of simulation, estimation and a pre-characterised library of instructions, to select the appropriate co-processors and instructions. We report that by selecting the appropriate coprocessors/functional units and specific TIE instructions, the total execution time of complex applications (we study a voice encoder/decoder), an applicationýs performance can be reduced by up to 85% compared to the base implementation. Our estimator used in the system takes typically less than a second to estimate, with an average error rate of 4% (as compared to full simulation, which takes 45 minutes). The total selection process using our methodology takes 3-4 hours, while a full design space exploration using simulation would take several days.
international conference on computer aided design | 2005
Jeremy Chan; Sri Parameswaran
In this paper we present NoCEE, a fast and accurate method for extracting energy models for packet-switched network on chip (NoC) routers. Linear regression is used to model the relationship between events occurring in the NoC and energy consumption. The resulting models are cycle accurate and can be applied to different technology libraries. We verify the individual router estimation models with many different synthetically generated traffic patterns and data inputs. Characterization of a small library takes about two hours. The mean absolute energy estimation error of the resultant models is 5% (10% max) against a complete gate level simulation. We also apply this method to a number of complete NoCs with inputs extracted from synthetic application traces and compare our estimated results to the gate level power simulations (mean absolute error is 5%). Our estimation methodology has been integrated with commercial logic synthesis flow and power estimation tools (synopsys design compiler and primepower), allowing application across different designs. The extracted models show the different trends across various parameterizations of network on chip routers and have been integrated into an architecture exploration framework.