Andhi Janapsatya | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andhi Janapsatya is active.

Explore More

Publication

Featured researches published by Andhi Janapsatya.

asia and south pacific design automation conference | 2006

Finding optimal L1 cache configuration for embedded systems

Andhi Janapsatya; Aleksandar Ignjatovic; Sri Parameswaran

Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy

asia and south pacific design automation conference | 2006

A novel instruction scratchpad memory optimization method based on concomitance metric

Andhi Janapsatya; Aleksandar Ignjatovic; Sri Parameswaran

Scratchpad memory has been introduced as a replacement for cache memory as it improves the performance of certain embedded systems. Additionally, it has also been demonstrated that scratchpad memory can significantly reduce the energy consumption of the memory hierarchy of embedded systems. This is significant, as the memory hierarchy consumes a substantial proportion of the total energy of an embedded system. This paper deals with optimization of the instruction memory scratchpad based on a methodology that uses a metric which we call the concomitance. This metric is used to find basic blocks which are executed frequently and in close proximity in time. Once such blocks are found, they are copied into the scratchpad memory at appropriate times; this is achieved using a special instruction inserted into the code at appropriate places. For a set of benchmarks taken from Mediabench, our scratchpad system consumed just 59% (avg) of the energy of the cache system, and 73% (avg) of the energy of the state of the art scratchpad system, while improving the overall performance. Compared to the state of the art method, the number of instructions copied into the scratchpad memory from the main memory is reduced by 88%.

international conference on computer aided design | 2004

Hardware/software managed scratchpad memory for embedded system

Andhi Janapsatya; Sri Parameswaran; Aleksandar Ignjatovic

We propose a methodology for energy reduction and performance improvement. The target system comprises of an instruction scratchpad memory instead of an instruction cache. Highly utilized code segments are copied into the scratchpad memory, and are executed from the scratchpad. The copying of code segments from main memory to the scratchpad is performed during runtime. A custom hardware controller is used to manage the copying process. The hardware controller is activated by strategically placed custom instructions within the executing program. These custom instructions inform the hardware controller when to copy during program execution. Novel heuristic algorithms are implemented to determine locations within the program to insert these custom instructions, as well as to choose the best sets of code segments to be copied to the scratchpad memory. For a set of realistic benchmarks, experimental results indicate the method uses 50.7% lower energy (on average) and improves performance by 53.2% (on average) when compared to a traditional cache system which is identical in size. Cache systems compared had sizes ranging from 256 to 16K bytes and associativities ranging from 1 to 32.

international conference on vlsi design | 2005

Rapid embedded hardware/software system generation

Jorgen Peddersen; Seng Lin Shee; Andhi Janapsatya; Sri Parameswaran

This paper presents an RTL generation scheme for a SimpleScalar/PISA instruction set architecture with system calls to implement C programs. The scheme utilizes ASIPmeister, a processor generation tool. The RTL generated is available for download. The second part of the paper shows a method of reducing the PISA instruction set and generating a processor for a given application. This reduction and generation can be performed within an hour, making this one of the fastest methods of generating an application specific processor. For five benchmark applications, we show that on average, processor size can be reduced by 30%, energy consumed reduced by 24%, and performance improved by 24%.

international conference on hardware/software codesign and system synthesis | 2009

SuSeSim: a fast simulation strategy to find optimal L1 cache configuration for embedded systems

Mohammad Shihabul Haque; Andhi Janapsatya; Sri Parameswaran

Simulation of an application is a popular and reliable approach to find the optimal configuration of level one cache memory for an application specific embedded system processor. However, long simulation time is one of the main disadvantages of simulation based approaches. In this paper, we propose a new and fast simulation method, Super Set Simulator (SuSeSim). While previous methods use Top-Down searching strategy, SuSeSim utilizes a Bottom-Up search strategy along with a new elaborate data structure to reduce the search space to determine a cache hit or miss. SuSeSim can simulate hundreds of cache configurations simultaneously by reading an applications memory request trace just once. Total number of cache hits and misses are accurately recorded. Depending on different cache block sizes and benchmark applications, SuSeSim can reduce the number of tags to be checked by up to 43% compared to the existing fastest simulation approach (the CRCB algorithm). With the help of a faster search and an easy to maintain data structure, SuSeSim can be up to 94% faster in simulating memory requests compared to the CRCB algorithm.

design, automation, and test in europe | 2010

DEW: a fast level 1 cache simulation approach for embedded processors with FIFO replacement policy

Mohammad Shihabul Haque; Jorgen Peddersen; Andhi Janapsatya; Sri Parameswaran

Increasing the speed of cache simulation to obtain hit/miss rates enables performance estimation, cache exploration for embedded systems and energy estimation. Previously, such simulations, particularly exact approaches, have been exclusively for caches which utilize the least recently used (LRU) replacement policy. In this paper, we propose a new, fast and exact cache simulation method for the First In First Out(FIFO) replacement policy. This method, called DEW, is able to simulate multiple level 1 cache configurations (different set sizes, associativities, and block sizes) with FIFO replacement policy. DEW utilizes a binomial tree based representation of cache configurations and a novel searching method to speed up simulation over single cache simulators like Dinero IV. Depending on different cache block sizes and benchmark applications, DEW operates around 8 to 40 times faster than Dinero IV. Dinero IV compares 2.17 to 19.42 times more cache ways than DEW to determine accurate miss rates.

IEEE Transactions on Very Large Scale Integration Systems | 2006

Exploiting statistical information for implementation of instruction scratchpad memory in embedded system

Andhi Janapsatya; Aleksandar Ignjatovic; Sri Parameswaran

A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves performance by 40.0% (on average) when compared to a traditional cache system which is identical in size

design, automation, and test in europe | 2010

Rapid runtime estimation methods for pipelined MPSoCs

Haris Javaid; Andhi Janapsatya; Mohammad Shihabul Haque; Sri Parameswaran

The pipelined Multiprocessor System on Chip (MPSoC) paradigm is well suited to the data flow nature of streaming applications. A pipelined MPSoC is a system where processing elements (PEs) are connected in a pipeline. Each PE is implemented using one of a number of processor configurations (configurations differ by instruction sets and cache sizes) available for that PE. The goal is to select a pipelined MPSoC with a mapping of a processor configuration to every PE. To estimate the run-time of a pipelined MPSoC, designers typically perform cycle-accurate simulation of the whole pipelined system. Since the number of possible pipelined implementations can be in the order of billions, estimation methods are necessary. In this paper, we propose two methods to estimate the runtime of a pipelined MPSoC, minimizing the use of slow cycle-accurate simulations. The first method estimates the runtime of the pipelined MPSoC, by performing cycle accurate simulations of individual processor configurations (rather than the whole pipelined system), and then utilizing an analytical model to estimate the runtime of the pipelined system. In the second method, runtimes of individual processor configurations are estimated using an analytical processor model (which uses cycle-accurate simulations of selected configurations, and an equation based on ISA and cache statistics). These estimated runtimes of individual processor configurations are then used to estimate the total runtime of the pipelined system. By evaluating our approach on three benchmarks, we show that the maximum estimation error is 5.91% and 16.45%, with an average estimation error of 2.28% and 6.30% for the first and second method respectively. The time to simulate all the possible pipelined implementations (design points) using cycle-accurate simulator is in the order of years, as design spaces with at least 1010 design points are considered in this paper. However, the time to simulate all processor configurations individually (first method) takes tens of hours, while the time to simulate a subset of processor configurations and estimate their runtimes (second method) is only a few hours. Once these simulations are done, the runtime of each pipelined implementation can be estimated within milliseconds.

design automation conference | 2010

SCUD: a fast single-pass L1 cache simulation approach for embedded processors with round-robin replacement policy

Mohammad Shihabul Haque; Jorgen Peddersen; Andhi Janapsatya; Sri Parameswaran

Embedded systems designers are free to choose the most suitable configuration of L1 cache in modern processor based SoCs. Choosing the appropriate L1 cache configuration necessitates the simulation of long memory access traces to accurately obtain hit/miss rates. The long execution time taken to simulate these traces, particularly separate simulation for each configuration is a major drawback. Researchers have proposed techniques to speed up the simulation of caches with LRU replacement policy. These techniques are of little use in the majority of embedded processors as these processors utilize Round-robin policy based caches. In this paper we propose a fast L1 cache simulation approach, called SCUD(Sorted Collection of Unique Data), for caches with the Round-robin policy. SCUD is a single-pass cache simulator that can simulate multiple L1 cache configurations (with varying set sizes and associativities) by reading the application trace once. Utilizing fast binary searches in a novel data structure, SCUD simulates an application trace significantly faster than a widely used single configuration cache simulator (Dinero IV). We show SCUD can simulate a set of cache configurations up to 57 times faster than Dinero IV. SCUD shows an average speed up of 19.34 times over Dinero IV for Mediabench applications, and an average speed up of over 10 times for SPEC CPU2000 applications.

asia and south pacific design automation conference | 2009

HitME: low power Hit MEmory buffer for embedded systems

Andhi Janapsatya; Sri Parameswaran; Aleksandar Ignjatovic

In this paper, we present a novel HitME (Hit-MEmory) buffer to reduce the energy consumption of memory hierarchy in embedded processors. The HitME buffer is a small direct-mapped cache memory that is added as additional memory into existing cache memory hierarchies. The HitME buffer is loaded only when there is a hit on L1 cache. Otherwise, L1 cache is updated from the memory and the processors memory request is served directly from the L1 cache. The strategy works due to the fact that 90% of memory accesses are only accessed once, and these often pollute the cache. Energy reduction is achieved by reducing the number of accesses to the L1 cache memory. Experimental results show that the use of HitME buffer will reduce the L1 cache accesses resulting in a reduction in the energy consumption of the memory hierarchy. This decrease in L1 cache accesses reduces the cache system energy consumption by an average of 60.9% when compared to traditional L1 cache memory architecture and an energy reduction of 6.4% when compared to filter cache architecture for 70nm cache technology.

Explore More