Jorgen Peddersen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jorgen Peddersen is active.

Explore More

Publication

Featured researches published by Jorgen Peddersen.

international conference on vlsi design | 2005

Rapid embedded hardware/software system generation

Jorgen Peddersen; Seng Lin Shee; Andhi Janapsatya; Sri Parameswaran

This paper presents an RTL generation scheme for a SimpleScalar/PISA instruction set architecture with system calls to implement C programs. The scheme utilizes ASIPmeister, a processor generation tool. The RTL generated is available for download. The second part of the paper shows a method of reducing the PISA instruction set and generating a processor for a given application. This reduction and generation can be performed within an hour, making this one of the fastest methods of generating an application specific processor. For five benchmark applications, we show that on average, processor size can be reduced by 30%, energy consumed reduced by 24%, and performance improved by 24%.

asia and south pacific design automation conference | 2007

CLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time

Jorgen Peddersen; Sri Parameswaran

Numerous dynamic power management techniques have been proposed which utilize the knowledge of processor power/energy consumption at run-time. So far, no efficient method to provide run-time power/energy data has been presented. Current measurement systems draw too much power to be used in small embedded designs and existing performance counters can not provide sufficient information for run-time optimization. This paper presents a novel methodology to solve the problem of run-time power optimization by designing a processor that estimates its own power/energy consumption. Estimation is performed by the addition of small counters that tally events which consume power. This methodology has been applied to an existing processor resulting in an average power error of 2% and energy estimation error of 1.5%. The system adds little impact to the design, with only a 4.9% increase in chip area and a 3% increase in average power consumption. A case study of an application that utilizes the processor showcases the benefits the methodology enables in dynamic power optimization.

design, automation, and test in europe | 2010

DEW: a fast level 1 cache simulation approach for embedded processors with FIFO replacement policy

Mohammad Shihabul Haque; Jorgen Peddersen; Andhi Janapsatya; Sri Parameswaran

Increasing the speed of cache simulation to obtain hit/miss rates enables performance estimation, cache exploration for embedded systems and energy estimation. Previously, such simulations, particularly exact approaches, have been exclusively for caches which utilize the least recently used (LRU) replacement policy. In this paper, we propose a new, fast and exact cache simulation method for the First In First Out(FIFO) replacement policy. This method, called DEW, is able to simulate multiple level 1 cache configurations (different set sizes, associativities, and block sizes) with FIFO replacement policy. DEW utilizes a binomial tree based representation of cache configurations and a novel searching method to speed up simulation over single cache simulators like Dinero IV. Depending on different cache block sizes and benchmark applications, DEW operates around 8 to 40 times faster than Dinero IV. Dinero IV compares 2.17 to 19.42 times more cache ways than DEW to determine accurate miss rates.

design automation conference | 2010

SCUD: a fast single-pass L1 cache simulation approach for embedded processors with round-robin replacement policy

Mohammad Shihabul Haque; Jorgen Peddersen; Andhi Janapsatya; Sri Parameswaran

Embedded systems designers are free to choose the most suitable configuration of L1 cache in modern processor based SoCs. Choosing the appropriate L1 cache configuration necessitates the simulation of long memory access traces to accurately obtain hit/miss rates. The long execution time taken to simulate these traces, particularly separate simulation for each configuration is a major drawback. Researchers have proposed techniques to speed up the simulation of caches with LRU replacement policy. These techniques are of little use in the majority of embedded processors as these processors utilize Round-robin policy based caches. In this paper we propose a fast L1 cache simulation approach, called SCUD(Sorted Collection of Unique Data), for caches with the Round-robin policy. SCUD is a single-pass cache simulator that can simulate multiple L1 cache configurations (with varying set sizes and associativities) by reading the application trace once. Utilizing fast binary searches in a novel data structure, SCUD simulates an application trace significantly faster than a widely used single configuration cache simulator (Dinero IV). We show SCUD can simulate a set of cache configurations up to 57 times faster than Dinero IV. SCUD shows an average speed up of 19.34 times over Dinero IV for Mediabench applications, and an average speed up of over 10 times for SPEC CPU2000 applications.

international conference on vlsi design | 2007

Energy Driven Application SelfAdaptation

Jorgen Peddersen; Sri Parameswaran

Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low-impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1% and execution time by 0.02%, yet is able to produce an output which is acceptable and processes up to double the number of frames

international conference on computer aided design | 2011

CIPARSim: cache intersection property assisted rapid single-pass FIFO cache simulation technique

Mohammad Shihabul Haque; Jorgen Peddersen; Sri Parameswaran

An applications cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit. In this paper, for the first time, we introduce a cache property called the “Intersection Property” that helps to reduce single-pass simulation time in a manner similar to inclusion property. An intersection property defines conditions that if met, prove a particular element exists in larger caches, thus avoiding further search time. We have discussed three such intersection properties for caches using the FIFO replacement policy in this paper. A rapid single-pass FIFO cache simulator “CIPARSim” has also been proposed. CIPARSim is the first single-pass simulator dependent on the FIFO cache properties to reduce simulation time significantly. CIPARSims simulation time was up to 5 times faster (on average 3 times faster) compared to the state of the art single-pass FIFO cache simulator for the cache configurations tested. CIPARSim produces the cache hit and miss rates of an application accurately on various cache configurations. During simulation, CIPARSims intersection properties alone predict up to 90% (on average 65%) of the total hits, reducing simulation time immensely.

local computer networks | 2009

LOP_RE: Range encoding for low power packet classification

Xin He; Jorgen Peddersen; Sri Parameswaran

State-of-the-art hardware based techniques achieve high performance and maximize efficiency of packet classification applications. The predominant example of these, Ternary Content Addressable Memory (TCAM) based packet classification systems can achieve much higher throughput than software-based techniques. However, they suffer from high power consumption due to the highly parallel architecture and lack high-throughput range encoding techniques. In this paper, we propose a novel SRAM-based packet classification architecture with packet-side search key range encoding units, significantly reducing energy consumption without reducing the throughput from that of TCAM and additionally allowing range matching at wire speed. LOP_RE is a flexible packet classification system which can be customized to the requirement of the application. Ten different benchmarks were tested, with results showing that LOP_RE architectures provide high lookup rates and throughput, and consume low power and energy. Compared with a TCAM-based packet classification system (without range encoding) implemented in 65nm CMOS technology, LOP_RE can save 65% energy consumption for the same rule set over these benchmarks.

international conference on hardware/software codesign and system synthesis | 2009

LOP: a novel SRAM-based architecture for low power and high throughput packet classification

Xin He; Jorgen Peddersen; Sri Parameswaran

Packet classification has become an important problem to solve in modern network processors used in networking embedded systems such as routers. Algorithms for matching incoming packets from the network to pre-defined rules, have been proposed by a number of researchers. Current software-based packet classification techniques have low performance, prompting many researchers to move their focus to new architectures encompassing both software and hardware components. Some of the newer hardware architectures exclusively utilize Ternary Content Addressable Memory (TCAM) to improve the performance of rule matching. However, this results in systems with high power consumption. TCAM consumes a high amount of power due to the fact that it reads the entire memory array during every access, much of which is unnecessary. In this paper, we propose LOP, a novel SRAM-based architecture where incoming packets are compared against parts of all rules simultaneously until a single matching rule is found for the compared bits in the packets. This method LOP significantly reduces power consumption as only a segment of the memory is compared against the incoming packet. Despite the additional time penalty to match a single packet, parallel comparison of multiple packets can improve throughput beyond that of the TCAMapproaches, while consuming significantly low power. Nine different benchmarks were tested in two classification systems, with results showing that LOP architectures provide high lookup rates and high throughput, and low power consumption. Compared with a state-of-the-art TCAM implementation (throughput of 495 Million Search per Second (Msps)) in 65nm CMOS technology, on average, LOP saves 43% of energy consumption with a throughput of 590Msps.

ieee computer society annual symposium on vlsi | 2013

A double-width algorithmic balancing to prevent power analysis Side Channel Attacks in AES

Ankita Arora; Jude Angelo Ambrose; Jorgen Peddersen; Sri Parameswaran

Advanced Encryption Standard (AES) is one of the most widely used cryptographic algorithms in embedded systems, and is deployed in smart cards, mobile phones and wireless applications. Researchers have found various techniques to attack the encrypted data or the secret key using Side Channel information (execution time, power variations, electro migration, sound, etc.). Power analysis attack is most prevalent out of all Side Channel Attacks (SCAs), the popular being the Differential Power Analysis (DPA). Balancing of signal transitions is one of the methods by which a countermeasure is implemented. Existing balancing solutions to counter power analysis attacks are either costly in terms of power and area or involve much complexity, hence lacks practicality. This paper for the first time proposes a double-width single core (earlier methods used two separate cores)processor algorithmic balancing to obfuscate power variations resulting in a DPA resistant system. The countermeasure only includes code/algorithmic modifications, hence can be easily deployed in any embedded system with a 16 bits bitwidth (or wider) processor. A DPA attack is demonstrated on the Double Width Single Core (DWSC) solution. The attack proved unsuccessful in finding the correct secret key. The instruction memory size overhead is only 16.6% while data memory increases by 15.8%.

Journal of Computers | 2008

Energy Driven Application Self-Adaptation at Run-time

Jorgen Peddersen; Sri Parameswaran

Until recently, there has been a lack of methods to trade-off energy use for quality of service at run-time in stand-alone embedded systems. Such systems are motivated by the need to increase the apparent available battery energy of portable devices, with minimal compromise in quality. The available systems either drew too much power or added considerable overheads due to task swapping. In this paper, we demonstrate a feasible method to perform these trade-offs. This work has been enabled by a low impact power/energy estimating processor which utilizes counters to estimate power and energy consumption at run-time. Techniques are shown that modify multimedia applications to differ the fidelity of their output to optimize the energy/quality trade-off. Two adaptation algorithms are applied to multimedia applications demonstrating the efficacy of the method. The method increases code size by 1% and execution time by 0.02%, yet is able to produce an output which is acceptable and processes up to double the number of frames.

Explore More