Peeter Ellervee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peeter Ellervee is active.

Explore More

Publication

Featured researches published by Peeter Ellervee.

design automation conference | 1999

Lowering power consumption in clock by using globally asynchronous locally synchronous design style

Ahmed Hemani; Thomas Meincke; Shashi Kumar; Adam Postula; Thomas Olsson; Peter Nilsson; Johnny Öberg; Peeter Ellervee; Dan Lundqvist

Power consumption in clock of large high performance VLSIs can be reduced by adopting globally asynchronous, locally synchronous design style (GALS). GALS has small overheads for the global asynchronous communication and local clock generation. We propose methods to (a) evaluate the benefits of GALS and account for its overheads, which can be used as the basis for partitioning the system into optimal number/size of synchronous blocks, and (b) automate the synthesis of the global asynchronous communication. Three realistic ASICs, ranging in complexity from 1 to 3 million gates, were used to evaluate GALS benefits and overheads. The results show an average power saving of about 70% in clock with negligible overheads.

field programmable gate arrays | 1994

A case study on hardware/software partitioning

Axel Jantsch; Peeter Ellervee; Johnny Öberg; Ahmed Hemani

We present an analysis of a fully automatic method to accelerate standard software in C or C++ by use of field programmable gate arrays. Traditional compiler techniques are applied to the hardware/software partitioning problem and a compiler is linked to state of the art hardware synthesis tools. Time critical regions are identified by means of profiling and are automatically implemented in user programmable logic with high level and logic synthesis design tools. The underlying architecture is an add-on board with user programmable logic connected to a Spare based workstation via the system bus. We present an analysis and case study of this method. Eight programs are used as test cases and the data collected by applying this method to programs is used to discuss potentials and limitations of this and similar methods. We discuss architectural parameters, programming language properties, and analysis techniques.<<ETX>>

international symposium on circuits and systems | 1999

Globally asynchronous locally synchronous architecture for large high-performance ASICs

Thomas Meincke; Ahmed Hemani; Shashi Kumar; Peeter Ellervee; Johnny Öberg; Thomas Olsson; Peter Nilsson; Dan Lindqvist; Hannu Tenhunen

Clock nets are the major source of power consumption in large, high-performance ASICs and a design bottleneck when it comes to tolerable clock skew. A way to obviate the global clock net is to partition the design into large synchronous blocks each having its own clock. Data with other blocks is exchanged asynchronously using handshake signals. Adopting such a strategy requires a methodology that supports: 1) a partitioning method dividing a design into the number of synchronous blocks such that the gain due to global clock net removal exceeds the communication overhead and 2) synthesis of handshake protocols to implement the data transfer between synchronous blocks. We describe this methodology and present results of applying it to a realistic design done in 0.25 micron, ranging in operating frequencies from 20 MHz to 1 GHz. The results show that the net power savings compared to fully synchronous designs are on an average about 30%.

Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium | 1999

Exploiting data transfer locality in memory mapping

Peeter Ellervee; Miguel Miranda; Francky Catthoor; Ahmed Hemani

System-level exploration of memory architectures is one of the key issues in successful implementation of data-transfer dominated applications. Usually, one of the main design bottlenecks is the memory access bandwidth. Transformations, rearranging the layout of the data records stored in memory, are very effective to improve the locality of the data transfers but usually lead to a large memory bit-wastage when not performed carefully. In this paper, a methodology which reduces memory bandwidth requirements without sacrificing storage space is proposed. The methodology exploits parallelism in the data-transfers to rearrange the layout of the data records. Distributed memory organization combined with our proposed layout rearrangement methodology allow to effectively reduce the memory bandwidth bottleneck in data-transfer dominated applications.

Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204) | 1998

Revolver: a high-performance MIMD architecture for collision free computing

Johnny Öberg; Peeter Ellervee

One of the main bottlenecks when using massively parallel processors, both RISC and CISC, and VLIW style processors has been the identification of potential parallelism in the tasks. Multi-threaded techniques for exploiting instruction- and data-level parallelism have gained renewed interest since high degrees of pipelining, caused by the increasing clock frequencies, introduce extra dependencies between instructions. Sophisticated methods implementing branch prediction and pipeline flushing during interrupts must be adopted which in addition puts more requirements onto the compilers. We present an interleaved processing architecture we call the Revolver Architecture together with a technique we call register file folding, which relieves the MIMD architecture of these dependencies to allow for collision free computing. We also discuss the implementation of the Revolver as a multi-threaded processor core, based on our presented techniques, together with some architectural strategies for implementing the Revolver Architecture as a DSP core.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2001

System-level data-format exploration for dynamically allocated data structures

Peeter Ellervee; Miguel Miranda; Francky Catthoor; Ahmed Hemani

System-level exploration of memory organizations is a key issue in successful implementation of data dominated applications based on dynamically allocated data structures involving records and access keys. This paper presents a formalized technique for exploring different memory data-format alternatives when only the system level functional behavior of the application has been defined. Our data-format exploration approach allows to substantially minimize the number of accessed bits by rearranging the format of the data records. The technique exploits parallelism in the data transfer by analyzing the dependencies between data-record accesses. As a result, significant reduction in memory size, bandwidth, and power are obtained. We have validated our techniques using several real-life asynchronous transfer mode cell processing applications, where we have obtained reductions in memory size (up to 20%), power (up to a 60%), and bandwidth.

international conference on vlsi design | 1996

A novel allocation strategy for control and memory intensive telecommunication circuits

Bengt Svantesson; Ahmed Hemani; Peeter Ellervee; A. Postulal; Johnny Öberg; Axel Jantsch; Hannu Tenhunen

Communication sub-systems that deal with switching, routing and protocol implementation often have their functionality dominated by control logic and interaction with memory. Synthesis of such Control and Memory Intensive Systems (hereafter abbreviated to CMISTs) poses demands that in the past have not been met satisfactorily by general purpose high-level synthesis (HLS) tools and have led to several research efforts to address these demands. In this paper we: Characterise CMISTs from the synthesis viewpoint; Contend that the synthesis demands of CMISTs can be met within the framework of a general purpose High-level synthesis tool, by making parts of it adaptive to the input, rather than develop a complete tool for a particular type of application; Present an allocation strategy that automatically adapts for CMISTs; Present the Operation and Maintenance (OAM) Protocol of the ATM, its modelling in VHDL and synthesis aspects of the VHDL model; Present the results of applying the synthesis methodology to the OAM as a test case. The results are compared with the result from two commercial High-level synthesis tool; Prove the efficacy of the proposed synthesis methodology by applying it to an industrial design and comparing our obtained by designing manually at register-transfer level; The results is also compared with the results from two commercial HLS tools.

international conference on asic | 1995

High-level synthesis of control and memory intensive communication systems

Ahmed Hemani; Bengt Svantesson; Peeter Ellervee; Adam Postula; Johnny Öberg; Axel Jantsch; Hannu Tenhunen

Communication sub-systems that deal with switching, routing and protocol implementation often have their functionality dominated by control logic and interaction with memory. Synthesis of such Control and Memory Intensive Systems (hereafter abbreviated to CMISTs) poses demands that in the past have not been met satisfactorily by general purpose high-level synthesis (HLS) tools and have led to several research efforts to address these demands. In this paper we: characterise CMISTs from the synthesis viewpoint; present a synthesis methodology adapted for CMISTs; present the Operation and Maintenance (OAM) Protocol of the ATM, its modelling in VHDL and synthesis aspects of the VHDL model; present the results of applying the synthesis methodology to the OAM as a test case-the results are compared to that obtained using the not adapted general purpose High-level synthesis tool; prove the efficacy of the proposed synthesis methodology by applying it to an industrial design and comparing our results to the results from two commercial HLS tools and to the results obtained by designing manually at register-transfer level.

international conference on asic | 1994

Exploring ASIC design space at system level with a neural network estimator

Peeter Ellervee; Axel Jantsch; Johnny Öberg; Ahmed Hemani; Hannu Tenhunen

Estimators are critical tools in carrying out architectural level exploration of the design space. We present a novel approach to estimation based on the multilayer perceptron which builds the estimation function during the learning process and thus allows the description of arbitrary complex functions. We also describe how the control data flow graph is encoded for the neural network input and present results of the first experiments made with realistic design examples.<<ETX>>

field-programmable custom computing machines | 2014

Customizable Compression Architecture for Efficient Configuration in CGRAs

Syed Mohammad Asad Hassan Jafri; Adeel Tajammul; Masoud Daneshtalab; Ahmed Hemani; Kolin Paul; Peeter Ellervee; Juha Plosila; Hannu Tenhunen

Single-processor chips have given way to multicore chips to enable a cost-effective implementation of computer systems. Toward continuous performance scaling, Network-On-Chip (NOC) is the communication architecture supporting the core count increase to hundreds or thousands in multicore chips. Low-power, low-latency, and high-bandwidth support in the NOC design is critical for meeting performance and energy targets of the overall system. Much of previous work has focused on improving the NOC design but without more fully taking into consideration the communication characteristics and the interplay with cache memory that can be exploited in the NOC design. In this paper, low spatial locality within cache blocks is exploited in reducing memory traffic toward energy savings in the NOC. We present a spatial locality predictor that separately manages different degrees of spatial locality across shared and private blocks for better prediction accuracy. To further optimize performance and power in the NOC, we present the adaptive control of the predictor and packet data resizing techniques. Evaluations for the 16-core system running PARSEC benchmarks reveal that our spatial-locality based packet resizing improves NOC power consumption on average by 21% (up to 33%).Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.In recent years, we assisted to the development of communication satellite systems with very powerful satellite and multibeam coverage. Instead of a solid continental coverage, these new satellite systems offer a “cellular” coverage made of several tens of pencil spot beams or of few form-shaped spot beams. Multibeam or cellular coverage allows re-using several times the same allocated frequency band and lead to a significant increase of the satellite usable bandwidth. These techniques trigger the expectation and then the realization of multi-gigahertz band satellites. Furthermore, enthusiasts are already predicting the Terabit satellite within a 20-year term. What are exactly the limits and goals of the race for in-orbit capacity? The paper first reviews the theoretical bases of satellite communications enlightened by the Information Theory, then it analyses the practical and technological limitations of communication satellites and concludes with an assessment of the feasibility of Terabit Satellites for the geostationary broadcast and fixed satellite services.Market mechanism constitutes an efficient scheme for the allocation of cloud-based computing resources with the view of virtual machines. However, most of the existing mechanisms commonly use fixed price model and ignore flexible price model for the cloud providers. In this paper, we formulate the problem of virtual machine allocation in clouds as a combinatorial auction problem and propose a mechanism with group price to solve it, in which the cloud provider can express the discount price for each kind of traded virtual machine instances. We investigate the theoretical properties of the proposed mechanism including individual rationality, ex-post budget balance, and truthfulness. Extensive simulation results show that the proposed mechanism yields the allocation efficiency and computational tractability while generating higher revenue for the cloud providers than the mechanism with fixed price.

Explore More