Paolo Ienne
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paolo Ienne.
international symposium on experimental robotics | 1993
Francesco Mondada; Edoardo Franzi; Paolo Ienne
This invention relates to board games with movable playing pieces. More particularly, this invention relates to chess-type board games with movable playing pieces wherein the playing pieces are moved in accordance with selected patterns by one player in order to capture playing pieces of another player.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2006
Laura Pozzi; Kubilay Atasu; Paolo Ienne
In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective option is that of extending the existing processors and toolchains. Extensibility is indeed a feature now offered in real designs, e.g., by processors such as Tensilica Xtensa [T. R. Halfhill, Microprocess Rep., 2003], ARC ARCtangent [T. R. Halfhill, Microprocess Rep., 2000], STMicroelectronics ST200 [P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and F. Homewood, Proc. 27th Annu. Int. Symp. Computer Architecture, 2000, p. 203], and MIPS CorExtend [T. R. Halfhill, Microprocess Rep., 2003]. While all these processors provide development environments with simulation capabilities for evaluating efficiently hand-crafted solutions, the tools to identify automatically the best processor configuration for a given application are less common. In particular, solutions to choose specialized instruction-set extensions (ISEs) have been investigated in the past years but are still seldom part of commercial toolchains. This paper provides a formal methodology and a set of algorithms that help address the problem. It proposes exact algorithms to derive optimal ISEs; exact identification of a single ISE is applicable to basic blocks of up to 1500 assembler-like instructions. This paper also introduces approximate methods that can process basic blocks of larger size. Results show that the described algorithms find solutions close to those that a designer would obtain by a detailed study of the application code. Both heuristic and exact algorithms find ISEs able to speed up unextended processors up to 5.0x. State-of-the-art comparisons show that the presented algorithms outperform existing ones by up to 2.6x
international symposium on computer architecture | 2015
Zidong Du; Robert Fasthuber; Tianshi Chen; Paolo Ienne; Ling Fei Li; Tao Luo; Xiaobing Feng; Yunji Chen; Olivier Temam
In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.
design, automation, and test in europe | 2008
Ajay K. Verma; Philip Brisk; Paolo Ienne
Adders are one of the key components in arithmetic circuits. Enhancing their performance can significantly improve the quality of arithmetic designs. This is the reason why the theoretical lower bounds on the delay and area of an adder have been analysed, and circuits with performance close to these bounds have been designed. In this paper, we present a novel adder design that is exponentially faster than traditional adders; however, it produces incorrect results, deterministically, for a very small fraction of input combinations. We have also constructed a reliable version of this adder that can detect and correct mistakes when they occur. This creates the possibility of a variable-latency adder that produces a correct result very fast with extremely high probability; however, in some rare cases when an error is detected, the correction term must be applied and the correct result is produced after some time. Since errors occur with extremely low probability, this new type of adder is significantly faster than state-of-the-art adders when the overall latency is averaged over many additions.
international symposium on systems synthesis | 2002
Frederic Worm; Paolo Ienne; Patrick Thiran; G. De Micheli
Systems-on- Chip (SoC) are evolving toward complex heterogeneous multiprocessors made of many predesigned macrocells or subsystems with application-specific interconnections. Intra-chip interconnects are thus becoming one of the central elements of SoC design and pose conflicting goals in terms of low energy per transmitted bit, guaranteed signal integrity, and ease of design. This work introduces and shows first results on a novel interconnect system which uses low-swing signalling, error detection codes, and a retransmission scheme; it minimises the interconnect voltage swing and frequency subject to workload requirements and S/N conditions. Simulation results show that tangible savings in energy can be attained while achieving at the same time more robustness to large variations in actual workload, noise, and technology quality (all quantities easily mispredicted in very complex systems and advanced technologies). It can be argued that traditional worst-case correct-by-design paradigm will be less and less applicable in future multibillion transistor SoC and deep sub-micron technologies; this work represents a first example towards robust adaptive designs.
compilers, architecture, and synthesis for embedded systems | 2005
Laura Pozzi; Paolo Ienne
Customisable embedded processors are becoming available on the market, thus making it possible for designers to speed up execution of applications by using Application-specific Functional Units (AFUs), implementing Instruction-Set Extensions (ISEs). While these processors have become available, the state of the art on automatic ISE identification is improving; many algorithms are being proposed for choosing, given the applications source code, the best ISEs under various constraints. Read and write ports between the AFUs and the processor register file are an expensive asset, fixed in the microarchitecture---some processors indeed only allow two read and one write ports---and, on the other hand, a large availability of inputs and outputs to and from the AFUs exposes high speedup. This paper proposes a solution to the limitation of actual register file ports by serialising register file access and therefore addressing multi-cycle read and write. It does so in an innovative way for two reasons: (1) it exploits and brings forward the progress in ISE identification under constraint [1, 16, 4, 19] and (2) it combines register file access serialisation with pipelining in order to obtain the best global solution. This paper proposes an algorithm for scheduling graphs---corresponding to ISEs---under input/output constraint; experiments show that by using the proposed method applications can be sped-up tangibly: speedup for low I/O constraints is 32% better on average, and 65% better at best, than that obtained by state of the art techniques.
IEEE Transactions on Very Large Scale Integration Systems | 2005
Frederic Worm; Paolo Ienne; Patrick Thiran; G. De Micheli
Systems-on-Chip (SoC) design involves several challenges, stemming from the extreme miniaturization of the physical features and from the large number of devices and wires on a chip. Since most SoCs are used within embedded systems, specific concerns are increasingly related to correct, reliable, and robust operation. We believe that in the future most SoCs will be assembled by using large-scale macro-cells and interconnected by means of on-chip networks. We examine some physical properties of on-chip interconnect busses, with the goal of achieving fast, reliable, and low-energy communication. These objectives are reached by dynamically scaling down the voltage swing, while ensuring data integrity-in spite of the decreased signal to noise ratio-by means of encoding and retransmission schemes. In particular, we describe a closed-loop voltage swing controller that samples the error retransmission rate to determine the operational voltage swing. We present a control policy which achieves our goals with minimal complexity; such simplicity is demonstrated by implementing the policy in a synthesizable controller. Such a controller is an embodiment of a self-calibrating circuit that compensates for significant manufacturing parameter deviations and environmental variations. Experimental results show that energy savings amount up to 42%, while at the same time meeting performance requirements.
IEEE Design & Test of Computers | 2005
M. Vuletid; Laura Pozzi; Paolo Ienne
Ideally, reconfigurable-system programmers and designers should code algorithms and write hardware accelerators independently of the underlying platform. To realize this scenario, the authors propose a portable, hardware-agnostic programming paradigm, which delegates platform-specific tasks to a system-level virtualization layer. This layer supports a chosen programming model and hides platform details from users much as general-purpose computers do. We introduce multithreaded programming model for reconfigurable computing based on a unified virtual-memory image for both software and hardware application parts. We also address the challenge of achieving seamless hardware-software interfacing and portability with minimal performance penalties.
design automation conference | 2004
Partha Biswas; Vinay Choudhary; Kubilay Atasu; Laura Pozzi; Paolo Ienne; Nikil D. Dutt
Automatic generation of Instruction Set Extensions (ISEs), to be executed on a custom processing unit or a coprocessor is an important step towards processor customization. A typical goal of a manual designer is to combine a large number of atomic instructions into an ISE satisfying microarchitectural constraints. However, memory operations pose a challenge for previous ISE approaches by limiting the size of the resulting instruction. In this paper, we introduce memory elements into custom units which result in ISEs closer to those sought after by the designers. We consider two kinds of memory elements for mapping to the specialized hardware: small hardware tables and architecturally-visible state registers. We devised a genetic algorithm to specifically exploit opportunities of introducing memory elements during ISE generation. Finally, we demonstrate the effectiveness of our approach by a detailed study of the variation in performance, area and energy in the presence of the generated ISEs, on a number of MediaBench, EEMBC and cryptographic applications. With the introduction of memory, the average speedup varied from 2.7X to 5X depending on the architectural configuration with a nominal area overhead. Moreover, we obtained an average energy reduction of 26% with respect to a 32-KB cache.
international conference on computer aided design | 2004
Ajay K. Verma; Paolo Ienne
The increasing importance of datapath circuits in complex systems-on-chip calls for special arithmetic optimisations. The goal is to achieve automatically the handcrafted results which escape classic logic optimisations. Some work has been done in the recent years to infer the use of the carry-save representation in the synthesis of arithmetic circuits. Yet, many cases of practical interest cannot be handled due to the scattering of logic operations among the arithmetic ones - especially in arithmetic computations which are originally described at the bit level in high-level languages such as C. We therefore introduce an algorithm to restructure dataflow graphs so that they can be synthesized in high-quality arithmetic circuits, close, to those that an expert designer would conceive. On typical embedded software benchmarks which could be advantageously implemented with hardware accelerators, our technique always reduces tangibly the critical path by up to 46% and generally achieves the quality of manual implementations. In many cases, our algorithm also manages to reduce the cell area by up to 10-20%.