Paolo Faraboschi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paolo Faraboschi is active.

Explore More

Publication

Featured researches published by Paolo Faraboschi.

international symposium on computer architecture | 2000

Lx: a technology platform for customizable VLIW embedded processing

Paolo Faraboschi; Geoffrey Brown; Joseph A. Fisher; Giuseppe Desoli; Fred Homewood

Lx is a scalable and customizable VLIW processor technology platform designed by Hewlett-Packard and STMicroelectronics that allows variations in instruction issue width, the number and capabilities of structures and the processor instruction set. For Lx we developed the architecture and software from the beginning to support both scalability (variable numbers of identical processing resources) and customizability (special purpose resources). In this paper we consider the following issues. When is customization or scaling beneficial? How can one determine the right degree of customization or scaling for a particular application domain? What architectural compromises were made in the Lx project to contain the complexity inherent in a customizable and scalable processor family? The experiments described in the paper show that specialization for an application domain is effective, yielding large gains in price/performance ratio. We also show how scaling machine resources scales performance, although not uniformly across all applications. Finally we show that customization on an application-by-application basis is today still very dangerous and much remains to be done for it to become a viable solution.

international symposium on microarchitecture | 1996

Custom-fit processors: letting applications define architectures

Joseph A. Fisher; Paolo Faraboschi; Giuseppe Desoli

In this paper we report on a system which automatically designs realistic VLIW architectures highly optimized for one given application (the input for this system), while running all other code correctly. The system uses a product-quality compiler that generates very aggressive VLIW code. We retarget the compiler until we have found a VLIW architecture idealized for the application on the basis of performance, a cost function and a hardware budget. We show that we can automatically select architectures that achieve large speedups on color and image processing codes. Specialization is shown to be very valuable: The differences between architectural choices, even among reasonable-seeming architectures having similar costs, can be very great, often a factor of 5 (and sometimes much more). We show also that specialization is also very dangerous. A reasonable choice of architecture to fit one algorithm can be a very poor choice for another even in the same domain. There is sometimes an architecture, near in cost and performance to the best, that does much better on a second algorithm.

international symposium on microarchitecture | 2002

DELI: a new run-time control point

Giuseppe Desoli; Nikolay Mateev; Evelyn Duesterwald; Paolo Faraboschi; Joseph A. Fisher

The Dynamic Execution Layer Interface (DELI) offers the following unique capability: it provides fine-grain control over the execution of programs, by allowing its clients to observe and optionally manipulate every single instruction - at run time - just before it runs. DELI accomplishes this by opening up art interface to the layer between the execution of software and hardware. To avoid the slowdown, DELI caches a private copy of the executed code and always runs out of its own private cache. In addition to giving powerful control to clients, DELI opens up caching and linking to ordinary emulators and just-in-time compilers, which their get the reuse benefits of the same mechanism. For example, emulators themselves call also use other clients, to mix emulation with already existing services, native code, and other emulators. This paper describes the basic aspects of DELI, including the underlying caching and linking mechanism, the Hardware Abstraction Mechanism (HAM), the Binary-Level Translation (BLT) infrastructure, and the Application Programming Interface (API) exposed to the clients. We also cover some of the services that clients could offer through the DELI, such as ISA emulation, software patching, and sandboxing. Finally, we consider a case study of emulation in detail: the emulation of a PocketPC system on the Lx/ST210 embedded VLIW processor. In this case, DELI enables us to achieve near-native performance, and to mix-and-match native and emulated code.

ACM Sigarch Computer Architecture News | 2009

How to simulate 1000 cores

Matteo Monchiero; Jung Ho Ahn; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

IEEE Signal Processing Magazine | 1998

The latest word in digital and media processing

Paolo Faraboschi; Giuseppc Desoli; Joseph A. Fisher

The article discusses the technology behind the resurgence of DSP oriented microprocessors and the techniques that allow one to use them well. After an overview of the VLIW architecture, it discusses three main areas: VLIW architectural feature relevant to DSP applications; their associated complier techniques; and coding techniques that allow the application programmer, while still coding in a high-level language, to best exploit the architecture.

Microprocessors and Microsystems | 2014

TERAFLUX: Harnessing dataflow in next generation teradevices

Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero

The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.

digital systems design | 2013

The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices

Marco Solinas; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Sylvain Girbal; Daniel Goodman; Behran Khan; Souad Koliai; Feng Li; Mikel Luján; Laurent Morin; Avi Mendelson; Nacho Navarro; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Mateo Valero; Sebastian Weis; Ian Watson; Stéphane Zuckermann; Roberto Giorgi

Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.

international symposium on microarchitecture | 2011

System-level integrated server architectures for scale-out datacenters

Sheng Li; Kevin T. Lim; Paolo Faraboschi; Jichuan Chang; Parthasarathy Ranganathan; Norman P. Jouppi

A System-on-Chip (SoC) integrates multiple discrete components into a single chip, for example by placing CPU cores, network interfaces and I/O controllers on the same die. While SoCs have dominated high-end embedded products for over a decade, system-level integration is a relatively new trend in servers, and is driven by the opportunity to lower cost (by reducing the number of discrete parts) and power (by reducing the pin crossings from the cores to the I/O). Today, the mounting cost pressures in scale-out datacenters demand technologies that can decrease the Total Cost of Ownership (TCO). At the same time, the diminshing return of dedicating the increasing number of available transistors to more cores and caches is creating a stronger case for SoC-based servers.

high performance distributed computing | 2012

Exploring the performance and mapping of HPC applications to platforms in the cloud

Abhishek Gupta; Laxmikant V. Kalé; Dejan S. Milojicic; Paolo Faraboschi; Richard Kaufmann; Verdi March; Filippo Gioachin; Chun Hui Suen; Bu-Sung Lee

This paper presents a scheme to optimize the mapping of HPC applications to a set of hybrid dedicated and cloud resources. First, we characterize application performance on dedicated clusters and cloud to obtain application signatures. Then, we propose an algorithm to match these signatures to resources such that performance is maximized and cost is minimized. Finally, we show simulation results revealing that in a concrete scenario our proposed scheme reduces the cost by 60% at only 10-15% performance penalty vs. a non optimized configuration. We also find that the execution overhead in cloud can be minimized to a negligible level using thin hypervisors or OS-level containers.

ieee international conference on cloud computing technology and science | 2013

The Who, What, Why, and How of High Performance Computing in the Cloud

Abhishek Gupta; Laxmikant V. Kalé; Filippo Gioachin; Verdi March; Chun Hui Suen; Bu-Sung Lee; Paolo Faraboschi; Richard Kaufmann; Dejan S. Milojicic

Cloud computing is emerging as an alternative to supercomputers for some of the high-performance computing (HPC) applications that do not require a fully dedicated machine. With cloud as an additional deployment option, HPC users are faced with the challenges of dealing with highly heterogeneous resources, where the variability spans across a wide range of processor configurations, interconnections, virtualization environments, and pricing rates and models. In this paper, we take a holistic viewpoint to answer the question - why and who should choose cloud for HPC, for what applications, and how should cloud be used for HPC? To this end, we perform a comprehensive performance evaluation and analysis of a set of benchmarks and complex HPC applications on a range of platforms, varying from supercomputers to clouds. Further, we demonstrate HPC performance improvements in cloud using alternative lightweight virtualization mechanisms - thin VMs and OS-level containers, and hyper visor- and application-level CPU affinity. Next, we analyze the economic aspects and business models for HPC in clouds. We believe that is an important area that has not been sufficiently addressed by past research. Overall results indicate that current public clouds are cost-effective only at small scale for the chosen HPC applications, when considered in isolation, but can complement supercomputers using business models such as cloud burst and application-aware mapping.

Explore More