Uzi Shvadron | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Uzi Shvadron is active.

Explore More

Publication

Featured researches published by Uzi Shvadron.

signal processing systems | 2000

Trends in compilable DSP architecture

John Glossner; Jaime H. Moreno; Mayan Moudgill; Jeff H. Derby; Erdem Hokenek; David Meltzer; Uzi Shvadron; Malcolm Scott Ware

We review the evolution of DSP architectures and compiler technology, and describe how compiler techniques are being used to optimize emerging DSP architectures. Such new architectures are characterized by the exploitation of data and instruction level parallelism while being an amenable target for a compiler, thereby reducing or eliminating the need to rely on assembly language programming and/or architecture-specific compiler intrinsics to achieve highly efficient code. We also summarize our research results on an ultra low power compilable DSP architecture.

Ibm Journal of Research and Development | 2003

An innovative low-power high-performance programmable signal processor for digital communications

Jaime H. Moreno; Victor Zyuban; Uzi Shvadron; Fredy D. Neeser; Jeff H. Derby; Malcolm Scott Ware; Krishnan K. Kailas; Ayal Zaks; Amir Geva; Shay Ben-David; Sameh W. Asaad; Thomas W. Fox; Daniel Littrell; Marina Biberstein; Dorit Naishlos; Hillery C. Hunter

We describe an innovative, low-power, high-performance, programmable signal processor (DSP) for digital communications. The architecture of this processor is characterized by its explicit design for low-power implementations, its innovative ability to jointly exploit instruction-level parallelism and data-level parallelism to achieve high performance, its suitability as a target for an optimizing high-level language compiler, and its explicit replacement of hardware resources by compile-time practices. We describe the methodology used in the development of the processor, highlighting the techniques deployed to enable application/architecture/compiler/implementation co-development, and the optimization approach and metric used for power-performance evaluation and tradeoff analysis. We summarize the salient features of the architecture, provide a brief description of the hardware organization, and discuss the compiler techniques used to exercise these features. We also summarize the simulation environment and associated software development tools. Coding examples from two representative kernels in the digital communications domain are also provided. The resulting methodology, architecture, and compiler represent an advance of the state of the art in the area of low-power, domain-specific microprocessors.

international symposium on microarchitecture | 2012

Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator

Jan van Lunteren; Christoph Hagleitner; Timothy Heil; Giora Biran; Uzi Shvadron; Kubilay Atasu

A growing number of applications rely on fast pattern matching to scan data in real-time for security and analytics purposes. The RegX accelerator in the IBM Power Edge of Network (PowerEN) processor supports these applications using a combination of fast programmable state machines and simple processing units to scan data streams against thousands of regular-expression patterns at state-of-the-art Ethernet link speeds. RegX employs a special rule cache and includes several new micro-architectural features that enable various instruction dispatch and execution options for the processing units. The architecture applies RISC philosophy to special-purpose computing: hardware provides fast, simple primitives, typically performed in a single cycle, which are exploited by an intelligent compiler and system software for high performance. This approach provides the flexibility required to achieve good performance across a wide range of workloads. As implemented in the PowerEN processor, the accelerator achieves a theoretical peak scan rate of 73.6 Gbit/s, and a measured scan rate of about 15 to 40 Gbit/s for typical intrusion detection workloads.

International Journal of Parallel Programming | 2011

ACOTES project: Advanced compiler technologies for embedded streaming

Eduard Ayguadé; Cédric Bastoul; Paul M. Carpenter; Zbigniew Chamski; Albert Cohen; Marco Cornero; Philippe Dumont; Marc Duranton; Mohammed Fellahi; Roger Ferrer; Razya Ladelsky; Menno Lindwer; Xavier Martorell; Cupertino Miranda; Dorit Nuzman; Andrea Ornstein; Antoniu Pop; Sebastian Pop; Louis-Noël Pouchet; Alex Ramirez; David Ródenas; Erven Rohou; Ira Rosen; Uzi Shvadron; Konrad Trifunovic; Ayal Zaks

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

international symposium on performance analysis of systems and software | 2008

Trace-based Performance Analysis on Cell BE

Marina Biberstein; Uzi Shvadron; Javier Turek; Bilha Mendelson; Moon S. Chang

The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.

multimedia signal processing | 1997

Asynchronous rate conversion

Yoav Medan; Uzi Shvadron

A new approach for sampling rate conversion of digitally sampled signals is presented. The approach enables conversion from any given source sampling rate to any desired target sampling rate even if the source and destination clocks are not synchronous. The importance of algorithm is in a communication system where the sampling is done on one system and the playback is done on another, in such cases a drift between the clock occurs causing synchronization problems. The algorithm utilizes a very simple approach and indeed consumes less than one Mips. The algorithm was implemented on the IBM Signal Processor (Mwave-MDSP2780) and is used to convert a low sampling rate (8 or 9.6KHz) to the high CD rate (44.1Khz) and vice-versa.

Simulation | 2012

Towards flexible exascale stream processing system simulation

Alfred Park; Cheng-Hong Li; Ravi Nair; Nobuyuki Ohba; Uzi Shvadron; Ayal Zaks; Eugen Schenfeld

Stream processing is an important emerging computational model for performing complex operations on and across multi-source, high-volume, unpredictable dataflows. We present Flow, a platform for parallel and distributed stream processing system simulation that provides a flexible modeling environment for analyzing stream processing applications. The Flow stream processing system simulator is a high-performance, scalable simulator that automatically parallelizes chunks of the model space and incurs near-zero synchronization overhead for acyclic stream application graphs. We show promising parallel and distributed event rates exceeding 149 million events per second on a cluster with 512 processor cores.

workshop on parallel and distributed simulation | 2010

Flow: A Stream Processing System Simulator

Alfred Park; Cheng-Hong Li; Ravi Nair; Nobuyuki Ohba; Uzi Shvadron; Ayal Zaks; Eugen Schenfeld

Stream processing is an important emerging computational model for performing complex operations on and across multi-source, high volume, unpredictable dataflows. We present Flow, a platform for parallel and distributed stream processing system simulation that provides a flexible modeling environment for analyzing stream processing applications. The Flow stream processing system simulator is a high performance, scalable simulator that automatically parallelizes chunks of the model space and incurs near zero synchronization overhead for stream application graphs that exhibit feed-forward behavior. We show promising multi-threaded and multi-process event rates exceeding 80 million events per second on a cluster with 256 processor cores.

Ibm Journal of Research and Development | 2009

Cell broadband engine processor performance optimization: tracing tools implementation and use

Marina Biberstein; Shiri Dori-Hacohen; Yuval Harel; Andre Heilper; Bilha Mendelson; Uzi Shvadron; Eran Treister; Javier Turek; Moon S. Chang

Optimizing performance on multicore processors is a daunting task M. S. Chang because of the increased importance of such factors as thread communication, memory contention, and memory access latency. This paper presents two tools that programmers and performance analysts can use to understand application performance on the Cell Broadband Engine® (Cell/B.E.) processor: the Performance Debugging Tool (PDT) and the Trace Analyzer (TA). PDT traces user-space events, augmenting them with scheduling data from the operating system; those traces are then read, analyzed, and presented visually by the TA. This paper describes the implementation issues arising from the fact that a common lowoverhead clock shared by all cores, essential for analysis and visualization, is not available on the Cell/B.E. processor. The TA employs an offline analysis to align the collected events to a common time based only on thread-local timestamps, event order, and context switch information. We also discuss the overhead of tracing and its impact on execution and performance analysis. We illustrate the use of the PDT and TA by analyzing several significant Cell/B.E. processor workloads, including native code and higher-level abstractions offered by the Data Communication and Synchronization services. We show how trace analysis can help identify performance issues in these workloads and how it can be used by programmers to spot performance antipatterns (common programming practices leading to suboptimal performance).

Archive | 1992

System for facilitating continuous, real-time, unidirectional, and asynchronous intertask and end-device communication in a multimedia data processing system using open architecture data communication modules

Gary G. Allran; Donald Edward Carmon; Fetchi Chen; Jose A. Eduartez; Charles R. Knox; William W. Lawton; Llewellyn Bradley Marshall; Nathan A. Mitchell; Malcolm Scott Ware; Raymond W. Weeks; Yoav Medan; Uzi Shvadron

Explore More