Roger Moussalli
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Roger Moussalli.
ieee international conference on high performance computing data and analytics | 2014
Andrew S. Cassidy; Rodrigo Alvarez-Icaza; Filipp Akopyan; Jun Sawada; John V. Arthur; Paul A. Merolla; Pallab Datta; Marc Gonzalez Tallada; Brian Taba; Alexander Andreopoulos; Arnon Amir; Steven K. Esser; Jeff Kusnitz; Rathinakumar Appuswamy; Chuck Haymes; Bernard Brezzo; Roger Moussalli; Ralph Bellofatto; Christian W. Baks; Michael Mastro; Kai Schleupen; Charles Edwin Cox; Ken Inoue; Steven Edward Millman; Nabil Imam; Emmett McQuinn; Yutaka Nakamura; Ivan Vo; Chen Guok; Don Nguyen
Drawing on neuroscience, we have developed a parallel, event-driven kernel for neurosynaptic computation, that is efficient with respect to computation, memory, and communication. Building on the previously demonstrated highly optimized software expression of the kernel, here, we demonstrate True North, a co-designed silicon expression of the kernel. True North achieves five orders of magnitude reduction in energy to-solution and two orders of magnitude speedup in time-to solution, when running computer vision applications and complex recurrent neural network simulations. Breaking path with the von Neumann architecture, True North is a 4,096 core, 1 million neuron, and 256 million synapse brain-inspired neurosynaptic processor, that consumes 65mW of power running at real-time and delivers performance of 46 Giga-Synaptic OPS/Watt. We demonstrate seamless tiling of True North chips into arrays, forming a foundation for cortex-like scalability. True Norths unprecedented time-to-solution, energy-to-solution, size, scalability, and performance combined with the underlying flexibility of the kernel enable a broad range of cognitive applications.
high performance embedded architectures and compilers | 2010
Roger Moussalli; Mariam Salloum; Walid A. Najjar; Vassilis J. Tsotras
Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-enabled systems. Here, users pose complex queries (expressed in XPath) on the structure and content of the streaming documents. The parts of the documents that match the user queries are then returned to the users. This paper proposes a novel hardware architecture that would exploit the parallelism found in XPath filtering systems. Using an incoming XML stream, parsing and matching with thousands of user profiles are performed simultaneously on a single FPGA, thus yielding up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. By converting XPath expressions into custom stacks, our architecture is the first providing full support for all structural XPath constructs, including parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion.
international conference on data engineering | 2011
Roger Moussalli; Mariam Salloum; Walid A. Najjar; Vassilis J. Tsotras
In recent years, XML-based Publish-Subscribe Systems have become popular due to the increased demand of timely event-notification. Users (or subscribers) pose complex profiles on the structure and content of the published messages. If a profile matches the message, the message is forwarded to the interested subscriber. As the amount of published content continues to grow, current software-based systems will not scale. We thus propose a novel architecture to exploit parallelism of twig matching on FPGAs. This approach yields up to three orders of magnitude higher throughput when compared to conventional approaches bound by the sequential aspect of software computing. This paper, presents a novel method for performing unordered holistic twig matching on FPGAs without any false positives, and whose throughput is independent of the complexity of the user queries or the characteristics of the input XML stream. Furthermore, we present experimental comparison of different granularities of twig matching, namely path-based (root-to-leaf) and pair-based (parent-child or ancestor-descendant).We provide comprehensive experiments that compare the throughput, area utilization and the accuracy of matching (percent of false positives) of our holistic, path-based and pair-based FPGA approaches.
Geoinformatica | 2015
Roger Moussalli; Ildar Absalyamov; Marcos R. Vieira; Walid A. Najjar; Vassilis J. Tsotras
The wide and increasing availability of collected data in the form of trajectories has led to research advances in behavioral aspects of the monitored subjects (e.g., wild animals, people, and vehicles). Using trajectory data harvested by devices, such as GPS, RFID and mobile devices, complex pattern queries can be posed to select trajectories based on specific events of interest. In this paper, we present a study on FPGA- and GPU-based architectures processing complex patterns on streams of spatio-temporal data. Complex patterns are described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain. More importantly, variables can be used to substantially enhance the flexibility and expressive power of pattern queries. Here we explore the challenges in handling several constructs of the assumed pattern query language, with a study on the trade-offs between expressiveness, scalability and matching accuracy. We show an extensive performance evaluation where FPGA and GPU setups outperform the current state-of-the-art (single-threaded) CPU-based approaches, by over three orders of magnitude for FPGAs (for expressive queries) and up to two orders of magnitude for certain datasets on GPUs (and in some cases slowdown). Unlike software-based approaches, the performance of the proposed FPGA and GPU solutions is only minimally affected by the increased pattern complexity.
field-programmable custom computing machines | 2013
Roger Moussalli; Walid A. Najjar; Xi Luo; Amna Khan
Integer compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-wise lossless technique applied to the compression of images, audio files and lists of inverted indices. However, since GR is a serial algorithm, decompression is regarded as a very slow process; to the best of our knowledge, all existing software and hardware native (non-modified) GR decoding engines operate bit-serially on the encoded stream. In this paper, we present (1) the first no-stall hardware architecture, capable of decompressing streams of integers compressed using the GR method, at a rate of several bytes (multiple integers) per hardware cycle; (2) a novel GR decoder based on the latter architecture is further detailed, operating at a peak rate of one integer per cycle. A thorough design space exploration study on the resulting resource utilization and throughput of the aforementioned approaches is presented. Furthermore, a performance study is provided, comparing software approaches to implementations of the novel hardware decoders. While occupying 10% of a Xilinx V6LX240T FPGA, the no-stall architecture core achieves a sustained throughput of over 7 Gbps.
symposium on large spatial databases | 2013
Roger Moussalli; Marcos R. Vieira; Walid A. Najjar; Vassilis J. Tsotras
The wide and increasing availability of collected data in the form of trajectory has lead to research advances in behavioral aspects of the monitored subjects (e.g., wild animals, people, vehicles). Using trajectory data harvested by devices, such as GPS, RFID and mobile devices, complex pattern queries can be posed to select trajectories based on specific events of interest. In this paper, we present a study on FPGA-based architectures processing complex patterns on streams of spatio-temporal data. Complex patterns are described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain. More importantly, variables can be used to substantially enhance the flexibility and expressive power of pattern queries. Here we explore the challenges in handling several constructs of the assumed pattern query language, with a study on the trade-offs between expressiveness, scalability and matching accuracy. We show an extensive performance evaluation where FPGA setups outperform the current state-of-the-art CPU-based approaches by over three orders of magnitude. Unlike software-based approaches, the performance of the proposed FPGA solution is only minimally affected by the increased pattern complexity.
field-programmable custom computing machines | 2015
Roger Moussalli; Mudhakar Srivatsa; Sameh W. Asaad
Insights extracted from spatial queries in geodatabase systems introduce significant opportunities for business intelligence. However, geodatabases are unable to keep up with the required performance due to the massive (and sky-rocketing) amounts of data generated from embedded location-enabled devices. In this paper, we focus on geographic information systems that make use of geohash, specifically, we tackle the kernel of converting geohash codes to and from longitude/latitude pairs. We present the first hardware implementation of a geohash conversion engine operating at wire speed. The presented geohash converter is further enhanced with runtime flexibility with respect to characteristics of the data it can process, furthermore, the architecture allows the user to compromise on performance when limited by hardware resources (design time flexibility). Experimental results of the geohash conversion engine on a Xilinx XC7K325T FPGA show >13X (end-to-end) speedup compared to optimized industry-grade software running on 16 CPU hardware threads.
ACM Transactions in Embedded Computing Systems | 2014
Roger Moussalli; Mariam Salloum; Robert J. Halstead; Walid A. Najjar; Vassilis J. Tsotras
Publish-subscribe systems present the state of the art in information dissemination to multiple users. Such systems have evolved from simple topic-based to the current XML-based systems. XML-based pub-sub systems provide users with more flexibility by allowing the formulation of complex queries on the content as well as the structure of the streaming messages. Messages that match a given user query are forwarded to the user. This article examines how to exploit the parallelism found in XPath filtering. Using an incoming XML stream, parsing and matching thousands of user profiles are performed simultaneously by matching engines. We show the benefits and trade-offs of mapping the proposed filtering approach onto FPGAs, processing streams of XML at wire speed, and GPUs, providing the flexibility of software. This is in contrast to conventional approaches bound by the sequential aspect of software computing, associated with a large memory footprint. By converting XPath expressions into custom stacks, our solution is the first to provide support for complex XPath structural constructs, such as parent-child and ancestor descendant relations, whilst allowing wildcarding and recursion. The measured speedups resulting from the GPU and FPGA accelerations versus single-core CPUs are up to 6.6X and 2.5 orders of magnitude, respectively. The FPGA approaches are up to 31X faster than software running on 12 CPU cores.
asilomar conference on signals, systems and computers | 2010
Robert J. Halstead; Jason R. Villarreal; Roger Moussalli; Walid A. Najjar
While the computational power of Field Programmable Gate Arrays (FPGA) makes them attractive as code accelerators, the lack of high-level language programming tools is a major obstacle to their wider use. Graphics Processing Units (GPUs), on the other hand, have benefitted from advanced and widely used high-level programming tools. This paper evaluates the performance, throughput and energy of both FPGAs and GPUs on image processing codes using high-level language programming tools for both.
ieee global conference on signal and information processing | 2013
Roger Moussalli; Bharat Sukhwani; Sameh W. Asaad
The low latency and high throughput requirements of high-frequency trading has resulted in increasing adoption of dedicated hardware for processing financial feeds. Development of hardware platforms, however, is plagued with slow design/verification cycles compared to their software counterparts. In this work, we present FINPAGE, a FINancial PArser GEnerator, to automatically generate hardware structures for parsing financial feeds. Given a high-level feed format description, FINPAGE generates an area-efficient hardware parser capable of processing feeds at line rate.