Leonel Sousa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leonel Sousa is active.

Explore More

Publication

Featured researches published by Leonel Sousa.

IEEE Transactions on Parallel and Distributed Systems | 2005

Communication contention in task scheduling

Oliver Sinnen; Leonel Sousa

Task scheduling is an essential aspect of parallel programming. Most heuristics for this NP-hard problem are based on a simple system model that assumes fully connected processors and concurrent interprocessor communication. Hence, contention for communication resources is not considered in task scheduling, yet it has a strong influence on the execution time of a parallel program. This paper investigates the incorporation of contention awareness into task scheduling. A new system model for task scheduling is proposed, allowing us to capture both end-point and network contention. To achieve this, the communication network is reflected by a topology graph for the representation of arbitrary static and dynamic networks. The contention awareness is accomplished by scheduling the communications, represented by the edges in the task graph, onto the links of the topology graph. Edge scheduling is theoretically analyzed, including aspects like heterogeneity, routing, and causality. The proposed contention-aware scheduling preserves the theoretical basis of task scheduling. It is shown how classic list scheduling is easily extended to this more accurate system model. Experimental results show the significantly improved accuracy and efficiency of the produced schedules.

IEEE Transactions on Parallel and Distributed Systems | 2006

Toward a realistic task scheduling model

Oliver Sinnen; Leonel Sousa; Frode Eika Sandnes

Task scheduling is an important aspect of parallel programming. Most of the heuristics for this NP-hard problem are based on a very simple system model of the target parallel system. Experiments revealed the inappropriateness of this classic model to obtain accurate and efficient schedules for real-systems. In order to overcome this shortcoming, a new scheduling model was proposed that considers the contention for communication resources. Even though the accuracy and efficiency improved with the consideration of contention, the new contention model is still not good enough. The crucial aspect is the involvement of the processor in communication. This paper investigates the involvement of the processor in communication and its impact on task scheduling. A new system model is proposed based on the contention model that is aware of the processor involvement. The challenges for the scheduling techniques are analyzed and two scheduling algorithms are proposed. Experiments on real parallel systems show the significantly improved accuracy and efficiency of the new model and algorithms.

Biosensors and Bioelectronics | 2009

Femtomolar limit of detection with a magnetoresistive biochip.

V. C. Martins; F. A. Cardoso; J. Germano; S. Cardoso; Leonel Sousa; Moisés Piedade; Paulo P. Freitas; Luís P. Fonseca

In this paper the biological limit of detection of a spin-valve-based magnetoresistive biochip applied to the detection of 20 mer ssDNA hybridization events is presented. Two reactional variables and their impact on the biomolecular recognition efficiency are discussed. Both the influence of a 250 nm diameter magnetic particle attached to the target molecule during the hybridization event and the effect of a magnetic focusing system in the hybridization of pre-labeled target DNA (assisted hybridization) are addressed. The particles carrying the target molecules are attracted to the probe active sensor sites by applying a 40 mA DC current on U-shaped aluminium current lines. Experiments comparing pre-hybridization versus post-hybridization magnetic labeling and passive versus magnetically assisted hybridization were conducted. The efficiency of a passive hybridization is reduced by about 50% when constrained to the operational conditions (sample volume, reaction time, temperature and magnetic label) of an on-chip real-time hybridization assay. This reduction has shown to be constant and independent from the initial target concentration. Conversely, the presence of the magnetic label improved the limit of detection when a magnetically assisted hybridization was performed. The use of a labeled target focusing system has permitted a gain of three orders of magnitude (from 1 pM down to 1 fM) in the sensitivity of the device, as compared with passive, diffusion-controlled hybridization.

IEEE Transactions on Parallel and Distributed Systems | 2011

Massively LDPC Decoding on Multicore Architectures

Gabriel Falcao; Leonel Sousa; Vitor Silva

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed in this paper to perform LDPC decoding on multicore architectures. To evaluate the efficiency of the proposed parallel algorithms, LDPC decoders were developed on recent multicores, such as off-the-shelf general-purpose x86 processors, Graphics Processing Units (GPUs), and the CELL Broadband Engine (CELL/B.E.). Challenging restrictions, such as memory access conflicts, latency, coalescence, or unknown behavior of thread and block schedulers, were unraveled and worked out. Experimental results for different code lengths show throughputs in the order of 1 ~ 2 Mbps on the general-purpose multicores, and ranging from 40 Mbps on the GPU to nearly 70 Mbps on the CELL/B.E. The analysis of the obtained results allows to conclude that the CELL/B.E. performs better for short to medium length codes, while the GPU achieves superior throughputs with larger codes. They achieve throughputs that in some cases approach very well those obtained with VLSI decoders. From the analysis of the results, we can predict a throughput increase with the rise of the number of cores.

IEEE Transactions on Circuits and Systems | 2005

A universal architecture for designing efficient modulo 2/sup n/+1 multipliers

Leonel Sousa; Ricardo Chaves

This paper proposes a simple and universal architecture for designing efficient modified Booth multipliers modulo (2/sup n/+1). The proposed architecture is comprehensive, providing modulo (2/sup n/+1) multipliers with similar performance and cost both for the ordinary and for the diminished-1 number representations. The performance and the efficiency of the proposed multipliers are evaluated and compared with the earlier fastest modulo (2/sup n/+1) multipliers, based on a simple gate-count and gate-delay model and on experimental results obtained from CMOS implementations. These results show that the proposed approach leads on average to approximately 10% faster multipliers than the fastest known structures for the diminished-1 representation based on the modified Booth recoding. Moreover, they also show that the proposed architecture is the only one taking advantage of this recoding to obtain faster multipliers with a significant reduction in hardware. With the used figures of merit, the proposed diminished-1 multipliers are on average 10% and 25% more efficient than the known most efficient modulo (2/sup n/+1) multipliers for Booth recoded and nonrecoded multipliers, respectively.

cryptographic hardware and embedded systems | 2006

Improving SHA-2 hardware implementations

Ricardo Chaves; Georgi Kuzmanov; Leonel Sousa; Stamatis Vassiliadis

This paper proposes a set of new techniques to improve the implementation of the SHA-2 hashing algorithm. These techniques consist mostly in operation rescheduling and hardware reutilization, allowing a significant reduction of the critical path while the required area also decreases. Both SHA256 and SHA512 hash functions have been implemented and tested in the VIRTEX II Pro prototyping technology. Experimental results suggest improvements to related SHA256 art above 50% when compared with commercial cores and 100% to academia art, and above 70% for the SHA512 hash function. The resulting cores are capable of achieving the same throughput as the fastest unrolled architectures with 25% less area occupation than the smallest proposed architectures. The proposed cores achieve a throughput of 1.4 Gbit/s and 1.8 Gbit/s with a slice requirement of 755 and 1667 for SHA256 and SHA512 respectively, on a XC2VP30-7 FPGA.

digital systems design | 2003

RDSP: a RISC DSP based on residue number system

Ricardo Chaves; Leonel Sousa

This paper is focused on low power programmable fast digital signal processors (DSP) design based on a configurable 5-stage RISC core architecture and on residue number systems (RNS). Several innovative aspects are introduced at the control and datapath architecture levels, which support both the binary system and the RNS. A new moduli set {2/sup n/-1, 2/sup 2n/, 2/sup n/+1} is also proposed for balancing the processing time in the different RNS channels. Experimental results, obtained trough RDSP implementation on FPGA and ASIC, show that not only a significant reduction in circuit area and power consumption but also a speedup may be achieved with RNS when compared with a binary DSP.

parallel computing | 2004

List scheduling: extension for contention awareness and evaluation of node priorities for heterogeneous cluster architectures☆

Oliver Sinnen; Leonel Sousa

In the area of static scheduling, list scheduling is one of the most common heuristics for the temporal and spatial assignment of a directed acyclic graph (DAG) to a target system. As most scheduling heuristics, list scheduling assumes fully connected homogeneous processors and ignores contention on the communication links. This article extends the list scheduling heuristic for contention aware scheduling on heterogeneous arbitrary architectures. The extension is based on the idea of scheduling edges to links, likewise the scheduling of nodes to processors. Based on this extension, we compare eight priority schemes for the node order determination in the first phase of list scheduling. Random graphs are generated and scheduled with the different schemes to homogenous and heterogeneous parallel systems from the area of cluster computing. Apart from identifying the best priority scheme, the results give new insights in contention aware DAG scheduling. Moreover, we demonstrate the appropriateness of our extended list scheduling for homogeneous and heterogeneous cluster architectures. 2003 Elsevier B.V. All rights reserved.

international conference on supercomputing | 2009

How GPUs can outperform ASICs for fast LDPC decoding

Gabriel Falcao; Vitor Silva; Leonel Sousa

Due to huge computational requirements, powerful Low-Density Parity-Check (LDPC) error correcting codes, discovered in the early 1960s, have only recently been adopted by emerging communication standards. LDPC decoders are supported by VLSI technology, which delivers good parallel computational power with excellent throughputs, but at the expense of significant costs. In this work, we propose an alternative flexible LDPC decoder that exploits data-parallelism for simultaneous multicodeword decoding, supported by multithreading on CUDA-based graphics processing units (GPUs). The ratio of arithmetic operations per memory access is low for the efficient min-sum LDPC decoding algorithm proposed, which causes a bottleneck due to memory latency and data collisions. We propose runtime data realignment to allow coalesced parallel memory accesses to be performed by distinct threads inside the same warp. The memory access patterns of LDPC codes are random, which does not admit the simultaneous use of coalescence in both read and write operations of the decoding process. To overcome this problem we have developed a data mapping transformation which allows new addresses to be contiguously accessed for one of the mentioned memory access types. Our implementation shows throughputs above 100Mbps and BER curves that compare well with ASIC solutions.

international conference on parallel processing | 2009

Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function

Frederico Pratas; Pedro Trancoso; Alexandros Stamatakis; Leonel Sousa

We are currently faced with the situation where applications have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demanding Bioinformatics application - MrBayes - and its Phylogenetic Likelihood Functions (PLF) using different architectures. Our experiments compare side-by-side the scalability and performance achieved using general-purpose multi-core processors, the Cell/BE, and Graphics Processor Units (GPU). The results indicate that all processors scale well for larger computation and data sets. Also, GPU and Cell/BE processors achieve the best improvement for the parallel code section. Nevertheless, data transfers and the execution of the serial portion of the code are the reasons for their poor overall performance. The general-purpose multi-core processors prove to be simpler to program and provide the best balance between an efficient parallel and serial execution, resulting in the largest speedup.

Explore More