Arquimedes Canedo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arquimedes Canedo is active.

Explore More

Publication

Featured researches published by Arquimedes Canedo.

Journal of Parallel and Distributed Computing | 2008

The QC-2 parallel Queue processor architecture

Ben A. Abderazek; Arquimedes Canedo; Tsutomu Yoshinaga; Masahiro Sowa

Queue based instruction set architecture processor offers an attractive option in the design of embedded systems. In our previous work, we proposed a novel queue processor architecture as a starting point for hardware/software design space exploration for embedded applications. In this paper, we present a high performance 32-bit Synthesizable QueueCore (QC-2)-an improved and optimized version of the produced order parallel Queue processor (PQP), with single precision floating-point support. The QC-2 core also implements a novel technique used to extend immediate values and memory instruction offsets that were otherwise not representable because of bit-width constraints in the PQP processor. A prototype implementation is produced by synthesizing the high-level model for a target FPGA device. We present the architecture description and design results in a fair amount of details.

international conference on convergence information technology | 2007

Novel Addressing Method for Aggregate Types in Queue Processors

Teruhisa Yuki; Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Queue processors use a first-in first-out data structure to perform operations. Instructions implicitly reference their operands simplifying the design of the instruction set and the hardware complexity. Some access to memory require a computed address. A register-indirect addressing method introduces severe limitations in a queue processor by inserting false dependencies that limit the high parallelism capacity of such architectures. In this paper we propose a novel addressing method for queue processors that employ the queue for address calculation and memory access. We demonstrate that our new proposed method reduces the number of instructions by 6% and increases parallelism by 4% for a set of embedded applications.

Computer Languages, Systems & Structures | 2008

A new code generation algorithm for 2-offset producer order queue computation model

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Queue computing is an attractive alternative for the compulsive demand of high-performance architectures. Code generation for queue machines has some problems but the solutions have not been studied thoroughly. A new parallel queue computation model, 2-offset P-Code queue computation model, is presented together with a new code generation algorithm. The code generation algorithm takes leveled DAGs as input and produces 2-offset P-Code assembly. We also developed a queue compiler to evaluate the new algorithm and compiled a set of C language benchmark programs for the 2-offset P-Code. The queue compiler generates between 8.55% less instructions and 10.55% more instructions than an actual MIPS32 compiler for the compiled programs.

symposium on code generation and optimization | 2010

Automatic parallelization of simulink applications

Arquimedes Canedo; Takeo Yoshizawa; Hideaki Komatsu

The parallelization of Simulink applications is currently a responsibility of the system designer and the superscalar execution of the processors. State-of-the-art Simulink compilers excel at producing reliable and production-quality embedded code, but fail to exploit the natural concurrency available in the programs and to effectively use modern multi-core architectures. The reason may be that many Simulink applications are replete with loop-carried dependencies that inhibit most parallel computing techniques and compiler transformations. In this paper, we introduce the concept of strands that allow the data dependencies to be broken while preserving the original semantics of the Simulink program. Our fully automatic compiler transformations create a concurrent representation of the program, and thread-level parallelism for multi-core systems is planned and orchestrated. To improve single processor performance, we also exploit fine grain (equation-level) parallelism by level-order scheduling inside each thread. Our strand transformation has been implemented as an automatic transformation in a proprietary compiler and with a realistic aeronautic model executed in two processors leads to an up to 1.98 times speedup over uniprocessor execution, while the existing manual parallelization method achieves a 1.75 times speedup.

symposium on computer architecture and high performance computing | 2007

Queue Register File Optimization Algorithm for QueueCore Processor

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Different resources descriptions from different virtual organizations in a grid environment, exemplifies the challenge to match a specific resource, that could have similar characteristics, but with diverse descriptions. The use of a semantic matching method, based on ontology descriptions, is an alternative that can be considered by a software package to tackle this problem. However, recent researches indicate that fully automated systems are not able to recognize all possible relations between different ontologies. In other words, the human interaction is necessary after the recognition phase, when preliminary results are obtained from an ontology matching operation. This interaction is important in order to build a more logic knowledge to create efficient queries. In this article, we present a prototype tool which was designed and implementated to reduce issues related to match grid resource.The queue computation model offers an attractive alternative for high-performance embedded computing given its characteristics of short instructions and high instruction level parallelism. A queue-based processor uses a FIFO queue to read and write operands through hardware pointers located at the head and tail of the queue. Queue length is the number of elements stored between the head and the tail pointers during computations. We have found that 95% of the statements in integer applications require a queue length of less than 32 words. The remaining 5% requires larger queue length sizes up to 230 queue words. In this paper we propose a compiler technique to optimize the queue utilization for the hungry statements that require a large amount of queue. We show that for SPEC CINT95 benchmarks, our technique optimizes the queue length without decreasing parallelism. However, our optimization has a penalty of a slight increase in code size.

signal processing systems | 2010

Compiling for Reduced Bit-Width Queue Processors

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Embedded systems are characterized by the requirement of demanding small memory footprint code. A popular architectural modification to improve code density in RISC embedded processors is to use a reduced bit-width instruction set. This approach reduces the length of the instructions to improve code size. However, having less addressable registers by the reduced instructions, these architectures suffer a slight performance degradation as more reduced instructions are required to execute a given task. On the other hand, 0-operand computers such as stack and queue machines implicitly access their source and destination operands making instructions naturally short. Queue machines offer a highly parallel computation model, unlike the stack model. This paper proposes a novel alternative for reducing code size by using a queue-based reduced instruction set while retaining the high parallelism characteristics in programs. We introduce an efficient code generation algorithm to generate programs for our reduced instruction set. Our algorithm successfully constrains the code to the reduced instruction set with the addition of only 4% extra code, in average. We show that our proposed technique is able to generate about 16% more compact code than MIPS16, 26% over ARM/Thumb, and 50% over MIPS32 code. Furthermore, we show that our compiler is able to extract about the same parallelism than fully optimized RISC code.

high performance embedded architectures and compilers | 2009

Compiler Support for Code Size Reduction Using a Queue-Based Processor

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Queue computing delivers an attractive alternative for embedded systems. The main features of a queue-based processor are a dense instruction set, high-parallelism capabilities, and low hardware complexity. This paper presents the design of a code generation algorithm implemented in the queue compiler infrastructure to achieve high code density by using a queue-based instruction set processor. We present the efficiency of our code generation technique by comparing the code size and extracted parallelism for a set of embedded applications against a set of conventional embedded processors. The compiled code is, in average, 12.03% more compact than MIPS16 code, and 45.1% more compact than ARM/Thumb code. In addition, we show that the queue compiler, without optimizations, can deliver about 1.16 times more parallelism than fully optimized code for a register machine.

Microprocessors and Microsystems | 2009

Design and implementation of a queue compiler

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor.

international symposium on parallel architectures algorithms and networks | 2008

Quantitative Evaluation of Common Subexpression Elimination on Queue Machines

Arquimedes Canedo; Masahiro Sowa; Ben A. Abderazek

Queue computation model is a novel alternative for high performance architectures. Compiling for queue machines requires a different approach than compiling for traditional architectures. We have solved the problem of generating correct code with the queue compiler infrastructure. In this paper we introduce some problems encountered when optimizing code for queue machines. Common-subexpression elimination (CSE) is a widely used optimization to improve execution time. This paper makes a quantitative evaluation of how this optimization affects the characteristics of queue programs. We have found that in average, 28% of instructions are eliminated, and 15% of the critical path is reduced. We determine how enlarging the scope of compilation from expressions to basic blocks affects the distribution of offsetted instructions.

parallel and distributed computing: applications and technologies | 2007

New Code Generation Algorithm for QueueCore An Embedded Processor with High ILP

Arquimedes Canedo; Ben A. Abderazek; Masahiro Sowa

Modern architectures rely on exploiting parallelism found at the instruction level to achieve high performance. Aggressive ILP compilers expose high amounts of instruction level parallelism where, in some cases, the number of architected registers is not enough to hold the results of potential parallel instructions. This paper presents a new code generation scheme for the QueueCore, a 32-bit queue-based architecture capable of executing high amounts of ILP. QueueCores instructions implicitly read their operands and write results. Compiling for the QueueCore requires that all instructions have at most one explicit operand represented as an offset calculated at compile-time. Additionally, the instructions must be scheduled in level-order manner. The proposed algorithm successfully restricts all instructions to have at most one offset reference, it computes the offset values, and makes a level-order scheduling of the program. To evaluate the effectiveness of the new code generation scheme we developed a queue compiler and compiled a set of benchmark programs. Our results show that the code has more parallelism than optimized RISC code by factors ranging from 1.12 to 2.30. QueueCores instruction set allows us to generate code about 40%-18% denser than optimized RISC code.

Explore More