Hasitha Muthumala Waidyasooriya

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hasitha Muthumala Waidyasooriya is active.

Explore More

Publication

Featured researches published by Hasitha Muthumala Waidyasooriya.

IEEE Transactions on Parallel and Distributed Systems | 2016

Hardware-Acceleration of Short-Read Alignment Based on the Burrows-Wheeler Transform

Hasitha Muthumala Waidyasooriya; Masanori Hariyama

The alignment of millions of short DNA fragments to a large genome is a very important aspect of the modern computational biology. However, software-based DNA sequence alignment takes many hours to complete. This paper proposes an FPGA-based hardware accelerator to reduce the alignment time. We apply a data encoding scheme that reduces the data size by 96 percent, and propose a pipelined hardware decoder to decode the data. We also design customized data paths to efficiently use the limited bandwidth of the DDR3 memories. The proposed accelerator can align a few hundred million short DNA fragments in an hour by using 80 processing elements in parallel. The proposed accelerator has the same mapping quality compared to the software-based methods.

international symposium on circuits and systems | 2012

FPGA implementation of heterogeneous multicore platform with SIMD/MIMD custom accelerators

Hasitha Muthumala Waidyasooriya; Yasuhiro Takei; Masanori Hariyama; Michitaka Kameyama

Heterogeneous multi-core architecture with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to applications, it is important to explore suitable architectures for different applications. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Our platform allows to select the most suitable accelerator according to the requirements of an application. Moreover, we optimize the number of ALUs, memory and interconnection network of the selected accelerators to increase the performance and to reduce the power consumption. Experimental results with simple media processing applications times power-efficient compared to the GPU.

IEEE Transactions on Circuits and Systems for Video Technology | 2011

Memory Allocation Exploiting Temporal Locality for Reducing Data-Transfer Bottlenecks in Heterogeneous Multicore Processors

Hasitha Muthumala Waidyasooriya; Yosuke Ohbayashi; Masanori Hariyama; Michitaka Kameyama

High performance and low-power very large-scale integrations are required to implement complex media processing applications on mobile devices. Heterogeneous multicore processors are a promising way to achieve this objective. They contain multiple accelerator cores and CPU cores to increase the processing speed. Since media processing applications access a huge amount of data, fast address generation is very important. To increase the address generation speed, accelerator cores contain address generation units (AGUs). To reduce the power consumption, the AGUs have limited hardware resources such as adders and counters. Therefore, the AGUs generate simple addressing patterns where the address increases linearly in each clock cycle. Media processing applications frequently encounter addressing patterns where the same data are accessed in different time slots. To implement such addressing patterns, the same data have to be allocated into multiple memory addresses in such a way that those addresses can be generated by the AGUs. Allocation of the same data in multiple addresses is called the “data-duplication.” The data-duplication increases the data-transfer time and also the total processing time significantly. To remove such data-transfer bottlenecks, this paper proposes a memory allocation method that exploits the temporal and spatial locality of the memory access in media processing applications. We evaluate the proposed method using media processing applications to validate its effectiveness. According to the results, the proposed method reduces the total processing time by 14% to more than 85% compared to previous works.

international conference of the ieee engineering in medicine and biology society | 2013

Implementation of a custom hardware-accelerator for short-read mapping using Burrows-Wheeler alignment

Hasitha Muthumala Waidyasooriya; Masanori Hariyama; Michitaka Kameyama

The mapping of millions of short DNA fragments to a large genome is a great challenge in modern computational biology. Usually, it takes many hours or days to map a large genome using software. However, the recent progress of programmable hardware such as field programmable gate arrays (FPGAs) provides a cost effective solution to this challenge. FPGAs contain millions of programmable logic gates to design massively parallel accelerators. This paper proposes a hardware architecture to accelerate the short-read mapping using Burrows-Wheeler alignment. The speed-up of the proposed architecture is estimated to be at least 10 times compared to its equivalent software application.

IEICE Transactions on Electronics | 2008

Multi-Context FPGA Using Fine-Grained Interconnection Blocks and Its CAD Environment

Hasitha Muthumala Waidyasooriya; Weisheng Chong; Masanori Hariyama; Michitaka Kameyama

SUMMARY Dynamically-programmable gate arrays (DPGAs) pro-mise lower-cost implementations than conventional ﬁeld-programmablegate arrays (FPGAs) since they eﬃciently reuse limited hardware resourcesin time. One of the typical DPGA architectures is a multi-context FPGA(MC-FPGA) that requires multiple memory bits per conﬁguration bit to re-alize fast context switching. However, this additional memory bits causesigniﬁcant overhead in area and power consumption. This paper presentsnovel architecture of a switch element to overcome the required capacityof conﬁguration memory. Our main idea is to exploit redundancy betweendiﬀerent contexts by using a ﬁne-grained switch element. The proposedMC-FPGA is designed in a 0.18 µm CMOS technology. Its maximumclock frequency and the context switching frequency are measured to be310MHz and 272MHz, respectively. Moreover, novel CAD process thatexploits the redundancy in conﬁguration data, is proposed to support theMC-FPGA architecture. key words: dynamically-programmable gate array, multi-context FPGA,conﬁguration data redundancy

annual acis international conference on computer and information science | 2016

FPGA-based deep-pipelined architecture for FDTD acceleration using OpenCL

Hasitha Muthumala Waidyasooriya; Masanori Hariyama

Acceleration of the FDTD (finite-difference time-domain) computation is very important for the electromagnetic simulations. Conventional FDTD acceleration methods using multicore CPUs and CPUs have the common problem of memory-bandwidth limitation due to a large amount of parallel data access. Although FPGAs have the potential to solve this problem, very long design, testing and debugging time is required to implement an architecture successfully. To solve this problem, we propose an FPGA architecture designed using C-like programming language called OpenCL (open computing language). Therefore, the design time is very small and extensive knowledge about hardware-design is not required. We implemented the proposed architecture on an FPGA and achieved over 114 GFLOPS of processing power. We also achieved more than 13 times and 4 times speed-up compared to CPU and GPU implementations respectively.

software engineering artificial intelligence networking and parallel distributed computing | 2017

Automatic Optimization of OpenCL-Based Stencil Codes for FPGAs

Tsukasa Endo; Hasitha Muthumala Waidyasooriya; Masanori Hariyama

Recently, C-based OpenCL design environment is proposed to design FPGA (field programmable gate array) accelerators. Although many c-programs can be executed on FPGAs, the best c-code for a CPU may not be the most appropriate one for an FPGA. Users must have some knowledge about computer architecture in order to write a good OpenCL code. In addition, OpenCL-based design process requires several hours of compilation time, because re-writing and compiling many different OpenCL codes may require a very large design time. To solve this problem, we propose an automatic optimization method. We accurately predict the kernel performance using the log files generated at the initial stage of the compilation. Then we find the optimized FPGA architecture by searching all possible design parameters. We implement the proposed method to find the optimized architecture for stencil computation. According to the results, the design time has been reduced to 6–11% of the conventional approach.

annual acis international conference on computer and information science | 2016

Architecture of an FPGA accelerator for molecular dynamics simulation using OpenCL

Hasitha Muthumala Waidyasooriya; Masanori Hariyama; Kota Kasahara

Molecular dynamics (MD) simulations are very important to study physical properties of the atoms and molecules. However, a huge amount of processing time is required to simulate a few nano-seconds of an actual experiment. Although the hardware acceleration using FPGAs provides promising results, huge design time and hardware design skills are required to implement an accelerator successfully. In this paper, we propose an FPGA accelerator designed using C-based OpenCL. We achieved over 4.6 times of speed-up compared to CPU-based processing, by using only 36% of the Stratix V FPGA resources. Maximum of 18.4 times speed-up is possible by using 80% of the FPGA resources.

asia pacific conference on circuits and systems | 2014

Efficient data transfer scheme using word-pair-encoding-based compression for large-scale text-data processing

Hasitha Muthumala Waidyasooriya; Daisuke Ono; Masanori Hariyama; Michitaka Kameyama

Large-scale data processing is very common in many fields such as data-mining, genome mapping, etc. To accelerate such processing, Graphic Accelerator Units (GPU) and FPGAs (Feild-Programmable Gate-Array) are used. However, the large data transfer time between the accelerator and the host computer is a huge performance bottleneck. In this paper, we use a word-pair-encoding method to compress the data down to 25% of its original size. The encoded data can be decoded from any position without decoding the whole data file. For some algorithms, the encoded data can be processed without decoding. Using Burrows-Wheeler algorithm based text search, we show that the data amount and transfer time can be reduced by over 70%.

Archive | 2018

Exploiting the Memory Hierarchy

Hasitha Muthumala Waidyasooriya; Masanori Hariyama; Kunio Uchiyama

When designing an FPGA accelerator, efficient memory access is extremely important to obtain high performance. The OpenCL for FPGA uses a hierarchical memory structure that contains memories with different capacities, bandwidths and latencies. Therefore, using those different memories efficiently is very challenging. This chapter explains the problems of memory access and their solutions.

Explore More