Saliya Ekanayake | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Saliya Ekanayake is active.

Explore More

Publication

Featured researches published by Saliya Ekanayake.

BMC Bioinformatics | 2010

Hybrid cloud and cluster computing paradigms for life science applications

Judy Qiu; Jaliya Ekanayake; Thilina Gunarathne; Jong Youl Choi; Seung-Hee Bae; Hui Li; Bingjing Zhang; Tak-Lon Wu; Yang Ruan; Saliya Ekanayake; Adam Hughes; Geoffrey C. Fox

BackgroundClouds and MapReduce have shown themselves to be a broadly useful approach to scientific computing especially for parallel data intensive applications. However they have limited applicability to some areas such as data mining because MapReduce has poor performance on problems with an iterative structure present in the linear algebra that underlies much data analysis. Such problems can be run efficiently on clusters using MPI leading to a hybrid cloud and cluster environment. This motivates the design and implementation of an open source Iterative MapReduce system Twister.ResultsComparisons of Amazon, Azure, and traditional Linux and Windows environments on common applications have shown encouraging performance and usability comparisons in several important non iterative cases. These are linked to MPI applications for final stages of the data analysis. Further we have released the open source Twister Iterative MapReduce and benchmarked it against basic MapReduce (Hadoop) and MPI in information retrieval and life sciences applications.ConclusionsThe hybrid cloud (MapReduce) and cluster (MPI) approach offers an attractive production environment while Twister promises a uniform programming environment for many Life Sciences applications.MethodsWe used commercial clouds Amazon and Azure and the NSF resource FutureGrid to perform detailed comparisons and evaluations of different approaches to data intensive computing. Several applications were developed in MPI, MapReduce and Twister in these different environments.

international parallel and distributed processing symposium | 2016

Towards High Performance Processing of Streaming Data in Large Data Centers

Supun Kamburugamuve; Saliya Ekanayake; Milinda Pathirage; Geoffrey C. Fox

Smart devices, mobile robots, ubiquitous sensors, and other connected devices in the Internet of Things (IoT) increasingly require real-time computations beyond their hardware limits to process the events they capture. Leveraging cloud infrastructures for these computational demands is a pattern adopted in the IoT community as one solution, which has led to a class of Dynamic Data Driven Applications (DDDA). These applications offload computations to the cloud through Distributed Stream Processing Frameworks (DSPF) such as Apache Storm. While DSPFs are efficient in computations, current implementations barely meet the strict low latency requirements of large scale DDDAs due to inefficient inter-process communication. This research implements efficient highly scalable communication algorithms and presents a comprehensive study of performance, taking into account the nature of these applications and characteristics of the cloud runtime environments. It further reduces communication costs within a node using an efficient shared memory approach. These algorithms are applicable in general to existing DSPFs and the results show significant improvements in latency over the default implementation in Apache Storm.

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine | 2012

DACIDR: deterministic annealed clustering with interpolative dimension reduction using a large collection of 16S rRNA sequences

Yang Ruan; Saliya Ekanayake; Mina Rho; Haixu Tang; Seung-Hee Bae; Judy Qiu; Geoffrey C. Fox

The recent advance in next generation sequencing (NGS) techniques has enabled the direct analysis of the genetic information within a whole microbial community, bypassing the culturing individual microbial species in the lab. One can profile the marker genes of 16S rRNA encoded in the sample through the amplification of highly variable regions in the genes and sequencing of them by using Roche/454 sequencers to generate half to a few millions of 16S rRNA fragments of about 400 base pairs. The main computational challenge of analyzing such data is to group these sequences into operational taxonomic units (OTUs). Common clustering algorithms (such as hierarchical clustering) require quadratic space and time complexity that makes them not suitable for large datasets with millions of sequences. An alternative is to use greedy heuristic clustering methods (such as CD-HIT and UCLUST); although these enable fast sequence analyzing, the hard-cutoff similarity threshold set for them and the random starting seeds can result in reduced accuracy and overestimation (too many clusters). In this paper, we propose DACIDR: a parallel sequence clustering and visualization pipeline, which can address the overestimation problem along with space and time complexity issues as well as giving robust result. The pipeline starts with a parallel pairwise sequence alignment analysis followed by a deterministic annealing method for both clustering and dimension reduction. No explicit similarity threshold is needed with the process of clustering. Experiments with our system also proved the quadratic time and space complexity issue could be solved with a novel heuristic method called Sample Sequence Partition Tree (SSP-Tree), which allowed us to interpolate millions of sequences with sub-quadratic time and linear space requirement. Furthermore, SSP-Tree can enhance the speed of fine-tuning on the existing result, which made it possible to recursive clustering to achieve accurate local results. Our experiments showed that DACIDR produced a more reliable result than two popular greedy heuristic clustering methods.

Applied and Environmental Microbiology | 2016

Phylogenetically Structured Differences in rRNA Gene Sequence Variation among Species of Arbuscular Mycorrhizal Fungi and Their Implications for Sequence Clustering

Geoffrey L. House; Saliya Ekanayake; Yang Ruan; Ursel M. E. Schütte; Wittaya Kaonongbua; Geoffrey C. Fox; Yuzhen Ye; James D. Bever

ABSTRACT Arbuscular mycorrhizal (AM) fungi form mutualisms with plant roots that increase plant growth and shape plant communities. Each AM fungal cell contains a large amount of genetic diversity, but it is unclear if this diversity varies across evolutionary lineages. We found that sequence variation in the nuclear large-subunit (LSU) rRNA gene from 29 isolates representing 21 AM fungal species generally assorted into genus- and species-level clades, with the exception of species of the genera Claroideoglomus and Entrophospora. However, there were significant differences in the levels of sequence variation across the phylogeny and between genera, indicating that it is an evolutionarily constrained trait in AM fungi. These consistent patterns of sequence variation across both phylogenetic and taxonomic groups pose challenges to interpreting operational taxonomic units (OTUs) as approximations of species-level groups of AM fungi. We demonstrate that the OTUs produced by five sequence clustering methods using 97% or equivalent sequence similarity thresholds failed to match the expected species of AM fungi, although OTUs from AbundantOTU, CD-HIT-OTU, and CROP corresponded better to species than did OTUs from mothur or UPARSE. This lack of OTU-to-species correspondence resulted both from sequences of one species being split into multiple OTUs and from sequences of multiple species being lumped into the same OTU. The OTU richness therefore will not reliably correspond to the AM fungal species richness in environmental samples. Conservatively, this error can overestimate species richness by 4-fold or underestimate richness by one-half, and the direction of this error will depend on the genera represented in the sample. IMPORTANCE Arbuscular mycorrhizal (AM) fungi form important mutualisms with the roots of most plant species. Individual AM fungi are genetically diverse, but it is unclear whether the level of this diversity differs among evolutionary lineages. We found that the amount of sequence variation in an rRNA gene that is commonly used to identify AM fungal species varied significantly between evolutionary groups that correspond to different genera, with the exception of two genera that are genetically indistinguishable from each other. When we clustered groups of similar sequences into operational taxonomic units (OTUs) using five different clustering methods, these patterns of sequence variation caused the number of OTUs to either over- or underestimate the actual number of AM fungal species, depending on the genus. Our results indicate that OTU-based inferences about AM fungal species composition from environmental sequences can be improved if they take these taxonomically structured patterns of sequence variation into account.

cluster computing and the grid | 2014

Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions

Yang Ruan; Geoffrey L. House; Saliya Ekanayake; Ursel M. E. Schütte; James D. Bever; Haixu Tang; Geoffrey C. Fox

Phylogenetic analysis is commonly used to analyze genetic sequence data from fungal communities, while ordination and clustering techniques commonly are used to analyze sequence data from bacterial communities. However, few studies have attempted to link these two independent approaches. In this paper, we propose a method, which we call spherical phylogram (SP), to display the phylogenetic tree within the clustering and visualization result from a pipeline called DACIDR. In comparison with traditional tree display methods, the correlations between the tree and the clustering can be observed directly. In addition, we propose an algorithm called interpolative joining (IJ) to construct and visualize the SP in 3D space. In the experiments, we used the sum of branch lengths to quantify the general fit between the clustering and the phylogenetic tree in SP and Mantel tests to determine how well the same grouping of sequences was preserved between the clustering and the SP. Our results show that DACIDR has a classification accuracy that is similar to a phylogenetic tree generated using a multiple sequence alignment, while having much lower computational cost.

grid computing | 2010

Performance of Windows Multicore Systems on Threading and MPI

Judy Qiu; Scott Beason; Seung-Hee Bae; Saliya Ekanayake; Geoffrey C. Fox

We present performance results on a Windows cluster with up to 768 cores using MPI and two variants of threading – CCR and TPL. CCR (Concurrency and Coordination Runtime) presents a message based interface while TPL (Task Parallel Library) allows for loops to be automatically parallelized. MPI is used between the cluster nodes (up to 32) and either threading or MPI for parallelism on the 24 cores of each node. We use a simple matrix multiplication kernel as well as a significant bioinformatics gene clustering application. We find that the two threading models offer similar performance with MPI outperforming both at low levels of parallelism but threading much better when the grain size (problem size per process) is small. We find better performance on Intel compared to AMD on comparable 24 core systems. We develop simple models for the performance of the clustering code.

WBDB | 2015

Big Data, Simulations and HPC Convergence

Geoffrey C. Fox; Judy Qiu; Shantenu Jha; Saliya Ekanayake; Supun Kamburugamuve

Two major trends in computing systems are the growth in high performance computing (HPC) with in particular an international exascale initiative, and big data with an accompanying cloud infrastructure of dramatic and increasing size and sophistication. In this paper, we study an approach to convergence for software and applications/algorithms and show what hardware architectures it suggests. We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Compute) in the same way. This leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View. We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack and discuss appropriate hardware.

BMC Bioinformatics | 2012

Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Adam Hughes; Yang Ruan; Saliya Ekanayake; Seung-Hee Bae; Qunfeng Dong; Mina Rho; Judy Qiu; Geoffrey C. Fox

BackgroundModern pyrosequencing techniques make it possible to study complex bacterial populations, such as 16S rRNA, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.MethodsPairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of rRNA genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.ResultsThis study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.ConclusionsAlthough work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.

Concurrency and Computation: Practice and Experience | 2014

Visualizing the Protein Sequence Universe

Larissa Stanberry; Roger Higdon; Winston Haynes; Natali Kolker; William Broomall; Saliya Ekanayake; Adam Hughes; Yang Ruan; Judy Qiu; Eugene Kolker; Geoffrey C. Fox

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the protein sequence universe (PSU), the protein sequence space, together with the costs and complexities of manual curation creates a major bottleneck in life sciences research. Existing resources lack scalable visualization tools that are instrumental for functional annotation. Here, we describe a new visualization tool using multidimensional scaling to create a 3D embedding of the protein space. The advantages of the proposed PSU method include the ability to scale to large numbers of sequences, integrate different similarity measures with other functional and experimental data, and facilitate protein annotation. We applied the method to visualize the prokaryotic PSU using sequence alignment scores. As an annotation example, we used the interpolation approach to map the set of annotated archaeal proteins into the prokaryotic PSU. Transdisciplinary approaches akin to the one described in this paper are urgently needed to quickly and efficiently translate the influx of new data into tangible innovations and groundbreaking discoveries. Copyright

international conference on big data | 2016

Java thread and process performance for parallel machine learning on multicore HPC clusters

Saliya Ekanayake; Supun Kamburugamuve; Pulasthi Wickramasinghe; Geoffrey C. Fox

The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes. We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues — thread models, affinity patterns, and communication mechanisms — as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPC-Big Data convergence.

Explore More