Sanath Jayasena
University of Moratuwa
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sanath Jayasena.
ieee international conference on high performance computing data and analytics | 2013
Sanath Jayasena; Saman P. Amarasinghe; Asanka Abeyweera; Gayashan Amarasinghe; Himeshi De Silva; Sunimal Rathnayake; Xiaoqiao Meng; Yanbin Liu
False sharing is a major class of performance bugs in parallel applications. Detecting false sharing is difficult as it does not change the program semantics. We introduce an efficient and effective approach for detecting false sharing based on machine learning. We develop a set of mini-programs in which false sharing can be turned on and off. We then run the mini-programs both with and without false sharing, collect a set of hardware performance event counts and use the collected data to train a classifier. We can use the trained classifier to analyze data from arbitrary programs for detection of false sharing. Experiments with the PARSEC and Phoenix benchmarks show that our approach is indeed effective. We detect published false sharing regions in the benchmarks with zero false positives. Our performance penalty is less than 2%. Thus, we believe that this is an effective and practical method for detecting false sharing.
international parallel and distributed processing symposium | 2015
Sanath Jayasena; Milinda Fernando; Tharindu Rusira; Chalitha Perera; Chamara Philips
We address the problem of tuning the performance of the Java Virtual Machine (JVM) with run-time flags (parameters). We use the Hot Spot JVM in our study. As the Hot Spot JVM comes with over 600 flags to choose from, selecting a subset manually to maximize performance is infeasible. In prior work, the potential performance improvement is limited by the fact that only a subset of the tunable flags are tuned. We adopt a different approach and present the Hot Spot Auto-tuner which considers the entire JVM and the effect of all the flags. To the best of our knowledge, ours is the first auto-tuner for optimizing the performance of the JVM as a whole. We organize the JVM flags into a tree structure by building a flag-hierarchy, which helps us to resolve dependencies on aspects of the JVM such as garbage collector algorithms and JIT compilation, and helps to reduce the configuration search-space. Experiments with the SPECjvm2008 and DaCapo benchmarks show that we could optimize the Hot Spot JVM with significant speedup, 16 SPECjvm2008 startup programs were improved by an average of 19% with three of them improved dramatically by 63%, 51% and 32% within a maximum tuning time of 200 minutes for each. Based on a minimum tuning time of 200 minutes, average performance improvement for 13 DaCapo benchmark programs is 26% with 42% being the maximum improvement.
international parallel and distributed processing symposium | 2016
Miyuru Dayarathna; Isuru Herath; Yasima Dewmini; Gayan Mettananda; Sameera Nandasiri; Sanath Jayasena; Toyotaro Suzumura
Linked data mining has become one of the key questions in HPC graph mining in recent years. However, the existing RDF database engines are not scalable and are less reliable in heterogeneous clouds. In this paper we describe the design and implementation of Acacia-RDF which is a scalable distributed RDF graph database engine developed with X10 programming language to solve this issue. Acacia-RDF partitions the RDF data sets into subgraphs following vertex cut paradigm. The partitioned data sets are persisted on secondary storage across X10 places. We developed a scalable SPARQL processor for Acacia-RDF which operates on top of partitioned RDF data. Furthermore, we demonstrate the implementation of scalable graph algorithms such as Triangle counting with such partitioned data sets. We present performance results gathered from Acacia with different scales of LUBM RDF benchmark data sets and make a comparison of Acacias performance against Neo4j graph database server. From the scalability experiments conducted upto 16 X10 places, we observed that Acacia-RDF scales well with LUBM data sets. Acacia-RDF reported approximately 2 seconds elapsed time on 4 places for running the first and third queries of the LUBM benchmark on LUBM scale 40 data set. Through this work we introduce the use of X10 language for scalable RDF graph data management.
international conference on performance engineering | 2017
Sajith Ravindra; Miyuru Dayarathna; Sanath Jayasena
Elastic scaling of event stream processing systems has gained significant attention recently due to the prevalence of cloud computing technologies. We investigate on the complexities associated with elastic scaling of an event processing system in a private/public cloud scenario. We develop an Elastic Switching Mechanism (ESM) which reduces the overall average latency of event processing jobs by significant amount considering the cost of operating the system. ESM is augmented with adaptive compressing of upstream data. The ESM conducts one of the two types of switching where either part of the data is sent to the public cloud (data switching) or a selected query is sent to the public cloud (query switching) based on the characteristics of the query. We model the operation of the ESM as the function of two binary switching functions. We show that our elastic switching mechanism with compression is capable of handling out-of-order events more efficiently compared to techniques which does not involve compression. We used two application benchmarks called EmailProcessor and a Social Networking Benchmark (SNB2016) to conduct multiple experiments to evaluate the effectiveness of our approach. In a single query deployment with EmailProcessor benchmark we observed that our elastic switching mechanism provides 1.24 seconds average latency improvement per processed event which is 16.70% improvement compared to private cloud only deployment. When presented the option of scaling EmailProcessor with four public cloud VMs ESM further reduced the average latency by 37.55% compared to the single public cloud VM. In a multi-query deployment with both EmailProcessor and SNB2016 we obtained a reduction of average latency of both the queries by 39.61 seconds which is a decrease of 7% of overall latency. These performance figures indicate that our elastic switching mechanism with compressed data streams can effectively reduce the average elapsed time of stream processing happening in private/public clouds.
applications of natural language to data bases | 2016
T. Mokanarangan; T. Pranavan; U. Megala; N. Nilusija; Gihan Dias; Sanath Jayasena; Surangika Ranathunga
Morphology is the process of analyzing the internal structure of words. Grammatical features and properties are used for this analysis. Like other Dravidian languages, Tamil is a highly agglutinative language with a rich morphology. Most of the current morphological analyzers for Tamil mainly use segmentation to deconstruct the word to generate all possible candidates and then either grammar rules or tagging mismatch is used during post processing to get the best candidate. This paper presents a morphological engine for Tamil that uses grammar rules and an annotated corpus to get all possible candidates. A support vector machines classifier is employed to determine the most probable morphological deconstruction for a given word. Lexical labels, respective frequency scores, average length and suffixes are used as features. The accuracy of our system is 98.73 % and a F-measure of .943, which is more than the same reported by other similar research.
grid computing | 2010
K. A. T. A. Jayasekara; Sanath Jayasena
Persistent Data Library (PDL) manages object persistence in C++ applications. PDL abstracts persisting features and provides an easy programming environment to the programmer. It offers a set of data structures which transparently handles persistence. Data structures in PDL are quite similar to data structures in Standard Template Library (STL) in C++ but STL does not provide functionality to make data persistent. This kind of a library will be beneficial when implementing Fault Tolerance to state based applications. State based applications need to checkpoint data periodically. In case of a failure, such applications need fast recovery of data. In developing such applications, each time a new state is introduced to the system, the programmer needs to write code to serialize and de-serialize data. PDL framework helps the programmer to write less code on serialization and de-serialization. Due to the direct memory dumping technology PDL uses, the time taken to write data to the disk and recover data from the storage is minimized.
international conference on parallel processing | 2017
Isuru Dilanka Fernando; Sanath Jayasena; Milinda Fernando; Hari Sundar
We present a scalable distributed memory library for generating and computations involving structured dense matrices, such as those produced by boundary integral equation formulations. Such matrices are dense, but have special structure that can be exploited to obtain efficient storage and matrix-vector product evaluations and consequently the fast solution of linear systems. At the core of the methods we use is the observation that off-diagonal matrix blocks of such matrices have a low numerical rank, and that this property can be exploited in a multi-level fashion. In this work we focus on the Hierarchically Semi-Separable (HSS) representation. We present algorithms for building and using HSS representations that are parallelized using MPI and CUDA to leverage state-of-the-art heterogeneous clusters. The efficiency of our methods and implementation is demonstrated on large dense matrices obtained from a boundary integral equation formulation of the Laplace equation with Dirichlet boundary conditions. We demonstrate excellent (linear) scalability on up to 128 GPUs on 128 nodes. Our codes will lay the foundation for fast direct solvers for elliptic problems.
international conference on advances in ict for emerging regions | 2016
Vimuth Fernando; Milinda Fernando; Tharindu Rusira; Sanath Jayasena
X10 is a programming language specifically de- signed with productivity and scalability in mind. In the era of distributed multi-core systems, X10 provides programmers a high-level abstraction which is an absolute necessity. In this paper we present an auto-tuning solution to enhance the performance of X10 programs that uses the Java back-end. Our auto-tuner is based on OpenTuner, an extensible framework for building auto-tuning applications. We present improved running times for X10 benchmark programs that are shipped with X10 and the well known LULESH benchmark. The auto-tuning experiments recorded a maximum performance improvement of 50% for LULESH while the average improvement for the set of benchmarks is 25%. We analyze the internal changes a Java Virtual Machine (JVM) undergoes as a result of our auto-tuning. Finally, the study of the behavior of tuned programs for their input sensitivity shows that our tuned JVM configurations would continue with enhanced performance over varying input sizes of a program.
conference on intelligent text processing and computational linguistics | 2016
Pranavan Theivendiram; Megala Uthayakumar; Nilusija Nadarasamoorthy; Mokanarangan Thayaparan; Sanath Jayasena; Gihan Dias; Surangika Ranathunga
Named-Entity-Recognition (NER) is widely used as a foundation for Natural Language Processing (NLP) applications. There have been few previous attempts on building generic NER systems for Tamil language. These attempts were based on machine-learning approaches such as Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), Support Vector Machine (SVM) and Conditional Random Fields (CRF). Among them, CRF has been proven to be the best with respect to the accuracy of NER in Tamil. This paper presents a novel approach to build a Tamil NER system using the Margin-Infused Relaxed Algorithm (MIRA). We also present a comparison of performance between MIRA and CRF algorithms for Tamil NER. When the gazetteer, POS tags and orthographic features are used with the MIRA algorithm, it attains an F1-measure of 81.38% on the Tamil BBC news data whereas the CRF algorithm shows only an F1-measure of 79.13% for the same set of features. Our NER system outperforms all the previous NER systems for Tamil language.
international conference on advances in ict for emerging regions | 2015
Prashan B. Weerasinghe; Chathura De Silva; Sanath Jayasena
The RALU optimisation research targeted to develop a soft processor, which is capable of a dynamic optimization of resource utilisation and increased processor throughput by changing its structure according to the running instruction. The RALU shows higher instruction gain and clock cycle gain compared to 8 bit microprocessor in similar scale. So the RALU approach provides solution to higher resource critical FPGA based design by improving the resource utilisation and providing higher processor throughput.