Sandhya Harikumar | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sandhya Harikumar is active.

Explore More

Publication

Featured researches published by Sandhya Harikumar.

international conference on signal processing | 2015

Hybridized fragmentation of very large databases using clustering

Sandhya Harikumar

Due to the ever growing needs of managing huge volume of data, together with the desire for consistent, scalable, reliable and efficient retrieval of information, an intelligent mechanism to design the storage structure for distributing the databases has become inevitable. The two critical facets of distributed databases are data fragmentation and allocation. Existing fragmentation techniques are based on the frequency and type of the queries as well as the statistics of the empirical data. However, very limited work is done to fragment the data based on the pattern of the tuples and the attributes responsible for such patterns. This paper presents a unique approach towards hybridized fragmentation, by applying subspace clustering algorithm, to come up with a set of fragments which partitions the data with respect to tuples as well as attributes. Projected clustering is the one that determines the clusters in the subspaces of high dimensional data. This concept leads to find the closely correlated attributes for different sets of instances thereby giving good hybridized fragments for distributed databases. Experimental results show that fragmenting the database based on clustering, results in reduced database access time as compared to the fragments chosen at design time using certain statistics.

international conference on data science and engineering | 2016

Apriori algorithm for association rule mining in high dimensional data

Sandhya Harikumar; Divya Usha Dilipkumar

Apriori is one of the best algorithms for learning association rules. Due to the explosion of data, the storage and retrieval mechanisms in various database paradigms have revolutionized the technologies and methodologies used in the architecture. As a result, the database is not only utilized for mere information retrieval but also to infer the analytical aspect of data. Therefore it is essential to find association rules from high dimensional data because the correlation amongst the attributes can help in gaining deeper insight into the data and help in decision making, recommendations as well as reorganizing the data for effective retrieval. The traditional Apriori algorithm is computationally expensive and infeasible with high dimensional datasets. Hence we propose a variant of Apriori algorithm using the concept of QR decomposition for reducing the dimensions thereby reducing the complexity of the traditional Apriori algorithm.

advances in computing and communications | 2016

An adaptive distributed approach of a self organizing map model for document clustering using ring topology

M. Ajeissh; Sandhya Harikumar

Document clustering aims at grouping the documents that are coherent internally with substantial difference amongst different groups. Due to huge availability of documents, the clustering face scalability and accuracy issues. Moreover, there is a dearth for a tool that performs clustering of such voluminous data efficiently. Conventional models focus either on fully centralized or fully distributed approach for document clustering. Hence, this paper proposes a novel approach to perform document clustering by modifying the conventional Self Organizing Map (SOM). The contribution of this work is threefold. The first is a distributed approach to pre-process the documents; the second being an adaptive bottom-up approach towards document clustering and the third being a neighbourhood model suitable for Ring Topology for document clustering. Experimentation on real datasets and comparison with traditional SOM show the efficacy of the proposed approach.

ieee recent advances in intelligent computational systems | 2013

Implementation of projected clustering based on SQL queries and UDFs in relational databases

Sandhya Harikumar; H. Haripriya; M. R. Kaimal

Projected clustering is one of the clustering approaches that determine the clusters in the subspaces of high dimensional data. Although it is possible to efficiently cluster a very large data set outside a relational database, the time and effort to export and import it can be significant. In commercial RDBMSs, there is no SQL query available for any type of subspace clustering, which is more suitable for large databases with high dimensions and large number of records. Integrating clustering with a relational DBMS using SQL is an important and challenging problem in todays world of Big Data. Projected clustering has the ability to find the closely correlated dimensions and find clusters in the corresponding subspaces. We have designed an SQL version of projected clustering which helps to get the clusters of the records in the database using a single SQL statement which in itself calls other SQL functions defined by us. We have used PostgreSQL DBMS to validate our implementation and have done experimentation with synthetic as well as real data.

international conference on data science and engineering | 2014

Semantic integration of heterogeneous relational schemas using multiple L1 linear regression and SVD

Sandhya Harikumar; R. Reethima; M. R. Kaimal

The challenge of semantic integration of heterogeneous databases is one of the critical areas of interest due to scalability of data and the need to share the existing data as the technology advances. The schema level heterogeneity of the relations is the major issue for such integration. Though various approaches of schema analysis, transformation and integration have been explored, sometimes those become too general to solve the problem especially when the data is very high-dimensional and the schema information is unavailable or inadequate. In this paper, a method to integrate heterogeneous relational schema at instance-level is proposed, rather than the schema level. A global schema is designed consisting of the integration of most relevant attributes of different relational schema of a particular domain. In order to find the significant attributes, multiple linear regressions based on LI norm and Singular Value Decomposition(SVD) is applied on the data iteratively. This is a variant of L1-PCA, which is efficient, effective and meaningful method of linear subspace estimation. The most prominent instance - level similarity is found by finding the most significant attributes of each relational data source and then finding the similarity among those attributes using L1-norm. Thus an integrated schema is created that maps the relevant attributes of each local schema to a global schema.

international conference on data science and engineering | 2016

MapReduce model for k-medoid clustering

Sandhya Harikumar; Shabana Shamsuddin Thaha

Distributed and Parallel computing are best alternatives for scalable clustering of huge amount of data with moderate to high dimensions, together with improved speed up. In this paper we address the problem of k-medoid clustering using MapReduce framework for distributed computing on commodity machines to evaluate its efficacy. There are mainly two issues to be tackled. The first one is, how to distribute the data for efficient clustering and the second one is, how to minimize the I/O and network cost among the machines. So, the main contributions of this paper are : (a)A map reduce methodology for distributed k-medoid clustering; (b) Reduction in the overall execution time and the overhead of data movement from one site to another leading to sub linear scaleup and speedup. This approach proves to be efficient, as the local clustering can be carried out independently from each other. Experimental analysis on millions of data using just 10 cores in parallel shows the clustering of data of size 1M × 17 requires only 4 minutes. So, such low transmission cost and low bandwidth requirement leads to improved speedup and scaleup of the distributed data.

international conference on contemporary computing | 2016

Logistic regression within DBMS

Jackson Isaac; Sandhya Harikumar

The context of this paper is to come up with an analytical query model for data categorization within DBMS. DBMS being the asset for most of the organizations, classification can help in getting better insight and control over the data. Conventionally, classification algorithms like logistic regression, KNN, etc. are applied after exporting the data out of DBMS, using non DBMS tools like R, matrix packages, generic data mining programs or large scale systems like Hadoop and Spark. However, this leads to I/O overhead since the data within DBMS is updated quite frequently and usually cannot be accommodated in the main memory. This paper proposes an alternative strategy, based on SQL and UDFs, to integrate the logistic regression for data categorization as well as prediction query processing within DBMS. A comparison of SQL with user defined functions (UDFs) as well as with statistical packages like R is presented, by experimentation on real datasets. The empirical results show the viability and validity of this approach for predicting the class of a given query.

Ingénierie Des Systèmes D'information | 2015

A Method to Induce Indicative Functional Dependencies for Relational Data Model

Sandhya Harikumar; R. Reethima

Relational model is one of the extensively used database models. However, with the contemporary technologies, high dimensional data which may be structured or unstructured are required to be analyzed for knowledge interpretation. One of the significant aspects of analysis is exploring the relationships existing between the attributes of large dimensional data. In relational model, the integrity constraints in accordance with the relationships are captured by functional dependencies. Processing of high dimensional data to understand all the functional dependencies is computationally expensive. More specifically, functional dependencies of the most prominent attributes will be of significant use and can reduce the search space of functional dependencies to be searched for. In this paper we propose a regression model to find the most prominent attributes of a given relation. Functional dependencies of these prominent attributes are discovered which are indicative and lead to faster results in decreased amount of time.

international conference on data science and engineering | 2014

SQL-MapReduce hybrid approach towards distributed projected clustering

Sandhya Harikumar; M. Shyju; M. R. Kaimal

Clustering high dimensional data is a major challenge in data mining due to the existence of inherent complexity and sparsity of the data. Projected clustering is one of the clustering approaches that determine the clusters in the subspaces of such high dimensional data. However, projected clustering within DBMS is quite computationally expensive in time and space complexity, when the volume of records is in terms of terabytes, petabytes and more. This expensive computation becomes a hurdle especially when the data clustering on transactional data is used as a preprocessing step for other tasks such as frequent decision making, efficient indexing, compression, etc. Hence, parallelizing and distributing expensive data clustering tasks becomes attractive in terms of speed-up of computation and the increased amount of memory available in a computing cluster. Inorder to achieve this, we propose a SQL-MapReduce hybrid approach for scalable projected clustering.

ieee recent advances in intelligent computational systems | 2013

NSB-TREE for an efficient multidimensional indexing in non-spatial databases

Sandhya Harikumar; Ambili Vinay

Query processing of high dimensional data with huge volume of records, especially in non-spatial domain require efficient multidimensional index. The present versions of DBMSs follow a single dimension indexing at multiple levels or indexing based on the formation of compound keys which is concatenation of the key values of the required attributes. The underlying structures, data models and query languages are not sufficient for the retrieval of information based on more complex data in terms of dimensions and size. This paper aims at designing an efficient indexing structure for multidimensional data access in non-spatial domain. This new indexing structure is evolved from R-tree with certain preprocessing steps to be applied on non-spatial data. The proposed indexing model, NSB-Tree (Non-Spatial Block tree) is balanced and has better performance than traditional B-trees and has less complicated algorithms as compared to UB tree. It has linear space complexity and logarithmic time complexity. The main drive of NSB tree is multidimensional indexing eliminating the need for multiple secondary indexes and concatenation of multiple keys. We cannot index non-spatial data using R-tree in the available DBMSs. Our index structure replaces an arbitrary number of secondary indexes for multicolumn index structure. This is implemented and feasibility check is done using the PostgreSQL database.

Explore More