Saptarsi Goswami
University of Calcutta
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saptarsi Goswami.
Ain Shams Engineering Journal | 2016
Saptarsi Goswami; Sanjay Chakraborty; Sanhita Ghosh; Amlan Chakrabarti; Basabi Chakraborty
Abstract Thousands of human lives are lost every year around the globe, apart from significant damage on property, animal life, etc., due to natural disasters (e.g., earthquake, flood, tsunami, hurricane and other storms, landslides, cloudburst, heat wave, forest fire). In this paper, we focus on reviewing the application of data mining and analytical techniques designed so far for (i) prediction, (ii) detection, and (iii) development of appropriate disaster management strategy based on the collected data from disasters. A detailed description of availability of data from geological observatories (seismological, hydrological), satellites, remote sensing and newer sources like social networking sites as twitter is presented. An extensive and in-depth literature study on current techniques for disaster prediction, detection and management has been done and the results are summarized according to various types of disasters. Finally a framework for building a disaster management database for India hosted on open source Big Data platform like Hadoop in a phased manner has been proposed. The study has special focus on India which ranks among top five counties in terms of absolute number of the loss of human life.
International Journal of Computer Applications | 2013
Subhajit DeySarakar; Saptarsi Goswami
Text classification has become much more relevant with the increased volume of unstructured data from various sources. Several techniques have been developed for text classification. High dimensionality of feature space is one of the established problems in text classification. Feature selection is one of the techniques to reduce dimensionality. Feature selection helps in increasing classifier performance, reduce over filtering to speed up the classification model construction and testing and make models more interpretable. This paper presents an empirical study comparing performance of few feature selection techniques (Chi-squared, Information Gain, Mutual Information and Symmetrical Uncertainty) employed with different classifiers like naive bayes, SVM, decision tree and k-NN. Motivation of the paper is to present results of feature selection methods on various classifiers on text datasets. The study further allows comparing the relative performance of the classifiers and the methods.
Expert Systems With Applications | 2017
Saptarsi Goswami; Amit Kumar Das; Amlan Chakrabarti; Basabi Chakraborty
FCTFS works in both autonomous and user guided mode.The defined taxonomy helps in arriving at optimal number of good quality clusters.Feature elimination due to irrelevance and redundancy is clearly isolated.It is faster than traditional search based methods.Yields superior results compared to some state of the art methods over 24 data sets. Feature subset selection is basically an optimization problem for choosing the most important features from various alternatives in order to facilitate classification or mining problems. Though lots of algorithms have been developed so far, none is considered to be the best for all situations and researchers are still trying to come up with better solutions. In this work, a flexible and user-guided feature subset selection algorithm, named as FCTFS (Feature Cluster Taxonomy based Feature Selection) has been proposed for selecting suitable feature subset from a large feature set. The proposed algorithm falls under the genre of clustering based feature selection techniques in which features are initially clustered according to their intrinsic characteristics following the filter approach. In the second step the most suitable feature is selected from each cluster to form the final subset following a wrapper approach. The two stage hybrid process lowers the computational cost of subset selection, especially for large feature data sets. One of the main novelty of the proposed approach lies in the process of determining optimal number of feature clusters. Unlike currently available methods, which mostly employ a trial and error approach, the proposed method characterises and quantifies the feature clusters according to the quality of the features inside the clusters and defines a taxonomy of the feature clusters. The selection of individual features from a feature cluster can be done judiciously considering both the relevancy and redundancy according to users intention and requirement. The algorithm has been verified by simulation experiments with different bench mark data set containing features ranging from 10 to more than 800 and compared with other currently used feature selection algorithms. The simulation results prove the superiority of our proposal in terms of model performance, flexibility of use in practical problems and extendibility to large feature sets. Though the current proposal is verified in the domain of unsupervised classification, it can be easily used in case of supervised classification.
international conference on electronics computer technology | 2011
Samiran Ghosh; Saptarsi Goswami; Amlan Chakrabarti
Extract, Transform, Load (ETL) is an integral part of Data Warehousing (DW) implementation. The commercial tools that are used for this purpose captures lot of execution trace in form of various log files with plethora of information. However there has been hardly any initiative where any proactive analyses have been done on the ETL logs to improve their efficiency. In this paper we utilize outlier detection technique to find the processes varying most from the group in terms of execution trace. As our experiment was carried on actual production processes, any outlier we would consider as a signal rather than a noise. To identify the input parameters for the outlier detection algorithm we employ a survey among developer community with varied mix of experience & expertise. We use simple text parsing to extract these features from the logs, as shortlisted from the survey. Subsequently we applied outlier detection technique (Clustering based) on the logs. By this process we reduced our domain of detailed analysis from 500 logs to 44 logs (8 Percentage). Among the 5 outlier cluster, 2 of them are genuine concern, while the other 3 figure out because of the huge number of rows involved.
Archive | 2017
Amit Kumar Das; Saptarsi Goswami; Basabi Chakraborty; Amlan Chakrabarti
A graph-theoretic approach is presented in this paper to visually represent feature association in data sets. This visual representation of feature association, which has been named as Feature Association Map (FAM), is based on similarity between features measured using pair-wise Pearson’s product moment correlation coefficient. Highly similar features will appear as clusters in the graph visualization. Data sets with high number of features as part of feature clusters will indicate the possibility of strong feature association. The efficacy of this method has been demonstrated in ten publicly available data sets. FAM can be applied effectively in the area of feature selection.
international conference on recent advances in information technology | 2016
Himanshu Kashyap; Sohini Das; Jayee Bhattacharjee; Ritu Halder; Saptarsi Goswami
Feature selection is a major preprocessing step in areas related to data mining, pattern recognition or machine learning. Finding a most favorable subset of features, among available combinations is a NP-Complete problem. Despite lots of research in this domain, feature selection for clustering is far from solved. The paper is first of its kind in applying multi objective Genetic Algorithm (GA), for feature selection in clustering for a filter method. The optimization objectives are: (i) Maximizing the Laplacian Score and (ii) Minimizing the inter-attribute correlation. Empirical study has been conducted over 21 datasets and results look promising in terms of amount of feature set reduction achieved. In terms of cluster validity also in more than half of the datasets, the proposed method achieves equivalent or better result.
International Journal of Computer Applications | 2011
Saptarsi Goswami; Samiran Ghosh; Amlan Chakrabarti
S is the heart for both OLTP and OLAP types of applications. For both types of applications thousands of queries expressed in terms of SQL are executed on daily basis. All the commercial DBM S engines capture various attributes in system tables about these executed queries. These queries need to conform to best practices and need to be tuned to ensure optimal performance. While we use checklists, often tools to enforce the same, a black box technique on the queries for profiling, outlier detection is not employed for a summary level understanding. This is the motivation of the paper, as this not only points out to inefficiencies built in the system, but also has the potential to point evolving best practices and inappropriate usage. Certainly this can reduce latency in information flow and optimal utilization of hardware and software capacity. In this paper we start with formulating the problem. We explore four outlier detection techniques. We apply these techniques over rich corpora of production queries and analyze the results. We also explore benefit of an ensemble approach. We conclude with future courses of action. The same philosophy we have used for optimization of extraction, transform, load (ETL) jobs in one of our previous work. We give a brief introduction of the same in section four.
international conference on recent advances in information technology | 2016
Avinash Kumar Bamwal; Govind Kumar Choudhary; Raj Swamim; Aman Kedia; Saptarsi Goswami; Amit Kumar Das
One of the reasons of exponential growth of unstructured data is increasing popularity of various social media and blogging sites. This vast amount of data can be utilized to form insights and take necessary actions. While it has been extensively used in developed countries, there has been a lack of focused studies based on tweets in context of India. In this paper, 15K + Tweets have been collected using an efficient method and the same has been used to answer few relevant questions about Indian Healthcare. The study seems to be quite effective and a repository of such tweets can be built for a more holistic and structured analysis.
2015 International Conference on Man and Machine Interfacing (MAMI) | 2015
Amit Kumar Das; Aman Kedia; Lisha Sinha; Saptarsi Goswami; Tamal Chakrabarti; Amlan Chakrabarti
Healthcare is a fast-growing field in developed and developing countries alike. Indian healthcare has also witnessed rapid growth in the recent past. The data generated is of very high volume and exhibit wide diversity, and hence data mining opportunities available in various areas of health care industry are immense. In this paper, a brief review of data mining applications in healthcare has been presented. The key differentiating factors of the paper is its focus on India and a patient lifecycle oriented view of the problems. As found by the study, most of the researches have been focused on predicting a disease which is a part of ongoing care management. This paper presents many uncharted areas of future research in this domain especially in Indian context.
2015 IEEE 2nd International Conference on Cybernetics (CYBCONF) | 2015
Saptarsi Goswami; Amlan Chakrabarti; Basabi Chakraborty
Pattern classification or clustering plays important role in a wide variety of applications in different areas like psychology and other social sciences, biology and medical sciences, pattern recognition and data mining. A lot of algorithms for supervised or unsupervised classification have been developed so far in order to achieve high classification accuracy with lower computational cost. However, some methods or algorithms work well for some of the data sets and perform poorly on others. For any particular data set, it is difficult to find out the most suitable algorithm without some random trial and error process. It seems that the characteristics of the data set might have some influence on the algorithm for classification. In this work, the data set characteristics is studied in terms of intra attribute relationship and a measure MVS (multivariate score) has been proposed to quantify and group different data sets on the basis of the correlation structure into strong independent, weak independent, weak correlated and strong correlated data set. The performance of different feature selection algorithms on different groups of data are studied by simulation experiments with 63 publicly available bench mark data sets. It has been verified that univariate methods lead to significant performance gain for strong independent data set compared to multivariate methods while multivariate methods have better performance for strong correlated data sets.