Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Priyanka Tripathi is active.

Publication


Featured researches published by Priyanka Tripathi.


international conference on communication systems and network technologies | 2014

Pattern and Cluster Mining on Text Data

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi

Due to heavy use of electronics devices nowadays most of the information is available in electronic format and a substantial portion of information is stored as text such as in news articles, technical papers, books, digital libraries, email messages, blogs, and web pages. Mining the knowledge like pattern finding or clustering of similar kind of words is one of the important issues nowadays. This paper focuses on mining the important information from the text data. This paper uses the stories data set from project Guttenbergs William Shakespeare stories dataset for experimental study. R is used as Text Mining and statistical analysis tool in Ubuntu 12.04 LTS Linux Operating System. Frequent pattern mining is used to find the frequent terms, appeared in the documents and word Association among two or more words is measured at a given threshold value. Our algorithm uses cosine similarity in order to measure the distance between the words before clustering. The algorithm may be use to find the similarity between stories, news, emails. In this paper k-means and hierarchical agglomerative clustering algorithm is used to form the cluster.


Expert Systems With Applications | 2017

Variable Global Feature Selection Scheme for automatic classification of text documents

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi

A novel Variable Global Feature Selection Scheme (VGFSS) is proposed.VGFSS selects variable number of features from each class instead of equal features.The selection of features in VGFSS is based on distribution of terms in the classes.The methods are evaluated using Macro_F1 and Micro_F1 measure followed by Z-test.The VGFSS algorithm outperforms among seven competing methods in benchmark datasets. The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS.


SpringerPlus | 2016

Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi

The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ2) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ2.


international conference on computer and automation engineering | 2010

Towards the identification of usability metrics for academic Web-sites

Priyanka Tripathi; Manju Pandey; Divya Bharti

This paper presents a brief account of the development in the field usability of Web-applications. Quality of Web-sites play crucial role in success of a web-site. Usability has been considered most important factor amongst the other factors. Since Web based software are diverse and build by integrating numerous components it is not proper to have common quality model for all Web-applications. In this paper the authors have proposed usability metrics of Web-applications for academic domain.


international symposium on computer vision | 2016

Computing Correlative Association of Terms for Automatic Classification of Text Documents

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi

The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.


Iete Technical Review | 2016

An Efficient Approach to Detect Sudden Changes in Vegetation Index Time Series for Land Change Detection

Sangram Panigrahi; Kesari Verma; Priyanka Tripathi

ABSTRACT In this paper, we proposed a novel data mining approach Recursive Search Algorithm (RSA) to detect sudden changes in time series data-set. In literature, Modified Lunetta, cumulative sum (CUMSUM) MEAN, Yearly Delta, and Recursive Merging techniques are used to detect sudden changes in land covers using data mining approach. The main drawback of Modified Lunetta, CUMSUM MEAN, and Yearly Delta approach is that, it only identifies a time series is changed or not, while Recursive Merging technique finds the changed segment only. RSA approach has the capability to detect with high confidence to correctly compute the change point (time of change) in time series data, also detect the type of change (increase/decrease) occurred in series. The proposed algorithm is scalable, considerable improvement in performance in the presence of cyclic data. All experiments are performed on synthetic data-set, which is analogous to vegetation index time series data-set.


New Trends in Intelligent Information and Database Systems | 2015

Review of Feature Selection Algorithms for Breast Cancer Ultrasound Image

Kesari Verma; Bikesh Kumar Singh; Priyanka Tripathi; A. S. Thoke

Correct classification of patterns from images is one of the challenging tasks and has become the focus of much research in areas of machine learning and computer vision in recent era. Images are described by many variables like shape, texture, color and spectral for practical model building. Hundreds or thousands of features are extracted from images, with each one containing only a small amount of information. The selection of optimal and relevant features is very important for correct classification and identification of benign and malignant tumors in breast cancer dataset. In this paper we analyzed different feature selection algorithms like best first search, chi-square test, gain ratio, information gain, recursive feature elimination and random forest for our dataset. We also proposed a ranking technique to all the selected features based on the score given by different feature selection algorithms.


Journal of Intelligent Information Systems | 2018

An automatic classification of text documents based on correlative association of words

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi

Training speed of the classifier without degrading its predictive capability is an important concern in text classification. Feature selection plays a key role in this context. It selects a subset of most informative words (terms) from the set of all words. The correlative association of words towards the classes increases an incertitude for the words to represent a class. The representative words of a class are either of positive or negative nature. The standard feature selection methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), do not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, this paper presents a novel feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine their positive and negative nature for a class. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1-3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) classifiers evaluate the performance of the selected N-Grams. Four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 are used for experimental analysis. Two standard performance measures named Macro_F1 and Micro_F1 show a significant improvement in the results using proposed CAS method.


international conference on communication systems and network technologies | 2014

Knowledge Discovery from Earth Science Data

Sangram Panigrahi; Kesari Verma; Priyanka Tripathi; Rika Sharma

Mining the knowledge from climate data is one of the important issues due the rapid development and the extensive use of data acquisition technology. Earth science data increases tremendously, it is containing a large amount of information and knowledge. In this paper we used earth science data which is downloaded from the NASA website. The Earth science data has characteristics of mass, high dimensions, spatial co-relations. The data consists of time series measurements of various climate variables such as temperature, pressure, precipitation, humidity, direction and speed of wind and so forth. Any sudden change in parameters cause change in weather and also changes in parameters patterns behaviour. These sudden changes in parameters pattern do not conform to the general behaviour of the data set, these nonconforming patterns are often referred to as anomalies. Anomalies in the climatic variable cause ecosystem disturbance events such as forest fires, droughts, floods, and deforestation. Anomaly detection is a problem of finding unexpected patterns in a data set. In June 2013, a multi-day cloud burst centred on the North Indian state of Uttarakhand caused devastating floods and landslides in the countrys worst natural disaster. Some neighbours area of Uttarakhand was experienced heavy rainfall. In this paper, we show the co-relations between the climatic parameters and analysed all parameters that affects the climates with the help of our proposed algorithm, which has the capability of discovering anomalies in the data set, in addition, analyse and identify significant changes or anomalies in the Uttarakhand climate data set. We also proposed anomalies detection algorithm based on Min-Max Method.


Archive | 2018

A Review of Techniques to Determine the Optimal Word Score in Text Classification

Deepak Agnihotri; Kesari Verma; Priyanka Tripathi; Nilam Choudhary

Massive digital information is available in the form of text on Web pages and there is a continuous growth in this Web corpus. The resultant is a huge corpus with large number of dimensions in the form of text contents. Therefore, the classification of text corpus by its text contents is a challenging problem. Various feature selection techniques were used by the researchers to reduce the dimension of text corpus without affecting the performance of text classification. This paper investigates the importance of N-grams-based term indexing over unigram term indexing approach of text classification. It follows a new approach to find out the most informative words as features. Initially, a correlation score of each term for a class label is computed using the Pearson’s correlation coefficient and then this score is multiplied with bigram collocation terms score which is computed by the chi-square method. The topmost n informative words are selected by sorting the words in descending order, where n is an empirically determined number. The performance of this approach is evaluated on two standard movie reviews text datasets using Naive Bayes Classifier. From the results, it is confirmed that the accuracy achieved by the proposed method is much better than state of the art.

Collaboration


Dive into the Priyanka Tripathi's collaboration.

Researchain Logo
Decentralizing Knowledge