Is this you? Create Your Porfile

Rajendra Kumar Roul

Birla Institute of Technology and Science

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rajendra Kumar Roul is active.

Explore More

Publication

Featured researches published by Rajendra Kumar Roul.

arXiv: Information Retrieval | 2015

A Novel Modified Apriori Approach for Web Document Clustering

Rajendra Kumar Roul; Saransh Varshneya; Ashu Kalra; Sanjay Kumar Sahay

The Traditional apriori algorithm can be used for clustering the web documents based on the association technique of data mining. But this algorithm has several limitations due to repeated database scans and its weak association rule analysis. In modern world of large databases, efficiency of traditional apriori algorithm would reduce manifolds. In this paper, we proposed a new modified apriori approach by cutting down the repeated database scans and improving association analysis of traditional apriori algorithm to cluster the web documents. Further we improve those clusters by applying Fuzzy C-Means (FCM), K-Means and Vector Space Model (VSM) techniques separately. We use Classic3 and Classic4 datasets of Cornell University having more than 10,000 documents and run both traditional apriori and our modified apriori approach on it. Experimental results show that our approach outperforms the traditional apriori algorithm in terms of database scan and improvement on association of analysis.

software engineering artificial intelligence networking and parallel distributed computing | 2015

Extreme learning machines in the field of text classification

Rajendra Kumar Roul; Ashish Nanda; Viraj Patel; Sanjay Kumar Sahay

The World Wide Web serves as a huge repository of information that is highly dynamic, diverse and growing at an exponential rate in a lightening speed. In order to speed-up and further improve tasks like information search and retrieval, personalization etc; it is highly important to develop techniques to classify text documents more accurately and efficiently than before. This paper is an effort in that direction, where the effectiveness of Extreme Learning Machines(ELM) in the domain of text classification is studied and compared with many of the existing relevant techniques like Support Vector Machines(SVM), which are currently one of the most popular and effective techniques for classifying text documents. Ours is one of the few works that highlight the high performance of ELM in the field of text classification, by implementing classifiers based on different interpretations of ELM, analyzing their performance, and studying which feature selection techniques are most suited to improve their accuracy. In our multi-class classification problem, we studied a single ELM classifier based on the one-against-all scheme, and a multi-layer ELM classifier inspired from deep networks, and then perform extensive experiments on different datasets to demonstrate the applicability and effectiveness of our approach. Results show that ELM based classifiers can outperform many of the traditional classification techniques including the most powerful state-of-the-art technique such as SVM.

ieee india conference | 2015

Clustering based feature selection using Extreme Learning Machines for text classification

Rajendra Kumar Roul; Shashank Gugnani; Shah Mit Kalpeshbhai

The expansion of the dynamic Web increases the digital documents, which has attracted many researchers to work in the field of text classification. It is an important and well studied area of machine learning with a variety of modern applications. A good feature selection is of paramount importance to increase the efficiency of the classifiers working on text data. Choosing the most relevant features out of what can be an incredibly large set of data, is particularly important for accurate text classification. This paper is a motivation in that direction where we propose a new clustering based feature selection technique that reduces the feature size. Traditional k-means clustering technique along with TF-IDF and Wordnet helps us to form a quality and reduced feature vector to train the Extreme Learning Machine (ELM) and Multi-layer ELM (ML-ELM) which have been used as the classifiers for text classification. The experimental work has been carried out on 20-Newsgroups and DMOZ datasets. Results on these two standard datasets demonstrate the efficiency of our approach using ELM and ML-ELM as the classifiers over the state-of-the-art classifiers.

forum for information retrieval evaluation | 2017

Feature Space of Deep Learning and its Importance: Comparison of Clustering Techniques on the Extended Space of ML-ELM

Rajendra Kumar Roul; Amit Agarwal

Based on the architecture of deep learning, Multilayer Extreme Learning Machine (ML-ELM) has many good characteristics which make it distinct and widespread classifier in the domain of text mining. Some of its salient features include non-linear mapping of features into a high dimensional space, high level of data abstraction, no backpropagation, higher rate of learning etc. This paper studies the importance of ML-ELM feature space and tested the performance of various traditional clustering techniques on this feature space. Empirical results show the efficiency and effectiveness of the feature space of ML-ELM compared to TF-IDF vector space which justifies the prominence of deep learning.

ieee india conference | 2016

Semi-supervised clustering using seeded-kMeans in the feature space of ELM

Rajendra Kumar Roul; Sanjay Kumar Sahay

Extreme learning machine (ELM) is based on single layer feed forward neural networks (SLFNs) and has become a rapidly developing learning technology today. Recently developed Multilayer form of ELM called ML-ELM which is based on the architecture of deep learning, become more popular compared to other traditional classifiers because of its important qualities such as multiple non-linear transformation of input data, higher level abstraction of data, learning different form of input data, capable of managing huge volume of data etc. In addition to the above, another good quality which ML-ELM possesses is its ability to map the input feature vector non-linearly to an extended dimensional feature space for giving better performance. This paper proposes an approach where unsupervised and semi-supervised clustering using kMeans and seeded-kMeans have been done in ML-ELM feature space. The empirical results of the proposed approach on two benchmark datasets outperform the results of clustering done in TF-IDF vector space. Also, it is observed that in ML-ELM feature space, the results of seeded-kMeans are better compared to the traditional kMeans.

International Journal of Data Mining, Modelling and Management | 2016

Spam web page detection using combined content and link features

Rajendra Kumar Roul; Shubham Rohan Asthana; Gaurav Kumar

Web spamming refers to actions that have intentions to mislead search engines by ranking some irrelevant web pages higher in the search results than they deserve. It is thus a roadblock in obtaining high-quality information retrieval from the web. Spam web pages are often littered with irrelevant and meaningless content. Therefore, spam detection methods have been proposed as a solution for web spam in order to minimise the adverse effects of spam web pages. There has been no single defining profile that can encompass all types of spam websites. As such, this makes spam web page detection extremely difficult. In this paper, the proposed technique combines the content and link-based features of web pages to classify them as spam or non-spam. For experimental purpose, WEBSPAM-UK2006 dataset has been used. The results of the proposed approach were compared with the existing approaches and it has been found that the F-measure of the proposed approach outperformed the others.

international conference on industrial and information systems | 2014

GM-Tree: An efficient frequent pattern mining technique for dynamic database

Rajendra Kumar Roul; Ishaan Bansal

Since its inception, mining frequent patterns have become an imperative issue in data mining. The main problem in this area is to find out the association rule that identifies the relationships among a set of items. But the most expensive step in association rule is finding frequent itemsets and hence it draw the attention of many important research. In this paper, we propose a novel tree structure, called GM(Generate and Merge)Tree, which is a combination of prefix based incremental mining using canonical ordering and batch incrementing techniques. Our approach makes the tree structure more compact, canonically ordered of nodes and avoids sequential incrementing of transactions. It also helps to give a scalable algorithm with minimum overheads of modifying the tree structure during update operations. This algorithm is especially expected to give better results in case of extremely large transaction database in a dynamic environment. The experimental work has been carried out on two large datasets. Test results show the efficiency and effectiveness of the proposed approach by outperforming the traditional FP-Tree, CanTree(Canonical-order Tree) and BIT(Batch Incremental Tree).

International Journal of Computer Applications | 2014

Triple Indexing: An Efficient Technique for Fast Phrase Query Evaluation

Shashank Gugnani; Rajendra Kumar Roul

Phrase query evaluation is an important task of every search engine. Optimizing the query evaluation time for phrase queries is the biggest threat for the current search engine. Usually, phrase queries are a hassle for standard indexing techniques. This is generally because, merging the posting lists and checking the word ordering takes a lot of time. This paper proposes a new technique called Triple Indexing to index web documents which optimizes query evaluation time for phrase queries by reducing the time for merging the posting lists and checking the word ordering. In addition, a proper procedure has been put forward for document ranking using an extended vector space model. The 4 Universities dataset and Industry Sector dataset of Carnegie Mellon University has been used for experimental purpose and it has been found that using the proposed method with a modern machine, the query time for phrase queries is reduced by almost 50 percent, compared to a standard inverted index.

Archive | 2019

Categorizing Text Data Using Deep Learning: A Novel Approach

Rajendra Kumar Roul; Sanjay Kumar Sahay

With large number of Internet users on the Web, there is a need to improve the working principle of text classification, which is an important and well-studied area of machine learning. Hence, in order to work with the text data and to increase the efficiency of the classifier, choice of quality features is of paramount importance. This study emphasizes two important aspects of text classification: proposes a new feature selection technique named Combined Cohesion Separation and Silhouette Coefficient (CCSS) to find the feature set which gathers the crux of the terms in the corpus without deteriorating the outcome in the construction process and then discusses the underlying architecture and importance of deep learning in text classification. To carry out the experimental work, four benchmark datasets are used. The empirical results of the proposed approach using deep learning are more promising compared to the other established classifiers.

Archive | 2019

Document Labeling Using Source-LDA Combined with Correlation Matrix

Rajendra Kumar Roul; Jajati Keshari Sahoo

Topic modeling is one of the most applied and active research areas in the domain of information retrieval. Topic modeling has become increasingly important due to the large and varied amount of data produced every second. In this paper, we try to exploit two major drawbacks (topic independence and unsupervised learning) of latent Dirichlet allocation (LDA). To remove the first drawback, we use Wikipedia as a knowledge source to make a semi-supervised model (Source-LDA) for generating predefined topic-word distribution. The second drawback is removed using a correlation matrix containing cosine-similarity measure of all the topics. The reason for using a semi-supervised LDA instead of a supervised model is not to overfit the data for new labels. Experimental results show that the performance of Source-LDA combine with correlation matrix is better than the traditional LDA and Source-LDA.

Explore More