Amit Awekar
North Carolina State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Amit Awekar.
web intelligence | 2009
Amit Awekar; Nagiza F. Samatova
All pairs similarity search is the problem of finding all pairs of records that have a similarity score above the specified threshold. Many real-world systems like search engines, online social networks, and digital libraries frequently have to solve this problem for datasets having millions of records in a high dimensional space, which are often sparse. The challenge is to design algorithms with feasible time requirements. To meet this challenge, algorithms have been proposed based on the inverted index, which maps each dimension to a list of records with non-zero projection along that dimension. Common to these algorithms is a three-phase framework of data preprocessing, pairs matching, and indexing. Matching is the most time-consuming phase. Within this framework, we propose fast matching technique that uses the sparse nature of real-world data to effectively reduce the size of the search space through a systematic set of tighter filtering conditions and heuristic optimizations. We integrate our technique with the fastest-to-date algorithm in the field and achieve up to 6.5X speed-up on three large real-world datasets.
conference on information and knowledge management | 2013
Sumeet Singh; Amit Awekar
Shared Nearest Neighbor Density-based clustering (SNN-DBSCAN) is a robust graph-based clustering algorithm and has wide applications from climate data analysis to network intrusion detection. We propose an incremental extension to this algorithm IncSNN-DBSCAN, capable of finding clusters on a dataset to which frequent inserts are made. For each data point, the algorithm maintains four properties: nearest neighbor list, strengths of shared links, total connection strength and topic property. Algorithm only targets points that undergo change to their properties. We prove that, to obtain the exact clustering it is sufficient to re-compute properties for only the targeted points, followed by possible cluster mergers on newly formed links and cluster splits on the deleted links. Experiments on KDD Cup 1999 and Mopsi search engine 2012 datasets respectively demonstrate 75% and 99% reduction in the size of the set of points involved in property re-computations. By avoiding most of the redundant property computations, algorithm generates speedup up to 250 and 1000 times respectively on those datasets, while generating the exact same clustering as the non-incremental algorithm. We experimentally verify our claim for up to 2500 inserts on both datasets. However, speedup comes at the cost of up to 48 times more memory usage.
social network mining and analysis | 2009
Amit Awekar; Nagiza F. Samatova; Paul Breimyer
All Pairs Similarity Search (APSS) is a ubiquitous problem in many data mining applications and involves finding all pairs of records with similarity scores above a specified threshold. In this paper, we introduce the problem of Incremental All Pairs Similarity Search (IAPSS), where APSS is performed multiple times over the same dataset by varying the similarity threshold. To the best of our knowledge, this is the first work that addresses the IAPSS problem. All existing solutions for APSS perform redundant computations by invoking APSS independently for each threshold value. In contrast, our solution to the IAPSS problem avoids redundant computations by storing the history of previous APSS invocations and using index splitting. While offering obvious benefits, the computation and I/O intensive nature of the IAPSS solution raises two key research challenges: (1) to develop efficient I/O techniques to manage computation history and (2) to efficiently identify and prune redundant computations. We address these challenges through the proposed (a) history binning technique that clusters record pairs based on similarity values and performs I/O during the similarity computation, and (b) splitting of inverted index that maps each dimension to a list of records that have a non-zero projection along that dimension. As a result, we evaluate the effectiveness of our techniques by demonstrating speed-ups in the order of 2X to over 105 X over the state-of-the-art APSS algorithm for four real-world large-scale datasets.
computational intelligence and data mining | 2007
Amit Awekar; Jaewoo Kang
We address the problem of handling topic oriented tasks on the World Wide Web. Our aim is to find most relevant and important pages for broad-topic queries while searching in a small set of candidate pages. We present a link analysis based algorithm SelHITS which is an improvement over Kleinbergs HITS algorithm. We introduce concept of virtual links to exploit latent information in the hyperlinked environment. Selective expansion of the root set and novel ranking strategy are the distinguishing features of our approach. Selective expansion method avoids topic drift and provides results consistent with only one interpretation of the query. Experimental evaluation and user feedback show that our algorithm indeed distills the most relevant and important pages for broad-topic queries. Trends in user feedback suggests that there exists a uniform notion of quality of search results within users
international world wide web conferences | 2006
Amit Awekar; Pabitra Mitra; Jaewoo Kang
We address the problem of answering broad-topic queries on the World Wide Web. We present a link based analysis algorithm SelHITS, which is an improvement over Kleinbergs HITS [2] algorithm. We introduce the concept of virtual links to exploit the latent information in the hyperlinked environment. We propose a novel approach to calculate hub and authority values. We also present a selective expansion method which avoids topic drift and provides results consistent with only one interpretation of the query, even if the query is ambiguous. Initial experimental evaluation and user feedback show that our algorithm indeed distills the most important and relevant pages for broad-topic queries. We also infer that there exists a uniform notion of quality of search results within users.
european conference on information retrieval | 2018
Sweta Agrawal; Amit Awekar
Harassment by cyberbullies is a significant phenomenon on the social media. Existing works for cyberbullying detection have at least one of the following three bottlenecks. First, they target only one particular social media platform (SMP). Second, they address just one topic of cyberbullying. Third, they rely on carefully handcrafted features of the data. We show that deep learning based models can overcome all three bottlenecks. Knowledge learned by these models on one dataset can be transferred to other datasets. We performed extensive experiments using three real-world datasets: Formspring (12k posts), Twitter (16k posts), and Wikipedia(100k posts). Our experiments provide several useful insights about cyberbullying detection. To the best of our knowledge, this is the first work that systematically analyzes cyberbullying detection on various topics across multiple SMPs using deep learning based models and transfer learning.
international world wide web conferences | 2017
Anasua Mitra; Amit Awekar
Number of published scholarly articles is growing exponentially. To tackle this information overload, researchers are increasingly depending on niche academic search engines. Recent works have shown that two major general web search engines: Google and Bing, have high level of agreement in their top search results. In contrast, we show that various academic search engines have low degree of agreement among themselves. We performed experiments using 2500 queries over four academic search engines. We observe that overlap in search result sets of any combination of academic search engines is significantly low and in most of the cases the search result sets are mutually exclusive. We also discuss implications of this low overlap.
european conference on information retrieval | 2017
Siddhesh Khandelwal; Amit Awekar
There has been considerable work on improving popular clustering algorithm ‘K-means’ in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose a fast heuristic to overcome this bottleneck with only marginal increase in MSE. We observe that across all iterations of K-means, a data point changes its membership only among a small subset of clusters. Our heuristic predicts such clusters for each data point by looking at nearby clusters after the first iteration of k-means. We augment well known variants of k-means with our heuristic to demonstrate effectiveness of our heuristic. For various synthetic and real-world datasets, our heuristic achieves speed-up of up-to 3 times when compared to efficient variants of k-means.
european conference on information retrieval | 2017
Panthadeep Bhattacharjee; Amit Awekar
Incremental data mining algorithms process frequent updates to dynamic datasets efficiently by avoiding redundant computation. Existing incremental extension to shared nearest neighbor density based clustering (SNND) algorithm cannot handle deletions to dataset and handles insertions only one point at a time. We present an incremental algorithm to overcome both these bottlenecks by efficiently identifying affected parts of clusters while processing updates to dataset in batch mode. We show effectiveness of our algorithm by performing experiments on large synthetic as well as real world datasets. Our algorithm is up to four orders of magnitude faster than SNND and requires up to 60% extra memory than SNND while providing output identical to SNND.
conference of the european chapter of the association for computational linguistics | 2017
Abhishek Abhishek; Ashish Anand; Amit Awekar