Diana Palsetia | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Diana Palsetia is active.

Explore More

Publication

Featured researches published by Diana Palsetia.

international conference on data mining | 2011

Twitter Trending Topic Classification

Kathy Lee; Diana Palsetia; Ramanathan Narayanan; Md. Mostofa Ali Patwary; Ankit Agrawal; Alok N. Choudhary

With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about. Therefore, it is important and necessary to classify these topics into general categories with high accuracy for better information retrieval. To address this problem, we classify Twitter Trending Topics into 18 general categories such as sports, politics, technology, etc. We experiment with 2 approaches for topic classification, (i) the well-known Bag-of-Words approach for text classification and (ii) network-based classification. In text-based classification method, we construct word vectors with trending topic definition and tweets, and the commonly used tf-idf weights are used to classify the topics using a Naive Bayes Multinomial classifier. In network-based classification method, we identify top 5 similar topics for a given topic based on the number of common influential users. The categories of the similar topics and the number of common influential users between the given topic and its similar topics are used to classify the given topic using a C5.0 decision tree learner. Experiments on a database of randomly selected 768 trending topics (over 18 classes) show that classification accuracy of up to 65% and 70% can be achieved using text-based and network-based classification modeling respectively.

ieee international conference on high performance computing data and analytics | 2012

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Md. Mostofa Ali Patwary; Diana Palsetia; Ankit Agrawal; Wei-keng Liao; Fredrik Manne; Alok N. Choudhary

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using data sets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

Communications of The ACM | 2012

Social media evolution of the Egyptian revolution

Alok N. Choudhary; William Hendrix; Kathy Lee; Diana Palsetia; Wei-keng Liao

Twitter sentiment was revealed, along with popularity of Egypt-related subjects and tweeter influence on the 2011 revolution.

international conference on data mining | 2011

SES: Sentiment Elicitation System for Social Media Data

Kunpeng Zhang; Yu Cheng; Yusheng Xie; Daniel Honbo; Ankit Agrawal; Diana Palsetia; Kathy Lee; Wei-keng Liao; Alok N. Choudhary

Social Media is becoming major and popular technological platform that allows users discussing and sharing information. Information is generated and managed through either computer or mobile devices by one person and consumed by many other persons. Most of these user generated content are textual information, as Social Networks(Face book, Linked In), Microblogging(Twitter), blogs(Blogspot, Word press). Looking for valuable nuggets of knowledge, such as capturing and summarizing sentiments from these huge amount of data could help users make informed decisions. In this paper, we develop a sentiment identification system called SES which implements three different sentiment identification algorithms. We augment basic compositional semantic rules in the first algorithm. In the second algorithm, we think sentiment should not be simply classified as positive, negative, and objective but a continuous score to reflect sentiment degree. All word scores are calculated based on a large volume of customer reviews. Due to the special characteristics of social media texts, we propose a third algorithm which takes emoticons, negation word position, and domain-specific words into account. Furthermore, a machine learning model is employed on features derived from outputs of three algorithms. We conduct our experiments on user comments from Face book and tweets from twitter. The results show that utilizing Random Forest will acquire a better accuracy than decision tree, neural network, and logistic regression. We also propose a flexible way to represent document sentiment based on sentiments of each sentence contained. SES is available online.

ieee international conference on high performance computing data and analytics | 2013

Scalable parallel OPTICS data clustering using graph algorithmic techniques

Md. Mostofa Ali Patwary; Diana Palsetia; Ankit Agrawal; Wei-keng Liao; Fredrik Manne; Alok N. Choudhary

OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIMs Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.

international conference on data engineering | 2014

SILVERBACK: Scalable association mining for temporal data in columnar probabilistic databases

Yusheng Xie; Diana Palsetia; Goce Trajcevski; Ankit Agrawal; Alok N. Choudhary

We address the problem of large scale probabilistic association rule mining and consider the trade-offs between accuracy of the mining results and quest of scalability on modest hardware infrastructure. We demonstrate how extensions and adaptations of research findings can be integrated in an industrial application, and we present the commercially deployed SILVERBACK framework, developed at Voxsup Inc. SILVERBACK tackles the storage efficiency problem by proposing a probabilistic columnar infrastructure and using Bloom filters and reservoir sampling techniques. In addition, a probabilistic pruning technique has been introduced based on Apriori for mining frequent item-sets. The proposed target-driven technique yields a significant reduction on the size of the frequent item-set candidates. We present extensive experimental evaluations which demonstrate the benefits of a context-aware incorporation of infrastructure limitations into corresponding research techniques. The experiments indicate that, when compared to the traditional Hadoop-based approach for improving scalability by adding more hosts, SILVERBACK - which has been commercially deployed and developed at Voxsup Inc. since May 2011 - has much better run-time performance with negligible accuracy sacrifices.

2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV) | 2013

A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures

William Hendrix; Diana Palsetia; Md. Mostofa Ali Patwary; Ankit Agrawal; Wei-keng Liao; Alok N. Choudhary

Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600.

Social Network Analysis and Mining | 2014

Excavating social circles via user interests

Diana Palsetia; Md. Mostofa Ali Patwary; Ankit Agrawal; Alok N. Choudhary

The rapid evolution of modern social networks motivates the design of networks based on users’ interests. Using popular social media such as Facebook and Twitter, we show that this new perspective can bring more meaningful information about the networks. In this paper, we model user-interest-based networks by deducing intent from social media activities such as comments and tweets of millions of users in Facebook and Twitter, respectively. These interactive contents derive networks that are dynamic in nature as the user interests can evolve due to temporal and spatial activities occurring around the user. To excavate social circles, we develop an approach that iteratively removes the influence of the communities identified in the previous steps by widely used Clauset, Newman, and Moore (CNM) community detection algorithm. Experimental results show that our approach can detect communities at a much finer scale compared to the CNM algorithm. Our user-interest-based model and community extraction methodology together can be used to identify target communities in the context of business requirements.

Knowledge and Information Systems | 2017

SILVERBACK+: scalable association mining via fast list intersection for columnar social data

Yusheng Xie; Zhengzhang Chen; Diana Palsetia; Goce Trajcevski; Ankit Agrawal; Alok N. Choudhary

We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.

ieee international conference on high performance computing, data, and analytics | 2016

Parallel Community Detection Algorithm Using a Data Partitioning Strategy with Pairwise Subdomain Duplication

Diana Palsetia; William Hendrix; Sunwoo Lee; Ankit Agrawal; Wei-keng Liao; Alok N. Choudhary

Community detection is an important data clustering technique for studying graph structures. Many serial algorithms have been developed and well studied in the literature. As the problem size grows, the research attention has recently been turning to parallelizing the technique. However, the conventional parallelization strategies that divide the problem domain into non-overlapping subdomains do not scale with problem size and the number of processes. The main obstacle lies in the fact that the graph algorithms often exhibit a high degree of data dependency, which makes developing scalable parallel algorithms a great challenge.

Explore More