Prithviraj Sen
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Prithviraj Sen.
very large data bases | 2014
Matthias Boehm; Shirish Tatikonda; Berthold Reinwald; Prithviraj Sen; Yuanyuan Tian; Douglas Burdick; Shivakumar Vaithyanathan
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce, where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The declarative specification of ML algorithms enables---in contrast to existing large-scale machine learning libraries---automatic optimization. SystemMLs primary focus is on data parallelism but many ML algorithms inherently exhibit opportunities for task parallelism as well. A major challenge is how to efficiently combine both types of parallelism for arbitrary ML scripts and workloads. In this paper, we present a systematic approach for combining task and data parallelism for large-scale machine learning on top of MapReduce. We employ a generic Parallel FOR construct (ParFOR) as known from high performance computing (HPC). Our core contributions are (1) complementary parallelization strategies for exploiting multi-core and cluster parallelism, as well as (2) a novel cost-based optimization framework for automatically creating optimal parallel execution plans. Experiments on a variety of use cases showed that this achieves both efficiency and scalability due to automatic adaptation to ad-hoc workloads and unknown data characteristics.
international world wide web conferences | 2012
Prithviraj Sen
A crucial step in adding structure to unstructured data is to identify references to entities and disambiguate them. Such disambiguated references can help enhance readability and draw similarities across different pieces of running text in an automated fashion. Previous research has tackled this problem by first forming a catalog of entities from a knowledge base, such as Wikipedia, and then using this catalog to disambiguate references in unseen text. However, most of the previously proposed models either do not use all text in the knowledge base, potentially missing out on discriminative features, or do not exploit word-entity proximity to learn high-quality catalogs. In this work, we propose topic models that keep track of the context of every word in the knowledge base; so that words appearing within the same context as an entity are more likely to be associated with that entity. Thus, our topic models utilize all text present in the knowledge base and help learn high-quality catalogs. Our models also learn groups of co-occurring entities thus enabling collective disambiguation. Unlike most previous topic models, our models are non-parametric and do not require the user to specify the exact number of groups present in the knowledge base. In experiments performed on an extract of Wikipedia containing almost 60,000 references, our models outperform SVM-based baselines by as much as 18% in terms of disambiguation accuracy translating to an increment of almost 11,000 correctly disambiguated references.
very large data bases | 2016
Matthias Boehm; Michael W. Dusenberry; Deron Eriksson; Alexandre V. Evfimievski; Faraz Makari Manshadi; Niketan Pansare; Berthold Reinwald; Frederick R. Reiss; Prithviraj Sen; Arvind C. Surve; Shirish Tatikonda
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
advances in social networks analysis and mining | 2013
Nagarajan Natarajan; Prithviraj Sen; Vineet Chaoji
Network structure and content in microblogging sites like Twitter influence each other - user A on Twitter follows user B for the tweets that B posts on the network, and A may then re-tweet the content shared by B to his/her own followers. In this paper, we propose a probabilistic model to jointly model link communities and content topics by leveraging both the social graph and the content shared by users. We model a community as a distribution over users, use it as a source for topics of interest, and jointly infer both communities and topics using Gibbs sampling. While modeling communities using the social graph, or modeling topics using content have received a great deal of attention, a few recent approaches try to model topics in content-sharing platforms using both content and social graph. Our work differs from the existing generative models in that we explicitly model the social graph of users along with the user-generated content, mimicking how the two entities co-evolve in content-sharing platforms. Recent studies have found Twitter to be more of a content-sharing network and less a social network, and it seems hard to detect tightly knit communities from the follower-followee links. Still, the question of whether we can extract Twitter communities using both links and content is open. In this paper, we answer this question in the affirmative. Our model discovers coherent communities and topics, as evinced by qualitative results on sub-graphs of Twitter users. Furthermore, we evaluate our model on the task of predicting follower-followee links. We show that joint modeling of links and content significantly improves link prediction performance on a sub-graph of Twitter (consisting of about 0.7 million users and over 27 million tweets), compared to generative models based on only structure or only content and paths-based methods such as Katz.
symposium on cloud computing | 2013
Matthias Boehm; Douglas Burdick; Alexandre V. Evfimievski; Berthold Reinwald; Prithviraj Sen; Shirish Tatikonda; Yuanyuan Tian
Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these analytics for varying data characteristics and at scale is challenging. To address these challenges, SystemML implements a declarative, high-level language using an R-like syntax extended with machine-learning-specific constructs, that is compiled to a MapReduce runtime [2]. The language is rich enough to express a wide class of statistical, predictive modeling and machine learning algorithms (Fig. 1). We chose robust algorithms that scale to large, and potentially sparse data with many features.
very large data bases | 2017
Shreyas Bharadwaj; Laura Chiticariu; Marina Danilevsky; Samarth Dhingra; Samved Divekar; Arnaldo Carreno-Fuentes; Himanshu Gupta; Nitin Gupta; Sang-Don Han; Mauricio A. Hernández; Howard Ho; Parag Jain; Salil Joshi; Hima P. Karanam; Saravanan Krishnan; Rajasekar Krishnamurthy; Yunyao Li; Satishkumaar Manivannan; Ashish R. Mittal; Fatma Ozcan; Abdul Quamar; Poornima Raman; Diptikalyan Saha; Karthik Sankaranarayanan; Jaydeep Sen; Prithviraj Sen; Shivakumar Vaithyanathan; Mitesh Vasa; Hao Wang; Huaiyu Zhu
The ability to create and interact with large-scale domain-specific knowledge bases from unstructured/semi-structured data is the foundation for many industry-focused cognitive systems. We will demonstrate the Content Services system that provides cloud services for creating and querying high-quality domain-specific knowledge bases by analyzing and integrating multiple (un/semi)structured content sources. We will showcase an instantiation of the system for a financial domain. We will also demonstrate both cross-lingual natural language queries and programmatic API calls for interacting with this knowledge base.
IEEE Data(base) Engineering Bulletin | 2014
Matthias Böhm; Douglas Burdick; Alexandre V. Evfimievski; Berthold Reinwald; Frederick Reiss; Prithviraj Sen; Shirish Tatikonda; Yuanyuan Tian
conference on innovative data systems research | 2017
Tarek Elgamal; Shangyu Luo; Matthias Boehm; Alexandre V. Evfimievski; Shirish Tatikonda; Berthold Reinwald; Prithviraj Sen
conference on information and knowledge management | 2017
Kun Qian; Lucian Popa; Prithviraj Sen
Archive | 2016
Matthias Boehm; Doughlas Burdick; Berthold Reinwald; Prithviraj Sen; Shirish Tatikonda; Yuanyuan Tian; Shivakumar Vaithyanathan