Is this you? Create Your Porfile

Prasenjit Mitra

Qatar Computing Research Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Prasenjit Mitra is active.

Explore More

Publication

Featured researches published by Prasenjit Mitra.

acm conference on hypertext | 2016

Summarizing Situational Tweets in Crisis Scenario

Koustav Rudra; Siddhartha Banerjee; Niloy Ganguly; Pawan Goyal; Muhammad Imran; Prasenjit Mitra

During mass convergence events such as natural disasters, microblogging platforms like Twitter are widely used by affected people to post situational awareness messages. These crisis-related messages disperse among multiple categories like infrastructure damage, information about missing, injured, and dead people etc. The challenge here is to extract important situational updates from these messages, assign them appropriate informational categories, and finally summarize big trove of information in each category. In this paper, we propose a novel framework which first assigns tweets into different situational classes and then summarize those tweets. In the summarization phase, we propose a two stage summarization framework which first extracts a set of important tweets from the whole set of information through an Integer-linear programming (ILP) based optimization technique and then follows a word graph and content word based abstractive summarization technique to produce the final summary. Our method is time and memory efficient and outperforms the baseline in terms of quality, coverage of events, locations et al., effectiveness, and utility in disaster scenarios.

International Journal on Digital Libraries | 2015

A generalized topic modeling approach for automatic document annotation

Suppawong Tuarob; Line C. Pouchard; Prasenjit Mitra; C. Lee Giles

Ecological and environmental sciences have become more advanced and complex, requiring observational and experimental data from multiple places, times, and thematic scales to verify their hypotheses. Over time, such data have not only increased in amount, but also in diversity and heterogeneity of the data sources that spread throughout the world. This heterogeneity poses a huge challenge for scientists who have to manually search for desired data. ONEMercury has recently been implemented as part of the DataONE project to alleviate such problems and to serve as a portal for accessing environmental and observational data across the globe. ONEMercury harvests metadata records from multiple archives and repositories, and makes them searchable. However, harvested metadata records sometimes are poorly annotated or lacking meaningful keywords, which could impede effective retrieval. We propose a methodology that learns the annotation from well-annotated collections of metadata records to automatically annotate poorly annotated ones. The problem is first transformed into the tag recommendation problem with a controlled tag library. Then, two variants of an algorithm for automatic tag recommendation are presented. The experiments on four datasets of environmental science metadata records show that our methods perform well and also shed light on the natures of different datasets. We also discuss relevant topics such as using topical coherence to fine-tune parameters and experiments on cross-archive annotation.

international conference on management of data | 2016

Graph Stream Summarization: From Big Bang to Big Crunch

Nan Tang; Qing Chen; Prasenjit Mitra

A graph stream, which refers to the graph with edges being updated sequentially in a form of a stream, has important applications in cyber security and social networks. Due to the sheer volume and highly dynamic nature of graph streams, the practical way of handling them is by summarization. Given a graph stream G, directed or undirected, the problem of graph stream summarization is to summarize G as SG with a much smaller (sublinear) space, linear construction time and constant maintenance cost for each edge update, such that SG allows many queries over G to be approximately conducted efficiently. The widely used practice of summarizing data streams is to treat each stream element independently by e.g., hash- or sample-based methods, without maintaining the connections (or relationships) between elements. Hence, existing methods can only solve ad-hoc problems, without supporting diversified and complicated analytics over graph streams. We present TCM, a novel generalized graph stream summary. Given an incoming edge, it summarizes both node and edge information in constant time. Consequently, the summary forms a graphical sketch where edges capture the connections inside elements, and nodes maintain relationships across elements. We discuss a wide range of supported queries and establish some error bounds. In addition, we experimentally show that TCM can effectively and efficiently support analytics over graph streams, which demonstrates its potential to start a new line of research and applications in graph stream management.

international conference on document analysis and recognition | 2015

A hybrid approach to discover semantic hierarchical sections in scholarly documents

Suppawong Tuarob; Prasenjit Mitra; C. Lee Giles

Scholarly documents are usually composed of sections, each of which serves a different purpose by conveying specific context. The ability to automatically identify sections would allow us to understand the semantics of what is different in different sections of documents, such as what was in the introduction, methodologies used, experimental types, trends, etc. We propose a set of hybrid algorithms to 1) automatically identify section boundaries, 2) recognize standard sections, and 3) build a hierarchy of sections. Our algorithms achieve an F-measure of 92.38% in section boundary detection, 96% accuracy (average) on standard section recognition, and 95.51% in accuracy in the section positioning task.

international joint conference on natural language processing | 2015

WikiKreator: Improving Wikipedia Stubs Automatically

Siddhartha Banerjee; Prasenjit Mitra

Stubs on Wikipedia often lack comprehensive information. The huge cost of editing Wikipedia and the presence of only a limited number of active contributors curb the consistent growth of Wikipedia. In this work, we present WikiKreator, a system that is capable of generating content automatically to improve existing stubs on Wikipedia. The system has two components. First, a text classifier built using topic distribution vectors is used to assign content from the web to various sections on a Wikipedia article. Second, we propose a novel abstractive summarization technique based on an optimization framework that generates section-specific summaries for Wikipedia stubs. Experiments show that WikiKreator is capable of generating well-formed informative content. Further, automatically generated content from our system have been appended to Wikipedia stubs and the content has been retained successfully proving the effectiveness of our approach.

international world wide web conferences | 2016

A Query-oriented Approach for Relevance in Citation Networks

Luam C. Totti; Prasenjit Mitra; Mourad Ouzzani; Mohammed Javeed Zaki

Finding a relevant set of publications for a given topic of interest is a challenging problem. We propose a two-stage query-dependent approach for retrieving relevant papers given a keyword-based query. In the first stage, we utilize content similarity to select an initial seed set of publications; we then augment them by citation links weighted with information such as citation context relevance and age-based attenuation. In the second stage, we construct a multi-layer graph that expands the publications subgraph by including links to the authors, venues, and keywords. This allows us to return recommendations that are both highly authoritative, and also textually related to the query. We show that our staged approach gives superior results on three different benchmark query sets.

international world wide web conferences | 2015

Abstractive Meeting Summarization Using Dependency Graph Fusion

Siddhartha Banerjee; Prasenjit Mitra; Kazunari Sugiyama

Automatic summarization techniques on meeting conversations developed so far have been primarily extractive, resulting in poor summaries. To improve this, we propose an approach to generate abstractive summaries by fusing important content from several utterances. Any meeting is generally comprised of several discussion topic segments. For each topic segment within a meeting conversation, we aim to generate a one sentence summary from the most important utterances using an integer linear programming-based sentence fusion approach. Experimental results show that our method can generate more informative summaries than the baselines.

document engineering | 2015

Filling the Gaps: Improving Wikipedia Stubs

Siddhartha Banerjee; Prasenjit Mitra

The availability of only a limited number of contributors on Wikipedia cannot ensure consistent growth and improvement of the online encyclopedia. With information being scattered on the web, our goal is to automate the process of generation of content for Wikipedia. In this work, we propose a technique of improving stubs on Wikipedia that do not contain comprehensive information. A classifier learns features from the existing comprehensive articles on Wikipedia and recommends content that can be added to the stubs to improve the completeness of such stubs. We conduct experiments using several classifiers - Latent Dirichlet Allocation (LDA) based model, a deep learning based architecture (Deep belief network) and TFIDF based classifier. Our experiments reveal that the LDA based model outperforms the other models (~6% F-score). Our generation approach shows that this technique is capable of generating comprehensive articles. ROUGE-2 scores of the articles generated by our system outperform the articles generated using the baseline. Content generated by our system has been appended to several stubs and successfully retained in Wikipedia.

AI Matters | 2015

WikiKreator: automatic authoring of Wikipedia content

Siddhartha Banerjee; Prasenjit Mitra

This article describes ongoing dissertation work on the automatic generation of Wikipedia articles. The goal of this work is to build an AI system to automatically summarize existing web content and utilize the resulting text to improve incomplete Wikipedia articles.

international world wide web conferences | 2015

Deep Learning for the Web

Kyomin Jung; Byoung-Tak Zhang; Prasenjit Mitra

Deep learning is a machine learning technology that automatically extracts higher-level representations from raw data by stacking multiple layers of neuron-like units. The stacking allows for extracting representations of increasingly-complex features without time-consuming, offline feature engineering. Recent success of deep learning has shown that it outperforms state-of-the-art systems in image processing, voice recognition, web search, recommendation systems, etc [1]. A lot of industrial-scale big data processing systems including IBM Watsons Jeopardy Contest 2011, Google Now, Facebooks face recognition system, and the voice recognition systems by Google and Microsoft use deep learning [2][3][6]. Deep learning has a huge potential to improve the intelligence of the web and the web service systems by efficiently and effectively mining big data on the Web[4][5]. This tutorial provides the basics of deep learning as well as its key applications. We give the motivation and underlying ideas of deep learning and describe the architectures and learning algorithms for various deep learning models. We also cover applications of deep learning for image and video processing, natural language and text data analysis, social data analytics, and wearable IoT sensor data with an emphasis in the domain of Web systems. We will deliver the key insight and understanding of these techniques, using graphical illustrations and examples that could be important in analyzing a large amount of Web data. The tutorial is prepared to attract general audience at the WWW Conference, who are interested in machine learning and big data analysis for Web data. The tutorial consists of five parts. The first part presents the basics of neural networks, and their structures. Then we explain the training algorithm via backpropagation, which is a common method of training artificial neural networks including deep neural networks. We will emphasize how each of these concepts can be used in various Web data analysis. In the second part of the tutorial, we describe the learning algorithms for deep neural networks and related ideas, such as contrastive divergence, wake-sleep algorithms, and Monte Carlo simulation. We then describe various kinds of deep architectures, including stacked autoencoders, deep belief networks [7], convolutional neural networks [8], and deep hypernetworks [9]. In the third part, we present more details of the recursive neural networks, which can learn structured tree outputs as well as vector representations for phrases and sentences. We first show how training the recursive neural network can be achieved by a modified version of the back-propagation algorithm introduced before. These modifications allow the algorithm to work on tree structures. Then we will present its applications to sentence analysis including POS tagging, and sentiment analysis. The fourth part discusses the neural networks used to generate word embeddings, such as Word2Vec [10], DSSM for deep semantic similarity [11], and object detection in images [12], such as GoogLeNet, and AlexNet. We will explain in detail the applications of these deep learning techniques in the analysis of various social network data. By this point, the audience should have a clear understanding of how to build a deep learning system for word, sentence and document level tasks. The fifth part of the tutorial will cover other application examples of deep learning. These include object segmentation and action recognition from videos [9], web data analytics, and wearable/IoT sensor data modeling for smart services.

Explore More