Is this you? Create Your Porfile

Barna Saha

University of Massachusetts Amherst

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Barna Saha is active.

Explore More

Publication

Featured researches published by Barna Saha.

international colloquium on automata languages and programming | 2009

On Finding Dense Subgraphs

Samir Khuller; Barna Saha

Given an undirected graph G = (V ,E ), the density of a subgraph on vertex set S is defined as

research in computational molecular biology | 2010

Dense subgraphs with restrictions and applications to gene annotation graphs

Barna Saha; Allison Hoch; Samir Khuller; Louiqa Raschid; Xiao-Ning Zhang

d(S)=\frac{|E(S)|}{|S|}

international semantic web conference | 2011

Link prediction for annotation graphs using graph summarization

Andreas Thor; Philip Anderson; Louiqa Raschid; Saket Navlakha; Barna Saha; Samir Khuller; Xiao-Ning Zhang

, where E (S ) is the set of edges in the subgraph induced by nodes in S . Finding subgraphs of maximum density is a very well studied problem. One can also generalize this notion to directed graphs. For a directed graph one notion of density given by Kannan and Vinay [12] is as follows: given subsets S and T of vertices, the density of the subgraph is

international conference on data mining | 2006

Dynamic Algorithm for Graph Clustering Using Minimum Cut Tree

Barna Saha; Pabitra Mitra

d(S,T)=\frac{|E(S,T)|}{\sqrt{|S||T|}}

international conference on data engineering | 2010

Schema covering: a step towards enabling reuse in information integration

Barna Saha; Ioana Stanoi; Kenneth L. Clarkson

, where E (S ,T ) is the set of edges going from S to T . Without any size constraints, a subgraph of maximum density can be found in polynomial time. When we require the subgraph to have a specified size, the problem of finding a maximum density subgraph becomes NP -hard. In this paper we focus on developing fast polynomial time algorithms for several variations of dense subgraph problems for both directed and undirected graphs. When there is no size bound, we extend the flow based technique for obtaining a densest subgraph in directed graphs and also give a linear time 2-approximation algorithm for it. When a size lower bound is specified for both directed and undirected cases, we show that the problem is NP-complete and give fast algorithms to find subgraphs within a factor 2 of the optimum density. We also show that solving the densest subgraph problem with an upper bound on size is as hard as solving the problem with an exact size constraint, within a constant factor.

very large data bases | 2016

Online entity resolution using an Oracle

Donatella Firmani; Barna Saha; Divesh Srivastava

In this paper, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR) A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.

statistical and scientific database management | 2014

Distributed data placement to minimize communication costs via graph partitioning

Lukasz Golab; Marios Hadjieleftheriou; Howard J. Karloff; Barna Saha

Annotation graph datasets are a natural representation of scientific knowledge. They are common in the life sciences where genes or proteins are annotated with controlled vocabulary terms (CV terms) from ontologies. The W3C Linking Open Data (LOD) initiative and semantic Web technologies are playing a leading role in making such datasets widely available. Scientists can mine these datasets to discover patterns of annotation. While ontology alignment and integration across datasets has been explored in the context of the semantic Web, there is no current approach to mine such patterns in annotation graph datasets. In this paper, we propose a novel approach for link prediction; it is a preliminary task when discovering more complex patterns. Our prediction is based on a complementary methodology of graph summarization (GS) and dense subgraphs (DSG). GS can exploit and summarize knowledge captured within the ontologies and in the annotation patterns. DSG uses the ontology structure, in particular the distance between CV terms, to filter the graph, and to find promising subgraphs. We develop a scoring function based on multiple heuristics to rank the predictions. We perform an extensive evaluation on Arabidopsis thaliana genes.

very large data bases | 2013

On repairing structural problems in semi-structured data

Flip Korn; Barna Saha; Divesh Srivastava; Shanshan Ying

In this paper we introduce a dynamic algorithm for clustering undirected graphs, whose edge property is continuously changing. The algorithm can maintain high-quality clusters efficiently in presence of insertion and deletion (update) of edges. The algorithm is motivated by the minimum-cut tree based partitioning algorithm presented in G. W. Flake et al. (2004) and G. W. Flake (2000). It takes O(k3) time for each update processing, where k is the maximum size of any cluster. This is the worst case time complexity, and in general time taken is much less. To the best of our knowledge, this is the first clustering algorithm, for evolving graphs, providing strong theoretical quality guarantee on clusters

foundations of computer science | 2014

The Dyck Language Edit Distance Problem in Near-Linear Time

Barna Saha

We introduce schema covering, the problem of identifying easily understandable common objects for describing large and complex schemas. Defining transformations between schemas is a key objective in information integration. However, this process often becomes cumbersome when the schemas are large and structurally complex. If such complex schemas can be broken into smaller and simpler objects, then simple transformations defined over these smaller objects can be reused to define suitable transformations among the complex schemas. Schema covering performs this vital task by identifying a collection of common concepts from a repository and creating a cover of the complex schema by these concepts. In this paper, we formulate the problem of schema covering, show that it is NP-Complete, and give efficient approximation algorithms for it. A performance evaluation with real business schemas confirms the effectiveness of our approach.

foundations of computer science | 2015

Language Edit Distance and Maximum Likelihood Parsing of Stochastic Grammars: Faster Algorithms and Connection to Fundamental Graph Problems

Barna Saha

Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity. This is an expensive task, and can take a significant amount of money and time; the end-user may want to take decisions during the process, rather than waiting for the task to be completed. We formalize an online version of the entity resolution task, and use an oracle which correctly labels matching and non-matching pairs through queries. In this setting, we design algorithms that seek to maximize progressive recall, and develop a novel analysis framework for prior proposals on entity resolution with an oracle, beyond their worst case guarantees. Finally, we provide both theoretical and experimental analysis of the proposed algorithms.

Explore More