Andrey Balmin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrey Balmin is active.

Explore More

Publication

Featured researches published by Andrey Balmin.

very large data bases | 2004

Objectrank: authority-based keyword search in databases

Andrey Balmin; Vagelis Hristidis; Yannis Papakonstantinou

The ObjectRank system applies authority-based ranking to keyword search in databases modeled as labeled graphs. Conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. Each node is ranked according to its authority with respect to the particular keywords. One can adjust the weight of global importance, the weight of each keyword of the query, the importance of a result actually containing the keywords versus being referenced by nodes containing them, and the volume of authority flow via each type of semantic connection. Novel performance challenges and opportunities are addressed. First, schemas impose constraints on the graph, which are exploited for performance purposes. Second, in order to address the issue of authority ranking with respect to the given keywords (as opposed to Googles global PageRank) we precompute single keyword ObjectRanks and combine them during run time. We conducted user surveys and a set of performance experiments on multiple real and synthetic datasets, to assess the semantic meaningfulness and performance of ObjectRank.

international conference on data engineering | 2003

Keyword proximity search on XML graphs

Vagelis Hristidis; Yannis Papakonstantinou; Andrey Balmin

XKeyword provides efficient keyword proximity queries on large XML graph databases. A query is simply a list of keywords and does not require any schema or query language knowledge for its formulation. XKeyword is built on a relational database and, hence, can accommodate very large graphs. Query evaluation is optimized by using the graphs schema. In particular, XKeyword consists of two stages. In the preprocessing stage a set of keyword indices are built along with indexed path relations that describe particular patterns of paths in the graph. In the query processing stage plans are developed that use a near optimal set of path relations to efficiently locate the keyword query results. The results are presented graphically using the novel idea of interactive result graphs, which are populated on-demand according to the users navigation and allow efficient information discovery. We provide theoretical and experimental points for the selection of the appropriate set of precomputed path relations. We also propose and experimentally evaluate algorithms to minimize the number of queries sent to the database to output the top-K results.

very large data bases | 2004

A framework for using materialized XPath views in XML query processing

Andrey Balmin; Fatma Ozcan; Kevin S. Beyer; Roberta Jo Cochrane; Hamid Pirahesh

XML languages, such as XQuery, XSLT and SQL/XML, employ XPath as the search and extraction language. XPath expressions often define complicated navigation, resulting in expensive query processing, especially when executed over large collections of documents. In this paper, we propose a framework for exploiting materialized XPath views to expedite processing of XML queries. We explore a class of materialized XPath views, which may contain XML fragments, typed data values, full paths, node references or any combination thereof. We develop an XPath matching algorithm to determine when such views can be used to answer a user query containing XPath expressions. We use the match information to identify the portion of an XPath expression in the user query which is not covered by the XPath view. Finally, we construct, possibly multiple, compensation expressions which need to be applied to the view to produce the query result. Experimental evaluation, using our prototype implementation, shows that the matching algorithm is very efficient and usually accounts for a small fraction of the total query compilation time.

very large data bases | 2005

Storing and querying XML data using denormalized relational databases

Andrey Balmin; Yannis Papakonstantinou

Abstract.XML database systems emerge as a result of the acceptance of the XML data model. Recent works have followed the promising approach of building XML database management systems on underlying RDBMS’s. Achieving query processing performance reduces to two questions: (i) How should the XML data be decomposed into data that are stored in the RDBMS? (ii) How should the XML query be translated into an efficient plan that sends one or more SQL queries to the underlying RDBMS and combines the data into the XML result? We provide a formal framework for XML Schema-driven decompositions, which encompasses the decompositions proposed in prior work and extends them with decompositions that employ denormalized tables and binary-coded XML fragments. We provide corresponding query processing algorithms that translate the XML query conditions into conditions on the relational tables and assemble the decomposed data into the XML query result. Our key performance focus is the response time for delivering the first results of a query. The most effective of the described decompositions have been implemented in XCacheDB, an XML DBMS built on top of a commercial RDBMS, which serves as our experimental basis. We present experiments and analysis that point to a class of decompositions, called inlined decompositions, that improve query performance for full results and first results, without significant increase in the size of the database.

Ibm Systems Journal | 2006

Cost-based optimization in DB2 XML

Andrey Balmin; Tom Eliaz; John F. Hornibrook; Lipyeow Lim; Guy M. Lohman; David E. Simmen; Min Wang; Chun Zhang

DB2 XML is a hybrid database system that combines the relational capabilities of DB2 Universal DatabaseTM (UDB) with comprehensive native XML support. DB2 XML augments DB2® UDB with a native XML store, XML indexes, and query processing capabilities for both XQuery and SQL/XML that are integrated with those of SQL. This paper presents the extensions made to the DB2 UDB compiler, and especially its cost-based query optimizer, to support XQuery and SQL/XML queries, using much of the same infrastructure developed for relational data queried by SQL. It describes the challenses to the relational infrastructure that supporting XQuery and SQL/XML poses and provides the rationale for the extensions that were made to the three main parts of the optimizer: the plan operators, the cardinality and cost model, and statistics collection.

acm ifip usenix international conference on middleware | 2010

FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Joel L. Wolf; Deepak Rajan; Kirsten Hildrum; Rohit Khandekar; Vibhore Kumar; Sujay Parekh; Kun-Lung Wu; Andrey Balmin

Originally, MapReduce implementations such as Hadoop employed First In First Out (fifo) scheduling, but such simple schemes cause job starvation. The Hadoop Fair Scheduler (hfs) is a slot-based MapReduce scheme designed to ensure a degree of fairness among the jobs, by guaranteeing each job at least some minimum number of allocated slots. Our prime contribution in this paper is a different, flexible scheduling allocation scheme, known as flex. Our goal is to optimize any of a variety of standard scheduling theory metrics (response time, stretch, makespan and Service Level Agreements (slas), among others) while ensuring the same minimum job slot guarantees as in hfs, and maximum job slot guarantees as well. The flex allocation scheduler can be regarded as an add-on module that works synergistically with hfs. We describe the mathematical basis for flex, and compare it with fifo and hfs in a variety of experiments.

web search and data mining | 2014

Scalable topic-specific influence analysis on microblogs

Bin Bi; Yuanyuan Tian; Yannis Sismanis; Andrey Balmin; Junghoo Cho

Social influence analysis on microblog networks, such as Twitter, has been playing a crucial role in online advertising and brand management. While most previous influence analysis schemes rely only on the links between users to find key influencers, they omit the important text content created by the users. As a result, there is no way to differentiate the social influence in different aspects of life (topics). Although a few prior works do support topic-specific influence analysis, they either separate the analysis of content from the analysis of network structure, or assume that content is the only cause of links, which is clearly an inappropriate assumption for microblog networks. To address the limitations of the previous approaches, we propose a novel Followship-LDA (FLDA) model, which integrates both content topic discovery and social influence analysis in the same generative process. This model properly captures the content-related and content-independent reasons why a user follows another in a microblog network. We demonstrate that FLDA produces results with significantly better precision than existing approaches. Furthermore, we propose a distributed Gibbs sampling algorithm for FLDA, and demonstrate that it provides excellent scalability on large clusters. Finally, we incorporate the FLDA model in a general search framework for topic-specific influencers. A user freely expresses his/her interest by typing a few keywords, the search framework will return a ranked list of key influencers that satisfy the users interest.

very large data bases | 2003

A system for keyword proximity search on XML databases

Andrey Balmin; Vagelis Hristidis; Nick Koudas; Yannis Papakonstantinou; Divesh Srivastava; Tianqiu Wang

Publisher Summary This chapter discusses keyword proximity search on XML database. Keyword proximity search is a user-friendly information discovery technique that has been extensively studied for text documents. In extending this technique to structured databases, recent works provide keyword proximity search on labeled graphs. A keyword proximity search does not require the user to know the structure of the graph, the role of the objects containing the keywords, or the type of the connections between the objects. The user simply submits a list of keywords and the system returns the sub-graphs that connect the objects containing the keywords. XML and its labeled graph/tree abstractions are becoming the data model of choice for representing semistructured, self-describing data, and keyword proximity search is well-suited to XML documents as well.

very large data bases | 2013

PREDIcT: towards predicting the runtime of large scale iterative analytics

Adrian Popescu; Andrey Balmin; Vuk Ercegovac; Anastasia Ailamaki

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult. This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithms convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.

international conference on management of data | 2011

Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse

Fatma Ozcan; David Hoa; Kevin S. Beyer; Andrey Balmin; Chuan Jie Liu; Yu Li

Enterprises are dealing with ever increasing volumes of data, reaching into the petabyte scale. With many of our customer engagements, we are observing an emerging trend: They are using Hadoop-based solutions in conjunction with their data warehouses. They are using Hadoop to deal with the data volume, as well as the lack of strict structure in their data to conduct various analyses, including but not limited to Web log analysis, sophisticated data mining, machine learning and model building. This first stage of the analysis is off-line and suitable for Hadoop. But, once their data is summarized or cleansed enough, and their models are built, they are loading the results into a warehouse for interactive querying and report generation. At this later stage, they leverage the wealth of business intelligence tools, which they are accustomed to, that exist for warehouses. In this paper, we outline this use case and discuss the bidirectional connectors we developed between IBM DB2 and IBM InfoSphere BigInsights.

Explore More