Tanu Malik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tanu Malik is active.

Explore More

Publication

Featured researches published by Tanu Malik.

international conference on management of data | 2002

The SDSS skyserver: public access to the sloan digital sky server data

Alexander S. Szalay; Jim Gray; Ani Thakar; Peter Z. Kunszt; Tanu Malik; Jordan Raddick; Christopher Stoughton; Jan Vandenberg

The SkyServer provides Internet access to the public Sloan Digital Sky Survey (SDSS) data for both astronomers and for science education. This paper describes the SkyServer goals and architecture. It also describes our experience operating the SkyServer on the Internet. The SDSS data is public and well-documented so it makes a good test platform for research on database algorithms and performance.

international conference on cluster computing | 2013

Distributed data provenance for large-scale data-intensive computing

Dongfang Zhao; Chen Shou; Tanu Malik; Ioan Raicu

It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality-whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32-node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1 K nodes on BlueGene/P supercomputer.

international conference on data engineering | 2005

Bypass caching: making scientific databases good network citizens

Tanu Malik; Randal C. Burns; Amitabh Chaudhary

Scientific database federations are geographically distributed and network bound. Thus, they could benefit from proxy caching. However, existing caching techniques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce parallelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selections at each database. We develop the bypass-yield formulation of caching, which reduces network traffic in wide-area database federations, while preserving parallelism and data reduction. Bypass-yield caching is altruistic; caches minimize the overall network traffic generated by the federation, rather than focusing on local performance. We present an adaptive, workload-driven algorithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about workload: a k-competitive deterministic algorithm and a randomized algorithm with minimal space complexity. We verify the efficacy of bypass-yield caching by running workload traces collected from the Sloan Digital Sky Survey through a prototype implementation.

international conference on data engineering | 2008

Automated physical design in database caches

Tanu Malik; Xiaodan Wang; Randal C. Burns; Debabrata Dash; Anastasia Ailamaki

Performance of proxy caches for database federations that serve a large number of users is crucially dependent on its physical design. Current techniques, automated or otherwise, for physical design depend on the identification of a representative workload. In proxy caches, however, such techniques are inadequate since workload characteristics change rapidly. This is remarkably shown at the proxy cache of SkyQuery, an Astronomy federation, which receives a continuously evolving workload. We present novel techniques for automated physical design that adapt with the workload and balance the performance benefits of physical design decisions with the cost of implementing these decisions. These include both competitive and incremental algorithms that optimize the combined cost of query evaluation and making physical design changes. Our techniques are general in that they do not make assumptions about the underlying schema nor the incoming workload. Preliminary experiments on the TPC-D benchmark demonstrate significant improvement in response time when the physical design continually adapts to the workload using our online algorithm compared with offline techniques.

international provenance and annotation workshop | 2012

SOLE: linking research papers with science objects

Quan Pham; Tanu Malik; Ian T. Foster; Roberto Di Lauro; Raffaele Montella

We introduce Science Object Linking and Embedding (SOLE), a tool for linking research papers with associated science objects, such as source codes, datasets, annotations, workflows, packages, and virtual machine images. The objective of SOLE is to reduce the cost to an author of linking research papers with such science objects for the purpose of reproducible research. To this end, SOLE allows an author to use simple tags to delimit a science object to be associated with a research paper. It creates an adequate representation of the science object and manages a bibliography-like specification of science objects. Authors and readers can reference elements of this bibliography and associate them with phrases in the text of the research paper through a Web interface, in a similar manner to a traditional bibliography tool.

passive and active network measurement | 2005

Practical passive lossy link inference

Alexandros Batsakis; Tanu Malik; Andreas Terzis

We propose a practical technique for the identification of lossy network links. Our scheme is based on a function that computes the likelihood of each link to be lossy. This function mainly depends on the number of times a link appears in lossy paths and on the relative loss rates of these paths. Preliminary simulation results show that our solution achieves accuracy comparable to statistical methods (e.g. Bayesian) at significantly lower running time.

database systems for advanced applications | 2007

A workload-driven unit of cache replacement for mid-tier database caching

Xiaodan Wang; Tanu Malik; Randal C. Burns; Stratos Papadomanolakis; Anastassia Ailamaki

Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, database applications. Midtier caching reduces contention at the backend database by distributing database operations to the cache. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existingmid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1.4 million query Astronomy workload demonstrate up to 70% reduction in network traffic and reduce query response time by up to a factor of three when compared with alternative units of cache replacement.

international conference on e-science | 2010

Tracking and Sketching Distributed Data Provenance

Tanu Malik; Ligia Nistor; Ashish Gehani

Current provenance collection systems typically gather metadata on remote hosts and submit it to a central server. In contrast, several data-intensive scientific applications require a decentralized architecture in which each host maintains an authoritative local repository of the provenance metadata gathered on that host. The latter approach allows the system to handle the large amounts of metadata generated when auditing occurs at fine granularity, and allows users to retain control over their provenance records. The decentralized architecture, however, increases the complexity of auditing, tracking, and querying distributed provenance. We describe a system for capturing data provenance in distributed applications, and the use of provenance sketches to optimize subsequent data provenance queries. Experiments with data gathered from distributed workflow applications demonstrate the feasibility of a decentralized provenance management system and improvements in the efficiency of provenance queries.

Archive | 2013

Sketching Distributed Data Provenance

Tanu Malik; Ashish Gehani; Dawood Tariq; Fareed Zaffar

Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is generated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.

Journal of Computational Science | 2015

An invariant framework for conducting reproducible computational science

Haiyan Meng; Rupa Kommineni; Quan Pham; Robert Gardner; Tanu Malik; Douglas Thain

Abstract Computational reproducibility depends on the ability to not only isolate necessary and sufficient computational artifacts but also to preserve those artifacts for later re-execution. Both isolation and preservation present challenges in large part due to the complexity of existing software and systems as well as the implicit dependencies, resource distribution, and shifting compatibility of systems that result over time—all of which conspire to break the reproducibility of an application. Sandboxing is a technique that has been used extensively in OS environments in order to isolate computational artifacts. Several tools were proposed recently that employ sandboxing as a mechanism to ensure reproducibility. However, none of these tools preserve the sandboxed application for re-distribution to a larger scientific community aspects that are equally crucial for ensuring reproducibility as sandboxing itself. In this paper, we describe a framework of combined sandboxing and preservation, which is not only efficient and invariant, but also practical for large-scale reproducibility. We present case studies of complex high-energy physics applications and show how the framework can be useful for sandboxing, preserving, and distributing applications. We report on the completeness, performance, and efficiency of the framework, and suggest possible standardization approaches.

Explore More