Cornel Constantinescu

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cornel Constantinescu is active.

Explore More

Publication

Featured researches published by Cornel Constantinescu.

data compression conference | 2011

Mixing Deduplication and Compression on Active Data Sets

Cornel Constantinescu; Joseph S. Glider; David D. Chambliss

Many new storage systems provide some form of data reduction. We examine data reduction methods that might be suitable for \emph{primary} storage systems serving active data (as contrasted with backup and archive systems), by analysis of file sets found in different active data environments. We address questions of: how effective are compression and variations of deduplication, both separately and in combination, when deduplication and compression are combined, which should be applied first, what will the tradeoff be between the different methods in their use of MIPS relative to the data reduction achieved, and what degree of data reduction should be expected for different data types.

acm international conference on systems and storage | 2012

Insights for data reduction in primary storage: a practical analysis

Maohua Lu; David D. Chambliss; Joseph S. Glider; Cornel Constantinescu

There has been increasing interest in deploying data reduction techniques in primary storage systems. This paper analyzes large datasets in four typical enterprise data environments to find patterns that can suggest good design choices for such systems. The overall data reduction opportunity is evaluated for deduplication and compression, separately and combined, then in-depth analysis is presented focusing on frequency, clustering and other patterns in the collected data. The results suggest ways to enhance performance and reduce resource requirements and system cost while maintaining data reduction effectiveness. These techniques include deciding which files to compress based on file type and size, using duplication affinity to guide deployment decisions, and optimizing the detection and mapping of duplicate content adaptively when large segments account for most of the opportunity.

data compression conference | 2008

Sequence of Hashes Compression in Data De-duplication

Subashini Balachandran; Cornel Constantinescu

Data de-duplication is a simple compression method, popular in storage archival and backup that consists in partitioning large data objects (files) into smaller parts (named chunks), and replacing the chunks for the purpose of communication or storage by their ID, generally a cryptographic hash like SHA-1 of the chunk data [A. Muthitacharoen et al., 2001], [D.R. Bobbarjung et al., 2006]. The compression ratio achieved by de-duplication can be improved by (1) increasing the likelihood of matching the new chunks against the dictionary (archived) chunks and/or (2) compressing the list of hashes (indexes, of 20 bytes each). Using smaller chunk sizes increases the chance of matching but many more hashes will be generated. The chunks repository is a hash table where each entry stores the SHA-1 value of the chunk and the chunk data. In addition, with each newly created entry we store a chronological pointer linking it with the next new entry. When the hashes produced by the chunker follow the chronological pointers we encode them as a sequence of hashes by specifying the first hash in the sequence and the length of the sequence or when the same hash is generated repeatedly we encode it as a run of hashes by specifying its value and the number of repeated occurrences. The usefulness of the chronological pointers is derived from the insight that when archiving successive versions of a file or set of files, large contiguous areas remain unchanged between these versions and the chronological pointers are predictors of this contiguity. If the contiguity is broken there is a small loss in the hash sequence compression.

ieee conference on mass storage systems and technologies | 2014

DedupT: Deduplication for tape systems

Abdullah Gharaibeh; Cornel Constantinescu; Maohua Lu; Ramani R. Routray; Anurag Sharma; Prasenjit Sarkar; David Pease; Matei Ripeanu

Deduplication is a commonly-used technique on disk-based storage pools. However, deduplication has not been used for tape-based pools: tape characteristics, such as high mount and seek times combined with data fragmentation resulting from deduplication create a toxic combination that leads to unacceptably high retrieval times. This work proposes DedupT, a system that efficiently supports deduplication on tape pools. This paper (i) details the main challenges to enable efficient deduplication on tape libraries, (ii) presents a class of solutions based on graph-modeling of similarity between data items that enables efficient placement on tapes; and (iii) presents the design and evaluation of novel cross-tape and on-tape chunk placement algorithms that alleviate tape mount time overhead and reduce on-tape data fragmentation. Using 4.5 TB of real-world workloads, we show that DedupT retains at least 95% of the deduplication efficiency. We show that DedupT mitigates major retrieval time overheads, and, due to reading less data, is able to offer better restore performance compared to the case of restoring non-deduplicated data.

Algorithms | 2012

Content Sharing Graphs for Deduplication-Enabled Storage Systems

Maohua Lu; Cornel Constantinescu; Prasenjit Sarkar

Deduplication in storage systems has gained momentum recently for its capability in reducing data footprint. However, deduplication introduces challenges to storage management as storage objects (e.g., files) are no longer independent from each other due to content sharing between these storage objects. In this paper, we present a graph-based framework to address the challenges of storage management due to deduplication. Specifically, we model content sharing among storage objects by content sharing graphs (CSG), and apply graph-based algorithms to two real-world storage management use cases for deduplication-enabled storage systems. First, a quasi-linear algorithm was developed to partition deduplication domains with a minimal amount of deduplication loss (i.e., data replicated across partitioned domains) in commercial deduplication-enabled storage systems, whereas in general the partitioning problem is NP-complete. For a real-world trace of 3 TB data with 978 GB of removable duplicates, the proposed algorithm can partition the data into 15 balanced partitions with only 54 GB of deduplication loss, that is, a 5% deduplication loss. Second, a quick and accurate method to query the deduplicated size for a subset of objects in deduplicated storage systems was developed. For the same trace of 3 TB data, the optimized graph-based algorithm can complete the query in 2.6 s, which is less than 1% of that of the traditional algorithm based on the deduplication metadata.

data compression conference | 2013

Random Extraction from Compressed Data - A Practical Study

Cornel Constantinescu; Joseph S. Glider; Dilip Simha; David D. Chambliss

Modern primary storage systems support or intend to add support for real time compression usually based on some flavor of the LZ77 and/or Huffman algorithm. There is a fundamental tradeoff in adding real time (adaptive) compression to such a system: to get good compression the amount of compressed data (the independently compressed block) should be large, to be able to read quickly from random places the blocks should be small. One idea is to let the independently compressed blocks be large but to be able to start decompressing the needed part of the block from a random location inside the compressed block. We explore this idea and compare it with a few alternatives, experimenting with the \zlib\ code base.

data compression conference | 2009