Dmitry Sotnikov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dmitry Sotnikov is active.

Explore More

Publication

Featured researches published by Dmitry Sotnikov.

ieee conference on mass storage systems and technologies | 2012

Estimation of deduplication ratios in large data sets

Danny Harnik; Oded Margalit; Dalit Naor; Dmitry Sotnikov; Gil Vernik

We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios.

pacific rim international symposium on dependable computing | 2014

Reliability of Geo-replicated Cloud Storage Systems

Ilias Iliadis; Dmitry Sotnikov; Paula Ta-Shma; Vinodh Venkatesan

Network bandwidth between sites is typically more scarce than bandwidth within a site in geo-replicated cloud storage systems, and can potentially be a bottleneck for recovery operations. We study the reliability of geo-replicated cloud storage systems taking into account different bandwidths within a site and between sites. We consider a new recovery scheme called staged rebuild and compare it with both a direct scheme and a scheme known as intelligent rebuild. To assess the reliability gains achieved by these schemes, we develop an analytical model that incorporates various relevant aspects of storage systems, such as bandwidths, latent sector errors, and failure distributions. The model applies in the context of Open Stack Swift, a widely deployed cloud storage system. Under certain practical system configurations, we establish that order of magnitude improvements in mean time to data loss (MTTDL) can be achieved using these schemes.

data compression conference | 2014

A Fast Implementation of Deflate

Danny Harnik; Ety Khaitzin; Dmitry Sotnikov; Shai Taharlev

We present a fast implementation of the Deflate protocol that is substantially faster than the fastest version of the Zlib software package, yet maintains full compatibility to the Deflate standard. Our solution outperforms the fastest Zlib version by a factor of 2.6 and higher (in compression time) yet demonstrates only a negligible drop-off in terms of compression ratio. The basic building blocks for our solution are a fast LZ77 compressor (the LZ4 package) and a standard Huffman encoding package (Zlib). In the paper we describe how a non-trivial combination constructed around these building blocks achieves the aforementioned performance and compatibility.

ieee conference on mass storage systems and technologies | 2014

The case for sampling on very large file systems

George Goldberg; Danny Harnik; Dmitry Sotnikov

Sampling has long been a prominent tool in statistics and analytics, first and foremost when very large amounts of data are involved. In the realm of very large file systems (and hierarchical data stores in general), however, sampling has mostly been ignored and for several good reasons. Mainly, running sampling in such an environment introduces technical challenges that make the entire sampling process non-beneficial. In this work we demonstrate that there are cases for which sampling is very worthwhile in very large file systems. We address this topic in two aspect: (a) the technical side where we design and implement solutions to efficient weighted sampling that is also distributed, one-pass and addresses multiple efficiency aspects; and (b) the usability aspect in which we demonstrate several use-cases in which weighted sampling over large file systems is extremely beneficial. In particular, we show use-cases regarding estimation of compression ratios, testing and auditing and offline collection of statistics on very large data stores.

symposium on reliable distributed systems | 2016

Network Aware Reliability Analysis for Distributed Storage Systems

Amir Epstein; Elliot K. Kolodner; Dmitry Sotnikov

It is hard to measure the reliability of a large distributed storage system, since it is influenced by low probability failure events that occur over time. Nevertheless, it is critical to be able to predict reliability in order to plan, deploy and operate the system. Existing approaches suffer from unrealistic assumptions regarding network bandwidth. This paper introduces a new framework that combines simulation and an analytic model to estimate durability for large distributed cloud storage systems. Our approach is the first that takes into account network bandwidth with a focus on the cumulative effect of simultaneous failures on repair time. Using our framework we evaluate the trade-offs between durability, network and storage costs for the OpenStack Swift object store, comparing various system configurations and resiliency schemes, including replication and erasure coding. In particular, we show that when accounting for the cumulative effect of simultaneous failures, the probability of data loss estimates can vary by two to four orders of magnitude.

file and storage technologies | 2013