Gil Vernik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gil Vernik is active.

Explore More

Publication

Featured researches published by Gil Vernik.

ieee conference on mass storage systems and technologies | 2012

Estimation of deduplication ratios in large data sets

Danny Harnik; Oded Margalit; Dalit Naor; Dmitry Sotnikov; Gil Vernik

We study the problem of accurately estimating the data reduction ratio achieved by deduplication and compression on a specific data set. This turns out to be a challenging task - It has been shown both empirically and analytically that essentially all of the data at hand needs to be inspected in order to come up with a accurate estimation when deduplication is involved. Moreover, even when permitted to inspect all the data, there are challenges in devising an efficient, yet accurate, method. Efficiency in this case refers to the demanding CPU, memory and disk usage associated with deduplication and compression. Our study focuses on what can be done when scanning the entire data set. We present a novel two-phased framework for such estimations. Our techniques are provably accurate, yet run with very low memory requirements and avoid overheads associated with maintaining large deduplication tables. We give formal proofs of the correctness of our algorithm, compare it to existing techniques from the database and streaming literature and evaluate our technique on a number of real world workloads. For example, we estimate the data reduction ratio of a 7 TB data set with accuracy guarantees of at most a 1% relative error while using as little as 1 MB of RAM (and no additional disk access). In the interesting case of full-file deduplication, our framework readily accepts optimizations that allow estimation on a large data set without reading most of the actual data. For one of the workloads we used in this work we achieved accuracy guarantee of 2% relative error while reading only 27% of the data from disk. Our technique is practical, simple to implement, and useful for multiple scenarios, including estimating the number of disks to buy, choosing a deduplication technique, deciding whether to dedupe or not dedupe and conducting large-scale academic studies related to deduplication ratios.

international conference on cloud computing | 2013

Data On-Boarding in Federated Storage Clouds

Gil Vernik; Alexandra Shulman-Peleg; Sebastian Dippl; Ciro Formisano; Michael C. Jaeger; Elliot K. Kolodner; Massimo Villari

One of the main obstacles hindering wider adoption of storage cloud services is vendor lock-in, a situation in which large amounts of data that are placed in one storage system can not be migrated to another vendor, e.g., due to time and cost considerations. To prevent this situation we present an advanced on-boarding federation mechanism, enabling a cloud to add a special federation layer to efficiently import data from other storage clouds. This is achieved without being dependent on any special function from the other clouds. We design a generic, modular on-boarding architecture and demonstrate its implementation as part of a VISION Cloud, which is a large scale storage cloud designed for content-centric data. Our system is capable of integrating storage data from various clouds, providing a common global view of storage data. The users can access the data through the new cloud provider immediately after the setup, maintaining the normal operation of applications, so that they do not need to wait for the completion of the data migration process. Finally, we analyze the payment models of existing storage clouds, showing that transferring the data via on-boarding federation with a direct link between clouds can lead to significant time and cost savings.

symposium on cloud computing | 2017

Stocator: an object store aware connector for apache spark

Gil Vernik; Michael Factor; Elliot K. Kolodner; Effi Ofer; Pietro Michiardi; Francesco Pace

Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., for genomics, for media and entertainment, and for Internet of Things. Object storage systems such as Amazon S3, Azure Blob storage, and IBM Cloud Object Storage, are highly scalable distributed storage systems that offer high capacity, cost effective storage. But it is not enough just to store data; we also need to derive value from it. Apache Spark is the leading big data analytics processing engine combining MapReduce, SQL, streaming, and complex analytics. We present Stocator, a high performance storage connector, enabling Spark to work directly on data stored in object storage systems, while providing the same correctness guarantees as Hadoops original storage system, HDFS. Current object storage connectors from the Hadoop community, e.g., for the S3 and Swift APIs, do not deal well with eventual consistency, which can lead to failure. These connectors assume file system semantics, which is natural given that their model of operation is based on interaction with HDFS. In particular, Spark and Hadoop achieve fault tolerance and enable speculative execution by creating temporary files, listing directories to identify these files, and then renaming them. This paradigm avoids interference between tasks doing the same work and thus writing output with the same name. However, with eventually consistent object storage, a container listing may not yet include a recently created object, and thus an object may not be renamed, leading to incomplete or incorrect results. Solutions such as EMRFS [1] from Amazon, S3mper [4] from Netflix, and S3Guard [2], attempt to overcome eventual consistency by requiring additional strongly consistent data storage. These solutions require multiple storage systems, are costly, and can introduce issues of consistency between the stores. Current object storage connectors from the Hadoop community are also notorious for their poor performance for write workloads. This, too, stems from their use of the rename operation, which is not a native object storage operation; not only is it not atomic, but it must be implemented using a costly copy operation, followed by delete. Others have tried to improve the performance of object storage connectors by eliminating rename, e.g., the Direct-ParquetOutputCommitter [5] for S3a introduced by Databricks, but have failed to preserve fault tolerance and speculation. Stocator takes advantage of object storage semantics to achieve both high performance and fault tolerance. It eliminates the rename paradigm by writing each output object to its final name. The name includes both the part number and the attempt number, so that multiple attempts to write the same part use different objects. Stocator proposes to extend an already existing success indicator object written at the end of a Spark job, to include a manifest with the names of all the objects that compose the final output; this ensures that a subsequent job will correctly read the output, without resorting to a list operation whose results may not be consistent. By leveraging the inherent atomicity of object creation and using a manifest we obtain fault tolerance and enable speculative execution; by avoiding the rename paradigm we greatly decrease the complexity of the connector and the number of operations on the object storage. We have implemented our connector and shared it in open source [3]. We have compared its performance with the S3a and Hadoop Swift connectors over a range of workloads and found that it executes many fewer operations on the object storage, in some cases as few as one thirtieth. Since the price for an object storage service typically includes charges based on the number of operations executed, this reduction in operations lowers the costs for clients in addition to reducing the load on client software. It also reduces costs and load for the object storage provider since it can serve more clients with the same amount of processing power. Stocator also substantially increases performance for Spark workloads running over object storage, especially for write intensive workloads, where it is as much as 18 times faster.

acm international conference on systems and storage | 2017

Stocator: a high performance object store connector for spark

Gil Vernik; Michael Factor; Elliot K. Kolodner; Effi Ofer; Pietro Michiardi; Francesco Pace

Data is the natural resource of the 21st century. It is being produced at dizzying rates, e.g., for genomics by sequencers, for Media and Entertainment with very high resolution formats, and for Internet of Things (IoT) by multitudes of sensors. Object Stores such as AWS S3, Azure Blob storage, and IBM Cloud Object Storage, are highly scalable distributed storage systems that offer high capacity, cost effective storage for this data. But it is not enough just to store data; we also need to derive value from it. Apache Spark is the leading big data analytics processing engine. It runs up to one hundred times faster than Hadoop MapReduce and combines SQL, streaming and complex analytics. In this poster we present Stocator, a high performance storage connector, that enables Spark to work directly on data stored in object storage systems.

european conference on service-oriented and cloud computing | 2013

Delegation for On-boarding Federation Across Storage Clouds

Elliot K. Kolodner; Alexandra Shulman-Peleg; Gil Vernik; Ciro Formisano; Massimo Villari

On-boarding federation allows an enterprise to efficiently migrate its data from one storage cloud provider to another (e.g., for business or legal reasons), while providing continuous access and a unified view over the data during the migration. On-boarding is provided through a federation layer on the new destination cloud providing delegation for accessing object on the old source cloud. In this paper we describe a delegation architecture for on-boarding where the user delegates to the on-boarding layer a subset of his/her access rights on the source and destination clouds to enable on-boarding to occur in a safe and secure way, such that the on-boarding layer has the least privilege required to carry out its work. The added value of this work is in evaluating all security implications of a delegation necessary to be taken into account during the on-boarding phase. We also show how this delegation architecture can be implemented using Security Assertion Markup Language.

Archive | 2012