Yongtao Zhou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yongtao Zhou is active.

Explore More

Publication

Featured researches published by Yongtao Zhou.

ieee international conference on cloud computing technology and science | 2018

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

Yongtao Zhou; Yuhui Deng; Junjie Xie; Laurence T. Yang

The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud.

IEEE Transactions on Parallel and Distributed Systems | 2017

An Incrementally Scalable and Cost-Efficient Interconnection Structure for Data Centers

Junjie Xie; Yuhui Deng; Geyong Min; Yongtao Zhou

The explosive growth in the volume of data storing and complexity of data processing drive data center networks (DCNs) to become incrementally scalable and cost-efficient while to maintain high network capacity and fault tolerance. To address these challenges, this paper proposes a new structure, called Totoro, which is defined recursively and hierarchically: dual-port servers and commodity switches are used to make Totoro affordable; a bunch of servers are connected to an intra-switch to form a basic partition; to construct a high-level structure, a half of the backup ports of servers in the low-level structures are connected by inter-switches in order to incrementally build a larger partition. Totoro is incrementally scalable since expanding the structure does not require any rewiring or routing alteration. We further design a distributed and fault-tolerant routing protocol to handle multiple types of failures. Experimental results demonstrate that Totoro is able to satisfy the demands of fault tolerance and high throughput. Furthermore, architecture analysis indicates that Totoro balances between performance and costs in terms of robustness, structural properties, bandwidth, economic costs and power consumption.

international conference on algorithms and architectures for parallel processing | 2014

Identifying File Similarity in Large Data Sets by Modulo File Length

Yongtao Zhou; Yuhui Deng; Xiaoguang Chen; Junjie Xie

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.

international conference on parallel and distributed systems | 2014

Leverage similarity and locality to enhance fingerprint prefetching of data deduplication

Yongtao Zhou; Yuhui Deng; Junjie Xie

Data deduplication has been widely used at data backup system due to the significantly reduced requirements of storage capacity and network bandwidth. However, the performance of data deduplication gradually decreases with the growth of deduplicated data. This is because the volume of fingerprints grows significantly with the increase of backup data, and a large portion of fingerprints have to be stored on disk drives. This incurs frequent disk accesses to locate fingerprints and blocks the process of data deduplication. Furthermore, the fingerprints belonging to the same file may be discretely stored on disk drives. This generates random and small disk accesses, and results in significant performance degradation when the fingerprints are referred. Additionally, a single fingerprint may appear only once during a backup process. This results in very low cache hit ratio due to lacking temporal locality. This paper proposes to employ file similarity to enhance the fingerprint prefetching, thus improving the cache hit ratio and the performance of data deduplication. Furthermore, the fingerprints are arranged sequently in terms of the backup data stream to maintain the locality and promote the performance. Experimental results demonstrate that the proposed idea can effectively reduce the number of fingerprint accesses going to disk drives, decrease the query overhead of fingerprints, thus significantly alleviating the disk bottleneck of data deduplication.

international conference on algorithms and architectures for parallel processing | 2014

Athena: A Fault-Tolerant, Efficient and Applicable Routing Mechanism for Data Centers

Lijun Lyu; Junjie Xie; Yuhui Deng; Yongtao Zhou

The overall performance of data center depends on the physical topology and the corresponding routing mechanism. Many novel network structures have been proposed in recent years to remedy the shortcomings of traditional tree-based structure. Especially some hybrid recursively defined structures with acceptable costs can perform well. These structures mainly adopt the conventional routing mechanism which maintains large and complex link states. However, this routing mechanism still can not work out the cost-optimal path to meet the requirement of short latency and low extra traffic consumption. Hence, this paper presents Athena Routing Mechanism (ARM) based on Dynamic Programming with path probing scheme to further promote the performance of those structures. ARM is fault-tolerant since it makes full use of redundant links. It is also able to work out the shortest paths, which shortens the communication delay and releases intermediate servers from forwarding loads as well as extra CPU and bandwidth resources. Results from theoretical analysis, simulations and experiments firmly support the conclusion that ARM is a fault-tolerant and efficient routing mechanism which is able to be generalized to many other hybrid structures.

high performance computing and communications | 2016

Predicting the Bursts of Data Access Streams by Filtering Correlated I/Os

Lifeng Huang; Yuhui Deng; Cheng Hu; Yongtao Zhou; Ru Yang; Ruikai Liu

Bursty behavior normally indicates that the workload generated by data accesses happens in short time, uneven spurts. In order to handle the bursts, the physical resources of IT devices have to be configured to offer capability which goes far beyond the average resource utilization, thus satisfying the performance. However, this kind of fat provisioning incurs wasting resources when the system does not experience peak workloads. If the bursts can be predicted in advance, thin provision will save a lot of resources in contrast to the fat provision. However, the bursty data access involves both correlated I/Os and non-correlated I/Os which are mixed together. Therefore, it has long been a challenge to effectively predict the bursts. By analyzing real traces, this paper observes that the non-correlated block I/Os dominate bursts across I/O workloads. Based on this observation, SAW-Apriori algorithm is proposed in this paper to mine the frequent and correlated I/Os by enhancing the temporal locality of traditionl Apriori algorithm. Furthermore, this paper proposes to predict the bursts by filtering those frequent and correlated I/Os. Experimental results demonstrate that the proposed approach significantly outperforms the traditional time series method when predicting the bursts.

international performance computing and communications conference | 2015

Reducing the read latency of in-line deduplication file system

Yongtao Zhou; Yuhui Deng; Yan Li; Junjie Xie

In-line dedupliaction systems mainly focus on secondary storage for backup and archiving, and just offer several simple APIs. Applications can not directly invoke these APIs without modification. Although file systems offer abundant APIs and amity to applications, building a file system for in-line deduplication brings new challenges in the I/O path. Read operations involve multiple disk accessing, which includes getting fingerprints in file recipes, obtaining the addresses by checking fingerprint index and reading corresponding data blocks in disk drive. This extremely increases the latency in read path. We present a Low-Read-Latency File System (LRLFS) for the in-line deduplication. Experiments suggest that LRLFS obtains low read latency in read path with negligible storage overhead, acceptable CPU and memory utilization rate.

IEEE Transactions on Parallel and Distributed Systems | 2018