Weijun Xiao
University of Minnesota
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Weijun Xiao.
high performance computing and communications | 2011
Youngjin Nam; Guanlin Lu; Nohhyun Park; Weijun Xiao; David Hung-Chang Du
Data deduplication has recently become commonplace in most secondary storage and even in some primary storage for the capacity optimization purpose. Aside from its write performance, read performance of the deduplication storage has been gaining in significance with a wide range of its deployments. In this paper, we emphasize the importance of read performance in reconstituting a data stream from its unique and shared chunks physically dispersed over deduplication storage. We newly introduce a read performance indicator called Chunk Fragmentation Level (CFL). We also validate that the CFL is very effective to indicate read performance of deduplication storage through a developed theoretical performance model and extensive experiments. Finally, we articulate further research issues.
international conference on parallel processing | 2012
Ruixuan Li; Chengzhou Li; Weijun Xiao; Hai Jin; Heng He; Xiwu Gu; Kunmei Wen; Zhiyong Xu
Large-scale search engines use hard disk drives (HDD) to store the mass index data for their capacity, whose performances are limited by the relatively low I/O performance of HDD. Caching is an effective optimization, and many caching algorithms have been proposed to improve retrieval performance. Considering the high cost of memory and huge amounts of data, the limited capacity of cache in memory cannot resolve the above problem thoroughly. In this paper, we adopt a solid state disk (SSD) based storage architecture, which uses SSD as a secondary cache for memory. We analyze the I/O patterns of search engines and propose SSD-based data management policies based on the hybrid storage architecture, including data selection, data placement and data replacement. Our main goal is to improve the performance of search engines while reducing operation cost inside SSD. The experimental results demonstrate the proposed architecture improves the hit ratio by 13.31%, the performance by 41.05%, the average access time inside SSD by 43.83%, and reduces block erasure operations by 71.52%.
international symposium on parallel and distributed processing and applications | 2012
Weijun Xiao; Xiaoqiang Lei; Ruixuan Li; Nohhyun Park; David J. Lilja
Recent advances in flash memory show great potential to replace traditional hard drives (HDDs) with flash-based solid state drives (SSDs) from personal computing to distributed systems. However, it is still a long way to go before completely using SSDs for enterprise data storage. Considering the cost, performance, and reliability of SSDs, a practical solution is to combine both SSDs and HDDs together. This paper proposes a hybrid storage system named PASS (Performance-dAta Synchronization - hybrid storage System) to tradeoff between I/O performance and data discrepancy between SSDs and HDDs. PASS includes a high-performance SSD and a traditional HDD to store mirrored data for reliability. All of the I/O requests are redirected to the primary SSD first and then the updated data blocks are copied to the backup HDD asynchronously. In order to hide the latency of copying operations, we use an I/O window to coalesce write requests and maintain an ordered I/O queue to shorten the HDD seek and rotation times. Depending on the charateristics of different I/O workloads, we develop an adaptive policy to dynamically balance the foreground I/O processing and background mirroring. We implement a prototype system of PASS by developing a Linux device driver and conduct experiments on the IoMeter, PostMark, and TPCC benchmarks. Our results show that PASS can achieve up to 12 times the performance of a RAID1 storage system for the IoMeter and PostMark workloads while tolerating less than 2% data discrepancy between the primary SSD and the backup HDD. More interestingly, while PASS does not produce any performance benefit for the TPC-C benchmark, it does allow the system to scale to larger sizes than when using an HDD-based RAID system alone.
international conference on computer design | 2012
Zhe Zhang; Weijun Xiao; Nohhyun Park; David J. Lilja
Phase change memory (PCM) is a promising technology to solve energy and performance bottlenecks for memory and storage systems. To help understand the reliability characteristics of PCM devices, we present a simple fault model to categorize four types of PCM errors. Based on our proposed fault model, we conduct extensive experiments on real PCM devices at the memory module level. Numerical results uncover many interesting trends in terms of the lifetime of PCM devices and error behaviors. Specifically, PCM lifetime for the memory chips we tested is greater than 14 million cycles, which is much longer than for flash memory devices. In addition, the distributions for four types of errors are quite different. These results can be used for estimating PCM lifetime and for measuring the fabrication quality of individual PCM memory chips.
computing frontiers | 2013
Ding Liu; Ruixuan Li; David J. Lilja; Weijun Xiao
Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performanc solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKLs divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACKs divide-and-conquer routine [17], 3 times faster than MKLs divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs.
The Journal of Supercomputing | 2013
Guoqiang Gao; Ruixuan Li; Weijun Xiao; Zhiyong Xu
Originally used as the default infrastructure for efficient file sharing, peer-to-peer (P2P) architecture achieved great successes. Now, the P2P model has been adopted for many other distributed applications, such as instant message and phone services, Internet gaming, and large-scale scientific computing. In recent years, P2P streaming systems experienced tremendous growth and became one of the largest bandwidth consumers on the Internet. Compared to standard file sharing systems, the streaming services show unique characteristics with more stringent time constraints and require much higher network bandwidth. It is extremely important to evaluate and analyze existing applications, and investigate the merits and weaknesses in these systems for future development.In this paper, we conduct a comprehensive measurement study on two of the most popular P2P streaming systems, namely, PPLive and PPStream. They are very popular P2P streaming applications, and serving millions of registered users with hundreds of live TV channels and millions of other video clips. In our measurement, we deploy our collectors in China, and both live TV and video-on-demand (VoD) channels are evaluated. We record run-time network traffic on the client side, compare and analyze the characteristics of these channels based on their popularity. For both categories, we perceive that, in general, the two measured P2P streaming systems provide satisfactory experience to the audiences for all channels regardless of popularity. However, the most of data are downloaded from the dedicated servers for unpopular channels. We also observe that live TV channels show better peer coordination than VoD channels. Beside the traffic, we have also collected cache replacement information for VoD channels, and these measurement results can help us understand the caching mechanism of P2P streaming systems. With the support of the cache, VoD channels perform better than their counterparts in the live TV category in terms of data transmission, workload distribution, and signal traffic overhead. Overall, our results reveal that although P2P streaming systems can usually provide excellent viewing experience for popular channels, there are still challenges to fully support unpopular channels. New designs and algorithms are in urgent need, especially for unpopular live TV channels.
symposium on computer architecture and high performance computing | 2012
Jiaxi Hu; Zhaosen Wang; Qiyuan Qiu; Weijun Xiao; David J. Lilja
Given an N-point sequence, finding its k largest components in the frequency domain is a problem of great interest. This problem, which is usually referred to as a sparse Fourier Transform, was recently brought back on stage by a newly proposed algorithm called the sFFT. In this paper, we present a parallel implementation of sFFT on both multi-core CPUs and GPUs using a human voice signal as a case study. Using this example, an estimate of k for the 3dB cutoff points was conducted through concrete experiments. In addition, three optimization strategies are presented in this paper. We demonstrate that the multi-core-based sFFT achieves speedups of up to three times a single-threaded sFFT while a GPU-based version achieves up to ten times speedup. For large scale cases, the GPU-based sFFT also shows its considerable advantages, which is about 40 times speedup compared to the latest out-of-card FFT implementations [2].
ieee conference on mass storage systems and technologies | 2011
Biplob Debnath; Srinivasan Krishnan; Weijun Xiao; David J. Lilja; David Hung-Chang Du
Existing garbage collection algorithms for the flash-based storage use score-based heuristics to select victim blocks for reclaiming free space and wear leveling. The score for a block is estimated using metadata information such as age, block utilization, and erase count. To quickly find a victim block, these algorithms maintain a priority queue in the SRAM of the storage controller. This priority queue takes O(K) space, where K stands for flash storage capacity in total number of blocks. As the flash capacity scales to larger size, K also scales to larger value. However, due to higher price per byte, SRAM will not scale proportionately. In this case, due to SRAM scarcity, it will be challenging to implement a larger priority queue in the limited SRAM of a large-capacity flash storage. In addition to space issue, with any update in the metadata information, the priority queue needs to be continuously updated, which takes O(lg(K)) operations. This computation overhead also increases with the increase of flash capacity. In this paper, we have taken a novel approach to solve the garbage collection metadata management problem of a large-capacity flash storage. We propose a sampling-based approach to approximate existing garbage collection algorithms in the limited SRAM space. Since these algorithms are heuristic-based, our sampling-based algorithm will perform as good as unsampled (original) algorithm, if we choose good samples to make garbage collection decisions. We propose a very simple policy to choose samples. Our experimental results show that small number of samples are good enough to emulate existing garbage collection algorithms.
conference on communication networks and services research | 2011
Ruixuan Li; Guoqiang Gao; Weijun Xiao; Zhiyong Xu
In recent years, Peer-to-Peer (P2P) streaming systems experienced tremendous growth and became one of the largest bandwidth consumer on Internet. PPLive, one of the most popular applications in this category, is serving millions of registered users with hundreds of Live TV channels and millions of other video clips. Compared to the standard file sharing systems, the streaming service shows unique characteristics with more stringent time constraints and requires much higher network bandwidth. It is extremely important to evaluate and analyze existing applications, investigate the merits and weaknesses in these systems for the future development. In this paper, we conduct a comprehensive measurement study on PPLive. Both Live TV and Video-on-Demand (VoD) channels are evaluated. We record run-time network traffic on the client side, compare and analyze the characteristics of these channels based on their popularity. For both categories, we perceive that, in general, PPLive delivers satisfactory performance if enough concurrent peers are present in a particular channel. We also observe that VoD channels perform better than their counterparts in Live TV category in terms of data transmission, workload distribution, and signal traffic overhead. However, Live TV channels show better peer coordinations than VoD channels. Overall, our results reveal that although PPLive can provide excellent viewing experiences for popular channels, there are still challenges to fully support unpopular channels. New designs and algorithms are in urgent need, especially for unpopular Live TV channels.
high performance computing and communications | 2011
Guoqiang Gao; Ruixuan Li; Weijun Xiao; Zhiyong Xu
Today, P2P system is one of the largest Internet bandwidth consumers. In order to relieve the burden on Internet backbone and improve the user access experience, efficient caching strategies should be applied. However, due to its autonomous nature, a fully distributed caching scheme is very difficult to design and implement. Most current P2P caching approaches are using Client/Server architecture by deploying dedicated proxy servers on the edge of networks. Such architecture is expensive. It also incurs single point of failure and hot spot problems. Furthermore, it violates P2P principle and failed to utilize vast available resources on individual peers. In this paper, we investigate the techniques for efficient distributed P2P caching. We propose novel placement and replacement algorithms to make caching decisions. For each object, an adequate number of copies are generated and disseminated on topologically distant locations. Combined with the underlying hierarchical query infrastructure, our strategies relieve the over-caching problems for popular objects, and provide more cache space for other objects. This resolution greatly reduces WAN traffic for P2P applications. We conduct simulation experiments to compare our approaches with several common caching strategies. The results show that our algorithms can achieve higher cache hit rates and superior load balance property.