Kaixi Hou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kaixi Hou is active.

Explore More

Publication

Featured researches published by Kaixi Hou.

international conference on supercomputing | 2016

Parallel Transposition of Sparse Data Structures

Hao Wang; Weifeng Liu; Kaixi Hou; Wu-chun Feng

Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other important building blocks, e.g., parallel transposition for sparse matrices and graphs, have not received the attention they deserve. In this paper, we first identify that the transposition operation can be a bottleneck of some fundamental sparse matrix and graph algorithms. Then, we revisit the performance and scalability of parallel transposition approaches on x86-based multi-core and many-core processors. Based on the insights obtained, we propose two new parallel transposition algorithms: ScanTrans and MergeTrans. The experimental results show that our ScanTrans method achieves an average of 2.8-fold (up to 6.2-fold) speedup over the parallel transposition in the latest vendor-supplied library on an Intel multi-core CPU platform, and the MergeTrans approach achieves on average of 3.4-fold (up to 11.7-fold) speedup on an Intel Xeon Phi many-core processor.

international conference on supercomputing | 2017

Fast segmented sort on GPUs

Kaixi Hou; Weifeng Liu; Hao Wang; Wu-chun Feng

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for different segment lengths to eliminate the irregularity caused by various workloads and thread divergence; and (2) a register-based sort method to support N-to-M data-thread binding and in-register data communication. We also implement a shared memory-based merge method to support non-uniform length chunk merge via multiple warps. Our segmented sort mechanism shows great improvements over the methods from CUB, CUSP and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. Furthermore, we apply our mechanism on two applications, i.e., suffix array construction and sparse matrix-matrix multiplication, and obtain obvious gains over state-of-the-art implementations.

international parallel and distributed processing symposium | 2016

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors

Kaixi Hou; Hao Wang; Wu-chun Feng

Pairwise sequence alignment algorithms, e.g., Smith-Waterman and Needleman-Wunsch, with adjustable gap penalty systems are widely used in bioinformatics. The strong data dependencies in these algorithms, however, prevents compilers from effectively auto-vectorizing them. When programmers manually vectorize them on multi-and many-core processors, two vectorizing strategies are usually considered, both of which initially ignore data dependencies and then appropriately correct in a subsequent stage: (1) iterate, which vectorizes and then compensates the scoring results with multiple rounds of corrections and (2) scan, which vectorizes and then corrects the scoring results primarily via one round of parallel scan. However, manually writing such vectorizing code efficiently is non-trivial, even for experts, and the code may not be portable across ISAs. In addition, even highly vectorized and optimized codes may not achieve optimal performance because selecting the best vectorizing strategy depends on the algorithms, configurations (gap systems), and input sequences. Therefore, we propose a framework called AAlign to automatically vectorize pairwise sequence alignment algorithms across ISAs. AAlign ingests a sequential code (which follows our generalized paradigm for pairwise sequence alignment) and automatically generates efficient vector code for iterate and scan. To reap the benefits of both vectorization strategies, we propose a hybrid mechanism where AAlign automatically selects the best vectorizing strategy at runtime no matter which algorithms, configurations, and input sequences are specified. On Intel Haswell and MIC, the generated codes for Smith-Waterman and Needleman-Wunsch achieve up to a 26-fold speedup over their sequential counterparts. Compared to the highly optimized and multi-threaded sequence alignment tools, e.g., SWPS3 and SWAPHI, our codes can deliver up to 2.5-fold and 1.6-fold speedups, respectively.

international parallel and distributed processing symposium | 2017

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Kaixi Hou; Wu-chun Feng; Shuai Che

Because sparse matrix-vector multiplication (SpMV) is an important and widely used computational kernel in many real-world applications, it behooves us to accelerate SpMV on modern multi- and many-core architectures. While many storage formats have been developed to facilitate SpMV operations, the compressed sparse row (CSR) format is still the most popular and general storage format. However, parallelizing CSR-based SpMV on multi- and many-core processors (e.g., CPUs, APUs, GPUs) remains a challenging problem, including dealing with uncoalesced memory access, balancing workload, and identifying the most appropriate parallelizing strategy. In the paper, we propose a novel auto-tuning framework that automatically finds the most efficient parallelizing strategy to achieve high-performance SpMV. Our framework can deter- mine the right binning schemes to group similar workloads into bins (e.g., buckets) with negligible overhead. Then, for each bin, the most suitable kernel is selected to process the rows within. Our framework is input-aware and based on a machine-learning method. The results show that our auto-tuned SpMV performs significantly better than the default SpMV. The speedups on 16 representative matrices range from 1.2x to 52.0x. Compared to the state-of-the-art SpMV kernel, our work yields better performance in most cases, achieving up to a 1.9x speedup.

computing frontiers | 2017

GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

Kaixi Hou; Hao Wang; Wu-chun Feng

Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through the memory hierarchy of the GPU to improve performance. However, approaches to take advantage of such blocking require complex and tedious changes to the GPU kernels for different stencils, GPU architectures, and multi-level cached systems. In this work, we explore the challenges of different spatial blocking strategies over three cache levels of the GPU (i.e., L1 cache, scratchpad memory, and registers) and propose a framework GPU-UniCache to automatically generate codes to access buffered data in the cached systems of GPUs. Based on the characteristics of spatial blocking over various stencil kernels, we generalize the patterns of data communication, index conversion, and synchronization (with abstracted ISA-friendly interfaces) and map them to different architectures with highly optimized code variants. Our approach greatly simplifies the design of efficient and portable stencil computations across GPUs. Compared to stencil kernels based on hardware-managed memory (L1 cache) and other state-of-the-art GPU benchmarks, the GPU-UniCache can achieve significant improvements.

international conference on computational advances in bio and medical sciences | 2015

pDindel: Accelerating indel detection on a multicore CPU architecture with SIMD

Da Zhang; Hao Wang; Kaixi Hou; Jing Zhang; Wu-chun Feng

Small insertions and deletions (indels) of bases in the DNA of an organism can map to functionally important sites in human genes, for example, and in turn, influence human traits and diseases. Dindel detects such indels, particularly small indels (> 50 nucleotides), from short-read data by using a Bayesian approach. Due to its high sensitivity to detect small indels, Dindel has been adopted by many bioinformatics projects, e.g., the 1,000 Genomes Project, despite its pedestrian performance. In this paper, we first analyze and characterize the current version of Dindel to identify performance bottlenecks. We then design, implement, and optimize a parallelized Dindel (pDindel) for a multicore CPU architecture by exploiting thread-level parallelism (TLP) and data-level parallelism (DLP). Our optimized pDindel can achieve up to a 37-fold speedup for the computational part of Dindel and a 9-fold speedup for the overall execution time over the current version of Dindel.

IEEE Transactions on Parallel and Distributed Systems | 2018

A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors

Kaixi Hou; Hao Wang; Wu-chun Feng

The continued growth in the width of vector registers and the evolving library of intrinsics on the modern x86 processors make manual optimizations for data-level parallelism tedious and error-prone. In this paper, we focus on parallel sorting, a building block for many higher-level applications, and propose a framework for the Automatic SIMDization of Parallel Sorting (ASPaS) on x86-based multi- and many-core processors. That is, ASPaS takes any sorting network and a given instruction set architecture (ISA) as inputs and automatically generates vector code for that sorting network. After formalizing the sort function as a sequence of comparators and the transpose and merge functions as sequences of vector-matrix multiplications, ASPaS can map these functions to operations from a selected “pattern pool” that is based on the characteristics of parallel sorting, and then generate the vector code with the real ISA intrinsics. The performance evaluation on the Intel Ivy Bridge and Haswell CPUs, and Knights Corner MIC illustrates that automatically generated sorting codes from ASPaS can outperform the widely used sorting tools, achieving up to 5.2x speedup over the single-threaded implementations from STL and Boost and up to 6.7x speedup over the multi-threaded parallel sort from Intel TBB.

ieee international symposium on workload characterization | 2017

A framework for fast and fair evaluation of automata processing hardware

Xiaodong Yu; Kaixi Hou; Hao Wang; Wu-chun Feng

Programming Microns Automata Processor (AP) requires expertise in both automata theory and the AP architecture, as programmers have to manually manipulate state transition elements (STEs) and their transitions with a low-level Automata Network Markup Language (ANML). When the required STEs of an application exceed the hardware capacity, multiple reconfigurations are needed. However, most previous AP-based designs limit the dataset size to fit into a single AP board and simply neglect the costly overhead of reconfiguration. This results in unfair performance comparisons between the AP and other processors. To address this issue, we propose a framework for the fast and fair evaluation of AP devices. Our framework provides a hierarchical approach that automatically generates automata for large datasets through user-defined paradigms and allows the use of cascadable macros to achieve highly optimized reconfigurations. We highlight the importance of counting the configuration time in the overall AP performance, which in turn, can provide better insight into identifying essential hardware features, specifically for large-scale problem sizes. Our framework shows that the AP can achieve up to 461x overall speedup fairly compared to CPU counterparts.

international conference on supercomputing | 2015