Xinyu Que | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xinyu Que is active.

Explore More

Publication

Featured researches published by Xinyu Que.

international conference on supercomputing | 2016

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics

Daniele Buono; Fabrizio Petrini; Fabio Checconi; Xing Liu; Xinyu Que; Chris Long; Tai-Ching Tuan

Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel, used by a large class of numerical algorithms. Emerging big-data and machine learning applications are propelling a renewed interest in SpMV algorithms that can tackle massive amount of unstructured data---rapidly approaching the TeraByte range---with predictable, high performance. In this paper we describe a new methodology to design SpMV algorithms for shared memory multiprocessors (SMPs) that organizes the original SpMV algorithm into two distinct phases. In the first phase we build a scaled matrix, that is reduced in the second phase, providing numerous opportunities to exploit memory locality. Using this methodology, we have designed two algorithms. Our experiments on irregular big-data matrices (an order of magnitude larger than the current state of the art) show a quasi-optimal scaling on a large-scale POWER8 SMP system, with an average performance speedup of 3.8x, when compared to an equally optimized version of the CSR algorithm. In terms of absolute performance, with our implementation, the POWER8 SMP system is comparable to a 256-node cluster. In terms of size, it can process matrices with up to 68 billion edges, an order of magnitude larger than state-of-the-art clusters.

international parallel and distributed processing symposium | 2015

Scalable Community Detection with the Louvain Algorithm

Xinyu Que; Fabio Checconi; Fabrizio Petrini; John A. Gunnels

In this paper we present and evaluate a parallel community detection algorithm derived from the state-of-the-art Louvain modularity maximization method. Our algorithm adopts a novel graph mapping and data representation, and relies on can efficient communication runtime, specifically designed for fine-grained applications executed on large-scale supercomputers. We have been able to parallelize graphs with up to 138 billion edges on 8, 192 Blue Gene/Q nodes and 1, 024 P7-IH nodes. Leveraging the convergence properties of our algorithm and the efficient implementation, we can analyze communities of large scale graphs in just a few seconds. To the best of our knowledge, this is the first parallel implementation of the Louvain algorithm that scales to these large data and processor configurations.

IEEE Computer | 2015

Optimizing Sparse Linear Algebra for Large-Scale Graph Analytics

Daniele Buono; John A. Gunnels; Xinyu Que; Fabio Checconi; Fabrizio Petrini; Tai-Ching Tuan; Chris Long

Emerging data-intensive applications attempt to process and provide insight into vast amounts of online data. A new class of linear algebra algorithms can efficiently execute sparse matrix-matrix and matrix-vector multiplications on large-scale, shared memory multiprocessor systems, enabling analysts to more easily discern meaningful data relationships, such as those in social networks.

international conference on supercomputing | 2014

Performance Analysis of Graph Algorithms on P7IH

Xinyu Que; Fabio Checconi; Fabrizio Petrini

IBM Power 775 P7IH is the latest supercomputing system that was designed for high-productivity and high-performance. The key innovation on the hub-chip based network makes it perform superior for traditional HPCC benchmarks. In this paper, we detailed characterize the bared network performance with a thin communication stack. Based on that, we present a systematic-al performance analysis of the data-intensive benchmark, Graph500s Breadth First Search, on P7IH. We then provide insight into the overall interaction between hardware and software and present the lesson learned on the key bottlenecks of both architecture and data-intensive application.

international parallel and distributed processing symposium | 2016

Subgraph Counting: Color Coding Beyond Trees

Venkatesan T. Chakaravarthy; Michael Kapralov; Prakash Murali; Fabrizio Petrini; Xinyu Que; Yogish Sabharwal; Baruch Schieber

The problem of counting occurrences of query graphs in a large data graph, known as subgraph counting, is fundamental to several domains such as genomics and social network analysis. Many important special cases (e.g. triangle counting) have received significant attention. Color coding is a very general and powerful algorithmic technique for subgraph counting. Color coding has been shown to be effective in several applications, but scalable implementations are only known for the special case of tree queries (i.e. queries of treewidth one). In this paper we present the first efficient distributed implementation for color coding that goes beyond tree queries: ouralgorithm applies to any query graph of treewidth 2. Since tree queries can be solved in time linear in the size of the data graph, our contribution is the first step into the realm of color codingfor queries that require superlinear worst case running time. This superlinear complexity leads to significant load balancing problems on graphs with heavy tailed degree distributions. Our algorithm works around high degree nodes in the data graph, and achieves very good runtime and scalability on a diverse collection of data and query graph pairs. We also provide a theoretical analysis of our algorithmic techniques, exhibiting asymptotic improvements in runtime on random graphs with power law degree distributions, a popular model for real world graphs.

international parallel and distributed processing symposium | 2016

An Early Performance Study of Large-Scale POWER8 SMP Systems

Xing Liu; Daniele Buono; Fabio Checconi; Jee W. Choi; Xinyu Que; Fabrizio Petrini; John A. Gunnels; Jeff A. Stuecheli

In this paper we evaluate the performance of a large-scale POWER8 symmetric multiprocessor (SMP) system with eight processors. We focus our attention on cache and memory subsystems, analyzing the characteristics that have a direct impact on high-performance computing and analytics applications. We provide insight into the relevant characteristics of the POWER8 processor using a set of micro-benchmarks. We also analyze the POWER8 SMP at the system level using the well-known roofline model. Using the knowledge gained from these micro-benchmarks, we optimize three applications and use them to assess the capabilities of the POWER8 system. The results show that the POWER8 - based SMP system is capable of delivering high performance for a wide range of applications and kernels.

computing frontiers | 2017

Data Analytics with NVLink: An SpMV Case Study

Daniele Buono; Fausto Artico; Fabio Checconi; Jee W. Choi; Xinyu Que; Lars Schneidenbach

A recent advancement in the world of heterogeneous computing, the NVLink interconnect enables high-speed communication between GPUs and CPUs and among GPUs. In this paper we show how NVLink changes the role GPUs can play in graph, and more in general, data analytics. With the technology preceding NVLink, the processing efficiency of GPUs is limited to data sets that fit into their local memory. The increased bandwidth provided by NVLink imposes a reassessment of many algorithms---including those used in data analytics---that in the past could not efficiently exploit GPUs because of their limited bandwidth towards host memory. Our contributions consist in the introduction of the basic properties of one of the first systems using NVLink, and the description of how one of the most pervasive data analytics kernels, SpMV, can be tailored to the system in question. We evaluate the resulting SpMV implementation on a variety of data sets, and compare favorably to the best results available in the literature.

computing frontiers | 2018

Horizon: a multi-abstraction framework for graph analytics

Adnan Haider; Fabio Checconi; Xinyu Que; Lars Schneidenbach; Daniele Buono; Xian-He Sun

A graph application written using a distributed graph processing framework can perform over an order of magnitude slower than its high-performance, native counterpart. This issue stems from the aim, common to most graph frameworks, of restricting the scope of application development to specific graph constructs, such as, for example, vertex or edge programs. In this paper we present Horizon, a distributed graph processing framework achieving close to native performance without penalizing productivity by providing a multi-layer, multi-abstraction model of computation. Compared to current frameworks, Horizon extends the scope of computation by exposing two notions usually relegated to implementations: graph data models and communication models. Horizon can reduce execution time by an average of 5.3× across different applications and datasets and process an order of magnitude larger graphs when compared to the state of the art.

ieee international conference on high performance computing, data, and analytics | 2016

Performance Analysis of Spark/GraphX on POWER8 Cluster

Xinyu Que; Lars Schneidenbach; Fabio Checconi; Carlos H. Andrade Costa; Daniele Buono

POWER 8, the latest RISC (Reduced Instruction Set Computer) microprocessor of the IBM Power architecture family, was designed to significantly benefit emerging workloads, including Business Analytics, Cloud Computing and High Performance Computing. In this paper, we provide a thorough performance evaluation on a widely used large-scale graph processing framework, Spark/GraphX, on a POWER 8 cluster. Note that we use Spark and Java versions out of the box without any optimization. We examine the performance with several important graph kernels such as Breadth-First Search, Connected Components, and PageRank using both large real-world social graphs and synthetic graphs of billions of edges. We study the Spark/GraphX performance against some architectural aspects and perform the first Spark/GraphX scalability test with up to 16 POWER 8 nodes.

ieee international conference on high performance computing data and analytics | 2015