[PDF] Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications

Abstract

Modern optimizing compilers are able to exploit memory access or computation patterns to generate vectorization codes. However, such patterns in irregular applications are unknown until runtime due to the input dependence. Thus, either compiler's static optimization or profile-guided optimization based on specific inputs cannot predict the patterns for any common input, which leads to suboptimal code generation. To address this challenge, we develop Intelligent-Unroll, a framework to automatically optimize irregular applications with vectorization. Intelligent-Unroll allows the users to depict the computation task using \textit{code seed} with the memory access and computation patterns represented in \textit{feature table} and \textit{information-code tree}, and generates highly efficient codes. Furthermore, Intelligent-Unroll employs several novel optimization techniques to optimize reduction operations and gather/scatter instructions. We evaluate Intelligent-Unroll with sparse matrix-vector multiplication (SpMV) and graph applications. Experimental results show that Intelligent-Unroll is able to generate more efficient vectorization codes compared to the state-of-the-art implementations.

Full PDF

IIntelligent-Unrolling: Exploiting Regular Patterns inIrregular Applications

Changxi Liu

School of Computer Science andEngineering, Beihang University [email protected]

Hailong Yang

School of Computer Science andEngineering, Beihang University [email protected]

Xu Liu

Department of Computer Science,College of William and Mary [email protected]

Zhongzhi Luan

School of Computer Science andEngineering, Beihang University

Depei Qian

School of Computer Science andEngineering, Beihang University [email protected]

Abstract

Modern optimizing compilers are able to exploit memoryaccess or computation patterns to generate vectorizationcodes. However, such patterns in irregular applications areunknown until runtime due to the input dependence. Thus,either compiler’s static optimization or profile-guided op-timization based on specific inputs cannot predict the pat-terns for any common input, which leads to suboptimalcode generation. To address this challenge, we developIntelligent-Unroll, a framework to automatically optimizeirregular applications with vectorization. Intelligent-Unrollallows the users to depict the computation task using codeseed with the memory access and computation patterns rep-resented in feature table and information-code tree , and gen-erates highly efficient codes. Furthermore, Intelligent-Unrollemploys several novel optimization techniques to optimizereduction operations and gather/scatter instructions. Weevaluate Intelligent-Unroll with sparse matrix-vector mul-tiplication (SpMV) and graph applications. Experimentalresults show that Intelligent-Unroll is able to generate moreefficient vectorization codes compared to the state-of-the-artimplementations.

Keywords irregular application, data access and instruc-tion pattern, code optimization

With the SIMD instruction adopted on modern CPU archi-tectures, the performance gap between CPU and memorybecome even larger. Compilers have developed powerfulstatic analysis to accelerate applications automatically byleveraging the SIMD units on CPU. However, it works wellonly with regular applications. In addition, although the com-plex instructions such as reduction, gather and scatter havebeen supported on CPU architectures to optimize irregularapplications, the performance with compiler optimizationsis often sub-optimal. Especially when there are the poten-tial write conflicts, the compilers usually give up on SIMDinstructions trading performance for correctness. As the SIMD units become pervasive on modern CPU architectures,leaving the performance on table for irregular applicationsthat take a large portion of scientific applications becomesunacceptable.The regular applications can be optimized by static analy-sis of compilers for their memory access and instruction pat-tern. However, for irregular applications, the memory accessand instruction pattern can only be analyzed during runtime.Therefore, the compilers fail to identify the performanceopportunity for irregular applications. For instance, on theSIMD architecture, the compiler fails to optimize the programwhen confronting the potential write conflicts. However, ifthe runtime behavior of the data accesses can be identified,then we can solve the write conflicts for better parallelizationusing the SIMD units. Another example of compiler inca-pability at optimizing irregular applications can be foundat the instruction level. For the gather/scatter/reduction in-structions that are widely used in irregular applications, ifwe can organize the runtime data accesses are continuousor in the same vector lane, then we can replace the above in-structions with load and permutation instructions for betterperformance.However, there are several challenges to realize the po-tential performance opportunities for irregular applicationsthat cannot be provided by compilers. Firstly, different fromregular applications, the memory access and instruction pat-tern varies significantly across different irregular applica-tions. Therefore, a general approach should be proposedto adapt to the various behaviors of irregular applications.Secondly, naively unrolling the instructions of irregular ap-plications could lead to memory bloat that prevents furtherperformance optimization. It is mandatory to constrain thememory occupancy when analyzing the runtime behaviorsof irregular applications. Thirdly, the optimization methodfor irregular applications should be able to adapt to variousunderlying architectures in order to improve its practicaladoption.To address the above challenges, we propose Intelligent-Unroll, a framework for optimizing irregular applications a r X i v : . [ c s . D C ] O c t n SIMD architectures automatically. There are three im-portant components in Intelligent-Unroll, including codeseed, feature table, and information-code tree.The design ofIntelligent-Unroll is easily extensible by adding new features.Intelligent-Unroll have already integrated several optimiza-tion techniques for reduction, gather and scatter instructionsfor better performance. When evaluating with representa-tive workloads, Intelligent-Unroll is able to generate moreefficient codes on various SIMD architectures compared tothe state-of-the-art implementations.Specifically, this paper makes the following contributions: • We propose Intelligent-Unroll, a framework that iden-tifies the regular patterns within irregular applica-tions, and automatically optimize the instruction anddata synthetically by generating more efficient codes. • We propose several techniques such as code seed,feature table and information-code tree to identifythe opportunity to replace the reduction instructionswith load instructions, and the gather instructionswith instruction group of load, shuffle and select in-structions for better performance. • We evaluate with representative workloads such asSpMV and PageRank on KNL and Intel Xeon CPUs.The experiment results demonstrate that the codesautomatically generated by Intelligent-Unroll achievebetter performance than the state-of-the-art imple-mentations.The remainder of this paper is organized as follows. Sec-tion 2 presents the background of the irregular applicationand corresponding optimization methods. Section 3 de-scribes the motivation of our work. Section 4 presents thedesign overview of

Intelligent-Unroll . Section 5 and Section6 describes the implementation details of the optimizationson reduction and gather operators. Section 7 presents theevaluation results of SpMV and PageRank compared to thestate-of-the-art implementations. Section 8 presents the re-lated work in the field, and section 9 concludes this paper.

Irregular applications are common in both traditional re-search fields such as high performance computing and emerg-ing research fields such as big data analysis and deep learning,which exhibits a constant demand for higher performance.The difference between irregular and regular applicationsis that whether the patterns of data access and instructioncan be known before runtime. For irregular applications,the above patterns is strongly correlated with the input dataand can only be known during runtime. Such uncertainty ofirregular applications introduces difficulties such as irregularmemory accesses, unbiased branches and writing conflictsfor compiler optimization. For irregular applications, there are two important con-cepts to describe their data access and instruction patternssuch as access arrays and data arrays [13]. Algorithm 1presents two code example of irregular applications. Wecan see that the access arrays contain the indirect accessor branch execution sequence (line 2 and line 6). Whereasthe data arrays are almost accessed indirectly through the access arrays (line 3). Another code example of irregularapplications is the inference process of the sparse neuralnetworks [20, 25], although the data arrays during the in-ference can be updated, the access arrays are immutableor updated infrequently. The above observations inspire usto design a mechanism for uncovering the potential perfor-mance of irregular applications and applying correspondingoptimization automatically.

Algorithm 1

The code examples of irregular applications function Irregular Memory Access2: idx ← Load access arr ay [ ... ] data ← Load data arr ay [ idx ] function Unbiased Branches5: cond ← Load access arr ay [ ... ] if cond then ... The performance gap between CPU and memory is still in-creasing. Although multi-level memory hierarchies are in-troduced to hide memory access latency, it still cannot catchup with the instruction level parallelism developed in hard-ware such as SIMD, multi-stage pipeline and out-of-orderexecution. For regular applications, the compilers can gen-erate efficient instructions such as AVX512 through staticanalysis of the program patterns for optimized performance.However, with irregular applications, the compiler optimiza-tion is quite restricted due to the unknown data access andinstruction patterns that can only be determined during run-time. For instance on the SIMD architectures, to ensure thecorrectness, the compilers perform almost no vectorizationof irregular applications if there are potential memory writeconflicts. The conservative optimization strategy of existingcompilers wastes the opportunities to exploit the regularpatterns within irregular applications for performance opti-mization.Similar to regular applications, the optimization of irregu-lar applications also focus on the temporal and spatial reuseof data, as well as parallel efficiency. There are plenty ofresearch works proposed to adapt irregular applications tounderlying architectures. However, most of the above worksrequire tremendous engineering efforts and cannot be easilyported to other architectures. Such ad-hoc optimizationsare unsustainable as new architectures and applications aredeveloped at unprecedented rate especially in the emergingdomains such as deep learning. In addition, the optimization or (i = 0, offset = 2; i < 8; i += offset) Load A[ i ] Gatherv4 A, {0,2,4,6} A C E GA B C D E F G HA B C D E F G H

Loadv4 A Loadv4 A + 4 shuffle

A C E G

B = {0,0,2,5} for (i = 0 ; i < 4; i ++) Load A[ B[i] ]Gatherv4 A, B

A A C EA B C D E F G HA B C D E F G H

Loadv4 A Loadv4 A + 4 shuffle

A A C E

Method 1Method 2 Code MemoryReg Method 1Method 2 (a) (b)

Figure 1.

The memory access optimization of regular appli-cation (a), and irregular application (b).of irregular applications has also been studied in domainspecific compilers such as Halide [26], Tensor Comprehen-sions [28] and TVM [7, 12]. These studies provide efficientway to generate high performance code for special appli-cation domains. These domain specific compilers motivateour work to design a compilation framework for irregularapplications that can analyze the data access and instruc-tion patterns to generate efficient code automatically. Wechoose LLVM [19] as our compilation backend, because theJIT APIs in LLVM allow us to analyze the execution patternsand generate optimized code at runtime.

The memory access patterns of regular applications havealready been optimized by the compilers using static anal-ysis [1, 24]. However, the memory patterns of irregularapplications are always dictated by the data being processed,which can only be known during runtime. Therefore, theexisting compilers fail to optimize the performance for suchirregular applications.The memory access pattern usually has a significant im-pact on the performance. However, using the existing com-piler optimizations sometimes lead to suboptimal perfor-mance for irregular applications. For instance, when loadingthe data from discontinuous memory addresses, the com-pilers alway generate gather instruction for the memoryload. However, as shown in the case of Figure 1, replacingthe gather instruction (Method 1) with vload instruction(Method 2) achieves better performance. In the case of theregular application as shown in Figure 1(a), the compilers canautomatically perform the above optimization through staticanalysis. Whereas with irregular applications as shown inFigure 1(b), since the memory access pattern can only be rec-ognized during runtime, the compilers generate inefficientcodes that load data from memory using gather instruction.Moreover, existing compilers are incapable to generateefficient code for the calculation of irregular applications.For instance, to utilize the vector units on SIMD architec-tures, the calculation dependencies need to be identified for for (i = 0 ; i < 4; i ++) A[i+2] = OP (A[i])

OPA B A’ B’ A’’ B’’

B = {3,2,6,5} for (i = 0 ; i < 4; i ++) A[B[i]] = OP(A[i])

Method 1Method 2 Code MemoryReg OP OPA B B’ A’ B’’ A’’

Method 1Method 2 OP SWAP (a) (b)

Figure 2.

The calculation optimization of regular application(a), and irregular application (b).correct vectorization. For regular application as shown inFigure 2(a), compilers can identify the calculation dependen-cies with static analysis and then generate efficient code. Forinstance, the operation , operation , operation andoperation are independent from each other. Then, thecompilers can leverage such information (Method 2) to opti-mize performance. However, when dealing with the irregularapplication in Figure 2(b), the compilers have to assume thatthe calculations have dependencies with each other to ensurecorrectness. Whereas the optimization (Method 2) in Fig-ure 2(b) indicates that operation and operation can beprocessed in parallel, and then operation and operation can be processed in parallel, which leads to better perfor-mance. Unfortunately, such optimization opportunity ofirregular applications cannot be identified by compilers us-ing static analysis.The above observations indicate that there is a huge spacefor performance optimization of irregular applications thatcannot be achieved by compilers using static analysis. Suchperformance opportunity within irregular applications canonly be identified during runtime that involves both memoryaccesses and calculation instructions.However, naively unrolling the instructions of irregularapplications and then applying optimizations could easilygenerate formidable code space, that leads to the instructionbloat problem. In addition, if we use condition statements toselect the optimal instructions, the application performancecould degrade significantly due to the branch mis-predictioncaused by the condition statements. Moreover, empiricallywriting specific code for each condition is also impractical,which requires tremendous engineering efforts. For instance,If the conditions to be optimized are ( k , k , k ... ), then thenumber of code to be written is ( k × k × k × ... ).To overcome the above problems, we propose Intelligent-Unroll, a framework that allows users to provide a codeseed to describe the calculation process of the program.Intelligent-Unroll then automatically generates efficient in-structions for the program. Specifically, Intelligent-Unrollcan identify the regular instruction patterns and optimizethem with efficient instructions. To accomplish above goals, ntelligent-Unroll provides corresponding techniques to tacklethe following challenges: • How to leverage the code seed to describe diversedata access and instruction patterns? • How to adapt instructions to the behaviors of dataaccesses for better performance? • How to optimize the instruction and data access syn-ergistically without violating the correctness?

Intelligent-Unroll is designed to identify the regular data ac-cess and instruction patterns hidden deeply within irregularapplications. The goal of Intelligent-Unroll is to automat-ically optimize the instruction and data synthetically foridentified performance opportunities.The design overview of Intelligent-Unroll is shown inFigure 3. The users only need to describe the calculationprocess using a lambda expression with its input data, andthen Intelligent-Unroll interprets calculation expression andautomatically generates an efficient implementation for aparticular architecture. The data of the computation taskis classified to mutable data and immutable data. The im-mutable data, that is unchanged during the execution of thetask, will be used to generate information for the optimiza-tion process. For the optimization process, Intelligent-Unrollfirstly interprets the lambda expression and generates thecode seed. The instruction patterns contained in the codeseed as well as the immutable data are used by the

Informa-tion Producer (Figure 3 (a)) to generate the

Feature Table (Figure 3 (b)), which includes the information required forfurther optimization.The

Code Seed describes the calculation process with-out concerning about the optimization. Based on the

CodeSeed , the

Information Producer extract the calculation pat-terns to generate the

Feature Table . And

Code Optimizerand Data Transfer modules use the

Code Seed to generateoptimized code. Each column of the

Feature Table is the cal-culation process for one iteration, and the row representsthe iterations. Each element in the

Feature Table describesthe instruction feature at the current iteration. Each columnof the

Feature Table is denoted as ops k , where k is the k -thorder. The Feature Table helps us handle various patternsin the irregular applications. We can merge instructions tooptimize the execution based on the information providedby

Feature Table .The

Code Optimizer and Data Transfer modules in the

Information Producer then process the

Feature Table to gen-erate the Intermediate Representation (IR) code that is in-dependent from the underlying architecture. Eventually,Intelligent-Unroll lowers the the code implementation toLLVM to generate the machine instruction regarding thetarget architecture. The design of

Code Optimizer and Data Transfer modulesis shown in Figure 3(c). Firstly, the hash value of each col-umn in the

Feature Table is generated. The columns withthe same hash value exhibits the same calculation pattern.Intelligent-Unroll merges the columns with the same hashvalue to generate a hash map. This hash map combines theinstructions with the same calculation pattern, and thus de-ceases the memory occupancy during instruction unrolling.After combining instructions, the Intelligent-Unroll con-tinues to process the hash map to merge instructions withthe same write location. Figure 4(a) shows an example of twoinstruction groups writing to the same location. Withoutmerging the instructions, two reduction operations in addi-tion to two read and write operations to the

Write Addr arerequired, which wastes computation resources and memorybandwidth. Figure 4(b) shows the calculation pattern aftermerging the instructions. We can see that only one reductionoperation is required. Although in this case we introduceone extra vector operation, it is far more efficient than reduc-tion operation. Eventually, the optimized instructions aregenerated by the

Optimization Pass and

Rearrange Opti-mization Info modules, the details of which are describedin Section 5 and Section 6.

The reduction instruction is a frequently used in programs.However, the reduction instruction encounters the instruc-tion dependency problem on the SIMD architecture for par-allelization. Traditional compilers degrades to SISD instruc-tions because it fails to identify the dependency using staticanalysis. The pseudo-code shown in Figure 5 serves as anexample. However, naively applying vectorization could leadto incorrect results, for example more than two operatorswriting to the same location in one SIMD instruction.In Intelligent-Unroll, it can analyze the write locations andrearrange the calculation to avoid write conflicts. However,changing the original calculation order may jeopardize thecorrectness of the program, therefore we need to make surethe correctness is not affected by the calculation rearrange-ment. The analysis of the calculation rearrangement in termsof program correctness is as follows.The reduction operator is both associative and commu-tative. We define the reduction operator as ∗ , and thus anexample of reduction operation can be expressed as res = p ∗ p ∗ p ∗ p

4. The expression can be transformed to res = ( p ∗ p ) ∗ ( p ∗ p ) based on the associative property.Therefore, we can calculate res = p ∗ p res = p ∗ p res = res ∗ res

2. It is clear forthe reduction operation that we reduce the partial resultsin parallel and then reduce partial results to derive the finalresults. op0vop1gather ●●● Addscatter ⸺⸺ ⸺⸺ ●●● ⸺⸺ ⸺⸺⸺⸺ ⸺⸺ ●●● ⸺⸺ ⸺⸺info info ●●● info info ●●● ●●● ●●● ●●● ●●● info (m-1)1 info (m-1)2 ●●● info (m-1)(n-1) info (m-1)n info m1 info m2 ●●● info m(n-1) info mn cyclecycle o p e r a t i o n s o p e r a t i o n s ops n ops n ops ops C o d e S ee d C o d e S ee d InfoProducer CodeOptimizerandDataTransfer ●●●●●● ●●●●●●

LLVM Deployable Deployable Module ●●●●●●

LamdaDatamutable immutable(a) (b)hash =h hash =a ●●● hash n-1 =a hash n =a k Hash FunctionHash Function hash1 (i ,j ), (i ,j ),i …hash2 ●●●●●● ●●● hash (k-1) ●●● hash k ●●● Merge the same calculation patterMerge the same calculation patter hash1 1,…,n-1hash2 2 ●●● ●●● hash (k-1) ●●● hash k …,n Mergethe samewrite locationMergethe samewrite location ●●● gather

OPT ●●●

Add

OPT scatter

OPT

PassOptimizationPass

GenerateGenerate

Optimization InfoRearrangeOptimization Info ●●● gather

INFO ●●●

Add

INFO scatter

INFO

UsedUsed

Parse (c) Code Optimizer and Data Transfer

Figure 3.

The design overview of

Intelligent-Unroll , which includes (a) information producer, (b) feature table and (c) codeoptimizer and data transfer.

I0 I1 I1 I0

AccessArray

R0 R1 R2 R3

DataArray

W0 W1

WriteAddr I0 ii ●●● R0’ R1’ R2’ R3’

DataArray2

I0 I1 I1 I0

AccessArray2

W0’ W1’

WriteAddr I0 jj Reduction + Scatterii I0 I1 I1 I0

DataArray

R0 R1 R2 R3

DataArray2AccessArray (b)

OpResult

R0’ R1’ R2’ R3’W0 W1

WriteAddr I0 ii (a) R0’’ R1’’ R2’’ R3’’ jj Figure 4.

An example of merging same location instructiongroups together (a) the instruction merged before. (b) theinstrcution merged after

Instead of generating code by the distribution of write loca-tions, we generate various reduction operators by the num-ber of reduction operations required. On the SIMD archi-tecture whose vector length is N , we need loд ( N ) reductioninstructions at most to complete a SIMD reduction operator.We denote a flag of the reduction operator, which ranges 0 , , , ..., loд ( N ) . For example, when the flag of the reductionoperator is M , it means that we need M reduction instruc-tions to complete the SIMD reduction operator.In addition to the flag, we also need other information.When the flag is M , it requires M vector, whose dimensionis N and the bit width of each element is loд ( N ) . The aboveinformation represents the source location of the data to bereduced. As shown in Figure 5(a), R3 requires a reductionoperator with R0, R1 and R2 each. Therefore, the shuffleaddress is 3 and 2, and R3 and R2 are moved to the first andsecond location of the shuffle data. We can reduce the shuffledata and then the rest of the data together to derive the finalresults. When the flag is N , we can also choose the reductionoperator supported by the architecture if it is available. The commonly used reduction operators include add andmultiply. For other reduction operators such as minus, divi-sion, we can transform them to add or multiply reductionoperators with negative variance operators.The code seed generated does not consider the write con-flicts and the optimization pass module after will process it.Intelligent-Unroll identifies the source instruction that pro-vides the write variance of scatter instructions. The reduction AccessArray

R0 R1 R2 R3

DataArray

W0 W1

WriteAddr I0 PermAddrPermArray

R3 R2 -- --

C0 C1 -- --

22 33ReductionArray Op Destination Sources ●●● ●●● ●●●

Add Res Res, v2 ●●● ●●● ●●●

Scatter DesAddr ResOp Destination Sources ●●● ●●● ●●●

Reduction v2Reduce v2Add Res Res,v2Reduce ●●● ●●● ●●●

Scatter DesAddr ResReduction Add Mult ●●●

Belongs to (a) (b)

Figure 5.

An example of reduction operator (a) and, (b)corresponding code generation pattern.

Table 1.

The comparison of the instructions before and afterthe optimization of reduction operator.

Calculation Reduction Permulationoriginal N N 0optimized 1 M M

Table 2.

The comparison of the data size before and afterthe optimization of reduction operator. vload vstoreWrite Index Write Data Additional Data Write Dataoriginal N * Bit(Index) N * Bit(Data) – N * Bit(Data)optimized M * Bit(Index) M * Bit(Data) M * Bit(Info) M * Bit(Data) processing module is activated to insert several reductionoperations before scatter instructions, if the operation typeof the source instruction belongs to the reduction operators.Intelligent-Unroll will generate reduction instructions ac-cording to the information in the column of

Feature Table corresponding.As shown in Figure 5(b), the

Res , which is the value writ-ten by

Scatter instruction, is provided by an

Add operation,which belongs to reduction operators. Activated by this con-dition, Intelligent-Unroll inserts a reduction operation beforethe

Add instruction and then redirects the result to the

Add instruction, which is the operation 1 and 2 in the Figure 5.

Intelligent-Unroll generates optimized codes for the originalprogram. Table 1 provides a comparison of the instructionsbefore and after the optimization. With Intelligent-Unroll,we can reduce the number of calculations on the reductiondata from N to 1, and the number of reduction operationsfrom N to M, where M is less than or equal to loд N . Al-though Intelligent-Unroll introduces additional operationssuch as Permulation , it can still accelerate the calculationprocess if executing M shuffle operations is faster than thesum of (N-1) calculations and (N-M) reduction operations.

D0 D4 D5 D1

AccessArray

A B C D E F

DataArrayGather

A E F B D0 D0 D4

A B C D

LoadInfo

A A B B E F G H

A E F B

Select (a)

Op Destination SourcesGather Res ●●●●●● ●●● ●●●

Op Destination SourcesLoad Load ●●● Permutation Perm Load Load Load ●●● Permutation Perm Load Select Res Perm ,Perm ●●● ●●● ●●● (b) Perm

E E F F

Figure 6.

An example of gather operator (a) and, (b) corre-sponding code generation pattern. o n l y l o a d l e ss t h a n l o a d l e ss t h a n l o a d l e ss t h a n l o a d l e ss t h a n l o a d l e ss t h a n l o a d l e ss t h a n l o a d l e ss t h a n l o a d R a t i o Figure 7.

The distribution of gather instructions that can bereplaced by instruction group of vload, permulation.Intelligent-Unroll also changes the memory access pattern.From Table 2 we can see, it avoids the redundant memoryload and store to the write data, whose size is ( N − M ) × Bit ( Data ) . In addition, Intelligent-Unroll also eliminatesunnecessary load to the index of write address, whose sizeis ( N − M ) × Bit ( Index ) . However, Intelligent-Unroll alsointroduces extra overhead. The additional data that is usedby the shuffle instructions is M × loд Nbits . Therefore, theperformance of memory access can be optimized if the sizeof additional data is less than the sum of the write data sizeafter optimization.

Gather and Scatter instructions are also frequently used inprograms on SIMD architectures. We observe that replacingthe gather instruction with group of vload and permutation instructions achieves better performance in several cases.Similar performance improvement is also observed by re-placing scatter instruction with group of permutation and store instructions. Since the method of optimizing gather and scatter instructions is similar, we only present the op-timization method of gather instruction in the following.Unlike the reduction instruction, the sparsity pattern ofthe data affects the performance opportunity when optimiz-ing the gather operator. For instance, if the sparsity of thedata is entirely random, there is hardly a chance to achieve etter performance. Fortunately, most of the sparse dataexhibit regular distribution to some extent. Figure 7 showsthe percentage of sparse datasets that achieve better perfor-mance when replacing the gather instructions with vloadinstructions. The sparse datasets include 2,700 matrices fromthe SuiteSparse Matrix Collection [11]. The x axis in thefigure indicates the number of vload instructions, and they axis indicates the percentage of the entire datasets. Thelegend of the Figure 7 represents the percentage of the gatherinstructions within the execution on a particular dataset.From Figure 7 we can see that the datasets, with morethan 25% of the gather instructions can be replaced by onevload instruction, accounts for 18.4% of the entire datasets.Whereas, 46.9% of the datasets contain more than 25% of thegather instructions that can be replaced with no more thantwo vload instructions. Moreover, 55.0% of datasets containmore than 75% of the gather instructions that can be replacedwith four vload instructions. It is clear that there is a largeperformance space by optimizing the gather instructions ofirregular applications with sparse data. Similar to the optimization of reduction operator, we usea flag to denote the number of vload instructions, and thelargest value of the flag is vector length of the architecture.And the same to the reduction operators, the optimizationof gather instructions also need additional information andthe bit width of each element in the address vector and thelength of the vector is the same as reduction operator. Thedifference from reduction operator is that we use only one

Permulation Address regardless the value of the flag. To de-termine the permulation instruction that data in the addressvector belongs to, we use additional mask vector whose num-ber is ( f laд − gather in-structions. The Figure 6(a) is an example of gather operator,where the vector length is four, the bit width of the shufflevector is two, and the length of vector is four. In this example,we use two vload instructions to replace one gather instruc-tion. Therefore, the value of the flag is two, and the numberof the mask vector is one. First, we load data ABDC andEFGH in the registers using the begin addresses D0 and D4.Then, based on the Permutation Address and ABCD,EFGH,we obtain AABB, EEFF by permutation instruction. Afterthat, we obtain AEFB with AABB, EEFF with mask 0110using the select instruction.

To optimize the gather instructions, we replace the gather instructions with vload , permutation and select instructions.When scanning the code, we consult the column of feature Table 3.

The comparison of the data size before and afterthe optimization of gather operator.

Index Data Additional Infooriginal N * Bit(Index) N * Bit(Data) –optimized M * Bit(Index) M * N * Bit(Data) N ∗ loд N + ( M − ) ∗ N Algorithm 2

The code snippet of SpMV in CSR format for i ← , m do for j ← row ptr [ i ] , row ptr [ i + ] do y [ i ] ← y [ i ] + value [ j ] × x [ col ptr [ j ]] Algorithm 3

The code snippet of PageRank for i ← , nedдes do sum [ n [ j ]] ← sum [ n [ j ]] + r ank [ n [ j ]] / nneiдhbot [ n [ j ]] table corresponding to determine whether there is perfor-mance benefit by replacing the gather instruction with theinstruction group (e.g., vload , permutation and select ). Then,Intelligent-Unroll performs the code transformation to gen-erate the optimized code. Figure 6(b) shows an example ofthe code generation for gather operator. The instructionsincluding multiple vload , permutation and select instructionsis used to replace the original gather instructions. And if theflag value equals to one, it only requires vload and permuta-tion instructions. As shown in Table 3, after our optimization of gather opera-tor, the number of index data avoided to be loaded is N − M .However, our optimization introduces ( M − ) × N extradata to be loaded as well as N × loд N + ( M − ) × N bits torecord the additional information. In addition to the memoryload overhead, our optimization also requires M instructiongroups of vload, permutation and select instructions.Fortunately, on the cache hierarchy of modern processor,the number of cache lines consumed by our method is thesame as the original gather instruction. In addition, the sizeof the extra data introduced by our method is always smallerthan the size of index data eliminated. Since our methodis effective when the performance improvement with theoptimized gather operator outweighs the overhead due tothe extra data, we apply the optimization only when the flagsindicate there are performance benefits. We evaluate Intelligent-Unroll on two representable bench-marks, Sparse Matrix-Vector Multiply (SpMV) and PageRank.The code snippets of SpMV and PageRank are shown in Al-gorithm 2 and Algorithm 3 respectively. We choose thesetwo benchmarks due to their unique memory and calcula-tion patterns. From Algorithm 2, we can see that in SpMVit always writes to the same memory location. Whereas able 4. The platform and benchmark evaluation approach. All experiments are done with single thread.

The platform SpMV evaluation approach PageRank evaluation approach

Intel Phi 7210 (64 [email protected],2.66 DP TFlops,16GB MCDRAM,400GB/s bandwidth,384GB DDR4,102.4Gbit/s bandwidth). (1) The CSR-based SpMV compiled by ICC.(2) The CSR-based SpMV with Intel MKL version 2019 Update 3.(3) CSR5-base SpMV[21].(4) The code generated by Intelligent-Unroll . (1) PageRank compiled by ICC.(2) The method proposed by Peng Jiang[14].(3) The code generated by Intelligent-Unroll.

Intel Xeon E5-2620 v3 (6 [email protected],230.40 DP GFlops4 × DDR4,59 GB/s bandwidth). (1) The CSR-based SpMV compiled by ICC.(2) The CSR-based SpMV with Intel MKL version 2019 Update 3.(3) CSR5-base SpMV[21].(4) The code generated by Intelligent-Unroll . (1) PageRank compiled by ICC.(2) The code generated by Intelligent-Unroll.

Table 5.

The datasets used by SpMV and PageRank.

Benchmark Dataset row × col nnz nnz/rowSpMV Dense 2K ×

2K 4.0M 2KFEM Ship 141K × × ×

66K 10.4M 155Webbase1M 1M ×

1M 3.1M 3Wind Tunnel 218K × × ×

49K 1.9M 39PageRank amazon0312 401K × × × for PageRank in Algorithm 3, it exhibits a random memorywrite pattern. In addition, the calculation pattern of SpMVis represented by explicit reduction operations, whereas thereduction operations in PageRank are implicit.The experiment platform is an Intel Xeon Phi CPU (KNL)and an Intel Xeon CPU. The details of the platform and eval-uation approach are shown in Table 4. The CPU machine isinstalled with 64-bit Ubuntu v16.04, whereas the KNL ma-chine is installed with CentOS 7.4. The icc v19.0.3 and LLVMv8.0.0 are installed on both machines. For SpMV, we com-pare to the implementations using CSR5 [21] and MKL inaddition to the default compiler optimization. For PageR-ank, we compare to the implementation using conflict-freemethod [14] on KNL in addition to the default compiler op-timization. We omit the results of conflict-free method onCPU since it does not support CPU architecture. The defaultcompiler optimization of SpMV and PageRank uses icc (-O3-Xhost) that serves as our baseline. For each run, we executethe benchmark for 1,000 times, and measure the averageexecution time. Every experiment is evaluated for 10 timesand the best result is reported.We select eight datasets from the University of FloridaSparse Matrix Collection to evaluate SpMV. The datasetsinclude regular matrices such as Dense and

QCD , as wellas irregular matrices such as mip1 and

Webbase-1M . Thedatasets for evaluating PageRank are adopted from [14]. Thedetails of the evaluation datasets are shown in Table 5.

In Table 6, we present the percentage of gather / scatter / reduction instructions that can be replaced by load / store ( L/S ) and vec-tor ( Op ) instructions for the two benchmarks under differ-ent datasets. The second column in Table 6 indicates the Algorithm 4

The PageRank defined in Intelligent-Unroll input :2: int * n

1, int * n

2, double * r ank , double * nneiдhbor output :4: double * sum lambda i :6: sum [ n [ i ]] ← sum [ n [ i ]] + r ank [ n [ i ]] × nneiдhbor [ n [ i ]] number of load / store / vector instructions that should be usedto replace the original gather / scatter / reduction instruction.We do not include the results of the scatter instruction inSpMV, since they can be optimized by the statical analysisof compiler. The higher value of L/S means the higher costof replacing the gather / scatter instruction. Whereas Op = reduction instructions can be replaced with vectorinstructions, and Op = reduction instruc-tion supported by underlying architecture achieves betterperformance.The SpMV running on the Dense dataset illustrates a per-fect case for instruction optimization in Table 6, where eachof its gather instructions can be replaced with only one load instruction. In addition, we can optimize the reduction oper-ations ( Op =

3) with the reduction instruction provided byunderlying architectures. There are also some cases wherethere is hardly any performance opportunity with Intelligent-Unroll, such as

Webbase-1M and textitCirCuit whose L / S = L / S =

1, the percentage of replace-able instructions is less than 51%. And even with L / S =

8, thepercentage is no more than 44.8% (e.g., higgs-twitter dataset).

The code snippet of PageRank shown in Algorithm 3 can bedefined using Intelligent-Unroll as Algorithm 4. The keyword input (line 1-2) and output (line 3-4) define the inputs andoutputs of the PageRank algorithm respectively. The lambdaexpression specifies the calculation details (line 5-6). Basedon Algorithm 4, we can generate an implementation fromIntelligent-Unroll for the PageRank algorithm.Table 7 shows the performance comparison of PageR-ank implemented using the methods of Intelligent-Unroll,conflict-free and default compiler optimization on KNL andCPU. We can see that the implementation optimized byour method achieves better performance across almost alldatasets on both KNL and CPU. Our method improves the able 6. The percentage of the gather / scatter / reduction instructions that can be optimized by the load / store operationinstructions for both SpMV and PageRank benchmarks across different datasets. The results are evaluated on CPU processorwith vector length of 8. Benchmark SpMV PageRankDetaset Dense FEM Ship dc2 mip1 Webbase1M Wind Tunnel CirCuit QCD amazon0312 higgs-twitter soc-pokec

Gather && Scatter

L/S = 1 100% 15.1% 14.8% 92.5% 5.6% 61.7% 2.6% 40.3% 50.2% 50.9% 50.2%L/S = 2 0% 84.9% 9.4% 1.5% 53.0% 37.8% 22.5% 45.8% 1.3% 0% 0.5%L/S = 3 0% 0% 15.7% 1.2% 18.1% 0.5% 28.9% 13.9% 5.0% 0% 0.9%L/S = 4 0% 0% 23.6% 1.3% 11.0% 0% 22.9% 0% 10.0% 0.1% 1.5%L/S = 5 0% 0% 20.7% 1.7% 5.4% 0% 14.6% 0% 12.1% 0.2% 2.4%L/S = 6 0% 0% 10.1% 1.3% 3.0% 0% 6.5% 0% 11.1% 0.7% 4.1%L/S = 7 0% 0% 4.4% 0.1% 1.6% 0% 1.6% 0% 7.2% 4.2% 9.0%L/S = 8 0% 0% 1.3 % 0.4% 2.3% 0% 0.4% 0% 3.1% 44.8% 31.4%Reduction Op=0 0% 0% 6.5% 0.6% 31.8% 1.0% 18.4% 2.6% 92.0% 100% 100%Op = 1 0% 2.6% 5.8% 0.7% 32.5% 2.8% 18.4% 2.6% 8.0% 0% 0%Op = 2 0% 4.7& 9.8% 1.3% 8.1% 3.7% 32.5% 5.1% 0% 0% 0%Op = 3 100% 92.7% 77.9% 97.4% 27.6% 92.5% 30.7% 89.7% 0% 0% 0%

Table 7.

The performance of PageRank across different datasets on KNL and CPU. K N L C P U G F l o p s ICCConflict-FreeOur Method K N L C P U G F l o p s ICCConflict-FreeOur Method K N L C P U G F l o p s ICCConflict-FreeOur Method amazon0312 higgs-twitter soc-pokec performance of PageRank by 4.8% on average (11.6% on max-imum) compared to the baseline on CPU, and by 30.2% and146.0% on average (68.5% and 158.8% on maximum) com-pared to the conflict-free and baseline methods respectivelyon KNL.On KNL, with higgs-twitter and soc-pokec datasets, ourmethod achieves similar performance to conflict-free method,both of which is better than the baseline. This is because icc cannot optimize PageRank due to the potential write con-flicts. However, on amazon0312 dataset, the performanceof our method outperforms the rest by a large margin. Thisis because the percentage of instructions with Op = amazon0312 dataset is more than 8%, which are also ran-domly distributed during the execution. Such random dis-tribution degrades the effectiveness of branch prediction inthe conflict-free method. However, in Intelligent-Unroll, thecode is directly generated for each branch condition withoutpredicting during runtime. Therefore, our method outper-forms the conflict-free method on amazon0312 dataset.On CPU, with amazon and higgs-twitter datasets, the per-formance using Intelligent-Unroll is better than the defaultcompiler optimization. However on soc-pokee dataset, thecode generated by Intelligent-Unroll is slower than the codeoptimized by icc . This is because the vector length is quitelimited (e.g., 8 in single precision) on CPU that outsets theperformance benefit when replacing the reduction instruc-tions with vector instructions.The reason why our method achieves better performancethan the conflict-free method on KNL can be attributed totwo folds: our method generates the code for each dataaccess and instruction pattern of PageRank. Therefore, it Algorithm 5

The SpMV defined in Intelligent-Unroll input :2: int * row ptr ,int * col ptr ,double * x ,double * value output :4: double * y lambda i :6: y [ row ptr [ i ]] ← y [ row ptr [ i ]] + value [ i ]× x [ column ptr [ i ]] avoids pattern prediction during runtime and thus improvesperformance; in addition to use SIMD instruction, ourmethod also replaces the gather / scatter instructions with load / store instructions. As shown in Table 6, the percentageof instructions with L / S = The baseline SpMV implementation uses CSR format, be-cause it decreases memory usage and provides more oppor-tunity for compiler optimization. However in Intelligent-Unroll, we use COO instead of CSR which fits well withour optimization method. Algorithm 5 defines SpMV usingIntelligent-Unroll (line 5-6). We can see that the definitionusing Intelligent-Unroll is more concise than the originaldefinition in Algorithm 2. Intelligent-Unroll automaticallyoptimizes the data access and instruction instead of rely-ing on manual optimization. After defining the calculation,users only need to specify the input and output (line 1-4) inIntelligent-Unroll. Table 8 shows the performance compari-son among the methods using default compiler optimization,MKL, CSR5 and our method on both CPU and KNL. able 8. The performance of SpMV across different datasets on KNL and CPU. K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method

Dense FEM Ship dc2 mip1 K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method K N L C P U G F l o p s ICCMKLCSR5Our Method

Webbase1M Wind Tunnel CirCuit QCD

On KNL, our method achieves best performance on

Dense and mip1 datasets, whereas on

FEM Ship , Wind Tunnel and

QCD datasets, the MKL implementations achieve best per-formance. The CSR5 implementations also achieve best per-formance on dc2 and

CirCuit datasets. The reason why ourmethod achieves better performance than other methodsis similar to the PageRank benchmark. This is because ourmethod is able to avoid branch prediction during runtimeand improve the memory accesses with load / store instruc-tions. However, on the datasets where the MKL implementa-tion is better, the reason can be attributed to the split of thewrites to the same memory location from different calcula-tion patterns in our method, which increases the load / write instructions to the output vector y . On datasets where CSR5achieves best performance, it is because the data structure ofinput matrices is friendly to CSR5 format and correspondingcalculation pattern, which has not been integrated in the optimization pass of Intelligent-Unroll yet.On CPU, our method achieves the best performance on

Dense , mip1 , Wind Tunnel and

CirCuit datasets. On thedatasets where our method fails to achieve the best per-formance, the reason can be attributed to the limited vectorlength on CPU that diminishes the advantage of Intelligent-Unroll by avoiding the branch prediction during runtimedue to the small number of conditions. In sum, comparedto baseline, MKL and CSR5, our method improves the per-formance of SpMV by 54.8%, 24.9% and 35.7% on average(151.0%, 116.9%, 112.0% on maximum) respectively on KNL,whereas by 35.9%, 10.1% and 40.5% on average (68.2%, 48.3%and 72.5% on maximum) on CPU.

Designing efficient sparse data formats -

Many sparsedata formats are proposed targeting different sparsity pat-terns as well as the architecture diversity. For instance, block-based formats are widely adopted due to the cache-friendlydesign [2, 4, 5]. CSR5 [21] and CVR [29] proposed newsparse data formats for SpMV, which focus on optimizingthe instruction parallelism and load balance. Liu et al. [22]proposed ELLPACK to accelerate SpMV kernel on Intel KNL processor. Choi et al. [8] proposed to use small sub-blocks,each of which is represented as a dense matrix, to optimizeSpMV on GPUs.

Improving the temporal and spatial data reuse -

Sincethere are many sparse data formats available, determiningthe appropriate sparse format for the irregular application isnot trivial. Friese et al. [9] and Xie et al. [30] proposed dif-ferent performance models to determine the optimal sparsedata format. In essence, their works optimize the irregularapplications by improving the temporal reuse and spatialdata reuse with the appropriate sparse format. There are alsomany works exploring optimization works on distributedmemory architectures[3, 10]. Several loop unrolling strate-gies [15, 23, 27] are proposed in literature. However, theseworks mainly focused on selecting optimal tile size and un-roll factor when unrolling the loop, and failed to exploit theperformance opportunity by optimizing the instructions.

Optimizing parallelization strategies -

Different par-allelization strategies were proposed when optimizing theirregular applications on specific architectures [16, 18]. Jianget al. [14] optimized the irregular applications by paralleliz-ing the computation using the powerful SIMD units. Buoboet al. [6] proposed optimizations of sparse linear algebratailored for large-scale graph analytics. Millind et al [17]proposed a tool called ParaMeter to profile parallelism infor-mation of irregular programs.

In this paper, we address the limitation of traditional compil-ers that is unable to exploit the performance opportunity foroptimizing irregular applications due to its static analysis.We propose our solution of Intelligent-Unroll that identi-fies the regular patterns within irregular applications, andautomatically optimizes the data access and instruction forgenerating more efficient code. The experiment results withrepresentative benchmarks on both CPU and KNL processorsdemonstrate the effectiveness of our approach in optimizingthe irregular applications for better performance comparedto the-state-of-the-art implementations. eferences [1] Andrew Anderson, Avinash Malik, and David Gregg. 2016. Auto-matic vectorization of interleaved data revisited. ACM Transactionson Architecture and Code Optimization (TACO)

12, 4 (2016), 50.[2] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan.2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In

Proceedings of the 28th ACM inter-national conference on Supercomputing . ACM, 273–282.[3] Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing irregularshared-memory applications for distributed-memory systems. In

Pro-ceedings of the eleventh ACM SIGPLAN symposium on Principles andpractice of parallel programming . ACM, 119–128.[4] Aydin Buluc¸, Jeremy T Fineman, Matteo Frigo, John R Gilbert, andCharles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In

Proceedings of the twenty-first annual symposium on Parallelism inalgorithms and architectures . ACM, 233–244.[5] Aydin Buluc, Samuel Williams, Leonid Oliker, and James Demmel.2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In . IEEE, 721–733.[6] Daniele Buono, John A Gunnels, Xinyu Que, Fabio Checconi, FabrizioPetrini, Tai-Ching Tuan, and Chris Long. 2015. Optimizing sparselinear algebra for large-scale graph analytics.

Computer

48, 8 (2015),26–34.[7] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie QYan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and ArvindKrishnamurthy. 2018. TVM: end-to-end optimization stack for deeplearning. arXiv preprint arXiv:1802.04799 (2018), 1–15.[8] Jee W Choi, Amik Singh, and Richard W Vuduc. 2010. Model-drivenautotuning of sparse matrix-vector multiply on GPUs. In

ACM sigplannotices , Vol. 45. ACM, 115–126.[9] Luca Daniel, Ong Chin Siong, Low Sok Chay, Kwok Hong Lee, andJacob White. 2004. A multiparameter moment-matching model-reduction approach for generating geometrically parameterized inter-connect performance models.

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems

23, 5 (2004), 678–693.[10] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994.Communication optimizations for irregular scientific computationson distributed memory architectures.

Journal of parallel and distributedcomputing

22, 3 (1994), 462–478.[11] Timothy A Davis and Yifan Hu. 2011. The University of Florida sparsematrix collection.

ACM Transactions on Mathematical Software (TOMS)

38, 1 (2011), 1.[12] Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley,Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, AlexAiken, Karthik Duraisamy, et al. 2011. Liszt: a domain specific lan-guage for building portable mesh-based PDE solvers. In

Proceedingsof 2011 International Conference for High Performance Computing, Net-working, Storage and Analysis . ACM, 9.[13] Chen Ding and Ken Kennedy. 1999. Improving cache performance indynamic applications through data and computation reorganizationat run time. In

ACM SIGPLAN Notices , Vol. 34. ACM, 229–241.[14] Peng Jiang and Gagan Agrawal. 2018. Conflict-free vectorizationof associative irregular applications with recent SIMD architecturaladvances. In

Proceedings of the 2018 International Symposium on CodeGeneration and Optimization . ACM, 175–187.[15] Toru Kisuki, Peter MW Knijnenburg, and Michael FP O’Boyle. 2000.Combined selection of tile sizes and unroll factors using iterativecompilation. In

Proceedings 2000 International Conference on ParallelArchitectures and Compilation Techniques (Cat. No. PR00622) . IEEE,237–246.[16] Milind Kulkarni, Martin Burtscher, Calin Casc¸aval, and Keshav Pin-gali. 2009. Lonestar: A suite of parallel irregular programs. In . IEEE, 65–76.[17] Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pingali,and Calin Casc¸aval. 2009. How much parallelism is there in irregularapplications?. In

ACM sigplan notices , Vol. 44. ACM, 3–14.[18] Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Rama-narayanan, Bruce Walter, Kavita Bala, and L Paul Chew. 2008. Sched-uling strategies for optimistic parallel execution of irregular programs.In

Proceedings of the twentieth annual symposium on Parallelism inalgorithms and architectures . ACM, 217–228.[19] Chris Lattner and Vikram Adve. 2004. LLVM: A compilation frame-work for lifelong program analysis & transformation. In

Proceedingsof the international symposium on Code generation and optimization:feedback-directed and runtime optimization . IEEE Computer Society,75.[20] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Mar-ianna Pensky. 2015. Sparse Convolutional Neural Networks. In

TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR) .[21] Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage formatfor cross-platform sparse matrix-vector multiplication. In

Proceedingsof the 29th ACM on International Conference on Supercomputing . ACM,339–350.[22] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey.2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In

Proceedings of the 27th international ACM conferenceon International conference on supercomputing . ACM, 273–282.[23] John Mellor-Crummey and John Garvin. 2004. Optimizing sparsematrix–vector product computations using unroll and jam.

The In-ternational Journal of High Performance Computing Applications

18, 2(2004), 225–236.[24] Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorizationof interleaved data for SIMD.

ACM SIGPLAN Notices

41, 6 (2006),132–143.[25] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yi-ran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparseconvolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016).[26] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, SylvainParis, Fr´edo Durand, and Saman Amarasinghe. 2013. Halide: a lan-guage and compiler for optimizing parallelism, locality, and recompu-tation in image processing pipelines. In

Acm Sigplan Notices , Vol. 48.ACM, 519–530.[27] Mark Stephenson and Saman Amarasinghe. 2005. Predicting unrollfactors using supervised classification. In

Proceedings of the interna-tional symposium on Code generation and optimization . IEEE ComputerSociety, 123–134.[28] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, PriyaGoyal, Zachary DeVito, William S Moses, Sven Verdoolaege, AndrewAdams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXivpreprint arXiv:1802.04730 (2018).[29] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He,and Lixin Zhang. 2018. Cvr: Efficient vectorization of spmv on x86processors. In

Proceedings of the 2018 International Symposium on CodeGeneration and Optimization . ACM, 149–162.[30] Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IA-SpGEMM: an input-aware auto-tuning framework for parallel sparsematrix-matrix multiplication. In

Proceedings of the ACM InternationalConference on Supercomputing . ACM, 94–105.. ACM, 94–105.