[PDF] Towards an Objective Metric for the Performance of Exact Triangle Count

Abstract

The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.

Full PDF

22020 IEEE High Performance Extreme Computing Conference(HPEC)

Towards an Objective Metric for thePerformance of Exact Triangle Count

Mark P. Blanco ∗ , Scott McMillan † , Tze Meng Low ∗∗ Dept. of Electrical and Computer Engineering † Software Engineering InstituteCarnegie Mellon University

Pittsburgh, PA, United States { markb1, scottmc, lowt } @cmu.edu ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or futuremedia, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale orredistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: Pending Release by IEEE Abstract —The performance of graph algorithms is oftenmeasured in terms of the number of traversed edgesper second (TEPS). However, this performance metric isinadequate for a graph operation such as exact trianglecounting. In triangle counting, execution times on graphswith a similar number of edges can be distinctly differentas demonstrated by results from the past Graph Challengeentries. We discuss the need for an objective performancemetric for graph operations and the desired characteristicsof such a metric such that it more accurately capturesthe interactions between the amount of work performedand the capabilities of the hardware on which the codeis executed. Using exact triangle counting as an example,we derive a metric that captures how certain techniquesemployed in many implementations improve performance.We demonstrate that our proposed metric can be usedto evaluate and compare multiple approaches for trianglecounting, using a SIMD approach as a case study againsta scalar baseline.

Index Terms —Performance Metric, Graph Algorithms,Triangle Counting, High Performance, CPU, PerformanceMeasurement

I. I

NTRODUCTION

It is widely accepted that software-hardware co-designis required for attaining high performance implementa-tions. Therefore, any performance metric used in designand evaluation of an implementation must have proper-ties that bridge the gap between a platform’s capabilitiesand the operations inherent to the problem.For many graph algorithms, traversed edges per sec-ond (TEPS) is a widely used ﬁgure of merit. While TEPSsuggests that performance is related to the number ofedges that are traversed (read/written during the courseof the computation) over time, a more common (mis)useof the metric is simply the number of edges in the graphdivided by execution time. In many works on trianglecounting speciﬁcally, the metric used is the simplerdeﬁnition [1]–[5]. From here, we refer to this commonusage as edges-per-second.To illustrate the inadequacy of the edges-per-secondmetric for a graph operation such as exact triangle count,consider the plot in Fig. 1 wherein we report the metricperformance obtained from a sequential implementation a s ( K ) o r e g o n _ ( K ) p p - G n u t e ll a ( K ) c a - H e p P h ( K ) r o a d N e t - C A ( . M ) c i t - P a t e n t s ( M ) M A W I - ( M ) M illi o n s o f E d g e s p e r S e c o nd Graph (Number of Edges)

Triangle Counting Edges per Second for Different Graph Labelings

Labeling 1 Labeling 2 Labeling 3

Fig. 1: The inadequacy of the edges-per-second perfor-mance metric is demonstrated by how the metric varieswidely for the each graph that has been sorted in differentorder, while still retaining the same number of edges,vertices and triangles.of triangle count for several graphs from the GraphChallenge Dataset [3], [6]. For each graph, we reportperformance numbers for three different ways in whichthe vertices have been labeled and sorted. Notice thatdespite having the same number of edges, vertices, andtriangles, the performance numbers attained for eachgraph vary dramatically even when using the edges-per-second metric. In essence, the metric is as informativeas raw execution time. It provides little insight into thesoftware innovation or the hardware capabilities thatcontribute to the attained performance.Samsi et. al. [7], [8] use edges-per-second and themetric: T tri = (cid:18) N e N (cid:19) β , where T tri , N e , and N correspond to the executiontime, number of edges in a graph, and number of edgesprocessed in one second. In this metric, β is a value(smaller is better) representing technology advancementthat arises from implementation, algorithmic, or hard-ware improvements. This metric, however, does notexplain the difference in execution times seen in Fig. 1.In addition, it is unclear how this metric can be usedto account for the different aspects of technology im- a r X i v : . [ c s . PF ] S e p rovements such as hardware differences and algorithmicimprovements.In this paper, we propose two performance metrics(match checks, and match checks per unit time) thatwe believe are more objective than edges-per-second.These metrics more accurately describe the performanceof triangle counting attained in Fig. 1, and better capturethe relationship between execution time and the expectedamount of work that has to be performed. In addition,the proposed metrics expose the relationship between theamount of work and the hardware capabilities requiredto perform the required work, which in turn providesnewer insights into commonly used techniques employedin many triangle counting implementations.II. P ROPERTIES OF A P ERFORMANCE M ETRIC

In this section, we describe properties which weconsider desirable in an objective performance metric forhardware-software co-design. We illustrate these proper-ties using dense matrix-matrix multiplication where theuse of number of ﬂoating point operations per second(FLOPS/s) is a well-established performance metric fordense linear algebra.

A. Indicative of the Expected Amount of Work Performed

A performance metric needs to measure the expectedamount of useful work performed as increasing theamount of work necessarily increases the executiontime. By work, we mean the basic unit of computation(and data movement) that is necessary to compute thedesired result using a generally accepted algorithm . It isnecessary to note that the expected amount of work is notthe minimum amount of work that could be performed.To illustrate this, consider multiplying two matrices ofsizes m × k and k × n . The quantity of work (measured inﬂoating point operations) is approximately mnk usingthe traditional triply-nested loop algorithm. Algorithmicinnovations such as Strassen [9] reduce the expectedamount of work, resulting in a faster execution time andthus a higher FLOPS/s when using mnk ﬂoating pointoperations as the expected amount of work. This metriccan potentially report performance above the theoreticalpeak of the hardware [10] because the expected amountof work exceeds that of the actual work. B. Measures Hardware Capabilities

The performance metric has to be sufﬁciently low-level to expose hardware capabilities related to the workbeing performed. Hardware with more capability forcomputing the basic work unit should yield a correspond-ingly increased score using the metric. When normalizedto the theoretical capability of the available hardware,the performance metric should also indicate how wellthe available hardware is being utilized. Increased hardware capabilities such as the intro-duction of the fused-multiply-accumulate unit (FMA)unit and Single Instruction Multiple Data (SIMD) in-struction set extensions can be objectively quantiﬁedusing FLOPS/s as these hardware capabilities allow moreﬂoating point operations to be computed per unit time.Comparing attained FLOPS/s as a function of percent-age of theoretical peak hardware capability allows forintrospection on implementations to determine if morecan be done to achieve better performance.

C. Captures Implementation Innovations

Implementation innovations, such as tiling and datalayout changes, increase performance through improveddata access while maintaining the amount of work thatis performed. A performance metric should reﬂect theseinnovations with a better score despite having no changein both the hardware or the expected amount of work.A high performance matrix matrix multiplication isoften implemented as multiple nested loops that partitiona matrix into submatrices through loop tiling/block-ing [11], [12]. In addition, input matrices are repackedin order to ensure that 1) data is brought into theappropriate level of caches and 2) data is repacked suchthat accesses can be performed with unit stride [13].These implementation choices maintain the same amountof (useful) work, but increase the performance due tobetter data access. These beneﬁts are demonstrated viaa higher FLOPS/s score.

D. Application to Graph Algorithms

While we have identiﬁed the desirable properties ofa performance metric, the vertices of a graph can berelabeled and reordered without changing the structure ofthe graph. Moreover, the adjacency matrix of the differ-ent isomorphic graphs often exhibit different structuresthat may change the amount of work that needs to becomputed. As a convention, we take the original unsortedgraph as the canonical graph from which the expectedamount of work is computed.III. D

ERIVING A M ETRIC FOR T RIANGLE C OUNT

For the sake of completeness, we begin with a briefdescription of triangle count and commonly used ap-proaches for computing the number of triangles in agraph. We then derive a plausible performance metricfor the triangle count operation: match checks.

A. Triangle Counting

Triangle counting, as its name suggests, counts thenumber of triangles in a undirected graph G . This graphcan be represented by its adjacency matrix which is asymmetric sparse matrix. For the purposes of this paper,we assume that only the lower-triangular portion of theadjacency matrix is stored. 𝑙 𝑣 : 𝑥 , 𝑥 , 𝑥 , 𝑥 , 𝑢𝑁 𝑙 𝑢 : 𝑦 , 𝑦 , 𝑦 Δ = Δ + |(𝑁 𝑙 𝑣 \{𝑢}) ∩ 𝑁 𝑙 (𝑢)| 𝑢 < 𝑣 𝑣 > 𝑥 𝑖 𝑢 > 𝑦 𝑗 Fig. 2: Diagram of connected vertices v and u . To counttriangles, their lower neighborhoods are intersected, with u subtracted from N l ( v ) . ∆ is the triangle count.Processing only the lower-triangular (or upper) partof the adjacency matrix allows one to count each tri-angle exactly once. This approach is common in manyimplementations [14]–[16]. Elements in this part of thematrix correspond to edges leading from one vertexto a neighbor with lower vertex ID. This portion ofeach vertex’s neighborhood is referred to as the lowerneighborhood, denoted as N l ( v ) for vertex v . B. Exact triangle count with wedges and intersections

In general, there are two approaches to counting thenumber of triangles in a graph. The ﬁrst method is thewedge-check method, where given a wedge (i.e. threevertices connected with two edges), a check is performedto test if there exists an edge that closes the wedgeinto a triangle. The second approach, based on set-intersection, identiﬁes if there exists a common vertexin the neighborhood of the two vertices of an edge.We demonstrate the equivalence of the two approachesusing Fig. 2. In the following discussion, v > u for both approaches. In the wedge-check method, thewedges between each pair of vertices v , u are found byway of the shared neighbors (e.g. the vertices labeled x /y and x /y ) between their lower neighborhoods( N l ( v ) , N l ( u ) ). The existence of the triangle is thenconﬁrmed by checking for the edge directly connecting v to u . In the set-intersection approach, for the endpointsof an existing edge ( v, u ) , their lower neighborhoodsare intersected to ﬁnd shared neighbors (again, x /y and x /y in the diagram). Because the intersection isonly performed given that edge ( v, u ) exists, each sharedneighbor found in the intersection indicates a triangle.In practice, work for the ﬁrst approach is often pre-ﬁltered by existing edges, fusing the two steps andmaking the two approaches equivalent. These ways ofdescribing triangle counting mechanics are also compat-ible with sparse-matrix based approaches. In particular,the work by Wolf et al. for miniTri speciﬁcally highlightsa wedge-check approach based on a linear algebraicspeciﬁcation [17]. Algorithm 1:

Scalar merge-based set intersectionkernel. Asterisk (*) before an address symbol indi-cates access to memory.

Input: a1 and a2 initially give the start addresses of each set.a1 nd and a2 nd are the addresses just after the end ofeach set.

Output: ∆ is the number of matches among both sets. ∆ ← while a1 < a1 nd ∧ a2 < a2 nd do if *a1 == *a2 then a1 ← a1 + 1 a2 ← a2 + 1 ∆ ← ∆ + 1 else if *a1 > *a2 then a2 ← a2 + 1 else a1 ← a1 + 1 return ∆ From either perspective, the work of ﬁnding wedgesthat may close into triangles constitutes an intersection of N l ( v ) and N l ( u ) . Algorithm 1 illustrates the core opera-tion in triangle counting by way of a naive approach forset intersection, commonly referred to as merge-basedset intersection as in [18], [19]. C. Expected work for Exact Triangle Count

Given the previous discussion, the core operation intriangle counting can be viewed as ﬁnding the sizeof neighborhood set intersections. Given two neighbor-hoods of vertices, the ﬁrst vertex in each of the twoneighborhoods are compared (intersected). In the caseof a match, the match is recorded and the next verticesin each neighborhood are compared. Otherwise, the nextvertex in the neighborhood of the vertex with the smallerof the IDs is compared with the vertex with the largerID. This process is repeated until all vertices in one orboth neighborhoods have been checked.Notice that this work has to be performed even if thereare no triangles in the graph since the absence of anytriangle can only be ascertained after iterating throughthe neighborhoods. This suggests that the number ofmatch checks a graph requires better reﬂects the amountof work in triangle counting than the number of edgesdoes. In addition, the number of match checks performedby triangle counting equals the number of iterationsexecuted by the loop in Algorithm 1. From this point inthe paper, we refer to such wedge or vertex comparisonsas match checks or matches . D. Match Checks in Hardware

As a metric, match checks do not prescribe whichinstructions or implementation should be used, but gen-erally indicate that some form of compare instructionfollowed by conditional updates are required for each riangle Count Runtime for Graph (s) N u m be r o f M a t c h C he cks and E dge s Number of Match Checks and Edges v.s. Graph Runtime

Fig. 3: Number of match checks and edges plottedagainst runtime for scalar merge-sort-based trianglecounting, for 3 graph orderings per graph. Increasednumber of match checks and edges both generally cor-respond to increased runtime. Match checks show astronger ﬁt to runtime ( r = 0 . ) compared to rawnumber of edges ( r = 0 . ).match check. Possible implementations of Algorithm 1include 1) a compare operation followed by branches,and 2) predicated instructions to avoid branches [19].However, knowing how match checks are mapped (ei-ther manually or by the compiler) to speciﬁc instructionsor hardware components allows us to use match checksas a proxy for the amount of hardware resources requiredto compute the necessary amount of work. For example,mapping match checks to a compare instruction naturallylimits the rate of match checks to the throughput of thecompare ( CMP ) instruction on a given architecture.IV. A

PPLYING M ATCH C HECKS AS A M ETRIC

In this section, we demonstrate that match checks arean acceptable metric for work in triangle counting.

A. Matches Reﬂecting Work in Triangle Count

We validate the use of match checks as a metric of thework for triangle count in Fig. 3. Each point representsa graph from a real or synthetic graph in the GraphChallenge dataset, processed using a sequential merge-based triangle count implementation [3], [6]. The numberof match checks per graph was obtained from a secondnon-timing run, shown in blue. As expected, the runtimeincreases with increased number of match checks acrossthe 93 graphs. Hence, match checks are a representativemetric for the work in triangle counting. By contrast,we also show the number of edges in each graph inred in Fig. 3. While there is clearly some correlationbetween the number of edges in a graph and the numberof checks, it is a weaker one.

B. Matches Reﬂecting Implementation Innovation

It has been observed in numerous works that graphreordering can improve triangle counting performance. Fig. 4: Speedup over non-sorted graph plotted againstthe change in number of match checks performed fordifferent graph sorts. Different markers represent sortorders. Different sort orders are necessary on differentgraphs to reduce match checks and execution time.Chiba and Nishizeki observed that processing vertices indecreasing order of degree enables better running-timebounds for triangle counting based on the graph’s aboric-ity [20]. Recent Graph Challenge implementations suchas Bisson et al. attribute improvements in performance tomore balanced threads on the GPU [21]. More generallyfor graph algorithms, Balaji et al. note improvements dueto improved memory layout and access patterns [22].Using the proposed matches-checked metric, we showthat new insights can be gained into the need for sorting.

Observation 1: Sorting changes the number of matchchecks performed.

Besides load-balancing, sorting in theappropriate fashion can signiﬁcantly reduce the amountof work. Fig. 4 shows the speedup of triangle countingagainst the decrease (below 1) or increase (above 1) inproportion of match checks over the unsorted originalgraph for sequential triangle count. The number of matchchecks is changed by relabeling vertices based on degreein decreasing order (blue) or increasing order (red). Formost graphs, we observe that sorting changes the numberof match checks performed, and thus a correspondingchange in execution time is observed.

Observation 2: Different label orders are required fordifferent graphs.

While sorting vertices by decreasingvertex degree often leads to lower execution time, anumber of graphs beneﬁt from sorting in the reversedorder. This is demonstrated in Fig. 4 when certain redtriangular markers are in the bottom right quadrant, whilea small number of blue circular markers are in the topleft quadrant. This suggests that a possible area of futureresearch into heuristics that determine the appropriatesort order or alternative labeling for a given graph.

C. Matches Reﬂected in Hardware Capability

Here we introduce a case study of using matchchecks to evaluate an alternate approach (SIMD) lgorithm 2:

SIMD 8x8 set intersection kernel.Leftover set elements go to the scalar kernel.

Input: a1 and a2 initially give the start addresses of each set.a1 nd and a2 nd are the addresses just after the end ofeach set.

Output: ∆ is the number of matches among both sets. ∆ ← while a1 < a1 nd ∧ a2 < a2 nd do a1 max ← a1[7] a2 max ← a2[7] // Early termination checks a1 min ← a1[0] a2 min ← a2[0] if a1 max < a2 min then a1 ← a1 + 8; continue if a2 max < a1 min then a2 ← a2 + 8; continue vector 1[0:7] ← a1[0:7] // packed load vector 2 1[0:7] ← a2[0] // broadcast// Omitted: similar broadcasts for a2[1]through a2[7]... cmp mask 1 ← cmp eq(vector 1, vector 2 1) cmp mask 2 ← cmp eq(vector 1, vector 2 2) cmp mask 3 ← cmp eq(vector 1, vector 2 3) cmp mask 4 ← cmp eq(vector 1, vector 2 4) cmp mask 5 ← cmp eq(vector 1, vector 2 5) cmp mask 6 ← cmp eq(vector 1, vector 2 6) cmp mask 7 ← cmp eq(vector 1, vector 2 7) cmp mask 8 ← cmp eq(vector 1, vector 2 8) // Omitted: logically or all cmp_maskstogether into and_mask... ∆ ← ∆ + popcount(and mask) if a1 max ≤ a2 max then a1 ← a1 + 8 if a2 max ≤ a1 max then a2 ← a2 + 8 return delta against a baseline (merge-based scalar) with a fo-cus on hardware-software co-design. Single-instructionmultiple-data (SIMD) hardware is commonly used inregular applications due to the higher throughput itaffords. Use of SIMD in graphs is less common as graphalgorithms (inclusive of triangle counting) are consideredirregular. Using match checks, we showcase that SIMDis measurably faster than the scalar approach owing toperforming more effective match checks per cycle.Recall that match checks can be used as a proxyfor the number of times a compare ( CMP ) instructionis required. The use of SIMD compare instructions (e.g.

VPCMPEQ ) can increase the number of comparisons thatare performed, thus increasing the overall throughput.We implemented multiple SIMD set intersection ker-nels, one of which is shown in Algorithm 2. This kernelcompares eight vertices from each neighborhood againsteach other. This effectively computes 64 match checkswithin the kernel. However, notice that the number of Fig. 5: Different distributions of vertex IDs results indifferent number of match checks. Arrows show thematch checks that will be performed by Algorithm 1.The interleaved pattern has the largest number of matchchecks (15) between two neighborhoods of size 8 each.SIMD match checks corresponding to match checkswhich the scalar implementation would have performeddepends on the distribution of vertex IDs in each vector.The impact of the different distribution of vertex IDsis illustrated in Fig. 5. In our analysis, we thereforeconsider effective match checks, where performance ofthe SIMD approach is measured based on the number ofeffective match checks it performs relative to the scalarbaseline. The remaining match checks performed in theSIMD approach are considered wasted work.The performance of triangle count using the scalar andSIMD versions of the set-intersection kernels, in matchchecks-per-cycle, is reported in Fig. 6 for the followingsystems:– Intel Xeon E5-2667 CPU (Haswell)– Intel i7-7700K CPU (Kaby Lake)– Intel Xeon Platinum 8153 CPU (Skylake-X)Sequential performance numbers are shown for allgraphs, which are relabeled in decreasing vertex degreeorder. Sorting time is NOT included in the overallexecution time. The graphs in each plots are orderedby increasing graph size (size in memory).The utility of SIMD set intersection is mixed on thesorted graphs. For smaller graphs, scalar outperformsSIMD. This is likely due to smaller neighborhoods insmall graphs reducing opportunities to apply the SIMDkernels. The added logic for switching between multiplekernels and the wasted work in the SIMD approachare potential reasons for diminished performance ofthe SIMD implementation. Additionally, the standarddeviation of the lower-neighborhood size for most graphsis reduced after sorting, so there are fewer large neigh-borhoods that SIMD can process. For larger graphs, inspite of the wasted work, SIMD is often faster than thescalar implementation. This implementation innovationis reﬂected in the effective match checks per cycle shownin Fig. 6, particularly for larger graphs.V. D

ISCUSSION AND C ONCLUSION

In this work, we proposed the use of match checks andmatch checks per cycle as a performance metric for exact E ff e c t i v e M a t c h C h e c k s p e r C y c l e Sequential Triangle Count on Haswell

ScalarSIMD E ff e c t i v e M a t c h C h e c k s p e r C y c l e Sequential Triangle Count on Kaby Lake

ScalarSIMD a s c a - G r Q c p p - G nu t e ll a o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ p p - G nu t e ll a o r e g o n _ c a - H e p T h p p - G nu t e ll a p p - G nu t e ll a o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ o r e g o n _ p p - G nu t e ll a p p - G nu t e ll a a s - c a i d a p p - G nu t e ll a f a c e b oo k c a - C o nd M a t p p - G nu t e ll a c a - H e p P h p p - G nu t e ll a c a - A s t r o P h e m a il - E n r o n l o c - … c i t - H e p T h c i t - H e p P h s o c - E p i n i o n s s o c - S l a s hd o t s o c - S l a s hd o t e m a il - E u A ll l o c - g o w a ll a _e d g e s a m a z o n r o a d N e t - P A a m a z o n a m a z o n a m a z o n r o a d N e t - T X r o a d N e t - C A c i t - P a t e n t s li v e j o u r n a l o r k u t E ff e c t i v e M a t c h C h e c k s p e r C y c l e Sequential Triangle Count on Skylake X

ScalarSIMD

Fig. 6: Scalar and SIMD triangle count performance, in match checks-per-second, for a variety of graphs fromthe SNAP dataset on Top) Haswell, Middle) Kabylake, and Bottom) Skylake-X architectures. Use of of SIMDinstructions results in higher attained performance for graphs with larger neighborhoods.triangle counting. This is motivated by the need for ametric that explains the performance of triangle countingimplementations in terms of the amount of work and thehardware capabilities of the system.We demonstrated that the metric represents the amountof work that has to be performed in triangle counting.The metric also provides new insights into commonly-employed techniques found in many triangle countingimplementations such as sort-based relabeling. While wehave focused on only sequential implementations, webelieve that extending the metric to represent parallelimplementations should be straight-forward.The observant reader may note that match checks andthe number of actually traversed edges in the graph arevery similar in numeric value. In fact, the actual numberof edges traversed equals the number of match checksplus the number of triangles in the graph. The case wemake in this work is not that match checks are betterthan counting the number of truly traversed edges, butthat edges-per-second as used in many recent works isnot appropriate and some hardware-focused metric likematch checks is needed for hardware-software co-design.More generally, we believe that the approach wetook to identify the proposed metric can be replicatedfor different (classes of) graph algorithms to identifyobjective metrics. Having better metrics would provide greater insights into the amount of work to be performedand the hardware capabilities that are required. Theseinsights could lead to faster and more efﬁcient softwareimplementations, and could also suggest hardware fea-tures needed by graph algorithms.Identiﬁed metrics for one graph algorithm could po-tentially apply to other graph algorithms. For example,the set intersection kernel in triangle counting is similarto a sparse dot product. This suggests that other graphalgorithms and sparse linear algebraic workloads maybeneﬁt from similar metrics for performance evaluation.We will pursue these directions in future work.A

CKNOWLEDGEMENT

This material is based upon work funded and sup-ported by the Department of Defense under Contract No.FA8702-15-D-0002 with Carnegie Mellon Universityfor the operation of the Software Engineering Institute,a federally funded research and development center[DM20-0657].Mark Blanco is supported by the National ScienceFoundation Graduate Research Fellowship Program un-der Grant No. DGE 1745016. Any opinions, ﬁndings,and conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarilyreﬂect the views of the National Science Foundation.

EFERENCES[1] M. M. Wolf, M. Deveci, J. W. Berry, S. D. Hammond, andS. Rajamanickam, “Fast linear algebra-based triangle countingwith KokkosKernels,” in . Waltham, MA: IEEE, Sep.2017, pp. 1–7. [Online]. Available: http://ieeexplore.ieee.org/document/8091043/[2] Y. Hu, P. Kumar, G. Swope, and H. H. Huang, “TriX: Trianglecounting at extreme scale,” in . Waltham, MA: IEEE,Sep. 2017, pp. 1–7. [Online]. Available: http://ieeexplore.ieee.org/document/8091036/[3] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao,S. Mohindra, P. Monticciolo, A. Reuther, S. Smith, W. Song,D. Staheli, and J. Kepner, “Static Graph Challenge: SubgraphIsomorphism,” Aug. 2017. [Online]. Available: https://arxiv.org/abs/1708.06866v1[4] V. S. Mailthody, K. Date, Z. Qureshi, C. Pearson, R. Nagi,J. Xiong, and W.-m. Hwu, “Collaborative (CPU + GPU) Al-gorithms for Triangle Counting and Truss Decomposition,” in , Sep. 2018, pp. 1–7, iSSN: 2377-6943.[5] M. Bisson and M. Fatica, “Static graph challenge on GPU,” in , Sep. 2017, pp. 1–8.[6] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large net-work dataset collection,” http://snap.stanford.edu/data, Jun. 2014.[7] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mo-hindra, P. Monticciolo, A. Reuther, S. Smith, W. Song et al. ,“Graphchallenge. org: Raising the bar on graph analytic perfor-mance,” in . IEEE, 2018, pp. 1–7.[8] S. Samsi, J. Kepner, V. Gadepally, M. Hurley, M. Jones, E. Kao,S. Mohindra, A. Reuther, S. Smith, W. Song, D. Staheli, andP. Monticciolo, “Graphchallenge.org triangle counting perfor-mance,” 2020.[9] V. STRASSEN, “Gaussian elimination is not optimal.”

Numerische Mathematik , vol. 13, pp. 354–356, 1969. [Online].Available: http://eudml.org/doc/131927[10] J. Huang, T. M. Smith, G. M. Henry, and R. A. van de Geijn,“Strassen’s algorithm reloaded,” in

Proceedings of the Interna-tional Conference for High Performance Computing, Networking,Storage and Analysis , ser. SC ’16. IEEE Press, 2016.[11] K. Goto and R. A. v. d. Geijn, “Anatomy of high-performancematrix multiplication,”

ACM Transactions on MathematicalSoftware , vol. 34, no. 3, pp. 1–25, May 2008. [Online].Available: https://dl.acm.org/doi/10.1145/1356052.1356053[12] F. G. Van Zee and R. A. van de Geijn, “Blis: A frameworkfor rapidly instantiating blas functionality,”

ACM Trans. Math.Softw. , vol. 41, no. 3, Jun. 2015. [Online]. Available:https://doi.org/10.1145/2764454[13] G. Henry, “Blas based on block data structures,” Cornell Univer-sity, Tech. Rep., 1992.[14] M. Lee and T. M. Low, “A Family of Provably CorrectAlgorithms for Exact Triangle Counting,” in

Proceedings of theFirst International Workshop on Software Correctness for HPCApplications , ser. Correctness’17. New York, NY, USA: ACM,2017, pp. 14–20, event-place: Denver, CO, USA. [Online].Available: http://doi.acm.org/10.1145/3145344.3145484[15] A. Azad, A. Buluc¸, and J. Gilbert, “Parallel Triangle Countingand Enumeration Using Matrix Algebra,” in ,May 2015, pp. 804–811.[16] C. Voegele, Y.-S. Lu, S. Pai, and K. Pingali, “Paralleltriangle counting and k-truss identiﬁcation using graph-centricmethods,” in . Waltham, MA, USA: IEEE, Sep. 2017,pp. 1–7. [Online]. Available: http://ieeexplore.ieee.org/document/8091037/ [17] M. M. Wolf, J. W. Berry, and D. T. Stark, “A task-based linearalgebra Building Blocks approach for scalable graph analytics,”in , Sep. 2015, pp. 1–6.[18] H. Inoue, M. Ohara, and K. Taura, “Faster set intersectionwith SIMD instructions by reducing branch mispredictions,”

Proceedings of the VLDB Endowment , vol. 8, no. 3, pp.293–304, Nov. 2014. [Online]. Available: http://dl.acm.org/doi/10.14778/2735508.2735518[19] J. Zhang, Y. Lu, D. G. Spampinato, and F. Franchetti,“FESIA: A fast and simd-efﬁcient set intersection approachon modern cpus,” in . IEEE, 2020, pp. 1465–1476. [Online]. Available:https://doi.org/10.1109/ICDE48307.2020.00130[20] N. Chiba and T. Nishizeki, “Arboricity and Subgraph ListingAlgorithms,”

SIAM Journal on Computing , vol. 14, no. 1, pp.210–223, Feb. 1985. [Online]. Available: http://epubs.siam.org/doi/10.1137/0214017[21] M. Bisson and M. Fatica, “Update on Static Graph Challengeon GPU,” in , Sep. 2018, pp. 1–8.[22] V. Balaji and B. Lucia, “When is Graph Reordering an Opti-mization? Studying the Effect of Lightweight Graph ReorderingAcross Applications and Input Graphs,” in2018 IEEE Interna-tional Symposium on Workload Characterization (IISWC)