[PDF] Parallelizing Bisection Root-Finding: A Case for Accelerating Serial Algorithms in Multicore Substrates

Abstract

Multicore architectures dominate today's processor market. Even though the number of cores and threads are pretty high and continues to grow, inherently serial algorithms do not benefit from the abundance of cores and threads. In this paper, we propose Runahead Computing, a technique which uses idle threads in a multi-threaded architecture for accelerating the execution time of serial algorithms. Through detailed evaluations targeting both CPU and GPU platforms and a specific serial algorithm, our approach reduces the execution latency up to 9x in our experiments.

Full PDF

PParallelizing Bisection Root-Finding:A Case for Accelerating Serial Algorithmsin Multicore Substrates

Mohammad Bakhshalipour and Hamid Sarbazi-Azad

Department of Computer Engineering, Sharif University of TechnologySchool of Computer Science, Institute for Research in Fundamental Sciences (IPM)

Abstract —Multicore architectures dominate today’s proces-sor market. Even though the number of cores and threads arepretty high and continues to grow, inherently serial algorithmsdo not beneﬁt from the abundance of cores and threads. In thispaper, we propose

Runahead Computing , a technique which usesidle threads in a multi-threaded architecture for accelerating theexecution time of serial algorithms. Through detailed evaluationstargeting both CPU and GPU platforms and a speciﬁc serialalgorithm, our approach reduces the execution latency up to x in our experiments. Keywords — Multi-Thread Programming, Single-Thread Perfor-mance, Multicore Processor, GPU, Bisection Root-Finding.

I. I

NTRODUCTION

Multi-threaded architectures presently appear across thewhole spectrum of computing machines, from the low-endembedded processors to high-end general-purpose devices.Chip Multiprocessor (CMP) [1] is a type of multi-threadedprocessors, in which, cores do not share the computational re-sources. CMPs are implemented in many commercial systemsand have high usage in broad classes of computations. IntelNehalem i7 [2], AMD Bulldozer [3], IBM Power5 [4], SunNiagara T2 [5], and TILE64 [6] are examples of commercialCMPs. CMP enhances the performance of a single programwhen the program can be split into multiple pieces, each piecerun by one core, in parallel with others.In the other side, Graphics Processing Units (GPUs)are also being adjusted for general-purpose computations.GPUs use an aggregate form of single-instruction-multiple-data (SIMD) paradigm [7] with ﬁne-grain multi-threading.NVIDIA Kepler GK110 [8] and AMD TeraScale [9] areexamples of commercial GPUs. GPU exploits the data-levelparallelism of a single program which consists of the sameoperation on multiple data points for accelerating the execu-tion time of the application [10]–[17].Even though the trend is growing the number of coresin both multicore and manycore systems [18]–[21], serialprograms do not beneﬁt from increasing the core count. Serialprograms (or serial parts of a program) run on a single threadand cannot be split into two or multiple threads. Executingsuch programs on a manycore platform results in only onecore running the whole program and other cores remain-ing idle. The situation gets worse when boosting the corecount diminishes the single-core performance. The problemarises from two reasons: (1) Tight physical constraints ofthe chip (i.e., limited power and area budgets) prevent fromaccommodating hundreds of large and power-hungry coreson a single chip [22], [23]. So increasing the core countimplies replacing high-performance cores with simple andsmall cores, which leads to diminishing the performance of single-thread and consequently protracting execution time ofserial programs. Some proposals resolve this problem withasymmetric architectures [24]. (2) Increasing the core countforces to replace non-scalable crossbars with on-chip networkswhich use scalable topologies (e.g., Mesh). Latency incurredby on-chip interconnect decreases the per-core performancedue to slower accesses to shared caches. Boosting the corecount grows the network hop count and results in longerdelays [25]–[42]. Some proposals offset this obstacle withrichly-connected topologies [43], [44] which have signiﬁcantarea and energy overheads.In this paper, we propose

Runahead Computing , a tech-nique for accelerating inherently serial algorithms on multi-core and manycore platforms. Runahead Computing drawson previous researches on non-traditional parallelism andtargets the execution latency of serial algorithms. The restof the paper is organized as follows: Section 2 explainsRunahead Computing and gives the essential background onnon-traditional parallelism. In section 3, we choose a properserial algorithm as a case study and describe its details.Throughout section 4, we illustrate the implementation ofRunahead Computing in our case study. Section 5 discussesour evaluation methodology. Section 6 presents the results ofevaluation experiments. Finally, section 7 concludes the paper.II. R

UNAHEAD C OMPUTING

Runahead Computing refers to use the idle threads in amulti-threaded architecture to improve the performance of thesingle-threaded application. Runahead Computing is a kind ofspeculation in which, idle threads operate few steps ahead ofthe main thread. These speculative threads attempt to providea situation at their activity location, under which, when themain thread reaches there, it executes faster than usual.This is not the ﬁrst research on using idle thread contextswith the purpose of increasing the performance of the single-threaded application. Various schemes have been proposed touse idle thread contexts to provide certain kinds of assistanceto the main thread . Some approaches leverage from idlethread contexts as a prefetcher for main thread [45]–[50]in order to defeat long-latency memory accesses [51]–[53].Several proposals precompute branch outcomes through aderived variant of main thread [50], [54]. There are also someapproaches which exploit idle thread contexts to precomputedependent live-in data [55]. However, all of these approachesrequire signiﬁcant non-trivial changes to the hardware ofprocessors which makes their implementation challenging forshipping products. In this paper, we target a method thatpushes everything to the software and hence is entirely appli-cable in the context of commercial off-the-shelf processors. This paradigm sometimes has been called non-traditional parallelism. a r X i v : . [ c s . D C ] M a y nput: X A = F(X)A

Even Odd

B = A + G(X)

B = A + H(X)

Output: B

Fig. 1: Flowchart of a hypothetical program.We proceed to describe our approach with an example.Figure 1 shows the ﬂowchart of a hypothetical program.The program takes X as input and returns B as output. Inthe beginning, the program computes F ( X ) and stores theresult in variable A . Then, if A is an even number, programproceeds with calculating G ( X ) and storing the sum of theresult and A in variable B . But if A is an odd number, theprogram computes the value of H ( X ) and put the sum of theresult and A into variable B . Finally, it returns B as the outputof the program. Because of existing dependencies betweenparts of the program, it cannot be regularly parallelized.So considering the execution time of 10, 5, and 5 secondsrespectively for F , G , and H , and neglecting the latencyof other operations (e.g., store and add), a single threadcan execute this program within 15 seconds. However, inthe Runahead Computing, programmer initiates two threadsat the beginning of execution. First thread (main thread)begins with calculating F ( X ) and in parallel, the secondthread (helper thread) calculates G ( X ) and H ( X ) and storesthe result of them in variables G and H , respectively. Inthis manner, the execution latency of calculating G ( X ) and H ( X ) overlaps with execution latency of computing F ( X ) .So, when the main thread ﬁnishes the calculation of F ( X ) ,the results of G ( X ) and H ( X ) are ready in variables G and H . So, the main thread checks the value of A and picks oneof G or H for summing with A and storing the result in B and returning the output. By this manner, the whole programﬁnishes within 10 seconds. So, in this example, RunaheadComputing improves the performance of the program by 50%.III. B ISECTION M ETHOD

In this section, we choose

Bisection root-ﬁnding algo-rithm [56] as our case study and describe its baseline manner(without Runahead Computing). Bisection is a serial algo-rithm and is a suitable case for our proposal due to itsoperative behavior.

A. Algorithm

Bisection is a root-ﬁnding algorithm which operates oncontinuous functions. The bisection method is based on thislemma: If continuous function f returns opposite sign valueson a and b , then equation f ( x ) = 0 has at least one real rootin the interval ( a, b ) , where a < b . Bisection algorithm takesa function and an interval as inputs and repeatedly halves theinterval and then picks a subinterval in which a root must lie. -101 F(X) = SIN(2X)

Fig. 2: Bisection algorithm on a sample function.Figure 2 illustrates an example of the operation of thisalgorithm. The algorithm takes f ( x ) = sin (2 x ) as inputfunction and (2 , as initial interval and then tries to ﬁnda root for equation f ( x ) = 0 in the given interval. At ﬁrst,it computes f (2) and f (4) and ﬁnds that f (2) is negativeand f (4) is positive. The algorithm computes the middlepoint of the interval and the value of the function at thatpoint (i.e., c = = 3 and f (3) ) and ﬁnds that f (3) isnegative. Now, the interval should be halved and replaced byone of (2 , or (3 , intervals. Because the function returnsopposite sign values on the edge points of (3 , interval, thealgorithm chooses (3 , interval to continuation. In the nextstep, the operation repeats similarly and (3 , . interval ischosen. The algorithm continues in an alike manner for moreiterations (not shown in the ﬁgure) till estimating the root withacceptable accuracy. Due to halving the interval at each step,in general, for achieving an error value less than (cid:15) , we needto iterate (cid:100) log b − a(cid:15) (cid:101) times for the initial interval of ( a, b ) .The main advantage of Bisection algorithm is itssimplicity and robustness. While other root-ﬁnding numericalmethods, which may have higher performance, operate justwhen the function has speciﬁc conditions, Bisection methodworks on any continuous function regardless of any particularcircumstances. The only problem with Bisection method isits high execution time which we target in this work. Someapproaches have been proposed to use Bisection method forachieving a nearness to a solution which is formerly used asa starting point for more quickly converging methods [57]. B. Baseline Implementation

We implement the Bisection method like Algorithm 1 forbaseline evaluations. Throughout this implementation, evenif the exact root is found in the middle of the execution,the program does not stop and continues for the predeﬁnednumber of iterations.

Data:

Function: f , Interval: ( a, b ) , Iterator: iterations Result: root of f ( x ) = 0 in ( a, b ) while iterations > do iterations ← iterations − root ← a + b if f ( a ) × f ( root ) < then b ← root else a ← root endendreturn root Algorithm 1:

Baseline implementation for Bisection algo-rithm.V. R

UNAHEAD B ISECTION

In this section, we propose our approach for acceleratingthe Bisection algorithm by leveraging available idle threads.First, we describe our method in detail and then discuss itscomplexity.

A. Algorithm

We begin with considering the same example (i.e., Fig. 2)in an environment with three threads. One of the threads is themain thread, and the two others are helper threads. At ﬁrst, themain thread computes the value of f (3) , and in parallel, helperthreads operate on one step ahead of the main thread. Onehelper thread predicts that f (3) will be positive and begins tocompute the value of f (2 . , speculatively. The other helperthread predicts a negative value for f (3) and speculativelycalculates the amount of f (3 . . Each helper thread storesthe result of its computation in a shared variable (say f . and f . ). So, when the main thread completes its calculation, itcompares the result of f (3) with the results of helper threadswhich have been stored in two shared variables. Because f (3) is negative (i.e., the prediction of second helper thread iscorrect) and f . is positive, the next interval is (3 , . . In thenext step, the main thread computes f (3 . , and the helperthreads calculate f (3 . and f (3 . . Operations repeatthe subsequent steps similarly. In this fashion, if we ignore thelatency of some operations (e.g., joining threads and storingthe values), the execution latency reduces by 50%.We can further reduce the execution latency of the al-gorithm with devoting more helper threads to the program.For example, if we have seven threads, we can do thecomputations related to the two steps ahead, in the currentstage. In this situation for our example, initially, the mainthread computes f (3) , two helper threads compute f (3 . and f (2 . , and four helper threads compute f (2 . , f (2 . , f (3 . , and f (3 . . By this way, the execution latency dropsto one-third of the baseline latency.For preserving the scalability of our method to thenumber of threads, we implement the shared variables asan array. Each thread has a particular element in the array.The array also contains two elements which do not belongto any thread and show the sign of edge points’ value of thecurrent interval. Any thread ﬁlls the corresponding elementin the array after its computation ﬁnishes. If the result of thecomputation is negative, the thread ﬁlls the correspondingelement in the array with ‘1’. Otherwise, it sets the element to‘0’. Whenever all the threads write the results into the array,each thread compares the entries in the array which belongto the two neighbor threads (it is a simple XOR). If theyare not the same, this means one edge of the new interval isthe point of this thread, and the other edge is the neighborpoint which has a different value in the array. Based on this,the new interval is chosen, then the main thread begins withcomputing the value of middle-point of the new interval,and helper threads pick their corresponding points, and thescenario repeats like before. To prevent false-sharing [58],[59], we implement the array as a two-dimensional structure,but we only use one dimension . Figure 3 illustrates thearray-based implementation of Runahead Bisection for ourexample with three threads. At ﬁrst, main thread computes f (3) and writes ‘1’ to the array, as the result is negative.Parallel with the main thread, helper thread-1 computes f (2 . , and helper thread-2 calculates f (3 . ; then, theywrite their results (‘0’ or ‘1’) into the array. Now, the array iscomplete . Helper thread-1 compares sign of the value of its In this manner, shared variables map to different coherence units. Formerly, we know the signs of f (2) and f (4) . Fig. 3: Array-based implementation of Runahead Bisection.neighbor points (i.e., f (2) and f (3) ). In parallel, main threadand helper thread-2 do similar work on their correspondingentries in the array. Based on comparisons, helper thread-2ﬁnds that its neighbors have different-sign values. So, fornext step, helper thread-2 sets the interval (which is a sharedvariable among all threads) to (3 , . . The sign of the valueof edge points gets copied to the top and bottom of the array,and the other operations repeat in the previous fashion. B. Complexity

Because of similarity between operations of all threads,we ﬁrst analyze the time complexity of a single thread. Eachthread computes the value of its corresponding point; then itwrites the sign of result in the array; afterward, it waits forsynchronizing with other threads. Next, the thread comparesthe sign of results of neighbor threads and updates the intervalif required. All of these operations take O (1) latency (asthe execution times of these operations are constant andindependent of demanded accuracy or the length of the initialinterval).If we need to iterate n times to reach to a certain accuracyin the baseline algorithm, in the Runahead manner, we needto iterate nk times for each thread, where k depends on thenumber of threads. Generally, we can reduce the number ofiterations for each thread, from n to nk , if we have k − threads. So, given k threads, our approach reduces the totalexecution time complexity of the program from O ( n ) to O ( nlog ( k +1) ) , where n is the number of required iterationsin the baseline implementation.V. M ETHODOLOGY

We evaluate our approach on both CMP and GPU. Table 1summarizes the parameters of our platforms. For CPU, wecompile the program using GCC without optimization. ForGPU, we use NVCC for compilation, again without optimiza-tion. For eliminating the effects of compulsory cache misses,we run each program two times and report the results ofthe second execution. To demonstrate the effectiveness of ourmethod, we choose the function as a high-latency function. Weuse trigonometric functions and calculate them with Taylorseries [60]. TABLE I: Evaluation parameters.

Parameter Value

CPU x86 Architecuter, Intel Core i7, 2.4 GHz, Eight coresOS Linux, Kernel version: 4.4.0-34GPU NVidia Tesla K20, 732 MHzProgram f ( x ) = sin(cos( x )) , Taylor series with iterationsInitial interval: (1 , .000.200.400.60 N O R M A L I Z E D E X E C U I T I O N T I M E NUMBER OF THREADS

Fig. 4: Sensitivity of execution time of CPU program to thenumber of threads.VI. E

VALUATION R ESULTS

In this section, we report the results of two sensitivityanalysis: (1) Sensitivity of execution time to the number ofthreads, and (2) Sensitivity of speed-up of our method to thelatency of input function.

A. Sensitivity to The Number of Threads

For a speciﬁc input program which has been deﬁnedin Table 1, we sweep the number of threads and measurethe execution time of the application. For CPU program,we set the maximum tolerable error to − and sweep thenumber of threads from to . For GPU program, we set themaximum acceptable error to − and sweep the numberof threads from to . We choose − and − asthe maximum tolerable errors because, in these situations,the number of iterations of the single-threaded program isdivisible by log ( Threads +1) .Figure 4 shows the result of thread-sweeping on CPUprogram. The execution latency values are normalized tothat of the single-threaded program. The execution time ofCPU program drops to 55% (using three threads) and 38%(using seven threads) of its baseline serial implementation.As the ﬁgure illustrates, the performance nearly scales withincreasing the thread count. The noises in the scaling comefrom the latency of operations which we do just in themulti-threaded program (e.g., creating and synchronizingthe threads, ﬁlling the variables which are shared amongthreads). Notably, by increasing the thread count, the latencyof these operations (e.g., synchronizing the threads) increasesand prevents reaching perfect performance scaling.Figure 5 presents the result of sweeping the numberof threads on the execution latency of the GPU program.Again, the execution times are normalized to that of thesingle-threaded program. The latency of program drops to50% (using three threads) and 10% (using 1023 threads) ofits baseline serial implementation. As the ﬁgure conﬁrms,the performance perfectly scales with growing the number N O R M A L I Z E D E X E C U I T I O N T I M E NUMBER OF THREADS

Fig. 5: Sensitivity of execution time of GPU program to thenumber of threads. -100% -50%0%50% R UN A H E A D S PEE D - U P TAYLOR SERIES ITERATIONS

Fig. 6: Sensitivity of speed-up to the execution latency ofinput function in CPU platform.of threads. The low overhead of creating/joining hardwarethreads in the GPU platform [61] provides this near-idealscalability.

B. Sensitivity to The Execution Latency of Input Function

In this section, we investigate the sensitivity of ourapproach’s speed-up to the computation latency of inputfunction. For this reason, we sweep the number of iterationsof Taylor series for computing trigonometric functions. Bygrowing the number of iterations in Taylor series, the latencyof calculating sin( x ) and cos( x ) raises, and consequently,the entire latency for computing the value of a given pointgrows. In this experiment, we set the maximum tolerable errorto − and restrict the number of threads to three. By thisway, the single-threaded program needs to iterate six times,and the multi-threaded program requires three iterations. Inan ideal situation (i.e., neglecting the latencies of creatingand joining threads and ﬁlling shared variables), the multi-threaded application should take the half of execution time ofthe single-threaded program (in other words, multi-threadingshould raise the performance by 100%).Figure 6 shows the result of this study for CPU program.As the ﬁgure illustrates, when the execution latency ofthe function is low (below 500 iterations), the RunaheadComputing not only does not improve the performance butalso decreases it. This occurs because, in this situation, theoverhead of creating and joining threads is more than thelatency of computing the value of a point. But when thecomputation time of function goes beyond of a threshold,Runahead Computing improves the performance. In ourexperiment, Runahead Computing decreases the performanceby 86% when the latency of input function is small (10iterations for Taylor series). By increasing the latencyof function, the speed-up of Runahead Computing alsoincreases. By setting the number of iterations of Taylor seriesto 10000, Runahead Computing improves the performanceby 97% and converges to the ideal speed-up value.Figure 7 presents the result of the same experiment onGPU platform. As shown, unlike CPU, Runahead Computingin the GPU never loses performance in our analysis. Thereason is, the overhead of creating and joining the threadson GPU is very low in comparison with CPU [61]. Evenfor a small number of iterations for Taylor series, RunaheadComputing considerably increases the performance. For 10iterations of Taylor series, Runahead Computing speeds upthe execution latency by 19%. For iterations beyond 500,Runahead Computing improves the performance by 99% andreaches to its ideal speed-up value. %25%50%75% R UN A H E A D S PEE D - U P TAYLOR SERIES ITERATIONS

Fig. 7: Sensitivity of speed-up to the execution latency ofinput function in GPU platform.VII. C

ONCLUSION

In this paper, we proposed Runahead Computing, a tech-nique for increasing the performance of single-threaded appli-cations in multi-threaded architectures, by exploiting availableidle threads. In the proposed approach, programmer codes theidle threads for working some steps ahead of the main thread.As a case study, we chose Bisection root-ﬁnding algorithmand accelerated it on a CMP and a GPU. While we examinedour method on a particular algorithm, we believe that ideasin this paper can be applied to other similar algorithms (e.g.,Binary Search). R

EFERENCES [1] Kunle Olukotun et al. The Case for a Single-Chip Multiprocessor. In

ASPLOS ,1996.[2] Martin Dixon et al. The Next-Generation Intel Core Microarchitecture.

IntelTechnology Journal , 2010.[3] Michael Butler et al. Bulldozer: An Approach to Multithreaded ComputePerformance.

IEEE Micro , 2011.[4] Ron Kalla et al. IBM Power5 Chip: A Dual-Core Multithreaded Processor.

IEEEMicro , 2004.[5] Manish Shah et al. UltraSPARC T2: A Highly-Treaded, Power-Efﬁcient, SPARCSOC. In

ASSCC , 2007.[6] Shane Bell et al. Tile64-Processor: A 64-Core SoC with Mesh Interconnect. In

ISSCC , 2008.[7] Michael J Flynn. Some Computer Organizations and Their Effectiveness.

IEEETransactions on Computers , 1972.[8] NVIDIAs Next Generation CUDA Compute Architecture: Kepler GK110. Tech-nical report, 2012.[9] M Houston. Anatomy of AMD’s TeraScale Graphics Engine. 2008.[10] Mohammad Sadrosadati et al. LTRF: Enabling High-Capacity Register Filesfor GPUs via Hardware/Software Cooperative Register Prefetching. In

ASPLOS ,2018.[11] Amir Yazdanbakhsh et al. Neural Acceleration for GPU Throughput Processors.In

MICRO , 2015.[12] Farzad Khorasani et al. RegMutex: Inter-Warp GPU Register Time-Sharing. In

ISCA , 2018.[13] Mohammad Sadrosadati et al. Effective Cache Bank Placement for GPUs. In

DATE , 2017.[14] Homa Aghilinasab et al. Reducing Power Consumption of GPGPUs ThroughInstruction Reordering. In

ISLPED , 2016.[15] Ali Karami et al. A Statistical Performance Prediction Model for OpenCL Kernelson NVIDIA GPUs. In

CADS , 2013.[16] Amin Abbasi et al. A Preliminary Study of Incorporating GPUs in the HadoopFramework. In

CADS , 2012.[17] Sayyed Ali Mirsoleimani et al. A Parallel Memetic Algorithm on GPU to Solvethe Task Scheduling Problem in Heterogeneous Environments. In

GECCO , 2013.[18] Pejman Lotﬁ-Kamran et al. Scale-Out Processors. In

ISCA , 2012.[19] Michael Ferdman et al. Cuckoo Directory: A Scalable Directory for Many-CoreSystems. In

HPCA , 2011.[20] Boris Grot et al. Optimizing Data-Center TCO with Scale-Out Processors.

IEEEMicro , 2012.[21] Pejman Lotﬁ-Kamran et al. TurboTag: Lookup Filtering to Reduce CoherenceDirectory Power. In

ISLPED , 2010.[22] Hadi Esmaeilzadeh et al. Dark Silicon and the End of Multicore Scaling. In

ISCA , 2011.[23] Nikos Hardavellas et al. Toward Dark Silicon in Servers.

IEEE Micro , 2011.[24] M. Aater Suleman et al. Accelerating Critical Section Execution with AsymmetricMulti-Core Architectures. In

ASPLOS , 2009. [25] Mohammad Bakhshalipour et al. Fast Data Delivery for Many-Core Processors.

IEEE Transactions on Computers , 2018.[26] Amirhossein Mirhosseini et al. BiNoCHS: Bimodal Network-on-Chip for CPU-GPU Heterogeneous Systems. In

NOCS , 2017.[27] Amirhossein Mirhosseini et al. An Energy-Efﬁcient Virtual Channel Power-gatingMechanism for On-Chip Networks. In

DATE , 2015.[28] Mohammad Sadrosadati et al. An Efﬁcient DVS Scheme for On-Chip NetworksUsing Reconﬁgurable Virtual Channel Allocators. In

ISLPED , 2015.[29] Pejman Lotﬁ-Kamran et al. NOC-Out: Microarchitecting a Scale-Out Processor.In

MICRO , 2012.[30] Pejman Lotﬁ-Kamran et al. Near-Ideal Networks-on-Chip for Servers. In

HPCA ,2017.[31] Pejman Lotﬁ-Kamran et al. EDXY–A Low Cost Congestion-Aware RoutingAlgorithm for Network-on-Chips.

Journal of Systems Architecture , 2010.[32] Pejman Lotﬁ-Kamran et al. BARP-A Dynamic Routing Protocol for BalancedDistribution of Trafﬁc in NoCs. In

DATE , 2008.[33] Pejman Lotﬁ-Kamran et al. An Efﬁcient Hybrid-Switched Network-on-Chip forChip Multiprocessors.

IEEE Transactions on Computers , 2016.[34] Pejman Lotﬁ-Kamran et al. NOC Characteristics of Cloud Applications. In

CADS ,2017.[35] Babak Falsaﬁ et al. Network-on-Chip Using Request and Reply Trees for Low-Latency Processor-Memory Communication, 2017. US Patent 9,703,707.[36] Mehdi Modarressi et al. Application-Aware Topology Reconﬁguration for On-Chip Networks.

IEEE Transactions on Very Large Scale Integration Systems ,2011.[37] Mehdi Modarressi et al. Virtual Point-to-Point Connections for NoCs.

IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems , 2010.[38] Mehdi Modarressi et al. A Hybrid Packet-Circuit Switched On-Chip NetworkBased on SDM. In

DATE , 2009.[39] Mehdi Modarressi et al. Power-Aware Mapping for Reconﬁgurable NoC Archi-tectures. In

ICCD , 2007.[40] Mehdi Modarressi et al. An Efﬁcient Dynamically Reconﬁgurable On-ChipNetwork Architecture. In

DAC , 2010.[41] Amirhossein Mirhosseini et al. Quantifying the Difference in Resource Demandamong Classic and Modern NoC Workloads. In

ICCD , 2016.[42] Amirhossein Mirhosseini et al. POSTER: Elastic Reconﬁguration for Heteroge-neous NoCs with BiNoCHS. In

PACT , 2017.[43] Boris Grot et al. Kilo-NOC: A Heterogeneous Network-on-Chip Architecture forScalability and Service Guarantees. In

ISCA , 2011.[44] John Kim et al. Flattened Butterﬂy: A Cost-Efﬁcient Topology for High-RadixNetworks. In

ISCA , 2007.[45] Jamison D Collins et al. Dynamic Speculative Precomputation. In

MICRO , 2001.[46] Jamison D. Collins et al. Speculative Precomputation: Long-Range Prefetchingof Delinquent Loads. In

ISCA , 2001.[47] Dongkeun Kim and Donald Yeung. Design and Evaluation of Compiler Algo-rithms for Pre-Execution. In

ASPLOS , 2002.[48] Chi-Keung Luk. Tolerating Memory Latency Through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In

ISCA , 2001.[49] Weifeng Zhang et al. An Event-Driven Multithreaded Dynamic OptimizationFramework. In

PACT , 2005.[50] Craig Zilles and Gurindar Sohi. Execution-Based Prediction Using SpeculativeSlices. In

ISCA , 2001.[51] Mohammad Bakhshalipour et al. Domino Temporal Data Prefetcher. In

HPCA ,2018.[52] Mohammad Bakhshalipour et al. An Efﬁcient Temporal Data Prefetcher for L1Caches.

IEEE Computer Architecture Letters , 2017.[53] Armin Vakil-Ghahani et al. Cache Replacement Policy Based on Expected HitCount.

IEEE Computer Architecture Letters , 2018.[54] Robert S Chappell et al. Simultaneous Subordinate Microthreading (SSMT). In

ISCA , 1999.[55] Carlos Madriles et al. Mitosis: A Speculative Multithreaded Processor Based onPrecomputation Slices.

IEEE Transactions on Parallel and Distributed Systems ,2008.[56] George Corliss. Which Root Does The Bisection Algorithm Find?

Siam Review ,1977.[57] Richard L Burden and J Douglas Faires. The Bisection Algorithm.

NumericalAnalysis. Prindle, Weber & Schmidt , 1985.[58] Josep Torrellas et al. False Sharing and Spatial Locality in Multiprocessor Caches.

IEEE Transactions on Computers , 1994.[59] Tor E. Jeremiassen and Susan J. Eggers. Reducing False Sharing on SharedMemory Multiprocessors Through Compile Time Data Transformations. In

PPOPP , 1995.[60] Lars V Ahlfors. Complex Analysis: An Introduction to The Theory of AnalyticFunctions of One Complex Variable.

New York, London , 1953.[61] Alejandro Segovia. Parallel Programming with NVIDIA CUDA.