Load Driven Branch Predictor (LDBP)
LLoad Driven Branch Predictor (LDBP)
Akash Sridhar
Computer Science and EngineeringUniversity of California, Santa Cruz [email protected]
Nursultan Kabylkas
Computer Science and EngineeringUniversity of California, Santa Cruz [email protected]
Jose Renau
Computer Science and EngineeringUniversity of California, Santa Cruz [email protected]
Abstract —Branch instructions dependent on hard-to-predictload data are the leading branch misprediction contributors.Current state-of-the-art history-based branch predictors havepoor prediction accuracy for these branches. Prior research backsthis observation by showing that increasing the size of a 256-KBithistory-based branch predictor to its 1-MBit variant has just a10% reduction in branch mispredictions.We present the novel Load Driven Branch Predictor (LDBP)specifically targeting hard-to-predict branches dependent on aload instruction. Though random load data determines the out-come for these branches, the load address for most of these datahas a predictable pattern. This is an observable template in datastructures like arrays and maps. Our predictor model exploitsthis behavior to trigger future loads associated with branchesahead of time and use its data to predict the branch’s outcome.The predictable loads are tracked, and the precomputed outcomesof the branch instruction are buffered for making predictions.Our experimental results show that compared to a standalone256-Kbit IMLI predictor, when LDBP is augmented with a 150-Kbit IMLI, it reduces the average branch mispredictions by 20%and improves average IPC by 13.1% for benchmarks from SPECCINT2006 and GAP benchmark suite.
I. I
NTRODUCTION
Branch mispredictions and data cache misses are the twomost significant factors limiting single-thread performancein modern microprocessors. Improving the branch predictionaccuracy has several benefits. First, it improves IPC by reducingthe number of flushed instructions. Second, it reduces the powerdissipation incurred through the execution of instructions takingthe wrong path of the branch. Third, it increases the MemoryLevel Parallelism(MLP), which facilitates a deeper instructionwindow in the pipeline and supports multiple outstandingmemory operations.Current branch prediction championships and CPU designsuse either perceptron-based predictors [1] [2] [3] [4] or TAGE-based predictors [5] [6]. These predictors may use global andlocal history, and a statistical corrector to further improveperformance. The TAGE-SC-L [7], which is a derivative of itsprevious implementation from Championship Branch Prediction(CBP-4) [8], combined several of these techniques and wasthe winner of the last branch prediction championship (CBP-5). Numbers from CBP-5 [7] [9] shows that scaling from a64-Kbit TAGE predictor to unlimited size, only yields branchMispredictions per Kilo Instructions (MPKI) reduction from3.986 to 2.596.Most of the current processors like AMD Zen 2, ARMA72 and Intel Skylake use some TAGE variation branch predictor. TAGE-like predictors are excellent, but there arestill many difficult-to-predict branches. Seznec [8] [7] studiedthe prediction accuracy of a 256-Kbit TAGE predictor and ano storage limit TAGE. The 256-Kbit TAGE had only about10% more mispredictions than its infinite size counterpart. Thenumbers mentioned above would reflect the prediction accuracyof the latest Zen 2 CPU [10] using a 256-Kbit TAGE-basedpredictor. For this work, we use the 256-Kbit TAGE-GSC +IMLI [11], which combines the global history components ofthe TAGE-SC-L with a loop predictor and local history as ourbaseline system.Recent work [12] shows that even though the current state-of-the-art branch predictors have almost perfect predictionaccuracy, there is scope for gaining significant performanceby fixing the remaining mispredictions. The core architecturecould be tuned to be wider if it had the support of better branchprediction, which could potentially offer more IPC gains. Priorworks [13], [14] have tried to address different types of hard-to-predict branches. A vital observation of these works is thatmost branches that state-of-the-art predictors fail to captureare branches that depend on a recent load. If the data loadedis challenging to predict, TAGE-like predictors have a lowprediction accuracy as these patterns are arbitrary and too largeto be captured.The critical observation/contribution of this paper is thatalthough the load data feeding a load-dependent branch maybe random, the load address may be highly predictable forsome cases. If the branch operand(s) are dependent on arbitraryload data, the branch is going to be difficult to predict. If theload address is predictable, it is possible to ”prefetch” the loadahead of time, and use the actual data value in the branchpredictor.Based on the previous observation, we propose to combinethe stride address predictor [15] with a new type of the branchpredictor to trigger loads ahead of time and feed the loaddata to the branch predictor. Then, when the correspondingbranch gets fetched, the proposed predictor will have a veryhigh accuracy even with random data. The predictor is onlyactive for branches that have low confidence with the defaultpredictor and depends on loads with predictable addresses.Otherwise, the default IMLI predictor performs the prediction.The proposed predictor is called Load Driven Branch Predictor(LDBP).LDBP is an implementation of a new class of branchpredictors that combine load(s) and branches to perform1 a r X i v : . [ c s . A R ] S e p rediction. This new class of load-assisted branch predictorsallows having near-perfect branch prediction accuracy overrandom data as long as the load address is predictable. It isstill a prediction because there are possibilities of coherence orother forwarding issues that can make it difficult to guaranteethe results.LDBP does not require software changes or modificationsto the ISA. It tracks the backward code slice starting from thebranch and terminating at a set of one or more loads. If all theloads have a predictable address, and the slice is small enoughto be computed, LDBP keeps track of the slice. When the samebranch retires again, it will start to trigger future loads aheadof time. The next fetch of this branch uses the precomputedslice result to predict the branch outcome. Through the rest ofthis paper, we will refer to the load(with predictable address)that has a dependency with a branch as a trigger load and itsdependent branch as a load-dependent branch. addi a5,a5,4 //increments array index . . lw a4,0(a5) //loads data from array bnez a4,1043e
We will explain a simple code example that massivelybenefits from LDBP. Let us consider a simple kernel thatiterates over a vector having random 0s and 1s to find valuesgreater than zero. The branch with most mispredictions inthis kernel has the assembly sequence shown in Listing 1.As we are traversing over a vector, the load addresses hereare predictable, even though the data is completely random.TAGE fails to build these branch history patterns due to thedependence of the branch outcome on irregular data patterns.LDBP has near-perfect branch prediction because the triggerload (line 4) has a predictable address. LDBP triggers loadsahead of time, computes the branch-load backward slice, andstores the results. The branch uses the precomputed outcomeat fetch. When we augment LDBP to a Zen 2 like core with a256-Kbit IMLI predictor, the IPC improves by 2.6x times.In general, a load-dependent branch immediately follows atrigger load in program order. Due to the narrow intervalbetween these two instructions, the load data will not beavailable when the branch is fetched. Therefore, if this loadyields a stream of random data across iterations, LDBP willhave a very slim chance of making a correct prediction. Toaddress this issue, we ensure the timeliness of the triggerloads in our setup. The key challenge is to make sure that thetrigger load execution is complete before the correspondingload-dependent branch reaches fetch. By leveraging the stridepredictor, we can ensure trigger load timeliness. When a branchretires, a read request for a trigger load is generated. Owing tothe high predictability of their address, trigger load requestsfuture addresses in advance. These requests have sufficientprefetch distance to cover the in-flight instructions and variablememory latency. As Section II shows, this can be achievedwith very small structures incurring little hardware overhead.To evaluate the results, we use the GAP benchmark suite [16] and the SPEC2006 integer benchmarks [17] having less than95% prediction accuracy on our IMLI baseline. GAP is acollection of graph algorithm benchmarks. This is one of thehighest performance benchmarks available, and graphs areknown to be severely limited by branch prediction accuracy.We integrated an 81-Kbit LDBP to the baseline 256-KbitIMLI predictor. Results show that LDBP fixes the topmostmispredicting branches for more than half of the benchmarksanalyzed in this study. Compared to the baseline predictor,LDBP with IMLI decreases the branch MPKI by 22.7%on average across all benchmarks. Similarly, the combinedpredictor has an average IPC improvement of 13.7%. LDBPalso eases the burden on the hardware budget of the primarypredictor. When combined with a 150-Kbit IMLI predictor, thebranch mispredictions come down by 20%, and the performancegain scales by 13.1% compared to the 256-Kbit IMLI, for a9.7% lesser hardware allocation.The rest of the paper is organized as follows: Section IIdescribes the LDBP mechanism and architecture. Section IIIreports our evaluation setup methodology. Benchmark analysis,architecture analysis, and results are highlighted in Section IV.Section V presents related works. Section VI concludes thepaper. II. L
OAD D RIVEN B RANCH P REDICTOR
A. Load-Branch Chains
The core principle of LDBP involves the exploitation of thedependency between load(s) and a branch in a load-branchchain. In this sub-section, we will explain load-branch chainsin detail. LDBP needs to capture the backward slice [18] ofoperation sequence starting from the branch. The exit pointof this slice must be a load with a predictable address or atrivially computable operation like a load immediate operation.
LD LDoutcome op1 op2op xbr op y
Complex Chain LD(donor) unpredictable LD(recipient)outcome br Load-Load ChainLD outcome br LDTrivial Chain
Fig. 1. Generic load-branch chain starts with predictable loads and terminateswith a branch.
As shown in Figure 1, we classify load-branch chains into3 different types: trivial, complex and load-load chain. In atrivial chain, the branch has a single source operand (likea bnez instruction) or two source operands, and it has adirect dependency with a predictable load. No intermediateinstructions modify the load data in this chain.In a complex chain, all the branch inputs terminate witha predictable load or a load immediate. A complex chain2ncludes at least one predictable load, one or more simplearithmetic operations, and it concludes with the branch. TheLDBP framework does not track complex ALU operations,and any chain with such an operation is invalidated. We willexplain load-load chain in Section IV.A load-branch chain has two main constraints: (1) themaximum number of operations between the load and thebranch, (2) the maximum number of input loads. For example,a chain can have five simple ALU operations before the branch.It means that a Finite State Machine (FSM) of the chain needssix cycles to compute the branch result. From the benchmarkswe analyzed, we found that a considerable proportion of hard-to-predict branches are part of a trivial load-branch chain.
B. LDBP Architecture
In this sub-section, we explain the LDBP architecture. AsLDBP works in conjunction with the primary branch predictor,its architecture aims at being simple, timely, spectre-safe, andhaving low power overhead. The LDBP architecture is dissectedinto two sub-blocks: one block attached to the cores retirementstage and another block at the fetch stage. From an abstractlevel, the retirement block detects potential load-branch chains,creates backward slices from the branch to its dependent load(s),and generates trigger loads. On the other hand, the fetch blockuses the backward slices to build FSMs of the program sequenceand computes the outcome of load-dependent branches usingthe executed trigger load data.
1) LDBP Retirement Block:
A naive LDBP retirementblock could consume significant power detecting and buildingbackward slices all the time. To avoid this substantial poweroverhead, we leverage the stride address predictor that existsin many modern microarchitectures to detect predictable loads.In addition to that, LDBP attempts to identify a load-branchchain only when the load is predictable, and the associatedbranch has low confidence with the default predictor (in ourcase, it is IMLI predictor). Figure 2 shows the tables/structuresassociated with the retirement block. tracking
Stride Predictor (SP)
LD PC con fi dencedeltaPC taglast addr Rename Tracking Table (RTT) reg stride ptr1n opsstride ptrn
Code Snippet Builder (CSB) reg op1opn
Pending Load Queue (PLQ) trackingstride ptr
Branch Trigger Table (BTT)
BR PC stride ptr1PC tagstride ptrnaccuracy
Fig. 2. LDBP Retirement Block - Fields in each index of the tables aremarked in the figure.
Stride Predictor (SP) : The retiring load PC indexes the StridePredictor table. This table has five fields. They are the PC tag( sp.pctag ), the address of the last retired load ( sp.lastaddr ) ,the load address delta ( sp.delta ), a delta confidence counter( sp.confidence ) and a tracking bit to indicate if a given loadPC is tracked as a part of a load-branch chain ( sp.tracking ).The updating policy of the confidence counter varies acrossdifferent stride predictors. Standard practice involves increasing Stride predictor can store partial load addresses to save space the counter each time the delta repeats and decreasing it eachtime the delta changes. This approach may skew the confidenceeither way. Ideally, increasing the counter by one and reducingit by a higher value minimizes the bias. A tracked load (withthe sp.tracking set) can trigger only when its confidence counteris saturated.
Rename Tracking Table (RTT) : The Rename Tracking Tabledetects and builds dependencies in the load-branch chains. Theretiring instruction’s logical register indexes the RTT. Eachtable entry has a saturating counter to track the number ofoperations ( rtt.nops ) in a load-branch chain and a pointer listto track Stride Predictor entries ( rtt.strideptr ). The numberof entries in the pointer list depends on the number of loadssupported by LDBP. If a chain consists of 2 loads and 4arithmetic operations before the branch, we need 3 bits to trackthese six operations and two entries on the pointer list.
Branch Trigger Table (BTT) : The Branch Trigger Table linksa branch with its associated loads and intermediate operations.The retiring branch PC indexes the BTT. Each entry has thefollowing fields: the branch PC tag ( btt.pctag ), the list ofassociated loads (copied from the Stride Predictor pointer listfrom the RTT table ( btt.strideptr )), and a 3-bit accuracy counterto track LDBP’s accuracy for this branch ( btt.accuracy ). Ifthe accuracy counter reaches zero, the BTT entry gets cleared,and sp.tracking bits of the loads in btt.strideptr are reset. ABTT entry is allocated only when a load-branch chain satisfiesthe following three conditions: (1) the loads in the chain arepredictable; (2) the retiring branch has low confidence withIMLI; (3) number of loads and number of operations in thechain is within the permissible threshold.
Code Snippet Builder (CSB) : The CSB tracks the operationsequence of a load-branch chain for each logical register. Eachentry on this table is a list of operations ( csb.ops ). The CSBentry is updated only when a new BTT entry gets allocated.This prerequisite ensures that the CSB is not polluted andminimizes power overhead. There are several works in theacademic literature about building backward slices [18]. Weuse a table indexed by the retiring logical register (similar inbehavior to an RTT). It copies the chain of operations startingfrom the load and terminating with the branch. Initially, weconsidered the possibility of combining the CSB with theRTT but dropped the idea considering the additional powerdissipation this would incur. The CSB entries are only neededwhen a new BTT entry is populated(when a load-branch chainis established), and it would not make sense to integrate it withthe RTT.
Pending Load Queue (PLQ) : The tables/structures mentionedabove is sufficient to detect and build load branch chains. ThePLQ acts as a buffer and stores the Stride Predictor pointer list( plq.strideptr ) associated with a load-branch chain. It trackswhether the last retired load had a change in delta ( plq.tracking ).If there is a change, it notifies the retire block to stop triggeringpotentially incorrect loads. Generally, loads generate prefetcheswhen it retires. But, in our setup, we delay the trigger loadgeneration until the branch retires to ensure correctness intrigger load generation. The PLQ ensures that the BTT gets3 ranch Outcome Table (BOT)
BR PC PC tagstride_ptr 0..m
Load Outcome Table (LOT) lot idx LD data 0..nvalid 0..n outcome_queue 0..noutcome_valid 0..noutcome_ptrCST ptr
Code Snippet Table (CST) cst ptr op1opn ALU/FSM
Load Outcome Register (LOR) ld startdeltalot_posstride ptr
Fig. 3. LDBP Fetch Block - Fields in each index of the tables are marked inthe figure. notified about any change in the retiring load’s delta beforeit triggers any loads. As shown in Figure 4, PLQ allocatesentries during BTT allocation.
2) LDBP Fetch Block:
The LDBP fetch block is responsiblefor accumulating trigger load results and computing the branchoutcomes. Figure 3 shows the tables used by the fetch block,the registers associated with tracking loads, and the ALU usedto compute the branch outcome for load-branch chains.
Load Outcome Table (LOT) and Load Outcome Registers(LOR) : The combination of LOR and LOT stores triggerload data, which could be consumed by future branches. The lor.ldstart is the starting load address of the range, and it isupdated after every branch fetch. The lor.delta field tracks theload address delta of each load tracked by LOR ( lor.strideptr ).The lor.lot pos field marks the data to be used by the currentbranch, and it helps to queue incoming data in an appropriateLOT index. The lot.valid bit gets set when the trigger loadassociated with that entry finishes execution.The LOR keeps track of a range of load addresses whose datacould be potentially useful for the current and future branches.The LOT caches the data associated with the addresses trackedby LOR. Each LOR entry has an associated LOT entry. EachLOT entry has an n-entry load data queue ( lot.ld data ) andvalid bit queue ( lot.valid ). The ending address tracked by LORis lor . ldstart + n ∗ lor . delta . Any trigger load address outsidethe address range is deemed useless, and the LOT does notcache its data. Branch Outcome Table (BOT):
The branch PC indexes theBOT at fetch ( bot.pctag ). As shown in Figure 4, the BOT hastwo main tasks. One, use the pre-computed branch outcometo predict at the fetch stage. Two, initiate the Code SnippetTable to compute the outcome for future branches.Each BOT entry has a queue of 1-bit entries holding thebranch outcome ( bot.outcome queue ). The length of this queueis equivalent to the number of entries in the lot.ld data queue.The bot.outcome ptr points to the current BOT outcome queueentry to be used by the incoming branch instruction. BOTuses the outcome if the corresponding bot.valid bit is set. The bot.strideptr has the list of loads associated with the branch.The Code Snippet FSM uses this field to pick appropriateload(s) from the LOR/LOT and the CST pointer ( bot.cstptr )to compute the branch outcome.
Code Snippet Table (CST):
The Code Snippet Table (CST)is responsible for executing the branch backward slice to compute the branch outcome. A CST entry is allocated duringBOT allocation. The CST feeds the FSMs with the operationsequence of the load-branch chain. When all the trigger loaddata associated with the trigger branch are available, the FSMexecutes the code snippet to completion at the rate of one ALUoperation per cycle. When large backward slices are supported,more FSMs are needed to reduce contention. The contentionhappens when all the FSM are busy. In this case, the branchoutcome gets delayed until an FSM is free. As the BOT onlytracks a small number of trigger branches, a similar-sized CSTis sufficient.
C. LDBP Flow
Figure 4 shows the interaction between different LDBPcomponents at instruction fetch and retirement stage. Throughthe rest of this sub-section, we will look in detail about LDBPbehavior.
1) Load Retirement:
When a load retires, it updates theStride Predictor. The sp.confidence field is updated dependingupon the load address behavior. The sp.tracking for a load getsset at BOT allocation. A BOT entry allocation implies that avalid load-branch chain is present, and it is necessary to trackthe loads in this chain to ensure LDBP correctness.If the sp.tracking is set, the corresponding Stride Predictor in-dex is appended to the Pending Load Queue table ( plq.strideptr ).The plq.tracking bit remains set until there is a change in deltafor the load it tracks.The retiring load also resets the RTT entry indexed byits destination register. If the sp.confidence is high, the rtt.nops is initialized to zero, and the load’s pointer fromthe Stride Predictor is appended to the rtt.strideptr . In casethe sp.confidence is low, the rtt.nops is saturated, and the RTTstride pointer list is cleared.
2) ALU Retirement:
A retiring simple ALU operation (likeaddition) updates the RTT entries pointed by its destinationregister. The RTT retrieves rtt.nops and rtt.strideptr valuespointed by its source registers and accumulates it into thefields indexed by the destination register. The cumulative rtt.nops is represented by Equation 1. It is realisticallyinfeasible to track an infinitely large load-branch chain. So,there is a threshold on the number of operations and the numberof loads supported by LDBP. If these values in the RTT fieldexceed the limit, the corresponding RTT entry gets invalidated.For simplicity, we add the number of operands per sourceignoring any potential redundancy in operations. rtt [ dst ] . nops = rtt [ src ] . nops + rtt [ src ] . nops + At most two sources in RISC-V StridePredictorRename Tracking Table (RTT)Branch Trigger Table (BTT) Pending LoadQueue (PLQ)
Load(LD)
LD/ALU/Br
BTT Allocation
Branch Outcome Table (BOT)Load OutcomeRegister(LOR) Load OutcomeTable (LOT)
PLQ Allocation when BTT alloc
BOT AllocationLOR Allocation LOT Allocation Generate Trigger Loads
BrBranch(Br) Br Outcome
Performed at Instruction FetchPerformed at Instruction Retirement
BOT initiates CST and LOR/LOT to computefuture Br outcomesahead of time
Trigger Load Request Completed
Memory Request Completion
Code SnippetBuilder (CSB) E n a b l e C S B Code SnippetTable (CST)
Fig. 4. LDBP Flow - Interaction between Fetch and Retire Block.
3) Branch Retirement:
At cold start, when a branch retires,it indexes the RTT only when it has low confidence with thedefault IMLI predictor . RTT ensures the validity of the load-branch chain by checking the load count and operation countin the entry indexed by the branch source(s). BTT entry getsallocated only when all the loads in this chain are predictable. BTT Allocation:
On BTT allocation, the contents of the rtt.strideptr are copied to the BTT stride pointer list. TheBTT accuracy counter ( btt.accuracy ) is initialized to half of itssaturation value. The sp.tracking bit for the associated loads areset, and the CSB starts building the code snippet for this load-branch chain. As shown in Figure 4, the BTT allocation createsa chain reaction by initiating the PLQ allocation, LOR/LOTallocation, and BOT allocation.Each load associated with the branch has a unique entryduring LOR/LOT allocation. Load-associated metadata fromthe Stride Predictor populates the LOR fields. The lor.lot pos is cleared. Similarly, BOT entry gets reset on allocation, and btt.strideptr updates the stride pointer list on the BOT. Thebranch’s PC tag is assigned to bot.pctag . BTT Hit:
On BTT hit, the btt.accuracy counter gets incre-mented if LDBP made a correct prediction, and the defaultIMLI predictor mispredicts and vice versa. If this counterreaches zero, the BTT deallocates the entry and the sp.tracking associated with btt.strideptr are cleared.The CSB starts to build the code snippet for the load-branchchain on BOT allocation. After CSB completes the snippet, ona BTT hit, the code snippet is copied to the CST. The CSB isdisabled after this process.When the retiring branch hits on the BTT, it reads thecorresponding PLQ entries to ensure if the tracking bit ishigh for the loads in the btt.strideptr . The BTT can triggerload(s) if the PLQ and LOR track all the associated loads.Equation 2 represents the address of the load triggered. The lor.ldstart is incremented by load address delta to ensure bettercoverage after every trigger load generation. The lor.lot pos is incremented when a new load is triggered. The trigger load IMLI is confident when the longest table hit counter is saturated. distance ( tl dist ) and the number of triggers generated for eachload can be tuned to facilitate better load timeliness. tl addr = lor . ldstart + lor . delta ∗ tl dist (2)There can be scenarios where the load-branch chain mightchange. It could happen when a different operation sequenceis taken to reach the branch. There are situations where thedelta associated with any of the branch’s dependent loadsmight change, potentially resulting in triggering incorrect loads.During such occurrences, LDBP flushes the branch entrieson the BTT, BOT (and its associated CST entry), and itscorresponding load entries on the LOR/LOT. The trackingbit on the Stride Predictor and PLQ are reset for these loads.Such an aggressive recovery scheme guarantees higher LDBPaccuracy and reduced memory congestion due to unwantedtrigger loads. Trigger Load Completion:
When a trigger load completesexecution, it checks for matching entries on the LOR. Therecould be zero or more entries on the LOR, which could havethe address range of this completed request. The address isa match on the LOR entry if it is within the LOR entry’saddress range and is a factor of the lor.delta . On a hit, thecorresponding LOT entry stores the trigger load data in the lot.ld data queue, and its valid bit is set. The LOT data queueindex is computed using Equation 3a and 3b. lot id = ( tl . addr − lor . ldstart ) lor . delta (3a) lot index = ( lor . lot pos + lot id ) % lot . ld data . size () (3b)
4) Branch at Fetch:
When a branch hits on the BOT atinstruction fetch, the bot.outcome ptr is increased by one. Thisis the only value speculatively updated in the LDBP fetch block.When there is a table flush due to misprediction, load-branchchain change or load delta variation, the bot.outcome ptr getsflushed to zero. The BOT outcome queue entry pointed by the bot.outcome ptr yields the branch’s prediction.The bot.cst ptr proactively instigates the computation offuture branch outcomes at fetch. The CST FSMs use the load5ata values from valid entries on the LOT. Once the outcomeis computed, the corresponding bot.outcome queue entry getsupdated.
D. Spectre-safe LDBP
The LDBP has been designed to avoid speculative updates.The reason is not to create another source of Spectre-like [19]attacks. The LDBP retirement block is only updated when theinstructions are not speculative. This means that it never has anyspeculative information and potential speculative side-channelleak.The LDBP fetch block is populated only with informationfrom the retirement block. Even the trigger loads are sent whena safe target branch retires. The only speculatively updatedfield is the LOR table, but this table is flushed after each missprediction, and the state is rebuilt from the LDBP retirementblock.In a way, the LDBP is not a new source of speculativeleaks because it is only updated with safe information, and thefields updated speculatively are always flushed on any pipelineflush. The flush is necessary for performance, not only forSpectre. The reason is that when the ”number of in-flight”trigger loads change due to flushes, the LOR must be updated.LDBP structures are not source of speculative leaks, but theloads in the speculative path can still leak unless speculativeloads are protected like in [20]. The result is that LDBP is not anew source of speculative leaks like most branch predictors thatgets speculatively updated and not fixed on pipeline flushes.
E. Multiple Paths Per Branch
The LDBP load-branch slices are generated at run-time, andthey can cross branches. As a result, the same branch can havemultiple chains or backward slices. These cases are sporadicin benchmarks from GAP as they have a large and somewhatregular pattern. Multi-path branches are slightly more commonin the SPEC CINT2006 benchmarks.The analysis performed as a part of this work shows thatbranches with multiple slices are not frequent, and when theyhappen, they tend to depend on unpredictable loads. Therefore,it is not a significant cause of concern for LDBP in these cases.Nevertheless, it can be an issue in other workloads. We leaveit as a part of future work and possibly find benchmarks thatexhibit such behavior more predominantly.III. S
IMULATION S ETUP
We evaluated LDBP using a subset of SPEC 2006 and theGAP Benchmark Suite [16]. For SPEC CINT2006, we ranall the benchmarks skipping 8 billion and modeling for 2billion instructions. Any benchmark with branch predictionaccuracy less than 95% is used for our evaluation ( hmmer , astar , gobmk ). The other benchmarks in the SPEC CINT2006suite already have very low MPKI. Therefore, they would notbe a true reflection of the impact of LDBP. We run all theGAP applications with “ -g 19 -n 30 ” command line input setand instrument the benchmarks to skip the initialization, assuggested by the developers of GAP. All the benchmarks are compiled with gcc 9.2 with -Ofast -flto optimization for aRISC-V RV64 ISA. Benchmark Branch MPKI IPC spec06 hmmer 12.9 2.42spec06 astar 14.9 0.89spec06 gobmk 13.1 1.49gap bfs 23.9 0.66gap pr 4.6 1.64gap tc 44.5 1.07gap cc 32.7 0.51gap bc 22.0 1.14gap sssp 6.2 0.89
TABLE IB
ENCHMARKS USED AND THEIR
MPKI
AND
IPC
RUNNING BASELINE
BIT
IMLI.
We use ESESC [21] as the timing simulator. The processorconfiguration is set to closely model an AMD Zen 2 core [10].Table I shows the Instructions Per Cycle (IPC) and MPKIfor the benchmarks investigated when running the baseline256-Kbit IMLI predictor. To match the Zen 2 architecture,the baseline branch prediction unit has a fast (1 cycle) branchpredictor and a slower but more accurate (2 cycle) IMLI branchpredictor. We evaluate the baseline configuration against 1-MbitIMLI, and different IMLI configurations (150-Kbit, 256-Kbit,and 1-Mbit) augmented with an 81-Kbit LDBP. h m m e r a s t a r gob m k b f s p r t c cc b c sss p a v e r age N o r m a li z ed M PK I Fig. 5. LDBP minimizes the mispredictions by more than 22.7% whencombined with the baseline 256-Kbit IMLI.
IV. R
ESULTS AND A NALYSIS
In this section, we highlight the results of our study. Wecompare the performance, and misprediction rate variationsbetween the baseline IMLI predictor and our proposed LDBPpredictor augmented to IMLI. Mispredictions Per Kilo Instruc-tion (MPKI) is the metric used to compare the mispredictionrate in this section.Figure 5 shows the normalized MPKI values compared tothe baseline IMLI for different branch predictor configurations.LDBP has a considerable impact on more than half of thebenchmarks. On average, the IMLI 256-Kbit + LDBP predictor6 h m m e r a s t a r gob m k b f s p r t c cc b c sss p a v e r age S peedup Fig. 6. LDBP (when combined with 150-Kbit IMLI or 256-Kbit IMLI)outperforms the large 1-Mbit IMLI comprehensively. reduces the MPKI of GAP and SPEC CINT2006 benchmarks by17.9% and 27.5%, respectively. As shown in Table I, astar hadthe worst branch prediction accuracy in the SPEC CINT2006suite. The most mispredicting branch in astar constitutes 22%of the benchmark’s mispredictions. This branch has a directdependency with a load, but LDBP cannot fix this branch asthe address of the load feeding this branch has a fluctuatingdelta. LDBP manages to minimize astar’s total branch missesby 25.4% without fixing the most mispredicting branch. Thesenumbers attest to the fact that a considerable proportion ofhard-to-predict branches on most benchmarks depend on datafrom loads with a predictable address. Another observationto note is that quadrupling the size of IMLI fixes only 9.7%branch misses from the baseline. This inference substantiatesthe fact that a huge TAGE-like predictor cannot efficientlycapture the history of hard-to-predict data-dependent branches.Figure 6 compares IPC changes over baseline 256-KbitIMLI for different branch predictor configurations. LDBP wasable to achieve an average IPC improvement of 13.7% whenpaired with the baseline predictor. An interesting observationis that the GAP benchmarks have a speedup of 12.5% withthis configuration. In contrast, they have a slightly better IPCgain of 12.9% over the baseline when running on 150-KbitIMLI + LDBP. The reason for this trend is that a smallerIMLI can fix lesser branches, and LDBP fixes branches thathave low confidence with IMLI. Therefore, lower the MPKIof the primary predictor, more the work for LDBP. A 41%smaller IMLI (150-Kbit) with LDBP produced similar IPC andMPKI numbers to that of the baseline IMLI-LDBP combination.For some benchmarks like bfs , the smaller predictor evenoutperformed its larger counterpart. Moreover, the 150-KbitIMLI + 81-Kbit LDBP offers 13.1% higher performance gainand 20% lesser branch misses than the baseline 256-Kbit IMLIfor a 9.7% lower hardware budget.The MPKI and performance improvements yielded by LDBPclearly shows that hard-to-predict load-dependent branches aremajor contributors to overall mispredictions in benchmarksacross different application suites. LDBP does not affect some benchmarks like gobmk , sssp and bc . This behavior can beattributed that mispredicting branches in these benchmarks donot have a load-branch dependency that can be captured byLDBP. An anomaly to note on Figure 5 and 6 is the behaviorof gobmk running with 150-Kbit IMLI and LDBP. We cannotice that the IPC decreases by 2%, and the MPKI worsensby 10%. It is because the 150-Kbit IMLI has a worse MPKIand IPC compared to the baseline 256-KBit IMLI. Added tothat, LDBP does not yield any improvement for gobmk . L2: addi a7,a7,1 ld a5,8(t5) bge a7,a5,L1 //outer ’for’ loop sext.w t6,a7 slli t1,a7,0x2 ld a5,0(a1) add t1,t1,a5 lw a5,0(t1) bgez a5,L2 //’if’ condition check Listing 2. GAP BFS RISC-V Assembly for Listing 3
A. Benchmark Study
In this sub-section, we analyze examples from differentbenchmarks where LDBP works and cases where LDBP doesn’twork.
1) Case 1: BFS (GAP Benchmark Suite):
For our first casestudy, we look at the Breadth-First Search (BFS) algorithm.It is one of the most popular graph traversal algorithms usedacross several domains. Listing 2 and Listing 3 shows a snippetof RISC-V assembly and its corresponding pseudo code fromGAP’s BFS benchmark. Here, the loop traverses over all thenodes in the graph to assign a parent to each node. The arbitrarynature of the graph makes it hard to predict if a node has avalid parent as each node can have multiple possible edges, butthe node traversal is in order. It is hard to predict parent [ u ] ,but u is easily predictable (Line 2 in Listing 3). The branchin Line 9 in Listing 2 is the most mispredicted branch in thisbenchmark. It contributes to about 30% of all mispredictionswhen simulated on the baseline architecture with 256-KbitIMLI. When we augment LDBP into this setup, it resolvesabout 94% for the mispredictions of this branch and reducesthe overall MPKI by 59%. It is also instrumental in gaining38% speedup. for (NodeID u=0; u < g.num_nodes(); u++){ if (parent[u] < 0){ .. .. } } Listing 3. GAP BFS Source Code Snippet
2) Case 2: HMMER (SPEC CINT 2006):
Listing 4 showsthe RISC-V assembly code section of the branch (line 8)contributing to most misprediction in SPEC CINT2006 hmmer .It accounts for 39% of all mispredicted branches. The branchoutcome is dependent on values from different matrices. Therandomness of the data involved makes this a very hard-to-predict branch. Each branch source operand is dependent on twoloads. As we traverse over matrices, the loads involved in thiscase have a traceable address pattern. LDBP has to track four7ifferent loads and some intermediate ALU operations to makethe prediction. LDBP fixes 67% of the mispredictions yieldedby bge . Appending LDBP to the baseline IMLI improves theIPC by 29% and reduces the overall MPKI of this benchmarkby 56%. lw s11, 0(a3) lw a3, 4(a7) addw a3, s11, a3 sw a3, 0(t3) lw s10, 0(s10) lw s11, 4(t1) addw s11, s10, s11 bge a3, s11, LABEL Listing 4. SPEC CINT2006 hmmer RISC-V Assembly
3) Case 3: CC (GAP Benchmark Suite):
Listing 5 representsthe code-snippet containing the branch (line 5) with mostmispredictions in the CC benchmark. It constitutes a littlemore than one-third of all mispredictions in this benchmark.LDBP cannot capture this load-branch chain. At first glance,it might look like the branch instruction’s source operands aredependent on two loads. On deeper introspection, we noticethat the source operand (load address) of the lw instruction(recipient) on line 5 is determined by the load data of theprevious lw (donor) on line 1. We refer to such a dependencyas a load-load chain. Figure 1 represents a load-load chain. lw a6, 0(a4) slli a5, a6, 0x2 add a5, a5, a0 lw a5, 0(a5) beq a6, a5, LABEL Listing 5. GAP CC RISC-V Assembly
The current LDBP setup does not support load-branch sliceshaving a load-load chain. If the address of the first loadinstruction is predictable by the stride predictor, we can useits data to prefetch the second load. Similar to the backwardslice computation of the load-branch chain, we need to builda backward chain starting from the recipient load and endingwith the donor. If the load-load chain is predictable, then LDBPcan build the load-branch slice and generate predictions. Thisimplementation is a part of our immediate future work.
Structure Name No. of Entries Total Size (Kbit)
Stride Predictor 48 2 . . . . . . Total LDBP Size 81.06
TABLE IIO
VERALL
LDBP S
IZE IS
BIT
B. LDBP Table Sizing
In this sub-section, we explain the methodology used to sizethe tables in LDBP. We analyze the variation in MPKI for a different number of entries in each structure in the predictor.Here, MPKI is the average MPKI of the benchmarks used. Wedefine a baseline infinite LDBP predictor. The infinite LDBPhas 512 entries in each table. When the MPKI sensitivity fora table’s size is analyzed, all other tables in LDBP have 512entries. Such an approach ensures a fair estimation of thetable’s impact on LDBP accuracy. A 2% MPKI increase frominfinite LDBP is the cut-off used to determine the ideal tablesize. Table II shows overall size of LDBP and the breakdownof individual table sizes.The overall size for the LDBP is 81-Kbit. As a reference,the IMLI predictor used is 256-Kbit. The fetch block in aprocessor like a Zen 2 also includes a 32-KByte instructioncache and two-level BTBs with 512 and 7K entries. The largestLDBP table is the LOT that can use area-efficient single portSRAMs. N o r m a li z ed M PK I Number of Stride Predictor and PLQ Entries2% MPKI increase
Fig. 7. 48 entries are sufficient in the Stride Predictor and PLQ to achieveprediction accuracy varying by less than 2% from the infinite LDBP.
1) Stride Predictor and PLQ Sizing:
Figure 7 shows theimpact of the number of entries on the Stride Predictor and thePLQ on MPKI. We can see that the MPKI drop is going over2% when the number of entries is around 32. With reducedstride predictor and PLQ entries, a load tracked as a part of thehard-to-predict load-dependent branch’s chain can be evictedto make way for a new incoming load. LDBP cannot determineif a load is trigger-worthy if it is not in the stride predictortable. Entries larger than 64 have a negligible effect on theMPKI. The stride predictor and PLQ have 48 entries each asit offers the perfect equilibrium between MPKI and hardwaresize.
2) LOR and LOT Sizing:
Figure 8 plots the effect of varyingLOR size on MPKI. There is a sharp increase in MPKI whenthe number of trigger loads tracked is less than 16. At the2% cut-off point, LOR and LOT has around 12 entries. Tominimize the impact of the sharp drop in MPKI, we allocate16 entries to both the LOR and LOT. The necessity to storethe complete load data contributes to the large size of the LOT.The number of entries on the LOT data queue is determined byhow proactively LDBP wants to predict branches and trigger itsassociated loads. The number of entries on the BOT’s outcome8 N o r m a li z ed M PK I Number of LOR and LOT Entries2% MPKI increase
Fig. 8. Tracking 16 loads on the LOR and LOTis adequate to maintain high LDBP accuracy. N o r m a li z ed M PK I Number of LOT Data Queue and Outcome Queue Entries2% MPKI increase
Fig. 9. The LOT Data Queue and OutcomeQueue requires 64 entries each. N o r m a li z ed M PK I Number of BOT and BTT Entries2% MPKI increase
Fig. 10. Effectiveness of LDBP remains steadyfor different number of entries on the BOT andBTT. N o r m a li z ed M PK I Number of loads in load-branch chain2% MPKI increase
Fig. 11. LDBP must track at least 5 loads tomaintain healthy prediction accuracy. N o r m a li z ed M PK I Number of ALU ops in load-branch chain2% MPKI increase
Fig. 12. Most LDBP chains have 3 ALUoperations between the loads and branch. N o r m a li z ed M PK I Number of sub-entries for each CSB index2% MPKI increase
Fig. 13. Each CSB index must have 4 sub-entriesto capture LDBP backward slice. queue matches the LOT data queue entries. The sizing of theBOT outcome queue is discussed in Section IV-B3.Some load-dependent branches may consume two or moretrigger loads. A bottleneck on the number of trigger loadstracked has a direct implication on the effectiveness of LDBP.In most cases, the load-dependent branch tends to be the entry-point to a huge loop. In such cases, it sufficient for LDBP totrack just one branch and its associated trigger loads. Therefore,a reasonably small to medium number of entries on the LORand LOT is adequate to maintain LDBP accuracy.
3) Outcome Queue/LOT Data Queue Sizing:
The outcomequeue is part of the BOT. The criticality of the outcome queuein the overall scheme of LDBP warranted optimal sizing. Thenumber of entries in this queue correlates to the number offuture outcomes trackable for a given branch PC. The outcomequeue entries directly impact the number of entries on theLOT data queue. It is sufficient for the LOT data queue tohave as many entries as the branches tracked by the outcomequeue. From Figure 9, the ideal number of outcome queueentries at the cut-off point is 64. As the outcome queue sizedecreases, the MPKI increase gets steeper. A smaller outcomequeue inhibits the ability of LDBP to trigger loads with higherprefetch distance. On the flip side, the outcome queue sizelarger than 64 almost hits an MPKI plateau.
4) BOT and BTT Sizing:
Figure 10 shows the variation ofMPKI for different sizes of BOT and BTT. Just like the LORand LOT, a small to a medium number of entries on the BOTand BTT is sufficient to track almost every load-dependent branch in an application. These branches are usually part oflarge loops. These huge loops give LDBP adequate time tocapture the new branch-load chain even if they replace analready existing entry from the tables. The correlation betweenthe number of entries and MPKI has very minimal variations.Therefore, it is sufficient to have just 8 entries on the BOTand BTT.
5) CSB and CST Sizing:
The CSB builds the load-branchslice. It is critical to size this table optimally to keep LDBP’shardware budget under check. Figure 11 and 12 shows thechange of MPKI for different load and ALU operationsthreshold in an LDBP chain. Five loads and three ALUoperations are needed to ensure maximum LDBP efficiency.These figures reflect the cumulative number of operationstracked by both the source operands of a branch instruction.Each source operand of the branch might need to track onlyfewer operations.Figure 13 shows the number of sub-entries needed by eachCSB index. This figure clearly shows that it is sufficient foreach branch source register to track four operations to supportan LDBP chain with a maximum of eight operations. There are32 entries on the CSB, and each entry track four operations. Thetotal size of the CSB is 4-Kbit. The CST caches the backwardslice of each branch. As there are 8 entries on the BOT, theCST must have 8 entries with 8 sub-entries (4 sub-entry foreach branch source operand).9 . LDBP Gating and Energy Implications
The LDBP has significant performance gains, but somebenchmarks ( gobmk , sssp , bc ) do not benefit. We evaluate theeffectiveness of gating the LDBP when infrequently used, tosave energy consumption.We gate (low-power mode) every component of LDBP apartfrom the Stride Predictor and RTT when there is a duration of100,000 or more clock cycles where LDBP did not predict anybranch. We refer to this phase as the LDBP low-power mode.As shown in Table III, for bc and sssp , LDBP remains in low-power mode for 99.5% and 98.2% of the benchmark’s executiontime, respectively. Gating offers a considerable reduction inenergy dissipated by LDBP as the predictor remains in low-power mode for 38.5% of the average execution time acrossall benchmarks, and LDBP gating does not have any negativeeffect on the prediction accuracy of LDBP. Benchmark % time in low-power mode spec06 hmmer 0 . . . . . . . . . TABLE IIIP
ROPORTION OF EXECUTION TIME IN
LDBP
LOW - POWER MODE
Haj-Yihia et al. [22] present a detailed breakdown of corepower consumption for high-performance modern CPUs run-ning SPEC CINT2006 benchmarks. We use the data presentedin their work to estimate the core energy dissipation for ouranalysis. For our baseline energy model, we replicate the corepower breakdown given in [22] for hmmer , astar and gobmk .For the GAP benchmarks, we use the average power breakdownof SPEC CINT2006 benchmarks given in [22]. The broad-spectrum power model based on SPEC CINT2006 benchmarksis good enough to capture the energy dissipation behavior ofGAP benchmarks with a good level of accuracy.Energy Per Access (EPA) for IMLI and LDBP werecalculated using CACTI 6.0 [23]. For IMLI, we model anideal structure with a single port. LDBP has 55% lesser EPAthan IMLI even if we assume all the tables are accessed whennot in low power mode, which is not the case in reality. Thereis a 10.9% average increase in DL1 access for LDBP, whichwill result in an equivalent escalation in energy on the memorysub-system. The 22.7% decrease in MPKI when using LDBPwill compensate for this increase in energy dissipation. LesserMPKI implies lesser energy spent on executing the wrongbranch path. We do not account for the energy saved due toreduced wrong path execution in the LDBP energy estimationnumbers. Added to that, we also do not account for the energyreduction incurred due to 13.3% lesser execution time whenusing LDBP. Reducing execution time results in reduced energy, h m m e r a s t a r gob m k b f s p r t c cc b c sss p a v e r age N o r m a li z ed I P C o r N o r m a li z ed E ne r g y Benchmarks Normalized EnergyNormalized Performance
Fig. 14. LDBP maintains a favorable energy-performance tradeoff. and our pessimistic energy estimation model for LDBP doesnot consider this.Figure 14 shows the energy-performance tradeoff for IMLI +LDBP compared to the baseline 256-Kbit IMLI. The IPC boostoutweighs the increase in energy dissipation for the majority ofthe benchmarks that benefit from LDBP. Benchmarks like bc and sssp only have about 2% energy overhead as the RTT andStride Predictor continue to be active even under low-powermode. Interestingly, LDBP only predicts a negligible proportionof branches in gobmk , but it contributes to 8% more energyuse. This is because LDBP resolves multiple low-frequencybranches that spread across different execution phases. Thus, gobmk does not offer a consistent low-power mode phase forLDBP. A more aggressive clock gating with retention state orsmarter phase learning could further improve the gobmk case,but we leave it as future work. D. Impact of Triggering Loads on LDBP Performance Gains h m m e r a s t a r gob m k b f s p r t c cc b c sss p a v e r age N o r m a li z ed P e r f o r m an c e Benchmarks No optimizationPrefetcherPerfect DL1
Fig. 15. Triggering loads does not offer any unfair gains to LDBP.
Figure 15 shows the normalized speedup of LDBP over 256-Kbit IMLI with three different Zen 2 core configurations. One,the default Zen-2 core used for evaluation in other parts of thispaper. Two, the default core with a standard stride prefetcherand third, the default core with perfect DL1 cache. We can10otice that the IPC numbers are almost similar across all threeconfigurations for most benchmarks. This clearly shows thatprefetching trigger loads in LDBP do not provide an unfairadvantage to it over the standalone IMLI predictor. Maybe evenmore important, Figure 15 shows that the LDBP benefits areconsistent independent of memory sub-system improvements.
E. Trigger Load Timeliness
In this sub-section, we will focus on trigger load prefetchdistance and its importance in achieving optimum LDBPtimeliness. We will use Listing 1 to highlight the criticalityof timely trigger loads. This example is the vector traversalproblem discussed in Section I. In the example we discuss, letus assume a scenario where it takes six cycles to load datafrom the vector, and there are ten inflight load-branch iterations.As the load address has a delta of 8, to achieve an IPC of 1,we need to send the new trigger load at least 16 cycles ahead.If the current load address is x , LDBP triggers a load addresswith a distance of 16 ( x + ∗ h m m e r a s t a r g o b m k b f s p r t c c c b c s s s p M P K I Unpredictable loadPredictable but not timelyOn-time load
Fig. 16. MPKI variation vs Trigger Load Timeliness.
Figure 16 shows the MPKI of different benchmarks recordedwith the baseline IMLI predictor. The portion of the bar shadedin green points to the number of mispredictions fixed by LDBPdue to the timely execution of trigger loads. We can seethe correlation between the number of predictable loads ina benchmark and LDBP effectiveness. The timeliness of thesepredictable load helps to exploit the maximum potency of LDBP.Only for the tc benchmark, a significant portion of the loadsare not predictable. Though we optimized the methodology totrigger loads, these outliers can be attributed to the change inload delta, which creates considerable delay due to relearningtime. Another potential reason could be memory bandwidthcongestion. Minimizing the number of delayed trigger loadscould lead to significant MPKI reduction. V. R ELATED W ORK
The strides in branch prediction accuracy have improvedseveral folds since the counter-based bimodal predictor [24].The ensuing works on branch prediction gradually raised thebar for the prediction accuracy. Yeh and Patt came up withthe two-level branch predictors [25]. McFarling [26] proposedoptimizations over their work. These works leverage the highcorrelation between the outcome of the current branch and thehistory of previous branch outcomes.PPM-like [27] and TAGE [5] achieve higher predictionaccuracy by tracking longer histories. They use multipleprediction tables, each indexed by a longer global history thanits preceding table. TAGE-based predictors are the state-of-the-art predictors, and they offer very high prediction accuracy.TAGE-based predictors fail to capture the outcome correlationof branches having an irregular periodicity or when a branchoutcome history is too long or too random to capture.Statistical correlator [28] and IMLI [11] components areaugmented to TAGE to mitigate some of the mispredictions.Several studies and extensive workload analysis have identifieddifferent types of hard-to-predict branches and ways to resolvethem. Sherwood et al. [29] and Morris et al. [30] proposedprediction mechanisms to tackle loop-termination branches.The Wormhole predictor [31] improved on earlier loop-basedpredictors to handle branches enclosed within nested loop andbranches exhibiting correlation across different iterations ofthe outer loop.Branches dependent on random data from load instructionscontribute to a high percentage of mispredictions with TAGE-bases predictors. It is impossible to capture the history of suchbranches competently, even with an unusually large predictor.Prior works [32], [33] show that using data values as an input tothe branch predictor improves the misprediction rate. Farooq etal. [14] note that some hard-to-predict data-dependent branchesmanifest a specific pattern of a store-load-branch chain. Theyleverage this observation to mark the stores that are in the chainat compile-time and compute branch conditions based on thevalues of marked stores at run-time in hardware. We tackle asimilar problem, but our work is based on the observation thata considerable proportion of hard-to-predict data-dependentbranches are dependent on the loads whose address is verypredictable. Moreover, we do not make any modifications tothe ISA. Gao et al. [13] proposed a closely related work.They correlate the branch outcome to the load address andprovide a prediction based on the confidence of the correlation.Nevertheless, our approach differs in that we precalculate thebranch outcomes by triggering loads that are part of the branch’sdependence chain and have a highly predictable address.VI. C
ONCLUSIONS
As shown by the benchmarks evaluated in our work,branch outcomes dependent on arbitrary load data are hard-to-predict and contribute to most mispredictions. They havepoor prediction accuracy with current state-of-the-art branchpredictors. These branch patterns are common in data structureslike vector, maps, and graphs. We propose the Load Driven11ranch Predictor (LDBP) to eliminates the misses contributedby this class of branch. LDBP exploits the predictable natureof the address of the loads on which these hard-to-predictbranches depend on and triggers these dependent loads aheadof time. The triggered load data are used to precompute thebranch outcome. With LDBP, programmers can traverse overlarge vectors/maps, do data-dependent branches, and still havenear-perfect branch prediction.LDBP contributes to minimal hardware and power overheadand does not require any changes to the ISA. Our experimentalresults show that compared to the standalone 256-Kbit IMLIpredictor, the combination of 256-Kbit IMLI and LDBPpredictor shrinks the branch MPKI by 22.7% and improves theIPC by 13.7%. The efficiency of LDBP also allows having asmaller primary predictor. A 150-Kbit IMLI + LDBP predictoryields performance improvement of 13.1% and 20% lessermispredictions compared to the baseline 256-Kbit IMLI.Another opportunity that this work provides is to extendthe use of graphs further. As the GAP benchmark suite resultsshow, LDBP can improve performance from graph traversalssignificantly. There is an extensive set of works exploringgraphs for neural networks [34], for which LDBP could helpto boost the performance.VII. A
CKNOWLEDGEMENTS
This work was supported in part by the National ScienceFoundation under grant CCF-1514284. Any opinions, findings,and conclusions or recommendations expressed herein are thoseof the authors and do not necessarily reflect the views of theNSF. This material is based upon work supported by, or in partby, the Army Research Laboratory and the Army ResearchOffice under contract/grant W911NF1910466.R
EFERENCES[1] D. A. Jim´enez and C. Lin, “Dynamic branch prediction with perceptrons,”in
Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture . IEEE, 2001, pp. 197–206.[2] D. A. Jim´enez, “Fast path-based neural branch prediction,” in
Proceedingsof the 36th annual IEEE/ACM International Symposium on Microarchi-tecture . IEEE Computer Society, 2003, p. 243.[3] Y. Ishii, “Fused two-level branch prediction with ahead calculation,”
Journal of Instruction-Level Parallelism , vol. 9, pp. 1–19, 2007.[4] D. A. Jim´enez and C. Lin, “Neural methods for dynamic branchprediction,”
ACM Transactions on Computer Systems (TOCS) , vol. 20,no. 4, pp. 369–397, 2002.[5] A. Seznec, “A case for (partially)-tagged geometric history lengthpredictors,”
Journal of InstructionLevel Parallelism , 2006.[6] A.Seznec, “A new case for the tage branch predictor,” in
Proceedings ofthe 44th Annual IEEE/ACM International Symposium on Microarchitec-ture .IEEE, 2015, pp. 347–357.[12] C.-K. Lin and S. J. Tarsa, “Branch prediction is not a solved problem:Measurements, opportunities, and future directions,” 2019. [13] H. Gao, Y. Ma, M. Dimitrov, and H. Zhou, “Address-branch correlation:A novel locality for long-latency hard-to-predict branches,” in . IEEE, 2008, pp. 74–85.[14] M. U. Farooq, Khubaib, and L. K. John, “Store-load-branch (slb)predictor: A compiler assisted branch prediction for data dependentbranches,” in . IEEE, 2013, pp. 59–70.[15] J. W. Fu, J. H. Patel, and B. L. Janssens, “Stride directed prefetchingin scalar processors,”
ACM SIGMICRO Newsletter , vol. 23, no. 1-2, pp.102–110, 1992.[16] S. Beamer, K. Asanovi´c, and D. Patterson, “The gap benchmark suite,” arXiv preprint arXiv:1508.03619 , 2015.[17] J. L. Henning, “Spec cpu2006 benchmark descriptions,”
ACM SIGARCHComputer Architecture News , vol. 34, no. 4, pp. 1–17, 2006.[18] A. Moshovos, D. Pnevmatikatos, and A. Baniasadi, “Slice-processors:an Implementation of Operation-Based Prediction,” in
InternationalConference on Supercomputing , Sorrento, Italy, Jun. 2001, pp. 321–334.[19] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp,S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks:Exploiting speculative execution,” arXiv preprint arXiv:1801.01203 ,2018.[20] C. Sakalis, S. Kaxiras, A. Ros, A. Jimborean, and M. Sj¨alander,“Efficient invisible speculative execution through selective delay andvalue prediction,” in
Proceedings of the 46th International Symposiumon Computer Architecture , 2019, pp. 723–735.[21] E. K. Ardestani and J. Renau, “ESESC: A fast multicore simulator usingtime-based sampling,” in
High Performance Computer Architecture,Proceedings of the IEEE 19th International Symposium on , ser. HPCA’13.Washington, DC, USA: IEEE Computer Society, Feb. 2013, pp. 448–459.[Online]. Available: http://dx.doi.org/10.1109/HPCA.2013.6522340[22] J. Haj-Yihia, A. Yasin, Y. B. Asher, and A. Mendelson, “Fine-grainpower breakdown of modern out-of-order cores and its implications onskylake-based systems,”
ACM Transactions on Architecture and CodeOptimization (TACO) , vol. 13, no. 4, pp. 1–25, 2016.[23] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: Atool to model large caches,”
HP laboratories , vol. 27, p. 28, 2009.[24] J. E. Smith, “A study of branch prediction strategies,” in
Proceedingsof the 8th Annual Symposium on Computer Architecture , ser. ISCA’81. Washington, DC, USA: IEEE Computer Society Press, 1981, pp.135–148.[25] T.-Y. Yeh and Y. N. Patt, “Two-level adaptive training branch prediction,” , pp. 51–61, Nov. 1991.[26] S. McFarling, “Combining Branch Predictors,” Tech. Rep. TN-36, Jun1993.[27] P. Michaud, “A ppm-like, tag-based branch predictor,”
Journal ofInstruction-Level Parallelism , vol. 7, pp. 1–10, 04 2005.[28] A. Seznec, “A 64 kbytes isl-tage branch predictor,” 2011.[29] T. Sherwood and B. Calder, “Loop termination prediction,” in
Proceedingsof the Third International Symposium on High Performance Computing ,ser. ISHPC ’00. Berlin, Heidelberg: Springer-Verlag, 2000, pp. 73–87.[30] D. Morris, M. Poplingher, T.-Y. Yeh, M. P. Corwin, and W. Chen,“Method and apparatus for predicting loop exit branches,” 2002.[31] J. Albericio, J. S. Miguel, N. E. Jerger, and A. Moshovos, “Wormhole:Wisely predicting multidimensional branches,” in
Proceedings of the47th Annual IEEE/ACM International Symposium on Microarchitecture ,ser. MICRO-47. USA: IEEE Computer Society, 2014, pp. 509–520.[Online]. Available: https://doi.org/10.1109/MICRO.2014.40[32] T. H. Heil, Z. Smith, and J. E. Smith, “Improving branch predictorsby correlating on data values,” in
MICRO-32. Proceedings of the 32ndAnnual ACM/IEEE International Symposium on Microarchitecture , 1999,pp. 28–37.[33] L. Chen, S. Dropsho, and D. H. Albonesi, “Dynamic data dependencetracking and its application to branch prediction,” in
Proceedingsof the 9th International Symposium on High-Performance ComputerArchitecture , ser. HPCA ’03. USA: IEEE Computer Society, 2003,p. 65.[34] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,” 2019., ser. HPCA ’03. USA: IEEE Computer Society, 2003,p. 65.[34] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “Acomprehensive survey on graph neural networks,” 2019.