[PDF] SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length

Abstract

Over the last decade the relative latency of access to shared memory by multicore increased as wire resistance dominated latency and low wire density layout pushed multiport memories farther away from their ports. Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which is unacceptable in real time systems. Smart DMAs can be used to directly copy data into a layer1 SRAM, but with overhead. The VLIW architecture, the de facto signal processing engine, suffers badly from a breakdown in lockstep execution of scalar and vector instructions. We describe the Split Latency Adaptive Pipeline (SLAP) VLIW architecture, a cache performance improvement technology that requires zero change to object code, while removing smart DMAs and their overhead. SLAP builds on the Decoupled Access and Execute concept by 1) breaking lockstep execution of functional units, 2) enabling variable vector length for variable data level parallelism, and 3) adding a novel triangular load mechanism. We discuss the SLAP architecture and demonstrate the performance benefits on real traces from a wireless baseband system (where even the most compute intensive functions suffer from an Amdahls law limitation due to a mixture of scalar and vector processing).

Full PDF

©© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in anycurrent or future media, including reprinting/republishing this material for advertising or promotional purposes, creating newcollective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in otherworks. a r X i v : . [ c s . A R ] F e b LAP: A SPLIT LATENCY ADAPTIVE VLIW PIPELINE ARCHITECTURE WHICHENABLES ON-THE-FLY VARIABLE SIMD VECTOR-LENGTH

Ashish Shrivastava (cid:63) † , Alan Gatherer (cid:63) ‡ , Tong Sun (cid:107) , Sushma Wokhlu (cid:107) , Alex Chandra (cid:107) (cid:63) Wireless Access Lab, Futurewei Technologies Inc † Senior Member, IEEE ‡ Fellow, IEEE

ABSTRACT

Over the last decade the relative latency of access to sharedmemory by multicore increased as wire resistance domi-nated latency and low wire density layout pushed multi-portmemories farther away from their ports. Various techniqueswere deployed to improve average memory access latencies,such as speculative pre-fetching and branch-prediction, oftenleading to high variance in execution time which is unac-ceptable in real-time systems. Smart DMAs can be usedto directly copy data into a layer-1 SRAM, but with over-head. The VLIW architecture, the de-facto signal-processingengine, suffers badly from a breakdown in lock-step exe-cution of scalar and vector instructions. We describe theSplit Latency Adaptive Pipeline (SLAP) VLIW architecture,a cache performance improvement technology that requireszero change to object code, while removing smart DMAs andtheir overhead. SLAP builds on the Decoupled Access andExecute concept by 1) breaking lock-step execution of func-tional units, 2) enabling variable vector length for variabledata-level parallelism, and 3) adding a novel triangular-loadmechanism. We discuss the SLAP architecture and demon-strate the performance beneﬁts on real traces from a wirelessbaseband-system (where even the most compute intensivefunctions suffer from an Amdahl’s law limitation due to amixture of scalar and vector processing).

1. INTRODUCTION & PRIOR WORK

In wireless Baseband SoC architecture [1][2][3] programmablecompute engines are widely used in physical layer implemen-tation and are split into three categories 1) control/scalar dom-inant (integer arithmetic) requiring a CPU 2) control/scalaralong with ﬂoating-point signal processing with limited data-level parallelism requiring narrow vector ﬂoating-point units,and 3) control/scalar processing along with heavy ﬂoating-point signal processing (with higher data-level parallelism)requiring wider vector ﬂoating-point units. VLIW has be-come the de-facto processor technology for narrow SIMD andwider SIMD processors but introduces a fundamental limita-tion if implemented traditionally (i.e. all functional units in (cid:107)

Authors performed the work while at Futurewei Technologies Inc

Fig. 1 . Processor with hierarchical memory architecturelock-step execution) when the algorithm calls for a mixture ofscalar functional units with integer arithmetic (and thereforesmaller pipeline stages) and vector/SIMD functional unitswith ﬂoating-point capabilities (and higher execute pipelinestages) with strict dependency between scalar functionalityand vector/SIMD functionality. For example, scalar func-tional units calculate addresses for the ﬂoating-point data thatwill be processed by vector functional units. For wirelessbaseband, relaxing the VLIW’s lock-step coupling betweenthe scalar and vector parts reduces cycle count and makesthe architecture scalable to different vector lengths. To sup-port all the required 5G physical layer numerology/use-cases,the memory infrastructure is hierarchical, non-uniform, non-coherent and distributed with on-chip multi-layer caches &SRAMs as well as off-chip DRAMs [4][5][6][7]. In Fig. 1.cwe see that a non data cached system [6] allows the VLIWmachine to run in an efﬁcient and predictable manner butwith the overhead of programmed data movement operationsn the DMA. This impacts power and latency unless the datamovement can be pipelined perfectly with the computation; atask that proves to be very difﬁcult due to the runtime varietyand complexity of a modern modem. For this reason, allthe major suppliers of baseband VLIW DSPs [4][7] providecache support to it, as shown in Fig. 1.a. The code becomesmore portable and robust but performance degrades due tomisses on data accesses because data accesses dimensionalityis ﬂexible at runtime, and there are usually some “rogue”parameters that need to be accessed within the loop that maycome from tables that are too big to be stored in L1 cacheor perhaps even in L2. It is therefore difﬁcult to hand opti-mize the loops or successfully use speculative pre-fetchingtechniques as shown in 1.b. Much of the baseband physicallayer code does not show “temporal locality of reference”and the “spatial locality of reference” gets polluted at run-time, leading to very minimal beneﬁt from complex cachelogic. To address these concerns [8] combines ideas fromDAE [9] and decoupled vector architectures [10] to improvedata access for large SIMD as shown in Fig. 1.d whereprefetching logic on a single processor accesses scalar andvector data. [11] identiﬁes load instructions that impact thecache misses allowing selective prefetching but this assumesstatic and predictable code traversal. [12] follows Fig. 1.ewith scalar threads accessing memory via L1 caches andvector threads bypassing L1 Cache but this requires vectorprefetch optimization in the code. SLAP follows Fig. 1.e,but differs from [12] in that we allow multiple lanes to ac-cess the memory in an un-synchronized manner using elasticqueues to manage correct operation. SLAP’s split pipelinearchitecture uses the DAE principle but does so in a codetransparent manner whereas [9] is a accelerator implementedas a vector extension along with a separate control processorin which the scalar and vector units access memory directlywhereas the control processor has data cache so the decou-pling of scalar and vector processing happens at compiletime with two sets of programs, one for control and anotherfor vector execution. Similarly, [13] implements the DAE[9] with two sets of processors and two sets of programsproducing the same scheduling problems that appear whenthere is a separate L1 load engine. SLAP’s architecture onthe other hand has a single program optimized at compiletime (like traditional VLIW) and the vector pipeline splitsaway at runtime in the hardware (including vector load/storeinstructions) allowing for better code portability and objectcode compatibility. [14] proposes split-issue in conjunctionwith delay-buffer and reservation stations to support dynamicscheduling mechanism and the purpose is similar to SLAP.But it requires costly (in terms of power and area) reservationstations to support different access policies and out-of-orderpipeline execution. SLAP split pipeline architecture is in-order and shows signiﬁcant performance improvement withno additional power and area overhead.

Fig. 2 . Typical VLIW processor pipeline stages

2. ARCHITECTURE OVERVIEW

We compare our SLAP VLIW processor to the in houseVLIW DSP from which it is derived by adding SLAP fea-tures without changing the pipeline. This allows commonobject code for fair comparison. Also, it is important to min-imize changes to object code as this places a heavy burdenon code maintenance. Fig. 2 shows typical VLIW pipelinestages with a 3 phase Fetch stage (addressing, memory ac-cess, and data), instruction dispatch (DS), instruction decode(DC), and an Execute pipeline with different stages for scalarprocessing and SIMD ﬂoating-point processing. Note thatthe pipeline requires functional units to execute in lock-step.The SIMD ﬂoating point pipeline executes matrix and vectoroperations and differs dramatically from the scalar integerarithmetic pipeline which typically executes address calcu-lations for memory operations and software-pipeline loopcounts. There is relatively little data exchange between thesepipelines which traditionally operate in lock-step. So when ascalar operation is stalled (due to memory read operations) itstalls the vector SIMD functional units, and vice versa.

For wireless baseband, memory communication latency stallssigniﬁcantly impact performance by stalling scalar and vec-tor pipelines. Techniques to improve average memory stalls,such as speculative/selective prefetching and branch predic-tion are very complex and often under-utilized, while ﬂoat-ing point operations have minimal data exchange with scalaroperations. These observations point to the potential ben-eﬁt of relaxing timing constraints between scalar and vec-tor pipelines by allowing variable-lag (but controlled) exe-cution, to improve throughput by mitigating the impact ofcostly memory communication delays. This is achieved inSLAP by “dispatching” (during Instruction dispatch pipelinestage) the vector arithmetic operations into a Vector Instruc-tion buffer (with appropriate control logic to throttle the vec- ig. 3 . SLAP based VLIW Architecturetor instruction ﬂow). This is shown in Fig. 3, for the scalarpipeline, called the GPCU (Global Program Control Unit) anda single vector pipeline, called a CU (compute unit). OtherCUs are easily added. attaching to the GPCU and mem-ory in the same way, with each CU having an independentport to memory. SLAP allows the variable offset/lag (up tothe size of the Vector Instruction Buffer depth) execution be-tween GPCU and CU. The Vector instruction buffers are im-plemented as low power FIFOs as shown in section 3. Typi-cally a single Load/Store functional unit performs scalar andvector loads/stores and address updates along with registerforwarding to scalar and vector register ﬁles. In SLAP, theLoad/Store pipeline is split, one for scalar arithmetic (alongwith scalar-register forwarding logic), and another for vec-tor arithmetic ﬂoating-point (and vector-register forwarding).The address generation for the CU (vector ﬂoating-point op-erations) are part of the scalar functionalities and thereforeall the address calculations for all CU loads/stores are done inthe GPCU and the addresses are transferred to CU via the vec-tor Load Address buffer and the vector store address Buffer.Vector load/store instructions are split into three categories 1)GPCU managing the address for vector-SIMD loads/stores aswell as implementing a “Triangular load” mechanism [15] forVector loads, 2) the CU interprets those vector SIMD load in-structions as reading from the SLAP Data memory, and 3) theCU interprets vector stores instructions as vector stores withaddress popped from the write address buffer.

SLAP’s triangular load is a novel way of implementing a DAE[9]. In a typical VLIW processor, a load functional unit willissue the read memory instructions and received the data intothe shared register ﬁles for consumption. If the data is notavailable within the pipeline stages of the read instruction,the whole processor stalls. Triangular load allows the GPCU,running ahead of the CU, to issue reads and the data to returnto the CU Data Memory CAM (Content-Addressable Mem-ory [16]), as shown in Fig. 3.

Fig. 4 . SLAP based Variable SIMD Vector architecture

One of the key features of VLIW architecture, compilercontrolled static scheduling, is actually a curse in disguise.Basestation applications require different categories of com-putations (scalar-only, narrow-SIMD and wide-SIMD). Thisleads to multiple compiler/debug toolchains, one for eachtype of VLIW but derived from the base-line VLIW proces-sor. Since the software uses intrisics[17] to specify the SIMDinstructions and unique datatypes[18], lots of software needsto be rewritten and veriﬁed when migrating from one VLIWprocessor to another. SLAP removes this issue by enablingvariable vector/SIMD for variable data-level parallelism witha single compiler/debug tool-chains. SLAP allows a vari-able SIMD/vector VLIW architecture via dynamic resourcesharing as shown in Fig. 4 by allowing the association ofGPCUs to CUs dynamically, enabling variable vector lengthon the ﬂy. Fig. 4 shows two GPCUs sharing 8 CUs with adynamic conﬁguration where GPCU1 is controlling 3 CUsand GPCU2 is controlling 5 CUs. If each CU supportsSIMD4 then GPCU1 processes SIMD12 and GPCU2 pro-cesses SIMD20. The number of CUs assigned depends onthe number of distinct data sets to process. The interfacebetween the GPCUs and the CUs is a FIFO of instructionsat each CU. Instructions are pushed onto the FIFO simul-taneously by the GPCU, but each CU may pop instructionsat their own speed as they are ready to execute them. If aninstruction queue ﬁlls, the GPCU stalls and doesn’t dispatchany more instructions until space is available in all of the CUinstruction queues. If a CU instruction queue empties, the CUstalls and waits for additional instructions from the GPCU.Both of these cases may happen during normal operations.

3. PERFORMANCE STUDY

We developed cycle-accurate models for the DSP processor,caches, and memory hierarchy and created a multi-processorSoC model for regular and SLAP DSPs. For performanceanalysis, we collected execution traces from the in-houseVLIW DSP cores on our Basestation SoCs running PhysicalUplink Shared Channel (PUSCH) receive chain and then ranthese traces on our SoC model. Fig. 5 shows processingblocks in PUSCH receive chain in green (see [19] for more ig. 5 . Wireless Baseband PUSCH processing blocks

Fig. 6 . SLAP-VLIW performance/area efﬁciency for PUSCHdetails). The DSP clusters process the estimation blocksabove the main processing ﬂow such as reference extraction,channel estimation, CQI etc. and are also in charge of controland management, and the generations of parameters for thereceiver chain. DSP clusters will be involved in other process-ing, but PUSCH is the most compute intensive and the mostappropriate to benchmark DSP performance. Traces werecollected from our product for different regions (based onscalar/vector operations distribution) and stitched together asshown by the regions of performance in Fig. 7. For example,Region 1, 3, and 5 have higher scalar operations than vector,whereas region 2 and 4 shows mixed scalar and vector oper-ations. For this ”combo trace” we benchmarked SLAP withvarious CU instruction FIFO sizes and different GPCU datacaches sizes as shown in Fig. 6 we tabulate the % increasein runtime compared to an in-house DSP with ﬂat-memory(so with no cache misses or memory accesses stalls). Adding32KB data cache (with all speculative/selective prefetchingmechanism) to the in house DSP results in about 33.63% ofoverhead and SLAP with FIFO size of 24 or 32 combinedwith GPCU data cache size of 8KB, 16KB or 32KB reducesthis overhead with a sweet spot at a FIFO size of 32 andthe opportunity to reduce the cache to 16KB without muchdegradation, though even an 8KB cache produces beneﬁt.Fig. 6, shows the performance-area efﬁciency of PUSCHimproves in the range of 5.7% - 8.22%. The test-vectors usedare the traces captured on the in-house DSP. Recompiling forSLAP gives more improvements but we do not have the spaceto present them here. Fig. 7 shows how the ratio of scalarto vector instructions varies across the combo trace and onewould expect that the performance beneﬁt would be stronglydependent on this ratio. In 8 we summarize the beneﬁt forSLAP FIFO depth of 24 and 8KB data cache, versus in-house

Fig. 7 . PUSCH Scalar/Vector regions

Fig. 8 . SLAP improvementsDSP with 32KB data cache, for different regions of the combotrace. Region 1, 3 and 5 shows heavy scalar processing butthe gain of SLAP is still in the range of 7.10% – 10.50%.This shows that SLAP is also beneﬁting scalar processingby not polluting the data cache with vector processing. Re-gion 2 and 4 evenly mix scalar and vector processing but theimprovement ratio is 30.6%. 4.10% respectively. ClearlySLAP improvement depends on the patterns of scalar/vectorprocessing as much as the simple ratio. But in all cases thebeneﬁt is signiﬁcant.Though the area of SLAP decreases this is due to a reduc-tion in memory, which is generally cool, and the addition ofFIFO logic. One might suspect that the FIFO logic will runmuch hotter than the memory removed, increasing the overallpower. We performed a careful study of the power dissipa-tion of the FIFOs compared to register ﬁles of the equivalentsize and found that FIFO implementation is over an order ofmagnitude more power efﬁcient than a similarly sized multi-ported register ﬁle. This is because of the dramatic reductionin number of ports and wire length in the memory. In SLAPthough we are using long FIFOs we can still save power be-cause we reduce cache size.

4. CONCLUSIONS AND FUTURE WORK

We have shown that an extension of a DAE structure can pro-vide signiﬁcant, double digit percentage performance beneﬁtto a VLIW DSP in a basestation application while potentiallydecreasing power and certainly decreasing area. This is pos-sible without recompilation of the original DSP code and canalso extend to dynamic sharing of pools of SIMD units. Thereare several areas for future study, including the developmentof a scheduling strategy to take advantage of the dynamic al-location of SIMD to improve system level scheduling goals. . REFERENCES [1] Marvell, “Marvel octeon fusion cn73xx: Next genera-tion integrated baseband processors,” 2020.[2] Andrei Frumusanu, “Marvel announces octeon fusionand octeon tx2 5g infrastructure processors,” 2020.[3] Bj¨orn Fjellborg, “Soc and asic design at ericsson,” 2016.[4] Texas Instruments, “Tms320c6670: 4 core ﬁxed andﬂoating point dsp for communications and telecom,”in ,March 2012.[5] Freescale, “Starcore sc3900fp: Flexible vector proces-sor,” 2012.[6] “Ceva-xc323 high-performance vector dsp for softwaredeﬁned radio infrastructure applications,” 2010.[7] “Ceva-xc12 the world’s most advanced communicationdsp,” 2013.[8] Yunsup Lee,

Decoupled Vector-Fetch Architecture witha Scalarizing Compiler , Ph.D. thesis, EECS Depart-ment, University of California, Berkeley, 2016.[9] J Smith, “Decoupled access/execute computer archi-tectures,” in

ACM Trans. Comput. Syst., 2(4):289–308 ,Nov. 1984.[10] R Espasa and M Valero, “Decoupled vector architec-tures,” in

High-Performance Computer Architecture,1996. Proceedings., Second International Symposiumon, pages 281–290 , Feb. 1996.[11] S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R.Rau, and R. Gupta, “Predictability of load/store instruc-tion latencies,” in

Proceedings of the 26th Annual In-ternational Symposium on Microarchitecture , 1993, pp.139–152.[12] Lucian Codrescu, “Architecture of teh hexagon680 dspfor mobile imaging and computer vision,” in

Hot Chips27: A Symposium on High Performance Chips , 2015.[13] M. K. Farrens and A. R. Pleszhun, “Implementation ofthe pipe processor,”

Computer , vol. 24, no. 1, pp. 65–70,1991.[14] B. R. Rau, “Dynamically scheduled vliw processors,”in

Proceedings of the 26th Annual International Sympo-sium on Microarchitecture , 1993, pp. 80–92.[15] Alan Gatherer, Sushma Wokhlu, Peter Yan,YPyng Harn, Ashish Shrivastava, Tong Sun,and Lee McFearin, “Processing units hav-ing triangular load protocol,” in

USPTO athttps://patents.google.com/patent/US20180321939A1/en ,Nov. 2018. [16] T Kohonen,

Content-Addressable Memories , Springer-Verlag; 2nd edition (April 16, 1987), USA, 2nd edition,1987.[17] Joseph Fisher, Paolo Faraboschi, and Cliff Young,