[PDF] A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors

Abstract

Hardware/Software (HW/SW) co-designed processors provide a promising solution to the power and complexity problems of the modern microprocessors by keeping their hardware simple. Moreover, they employ several runtime optimizations to improve the performance. One of the most potent optimizations, vectorization, has been utilized by modern microprocessors, to exploit the data level parallelism through SIMD accelerators. Due to their hardware simplicity, these accelerators have evolved in terms of width from 64-bit vectors in Intel MMX to 512-bit wide vector units in Intel Xeon Phi and AVX-512. Although SIMD accelerators are simple in terms of hardware design, code generation for them has always been a challenge. Moreover, increasing vector lengths with each new generation add to this complexity. This paper explores the scalability of SIMD accelerators from the code generation point of view. We discover that the SIMD accelerators remain underutilized at higher vector lengths mainly due to: a) reduced dynamic instruction stream coverage for vectorization and b) increase in permutations. Both of these factors can be attributed to the rigidness of the SIMD architecture. We propose a novel SIMD architecture that possesses the flexibility needed to support higher vector lengths. Furthermore, we propose Variable Length Vectorization and Selective Writing in a HW/SW co-designed environment to transparently target the flexibility of the proposed architecture. We evaluate our proposals using a set of SPECFP2006 and Physicsbench applications. Our experimental results show an average dynamic instruction reduction of 31% and 40% and an average speed up of 13% and 10% for SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code.

Full PDF

aa r X i v : . [ c s . A R ] F e b A Variable Vector Length SIMD Architecture forHW/SW Co-designed Processors ∗ Rakesh Kumar

NTNU, Norway

Alejandro Martínez

ARM, UK

Antonio González

UPC Barcelona, spain

ABSTRACT

Hardware/Software (HW/SW) co-designed processors provide apromising solution to the power and complexity problems of themodern microprocessors by keeping their hardware simple. More-over, they employ several runtime optimizations to improve theperformance. One of the most potent optimizations, vectorization,has been utilized by modern microprocessors, to exploit the datalevel parallelism through SIMD accelerators. Due to their hardwaresimplicity, these accelerators have evolved in terms of width from64-bit vectors in Intel MMX to 512-bit wide vector units in IntelXeon Phi and AVX-512. Although SIMD accelerators are simple interms of hardware design, code generation for them has alwaysbeen a challenge. Moreover, increasing vector lengths with eachnew generation add to this complexity. This paper explores thescalability of SIMD accelerators from the code generation point ofview. We discover that the SIMD accelerators remain underutilizedat higher vector lengths mainly due to: a) reduced dynamic instruc-tion stream coverage for vectorization and b) increase in permuta-tions. Both of these factors can be attributed to the rigidness ofthe SIMD architecture. We propose a novel SIMD architecture thatpossesses the ﬂexibility needed to support higher vector lengths.Furthermore, we propose Variable Length Vectorization and Selec-tive Writing in a HW/SW co-designed environment to transpar-ently target the ﬂexibility of the proposed architecture. We eval-uate our proposals using a set of SPECFP2006 and Physicsbenchapplications. Our experimental results show an average dynamicinstruction reduction of 31% and 40% and an average speed up of13% and 10% for SPECFP2006 and Physicsbench respectively, for512-bit vector length, over the scalar baseline code.

Hardware/Software (HW/SW) co-designed processors oﬀer a solu-tion to the power and complexity problems of modern micropro-cessors [37][11][17]. In order to reduce the power consumptionand complexity, these processors incorporate simple hardware. More-over, several dynamic optimizations are applied to improve the per-formance.Single Instruction Multiple Data (SIMD) accelerators form anintegral part of modern microprocessors. Since these acceleratorsperform the same operation on multiple pieces of data, they justrequire duplicated functional units and a very simple control mech-anism. Despite their simplicity, they are well suited to exploit datalevel parallelism from modern multimedia, scientiﬁc and through-put computing applications. For this reason, SIMD accelerators areubiquitous in processors from diﬀerent computing domains likegeneral purpose processors [3][12][27], Digital Signal Processors[6], gaming consoles [15][42] as well as embedded architectures ∗ This work was done while R.Kumar was at UPC Barcelona, A. Martínez was at IntelLabs, Barcelona and A. González was at UPC Barcelona and Intel Labs, Barcelona. [7]. Due to their hardware simplicity, SIMD accelerators grow insize with each new generation. For example, Intel MMX [3] hadvector length of 64-bits, which was increased to 128-bits in SSE[3] extensions. Intel AVX [3] and AVX2 [3] support 256-bit vectors.Whereas Intel‘s recent SIMD extensions AVX-512 [1] and Many In-tegrated Core architecture [2] support 512-bit vector operations.In spite of their hardware simplicity, code generation for SIMDaccelerators has always been challenging. In the early days, pro-grammers used to target these extensions mainly using in-line as-sembly or specialized library calls which is tedious and error prone.Then, automatic generation of SIMD instructions (auto-vectorization)was introduced in compilers [8][32], which borrowed their method-ology from vector compilers. These compilers target loops for gen-erating code for SIMD accelerators. Later, S. Larsen et al. [26] in-troduced Superword Level Parallelism (SLP) in which they targetbasic blocks instead of whole loops for vectorization. Apart fromthese static approaches, dynamic vectorization in superscalar pro-cessors has also been explored by A. Pajuelo et al. [34].Although SIMD accelerators are amenable to scaling from thehardware point of view, generating eﬃcient code for higher vectorlengths is not straightforward. The problem lies in the fact that dif-ferent applications have diﬀerent natural vector length. There areapplications for which compilers just need to unroll loops with ahigher unroll factor to ﬁll the wider vector paths. However, thereare other applications that do not have enough parallelism for vec-torization at higher vector lengths and SIMD resources are leftun/under-utilized. Generating code for these applications for widervector units becomes a challenge.In this paper, we explore the scalability of SIMD acceleratorsfrom the code generation point of view. We discover that thereare two key factors that thwart the performance at higher vectorlengths. First, the dynamic instruction stream coverage for vector-ization reduces as vector length increases. This is because the in-structions in current vector ISAs operate on all the vector lanestogether and not on a subset of it. For example, ADDPS in IntelSSE, VADD in ARM Neon and VADDFP in PowerPC Altivec all op-erate on all the vector lanes together. Therefore, compilers gener-ate a vector instruction only when there are suﬃcient numbers ofindependent operations to ﬁll the vector path. When there are notenough instructions to ﬁll up the vector path, all the instructionsare left in the scalar form. We propose to have a ﬂexible SIMD ar-chitecture that allows to operate on any number of vector lanes. Inaddition, we propose Variable Length Vectorization (VLV) to targetthe ﬂexible vector datapath.Second, the number of permutation instructions increases withvector length. The rigidness of SIMD architecture is again responsi-ble for this. For example, the scalar SIMD instructions in Intel SSEalways write their result to the lowest element of the vector reg-ister. If a vector instruction needs to read these results, they ﬁrsteed to be packed together in a single vector register using shuf-ﬂe instructions. The proposed SIMD architecture allows scalar in-structions to write their result to any element of the vector registerdepending on how they are needed by the consumer vector instruc-tion. Therefore, the shuﬄe instructions are no longer required. Wecall this ability of writing to any selective part of a vector registeras Selective Writing (SWR).VLV increases the dynamic instruction stream coverage by iter-atively packing maximum number of scalar instructions together,even if the number is less than the number of vector lanes available.SWR employs two techniques to keep the permutations to min-imum. As a result, the proposed SIMD architecture alleviate therigidness problem of the traditional SIMD architecture and allowsto generate optimized code at higher vector lengths. Moreover, theHW/SW co-designed nature of the processor provides some addi-tion advantages. For example, since vectorization is done at run-time on the program binary, it does not require any changes incompiler, operating system or application source code. Therefore,we can target the proposed ISA without modifying anything in thesoftware stack. The main contributions of this paper can be sum-marized as: • Identiﬁes the bottleneck in vector code generation for widervector units. • Proposes a ﬂexible SIMD architecture. • Proposes Variable Length Vectorization to increase the dy-namic instruction stream coverage. • Proposes Selective Writing to reduce the number of permu-tation instructions.This paper is an extension of our prior work [23] and makes thefollowing additional contributions: • Shows why both VLV and SWR are necessary and not onlyjust either of them. • Shows why vector length register is not a good choice forSIMD accelerators.The rest of the paper is organized as follows: Section 2 providesa background on HW/SW co-designed processors. Section 3 brieﬂyprovides the motivation for the work presented in this paper andidentiﬁes key issues in eﬃcient vector code generation for highervector lengths. Section 4 describes the speculative dynamic vector-ization algorithm. Section 5 and 6 explain the proposed SIMD ISA,Variable Length Vectorization and Selective Writing techniques.Evaluation of the proposals using a set of SPECFP2006 and Physics-bench applications is presented in Section 7. Section 8 presents re-lated work and Section 9 concludes.

A HW/SW co-designed processor is a hybrid architecture that lever-ages hardware/software co-design to couple a software layer tothe microarchitectural design of a processor. The software layerresides between the hardware and the operating system. This soft-ware layer allows host and guest ISAs to be completely diﬀerentby translating the guest ISA instructions to the host ISA dynami-cally. We deﬁne the host ISA as the ISA that is implemented in thehardware, whereas, guest ISA is the one for which applications are compiled. The basic idea behind these processors is to havea simple host ISA to reduce power consumption and complexity.This kind of processors[11][13][37] ﬁrst emerged more than twodecades ago. Moreover, there is a renewed interest in them in bothindustry and academia [4][29][9] [44][24][33] [23][22].These processors are speciﬁcally designed to achieve energy ef-ﬁciency, design simplicity, and performance improvement. In orderto achieve design simplicity, they keep the hardware simple and im-plement a relatively simple ISA. The simple hardware design alsohelps in achieving energy eﬃciency. Transmeta reports signiﬁcantreduction in power dissipation for their HW/SW co-designed pro-cessor Crusoe compared to Intel Pentium III for a software DVDplayer [16]. Their data shows that Pentium III heats up to a temper-ature of 105º C whereas Crusoe’s maximum temperature goes onlyup to 48º C running the same software DVD player. Furthermore,to achieve the performance goal, HW/SW co-designed processorsemploy dynamic binary optimizations.In general, HW/SW co-designed processors implement a pro-prietary ISA in order to achieve design simplicity and power ef-ﬁciency. Therefore, they need to apply binary translation to mapthe guest ISA on to the host ISA. The binary translation, in general,can be implemented in either hardware or software. Modern pro-cessors implementing CISC ISA, like x86, implement binary trans-lation in hardware [41]. The hardware binary translator translatesCISC instructions to RISC like instructions dynamically to simplifythe execution pipeline implementation. However, the hardware im-plementation leads to signiﬁcant hardware complexity and powerconsumption. HW/SW co-designed processors, on the other hand,implements dynamic binary translation in software which leads toenergy eﬃciency.Fig. 1(a) shows the hardware/software interface in a conven-tional RISC processor where the software stack directly interactswith the hardware. Conventional CISC processors implement aRISC like ISA in hardware. As shown in Fig. 1(b), they employ ahardware dynamic binary translator to translate CISC instructionsto the internal ISA instructions. The binary translation in HW/SWco-designed processors is performed by a software layer as showsFig. 1(c). We call this software layer as Translation OptimizationLayer (TOL) in this paper.Performing the dynamic binary translation/optimization in soft-ware layer provides several beneﬁts over the hardware implemen-tation. For example, the software implementation signiﬁcantly re-duces hardware complexity and power consumption. Furthermore,it allows to upgrade a processor in the ﬁeld by introducing newoptimizations in the software layer. In contrast, if TOL is imple-mented in hardware, adding new optimizations in the existing pro-cessor is not feasible. Additionally, software implementation of TOLsigniﬁcantly reduces hardware validation and veriﬁcation cost andtime.

Translating guest ISA code to host ISA is the prime responsibil-ity of TOL. The translation is done dynamically and, generally, inmultiple phases. Usually, in the ﬁrst phase, an interpreter decodesand executes guest ISA instructions sequentially. In the rest of thephases, the guest code in translated into host ISA code and stored Operating SystemExecution HardwareLibrariesApplication Programs ISA (a) Conventional RISC processor

Operating System

Translation Optimization Layer (Hardware)LibrariesApplication ProgramsExecution Hardware ISAInternalISA (b) Conventional CISC processor

Operating System

Translation Optimization

Layer (Software)LibrariesApplication ProgramsExecution Hardware Guest ISAHost ISA (c) HW/SW co-designed processor

Figure 1: HW/SW interface in processors in the code cache, after applying several dynamic optimizations,for faster execution. The number of translation phases and opti-mizations in each phase are implementation dependent.Fig. 2 shows a typical two stage translation/optimizations ﬂowin a TOL. It starts by interpreting guest ISA instruction stream se-quentially. While interpreting, TOL also proﬁles the guest code tocollect information about most frequently executed code and bi-ased branch directions. The execution frequency guides TOL todecide which guest code basic blocks to translate. When a basicblock has been executed more than a predetermined number oftimes, TOL invokes the translator. The translator takes the guestISA basic blocks as input, translates them to host ISA code andsaves the translated code into the code cache for fast native execu-tion. Instead of translating and optimizing each basic block in iso-lation, the translator uses biased branch direction information, col-lected during interpretation, to create bigger optimization regions,called superblocks. A superblock, generally, consists of multiplebasic blocks following the biased direction of branches. Therefore,superblocks increase the scope of optimizations to multiple basicblocks and allow more aggressive optimizations. Superblocks havea single entry point that is the ﬁrst instruction of the ﬁrst basicblock included in the superblock. However, depending on the im-plementation they might have multiple or a single exit point.Initially, the control is transferred back to TOL after executinga superblock from the code cache. Then, TOL searches the nextinstructions to be executed. If the next instruction is not alreadytranslated, it has to be interpreted. However, if it is already trans-lated, TOL patches the last branch of the ﬁrst superblock (the onethat transferred the control back to TOL) to the beginning of thesecond superblock. This process is called chaining or linking.

HW/SW co-designed processors provide certain features that setthem apart from traditional hardware only processors. Followingare the some of the reasons that motivated us to choose them forour proposals:

Aggressive Vectorization:

Compilers inability to do accurateinterprocedural pointer disambiguation and interprocedural arraydependence analysis severely limits their vectorization ability[30].On the other hand, dynamic optimization environment in HW/SW

Next instruction in Code Cache? Execute from Code CacheExceed Translation Threshold? Translate and store in Code CacheInterpret Next InstructionStart YesYesNoNo ChainNo Chain

Figure 2: Typical two stage TOL control ﬂow co-designed processors avoids the need to these analysis by vector-izing speculatively[22]. Furthermore, these processors provide ef-ﬁcient support to recover from speculation failures[37][11]. There-fore, they enable aggressive vectorization and catch vectorizationopportunities missed by conservative compiler vectorization.

Dynamic Information:

Since the vectorization is done at run-time it beneﬁts from the availability of the runtime information.For example, loop unroll factor can be determined at runtime throughproﬁling for the loops where loop trip count in unknown at com-pile time. This is especially important for variable length vectoriza-tion where the optimal loop unroll factor varies based on logicalvector length which is not always equal to the SIMD acceleratorwidth as explained in section 5.1.

Decoupled vector ISA and SIMD accelerator:

HW/SW co-designed processors decouple the hardware implementation of SIMDaccelerator from application visible vector ISA by means of dy-namic binary translation. This enables modiﬁcations/improvementsin the SIMD accelerator without aﬀecting the application visibleSIMD ISA. We leverage this fact to introduce a ﬂexible SIMD accel-erator without any modiﬁcation in the application visible (guest)ISA, compiler or any other component of the software stack. N o r m a li z e d D y n a m i c I n s t r u c t i o n S t r e a m C o v e r a g e Figure 3: Dynamic FP instruction stream coverage for vec-torization at 128, 256 and 512-bit vector lengthsPortable Vectorization:

Since vectorization is done by TOL atruntime, the same application binary can be executed on diﬀerentSIMD accelerators. This kind of portable vectorization provides for-ward and backward binary compatibility.

Legacy Code Vectorization:

Runtime vectorization in HW/SWco-designed processors also enables legacy code vectorization. There-fore, the code that was not compiled for any SIMD accelerator canalso beneﬁt from there presence.

The trends in the recent past show that the vector lengths are likelyto keep increasing in future microprocessors, since wider vectorsprovide a simple and eﬃcient way of achieving higher FLOPS inan energy eﬃcient manner. Intel’s 256-bit AVX [3] and 512-bit vec-tor length of AVX-512 [1] and Larrabee [38] are few examples ofthese trends. However, it is a challenge to generate eﬃcient codeto utilize these wider vector units. To demonstrate this fact, wevectorized ﬂoating point instructions in SPECFP2006 for three dif-ferent vector lengths of 128, 256, and 512-bits using the speculativedynamic vectorization algorithm described in [22]. Moreover, at agiven vector length, all the vector instructions operate only on themaximum vector length and not on a subset of it. For example,for 512-bit vector length case, all the vector instructions operateon whole 512-bits and there is no vector instruction that operatesonly on 256 or 128-bits. This is inline with how the vector instruc-tions function in the current SIMD architectures, operating on allthe vector lanes and not on a subset.Our results show that there are mainly two problems in vectorcode generation at higher vector lengths: reduced dynamic instruc-tion stream coverage for vectorization and huge number of permu-tation instructions.

We deﬁne dynamic instruction stream coverage as the number ofdynamic scalar instructions vectorized. Fig. 3 shows the dynamicinstruction stream coverage for vectorization at diﬀerent vectorlengths normalized to the 128-bit case. The best, worst and aver-age cases are shown. We divide the applications in two categories:The ﬁrst category applications have maximum dynamic instruc-tion stream coverage at all the vector lengths, like 454.calculix. On N u m b e r o f p e r m u t a t i o n i n s t r u c t i o n s p e r v e c t o r i n s t r u c t i o n Figure 4: Normalized Number of Permutation Instructionsgenerated per vector instruction the contrary, there are applications like 444.namd where dynamicinstruction steam coverage falls by 70% at vector length of 512-bits.The dynamic instruction stream coverage at diﬀerent vector lengthsdepends upon the degree of data level parallelism available in theapplication and how this parallelism is extracted through SIMDextensions. If an application spends most of its time in loops withhigh trip counts, it will beneﬁt from higher vector lengths, sincethe wider vector paths can be ﬁlled by unrolling the loops morenumber of times depending on the vector length. However, as shownby the average case of Fig. 3, this is not the case for most of the ap-plications. We see an average reduction of 25% and 48% in dynamicinstruction stream coverage at 256-bit and 512-bit respectively. Ifthis trend continues, the coverage is going to be even lesser athigher vector lengths.

When the input operands of a vector instruction are not availablein a single vector register or are not in the same order as requiredby the vector instruction, permutation instructions are needed toarrange them in the correct order. Our results show that the num-ber of permutation instructions grows signiﬁcantly with increas-ing vector lengths.Fig. 4 shows the number of permutation instructions generatedper vector instruction in SPECFP2006 normalized to the 128-bitcase. As the ﬁgure shows, if we generate one permutation instruc-tion for each vector instruction at 128-bit vector length, this num-ber goes as high as 10 at 512-bit vectors in case of 444.namd. Also,there are applications for which this number does not grow thatrapidly. However, the average behavior suggests that number ofpermutation instructions is going to be a problem at higher vectorlengths.Both of these factors become a limitation as vector paths becomewider and instead of performance improvements, it starts degrad-ing compared to the lower vector lengths. In essence, both of theseproblems arise because current SIMD architectures are not ﬂexibleenough to handle these situations. The vector instructions in cur-rent SIMD architectures operate on all the vector lanes and not ona subset of it. As a result, if there are not enough independent in-structions performing the same operation, compilers do not gener-ate vector instruction. This behavior leads to reduced dynamic in-struction stream coverage. Also, the scalar instructions in currentSIMD architectures, such as ADDSS, MULSS etc. in Intel SSE, write heir result only to lowest element of a vector register. If a vectorinstruction needs to read these results, they need to be packed insingle register using shuﬄe instructions before they can be con-sumed by the vector instruction; thereby increasing the number ofpermutations. This paper investigates both the problems and pro-poses a ﬂexible SIMD architecture along with Variable Length Vec-torization and Selective Writing to solve the problems of reducedcoverage and permutation instructions, respectively. This section brieﬂy discusses the baseline speculative dynamic vec-torization scheme; the details of the algorithm and its evaluationcan be found in [20, 22, 25]. The software layer of our co-designedprocessor is called Translation Optimization Layer (TOL). TOL op-erates in three translation modes for generating host code fromguest x86 code: Interpretation Mode (IM), Basic Block TranslationMode (BBM) and Superblock Translation Mode (SBM). SBM is themost aggressive translation/optimization mode and the majority(more than 90%) of the dynamic application code is executed inthis mode. Vectorization is done only in SBM, after applying sev-eral standard optimizations.

Before starting with vectorization we create a superblock, optimizethem by applying standard compiler optimizations, and generatea Data Dependence Graph (DDG) as explained below:

TOL starts by interpreting guest x86instruction stream in IM. When a basic block is executed more thana predetermined number of times, TOL switches to BBM. In thismode, the whole basic block is translated and stored in the codecache and the rest of the executions of this basic block are donefrom the code cache. Moreover, proﬁling information is gatheredfor all the basic blocks in BBM using software counters. This in-formation consists of execution and edge counters. The executioncounter provides the execution frequency of a basic block whilethe edge counters monitor the biased branch direction. Once theexecution of a basic block exceeds another predetermined thresh-old, TOL creates a bigger optimization region, called superblock,using the branch proﬁling information collected during BBM. Asuperblock generally includes multiple basic blocks following thebiased direction of branches.Moreover, the branches inside the superblocks are convertedto “asserts” so that a superblock can be treated as a single-entry,single-exit sequence of instructions. This gives the freedom to re-order and optimize instructions across multiple basic blocks. “As-serts” are similar to branches in the sense that both checks a condi-tion. Branches determine the next instruction to be executed basedon the condition; however, asserts have no such eﬀect. If the con-dition is true, assert does nothing. However, if the condition evalu-ates to false, the assert “fails” and the execution is restarted from apreviously saved checkpoint in IM. Furthermore, if the number ofassert failures in a superblock exceeds a predetermined limit, thesuperblock is recreated without converting branches to “asserts”.As a result, this time the superblock has to be treated as a single-entry multiple-exit sequence of instructions. Having multiple exitsin a superblock also reduces available optimization opportunities because the instructions across diﬀerent exit paths cannot be re-ordered as freely as before.Loop unrolling plays a major role in vectorization. Compilersunroll the loops a particular number of times to get suﬃcient inde-pendent instructions to ﬁll the vector path. It is relatively simple todetermine the unroll factor for loops with static trip count. How-ever, for the loops, where the number of iterations are not knowstatically, it is diﬃcult to decide the unroll factor. The availability ofdynamic application behavior in HW/SW co-designed processorsallows us to detect the loop unroll factor dynamically. We proﬁlethe applications, in BBM, to collect loop iteration count for eachloop. This information is used in superblock creation to decide loopunroll factor. Currently, we unroll loops with a single basic block,as the loops with no or minimum control ﬂow are the ones whichprovide maximum beneﬁts [31].

The optimizer applies several transforma-tions on the superblock. First, x86 code is translated to an interme-diate representation. Then the resulting code is transformed into aStatic Single Assignment format. This transformation removes anti& output dependences and signiﬁcantly reduces the complexity ofsubsequent optimizations. Second, a forward pass applies a set ofconventional single pass optimizations: constant folding, constantpropagation, copy propagation, and common subexpression elimi-nation. Third, a backward pass applies dead code elimination.After the basic optimizations, the Data Dependence Graph (DDG)is prepared. During DDG creation, we perform memory disam-biguation analysis. If the analysis cannot prove that a pair of mem-ory operations will never/always alias, it is marked as “may alias”.In case of reordering, the original memory instructions are con-verted to speculative memory operations. Apart from this, Redun-dant Load Elimination and Store Forwarding are also applied dur-ing DDG phase so that redundant memory operations are removedbefore vectorization. The DDG is then passed as input to the vector-izer. After vectorization, an instruction scheduler that uses a con-ventional list scheduling algorithm schedules the vectorized code.Afterwards, the determined schedule is used by the register allo-cator that implements linear scan register allocation algorithm. Fi-nally, the optimized code is translated to the host instructions andis stored in the code cache.

The vectorizer packs together a number of independent scalar in-structions that perform the same operation, and replaces them withone vector instruction. The number of scalar instructions packeddepends on two factors: • data-types of scalar instructions • host vector lengthFor example, for a host vector length of 128-bit, four 32-bit single-precision ﬂoating-point instructions can be packed together in asingle vector instruction. Therefore, vectorization reduces dynamicinstruction count and improves performance. Before describingthe algorithm itself, we deﬁne a set of conditions that a pair ofinstructions must satisfy to be included in the same pack: • The instructions must perform the same operation. • The instructions must be independent. The instructions must not be in another pack. • If the instructions are load/store, they must be accessingconsecutive memory locations.Vectorization starts by marking all the instructions which arecandidates for vectorization. Moreover, we mark

First Load and

First Store instructions.

First Load/Store instructions are those forwhich there are no other loads/stores from/to adjacently previousmemory locations. For example, if there is a 64-bit load instruction 𝐼 𝐿 that loads from a memory location [M] and there is no 64-bitload instruction that loads from address [M-8], we call 𝐼 𝐿 First Load .Vectorization begins by packing consecutive stores, starting froma

First Store . The decision of starting with stores instead of loads isbased on the observation that a given kind of operation always hasthe same number of predecessors, e.g. all the additions always havetwo predecessors, whereas the number of successors may varydepending on how many instructions consume the result. Conse-quently, following a bottom-up approach results in a more struc-tured tree traversal than a top-down approach.Once a pack of stores is created, their predecessors are packed,before packing other stores, if they satisfy the packing conditions.Moreover, if the last store in the pack has a next adjacent store, itis marked as

First Store so that a new pack can start from it.Once all the stores are packed and their predecessor/successorschains have been followed, we check for remaining load instruc-tions that satisfy the packing conditions and pack them in the sameway as stores.Vectorization starting from adjacent loads/stores has an obviouslimitation: if a superblock does not have any consecutive loads/stores,nothing can be vectorized. To tackle this problem, after packingall loads/stores and their predecessors/successors, we check if stillthere are some arithmetic instructions that can be packed together.If so, we vectorize them and follow their predecessor/successortrees. This allows to partially vectorize loops with interleaved mem-ory accesses.While traversing the predecessor/successor chains, if we ﬁndout that the predecessors of a pack cannot be vectorized, a

Pack in-struction is generated. This

Pack instruction collects the results ofall the predecessors into a single vector register and feeds the cur-rent pack. Similarly, if all the successors of a pack cannot be vector-ized, an

Unpack instruction is generated. This

Unpack instructiondistributes the result of the pack to the scalar successor instruc-tions. For example, in the case of loops with interleaved memoryaccess, when we reach several load instructions while traversingthe tree, we ﬁnd out that they cannot be packed since they are notconsecutive. Therefore, we leave them in scalar form and assembletheir results using a

Pack instruction.

As shown in Fig. 3 in Section 3, the dynamic instruction streamcoverage for vectorization reduces at higher vector lengths. Weobserve that the reason for this behavior lies in the way the vectorinstructions in SIMD architectures function. Vector instructions inthe current SIMD architectures, such as ADDPS in Intel SSE, VADDin ARM Neon and VADDFP in PowerPC Altivec, operate on all thevector lanes and not on a subset of it. Due to this reason, com-pilers generate a vector instruction only when there are suﬃcient numbers of independent operations to ﬁll the vector path. Whenthere are not enough instructions to ﬁll up the vector path, all theinstructions are left in scalar form. This is going to be an impor-tant issue in the future microprocessors with wider vector pathsand a lot of, otherwise vectorizable, code will be left unvectorized.We propose Variable Length Vectorization (VLV), a speculative dy-namic iterative vectorization technique that targets a ﬂexible SIMDarchitecture for optimal vectorization of data parallel applications.VLV targets a SIMD architecture with vector instructions thatcan operate on all or any subset of vector lanes. Since the vectorinstructions can operate on any number of vector lanes, we needa way to notify the SIMD accelerator which vector lanes to enableand which ones not. We make use of mask registers for this pur-pose. Mask register has one bit per vector lane. The bits containingones signify the corresponding vector lanes are to be enabled; 0means otherwise. The mask register is included in instruction en-coding in addition to the regular source and destination registers.An important factor to consider here is the need of masking.Masking is used to disable unused vector lanes when a vector in-struction does not use all the lanes. In general, not masking theunused lanes might work well for arithmetic instructions from thefunctionality point of view. However, performing unnecessary op-erations in the unused lanes might also generate false exceptions,like divide by zero. Therefore, we would need a way to distinguishreal and false exceptions. Furthermore, for memory access instruc-tions this might result in crossing array boundaries and leadingto page/segmentation faults. Also, for store instructions it wouldresult in writing incorrect data to the memory. Moreover, the regis-ter ﬁle will contain invalid data because whole destination registerwill be written. As a result, we would need a way to distinguishbetween invalid and valid data in the register ﬁle. Mixing the ar-chitectural state and temporal values is typically not a good idea.On the other hand, masking the unused lanes helps us get rid ofall these problems.From the implementation perspective, we do not really need tohave real mask registers in the hardware. Since we need to enableonly consecutive lower order vector lanes, the number of lanes tobe activated can directly be encoded in the instructions encoding.This also saves upon the extra instructions, otherwise, needed towrite the mask in the registers. It is important to note that the tra-ditional vector processors support variable vector length througha vector length register. It needs to be set to the desired vectorlength before executing vector instructions. However, it is not theoptimal solutions for the processors targeting general purpose ap-plications, where the vector length needs to be changed frequently.In this scenario, the overhead of writing the vector length registerwould aﬀect the performance severely as will be shown in Section7. Therefore, instead of having a variable vector length register wepropose to have Variable Length Vectorization using masked vec-tor instructions.For the execution of a vector instruction, the hardware nowreads not only the source registers but also a mask to enable onlythe required vector lanes. Example in Fig. 5 shows the execution ofa vector instruction that needs only two of the four vector lanes. Asshown in the ﬁgure only two of the four vector lanes are activated.This is also important from the power consumption point of view,not to activate all the vector lanes for all the vector instructions. destsrc2src1mask0 110 Op Op

Figure 5: Masked Vector Instruction Execution addss addssaddssaddss addssaddss (a) Unvectorized code addss addssaddssaddss addssaddss (b) Vectorized code for ﬁxed vector length of 128-bits addss addssaddssaddss addssaddss (c) Vectorized code with variable length vectorization

Figure 6: Variable Length Vectorization Example

We modify our baseline speculative dynamic vectorization algo-rithm of [22], brieﬂy explained in Section 4, to generate vectorcode with variable vector length SIMD ISA. The modiﬁed algo-rithm starts by vectorizing for the given maximum vector length,we call it physical vector length. Once all the possible packs for thephysical vector length have been created, the vectorizer reducesthe logical vector length iteratively. At lower logical vector lengths,packs are created with smaller number of scalar instructions thanrequired to ﬁll the vector path. The left out positions in a pack areconsidered as no operations.Fig. 6 shows a simple vectorization example using the proposedVLV algorithm. Fig. 6(a) shows unvectorized code having six in-dependent single-precision ﬂoating-point (32-bit) addition instruc-tions. For a vector length of 128-bits, we can pack a maximum offour single-precision ﬂoating-point additions in a single vector ad-dition instruction. The algorithm ﬁrst packs four of the six instruc-tions in a vector instruction and assigns a mask with all ones tothis instruction, as shown in Fig. 6(b). A mask with all ones signi-ﬁes that all the vector lanes are to be enabled.A ﬁxed vector length vectorization algorithm will stop at thispoint, since there are just two ADDSS instructions left and at leastfour are required to generate a vector instruction. However, VLValgorithm continues and packs the remaining two addition instruc-tions as shown in Fig. 6(c). Moreover, a mask register with onesonly at lowest two positions is assigned to this instruction. It makessure that only the two lower vector lanes are enabled during theexecution of this vector instruction as show in Fig 5. Variable Length Vectorization helps in vectorizing the applica-tions which have loops with lower iteration count than requiredby the vector length and the straight line code with fewer indepen-dent scalar operations.VLV algorithm is fairly simple to extend to compilers for thestatic trip count loops, however for loops with unknown trip countat compile time it becomes tricky. For ﬁxed vector length, com-piler can vectorize such loops by unrolling them enough numberof times to ﬁll the vector path and putting a runtime check beforethe vectorized version to decide whether to execute it or not. How-ever, for variable length vectorization, choosing a single unroll fac-tor becomes diﬃcult at compile time. The runtime information ofthe program behavior in HW/SW co-designed processors makes itstraightforward to choose the correct unroll factor for VLV.

This section presents the proposed Selective Writing (SWR) tech-nique to reduce the number of permutation instructions at highervector lengths. First, we present a technique to eliminate permuta-tion instructions completely if the result of an instruction is readonly by one instruction. Then, we present another technique to re-duce the number of instructions required to pack N values fromN-1 to N/2, if the values to be packed are in N diﬀerent registers.

If the producer instructions of a vector instruction cannot be vec-torized, the results of these instructions have to be packed togetherbefore feeding the vector instruction. This is due to the fact thatthe scalar instructions in the current SIMD architectures, such asADDSS, MULSS etc. in Intel SIMD extensions, write their resultsonly to the lowest element of vector registers. Whereas the vectorinstructions need them to be in a single vector register and in aparticular order.Fig. 7(a) shows a situation where producers of I7 (I0-I3) are notvectorized and their results are packed using a permutation instruc-tion sequence (I4-I6). As shown in the ﬁgure, I0 to I3 write theirresults to the lowest elements of diﬀerent vector registers. Then asequence of three instructions, I4 to I6, is used to pack these resultsin a single vector register xmm3, before feeding it to the vector in-struction I7.The scalar instructions in the proposed SIMD architecture canwrite their results to any element of a vector register, instead of al-ways writing to the lowest element, thus getting rid of the permu-tation instructions. It is done by making the scalar instructions toselectively write to the diﬀerent elements of a vector register in theorder they are needed by the vector instruction, Fig. 8. This way, wecan avoid putting permutation instructions altogether. This kindof selective writing capability is already available in the memoryaccess instruction set of current architectures. For example, IN-SERTPS in Intel SSE can be used to write a 32-bit value loadedfrom memory to any part of the destination register. We extendthis capability to the arithmetic instruction set as well.In addition to carry source and destination register numbers, allscalar arithmetic instructions also carry an immediate that speci-ﬁes to which element of the destination vector register the scalar I0 addss xmm0, xmm6I1 addss xmm1, xmm6I2 mulss xmm2, xmm7I3 mulss xmm3, xmm7I4 shufps xmm1, xmm0, immI5 shufps xmm3, xmm2, immI6 blendps xmm3, xmm1, immI7 addps xmm3,[M] (a) Traditional code sequence

I0 addss vr4, vr0, vr6, immI1 addss vr4, vr1, vr6, immI2 mulss vr4, vr2, vr7, immI3 mulss vr4, vr3, vr7, immI4 addps vr5, vr4, [M] (b) Proposed instruction sequence

Figure 7: Packing scalar instruction results for feeding a vec-tor instruction

An A0A1……. Bn B0B1……opCn C0C1…….

Immd

Figure 8: Functionality of the proposed arithmetic scalar in-structions result is to be written. If scalar instructions have written their re-sults to a single vector register in the order in which they areneeded by the vector instruction, the instruction sequence for pack-ing these results is not needed anymore as shown in Fig. 7(b).The limitation of SWR scheme is that it works as long as a scalarinstruction has only one consumer. In the case of more than oneconsumer, we would not get the maximum beneﬁt out of SWR.However, our analysis of SPECFP2006 shows that more than 70%of dynamic instructions have only one consumer.The proposed scalar instructions can be viewed as an arithmeticoperation followed by a shuﬄe. However, this does not aﬀect thelatency of these instructions, since the results can be forwardedas soon as the arithmetic operation is ﬁnished. As Fig. 9 shows,it requires only an additional input to the multiplexers, selectinginput operands of the ALUs from the output of the ﬁrst vector lane(which performs scalar operations). Consequently, forwarding theresults of the ﬁrst vector lane to any other vector lane provides thefunctionality of a shuﬄe operation.

Current architectures provide vector instruction set where N-1 in-structions are required to bring N values to a register. A typicalinstruction sequence to bring 4 values from diﬀerent vector regis-ters to single vector register in x86 architecture is shown in Fig.10(a). The ﬁrst two shuﬄe instructions bring values selected bythe immediate into register xmm1 and xmm3, respectively. Then

Pipeline RegisterPipeline Register

MUXMUXMUXMUX PRPRPRPR

Shuffle Network

From MemoryFrom Register FilePR -> Pipeline Register

Figure 9: Operand forwarding before shuﬀle

I0 shufps xmm1, xmm0, immI1 shufps xmm3, xmm2, immI2 blendps xmm3, xmm1, imm (a) x86 instruction sequence

I0 packps vr6, vr0, vr1, immI1 packps vr6, vr2, vr3, imm (b) Proposed instruction sequence

Figure 10: Instruction sequence for packing 4 values fromdiﬀerent registers into a single register n 01… n 01…n 01…

Immd[0:3]Immd[8:11] Immd[4:7]Immd[12:15]

Figure 11: Functionality of the proposed Pack instruction a BLENDPS instruction is used to combine the results from xmm1and xmm3 into xmm3.One of the main factors that force this instruction count to beN-1 is that, these instructions write to all the elements of the desti-nation register. If it is possible to write only the selective elementsof the destination register, then this number can be brought down.In this case, the number of instructions required will depend uponthe total number of diﬀerent registers to be read and the numberof registers that can be read by a single permutation instruction. Ina case where we need to read N registers and the permutation in-struction can read only two registers, we would need N/2 instruc-tions to collect N values in a single register. If we support morenumber of input registers, the number of instructions required canbe brought further down. Moreover, we need a mechanism to tellwhich elements of the source registers are to be read and whichelements of the destination register are to be written.We propose to have a permutation instruction with the function-ality in Fig. 11. The proposed instruction (PACKPS) has two inputregisters and a 16-bit immediate that tells which elements of thesource and destination registers are to be accessed. The ﬁrst four igure 12: Dynamic Instruction stream coverage at three vector lengths, baseline and with VLV bits of the immediate [0:3] tells which element of the ﬁrst sourceregister is to be read and the next four bits [4:7] tell where it is to bewritten in the destination. Similarly, bits [8:11] tell which elementof the second source register is to be written to the destination ele-ment selected by the bits [12:15]. Note that PACKPS is very similarto SHUFPS but with a bit more freedom in choosing source ele-ment for each destination element. Therefore, their latencies willbe similar.The instruction sequence for replacing x86 instruction sequenceof Fig. 10(a) is shown in Fig. 10(b). In this case, we are able to re-duce the number of instructions required to two. For higher vectorlengths, where we need to get 8 and 16 values in a register, weneed just 4 and 8 instructions, respectively, instead of 7 and 15 in-structions required by the original sequence. The down side of thisscheme is that it requires N/2 instructions even if the values to becollected are in less than N number of registers. However, our ex-periments show that in SPECFP2006, on average, about 86% and48% of permutations, for 256-bit and 512-bit vectors respectively,need to read N or N-1 registers to pack N values. To measure the success of our proposals, we use a set of appli-cations from SPECFP2006 [5] and Physicsbench [46] benchmarksuites. All the SPECFP2006 benchmarks used in our experimentsemploy 64-bit double precision ﬂoating point data types, except435.gromacs, whereas benchmarks in Physicsbench operate on 32-bit single precision ﬂoating point values. All the benchmarks arecompiled with gcc-4.5.3 with “-O3 -fomit-frame-pointer -ﬀast-math-mfpmath=sse -msse3” ﬂags.For SPECFP2006 we instrument the benchmarks, using PIN [28],to ﬁnd the most frequently executing routines. Then we simulateone billion instructions starting from these routines. The bench-marks in Physicsbench are executed till completion.

To evaluate our proposals, we use DARCO [19, 35], which is an in-frastructure for evaluating HW/SW co-designed virtual machines.DARCO executes guest x86 binary on a PowerPC-like RISC host architecture. Since DARCO emulates ﬂoating point code in soft-ware, we extended the infrastructure to add ﬂoating point scalarand vector operations. We implemented the dynamic vectorizationalgorithm in the TOL to provide vectorization support.For our experiments, we extended the host architecture to sup-ports vector sizes of 128, 256 and 512-bits. Moreover, we consideronly ﬂoating point operations for vectorization (because most SIMDoptimizations tend to focus on them) and no integer operation isvectorized. Therefore, we show only the ﬂoating point instructionsin the results presented.

Fig. 12 shows the dynamic instruction stream coverage for threevector lengths ﬁrst without and then with Variable Length Vector-ization (VLV). We will have maximum coverage when the numberof instructions required to create a pack is minimum, i.e. two in-structions. At 128-bit vector length the maximum number of 64-bit double precision operations that can be packed together is two.Therefore, 128-bit vector length provides maximum coverage, evenwithout VLV, for double precision operations. Since all the SPECFP2006benchmarks primarily operate on double precision ﬂoating pointvariables, they have maximum coverage at 128-bits as shown in Fig.12. For single precision ﬂoating point variables, Variable LengthVectorization helps increasing coverage even at 128-bit vector length,as is evident from the ﬁgure, for Physicsbench benchmark suiteand 435.gromacs.For the vector lengths of 256-bit and 512-bits, the benchmarkscan be divided into two categories. First, the benchmarks like 454.cal-culix have maximum, or close to maximum, dynamic instructionstream coverage at higher vector lengths also. The hottest loops ofthese benchmarks have enough iterations to ﬁll the wider vectorpaths. Second, the benchmarks like 436.cactusADM, 444.namd, andPhysicsbench show drastic reduction in coverage as vector lengthincreases, due to the lack of independent instructions to ﬁll thewider paths. These benchmarks either have loops with fewer it-erations or with complex control ﬂow. For example, the hottestloops in 410.bwave iterate four times, therefore, for 256-bit vec-tor length it has the maximum coverage but for 512-bit, it dropsdown to zero. Benchmarks in Physicsbench have loops with com-plex control ﬂow and cannot be unrolled. Moreover, number of igure 13: Dynamic Instruction stream distribution for SPECFP2006: 128, 256 and 512-bit vector lengths without and with VLV independent instruction in individual superblocks is not enoughto ﬁll the vector path. Thus, the dynamic instruction stream cover-age reduces severely. Using VLV, we bring the coverage for thesebenchmarks also to the maximum as shown in the Fig. 12. This section shows that even though VLV increases the dynamicinstructions stream coverage, by itself it does not provide muchbeneﬁt in terms of overall dynamic instruction reduction becauseof a corresponding increase in permutations. Fig. 13 presents dy-namic instruction stream distribution for SPECFP2006 for 128, 256and 512-bit vector lengths ﬁrst without VLV (called baseline in theﬁgure) and then with VLV. The results shown are normalized tono vectorization case. The dynamic instruction stream is dividedinto: Scalar and Vector instructions, Pack/Unpack instructions (asdescribed in Section 4.2), and unvectorizable instructions (e.g. wedo not vectorize conversions).On average, the number of scalar instructions increases with in-crease in vector length without VLV as shown by the 128, 256 and512-bit baseline case. Scalar instructions constitute 31% of over-all dynamic instruction stream for SPECFP2006 at 128-bit vectorlength without VLV. However this number increases to 41% and52% at 256 and 512-bit without VLV. It is because of this increasein scalar instructions (or the corresponding decrease in dynamicinstruction stream coverage) that we do not get any reduction inoverall dynamic instruction stream at higher vector lengths. VLV,on the other hand, reduces the scalar instructions in the dynamicinstruction stream by extracting additional vectorization opportu-nities. As shown in Fig. 13, VLV brings down the scalar instructionsto 28% from 41% and 52% at 256 and 512-bit vector lengths.Even though VLV increases the dynamic instructions vector-ized, the overall reduction in dynamic instructions stream is onlymarginal as is evident from Fig. 13. It is the result of the fact thatthe increased number of vectorized instructions comes at the costof an increase in the permutations. Therefore, we need a way tokeep the permutation instructions to a minimum. We use SelectiveWriting (SWR) as a means to that and evaluate it next. For Physicsbench, VLV by itself is able to provide signiﬁcant dy-namic instruction stream reduction with minimal increase in per-mutations. Therefore, we do not show results for it.

Fig. 14 shows the number of permutation instructions per vectorinstruction required at three vector lengths without and with Se-lective Writing (SWR). Again, we have the same two categoriesof benchmarks as for the dynamic instruction stream coverage.Benchmarks like 434.zeusmp, 459.GemsFDTD, and Physicsbenchhave, essentially, the same amount of permutation instructions acrossall the vector lengths. Packing the instructions from the diﬀerentiterations of unrolled loops avoids generation of permutation in-structions in the case of 434.zeusmp and 459.GemsFDTD. Physics-bench, however, has really less number of permutations since wefail to vectorize anything. On the contrary, 433.milc, 436.cactusADMand 444.namd show an increase in the permutation instructions athigher vector lengths. Complex control ﬂow and fewer loop itera-tions forces us to vectorize straight line code which require highernumber of permutation instructions. SWR helps in eliminating sig-niﬁcant number of permutation instructions for these benchmarks.Another point to notice in Fig. 14 is that for 128-bit vector lengththere is negligible reduction in permutation instructions. This isbecause we need to pack two double precision values in a 128-bitregister and for N=2, N/2 and N-1 are same. Therefore, we do notget much beneﬁt. However, on average we reduce the number ofpermutation instruction required to half.

This section shows that even though SWR is eﬀective in keep-ing the permutation instructions to a minimum, it also by itselfis unable to provide signiﬁcant overall dynamic instruction reduc-tion. Fig. 15 present dynamic instruction stream distribution forSPECFP2006 for 128, 256 and 512-bit vector lengths ﬁrst withoutSWR (called baseline in the ﬁgure) and then with SWR. The resultsshown are also normalized to no vectorization case. The dynamicinstruction stream is again divided into: Scalar and Vector instruc-tions, Pack/Unpack instructions and unvectorizable instructions. igure 14: Number of Permutation Instructions per vector instruction, baseline and with SWRFigure 15: Dynamic Instruction stream distribution for SPECFP2006: 128, 256 and 512-bit vector lengths without and withSWR SWR achieves signiﬁcant permutation reduction as shown inFig. 15 especially for 433.milc, 436.cactusADM and 470.lbm bench-marks. For other benchmarks like 410.bwaves, 434.zeusmp, 437.leslie3detc. permutation instructions are not signiﬁcant either because ofsmall number of vectorized instructions due to less coverage orbecause the benchmarks have enough parallelism at higher vectorlengths also. Even though SWR is eﬀective in keeping the permuta-tions to a minimum it cannot provide signiﬁcant dynamic instruc-tion reduction if the vectorizer is not able to vectorize most of thecode as shown in Fig. 15.Therefore, none of VLV and SWR by itself is able to achievesigniﬁcant dynamic instruction stream reductions at higher vectorlengths. However, when combined together, they do reduce thedynamic instruction stream substantially as shown in the next sec-tion.

Fig. 16 shows the percentage of dynamic instructions after vector-ization without and with VLV-SWR. As shown in this ﬁgure, afterapplying both the optimizations all the applications perform betteras vector length is increased. Applications like 433.milc, 436.cac-tusADM, 470.lbm, and Physicsbench which were earlier getting worse with increase in the vector length, compared to 128-bit vec-tor length; now perform better. On average, VLV-SWR help elim-inating 9% and 16% more dynamic instructions compared to thebaseline vectorization, at 256-bit and 512-bit vector lengths respec-tively, for SPECFP2006. Overall, vectorization with VLV-SWR re-duce unvectorized dynamic instruction stream by 15%, 27%, and31% for 128-bit, 256-bit, and 512-bit vector lengths respectively.For Physicsbench, we eliminate 40% more instructions comparedto baseline vectorization and unvectorized code, at 256-bit, and512-bit vector lengths with VLV- SWR. Baseline vectorization doesnot ﬁnd any vectorization opportunity at higher vector lengths forPhysicsbench.As Fig. 16 shows, the percentage of reduced instructions is samefor 256-bit and 512-bit vector lengths in case of Physicsbench and410.bwaves. The lack of availability of independent instructions at512-bit vector length forces VLV to vectorize the code the sameway as for 256-bit vector length. However, important point to no-tice is that we still have more instruction reduction than 128-bitcase, which was not possible without VLV.

Traditional vector processors used a special register, called vectorlength register, to choose the number of vector lanes to be enabled. igure 16: Dynamic Instruction Percentage after baseline and VLV-SWR vectorizationsFigure 17: Average number of consecutive dynamic vectorinstructions with same vector length in a 512-bit wide vectorunit with VLV-SWR This register needs to be written every time a vector instructionneeds diﬀerent number of lanes than the vector instruction imme-diately preceding it. This section shows why vector length regis-ter is not an optimal solution in SIMD accelerators for dynami-cally varying the logical vector length. Fig. 17 shows the averagenumber of dynamic vector instructions executed before a vectorinstruction requiring a diﬀerent number of vector lanes is encoun-tered. In other words, the ﬁgure shows how frequently the vectorlength register would need to be written had we used it instead ofthe proposed VLV.As the ﬁgure shows, a hypothetical vector length register wouldneed to be written very frequently for most of the benchmarks.For example, for 433.milc, 436.cactusADM and 470.lbm it wouldbe written after executing only two vector instructions. Although,there are few benchmarks like 410.bwaves, 454.calculix and 482.sphinx3where the writes to the vector length register are quite rare how-ever, for the majority of the benchmarks it would need to be writ-ten very frequently. The vector processors could use vector lengthregister because they speciﬁcally targeted heavily data parallel ap-plications.The extra instructions to write the vector length register wouldseverely aﬀect the performance beneﬁts of vector execution. There-fore, VLV chooses to encode the number of vector lanes to be en-abled in the instruction encoding rather than using a vector lengthregister.

Table 1: Processor Microarchitectural Parameters

Parameter Value

L1 I-cache 64KB, 4-way set associative, 64-byte line, 1cycle hit, LRUL1 D-cache 64KB, 4-way set associative, 64-byte line, 1cycle hit, LRUUniﬁed L2 cache 512KB, 8-way set associative, 64-byte line, 6cycle hit, LRUScalar FunctionalUnits (latency) 2 simple int(1), 2 int mul/div (3/10) 2 simpleFP(2), 2 FP mul/div (4/20)Vector FunctionalUnits (latency) 1 simple int(1), 1 int mul/div (3/10) 1 simpleFP(2), 1 FP mul/div (4/20)Registers 128-Integer, 128-Vector, 32-FPMemory Lat 128 Cycles

We model a simple in-order processor, in congruence with the sim-ple hardware design philosophy of the co-designed processors, withissue width of two. Microarchitectural parameters are shown in Ta-ble 1.Fig. 18 shows the percentage of execution time, at three vectorlengths, after vectorization without and with VLV-SWR. On aver-age VLV-SWR provide 5% and 7% speed up over the baseline vec-torization and 10% and 13% over the unvectorized code, for vectorlength of 256-bit and 512-bit respectively, for SPECFP2006. Simi-larly, for Physicsbench, we get a speed up of 10% for with VLV-SWR over unvectorized and baseline vectorization.There are several interesting points to note in Fig. 18. First, eventhough we have higher dynamic instruction elimination, e.g. 31%for SPECFP2006, the speed up we get is smaller, 13% for SPECFP2006at 512-bit vector length. This is because only 39% of dynamic in-structions are ﬂoating point in SPECFP2006, which reduces theoverall performance. Second, dynamic instruction reduction is morefor Physicsbench, 40% compared to 31% of SPECFP2006 for 512-bitvector length; SPECFP2006 shows more speed up, 13% comparedto 10% of Physicsbench for 512-bit vector length. This is due to thefact that Physicsbench has higher percentage of integer instruc-tions than SPECFP2006. igure 18: Execution time for baseline and VLV-SWR vectorizations normalized to unvectorized code execution time Masked operations have been used in the past for vectorization ofcode with control ﬂow. However, we use them in the absence ofcontrol ﬂow to increase dynamic instructions stream coverage. J.Smith et al. [40] proposed masked operations as a means of addingsupport for conditional operations in vector instruction set. J. Shinet al. [39] incorporated masked operations to vectorize loops withconditional ﬂow in Superword Level Parallelism approach. Larrabeealso uses masked instructions to map scalar if-then-else controlstructure to the vector processing unit. All of these proposals exe-cute both if and else clauses and select the correct results based onthe values in the mask registers. Our proposal, on the other hand,uses masked operations to increase the dynamic instruction streamcoverage when there not enough instruction to ﬁll the wider vectorpaths.Signiﬁcant amount of work has been done on the optimal gen-eration of permutation instructions. However, previous work doesnot show eﬀect of permutations at increasing vector lengths. A.Kudriavtsev et al. [18] show the relationship between operationgrouping and permutation generation. They show the orderingof individual operations in SIMD instructions aﬀect the numberof permutation instructions required. G. Ren et al. [36] presentedan algorithm that converts all the permutations to a generic form.Then, permutations are propagated across the statement and re-dundant permutations are eliminated. These solutions focus on re-ducing the number of permutations required, whereas our solutionreduces the number of instructions for each permutation. L. Huanget al. [14] proposed a method to reduce the number of instructionfor one permutation. Their system has a Permutation Vector Regis-ter File which provides implicit permutation capabilities. However,the permutation pattern is to be saved beforehand in a permutationregister. Moreover, only the values from two consecutive registerscan be permutated. The proposal by M. Woh et al. [45] for supporting multiple SIMDwidths is the closest to our proposal of Variable Length Vectoriza-tion. They proposed a conﬁgurable SIMD datapath that can be con-ﬁgured to process wide vectors or multiple narrow vectors. Unfor-tunately, details of their vectorization algorithm for vectorizationfor multiple vector lengths are not provided.Speculative Dynamic Vectorization, in itself, is not a much ex-tended topic in literature. There have only been a few proposalslike Speculative Dynamic Vectorization [34], Dynamic Vectoriza-tion in Trace Processors [43] and Liquid SIMD [10]. None of them isin the context of HW/SW co-designed processors. A. Pajuelo et al.[34] proposed to speculatively vectorize the instruction stream inthe hardware for superscalar architectures. Their scheme prefetchesdata into the vector registers and speculatively manipulates it througharithmetic instructions. S. Vajapeyam et al. [43] builds a large log-ical instruction window and converts repetitive dynamic instruc-tions from diﬀerent iterations of a loop into vector form. The wholeloop is vectorized if all iterations of the loop have the same controlﬂow. Liquid SIMD [10] decouples the SIMD accelerator implemen-tation from the instruction set of the processor by compiler supportand a hardware based dynamic translator. Compiler passes hintsto dynamic translator, which can then retarget the vector code fordiﬀerent SIMD accelerators. Selective devectorization [21, 24] hasalso been explored to reduce the energy consumption of SIMD ac-celerators by keeping them power gated for longer intervals.

In this paper, we showed that widening the SIMD accelerators doesnot improve the performance for all the applications. We discov-ered two main problems hurting the performance of naturally lowvector length applications for wider SIMD units: Reduced dynamicinstruction stream coverage and large number of permutation in-structions.We proposed a ﬂexible SIMD architecture that allows the vec-tor instructions to operate on variable number of lanes. Addition-ally, the scalar instructions can selectively write to any element ofthe vector register, thus avoiding permutations. We also proposedVariable Length Vectorization and Selective Writing techniques to arget the ﬂexibility of the proposed SIMD architecture. VariableLength Vectorization vectorizes the code even though it is not pos-sible to ﬁll the wider vector path. Selective Writing allows to writeto any particular element of vector registers, thus reduces permuta-tions. Our experimental results show an average dynamic instruc-tion elimination of 31% and 40% and an average speed up of 13%and 10% for SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code. REFERENCES [1]

Intel AVX-512 . [Online]. Available:https://software.intel.com/en-us/blogs/2013/avx-512-instructions[2]

Intel MIC . [Online]. Available: https://software.intel.com/en-us/forum/37014[3]

Intel® 64 and IA-32 Architectures Software Developer ´ s Manual .[4] Intel’s HW/SW co-designed processor project

Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks

Wireless Symposium. Motorola , 1999.[7] M. Baron, “Cortex-a8: High speed, low power,” in

Microprocessor Report,11(14) ,2005, pp. 1–6.[8] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian, “Automatic intra-register vec-torization for the intel architecture,”

Int. J. Parallel Program. , vol. 30, no. 2, pp.65–98, Apr. 2002.[9] A. Branković, K. Stavrou, E. Gibert, and A. González, “Warm-up simulationmethodology for hw/sw co-designed processors,” in

Proceedings of AnnualIEEE/ACM International Symposium on Code Generation and Optimization , ser.CGO ’14, 2014, pp. 284–294.[10] N. Clark, A. Hormati, S. Yehia, S. Mahlke, and K. Flautner, “Liquid simd: Abstract-ing simd hardware using lightweight dynamic mapping,” in

High PerformanceComputer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on ,Feb 2007, pp. 216–227.[11] J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber,and J. Mattson, “The transmeta code morphing™ software: Using specu-lation, recovery, and adaptive retranslation to address real-life challenges,” in

Proceedings of the International Symposium on Code Generation and Optimiza-tion: Feedback-directed and Runtime Optimization , ser. CGO ’03, 2003, pp. 15–24.[12] K. Diefendorﬀ, P. Dubey, R. Hochsprung, and H. Scale, “Altivec extension topowerpc accelerates media processing,”

Micro, IEEE , vol. 20, no. 2, pp. 85–95,Mar 2000.[13] K. Ebcioğlu and E. R. Altman, “Daisy: Dynamic compilation for 100architec-tural compatibility,” in

Proceedings of the 24th Annual International Symposiumon Computer Architecture , ser. ISCA ’97, 1997, pp. 26–37.[14] L. Huang, L. Shen, Z. Wang, W. Shi, N. Xiao, and S. Ma, “Sif: Overcoming thelimitations of simd devices via implicit permutation,” in

High Performance Com-puter Architecture (HPCA), 2010 IEEE 16th International Symposium on , Jan 2010,pp. 1–12.[15] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy,“Introduction to the cell multiprocessor,”

IBM J. Res. Dev. , vol. 49, no. 4/5, pp.589–604, Jul. 2005.[16] A. Klaiber, “The technology behind the crusoe processors,” in

White paper , Jan-uary 2000.[17] K. Krewell, “Transmeta gets more eﬃceon,” in

Micro-processor Report, 17(10) ,2003.[18] A. Kudriavtsev and P. Kogge, “Generation of permutations for simd processors,”in

Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Com-pilers, and Tools for Embedded Systems , ser. LCTES ’05, 2005, pp. 147–156.[19] R. Kumar, J. Cano, A. Brankovic, D. Pavlou, K. Stavrou, E. Gibert, A. Martínez,and A. González, “Hw/sw co-designed processors: Challenges, design choicesand a simulation infrastructure for evaluation,” in , 2017, pp. 185–194.[20] R. Kumar, A. Martínez, and A. González, “Speculative dynamic vectorization forhw/sw codesigned processors,” in , 2012, pp. 459–460.[21] R. Kumar, A. Martínez, and A. González, “Dynamic selective devectorization foreﬃcient power gating of simd units in a hw/sw co-designed environment,” in , 2013, pp. 81–88.[22] R. Kumar, A. Martínez, and A. Gonzalez, “Speculative dynamic vectorization toassist static vectorization in a hw/sw co-designed environment,” in

High Perfor-mance Computing (HiPC), 2013 20th International Conference on , Dec 2013. [23] R. Kumar, A. Martínez, and A. González, “Vectorizing for wider vector units in ahw/sw co-designed environment,” in

High Performance Computing and Commu-nications(HPCC) 2013 IEEE International Conference on , Nov 2013, pp. 518–525.[24] R. Kumar, A. Martinez, and A. González, “Eﬃcient power gating of simd accel-erators through dynamic selective devectorization in an hw/sw codesigned en-vironment,”

ACM Trans. Archit. Code Optim. , vol. 11, no. 3, pp. 25:1–25:23, Jul.2014.[25] R. Kumar, A. Martinez, and A. Gonzalez, “Assisting static compiler vectorizationwith a speculative dynamic vectorizer in an hw/sw codesigned environment,”

ACM Trans. Comput. Syst. , vol. 33, no. 4, Jan. 2016.[26] S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism with mul-timedia instruction sets,” in

Proceedings of the ACM SIGPLAN 2000 Conference onProgramming Language Design and Implementation , ser. PLDI ’00, 2000, pp. 145–156.[27] R. B. Lee, “Subword parallelism with max-2,”

IEEE Micro , vol. 16, no. 4, pp. 51–59,Aug. 1996.[28] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools withdynamic instrumentation,” in

Proceedings of the 2005 ACM SIGPLAN Conferenceon Programming Language Design and Implementation , ser. PLDI ’05, 2005.[29] M. Lupon, E. Gibert, G. Magklis, S. Samudrala, R. Martínez, K. Stavrou, and D. R.Ditzel, “Speculative hardware/software co-designed ﬂoating-point multiply-addfusion,” in

Proceedings of the 19th International Conference on Architectural Sup-port for Programming Languages and Operating Systems , ser. ASPLOS ’14, 2014.[30] S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua, “An evaluationof vectorizing compilers,” in

Proceedings of the 2011 International Conference onParallel Architectures and Compilation Techniques , ser. PACT ’11, 2011, pp. 372–382.[31] S. S. Muchnick,

Advanced Compiler Design & Implementation . Morgan Kauf-mann, 1997.[32] D. Naishlos, “Autovectorization in gcc,” in

The 2004 GCC Developers’ Summit ,2004, pp. 105–118.[33] N. Neelakantam, D. R. Ditzel, and C. Zilles, “A real system evaluation of hard-ware atomicity for software speculation,” in

Proceedings of the Fifteenth Editionof ASPLOS on Architectural Support for Programming Languages and OperatingSystems , ser. ASPLOS XV, 2010, pp. 29–38.[34] A. Pajuelo, A. Gonzalez, and M. Valero, “Speculative dynamic vectorization,” in

Computer Architecture, 2002. Proceedings. 29th Annual International Symposiumon , 2002, pp. 271–280.[35] D. Pavlou, A. Brankovic, R. Kumar, M. Gregori, K. Stavrou, E. Gibert, and A. Gon-zalez, “Darco: Infrastructure for research on hw/sw co-designed virtual ma-chines,” in

In Proceedings of the 4th Workshop on Architectural and Microarchi-tectural Support for Binary Translation (AMAS-BT’11) at ISCA-38 , June 2011.[36] G. Ren, P. Wu, and D. Padua, “Optimizing data permutations for simd devices,”in

Proceedings of the 2006 ACM SIGPLAN Conference on Programming LanguageDesign and Implementation , ser. PLDI ’06, 2006, pp. 118–131.[37] S. Sathaye, P. Ledak, J. Leblanc, S. Kosonocky, M. Gschwind, J. Fritts, A. Bright,E. Altman, and C. Agricola, “Boa: Targeting multi-gigahertz with binary trans-lation,” in

In Proc. of the 1999 Workshop on Binary Translation, IEEE ComputerSociety Technical Committee on Computer Architecture Newsletter , 1999, pp. 2–11.[38] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins,A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanra-han, “Larrabee: A many-core x86 architecture for visual computing,”

ACM Trans.Graph. , vol. 27, no. 3, pp. 18:1–18:15, Aug. 2008.[39] J. Shin, M. Hall, and J. Chame, “Superword-level parallelism in the presence ofcontrol ﬂow,” in

Proceedings of the International Symposium on Code Generationand Optimization , ser. CGO ’05, 2005.[40] J. E. Smith, G. Faanes, and R. Sugumar, “Vector instruction set support for con-ditional operations,” in

Proceedings of the 27th Annual International Symposiumon Computer Architecture , ser. ISCA ’00, 2000, pp. 260–269.[41] J. Smith and R. Nair,

Virtual Machines: Versatile Platforms for Systems and Pro-cesses . Morgan Kaufmann Publishers Inc., 2005.[42] M. Sporny, G. Carper, and J. Turner, “The playstation 2 linux kit handbook,” 2002.[43] S. Vajapeyam, P. J. Joseph, and T. Mitra, “Dynamic vectorization: A mechanismfor exploiting far-ﬂung ilp in ordinary programs,” in

In Proceedings of the 26thAnnual International Symposium on Computer Architecture , 1999, pp. 16–27.[44] C. Wang, Y. Wu, and M. Cintra, “Acceldroid: Co-designed acceleration of androidbytecode,” in

Code Generation and Optimization (CGO), 2013 IEEE/ACM Interna-tional Symposium on , Feb 2013.[45] M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner, “Anysp:Anytime anywhere anyway signal processing,” in

Proceedings of the 36th AnnualInternational Symposium on Computer Architecture , ser. ISCA ’09, 2009, pp. 128–139.[46] T. Y. Yeh, P. Faloutsos, S. J. Patel, and G. Reinman, “Parallax: An architecture forreal-time physics,” in

Proceedings of the 34th Annual International Symposium onComputer Architecture , ser. ISCA ’07, 2007, pp. 232–243., ser. ISCA ’07, 2007, pp. 232–243.