[PDF] Accelerating Viterbi Algorithm using Custom Instruction Approach

Abstract

In recent years, the decoding algorithms in communication networks are becoming increasingly complex aiming to achieve high reliability in correctly decoding received messages. These decoding algorithms involve computationally complex operations requiring high performance computing hardware, which are generally expensive. A cost-effective solution is to enhance the Instruction Set Architecture (ISA) of the processors by creating new custom instructions for the computational parts of the decoding algorithms. In this paper, we propose to utilize the custom instruction approach to efficiently implement the widely used Viterbi decoding algorithm by adding the assembly language instructions to the ISA of DLX, PicoJava II and NIOS II processors, which represent RISC, stack and FPGA-based soft-core processor architectures, respectively. By using the custom instruction approach, the execution time of the Viterbi algorithm is significantly improved by approximately 3 times for DLX and PicoJava II, and by 2 times for NIOS II.

Full PDF

aa r X i v : . [ c s . A R ] S e p Accelerating Viterbi Algorithm using CustomInstruction Approach

Waqar Ahmad ∗ and Imran Hafeez Abbassi †∗ Electrical and Computer EngineeringConcordia University, Montr`eal, CanadaEmail: [email protected] ∗† School of Electrical Engineering and Computer Science,National University of Sciences and Technology,Islamabad, PakistanEmail: { waqar.ahmad,imran.abbasi } @seecs.nust.edu.pk Usman Sanwal ‡ and Hasan Mahmood §‡ Computational Biomodeling Laboratory,Department of Computer Science, ˚Abo Akademi University,Turku Centre for Computer Science, 20500 Turku, FinlandEmail: msanwal@abo.ﬁ § Department of Electronics, Quaid-i-Azam University45320, Islamabad, PakistanEmail: [email protected]

Abstract —In recent years, the decoding algorithms in commu-nication networks are becoming increasingly complex aiming toachieve high reliability in correctly decoding received messages.These decoding algorithms involve computationally complexoperations requiring high performance computing hardware,which are generally expensive. A cost-effective solution is toenhance the Instruction Set Architecture (ISA) of the processorsby creating new custom instructions for the computational partsof the decoding algorithms. In this paper, we propose to utilizethe custom instruction approach to efﬁciently implement thewidely used Viterbi decoding algorithm by adding the assemblylanguage instructions to the ISA of DLX, PicoJava II and NIOS IIprocessors, which represent RISC, stack and FPGA-based soft-core processor architectures, respectively. By using the custominstruction approach, the execution time of the Viterbi algorithmis signiﬁcantly improved by approximately 3 times for DLX andPicoJava II, and by 2 times for NIOS II.

Index Terms —Viterbi Algorithm, DLX, PicoJava II, NIOS II,Custom Instruction

I. I

NTRODUCTION

In the past few years, there has been a continuous in-crease in the demand of efﬁcient and reliable transmissionof messages over the band-limited and noisy communicationchannels especially for Internet and wireless networks. Inorder to meet this demand, many decoding algorithms, suchas Viterbi [9], have been developed and enhanced over theyears. The Viterbi algorithm [9] is one of the most widely useddecoding algorithm, which utilizes the Maximum-LikelihoodDecoding (MLD) [10] procedure in order to reliably decodethe transmitted messages at the receiver end. According toan estimate, bits/sec are decoded every day by Viterbialgorithm in digital TV devices [11]. In Viterbi algorithm,the operations add, compare and select (ACS) have beencalled many times during the decoding process. This makesthe Viterbi algorithm extensively iterative and computationallycomplex, so it is important to implement this algorithm in amost efﬁcient manner to improve its performance.Several methods have been proposed in order to implementthe Viterbi algorithm efﬁciently ranging from DSPs [27] to The ﬁnal publication is available at http://ieeexplore.ieee.org.

FPGA-based [6] dedicated hardware designs. However, theseimplementations have been mainly intended for powerful butexpensive high-end DSPs and FPGA devices. A cost-effectivesolution is to implement the Viterbi algorithm by using thecustom instruction approach [28], which is a method of en-hancing the ISA of the processors by adding new instructionsin order to signiﬁcantly reduce the execution time of theViterbi algorithm.In custom instruction approach, ﬁrstly the most computa-tional part of the given algorithm is identiﬁed and then newinstructions, which implement the identiﬁed computationalpart, generally at the microarchitecture level of the processor,are added to the ISA of the processors. Thus, enabling themodiﬁed processors to execute the computationally complexalgorithm, such as Viterbi [9], in a most efﬁcient mannercompared to its implementation without custom instruction.This approach has been successfully utilized to improve theexecution time of many cryptographic algorithms [1], [5],video coding standard [12] and trigonometric functions [16].In this paper, we utilize the above-mentioned custom in-struction approach to efﬁciently implement the Viterbi decod-ing algorithm in DLX [15], PicoJava II [25] and NIOS II [18]representing RISC, stack and FPGA-based soft processors,respectively. DLX and PicoJava II provide microprogrammedcontrol store enabling modiﬁcation as well as the inclusion ofnew custom instructions to their ISAs [15], [25]. We utilizethe microprogramming technique to design the custom instruc-tions (

Texpand ) for Viterbi ACS, which are add, compare andselect operations that are extensively called during the decod-ing process to correctly ﬁnd out the transmitted codeword,and include it in the ISA of DLX and PicoJava II processors.We then test the performance of the

Texpand instructions byimplementing the ISA of DLX and PicoJava II in CPU

Sim [23]and MIC-1 [25] simulators, respectively. Currently, the custominstruction based implementation of Viterbi algorithm on DLXis only presented for 12 bits decoding [2]. However, in thispaper, we provide the results for up to 60 bits, which canbe easily extendable to more number of bits. Since NIOS IIis a FPGA-based soft processor, the custom instruction is aedicated hardware circuitory, which is attached to the NIOSII ALU and invoked when custom instruction is executed.We create the

Texpand custom instruction in NIOS II byusing Verilog HDL programming language [20] and test iton ALTERA DE2 board CYCLON II FPGA [21] using NIOSII IDE software. We compare the performance of the Viterbialgorithm implementation with and without custom instructionin terms of clock cycle for DLX, PicoJava II and NIOS IIprocessors. The proposed custom instruction approach showssigniﬁcant improvement in the Viterbi algorithm executiontime of ≈ ≈ ELATED W ORK

Viterbi decoding algorithm has been extensively imple-mented in DSPs and FPGAs in order to reduce the computa-tional complexity. For instance, Cholan described the FPGA-based design of the Viterbi decoding algorithm and presentsan implementation of the decoder for the UWB MB OFDMtechnology [6]. Similarly, Ou et. al presented an FPGA-based Viterbi decoder architecture that can provide variousthroughput and energy trade-offs with an improvement of upto 26.1% [19]. Wilson described an efﬁcient implementationof the Viterbi decoding algorithm on the ZSP500 digital signalprocessor (DSP) core [27]. However, the state-of-art DSPs andFPGAs are generally expensive and their utilization for Viterbiimplementation may not be a suitable choice for cost-limitedapplications, such as digital TVs [22].The custom instruction approach has been extensively uti-lized as a cost-effective solution for the efﬁcient implemen-tation of computationally complex algorithms. For instance,Chen et. al utilized the custom instruction approach to protectcryptographic software implementations against Side Chan-nel Attack (SCA) by emulating the behavior of the securehardware circuits [5]. Similarly, custom instruction approachhas also been used to efﬁciently implement the face detectionalgorithm [8] and S8 AES algorithm [1]. Viterbi algorithm hasalso been implemented in Xtensa [26] and in DLX [2] pro-cessors using custom instruction approach. However, in [26],the implementation is described using C programming, whichmay not be optimized causing extra assembly instructionsoverhead. Whereas, [2] presented the custom instruction basedimplementation of Viterbi algorithm in DLX processor for 12bits only.In this paper, we utilize the custom instruction approachand describes the efﬁcient implementation of Viterbi decodingalgorithm up to 60 bits in DLX, PicoJava II and NIOSII processors representing RISC, Java and FPGA-based softprocessor architectures, respectively. III. P

ROCESSOR A RCHITECTURES

Many processors have hardwired control units composed ofdigital logic components. The ISA of these processors consistof ﬁxed number of instructions that cannot be modiﬁed. Onthe other hand, there are some processors, such as DLX andPicoJava II, that have microprogrammed control units offeringthe ability to enhance and modify their ISAs. Similarly, FPGA-based soft core processors, such as NIOS II, also providethe ﬂexibility of adding new custom instructions to theirISAs. In this work, we utilize DLX, PicoJava II and NIOS IIprocessors for accelerating the Viterbi algorithm using custominstruction approach. A brief description of their architecturesare described in this section.

A. DLX Processor

The DLX processor [15] has 32 general-purpose registers(R0-R31) of 32 bits. Some registers have special roles. Forinstance, the value of register R0 is always zero while thebranch instructions to subroutines implicitly use register R31to store the return address. DLX processor memory is byte-addressable and divided into words of 32 bits. Microprogram-ming consisting of microinstructions have been typically usedto derive the DLX datapath. Some of the commonly used DLXassembly instructions with their microinstructions are shownin Table I.CPU

Sim [23] is a Java-based simulator allowing users todesign processors at the microcode level and to run machine-language or assembly-language programs on those processorsthrough simulation. It provides several interesting features todesign variety of architectures, including accumulator-based,RISC-like, or stack-based (such as the JVM) architectures. Inthis paper, we utilize the CPU

Sim simulator to design a DLXprocessor ISA and then include the

Texpand custom instructionin order to accelerate the Viterbi algorithm by implementingthe microprogramming code for each individual instruction.TABLE I: DLX Instructions with their Microinstructions

DLX Instruction MicroinstructionLD R4,100(R1) ir(8-15) - > marMain [mar] - > mdrmdr - > ir(5-7)endSW R4,100(R1) ir(8-15) - > marir(5-7) - > mdrmdr - > Main[mar]endAND R1,R2,R3 Ir(8-10) - > BIr(11-13) - > Aacc < - A & Bacc - > ir(5-7)end B. PicoJava II Processor

PicoJava II [25] is a 32-bit pipelined stack-based pro-cessor, which can execute the Java Virtual Machine (JVM)instructions. There are about 30 JVM instructions that aremicroprogrammed and typically execute in a single clock.he instructions in PicoJava II is executed in six pipelinestages. The ﬁrst stage is the instruction fetch stage, which takesinstructions from instruction cache (I-cache). The second and third stages are the decode and fold stages. The opcode andthree register ﬁelds are decoded in the decode stage. In the foldstage, the instruction folding operation is performed, in whicha particular sequence of instructions is detected and combinedinto one instruction [25]. In the fourth stage, the operands arefetched from the stack, i.e., from the register ﬁle, which arethen ready for the ﬁfth stage known as the execution stage.The results are stored in the cache during sixth stage. Someof the PicoJava II assembly language instructions and theirmicrocode are shown in the Table II. Further detail about thePicoJava II microinstructions and their execution stages canbe found in [25].TABLE II: PicoJava II Instructions and their Microinstructions

Mnemonic Microcode Descriptioniadd1 MAR = SP = SP - 1; rd Read in next-to-top word onstack.iadd2 H = TOS H = top of stackiadd3 MDR = TOS = MDR+H; wr;goto (MBR1) Add top two words; write tonew top of stackiload1 H = LV MBR contains index; copy LVto Hiload2 MAR = MBRU + H; rd MAR = address of local vari-able to pushiload3 MAR = SP = SP + 1 SP points to new top of stack;prepare write

C. NIOS II Soft Processor

The basic architectural diagram of the Cyclon II FPGA,designed by ALTERA [21], consisting of several peripherals,such as SDRAM, SRAM and UART, and their interface withNIOS II processor through Avalon Switch Fabric [18]. NIOSII processor need interfaces to connect to other devices on theboard, that are instantiated on the Cyclon II FPGA chip alongwith the NIOS II processor. These interfaces are connectedto each other by means of a interconnection network knownAvalon Switch Fabric. In this network, the master componentsare on one side and slave component are on the other side.The key responsibility of the Avalon Switch Fabric is tosynchronize the transfer of data between two devices.NIOS II soft processor is available in three different ver-sions, i.e., economy (e), standard (s) and fast (f) processors[18]. All of these processors have separate instruction and datacaches except NIOS II/e. About custom instructions canbe added to the ISA of these NIOS II processors. NIOS IIprocessors can be created in ALTERA DE2 board [7] usingthe SOPC builder in Quartus II software [21]. By using the

Add new component feature in SOPC builder, we can add newcustom instructions in the ISA of NIOS II processor [18]. Thecustom logic is then attached to the NIOS II ALU and isinvoked when custom instruction is executed. IV. V

ITERBI A LGORITHM

A typical communication system incorporates channelcoding schemes in order to correct transmission errors. Theprocess of channel coding involves the addition of redundancyin the information bits. Over the years, many channel codingschemes have been developed, which are mainly distinguishedby their error correcting capabilities against channel noise.There two major types of codes, i.e., Block or Convolutional,which are differ by their encoding principle. In Block Codes,the information bits are followed by the parity bits while thelater convolve the sequence of information bits to codewordssequentially according to some speciﬁed rules. Viterbi algo-rithm has been extensively utilized for decoding both typesof codes [3]. However, in this paper, we mainly focus on theconvolutional codes that are generated from the convolutionalencoder. The encoding process of information bits to codewordusing the convolutional encoder is brieﬂy described in the nextsection.

A. Encoding

Convolutional encoders are discrete-time linear time-invariant (LTI) systems that have been typically used to encode K information bits to generate N > K codewords in eachtime step [3]. A convolutional encoder having coding rate isshown in Figure 1(a), where U represents the information bitsand V and V are the corresponding output generated by theencoder from each information bit in a sequential manner. Thememory elements m and m represent state of the encoderduring the encoding of information bits. (a) (b)(c) Fig. 1: (a) State diagram of convolutional encoder, (b) Atypical convolutional encoder (c) Trellis for convolutionalencoderA complete state diagram of the convolution encoder isshown in Figure 1(b). The nodes represent the state of thencoder whereas the edges describe the transitions betweenstates based on the input/output relationship of the informationbits with the code bits. A state diagram can be equivalentlyrepresented in the form of a trellis diagram, as shown in Fig-ure 1(c). The trellis is a special graph with edges representingthe possible transitions from states and is considered as thebackbone in the decoding process of the convolutional codes.In order to illustrate the working of the convolutionalencoder, as shown in Figure 1(b), consider an information bits(110100) having ﬁrst four bits are data bits while the last twoare ﬂush bits. After passing the information bits through theencoder, the resulting codeword bits are (10 01 11 10 11 00).Assuming, if the noisy channel caused the rd and th bitsof codeword in error then the received codeword becomes (1011 11 00 11 00). B. Decoding

Fig. 2: Trellis Diagram for Typical ApplicationViterbi algorithm utilizes the trellis structure to perform thedecoding operation. The number of times the trellis expansionfunction is called depends upon the amount of decoding bitsand the states in the trellis. Based on the above design, thecomplete trellis diagram is shown Fig. 2. This trellis diagramdescribes a way to select the path with minimum weightamong all the paths. In Fig. 2, the recieved bits are shownat the top of trellis diagram whereas in the left side corner allthe possible states of the encoder are listed. The trellis expandsfrom state (00) and only those paths survive which end at thestate (00). The dashed lines are the paths that originates when bit is given as input to the encoder whereas the solid lines areobtained from input bit to the encoder. The path which hasthe minimum weight among all the surviving paths is shownby a dark solid line.In Viterbi algorithm, after a transition from a state to thenext state, the weights are calculated for each possible path.The path weight is incremented whenever there is a differencein a particular received bit and the state output bit in thetransition path. For example, if the received bits are (00)and output bit in a particular state transition is (01), thenthe path weight value is incremented by 1 for a particularpath transition. Similarly, in the case of difference of two bits the path weight is incremented by 2. If more than onepaths arrive at a particular state, the path with lowest weightsurvives and the remaining paths are deleted. For the casewhen the weights of the arriving paths are equal, the patharriving from the lowest state survives. For instance, if wehave state (00) and state (01) both arrive at state (00) havingsame path weight values, we select the path that arrives fromstate (00). When the decoding process is completed, a traceback function is performed to determine the most probabletransmitted sequence by selecting the path that start from state(00) and ends up at state (00) having minimal path weightamong all the paths. In Fig. 2, the weight of the path at eachnode is represented in a circle whereas the square-box showsthe correctly decoded information bits. The cross in square-box represents that the corresponding path does not survive. C. Custom Trellis Instruction

From the last section, it is evident that the trellis expansionprocess, involving add, compare and select (ACS) operations,are called several times as we progress in the decoding process.For example, if there are 12 bits in a received codewordthen this trellis expansion function is called almost 19 times.Therefore, it is desirable to create a custom trellis expandinstruction that allows the processors to execute these opera-tions in a minimum clock cycles in order to achieve maximumefﬁciency.We create

Texpand custom instructions in DLX, PicoJavaII and NIOS II processors performing two fundamental tasks.The ﬁrst task is the implementation of the following opera-tions: (i) add- the calculation of the cumulative weights of thearriving paths at a particular state; (ii) compare- a comparisonoperation between the weights of the arriving paths; and (iii)select- a selection operation to ﬁnd out the surviving path.The second task is to keep track of the path with minimumweight that ultimately ends up at state (00). At the ﬁnal stage,the trace-back function is performed, based on the path withminimum weight, in order to determine the most probabletransmitted sequence. V. C

OMPARISON

In this section, we present a performance comparison be-tween trellis expansion function, which is written in assemblylanguage, and the

Texpand custom instruction that is createdand included in the ISA of DLX by using CPU

Sim , PicoJavaII by using Mic-1 and in NIOS II by using SOPC builder.Each microinstruction in DLX and PicoJava typically takes4 clock cycles to complete its execution. The comparison ismade on the basis of number of clock cycles consumed by themicroinstructions in the implementation of trellis expansionfunction and the Texpand custom instruction. Viterbi algorithmis implemented for 12 bits decoding having trellis expansionfunction as well as Texpand instruction is called about 19times. Tables III and IV show a signiﬁcant performanceimprovement of about 3.5 and 3 times for DLX and PicoJavaII processors, respectively.ABLE III: Comparison between trellis assembly function andTexpand Instruction on CPUSIM

Trellis Assembly Function Texpand InstructionAssembly Instruction (A.I) 63 Assembly Instruction (A.I) 1Microinstruction (M.I) 277 Microinstruction (M.I) 100Fetched Instructions (I x 4) 63 Fetched Instructions (I x 4) 1Function calls 19 Texpand instruction calls 19Total M.I = 6460 Total M.I = 1919((M.I + F.I) x 19) ((M.I + F.I) x 19)Total Time (T) = M.I x 4 25840 Total Time (T) = M.I x 4 7676%age Improvement 236

TABLE IV: Comparison between Trellis assembly functionand Texpand Instruction on MIC-1

Trellis Assembly Function Texpand InstructionAssembly Instruction (A.I) 41 Assembly Instruction (A.I) 1Microinstruction (M.I) 255 Microinstruction (M.I) 102Fetched Instructions (I x 4) 41 Fetched Instructions (I x 4) 1Function calls 19 Texpand instruction calls 19Total M.I 5624 Total M.I 1957((M.I + F.I) x 19) ((M.I + F.I) x 19)Total Time (T) = M.I x 4 22496 Total Time (T) = M.I x 4 7828%age Improvement 187

Similarly, TABLE V shows the comparison between theNIOS II assembly language program with custom instruction-based Viterbi algorithm implementation. The execution of theNIOS II assembly instructions take different clock cycles andits also different for each version [17]. The performance of theViterbi algorithm is considerably improved by about 2 timeswith the custom instruction as compared to the assembly lan-guage function for all the NIOS II processors. Some additionalassembly instructions are used in the assembly program withcustom instruction because the data that are required to passto the custom instruction is embedded in the register throughshift instruction before calling the custom instruction. After theexecution of the custom instruction, the results in the particularregister can only be extracted by using additional assemblyinstructions.As it can be seen from Tables III, IV, and V that theperformance improvements of Viterbi algorithm in DLX andPicoJava II processors are quite higher then its performanceimprovements on NIOS II soft processors. For DLX andPicoJava II implementations, there is no need to use the shiftinstructions to pass the data to the custom instruction, wecan directly access the data in the memory location throughmicroinstructions in CPU

Sim and MIC-1 simulators while inNIOS II the data are passed to the custom instruction byusing additional assembly instructions. Therefore, additionalexecution cycles are consumed in the assembly program ofViterbi algorithm with custom instruction. Another reason isthat the technique of custom instruction in NIOS II is basedon writing a Verilog HDL program, which is quite differentthan the procedure used in the creating custom instructionsCPU

Sim and MIC-1 simulators.The trend in the performance improvement of Viterbi al- N o . o f C l o c k C y c l e s No. of Bits

DLX (CI) DLX (a) N o . o f C l o c k C y c l e s No. of Bits

Pico Java (CI) Pico Java (b) N o . o f C l o c k C y c l e s No. of Bits

NIOS II (CI) NIOS II (c)

Fig. 3: Histogram Plot for DLX, PicoJava II and NIOS IIPerformance Improvementgorithm, using our custom Texpand instruction, by increasingthe number of received bits, can be seen in Fig. 3. The x-axisABLE V: Comparison between NIOS II Trellis assembly function and Custom Instruction

A.L.T.F = Assembly Language Trellis Expansion Function, C.I = Custom Instruction,A.L.I = Assembly Language Instructions

Nios II/f Processor Nios II/s Processor Nios II/e processor shows the number of decoding bits and the y-axis representsthe number of clock cycles consumed in order to recoverthe information bits using the Viterbi algorithm. The ﬁrstbar depicts the total number of clock cycles used with thecustom instruction approach whereas the second bar representsthe number of clocks utilized in assembly level programwithout using the custom instructions. We take the clockcycles consumption as the metric of comparison to measurethe performances of Viterbi algorithm implementations, i.e.,with and without custom instruction. It can be seen clearly, inFig. 3, that the clock cycles are less consumed when viterbialgorithm is implemented with custom instruction compared toits implementation using non-modiﬁed ISAs of DLX, PicoJavaII and NIOS processors. Also, it can be observed that asthe number of bits increases the clock cycles consumptionin Viterbi algorithm increases drastically and the custominstruction based implementations signiﬁcantly help to reducethe number of clock cycles. For a bird-eye view, Fig. 4 presentsa graphical depiction of the performance improvement ofViterbi algorithm by using the Texpand instructions comparedto assembly program for DLX, PicoJava II and NIOS IIprocessors.The proposed implementation of Viterbi algorithm is quireefﬁcient then the custom instruction-based Viterbi algorithmimplementation in Xtensa [13], which is also a FPGA-basedsoft processor, like NIOS II. In [26], the comparison andperformance improvement is described between the imple-mentation of C language program and the custom instruction,named as TIE. However, the generated assembly code from Cprogram may not be optimized compared to hand-written as-sembly program causing extra assembly instructions overhead.Consequently, consuming more number of cycles effecting theoverall performance of the processor.The custom instructions that we have created in this workare generic and can be used for practical convolutionalencoders having coding rate 1/2. For instance, the GSMconvolutional encoder [14], which has coding rate 1/2 andconstraint length K is and total number of states is .An important feature of our proposed approach is that itcan be used to improve the execution performance of othercomputationally complex algorithms especially that are usedin domain of image processing. For instance, Sundararajana et.al have recently implemented a custom instruction based FFT

12 18 30 40 60

N(cid:0) . o(cid:1) C(cid:2) c l e s No. of BitsDLX DLX (CI)Pico Java Pico Java (CI)NIOS II NIOS II (CI)

Fig. 4: Trend of Performance Improvements in DLX, PicoJavaII and NIOS IIalgorithm on NIOS II processor using ALTERA DE2 boardembedded with Cyclone II FPGA [24]. By using our proposedapproach, a fair comparison of performance improvement oftheir custom instruction based FFT algorithm implementationcan be analyzed by implementing it also on DLX and PicoJavaII processors. VI. C

ONCLUSION

In this paper, we report an enhancement in DLX and Pico-Java II processor ISA for efﬁcient implementation of Viterbidecoding algorithm. We create a custom trellis expansioninstruction (Texpand) in CPUSIM simulator on RISC basedarchitecture and MIC-1 simulator on stack based architecture.The execution time is stupendously improved to approximatelythree times, when Texpand instruction is designed for RISCarchitecture and approximately three times for stack basedarchitecture. In addition, we enhance the ISA of NIOS II softprocessor for the efﬁcient implementation of Viterbi algorithm.The comparison with and without the custom instructionshows substantial improvement in the results. The performanceof the NIOS II processor with the custom instruction isimproved to two times to the assembly language programwithout the custom instruction.n this paper, we presented our proposed approach by real-izing the implementation of DLX and PicoJava II processorson computer-based software tools. However, an FPGA basedimplementation of these processors may also improve the ex-ecution performance for computationally complex algorithmsas we can change the clock frequency and also execute thecustom instruction in parallel to other independent instructions.We also plan to extend our proposed approach to state-of-the-art architectures, such as GPU [4], and aiming to provide adetailed comparison in terms of execution time, delay, latencyand complexity. R

EFERENCES[1] W Ahmed, H Mahmood, and U Siddique. The Efﬁcient Implementationof S8 AES Algorithm. In

Proceedings of world congress on engineering ,pages 1215–1219, 2011.[2] W Ahmed, H Mahmood, and U Siddique. Efﬁcient Implementation ofComputationally Complex Algorithms: Custom Instruction Approach. In

Electrical Engineering and Intelligent Systems , pages 39–52. Springer,2013.[3] Martin Bossert.

Channel Coding for Telecommunications . John Wiley& Sons, Inc., 1999.[4] S Che, J Li, J W Sheaffer, K Skadron, and J Lach. AcceleratingCompute-intensive Applications with GPUs and FPGAs. In

Symposiumon Application Speciﬁc Processors , pages 101–107. IEEE, 2008.[5] Z Chen, A Sinha, and P Schaumont. Implementing Virtual Secure Circuitusing a Custom-instruction Approach. In

International conference onCompilers, Architectures and Synthesis for Embedded Systems , pages57–66. ACM, 2010.[6] K Cholan. Design and Implementation of Low Power High Speed ViterbiDecoder.

Procedia Engineering

Application-speciﬁc Systems, Architectures and Processors , pages 75–82. IEEE, 2017.[9] G David Forney. The Viterbi Algorithm.

Proceedings of the IEEE ,61(3):268–278, 1973.[10] GD Forney. Convolutional Codes II. Maximum-likelihood Decoding.

Information and control , 25(3):222–266, 1974.[11] G David Forney Jr. The Viterbi Algorithm: A Personal History. In

Viterbi Conference on Advancing Technology through CommunicationsSciences , pages 1–8, 2005.[12] D Gonz´alez, G Botella, C Garc´ıa, M Prieto, and F Tirado. Acceler-ation of Block-matching Algorithms using a Custom Instruction-basedParadigm on a Nios II Microprocessor.

EURASIP Journal on Advancesin Signal Processing , 2013(1):118, 2013. [13] RE Gonzalez. Xtensa: A Conﬁgurable and Extensible Processor.

IEEEmicro , 20(2):60–70, 2000.[14] S Grech. Channel Coding Standards in Mobile Communications.

Helsinki University of Technology , 1999.[15] JL Hennessy and DA Patterson.

Computer Architecture: A QuantitativeApproach . Elsevier, 2011.[16] KJ Lin and CC Hou. Implementation of Trigonometric Custom Func-tions Hardware on Embedded Processor. In

Consumer Electronics

Acoustics, Speech, and Signal Processing , volume 5,pages 33–36. IEEE, 2005.[20] S Palnitkar.

Verilog HDL: A Guide to Digital Design and Synthesis

Global Telecommunications Conference , pages 1694–1698. IEEE, 1992.[23] D Skrien. CPU Sim 3.1: A tool for Simulating Computer Architecturesfor Computer Organization Classes.

Journal on Educational Resourcesin Computing , 1(4):46–59, 2001.[24] S Sundararajana, U Meyer-Baese, and G Botella. Custom Instructionfor NIOS II Processor FFT Implementation for Image Processing.In

Sensing and Analysis Technologies for Biomedical and CognitiveApplications 2016 , volume 9871, pages 1–12. International Society forOptics and Photonics, 2016.[25] AS Tanenbaum.

Structured Computer Organization