Deep Learning-Based FPGA Function Block Detection Method using an Image-Coded Representation of Bitstream
11 Deep Learning-Based FPGA Function BlockDetection Method using an Image-CodedRepresentation of Bitstream
Minzhen Chen and Peng Liu,
Member, IEEE
Abstract —Examining field-programmable gate array (FPGA)bitstream is found to help detect known function blocks, whichoffers assistance and insight to analyze the circuit’s systemfunction. Our goal is to detect one or more than one functionblock in FPGA design from a complete bitstream by utilizing thelatest deep learning techniques, which do not require manuallydesigning features. To this end, in this paper, we propose adeep learning-based FPGA function block detection method bytransforming the bitstream into a three-channel color image. Inspecific, we first analyze the format of the bitstream to findthe mapping relationship between the configuration bits andconfigurable logic blocks. Next, an image-coded representationof bitstream is proposed suitable for deep learning processing.This bitstream-to-image transformation takes into account of theadjacency nature of the programmable logic as well as highdegree of redundancy of configuration information. With thecolor images transformed from bitstreams as the training dataset,a deep learning-based object detection algorithm is applied forgenerating the function block detection results. The effects ofEDA tools, input size of the deep neural network, and thedata arrangement of representation on the detection accuracyare explored. The Xilinx Zynq-7000 SoCs and Xilinx ZynqUltraScale+ MPSoCs are adopted to verify the proposed method,and the results show that the mean Average Precision (IoU=0.5)for 10 function blocks is as high as 97.72% for YOLOv3 detector.
Index Terms —Field-programmable gate array, bitstream,image-coded representation, function block detection.
I. I
NTRODUCTION F IELD-PROGRAMMABLE gate arrays (FPGAs) havebecome popular and been widely applied in variousfields, such as communication, computation, deep learning,and digital signal processing, due to its configurability, fastdevelopment-cycles, and abundant resources. Because of thelow-latency and low-power features, FPGAs also play animportant role in highly real-time and embedded systems. Withwidespread applications, the security issues of the FPGAs aregaining more and more focus. Since the function of the FPGAis fully determined by the bitstream used to configure theFPGA, it is possible to retrieve the design details of the circuitimplemented in FPGA by reverse engineering the bitstream.This paper focuses on the reverse analysis of circuit designfunction. Detecting known function blocks from the circuitdesign can play the role of pre-analysis and offer assistance tothe analysis of the circuit’s system function. After the function
The authors are with the College of Information Science and ElectronicEngineering, Zhejiang University, Hangzhou 310027, China.E-mail: { chenmz, liupeng } @zju.edu.cn. FPGA .bit
Application algorithm Function blocksbitstream access Function block detection
Fig. 1. For instance, an application scenario demonstrates that FPGA functionblock detection helps analyze the circuit’s system function. blocks are detected, the system function implemented on thecircuit can be further found out.We take the scenario shown in Fig. 1 as an example todemonstrate that FPGA function block detection helps analyzethe circuit’s system function. In this scenario, circuit’s systemfunction refers to the application algorithm implemented onFPGA and the application algorithm contains different kinds offunction blocks. After bitstream access, function blocks con-tained in the application algorithm can be identified throughFPGA function block detection from the bitstream. Then, theapplication algorithm can be found out or narrowed down toseveral candidates.To identify the function blocks in FPGA designs, re-searchers in [1], [2] analyzed bitstreams or netlists, partitionedcircuits, and compared the content of the partitioned circuitsfrom bitstreams or netlists with existing designs using conven-tional algorithms. However, the partitioning process is time-consuming and the imperfect partitioning can lead to incorrectmatching results. Features for conventional algorithms need tobe designed manually, and improperly selected features resultin the performance degradation of conventional algorithms.Deep neural networks (DNNs) have been used to classifyarithmetic operator modules in circuits [3], [4], [5], due to itsgood performance, strong ability to learn from large amountsof data, and the advantage of no manually designing features.Dai et al. [3] and Fayyazi et al. [4] classified arithmeticoperators from gate-level circuits, which requires the bitstreamto be reverse engineered first. Mahmood et al. [5] classi-fied arithmetic operators from partial bitstreams. However,the work in [5] lacks the processing of raw bitstream dataand could only classify the circuit containing one hardwaremodule.Inspired by the work mentioned above, the goal of this paperis to detect one or more than one function block in the FPGAcircuit design from a given complete bitstream by makinguse of the latest deep learning research results. Since morethan one function block is to be detected at the same time,object detection networks are used in this paper instead of the a r X i v : . [ c s . OH ] J u l classification networks similar to the ones mentioned above.Because of the discontinuity of the configuration bits for oneelement and the high degree of redundancy of configurationinformation, the raw bitstream data should be pre-processed.Therefore, an effective representation of bitstream should beproposed so that DNN can make full use of its featureextraction capability.In this paper, we propose an FPGA function block detec-tion method based on deep learning using an image-codedrepresentation of bitstream. For the purpose of improving thedetection accuracy and compressing the size of the dataset,the representation should reflect the adjacency of the pro-grammable logic and remove the useless information. Theapproaches taken are: 1) finding the mapping relationshipbetween the configurable logic block (CLB) elements andthe configuration bits, and 2) using only configuration bits ofCLBs for representation. Then, the bitstreams are transformedinto images by the proposed representation and the deeplearning techniques are applied by training an innovativeobject detection algorithm on the images. Researchers [6]have implemented a variety of application-specific encryptionalgorithms containing different cryptographic operators, suchas Advanced Encryption Standard (AES), Secure HashingAlgorithm 1 (SHA-1), Message Digest Algorithm 4 (MD4),and so on, on FPGA devices. Our work takes the cryptographicoperator detection as an example to verify the methodology.In summary, the main contributions of this paper include: • A three-channel color image-coded representation of bit-stream suitable for deep learning processing is proposedby analyzing the mapping relationship between the con-figuration bits and CLB elements. • A dataset, in which the images are transformed frombitstream files containing 10 kinds of cryptographic op-erators, is generated without manual annotation. • The deep learning techniques are applied to FPGA func-tion block detection from bitstream for the first time bytraining a deep learning-based object detection algorithmon the dataset. The mean Average Precision (mAP)reaches 97.72% for 10 kinds of function blocks whenIntersection over Union (IoU) is 0.5.The rest of the paper is organized as follows: Sec. II brieflydescribes the background of FPGAs and deep learning-basedobject detection algorithms. Sec. III firstly introduces theoverall process of the detection method and then describeseach step of the method in details. The experimental resultsare presented in Sec. IV. Sec. V discusses the related workand Sec. VI summarizes the findings.II. B
ACKGROUND
In this section, the basic knowledge of FPGAs and functionblock detection is presented at first (Sec. II-A). Then, thebackground of deep learning and object detection algorithmsis introduced (Sec. II-B).
A. FPGA and Function Block Detection
FPGA.
FPGAs are widely applied in digital designs. Com-pared to application-specific integrated circuits (ASICs), FP-GAs have the advantages of reconfigurability and flexibility, which bring about lower costs and shorter development time.There are several kinds of FPGAs, such as static random-access memory (SRAM) FPGA, flash memory FPGA, andanti-fuse FPGA. Although flash memory FPGA and anti-fuseFPGA have higher security over SRAM FPGA, these kindsof FPGAs have a more complex process and limited writingtimes. SRAM FPGA occupies the main market and is morewidely applied in daily life than the other kinds of FPGAs.With the increase of application demand, more and more re-sources are integrated into FPGAs. For example, Xilinx Zynq-7000 SoC includes embedded microprocessors in the FPGAsystem. Xilinx SoC usually consists of the processing system,programmable logic, and many other features all in one siliconchip. The processing system includes microprocessors, inter-connection interface, external memory interface, and so on.The programmable logic includes CLBs, Input/Output Blocks(IOBs), Block RAMs (BRAMs), Digital Signal Processors(DSPs), and others. CLB resources are the logical resourcesfor realizing combinational logic circuits and sequential logiccircuits. A CLB element contains Look Up Tables (LUTs) andFlip-Flops (FFs). Switch Matrix connects the CLB to the otherresources, allowing flexible wiring. CLBs are arrayed into atwo-dimensional matrix in the programmable logic. BRAMsare used for dense storage, and DSPs are used for high-speedcomputing.
Bitstream.
An FPGA bitstream is a binary file that containsconfiguration information for an FPGA design. In SRAM FP-GAs, the bitstream is placed in external non-volatile memory.The programming process of an FPGA is to load a bitstreamfile into the FPGA. The reconfigurability of FPGA benefitsfrom the bitstream file. However, the bitstream also bringsabout lots of security issues for FPGAs. There are many FPGAvendors providing bitstream encryption for FPGAs to improvethe confidentiality of the bitstream.
FPGA security.
Since the security of the processing systemin FPGA corresponds to processor security, FPGA securityhere focuses on the security of programmable logic in theFPGA. Compared to ASICs, FPGAs are more vulnerable to beattacked because of the binary file used to configure the FPGAdesign. FPGA security threats include cloning, overbuilding,reverse engineering, tampering, spoofing, and so on. TheseFPGA security threats may lead to intellectual property (IP)theft, circuit design leakage, privacy issues, and even FPGAsystem damage.The reverse engineering of FPGAs refers to analyzingthe bitstream format and transforming the bitstream to thenetlist [7]. The netlist contains the components and connectioninformation among the components. The reverse engineeringof FPGAs consists of the following several steps: bitstreamaccess, bitstream decryption [8], [9], [10], and bitstreamreverse engineering [11], [12], [13], [14]. The bitstream accessand bitstream decryption are not the focus of this paper.
Function block detection.
In this paper, a function block inthe FPGA system design refers to a circuit block implementinga complex function, such as a cryptographic operator likeSHA-1. A function block is more complex and occupies morehardware resources than a simple hardware module, such asan adder module or a multiplication module. Function block detection should not only pick out one or more than onefunction block from the FPGA design but also point out thelocations of the function blocks in the FPGA device diagram.However, classification can only give one prediction result forone test sample. Therefore, function block detection meanspartition first, then classification for the partitioned sample.Function block detection plays a role of pre-analysis in theanalysis of the circuit’s system function.
B. Deep Learning and Object Detection Algorithms
Deep learning.
Deep learning has become a popular re-search field with many applications, such as image process-ing, speech recognition, medical diagnosis, unmanned groundvehicle, and so on. Deep learning makes it needless to designfeatures manually because deep learning extracts high-level,abstract features from raw data by several layers [15]. Convo-lutional neural networks (CNNs) and recurrent neural networks(RNNs) are popular DNNs, which are widely used in variousfields. The deep learning model learns lots of experience froma large dataset. This process of learning is called training,and this dataset is called a training set. After training, theperformance of the deep learning model is usually tested inanother dataset, which is called a test set.
CNN.
CNNs are one of the most important deep learningnetworks which gain great success in many applications.CNNs are a kind of DNN with at least one convolution layer,which uses convolution calculation instead of general matrixmultiplication. CNNs usually consist of convolution layers,pooling layers, fully connected layers, and so on. Because ofthe convolution calculation, CNNs are able to extract localfeatures of the image and have fewer parameters than fullyconnected neural networks. Therefore, CNNs achieve a quitegood performance in two-dimensional image applications.
Object detection.
The function of object detection algo-rithms is to find the location of the interesting objects inan image and to give a classification probability for eachobject. Object detection algorithms generally fall into twocategories: conventional methods based on manually designingfeatures and deep learning-based methods. Deep learning-based methods do not need specifically defined features andhave been developed with good performance. To date, there aretwo groups of deep learning-based object detection algorithms.One is two-stage object detection algorithms based on regionproposal and classification, such as R-CNN [16], Fast R-CNN [17], and Faster R-CNN [18]. The other one is one-stage object detection algorithms based on regression, suchas You Only Look Once (YOLO) [19], [20], [21], and SingleShot MultiBox Detector (SSD) [22]. The biggest advantage ofthe one-stage object detection algorithms is the fast detectingspeed with the necessary accuracy.Some of the function blocks occupy a small part of thewhole image transformed from the bitstream by the image-coded representation, and sometimes there is more than onefunction block in an image. Considering the characteristics ofthe function blocks in the image, object detection algorithmsare more suitable for our work than other kinds of algorithms,such as the classification algorithms. Classification algorithms cannot work on the bitstreams containing more than onefunction block, because classification algorithms can onlyclassify one category for each image.Object detection is widely applied in various fields, suchas face detection, object tracking, security, unmanned vehicle,robot, and so on. However, it has never been applied to FPGAfunction block detection. Our work combines the image-codedrepresentation of bitstream and an object detection algorithmbased on deep learning to realize the detection of functionblocks in the FPGA bitstreams for the first time.
YOLO.
YOLO is the first deep learning-based object de-tection algorithm with the idea of end-to-end training [19],which means YOLO does not have the process of regionproposal and uses a single network. The object detectionproblem is regarded as a regression problem. YOLO extractsthe features from images and predicts the location informationof the bounding boxes and class probability directly. Comparedto two-stage object detection algorithms, YOLO greatly im-proves the running speed and has better generalization ability.With the development of YOLOv1 [19], YOLOv2 [20], andYOLOv3 [21], YOLO has significantly improved the speed,detection class number, location accuracy, and accuracy onsmall objects. From YOLOv2, YOLO uses convolution layersinstead of the last fully connected layer and takes multi-scaletraining so that multi-scale input images can be predicted bythe same network. Among various deep learning-based objectdetection algorithms, YOLOv3 [21] is the network with thebest comprehensive performance since YOLOv3 achieves highaccuracy and fast speed at the same time.
SSD.
SSD [22] is another classical one-stage deep learning-based object detection algorithm, which is also based on theidea of end-to-end training. SSD is able to detect multi-scaleobjects by computing the results from multiple feature mapswith different sizes. The speed and accuracy of SSD arehigher than YOLOv1 when SSD was presented. However,the performance of SSD is soon exceeded by YOLOv2 andYOLOv3. III. M
ETHODOLOGY
A. Overview
The process of function block detection consists of thefollowing steps, as is shown in Fig. 2. First of all, the analysisof the FPGA bitstream format is carried out and the mappingrelationship used for the representation of bitstream is found.Secondly, a large number of bitstreams are transformed intoimages by the image-coded representation of bitstream, and adataset containing a lot of images is generated without manualannotation. Then, a process of deep learning training is carriedout and a model with good performance is obtained. Whentesting, an FPGA bitstream should be transformed into an im-age, which passes through the deep network later. At last, thedetection result is obtained. The process of dataset generation,deep learning training and testing could be implemented byscripts automatically. The rest of this section will describeeach step in detail. deep learning modeltrainingbitstreams a bitstream an image images function block detection result bitstream format analysisbitstreams image-coded representation automatic annotation inference
Bitstream format analysisDataset generation and training Test image-coded representation .bitmapping relationshipmapping relationship.bit.bitmapping relationship dataset.bit.bit
Fig. 2. The process of function block detection method consisting of bitstreamformat analysis, image-coded representation of bitstream, dataset generation,deep learning training, and deep learning inference.
B. Bitstream Format Analysis
The components of FPGAs are described in Sec. II-A. Thebitstream files are used to configure the programmable logic.The programmable logic of FPGA can be divided into severalClock Regions. Each Clock Region consists of many columnsof CLBs, and q CLBs make up a column of CLBs. Forinstance, Fig. 3 shows the Clock Regions of Xilinx Zynq-7000SoC ZC702 Evaluation Board. The dark blue area representscolumns of CLBs, and the number of columns and rows ofCLBs are marked in Fig. 3. For Xilinx Zynq-7000 SoCs, aCLB element contains two slices, and each slice consists of 4LUTs and 8 FFs [23]. For Xilinx Zynq UltraScale+ MPSoCs,a CLB element contains one slice, and each slice consists of 8LUTs and 16 FFs [24]. There are two types of slices, SLICEL(logic) and SLICEM (memory) respectively.The Xilinx FPGA bitstream consists of Head-of-File, FDRI(Frame Data Register Input) data, and End-of-File. Amongthese three parts, FDRI data contains the configuration infor-mation and is the main content. The configuration memoryis arranged in frames, which are the smallest addressablesegments of configuration memory space. Each frame contains m n frames of theFDRI data configure a column of CLBs ( q CLBs), as is shownin Fig. 4. Except for the l words in the middle of the frame, Fig. 3. As an example, ZC702 FPGA consists of the programmable logic andthe processing system. The programmable logic can be divided into 6 ClockRegions. The dark blue area represents columns of CLBs, and the digits inblack show the numbers of columns and rows of CLBs of each Clock Region.The total of CLBs in ZC702 FPGA is 6,650. ... n frames a column of CLBs( q CLBs) m wordsFrame 0 Frame 1 Frame n -1 Not for any CLBs l words ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2word ( m+l )/2-1 ... word ( m - l )/2word ( m+l )/2-1 ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2word ( m+l )/2-1 ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2 word ( m+l )/2-1 ... word ( m - l )/2 word ( m+l )/2-1 ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2 word ( m+l )/2-1 ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2word ( m+l )/2-1 ... word ( m - l )/2word ( m+l )/2-1 ... word 0word p -1 ... word m -1word m - p ... word ( m - l )/2word ( m+l )/2-1CLB q -1 CLB 0
Fig. 4. Every successive n frames of the FDRI data configure a column ofCLBs from bottom to top. The configuration data for one CLB is distributedat the same location in the n frames. every p words in the left m - l words of a frame correspond toa CLB from the bottom to the top of the column. For XilinxZynq-7000 SoCs, a CLB contains two slices. The p wordsconfigure separately the two slices in a CLB from left to right.Therefore, each CLB in a column of CLBs needs p × n wordsto configure, which are distributed at the same location in the n frames.In the n frames of the FDRI data configuring the samecolumn of CLBs, the first frame has a fixed position in thebitstream. The positions of every first frame can be found byconfiguring the different columns of CLBs repeatedly. As anexample, the composition of the bitstream of ZC702 FPGA isshown in Fig. 5. The number of frames configuring a columnof CLBs ( n ) is 36 for ZC702 FPGA. Sometimes the gapbetween two consecutive first frames is 64 frames instead of36 frames, because there is a column of BRAMs betweenthe two columns of CLBs, and a column of BRAMs need 28frames to configure. After the positions of every first frame Head-of-FileFDRI dataEnd-of-File Clock Region 1Clock Region 2
Clock Region 3
Clock Region 4Clock Region 5
Clock Region 6
Column 2 of CLBsColumn 3 of CLBs
Column 4 of CLBsColumn 5 of CLBs
BRAMsColumn 7 of CLBsColumn 8 of CLBsColumn 9 of CLBsColumn 1 of CLBs
Column 10 of CLBs
Column 11 of CLBsColumn 12 of CLBsBRAMs
Column 6 of CLBs
Frame 3736Frame 5204Frame 6302 Frame 652Frame 724Frame 788Frame 860Frame 924Frame 996Frame 688Frame 760Frame 824Frame 888Frame 960Frame 1032Frame 1068Frame 1104
ZC702 FPGAbitstream file
Frame 652Frame 1170Frame 3218
Fig. 5. The bitstream file is arranged in frames. As an example, thecomposition of a bitstream file of ZC702 FPGA and the positions of somefirst frames are shown. The number of frames configuring a column of CLBs( n ) is 36 for ZC702 FPGA. are found, the configuration bits for every CLB element canbe extracted from the raw data by the Python script. C. Image-Coded Representation of Bitstream
When proposing an image-coded representation of bit-stream, there are two challenges to be faced. The first chal-lenge is that the configuration bits of one CLB element are notconsecutive in a bitstream. In order to extract features from theimage-coded representation, the representation should reflectthe adjacency of the programmable logic. The discontinuityof the configuration bits makes it difficult to reflect theadjacency of the programmable logic by the representation.The approach taken is analyzing the bitstream format andfinding the mapping relationship between the CLB elementin the device location diagram and the configuration bits inthe bitstream. So that the two-dimensional location distributiondiagram of the programmable logic is obtained by the mappingrelationship.The second challenge is that not all configuration informa-tion in the bitstream file is useful for function block detection.This challenge leads to two bad effects: 1) Since our workfocuses on the logic resources, the utilization information ofother resources may confuse the function block detection. Forinstance, the utilization of BRAMs will change if the array sizein the function block changes, so the utilization of BRAMsmay be very different for the same kind of function blocks.However, the logic in the function block will not changeaccording to the data size. Therefore, the configuration bits,which are irrelevant to the logic resources and helpless forfunction block detection, can be dropped. 2) The large sizeof images leads to a large image dataset and low speed ofreading image during the deep learning training. For instance,the storage size of a bitstream of ZC702 FPGA is 3.86 MiB,which is fixed for one type of FPGA device. If all of theconfiguration bits in a bitstream are transformed into an image,the size of the image will achieve 1280 × ×
3, which is word 25word 24 word 25word 24 word 25word 24word 13word 12 word 13word 12 word 13word 126 8 a slice ......
36 words (144 bytes) ...... word 1word 0 word 3word 2 word 3word 2 word 5word 4 word 5word 4 word 7word 6 word 7word 6 word 9word 8 word 9word 8 word 11word 10 word 11word 10 word 1word 0 word 3word 2 word 5word 4 word 7word 6 word 9word 8 word 11word 10word 0 word 1 word 35
Fig. 6. For Xilinx Zynq-7000 SoCs, the 36-word configuration bits of a sliceare transformed into a three-channel color image with 6 × too large for deep learning training. The approach taken isusing only configuration bits of CLBs for representation. Sinceapproximately 60% of the configuration bits in a bitstream areused for configuring CLBs, using only configuration bits ofCLBs can compress effectively and drop useless informationat the same time.On the basis of the above bitstream analysis (Sec. III-B),we propose a three-channel color image-coded representationof the bitstream. The device diagram of FPGA is dividedinto a × b blocks since each block corresponds to a CLB.The representation of each CLB is done separately and therepresentation results of all of the CLBs are aggregated toobtain the entire device location map.According to the analysis of the bitstream format(Sec. III-B), each CLB is allocated 4 p × n bytes configurationmemory from successive n frames. For Xilinx Zynq-7000SoCs, each slice is allocated 2 p × n =144 bytes. For XilinxZynq UltraScale+ MPSoCs, each slice is allocated 4 p × n bytes ( n is different for different types of slices). The severalbytes of configuration data for a slice are transformed into aseparate three-channel RGB image with the proper height andwidth. For example, the configuration bits of a slice in XilinxZynq-7000 SoCs are transformed into a three-channel colorimage with 6 × a × × ( b × ×
8) pixels withthree RGB channels. For Xilinx Zynq UltraScale+ MPSoCs,the size of the image-coded representation of bitstream isdecided by the numbers of the SLICEL and SLICEM. Theconfiguration bits for every slice can be transformed into thethree-channel image-coded representation by the Python scriptafter the configuration bits are extracted from the bitstream file.
D. Generation of Dataset
The image-coded representation of bitstream is used totransform a large number of bitstream files into images, whichare gathered into a dataset for deep learning training andtesting. Each bitstream file implements an algorithm and eachalgorithm contains one or more than one function block.In a practical application, one kind of function block hasdifferent constructions, such as the original one with no specialdesign and the pipelined one. Therefore, each kind of functionblock is implemented in one or two constructions when thebitstreams are generated.
Multiple bitstream files containing various kinds of functionblocks are needed for the training of the deep network. Wegenerate a large number of bitstreams by EDA toolset (XilinxVivado). The constraint of the implementation region makesthe function blocks placed in different locations in differentbitstreams. Tcl (Tool Command Language) [27] of XilinxVivado is used instead of a graphical user interface (GUI).The categories and locations of the function blocks in theFPGA device diagram can be extracted from the EDA toolsetwhen the bitstreams are generated by Tcl scripts. Finally, thebitstream files are transformed into images using the Pythonscripts. And the Python scripts process the label informationinto annotation files for deep learning at the same time.
E. Architectures of the DNNs and Training for Function BlockDetection
Since the deep learning techniques have not been appliedto the FPGA function block detection, our work makes useof the image-coded representation of bitstream and the imagefeature extraction capabilities of DNNs. YOLOv3 and SSD aretwo classical one-stage deep learning-based object detectionalgorithms, which have fast speed and high accuracy at thesame time, especially the speed is much faster than the speedof two-stage deep learning-based object detection algorithms.The reason we choose YOLOv3 and SSD is their high com-prehensive performance. The architectures and training processof YOLOv3 and SSD used in our work are introduced brieflybelow.
YOLOv3.
YOLOv3 consists of 75 convolution layers. Thesizes of kernel/stride include 1 × × × box number bounding boxes( box number is 3 for YOLOv3). One objectness prediction, C class predictions for C classes, and 4 box offsets are predictedfor each bounding box. The objectness prediction quantifieshow likely the image in the box contains a generic object [28].Thus, the filter number of the output layers is 3 × (1+ C +4). Forexample, we apply YOLOv3 to the detection of 10 kinds offunction blocks, so the number of classes C is 10 and the filternumber of the output layers should be 45.When training YOLOv3, we take the pre-trained weightsfor the COCO dataset [29] as initial weights. During the first50 epochs of the training process, the front layers are frozen toget a stable loss value, except the last three output convolutionlayers. From the 51st to the 100th epoch, all of the layersare unfrozen and trained with a smaller learning rate. Whenthe training is finished, choose the model with the smallestvalidation loss value as the final model. Eventually, test themodel on the test set. SSD.
SSD consists of 29 convolution layers and 4 max-pooling layers. The sizes of kernel/stride are the same asYOLOv3. The backbone of SSD is VGG-16 [30]. There aresix output convolution layers for objects of different sizes.
TABLE IP
ARAMETERS FOR THE T RAINING P ROCESS OF
YOLO V AND
SSD.Deep neural network YOLOv3 SSDInput size × × × × The images are resized to the input size of DNN before fed into theDNN.
Similar to YOLOv3, the filter number of the output layers is box number × ( C +4). In SSD, box number can be 4 or 6 fordifferent output layers. Thus, the filter number of the outputlayers in SSD is 56 or 84 when the class number C is 10.When training SSD, we load the trained weights of VGG-16 as initial weights for the front layers. In the first stage oftraining, the front layers are frozen with the weights from theVGG-16 model. Then, the whole network is trainable in thesecond stage of training. The training process is similar toYOLOv3 introduced above.IV. E XPERIMENTAL R ESULTS
A. Experimental Setup
For evaluation purposes, we use Xilinx Zynq-7000 SoCsand Xilinx Zynq UltraScale+ MPSoCs to evaluate our pro-posed methodology. All of the experiments in this section areperformed in the following experiment setup unless otherwisestated. The bitstream files are generated by the Xilinx Vivadotoolset without encryption. The versions of Xilinx Vivadotoolset used by this work include Vivado 2016.3, Vivado2017.2, and Vivado 2017.4. The scripts used for transformingbitstreams into images are running in Python 2.7.15. Thetraining and testing of the deep learning are running in Keras2.2.5 based on TensorFlow 1.10.0 for GPUs, with Python3.5.6. A server running CentOS Linux 7.6, with an NVIDIATesla P100 GPU, is used to perform all of the experiments.There are 10 kinds of function blocks to detect. The DNNYOLOv3 is used as the deep learning-based object detectionalgorithm unless otherwise stated. The DNN SSD is only usedin the experiments in Sec. IV-E6. The parameters for thetraining process of YOLOv3 and SSD are listed in TABLE I.To characterize the performance of the object detector quan-titatively, mAP under specific IoU is used as the performancemetric, which takes into account of both precision and recall.In general, the performance is good when the IoU for thedetected box and the ground truth is more than 0.5. In ourwork, two metrics are used, imitating the COCO dataset. Oneis the mAP at IoU=0.5 ([email protected]), which is the metric forthe PASCAL VOC dataset [32]. The other one is the mAP atIoU=0.75 ([email protected]), which is stricter than [email protected].
B. Bitstream Format Information
For the purpose of finding the mapping relationship usedfor the representation of bitstream, this work analyzes thebitstream format of Xilinx Zynq-7000 SoCs and Xilinx Zynq
TABLE IIB
ITSTREAM F ORMAT I NFORMATION D ETERMINED BY THE
FPGA D
EVICE F AMILY , TAKING X ILINX Z YNQ -7000 S O C S AND X ILINX Z YNQ U LTRA S CALE + MPS O C S AS E XAMPLES .Device family Xilinx Zynq-7000SoCs Xilinx ZynqUltraScale+ MPSoCsNumber of wordsin a frame ( m ) 101 93Number of CLBsin a column ( q ) 50 60Number of framesconfiguring a columnof CLBs ( n ) 36 29 for SLICEL79 for SLICEMNumber of wordsin the middle of a framenot configuring CLBs ( l ) 1 3Number of wordsconfiguring a CLB ( p ) 2 1.5TABLE IIIB ITSTREAM F ORMAT I NFORMATION D ETERMINED BY THE
FPGAD
EVICE , TAKING X ILINX Z YNQ -7000 S O C Z-7020, X
ILINX Z YNQ -7000S O C Z-7030,
AND X ILINX Z YNQ U LTRA S CALE + MPS O C ZU9EG AS E XAMPLES .Device name XilinxZynq-7000SoC Z-7020 XilinxZynq-7000SoC Z-7030 Xilinx ZynqUltraScale+MPSoCZU9EG Number of ClockRegions inprogrammable logic 6 8 25 Number of CLBsin total 6,650 9,825 34,260Number of slicesin total 13,300 19,650 34,260Number of framesof the FDRI datain total 10,008 14,796 71,260Number of wordsof the FDRI datain total 1,010,808 1,494,396 6,627,180 Evaluated on Xilinx Zynq-7000 SoC ZC702 Evaluation Board. Evaluated on Xilinx xc7z030fbg484-3 FPGA. Evaluated on Xilinx Zynq UltraScale+ ZCU102 Evaluation Board. There are 28 Clock Regions in Xilinx Zynq UltraScale+ MPSoCZU9EG device totally. However, there are 3 Clock Regions withoutprogrammable logic resources.
UltraScale+ MPSoCs. The analysis results of these two fam-ilies of SoCs are listed in TABLE II. The bitstream formatinformation listed in TABLE II is determined by the FPGAdevice family.In order to further analyze the bitstream format informationdetermined by the FPGA device, this work takes three FPGAdevices, namely Xilinx Zynq-7000 SoC Z-7020, Xilinx Zynq-7000 SoC Z-7030, and Xilinx Zynq UltraScale+ MPSoCZU9EG, as examples. The bitstream format information ofthese three FPGA devices is listed in TABLE III. The bitstreamformat information listed in TABLE II and TABLE III and thepositions of every first frame are necessary for transformingthe bitstreams into images. Similar bitstream format rules canbe found when analyzing other FPGA devices and other FPGAdevice families.
TABLE IVP
ARAMETERS FOR T RANSFORMING THE B ITSTREAMS INTO I MAGES , TAKING X ILINX Z YNQ -7000 S O C Z-7020, X
ILINX Z YNQ -7000 S O CZ-7030,
AND X ILINX Z YNQ U LTRA S CALE + MPS O C ZU9EG AS E XAMPLES .Device name XilinxZynq-7000SoC Z-7020 XilinxZynq-7000SoC Z-7030 Xilinx ZynqUltraScale+MPSoCZU9EGImage sizeof a slice 6 × × × × × × × × × × × × × × × × a × b ) 150 ×
57 200 ×
60 420 × × × × × × × C. Bitstream Representation
According to the bitstream format analysis results inSec. IV-B, the bitstreams are transformed into images by theproposed image-coded representation of bitstream. Taking Xil-inx Zynq-7000 SoC Z-7020, Xilinx Zynq-7000 SoC Z-7030,and Xilinx Zynq UltraScale+ MPSoC ZU9EG as examples,the parameters for transforming the bitstreams into images arelisted in TABLE IV.For instance, the bitstream length of ZC702 FPGA is 3.86MiB. The size of the image-coded representation of bit-stream is 900 × × × × × × (a) (b)(c) (d)(e) (f)Fig. 7. Vivado implemented designs and the image-coded representation.(a) Vivado implemented design and (b) the image-coded representationof bitstream of Xilinx Zynq-7000 SoC Z-7020. (c) Vivado implementeddesign and (d) the image-coded representation of bitstream of Xilinx Zynq-7000 SoC Z-7030. (e) Vivado implemented design and (f) the image-codedrepresentation of bitstream of Xilinx Zynq UltraScale+ MPSoC ZU9EG. representation of bitstream can reflect the adjacency of theprogrammable logic.In summary, the two challenges mentioned in Sec. III-Chave been overcome by the proposed image-coded represen-tation of bitstream. D. Dataset Description
For the purpose of training and testing the DNN models, alarge number of bitstreams, which implement FPGA designson ZC702 FPGA, are generated to make up the dataset. Thereare 15 kinds of application-specific encryption algorithms cho-sen for generating 15,104 bitstream files and these encryptionalgorithms contain 10 kinds of cryptographic operators. Theapplication-specific encryption algorithms used for generatingthe dataset and the cryptographic operators contained arelisted in TABLE V. Each encryption algorithm used in thiswork contains up to 3 kinds of cryptographic operators. Eachkind of cryptographic operator is implemented in one or twoconstructions. The pipeline means the cryptographic operatoris implemented in a pipeline design, and the module meansthe cryptographic operator is implemented without specialdesigns. For example, a bitstream implementing the encryptionalgorithm used for NTLM (NT LAN Manager) contains anMD4 pipeline or an MD4 module. A bitstream implementingthe encryption algorithm used for PDF-R2 contains an MD5(Message Digest Algorithm 5) pipeline and an RC4 (RivestCipher 4) module.In order to arrange the experiments reasonably, 13 kinds ofencryption algorithms in TABLE V are chosen to make upthe training set and the test set, including 10,047 bitstreamsgenerated by Xilinx Vivado 2017.4 totally. The bitstreams fortraining and testing are divided into the training set and thetest set randomly by 4:1. Two kinds of encryption algorithms,used for PDF-R5 and OFFICE 2010, implemented by XilinxVivado 2017.4 are just used for testing. Besides, to explorethe effect of EDA tools, the bitstreams generated by XilinxVivado 2016.3 and Xilinx Vivado 2017.2 are used to testthe performance of the trained model. These bitstreams aretransformed into images to make up the dataset for trainingand testing in Sec. IV-E.
E. Function Block Detection
The experimental results of function block detection areevaluated on ZC702 FPGA. Firstly, the evaluation results onthe test set with the same distribution as the training set areshown, as well as the evaluation results on the encryption al-gorithms not appearing in the training set. Then, the effects ofEDA tools, input size of the DNN, and the data arrangement ofthe image-coded representation are discussed. Finally, anotherdeep learning-based object detection algorithm SSD is appliedand the performance of SSD is presented.
1) Evaluation results on the test set:
The first 13 kindsof encryption algorithms listed in TABLE V are chosen tomake up the training set and the test set. Since the imagestransformed from the bitstreams of the 13 kinds of encryptionalgorithms are divided into the training set and the test setrandomly by 4:1, the test set has the same distribution as thetraining set.In this experiment, we train our detector on the training setfor around 16 hours and test on the test set with the samedistribution as the training set. The function block detectionresult of a bitstream file, which implements the encryptionalgorithm used for PDF-R2 on ZC702 FPGA, is shown in
TABLE VC
OMPONENTS OF THE T RAINING S ET AND T EST S ET .Vivadoversion Applications ofencryptionalgorithms Number ofcryptographicoperatorscontained Cryptographicoperators Implementation constructions Number ofbitstreams For trainingand testing
For testing only
Fig. 8, as an example. The function blocks included in thisimage are marked with boxes. Each box is labeled with thecategory and a classification probability.TABLE VI shows the evaluation results under the metricsmentioned in Sec. IV-A. The performance of the detector isevaluated quantitatively on the test set. AP (Average Preci-sion) @0.5 of 10 kinds of cryptographic operators are allbeyond 85.33% and [email protected] reaches 97.72%. Even underthe stricter metric, [email protected] reaches 95.87%. It is evidentthat the detector has a good detection performance on the testset with the same distribution as the training set.The image-coded representation of bitstream can reflectthe resource utilization of function blocks and the adjacencyof the CLBs used. Different kinds of function blocks aredifferent in these two aspects. The SHA-256 module nearlytakes over all of the LUT resources of ZC702 FPGA. However, the RC4 module only occupies no more than 1% of the FFresources and approximately 3% of the LUT resources. Whenimplemented in the same construction, two function blocks ofthe same kind are similar but appear in different locations indifferent images. Since the image-coded representation of bit-stream keeps some characteristics of function blocks that candistinguish one kind of function block from another one, theproposed image-coded representation of bitstream is provedeffective and applicable for generating a dataset for deeplearning. The performance of the detector also demonstratesthat YOLOv3 learns successfully how to detect the functionblocks from the images.Some kinds of cryptographic operators are well detected,such as SHA-1 and SHA-256. It is because these kinds ofcryptographic operators occupy a large area of the images. Itis not difficult to detect a box with an IoU over 0.5 or 0.75with the ground truth.Some kinds of cryptographic operators have relatively lowAP, such as AES and Serpent. The reasons are as follows: 1)These kinds of cryptographic operators always occupy a smallnumber of resources of FPGA. It is difficult to distinguishone function block with high resource utilization and thecombination of several function blocks with low resourceutilization. 2) There is more than one function block in animage. It is more difficult to detect a function block from asystem with various function blocks than from a system witha single function block. The boundary between two functionblocks in an image is hard to distinguish.
2) Effectiveness of detecting cryptographic operators:
Inthis experiment, the trained model is tested on the bitstreamfiles implementing the encryption algorithms, which do not TABLE VIE
VALUATION R ESULTS ON THE T EST S ET WITH THE S AME D ISTRIBUTION AS THE T RAINING S ET .Function blocks mAP AES DES MD4 MD5 RC4 Serpent SHA-1 SHA- 256 SHA- 512 Twofi[email protected] (%)
VALUATION R ESULTS ON THE E NCRYPTION A LGORITHMS NOT A PPEARING IN THE T RAINING S ET .Function blocks [email protected] (%) [email protected] (%)PDF-R5 (1) MD5 pipeline 100.00 99.60(2) RC4 module 97.13 94.98OFFICE 2010 (3) SHA-1 pipeline 100.00 100.00 appear in the training set, to confirm the capability to detectcryptographic operators. The training process is the same asmentioned in Sec. IV-E1. We choose 1,339 bitstreams fortesting in this experiment, which implement two kinds ofencryption algorithms used for PDF-R5 and OFFICE 2010and are generated by Vivado 2017.4. The function blocks inthese bitstreams have the same implementation constructionsas the ones in the training set.TABLE VII shows the results of this experiment. Theevaluation results on the encryption algorithms used for PDF-R5 and OFFICE 2010 show that the detector also has agood performance on the encryption algorithms not appearingin the training set, because the cryptographic operators inthese encryption algorithms have the same implementationconstructions as the ones in the training set. TABLE VI andTABLE VII demonstrate that the detector has the capabilityto detect the function blocks with the same constructions inthe training set, no matter whether the encryption algorithmsappear in the training set or not.Although the RC4 module has appeared in the trainingset, the small occupied area of the RC4 module accounts forthe relatively low AP. On the contrary, the SHA-1 pipelineoccupies a large area. And the detector has the capability todistinguish SHA-1 from other cryptographic operators, whichalso have a large area, such as SHA-256. Therefore, the SHA-1pipeline has a really high AP.
3) Effect of EDA tools:
In this experiment, the model istrained on the bitstreams generated by Vivado 2017.4 andtested on the bitstreams generated by the other version ofXilinx Vivado, namely Vivado 2016.3 and Vivado 2017.2.The training process is the same as mentioned in Sec. IV-E1.The bitstreams for testing implementing encryption algorithmsused for PDF-R2 and PDF-R5, contain the function blockswith the same implementation constructions as the ones inthe training set. The encryption algorithm used for PDF-R2 isincluded in the training set. However, the encryption algorithmused for PDF-R5 is not included. This experiment is set up toexplore the effect of the EDA tools. Although the EDA toolsare provided by the FPGA vendors, the corresponding EDAtools are updated continually.The evaluation results are shown in Fig. 9. The evaluationresults on bitstreams generated by Vivado 2016.2 and Vivado
V i v a d o 2 0 1 6 . 3 P D F - R 2 V i v a d o 2 0 1 6 . 3 P D F - R 5 V i v a d o 2 0 1 7 . 2 P D F - R 2 V i v a d o 2 0 1 7 . 2 P D F - R 5 V i v a d o 2 0 1 7 . 4 t e s t s e t V i v a d o 2 0 1 7 . 4 P D F - R 502 04 06 08 01 0 0
M D 5 R C 4 (a)
V i v a d o 2 0 1 6 . 3 P D F - R 2 V i v a d o 2 0 1 6 . 3 P D F - R 5 V i v a d o 2 0 1 7 . 2 P D F - R 2 V i v a d o 2 0 1 7 . 2 P D F - R 5 V i v a d o 2 0 1 7 . 4 t e s t s e t V i v a d o 2 0 1 7 . 4 P D F - R 502 04 06 08 01 0 0
M D 5 R C 4 (b)Fig. 9. Evaluation results on the bitstreams generated by different EDA tools.(a) The [email protected] and (b) the [email protected] of MD5 and RC4 evaluated on thebitstreams generated by different EDA tools.
4) Effect of input size:
In this experiment, there are fivemodels trained with different input sizes of YOLOv3. Thepurpose of this experiment is to explore the effect of theinput sizes of YOLOv3 on detection accuracy. The images areresized to the input size of YOLOv3 before fed into the DNN.Since YOLOv3 is a fully convolutional network without anyfully connected layers, the changes in the input size will notchange the number of weight parameters in every layer. Thehyperparameters of the models with different input sizes arethe same as the model trained in Sec. IV-E1. The input size ofthe model in Sec. IV-E1 is 416 × ×
3. The training set andtest set are the same as set up in Sec. IV-E1 and Sec. IV-E2.The bitstreams in this experiment are all generated by Vivado
02 04 06 08 01 0 0 I n p u t s i z e o f Y O L O v 3 mAP(%) m A P @ 0 . 5 m A P @ 0 . 7 5 (a)
02 04 06 08 01 0 0 I n p u t s i z e o f Y O L O v 3
M D 5 R C 4 (b)Fig. 10. Evaluation results of five models with different input sizes on (a)the test set and (b) the encryption algorithm used for PDF-R5. × × × ×
3, the otherfour models have a similar performance. There is no need tochoose too large input size for the DNN, because the largeinput size leads to large computation.
5) Effect of the order in which the configuration bits of aslice correspond to the image:
In this experiment, there arethree models trained on different image datasets. The threeimage datasets are transformed from the same bitstreams bydifferent representation methods. The only difference amongthe three representation methods is the order in which theconfiguration bits of a slice correspond to the image. The firstorder is (channel, height, width), as is shown in Fig. 6. Theother two orders are (height, width, channel) and (channel,width, height), respectively. The training processes are thesame as mentioned in Sec. IV-E1. The bitstreams in thisexperiment are all generated by Vivado 2017.4. The purposeof this experiment is to explore the effect of the order in whichthe configuration bits of a slice correspond to the image.The mAP of 10 kinds of function blocks on the test setwith the same distribution as the training set is listed inTABLE VIII. The three models are also evaluated on theencryption algorithm used for PDF-R5, and the evaluationresults are listed in TABLE IX. It is demonstrated that theorder, in which the configuration bits of a slice correspond to
TABLE VIIIE
VALUATION R ESULTS OF T HREE M ODELS WITH D IFFERENT O RDERS , IN W HICH THE C ONFIGURATION B ITS OF A S LICE C ORRESPOND TO THE I MAGE , ON THE T EST S ET .The order [email protected] (%) [email protected] (%)(channel, height, width) 97.72 95.87(height, width, channel) 97.46 95.28(channel, width, height) 97.64 95.71TABLE IXE VALUATION R ESULTS OF T HREE M ODELS WITH D IFFERENT O RDERS , IN W HICH THE C ONFIGURATION B ITS OF A S LICE C ORRESPOND TO THE I MAGE , ON THE E NCRYPTION A LGORITHM U SED FOR
PDF-R5.The order Functionblocks [email protected](%) [email protected](%)(channel, height, width) MD5 100.00 99.51RC4 97.13 94.99(height, width, channel) MD5 100.00 99.66RC4 96.90 95.39(channel, width, height) MD5 100.00 99.52RC4 96.44 94.38TABLE XE
VALUATION R ESULTS OF
YOLO V AND
SSD
ON THE T EST S ET WITHTHE S AME D ISTRIBUTION AS THE T RAINING S ET .Deep neural network(Input size) [email protected] (%) [email protected] (%)YOLOv3 (416 × ×
3) 97.72 95.87SSD (300 × ×
3) 96.34 81.54 the image, has almost no effect on the detection accuracy. Itis inferred that as long as the representation of bitstream canreflect whether each slice is used or not, the detection accuracywill not be affected.
6) Application of other deep learning-based object algo-rithms to function block detection:
In this experiment, SSD isapplied to bitstream function block detection to demonstratethe effectiveness of the methodology. In the methodology, it isnot necessary to choose YOLOv3 as the deep learning-basedobject detection algorithm. The training set and test set are thesame as used in Sec. IV-E1. The bitstreams are generated byVivado 2017.4.The evaluation results of SSD on the test set are listed inTABLE X, together with the evaluation results of YOLOv3 inSec. IV-E1. The evaluation results of SSD show that SSD alsohas the capability to detect function blocks. The methodologyhas generality to some degree and the other deep learning-based object detection algorithms with high performance canbe applied. Compared with the performance of YOLOv3,the performance of SSD is evidently lower than [email protected] of SSD is slightly low than [email protected] of YOLOv3.However, [email protected] of SSD is much lower than YOLOv3.It is shown that the location accuracy of SSD is poor in thisscenario.
F. Processing Time
Taking Xilinx Zynq-7000 SoC Z-7020, Xilinx Zynq-7000SoC Z-7030, and Xilinx Zynq UltraScale+ MPSoC ZU9EG as
051 01 52 02 53 0
B i t s t r e a m l e n g t h P r o c e s s i n g t i m e
F P G A d e v i c e
Bitstream length (MiB)
051 01 52 02 53 0
Processing time (s)
Fig. 11. The processing time of transforming a bitstream into an image on asingle Intel Xeon Gold 5118 [email protected].
00 . 0 20 . 0 40 . 0 60 . 0 80 . 10 . 1 2
Processing time/image (s)
I n p u t s i z e o f Y O L O v 3
Fig. 12. The processing time per image of the YOLOv3 inference processon an NVIDIA Tesla P100 GPU. examples, the processing time of transforming a bitstream intoan image is reported in Fig. 11, together with the bitstreamlength. The processing time is measured on a single Intel XeonGold 5118 [email protected]. It is evident that the processingtime of transforming a bitstream into an image is almostproportional to the bitstream length, which is determined bythe FPGA device.As is shown in Fig. 12, the processing time per image ofthe YOLOv3 inference process varies with the input size ofthe DNN. Before fed into the DNN, the images are resizedto the input size of the DNN. Therefore, the processing timeof YOLOv3 inference has no relationship with the size ofthe image-coded representation of bitstream. The confidencethreshold is set as 0.5 and the IoU threshold is set as 0.45 forinference. The processing time is measured on an NVIDIATesla P100 GPU. It is shown that the processing time ofYOLOv3 inference increases as the input size of YOLOv3increases. Besides, the processing time of the SSD inferenceprocess is 0.0803s per image, also measured on an NVIDIATesla P100 GPU. The thresholds are set the same as YOLOv3.The input size of SSD is 300 × × G. Recap
Based on the above experimental results and analysis, theeffectiveness of the proposed bitstream function block detec-tion methodology is proved. The following points summarizethe insights from the experimental results:1) Similar bitstream format rules can be found in severalFPGA devices by bitstream format analysis, which is thebasis of the image-coded representation of bitstream. Theimage-coded representation can reflect the adjacency of theprogrammable logic and the image-coded representation has asuitable size without losing useful information.2) The deep learning-based object detection algorithm hasthe capability to detect the function blocks with the same constructions in the training set, no matter whether the systemdesigns appear in the training set or not.3) The model trained on the bitstreams generated by oneversion of Xilinx Vivado can also detect function blocks fromthe bitstreams generated by other versions of Xilinx Vivado.4) The model with too small input size has bad detectionaccuracy. The model with too large input size has no accuracyimprovement while sacrificing the computation cost.5) The order in which the configuration bits of a slicecorrespond to the image has almost no effect on the detectionaccuracy, as long as the image-coded representation of a slicecan reflect whether the slice is used or not.6) In the methodology, other deep learning-based objectdetection algorithms with high performance can also be chosento detect function blocks from the bitstream.V. R
ELATED W ORKS
In this section, some related works about bitstream formatanalysis, bitstream reverse engineering, and deep learning-based circuit classification are presented.An FPGA bitstream contains the programming informationfor an FPGA device, which configures the programmable logicinto the FPGA. Because of the lack of disclosed informationabout the bitstream format from FPGA vendors, many workshave analyzed the format of bitstream [1], [33], [34], [35].Ziener et al. [1] extracted the content of LUTs in the bitstreamof Xilinx Virtex-II and Virtex-II Pro FPGAs to identify IPcores in the FPGAs. Le Roux et al. [33] analyzed the bitstreamof Xilinx Virtex-5 FPGAs to manipulate the configurationbits of LUTs for the purpose of reconfiguring the FPGAs inreal-time. There are also some related works analyzing thebitstream format of the later Xilinx 7-series FPGAs [34], [35].Dang Pham et al. [34] provided a tool called BITMAN thatsupports bitstream manipulations, such as module placement,module relocation, and so on. COMET [35] is a tool support-ing bitstream analysis, visualization, and manipulation. Themanipulation of bitstream is to provide means to performpartial reconfiguration or fault injection.There are many works implementing FPGA bitstream re-verse engineering based on bitstream analysis. Some of themreverse the bitstream to a Xilinx Design Language (XDL) levelrepresentation of the netlist or a Native Circuit Description(NCD) file [11], [12], [13]. Some of them further reversethe netlist file to Register Transfer Level (RTL) code [14].These works analyze the bitstream format and gather databasescontaining the mapping relationship from the configurationbits in the bitstream to their related configurable elements.Reverse engineering is implemented through these databases.Bitstream reverse engineering aims at performing analysis ordetecting hardware Trojan based on netlist or RTL code.Dai et al. [3] put forward a CNN method for arithmeticoperator classification and detection from gate-level circuitsrather than bitstream-level, and discussed the importance ofrepresentation of circuits for CNN processing. Fayyazi etal. [4] also presented a CNN-based gate-level circuit recogni-tion method and used a vector-based representation for theCNN processing. The method can be used to detect the hardware trojans or classify the circuits of different arithmeticoperators, such as an adder and an multiplier. Mahmood etal. [5] proposed judging whether a partial bitstream contains ahardware module implementing an add operation using neuralnetworks. Their work lacks the analysis and process of theraw bitstream data. Neto et al. [36] also took advantage of thelatest progress of DNNs in image classification by proposinga binary image-like circuit representation of Boolean logicfunction. However, the work in [36] just utilized the DNNsto choose the best optimization method for the partitionedcircuits. And the binary image-like representation of Booleanlogic function in [36] is different with the three-channel colorimage-coded representation in this paper, which is proposedto represent the configuration bits in the bitstream.VI. C ONCLUSIONS
In this paper, we have proposed an FPGA bitstream func-tion block detection method built upon the deep learningtechniques. At first, we analyze the bitstream format andfind the mapping relationship between the configuration bitsand the CLB elements. Then, we propose a three-channelcolor image-coded representation of bitstream, which reflectsthe adjacency of the programmable logic and transforms thebitstreams into images suitable for deep learning processing.A dataset is generated without manual annotation, in whichthere are 15,104 images transformed from bitstreams. Theobject detection algorithm based on deep learning is applied todetecting function blocks from FPGA bitstreams by training onthis dataset. The process of dataset generation, deep learningtraining and testing could be implemented by scripts automat-ically. Experimental results show that mAP (IoU=0.5) for 10kinds of function blocks reaches 97.72% when using YOLOv3as the object detector. The detector is also demonstrated tohave the capability to detect the function blocks from thebitstreams implementing the system designs not appearing inthe training set or from the bitstreams generated by other EDAtools. R
EFERENCES[1] D. Ziener, S. Assmus, and J. Teich, “Identifying FPGA IP-cores basedon lookup table content analysis,” in
International Conference on FieldProgrammable Logic and Applications (FPL) , Aug. 2006, pp. 1–6.[2] J. Couch, E. Reilly, M. Schuyler, and B. Barrett, “Functional blockidentification in circuit design recovery,” in
IEEE International Sympo-sium on Hardware Oriented Security and Trust (HOST) , May. 2016, pp.75–78.[3] Y.-Y. Dai and R. K. Brayton, “Circuit recognition with deep learning,”in
IEEE International Symposium on Hardware Oriented Security andTrust (HOST) , May. 2017, pp. 162–162.[4] A. Fayyazi, S. Shababi, P. Nuzzo, S. Nazarian, and M. Pedram, “Deeplearning-based circuit recognition using sparse mapping and level-dependent decaying sum circuit representations,” in
Design, AutomationTest in Europe Conference Exhibition (DATE) , Mar. 2019, pp. 638–641.[5] S. Mahmood, J. Rettkowski, A. Shallufa, M. H¨ubner, and D. G¨ohringer,“IP core identification in FPGA configuration files using machinelearning techniques,” in
IEEE International Conference on ConsumerElectronics (ICCE-Berlin) , Sep. 2019, pp. 103–108.[6] P. Liu, S. Li, and Q. Ding, “An energy-efficient accelerator basedon hybrid CPU-FPGA devices for password recovery,”
IEEE Trans.Comput. , vol. 68, no. 2, pp. 170–181, Feb. 2019.[7] S. E. Quadir, J. Chen, D. Forte, N. Asadizanjani, S. Shahbazmohamadi,L. Wang, J. A. Chandy, and M. M. Tehranipoor, “A survey on chipto system reverse engineering,”
ACM J. Emerg. Technol. Comput. Syst. ,vol. 13, no. 1, pp. 1–34, Apr. 2016. [8] A. Moradi, A. Barenghi, T. Kasper, and C. Paar, “On the vulnerabilityof FPGA bitstream encryption against power analysis attacks: Extractingkeys from Xilinx Virtex-II FPGAs,” in
ACM Conference on Computerand Communications Security (CCS) , Oct. 2011, pp. 111–124.[9] S. Tajik, H. Lohrke, J. Seifert, and C. Boit, “On the power of opticalcontactless probing: Attacking bitstream encryption of FPGAs,” in
ACMConference on Computer and Communications Security (CCS) , Nov.2017, pp. 1661–1674.[10] M. Ender, A. Moradi, and C. Paar, “The unpatchable silicon: A fullbreak of the bitstream encryption of Xilinx 7-Series FPGAs,” in
USENIXSecurity Symposium , Aug. 2020.[11] J. Note and ´E. Rannaud, “From the bitstream to the netlist,” in
Inter-national Symposium on Field Programmable Gate Arrays (FPGA) , Feb.2008, pp. 264–264.[12] F. Benz, A. Seffrin, and S. A. Huss, “Bil: A tool-chain for bit-stream reverse-engineering,” in
International Conference on Field Pro-grammable Logic and Applications (FPL) , Aug. 2012, pp. 735–738.[13] Z. Ding, Q. Wu, Y. Zhang, and L. Zhu, “Deriving an NCD filefrom an FPGA bitstream: Methodology, architecture and evaluation,”
Microprocessors and Microsystems , vol. 37, no. 3, pp. 299–312, May.2013.[14] T. Zhang, J. Wang, S. Guo, and Z. Chen, “A comprehensive FPGAreverse engineering tool-chain: From bitstream to RTL code,”
IEEEAccess , vol. 7, pp. 38 379–38 389, 2019.[15] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learning . MIT press,2016.[16] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,Jun. 2014, pp. 580–587.[17] R. B. Girshick, “Fast R-CNN,” in
IEEE International Conference onComputer Vision (ICCV) , Dec. 2015, pp. 1440–1448.[18] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in
Conference onNeural Information Processing Systems (NIPS) , Dec. 2015, pp. 91–99.[19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You onlylook once: Unified, real-time object detection,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , Jun. 2016, pp. 779–788.[20] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,Jul. 2017, pp. 6517–6525.[21] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767 , 2018.[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.Berg, “SSD: Single shot multibox detector,” in
European Conference onComputer Vision (ECCV)
European Conference on Computer Vision (ECCV) , Sep.2014, pp. 740–755.[30] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
International Conference on LearningRepresentations (ICLR) , May. 2015.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
International Conference on Learning Representations (ICLR) , May.2015.[32] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-man, “The Pascal Visual Object Classes (VOC) challenge,”
InternationalJournal of Computer Vision , vol. 88, no. 2, pp. 303–338, Jun. 2010. [33] R. le Roux, G. van Schoor, and P. van Vuuren, “Parsing and analysisof a Xilinx FPGA bitstream for generating new hardware by direct bitmanipulation in real-time,” South African Computer Journal , vol. 31,pp. 80–102, Jul. 2019.[34] K. Dang Pham, E. L. Horta, and D. Koch, “BITMAN: A tool and API forFPGA bitstream manipulations,” in
Design, Automation Test in EuropeConference Exhibition (DATE) , Mar. 2017, pp. 894–897.[35] L. Bozzoli and L. Sterpone, “COMET: A configuration memory tool toanalyze, visualize and manipulate FPGAs bitstream,” in
InternationalConference on Architecture of Computing Systems (ARCS) Workshop ,Apr. 2018, pp. 1–4.[36] W. L. Neto, M. Austin, S. Temple, L. G. Amar`u, X. Tang, and P. Gail-lardon, “LSOracle: A logic synthesis framework driven by artificialintelligence: Invited paper,” in