[PDF] FPCNet: Fast Pavement Crack Detection Network Based on Encoder-Decoder Architecture

Abstract

Timely, accurate and automatic detection of pavement cracks is necessary for making cost-effective decisions concerning road maintenance. Conventional crack detection algorithms focus on the design of single or multiple crack features and classifiers. However, complicated topological structures, varying degrees of damage and oil stains make the design of crack features difficult. In addition, the contextual information around a crack is not investigated extensively in the design process. Accordingly, these design features have limited discriminative adaptability and cannot fuse effectively with the classifiers. To solve these problems, this paper proposes a deep learning network for pavement crack detection. Using the Encoder-Decoder structure, crack characteristics with multiple contexts are automatically learned, and end-to-end crack detection is achieved. Specifically, we first propose the Multi-Dilation (MD) module, which can synthesize the crack features of multiple context sizes via dilated convolution with multiple rates. The crack MD features obtained in this module can describe cracks of different widths and topologies. Next, we propose the SE-Upsampling (SEU) module, which uses the Squeeze-and-Excitation learning operation to optimize the MD features. Finally, the above two modules are integrated to develop the fast crack detection network, namely, FPCNet. This network continuously optimizes the MD features step-by-step to realize fast pixel-level crack detection. Experiments are conducted on challenging public CFD datasets and G45 crack datasets involving various crack types under different shooting conditions. The distinct performance and speed improvements over all the datasets demonstrate that the proposed method outperforms other state-of-the-art crack detection methods.

Full PDF

IIEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 1

FPCNet: Fast Pavement Crack Detection NetworkBased on Encoder-Decoder Architecture

Wenjun Liu, Yuchun Huang, Ying Li, and Qi Chen

Abstract —Timely, accurate and automatic detection of pave-ment cracks is necessary for making cost-effective decisionsconcerning road maintenance. Conventional crack detection algo-rithms focus on the design of single or multiple crack features andclassiﬁers. However, complicated topological structures, varyingdegrees of damage and oil stains make the design of crackfeatures difﬁcult. In addition, the contextual information arounda crack is not investigated extensively in the design process.Accordingly, these design features have limited discriminativeadaptability and cannot fuse effectively with the classiﬁers. Tosolve these problems, this paper proposes a deep learning net-work for pavement crack detection. Using the Encoder-Decoderstructure, crack characteristics with multiple contexts are auto-matically learned, and end-to-end crack detection is achieved.Speciﬁcally, we ﬁrst propose the Multi-Dilation (MD) module,which can synthesize the crack features of multiple contextsizes via dilated convolution with multiple rates. The crack MDfeatures obtained in this module can describe cracks of differentwidths and topologies. Next, we propose the SE-Upsampling(SEU) module, which uses the Squeeze-and-Excitation learningoperation to optimize the MD features. Finally, the above twomodules are integrated to develop the fast crack detectionnetwork, namely, FPCNet. This network continuously optimizesthe MD features step-by-step to realize fast pixel-level crackdetection. Experiments are conducted on challenging public CFDdatasets and G45 crack datasets involving various crack typesunder different shooting conditions. The distinct performanceand speed improvements over all the datasets demonstrate thatthe proposed method outperforms other state-of-the-art crackdetection methods.

Index Terms —Pavement crack detection, convolutional neu-ral network, deep learning, semantic segmentation, Encoder-Decoder.

I. I

NTRODUCTION P AVEMENT cracks, which are one of the most represen-tative defects of roads, are mainly caused by overloading,temperature changes and road surface aging. These damagescan degrade the performance of road surfaces, shorten theservice life of roads, and endanger the driving safety ofvehicles. Fast and accurate pavement crack detection facilitatestimely maintenance of roads and prevents the road conditionsfrom deteriorating further. With the rapid advancement insensors and information technology (IT), millions of roadimages have been collected by many transportation agenciesfor crack detection.

W. Liu, Y. Huang and Q. Chen are with the School of Remote Sensing andInformation Engineering, Wuhan University, Wuhan 430079, China (e-mail:[email protected]; [email protected]; [email protected])Y. Li is with the Mobile Sensing and Geodata Science Laboratory, Depart-ment of Geography and Environmental Management, University of Waterloo,ON N2L 3G1, Canada (e-mail: [email protected])

Various manually designed features, such as grayscale [1]–[4], edge [5]–[7], Gabor ﬁlters [8], [9], wavelet [10], [11],and histogram of oriented gradients (HOG) [12], are used todetect cracks from images. However, owing to the complexand diverse topology, arbitrary shapes and varying widths,and the presence of oil spots, gravel, zebra crossings andother strong disturbances on roads that pose challenges to theidentiﬁcation and detection of cracks, the performance of thesemethods is still limited. In addition, the poor contrast aroundthe cracked pixels caused by undesired imaging conditions(such as overexposure or underexposure) also makes crackdetection difﬁcult. Therefore, in complex situations, manuallydesigning one or multiple robust features is ineffective forextracting cracks from different road images.In deep learning, Convolutional Neural Networks (CNN)can automatically learn the characteristics of target objectsthrough alternating layers of convolution and pooling, andsubsequently classify them. Human experience for featureand classiﬁer design is not required in such networks, whichprovides new opportunities for end-to-end crack detection.The CNN-based crack detection algorithms proposed by someresearchers have achieved relatively successful results by train-ing automatic crack feature learners. Some of these algorithmsuse object detection methods [13], [14] or image block clas-siﬁcation [15], [16] to detect cracks. These algorithms canlocate cracks in a pavement image but fail to detect them pixelby pixel. Some algorithms [17], [18] ﬁrst partition the crackimage according to a certain size, and then predict whethersingle or multiple pixel(s) in the center of the block are cracks.Pixel-level prediction is achieved in these methods, but thesemethods are time consuming and do not involve end-to-endaspects. Some studies [19], [20] applied a fully convolutionalnetwork (FCN) [21] to crack detection to solve the aboveproblems with high precision and speed. However, the FCNmethods still have the following problems with respect to crackdetection.1) Pavement cracks have different widths and topologies, butthe ﬁlters of the FCN methods use only one receptive ﬁeldsize to extract the crack features within only one context,thus limiting their robustness for crack detection2) The edge, pattern or shape features of cracks contributedifferently to the detection results. However, FCN meth-ods treat these features equally with addition [21] or con-catenation [22] operations performed during the lateralconnection of different features.Inspired by the Encoder-Decoder structure, we propose anew crack detection network called FPCNet. a r X i v : . [ c s . C V ] J u l EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 2

First, the features of cracks of one context size are extractedthrough a series of convolutions and pooling layers in theEncoder.A Multi-Dilation (MD) module is then used to obtain thecrack features of multiple context sizes. The dilated convolu-tion [23] can increase the context and learn deeper featureswithout compromising the edge resolution. It is employedin the MD module with multiple rates to extract the crackMD features of multiple context sizes. With the proposed MDmodule, the cracks of different widths and topologies can berobustly detected.Next, an SE-Upsampling (SEU) module is developed toconstruct the Decoder. It restores the resolution of the crackMD features through transposed convolution. The features inthe Encoder are added to the restored MD features of the sameresolution, which combines the context of the cracks with edgedetails. After this addition, the SEU module assigns differentweights to the MD features adaptively through the Squeeze-and-Excitation learning operation [24]. The crack informationcorresponding to the edge, pattern or shape embedded inthe MD features is assigned different weights, based on itscontribution to the detection results.Finally, the MD and SEU modules are integrated in theEncoder-Decoder structure to develop the fast crack detectionnetwork FPCNet. The network uniquely characterizes thecrack context by using the MD module, and continuouslyoptimizes the contextual features by using the SEU modulesto obtain the ﬁnest prediction. Pixel-by-pixel crack detectionis realized using the proposed FPCNet.We conducted several experiments on public CFD datasetsand G45 crack datasets involving multiple crack types underdifferent shooting conditions. The experimental results showthat FPCNet can detect multiple types of cracks and attainstate-of-the-art precision on CFD datasets, with a high speedof 14.7 FPS.The rest of the paper is organized as follows: Section IIprovides a brief review of crack detection methods; in SectionIII, we describe in detail our Multi-Dilation module, SE-Upsampling module and FPCNet; Section IV describes theperformed series of experiments, and the corresponding resultsand analysis; Finally, Section V summarizes the main workpresented in this article.II. R

ELATED W ORK

Existing visual-based crack detection methods can beroughly classiﬁed into three categories: traditional, machinelearning-based and deep learning-based methods. In this sec-tion, we brieﬂy describe the application of these methods forcrack detection.

A. Traditional Method

Early studies such as [1]–[4] observed that the cracks ina road image are darker than the background; thus, differ-ent thresholding methods were used to extract the cracks.However, these methods experience difﬁculty in selectingthe appropriate threshold. In addition, they are sensitive toimaging conditions and noise, which ﬁnally result in poor performance. Edge detection methods [5]–[7] achieved distinctimprovements in images with a large contrast between thecrack edge and the background. However, these methodsdemonstrate limited performance in road images with lowcontrast or noise. The use of manually designed featuredescriptors such as Gabor ﬁlters [8], [9], wavelet transform[10], [11], and histogram of oriented gradients (HOG) [12]exhibit signiﬁcant advancements in detecting simple cracksbut they are not suitable for complex and diverse cracks. Inaddition, the parameter selection is commonly time-consumingand laborious.

B. Machine Learning

With the advancement of machine learning, the followingmethods have been successfully applied in crack detection:[25] considered the road surface as a textured surface to designfeatures, and then applied the support vector machine (SVM)for classiﬁcation; [26] utilized numerous linear and nonlinearﬁlters to extract texture features that could then be ﬁltered byAdaBoost; [27] selected the random forest method to classifymultiple spatially adjusted visual features. However, these de-tection methods are restricted to detecting learned cracks andﬁnd it difﬁcult to detect new cracks. CrackForest [28] solvedthis problem by using random structured forests classiﬁers,which can identify arbitrarily complex cracks. However, thesemethods are limited in terms of the quality and quantity of themanually designed features. Moreover, it is difﬁcult to designuniversal features that can be applied to all types of cracks.

C. Deep Learning

Recently, deep learning has made great progress in theﬁeld of computer vision. The precision achieved by CNN-based networks has greatly exceeded the precision attained bytraditional image classiﬁcation methods [29]–[32], and eventhat possible at the human level [32]. In recent research,deep learning-based methods have been successfully appliedto road crack detection. [14], [15] applied deep learning-basedobject detection methods to detect the location of cracks inroad images; [16], [17] utilized road grids or sliding windowsto divide the road images into smaller image blocks beforeusing a CNN to determine whether an image block contains acrack. Although the abovementioned methods can accuratelylocate the crack, they cannot detect cracks pixel by pixel. [18],[19] selected image blocks in the road image via CNN todetermine whether the central pixel or the pixels of the imageblock belong to the crack, which not only achieved pixel-by-pixel detection but also attained high precision. However,the small blocks fail to provide enough context informationfor prediction. Moreover, the time consumption is relativelyhigh for block-based detection. [20], [21] used the FCNnetwork for crack detection and achieved high precision andspeed. However, this method does not consider the fact thatcracks with different widths and topologies require differentcontext sizes. Moreover, in this method, the fact different crackfeatures contribute differently to crack detection was ignoredand all crack features were treated in the same manner.

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 3 (a) r = 1 (b) r = 2 (c) r = 3 (d) r = 4 Fig. 1. Convolution kernels with different dilation rates.

III. M

ETHODOLOGY

In this section, we ﬁrst introduce the proposed Multi-Dilation module and SE-Upsampling module. Next, the net-work structure for crack detection, i.e., FPCNet is described.

A. Multi-Dilation (MD)

To extract crack features, the Encoder operation is per-formed ﬁrst. This process includes four groups of two typical × convolutions. Each convolutional group produces itscrack multiple-convolution (MC) features, which are thendownsampled by a max pooling layer to capture the context.However, the convolutional ﬁlters in the Encoder use only onereceptive ﬁeld size to extract the crack features within onecontext. As a result, the MC features, which are extractedby the Encoder, cannot robustly detect cracks with differentwidths and topologies.Thus, the Multi-Dilation (MD) module is developed, whichis based on the MC features. The dilated convolution, whichexpands the context window size of the convolution with-out downsampling or convolving with larger ﬁlters of moreparameters, is employed in the MD module. By combiningmultiple dilated convolutions [23] with different rates and aglobal pooling, the MD module extracts crack features withmultiple context sizes and detects cracks with different widthsand topologies.The dilated convolution was ﬁrst proposed by [23] toefﬁciently perform wavelet decomposition. In a 1D signal, thedilated convolution can be deﬁned as follows: y [ i ] = K (cid:88) k =1 x [ i + r · k ] w [ k ] (1)where x [ i ] is the input signal, y [ i ] is the output signal, w [ k ] represents the ﬁlter of length K , and parameter r representsthe interval at which the dilation is used to sample the inputsignal. In default ﬁltering operation, r = 1 .The dilated convolution is used in convolution operations[33], [34], which add “holes” with a value of 0 between thepixels of the convolution kernel. For a convolution kernel ofsize k × k , the actual convolution kernel size k result = k +( k − × ( r − . As shown in Fig. 1, (a) is the default convolution,and (b), (c) and (d) are the convolution kernels for kernels for r = 2 , , and , respectively.As shown in Fig. 1, the context of the dilated convolutionkernel is larger than that of the standard convolution kernelwhen r is greater than 1. Since “0” is not a parameter, theparameters and calculation amount of the convolution kernel Fig. 2. Multi-Dilation module. The module concatenates four dilated convo-lutions with rates of {

1, 2, 3, 4 } , a global pooling layer and the original crackMC features. After the concatenation, a × convolution is performed toobtain the crack MD features. Every convolution retains its number of featurechannels except the last × convolution, and padding is used to ensure thatthe resolution of the MC feature remains constant. are not actually increased. Thus, compared to the standardconvolution with larger ﬁlters, dilated convolution enlarges thecontext of the convolution operation without increasing theamount of calculation involved.However, in complex road images, the width along the samecrack curve changes dramatically. In addition, the contexts re-quired for detecting cracks of different topologies and severitylevels are different, and dilated convolution with one rate canonly get one context. For example, when r = 1 , which is thestandard convolution, the context size obtained is small. Such aconvolution is suitable for thin and simple cracks, but it cannoteffectively detect wide cracks as well as cracks with complextopologies. However, these cracks can be robustly detected bydilated convolutions with a larger value of r (for example, 4).Thus, a dilated convolution with a single rate cannot obtainall the required contextual information for crack detection.Based on this, we propose the Multi-Dilation module. Asshown in Fig. 2, the input is the crack MC features extractedfrom the Encoder, and the output is the crack MD features.This module analyzes crack features with different contextsizes and integrates them to obtain features with multiplecontexts. First, the dilated convolutions with four rates {

1, 2,3, 4 } are used to obtain the crack features of different contextsizes. The global pooling layer is then added to obtain theglobal crack information contained in the MC features. Toretain the crack information of the MC features, they are feddirectly into the ﬁnal output. Next, a concatenation methodis employed to combine the abovementioned six featuresfrom pixelwise to global context. Subsequently, the number offeature channels is increased to six times that of the originalnumber of channels, which improves the amount of calculationin subsequent operations. As a result, a × convolution isﬁnally applied to reduce the number of channels from × for the concatenated features to 1024, which also increase thecommunication among these channels. After performing thisconvolution, we can obtain the output, that is, the crack MDfeatures.Note that the MD module integrates the features of multiplecontext sizes, including the pixelwise context, contexts of mul-tiple dilated convolution rates and global context, which canhelp robustly describe cracks of various widths and topologies.The rates of the dilated convolutions in the MD module are EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 4

Fig. 3. SE-Upsampling module. The MC features are ﬁrst added to the MD features after transposed convolution. Next, global pooling is performed toobtain the global information of the C channels. After squeeze ( F sq ) and excitation ( F ex ) (two fully connected layers) of the global information, the weightof each feature for its channel is obtained. Finally, each feature in the added MD features is multiplied ( F scale ) by its corresponding weight to obtain theoptimized MD features. The green arrow indicates the transposed convolution. H , W , and C represent the length, width, and number of channels of thefeatures, respectively. set according to the statistics of the crack widths, and can bereadily expanded, if necessary, for different cases. B. SE-Upsampling (SEU)

The resolution of the crack MD features decreases because itis based on the MC features of many subsamplings. To achievepixel-level detection, the resolution of the MD features needsto be restored to that of the original input pavement image.Hence, the Decoder operation is performed. With the Decodersupsampling operation (such as transposed convolution or bi-linear interpolation), the resolution of the MD features canbe continuously restored. Owing to the involvement of fewersubsamplings compared to those of the MD features, the MCfeatures in the Encoder have more crack details, which areblurred in the MD features. To incorporate more crack details,the MD features can be combined with the MC features ofthe same resolution, i.e., lateral connection of the MD and MCfeatures can be realized. However, if two different features aresimply concatenated or added as proposed in [21], [22], thedifferent contributions of the edge, pattern, texture and otherinformation embedded in the features for crack detection areregarded as being identical.To overcome this problem, we propose the SE-Upsampling(SEU) module, as shown in Fig. 3. The inputs are MDfeatures and MC features, and the output is the optimized MDfeatures after weighted fusion. The SEU module ﬁrst restoresthe resolution of the crack MD features through transposedconvolution. Next, it adds the MC features to the MD featuresin order to fuse the associated crack information concerningthe edge, pattern, texture among others. Subsequently, theSqueeze-and-Excitation learning operation [24] is applied tothe added MD features to learn the weights of the differentfeatures. After the learning, the SEU module can adaptivelyassign different weights to different crack features such as theedge, pattern, and texture.Speciﬁcally, the MD features ﬁrst undergo upsampling bytransposed convolution, which restores their resolution by 2times and reduces their number of channels to half of theoriginal value. Next, the MC features with the same resolution

Fig. 4. Network architecture of FPCNet. The method uses 4 Convs (two × convolutions and ReLUs) + max poolings as the Encoder to extractfeatures. Next, the MD module is employed to obtain the information ofmultiple context sizes. Subsequently, 4 SEU modules are operated as theDecoder. H and W indicate the original sizes of the image. The red, green,and blue arrows indicate the max pooling, transposed convolution and × convolution + sigmoid, respectively. MCF denotes the multiple-convolutionfeatures extracted in the Encoder, and MDF denotes the MD features. are added to the MD features. Global average pooling isperformed to obtain the global information of each channelfrom the added MD features. Subsequently, the global infor-mation is processed by a squeeze operation ( F sq ). A fullyconnected layer is used to squeeze the number of channelswith a certain ratio (in this study, we use a ratio of ) andthe ReLU layer is used to nonlinearize the output. We carry outan excitation process ( F ex ) on the output, which restores thesqueezed output to its original number of channels by using afully connected layer. The sigmoid layer is used to obtain thechannel weights. A larger weight indicates that the feature ina channel has a larger contribution to crack detection. Finally,each MD feature is multiplied ( F scale ) by its correspondingweight to obtain the optimized MD feature. C. Network Architecture

FPCNet is developed by integrating the Multi-Dilation mod-ule and the SE-Upsampling module in the Encoder-Decoderstructure. The network structure of FPCNet is shown in Fig.4. This framework shares the common Encoder-Decoder struc-ture of semantic segmentation networks: The upper row isthe Encoder structure, and the bottom row is the Decoderstructure. Each Conv in the Encoder structure consists of twotypical × convolutional layers followed by a nonlineariza-tion layer ReLU. Each × convolution is padded to maintain EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 5 the original resolution. After the convolution, the × maxpooling layer is used for downsampling. In total, there arefour sets of convolutions + pooling operations. After eachoperation, the resolution of the MC features is reduced to halfof their original resolution, and the number of channels isincreased by two times, that is, the number of channels ineach group is {

64, 128, 256, 512 } .After the fourth max pooling layer at the end of the Encoder,the proposed Multi-Dilation module is employed to extract thecontextual crack MD features of multiple sizes for robust crackdetection.The Decoder process is constructed by four successive SEUmodules. The number of channels of the MD features aregradually reduced, speciﬁcally { } , and theresolution of the MD features is restored. By continuously op-erating the SEU modules, the MD features are also optimizedstep-by-step to obtain the ﬁnest prediction.Finally, after the Decoder operation, the information in thefeature vector of each pixel is integrated and predicted by a × convolution, and later, the sigmoid nonlinearization layeris used to maintain the prediction probability between 0 and1. With the proposed FPCNet, fast pixel-level crack detectioncan be achieved owing to the following factors:

1) MD module:

Since the crack MC features undergo fouriterations of max poolings, the size of the input MC featuresin the MD module is small (for example, after four runs ofdownsamplings, the size reduces from × to × ). Asa result, although ﬁve crack contextual features with differentcontext sizes are calculated from the MC features to obtain theMD features, the calculation cost does not increase sharply.

2) SEU module:

Instead of the concatenation operation de-scribed in [22], the SEU module uses the addition operation tocombine two types of features, which reduces the subsequentamount of required calculation.

3) Encoder-Decoder structure:

FPCNet uses an end-to-endstructure to achieve pixelwise prediction, which is much fasterthan the block-based methods.IV. E

XPERIMENTS

This section ﬁrst describes the evaluation of the proposedapproach on two crack datasets: public CFD [28] and our ownG45 crack dataset. Next, the selection of hyperparameters ofMD module is discussed.We have implemented our approach using Pytorch [35] asthe deep learning framework for training and evaluation undera PC with an operating system of Windows 10, which hasan Intel(R) Core (TM) i7-6800K CPU @ 3.40 GHz with 16GB memory and a NVIDIA GTX1080Ti GPU with 11 GBmemory. To evaluate the proposed approach, we compare itwith FCN [21] and other state-of-the-art methods tested onthe CFD dataset, including CrackForest [28], MFCD [36],method [37], and method [18]. At the same time, to verifythe effectiveness and scalability of FPCNet, the network isapplied and evaluated on our G45 dataset including varioustype of cracks.

Evaluation:

To evaluate the performance of the proposednetwork, the values of Precision, Recall and F1 score are introduced. These values are computed based on true positives(TP), true negatives (TN), false positives (FP) and falsenegatives (FN) as follows:

P recision = T PT P + F P (2)

Recall = T PT P + F N (3) F score = 2 × P recision × RecallP recision + Recall (4)Because both Precision and Recall have their biases, thisstudy focuses on the F1 score. Since it is extremely difﬁcultfor the pixel-level ground truth to be obtained by crack images,a tolerance margin is used in most crack detection algorithmsfor evaluating the performance of the algorithm. This margintakes into account detected pixels that are no more than 2 [18],[37] or 5 [28], [36] pixels away from the ground truth as theTP. We use a tolerance margin of 2 pixels in this study.

A. CFD Dataset

The CFD dataset was published in [28], and it is composedof 118 RGB images with a resolution of × pixels.All the images are taken using an iPhone 5 from pavementsof Beijing, China, which can generally reﬂect the urbanpavement surface condition existing in Beijing. These imageshave uneven illumination and contain noises such as shadows,oil spots and water stains, which make crack detection quitedifﬁcult. We randomly divided 60% (72 images) of the datasetfor training and 40% (46 images) of the dataset for testing, asdescribed in [28].

1) Implementation details:

An insufﬁcient number of crackimages can easily cause the problem of overﬁtting duringtraining. Thus, data augmentation is performed, includingclockwise rotation of ◦ and ◦ , horizontal ﬂip, andrandom color jittering. Random cropping to a size of × is performed in training. Data augmentation is not used duringtesting.We use the following binary cross entropy (BCE) + dicecoefﬁcient loss as the loss function during training: L ( Y ∗ , Y ) = 1 N (cid:88) P ∈ N ( Y ∗ P · lg Y P + (1 − Y ∗ P ) · lg(1 − Y P )+ 1 − × T P × T P + F P + F N (5)where Y ∗ and Y denote the target image and prediction image,respectively; N is the set of all pixels in the image; and Y P and Y ∗ P denote the values at pixel p in the prediction and thetarget images, respectively. We use the initialization methodproposed in [38] as the weight initialization approach, andchoose SGD with Momentum (0.9) [39] as our optimizer witha batch size of 1 and a weight decay of 0.0001. Training isstarted with a learning rate of 0.01; it is reduced by 10 atepochs 50, 80, and 110, and training is terminated at 120epochs. EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 6

Fig. 5. Results of comparison of proposed approach with Method [18] onCFD (from top to bottom: original image, ground truth, Method [18], FCN[21], FPCNet, probability images predicted by FPCNet, special display of theprobability images).

2) Results:

We compare the proposed approach with FCN[21] (We trained the network on the CFD dataset) and fourother crack detection methods tested on the CFD dataset:CrackForest [28], MFCD [36], Method [37], and Method[18]. The result shown in TABLE I demonstrates that ourapproach outperforms the others. FPCNet achieves an F1score of 96.93%, thereby exceeding the F1 score of Method[18] by 4.49%, which is the current state-of-the-art approachon the CFD dataset and that of FCN by 1.03%. Fig. 5shows the comparison of the proposed approach with Method[18] and FCN [21]. In this ﬁgure, wrong detection (FP)and missed detection (TN) are indicated by blue and greenrectangles, respectively. As seen from the third row in theﬁgure, several wrong detections and missed detections occur inthe images predicted by Method [18] because of the methodssmall context of the block for detection. The FCN with alarge context size solves these problems to some extent. As

TABLE IRESULTS FOR CRACK DETECTION EVALUATION ON CFDDATASETMethod Tolerance Margin Precision Recall F1 scoreCrackForest [28] 5 82.28% 89.44% 85.71%MFCD [36] 5 89.90% 89.47% 88.04%Method [37] 2 90.70% 84.60% 87.00%Method [18] 2 91.19% 94.81% 92.44%FCN [21] 2 97.29% 94.56% 95.90%FPCNet 2 shown in the fourth row, most of the noise (wrong detection)is eliminated and the situation of missed detections is alsoalleviated. However, owing to the FCNs single context size andbecause it treats all features equally during detection, a largenumber of missed detections (green rectangles in the fourthrow) occur, especially for cracks having complex topologies.Compared with these two methods, the results predicted byFPCNet have fewer wrong detections and missed detections,as seen in the ﬁfth row of Fig. 5, indicating the robustness ofour approach. This is because FPCNet can acquire the featuresof multiple context sizes via the MD module and treat themdifferently according to their respective contributions via theSEU module.The sixth row shows the probability images predicted byFPCNet, in which a darker pixel has a higher probability ofbeing a crack. It can be observed that the noises and pixels nearthe outer sides of the crack edges have such a low predictionprobability that they are barely noticeable. To illustrate thesefeatures clearly, we indicate the pixels with probabilities higherthan 0.5 in red and those with probabilities lower than 0.5 inblue in the seventh row (all the following probability imagesare marked in a similar manner). After binarization, the noisesand pixels on the outer sides, which have low probabilities(marked in blue), are erased. This indicates that, owing to thedesign of the MD module, FPCNet can obtain wide-rangingand multiple contextual information to effectively suppressnoise and adapt to the different widths of the cracks.

3) Time comsumption:

The analysis of the time consump-tion of FPCNet is discussed herein. We ﬁrst test the timerequired by each module in FPCNet during the detection ofone image. The corresponding results are shown in Fig. 6. Itcan be observed intuitively that the Encoder takes the mosttime (17.8 ms) in FPCNet. This is because the Encoders eightconvolutional layers and four pooling layers are all operatedfor the large images of ﬁne crack MC features. As for theMD module, even if the module contains 10 convolutionallayers, one global pooling layer, and one upsampling layer, it isoperated after the extraction of MC features. Consequently, thetime consumption is only 2.0 ms due to the small sizes of thefeatures. The Decoder module has 4 SE-Upsampling modulesand an additional predicted convolution layer. Although theimage resolution increases, the process takes only 7.0 ms,which is only 1/8th the time taken by the Encoder module.Other operations including image reading, processing, andpredicted image saving, take up 41.09 ms. The total time con-

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 7

Fig. 6. Percentage of time consumption of each module in the process ofcrack detection by FPCNet in the CFD dataset. The total time consumptionis 67.9 ms per image. TABLE IIRESULTS FOR TIME CONSUMPTION ON CFD DATASETMethod Batch Size Time Comsumption FPSMethod [18] 2800 380ms 2.6FPCNet 1 sumption in predicting an image is 67.9 ms, which correspondsto a speed of 14.7 frames per second.We also compare the time consumption with Method [18]since it is the current state-of-the-art approach on the CFDdataset. We reproduce this method by Pytorch and test theaverage time consumption of one crack image with the samerequirement of GPU memory (890 M) and environment (CPU:Intel(R) Core(TM) i7-6800K CPU @ 3.40 GHz, Memory: 16GB, GPU: NVIDIA GTX1080Ti). The speed of the FPCNetfor the detection of an image is 5.7 times that of Method [18],as shown in TABLE II. This proves that our network demon-strates not only a higher detection accuracy but also fasterdetection speed, and it is suitable for large-scale pavementcrack detection.

B. G45 Dataset

The G45 crack dataset consists of 122 grayscale crackimages with a resolution of × ; these images werecollected on the G45 highway in China. The ground truthof each image is carefully labeled by professional engineers.The dataset covers four types of cracks: transverse cracks,longitudinal cracks, block cracks, and alligator cracks. Theimaging conditions of each image in the dataset are notuniform and the imaging brightness is different, resulting in adifference in contrasts between the crack and the pavement inthe image. Moreover, the number, length and width of thecracks in the image vary. Some images contain oil stains,speckle noise, and lane lines, which makes crack detectiondifﬁcult.We randomly divided the 122 crack images into 77 trainingimages and 45 testing images. Owing to the large imageresolution, each image is cropped into twelve images having a size of × (without overlapping), and then these imagesare sent to the network for training.

1) Implementation details:

The data augmentation is per-formed in the same manner as on the CFD dataset exceptfor the random color jittering because the images in the G45dataset are in grayscale. Further, we also omit the clockwiserotation of ◦ . Random cropping is used in training, butwith a size of × . We use the same hyperparameters asthose when training on the CFD dataset, except for the batchsize of 2.

2) Results:

To obtain the complete prediction result, wecrop the entire × pavement image into twelve × patches and send them to the network separately. Next, wecombine the twelve predicted results into an overall predictionof the entire pavement image.Fig. 7 shows the typical results of the four types of cracks.The following aspects are shown from top to bottom: pavementcrack images, ground truth, prediction results from FPCNet,detailed images having green rectangles in the ﬁrst row, andprobability results of detailed images and prediction results ofdetailed images. From left to right, the results for transversecracks, longitudinal cracks, block cracks, and alligator cracks,are shown. FPCNet can analyze different types of cracks withsufﬁcient contextual features of the MD module. As a result,all cracks can be robustly detected irrespective of the complexpattern they possess (third row in Fig. 7).FPCNet demonstrates excellent ability to identify transversecracks (ﬁrst column in Fig. 7) owing to its simple structure andthe effect of the network. Only little noise exists (blue pixels)in the probability image shown in the ﬁfth row, which indicatesthat the proposed network detects transverse cracks with aremarkably high accuracy. Because the data augmentation ofrotation by ◦ is used during the training, the rotated imagesof the transverse cracks enhance the recognition ability of thelongitudinal cracks. As shown in the second column of Fig.7, longitudinal cracks are detected with high accuracy, similarto the transverse cracks. Block cracks are more complicatedthan transverse and longitudinal cracks (third column in Fig.7). Missed detections occur in certain location in which thecracks are shallow and indistinct. As seen in the ﬁfth row,FPCNet assigns them a low prediction probability. However,the proposed network can still detect most cracks of this typereasonably well (ﬁnal row in third column). Alligator cracksare the most complex type of pavement cracks (fourth columnin Fig. 7) and they are densely interlaced. As illustrated in ﬁfthrow of the fourth column, more noise exists in this probabilityimage compared to those of other types of cracks; however,after binarization, the images with low probability are ﬁltered.This proves that the proposed network demonstrates satisfac-tory performance for the detection of alligator cracks.We also illustrate some typical examples of crack imageswith low contrast, zebra crossings, and noise in Fig. 8. As seenin this ﬁgure, such cracks are more indiscernible than othercracks and their backgrounds are full of noise. In addition, theimage in the right column also exhibits poor illumination. Thepresence of such features in crack images critically affects thedetection of cracks, leading to an increase in missed detections.However, most cracks can be successfully detected by FPCNet, EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 8

Fig. 7. Typical results of four types of cracks on G45 dataset using FPCNet (from left to right: transverse cracks, longitudinal cracks, block cracks, alligatorcracks; from top to bottom: original image, ground truth, predicted image, detailed images having green rectangles in ﬁrst row, detailed images of probabilityimage, detailed images of predicted image).

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 9

TABLE IIIEVALUATION RESULTS FOR DIFFERENT TYPES OF CRACKS ONG45 DATASETType Precision Recall F1 scoreTransverse 98.22% 96.82% 97.51%Longitudinal 93.10% 98.58% 95.76%Block 91.53% 89.90% 90.71%Alligator 95.24% 94.26% 94.75%TABLE IVEVALUATION RESULTS FOR DIFFERENT DILATION RATES ONCFD DATASETDilation rates Precision Recall F1 score {

1, 2, 3, 4 } {

1, 2, 4, 8 } {

2, 4, 8, 16 } { } { } { } indicating the robustness of the proposed network.The evaluation results of the four types of cracks on the G45dataset are presented in TABLE III. Transverse and longitudi-nal cracks have relatively good detection results owing to theirsimple structure and our employed effective training strategymentioned above. The F1 scores of both these types of cracksreach values more than 95%. Most images of block cracksimages involve bad imaging conditions and noise, leading to anincrease in missed detection. However, owing to the robustnessof FPCNet, the Precision, Recall and F1 score of blockcracks still reach values of 91.53%, 89.90%, and 90.71%,respectively. Our network performs robustly in the case ofalligator cracks, for which the values of Precision, Recall andF1 score are 95.24%, 94.26%, and 94.75%, respectively. C. Discussion

The dilation rate introduced in Eqn. 1 is an importanthyperparameter that allows us to vary the context size obtainedby the MD module in the network. As illustrated in Fig. 1,a larger dilation rate corresponds to a larger context size thatcan be obtained. Different context sizes can lead to differenteffects on the prediction results. To investigate the appropriatecombination of different dilation rates for crack detection, weconduct experiments on the CFD and G45 datasets to discussthe setting of the hyperparameters in the MD module.Three dilation rates groups of {

1, 2, 3, 4 } , {

1, 2, 4, 8 } , { } are tested. As seen from the results given in TABLEIV and TABLE V, the highest accuracy is achieved on bothdatasets when the dilation rates are of the group of {

1, 2, 3,

Fig. 8. Typical examples of crack images with low contrast, zebra crossings,and noise.

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 10 } . This is because for a relatively elongated structure suchas a crack, a larger dilation rate ignores more details of thecracks, thereby causing a decrease in the accuracy.V. C ONCLUSION

In this paper, we propose a high-precision and high-speedpavement crack detection network called FPCNet. A Multi-Dilation module and an SE-Upsampling module are devel-oped in this framework. The Multi-Dilation module extractsthe crack MD features of multiple context sizes to robustlydetect cracks with different widths and topologies. The SE-Upsampling module restores the resolution of the MD featuresand assigns different weights to the MD features after lateralconnection, to optimize the MD features. By integrating thesetwo modules in the Encoder-Decoder structure, FPCNet char-acterizes the crack context with the MD module, and recur-sively optimizes the contextual features step-by-step. Finally,pixel-level crack detection is achieved.The results of a large number of experiments performed ontwo different crack datasets prove that the proposed methodcan robustly detect many types of cracks and achieve state-of-the-art results on the CFD dataset, with a speed of 14.7FPS.In future work, we will test our FPCNet in more crackdatasets. In addition, learning-based conditional random ﬁeld(CRF) will be performed in the network to further reﬁne itsoutput. R

EFERENCES[1] J. Tang and Y. Gu, “Automatic crack detection and segmentation usinga hybrid algorithm for road distress analysis,” in

Systems, Man, andCybernetics (SMC), 2013 IEEE International Conference on . IEEE,2013, pp. 3026–3030.[2] Q. Li and X. Liu, “Novel approach to pavement image segmentationbased on neighboring difference histogram method,” in

Image and SignalProcessing, 2008. CISP’08. Congress on , vol. 2. IEEE, 2008, pp. 792–796.[3] H. Oliveira and P. L. Correia, “Automatic road crack segmentationusing entropy and image dynamic thresholding,” in

Signal ProcessingConference, 2009 17th European . IEEE, 2009, pp. 622–626.[4] H.-D. Cheng and M. Miyojim, “Automatic pavement distress detectionsystem,”

Inf. Sci. , vol. 108, no. 1-4, pp. 219–240, 1998.[5] H. Zhao, G. Qin, and X. Wang, “Improvement of canny algorithm basedon pavement edge detection,” in

Image and Signal Processing (CISP),2010 3rd International Congress on , vol. 2. IEEE, 2010, pp. 964–967.[6] R. S. Lim, H. M. La, Z. Shan, and W. Sheng, “Developing a crackinspection robot for bridge maintenance,” in

Robotics and Automation(ICRA), 2011 IEEE International Conference on . IEEE, 2011, pp.6288–6293.[7] R. S. Lim, H. M. La, and W. Sheng, “A robotic crack inspection andmapping system for bridge deck maintenance,”

IEEE Trans. Autom. Sci.Eng. , vol. 11, no. 2, pp. 367–378, 2014.[8] S. Chanda, G. Bu, H. Guan, J. Jo, U. Pal, Y.-C. Loo, and M. Blu-menstein, “Automatic bridge crack detection–a texture analysis-basedapproach,” in

IAPR Workshop on Artiﬁcial Neural Networks in PatternRecognition . Springer, 2014, pp. 193–203.[9] R. Medina, J. Llamas, E. Zalama, and J. G´omez-Garc´ıa-Bermejo,“Enhanced automatic detection of road surface cracks by combining2d/3d image processing techniques,” in

Image Processing (ICIP), 2014IEEE International Conference on . IEEE, 2014, pp. 778–782.[10] P. Subirats, J. Dumoulin, V. Legeay, and D. Barba, “Automation of pave-ment surface crack detection using the continuous wavelet transform,”in

Image Processing, 2006 IEEE International Conference on . IEEE,2006, pp. 3037–3040.[11] J. Zhou, P. S. Huang, and F.-P. Chiang, “Wavelet-based pavement distressdetection and evaluation,”

Opt. Eng. , vol. 45, no. 2, p. 027007, 2006. [12] R. Kapela, P. ´Sniatała, A. Turkot, A. Rybarczyk, A. Po˙zarycki, P. Ry-dzewski, M. Wyczałek, and A. Błoch, “Asphalt surfaced pavement cracksdetection based on histograms of oriented gradients,” in

Mixed Designof Integrated Circuits & Systems (MIXDES), 2015 22nd InternationalConference . IEEE, 2015, pp. 579–584.[13] F.-C. Chen and M. R. Jahanshahi, “Nb-cnn: deep learning-based crackdetection using convolutional neural network and naive bayes datafusion,”

IEEE Trans. Ind. Electron. , vol. 65, no. 5, pp. 4392–4400, 2018.[14] J.-H. Lee, S.-S. Yoon, I.-H. Kim, and H.-J. Jung, “Diagnosis of crackdamage on structures based on image processing techniques and r-cnnusing unmanned aerial vehicle (uav),” in

Sensors and Smart StructuresTechnologies for Civil, Mechanical, and Aerospace Systems 2018 , vol.10598. International Society for Optics and Photonics, 2018, p.1059811.[15] X. Wang and Z. Hu, “Grid-based pavement crack analysis using deeplearning,” in

Transportation Information and Safety (ICTIS), 2017 4thInternational Conference on . IEEE, 2017, pp. 917–924.[16] Y.-J. Cha, W. Choi, and O. B¨uy¨uk¨ozt¨urk, “Deep learning-based crackdamage detection using convolutional neural networks,”

Comput.-AidedCiv. Infrastruct. Eng. , vol. 32, no. 5, pp. 361–378, 2017.[17] Y.-K. An, K.-Y. Jang, B. Kim, and S. Cho, “Deep learning-basedconcrete crack detection using hybrid images,” in

Sensors and SmartStructures Technologies for Civil, Mechanical, and Aerospace Systems2018 , vol. 10598. International Society for Optics and Photonics, 2018,p. 1059812.[18] Z. Fan, Y. Wu, J. Lu, and W. Li, “Automatic pavement crack detectionbased on structured prediction with the convolutional neural network,” arXiv preprint arXiv:1802.02208 , 2018.[19] X. Yang, H. Li, Y. Yu, X. Luo, T. Huang, and X. Yang, “Automaticpixel-level crack detection and measurement using fully convolutionalnetwork,”

Comput.-Aided Civ. Infrastruct. Eng. [20] H.-w. Huang, Q.-t. Li, and D.-m. Zhang, “Deep learning based imagerecognition for crack and leakage defects of metro shield tunnel,”

Tunnelling Underground Space Technol. , vol. 77, pp. 166–176, 2018.[21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[23] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian,“A real-time algorithm for signal analysis with the help of the wavelettransform,” in

Wavelets . Springer, 1990, pp. 286–297.[24] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXivpreprint arXiv:1709.01507 , vol. 7, 2017.[25] Y. Hu, C. Zhao, and H. Wang, “Automatic pavement crack detectionusing texture and shape descriptors,”

IETE TECH REV , vol. 27, no. 5,p. 398, 2010.[26] A. Cord and S. Chambon, “Automatic road defect detection by texturalpattern recognition based on adaboost,”

Comput.-Aided Civ. Infrastruct.Eng. , vol. 27, no. 4, pp. 244–259, 2012.[27] P. Prasanna, K. J. Dana, N. Gucunski, B. B. Basily, H. M. La, R. S. Lim,and H. Parvardeh, “Automated crack detection on concrete bridges,”

IEEE Trans. Autom. Sci. Eng. , vol. 13, no. 2, pp. 591–599, 2016.[28] Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, “Automatic road crackdetection using random structured forests,”

IEEE Trans. Intell. Transp.Syst. , vol. 17, no. 12, pp. 3434–3445, 2016.[29] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proc. IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,”

Proc. IEEE CVPR , pp. 1–9, 2015.[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,”

Proc. IEEE CVPR , pp. 770–778, 2016.[33] F. Yu and V. Koltun, “Multi-scale context aggregation by dilatedconvolutions,”

International Conference on Learning Representations ,2016.[34] L. Chen, G. Papandreou, I. Kokkinos, K. P. Murphy, and A. L. Yuille,“Deeplab: Semantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 40, no. 4, pp. 834–848, 2018.

EEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, UNDER REVIEW. 11 [35] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” in

NIPS-W , 2017.[36] H. Li, D. Song, Y. Liu, and B. Li, “Automatic pavement crack detectionby multi-scale image fusion,”

IEEE Trans. Intell. Transp. Syst. , no. 99,pp. 1–12, 2018.[37] D. Ai, G. Jiang, L. S. Kei, and C. Li, “Automatic pixel-level pavementcrack detection using information of multi-scale neighborhoods,”

IEEEAccess , vol. 6, pp. 24 452–24 463, 2018.[38] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”

ProcIEEE Int Conf Comput Vis , pp. 1026–1034, 2015.[39] B. T. Polyak, “Some methods of speeding up the convergence of iterationmethods,”