[PDF] Damage detection using in-domain and cross-domain transfer learning

Abstract

We investigate the capabilities of transfer learning in the area of structural health monitoring. In particular, we are interested in damage detection for concrete structures. Typical image datasets for such problems are relatively small, calling for the transfer of learned representation from a related large-scale dataset. Past efforts of damage detection using images have mainly considered cross-domain transfer learning approaches using pre-trained ImageNet models that are subsequently fine-tuned for the target task. However, there are rising concerns about the generalizability of ImageNet representations for specific target domains, such as for visual inspection and medical imaging. We, therefore, propose a combination of in-domain and cross-domain transfer learning strategies for damage detection in bridges. We perform comprehensive comparisons to study the impact of cross-domain and in-domain transfer, with various initialization strategies, using six publicly available visual inspection datasets. The pre-trained models are also evaluated for their ability to cope with the extremely low-data regime. We show that the combination of cross-domain and in-domain transfer persistently shows superior performance even with tiny datasets. Likewise, we also provide visual explanations of predictive models to enable algorithmic transparency and provide insights to experts about the intrinsic decision-logic of typically black-box deep models.

Full PDF

DDamage detection using in-domain and cross-domaintransfer learning

Zaharah A. Bukhsh

Eindhoven University of TechnologyEindhoven, The Netherlands [email protected]

Nils Jansen

Radboud UniversityNijmegen, The Netherlands [email protected]

Aaqib Saeed

Eindhoven University of TechnologyEindhoven, The Netherlands [email protected]

Abstract

We investigate the capabilities of transfer learning in the area of structural healthmonitoring. In particular, we are interested in damage detection for concrete struc-tures. Typical image datasets for such problems are relatively small, calling for thetransfer of learned representation from a related large-scale dataset. Past effortsof damage detection using images have mainly considered cross-domain transferlearning approaches using pre-trained I

MAGE N ET models that are subsequentlyﬁne-tuned for the target task. However, there are rising concerns about the gener-alizability of I MAGE N ET representations for speciﬁc target domains, such as forvisual inspection and medical imaging. We, therefore, propose a combination ofin-domain and cross-domain transfer learning strategies for damage detection inbridges. We perform comprehensive comparisons to study the impact of cross-domain and in-domain transfer, with various initialization strategies, using sixpublicly available visual inspection datasets. The pre-trained models are also evalu-ated for their ability to cope with the extremely low-data regime. We show that thecombination of cross-domain and in-domain transfer persistently shows superiorperformance even with tiny datasets. Likewise, we also provide visual explanationsof predictive models to enable algorithmic transparency and provide insights toexperts about the intrinsic decision-logic of typically black-box deep models. Civil structures such as bridges are reaching their end of service life due to aging, increased usage, andadverse climate impact [9]. Asset owners have to employ human experts to conduct periodic visualinspections to ensure structural safety and usability. In particular, so-called condition scorecardsare used to rate the condition of bridges, whereas damage details are captured in images for furtheranalysis [17]. Given that countries have thousands of bridges, for instance, 60,000 bridges acrossthe US and 39,000 bridges in Germany, such expert-driven visual inspections are very costly andlabor-intensive.Towards a more automated process, unmanned aerial vehicles (UAVs) equipped with cameras, thermalinfrared sensors, GPS, as well as light detection and ranging (LiDAR) sensors have proven usefulto characterize cracks, spalls, concrete degradation, and corrosion [25, 34, 8]. Although UAVs arecost-effective and safe for harder-to-reach areas, they capture both damaged and structurally soundparts of bridges. As a consequence, an enormous amount of high-dimensional sensory data are

Preprint. Under review. a r X i v : . [ c s . C V ] F e b reated [7]. Manual damage identiﬁcation from such an enormous data source demands tremendousefforts and is prone to discrepancies due to human errors, fatigue, and poor judgments of bridgeinspectors [36, 2]. In particular, for image data, the involved subjectivity in a visual inspectionprocess results in inaccurate outcomes and poses several concerns for public safety [30], as shown bythe collapse incidents like Malahide viaduct [39] and the I-35 Minneapolis bridge [47].The large-scale unlabeled datasets from visual inspections call for state-of-the-art machine learningmethods. In particular, we focus on damage detection of concrete surfaces using visual data. Theﬁeld of computer vision has witnessed unprecedented developments since the advent of deep learningand the availability of large-scale annotated datasets, such as I MAGE N ET . Deep neural networks,speciﬁcally convolutional neural networks (CNNs), have shown strong performance in tasks such asobject detection [41], image segmentation [21], image synthesis [54], and reconstruction [15], withpromising applications in multiple domains.While the use of deep learning techniques for structural health monitoring of civil structures is gainingmomentum [52, 13], we found only six publicly available bridge inspection datasets [18, 29, 49,26, 20, 32] that can be utilized to develop automated damage detection models. These datasets arerelatively small-scale due to the tedious process of image acquisition and data labeling by experts.Due to the limitation of small-scale datasets, recent studies for damage detection of bridges, tunnels,and roads have adopted transfer learning as the de-facto standard for crack detection [26, 42], potholeidentiﬁcation [4, 28], and related defect classiﬁcation tasks [49]. These studies exclusively relyon cross-domain transfer learning, where I MAGE N ET is used as the source dataset (or upstreamtask), which is then ﬁne-tuned for speciﬁc downstream tasks. However, several studies have raisedskepticism about employing transfer learning for disparate target domains [16, 23, 14, 19]. Forinstance, Raghu et al. [38] showed that I MAGE N ET features reuse offers little beneﬁt to performancewhen evaluated with medical imaging. Similar to medical images, bridge inspection datasets arefundamentally different to I MAGE N ET dataset in terms of the number of classes, quality of images,size of the areas of interest, and the task speciﬁcation.This work performs a ﬁne-grained investigation to study the potential advantages and downsides of cross-domain transfer learning for structural damage detection . Besides, for the ﬁrst time, we proposeto learn and transfer in-domain representations and its combination with cross-domain transferstrategies for improved automated damage detection tasks. Speciﬁcally, in-domain transfer learningrefers to a strategy in which the source and target dataset belong to a similar domain. Furthermore,we perform a comparative analysis to evaluate the transferability of learned representations under alow-data regime . We evaluate the different transfer strategies using all six publicly available bridgeinspection datasets.The main contributions of this work are following:• We develop multiple CNN models using cross-domain and in-domain transfer learningstrategies for damage detection (classiﬁcation) tasks. We demonstrate that the combinationof cross-domain and in-domain learning provides superior performance but at the cost ofextensive ﬁne-tuning of large pre-trained models.• We conduct rigorous evaluations of transfer strategies for damage detection using thesix publicly available visual inspection datasets. Moreover, we establish benchmarks forpublicly available damage detection datasets.• Finally, we generate visual explanations of CNNs to reveal the learning mechanisms ofmulti-layered deep neural networks. Such an interpretability analysis enables algorithmictransparency, validates the robustness of trained models, and gains practitioners’ trust intypically black-box models.The paper is structured as follows: Section 2 presents the related work about damage detection usingdeep transfer learning and features reuse. Section 3 introduces the publicly available bridge inspectiondatasets. Section 4 explains the methodology of our approach, along with the experimental setup.Section 5 presents the results of the experiments and provides comparisons to the baselines. Theinterpretability analysis of the developed models is given in Section 6. Finally, the conclusions of thisstudy are presented in Section 7. 2 Related work

Visual inspection datasets are small-scale and are expensive to curate. Therefore, several studiesadopt standard architectures—such as VGG , Inception-v , or ResNet pre-trained models withI MAGE N ET representations—for the detection of cracks [49, 44, 56, 26, 42], potholes [4, 28],spalls [50], and multiple other damages including corrosion, seapage, and exposed bars [32, 20, 58,59, 55, 12]. Additionally, pre-trained CNN models are also being utilized for vibration-based damagelocalization [1, 3, 51], condition assessment [22] and fault diagnosis [48].Speciﬁc (visual) detection tasks in the damage recognition setting are strongly inﬂuenced by variousoperating conditions, such as surface reﬂectance, roughness, concrete materials, coatings, andweather phenomena for different components of a bridge [32]. Due to differences in the sourcedomain (i.e., I MAGE N ET ) and the target domain (i.e., inspection datasets), transfer learning fromcross-domain representations may not bring the desired performance improvements as shown in thecase of medical images [38]. Transfer learning via I

MAGE N ET representations is a widely adopted standard for learning fromsmall-scale datasets across several domains. The availability of a multitude of pre-trained modelssuch as MoBiNet, YOLO, ResNet, or GoogLeNet, has further encouraged the application of transferlearning beyond the standard datasets consisting of day-to-day objects. The learned representationsare typically utilized either via a ﬁxed features extraction or ﬁne-tuning methods depending on thetarget tasks.Due to the widespread adoption transfer learning, several ﬁne-grained studies examine the trans-ferability of features concerning the number of layers, the order of ﬁne-tuning [53, 27, 45], thegeneralizability of the architecture and learned weights [23, 37], and the characteristics of data usedfor pre-training [19]. Besides the popularity of transfer learning and its perceived usefulness, it hasbeen argued that feature transfer does not necessarily deliver performance improvements comparedto learning from scratch [38, 16]. Moreover, learned representations from pre-trained models may beless generic than often suggested [23], and the choice of data for pre-training is not as critical for thetransfer performance [19]. In the context of in-domain and cross-domain strategies, domain adaption is often misinterpreted as ageneral transfer learning method. In fact, domain adaption is a subcategory of transfer learning inwhich both target and source datasets are required during training, and the target dataset is weaklyor unlabeled [35]. However, in a standard transfer learning setting, the target dataset is supervised(labeled), and only pre-trained models can be used for learning without explicit access to the sourcedataset. Further details about transfer learning and proposed strategies are provided in the followingsections.

Due to the ubiquitous nature of images, several traditional application domains such as retail,automotive, agriculture have beneﬁted from massive labeled datasets to develop deep neural networksto solve various tasks. These natural datasets are relatively easy to label and do not require speciﬁcdomain expertise. On the other hand, bridge inspection datasets are very scarce due to the vitaldomain knowledge required to identify and label damages. Moreover, the differences in humanexpertise for annotation make these datasets susceptible to noisy labels. To the best of our knowledge,only six (publicly available) datasets of visual inspection of bridges exist. Any datasets having lessthan images are too small to train and evaluate CNNs and are therefore not considered here.Table 1 provides a brief overview of the datasets. Few example images are shown in Figure 1.3able 1:

Overview of (bridge) Visual Inspection Datasets.Dataset Instances Classes Problem

CDS [18] 1,027 2 BinarySDNETv1 [29] 13,620 2 BinaryBCD [49] 5,390 2 BinaryICCD [26] 60,010 2 BinaryMCDS [20] 2,411 10 Multi-labelCODEBRIM [32] 8,304 6 Multi-label (a) CDS (b) SDNETv1(c) BCD (d) ICCD(e) MCDS (f) CODEBRIM

Figure 1:

Examples images from datasets.

From left to right: (a) Complex lighting with no crack, Crackwith efﬂorescence, and exposed bars. (b) Intact concrete, minor crack, (c) Intact concrete, large crack on deck,(d) Crack with rust stains, Crack with minor scaling, (e) Concrete scaling, Corrosion with exposed bars, (f)Spallation and efﬂorescence, rust stains.

Four of the datasets mainly focus on crack detection tasks. The other two datasets are for damageclassiﬁcation in a multi-label setting in which multiple damages co-exist on a single input image. Webrieﬂy introduce the key characteristics of each dataset.

CDS [18].

The Cambridge dataset is a small bridge inspection dataset that seeks to detect defects onconcrete surfaces. The binary classes are divided into healthy and unhealthy of and images,respectively. Besides different luminous conditions on all images, the unhealthy class consists ofconcrete damage such as cracking, grafﬁti, vegetation, and blistering. SDNET [29].

The dataset has images of reinforced concrete decks, walls, and pavements sub-segmented into images. However, we found misclassiﬁed label categories in the datasets. Toutilize the data, we performed a manual cleaning of deck images only. The new dataset is referred as

SDNETv1 , which is highly imbalanced having , crack and , uncracked images.4 CD [49].

Bridge crack detection consists of , crack and , background images. Thebackground images consist of healthy concrete surfaces. The dataset contains crack images withdetails like shading, water stains, and strong lights. ICCD [26].

The image-based concrete crack detection dataset is one of the largest crack datasetshaving 60,000 images with an equal distribution of crack and uncracked . The images are capturedusing smartphone cameras under varying lighting conditions. In addition to cracks, the imagespredominately show corrosion strains.

MCDS [20].

The multi-classiﬁer dataset has original inspection data and collected images fromten highway bridges. The authors deﬁned the problem in a multi-stage classiﬁer manner. However,we used the dataset in a multi-label setting having ten classes i.e. crack ( ), efﬂorescence ( ), scaling ( ), spalling ( ), general defects ( ), no defects ( ), exposed reinforcement ( ), no exposed reinforcement ( ), rust straining ( ), and no rust straining ( ). CODEBRIM [32].

The COncrete DEfect BRidge IMage Dataset provides the overlapping defectsimages of bridges collected via camera and UAVs. The dataset consist of six classes, i.e. cracks ( ), spallation ( ), efﬂorescence ( ), exposed bars ( ), corrosion stain ( ) and background ( ). The authors have framed the damage detection problem as a multi-target multi-class due to the overlapping damages in images. Here, we utilize the dataset in a multi-label settingonly. We propose to employ transfer learning as a natural solution for learning and improving the perfor-mance of damage detection models on small labeled datasets, as shown in numerous other visiontasks [35]. Transfer learning attempts to transfer the learned knowledge from a source task to improvethe learning in a related target task [46]. Here, we consider variants of transfer learning, referred toas cross-domain and in-domain , and their combination as shown in Figure 2. The main differencebetween these approaches is the type of dataset used as a source task (also called the upstreamdataset).In cross-domain transfer learning , a convolutional neural network is trained over a large I

MAGE N ET dataset having one million images of generic objects [6]. The learned representations are thenﬁne-tuned for the speciﬁc downstream tasks, such as for damages detection. Given the computational (a) Cross ‐ domain Predictions I MAGENET I MAGENET

CNNs CNNs

CNNsCNNs

Upstreamdatasets

VI dataset

Model architectures VI dataset (b) In ‐ domain (c) Combination (VGG16, Inception ‐ v3, ResNet50, & custom small networks) DownstreamVisual Inspection (VI)datasets

CNNs CNNs CNNs

Predictions Predictions

Figure 2:

Schematic representation of transfer learning strategies

MAGE N ET dataset, several pre-trained modelsare made available by academia and industry. The pre-trained models, such as VGG , Inception-v ,ResNet , among others, have shown to yield state-of-art performance for object detection andimage localization in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The pre-trained models consist of representations (or weights) learned from I MAGE N ET and a standard CNNarchitecture.For in-domain transfer learning , the upstream and downstream datasets belong to a similar applicationdomain, for instance, in medical imaging and assets inspection, among others. For comparisonpurposes, we utilize standard CNN architectures for training on a visual inspection dataset, whichare then ﬁne-tuned for a speciﬁc task. It is important to note that the dataset used for upstream anddownstream tasks are mutually exclusive as depicted by one less square in Figure 2b. In theory,in-domain transfer learning should provide improved model performance compared to cross-domaindue to the availability of similar visual concepts. However, cross-domain transfer learning is standardin the computer vision ﬁeld as representations are learned from a large dataset having diverse classes.We also study the combination of in-domain and cross-domain transfer learning to enable further per-formance improvements for damage detection tasks. Here, the pre-trained models having I MAGE N ET representations are ﬁrst trained on a visual inspection dataset before ﬁne-tuning to a speciﬁc damagedetection problem. The combination transfer strategy is depicted in Figure 2c. We conduct several experiments to evaluate different transfer strategies for improved damage detectiontasks. The experiments seek to answer the following questions:• Do features reused from standard pre-trained models prove helpful compared to learningfrom scratch for damage detection task?• Does the standard architecture perform better than small custom CNN models?• Does in-domain transfer learning provide a performance beneﬁt compared to the state-of-the-art cross-domain transfer approach?• Does the combination of in-domain and cross-domain perform better than their stand-aloneversions?• How well do the representations generalize given an extremely small downstream trainingdataset?To acquire pre-trained (I

MAGE N ET ) features, we utilize pre-trained models namely VGG ,Inception-v and ResNet . The choice of these models are motivated by the fact that they arewidely used for damage detection tasks, as shown in literature [50, 58, 20, 5, 24]. The traditionalstrategies for leveraging transfer learning include: (i) using a pre-trained model as a ﬁxed featureextractor (FE) in which pre-trained layers are kept frozen, (ii) ﬁne-tuning (FT) all or a few of thelayers of an existing model so that the weights are updated for the target task, (iii) training I MA - GE N ET weights from scratch and then ﬁne-tuning on the target task. Since training on I MAGE N ET from scratch is computationally expensive, in our experiments, we opted for feature extractionand ﬁne-tuning strategies appended with additional convolution ( D) and a dropout layer to avoidoverﬁtting.For the comparative evaluation of standard architecture, four small and simpler CNNs models arealso adopted from [38]. Raghu et al., [38] referred to these models as a

CBR family and used themto evaluate pre-trained networks for medical imaging. The simpler network consists of four to ﬁvelayers of convolution, batch normalization, and ReLU activation with varying numbers of ﬁlters andkernel sizes. Depending on the architecture conﬁguration, the size of CBR CNNs networks rangesfrom a third of the I

MAGE N ET architecture size to one-twentieth of the size as described in [38]. Dueto the extensive computation costs of training I MAGE N ET , we perform architectural comparisonswith random initialization only.Besides cross-domain transfer, we investigate in-domain transfer learning for structural damagedetection where a dataset from a similar domain is utilized for upstream and downstream tasks. Forcomparison purposes, we evaluated the best performing random initialized and I MAGE N ET model forthe speciﬁc downstream damage detection task. Finally, we conduct several experiments with fewer6abeled examples from training sets to assess the quality of learned representation from cross-domainand in-domain learning within a low data-regime. For the performance evaluation, we employ the held-out approach in which the complete dataset issplit into training, validation, and test split with an approximate ratio of %, %, and %. For ICCD and

CODEBRIM dataset, the predeﬁned train/validation/test ratios are used. The validation setis used for the hyper-parameter tuning of the model. We performed ﬁve independent runs of themodels for 30 epochs with a 64 batch size for all the experiments. The learning rate was kept as lowas − with a binary or categorical cross-entropy loss function depending on the problem. We useda machine with two Nvidia Tesla T GPUs on the Google Cloud Platform.Additionally, for the performance evaluation and comparison of the deep models, we compute areaunder the receiver operating characteristic curve (AUC-ROC) score, which is one of the most robustevaluation metrics for classiﬁcation tasks having skewed class distribution [11]. The AUC-ROC scoreranges in value from - . The best performing model will achieve a score closer to . A classiﬁerwith a zero or . score is considered random with no predictive capability. Five independent runs (i.e., training and evaluation) of CBR models and pre-trained I

MAGE N ET models are performed on six different bridge inspection datasets. The averaged AUC-ROC perfor-mance scores and standard deviation are shown in Table 2. First, we note that transfer learning with ﬁne-tuning performed consistently well for the majority of bridge inspection datasets, as depictedwith bold entries in Table 2.VGG with ﬁne-tuned weights achieved the best performance on all datasets except for CODEBRIM dataset. For

BCD , ICCD , and

CODEBRIM datasets, the results from Inception-v and ResNet modelsare also comparable with the difference of . scores only. Next, in a random initialization setting,the vanilla CBR model, with their one-twentieth to the standard architecture size, performed best onfour datasets, followed by VGG for the other two datasets as depicted with underlined entries. TheTable 2: AUC-ROC performance scores of CBR models and standard I

MAGE N ET architec-tures with different initializations. VGG16 with ﬁne-tuning performed best for ﬁve datasets, fol-lowed by ResNet50, as shown with bold entries. Under random initialization and ﬁxed weightssettings, the CBR models performed better than the standard I

MAGE N ET architecture for fourdatasets followed by VGG16, as shown with underlined entries. CDS SDNETv1 BCD ICCD MCDS CODEBRIM

Random Initialization

CBR Tiny ± ± ± ± ± ± CBR Small ± ± ± ± ± ± CBR LargeW ± ± ± ± ± ± CBR LargeT ± ± ± ± ± ± VGG16 ± ± ± ± ± ± Inception-v3 ± ± ± ± ± ± ResNet50 ± ± ± ± ± ± Fine-tuning of pre-trained models.

VGG16 0.82 ± ± ± ± ± ± Inception-v3 ± ± ± ± ± ± ResNet50 ± ± ± ± ± ± Feature extractions from pre-trained models.

VGG16 ± ± ± ± ± ± Inception-v3 ± ± ± ± ± ± ResNet50 ± ± ± ± ± ± feature extraction yielded a poor performance compared to random ini-tialization and ﬁne-tuning. This poor performance may be due to considerable differences in theI MAGE N ET and bridge inspection datasets regarding the number of classes, quality of images, size ofthe area of interest, and task speciﬁcation. In-domain representation learning typically follows two steps [33]: a) Learning of representationsusing in-domain data, called upstream training, b) evaluation of learned representations by transferringthem to a new and unseen task, termed as downstream . For upstream, we utilize the best performingmodels that were trained from scratch, and ﬁne-tuned I

MAGE N ET models (see Table 2). The learnedrepresentations are further ﬁne-tuned on all the new and unseen downstream tasks.Table 3 provides the results of the in-domain transfer. For comparison purposes, the results fromrandom initialization are also provided. The ﬁrst column depicts the upstream datasets, and the restof the columns’ names are downstream datasets. Out of six datasets, the in-domain transfer yieldsperformance improvement for at least four datasets than learning from scratch. SDNETV

MCDS show notable performance improvements with an increase of 0.7 and 0.2 in the AUC-ROC score,respectively. Additionally,

BCD and

ICCD datasets predominantly showed good transfer of in-domainknowledge across all the target tasks, despite the considerable difference in their sizes.We also evaluate the combination of in-domain and cross-domain transfer for performance gains.Compared to discrete I

MAGE N ET or in-domain transfer, the combination shows further performanceimprovement for the majority of datasets, as shown with bold entries in Table 4. With the in-domainand cross-domain transfer combination, it is difﬁcult to remark about the usefulness of a speciﬁcdataset since the performance scores across the different source datasets are very similar. When usedTable 3: AUC-ROC performance scores for in-domain transfer per dataset.

The rows representthe upstream source model trained for the target task, and the column shows the results of in-domain transfer foreach dataset. The bold entries depict the best performing models compared to the random initialization (given inthe ﬁrst row).

CDS SDNETv1 BCD ICCD MCDS CODEBRIMRand. Init. 0.79 ± ± ± ± ± ± ± ± ± ± ± SDNETv1 ± ± ± ± ± BCD ± ± ± ± ± ICCD ± ± ± ± ± MCDS ± ± ± ± ± CODEBRIM ± ± ± ± ± Upstream model architecture: CBR Tiny (

CDS , SDNETV MCDS ) CBR Small (

ICCD ),VGG16 (

BCD , CODEBRIM ). Table 4:

AUC-ROC performance scores for in-domain and I

MAGE N ET transfer. The rows repre-sent the upstream model of in-domain and I

MAGE N ET representations, and the column entries show the resultsof transfer for each dataset. The bold entries depict the best performing models compared to transfer fromI MAGE N ET only (given in the ﬁrst row). CDS SDNETv1 BCD ICCD MCDS CODEBRIMI

MAGE N ET ± ± ± ± ± ± ± ± ± ± ± SDNETv1 ± ± ± ± ± BCD ± ± ± ± ± ICCD ± ± ± ± ± MCDS 0.84 ± ± ± ± ± CODEBRIM ± ± ± ± ± Upstream model architecture: VGG16 architecture for all except for

CODEBRIM which use ResNet50.

8s a source, most of the datasets show comparable performance results irrespective of their size andnumber of classes.

Bridge inspection datasets are typically smaller than I

MAGE N ET due to the laborious process ofacquiring the labeled images. This lack of data necessitates the effective transfer of knowledge frompre-trained models. Thereby, to further explore the generalizability of in-domain and cross-domainrepresentations for a low data-regime, we perform further experiments with a smaller number oftraining samples ranging from 5%, 10%, 20%, and 50% of the datasets.Figure 3 reports the AUC-ROC performance score with different initialization settings for fewertraining samples of datasets. Transfer learning either with in-domain or cross-domain datasetsprovides signiﬁcant performance gains with very small datasets compared to learning from scratch.The combination of in-domain and ImageNet transfer show signiﬁcant performance with as few as5% of training samples only.

20% 40% 60% 80% 100%Number of instances0.50.60.70.80.9 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (a) CDS

20% 40% 60% 80% 100%Number of instances0.50.60.70.80.9 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (b) SDNETv1

20% 40% 60% 80% 100%Number of instances0.50.60.70.80.91.0 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (c) BCD

20% 40% 60% 80% 100%Number of instances0.50.60.70.80.91.0 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (d) ICCD

20% 40% 60% 80% 100%

Number of instances0.50.60.70.80.91.0 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (e) MCDS

20% 40% 60% 80% 100%

Training samples0.50.60.70.80.91.0 R O C - A U C Random Init.ImageNetIndomainIndomain+ImagNet (f) CODEBRIM

Figure 3:

AUC-ROC score on test-set after training with different initializa-tion over a limited number of training samples.

MAGE N ET dataset, having millions of images, and thelargest in-domain dataset with 60,000 images, their transfer shows comparable performance forbinary datasets. For multi-label datasets, i.e., MCDS and

CODEBRIM datasets, the impact of differentinitialization on performance scores is notable. For the

CODEBRIM dataset, the in-domain transferperforms notably well compared to I

MAGE N ET when only 5% to 20% data is used. In contrast, withthe increasing size of the training sample, the I MAGE N ET surpasses the other initializations. This section conducts a comparison to the baselines to assess the performance of the proposed transferlearning strategies. Table 5 presents the accuracy scores for each dataset and provides the comparisonto the baselines. Transfer learning with a combination of in-domain and cross-domain strategies showimproved results for at least four datasets as depicted by bold entries. For the other two datasets,i.e. [32] and [20], a direct comparison is not feasible due to the disparity in research methodologies.Mundt et al., [32] treated the damage classiﬁcation problem in multi-label multi-target settings inwhich the exact match of predicted and actual multi-labels is ensured. Brilakis et al., [20] dealtwith damage classiﬁcation in a multi-stage multi-classiﬁcation manner, where several multi-classand binary classiﬁer are deﬁned. Instead, in our approach, we develop a single classiﬁer for multipledamages detection. Besides transfer learning, the results provided in Table 5 can be used for futureperformance comparisons and improvements for the publicly available visual inspection datasets.Table 5:

Accuracy scores of transfer learning strategies and comparisons with baselines.

CDS SDNETv1 BCD ICCD MCDS CODEBRIM

Rand. Init. 82% 94% 98% 98% 68% 82%Cross-domain 86% 95% 98% 98% 76% %In-domain 83% 95% % 98% 69% 82%Combination

87% 96 % % % 79% %Previous studies - 92% [29] 96% [49] 99% [26] % [20] 72% [32] Multiple (binary and multi-class) classiﬁers approach. Multi-target accuracy score.

Transparency and interpretability of predictive models are indispensable to enable their use in practice.Deep neural networks are well-known for exploiting millions of parameters, processed by severalnon-linear functions and pooling layers to learn optimal weights. It is intractable for humans to followthe exact mapping of data from input to classiﬁcation through these complex multi-layered networks.To encourage CNN’s usage in practice, the interpretability and explainability of deep neural networksare increasingly popular and an active research area. Several model-agnostic, visual explanations andexample-based methods have recently been proposed [10, 31].In this study, we employed gradient-weighted class activation mapping (Grad-CAM) to visualize andlocalize the areas of an input image that are most important for the models’ prediction. Grad-CAMis a class-discriminative localization technique that, unlike its predecessor [57], does not requireany change in the CNNs architecture or retraining for generating visual explanations [43]. Severalstudies have shown that the deeper layers of CNNs capture high-level abstract constructs; therefore,Grad-CAM utilizes the gradient information ﬂowing into the last convolution layer to evaluate eachneuron’s importance. We apply importance scores with logistic regression and a non-linear activationfunction to obtain coarse heatmaps for discriminative class mappings.Figure 4 shows the visual explanation of the ﬁne-tuned VGG model, which has pre-dominantlyperformed well for most of the damage detection datasets (see Table 2 for results). We utilized thetf-explain implementation of Grad-CAM to generate visual mapping [40]. In addition to the lastconvolution layer of CNN, Grad-CAM can be used to extract discriminative class mapping with anyintermediate convolution layer. We experimented with various convolution and pooling layers togenerate visual explanations, as shown in Figure 4. The visual explanations with the correct predictedclass validate that the model is paying attention to the right areas of an image to identify and localize10 round truth: Potentially unhealthy Prediction: Potentially unhealthy (0.99) (a) CDS Ground truth: Cracked deck Prediction: Cracked deck (0.99) (b) SDNETv1

Ground truth: Uncracked Prediction: Cracked (0.90) (c) BCD

Ground truth: Cracked Prediction: Cracked (0.99) (d) ICCD

Prediction: Spallation (0.98) & No exposed reinforcement (0.99)Ground truth: Spallation & No exposed reinforcement (e) MCDS

Ground truth: Spallation, Exposed bars, & Corrosion stain Ground truth: Spallation (0.99), Exposed bars (0.99) & Corrosionstain (0.99) (f) CODEBRIM

Figure 4:

Visual explanations for classiﬁcation and localization of VGG16 models for all sixdatasets.

Grad-CAM highlights class discriminative features. The heatmap localizes the speciﬁc damageregions. a speciﬁc damage category. It is interesting to notice the Figure 4c, where the model misclassiﬁed thehealthy patch of concrete as cracked by assigning higher weights to the textured concrete patch.The visual interpretations of CNN models reveal why the models predict, what they predict . Thevisual explanation also provides localization of speciﬁc damage in an input image without additionallabeling or segmentation activity. By revealing the decision-logic of these complex models, thedecision-makers and infrastructure managers can trust them for automatic damage detection tasks.

This paper presents transfer learning strategies and conducts comprehensive experiments to evaluatetheir usefulness for damage detection tasks of bridges. We compared different initialization settings,namely random initialization, in-domain, cross-domain, and their combination for transfer learningon six publicly available visual inspection datasets of concrete bridges. We found that cross-domaintransfer yields performance improvements only when ﬁne-tuned for the target damage detection task.Our main message is that the combination of in-domain and cross-domain representations provideenhanced performance compared to their stand-alone versions. Additionally, in contrast to pre-trained I

MAGE N ET models, in-domain transfer provides training efﬁciency and ﬂexibility in selectingrelatively smaller yet powerful CNNs architecture. In our exploration of learning for tasks with alimited number of training samples, in-domain and I MAGE N ET representations show comparableperformance. The results demonstrate considerable performance gains when in-domain and cross-domain (I MAGE N ET ) representations are used jointly for the target task.We further assessed the best performing I MAGE N ET models by developing visual explanations usinggradient-weighted class activation mapping. The visual explanation shows the class discriminative11egions of an input that are most paramount for speciﬁc damage prediction. Additionally, theinterpretability of models’ prediction localizes the speciﬁc damages and enables decision-makers tounderstand the underlying intrinsic decision-logic of the neural model. Such visual exploration ofdamage detection and localization also encourages the use of predictive models in practice. Acknowledgements

This research has been partially funded by NWO under the grant PrimaVera NWA.1160.18.238.

Reproducibility

The source code and supplementary material will be made publicly available upon publication.

References [1] O. Abdeljaber, O. Avci, M. S. Kiranyaz, B. Boashash, H. Sodano, and D. J. Inman. 1-D CNNsfor structural damage detection: Veriﬁcation on a structural health monitoring benchmark data.

Neurocomputing , 275:1308 – 1317, 2018. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2017.09.069. URL .[2] D. Agdas, J. A. Rice, J. R. Martinez, and I. R. Lasa. Comparison of visual inspection andstructural-health monitoring as bridge condition assessment methods.

Journal of Performanceof Constructed Facilities , 30(3):04015049, 2016.[3] M. Azimi and G. Pekcan. Structural health monitoring using extremely compressed data throughdeep learning.

Computer-Aided Civil and Infrastructure Engineering , 2019.[4] W. Cao, Q. Liu, and Z. He. Review of pavement defect detection methods.

IEEE Access , 8:14531–14544, 2020.[5] T. A. Carr, M. D. Jenkins, M. I. Iglesias, T. Buggy, and G. Morison. Road crack detection usinga single stage detector based deep neural network. In , pages 1–5. IEEE, 2018.[6] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. ImageNet: a large-scale hierarchicalimage database. In , pages248–255, 2009.[7] S. Dorafshan and M. Maguire. Bridge inspection: human performance, unmanned aerial systemsand automation.

Journal of Civil Structural Health Monitoring , 8(3):443–476, 2018.[8] A. Ellenberg, A. Kontsos, F. Moon, and I. Bartoli. Bridge deck delamination identiﬁcation fromunmanned aerial vehicle infrared imagery.

Automation in Construction , 72:155–165, 2016.[9] European Commission. Transport in the european union, current trends and issues.

Mobilityand Transport , 2018. Accessed on: 05 May, 2020, Available at: https://bit.ly/2BuDRpy.[10] F. Fan, J. Xiong, and G. Wang. On interpretability of artiﬁcial neural networks.

Preprint athttps://arxiv. org/abs/2001.02522 , 2020.[11] T. Fawcett. An introduction to ROC analysis.

Pattern recognition letters , 27(8):861–874, 2006.[12] C. Feng, H. Zhang, S. Wang, Y. Li, H. Wang, and F. Yan. Structural damage detection usingdeep convolutional neural network and transfer learning.

KSCE Journal of Civil Engineering ,23(10):4493–4502, 2019.[13] O. Fink, Q. Wang, M. Svensén, P. Dersin, W.-J. Lee, and M. Ducoffe. Potential, challengesand future directions for deep learning in prognostics and health management applications.

Engineering Applications of Artiﬁcial Intelligence , 92:103678, 2020. ISSN 0952-1976. doi:https://doi.org/10.1016/j.engappai.2020.103678. URL .[14] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231 , 2018. 1215] X. Han, H. Laga, and M. Bennamoun. Image-based 3D object reconstruction: State-of-the-artand trends in the deep learning era.

IEEE Transactions on Pattern Analysis and MachineIntelligence , page 1–1, 2019. ISSN 1939-3539. doi: 10.1109/tpami.2019.2954885. URL http://dx.doi.org/10.1109/TPAMI.2019.2954885 .[16] K. He, R. Girshick, and P. Dollár. Rethinking imagenet pre-training. In

Proceedings of theIEEE International Conference on Computer Vision , pages 4918–4927, 2019.[17] Highway Agency.

Inspection manual for highway structures , volume 1. TSO, 2007.[18] P. Huethwohl. Cambridge bridge inspection dataset, 2017. Available at: 10.17863/CAM.13813.[19] M. Huh, P. Agrawal, and A. A. Efros. What makes ImageNet good for transfer learning? arXivpreprint arXiv:1608.08614 , 2016.[20] P. Hüthwohl, R. Lu, and I. Brilakis. Multi-classiﬁer for reinforced concrete bridge defects.

Automation in Construction , 105:102824, 2019.[21] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layerstiramisu: Fully convolutional densenets for semantic segmentation. In

Proceedings of the IEEEconference on computer vision and pattern recognition workshops , pages 11–19, 2017.[22] H. Khodabandehlou, G. Pekcan, and M. S. Fadali. Vibration-based structural condition as-sessment using convolution neural networks.

Structural Control and Health Monitoring , 26(2):e2308, 2019. doi: 10.1002/stc.2308. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/stc.2308 . e2308 STC-18-0164.R1.[23] S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better? In

Proceedingsof the IEEE conference on computer vision and pattern recognition , pages 2661–2671, 2019.[24] S. L. Lau, X. Wang, Y. Xu, and E. K. Chong. Automated pavement crack segmentation usingfully convolutional u-net with a pretrained ResNet-34 encoder. arXiv preprint arXiv:2001.01912 ,2020.[25] B. Lei, N. Wang, P. Xu, and G. Song. New crack detection method for bridge inspection usingUAV incorporating image processing.

Journal of Aerospace Engineering , 31(5):04018058,2018.[26] S. Li and X. Zhao. Image-based concrete crack detection using convolutional neural networkand exhaustive search technique.

Advances in Civil Engineering , 2019.[27] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptationnetworks. In

International conference on machine learning , pages 97–105. PMLR, 2015.[28] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata. Road damage detectionusing deep neural networks with images captured through a smartphone. arXiv preprintarXiv:1801.09454 , 2018.[29] M. Maguire, S. Dorafshan, and R. J. Thomas. SDNET2018: A concrete crack image dataset formachine learning applications, 2018.[30] C. R. Middleton and F. Lea. Reliability of visual inspection of highway bridges.

FederalHighway Administration Research and Technology , 2002.[31] C. Molnar.

Interpretable Machine Learning . Opensource., 2019.https://christophm.github.io/interpretable-ml-book/.[32] M. Mundt, S. Majumder, S. Murali, P. Panetsos, and V. Ramesh. Meta-learning convolutionalneural architectures for multi-target concrete defect classiﬁcation with the concrete defect bridgeimage dataset. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,June 2019.[33] M. Neumann, A. S. Pinto, X. Zhai, and N. Houlsby. In-domain representation learning forremote sensing. arXiv preprint arXiv:1911.06721 , 2019.[34] T. Omar and M. L. Nehdi. Remote sensing of concrete bridge decks using unmanned aerialvehicle infrared thermography.

Automation in Construction , 83:360–371, 2017.[35] S. J. Pan and Q. Yang. A survey on transfer learning.

IEEE Transactions on Knowledge andData Engineering , 22(10):1345–1359, 2010.1336] B. M. Phares, G. A. Washer, D. D. Rolander, B. A. Graybeal, and M. Moore. Routinehighway bridge inspection condition documentation accuracy and reliability.

Journal of BridgeEngineering , 9(4):403–413, 2004.[37] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu.Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. arXiv preprintarXiv:1910.10683 , 2019.[38] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio. Transfusion: Understanding transfer learningfor medical imaging. In

Advances in Neural Information Processing Systems , pages 3342–3352,2019.[39] Railway Accident Investigation Unit. Malahide Viaduct Collapse on the Dublin to BelfastLine, Aug 2009. URL .Accessed on: 07 Feb., 2021.[40] Raphaël. Interpretability Methods for tf.keras models with Tensorﬂow 2.x.https://github.com/sicara/tf-explain, 2019.[41] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection withregion proposal networks.

IEEE transactions on pattern analysis and machine intelligence , 39(6):1137–1149, 2016.[42] Y. Ren, J. Huang, Z. Hong, W. Lu, J. Yin, L. Zou, and X. Shen. Image-based concrete crackdetection in tunnels using deep fully convolutional networks.

Construction and BuildingMaterials , 234:117367, 2020.[43] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visualexplanations from deep networks via gradient-based localization. In

Proceedings of the IEEEinternational conference on computer vision , pages 618–626, 2017.[44] M. Sło´nski. A comparison of deep convolutional neural networks for image-based detectionof concrete surface cracks.

Computer Assisted Methods in Engineering and Science , 26(2):105–112, 2019.[45] A. Tamkin, T. Singh, D. Giovanardi, and N. Goodman. Investigating transferability in pretrainedlanguage models. arXiv preprint arXiv:2004.14975 , 2020.[46] L. Torrey and J. Shavlik. Transfer learning. In

Handbook of research on machine learningapplications and trends: algorithms, methods, and techniques , pages 242–264. IGI global,2010.[47] U.S. Department of Transportation. I-35 Bridge Collapse, Minneapolis, Sep 2012. URL

Neural Computing and Applications , pages 1–14, 2019.[49] H. Xu, X. Su, Y. Wang, H. Cai, K. Cui, and X. Chen. Automatic Bridge Crack Detection Usinga Convolutional Neural Network.

Applied Sciences , 9(14), 2019.[50] L. Yang, B. Li, W. Li, Z. Liu, G. Yang, and J. Xiao. A robotic system towards concretestructure spalling and crack database. In , pages 1276–1281. IEEE, 2017.[51] S. Yang and Y. Huang. Damage identiﬁcation method of prestressed concrete beam bridgebased on convolutional neural network.

Neural Computing and Applications , pages 1–11, 2020.[52] X. Ye, T. Jin, and C. Yun. A review on deep learning-based structural health monitoring of civilinfrastructures.

Smart Structures and Systems , 24(5):567–585, 2019.[53] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neuralnetworks? In

Advances in neural information processing systems , pages 3320–3328, 2014.[54] Y. Yu, Z. Gong, P. Zhong, and J. Shan. Unsupervised representation learning with deepconvolutional neural network for remote sensing images. In

International Conference on Imageand Graphics , pages 97–108. Springer, 2017.[55] C. Zhang, C.-C. Chang, and M. Jamshidi. Bridge damage detection using a single-stage detectorand ﬁeld inspection images. arXiv preprint arXiv:1812.10590 , 2018.1456] X. Zhang, D. Rajan, and B. Story. Concrete crack detection using context-aware deep semanticsegmentation network.

Computer-Aided Civil and Infrastructure Engineering , 34(11):951–971,2019.[57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features fordiscriminative localization. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2921–2929, 2016.[58] F. Zhou, G. Liu, F. Xu, and H. Deng. A generic automated surface defect detection based on abilinear model.

Applied Sciences , 9(15), 2019. ISSN 2076-3417. doi: 10.3390/app9153159.URL .[59] J. Zhu, C. Zhang, H. Qi, and Z. Lu. Vision-based defects detection for bridges using transferlearning and convolutional neural networks.