[PDF] Single View Distortion Correction using Semantic Guidance

Abstract

Most distortion correction methods focus on simple forms of distortion, such as radial or linear distortions. These works undistort images either based on measurements in the presence of a calibration grid, or use multiple views to find point correspondences and predict distortion parameters. When possible distortions are more complex, e.g. in the case of a camera being placed behind a refractive surface such as glass, the standard method is to use a calibration grid. Considering a high variety of distortions, it is nonviable to conduct these measurements. In this work, we present a single view distortion correction method which is capable of undistorting images containing arbitrarily complex distortions by exploiting recent advancements in differentiable image sampling and in the usage of semantic information to augment various tasks. The results of this work show that our model is able to estimate and correct highly complex distortions, and that incorporating semantic information mitigates the process of image undistortion.

Full PDF

SSingle View Distortion Correction using SemanticGuidance

Szabolcs-Botond L˝orincz , , Szabolcs P´avel , , Lehel Csat´o Faculty of Mathematics and Informatics, Babes , -Bolyai University of Cluj-Napoca, [email protected] - Robert Bosch SRL – Cluj-Napoca, Romania { ﬁxed-term.szabolcs.lorincz, szabolcs.pavel } @ro.bosch.com Abstract —Most distortion correction methods focus onsimple forms of distortion, such as radial or linear distortions.These works undistort images either based on measurementsin the presence of a calibration grid [1]–[3], or use multipleviews to ﬁnd point correspondences and predict distortionparameters [4]–[6]. When possible distortions are morecomplex, e.g. in the case of a camera being placed behind arefractive surface such as glass, the standard method is to usea calibration grid [7], [8]. Considering a high variety ofdistortions, it is nonviable to conduct these measurements. Inthis work, we present a single view distortion correctionmethod which is capable of undistorting images containingarbitrarily complex distortions by exploiting recentadvancements in differentiable image sampling introduced by[9] and in the usage of semantic information to augmentvarious tasks. The results of this work show that our model isable to estimate and correct highly complex distortions, andthat incorporating semantic information mitigates the processof image undistortion.

I. I

NTRODUCTION

One way intelligent systems are able to perceive andinteract with complex environments is vision, thus, they arehighly reliant on a wide variety of computer visionalgorithms, such as object detection, semantic segmentation,or depth estimation. The propagation of errors caused bygeometric image distortions has a negative effect on theaccuracy of these algorithms, therefore it is critical tocorrect them.Camera based driver assistance and autonomous drivingsystems are no exception, as the camera is usually placedbehind the vehicle’s windshield, which typically consists oftwo curved sheets of glass with a plastic layer laminatedbetween them. The curvature, deviation in thickness andinconsistency in the parallelism of the two surfaces, causesgeometric distortions. Measuring ground truth distortionscaused by various glass surfaces requires laboratory setups,making the collection of large training sets – and as aconsequence using standard supervised learning –unfeasible.Our contribution is threefold. First, we present a scalabledeep learning approach that can correct arbitrarily complexnonlinear distortions. Second, we construct two data sets comprising of real-world (KITTI odometry [10]) andsynthesized (Carla [11]) images and corresponding semanticsegmentation, on which we apply parametric distortionssampled from a distribution derived from real-worldmeasurements in the presence of different windshields.Third, we train our network in an end-to-end mannerwithout using hard to obtain ground truth distortions assupervision, and instead leverage recent advancements indifferentiable image sampling to formulate a loss based onMulti-Scale Structural Similarity Index Metric (MS-SSIM)[12].Our experiments on both data sets show that our modelis able to estimate highly complex distortions. Moreover, thenetwork does not only estimate the distortions, but it producesdirectly the undistorted image and segmentation also.II. P

REVIOUS W ORK

In this section, we summarize the studies related to ourapproach. First, a brief overview of existing workaddressing the problem of geometric distortion correction isprovided. Second, various successful use cases of semanticguidance are presented, including distortion correction.Third, current progress of spatial transformer networks andtheir applications is described, which is a main unit of ourdistortion correction system.

A. Distortion Correction

Most of the literature focuses on simple forms ofdistortions, such as radial distortions. According to [13], a r X i v : . [ c s . C V ] N ov adial distortions are the main components of distortionscaused by lenses in addition to decentering distortions andthin prism distortions. There are two major approaches forcorrecting radial distortions in the literature. The ﬁrstinvolves using point correspondences of two or moreimages [4]–[6], while the second is based on ﬁndingdistorted straight lines in single images and estimating thedistortion parameters [1]–[3].In contrast with these methods, in [14] a CNN basedmodel is introduced, which predicts radial distortionparameters based on single input images. In [15] ﬁsheyedistortion parameters are estimated by CNNs. Anotherachievement is that both networks in [14] and [15] aretrained on synthetically distorted images, but it isdemonstrated that they achieve similarly good results inundistorting images containing real distortions.Only a few studies have conducted experiments for morecomplex distortions. In [7] a calibration grid is used tomeasure aircraft windscreen distortions and adecision-tree-based classiﬁer is introduced which classiﬁesthe distortions as acceptable or not. In [16] and [17], thedistortions caused by a car’s front window are estimatedwith the help of a calibration grid in order to createhead-up displays (HUD), which show important informationdirectly in the ﬁeld of view of the driver. The disadvantageof utilizing a calibration grid is that the measurements needto be conducted for each kind of distortion separately –hence the method is not scalable. B. Semantic Guidance

Semantic segmentation is understanding an image at pixellevel i.e, an object class ( e.g. car, road, pedestrian, ...) isassigned to each pixel in a given image. A number of studieshave investigated the potential utilization of semantic labelsfor solving various problems.In [18], an end-to-end deep convolutional neural networkis proposed, which learns to capture semantic information,and uses that information for image harmonization. Amulti-context embedding network, which integrateshigh-level semantic labels and low-level image details isproposed for automatic shadow removal from single imagesin [19]. Single image depth estimation is achieved in [20]by ﬁrst performing a semantic segmentation of the sceneand using the semantic labels to guide the 3Dreconstruction. The task of optical ﬂow estimation is alsoenhanced by modelling motions as a function of the classespresent in the images [21].By knowing semantic labels, the correction of distortionscan also be augmented, due to objects of different classeshaving different geometric properties ( e.g. buildings havestraight borders, while cars usually do not). This property isexploited in [15] to facilitate estimating ﬁsheye distortionparameters and correcting the distortions.

C. Spatial Transformer Module

Spatial transformers introduced in [9] are modules whichcan be incorporated into any CNN to augment various

TABLE IS

EMANTIC LABELS PROVIDED BY C ARLA V ALUE L ABEL

ONE

UILDINGS

ENCES

THER

EDESTRIANS

OLES

OAD LINES

OADS

IDEWALKS

EGETATION

10 V

EHICLES

11 W

ALLS

12 T

RAFFIC SIGNS problems, and are composed of three parts. The ﬁrst part isthe localization network, which takes an input feature mapor input image and produces the parameters of the chosentransformation. The second generates a sampling grid basedon the predicted transformation parameters. The third partis a differentiable sampler, which transforms the input usingthe generated sampling grid. The task of these modules isto achieve real spatial invariance by automaticallytransforming input images or feature maps to a prototypeinstance before they are used for classiﬁcation or othertasks. Recent successful use-cases of spatial transformersinclude handwritten digit classiﬁcation [9] on distortedMNIST [22] data set, recognition of sequences of numbers[9] on Street View House Numbers (SVHN) [23] and scenetext recognition [24]. III. D

ATA SETS

In order to demonstrate that our model is able to undistortboth synthetic and real-world images, we construct two datasets, Distorted Carla (DC) and Distorted KITTI (DK).Distorted Carla is composed of , synthetic imagesand their corresponding semantic labels generated usingCarla driving simulator [11]. The labels which are providedby Carla are presented in Table I. We generate the imageswith autopilot turned on, at a ﬁxed time-step of 0.2seconds, using weather preset ClearNoon, having 128vehicles and 256 pedestrians spawned on the map Town01.Distorted KITTI is comprised of , imagesoriginating from sequences to of KITTI odometry[10] data set.The geometric distortions are deﬁned as a grid containingdisplacement vectors δ = [ δ x i , δ y i ] (cid:62) for each pixel in theimage. We apply distortions synthetically on the images andcorresponding semantic labels. As a second step, weinterpolate the color values in the sampled images usingbilinear interpolation, and the semantic labels usingnearest-neighbour interpolation. The set of distortions usedin our experiments is drawn from real-world windscreenistortion distribution data, covering a wide range ofparameter settings. IV. O UR APPROACH

First, we introduce our parametric distortion model. Then,we specify the architecture of our deep network. We closethis section with a description of how our model is trained.

A. Distortion Model

In [25] it is shown, that a pair of thin plate splines(TPS), one representing the x -component and the other the y -component form a map from R to R , which can modelbiological deformations, e.g. in the case of Apert syndrome.We model geometric distortions with thin plate spline pairs.The transformed coordinates f tps ( G i ) at imagecoordinate G i = [ x i , y i ] (cid:62) assuming n control points aregiven, are deﬁned as f tps ( G i ) = A (cid:20) G i (cid:21) + n (cid:88) k =1 ϕ ( (cid:107) p (cid:48) k − G i (cid:107) ) · w k . (1)For our purpose, we use n = 16 control points, but itis possible to use more points to model arbitrarily complexdistortions. The target control points P (cid:48) = [ p (cid:48) , p (cid:48) , . . . , p (cid:48) n ] ∈ R × n in our case are ﬁxed and evenly distributed on a × grid, whereas source control points P = [ p , p , . . . , p n ] ∈ R × n have to be localized on the distorted image in order tointerpolate the displacements between them.The ﬁrst term of f tps ( · ) is an afﬁne transformation A = [ a , a , a ] ∈ R × . The second term represents thenon-afﬁne deformation, where the radial basis kernelcorresponding to TPS transformation is ϕ ( r ) = r log ( r ) ,with r denoting the Euclidean distance between two points,while W = [ w , w , . . . , w n ] ∈ R × n is a warpingcoefﬁcient matrix. The transformation parameter θ containing the two terms is calculated by θ = ( W | A ) (cid:62) = L − (cid:20) P (cid:62) × (cid:21) , (2)where L − is the inverse of the padded kernel matrix L whichis computed based on the target control points and is givenby L =  K n × P (cid:48)(cid:62) × n × P (cid:48) × ×  . (3)Here, K ∈ R n × n is deﬁned by K =  ϕ ( r ) · · · ϕ ( r n ) ϕ ( r ) 0 · · · ϕ ( r n ) ... ... . . . ... ϕ ( r n ) ϕ ( r n ) · · ·  , (4)where r ij denotes the Euclidean distance between targetcontrol points p (cid:48) i and p (cid:48) j .In order to undistort images we need the displacedcoordinates for each point in the undistorted reference grid G = [ G , G , · · · , G N ] , where N is the total number ofpixels in the undistorted image. Thus, we ﬁrst deﬁne matrix Fig. 2. Sample distorted image (top) and undistorted image (bottom) fromDistorted Carla data set. The task is to localize source control points in thedistorted image (large blue dots). Since the target control points (large greendots) are ﬁxed, we are able to generate a sampling grid (small blue dots)using thin plate spline interpolation, based on which we can undistort theimage. K (cid:48) ∈ R N × n which contains radial basis kernel values ofthe pairwise distances of undistorted grid points and targetcontrol points and is deﬁned by K (cid:48) =  ϕ ( r (cid:48) ) ϕ ( r (cid:48) ) · · · ϕ ( r (cid:48) n ) ϕ ( r (cid:48) ) ϕ ( r (cid:48) ) · · · ϕ ( r (cid:48) n ) ... ... . . . ... ϕ ( r (cid:48) N, ) ϕ ( r (cid:48) N, ) · · · ϕ ( r (cid:48) N,n )  , (5)where r (cid:48) ij denotes the Euclidean distance between grid point G i and target control point p (cid:48) j .Let τ θ ( G ) be the distorted grid consisting of displacedcoordinates, where τ θ is a transformation of choiceparameterized by θ . The distorted grid is given by τ θ ( G ) = (cid:2) K (cid:48) N × G (cid:62) (cid:3) θ. (6) B. Proposed Architecture

We employ an end-to-end architecture which takes asingle distorted image I as input, and outputs theundistorted image I (cid:48) and its corresponding semantic labels(Figure 3). Our architecture processes the input in twosteps: a feature extraction step, and a distortion correctionstep. Feature Extraction : First, low-level features are extractedby the core network, for which we use ResNet-18 [26] pre-trained on ImageNet [27] and remove the top two layers.The model includes a semantic segmentation network, as in[15], which provides high-level semantics for undistorting theimages. The segmentation network takes the extracted featuremaps with channel dimension and the distorted imageas input and outputs high-level semantics. The feature mapsare ﬁrst upsampled using ﬁve resize-convolution layers [28].Each of these layers upsamples the input feature map by a ig. 3. Schematic diagram of the proposed model composed of threemain processing units: the core network (green), the semantic segmentationnetwork (red), and the spatial transformer module (light blue). First,low-level features are extracted using the core network. The semanticsegmentation module then combines the input image with the upsampledlow-level features and produces pixel-level semantic labels. The spatialtransformer module fuses the extracted low-level features with high-levelsemantics and the input image, and predicts the transformation parameters,based on which the sampling grid is generated, and the output (undistorted)image and segmentation are produced. factor of two using nearest-neighbour sampling (Upx2), thenapplies a convolution with kernel size , stride and padding , followed by batch normalization [29] and a parametricrectiﬁed linear unit (PReLU) [30].After ﬁve such blocks, the upsampled feature maps areconcatenated with the input image, and are furtherupsampled using two resize-convolution layers. They arethen passed to a Conv4x4-BN-PReLU block. Finally, aConv4x4 layer produces the pixel-level semantic labels,corresponding to the classes which Carla provides asground truth. Distortion Correction : Once the control points have beenlocalized, the model generates a sampling grid containing2D pixel coordinates using Equation (6). Finally, a SpatialTransformer Sampler [9] takes the sampling grid, thedistorted image and semantic segmentation to produce the

Fig. 4. The undistorted grid points G i (green) are transformed to distortedgrid points τ θ ( G i ) (dark blue). The pixel values in the undistorted image arecalculated by bilinear sampling from the nearby pixel values (light blue). Theundistorted semantic labels are the semantic labels nearest to the transformedpoints. undistorted image and segmentation (Figure 4). Both thesampling grid generator and the sampler are differentiable,thus, end-to-end learning using gradient descent is possible. C. Model Training

In the section that follows, we detail our training method,including network initialization, training loss function andhyperparameters.

Network Initialization : The model parameters areinitialized using ”He” uniform initialization described in[30], except for the last fully connected layer in the controlpoint localization network, where weights are set to zero,and biases are set to initially produce the target controlpoints, similarly to [24].

Training Loss : We experiment with various settings oftraining loss functions. In order to enforce reconstruction ofboth the image and the semantic labels, we employ a jointloss.For the former, we propose to use the reconstruction loss L r based on MS-SSIM [12], which involves computingsingle-scale structural similarity on multiple scales,measuring the similarity of two image patches inluminance, contrast, and structure. This metric is proper incase of training our model on data sets containingreal-world distortions, since ground truth sampling grid ishard to obtain in those scenarios. The proposed loss isgiven by L r = − MS-SSIM ( I, I (cid:48) ) + 12 . (7)Since we applied synthetic distortions, we are able tocalculate the ground truth sampling grid for undistorting theimages. Therefore, we investigate whether minimizingdirectly the grid loss gives better results than L r fromEquation (7). The grid loss is formulated as L g = 1 N N (cid:88) i =1 (cid:107) τ θ ( G i ) − ˆ τ θ ( G i ) (cid:107) , (8) ABLE IIO

RIGINAL DISTORTION NORM ( PX ) D ATA SET M EAN S TD D ISTORTED C ARLA T EST

ISTORTED

KITTI T

EST where τ θ ( G ) is the estimated sampling grid, while ˆ τ θ ( G ) isthe ground truth sampling grid.For semantic segmentation, we use L s , which we calculateas the mean of pixel-wise cross-entropy between the groundtruth distorted and predicted distorted semantic labels.The ﬁnal loss of our network is composed of a weightedsum of all three losses: L = L r + λ L g + λ L s , (9)where { λ i } is a set of weighting coefﬁcients to balance theloss functions. We set λ = 100 , λ = 0 . empirically. Training Method : We use Adam [31] optimization and mini-batch gradient-descent, with batch size of . The learningrate is set to − , except for the core network, where weuse · − . During ﬁne-tuning on sequences to ofDistorted KITTI, we set the learning rate of the semanticsegmentation network to , because no ground truth semanticlabels are provided.V. E XPERIMENTAL R ESULTS

We split Distorted Carla and Distorted KITTI intotraining and test sets. We train the networks on DistortedCarla Train (8,000 images) for 10 epochs, while we test themodels on both Distorted Carla Test (2,000 images) and onsequence of Distorted KITTI (4,539 images). Weﬁne-tune the previously trained networks on sequence − of Distorted KITTI (10,684 images) for another 10epochs and test the ﬁne-tuned models on sequence .Since we synthetically generate the applied distortions,we are able to measure the performance of our modelquantitatively by calculating the residual distortion normmeasured in pixels for each pixel on each image andcomputing the mean and standard deviation over all thepixels in all images, contrary to previous work [15] whereonly qualitative metrics were used, such as PeakSignal-to-Noise Ratio (PSNR) or MS-SSIM [12].In Table II we report mean and standard deviation ofdistortion norms in pixels in the original distorted testimages, which we use as a comparison to the residualdistortion error norms after distortion correction. The meanand standard deviation of residual distortion norms in pixelsafter distortion correction for different settings of ourmethod are summarized in Table III.We conducted various experiments using differentconﬁgurations of the training loss function. It can be seen,that employing semantic loss reduces the residual error ingeneral, except in the case where it is used alongside both TABLE IIIR

ESIDUAL DISTORTION NORM ( PX ) F INE - TUNE T EST L r L g L s M EAN S TD (cid:55) DC T

EST (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) .

98 1 . (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:55) DK 00 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) .

37 0 . (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) .

22 0 . (cid:88) (cid:88) (cid:88) reconstruction and grid loss, when tested on DistortedKITTI.The model which performs the best on Distorted CarlaTest uses reconstruction loss and segmentation loss,whereas on Distorted KITTI , the best performing modeluses grid loss together with segmentation loss. Among theﬁne-tuned models, the model which achieves the lowestmean residual distortion norm is also the one using gridloss and segmentation loss.The ﬁne-tuned models do not achieve signiﬁcantly betterperformance when using grid loss instead of reconstructionloss, so it is possible to train our network without obtainingground truth sampling grid, which simpliﬁes the training andusage of our model.VI. C ONCLUSION AND FUTURE WORK

In this work we have demonstrated a deep-network basedsystem that can correct arbitrarily complex distortions andhave illustrated its accuracy for the highly nonlineardistortions caused by vehicles’ windshields. We have alsoshown, that training on synthesized data, the model is ableto generalize to real-world scenes. Additionally, wedemonstrated that incorporating semantic informationmitigates the process of image undistortion. It is possible totrain our model using real-world distortions instead ofsynthetic distortions, since only the distorted image isneeded as input, and the undistorted image as supervisionfor the ﬁne-tuning phase. In the future, we want to test themodel in real-world scenarios, and to quantify the impact ofour algorithm on the performance of end-to-end computervision pipelines. A

CKNOWLEDGMENT

The authors thank Robert Bosch SRL for technical supportand for useful discussions.

EFERENCES[1] A. Wang, T. Qiu, and L. Shao, “A simple method of radialdistortion correction with centre of distortion estimation,”

Journal ofMathematical Imaging and Vision , vol. 35, no. 3, pp. 165–172, 2009.[2] C. Br¨auer-Burchardt and K. Voss, “Automatic lens distortioncalibration using single views,” in

Mustererkennung 2000 . Springer,2000, pp. 187–194.[3] B. Prescott and G. McLean, “Line-based correction of radial lensdistortion,”

Graphical Models and Image Processing , vol. 59, no. 1,pp. 39–47, 1997.[4] G. P. Stein, “Lens distortion calibration using point correspondences,”in

Computer Vision and Pattern Recognition, 1997. Proceedings., 1997IEEE Computer Society Conference on . IEEE, 1997, pp. 602–608.[5] A. W. Fitzgibbon, “Simultaneous linear estimation of multiple viewgeometry and lens distortion,” in

Computer Vision and PatternRecognition, 2001. CVPR 2001. Proceedings of the 2001 IEEEComputer Society Conference on , vol. 1. IEEE, 2001, pp. I–I.[6] R. Hartley and S. B. Kang, “Parameter-free radial distortion correctionwith center of distortion estimation,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 29, no. 8, pp. 1309–1321,2007.[7] M. Dixon, R. Glaubius, P. Freeman, R. Pless, M. P. Gleason, M. M.Thomas, and W. D. Smart, “Measuring optical distortion in aircrafttransparencies: a fully automated system for quantitative evaluation,”

Machine Vision and Applications , vol. 22, no. 5, pp. 791–804, Sep2011. [Online]. Available: https://doi.org/10.1007/s00138-010-0258-z[8] P. L. Wisely, “A digital head-up display system as part of an integratedautonomous landing system concept,” in

Enhanced and SyntheticVision 2008 , vol. 6957. International Society for Optics and Photonics,2008, p. 69570O.[9] M. Jaderberg, K. Simonyan, A. Zisserman et al. , “Spatial transformernetworks,” in

NIPS , 2015, pp. 2017–2025.[10] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? the KITTI vision benchmark suite,” in

Conference onComputer Vision and Pattern Recognition (CVPR) , 2012.[11] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in

Proceedings of the1st Annual Conference on Robot Learning , 2017, pp. 1–16.[12] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in

Asilomar Conference onSignals, Systems & Computers , vol. 2, 2003, pp. 1398–1402.[13] Z. Zhang, “A ﬂexible new technique for camera calibration,”

IEEETransactions on pattern analysis and machine intelligence , vol. 22,2000.[14] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lensdistortion correction using convolutional neural networks trained withsynthesized images,” in

Computer Vision – ACCV 2016 , S.-H. Lai,V. Lepetit, K. Nishino, and Y. Sato, Eds. Cham: Springer InternationalPublishing, 2017, pp. 35–49.[15] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fisheyerecnet:A multi-context collaborative deep network for ﬁsheye imagerectiﬁcation,” arXiv preprint arXiv:1804.04784 , 2018.[16] A. Sato, I. Kitahara, Y. Kameda, and Y. Ohta, “Visual navigationsystem on windshield head-up display,” in

Proc. 13th World Congresson Intelligent Transort Systems, CD-ROM , 2006.[17] F. Wientapper, H. Wuest, P. Rojtberg, and D. Fellner, “A camera-based calibration for automotive augmented reality head-up-displays,”in

Mixed and Augmented Reality (ISMAR), 2013 IEEE InternationalSymposium on . IEEE, 2013, pp. 189–197.[18] Y.-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang,“Deep image harmonization,” in

IEEE CVPR , vol. 2, 2017.[19] L. Qu, J. Tian, S. He, Y. Tang, and R. W. Lau, “Deshadownet: Amulti-context embedding deep network for shadow removal,” in

IEEEInternational Conference on Computer Vision and Pattern Recognition(CVPR) , vol. 1, no. 2, 2017, p. 3.[20] B. Liu, S. Gould, and D. Koller, “Single image depth estimation frompredicted semantic labels,” in

Computer Vision and Pattern Recognition(CVPR), 2010 IEEE Conference on . IEEE, 2010, pp. 1253–1260.[21] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black, “Optical ﬂowwith semantic segmentation and localized layers,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2016, pp. 3889–3898.[22] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digitdatabase,”

AT&T Labs [Online]. Available: http://yann. lecun.com/exdb/mnist , vol. 2, 2010. [23] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,“Reading digits in natural images with unsupervised feature learning,”in

NIPS workshop on deep learning and unsupervised feature learning ,vol. 2011, no. 2, 2011, p. 5.[24] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene textrecognition with automatic rectiﬁcation,” in

IEEE CVPR , 2016, pp.4168–4176.[25] F. L. Bookstein, “Principal warps: Thin-plate splines and thedecomposition of deformations,”

IEEE Transactions on patternanalysis and machine intelligence , vol. 11, no. 6, pp. 567–585, 1989.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , June 2016.[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei, “Imagenet large scale visualrecognition challenge,”

International Journal of Computer Vision ,vol. 115, no. 3, pp. 211–252, Dec 2015. [Online]. Available:https://doi.org/10.1007/s11263-015-0816-y[28] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution andcheckerboard artifacts,”

Distill , 2016. [Online]. Available: http://distill.pub/2016/deconv-checkerboard[29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

Proceedingsof the 32Nd International Conference on Machine Learning - Volume37 , ser. ICML’15. JMLR.org, 2015, pp. 448–456. [Online]. Available:http://dl.acm.org/citation.cfm?id=3045118.3045167[30] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,” in

Proceedings of the IEEE international conference on computer vision ,2015, pp. 1026–1034.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980