[PDF] DeSTNet: Densely Fused Spatial Transformer Networks

Abstract

Modern Convolutional Neural Networks (CNN) are extremely powerful on a range of computer vision tasks. However, their performance may degrade when the data is characterised by large intra-class variability caused by spatial transformations. The Spatial Transformer Network (STN) is currently the method of choice for providing CNNs the ability to remove those transformations and improve performance in an end-to-end learning framework. In this paper, we propose Densely Fused Spatial Transformer Network (DeSTNet), which, to our best knowledge, is the first dense fusion pattern for combining multiple STNs. Specifically, we show how changing the connectivity pattern of multiple STNs from sequential to dense leads to more powerful alignment modules. Extensive experiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that the proposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs) both in terms of accuracy and robustness.

Full PDF

AANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS DeSTNet: Densely Fused SpatialTransformer Networks Roberto Annunziata roberto.annunziata@onﬁdo.com

Christos Sagonas christos.sagonas@onﬁdo.com

Jacques Calì jacques.cali@onﬁdo.com

Onﬁdo Research3 Finsbury AvenueLondon, UK

Abstract

Modern Convolutional Neural Networks (CNN) are extremely powerful on a range ofcomputer vision tasks. However, their performance may degrade when the data is char-acterised by large intra-class variability caused by spatial transformations. The SpatialTransformer Network (STN) is currently the method of choice for providing CNNs theability to remove those transformations and improve performance in an end-to-end learn-ing framework. In this paper, we propose

Densely Fused Spatial Transformer Network(DeSTNet) , which, to our best knowledge, is the ﬁrst dense fusion pattern for combiningmultiple STNs. Speciﬁcally, we show how changing the connectivity pattern of multipleSTNs from sequential to dense leads to more powerful alignment modules. Extensiveexperiments on three benchmarks namely, MNIST, GTSRB, and IDocDB show that theproposed technique outperforms related state-of-the-art methods (i.e., STNs and CSTNs)both in terms of accuracy and robustness. Recently, signiﬁcant progress has been made in several real-world computer vision appli-cations, including image classiﬁcation [13, 22], face recognition [32], object detection andsemantic segmentation [12, 14, 31]. These breakthroughs are attributed to advances of CNNs[13, 16, 33], as well as the availability of huge amounts of data [21, 22] and computationalpower. In general, performance is adversely affected by intra-class variability caused byspatial transformations, such as afﬁne or perspective; therefore, achieving invariance to theaforementioned transformations is highly desirable. CNNs achieve translation equivariancethrough the use of convolutional layers. However, the ﬁlter response is not in itself transfor-mation invariant. To compensate for this max-pooling strategies are often applied [4, 22].Pooling is usually performed on very small regions (e.g., 2 × (i) the set of Accepted for publication at the 29th British Machine Vision Conference (BMVC 2018) c (cid:13) a r X i v : . [ c s . C V ] J u l ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS ... p init Image p ' p ' p ' p 'p p p p ' p p p -STN p -STN p -STN p -STN p -STN p p p p p T p Figure 1: DeSTNet - A stack of Densely fused Spatial Transformer Networks.transformations must be deﬁned a-priori ; and (ii) a large number of samples are required,thus reducing training efﬁciency.Arguably, one of best known methods used to efﬁciently increase invariance to geometrictransformations in CNNs is the Spatial Transformer Network (STN) [19]. STN provides anend-to-end learning mechanism that can be seamlessly incorporated into a CNN to explicitlylearn how to transform the input data to achieve spatial invariance. One might look at anSTN as an attention mechanism that manipulates a feature map in a way that the input issimpliﬁed for some process downstream, e.g. image classiﬁcation. For example, in [5] anSTN was used in a supervised manner in order to improve the performance of a face detec-tor. Similarly, a method based on STN for performing simultaneously face alignment andrecognition was introduced in [38]. Although the incorporation of the STN within CNNsled to state-of-the-art performance, its effectiveness could reduce drastically in cases wherethe face is heavily deformed (e.g. due to facial expressions). To overcome this issue, Wu etal . [37] proposed multiple STNs linked in a recurrent manner. One of the main drawbackswhen combining multiple STNs can be seen in the boundary pixels. Each STN samples theoutput image produced by the previous, thus as the image passes through multiple transformsthe quality of the transformed image deteriorates. In cases where initial bounding boxes arenot of sufﬁcient accuracy, transformed images are heavily affected by the boundary effect,shown in [25]. To overcome this and inspired by the Lucas-Kanade algorithm [27], Lin andLucey [25] proposed Compositional STNs (CSTNs) and their recurrent version ICSTNs.CSTNs are made up of an STN variant (henceforth, p -STN), which propagates transforma-tion parameters instead of the transformed images.In this work, building on the success of p -STNs, we present DeSTNet (Fig. 1), an end-to-end framework designed to increase spatial invariance in CNNs. Firstly, motivated byinformation theory principles, we propose a dense fusion connectivity pattern for p -STNs.Secondly, we introduce a novel expansion-contraction fusion block for combining the pre-dictions of multiple p -STNs in a dense manner. Finally, extensive experimental results ontwo public benchmarks and a non-public real-world dataset suggest that the proposed DeST-Net outperforms the state-of-the-art CSTN[25] and the original STN [19]. Geometric transformations can be mitigated through the use of either (i) invariant or equiv-ariant features; (ii) encoding some form of attention mechanism. More traditional computervision systems achieved this through the use of hand-crafted features such as HOG [9],SIFT [26] and SCIRD [1, 2] that were designed to be invariant to various transformations. In

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS CNNs translation equivariance is achieved through convolutions and limited spatial invari-ance from pooling.In [20], a method for creating scale-invariant CNNs was proposed. Locally scale-invariant representations are obtained by applying ﬁlters at multiple scales and locationsfollowed by max-pooling. Rotational invariance can be achieved by discretely rotating theﬁlters [6, 7, 28] or input images and feature maps [10, 23, 30]. Recently, a method forproviding continuous rotation robustness was proposed in [36]. To facilitate the translationinvariance property of CNNs, Henriques and Vedaldi [15] proposed to transform the imagevia a constant warp and then employ a simple convolution. Although, the aforementioned isvery simple and powerful, it requires prior knowledge of the type of transformation as wellas the location inside the image where it is applied.More related to our work are methods that encode an attention or detection mechanism.Szegedy et al . [35] introduced a detection system as a form of regression within the networkto predict object bounding boxes and classiﬁcation results simultaneously. Erhan et al . [11]proposed a saliency-inspired neural network that predicts a set of class-agnostic boundingboxes along with a likelihood of each box containing the object of interest. A few years later,He et al . [14] designed a network that performs a number of complementary tasks: classiﬁca-tion, bounding box prediction and object segmentation. The region proposal network withintheir model provided a form of learnt attention mechanism. For a more thorough review ofobject detection systems we point the reader to Huang et al . [17] who look at speed/accuracytrade-offs for modern detection systems.

Let D = { I , I , . . . , I M } be a set of M images and { p i } Mi = ∈ R n ( n = the initial estimation of the distortion parameters for each image. Our goal is to reduce theintra-class variability due to the perspective transformations inherently applied to the imagesduring capture. Achieving this goal has the potential to signiﬁcantly simplify subsequenttasks, such as classiﬁcation. To this end, we need to ﬁnd the optimal parameters { p ∗ i } Mi = thatwarp all the images into a transformation-free space.Arguably, the most notable method for ﬁnding the optimal parameters is the STN [19].An STN is made up of three components, namely the localization network , the grid gen-erator and the sampler . The localization network L is used to predict transformation pa-rameters for a given input image I and initial parameters p init , i.e. p = L ( I , p init ) , the gridgenerator and sampler are used for warping the image based on the computed parameters,i.e. I ( W ( p )) (Fig. 2(a)). By allowing the network to learn how to warp the input, it isable to gain geometric invariance, thus boosting task performance. When recovering largertransformations a number of STNs can be stacked or used in combination with a recurrentframework (Fig. 2(b)). However, this tends to introduce boundary artifacts and image qualitydegradation in the ﬁnal transformed image, as each STN re-samples from an image that isthe result of multiple warpings.To address the aforementioned and inspired by the success of the LK algorithm for imagealignment, Lin and Lucey [25] proposed compositional STNs (CSTNs). The LK algorithmis commonly used for alignment problems [3, 29] as it approximates the linear relationshipbetween appearance and geometric displacement. Speciﬁcally, given two images I , I that This initial estimation may simply be an identity.

ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS

Input image

I p

Localizer L Recti edimage I ( W ( p )) STN

Warp W ... Input image

I I rec1 I rec2 I recT (a) (b)Figure 2: (a) Spatial Transformer Network (STN) [19] and (b) stack of STNs. Warp W Input image Ip init Localizer L I ( W ( p init )) p Compose p out p -STN p init ... p T p p p Input image I p -STN p -STN p -STN p p p (a) (b)Figure 3: (a) Compositional STN (CSTN) [25] and (b) stack of CSTNs.are related by a parametric transformation W , the goal of LK is to ﬁnd the optimal param-eters that minimize the (cid:96) norm of the error between the deformed version of I , and I : min p (cid:107) I ( W ( p )) − I (cid:107) . Applying ﬁrst-order Taylor expansion to I , it has been shown thatthe previous problem can be optimised by an iterative algorithm with the following additive-based update rule: p t + = p t + ∆ p t , (1)at each iteration t . In [25], Lin and Lucey introduced the CSTN that predicts the parameters’updates by employing a modiﬁed STN, which we refer to as p -STN, and then compose themas in Eq. (1). By incorporating the LK formulation, the resulting CSTN is able to inherit thegeometry preserving property of LK. Unlike a stack of STNs that propagates warped images to recover large displacements (Fig. 2(b)), a stack of CSTNs (Fig. 3(b)) propagate the warpparameters in a similar fashion to the iterative process used in the LK algorithm.Here, we extend the CSTN framework to improve the information ﬂow in terms of pa-rameters’ updates. In particular, we modify Eq. (1) and propose the additive-based densefusion update rule: p t + = p t + f ( ∆ p (cid:48) t , ∆ p (cid:48) t − , . . . , ∆ p (cid:48) ) , (2)where the parameters’ update at iteration t , ∆ p t , is now a function f : R n × t → R n of theupdates predicted by the p -STN at iteration t , ∆ p (cid:48) t , and all the previous ones, { ∆ p (cid:48) i } t − i = (Fig. 1). Learning the fusion function f ( · ) at each iteration t means learning the posteriordistribution p ( ∆ p t | ∆ p (cid:48) t , ∆ p (cid:48) t − , . . . , ∆ p (cid:48) ) for the parameters’ update ∆ p t . From an infor-mation theory perspective, this amounts to predicting ∆ p t with an uncertainty measured bythe conditional entropy, H ( ∆ p t | ∆ p (cid:48) t , ∆ p (cid:48) t − , . . . , ∆ p (cid:48) ) . We notice that the CSTN update inEq. (1) is a special case of Eq. (2): p t + = p t + f ( ∆ p (cid:48) t ) , (3)where the parameters’ update at iteration t , ∆ p t , is only a function of the update predicted bythe t th regressor ( ∆ p (cid:48) t ). In fact, no fusion has to be applied (i.e., f ( · ) is an identity mapping)and ∆ p t = ∆ p (cid:48) t . In other words, the CSTN learns the distribution p ( ∆ p t ) for the parameters’update ∆ p t at each iteration t . This amounts to predicting ∆ p t with an uncertainty measured NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS c h p 'p 'p ' p conv1-8 c F h h p 'p 'p ' p conv1-8conv1-3 k F (a) (b)Figure 4: Fusion blocks. (a) The bottleneck-based fusion block proposed in [16]. (b) Theproposed expansion-contraction fusion block used in Figure 1.by the related entropy, H ( ∆ p t ) . Invoking the well-known ‘conditioning reduces entropy’ principle from information theory [8], it can be shown that H ( ∆ p t | ∆ p (cid:48) t , ∆ p (cid:48) t − , . . . , ∆ p (cid:48) ) ≤H ( ∆ p t ) . In other words, the update predictions in the proposed formulation are upper -bounded by those made with CSTN in terms of uncertainty. We advocate that this theoreticaladvantage can translate into better performance.Inspired by the recent success of densely connected CNNs [16] and justiﬁed by the ex-tension outlined above, we propose an alignment module which we call DeSTNet (Denselyfused Spatial Transformer Network). DeSTNet consists of a cascade of p -STNs with a densefusion connectivity pattern, as shown in Fig. 1. The fusion function, implemented by thefusion block F in Fig. 1, is adopted to combine the update predictions of all the previ-ous p -STNs and estimate the best parameters’ update at each level t . Unlike the fusionblocks adopted in [16] consisting of a single bottleneck layer (Fig. 4(a)), we advocate theuse of an expansion-contraction fusion block (Fig. 4(b)). This solves the fusion task in ahigh-dimensional space and then maps the result back to the original. Speciﬁcally, we con-catenate all the previous parameters’ updates and project them using a 1 × n × t × k F ( expansion ), where n is the dimension of the warp parameters p , t = , . . . , T is the level within DeSTNet, and k F is the expansion rate . This is then followedby a 1 × × n convolution layer ( contraction ), as shown in Fig. 4(b). We adopt tanh activa-tions (non-linearities) after each convolutional layer of the fusion block to be able to predictboth positive and negative parameter values. It is worth noting that the use of expansionlayers is made possible by the relatively low dimension of each individual prediction (i.e., n = In this section, we assess the effectiveness of the proposed DeSTNet in (i) adding spatialtransformation invariance (up to perspective warps) to CNN-based classiﬁcation modelsand (ii) planar image alignment. To this end, artiﬁcially distorted versions of two widelyused datasets, namely the German Trafﬁc Sign Recognition Benchmark (GTSRB) [34] andMNIST [24] are utilised. Furthermore, we evaluate the performance of DeSTNet on a non-public dataset of ofﬁcial identity documents (IDocDB), which includes substantially largerimages (e.g. up to 6 , × ,

910 pixels) and, more importantly, real perspective transforma-tions. Additional results can be found in supplementary material.

ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS

Model Test Error ArchitectureAlignment Classiﬁer G T S R B CNN 8 .

01% [ conv7-6 | conv7-24 | FC(8) ] × .

18% [ conv7-6 | conv7-24 | FC(8) ] × .

15% [ conv7-6 | conv7-24 | FC(8) ] × DeSTNet-4 1 . % F {[ conv7-6 | conv7-24 | FC(8) ] × } conv7-6 | conv7-12 | P | FC(43) M N I S T CNN 6 .

69% [ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × .

23% [ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × .

04% [ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × DeSTNet-4 0 . % F {[ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × } conv9-3 | FC(10) Table 1: Test classiﬁcation errors of the compared models on GTSRB and MNIST datasets.

Trafﬁc Signs:

We report experimental results on the GTSRB dataset [34], consisting of39 ,

209 training and 12 ,

630 test colour images from 43 trafﬁc signs taken under various real-world conditions including motion blur, illumination changes and extremely low resolution.We adopt the image classiﬁcation error as a proxy measure for alignment quality. Speciﬁ-cally, we build classiﬁcation pipelines made up of two components: an alignment networkfollowed by a classiﬁcation one (detailed architectures reported in Table 1). Both networksare jointly trained with the classiﬁcation-based loss using standard back-propagation. Atparity of a classiﬁcation network, a lower classiﬁcation error suggests better alignment (i.e.,spatial transformation invariance). Following the experimental protocol in [25], we resizeimages to s × s , s =

36 pixels and artiﬁcially distort them using a perspective warp. Speciﬁ-cally, the four corners of each image are independently and randomly scaled with Gaussiannoise N ( , ( σ s ) ) , then randomly translated with the same noise model.In the ﬁrst experiment, we follow the same setting adopted in [25] and train all the net-works for 200,000 iterations with a batch of 100 perturbed samples generated on the ﬂy.For DeSTNet, we use α clf = − as the learning rate for the classiﬁcation network and α aln = − for the alignment network which is reduced by 10 after 100 ,

000 iterations. Forthe proposed expansion-contraction fusion block we set the expansion rate k F = S = .

9. Finally, images of both train and test sets are perturbed using σ = . we observethat alignment improves classiﬁcation performance, irrespective of the speciﬁc alignmentmodule, supporting the need for removing perspective transformations with which a stan-dard CNN classiﬁer would not be able to cope. Importantly, CSTN-1 achieves lower clas-siﬁcation error as compared to the STN (5 .

01% vs 6 . convD -D : convolution layer with D × D receptive ﬁeld and D channels, P: max-pooling layer, FC: fullyconnected layer, F : fusion operation used in DeSTNet for combining the parameters’ updates, ˜ F : standard fusionoperation [16]. Convolution and max-pooling help with small transformations, but are not enough to cope with full perspectivewarpings.

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS Model Test error ArchitecturePerturbation σ Alignment Classiﬁer10 % % % G T S R B CSTN-4 6 .

86% 8 .

92% 13 .

72% [ conv7-6 | conv7-24 | FC(8) ] × DeSTNet-4 ( ˜ F ) .

60% 4 .

65% 5 .

25% ˜ F {[ conv7-6 | conv7-24 | FC(8) ] × } FC(43)

DeSTNet-4 3 . % . % . % F {[ conv7-6 | conv7-24 | FC(8) ] × } FC(43) M N I S T C-STN-4 1 .

50% 2 .

39% 3 .

40% [ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × DeSTNet-4 ( ˜ F ) .

86% 0 .

89% 1 .

09% ˜ F {[ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × } FC(10)

DeSTNet-4 0 . % . % . % F {[ conv7-4 | conv7-8 | P | FC(48) | FC(8) ] × } FC(10)

Table 2: Test classiﬁcation errors of the compared models by using a single fully connectedlayer as classiﬁer under three perturbation levels on GTSRB and MNIST datasets.choice of building DeSTNet using p -STNs. Moreover, using a cascade of four CSTNs fur-ther improves results. Finally, the DeSTNet-4 outperforms CSTN-4 with an error of 1 . .

15% which amounts to a relative improvement of 37%.It is worth noting, (i) the perturbations in this experiment are relatively small ( σ = (ii) the CNN network followed by a fully connected layer as classiﬁer does not fully off-load the alignment task to the alignment network. This is due to the translation invarianceand robustness to small transformations brought about by the convolutions and pooling lay-ers. Therefore, to further investigate the alignment quality of the state-of-the-art CSTN andDeSTNet, we use a single fully connected layer as a classiﬁcation network and report perfor-mance under three perturbation levels σ = { , , } corresponding to a minimumof 3 . . show that, ( i ) DeSTNet yields analignment quality that signiﬁcantly simpliﬁes the classiﬁcation task compared to CSTN (i.e.,up to 9 .

87% better classiﬁcation performance for DeSTNet); ( ii ) DeSTNet exhibits robust-ness against stronger perturbation levels, with performance degrading by only 0 .

81% from10% to 30% perturbation, while CSTN performance degrades by 6 .

86% in the same range;and ( iii ) the proposed expansion-contraction fusion block F leads to better performancew.r.t. the standard bottleneck layer ˜ F proposed in [16]. Qualitative experimental resultsfor CSTN and DeSTNet under different perturbation levels are reported in Fig. 5. More InitialCSTN-4DeSTNet-4 (a) σ =

10% (b) σ =

20% (c) σ = Figure 5: Qualitative comparison of CSTN-4 and DeSTNet-4 methods on GTSRB dataset.Averages of the test trafﬁc signs under different perturbation levels.

ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS (a) GTSRB (b) MNISTFigure 6: Sample alignment results produced by the DeSTNet-4 model on three examples(rows) from the GTSRB (a) and the MNIST (b) datasets. Column 1: input image; columns2-5: results obtained by applying the intermediate perspective transformations predicted atlevels 1-4, respectively.speciﬁcally, the averages of the 43 trafﬁc signs before and after convergence for CSTN-4and DeSTNet-4 are shown. We observe that the average images produced by DeSTNet-4are much sharper and have more details (even for the 30% perturbation level, Fig. 5(c)) thanthe averages produced by CSTN-4, this is indicative of the better alignment performancefor the proposed model. Fig. 6(a) illustrates aligned examples generated by DeSTNet-4.

Handwritten Digits:

For this experiment, we adopt MNIST dataset [24], consisting ofhandwritten digits between 0 and 9, with a training set of 60 ,

000 and 10 ,

000 test grayscaleimages (28 ×

28 pixels). We adopt the same settings as for the GTSRB experiments by usingthe image classiﬁcation error as a proxy measure for alignment quality. Training and test setsare distorted using the same perspective warp noise model ( σ = . . . In line with the GTSRB experiments, (i) pre-alignment considerably improves classiﬁcation performance, regardless of the speciﬁcalignment module used; (ii) lower classiﬁcation error is achieved when using CSTN-1 ascompared to STN, again supporting our choice of using p -STNs as base STNs in DeST-Net; (iii) although performance almost saturates with four CSTNs, DeSTNet is still ableto squeeze extra performance, outperforming CSTN-4 with an error of 0 .

71% down from1 .

04% which is a relative improvement of 32%.We further investigate the alignment quality of the state-of-the-art CSTN and DeSTNet,when a single fully connected layer is used for classiﬁcation and report performance underthree perturbation levels corresponding to a minimum of 2 . . , we can see that, (i) DeSTNet achievesan alignment quality that signiﬁcantly simpliﬁes the classiﬁcation task compared to CSTN(i.e., up to 2 .

66% better classiﬁcation performance for DeSTNet); (ii)

DeSTNet exhibitsrobustness against stronger perturbation levels, with the classiﬁcation performance degrading

InitialCSTN-4DeSTNet-4InitialCSTN-4DeSTNet-4 (a) σ =

10% (b) σ =

20% (c) σ = Figure 7: Qualitative comparison of CSTN-4 and DeSTNet-4 on the MNIST dataset. Mean(top rows) and variance (bottom rows) of the 10 digits under different perturbation levels.

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS [email protected] (a) (b)Figure 8: (a) Cumulative Error Distribution curves and (b) qualitative results obtained by the

CSTN-5 and

DeSTNet-5 on IDocDB.by only 0 .

08% from 10% to 30% perturbation, while CSTN performance degrades by 1 . (iii) the proposed expansion-contraction fusion block further helpsreducing the classiﬁcation test error.Qualitative experimental results are reported in Fig. 7. In particular, the average andcorresponding variance of all test samples grouped by digit are computed and shown forCSTN-4 and DeSTNet-4. Inspecting the images we can see that the mean images generatedby DeSTNet-4 are sharper than those of CSTN-4 while the variance ones are thinner. Thissuggests that DeSTNet is more accurate and robust to different perturbation levels comparedto CSTN. Finally, aligned images generated by the DeSTNet-4 are displayed in Fig. 6(b). Here, we show how DeSTNet can be successfully utilised for aligning planar images. To thisend, we make use of our non-public ofﬁcial identity documents dataset (IDocDB) consist-ing of 1 ,

000 training and 500 testing colour images collected under in-the-wild conditions.Speciﬁcally, each image contains a single identity document (UK Driving Licence V2015)and their size ranges from 422 ×

215 to 6 , × ,

910 pixels. In addition to typical chal-lenges such as non-uniform illumination, shadows, and compression noise, several otheraspects make this dataset challenging, including: the considerable variations in resolution;highly variable background which may include clutter and non-target objects; occlusion, e.g.the presence of ﬁngers covering part of the document when held for capture. The groundtruth consists of the location of the four corners of each document. From these points, wecan compute a homography matrix that maps each document to a reference frame. Thealignment task can be solved by predicting the location of the corner points on each inputimage. We train the networks using the smooth (cid:96) loss [31] between the ground truth and thepredicted corner coordinates.Adopting the following experimental setting: we resize each image to 256 ×

256 pixelsfor computational efﬁciency, as done for instance in [18, 33]. We set the learning rate forthe localisation network to α aln = − , which we reduce by 10 after 20 ,

000 iterations. Weuse batches with 8 images each for all the models. For the fusion blocks of DeSTNet, weset k F =

256 and use S = .

9. We assess the performance of DeSTNet and compare it with ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS the state-of-the-art CSTNs (strongest baseline based on the presented experiments). Giventhe increased complexity of the task compared to MNIST and GTSRB, we built networkswith ﬁve STNs for both CSTN and DeSTNet (architectures are reported in Table 1 of supple-mentary material). For comparison, we use the average point-to-point Euclidean distance,normalised by each document’s diagonal, between the ground truth and predicted locationof the four corners. In addition, the Cumulative Error Distribution (CED) curve for eachmethod is computed using the fraction of test images for which the average error is smallerthan a threshold. The CED curves in Fig. 8(a) show that DeSTNet-5 outperforms CSTN-5both in terms of accuracy and robustness. In fact, DeSTNet achieves a higher AUC@0 . .

77 vs 0 . It is well-known that image recognition is adversely affected by spatial transformations. In-creasing geometric invariance helps to improve performance. Although CNNs achieve somelevel of translation equivariance, they are still susceptible to large spatial transformations.In this paper, we address this problem by introducing DeSTNet, a stack of densely fusedSTNs that improve information ﬂow in terms of warp parameters’ updates. Furthermore,we provide a novel fusion technique demonstrating its improved performance in our prob-lem setting. We show the superiority of DeSTNet over the current state-of-the-art STNand its variant CSTN, by conducting extensive experiments on two widely-used benchmarks(MNIST, GTSRB) and a new non-public real-world dataset of ofﬁcial identity documents.

Acknowledgements.

We would like to thank all the members of the Onﬁdo researchteam for their support and candid discussions.

References [1] Roberto Annunziata and Emanuele Trucco. Accelerating convolutional sparse cod-ing for curvilinear structures segmentation by reﬁning SCIRD-TS ﬁlter banks.

IEEETransactions on Medical Imaging (IEEE–TMI) , 35(11):2381–2392, 2016.[2] Roberto Annunziata, Ahmad Kheirkhah, Pedram Hamrah, and Emanuele Trucco. Scaleand curvature invariant ridge detector for tortuous and fragmented structures. In

Inter-national Conference on Medical Image Computing and Computer-Assisted Interven-tion (MICCAI) , pages 588–595. Springer, 2015.[3] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework.

International Journal of Computer Vision (IJCV) , 56(3):221–255, 2004.[4] Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature poolingin visual recognition. In

Proceedings of International Conference on Machine Learning(ICML) , pages 111–118, 2010.[5] Dong Chen, Gang Hua, Fang Wen, and Jian Sun. Supervised transformer network forefﬁcient face detection. In

Proceedings of European Conference on Computer Vision(ECCV) , pages 122–138. Springer, 2016.

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS [6] Taco Cohen and Max Welling. Group equivariant convolutional networks. In Pro-ceedings of International Conference on Machine Learning (ICML) , pages 2990–2999,2016.[7] Taco S Cohen and Max Welling. Steerable cnns.

Proceedings of International Confer-ence on Learning Representations (ICLR) , 2017.[8] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley &Sons, 2012.[9] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detec-tion. In

Proceedings of IEEE International Conference on Computer Vision & PatternRecognition (CVPR) , volume 1, pages 886–893, 2005.[10] Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic sym-metry in convolutional neural networks. In

Proceedings of International Conferenceon Machine Learning (ICML) , 2016.[11] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scal-able object detection using deep neural networks. In

Proceedings of IEEE InternationalConference on Computer Vision & Pattern Recognition (CVPR) , pages 2147–2154,2014.[12] Ross Girshick. Fast r-cnn. In

Proceedings of IEEE International Conference on Com-puter Vision & Pattern Recognition (CVPR) , pages 1440–1448, 2015.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learningfor image recognition. In

Proceedings of IEEE International Conference on ComputerVision & Pattern Recognition (CVPR) , pages 770–778, 2016.[14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In

Pro-ceedings of IEEE International Conference on Computer Vision (ICCV) , pages 2980–2988, 2017.[15] Joao F Henriques and Andrea Vedaldi. Warped convolutions: Efﬁcient invariance tospatial transformations. In

Proceedings of International Conference on Machine Learn-ing (ICML) , 2017.[16] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Denselyconnected convolutional networks. In

Proceedings of IEEE International Conferenceon Computer Vision & Pattern Recognition (CVPR) , volume 1, page 3, 2017.[17] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.Speed/accuracy trade-offs for modern convolutional object detectors. In

Proceedingsof IEEE International Conference on Computer Vision & Pattern Recognition (CVPR) ,2017.[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image trans-lation with conditional adversarial networks. In

Proceedings of IEEE InternationalConference on Computer Vision & Pattern Recognition (CVPR) , pages 1125–1134,2017. ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS [19] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer net-works. In

Proceedings of Advances in Neural Information Processing Systems (NIPS) ,pages 2017–2025, 2015.[20] Angjoo Kanazawa, Abhishek Sharma, and David Jacobs. Locally scale-invariant con-volutional neural networks. arXiv preprint arXiv:1412.5104 , 2014.[21] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. Themegaface benchmark: 1 million faces for recognition at scale. In

Proceedings of IEEEInternational Conference on Computer Vision & Pattern Recognition (CVPR) , pages4873–4882, 2016.[22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation withdeep convolutional neural networks. In

Proceedings of Advances in Neural InformationProcessing Systems (NIPS) , pages 1097–1105, 2012.[23] Dmitry Laptev, Nikolay Savinov, Joachim M Buhmann, and Marc Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neuralnetworks. In

Proceedings of IEEE International Conference on Computer Vision &Pattern Recognition (CVPR) , pages 289–297, 2016.[24] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/ , 1998.[25] Chen-Hsuan Lin and Simon Lucey. Inverse compositional spatial transformer net-works. In

Proceedings of IEEE International Conference on Computer Vision & Pat-tern Recognition (CVPR) , pages 2568–2576, 2017.[26] David G Lowe. Distinctive image features from scale-invariant keypoints.

InternationalJournal of Computer Vision (IJCV) , 60(2):91–110, 2004.[27] Bruce D Lucas and Takeo Kanade. An iterative image registration technique withan application to stereo vision. In

Proceedings of International Joint Conference onArtiﬁcial Intelligence (IJCAI) , pages 674–679, 1981.[28] Diego Marcos, Michele Volpi, and Devis Tuia. Learning rotation invariant convolu-tional ﬁlters for texture classiﬁcation. In

Proceedings of International Conference onPattern Recognition (ICPR) , pages 2012–2017, 2016.[29] Iain Matthews and Simon Baker. Active appearance models revisited.

InternationalJournal of Computer Vision (IJCV) , 60(2):135–164, 2004.[30] Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object clas-siﬁcation. In

Proceedings of IEEE International Conference on Computer Vision &Pattern Recognition (CVPR) , volume 3, page 6, 2015.[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In

Proceedings of Advances inNeural Information Processing Systems (NIPS) , pages 91–99, 2015.[32] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed em-bedding for face recognition and clustering. In

Proceedings of IEEE InternationalConference on Computer Vision & Pattern Recognition (CVPR) , pages 815–823, 2015.

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Rep-resentations (ICLR) , 2014.[34] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. The germantrafﬁc sign recognition benchmark: a multi-class classiﬁcation competition. In

Pro-ceedings of International Joint Conference on Neural Networks (IJCNN) , pages 1453–1460, 2011.[35] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks forobject detection. In

Proceedings of Advances in Neural Information Processing Sys-tems (NIPS) , pages 2553–2561, 2013.[36] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow.Harmonic networks: Deep translation and rotation equivariance. In

Proceedings ofIEEE International Conference on Computer Vision & Pattern Recognition (CVPR) ,volume 2, 2017.[37] Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen. Recur-sive spatial transformer (rest) for alignment-free face recognition. In

Proceedings ofIEEE International Conference on Computer Vision (ICCV) , pages 3772–3780, 2017.[38] Yuanyi Zhong, Jiansheng Chen, and Bo Huang. Toward end-to-end face recognitionthrough alignment learning.

IEEE Signal Processing Letters , 24(8):1213–1217, 2017. ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS

Figures 9 and 10 show additional alignment results obtained by the proposed DeSTNet modelon GTSRB [34] and MNIST [24] datasets, respectively.Figure 9: Sample alignment results produced by the DeSTNet-4 model on the GTSRBdataset. Row 1: input image. Rows 2-4: results produced after each one of the four lev-els.Figure 10: Sample alignment results produced by the DeSTNet-4 model on the MNISTdataset. Row 1: input image. Rows 2-4: results produced after each one of the four levels.

Table 3 reports the architectures of the compared CSTN-5 [25] and DeSTNet-5 models forthe task of planar image alignment.Additional qualitative results obtained by the CSTN-5 and DeSTNet-5 on the IDocDBdatabase are provided in Figs. 11, 12. These results conﬁrm that the proposed DeSTNet ismore accurate than the CSTN and show better robustness against partial-occlusions, clutterand low-light conditions.

NNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS Model Architecture

CSTN-5 [ conv3-64 ( ) | conv3-128 ( ) | conv3-256 ( ) | FC8 ] × DeSTNet-5 F {[ conv3-64 ( ) | conv3-128 ( ) | conv3-256 ( ) | FC8 ] × } Table 3: Architectures utilized by CSTN-5 and DeSTNet-5. convD -D ( D ): convolutionlayer with D × D receptive ﬁeld, D channels and D stride, FC: fully connected layer, F :fusion operation used in DeSTNet for fusing the parameters updates.Figure 11: Qualitative results obtained with CSTN-5 and

DeSTNet-5 on IDocDB. (Resultsare best viewed on a digital screen) ANNUNZIATA, SAGONAS, CALÌ: DENSELY FUSED SPATIAL TRANSFORMER NETWORKS

Figure 12: Qualitative results obtained with

CSTN-5 and