[PDF] Dual Path Networks

Abstract

In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.

Full PDF

DDual Path Networks

Yunpeng Chen , Jianan Li , , Huaxin Xiao , , Xiaojie Jin , Shuicheng Yan , , Jiashi Feng National University of Singapore Beijing Institute of Technology National University of Defense Technology Qihoo 360 AI Institute

Abstract

In this work, we present a simple, highly efﬁcient and modularized Dual PathNetwork (DPN) for image classiﬁcation which presents a new topology of connec-tion paths internally. By revealing the equivalence of the state-of-the-art ResidualNetwork (ResNet) and Densely Convolutional Network (DenseNet) within theHORNN framework, we ﬁnd that ResNet enables feature re-usage while DenseNetenables new features exploration which are both important for learning good repre-sentations. To enjoy the beneﬁts from both path topologies, our proposed Dual PathNetwork shares common features while maintaining the ﬂexibility to explore newfeatures through dual path architectures. Extensive experiments on three bench-mark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstratesuperior performance of the proposed DPN over state-of-the-arts. In particular, onthe ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101( × d)with 26% smaller model size, 25% less computational cost and 8% lower memoryconsumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art sin-gle model performance with about 2 times faster training speed. Experiments on thePlaces365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCALVOC segmentation dataset also demonstrate its consistently better performancethan DenseNet, ResNet and the latest ResNeXt model over various applications. “Network engineering” is increasingly more important for visual recognition research. In this paper,we aim to develop new path topology of deep architectures to further push the frontier of representationlearning. In particular, we focus on analyzing and reforming the skip connection, which has beenwidely used in designing modern deep neural networks and offers remarkable success in manyapplications [16, 7, 20, 14, 5]. Skip connection creates a path propagating information from a lowerlayer directly to a higher layer. During the forward propagation, skip connection enables a verytop layer to access information from a distant bottom layer; while for the backward propagation,it facilitates gradient back-propagation to the bottom layer without diminishing magnitude, whicheffectively alleviates the gradient vanishing problem and eases the optimization.Deep Residual Network (ResNet) [5] is one of the ﬁrst works that successfully adopt skip connections,where each mirco-block, a.k.a. residual function, is associated with a skip connection, called residualpath . The residual path element-wisely adds the input features to the output of the same mirco-block, making it a residual unit. Depending on the inner structure design of the mirco-block, theresidual network has developed into a family of various architectures, including WRN [22], Inception-resnet [20], and ResNeXt [21].More recently, Huang et al. [8] proposed a different network architecture that achieves comparableaccuracy with deep ResNet [5], named Dense Convolutional Network (DenseNet). Different fromresidual networks which add the input features to the output features through the residual path, theDenseNet uses a densely connected path to concatenate the input features with the output features, a r X i v : . [ c s . C V ] A ug nabling each micro-block to receive raw information from all previous micro-blocks. Similar withresidual network family, DenseNet can be categorized to the densely connected network family.Although the width of the densely connected path increases linearly as it goes deeper, causingthe number of parameters to grow quadratically, DenseNet provides higher parameter efﬁciencycompared with the ResNet [5].In this work, we aim to study the advantages and limitations of both topologies and further enrichthe path design by proposing a dual path architecture. In particular, we ﬁrst provide a new under-standing of the densely connected networks from the lens of a higher order recurrent neural network(HORNN) [19], and explore the relations between densely connected networks and residual networks.More speciﬁcally, we bridge the densely connected networks with the HORNNs, showing that thedensely connected networks are HORNNs when the weights are shared across steps. Inspired by [12]which demonstrates the relations between the residual networks and RNNs, we prove that the residualnetworks are densely connected networks when connections are shared across layers. With this uniﬁedview on the state-of-the-art deep architecture, we ﬁnd that the deep residual networks implicitly reusethe features through the residual path, while densely connected networks keep exploring new featuresthrough the densely connected path.Based on this new view, we propose a novel dual path architecture, called the Dual Path Network(DPN). This new architecture inherits both advantages of residual and densely connected paths,enabling effective feature re-usage and re-exploitation. The proposed DPN also enjoys higherparameter efﬁciency, lower computational cost and lower memory consumption, and being friendlyfor optimization compared with the state-of-the-art classiﬁcation networks. Experimental resultsvalidate the outstanding high accuracy of DPN compared with other well-established baselinesfor image classiﬁcation on both ImageNet-1k dataset and Places365-Standard dataset. Additionalexperiments on object detection task and semantic segmentation task also demonstrate that theproposed dual path architecture can be broadly applied for various tasks and consistently achieve thebest performance. Designing an advanced neural network architecture is one of the most challenging but effectiveways for improving the image classiﬁcation performance, which can also directly beneﬁt a varietyof other tasks. AlexNet [10] and VGG [18] are two most important works that show the powerof deep convolutional neural networks. They demonstrate that building deeper networks with tinyconvolutional kernels is a promising way to increase the learning capacity of the neural network.Residual Network was ﬁrst proposed by He et al. [5], which greatly alleviates the optimizationdifﬁculty and further pushes the depth of deep neural networks to hundreds of layers by usingskipping connections. Since then, different kinds of residual networks arose, concentrating oneither building a more efﬁcient micro-block inner structure [3, 21] or exploring how to use residualconnections [9]. Recently, Huang et al. [8] proposed a different network, called Dense ConvolutionalNetworks, where skip connections are used to concatenate the input to the output instead of adding.However, the width of the densely connected path linearly increases as the depth rises, causing thenumber of parameters to grow quadratically and costing a large amount of GPU memory comparedwith the residual networks if the implementation is not speciﬁcally optimized. This limits the buildingof a deeper and wider densenet that may further improve the accuracy.Besides designing new architectures, researchers also try to re-explore the existing state-of-the-artarchitectures. In [6], the authors showed the importance of the residual path on alleviating theoptimization difﬁculty. In [12], the residual networks are bridged with recurrent neural networks(RNNs), which helps people better understand the deep residual network from the perspective ofRNNs. In [3], several different residual functions are uniﬁed, trying to provide a better understandingof designing a better mirco structure with higher learning capacity. But still, for the densely connectednetworks, in addition to several intuitive explanations on better feature reusage and efﬁcient gradientﬂow introduced, there have been few works that are able to provide a really deeper understanding.In this work, we aim to provide a deeper understanding of the densely connected network, from thelens of Higher Order RNN, and explain how the residual networks are in indeed a special case ofdensely connected network. Based on these analysis, we then propose a novel Dual Path Networkarchitecture that not only achieves higher accuracy, but also enjoys high parameter and computationalefﬁciency. 2 nfoldFold UnfoldFold (a) ResNet with shared weights (b) ResNet in RNN form (c) DenseNet with shared weights (d) DenseNet in HORNN form h h ++ … x + Output x t φ ( ּ )+I( ּ ) z -1 h x + h + … g ( ּ )g ( ּ )f ( ּ )f ( ּ ) Output h k z -1 + ... + + z -1 z -1 f k-1k ( ּ ) f k-2k ( ּ ) f ( ּ )f ( ּ )g k ( ּ ) x Figure 1: The topological relations of different types of neural networks. (a) and (b) show relationsbetween residual networks and RNN, as stated in [12]; (c) and (d) show relations between denselyconnected networks and higher order recurrent neural network (HORNN), which is explained in thispaper. The symbol “ z − ” denotes a time-delay unit; “ ⊕ ” denotes the element-wise summation; “ I ( · ) ”denotes an identity mapping function. In this section, we ﬁrst bridge the densely connected network [8] with higher order recurrentneural networks [19] to provide a new understanding of the densely connected network. We provethat residual networks [5, 6, 22, 21, 3], essentially belong to the family of densely connectednetworks except their connections are shared across steps. Then, we present analysis on strengthsand weaknesses of each topology architecture, which motivates us to develop the dual path networkarchitecture.For exploring the above relation, we provide a new view on the densely connected networks fromthe lens of Higher Order RNN, explain their relations and then specialize the analysis to residualnetworks. Throughout the paper, we formulate the HORNN in a more generalized form. We use h t to denote the hidden state of the recurrent neural network at the t -th step and use k as the index of thecurrent step. Let x t denotes the input at t -th step, h = x . For each step, f kt ( · ) refers to the featureextracting function which takes the hidden state as input and outputs the extracted information. The g k ( · ) denotes a transformation function that transforms the gathered information to current hiddenstate: h k = g k (cid:34) k − (cid:88) t =0 f kt ( h t ) (cid:35) . (1)Eqn. (1) encapsulates the update rule of various network architectures in a generalized way. ForHORNNs, weights are shared across steps, i.e. ∀ t, k, f kk − t ( · ) ≡ f t ( · ) and ∀ k, g k ( · ) ≡ g ( · ) . For thedensely connected networks, each step (micro-block) has its own parameter, which means f kt ( · ) and g k ( · ) are not shared. Such observation shows that the densely connected path of DenseNetis essentially a higher order path which is able to extract new information from previous states.Figure 1(c)(d) graphically shows the relations of densely connected networks and higher orderrecurrent networks.We then explain that the residual networks are special cases of densely connected networks if taking ∀ t, k, f kt ( · ) ≡ f t ( · ) . Here, for succinctness we introduce r k to denote the intermediate results and let r = 0 . Then Eqn. (1) can be rewritten as r k (cid:44) k − (cid:88) t =1 f t ( h t ) = r k − + f k − ( h k − ) , (2) h k = g k (cid:0) r k (cid:1) . (3)Thus, by substituting Eqn. (3) into Eqn. (2), Eqn. (2) can be simpliﬁed as r k = r k − + f k − ( h k − ) = r k − + f k − ( g k − (cid:0) r k − (cid:1) ) = r k − + φ k − ( r k − ) , (4)where φ k ( · ) = f k ( g k ( · )) . Obviously, Eqn. (4) has the same form as the residual network and therecurrent neural network. Speciﬁcally, when ∀ k, φ k ( · ) ≡ φ ( · ) , Eqn. (4) degenerates to an RNN;when none of φ k ( · ) is shared and x k = 0 , k > , Eqn. (4) produces a residual network. Figure 1(a)(b)3raphically shows the relation. Besides, recall that Eqn. (4) is derived under the condition when ∀ t, k, f kt ( · ) ≡ f t ( · ) from Eqn. (1) and the densely connected networks are in forms of Eqn. (1),meaning that the residual network family essentially belongs to the densely connected network family.Figure 2(a–c) give an example and demonstrate such equivalence, where f t ( · ) corresponds to theﬁrst × convolutional layer and the g k ( · ) corresponds to the other layers within a micro-block inFigure 2(b).From the above analysis, we observe: 1) both residual networks and densely connected networks canbe seen as a HORNN when f kt ( · ) and g k ( · ) are shared for all k ; 2) a residual network is a denselyconnected network if ∀ t, k, f kt ( · ) ≡ f t ( · ) . By sharing the f kt ( · ) across all steps, g k ( · ) receives thesame feature from a given output state, which encourages the feature reusage and thus reduces thefeature redundancy. However, such an information sharing strategy makes it difﬁcult for residualnetworks to explore new features. Comparatively, the densely connected networks are able to explorenew information from previous outputs since the f kt ( · ) is not shared across steps. However, different f kt ( · ) may extract the same type of features multiple times, leading to high redundancy.In the following section, we present the dual path networks which can overcome both inherentlimitations of these two state-of-the-art network architectures. Their relations with HORNN alsoimply that our proposed architecture can be used for improving HORNN, which we leave for futureworks. Above we explain the relations between residual networks and densely connected networks, showingthat the residual path implicitly reuses features, but it is not good at exploring new features. In contrastthe densely connected network keeps exploring new features but suffers from higher redundancy.In this section, we describe the details of our proposed novel dual path architecture, i.e. the Dual PathNetwork (DPN). In the following, we ﬁrst introduce and formulate the dual path architecture, andthen present the network structure in details with complexity analysis.

Sec. 3 discusses the advantage and limitations of both residual networks and densely connectednetworks. Based on the analysis, we propose a simple dual path architecture which shares the f kt ( · ) across all blocks to enjoy the beneﬁts of reusing common features with low redundancy, while stillremaining a densely connected path to give the network more ﬂexibility in learning new features. Weformulate such a dual path architecture as follows: x k (cid:44) k − (cid:88) t =1 f kt ( h t ) , (5) y k (cid:44) k − (cid:88) t =1 v t ( h t ) = y k − + φ k − ( y k − ) , (6) r k (cid:44) x k + y k , (7) h k = g k (cid:0) r k (cid:1) , (8)where x k and y k denote the extracted information at k -th step from individual path, v t ( · ) is a featurelearning function as f kt ( · ) . Eqn. (5) refers to the densely connected path that enables exploring newfeatures, Eqn. (6) refers to the residual path that enables common features re-usage, and Eqn. (7)deﬁnes the dual path that integrates them and feeds them to the last transformation function inEqn. (8). The ﬁnal transformation function g k ( · ) generates current state, which is used for makingnext mapping or prediction. Figure 2(d)(e) show an example of the dual path architecture that isbeing used in our experiments.More generally, the proposed DPN is a family of convolutional neural networks which contains aresidual alike path and a densely connected alike path, as explained later. Similar to these networks,one can customize the micro-block function of DPN for task-speciﬁc usage or for further overallperformance boosting. 4 a) Residual Network (b) Densely Connected Network (e) DPN(d) Dual Path Architecture + + + + + + ~ + + ~ ~ + ~ + (c) Densely Connected Network ( with shared connections ) ++ r e s i du a l un it Figure 2: Architecture comparison of different networks. (a) The residual network. (b) The denselyconnected network, where each layer can access the outputs of all previous micro-blocks. Here, a × convolutional layer (underlined) is added for consistency with the micro-block design in (a).(c) By sharing the ﬁrst × connection of the same output across micro-blocks in (b), the denselyconnected network degenerates to a residual network. The dotted rectangular in (c) highlights theresidual unit. (d) The proposed dual path architecture, DPN. (e) An equivalent form of (d) fromthe perspective of implementation, where the symbol “ (cid:111) ” denotes a split operation, and “ + ” denoteselement-wise addition. The proposed network is built by stacking multiple modualized mirco-blocks as shown in Figure 2.In this work, the structure of each micro-block is designed with a bottleneck style [5] which startswith a × convolutional layer followed by a × convolutional layer, and ends with a × convolutional layer. The output of the last × convolutional layer is split into two parts: the ﬁrstpart is element-wisely added to the residual path, and the second part is concatenated with the denslyconnected path. To enhance the leaning capacity of each micro-block, we use the grouped convolutionlayer in the second layer as the ResNeXt [21].Considering that the residual networks are more wildly used than the densely connected networks inpractice, we choose the residual network as the backbone and add a thin densely connected path tobuild the dual path network. Such design also helps slow the width increment of the densely connectedpath and the cost of GPU memory. Table 1 shows the detailed architecture settings. In the table, G refers to the number of groups, and k refers to the channels increment for the densely connected path.For the new proposed DPNs, we use (+ k ) to indicate the width increment of the densely connectedpath. The overall design of DPN inherits backbone architecture of the vanilla ResNet / ResNeXt,making it very easy to implement and apply to other tasks. One can simply implement a DPN byadding one more “slice layer” and “concat layer” upon existing residual networks. Under a welloptimized deep learning platform, none of these newly added operations requires extra computationalcost or extra memory consumption, making the DPNs highly efﬁcient.In order to demonstrate the appealing effectiveness of the dual path architecture, we intentionallydesign a set of DPNs with a considerably smaller model size and less FLOPs compared with thesate-of-the-art ResNeXts [21], as shown in Table 1. Due to limited computational resources, we setthese hyper-parameters based on our previous experience instead of grid search experiments. Model complexity

We measure the model complexity by counting the total number of learnableparameters within each neural network. Table 1 shows the results for different models. The DPN-92costs about fewer parameters than ResNeXt-101 ( × d), while the DPN-98 costs about fewer parameters than ResNeXt-101 ( × d). Computational complexity

We measure the computational cost of each deep neural network usingthe ﬂoating-point operations (FLOPs) with input size of × , in the number of multiply-addsfollowing [21]. Table 1 shows the theoretical computational cost. Though the actual time costmight be inﬂuenced by other factors, e.g. GPU bandwidth and coding quality, the computationalcost shows the speed upper bound. As can be see from the results, DPN-92 consumes about less FLOPs than ResNeXt-101( × d), and the DPN-98 consumes about less FLOPs thanResNeXt-101( × d). 5able 1: Architecture and complexity comparison of our proposed Dual Path Networks (DPNs) andother state-of-the-art networks. We compare DPNs with two baseline methods: DenseNet [5] andResNeXt [21]. The symbol (+ k ) denotes the width increment on the densely connected path. stage output DenseNet-161 (k=48) ResNeXt-101 (32 × × × × × , 96, stride 2 × , 64, stride 2 × , 64, stride 2 × , 64, stride 2 × , 96, stride 2conv2 56x56 × max pool, stride 2 × max pool, stride 2 × max pool, stride 2 × max pool, stride 2 × max pool, stride 2 (cid:20) ×

1, 1923 ×

3, 48 (cid:21) ×  ×

1, 1283 ×

3, 128, G=321 ×

1, 256  ×  ×

1, 2563 ×

3, 256, G=641 ×

1, 256  ×  ×

1, 963 ×

3, 96, G=321 ×

1, 256 (+16)  ×  ×

1, 1603 ×

3, 160, G=401 ×

1, 256 (+16)  × conv3 28 × (cid:20) ×

1, 1923 ×

3, 48 (cid:21) ×  ×

1, 2563 ×

3, 256, G=321 ×

1, 512  ×  ×

1, 5123 ×

3, 512, G=641 ×

1, 512  ×  ×

1, 1923 ×

3, 192, G=321 ×

1, 512 (+32)  ×  ×

1, 3203 ×

3, 320, G=401 ×

1, 512 (+32)  × conv4 14 × (cid:20) ×

1, 1923 ×

3, 48 (cid:21) ×  ×

1, 5123 ×

3, 512, G=321 ×

1, 1024  ×  ×

1, 10243 ×

3, 1024, G=641 ×

1, 1024  ×  ×

1, 3843 ×

3, 384, G=321 ×

1, 1024 (+24)  ×  ×

1, 6403 ×

3, 640, G=401 ×

1, 1024 (+32)  × conv5 7 × (cid:20) ×

1, 1923 ×

3, 48 (cid:21) ×  ×

1, 10243 ×

3, 1024, G=321 ×

1, 2048  ×  ×

1, 20483 ×

3, 2048, G=641 ×

1, 2048  ×  ×

1, 7683 ×

3, 768, G=321 ×

1, 2048 (+128)  ×  ×

1, 12803 ×

3, 1280, G=401 ×

1, 2048 (+128)  × × . × . × . × . × . × FLOPs . × . × . × . × . × Extensive experiments are conducted for evaluating the proposed Dual Path Networks. Speciﬁcally,we evaluate the proposed architecture on three tasks: image classiﬁcation, object detection andsemantic segmentation, using three standard benchmark datasets: the ImageNet-1k dataset, Places365-Standard dataset and the PASCAL VOC datasets.Key properties of the proposed DPNs are studied on the ImageNet-1k object classiﬁcation dataset [17]and further veriﬁed on the Places365-Standard scene understanding dataset [24]. To verify whether theproposed DPNs can beneﬁt other tasks besides image classiﬁcation, we further conduct experimentson the PASCAL VOC dataset [4] to evaluate its performance in object detection and semanticsegmentation.

We implement the DPNs using MXNet [2] on a cluster with 40 K80 graphic cards. Following [3], weadopt standard data augmentation methods and train the networks using SGD with a mini-batch sizeof 32 for each GPU. For the deepest network, i.e.

DPN-131 , the mini-batch size is limited to 24because of the 12GB GPU memory constraint. The learning rate starts from √ . for DPN-92 andDPN-131, and from . for DPN-98. It drops in a “steps” manner by a factor of . . Following [5],batch normalization layers are reﬁned after training. Firstly, we compare the image classiﬁcation performance of DPNs with current state-of-the-artmodels. As can be seen from the ﬁrst block in Table 2, a shallow DPN with only the depth of 92reduces the top-1 error rate by an absolute value of . compared with the ResNeXt-101( × d )and an absolute value of . compared with the DenseNet-161 yet provides with considerably lessFLOPs. In the second block of Table 2, a deeper DPN (DPN-98) surpasses the best residual network –ResNeXt-101 ( × d), and still enjoys less FLOPs and a much smaller model size (236 MBv.s. 320 MB). In order to further push the state-of-the-art accuracy, we slightly increase the depthof the DPN to 131 (DPN-131). The results are shown in the last block in Table 2. Again, the DPNshows superior accuracy over the best single model – Very Deep PolyNet [23], with a much smallermodel size (304 MB v.s. 365 MB). Note that the Very Deep PolyNet adopts numerous tricks, e.g. initialization by insertion, residual scaling, stochastic paths, to assist the training process. In contrast,our proposed DPN-131 is simple and does not involve these tricks, DPN-131 can be trained using astandard training strategy as shallow DPNs. More importantly, the actual training speed of DPN-131is about 2 times faster than the Very Deep PolyNet, as discussed in the following paragraph. The DPN-131 has 128 channels at conv1 , 4 blocks at conv2 , 8 blocks at conv3 , 28 blocks at conv4 and 3blocks at conv5 , which has . × and FLOPs= . × . % ) on validation set. *: Performance reported by [21], † : With Mean-Max Pooling (see Appendix A). Method ModelSize GFLOPs x224 x320 / x299top-1 top-5 top-1 top-5DenseNet-161(k=48) [8] 111 MB 7.7 22.2 – – –ResNet-101* [5] 170 MB 7.8 22.0 6.0 – –ResNeXt-101 ( × d) [21] 170 MB 8.0 21.2 5.6 – –DPN-92 ( × d)

145 MB 6.5 20.7 5.4 19.3 4.7

ResNet-200 [6] 247 MB 15.0 21.7 5.8 20.1 4.8Inception-resnet-v2 [20] 227 MB – – – 19.9 4.9ResNeXt-101 ( × d) [21] 320 MB 15.5 20.4 5.3 19.1 4.4DPN-98 ( × d)

236 MB 11.7 20.2 5.2 18.9 4.4

Very deep Inception-resnet-v2 [23] 531 MB – – – 19.10 4.48Very Deep PolyNet [23] 365 MB – – – 18.71 4.25DPN-131 ( × d) 304 MB 16.0 19.93 5.12 18.62 4.23DPN-131 ( × d) †

304 MB 16.0 19.93 5.12 18.55 4.16

Table 3: Comparison with state-of-the-art CNNs on Places365-Standard dataset.10 crops validation accuracy rate ( % ) onvalidation set. Method ModelSize top-1acc. top-5acc.AlexNet [24] 223 MB 53.17 82.89GoogleLeNet [24] 44 MB 53.63 83.88VGG-16 [24] 518 MB 55.24 84.91ResNet-152 [24] 226 MB 54.74 85.08ResNeXt-101 [3] 165 MB 56.21 86.25CRU-Net-116 [3] 163 MB 56.60 86.55DPN-92 ( × d)

138 MB 56.84 86.69

50 60 70 80 90 100

Training Speed (samples/sec) S i ng l e C r op , T op - E rr o r ResNet-200ResNeXt-101 (64x4d) DPN-98 (40x4d)DPN-131 (40x4d)

Memory Cost (GB), Batch Size = 24 S i ng l e C r op , T op - E rr o r ResNet-200ResNeXt-101 (64x4d)DPN-98 (40x4d)DPN-131 (40x4d)

50 60 70 80 90 100

Training Speed (samples/sec) M e m o r y C o s t ( G B ) , B a t c h S i ze = ResNet-200ResNeXt-101 (64x4d)DPN-98 (40x4d)DPN-131 (40x4d) (a) (b) (c)

Figure 3: Comparison of total actual cost between different models during training. Evaluations areconducted on a single Node with 4 K80 graphic card with all training samples cached into memory.(For the comparison of

Training Speed , we push the mini-batch size to its maximum value given a12GB GPU memory to test the fastest possible training speed of each model.)Secondly, we compare the training cost between the best performing models. Here, we focus onevaluating two key properties – the actual GPU memory cost and the actual training speed. Figure 3shows the results. As can be seen from Figure 3(a)(b), the DPN-98 is faster and uses lessmemory than the best performing ResNeXt with a considerably lower testing error rate. Note thattheoretically the computational cost of DPN-98 shown in Table 2 is less than the best performingResNeXt, indicating there is still room for code optimization. Figure 3(c) presents the same result ina more clear way. The deeper DPN-131 only costs about more training time compared with thebest performing ResNeXt, but achieves the state-of-the-art single model performance. The trainingspeed of the previous state-of-the-art single model, i.e. Very Deep PolyNet (537 layers) [23], is about31 samples per second based on our implementation using MXNet, showing that DPN-131 runs about2 times faster than the Very Deep PolyNet during training.

In this experiment, we further evaluate the accuracy of the proposed DPN on the scene classiﬁcationtask using the Places365-Standard dataset. The Places365-Standard dataset is a high-resolution sceneunderstanding dataset with more than 1.8 million images of 365 scene categories. Different fromobject images, scene images do not have very clear discriminative patterns and require a higher levelcontext reasoning ability.Table 3 shows the results of different models on this dataset. To make a fair comparison, we performthe DPN-92 on this dataset instead of using deeper DPNs. As can be seen from the results, DPNachieves the best validation accuracy compared with other methods. The DPN-92 requires much lessparameters (138 MB v.s. 163 MB), which again demonstrates its high parameter efﬁciency and highgeneralization ability.

We further evaluate the proposed Dual Path Network on the object detection task. Experimentsare performed on the PASCAL VOC 2007 datasets [4]. We train the models on the union set ofVOC 2007 trainval and VOC 2012 trainval following [16], and evaluate them on VOC 2007 test set.We use standard evaluation metrics Average Precision (AP) and mean of AP (mAP) following thePASCAL challenge protocols for evaluation. 7able 4: Object detection results on PASCAL VOC 2007 test set. The performance is measured bymean of Average Precision (mAP, in %).

Method mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbk prsn plant sheep sofa train tvDenseNet-161 (k=48) 79.9 80.4 85.9 81.2 72.8 68.0 87.1 88.0 88.8 64.0 83.3 75.4 87.5 87.6 81.3 84.2 54.6 83.2 80.2 87.4 77.2ResNet-101 [16] 76.4 79.8 80.7 76.2 68.3 55.9 85.1 85.3 × d) 80.1 80.2 86.5 79.4 72.5 67.3 86.9 88.6 88.9 64.9 85.0 76.2 87.3 87.8 81.8 84.1 55.5 84.0 79.7 87.9 77.0DPN-92 ( × d) Table 5: Semantic segmentation results on PASCAL VOC 2012 test set. The performance is measuredby mean Intersection over Union (mIoU, in %).

Method mIoU bkg areo bike bird boat bottle bus car cat chair cow table dog horse mbk prsn plant sheep sofa train tvDenseNet-161 (k=48) 68.7 92.1 77.3 37.1 83.6 54.9 70.0 85.8 82.5 85.9 26.1 73.0 55.1 80.2 74.0 79.1 78.2 51.5 80.0 42.2 75.1 58.6ResNet-101 73.1 93.1 86.9 39.9 × d) 73.6 93.1 84.9 36.2 80.3 × d) We perform all experiments based on the ResNet-based Faster R-CNN framework, following [5]and make comparisons by replacing the residual network, while keeping other parts unchanged.Since our goal is to evaluate DPN, rather than further push the state-of-the-art accuracy on thisdataset, we adopt the shallowest DPN-92 and baseline networks at roughly the same complexitylevel. Table 4 provides the detection performance comparisons of the proposed DPN with severalcurrent state-of-the-art models. It can be observed that the DPN obtains the mAP of . , whichmakes large improvements, i.e. . compared with ResNet-101 [16] and . compared withResNeXt-101 ( × d). The better results shown in this experiment demonstrate that the Dual PathNetwork is also capable of learning better feature representations for detecting objects and beneﬁtingthe object detection task. In this experiment, we evaluate the performance of the Dual Path Network for dense prediction, i.e. semantic segmentation, where the training target is to predict the semantic label for each pixel inthe input image. We conduct experiments on the PASCAL VOC 2012 segmentation benchmarkdataset [4] and use the DeepLab-ASPP-L [1] as the segmentation framework. For each comparedmethod in Table 5, we replace the × convolutional layers in conv4 and conv5 of Table 1 withatrous convolution [1] and plug in a head of Atrous Spatial Pyramid Pooling (ASPP) [1] in the ﬁnalfeature maps of conv5 . We adopt the same training strategy for all networks following [1] for faircomparison.Table 5 shows the results of different convolutional neural networks. It can be observed that theproposed DPN-92 has the highest overall mIoU accuracy. Compared with the ResNet-101 whichhas a larger model size and higher computational cost, the proposed DPN-92 further improves theIoU for most categories and improves the overall mIoU by an absolute value . . Considering theResNeXt-101 ( × d) only improves the overall mIoU by an absolute value . compared withthe ResNet-101, the proposed DPN-92 gains more than times improvement compared with theResNeXt-101 ( × d). The better results once again demonstrate the proposed Dual Path Networkis capable of learning better feature representation for dense prediction. In this paper, we revisited the densely connected networks, bridged the densely connected networkswith Higher Order RNNs and proved the residual networks are essentially densely connected networkswith shared connections. Based on this new explanation, we proposed a dual path architecture thatenjoys beneﬁts from both sides. The novel network, DPN, is then developed based on this dual patharchitecture. Experiments on the image classiﬁcation task demonstrate that the DPN enjoys highaccuracy, small model size, low computational cost and low GPU memory consumption, thus isextremely useful for not only research but also real-word application. Experiments on the objectdetection task and semantic segmentation tasks show that the proposed DPN can also beneﬁt othertasks by simply replacing the base network. 8 eferences [1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 , 2016.[2] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, ChiyuanZhang, and Zheng Zhang. Mxnet: A ﬂexible and efﬁcient machine learning library for heterogeneousdistributed systems. arXiv preprint arXiv:1512.01274 , 2015.[3] Yunpeng Chen, Xiaojie Jin, Bingyi Kang, Jiashi Feng, and Shuicheng Yan. Sharing residual units throughcollective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180 , 2017.[4] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and AndrewZisserman. The pascal visual object classes challenge: A retrospective.

IJCV , 111(1):98–136, 2014.[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778,2016.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.In

European Conference on Computer Vision , pages 630–645. Springer, 2016.[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. arXiv preprintarXiv:1703.06870 , 2017.[8] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-tional networks. arXiv preprint arXiv:1608.06993 , 2016.[9] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep con-volutional networks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 1646–1654, 2016.[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutionalneural networks. In

Advances in neural information processing systems , pages 1097–1105, 2012.[11] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutionalneural networks: Mixed, gated, and tree. In

Artiﬁcial Intelligence and Statistics , pages 464–472, 2016.[12] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networksand visual cortex. arXiv preprint arXiv:1604.03640 , 2016.[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic seg-mentation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages3431–3440, 2015.[14] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In

European Conference on Computer Vision , pages 483–499. Springer, 2016.[15] Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q Weinberger.Memory-efﬁcient implementation of densenets. arXiv preprint arXiv:1707.06990 , 2017.[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. In

Advances in neural information processing systems , pages 91–99, 2015.[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet LargeScale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV) , 115(3):211–252,2015. doi: 10.1007/s11263-015-0816-y.[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-tion. arXiv preprint arXiv:1409.1556 , 2014.[19] Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064 ,2016.[20] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet andthe impact of residual connections on learning. arXiv preprint arXiv:1602.07261 , 2016.

21] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-tions for deep neural networks. arXiv preprint arXiv:1611.05431 , 2016.[22] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 ,2016.[23] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structuraldiversity in very deep networks. arXiv preprint arXiv:1611.05725 , 2016.[24] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An imagedatabase for deep scene understanding. arXiv preprint arXiv:1610.02055 , 2016. % ) on validation set. Method ModelSize GFLOPs w/o Mean-Max Pooling w/ Mean-Max Poolingtop-1 top-5 top-1 top-5DPN-92 ( × d) 145 MB 6.5 19.34 4.66 DPN-98 ( × d) 236 MB 11.7 18.94 4.44 DPN-131 ( × d) 304 MB 16.0 18.62 4.23 A Testing with Mean-Max Pooling

Here, we introduce a new testing technique by using Mean-Max Pooling which can further improvethe performance of a well trained CNN in the testing phase without any noticeable computationaloverhead. This testing technique is very effective for testing images with size larger than trainingcrops. The idea is to ﬁrst convert a trained CNN model into a convolutional network [13] and theninsert the following Mean-Max Pooling layer ( a.k.a.

Max-Avg Pooling [11]), i.e.i.e.