[PDF] Graph-guided Architecture Search for Real-time Semantic Segmentation

Abstract

Designing a lightweight semantic segmentation network often requires researchers to find a trade-off between performance and speed, which is always empirical due to the limited interpretability of neural networks. In order to release researchers from these tedious mechanical trials, we propose a Graph-guided Architecture Search (GAS) pipeline to automatically search real-time semantic segmentation networks. Unlike previous works that use a simplified search space and stack a repeatable cell to form a network, we introduce a novel search mechanism with new search space where a lightweight model can be effectively explored through the cell-level diversity and latencyoriented constraint. Specifically, to produce the cell-level diversity, the cell-sharing constraint is eliminated through the cell-independent manner. Then a graph convolution network (GCN) is seamlessly integrated as a communication mechanism between cells. Finally, a latency-oriented constraint is endowed into the search process to balance the speed and performance. Extensive experiments on Cityscapes and CamVid datasets demonstrate that GAS achieves the new state-of-the-art trade-off between accuracy and speed. In particular, on Cityscapes dataset, GAS achieves the new best performance of 73.5% mIoU with speed of 108.4 FPS on Titan Xp.

Full PDF

GGraph-guided Architecture Search for Real-time Semantic Segmentation

Peiwen Lin , ∗ Peng Sun , ∗ Guangliang Cheng Sirui Xie Xi Li , † Jianping Shi , † SenseTime Research Zhejiang University University of California, Los Angeles { linpeiwen,chengguangliang,shijianping } @sensetime.com { sunpeng1996,xilizju } @zju.edu.cn [email protected] Abstract

Designing a lightweight semantic segmentation networkoften requires researchers to ﬁnd a trade-off between per-formance and speed, which is always empirical due tothe limited interpretability of neural networks. In orderto release researchers from these tedious mechanical tri-als, we propose a Graph-guided Architecture Search (GAS)pipeline to automatically search real-time semantic seg-mentation networks. Unlike previous works that use a sim-pliﬁed search space and stack a repeatable cell to form anetwork, we introduce a novel search mechanism with anew search space where a lightweight model can be effec-tively explored through the cell-level diversity and latency-oriented constraint. Speciﬁcally, to produce the cell-leveldiversity, the cell-sharing constraint is eliminated throughthe cell-independent manner. Then a graph convolutionnetwork (GCN) is seamlessly integrated as a communica-tion mechanism between cells. Finally, a latency-orientedconstraint is endowed into the search process to balancethe speed and performance. Extensive experiments onCityscapes and CamVid datasets demonstrate that GASachieves the new state-of-the-art trade-off between accu-racy and speed. In particular, on Cityscapes dataset, GASachieves the new best performance of 73.5% mIoU withspeed of 108.4 FPS on Titan Xp.

1. Introduction

As a fundamental topic in computer vision, semanticsegmentation [26, 51, 9, 7] aims at predicting pixel-levellabels for images. Leveraging the strong ability of CNNs[38, 18, 19, 12], many works have achieved remarkableperformance on public semantic segmentation benchmarks[13, 15, 4]. To pursue higher accuracy, state-of-the-art mod-els become increasingly larger and deeper, and thus require ∗ The ﬁrst two authors contributed equally to this paper. † Corresponding Author

Figure 1. The inference speed and mIoU for different networkson the Cityscapes test set with only ﬁne training data. Our GASachieves the state-of-the-art trade-off between speed and perfor-mance. The Mark ∗ denotes the speed is remeasured on Titan Xp. Best viewed in color. high computational resources and large memory overhead,which makes it difﬁcult to deploy on resource-constrainedplatforms, such as mobile devices, robotics, self-drivingcars, etc.Recently, many researches have focused on designingand improving CNN models with light computation costand satisfactory segmentation accuracy. For example, someworks [1, 34] reduce the computation cost via the pruningalgorithms, and ICNet [50] uses an image cascade networkto incorporate multi-resolution inputs. BiSeNet [47] andDFANet [22] utilize a light-weight backbone to speed up,and is equipped with a well-designed feature fusion or ag-gregation module to remedy the accuracy drop. To achievesuch design, researchers acquire expertise in architecturedesign through enormous trial and error to carefully balancethe accuracy and resource-efﬁciency.To design more effective segmentation networks, some1 a r X i v : . [ c s . C V ] A p r esearchers have explored automatically neural architec-ture search (NAS) methods [25, 53, 30, 21, 36, 5, 44] andachieved excellent results. For example, Auto-Deeplab [24]searches cell structures and the downsampling strategy to-gether in the same round. CAS [49] searches an architecturewith customized resource constraint and a multi-scale mod-ule which has been widely used in semantic segmentationﬁeld [9, 51].Particularly, CAS has achieved state-of-the-art segmen-tation performance in mobile setting [50, 22, 47]. Like thegeneral NAS methods, such as ENAS [36], DARTS [25]and SNAS [44], CAS also searches for two types of cells( i.e. normal cell and reduction cell) and then repeatedlystacks the identical cells to form the whole network. Thissimpliﬁes the search process, but also increases the difﬁ-culties to ﬁnd a good trade-off between performance andspeed due to the limited cell diversity. As shown in Figure2(a), the cell is prone to learn a complicated structure to pur-sue high performance without any resource constraint, andthe whole network will result in high latency. When a low-computation constraint is applied, the cell structures tend tobe over-simpliﬁed as shown in Figure 2(b), which may notachieve satisfactory performance.Different from the traditional search algorithms withsimpliﬁed search space, in this paper, we propose anovel search mechanism with new search space, where alightweight model with high performance can be fully ex-plored through the well-designed cell-level diversity andlatency-oriented constraint. On one hand, to encourage thecell-level diversity, we make each cell structure indepen-dent, and thus the cells with different computation cost canbe ﬂexibly stacked to form a lightweight network in Figure2(c). In this way, simple cells can be applied to the stagewith high computation cost to achieve low latency, whilecomplicated cells can be chosen in deep layers with lowcomputation for high accuracy. On the other hand, we ap-ply a real-world latency-oriented constraint into the searchprocess, through which the searched model can achieve bet-ter trade-off between the performance and latency.However, simply endowing cells with independence inexploring its own structures enlarges the search space andmakes the optimization more difﬁcult, which causes accu-racy degradation as shown in Figure 5(a) and Figure 5(b).To address this issue, we incorporate a Graph ConvolutionNetwork (GCN) [20] as the communication deliverer be-tween cells. Our idea is inspired by [29] that different cellscan be treated as multiple agencies, whose achievement ofsocial welfare may require communication between them.Speciﬁcally, in the forward process, starting from the ﬁrstcell, the information of each cell is propagated to the nextadjacent cell with a GCN. Our ablation study exhibits thatthis communication mechanism tends to guide cells to se-lect less-parametric operations, while achieving the satis- P r e v i ou s W o r k (b) Low latency, low performance (a) High latency, high performance O u r W o r k (c) Low latency, high performance Figure 2. (a) The network stacked by complicated cells results inhigh latency and high performance. (b) The network stacked bysimple cells leads to low latency and low performance. (c) The celldiversity strategy, i.e. each cell possesses own independent struc-ture, can ﬂexibly construct the high accuracy lightweight network.

Best viewed in color. factory accuracy. We name the method as Graph-guidedArchitecture Search (GAS).We conduct extensive experiments on the standardCityscapes [13] and CamVid [4] benchmarks. Comparedto other real-time methods, our method locates in the top-right area in Figure 1, which is the state-of-the-art trade-offbetween the performance and latency.The main contributions can be summarized as follows: • We propose a novel search framework, for real-timesemantic segmentation, with a new search space inwhich a lightweight model with high performance canbe effectively explored. • We integrate the graph convolution network seam-lessly into neural architecture search as a communi-cation mechanism between independent cells. • The lightweight segmentation network searched withGAS is customizable in real applications. Notably,GAS has achieved 73.5% mIoU on the Cityscapes testset and 108.4 FPS on NVIDIA Titan Xp with one × image.

2. Related Work

Semantic Segmentation Methods

FCN [26] is the pio-neer work in semantic segmentation. To improve the seg-mentation performance, some remarkable works have uti-lized various heavy backbones [38, 18, 19, 12] or effectivemodules to capture multi-scale context information [51, 7,8]. These outstanding works are designed for high-quality2 CN ! !" ! ! Building Graphs Reasoning ! !$ (a) Network Architecture NormalCell_ GGMConv3x3Stride 2 Conv3x3Stride 1Conv3x3Stride 2 GGMReductionCell_1 NormalCell_7 ReductionCell_8 NormalCell_13GGM ... ...

ASPP (b) GCN-Guided Module (GGM)

Figure 3. Illustration of our Graph-Guided Network Architecture Search. In reduction cells, all the operations adjacent to the input nodesare of stride two. (a) The backbone network, it’s stacked by a series of independent cells. (b) The GCN-Guided Module (GGM), itpropagates information between adjacent cells. α k and α k − represent the architecture parameters for cell k and cell k − , respectively,and α (cid:48) k is the updated architecture parameters by GGM for cell k . The dotted lines indicate GGM is only utilized in the search progress. Best viewed in color. segmentation, which is inapplicable to real-time applica-tions. In terms of efﬁcient segmentation methods, there aretwo mainstreams. One is to employ relatively lighter back-bone ( e.g.

ENet [34]) or introduce some efﬁcient operations(depth-wise dilated convolution). DFANet [22] utilizes alightweight backbone to speed up and equips with a cross-level feature aggregation module to remedy the accuracydrop. Another is based on multi-branch algorithm that con-sists of more than one path. For example, ICNet [50] pro-poses to use the multi-scale image cascade to speed up theinference. BiSeNet [47] decouples the extraction for spatialand context information using two paths.

Neural Architecture Search

Neural ArchitectureSearch (NAS) aims at automatically searching network ar-chitectures. Most existing architecture search works arebased on either reinforcement learning [52, 17] or evo-lutionary algorithm [37, 11]. Though they can achievesatisfactory performance, they need thousands of GPUhours. To solve this time-consuming problem, one-shotmethods [2, 3] have been developed to greatly solve thetime-consuming problem by training a parent network fromwhich each sub-network can inherit the weights. They canbe roughly divided into cell-based and layer-based meth-ods according to the type of search space. For cell-basedmethods, ENAS [36] proposes a parameter sharing strat-egy among sub-networks, and DARTS [25] relaxes the dis-crete architecture distribution as continuous deterministicweights, such that they could be optimized with gradientdescent. SNAS [44] proposes novel search gradients that train neural operation parameters and architecture distri-bution parameters in the same round of back-propagation.What’s more, there are also some excellent works [10, 32]to reduce the difﬁculty of optimization by decreasing grad-ually the size of search space. For layer-based methods,FBNet [42], MnasNet [39], ProxylessNAS [5] use a multi-objective search approach that optimizes both accuracy andreal-world latency.In the ﬁeld of semantic segmentation, DPC [6] is thepioneer work by introducing meta-learning techniques intothe network search problem. Auto-Deeplab [24] searchescell structures and the downsampling strategy together inthe same round. More recently, CAS [49] searches an ar-chitecture with customized resource constraint and a multi-scale module which has been widely used in semantic seg-mentation ﬁeld. And [31] over-parameterises the architec-ture during the training via a set of auxiliary cells using re-inforcement learning. Recently, NAS also has been used inobject detection, such as NAS-FPN [16], DetNAS [48] andAuto-FPN [45].

Graph Convolution Network

Convolutional neural net-works on graph-structure data is an emerging topic in deeplearning research. Kipf [20] presents a scalable approachfor graph-structured data that is based on an efﬁcient vari-ant of convolutional neural networks which operate directlyon graphs, for better information propagation. After that,Graph Convolution Networks (GCNs) [20] is widely usedin many domains, such as video classiﬁcation [41] and ac-tion recognition [46]. In this paper, we apply the GCNs to3 " 𝑖 𝑥 " 𝑥 Figure 4. The structure of cell in our GAS. Each colored edgerepresents one candidate operation. model the relationship of adjacent cells in network architec-ture search.

3. Methods

As shown in Figure 3, GAS searches for, with GCN-Guided module (GGM), an optimal network constructed bya series of independent cells. In the search process, we takethe latency into consideration to obtain a network with com-putational efﬁciency. This search problem can be formu-lated as: min a ∈A L val + β ∗ L lat (1)where A denotes the search space, L val and L lat are thevalidation loss and the latency loss, respectively. Our goalis to search an optimal architecture a ∈ A that achieves thebest trade-off between the performance and speed.In this section, we will describe three main componentsin GAS: 1) Network Architecture Search; 2) GCN-GuidedModule; 3) Latency-Oriented Optimization. As shown in Figure 3(a), the whole backbone takes animage as input which is ﬁrst ﬁltered with three convolu-tional layers followed by a series of independent cells. TheASPP [9] module is subsequently used to extract the multi-scale context for the ﬁnal prediction.A cell is a directed acyclic graph (DAG) as shown inFigure 4. Each cell has two input nodes i and i , N or-dered intermediate nodes, denoted by N = { x , ..., x N } ,and an output node which outputs the concatenation of allintermediate nodes N . Each node represents the latent rep-resentation ( e.g. feature map) in the network, and each di-rected edge in this DAG represents an candidate operation( e.g. conv, pooling).The number of intermediate nodes N is 2 in our work.Each intermediate node takes all its previous nodes as input.In this way, x has two inputs I = { i , i } and node x takes I = { i , i , x } as inputs. The intermediate nodes x i can be calculated by: x i = (cid:88) c ∈ I i (cid:101) O h,i ( c ) (2) where (cid:101) O h,i is the selected operation at edge ( h , i ).To search the selected operation (cid:101) O h,i , the search space isrepresented with a set of one-hot random variables from afully factorizable joint distribution p ( Z ) [44]. Concretely,each edge is associated with a one-hot random variablewhich is multiplied as a mask to the all possible operations O h,i = ( o h,i , o h,i , ..., o Mh,i ) in this edge. We denote the one-hot random variable as Z h,i = ( z h,i , z h,i , ..., z Mh,i ) where M is the number of candidate operations. The intermediatenodes during search process in such way are: x i = (cid:88) c ∈ I i (cid:101) O h,i ( c ) = (cid:88) c ∈ I i M (cid:88) m =1 z mh,i o mh,i ( c ) (3) To make P ( Z ) differentiable, reparameterization [27] isused to relax the discrete architecture distribution to be con-tinuous: Z h,i = f α h,i ( G h,i ) = softmax (( logα h,i + G h,i ) /λ ) (4) where α h,i is the architecture parameters at the edge ( h, i ) ,and G h,i = − log ( − log ( U h,i )) is a vector of Gumbel randomvariables, U h,i is a uniform random variable and λ is thetemperature of softmax.For the set of candidate operations O , we only use thefollowing 8 kinds of operations to better balance the speedand performance: • × • skip connection • × • zero operation • × • × • × • × With cell independent to each other, the inter-cell rela-tionship becomes very important for searching efﬁciently.We propose a novel GCN-Guided Module (GGM) to nat-urally bridge the operation information between adjacentcells. The total network architecture of our GGM is shownin Figure 3(b). Inspired by [41], the GGM representsthe communication between adjacent cells as a graph andperform reasoning on the graph for information delivery.Speciﬁcally, we utilize the similarity relations of edges inadjacent cells to construct the graph where each node rep-resents one edge in cells. In this way, the state changes forprevious cell can be delivered to current cell by reasoningon this graph.As stated in Section . , let α k represents the architec-ture parameter matrix for the cell k , and the dimension of α k is p × q where p and q represents the number of edgesand the number of candidate operations respectively. Samefor cell k , the architecture parameter α k − for cell k − also is a p × q matrix. To fuse the architecture parame-ter information of previous cell k − into the current cell4 and generate the updated α (cid:48) k , we model the informationpropagation between cell k − and cell k as follows: α (cid:48) k = α k + γ Φ ( G (Φ ( α k − ) , Adj )) (5) where Adj represents the adjacency matrix of the reasoninggraph between cells k and k − , and the function G denotesthe Graph Convolution Networks (GCNs) [20] to performreasoning on the graph. Φ and Φ are two different trans-formations by 1D convolution. Speciﬁcally, Φ maps theoriginal architecture parameter to embedding space and Φ transfers it back into the source space after the GCN rea-soning. γ controls the fusion of two kinds of architectureparameter information.For the function G , we construct the reasoning graph be-tween cell k − and cell k by their similarity. Given a edgein cell k , we calculate the similarity between this edge andall other edges in cell k − and a softmax function is usedfor normalization. Therefore, the adjacency matrix Adj ofthe graph between two adjacent cells k and k − can beestablished by: Adj = Sof tmax ( φ ( α k ) ∗ φ ( α k − ) T ) (6)where we have two different transformations φ = α k w and φ = α k − w for the architecture parameters, and parame-ters w and w are both q × q weights which can be learnedvia back propagation. The result Adj is a p × p matrix.Based on this adjacency matrix Adj , we use the GCNsto perform information propagation on the graph as shownin Equation 7. A residual connection is added to each layerof GCNs. The GCNs allow us to compute the response of anode based on its neighbors deﬁned by the graph relations,so performing graph convolution is equivalent to perform-ing message propagation on the graph. G (Φ ( α k − ) , Adj ) = Adj Φ ( α k − ) W gk − + Φ ( α k − ) (7) where the W gk − denotes the GCNs weight with dimension d × d , which can be learned via back propagation.The proposed well-designed GGM seamlessly integratesthe graph convolution network into neural architecturesearch, which can bridge the operation information betweenadjacent cells. To obtain a real-time semantic segmentation network,we take the real-world latency into consideration during thesearch process, which orients the search process toward thedirection to ﬁnd an optimal lightweight model. Speciﬁcally,we create a GPU-latency lookup table [5, 42, 49, 39] whichrecords the inference latency of each candidate operation.During the search process, each candidate operation m at edge ( h , i ) will be assigned a cost lat mh,i given by the pre-built lookup table. In this way, the total latency for cell k isaccumulated as: lat k = (cid:88) h,i M (cid:88) m =1 z mh,i lat mh,i (8)where z mh,i is the softened one-hot random variable as statedin Section 3.1. Given an architecture a , the total latency costis estimated as: LAT ( a ) = K (cid:88) k =0 lat k (9)where K refers to the number of cells in architecture a . Thelatency for each operation lat mh,i is a constant and thus totallatency loss is differentiable with respect to the architectureparameter α h,i . The total loss function is designed as fol-lows: L ( a, w ) = CE ( a, w a ) + β log ( LAT ( a )) (10)where CE ( a, w a ) denotes the cross-entropy loss of archi-tecture a with parameter w a , LAT ( a ) denotes the over-all latency of architecture a , which is measured in micro-second, and the coefﬁcient β controls the balance betweenthe accuracy and latency. The architecture parameter α andthe weight w are optimized in the same round of back-propagation.

4. Experiments

In this section, we conduct extensive experiments to ver-ify the effectiveness of our GAS. Firstly, we compare thenetwork searched by our method with other works on twostandard benchmarks. Secondly, we perform the ablationstudy for the GCN-Guided Module and latency optimiza-tion settings, and close with an insight about GCN-GuidedModule.

Datasets

In order to verify the effectiveness and ro-bustness of our method, we evaluate our method on theCityscapes [13] and CamVid [4] datasets. Cityscapes [13]is a public released dataset for urban scene understanding.It contains 5,000 high quality pixel-level ﬁne annotated im-ages (2975, 500, and 1525 for the training, validation, andtest sets, respectively) with size 1024 × ×

720 and 11 semantic categories.5 valuation Metrics

For evaluation, we use mean ofclass-wise intersection over union (mIoU), network forwardtime (Latency), and Frames Per Second (FPS) as the evalu-ation metrics.

We conduct all experiments using Pytorch 0.4 [35] ona workstation, and the inference time in all experiments isreported on one Nvidia Titan Xp GPU.The whole pipeline contains three sequential steps:search, pretraining and ﬁnetuning. It starts with the searchprogress on the target dataset and obtains the light-weightarchitecture according to the optimized α followed by theImageNet [14] pretraining, and this pretrained model is sub-sequently ﬁnetuned on the speciﬁc dataset for 200 epochs.In search process, the architecture contains 14 cells andeach cell has N = 2 intermediate nodes. With the consid-eration of speed, the initial channel for network is 8. Forthe training hyper-parameters, the mini-batch size is set to16. The architecture parameters α are optimized by Adam,with initial learning rate 0.001, β = (0.5, 0.999) and weightdecay 0.0001. The network parameters are optimized usingSGD with momentum 0.9, weight decay 0.001, and cosinelearning scheduler that decays learning rate from 0.025 to0.001. For gumbel softmax, we set the initial temperature λ in equation 4 as 1.0, and gradually decrease to the min-imum value of 0.03. The search time cost on Cityscapestakes approximately 10 hours with 16 TitanXP GPU.For ﬁnetuning details, we train the network with mini-batch 8 and SGD optimizer with ‘poly’ scheduler that decaylearning rate from 0.01 to zero. Following [43], the onlinebootstrapping strategy is applied to the ﬁnetuning process.For data augmentation, we use random ﬂip and random re-size with scale between 0.5 and 2.0. Finally, we randomlycrop the image with a ﬁxed size for training.For the GCN-guided Module, we use one Graph Convo-lution Network (GCN) [20] between two adjacent cells, andeach GCN contains one layer of graph convolutions. Thekernels size of the GCN parameters W in equation 7 is 64 ×

64. We set the γ as 0.5 in equation 5 in our experiments. In this part, we compare the model searched by GASwith other existing real-time segmentation methods on se-mantic segmentation datasets. The inference time is mea-sured on an Nvidia Titan Xp GPU and the speed of othermethods reported on Titan Xp GPU in CAS [49] are usedfor fair comparison. Moreover, the speed is remeasured onTitan Xp if the origin paper reports it on different GPU andis not mentioned in CAS [49].

Results on Cityscapes.

We evaluate the networksearched by GAS on the Cityscapes test set. The valida-tion set is added to train the network before submitting to

Method Input Size mIoU (%) Latency(ms) FPSFCN-8S [26] 512x1024 65.3 227.23 4.4PSPNet [51] 713x713 81.2 1288.0 0.78DeepLabV3 ∗ [7] 769x769 81.3 769.23 1.3SegNet [1] 640x360 57.0 30.3 33ENet [34] 640x360 58.3 12.7 78.4SQ [40] 1024x2048 59.8 46.0 21.7ICNet [50] 1024x2048 69.5 26.5 37.7SwiftNet [33] 1024x2048 75.1 26.2 38.1ESPNet [28] 1024x512 60.3 8.2 121.7BiSeNet [47] 768x1536 68.4 9.52 105.8DFANet A § [22] 1024x1024 71.3 10.0 100.0DFANet A † [22] ∗ [49] 768x1536 72.3 9.25 108.0 GAS

GAS ∗ ∗ . Themark § represents the speed on TitanX, and the mark † representsthe speed is remeasured on Titan Xp. Cityscapes online server. Following [47, 49], GAS takes asan input image with size 769 × × Results on CamVid.

We directly transfer the networksearched on Cityscapes to Camvid to verify the transfer-ability of GAS. Table 2 shows the comparison results withother methods. With input size 720 × After merging the BN layers for DFANet, there still has a speed gapbetween the original paper and our measurement. We suspect that it’scaused by the inconsistency of implementation platform in which DFANethas the optimized depth-wise convolution (DW-Conv). GAS also havemany candidate operations using DW-Conv, so the speed of our GAS is stillcapable of beating it if the DW-Conv is optimized correctly like DFANet. ethod mIoU (%) Latency(ms) FPSSegNet [1] 55.6 34.01 29.4ENet [34] 51.3 16.33 61.2ICNet [50] 67.1 28.98 34.5BiSeNet [47] 65.6 - -DFANet A [22] 64.7 8.33 120CAS [49] 71.2 5.92 169GAS 72.8 6.53 153.1Table 2. Results on the CamVid test set with resulotion 960 × To verify the effectiveness of each component in ourframework, extensive ablation studies for the GCN-GuidedModule and the latency loss are performed. In addition,we also give some insights about the role of GCN-GuidedModule in the search process.

We propose the GCN-Guided Module (GGM) to build theconnection between cells. To verify the effectiveness of theGGM, we conduct a series of experiments with differentstrategies: a) network stacked by shared cell; b) networkstacked by independent cell; c) based on strategy-b, usingfully connected layer to infer the relationship between cells;d) based on strategy-b, using GGM to infer the relationshipbetween cells. Experimental results are shown in Figure 5.The performance reported here is the average mIoU overﬁve repeated experiments on the Cityscapes validation setwith latency loss weight β = 0.005. The numbers belowthe horizontal axis are the average model size of ﬁve archi-tectures ( e.g. i.e. setting (c)). (a) Cell shared(6.5M) (b) Cell independent(4.24M) (c) Cell independent + FC(3.12M) (d) Cell independent + GCN(2.18M) Effectiveness of GCN-Guided Module mIoU (%)

Figure 5. Ablation study for the effectiveness of GCN-GuidedModule on Cityscapes validation dataset.

Best viewed in color.

Comparison against Random Search

As discussed in[23], random search is a competitive baseline for hyper-parameter optimization. To further verify the effectivenessof GCN-Guided Module, we randomly sample ten archi-tectures from the search space and evaluate them on theCityscapes validation set with ImageNet pretrained. Specif-ically, we try two types of random settings in our experi-ments: a) fully random search without any constraint; b)randomly select the networks that meet the speed require-ment about 108 FPS from the search space. The resultsare shown in Table 3, in which each value is the averageresult of ten random architectures. In summary, the net-work searched by GAS can achieve an excellent trade-offbetween performance and latency, while random search willresult in high overhead without any latency constraint orlow performance with latency constraint.

Methods mIoU (%) FPSGAS 72.3 108.2Random setting (a) 69.6 61.2Random setting (b) 65.8 105.6Table 3. Comparison to random search on the Cityscapes valida-tion set.

Dimension Selection

The dimension selection of GCNweight W in Equation 7 is also important, thus we con-duct experiments with different GCN weight dimensions(denoted by d ). Experimental results are shown in Table 4in which the values are the average mIoU over ﬁve repeatedexperiments on the Cityscapes validation set with latencyloss weight β = 0.005. Experimental result indicates thatGAS achieves the best performance when d = 64. Methods mIoU (%) FPSGCN with d = 16 71.6 108.6GCN with d = 32 71.8 102.2GCN with d = 64 72.4 108.4GCN with d = 128 72.1 104.1GCN with d = 256 71.5 111.2Table 4. Ablation study for different GCN weight dimensions ofthe GCN-Guided Module.

Reasoning Graph

For GCN-Guided Module, in additionto the way described in Section 3.2, we also try anotherway to construct the reasoning graph. Speciﬁcally, we treateach candidate operation in a cell as a node in the reasoninggraph. Given the architecture parameter α k for cell k withdimension p × q , we ﬁrst ﬂatten the α k and α k − to the onedimensional vector α (cid:48) k and α (cid:48) k − , and then perform matrixmultiplication to get adjacent matrix Adj = α (cid:48) k ( α (cid:48) k − ) T .Different from the “edge-similarity” reasoning graph inSection 3.2, we call this graph “operation-identity” reason-ing graph. We conduct the comparison experiment for two7ypes of graphs on the Cityscapes validation set under thesame latency loss weight β = 0.005, the comparison resultsare shown in Table 5. Reasoning Graph mIoU (%) FPSEdge-similarity 72.4 108.4Operation-identity 70.9 102.2Table 5. The comparison results for reasoning graph for edges andoperations.

Intuitively, the “operation-identity” way provides moreﬁne-grained information about operation selection for othercells, while it also breaks the overall properties of an edge,and thus doesn’t consider the other operation information atthe same edge when making decision. After visualizing thenetwork, we also found that the “operation-identity” reason-ing graph tends to make cell select the same operation forall edge, which increases the difﬁculty of trade-off betweenperformance and latency. This can also be veriﬁed from re-sult in Table 5. So we choose the “edge-similarity” way toconstruct the reasoning graph as described in Section 3.2.

Network Visualization

We illustrate the network struc-ture searched by GAS in the supplementary material. An in-teresting observation is that the operations selected by GASwith GGM have fewer parameters and less computationalcomplexity than GAS without GGM, where more dilated orseparated convolution kernels are preferred. This exhibitsthe emergence of concept of burden sharing in a group ofcells when they know how much others are willing to con-tribute.

As mentioned above, GAS provides the ability to ﬂexi-bly achieve a superior trade-off between the performanceand speed with the latency-oriented optimization. We con-ducted a series of experiments with different loss weight β in Equation 10. Figure 6 shows the variation of mIoUand latency as β changes. With smaller β , we can obtain amodel with higher accuracy, and vice-versa. When the β in-creases from 0.0005 to 0.005, the latency decreases rapidlyand the performance is slightly falling. But when β in-creases from 0.005 to 0.05, the performance drops quicklywhile the latency decline is fairly limited. Thus in our ex-periments, we set β as 0.005. We can clearly see that thelatency-oriented optimization is effective for balancing theaccuracy and latency. One concern is about what kind of role does GCN play inthe search process. We suspect that its effectiveness is de-rived from the following two aspects: 1) to search a light-

Figure 6. The validation accuracy on Cityscapes dataset for differ-ent latency constraint.

Best viewed in color. weight network, we do not allow the cell structures to sharewith each other to encourage structure diversity. Appar-ently, learning cell independently makes the search moredifﬁcult and does not guarantee better performance, thusthe GCN-Guided Module can be regraded as a regulariza-tion term to the search process. 2) We have discussed that p ( Z ) is a fully factorizable joint distribution in above sec-tion. As shown in Equation 4, p ( Z h,i ) for current cell be-comes a conditional probability if the architecture parame-ter α h,i depends on the probability α h,i for previous cell.In this case, the GCN-Guided Module plays a role to modelthe condition in probability distribution p ( Z ) .

5. Conclusion & Discussion

In this paper, a novel Graph-guided architecture search(GAS) framework is proposed to tackle the real-time se-mantic segmentation task. Different from the existing NASapproaches that stack the same searched cell into a wholenetwork, GAS explores to search different cell architecturesand adopts the graph convolution network to bridge the in-formation connection among cells. In addition, a latency-oriented constraint is endowed into the search process forbalancing the accuracy and speed. Extensive experimentshave demonstrated that GAS performs much better than thestate-of-the-art real-time segmentation approaches.In the future, we will extend the GAS to the followingdirections: 1) we will search networks directly for the seg-mentation and detection tasks without retraining. 2) we willexplore some deeper research on how to effectively com-bine the NAS and the graph convolution network.

Acknowledgement

This paper is carried out at Sense-Time Research in Beijing, China, and is supported bykey scientiﬁc technological innovation research project byMinistry of Education, Zhejiang Provincial Natural Sci-ence Foundation of China under Grant LR19F020004, Zhe-jiang University K.P.Chao’s High Technology DevelopmentFoundation.8 eferences [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.Segnet: A deep convolutional encoder-decoder architecturefor image segmentation.

IEEE trans. PAMI , 39(12):2481–2495, 2017.[2] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, VijayVasudevan, and Quoc Le. Understanding and simplifyingone-shot architecture search. In

ICML , pages 549–558, 2018.[3] Andrew Brock, Theodore Lim, James M Ritchie, and NickWeston. Smash: one-shot model architecture search throughhypernetworks. arXiv:1708.05344 , 2017.[4] Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, andRoberto Cipolla. Segmentation and recognition using struc-ture from motion point clouds. In

ECCV (1) , pages 44–57,2008.[5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Di-rect neural architecture search on target task and hardware. arXiv:1812.00332 , 2018.[6] Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, GeorgePapandreou, Barret Zoph, Florian Schroff, Hartwig Adam,and Jonathon Shlens. Searching for efﬁcient multi-scale ar-chitectures for dense image prediction. In

NeurIPS , pages8713–8724, 2018.[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, andHartwig Adam. Rethinking atrous convolution for semanticimage segmentation.

CoRR , abs/1706.05587, 2017.[8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, FlorianSchroff, and Hartwig Adam. Encoder-decoder with atrousseparable convolution for semantic image segmentation. In

ECCV , pages 833–851, 2018.[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.

IEEE trans. PAMI , 40(4):834–848, 2018.[10] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive dif-ferentiable architecture search: Bridging the depth gap be-tween search and evaluation. arXiv:1904.12760 , 2019.[11] Yukang Chen, Qian Zhang, Chang Huang, Lisen Mu,Gaofeng Meng, and Xinggang Wang. Reinforced evolution-ary neural architecture search.

CoRR , abs/1808.00193, 2018.[12] Franc¸ois Chollet. Xception: Deep learning with depthwiseseparable convolutions. In

CVPR , pages 1800–1807, 2017.[13] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In

CVPR ,pages 3213–3223, 2016.[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In

CVPR , pages 248–255. Ieee, 2009.[15] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christo-pher KI Williams, John Winn, and Andrew Zisserman. Thepascal visual object classes challenge: A retrospective.

IJCV ,111(1):98–136, 2015. [16] Golnaz Ghiasi, Tsung-Yi Lin, Ruoming Pang, and Quoc VLe. Nas-fpn: Learning scalable feature pyramid architecturefor object detection. In

CVPR , 2019.[17] Minghao Guo, Zhao Zhong, Wei Wu, Dahua Lin, and JunjieYan. IRLAS: inverse reinforcement learning for architecturesearch.

CoRR , abs/1812.05285, 2018.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

CVPR ,pages 770–778, 2016.[19] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Denselyconnected convolutional networks.

CVPR , pages 1–9, 2016.[20] Thomas N Kipf and Max Welling. Semi-supervisedclassiﬁcation with graph convolutional networks. arXiv:1609.02907 , 2016.[21] Ben Krause, Emmanuel Kahembwe, Iain Murray, and SteveRenals. Dynamic evaluation of neural sequence models. arXiv:1709.07432 , 2017.[22] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun.Dfanet: Deep feature aggregation for real-time semantic seg-mentation. In

CVPR , pages 9522–9531, 2019.[23] Liam Li and Ameet Talwalkar. Random search and repro-ducibility for neural architecture search. In

Proceedings ofthe Thirty-Fifth Conference on Uncertainty in Artiﬁcial In-telligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019 , page129, 2019.[24] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, HartwigAdam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab:Hierarchical neural architecture search for semantic imagesegmentation. In

CVPR , 2019.[25] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:differentiable architecture search. In

ICLR , 2019.[26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In

CVPR , pages 3431–3440, 2015.[27] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. Theconcrete distribution: A continuous relaxation of discreterandom variables. arXiv:1611.00712 , 2016.[28] Sachin Mehta, Mohammad Rastegari, Anat Caspi, LindaShapiro, and Hannaneh Hajishirzi. Espnet: Efﬁcient spatialpyramid of dilated convolutions for semantic segmentation.In

ECCV , pages 552–568, 2018.[29] Marvin Minsky.

The Society of Mind . Simon & Schuster,1988.[30] Renato Negrinho and Geoff Gordon. Deeparchitect:Automatically designing and training deep architectures. arXiv:1704.08792 , 2017.[31] Vladimir Nekrasov, Hao Chen, Chunhua Shen, and Ian Reid.Fast neural architecture search of compact semantic segmen-tation models via auxiliary cells. In

CVPR , pages 9126–9135, 2019.[32] Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, SivanDoveh, Itamar Friedman, Raja Giryes, and Lihi Zelnik-Manor. Asap: Architecture search, anneal and prune. arXiv:1904.04123 , 2019.[33] Marin Orsic, Ivan Kreso, Petra Bevandic, and Sinisa Segvic.In defense of pre-trained imagenet architectures for real-timesemantic segmentation of road-driving images.

CVPR , pages12599–12608, 2019.

34] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Euge-nio Culurciello. Enet: A deep neural network architecture forreal-time semantic segmentation. arXiv:1606.02147 , 2016.[35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: Animperative style, high-performance deep learning library. In

NIPS , 2019.[36] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, andJeff Dean. Efﬁcient neural architecture search via parametersharing. In

ICML , pages 4092–4101, 2018.[37] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V.Le. Regularized evolution for image classiﬁer architecturesearch.

CoRR , abs/1802.01548, 2018.[38] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition.

CoRR ,abs/1409.1556, 2014.[39] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,Mark Sandler, Andrew Howard, and Quoc V Le. Mnas-net: Platform-aware neural architecture search for mobile.In

CVPR , pages 2820–2828, 2019.[40] Michael Treml, Jos´e Arjona-Medina, Thomas Unterthiner,Rupesh Durgesh, Felix Friedmann, Peter Schuberth, An-dreas Mayr, Martin Heusel, Markus Hofmarcher, MichaelWidrich, et al. Speeding up semantic segmentation for au-tonomous driving. In

MLITS, NIPS Workshop , volume 2,page 7, 2016.[41] Xiaolong Wang and Abhinav Gupta. Videos as space-timeregion graphs. In

ECCV , pages 399–417, 2018.[42] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, YangqingJia, and Kurt Keutzer. Fbnet: Hardware-aware efﬁcient con-vnet design via differentiable neural architecture search. In

CVPR , pages 10734–10742, 2019.[43] Zifeng Wu, Chunhua Shen, and Anton van den Hengel.High-performance semantic segmentation using very deepfully convolutional networks.

CoRR , abs/1604.04339, 2016.[44] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin.SNAS: stochastic neural architecture search. In

ICLR , 2019.[45] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhen-guo Li. Auto-fpn: Automatic network architecture adap-tation for object detection beyond classiﬁcation. In

ICCV ,2019.[46] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-ral graph convolutional networks for skeleton-based actionrecognition. In

AAAI , 2018.[47] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,Gang Yu, and Nong Sang. Bisenet: Bilateral segmenta-tion network for real-time semantic segmentation. In

ECCV ,pages 334–349, 2018.[48] Xiangyu Zhang Gaofeng Meng Xinyu Xiao Jian SunYukang Chen, Tong Yang. Detnas: Backbone search for ob-ject detection. In

NeurIPS , 2019. [49] Yiheng Zhang, Zhaofan Qiu, Jingen Liu, Ting Yao, DongLiu, and Tao Mei. Customizable architecture search for se-mantic segmentation. In

CVPR , pages 11641–11650, 2019.[50] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, JianpingShi, and Jiaya Jia. Icnet for real-time semantic segmentationon high-resolution images. In

ECCV , pages 405–420, 2018.[51] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In

CVPR , pages 2881–2890, 2017.[52] Barret Zoph and Quoc V. Le. Neural architecture search withreinforcement learning. In

ICLR , 2017.[53] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc VLe. Learning transferable architectures for scalable imagerecognition. In

CVPR , pages 8697–8710, 2018. upplemental Material A. Network Visualization

As shown in Section 4.4.1 of the paper, the networksearched by our GAS with GGM has smaller parameter sizewhile achieving much higher performance. The visualiza-tion result can effectively help to analyze which componentbrings in the performance improvement. We thus visualizethe networks searched by the three methods: 1) GAS withGGM; 2) GAS with fully connected layer; and 3) randomsearch in Figure 7, Figure 8 and Figure 9, respectively.Compared to the other methods, the network searched byour GAS with GGM shows the following three advantages:1) The cells in the low stage tend to choose light-weightoperations (i.e. none, max pooling, skip connection) andthe cells in the high stage enjoy the complex ones, which isthe goal of pursuing high speed as described in the introduc-tion of our paper. Speciﬁcally, under the same latency lossweight, the network searched by our GAS with GGM con-tains thirty light-weight operations (dashed-line arrow in thepicture) with lower latency, while the other two methodsuse twenty-one and twenty-three light-weight operations,respectively. However, our GAS with GGM achieves higherperformance, which exhibits the emergence of the conceptof burden sharing in a group of cells when they know howmuch others are willing to contribute.2) The deeper layers tend to utilize larger receptive ﬁeldoperations (e.g. conv with dilation = 4 or 8), which plays akey role to improve performance in semantic segmentation[9, 7]. Speciﬁcally, the network searched by our GAS withGGM uses 11 large receptive ﬁeld operations (denoted bygreen arrow) in the last four cells and the other methodsonly use 4 or 8 operations, respectively.3) The ﬁnal structure has sufﬁcient cell-level diversity aswe expected. On the contrary, the network search by GASwith fully connected layer tends to use similar structures,for example, cell 7 is similar to cell 8 and 9, and cell 1 issimilar to the cell 2, 3 and 4.

B. Multi-Scale Module Exploration

Methods mIoU (%) FPSASPP 72.4 108.4PPM 72.5 114.1Table 6. The performance for different multi-scale modules on theCityscapes validation set.

When considering multi-scale features, we also try thePPM module in PSPNet [51], and our GAS achieves thesimilar performance with faster speed on the Cityscapes val-idation set in Table 6. 11 onv3x3sep conv3x3dil2 conv3x3dil4 conv3x3dil8 conv3x3nonemax pooling3x3skip connection

Figure 7. The network searched by our GAS with GGM exhibits the beneﬁt property (e.g. more dilation convolution operations in deeplayers and more low computational operations for fast speed) for real-time semantic segmentation. conv3x3sep conv3x3dil2 conv3x3dil4 conv3x3dil8 conv3x3nonemax pooling3x3skip connection