Graph Neural Network for Large-Scale Network Localization
GGRAPH NEURAL NETWORK FOR LARGE-SCALE NETWORK LOCALIZATION
Wenzhong Yan (cid:63) , Di Jin † , Zhidi Lin (cid:63) and Feng Yin (cid:63)(cid:63) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China † Signal Processing Group, Technische Universitat Darmstadt, Darmstadt, Germany
ABSTRACT
Graph neural networks (GNNs) are popular to use for classifyingstructured data in the context of machine learning. But surprisingly,they are rarely applied to regression problems. In this work, weadopt GNN for a classic but challenging nonlinear regression prob-lem, namely the network localization. Our main findings are inorder. First, GNN is potentially the best solution to large-scale net-work localization in terms of accuracy, robustness and computationaltime. Second, thresholding of the communication range is essentialto its superior performance. Simulation results corroborate that theproposed GNN based method outperforms all benchmarks by far.Such inspiring results are further justified theoretically in terms ofdata aggregation, non-line-of-sight (NLOS) noise removal and low-pass filtering effect, all affected by the threshold for neighbor selec-tion. Code is available at https://github.com/Yanzongzi/GNN-For-localization . Index Terms — Graph neural networks, large-scale network lo-calization, non-line-of-sight, thresholding.
1. INTRODUCTION
In recent years, graph neural networks (GNNs) have achieved manystate-of-the-art results in various graph-related learning tasks suchas node classification, link prediction and graph classification [1–4].Comparing with the multi-layer perceptrons (MLPs), GNNs canexploit extra information about edges. More concretely, each nodeaggregates information of its adjacent nodes instead of just usingits own. Though being effective, GNN models are often used todeal with classification tasks with discrete labels. While regressionproblems are more challenging and constitute a large body of practicalsignal processing applications. In this paper, we consider a classicyet challenging nonlinear regression problem, namely large-scalenetwork localization.Network localization requires not only the measurements be-tween agent nodes and anchor nodes, but also the measurementsbetween the agent nodes themselves. In the past decades, a vari-ety of canonical network localization methods have been developed,including 1) maximum likelihood estimation based methods [5, 6];2) least-squares estimation based methods [7]; 3) multi-dimensionalscaling based methods [8]; 4) mathematical programming based meth-ods [9, 10] and 5) Bayesian message passing based methods [7, 11].Ignoring the effect due to non-line-of-sight (NLOS) propagationwill incur severe performance degradation of the aforementionedmethods. For remedy, the most applied strategy is to perform NLOSidentification for each link and then either discard the NLOS mea-surements or suppress them in the aforementioned methods [12].However, accurate NLOS identification requires large-scale offlinecalibration and huge manpower. The NLOS effect can also be dealtwith from algorithmic aspect. Based on the assumption that NLOS noise follows a certain probability distribution, the maximum likeli-hood estimation based methods were developed in [13, 14]. However,model mismatch may cause severe performance degradation. In arecent work [15], network localization problem is formulated as aregularized optimization problem in which the NLOS-inducing spar-sity of the ranging-bias parameters was exploited. Unfortunately,all of these methods are computationally expensive for large-scalenetworks.In this paper, we propose a fresh GNN based network localizationmethod that is able to achieve all the following merits at the same time.First, it provides stable and highly accurate localization accuracydespite of severe NLOS propagations. Second, it does not requirelaborious offline calibration nor NLOS identification. Third, it isscalable to large-scale networks at an affordable computation cost.The remainder of this paper is organized as follows. The back-ground of network localization is introduced in Section 2. In Section3, we introduce a fresh GNN framework for network localization.Numerical results are provided in Section 4, followed by the theoreti-cal performance verification in Section 5. Finally, we conclude thepaper in Section 6.
2. PROBLEM FORMULATION
We consider a wireless network in two-dimensional (2-D) space,and extension to the 3-D case is straightforward. We let, withoutloss of generality, S a = { , , . . . , N l } be the set of indices of theanchors, whose positions p i = [ x i , y i ] (cid:62) , i ∈ S a are known, and S b = { N l + 1 , N l + 2 , . . . , N } be the set of indices of the agents,whose positions are unknown. The distance measurement betweenany two nodes i and j is given by x ij = d ( p i , p j ) + n ij , (1)where d ( p i , p j ) := (cid:107) p i − p j (cid:107) is the Euclidean distance and n ij isan additive measurement error, due to line-of-sight (LOS) and NLOSpropagation. A distance matrix, denoted by X , is constructed bystacking the distance measurements in the manner that x ij is the ( ij ) -th entry of X . Notably, the distance matrix X is a “zero-diagonal”matrix, i.e., x ii = 0 for i = 1 , , . . . , N . Based on this distancematrix, our goal is to accurately locate the agents in a large-scalewireless network with satisfactory computation time.The above signal model suits many realistic large-scale wirelessnetworks. For instance, in 5G network, a large number of small basestations are deployed densely in each cell; Internet of Things (IoT)network, advocating interconnection of everything, comprises a hugenumber of connected smart devices and machines. For large-scalenetworks, it is typical that only a small fraction of nodes know theirlocations precisely, which otherwise requires either a lot of manpowerto do offline calibration or to equip expensive and power-hungryGPS/BeiDou chips. a r X i v : . [ c s . L G ] O c t o locate a large number of agents, we propose a completely newparadigm, namely GNN based network localization, which is data-driven and relies on a graph-based learning model, to be specified inthe next section.
3. NETWORK LOCALIZATION WITH GCN
Among different types of GNN models, graph convolutional networks(GCNs) constitute a representative class. In this section, we focuson formulating the network localization problem using GCNs. Anundirected graph associated with a wireless network can be formallydefined as G = ( V , A ) , where V represents the vertex set of thenodes { v , v , . . . , v N } , and A ∈ R N × N is a symmetric adjacencymatrix where a ij denotes the weight of the edge between v i and v j . We set a ij = 1 if there is no connection between v i and v j ,otherwise a ij = 0 . The degree matrix D := diag ( d , d , . . . , d N ) isa diagonal matrix with d i = (cid:80) Nj =1 a ij .In the original GCN, the edge a ij represents that the labels ofnodes i and j are similar. In the context of network localization,similar labels of two nodes means that they are close to each other,i.e., d ( p i , p j ) being small. Accordingly, we introduce a Euclideandistance threshold, denoted by T h , to determine whether there is anedge between two nodes or not. As will be explained in Section 5, thisthreshold is critical to the localization performance. By thresholding,a refined adjacency matrix A T h is constructed as follows: [ A T h ] ij = (cid:26) , if x ij > T h , otherwise . (2)Consequently, the augmented adjacency matrix [1] is defined as ˜ A T h := A T h + I , where I is an identity matrix, and the associateddegree matrix of ˜ A T h is denoted by ˜ D T h .Similarly, we construct a sparse distance matrix ˆ X = A T h (cid:12) X ,where (cid:12) denotes the Hadamard product. Consequently, ˆ X containsonly distance measurements that are smaller than or equal to T h .In general, each layer of GCN carries out three actions: featurepropagation, linear transformation and an element-wise nonlinearactivation [16]. The main difference between GCN and the standardmulti-layer perceptron (MLP) [17] lies in the feature propagation,which will be clarified in the following.In the k -th graph convolution layer, the input and output noderepresentations are denoted by the matrices H ( k − and H ( k ) , re-spectively. The initial node representations is H (0) = ˆ X . A K -layerGCN differs from a K -layer MLP in that the hidden representationof each node in GCN is averaged with its neighbors. More precisely,in GCN, the update process for all layers is obtained by performingthe following matrix multiplication: ¯ H ( k ) ← ˆ A T h H ( k − , (3)where ˆ A T h := ˜ D − T h ˜ A T h ˜ D − T h is the augmented normalized adja-cency matrix [1] and ¯ H ( k ) is the hidden representation matrix inthe k -th graph convolution layer. Intuitively, this step smoothes thehidden representations locally along the edges of the graph and ul-timately encourages similar predictions among locally connectednodes.After feature propagation, the remaining two steps of GCN, i.e.,linear transformation and nonlinear activation, are identical to thoseof the standard MLP. The k -th layer contains a layer-specific trainableweight matrix W ( k ) and a nonlinear activation function φ ( · ) , such as ... ... ...Input Representation Feature Propagation Linear Transformation Nonlinear Activation OutputRepresentationAggregation CombinationChosen NodeNeighbor HiddenRepresentation ( 1) H k ( ) ( 1) ˆH A H h k kT ( ) ( ) H W k k ( ) ( ) (H W ) k k ( ) H k Fig. 1 : Diagram of GCN updating rule for one hidden layer.ReLU ( · ) = max(0 , · ) [18]. The representation updating rule of the k -th layer, presented in Fig. 1, is given by H ( k ) ← φ (cid:16) ¯ H ( k ) W ( k ) (cid:17) . (4)It is noteworthy that the activation function φ ( · ) is applied to everyelement in the matrix.Taking a 2-layer GCN as an example, the estimated positions, ˆ R = [ˆ p , ˆ p , . . . , ˆ p N ] (cid:62) ∈ R N × , are given by ˆ R = ˆ A T h φ (cid:16) ˆ A T h ( A T h (cid:12) X ) W (1) (cid:17) W (2) . (5)The weight matrices W (1) and W (2) can be optimized by minimizingthe mean-squared-error (MSE), L ( W (1) , W (2) ) := (cid:107) R l − ˆ R l (cid:107) F ,where R l = [ p , p , . . . , p N l ] (cid:62) and ˆ R l = [ˆ p , ˆ p , . . . , ˆ p N l ] (cid:62) arethe true anchor positions and their estimates, respectively, and (cid:107) · (cid:107) F is the Frobenius norm of a matrix. This optimization problem ismostly solved via a gradient descent type method, such as stochasticgradient descent [19] or Adam [20].
4. NUMERICAL RESULTS
In this section, the performance of the proposed GCN based methodis evaluated in terms of localization accuracy, robustness againstNLOS noise and computational time. For comparison purposes,we choose various salient competitors, including an MLP basedmethod, a neural tangent kernel (NTK) regression based method,the sparsity-inducing semi-definite programming (SDP) method [15],the expectation-conditional maximization (ECM) method [13], andthe centralized least-square (LS) method [7]. Note that we chooseMLP to demonstrate the performance improvement caused by addingthe normalized adjacent matrix ˆ A T h in each layer. Additionally, weuse NTK based regression to mimic ultra-wide MLP with randominitialization based on the theorem that a sufficiently wide and ran-domly initialized MLP trained by gradient descent is equivalent to akernel regression predictor with the NTK [21].Implementation details of these methods are as follows. We use a2-layer GCN with 2000 neurons in each hidden layer. We train GCNand MLP models for a maximum number of 200 epochs (trainingiterations) using Adam with a learning rate of 0.01. We initialize theweights using the routine described in [22] and normalize the inputfeature vectors along rows. Dropout rate [23] is set to 0.5 for alllayers. The settings of NTK remain the same as described in [24].The regularization parameter in SDP is set to λ = 0 . . For both theECM and LS methods, the initial positions are randomly generated inthe square area. All simulations are performed on a server computer Anchor Number (N l ) R M SE GCN-( σ =0.1,p B =0)GCN-( σ =0.1,p B =30%)MLP-( σ =0.1,p B =0)MLP-( σ =0.1,p B =30%) Fig. 2 : The averaged loss (RMSE) versus the number of anchorsunder different noise conditions.equipped with 48 Inter Xeon E5-2650 2.2GHz CPUs and 8 NVIDIATITAN Xp 12GB GPUs. In all experiments, we set the threshold T h = 1 . for GCN, MLP and NTK, and set T h = 0 . for othermethods, which leads to similar averaged localization accuracy butrequires much less computational time than using T h = 1 . . Fairnessin comparison is carefully maintained.Details of the simulated localization scenarios are given below.We consider a 5m ×
5m unit square area. Each network consists of randomly generated nodes in total. The number of anchors, N l , varies from to for investigating its impact, and the restare agents. The measurement error n ij is generated according to n ij = n Lij + b ij n Nij . Here, the LOS noise n Lij is generated from azero-mean Gaussian distribution, i.e., n Lij ∼ N (0 , σ ) , while thepositive NLOS bias n Nij is generated from the uniform distribution, n Nij ∼ U [0 , and b ij generated from the Bernoulli distribution B ( P B ) with P B being the NLOS occurrence probability.First, we assess the localization accuracy of all methods underdifferent noise conditions. Here, the localization accuracy is mea-sured in terms of the averaged test root-mean-squared-error (RMSE).The results are summarized in Table 1. It is shown that among allconsidered methods, GCN provides the highest localization accuracyin almost all cases. In particular, when the NLOS probability, P B , ishigh, GCN outperforms all competitors by far. Moreover, we test thelocalization performance of GCN for large networks with N = 1000 ,denoted by GCN in Table 1. The results show that GCN performseven better with slightly increased computational time, as shown inTable 2. If we increase N even further, GCN based method canmaintain its outstanding performance, but other methods (SDP, ECMand LS) will degrade severely in terms of both localization accuracyand computational time.Next, we focus on two data-driven methods, GCN and MLP,which perform relatively well in Table 1. The localization accuracyis investigated by varying N l from to with a stepsize of under two different noise conditions. The results are depicted inFig. 2. There are two main observations. First, GCN attains thelowest RMSE consistently for all N l . Compared with MLP, theimprovement in localization accuracy is particularly remarkable forGCN with small N l . This result indicates that GCN can better exploitthe distance information than MLP. When N l increases, both GCNand MLP tend to approach a performance lower bound. Second, GCNperforms similarly under both noise conditions, while MLP shows aclear performance degradation when P B increases. This observationindicates that GCN is very robust against NLOS. Lastly, we wantto mentioned that a fine-tuned MLP is often superior to NTK whichcorresponds to random initialized MLP, which performs surprisingly Threshold (T h ) R M SE Region IRegion IIRegion IIIRegion IVRegion V ( σ =0.1,p B =0)( σ =0.1,p B =30%)( σ =0.5,p B =50%) Fig. 3 : The averaged loss (RMSE) versus threshold under differentnoise conditions in GCN model. N l = 50 .close to other benchmark methods as shown in Table 1.In the third simulation, we focus on investigating the influenceof the threshold, T h , on the localization performance. Figure 3 de-picts the RMSE of GCN versus the threshold, T h , in three differentnoise scenarios. It is interesting to see that the RMSE curves showsimilar trend as the threshold changes. We characterize the localiza-tion performance obtained by using different thresholds. In RegionI ( T h ∈ [0 . , . ), the RMSE is very large at the beginning anddrops rapidly as T h increases. The reason for such bad performanceat the beginning is that when T h is too small there will be no suf-ficient edges in the graph, incurring isolated nodes. In Regions II ∼ IV ( T h ∈ (0 . , . ), GCN shows stable performance. A closerinspection shows that the RMSE is relatively lower in Region II , risesslightly in
Region III and decreases in
Region IV to the same lowestlevel as in
Region II . This observation can be explained as follows.When T h ∈ (0 . , . , the good performance of GCN is due to theNLOS noise truncation effect of T h , which will be explained in Sec-tion 5.1. For T h ∈ (2 . , . , the adverse effect of large NLOS noiseis compensated by the increased number of neighbors for each node.Lastly, the rapid increase of RMSE in Region V can be explained bythe effect of extremely large NLOS noise and over-smoothing, whichwill be explained in Section 5.1 as well.Another important requirement for real-world applications is fastresponse. Table 2 shows the practical computational time of differentmethods. It is shown that GCN, MLP and NTK are more compu-tationally efficient than other methods. Besides, the computationaltime of GCN slightly increases when we double the number ofnodes in the network. GCN can also deal with very large network,for instance N = 10000 , in an affordable time, while the LS, ECMand SDP are all disabled. All above results indicate that the proposedGCN based method is a prospective solution to large-scale networklocalization.
5. PERFORMANCE VERIFICATION
In this section, we aim to dig out the reasons behind the remarkableperformance of the GCN based method, corroborated by the resultsshown in Section 4. We pinpoint two major factors: one is thethreshold T h and the other is the augmented normalized adjacencymatrix ˆ A T h . In the following, we analyze the two factors separately,although T h determines ˆ A T h . Thresholding plays two major roles: 1) truncating large noise and 2)avoiding over-smoothing. able 1 : The averaged loss (RMSE) of all methods under different noise conditions for N l =50.Methods \ Noise ( σ , p B ) (0 . , . , . , . , . , GCN
GCN
Table 2 : A comparison of different methods in terms of computa-tional time (in second) at ( σ = 0 . , P B = 30%) and N l = 50 . GCN GCN
GCN
MLP NTK LS ECM SDP3.24 5.82 707.38
Noise truncation.
The operation, ˆ X = A T h (cid:12) X , in Section 3implies that for each ˆ x ij (cid:54) = 0 , d ( p i , p j ) + n ij ≤ T h holds. Equiv-alently speaking, for each non-zero element in ˆ X , we have n ij ≤ T h − d ( p i , p j ) , indicating that only noise in a small limited rangewill be included in ˆ X . Specifically, due to the fact that n ij with largevalue is usually caused by positive NLOS bias n Nij , each element x ij ,associated either with a large true distance or with a large noise, isneglected when the threshold T h is set small. In other words, we re-tain the measurement if the two nodes are adjacent and only affectedby a small or moderate measurement noise. Avoiding over-smoothing.
When the threshold is large enough, thecorresponding graph will become fully connected and the adjacencymatrix will be a matrix of all ones. In this case, according to Eq. (3),all entries of the hidden representation matrix ¯ H ( k ) are identical,meaning that the obtained hidden representation completely loses itsdistance information. Consequently, the predicted positions of allnodes will tend to converge to the same point. This phenomenon isknown as over-smoothing. As an illustration, Region V in Fig. 3 con-firms that GCN will suffer from over-smoothing when the thresholdis set too large. Thus, we need to choose a proper threshold to preventsuch an adversarial behavior. ˆ A T h To understand the superior localization performance of GCN com-pared with MLP, we analyze the effects of the augmented normalizedadjacency matrix, ˆ A T h , from both spatial and spectral perspectives. Aggregation and combination.
To understand the spatial effect of ˆ A T h , we decompose ˆ A T h , cf. Eq. (3), into two parts: ¯ h ( k ) i = 1 d i + 1 h ( k − i (cid:124) (cid:123)(cid:122) (cid:125) Own information + N (cid:88) j =1 a ij (cid:112) ( d i + 1)( d j + 1) h ( k − j (cid:124) (cid:123)(cid:122) (cid:125) Aggregated information , (6)where ¯ h ( k ) i and h ( k ) i are the i -th row vectors of hidden representationmatrix, ¯ H ( k ) , and the input representation matrix, H ( k ) , in the k -th layer, respectively. Specifically, Eq. (6) contains two operations:aggregation and combination. Aggregation corresponds to the secondterm of Eq. (6), in which the neighboring features are captured byfollowing the given graph structure. Then, the target node’s owninformation is combined with the aggregated information. Eigenvalue ( λ ) N o r m a li z ed s pe c t r a l m agn i t ude Normalized original noiseNormalized thresholded noiseNormalized thresholded true-distance feature
Fig. 4 : Spectral components of different signals in dataset ( σ =0 . , P B = 0) .Comparing with the training procedure of MLP, which solely usesthe features of the labeled nodes (referring to anchors here), GCN isa semi-supervised method in which the hidden representation of eachlabeled node is averaged over itself and its neighbors. Equivalentlyspeaking, GCN trains a model by exploiting features of both labeledand unlabeled nodes, leading to its superior localization performance. Low-pass filtering.
From the spectral perspective, the eigenvalues ofthe augmented normalized Laplacian ∆ T h = I − ˆ A T h , denoted by ˜ λ , can be regarded as the "frequency" components. Multiplying K augmented normalized adjacency matrices ˆ A KT h in graph convolutionlayers is equivalent to passing a spectral “low-pass” filter g (˜ λ i ) :=(1 − ˜ λ i ) K , where ˜ λ i , i = 1 , , . . . [16]. Figure 4 depicts the graphFourier transformations of the original LOS noise, the thresholdedLOS noise and the thresholded true distance matrix. It can be seenthat almost all information of the thresholded true distance matrix isconcentrated in the “low frequency” band, while both “low frequency”and “high frequency” components are present in the original LOSnoise and thresholded LOS noise. Thus, ˆ A T h , acting as a “low-pass”filter, can partially remove the “high frequency” component of LOSnoise. This explains the improved localization performance of GCNsfrom the spectral perspective.
6. CONCLUSION
In this paper, we have proposed a GCN based data-driven method forrobust large-scale network localization in mixed LOS/NLOS envi-ronments. Numerical results have shown that the proposed methodis able to achieve substantial improvements in terms of localizationaccuracy, robustness and computational time, in comparison withboth MLP and various benchmark methods. Moreover, our detailedanalysis found that thresholding the neighboring features is crucial toattaining desired localization performance. . REFERENCES [1] T. N. Kipf and M. Welling, “Semi-Supervised Classificationwith Graph Convolutional Networks,” in
International Confer-ence on Learning Representations (ICLR) , 2017.[2] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive Repre-sentation Learning on Large Graphs,” in
Advances in neuralinformation processing systems (NIPS) , 2017, pp. 1024–1034.[3] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio, “Graph Attention Networks,” 2017.[4] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How Powerful areGraph Neural Networks,” 2019.[5] N. Patwari, A. O. Hero, M. Perkins, N. S. Correal, and R. J.O’Dea, “Relative Location Estimation in Wireless Sensor Net-works,”
IEEE Transactions on Signal Processing , vol. 51, no.8, pp. 2137–2148, Aug 2003.[6] A. Simonetto and G. Leus, “Distributed Maximum LikelihoodSensor Network Localization,”
IEEE Transactions on SignalProcessing , vol. 62, no. 6, pp. 1424–1437, 2014.[7] H. Wymeersch, J. Lien, and M. Z. Win, “Cooperative Localiza-tion in Wireless Networks,”
Proceedings of the IEEE , vol. 97,no. 2, pp. 427–450, 2009.[8] J. A. Costa, N. Patwari, and A. O. Hero, “Distributed Weighted-Multidimensional Scaling for Node Localization in Sensor Net-works,”
ACM Transactions on Sensor Networks , vol. 2, no. 1,pp. 39–64, 2006.[9] P. Biswas, T. Lian, T. Wang, and Y. Ye, “Semidefinite Pro-gramming Based Algorithms for Sensor Network Localization,”
ACM Transactions on Sensor Networks , vol. 2, no. 2, pp. 188–220, May 2006.[10] P. Tseng, “Second-Order Cone Programming Relaxation ofSensor Network Localization,”
SIAM Journal on Optimization ,vol. 18, no. 1, pp. 156–185, 2007.[11] A. Ihler, J. W. Fisher, R. L. Moses, and A. S. Willsky, “Non-parametric Belief Propagation for Self-Localization of SensorNetworks,” in
IEEE Journal on Selected Areas in Communica-tions , 2005.[12] S. Marano, W. M. Gifford, H. Wymeersch, and M. Z. Win,“NLOS Identification and Mitigation for Localization Based onUWB Experimental Data,”
IEEE Journal on Selected Areas inCommunications , vol. 28, no. 7, pp. 1026–1035, Sep. 2010.[13] F. Yin, C. Fritsche, D. Jin, F. Gustafsson, and A. M. Zoubir,“Cooperative Localization in WSNs Using Gaussian MixtureModeling: Distributed ECM Algorithms,”
IEEE Transactionson Signal Processing , vol. 63, no. 6, pp. 1448–1463, 2015.[14] H. Chen, G. Wang, Z. Wang, H. C. So, and H. V. Poor, “Non-Line-of-Sight Node Localization Based on Semi-Definite Pro-gramming in Wireless Sensor Networks,”
IEEE Transactionson Wireless Communications , vol. 11, no. 1, pp. 108–116, 2012.[15] D. Jin, F. Yin, M. Fauß, M. Muma, and A. M. Zoubir, “Exploit-ing Sparsity for Robust Sensor Network Localization in MixedLOS/NLOS Environments,” in
IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2020,pp. 5915–5915.[16] F. Wu, T. Zhang, A. H. d. Souza Jr, C. Fifty, T. Yu, and K. Q.Weinberger, “Simplifying Graph Convolutional Networks,” in
International Conference on Learning Representations (ICLR) ,2019. [17] D. Svozil, V. Kvasnicka, and J. Pospichal, “Introduction toMulti-layer Feed-forward Neural Networks,”
Chemometricsand Intelligent Laboratory Systems , vol. 39, no. 1, pp. 43–62,1997.[18] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for Ac-tivation Functions,” in
International Conference on LearningRepresentations (ICLR) , 2018.[19] L. Bottou, “Large-Scale Machine Learning with Stochastic Gra-dient Descent,” in
Proceedings of 19th International Conferenceon Computational Statistics . 2010, pp. 177–186, Springer.[20] D. P. Kingma and J. Ba, “Adam: A Method for StochasticOptimization,” in
International Conference on Learning Repre-sentations (ICLR) , 2014.[21] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, andR. Wang, “On Exact Computation with an Infinitely WideNeural Net,” in
Advances in Neural Information ProcessingSystems (NIPS) , 2019, pp. 8141–8150.[22] X. Glorot and Y. Bengio, “Understanding the Difficulty ofTraining Deep Feedforward Neural Networks,” in
Proceedingsof the 13th International Conference on Artificial Intelligenceand Statistics , 2010, pp. 249–256.[23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A Simple Way to Prevent NeuralNetworks from Overfitting,”
The Journal of Machine LearningResearch , vol. 15, no. 1, pp. 1929–1958, 2014.[24] S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, andD. Yu, “Harnessing the Power of Infinitely Wide Deep Nets onSmall-Data Tasks,” in