[PDF] On Learning Paradigms for the Travelling Salesman Problem

Abstract

Full PDF

OOn Learning Paradigms for theTravelling Salesman Problem

Chaitanya K. Joshi , Thomas Laurent , and Xavier Bresson School of Computer Science and Engineering, Nanyang Technological University, Singapore Department of Mathematics, Loyola Marymount University {chaitanya.joshi, xbresson}@ntu.edu.sg, [email protected]

Abstract

We explore the impact of learning paradigms on training deep neural networksfor the Travelling Salesman Problem. We design controlled experiments to trainsupervised learning (SL) and reinforcement learning (RL) models on ﬁxed graphsizes up to nodes, and evaluate them on variable sized graphs up to nodes.Beyond not needing labelled data, our results reveal favorable properties of RLover SL: RL training leads to better emergent generalization to variable graph sizesand is a key component for learning scale-invariant solvers for novel combinatorialproblems. The Travelling Salesman Problem (TSP) is one of the most intensively studied combinatorial opti-mization problems in the Operations Research community and is the backbone of industries suchas transportation, logistics and scheduling. Being an NP-hard graph problem, ﬁnding optimal TSPsolutions is intractable at large scales above thousands of nodes. In practice, the Concorde TSP solver[1] uses carefully handcrafted heuristics to ﬁnd approximate solutions up to tens of thousands ofnodes. Unfortunately, powerful OR solvers such as Concorde are problem-speciﬁc; their developmentfor new problems requires signiﬁcant time and specialized knowledge.An alternate approach by the Machine Learning community is to develop generic learning algorithmswhich can be trained to solve any combinatorial problem from problem instances themselves [15, 3].Using 2D Euclidean TSP as a representative of practical combinatorial problems, recent learning-based approaches [10, 7, 13] have leveraged advances in graph representation learning [5, 6, 16, 12,8, 14] to operate directly on the problem’s graph structure and perform competitively with Concordeon problem instances of ﬁxed/trivially small graphs.A key design choice for scaling these approaches to real-world problem sizes is the learning paradigm:supervised learning (SL) or reinforcement learning (RL). As noted in [3], the performance of SL-based models depends on the availability of large sets of optimal or high-quality instance-solutionpairs. Although RL is known to be less sample efﬁcient than SL, it does not require labelled instances.As long as a problem can be formulated via a reward signal for making sequential decisions, anautoregressive policy can be trained via RL. Hence, most recent work on TSP has defaulted totraining autoregressive RL models to minimize the tour length [2, 10, 13]. In contrast, [9] showed thatnon-autoregressive SL models with sufﬁcient labelled data (generated using Concorde) outperformstate-of-the-art RL approaches on ﬁxed graph sizes. However, their non-autoregressive architectureshow poor ‘zero-shot’ generalization performance compared to autoregressive models when evaluatedon instances of different sizes than those used for training. Code available: https://github.com/chaitjo/learning-paradigms-for-tsp a r X i v : . [ c s . L G ] O c t n this paper, we perform controlled experiments on the learning paradigm for autoregressive TSPmodels with an emphasis on emergent generalization to variable graph sizes, especially those largerthan training graphs. We ﬁnd that both SL and RL learn to solve TSP on ﬁxed graph sizes very closeto optimal, with SL models achieving state-of-the-art results for TSP20-TSP100. Interestingly, RLemerges as the superior learning paradigm for zero-shot generalization to variable and large-scaleTSP instances up to TSP500.Our ﬁndings suggest that learning guided by sparse reward functions trains policies which identifymore general motifs and patterns for TSP. In contrast, policies trained to imitate optimal solutionsoverﬁt to speciﬁc graph sizes. We contribute to existing literature on neural combinatorial optimizationby empirically exploring the impact of learning paradigms for TSP, and believe that RL will be a keycomponent towards building scale-invariant solvers for new combinatorial problems beyond TSP. Intuitively, learning from variable graph sizes is a straightforward way of building more robust andscale-invariant solvers. In our experiments, we chose to focus on learning from graphs of ﬁxed sizesbecause we want to study the impact of the learning paradigm on emergent generalization in theextreme case; where generalization is measured as performance on smaller or larger instances of thesame combinatorial problem that the model was trained on.

Model Setup

We follow the experimental setup of [13] to train autoregressive Graph Attention-based models for TSP20, TSP50 and TSP100, and evaluate on instances from TSP20 up till TSP500.We train two otherwise equivalent variants of the model: an RL model trained with REINFORCE [19]and a greedy rollout baseline; and an SL model trained to minimize a Cross Entropy Loss betweenthe model’s predictions and optimal targets at each step, similar to supervised Pointer Networks [18].We use the model architecture and optimizer speciﬁed by [13] for both approaches. Optimal TSPdatasets from [9] are used to train the SL models whereas training data is generated on the ﬂy for RL.See Appendix A for detailed descriptions of training setups. Evaluation

We measure performance on held-out test sets of , instances of TSP20, TSP50and TSP100 from [9], as well as , test instances of TSP150, TSP200, TSP300, TSP400 andTSP500 generated using Concorde. We use the average predicted tour length and the averageoptimality gap (percentage ratio of the predicted tour length relative to the optimal solution) over thetest sets as performance metrics. We evaluate both models in three search settings: greedy search,sampling from the learnt policy ( , solutions), and beam search (with beam width , ). Performance on training graph sizes

Table 1 presents the performance of SL and RL modelsfor various TSP sizes. In general, we found that both SL and RL models learn solvers close tooptimal for TSP20, TSP50 and TSP100 when trained on the problem size. In the greedy setting, RLmodels clearly outperform SL models. As we sample or perform beam search, SL models obtainsstate-of-the-art results for all graph sizes, showing signiﬁcant improvement over RL models as wellas non-autoregressive SL models from [9]; e.g.

TSP100 optimality gap of . for TSP100 SLmodel using beam search vs. . for TSP100 RL model vs. . reported in [9]. Generalization to variable graph sizes

RL clearly results in better zero-shot generalization toproblem sizes smaller or larger than training graphs. The different generalization trends for SL andRL models can be visualized in Figure 1 and are highlighted below: • Both TSP20 models do not generalize well to TSP100, but RL training leads to betterperformance; e.g. in the greedy setting, TSP100 optimality gap of . for TSP20 SLmodel vs. . for TSP20 RL model. We select the autoregressive RL model [13] as it naturally ﬁts the sequential nature of TSP and can easilybe extended to the SL setting . In contrast, it is not trivial to extend the non-autoregressive SL [9] model to theRL setting. We modify their codebase: https://github.com/wouterkool/attention-learn-to-route

Type column, SL : Supervised Learning, RL : Reinforcement Learning. In the Decoder column, G : Greedy search, S : Sampling, BS : Beam search Model Type Decoder TSP20 TSP50 TSP100Tour Len. Opt. Gap. Time Tour Len. Opt. Gap. Time Tour Len. Opt. Gap. TimeConcorde .

831 0 . (1m) .

692 0 . (2m) .

764 0 . (3m)Greedy searchTSP20 Model SL G .

847 0 . (1s) .

219 9 . (1s) .

269 32 . (5s)TSP50 Model SL G .

177 9 . (1s) .

951 4 . (1s) .

519 9 . (5s)TSP100 Model SL G .

696 48 . (1s) .

643 16 . (1s) .

589 10 . (5s)TSP20 Model RL G .

846 0 . (1s) .

946 4 . (1s) .

946 15 . (5s)TSP50 Model RL G .

885 1 . (1s) .

809 2 . (1s) .

177 5 . (5s)TSP100 Model RL G .

362 13 . (1s) .

971 4 . (1s) .

146 4 . (5s)Sampling, 1280 solutionsTSP20 Model SL S .

831 0 . (6m) .

992 5 . (26m) .

115 43 . (1.5h)TSP50 Model SL S .

834 0 . (6m) .

694 0 . (26m) .

135 4 . (1.5h)TSP100 Model SL S .

380 14 . (6m) .

751 1 . (26m) .

862 1 . (1.5h)TSP20 Model RL S .

834 0 . (6m) .

805 1 . (26m) .

733 25 . (1.5h)TSP50 Model RL S .

841 0 . (6m) .

726 0 . (26m) .

990 2 . (1.5h)TSP100 Model RL S .

951 3 . (6m) .

847 2 . (26m) .

972 2 . (1.5h)Beam search, 1280 widthTSP20 Model SL BS .

831 0 . (4m) .

750 1 . (30m) .

928 27 . (2h)TSP50 Model SL BS .

831 0 . (4m) .

692 0 . (30m) .

905 1 . (2h)TSP100 Model SL BS .

138 8 . (4m) .

703 0 . (30m) .

794 0 . (2h)TSP20 Model RL BS .

831 0 . (4m) .

795 1 . (30m) .

166 18 . (2h)TSP50 Model RL BS .

833 0 . (4m) .

714 0 . (30m) .

986 2 . (2h)TSP100 Model RL BS .

922 2 . (4m) .

824 2 . (30m) .

986 2 . (2h)

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (a) Greedy Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (b) Sampling

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (c) Beam Search

Figure 1: Zero-shot generalization trends on TSP20-TSP100 for various search settings. • The TSP50 RL model generalizes to TSP20 and TSP100 better than the TSP50 SL model,except when performing beam search; e.g. in the sampling setting, TSP100 optimality gapof . for TSP50 SL model vs. . for TSP50 RL model. • In all search settings, the TSP100 SL model shows extremely poor generalization to TSP20compared to the TSP100 RL model; e.g. in the beam search setting, TSP20 optimality gapof . for TSP100 SL model vs. . for TSP100 RL model.TSP solution visualizations for various models are available in Appendix D. Performance on large-scale instances

We further evaluate all SL and RL models on TSP instanceswith up to nodes to determine the upper limits of emergent generalization. During sampling/beamsearch, we only sample/search for solutions instead of , as in previous experiments due totime and memory constraints.Figure 2 shows that RL training leads to relatively better generalization than SL in all search settings.Although no model is able to solve large instances close to optimal, generalization performancebreaks down smoothly, suggesting that an incremental ﬁne-tuning process using RL might be thekey to learning solvers for large-scale TSP. Additionally, sampling leads to worse performance thangreedy search for larger instances. This may be due to the probability distributions at each decodingstep being close to uniform, i.e. the learnt policy is not conﬁdent about choosing the next node.3

50 200 250 300 350 400 450 500

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (a) Greedy Search

150 200 250 300 350 400 450 500

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (b) Sampling

150 200 250 300 350 400 450 500

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP100 Model (GAT,SL)TSP100 Model (GAT,RL) (c) Beam Search

Figure 2: Zero-shot generalization trends on TSP150-TSP500 for various search settings.

Stability and sample efﬁciency of learning paradigms

Conventionally, RL is known to be un-stable and less sample efﬁcient than SL. As seen by Figure 3, we found training to be stable andequally sample efﬁcient in our experiments for both learning paradigms. SL and RL models requireapproximately equal number of mini-batches to converge to stable states using a ﬁxed learning rateof − . As noted by [13], using larger learning rates with a decay schedule speeds up learning at thecost of stability. O p t i m a li t y G a p ( % ) LearningSL (Run 1)SL (Run 2)RL (Run 1)RL (Run 2) (a) TSP20 O p t i m a li t y G a p ( % ) LearningSL (Run 1)SL (Run 2)RL (Run 1)RL (Run 2) (b) TSP50 O p t i m a li t y G a p ( % ) LearningSL (Run 1)SL (Run 2)RL (Run 1)RL (Run 2) (c) TSP100

Figure 3: Validation optimality gap (using greedy search) vs. number of training mini-batches for SLand RL models. The difference in validation optimality gap is prominent in the greedy setting, butreduces considerably as we sample or use beam search.

Impact of RL baseline and graph encoder

In Appendix B, we replace the rollout baseline for RE-INFORCE with a critic baseline similar to [2]. We ﬁnd that RL models follow similar generalizationtrends to variable sized instances, regardless of the choice of baseline used.Further, in Appendix C, we replace the

Graph Transformer encoder with the

Graph ConvolutionalNetwork encoder from [9]. We ﬁnd that choice of graph encoder has negligible impact on performance,and better generalization in RL models is due to the learning paradigm.

This paper investigates the choice of learning paradigms for the Travelling Salesman Problem.Through controlled experiments, we ﬁnd that both supervised learning and reinforcement learningcan train models which perform close to optimal on ﬁxed graph sizes. Evaluating models on variablesized instances larger than those seen in training reveals a threefold advantage of RL training: (1)training does not require expensive labelled data; (2) learning is as stable and sample-efﬁcient as SL;and most importantly, (3) RL training leads to better emergent generalization on variable sized graphs.Finding (3) has broader implications on building scale-invariant solvers for practical combinatorialproblems beyond TSP.Future work shall explore more detailed evaluations of learning paradigms for other problems, aswell as performing ﬁne-tuning/transfer learning for generalization to large-scale instances.4 cknowledgement

XB is supported in part by NRF Fellowship NRFF2017-10.

References [1] D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook.

The traveling salesman problem: acomputational study . Princeton university press, 2006.[2] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimizationwith reinforcement learning. In

International Conference on Learning Representations , 2017.[3] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: amethodological tour d’horizon. arXiv preprint arXiv:1811.06128 , 2018.[4] X. Bresson and T. Laurent. An experimental study of neural networks for variable graphs. In

International Conference on Learning Representations , 2018.[5] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connectednetworks on graphs. In

International Conference on Learning Representations , 2014.[6] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphswith fast localized spectral ﬁltering. In

Advances in neural information processing systems ,pages 3844–3852, 2016.[7] M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau. Learning heuristicsfor the tsp by policy gradient. In

International Conference on the Integration of ConstraintProgramming, Artiﬁcial Intelligence, and Operations Research , pages 170–181. Springer, 2018.[8] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In

Advances in Neural Information Processing Systems , pages 1024–1034, 2017.[9] C. K. Joshi, T. Laurent, and X. Bresson. An efﬁcient graph convolutional network technique forthe travelling salesman problem. arXiv preprint arXiv:1906.01227 , 2019.[10] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimizationalgorithms over graphs. In

Advances in Neural Information Processing Systems , pages 6348–6358, 2017.[11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In

InternationalConference on Learning Representations , 2015.[12] T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks.In

International Conference on Learning Representations , 2017.[13] W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In

International Conference on Learning Representations , 2019.[14] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometricdeep learning on graphs and manifolds using mixture model cnns. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 5115–5124, 2017.[15] K. A. Smith. Neural networks for combinatorial optimization: a review of more than a decadeof research.

INFORMS Journal on Computing , 11(1):15–34, 1999.[16] S. Sukhbaatar, A. Szlam, and R. Fergus. Learning multiagent communication with backpropa-gation. In

Advances in Neural Information Processing Systems , pages 2244–2252, 2016.[17] P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph AttentionNetworks.

International Conference on Learning Representations , 2018.[18] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In

Advances in Neural InformationProcessing Systems , pages 2692–2700, 2015.[19] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcementlearning.

Machine learning , 8(3-4):229–256, 1992.5

Experimental Setup

Table 2 presents an overview of our experimental setup for training autoregressive SL and RL modelson ﬁxed graphs. We use the model architecture speciﬁed in [13] for both approaches: The graphencoder is a -layer Graph Transformer [17] with -dimensional embeddings/hidden states and attention heads per layer. The node and graph embeddings produced by the encoder are providedto an autoregressive Attention-based decoder which outputs TSP tours node-by-node. Both modelsare trained using the Adam optimizer [11] with a ﬁxed learning rate of − for epochs withmini-batches of size , where the epoch size is , , .Table 2: Training setup for comparing autoregressive SL and RL models.Parameter Reinforcement Learning Supervised LearningTraining Epochs

100 100

Epoch Size , ,

000 1 , , Batch Size

512 512

Graph Generation Random Fixed set of 1MGraph Encoder Graph Transformer Graph TransformerEncoder Layers

Embedding Dimension

128 128

Hidden Dimension

128 128

Feed-forward Dimension

512 512

Attention Heads

Number of Parameters ,

608 706 , Loss Function REINFORCE, rollout baseline Cross Entropy LossOther Tricks Baseline update after validation Teacher ForcingOptimizer Adam AdamLearning Rate − (ﬁxed) − (ﬁxed) B Replacing Rollout with Critic Baseline

In Figure 4, we compare the generalization trends for RL models trained with rollout baselines (asin [13]) and critic baselines (as in [2]). The critic network architecture uses the same -layer graphencoder architecture as the main model, after which the node embeddings are averaged and providedto an MLP with one hidden layer of -units and ReLU activation, and a single output. All otherexperimental settings are kept consistent across both approaches, as in Table 2. We ﬁnd that RLmodels follow similar generalization trends to variable sized instances, regardless of the choice ofbaseline used to train them.

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,RL)TSP20 Model (GAT,RL-AC)TSP50 Model (GAT,RL)TSP50 Model (GAT,RL-AC)TSP100 Model (GAT,RL)TSP100 Model (GAT,RL-AC) (a) Greedy Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,RL)TSP20 Model (GAT,RL-AC)TSP50 Model (GAT,RL)TSP50 Model (GAT,RL-AC)TSP100 Model (GAT,RL)TSP100 Model (GAT,RL-AC) (b) Sampling

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,RL)TSP20 Model (GAT,RL-AC)TSP50 Model (GAT,RL)TSP50 Model (GAT,RL-AC)TSP100 Model (GAT,RL)TSP100 Model (GAT,RL-AC) (c) Beam Search

Figure 4: Zero-shot generalization trends on TSP20-TSP100 for various search settings.6

Replacing GAT Graph Encoder with GCN

To further isolate the impact of learning paradigms on performance and generalization, we swapthe

Graph Transformer encoder used in [13] with the

Graph Convolutional Network encoder from[9], keeping the autoregressive decoder the same. Our GCN architecture consists of 3 graph convo-lution layers with -dimensional embeddings/hidden states. We use residual connections, batchnormalization and edge gates between each layer, as described in [4].Figure 5 plots performance of SL and RL models using GAT and GCN as graph encoders. We ﬁndnegligible impact on performance for both learning paradigms despite GCN models have half thenumber of parameters as GAT models ( , for GCN vs. , for GAT). Generalizationto instances sizes different from training graphs indicates that better emergent generalization in RLmodels is chieﬂy due to the learning paradigm, and is independent of the choice of graph encoder.

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP20 Model (GCN,SL)TSP20 Model (GCN,RL) (a) TSP20 Model, Greedy Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP20 Model (GCN,SL)TSP20 Model (GCN,RL) (b) TSP20 Model, Sampling

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP20 Model (GAT,SL)TSP20 Model (GAT,RL)TSP20 Model (GCN,SL)TSP20 Model (GCN,RL) (c) TSP20 Models, Beam Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP50 Model (GCN,SL)TSP50 Model (GCN,RL) (d) TSP50 Model, Greedy Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP50 Model (GCN,SL)TSP50 Model (GCN,RL) (e) TSP50 Model, Sampling

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP50 Model (GAT,SL)TSP50 Model (GAT,RL)TSP50 Model (GCN,SL)TSP50 Model (GCN,RL) (f) TSP50 Model, Beam Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP100 Model (GAT,SL)TSP100 Model (GAT,RL)TSP100 Model (GCN,SL)TSP100 Model (GCN,RL) (g) TSP100 Model, Greedy Search

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP100 Model (GAT,SL)TSP100 Model (GAT,RL)TSP100 Model (GCN,SL)TSP100 Model (GCN,RL) (h) TSP100 Model, Sampling

20 30 40 50 60 70 80 90 100

TSP Size O p t i m a li t y G a p ( % ) ModelTSP100 Model (GAT,SL)TSP100 Model (GAT,RL)TSP100 Model (GCN,SL)TSP100 Model (GCN,RL) (i) TSP100 Model, Beam Search

Figure 5: Zero-shot generalization trends for various search settings for GAT and GCN models. We did not experiment with larger GCN encoders as -layer GCNs took longer to train than -layer GATsand had high GPU overhead due to the computation of edge features for fully-connected graphs. Solution Visualizations

Figures 6, 7, 8 and 9 display prediction visualizations for samples from test sets of various probleminstances under various search settings. In each ﬁgure, the ﬁrst panel shows the groundtruth TSP tour,obtained using Concorde. Subsequent panels show the predicted tours from each model.

Concorde: 4.131 TSP20 Model (GAT,SL): 4.131 TSP20 Model (GAT,RL): 4.154 TSP50 Model (GAT,SL): 4.154 TSP50 Model (GAT,RL): 4.131 TSP100 Model (GAT,SL): 7.477 TSP100 Model (GAT,RL): 4.562 (a) TSP20 instance, Greedy search

Concorde: 4.132 TSP20 Model (GAT,SL): 4.132 TSP20 Model (GAT,RL): 4.132 TSP50 Model (GAT,SL): 4.143 TSP50 Model (GAT,RL): 4.204 TSP100 Model (GAT,SL): 4.543 TSP100 Model (GAT,RL): 4.255 (b) TSP20 instance, Sampling (1280 solutions)

Concorde: 4.102 TSP20 Model (GAT,SL): 4.096 TSP20 Model (GAT,RL): 4.096 TSP50 Model (GAT,SL): 4.096 TSP50 Model (GAT,RL): 4.096 TSP100 Model (GAT,SL): 4.664 TSP100 Model (GAT,RL): 4.213 (c) TSP20 instance, Beam search (1280 width)

Figure 6: Prediction visualizations from various models for TSP20 test instances.

Concorde: 6.232 TSP20 Model (GAT,SL): 6.432 TSP20 Model (GAT,RL): 6.727 TSP50 Model (GAT,SL): 6.287 TSP50 Model (GAT,RL): 6.577 TSP100 Model (GAT,SL): 6.930 TSP100 Model (GAT,RL): 6.432 (a) TSP50 instance, Greedy search

Concorde: 5.679 TSP20 Model (GAT,SL): 5.950 TSP20 Model (GAT,RL): 5.782 TSP50 Model (GAT,SL): 5.679 TSP50 Model (GAT,RL): 5.689 TSP100 Model (GAT,SL): 5.691 TSP100 Model (GAT,RL): 5.788 (b) TSP50 instance, Sampling (1280 solutions)

Concorde: 5.772 TSP20 Model (GAT,SL): 5.824 TSP20 Model (GAT,RL): 5.963 TSP50 Model (GAT,SL): 5.772 TSP50 Model (GAT,RL): 5.856 TSP100 Model (GAT,SL): 5.772 TSP100 Model (GAT,RL): 5.932 (c) TSP50 instance, Beam search (1280 width)

Figure 7: Prediction visualizations from various models for TSP50 test instances.8 oncorde: 7.848 TSP20 Model (GAT,SL): 11.054 TSP20 Model (GAT,RL): 9.077 TSP50 Model (GAT,SL): 8.450 TSP50 Model (GAT,RL): 8.102 TSP100 Model (GAT,SL): 7.945 TSP100 Model (GAT,RL): 8.126 (a) TSP100 instance, Greedy search

Concorde: 7.346 TSP20 Model (GAT,SL): 10.589 TSP20 Model (GAT,RL): 9.020 TSP50 Model (GAT,SL): 7.541 TSP50 Model (GAT,RL): 7.538 TSP100 Model (GAT,SL): 7.395 TSP100 Model (GAT,RL): 7.553 (b) TSP100 instance, Sampling (1280 solutions)

Concorde: 7.595 TSP20 Model (GAT,SL): 9.490 TSP20 Model (GAT,RL): 8.894 TSP50 Model (GAT,SL): 7.725 TSP50 Model (GAT,RL): 7.872 TSP100 Model (GAT,SL): 7.648 TSP100 Model (GAT,RL): 7.772 (c) TSP100 instance, Beam search (1280 width)

Figure 8: Prediction visualizations from various models for TSP100 test instances.

Concorde: 10.664 TSP20 Model (GAT,SL): 15.704 TSP20 Model (GAT,RL): 13.854 TSP50 Model (GAT,SL): 12.922 TSP50 Model (GAT,RL): 12.037 TSP100 Model (GAT,SL): 12.664 TSP100 Model (GAT,RL): 11.410 (a) TSP200 instance, Greedy search

Concorde: 10.693 TSP20 Model (GAT,SL): 21.592 TSP20 Model (GAT,RL): 18.537 TSP50 Model (GAT,SL): 14.481 TSP50 Model (GAT,RL): 12.885 TSP100 Model (GAT,SL): 12.104 TSP100 Model (GAT,RL): 11.398 (b) TSP200 instance, Sampling (250 solutions)

Concorde: 10.851 TSP20 Model (GAT,SL): 14.689 TSP20 Model (GAT,RL): 14.249 TSP50 Model (GAT,SL): 13.880 TSP50 Model (GAT,RL): 11.935 TSP100 Model (GAT,SL): 11.315 TSP100 Model (GAT,RL): 11.473 (c) TSP200 instance, Beam search (250 width)(c) TSP200 instance, Beam search (250 width)