[PDF] Variational Neural Annealing

Abstract

Many important challenges in science and technology can be cast as optimization problems. When viewed in a statistical physics framework, these can be tackled by simulated annealing, where a gradual cooling procedure helps search for groundstate solutions of a target Hamiltonian. While powerful, simulated annealing is known to have prohibitively slow sampling dynamics when the optimization landscape is rough or glassy. Here we show that by generalizing the target distribution with a parameterized model, an analogous annealing framework based on the variational principle can be used to search for groundstate solutions. Modern autoregressive models such as recurrent neural networks provide ideal parameterizations since they can be exactly sampled without slow dynamics even when the model encodes a rough landscape. We implement this procedure in the classical and quantum settings on several prototypical spin glass Hamiltonians, and find that it significantly outperforms traditional simulated annealing in the asymptotic limit, illustrating the potential power of this yet unexplored route to optimization.

Full PDF

VVariational Neural Annealing

Mohamed Hibat-Allah,

1, 2, ∗ Estelle M. Inack,

3, 1

Roeland Wiersema,

1, 2

Roger G. Melko,

2, 3 and Juan Carrasquilla

1, 2 Vector Institute, MaRS Centre, Toronto, Ontario, M5G 1M1, Canada Department of Physics and Astronomy, University of Waterloo, Ontario, N2L 3G1, Canada Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada (Dated: January 26, 2021)Many important challenges in science and technology can be cast as optimization problems. Whenviewed in a statistical physics framework, these can be tackled by simulated annealing, where agradual cooling procedure helps search for groundstate solutions of a target Hamiltonian. Whilepowerful, simulated annealing is known to have prohibitively slow sampling dynamics when theoptimization landscape is rough or glassy. Here we show that by generalizing the target distributionwith a parameterized model, an analogous annealing framework based on the variational principlecan be used to search for groundstate solutions. Modern autoregressive models such as recurrentneural networks provide ideal parameterizations since they can be exactly sampled without slowdynamics even when the model encodes a rough landscape. We implement this procedure in theclassical and quantum settings on several prototypical spin glass Hamiltonians, and ﬁnd that itsigniﬁcantly outperforms traditional simulated annealing in the asymptotic limit, illustrating thepotential power of this yet unexplored route to optimization.

I. INTRODUCTION

A wide array of complex combinatorial optimizationproblems can be reformulated as ﬁnding the lowest en-ergy conﬁguration of an Ising Hamiltonian of the form [1]: H target = − (cid:88) iAAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0ItQ9OKxQr8gDWWz3bRLN7thdyKE0J/hxYMiXv013vw3btoctPpg4PHeDDPzwkRwA6775VTW1jc2t6rbtZ3dvf2D+uFRz6hUU9alSig9CIlhgkvWBQ6CDRLNSBwK1g9nd4Xff2TacCU7kCUsiMlE8ohTAlbyO/gGD7mMIKuN6g236S6A/xKvJA1Uoj2qfw7HiqYxk0AFMcb33ASCnGjgVLB5bZgalhA6IxPmWypJzEyQL06e4zOrjHGktC0JeKH+nMhJbEwWh7YzJjA1q14h/uf5KUTXQc5lkgKTdLkoSgUGhYv/8ZhrRkFklhCqub0V0ynRhIJNqQjBW335L+ldND236T1cNlq3ZRxVdIJO0Tny0BVqoXvURl1EkUJP6AW9OuA8O2/O+7K14pQzx+gXnI9v0KmQSQ== AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0ItQ9OKxQr8gDWWz3bRLN7thdyKE0J/hxYMiXv013vw3btoctPpg4PHeDDPzwkRwA6775VTW1jc2t6rbtZ3dvf2D+uFRz6hUU9alSig9CIlhgkvWBQ6CDRLNSBwK1g9nd4Xff2TacCU7kCUsiMlE8ohTAlbyO/gGD7mMIKuN6g236S6A/xKvJA1Uoj2qfw7HiqYxk0AFMcb33ASCnGjgVLB5bZgalhA6IxPmWypJzEyQL06e4zOrjHGktC0JeKH+nMhJbEwWh7YzJjA1q14h/uf5KUTXQc5lkgKTdLkoSgUGhYv/8ZhrRkFklhCqub0V0ynRhIJNqQjBW335L+ldND236T1cNlq3ZRxVdIJO0Tny0BVqoXvURl1EkUJP6AW9OuA8O2/O+7K14pQzx+gXnI9v0KmQSQ== AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0ItQ9OKxQr8gDWWz3bRLN7thdyKE0J/hxYMiXv013vw3btoctPpg4PHeDDPzwkRwA6775VTW1jc2t6rbtZ3dvf2D+uFRz6hUU9alSig9CIlhgkvWBQ6CDRLNSBwK1g9nd4Xff2TacCU7kCUsiMlE8ohTAlbyO/gGD7mMIKuN6g236S6A/xKvJA1Uoj2qfw7HiqYxk0AFMcb33ASCnGjgVLB5bZgalhA6IxPmWypJzEyQL06e4zOrjHGktC0JeKH+nMhJbEwWh7YzJjA1q14h/uf5KUTXQc5lkgKTdLkoSgUGhYv/8ZhrRkFklhCqub0V0ynRhIJNqQjBW335L+ldND236T1cNlq3ZRxVdIJO0Tny0BVqoXvURl1EkUJP6AW9OuA8O2/O+7K14pQzx+gXnI9v0KmQSQ== AAAB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0lE0ItQ9OKxQr8gDWWz3bRLN7thdyKE0J/hxYMiXv013vw3btoctPpg4PHeDDPzwkRwA6775VTW1jc2t6rbtZ3dvf2D+uFRz6hUU9alSig9CIlhgkvWBQ6CDRLNSBwK1g9nd4Xff2TacCU7kCUsiMlE8ohTAlbyO/gGD7mMIKuN6g236S6A/xKvJA1Uoj2qfw7HiqYxk0AFMcb33ASCnGjgVLB5bZgalhA6IxPmWypJzEyQL06e4zOrjHGktC0JeKH+nMhJbEwWh7YzJjA1q14h/uf5KUTXQc5lkgKTdLkoSgUGhYv/8ZhrRkFklhCqub0V0ynRhIJNqQjBW335L+ldND236T1cNlq3ZRxVdIJO0Tny0BVqoXvURl1EkUJP6AW9OuA8O2/O+7K14pQzx+gXnI9v0KmQSQ== T = 0 AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxQr+gDWWz3bRrN7thdyOU0P/gxYMiXv0/3vw3btIctPXBwOO9GWbmBTFn2rjut1NaW9/Y3CpvV3Z29/YPqodHHS0TRWibSC5VL8CaciZo2zDDaS9WFEcBp91gepf53SeqNJOiZWYx9SM8FixkBBsrdVroBrmVYbXm1t0caJV4BalBgeaw+jUYSZJEVBjCsdZ9z42Nn2JlGOF0XhkkmsaYTPGY9i0VOKLaT/Nr5+jMKiMUSmVLGJSrvydSHGk9iwLbGWEz0cteJv7n9RMTXvspE3FiqCCLRWHCkZEoex2NmKLE8JklmChmb0VkghUmxgaUheAtv7xKOhd1z617D5e1xm0RRxlO4BTOwYMraMA9NKENBB7hGV7hzZHOi/PufCxaS04xcwx/4Hz+AID2jcE= AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxQr+gDWWz3bRrN7thdyOU0P/gxYMiXv0/3vw3btIctPXBwOO9GWbmBTFn2rjut1NaW9/Y3CpvV3Z29/YPqodHHS0TRWibSC5VL8CaciZo2zDDaS9WFEcBp91gepf53SeqNJOiZWYx9SM8FixkBBsrdVroBrmVYbXm1t0caJV4BalBgeaw+jUYSZJEVBjCsdZ9z42Nn2JlGOF0XhkkmsaYTPGY9i0VOKLaT/Nr5+jMKiMUSmVLGJSrvydSHGk9iwLbGWEz0cteJv7n9RMTXvspE3FiqCCLRWHCkZEoex2NmKLE8JklmChmb0VkghUmxgaUheAtv7xKOhd1z617D5e1xm0RRxlO4BTOwYMraMA9NKENBB7hGV7hzZHOi/PufCxaS04xcwx/4Hz+AID2jcE= AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxQr+gDWWz3bRrN7thdyOU0P/gxYMiXv0/3vw3btIctPXBwOO9GWbmBTFn2rjut1NaW9/Y3CpvV3Z29/YPqodHHS0TRWibSC5VL8CaciZo2zDDaS9WFEcBp91gepf53SeqNJOiZWYx9SM8FixkBBsrdVroBrmVYbXm1t0caJV4BalBgeaw+jUYSZJEVBjCsdZ9z42Nn2JlGOF0XhkkmsaYTPGY9i0VOKLaT/Nr5+jMKiMUSmVLGJSrvydSHGk9iwLbGWEz0cteJv7n9RMTXvspE3FiqCCLRWHCkZEoex2NmKLE8JklmChmb0VkghUmxgaUheAtv7xKOhd1z617D5e1xm0RRxlO4BTOwYMraMA9NKENBB7hGV7hzZHOi/PufCxaS04xcwx/4Hz+AID2jcE= AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0ItQ9OKxQr+gDWWz3bRrN7thdyOU0P/gxYMiXv0/3vw3btIctPXBwOO9GWbmBTFn2rjut1NaW9/Y3CpvV3Z29/YPqodHHS0TRWibSC5VL8CaciZo2zDDaS9WFEcBp91gepf53SeqNJOiZWYx9SM8FixkBBsrdVroBrmVYbXm1t0caJV4BalBgeaw+jUYSZJEVBjCsdZ9z42Nn2JlGOF0XhkkmsaYTPGY9i0VOKLaT/Nr5+jMKiMUSmVLGJSrvydSHGk9iwLbGWEz0cteJv7n9RMTXvspE3FiqCCLRWHCkZEoex2NmKLE8JklmChmb0VkghUmxgaUheAtv7xKOhd1z617D5e1xm0RRxlO4BTOwYMraMA9NKENBB7hGV7hzZHOi/PufCxaS04xcwx/4Hz+AID2jcE= P AAACNHicfVDLSgMxFL1TX7W+Rl26CRbBVZkRQZdFN4KbCvYBbRkyaaYNzWSGJGMpQz/KjR/iRgQXirj1G8y0s6iteCBwOOfcJPf4MWdKO86rVVhZXVvfKG6WtrZ3dvfs/YOGihJJaJ1EPJItHyvKmaB1zTSnrVhSHPqcNv3hdeY3H6hULBL3ehzTboj7ggWMYG0kz76teSnq9KKRwFJGI9RJ4kUyZ/6Xm5Q8u+xUnCnQMnFzUoYcNc9+NjeSJKRCE46VartOrLsplpoRTielTqJojMkQ92nbUIFDqrrpdOkJOjFKDwWRNEdoNFXnJ1IcKjUOfZMMsR6oRS8T//LaiQ4uuykTcaKpILOHgoQjHaGsQdRjkhLNx4ZgIpn5KyIDLDHRpuesBHdx5WXSOKu4TsW9Oy9Xr/I6inAEx3AKLlxAFW6gBnUg8Agv8A4f1pP1Zn1aX7NowcpnDuEXrO8fzwWstw== AAACNHicfVDLSgMxFL1TX7W+Rl26CRbBVZkRQZdFN4KbCvYBbRkyaaYNzWSGJGMpQz/KjR/iRgQXirj1G8y0s6iteCBwOOfcJPf4MWdKO86rVVhZXVvfKG6WtrZ3dvfs/YOGihJJaJ1EPJItHyvKmaB1zTSnrVhSHPqcNv3hdeY3H6hULBL3ehzTboj7ggWMYG0kz76teSnq9KKRwFJGI9RJ4kUyZ/6Xm5Q8u+xUnCnQMnFzUoYcNc9+NjeSJKRCE46VartOrLsplpoRTielTqJojMkQ92nbUIFDqrrpdOkJOjFKDwWRNEdoNFXnJ1IcKjUOfZMMsR6oRS8T//LaiQ4uuykTcaKpILOHgoQjHaGsQdRjkhLNx4ZgIpn5KyIDLDHRpuesBHdx5WXSOKu4TsW9Oy9Xr/I6inAEx3AKLlxAFW6gBnUg8Agv8A4f1pP1Zn1aX7NowcpnDuEXrO8fzwWstw== AAACNHicfVDLSgMxFL1TX7W+Rl26CRbBVZkRQZdFN4KbCvYBbRkyaaYNzWSGJGMpQz/KjR/iRgQXirj1G8y0s6iteCBwOOfcJPf4MWdKO86rVVhZXVvfKG6WtrZ3dvfs/YOGihJJaJ1EPJItHyvKmaB1zTSnrVhSHPqcNv3hdeY3H6hULBL3ehzTboj7ggWMYG0kz76teSnq9KKRwFJGI9RJ4kUyZ/6Xm5Q8u+xUnCnQMnFzUoYcNc9+NjeSJKRCE46VartOrLsplpoRTielTqJojMkQ92nbUIFDqrrpdOkJOjFKDwWRNEdoNFXnJ1IcKjUOfZMMsR6oRS8T//LaiQ4uuykTcaKpILOHgoQjHaGsQdRjkhLNx4ZgIpn5KyIDLDHRpuesBHdx5WXSOKu4TsW9Oy9Xr/I6inAEx3AKLlxAFW6gBnUg8Agv8A4f1pP1Zn1aX7NowcpnDuEXrO8fzwWstw== AAACNHicfVDLSgMxFL1TX7W+Rl26CRbBVZkRQZdFN4KbCvYBbRkyaaYNzWSGJGMpQz/KjR/iRgQXirj1G8y0s6iteCBwOOfcJPf4MWdKO86rVVhZXVvfKG6WtrZ3dvfs/YOGihJJaJ1EPJItHyvKmaB1zTSnrVhSHPqcNv3hdeY3H6hULBL3ehzTboj7ggWMYG0kz76teSnq9KKRwFJGI9RJ4kUyZ/6Xm5Q8u+xUnCnQMnFzUoYcNc9+NjeSJKRCE46VartOrLsplpoRTielTqJojMkQ92nbUIFDqrrpdOkJOjFKDwWRNEdoNFXnJ1IcKjUOfZMMsR6oRS8T//LaiQ4uuykTcaKpILOHgoQjHaGsQdRjkhLNx4ZgIpn5KyIDLDHRpuesBHdx5WXSOKu4TsW9Oy9Xr/I6inAEx3AKLlxAFW6gBnUg8Agv8A4f1pP1Zn1aX7NowcpnDuEXrO8fzwWstw== P " AAACNnicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxo1QwT6gLUMmzbShmcyQZCxl6Fe58TvcdeNCEbd+gpl2FG17IHA459wk9/gxZ0o7zsRaWV1b39gsbBW3d3b39u2Dw7qKEklojUQ8kk0fK8qZoDXNNKfNWFIc+pw2/MFN5jceqVQsEg96FNNOiHuCBYxgbSTPvqt6KUKoncRYymiI2t1oKBbpr70s96ONi55dcsrOFGiRuDkpQY6qZ7+Yi0gSUqEJx0q1XCfWnRRLzQin42I7UTTGZIB7tGWowCFVnXS69hidGqWLgkiaIzSaqn8nUhwqNQp9kwyx7qt5LxOXea1EB1edlIk40VSQ2UNBwpGOUNYh6jJJieYjQzCRzPwVkT6WmGjTdFaCO7/yIqmfl12n7N5flCrXeR0FOIYTOAMXLqECt1CFGhB4ggm8wbv1bL1aH9bnLLpi5TNH8A/W1zePZa0L AAACNnicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxo1QwT6gLUMmzbShmcyQZCxl6Fe58TvcdeNCEbd+gpl2FG17IHA459wk9/gxZ0o7zsRaWV1b39gsbBW3d3b39u2Dw7qKEklojUQ8kk0fK8qZoDXNNKfNWFIc+pw2/MFN5jceqVQsEg96FNNOiHuCBYxgbSTPvqt6KUKoncRYymiI2t1oKBbpr70s96ONi55dcsrOFGiRuDkpQY6qZ7+Yi0gSUqEJx0q1XCfWnRRLzQin42I7UTTGZIB7tGWowCFVnXS69hidGqWLgkiaIzSaqn8nUhwqNQp9kwyx7qt5LxOXea1EB1edlIk40VSQ2UNBwpGOUNYh6jJJieYjQzCRzPwVkT6WmGjTdFaCO7/yIqmfl12n7N5flCrXeR0FOIYTOAMXLqECt1CFGhB4ggm8wbv1bL1aH9bnLLpi5TNH8A/W1zePZa0L AAACNnicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxo1QwT6gLUMmzbShmcyQZCxl6Fe58TvcdeNCEbd+gpl2FG17IHA459wk9/gxZ0o7zsRaWV1b39gsbBW3d3b39u2Dw7qKEklojUQ8kk0fK8qZoDXNNKfNWFIc+pw2/MFN5jceqVQsEg96FNNOiHuCBYxgbSTPvqt6KUKoncRYymiI2t1oKBbpr70s96ONi55dcsrOFGiRuDkpQY6qZ7+Yi0gSUqEJx0q1XCfWnRRLzQin42I7UTTGZIB7tGWowCFVnXS69hidGqWLgkiaIzSaqn8nUhwqNQp9kwyx7qt5LxOXea1EB1edlIk40VSQ2UNBwpGOUNYh6jJJieYjQzCRzPwVkT6WmGjTdFaCO7/yIqmfl12n7N5flCrXeR0FOIYTOAMXLqECt1CFGhB4ggm8wbv1bL1aH9bnLLpi5TNH8A/W1zePZa0L AAACNnicbVDLSgMxFL3js9bXqEs3wSK4KjMi6LLoxo1QwT6gLUMmzbShmcyQZCxl6Fe58TvcdeNCEbd+gpl2FG17IHA459wk9/gxZ0o7zsRaWV1b39gsbBW3d3b39u2Dw7qKEklojUQ8kk0fK8qZoDXNNKfNWFIc+pw2/MFN5jceqVQsEg96FNNOiHuCBYxgbSTPvqt6KUKoncRYymiI2t1oKBbpr70s96ONi55dcsrOFGiRuDkpQY6qZ7+Yi0gSUqEJx0q1XCfWnRRLzQin42I7UTTGZIB7tGWowCFVnXS69hidGqWLgkiaIzSaqn8nUhwqNQp9kwyx7qt5LxOXea1EB1edlIk40VSQ2UNBwpGOUNYh6jJJieYjQzCRzPwVkT6WmGjTdFaCO7/yIqmfl12n7N5flCrXeR0FOIYTOAMXLqECt1CFGhB4ggm8wbv1bL1aH9bnLLpi5TNH8A/W1zePZa0L P AAACOHicbVDLSsNAFL3xWesr6tLNYBFclUQEXRbduLOCfUBbwmQ6aYdOJmFmYimhn+XGz3Anblwo4tYvcNIGH20PXDicc+/MvcePOVPacZ6tpeWV1bX1wkZxc2t7Z9fe26+rKJGE1kjEI9n0saKcCVrTTHPajCXFoc9pwx9cZX7jnkrFInGnRzHthLgnWMAI1kby7Juql6IM7W40FFjKaIjaSTxLFpm/2o84Lnp2ySk7E6B54uakBDmqnv1kHiJJSIUmHCvVcp1Yd1IsNSOcjovtRNEYkwHu0ZahAodUddLJ4WN0bJQuCiJpSmg0Uf9OpDhUahT6pjPEuq9mvUxc5LUSHVx0UibiRFNBph8FCUc6QlmKqMskJZqPDMFEMrMrIn0sMdEm6ywEd/bkeVI/LbtO2b09K1Uu8zgKcAhHcAIunEMFrqEKNSDwAC/wBu/Wo/VqfVif09YlK585gH+wvr4BTjStXw== AAACOHicbVDLSsNAFL3xWesr6tLNYBFclUQEXRbduLOCfUBbwmQ6aYdOJmFmYimhn+XGz3Anblwo4tYvcNIGH20PXDicc+/MvcePOVPacZ6tpeWV1bX1wkZxc2t7Z9fe26+rKJGE1kjEI9n0saKcCVrTTHPajCXFoc9pwx9cZX7jnkrFInGnRzHthLgnWMAI1kby7Juql6IM7W40FFjKaIjaSTxLFpm/2o84Lnp2ySk7E6B54uakBDmqnv1kHiJJSIUmHCvVcp1Yd1IsNSOcjovtRNEYkwHu0ZahAodUddLJ4WN0bJQuCiJpSmg0Uf9OpDhUahT6pjPEuq9mvUxc5LUSHVx0UibiRFNBph8FCUc6QlmKqMskJZqPDMFEMrMrIn0sMdEm6ywEd/bkeVI/LbtO2b09K1Uu8zgKcAhHcAIunEMFrqEKNSDwAC/wBu/Wo/VqfVif09YlK585gH+wvr4BTjStXw== AAACOHicbVDLSsNAFL3xWesr6tLNYBFclUQEXRbduLOCfUBbwmQ6aYdOJmFmYimhn+XGz3Anblwo4tYvcNIGH20PXDicc+/MvcePOVPacZ6tpeWV1bX1wkZxc2t7Z9fe26+rKJGE1kjEI9n0saKcCVrTTHPajCXFoc9pwx9cZX7jnkrFInGnRzHthLgnWMAI1kby7Juql6IM7W40FFjKaIjaSTxLFpm/2o84Lnp2ySk7E6B54uakBDmqnv1kHiJJSIUmHCvVcp1Yd1IsNSOcjovtRNEYkwHu0ZahAodUddLJ4WN0bJQuCiJpSmg0Uf9OpDhUahT6pjPEuq9mvUxc5LUSHVx0UibiRFNBph8FCUc6QlmKqMskJZqPDMFEMrMrIn0sMdEm6ywEd/bkeVI/LbtO2b09K1Uu8zgKcAhHcAIunEMFrqEKNSDwAC/wBu/Wo/VqfVif09YlK585gH+wvr4BTjStXw== AAACOHicbVDLSsNAFL3xWesr6tLNYBFclUQEXRbduLOCfUBbwmQ6aYdOJmFmYimhn+XGz3Anblwo4tYvcNIGH20PXDicc+/MvcePOVPacZ6tpeWV1bX1wkZxc2t7Z9fe26+rKJGE1kjEI9n0saKcCVrTTHPajCXFoc9pwx9cZX7jnkrFInGnRzHthLgnWMAI1kby7Juql6IM7W40FFjKaIjaSTxLFpm/2o84Lnp2ySk7E6B54uakBDmqnv1kHiJJSIUmHCvVcp1Yd1IsNSOcjovtRNEYkwHu0ZahAodUddLJ4WN0bJQuCiJpSmg0Uf9OpDhUahT6pjPEuq9mvUxc5LUSHVx0UibiRFNBph8FCUc6QlmKqMskJZqPDMFEMrMrIn0sMdEm6ywEd/bkeVI/LbtO2b09K1Uu8zgKcAhHcAIunEMFrqEKNSDwAC/wBu/Wo/VqfVif09YlK585gH+wvr4BTjStXw== Variational AAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5KIoMeiF48V7Ae0oWy2m3bpZhN2J2II9a948aCIV3+IN/+NmzQHbX0w8Hhvhpl5fiy4Bsf5tipr6xubW9Xt2s7u3v6BfXjU1VGiKOvQSESq7xPNBJesAxwE68eKkdAXrOfPbnK/98CU5pG8hzRmXkgmkgecEjDSyK4PgT1C1iWKFwoR89rIbjhNpwBeJW5JGqhEe2R/DccRTUImgQqi9cB1YvAyooBTwea1YaJZTOiMTNjAUElCpr2sOH6OT40yxkGkTEnAhfp7IiOh1mnom86QwFQve7n4nzdIILjyMi7jBJiki0VBIjBEOE8Cj7liFERqCKGKm1sxnRJFKJi88hDc5ZdXSfe86TpN9+6i0bou46iiY3SCzpCLLlEL3aI26iCKUvSMXtGb9WS9WO/Wx6K1YpUzdfQH1ucPFBGVBg== AAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5KIoMeiF48V7Ae0oWy2m3bpZhN2J2II9a948aCIV3+IN/+NmzQHbX0w8Hhvhpl5fiy4Bsf5tipr6xubW9Xt2s7u3v6BfXjU1VGiKOvQSESq7xPNBJesAxwE68eKkdAXrOfPbnK/98CU5pG8hzRmXkgmkgecEjDSyK4PgT1C1iWKFwoR89rIbjhNpwBeJW5JGqhEe2R/DccRTUImgQqi9cB1YvAyooBTwea1YaJZTOiMTNjAUElCpr2sOH6OT40yxkGkTEnAhfp7IiOh1mnom86QwFQve7n4nzdIILjyMi7jBJiki0VBIjBEOE8Cj7liFERqCKGKm1sxnRJFKJi88hDc5ZdXSfe86TpN9+6i0bou46iiY3SCzpCLLlEL3aI26iCKUvSMXtGb9WS9WO/Wx6K1YpUzdfQH1ucPFBGVBg== AAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5KIoMeiF48V7Ae0oWy2m3bpZhN2J2II9a948aCIV3+IN/+NmzQHbX0w8Hhvhpl5fiy4Bsf5tipr6xubW9Xt2s7u3v6BfXjU1VGiKOvQSESq7xPNBJesAxwE68eKkdAXrOfPbnK/98CU5pG8hzRmXkgmkgecEjDSyK4PgT1C1iWKFwoR89rIbjhNpwBeJW5JGqhEe2R/DccRTUImgQqi9cB1YvAyooBTwea1YaJZTOiMTNjAUElCpr2sOH6OT40yxkGkTEnAhfp7IiOh1mnom86QwFQve7n4nzdIILjyMi7jBJiki0VBIjBEOE8Cj7liFERqCKGKm1sxnRJFKJi88hDc5ZdXSfe86TpN9+6i0bou46iiY3SCzpCLLlEL3aI26iCKUvSMXtGb9WS9WO/Wx6K1YpUzdfQH1ucPFBGVBg== AAAB/HicbVBNS8NAEN3Ur1q/oj16WSyCp5KIoMeiF48V7Ae0oWy2m3bpZhN2J2II9a948aCIV3+IN/+NmzQHbX0w8Hhvhpl5fiy4Bsf5tipr6xubW9Xt2s7u3v6BfXjU1VGiKOvQSESq7xPNBJesAxwE68eKkdAXrOfPbnK/98CU5pG8hzRmXkgmkgecEjDSyK4PgT1C1iWKFwoR89rIbjhNpwBeJW5JGqhEe2R/DccRTUImgQqi9cB1YvAyooBTwea1YaJZTOiMTNjAUElCpr2sOH6OT40yxkGkTEnAhfp7IiOh1mnom86QwFQve7n4nzdIILjyMi7jBJiki0VBIjBEOE8Cj7liFERqCKGKm1sxnRJFKJi88hDc5ZdXSfe86TpN9+6i0bou46iiY3SCzpCLLlEL3aI26iCKUvSMXtGb9WS9WO/Wx6K1YpUzdfQH1ucPFBGVBg==

Simulated annealing AAACBHicbVA9SwNBEN3z2/gVtUyzGASrcCeClkEbS0WjgeQIc5tJXNzbO3bnxHCksPGv2FgoYuuPsPPfuEmu0OiDgcd7M7szL0qVtOT7X97M7Nz8wuLScmlldW19o7y5dWWTzAhsiEQlphmBRSU1NkiSwmZqEOJI4XV0ezLyr+/QWJnoSxqkGMbQ17InBZCTOuVKm/Ce8gsZZwoIuxy0RnCv9YelTrnq1/wx+F8SFKTKCpx1yp/tbiKyGDUJBda2Aj+lMAdDUigcltqZxRTELfSx5aiGGG2Yj48Y8l2ndHkvMa408bH6cyKH2NpBHLnOGOjGTnsj8T+vlVHvKMylTjNCLSYf9TLFKeGjRHhXGhSkBo6AMNLtysUNGBDkchuFEEyf/Jdc7dcCvxacH1Trx0UcS6zCdtgeC9ghq7NTdsYaTLAH9sRe2Kv36D17b977pHXGK2a22S94H98J7phR AAACBHicbVA9SwNBEN3z2/gVtUyzGASrcCeClkEbS0WjgeQIc5tJXNzbO3bnxHCksPGv2FgoYuuPsPPfuEmu0OiDgcd7M7szL0qVtOT7X97M7Nz8wuLScmlldW19o7y5dWWTzAhsiEQlphmBRSU1NkiSwmZqEOJI4XV0ezLyr+/QWJnoSxqkGMbQ17InBZCTOuVKm/Ce8gsZZwoIuxy0RnCv9YelTrnq1/wx+F8SFKTKCpx1yp/tbiKyGDUJBda2Aj+lMAdDUigcltqZxRTELfSx5aiGGG2Yj48Y8l2ndHkvMa408bH6cyKH2NpBHLnOGOjGTnsj8T+vlVHvKMylTjNCLSYf9TLFKeGjRHhXGhSkBo6AMNLtysUNGBDkchuFEEyf/Jdc7dcCvxacH1Trx0UcS6zCdtgeC9ghq7NTdsYaTLAH9sRe2Kv36D17b977pHXGK2a22S94H98J7phR AAACBHicbVA9SwNBEN3z2/gVtUyzGASrcCeClkEbS0WjgeQIc5tJXNzbO3bnxHCksPGv2FgoYuuPsPPfuEmu0OiDgcd7M7szL0qVtOT7X97M7Nz8wuLScmlldW19o7y5dWWTzAhsiEQlphmBRSU1NkiSwmZqEOJI4XV0ezLyr+/QWJnoSxqkGMbQ17InBZCTOuVKm/Ce8gsZZwoIuxy0RnCv9YelTrnq1/wx+F8SFKTKCpx1yp/tbiKyGDUJBda2Aj+lMAdDUigcltqZxRTELfSx5aiGGG2Yj48Y8l2ndHkvMa408bH6cyKH2NpBHLnOGOjGTnsj8T+vlVHvKMylTjNCLSYf9TLFKeGjRHhXGhSkBo6AMNLtysUNGBDkchuFEEyf/Jdc7dcCvxacH1Trx0UcS6zCdtgeC9ghq7NTdsYaTLAH9sRe2Kv36D17b977pHXGK2a22S94H98J7phR AAACBHicbVA9SwNBEN3z2/gVtUyzGASrcCeClkEbS0WjgeQIc5tJXNzbO3bnxHCksPGv2FgoYuuPsPPfuEmu0OiDgcd7M7szL0qVtOT7X97M7Nz8wuLScmlldW19o7y5dWWTzAhsiEQlphmBRSU1NkiSwmZqEOJI4XV0ezLyr+/QWJnoSxqkGMbQ17InBZCTOuVKm/Ce8gsZZwoIuxy0RnCv9YelTrnq1/wx+F8SFKTKCpx1yp/tbiKyGDUJBda2Aj+lMAdDUigcltqZxRTELfSx5aiGGG2Yj48Y8l2ndHkvMa408bH6cyKH2NpBHLnOGOjGTnsj8T+vlVHvKMylTjNCLSYf9TLFKeGjRHhXGhSkBo6AMNLtysUNGBDkchuFEEyf/Jdc7dcCvxacH1Trx0UcS6zCdtgeC9ghq7NTdsYaTLAH9sRe2Kv36D17b977pHXGK2a22S94H98J7phR

Exact Boltzmann dist. AAACBnicbVDLSsNAFJ34rPUVdSnCYBFchUQEXZaK4LKCfUAbymQ6aYdOJmHmRlpDV278FTcuFHHrN7jzb5y0XWjrgYHDOfdy55wgEVyD635bS8srq2vrhY3i5tb2zq69t1/Xcaooq9FYxKoZEM0El6wGHARrJoqRKBCsEQyucr9xz5TmsbyDUcL8iPQkDzklYKSOfdQGNoTsekgo4Eos4CEiUuKuueyMix275DruBHiReDNSQjNUO/ZXuxvTNGISqCBatzw3AT8jCjgVbFxsp5olhA5Ij7UMlSRi2s8mMcb4xChdHMbKPAl4ov7eyEik9SgKzGREoK/nvVz8z2ulEF76GZdJCkzS6aEwFRhinHdi0ipGQYwMIVRx81dM+0SZSkxzeQnefORFUj9zPNfxbs9L5cqsjgI6RMfoFHnoApXRDaqiGqLoET2jV/RmPVkv1rv1MR1dsmY7B+gPrM8f6XuYvA== AAACBnicbVDLSsNAFJ34rPUVdSnCYBFchUQEXZaK4LKCfUAbymQ6aYdOJmHmRlpDV278FTcuFHHrN7jzb5y0XWjrgYHDOfdy55wgEVyD635bS8srq2vrhY3i5tb2zq69t1/Xcaooq9FYxKoZEM0El6wGHARrJoqRKBCsEQyucr9xz5TmsbyDUcL8iPQkDzklYKSOfdQGNoTsekgo4Eos4CEiUuKuueyMix275DruBHiReDNSQjNUO/ZXuxvTNGISqCBatzw3AT8jCjgVbFxsp5olhA5Ij7UMlSRi2s8mMcb4xChdHMbKPAl4ov7eyEik9SgKzGREoK/nvVz8z2ulEF76GZdJCkzS6aEwFRhinHdi0ipGQYwMIVRx81dM+0SZSkxzeQnefORFUj9zPNfxbs9L5cqsjgI6RMfoFHnoApXRDaqiGqLoET2jV/RmPVkv1rv1MR1dsmY7B+gPrM8f6XuYvA== AAACBnicbVDLSsNAFJ34rPUVdSnCYBFchUQEXZaK4LKCfUAbymQ6aYdOJmHmRlpDV278FTcuFHHrN7jzb5y0XWjrgYHDOfdy55wgEVyD635bS8srq2vrhY3i5tb2zq69t1/Xcaooq9FYxKoZEM0El6wGHARrJoqRKBCsEQyucr9xz5TmsbyDUcL8iPQkDzklYKSOfdQGNoTsekgo4Eos4CEiUuKuueyMix275DruBHiReDNSQjNUO/ZXuxvTNGISqCBatzw3AT8jCjgVbFxsp5olhA5Ij7UMlSRi2s8mMcb4xChdHMbKPAl4ov7eyEik9SgKzGREoK/nvVz8z2ulEF76GZdJCkzS6aEwFRhinHdi0ipGQYwMIVRx81dM+0SZSkxzeQnefORFUj9zPNfxbs9L5cqsjgI6RMfoFHnoApXRDaqiGqLoET2jV/RmPVkv1rv1MR1dsmY7B+gPrM8f6XuYvA== AAACBnicbVDLSsNAFJ34rPUVdSnCYBFchUQEXZaK4LKCfUAbymQ6aYdOJmHmRlpDV278FTcuFHHrN7jzb5y0XWjrgYHDOfdy55wgEVyD635bS8srq2vrhY3i5tb2zq69t1/Xcaooq9FYxKoZEM0El6wGHARrJoqRKBCsEQyucr9xz5TmsbyDUcL8iPQkDzklYKSOfdQGNoTsekgo4Eos4CEiUuKuueyMix275DruBHiReDNSQjNUO/ZXuxvTNGISqCBatzw3AT8jCjgVbFxsp5olhA5Ij7UMlSRi2s8mMcb4xChdHMbKPAl4ov7eyEik9SgKzGREoK/nvVz8z2ulEF76GZdJCkzS6aEwFRhinHdi0ipGQYwMIVRx81dM+0SZSkxzeQnefORFUj9zPNfxbs9L5cqsjgI6RMfoFHnoApXRDaqiGqLoET2jV/RmPVkv1rv1MR1dsmY7B+gPrM8f6XuYvA==

Figure 1. Schematic illustration of the space of probabilitydistributions visited during simulated annealing. An arbitrar-ily slow SA visits a series of Boltzmann distributions startingat the high temperature (e.g. T = ∞ ) and ending in the T = 0Boltzmann distribution (continuous yellow line), where a per-fect solution to an optimization problem is reached. Thesesolutions are found either at the edge or a corner (for non-degenerate problems) of the standard probabilistic simplex(colored triangle plane). A practical, ﬁnite-time SA trajectory(red dotted line), as well as a variational classical annealingtrajectory (green dashed line), deviate from the trajectory ofexact Boltzmann distributions. annealing has been so successful that it has inspired in-tense research into its quantum extension, which requiresquantum hardware to anneal the tunneling amplitude,and can be simulated in an analogous way to SA [11, 12].The SA algorithm explores an optimization problem’senergy landscape via a gradual decrease in thermalﬂuctuations generated by the Metropolis-Hastings algo-rithm. The procedure stops when all thermal kineticsare removed from the system, at which point the solu-tion to the optimization problem is expected to be found.While an exact solution to the optimization problem is al- a r X i v : . [ c ond - m a t . d i s - nn ] J a n ways attained if the decrease in temperature is arbitrarilyslow, a practical implementation of the algorithm mustnecessarily run on a ﬁnite time scale [13]. As a conse-quence, the annealing algorithm samples a series of eﬀec-tive, quasi-equilibrium distributions close but not exactlyequal to the stationary Boltzmann distributions targetedduring the annealing [14] (see Fig. 1 for a schematic illus-tration). This naturally leads to approximate solutionsto the optimization problem, whose quality generally de-pends on the interplay between the problem complexityand the rate at which the temperature is decreased.In this paper, we oﬀer an alternative route to solv-ing optimization problems of the form of Eq. (1), called variational neural annealing . Here, the conventionalsimulated annealing formulation is substituted with theannealing of a parameterized model. Namely, insteadof annealing and approximately sampling the exactBoltzmann distribution, this approach anneals a quasi-equilibrium model, which must be suﬃciently expressiveand capable of tractable sampling. Fortunately, suitablemodels have recently been provided by machine learningtechnology [15–17]. In particular, neural autoregressive models combined with variational principles have beenshown to accurately describe the equilibrium propertiesof classical and quantum systems [18–21]. Here, we im-plement variational neural annealing using autoregres-sive recurrent neural networks, and show that they oﬀera powerful alternative to conventional SA and its analo-gous quantum extension, i.e., simulated quantum anneal-ing (SQA) [11]. This powerful and unexplored route tooptimization is schematically illustrated in Fig. 1, wherea variational neural annealing trajectory (dashed greenarrow) is shown to provide a more accurate approxima-tion to the ideal trajectory (continuous yellow line) thana conventional SA run (dotted red line). II. VARIATIONAL CLASSICAL ANDQUANTUM ANNEALING

We ﬁrst consider the variational approach to statisticalmechanics [18, 22], where a distribution p λ ( σ ) deﬁned bya set of variational parameters λ is optimized to closelyreproduce the equilibrium properties of a system at tem-perature T . Following the spirit of SA, we dub our ﬁrstvariational neural annealing algorithm variational classi-cal annealing (VCA).The VCA algorithm searches for the ground state of anoptimization problem, encoded in a target Hamiltonian H target , by slowly annealing the model’s variational freeenergy F λ ( t ) = (cid:104) H target (cid:105) λ − T ( t ) S classical ( p λ ) , (2)from a high temperature to a low temperature. Thequantity F λ ( t ) provides an upper bound to the true in-stantaneous free energy and can be used at each anneal-ing stage to update λ through gradient-descent tech-niques. The brackets (cid:104) ... (cid:105) λ denote ensemble averages taken over the probability p λ ( σ ). The von Neumannentropy is given by S classical ( p λ ) = − (cid:88) σ p λ ( σ ) log ( p λ ( σ )) , (3)where the sum runs over all the elements of the statespace { σ } . In our setting, the temperature is decreasedfrom an initial value T to 0 using a linear schedule func-tion T ( t ) = T (1 − t ), where t ∈ [0 , σ drawn from p λ ( σ ). Since RNNs are normalized by con-struction, these samples naturally allow the estimation ofthe entropy in Eq. (3). We provide a detailed descriptionof the RNN in Methods Sec. V A.The VCA algorithm, summarized in Fig. 2(a), per-forms a warm-up step which brings a randomly initializeddistribution p λ ( σ ) to an approximate equilibrium statewith free energy F λ ( t = 0) via N warmup gradient descentsteps. At each step t , we reduce the temperature of thesystem from T ( t ) to T ( t + δt ) and apply N train gradi-ent descent steps to re-equilibrate the model. A criticalingredient to the success of VCA is that the variationalparameters optimized at temperature T ( t ) are reused attemperature T ( t + δt ) to ensure that the model’s distri-bution is always near its instantaneous equilibrium state.Repeating the last two steps N annealing times, we reachtemperature T (1) = 0, which is the end of the anneal-ing protocol. Here the distribution p λ ( σ ) is expectedto assign high probability to conﬁgurations σ that solvethe optimization problem. Likewise, the residual entropyEq. (3) at T (1) = 0 provides a heuristic approach tocount the number of solutions to the problem Hamilto-nian [18]. Further algorithmic details are provided inMethods Sec. V B.Simulated annealing provides a powerful heuristic forthe solution of hard optimization problems by harnessingthermal ﬂuctuations. Inspired by the latter, the advent ofcommercially available quantum devices [24] has enabledthe analogous concept of quantum annealing [25], wherethe solution to an optimization problem is performed byharnessing quantum ﬂuctuations. In quantum annealing,the search for the ground state of Eq. (1) is performed at T = 0, by supplementing the target Hamiltonian with aquantum mechanical kinetic (or “driving”) term,ˆ H ( t ) = ˆ H target + f ( t ) ˆ H D , (4) Figure 2. Variational neural annealing protocols. (a) The variational classical annealing (VCA) algorithm steps. A warm-upstep brings the initialized variational state (green dot) close to the minimum of the free energy (cyan dot) at a given value ofthe order parameter M . This step is followed by an annealing and a training step that brings the variational state back to thenew free energy minimum. Repeating the last two steps until T ( t = 1) = 0 (red dots) produces approximate solutions to H target if the protocol is conducted slowly enough. This schematic illustration corresponds to annealing through a continuous phasetransition with an order parameter M . (b) Variational quantum annealing (VQA). VQA includes a warm-up step, followed byan annealing and a training step, which brings the variational energy (green dot) closer to the new a ground state energy (cyandot). We loop over the previous two steps until reaching the target ground state of ˆ H target (red dot) if annealing is performedslowly enough. where H target in Eq. (1) is promoted to a quantum me-chanical Hamiltonian ˆ H target .Quantum annealing algorithms typically start with adominant driving term ˆ H D (cid:29) ˆ H target chosen so thatthe ground state of ˆ H (0) is easy to prepare. When thestrength of the driving term is subsequently reduced (typ-ically adiabatically) using a schedule function f ( t ), thesystem is annealed to the ground state of ˆ H target . In anal-ogy to its thermal counterpart, SQA emulates this pro-cess on classical computers using quantum Monte Carlomethods [11].Here, we leverage the variational principle of quantummechanics and devise a strategy that emulates quan-tum annealing variationally. We dub our second vari-ational neural annealing algorithm variational quantumannealing (VQA). The latter is based on the variationalMonte Carlo (VMC) algorithm, whose goal is to simu-late the equilibrium properties of quantum systems atzero temperature (see Methods Sec. V C). In VMC, theground state of a Hamiltonian ˆ H is modeled through anansatz | Ψ λ (cid:105) endowed with parameters λ . The varia-tional principle guarantees that the energy (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105) is an upper bound to the ground state energy of ˆ H ,which we use to deﬁne a time-dependent objective func-tion E ( λ , t ) ≡ (cid:104) ˆ H ( t ) (cid:105) λ = (cid:104) Ψ λ | ˆ H ( t ) | Ψ λ (cid:105) to optimize theparameters λ .The VQA setup, graphically summarized in Fig. 2(b), applies N warmup gradient descent steps to minimize E ( λ , t = 0), which brings | Ψ λ (cid:105) close to the ground stateof ˆ H (0). Setting t = δt while keeping the parameters λ ﬁxed results in a variational energy E ( λ , t = δt ).A set of N train gradient descent steps bring the ansatzcloser to the new instantaneous ground state, which re-sults in a variational energy E ( λ , t = δt ). The vari-ational parameters optimized at time step t are reusedat time t + δt , which promotes the computational adi-abaticity of the protocol (see Appendix. A). We repeatthe annealing and training steps N annealing times on alinear schedule ( f ( t ) = 1 − t with t ∈ [0 , t = 1,at which point the system should solve the optimizationproblem (red dot in Fig. 2(b)). We note that in our sim-ulations, no training steps are taken at t = 1. Finally,similarly to VCA, we choose normalized RNN wave func-tions [20, 21] as ans¨atze, giving the VQA algorithm accessto exact Monte Carlo samples.To gain theoretical insight on the principles behind asuccessful VQA simulation, we derive a variational ver-sion of the adiabatic theorem [26]. Starting from a set ofassumptions, such as the convexity of the energy land-scape in the warm-up phase and close to convergenceduring annealing, as well as the absence of noise in theenergy gradients, we provide a bound on the total numberof gradient descent steps N steps that guarantees the adia-baticity of the VQA algorithm as well as a success proba-bility of solving the optimization problem P success > − (cid:15) .Here, (cid:15) is an upper bound on the overlap between thevariational wave function and the excited states of theHamiltonian ˆ H ( t ), i.e., |(cid:104) Ψ ⊥ ( t ) | Ψ λ (cid:105)| < (cid:15) . We show that N steps can be bounded as (see Appendix. B): O  poly( N ) (cid:15) min { t n } ( g ( t n ))  ≤ N steps ≤ O  poly( N ) (cid:15) min { t n } ( g ( t n ))  . (5)The function g ( t ) is the energy gap between the ﬁrstexcited state and the ground state of the instantaneousHamiltonian ˆ H ( t ), N is the system size, and the set oftimes { t n } is deﬁned in Appendix. B. As expected forhard optimization problems, the minimum gap typicallydecreases exponentially with system size N , which dom-inates the computational complexity of a VQA simula-tion, but in cases where the minimum gap scales as theinverse of a polynomial in N , then the number of steps N steps is also polynomial in N . III. RESULTSA. Annealing on random Ising chains

We now proceed to evaluate the power of VCA andVQA. As a ﬁrst benchmark, we consider the task of solv-ing for the ground state the one-dimensional (1D) IsingHamiltonian with random couplings J i,i +1 , H target = − N − (cid:88) i =1 J i,i +1 σ i σ i +1 . (6)First, we examine J i,i +1 sampled from a uniform dis-tribution in the interval [0 , E G = − (cid:80) N − i =1 J i,i +1 [27].We use a tensorized RNN ansatz without weight shar-ing for both VCA and VQA (see Methods Sec. V A).We consider system sizes N = 32 , ,

128 and N train = 5,which suﬃces to achieve accurate solutions. For VQA, weuse a one-body driving term ˆ H D = − Γ (cid:80) Ni =1 ˆ σ xi , whereˆ σ x,y,zi are Pauli matrices acting on site i . To quantifythe performance of the algorithms, we use the residualenergy [11], (cid:15) res = (cid:2) (cid:104) H target (cid:105) av − E G (cid:3) dis , (7)where E G is the exact ground state energy of H target . Weuse the arithmetic mean for statistical averages (cid:104) . . . (cid:105) av over samples from the models. For VCA it means that (cid:104) H target (cid:105) av ≈ (cid:104) H target (cid:105) λ , while for VQA the target Hamil-tonian is promoted to ˆ H target = − (cid:80) N − i =1 J i,i +1 ˆ σ zi ˆ σ zi +1 and (cid:104) H target (cid:105) av ≈ (cid:104) ˆ H target (cid:105) λ . We consider the typical(geometric) mean for averaging over instances of the tar-get Hamiltonian, i.e., (cid:2) ... (cid:3) dis = exp( (cid:104) ln( ... ) (cid:105) av ). The aver-age in the argument of the exponential stands for arith-metic mean over diﬀerent realizations of the couplings. N annealing ✏ r e s / N (a) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . N annealing ✏ r e s / N (b) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . Figure 3. Variational neural annealing on a random Isingchain. Here we represent the residual energy per site (cid:15) res /N vs the number of annealing steps N annealing for both VQA andVCA. The system sizes are N = 32 , , J i,i +1 ∈ [0 ,

1) (see text for more details).The error bars represent the one s.d. statistical uncertaintycalculated over diﬀerent disorder realizations [28].

We take advantage of the autoregressive nature of theRNN and sample 10 conﬁgurations at the end of theannealing, which allows us to accurately estimate themodel’s arithmetic mean. The typical mean is taken over25 instances of H target .In Fig. 3 we report the residual energies per site againstthe number of annealing steps N annealing . As expected,the residual energy is a decreasing function of N annealing ,which underlines the importance of adiabaticity and an-nealing in our setting. In our examples, we observe thatthe decrease of the residual energy of VCA and VQA isconsistent with a power-law decay for a large number ofannealing steps. Whereas VCA’s decay exponent is in theinterval 1 . − .

9, the VQA exponent is about 0 . − . J i,i +1 uniformly sampled fromthe discrete set {− , +1 } , are provided in Appendix. A. B. Edwards-Anderson model

We now consider the two-dimensional (2D) Edwards-Anderson (EA) model, which is a prototypical spin glassarranged on a square lattice with nearest neighbor ran-dom interactions. The problem of ﬁnding ground statesof the model has been studied experimentally [12] andnumerically [11] from the annealing perspective, as wellas theoretically [2] from the computational complexityperspective. The EA model with open boundary condi-tions is given by H target = − (cid:88) (cid:104) i,j (cid:105) J ij σ i σ j , (8)where (cid:104) i, j (cid:105) denote nearest neighbors. The couplings J ij are drawn from a uniform distribution in the interval[ − , H D = − Γ (cid:80) Ni =1 ˆ σ xi . Fig. 4(a) shows the annealing re-sults obtained on a system size N = 10 ×

10 spins. VCAoutperforms VQA and in the adiabatic, long-time anneal-ing regime, it produces solutions three orders of magni-tude more accurate on average than VQA. In addition, weinvestigate the performance of VQA supplemented witha ﬁctitious Shannon information entropy [21] term thatmimics thermal relaxation eﬀects observed in quantumannealing hardware [31]. This form of regularized VQA,here labelled (RVQA), is described by a pseudo free en-ergy cost function ˜ F λ ( t ) = (cid:104) ˆ H ( t ) (cid:105) λ − T ( t ) S classical ( | Ψ λ | ).As in VCA, the pseudo entropy term S classical ( | Ψ λ | ) at f (1) = 0 provides a heuristic approach to count the num-ber of solutions to H target for VQA and RVQA. The re-sults in Fig. 4(a) do show an amelioration of the VQAperformance, including changing a saturating dynamicsat large N annealing to a power-law like behavior. How-ever, it appears to be insuﬃcient to compete with theVCA scaling (see exponents in Fig. 4(a)). This observa-tion suggests the superiority of a thermally driven varia-tional emulation of annealing over a purely quantum onefor this example.To further scrutinize the relevance of the annealingeﬀects in VCA, we also consider VCA with zero ther-mal ﬂuctuations, i.e., setting T = 0. Because of itsintimate relation to the classical-quantum optimization(CQO) methods of Refs. [32–34], we refer to this settingas CQO. Fig. 4(a) shows that CQO takes about 10 train-ing steps to reach accuracies nearing 1%. The accuracydoes not further improve upon additional training up to10 gradient steps, which indicates that CQO is proneto getting stuck in local minima. In comparison, VCAand VQA oﬀer solutions orders of magnitude more ac- d h , number of training steps N train , gradient descent optimizer, number of samples,etc), which may open up avenues to boost the perfor-mance of our algorithms. For reproducibility purposes,Appendix. D provides a summary of the hyperparametersused to produce the results shown here. B. Edwards-Anderson model

We now consider the two-dimensional Edwards-Anderson (EA) model, which is a prototypical spin-glassmodel where a set of spins are arranged on a squarelattice with nearest neighbor random interactions. Theproblem of ﬁnding ground states of the model has beenstudied experimentally [76] and numerically [55, 56, 68]from the annealing perspective, as well as theoretically [2]from the computational complexity perspective. In thissection, we use the EA model as a benchmark to fur-ther probe VCA and VQA, and compare them againststandard heuristics, namely, SA and SQA implementedvia discrete-time path-integral Monte Carlo [55, 68]. TheEA model is given byˆ H EA = X h i,j i J ij ˆ zi ˆ zj , (8)where the sum runs over nearest neighbors, and the cou-plings J ij are drawn independently from a uniform dis-tribution in the range [ , J ij , we usethe spin-glass server [77] to obtain the exact ground stateenergy. This feature makes the EA model an ideal bench-mark for our method, particularly for large system sizes.To simulate our variational neural annealing protocols,we use a 2D tensorized RNN (see Methods Sec. V B) as anansatz without weight sharing. We implement the meth-ods described in Sec. II and ?? with VQA implementedusing a one-body driving term. Fig. 3 shows the anneal-ing results obtained on a system size N = 10 ⇥

10 spins.As for the random Ising chains in Sec. III A, VCA out-performs VQA and in the adiabatic, long-time annealingregime, VCA produces solutions three orders of magni-tude more accurate than VQA. In addition, we investi-gate the performance of VQA supplemented with a ﬁcti-tious Shannon information entropy [47] term that mimicsthermal relaxation e↵ects observed in quantum anneal-ing hardware [78] and induces a thermal-like explorationof the energy landscape during the quantum annealingemulation. This form of regularized variational quan-tum annealing (RVQA) is described by a free energy costfunction:˜ F ( t ) = h ˆ H ( t ) i T ( t ) S classical ( | ( t ) | ) . (9) N annealing ✏ r e s / N N steps CQOVQARVQA /t . ± . VCA /t . ± . CQOVQARVQA /t . ± . VCA /t . ± . Figure 3. A comparison between VCA, VQA, RVQA, andCQO for Edwards-Anderson (EA) on a 10 ⇥

10 lattice. Theresidual energy per site vs. N annealing for VCA, VQA andRVQA. For CQO, we report the residual energy per site vs.the number of optimization steps N steps . While the results in Fig. 3 do show an amelioration ofthe VQA performance, including changing a saturatingdynamics at long annealing time to a power-law like be-havior, it appears to be insucient to compete with theVCA scaling. This suggests the superiority of a thermallydriven variational emulation of annealing over a quantumone.To further scrutinize the relevance of the annealing ef-fects in VCA, we also consider VCA with zero thermalﬂuctuations, i.e., setting T = 0. Because of its intimaterelation to the classical-quantum optimization methodsof Ref. 51, 79, and 80, we call this setting CQO. Fig. 3shows that CQO takes about 10 training steps start-ing from random parameters initialization to reach closeto 1% accuracy. The accuracy does not further improvewhen trained up to 10 gradient steps, indicating that theCQO limit of VCA is prone to getting stuck in local min-ima. In comparison, VCA and VQA o↵er solutions ordersof magnitude more accurate at long annealing times, sug-gesting the importance of the annealing e↵ect in tacklingoptimization problems.Since VCA displays the best performance in the pre-vious benchmarks, we use it to demonstrate its capabili-ties on a relatively large system with 40 ⇥

40 spins. Forcomparison, we use SA as well as SQA with P = 20 trot-ter slices, and take the average energy across all trotterslices, for each realization of randomness (see MethodsSec. V E). In addition, we average the energy obtainedafter 25 annealing runs on every instance of randomnessfor SA and SQA. To average over Hamiltonian instances,we use the typical mean over 25 di↵erent realizations forthe three annealing methods. The results are shown in N annealing ✏ r e s / N SASQAVCASASQAVCA

Figure 4. Comparison between Simulated Annealing (SA),Path-Integral Quantum Monte Carlo (SQA) with P = 20trotter slices, and VCA using a 2D tensorized pRNN state forthe EA model on a 40 ⇥

40 lattice. We report the residualenergy per site as a function of the number of annealing steps N annealing for SA, VCA and SQA. Fig. 4, where we present the residual energies per siteagainst the number of annealing steps N annealing , whichis set so that the speed of annealing is the same for SA,SQA and VCA. We ﬁrst note that our results conﬁrmthe qualitative behavior of SA and SQA in Refs. [55, 68].While at short annealing times SA and SQA producelower residual energy solutions than VCA, we observethat VCA achieves residual energies for large annealingtime about three orders of magnitude smaller than SQAand SA. Notably, the rate at which the residual energyimproves with increasing the annealing time is signiﬁ-cantly higher in VCA than SQA and SA even at rela-tively short annealing time. These observations highlightthe advantages of solving hard optimization problems ina variational space compared to SA and SQA paradigms. C. Fully-connected spin glasses

We now focus our attention on fully-connected spinglasses [2, 81]. We ﬁrst focus on the Sherrington-Kirkpatrick (SK) model [82], which provides a concep-tual framework for the understanding of the role of dis-order and frustration in widely diverse systems rangingfrom materials to combinatorial optimization and ma-chine learning. The combined e↵ect of disorder and long-range interactions in the SK model results in an energylandscape characterized by a hierarchy of valleys with anumber of local minima growing exponentially in the sys-tem size [81]. Together with the fact that many combina-torial NP-hard problems can be thought of as the task ofﬁnding a ground state of a densely connected spin glass,the properties above make fully connected spin glassesa suitable benchmark for heuristic optimization meth- ods [5]. The SK Hamiltonian ˆ H SK is given byˆ H SK = X i = j J ij p N ˆ zi ˆ zj , (10)where { J ij } is a symmetric matrix such that each matrixelement J ij is sampled from a gaussian distribution withmean 0 and variance 1.Since VCA performed best in our previous examples,we use it to ﬁnd ground states of the SK model for N =100 spins. Here, exact ground states energies of the SKmodel are calculated using the spin-glass server [77] ona total of 25 instances of disorder. To account for long-distance dependencies between spins in the SK model,we use a dilated RNN that has d log ( N ) e = 7 layers(see Methods Sec. V B) and we start the annealing at aninitial temperature T = 2. We compare our results withSA and SQA. For SQA, we start with an initial magneticﬁeld = 2, while for SA we use T = 2.To e↵ectively compare the three methods (i.e., SA,SQA, and VCA), we ﬁrst plot the residual energy persite as a function of N annealing for VCA, SA and SQA(with P = 100 trotter slices). Here, the SA and SQAresidual energies are obtained by averaging the outcomeof 50 independent annealing runs, while for VCA we av-erage the outcome of 10 exact samples from the an-nealed RNN. For all methods, we take the typical aver-age over 25 disorder instances. The results are shown inFig. 5(a). As observed in the EA model in Fig. 4, we notethat for fast annealing runs SA and SQA produce lowerresidual energy solutions than VCA, but we emphasizethat VCA delivers a lower residual energy compared toSQA and SA as the total annealing time increases past N annealing ⇠ . Likewise, we observe that the rate atwhich the residual energy improves with increasing thetotal annealing time is signiﬁcantly higher in VCA thanSQA and SA.A more detailed look at the statistical behaviour of themethods at long annealing times can be obtained fromthe residual energy histograms separately produced byeach method, as shown in Fig. 5(e). For each instance { J ij } after the end of annealing, we represent the ob-tained residual energies in a histogram form. For thethree methods, we extract 10 residual energies for eachdisorder realization. Here, we observe that VCA is supe-rior to SA and SQA, as it produces a higher density oflow residual energies. This indicates that, even thoughVCA typically takes more annealing steps, it ultimatelyresults in a higher chance of getting more accurate solu-tions to optimization problems than their SA and SQAcounterparts.We now focus on the Wishart planted ensemble(WPE), which is a class of zero-ﬁeld Ising models with aﬁrst-order phase transition and tunable algorithmic hard-ness [83]. These problems belong to a special class of hardproblem ensembles whose solutions are known to the con-structor, which, together with the tunability of the hard-ness, makes the WPE model an ideal tool to benchmark ab Figure 4. Benchmarking the two-dimensional Edwards-Anderson spin glass. (a) A comparison between VCA, VQA,RVQA, and CQO on a 10 ×

10 lattice by plotting the resid-ual energy per site vs N annealing . For CQO, we report theresidual energy per site vs the number of optimization steps N steps . (b) Comparison between SA, SQA with P = 20 trot-ter slices, and VCA using a 2D tensorized RNN ansatz on a40 ×

40 lattice. The annealing speed is the same for SA, SQAand VCA. curate on average for a large number of annealing steps,highlighting the importance of annealing in tackling op-timization problems.Since VCA displays the best performance in the pre-vious benchmarks, we use it to demonstrate its capa-bilities on a 40 ×

40 spin system. For comparison, weuse SA as well as SQA. The SQA simulation uses thepath-integral Monte Carlo method [11] with P = 20 trot-ter slices, and we report averages over energies acrossall trotter slices, for each realization of randomness (seeMethods Sec. V D). In addition, we average the energyobtained after 25 annealing runs on every instance of ran-domness for SA and SQA. To average over Hamiltonianinstances, we use the typical mean over 25 diﬀerent re-alizations for the three annealing methods. The resultsare shown in Fig. 4(b), where we present the residualenergies per site against the number of annealing steps N annealing , which is set so that the speed of annealing isthe same for SA, SQA and VCA. We ﬁrst note that ourresults conﬁrm the qualitative behavior of SA and SQAin Refs. [11, 35]. While SA and SQA produce lower resid-ual energy solutions than VCA for small N annealing , weobserve that VCA achieves residual energies about threeorders of magnitude smaller than SQA and SA for a largenumber of annealing steps. Notably, the rate at which theresidual energy improves with increasing N annealing is sig-niﬁcantly higher for VCA compared to SQA and SA evenat relatively small number of annealing steps. C. Fully-connected spin glasses

We now focus our attention on fully-connected spinglasses [2, 36]. We ﬁrst focus on the Sherrington-Kirkpatrick (SK) model [37], which provides a concep-tual framework for the understanding of the role of dis-order and frustration in widely diverse systems rangingfrom materials to combinatorial optimization and ma-chine learning. The SK Hamiltonian is given by H target = − (cid:88) i (cid:54) = j J ij √ N σ i σ j , (9)where { J ij } is a symmetric matrix such that each matrixelement J ij is sampled from a gaussian distribution withmean 0 and variance 1.Since VCA performed best in our previous examples,we use it to ﬁnd ground states of the SK model for N =100 spins. Here, exact ground states energies of the SKmodel are calculated using the spin-glass server [30] ona total of 25 instances of disorder. To account for long-distance dependencies between spins in the SK model, weuse a dilated RNN ansatz that has (cid:100) log ( N ) (cid:101) = 7 layers(see Methods Sec. V A) and set the initial temperature T = 2. We compare our results with SA and SQA. ForSQA, we start with an initial magnetic ﬁeld Γ = 2, whilefor SA we use T = 2.For an eﬀective comparison, we ﬁrst plot the resid-ual energy per site as a function of N annealing for VCA,SA and SQA (with P = 100 trotter slices). Here, theSA and SQA residual energies are obtained by averag-ing the outcome of 50 independent annealing runs, whilefor VCA we average the outcome of 10 exact samplesfrom the annealed RNN. For all methods, we take thetypical average over 25 disorder instances. The resultsare shown in Fig. 5(a). As observed in the EA model,we note that SA and SQA produce lower residual energysolutions than VCA for small N annealing , but we empha-size that VCA delivers a lower residual energy comparedto SQA and SA as the total number of annealing stepsincreases past N annealing ∼ . Likewise, we observethat the rate at which the residual energy improves withincreasing N annealing is signiﬁcantly higher for VCA incomparison to SQA and SA. A more detailed look at the statistical behaviour ofthe methods at large N annealing can be obtained from theresidual energy histograms separately produced by eachmethod, as shown in Fig. 5(d). The histograms contain1000 residual energies for each of the same 25 disorderrealizations. For each instance, we plot results for 1000SA runs, 1000 samples obtained from the RNN at theend of annealing for VCA, and 10 SQA runs includingcontribution from each of the P = 100 Trotter slices.We observe that VCA is superior to SA and SQA, as itproduces a higher density of low energy conﬁgurations.This indicates that, even though VCA typically takesmore annealing steps, it ultimately results in a higherchance of getting more accurate solutions to optimizationproblems than SA and SQA. Note that for the SK model,the SQA histogram remain quantitatively the same for200 runs, and we report data of 10 runs only for fairnesspurposes compared to both SA and VCA.We now focus on the Wishart planted ensemble(WPE), which is a class of zero-ﬁeld Ising models with aﬁrst-order phase transition and tunable algorithmic hard-ness [38]. These problems belong to a special class ofhard problem ensembles whose solutions are known a pri-ori, which, together with the tunability of the hardness,makes the WPE model an ideal tool to benchmark heuris-tic algorithms for optimization problems. The Hamilto-nian of the WPE model is deﬁned as H target = − (cid:88) i (cid:54) = j J αij σ i σ j . (10)Here J αij is a symmetric matrix satisfying J α = ˜ J α − diag( ˜ J )and ˜ J α = − N W α W T α . The term W α is an N × (cid:98) αN (cid:99) random matrix satisfy-ing W α t ferro = 0 where t ferro = (+1 , +1 , ..., +1) is theferromagnetic state (see Ref. [38] for details about thegeneration of W α ). The ground state of the WPE modelis known (i.e., it is planted) and corresponds to the ferro-magnetic states ± t ferro . Interestingly, α is a tunable pa-rameter of hardness, where for α < N = 32 and α ∈ { . , . } .We consider 25 instances of the couplings { J αij } andattempt to solve the model with VCA implemented usinga dilated RNN ansatz with (cid:100) log ( N ) (cid:101) = 5 layers and aninitial temperature T = 1. For SQA ( P = 100 trotter Figure 5. Benchmarking SA, SQA ( P = 100 trotter slices) and VCA on the Sherrington-Kirkpatrick (SK) model and theWishart planted ensemble (WPE). Panels (a),(b), and (c) display the residual energy per site as a function of N annealing . (a)The SK model with N = 100 spins. (b) WPE with N = 32 spins and α = 0 .

5. (c) WPE with N = 32 spins and α = 0 . − for (cid:15) res /N , which is within our numerical accuracy. slices), we use an initial magnetic ﬁeld Γ = 1, and forSA we start with T = 1.We ﬁrst plot the scaling of residual energies per site (cid:15) res /N as shown in Figs. 5(b) and (c). Here we note thatVCA is superior to SA and SQA for α = 0 . α = 0 .

25 in Fig. 5(c), VCA is competitive whereit achieves a similar performance compared to SA andSQA on average for a large number of annealing steps.We also represent the residual energies in a histogramform. We observe that for α = 0 . (cid:15) res /N ∼ − -10 − compared to SA and SQA. For α = 0 .

25 in Fig. 5(f), VCA leads to a non-negligibledensity at very low residual energies as opposed to SAand SQA, whose solutions display residual energies or-ders of magnitude higher. Finally, our WPE simulationssupport the observation that VCA tends to improve thequality of solutions faster than SQA and SA for a largenumber of annealing steps.

IV. CONCLUSIONS AND OUTLOOK

In conclusion, we have introduced a strategy to com-bat the slow sampling dynamics encountered by simu-lated annealing when an optimization landscape is roughor glassy. Based on annealing the variational parametersof a generalized target distribution, our scheme — whichwe dub variational neural annealing — takes advantageof the power of modern autoregressive models, which canbe exactly sampled without slow dynamics even whena rough landscape is encountered. We implement varia-tional neural annealing parameterized by a recurrent neu-ral network, and compare its performance to conventionalsimulated annealing on prototypical spin glass Hamiltoni-ans known to have landscapes of varying roughness. Weﬁnd that variational neural annealing produces accuratesolutions to all of the optimization problems considered,including spin glass Hamiltonians where our techniquestypically reach solutions orders of magnitude more accu-rate on average than conventional simulated annealing inthe limit of a large number of annealing steps.We emphasize that several hyperparameters, model,hardware, and variational objective function choices canbe explored and may improve our methodologies. Wehave utilized a simple annealing schedule in our protocolsand highlight that reinforcement learning can be used toimprove it [39]. A critical insight gleaned from our exper-iments is that certain neural network architectures weremore eﬃcient on speciﬁc Hamiltonians. Thus, a natu-ral direction is to study the intimate relation betweenthe model architecture and the problem Hamiltonian,where we envision that symmetries and domain knowl-edge would guide the design of models and algorithms.As we witness the unfolding of a new age for opti-mization powered by deep learning [40], we anticipatea rapid adoption of machine learning techniques in thespace of combinatorial optimization, as well as antici-pate domain-speciﬁc applications of our ideas in diversetechnological and scientiﬁc areas related to physics, biol-ogy, health care, economy, transportation, manufactur-ing, supply chain, hardware design, computing and in-formation technology, among others.

V. METHODSA. Recurrent Neural Network Ans¨atze

Recurrent neural networks model complex probabilitydistributions p by taking advantage of the chain rule p ( σ ) = p ( σ ) p ( σ | σ ) · · · p ( σ N | σ N − , . . . , σ , σ ) , (11)where specifying every conditional probability p ( σ i | σ

Training step F (c) Tensorized RNNDilated RNN abc

Figure 6. (a) An illustration of a 1D RNN: at each site n , theRNN cell denoted by the green box, receives a hidden state h n − and the one-hot spin vector σ n − , to generate a newhidden state h n that is fed into a Softmax layer (denoted bya magenta circle). (b) A graphical illustration of a 2D RNN.Each RNN cell receives two hidden states h i,j − and h i − ,j ,as well as two input vectors σ i,j − and σ i − ,j (not shown) asillustrated by the black arrows. The red arrows correspond tothe zigzag path we use for 2D autoregressive sampling. Theinitial memory state h of the RNN and the initial inputs σ (not shown) are null vectors. (c) An illustration of a dilatedRNN, where the distance between each two RNN cells growsexponentially with depth to account for long-term dependen-cies. We choose depth L = (cid:100) log ( N ) (cid:101) where N is the numberof spins. In our work, we choose the size of the hidden states h ( l ) n ,where l >

0, as constant and equal to d h . We also use anumber of layers L = (cid:100) log ( N ) (cid:101) , where N is the numberof spins and (cid:100) . . . (cid:101) is the ceiling function. This meansthat two spins are connected with a path whose length isbounded by O (log ( N )), which follows the spirit of themulti-scale renormalization ansatz [46]. For more detailson the advantage of dilated RNNs over tensorized RNNssee Appendix. D.We ﬁnally note that for all the RNN architectures inour work, we found accurate results using the exponentiallinear unit (ELU) activation function, deﬁned as:ELU( x ) = (cid:40) x, if x ≥ , exp( x ) − , if x < . B. Minimizing the variational free energy

To implement the variational classical annealing algo-rithm, we use the variational free energy F λ ( T ) = (cid:104) H target (cid:105) λ − T S classical ( p λ ) , (20)where the target Hamiltonian H target encodes the op-timization problem and T is the temperature. More-over, S classical is the entropy of the distribution p λ . Toestimate F λ ( T ) we take N s exact samples σ ( i ) ∼ p λ ( i = 1 , . . . , N s ) drawn from the RNN and evaluate F λ ( T ) ≈ N s N s (cid:88) i =1 F loc ( σ ( i ) ) , where the local free energy is F loc ( σ ) = H target ( σ ) + T log ( p λ ( σ )) [18]. Similarly, the gradients are given by ∂ λ F λ ( T ) ≈ N s N s (cid:88) i =1 ∂ λ log (cid:16) p λ (cid:16) σ ( i ) (cid:17)(cid:17) × (cid:16) F loc ( σ ( i ) ) − F λ ( T ) (cid:17) , where we subtract F λ ( T ) in order to reduce noise in thegradients [18, 20]. We note that this variational schemeexhibits a zero-variance principle, namely that the localfree energy variance per spin σ F ≡ var( { F loc ( σ ) } ) N , (21)becomes zero when p λ matches the Boltzmann distribu-tion, provided that mode collapse is avoided [18].The gradient updates are implemented using the Adamoptimizer [47]. Furthermore, the computational complex-ity of VCA for one gradient descent step is O ( N s × N × d h )for 1D RNNs and 2D RNNs (both vanilla and tensorizedversions) and O ( N s × N log( N ) × d h ) for dilated RNNs.Consequently, VCA has lower computational cost thanVQA, which is implemented using VMC (see MethodsSec. V C).0Finally, we note that in our implementations no train-ing steps are performed at the end of annealing for bothVCA and VQA. C. Variational Monte Carlo

The main goal of Variational Monte Carlo is to approx-imate the ground state of a Hamiltonian ˆ H through theiterative optimization of an ansatz wave function | Ψ λ (cid:105) .The VMC objective function is given by E ≡ (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105)(cid:104) Ψ λ | Ψ λ (cid:105) . We note that an important class of stoquastic many-body Hamiltonians has ground states | Ψ (cid:105) with strictlyreal and positive amplitudes in the standard product spinbasis [48]. These ground states can be written down interms of probability distributions, | Ψ (cid:105) = (cid:88) σ Ψ( σ ) | σ (cid:105) = (cid:88) σ (cid:112) P ( σ ) | σ (cid:105) . (22)To approximate this family of states, we use an RNNwave function, namely Ψ λ ( σ ) = (cid:112) p λ ( σ ). Extensionsto complex-valued RNN wave functions are deﬁned inRef. [20], and results on their ability to simulate vari-ational quantum annealing of non-stoquastic Hamilto-nians [49] will be reported elsewhere [50]. These fami-lies of RNN states are normalized by construction (i.e., (cid:104) Ψ λ | Ψ λ (cid:105) = 1) and allow for accurate estimates of theenergy expectation value. By taking N s exact samples σ ( i ) ∼ p λ ( i = 1 , . . . , N s ), it follows that E ≈ N s N s (cid:88) i =1 E loc ( σ ( i ) ) . The local energy is given by E loc ( σ ) = (cid:88) σ (cid:48) H σσ (cid:48) Ψ λ ( σ (cid:48) )Ψ λ ( σ ) , (23)where the sum over σ (cid:48) is tractable when the Hamiltonianˆ H is local. Similarly, we can also estimate the energygradients as ∂ λ E = 2 N s N s (cid:88) i =1 ∂ λ log (cid:16) Ψ λ (cid:16) σ ( i ) (cid:17)(cid:17) (cid:16) E loc (cid:16) σ ( i ) (cid:17) − E (cid:17) . Here, we can subtract the term E in order to reduce noisein the stochastic estimation of our gradients without in-troducing a bias [20, 51]. In fact, when the ansatz is closeto an eigenstate of ˆ H , then E loc ( σ ) ≈ E , which meansthat the variance of gradients Var( ∂ λ j E ) ≈ λ j . We note that this is similar inspirit to the control variate methods in Monte Carlo andto the baseline methods in reinforcement learning [51]. Similarly to the minimization scheme of the variationalfree energy in Methods Sec. V B, VMC also exhibits azero-variance principle, where the energy variance perspin σ ≡ var( { E loc ( σ ) } ) N , (24)becomes zero when | Ψ λ (cid:105) matches an excited state of ˆ H ,which thanks to the minimization of the variational en-ergy E is likely to be the ground state | Ψ G (cid:105) .The gradients ∂ λ log (Ψ λ ( σ )) are numerically com-puted using automatic diﬀerentiation [52]. We use theAdam optimizer to perform gradient descent updates,with a learning rate η , to optimize the variational param-eters λ of the RNN wave function. We note that in thepresence of O ( N ) non-diagonal elements in a Hamilto-nian ˆ H , the local energies E loc ( σ ) have O ( N ) terms (seeEq. (23)). Thus, the computational complexity of onegradient descent step is O ( N s × N × d h ) for 1D RNNsand 2D RNNs (both vanilla and tensorized versions). D. Simulated Quantum Annealing and SimulatedAnnealing

Simulated Quantum Annealing is a standard quantum-inspired classical technique that has traditionally beenused to benchmark the behavior of quantum anneal-ers [24]. It is usually implemented via the path-integralMonte Carlo method [11], a QMC method that simu-lates equilibrium properties of quantum systems at ﬁnitetemperature. To illustrate this method, consider a D -dimensional time-dependent quantum Hamiltonianˆ H ( t ) = − (cid:88) i,j J ij ˆ σ zi ˆ σ zj − Γ( t ) N (cid:88) i =1 ˆ σ xi , where Γ( t ) = Γ (1 − t ) controls the strength of the quan-tum annealing dynamics at a time t ∈ [0 , Z = Tr exp {− β ˆ H ( t ) } , (25)with the inverse temperature β = T , we can map the D -dimensional quantum Hamiltonian onto a ( D + 1) clas-sical system consisting of P coupled replicas (Trotterslices) of the original system H D +1 ( t ) = − P (cid:88) k =1 (cid:88) i,j J ij σ ki σ kj + J ⊥ ( t ) N (cid:88) i =1 σ ki σ k +1 i  , (26)where σ ki is the classical spin at site i and replica k . Theterm J ⊥ ( t ) corresponds to uniform coupling between σ ki and σ k +1 i for each site i , such that J ⊥ ( t ) = − P T (cid:18) tanh (cid:18) Γ( t ) P T (cid:19)(cid:19) . σ P +1 ≡ σ arise because of the trace in Eq. (25).Interestingly, we can approximate Z with an eﬀectivepartition function Z p at temperature P T given by [35]: Z p ∝ Tr exp (cid:26) − H D+1 ( t ) P T (cid:27) , which can now be simulated with a standard Metropolis-Hastings Monte Carlo algorithm. A key element to thisalgorithm is the energy diﬀerence induced by a single spinﬂip at site σ ki , which is equal to∆ i E local = 2 (cid:88) j J ij σ ki σ kj + 2 J ⊥ ( t ) (cid:0) σ k − i σ ki + σ ki σ k +1 i (cid:1) . Here, the second term encodes the quantum dynamics. Inour simulations we consider single spin ﬂip (local) movesapplied to all sites in all slices. We can also perform aglobal move [35], which means ﬂipping a spin at location i in every slice k . Clearly this has no impact on theterm dependent on J ⊥ , because it contains only termsquadratic in the ﬂipped spin, so that∆ i E global = 2 P (cid:88) k =1 (cid:88) j J ij σ ki σ kj . In summary, a single Monte Carlo step (MCS) consistsof ﬁrst performing a single local move on all sites in each k -th slice and on all slices, followed by a global move forall sites. For the SK model and the WPE model studiedin this paper, we use P = 100, whereas for the EA modelwe use P = 20 similarly to Ref. [11]. Before startingthe quantum annealing schedule, we ﬁrst thermalize thesystem by performing SA [35] from a temperature T = 3to a ﬁnal temperature 1 /P (so that P T = 1). This isdone in 60 steps, where at each temperature we perform100 Metropolis moves on each site. We then performSQA using a linear schedule that decreases the ﬁeld fromΓ to a ﬁnal value close to zero Γ( t = 1) = 10 − , whereﬁve local and global moves are performed for each valueof the magnetic ﬁeld Γ( t ), so that it is consistent with thechoice of N train = 5 for VCA (see Sec. II and III A). Thus,the number of MCS is equal to ﬁve times the number ofannealing steps.For the standalone SA, we decrease the temperaturefrom T to T ( t = 1) = 10 − . Here, a single MCS consistsof a Monte Carlo sweep, i.e., attempting a spin-ﬂip for allsites. For each thermal annealing step, we perform ﬁveMCS, and hence similar to SQA, the number of MCS isequal to ﬁves times the number of annealing steps. Fur-thermore, we do a warm-up step for SA, by performing N warmup MCS to equilibrate the Markov Chain at theinitial temperature T and to provide a consistent choicewith VCA (see Sec. II). ACKNOWLEDGMENTS

We acknowledge Jack Raymond for suggesting to usethe Wishart Planted Ensemble as a benchmark for ourvariational annealing setup. We also thank ChristopherRoth, Cunlu Zhou, Martin Ganahl and Giuseppe Santorofor fruitful discussions. We are also grateful to LaurenHayward for providing her plotting code to produce ourﬁgures using Matplotlib library. Our RNN implementa-tion is based on Tensorﬂow and NumPy. We acknowledgesupport from the Natural Sciences and Engineering Re-search Council (NSERC), a Canada Research Chair, theShared Hierarchical Academic Research Computing Net-work (SHARCNET), Compute Canada, Google Quan-tum Research Award, and the Canadian Institute forAdvanced Research (CIFAR) AI chair program. Re-sources used in preparing this research were provided,in part, by the Province of Ontario, the Government ofCanada through CIFAR, and companies sponsoring theVector Institute .Research at Perimeter Institute is supported in part bythe Government of Canada through the Department ofInnovation, Science and Economic Development Canadaand by the Province of Ontario through the Ministry ofEconomic Development, Job Creation and Trade.

Appendix A: Numerical proof of principle ofadiabaticity

As demonstrated in Sec. III, we have shown that bothVQA and VCA are eﬀective at ﬁnding the classicalground state of disordered spin chains. Here, we fur-ther illustrate the adiabaticity of both VQA and VCA.First, we perform VQA on the uniform ferromagneticIsing chain (i.e., J i,i +1 = 1) with N = 20 spins andopen boundary conditions with an initial transverse ﬁeldΓ = 2. Here, we use a tensorized RNN wave func-tion with weight sharing across sites of the chain. Wealso choose N annealing = 1024. In Fig. 7(a), we showthat the variational energy tracks the exact ground en-ergy throughout the annealing process with high accu-racy. We also observe that optimizing an RNN wavefunction from scratch, i.e., randomly reinitializing theparameters of the model at each new value of the trans-verse magnetic ﬁeld is not optimal. This observation un-derlines the importance of transferring the parameters ofour wave function ansatz after each annealing step. Fur-thermore, in Fig. 7(b) we illustrate that the RNN wavefunction’s residual energy is much lower compared to thegap throughout the annealing process, which shows thatVQA remains adiabatic for a large number of annealingsteps.Similarly, in Fig. 7(c) we perform VCA with an initialtemperature T = 2 on the same model, the same systemsize, the same ansatz, and the same number of annealingsteps. We see an excellent agreement between the RNNwave function free energy and the exact free energy, high-2 . . . . . h ˆ H i Random parametersTransferred parametersExact energy . . . . . . . . . . ✏ r e s GapRNN residual energy . . . . . T F ( T ) Random parametersTransferred parametersExact Free energy abc

Figure 7. Numerical evidence of adiabaticity on the uniformIsing chain with N = 20 spins for VQA in panels (a) and(b) and VCA in panel (c). (a) Variational energy of RNNwave function against the transverse magnetic ﬁeld Γ, with λ initialized using the parameters optimized in the previous an-nealing step (transferred parameters, green curve) and withrandom parameter reinitialization (random parameters, pur-ple curve). These strategies are compared with the exact en-ergy obtained from exact diagonalization (dashed black line).(b) Residual energy of the RNN wave function vs the trans-verse ﬁeld Γ. Throughout annealing with VQA, the resid-ual energy is always much smaller than the gap within errorbars. (c) Variational free energy vs temperature T for a VCArun with λ initialized using the parameters optimized in theprevious annealing step (transferred parameters, purple line)and with random reinitialization (random parameters, orangecurve). N annealing ✏ r e s / N (a) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . N annealing ✏ r e s / N (b) VQA (N = 32) / /t . ± . VQA (N = 64) / /t . ± . VQA (N = 128) / /t . ± . VCA (N = 32) / /t . ± . VCA (N = 64) / /t . ± . VCA (N = 128) / /t . ± . Figure 8. Variational annealing on random Ising chains,where we represent the residual energy per site (cid:15) res /N vs N annealing for both VQA and VCA. The system sizes are N = 32 , ,

128 and we use random discrete couplings J i,i +1 ∈{− , } . lighting once again the adiabaticity of our emulation ofclassical annealing, as well as the importance of trans-ferring the parameters of our ansatz after each annealingstep. Taken all together, the results in Fig. 7 support thenotion that VQA and VCA evolutions can be adiabatic.In Fig. 8 we report the residual energies per site againstthe number of annealing steps N annealing . Here, weconsider J i,i +1 uniformly sampled from the discrete set {− , +1 } , where the ground state conﬁguration is dis-ordered and the ground state energy is given by E G = − (cid:80) N − i =1 | J i,i +1 | = − ( N − . − . Appendix B: The variational adiabatic theorem

In this section, we derive a suﬃcient condition for thenumber of gradient descent steps needed to maintain thevariational ansatz close to the instantaneous ground statethroughout the VQA simulation. First, consider a vari-ational wave function | Ψ λ (cid:105) and the following the time-dependent Hamiltonian:ˆ H ( t ) = ˆ H target + f ( t ) ˆ H D , The goal is to ﬁnd the ground state of the targetHamiltonian ˆ H target by introducing quantum ﬂuctuationsthrough a driving Hamiltonian ˆ H D , where ˆ H D (cid:29) ˆ H target .Here f ( t ) is a decreasing schedule function such that f (0) = 1, f (1) = 0 and t ∈ [0 , E ( λ , t ) = (cid:104) Ψ λ | ˆ H ( t ) | Ψ λ (cid:105) , and E G ( t ) , E E ( t ) theinstantaneous ground/excited state energy of the Hamil-tonian ˆ H ( t ), respectively. The instantaneous energy gapis deﬁned as g ( t ) ≡ E E ( t ) − E G ( t ).To simplify our discussion, we consider the case of atarget Hamiltonian that has a non-degenerate groundstate. Here, we decompose the variational wave functionas: | Ψ λ (cid:105) = (1 − a ( t )) | Ψ G ( t ) (cid:105) + a ( t ) | Ψ ⊥ ( t ) (cid:105) , (B1)where | Ψ G ( t ) (cid:105) is the instantaneous ground state and | Ψ ⊥ ( t ) (cid:105) is a superposition of all the instantaneous excitedstates. From this decomposition, one can show that [56]: a ( t ) ≤ E ( λ , t ) − E G ( t ) g ( t ) . (B2)As a consequence, in order to satisfy adiabaticity, i.e., | (cid:104) Ψ ⊥ ( t ) | Ψ λ (cid:105) | (cid:28) t , then one should have a ( t ) < (cid:15) (cid:28) (cid:15) is a small upper bound on theoverlap between the variational wave function and theexcited states. This means that the success probability P success of obtaining the ground state at t = 1 is boundedfrom below by 1 − (cid:15) . From Eq. (B2), to satisfy a ( t ) < (cid:15) ,it is suﬃcient to have: (cid:15) res ( λ , t ) ≡ E ( λ , t ) − E G ( t ) < (cid:15)g ( t ) . (B3)To satisfy the latter condition, we require a slightlystronger condition as follows: (cid:15) res ( λ , t ) < (cid:15)g ( t )2 . (B4)In our derivation of a suﬃcient condition on the numberof gradient descent steps to satisfy the previous require-ment, we use the following set of assumptions: • (A1) | ∂ kt E G ( t ) | , | ∂ kt g ( t ) | , | ∂ kt f ( t ) | ≤ O (poly( N )),for all 0 ≤ t ≤ k ∈ { , } . • (A2) |(cid:104) Ψ λ | ˆ H D | Ψ λ (cid:105)| ≤ O (poly( N )) for all possibleparameters λ of the variational wave function. • (A3) No anti-crossing during annealing, i.e., g ( t ) (cid:54) =0, for all 0 ≤ t ≤ • (A4) The gradients ∂ λ E ( λ , t ) can be calculatedexactly, are L ( t )-Lipschitz with respect to λ and L ( t ) ≤ O (poly( N )) for all 0 ≤ t ≤ • (A5) Local convexity, i.e., close to convergencewhen (cid:15) res ( λ , t ) < (cid:15)g ( t ), the energy landscape of E ( λ , t ) is convex with respect to λ , for all 0
Given the assumptions (A1) to (A8) , asuﬃcient (but not necessary) number of gradient descentsteps N steps to satisfy the condition (B4) during the VQAprotocol, is bounded as: O  poly( N ) (cid:15) min { t n } ( g ( t n ))  ≤ N steps ≤ O  poly( N ) (cid:15) min { t n } ( g ( t n ))  , where ( t , t , t , . . . ) is an increasing ﬁnite sequence oftime steps, satisfying t = 0 and t n +1 = t n + δt n , where δt n = O (cid:18) (cid:15)g ( t n )poly( N ) (cid:19) . Proof:

In order to satisfy the condition Eq. (B4) dur-ing the VQA protocol, we follow these steps: • Step 1 (warm-up step): we prepare our variationalwave function at the ground state at t = 0 suchthat Eq. (B4) is veriﬁed at time t = 0. • Step 2 (annealing step): we change time t by aninﬁnitesimal amount δt , so that the condition (B3)is veriﬁed at time t + δt . • Step 3 (training step): we tune the parameters ofthe variational wave function, using gradient de-scent, so that the condition (B4) is satisﬁed at time t + δt . • Step 4: we loop over steps 2 and 3 until we arrive at t = 1, where we expect to obtain the ground stateenergy of the target Hamiltonian.Let us ﬁrst start with step 2 assuming that step 1 isveriﬁed. In order to satisfy the requirement of this stepat time t , then δt has to be chosen small enough so that (cid:15) res ( λ t , t + δt ) < (cid:15)g ( t + δt ) (B5)is veriﬁed given that the condition (B4) is satisﬁed attime t . Here, λ t are the parameters of the variationalwave function that satisﬁes the condition (B4) at time t .To get a sense of how small δt should be, we do a Taylorexpansion, while ﬁxing the parameters λ t , to get: (cid:15) res ( λ t , t + δt )= (cid:15) res ( λ t , t ) + ∂ t (cid:15) res ( λ t , t ) δt + O (( δt ) ) ,< (cid:15)g ( t )2 + ∂ t (cid:15) res ( λ t , t ) δt + O (( δt ) ) , ∂ t (cid:15) res ( λ t , t ) = ∂ t f ( t ) (cid:104) ˆ H D (cid:105) − ∂ t E G ( t ). To satisfy the condition (B3) at time t + δt ,it is enough to have the right hand side of the previousinequality to be much smaller than the gap at t + δt , i.e., (cid:15)g ( t )2 + ∂ t (cid:15) res ( λ t , t ) δt + O (( δt ) ) < (cid:15)g ( t + δt ) . By Taylor expanding the gap, we get: ∂ t (cid:15) res ( λ t , t ) δt + O (( δt ) ) < (cid:15)g ( t )2 + (cid:15)∂ t g ( t ) δt + O (( δt ) ) , hence, it is enough to satisfy the following condition:( ∂ t (cid:15) res ( λ t , t ) − (cid:15)∂ t g ( t )) δt + O (( δt ) ) < (cid:15)g ( t )2 . (B6)Using the Taylor-Laplace formula, one can express theTaylor remainder term O (( δt ) ) as follows: O (( δt ) ) = (cid:90) t + δtt ( τ − t ) A ( τ )d τ, where A ( τ ) = ∂ τ (cid:15) res ( λ t , τ ) − (cid:15)∂ τ g ( τ ) = ∂ τ f ( τ ) (cid:104) ˆ H D (cid:105) − ∂ τ E G ( τ ) − (cid:15)∂ τ g ( τ ) and τ is between t and t + δt . Thelast expression can be bounded as follows: O (( δt ) ) ≤ (cid:90) t + δtt ( τ − t ) | A ( τ ) | d τ ≤ ( δt ) | A | ) . where “sup( | A | )” is the supremum of | A | over the interval[0 , (A1) and (A2) , then sup( | A | )is bounded from above by a polynomial in N , hence: O (( δt ) ) ≤ O (poly( N ))( δt ) ≤ O (poly( N )) δt, where the last inequality holds since δt ≤ t ∈ [0 , ∂ t (cid:15) res ( λ t , t ) − (cid:15)∂ t g ( t )) is also bounded fromabove by a polynomial in N (according to assumptions (A1) and (A2) ), then in order to satisfy Eq. (B6), it issuﬃcient to require the following condition: O (poly( N )) δt < (cid:15)g ( t )2 . Thus, it is suﬃcient to take: δt = O (cid:18) (cid:15)g ( t )poly( N ) (cid:19) . (B7)By taking account of assumption ( A3 ), δt can be takennon-zero for all time steps t . As a consequence, assumingthe condition (B7) is veriﬁed for a non-zero δt and asuitable O (1) prefactor, then the condition (B5) is alsoveriﬁed.We can now move to step 3. Here, we apply a numberof gradient descent steps N train ( t ) to ﬁnd a new set ofparameters λ t + δt such that: (cid:15) res ( λ t + δt , t + δt ) = E ( λ t + δt , t + δt ) − E G ( t + δt ) < (cid:15)g ( t + δt )2 , (B8) To estimate the scaling of the number of gradient descentsteps N train ( t ) needed to satisfy (B8), we make use ofassumptions (A4) and (A5) . The assumption (A5) isreasonable providing that the variational energy E ( λ t , t + δt ) is very close to the ground state energy E G ( t + δt ),as given by Eq. (B5). Using the above assumptions andassuming that the learning rate η ( t ) = 1 /L ( t ), we canuse a well-known result in convex optimization [57](seeSec. 2.1.5), which states the following inequality: E ( ˜ λ t , t + δt ) − min λ E ( λ , t + δt ) ≤ L ( t ) || λ t − λ ∗ t + δt || N train ( t ) + 4 . Here, ˜ λ t are the new variational parameters obtained af-ter applying N train ( t + δt ) gradient descent steps startingfrom λ t . Furthermore, λ ∗ t + δt are the optimal parameterssuch that: E ( λ ∗ t + δt , t + δt ) = min λ E ( λ , t + δt ) . Since the Lipschitz constant L ( t ) ≤ O (poly( N )) (as-sumption (A4) ) and || λ t − λ ∗ t + δt || ≤ O (poly( N )) (as-sumption (A6) ), one can take N train ( t + δt ) = O (cid:18) poly( N ) (cid:15)g ( t + δt ) (cid:19) , (B9)with a suitable O (1) prefactor, so that: E ( ˜ λ t , t + δt ) − min λ E ( λ , t + δt ) < (cid:15)g ( t + δt )4 . Moreover, by assuming that the variational wave functionis expressive enough (assumption (A7) ), i.e.,min λ E ( λ , t + δt ) − E G ( t + δt ) < (cid:15)g ( t + δt )4 , we can then deduce, by taking λ t + δt ≡ ˜ λ t and summingthe two previous inequalities, that: E ( λ t + δt , t + δt ) − E G ( t + δt ) < (cid:15)g ( t + δt )2 . Let us recall that in step 1, we have to initially pre-pare the variational ansatz to satisfy condition (B4) at t = 0. In fact, we can take advantage of the assump-tion (A4) , where the gradients are L (0)-Lipschitz with L (0) ≤ O (poly( N )). We can also use the convexity as-sumption (A8) , and we can show that a suﬃcient num-ber of gradient descent steps to satisfy condition (B4) at t = 0 is estimated as: N warmup ≡ N train (0) = O (cid:18) poly( N ) (cid:15)g (0) (cid:19) . The latter can be obtained in a similar way as in Eq. (B9).In conclusion, the total number of gradient steps N steps to evolve the Hamiltonian ˆ H (0) to the target Hamilto-nian ˆ H (1), while verifying the condition (B4) is givenby: N steps = N annealing +1 (cid:88) n =1 N train ( t n ) , N train ( t n ) satisﬁes the requirement (B9). Theannealing times { t n } N annealing +1 n =1 are deﬁned such that t ≡ t n +1 ≡ t n + δt n . Here, δt n satisﬁes δt n = O (cid:18) (cid:15)g ( t n )poly( N ) (cid:19) . (B10)We also consider N annealing the smallest integer suchthat t N annealing + δt N annealing ≥

1, in this case, we deﬁne t N annealing +1 ≡

1, indicating the end of annealing. Thus, N annealing is the total number of annealing steps. Takingthis deﬁnition into account, then one can show that N annealing ≤ { t n } ( δt n ) + 1 . Using Eqs. (B7) and (B9) and the previous inequality, N steps can be bounded from above as: N steps ≤ ( N annealing + 1) max { t n } ( N train ( t n )) ≤  { t n } ( δt n ) + 2  max { t n } ( N train ( t n )) ≤ O  poly( N ) (cid:15) min { t n } ( g ( t n ))  , where the transition from line 2 to line 3 is valid fora suﬃciently small (cid:15) and min { t n } ( g ( t n )). Furthermore, N steps can also be bounded from below as: N steps ≥ max { t n } ( N train ( t n )) = O  poly( N ) (cid:15) min { t n } ( g ( t n ))  . (B11)Note that the minimum in the previous two bounds aretaken over all the annealing times t n where 1 ≤ n ≤ N annealing + 1.In this derivation of the bound on N steps , we have as-sumed that the ground state of ˆ H target is non-degenerate,so that the gap does not vanish at the end of annealing(i.e., t = 1). In the case of degeneracy of the targetground state, we can deﬁne the gap g ( t ) by consideringthe lowest energy level that does not lead to the degen-erate ground state.It is also worth noting that the assumptions of thisderivation can be further expanded and improved. Inparticular, the gradients of E ( λ , t ) are computed stochas-tically (see Methods Sec. V C), as opposed to our as-sumption (A4) where the gradients are assumed to beknown exactly. To account for noisy gradients, it ispossible to use convergence bounds of stochastic gradi-ent descent [47, 58] to estimate a bound on the num-ber of gradient descent steps. Second-order optimizationmethods such as stochastic reconﬁguration/natural gra-dient [59, 60] can potentially show a signiﬁcant advantageover ﬁrst-order optimization methods, in terms of scalingwith the minimum gap of the time-dependent Hamilto-nian ˆ H ( t ). Appendix C: Default Hyperparameters

In this Appendix, we summarize the architectures andthe hyperparameters of the simulations performed in thispaper, as shown in Tab. I. The latter has shown to yieldgood performance, while we believe that a more advancedstudy of the hyperparameters can result in optimal re-sults. We also note that in this paper, VQA and VCAwere run using a single GPU workstation for each simula-tion, while SQA and SA were performed on a multi-coreCPU.

Appendix D: Benchmarking Recurrent neuralnetwork cells

To show the advantage of tensorized RNNs over vanillaRNNs, we benchmark these architectures on the task ofﬁnding the ground state of the uniform ferromagneticIsing chain (i.e., J i,i +1 = 1) with N = 100 spins at thecritical point (i.e., no annealing is employed). Since thecouplings in this model are site-independent, we choosethe parameters of the model to be also site-independent.In Fig. 9(a), we plot the energy variance per site σ (seeEq. (24)) against the number of gradient descent steps.Here σ is a good indicator of the quality of the optimizedwave function [59, 61, 62]. The results show that thetensorized RNN wave function can achieve both a lowerestimate of the energy variance and a faster convergence.For the disordered systems studied in this paper, weset the weights T n , U n and the biases b n , c n (in Eqs. (16)and (17)) to be site-dependent. To demonstrate the ben-eﬁt of using site-dependent over site-independent param-eters when dealing with disordered systems, we bench-mark both architectures on the task of ﬁnding the groundstate of the disordered Ising chain with random discretecouplings J i,i +1 = ± N = 20 spins and at temperature T = 1,as explained in Methods Sec. V B. Both RNNs have acomparable number of parameters (66400 parameters forthe tensorized RNN and 59240 parameters for the dilatedRNN). Interestingly, in Fig. 9(c), we ﬁnd that the dilatedRNN supersedes the tensorized RNN with almost an or-der of magnitude diﬀerence in term of the free energyvariance per spin deﬁned in Eq. (21). Indeed, this resultsuggests that the mechanism of skip connections allowsdilated RNNs to capture long-term dependencies moreeﬃciently compared to tensorized RNNs.6 Figures Parameter ValueFigs. 3 and 8 Architecture Tensorized RNN wave function with no-weight sharingNumber of memory units d h = 40Number of samples N s = 50Initial magnetic ﬁeld for VQA Γ = 2Initial temperature for VCA T = 1Learning rate η = 5 × − Warmup steps N warmup = 1000Number of random instances N instances = 25Fig. 4 Architecture 2D tensorized RNN wave function with no weight-sharingNumber of memory units d h = 40Number of samples N s = 25Initial magnetic ﬁeld Γ = 1 (for SQA, VQA and RVQA)Initial temperature T = 1 (for SA, VCA and RVQA)Learning rate η = 10 − Number of warmup steps N warmup = 1000 for 10 ×

10 and N warmup = 2000 for 40 × N instances = 25Figs. 5(a) and (d) Architecture Dilated RNN wave function with no weight-sharingNumber of memory units d h = 40Number of samples N s = 50Initial temperature T = 2 (for SA and VCA)Initial magnetic ﬁeld Γ = 2 (for SQA)Learning rate η = 10 − Number of warmup steps N warmup = 2000Number of random instances N instances = 25Figs. 5(b), (c), (e) and (f) Architecture Dilated RNN wave function with no weight-sharingNumber of memory units d h = 20Number of samples N s = 50Initial temperature T = 1 (for SA and VCA)Initial magnetic ﬁeld Γ = 1 (for SQA)Learning rate η = 10 − Number of warmup steps N warmup = 1000Number of random instances N instances = 25Fig. 7 Architecture Tensorized RNN wave function with weight sharingNumber of memory units d h = 20Number of samples N s = 50Initial temperature T = 2Initial magnetic ﬁeld Γ = 2Learning rate η = 10 − Number of warmup steps N warmup = 1000Figs. 9(a) and (b) Architecture RNN wave functionNumber of memory units d h = 50Number of samples N s = 50Learning rate η = 10 − for Fig. 9(a) and η = 5 × − for Fig. 9(b)Fig. 9(c) Architecture RNN wave function with no-weight sharingNumber of memory units of dilated RNN d h = 20Number of memory units of tensorized RNN d h = 40Number of samples N s = 100Learning rate η = 10 − Table I. Hyperparameters used to obtain the results reported in this paper. Note that the number of samples stands for thebatch size used to train the RNN. abc Training step (a) Vanilla RNNTensorized RNN

Training step (b) RNN with weight sharingRNN with no weight sharing

Training step F (c) Tensorized RNNDilated RNN abc

Figure 9. Energy (or Free energy) variance per spin σ vsthe number of training steps. (a) We compare tensorized andvanilla RNN ansatzes both with weight sharing across siteson the uniform ferromagnetic Ising chain at the critical pointwith N = 100 spins. (b) Comparison between a tensorizedRNN with and without weight sharing, trained to ﬁnd theground state of the random Ising chain with discrete disorder( J i,i +1 = ±

1) at criticality with N = 20 spins. (c) Compar-ison between a tensorized RNN and dilated RNN ansatzes,both with no weight sharing, trained to ﬁnd the Sherrington-Kirkpatrick model’s equilibrium distribution with N = 20spins at temperature T = 1. [1] Andrew Lucas, “Ising formulations of many np prob-lems,” Front. Phys. , 5 (2014).[2] F Barahona, “On the computational complexity of isingspin glass models,” Journal of Physics A: Mathematicaland General , 3241–3253 (1982).[3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Opti-mization by simulated annealing,” Science , 671–680(1983).[4] C Koulamas, SR Antony, and R Jaen, “A survey ofsimulated annealing applications to operations researchproblems,” Omega , 41 – 56 (1994).[5] Bruce Hajek, “A tutorial survey of theory and applica-tions of simulated annealing,” in (1985) pp. 755–760.[6] D.I. Svergun, “Restoring low resolution structure of bi-ological macromolecules from solution scattering usingsimulated annealing,” Biophysical Journal , 2879 –2886 (1999).[7] David S. Johnson, Cecilia R. Aragon, Lyle A. McGeoch,and Catherine Schevon, “Optimization by simulated an-nealing: An experimental evaluation; part ii, graph color-ing and number partitioning,” Operations Research ,378–406 (1991).[8] M. A. Abido, “Robust design of multimachine power sys-tem stabilizers using simulated annealing,” IEEE Trans-actions on Energy Conversion , 297–304 (2000).[9] Torsten Karzig, Armin Rahmani, Felix von Oppen, andGil Refael, “Optimal control of majorana zero modes,”Phys. Rev. B , 201404 (2015).[10] Georges Gielen, Herman Walscharts, and Willy Sansen,“Analog circuit design optimization based on symbolicsimulation and simulated annealing,” in ESSCIRC ’89:Proceedings of the 15th European Solid-State CircuitsConference (1989) pp. 252–255.[11] Giuseppe E. Santoro, Roman Martoˇn´ak, Erio Tosatti,and Roberto Car, “Theory of quantum annealing of anising spin glass,” Science , 2427–2430 (2002).[12] J. Brooke, D. Bitko, T. F. Rosenbaum, and G. Aeppli,“Quantum annealing of a disordered magnet,” Science , 779–781 (1999).[13] Debasis Mitra, Fabio Romeo, and Alberto Sangiovanni-Vincentelli, “Convergence and ﬁnite-time behavior ofsimulated annealing,” Advances in Applied Probability , 747–771 (1986).[14] Daniel Delahaye, Supatcha Chaimatanan, and MarcelMongeau, “Simulated annealing: From basics to applica-tions,” in Handbook of Metaheuristics , edited by MichelGendreau and Jean-Yves Potvin (Springer InternationalPublishing, Cham, 2019) pp. 1–35.[15] Ilya Sutskever, James Martens, and Geoﬀrey Hinton,“Generating text with recurrent neural networks,” in

Proceedings of the 28th International Conference on In-ternational Conference on Machine Learning , ICML’11(Omnipress, Madison, WI, USA, 2011) p. 1017–1024.[16] Hugo Larochelle and Iain Murray, “The neural autore-gressive distribution estimator,” in

Proceedings of theFourteenth International Conference on Artiﬁcial Intelli-gence and Statistics , Proceedings of Machine LearningResearch, Vol. 15, edited by Geoﬀrey Gordon, DavidDunson, and Miroslav Dud´ık (JMLR Workshop andConference Proceedings, Fort Lauderdale, FL, USA, 2011) pp. 29–37.[17] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” (2017),arXiv:1706.03762 [cs.CL].[18] Dian Wu, Lei Wang, and Pan Zhang, “Solving statisti-cal mechanics using variational autoregressive networks,”Physical Review Letters (2019), 10.1103/phys-revlett.122.080602.[19] Or Sharir, Yoav Levine, Noam Wies, Giuseppe Car-leo, and Amnon Shashua, “Deep autoregressive mod-els for the eﬃcient variational simulation of many-bodyquantum systems,” Physical Review Letters (2020),10.1103/physrevlett.124.020503.[20] Mohamed Hibat-Allah, Martin Ganahl, Lauren E. Hay-ward, Roger G. Melko, and Juan Carrasquilla, “Recur-rent neural network wave functions,” Physical ReviewResearch (2020), 10.1103/physrevresearch.2.023358.[21] Christopher Roth, “Iterative retraining of quantumspin models using recurrent neural networks,” (2020),arXiv:2003.06228 [physics.comp-ph].[22] R.P. Feynman, Statistical Mechanics: A Set of Lectures ,Advanced Books Classics (Avalon Publishing, 1998).[23] Philip M. Long and Rocco A. Servedio, “Restricted boltz-mann machines are hard to approximately evaluate orsimulate,” in

Proceedings of the 27th International Con-ference on International Conference on Machine Learn-ing , ICML’10 (Omnipress, Madison, WI, USA, 2010) p.703–710.[24] Sergio Boixo, Troels F Rønnow, Sergei V Isakov, ZhihuiWang, David Wecker, Daniel A Lidar, John M Martinis,and Matthias Troyer, “Evidence for quantum annealingwith more than one hundred qubits,” Nat. Phys. , 218–224 (2014).[25] Tadashi Kadowaki and Hidetoshi Nishimori, “Quantumannealing in the transverse ising model,” Physical ReviewE , 5355–5363 (1998).[26] M. Born and V. Fock, “Beweis des adiabatensatzes,”Zeitschrift f¨ur Physik , 165–180 (1928).[27] Glen Bigan Mbeng, Lorenzo Privitera, Luca Arceci, andGiuseppe E. Santoro, “Dynamics of simulated quantumannealing in random ising chains,” Phys. Rev. B ,064201 (2019).[28] Nilan Norris, “The standard errors of the geometric andharmonic means and their application to index num-bers,” The Annals of Mathematical Statistics , 445–448 (1940).[29] Tommaso Zanca and Giuseppe E. Santoro, “Quantumannealing speedup over simulated annealing on randomising chains,” Phys. Rev. B , 224431 (2016).[30] “https://software.cs.uni-koeln.de/spinglass/,” .[31] Neil G Dickson, MW Johnson, MH Amin, R Harris,F Altomare, AJ Berkley, P Bunyk, J Cai, EM Chapple,P Chavez, et al. , “Thermally assisted quantum annealingof a 16-qubit problem,” Nature communications , 1–6(2013).[32] Joseph Gomes, Keri A. McKiernan, Peter Eastman,and Vijay S. Pande, “Classical quantum optimiza-tion with neural network quantum states,” (2019),arXiv:1910.10675 [cond-mat.dis-nn].[33] Semyon Sinchenko and Dmitry Bazhanov, “The deep learning and statistical physics applications to theproblems of combinatorial optimization,” (2019),arXiv:1911.10680 [cond-mat.dis-nn].[34] Tianchen Zhao, Giuseppe Carleo, James Stokes, andShravan Veerapaneni, “Natural evolution strategiesand quantum approximate optimization,” (2020),arXiv:2005.04447 [quant-ph].[35] Roman Martoˇn´ak, Giuseppe E. Santoro, and ErioTosatti, “Quantum annealing by the path-integral montecarlo method: The two-dimensional random isingmodel,” Phys. Rev. B , 094203 (2002).[36] M Mezard, G Parisi, and M Virasoro, Spin GlassTheory and Beyond , 1792–1796(1975).[38] Firas Hamze, Jack Raymond, Christopher A. Pattison,Katja Biswas, and Helmut G. Katzgraber, “Wishartplanted ensemble: A tunably rugged pairwise ising modelwith a ﬁrst-order phase transition,” Physical Review E (2020), 10.1103/physreve.101.052102.[39] Kyle Mills, Pooya Ronagh, and Isaac Tamblyn, “Con-trolled online optimization learning (cool): Finding theground state of spin hamiltonians with reinforcementlearning,” (2020), arXiv:2003.00011 [physics.comp-ph].[40] Yoshua Bengio, Andrea Lodi, and Antoine Prou-vost, “Machine learning for combinatorial opti-mization: A methodological tour d’horizon,” Eu-ropean Journal of Operational Research (2020),https://doi.org/10.1016/j.ejor.2020.07.063.[41] Ian Goodfellow, Yoshua Bengio, and AaronCourville, Deep Learning (MIT Press, 2016) .[42] Richard Kelley, “Sequence modeling with recurrent ten-sor networks,” (2016).[43] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, XiaoxiaoGuo, Wei Tan, Xiaodong Cui, Michael Witbrock, MarkHasegawa-Johnson, and Thomas S. Huang, “Dilatedrecurrent neural networks,” (2017), arXiv:1710.02224[cs.AI].[44] Y. Bengio, P. Simard, and P. Frasconi, “Learninglong-term dependencies with gradient descent is diﬃ-cult,” IEEE Transactions on Neural Networks , 157–166(1994).[45] Salah El Hihi and Yoshua Bengio, “Hierarchical recur-rent neural networks for long-term dependencies,” in Advances in Neural Information Processing Systems 8 ,edited by D. S. Touretzky, M. C. Mozer, and M. E. Has-selmo (MIT Press, 1996) pp. 493–499.[46] G. Vidal, “Class of quantum many-body states that canbe eﬃciently simulated,” Physical Review Letters (2008), 10.1103/physrevlett.101.110501.[47] Diederik P. Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” (2014), arXiv:1412.6980[cs.LG].[48] Sergey Bravyi, David P. Divincenzo, Roberto Oliveira, and Barbara M. Terhal, “The complexity of stoquasticlocal hamiltonian problems,” Quantum Info. Comput. ,361–385 (2008).[49] I. Ozﬁdan, C. Deng, A.Y. Smirnov, T. Lanting, R. Har-ris, L. Swenson, J. Whittaker, F. Altomare, M. Bab-cock, C. Baron, A.J. Berkley, K. Boothby, H. Chris-tiani, P. Bunyk, C. Enderud, B. Evert, M. Hager, A. Ha-jda, J. Hilton, S. Huang, E. Hoskinson, M.W. Johnson,K. Jooya, E. Ladizinsky, N. Ladizinsky, R. Li, A. Mac-Donald, D. Marsden, G. Marsden, T. Medina, R. Molavi,R. Neufeld, M. Nissen, M. Norouzpour, T. Oh, I. Pavlov,I. Perminov, G. Poulin-Lamarre, M. Reis, T. Prescott,C. Rich, Y. Sato, G. Sterling, N. Tsai, M. Volkmann,W. Wilkinson, J. Yao, and M.H. Amin, “Demonstrationof a nonstoquastic hamiltonian in coupled superconduct-ing ﬂux qubits,” Phys. Rev. Applied , 034037 (2020).[50] Mohamed Hibat-Allah, Estelle M. Inack, Roger G.Melko, and Juan Carrasquilla, (Manuscript in prepa-ration).[51] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, andAndriy Mnih, “Monte carlo gradient estimation in ma-chine learning,” (2019), arXiv:1906.10652 [stat.ML].[52] Shi-Xin Zhang, Zhou-Quan Wan, and Hong Yao, “Au-tomatic diﬀerentiable monte carlo: Theory and applica-tion,” (2019), arXiv:1911.09117 [physics.comp-ph].[53] Sei Suzuki, “Cooling dynamics of pure and random isingchains,” Journal of Statistical Mechanics: Theory andExperiment , P03032 (2009).[54] Jacek Dziarmaga, “Dynamics of a quantum phase transi-tion in the random ising model: Logarithmic dependenceof the defect density on the transition rate,” Phys. Rev.B , 064416 (2006).[55] Tommaso Caneva, Rosario Fazio, and Giuseppe E. San-toro, “Adiabatic quantum dynamics of a random isingchain across its quantum critical point,” Phys. Rev. B , 144427 (2007).[56] Sandro Sorella and Federico Becca, SISSA Lecture noteson Numerical methods for strongly correlated electrons(Sec. 1.3) (2016).[57] Yurii Nesterov, “Smooth convex optimization,” in Lec-tures on Convex Optimization (Springer InternationalPublishing, Cham, 2018) pp. 59–137.[58] Mark Schmidt, Nicolas Le Roux, and Francis Bach,“Minimizing ﬁnite sums with the stochastic average gra-dient,” (2013), arXiv:1309.2388 [math.OC].[59] F. Becca and S. Sorella,

Quantum Monte Carlo Ap-proaches for Correlated Systems (Cambridge UniversityPress, 2017).[60] Shun-ichi Amari, “Natural gradient works eﬃcientlyin learning,” Neural Computation , 251–276 (1998),https://doi.org/10.1162/089976698300017746.[61] Claudius Gros, “Criterion for a good variational wavefunction,” Phys. Rev. B , 6835–6838 (1990).[62] Roland Assaraf and Michel Caﬀarel, “Zero-variance zero-bias principle for observables in quantum monte carlo:Application to forces,” The Journal of Chemical Physics119