[PDF] Training Neural Networks is ER-complete

Abstract

Given a neural network, training data, and a threshold, it was known that it is NP-hard to find weights for the neural network such that the total error is below the threshold. We determine the algorithmic complexity of this fundamental problem precisely, by showing that it is ER-complete. This means that the problem is equivalent, up to polynomial-time reductions, to deciding whether a system of polynomial equations and inequalities with integer coefficients and real unknowns has a solution. If, as widely expected, ER is strictly larger than NP, our work implies that the problem of training neural networks is not even in NP.

Full PDF

TTraining Neural Networks is ER-complete

Mikkel Abrahamsen ∗ , Linda Kleist , and Tillmann Miltzow † University of Copenhagen TU Braunschweig Utrecht UniversityFebruary 22, 2021

Abstract

Given a neural network, training data, and a threshold, it was known that it is NP-hard to ﬁnd weightsfor the neural network such that the total error is below the threshold. We determine the algorithmiccomplexity of this fundamental problem precisely, by showing that it is ∃ R -complete. This means thatthe problem is equivalent, up to polynomial time reductions, to deciding whether a system of polynomialequations and inequalities with integer coeﬃcients and real unknowns has a solution. If, as widelyexpected, ∃ R is strictly larger than NP, our work implies that the problem of training neural networks isnot even in NP. Training neural networks is a fundamental problem in machine learning. An (artiﬁcial) neural network isa brain-inspired computing system. For an example consider Figure 1. Neural networks are modelled bydirected acyclic graphs where the vertices are called neurons . The source nodes are called the input neurons and the sinks are called output neurons , and all other neurons are said to be hidden . A network computes inthe following way: Each input neuron s receives an input signal (a real number) which is sent through allout-going edges to the neurons that s points to. A non-input neuron v receives signals through the incomingedges, and v then processes the signals and transmits a single output signal to all neurons that v points to.The values computed by the output neurons are the result of the computation of the network. A neuron v evaluates the input signals by a so-called activation function ϕ v . Each edge has a weight that scales thesignal transmitted through the edge. Similarly, each neuron v often has a bias b v that is added to the inputsignals. Denoting the unweighted input values to a neuron v by x ∈ R k and the corresponding edge weightsby w ∈ R k , then the output of v is given by ϕ v ( (cid:104) w , x (cid:105) + b v ).During a training process, the network is fed with input values for which the true output values are known.The task is then to adjust the weights and the biases so that the network produces outputs that are close tothe ground truth speciﬁed by the training data. We formalize this problem in the following deﬁnition. Deﬁnition 1 (Training Neural Networks) . The problem of training a neural network (

NN-Training ) hasthe following inputs explained above: • A neural network architecture N = ( V, E ) , where some S ⊂ V are input neurons and have in-degree and some T ⊂ V are output neurons and have out-degree , ∗ The ﬁrst author is part of Basic Algorithms Research Copenhagen (BARC). BARC is supported by the VILLUM Foundationgrant 16582. † The third author is supportd by the NWO Veni grant EAGER. a r X i v : . [ c s . CC ] F e b w w w b b b b b w w w w Figure 1: The architecture of a neural network with one layer of hidden neurons. As an example of a trainingproblem, we are given the data points D = { (1 , ,

3; 1 , , , (3 , ,

1; 2 , , } , the identity as the activationfunction ϕ for every neuron, the threshold δ = 10, and the mean of squared errors as cost function. Are thereweights and biases such that the total cost is below δ ? • an activation function ϕ v : R −→ R for each neuron v ∈ V \ S , • a cost function c : R | T | −→ R ≥ , • a threshold δ ≥ , and • a set of data points D ⊂ R | S | + | T | .Here, each data point d ∈ D has the form d = ( x , . . . , x | S | ; y , . . . , y | T | ) , where x ( d ) = ( x , . . . , x | S | ) speciﬁes the values to the input neurons and y ( d ) = ( y , . . . , y | T | ) are the associated ground truth outputvalues. If the actual values computed by the network are y (cid:48) ( d ) = ( y (cid:48) , . . . , y (cid:48)| T | ) , then the cost is c ( y ( d ) , y (cid:48) ( d )) .The total cost is then C ( D ) = (cid:88) d ∈ D c ( y ( d ) , y (cid:48) ( d )) . We seek to answer the following question. Do there exist weights and biases of N such that C ( D ) ≤ δ ? We say that a cost function c is honest if it satisﬁes that c ( y ( d ) , y (cid:48) ( d )) = 0 if and only if y ( d ) = y (cid:48) ( d ).An example of an honest cost function is the popular mean of squared errors : c ( y ( d ) , y (cid:48) ( d )) = 1 | T | | T | (cid:88) i =1 ( y i − y (cid:48) i ) . As our main result, we determine the algorithmic complexity of the fundamental problem

NN-Training . Theorem 2.

NN-Training is ∃ R -complete, even if • the neural network has only one layer of hidden neurons and three output neurons, • all neurons use the identity function ϕ ( x ) = x as activation function, • any honest cost function c is used, • each data point is in { , } | S | + | T | , • the threshold δ is , and • there are only three output neurons. We show that

NN-Training is contained in ∃ R in Section 2 and prove its hardness in Section 3. Section 4discusses how our proof could be modiﬁed to work with the ReLu activation function. Section 5 containsa discussion and open problems. In the remainder of the introduction, we familiarize the reader with thecomplexity class ∃ R , discuss the practical implications of our result, and give an overview of related complexityresults on neural network training. 2 .2 The Existential Theory of the Reals In the problem

ETR , we are given a logical expression (using ∧ and ∨ ) involving polynomial equalities andinequalities with integer coeﬃcients, and the task is to decide if there are real variables that satisfy theexpression. One example is ∃ x, y ∈ R :( x + y − x − y + 13) · ( y − y + x + 6 y + 2 x − y + 2) = 0 ∧ ( x ≥ ∨ y > . This is a yes instance because ( x, y ) = (3 ,

2) is a solution. Due to deep connections to many related ﬁelds,

ETR is a fundamental and well-studied problem in Mathematics and Computer Science. Despite its longhistory, we still lack algorithms that can solve

ETR eﬃciently in theory and practice. The

ExistentialTheory of the Reals , denoted by ∃ R , is the complexity class of all decision problems that are equivalentunder polynomial time many-one reductions to ETR . Its importance is reﬂected by the fact that manynatural problems are ∃ R -complete. Famous examples from discrete geometry are the recognition of geometricstructures, such as unit disk graphs [22], segment intersection graphs [21], stretchability [24, 28], and ordertype realizability [21]. Other ∃ R -complete problems are related to graph drawing [20], Nash-Equilibria [13, 5],geometric packing [3], the art gallery problem [2], non-negative matrix factorization [27], and geometriclinkage constructions [1]. We refer the reader to the lecture notes by [21] and surveys by [26] and [10] formore information on the complexity class ∃ R . ∃ R -hardness The ∃ R -completeness of a problem gives us a better understanding of the inherent diﬃculty of ﬁnding exactalgorithms for it. Problems that are ∃ R -complete often require irrational numbers of arbitrarily high algebraicdegree or doubly exponential precision to describe valid solutions. These phenomena make it hard to ﬁndeﬃcient algorithms. We know that NP ⊆ ∃ R ⊆ PSPACE [9], and both inclusions are believed to be strict,although this remains an outstanding open question in the ﬁeld of complexity theory. In a classical view ofcomplexity, we distinguish between problems that we can solve in polynomial time and intractable problemsthat may not be solvable in polynomial time. Usually, knowing that a problem is NP-hard is argumentenough to convince us that we cannot solve the problem eﬃciently. Yet there is a big diﬀerence betweenNP-complete problems and ∃ R -complete problems, assuming that NP (cid:54) = ∃ R . To give a simple example,NP-complete problems can be solved in a brute-force fashion by exhaustively going through all possiblesolutions. Although this method is not very sophisticated, it is good enough to solve small sized instances.The same is not possible for ∃ R -complete problems due to their continuous nature.The diﬃculty of solving ∃ R -complete problems is nicely illustrated by the problem of placing eleven unitsquares into a minimum sized square container without overlap. Whether a given square can contain elevenunit squares can be expressed as an ETR -formula of modest size, so if such formulas could be solved eﬃciently,we would know the answer (at least to within any desired accuracy). Despite the apparent simplicity of theproblem, it is only known that the sidelength is between 2 + 4 / √ ≈ .

788 and 3 .

878 [14].By now, strong algorithmic methods and tools are known with which we can ﬁnd optimal solutions tolarge scale instances of NP-complete problems. We highlight here FPT algorithms, ILP solvers, and SATsolvers, to name just a few popular approaches. We are completely lacking similarly eﬃcient approaches tosolve ∃ R -complete problems. Here, it is most common to use some type of gradient descent method, which,in the context of neural networks, includes the backpropagation algorithm. Unfortunately, gradient descentmethods have very weak performance guarantees in general. Speciﬁcally, it is diﬃcult to distinguish betweenlocal and global optima.It is interesting to ﬁnd or disprove the existence of methods that outperform gradient descent methods forthe problem NN-Training . For instance, methods with the convergence speed of ILP solvers would be agreat asset, saving money, energy, and time, and returning solutions of a higher quality. However, our resultsindicate that this is not achievable in general. 3 .4 Previous Hardness Results for NN-Training

It has been known for more than three decades that it is NP-hard to train various types of neural networksfor binary classiﬁcation [6, 23, 18], which means that the output neurons use activation functions that mapto { , } . The ﬁrst hardness result for networks using continuous activation functions appears to be by [17],who showed NP-hardness of training networks with two hidden neurons using sigmoidal activation functionsand one output neuron using the identity function. [16] and [29] showed hardness of training networks withno hidden neurons and a single output neuron with sigmoidal activation function. The latter paper containsan informative survey of the numerous hardness results that were known at that time.Recently, the attention has been turned to networks using the so-called ReLU function [ x ] + = max { , x } as activation function due to its extreme popularity in practice. [15] and [11] showed that it is even NP-hardto train a network with no hidden neurons and a single output neuron using the ReLU activation function.For hardness on other simple architectures using the ReLU activation function, see [8, 7, 4].In order to prevent overﬁtting, we may stop training early, although the costs could still be reducedfurther. In the context of training neural networks, overﬁtting can be regarded as a secondary problem, asthe problem only emerges after we have been able to train the network on the data at all. Thus, none of thecomplexity theory papers on this subject address overﬁtting.Besides training neural networks, there exist other NP-hard problems related to neural networks, e.g.,continual learning [19].Some of these training problems are not only NP-hard, but also contained in NP, implying that theyare NP-complete. For instance, consider a fully-connected network with one hidden layer of neurons andone output neuron, all using the ReLU activation function. Here, it is not hard to see that the problem ofdeciding if total cost δ = 0 can be achieved is in NP. We will show that the network does not need to bemuch more complicated before the training problem becomes ∃ R -complete, even when δ = 0. In order to prove that

NN-Training is ∃ R -complete, we show that the problem is contained in the class ∃ R and that it is ∃ R -hard (just as when proving NP-completeness of a problem). The ﬁrst part is obtainedby proving that NN-Training can be reduced to

ETR , while the latter is to present a reduction in theopposite direction. To see that

NN-Training ∈ ∃ R , we use a recent result by Erickson et al. [12]. Given analgorithmic problem, a real veriﬁcation algorithm A has the following properties. For every yes -instance I ,there exists a witness w consisting of integers and real numbers. Furthermore, A ( I, w ) can be executed on thereal RAM in polynomial time and the algorithm returns yes . On the other hand, for every no -instance I andany witness w the output A ( I, w ) is no . [12] showed that an algorithmic problem is in ∃ R if and only if thereexists a real veriﬁcation algorithm. Note that this is very similar to how NP-membership is usually shown.The crucial diﬀerence is that a real veriﬁcation algorithm accepts real numbers as input for the witness andworks on the real RAM instead of the integer RAM.It remains to describe a real veriﬁcation algorithm for NN-Training . As a witness, we simply describeall the weights of the network. The veriﬁcation then computes the total costs of all the data points andchecks if it is below the given threshold δ . Clearly, this algorithm can be executed in polynomial time on thereal RAM.Note that if the activation function is not piecewise algebraic, e.g., the sigmoid function ϕ ( x ) = / e − x ,it is not clear that we have ∃ R -membership as the function is not supported by the real RAM model ofcomputation [12]. In the following, we describe a reduction from the ∃ R -hard problem ETR-INV to NN-Training . As aﬁrst step, we establish ∃ R -hardness of ETR-INV using previous work. As the next step, we deﬁne theintermediate problem

Restricted Training . In the main part of the reduction, we describe how to encode4ariables, subtraction operations as well as inversion and addition constraints in

Restricted Training .Finally, we present two modiﬁcations that enable the step from

Restricted Training to NN-Training . In order to deﬁne the new algebraic problem

ETR-INV-EQ , we recall the deﬁnition of

ETR-INV . An

ETR-INV formula

Φ = Φ( x , . . . , x n ) is a conjunction ( (cid:86) mi =1 C i ) of m ≥ C i with variables x, y, z ∈ { x , . . . , x n } has one of the forms x + y = z, x · y = 1 . The ﬁrst constraint is called an addition constraint and the second is an inversion constraint. An instance I of the ETR-INV problem consists of an

ETR-INV formula Φ. The goal is to decide whether there are realnumbers that satisfy all the constraints.Abrahamsen et al. [3] established the following theorem. Note that their deﬁnition of

ETR-INV formulasasks for a number of additional properties (e.g., restricting the variables to certain ranges) that we do notneed for our purposes, so these can be omitted without aﬀecting the correctness of the following result.

Theorem A ([3], Theorem 3) . ETR-INV is ∃ R -complete. For our purposes, we slightly extend their result and deﬁne the algorithmic problem

ETR-INV-EQ inwhich each constraint has the form x ± + y ± − z ± = 0 . We call a constraint of the above form a combined constraint , as it is possible to express both inversion andaddition constraints using combined constraints. To see that

ETR-INV-EQ is also ∃ R -complete, we showhow to transform an instance of ETR-INV into an instance of

ETR-INV-EQ . First, note that we canassume that every variable is contained in at least one addition constraint; otherwise, we can add a triviallysatisﬁable addition constraint, i.e., x + y = y for a new variable y . Furthermore, consider the case that avariable x of Φ appears in (at least) two inversion constraints, i.e., there exist two constraints of the form x · y = 1 and x · y = 1. This implies y = y and we can replace all occurrences of y by y . Thus, we mayassume that each variable appears in at most one inversion constraint. Now, if there is an inversion constraint x · y = 1, we replace all occurrences of y (in addition constraints) by x − . In this way, the inversion constraintbecomes redundant and we can remove it. We are then left with a collection of combined constraints. Thisproves ∃ R -completeness of ETR-INV-EQ .In the following, we reduce

ETR-INV-EQ to NN-Training . We start by deﬁning the algorithmic problem

Restricted Training . Restricted Training diﬀers from

NN-Training in the following three properties: • Some weights and biases can be predeﬁned in the input. • The output values of data points may contain a question mark symbol ‘?’. • When computing the total cost, all output values with question marks are ignored.Figure 2 displays an example of such an instance.In the rest of the reduction, we will use the identity as activation function, and threshold δ = 0. Thereduction works for all honest cost functions. The overall network consists of two layers, as depicted in Figure 2. Note that these layers are not fullyconnected. Some weights of the ﬁrst layer will represent variables of the

ETR-INV-EQ ; other weights willbe predeﬁned by the input and some will have purely auxiliary purposes. Furthermore all biases are set to 0.5 xy − z Figure 2: The architecture of an example instance of

Restricted Training . Further, the input consists ofthe data points d = (0 , ,

1; 1 , , ?) and d = (2 , ,

0; ? , ? , ϕ ( x ) = x , the threshold δ = 7, and the Manhattan norm ( (cid:107) · (cid:107) ) as the cost function. If we set the weights to ( x, y, z ) = (1 , , − x, y, z ) = (1 , , −

1) is avalid solution.In the following, we describe the individual gadgets to represent variables, addition and inversion. Thenwe show how we combine these parts. Later, we modify the construction to remove preset weights, biases andquestion marks. At last, we will sketch how this reduction can be modiﬁed to work with ReLU functions asactivation functions.

The subtraction gadget consists of ﬁve neurons, four edges, two prescribed weights and one data point. SeeFigure 3 for an illustration of the network architecture of the subtraction gadget. The data point d = (1 ,

1; 0)enforces that the constraint x = − y , as can be easily calculated. xy Figure 3: The architecture of the subtraction gadget. The data point d = (1 ,

1; 0) enforces that the constraint x = − y . The purpose of the inversion gadget is to enforce that two variables are the inverse of one another. It consistsof six vertices, ﬁve edges, two prescribed weights and two data points; for an illustration of the architecturesee Figure 4. A simple calculation shows that the data point d = (0 , ,

0; 1) enforces the constraint y · z = 1,while the data point d = (1 , ,

1; 0) enforces the constraint x − z = 0. It follows that x · y = 1.6 y − z Figure 4: The architecture of the inversion gadget. The data points d = (0 , ,

0; 1) and d = (1 , ,

1; 0)enforce y · z = 1 and x − z = 0, respectively, implying that x · y = 1. For every variable, we build a gadget such that there exist four weights on the ﬁrst layer with the values x, − x, /x, − /x . wx yz aab s s s s s v − Figure 5: The variable gadget.The variable-gadget is a combination of two subtraction gadgets and one inversion gadget. In total ithas ﬁve input neurons, four middle neurons and two output neurons, see Figure 5 for an illustration ofthe architecture and the initial weights. We denote the input neurons by s , . . . , s . We denote the outputneurons by a, b . The output neuron a is drawn twice for clarity of the drawing. We have the data points d = (1 , , , ,

0; 0 , ?), d = (0 , , , ,

0; ? , d = (0 , , , ,

1; ? , d enforces w = − x . To see this note that the output neurons with a question marksymbol are irrelevant. Similarly, input neurons with 0-entries can be ignored. The remaining neurons formexactly the subtraction gadget. Analogously, we conclude that y = − z , using data point d . From the datapoint d , we infer that y · v = 1. We infer v = x . using d , We summarize our observations in the following7emma. Lemma 3 (Variable-Gadget) . The variable gadget enforces the following constraints on the weights: w = − x , y = 1 /x , and z = − /x . Here, we describe how to combine n variable gadgets. For the architecture, we identify the two output neuronsof all gadgets; all other neurons remain distinct. Figure 6 gives a schematic drawing of the architecture forthe case of n = 3 variables. Additionally, we construct 4 n data points. For each variable, we construct fourdata points as described for the variable gadget; the additional input entries are set to 0. In this way, werepresent all n variables of an ETR-INV-EQ formula. Furthermore, for each variable x , we have edges fromthe ﬁrst to the second vertex layer, with the values x , − x , 1 /x , and − /x , see Lemma 3. ab Figure 6: A schematic drawing of combining three variable gadgets. The two output neurons of all gadgetsare identiﬁed. The input neurons and the neurons in the middle layer remain distinct.

For the purpose of concreteness, we consider the combined constraint C of an ETR-INV-EQ instance w + w + w = 0 , where each w i is either the value of some variable, its inverse, its negative or its negative inverse. Then, byconstruction, there exists a weight in the combined variable gadget for each w i . Figure 7 depicts the networkinduced by the edges and the output vertex a ; in particular, note that all the edges with weights w i areconnected to a , as can also be checked in Figure 5. 8n order to represent the constraint C , we introduce a data point d ( C ). It has input entry 1 exactly atthe input neurons of w , w , and w ; otherwise it is 0. Its output is deﬁned by 0 for a and ‘?’ for b . Thus d ( C ) enforces the combined constraint C . Note that enforcing the combined constraints does not require toalter the neural network architecture or to modify any of those weights. w aw w Figure 7: There are three weights encoding w , w , and w . They are all connected to the output vertex a . Next, we modify the construction such that we do not make use of predeﬁned weights. To this end, we showhow to enforce edge weights of ±

1. Recall that, by construction, all predeﬁned weights are either +1 or − q . For each middle neuron m , we perform the followingsteps individually: We add one more input neuron s and insert the two edges sm and mq . We will show laterthat the weights z and z on the two new edges can be assumed to be 1 as depicted in Figure 8. We modifyall previously deﬁned data points such that they have output ‘?’ for output neuron q . They are furtherpadded with zeros for all the new input neurons. qs mz = 1 z = 1 z = 1 t Figure 8: For each middle neuron m , we add an input neuron s and the edges sm and mq with weights z and z .Furthermore, we add one data point d ( m ) with input entry 1 for s and 0 otherwise, and output entry 1for q and ‘?’ otherwise. This data point ensures that neither z nor z are 0.Next, we describe a simple observation. Consider a single middle neuron m , with k incoming edges and l outgoing edges. Let us denote some arbitrary input by a = ( a , . . . , a k ), the weights on the ﬁrst layer by w = ( w , . . . , w k ), and the weights of the second layer by w = ( w , . . . , w l ). We consider all vectors to becolumns. Then, the output vector for this input is given by (cid:104) a, w (cid:105) · w , where (cid:104)· , ·(cid:105) denotes the scalar product.Let α (cid:54) = 0 be some real number. Note that replacing w by w (cid:48) = α · w and w by w (cid:48) = 1 /α · w does notchange the output. 9 bservation 4. Scaling the weights of incoming edges of a middle neuron by α (cid:54) = 0 and the weights ofoutgoing edges by α − does not change the neural network behavior. This observation can be used to assume that some non-zero weight equals 1 because we can freely choosesome α (cid:54) = 0 and multiply all weights as described above without changing the output. In particular, for themiddle neuron m , we may assume that z = 1, see Figure 8. Here, we crucially use the fact that z is notzero. This standard technique is often referred to as normalization . Moreover, by the data point d ( m ), wecan also infer that the weight z equals 1, as we know z · z = 1.With the help of the edge weights z = z = 1, we are able to set more weights to ±

1. Let z denote theweight of some other edge e incident to m that we wish to ﬁx to the value +1 (the case of − t the edge e = mt is an outgoing edge of m ; the case of anincoming edge is analogous. See Figure 8 for an illustration. We add a new data point d , with output entriesbeing 1 for t and ‘?’ otherwise and input entries being 1 for s and 0 otherwise. Given that z = 1, the datapoint d implies that z = 1 as well. In this section, we modify the input such that we do not make use of the biases being set to zero. Firstnote that a function f representing a certain neural network, might be represented by several weights andbiases. To be speciﬁc, let b ∈ R the bias of a ﬁxed middle neuron m . Denote by z , . . . , z k the weights of theoutgoing edges of m and b , . . . , b k the biases of the corresponding neurons. We can replace these biases asfollows: we replace b by b (cid:48) = 0 and b i by b (cid:48) i = b i + z i · b . We observe that the new neural network representsexactly the same function f . Note that we used here explicitly the fact that the activation function is theidentity. This may not be the case for other activation functions. From here on, we assume that all biases inthe middle layer are set to zero.It remains to ensure that the output neurons are zero as well. We add the additional data point d = ( , )that is zero on all inputs and outputs. The value of the neural network on the input is precisely the bias ofall its output neurons. As our threshold is zero and the cost function is honest, we can conclude that thebiases on all output neurons must be zero as well. We summarize this observations as follows. Observation 5.

All biases can be assumed to be zero.

To complete the construction, it remains to show how to remove the question marks from the data points.We remove the question marks one after the other. For each data point d and every contained symbol ‘?’, weadd an input neuron, a middle neuron, the edge between them and the edges from the input neuron to theoutput neuron containing the considered symbol ‘?’ in d , see Figure 9 for a schematic illustration.Due to the additional input entry for the added input neuron, we need to modify all data points slightly.In d , we set the additional entry to 1; for all other data points, we set it to 0. Moreover, we replace theconsidered symbol ‘?’ in d by the entry 0.We have to show that this modiﬁcation does not change the feasibility of the neural network. Clearly, theoutput entries of d with the question mark can now be freely adjusted using the two new edges. At the sametime no other data point d (cid:48) can make use of the new edges as the value for the new input neurons equals to 0.This ﬁnishes the description of the reduction. Next, we show its correctness. Let Φ be an

ETR-INV-EQ instance on n variables. We construct an instance I of NN-Training asdescribed above. First note that the construction is polynomial in n in time and space complexity. To beprecise, the size of the network is O ( n ), the number of data points is O ( n ), and each data point has a size in O ( n ). Thus, the total space complexity is in O ( n ). Because no part of the construction needs additionalcomputation time, the time complexity is also in O ( n ).10igure 9: For every ‘?’ in a data point d , we add two more vertices and edges to the neural network.We show that Φ has a real solution x ∗ ∈ R n if and only if there exists a set of weights for I such that allinput data are mapped to the correct output.Suppose that there exists a solution x ∗ ∈ R n satisfying all constraints of Φ. We show that there areweights w for I that predict all outputs correctly for each data point. By construction, for every variable x ,there exist edge weights x, − x, /x, − /x . We set these weights to the value of x given by x ∗ . Moreover, weprescribe all other weights as indented by the construction procedure; e.g., ± prescribed weights.By construction and the arguments above, all data points are predicted correctly by the neural network.Speciﬁcally, all the data points described in Section 3.8 are correctly predicted, as x ∗ satisﬁes Φ.For the reverse direction, we suppose that we are given weights w for all the edges of the network in I .By Section 3.10, we can assume that all biases are zero without changing the function that is represents bythe neural network. By Observation 4, we may normalize the weights without changing the behaviour of theneural network. Consequently, we can assume that all the weights are as prescribed for the RestrictedTraining problem. By Lemma 3, there exist edge weights that consistently encode the variables. Thus,we use the values of these weights to describe a real solution x ∗ that satisﬁes Φ. Due to the data pointsintroduced in Section 3.8, we can conclude that all combined constraints of Φ are satisﬁed.This ﬁnishes the proof of Theorem 2. As the ReLU activation function is commonly used in practice, we present some ideas to prove that ourreduction also holds when linear activation functions are replaced by ReLUs.

Conjecture 6.

NN-Training is ∃ R -complete, even if the activation function for all neurons is the ReLU. We would like to note that ReLUs are more complex than linear functions. This results in much morecomplex behavior of the resulting neural network. Hence, on some level, using the identity function forour hardness reduction may be considered a stronger statement. This is why we concentrated to prove ourtheorem.The idea of the hardness for ReLUs is based on the following fact. If an instance Φ of

ETR-INV has asolution, then there also exists a solution where each variable is in the interval [1 / , / ,

2] as well. Together with the factthat all data points have small support, we may conclude that also the sum of the incoming values of neuronsare lower bounded by some negative constant C . Thus, we can deﬁne the function ψ ( x ) = max { C, x } , whichis a shifted version of the standard ReLU. For this proof to be rigorous, it remains to show the following.No choice of weights activating the constant part of the ReLU on any neuron can be completed to a validsolution. We believe that this is possible by adding some extra data points. As the function that we intend11o represent will be linear, adding more data points that are linear combinations of previously added datapoints may be helpful to support the argument.Additionally, if someone aims to show hardness for the standard ReLU ϕ ( x ) = max { , x } , the followingapproach may work. We shift the values of all variables, in particular of − x, − /x , to the positive range asfollows. Instead of representing the values − x, − /x , we represent it by 3 − x and 3 − /x . This requires amodiﬁcation of the neural network and the given data points, see Section 3.8. Henceforth, the combinedconstraint x + y = z is replaced by x + y + (3 − z ) = 3. As a consequence of this modiﬁcation, all weightsof the neural network are in the positive range. Again, it remains to ensure that the non-linear part of theReLU is not activated. We believe that this is possible by adding some extra data points. Training neural networks is undoubtedly a fundamental problem in machine learning. We present a clean andsimple argument to show that

NN-Training is complete for the complexity class ∃ R . Compared to otherprominent ∃ R -hardness proofs, such as [25] or [3], our proof is relatively accessible. Our ﬁndings illustrate thefundamental diﬃculty of training neural networks. At the same time, we explain why neural networks can bea very powerful tool, since we prove neural networks to be more expressive than any learning method involvingonly discrete parameters or linear models: In practice, neural networks proved useful to solve problems thatcannot be solved by combinatorial methods such as ILP solvers, SAT solvers, or linear programming, and ourwork gives a reason why (at least under the assumption that NP (cid:54) = ∃ R ).In our reduction, we carefully choose which edges should be part of our network to obtain an architecturethat is particularly diﬃcult to train. In practice, it is common to use a simpler fully connected network, whereeach neuron from one layer has an edge to each neuron of the next. It is thus an interesting open problem forfuture research to ﬁnd out if training a fully connected neural network with one hidden layer of neurons is also ∃ R -complete, which we expect to be the case. Note that as mentioned in Section 1.4, this requires at leasttwo output neurons, as the problem is otherwise in NP (when using ReLU or identity activation functions).Besides the neural network architecture, also the data points are chosen speciﬁcally to create a diﬃcultinstance. However, data from which it is hard to train is arguably a realistic scenario. Acknowledgments.

We thank Frank Staals for many enjoyable and valuable discussions. We thank Thijsvan Ommen for helpful comments on the write-up.

References [1] Zachary Abel, Erik Demaine, Martin Demaine, Sarah Eisenstat, Jayson Lynch, and Tao Schardl. Whoneeds crossings? Hardness of plane graph rigidity. In , pages 3:1–3:15, 2016.[2] Mikkel Abrahamsen, Anna Adamaszek, and Tillmann Miltzow. The art gallery problem is ∃ R -complete.In STOC , pages 65–73, 2018.[3] Mikkel Abrahamsen, Tillmann Miltzow, and Nadja Seiferth. A framework for ∃ R -completeness oftwo-dimensional packing problems. FOCS 2020 , 2020.[4] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruﬀ. Learning two layer rectiﬁed neural networks inpolynomial time. In

Proceedings of the Thirty-Second Conference on Learning Theory (COLT 2019) ,pages 195–268, 2019.[5] Vittorio Bil`o and Marios Mavronicolas. A catalog of ∃ (cid:114) -complete decision problems about Nash equilibriain multi-player games. In ,2016. 126] Avrim L. Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. NeuralNetworks , 5(1):117–127, 1992.[7] Digvijay Boob, Santanu S. Dey, and Guanghui Lan. Complexity of training ReLU neural network.

Discrete Optimization , 2020. In press.[8] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a ConvNet with Gaussianinputs. In

Proceedings of the 34 th International Conference on Machine Learning (ICML 2017) , 2017.[9] John Canny. Some algebraic and geometric computations in PSPACE. In

Proceedings of the twentiethannual ACM symposium on Theory of computing (STOC 1988) , pages 460–467. ACM, 1988.[10] Jean Cardinal. Computational geometry column 62.

SIGACT News , 46(4):69–78, 2015.[11] Santanu S. Dey, Guanyi Wang, and Yao Xie. Approximation algorithms for training one-node ReLUneural networks.

IEEE Transactions on Signal Processing , 68:6696–6706, 2020.[12] Jeﬀ Erickson, Ivor van der Hoog, and Tillmann Miltzow. Smoothing the gap between np and er. In , pages 1022–1033.IEEE, 2020.[13] Jugal Garg, Ruta Mehta, Vijay V. Vazirani, and Sadra Yazdanbod. ETR-completeness for decisionversions of multi-player (symmetric) Nash equilibria. In

Proceedings of the 42nd International Colloquiumon Automata, Languages, and Programming (ICALP 2015), part 1 , pages 554–566, 2015.[14] Thierry Gensane and Philippe Ryckelynck. Improved dense packings of congruent squares in a square.

Discrete & Computational Geometry , 34(1):97–109, 2005.[15] Surbhi Goel, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. Tight hardness results for trainingdepth-2 ReLU networks, 2020. Preprint https://arxiv.org/abs/2011.13550 .[16] Don R. Hush. Training a sigmoidal node is hard.

Neural Computation , 11(5):1249–1260, 1999.[17] L. K. Jones. The computational intractability of training sigmoidal neural networks.

IEEE Transactionson Information Theory , 43(1):167–173, 1997.[18] Stephen Judd. On the complexity of loading shallow neural networks.

Journal of Complexity , 4(3):177–192,1988.[19] Jeremias Knoblauch, Hisham Husain, and Tom Diethe. Optimal continual learning has perfect memoryand is NP-hard. In

Proceedings of the 37th International Conference on Machine Learning (ICML 2020) ,pages 5327–5337, 2020.[20] Anna Lubiw, Tillmann Miltzow, and Debajyoti Mondal. The complexity of drawing a graph in apolygonal region. In

International Symposium on Graph Drawing and Network Visualization , 2018.[21] Jiˇr´ı Matouˇsek. Intersection graphs of segments and ∃ R . 2014. Preprint, https://arxiv.org/abs/1406.2636 .[22] Colin McDiarmid and Tobias M¨uller. Integer realizations of disk and segment graphs. Journal ofCombinatorial Theory, Series B , 103(1):114–143, 2013.[23] Nimrod Megiddo. On the complexity of polyhedral separability.

Discrete & Computational Geometry ,3(4):325–337, 1988.[24] Nicolai Mn¨ev. The universality theorems on the classiﬁcation problem of conﬁguration varieties andconvex polytopes varieties. In Oleg Y. Viro, editor,

Topology and geometry – Rohlin seminar , pages527–543, 1988. 1325] J¨urgen Richter-Gebert and G¨unter M. Ziegler. Realization spaces of 4-polytopes are universal.

Bulletinof the American Mathematical Society , 32(4):403–412, 1995.[26] Marcus Schaefer. Complexity of some geometric and topological problems. In

Proceedings of the 17thInternational Symposium on Graph Drawing (GD 2009) , pages 334–344, 2009.[27] Yaroslav Shitov. A universality theorem for nonnegative matrix factorizations.

ArXiv 1606.09068 , 2016.[28] Peter W. Shor. Stretchability of pseudolines is NP-hard. In Peter Gritzmann and Bernd Sturmfels, editors,

Applied Geometry and Discrete Mathematics: The Victor Klee Festschrift , volume 4 of

DIMACS – Seriesin Discrete Mathematics and Theoretical Computer Science , pages 531–554. American MathematicalSociety and Association for Computing Machinery, 1991.[29] Jiˇr´ı ˇS´ıma. Training a single sigmoidal neuron is hard.