Stronger Separation of Analog Neuron Hierarchy by Deterministic Context-Free Languages
SStronger Separation of Analog Neuron Hierarchy byDeterministic Context-Free Languages
Jiˇr´ı ˇS´ıma a a Institute of Computer Science of the Czech Academy of Sciences,P. O. Box 5, 18207 Prague 8, Czech Republic,
Abstract
We analyze the computational power of discrete-time recurrent neural net-works (NNs) with the saturated-linear activation function within the Chom-sky hierarchy. This model restricted to integer weights coincides with binary-state NNs with the Heaviside activation function, which are equivalent to fi-nite automata (Chomsky level 3) recognizing regular languages (REG), whilerational weights make this model Turing-complete even for three analog-stateunits (Chomsky level 0). For the intermediate model α ANN of a binary-state NN that is extended with α ≥ ⊂ ⊂ ⊆ (cid:36) L = { n n | n ≥ } which cannot be recognized by any 1ANN even withreal weights, while any DCFL (Chomsky level 2) is accepted by a 2ANNwith rational weights. In this paper, we strengthen this separation by show-ing that any non-regular DCFL cannot be recognized by 1ANNs with realweights, which means (DCFLs \ REG) ⊂ (2ANNs \ ∩ DCFLs = 0ANNs. For this purpose, we have shown that L is thesimplest non-regular DCFL by reducing L to any language in this class,which is by itself an interesting achievement in computability theory. Keywords: recurrent neural network, analog neuron hierarchy,deterministic context-free language, Chomsky hierarchy
Email address: [email protected] (Jiˇr´ı ˇS´ıma)
Preprint submitted to Elsevier February 3, 2021 a r X i v : . [ c s . N E ] F e b . Analog Neuron Hierarchy The standard techniques used in artificial neural networks (NNs) suchas Hebbian learning, back-propagation, simulated annealing, support vectormachines, deep learning, are of statistical or heuristic nature. NNs oftenconsidered as “black box” solutions are mainly subject to empirical researchwhose methodology is based on computer simulations through which thedeveloped heuristics are tested, tuned, and mutually compared on benchmarkdata. Nevertheless, the development of NN methods has, among others,its own intrinsic limits given by mathematical, computability, or physicallaws. By exploring these limits one can understand what is computable inprinciple or efficiently by NNs, which is a necessary prerequisite for pushing oreven overcoming these boundaries in future intelligent technologies. Thus,rigorous mathematical foundations of NNs need to be further developed,which is the main motivation for this study. We explore the computationalpotential and limits of NNs for general-purpose computations by comparingthem with more traditional computational models such as finite or pushdownautomata, Chomsky grammars, and Turing machines.The computational power of discrete-time recurrent NNs with the satur-ated-linear activation function depends on the descriptive complexity oftheir weight parameters [6, 7]. NNs with integer weights, correspondingto binary-state (shortly binary) networks which employ the Heaviside acti-vation function (with Boolean outputs 0 or 1), coincide with finite automata(FAs) recognizing regular languages [8, 9, 10, 11, 12, 13]. Rational weightsmake the analog-state (shortly analog) NNs (with real-valued outputs in theinterval [0 , real weights can even derive “super-Turing” computational capabilities [6]. Namely, their polynomial-time com-putations correspond to the nonuniform complexity class P/poly while anyinput/output mapping (including algorithmically undecidable problems) canbe computed within exponential time [15]. Moreover, a proper infinite hi-erarchy of nonuniform complexity classes between P and P/poly has beenestablished for polynomial-time computations of NNs with increasing Kol- The results are partially valid for more general classes of activation functions [1, 2, 3, 4]including the logistic function [5]. α ≥ α ANNs.This study has primarily been motivated by theoretical issues of how thecomputational power of NNs increases with enlarging analogicity when wechange step by step from binary to analog states, or equivalently, from integerto arbitrary rational weights. In particular, the weights are mainly assumedto be just fixed fractions with a finite representation (i.e. a quotient of twointeger constants) avoiding real numbers with infinite precision . Hence,the states of added α analog units can thus be only rationals although thenumber of digits in the representation of analog values may increase (linearly)along a computation. Nevertheless, by bounding the precision of analogstates, we would reduce the computational power of NNs to that of finiteautomata which could be implemented by binary states. This would notallow the study of analogicity phenomena such as the transition from integerto rational weights in NNs whose functionality (program) is after all encodedin numerical weights.There is nothing suspicious about the fact that the precision of analogstates in α ANNs is not limited by a fixed constant in advance. The same istrue in conventional abstract models of computation such as pushdown au-tomata or Turing machines with unlimited (potentially infinite) size of stackor tape, respectively, whose limitation would lead to the collapse of Chomskyhierarchy to finite automata. Thus, the proposed abstract model of α ANNsitself has been intended for measuring the expressive power of a binary-stateNN to which analog neurons are added one by one, rather than for solvingspecial-purpose practical tasks or biological modeling. Nevertheless, as a Nevertheless, we formulate the present lower-bound results for arbitrary real weightswhich hold all the more so for rationals. one extra analog unit, interms of so-called cut languages [23] which are combined in a certain wayby usual operations such as complementation, intersection, union, concate-nation, Kleene star, reversal, the largest prefix-closed subset, and a letter-to-letter morphism. By using this syntactic characterization of 1ANNs wehave derived a sufficient condition when a 1ANN accepts only a regular lan-guage (Chomsky level 3), which is based on the quasi-periodicity [23] ofsome parameters depending on its real weights. This condition defines thesubclass QP-1ANNs of so-called quasi-periodic 1ANNs which are computa-tionally equivalent to FAs. For example, the class QP-1ANNs contains the1ANNs with weights from the smallest field extension Q ( β ) over the ratio-nal numbers Q including a Pisot number β >
1, such that the self-loopweight w of the only analog neuron equals 1 /β . For instance, the 1ANNswith arbitrary rational weights except for w = 1 /n for some integer n > w = 1 /ϕ = 1 − ϕ where ϕ is the golden ratio, belong to QP-1ANNsrecognizing REG. An example of the QP-1ANN N (cid:0) , (cid:1) that accepts theregular language (27), is depicted in Figure 2 with parameters (25). A cut language L
1, using real digits from A , that are less than a given real threshold c . It is knownthat L
1, and a finite alphabet A of real digits, an in-finite β -expansion , (cid:80) ∞ k =1 x k β − k where x k ∈ A , is called quasi-periodic if the sequence (cid:0)(cid:80) ∞ k =1 x n + k β − k (cid:1) ∞ n =0 contains a constant infinite subsequence. We say that a real number x is quasi-periodic if all its infinite β -expansions x = (cid:80) ∞ k =1 x k β − k are quasi-periodic. Recall that in algebra, the rational numbers (fractions) form the field Q with the twousual operations, the addition and the multiplication over real numbers. For any realnumber β ∈ R , the field extension Q ( β ) ⊂ R is the smallest set containing Q ∪ { β } thatis closed under these operations. For example, the golden ratio ϕ = (1 + √ / ∈ Q ( √ √ / ∈ Q ( √ Q ( β ) = Q for every β ∈ Q . Pisot number is a real algebraic integer (a root of some monic polynomial with integercoefficients) greater than 1 such that all its Galois conjugates (other roots of such a uniquemonic polynomial with minimal degree) are in absolute value less than 1.
4n the other hand, we have introduced [22] examples of languages ac-cepted by 1ANNs with rational weights that are not context-free (CFLs)(i.e. are above Chomsky level 2), while we have proven that any language ac-cepted by this model online , is context-sensitive (CSL) at Chomsky level 1.For example, the 1ANN N (cid:0) , (cid:1) depicted in Figure 2 with parameters (11),accepts the context-sensitive language L = L (cid:0) N (cid:0) , (cid:1)(cid:1) = (cid:40) x . . . x n ∈ { , } ∗ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) k =1 x n − k +1 (cid:0) (cid:1) − k < (cid:41) (1)defined in (24) as the reversal of a cut language, which is not context-free.These results refine the analysis of the computational power of NNs with theweight parameters between integer and rational weights. Namely, the compu-tational power of binary-state networks having integer weights can increasefrom REG (Chomsky level 3) to that between CFLs (Chomsky level 2) andCSLs (Chomsky level 1), when an extra analog unit with rational weights isadded, while the condition when adding one analog neuron even with realweights does not increase the power of binary-state networks, was formulated,which defines QP-1ANNs.Furthermore, we have established the analog neuron hierarchy of classesof languages recognized by binary α ANNs with α extra analog units having rational weights , for α = 0 , , , , . . . , that is, 0ANNs ⊆ ⊆ ⊆ ⊆ · · · , respectively [24]. Note that we use the notation α ANNsalso for the class of languages accepted by α ANNs, which can clearly bedistinguished by the context. Obviously, the 0ANNs are purely binary-stateNNs equivalent to FAs, which also implies 0ANNs = QP-1ANNs. Hence,0ANNs (cid:36) L in (1) accepted by 1ANNs [22]. In contrast, we have proven thatthe non-regular deterministic context-free language (DCFL) L = { n n | n ≥ } , (2)which contains the words of n zeros followed by n ones, cannot be recognizedeven offline by any 1ANN with arbitrary real weights [24]. We thus knowthat 1ANNs are not Turing-complete. In online input/output protocols, the time between reading two consecutive inputsymbols as well as the delay in outputting the result after an input has been read, isbounded by a constant, while in offline protocols these time intervals are not bounded. igure 1: The analog neuron hierarchy. Nevertheless, we have shown that any DCFL included in Chomsky level 2can be recognized by a 2ANN with two extra analog neurons having rationalweights, by simulating a corresponding deterministic pushdown automaton(DPDA) [24]. This provides the separation 1ANNs (cid:36) L in (2) is not accepted by any 1ANN. In addition, we have proven thatany TM can be simulated by a 3ANN having rational weights with a linear-time overhead [24]. It follows that RE at the highest Chomsky level 0 areaccepted by 3ANNs with rational weights and thus this model including onlythree analog neurons is Turing-complete. Since α ANNs with rational weightscan be simulated by TMs for any α ≥
0, the analog neuron hierarchy collapsesto 3ANNs:FAs ≡ (cid:36) (cid:36) ⊆ . . . ≡ TMsIt appears that the analog neuron hierarchy which is schematically depictedin Figure 1, is only partially comparable to that of Chomsky.In this paper, we further study the relation between the analog neuronhierarchy and the Chomsky hierarchy. We show that any non-regular DCFLcannot be recognized online by 1ANNs with real weights, which provides thestronger separation(DCFLs \ REG) ⊂ (2ANNs \ , implying REG = 0ANNs = QP-1ANNs = 1ANNs ∩ DCFLs. Thus, the classof non-regular DCFLs is contained in 2ANNs with rational weights, havingthe empty intersection with 1ANNs, as depicted in Figure 1.In order to prove this lower bound on the computational power of 1ANNs,we have shown that the non-regular language L in (2) is in some sense thesimplest DCFL (so-called DCFL-simple problem), by reducing L to any6anguage in DCFLs \ REG [25]. Namely, given any non-regular DCFL L , wecan recognize the language L by a Mealy machine (a deterministic finite-state transducer) that is allowed to call a subroutine for deciding L (oracle) onits output extended with a few suffixes of constant length. In computabilitytheory, this is a kind of truth-table (Turing) reduction by Mealy machineswith an oracle for L . In this paper, we prove that such a reduction can beimplemented by an online 1ANN. Thus, if the non-regular DCFL L wereaccepted by an online 1ANN, then we could recognize L by a 1ANN, whichis a contradiction, implying that L cannot be accepted by any online 1ANNeven with real weights.Note that the definition of DCFL-simple problems which any languagein DCFLs \ REG must include, is by itself an interesting achievement informal language theory [25]. A DCFL-simple problem can be reduced to allthe non-regular DCFL problems by the truth-table reduction using oracleMealy machines, which is somewhat methodologically opposite to the usualhardness results in computational complexity theory where all problems in aclass are reduced to its hardest problem such as in NP-completeness proofs.The concept of DCFL-simple problems has been motivated by our analysisof the computational power of 1ANNs and represents its first non-trivialapplication to proving the lower bounds. Our result can thus open a newdirection of research in computability theory aiming towards the existenceof the simplest problems in traditional complexity classes and their mutualreductions.The paper is organized as follows. In Section 2, we introduce basic defini-tions of the language acceptor based on 1ANNs, including an example of the1ANNs that recognize the reversal of cut languages, which also illustrates itsinput/output protocol. In Section 3, we prove two technical lemmas concern-ing the properties of 1ANNs which are used in Section 4 for the reductionof L to any non-regular DCFL by a 1ANN, implying that one extra analogneuron even with real weights is not sufficient for recognizing any non-regularDCFL online. Finally, we summarize the results and list some open problemsin Section 5.A preliminary version of this paper [26] contains only a sketch of the proofexploiting the representation of DCFLs by so-called deterministic monotonicrestarting automata [27], while the complete argument for L to be theDCFL-simple problem has eventually been achieved by using DPAs [25].7 . Neural Language Acceptors with One Analog Unit We specify the computational model of a discrete-time binary-state recur-rent neural network with one extra analog unit (shortly, 1ANN), N , whichwill be used as a formal language acceptor. The network N consists of s ≥ units (neurons) , indexed as V = { , . . . , s } . All the units in N are assumedto be binary-state (shortly binary ) neurons (i.e. perceptrons, threshold gates)except for the last s th neuron which is an analog-state (shortly analog ) unit.The neurons are connected into a directed graph representing an architecture of N , in which each edge ( i, j ) ∈ V leading from unit i to j is labeled witha real weight w ji ∈ R . The absence of a connection within the architecturecorresponds to a zero weight between the respective neurons, and vice versa.The computational dynamics of N determines for each unit j ∈ V its state(output) y ( t ) j at discrete time instants t = 0 , , , . . . . The states y ( t ) j of thefirst s − j ∈ ˜ V = V \ { s } are Boolean values 0 or 1, whereasthe output y ( t ) s from the analog unit s is a real number from the unit interval I = [0 , network state y ( t ) = (cid:16) y ( t )1 , . . . , y ( t ) s − , y ( t ) s (cid:17) ∈{ , } s − × I at each discrete time instant t ≥ N is placed in a predefined initial state y (0) ∈ { , } s − × I . Atdiscrete time instant t ≥
0, an excitation of any neuron j ∈ V is defined as ξ ( t ) j = s (cid:88) i =0 w ji y ( t ) i , (3)including a real bias value w j ∈ R which, as usually, can be viewed as theweight from a formal constant unit input y ( t )0 ≡ t ≥
0. At thenext instant t + 1, all the neurons j ∈ V compute their new outputs y ( t +1) j inparallel by applying an activation function σ j : R −→ I to ξ ( t ) j , that is, y ( t +1) j = σ j (cid:16) ξ ( t ) j (cid:17) for every j ∈ V . (4)For the neurons j ∈ ˜ V with binary states y j ∈ { , } , the Heaviside activationfunction σ j ( ξ ) = H ( ξ ) is used where H ( ξ ) = (cid:26) ξ ≥
00 for ξ < , (5)8hile the analog unit s ∈ V with real output y s ∈ I employs the saturated-linear function σ s ( ξ ) = σ ( ξ ) where σ ( ξ ) = ξ ≥ ξ for 0 < ξ <
10 for ξ ≤ , (6)In this way, the new network state y ( t +1) ∈ { , } s − × I is determined attime t + 1.The computational power of NNs has been studied analogously to thetraditional models of computations [7] so that the network is exploited as anacceptor of formal language L ⊆ Σ ∗ over a finite alphabet Σ = { λ , . . . λ q } composed of q letters (symbols). For the finite 1ANN N , we use the following online input/output protocol employing its special binary neurons X ⊂ ˜ V and nxt , out ∈ ˜ V . An input word (string) x = x . . . x n ∈ Σ n of arbitrarylength n ≥
0, is sequentially presented to the network, symbol after symbol,via the first q < s so-called input neurons X = { , . . . , q } ⊂ ˜ V , at the timeinstants 0 < τ < τ < · · · < τ n after queried by N . The neuron nxt ∈ ˜ V isused by N to prompt a user to enter the next input symbol. Thus, once theprefix x , . . . , x k − of x for 1 ≤ k ≤ n , has been read, the next input symbol x k ∈ Σ is presented to N at the time instant τ k that is one computationalstep after N activates the neuron nxt ∈ ˜ V . This means that N signals y ( t − = (cid:26) t = τ k k = 1 , . . . , n . (7)We employ the popular one-hot encoding of alphabet Σ where each letter λ i ∈ Σ is represented by one input neuron i ∈ X which is activated whenthe symbol λ i is being read while, at the same time, the remaining inputneurons j ∈ X \ { i } do not fire. Namely, the states of input neurons i ∈ X ,which represent a current input symbol x k ∈ Σ at the time instant τ k , arethus externally set as y ( t ) i = (cid:26) x k = λ i and t = τ k i ∈ X and k = 1 , . . . , n . (8)At the same time, N carries out its computation deciding about eachprefix of the input word x whether it belongs to L , which is indicated by theoutput neuron out ∈ ˜ V when the next input symbol is presented which is9ne step after the neuron nxt is active according to (7): y ( τ k +1 )out = (cid:26) x . . . x k ∈ L x . . . x k / ∈ L for k = 0 , . . . , n , (9)where τ n +1 > τ n is the time instant when the input word x is decided (e.g.formally define x n +1 to be any symbol from Σ to ensure the consistency withthe input protocol (8) for k = n + 1). For instance, y ( τ )out = 1 iff the emptyword ε belongs to L . We assume the online protocol where τ k +1 − τ k ≤ δ forevery k = 0 , . . . , n (formally τ = 0), is bounded by some integer constant δ >
0, which ensures N halts on every input word x ∈ Σ ∗ . We say thata language L ⊆ Σ ∗ is accepted (recognized) by 1ANN N , which is denoted as L = L ( N ), if for any input word x ∈ Σ ∗ , N accepts x iff x ∈ L . Example 1
We illustrate the definition of the 1ANN language acceptor andits input/output protocol on a simple network N = N ( β, c ) with two realparameters β > c . This 1ANN is used for recognizing a language L ( N ) ⊆ { , } ∗ over the binary alphabet Σ = { λ , λ } including q = 2 binarydigits λ = 0 and λ = 1. The network N is composed of s = 8 neurons,that is, V = { , . . . , } where the last neuron s = 8 ∈ V is the analog unitwhereas ˜ V = V \ { } = { , . . . , } contains the remaining binary neuronsincluding the input neurons X = { , } ⊂ ˜ V employing the one-hot encodingof the binary alphabet Σ, and the neurons nxt = 3 ∈ ˜ V , out = 7 ∈ ˜ V whichimplement the input/output protocol (7)–(9).The architecture of N ( β, c ) is depicted in Figure 2 where the directededges connecting neurons are labeled with the respective weights w = β − /ν = ( β − /β , w = β − / , w , nxt = w = w nxt , = w = w = w out , nxt = 1, and w out , = −
1, while the edges drawn without the originatingformal unit 0 correspond to the biases w = − − c/ν = − − ( β − c and w nxt , = w = w = w out , = −
1, where ν = ∞ (cid:88) k =1 β − k = 1 β − > . (10)We will first choose the parameters β, c of N ( β, c ) so that the language L ( N ( β, c )) is not CFL, while we will later reduce its power to a regularlanguage for other parameters, that is, β = (cid:18) (cid:19) = 278 > c = 14 (11)10 igure 2: The 1ANN language acceptor N ( β, c ). which determine the parameterized weights and bias of N , w = , w = , w = − . (12)Suppose that the input word x = 101 ∈ { , } of length n = 3 isexternally presented to N where x = 1, x = 0, x = 1, and formally let x = 0. Table 1 shows the sequential schedule of presenting the symbols x , x , x of x to N through the input neurons X = { , } ⊂ ˜ V at thetime instants τ = 1, τ = 4, τ = 7, respectively, by using the one-hotcoding, that is, y (1)1 = 0, y (1)2 = 1, y (4)1 = 1, y (4)2 = 0, y (7)1 = 0, y (7)2 = 1,according to (8), which is indicated in boldface. Each input symbol is queriedby the neuron nxt ∈ ˜ V one step beforehand according to (7). Thus, theneuron nxt is the only initially active unit, that is, y (0)nxt = 1, and this activitypropagates repeatedly around the oriented cycle composed of three neuronsnxt(= 3) , , ∈ ˜ V through the edges with the unit weights, which ensuresthe neuron nxt fires only at the time instant τ k − k −
1) for k > y ( t )1 y ( t )2 y ( t )nxt y ( t )4 y ( t )5 y ( t )6 y ( t )out y ( t )8 the result ofrecognition ε ∈ L ( N )2 0 0 0 0 1 0 0 / ∈ L ( N )5 0 0 0 0 1 0 0 ∈ L ( N )8 0 0 0 0 1 0 0 / ∈ L ( N ) Table 1: The rejecting computation by the 1ANN N (cid:0) , (cid:1) on the input 101. when the next input symbol x k is prompted, whereas y (3 k − = 1 for every k > . (13)In addition, the units 5 and nxt from this cycle synchronize the incidentneurons 6 ∈ ˜ V and out = 7 ∈ ˜ V , respectively, so that the unit 6 can beactivated only at the time instants t = 3 k for k >
0, by (13), while theoutput neuron out can fire only at the time instants τ k +1 = 3 k + 1 for k ≥ x , theempty string ε , 1, 10, and 101, at the time instants τ = 1, τ = 4, τ = 7, τ = 10, respectively, according to (9).According to (3), (4), and (6), we obtain the recurrence equation for theanalog state of unit 8 ∈ V , y ( t )8 = ξ ( t − = w y ( t − + w y ( t − = β − ν y ( t − + β − y ( t − (14)at time instant t ≥
1, where y ( t )8 = ξ ( t − ∈ I by (10). Hence, the inputsymbols, which determine y (3 k +1)2 = x k +1 at the time instants τ k +1 = 3 k + 112or every k ≥
0, by the one-hot encoding, are stored in this analog state as y (1)8 = y (0)8 = 0 (15) y (2)8 = β − ν x (16) y (4)8 = β − y (3)8 = β − y (2)8 = β − ν x (17) y (5)8 = 1 ν (cid:0) x β − + x β − (cid:1) (18)etc., which generalizes to y (3 k − = 1 ν k (cid:88) i =1 x k − i +1 β − i . (19)It follows that the neuron 6 ∈ ˜ V , activating only at the time instant t = 3 k for k >
0, satisfies y (3 k )6 = 1 iff ξ (3 k − = w + w y (3 k − + w y (3 k − ≥ − − cν + 1 + 1 ν k (cid:88) i =1 x k − i +1 β − i ≥ y (3 k )6 = 1 iff k (cid:88) i =1 x k − i +1 β − i ≥ c . (21)At the time instant t = τ k +1 = 3 k + 1, the output neuron out ∈ ˜ V computesthe negation of y (3 k )6 , and hence, y ( τ k +1 )out = 1 iff k (cid:88) i =1 x k − i +1 β − i < c . (22)It follows from (22) that the neural language acceptor N ( β, c ) acceptsthe reversal of the cut language , L ( N ( β, c )) = L R
3. Technical Properties of 1ANNs
In this section, we will prove two lemmas about technical properties of1ANNs that will be used in Section 4 for implementing the reduction of L to any non-regular DCFL by a 1ANN. Namely, Lemma 1 shows that for According to the definition of the quasi-periodic number , it suffices to prove thatfor some -expansion (cid:80) ∞ k =1 x k (cid:0) (cid:1) − k = with x k ∈ { , } , all the numbers r n = (cid:80) ∞ k = n x n + k (cid:0) (cid:1) − k for n ≥
0, are distinct. Clearly, r = and r n +1 = r n − x n +1 forevery n ≥
0. One can show by induction on n that r n = c n / n +2 for some odd integer c n , which provides the proof. T >
0, the state domain I of the only analog unit ofa 1ANN N can be partitioned into finitely many subintervals so that thebinary states during T consecutive computational steps by N are invariantto any initial analog state within each subinterval of this partition. Thus, onecan extrapolate any computation by N for the next T computational stepsonly on the basis of information to which subinterval the initial analog statebelongs. Lemma 2 then shows that such an extrapolation can be evaluatedby a binary neural network, which ensures that the class of 1ANNs is in factclosed under the (right) quotient with a word. Lemma 1
Let N be a 1ANN of size s neurons, which can be exploited asan acceptor of languages over an alphabet Σ for different initial states of N .Then for every integer T > , there exists a partition I ∪ I ∪ · · · ∪ I p = I of the unit interval I = [0 , into p = O (cid:0) s sT (cid:1) intervals such that for anyinitial state y (0) ∈ { , } s − × I and any input word u ∈ Σ ∗ of length n = | u | that meets τ n +1 ≤ T according to the input/output protocol (7)–(9) for N , the binary states ˜ y ( t ) = (cid:16) y ( t )1 , . . . , y ( t ) s − (cid:17) ∈ { , } s − at any time instant t ∈ { , , . . . , τ n +1 } , are uniquely determined only by the initial binary states ˜ y (0) ∈ { , } s − and the index r ∈ { , . . . , p } such that the initial state of theanalog unit s ∈ V satisfies y (0) s ∈ I r . Proof.
Let
T > y (0) ∈ { , } s − × I be an initial state of N , and u ∈ Σ ∗ of length n = | u | be an input word that meets τ n +1 ≤ T according to the input/output protocol (7)–(9) for N . Assume that0 < ξ ( t − s < t = 1 , . . . , τ − τ such that 0 ≤ τ < τ n +1 , which implies y ( t ) s = ξ ( t − s for every t = 1 , . . . , τ −
1, according to (4) and (6), and hence, for τ > ξ ( τ − s = s − (cid:88) i =0 w si y ( τ − i + w ss y ( τ − s = s − (cid:88) i =0 w si y ( τ − i + w ss (cid:32) s − (cid:88) i =0 w si y ( τ − i + w ss y ( τ − s (cid:33) . . . = τ − (cid:88) t =0 (cid:32) s − (cid:88) i =0 w si y ( t ) i (cid:33) w τ − t − ss + w τss y (0) s . (29) The (right) quotient of language L with a word u is the language L/ u = { x | x · u ∈ L } . ξ ( τ − s = s − (cid:88) i =0 w si y ( τ − i , (30)when w ss = 0.First assume 0 < ξ ( τ − s < τ >
0, which implies y ( τ ) s = ξ ( τ − s = τ − (cid:88) t =0 (cid:32) s − (cid:88) i =0 w si y ( t ) i (cid:33) w τ − t − ss + w τss y (0) s (31)according to (4), (6), and (29). For any binary neuron j ∈ ˜ V , we have y ( τ +1) j = 1 iff ξ ( τ ) j = s − (cid:88) i =0 w ji y ( τ ) i + w js y ( τ ) s ≥ y ( τ +1) j = 1 iff s − (cid:88) i =0 w ji y ( τ ) i + w js τ − (cid:88) t =0 (cid:32) s − (cid:88) i =0 w si y ( t ) i (cid:33) w τ − t − ss + w js w τss y (0) s ≥ , (33)which can be rewritten for w ss (cid:54) = 0 and w js (cid:54) = 0 as y ( τ +1) j = 1 iff (34) τ − (cid:88) t =0 (cid:32) − s − (cid:88) i =0 w si w ss y ( t ) i (cid:33) w − tss − s − (cid:88) i =0 w ji w js y ( τ ) i w − τss (cid:40) ≥ y (0) s if w js w τss < ≤ y (0) s if w js w τss > . (35)For w ss = 0 and τ >
0, condition (33) reduces to y ( τ +1) j = 1 iff s − (cid:88) i =0 w ji y ( τ ) i + w js (cid:32) s − (cid:88) i =0 w si y ( τ − i (cid:33) ≥ y ( τ +1) j depends in fact only on the binary states ˜ y ( τ ) and ˜ y ( τ − where ˜ y ( t ) = (cid:16) y ( t )1 , . . . , y ( t ) s − (cid:17) ∈ { , } s − . Similarly, for w js = 0,we have y ( τ +1) j = 1 iff s − (cid:88) i =0 w ji y ( τ ) i ≥ y ( τ +1) j depends only on the binary states ˜ y ( τ ) .For the case when either ξ ( τ − s ≤ ξ ( τ − s ≥ w ss (cid:54) = 0 and τ > y ( τ ) s = 0 iff τ − (cid:88) t =0 (cid:32) − s − (cid:88) i =0 w si w ss y ( t ) i (cid:33) w − tss (cid:40) ≥ y (0) s if w τss > ≤ y (0) s if w τss < y ( τ ) s = 1 iff 1 w τss + τ − (cid:88) t =0 (cid:32) − s − (cid:88) i =0 w si w ss y ( t ) i (cid:33) w − tss (cid:40) ≥ y (0) s if w τss < ≤ y (0) s if w τss > , (39)respectively, according to (4), (6), and (29).Altogether, for any (cid:96) ∈ V such that w (cid:96)s (cid:54) = 0, and ˜ y = ( y , . . . , y s − ) ∈{ , } s − , we denote ζ (cid:96) (˜ y ) = − s − (cid:88) i =0 w (cid:96)i w (cid:96)s y i , (40)which reduces conditions (35), (38), (39) with w ss (cid:54) = 0 to y ( τ +1) j = 1 iff z j (cid:0) ˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ ) (cid:1) (cid:40) ≥ y (0) s if w js w τss < ≤ y (0) s if w js w τss > j ∈ ˜ V such that w js (cid:54) = 0, y ( τ ) s = 0 iff z s (cid:0) ˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ − (cid:1) (cid:40) ≥ y (0) s if w τss > ≤ y (0) s if w τss < y ( τ ) s = 1 iff 1 w τss + z s (cid:0) ˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ − (cid:1) (cid:40) ≥ y (0) s if w τss < ≤ y (0) s if w τss > , (43)for τ >
0, respectively, where z (cid:96) (˜ y , ˜ y , . . . , ˜ y τ ) = τ − (cid:88) t =0 ζ s (˜ y t ) w − tss + ζ (cid:96) (˜ y m ) w − τss . (44)We define the set Z = (cid:0) Z (cid:48) ∩ ( I × {− , } ) (cid:1) ∪ (cid:8) (0 , − , (0 , , (1 , − , (1 , (cid:9) = (cid:8) ( a , b ) , ( a , b ) , . . . , ( a p +1 , b p +1 ) (cid:9) ⊂ I × {− , } (45)17here Z (cid:48) = (cid:0) z j (˜ y , . . . , ˜ y τ ) , − sgn ( w js w τss ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) j ∈ ˜ V s.t. w js (cid:54) = 0˜ y . . . , ˜ y τ ∈ { , } s − ≤ τ < T (cid:83) (cid:26)(cid:0) z s (˜ y , . . . , ˜ y τ − ) , sgn ( w τss ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) ˜ y . . . , ˜ y τ − ∈ { , } s − < τ < T (cid:27) (46) (cid:83) (cid:26)(cid:18) w ss + z s (˜ y , . . . , ˜ y τ − ) , − sgn ( w τss ) (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) ˜ y . . . , ˜ y τ − ∈ { , } s − < τ < T (cid:27) and sgn : R → {− , , } is the signum function. The set Z includes the p + 1 pairs ( a r , b r ) ∈ I × {− , } for r = 1 , . . . , p + 1, which encode all thepossible closed half-lines with the finite endpoints a r ∈ I = [0 , a r , + ∞ ) if b r = −
1, or ( −∞ , a r ] if b r = 1, that may occur in conditions(41)–(43) determining the binary outputs y ( τ +1) j , y ( τ ) s ∈ { , } for the analogstate y (0) s ∈ I . Clearly, the number | Z | = p + 1 of these half-lines can bebounded as p + 1 ≤ ( s − (cid:16) s − + (cid:0) s − (cid:1) + · · · + (cid:0) s − (cid:1) T (cid:17) +2 (cid:16)(cid:0) s − (cid:1) + · · · + (cid:0) s − (cid:1) T − (cid:17) + 4 = O (cid:0) s sT (cid:1) . (47)We also assume that the elements of Z are lexicographically sorted as( a , b ) < ( a , b ) < · · · < ( a p +1 , b p +1 ) (48)which is used in the definition of the partition of the unit interval I = [0 ,
1] = I ∪ I ∪ . . . ∪ I p into p intervals: I r = [ a r , a r +1 ) if b r = − b r +1 = − a r , a r +1 ] if b r = − b r +1 = 1( a r , a r +1 ) if b r = 1 & b r +1 = − a r , a r +1 ] if b r = 1 & b r +1 = 1 for r = 1 , . . . , p . (49)Note that if a r = a r +1 for some r ∈ { , . . . , p } , then we know − b r < b r +1 = 1 due to Z is lexicographically sorted, which produces thedegenerate interval I r = [ a r , a r ]. Thus, I = [0 ,
0] and I p = [1 ,
1] because(0 , − , (0 , , (1 , − , (1 , ∈ Z according to (45).We will show that for any initial binary states ˜ y (0) ∈ { , } s − , the binaryoutput y ( τ ) j ∈ { , } from any neuron j ∈ ˜ V after the next τ computational18teps of N where 0 ≤ τ ≤ τ n +1 ≤ T , is the same for all initial analogvalues y (0) s within the whole interval I r , which means ˜ y ( τ ) depends only on˜ y (0) and r ∈ { , . . . , p } such that y (0) s ∈ I r . We proceed by induction on τ = 0 , . . . , τ n +1 satisfying (28). The base case is trivial since ˜ y (0) does notdepend on y (0) s at all. Thus assume in the induction step that the statementholds for ˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ ) that meet (28), where 0 ≤ τ < τ n +1 .Consider first the case when either τ = 0 or 0 < ξ ( τ − s < τ > τ replacedby τ + 1 in the next inductive step. Further assume w ss (cid:54) = 0 and let j ∈ ˜ V be any binary neuron. For w js = 0, the state y ( τ +1) j is clearly determinedonly by ˜ y ( τ ) according to (37). For w js (cid:54) = 0, the binary state y ( τ +1) j ∈ { , } depends on whether the initial analog output y (0) s ∈ I lies on the corre-sponding half-line from Z with the endpoint z j (˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ ) ), accordingto (41), which holds within the whole interval I r (cid:51) y (0) s , since the endpoints z j (˜ y , ˜ y , . . . , ˜ y τ ) of all the possible half-lines in condition (41) for 0 ≤ τ < T ,are taken into account in the definition (45), (47) determining the partition(49) of the analog state domain I . Thus, ˜ y ( τ +1) depends only on ˜ y ( τ ) and I r containing y (0) s , and hence, only on ˜ y (0) and r ∈ { , . . . , p } such that y (0) s ∈ I r ,by induction hypothesis. For w ss = 0, we know that ˜ y ( τ +1) depends only on˜ y ( τ ) and ˜ y ( τ − according to (36), which proves the assertion for τ > τ = 0 the argument is the same as for w ss (cid:54) = 0since condition (41) makes still sense for τ = 0. This completes the inductionstep for τ = 0 or 0 < ξ ( τ − s < τ > ξ ( τ − s ≤ ξ ( τ − s ≥ τ >
0, we knowthe analog output y ( τ ) s ∈ { , } is, in fact, binary, satisfying (42) or (43)when w ss (cid:54) = 0, respectively, which means y (0) s ∈ I lies on the correspondinghalf-line from Z with the endpoint z s (˜ y (0) , ˜ y (1) , . . . , ˜ y ( τ − ). This holds withinthe whole interval I r (cid:51) y (0) s , since the endpoints z s (˜ y , ˜ y , . . . , ˜ y τ − ) of all thepossible half-lines in conditions (42) and (43) for 0 < τ < T , are taken into ac-count in the definition (45), (47) determining the partition (49). For w ss = 0,the state y ( τ ) s ∈ { , } depends only on ˜ y ( τ − according to (30). Thus, ˜ y ( τ +1) is determined by the binary state y ( τ ) ∈ { , } s that is guaranteed for thewhole interval I r containing y (0) s , and hence, ˜ y ( τ +1) depends only on ˜ y (0) and r ∈ { , . . . , p } such that y (0) s ∈ I r , by induction hypothesis. In addition, thesame holds for the subsequent binary states ˜ y ( τ +2) , ˜ y ( τ +3) , . . . , ˜ y ( τ n +1 ) whichare also determined by the binary state y ( τ ) ∈ { , } s at the time instant τ ,19hich completes the proof of Lemma 1. (cid:3) Lemma 2
Let N be a 1ANN which recognizes the language L = L ( N ) ⊆ Σ ∗ over an alphabet Σ by using the online input/output protocol (7)–(9) sat-isfying τ k +1 − τ k ≤ δ for every k ≥ and some integer constant δ > .Let u , u ∈ Σ + be two nonempty strings which define the (right) quotients L = L/ u and L = L/ ( u · u ) of L with u and u · u , respectively, where L/ u = { x ∈ Σ ∗ | x · u ∈ L } . Then there exists a 1ANN N (cid:48) that accepts L ( N (cid:48) ) = L \ L respectively L ( N (cid:48) ) = L \ L , with the delay of 3 computa-tional steps, that is, the output protocol (9) is modified for N (cid:48) as y ( τ k +1 +3) out (cid:48) = 1 iff x . . . x k ∈ L ( N (cid:48) ) , where out (cid:48) ∈ ˜ V (cid:48) is the binary output neuron of N (cid:48) . Proof.
We will construct the 1ANN N (cid:48) such that L ( N (cid:48) ) = L \ L respec-tively L ( N (cid:48) ) = L \ L for the delayed output protocol, which contains N with s neurons as its subnetwork including the analog unit s ∈ V sharedby N (cid:48) , that is, V ⊂ V (cid:48) = ˜ V (cid:48) ∪ { s } for the corresponding sets of (binary)neurons. The architecture of N (cid:48) is schematically depicted in Figure 3. Let I ∪ I ∪· · ·∪ I p = I be the partition of the state domain I = [0 ,
1] of the analogunit s ∈ V in N into p intervals according to Lemma 1 for T = δ · ( | u u | +1).We encode these intervals by the p + 1 pairs ( a r , b r ) ∈ I × {− , } for r = 1 , . . . , p + 1, according to (49) where a r ∈ I is the left endpoint of I r and b r = 1 if I r is left-open, while b r = − I r is left-closed, which arelexicographically sorted according to (48).For each pair ( a r , b r ) where r ∈ { , . . . , p + 1 } , we introduce one binaryneuron α r ∈ ˜ V (cid:48) in N (cid:48) to which the analog unit s ∈ V is connected so that y ( t +1) α r = 1 iff y ( t ) s (cid:26) ≥ a r for b r = − ≤ a r for b r = 1 (50)iff b r a r − b r y ( t ) s ≥
0, for any time instant t ≥
0. According to (3)–(5),the bias and the corresponding weight of α r ∈ ˜ V (cid:48) from s are thus definedas w (cid:48) α r , = b r a r and w (cid:48) α r ,s = − b r , respectively (see Figure 3). Clearly,the binary states y ( t +1) α = (cid:16) y ( t +1) α , . . . , y ( t +1) α p +1 (cid:17) ∈ { , } p +1 of neurons in α = { α , . . . , α p +1 } ⊂ ˜ V (cid:48) at time t + 1 determine uniquely the index r ∈ { , . . . , p + 1 } such that y ( t ) s ∈ I r . In addition, for the synchroniza-tion purpose, we introduce the set β = { β , . . . , β s − } ⊂ ˜ V (cid:48) of s − N (cid:48) that, at the time instant t + 1, copy the binary states20 y ( t ) = (cid:16) y ( t )1 , . . . , y ( t ) s − (cid:17) ∈ { , } s − of N from the time instant t , whichmeans y ( t +1) β = (cid:16) y ( t +1) β , . . . , y ( t +1) β s − (cid:17) = ˜ y ( t ) . This can implemented by thebiases w (cid:48) β i , = − w (cid:48) β i ,i = 1 for every i = 1 , . . . , s −
1, accordingto (3)–(5).
Figure 3: The 1ANN N (cid:48) that, with the delay of 3 steps, accepts L ( N (cid:48) ) = L \ L respec-tively L ( N (cid:48) ) = L \ L , where L = L ( N ) / u and L = L ( N ) / ( u · u ). For any input word x ∈ Σ ∗ of length n = | x | , let t ≥ x has been read and still not decided by N , that is, τ n ≤ t < τ n +1 according to the input protocol (7)–(8). According to Lemma 1, for the21tate y ( t ) ∈ { , } s − × I that is considered as an initial state of N and forany nonempty suffix string u ∈ Σ + added to x such that δ ( | u | + 1) ≤ T ,which is presented to N as an input since the time instant t , the binarystates ˜ y ( t + τ ) = (cid:16) y ( t + τ )1 , . . . , y ( t + τ ) s − (cid:17) ∈ { , } s − at any time instant t + τ ≥ t of the ongoing computation of N over u , are uniquely determined bythe binary states y ( t +1) β = ˜ y ( t ) = (cid:16) y ( t )1 , . . . , y ( t ) s − (cid:17) ∈ { , } s − of N and y ( t +1) α ∈ { , } p +1 due to y ( t +1) α is unique for I r (cid:51) y ( t ) s . In particular, thebinary state y ( τ n + | u | )out ∈ { , } of the output neuron out ∈ V in N after thesuffix u has been read, where t < τ n + | u | ≤ t + T , is uniquely determined bythe binary states y ( t +1) α and y ( t +1) β , according to the output protocol (9).In other words, there is a Boolean function f u : { , } p + s → { , } such that f u (cid:16) y ( t +1) α , y ( t +1) β (cid:17) = 1 iff x · u ∈ L ( N ) iff x ∈ L/ u . We de-fine the Boolean function f : { , } p + s → { , } as the conjunction f = ¬ f u ∧ f u · u where ¬ denotes the negation, or f = f u ∧ ¬ f u · u whichsatisfies f (cid:16) y ( t +1) α , y ( t +1) β (cid:17) = 1 iff x ∈ L \ L or x ∈ L \ L , respec-tively. The Boolean function f can be computed by a binary-state two-layered neural network N f that implements e.g. the disjunctive normal formof f . As depicted in Figure 3, the network N f is integrated into N (cid:48) sothat the neurons α ∪ β ⊂ ˜ V (cid:48) create the input layer to N f , while the outputof N f represents the output neuron out (cid:48) ∈ ˜ V (cid:48) of N (cid:48) which thus produces y ( t +3)out = f (cid:16) y ( t +1) α , y ( t +1) β (cid:17) . Hence, N (cid:48) recognizes L ( N (cid:48) ) = L \ L re-spectively L ( N (cid:48) ) = L \ L with the delay of 3 computational steps, whichcompletes the proof of Lemma 2. (cid:3)
4. Separation of 1ANNs by DCFLs
In this section, we will show the main result that any non-regular DCFLcannot be recognized online by a binary-state 1ANN with one extra analogunit, which gives the stronger separation (DCFLs \ REG) ⊂ (2ANNs \ ∩ DCFLs = 0ANNs= REG. The class of non-regular DCFLs is thus contained in 2ANNs withrational weights and has the empty intersection with 1ANNs, as depicted inFigure 1. For the proof, we will exploit the following fact that at least oneDCFL cannot be recognized by any 1ANN, which has been shown in ourprevious work: 22 heorem 1 [24, Theorem 1] The non-regular deterministic context-free lan-guage L = { n n | n ≥ } ⊂ { , } ∗ over the binary alphabet cannot berecognized by any 1ANN with one extra analog unit having real weights. In order to generalize Theorem 1 to all non-regular DCFLs, we haveshown that L is in some sense the simplest DCFL which is contained inevery non-regular DCFL, as is formalized in the following Theorem 2. Theorem 2 [25, Theorem 1] Let L ⊆ Σ ∗ be a non-regular deterministiccontext-free language over an alphabet Σ . Then there exist nonempty words v , v , v , v , v ∈ Σ + and languages L, L (cid:48) ∈ {
L, L } such that for every m ≥ , v v m v v n v / ∈ L for ≤ n < m ∈ L for n = m ∈ L (cid:48) for n > m . (51)This theorem is the basis for the novel concept of so-called DCFL-simpleproblems, which has been inspired by this study and represents an interest-ing contribution to the formal language theory. Namely, the DCFL-simpleproblem L can be reduced to every non-regular DCFL by the truth-table(Turing) reduction using oracle Mealy machines [25]. We will show in the fol-lowing Theorem 3 that this reduction can be implemented by 1ANNs, whichgeneralizes Theorem 1 to any non-regular DCFLs providing the stronger sep-aration of 1ANNs in the analog neuron hierarchy: Theorem 3
Any non-regular deterministic context-free language L ⊂ Σ ∗ over an alphabet Σ cannot be recognized online by any 1ANN with one extraanalog unit having real weights. Proof.
Let L ⊂ Σ ∗ be a non-regular deterministic context-free languageover an alphabet Σ including q > N that accepts L = L ( N ). Let v , v , v , v , v ∈ Σ + be the nonempty words and L, L (cid:48) ∈ {
L, L } be the languages guaranteed byTheorem 2 for L , which satisfy condition (51). For any integer constant c >
0, we can assume without loss of generality that the strings v i have thelength at least c , that is, | v i | ≥ c for every i = 1 , . . . ,
5, since otherwise we canreplace v , v , v , v , v by v v c , v c , v c v v c , v c , v c v , respectively. Accordingto Lemma 2 for L = L/ v and L = L/ ( v · v ), there is a 1ANN N (cid:48) thataccepts L ( N (cid:48) ) = L \ L if L = L , or L ( N (cid:48) ) = L \ L if L = L , respectively,23 igure 4: The reduction of L to a non-regular DCFL L . with the delay of 3 computational steps. It follows from (51) that for every m, n ≥ v v m v v n − ∈ L ( N (cid:48) ) iff m = n , (52)which will be used in the construction of a bigger 1ANN N including N (cid:48) asits subnetwork, that recognizes the language L = { n n | n ≥ } over thebinary alphabet { , } . The architecture of N is schematically depicted inFigure 4. We denote ˜ V (cid:48) ⊂ ˜ V to be the corresponding sets of binary neuronsin N (cid:48) and N , respectively, while N shares the only analog unit with N (cid:48) .Namely, an input x = x . . . x r ∈ { , } ∗ to N of the valid form 0 m n istranslated to the string v v m v v n − ∈ Σ ∗ and presented to its subnetwork24 (cid:48) which decides online whether m = n according (52). The result is usedby N for deciding whether x ∈ L . For this purpose, N contains a finitebuffer memory B organized as the queue of current input symbols from Σ ∗ ,which are presented online, one by one, to N (cid:48) through its q input neurons X (cid:48) ⊂ ˜ V (cid:48) by using the one-hot encoding of Σ, when queried by nxt (cid:48) ∈ ˜ V (cid:48) according to the input protocol (7) and (8) for N (cid:48) .At the beginning, B is initialized with the nonempty string v ∈ Σ + and N queries on the first input bit x ∈ { , } , that is, y (0)nxt = 1 wherenxt ∈ ˜ V , according to the input protocol (7) and (8) for N . Thus, at thetime instant τ = 1, N reads the first input bit x through its two inputneurons X ⊂ ˜ V by using the one-hot encoding of { , } . If x = 1, then x = 1 x (cid:48) / ∈ L is further rejected for any suffix x (cid:48) ∈ { , } ∗ by clamping thestate y ( t )out = 0 of the output neuron out ∈ ˜ V in N whereas y ( t )nxt = 1,for every t >
1. If x = 0, then N writes the string v ∈ Σ + to B . Atthe same time, the computation of N (cid:48) proceeds while reading its input fromthe buffer B when needed which is indicated by the neuron nxt (cid:48) ∈ ˜ V (cid:48) onecomputational step beforehand. Every time before B becomes empty, N reads the next input bit x k ∈ { , } for k > v ∈ Σ + to B if x k = 0, so that N (cid:48) can smoothly continue in its computation. This isrepeated until N reads the input bit x m +1 = 1 for m ≥
1, which completesthe first phase of the computation by N . In the course of this first phase,each prefix 0 k / ∈ L of the input word x , which is being read online by N , isrejected by putting the state y ( τ k +1 )out = 0 of its output neuron out for every k = 1 , . . . , m , according to the output protocol (9) for N .At the beginning of the subsequent second phase when the input bit x m +1 = 1 has been read, N writes the string v v ∈ Σ + to B and continuesuninterruptedly in the computation of N (cid:48) over the input being read fromthe buffer B when required. Every time before B becomes empty which willprecisely be specified below, N reads the next input bit x m + n ∈ { , } for n > v ∈ Σ + to B if x m + n = 1, so that N (cid:48) cansmoothly carry out its computation. If x m + n = 0, then x = 0 m n − x (cid:48) / ∈ L is further rejected for any suffix x (cid:48) ∈ { , } ∗ by clamping the states y ( t )out = 0and y ( t )nxt = 1 since that.It follows that in the second phase, N (cid:48) decides online for each n > v v m v v n − ∈ Σ + of length (cid:96) = | v v | + m · | v | +( n − ·| v | belongs to L ( N (cid:48) ), where the result is indicated through its outputneuron out (cid:48) ∈ ˜ V (cid:48) at the time instant τ (cid:48) (cid:96) +1 +3 with the delay of 3 computational25teps after the next symbol subsequent to v v m v v n − is read, according tothe delayed output protocol (9) for N (cid:48) . For sufficiently large length | v | > (cid:48) thus signals whether v v m v v n − ∈ L ( N (cid:48) ), while stillreading the next string v corresponding to the last input bit x m + n = 1 of thecurrent input 0 m n to N . At the next time instant τ m + n +1 = τ (cid:48) (cid:96) +1 + 4, whenthe subsequent input bit x m + n +1 ∈ { , } is presented to N , which is queriedby N via the state y ( τ m + n +1 − = 1 of the neuron nxt one step beforehand,the output neuron out of N copies the state of out (cid:48) , providing the resultof the computation by N over the input word x ∈ { , } ∗ according to theoutput protocol (9) for N . Namely, y ( τ m + n +1 )out = 1 iff v v m v v n − ∈ L ( N (cid:48) )iff m = n iff 0 m n ∈ L according to (52), which ensures L ( N ) = L .The preceding online reduction of any input 0 m n for N to the input v v m v v n − for N (cid:48) can clearly be realized by a finite automaton, includingthe implementation of the finite buffer memory B . This finite automatoncan further be implemented by a binary-state neural network by using thestandard constructions [9, 10, 11, 13], which is wired to the 1ANN N (cid:48) inorder to create the 1ANN N recognizing the language L ( N ) = L online,as described above. In particular, the synchronization of these two networksis controlled by their input/output protocols, while the operation of N (cid:48) cansuitably be slowed down for sufficiently large length of strings v i . However,we know by Theorem 1 that there is no 1ANN that accepts L , which is acontradiction completing the proof of Theorem 3. (cid:3)
5. Conclusion
In this paper, we have refined the analysis of the computational power ofdiscrete-time binary-state recurrent neural networks α ANNs extended with α analog-state neurons by proving a stronger separation 1ANNs (cid:36) \ ∩ DCFLs = 0ANNs = REG. For this purpose, we have re-duced the non-regular DCFL L = { n n | n ≥ } , which is known to be notin 1ANNs [24], to any non-regular DCFL.It follows that L is in some sense the simplest languages in the classof non-regular DCFLs. This is by itself an interesting contribution to com-putability theory, which has inspired the novel concept of a DCFL-simpleproblem that can be reduced to any non-regular DCFL by the truth-table(Turing) reduction using oracle Mealy machines [25]. The proof of the26tronger separation 1ANNs (cid:36) nondeterministic context-free languages (CFLs)by showing that 1ANNs ∩ CFLs = 0ANNs.Moreover, it is an open question whether there is a non-context-sensitivelanguage that can be accepted offline by a 1ANN, which does not apply to anonline input/output protocol since we know online 1ANNs ⊂ CSLs. Anotherimportant challenge for future research is the separation 2ANNs (cid:36) ∩ CFLs ? = DCFLs.It also appears that the analog neuron hierarchy is only partially com-parable to that of Chomsky since 1ANNs and probably also 2ANNs do notcoincide with the Chomsky levels although 0ANNs and 3ANNs correspond toFAs and TMs, respectively. In our previous paper [22], the class of languagesaccepted by 1ANNs has been characterized syntactically by so-called cut lan-guages which represent a new type of basis languages defined by NNs that donot have an equivalent in the Chomsky hierarchy. A similar characterizationstill needs to be done for 2ANNs.The analog neuron hierarchy shows what is the role of analogicity in thecomputational power of NNs. The binary states restrict NNs to a finite do-main while the analog values create a potentially infinite state space whichcan be exploited for recognizing more complex languages in the Chomsky hi-erarchy. This is not only an issue of increasing precision of rational-numberparameters in NNs but also of functional limitations of one or two analogunits for decoding an information from rational states as well as for synchro-nizing the storage operations. An important open problem thus concerns thegeneralization of the hierarchy to other types of analog neurons used in prac-tical deep networks such as LSTM, GRU, or ReLU units [18, 20]. Clearly,the degree of analogicity represent another computational resource that cansimply be measured by the number of analog units while a possible tradeoffwith computational time can also be explored.Nevertheless, the ultimate goal is to prove a proper “natural” hierarchyof NNs between integer and rational weights similarly as it is known between27ational and real weights [16] and possibly, map it to known hierarchies ofregular/context-free languages. This problem is related to a more generalissue of finding suitable complexity measures of realistic NNs establishingthe complexity hierarchies, which could be employed in practical neurocom-puting, e.g. the precision of weight parameters [21], energy complexity [12],temporal coding etc.Yet another important issue concerns grammatical inference. For a givenPDA or TM, the constructions of computationally equivalent 2ANNs and3ANNs, respectively, can be implemented algorithmically [24] although theydo not provide learning algorithms that would infer a language from train-ing data. Nevertheless, the underlying results establish the principal limits(lower and upper bounds) for a few analog units to recognize more complexlanguages. For example, we now know that one analog neuron cannot accepteven some simple DCFLs. In other words, any learning algorithm has toemploy a sufficient number of analog units to be able to infer more complexgrammars. Acknowledgments
The research was done with institutional support RVO: 67985807 andpartially supported by the grant of the Czech Science Foundation No.GA19-05704S.
References [1] P. Koiran, A family of universal recurrent networks, Theoretical Com-puter Science 168 (2) (1996) 473–480.[2] H. T. Siegelmann, Recurrent neural networks and finite automata, Jour-nal of Computational Intelligence 12 (4) (1996) 567–574.[3] J. ˇS´ıma, Analog stable simulation of discrete neural networks, NeuralNetwork World 7 (6) (1997) 679–686.[4] M. ˇSorel, J. ˇS´ıma, Robust RBF finite automata, Neurocomputing 62(2004) 93–110.[5] J. Kilian, H. T. Siegelmann, The dynamic universality of sigmoidal neu-ral networks, Information and Computation 128 (1) (1996) 48–56.286] H. T. Siegelmann, Neural Networks and Analog Computation: Beyondthe Turing Limit, Birkh¨auser, Boston, 1999.[7] J. ˇS´ıma, P. Orponen, General-purpose computation with neural net-works: A survey of complexity theoretic results, Neural Computation15 (12) (2003) 2727–2778.[8] N. Alon, A. K. Dewdney, T. J. Ott, Efficient simulation of finite au-tomata by neural nets, Journal of the ACM 38 (2) (1991) 495–514.[9] B. G. Horne, D. R. Hush, Bounds on the complexity of recurrent neuralnetwork implementations of finite state machines, Neural Networks 9 (2)(1996) 243–252.[10] P. Indyk, Optimal simulation of automata by neural nets, in: Proceed-ings of the STACS 1995 Twelfth Annual Symposium on TheoreticalAspects of Computer Science, Vol. 900 of LNCS, LNCS, Springer, 1995,pp. 337–348.[11] M. Minsky, Computations: Finite and Infinite Machines, Prentice-Hall,Englewood Cliffs, 1967.[12] J. ˇS´ıma, Energy complexity of recurrent neural networks, Neural Com-putation 26 (5) (2014) 953–973.[13] J. ˇS´ıma, J. Wiedermann, Theory of neuromata, Journal of the ACM45 (1) (1998) 155–178.[14] H. T. Siegelmann, E. D. Sontag, On the computational power of neuralnets, Journal of Computer System Science 50 (1) (1995) 132–150.[15] H. T. Siegelmann, E. D. Sontag, Analog computation via neural net-works, Theoretical Computer Science 131 (2) (1994) 331–360.[16] J. L. Balc´azar, R. Gavald`a, H. T. Siegelmann, Computational power ofneural networks: A characterization in terms of Kolmogorov complexity,IEEE Transactions on Information Theory 43 (4) (1997) 1175–1183.[17] J. Schmidhuber, Deep learning in neural networks: An overview, NeuralNetworks 61 (2015) 85–117. 2918] S. A. Korsky, R. C. Berwick, On the computational power of RNNs,arXiv:1906.06349 (2019).[19] W. Merrill, Sequential neural networks as automata, arXiv:1906.01615(2019).[20] W. Merrill, G. Weiss, Y. Goldberg, R. Schwartz, N. A. Smith, E. Yahav,A formal hierarchy of RNN architectures, in: Proceedings of the ACL2020 Fifty-Eighth Annual Meeting of the Association for ComputationalLinguistics, Association for Computational Linguistics, 2020, pp. 443–459.[21] G. Weiss, Y. Goldberg, E. Yahav, On the practical computational powerof finite precision RNNs for language recognition, in: Proceedings of theACL 2018 Fifty-sixth Annual Meeting of the Association for Compu-tational Linguistics, Vol. 2, Association for Computational Linguistics,2018, pp. 740–745.[22] J. ˇS´ıma, Subrecursive neural networks, Neural Networks 116 (2019) 208–223.[23] J. ˇS´ıma, P. Savick´y, Quasi-periodic ββ