[PDF] Sampling and Learning for Boolean Function

Abstract

In this article, we continue our study on universal learning machine by introducing new tools. We first discuss boolean function and boolean circuit, and we establish one set of tools, namely, fitting extremum and proper sampling set. We proved the fundamental relationship between proper sampling set and complexity of boolean circuit. Armed with this set of tools, we then introduce much more effective learning strategies. We show that with such learning strategies and learning dynamics, universal learning can be achieved, and requires much less data.

Full PDF

SSampling and Learning for Boolean Function ∗ Chuyu Xiong

Independent researcher, New York, USAEmail: [email protected]

January 22, 2020

Abstract

In this article, we continue our study on universal learning machine by introducing new tools.We ﬁrst discuss boolean function and boolean circuit, and we establish one set of tools, namely,ﬁtting extremum and proper sampling set. We proved the fundamental relationship betweenproper sampling set and complexity of boolean circuit. Armed with this set of tools, we thenintroduce much more eﬀective learning strategies. We show that with such learning strategiesand learning dynamics, universal learning can be achieved, and requires much less data.

Keywords: Boolean Circuit, Fitting Extremum, Proper Sampling Set, Learning Dy-namics and Strategy, X-form

It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elementsas simple and as few as possible without having to surrender the adequate representation of a singledatum of experience. — A. Einstein ...... then a sudden leap takes place in the brain in the process of cognition,...... — Mao Zedong

In [1, 2, 4, 5], we tried to study universal learning machine. There, we laid out framework of discussionsand proved some basic yet important results, such as: with suﬃcient data, universal learning machinecan be achieved. The core of universal learning machine is X-form, which turns out to be a form ofboolean function. We showed that the learning is actually equivalent to dynamics of X-form inside alearning machine. Thus, in order to study universal learning machine well, we need to study thoroughlyX-form and the motion of X-form under driven of data.Since the work of [2, 4, 5], we have constantly pursued the eﬀective learning dynamics, and triedto understand X-form, and more generally, boolean function and boolean circuit. In the process,eventually, we found that the very core of problem is: we need to ﬁnd a powerful way to describe theproperty of boolean function. If we have such a tool, we can penetrate into boolean function deepand do much better than before. But, it is not easy to ﬁnd such a tool. It took us a long time. Werecently invented a set of tools, namely, ﬁtting extremum and proper sampling set. Our invention, i.e.ﬁtting extremum and learning dynamics, can be seen in our patent application [9, 10]. How to useﬁtting extremum and proper sampling set for a spacial case, namely 1-dim real function, can be seenin [7]. In this article, we provide theoretical discussions of these tools and related studies.We discuss boolean function in section 2, and boolean circuit in section 3. We deﬁne a way topresent a boolean circuit, i.e. connection matrix, and decomposition of connection matrix. In section4, we introduce sampling set, ﬁtting extremum, and proper sampling set (PSS). We show the deep ∗ Great thanks for whole heart support of my wife. Thanks for Internet and research contents contributers to Internet. a r X i v : . [ c s . A I] J a n Sampling and Learning connections between PSS and size of boolean circuit. In section 5, we will discuss how to apply thesetools to learning dynamics, and prove universal learning machine can be achieved by using them.Finally, in section 6, we make some comments. In appendix, we put details of relationship of PSS andsize of boolean circuit.

Boolean function and boolean circuit are very important for learning machine. We ﬁrst deﬁne booleanfunctions and related concepts. B N is N-dim boolean space, it consists of all N-dim boolean vectors: B N = { ( b , b , . . . b N ) | b k = 0 or 1 , k = 1 , , . . . , N } We also called this space as base pattern space [2]. B N is the starting point for us. Specially, when N = 1 , B N become B = { , } . N-dim boolean function is a function deﬁned on B N : Deﬁnition 2.1 ( Boolean Function).

A N-dim boolean function f : B N → B is a function from B N to B . We can also write it as: f : B N → B , f ( b , b , . . . b N ) = 0 or 1We can see some examples of boolean functions. Example 2.1 ( Some Simplest Boolean Functions).

Constant function is simplest: f : B N → B , f ( b , b , . . . b N ) = 1The function only depends on one variable is also very simple: f : B N → B , f ( b , b , . . . b N ) = b We can see more examples of boolean function. Boolean functions formed by one basic logic operationsare also very simple. Logical operation OR forms one boolean function: o : B → B , o ( b , b ) = b ∨ b = (cid:40) a : B → B , a ( b , b ) = b ∧ b = (cid:40) id : B → B , id ( b ) = b = (cid:40) b = 10 b = 0Logical operation Negation also forms one boolean function: n : B → B , n ( b ) = ¬ b = (cid:40) b = 00 b = 1Logical operation XOR also forms one boolean function: x : B → B , x ( b , b ) = b ⊕ b = (cid:40) b , b is zero0 otherwiseIt is worth to note that XOR can be written by using OR, AND and Neg: b ⊕ b = ( b ∧ ¬ b ) ∨ ( ¬ b ∧ b ) = ( b ∨ b ) ∧ ¬ ( b ∧ b ) huyu Xiong Example 2.2 ( Boolean Function as Real Function).

Logical operation OR can be written asreal function: o : B → B , o ( b , b ) = b ∨ b = sign ( b + b ) , where sign ( x ) = (cid:40) x > x ≤ a : B → B , a ( b , b ) = b ∧ b = b · b where · is the multiplication of real number. Logical operation Negation also forms one booleanfunction: n : B → B , n ( b ) = ¬ b = − ( b − Example 2.3 ( More Boolean Functions Deﬁned by Real Functions).

We can deﬁne a booleanfunction as: f : B → B , f ( b , b ) = sign ( Oscil ( r b + r b )) , sign ( x ) = (cid:40) x > x ≤ r , r are 2 real numbers, sign is the sign function, Oscil is an oscillator function. Oscillatorfunction is something like sin ( x ), which oscillates from negative to positive and go on. Generally,oscillator functions are very rich. They do not need to be oscillate regularly like sin ( x ). They couldoscillate irregularly and very complicatedly.Yet, another boolean function is more popular: f : B N → B , f ( b , b , . . . , b N ) = sign ( r b + r b + . . . + r N b N )where r , r , . . . , r N are real numbers. This function is often called as a artiﬁcial neuron. A littlemodiﬁcation will give linear threshold function: f : B N → B , f ( b , b , . . . , b N ) = sign ( r b + r b + . . . + r N b N − θ )where r , r , . . . , r N , θ are real numbers.Parity function is one important boolean function, which help us in many aspects. Example 2.4 ( Parity Function).

Parity function p : B N → B is deﬁned as below: p ( b , b , . . . , b N ) = (cid:40) p ( b , b , . . . , b N ) = (cid:16) N (cid:88) i =1 b i (cid:17) (mod 2)Since boolean function is on a ﬁnite set, it is possible to express it by a table of value. This table iscalled as truth table. For example, a parity function of 3 variables can be expressed as below table:(0 , ,

0) (1 , ,

0) (0 , ,

0) (1 , ,

0) (0 , ,

1) (1 , ,

1) (0 , ,

1) (1 , , p ( b , b , b ) 0 1 1 0 1 0 0 1We have seen that a boolean function can be deﬁned and calculated by many ways, such as: logicaloperations, real functions, truth table, etc. But, any boolean function can be expressed by logicaloperations. Sampling and Learning

Lemma 2.1 ( Expressed by Basic Logic Operation).

Any boolean function f : B N → B can beexpressed by basic logic operations: ∨ , ∧ , ¬ . Proof:

First, one boolean function can be expressed by its truth table. In the truth table, there are2 N entries, and at each entry, the function value f ( b , b , . . . , b N ) is recorded. Since we can use thebasic logic operations to express one boolean vector in B N , each entry can be expressed by basic logicoperations. Thus, we can express the boolean function.For example, we can express the parity function of 3 variables as: p ( b , b , b ) = ( b ⊕ b ) ⊕ b Note, ⊕ can be expressed by ∨ , land, ¬ .Another example of boolean function. Example 2.5 ( Expressed By Polynomial Function).

Consider a polynomial function P on realnumber, e.g., P ( x ) = x − x −

3. Also, consider a way to embed a boolean vector v ∈ B N into realnumber. There are inﬁnite such embeddings. We will consider following: ∀ v ∈ B N , x = b (cid:18) (cid:19) + b (cid:18) (cid:19) + . . . + b N (cid:18) (cid:19) N Then, we deﬁne a boolean function f : B N → B as: ∀ v ∈ B N , f ( v ) = sign ( P ( x )) , where x is as aboveThis will deﬁne a boolean function on B N for any N . Such a way to deﬁne boolean function andembedding to real number is quite useful. We know a boolean function can be deﬁned and calculated by many possible ways. But, no matterhow it is deﬁned and calculated, Lemma 2.1 tells us that it can be expressed by ∨ , ∧ , ¬ . We call suchexpression as boolean expression. Deﬁnition 3.1 ( Boolean Expression).

A boolean function f : B N → B can be expressed by ∨ , ∧ , ¬ and input variables b , b , . . . , b N as one algebraic expression, we call this algebraic expressionas boolean expression of f .Boolean expression is also called boolean formula. As one example, the parity function of 4 variablecan be expressed as: p ( b , b , b , b ) = ( b ⊕ b ) ⊕ ( b ⊕ b )This is to say, we can realize a boolean function by one algebraic expression. Moreover, we can realizeone algebraic expression by hardware that is a group of switches and connections, namely. a circuit.Actually, we can just make such a circuit that is direct translation from the boolean expression, justuse a AND switch to replace ∧ , a OR switch to replace ∨ , and negation connection to replace ¬ .Thus, we have deﬁnition: Deﬁnition 3.2 ( Boolean Circuit).

Boolean circuit is one directed acyclic graph. There are 2 typesof nodes, AND and OR nodes. Connection between nodes are either direct connection (1 to 1 and 0to 0) or negation connection (1 to 0 and 0 to 1). This graph starts from input nodes: b , b , . . . , b N ,and ends at the top node. We note that at each node, there are 2 and only 2 connections from below(this is called 2 fanin). But the connections going up could be any number.Note, the deﬁnition here are slight diﬀerent than boolean circuit deﬁned in most literatures (forexample [11]). But, the diﬀerence is just very surface and it is just for convenience for our discussions.We can write a boolean circuit in diagram. See diagram below for some examples. A boolean circuitand a boolean expression actually are identical. So, we will later to use them as same. huyu Xiong Example 3.1 ( Some Simple Circuit).

Simplest circuit: C = 1. This is a special case. This circuithas no node, i.e. the number of node is 0.Second simplest circuit: C = b ∨ b . See Fig. 1 C1 for diagram. This circuit has 1 node and 2connections. Circuit: C = b ∨ ¬ b . See Fig. 1 C3 for diagram. This circuit has 1 node and 2connections,.one is direct connection, another is negation connection.Circuit for AND. See Fig. 1 C2 for diagram. This circuit has 1 node and 2 connections, both aredirect connections.Circuit for XOR. See Fig. 1 C5 for diagram. We can express it as: b ⊕ b = ( b ∧ ¬ b ) ∨ ( ¬ b ∧ b ).This circuit has 3 nodes, i.e. one OR node, and 2 AND nodes, and with 2 negation connection. C = ( b ∨ ( b ∧ ¬ b )). See Fig. 1 C4 for diagram. This circuit has 2 nodes. Fig1. Diagrams of Some Simple Circuits

For a given boolean circuit C , for a given input, i.e. b , b , . . . , b N taking value of 0 or 1, we can feedthese values into C . The circuit will take value at each node accordingly. When the value at the mosttop node is taken, the circuit take value for itself. This is how a boolean circuit to execute a booleanfunction. We will denote as C ( b , b , . . . , b N ).Any boolean function f : B N → B , no matter how f is deﬁned and calculated, it can be expressed byone boolean circuit C . That is to say, ∀ x ∈ B N , f ( x ) = C ( x ).Clearly, for a boolean function, the boolean circuit to express the function is not unique. For example,one very simple boolean function XOR can be expressed in 2 ways: ( b ∧ ¬ b ) ∨ ( ¬ b ∧ b ) or ( b ∨ b ) ∧¬ ( b ∧ b ). That is to say, XOR can be expressed by 2 diﬀerent boolean circuit. For more complicatedboolean function, this is even more true.A boolean circuit consists of a series of nodes and connections. One very important properties of aboolean circuit is its number of nodes. Deﬁnition 3.3 ( Node Number).

For one boolean circuit C , we denote the number of nodes of C as d ( C ).That is to say, we deﬁne a function d ( C ) on all circuits. Such function is called node number. Thisfunction will play an important role in our discussions. Sampling and Learning

How can we write a boolean circuit? We can write it as algebraic expression like before. But,for the purpose of easy manipulation, we need to write them in more ways. First, we denote allnodes of a circuit C as: g , g , . . . , g d , where d = d ( C ). Theses are working nodes. Yet, inputvariables b , b , . . . , b N are also nodes, which are nodes for inputs. So, C is a graph with nodes b , b , . . . , b N , g , g , . . . , g d . b , b , . . . , b N are input nodes, and g d as ending node, the rest, i.e. g i , i =1 , , . . . , d , are working nodes, and g d is the ending node (it is working node as well).At each working node, g i , i = 1 , , . . . , d , there are 2 and only 2 incoming connections. Except endingnode, at each working node, there are 1 or more outgoing connections.Thus, besides using diagram and boolean algebraic expression to express a boolean circuit, we canuse matrix notation to express a circuit. Deﬁnition 3.4 ( Connection Matrix).

For a circuit C on B N , suppose all working nodes of C are g , g , . . . , g d , where d = d ( C ), we deﬁne a d × ( N + d −

1) matrix M , its entries are these symbols: ∧ , ∨ , ∧¬ , ∨¬ or 0, and the meaning of symbols are as following:at (i, j) :  ∧ direct connection from j-th node to i-th working node, and this working node is ∧∧¬ negation connection from j-th node to i-th working node, and this working node is ∧∨ direct connection from j-th node to i-th working node, and this working node is ∨∨¬ negation connection from j-th node to i-th working node, and this working node is ∨ We call such maxtrix as connection matrix of C .Clearly, for a given circuit, we can write down its connection matrix. Reversely, if we have such amatrix, it gives a circuit as well. So, we could identify a circuit with a connection matrix.We can see some immediate properties of connection matrix. Each row of connection matrix is forone working node, and each column is for connection to all working nodes (except ending node) fromone node. Since for each working node, there are 2 and only 2 incoming connections, each row has 2and only 2 entries are non 0. Since for each node (except ending node), there are 1 or more outgoingconnections, each column has 1 or more entries are non 0. Example 3.2 ( Examples of Connection Matrix).

Consider a circuit C f = b ∨ ( b ∧ ¬ b ). SeeFig. 1 C4 for diagram of this circuit. All nodes of C f are b , b , b , g , g , and working nodes are g , g ,ending node is g . The connection matrix of C f is a 2 × M f = (cid:20) ∧ ∧¬ ∨ ∨ (cid:21) Another example, consider XOR, the circuit is C xor = ( b ∨ b ) ∧ ¬ ( b ∧ b ), all nodes of C xor are b , b , g , g , g , working nodes are g , g , g , ending node is g . The connection matrix of C xor is a3 × M xor =  ∨ ∨ ∧ ∧ ∧ ∧¬  In the above discussions, there is no order among working nodes. Now we deﬁne a order amongworking nodes. Let’s see how the ending node is getting its values. At the very beginning, only inputnodes have values, all working nodes are with empty value. When the values propogate along thecircuit, the working nodes that have 2 incoming connections from input nodes will get their values. So,these nodes should be put ﬁrst in the order. But, there could be more than one such nodes. Amongthese nodes, we will deﬁne order by this way: if both 2 nodes g i , g j have 2 incoming connections frominput nodes, say, g i with i , i , i < i , and g j with j , j , j < j , the order of g i , g j are determinedby so called dictionary order, i.e. if i < j , then g i is ﬁrst than g j , if i = j , i < j , then g i is ﬁrstthan g j . Yet, if it is the case: i = j , i = j , then g i and g j must be diﬀerent type (otherwise, wecould eliminate one), then the node of ∨ is ﬁrst.Now, we have order among working nodes that have 2 incoming connections from input nodes. Thesenodes will be evaluated. We then consider those working nodes that have 2 incoming connections from huyu Xiong C p = ( b ⊕ b ) ⊕ ( b ⊕ b ). See diagram below. Fig2. Circuit of Parity of 4 Variables

There are 9 working nodes. Thus, all nodes are b , b , b , b , g , g , . . . , g . According to the naturalorder, working nodes are getting values in this way: b , b , b , b get input values, then g , g , g , g get values, then, g , g get values, then, g , g , ﬁnally, g . We can write the connection matrix below. M p =  ∨ ∨ ∧ ∧ ∨ ∨ ∧ ∧ ∧ ∧¬ ∧ ∧¬ ∨ ∨ ∧ ∧ ∧ ∧¬  Note, the connection matrix is done according to the natural order of working nodes. If the order inworking nodes are diﬀerent, the connection matrix will appear diﬀerently (but just some permutation).Natural order in working nodes is useful tool. We use a lemma to describe it.

Lemma 3.1 ( Natural Order of Working Nodes).

For a boolean circuit C , suppose its workingnodes are g , g , . . . , g d , we can make one natural order in the working nodes, so that evaluation of theworking nodes will depend on the working nodes in front of it, and will not depend on any workingnodes in back of it. Proof:

The proof is already done in above discussions. (cid:4)

Using the natural order of working nodes, we can see that the working nodes will be in levels. Forexample, in the example of parity of 4 variables, we have 9 working nodes, and they are divided into 4levels: level 1: g , g , g , g , level 2: g , g , level 3: g , g , and level 4: g . See this clearly in diagram.Nodes in level 1 will get value ﬁrst. Nodes in level 2, will depends in level 1, etc. That is to say, inorder to evaluate nodes in level j , all nodes in all levels i < j should be evaluated ﬁrst. Deﬁnition 3.5 ( Level of Nodes).

For boolean circuit C , suppose its working nodes are g , g , . . . , g d ,we can group working nodes into a series of subsets l , . . . , l K , l i consisting of all working nodes that Sampling and Learning any their incoming connections are from previous subsets, i.e. from l j , j < i . We call each subset l i as one level of working nodes, we also call the number K as depth, or depth number, or height.According to Lemma 3.1, we can indeed make such level of working nodes. Clearly, the top level onlyhas one node, i.e. ending node g d . As the above example of parity of 4 variables demonstrates, theevaluation process of a circuit must be level by level. In order to evaluate nodes in level i + 1, it mustﬁrst evaluate all nodes in level i . This property indicates that we can do decomposition according tolevel.That is to say, we can do evaluation by this way: from input nodes to level 1, then, from level 1 tolevel 2, etc. If we see the connection matrix of parity of 4 variables, we can see clearly. Thus, we candecompose the connection matrix according to levels. See below: M =  ∨ ∨ ∧ ∧ ∨ ∨ ∧ ∧  M = (cid:20) ∧ ∧¬ ∧ ∧¬ (cid:21) M = (cid:20) ∨ ∨∧ ∧ (cid:21) M = (cid:2) ∧ ∧¬ (cid:3) Here, M is for: from input nodes to get value of nodes in level 1. For example, if v = (1 , , , T is the input, then M v = (1 , , , T . vecto r(1 , , , T gives values of all nodes in level 1. We cancontinue to use M for values of all nodes in level 2, M for values of all nodes in level 3, and ﬁnally, M for value of top node. We can write these operations into following form: C p ( v ) = M M M M v, v = ( b , b , b , b ) T ∈ B Here, C p is the circuit of parity of 4 variables, and C p ( v ) stands for the value of top node, which isthe output value of the circuit. In this way, we can operate on circuit much easier. It is still notas good as ordinary matrix calculations, but it is much better and clear. We will use this notationconsistently.However, we need to be more careful. In the above example, level i + 1 only depends on level i , noton level i − C f , which is in diagram of C4 inFig. 1. All nodes of C f are b , b , b , g , g . Working nodes are g , g . The connection matrix of C f isa 2 × M f = (cid:20) ∧ ∧¬ ∨ ∨ (cid:21) So, clearly, level 0 is { b , b , b } (input nodes), level 1 is { g } , level 2 is { g } (ending node). But, wecan see that level 2 node has incoming connections from level 1 and level 0. Thus, the decompositionaccording to level to level seems has diﬃculties. Can we still do decomposition as we did for C p ?In order to make neat decomposition, we need introduce a new kind of node: spurious node. Wewill se s to denote spurious node. A spurious node is one node adding to one level to just pass theconnections from lower level to higher level. After introducing spurious node, then, we can go backto the situation: level i + 1 will only depends on level i , not on any previous level. As one exampleto demonstrate, for C f , we add one spurious node in level 1. This spurious node has 1 and only 1incoming connection, and this node will not do anything, but pass the value of b , so its outgoingconnections are exactly same as the outgoing connections of b . So, after add this node, g will have2 incoming connections from level 1. So, we can write following decomposition. M = (cid:20) s ∧ ∧¬ (cid:21) M = (cid:2) ∨ ∨ (cid:3) And, C f ( v ) = M M v, v = ( b , b , b ) T ∈ B This decomposition will make our operation on circuit easier. For example, if input is v = (1 , , T ,then, u = M v = (1 , T , M u = 1, so C f ( v ) = 1. Deﬁnition 3.6 ( Spurious Nodes).

For a circuit C on B N , suppose all working nodes of C are g , g , . . . , g d , where d = d ( C ), and nodes are grouped into levels: { g i,j , i = 1 , . . . , K, j = 1 , . . . , L i , huyu Xiong K is the number of levels. If at level i + 1, there are the incoming connections not from level i , but from level lower than i , we can add spurious nodes in level i , so that these nodes only pass thevalue. We use s to denote such nodes. By adding spurious nodes, the evaluation of one level i + 1 willonly depend on level i .We can write this decomposition into following lemma. Lemma 3.2 ( Decomposition of Connection Matrix by Level).

For a boolean circut C , supposeits working nodes are g , g , . . . , g d , and nodes are grouped into levels: { g i,j , i = 1 , . . . , K, j = 1 , . . . , L i ,where K is the number of levels. Then, if necessary, we can add spurious nodes, then the evaluationof C will be decomposited to a series evaluation so that each evaluation is done from one level to nextlevel. And, each evaluation can be achieved by matrix operation. Proof:

The proof is already done in above discussions. (cid:4)

Fig. 3 Circuit of 5 LevelsExample 3.3 ( Example of Decomposition).

We consider this boolean circuit: C = ( b ∨ ( b ∧ b )) ⊕ ( b ∨ ( b ⊕ b )). See the diagram for this circuit in Fig. 3, which is the left diagram. C has 9working nodes: g , g , . . . , g . The working nodes are ordered as we discussed before. We can writedown working nodes as: g : b ∧ b , g : b ∨ b , g b ∧ b , g : b ∨ g , g : g ∧ ¬ g , g : b ∨ g , g : g ∨ g , g : g ∨ g , g : g ∧ ¬ g . The connection matrix is blow: M p =  ∧ ∧ ∨ ∨ ∧ ∧ ∨ ∨ ∧ ∧¬ ∨ ∨ ∨ ∨ ∧ ∧ ∧ ∧¬  There are 5 levels in this circuit: level 0: { b , b , b , b } , level 1: { g , g , g } , level 2: { g , g } , level 3: { g } , level 4: { g , g } , level 5: { g } . These levels are not single level evaulation. For example, at g ,we need g (level 1) and b (level 0) to evaluate it. But, we can add spurious nodes. See the rightdiagram in Fig. 3, where nodes S are spurious nodes. We can see clearly, with spurious nodes, thecircuit becomes single level evaluation. Then, we can do decomposition by level. We have following0 Sampling and Learning connection matries between levels. M =  s ∧ ∧ s ∨ ∨ ∧ ∧  M =  ∨ ∨ s ∧ ∧¬  M = (cid:20) s ∧ ∧¬ (cid:21) M = (cid:20) ∨ ∨∧ ∧ (cid:21) M = (cid:2) ∧ ∧¬ (cid:3) By using these connection matrices, we can see the evaluation of circuit as following: C ( v ) = M M M M M v, v = ( b , b , b , b ) T ∈ B First, input value is v = ( b , b , b , b ) T . We feed this into M , and get a 5-dim vector u = M v .Then, feed u into M , we will get a 3-dim vector u (cid:48) = M u . Then, feed into M , get a 2-dim vector.Then, feed into M , get a 2-dim vetor. Finally, feed into M , get the value at ending node.Note, the role that spurious nodes are playing.This example shows that decomposition will make boolean circuit becomes much easier to analyze.After decomposition, we have several levels. Each level is very simple boolean circuits: each node hasonly 2 incoming connections, and all nodes are in the exactly same level. We can use one matrix torecord this one level circuit well. We call this matrix as one level connection matrix. We can use thematrix to evaluate all nodes in the one level circuit, and the evaluation is very simple and mechanical,almost like the normal matrix-vector multiplication. This makes analysis much easier. Although theoperation is not truly matrix calculation, yet, it is quite simple and easier to handle. So, the abovenotation is good enough to help us to record the boolean circuits, and help us the do operations andanalysis on boolean circuits. About Size of Boolean Circuit

In most literatures about boolean circuit, for example, in [11], the size and depth of a boolean circuitare deﬁned. They are highly related to and diﬀerent from our deﬁnition of node number and levelnumber. We discuss them here.In [11], the size of a boolean circuit is deﬁned as the number of gates ∨ , ∧ , ¬ used in the circuit. Incontrast, we deﬁne the node number of a circuit as the number of nodes ∨ , ∧ , not including ¬ . Wewill use notation s ( C ) for size of a boolean circuit (as most literature), and use notation d ( C ) for nodenumber.In most literatures, the depth of circuit is deﬁned as: the steps required from input to output. Ourdeﬁnition of depth is exactly same as most literature. The depth equals the number of levels. So,if depth of a circuit is K , we can decompose connection matrix to K connection matrices, and eachsuch matrix is only for one level, i.e. depth is 1. Lemma 3.3 ( Relationship of s ( C ) and d ( C ) , K and Depth). For a boolean circuit C , suppose s ( C ) is the size of circuit (as most literature), and d ( C ) is node number, then d ( C ) ≤ s ( C ) ≤ d ( C ) .And, depth of a circuit equals number of levels. Proof:

The proof is clear. (cid:4)

Since circuit complexity in most literature is measured by s ( C ), if we are interested in circuit complex-ity, using d ( C ) is equivalent to using s ( C ). However, for our purpose, to use d ( C ) is more convenient.We will mostly use d ( C ) to measure a circuit. In order to analyze boolean function f : B N → B , one way is to consider some examples, say, we feedsome x ∈ B N into f and see its value. This is called sampling. More precisely, we get an input x ∈ B N huyu Xiong f ( x ), this forms one sample of f . If we repeat such sampling forsome times, we get the sampling set. Deﬁnition 4.1 ( Sampling Set).

A sampling set is one subset of B N , that is, if S ⊂ B N , we say S is one sampling set (or, just sampling). Moreover, over one sampling set, there could have assignedvalues: Sv = { [ x, b ] | x ∈ S, b = 0 or 1 } We say such set Sv as sampling set with assigned values, or sampling with values, or just sampling.For a boolean function f : B N → B , we can have the sampling set of f (or sampling set for f ): Sv = { [ x, f ( x )] | x ∈ S } Sampling set of f will give us information about this boolean function. We can think a samplingset of f as a subset of the truth table of f . Naturally, we want to ask: Can we recover the wholetruth table by a sampling set? Actually, under certain condition, we can. See this simple example.Consider the simplest circuit: C = b ∨ b . The truth table is very simple as below: If we only have(0 ,

0) (1 ,

0) (0 ,

1) (1 , C ( b , b ) 0 1 1 1a subset of this truth table, can we use a it to recover the whole truth table? Depends. If the subsetis: [(0 , , , [(1 , , C = b satisﬁes this sampling setas well. But, if the subset is: [(0 , , , [(1 , , , [(0 , , simple circuit satisﬁes this set.But, there is indeed a circuit C xor satisﬁes this set and it is not C . But, this circuit C xor is morecomplicated than C , i.e. it has more nodes.This simple fact, of course many other facts as well, motivates us to consider this question: Given asampling set, if we look a simplest boolean circuit to satisfy the sampling set, what would happen?Can we recover the whole truth table by this action? This is the central question that we try toaddress. But ﬁrst we deﬁne circuit space. Deﬁnition 4.2 ( Circuit Space on B N ). The set of all boolean circuit on B N is called circuit spaceon B N . We use C to represent the circuit space. C = { C | C is boolean circuit on B N } Note, C is much a bigger set than the set of all boolean functions. The number of boolean functionson B N are ﬁnite, though the number is very huge: 2 N . But, one boolean function could have manyboolean circuits to express it. So, C is a much larger space.We then deﬁne Fitting Extremum that is a minimizing problem to look for the boolean circuit thathas smallest node number while ﬁtting with sampling. Deﬁnition 4.3 ( Fitting Extremum).

For a sampling set Sv with values, we deﬁne one extremumproblem as following: Min: d ( C ) , C ∈ C & ∀ [ x, b ] ∈ Sv C ( x ) = b We call this problem as ﬁtting extremum on Sv .In ﬁtting extremum, we are looking for boolean circuit in C that it has these properties: 1) ﬁttingwith sampling set and 2) with smallest node number. We can use one most simple case to illustratethe meaning of ﬁtting extremum. Consider sampling set: { [(0 , , , [(1 , , , [(0 , , } . As discussedabove, this could be a subset of truth table of some unknown circuit. We want to use this sampling setto recover the whole truth table. When we look circuit ﬁtting with sampling, we ﬁnd that 2 circuits C = b ∨ b and C = b ⊕ b ﬁtting with sampling. So, which circuit should we choose? Just samplingset itself is not good enough. But, if we add one more condition, i.e. to look for simplest circuit ﬁttingwith sampling, then, we know C should be chosen, since d ( C ) = 1 , d ( C ) = 3. This simple exampleindeed tells us what ﬁtting extremum is about.In the deﬁnition of ﬁtting extremum, we give a sampling set with values. But, what if we give a subsetof B N and a boolean function? This sure will give a ﬁtting extremum as well.2 Sampling and Learning

Deﬁnition 4.4 ( Fitting Extremum of a Boolean Function).

For one boolean function f : B N → B , and for a sampling set S ⊂ B N , we deﬁne one extremum problem as following:Min: d ( C ) , C ∈ C & ∀ x ∈ S C ( x ) = f ( x )We call this problem as ﬁtting extremum on S and f .Such a circuit C is called as circuit generated by ﬁtting extremum on sampling S and f . That is tosay, given a sampling and a boolean function, we can generate a circuits from them. Lemma 4.1 ( Existence of Circuit Generated).

For any given boolean function f , and any givensampling S , the circuit generate by ﬁtting extremum on S and f always exists. That is to say, thereexists at least one circuit C so that C ﬁtting with sampling and d ( C ) reach minimum. Proof:

For a given S , we denote the set of circuits as G : G = { C ∈ C | C ﬁts with with S } . Veryclear that G is not empty, since there is at least a circuit C expressing f , then C ﬁts with S . So, theset { d ( C ) | C ∈ G } is a nonempty set of integers. Thus, there must be a C so that d ( C ) equals theminimum. (cid:4) So, for any given f and S , there is at least one circuit C generated by ﬁtting extremum from them.That is to say, if we have a boolean function f and a sampling set S , we can put them into ﬁttingextremum, then we get one or more boolean circuit C ﬁtting with f on S . Naturally, we ask: whatis the relationship between C and f ? Could this circuit C express f exactly? We ﬁrst see a simpleexample.For OR function f = b ∨ b , for sampling S = { (1 , , (0 , } , if we put them into ﬁtting extremum, itis easy to see circuit C = b ﬁtting with sampling and d ( C ) = 0. So, circuit b is a circuit generatedby ﬁtting extreme. But, the circuit C does not express f since C (0 , (cid:54) = f (0 , S = { (1 , , (0 , , (0 , } , the circuit generated by ﬁtting extremum from f and S is C = b ∨ b , which expresses f exactly.This simple example tells us: For a boolean function f , for some sampling S , the circuit C generated byﬁtting extremum from f and S indeed expresses f , but for some other sampling, the circuit generatedfrom ﬁtting extremum does not express f . The sampling that makes ﬁtting extremum to produce acircuit expressing f is special and needs our attention. Thus, we deﬁne proper sampling set. Deﬁnition 4.5 ( Proper Sampling Set).

For a given boolean function f : B N → B , and for asampling set S ⊂ B N , if ﬁtting extremum on S and f generates a boolean circuit C , i.e. C ﬁts f on S , and d ( C ) reaches minimum, and if C expresses f exactly, i.e. ∀ x ∈ B N , C ( x ) = f ( x ), we say S isa proper sampling set of f , or just proper sampling.In another words, when S is proper sampling set, the boolean circuit generated by ﬁtting extremumon S and f will always express f . This is one crucial property.We will use PSS to stand for proper sampling set. In the above simple example, for OR function f , S = { (1 , , (0 , } is not PSS, but S = { (1 , , (0 , , (0 , } is PSS. Lemma 4.2 ( Existence of PSS).

For any boolean function f , there is some subset S ⊂ B N so that S is proper sampling set of f . Proof:

This is very clear. At least, the whole space B N is proper sampling. (cid:4) That is to say, for any boolean function f , PSS always exists. The trivial case is that PSS equals thewhole boolean space B N . We can think in this way: give a sampling S , if S is not PSS, we can addmore elements into S , eventually, S will become PSS. Of course, we do not want the whole space, ifpossible. This is actually the major problem we will discuss here. First, we consider more examples. Example 4.1 ( Examples for Sampling and PSS).

Note, normally, we write vectors as column.But, for convenience, for short vectors (low dimension), we write as row.For OR function f = b ∨ b , the sample set { (1 , } is not PSS. It is easy to see the ﬁtting extremumgenerate a constant circuit C = 1. But, the sampling set S = { (1 , , (0 , , (0 , } is PSS. Fittingextremum generates C = b ∨ b , which expresses f exactly. Note, | S | = 3. huyu Xiong f = b ∧ b , the sampling set { (0 , , (1 , } is not PSS. It is easy to see, ﬁttingextremum generates a circuit C = b . But, the sampling set S = { (1 , , (0 , , (1 , } is PSS, ﬁttingextremum generates C = b ∧ b , which expresses f . Also note | S | = 3.For XOR function C = b ⊕ b , the sampling set S = { (1 , , (0 , } is not PSS. But, S = { (1 , , (0 , , (1 , , (0 , } is PSS. Here, | S | = 4.See diagram C4 in Fig. 1. It is for a function f = b ∨ ( b ∧ ¬ b ). Sampling S = { (1 , , , (0 , , , (0 , , , (0 , , } is PSS. How do we know this? Let’s see some details. Fornode g = b ∧ ¬ b , this is a ∧ node with one negation connection. As we talked above, for ∧ node,the PSS should be: { (1 , , (0 , , (1 , } , but, since there is one negation connection, for ∧¬ node,the PSS become: { (1 , , (0 , , (1 , } . This is only for b , b . But, we can add b as 0, so, we have aset { (0 , , , (0 , , , (0 , , } . But, we need sampling for b . This is the sampling (1 , , b as 1, and b , b as 0. So, we have S = { (1 , , , (0 , , , (0 , , , (0 , , } . We then consider node g = b ∨ g . This is ∨ node. As above discussion, for this node, we need to have { (1 , , (0 , , (0 , } for b , g . But, for this case, S indeed will cause to have { (1 , , (0 , , (0 , } for b , g . Thus, S isa PSS. We can verify this by trying some circuits. But, the procedure we did here is generally true,which we will see in later discussions. Example 4.2 ( More example of PSS).

Consider a sampling with value, in B , Sv = { [(1 , , , [(0 , , } .This sampling set is not PSS. We can easily see that circuit C = b ﬁts with S , and C = b ﬁts with S as well. However, if we add one more sampling into S , for example: [(1 , , C = b . Thus, S = { [(1 , , , [(0 , , , [(1 , , } is a PSS.From above discussions, we know that for a boolean function f , we could ﬁrst sampling it, then applyﬁtting extremum on sampling, if the sampling is right, i.e. it is PSS, we will get a boolean circuit thatexpress f . This is a very great outcome. With this procedure, we can understand f better. Theorem 4.3 ( PSS implies Circuit). If f is a boolean function f : B N → B , and S ⊂ B N is aPSS for f , and | S | is the size of PSS, then there is a circuit C expresses f and d ( C ) < N | S | . Opposite direction is also true, that is to say, if we have circuit, we can to construct a PSS from it.

Theorem 4.4 ( Circuit implies PSS). If f is a boolean function f : B N → B , and C is a booleancircuit to express f , then there is a PSS for f , and size of PSS is less than d ( C ) . PSS implies circuit theorem tells us that for a boolean function f , if we have a PSS for f , we canconstruct a circuit to express f and the size of circuit is controlled by size of PSS. Note, the sizeof circuit is one good measure of complexity of f , thus, the size of PSS is also a good measure ofcomplexity of f .Circuit implies PSS theorem tells us that for a boolean function f , if we know a circuit C expressing f , we can pick up PSS by using C .So, the 2 theorems tell us that for a boolean function f , if we have a PSS of f , we can construct acircuit to express f and the size of circuit is controlled by size of sampling. And, reversely, if there isone circuit expressing f , then we can ﬁnd a PSS by using circuit, and the size of sampling is controlledby size of circuit. Thus, the size of circuit and size of PSS is equivalent. Since the size of circuit isone good measure of computational complexity of f , so is the size of PSS. This is a very importantproperty.The above 2 theorems are very crucial. We put the proofs for them in Appendix.For one boolean function f , there might be more than one PSS of it. Could be many. But, among allPSSs, the PSS with lowest number of nodes will be specially interesting. Deﬁnition 4.6 ( Minimal Proper Sampling Set).

For a given boolean function f : B N → B , if asampling S ⊂ B N is a proper sampling set, and | S | reaches the minimum, we call such a sampling setas minimal proper sampling set.We use brief notation mPSS for minimal proper sampling set.4 Sampling and Learning

We discussed universal learning machine in [2, 4, 5], which is a machine that can learn any possibleto learn without human intervention. In our previous discussions, the learning dynamics of universallearning machine was given special attention, and several methods/strategies were introduced. As theresult, we proved that with suﬃcient data (suﬃcient to bound and suﬃcient to support), universallearning machine can be realized. Of course, we are constantly looking for better learning methods. Asa matter of fact, we invented Fitting Extremum and Proper Sampling Set (FE and PSS) particularlyfor such a purpose. Without the eﬀorts to ﬁnd better learning methods, perhaps FE and PSS wouldnot be invented. In this section, we will discuss on how to utilize FE and PSS for learning dynamics.

Universal Learning Machine

We brieﬂy recall learning machine and learning dynamics. An universal learning machine M is asystem consisting of input space, output space, conceiving space and governing space. The input spacehas N dimension, and output space has M dimension. The conceiving space contains informationprocessing unit that will get information from input space, process the information, and put results intooutput space. The conceiving space is the container for information processing units, and it normallycontains many pieces of information processing. But, at one particular time, only one informationprocessing unit is used to generate output. The learning is actually modifying/adapting the currentinformation processing unit so that it becomes better. Governing space is the container for methodsthat control how learning is conducted.For convenience of discussions and without loss of generality, we often set the dimension of outputspace M = 1. Thus the information processing unit becomes a boolean function p : B N → B . Insideconceiving space, there could be many boolean functions, and one is used as current informationprocessing unit.The input space is N dimension, thus input v ∈ B N . We also call the space B N as base pattern space .Any vector v ∈ B N is also called as a base pattern. Learning machine will get information from input v and form subjective view for v in machine. Such subjective view is called as subjective pattern, whichis handled inside machine by something called X-form. Actually, the information processing is doneaccording to those subjective patterns, so according to X-forms. Inside conceiving space, normally,there are many X-forms.X-form plays one crucial role in learning machine. For full details of X-form, consult [2, 4, 5]. Here,we focus on relationship between X-form and boolean functions. Deﬁnition 5.1 ( X-form as Algebraic Expression). If E is an algebraic expression of 3 operators, ∨ , ∧ , ¬ (OR, AND, NOT), and g = { b , b , . . . , b K } is a group of base patterns, then we call theexpression E ( g ) = E ( b , b , . . . , b K ) as an X-form upon g , or simply X-form.Note a small diﬀerence on surface: in [2, 4, 5], we used + , · , ¬ for OR, AND, NOT operators. In fact,if we want to do algebraic expression, to use + , · , ¬ is much better. Here, for consistence with thispaper, we use ∨ , ∧ , ¬ , though, which is not as good for algebraic expressions.In another words, a X-form is an algebraic expression of some base patterns. This is one way to seeX-form. But, we can view such algebraic expression as subjective pattern. Deﬁnition 5.2 (X-form as Subjective Pattern) . Suppose g = { p , p , . . . , p K } is a set of subjectivepattern, and E = E ( g ) = E ( p , p , . . . , p K ) is one X-form on g (as algebraic expression). Withnecessary supports (i.e. the operations in the algebraic expression can be realized), this expression E is a new subjective pattern.Further, such algebraic expression can be viewed as information processing: Deﬁnition 5.3 (X-form as Information Processor) . Assuming M is a learning machine, g = { p , p , . . . , p K } is a set of subjective patterns subjectively perceived by M , and E = E ( g ) is a X-form on g (as al-gebraic expression), then E ( g ) is an information processing unit that processes information like this: huyu Xiong p ∈ B N is put into M , and M perceives this pattern, then the subjective pat-terns p , p , . . . , p K forms a set of boolean variables, still written as: p , p , . . . , p K , and when this setof boolean variables is applied to E , the value of E is the output of the unit, and it is written as: E ( g )( p ).Thus, one X-form actually is one boolean function. So, we now understand the meaning of X-form inseveral aspects. Why do we call as X-form? These expressions are mathematical forms and have veryrich meanings, yet there are many properties of such expressions are unknown. Following tradition,we use X to name it.Following theorem connect objective pattern, subjective pattern and X-form. Theorem 5.1 (Objective and Subjective Pattern, and X-form) . Suppose M is an learning machine.For any objective pattern p o (i.e. a subset in B N ), we can ﬁnd a set of base pattern g = { b , b , . . . , b K } ,and one X-form E on g , E = E ( g ) = E ( b , b , . . . , b K ) , so that M perceives any base pattern in p o as E , and we write as p o = E ( g ) . We say p o is expressed by X-form E ( g ) . We skip the proof here, which can be found in [2].

Example 5.1 ( X-form and related).

We see some examples for X-form. Suppose N = 2 and the information processing unit is such a boolean function: f : B → B , f ( b , b ) = b ⊕ b . We can write this boolean function in X-form. Let p = (1 , , p = (0 , p , p both are base patterns, and one algebraic expression E ( p , p ) = ( p ∨ p ) ∧ ¬ ( p ∧ p ), then wecan see: for any v ∈ B , E ( v ) = f ( v ). Suppose N = 3, we have one objective pattern p o = { (0 , , , (1 , , , (1 , , , (0 , , } , we canhave these base patterns: { p , p , p } , p = (1 , , , p = (0 , , , p = (0 , , E ( p , p , p ) = p ∨ p ∨ p ∨ ( p ∧ p ), so that p o = E ( p , p , p ). We can see the number ofoperations in E is d ( E ) = 4. Suppose N = 4, we have some X-forms: Q , Q , Q , then, we can form new X-forms as: ( Q ∨ Q ) ∧ ¬ ( Q ∧ Q )If we want to emphasis the information processing unit, we can just focus on boolean function. But,in this way, we lost the connection to subjective pattern that is crucial in many aspects. By usingX-forms. we can reach both subjective pattern and boolean function, since X-form is both. Insideconceiving space, there are a lot of X-forms. We can ﬁnd some X-forms are better, and choose them.And, we use existing X-forms to form new X-form that would be better. These actions are actuallylearning dynamics. Following learning strategies will tell us how to do learning. Lemma 5.2. If E is a X-form, then there is a boolean circuit C , so that ∀ p ∈ B N , E ( p ) = C ( p ) . and d ( C ) = d ( E ) + L , where d ( C ) is the number of nodes of C , d ( E ) is the number of operators ∧ and ∨ in E , L is an adjusting number. Proof: E is an X-form, according to deﬁnition, there is an algebraic expression of 3 operators, ∨ , ∧ , ¬ (OR, AND, NOT), and g = { p , p , . . . , p K } is a group of base patterns, and E = E ( b , b , . . . , b K ).Note, E is almost a boolean circuit, there are only 2 things are diﬀerent. One is: in E , there are3 operators, ¬ is view as one operator. But, in boolean circuit C , ¬ is integrated into node. So, ifwe only count ∨ and ∧ operators in E , we can get the number of nodes of C . Another diﬀerenceis: E is based on base patterns: { p , p , . . . , p K } . But, we write base patterns p , p , . . . into theform: p = ( . . . ( s b ∧ s b ) ∧ s b ) . . . ∧ s N b N ), where s i , i = 1 , , . . . , N are: if b i = 1 , s i = id , if b i = 0 , s i = ¬ . We can do same for p , etc. (see the Lemma 4.3 circuit of a single vector). Weconnect these circuits with E , we then have the boolean circuit C that expresses the X-form E . Also, d ( C ) = d ( E ) + L , where L depends on 1) the number of ¬ in E , 2) the number of nodes used in p etc, which is K ( N − (cid:4) This lemma tells us that we can get a boolean circuit from a X-form. Reversely, we can also get oneX-form from a boolean circuit.

Lemma 5.3. If C is a boolean circuit over B N , it is an X-form E as well, and d ( C ) = d ( E ) + L ,where d ( C ) is the number of nodes of C , d ( E ) is the number of operators ∧ and ∨ in E , L is anadjusting number. Sampling and Learning

Proof: C is a boolean circuit, so it is such: there is an algebraic expression E of 3 operators, ∨ , ∧ , ¬ (OR, AND, NOT), and this expression E on this group of base pattern: g = { b , b , . . . , b N } , the C = E ( b , b , . . . , b N ). Clearly, E is an X-form. We also see d ( C ) = d ( E ) + L . (cid:4) We point out here: C is a boolean circuit that is objective. But, E is X-form that could have subjectivefactors. One circuit could be several diﬀerent X-forms. The way to form a X-form from a circuit isnot unique. We see some examples below. Example 5.2 ( X-form and Circuit).

Some examples of X-form and circuits. Suppose N = 3. We have a boolean circuit: C : C ( b , b , b ) = b ⊕ b . This boolean circuitis X-form actually in this way: Let p = (1 , , , p = (0 , , p , p both are base patterns,and one algebraic expression E ( p , p ) = ( p ∨ p ) ∧ ¬ ( p ∧ p ), so E is one X-form. We can see ∀ v ∈ B , E ( v ) = C ( v ). Suppose N = 4, we have a group of base patterns: { p , p , p } , p = (1 , , , , p = (0 , , , , p =(0 , , , E ( p , p , p ) = p ∨ p ∨ p ∨ ( p ∧ p ). They will form an X-form E .Then, this X-form E is equivalent to a boolean circuit: C : C ( b , b , b , b ) = ( b ∧ b ) ∨ b ∨ b ∨ ( b ∧ b ).From the above lemmas, we know that X-forms are equivalent to boolean circuit. Thus, looking forbetter X-form is equivalent to looking for better boolean circuit.FE and PSS provide us a new set of tools for ﬁnding better circuit, thus, better X-form. Learning Strategies by Using Fitting Extremum and PSS

In [2], we discussed learning dynamics and suggested several learning strategies. As a consequenceof such discussions, we showed that deep learning can be explained by the learning strategy called”Embed X-forms into Parameter Space”. From its root, this learning strategy needs a lot of humaninterventions, which is not desirable. In order achieve learning without human interventions, weinvented other strategies called: ”Squeeze X-form from Inside to Higher Abstraction”, and ”SqueezeX-form from Inside and Outside to Higher Abstraction”. We showed that if we have data that aresuﬃcient to bound and suﬃcient to support the X-form, the above 2 strategies could realize universallearning (i.e. be able to learning any possible to learn without human interventions).However, these learning strategies are not good enough, which need huge data (suﬃcient to boundand suﬃcient to support often equivalent to huge data) and depend on some capabilities that are stillon development. In fact, we know very clearly that these learning strategies are just our ﬁrst attemptin the study of universal learning machine. They helped us to gain theoretical understanding, butthey are not practical. We need better methods. Now, with newly invented tools, i.e. FE and PSS,we can design much better learning strategies.Suppose the learning machine is M , the conceiving space of M is C , the current X-form in C is E .We also denote the input data as D = { ( b j , o j ) | j = 1 , , . . . } . In this framework, the learning is:under the driven of input data, the current X-form E is moving to the X-form that we desire. Thelearning strategy is how to move/change E , eﬀectively and eﬃciently.Here, we design 2 strategies. Both are based on FE and PSS. The ﬁrst strategy does learning pureobjectively, while the second utilizes subjective view of machine. We discuss 2 strategies separatelybelow.Suppose data input are: D = { ( b j , o j ) | j = 1 , , . . . } , where b j ∈ B N are base patterns as input. o j are the value of output should take, but o j could be empty. If o j is not empty, o j ∈ B . This meanswe know the output of information processing. If o j is empty, it means that we do not know (or donot need to know) the output of information processing. If in learning, each o j is not empty, it issupervised learning. Learning Strategy – Objectively Using Fitting Extremum

We can call this strategy as Strategy OF. For Strategy OF, we need to put one requirement on itsdata input: in data input D = { ( b j , o j ) | j = 1 , , . . . , K } , b j ∈ B N , o j ∈ B , o j are not empty, for all j .We summarize Strategy OF as: huyu Xiong E .2. At ﬁrst, the initial X-form is E , which could be any X-form. Set E = E ,3. Start from the ﬁrst data input: ( b , o ).4. At J -th step, J < K , data input is ( b J , o J ). Then, ﬁrst check if E ( b J ) = o J . If it is true, thisstep is done, no need to do further, and go to next step.5. If E ( b J ) (cid:54) = o J , then need to update E . The way to update is: To form the sampling set withvalue Sv J = { [ b j , o j ] | j = 1 , , . . . , J } , then do FE on Sv J to generate circuit C , then use this C to replace E .6. Decide if continue learning. If so, go to next step.Strategy OF are purely driven by data, i.e. learning machine M will do learning objectively accordingto incoming data. This is why we call it as ”objectively using FE”. We have following theorem aboutStrategy OF. Theorem 5.4 ( Strategy OF).

Suppose a learning machine M , and suppose data D = { ( b j , o j ) | j =1 , , . . . , K } is used to drive learning, and we are using Strategy OF, if the desired X-form is E d , andthe sampling set Sv J = { [ b j , o j ] | j = 1 , , . . . , J } is a PSS for E d for some J < K , then, startingfrom any X-form E , eventually, M will learn E d , i.e. the current X-form E will become the desiredX-form E d . Proof:

It is easy to see the proof. Since for some

J < K , the sampling set Sv J is a PSS for E d , whenwe do FE on Sv J , the circuit generated will be E d . That is to say, once the data feed is long enough(i.e. greater than J ), the current X-form becomes E d . (cid:4) Corollary 5.5.

A learning machine M with Strategy OF is an universal learning machine. Proof:

For any given starting X-form E , and any desired X-form E d , if we give data input thatform PSS for E d , then without any human intervention, M will learning E d . That is to say, M is anuniversal learning machine. (cid:4) Comparing with other learning strategies we discussed before, the advantage of Strategy OF is veryclear: it needs much less data. It only need a data set that including a PSS for the desired X-form,which is much smaller than suﬃcient to bound and suﬃcient to support data. This will make learningmuch better and faster.Another advantage is that Strategy OF gives a deﬁnitive method to do evolution of current X-form.In other methods we discussed in [2], we only assume some learning capabilities that are still waitingto be realized. With Strategy OF, we are ready to put universal learning machine into practical stage.One thing we need to state again: Strategy OF requires the data o j are not empty. This is a very bigrestriction.We then turn to another learning strategy. In this strategy, we utilize subjective view of machine,which makes learning better. Compare to pure objective way, subjective way is better in many as-pects. One such aspects is: we data o j could be empty for some j . Learning Strategy – Subjectively Using Fitting Extremum

We can call this strategy as Strategy SF. In Strategy OF, we use FE and PSS pure objectively, andwe require o j are not empty for all j . But, in Strategy SF, we will utilize subjective view of machinein learning, and some o j could be empty.We summarize this strategy as:1. At each step, the current X-form is E .2. At ﬁrst, the initial X-form is E , which could be any X-form. Set E = E .3. In conceiving space C , maintaining a set of X-forms that are available to be used. Denote thisset of X-forms as X . This set X is super important. When we are looking for X-form to beused, we are looking for X-form only in X . M will subjectively maintain this set X (of course,under the driven of data).4. Start from the ﬁrst data input: ( b , o ).8 Sampling and Learning

5. At J -th step, J < K , data input is ( b J , o J ). If o j is not empty, check if E ( b J ) = o J . Then, thereare 3 situations: 1) o j is empty, 2) o j is not empty, and E ( b J ) = o J , 3) o j is not empty, and E ( b J ) (cid:54) = o J .6. For situation 1), do subjective actions to maintain the set X .7. For situation 2), do subjective actions to maintain the set X . Also, keep o j and informationabout E ( b J ) = o J .8. For situation 3), need to update E to ﬁt the data, ﬁrst form a sampling set with value as: Sv J = { [ b i , o i ] | i = 1 , , . . . , I J } , where b i , o i are pairs of data input: o j are not empty. I J is theindex of, Then, do FE on Sv J . But, available X-forms are chosen from X . Suppose the circuit C generated by FE on Sv J over X is C , and the associated X-form from C is E (cid:48) , then use E (cid:48) to replace the current X-form.9. Decide if more learning. If so, go to next step.Strategy SF ultimately is driven by inputing data, but, there are signiﬁcant subjective actions. Thisis why we call it as ”subjectively using FE”. We have following theorem about this strategy. Theorem 5.6 ( Strategy SF).

Suppose a learning machine M , and suppose data D = { ( b j , o j ) | j =1 , , . . . , K } is used to drive learning, and we are using Strategy SF to learn, if the desired X-form is E d , and if there is a sampling set Sv I = { [ b i , o i ] | i = 1 , , . . . , I } embedded in D , and Sv I is a PSSfor E d , then, starting from any X-form E , eventually, M will learn E d , i.e. the current X-form E will become the desired X-form E d . Proof:

Suppose the subjective actions in learning is in right direction, so that eventually, X will have E d inside it, and sampling set Sv I is eventually be used. Since there is a sampling set Sv I embeddedin data D , and it is a PSS for E d , when we do FE on Sv I , the circuit generated will be E d . That isto say, eventually, the current X-form is E d . (cid:4) Corollary 5.7.

A learning machine M with Strategy SF is an universal learning machine. By using subjective actions, we are possible to speedup the learning very substantially if these sub-jective actions are in the right direction (the performance could become worse if the subjective actionis not good). So, Strategy SF could learn much faster than Strategy OF. What are subjective actionsand how to do subjective actions eﬃciently actually is big question. We will discuss this in otherplaces.

We make some comments about FE, PSS and learning dynamics.1. FE+PSS (ﬁtting extremum and proper sampling set) are important tools. They are highlyrelated to machine epistemology, i.e. how a machine learns a rule in its environments and howmachine represents the learned rule inside itself. FE+PSS tells us: the rule is in fact inside aset of data (data contains PSS), and if machine keeps looking better representations (X-form)with least cost (fewest nodes), eventually, machine learns the rule fully. This has very strongepistemological meaning. It is worth to do deep study. We will discuss this issue in other place.2. We proved the fundamental relationship between PSS and complexity of boolean circuit. Thisgives us a strong tool to study computational complexity. We will explore this in the nextstudy. This fundamental relationship between PSS and computational complexity actually re-ﬂect the intrinsic relationship between learning and computational complexity, and such intrinsicrelationship is the very core of learning.3. FE reveals why generalization can be achieved in mechanical learning. From view of FE, we willsee generalization very naturally, no longer with surprise.4. With FE+PSS, and learning strategies OF and SF, universal learning machine is no longerjust theoretically true, but is in practical stage. Our previous papers discussed other learningstrategies. But, the Strategy OF and Strategy SF are much diﬀerent, and much better. StrategyOF and SF are ready to be used in engineering practice. huyu Xiong

References

Sampling and Learning

Appendix

In appendix, we want to prove the 2 lemmas and 2 theorems stated in section 4, i.e. Expansion ofFunctions, PSS implies Circuit and Circuit implies PSS.We ﬁrst put PSS implies circuit below.

PSS implies Circuit: If f is a boolean function f : B N → B , and S ⊂ B N is a PSS for f , and | S | is the size of PSS, then there is a circuit C expresses f and d ( C ) < N | S | . Proof:

Now, let K = | S | , and S = { v , v , . . . , v K } . Let C v j be circuit to express v j , where , j = 1 , , . . . , K . So, if x = v j , C v j ( x ) = 1 and if x (cid:54) = v j , C v j ( x ) = 0. Using them, we form one circuit C f = ( . . . ( s C v ∨ s C v ) ∨ . . . ∨ s K C v K ), where s j are: if f ( v j ) = 1 , s i = id, otherwise, s i = ¬ . Itis clear, ∀ x ∈ S, C f ( x ) = f ( x ), i.e. circuit C f is ﬁtting with S . Also, we can see C f has K − ∨ ”nodes, and each C v j has N − ∧ ” nodes, so d ( C f ) = K − K ( N −

1) = KN − < N | S | . This isto say, there is a circuit C f ﬁtting with S and d ( C f ) < N | S | .Therefore, if a circuit C is the circuit generated by ﬁtting extremum from f and S , since S is PSS. C should expresses f . And, d ( C ) ≤ d ( C f ) < N | S | . (cid:4) For circuit implies PSS, we need some lemmas ﬁrst.Suppose C is a circuit and w is one node of C , then for each vector b ∈ B N , w will take some valueaccordingly. We will call this value as the value at node w for input b , denote as w ( b ). If the w is thetop node, then w ( b ) is the value of the circuit for input b , i.e. C ( b ) = w ( b ). Here is a lemma that tellsus about the values at nodes of circuit. Lemma 6.1 ( Value at Node).

Suppose f : B N → B is a boolean function, C is a boolean circuitexpressing f , and d ( C ) reaches minimum, then, for any node w in C , the values at the 2 nodes w L , w R directly underneath w must satisﬁes the following rules: for each type of connection conﬁgurations(totally 8 types), there must have inputs b , b , b ∈ B N so that ( w L ( b ) , w R ( b )) , ( w L ( b ) , w R ( b )) , ( w L ( vb ) , w R ( b )) takes values speciﬁed below. Proof:

There are 8 connection conﬁgurations as below: (cid:2) ∨ ∨ (cid:3) , (cid:2) ∨ ∨¬ (cid:3) , (cid:2) ∨¬ ∨ (cid:3) , (cid:2) ∨¬ ∨¬ (cid:3)(cid:2) ∧ ∧ (cid:3) , (cid:2) ∧ ∧¬ (cid:3) , (cid:2) ∧¬ ∧ (cid:3) , (cid:2) ∧¬ ∧¬ (cid:3) .First consider (cid:2) ∨ ∨ (cid:3) . We want to show: there must have 3 inputs b , b , b ∈ B N so that ( w L ( b ) , w R ( b )) =(0 , , ( w L ( b ) , w R ( b )) = (1 , , ( w L ( b ) , w R ( b )) = (0 , b ∈ B N so that ( w L ( b ) , w R ( b )) = (0 , w , the value at w is always 1. In this case, the circuit C can be simpliﬁed to another circuit C (cid:48) and ∀ b ∈ B N , C ( b ) = C (cid:48) ( b ), and d ( C (cid:48) ) < d ( C ). This is a contradiction to d ( C ) reaches minimum. So,there is at least one b ∈ B N so that ( w L ( b ) , w R ( b )) = (0 , b ∈ B N so that ( w L ( b ) , w R ( b )) = (1 , b ∈ B N , thereare only 3 possibilities: ( w L ( b ) , w R ( b )) = (0 ,

0) or (1 ,

1) or (0 , w ( b ) equalsvalue at w R , i.e. w ( b ) = w R ( b ) , ∀ b ∈ B N . So, we can eliminate node w L without modifying value of w . In this case, the circuit C can be simpliﬁed to another circuit C (cid:48) and ∀ b ∈ B N , C ( b ) = C (cid:48) ( b ), and d ( C (cid:48) ) < d ( C ). This is a contradiction to d ( C ) reaches minimum. So, there is at least one b ∈ B N sothat ( w L ( b ) , w R ( b )) = (1 , b ∈ B N so that ( w L ( b ) , w R ( b )) =(0 , (cid:2) ∧ ∧ (cid:3) . By the very similar arguments as above, we can show: there at least one b ∈ B N so that ( w L ( b ) , w R ( b )) = (1 , b ∈ B N so that ( w L ( b ) , w R ( b )) = (1 , b ∈ B N so that ( w L ( b ) , w R ( b )) = (0 , (cid:2) ∨ ∨ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (0,0), (1,0), (0,1). huyu Xiong (cid:2) ∨ ∨¬ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (0,1), (1,1), (0,0).For (cid:2) ∨¬ ∨ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (1,0), (0,0), (1,1).For (cid:2) ∨¬ ∨¬ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (1,1), (0,1), (1,0).For (cid:2) ∧ ∧ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (1,1), (1,0), (0,1).For (cid:2) ∧ ∧¬ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (1,0), (1,1), (0,0).For (cid:2) ∧¬ ∧ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (0,1), (0,0), (1,1).For (cid:2) ∧¬ ∧¬ (cid:3) , there are 3 inputs so that the value at nodes w L and w R are: (0,0), (0,1), (1,0). (cid:4) By observing the results of Lemma 6.1, we can see something very interesting and useful. Firstconsider at w the connection conﬁguration is (cid:2) ∨ ∨ (cid:3) , then, we have b , b , b so that the values at w L , w R are: (0 , , (1 , , (0 , S L = { b , b } , S R = { b , b } , then ∀ x ∈ S L , w R ( x ) = 0,and ∀ x ∈ S R , w L ( x ) = 0. Also, ∀ x ∈ S L , w ( x ) = w L ( x ), and ∀ x ∈ S R , w ( x ) = w R ( x ).Then, consider at w the connection conﬁguration is (cid:2) ∧ ∧ (cid:3) , then, we have b , b , b so that the valuesat w L , w R are: (1 , , (1 , , (0 , S L = { b , b } , S R = { b , b } , then ∀ x ∈ S L , w R ( x ) = 1,and ∀ x ∈ S R , w L ( x ) = 1. Also, ∀ x ∈ S L , w ( x ) = w L ( x ), and ∀ x ∈ S R , w ( x ) = w R ( x ).For all other type of connection conﬁguration, we have similar results. These results are importantfor later usage. Lemma 6.2 ( Expansion of Sampling).

Suppose f : B N → B is a boolean function, and S ⊂ B N is a PSS for f . If we expand the sampling, i.e. let b ∈ B N , b / ∈ S , and S (cid:48) = S ∪ { b } , and we set thevalue on b diﬀerent than f ( b ) . If D is a circuit ﬁts with S , and d ( D ) reaches minimum, and D (cid:48) is acircuit generated from FE on S (cid:48) , then d ( D ) < d ( D (cid:48) ) . Proof:

Since S is PSS, and d ( D ) reaches the minimum, so circuit D must expresses f . Now, let D (cid:48) be a circuit generated from FE on S (cid:48) . Since D (cid:48) ﬁts with S (cid:48) , so ﬁts with S , by deﬁnition of PSS, d ( D ) ≤ d ( D (cid:48) ). Further, if d ( D (cid:48) ) = d ( D ), which means the circuit D (cid:48) ﬁts with S and the number ofnodes reaches minimum. Since S is PSS, it means such circuit D (cid:48) must expresses f . However, thevalue of D (cid:48) on b is diﬀerent than f ( b ) as D (cid:48) ﬁts with S (cid:48) . This is a contradiction. The contradictiontells d ( D (cid:48) ) = d ( D ) is wrong. Thus, we must have d ( D ) < d ( D (cid:48) ). (cid:4) We can have weaker version.

Lemma 6.3 ( Expansion of Sampling, Weaker).

Suppose S ⊂ B N is a sampling set, not nec-essarily a PSS. And Sv = { [ s, v ] | s ∈ S, v = 0 Or } is a sampling set with value over S . D is acircuit ﬁts with Sv , and d ( D ) reaches minimum. Suppose b ∈ B N , b / ∈ S , we expand sampling set withvalue as Sv (cid:48) = Sv ∪ { [ b, v ] } , where v is such a value: v is diﬀerent than D ( b ) . Suppose D (cid:48) is a circuitgenerated from FE on Sv (cid:48) , then d ( D ) < d ( D (cid:48) ) . Proof:

We want to use the the above lemma (i.e. Expansion of Sampling). The problem is: S isnot necessarily a PSS. So, we need to make some additional arguments. Deﬁne a boolean function: f ( t ) = D ( t ) , ∀ t ∈ B N . If S is a PSS for f , then we can apply above lemma and the proof is done. If S is not PSS for f , we can add some points to S to get a sampling set S ∗ , so that S ∗ becomes a PSSof f . This surely can be done. In this case, D is still a circuit generated from FE on S ∗ . Then, wecan apply above lemma, and proof is done. (cid:4) The above 2 lemmas tells us this: if the sampling expands, then the circuit generated by FE on thesampling will expands as well. That is to say, for a more complicated sampling, the circuit generatedfrom FE on it must be bigger, with more nodes. This is one fundamental fact that plays importantrole.Next, we want show how to join PSSs to form new PSS.

Lemma 6.4 ( Join PSS). If f : B N → B is a boolean function, and circuit C expresses f , also d ( C ) reaches minimum. C must be in such a form: C = L ◦ R , where ◦ is the connection conﬁguration oftop node (there are 8 types, see Lemma 6.1), and L, R are 2 sub-circuits of C . Suppose S L , S R ⊂ B N are 2 sets with property: 1) S L is a PSS of L and S R is a PSS of R , 2) ∀ x ∈ S L , f ( x ) = L ( x ) and ∀ x ∈ S R , f ( x ) = R ( x ) , then the set S = S L ∪ S R is a PSS of f . Sampling and Learning

Proof:

We can think a process to seek circuit D that ﬁts S L ∪ S R while keep d ( D ) to be lowest.We can start from S L , and do FE on S L . Suppose we get circuit D L . Since S L is PSS for L , must ∀ x ∈ B N , D L ( x ) = L ( x ). The next step is to consider modify circuit D L to get circuit D L + R so that D L + R will keep D L ( x ) , ∀ x ∈ S L , and D L + R ﬁts with S R , and also make d ( D L + R ) to be lowest. Theonly possible choice is: D L + R = D L ◦ D R , whiere D R is a circuit from FE on S R . Since S R is PSS for R , must ∀ x ∈ B N , D R ( x ) = R ( x ). Thus, ∀ x ∈ B N , D L + R ( x ) = D L ( x ) ◦ D R ( x ) = C ( x ) = f ( x ). Thistells us that S L ∪ S R is PSS of f . (cid:4) This lemma tells us one very essential property PSS: it must grasp the characteristics of each branch,and can distingish branch from each other. Using this property, we know how to pick up PSS from acircuit.

Pick up PSS by using circuit:

We are going to pick up a sampling set from a given circuit.Suppose f : B N → B is a boolean function, circuit C expresses f , and d ( C ) reaches minimum.We are going to pick up sampling by using Lemma 6.4, which tells us how to join PSSs of 2 branchestogether to form a PSS.We ﬁrst consider the simplest circuit, the circuits with height 1. In order to make writing easier,we consider B . Such circuit C must in form: C = L ◦ R , where ◦ is one of 8 types of connectionconﬁguration shown in Lemma 6.1, and L, R are 2 sub-circuits of C . In this case, due to height 1,must L = b i , R = b j , i, j = 1 , . . . , , i (cid:54) = j . We can see some vectors in B below. b =   , b =   , b =   , b =   , b =   , b =   , b =   , b =   , b =   As Lemma 6.4 tells us, we can ﬁnd PSS for L , and PSS for R , and the satisﬁes certain condition,then, join these 2 PSSs, we get PSS for C . Consider one example: C = L ∧ R, L = b , R = b . It iseasy to see S L = { b , b } is a PSS of L , and S L = { b , b } is a PSS of L , and S L = { b , b } is a PSSof L . Also, there are several choice for PSS of R . However, the sets S L = { b , b } and S R = { b , b } have properties: ∀ x ∈ S L , C ( x ) = L ( x ), and ∀ x ∈ S R , C ( x ) = R ( x ). This property is essential. Withit, by Lemma 6.4, S L ∪ S R is PSS for C .This is for top node as ∧ . But, we can do exactly same for other type of node. See Lemma 6.1. Thisis how to pick up PSS from a circuit with height 1. Moreover, S L = { b , b } and S R = { b , b } can beused to form PSS for C .For height as 1, clearly, | S | = 3, and d ( C ) = 1. So, | S | ≤ d ( C ).For a circuit C expressing f and d ( C ) reaches minimum, any sub-circuits D of C expresses a booleanfunction, we use ∀ x ∈ B D ( x ) to represent this sub-circuit. Easy to see, d ( D ) reaches minimum. So,we pick up sampling set in this way: For 2 branches of D , L, R , we can have S L and S R , with thisproperty: S L is PSS for L , and S R is PSS for R , and ∀ x ∈ S L , C ( x ) = L ( x ), and ∀ x ∈ S R , C ( x ) = R ( x ).Then, S = S L ∪ S R will be PSS for D , and | S | ≤ d ( D ).We do this for all sub-circuits of C , then ﬁnally reache to the top of C . In this way, we eventally get2 sampling sets S L and S R for C , so that S = S L ∪ S R is PSS for C , and | S | ≤ d ( C ). (cid:4) The above process already shows: circuit implies PSS. We just state this again below.

Circuit implies PSS: f : B N → B is a boolean function, a circuit C expresses f , and d ( C ) reachesminimum, then, we can pick up a sampling set S so that S is PSS of f , and | S | ≤ d ( C ) . We consider some simple example of how to pick PSS from a circuit.

Example 6.1 ( Example of PSS).