A Constructive, Type-Theoretic Approach to Regression via Global Optimisation
AA Constructive, Type-Theoretic Approach toRegression via Global Optimisation
Dan R. Ghica and Todd Waugh AmbridgeSchool of Computer Science, University of Birmingham, UK
Abstract
We examine the connections between deterministic, complete, and general global optimisa-tion of continuous functions and a general concept of regression from the perspective of con-structive type theory via the concept of ‘searchability’. We see how the property of convergenceof global optimisation is a straightforward consequence of searchability. The abstract settingallows us to generalise searchability and continuity to higher-order functions, so that we canformulate novel convergence criteria for regression, derived from the convergence of global op-timisation. All the theory and the motivating examples are fully formalised in the proof assistantA
GDA . For some given objective function f and set of equalities, inequalities, or arbitrary constraints S ,the central goal of global optimisation is to compute, with mathematical guarantees, the globalminimum of f subject to S . Global optimisation has numerous obvious applications in all areasof engineering and computational sciences, as it gives a general recipe for solving problems ofarbitrary complexity. As an area of research, the study of global optimisation algorithms is mature,with a recent survey indicating more than twenty textbooks and research monographs in the lastfew decades [12].Global optimisation algorithms fall under several categories, but in this paper we will focuson algorithmis that are: General:
Algorithms may take into account information about the shape of the function. Forexample, the minimisation of functions with convex envelopes is intensively studied [27].In contrast, we will make minimal assumptions of this nature.
Complete:
An incomplete algorithm makes no guarantees regarding the quality of the solution itarrives at, focussing on efficiency via sophisticated heuristics, rather than correctness. Thetypical example of an incomplete algorithm is gradient descent , which will only find a localminimum of a function [23]. In contrast, we provide mathematical guarantees a solution isindeed optimal within some margin of error.
Deterministic:
A randomized algorithm can offer an asymptotic guarantee that the optimum isreached, with probability one, without actually knowing when it has been reached [26]. Incontrast, we will give strong termination guarantees for the algorithm.
Continuous:
Many global optimisation algorithms deal with discrete problems, such as branch-and-bound [14]. In contrast, we will focus on the minimisation of continuous functions.To summarise, in this paper we will concentrate on general, complete, continuous, deterministic globalsearch , which finds one guaranteed optimal-within-epsilon global minimum of a continuous func-tion [18]. In the sequel, by ‘optimisation’ this is what we mean, precisely.The first important results in the area relevant to our work appeared in the 1960s and 70s, theoptimisation of rational functions using interval arithmetic by Moore and Young [16], which was1 a r X i v : . [ c s . L O ] J un hen generalised to Lipschitz-continuous functions by Piyavskii [21]. The idea of the algorithm israther simple. By splitting the domain of the function into intervals, we impose a certain degreeof precision on the horisontal axis. The Lipschitz constant will then bound the growth of thefunction on each interval, thus allowing us to calculate a precision on the vertical axis. In effect,we can ‘discretise’ the function with known precision along both imput and output, which willmake the problem decidable. It also allows the application of efficient discrete algorithms such asbranch-and-bound for continuous optimisation [25].One of the important and immediate applications of optimisation is regression , broadly con-strued. It means finding some parameters for a model so that a target error (loss) function isminimised. This connection is so intuitive and obvious that it is rather surprising that it is not ex-pressed more emphatically in the literature. This broad formulation of regression captures not justconventional regression problems (linear regression, polynomial regression, etc.) but virtually allmachine learning algorithms that are sometimes referred to as ‘curve fitting’ [19]. A new approach: searchable types
The inspiration for our new approach to global search andregression is in earlier work on searchability [7, 8], concerning the construction of algorithms (se-lection functions) for finding elements in compact spaces satisfying a (computable) predicate.Finite sets are trivially compact, and so are trivially searchable. However, certain infinite setsare also searchable by Tychonoff’s theorem, which states that the product space of any set ofcompact spaces is itself compact. The infinite product of a set X is given by the function space N → X , whose elements are infinitary sequences of elements of X . These infinitary sequencesare therefore, in a certain sense, searchable; which is somewhat surprising. This development isparticularly interesting in the context of constructive real numbers, as the computable elements ofcompact intervals of R can be represented as infinitary sequences of digits taken from a finite set D . In this work, a constructive Tychonoff-style theorem is utilised to search these representationspaces N → D of constructive real numbers relative to certain explicit continuity conditions. Contributions
Our paper establishes new connections between several areas: global optimisa-tion, regression, searcheable types, and constructive real numbers. This is the most importantcontribution.Our paper also makes a technical contribution to the study of searcheable types by adding anexplicit requirement of continuity to the key theorems which, which allows us to formulate ourkey proofs in a way that is compatible with proof assistants, namely A
GDA , based on constructivetype theory. This means that the entirety of our proofs are fully formalised.Another significant contribution is a more general methodological perspective on global op-timisation and especially regression. In fact, the bulk of our paper is spent on regression, asformulated in our type-theoretic framework for searchability. The advantage of the type theoreticframework is that we can generalise the formulation of convergence of global search from R n tothe more general concept of searching on S -types, our own version of searchable types.Our first result is straightforward (Thm. 2); that regression can be formulated as a global min-imisation property, which has a deterministic, optimal-within-epsilon solution. However, we notethat this is not an actual convergence property for regression, in the sense that the Weierstrass the-orem follows by interpolation (see [20] for an informal survey of this issue). Regression, unlikeinterpolation, relies on a prior assumption for the model which, if wrong, will not converge nomatter how precisely we calculate its parameters. So Thm. 2 only states that a solution convergeson a ‘best guess’.Thus, what we give is a theorem which states, in a general setting, what it means for a re-gression algorithm to converge absolutely . We distinguish between ‘perfect’ models, which arethe same as the function we aim to model (the ‘oracle’), provided some parameters are given theright values, and ‘imperfect’ models in which that is not the case. One of the challenges here is toformulate the right notions of approximation between models, not just between parameters. Therequisite functions, namely a loss function between models and a distortion function from modelsto models, are higher order. The abstract type-theoretic setting is essential here in formulating theright notions of continuity which make the theorems true.2he most general version of convergence are Thm. 4 and Thm. 5, which characterise the con-vergence of regression for an ‘imperfect’ model. Informally, the former says that for wheneverthe imperfect model and the oracle are ‘approximately equal’ the parameters of the model can becomputed so that the error between the model and the oracle is approximately the same as theerror introduced by the distortion function. The latter says that if the loss between an distortedoracle and the oracle is less than some (cid:101) then so is the loss between the regressed model and thedistorted oracle. Both theorems capture the same idea: the error introduced by a ‘bad guess’ ofa model bounds the error between the regressed model and the oracle. As an immediate conse-quence (Thm. 3) if the model is perfect (i.e. the distortion function is the identity) then the lossbetween the oracle and the model converges on zero.We give some examples, mainly to show that the definitions we provide ( S -types and conti-nuity) can accommodate standard examples.The framework that we have built for this perspective is formalised in the A GDA program-ming language, which allows us to give computable (but practically inefficient) algorithms forour version of optimisation.
To maintain a high assurance for correctness all our main results and most of our examples areproved formally using A
GDA [2]. The proofs can be found online . We use certain options toensure a high standard of consistency and compatibility. The ‘safe’ option of A GDA disables fea-tures that may lead to possible inconsistencies, such as type-in-type or experimental or exoticconstructs. This option also prevents the local disabling of termination checking. It is our ex-plicit requirement of continuity conditions that allows all proofs to go through without violatingtermination, unlike prior proofs in the literature [11]. We also turn off the K axiom to ensure com-patibility with type theories that are incompatible with ‘uniqueness of identity’ proofs, such ashomotopy type theory. Finally, using the ‘exact split’ clause we force the type-checker require thatall clauses in a definition hold as definitional equalities. Our proofs requires several basic typesand related properties found in Escard ´o’s
TypeTopology library .The bulk of the proofs of this section are in the SearchableTypes module, which containsannotations cross-referenced against this text. To make the presentation accessible to readers with-out a background in A
GDA the mathematical statements in our paper are formulated in a conven-tional, informal yet rigorous, mathematical vernacular. To aid the readers who are interested informal proof details each mathematical proof is labelled with the A
GDA function formalising it. S -types This section concerns the definition and properties of ‘ S -types’, which are used to develop theconcept of Escard ´o’s searchable types . These types define the spaces in which regression can takeplace. Definition 1 ( SearchableTypes.ST-Type ) . An S -type is defined inductively as a finite non-emptytype, the product of two S -types S × S (cid:48) , or the type of functions N → S, where S is an S -type. The key technical challenge of our approach is to define a notion of (uniform) continuity for S -types, where continuity of a function is broadly understood as ‘finite amounts of output onlyrequire finite amounts of input’. In this context, whenever we deal with infinite data the precision of our observation comes into play. In the case of S -types, infinite data comes from the type withshape N → S . It is natural to think of such data as sequences, which leads to a natural notion ofprecision-up-to- m as observing the first m elements of the sequence. This notion of equivalenceinduces the usual ultrametric on such sequences, from which we can derive a reliable definitionof uniform continuity. https://github.com/tnttodda/RegressionInTypeArxiv https://github.com/martinescardo/TypeTopology
3e generalise this intuitive notion of precision to S -types as follows. First, the way we mea-sure precision depends on the type at which we measure it; we call the type of precisions for agiven type its exactness type (the elements of this type are precisions ). For finite data we do notafford degrees of precision, so its exactness type is the unit type. For product types we take theproduct of the two exactness types point-wise. Finally, for functions from N we record a natu-ral number, which is the precision at that level, paired with the precision in the domain of thefunction. Definition 2 ( SearchableTypes.ST-Moduli ) . The exactness type of an S -type is defined induc-tively as: • The exactness type of a finite set is the unit type. • The exactness type of a finite product of S -types is the product of their exactness types. • The exactness type of a function N → S, where S is a S -type, is the product of N with the exactnesstype of S. Precision as defined above can be used to qualify equality between elements of S -types. Forfinite data equality is not qualified by precision, and for products it is taken component-wise.For sequences N → S , equality with precision ( n , p ) , where p has the exactness type of S , isinterpreted as observing only the first n elements of the sequence, with each element observed upto precision p . Definition 3 ( SearchableTypes.ST- ≈ ) . • Two elements of a finite set are said to be equal with precision p just if they are equal, for any p. • Two elements of a product of S -types are equal with precision ( p , p ) , if their ith projections areequal with precision p i . • Two elements of N → S, with S an S -type, are equal with precision ( m , p ) if all elements in them-size prefixes are equal with precision p. Note that in the definition above the types of the precision depends on the S -type as spelledout in Def. 2. If x , y are equal with precision p , we write x ≡ p y .The concept of ‘equality with precision p ’ can be adapted to predicates as logical equivalencewith precision p ( ⇔ p ) in the obvious way (formally, SearchableTypes.ST- ≈ p ).The following properties are immediate. Proposition 1 ( SearchableTypes.ST- ≈ -EquivRel ) . Equality with precision p is an equivalencerelation.
So, immediately, equality implies equality with precision p , for any p .We are now in a position to introduce continuity for predicates on S -types. A predicate is saidto be continuous if its argument only needs to be examined up to precision p in order to yield ananswer. Obviously, the type of p is the exactness type of the argument. Definition 4 ( SearchableTypes.continuous ) . We say that a predicate Q on an S -type S is con-tinuous if there exists a precision q in the exactness type of S, such that for all x , x (cid:48) : S, whenever x ≡ q x (cid:48) we also have Q ( x ) ⇒ Q ( x (cid:48) ) .We call q the modulus of continuity (MoC) of Q. The same intuition applies to functions.
Definition 5 ( SearchableTypes.continuous ) . A function f : S → S (cid:48) , with S , S (cid:48) S -types, is saidto be continuous if for any precision p in the exactness type of S there exists a precision q in the exactnesstype of S (cid:48) such that for all x , x (cid:48) : S if x ≡ q x (cid:48) then f ( x ) ≡ p f ( x (cid:48) ) .We call q the MoC of f for p. S → S (cid:48) with S , S (cid:48) S -types are not themselves S -types, but they arean important class of types, which we shall call O -types ( oracle types , usually ranged over by thevariable Y ).Certain helpful properties of continuity are immediate: Proposition 2 ( SearchableTypes.all- F -preds-continuous ) . All predicates and functionson finite types are continuous.
Proposition 3 ( SearchableTypes. ◦ -continuous ) . If f : S → S (cid:48) and g : S (cid:48) → S (cid:48)(cid:48) are continuousthen so is g ◦ f : S → S (cid:48)(cid:48) . We are now ready to introduce the concept of searchability .A predicate is said to be detachable if it is always decidable, i.e. either it or its negation holds.Note that a detachable predicate is essentially a function to a two-element type, i.e. Booleans.
Definition 6 ( SearchableTypes.searcher ) . A searcher E on an S -type S is a function whichgiven a detachable and continuous predicate on S returns a witness element of S, for which the predicateholds if such an element exists. Since the searcher is a well-defined function, it will always return an element of S even if awitness, i.e. an element satisfying the predicate, does not exist. In that case the searcher will justreturn some arbitrary element of S . Remark 1.
In the A GDA code the definition above has two parts, also involving
SearchableTypes.search-condition , which spells out what it means for a witness to satisfy the predicate.
We will usually denote a searcher by E . Definition 7 ( SearchableTypes.continuous-searcher ) . A searcher E on S is said to be con-tinuous if whenever given predicates which are equivalent with precision p, P ⇔ p Q it returns witnesseswhich are equal with precision p, E ( P ) ∼ = p E ( Q ) . An S -type is said to be continuously searchable if any continuous and detachable predicate onit has a continuous searcher.We are now building towards the main theorem of this section, that all S -types are in factcontinuously searchable. Lemma 1 ( SearchableTypes.finite-ST-searchable ) . All finite non-empty types are contin-uously searchable.Proof.
In the case of finite (non-empty) types we use induction on the size of the type. For single-tons the proof is immediate, with the searcher always returning the unique element. The continu-ity of this searcher and the fact that it is a proper searcher are immediate. In the inductive case,given a searcher E n for a set of size n and some predicate Q we construct a new searcher for thefinite type Fin n + = Fin n + {∗} which behaves like the old searcher if it finds a Q witness andreturns the additional element inr ( ∗ ) otherwise: E n + ( Q ) = (cid:40) E n ( Q ) if Q ( E n ( Q )) inr ( ∗ ) otherwise.Checking that this is a continuous searcher is laborious but routine. Lemma 2 ( SearchableTypes.product-ST-searchable ) . The product of two continuously search-able S -types is continuously searchable.Proof. In the case of the product of two S -types S × S (cid:48) which are searchable, we need to constructa searcher for predicate Q which returns as pair a witness ( x (cid:101) , y (cid:101) ) : S × S (cid:48) . Let E S and E S (cid:48) be thesearchers for the two types. The witnesses are computed by:ˆ y ( x ) = E S (cid:48) ( λ y . Q ( x , y )) (cid:101) = E S ( λ x . Q ( x , ˆ y ( x )) y (cid:101) = ˆ y ( x (cid:101) ) .These computations are obviously continuous, and the formal proof is straightforward. Verifyingthat these values satisfy the conditions of a correct searcher is laborious but routine. Remark 2.
In the previous two theorems the details of checking that the defined searchers meet the requiredconditions are intricate, but they are also routine in a way such that our proof-assistant ( A GDA ) can alsomake the task easy. Because of this, our reliance on a proof assistant is not onerous but, in fact, beneficial,improving the productivity of the mathematics.
The previous two lemmas are perhaps unsurprising, since finite types and binary productscan be searched exhaustively and component-wise, respectively. The surprising fact is that thetype of infinitary sequences satisfies the same property.Before we proceed to the main result, we note that
Lemma 3 ( SearchableTypes.tail-decrease-mod ) . For any natural number n, if a predicateP ( α ) over S-sequences, with S an S -type, has modulus of continuity ( n + p ) then predicate P ( x :: α ) has modulus of continuity ( n , p ) , for any x : S. Lemma 4 ( SearchableTypes.tychonoff ) . Sequences of continuously searchable S -types are con-tinuously searchable.Proof. In this case, we need to construct a searcher E for predicate Q which returns a witness α (cid:101) : N → S . Let E S be the searcher for S . We proceed by induction on the first projection of themodulus of continuity of Q , n : N :For n =
0, we can return any element as it will vacuously satisfy the predicate. For example, α (cid:101) = λ n . E S ( λ x . ) .For the inductive step we construct the witness like so:ˆ x ( α ) = E S ( λ x . Q ( x :: (cid:11) )) α t = E ( λα . Q ( ˆ x ( α ) :: α )) x (cid:101) = ˆ x ( α t ) α (cid:101) = x (cid:101) :: α t α t is constructed using the inductive hypothesis: by Lem. 3, the first projection of the MoC of thesearched predicate is one less than n . It is laborious but routine to show that the two predicatessearched here are detachable and continuous.While the formal proof may look daunting, proving this witness satisfies the predicate is in-tuitively straightforward. Verifying that the constructed searcher E is continuous is somewhatcomplex, but follows from the continuity of E S by induction on the modulus of continuity of thepredicates involved.From Lem. 1-4 the key result of this section follows immediately: Theorem 1 ( SearchableTypes.all-ST-searchable ) . All S -types are continuously searchable. This theorem is a Tychonoff-style theorem since S -types are closely related to compact types,and the definition of S -types can be interpreted as all types that can be built from finite typesusing products, finite or countable. The theorem guarantees that the collection of types that canbe used in regression is rich enough to cover many interesting examples. In this preamble to our main technical results we give a semi-formal presentation of the key ideasto aid understanding and explain the method we are following.Consider the most common form of regression, linear regression. It involves a ‘model’ M (cid:126) k : R → R defined as M (cid:126) k ( x ) = k · x + k , with k , k ∈ R . The regression task involves computing6he parameters (cid:126) k = ( k , k ) ∈ R such that a measure of loss, or error, between m (cid:126) k and a data set Ω = { ( x i , y i ) ∈ R | ≤ i < n } is minimised. A common, but not unique, formula for such a lossfunction is the ‘least squares’, defined as Φ = ∑ ≤ i < n (cid:0) y i − M (cid:126) k ( x i ) (cid:1) This is essentially an optimisation problem: finding the best (cid:126) k ∈ R to minimise the functionabove.Note that the regression problem has an identical formulation for polynomial regression,where the model is a polynomial of some fixed rank M (cid:126) k ( x ) = ∑ ≤ i ≤ n k i · x i , except that theproblem now is finding some (cid:126) k ∈ R n . We work towards generalising the concepts, offering thefollowing informal definitions first: Definition 8.
We say that an oracle is a continuous function of type Ω : X → X .We say that a parameterised model is a continuous function of type M : X → ( X → X ) .We define a loss function as any continuous function of type Φ : ( X → X ) → ( X → X ) → [
0, 1 ] such that Φ ( f , f ) = , for any f : ( X → X ) . These definitions are still informal in the sense that we are not saying anything yet about whatthe X i s are. The obvious candidates for such types are computable representations of (compactsubsets of) real numbers. However, as we shall see, any S -types can be used, which leads to ageneralisation of existing notions of regression.Note that a loss function is a generalisation of a metric, dropping the requirement for it besub-additive and even symmetric. It is convenient, without loss of generality, to normalise it tothe unit interval, which will be represented as a specific S -type.For readability we may write the instantiation of a model for a given parameter as M k = M k and the loss function in curried form, so that the quantity to minimise is written as Φ ( M k , Ω ) .Our perspective on regression, succinctly expressed, is the following: The regression problem consists of finding a parameter k : X such that for a given oracle Ω : X → X and model M : X → ( X → X ) , the value of the loss function Φ ( M k , Ω ) is minimised. For instance, in the case of linear regression we may (naively) take X = R and X = X = R ;for polynomial regression X = R n for some fixed n and the type of the oracle as before. The lossfunction, least squares (or rather a normalised version thereof), is in ( R → R ) → ( R → R ) → [
0, 1 ] .However, the reals R cannot be represented as an S -type. In the sequel we see how to workwith computable representations of certain (compact) subsets of R which are S -types and lead tointeresting examples, according to our motivation discussed earlier. Remark 3.
Before we proceed we need to make some important distinctions. The real numbers R are awell understood mathematical concept. In our formal perspective we are required to work with a representa-tion, or an encoding, of the real numbers into entities that can be defined type-theoretically. This leads to afoundational tension between the mathematical concepts and their formal representations. The most signif-icant potential problem arises from the fact that mathematical functions operate on real numbers, whereasour functions work on encodings of real numbers (codes). If a function defined in our representationaldomain corresponds to a genuine mathematical function it is called its realiser . However, we can definefunctions on codes which are more ‘intensional’ in nature than mathematical functions because they haveaccess to the internal representation of the numbers in a way that mathematical functions do not. Suchfunctions are not realisers of any genuine mathematical function. Yet, such functions are interesting fromthe point of view of computer science, data science, or machine learning insofar as we see these disciplinesare intrinsically algorithmic rather than purely mathematical, thus restricted to operating on codes. Thus,resolving this foundational tension by ensuring that all ‘representational’ functions are genuine realisers isnot something that we are concerned with in this paper, although it is an important and well-studied topicin computable real number arithmetic [24].
7s motivated by the considerations above and our leading target examples we need to con-sider now real numbers. In our constructive setting we clearly need to restrict ourselves to repre-sentations of some ‘computable’ reals. More precisely, we require representations of the reals forwhich our desired operations (at least comparison, addition and multiplication) can be definedand are continuous.Real numbers are used in two ways: in the general setting, as part of defining the conceptof ‘loss function’, and in examples. Because of this distinction we can conveniently use severaltypes which serve different purposes. For the loss function we can represented the unit interval [
0, 1 ] as binary sequences U = N → {
0, 1 } , which is clearly an S -type. For this type we candefine families of strict and total order relations, each of which is detachable and continuous.Each element r : U is an encoding of a real number in [
0, 1 ] ; we notate the encoding of 0 as 0 U .The interpretation is the standard one for binary numbers: ∑ i ∈ N r ( i ) × − ( i + ) . Definition 9 ( UIOrder. < U ) . For any p : N , a sequence a : U is said to be less-than with precision panother sequence b : U , written a < n b, if there is some k : N , k < p such that their prefixes up to k areequal and a k < b k . Definition 10 ( UIOrder. ≤ U ) . For any p : N , a sequence a : U is said to be less-than-or-equal-to withprecision p another sequence b : U , written a (cid:54) p b, if either a < p b or a ≡ p b. It is straightforward to prove that, for any p : N , < p is a strict partial order, (cid:54) p is a total orderand, given a , b : U , these predicates are decidable and continuous.With these considerations in place we can revisit and spell out the informal parts of Def. 8, thegeneral formulation of regression. To cast it in type theory, we will always take X i to be S -types,and we will use the representation of the unit interval U as the domain of the loss function. Thetype of the oracle is thus some X → X , which is an O -type. The type of the loss function is Y → Y → U , with Y being O -types. This means that the standarddefinition of function continuity (Def. 5) does not apply. In this section we define a notion of‘continuity’ for loss functions.First we introduce a notion of approximate equality for functions. Definition 11 ( TheoremsBase.ST- ≈ f ) . Two functions f , g : S → T with S , T being S -types aresaid to be equal with precision p in the exactness type of T, written f ≈ p g if for all x : S we have thatf ( x ) ≡ p g ( x ) . This is an extensional definition in which all points in the domain are evaluated, but the resultsare compared only with precision m , which needs to be of the exactness type of T .With this, we can define a weaker notion of continuity for model functions. Definition 12 ( TheoremsBase.continuous M ) . A model function M : S → Y where S is an S -typeand Y = S (cid:48) → S (cid:48)(cid:48) and O -type is said to be weakly continuous if for all precisions p in the exactness type ofS (cid:48)(cid:48) there exists a precision q in the exactness type of S such that for all k , k (cid:48) : S, if k ≡ q k (cid:48) then M k ≈ p M k (cid:48) . Note that n above has the exactness type for S (cid:48)(cid:48) and m for S , respectively.It is straightforward to show that Lemma 5 ( TheoremsBase.strong → weak-continuity ) . Any (model) function that is continu-ous is also weakly continuous.
With this, we can define (weak) continuity for the loss function.
Definition 13 ( TheoremsBase.continuous L ) . A loss function Φ : Y → Y → U , where Y = S → S (cid:48) is an O -type, is said to be (weakly) continuous if for any precision p in the exactness type of U , thereexists a precision q in the exactness type of S (cid:48) such that if for all g , h : Y if g ≈ q h then for all f : Y wehave that Φ ( f , g ) ≡ p Φ ( f , h ) .We call q the MoC of Φ for precision p. The definition above can be generalised so that it is continuous in both arguments. However,only this more restricted continuity of the loss function is required by the theorems below.8 .3 Global optimisation and the convergence of regression
We now turn our attention to a general characterisation of algorithms for regression: in whatcircumstances they exist and what it means for them to be correct. The standard property ofregression is that a ‘best guess” parameter can always be produced.
Theorem 2.
Let S be an S -type and Y be an O -type and p a precision in the exactness type of U . For anyweakly continuous model M : S → Y, oracle Ω : Y, and continuous loss function Φ : Y → Y → U wecan construct a parameter k : S such that for any k : S we have that Φ ( Ω , M k ) (cid:54) p Φ ( Ω , M k ) .Proof. We prove this as a corollary to the more general theorem that any continuous function f : S → U has a minimum argument k : S such that ∀ k : S . f k (cid:54) p f k . The corollary followsbecause – due to the continuity conditions on M and Φ – the function λ x . Φ ( Ω , M x ) is continuous.We use induction on the structure of S as an S -type. In each case we wish to construct the argmin for f with precision p , notated argmin S ( f , p ) : S .In the finite case, we proceed by induction on the number of constructions of S . If S = , theunit type with the single construction { (cid:63) } , then clearly argmin ( f , p ) = { (cid:63) } . If S = S (cid:48) + forsome S -type S (cid:48) , then we proceed by inductively computing x (cid:48) = argmin S (cid:48) ( λ x . f ( inl x ) , p ) , where inl : S (cid:48) → S casts the element x : S (cid:48) to the corresponding element in S . As x (cid:48) is the argmin for f in S (cid:48) with precision p , and { (cid:63) } is the corresponding argmin in , we simply need to decide whether f ( inl x (cid:48) ) (cid:54) p f ( inr ∗ ) or f ( inl x (cid:48) ) (cid:54) p f ( inr ∗ ) – where inr : → S . This is decidable because (cid:54) p is decidable and a total order by Def 10.In the product case S = S (cid:48) × S (cid:48)(cid:48) , we proceed similarly to the structure of Thm. 2.2. We construct ( x , y ) = argmin S ( f , p ) as follows:ˆ y ( x ) = argmin S (cid:48)(cid:48) ( λ y . f ( x , y ) , p ) x = argmin S (cid:48) ( λ x . f ( x , ˆ y ( x ) , p ) y = ˆ y ( x ) .From these inductive constructions, we have that ∀ x . f ( x , ˆ y ( x )) (cid:54) p f ( x , ˆ y ( x )) and ∀ x , y . f ( x , ˆ y ( x )) (cid:54) f ( x , y ) . By transitivity of (cid:54) p (Def 10), therefore, ∀ x , y . f ( x , y ) (cid:54) f ( x , y ) (cid:54) f ( x , y ) .In the sequence case S = N → S (cid:48) , we proceed similarly to the above and by the structure ofLemma 4, i.e. by induction on the first projection n : N of the MoC of f at point p . When n = α = argmin S ( f ) as follows:ˆ x ( α ) = argmin (cid:48) S ( λ x . f ( x :: α ) , p ) α t = argmin S ( λα . f ( ˆ x ( α ) :: α ) , p ) x = ˆ x ( α t ) α = x :: α t α t is constructed by the inductive hypothesis on the MoC, because the MoC of λα . f ( x :: α ) atpoint ( p , (cid:63) ) for a given value x : S (cid:48) will be one lower than that of f . Therefore, we have that ∀ α . f ( ˆ x ( α t ) :: α t ) (cid:54) f ( ˆ x ( α ) :: α ) and ∀ x , α . f ( ˆ x ( α ) :: α ) (cid:54) f ( x :: α ) ; again, the result is obtainedvia the transitivity of (cid:54) p . An additional lemma is used to finish this case that shows the outputof a continuous function f ( α ) is equal to the required precision to f ( head ( α ) :: tail ( α )) , where head ( α ) = α tail ( α ) = λ n . ( α ( n + )) . Thus, ∀ α . α . f ( x :: α t ) (cid:54) p f ( head α :: tail α ) ≡ p f ( α ) .This theorem seems to give a definitive constructive, type-theoretic, characterisation of regres-sion. However, the computational content of the proof is, on closer inspection, not satisfactory.We can understand that more easily by instantiating the theorem on particular types, such as Y = U → U . Informally speaking, the proof requires finding the argmin k of the function f ( k ) = Φ ( Ω , M k ) with some fixed precision. The way in which k is computed is by partition-ing the interval U into a finite number of intervals computed from the precision. The continuitycondition of f will allow us to compute a size of these intervals which is small enough so thattheir images through f is smaller than the precision. In other words, for the given precision p we do not need more than a certain precision of the input. And, since there is a finite number of9artitions, we can simply examine the value of f for all of them and select the one in which thisvalue is minimal.There are two inter-related problems here. The first one is obvious, the algorithm that is ex-tracted out of the proof is an always-exhaustive search of the domain, up to the desired level ofprecision. The second one is more subtle and it has to do with the ‘stability’ of the algorithm. Sup-pose that there are two distinct values k and k (cid:48) for which f ( k ) = f ( k (cid:48) ) and which is minimal for f . In this situation, as we run the argument with different precisions p sometimes we may get anapproximation around k as a result and sometimes we may get one around k (cid:48) . As p gets smallerthe algorithm is not guaranteed to converge on either of them.The misbehaviour is not entirely surprising considering that we are attempting to compute afunction, argmin, which is known not to be computable [28]. The reason we manage to computeanything at all is because our algorithm has access to the codes of the numbers involved, so it is afunction which is not a realiser of any mathematical function (also see Remark 3). The regression Thm. 2 gives a conventional characterisation of regression, but it has certain short-comings, as discussed. It also does not tell the whole story. Whereas it states the situation inwhich the loss value can be minimised it makes no absolute statement regarding the loss itself. Wetherefore desire a statement which says something about the situation in which the error can notonly be minimised, but also be made vanishingly small. In other words, a convergence theorem guaranteeing that the regressed model is arbitrarily close, as measured by the loss function, to theoracle.In parametric regression we are epistemologically committed to a model, we just do not knowits parameters and we want to calculate them from observations. The minimisation algorithm inThm 2 is always guaranteed to produce a ‘best guess’ in terms of minimising loss, but if our beton a particular model is the correct one then this ‘best guess’ should be such that the the loss canbe made vanishingly small. To represent this situation, instead of taking an arbitrary oracle Ω wetake an arbitrary parameter k and create a synthetic oracle Ω = M k . The synthetic oracle has the‘same shape’ as the model, therefore can be approximated with arbitrarily small loss.For this theorem we will rely on the concept of searchability, which did not come into play inthe minimisation theorem Thm 2. We will call a regerssion algorithm a regressor . Theorem 3 ( LossTheorems.perfect-theorem ) . Let S be an S -type, Y an O -type, p : N a preci-sion, (cid:101) : U a loss value such that U < p (cid:101) , and Φ : Y → Y → U a continuous loss function.There exists a regressor reg : ( S → Y ) → Y → S such that given an element k : S, and weaklycontinuous model M : S → Y, we can construct k = reg M Ω such that Φ ( Ω , M k ) < p (cid:101) , for syntheticoracle Ω = M ( k ) .Proof. This theorem is an immediate corollary of the more general Thm. 5 in the next sub-section.
Thm. 2 states that parametric regression eventually converges on the ‘best possible’ solution,whereas Thm. 3 proves that if we ‘guess’ the model correctly then the regression converges onthe ‘absolutely best’ solution. But what if we don’t guess the right model? Consider the followingdata which is produced by oracle Ω ( x ) = x + sin x in Fig. 1.Parametric regression requires us to commit to a model, and the model can be imperfect. Forinstance, trying to regress a linear model M (cid:126) k = k · x + k for the oracle Ω could give a ‘prettygood’ approximation, depending on the desired precision. We will aim to quantify this usinganother convergence theorem which essentially says that the better the guessed model the higherthe precision of the approximation.To formulate the theorem we will again use a synthetic oracle Ω = M k , with unknown pa-rameter k , but we will distort it using a function Ψ : Y → Y , so that the regression will try toreconstruct Ψ Ω = Ψ ( Ω ) = Ψ ( M k ) by wrongly assuming it is Ω . The distortion function Ψ can10igure 1: Regression to imperfect modelrepresent either measurement noise or a lack of perfect knowledge about the oracle. To quantifythis lack of knowledge, or how powerful the distortion is, we use two approaches.The first theorem for regressing an unreliable model uses equality with precision p to comparehow ‘equal’ the original and the distorted oracle are, and show that the loss function between thecorrect and the distorted oracle is ‘just as equal’ (with precision p ) with the loss function betweenthe correct and the regressed oracle. It utilises the following definition of a ‘continuous’ distortionfunction: Definition 14 ( FunEquivTheorem.continuous D ) . A distortion function Ψ : Y → Y, for a given O -type Y = S → S (cid:48) , is called continuous if for any function f : Y and precision p in the exactness typeof S (cid:48) , there exists some precision q in the exactness type of S such that for any x , x (cid:48) : S if x ≡ q x (cid:48) then Ψ f x ≡ p Ψ f x (cid:48) . Theorem 4 ( FunEquivTheorem.imperfect-corollary-with- ≈ ) . Let S be an S -type, Y an O -type, p a precision in the exactness type of U , and Φ : Y → Y → U a continuous loss function. Givenan element k : S, continuous model M : S → Y, and any continuous distortion function Ψ : Y → Y,there exists a regressor reg : ( S → Y ) → Y → S such that whenever k = reg M Ψ Ω :If Ψ Ω ≈ q Ω then Φ ( Ω , Ψ Ω ) ≡ p Φ ( Ω , M k ) , where Ω = M ( k ) is the synthetic oracle, Ψ Ω = Ψ ( Ω ) the distorted synthetic oracle, and q is the MoC ofthe loss function Φ for precision p.Proof. S is an S -type, therefore it comes equipped with a searcher E . The regressor which com-putes the parameter k is reg M Ω = E ( λ k . Ω ≈ q M k ) . It turns out that, due to the searchability ofthe S -type and the continuity conditions on the model and distortion functions, this predicate isin fact detachable and continuous.Because there exists some k : S such that Ψ Ω ≈ q M k , we have by the condition on the searcherthat Ω ≈ q M k where k = reg M Ψ Ω . By transitivity of ≡ q (Prop. 1), we arrive at Ω ≈ q M k .Finally, a routine calculation from the continuity of the loss function gives us the result.This theorem gives a convergence property of sorts, but it is not very useful in practice. It onlyapplies when the distortion produced by Ψ is small enough for the original and distorted oracleto be ‘almost equal’ (with precision q ). This means that if the distorted model differs from the truemodel even rarely, but by a large enough amount, the theorem does not apply.For this reason we also give a more practically relevant convergence theorem which uses theloss function itself to measure the degree of distortion, rather than approximate equality, and onlyrequires a weakly continuous model. The second imperfect-model regression theorem states thatif the loss between the distorted synthetic oracle and the true oracle is small, then so is the lossbetween the distorted synthetic oracle and the regressed model. To emphasise, this is even thoughthe model is regressed using the distorted oracle as a source of data. Theorem 5 ( LossTheorems.imperfect-theorem-with- Φ ) . Let S be an S -type, Y an O -type,p : N a precision, (cid:101) : U a loss value, and Φ : Y → Y → U a continuous loss function.There exists a regressor reg : ( S → Y ) → Y → S such that given an element k : S, a weaklycontinuous model M : S → Y, and a distortion function Ψ : Y → Y, for parameter k = reg M Ψ Ω :if Φ ( Ψ Ω , Ω ) < p (cid:101) then Φ ( Ψ Ω , M k ) < p (cid:101) ,11 or synthetic oracle Ω = M ( k ) and distorted synthetic oracle Ψ Ω = Ψ ( Ω ) .Proof. The proof follows the same ‘recipe’ as that of Thm. 4, effectively constructing a regressorwhich has the desired property.The regressor will use the searcher E for the searchable type S for the predicate P ( k ) = Φ ( Ψ Ω , M k ) < m (cid:101) to produce the model parameter k . We need to show that this predicate iscontinuous, detachable, and satisfies the desired property, which follows from routine calcula-tions.It is easy to see now that the perfect-model convergence theorem (Thm. 3) is an immediateconsequence of the imperfect-model convergence theorem (Thm. 5), by using the identity distor-tion Ψ ( Ω ) = Ω , which then makes Φ ( Ψ Ω , Ω ) = Φ ( Ω , Ω ) = U < p (cid:101) so that the condition istrivially true.We prefer this final formulation of the theorem, in contrast to the previous one, and we willtake it as the defining property of regression, rather than the conventional minimisation one ex-pressed in Thm. 2.Compared to the global minimisation approach, Thm. 5 has the potential to serve as a basisfor more efficient algorithms. This is because the regressor uses a searcher, which does not needto explore the search space exhaustively, unlike Thm. 2. The searcher can stop and return theparameter as soon as the predicate is satisfied. In other words, it will provide a ‘good enough’, upto the specified target loss value, solution instead of searching for the ‘best’ solution. The ‘worstcase’ behaviour of exploring the entire space can still happen, especially if there is no witness tothe predicate.We also need to understand that the regressor is guaranteed to return a good enough pa-rameter only when our model is a good enough guess of the oracle. If our model is bad thenthe regressed parameter will not be very good either. This is a problem in practical applica-tion, since we may not know what the true model is. That means we cannot know whether Φ ( Ψ Ω , Ω ) < p (cid:101) . Therefore, for the computed parameter k , we need to compute separatelywhether Φ ( Ψ Ω , M k ) < p (cid:101) . Fortunately, the latter is computable — that could be considered aseparate ‘validation of regression’ step. This matches accepted practice in machine learning anddata science where ‘learning’ or ‘inference’ is always followed by ‘validation’ or ‘testing’. Whatthe Thm. 5 guarantees is that the regression algorithm is valid, in the sense that good models willalways be inferred accurately.The imperfect-model regression theorem also saves us from relying too much on our smallmethodological innovation as discussed in the Introduction. Regression as broadly practised is‘from data’ and not ‘from oracle’. In other words it is ‘off-line’ rather than ‘on-line’, with all datapre-sampled in advance. But we can think of off-line regression as regression to an imperfectmodel, with the distortion function formed by the composition of a sampling function followedby an interpolation function, noting that interpolation can be easily defined as continuous. Thm. 5guarantees that if the reconstruction via sampling and interpolation is ‘almost perfect’ then so isthe regressed model. What is left unsaid is that it is indeed possible to reconstruct a function viasampling and interpolation with arbitrary precision. In other words, that the StoneWeierstrasstheorem can be recast in this setting. This is subject of further research. The framework described above is rather abstract. In this section we will show that it is applicablein a common scenario in which regression is used: polynomial regression with a loss functionin the style of least-squares. As a warm-up example we will also show a ‘degenerate’ form ofregression, which is simply searching for the argmin of a function. This example is interestingbecause it gives a deterministic version of the well-known random search theorem [26]. Finally,we show and discuss the practical implications of regression to a model described by an infinite
Taylor series, which is normally outside the scope of existing regression methods.12 .1 Real number arithmetic
For examples we focus on the interval [ −
1, 1 ] which is represented by the type of ternary se-quences I = N → {−
1, 0, 1 } , a version of the ‘signed digit representation’ [1]. Sequences r : I are encodings of real numbers in [ −
1, 1 ] , using the standard binary numeral interpretation ∑ i ∈ N r ( i ) × − ( i + ) . This representation is particularly well suited for the definition of multiplica-tion and normalised addition (taking the midpoint), but is inconvenient for defining an order asthe same number can have too many encodings. In contrast, U is suitable for ordering but not forarithmetic. This highlights the convenience of being able to use different representations of thereals for different purposes.The midpoint algorithm is closely inspired by Ciaffaglione and Di Gianantonio [4], and mul-tiplication by Escard ´o [9]. Both of these have been proved formally correct in loc. cit. but notin a way that can be easily reused (or recycled) in our setting. However, we face an additionalburden of proof by being required to show they are all continuous functions in the specific senseof Sec. 2.2. This is what we focus on.Practical applications may require operating with representations of larger sets of reals thanjust [ −
1, 1 ] . Arbitrary closed intervals can be obtained from [ −
1, 1 ] using scaling and shifting byconstant values, which introduce some not insurmountable complications. To deal with largersets of reals still we need to be always careful that the representation is an S -type. For instance,a ‘mantissa and exponent’ representation, where the mantissa is a representation of a real and theexponent a natural number is not an S -type. A good rule of thumb is that compact sets are goodcandidates which might have such representation. We leave these issues for further work.The operations below are a minimum set which will allow us to formulate examples. Theimplementations are meant to be easy to reason about rather than efficient – they are in fact notpractically usable. To scale up to realistic regression examples as used for example in machinelearning the operations need to be much more efficiently implemented and, perhaps, extractedout of A GDA into a more performance-oriented language. However, there is no reason to believethat the recipe we follow below cannot be applied to more, and more efficiently implemented,operations.
Midpoint (Details in module
IAddition )Let x ++ x (cid:48) be a sequence with head x and tail x (cid:48) . Let Z be the type of integers and + additionon integers. We write 2 i = i + i . Following loc. cit. we define the midpoint operator ⊕ usingauxiliary operations (cid:100)−(cid:101) : Z → {−
1, 0, 1 } , (cid:98)−(cid:99) : Z → Z and a : U → U → Z → U . (cid:100) m (cid:101) = − m ≤ −
20 if − < m ≤
11 if 1 < m (cid:98) m (cid:99) = m + m ≤ − m if − < m ≤ m + ( − ) if 1 < m (cid:1) ( x ++ x (cid:48) , y ++ y (cid:48) , i ) n = (cid:40) (cid:100) i + x + y (cid:101) if n = (cid:1) ( x (cid:48) , y (cid:48) , (cid:98) i + x + y (cid:99) ) m if n = m + ( x ++ x (cid:48) ) ⊕ ( y ++ y (cid:48) ) = (cid:1) ( x (cid:48) , y (cid:48) , x + y ) .Full blown addition can be defined using ⊕ and a global scaling factor via elementary alge-braic manipulations. For example, if u , v , w ∈ [ −
1, 1 ] then we can define u + v + w = ( u ⊕ v ) ⊕ ( w ⊕ ) with a global scaling factor of 4. Lemma 6.
The ⊕ operator is continuous.Proof. This is so because for any n : N , x ++ x (cid:48) , x ++ x (cid:48) , y ++ y (cid:48) , y ++ y (cid:48) : [ −
1, 1 ] and z : Z , if ( x ++ x (cid:48) , x ++ x (cid:48) ) ∼ = ( n + ∗ ) , ( n + ∗ ) ( y ++ y (cid:48) , y ++ y (cid:48) ) then x (cid:48) ⊕ x (cid:48) ∼ = ( n , ∗ ) y (cid:48) ⊕ y (cid:48) .13 egation In the signed-digit representation this operation is simply reversing the sign of eachdigit. The continuity is immediate.
Multiplication (Details in module
IMultiplication )Let (cid:12) be multiplication on the set of digits {−
1, 0, 1 } defined in the obvious way. And let ⊗ be the multiplication of a (code of a) real by a digit, defined by mapping (cid:12) over the sequence ofdigits which is an element of I . We use several auxiliary operations defined by mutual recursion.First consider p , p (cid:48) , p (cid:48)(cid:48) : I × I → I : p ( x , y ) = p (cid:48) ( x , y ) ⊕ p (cid:48)(cid:48) ( x , y ) p (cid:48) ( x , y ) n = (cid:40) x (cid:48) (cid:12) y (cid:48)(cid:48) if n = ( y (cid:48)(cid:48) ⊗ x ⊕ x (cid:48)(cid:48) ⊗ y ) n otherwise p (cid:48)(cid:48) ( x , y ) = y (cid:48) ⊗ x ⊕ x ⊗ y where x = x (cid:48) ++( x (cid:48)(cid:48) ++ x (cid:48)(cid:48)(cid:48) ) , y = y (cid:48) ++( y (cid:48)(cid:48) ++ y (cid:48)(cid:48)(cid:48) ) .The second helper function is q : N → I × I → I : q ( k , x , y ) n = x (cid:48) (cid:12) y (cid:48) if n = x (cid:48)(cid:48) (cid:12) y (cid:48) if n = x (cid:48)(cid:48) (cid:12) y (cid:48)(cid:48) if n =
20 if n > k = p ( x (cid:48)(cid:48) , y (cid:48)(cid:48) ) ⊕ q ( k − x (cid:48)(cid:48) , y (cid:48)(cid:48) ) otherwisewhere x = x (cid:48) ++( x (cid:48)(cid:48) ++ x (cid:48)(cid:48)(cid:48) ) , y = y (cid:48) ++( y (cid:48)(cid:48) ++ y (cid:48)(cid:48)(cid:48) ) .Finally, multiplication × : I × I → I is defined as ( x × y ) n = (cid:0) p ( x , y ) ⊕ q ( n , x , y ) (cid:1) n . Lemma 7.
Multiplication is continuous.Proof.
This amounts to proving that the constituent operators p , p (cid:48) , p (cid:48)(cid:48) and q are continuous. Thequestion is whether at every precision in the exactness type of I there exists some MoC in theexactness type of I . In the cases where the output is a simple arithmetic operation that reliesupon zero or one digits of the input – for example, the n = q – the MoC is clear andeasily constructed. In all other cases, the output is the result of the composition of the ⊕ operatorwith other operators that have been proved continuous. As ⊕ is continuous, it is clear that anMoC can be constructed in these cases too. The most difficult case is the ‘otherwise’ case of q ,which relies upon constructing an MoC from the continuity of ⊕ , p and q itself. However, as the k value decreases, we can construct the MoC from an inductive hypothesis on the continuity of q atdiffering values of k .The formalisation seems forbidding but the intuition is clear. Positive truncation
The domain of the normalised loss function is U , whereas arithmetic hap-pens in I , for example in computing least-square-like loss functions. Since we only require con-tinuity and the vanishing property of the loss function, rather than a precise measure of loss, thesimplest way to create a well-typed loss function is to use a ‘truncation’ function t : U → I whichchanges all digits − f ( ) = t is a perfect example of a functionthat operates strictly at the level of codes and is not the realiser of a real function [ −
1, 1 ] → [
0, 1 ] .This is somewhat unsatisfactory from a foundational perspective, but from an algorithmic (andsomewhat pragmatic) point of view it raises not serious issues in our setting. More meaningfulloss functions, which are realisers of real functions and have additional desirable properties (e.g.they are monotonic) can be defined, but at the cost of extra complexity.14he operations above, together with Prop. 3 which states that continuity is preserved by compo-sition, will allow us to construct arbitrary multi-variate polynomial functions. Going beyond thatwould require extra operations (division, square root, logarithm, trigonometric functions etc.),for which algorithms in real number computation exists, but are beyond the scope of the presentwork. Using the minimisation algorithm for regression (Thm. 2) we can compute, to an arbitrary preci-sion, the solution of any continuous function simply by considering the variable(s) as an unknowndegenerate model parameter and least squares as the loss function.Concretely, let us illustrate this with solving a non-linear system of equations: x = x ⊕ y = x .This equation is expressed in terms of real numbers, and we can use the minimisation algorithmof Thm. 2 to look for approximate solutions in I with some given precision. As it happens, thesolutions to this equations are both in [ −
1, 1 ] ( x = − y = S = I × I , which is an S -type, and Y = → I × I (cid:39) I × I , with the unit type, which is an O -type. This is why we call this ‘degenerate’regression, because the oracle type is not a function type.The model ‘function’ is now a constant: M ( x , y ) = (cid:0) x ⊕ − ( x ⊕ ) , y ⊕ − x (cid:1) ,The true (degenerate) oracle is the constant Ω =
0. The loss function is Φ : I × I → I × I → U , Φ (cid:0) , ( u , v ) (cid:1) = t ( u × u ⊕ v × v ).Since all the types involved are S -types and all the functions continuous, as the compositionof continuous operations, it is an immediate consequence that the ‘parameter’ ( x , y ) : I × I canbe computed for whatever precision p : N . The minimiser used in the theorem is a possible suchalgorithm.Two caveats are required. The first one is that regression will compute the ‘ argmin ’ of thefunction, so it will return one of the solutions if they exist. This has been already discussed inthe general setting in Remark 3. In this example both real solutions are in [ −
1, 1 ] . The algorithmdoes not control which one will be returned. The second one is that in the case of no solution theminimisation algorithm will still return some ( x , y ) value of the argmin , so the model itself mustbe used to test whether the loss value is close enough to zero to be considered a solution. Whethera returned pair is an actual solution, i.e. Φ ( Ω , M ( x , y ) ) = U . This is the ‘meat and potatoes’ motivating example. Consider a set of points ( x i , y i ) ∈ R , i ≤ n .And suppose that we want to ‘best fit’ a polynomial f (cid:126) k ( x ) = k + k x + · · · k m x m : R → R throughthis data set, i.e. find values for (cid:126) k ∈ R m + which minimise a loss function such as least squares.One apparent obstacle is that all the convergence theorems require an oracle to compute theparameter (cid:126) k , whereas we only have a set of points. An important observation is that the leastsquare loss function computes the loss only at the given data-points and ignores its behaviourelsewhere. So any ‘oracle’ constructed from the points which is continuous would ultimately leadto the same result.To construct such an oracle we can use interpolation. There are many interpolation algorithmsbut for our purpose we might as well take the simplest one: piece-wise constant interpolation.Let p be some fixed precision p : N and y n : I an arbitrary value. We define a (distorted) oraclefrom the data points ( x i , y i : I ): 15 Ω ( x ) = y if x < p x ⊕ x y if x < p x ⊕ x ... y n − if x < p x n − ⊕ x n y n otherwise.The definition assumes that the data points are sorted by the x i component. This function isdefined by cases noting that the order in which the conditions are tested is fixed, top-to-bottom.This makes the function well defined, computable, and, perhaps surprisingly, continuous. (In factThm. 5 does not require the oracle to be continuous, just Thm. 4.) The real issue is not continuitybut why the function is well defined. The function is defined piecemeal, but if x is closer to some x i ⊕ x i + than the precision p then we cannot say for sure if it is to the left or to the right of it.In this situation the fact that the side-conditions are checked in a defined order means x will beconsidered as if it is to the left of the x i , which makes the function well defined. Note that thismeans that this function is also not a realiser of a continuous function R → R , an issue which wediscussed before (Remark 3).The general property of regression (Thm. 2) guarantees that parameters (cid:126) k can be computed sothat the interpolated model M (cid:126) k will minimise the least-square error at each x i . It is interesting toalso consider what this means from the point of view of convergence. The perfect-model conver-gence theorem (Thm. 3) is not applicable since the general form of the oracle (line segments) andof the model (polynomial) are not the same.However, the general imperfect-model convergence theorem (Thm. 5) says that if the loss be-tween a distorted model and the true model vanishes then so does the loss between the truemodel and the regressed model. In this case the true oracle would be a same (or lower) degreepolynomial from which the data points are sampled then interpolated, resulting in the distortedoracle. Since the least squares loss function only considers the behaviour at the sample points, itwill be zero when applied to the true and distorted oracles. Which means that Thm. 5 guaranteesthat in this situation the loss between the true oracle and the model can be also made arbitrarilysmall. From which we can conclude that polynomial regression, as performed in practice, hasgood convergence properties.The possibly problematic aspect of this is not the use of a polynomial as a model but thefact that we are working ‘offline’ (from data) as opposed to ‘online’ (from the oracle). But thecorrectness of ‘offline’ regression is an immediate corollary of Thm. 5 Proposition 4 ( Examples.offline-regression ) . Let S be an S -type, Y = I → I , p : N aprecision, (cid:101) : U a loss value, points (cid:126) x : I n , n : N , Φ (cid:126) x a least-squares loss function, and Ψ (cid:126) x : Y → Y aconstant interpolation function, both defined at points (cid:126) x.There exists a regressor reg : ( S → Y ) → Y → S such that given a weakly continuous modelM : S → Y for parameter k = reg M Ψ (cid:126) x ( Ω ) :if Φ (cid:126) x ( Ψ (cid:126) x ( Ω ) , Ω ) < p (cid:101) then Φ (cid:126) x ( Ψ (cid:126) x ( Ω ) , M k ) < p (cid:101) , for synthetic oracle Ω = M ( k ) . From this, the convergence of polynomial interpolation follows immediately as any modeldefined by a polynomial is continuous.
In applications, particularly to machine learning, we may not know the general form of the oracle.In such a situation we may want to consider a more general kind of model, which is expressed asan infinite series, such as power series or trigonometric series. Many such series can be written inthe form M k ( x ) = ∑ i ∈ N f ( k n , x , n ) where f is a fixed function and k n an infinite set of parameters.For example, in the case of a power series f ( k , x , n ) = kx n . Such series can serve as ‘universalapproximators’ for classes of functions. For example, analytic functions equal to their Taylor16eries at all points form a class known as ‘integral’ functions. The polynomials, exponential, andcertain trigonometric (sine and cosine) functions are examples of integral functions.These models are intriguing because they can be given types such as M : ( N → U ) → U → U , with the type of parameters k : N → U an S -type. This means that, providing the continuity of M is proved, the entire set of parameters k can be computed to any degree of precision. In the caseof the power series we know that using addition, multiplication, and composition always leadsto continuous functions. The problem is computing the infinite series. Provided that the seriesconverges, the series can be computed in general [17] or approximated [5], but this is beyond thescope of our paper. From the point of view of regression analysis this may seem surprising, butthis is a known result using searchable sets [10].For example consider a model M k ( x ) = (cid:76) i : N k i x i , which converges for all values of x : U . Itcan be used to regress some oracle Ω : U → U . Using the regressor of Thm. 3, the sequence ofparameters is given by k = reg M Ω , so a model can be instantiated as M k .Note that the solution above involves an infinite set of parameters so obviously cannot becomputed other than lazily. The model, after instantiating k is M ( x ) = (cid:77) i : N ( reg M Ω )( i ) · x i ,which is computable but could be expensive to compute.A problem of practical importance in this setting is the ‘truncation’ of the series defining model M k to only a finite number of terms, i.e. M k , m = (cid:76) ≤ i ≤ m k i x i . However, such a model has type M : ( N → U ) → N → U → U . The type N is not an S -type; it is also clearly not a searchabletype. So this problem cannot be solved.A broader consequence is that some ‘hyper-parameters’ of neural networks (the number oflayers, the number of neurons per layer, etc.) also cannot be computed using our approach. Remark 4.
This class of more speculative examples, in particular summing infinite series, is not formalisedin A GDA . This paper has been inspired by and relies extensively on a significant body of work by Escard ´o,starting with searchable infinite sets [7]. The properties of regression established here can be equallyformulated in that setting, or in related setting such as compact sets [10] and compact types [11].What makes our approach distinct is that in the formulations above are not synthetic, in the senseof [6]. Whereas in synthetic topology all functions are assumed to be continuous, we work withan explicit condition of continuity. This makes proofs more difficult but it has the advantagethat makes our regression theorems hold in more models of type theory, including those thatmanipulate non-continuous functions, yet allowing for formalisation in a proof assistant basedon dependent type theory (namely A
GDA ).We are interested in establishing an alternative framework for a better mathematical under-standing of data science, machine learning, etc. based on type theory and constructive real num-bers. It is worth drawing an anaology it with the established mathematical framework for ma-chine learning, probably approximately correct learning (PAC) [13, 29]. We first introduce its basicconcepts.Let X be a set and f : X → {
0, 1 } an unknown function (in our terminology, the ‘oracle’). Asample (cid:126) x is drawn from X according to some (unknown) distribution D and is correctly classifiedaccording to f . Can we learn the function f ? Note that this is a particular instance of regressionas discussed here.The function f is not usually guessed out of nothing, but from a known class of possible func-tions H , dubbed inductive bias . Our counterpart is, of course, the class of models M . The workingassumption is that f ∈ H , which is mirrored in our approach, in the convergence theorems, bythe fact that Ω = M k for some unknown k .Suppose that a learning procedure (which we call a ‘regressor’) produces a new hypothesis h (cid:126) x ∈ H based on the sample. This is what we call a ‘regressed model’ M k . The basic question ishow good is this new hypothesis? It should be good for the sample, but also for new examples.17he error is defined as err ( h (cid:126) x ) = Pr x ∼D [ h (cid:126) x ( x ) (cid:54) = f ( x )] , the probability that under the givendistributions the unknown function and the hypothesis differ. The problem statement is thatgiven an error (cid:101) ∈ (
0, 1 ) what can it be said about err ( h (cid:126) x ) ≤ (cid:101) . This cannot be guaranteed, exceptwith probability at least 1 − δ for some fixed parameter δ ∈ (
0, 1 ) .A hypothesis class H is PAC-learnable if there is an algorithm such that for every (cid:101) , δ ∈ (
0, 1 ) and unknown function f ∈ H there is a natural number m > x i , i = m according to some distribution D ,we obtain an h (cid:126) x ∈ H such that err ( h (cid:126) x ) ≤ (cid:101) with probability at least 1 − δ .The size of the sample m given as a function of δ − and (cid:101) − is called the sample complexity .Finite sets are a non-surprising example of PAC-learnable and their sample complexity boundsare known. But certain infinite sets also are PAC-learnable, with sample complexity determinedby the so-called VapnikChervonenkis dimension (VC) [30].Our approach is complementary to PAC, having certain strengths and weakness (leaving asidethe obvious fact that PAC theory is a mature and well explored area of research). The setting ofthe problem is similar, up to differences in vocabulary, but both the learning procedure and thevalidation procedure vary significantly. PAC requires a prior sampling of the oracle with a givendistribution, which makes it intrinsically ‘off-line’, whereas our learning procedure assumes ac-cess to the oracle, ‘on-line’. (The two are related by Prop. 4, but more about this in the nextsection.) It also means that the learning procedure and the testing criterion in PAC are necessar-ily probabilistic. In contrast, our approach is deterministic and quantitative in a different way:instead of measuring the probability of the learned outcome being different from the desired out-come we measure the definite amount by which the two outcomes differ. For finite sets, which canbe searched trivially, our approach is trivial whereas the PAC is interesting. But for infinite setsboth our approach and the PAC approach give interesting and non-trivial characterisations.
The main contribution of the paper is to offer a range of convergence criteria for parametric re-gression, formalised in type theory, and proved formally in A
GDA . The main convergence theo-rem (Thm. 5) states that a large class of oracles, all continuous functions of O -type with unknownparameters of S -type, can be regressed up to any desired precision, even in the presence of distor-tions, so long as the distortions are small. The regressors used in the theorem can be considered ascorrect, albeit inefficient, reference implementations that satisfy the conditions of the convergencetheorem.The next part of this work will requires us to turn our attention to off-line learning. The startingpoint is Prop. 4 which gives a convergence criterion for off-line regression. The interesting partis the precondition Φ (cid:126) x ( Ψ (cid:126) x ( Ω ) , Ω ) < p (cid:101) . We conjecture that there if the sample (cid:126) x is large enoughthen this precondition is always true. The reason is that the distorted oracle Ψ (cid:126) x ( Ω ) constructedby interpolation should become close enough to the true oracle Ω as the sample grows, whichis a version of the Stone-Weierstrass interpolation theorem. Our simple interpolator (piece-wiseconstant) may not be suitable for such a theorem, but we strongly believe that such interpolatorsexist in our setting. Interpolation, as mentioned above, is closely related to sampling, which couldopen the door to dealing with probabilistic sampling and formulating convergence results moreclosely related to PAC learning, including estimating or bounding the sample size. The fact thatprobabilities over discrete sets such as N → [
0, 1 ] are S -types is encouraging. In the longerterm we also wish to find (synthetic) topological or type-theoretic characterisation of other PACconcepts, such as VC dimension.Interpolation in itself is very important because, especially in the presence of distortions (noise),as it forms the basis of non-parametric regression , the learning of models without committing to aparticular shape of a model.A better class of interpolation functions should also resolve the foundational rough edges dis-cussed in Sec. 4.3, namely the fact that the interpolated functions are not realisers of real functions.We do not believe these issues have any profound consequences but are best avoided. In contrast,the same issues in the context of the minimisation theorem (Thm. 2) cannot be solved — but thistheorem is a ‘dead end’ for us. 18n parallel we aim to consider more realistic implementations, either extracted from the A GDA regressors or implemented directly in other more performance-oriented languages. The key re-quirement is fast (enough) arbitrary precision arithmetic over real numbers, a field intensely stud-ied with multiple libraries available for various languages [3, 15, 22].
References [1] A. Avizienis. Signed-digit numbe representations for fast parallel arithmetic.
IRE Trans.Electronic Computers , 10(3):389–400, 1961.[2] A. Bove, P. Dybjer, and U. Norell. A brief overview of agda - A functional language withdependent types. In S. Berghofer, T. Nipkow, C. Urban, and M. Wenzel, editors,
TheoremProving in Higher Order Logics, 22nd International Conference, TPHOLs 2009, Munich, Germany,August 17-20, 2009. Proceedings , volume 5674 of
Lecture Notes in Computer Science , pages 73–78.Springer, 2009.[3] K. Briggs. Implementing exact real arithmetic in python, C++ and C.
Theor. Comput. Sci. ,351(1):74–81, 2006.[4] A. Ciaffaglione and P. D. Gianantonio. A certified, corecursive implementation of exact realnumbers.
Theor. Comput. Sci. , 351(1):39–51, 2006.[5] R. A. DeVore and G. G. Lorentz.
Constructive approximation , volume 303. Springer Science &Business Media, 1993.[6] M. H. Escard ´o. Synthetic topology: of data types and classical spaces.
Electr. Notes Theor.Comput. Sci. , 87:21–156, 2004.[7] M. H. Escard ´o. Infinite sets that admit fast exhaustive search. In , pages443–452. IEEE Computer Society, 2007.[8] M. H. Escard ´o. Exhaustible sets in higher-type computation.
Logical Methods in ComputerScience , 4(3), 2008.[9] M. H. Escard ´o. Real number computatio in Haskell with real numbers represented as infi-nite sequences of digits. ,2011.[10] M. H. Escard ´o. Algorithmic solution of higher type equations.
J. Log. Comput. , 23(4):839–854,2013.[11] M. H. Escard ´o. Compact types. , 2018.[12] C. A. Floudas and C. E. Gounaris. A review of recent advances in global optimization.
Journalof Global Optimization , 45(1):3, 2009.[13] D. Haussler. Probably approximately correct learning. In H. E. Shrobe, T. G. Dietterich,and W. R. Swartout, editors,
Proceedings of the 8th National Conference on Artificial Intelligence.Boston, Massachusetts, USA, July 29 - August 3, 1990, 2 Volumes , pages 1101–1108. AAAI Press/ The MIT Press, 1990.[14] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey.
Operations research ,14(4):699–719, 1966.[15] V. M´enissier-Morain. Arbitrary precision real arithmetic: design and algorithms.
J. Log.Algebr. Program. , 64(1):13–39, 2005.[16] R. Moore and C. Yang. Interval analysis. Space Division Report LMSD285875, LockheedMissiles and Space Co, 1959. 1917] N. T. M ¨uller. Constructive aspects of analytic functions. In
Proceedings of Workshop on Com-putability and Complexity in Analysis , volume 190, pages 105–114. Informatik Berichte Fer-nUniversit¨at Hagen, 1995.[18] A. Neumaier. Complete search in continuous global optimization and constraint satisfaction.
Acta numerica , 13:271–369, 2004.[19] J. Pearl. To build truly intelligent machines, teach them cause and effect.
Quanta Magazine(15 May 2018) , 2018.[20] A. Pinkus. Weierstrass and approximation theory.
Journal of Approximation Theory , 107(1):1 –66, 2000.[21] S. Piyavskii. An algorithm for finding the absolute extremum of a function.
USSR Computa-tional Mathematics and Mathematical Physics , 12(4):57–67, 1972.[22] D. Plume. A calculator for exact real number computation, 1998. University of Edinburgh.[23] S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprintarXiv:1609.04747 , 2016.[24] A. K. Simpson. Lazy functional algorithms for exact real functionals. In
Mathematical Foun-dations of Computer Science 1998, 23rd International Symposium, MFCS’98, Brno, Czech Republic,August 24-28, 1998, Proceedings , pages 456–464, 1998.[25] S. Skelboe. Computation of rational interval functions.
BIT Numerical Mathematics , 14(1):87–95, 1974.[26] F. J. Solis and R. J. B. Wets. Minimization by random search techniques.
Math. Oper. Res. ,6(1):19–30, Feb. 1981.[27] M. Tawarmalani and N. V. Sahinidis. Semidefinite relaxations of fractional programs vianovel convexification techniques.
Journal of Global Optimization , 20(2):133–154, 2001.[28] A. Troelstra and D. van Dalen. Chapter 6 some elementary analysis. In
Constructivism inMathematics , volume 121 of
Studies in Logic and the Foundations of Mathematics , pages 291 –325. Elsevier, 1988.[29] L. G. Valiant. A theory of the learnable.
Commun. ACM , 27(11):1134–1142, 1984.[30] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies ofevents to their probabilities.