Lazily Adapted Constant Kinky Inference for Nonparametric Regression and Model-Reference Adaptive Control
LLazily Adapted Constant Kinky Inference for NonparametricRegression and Model-Reference Adaptive Control
Jan-Peter CalliessDept. of EngineeringUniversity of Cambridge, UKApril 13, 2018
Abstract
Techniques known as
Nonlinear Set Membership prediction,
Lipschitz Interpolation or KinkyInference are approaches to machine learning that utilise presupposed
Lipschitz properties tocompute inferences over unobserved function values. Provided a bound on the true best Lips-chitz constant of the target function is known a priori they offer convergence guarantees as wellas bounds around the predictions. Considering a more general setting that builds on H¨oldercontinuity relative to pseudo-metrics, we propose an online method for estimating the H¨olderconstant online from function value observations that possibly are corrupted by bounded ob-servational errors. Utilising this to compute adaptive parameters within a kinky inference rulegives rise to a nonparametric machine learning method, for which we establish strong universalapproximation guarantees. That is, we show that our prediction rule can learn any continuousfunction in the limit of increasingly dense data to within a worst-case error bound that dependson the level of observational uncertainty. We apply our method in the context of nonparametricmodel-reference adaptive control (MRAC) . Across a range of simulated aircraft roll-dynamics andperformance metrics our approach outperforms recently proposed alternatives that were based on
Gaussian processes and
RBF-neural networks . For discrete-time systems, we provide guaranteeson the tracking success of our learning-based controllers both for the batch and the online learningsetting.
Typically, a controller is designed on the basis of a dynamical model of the system. When littleis known about these dynamics a priori or, if the dynamics may be subject to unexpected change,machine learning methods can be employed to learn such a model online on the basis of measurements.Supervised machine learning methods are algorithms for inductive inference. On the basis of asample, they construct (learn) a computable model of a data generating process that facilitates infer-ence over the underlying ground truth function and aims to predict its function values at unobservedinputs. Among supervised learning methods, nonparametric algorithms tend to offer greater flexibilityto learn rich function classes (e.g. rich classes of nonlinear dynamics).Perhaps the most popular nonparametric machine learning method is Bayesian inference with
Gaussian processes (GPs) [34]. GPs offer a flexible and principled probabilistic method for nonpara-metric regression and have evolved into one of the primary work-horses for learning dynamic systems[17, 18, 15, 32, 16, 27] in the research communities related to artificial intelligence. However, theysuffer from several limitations, including scalability to large data sets, a lack of understanding of howto bound the closed-loop dynamics resulting from controlling on the basis of a GP state-space modeland the question of how to choose a good prior in a principled yet computable manner. To alleviatethe last problem, it is common practice to tune hyper-parameters of a chosen (typically universal)1 a r X i v : . [ m a t h . O C ] A p r ernel to explain the data via the marginal log-likelihood. While often successful on many data sets,the result can be highly sensitive to the choice of optimiser, initialisations, data sets and compu-tational budget. Unfortunately, little theoretical understanding of the important interplay betweenthese components in the resulting inference mechanism seems to exist.In contrast to such Bayesian methods this work considers an extension of ideas existing in appliedmathematics (e.g. [37, 44, 12, 13, 2, 3]) as well as in control [29, 8] that harnesses Lipschitz regularityof the target function to provide bounds on the predictions of the target function at unobserved inputs.Applied to machine learning, the basic idea is that Lipschitz continuity constrains the set of possiblefunction values of a target function at a query input as a function of the distance between the queryand the previously observed training examples. A prediction is then made by choosing a functionvalue in the middle of the set of possible function values. This idea, which at least goes back to [39],has been leveraged in different fields under different headlines including Lipschitz Interpolation [44, 3]and
Nonlinear Set Membership (NSM) methods [29]. If the Lipschitz constant is known, Lipschitzinterpolation provides uncertainties around the predictions of function values at unobserved inputs.The uncertainties are maximally tight if no other knowledge is known other than the presupposedLipschitz regularity [39, 7]. The presupposed Lipschitz constant is a crucial parameter of the inferencerule, quite similar to the choice of a prior (e.g. via kernel and hyperparameters) in Bayesian inference.We would argue that one of the advantages of these methods is their computational simplicity.That is, they are numerically robust and only involve basic computational steps that could be moreefficiently computed even on a simple embedded RISC micro-controller. A practical concern is that thepredictions and bounds of these methods hinge on the a priori knowledge of the presupposed Lipschitzparameter of the underlying target function. Some previous works remark that, in the absence of suchknowledge the constants might in practice be estimated from the data (e.g. via estimators discussedin [38, 43, 4, 29, 3]). In fact, [29] suggest fitting a parametric regressor to the data first and utilise theLipschitz constant of the fitted model for the NSM approach. Unfortunately, there is no theoreticalanalysis given anywhere what the impact of that estimate is to the predictive performance of theregression method. (And, consequently control methods that rely on the resulting predictions wouldbe unable to assert closed-loop convergence or robustness guarantees).Our work addresses this deficiency by proposing an approach that allows us to provide learningand tracking guarantees even in the absence of a priori knowledge of this constant.As a first step, we rehearse
Kinky Inference (KI) [7] which generalises the Lipschitz interpolationand NSM frameworks in several ways. We then combine the KI machine learning approach with asimple online parameter estimator that allows us to incrementally compute an estimate of the H¨olderconstant on the basis of incrementally arriving (noisy) data. This merger yields a new inference rulewe refer to as
Lazily Adapted Constant Kinky Inference (LACKI) . We prove that this rule is sample-consistent (up to the level observational error in the data) and that the LACKI predictors themselvesare H¨older continuous. This allows us to establish strong universal approximation properties: Thatis, in the limit of infinite data, the LACKI rule is capable of approximating not only any H¨oldercontinuous target function but also any non-H¨older continuous function with arbitrarily low error (upto a bound dependent on the level of observational error in the training data). Since our LACKI rulecan be seen as an extension of Lipschitz interpolation with empirical Lipschitz constant estimationas considered in [3], our results also provide new theoretical guarantees for this previously proposedinterpolation rule where the Lipschitz parameters are estimated online from the noisy training datawith our proposed modified constant estimation method.In addition to these learning-theoretic considerations, we apply our LACKI approach to online-learning based model-reference adaptive control.As a testbed, we replicate simulations of the roll dynamics of an F-4 fighter aircraft under uncertainwing rock previously considered by other authors in model-reference adaptive control [10, 31, 11]and compare our controller against their methods. Here our LACKI-based controller outperforms itscompetitors across a range of metrics including computational speed, prediction and tracking accuracy.For discrete-time feedback-linearisable systems with uncertain nonlinear drift, we provide theoret-2cal guarantees on the tracking success of our LACKI- model-reference adaptive controller both in thebatch and in the online learning setting.In contrast to much of the standard literature of probabilistic nonparametric regression (e.g. [23,40]), our analysis focusses on the derivation of deterministic worst-case error bounds. While possiblybeing more conservative, we would argue that this type of analysis has the benefit of being moremeaningful in a control setting where the learner receives training examples and queries that willtypically violate statistical assumptions typically presupposed in the statistical literature.The remainder of the paper is structured as follows:Sec. 2 contains the core of the LACKI regression methods. Following a rehearsal of the kinkyinference (KI) framework for nonparametric learning, Sec. 2.2 describes the our LACKI approach forsetting the KI parameters. Sec. 2.3 is dedicated the derivation of several properties of the resultingLACKI approach, including our consistency guarantees.Sec. 3 contains the control part of the paper. We first introduce the framework of model-referenceadaptive control in which we propose a controller based on our LACKI learning method. For illustra-tion purposes, we closely follow the setting of wing-rock control considered in [10, 9] and compare ourLACKI-based controller to other MRAC controllers consdered and proposed by previous work. Thesection concludes by giving convergence guarantees for LACKI-MRAC in discrete-time systems.The paper concludes with Sec. 4, summarising our findings and containing various suggestionsfor future work. The appendix contains a variety of background material on H¨older continuity andvarious supplementary derivations referred to at various points of the main body of the paper. Incomparison to the original 2016 version of this preprint, this updated version adds some experimentsin Sec. 2.4 and corrects a typo that had existed in Thm. 3.4.
In this section, we will introduce the class of learning rules we refer to as
Kinky Inference . Theyencompass a host of other methods such as Lipschitz Interpolation and Nonlinear Set Interpolation.The rules possess a parameter L ( n ) that needs to be specified by any KI algorithm. In this paper,we are most concerned with studying LACKI, a KI rule algorithm where L ( n ) coincides with a noise-robust and multi-variate generalisation of Strongin’s estimate [38] of a H¨older constant computed fromthe data set D n available at time step n . Setting.
Let X , Y be two spaces endowed with (pseudo-) metrics d : X → R ≥ , d Y : Y → R ≥ ,respectively. Spaces X , Y will be referred to as input space and output space , respectively. It will beconvenient to restrict our attention to input and output spaces that are additive abelian groups andwhich are translation-invariant with respect to their (pseudo-) metrics. That is, for the input space X , we assume d ( x + x (cid:48) , x (cid:48)(cid:48) + x (cid:48) ) = d ( x, x (cid:48)(cid:48) ) , ∀ x, x (cid:48) , x (cid:48)(cid:48) ∈ X .For simplicity, throughout the remainder of this work, we will assume the output space is thecanonical Hilbert space Y = R m ( m ∈ N ) endowed with the d Y ( y, y (cid:48) ) = (cid:107) y − y (cid:48) (cid:107) ∞ , ∀ y, y (cid:48) ∈ Y .Let f : X → Y be a target or ground-truth function we desire to learn. For our purposes, learningmeans regression. That is, we utilise the data to construct a computable function that allows uspredict values of the target function at any given input.Assume that, at time step n , we have access to a sample or data set D n := { (cid:0) s i , ˜ f i (cid:1) | i = 1 , . . . , N n } containing N n ∈ N sample vectors ˜ f i = ˜ f ( s i ) ∈ Y of an observable function ˜ f at sample input s i ∈ X .Here, the observable ˜ f : X → Y is a “noise-corrupted” version of the true target function f : X → Y we would like to make inferences about on the basis of the available sample. In this work, we willtypically assume that the observable ˜ f coincides with the target f up to a level of interval-boundedobservational noise: ∀ x ∈ X : d Y ( ˜ f ( x ) , f ( x )) ≤ e ( x ) where e : X → R ≥ is the error bound functionwhose values we assume to be bounded by some (known or unknown) bound ¯ e ∈ R ≥ . We canmodel the situation by the presence of a bounded additive observational error (or “noise”) function3 : X → Y with ˜ f = f + ν . The interpretation of these errors depends on the given application andthis “noise” may be deterministic or stochastic. For instance, in the context of system identification,the sample might be based on noisy measurements of velocities and it may be due to sensor noise or,the noise might model systematic error such as numerical approximation errors.Furthermore, e may also accommodate input uncertainty (that is when predicting f ( x ), x is un-certain) (for details refer to [7]). In the course of our theoretical considerations below the error willalso serve to absorb the discrepancy between a H¨older and a non-H¨older function. Learning.
It is our aim to learn target function f in the sense that, combining prior knowledgeabout f with the observed data D n , we infer predictions ˆ f n ( x ) of f ( x ) at unobserved query inputs x / ∈ G n . Here, G n = { s i | i = 1 . . . , N n } ⊂ X refers to the (not necessarily regular) grid of sampleinputs. The entire function ˆ f n that is learned to facilitate predictions is referred to as the predictor .Since the computation of the predictor is based on the available data and utilised to make inferencesover unobserved inputs, we can view the learning process as an instance of (inductive) inference.Therefore, the formula to compute the predictor ˆ f n will also be referred to as an inference rule. Inour context, we will understand a machine learning algorithm to implement a such an inference rule.That is, it is a computable function that maps a data set D n to a predictor ˆ f n (and possibly anuncertainty estimate function ˆ v n ). A typical desideratum of a good predictor is that it is efficientlycomputable . Its learning performance is measured in terms of the degree and rapidity it converges tothe target (up to the observational error given by e ) in the limit of increasingly dense data. Of coursethere are many different metrics with respect to which one can assess convergence. Perhaps, the mostconvenient one is mean-square convergence. However, inspired by our control applications, we desireto investigate worst-case convergence rates which hold independently from distributional assumptionsand will yield performance guarantees even in zero-measure events. The Kinky Inference Learning rules.
In this work we will expand on the basis of the followingclass of predictors to perform learning as inference over unobserved function values:
Definition 2.1 (Kinky Inference (KI) rule ) . Let R ∞ := R ∪ {−∞ , ∞} and X be some space endowedwith a pseudo-metric d . Let B, ¯ B : X → Y ⊆ R m ∞ denote lower- and upper bound functions , thatcan be specified in advance and assume B ( x ) ≤ ¯ B ( x ) , ∀ x ∈ I ⊂ X component-wise. Furthermore,let e denote a parameter that specifies a deterministic belief about the true observational error bound e . Given sample set D n , we define the predictive functions ˆ f n : X → Y , ˆ v n : X → R m ≥ to performinference over function values. For j = 1 , . . . , m , their j th output components are given by:ˆ f n , j ( x ; Ξ( n )) := 12 min { ¯ B j ( x ) , u n,j ( x ; Ξ( n ) (cid:1) } + 12 max { B j ( x ) , l n,j ( x ; Ξ( n ) (cid:1) } , ˆ v n , j ( x ; Ξ( n )) := 12 min { ¯ B j ( x ) , u n,j (cid:0) x ; Ξ( n ) (cid:1) }−
12 max { B j , l n,j (cid:0) x ; Ξ( n ) (cid:1) } . Here, u n (cid:0) · ; Ξ( n ) (cid:1) , l n (cid:0) · ; Ξ( n ) (cid:1) : X → R m are called ceiling and floor functions, respectively. Their j thcomponent functions are given by u n,j (cid:0) x ; Ξ( n ) (cid:1) := min i =1 ,...,N n ˜ f i,j + ˜ d ( x, s i ; Ξ( n ))and l n,j (cid:0) x ; Ξ( n ) (cid:1) := max i =1 ,...,N n ˜ f i,j − ˜ d ( x, s i ; Ξ( n )) , respectively. Here, ˜ d ( · , · ; Ξ( n )) is a mapping parameterised by Ξ( n ). While there are many conceivableparameterisations, we restrict our attention to the case where, for some pseudo-metric d on the input4pace X , we have Ξ( n ) = ( L ( n ) , α, e ) with˜ d ( · , · ; Ξ( n )) = L ( n ) d α ( · , · ) + e ( x ) . As will be seen below, parameter L ( n ) has the interpretation of a H¨older constant of the predictorrelative to pseudo-metric d while α ∈ (0 ,
1] can be interpreted as a H¨older exponent (cf. Thm. 2.6).That is, we will show that ˆ f n ( · ; Ξ( n )) belongs to the class H (cid:0) L ( n ) , α (cid:1) = { φ : X → Y|∀ x, x (cid:48) ∈ X : d Y ( φ ( x ) , φ ( x (cid:48) )) ≤ L ( n ) d ( x, x (cid:48) ) α } of L ( n ) − α - H¨older continuous functions. Note, we could alternatively re-express this H¨older class as ageneralised class of Lipschitz functions Lip( L ( n )) = { φ : ∀ x, x (cid:48) ∈ X : d Y (cid:0) φ ( x ) , φ ( x (cid:48) ) (cid:1) ≤ L ( n )˜ d ( x, x (cid:48) ) } where, for any α ∈ (0 , d = d α is a pseudo-metric provided d is (refer to Lem. A.11). However,as it is often customary to define Lipschitz and H¨older continuity in a more restricted sense relativeto standard norm-induced metrics (in which case the H¨older class is strictly more general than theLipschitz class) we chose to refer to H¨older continuity rather than Lipschitz continuity to highlightthat such H¨older functions can be learned as well.As insinuated by our notation, we consider parameter L ( n ) to be adaptive, i.e. data-dependentwhile the other parameters are assumed to be set in advance. Function e can be utilised to accommo-date observational noise. That is, if the noise level in the data is assumed to be contained in [ − ¯ e , ¯ e ]then one would choose e ( x ) = ¯ e , ∀ x . In addition, functions B, ¯ B : X → R m ∞ are parameters that haveto be specified in advance and can impose a priori knowledge of bounds on the target function. Forexample, if we know the target function to exclusively map to nonnegative values, then one can set B ( x ) = 0 , ∀ x . To disable restrictions of boundedness, it is allowed to specify the upper and lowerbound functions to constants ∞ or −∞ , respectively.Function ˆ f n is the predictor that is to be utilised for predicting/inferring function values at unseeninputs. Function ˆ v n ( x ; Ξ( n )) is meant to quantify the uncertainty of prediction ˆ f n ( x ; Ξ( n )). For easeof notation, we shall often omit explicit mention of the parameter, e.g. we may write ˆ f n ( x ) instead ofˆ f n ( x ; Ξ( n )).To provide an intuition of the inference rule, consider the following special case where we haveaccess to a noise-free sample D n and suppose the target f is a real-valued L ∗ − α H¨older continuousfunction. Observing the noise-free sample point ( s i , f i ) constrains the set of function values f ( x ) to theset S i ( x ) = { φ ∈ Y| d Y ( φ, f i ) ≤ L ∗ d ( s i , x ) } . Considering a set of sample points D n , target value f ( x )is constrained to lie in the intersection S ( x ) = ∩ N n i =1 S i ( x ). It is easy to see that the floor and ceilingfunctions are tight lower and upper bounds of S ( x ) with S ( x ) := { φ ∈ Y| l n ( x ; L ∗ ) ≤ φ ≤ u n ( x ; L ∗ ) } . Inother words, setting parameter L ( n ) to the best H¨older constant L ∗ and bounds B = −∞ , ¯ B = + ∞ yields a predictor ˆ f n ( x ) that for every query x chooses the mid-point of the set S ( x ) of those functionvalues that can possibly be assumed by a H¨older continuous function that interpolates the observedsample. Prediction error ˆ v n ( x ) simply is the radius of the set.For the case of α = 1, this approach is known as Lipschitz interpolation [3, 44]. Since a setis utilised for interpolation, the approach is also known as
Nonlinear Set Interpolation [29, 8] incontrol. Specification of ¯
B, B allows us to incorporate additional knowledge and constrain our set S ( x ) further. For instance, when estimating densities we might incorporate the knowledge of dealingwith nonnegative functions. In this case, it makes sense to set B to a constant value of zero yielding S ( x ) = { φ | φ ≥ } ∩ N n i =1 S i ( x ).When choosing L ( n ) to coincide with the best H¨older constant, one can give strong guarantees ofconvergence to the target as on the tightness of the prediction bounds [39, 7] showing that bounds areas tight as possible without imposing additional assumptions and that the predictor minimises theworst-case risk.Unfortunately, this requires us to know at least an upper bound of L ∗ and therefore, several authorshave proposed different approaches of how to estimate the constant from the data (e.g. [38, 43]).However, it appears to be largely unknown how to do so in the presence of bounded observational5oise e > L ( n ) by the empirical estimates,nothing seems to be known about the convergence properties of the resulting kinky inference rule thatis based on such estimates.In the remainder of the paper, we shall address this gap. Firstly, we propose an estimator tobe utilised in place of L ( n ) that can be set to be robust to noise (i.e. does not grow unbounded).Referring to the resulting KI rule as LACKI, we then prove universal approximation properties of theLACKI rule before considering its performance in a control application. Above we explained the benefits of choosing parameter L ( n ) to coincide with a H¨older constant ofthe target. However, if such a constant is unavailable a priori, we desire to compute L ( n ) as a data-dependent estimate of the H¨older constant. Our proposal for such an estimator will be introducednext.For notational convenience, for two sets S, S (cid:48) ⊂ X of inputs we define U ( S, S (cid:48) ) := { ( s, s (cid:48) ) ∈ S × S (cid:48) | d ( s, s (cid:48) ) > } and let U n := U ( G n , G n )be the set of all pairs grid inputs deemed disparate under the pseudo-metric d .The best H¨older constant of a function f is the smallest nonnegative number L ∗ such that f iscontained in the set H ( L ∗ , p ) = { φ : X → Y | ∀ x, x (cid:48) ∈ X : d Y (cid:0) φ ( x ) , φ ( x (cid:48) ) (cid:1) ≤ L ∗ (cid:0) d ( x, x (cid:48) ) (cid:1) α } of L ∗ − α -H¨older continuous functions. So, this best H¨older constant is given by L ∗ = sup ( x,x (cid:48) ) ∈ U ( X , X ) d Y (cid:0) f ( x ) − f ( x (cid:48) ) (cid:1) d α ( x, x (cid:48) ) . Given the noisy data D n = { ( s i , ˜ f i ) | i = 1 , . . . , N n } a natural estimate of the best H¨older constantmight be to compute ˆ (cid:96) ∗ n := max ( s,s (cid:48) ) ∈ U n d Y ( ˜ f ( s ) , ˜ f ( s (cid:48) )) d α ( s,s (cid:48) ) [38]. In the absence of noise (may it be stochasticor deterministic), that is, if ˜ f i = f ( s i ) , ∀ i , ˆ (cid:96) ∗ n never overestimates the true best H¨older constant. Thatis, ˆ (cid:96) ∗ n ≤ L ∗ . However, in the presence of noise ν : X → Y (such that ˜ f = f + ν ) this boundednessassumption of the estimates no longer holds true. In particular if the noise is not H¨older continuous,we expect ˆ (cid:96) ∗ n to grow unbounded with increasingly dense data. For practical reasons and for thesake of our theoretical arguments presented below, we desire the parameters L ( n ) to remain bounded.Thus, without further modifications ˆ (cid:96) ∗ n is not a suitable candidate for L ( n ).To ensure bounded estimates even in the presence of noise, we propose the following estimator : (cid:96) ( D n ; λ, L ) := max (cid:110) L, max ( s,s (cid:48) ) ∈ U n d Y ( ˜ f ( s ) , ˜ f ( s (cid:48) )) − λ d α ( s, s (cid:48) ) (cid:111) . (1)Here L is a parameter that can be used to specify a priori knowledge of a lower bound on the bestLipschitz constant. In the absence of particular domain-specific knowledge, one can of course alwaysset L = 0. Remark . By setting parameter λ at least as large as twice the maximum level of observationalnoise, i.e. λ = 2¯ e + q for some q ≥
0, it is easy to see that the (cid:96) ( D n ; λ,
0) are bounded from above by¯ L = sup x,x (cid:48) , d ( x,x (cid:48) ) > d Y ( f ( x ) ,f ( x (cid:48) )) − q d α ( x,x (cid:48) ) ≤ L ∗ ( and, L ∗ < ∞ if the target is H¨older continuous).Next, consider an online learning situation where the available data increases over time. That is, D n ⊆ D n +1 for all time steps n ∈ N . For time step n ∈ N , let S n +1 := G n +1 \ G n be the set of new6 .0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.50.00.51.01.52.02.53.0 LACKI 1noisy test functionpredictionground truth
LACKI 2
Figure 1: Two LACKI inferences over function values of the target f : x (cid:55)→ | cos(2 πx ) | + x (dashedline) on the basis of a noisy sample (plotted as dots). The predictors are plotted in grey, the noisyobservational function ˜ f ( · ) = f ( x ) + ν x is plotted in cyan. Here the ν x were drawn i.i.d. at randomfrom a uniform distribution on the interval [ − . , . B ≡∞ , B ≡ −∞ , L = 0 and α = 1. The left plot shows the LACKI predictor ˆ f n (grey line) for parameterchoice λ = 0, falsely assuming absence of observational noise. As a result, the prediction overfittedto the noise. The right plot depicts the prediction ˆ f n (grey curve) for correct parameter choice λ = 2¯ e = 1, causing the noise to be smoothed out and resulting in more accurate prediction of theunderlying ground truth f .sample inputs. We can define an incremental update rule recursively as follows: (cid:96) n +1 := max (cid:110) (cid:96) n , max ( s,s (cid:48) ) ∈ U ( G n ,S n +1 ) d Y (cid:0) ˜ f ( s ) , ˜ f ( s (cid:48) ) (cid:1) − λ d α ( s, s (cid:48) ) , (2)max ( s,s (cid:48) ) ∈ U ( S n +1 ,S n +1 ) d Y (cid:0) ˜ f ( s ) , ˜ f ( s (cid:48) ) (cid:1) − λ d α ( s, s (cid:48) ) (cid:111) , (3)for n ∈ N and where (cid:96) := L . The effort of computing (cid:96) n +1 is in the order of O (cid:0) M ( | S n +1 | N n + | S n +1 | ) (cid:1) where M denotes the effort for evaluating the pseudo-metrics. By construction, we have (cid:96) n = (cid:96) ( D n ; λ, L ) , ∀ n ∈ N . Remark . Remember that (cid:0) (cid:96) n (cid:1) n ∈ N is bounded. Since it is also growing monotonically we canappeal to the monotone convergence theorem to show that the sequence is convergent to some number¯ L ≤ max { L ∗ , L } .So far, we have defined a rule of how to update noise-robust and convergent estimates (cid:96) n ofthe H¨older constant. Using these data-dependent estimates in place of L ( n ) in our kinky inferenceframework as per Def. 2.1 yields an inference rule that shall henceforth be referred to as Lazily AdaptedConstant Kinky Inference (LACKI) . Definition 2.4 (LACKI rule) . For each output component j ∈ { , . . . , dim Y} define ˆ f n ( · ) j as perDef. 2.1 but assume we choose the parameters L ( n ) := (cid:96) ( D n ; λ, L ) (according to Eq. 1). We refer tothe resulting predictor ˆ f n (cid:0) · (cid:1) as a Lazily Adapted Constant Kinky Inference (LACKI) rule. Here, thefree parameters are α ∈ (0 , , λ ∈ R ≥ and L ∈ R ≥ .To develop a first feel for our inference rule, refer to Fig. 1. Here, we depicted the predictors foran underlying ground-truth function on the basis of a sample but with different parameter choices λ . When setting this parameter to two times the observational noise level, the predictor accuratelysmoothes out the noise. In contrast, when the parameter is set to zero, the resulting predictor willperfectly interpolate through the noisy observations, thereby limiting the approximation quality tothe level of observational noise.Furthermore, we notice that the predictors are H¨older continuous but non-differentiable. Informallyspeaking, the inference exhibits “kinks”, motivating the term “kinky inference”.7 .0 0.2 0.4 0.6 0.8 1.01510505101520 LACKI 1noisy test functionpredictionground truthfloorceiling
LACKI 2
Figure 2: Repetition of the experiment but with p = 0 .
5. This time, we also plotted floor and ceilingfunctions (grey dotted and dashed-dotted curves) delimiting the uncertainty bounds. Note, whensetting λ = 0 the estimate (cid:96) ( D n ; λ, L ) was found to be 128 resulting in extremely conservative uncer-tainty estimates (left figure). In contrast, choosing λ = 2¯ e gave a parameter estimate (cid:96) ( D n ; λ, L ) = 2 . (cid:96) n determining parameter L ( n ) is “lazy” in the sense that it onlyincreases the estimate of the H¨older constant just enough to be consistent with the observed data. Thatis, it chooses L ( n ) to coincide with the smallest H¨older constant of a conceivable target function f thatcould have generated the data under the given noise assumption. Below, we will see that the predictorˆ f n has H¨older constant L ( n ). Therefore, the “laziness” of the estimator of L ( n ) implements Occam’srazor : it regularises the hypothesis space of continuous functions to prefer simple explanations of thedata (i.e. functions with low H¨older constants) over complex ones (i.e. functions with higher H¨olderconstants). Here, λ can serve as a parameter that can be utilised to regularise the predictor furtherin order to compensate for (bounded) noise in the data. We will now establish several properties of the LACKI rules including boundedness of the predictors,sample-consistency and H¨older continuity. Most importantly however, we will show that the LACKIrules are universal approximators, in the sense that they can be set to learn any continuous functionwith arbitrary worst-case error.The core idea behind this can be sketched as follows: First, we establish H¨older continuity andsample-consistency. This allows us to prove that LACKI can learn any H¨older function. Note, someuniversal approximators, such as RBFNs with Gaussian kernels, are provably Lipschitz. Therefore,learning any continuous function can be interpreted as learning some Gaussian RBFN with an ob-servational error level that absorbs the discrepancy between the RBFN and the ground truth. Sincea finite RBFN with smooth, bounded-derivative kernel is provably H¨older and since we can learnany H¨older function with LACKI up to the level of observational error, we can learn the continuousground-truth up to the approximation error of the RBFN.Following this outline, we will now proceed to establish the desired properties formally.
Lemma 2.5 (Boundedness of the predictor) . Irrespective of the boundedness of input space X andassuming finite sample size N n = |D n | < ∞ , the predictor ˆ f n : X → Y is bounded. In particular, for α = 1 , we have ∀ x ∈ X : (cid:13)(cid:13)(cid:13) ˆ f n ( x ) (cid:13)(cid:13)(cid:13) ∞ ≤ max i =1 ,...,N n (cid:13)(cid:13)(cid:13) ˜ f i (cid:13)(cid:13)(cid:13) ∞ + L ( n )2 max i,j =1 ,...,N n d α ( s i , s j ) < ∞ .Proof. Let D = max i,j =1 ,...,N n d α ( s i , s j ) and for the k th output dimension let F k = max i =1 ,...,N n (cid:12)(cid:12)(cid:12) ˜ f i,k (cid:12)(cid:12)(cid:12) .As shown in Sec. A.1, d α is a pseudo-metric too and hence, adheres to the triangle inequality. Utilisingthe definition of the predictor and the triangle inequality we see that, for any x ∈ X and any output8imension k , there are some i, j ∈ { , ..., N n } such that we have: ˆ f n,k ( x ) = ˜ f j,k + ˜ f i,k + L ( n )2 (cid:0) d α ( x, s i ) − d α ( x, s j ) (cid:1) ≤ ˜ f j,k + ˜ f i,k + L ( n )2 d α ( s j , s i ) ≤ F k + L ( n )2 D < ∞ .As promised, we establish that the predictors of the LACKI inference rule are H¨older continuous: Lemma 2.6 (H¨older regularity of LACKI) . With definitions as before, let ( Y , d Y ) = ( R m , ( x, y ) (cid:55)→(cid:107) x − y (cid:107) ∞ ) . Provided that the bounding functions B, ¯ B are H¨older continuous (or set to −∞ , ∞ ,respectively), the predictors ˆ f n are H¨older continuous ( n ∈ N ) with constant L ( n ) and exponent α .That is, ∀ n ∈ N : ˆ f n ∈ H (cid:0) L ( n ) , α (cid:1) .Proof. It is easy to show that the one-dimensional mappings of the form x (cid:55)→ (cid:96) d (cid:0) x, x (cid:48) (cid:1) α are (cid:96) − α − H¨older continuous for any choices of (cid:96), α and inputs x (cid:48) . Furthermore, taking point-wise max, minas well as averages of H¨older continuous functions is known to not change their H¨older proper-ties (e.g. cf. [7]). Therefore, the output-component predictors ˆ f n,j ( j = 1 , ..., m ) are L ( n )- α -H¨older. Hence, ∀ x, x (cid:48) : d Y (cid:0) ˆ f n ( x ) , ˆ f n ( x (cid:48) ) (cid:1) = (cid:13)(cid:13)(cid:13) ˆ f n ( x ) − ˆ f n ( x (cid:48) ) (cid:13)(cid:13)(cid:13) ∞ = max j =1 ,...,m (cid:12)(cid:12)(cid:12) ˆ f n,j ( x ) − ˆ f n,j ( x (cid:48) ) (cid:12)(cid:12)(cid:12) ≤ max j =1 ,...,m L ( n ) d ( x, x (cid:48) ) α = L ( n ) d ( x, x (cid:48) ) α .We now establish how well our LACKI rule can interpolate the training data as function of thenoise bound and regularisation parameter λ : Lemma 2.7 (Sample-consistency of LACKI) . If for each output dimension j ∈ { , ..., d } and some λ ≥ we have L ( n ) ≥ max ( s,s (cid:48) ) ∈ U n | ˜ f j ( s ) − ˜ f j ( s (cid:48) ) | − λ d α ( s,s (cid:48) ) then the LACKI rule is sample-consistent (up to λ ). That is, ∀ q ∈ { , . . . , N n } : ˆ f n ( s q ) ∈ B λ (cid:0) ˜ f q (cid:1) where B λ (cid:0) ˜ f q (cid:1) = { x ∈ Y| (cid:13)(cid:13)(cid:13) x − ˜ f q (cid:13)(cid:13)(cid:13) ∞ ≤ λ } denotes the λ -ball around the observation ˜ f q .Thus, we also have (cid:13)(cid:13)(cid:13) f ( s q ) − ˆ f n ( s q ) (cid:13)(cid:13)(cid:13) ∞ ≤ λ + (cid:107) e ( s q ) (cid:107) ∞ ≤ λ + ¯ e .Proof. Remember, our output-space metric is given by d Y (cid:0) y, y (cid:48) (cid:1) = (cid:107) y − y (cid:48) (cid:107) ∞ . For ease of notation,we will confine our proof to the case of one-dimensional outputs ( d = 1). The multi-dimensional casefollows trivially from the one-dimensional result by applying it to each output component function.Let n ∈ N be fixed and, for ease of notation, write L := L ( n ). Let j, k ∈ { , . . . , N n } such that j ∈ argmin i ˜ f i + L d α ( s i , s q ) and k ∈ argmax i ˜ f i − L d α ( s i , s q ). By definition of ˆ f n we have:ˆ f n ( s q ) = 12 (cid:0) ˜ f j + L d α ( s j , s q ) (cid:124) (cid:123)(cid:122) (cid:125) := B (cid:1) + 12 (cid:0) ˜ f k − L d α ( s k , s q ) (cid:124) (cid:123)(cid:122) (cid:125) =: A (cid:1) . (4)(i) Firstly, we show A ∈ [ ˜ f q , ˜ f q + λ ]: If k = q , this holds trivially true since then A = ˜ f q . So, assume k (cid:54) = q . We have ˜ f k ≥ ˜ f k − L d α ( s k , s q ) ≥ ˜ f q − L d α ( s q , s q ) = ˜ f q where the second inequality holds dueto k ∈ argmax i ˜ f i − L d α ( s i , s q ). That is, A = ˜ f k − L d α ( s k , s q ) ≥ ˜ f q . (5)On the other hand, since L ≥ max ( s,s (cid:48) ) ∈ U n | ˜ f ( s ) − ˜ f ( s (cid:48) ) | − λ d α ( s,s (cid:48) ) we have in particular: L ≥ | ˜ f k − ˜ f q | − λ d α ( s k ,s q ) . Thus, L d α ( s k , s q ) + λ ≥ (cid:12)(cid:12)(cid:12) ˜ f k − ˜ f q (cid:12)(cid:12)(cid:12) = ˜ f k − ˜ f q . Hence, ˜ f q + λ ≥ ˜ f k − L d α ( s k , s q ) = A . Together with (5) wehave shown A ∈ [ ˜ f q , ˜ f q + λ ] . (ii) The proof of B ∈ [ ˜ f q − λ, ˜ f q ] is completely analogous to that of (i) and hence, is omitted.9iii) Together, the statements in (i) and (ii) entail ˆ f n ( s q ) = A + B ∈ [ ˜ f q − λ , ˜ f q + λ ].Hence, d Y (cid:0) ˆ f n ( s q ) , ˜ f ( s q ) (cid:1) ≤ λ .Moreover, for any sample input s q we have ˆ f n ( s q ) = f ( s q )+ φ q + ψ q with d Y (0 , ψ q ) ≤ λ , d Y (0 , φ q ) ≤ d Y (cid:0) , e ( s q ) (cid:1) ≤ ¯ e . Our output-space metric is translation-invariant and hence, d Y (cid:0) f ( s q ) , ˆ f n ( s q ) (cid:1) = d Y (cid:0) , ˆ f n ( s q ) − f ( s q ) (cid:1) = d Y (cid:0) , φ q + ψ q (cid:1) ≤ λ + d Y (cid:0) , e ( s q ) (cid:1) ≤ λ + ¯ e . To asses our learning rule, we might be interested the discrepancy d F (ˆ f n , f ) between the predictor ˆ f n and the target function f relative to some metric d F between functions in the space F of continuousfunctions. In statistics, a typical choice is the mean-square error metric assessed with respect tosome distribution over inputs, the function space and the noise. However, in many safety-criticalapplications, often arising in control, worst-case error considerations are of greater value, leading toa worst-case metric d F ( f, g ) = sup x ∈ I d Y (cid:0) f ( x ) , g ( x ) (cid:1) for some subset I ⊆ X of queries on findsinteresting to take into consideration.Therefore, we will now establish worst-case consistency guarantees of our LACKI inference rules.That is, we shall study the worst-case error sequence E ∞ := (cid:0) E ∞ n (cid:1) n ∈ N , E ∞ n := sup x ∈ I d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) (6)for data D n that becomes increasingly dense over time relative to a set of query inputs I ⊆ X . Toclarify the latter concept, consider the sequence of grids (cid:0) G n (cid:1) n ∈ N . We say this sequence converges toa set that becomes dense relative to a set I in the limit of large n if we can use points in the sequenceto approximate any points in I with increasing accuracy. That is, if ∀ (cid:15) > , x ∈ I ∃ n ∀ n ≥ n ∃ g ∈ G n : d ( x, g ) < (cid:15) . If the rate at which this happens is independent of x then we say that the gridsequence becomes dense uniformly . This is the case iff ∀ (cid:15) > ∃ n ∀ n ≥ n , x ∈ I ∃ g ∈ G n : d ( x, g ) < (cid:15) .To make the rates explicit in our notation, we list the following general definitions: Definition 2.8 (Becoming dense, rates, r −→ , r (cid:32) , r (cid:16) ) . Let X be a space endowed with a pseudo-metric d . Let r : N → R be a “rate” function. that vanishes, that is, with lim n →∞ r ( n ) = 0 (i.e. r ∈ o (1)). • The sequence s = (cid:0) s n (cid:1) n ∈ N of points in X is said to converge to a point x ∈ X with rate r (denoted by s r −→ x ) iff ∀ n ∈ N : d ( x, s n ) ≤ r ( n ) and r ( n ) n →∞ −→ – The sequence s is said to converge to a set S ⊂ X with rate r : N → R (denoted by s r −→ S )iff ∀ n ∈ N : inf x ∈ S d ( x, s n ) ≤ r ( n ) and r ( n ) n →∞ −→ • A sequence of sets S = (cid:0) S n (cid:1) n ∈ N is said to become dense relative to x ∈ X with rate r (denotedby S r (cid:32) x ) iff S contains a point sequence that converges to x with that rate. That is, iff ∃ s = (cid:0) s n (cid:1) n ∈ N : s r −→ x ∧ ∀ n : s n ∈ S n . – Similarly, the sequence of sets S is said to become dense relative to a set of points S ⊂ X (denoted by S (cid:32) S ) iff it becomes dense relative to all points of S , i.e. iff ∀ x ∈ S : S r x (cid:32) x for some vanishing rate r x : N → R . – The sequence is becoming dense relative to S uniformly (denoted by S (cid:16) S ) iff thereis a single vanishing rate for all x ∈ S . That is, if ∃ r : N → R : lim n →∞ r ( n ) = 0 ∧ sup x ∈ S inf s n ∈ S n d ( s n , x ) ≤ r ( n ) , ∀ n . Function r is referred to as the convergence rateand we write S r (cid:16) S to denote that S becomes dense relative to S with uniform rate r . Theorem 2.9 (LACKI can learn any H¨older function) . Assume the following holds true:1. The observational errors given by e are bounded from above by ¯ e = sup x d Y (cid:0) , e ( x ) (cid:1) ∈ R ≥ . . The target f : X → Y is H¨older continuous, i.e. ∃ L ∗ ∈ R : f ∈ H ( L ∗ , p ) .Under these assumptions we can give the following guarantees: (A) If the grid becomes dense (pointwise), the point-wise worst-case error vanishes up to λ + e :If ∀ x ∈ I ⊂ X ∃ r x ∈ o (1) : L ( · ) r αx ( · ) ∈ o (1) ∧ (cid:0) G n (cid:1) n ∈ N r x (cid:32) x then ∀ x ∈ I : (cid:0) d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1)(cid:1) n ∈ N (cid:37) x −→ [0 , λ e ] where for the error convergence rate (cid:37) x we have (cid:37) x ( n ) ≤ ( L ( n ) + L ∗ ) r αx ( n ) , ∀ n ∈ N . (B) If the grid becomes dense in I ⊂ X uniformly, then the worst-case prediction error vanishesuniformly (up to λ + ¯ e ):If ∃ r ∈ o (1) : L ( · ) r α ( · ) ∈ o (1) ∧ ( G n ) r (cid:16) I then E ∞ (cid:37) −→ [0 , λ e ] where for the uniform error convergence rate (cid:37) we have (cid:37) ( n ) ≤ ( L ( n ) + L ∗ ) r α ( n ) , ∀ n ∈ N .Proof. We have established that the predictors ˆ f n ( · ) of the LACKI rule are L ( n )- α - H¨older (Lem. 2.6)and sample-consistent up to level λ (Lem. 2.7).For any input x ∈ X let ξ xn denote a nearest neighbour of x in grid G n . That is, ξ xn ∈ arg inf s ∈ G n d ( x, s ).Since G n is assumed to become dense in the input domain X , for any input x there is a rate func-tion r x : N → R ≥ such that r x ( n ) n →∞ −→ d ( x, ξ xn ) α ≤ r x ( n ) , ∀ n ∈ N . In the case of uniformconvergence a rate function can be chosen independently of x and will be denoted by r rather than r x . (A) For all n ∈ N and x ∈ X we have: d Y (cid:0) ˆ f n ( x ) , f ( ξ xn ) (cid:1) ( i ) ≤ d Y (cid:0) ˆ f n ( x ) , ˆ f n ( ξ xn ) (cid:1) + d Y (cid:0) ˆ f n ( ξ xn ) , f ( ξ xn ) (cid:1) (7) ( ii ) ≤ d Y (cid:0) ˆ f n ( x ) , ˆ f n ( ξ xn ) (cid:1) + λ d Y (cid:0) , e ( ξ xn ) (cid:1) = (cid:13)(cid:13)(cid:13) ˆ f n ( x ) − ˆ f n ( ξ xn ) (cid:13)(cid:13)(cid:13) ∞ + λ e (8) ( iii ) ≤ L ( n ) d (cid:0) x, ξ xn (cid:1) α + λ e (9)Here, (i) follows from the triangle inequality, (ii) leverages Lem. 2.7 and (iii) follows by H¨oldercontinuity of the predictors (Lem. 2.6).Thus, for x ∈ X , n ∈ N :0 ≤ d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ d Y (cid:0) ˆ f n ( x ) , f ( ξ xn ) (cid:1) + d Y (cid:0) f ( ξ xn ) , f ( x ) (cid:1) (10) ( † ) ≤ ( L ( n ) + L ∗ ) d ( x, ξ xn ) α + λ e (11)where ( † ) follows from (9) and the presupposed H¨older continuity of f .Since by assumption, d ( x, ξ xn ) α ≤ r x ( n ) , ∀ n this implies: d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ∈ (cid:2) , ( L ( n ) + L ∗ ) r x ( n ) α + λ e (cid:3) , ∀ n. (12)By assumption r x ( n ) , L ( n ) r αx ( n ) n →∞ → , ∀ x and hence, d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) converges to [0 , λ +¯ e ] , ∀ x withrate (cid:37) x ≤ ( L ( n ) + L ∗ ) r x ( n ) α . (B) Proceeding analogously as before, but utilising uniform convergence with rate r , we obtain: d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ∈ (cid:2) , ( L ( n ) + L ∗ ) r ( n ) α + λ e (cid:3) , ∀ x ∀ n. (13)11y assumption, L ( n ) r α ( n ) ∈ o (1) and thus, lim n →∞ L ( n ) r ( n ) α = 0. Hence, E ∞ = (cid:0) sup x ∈ I d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1)(cid:1) n ∈ N (cid:37) −→ [0 , λ e ]with rate (cid:37) such that (cid:37) ( n ) ≤ ( L ( n ) + L ∗ ) r ( n ) α , ∀ n .Note a necessary condition was that the product of L ( n ) and the rate was in o (1), that is, vanishingin the limit of n → ∞ . A sufficient condition for this to hold is if L ( n ) is guaranteed to be bounded(assuming the rate is vanishing). Above, we have established a sufficent condition for this (cf. Rem.2.3): L ( n ) is bounded as long as parameter λ ≥ e + q for any q ≥
0. This yields the following result:
Corollary 2.10.
With definitions and assumption as in Thm. 2.3, if parameter λ is chosen to be e + q for any q ≥ then convergence to the ground truth is guaranteed (up to an twice the observationalerror and a term dependent on q ). In particular, if the data becomes dense uniformly in I ⊆ X witha rate of r ( n ) then, for some ¯ L ∈ [0 , L ∗ ] and any n ∈ N , we have sup x ∈ I d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ ( ¯ L + L ∗ ) r ( n ) α + q e n →∞ −→ q e . (14)Of course in the absence of observational errors, one can choose λ = 0. In this case, the corollaryimplies that LACKI will learn the ground-truth arbitrarily well in the limit of infinitely dense data. Remark . Our bounds rely on the proximity (expressed by the ratefunctions) of the query input to the previously observed data. Refer to Thm. 2.9. Roughly speaking,for a particular query input x , our guarantee in (A) asserts that the closer the query is to the previouslyseen data, the better the confidence in prediction accuracy. In (B) this is extended to a worst-casestatement implying that the smaller the worst-case proximity of the data to any query in I , the smallerthe worst-case prediction error can be. Unfortunately, this worst-case proximity and therefore, theprediction error bound, is subject to the curse of dimensionality . That is, the number of samplesnecessary to guarantee a desired reduction in worst-case prediction uncertainty will inevitably have toscale exponentionally with the dimensionality of the space. A manifestation of this fact can be seenin Sec. 2.3.2 where we give a sample complexity bound for uniformly distributed input samples.Having established that our LACKI rule can learn any H¨older function with any H¨older constant,we will now attend to extend the results to non-H¨older functions. In preparation of the necessaryderivations we will first rehearse universality and H¨older properties of radial basis function networks.Park and Sandberg derived universal approximation guarantees for radial-basis function networks[33]. In particular, on page 252 in their article the authors make an assertion that translates to ournotation as follows: Lemma 2.12 (Expressiveness of RBFNs) . Assume
X ⊆ R d is compact. Given parameter vector θ := ( w , . . . , w m , σ , ..., σ m , c , . . . , c m ) and kernel function K : X → Y let β ( · ; θ ) = (cid:80) mi =1 w i K ( ·− c i σ i ) denote a radial basis function network (RBFN). Assume K : R d → R is continuous and has non-vanishing integral, i.e. (cid:82) R d K ( x ) d x (cid:54) = 0 . Then, the set S K := { β ( · ; θ ) | m ∈ N , θ ∈ R m } of allRBFNs is uniformly dense in the set C ( X ) of continuous functions on compact domain X . That is, ∀ f ∈ C ( X ) ∀ (cid:15) > ∃ m, θ ∈ R m : sup x ∈X | f ( · ) − β ( · ; θ ) | < (cid:15) .Remark . We note that, for any finite-dimensional parameter θ , any RBFN β ( · ; θ ) is Lipschitzcontinuous as long as the kernel K is. This can be seen by applying Lem. A.8 which allows us to con-clude that the Lipschitz constant of RBFN β ( · ; θ ) = (cid:80) mi =1 w i K ( ·− c i σ i ) is given by L β = (cid:80) mi =1 (cid:12)(cid:12)(cid:12) w i σ i (cid:12)(cid:12)(cid:12) L K where L K ∈ R ≥ denotes a Lipschitz constant of K . By the same Lemma it is easy to see that choos-ing the Gaussian kernel for K satisfies both the Lipschitz requirement as well as the integrability12equirements of Lem. 2.12. As a by-product this means that on a compact support, any continuousfunction can be approximated by some Lipschitz function with arbitrarily small, positive worst-caseerror (cid:15) >
0. Note, it may well be the case that the Lipschitz constant of the approximator grows withdecreasing approximation error bound (cid:15) . We consider this to be inevitable when the approximatedfunction is not Lipschitz.Harnessed with these preparatory statements we can move on to show that the LACKI rule canbe set up to learn any continuous function up to arbitrary low error.
Theorem 2.14 (Universality of LACKI) . Assume we are given a sequence (cid:0) D n (cid:1) n ∈ N of samples withobservational errors bounded by ¯ e ∈ R ≥ . We set the parameters of the LACKI rule to B = −∞ , ¯ B = ∞ , L = 0 and λ := 2¯ e + q for some q > . In this theorem, we assume that the set of interest I ⊆ X is compact. Then, we have:
The LACKI rule as per Def. 2.4 is a universal approximator in the following sense: If the sequenceof input grids (cid:0) G n (cid:1) n ∈ N relative to I (uniformly) then the sequence of predictors (cid:0) ˆ f n (cid:1) n ∈ N computed bythe LACKI rule (uniformly) converges to any continuous target f : X → R up to error e + q . Thatis, the following holds true: • (I) Let x ∈ I . If ∃ r x ∈ o (1) : ( G n ) r x (cid:32) x then ∃ C ∈ R : (cid:0) d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1)(cid:1) Cr αx −→ [2¯ e + q ] . • (II) If ∃ r ∈ o (1) : ( G n ) r (cid:16) I then ∃ C ∈ R : E ∞ Cr α −→ [2¯ e + q ] .Proof. We choose any parameter λ = 2¯ e + q with q >
0. As observed in Rem. 2.13, Lem. 2.12 allowsus to infer that there exists a Lipschitz function h that approximates the target with worst-case errorof at most q . That is, sup x ∈X d Y (cid:0) h ( x ) , f ( x ) (cid:1) ≤ q . (Also, note Lipschitz continuity implies H¨oldercontinuity for any H¨older exponent α ∈ (0 , h ∈ H ( L h , α ) for some L h ∈ R ≥ .)Consequently, there exists a function φ (cid:48) : X → Y with sup x d Y (cid:0) , φ (cid:48) ( x ) (cid:1) ≤ q accounting for thediscrepancy between the H¨older function h and the target f : f = h + φ (cid:48) .Furthermore,we define φ to be the bounded observational noise. Hence, we have ˜ f = f + φ andsup x d Y (cid:0) , φ ( x ) (cid:1) ≤ ¯ e . Combining both functions into ψ := φ + φ (cid:48) , we can write ˜ f = h + ψ withsup x d Y (cid:0) , ψ ( x ) (cid:1) ≤ q + ¯ e =: ¯ ν .This can be interpreted as follows: Instead of viewing the given sample as being generated bytarget f (with some observational error φ ) we can view the sample as being generated by the H¨olderfunction h corrupted by the extended “observational noise” ψ accounting for both the original ob-servational error and the discrepancy between the target and H¨older function h . This gives us areduction to the case of learning H¨older functions with observational error bounded by ¯ ν . Firstly,we note that λ = 2¯ e + q = 2¯ ν (which entails that the sequence (cid:0) L ( n ) (cid:1) n ∈ N is bounded by someconstant ¯ L = sup x,x (cid:48) , d ( x,x (cid:48) ) > d Y ( h ( x ) ,h ( x (cid:48) )) − q d α ( x,x (cid:48) ) ≤ L h ). Linking to Thm. 2.9, we get all the desiredstatements with regard to learning h . These can easily be converted into statements about learning f by adding the worst-case difference q between f and h to all error bounds. For example, leveragingsup x d Y (cid:0) , φ (cid:48) ( x ) (cid:1) ≤ q and λ = 2¯ e + q and going through analogous steps as in the previous theoremwe obtain: d Y (ˆ f n ( x ) , f ( x )) = d Y (ˆ f n ( x ) , h + φ (cid:48) ( x )) ≤ d Y (cid:0) ˆ f n ( x ) , h ( x ) (cid:1) + d Y (0 , φ (cid:48) ( x )) (15) ≤ ( ¯ L + L h ) d ( x, ξ xn ) α + λ ν + q ≤ ( ¯ L + L h ) d ( x, ξ xn ) α + 2¯ e + 3 q ξ xn := arg inf s ∈ G n d ( x, s ) denotes a nearest neighbour of x in the input sample G n .So, convergence (pointwise or uniform) of the grid to the input space with a rate of at most r ( n ) implies that the right-hand side of (17) and hence, the prediction error, converges (pointwise oruniformly) to the interval [0 , e + q ] with a rate of at most ( ¯ L + L h ) r α ( n ) as n → ∞ .13 .3.2 Convergence in probability with uniformly distributed inputs Above we have given guarantees relative to the deterministic convergence rates of the input sampleto the domain. In this subsection, we shall study probabilistic convergence rates as a function ofthe sample size in situations where the sample is obtained by drawing inputs independently from auniform probability distribution on I = X := [0 , d .We can show that the worst-case prediction error given by sup x ∈X d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) vanishes (upto the usual worst-case bounds in the presence of observational errors) in probability for canonicalinput-space metrics: Theorem 2.15.
Let X = [0 , d be the domain of target function f ∈ H ( L ∗ , α ) . Assume the inputdata G n = { s , . . . , s n } contains n data sample inputs which are drawn independently at random froma uniform distribution over X . Furthermore, assume there are no observational errors, i.e. ¯ e = 0 ,and, that d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) ∞ , ∀ x, x (cid:48) ∈ X . The worst-case error of our LACKI predictor vanishes inprobability.That is, ∀ (cid:15) > ∀ δ ∈ (0 , ∃ N ∈ N ∀ n ≥ N : Pr[sup x ∈X d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) > (cid:15) ] ≤ δ. In particular, for all δ ∈ (0 , we have Pr[sup x ∈X d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) > (cid:15) ] ≤ δ
1. for any (cid:15) ≥ L ∗ , provided that n ≥ ;2. for any (cid:15) < L ∗ , provided that n ≥ N := (cid:108) log( δ − kd )log(1 − − kd ) (cid:109) with k = (cid:108) log( (cid:15) − L ∗ )log 2 (cid:109) .Proof. Let r n := sup x ∈X min s ∈ G n d ( x, s ) = sup x ∈X min s ∈ G n (cid:107) x − s (cid:107) ∞ ≤ P (cid:15)n := Pr[sup x ∈X d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) > (cid:15) ] which we intend to bound from above. Remember, fromCor. 2.10 sup x d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ L ∗ r n . Hence, for (cid:15) ≥ L ∗ , P (cid:15)n ≤ , ∀ n ∈ N .So, it suffices to focus on the case where (cid:15) < L ∗ . Now, sup x d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ (cid:15) is impliedby sup x d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ L ∗ r n provided that r n ≤ (cid:15) L ∗ . So, we define an event E n that en-sures r n satisfies the latter inequality with a probability that grows as n increases. To this end, weintroduce a partition of the domain into m hyper-rectangles H , ..., H m of equal size, each havingedge length l k = k where k is a natural number such that l k ≤ (cid:15) L ∗ . As a valid choice, we set k := (cid:108) log( (cid:15) − L ∗ )log 2 (cid:109) . Note, Pr[ s i ∈ H j ] = l dk = dk . By construction, in the event that each hyper-rectangle ends up containing at least one sample input of G n , we have r n ≤ (cid:15) L ∗ . We define thecomplement of this event as ¯ E n := { ( s , ..., s n ) ∈ X n |∃ j ∈ { , ..., m }∀ i ∈ { , ..., n } : s i / ∈ H j } . Let W := { s = ( s , ..., s n ) | sup x d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) > (cid:15) } be the event that the sample inputs are located insuch a fashion that they give rise to a worst-case error larger than (cid:15) . We have: s / ∈ ¯ E implies that r ( n ) ≤ (cid:15) L ∗ which in turn implies sup x d Y (cid:0) ˆ f n ( x ) , f ( x ) (cid:1) ≤ (cid:15) , i.e. that s / ∈ W . Hence, W ⊆ ¯ E n andthus, P (cid:15)n = Pr[ W ] ≤ Pr[ ¯ E n ]. So, to bound P (cid:15)n from above it suffices to bound Pr[ ¯ E n ] from abovewhich we will do next: We can employ the union bound, utilise that m = 2 kd and the fact that the s i are drawn i.i.d. from a uniform to see that Pr[ ¯ E n ] ≤ (cid:80) mj =1 (cid:81) ni =1 Pr[ s i / ∈ H j ] = 2 kd (1 − dk ) n n →∞ −→ n sufficently large to ensure Pr[ W ] ≤ δ we consider the inequality 2 kd (1 − dk ) n ≤ δ . Taking the log on both sides and rearranging yields thesufficient condition: n ≥ log( δ − kd )log(1 − − kd ) . In the theorems above, we considered the worst-case asymptotics for the case where the data becomesdense in the domain. Here the error was evaluated on the entire input domain. By contrast, wewill now consider an online learning setting where we incrementally get to observe samples along thetrajectory of inputs (cid:0) x n (cid:1) n ∈ N and are interested in the long-term one-step-lookahead prediction errors14n this trajectory. That is, we are interested in the evolution of prediction errors d Y (cid:0) ˆ f n ( x n ) , f ( x n ) (cid:1) where the predictor ˆ f n ( · ) is based on D n = D n − ∪ { (cid:0) x n − , ˜ f ( x n − ) (cid:1) } , ∀ n > (cid:0) x n (cid:1) n ∈ N is bounded.In preparation of these considerations, we will establish the following facts: Lemma 2.16.
Assume we are given a trajectory (cid:0) x n (cid:1) n ∈ N of inputs with x n ∈ X where input space X can be endowed with a shift-invariant measure. Furthermore, assume the sequence is bounded, i.e. d X ( x n , ≤ β for some β ∈ R + and all n ∈ N . Finally assume the inputs of the available data coincidewith the complete history of past inputs, i.e. G n = { x i | i ∈ N , i < n } . Then we have:dist ( G n , x n ) = min { d X ( g, x n ) | g ∈ G n } n →∞ −→ . Proof.
The intuition behind the following proof is that if the distances were not to converge, there wasan infinite number of disjoint balls around the input points that summed up to infinite volume. Thishowever, would be a contradiction to the presupposed boundedness of the sequence. We formalisethis intuition as follows: We can rephrase the desired convergence statement as ∀ (cid:15) > ∃ n ∈ N ∀ m ≥ n : dist( x m , G m ) ≤ (cid:15). (18)For contradiction, assume that ∃ (cid:15) > ∀ n ∈ N ∃ m ( n ) ≥ n : dist( x m ( n ) , G m ( n ) ) > (cid:15). (19)Hold such an (cid:15) > n ∈ N . By definition of G m ( n ) = { x i | i < m ( n ) } we have: ∀ i < m ( n ) : d X ( x m ( n ) , x i ) > (cid:15). (20)Let C n := (cid:83) i
0, we have concluded the falsestatement µ ( ¯ I ) > µ ( ¯ I ). Theorem 2.17.
Assume that, for some q ≥ , we chose λ = 2¯ e + q in our LACKI prediction rule.And, assume that the target f is H¨older continuous up to some error level ¯ E h . That is, f = φ + ψ with φ ∈ H ( L ∗ , α ) and a function ψ such that sup x d Y (cid:0) , ψ ( x ) (cid:1) ≤ ¯ E h ∈ R .Assume we are given a trajectory (cid:0) x n (cid:1) n ∈ N of inputs that is bounded, i.e. where d ( x n , ≤ β for some β ∈ R + and all n ∈ N . Furthermore, assume D n +1 = D n ∪ { (cid:0) x n , ˜ f ( x n ) (cid:1) } and thus, n = { x i | i ∈ N , i < n } . Then the prediction error on the sequence vanishes up to the level ofsample-consistency and H¨older continuity in the following sense: d Y (cid:0) ˆ f n ( x n ) , f ( x n ) (cid:1) n →∞ −→ [0 , q e + 2 ¯ E h ] . In particular, in case the observations are error-free ( ˜ f = f ) and assuming the target is H¨oldercontinuous then, when choosing λ = 0 , the prediction error is guaranteed to vanish. That is, d Y (cid:0) ˆ f n ( x n ) , f ( x n ) (cid:1) n →∞ −→ . Proof.
Let ξ n ∈ argmin g ∈ G n d ( x n , g ) denote the nearest neighbour of x n in G n = { x , ..., x n − } .Since sequence ( x n ) is bounded, Lem. 2.16 is applicable and hence: (i) lim n →∞ d ( x n , ξ n ) = 0.From Lem. 2.7 we conclude d Y (cid:0) ˆ f n ( ξ n ) , f ( ξ n ) (cid:1) ≤ e + q . Hence, appealing to the triangle inequality,we see that (ii) d Y (cid:0) ˆ f n ( x n ) , f ( ξ n ) (cid:1) ≤ d Y (cid:0) ˆ f n ( x n ) , ˆ f n ( ξ n ) (cid:1) + 2¯ e + q .Moreover we note that the predictors ˆ f n have H¨older constants L ( n ) and that the L ( n ) are boundedfrom above by some ¯ L ∈ R . Thus, ( iii ) ∃ ¯ L ∈ R ∀ n ∈ N : ˆ f n ∈ H ( ¯ L, α ).In conclusion, 0 ≤ d Y (cid:0) ˆ f n ( x n ) , f ( x n ) (cid:1) ≤ d Y (cid:0) ˆ f n ( x n ) , f ( ξ n ) (cid:1) + d Y (cid:0) f ( ξ n ) , f ( x n ) (cid:1) ( ii ) ≤ d Y (cid:0) ˆ f n ( x n ) , ˆ f n ( ξ n ) (cid:1) +2¯ e + q + d Y (cid:0) f ( ξ n ) , f ( x n ) (cid:1) ≤ d Y (cid:0) ˆ f n ( x n ) , ˆ f n ( ξ n ) (cid:1) + 2¯ e + q + d Y (cid:0) φ ( ξ n ) , φ ( x n ) (cid:1) + 2 ¯ E h ( iii ) ≤ ( ¯ L + L ∗ ) d ( x n , ξ n ) α + 2¯ e + q + 2 ¯ E h n →∞ −→ e + q + 2 ¯ E h . The computational complexity for computing the parameter update L ( n + 1) based on L ( n ), thepre-existing data D n of size N n and a newly arriving sample input is in O ( N n m ) where m is theeffort for evaluating the pseudo-metric. Typically, m will scale linearly with the dimensionality of theinput space. Therefore, online updates cost training time that will be linear in the number of existingtraining data and input dimensionality. In batch training, for a batch of N samples, computationof the estimate will require an effort in O ( N D ). Once the parameter L ( n ) is computed, the effortfor evaluating ˆ f n ( x ) is linear in the number of samples and, again, typically linear in the input andoutput space dimensionality.However, it should be noted though that generalised nearest neighbor techniques can be utilisedto reduce the prediction effort to expected logarithmic effort in the sample size (see [3]). Devised forstandard Lipschitz interpolation, this approach could be readily applied to our LACKI inference rule. Having a established worst-case prediction error bounds of our LACKI method, we now aim to illus-trate the benefits and shortcomings of our approach in a number of regression problems. In order toassess the predictive performance accurately, we need access to the ground truth. Therefore, we re-strict our benchmarks to artificial data for the time being. We conducted three different experiments.In each of the experiments we varied the conditions over the course of randomised trials as follows: • Exp. 0 : Here, we investigated the regression performance on the target function f : [0 , d → R , x (cid:55)→ | cos(2 πx ) | + x for increasing input-space dimensionality d . The observations wereperturbed by i.i.d. uniform noise sampled from the interval [ − . , . N n = 1000 examples. In trial n , the input space dimensionality was set to d = 2 n . • Exp. 1 : A repetition of the setup of Exp.1 but for fixed input space dimensionality d = 2 andincreasing sample size. In trial n the training data size was set to N n = 2 n + 1.16 ���� ��� ��� ��������� � � � �� ���� � ��������� ��� ��� ��������� � � � �� �������� ��� ��� ��������� � � � �� ������� ��� ��� ��������� � � � �� ����������� ��� ��� ��������� � � � �� ����� Figure 3:
Exp. 0.
The predictors of the various regression methods on target function f for d = 1and training sample size N n = 500. Training examples plotted as light blue dots, the graph of thetarget function is plotted in dark blue and the predictions are plotted in magenta. R M S GPGP2Lin. Mod.LACKI2LACKI l o g - T r a i n i n g T i m e ( l o g - TT ) GPGP2Lin. Mod.LACKI2LACKI l o g - P r e d i c t i o n T i m e ( l o g - P T ) GPGP2Lin. Mod.LACKI2LACKI
Figure 4:
Exp. 0.
Comparisons on target function f over a range of different trials. In the n thtrial, test function 1 was sampled uniformly at random on the domain 2 n -dimensional input spacedomain X = [0 , n with fixed training data size |D n | = 500 but varying input space. Depicted arethe measured means and standard deviations for each method and each trial. Note how LACKI’spredictive performance degrades comparatively mildly with increasing dimensionality on the giventest function. 17 Exp. 2 : A repetition of the setup of Exp. 2, but for the benchmark target (considered in [3]) f : x (cid:55)→ sin( x ) sin( x ) + 0 . (cid:0) sin(5 x ) sin(5 x ) (cid:1) under no observational noise.In each trial of each experiment, we recorded the following performance measures:1. For a predictor ˆ f , on a set X test of test inputs we recorded the empirical root-mean-square error RMS = (cid:80) x ∈X test (cid:12)(cid:12)(cid:12) f ( x ) − ˆ f ( x ) (cid:12)(cid:12)(cid:12) as well as maximal prediction error ME = max x ∈X test (cid:12)(cid:12)(cid:12) f ( x ) − ˆ f ( x ) (cid:12)(cid:12)(cid:12) .2. The log of the run time measurement (in sec.) of training the predictor ( log-TT ).3. The log of the run time measurement (in sec.) of computing the predictions of the random testinputs divided by the number of test inputs ( log-PT ).Here, the test sample inputs in X test were drawn i.i.d. at random from a uniform distribution overthe domain X . Therefore, the pertaining performance measures were random variables. In each trial,we have obtained a sample (of size 30) of these random variables and recorded their empirical meansand standard deviations for each trial and for the following regression techniques: • LACKI : Our LACKI method with parameter choice set to λ = 1. • LACKI2 : Our LACKI method as above, but with parameter choice λ = 0. • GP : A Gaussian process [34] with fixed covariance function k ( x, x (cid:48) ; θ ) = θ exp( (cid:107) x − x (cid:48) (cid:107) θ ). Wedetermined the parameters manually to give good results on Exp. 1 for d = 1. This tuningprocess resulted in the choice of θ = (1 , ) with observational noise variance (the resultingpredictor for a 1-dimensional data set is depcited in Fig. 3). The predictor was chosen tocoincide with the mean function of the posterior process. • GP2 : A GP with hyper-parameters determined by following the standard approach of max-imising the marginal log-likelihood of the data [34]. Optimisation was done without restartsemploying BFGS. The optimiser was started with initial hyper-parameters set to θ = (1 ,
1) andobservational noise parameter being initialised with ¯ e . • Lin. Mod. : A linear regression model fitted with the least-squares method.The code was implemented in pure
Julia 0.4.7 with the GPs making use of the library
Gaussian-Processes.jl . The code was executed on a 2015-MBP furnished with i7 processors and 16 GB RAMrunning OS X 10.11.6.
Discussion:
The results of the experiments are depicted in Fig. 4 - Fig. 6. We note that, whenthe noise hyper-parameter λ was set correctly to twice the level of observational error (e.g. LACKI inExp. 0 and Ex. 1, and LACKI 2 in Exp.2), our approach yielded good predictive performance thatoutperformed the GPs considering both prediction accuracy and computation. Interestingly, even inthe presence of stochastic observational noise, LACKI was able to yield worst-case prediction errorbelow the level of observational error. Furthermore, the prediction errors seemed to vanish in the limitof increasing sample size with a rate matching or outperforming the competing regression methods(ref. Fig. 5). Furthermore, observe that the performance deterioration with increasing dimensionalityof the input space (ref. Fig. 4) was less than with the GPs. This is noteworthy since one might expectthe GP-based predictors to benefit from stronger regularisation. Of course, we cannot claim that thissuperior performance over GPs will hold in general, but it is interesting to note that it can hold.As stated above, the superior performance of the LACKI approach was contingent on setting theobservational noise parameter to the correct value of λ = 2¯ e . In fact, Exp. 2 was designed to exposea shortcoming of our approach: namely the sensitivity to correctly setting λ to zero in the absence ofobservational noise (¯ e = 0) when no information about the Lipschitz constant of the target is known.18 M a x . E rr o r ( M E ) GPGP2Lin. Mod.LACKI2LACKI l o g - T r a i n i n g T i m e ( l o g - TT ) GPGP2Lin. Mod.LACKI2LACKI l o g - P r e d i c t i o n T i m e ( l o g - P T ) GPGP2Lin. Mod.LACKI2LACKI
Figure 5:
Exp. 1.
Comparisons on target function f over a range of different trials. In the n th trial,the noise-perturbed target function was sampled uniformly at random on the domain X = [0 , with varying training data size |D n | = 2 n + 1. The plots depict measured means and standard deviationsover 30 repetitions for each trial. Prediction error measures were estimated based on 25000 testsamples drawn independently from the domain. Run times for training and prediction are depictedon a log-scale. We note that LACKI, with correct noise parameter λ = 1 overall outperformed theother methods in terms of prediction accuracy.As we can see in Fig. 6, falsely setting λ = 1 resulted in poor predictive performance in Exp. 2. Weexplain this as follows: the target was confined to the interval [ − . , . λ = 1 denied an increase of L ( n ).Thus, the resulting predictor of LACKI remained a constant and hence, gave rise to a relatively largeand non-decreasing prediction error. Note, this problem would be prevented by either setting L ( n ) toa valid Lipschitz constant of the target, or, by setting λ = 0 reflecting the absence of observationalerrors. The positive effect of the latter is also testified by the plot of LACKI2 in Fig. 6 showing thatour method for this setting was competitive with the best GP method on the noiseless regression task.While our experiments might suggest that GPs may not always be as sensitive to the noise hyper-parameter settings we would like to emphasise the fact that the GP learners were carefully initialisedto give good performance on the problems. And, suitable alterations of these choices could provokesubstantially degradation of the GPs predictive performance, even in the limit of dense data.In summary, the experimental results suggest the following observation: If the parameter λ iscorrectly set to 2¯ e , our LACKI method can offer a fast and reliable approach to regression underbounded stochastic noise that can outperform GP regression when assessed along performance metricsthat reflect prediction accuracy and computational effort. We note that the latter could be furtherenhanced by applying nearest-neighbor approaches for Lipschitz interpolation in lieu to methodsproposed in [3, 7]. So far, we have established some learning guarantees for LACKI as a method for supervised learning.In this section, we utilise our results and discuss the use of LACKI in the context of model-referenceadaptive control. We introduce the control framework with a simple example of controlling thesimulated roll dynamics of an aircraft under wing rock. In the second half of the remainder ofthe paper, we use our theoretical guarantees established in the first half in order to provide globalasymptotic convergence guarantees of the closed-loop trajectory to the reference. Throughout theentire section we simplify our analysis by assuming the pseudo-metrics d and d Y are in fact canonicalnorm-induced metrics. For instance, we assume that X = R d is a finite-dimensional vector space andwe have d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) for some norm (cid:107)·(cid:107) equivalent to the maximum norm (cid:107)·(cid:107) ∞ .19 M a x . E rr o r ( M E ) GPGP2Lin. Mod.LACKI2LACKI l o g - T r a i n i n g T i m e ( l o g - TT ) GPGP2Lin. Mod.LACKI2LACKI l o g - P r e d i c t i o n T i m e ( l o g - P T ) GPGP2Lin. Mod.LACKI2LACKI
Figure 6:
Exp. 2.
Repetition of Exp. 1 but with target function f and no observational noise. Inthe n th trial, the target was sampled uniformly at random on the domain X = [0 , with varyingtraining data size |D n | = 2 n + 1. Prediction errors were estimated based on 25000 test samples drawnindependently from the domain. Run times for training and prediction times are depicted on a log-scale. LACKI, having a now falsely set noise parameter λ = 1, did not achieve good predictive results.By contrast, both GPs, as well as LACKI2 (with the correctly set parameter λ = 0) managed to learnthe target accurately with increasing data. However, in comparison to LACKI2, we note that theGPs exhibited much higher variability in performance for lower sample sizes and that GP2, whichtended to have slightly lower prediction error, achieved this at the expense of substantially highercomputational training effort. As pointed out in [11], modern fighter aircraft designs are susceptible to lightly damped oscillationsin roll known as “wing rock”. Commonly occurring during landing [35], removing wing rock from thedynamics is crucial for precision control of such aircraft. Precision tracking control in the presence ofwing rock is a nonlinear problem of practical importance and has served as a test bed for a numbernonlinear adaptive control methods [10, 31, 11].For comparison, we replicate the experiments of the recent work of Chowdhary et. al. [10, 9]. Here the authors have compared their Gaussian process based approach, called
GP-MRAC , to themore established adaptive model-reference control approach based on RBF networks [36, 25], referredto as
RBFN-MRAC . Replacing the Gaussian process learner by our kinky inference learner, we readilyobtain an analogous approach which we will refer to as
LACKI-MRAC . As an additional baseline, wealso examine the performance of a simple P-controller.While with the exact same parameters settings of the experiments in [10], performance of ourLACKI-MRAC method comes second to GP-MRAC, we also evaluate the performance of all controllersover a range of 555 random parameter settings and initial conditions. As we will see, across this rangeof problem instances and parameter settings, LACKI-MRAC markedly outperforms all other methods.
Before proceeding with the wing rock application we will commence with (i) outlining model referenceadaptive control (MRAC) [1] as considered in [10] and (ii) describe the deployment of kinky inferenceto this framework. We will now rehearse the description of MRAC for second-order systems following[10].Assume m ∈ N to be the dimensionality of a configuration of the system in question and define d = 2 m to be the dimensionality of the pertaining state space X .Let x = [ x ; x ] ∈ X denote the state of the plant to be controlled. Given the control-affine system We are grateful to the authors for kindly providing the code. x = x (22)˙ x = a ( x ) + b ( x ) u ( x ) (23)it is desired to find a control law u ( x ) such that the closed-loop dynamics exhibit a desired referencebehaviour: ˙ ξ = ξ (24)˙ ξ = f r ( ξ, r ) (25)where r is a reference command, f r some desired response and t (cid:55)→ ξ ( t ) is the reference trajectory.If a priori a and b are believed to coincide with ˆ a , ˆ b respectively, the inversion control u =ˆ b − ( − ˆ a + u (cid:48) ) is applied. This reduces the closed-loop dynamics to ˙ x = x , ˙ x = u (cid:48) + ˜ a ( x, u ) where˜ a ( x, u ) captures the modelling error of the dynamics:˜ a ( x, u ) = a ( x ) − ˆ a ( x ) + (cid:0) b ( x ) − ˆ b ( x ) (cid:1) u. (26)Let I d ∈ R d × d denote the identity matrix. If b is perfectly known, then b − ˆ b − = 0 and the modelerror can be written as ˜ a ( x ) = a ( x ) − ˆ a ( x ). In particular, ˜ a has lost its dependence on the controlinput.In this situation [10, 9] propose to set the pseudo control as follows: u (cid:48) ( x ) := ν r + ν pd − ν ad where ν r = f r ( ξ, r ) is a feed-forward reference term, ν ad is a yet to be defined output of a learning module adaptive element and ν pd = [ K K ] e is a feedback error term designed to decrease the tracking error e ( t ) = ξ ( t ) − x ( t ) by defining K , K ∈ R m × m as described in what is to follow.Inserting these components, we see that the resulting error dynamics are:˙ e = ˙ ξ − [ x ; ν r + ν pd + ˜ a ( x )] = M e + B (cid:0) ν ad ( x ) − ˜ a ( x ) (cid:1) (27)where M = (cid:18) O m I m − K − K (cid:19) and B = (cid:18) O m I m (cid:19) . If the feedback gain matrices K , K parametris-ing ν pd are chosen such that M is stable then the error dynamics converge to zero as desired, providedthe learning error E λ vanishes: E λ ( x ( t )) = (cid:107) ν ad ( x ( t )) − a ( x ( t )) (cid:107) t →∞ −→ a online. This is done by continuously feeding it training examples of the form (cid:0) x ( t i ) , ˜ a ( x ( t i )) + ε i (cid:1) where ε i is observational noise.Intuitively, assuming the learning algorithm is suitable to learn target ˜ a (i.e. ˜ a is close to someelement in the hypothesis space [30] of the learner) and that the controller manages to keep the visitedstate space bounded, the learning error (as a function of time t ) should vanish.Substituting different learning algorithms yields different adaptive controllers. RBFN-MRAC [25]utilises radial basis function neural networks for this purpose whereas
GP-MRAC employs Gaussianprocess learning [34] to learn ˜ a [10, 9].In what is to follow, we utilise our LACKI method as the adaptive element. Following the nomen-clature of the previous methods we name the resulting adaptive controller LACKI-MRAC . The wing rock dynamics control problem considers an aircraft in flight. Denoting x to be the rollattitude (angle of the aircraft wings) and x the roll rate (measured in angles per second), the controllercan set the aileron control input u to influence the state x := [ x ; x ].Based on [31], Chowdhary et. al. [10, 9] consider the following model of the wing rock dynamics:21
10 20 30 40 50−10 1 time (seconds) r o ll ang l e e rr o r ( deg )
0 10 20 30 40 50−20 2 4 time (seconds) r o ll r a t e e rr o r ( deg / s ) (a) Tracking error (RBF-MRAC).
0 10 20 30 40 50−0.20.0 0.2 time (seconds) r o ll ang l e e rr o r ( deg )
0 10 20 30 40 50−0.10.0 0.1 time (seconds) r o ll r a t e e rr o r ( deg / s ) (b) Tracking error (GP-MRAC).
0 10 20 30 40 50−0.50.0 0.5 time (seconds) r o ll ang l e e rr o r ( deg )
0 10 20 30 40 50−0.50.0 0.5 time (seconds) r o ll r a t e e rr o r ( deg / s ) (c) Tracking error (KI-MRAC). Figure 7: Tracking error comparison of first example.˙ x = x (28)˙ x = a ( x ) + b u (29)where b = 3 is a known constant and a ( x ) = W ∗ + W ∗ x + W ∗ x + W ∗ | x | x + W ∗ | x | x + W ∗ x is an priori unknown nonlinear drift.Note, the drift is non-smooth but it would be easy to derive a Lipschitz constant on any boundedsubset of state space if the parameters W := ( W ∗ , . . . , W ∗ ) were known.To control the system we employ LACKI as the adaptive element ν ad . In the absence of theknowledge of a Lipschitz constant, we start with a guess of L = 1 (which will turn out to be too low)and update it following the procedure described in Sec. 2.2.In a first instance, we replicated the experiments conducted in [10, 11] with the exact same pa-rameter settings. That is, we chose W ∗ = 0 . , W ∗ = 0 . , W ∗ = 0 . , W ∗ = − . , W ∗ =0 . , W ∗ = 0 . x = (3 , (cid:62) and simulated forward with a first-orderEuler approximation with time increment ∆ = 0 . s ] over a time interval I t = [ t , t f ] with t = 0[ s ]and t f = 50[ s ]. Training examples and control signal were continuously updated every ∆ u = ∆ o =∆[ s ]. The RBF and GP learning algorithms were initialised with fixed length scales of 0.3 units. TheGP was given a training example budget of a maximum of 100 training data points to condition theposterior model on. Our LACKI learner was initialised with L = α = 1 and updated online followingour lazy update method described above.The test runs also exemplify the working of the lazy update rule. The initial guess L = 1 wastoo low. However, our lazy update rule successfully picked up on this and had ended up increasingconstant to L = 2 .
10 20 30 40 50−10−5 0 5 10 Time ν ad Ground truth ν ad (a) Prediction v.s. ground truth(RBF-MRAC).
0 10 20 30 40 50−6−4−20 2 4 6 8 Time ν ad Ground truth ν ad (b) Prediction v.s. ground truth (GP-MRAC).
0 10 20 30 40 50−6−4−20 2 4 6 8 Time ν ad Ground truth ν ad (c) Prediction v.s. ground truth(LACKI-MRAC). Figure 8: Prediction vs ground truth comparisons for the first example. Both nonparametric methodsaccurately predict the true drift and clearly outperform the RBFN learner.data points [34]. We would add that, as a non-parametric method, LACKI-MRAC shares this kindof flexibility, which might explain the fairly similar performance.However, being an online method, the authors of GP-MRAC explicitly avoided hyperparametertraining via optimising the marginal log-likelihood. The latter is commonly done in GP learning [34] toavoid the impact of an unadjusted prior but is often a computational bottle neck. Therefore, avoidingsuch hyperparameter optimisation greatly enhances learning and prediction speed in an online setting.However, we would expect the performance of the prediction to be dependent upon the hyperparametersettings. As we have noted above, the Lipschitz constant depends on the part of state space visitedat runtime. Similarly, we might expect length scale changes depending on the part of state space thetrajectory is in. Unfortunately, [10, 9, 11] provide no discussion of the length scale parameter settingand also called the choice of the maximum training corpus size “arbitrary”.Since the point of learning-based and adaptive control is to be able to adapt to various settings, wetest the controllers across a range of randomised problem settings, initial conditions and parametersettings.We created 555 randomised test runs of the wingrock tracking problems and tested each algorithmon each one of them. The initial state x ( t ) was drawn uniformly at random from [0 , × [0 , . , L for LACKI-MRAC was initialised atrandom from the same interval but was allowed to be adapted as part of the online learning process.Furthermore, we chose λ = 0. The parameter weights W of the system dynamics specified abovewere multiplied by a constant drawn uniformly at random from the interval [0 , K = K = 1.In addition to the three adaptive controllers we also tested the performance of a simple P D controller with just these feedback gains (i.e. we executed x-MRAC with adaptive element ν ad = 0).This served as a baseline comparison to highlight the benefits of the adaptive element over simplefeedback control.The performance of all controllers across these randomised trials is depicted in Fig. 9. Each datapoint of each boxplot represent a performance measurement for one particular trial.For each method, the figures show the boxplots of the following recorded quantities: • log-XERR : cummulative angular position error (log-deg), i.e. log( (cid:82) t f t (cid:107) ξ ( t ) − x ( t ) (cid:107) dt ). • log-XDOTERR : cummulative roll rate error (log-deg/sec.), i.e. log( (cid:82) t f t (cid:107) ξ ( t ) − x ( t ) (cid:107) dt ). • log-PREDERR : log-prediction error, i.e.log( (cid:82) t f t (cid:107) ν ad ( x ( t )) − ˜ a ( x ( t )) (cid:107) dt ). 23
50 5 1 2 3 4log−XERR −50 5 1 2 3 4log−XDOTERR 0 5 10 1 2 3 4log−PREDERR4 6 8 101214 1 2 3 4log−CMD −10−5 1 2 3 4log− max. RT (predictions) −8−6−4−2 1 2 3 4log− max. RT (learning) (a) Results over 555 randomised examples.
Figure 9: Performance of the different online controllers over a range of 555 trials with randomisedparameter settings and initial conditions. 1: RBF-MRAC, 2: GP-MRAC, 3:LACKI-MRAC, 4: P-Controller. LACKI-MRAC outperforms all other methods with respect to all performance measures,except for prediction runtime (where the parametric learner RBF-MRAC performs best). • log-CMD : cummulative control magnitude (log-scale), i.e. log( (cid:82) t f t (cid:107) u ( t ) (cid:107) dt ). • log-max. RT (predictions) : the log of the maximal runtime (within time span [ t , t f ]) eachmethod took to generate a prediction ν ad within the time span. • log-max. RT (learning) : the log of the maximal runtime (within time span [ t , t f ]) it took eachmethod to incorporate a new training example of the drift ˜ a .As can be seen from Fig. 9, all three adaptive methods outperformed the simple α controller interms of tracking error.In terms of prediction runtime, the RBF-MRAC outperformed both GP-MRAC and LACKI-MRAC. This is hardly surprising. After all, RBF-MRAC is a parametric method with constantprediction time. By contrast, both non-parametric methods will have prediction times growing withthe number of training examples. That is, it would be the case if GP-MRAC were given an infinitetraining size budget. Indeed one might argue whether GP-MRAC, if operated with a finite budget,actually is a parametric approximation where the parameter consists of the hyperparameters alongwith the fixed-size training data matrix. When comparing the (maximum) prediction and learningruntimes one should also bear in mind that GP-MRAC predicted with up to 200 examples in thetraining data set. By contrast, LACKI-MRAC undiscerningly had incorporated all 10001 trainingpoints by the end of each trial.Across the remaining metrics, LACKI-MRAC markedly outperformed all other methods.Note, we have also attempted to test all methods across a greater range of problem settings,including larger initial states, more varied hyper-parameter settings, lower feedback gains and morevaried choices of dynamics coefficients W . However, this resulted in GP-MRAC to often run intoconditioning problems. This is a common issue in GP learning due to the necessity of matrix inversionor Cholesky decompositions of the covariance matrix (it seems to be common practice to address thisby hand-tuning the observational noise parameters). Similar behaviour ensued when setting thetraining size budget to large values. All these changes often resulted in long learning runtimes, spikycontrol outputs and thus, poor overall performance. Similarly, code execution of our RBF-MRACimplementation was frequently interrupted with error messages when the state was initialised topositions outside of the rectangle [0 , × [0 , x ( t ) = ( − , (cid:62) corresponding to a rapidly rotating aircraft. Furthermore, the wing rock coefficients W were multipliedby a factor of 5, amplifying the non-linearities of the drift field.When initialised with a length scale parameter of 0.3, the GP ran into conditioning problems andcaused the output of the adaptive element in GP-MRAC to produce spikes of very large magnitudeand thus, further destabilised the system. We tried the problem with various kernel length scalesettings ranging from 0 . L = 1 asbefore. Starting out with a relatively large tracking and prediction error, LACKI-MRAC nonethelessmanaged to recover and successfully track the system (see Fig. 10(d) - 10(f)). The state path andlearned drift model obtained by LACKI-MRAC are depicted in Fig. 11. In the previous section, we gave an illustration LACKI-MRAC – a combination of a feedback-linearising controller with our KI learning method. The results are encouraging – our adaptive controllaw managed to learn a dynamic system online and to track a reference in the presence of wing-rockdynamics where other methods failed. To simulate the dynamics we relied on a first-order Eulerapproximation resulting in a discrete-time dynamical system.In this section, we study the convergence of LACKI-MRAC in such discrete-time systems. In thesewe will consider both the offline and the online learning setting. In the former, the LACKI-learnerreceives a sample set once and builds a controller that remains unaltered during execution time. Forthis case, we will provide robustness bounds on the control success (as quantified by a bound on thenorm of the error dynamics) as a function of the remaining uncertainty of the trained LACKI model.In the online learning setting, which we considered in the wing-rock control simulations, theLACKI-learner gets updated with the most recent observation after each time step. Provided theinitial uncertainty is bounded on the given state space, we will be able to guarantee that LACKI-MRAC leads to a closed-loop system that achieves the tracking objective with increasing time andlearning experience.
In the offline learning setting, the predictor sequence (cid:0) ˆ f n (cid:1) n ∈ N of adaptive elements is based on onlyone fixed data set D at time 0 which is not updated subsequently. That is D n = D , ∀ n ∈ N .Given some data set D n = D at time n and the resulting predictor ˆ f n ( · ), the model error is givenby F n ( · ) := f ( · ) − ˆ f n ( · ). Since we assume constant data, the error does not change either. That is,we have F n ( x ) = . . . = F ( x ) = F ( x ) , ∀ n, ∀ x .In our analysis, we consider discrete-time dynamical systems. For example the dynamics might befirst-order Euler approximations of the control-affine dynamics of Sec. 3.1. Consequently, the errordynamics as per Eq. 27 translate to the recurrence relation: e n +1 = M e n + ∆ F ( x n ) (30)25 time (seconds) r o ll ang l e ( deg ) actualref model0 5 10 15 20 25 30 35 40 45 50−6−4−202 x 10 time (seconds) r o ll r a t e ( deg / s ) roll rate actualref model (a) Position (GP-MRAC). time (seconds) r o ll ang l e e rr o r ( deg ) time (seconds) r o ll r a t e e rr o r ( deg / s ) (b) Tracking error (GP-MRAC).
0 5 10 15 20 25 30−1000 100 200 300 400
X: 0.87Y: 4.281
Time l og − p r ed i c t i on e rr o r (c) Log - prediction error (GP-MRAC). r o ll ang l e ( deg ) actualref model0 5 10 15 20 25 30 35 40 45 50−200−1000100200 time (seconds) r o ll r a t e ( deg / s ) roll rate actualref model (d) Position (LACKI-MRAC). r o ll ang l e e rr o r ( deg ) r o ll r a t e e rr o r ( deg / s ) (e) Tracking error (LACKI-MRAC).
0 10 20 30 40 50−15−10−5 0 5 10 Time l og − p r ed i c t i on e rr o r (f) Log - prediction error (LACKI-MRAC). Figure 10: Example where GP-MRAC fails. By contrast, LACKI-MRAC manages to adapt and directthe system back to the desired trajectory. 26
150 −100 −50 0 50 −200−150−100−50 0 50 100 roll angle deg r o ll r a t e deg / s e c ond s (a) State path (LACKI-MRAC). −200 −100 0 100−200−1000100−20−15−10−505 x 10 angleangular rate −15−10−50x 10 (b) Learned drift model (LACKI-MRAC). Figure 11: Depicted are the state path and the drift model learned online by LACKI-MRAC.where ∆ ∈ R + is a positive time increment and n is the time step index. Furthermore, F ( x n ) = f ( x n ) − ˆ f n ( x n ) = B (cid:0) ν ad ( x n ) − ˜ a ( x n ) (cid:1) (31)is an uncertain increment due to the model error of the learner (cf. Eq. 27), B = (cid:18) O m ∆ I m (cid:19) and M = (cid:18) I m ∆ I m − ∆ K I m − ∆ K (cid:19) (32)is the (error state) transition matrix. Here, m = d is half the dimensionality of the state space, I m denotes the m × m identity matrix and K , K are gain matrices that can be freely chosen by thedesigner of the linear pseudo controller. By induction, it is easy to show that the recurrence can beconverted into the closed-form expression: e n = M e n − + ∆ F ( x n +1 )= · · · = M k e + ∆ n − (cid:88) i =0 M n − − i F ( x i ) . For vector norm (cid:107)·(cid:107) , let |||·||| denote a matrix norm such that (cid:107)
M v (cid:107) ≤ ||| M ||| (cid:107) v (cid:107) for all suitable matrices M and vectors v . For instance, for the Euclidean norm (cid:107)·(cid:107) = (cid:107)·(cid:107) , we can choose the spectral norm |||·||| = |||·||| as a matrix norm. Or, for (cid:107)·(cid:107) = (cid:107)·(cid:107) ∞ , if the vector space is d -dimensional, we can choosethe matrix norm |||·||| = √ d |||·||| . We desire to bound the norm of the error. To this end, we leveragethat the norms are sub-additive and sub-multiplicative to deduce: (cid:107) e n (cid:107) ≤ ||| M n ||| (cid:107) e (cid:107) + ∆ n − (cid:88) i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M n − − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) F ( x i ) (cid:107) (33) ≤ ||| M n ||| (cid:107) e (cid:107) + ∆ ¯ N n n − (cid:88) i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M n − − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: (cid:37) [ n ] (34)where ¯ N n is chosen such that we can guarantee that ¯ N n ≥ max i =1 ,...,n − (cid:107) F ( x i ) (cid:107) . For instance, wecould choose ¯ N n := sup s ∈S n (cid:107) F ( s ) (cid:107) where S n = (cid:83) t 1, and provided ¯ N n is bounded (see e.g. [7]).Whether or not M is stable, in low dimensions, the sums can be computed offline and in advance.This is of great benefit if the controller that is building on the error bounds is utilising optimisation-based control with a finite time-horizon.To obtain a (conservative) closed-form bound on the error norms [7] contains a derivation anddiscussion of the following result: Theorem 3.2. Let ( F n ) n ∈ N be a norm-bounded sequence of vectors with ¯ N n := max i ∈{ ,...,n − } (cid:107) F i ( x i ) (cid:107) ≤ ¯ N ∈ R . For error sequence ( e n ) n ∈ N defined by the linear recurrence e n +1 = M e n +∆ F n ( x n ) ( n ∈ N ) ,we can define the following bounds:1. (cid:107) e n (cid:107) ≤ ||| M n ||| (cid:107) e (cid:107) + ∆ ¯ N n (cid:80) n − i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . If ρ ( M ) < and ∃ ¯ N : ¯ N ≥ ¯ N n − ≥ , ∀ n , theright-hand side converges as n → ∞ .2. Let k ∈ N , k > such that ||| M n ||| < , ∀ n ≥ k , let ϕ := (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < and let δ n := (cid:98) n/k (cid:99) . If r := ρ ( M ) < , for n > k , we also have: (cid:107) e n (cid:107) ≤ c ϕ δ n (cid:107) e (cid:107) + ∆ ¯ N n (cid:16) n − (cid:88) j =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + c k ϕ − ϕ (cid:106) nk (cid:107) +1 − ϕ (cid:17) n →∞ −→ C ≤ ∆ ¯ N n − (cid:88) j =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + ∆ ¯ N c k ϕ − ϕ for some constants C, c ∈ R . Here, possible choices for c ∈ R are:(i) c = max { (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | i = 1 , . . . , k − } or (ii) c = d − (cid:16) − d log r (cid:17) d − ||| M ||| d − r − d log r − d +1 . Since ||| M ||| (cid:54) = 1 , one can also choose (iii) c := ||| M ||| n .3. If ||| M ||| (cid:54) = 1 , we have: (cid:107) e n (cid:107) ≤ ||| M ||| n (cid:107) e (cid:107) + ∆ ¯ N n −||| M ||| n −||| M ||| . In this subsection, we lift the assumption that the available sample is static. Instead we assume thatat time step n + 1 we get to see an additional sample of the uncertain drift at the state visited inthe previous time step n . That is, the predictor ˆ f n +1 ( · ) is based on D n +1 = D n ∪ { (cid:0) x n , ˜ f ( x n ) (cid:1) } , ∀ n .Therefore, we also need to index the innovations vector field by time. That is, F n := f − ˆ f n denotesthe prediction error function (or innovation) due to the data available at time n ∈ N . As pointed outin Rem. 3.1, the error innovations F n are assumed to be bounded for all finite sample sets D n .Above, we have seen that any continuous function can be approximated by some H¨older continuousLACKI predictor up to an arbitrarily small error. For convenience, we will establish the followingdefinition: 28 efinition 3.3. We say that a continuous function f is L ∗ − p -H¨older up to error ¯ E h ∈ R on domain X iff there is an L ∗ − p -H¨older function φ ∈ H ( L ∗ , p ) and a function ψ such that ∀ x : f ( x ) = φ ( x ) + ψ ( x ) , sup x ∈X d Y (cid:0) , ψ ( x ) (cid:1) ≤ ¯ E h . Theorem 3.4 (Tracking error convergence) . Assume that, for some q ≥ , we choose λ = 2¯ e + q inour LACKI prediction rule and that the sequence of innovations (cid:0) F n ( x n ) (cid:1) n ∈ N as well as the reference (cid:0) ξ n (cid:1) n ∈ N are bounded. If the initial error innovation function is bounded, i.e. if ∃ b ∈ R ∀ x : (cid:107) F ( x ) (cid:107) ∞ ≤ b , and, if M is a stable matrix, i.e. if ρ ( M ) < , then the tracking error converges to the interval [0 , q + 2¯ e + 2 ¯ E h ] . That is, (cid:107) e n (cid:107) ∞ n →∞ −→ [0 , σ (cid:0) q e + 2 ¯ E h (cid:1) ] where σ := ∆ (cid:80) ∞ i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ .Proof. Let (cid:107)·(cid:107) := (cid:107)·(cid:107) ∞ with accociated matrix norm |||·||| := √ d |||·||| .Let (cid:15) > 0. We desire to show: ∃ N ∈ N ∀ n ≥ N : (cid:107) e n (cid:107) ≤ (cid:15) + q e + 2 ¯ E h . (36)If sequence (cid:0) F n ( x n ) (cid:1) n ∈ N is bounded then, by Thm. 3.2, the error sequence (cid:0) e n (cid:1) n ∈ N is bounded.That is, ∃ b ∈ R ∀ n : (cid:107) e n (cid:107) ≤ β . Knowing that the error dynamics are bounded by some β ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) e n (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β k →∞ −→ 0. Here, the convergence to zero follows from the assumption that M is a stable matrix. Hence, we have:( I ) ∀ n ∃ k ( n ) ∈ N ∀ k ≥ k ( n ) : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) e n (cid:107) ≤ (cid:15) . If in addition, the reference is bounded this implies that the sequence ( x n ) is bounded, too. Thm.2.17 implies convergence of the innovations and hence, assuming d Y ( f, f (cid:48) ) = (cid:107) f − f (cid:48) (cid:107) , we have: ∀ ε > ∃ n ∀ n ≥ n : (cid:107) F n ( x n ) (cid:107) ≤ ε + q e + 2 ¯ E h . (37)Referring to (i) in Thm. 3.2, With a change of variables we can follow analogous steps to convertIneq. (34) to state that for all k ∈ N , n ∈ N we have: (cid:107) e n + k (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) e n (cid:107) + ∆ Q n : n + k k − (cid:88) i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k − − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (38) Q n : n + k := max {(cid:107) F n ( x n ) (cid:107) , . . . , (cid:107) F k + n − ( x k + n − ) (cid:107)} . With Gelfand’s formula and the standard roottest for series it is easy to establish convergence of the series: That is, σ = lim k →∞ ∆ (cid:80) k − i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k − − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . And, we have ∆ (cid:80) k − i =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k − − i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ, ∀ k . Hence, (cid:107) e n + k (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) e n (cid:107) + σ Q k : n + k , ∀ n ∈ N , k ∈ N . (39)With (37) follows that there exists n ∈ N such that we have:( II ) ∀ k ∈ N : Q n : n + k ≤ (cid:15) σ + q e + 2 ¯ E h . Combining (I) and (II) with Eq. 39 allows us to conclude that for any n ≥ N := n + k ( n ) wehave (cid:107) e n (cid:107) ≤ (cid:15) σ (cid:16) (cid:15) σ + q e + 2 ¯ E h (cid:17) = (cid:15) + σ ( q e + 2 ¯ E h (cid:17) . Corollary 3.5. In the special case of error-free observations of a H¨older continuous target function,choosing a parameter λ = 0 implies that the tracking error vanishes, i.e. : (cid:107) e n (cid:107) ∞ n →∞ −→ . Furthermore, the control action sequence (cid:0) u ( x n ) (cid:1) n ∈ N converges, provided the reference trajectory (cid:0) ξ n (cid:1) n ∈ N is bounded.Proof. The convergence statement is an immediate consequence of the preceding theorem. Rememberfrom Sec. 3.1.1 that the control action at time n is of the form u n := u ( x n ) = − ˆ f n ( x n ) − Ke n + c forsome constant c . We show that ( u n ) is a Cauchy sequence, provided that the reference sequence ξ n is. Since X is a Hilbert space, the desired convergence result follows.So, let (cid:15) > 0. Since ( e n ) , ( ξ n ) converge, also the state sequence ( x n ) converges. Hence, allthree are convergent Cauchy sequences. In particular, there is N such that for all n, m > N : (cid:107) e n − e m (cid:107) < (cid:15) ||| K ||| and (cid:107) x n − x m (cid:107) < (cid:15) L . Hence, utilising the definition of the control law andthe fact that all predictors are H¨older continuous with H¨older constant ¯ L , for all m, n > N : (cid:107) u n − u m (cid:107) ≤ ||| K ||| (cid:107) e n − e m (cid:107) + (cid:13)(cid:13)(cid:13) ˆ f n ( x n ) − ˆ f m ( x m ) (cid:13)(cid:13)(cid:13) ≤ (cid:15) + ¯ L (cid:107) x n − x m (cid:107) ≤ (cid:15) . Therefore, ( u n ) is aCauchy sequence and hence, convergent. In this paper, we have introduced Lazily Adapted Constant Kinky Inference (LACKI) as an approachto nonparametric machine learning. Our method was built on the framework of Kinky Inference (which in turn is a generalisation of well-known approaches such as Lipschitz Interpolation [39, 44, 3]and Nonlinear Set Membership (NSM) methods [29]). Our approach inherits the numerical simplicityof these methods. On top of this, it can deal with bounded additive observational errors and doesnot require a priori knowledge about a H¨older constant of the underlying target function– instead itestimates the constant online from the data. This of course is of great practical interest since thisendows LACKI with superior black-box learning capabilities while still allowing us to give theoreticalguarantees on learning and control success.To avoid the need to specify the H¨older constant, LACKI adapts the parameter L ( n ) to reflect amodification of the empirical estimate of the H¨older constant of the underlying target function. Theadapted parameters were carefully defined to be bounded even if the target function is not H¨oldercontinuous and the data is subject to bounded observational uncertainty. This allowed us to establishseveral theoretical guarantees of worst-case consistency. That is, we provided asymptotic guaranteeson the ability to learn any H¨older (and non-H¨older) continuous target function as well as convergencerates. Our derivations focussed on worst-case prediction error bounds.Future work will investigate in how far the bounds can be tightened further (albeit we do not expectthat worst-case guarantees could be given that avoid the curse of dimensionality without imposingmore confining assumptions on the target functions and the nature of the observational uncertainties).However, if we were to shift our attention away from worst-case error analysis under general (possiblysystematic) observational uncertainties towards standard mean-square risk analysis and assumptionsprevalent in probability theory, we believe less conservative consistency might be establishable. To thisend, we believe it is possible to modify our proofs to translate mean-square consistency derivationsfor nearest-neighbor regression methods (e.g. as discussed in [23]) to our LACKI approach. Thismight provide a theoretical underpinning for the smoothing properties of our approach observed inthe presence of i.i.d. stochastic noise (refer to Fig. 1 and Fig. 2).30ur learning-theoretic considerations were supplemented by an application of LACKI to onlinelearning-based model-reference adaptive control. In a simulated aircraft control problem with nonlin-ear model uncertainty, we compared our LACKI-based controller against other learning-based meth-ods that were recently proposed in the control literature. Across a range of performance metricsand randomised problem instances, LACKI-MRAC exhibited consistent and robust performance andoutperformed its competitors on the majority of randomised test cases.For discrete-time systems with additive, bounded, nonlinear uncertainty, we provided theoreticalguarantees on the tracking success of our LACKI-MRAC controller. For the online learning casewhere the LACKI learner was assumed to be continuously updated with the most recently visitedstate observation, we proved tracking success up to a term dependent on the observational error.In future work, we would like to apply our LACKI learning method to more challenging controltasks that require planning. To this end, it might be beneficial to link our results to recent work onNSM-based model-predictive control (e.g. [8]). We believe that the worst-case analysis we focus onin this work is key to establish the necessary links to results existing in the robust MPC literature.In this work, we have assumed that the observational noise was bounded; we have addressed theissue of unbounded noise in a companion paper [6] where the H¨older constant parameter L ( n ) is foundas the minimiser of a prediction loss estimator. We also believe that these estimators could also beused to estimate the noise bound if this exists but is unknown a priori.A final suggestion for future work pertains to the question of speeding up prediction time. Inthe context of Lipschitz interpolation, Beliakov [3] proposed a way to organise the sample in a tree-structure (in lieu to KD-trees utilised in nearest-neighbor search) with the aim to reduce the predictiontime to be logarithmic in the sample size. While applicable to our LACKI approach in the batchlearning setting, it is not clear to us how this tree structure could be efficiently updated in an onlinelearning setting. Future work could explore avenues of connecting his idea of appealing to notions ofgeneralised nearest-neighbor search to existing efficient approaches of online nearest neighbor search. I am grateful for useful feedback from Jan Maciejowski, Carl Rasmussen, Stephen Roberts and DanielLimon. I would also like to thank Girish Chowdhary and Hassan Kingravi who generously supplied mewith their code that allowed me to most closely reproduce their work and use it for my comparisonsin Sec. 3.1. I also gratefully acknowledge funding via the AIS project, NMZR/031 RG64733. A Supplementaries A.1 H¨older and Lipschitz continuity for inference A.1.1 Introduction and related work H¨older continuous functions are uniformly continuous functions that may exhibit infinitely manypoints of non-differentiability and yet are sufficiently regular to facilitate inference. That is, they haveproperties that make it possible to make assertions of global properties on the basis of a finite functionsample.H¨older continuity is a generalisation of Lipschitz continuity. Lipschitz properties are widely usedin applied mathematics to establish error bounds and, among many other, find application in optimi-sation [37, 24] and quadrature [2, 14, 19] and are a key property to establish convergence properties ofapproximation rules in (stochastic) differential equations [26, 20]. Furthermore, most machine learn-ing methods for function interpolation seem to impose smoothness (and thereby, H¨older continuity)on the function. For instance, with our Lem. A.8 derived below, it would be possible to show thatany finite radial basis function neural network [5] with a smooth basis function is H¨older continuouson a compact domain. Or, a Gaussian process with a smooth covariance function also has a smooth31ean function and a.s. smooth sample paths [34, 22]. Therefore, posterior inference over functionson compact support made with such a Gaussian process on the basis of a finite sample is H¨oldercontinuous.Recently, we have become aware of related work published in mathematical and operations researchjournals [13, 12, 44, 3, 4]. For instance, Zabinsky et. al. [44] consider the problem of estimating aone-dimensional Lipschitz function (with respect to the canonical norm-induced metric). Similarto the analysis we employ to establish our guarantees, they use a pair of bounding functions andmake predictions by taking the average of these functions. While we have developed our kinkyinference rule independently, it can be seen as a generalisation of their approach. Our method providesextensions to H¨older continuous multi-output functions over general, multi-dimensional (pseudo-)metric spaces, can cope with with erroneous observations and inputs, can fold in additional knowledgeabout boundedness, learn parameters from data and provides different guarantees such as (uniform)convergence of the prediction uncertainty. As part of the analysis of our method, we constructdelimiting functions we refer to as ceiling and floor functions. The construction of similar functionsis a recurring theme that, in the standard Lipschitz context, can be found in global optimisation [37],quadrature [2], interpolation [3, 4], as well as in the analysis of linear splines for function estimation[12]. Cooper [13, 12] utilises such upper and lower bound functions in a multi-dimensional setting toderive probabilistic PAC-type error bounds [41] for a linear interpolation rule. He assumes the datais sampled uniformly at random on a hypercube domain. This precludes the application of his resultsto much of our control applications where the data normally is collected along continuous trajectoriesvisited by a controlled plant. Our inference rule is different from his and our guarantees do not relyon distributional assumptions. This of course is important in control settings where the commonassumption that the input data was drawn independently from a fixed distribution typically is notmet. In this thread of works, perhaps the work that is most closely related to ours is the functioninterpolation method of Beliakov [3] that is a special case of a kinky inference rule: For a single-outputfunction that is Lipschitz with respect to a special input space norm and where the data is error-free,the authour provides an algorithm that promises logarithmic prediction time. Unfortunately, many ofhis assumptions are unrealistic in a control setting. And, the improved prediction time is achieved byconstructing a data structure from batch data which precludes its use in an online learning setting.However, future work might explore in how far his ideas can be converted into an online learning rule.Furthermore, in learning situations where Beliakov’s interpolation method is applicable, our theoreticalresults extend to his work. For instance, our results show how Lipschitz constant estimation can beharnessed to render his approach a universal approximator.Other work of relevance can be found in analysis. For instance, Miculescu [28] presents workproving that any continuous function on a metric space is a uniform limit of a sequence of locally Lipschitz functions and also mentions that the stronger statement, that every function is a limit ofa sequence of globally Lipschitz functions, is not true in general. However, he cites earlier work [21]that does show that every real-valued continuous function on compact domain is a uniform limit of asequence of Lipschitz functions. In some sense, our work develops a related statement as a by-product.From our convergence guarantee of the LACKI rule, we have derived constructive method for showingthat any continuous function on compact domain is the uniform limit of a sequence of H¨older functionsup to an arbitrarily small error.Finally, in the context of control, Milanese and Novara [29] considered NSM methods for inter-polation. For a fixed Lipschitz constant, their prediction rule can be seen as a special case of ourswithout the B and ¯ B parameters and with special choices of metrics. Similar to us, they do considerthe problem of estimating the Lipschitz constant from the data and consider bounded noise. However,they obtain the Lipschitz constant estimate via the maximum partial derivative of an arbitrarily cho-sen fitted parametric model of a bounded input set. And, they give no guarantees on the quality ofthe resulting estimator that is fitted to the data like this nor do they discuss the impact of the choiceof parametric model or the fitting method on the quality of the estimator.32 .1.2 Basic facts and derivations In preparation of subsequent parts of the work that take advantage of H¨older properties this sectionwill proceed to establish essential prerequisites. The remainder of this section is structured as follows:Firstly, we will go over basic definitions and engage in some preliminary derivations that will be ofimportance throughout this work. While we do not claim novelty on any of the results we provideproofs for in this section, we had not found them in the literature and hence, had to derive them onour own.Firstly, we commence with introducing the notions of (pseudo-) metric spaces. Definition A.1 ((Pseudo-) metrics) . Let X be a set. A mapping d X : X → R is called a pseudo-metric if it positive ( ∀ x, x (cid:48) ∈ X : d X ( x, x (cid:48) ) ≥ 0) and satisfies the triangle inequality ( ∀ x, x (cid:48) , x (cid:48)(cid:48) ∈ X : d ( x, x (cid:48) ) ≤ d X ( x, x (cid:48)(cid:48) ) + d ( x (cid:48)(cid:48) , x (cid:48) )). If furthermore the pseudo-metric d is definite ( i.e. ∀ x, x (cid:48) ∈ X X : d X ( x, x (cid:48) ) = 0 ⇔ x = x (cid:48) ) then the mapping d is called a metric . The set X endowed with a (pseudo-)metric d X : X → R or the pair ( X , d X ) are called (pseudo-) metric space . Definition A.2. Let ( X , d X ) , ( Y , d Y ) be two (pseudo-) metric spaces and I ⊂ X be an open set. Afunction f : X → Y is called (L-p-) H¨older (continuous) on I ⊂ X if there exists a (H¨older) constant L ≥ (H¨older) exponent p ≥ ∀ x, x (cid:48) ∈ I : d Y (cid:0) f ( x ) , f ( x (cid:48) ) (cid:1) ≤ L (cid:0) d X ( x, x (cid:48) ) (cid:1) p . We denote the space of all L-p- H¨older functions by H ( L, p ).H¨older functions are known to be uniformly continuous. A special case of importance is the classof L - Lipschitz functions. These are H¨older continuous functions with exponent p = 1. In this context,coefficient L is referred to as Lipschitz constant or Lipschitz number . Example A.3 (Square root function) . As an example of a H¨older function that is not Lipschitz wecan consider x (cid:55)→ √ x on domain I = [0 + (cid:15), c ] where c > (cid:15) ≥ 0. For (cid:15) > L = sup x ∈ I √ x . We can see that the coefficient grows infinitely large as (cid:15) → 0. By contrast,the function is H¨older continuous with H¨older coefficient L = 1 and exponent p = for any bounded I ⊂ R . We can see this as follows: Let (cid:15) = 0 , x, y ∈ I and, without loss of generality, y ≥ x . Let ξ := √ x, γ := √ y and thus, γ ≥ ξ . We have: ξ ≤ γ ⇔ ξ ≤ ξγ ⇔ γ − ξγ + ξ ≤ γ − ξ ⇔ ( γ − ξ ) ≤ γ − ξ ⇔ | γ − ξ | ≤ (cid:12)(cid:12) γ − ξ (cid:12)(cid:12) ⇔ | γ − ξ | ≤ (cid:112) | γ − ξ | ⇔ (cid:12)(cid:12) √ x − √ y (cid:12)(cid:12) ≤ | y − x | . Since, x, y were chosen arbitrarily, we have shown H¨older continuity as desired.Most commonly, one considers H¨older continuity for the special case of the standard metric inducedby a norm, i.e. d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) . For a function f : X → Y , the H¨older condition becomes: ∀ x, x (cid:48) ∈ I : (cid:107) f ( x ) − f ( x (cid:48) ) (cid:107) Y ≤ L (cid:107) x − x (cid:48) (cid:107) p X . Similarly, we can consider H¨older continuity for each output component: Definition A.4. Let Y ⊆ R m and X be a space endowed with a metric (or indeed a semi-metric) d X . Then, the function f : X → Y is output-component-wise H¨older continuous with exponent p and constant L ∈ R m ≥ if f ∈ H ( L, p ) where H d X ( L, p ) := (cid:8) φ : X → Y | ∀ j ∈ { , ..., m } , ∀ x, x (cid:48) ∈X : | φ j ( x ) − φ j ( x (cid:48) ) | ≤ L j d p X ( x, x (cid:48) ) (cid:9) is the set of all functions whose component functions are H¨oldercontinuous with respect to input space metric d X and an output space metric that is induced by thecanonical norm d Y ( x, x (cid:48) ) = | x − x (cid:48) | . Remark A.5 (Best H¨older constant) . Note for p ∈ (0 , , ≤ L ≤ L we have H d X ( L , p ) ⊆ H ( L , p ).The smallest L ∗ ≥ L ∗ − p − H¨older, f ∈ H d X ( L ∗ , p ), is called the best H¨olderconstant of f .Generally, it is obviously true that H ( L, p ) ⊆ H ( L (cid:48) , p ) for L (cid:48) ≥ L . With regard to the H¨olderexponent, we will now show that smaller exponents are less restrictive than larger ones.33 emma A.6. Let X be the input space (not necessarily bounded). For some p ∈ (0 , , L ≥ assumethat and f : X → Y is locally L − p -H¨older continuous. Then we have: (i) for any q ∈ (0 , p ] , f is alsolocally L − q -H¨older. (ii) If f : X → Y is bounded with sup x,x (cid:48) ∈X d Y ( f ( x ) , f ( x (cid:48) )) ≤ B ∈ R and globally L − p H¨older then f is globally L ∗ − q -H¨older, where L ∗ := max { L, B } and q ∈ (0 , p ] . In particular,on compact support, Lipschitz continuity entails H¨older continuity for any H¨older exponent p ∈ [0 , .Proof. (i) Let p ∈ (0 , , f ∈ H ( L, p ) and p = q + r, r ∈ [0 , ξ ∈ X and I denote the in-tersection of the domain with an (cid:15) -ball around ξ such that f satisfies the H¨older condition on I and such that sup x,x (cid:48) ∈ I d X ( x, x (cid:48) ) ≤ 1. For all x, x (cid:48) ∈ I we have d Y ( f ( x ) , f ( x (cid:48) )) ≤ L d p X ( x, x (cid:48) ) = L d q X ( x, x (cid:48) ) d r X ( x, x (cid:48) ) ≤ L d q X ( x, x (cid:48) ) where the last inequality holds since r ∈ [0 , 1) and sup x,x (cid:48) ∈ I d X ( x, x (cid:48) ) ≤ 1. (ii) Let x, x (cid:48) ∈ X . If d X ( x, x (cid:48) ) ≤ d Y ( f ( x ) , f ( x (cid:48) )) ≤ L d q X ( x, x (cid:48) ) following throughthe same sequence of inequalities as above in the proof of (i). Now, let d ( x (cid:48) , x ) > 1. We have d Y ( f ( x ) , f ( x (cid:48) )) ≤ B ≤ B d X ( x, x (cid:48) ) q . Theorem A.7. Let ( X , d ) be a metric space and f, g : X → X be two H¨older continuous mappingswith H¨older constants L ( f ) , L ( g ) and H¨older exponents p f , p g , respectively. Then, the concatenation h = f ◦ g : X → X is also H¨older continuous with H¨older constant L ( h ) := L ( f ) L ( g ) p f and exponent p h := p g p f . That is, ∀ x, x (cid:48) ∈ X : d (cid:0) h ( x ) , h ( x (cid:48) ) (cid:1) ≤ L ( h ) (cid:0) d ( x, x (cid:48) ) (cid:1) p h . Proof. d (cid:0) f ◦ g ( x ) , f ◦ g ( x (cid:48) ) (cid:1) ≤ L ( f ) (cid:16) d( g ( x ) , g ( x (cid:48) )) (cid:17) p f ≤ L ( f ) (cid:16) L ( g ) d( x, x (cid:48) ) p g (cid:17) p f = L ( f ) L ( g ) p f (cid:16) d( x, x (cid:48) ) (cid:17) p g p f where in the first step we were using H¨older-continuity of f and inthe second, we were using H¨older continuity of g combined with the fact that ( · ) p f is a monotonicallyincreasing function.Several numerical methods, such as Lipschitz optimisation [37], rely on the knowledge of a Lipschitzconstant. In the more general case of H¨older continuous functions this will correspond to the needof knowing a H¨older constant and exponent. To avoid having to derive these for each new functionfrom first principles, we establish the following collection of facts that allows us determine boundson H¨older constants of combinations of functions with known H¨older parameters. While we provideproofs for a restatement in the H¨older continuous setting, a number of the following statements havealso been proven in [42] in the context of Lipschitz algebras. Lemma A.8 (H¨older arithmetic) . Let, I, J ⊂ X where X is a metric space endowed with metric d . Let f : X → R be H¨older on I with constant L I ( f ) ∈ R + and g : X → R be H¨older on J withconstant L J ( g ) ∈ R + . Assume both functions have the same H¨older exponent p ∈ (0 , . That is, ∀ x, x (cid:48) ∈ X : | f ( x ) − f ( x (cid:48) ) | ≤ L ( f ) d ( x, x (cid:48) ) p and ∀ x, x (cid:48) ∈ X : | g ( x ) − g ( x (cid:48) ) | ≤ L ( g ) d ( x, x (cid:48) ) p . We have:1. Mapping x (cid:55)→ | f ( x ) | is H¨older on I with constant L I ( f ) and exponent p f .2. If g is H¨older on all of J = f ( I ) the concatenation g ◦ f : t (cid:55)→ g ( f ( t )) is H¨older on I withconstant L I ( g ◦ f ) ≤ L J ( g ) L pI ( f ) and exponent p .3. Let r ∈ R . r f : x (cid:55)→ r f ( x ) is H¨older on I having a constant L I ( r f ) = | r | L I ( f ) .4. f + g : x (cid:55)→ f ( x ) + g ( x ) has H¨older constant at most L I ( f ) + L J ( g ) .5. Let m f = sup x ∈X f ( x ) and m g = sup x ∈X g ( x ) . Product function f · g : x (cid:55)→ f ( x ) g ( x ) has H¨olderexponent p and a H¨older constant on I which is at most ( m f L J ( g ) + m g L I ( f )) .6. For some countable index set I t , let the sequence of functions f i be H¨older with exponent p andconstant L ( f i ) each. Furthermore, let H ( x ) = sup i ∈I t f i ( x ) and h ( x ) := inf i ∈I t f i ( x ) be finitefor all x . Then H, h are also H¨older with exponent p and have a H¨older constant which is atmost sup i ∈I t L ( f i ) . . Let b := inf x ∈X | f ( x ) | > and let φ ( x ) = f ( x ) , ∀ x ∈ X be well-defined. Then L I ( φ ) ≤ b − L I ( f ) .8. Let p = 1 (that is we consider the Lipschitz case), let I be convex and d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) where (cid:107)·(cid:107) is a norm that induces a sub-multiplicative matrix norm (e.g. all p − norms are valid). f cont. differentiable on I ⇒ L I ( f ) ≤ sup x ∈ I (cid:107)∇ f ( x ) (cid:107) . For one-dimensional input space, X = R , L I ( f ) = sup x ∈ I |∇ f ( x ) | is the smallest Lipschitz number.9. Let c ∈ R , f ( t ) = c, ∀ x ∈ I . Then f is H¨older continuous with constant L I ( f ) = 0 and for anycoefficient p f ∈ R .10. L I ( f ) ≤ L I ( f ) sup t ∈ I f .11. With conditions as in 8), and input space dimension one, we have ∀ q ∈ Q : L I ( f q ) = | q | sup τ ∈I | f q − ( τ ) ˙ f ( τ ) | .Proof. We show | f | has the same constant and exponent as f . Let X, X (cid:48) ∈ X arbitrary. We entera case differentiation: f ( x ) , f ( x (cid:48) ) ≥ (cid:12)(cid:12) | f ( x ) | − | f ( x (cid:48) ) | (cid:12)(cid:12) = (cid:12)(cid:12) f ( x ) − f ( x (cid:48) ) (cid:12)(cid:12) f Hoelder ≤ L I ( f ) d ( x, x (cid:48) ) p . f ( x ) ≥ , f ( x (cid:48) ) ≤ . Note, | y | = − y , iff y ≤ 0. Hence, (cid:12)(cid:12) | f ( x ) | − | f ( x (cid:48) ) | (cid:12)(cid:12) ≤ (cid:12)(cid:12) | f ( x ) | + | f ( x (cid:48) ) | (cid:12)(cid:12) = (cid:12)(cid:12) | f ( x ) | − f ( x (cid:48) ) (cid:12)(cid:12) = (cid:12)(cid:12) f ( x ) − f ( x (cid:48) ) (cid:12)(cid:12) ≤ L I ( f ) d ( x, x (cid:48) ) p . f ( x ) ≤ , f ( x (cid:48) ) ≥ . Completely analogous to the second case. f ( x ) , f ( x (cid:48) ) ≤ (cid:12)(cid:12) | f ( x ) | − | f ( x (cid:48) ) | (cid:12)(cid:12) = (cid:12)(cid:12) f ( x (cid:48) ) − f ( x ) (cid:12)(cid:12) = (cid:12)(cid:12) f ( x ) − f ( x (cid:48) ) (cid:12)(cid:12) f Hoelder ≤ L I ( f ) d ( x, x (cid:48) ) p .The remaining points are also proven in [42] in the context of Lipschitz functions. Special case of Thm. A.7. For arbitrary x, x (cid:48) ∈ X , r ∈ R we have: (cid:12)(cid:12) r f ( x ) − r f ( x (cid:48) ) | (cid:12)(cid:12) = | r | | f ( x ) − f ( x (cid:48) ) | ≤ | r | L I ( f ) d ( x, x (cid:48) ) p . For arbitrary x, x (cid:48) ∈ X , r ∈ R we have: (cid:12)(cid:12) g ( x ) + f ( x ) − ( g ( x (cid:48) ) + f ( x (cid:48) )) | (cid:12)(cid:12) = (cid:12)(cid:12) g ( x ) − g ( x (cid:48) ) + f ( x ) − f ( x (cid:48) ) | (cid:12)(cid:12) ≤ (cid:12)(cid:12) g ( x ) − g ( x (cid:48) ) (cid:12)(cid:12) + (cid:12)(cid:12) f ( x ) − f ( x (cid:48) ) | (cid:12)(cid:12) ≤ ( L J ( g ) + L I ( f )) d ( x, x (cid:48) ) p . Let x, x (cid:48) ∈ X , d := f ( t ) − f ( t (cid:48) ). (cid:12)(cid:12) g ( x ) f ( x ) − g ( x (cid:48) ) f ( x (cid:48) ) (cid:12)(cid:12) = (cid:12)(cid:12) g ( x )( f ( x (cid:48) ) + d ) − g ( x (cid:48) ) f ( x (cid:48) ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:0) g ( x ) − g ( x (cid:48) ) (cid:1) f ( x (cid:48) ) + g ( x ) d (cid:12)(cid:12) ≤ (cid:12)(cid:12) g ( x ) − g ( x (cid:48) ) (cid:12)(cid:12) | f ( x (cid:48) ) | + (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) | d | ≤ L I ( g ) d ( x, x (cid:48) ) p | f ( x (cid:48) ) | + (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) L I ( f ) d ( x, x (cid:48) ) p ≤ L I ( g ) d ( x, x (cid:48) ) p sup x (cid:48) ∈X {| f ( x (cid:48) ) |} + sup x ∈X { (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) } L I ( f ) d ( x, x (cid:48) ) p = (cid:16) L I ( g ) sup x (cid:48) ∈X {| f ( x (cid:48) ) |} + sup x ∈X { (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) } L I ( f ) (cid:17) d ( x, x (cid:48) ) p . The proof of Proposition 1.5.5 in [42] proves our statement if one replaces their metric ρ by d p . Let x, x (cid:48) ∈ X . (cid:12)(cid:12) f ( x ) − f ( x (cid:48) ) (cid:12)(cid:12) = (cid:12)(cid:12) f ( x (cid:48) ) f ( x (cid:48) ) f ( x ) − f ( x ) f ( x (cid:48) ) f ( x ) (cid:12)(cid:12) = (cid:12)(cid:12) f ( x (cid:48) ) − f ( x ) (cid:12)(cid:12)(cid:12)(cid:12) f ( x (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) f ( x ) (cid:12)(cid:12) ≤ L I ( f ) d ( x,x (cid:48) ) p inf x ∈X | f ( x ) | . Let p = 1 and d ( x, x (cid:48) ) = (cid:107) x − x (cid:48) (cid:107) be a norm that induces a sub-multiplicative matrix norm.Define (cid:96) := sup x ∈ I (cid:107)∇ f ( x ) (cid:107) = L I ( f ). Firstly, we show that it is a Lipschitz constant: Let x, x (cid:48) ∈ I xx (cid:48) := { tx + (1 − t ) x (cid:48) | t ∈ [0 , } . Owing to convexity of I, xx (cid:48) ⊂ I . Due to the mean valuetheorem ∃ ξ ∈ xx (cid:48) ⊂ I : | f ( x ) − f ( x (cid:48) ) | = T ξ ( x − x (cid:48) ). where T ξ ( x ) = (cid:104)∇ f ( ξ ) ,x (cid:105) is a linear OP. Assumingthe derivative of f is bounded, T ξ is a bounded OP and we have | T ξ ( x − x (cid:48) ) | ≤ ||| T ξ ||| (cid:107) x − x (cid:48) (cid:107) where ||| T ξ ||| = sup (cid:107) x (cid:107) =1 |(cid:104)∇ f ( ξ ) ,x (cid:105)| ≤ (cid:107)∇ f ( ξ ) (cid:107) . In conjunction, | f ( x ) − f ( x (cid:48) ) | ≤ (cid:107)∇ f ( ξ ) (cid:107) (cid:107) x − x (cid:48) (cid:107) .Secondly, we show that (cid:96) is the smallest Lipschitz constant in the one-dimensional case: Let ¯ (cid:96) be another Lipschitz constant on I such that ¯ (cid:96) ≤ (cid:96) . Of course, here (cid:107)·(cid:107) = |·| . Since I is compactand (cid:107)∇ f ( · ) (cid:107) is continuous, there is some ξ ∈ I such that (cid:107)∇ f ( ξ ) (cid:107) = sup x ∈ I (cid:107)∇ f ( ξ ) (cid:107) = (cid:96) . Pick anysequence ( y k ) ∞ k =1 contained in I such that y k k →∞ −→ ξ and y k (cid:54) = ξ . ∀ k : y k ∈ I and ¯ (cid:96) is a Lipschitzconstant on I . Hence, ¯ (cid:96) ≥ | f ( y k ) − f ( ξ ) |(cid:107) y k − ξ (cid:107) k →∞ −→ (cid:107)∇ f ( ξ ) (cid:107) = (cid:96) . Thus, ¯ (cid:96) = (cid:96) . Trivial. Special case of property 5). L ( f q ) = sup τ ∈I | dd t f q ( τ ) | = | q | sup τ ∈I | f q − ( τ ) ˙ f ( τ ) | .As a simple illustration, assume we desire to establish that f ( t ) = max { − t ) , exp (cid:0) − sin( t ) (cid:1) } is Lipschitz and to find a Lipschitz number on R . Application of 8. shows that t (cid:55)→ sin( t ) and t (cid:55)→ exp( − t ) have a Lipschitz number of 1. Application, of 2., 9. 1. and 6. then show that L ( f ) = 3is a Lipschitz number of f . A.2 H¨older continuity of the exponentiated map The main objective of this subsection is to show that, for each s ∈ X , the function φ s : x (cid:55)→ L d ( x, s ) p is H¨older continuous with coefficient L and exponent p with respect to (pseudo-) metric d : X → R .This is important in our derivations since these maps are essential building blocks of the kinky inferenceprocedure. Therefore it is easy to employ Lem. A.8 to show that the full kinky inference rule is H¨oldercontinuous.To establish the H¨older regularity result, we will first show that ( x, y ) (cid:55)→ d ( x, y ) p , for p ∈ (0 , φ s ∈ H d ( L, p ).Before proceeding we need to establish a few facts. Firstly, we remind ourselves of the followingwell-known fact: Lemma A.9. A nonnegative, concave function g : R ≥ → R with g (0) = 0 is subadditive. That is, ∀ x, y ∈ R ≥ : g ( x + y ) ≤ g ( x ) + g ( y ) .Proof. If x = y = 0 then subadditivity trivally holds: 0 = g ( x + y ) ≤ g ( x )+ g ( y ) = 0. So, let x, y ∈ R + such that x > ∨ y > 0. We have, g ( x + y ) = xx + y g ( x + y )+ yx + y g ( x + y ) ≤ g ( xx + y ( x + y ))+ g ( yx + y ( x + y )) = g ( x ) + g ( y ). Taking into account that xx + y , xx + y ∈ [0 , g is concave we have ∀ p ∈ [0 , , x ∈ R : g ( px ) = g ( px +(1 − p )0) ≥ pg ( x )+(1 − p ) g (0) ≥ pg ( x ).The last inequality follows from g (0) ≥ Lemma A.10. Let h : (cid:40) R ≥ → R ≥ ,x (cid:55)→ x p , for p ∈ (0 , . h is positive definite and subadditive. Thatis, (i) h (0) = 0 and h ( x ) > , ∀ x (cid:54) = 0 and (ii) ∀ x, y ∈ R ≥ : h ( x + y ) ≤ h ( x ) + h ( y ) . Moreover, h isconcave. If p ∈ (0 , , h is strictly concave.Proof. Pos. def. : h (0) = 0. Also lim x → + h ( x ) = 0. Since ∇ h ( x ) = ph p − ( x ) > x > h isstrictly monotonically increasing on R + . Hence, h ( x ) > , ∀ x > Concavity: If p = 1, h is linear and hence, concave. If p ∈ (0 , ∇ h ( x ) = p ( p − h ( x ) p − > x . Subadditivity: Follows directly with Lem. A.9 on the basis of established positive definiteness andconcavity. 36 emma A.11. Let p ∈ (0 , . With definitions as above, we assume that set X is endowed with apseudo- metric d . Function ˜ d : (cid:40) X × X → R ≥ ( x, y ) (cid:55)→ (cid:16) d ( x, y ) (cid:17) p is a pseudo-metric on X . If d is a metricthen so is ˜ d .Proof. By Lem. A.10, x (cid:55)→ x p is pos. def. and sub-additive. Therefore, positive definiteness and thetriangle inequality of d readily extend to ˜ d as follows: Pos. def.: Let x = 0. ˜ d ( x, x ) = d ( x, x ) p = 0 p = 0. If x (cid:54) = 0 then k := d ( x, x ) (cid:54) = 0. Hence˜ d ( x, x ) = d ( x, x ) p = k p (cid:54) = 0. Triangle inequality: Choose arbitrary x, y, z ∈ X . We have ˜ d ( x, z ) = d ( x, z ) p ≤ ( d ( x, y ) + d ( y, z ) (cid:1) p ≤ d ( x, y ) p + d ( y, z ) p = ˜ d ( x, y ) + ˜ d ( y, z ). Here, the first inequality followed from the triangleinequality of pseudo-metric d and the second from subadditivity properties established in Lem. A.10. Symmetry: If d is a metric it is symmetric. Hence, ˜ d ( x, y ) = d ( x, y ) p = d ( y, x ) p = ˜ d ( y, x ) , ∀ x, y ∈X in which case d also is symmetric.Before proceeding we establish a slight generalisation of the well-known reverse triangle inequality : Lemma A.12 (Reverse Triangle Inequality) . Let X be a set and d : X → R a symmetric functionthat satisfies the triangle inequality. That is, ∀ x, y, z ∈ X : d ( x, y ) = d ( y, x ) ∧ d ( x, z ) ≤ d ( x, y ) + d ( y, z ) .Then ∀ x, y, z ∈ X : | d ( x, y ) − d ( z, y ) | ≤ d ( x, z ) . Proof. For contradiction, assume | d ( x, y ) − d ( z, y ) | > d ( x, z ) for some x, y, z ∈ X . This is implies( i ) d ( x, y ) − d ( z, y ) > d ( x, z ) or ( ii ) d ( z, y ) − d ( x, y ) > d ( x, z ). Both inequalities contradictthe triangle inequality: ( i ) ⇔ d ( x, y ) > d ( x, z ) + d ( z, y ) and ( ii ) ⇔ d ( z, y ) > d ( x, z ) + d ( x, y ) = d ( z, x ) + d ( x, y ). Lemma A.13. Let d be a (pseudo-) metric on X . For arbitrary s ∈ X we define φ s : X → R as φ s : x (cid:55)→ d ( x, s ) . φ s is Lipschitz with respect to metric d . That is, ∀ x, y ∈ X : | φ s ( x ) − φ s ( y ) | ≤ d ( x, y ) . Proof. | φ s ( x ) − φ s ( y ) | = | d ( x, s ) − d ( y, s ) | T hm.A. ≤ d ( x, y ) , ∀ x, y, s ∈ X .Finally, combining this lemma with Lem. A.11 immediately establishes that mappings of the form d ( · , s ) p are H¨older continuous with exponent p ( ∈ (0 , Theorem A.14. Let d be a (pseudo-) metric on set X and let p ∈ (0 , , L ≥ . For arbitrary s ∈ X we define φ s : (cid:40) X → R x (cid:55)→ L (cid:0) d ( x, s ) (cid:1) p . φ s is H¨older with exponent p . That is, ∀ x, y ∈ X : | φ s ( x ) − φ s ( y ) | ≤ L d ( x, y ) p . In particular, for any norm (cid:107)·(cid:107) on G and s ∈ X , mapping x (cid:55)→ L (cid:107) x − s (cid:107) p is H¨older with constant L and exponent p .Proof. By Lem. A.11, d p ( · , · ) is a (pseudo-) metric on X . Hence, Lem. A.12 is applicable yielding: | φ s ( x ) − φ s ( y ) | = L | d ( x, s ) p − d ( y, s ) p | Lem.A. ≤ d ( x, y ) p , ∀ x, y, s ∈ X . The last sentence concerningthe norms follows from the fact that a mapping ( x, y ) (cid:55)→ (cid:107) x − y (cid:107) defines a metric.37 eferences [1] K. J. Astroem and B. Wittenmark. Adaptive Control . Addison-Wesley, 2nd edition, 2013.[2] B. Baran, E. D. Demaine, and D. A. Katz. Optimally adaptive integration of univariate Lipschitzfunctions. Algorithmica , 2008.[3] G. Beliakov. Interpolation of Lipschitz functions. Journal of Computational and Applied Mathe-matics , 2006.[4] G. Beliakov. Smoothing Lipschitz functions. Optimization Methods and Software , 2007.[5] D. S. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive networks. Complex Systems , 1988. First paper on RBF networks.[6] J. Calliess. Lipschitz Optimisation for Lipschitz Interpolation. Working paper, 2016.[7] Jan-Peter Calliess. Conservative decision-making and inference in uncertain dynamical systems .PhD thesis, University of Oxford, 2014.[8] M. Canale, L. Fagiano, and M. C. Signorile. Nonlinear model predictive control from data: a setmembership approach. Int. J. Robust Nonlinear Control , 2014.[9] G. Cho, G. Chowdhary, A. Kingravi, J. P. . How, and A. Vela. A Bayesian nonparametricapproach to adaptive control using Gaussian processes. In CDC , 2013.[10] Girish Chowdhary, H.A. Kingravi, J.P. How, and P.A. Vela. Bayesian nonparametric adaptivecontrol using Gaussian processes. Technical report, MIT, 2013.[11] Girish Chowdhary, Hassan A. Kingravi, Jonathan How, and Patricio A. Vela. Nonparametricadaptive control of time-varying systems using Gaussian processes. In American Control Con-ference (ACC) , 2013.[12] D. A. Cooper. Learning Lipschitz functions. Int. J. Computer Math. , 59:15–26, 1995.[13] D. A. Cooper. Learning C and Hoelder functions. International Journal of Pure and AppliedMathematics , 2006.[14] F. Curbera. Optimal integration of lipschitz functions with a Gaussian weight. Journal ofComplexity , 1998.[15] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efficient learning inrobotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2015.[16] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundationsand Trends in Robotics , 2013.[17] M.P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic programming. Neurocomputing , 2009.[18] M.P. Deisenroth and C.E. Rasmussen. Pilco : A model-based and data-efficient approach topolicy search. In ICML , 2011.[19] S. Dereich, T. Mueller-Gronbach, and K. Ritter. Infinite-dimensional quadrature and quantiza-tion. Arxiv:math/060124v1 , 2006.[20] C. Gardiner. Stochastic Methods-A Handbook for the Natural and Social Sciences . Springer, 4thedition, 2009. 3821] G. Georganopoulos. Sur l’approximations des fonctions continues par des fonctions lipschitzi-ennes. C. R. Acad. Sci. Paris , 1967.[22] G. Grimmet and D. Stirzaker. Probability and Random Processes . Oxford University Press, 3rdedition, 2001.[23] L. Gyoerfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of NonparametricRegression . Springer, 2002.[24] D.R. Jones, C.D. Peritunen, and B.E. Stuckman. Lipschitzian optimization without the Lipschitzconstant. JOTA , 79(1), 1991.[25] Y. H. Kim and F. Lewis. High-level feedback control with neural networks. Robotics and Intelli-gent Systems , 1998.[26] P. E. Kloeden and E. Platen. Numerical solution of stochastic differential equations. Springer,1992.[27] A. McHutchon. Nonlinear Modelling and Control using Gaussian Processes . PhD thesis, Dept.of Engineering, Cambridge University, 2014.[28] R. Miculescu. Approximation of continuous functions by Lipschitz functions. Real AnalysisExchange , 2000.[29] M. Milanese and C. Novara. Set membership identification of nonlinear systems. Automatica ,2004.[30] T. Mitchell. Machine Learning . Mc Graw Hill, 1997.[31] M.M. Monahemi and M. Krstic. Control of wingrock motion using adaptive feedback linearization. J. of. Guidance Control and Dynamics. , 1996.[32] D Nguyen-Tuong and J. Peters. Model learning for robot control: a survey. Cognitive processing ,2011.[33] J. Park and J. W. Sandberg. Universal approximation using radial-basis function networks. Neural Computation , 1991.[34] C.E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning . MIT Press,2006.[35] A. A. Saad. Simulation and Analysis of wing rock physics for a generic fighter model with threedegrees of freedom. PhD thesis, Air Force Institute of Technology, Air University, 2000.[36] R. Sanner and J.-J. Slotine. Gaussian networks for direct adaptive control. Trans. on NeuralNetworks , 1992.[37] B. Shubert. A sequential method seeking the global maximum of a function. SIAM J. onNumerical Analysis , 9, 1972.[38] R. G. Strongin. On the convergence of an algorithm for finding a global extremum. Engineeringin Cybernetics , 1973.[39] A.G. Sukharev. Optimal method of constructing best uniform approximation for functions of acertain class. Comput. Math. and Math. Phys. , 1978.[40] A. B. Tsybakov. Introduction to Nonparametric Estimation . Springer, 2009.[41] L. Valiant. A theory of the learnable. Communications of the ACM. , 1984.3942] N. Weaver. Lipschitz Algebras . World Scientific, 1999.[43] G.R. Wood and B. P. Zhang. Estimation of the Lipschitz constant of a function. Journal ofGlobal Optimization , 1996.[44] Z. B. Zabinsky, R. L. Smith, and B. P. Kristinsdottir. Optimal estimation of univariate black-boxLipschitz functions with upper and lower bounds.