[PDF] Detecting Rewards Deterioration in Episodic Reinforcement Learning

Abstract

In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. Furthermore, it often has to be done without modifying the policy and under minimal assumptions regarding the environment. In this paper, we address this problem by focusing directly on the rewards and testing for degradation. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We present this problem as a multivariate mean-shift detection problem with possibly partial observations. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power. Empirically, on deteriorated rewards in control problems (generated using various environment modifications), the test is demonstrated to be more powerful than standard tests - often by orders of magnitude. We also suggest a novel Bootstrap mechanism for False Alarm Rate control (BFAR), applicable to episodic (non-i.i.d) signal and allowing our test to run sequentially in an online manner. Our method does not rely on a learned model of the environment, is entirely external to the agent, and in fact can be applied to detect changes or drifts in any episodic signal.

Full PDF

DD RIFT D ETECTION IN E PISODIC D ATA :D ETECT W HEN Y OUR A GENT S TARTS F ALTERING

Ido Greenberg

Faculty of Electrical Engineering, Technion,Israel Institute of Technology. [email protected]

Shie Mannor

Faculty of Electrical Engineering, Technion,Israel Institute of Technology.Nvidia Research. [email protected] A BSTRACT

Detection of deterioration of agent performance in dynamic environments is chal-lenging due to the non-i.i.d nature of the observed performance. We consider anepisodic framework, where the objective is to detect when an agent begins to falter.We devise a hypothesis testing procedure for non-i.i.d rewards, which is optimalunder certain conditions. To apply the procedure sequentially in an online man-ner, we also suggest a novel Bootstrap mechanism for False Alarm Rate control(BFAR). We demonstrate our procedure in problems where the rewards are notindependent, nor identically-distributed, nor normally-distributed. The statisticalpower of the new testing procedure is shown to outperform alternative tests – oftenby orders of magnitude – for a variety of environment modiﬁcations (which causedeterioration in agent performance). Our detection method is entirely external tothe agent, and in particular does not require model-based learning. Furthermore,it can be applied to detect changes or drifts in any episodic signal.

NTRODUCTION

Reinforcement learning (RL) algorithms have recently demonstrated impressive success in a varietyof sequential decision-making problems (Badia et al., 2020; Hessel et al., 2018). While most RLworks focus on the maximization of rewards under various conditions, a key issue in real-world RLtasks is the safety and reliability of the system (Dulac-Arnold et al., 2019; Chan et al., 2020), arisingin both ofﬂine and online settings.In ofﬂine settings , comparing the agent performance in different environments is important forgeneralization (e.g., in sim-to-real and transfer learning). The comparison may indicate the difﬁcultyof the problem or help to select the right learning algorithms. Uncertainty estimation, which couldhelp to address this challenge, is currently considered a hard problem in RL, in particular for model-free methods (Yu et al., 2020).In online settings , where a ﬁxed, already-trained agent runs continuously, its performance may beaffected (gradually or abruptly) by changes in the controlled system or its surroundings, or whenreaching new states beyond the ones explored during the training. Some works address the robust-ness of the agent to such changes (Lecarpentier & Rachelson, 2019; Lee et al., 2020). However,noticing the changes may be equally important, as it allows us to fall back into manual control, sendthe agent to re-train, guide diagnosis, or even bring the agent to halt. This is particularly critical inreal-world problems such as health care and autonomous driving (Zhao et al., 2019), where agentsare required to be ﬁxed and stable (Matsushima et al., 2020), and any performance degradationshould be detected as soon as possible.Many sequential statistical tests exist for detection of mean degradation in a random process. How-ever, common methods (Page, 1954; Lan, 1994; Harel et al., 2014) assume independent and iden-tically distributed (i.i.d) samples, while in RL the feedback from the environment is usually bothhighly correlated over consecutive time-steps, and varies over the life-time of the task. This isdemonstrated in Fig. 1 for HalfCheetah environment (MuJoCo).1 a r X i v : . [ c s . L G ] O c t possible solution is to apply statistical tests to large blocks of time-steps assumed to be i.i.d.Since many RL applications consist of repeating episodes, such a blocks-partition can be applied ina natural way. However, this approach requires complete episodes to allow the detection of changes,while a faster response is often required. Furthermore, naively applying a statistical test on theaccumulated feedback (e.g., sum of rewards) from complete episodes, ignores the dependencieswithin the episodes and may miss vital information, leading to highly sub-optimal tests.In this work, we devise a test for detection of degradation of the rewards in an episodic RL task (orin any other episodic signal), based on the covariance structure within the episodes. The test can beused to detect changes and drifts in both the ofﬂine and the online settings mentioned above. Weuse Neyman-Pearson Lemma (Neyman et al., 1933) to prove that the test is optimal under certainassumptions. Furthermore, for the online settings, we suggest a novel Bootstrap mechanism tocontrol the False Alarm Rate (BFAR) through adjustment of test thresholds in sequential tests ofepisodic signals. The suggested procedures rely on the ability to estimate the correlations within theepisodes, e.g., through a ”reference dataset” of episodes.Since the test is applied directly to the rewards, it is model-free in the sense that a model may not belearned. Furthermore, as the rewards are simply referred to as episodic time-series, the test can besimilarly applied to detect changes in any episodic signal. Note that if the episodes are very short,or if the non-zero signal is very sparse within the episodes, the test becomes similar to a standardi.i.d test where each episode is just a sample.We demonstrate the new procedures in the environments of Pendulum (OpenAI), HalfCheetah andHumanoid (MuJoCo; Todorov et al., 2012). BFAR is shown to successfully control the false alarmrate. The covariance-based degradation-test detects degradation faster and more often than twoalternative tests – in certain cases by orders of magnitude.Section 3 formulates the ofﬂine setup (individual tests) and the online setup (sequential tests). Sec-tion 4 introduces the model of an episodic signal, and derives an optimal test for degradation in sucha signal. Section 5 shows how to adjust the test for online settings and control the false alarm rate.Section 6 describes the experiments, Section 7 discusses related works and Section 8 summarizes.To the best of our knowledge, we are the ﬁrst to exploit the covariance between rewards in post-training phase to test for changes in RL-based systems. The contributions of this paper are (i) anew framework for model-free statistical tests on episodic (non-i.i.d) data with trusted reference-episodes; (ii) an optimal test (under certain conditions) for degradation in episodic data; and (iii) anovel bootstrap mechanism that controls the false alarm rate of sequential tests on episodic data. RELIMINARIES

Reinforcement learning and episodic framework:

A Reinforcement Learning (RL) problem isusually modeled as a decision process , where a learning agent has to repeatedly make decisionsthat affect its future states and rewards. The process is often organized as a ﬁnite sequence of time-steps (an episode ) that repeats multiple times in different variants, e.g., with different initial states.Common examples are board and video games (Brockman et al., 2016), as well as more realisticproblems such as repeating drives in autonomous driving tasks.Once the agent is ﬁxed (which is the case in the scope of this work), the rewards of the decisionprocess essentially reduce to a (decision-free) random process { X t } nt =1 , which can be deﬁned by itsPDF ( f { X t } nt =1 : R n → [0 , ∞ ) ). { X t } usually depend on each other: even in the popular MarkovDecision Process (Bellman, 1957), where the dependence goes only a single step back, long-termcorrelations may still carry information if the states are not observable by the agent.

Hypothesis tests:

Consider a parametric probability (or probability density) function p ( X | θ ) de-scribing a random process, and consider two different hypotheses H , H A determining the allowedvalues of θ . When designing an observations-based test to decide between the hypotheses, the ba-sic metrics for the test efﬁcacy are its signiﬁcance P ( not reject H | H ) = 1 − α and its power P ( reject H | H A ) = β . A statistical hypothesis test with signiﬁcance − α and power β is said tobe optimal if any test with as high signiﬁcance − ˜ α ≥ − α has smaller power ˜ β ≤ β .2he likelihood of the hypothesis H : θ ∈ Θ given data X is deﬁned as L ( H | X ) = sup θ ∈ Θ p ( X | θ ) .According to Neyman-Pearson Lemma (Neyman et al., 1933), a threshold-test on the likelihood ratio LR ( H , H A | X ) = L ( H | X ) /L ( H A | X ) is optimal. In a threshold-test, the threshold is uniquelydetermined by the desired signiﬁcance level α , though is often difﬁcult to calculate given α .In many practical applications, a hypothesis test is repeatedly applied as the data change or grow, aprocedure known as a sequential test . If the null hypothesis H is true, and any individual hypothesistest falsely rejects H with some probability α , then the probability that at least one of the multipletests will reject H is α > α , termed family-wise type-I error (or false alarm rate when associatedwith frequency). See Appendix K for more details about hypothesis testing and sequential tests inparticular.Common approaches for sequential tests, such as CUSUM (Page, 1954; Ryan, 2011) and α -spendingfunctions (Lan, 1994; Pocock, 1977), usually assume strong assumptions such as independence ornormality, as further discussed in Appendix F. ROBLEM S ETUP

In this work, we consider two setups where detecting performance deterioration is important – se-quential degradation-tests and individual degradation-tests. The individual tests, in addition to theirimportance in (ofﬂine) settings such as sim-to-real and transfer learning, are used in this work asbuilding-blocks for the (online) sequential tests.Both setups assume a ﬁxed agent that was previously trained, and aim to detect whenever the agentperformance begins to deteriorate, e.g., due to environment changes. The ability to notice suchchanges is essential in many real-world problems, as explained in Section 1.

Setup 1 ( Individual degradation-test).

We consider a ﬁxed trained agent, whose rewards in anepisodic environment (with episodes of length T ) were previously recorded for multiple episodes(the reference dataset ). The agent runs in a new environment for n time-steps (both n < T and n ≥ T are valid). The goal is to decide whether the rewards in the new environment are smaller thanthe original environment or not. If the new environment is identical, the probability of a false alarmmust not exceed α . Setup 2 ( Sequential degradation-test).

As in Setup 1, we consider a ﬁxed trained agent withrecorded reference data of multiple episodes. This time the agent keeps running in the same envi-ronment, and at a certain point in time its rewards begin to deteriorate, e.g., due to changes in theenvironment. The goal is to alert to the degradation as soon as possible. As long as the environmenthas not changed, the probability of a false alarm must not exceed α during a run of ˜ h episodes.Note that while in this work the setups focus on degradation, they can be easily modiﬁed to look forany change (as positive changes may also indicate the need for further training, for example). PTIMIZATION OF I NDIVIDUAL D EGRADATION -T ESTS

To tackle the problem of Setup 1, we ﬁrst deﬁne the properties of an episodic signal and the generalassumptions regarding its degradation.

Deﬁnition 4.1 ( T -long episodic signal) . Let n, T ∈ N , and write n = KT + τ (for non-negativeintegers K, τ with τ ≤ T ). A sequence of real-valued random variables { X t } nt =1 is a T -longepisodic signal, if its joint probability density function can be written as f { X t } nt =1 ( x , ..., x n ) = (cid:34) K − (cid:89) k =0 f { X t } Tt =1 ( x kT +1 , ..., x kT + T ) (cid:35) · f { X t } τ t =1 ( x KT +1 , ..., x KT + τ ) (1) (where an empty product is deﬁned as 1). We further denote µ µ µ := E [( X , ..., X T ) (cid:62) ] ∈ R T , Σ := Cov (( X , ..., X T ) (cid:62) , ( X , ..., X T )) ∈ R T × T . Note that the episodic signal consists of i.i.d episodes, but is not assumed to be independent oridentically-distributed within the episodes. For simplicity we focus on one-dimensional episodicsignals, although a generalization to multidimensional signals is straight-forward (see Appendix G).3 a) (b) (c)

Figure 1:

Parameters of an episodic signal of the rewards in HalfCheetah environment, estimated over N =10000 episodes of T = 1000 time-steps: (a) distribution of rewards per time-step; (b) variance per time-step;(c) correlation( t , t ) vs. t − t . The estimations were done in resolution of 25 time-steps, i.e., every episodewas split into 40 intervals of 25 consecutive rewards, and each sample is the average over an interval. In the analysis below we assume that both µ µ µ and Σ are known. In practice, this can be achievedeither through detailed domain knowledge, or by estimation from the recorded reference dataset ofSetup 1, assuming it satisﬁes Eq. (1). The estimation errors decrease as O (1 / √ N ) with the number N of reference episodes, and are distributed according to the Central Limit Theorem (for means)and Wishart distribution (K. V. Mardia & Bibby, 1979) (for covariance). While in this work weuse up to N = 10000 reference episodes, Appendix E shows that N = 300 reference episodes aresufﬁcient for reasonable results in HalfCheetah, for example. Note that correlations estimation hasbeen already discussed in several other RL works (Alt et al., 2019).Fig. 1 demonstrates the estimation of mean and covariance parameters for a trained agent in theenvironment of HalfCheetah, from a reference dataset of N = 10000 episodes. This also demon-strates the non-trivial correlations structure in the environment. According to Fig. 1b, the variancein the rewards varies and does not seem to reach stationarity within the scope of an episode. Fig. 1cshows the autocorrelation function ACF ( t − t ) = corr ( t , t ) for different reference times t . Itis evident that the correlations last for hundreds of time-steps, and depend on the time t rather thanmerely on the time-difference t − t . This means that the autocorrelation function is not expressiveenough for the actual correlations structure.Once the per-episode parameters µ µ µ ∈ R T , Σ ∈ R T × T are known, the expectations and covariancematrix of the whole signal µµµ ∈ R n , Σ ∈ R n × n can be derived directly: µµµ consists of periodicrepetitions of µ µ µ , and Σ consists of copies of Σ as T × T blocks along its diagonal. For bothparameters, the last repetition is cropped if n is not an integer multiplication of T .The degradation in the signal X = { X t } nt =1 is deﬁned through the difference between two hypothe-ses. The null hypothesis H states that X is a T -long episodic signal with expectations µ µ µ ∈ R T and invertible covariance matrix Σ ∈ R T × T . Our ﬁrst alternative hypothesis – uniform degradation– states that X is a T -long episodic signal with the same covariance Σ but smaller expectations: ∃ (cid:15) > , ∀ ≤ t ≤ T : ( µµµ ) t = ( µ µ µ ) t − (cid:15) . This brings us to the following result. Theorem 4.1 (Optimal test for uniform degradation) . Deﬁne the uniform-degradation weighted-mean s unif ( X ) := W · X , where W := 111 (cid:62) · Σ − ∈ R n (and is the all-1 vector). If the distributionof X is multivariate normal, then a threshold-test on s unif is optimal.Proof Sketch. Neyman-Pearson Lemma (Neyman et al., 1933) states that an optimal hypothesistest is a threshold-test on the likelihood-ratio between H and H A . s unif can be shown to bestrictly monotonous with the likelihood-ratio, hence every threshold-test on s unif is equivalent to athreshold-test on the likelihood-ratio, which is optimal. See the full proof in Appendix J.Following Theorem 4.1, we deﬁne the Uniform Degradation Test ( UDT ) to be a threshold-test on s unif , i.e., ”declare a degradation if s unif < κ ” for a pre-deﬁned κ .Recall that optimality of a test is deﬁned in Section 2 as having maximal power given signiﬁcancelevel. To achieve the signiﬁcance α required in Setup 1, we apply a bootstrap mechanism that4andomly samples episodes from the reference dataset and calculates the corresponding statistic(e.g., s unif ). This yields a bootstrap-estimate of the distribution of the statistic under H , and the α -quantile of the estimated distribution is chosen as the test-threshold ( κ = q α ( s unif | H ) ).Note that Theorem 4.1 assumes multivariate normality of the signal. If we remove this strong as-sumption, Theorem 4.2 still guarantees that UDT is asymptotically better than a test on the simplemean s simp = (cid:80) nt =1 X t /n . Note that ”asymptotic” refers to the signal length n → ∞ (while T remains constant), and is translated in the sequential setup into a ”very long lookback-horizon h ”(rather than very long running time). Theorem 4.2 (Asymptotic power of UDT) . Denote the length of the signal n = K · T , assume auniform degradation of size (cid:15) √ K , and let two threshold-tests τ simp on s simp and UDT on s unif betuned to have signiﬁcance α . Thenlim K →∞ P (cid:0) τ simp rejects H (cid:12)(cid:12) H A (cid:1) = Φ (cid:32) q α + (cid:15)T (cid:112) (cid:62) Σ (cid:33) ≤ Φ (cid:18) q α + (cid:15) (cid:113) (cid:62) Σ − (cid:19) = lim K →∞ P (cid:0) U DT rejects H (cid:12)(cid:12) H A (cid:1) (2) where Φ is the CDF of the standard normal distribution, and q α is its α -quantile.Proof Sketch. Since the episodes of the signal are i.i.d, both s simp and s unif are asymptoticallynormal according to the Central Limit Theorem. Derivation of the theorem from the asymptoticdistributions is provided in Appendix J.The power advantage of UDT in Eq. (2) depends on the spectrum of Σ , and in particular increaseswith the heterogeneity of Σ ’s eigenvalues. Detailed calculation is available in Appendix J.Figure 2: Rewards degradation inHalfCheetah following changes in gravity,mass, and control-cost, over N = 5000 episodes per scenario. So far we assumed a uniform degradation. In the con-text of RL, such a model may refer to changes in constantcosts or action costs, as well as certain environment dy-namics whose change inﬂuences various states in a sim-ilar way. Fig. 2 demonstrates the empiric degradation inthe rewards of a trained agent in HalfCheetah, followingchanges in gravity, mass and control-cost. It seems thatsome modiﬁcations indeed cause a quite uniform degra-dation, while in others the degradation is mostly restrictedto certain ranges of time.To model effects that are less uniform in time we sug-gest a partial degradation hypothesis, where some of theentries of µ µ µ are reduced by (cid:15) > , and others do notchange. The number m = p · T of the reduced entries isdeﬁned by a parameter p ∈ (0 , .As before, a likelihood-ratio calculation along with normality assumption can be used to derive anoptimal test-statistic. We use an approximation of the statistic, named partial-degradation mean anddenoted s part , that essentially sums the m = p · T smallest time-steps after a Σ − -transformation.The calculation of the statistic and the meaning of the approximation are discussed in Appendix I.We deﬁne the Partial Degradation Test ( PDT ) to be a threshold-test on s part with a parameter p . OOTSTRAP FOR F ALSE A LARM R ATE C ONTROL (BFAR)

For Setup 2, we suggest a sequential testing procedure: run an individual degradation-test every d steps (i.e., F = T /d test-points per episode), and return once any individual test declares adegradation. The tests can run according to Section 4, applied on the h recent episodes. Multipletests may be applied every test-point, e.g., with varying test-statistics { s } or lookback-horizons { h } .This procedure, as implemented for the experiments of Section 6, is described in Fig. 3.5etup 2 limits the probability of a false alarm to α in a run of ˜ h episodes. To satisfy this condition,we set a uniform threshold κ on the p -values of the individual tests (i.e., declare once a test returns p -val < κ ). The threshold is determined using a Bootstrap mechanism for False Alarm Control( BFAR ), as described in Algorithm 1.

Algorithm 1:

BFAR: Bootstrap for FAR control

Input : reference dataset x ∈ R N × T ; statisticfunctions { s } ; lookback-horizons { h , ..., h max } ;test length ˜ h ∈ N ; B ∈ N ; α ∈ (0 , ; Output : test threshold for individual tests;Initialize P = (1 , ..., ∈ [0 , B ; for b in 1: B do Initialize Y ∈ R ( h max +˜ h ) T ; for k in 0:( h max + ˜ h -1) do Sample j uniformly from (1 , ..., N ) ; Y [ kT + 1 : kT + T ] ← ( x j , ..., x jT ) ; for t in test-points dofor h in lookback-horizons and s instatistic functions do y ← Y [ t − hT : t ] ; p ← individual test pvalue ( y vs. x ; s ) P [ b ] ← min ( P [ b ] , p ) ;Return quantile α ( P ) ; BFAR samples h max + ˜ h episodes (where h max is the maximal lookback-horizon)from reference data of N episodes, to sim-ulate sequential data Y . Then individualtests are simulated for any test-point along ˜ h episodes, starting after h max episodes.The minimal p -value determines whethera detection would occur in Y . The wholeprocedure repeats B times, creating a boot-strap estimate of the distribution of the min-imal p -value along ˜ h episodes. We choosethe tests threshold to be the α -quantile ofthis distribution, such that α of the boot-strap simulations would raise a false alarm.The simulation relies on the assumption ofi.i.d reference episodes, but does not as-sume independence, normality, or identi-cal distributions within episodes. Note thatthe statistic for the tests is given to BFARas an input, making its choice independentof BFAR. Additional details and time com-plexity are discussed in Appendices H,L. XPERIMENTS

ETHODOLOGY

We run experiments in standard Reinforcement Learning environments as described below. Forevery environment, we use a PyTorch implementation (Kostrikov, 2018) of the standard A2C al-gorithm (Mnih et al., 2016) to train an agent. We let the trained agent run in the environment for N episodes and record its rewards, considered the trusted reference data . We then deﬁne severalscenarios, and let the agent run for M × N episodes in each scenario (divided later into M = 100 blocks of N episodes). One scenario is named H and is identical to the reference run (up to initial-state randomization). The other scenarios are deﬁned per environment, and present environmentalchanges expected to harm the agent’s rewards. The agent is not trained to adapt to these changes,and the goal is to test how long it takes for a degradation-test to detect its degradation.Individual degradation-tests of length n (Setup 1) are applied for every scenario over the ﬁrst n time-steps of each block. Sequential degradation-tests (Setup 2) are applied sequentially on theepisodes of each block. Since the agent is assumed to run continuously as the environment changesfrom H to an alternative scenario, each block is preceded by a random sample of H episodes, asdemonstrated in Fig. 3.BFAR adjusts the tests thresholds to have a false alarm with probability α = 5% per ˜ h = N episodes (where N is the data-block size). Two lookback-horizons h , h are chosen for everyenvironment. The rewards are downsampled by factor d before applying the tests, intended to reducethe parameters estimation error and the running time of the tests.The experimented degradation-tests are a threshold-test on the simple Mean ; CUSUM (Ryan,2011);

UDT and

PDT (with p = 0 . ) from Section 4; and a Mixed Degradation Test ( MDT ) thatruns Mean and PDT in parallel (note that Algorithm 1 permits multiple test-statistics). We call UDT,PDT and MDT the covariance-based tests , and discuss implementation details in Appendix D.6igure 3:

A summary of the sequential degradation-test procedure described in Section 6.1.

ESULTS

We run the tests in the environments of Pendulum (OpenAI), where the goal is to keep a one-dimensional pendulum pointing upwards; HalfCheetah (Todorov et al., 2012), where the goal is fora two-dimensional cheetah to run as fast as possible; and Humanoid, where the goal is for a personto walk without falling. In each environment we deﬁne the scenario ccostx of control cost increasedto x% of its original value, in addition to scenarios of changed dynamics as speciﬁed in Appendix D.In all the environments the rewards are clearly not independent, identically-distributed or normally-distributed (see Fig. 1 for example). Yet the false alarm rates are close to α = 5% per ˜ h episodesin all the tests, as demonstrated in Fig. 4 for HalfCheetah, for example. These results for the H scenarios indicate that BFAR tunes the thresholds properly in spite of the complexity of the data.Note that BFAR never observed the data of scenario H , but only the reference data.In most of the non- H scenarios, the covariance-based tests prove to be more powerful than thestandard tests, often by extreme margins. For example, increased control cost in all the environmentsand additive noise in Pendulum are all 100%-detected by the suggested tests, usually within fewepisodes (Fig. 4); whereas Mean and CUSUM have very poor detection rates. Mean did not detectdegradation in Pendulum even after the control cost increased from 110% to 300%(!).Note that we run the tests with two lookback-horizons in parallel, as allowed by BFAR. This provesuseful: with +30% control cost in HalfCheetah, for example, the short lookback-horizon allows fastdetection of degradation; but with merely +10%, the long horizon is necessary to notice the slightdegradation over a large number of episodes. This is demonstrated in Fig. 11 in Appendix C.The covariance-based tests reduce the weights of the highly-varying (and presumably noisier) time-steps. In HalfCheetah they turn out to be in the later parts of the episode. As a result, in certainscenarios, both Mean (which ignores the different variances) and CUSUM (which exploits themonly in a heuristic way) do better in individual degradation-tests of 100 samples (out of T = 1000 )Figure 4: Bottom: percent of sequential tests (among M = 100 runs with different seeds) that ended withdegradation detection, for various degradation-tests (corresponding to different colors), in a sample of scenariosin Pendulum, HalfCheetah and Humanoid. Top: the distribution of time until detection – for the runs that endedwith detection. High detection rates usually go along with short detection times. ELATED W ORK

Training in non-stationary environments has been widely researched, in particular in the frameworksof MAB (Mukherjee & Maillard, 2019; Garivier & Moulines, 2011; Besbes et al., 2014; Lykouriset al., 2020; Alatur et al., 2020; Gupta et al., 2019; Jun et al., 2018), model-based RL (Lecarpentier& Rachelson, 2019; Lee et al., 2020) and general multi-agent environments (Hernandez-Leal et al.,2019). Safe exploration during training in RL was addressed by Garcia & Fernandez (2015); Chowet al. (2018); Junges et al. (2016); Cheng et al. (2019); Alshiekh (2017). Note that our work refersto changes beyond the scope of the training phase: it addresses the stage where the agent is ﬁxedand required not to train further, in particular not in an online manner. Robust algorithms mayprevent degradation in the ﬁrst place, but when they fail – or when their assumptions are not met –a model-free monitor with minimal assumptions (as the one suggested in this work) is crucial.Sequential tests were addressed by many over the years. Common approaches rely on strong as-sumptions such as samples independence (Page, 1954; Ryan, 2011) and normality (Pocock, 1977;O’Brien & Fleming, 1979). Generalizations exist for certain private cases (Lu & Jr., 2001; Xie &Siegmund, 2011), sometimes at cost of alternative assumptions such as known change-size (Lundet al., 2007). Samples independence is usually assumed also in recent works based on numericapproaches (Abhishek & Mannor, 2017; Harel et al., 2014), and is often justiﬁed by consolidatingmany data samples (e.g., an episode) together as a single sample (Colas et al., 2019). Ditzler et al.(2015) wrote that ”change detection is typically carried out by inspecting independent and identi-cally distributed (i.i.d) features extracted from the incoming data stream, e.g., the sample mean”.See extended discussion about related works in Appendix F.

UMMARY

We introduce a novel approach that is optimal (under certain conditions) for detection of changesin episodic signals, exploiting the correlations structure as measured in a reference dataset. In en-vironments of classic control (Pendulum) and MuJoCo (HalfCheetah, Humanoid), the suggestedstatistical tests detected degradation faster than alternatives, often by orders of magnitude. Certainconditions, such as combination of positive and negative changes in very heterogeneous signals,may cause instability in some of the suggested tests; however, this is shown to be solved by runningthe new test in parallel to a standard mean test – with only a small loss of test power.We also introduce BFAR, a bootstrap mechanism that adjusts tests thresholds according to the de-sired false alarm rate in sequential tests. The mechanism empirically succeeded in providing validthresholds for various tests in all the environments, in spite of the non-i.i.d data.8he suggested approach may contribute to development of more reliable RL-based systems. Futureresearch may: consider different hypotheses, such as a permitted small degradation (instead of H )or a mix of degradation and improvement (instead of H A ); suggest additional stabilizing mecha-nisms for covariance-based tests; exploit other metrics than rewards for tests on model-based RLsystems; and apply comparative tests of episodic signals beyond the scope of drifts detection.A CKNOWLEDGEMENTS

The authors wish to thank Guy Tennenholtz and Nadav Merlis for their helpful insights.9

EFERENCES

Vineet Abhishek and Shie Mannor. A nonparametric sequential test for online randomized exper-iments.

Proceedings of the 26th International Conference on World Wide Web Companion , pp.610–6, 2017.Pragnya Alatur, Kﬁr Y. Levy, and Andreas Krause. Multi-player bandits: The adversarial case.

JMLR , 2020.Mohammed Alshiekh. Safe reinforcement learning via shielding.

Logic in Computer Science , 2017.Bastian Alt, Adrian Sosic, and Heinz Koeppl. Correlation priors for reinforcement learning.

NeurIPS , 2019.Samaneh Aminikhanghahi and D. Cook. A survey of methods for time series change point detection.

Knowledge and Information Systems , 51:339–367, 2016.Adria Puigdomenech Badia et al. Agent57: Outperforming the atari human benchmark.

ICML ,2020.Richard Bellman. A markovian decision process.

Indiana Univ. Math. J. , 6:679–684, 1957. ISSN0022-2518.Donald A. Berry and Bert Fristedt.

Bandit problems . Springer Netherlands, 1985. doi: 10.1007/978-94-015-3711-7.Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards.

Advances in Neural Information Processing Systems (NIPS) , 27, 2014.Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, andWojciech Zaremba. Openai gym, 2016.D. Brook et al. An approach to the probability distribution of cusum run length.

Biometrika , 59(3):539–549, 1972.Tom Bylander. Lecture notes: Reinforcement learning. .Stephanie C.Y. Chan et al. Measuring the reliability of reinforcement learning algorithms.

ICLR ,2020.James Chen. Conditional value at risk (cvar). , 2020.Richard Cheng et al. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks.

AAAI Conference on Artiﬁcial Intelligence , 2019.Yinlam Chow et al. A lyapunov-based approach to safe reinforcement learning.

NIPS , 2018.Cedric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. A hitchhiker’s guide to statistical compar-isons of reinforcement learning algorithms, 2019.Bin Dai, Shilin Ding, and Grace Wahba. Multivariate bernoulli distribution.

Bernoulli , 19(4):1465–1483, 09 2013. doi: 10.3150/12-BEJSP10. URL https://doi.org/10.3150/12-BEJSP10 .David A. Dickey and Wayne A. Fuller. Distribution of the estimators for autoregressive time se-ries with a unit root.

Journal of the American Statistical Association , 74(366a):427–431, 1979.doi: 10.1080/01621459.1979.10482531. URL https://doi.org/10.1080/01621459.1979.10482531 .Gregory Ditzler, Robi Polikar, and Cesare Alippi. Learning in nonstationary environments: A sur-vey.

IEEE Computational Intelligence Magazine , 2015.Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcementlearning, 2019. 10radley Efron. Second thoughts on the bootstrap.

Statist. Sci. , 18(2):135–140, 05 2003. doi:10.1214/ss/1063994968. URL https://doi.org/10.1214/ss/1063994968 .Ari Freedman. Convergence theorem for ﬁnite markov chains.https://math.uchicago.edu/ may/REU2017/REUPapers/Freedman.pdf, 2017.Javier Garcia and Fernando Fernandez. A comprehensive survey on safe reinforcement learning.

JMLR , 2015.Aur´elien Garivier and Eric Moulines. On upper-conﬁdence bound policies for switching banditproblems.

International Conference on Algorithmic Learning Theory

Proceedings of Machine Learning Research , 2019.Maayan Harel, Koby Crammer, Ran El-Yaniv, and Shie Mannor. Concept drift detection throughresampling.

International Conference on Machine Learning , pp. II–1009–II–1017, 2014.Peter Henderson et al. Deep reinforcement learning that matters.

AAAI , 2017.Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey oflearning in multiagent environments: Dealing with non-stationarity, 2019.Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, DanHorgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements indeep reinforcement learning.

AAAI , 2018.Mark E. Irwin. Lecture notes: Convergence in distribution and central limit theorem. , 2006.Kwang-Sung Jun et al. Adversarial attacks on stochastic bandits.

NeurIPS , 2018.Sebastian Junges et al. Safety-constrained reinforcement learning for mdps.

International Confer-ence on Tools and Algorithms for the Construction and Analysis of Systems , 2016.J. T. Kent K. V. Mardia and J. M. Bibby.

Multivariate analysis . Academic Press, 1979.Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail , 2018.Dirk P. Kroese, T. Brereton, T. Taimre, and Z. Botev. Why the monte carlo method is so importanttoday.

Wiley Interdisciplinary Reviews: Computational Statistics , 6:386–392, 2014.David L. Demets K. K. Gordon Lan. Interim analysis: The alpha spending function approach.

Statistics in Medicine , 13:1341–52, 1994.Erwan Lecarpentier and Emmanuel Rachelson. Non-stationary markov decision processes: a worst-case approach using model-based reinforcement learning.

NeurIPS 2019 , abs/1904.10090, 2019.URL http://arxiv.org/abs/1904.10090 .Kimin Lee et al. Context-aware dynamics model for generalization in model-based rl.

ICML , 2020.Chao-Wen Lu and Marion R. Reynolds Jr. Cusum charts for monitoring an autocorrelated process.

Journal of Quality Technology , 33(3):316–334, 2001. doi: 10.1080/00224065.2001.11980082.URL https://doi.org/10.1080/00224065.2001.11980082 .Robert Lund, Xiaolan L. Wang, Qi Qi Lu, Jaxk Reeves, Colin Gallagher, and Yang Feng. Change-point Detection in Periodic and Autocorrelated Time Series.

Journal of Climate , 20(20):5178–5190, 10 2007. ISSN 0894-8755. doi: 10.1175/JCLI4291.1. URL https://doi.org/10.1175/JCLI4291.1 . 11hodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Bandits with adversarial scaling.

ICML ,2020.Shie Mannor. Why does reinforcement learning not work (foryou)? https://rlrl.net.technion.ac.il/2020/01/27/why-does-reinforcement-learning-not-work-for-you/ , 2019.MathWorks. Conditional value-at-risk (cvar). .Tatsuya Matsushima, Hiroki Furuta, Y. Matsuo, Oﬁr Nachum, and Shixiang Gu. Deployment-efﬁcient reinforcement learning via model-based ofﬂine optimization.

ArXiv , abs/2006.03647,2020.Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, TimHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcementlearning.

Proceedings of Machine Learning Research , 48:1928–1937, 20-22 Jun 2016.MuJoCo. Halfcheetah-v2. https://gym.openai.com/envs/HalfCheetah-v2/ .Subhojyoti Mukherjee and Odalric-Ambrym Maillard. Distribution-dependent and time-uniformbounds for piecewise i.i.d bandits. arXiv preprint arXiv:1905.13159 , 2019.Susan A Murphy, Mark J van der Laan, and James M Robins. Marginal mean models for dynamicregimes.

Journal of the American Statistical Association , 2001.Oﬁr Nachum, Michael Ahn, Hugo Ponte, Shixiang (Shane) Gu, and Vikash Kumar. Multi-agentmanipulation via locomotion using hierarchical sim2real.

PMLR , 100:110–121, 30 Oct–01 Nov2020. URL http://proceedings.mlr.press/v100/nachum20a.html .NCSS. Cumulative sum (cusum) charts. https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/CUSUM_Charts.pdf .Jerzy Neyman, Egon Sharpe Pearson, and Karl Pearson. On the problem of the most efﬁcient testsof statistical hypotheses.

Philosophical Transactions of the Royal Society of London , 1933. doi:10.1098/rsta.1933.0009.Peter C. O’Brien and Thomas R. Fleming. A multiple testing procedure for clinical trials.

Biomet-rics , 35(3):549–556, 1979. ISSN 0006341X, 15410420. URL .PennState College of Science. Lecture notes in stat 509: Alpha spending function approach. https://online.stat.psu.edu/stat509/node/81/ .OpenAI. Pendulum-v0. https://gym.openai.com/envs/Pendulum-v0/ .E. S. Page. Continuous Inspection Schemes.

Biometrika , 41(1-2):100–115, 06 1954. ISSN 0006-3444. doi: 10.1093/biomet/41.1-2.100. URL https://doi.org/10.1093/biomet/41.1-2.100 .Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcementlearning.

CoRR , abs/1712.00378, 2017. URL http://arxiv.org/abs/1712.00378 .V. V. Petrov.

Sums of Independent Random Variables . Nauka, 1972.S. J. Pocock. Group sequential methods in the design and analysis of clinical trials.

Biometrika , 64(2):191–199, 08 1977. ISSN 0006-3444. doi: 10.1093/biomet/64.2.191. URL https://doi.org/10.1093/biomet/64.2.191 .R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.

Journal ofRisk , 2:21–41, 2000. doi: 10.21314/JOR.2000.038.Thomas P. Ryan.

Statistical Methods for Quality Improvement . Wiley; 3rd Edition, 2011.12. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. , pp. 5026–5033, 2012.A. Wald. Sequential tests of statistical hypotheses.

Annals of Mathematical Statistics , 16(2):117–186, 06 1945. doi: 10.1214/aoms/1177731118. URL https://doi.org/10.1214/aoms/1177731118 .James Westgard, Torgny Groth, T Aronsson, and C Verdier. Combined shewhart-cusum controlchart for improved quality control in clinical chemistry.

Clinical chemistry , 23:1881–7, 11 1977.doi: 10.1093/clinchem/23.10.1881.S. S. Wilks. The large-sample distribution of the likelihood ratio for testing composite hypotheses.

Ann. Math. Statist. , 9(1):60–62, 03 1938. doi: 10.1214/aoms/1177732360. URL https://doi.org/10.1214/aoms/1177732360 .S. M. Williams et al. Quality control: an application of the cusum.

BMJ: British medical journal ,304.6838:1359, 1992.Yao Xie and David Siegmund. Weak change-point detection using temporal correlation, 2011.E. Yashchin. On the analysis and design of cusum-shewhart control schemes.

IBM Journal ofResearch and Development , 29(4):377–391, 1985.Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn,and Tengyu Ma. Mopo: Model-based ofﬂine policy optimization, 2020.Xingyu Zhao et al. Assessing the safety and reliability of autonomous vehicles from road testing.

ISSRE , 2019. 13 T ABLE OF C ONTENTS C ONTENTS L IST OF D EFINITIONS AND T HEOREMS L IST OF T HEOREMS

Individual degradation-test) . . . . . . . . . . . . . . . . . . . . . . . . . 32 Setup (

Sequential degradation-test) . . . . . . . . . . . . . . . . . . . . . . . . . 34.1 Deﬁnition ( T -long episodic signal) . . . . . . . . . . . . . . . . . . . . . . . . . . 34.1 Theorem (Optimal test for uniform degradation) . . . . . . . . . . . . . . . . . . . 44.2 Theorem (Asymptotic power of UDT) . . . . . . . . . . . . . . . . . . . . . . . . 5G.1 Deﬁnition (Episodic index decomposition) . . . . . . . . . . . . . . . . . . . . . . 24G.2 Deﬁnition ( T -long episodic signal; an extended formulation of Deﬁnition 4.1) . . . 24G.1 Proposition (Expectation and covariance of an episodic signal) . . . . . . . . . . . 24G.3 Deﬁnition (Multivariate normal T -long episodic signal) . . . . . . . . . . . . . . . 24I.1 Deﬁnition (Null hypothesis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26I.2 Deﬁnition (The standard setup) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26I.3 Deﬁnition (The standard normal setup) . . . . . . . . . . . . . . . . . . . . . . . . 26I.4 Deﬁnition (General degradation hypothesis) . . . . . . . . . . . . . . . . . . . . . 27I.5 Deﬁnition (Uniform degradation hypothesis) . . . . . . . . . . . . . . . . . . . . . 27I.6 Deﬁnition (Uniform degradation weighted-mean) . . . . . . . . . . . . . . . . . . 27I.7 Deﬁnition (Threshold test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27I.1 Theorem (Optimal test for uniform degradation; an extended formulation of Theo-rem 4.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28I.8 Deﬁnition (The rolling setup) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29I.1 Proposition (Uniform degradation test consistency) . . . . . . . . . . . . . . . . . 29I.2 Theorem (Uniform degradation test asymptotic power; an extended formulation ofTheorem 4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29I.9 Deﬁnition (Uniform degradation asymptotic power gain) . . . . . . . . . . . . . . 29I.2 Proposition (Uniform degradation test asymptotic power gain) . . . . . . . . . . . 30I.10 Deﬁnition (Partial degradation hypothesis) . . . . . . . . . . . . . . . . . . . . . . 30I.11 Deﬁnition (Partial degradation mean) . . . . . . . . . . . . . . . . . . . . . . . . . 30I.3 Theorem (Optimal test for partial degradation) . . . . . . . . . . . . . . . . . . . . 30J.1 Proposition (Likelihood ratio with respect to general degradation) . . . . . . . . . 31J.2 Proposition (Expected value of uniform-degradation weighted-mean) . . . . . . . . 32J.3 Proposition (Consistency of uniform-degradation weighted-mean) . . . . . . . . . 321 Lemma (Maximum likelihood with relation to the complex hypothesis of uniform-degradation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Lemma (Properties of statistics under uniform-degradation) . . . . . . . . . . . . . 343 Lemma (Sensitivity of the minimum to deviations in the elements) . . . . . . . . . 3715 C OMPLEMENTARY F IGURES

Figures 5-14 introduce additional results from the experiments described in Section 6. Note thatSection 6 refers directly to some of the results presented in this section.Figure 5:

The weights assigned to the various time-steps by the various tests. Mind the logarithmic scale.Note that the weights of CUSUM are received from its normalization scheme, i.e., w t = 1 / std ( r t ) . Figure 6:

Percent of rejections of H when H is true, for various statistical tests, for both individual (top)and sequential (bottom) tests. Each point in each plot represents M = 100 tests, and the shaded area represents95% conﬁdence interval. The tests were tuned by Algorithm 2 (individual) and Algorithm 1 (sequential), usinga reference dataset, to yield rejection rate of 5% under H . Sequential tests in different scenarios in Pendulum environment: cumulative percent of rejectionsvs. number of simulated time-steps. In the legend, the numbers in parenthesis are the ﬁnal percents of rejection.

Figure 8:

Sequential tests in different scenarios in HalfCheetah environment: cumulative percent of rejectionsvs. number of simulated time-steps. In the legend, the numbers in parenthesis are the ﬁnal percents of rejection.

Sequential tests in different scenarios in Humanoid environment: cumulative percent of rejectionsvs. number of simulated time-steps. In the legend, the numbers in parenthesis are the ﬁnal percents of rejection.(a) (b)

Figure 10:

Individual (not sequential) tests in HalfCheetah environment: (a) percent of rejections (with sig-niﬁcance α = 0 . ) vs. number of samples: recall that T = 1000 samples correspond to a single episode,and note that Mean and CUSUM perform better in the beginning of the ﬁrst episode – before most of the noisecomes in; (b) for each scenario and each test-statistic -– the distribution of the M = 100 z-values correspond-ing to simulated data blocks of 10 episodes each. The horizontal line represents the rejection threshold forsigniﬁcance α = 0 . . Figure 11:

Lookback horizons for which H was rejected in sequential tests: smaller degradation requireslonger horizon (i.e., more data) for detection. Rewards degradation in Pendulum following changes in pole length, over N = 3000 episodes perscenario. The ﬁgure is split due to the extreme scale difference between t < and t > .(a) (b) (c) Figure 13:

Parameters of an episodic signal of the rewards in Pendulum environment, estimated over N =3000 episodes of T = 200 time-steps: (a) distribution of rewards per time-step; (b) standard deviations; (c)correlation( t , t ) vs. | t − t | . The estimations were done in resolution of 10 time-steps, i.e., every episodewas split into 20 intervals of 10 consecutive rewards, and each sample is the average over an interval.(a) (b) (c)(d) (e) Figure 14:

Parameters of an episodic signal of the rewards in Humanoid environment, estimated over N =5000 episodes of T = 200 time-steps: (a) distribution of rewards per time-step; (b) standard deviations; (c)correlation( t , t ) vs. | t − t | ; (d) correlations map; (e) rewards degradation following changes in controlcosts and leg size. The estimations were done in resolution of 20 time-steps, i.e., every episode was split into10 intervals of 20 consecutive rewards, and each sample is the average over an interval. E XPERIMENTS I MPLEMENTATION D ETAILS

In the Pendulum (OpenAI) environment, where the goal is to keep a one-dimensional pendulum-pole pointing upwards, we deﬁne several alternative scenarios. ccostx scenario (with a parameter x ) increases the cost of action (”control cost”, which is quadratic in the activated force) to x % ofits original value. Note that the control cost is typically the smaller among the components of thereward, which also include the angle of the pendulum and its speed. noisex scenario adds an additiverandom normally-distributed noise to each action, whose standard deviation is x % of the range ofvalid actions. lenx and massx scenarios change the Pendulum length and mass respectively to x % oftheir original values. Note that while these scenarios are not necessarily harder to act in, the changesare still supposed to cause degradation since the agent is not re-trained for them.In the HalfCheetah (MuJoCo) environment, where the goal is to train a two-dimensional cheetah torun as fast as possible, we also deﬁne several alternative scenarios. ccostx and massx are similar tothe analog scenarios in Pendulum described above. The ”control cost” in this case is also quadraticwith the activated force, and is typically smaller than the other component of the reward – the speedof the Cheetah. gravityx scenario changes the gravity to x % of its original value.Table 2 brieﬂy summarizes the various scenarios, and Table 1 summarizes the parameters of the testssetup per environment.For every environment, before running the statistical tests according to Section 6.1, the recordedrewards are downsampled in time by factor d : every interval of samples { x t } d · ˜ t + dt = d · ˜ t +1 is replacedby its mean as a single sample ˜ x ˜ t := d (cid:80) d · ˜ t + dt = d · ˜ t +1 x t . The downsampling reduces the dimensionof the covariance matrix Σ to Td × Td , making it less noisy to estimate. In addition, it reduces thecomputational complexity of the experiments in this section. After the downsampling, the sequentialtests apply an individual test in every single time-step, i.e., the testing-frequency of the sequentialtests is F = T /d per episode.The statistical tests compared in Section 6.2 are mostly based on the threshold-tests described inAlgorithm 3 (for individual tests) and Algorithm 5 (for sequential tests), with different test-statistics:•

Mean : simple mean of the rewards.•

CUSUM : the standard cumulative-sum (Page, 1954; NCSS) test, with every time-step nor-malized by its standard-deviation (estimated over all the reference episodes), and with ref-erence value k = 0 . . As CUSUM is online by nature, it is used as is (beginning torun h episodes in advance for any lookback horizon h ) instead of as part of Algorithm 5.The family-wise signiﬁcance level of CUSUM is controlled using Algorithm 1. Note thatthrough the normalization mentioned above, we let CUSUM take advantage of the episodicsetup and the trusted reference data. Other normalization methods (time-invariant normal-ization and no normalization) did not improve CUSUM results in the experiments of Sec-tion 6.2.• UDWM : the uniform-degradation weighted-mean from Deﬁnition I.6. The test based onthis statistic is also named UDT.•

PDM : 0.9-partial-degradation mean from Deﬁnition I.11, corresponding to degradationfocused on 90% of the time-steps. The test based on this statistic is also named PDT.While this statistic may look quite similar to UDWM, it can produce different results inenvironments where UDWM concentrates most of the weights in few time-steps – if PDMTable 1: Environments parameters (episode length ( T ), reference episodes ( N ), test blocks ( M ), tests per episode ( F = T /d ),episodes per block ( N ) = sequential test length ( ˜ h ), lookback horizons ( h , h )) Environment

T N N (= ˜ h ) M h , h F = T /d

Pendulum-v0 200 3000 30 100 3,30 20HalfCheetah-v3 1000 10000 50 100 5,50 40Humanoid 200 5000 30 100 3,30 1020able 2: Environments scenariosEnvironment ScenariosPendulum-v0 H ccost x : action cost × = x%noise x : additive noise = x% of max actionlen x : lenth × = x%mass x : mass × = x%HalfCheetah-v3 H ccost x : action cost × = x%mass x : mass × = x%gravity x : gravity × = x%Humanoid H ccost x : action cost × = x%len x : leg size × = x%”drops” these time-steps. This may make PDM more robust to degradation scenarios wherethe dominantly-weighted time-steps are not affected. Further discussion regarding p andthe relation to the CVaR statistic is provided in Appendix I.• Mixed : The mixed statistic essentially incorporates multiple statistics together – Meanand PDM in this case – and testing whether any of them has ”extreme” values. It is de-ﬁned as s = min ( p-value(Mean) , p-value(PDM) ) , i.e., we calculate both statistics andtake the more signiﬁcant p-value. Note that s itself is the statistic, and that its p-valueis derived using Algorithm 2’s bootstrap as in any of the other test statistics. We see be-low that the Mixed test enjoys most of the value of PDM, and still performs reasonablywell wherever PDM is not robust enough (namely, when the mean reward decreases but thehighly-weighted time-steps actually increase ). The test based on this statistic is also namedMDT. E S

ENSITIVITY TO C OVARIANCE M ATRIX E STIMATION

In most of the analysis in this work we assume that both the means µ µ µ and the covariance Σ ofthe episodic signal X are known. In practice, this can be achieved either through detailed domainknowledge, or by estimation from the recorded reference dataset of Setup 1, assuming it satisﬁesEq. (1). The parameters estimation errors decrease as O (1 / √ N ) with the number N of referenceepisodes, and are distributed according to the Central Limit Theorem (for means) and Wishart dis-tribution (K. V. Mardia & Bibby, 1979) (for covariance).If N is suspected to be too small for accurate estimation, it is possible to deal with the estimationerrors of the model parameters through regularization. One possible regularization is assumingabsence of correlations between distant time-steps ( ∃ δ ∈ N , ∀| t − t | > δ : (Σ ) t t = 0 ).Another is to essentially reduce T through grouping of sequences of time-steps together (as we doin Section 6, for example).To test the practical consequences of inaccurate parameters estimation, we repeated some of theofﬂine (individual) tests of Section 6 for HalfCheetah – with different sizes of reference datasets. Thereference datasets vary between N = 100 and N = 10000 episodes (where N = 10000 correspondsto Section 6). As in Section 6, we downsample each episode from T = 1000 to F = T /d = 40 time-steps.Figure 15 shows the results of the sensitivity tests. Even with as little as N = 100 reference episodes,the largest weights are successfully assigned to the ﬁrst time-steps (mind the logarithmic scale inboth axes), although certain later weights are still noisy. N = 300 is sufﬁcient to yield a consistentstatistic distribution under H , i.e., to reliably tune the false alarm rate. All sizes of referencedatasets yield similar test power in the tested scenarios ccost130 and gravity090 . N = 3000 ishardly distinguishable from N = 10000 by any mean.21igure 15: The weights of Uniform Degradation Tests (UDT), based on estimation of parameters from refer-ence datasets of various sizes (top left); and percents of degradation detections in individual tests in differentscenarios (with signiﬁcance − α = 0 . ). F R

ELATED W ORK : D

ETAILED D ISCUSSION

As explained in Section 2, sequential tests repeatedly apply individual hypothesis tests with certainsigniﬁcance level − α ∈ (0 , . The probability that at least one test would reject the null hypothesis H increases with the number of the individual tests, leading to ”inﬂation of α ” and decreasedfamily-wise signiﬁcance level − α < − α . Section 5 discusses this problem in the context oftests on episodic signals. Here we discuss some of the existing methods for design of sequentialtests. Sequential Probability Ratio Test (SPRT):

SPRT (Wald, 1945) considers a symmetric approachbetween two hypotheses H , H , and aims to decide between them as fast as possible, subject tothe probability of a wrong decision being bounded by α . The decision rule is chosen such thatthe expected time until decision is minimized. The element that bounds the probability of wrongdecision is the setup of the ﬂow of the test. Every iteration, the decision rule decides between threepossibilities: accept H , accept H , or continue. The possibility to stop on acceptance of the truehypothesis limits the inﬂation of α .In contrast to this setup, in the change-point detection problem – where continuously looking forchanges – we either reject H or continue, but never stop to accept H . Dedicated sequential testsare designed for the problem of change-point detection. Cumulative Sum test (CUSUM):

The CUSUM test (Page, 1954; NCSS) is a well-studied (Brooket al., 1972; Yashchin, 1985) and very popular method in quality control and change detec-tion (Williams et al., 1992; Westgard et al., 1977). While being useful in a wide scope of problems,the test requires the size of change to be deﬁned in advance as a parameter (a requirement that ex-ists in other methods as well (Lund et al., 2007)). In addition, CUSUM assumes to observe i.i.dsamples. The statistic is deﬁned incrementally in a non-linear way, making it more difﬁcult to gen-eralize to non-i.i.d models, although several generalizations do exist, e.g., for the case of ﬁrst-orderautoregressive signal AR(1) (Lu & Jr., 2001). However, for example, Fig. 1c demonstrates empiricrewards in HalfCheetah environment (MuJoCo), where the dependencies in the signal require a moreexpressive model.

Persistent drift and Dickey-Fuller test:

Certain methods are available for detection of persistentdrifts (also known as trends) in time-series. For example, Dickey-Fuller test (Dickey & Fuller, 1979)22or unit-roots in autoregressive models essentially looks for linear drifts. However, in the scope ofthis work we do not assume a persistent drift, nor limit ourselves to autoregressive models. α -spending functions: The α -spending functions (Lan, 1994; of Science) deal with the inﬂationof α in sequential tests by conceptually referring to α as a limited budget of signiﬁcance, whereevery individual test spends some of the budget. Due to the dependence between the individualtests, the total budget spent is smaller than the sum of the individual spends α < (cid:80) i α i . Thus,careful calculations are required for tuning of the family-wise signiﬁcance level α .Pocock (1977), for example, showed how to calculate a constant individual signiﬁcance level α i ≡ α given a desired family-wise signiﬁcance α and known number of k individual tests, assumingthat the tests are applied to accumulated normal i.i.d data samples. For many applications, such aconstant signiﬁcance level tends to spend too much α -budget in the ﬁrst individual tests, reducingtoo much power from the later tests – where most of the data are available. It is often preferred tokeep high signiﬁcance level for these ﬁnal tests, and reject H in earlier tests only in radical cases.Accordingly, the O’Brien-Fleming function (O’Brien & Fleming, 1979) determines the individualsigniﬁcance levels { α i } ki =1 under similar i.i.d and normality assumptions as Pocock, but lets α i gradually increase over the sequential test. In Section 5 we consider the α -spending approach andgeneralize it through a bootstrap mechanism to handle any sequence of individual tests for the case ofepisodic data; that is, i.i.d episodes consisting of samples which are not assumed to be independent,normal, or identically-distributed. Numeric methods:

Colas et al. (2019) address the problem of comparing different RL algorithms,referring to whole episode as a single data sample for the tests. Harel et al. (2014) apply permutationstest to detect changes in i.i.d data, focusing on drifts that impair predictive models of the data. Thebootstrap mechanism discussed in Section 5 can be seen as a permutations test on i.i.d episodes(instead of single samples). Abhishek & Mannor (2017) also bring together ideas from bootstrap andsequential tests to construct a nonparametric sequential hypothesis test. The test applies bootstrap onsingle samples within blocks of data, assuming the data samples are i.i.d. Certain machine-learningbased approaches were also suggested for changepoint detection in time-series (Aminikhanghahi &Cook, 2016). Ditzler et al. (2015) wrote that ”change detection is typically carried out by inspectingindependent and identically distributed (i.i.d) features extracted from the incoming data stream, e.g.,the sample mean, the sample variance, and/or the classiﬁcation error”.

Changing environment and safety in RL:

In Multi-Armed Bandits (MAB) (Berry & Frist-edt, 1985), where by default each bandit (action) yields i.i.d rewards, several works address theproblem of regret minimization (namely, optimization of rewards during training) with abruptchanges (Garivier & Moulines, 2011; Mukherjee & Maillard, 2019), gradual changes (Besbes et al.,2014) and even adversarial changes (Lykouris et al., 2020; Alatur et al., 2020; Gupta et al., 2019;Jun et al., 2018).Training in presence of non-stationary environment was also considered in other environments suchas multi-agent environments (Hernandez-Leal et al., 2019) and in model-based RL with varyingmodel (Lecarpentier & Rachelson, 2019). Several works addressed the problem of safety in ex-ploration of RL algorithms during training (Garcia & Fernandez, 2015; Chow et al., 2018; Jungeset al., 2016), often using model-based learning of the environment (Cheng et al., 2019) or speciﬁedconstraints (Alshiekh, 2017).Note that our work refers to changes beyond the scope of the training phase, at the stage wherethe agent is ﬁxed and required not to train further, in particular not in an online manner. Robustalgorithms may prevent rewards degradation in the ﬁrst place, but when they do not – it is crucial tobe alerted. To the best of our knowledge, we are the ﬁrst to exploit correlations between rewards inpost-training phase to test for changes in both model-based and model-free RL.

G E

XTENDED D EFINITIONS AND M ODEL D ISCUSSIONS

Episodic signal model:

Below is the formal deﬁnition of an episodic signal, as discussed in Sec-tion G. 23 eﬁnition G.1 (Episodic index decomposition) . Let t, T ∈ N . We deﬁne k ( t, T ) := (cid:98) t − T (cid:99) , ˜ τ ( t, T ) := t (mod T ) , and τ ( t, T ) := (cid:26) T if ˜ τ ( t, T ) = 0˜ τ ( t, T ) if ˜ τ ( t, T ) (cid:54) = 0 . When no confusion is risked,we may simply write k = k ( t, T ) , τ = τ ( t, T ) . Note that ∀ t, T ∈ N : t = kT + τ . Deﬁnition G.2 ( T -long episodic signal; an extended formulation of Deﬁnition 4.1) . Let n, T ∈ N .Denote K = k ( n, T ) , τ = τ ( n, T ) according to Deﬁnition G.1. A sequence of real-valued randomvariables { X t } nt =1 is a T -long episodic signal, if its joint probability density distribution can bewritten as f { X t } nt =1 ( x , ..., x n ) = (cid:34) K − (cid:89) k =0 f { X t } Tt =1 ( x kT +1 , ..., x kT + T ) (cid:35) · f { X t } τ t =1 ( x KT +1 , ..., x KT + τ ) (3) (where in the edge case K = 0 we deﬁne the empty product to be 1). We further denote µ µ µ := E [( X , ..., X T ) (cid:62) ] ∈ R T , Σ := Cov (( X , ..., X T ) (cid:62) , ( X , ..., X T )) ∈ R T × T . Expectation and covariance of an episodic signal:

The expectations and covariance matrix ofa whole episodic signal can be directly derived from the parameters µ µ µ , Σ corresponding to theexpectations and covariance matrix of a single episode. Proposition G.1 (Expectation and covariance of an episodic signal) . Let { X t } nt =1 be a T -longepisodic signal with parameters µ µ µ , Σ . The expectations µµµ := E [( X , ..., X n ) (cid:62) ] ∈ R n and covari-ance matrix Σ :=

Cov (( X , ..., X n ) (cid:62) , ( X , ..., X n )) ∈ R n × n are uniquely determined by µ µ µ and Σ , respectively.Proof. For any t ∈ { , ..., n } , denote t = kT + τ according to Deﬁnition G.1. From Eq. (1) it isclear that ∀ t = k T + τ , t = k T + τ ∈ { , ..., n } : µ t = E [ X t ] = ( µ µ µ ) τ Σ t t = Cov ( X t , X t ) = (cid:26) (Σ ) τ τ if k = k if k (cid:54) = k (4)Proposition G.1 essentially means that µµµ consists of periodic repetitions of µ µ µ , and Σ consists ofcopies of Σ as T × T blocks along its diagonal. For both parameters, the last repetition is croppedif τ ( n, T ) < T . Multivariate normal episodic signal:

Some of the theoretical results in Section I assume multi-variate normality of the episodic signal. The formal deﬁnition of such a signal is given below.

Deﬁnition G.3 (Multivariate normal T -long episodic signal) . Let { X t } nt =1 be a T -long episodicsignal (Deﬁnition G.2). For any ≤ τ ≤ min ( T, n ) , deﬁne µ τ µ τ µ τ ∈ R τ to be the ﬁrst τ elements of µ µ µ and Σ τ ∈ R τ × τ to be the upper-left τ × τ block of Σ . The signal { X t } nt =1 is multivariate normalif ∀ ≤ τ ≤ min ( T, n ) , f X ,...,X τ ( xxx ) = (2 π ) − τ/ det (Σ τ ) − / e − ( xxx − µ τ µ τ µ τ ) (cid:62) Σ − τ ( xxx − µ τ µ τ µ τ ) (5)From Deﬁnitions G.2,G.3 it is clear that if { X t } nt =1 form a multivariate normal T -long episodic sig-nal, then in particular X = ( X , ..., X n ) (cid:62) ∈ R n is an n -dimensional multivariate normal variable,with expectations µµµ and covariance Σ determined by Eq. (4). Parameters estimation:

As mentioned above, a possible way to estimate the parameters µ µ µ , Σ of an episodic signal is to compute the mean vector and the covariance matrix of a dataset { x iτ | ≤ i ≤ N, ≤ τ ≤ T } of N episodes assumed to satisfy Eq. (1). According to theCentral Limit Theorem (Petrov, 1972; Irwin, 2006), since the episodes are i.i.d, for any time-step τ the estimate ( ˆ µ µ µ ) τ = N (cid:80) Ni =1 x iτ is asymptotically normally-distributed around the true mean ( µ µ µ ) τ with variance Var (( µ µ µ ) τ ) N . Furthermore, in the private case of a multivariate normal signal, the24ovariance matrix estimate ( ˆΣ ) ij = N − (cid:80) Nk =1 ( x ik − ¯ x i )( x jk − ¯ x j ) follows Wishart distribu-tion (K. V. Mardia & Bibby, 1979) (up to a factor of N − ), with N − degrees of freedom andvariance Var (( ˆΣ ) ij ) = N − (cid:0) (Σ ) ii (Σ ) jj + (Σ ) ij (cid:1) .If N is suspected to be too small for accurate estimation, it is possible to deal with the estimationerror of the model parameters through regularization. One possible regularization is assuming ab-sence of correlations between distant time-steps ( ∃ δ ∈ N , ∀| t − t | > δ : (Σ ) t t = 0 ). Another isto essentially reduce T through grouping of sequences of time-steps together (as we do in Section 6,for example).In the analysis in the following sections we assume that both µ µ µ and Σ are known. Multidimensional signals:

For simplicity of the theoretical discussion, we only consider one-dimensional signals: for any t , the random variable X t returns a scalar x t ∈ R . However, ageneralization to multidimensional signals ( x t ∈ R d ) is straight-forward: A d -dimensional T -longepisodic signal is simply a one-dimensional ( dT ) -long episodic signal, where the observations arrivein groups of d samples per group (i.e., n is always an integer multiplication of d ). Since the variousdimensions are equivalent to time-steps in the eyes of this model, the correlations between the var-ious dimensions are inherently captured. Note that for a large number of dimensions, the O ( d T ) degrees of freedom in the model may be impractical to estimate through a reference dataset. H B

OOTSTRAP FOR S EQUENTIAL T ESTS : E

XTENDED D ISCUSSION

Section 5 describes a mechanism for sequential hypothesis testing with relation to episodic signals.The mechanism simply runs individual hypothesis tests repeatedly with a constant signiﬁcance level α , similarly to the concept of α -spending functions (Lan, 1994; of Science), and in particular toPocock approach (Pocock, 1977).Note that Pocock’s constant α -spending function is often avoided, as it is claimed to spend ”toomuch” α -budget in the beginning of the sequential test on account of its end. In our online setupthis time-homogeneous approach is welcome, as we do not to rely on well-deﬁned beginning andend. However, in contrast to Pocock, we cannot assume independence between nor normality of theaggregative parts of the data.The sequential test (described in Algorithm 5) uses a constant manually-determined lookback-horizon h . Any individual test at time t = kT + τ runs the simple threshold-test of Algorithm 3on the signal X ( k − h ) T , ..., X kT + τ , i.e., it looks exactly h + τ /T episodes back. In practice, n h different lookback-horizons h , ..., h n h can be used simultaneously, such that at any point of time,we reject H if any of the lookback tests rejects it. This allows us to detect slight changes which areonly detectable over large amounts of data (large h ); while still allowing quick detection of largerabrupt changes, without mixing them with older irrelevant data (small h ).In order to determine the signiﬁcance level α for the individual tests within the sequential test,we use the bootstrap mechanism described in Algorithm 1 (also see extended pseudo-code in Al-gorithm 4 in Appendix L). The mechanism simulates sequential tests using bootstrap-sampling ofmax ( h , ..., h n h ) + ˜ h episodes (length of simulation + maximum lookback horizon) from a refer-ence dataset of N episodes assumed to be i.i.d. Once the episodes are sampled, the simulation runs ˜ h episodes without stopping condition, keeps track of the resulted p -values, and eventually returnsthe minimal p -value among all the individual tests. This simulative process is repeated ˜ B times withdifferent bootstrap-samples, and the α -quantile among all the minimal- p -values is chosen as theindividual-test signiﬁcance level α . Indeed, α is the relative part of bootstrap-samples in which atleast one individual test returned p -value smaller than α .The sequential bootstrap mechanism of Algorithm 1 may look computationally overwhelming sinceit runs a bootstrap that calls another bootstrap (Algorithm 2, called through Algorithm 3). However,if the sequential test runs individual tests in n h different lookback-horizons (where typically n h ≤ )in frequency of F test-points per episode, then the inner bootstrap of Algorithm 2 will only be called n h · F times in total (thanks to the bootstrap-storage mechanism described in Algorithm 3). Alsonote that in spite of its name, the whole sequential bootstrap algorithm is intended to run only once(and not sequentially) – after the reference dataset becomes available.25s a practical remark for implementation, note that Algorithm 3 necessarily returns p -value ≥ B +1 ,which is the resolution of the inner bootstrap. Now consider the case where in Algorithm 1, in morethan α of the simulated sequential tests, there is certain individual test whose return value is B +1 .In other words, the bootstrap found that under H , with probability higher than α , a sequential testof length ˜ h will encounter the most extreme possible result of Algorithm 3 at least once. In thatcase Algorithm 5 would not be able to distinguish any degradation from H : we would have theindividual test threshold set to α = quantile α ( P ) = B +1 , which can never be overcome. For thisreason, Algorithm 5 makes sure to check whether α = B +1 . Possible solutions in this situation areincrease of B for better resolution, or reduction of the required signiﬁcance level through either ˜ h or α . Multiple test-statistics:

Every iteration, Algorithm 5 calculates the test-statistic for multiple look-back horizons, where Algorithm 1 is responsible of controlling the family-wise type-I error ratethrough the test-thresholds. In a similar manner, Algorithm 5 can be generalized to run multipletest-statistics in parallel: simply iterate over the statistics the same as iterating over the lookback-horizons.Heterogeneous test-statistics should provide more robustness to the alternative hypothesis, sinceevery statistic is often affected differently by every alternative hypothesis. This comes at the costof reduced sensitivity of each statistic, expressed through decrease of the test-thresholds by Algo-rithm 1.

I L

IKELIHOOD -R ATIO T EST FOR D RIFT IN E PISODIC S IGNAL : F

ORMAL D EVELOPMENT

In this section we look for an optimal hypothesis test for detection of a negative drift in multivariatenormal episodic signal (see Deﬁnitions G.2,G.3). The corresponding hypotheses are episodic signalwith known parameters ( H ), and episodic signal with identical covariance matrix but smaller ex-pected values ( H A ), as deﬁned below. By ”optimal test” we mean that given the test’s signiﬁcancelevel (i.e., type-I error rate), it should provide the maximum possible power (i.e., minimum type-II error rate) with respect to H A . To that end, we calculate the log-likelihood-ratio and use it (upto a monotonous transformation) as a test-statistic according to Neyman-Pearson Lemma (Neymanet al., 1933).In Section I.1.1, after proving optimality for a certain negative drift, we eliminate the multivariate-normality assumption and analyze the asymptotic power of the suggested statistical test. In particu-lar, we show that it is asymptotically superior to a simple threshold-test on the average reward.Note that in the scope of this section we assume an individual test at a certain point of time. Adjust-ment of the signiﬁcance level to sequential tests is handled in Section 5.Formally, the test is deﬁned with respect to some real-valued random variables X , ..., X n . Deﬁnition I.1 (Null hypothesis) . Let { X t } nt =1 be real-valued random variables, and let T ∈ N , µ µ µ ∈ R T , Σ ∈ R T × T . The null hypothesis H ( T, µ µ µ , Σ ) in the scope of this section, isthat { X t } nt =1 form a T -long episodic signal (Deﬁnition G.2), with known parameters T, µ µ µ , Σ . Forsimplicity, we further assume that Σ is of full-rank (i.e., invertible). We deﬁne a standard setup for most of the analysis below, both with and without the multivariate-normality assumption.

Deﬁnition I.2 (The standard setup) . In the standard setup, we denote by X = { X t } nt =1 a T -longepisodic signal for some n, T ∈ N (Deﬁnition G.2), and let the null hypothesis H be as in Deﬁni-tion I.1, with known parameters µ µ µ ∈ R T , Σ ∈ R T × T .Note that under H , the complete signal’s expectations µµµ ∈ R n and covariance matrix Σ ∈ R n × n are also known through Proposition G.1.We also denote k ( t ) = k ( t, T ) , τ ( t ) = τ ( t, T ) as in Deﬁnition G.1, and in particular K := k ( n, T ) , τ := τ ( n, T ) . eﬁnition I.3 (The standard normal setup) . The standard normal setup is the standard setup where X is a multivariate-normal episodic signal (Deﬁnition G.3). I.1 U

NIFORM D EGRADATION T EST

The general alternative hypothesis we use assumes conservation of the correlations structure of H ,along with decrease in the expectations. Deﬁnition I.4 (General degradation hypothesis) . Given the standard setup (Deﬁnition I.2), let E ⊆ R T s.t. ∀ (cid:15) (cid:15) (cid:15) ∈ E , ≤ t ≤ T : ( (cid:15) (cid:15) (cid:15) ) t ≥ . According to the E -degradation hypothesis, denoted H A ( E ) , there exists (cid:15) (cid:15) (cid:15) ∈ E such that { X t } nt =1 form ˜ T -long episodic signal with the parameters ˜ T = T, ˜Σ = Σ and ˜ µ ˜ µ ˜ µ = µ µ µ − (cid:15) (cid:15) (cid:15) .In particular, according to Eq. (4) , the covariance and the mean of the whole signal under H A ( E ) are ˜Σ = Σ and ˜ µ ˜ µ ˜ µ = µµµ − (cid:15)(cid:15)(cid:15) , where (cid:15)(cid:15)(cid:15) = (cid:15)(cid:15)(cid:15) ( (cid:15) (cid:15) (cid:15) ) ∈ R n is a cyclic completion deﬁned by ∀ t = kT + τ :( (cid:15)(cid:15)(cid:15) ) t := ( (cid:15) (cid:15) (cid:15) ) τ . Proposition J.1 calculates the log-likelihood-ratio with respect to the hypotheses in Deﬁni-tions I.1,I.4, assuming a multivariate-normal episodic signal. Still, to derive a concrete statisticaltest, further assumptions must be applied on E . We begin with the uniform degradation assump-tion, corresponding to a disturbance source that affects the whole signal uniformly. For example,in the context of Reinforcement Learning, such a model may refer to changes in constant costs oraction costs, as well as certain environment dynamics whose change inﬂuences the various states ina similar way. Deﬁnition I.5 (Uniform degradation hypothesis) . Let (cid:15) > . The uniform degradation hypothesis,denoted H unifA ( (cid:15) ) , is a degradation hypothesis H A ( E ) with E := { (cid:15) · | (cid:15) ≥ (cid:15) } , where

111 :=(1 , ..., (cid:62) ∈ R T . Fig. 2 demonstrates the empiric degradation in the rewards of a trained agent in HalfCheetah en-vironment, following changes in gravity, mass, and control-cost (see Table 2 for details). It seemsthat some modiﬁcations indeed cause a quite uniform degradation, while in others the degradation ismostly restricted to certain ranges of time. This may be important, in particular if the non-degradedtime-steps happen to be assigned large weights by the test, as demonstrated in Section 6.2. In Sec-tion I.2 we suggest an alternative model, whose corresponding test is proved in Section 6.2 to bemore robust to such non-uniform degradation.We now show that an optimal hypothesis test for detection of uniform degradation in multivariatenormal episodic signal is a threshold-test on the weighted-mean of the signal, where the weights arederived from the inverted covariance matrix.Note that according to Proposition G.1, the covariance matrix

Σ = Σ(Σ ) ∈ R n × n of the full signalis block-diagonal with the blocks being Σ ∈ R T × T (or an upper-left block of Σ ). Hence, theinverted Σ is given directly by inverting Σ (and possibly one upper-left block of Σ ). Deﬁnition I.6 (Uniform degradation weighted-mean) . Given the standard setup (Deﬁnition I.2), theuniform-degradation weighted-mean of X is s unif ( X | Σ ) := W · X , where W := 111 (cid:62) · Σ − ∈ R n .Note that the ﬁrst KT elements of W are T -periodic with ∀ ≤ k ≤ K − w kT +1 , ..., w kT + T =111 (cid:62) · Σ − ∈ R T . We deﬁne accordingly W := 111 (cid:62) · Σ − and W τ := ( w KT +1 , ..., w KT + τ ) (cid:62) =111 (cid:62) · Σ − τ , where Σ τ is the upper-left τ × τ block of Σ . Proposition J.3 shows the consistency of the uniform-degradation weighted-mean, and Theorem I.1shows that it derives an optimal hypothesis test for uniform degradation.

Deﬁnition I.7 (Threshold test) . Assume the standard setup (Deﬁnition I.2), and let S : R n → R (statistic), κ ∈ R (threshold) and ρ ∈ [0 , (edge-case probability). The corresponding κ -threshold-test is deﬁned as follows:Given the observations ∀ ≤ t ≤ n : X t = x t ∈ R , calculate the statistic s = S ( x , ..., x n ) . If s < κ , reject H . If s = κ , reject H with probability p = ρ (note that this is only relevant fornon-continuous distributions, where P ( S = κ ) (cid:54) = 0 ). If s > κ , do not reject H . e denote the signiﬁcance level of the test α := P ( reject H | H ) . For simplicity, in the discussionbelow we often omit ρ , implicitly assuming continuous distribution of the signal. Theorem I.1 (Optimal test for uniform degradation; an extended formulation of Theorem 4.1) . Assume the standard normal setup (Deﬁnition I.3) with H unifA ( (cid:15) ) of Deﬁnition I.5 as the alterna-tive hypothesis, and let α ∈ (0 , . Then, there exists κ ∈ R such that a κ -threshold-test on theuniform-degradation weighted-mean statistic has the greatest power among all the statistical testswith signiﬁcance level ˜ α ≤ α .Proof. The proof is available in Appendix J. Roughly speaking, according to Neyman-PearsonLemma (Neyman et al., 1933) a threshold-test on the likelihood-ratio is optimal, hence it is sufﬁcientto show that the uniform-degradation weighted-mean s unif is monotonous with the likelihood-ratio.Note that the likelihood-ratio is taken with respect to a complex hypothesis H unifA ( (cid:15) ) that has adegree of freedom (cid:15) ∈ [ (cid:15) , ∞ ) , where (cid:15) depends on X . Some algebraic work is required to showthat (cid:15) only depends on X through s unif , and that the whole likelihood-ratio is monotonous with s unif .Algorithm 3 describes the threshold-test in the non-sequential framework. The uniform-degradationtest-statistic (or any other function) can be fed into the algorithm as an input.As can be seen, the rejection threshold κ α ∈ R is chosen according to the desired type-I errorrate α ∈ (0 , , using a bootstrap mechanism described in Algorithm 2. B bootstrap-samples aresampled from a reference dataset of N episodes of the signal, assumed to follow the null hypothesis H of Deﬁnition I.1. For each bootstrap-sample the test-statistic is calculated, yielding a bootstrap-estimate for the distribution of the statistic under H . The rejection threshold κ α is set to be the α -quantile of the estimated distribution. If the estimated distribution is close to the true distribution,then we have P ( s ≤ κ α | H ) ≈ P ( s ≤ q α ( s | H ) | H ) = α , where q α ( s | H ) is the α -quantile of s under H .I.1.1 A SYMPTOTIC ANALYSIS IN ABSENCE OF THE NORMALITY ASSUMPTION

The optimality of the uniform-degradation weighted-mean test (proved in Theorem I.1) relies onthe assumption that the episodic signal is multivariate normal. In this section we show that evenin absence of the normality assumption, the test while not necessary is asymptotically superior toa standard threshold-test on the average of the signal (though it is not necessarily the optimal testanymore).Since the episodes in the signal are still assumed to be i.i.d, both a simple mean and the uniform-degradation weighted-mean s unif are asymptotically normal (where n → ∞ with respect to aconstant episode length T ∈ N ). For simplicity of the asymptotic analysis below, we focus oninteger number of episodes, i.e., n = KT and K → ∞ (rather than n → ∞ ). We also deﬁnenormalized variants of our statistics, with zero-mean and unit-variance per episode: s simp ( { X t } nt =1 ) := n (cid:88) t =1 X t ˜ s Ksimp := s simp ( { X t } KTt =1 ) − K · E (cid:2) s simp ( { X t } Tt =1 ) (cid:12)(cid:12) H (cid:3)(cid:113) K · Var ( s simp ( { X t } Tt =1 ) (cid:12)(cid:12) H )˜ s Kunif := s unif ( { X t } KTt =1 ) − K · E (cid:2) s unif ( { X t } Tt =1 ) (cid:12)(cid:12) H (cid:3)(cid:113) K · Var ( s unif ( { X t } Tt =1 ) (cid:12)(cid:12) H ) (6)Note that Algorithm 3 is invariant to linear transformation of the statistic, since the test-statisticand the reference bootstrap distribution pass through the same transformation. Hence, the tests on s simp , s unif are equivalent to the tests on ˜ s simp , ˜ s unif , respectively. As a terminological note, this sampling mechanism can be considered a bootstrap in the sense of dis-tribution estimation from a single dataset using sampling with replacement; or can be merely considered aMonte-Carlo simulation in the sense that the test signal is compared to distribution estimated by an externalsimulative source (the reference data). ˜ s simp , ˜ s unif are asymptotically normal with zero-mean and unit-variance under H , the de-sired test threshold for sufﬁciently large K is κ ≈ q α , where q α is the α -quantile of the standardnormal distribution. This threshold should be indirectly estimated by Algorithm 2.Note that the sequential test of Algorithm 5 in Section 5 applies the individual tests of Algorithm 3on a constant number of episodes (deﬁned by the lookback horizon h ). Hence, in the context of thesequential tests suggested in this work, the asymptotic analysis in this section refers to a very longlookback horizon, rather than very long running time. Regardless, as the analysis refers to a varying n , we need to generalize the standard setup (that assumes a constant signal length n ). Deﬁnition I.8 (The rolling setup) . Let { X t } t ∈ N be an inﬁnite sequence of real-valued random vari-ables. In the rolling setup, for any n ∈ N we assume the standard setup of Deﬁnition I.2 withrelation to the variables { X t } nt =1 and the parameters T, µ µ µ , Σ (which are independent of n ). We ﬁrst show that the test threshold κ = q α indeed yields asymptotic signiﬁcance level of − α , and guarantees asymptotic rejection of H for uniform degradation of any size (cid:15) > . Notethat Algorithm 3 does not pick q α directly as a threshold, but should estimate it indirectly throughAlgorithm 2. Proposition I.1 (Uniform degradation test consistency) . Given the rolling setup, we deﬁne the al-ternative hypothesis H (cid:15)A to be that ∀ K ∈ N , the parameters of the signal { X t } KTt =1 are µ µ µ − (cid:15) · , Σ (i.e., H A ( { (cid:15) } ) in terms of Deﬁnition I.4). Given a signiﬁcance level α ∈ (0 , , we havelim K →∞ P (cid:0) s ≤ q α (cid:12)(cid:12) H (cid:1) = α lim K →∞ P (cid:0) s ≤ q α (cid:12)(cid:12) H (cid:15)A (cid:1) = 1 for both s = ˜ s Ksimp and s = ˜ s Kunif of Eq. (6) , where q α is the α -quantile of the standard normaldistribution.Proof. The proof, fully provided in Appendix J, applies the Central Limit Theorem (Petrov, 1972;Irwin, 2006) on the i.i.d episodes.Theorem I.2 quantiﬁes the asymptotic power of the threshold test for both simple mean and uniform-degradation weighted-mean. To that end, we consider uniform-degradation scaled as (cid:15) ∝ √ K . Wealso denote by Φ the Cumulative Distribution Function of the standard normal distribution Theorem I.2 (Uniform degradation test asymptotic power; an extended formulation of Theo-rem 4.2) . Given the rolling setup, we deﬁne the alternative hypothesis H (cid:15),KA to be that ∀ K ∈ N , theparameters of the signal { X t } KTt =1 are µ µ µ − (cid:15) √ K · , Σ (i.e., H A ( { (cid:15) √ K } ) of Deﬁnition I.4). Givena signiﬁcance level α ∈ (0 , , we havelim K →∞ P (cid:16) ˜ s Ksimp ≤ q α (cid:12)(cid:12) H (cid:15),KA (cid:17) = Φ (cid:32) q α + (cid:15)T (cid:112) (cid:62) Σ (cid:33) ≤ Φ (cid:18) q α + (cid:15) (cid:113) (cid:62) Σ − (cid:19) = lim K →∞ P (cid:16) ˜ s Kunif ≤ q α (cid:12)(cid:12) H (cid:15),KA (cid:17) Proof.

Similarly to Proposition I.1, the proof applies the Central Limit Theorem on the i.i.d episodesto calculate the asymptotic properties. Full details are provided in Appendix J.Note that while Theorem I.1 shows optimality of the uniform-degradation weighted-mean test formultivariate-normal episodic signal, Theorem I.2 proves that even in absence of normality the testis asymptotically superior to a threshold-test on the simple mean.Finally, we quantify the difference of power between the tests.

Deﬁnition I.9 (Uniform degradation asymptotic power gain) . Given the rolling setup, the uniform-degradation power gain is deﬁned to be G := (111 (cid:62) Σ − (cid:62) Σ T . ote that according to Theorem I.2, if the asymptotic power of the simple-mean threshold-test withrelation to the alternative hypothesis H (cid:15),KA is Φ( q α + y ) (where y ∈ R ), then the asymptotic powerof the weighted-mean threshold-test is Φ( q α + G · y ) . Proposition I.2 (Uniform degradation test asymptotic power gain) . Under the setup of Theorem I.2,there exist positive weights { w ij } Ti,j =1 such that the uniform-degradation power gain is G = 1 + T (cid:88) i,j =1 w ij ( λ i − λ j ) where { λ i } Ti =1 are the eigenvalues of Σ .Proof. The result is received from simple algebra after diagonalizing the symmetric positive-deﬁnitecovariance matrix Σ . The full details are available in Appendix J.Clearly, the asymptotic power gain G becomes larger as the covariance matrix Σ introduces moreheterogeneous eigenvalues. Note that in the independent case, the eigenvalues are simply the vari-ances of the different time-steps. In particular, in the i.i.d case, the variances are identical and thegain becomes G = 1 , which is consistent with the fact that the two tests are equivalent in this case.I.2 P ARTIAL D EGRADATION T EST

Deﬁnition I.5 assumes uniform degradation over all the time-steps in every episode. However, theeffects of many environmental changes may focus on certain states (which is translated in our model-free setup into ”certain time-steps”). An example is available in Fig. 2, as discussed before. To modelsuch effects we introduce the partial degradation hypothesis . Deﬁnition I.10 (Partial degradation hypothesis) . Let (cid:15) > , p ∈ (0 , . Deﬁne A Tm := { a ∈{ , } T | (cid:80) Tt =1 ( a ) t = m } the set of binary vectors with exactly m one-entries. The partial degra-dation hypothesis, denoted H partA ( (cid:15), p ) , is a degradation hypothesis H A ( E ) (see Deﬁnition I.4) with E := { (cid:15) · a | a ∈ A T (cid:100) pT (cid:101) } . Interpretation:

As a private case of Deﬁnition I.4, Deﬁnition I.10 assumes conservation of thecorrelations structure. One possible interpretation of this assumption is causal relationships (wherechange in a certain time-step affects any other time-steps correlated with it). Another possibleinterpretation is that the partial degradation hypothesis does not restrict the degradation to only m = (cid:100) pT (cid:101) time-steps, but rather distributes the degradation from m time-steps to all the episode,according to the same relations that created the correlations in the signal from the ﬁrst place.Similarly to the case of uniform-degradation, we can use the likelihood-ratio to derive a test-statisticand prove its approximate optimality with respect to H partA ( (cid:15), p ) , under certain smoothness assump-tions. Deﬁnition I.11 (Partial degradation mean) . Given the standard setup (Deﬁnition I.2), denote ˜ X := X − µ , ˜ s := Σ − ˜ X ∈ R n and ∀ : 1 ≤ τ ≤ T : ˜ S τ := (cid:80) (cid:98) n − τT (cid:99) k =0 ˜ s kT + τ . Given a ∈ { , } T denote o ( a ) := { ≤ t ≤ T | ( a ) t = 1 } . Given p ∈ (0 , , the p -partial-degradation meanof X is s part ( X ; p ) := min a ∈ A Tm (cid:80) τ ∈ o ( a ) ˜ S τ , where m ( p ) = (cid:100) pT (cid:101) and A Tm is deﬁned as inDeﬁnition I.10. Note that while a has | A Tm | = (cid:0) Tm (cid:1) possible values, to compute s part we only need to sum the lowest m values in { ˜ S τ } Tτ =1 . Theorem I.3 (Optimal test for partial degradation) . Consider the standard normal setup (Deﬁni-tion I.3) with H partA ( (cid:15), p ) of Deﬁnition I.10 as the alternative hypothesis, and let α ∈ (0 , . Denoteby P α the largest possible power (with respect to H partA ( (cid:15), p ) ) of a statistical test with signiﬁcancelevel ≤ α . Then, there exists κ ∈ R such that the power of the κ -threshold-test on the partial-degradation mean is P α − O ( (cid:15) ) (where O ( (cid:15) ) is deﬁned with relation to (cid:15) → ). roof. The proof is provided in Appendix J. Similarly to the proof of Theorem I.1, it is based oncalculation of the log-likelihood-ratio λ LR from Lemma 1 – after substituting Deﬁnition I.10. Thecalculation results in s part along with an (cid:15) -dependent term whose effect on the test power is shownto be O ( (cid:15) ) . The parameter p and comparison to CVaR: In Deﬁnition I.11, the ”weighted mean” completelyeliminates T − m of the entries of Σ − X , and only sums the most negative ones. Hence it can beinterpreted as the well known CVaR statistic (Rockafellar & Uryasev, 2000) of X − µ after transfor-mation to Σ − -basis. CVaR (Conditional Value at Risk) is intended to measure the ”risky” tail of arandom variable’s distribution (Chen, 2020) by estimating its expectation – conditioned on it beingbelow the p -quantile of the distribution. This is done simply by averaging the p ”worst” (lowest) val-ues in the corresponding data. To express risk, the parameter p is often set below 5% (MathWorks).In our context, however, the relative part of time-steps p represents the scope of degradation ratherthan extremity of risk, and there is usually no reason to assume that p (cid:28) . Such an assumption,when misplaced, may cause elimination of necessary information from the statistic. In fact, . ≤ p < is shown in Section 6 to often achieve superior results, presumably because it maintains mostof the information while still being able to ﬁlter noisy or misleading time-steps (e.g., time-steps withparticularly large values).Note that ﬁltering positive time-steps is not ”cheating” in the context of our problem: we essentiallyapply a monitor which looks for negative changes, and thus positive changes in other time-steps areindeed considered as noise for the sake of our monitor. If a more symmetric comparison is desired,then the test can be applied twice – once for negative changes, and once for positive changes. The dependence on (cid:15) and O ( (cid:15) ) approximation: The partial degradation mean is shown to beequal to an optimal test-statistic up to O ( (cid:15) ) . The approximation is used to handle the dependence ofthe minimum min a ∈ A Tm (cid:16) a (cid:62) Σ − ˜ X + 0 . (cid:15) ( a (cid:62) Σ − a ) (cid:17) on (cid:15) . Note that for small degradation the second term is indeed negligible, while for larger degrada-tion the distinction between the two hypotheses should pose little challenge to any detection algo-rithm.Furthermore, the test is entirely invariant to a constant additive factor (due to the adjustment of thetest threshold using the bootstrap in Algorithm 2); hence, the true distortion in the test is not of size γ = 0 . (cid:15) ( a (cid:62) Σ − a ) , but rather the change in γ due to the possibly-changed choice of a . Note that(a) since Σ − is positive-deﬁnite, we have ∀ a : a (cid:62) Σ − a > , hence the change in γ is necessarilysmaller than γ ; (b) if the parameter p is close to 100% (as discussed above), then most of the entriesof a are necessarily kept unchanged, further reducing the change in γ .If we wish to apply a more formal test, we can deﬁne for example H partA ( p ) := ∃ (cid:15) > H partA ( (cid:15), p ) (similarly to Deﬁnition I.5 of uniform degradation, for (cid:15) = 0 ), which yields the log-likelihood-ratiomin a ∈ A Tm − ( a (cid:62) Σ − ˜ X ) a (cid:62) Σ − a s.t. a (cid:62) Σ − ˜ X ≤ (after minimization with relation to (cid:15) > , similarly tothe proof of Theorem I.1). Due to the discrete domain A Tm of a , this becomes a non-linear integerprogramming problem, which should be solved for every run of the test on new data X .Note that from a practical point of view, a major role of the partial-degradation model is to allow fo-cusing on negative (degrading) entries of a (cid:62) Σ − ˜ X , and ﬁltering positive ones (as discussed above).For this role, both the approximate s part and the accurate minimizer of H partA ( p ) = ∃ (cid:15) > H partA ( (cid:15), p ) above are perfectly qualiﬁed, as both would tend to reject positive entries of a (cid:62) Σ − ˜ X .In the scope of this work, we stick to the approximately-correct and computationally-simpler partialdegradation mean s part . J S

UPPLEMENTARY C ALCULATIONS

Proposition J.1 (Likelihood ratio with respect to general degradation) . Let the standard normalsetup (Deﬁnition I.3 with the E -degradation hypothesis H A ( E ) (Deﬁnition I.4). Deﬁne (cid:15)(cid:15)(cid:15) ( (cid:15) (cid:15) (cid:15) ) as inDeﬁnition I.4, and denote ˜ X t := X t − ( µµµ ) t . Then, the log-likelihood λ LR ( H , H A |{ X t } nt =1 ) := ln ( P ( { X t } nt =1 | H ) sup H ∈ HA P ( { X t } nt =1 | H ) ) of { X t } nt =1 with respect to (the simple hypothesis) H and (the complexhypothesis) H A is λ LR ( H , H A |{ X t } nt =1 ) = min (cid:15) (cid:15) (cid:15) ∈ E (cid:15)(cid:15)(cid:15) ( (cid:15) (cid:15) (cid:15) )) (cid:62) Σ − ˜ X + ( (cid:15)(cid:15)(cid:15) ( (cid:15) (cid:15) (cid:15) )) (cid:62) Σ − (cid:15)(cid:15)(cid:15) ( (cid:15) (cid:15) (cid:15) ) (7) Proof.

Using the density function of Eq. (5), we have λ LR ( H , H A |{ X t } nt =1 ) == 2 ln ( e − . X (cid:62) Σ − ˜ X max (cid:15) (cid:15) (cid:15) ∈ E e − .

5( ˜ X + (cid:15)(cid:15)(cid:15) ) (cid:62) Σ − ( ˜ X + (cid:15)(cid:15)(cid:15) ) ) == min (cid:15) (cid:15) (cid:15) ∈ E ( ˜ X + (cid:15)(cid:15)(cid:15) ) (cid:62) Σ − ( ˜ X + (cid:15)(cid:15)(cid:15) ) − ˜ X (cid:62) Σ − ˜ X = min (cid:15) (cid:15) (cid:15) ∈ E (cid:15)(cid:15)(cid:15) (cid:62) Σ − ˜ X + ˜ X (cid:62) Σ − (cid:15)(cid:15)(cid:15) + (cid:15)(cid:15)(cid:15) (cid:62) Σ − (cid:15)(cid:15)(cid:15) = min (cid:15) (cid:15) (cid:15) ∈ E (cid:15)(cid:15)(cid:15) (cid:62) Σ − ˜ X + (cid:15)(cid:15)(cid:15) (cid:62) Σ − (cid:15)(cid:15)(cid:15) where the last equality relies on the invariance of the scalar ˜ X (cid:62) Σ − (cid:15)(cid:15)(cid:15) ∈ R to the transpose operation,as well as the symmetry of the covariance matrix (and its inverse). Proposition J.2 (Expected value of uniform-degradation weighted-mean) . Let X be a T -longepisodic signal of length n = KT for some K ∈ N (i.e., integer number of episodes), with pa-rameters µ µ µ , Σ . The expected value of n s unif ( X | Σ ) deﬁned in Deﬁnition I.6 is T ( W · µ µ µ ) .Proof. Using the T -periodicity of W and µµµ (see Eq. (4)), we have E [ n s unif ] = E [ n W · X ] = n W · µµµ ( µ µ µ ) = n K · ( W · µ µ µ ) = T ( W · µ µ µ ) . Proposition J.3 (Consistency of uniform-degradation weighted-mean) . n s unif deﬁned in Deﬁni-tion I.6 is consistent with relation to the expected value T ( W · µ µ µ ) calculated in Proposition J.2,i.e., ∀ (cid:15) > lim n →∞ P (cid:0) | n s unif − T ( W · µ µ µ ) | ≥ (cid:15) (cid:1) = 0 .Proof. The consistency is proven through the L.L.N over the i.i.d episodes, where the last possibly-partial episode becomes negligible in the limit of inﬁnitely-many episodes.Using the episodic index decomposition of Deﬁnition G.1, and the notations of W , W τ from Deﬁ-nition I.6, we can write n s unif − T ( W · µ µ µ ) = 1 n W X − T ( W · µ µ µ ) = 1 n n (cid:88) t =1 [ w t X t ] − T ( W · µ µ µ )= (cid:34) n K − (cid:88) k =0 T (cid:88) τ =1 ( W ) τ X kT + τ − KTn W µ µ µ (cid:35) + (cid:34) n τ (cid:88) τ =1 ( W τ ) τ X kT + τ − τ n W µ µ µ (cid:35) = 1 n K − (cid:88) k =0 T (cid:88) τ =1 ( W ) τ ( X kT + τ − ( µ µ µ ) τ ) + 1 n τ (cid:88) τ =1 ( W τ ) τ ( X kT + τ − ( µ µ µ ) τ )= 1 n K − (cid:88) k =0 S k + 1 n S tailK,τ where τ := τ ( n, T ) , S k := (cid:80) Tτ =1 ( W ) τ ( X kT + τ − ( µ µ µ ) τ ) and S tailK,τ := (cid:80) τ τ =1 ( W τ ) τ ( X kT + τ − ( µ µ µ ) τ ) . 32o prove consistency we have to show that ∀ (cid:15) > lim n →∞ P (cid:0) | n s unif − T ( W · µ µ µ ) | ≥ (cid:15) (cid:1) = 0 .Indeed, given (cid:15) > , we havelim n →∞ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n s unif − T ( W · µ µ µ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:19) ≤ lim n →∞ P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n K − (cid:88) k =0 S k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)/ ∨ (cid:12)(cid:12)(cid:12)(cid:12) n S tailK,τ (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)/ (cid:33) ≤ lim n →∞ P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n K − (cid:88) k =0 S k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)/ (cid:33) + P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n S tailK,τ (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)/ (cid:19) where P (cid:16)(cid:12)(cid:12)(cid:12) n (cid:80) K − k =0 S k (cid:12)(cid:12)(cid:12) ≥ (cid:15)/ (cid:17) → according to the Law of Large Numbers applied to the i.i.dsequence { S k } ; andlim n →∞ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n S tailK,τ (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15)/ (cid:19) ≤ lim n →∞ P (cid:32) T (cid:88) τ =1 | max ≤ ˜ τ ≤ T ( W ˜ τ ) τ | · | X kT + τ − ( µ µ µ ) τ | ≥ n(cid:15) (cid:33) = lim n →∞ P (cid:32) T (cid:88) τ =1 | max ≤ ˜ τ ≤ T ( W ˜ τ ) τ | · | X τ − ( µ µ µ ) τ | ≥ n(cid:15) (cid:33) = 0 Lemma 1 (Maximum likelihood with relation to the complex hypothesis of uniform-degradation) . Under the setup of Theorem I.1, denote s := W · µµµ − (cid:15) [111 (cid:62) Σ − . λ LR ( H , H unifA ( (cid:15) ) |{ X t } nt =1 ) is minimized by (cid:15) = (cid:15) if s unif ≥ s , and by (cid:15) = W · µµµ − s unif (cid:62) Σ − if s unif ≤ s .Proof. Applying Proposition J.1 to Deﬁnition I.4 yields λ LR ( H , H unifA ( (cid:15) ) |{ X t } nt =1 ) = min (cid:15) ≥ (cid:15) (cid:15) [ W ˜ X ] + (cid:15) [111 (cid:62) Σ − min (cid:15) ≥ (cid:15) P ( (cid:15) ) (8)where P ( (cid:15) ) is a parabola with respect to (cid:15) , with leading coefﬁcient (cid:62) Σ − > (since the full-rank covariance matrix Σ is necessarily positive deﬁnite) and minimum (cid:15) min = − W ˜ X (cid:62) Σ − = Wµµµ − s unif (cid:62) Σ − (remember that ˜ X = X − µµµ ). If s unif ≤ s then (cid:15) min ≥ (cid:15) and min (cid:15) ≥ (cid:15) P ( (cid:15) ) isminimized by (cid:15) = (cid:15) min . If s unif ≥ s then (cid:15) min ≤ (cid:15) (i.e., (cid:15) is to the right of the minimum of theparabola), hence ∀ (cid:15) > (cid:15) : P ( (cid:15) ) > P ( (cid:15) ) , and min (cid:15) ≥ (cid:15) P ( (cid:15) ) is minimized by (cid:15) = (cid:15) . Proof of Theorem I.1 (also compactly formulated in Theorem 4.1) : Given α ∈ (0 , , considera ˜ κ -threshold-test (Deﬁnition I.7) with relation to the log-likelihood λ LR ( H , H unifA ( (cid:15) ) |{ X t } nt =1 ) (and with edge-case rejection-probability ρ ∈ [0 , ), such that the signiﬁcance level of the test is − α . Since λ LR = 2 ln ( LR ) is monotonously increasing with relation to the likelihood-ratio, thenaccording to Neyman-Pearson Lemma (Neyman et al., 1933) this test has the greatest power amongall tests with signiﬁcance ˜ α ≤ α . We will show that this test is equivalent to a threshold-test on theuniform-degradation weighted-mean.According to Lemma 1, we have λ LR ( H , H unifA ( (cid:15) ) |{ X t } nt =1 )= min (cid:15) ≥ (cid:15) (cid:15) [ W ˜ X ] + (cid:15) [111 (cid:62) Σ − (cid:40) (cid:15) [ W ˜ X ] + (cid:15) [111 (cid:62) Σ − if s unif ≥ s W · µµµ − s unif (cid:62) Σ − [ W ˜ X ] + [ W · µµµ − s unif (cid:62) Σ − ] [111 (cid:62) Σ − if s unif ≤ s = (cid:40) (cid:15) s unif − (cid:15) Wµµµ + (cid:15) [111 (cid:62) Σ − if s unif ≥ s − ( s unif − Wµµµ ) (cid:62) Σ − if s unif ≤ s (9)33learly λ LR is strictly increasing with s unif in ( −∞ , s ] . Note that in the case s unif ≥ s , λ LR isthe parabola P ( s unif ) = − ( s unif − Wµµµ ) (up to a positive multiplicative factor), whose maximumis s max = Wµµµ . Since in this case s unif ≥ s = Wµµµ − (cid:15) [111 (cid:62) Σ − ≤ Wµµµ = s max , then s unif is to the left of the maximum of the parabola, and hence λ LR is strictly increasing with s unif in [ s , ∞ ) .Since λ LR is strictly increasing with s unif in both ( −∞ , s ] and [ s , ∞ ) , then it is strictlymonotonously increasing with s unif in R . Hence there exists κ ∈ R such that λ LR < ˜ κ ⇔ s unif <κ , and the tests are equivalent. (cid:3) Lemma 2 (Properties of statistics under uniform-degradation) . Let X be a T -long episodic signalof length n = KT for some K ∈ N (i.e., integer number of episodes), with parameters µ µ µ − (cid:15) · ∈ R T , Σ ∈ R T × T . Denote s simp = (cid:80) nt =1 X t as in Eq. (6) and s unif = W X as in Deﬁnition I.6.Then we have: E [ s simp ] = K (cid:62) µ µ µ − KT (cid:15)E [ s unif ] = KW µ µ µ − (cid:15)KW Var ( s simp ) = K (cid:62) Σ Var ( s unif ) = K (cid:62) Σ − Proof.

The ﬁrst 3 identities are straight-forward: E [ s simp ] = K − (cid:88) k =0 T (cid:88) τ =1 ( µ µ µ ) τ − (cid:15) = K (cid:62) µ µ µ − KT (cid:15)E [ s unif ] = K − (cid:88) k =0 W · ( µ µ µ − (cid:15) KW µ µ µ − (cid:15)KW Var ( s simp ) = n (cid:88) i,j =1 Cov ( X i , X j ) = K T (cid:88) i,j =1 Cov ( X i , X j ) = K (cid:62) Σ For the last identity denote Y := Σ − X ∈ R n (i.e., Y i = (cid:80) Tm =1 (Σ ) − im X m ), such that s unif = (cid:80) i Y i . Var ( s unif ) = n (cid:88) i,j =1 Cov ( Y i , Y j )= n (cid:88) i,j =1 Cov ( n (cid:88) m =1 Σ − im X m , n (cid:88) l =1 Σ − jl X l )= n (cid:88) i,j,m,l =1 Σ − im Σ − jl Cov ( X m , X l )= K T (cid:88) i,j,m,l =1 (Σ ) − im (Σ ) − jl Cov ( X m , X l )= K T (cid:88) i,j,m,l =1 (Σ ) − jl (Σ ) − im (Σ ) ml = K T (cid:88) i,j,l =1 (Σ ) − jl (cid:0) (Σ ) − i · · (Σ ) · l (cid:1) = K T (cid:88) i,j,l =1 (Σ ) − jl δ il = K T (cid:88) i,j =1 (Σ ) − ji = K (cid:62) Σ − roof of Proposition I.1 : Both √ K ˜ s Ksimp and √ K ˜ s Kunif deﬁned in Eq. (6) are under H the sums of K i.i.d variables with mean 0 and variance 1. Thus, according to the Central Limit Theorem (Petrov,1972; Irwin, 2006), both converge-in-distribution to the standard normal distribution under H : ˜ s Ksimp D −−−−→ K →∞ N (0 , s Kunif D −−−−→ K →∞ N (0 , Hence, from the deﬁnition of convergence in distribution, we havelim K →∞ P (cid:0) ˜ s Ksimp ≤ q α (cid:12)(cid:12) H (cid:1) = lim K →∞ F ˜ s Ksimp | H (cid:0) q α (cid:1) = Φ (cid:0) q α (cid:1) = α lim K →∞ P (cid:0) ˜ s Kunif ≤ q α (cid:12)(cid:12) H (cid:1) = lim K →∞ F ˜ s Kunif | H (cid:0) q α (cid:1) = Φ (cid:0) q α (cid:1) = α where F s | H is the Cumulative Distribution Function of the random variable s under the hypothesis H , and Φ is of the standard normal distribution.Note that ˜ s Ksimp , ˜ s Kunif can be computed from s simp , s unif by substituting Lemma 2 (with (cid:15) = 0 ,corresponding to H ) in Eq. (6): ˜ s Ksimp = s simp − K (cid:62) µ µ µ (cid:112) K (cid:62) Σ s Kunif = s unif − KW µ µ µ (cid:113) K (cid:62) Σ − (10)and by substituting Lemma 2 with (cid:15) > in Eq. (10), we also have the properties of ˜ s Ksimp , ˜ s Kunif under H (cid:15)A : E (cid:2) ˜ s Ksimp | H (cid:15)A (cid:3) = − KT (cid:15) (cid:112) K (cid:62) Σ

111 = − √

KT (cid:15) (cid:112) (cid:62) Σ E (cid:2) ˜ s Kunif | H (cid:15)A (cid:3) = − KW (cid:15) (cid:113) K (cid:62) Σ −

111 = − √ KW (cid:15) (cid:113) (cid:62) Σ − Var (˜ s Ksimp | H (cid:15)A ) = Var (˜ s Kunif | H (cid:15)A ) = 1 Accordingly, using the Central Limit Theorem again, we have under H (cid:15)A : ˜ s Ksimp + √ KT (cid:15) (cid:112) (cid:62) Σ D −−−−→ K →∞ N (0 , s Kunif + √ KW (cid:15) (cid:113) (cid:62) Σ − D −−−−→ K →∞ N (0 , and by the deﬁnition of convergence in distribution, we receivelim K →∞ P (cid:0) ˜ s Ksimp ≤ q α (cid:12)(cid:12) H (cid:15)A (cid:1) = lim K →∞ F ˜ s Ksimp | H (cid:15)A (cid:0) q α (cid:1) = lim K →∞ F ˜ s Ksimp + √ KT(cid:15) √ (cid:62) Σ0111 | H (cid:0) q α (cid:1) = lim K →∞ Φ (cid:32) q α + √ KT (cid:15) (cid:112) (cid:62) Σ (cid:33) = 1 lim K →∞ P (cid:0) ˜ s Kunif ≤ q α (cid:12)(cid:12) H (cid:15)A (cid:1) = lim K →∞ F ˜ s Kunif | H (cid:15)A (cid:0) q α (cid:1) = lim K →∞ F ˜ s Kunif + √ KW (cid:15) √ (cid:62) Σ −

10 111 | H (cid:0) q α (cid:1) = lim K →∞ Φ  q α + √ KW (cid:15) (cid:113) (cid:62) Σ −  = 1 (11) (cid:3) roof of Theorem I.2 (also compactly formulated in Theorem 4.2) : Following identical reason-ing to the proof of Proposition I.1 with (cid:15) replaced by (cid:15)/ √ K , and recalling that W = 111 (cid:62) Σ − (Deﬁnition I.6), we receive the analog of Eq. (11):lim K →∞ P (cid:16) ˜ s Ksimp ≤ q α (cid:12)(cid:12) H (cid:15),KA (cid:17) = Φ (cid:32) q α + T (cid:15) (cid:112) (cid:62) Σ (cid:33) lim K →∞ P (cid:16) ˜ s Kunif ≤ q α (cid:12)(cid:12) H (cid:15),KA (cid:17) = Φ (cid:18) q α + (cid:15) (cid:113) (cid:62) Σ − (cid:19) (12)To complete the proof, since Φ is monotonously increasing, we only have to show that T √ (cid:62) Σ ≤ (cid:113) (cid:62) Σ − , or equivalently T (cid:62) Σ − ≤ (cid:62) Σ T , which can be seen as a matrix-form generalizationfor the harmonic-algebraic means inequality.Since the invertible covariance matrix Σ is necessarily symmetric and positive deﬁnite, it has asymmetric positive deﬁnite square-root R = Σ . Since (cid:62) Σ

111 = 111 (cid:62) R (cid:62) R

111 = (cid:107) R (cid:107) and (cid:62) Σ −

111 = (cid:107) R − (cid:107) , we indeed have by Cauchy-Schwarz inequality (111 (cid:62) Σ − (cid:62) Σ (cid:107) R − (cid:107) · (cid:107) R (cid:107) ≥ ((111 (cid:62) R − )( R = (111 (cid:62) = T (13) (cid:3) Proof of Proposition I.2

Since Σ is symmetric it is orthogonally diagonalizable, i.e., Σ = U (cid:62) AU where A is diagonal and U is orthogonal. Since Σ is positive-deﬁnite (note that Deﬁnition I.1assumes full-rank covariance matrix), its eigenvalues are positive, i.e., ∀ ≤ i ≤ T : λ i = A ii > .We also have (cid:62) Σ

111 = 111 (cid:62) U (cid:62) AU

111 = u (cid:62) Au (where u = U ), and (cid:62) Σ −

111 = 111 (cid:62) U (cid:62) A − U

111 = u (cid:62) A − u .From this we receive G = (111 (cid:62) Σ − (cid:62) Σ T = ( u (cid:62) A − u )( u (cid:62) Au ) T = 1 T ( T (cid:88) i =1 u i λ i )( T (cid:88) j =1 u j /λ j ) = 1 T T (cid:88) i,j =1 ( u i u j ) λ i λ j = 12 T T (cid:88) i,j =1 ( u i u j ) (cid:18) λ i λ j + λ j λ i (cid:19) = 12 T T (cid:88) i,j =1 ( u i u j ) λ i + λ j λ i λ j = 12 T T (cid:88) i,j =1 ( u i u j ) ( λ i − λ j ) + 2 λ i λ j λ i λ j = 12 T  T (cid:88) i,j =1 ( u i u j ) + T (cid:88) i,j =1 ( u i u j ) ( λ i − λ j ) λ i λ j  = 12 T  u (cid:62) u + T (cid:88) i,j =1 ( u i u j ) ( λ i − λ j ) λ i λ j  = 12 T  (cid:62) I T (cid:88) i,j =1 ( u i u j ) ( λ i − λ j ) λ i λ j  =1 + 12 T T (cid:88) i,j =1 ( u i u j ) ( λ i − λ j ) λ i λ j ∀ i : u i = U i · (cid:54) = 0 as the sum of a row of an orthogonal matrix, we only need to denote w ij := ( u i u j ) T λ i λ j > . (cid:3) Lemma 3 (Sensitivity of the minimum to deviations in the elements) . Let a ﬁnite set A , functions f, g : A → R , and (cid:15) > . Note that since A is ﬁnite, both f, g are bounded and denote | g | ≤ G anupper bound. Denote y := min a ∈ A f ( a ) + (cid:15)g ( a ) and ˜ y := min a ∈ A f ( a ) . Then | y − ˜ y | ≤ (cid:15)G .Proof. Denote by a , ˜ a the minimizers of y, ˜ y respectively, i.e., y = f ( a ) + (cid:15)g ( a ) ≤ f (˜ a ) + (cid:15)g (˜ a ) and ˜ y = f (˜ a ) ≤ f ( a ) . From the last two inequalities we get ≤ f ( a ) − f (˜( a ) ) ≤ (cid:15) ( g (˜ a ) − g ( a )) . Finally from the triangle inequality, | y − ˜ y | = | f ( a ) − f (˜ a ) + (cid:15)g ( a ) | ≤ | (cid:15) ( g (˜ a ) − g ( a )) | + | (cid:15)g ( a ) |≤ (cid:15) [ | g (˜ a ) | + | g ( a )) | + | g ( a ) | ] ≤ (cid:15)G Proof of Theorem I.3 : Similarly to the proof of Theorem I.1, we wish to show that the log-likelihood-ratio λ LR from Lemma 1 – after substituting Deﬁnition I.10 – is strictly monotonouslyincreasing with s part .Given a ∈ { , } T , let a ( a ) ∈ { , } n be its T -periodic completion to n dimensions, and recallthe notations m ( p ) = (cid:100) pT (cid:101) , n = KT + τ . Then we have λ LR ( H , H partA |{ X t } nt =1 ) == min a ∈ A Tm (cid:15)a (cid:62) Σ − ˜ X + (cid:15) a (cid:62) Σ − a = 2 (cid:15) · min a ∈ A Tm (cid:16) a (cid:62) Σ − ˜ X + 0 . (cid:15) ( a (cid:62) Σ − a ) (cid:17) Denote f ( a ) = a ( a ) (cid:62) Σ − ˜ X , g ( a ) = 0 . a ( a ) (cid:62) Σ − a ( a ) and y = min a ∈ A Tm f ( a ) + (cid:15)g ( a ) ,such that λ LR = 2 (cid:15)y is monotonously increasing with y . Note thatmin a ∈ A Tm f ( a ) = min a ∈ A Tm a (cid:62) Σ − ˜ X = min a ∈ A Tm K − (cid:88) k =0 T (cid:88) τ =1( a ) τ =1 ˜ s kT + τ + τ (cid:88) τ =1( a ) τ =1 ˜ s kT + τ = min a ∈ A Tm (cid:88) τ ∈ o ( a ) ˜ S τ = s part ( X ) Also note that the term g ( a ) is bounded: ∀ a ∈ A Tm : | g ( a ) | ≤ T (cid:88) i,j =1 | (Σ − ) ij | ≤ K + 12 T (cid:88) i,j =1 | (Σ − ) ij | Hence, by Lemma 3, we have | y − s part | ≤ (cid:15) K + 1)2 T (cid:88) i,j =1 | (Σ − ) ij | = O ( (cid:15) ) (14)(where O ( (cid:15) ) is deﬁned with relation to (cid:15) → ).Now consider the α -quantile of y under H , denoted ˜ κ = q α ( y | H ) . By construction P ( y ≤ ˜ κ | H ) = α (up to non-continuous probability density in the edge case y = ˜ κ ). According toNeyman-Pearson Lemma, a threshold-test on y has the greatest power P α among all statistical testswith signiﬁcance level ≤ α (see the proof of Theorem I.1 for more details), i.e., P α = P ( y ≤ ˜ κ | H partA ) = F y | H partA (˜ κ ) (where F s | H is the Cumulative Distribution Function of the variable s given the hypothesis H ). 37enote the α -quantile of the actual test-statistic s part by κ = q α ( s part | H ) . Since ∀ X ∈ R n : | y − s part | = O ( (cid:15) ) (Eq. (14)), we also have (cid:12)(cid:12) ˜ κ − κ (cid:12)(cid:12) = (cid:12)(cid:12) q α ( y | H ) − q α ( s part | H ) (cid:12)(cid:12) = O ( (cid:15) ) . Hence,along with Eq. (14), we have P (cid:0) s part ≤ κ | H partA (cid:1) ≥ P (cid:0) y + O ( (cid:15) ) ≤ ˜ κ − O ( (cid:15) ) (cid:12)(cid:12) H partA (cid:1) = P (cid:0) y ≤ ˜ κ − O ( (cid:15) ) (cid:12)(cid:12) H partA (cid:1) = F y | H partA (˜ κ − O ( (cid:15) ))= F y | H partA (˜ κ ) − O ( (cid:15) )= P α − O ( (cid:15) ) where the second-to-last equality is true since F y | H partA ( x ) has a ﬁnite derivative at x = ˜ κ , as theCDF of the minimum of the normal variables { a (cid:62) Σ − ˜ X + 0 . (cid:15) ( a (cid:62) Σ − a ) } a ∈ A Tm . (cid:3) K D

ETAILED P RELIMINARY M ATERIALS

K.1 R

EINFORCEMENT L EARNING AND E PISODIC F RAMEWORK

The environment of a Reinforcement Learning (RL) problem is usually modeled as a

Decision Pro-cess . This is essentially a state-machine, where the (possibly random) transition between statesdepends on decision-making, as well as on the current and the previous states (in the general case).Every state (and possibly every decision) is assigned a corresponding reward, and the goal of thedecision-making system (termed agent ) is to maximize some function of the rewards, named the return function . In contrast to Supervised Learning, the feedback from the environment does notinform the agent whether it succeeded to maximize the rewards, but merely how high the rewardswere. It is up to the agent to explore the possible decisions (also termed actions ) and the correspond-ing rewards.The return function is usually a simple sum of the rewards for a ﬁnite process, and a decayed sumfor an inﬁnite process. In the ﬁnite case, the process usually repeats multiple times in differentvariants, e.g., with different initial states. Common examples are board and video games (Brockmanet al., 2016), as well as more realistic problems such as repeating drives in autonomous drivingtask. In the context of RL, the repetitions of the decision process are usually named episodes .Bylander (Bylander) deﬁned an episode as the ”path from initial to a terminal state”. Pardo etal. (Pardo et al., 2017) wrote that ”it is common to let an agent interact for a ﬁxed amount of timewith its environment before resetting it and repeating the process in a series of episodes”.Note that once the agent chooses a decision-making scheme (termed policy ), the decision processessentially reduces to a (decision-free) random process. Every time-step in the process has a certaindistribution of (state and) reward, and different time-steps may depend on each other.The decision process in RL is often modeled as a

Markov Decision Process (MDP) (Bellman, 1957),where every state depends only on the preceding state and the agent’s action. The decision-freeprocess received from an MDP with relation to a ﬁxed policy is a

Markov Chain (MC), whichunder certain further assumptions is guaranteed to converge into a stationary state (Freedman, 2017).However, even in such a restrictive model, long-term correlations between rewards may still carryinformation if the states are not observable by the agent; and even under the further conditions ofconvergence to a stationary state, the rate of convergence may be slow compared to the length ofan episode. The non-stationarity of the rewards within an episode is demonstrated, for example, inFig. 1b.This work exploits the repetitive nature of the episodic random processes – and in particular therewards of episodic decision processes in the context of RL – to estimate the expectations andthe correlations in the process. Since we measure the rewards directly, without considering theunderlying states or any other observations available to the agent, we may call this approach model-free in the context of RL.Note that in the scope of this work, the goal of the episodes is to provide i.i.d samples of a non-i.i.d random process, so that the covariance parameters of the process can be estimated. Hence, thescope of ”episodic problems” may be quite extensive: it may include even life-time systems that38un continuously without ever resetting – as long as a reference dataset of other instances of thesystem is available, and the sample resolution does not introduce too many covariance parametersto estimate from the reference dataset. Indeed, the model deﬁned in Section G and the optimalityresults in Section I are fully capable of handling a part of a single, long episode (with the exceptionof the asymptotic results in Section I.1.1).K.2 H

YPOTHESIS T ESTING

In a standard hypothesis test, two hypotheses are formulated regarding some observable phe-nomenon, and we wish to decide which one is true according to available evidence, given in theform of observations X ∈ X from a corresponding observation space X . One hypothesis is oftenregarded as the default, named the Null Hypothesis and denoted H ; and given X we have to decidewhether to reject H in favor of the Alternative Hypothesis H A .The fundamental distinction between the hypotheses lays on their different probabilistic models P (cid:0) X (cid:12)(cid:12) H (cid:1) (either probability function or probability density function), also referred to as the likeli-hood L (cid:0) H (cid:12)(cid:12) X (cid:1) of the hypothesis given the observations. The difference between the models is oftenformulated in terms of different values of a parameter θ for some parametric probability function P (cid:0) X (cid:12)(cid:12) θ (cid:1) . A complex hypothesis is one that allows different possible probabilistic models, repre-sented by a set Θ of permitted values of θ . The likelihood of a complex hypothesis H : θ ∈ Θ isdeﬁned as L (cid:0) H (cid:12)(cid:12) X (cid:1) = sup θ ∈ Θ P (cid:0) X (cid:12)(cid:12) θ (cid:1) . The likelihood-ratio between two hypotheses is deﬁned as LR (cid:0) H , H A (cid:12)(cid:12) X (cid:1) = L (cid:16) H (cid:12)(cid:12) X (cid:17) L (cid:16) H A (cid:12)(cid:12) X (cid:17) . The log-likelihood-ratio is often used instead (Wilks, 1938), sinceit tends to derive simpler expressions for exponential families of distributions such as the Normaldistribution. In this work we often denote λ LR (cid:0) H , H A (cid:12)(cid:12) X (cid:1) = 2 ln ( LR ) .The basic metrics for the efﬁciency of a hypothesis test are its signiﬁcance P (cid:0) not reject H (cid:12)(cid:12) H (cid:1) =1 − P ( type-I error ) = 1 − α and its power P (cid:0) reject H (cid:12)(cid:12) H A (cid:1) = 1 − P ( type-II error ) = β . Astatistical hypothesis test with signiﬁcance − α and power β is said to be optimal if any statisticaltest with as high signiﬁcance − ˜ α ≥ − α has smaller power ˜ β ≤ β .According to Neyman-Pearson Lemma (Neyman et al., 1933), a threshold-test on the likelihoodratio is an optimal hypothesis test. In a likelihood-ratio threshold-test with a threshold κ ∈ R , wereject H if LR (cid:0) H , H A (cid:12)(cid:12) X (cid:1) < κ ; reject with a certain probability ρ ∈ [0 , if LR = κ ; and do notreject H otherwise. Note that the behavior in the edge-case LR = κ (controlled by ρ ) only mattersin the case of non-continuous distributions, where it is possible that P ( LR = κ ) (cid:54) = 0 .Note that the optimal hypothesis test is not unique, but rather leaves a degree of freedom in thetradeoff between α and β . In the case of a threshold-test, this degree of freedom is controlled bythe threshold κ (and the edge probability ρ ). It is common to deﬁne the test according to a desiredsigniﬁcance level (often α = 0 . or α = 0 . ), and derive the corresponding threshold κ α .In certain cases, given a test-statistic and desired α , the threshold κ α can be analytically calculatedfrom the corresponding probabilistic model P (cid:0) X (cid:12)(cid:12) H (cid:1) . If the model is too complex or not well-deﬁned, but expresses the sum of i.i.d random variables, then according to the Central Limit Theorem (CLT) (Petrov, 1972; Irwin, 2006), the model becomes closer to a Normal distribution as the numberof summed variables grows, allowing to analytically calculate the asymptotic value of κ α . Notethat the CLT lays on the independence and identical distributions of the summed variables – twoproperties which are not generally satisﬁed by episodic rewards in the decision processes describedin Section K.1.Numeric methods are also available for estimation of properties of a hypothesis test (or the prop-erties of a statistic of the observations). In Monte-Carlo method (Kroese et al., 2014), the test issimulated (or the statistic is computed) multiple times for observations X generated in a way whichis assumed to be similar to a hypothesis H (in particular H for signiﬁcance estimation). In the boot-strap method (Efron, 2003), given i.i.d observations X ∈ R n (assumed to be drawn according toa hypothesis H ), Monte-Carlo method is applied on artiﬁcial observations X b drawn by repeatedlysampling n elements from X with replacement. 39.3 S EQUENTIAL T ESTS

Section K.2 describes the general scheme of a standard hypothesis test for distinction between twohypotheses according to certain available data. In many practical applications, the hypothesis test isrepeatedly applied as the data change or grow, a procedure known as a sequential test . If the nullhypothesis H is true, and any individual hypothesis test falsely rejects H with some probability α , then the probability that at least one of the multiple tests will reject H is α > α , termed family-wise type-I error rate . For simplicity, consider the private case of k independent tests, where α = 1 − (1 − α ) k k →∞ −−−−−→ .This problem, also known as inﬂation of signiﬁcance or inﬂation of α in sequential tests, wasaddressed by many over the years. A simple solution is the Bonferroni correction (Goldman,2008), setting signiﬁcance level of − α/k in every individual test. This way, we have P ( ∃ i : test i rejects | H ) ≤ (cid:80) ki =1 α/k = α . However, the inequality becomes equality only if the rejec-tions of the various tests are disjoint events (not even independent); thus in practice we often have α (cid:28) α , which makes the Bonferroni correction extremely conservative. Appendix F describesother relevant works on sequential testing. L A

LGORITHMS (P SEUDOCODE ) This appendix concentrates the pseudo-code of the algorithms for hypothesis testing and forbootstrap-based α tuning, in the contexts of both individual and sequential tests. Algorithm 4 isa more detailed version of the pseudo-code of Algorithm 1. Algorithm 2:

Individual test bootstrap

Input : x ∈ R N × T assumed to be drawn from a T -long episodic signal; sample size n = KT + τ ∈ N ; a test-statistic function s : R n → R ; number of repetitions B ∈ N ; Output : test-statistic bootstrap distribution S ∈ R B ; Algorithm :Initialize S ∈ R B ; for b in 1:B do // sampleInitialize y ∈ R n ; for k in 0:K-1 do Sample j uniformly from (1 , ..., N ) ; y [ kT + 1 : kT + T ] ← ( x j , ..., x jT ) ;Sample j uniformly from (1 , ..., N ) ; y [ KT + 1 : KT + τ ] ← ( x j , ..., x jτ ) ;// calculate S b ← s ( y ) ;Return S ; 40 lgorithm 3: Individual degradation test

Input : reference episodic signal x ∈ R N × T ; test data x ∈ R n ; a test-statistic function s : R n → R ; bootstrap repetitions B ∈ N ; bootstrap distributions storage BS ; allowed type-Ierror rate α ∈ (0 , ; Output : rejection ∈ { , } ; P-value p ∈ R ; Algorithm : if BS [ n ] not exists then BS [ n ] ← Individual test bootstrap ( T, N, x , n, s, B ) ; (Algorithm 2) S ← BS [ n ] ; κ α ← quantile α ( S ) ; y ← s ( x ) ; count ← |{ b ∈ { , ..., B }| S b ≤ y }| ; p ← count B ; reject = 1 if y < κ α else ; (or equivalently, if p < α else )Return reject , p ; Algorithm 4:

Sequential test bootstrap

Input : x ∈ R N × T assumed to be drawn from a T -long episodic signal; test-statistic function s ;inner-bootstrap repetitions B ∈ N ; inner-bootstrap storage BS ; tests frequency d ∈ [1 , T ] andlookback horizons h , ..., h n h ∈ N ; sequential test length ˜ h ∈ N ; outer-bootstrap repetitions ˜ B ∈ N ; Output : bootstrap-distribution P ∈ [0 , ˜ B of the minimal- p -value in a sequential test of ˜ h episodes under H ; Algorithm :Initialize P = (1 , ..., ∈ [0 , ˜ B ; h max ← max ( h , ..., h n h ) ; for b in 1: ˜ B do // sampleInitialize Y ∈ R ( h max +˜ h ) T ; for k in 0:( h max + ˜ h -1) do Sample j uniformly from (1 , ..., N ) ; Y [ kT + 1 : kT + T ] ← ( x j , ..., x jT ) ;// calculate p-value at any time for any lookback horizon for k in 0:( ˜ h -1) dofor τ in 1:d:T dofor h in h , ..., h n h do y ← Y [( h max + k − h ) T : ( h max + k ) T + τ ] ; p ← Individual test ( x = x, x = y, s = s, B = B, BS = BS, α = 1) .p ;(Algorithm 3) P [ b ] ← min ( P [ b ] , p ) ;Return P ; 41 lgorithm 5: Sequential degradation test

Input : reference episodic signal x ∈ R N × T ; test data stream x ; test-statistic function s ;inner-bootstrap repetitions B ∈ N ; tests frequency d ∈ [1 , T ] and lookback horizons h , ..., h n h ∈ N ; family-wise signiﬁcance parameters α ∈ (0 , , ˜ h ∈ N ; outer-bootstraprepetitions ˜ B ∈ N ; Output : time of H rejection; Algorithm :Initialize bootstrap-storage BS ; P ← Sequential bootstrap ( x , s, B, BS, d, { h i } , ˜ h, ˜ B ) ; (Algorithm 1) α ← quantile α ( P ) ; if α = B +1 then // Can never reject H Warn(”Either increase B or reduce signiﬁcance requirements.”);Return ERROR; for k in ( h max , h max +1, ...) dofor τ in 1:d:T dofor h in h , ..., h n h do y ← x [( k − h ) T : kT + τ ] ; r ← Individual test ( x = y, s = s, BS = BS, α = α ) .reject; (Algorithm 3) if r=1 then // Reject H Return kT + ττ