[PDF] Active and Adaptive Sequential learning

Abstract

A framework is introduced for actively and adaptively solving a sequence of machine learning problems, which are changing in bounded manner from one time step to the next. An algorithm is developed that actively queries the labels of the most informative samples from an unlabeled data pool, and that adapts to the change by utilizing the information acquired in the previous steps. Our analysis shows that the proposed active learning algorithm based on stochastic gradient descent achieves a near-optimal excess risk performance for maximum likelihood estimation. Furthermore, an estimator of the change in the learning problems using the active learning samples is constructed, which provides an adaptive sample size selection rule that guarantees the excess risk is bounded for sufficiently large number of time steps. Experiments with synthetic and real data are presented to validate our algorithm and theoretical results.

Full PDF

AActive and Adaptive Sequential learning

Yuheng Bu ∗† Jiaxun Lu ∗‡ Venugopal V. Veeravalli † Abstract

A framework is introduced for actively and adaptively solving a sequence ofmachine learning problems, which are changing in bounded manner from onetime step to the next. An algorithm is developed that actively queries the labels ofthe most informative samples from an unlabeled data pool, and that adapts to thechange by utilizing the information acquired in the previous steps. Our analysisshows that the proposed active learning algorithm based on stochastic gradientdescent achieves a near-optimal excess risk performance for maximum likelihoodestimation. Furthermore, an estimator of the change in the learning problems usingthe active learning samples is constructed, which provides an adaptive samplesize selection rule that guarantees the excess risk is bounded for sufﬁciently largenumber of time steps. Experiments with synthetic and real data are presented tovalidate our algorithm and theoretical results.

Machine learning problems that vary in a bounded manner over time naturally arise in many applica-tions. For example, in personalized recommendation systems [9, 15], the preferences of users mightchange with fashion trends. Since acquiring new training samples from users can be expensive inpractice, a recommendation system needs to update the machine learning model and adapt to thischange using as few new samples as possible.In such problems, we are given a large set of unlabeled samples, and the learning tasks are solved byminimizing the expected value of an appropriate loss function on this unlabeled data pool at eachtime t . To capture the idea that the sequence of learning problems is changing in a bounded manner,we assume the following bound holds (cid:107) θ ∗ t − θ ∗ t − (cid:107) ≤ ρ, ∀ t ≥ , (1)where θ ∗ t is the true minimizer of the loss functions at time t , and ρ is a ﬁnite upper bound on thechange of minimizers, which needs to be estimated in practice.To tackle this sequential learning problem, we propose an active and adaptive algorithm to learn theapproximate minimizers ˆ θ t of the loss function. At each time t , our algorithm actively queries thelabels of K t samples from the unlabeled data pool, with a well-designed active sampling distribution,which is adaptive to the change in the minimizers by utilizing the information acquired in the previoussteps. In particular, we adaptively select K t and construct ˆ θ t such that the excess risk [13] is boundedat each time t .The challenges of this active and adaptive sequential learning problem arise in three aspects: 1)we need to determine which samples are more informative for solving the task at the current timestep based on the information acquired in the previous time steps to conduct active learning; 2) to ∗ Equal contribution † ECE Department and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign,Urbana, IL, USA. Email: {bu3, vvv}@illinois.edu ‡ EE Department, Tsinghua University, Beijing, China. Email: [email protected]. Work in progress. a r X i v : . [ c s . L G ] M a y chieve a desired bounded excess risk with as few new samples as possible, we need to understandthe tradeoff between the solution accuracy and the adaptively determined sample complexity K t ; 3)the change in the minimizers ρ is unknown and we need to estimate it.Our contributions in this paper can be summarized as follows. We propose an active and adaptivelearning framework with theoretical guarantees to solve a sequence of learning problems, whichensures a bounded excess risk for each individual learning task when t is sufﬁciently large. Weconstruct a new estimator of the change in the minimizers ˆ ρ t with active learning samples and showthat this estimate upper bounds the true parameter ρ almost surely. We test our approaches on asynthetic regression problem, and further apply it to a recommendation system that tracks changesin preferences of customers. Our experiments demonstrate that our algorithm achieves a betterperformance compared to the other baseline algorithms in these scenarios. Our active and adaptive learning problem has relations with multi-task learning (MTL) and transferlearning . In multi-task learning, the goal is to learn several tasks simultaneously as in [2, 10, 21] byexploiting the similarities between the tasks. In transfer learning, prior knowledge from one sourcetask is transferred to another target task either with or without additional training data [14]. Multi-tasklearning could be applied to solve our problem by running a MTL algorithm at each time, whileremembering all prior tasks. However, this approach incurs a heavy memory and computationalburden. Transfer learning lacks the sequential nature of our problem, and there is no active learningcomponent in both works. For multi-task and transfer learning, there are theoretical guarantees onregret for some algorithms [1], while we provide an excess risk guarantee for each individual task.In concept drift problem, stream of incoming data that changes over time is observed, and we tryto predict some properties of each piece of data as it arrives. After prediction, a loss is revealed asthe feedback in [17]. Some approaches for concept drift use iterative algorithms such as stochasticgradient descent, but without speciﬁc models on how the data changes, there is no theoreticalguarantees for these algorithms.Our work is of course related to active learning [8, 4], in which a learning algorithm is able tointeractively query the labels of samples from an unlabeled data pool to achieve better performance.A standard approach to active learning is to select the unlabeled samples by optimizing speciﬁcstatistics of these samples [7]. For example, with the goal of minimizing the expected excess risk inmaximum likelihood estimation, the authors of [6, 16] propose a two-stage algorithm based on Fisherinformation ratio to select the most informative samples, and show that it is optimal in terms of theconvergence rate. We apply similar algorithms in our problem, but the ﬁrst stage of estimating theFisher information using labeled samples to conduct active learning can be skipped by exploiting thebounded nature of the change, and utilizing information obtained in previous time steps.Our approach is closely related to prior work on adaptive sequential learning [20, 19], where thetraining samples are drawn passively and the adaptation is only in the selection of the number oftraining samples K t at each time step.The rest of the paper is organized as follows. In Section 2, we describe the problem setting considered.In Section 3, we present our active and adaptive learning algorithm. In Section 4, we provide thetheoretical analysis which motivates the proposed algorithm. In Section 5, we test our algorithm onsynthetic and real data. Finally, in Section 6, we provide some concluding remarks. Throughout this paper, we use lower case letters to denote scalars and vectors, and use uppercase letters to denote random variables and matrices. All logarithms are the natural ones. Weuse I to denote an identity matrix of appropriate size. We use the superscript ( · ) (cid:62) to denote thetranspose of a vector or a matrix, and use Tr( A ) to denote the trace of a square matrix A . We denote (cid:107) x (cid:107) A = √ x (cid:62) Ax for a vector x and a matrix A of appropriate dimensions.We consider the active and adaptive sequential learning problem in the maximum likelihood estimation(MLE) setting. At each time t , we are given a pool S t = { x ,t , · · · , x N,t } of N t unlabeled samplesdrawn from some instance space X . We have the ability to interactively query the labels of K t of these2amples from a label space Y . In addition, we are given a parameterized family of distribution models M = { p ( y | x, θ t ) , θ t ∈ Θ } , where Θ ⊆ R d . We assume that there exists an unknown parameter θ ∗ t ∈ Θ such that the label y t of x t ∈ S t is actually generated from the distribution p ( y t | x t , θ ∗ t ) .For any x ∈ X , y ∈ Y and θ ∈ Θ , we let the loss function be the negative log-likelihood withparameter θ , i.e., (cid:96) ( y | x, θ ) (cid:44) − log p ( y | x, θ ) , p ( y | x, θ ) ∈ M . (2)Then, the expected loss function over the uniform distribution on the data pool S t can be written as L U t ( θ ) (cid:44) E X ∼ U t ,Y ∼ p ( Y | X,θ ∗ t ) [ (cid:96) ( Y | X, θ )] , (3)where we use U t to denote the uniform distribution over the samples in S t . It can be seen that theminimizer of L U t ( θ ) is the true parameter θ ∗ t . As mentioned in (1), we assume that θ ∗ t is changing ata bounded but unknown rate, i.e., (cid:107) θ ∗ t − θ ∗ t − (cid:107) ≤ ρ , for t ≥ .The quality of our approximate minimizers ˆ θ t are evaluated through a mean tracking criterion , whichmeans that the excess risk of ˆ θ t is bounded at each time step t , i.e., E [ L U t (ˆ θ t ) − L U t ( θ ∗ t )] ≤ ε. (4)Thus, our goal is to actively and adaptively select the smallest number of samples K t in S t to querylabels, and sequentially construct an estimate of ˆ θ t satisfying the above mean tracking criterion foreach time step t . Note that it is allowed to query the label of the same sample multiple times.Let Γ t be an arbitrary sampling distribution on S t . Then, the following MLE using Γ t ˆ θ Γ t (cid:44) argmin θ ∈ Θ K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , θ ) , (5)can be viewed as an empirical risk minimizer (ERM) of (3), where X k,t ∼ Γ t , Y k,t ∼ p ( Y | X k,t , θ ∗ t ) .To ensure that our algorithm works correctly, we require the following assumption on the Hessianmatrix of (cid:96) ( y | x, θ ) , which determines the Fisher information matrix. Assumption 1.

For any x ∈ X , y ∈ Y , θ ∈ Θ , H ( x, θ ) (cid:44) ∂ (cid:96) ( y | x,θ ) ∂θ is a function of only x and θ and does not depend on y . Assumption 1 holds for many practical models, such as generalized linear model, logistic regressionand conditional random ﬁelds [6]. Moreover, for θ ∈ Θ , we denote I Γ t ( θ ) (cid:44) E X ∼ Γ t [ H ( X, θ )] asthe Fisher information matrix under sampling distribution Γ t . The main idea of our algorithm is to adaptively choose the number of samples K t based on theestimated change in the minimizers ˆ ρ t − such that the mean tracking criterion in (4) is satisﬁed, thenactively query the labels of these K t samples with a well-designed sampling distribution Γ t , andﬁnally perform MLE in (5) using a stochastic gradient descent (SGD) algorithm over the labeledsamples. By executing this algorithm iteratively, we can sequentially learn ˆ θ t over all the consideredtime steps. The algorithm is formally presented in Algorithm 1.To ensure a good performance with limited querying samples, it is essential to construct Γ t carefully.Motivated by Lemma 1 in Section 4.2, the convergence rate of the excess risk for ERM using K t samples from Γ t is Tr( I − t ( θ ∗ t ) I U t ( θ ∗ t )) /K t . Thus, the optimal sampling distribution Γ ∗ t should bethe one that minimizes Tr( I − t ( θ ∗ t ) I U t ( θ ∗ t )) , which relies on the unknown parameter θ ∗ t . Based onthe bounded nature of the change in (1), we solve this problem by approximating θ ∗ t with ˆ θ t − andgenerate the sampling distribution ˆΓ ∗ t by minimizing Tr( I − t (ˆ θ t − ) I U t (ˆ θ t − )) (Step 1).Then, as shown in Section 4.3, we use the minimum number of samples K ∗ t such that the meantracking criterion is satisﬁed, and actively draw samples from ¯Γ t to estimate ˆ θ t (Steps 2-4). Note thatthe distribution ˆΓ ∗ t is modiﬁed slightly to ¯Γ t in Step 3 to ensure it still has the full support of S t .3 lgorithm 1 Active and Adaptive Sequential Learning

Input:

Sample pool S t = { x ,t , · · · , x N,t } , the previous estimation ˆ θ t − , ˆ ρ t − and the desiredmean tracking accuracy ε .1: Solve the following semideﬁnite programming problem (see Section 4.2) ˆΓ ∗ t = argmin Γ t ∈ R Nt Tr[ I − t (ˆ θ t − ) I U t (ˆ θ t − )] s.t. (cid:40) I Γ t (ˆ θ t − ) = (cid:80) N t i =1 Γ i,t H ( x i,t , ˆ θ t − ) , (cid:80) N t i =1 Γ i,t = 1 , Γ i,t ∈ [0 , .

2: Choose K ∗ t based on ˆ ρ t − such that it is the minimum number of samples required to meet themean tracking criterion (see Section 4.3).3: Generate K ∗ t samples using the distribution ¯Γ t = α t ˆΓ ∗ t + (1 − α t ) U t on unlabeled data pool S t , where α t ∈ (0 , . Query their labels and get the labeled set S (cid:48) t = { ( x k,t , y k,t ) } K ∗ t k =1 .4: Solve the MLE using labeled set S (cid:48) t with a SGD algorithm initialized at ˆ θ t − , ˆ θ t = argmin θ t ∈ Θ (cid:88) ( x k,t ,y k,t ) ∈S (cid:48) t (cid:96) ( y k,t | x k,t , θ t ) .

5: Update the estimate of ˆ ρ t using estimator deﬁned in Section 4.4 for ∀ t ≥ . Output: ˆ θ t , ˆ ρ t .Finally, based on the current and previous estimation ˆ θ t and ˆ θ t − , we update the estimate of thebounded change rate ˆ ρ t by the estimator proposed in Section 4.4.It is easy to see that the active nature of Algorithm 1 comes from the active sampling distribution,which is constructed by minimizing the Fisher information ratio as in Step 1. But the adaptivityof Algorithm 1 is more complex and results from the following three aspects: 1) The samplingdistribution is adaptive to the bounded change through the replacement of θ ∗ t with ˆ θ t − in Step 1; 2)The sample size selection rule is adaptive through the selection of the minimum number of samplesrequired in Step 2; 3) The SGD is adaptive through the initialization by ˆ θ t − in Step 4. In this section, we present the theoretical analysis of Algorithm 1. We ﬁrst introduce the assumptionsneeded. Then, in Section 4.2, we provide the analysis of the active sampling distribution. In Section4.3, we present theoretical guarantees on the sample size selection rules which meet the mean trackingcriterion in (4). In Section 4.4, we describe the proposed estimator ˆ ρ t . The proofs of the theoremsand all the supporting lemmas will be presented in the Appendices. For the purpose of analysis, the following regularity assumption on the log-likelihood function (cid:96) isrequired to establish the standard Local Asymptotic Normality of the MLE [18].

Assumption 2 (Regularity conditions) . Regularity conditions for MLE: (a)

Compactness : Θ is compact and θ ∗ t is an interior point of Θ for each t .(b) Smoothness : (cid:96) ( y | x, θ ) is smooth in the following sense: the ﬁrst, second and thirdderivatives of θ exist at all interior points of Θ .(c) Strong Convexity : For each t and θ ∈ Θ , I U t ( θ ) (cid:23) mI with m > , and hence I U t ( θ ) is positive deﬁnite and invertible.(d) Boundedness : For all θ ∈ Θ , the largest eigenvalue of I U t ( θ ) is upper bounded by L b .2. Concentration at θ ∗ t : For all t , and any x t ∈ S t , y t ∈ Y , (cid:13)(cid:13)(cid:13) ∇ (cid:96) ( y t | x t , θ ∗ t ) (cid:13)(cid:13)(cid:13) I Ut ( θ ∗ t ) − ≤ L and (cid:13)(cid:13)(cid:13) I U t ( θ ∗ t ) − / H ( x, θ ∗ t ) I U t ( θ ∗ t ) − / (cid:13)(cid:13)(cid:13) ≤ L (6)4 olds with probability one.3. Lipschitz continuity : For all t , there exists a neighborhood B t of θ ∗ t and a constant L ,such that for all x t ∈ S t , H ( x t , θ ) are L -Lipschitz in this neighborhood, namely, (cid:13)(cid:13)(cid:13) I U t ( θ ∗ t ) − / (cid:0) H ( x t , θ ) − H ( x t , θ (cid:48) ) (cid:1) I U t ( θ ∗ t ) − / (cid:13)(cid:13)(cid:13) ≤ L (cid:107) θ − θ (cid:48) (cid:107) I Ut ( θ ∗ t ) (7) holds for θ, θ (cid:48) ∈ B t . In addition, we need the following assumption to prove that replacing θ ∗ t with ˆ θ t − in Algorithm 1does not change the performance of the active learning algorithm in terms of the convergence rate.This assumption is satisﬁed by many classes of models, including the generalized linear model [6]. Assumption 3 (Point-wise self-concordance) . For all t , there exists a constant L , such that − L (cid:107) θ t − θ ∗ t (cid:107) H ( x, θ ∗ t ) (cid:22) H ( x, θ t ) − H ( x, θ ∗ t ) (cid:22) L (cid:107) θ t − θ ∗ t (cid:107) H ( x, θ ∗ t ) . (8) In this subsection, we provide the intuition and analysis of Step 1 in Algorithm 1. The constructionof the active sampling distribution Γ t is motivated by the following lemma, which characterizes theconvergence rate of the ERM solution ˆ θ Γ t deﬁned in (5) when ρ and θ ∗ t − are known. Lemma 1.

Suppose Assumptions 1 and 2 hold, and let Θ t (cid:44) { θ t |(cid:107) θ t − θ ∗ t − (cid:107) ≤ ρ } . For anysampling distribution Γ t on S t , suppose that I Γ t ( θ ∗ t ) (cid:23) CI U t ( θ ∗ t ) holds for some constant C < .Then, for sufﬁciently large K t , such that γ t (cid:44) O (cid:0) C ( L L + √ L ) (cid:113) log dK t K t (cid:1) < , the excess riskof ˆ θ Γ t can be bounded as (1 − γ t ) τ t K t − L CK t ≤ E [ L U t (ˆ θ Γ t ) − L U t ( θ ∗ t )] ≤ (1 + γ t ) τ t K t + 2 L b ρ K t (9) for all t , where τ t (cid:44) Tr (cid:0) I − t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) . In practice, the parameter space Θ t = { θ t |(cid:107) θ t − θ ∗ t − (cid:107) ≤ ρ } is unknown and the ERM solution of(5) cannot be obtained directly due to the computational issue. To solve these problems, we can applyoptimization algorithm such as SGD to ﬁnd approximate minimizers in the original parameter space Θ with initialization at ˆ θ t − . Thus, we further build Algorithm 1 and our theoretical results with theSGD algorithm (which incidentally achieves the optimal convergence rate for ERM). We need thefollowing assumptions on the optimization algorithm to solve (5): Assumption 4.

Given an optimization algorithm that generates an approximate loss minimizer ˆ θ t (cid:44) A (cid:0) ˆ θ t − , {∇ θ (cid:96) ( y k,t | x k,t , θ ) } K t k =1 (cid:1) using K t stochastic gradients {∇ θ (cid:96) ( y i,t | x i,t , θ ) } K t k =1 withinitialization at ˆ θ t − , if E (cid:107) ˆ θ t − − θ ∗ t (cid:107) ≤ ∆ t , there exists a function b ( τ t , ∆ t , K t ) such that E [ L U t (ˆ θ t )] − L U t ( θ ∗ t ) ≤ b ( τ t , ∆ t , K t ) , (10) where b ( τ t , ∆ t , K t ) monotonically increases with respect to τ t , ∆ t and /K t . The bound b ( τ t , ∆ t , K t ) depends on the converge rate τ t and the expectation of the differencebetween the initialization and the true minimizer ∆ t , which correspond to the ﬁrst and the secondterm in the upper bound of Lemma 1, respectively. As an example for this type of bound, for theStreaming Stochastic Variance Reduced Gradient (Streaming SVRG) algorithm in [11], it holds that b ( τ t , ∆ t , K t ) = C τ t K t + C (cid:0) ∆ t K t (cid:1) (11)with constant C and C . In addition, the paper [20] contains several examples of the bound b ( τ t , ∆ t , K t ) with other variations of SGD algorithm.Then, the following theorem characterizes the convergence rate of the active sampling distributionused in Algorithm 1 in the order sense. 5 heorem 1. Suppose Assumptions 1-4 hold, and let β t (cid:44) L ( ρ + δ (cid:113) εm ) < . Then, the excess riskof ˆ θ t in Algorithm 1 is upper-bounded by E [ L U t (ˆ θ t ) − L U t ( θ ∗ t )] ≤ b (´ τ t , ∆ t , K t ) , (12) with probability 1- δ , where ´ τ t = (cid:16) β t − β t (cid:17) Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) α t , ∆ t = (cid:114) εm + ρ, (13) δ ∈ (0 , and Γ ∗ t is the optimal sampling distribution minimizing Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) . Remark 1.

A comparison between Theorem 1 and Lemma 1 shows that the convergence rate ofAlgorithm 1 that approximates θ ∗ t with ˆ θ t − in Step 1 is the same as the ERM solution with highprobability, as long as the change in the minimizers ρ is small enough, i.e., L ( ρ + δ (cid:112) ε/m ) < . Incertain cases such as linear regression model, the Hessian matrices are independent of θ ∗ t . Thus, noapproximation is needed in constructing the sampling distribution, and Algorithm 1 is rate optimal. In this subsection, we explain and analyze the sample selection rule of Step 2 in Algorithm 1. Theidea starts with the bound b ( τ t , ∆ t , K t ) from Assumption 4. If we can compute τ t and ∆ t , thesample size K t can be determined by letting b ( τ t , ∆ t , K t ) ≤ ε to satisfy the mean tracking criterion.However, θ ∗ t in τ t = Tr (cid:0) I − t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) is unknown in practice. Although we can approximate θ ∗ t using ˆ θ t − as we did in Step 1, this upper bound only holds with high probability as shown inTheorem 1, which means the mean tracking criterion will be satisﬁed with high probability. To avoidthis issue, we use the fact that Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) ≤ Tr (cid:0) I − U t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) = d (recall d is thedimension of parameters) to form a conservative bound b ( d/ , ∆ t , K t ) to choose K t , which worksfor the uniform sampling distribution U t .To bound the difference between the initialization and the true minimizer ∆ t , we have the inequality E (cid:107) ˆ θ t − − θ ∗ t (cid:107) ≤ ( (cid:112) ε/m + ρ ) following from the triangle inequality, Jensen’s inequality and thestrong convexity in Assumption 2. This inequality implies that ∆ t = (cid:112) ε/m + ρ .Therefore, if ρ is known, we can set K ∗ t = min (cid:110) K ≥ (cid:12)(cid:12)(cid:12) b (cid:16) d/ , (cid:113) εm + ρ, K (cid:17) ≤ ε (cid:111) for t ≥ toensure that E [ L U t (ˆ θ t ) − L U t ( θ ∗ t )] ≤ ε . For t = 1 , we could always use diameter(Θ) to bound ∆ and select K . In general, if ρ is much smaller than diameter(Θ) , then we require signiﬁcantly fewersamples K t to meet the mean tracking criterion for t ≥ .For the case where the change of the minimizers ρ is unknown, we could replace ρ with an estimate ˆ ρ t − to select the sample size. The following theorem characterizes the convergence guarantee usingthe sample size selection rule of step 2 in Algorithm 1 and the estimator of ˆ ρ t in Section 4.4. Theorem 2. If K t ≥ K ∗ t (cid:44) min (cid:110) K ≥ (cid:12)(cid:12)(cid:12) b (cid:16) d/ , (cid:114) εm + ˆ ρ t − , K (cid:17) ≤ ε (cid:111) , (14) then for all t large enough we have lim sup t →∞ (cid:0) E [ L U t (ˆ θ t )] − L U t ( θ ∗ t ) (cid:1) ≤ ε almost surely. In this subsection, we construct an estimate ˆ ρ t of the change in the minimizers ρ using the activelearning samples for step 5 in Algorithm 1.We ﬁrst construct an estimate (cid:101) ρ t for the one-step changes (cid:107) θ ∗ t − − θ ∗ t (cid:107) . As a consequence of strongconvexity, the following lemma holds. Lemma 2.

Suppose Assumption 2 holds, then (cid:107) θ ∗ t − − θ ∗ t (cid:107) ≤ m (cid:2) L U t ( θ ∗ t − ) − L U t ( θ ∗ t ) + L U t − ( θ ∗ t ) − L U t − ( θ ∗ t − ) (cid:3) . (15)6otivated by Lemma 2, we can construct the following one-step estimation of ρ (cid:101) ρ t = 1 m (cid:2) ˆ L U t (ˆ θ t − ) − ˆ L U t (ˆ θ t ) + ˆ L U t − (ˆ θ t ) − ˆ L U t − (ˆ θ t − ) (cid:3) , (16)where we use ˆ L U t (ˆ θ t − ) (cid:44) K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , ˆ θ t − ) N t ¯Γ t ( X k,t ) (17)as the empirical estimation of L U t ( θ ∗ t − ) . Note that we are using the samples generated from theactive learning distribution, i.e., X k,t ∼ ¯Γ t and Y k,t ∼ p ( Y | X k,t , θ ∗ t ) . Thus, based on the idea ofimportance sampling [5], we need to normalize the estimate with the sampling distribution ¯Γ t .Then, we combine the one-step estimates to construct an overall estimate. The simplest way tocombine the one-step estimates would be to set ´ ρ t = max { (cid:101) ρ , · · · , (cid:101) ρ t } . However, if we suppose thateach estimate (cid:101) ρ is an independent Gaussian random variable, then this estimate goes to inﬁnity as t → ∞ . To avoid this issue, we use a class of functions h W : R W → R that are non-decreasing intheir arguments and satisfy E [ h W ( ρ j , · · · , ρ j − W +1 )] ≥ ρ . For example, h W ( ρ j , · · · , ρ j − W +1 ) = W +1 W max { ρ j , · · · , ρ j − W +1 } satisﬁes the requirements. The combined estimate of ´ ρ t is computedby applying the function h W to a sliding window of one-step estimates of (cid:101) ρ , i.e., ´ ρ t = 1 t − t (cid:88) j =2 h { min[ W,j − } ( (cid:101) ρ j , (cid:101) ρ j − , · · · , (cid:101) ρ j − W +1 , ) . (18)The following theorem characterizes the performance of proposed estimator in (18). Theorem 3.

Suppose Assumptions 1 and 2 hold, and there exists a sequence { r t } satisfying ∞ (cid:88) t =1 exp (cid:110) − m ( t − r t L b Diameter (Θ) (cid:111) < ∞ for all t large enough, then ˆ ρ t (cid:44) ´ ρ t + D t + r t ≥ ρ almost surely with a constant D t . In this section, we present two experiments to validate our algorithm and the related theoreticalresults: one is to track a synthetic regression model and the other is to track the time-varying userpreferences in a recommendation system. More experiments on binary classiﬁcation are presentedin the Appendices. We use three baseline algorithms for comparison: passive adaptive algorithm,active random algorithm and passive random algorithm. Compared with Algorithm 1,

Passive meansdrawing new samples using a uniform distribution U t in Step 3 and Random means replacing theestimate of ˆ θ t − with a random point from Θ in Step 1 and 4. All reported results are averaged over1000 runs of Monte Carlo trials. The sizes of the sample pools for all the test algorithms are the samewith N t = 500 , and the number of considered time steps is 25. We construct the active samplingdistribution with the exact solution of the SDP problem in Step 1. Note that approximation algorithmsfor SDP introduced in [16] can be applied to accelerate this process. We set K t = K ∗ t for all the testalgorithms and use the estimator deﬁned in Section 4.4 with window size W = 3 to estimate ρ . The model of the synthetic regression problem is y t = θ Tt x t + w t , where the input variable x t ∼ N (0 , . I ) is a 5-dimensional Gaussian vector and the noise w t ∼ N (0 , . . We consider learningthe parameter θ t by minimizing the following negative log-likelihood function (cid:96) ( y k,t | x k,t , θ t ) =( y k,t − θ Tt x k,t ) . In the simulations, the change of the true minimizers is ρ = 10 , and the target excessrisk is ε = 1 . To highlight the time-varying nature of the problem, we implement the “all samples upfront” method by using (cid:80) t =1 K ∗ t samples at the ﬁrst time step and keep this time-invariant regressionmodel for the rest of considered time steps. Note that a choice of r t that is greater than / √ t − in the order sense works here. K ∗ t new samples, the passive adaptive algorithm meets the mean trackingcriterion and our proposed active and adaptive learning algorithm outperforms all the other algorithms.The “all samples up front” algorithm outperforms the other algorithms initially, but it fails to trackthe time-varying underlying model after only a few time steps. Moreover, the excess risk of activerandom algorithm is almost the same as that of active adaptive algorithm, since the Hessian matricesin the regression task are independent of θ t . In this case, no approximation is needed and the changerate ρ in the regression task can be arbitrarily large, as we mentioned in Remark 1. Fig 1(b) shows that ˆ ρ t converges to a conservative estimate of ρ , which veriﬁes Theorem 3. Moreover, the correspondingnumber of samples determined by Theorem 2 is depicted in Fig. 1(c), which shrinks adaptively as ˆ ρ t converges. (a) (b) (c)(d) (e) (f) Figure 1: Experiments on synthetic regression: (a) Excess risk. (b) Estimated rate of change ofminimizers. (c) Number of samples. Experiments on user preference tracking performance usingYelp data: (d) Excess risk. (e) Estimated rate of change of minimizers. (f) Classiﬁcation error.

We utilize a subset of Yelp 2017 dataset to perform our experiments. We censor the original datasetsuch that each user has at least 10 ratings. After censoring procedure, our dataset contains ratings of M = 473 users for N = 858 businesses. By converting the original 5-scale ratings to a binary labelfor all businesses with high ratings (4 and 5) as positive ( ) and low ratings (3 and below) as negative( − ), we form the N × M binary rating matrix R , which is very sparse and only . are observed.We complete the sparse matrix R to make recommendations by using the matrix factorization method[12]. The rating matrix R can be modeled by the following logistic regression model p ( R u,b | φ b , φ u ) = 11 + exp − R u,b φ (cid:62) u φ b , (19)where φ u and φ b are the d -dimensional latent vectors representing the preferences of user u andproperties of business b , respectively. Then, we train φ u and φ b with dimension d = 5 for each userand business in the dataset using maximum likelihood estimation by SGD. With the learned latentvectors, we can complete the matrix R and make recommendations to customers in a collaborativeﬁltering fashion [9, 15].In practice, the preferences of users φ u,t may vary with time t , and hence user features need to beretrained. Considering the fact that acquiring new ratings of users can be expensive, we apply our { φ b } with size N t as our unlabeled data pool,while the remaining serve as a test set to evaluate the algorithms. To model the bounded time-varyingchanges of user preferences φ u,t , we start from a randomly chosen user feature and update it byadding a random Normal drift with norm bounded by 0.1 at each time step. Since we are unable toretrieve the actual answer from a real user, we generate the labels with the probabilistic model givenby (19) with true parameter φ u,t instead. Note that one cannot ask a user the same question twice in areal recommendation system, and therefore we implement without replacement sampling by queryingthe labels of the samples having the largest K ∗ t values in the active sampling distribution ¯Γ t .Fig. 1(e) shows that ˆ ρ t converges to a conservative estimate of ρ , and the corresponding samplesize converges to K ∗ t = 13 after two time steps. Fig. 1(d) and Fig. 1(f) show that our algorithmachieves a error rate of 6% with these samples and signiﬁcantly outperforms the other algorithms.This is because the Hessian matrices of logistic regression are functions of θ t , and hence the samplingdistribution generated by the active and adaptive algorithm selects more informative samples. In this paper, we propose an active and adaptive learning framework to solve a sequence of learningproblems, which ensures a bounded excess risk for each individual learning task when the number oftime steps is sufﬁciently large. We construct an estimator of the change in the minimizers ˆ ρ t usingactive learning samples and show that this estimate upper bounds the true parameter ρ almost surely.We test our algorithm on a synthetic regression problem, and further apply it to a recommendationsystem that tracks changes in preferences of customers. Our experiments demonstrate that ouralgorithm achieves better performance compared to the other baseline algorithms.9 eferences [1] A. Agarwal, A. Rakhlin, and P. Bartlett. Matrix regularization techniques for online multitask learning. Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley , 2008.[2] A. Agarwal, S. Gerber, and H. Daume. Learning multiple tasks using manifold regularization. In

Advancesin Neural Information Processing Systems 23 , pages 46–54, 2010.[3] R G Antonini and Yu V Kozachenko. A note on the asymptotic behavior of sequences of generalizedsubgaussian random vectors.

Random Operators and Stochastic Equations , 13(1):39–52, 2005.[4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In

Proceedings of the 23rdinternational conference on Machine learning , pages 65–72. ACM, 2006.[5] O. Cappé, R. Douc, A. Guillin, J. Marin, and C. P Robert. Adaptive importance sampling in generalmixture classes.

Statistics and Computing , 18(4):447–459, 2008.[6] K. Chaudhuri, S. M Kakade, P. Netrapalli, and S. Sanghavi. Convergence rates of active learning formaximum likelihood estimation. In

Advances in Neural Information Processing Systems 28 , pages1090–1098, 2015.[7] J. A Cornell.

Experiments with mixtures: designs, models, and the analysis of mixture data . John Wiley &Sons, 2011.[8] S. Dasgupta. Coarse sample complexity bounds for active learning. In

Advances in Neural InformationProcessing Systems 18 , pages 235–242, 2006.[9] M. Elahi, F. Ricci, and N. Rubens. A survey of active learning in collaborative ﬁltering recommendersystems.

Computer Science Review , 20:29–50, 2016.[10] T. Evgeniou and M. Pontil. Regularized multi–task learning. In

Proceedings of the 10th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 109–117, 2004.[11] R. Frostig, R. Ge, S. M Kakade, and A. Sidford. Competing with the empirical risk minimizer in a singlepass. In

Conference on Learning Theory , pages 728–763, 2015.[12] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.

Computer ,42(8), 2009.[13] M. Mohri, A. Rostamizadeh, and A. Talwalkar.

Foundations of machine learning . MIT press, 2012.[14] S. J Pan and Q. Yang. A survey on transfer learning.

IEEE Transactions on Knowledge and DataEngineering , 22(10):1345–1359, 2010.[15] N. Rubens, M. Elahi, M. Sugiyama, and D. Kaplan. Active learning in recommender systems. In

Recommender Systems Handbook , pages 809–846. Springer, 2015.[16] J. Sourati, M. Akcakaya, T. K Leen, D. Erdogmus, and J. G Dy. Asymptotic analysis of objectives basedon ﬁsher information in active learning.

Journal of Machine Learning Research , 18(34):1–41, 2017.[17] Zaid J Towﬁc, Jianshu Chen, and Ali H Sayed. On distributed online classiﬁcation in the midst of conceptdrifts.

Neurocomputing , 112:138–152, 2013.[18] A. W Van der Vaart.

Asymptotic statistics . Cambridge series in statistical and probabilistic mathematics.Cambridge University Press, 2000.[19] C. Wilson and V. V Veeravalli. Adaptive sequential optimization with applications to machine learning.In

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing , pages2642–2646, 2016.[20] C. Wilson, V. V Veeravalli, and A. Nedich. Adaptive sequential stochastic optimization.

IEEE Transactionson Automatic Control , 2018.[21] Y. Zhang and D. Yeung. A convex formulation for learning task relationships in multi-task learning. arXivpreprint arXiv:1203.3536 , 2012. Proof of Lemma 1

To prove Lemma 1, we use the following result from [11]. In particular, the following lemma is ageneralization of Theorem 5.1 in [11], and its proof follows from generalizing the derivation of thattheorem and is omitted here.

Lemma 3.

Suppose ψ ( θ ) , · · · , ψ K ( θ ) : R d → R are random functions drawn i.i.d. from adistribution, where θ ∈ Θ ⊆ R d . Denote P ( θ ) = E [ ψ ( θ )] and let Q ( θ ) : R d → R be anotherfunction. Let ˆ θ = argmin θ ∈ Θ K (cid:88) k =1 ψ k ( θ ) , and θ ∗ = argmin θ ∈ Θ P ( θ ) . Assume:1.

Regularity conditions :(a) Compactness: Θ is compact, and θ ∗ is an interior point of Θ .(b) Smoothness: ψ ( θ ) is smooth in the following sense: the ﬁrst, second and third deriva-tives exist at all interior points of Θ with probability one.(c) Convexity: ψ ( θ ) is convex with probability one, and ∇ P ( θ ∗ ) is positive deﬁnite.(d) ∇ P ( θ ∗ ) = 0 and ∇ Q ( θ ∗ ) = 0 .2. Concentration at θ ∗ : Suppose (cid:13)(cid:13)(cid:13) ∇ ψ ( θ ∗ ) (cid:13)(cid:13)(cid:13) ∇ P ( θ ∗ ) − ≤ L (cid:48) and (cid:13)(cid:13)(cid:13)(cid:0) ∇ P ( θ ∗ ) (cid:1) − / ∇ ψ ( θ ∗ ) (cid:0) ∇ P ( θ ∗ ) (cid:1) − / (cid:13)(cid:13)(cid:13) ≤ L (cid:48) hold with probability one.3. Lipschitz continuity : There exists a neighborhood B of θ ∗ and a constant L (cid:48) , such that ∇ ψ ( θ ) and ∇ Q ( θ ) are L (cid:48) -Lipschitz in this neighborhood, namely, (cid:13)(cid:13)(cid:13)(cid:0) ∇ P ( θ ∗ ) (cid:1) − / (cid:0) ∇ ψ ( θ ) − ∇ ψ ( θ (cid:48) ) (cid:1)(cid:0) ∇ P ( θ ∗ ) (cid:1) − / (cid:13)(cid:13)(cid:13) ≤ L (cid:48) (cid:107) θ − θ (cid:48) (cid:107) ∇ P ( θ ∗ ) , (cid:13)(cid:13)(cid:13)(cid:0) ∇ Q ( θ ∗ ) (cid:1) − / (cid:0) ∇ Q ( θ ) − ∇ Q ( θ (cid:48) ) (cid:1)(cid:0) ∇ Q ( θ ∗ ) (cid:1) − / (cid:13)(cid:13)(cid:13) ≤ L (cid:48) (cid:107) θ − θ (cid:48) (cid:107) ∇ P ( θ ∗ ) , holds with probability one, for θ, θ (cid:48) ∈ B ,Choose p ≥ and deﬁne γ (cid:44) c ( L (cid:48) L (cid:48) + (cid:112) L (cid:48) ) (cid:114) p log dKK , where c is an appropriately chosen constant. Let c (cid:48) be another appropriately chosen constant. If K islarge enough so that (cid:113) p log dKK ≤ c (cid:48) min (cid:26) √ L (cid:48) , L (cid:48) L (cid:48) , diameter( B ) L (cid:48) (cid:27) , then: (1 − γ ) τ K − L (cid:48) K p/ ≤ E (cid:2) Q (ˆ θ ) − Q ( θ ∗ ) (cid:3) ≤ (1 + γ ) τ K + max θ ∈ Θ [ Q ( θ ) − Q ( θ ∗ )] K p , where τ (cid:44) K Tr (cid:18) (cid:88) i,j E (cid:2) ∇ ψ i ( θ ∗ ) ∇ ψ j ( θ ∗ ) (cid:62) (cid:3)(cid:0) ∇ P ( θ ∗ ) (cid:1) − ∇ Q ( θ ∗ ) (cid:0) ∇ P ( θ ∗ ) (cid:1) − (cid:19) . Then, we proceed to prove Lemma 1.

Proof of Lemma 1.

We ﬁrst use Lemma 3 to bound the excess risk, which is similar to the idea ofLemma 1 in [6]. We ﬁrst deﬁne ψ k ( θ t ) = (cid:96) ( Y k,t | X k,t , θ t ) , (20)where X k,t ∼ Γ t and Y k,t ∼ p ( Y k,t | X k,t , θ ∗ t ) for ≤ k ≤ K t . Then, P ( θ t ) = E ( ψ k ( θ t )) = L Γ t ( θ t ) , and ∇ P ( θ ∗ t ) = I Γ t ( θ ∗ t ) . (21)11urther, we choose Q ( θ t ) = L U t ( θ t ) , and ∇ Q ( θ ∗ t ) = I U t ( θ ∗ t ) . (22)As shown in Assumption 2, the assumptions of Lemma 3 are satisﬁed. Moreover, according to thecondition that I Γ t ( θ ∗ ) (cid:23) CI U t ( θ ∗ ) holds for some constant C < in Lemma 1, we have (cid:13)(cid:13)(cid:13) I Γ t ( θ ∗ t ) − / (cid:0) H ( x, θ t ) − H ( x, θ (cid:48) t ) (cid:1) I Γ t ( θ ∗ t ) − / (cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13) I U t ( θ ∗ t ) − / (cid:0) H ( x, θ t ) − H ( x, θ (cid:48) t ) (cid:1) I U t ( θ ∗ t ) − / (cid:13)(cid:13)(cid:13) ≤ L C (cid:107) θ − θ (cid:48) (cid:107) I Ut ( θ ∗ t ) ≤ L C / (cid:107) θ − θ (cid:48) (cid:107) I Γ t ( θ ∗ t ) (23)and (cid:13)(cid:13)(cid:13) I U t ( θ ∗ t ) − / (cid:0) H ( x, θ t ) − H ( x, θ (cid:48) t ) (cid:1) I U t ( θ ∗ t ) − / (cid:13)(cid:13)(cid:13) ≤ L (cid:107) θ − θ (cid:48) (cid:107) I Ut ( θ ∗ t ) ≤ L √ C (cid:107) θ − θ (cid:48) (cid:107) I Γ t ( θ ∗ t ) . (24)Hence, L (cid:48) = max { L /C / , L / √ C } = L /C / . Similarly, we have L (cid:48) = L / √ C and L (cid:48) = L /C . In summary, the Assumptions 2 and 3 in Lemma 3 are satisﬁed with constants ( L (cid:48) , L (cid:48) , L (cid:48) ) = ( L / √ C, L /C, L /C / ) . (25)Applying Lemma 3 with p = 2 and considering the fact that E x ∼ Γ t (cid:2) ∇ (cid:96) ( Y i,t | X i,t , θ ∗ t ) ∇ (cid:96) ( Y i,t | X i,t , θ ∗ t ) (cid:62) (cid:3) = I Γ t ( θ ∗ t ) , (1 − γ t ) τ t K t − L CK t ≤ E (cid:2) L U t (ˆ θ Γ t ) − L U t ( θ ∗ t ) (cid:3) ≤ (1+ γ t ) τ t K t + max θ ∈ Θ t [ L U t ( θ ) − L U t ( θ ∗ t )] K t (26)holds, where γ t = O (cid:18) ( L (cid:48) L (cid:48) + (cid:112) L (cid:48) ) (cid:114) log dK t K t (cid:19) = O (cid:18) C ( L L + (cid:112) L ) (cid:114) log dK t K t (cid:19) , (27)and τ t = Tr (cid:16)(cid:0) I Γ t ( θ ∗ t ) (cid:1) − I U t ( θ ∗ t ) (cid:17) .Note that if we assume the parameter set Θ t (cid:44) { θ t |(cid:107) θ t − θ ∗ t − (cid:107) ≤ ρ } is known, then the second termin the right hand side of (26) can be further bounded as max θ ∈ Θ t [ L U t ( θ ) − L U t ( θ ∗ t )] K t ≤ max θ ∈ Θ t (cid:2) L b (cid:107) θ − θ ∗ t (cid:107) (cid:3) K t ≤ L b Diameter(Θ t ) K t ≤ L b ρ K t , (28)where the inequalities follow from the boundedness condition in Assumption 2. Combining thisresult with the inequality in (26) completes the proof of Lemma 1. B Proof of Theorem 1

Proof of Theorem 1.

The proof starts from the bound b ( τ , ∆ t , K t ) of the SGD algorithm in As-sumption 4. To compute the convergence rate τ , we need to ﬁrst study the approximation of θ ∗ t using ˆ θ t − . The difference between ˆ θ t − and θ ∗ t can be bounded as (cid:13)(cid:13) ˆ θ t − − θ ∗ t (cid:13)(cid:13) ≤ (cid:13)(cid:13) θ ∗ t − − θ ∗ t (cid:13)(cid:13) + (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) ≤ ρ + (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) . (29)To bound the second term, we use the strongly convexity assumption in Assumption 2, (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) ≤ m ( L U t − (cid:0) ˆ θ t − ) − L U t − ( θ ∗ t − ) (cid:1) . (30)Suppose the excess risk bound E [ L U t − (ˆ θ t − ) − L U t − ( θ ∗ t − )] ≤ ε holds for t − . Then, we have E ( (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) ) ≤ (cid:113) E ( (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) ) ≤ (cid:112) ε/m. (31)12hen, (cid:13)(cid:13) ˆ θ t − − θ ∗ t − (cid:13)(cid:13) ≤ δ (cid:113) εm holds with probability − δ by Markov’s inequality, for ∀ δ ∈ (0 , .Thus, (cid:13)(cid:13) ˆ θ t − − θ ∗ t (cid:13)(cid:13) ≤ ρ + 1 δ (cid:114) εm (32)holds with probability − δ . By the self-concordance condition in Assumption 3, we have that (1 − β t ) H ( x t , θ ∗ t ) (cid:22) H ( x t , ˆ θ t − ) (cid:22) (1 + β t ) H ( x t , θ ∗ t ) , x t ∈ S t , (33)holds with probability − δ , where β t = L ( ρ + δ (cid:113) εm ) . Then, for distribution Γ ∗ t , ˆΓ ∗ t and U t , wehave (1 − β t ) I Γ ∗ t ( θ ∗ t ) (cid:22) I Γ ∗ t (ˆ θ t − ) (cid:22) (1 + β t ) I Γ ∗ t ( θ ∗ t ) , (34) (1 − β t ) I ˆΓ ∗ t ( θ ∗ t ) (cid:22) I ˆΓ ∗ t (ˆ θ t − ) (cid:22) (1 + β t ) I ˆΓ ∗ t ( θ ∗ t ) , (35) (1 − β t ) I U t ( θ ∗ t ) (cid:22) I U t (ˆ θ t − ) (cid:22) (1 + β t ) I U t ( θ ∗ t ) . (36)Recall that ¯Γ t = α t ˆΓ ∗ t + (1 − α t ) U t . Hence, I ¯Γ t ( θ ∗ t ) (cid:23) α t I ˆΓ ∗ t ( θ ∗ t ) which implies that I ¯Γ t ( θ ∗ t ) − (cid:22) α t I ˆΓ t ( θ ∗ t ) − . Thus, τ t = 12 Tr (cid:0) I − t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) ≤ α t Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) . (37)From (35) and (36), (37) can be further upper bounded by Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) ≤ β t − β t Tr (cid:0) I − ∗ t (ˆ θ t − ) I U t (ˆ θ t − ) (cid:1) ( a ) ≤ β t − β t Tr (cid:0) I − ∗ t (ˆ θ t − ) I U t (ˆ θ t − ) (cid:1) ( b ) ≤ (cid:18) β t − β t (cid:19) Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) , (38)where (a) is because that ˆΓ ∗ t is the minimizer of Tr (cid:0) I − t (ˆ θ t − ) I U t (ˆ θ t − ) (cid:1) and (b) follows from theresults in (34) and (36).To bound the difference between the initialization and the true minimizer, we use triangle inequalityand Jensen’s inequality to get (cid:113) E (cid:107) ˆ θ t − − θ ∗ t (cid:107) ≤ (cid:113) E (cid:107) ˆ θ t − − θ ∗ t − (cid:107) + (cid:107) θ ∗ t − θ ∗ t − (cid:107) ≤ (cid:113) E (cid:107) ˆ θ t − − θ ∗ t − (cid:107) + ρ. (39)From (31), we have E (cid:107) ˆ θ t − − θ ∗ t − (cid:107) ≤ εm , (40)which yields E (cid:107) ˆ θ t − − θ ∗ t (cid:107) ≤ (cid:16)(cid:114) εm + ρ (cid:17) = ∆ t . (41)Thus, combining the above result with the bound in (38), we can conclude that the following upperbound E [ L U t (ˆ θ t ) − L U t ( θ ∗ t )] ≤ b (´ τ t , ∆ t , K t ) , (42)holds with probability 1- δ , where ´ τ t = (cid:16) β t − β t (cid:17) Tr (cid:0) I − ∗ t ( θ ∗ t ) I U t ( θ ∗ t ) (cid:1) α t . (43)This completes the proof of Theorem 1. 13 Proof of Lemma 2

Proof of Lemma 2.

The following inequalities hold from the strong convexity assumption and thefact that ∇ L U t ( θ ∗ t ) = ∇ L U t − ( θ ∗ t − ) = 0 : L U t ( θ ∗ t − ) ≥ L U t ( θ ∗ t ) + 12 m (cid:107) θ ∗ t − θ ∗ t − (cid:107) (44) L U t − ( θ ∗ t ) ≥ L U t − ( θ ∗ t − ) + 12 m (cid:107) θ ∗ t − θ ∗ t − (cid:107) . (45)Then, adding and rearranging these inequalities yields m (cid:104) L U t ( θ ∗ t − ) − L U t ( θ ∗ t ) + L U t − ( θ ∗ t ) − L U t − ( θ ∗ t − ) (cid:105) ≥ (cid:107) θ ∗ t − θ ∗ t − (cid:107) . (46)Moreover, we have the following relation (cid:107) θ ∗ t − θ ∗ t − (cid:107) ≤ m (cid:104) L U t ( θ ∗ t − ) − L U t ( θ ∗ t ) + L U t − ( θ ∗ t ) − L U t − ( θ ∗ t − ) (cid:105) = 1 m (cid:104) E X ∼ U t (cid:2) D (cid:0) p ( Y | X, θ ∗ t ) (cid:107) p ( Y | X, θ ∗ t − ) (cid:1)(cid:3) + E X ∼ U t − (cid:2) D (cid:0) p ( Y | X, θ ∗ t − ) (cid:107) p ( Y | X, θ ∗ t ) (cid:1)(cid:3)(cid:105) , (47)where D ( p (cid:107) q ) (cid:44) (cid:90) y ∈Y p ( y ) log p ( y ) q ( y ) dy (48)is the KL divergence between distribution p and q .Thus, an upper bound of ρ can be constructed by estimating the symmetric KL divergence between p ( y | x, θ ∗ t ) and p ( y | x, θ ∗ t − ) using the data pool U t and U t − , respectively. D Proof of Theorem 3

To analyze the performance of the estimator of ρ , we need to introduce a few results for sub-Gaussianrandom variables including the following key technical lemma from [3]. This lemma controls theconcentration of sums of random variables that are sub-Gaussian conditioned on a particular ﬁltration. Lemma 4.

Suppose we have a collection of random variables { V i } ni =1 and a ﬁltration { F i } ni =0 suchthat for each random variable V i it holds that1. E [exp { s ( V i − E [ V i | F i − ]) }| F i − ] ≤ e σ i s with σ i a constant.2. V i is F i -measurable. Then for every a ∈ R n it holds that P (cid:40) n (cid:88) i =1 a i V i > n (cid:88) i =1 a i E [ V i | F i − ] + t (cid:41) ≤ exp (cid:110) t ν (cid:111) with ν = (cid:80) ni =1 σ i a i . The other tail is similarly bounded.If we can upper bound the conditional expectations E [ V i | F i − ] ≤ ξ i by some constants ξ i , then wehave P (cid:40) n (cid:88) i =1 a i V i > n (cid:88) i =1 a i ξ i + t (cid:41) ≤ exp (cid:110) t ν (cid:111) . (49)For our analysis, we generally cannot compute E [ V i | F i − ] directly, but we can ﬁnd the upper bound ξ i . To compute σ i for use in Lemma 4, we employ the following conditional version of Hoeffding’sLemma. 14 emma 5. (Conditional Hoeffding’s Lemma): If a random variable V and a sigma algebra F satisfy a ≤ V ≤ b and E [ V | F ] = 0 , then E [ e sV | F ] ≤ exp (cid:110)

18 ( b − a ) s (cid:111) . Proof of Lemma Theorem 3.

To simplify our proof, we look at a special case where (cid:107) θ ∗ t − θ ∗ t − (cid:107) = ρ holds. The proof for the case (cid:107) θ ∗ t − θ ∗ t − (cid:107) ≤ ρ is similar, and more details about the window function h W can be found in [20].For the case (cid:107) θ ∗ t − θ ∗ t − (cid:107) = ρ , we use the following estimator to combine the one-step estimator (cid:101) ρ t ´ ρ t = 1 t − t (cid:88) i =2 (cid:101) ρ i = 1 m ( t − t (cid:88) i =2 (cid:0) ˆ L U i (ˆ θ i − ) − ˆ L U i (ˆ θ i ) + ˆ L U i − (ˆ θ i ) − ˆ L U i − (ˆ θ i − ) (cid:1) . (50)We denote ρ t (cid:44) m ( t − t (cid:88) i =2 (cid:0) L U i ( θ ∗ i − ) − L U i ( θ ∗ i ) + L U i − ( θ ∗ i ) − L U i − ( θ ∗ i − ) (cid:1) ≥ ρ . (51)where the inequality follows from Lemma 2. We want to construct ˆ ρ t , such that ˆ ρ t ≥ ρ t ≥ ρ almostsurely. Then, we have ρ t − ´ ρ t = 1 m ( t − (cid:16) t (cid:88) i =2 L U i ( θ ∗ i − ) − ˆ L U i (ˆ θ i − ) + t (cid:88) i =2 L U i − ( θ ∗ i ) − ˆ L U i − (ˆ θ i ) (52) + ˆ L U (ˆ θ ) − L U ( θ ∗ ) + 2 t − (cid:88) i =2 ˆ L U i (ˆ θ i ) − L U i ( θ ∗ i ) + ˆ L U t (ˆ θ t ) − L U t ( θ ∗ t ) (cid:17) . (53)Deﬁne U t (cid:44) t − t (cid:88) i =2 m (cid:0) L U i ( θ ∗ i − ) − ˆ L U i (ˆ θ i − ) (cid:1) , (54) V t (cid:44) t − t (cid:88) i =2 m (cid:0) L U i − ( θ ∗ i ) − ˆ L U i − (ˆ θ i ) (cid:1) , (55) W t (cid:44) m ( t − (cid:16) ˆ L U (ˆ θ ) − L U ( θ ∗ ) + 2 t − (cid:88) i =2 (cid:0) ˆ L U i (ˆ θ i ) − L U i ( θ ∗ i ) (cid:1) + ˆ L U t (ˆ θ t ) − L U t ( θ ∗ t ) (cid:17) . (56)Then it holds that ρ t − ´ ρ t = U t + V t + W t . (57)Now, we look at bounding E (cid:2) L U i ( θ ∗ i − ) − ˆ L U i (ˆ θ i − ) (cid:3) , E (cid:2) L U i − ( θ ∗ i ) − ˆ L U i − (ˆ θ i ) (cid:3) and E (cid:2) ˆ L U i (ˆ θ i ) − L U i ( θ ∗ i ) (cid:3) in U t , V t and W t , respectively.Note that, the samples at time step i − are independent with samples at time i , hence, E (cid:2) ˆ L U i (ˆ θ i − ) (cid:3) = E (cid:104) E X k,i ∼ ¯Γ i ,Y k,i ∼ p ( Y | X k,i ,θ ∗ i ) (cid:2) K i K i (cid:88) i =1 (cid:96) ( Y k,i | X k,i , ˆ θ i − ) N i ¯Γ i ( X k,i ) (cid:3)(cid:105) = E (cid:104) E X i ∼ U i ,Y i ∼ p ( Y | X i ,θ ∗ i ) (cid:2) (cid:96) ( Y t | X t , ˆ θ i − ) (cid:3)(cid:105) = E (cid:2) L U i (ˆ θ i − )] . (58)Thus, E (cid:2) L U i ( θ ∗ i − ) − ˆ L U i (ˆ θ i − ) (cid:3) = E (cid:2) L U i ( θ ∗ i − ) − L U i (ˆ θ i − ) (cid:3) , (59) E (cid:2) L U i − ( θ ∗ i ) − ˆ L U i − (ˆ θ i ) (cid:3) = E (cid:2) L U i − ( θ ∗ i ) − L U i − (ˆ θ i ) (cid:3) . (60)15e use Lemma 3 to construct bounds for these two terms. Let Q ( θ ) = (cid:0) L U i ( θ ∗ i − ) − L U i ( θ ) (cid:1) , and ψ k ( θ ) = (cid:96) ( Y k | X k , θ ) , ≤ k ≤ K i − , (61)where X k ∼ ¯Γ i − and Y k ∼ p ( Y | X k , θ ∗ i − ) . It can be veriﬁed that ˆ θ i − = argmin θ ∈ Θ K i − (cid:88) k ψ k ( θ ) , θ ∗ = argmin θ ∈ Θ P ( θ ) = argmin θ ∈ Θ E [ ψ ( θ )] = θ ∗ i − , (62)and ∇ Q ( θ ∗ i − ) = 0 . All the conditions in Lemma 3 are satisﬁed. We have ∇ P ( θ ∗ ) = I ¯Γ i − ( θ ∗ i − ) , ∇ Q ( θ ∗ ) = 2 I U i ( θ ∗ i − ) . (63)Thus, (cid:0) E (cid:2) L U i ( θ ∗ i − ) − L U i (ˆ θ i − ) (cid:3)(cid:1) ≤ E (cid:2)(cid:0) L U i ( θ ∗ i − ) − L U i (ˆ θ i − ) (cid:1) (cid:3) ≤ (1 + γ i − ) Tr (cid:0) I ¯Γ i − ( θ ∗ i − ) − I U i ( θ ∗ i − ) (cid:1) K i − + max θ ∈ Θ (cid:2) L U i ( θ ) − L U i ( θ ∗ i − ) (cid:3) K i − (cid:44) A i . (64)Similarly, we have (cid:0) E [ L U i − ( θ ∗ i ) − L U i − (ˆ θ i )] (cid:1) ≤ (1 + γ i ) Tr (cid:0) I ¯Γ i ( θ ∗ i ) − I U i − ( θ ∗ i ) (cid:1) K i + max θ ∈ Θ (cid:2) L U i − ( θ ) − L U i − ( θ ∗ i ) (cid:3) K i (cid:44) B i . (65)For the term E (cid:2) L U i ( θ ∗ i ) − ˆ L U i (ˆ θ i ) (cid:3) in W t , suppose that the samples used to estimate ˆ θ i and thesamples used to compute ˆ L U i are independent. This can be done by splitting the samples at each timestep i . Note that this assumption is just required to proceed with the theoretical analysis; we will useall the samples to estimate ˆ θ i in practice.Then, similar argument holds as in (58), and we have E (cid:2) ˆ L U i (ˆ θ i ) − L U i ( θ ∗ i ) (cid:3) = E (cid:2) L U i (ˆ θ i ) − L U i ( θ ∗ i ) (cid:3) ≥ . (66)where the inequality follows from the fact that θ ∗ t is the minimizer of L U t ( θ ) . Applying the upperbound in Lemma 1, this term can be bounded as ≤ E (cid:2) L U i ( θ ∗ i ) − L U i (ˆ θ i ) (cid:3) ≤ (1 + γ i ) Tr (cid:0) I − i ( θ ∗ i ) I U i ( θ ∗ i )2 K i + max θ ∈ Θ [ L U i ( θ ) − L U i ( θ ∗ i )] K i (cid:44) C i . (67)The resulting bounds on the expectation of U t , V t , and W t denoted ¯ U t , ¯ V t , and ¯ W t are as follows: ¯ U t = 1 m ( t − t (cid:88) i =2 (cid:112) A i , (68) ¯ V t = 1 m ( t − t (cid:88) i =2 (cid:112) B i , (69) ¯ W t = 1 m ( t −

1) ( C + 2 t − (cid:88) i =2 C i + C t ) . (70)Now, we ﬁnd the upper bound ξ i to upper bound the expectation as we mentioned in (49). Then itholds that P (cid:110) ρ t − ´ ρ t > ¯ U t + ¯ V t + ¯ W t + r t (cid:111) = P (cid:110) U t + V t + W t > ¯ U t + ¯ V t + ¯ W t + r t (cid:111) ≤ P (cid:110) U t > ¯ U t + 13 r t (cid:111) + P (cid:110) V t > ¯ V t + 13 r t (cid:111) + P (cid:110) W t > ¯ W t + 13 r t (cid:111) . (71)16o bound these probabilities with (49), we ﬁrst bound the moment generating functions using Lemma5, m | ˆ L U i (ˆ θ i ) − L U i ( θ ∗ i ) | ≤ L b m max θ ∈ Θ (cid:107) θ − θ ∗ i (cid:107) ≤ L b m Diameter(Θ) , (72)and m | L U i ( θ ∗ i − ) − ˆ L U i (ˆ θ i − ) | ≤ m | L U i ( θ ∗ i − ) − L U i ( θ ∗ i ) | + 1 m | L U i ( θ ∗ i ) − ˆ L U i (ˆ θ i − ) |≤ L b m Diameter(Θ) . (73)Then, we apply Lemma 4 and Lemma 5 with σ i = L b m Diameter (Θ) for the terms in U t and V t ,and apply σ i = L b m Diameter (Θ) for the terms in W t , respectively. We have ν U = ν V = L b m Diameter(Θ) t (cid:88) i =2 t − = L b t − m Diameter(Θ) , (74) ν W ≤ L b m Diameter(Θ) t (cid:88) i =2 (cid:0) t − (cid:1) = L b t − m Diameter(Θ) . (75)Let D t (cid:44) ¯ U t + ¯ V t + ¯ W t . Then we obtain P (cid:110) ρ t > ´ ρ t + D t + r t (cid:111) ≤ (cid:110) − m ( t − r t L b Diameter (Θ) (cid:111) . (76)Then it follows the assumption in Theorem 3 that ∞ (cid:88) t =2 P (cid:110) ´ ρ t + D t + r t < ρ t (cid:111) ≤ ∞ (cid:88) t =2 (cid:110) − m ( t − r t L b Diameter (Θ) (cid:111) < ∞ . (77)Therefore, by the Borel-Cantelli Lemma, for all t large enough it holds that ˆ ρ t = ´ ρ t + D t + r t ≥ ρ t (78)almost surely. Finally, it holds that ρ t ≥ ρ from Lemma 2, which proves the result. E Proof of Theorem 2

To prove Theorem 2, we use the following result from Theorem 3 in [20].

Lemma 6. If ˆ ρ t ≥ ρ almost surely for t sufﬁciently large, then with K t ≥ K ∗ t (cid:44) min (cid:110) K ≥ (cid:12)(cid:12)(cid:12) b (cid:16) d/ , (cid:0)(cid:114) εm + ˆ ρ t − (cid:1) , K (cid:17) ≤ ε (cid:111) (79) samples, we have lim sup t →∞ ( E [ L U t (ˆ θ t )] − L U t ( θ ∗ t )) ≤ ε almost surely.Proof of Theorem 2. From Theorem 3, we know that the proposed estimate ˆ ρ t ≥ ρ almost surely,which implies ˆ ρ t ≥ ρ almost surely. Directly applying the above lemma completes the proof. F Estimation of m and L b We construct the estimator of m and L b with the samples drawn from distribution ¯Γ t . By theassumption of strongly convexity, we have L U t ( θ ) ≥ L U t ( θ (cid:48) ) + (cid:104)∇ L U t ( θ (cid:48) ) , θ − θ (cid:48) (cid:105) + m (cid:107) θ − θ (cid:48) (cid:107) , ∀ θ, θ (cid:48) ∈ Θ , (80)which implies that m ≤ L U t ( θ ) − L U t ( θ (cid:48) ) − (cid:104)∇ L U t ( θ (cid:48) ) , θ − θ (cid:48) (cid:105) (cid:107) θ − θ (cid:48) (cid:107) (81)17 a) (b) Figure 2: Estimated parameter on the regression task over synthetic data. (a) Estimated stronglyconvex parameter. (b) Estimated largest eigenvalue.holds for any θ, θ (cid:48) ∈ Θ .Since m is the smallest value satisfying (81) for any θ, θ (cid:48) ∈ Θ , we consider following estimator (cid:101) m t (cid:44) min θ,θ (cid:48) ∈ Θ t K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , θ ) − (cid:96) ( Y k,t | X k,t , θ (cid:48) ) − (cid:104)∇ (cid:96) ( Y k,t | X k,t , θ (cid:48) ) , θ − θ (cid:48) (cid:105) N t ¯Γ t ( X k,t ) (cid:107) θ − θ (cid:48) (cid:107) . (82)Following (82), we have E ( (cid:101) m t ) = E X k,t ∼ Γ t (cid:40) min θ,θ (cid:48) ∈ Θ t K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , θ ) − (cid:96) ( Y k,t | X k,t , θ (cid:48) ) − (cid:104)∇ (cid:96) ( Y k,t | X k,t , θ (cid:48) ) , θ − θ (cid:48) (cid:105) N t Γ t ( X k,t ) (cid:107) θ − θ (cid:48) (cid:107) (cid:41) ≤ min θ,θ (cid:48) ∈ Θ t E X k,t ∼ Γ t (cid:40) K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , θ ) − (cid:96) ( Y k,t | X k,t , θ (cid:48) ) − (cid:104)∇ (cid:96) ( Y k,t | X k,t , θ (cid:48) ) , θ − θ (cid:48) (cid:105) N t Γ t ( X k,t ) (cid:107) θ − θ (cid:48) (cid:107) (cid:41) = min θ,θ (cid:48) ∈ Θ t E X k,t ∼ U t (cid:40) K t K t (cid:88) k =1 (cid:96) ( Y k,t | X k,t , θ ) − (cid:96) ( Y k,t | X k,t , θ (cid:48) ) − (cid:104)∇ (cid:96) ( Y k,t | X k,t , θ (cid:48) ) , θ − θ (cid:48) (cid:105)(cid:107) θ − θ (cid:48) (cid:107) (cid:41) = min θ,θ (cid:48) ∈ Θ t L U t ( θ ) − L U t ( θ (cid:48) ) − (cid:104)∇ L U t ( θ (cid:48) ) , θ − θ (cid:48) (cid:105) (cid:107) θ − θ (cid:48) (cid:107) = m, (83)which implies that (cid:101) m t is a conservative estimate of m . In practice, the strongly convex parameter m may also vary with time t . Thus, we use the following estimator to combine the one-step estimator (cid:101) m t , ˆ m t = min { (cid:101) m t − , (cid:101) m t } , (84)for t ≥ .Moreover, following the boundedness assumption in Assumption 2, we have max θ ∈ Θ λ max [ I U t ( θ )] ≤ L b , (85)where λ max ( · ) denotes the maximal eigenvalue of a square matrix. In this case, we consider followingestimator ˆ L b,t (cid:44) max θ t ∈ Θ t λ max (cid:34) K t K t (cid:88) k =1 N t t ( X k,t ) H ( X k,t , θ t ) (cid:35) . (86)18imilarly, ˆ L b is also a conservative estimate of L b . That is, E ( ˆ L b,t ) = E X k,t ∼ Γ t (cid:40) max θ t ∈ Θ t λ max (cid:32) K t K t (cid:88) k =1 N t t ( X k,t ) H ( X k,t , θ t ) (cid:33)(cid:41) ≥ max θ t ∈ Θ t E X k,t ∼ Γ t (cid:40) λ max (cid:34) K t K t (cid:88) k =1 N t t ( X k,t ) H ( X k,t , θ t ) (cid:35)(cid:41) ≥ max θ t ∈ Θ t λ max [ I U t ( θ t )]= L b . (87)Fig. 2(a) and Fig. 2(b) demonstrate our estimation of ˆ m t and ˆ L b,t in the synthetic regression problemdescribed in Section 5, respectively. G Experiments on Synthetic Classiﬁcation

We consider solving a sequence of binary classiﬁcation problems by using logistic regression. Attime t , the features of two classes are drawn from Gaussian distribution with different means µ ,t and − µ ,t . More speciﬁcally, the features are 2-dimensional Gaussian vectors with (cid:107) µ ,t (cid:107) = 2 andvariance . I . The parameter θ t is learned by minimizing the following log-likelihood function (cid:96) ( y k,t | x k,t , θ t ) = log(1 + exp − y k,t θ (cid:62) t x k,t ) . (88)To ensure the change of minimizers is bounded, we set that µ ,t is drifting with a constant rate alonga -dimensional sphere. We further set ρ = 0 . and (cid:15) = 0 . . (a) (b) (c)(a) (b) (c)