OOutlier Robust Online Learning
Jiashi FengDepartment of Electrical and Computer EngineeringNational University of Singapore [email protected]
Huan XuSchool of Industrial and Systems EngineeringGeorgia Institute of Technology [email protected]
Shie MannorDepartment of Electrical EngineeringTechnion – Israel Institute of Technology [email protected]
Abstract
We consider the problem of learning from noisy data in practical settings wherethe size of data is too large to store on a single machine. More challenging, the datacoming from the wild may contain malicious outliers. To address the scalability androbustness issues, we present an online robust learning (ORL) approach. ORL is sim-ple to implement and has provable robustness guarantee—in stark contrast to existingonline learning approaches that are generally fragile to outliers. We specialize the ORLapproach for two concrete cases: online robust principal component analysis and onlinelinear regression. We demonstrate the efficiency and robustness advantages of ORLthrough comprehensive simulations and predicting image tags on a large-scale data set.We also discuss extension of the ORL to distributed learning and provide experimentalevaluations.
In the era of big data, traditional statistical learning methods are facing two significantchallenges: (1) how to scale current machine learning methods to the large-scale data? And(2) how to obtain accurate inference results when the data are noisy and may even containmalicious outliers? These two important challenges naturally lead to a need for developing scalable robust learning methods.Traditional robust learning methods generally rely on optimizing certain robust statis-tics [16, 21] or applying some sample trimming strategies [7], whose calculations require1 a r X i v : . [ c s . L G ] J a n oading all the samples into the memory or going through the data multiple times [9].Thus, the computational time of those robust learning methods is usually at least linearlydependent on the size of the sample set, N . For example, in RPCA [21], the computationaltime is O ( N p r ) where r is the intrinsic dimension of the subspace and p is the ambientdimension. In robust linear regression [3], the computational time is super-linear on thesample size: O ( pN log N ). This rapidly increasing computation time becomes a major ob-stacle for applying robust learning methods to big data in practice, where the sample sizeeasily reaches the terabyte or even petabyte scale.Online learning and distributed learning are natural solutions to the scalability issue.Most of existing online statistical learning methods propose to optimize a surrogate functionin an online fashion, such as employing stochastic gradient descent [10, 15, 8] to updatethe estimates, which however cannot handle the outlier samples in the streaming data[12]. Similarly, most of existing distributed learning approaches ( e.g. , MapReduce [6])are not robust to contamination from outliers, communication errors or computation nodebreakdown.In this work, we propose an online robust learning (ORL) framework to efficiently pro-cess big data with outliers while preserving robustness and statistical consistency of theestimates. The core technique is based on two-level online learning procedure, one of whichemploys a novel median filtering process. The robustness of median has been investigatedin statistical estimations for heavy-tailed distributions [17, 11]. However, to our best knowl-edge, this work is among the first to employ such estimator to deal with outlier samples inthe context of online learning.The implementation of ORL follows mini-batch based online optimization which is pop-ular in a wide range of machine learning problems ( e.g. , deep learning, large-scale SVM)from large-scale data. Within each mini-batch, ORL computes an independent estimate.However, outliers may be heterogeneously distributed on the mini-batches and some of themmay contain overwhelmingly many outliers. The corresponding estimate will be arbitrar-ily bad and break down the overall online learning. Therefore, on top of such streamingestimates ORL performs another level of robust estimation—median filtering—to obtainreliable estimate. The ORL approach is general and compatible with many popular learn-ing algorithms. Besides its obvious advantage of enhancing the computation efficiency forhandling big data, ORL incurs negligible robustness loss compared to centralized (andcomputationally unaffordable) robust learning methods. In fact, we provide analysis anddemonstrate that ORL is robust to a constant fraction of “bad” estimates generated in thestreaming mini-batches that are corrupted by outliers.We specify the ORL approach for two concrete problems—online robust principal com-ponent analysis (PCA) and linear regression. Comprehensive experiments on both syntheticand real large scale datasets demonstrate the efficiency and robustness advantages of theproposed ORL approach. In addition, ORL can be adapted straightforwardly to distributedlearning setting and offers additional robustness to corruption of several computation nodesor communication errors, as demonstrated in the experiments.In short, we make following contributions in this work. First, we develop an outlier ro-2ust online learning framework which is the first one with provable robustness to a constantfraction of outliers. Secondly, we introduce two concrete online robust learning approaches,one for unsupervised learning and the other for supervised learning. Other examples canbe developed in a similar way easily. Finally, we also present the application of the ORLapproach to distributed learning setting which is equally attractive for learning from largescale data. We consider a set of N = n + n observation samples X = X I ∪ X O = { x , . . . , x n } ∪{ x n +1 , . . . , x n + n } ⊂ R p , which contains a mixture of n authentic samples X I and n out-liers X O . The authentic samples are generated according to an underlying model ( i.e. , theground truth) parameterized by θ (cid:63) ∈ Θ. The target of a statistical learning procedure is toestimate the model parameter θ (cid:63) according to the provided observations X . Throughoutthe paper, we assume the authentic samples are sub-Gaussian random vectors in R p , whichthus satisfy that P ( |(cid:104) x , u (cid:105)| > t ) ≤ e − t /L for t > u ∈ S p − , (1)for some L . Here S p − denotes the unit sphere.In this work, we focus on the case where a constant fraction of the observations are out-liers, and we use λ (cid:44) n /N to denote this outlier fraction. In the context of online learning,samples are provided in a sequence of T mini batches, each of which contains b = (cid:98) ( n + n ) /T (cid:99) observations. Denote the sequence as {X , . . . , X T } = { x , . . . , x b , x b +1 , . . . , x T b } .The target of online statistical learning is to estimate the parameter θ (cid:63) , only based on theobservations revealed so far. We first introduce the geometric median here—a core concept underlying the median filter-ing procedure that is important for developing the proposed online robust learning approach.
Definition 1 (Geometric Median) . Given a finite collection of i.i.d. estimates θ , . . . , θ T ∈ Θ , their geometric median is the point which minimizes the total (cid:96) distance to all the givenestimates, i.e. , (cid:98) θ = median( θ , . . . , θ T ) := arg min θ ∈ Θ T (cid:88) j =1 (cid:107) θ − θ j (cid:107) . (2)An important property of the geometric median is that it indeed aggregates a collectionof independent estimates into a single estimate (cid:98) θ with strong concentration guarantees,even in presence of a constant fraction of outlying estimates in the collection. The followinglemma, straightforwardly derived from Lemma 2.1 in [17], characterizes such robustnessproperty of the geometric median. 3 emma 1. Let (cid:98) θ be the geometric median of the points θ , . . . , θ T ∈ Θ . Fix γ ∈ (cid:0) , (cid:1) and C γ = (1 − γ ) (cid:113) − γ . Suppose there exists a subset J ⊆ { , . . . , T } of cardinality | J | > (1 − γ ) T such that for all j ∈ J and any point θ (cid:63) ∈ Θ , (cid:107) θ j − θ (cid:63) (cid:107) ≤ r . Then we have (cid:107) (cid:98) θ − θ (cid:63) (cid:107) ≤ C γ r . In words, given a set of points, their geometric median will be close to the “true” θ (cid:63) aslong as at least half of them are close to θ (cid:63) . In particular, the geometric median will notbe skewed severely even if some of the points deviate significantly away from θ (cid:63) . In this section, we present how to scale up robust learning algorithms to process large-scaledata (containing outliers) through online learning without losing robustness. We term theproposed approach as online robust learning (ORL).The idea behind ORL is intuitive—instead of equally incorporating generated estimatesat each time step, ORL aggregates the sequentially generated estimates by mini-batch basedlearning methods via an online computation of the robust geometric median. Basically, ORLruns online learning at two levels: at the bottom level, ORL employs appropriate robustlearning procedures RL( · , ν ) with parameter ν ( e.g. , robust PCA algorithms on a mini-batchof samples) to obtain a sequence of estimates { θ , . . . , θ T } of θ (cid:63) based on the observationmini-batch X , . . . , X T ; at the top level, ORL updates the running estimate (cid:98) θ t (1 ≤ t ≤ T ) through a geometric median filtering algorithm (explained later) over (cid:98) θ , . . . , (cid:98) θ t − andoutputs a robust estimate after going through all the mini-batches. Intuitively, accordingto Lemma 1, as long as a majority of mini-batch estimates are not skewed by outliers, theproduced (cid:98) θ t would be robust and accurate. This new two-level robust learning gives ORLstronger robustness to outliers compared with ordinary online learning.To develop the top level geometric median filtering procedure, recall definition of thegeometric median in (2). A natural estimate of the geometric median (cid:98) θ is the minimum (cid:98) θ T of the following empirical loss function (cid:98) G T : (cid:98) θ T = arg min θ ∈ Θ (cid:40) (cid:98) G T (cid:44) T T (cid:88) i =1 (cid:107) θ i − θ (cid:107) (cid:41) . (3)The empirical function (cid:98) G T is differentiable everywhere except for the points θ i , and canbe optimized by applying stochastic gradient descent (SGD) [1]. More concretely, at thetime step t , given a new estimate θ t +1 (based on the ( t +1)-st mini-batch) and the currentestimate θ t , ORL computes the gradient at point θ of the empirical function (cid:98) G T in Eqn. (3)evaluated only at θ t +1 : (cid:98) g ( θ ; θ t +1 ) (cid:44) ∂ (cid:98) G T ( θ ; θ t +1 ) ∂θ = 2( θ − θ t +1 ) (cid:107) θ − θ t +1 (cid:107) . (4)4 lgorithm 1 The ORL Approach
Input : Mini-batch sequence X , . . . , X T , convexity parameter c a , robust learning proce-dure parameter ν . Initialization : (cid:98) θ = 0. for t = 1 , . . . , T do Call the robust learning procedure θ t = RL( X t , ν );Compute weight w t = 2 η t / (cid:107) (cid:98) θ t − − θ t (cid:107) with η t = 1 /c a t .Update the estimate: (cid:98) θ t = (1 − w t ) (cid:98) θ t − + w t θ t +1 . end forOutput : Final estimate (cid:98) θ T .Then ORL updates the estimate (cid:98) θ t by following filtering: (cid:98) θ t +1 ← (cid:98) θ t − η t (cid:98) g ( (cid:98) θ t ; θ t +1 ) = (1 − w t ) (cid:98) θ t + w t θ t +1 . (5)Here η t is a predefined step size parameter which usually takes the form of 1 /c a t with aconstant c a characterizing convexity of the empirical function to optimize. Besides, w t =2 η t / (cid:107) (cid:98) θ t − θ t +1 (cid:107) controls contribution of each new estimate θ t +1 conservatively in updating theglobal estimate (cid:98) θ . Details of ORL are provided in Algorithm 1. Another level of filteringis important. Certain mini-batches may contain overwhelming outliers. Therefore, eventhough a robust learning procedure is employed on each mini-batch, the resulted estimatecannot be guaranteed to be accurate. In fact, a mini-batch containing over 50% outlierswould corrupt any robust learning procedure—the resulted estimate can be arbitrarily badand breakdown the overall online learning. To address this critical issue, ORL performsanother level of online learning for updating the “global” estimate with adaptive weightsfor the new estimate and “filters out” possibly bad estimates. We provide the performance guarantees for ORL in this section. Throughout this section, weuse following asymptotic inequality notations: for positive numbers a and b , the asymptoticinequality a (cid:46) p,q b means that a ≤ C p,q b where C p,q is a constant only depending on p, q . Suppose N samples, a constant fraction of which are authentic ones and have sub-Gaussian distributions as specified in (1) for some L , are evenly divided to T mini-batchesand outlier fractions on the T mini-batches are λ , . . . , λ T respectively. Let θ , . . . , θ T bea collection of independent estimates of θ (cid:63) output by implementing the robust learningprocedure RL( · , ν ) on the T mini-batches independently. We assume an estimate or therobust learning procedure provides following composite deviation bound, P (cid:32) (cid:107) θ i − θ (cid:63) (cid:107) (cid:46) δ,L (cid:114) b + λ i − λ i √ p (cid:33) ≥ − δ, (6)5here b is the size of each mini-batch whose value can be tuned by the desired accuracy( e.g. , through data augmentation). We will specify value of the constant depending on δ and L explicitly in concrete applications. The above bound indicates the estimation errordepends on the standard statistically error and the outlier fraction. If λ i is overwhelminglylarge, the estimate will be arbitrarily bad.We now proceed to demonstrate that the ORL approach is robust to outliers—even ona constant fraction of mini-batches, the obtained estimates are not good, ORL can stillprovide reliable estimate with bounded error. Given a sequence of estimates θ , . . . , θ T produced internally in ORL, we analyze and provide guarantee on performance of the ORLthrough following two steps. We first demonstrate the geometric median function G T ( θ ) isin fact strongly convex and thus geometric median filtering provides a good estimate of the“true” geometric median of θ , . . . , θ T . Then we derive following performance guarantee forORL by invoking the robustness property of geometric median. Proposition 1.
Suppose in total N samples, a constant fraction of which have sub-Gaussiandistribution as in (1) , are divided into T sequential mini batches of size b with outlier fraction λ , . . . , λ T . Here T ≥ . We run a base robust learning algorithm having a deviation boundas in (6) on each mini batch. Denote the ground truth of the parameter to estimate as θ (cid:63) and the output of ORL (Alg. 1) as (cid:98) θ T . Then with probability at least − δ , (cid:98) θ T satisfies: (cid:107) (cid:98) θ T − θ (cid:63) (cid:107) (cid:46) δ,L,p,γ log(log( T )) + 1 T + (cid:114) b + λ ( γ ) √ p. Here λ ( γ ) = λ (1 − γ ) / (1 − λ (1 − γ ) ) and λ (1 − γ ) denotes the (cid:98) (1 − γ ) T (cid:99) smallest outlier fractionin { λ , . . . , λ T } with γ ∈ [0 , / . The above results demonstrate the estimation error of ORL consists of two compo-nents. The first term accounts for the deviation between the solution (cid:98) θ T and the “true”geometric median of the T sequential estimations. When T is sufficiently large, i.e. , afterORL seeing sufficiently many mini batches of observations, this error vanishes at a rate of O (log log( T ) /T ). The second term explains the deviation of geometric median of estimatesfrom the ground truth. The significant part of this result is that the error of ORL onlydepends on the (cid:98) (1 − γ ) T (cid:99) smallest outlier fraction among T mini-batches, no matter howseverely the other estimates are corrupted. This explains why ORL is robust to outliers inthe samples. In this section, we provide two concrete examples of the ORL approach, including oneunsupervised learning algorithm—principal component analysis (PCA) and one supervisedlearning one—linear regression (LR). Both algorithms are popular in practice but theironline learning versions with robustness guarantees are still absent. Finally, we also discussan extension of ORL for distributed robust learning.6 .1 Online Robust PCA
Classical PCA is known to be fragile to outliers and many robust PCA methods have beenproposed so far (see [21] and references therein). However, most of those methods require toload all the data into memory and have computational cost (super-)linear in the sample size,which prevents them from being applicable for big data. In this section, we first developa new robust PCA method which robustifies PCA via a robust sample covariance matrixestimation, and then demonstrate how to implement it with the ORL approach to enhancethe efficiency.Given a sample matrix X = [ x , x , . . . , x n ] ∈ R p × n , the standard covariance matrixis computed as C = XX (cid:62) , i.e. , C ij = (cid:104) X i , X j (cid:105) , ∀ i, j = 1 , . . . , p . Here X i denotes the i th row vector of matrix X . To obtain a robust estimate of the covariance matrix, wereplace the vector inner product by a trimmed inner product, (cid:98) C ij = (cid:104) X i , X j (cid:105) n , as proposedin [4] for linear regressor estimation. Intuitively, the trimmed inner product removes theoutliers having large magnitude and the remaining outliers are bounded by inliers. Thus, theobtained covariance matrix, after proper symmetrization, is close to the authentic samplecovariance. How to calculate the trimmed inner product for a robust estimation of samplecovariance matrix is given in Algorithm 2. Algorithm 2
Trimmed inner product (cid:104) x , x (cid:48) (cid:105) n Input:
Two vectors x ∈ R N and x (cid:48) ∈ R N , trimmed parameter n .Compute q i = x i x (cid:48) i , i = 1 , . . . , N .Sort {| q i |} in ascending order and select the smallest ( N − n ) ones.Let Ω be the set of selected indices. Output: h = (cid:80) i ∈ Ω q i .Then we perform a standard eigenvector decomposition on the covariance matrix toproduce the principal component estimations. The details of the new Robust CovariancePCA (RC-PCA) algorithm are provided in Algorithm 3. Algorithm 3
Robust Covariance PCA (RC-PCA)
Input:
Sample matrix X = [ x , . . . , x N ] ∈ R p × N , subspace dimension d , outlier fraction λ .Compute the trimmed covariance matrix (cid:98) Σ: (cid:98) Σ ij = (cid:104) X i , X j (cid:105) λN , ∀ i, j = 1 , . . . , p .Perform eigen decomposition on (cid:98) Σ (cid:48) = ( (cid:98) Σ+ (cid:98) Σ (cid:62) ) / d largest eigenvalues (cid:98) P U = [ (cid:98) w , . . . , (cid:98) w d ]. Output: column subspace projector (cid:98) P U .Applying the proposed ORL approach onto the above RC-PCA develops a new onlinerobust PCA algorithm, called ORL-PCA, as explained in Algorithm 4.Based on the above result, along with Proposition 1, we provide the following perfor-mance guarantee for ORL-PCA. 7 lgorithm 4 ORL-PCA
Input : Sequential mini-batches X , . . . , X T with size b , subspace dimension d , RC-PCA parameter q = 0 . b . Initialization : (cid:98) Σ (0) = 0 ∈ R p × p . for t = 1 , . . . , T do Perform RC-PCA on X t : P ( t ) U = RC-PCA( X t ; d, q );Compute covariance matrix: Σ ( t ) = P ( t ) U P ( t ) U (cid:62) ;Compute w t = 2 η t / (cid:107) (cid:98) Σ ( t − − Σ ( t ) (cid:107) with η t = 1 /c a t ;Update the estimate (cid:98) Σ ( t ) = (1 − w t ) (cid:98) Σ ( t − + w t Σ ( t ) . end forOutput : (cid:98) P ( T ) U = svd (cid:16)(cid:98) Σ ( T ) , d (cid:17) . Theorem 1.
Suppose samples are divided into T mini-batches of size b . Authentic samplessatisfy the sub-Gaussian distribution with parameter L . Let λ ( γ ) = λ (1 − γ ) / (1 − λ (1 − γ ) ) where λ (1 − γ ) is the (1 − γ ) smallest outlier fraction out of the T mini-batches. Let (cid:98) P ( T ) U denote the projection operator given by ORL-PCA, and P (cid:63) U denotes the projection operatorto the ground truth d dimensional subspace. Then, with a probability at least − δ , we have, (cid:107) (cid:98) P ( T ) U − P (cid:63) U (cid:107) F ≤ C a log(log( T ) /δ ) + 1 T + c p (cid:114) d log(1 /δ ) b + c λ ( γ ) (cid:112) dp. Here C a , c , c are positive constants. We then showcase another example of the application of ORL—online robust regression.As aforementioned, the target of linear regression is to estimate the parameter θ (cid:63) of linearregression model y i = (cid:104) θ (cid:63) , x i (cid:105) + ε given the observation pairs { x i , y i } n + n i =1 where n samplesare corrupted. Here ε ∈ N (0 , σ e ) is additive noise. Similar to ORL-PCA, we use therobustified thresholding (RoTR) regression proposed in Algorithm 5 (ref. [4]) as the robustlearning procedure for parameter estimation within each mini-batch. Algorithm 5
Base Robust Regression - RoTR
Input:
Covariate matrix X = [ x , . . . , x n + n ] ∈ R p × ( n + n ) and response y ∈ R n + n ,outlier fraction λ (set as 0 . j = 1 , . . . , p , compute (cid:98) θ ( j ) = (cid:104) y , X j (cid:105) λ ( n + n ) . Output: (cid:98) θ .Due to the blessing of online robust learning framework, ORL-LR has the followingperformance guarantee. 8 heorem 2. Adopt the notations in Theorem 1. Suppose the authentic samples have thesub-Gaussian distribution as in (1) with noise level σ e , are divided into T sequential mini-batches. Let (cid:98) θ T be the output of ORL-LR and θ (cid:63) be the ground truth. Then, with probabilityat least − δ , the following holds: (cid:107) (cid:98) θ T − θ (cid:63) (cid:107) ≤ C a log(log( T ) /δ ) + 1 T + C γ (cid:107) θ (cid:63) (cid:107) (cid:115) σ e (cid:107) θ (cid:63) (cid:107) (cid:32)(cid:114) p log(1 /δ ) b + λ ( γ ) √ p log( 1 δ ) (cid:33) . Following the spirit of ORL, we can also develop a distributed robust learning (DRL)approach. Suppose in a distributed computing platform, k machines are usable for parallelcomputation. Then for processing a large scale dataset, one can evenly distribute themonto the k machines and run robust learning procedure RL( · , ν ) in parallel. Each machineprovides an independent estimate θ i for the parameter of interest θ (cid:63) . Aggregating theseestimates via geometric median (ref. Eqn. (2)) can provide additional robustness to theinaccuracy, breakdown and communication error for a fraction of machines in the computingcluster, as stated in Lemma 1. Of particular interest, DRL can provide much strongerrobustness than the commonly used averaging over the k estimates, as average or mean isnotoriously fragile to corruption. Even a single corrupted estimate out of the k estimatescan make the final estimate arbitrarily bad. In this section, we investigate robustness of the ORL approach by evaluating the ORL-PCA and ORL-LR algorithms and comparing them with their centralized and non-robustcounterparts. We also perform similar investigation on DRL (ref. Section 5.3) consideringrobustness is also critical for distributed learning in practice. In the simulations, we reportthe results with the outlier fraction which is computed as λ = n / ( n + n ). Data generation
In simulations of the PCA problems, samples are generated accordingto x i = θ (cid:63) z i + ε . Here the signal z i ∈ R d is sampled from the normal distribution: z i ∼N (0 , I d ). The noise ε ∈ R p is sampled as: ε ∼ N (0 , σ e I p ). The underlying matrix θ (cid:63) ∈ R p × d is randomly generated whose columns are then orthogonalized. The entries of outliers x o ∈ R p are i.i.d. random variables from uniform distribution [ − σ o , σ o ]. We use the distancebetween two projection matrices to measure the subspace estimation error for PCA: (cid:107) (cid:98) P U − P (cid:63) U (cid:107) F / (cid:107) P (cid:63) U (cid:107) F . Here (cid:98) P U is the output estimates and P (cid:63) U = θ (cid:63) θ (cid:63) (cid:62) is the ground truth.In simulations of the LR problems, samples ( x i , y i ) are generated according to y i = θ (cid:63) (cid:62) x i + ε . Here the model parameter θ (cid:63) is randomly sampled from N (0 , I p ) , and x i ∈ R p is also sampled from normal distribution: x i ∈ N (0 , I p ). The noise ε ∈ R is sampledas: ε ∼ N (0 , σ e ). The entries of outlier x o are also i.i.d. randomly sampled from uniformdistribution [ − σ o , σ o ]. The response of outlier is generated by y o = − θ (cid:63) (cid:62) x o + v . We use (cid:107) θ (cid:63) − (cid:98) θ (cid:107) / (cid:107) θ (cid:63) (cid:107) to measure the error. Here (cid:98) θ is the output estimate.9 E s t i m a t i on E rr o r Number of Mini−batches ( × ) ORL−PCAOnline−Avg. RPCABatch RPCA (a) Online PCA E s t i m a t i on E rr o r Number of Mini−batches ( × ) ORL−LROnline Avg. LRBatch LR (b) Online LR E s t i m a t i on E rr o r PCADiv.−Avg. RPCACentralized RPCADRL−RPCA (c) Distributed PCA E s t i m a t i on E rr o r LRDiv.−Avg. RLRCentralized RLRDRL−RLR (d) Distributed LR
Figure 1: Simulation comparison between online (in (a), (b)) and distributed as well ascentralized (in (c), (d)) algorithms, along with standard non-robust ones, for PCA and LRproblems. Both problems have the following setting: noise level σ e = 1, outlier magnitude σ o = 10, sample dimension p = 100, sample size N = 1 × , k = 100 (for distributed algorithms), and T = 100 (for online algorithms) .For PCA, intrinsic dimension d = 5. (Best viewed in color.) Online Setting
Results shown in Figure 1(a) give following observations. First, ORL-PCA converges to comparable performance with batch RC-PCA with accesses to the entiredata set. This demonstrates the rapid convergence of ORL-PCA. It is worth noting thatORL-PCA saves considerable memory cost than batch RC-PCA (8 Mb vs. × Mb) andcomputation time (212 seconds vs. ∼
27 Hours) since ORL-PCA performs SVD on muchsmaller data. Secondly, ORL-PCA offers much stronger robustness than naively averagedaggregation when outlier order is adversarial to corrupt a fraction of mini-batches. As shownin Figure 1(a), when some batches have overwhelming outliers (outlier fraction λ i ≥ . Distributed setting
All the simulations are implemented on a PC with 2 . × samples with dimensionality of 100. In contrast, distributed RPCA only costs 0 . k = 100 parallel procedures. The communication cost here is negligible since onlyeigenvector matrices of small sizes are communicated. For RLR simulations, we also observeabout efficiency enhancement.As for the performance, from Fig. 1(c), we observe that when λ ≤ .
5, DRL-RPCA,RPCA with division-averaging (Div.-Avg. RPCA) and centralized RPCA ( i.e. , the RC-PCA) achieve similar performances, which are much better than non-robust standard PCA.When λ = 0, i.e. , when there are no outliers, the performances of DRL-RPCA and Div.-Avg.RPCA are slightly worse than standard PCA as the quality of each mini-batch estimatedeteriorates due to the smaller sample size. However, distributed algorithms of course10able 1: Comparisons of the estimation error for PCA between Division-Averaging (Div.-Avg.) and DRL, with machine latency and communication errors. Under the same param-eter setting as Figure 1(c). Outlier fraction λ = 0 .
4. The average and std of the error from10 repetitions are reported.Unreliability Type DRL Div.-Avg.Latency 0 . ± .
01 0 . ± . . ± .
03 0 . ± . λ = 0 .
1. Theseresults demonstrate that DRL preserves the robustness of centralized algorithms well.When outlier fraction λ increases to 0 .
6, centralized (blue lines) and division-averagingalgorithms (green lines) break down sharply, as the outliers outnumber their maximal break-down point of 0 .
5. In contrast, DRL-RPCA and DRL-RLR still present strong robustnessand perform much better, which demonstrate that the DRL framework is indeed robustto computing nodes breaking down, and even enhances the robustness of the base robustlearning methods under favorable outlier distributions across the machines.
Comparison with Averaging
Taking the average instead of the geometric median isa natural alternative to DRL. Here we provide more simulations for the RPCA problem tocompare these two different aggregation strategies in the presence of different errors on thecomputing nodes.In distributed computation of learning problems, besides outliers, significant deteriora-tion of the performance may result from unreliabilities, such as latency of some machinesor communication errors. For instance, it is not uncommon that machines solve their ownsub-problem at different speed, and sometimes users may require to stop the learning beforeall the machines output the final results. In this case, results from the slow machines arepossibly not accurate enough and may hurt the quality of the aggregated solution. Similarly,communication errors may also damage the overall performance. We simulate the machinelatency by stopping the algorithms once over half of the machines finish their computation.To simulate communication error, we randomly sample k/
10 estimations and flip the signof 30% of the elements in these estimations. The estimation errors of the solution aggre-gated by averaging and DRL are given in Table 1. Clearly, DRL offers stronger resilienceto unreliability of the computing nodes.
Real large-scale data
We further apply the ORL-LR for an image tag predictionproblem on a large-scale image set, i.e. , the Flickr-10M image set , which contains 1 × images with noisy users contributed tags. We employ robust linear regression to predict 200semantic tags for each image, which is described by a 4 , . × images http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67 . ± . . ± .
01 0 . ± . , ±
10 3 , ± , ± . × images.We perform experiments with the online learning setting, and compare the performanceof the proposed ORL-LR with the Online Averaging LR. We also implement a non-robustbaseline – stochastic gradient descent to solve the LR problem. The size of min-batch isfixed as 5 × images. From the results in Table 2, one can observe that ORL-LR achievessignificantly higher accuracy than non-robust baseline algorithms, with a margin of morethan 9%. Lemma 2 (Hoeffding’s Inequality) . Let X , . . . , X n be independent random variables takingvalues in [0 , . Let ¯ X n = n (cid:80) ni =1 X i and µ = E ¯ X n . Then for < t < − µ , P (cid:0) ¯ X n − µ ≥ t (cid:1) ≤ exp (cid:8) − nt (cid:9) . Lemma 3 (A coupling result [14]) . Let Y N be independent random variables, let x bea real number and let A = Card { i = 1 , . . . , N s.t. Y i > x } . Let p ∈ (0 , such that, forall i = 1 , . . . , N , p ≥ P { Y i > x } and let B be a random variable with a binomial law Bin(
N, p ) . There exists a coupling ˜ C = ( ˜ A, ˜ B ) such that ˜ A has the same distribution as A , ˜ B has the same distribution as B and such that ˜ A ≤ ˜ B . In particular, for all y > , P { A > y } ≤ P { B > y } . The following lemma demonstrates that aggregating estimates via their geometric me-dian can enhance the confidence significantly.
Lemma 4.
Given k independent estimates of θ (cid:63) satisfying P ( (cid:107) θ i − θ (cid:63) (cid:107) ≤ R ) ≥ p ∗ > / ,for all i = 1 , . . . , k . Let (cid:98) θ = median( θ , . . . , θ k ) , then we have, P (cid:16) (cid:107) (cid:98) θ − θ (cid:63) (cid:107) < C γ R (cid:17) ≥ − exp (cid:8) − k ( γ − p ∗ ) (cid:9) . Here γ and C γ are defined in Lemma 1.Proof. According to Lemma 1, we have P (cid:16) (cid:107) (cid:98) θ − θ (cid:63) (cid:107) ≥ C γ R (cid:17) ≤ P k (cid:88) j =1 ( (cid:107) θ j − θ (cid:63) (cid:107) ≥ R ) ≥ γk . Z j = {(cid:107) θ j − θ (cid:63) (cid:107) > R } ∼ Ber(1 − p ∗ ), and let W = (cid:80) kj =1 Z j have Binomial distribution W ∼ Bin( k, − p ∗ ), then P k (cid:88) j =1 {(cid:107) θ j − θ (cid:63) (cid:107) > R } > γk ≤ P ( W > γk ) , according to Lemma 3. Applying the Hoeffding’s inequality in Lemma 2 (with µ = 1 − p ∗ , t = γ − p ∗ , γ < / p ∗ > /
2) gives P (cid:16) (cid:107) (cid:98) θ − θ (cid:63) (cid:107) > C γ R (cid:17) ≤ P k (cid:88) j =1 {(cid:107) θ j − θ (cid:63) (cid:107) > R } > γk ≤ P ( W > γk ) ≤ exp (cid:8) − k ( γ − p ∗ ) (cid:9) We suppose from now on following conditions hold.
Condition 1.
Assume that θ (cid:63) ∈ Θ is the parameter of interest. Let θ , . . . , θ T ∈ Θ bea collection ofindependent estimates of θ (cid:63) , which are not concentrated on a straight line:for all v ∈ Θ , there is w ∈ Θ such that (cid:104) v, w (cid:105) = 0 and T (cid:80) Ti =1 (cid:107)(cid:104) w, θ i − ¯ θ (cid:105)(cid:107) > with ¯ θ = T (cid:80) Ti =1 θ i . As noted in [2], Condition 1 ensures that geometric median (cid:98) θ of the T estimates isuniquely defined. Condition 2.
The distribution of the independent estimates of θ (cid:63) is a mixing of two “nice”distributions: µ θ (cid:63) = µ c + µ d . Here µ c is not strongly concentrated around single points: if B (0 , a ) is the ball { u ∈ Θ , (cid:107) u (cid:107) ≤ a } , and Y is a random variable with distribution µ c , thenfor any constant a > , ∃ C a ∈ [0 , ∞ ) , ∀ u ∈ B (0 , a ) , E Y (cid:2) (cid:107) Y − u (cid:107) − (cid:3) ≤ C a . In addition, µ d is a discrete measure, µ d = (cid:80) i δ u i . Here δ u i is a Dirac measure at point u i .We denote by D the support of µ d and assume that the median (cid:98) θ / ∈ D . Conditions 1 and 2 are only technical conditions to avoid pathologies in the convergenceanalysis for Algorithm 1. In practical implementations, we can simply set the sub-gradientof G ( u ) at u (cid:48) as zero (a valid sub-gradient as proved in [2]) when u (cid:48) ∈ D .13 .2.2 Convergence Rate of Geometric Median Filtering Given the definition of geometric median in (2), we can define following population geometricmedian loss function, G : Θ → R , that we want to minimize to compute the geometricmedian: G ( u ) (cid:44) E [ (cid:107) Θ − u (cid:107) − (cid:107) Θ (cid:107) ] . (7)In this subsection, we first show that the geometric median function in (7) is indeedstrongly convex under Conditions 1 and 2. Thus the SGD optimization is able to providesolutions with a convergence rate of O (log log( T ) /T ) to the true geometric median (cid:98) θ , given T independent estimates. Definition 2 ( β -strongly convex function [18]) . A function G is β -strongly convex, if forall u , u ∈ Θ and any sub-gradient g ( u ) of G at u , we have (cid:104) g ( u ) − g ( u ) , u − u (cid:105) ≥ β (cid:107) u − u (cid:107) . The following theorem establishes the strong convexity of the geometric median functionin (7).
Theorem 3.
Let g ( u ) be the sub-gradient of G ( u ) at u . Under Conditions 1 and 2, thereis a strictly positive constant c a > , such that: ∀ u , u ∈ B (0 , a ) , (cid:104) g ( u ) − g ( u ) , u − u (cid:105) ≥ c a (cid:107) u − u (cid:107) , and thus G ( u ) is c a -strongly convex. The proof can be derived from the proof for the Proposition 2.1 in [2] straightforwardlyand we omit details here.Given the strong convexity property of geometric median function G ( u ), we can applythe convergence argument of SGD for strongly convex functions ( e.g. , Proposition 1 in [19]),and obtain the following convergence rate for online geometric median filtering. Theorem 4.
Assume Conditions 1 and 2 hold, and (cid:107) θ i (cid:107) ≤ K, ∀ i = 1 , . . . , T . Then (cid:107) (cid:98) g t (cid:107) ≤ KC a with probability . Assume T ≥ . Let δ ∈ (0 , /e ) . Pick η t = 1 /tc a in Algorithm 1and let (cid:98) θ t denote the output at time step t . Furthermore, let (cid:98) θ be the geometric median of { θ i } Ti =1 . Then for any t ≤ T , (cid:107) (cid:98) θ t − (cid:98) θ (cid:107) ≤ C (cid:48) a (log(log( t ) /δ ) + 1) t , with probability at least − δ . Here C (cid:48) a = KC a /c a . The bound on the gradient (cid:107) (cid:98) g t (cid:107) ≤ KC a is from the definition of the gradient in (4),Condition 2 and the assumption that all the estimates are bounded.14 .2.3 Proofs of Proposition 1 From now on, we slightly abuse the notation and use (cid:101) θ to denote the geometric median ofa collection of estimates. Proof.
Proposition 1 can be derived by following triangle inequality: (cid:107) (cid:98) θ t − θ (cid:63) (cid:107) ≤ (cid:107) (cid:98) θ t − (cid:101) θ (cid:107) + (cid:107) (cid:101) θ − θ (cid:63) (cid:107) , where (cid:101) θ denotes the “true” geometric median of estimates { θ i } ti =1 . We now proceed tobound the above two terms separately. Based on Theorem 4, we have (cid:107) (cid:98) θ t − (cid:101) θ (cid:107) ≤ C (cid:48) a (log(log( t ) /δ ) + 1) t , with a probability at least 1 − δ . The second term can be bounded as follows by applyingLemma 4: P (cid:32) (cid:107) (cid:101) θ − θ (cid:63) (cid:107) (cid:46) δ,L (cid:114) b + λ ( γ ) √ p (cid:33) ≥ − δ, where λ ( γ ) = λ (1 − γ ) − λ (1 − γ ) and λ (1 − γ ) denotes the (cid:98) (1 − γ ) k (cid:99) smallest outlier fraction in { λ , . . . , λ k } with γ ∈ [0 , / (cid:107) (cid:98) θ t − θ (cid:63) (cid:107) (cid:46) δ,L C (cid:48) a log(log( t ) /δ ) + 1 t + C γ (cid:114) b + C (cid:48) λ ( γ ) √ p. Before proving the performance guarantee for ORL-PCA and ORL-LR, we provide robust-ness analysis for the base robust learning procedure—the RC-PCA and RoTR.
Suppose in total N samples are provided with n authentic samples and n outliers. Let λ = n /N . Assume the authentic samples follow sub-Guassian design withparameter L . Let ∆ d = σ d − σ d +1 , where σ d denotes the d th largest eigenvalue of ground-truth sample covariance matrix C (cid:63) . Let P U be the output d -dimensional subspace projectorfrom RC-PCA. Then for a constant c , we have with probability − δ , (cid:107) P U − P (cid:63) U (cid:107) ∞ ≤ L ∆ d (cid:40)(cid:114) c log( 4 δ ) (cid:114) pn + λ − λ log( 2 δ ) (cid:41) . roof. According to the proof of Theorem 4 in [4] and deviation bound on the empiricalcovariance matrix estimation in [20], when the authentic samples are from sub-Gaussian dis-tribution with parameter L , we have, for the covariance matrix constructed in Algorithm 3, (cid:107) (cid:98) C − C (cid:63) (cid:107) ∞ ≤ L (cid:114) c log( 4 δ ) (cid:114) pn + n n L log( 2 δ )with a probability at least 1 − δ . Here c is a constant, n is the number of authentic samplesand n is the number of outliers.Let ∆ d = σ d − σ d +1 be the eigenvalue gap, where σ d denotes the d -th largest eigenvalueof C (cid:63) . Then, applying the Davis-Kahan perturbation theorem [5], we have, whenever (cid:107) (cid:98) C − C (cid:63) (cid:107) ∞ ≤ ∆ d , (cid:107) P U − P (cid:63) U (cid:107) ∞ ≤ (cid:107) (cid:98) C − C (cid:63) (cid:107) ∞ / ∆ d . Thus, (cid:107) P U − P (cid:63) U (cid:107) ∞ ≤ L ∆ d (cid:40)(cid:114) c log( 4 δ ) (cid:114) pn + n n log( 2 δ ) (cid:41) , with a probability at least 1 − δ . Proof.
Theorem 1 can be derived directly from following triangle inequality: (cid:107) (cid:98) P ( T ) U − P (cid:63) U (cid:107) F ≤ (cid:107) (cid:98) P ( T ) U − (cid:101) P U (cid:107) F + (cid:107) (cid:101) P U − P (cid:63) U (cid:107) F , and we bound the above two terms separately. The first term can be bounded by Theorem4 as, (cid:107) (cid:98) P ( T ) U − (cid:101) P U (cid:107) F ≤ C (cid:48) a (log(log( T ) /δ ) + 1) T , with a probability 1 − δ . The second term can be bounded as in Theorem 5 that with aprobability 1 − δ , (cid:107) (cid:101) P U − P (cid:63) U (cid:107) F ≤ c p (cid:114) d log(1 /δ ) N + c λ ( γ ) (cid:112) dp. Combining the above two bounds (with union bound) proves the theorem.
Before proving Theorem 2, we first show the following performance guarantee for RoTRalgorithm from [4]. The estimation error of the RoTR is bounded as in Lemma 5.
Lemma 5 (Performance of RoTR [4]) . Suppose the samples x are from sub-Gaussian designwith Σ x = I p , with dimension p and noise level σ e , then with probability at least − δ , the utput of RoTR satisfies the (cid:96) bound: (cid:13)(cid:13)(cid:13)(cid:98) θ − θ (cid:63) (cid:13)(cid:13)(cid:13) ≤ c (cid:107) θ (cid:63) (cid:107) (cid:115) σ e (cid:107) θ (cid:63) (cid:107) (cid:32)(cid:114) p log(1 /δ ) n + λ − λ √ p log(1 /δ ) (cid:19) . Here c is a constant independent of p, n, λ .Proof. Based on the results in the Lemma 5 and Lemma 4, it is straightforward to get: (cid:107) (cid:101) θ − θ (cid:63) (cid:107) ≤ C (cid:48) γ (cid:107) θ (cid:63) (cid:107) (cid:115) σ e (cid:107) θ (cid:63) (cid:107) (cid:32)(cid:114) p log(1 /δ ) N + λ ( γ ) √ p log(1 /δ ) (cid:33) where C (cid:48) γ = C γ c with c being the constant in Lemma 5, C γ = (1 − γ ) (cid:113) − γ , and λ ( γ ) = λ (1 − γ ) / ( λ (1 − γ ) with λ (1 − γ ) being the (cid:98) k (1 − γ ) (cid:99) smallest outlier fraction in { λ , . . . , λ k } .As proving Theorem 1, Theorem 2 can be derived based on the results in Theorem 4.For simplicity, we omit the details here. We developed a generic Online Robust Learning (ORL) approach with provable robustnessguarantee and we also demonstrate its application for Distributed Robust Learning (DRL).The proposed approaches not only significantly enhance the time and memory efficiencyof robust learning but also preserve the robustness of the centralized learning procedures.Moreover, when the outliers are not uniformly distributed, the proposed approaches arestill robust to adversarial outliers distributions. We provided two concrete examples, onlineand distributed robust principal component analysis and linear regression.
References [1] L´eon Bottou. Online learning and stochastic approximations.
On-line learning inneural networks , 17(9):142, 1998.[2] Herv´e Cardot, Peggy C´enac, Pierre-Andr´e Zitt, et al. Efficient and fast estimation ofthe geometric median in hilbert spaces with an averaged stochastic gradient algorithm.
Bernoulli , 19(1):18–43, 2013.[3] Yudong Chen and Constantine Caramanis. Noisy and missing data regression:Distribution-oblivious support recovery. In
ICML , 2013.[4] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regressionunder adversarial corruption. In
ICML , 2013.175] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a pertur-bation. iii.
SIAM Journal on Numerical Analysis , 7(1):1–46, 1970.[6] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on largeclusters.
Communications of the ACM , 51(1):107–113, 2008.[7] David L Donoho and Miriam Gasko. Breakdown properties of location estimates basedon halfspace depth and projected outlyingness.
The Annals of Statistics , pages 1803–1827, 1992.[8] Jiashi Feng, Huan Xu, Shie Mannor, and Shuicheng Yan. Online pca for contaminateddata. In
NIPS , pages 764–772, 2013.[9] Jiashi Feng, Huan Xu, and Shuicheng Yan. Online robust pca via stochastic optimiza-tion. In
NIPS , 2013.[10] N. Guan, D. Tao, Z. Luo, and B. Yuan. Online nonnegative matrix factorizationwith robust stochastic approximation.
Neural Networks and Learning Systems, IEEETransactions on , 23(7):1087–1099, 2012.[11] Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavytails. arXiv preprint arXiv:1307.1827 , 2013.[12] Peter J Huber.
Robust statistics . Springer, 2011.[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, RossGirshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecturefor fast feature embedding. arXiv preprint arXiv:1408.5093 , 2014.[14] Matthieu Lerasle and Roberto I Oliveira. Robust empirical mean estimators. arXivpreprint arXiv:1112.3914 , 2011.[15] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionarylearning for sparse coding. In
ICML , pages 689–696, 2009.[16] Ricardo A Maronna and V´ıctor J Yohai. Robust estimation of multivariate locationand scatter.
Encyclopedia of Statistical Sciences , 1998.[17] Stanislav Minsker. Geometric median and robust estimation in banach spaces. arXivpreprint arXiv:1308.1334 , 2013.[18] Angelia Nedic, DP Bertsekas, and AE Ozdaglar. Convex analysis and optimization.
Athena Scientific , 2003.[19] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descentoptimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647 ,2011. 1820] Roman Vershynin. How close is the sample covariance matrix to the actual covariancematrix?
Journal of Theoretical Probability , 25(3):655–686, 2012.[21] Huan Xu, C. Caramanis, and S. Mannor. Outlier-robust pca: The high-dimensionalcase.