Estimating Traffic and Anomaly Maps via Network Tomography
aa r X i v : . [ c s . N I] J u l Estimating Traffic and Anomaly Maps via NetworkTomography † Morteza Mardani and Georgios B. Giannakis (contact author) ∗ Abstract —Mapping origin-destination (OD) network traffic ispivotal for network management and proactive security tasks.However, lack of sufficient flow-level measurements as well aspotential anomalies pose major challenges towards this goal.Leveraging the spatiotemporal correlation of nominal traffic, andthe sparse nature of anomalies, this paper brings forth a novelframework to map out nominal and anomalous traffic, whichtreats jointly important network monitoring tasks includingtraffic estimation, anomaly detection, and traffic interpolation.To this end, a convex program is first formulated with nuclearand ℓ -norm regularization to effect sparsity and low rank for thenominal and anomalous traffic with only the link counts and a small subset of OD-flow counts. Analysis and simulations confirmthat the proposed estimator can exactly recover sufficiently low-dimensional nominal traffic and sporadic anomalies so long as therouting paths are sufficiently “spread-out” across the network,and an adequate amount of flow counts are randomly sam-pled. The results offer valuable insights about data acquisitionstrategies and network scenaria giving rise to accurate trafficestimation. For practical networks where the aforementioned con-ditions are possibly violated, the inherent spatiotemporal trafficpatterns are taken into account by adopting a Bayesian approachalong with a bilinear characterization of the nuclear and ℓ norms. The resultant nonconvex program involves quadraticregularizers with correlation matrices, learned systematicallyfrom (cyclo)stationary historical data. Alternating-minimizationbased algorithms with provable convergence are also developedto procure the estimates. Insightful tests with synthetic and realInternet data corroborate the effectiveness of the novel schemes. Index Terms —Sparsity, low rank, convex optimization, nominaland anomalous traffic, spatiotemporal correlation.
I. I
NTRODUCTION
Emergence of multimedia services and Internet-friendlyportable devices is multiplying network traffic volume dayby day [1]. Moreover, the advent of diverse networks ofintelligent devices including those deployed to monitor thesmart power grid, transportation networks, medical informa-tion networks, and cognitive radio networks, will transformthe communication infrastructure to an even more complexand heterogeneous one. Thus, ensuring compliance to service-level agreements necessitates ground-breaking managementand monitoring tools providing operators with informativedepictions of the network state. One such atlas (set ofmaps) can offer a flow-time depiction of the network origin-destination (OD) flow traffic. Situational awareness provided † Work in this paper was supported by the MURI Grant No. AFOSRFA9550-10-1-0567. Parts of the paper were presented in the
Proc. of theIEEE International Conference on Acoustics, Speech, and Signal Processing ,Vancouver, Canada, May 26-31, 2013, and in
IEEE Global Signal andInformation Processing Workshop , Austin, Texas, December 3-5, 2013. ∗ The authors are with the Dept. of ECE and the DigitalTechnology Center, University of Minnesota, 200 Union Street SE,Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-4583; Emails: {morteza,georgios}@umn.edu by such maps will be the key enabler for effective routing andcongestion control, network health management, risk analysis,security assurance, and proactive network failure prevention.Acquiring such diagnosis/prognosis maps for large networkshowever is an arduous task. This is mainly because the numberof OD pairs grows promptly as the network size grows,while probing exhaustively all OD pairs becomes impracticaleven for moderate-size networks [2]. In addition, OD flowspotentially undergo anomalies arising due to e.g., cyberattacksand network failures [3], and the acquired measurementstypically encounter misses, outliers, and errors.Towards creating traffic maps, one typically has access to:(D1) link counts comprising the superposition of OD flowsper link; these counts can be readily obtained using the singlenetwork management protocol (SNMP) [3]; and (D2) partial
OD-flow counts recorded using e.g., the NetFlow protocol [3].Extensive studies of backbone Internet Protocol (IP) networksreveals that the nominal OD-flow traffic is spatiotemporallycorrelated mainly due to common temporal patterns acrossOD flows, and exhibits periodic trends (e.g., daily or weekly)across time [3]. This renders the nominal traffic having a smallintrinsic dimensionality. Moreover, traffic volume anomaliesrarely occur across flows and time [3]–[5]. Given the obser-vations (D1) and/or (D2), ample research has been carriedout over the years to tackle the ill-posed traffic inference taskrelying on various techniques that leverage the traffic featuresas prior knowledge; see e.g., [6]–[13] and references therein.To date, the main body of work on traffic inference relieson least-squares (LS) and Gaussian [7], [8] or Poisson mod-els [9], and entropy regularization [10]. None of these methodshowever takes spatiotemporal dependencies of the traffic intoaccount. To enhance estimation accuracy by exploiting thespatiotemporal dependencies of traffic, attempts have beenmade in [11] and [12]. Using the prior spatial and temporalstructures of traffic, [11] applies rank regularization along withmatrix factorization to discover the global low-rank trafficmatrix from the link and/or flow counts. The model in [11] ishowever devoid of anomalies, which can severely deterioratetraffic estimation quality. In the context of anomaly detection,our companion work [12] capitalizes on the low-rank oftraffic and sparsity of anomalies to unveil the traffic volumeanomalies from the link loads (D1). Without OD-flow countshowever, the nominal flow-level traffic cannot be identifiedusing the approach of [12].The present work addresses these limitations by introducinga novel framework that efficiently and scalably constructsnetwork traffic maps. Leveraging recent advances in compres-sive sensing and rank minimization, first, a novel estimatoris put forth, to effect sparsity and low rank attributes for
EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 1 the anomalous and nominal traffic components through ℓ -and nuclear-norm, respectively. The recovery performanceof the sought estimator is then analyzed in the noise-freesetting following a deterministic approach along the linesof [14]. Sufficient incoherence conditions are derived basedon the angle between certain subspaces to ensure the retrievedtraffic and anomaly matrices coincide with the true ones.The recovery conditions yield valuable insights about thenetwork structures and data acquisition strategies giving rise toaccurate traffic estimation. Intuitively, one can expect accuratetraffic estimation if: (a) NetFlow measures sufficiently manyrandomly selected OD flows; (b) the OD paths are sufficiently“spread-out” so as the routes form a column-incoherent routingmatrix; (c) the nominal traffic is sufficiently low dimensional;and, (d) anomalies are sporadic enough.Albeit insightful, the accurate-recovery conditions in prac-tical networks may not hold. For instance, it may happen thata specific flow undergoes a bursty anomaly lasting for a longtime [4], or certain OD flows may be inaccessible for theentire time horizon of interest with no NetFlow samples athand. With the network practical challenges however comeopportunities to exploit certain structures, and thus cope withthe aforementioned challenges. This work bridges this “theory-practice” gap by incorporating the spatiotemporal patterns ofthe nominal and anomalous traffic, both of which can belearned from historical data. Adopting a Bayesian approach, anovel estimator is introduced for the traffic following a bilinearcharacterization of the nuclear- and ℓ -norms. The resultantnonconvex problem entails quadratic regularizers loaded withinverse correlation matrices to effect structured sparsity andlow rank for anomalous and nominal traffic matrices, respec-tively. A systematic approach for learning traffic correlationsfrom historical data is also devised taking advantage of the(cyclo)stationary nature of traffic. Alternating majorization-minimization algorithms are also developed to obtain iterativeestimates, which are provably convergent.Simulated tests with synthetic network and real Internet-datacorroborate the effectiveness of the novel schemes, especiallyin reducing the number of acquired NetFlow samples neededto attain a prescribed estimation accuracy. In addition, theproposed optimization-based approach opens the door forefficient in-network and online processing along the lines ofour companion works in [15] and [16]. The novel ideas canalso be applicable to various other inference tasks dealing withrecovery of structured low-rank and sparse matrices.The rest of this paper starts with preliminaries and problemstatement in Section II. The novel estimator to map out thenominal and anomalous traffic is discussed in Section III, andpertinent reconstruction claims are established in Section IV.Sections V and VI deal with incorporating the spatiotempo-ral patterns of traffic to improve estimation quality. Certainpractical issues are addressed in Section VIII. Simulated testsare reported in Section IX, and finally Section X draws theconclusions. Notation:
Bold uppercase (lowercase) letters will denote matri-ces (column vectors), and calligraphic letters will be used forsets. Operators ( · ) ′ , tr( · ) , σ max ( · ) , [ · ] + , ⊕ , ⊙ and E [ · ] , dim( · ) will denote transposition, matrix trace, maximum singular- value, projection onto the nonnegative orthant, direct sum,Hadamard product, statistical expectation, and dimension ofa subspace, respectively; | · | will stand for cardinality of aset, and the magnitude of a scalar. The ℓ p -norm of x ∈ R n is k x k p := ( P ni =1 | x i | p ) /p for p ≥ . For two matrices M , U ∈ R n × n , h M , U i := tr( M ′ U ) denotes their trace innerproduct. The Frobenius norm of matrix M = [ m i,j ] ∈ R n × p is k M k F := p tr ( MM ′ ) , k M k := max k x k =1 k Mx k is thespectral norm, and k A k ∞ := max i,j | a ij | the ℓ ∞ -norm. The n × n identity matrix will be represented by I n and its i -th col-umn by e i , while n will stand for the n × vector of all zeros, n × p := n ′ p . Operator vec stacks columns of a matrix, andconversely does unvec ; ∩ and ∪ stand for the set intersectionand union, respectively; supp( A ) := { ( i, j ) : a ij = 0 } is thesupport set of A , and [ n ] := { , . . . , n } .II. P RELIMINARIES AND P ROBLEM S TATEMENT
Consider a backbone IP network described by the directedgraph G ( N , L ) , where L and N denote the set of linksand nodes (routers) of cardinality | L | = L and | N | = N ,respectively. A set of end-to-end flows F with | F | = F traverse different OD pairs. In backbone networks, the num-ber of OD flows far exceeds the number of physical links ( F ≫ L ) . Per OD-flow, multipath routing is considered whereeach flow traverses multiple possibly overlapping paths toreach its intended destination. Letting x f,t denote the unknowntraffic level of flow f ∈ F at time t , link ℓ ∈ L carries thefraction r ℓ,f ∈ [0 , of this flow; clearly, r ℓ,f = 0 if flow f isnot routed through link ℓ . The total traffic carried by link ℓ isthen the weighted superposition of flows routed through link ℓ ,that is, P f ∈ F r ℓ,f x f,t . The weights { r ℓ,f } form the routingmatrix R ∈ [0 , L × F , which is assumed fixed and given.These weights are not arbitrary but must respect the flow con-servation law P ℓ ∈ L in ( n ) r ℓ,f = P ℓ ∈ L out ( n ) r ℓ,f , ∀ f ∈ F ,where L in ( n ) and L out ( n ) denote the sets of incoming andoutgoing links to node n ∈ N , respectively.It is not uncommon for some of flow rates to experiencesudden changes, which are termed traffic volume anomalies that are typically due to the network failures, or cyberattacks[3]. With a f,t denoting the unknown traffic volume anomalyof flow f at time t , the traffic carried by link ℓ at time t is y ℓ,t = X f ∈ F r ℓ,f ( x f,t + a f,t ) + v ℓ,t , t ∈ T (1)where the time horizon T comprises T slots, and v ℓ,t accountsfor the measurement errors. In IP networks, link loads can bereadily measured via SNMP supported by most routers [3].Introducing the matrices Y := [ y ℓ,t ] , V := [ v ℓ,t ] ∈ R L × T , X := [ x f,t ] , and A := [ a f,t ] ∈ R F × T , link counts in (1) canbe expressed in a compact matrix form as Y = R ( X + A ) + V . (2)Here, matrices X and A contain, respectively, the nominal and anomalous traffic flows over the time horizon T . Inferring ( X , A ) from the compressed measurements Y is a severelyunderdetermined task (recall that L ≪ F ), necessitating EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 2
Fig. 1. Internet-2 traffic for a few representative OD flows across time andflows [17]. additional data to ensure identifiability and improve estima-tion accuracy. A useful such source is the direct flow-levelmeasurements z f,t = x f,t + a f,t + w f,t , t ∈ T , f ∈ F (3)where w f,t accounts for measurement errors. The flow trafficin (3) is sampled via NetFlow [3] at each origin node. Thishowever incurs high cost which means that one can havemeasurements (3) only for few ( f, t ) pairs [3]. To accountfor missing flow-level data, collect the available pairs ( f, t ) in the set Π ∈ [ F ] × [ T ] ; introduce also the matrices Z Π := [ z f,t ] , W Π := [ w f,t ] ∈ R F × T , where z f,t = w f,t = 0 for ( f, t ) / ∈ Π , and associate the sampling operator P Π withthe set Π , which assigns entries of its matrix argument not in Π equal to zero, and keeps the rest unchanged. As with X , itholds that P Π ( X ) ∈ R F × T . The flow counts in (3) can thenbe compactly written as Z Π = P Π ( X + A ) + W Π . (4)Besides periodicity, temporal patterns common to traffic flowsrender rows (correspondingly columns) of X correlated, andthus X exhibits a few dominant singular values which makeit (approximately) low rank [3]. Anomalies on the other handare expected to occur occasionally, as only a small fractionof flows are supposed to be anomalous at any given timeinstant, which means A is sparse. Anomalies may exhibitcertain patterns e.g., failure at a part of the network maysimultaneously render a subset of flows anomalous; or certainflows may be subject to bursty malicious attacks over time.Given the link counts Y obeying (2) along with the partialflow-counts Z Π adhering to (4), and with { R , Π } known, thispaper aims at accurately estimating the unknown low-rank nominal and sparse anomalous traffic pair ( X , A ) .III. M APS OF N OMINAL AND A NOMALOUS T RAFFIC
In order to estimate the unknowns of interest, a naturalestimator accounting for the low rank of X and the sparsityof A will be sought to minimize the rank of X , and thenumber of nonzero entries of A measured by its ℓ -(pseudo)norm. Unfortunately, both rank and ℓ -norm minimizationproblems are in general NP-hard [18]–[20]. The nuclear-norm k X k ∗ := P k σ k ( X ) , where σ k ( X ) signifies the k -th singularvalue of X , and the ℓ -norm k A k := P f,t | a f,t | are typically adopted as convex surrogates [19], [20]. Accordingly, onesolves(P1) ( ˆ X , ˆ A ) = arg min ( X , A ) k Y − R ( X + A ) k F + 12 k P Π ( Z − X − A ) k F + λ ∗ k X k ∗ + λ k A k where λ , λ ∗ ≥ are the sparsity- and rank-controlling pa-rameters. From a network operation perspective, the estimate ˆ A maps out the network health-state across both time andflows. A large value | ˆ a f,t | indicates that at time instant t flow f exhibits a sever anomaly, and therefore appropriate trafficengineering and security tasks need to be run to mitigate theconsequences. The estimated map of nominal traffic ˆ X is alsoa viable input for network planning tasks.From the recovery standpoint, (P1) subsumes several im-portant special cases, which deal with recovery of ˆ X and/or ˆ A . In the absence of flow counts, i.e., Π = ∅ , exact recoveryof the sparse anomaly matrix ˆ A from link loads is establishedin [12]. The key to this is the sparsity present, which enablesrecovery from compressed linear-measurements. However, the(possibly huge) nullspace of R challenges identifiability of thenominal traffic matrix X , as will be delineated later. Moreover,with only flow counts partially available, (P1) boils down tothe so-termed robust principal component pursuit (PCP), forwhich exact reconstruction of the low-rank nominal trafficcomponent is established in [14]. Instrumental role in this caseis played by the dependencies among entries of the low-rankcomponent, reflected in the observations. Indeed, the matrixof anomalies is not recoverable since observed entries donot convey any information about the unobserved anomalies.Furthermore, without the sparse matrix, i.e., A = , and onlywith flow counts partially available, (P1) boils down to thecelebrated matrix completion problem studied e.g., in [21],which can be applied to interpolate the traffic of unreachableOD flows from the observed ones at the edge routers.The aforementioned considerations regarding recovery inthese special cases make one hopeful to retrieve X and A via(P1). Before delving into the analysis of (P1), it is worth notingthat [22] has recently studied recovery of compressed low-rank-plus-sparse matrices, also known as compressive PCP,where the compression is performed by an orthogonal projec-tion onto a low-dimensional subspace, and the the supportof the sparse matrix is presumed uniformly random. Theresults require certain subspace incoherence conditions to hold,which in the considered traffic estimation task impose strongrestrictions on the routing matrix R and the sampling operator P Π ( · ) . Furthermore, it is unclear how to relate the subspaceincoherence conditions to the well-established incoherencemeasures adopted in the context of matrix completion andcompressive sampling, which are satisfied by various classesof random matrices; see e.g., [20], [23].Before closing this section, it is important to recognizethat albeit few the NetFlow measurement Z Π , they play animportant role in estimating X . In principle, if one merelyknows the link counts Y , it is impossible to accurately identify X when the only prior information about X and A is thatthey are sufficiently low-rank and sparse, respectively. This EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 3 identifiability issue is formalized in the next lemma.
Lemma 1:
With N R denoting the nullspace of R , and X = U Σ V ′ , if N R = ∅ , and one only knows { Y , R } , then forany W ∈ N R the matrix pair { X := X + WV ′ , A } : (i)is feasible, and (ii) it satisfies rank( X ) ≤ rank( X ) =: r .Proof: Clearly (i) holds true since RW = , and subse-quently R ( A + X ) = R ( A + X ) + RWV ′ = Y . Also,(ii) readily follows from Sylvester’s inequality [24] whichimplies that rank( U Σ V ′ + WV ′ ) ≤ min { rank( X + WV ′ ) , rank( V ) } ≤ rank( V ) = r .IV. R ECONSTRUCTION G UARANTEES
This section studies the exact reconstruction performanceof (P1) in the absence of noise, namely V = and W Π = .The corresponding formulation can be expressed as(P2) ( ˆ X , ˆ A ) = arg min ( X , A ) k X k ∗ + λ k A k s . to Y = R ( X + A ) , Z Π = P Π ( X + A ) . In the sequel, identifiability of ( X , A ) from the linear mea-surements { Y , Z Π } is pursued first, followed by technical con-ditions based on certain incoherence measures, to guarantee ( ˆ X = X , ˆ A = A ) , where X and A are the true low-rankand sparse matrices of interest. A. Local Identifiability
Let r := rank( X ) and s := k A k denote the rankand sparsity level of the true matrices of interest. The firstissue to address is identifiability, asserting that there is a unique pair ( X , A ) fulfilling the data constraints: (d1) Y = R ( X + A ) and (d2) Z Π = P Π ( X + A ) . Apparently,if multiple solutions exist, one cannot hope finding ( X , A ) .Before examining this issue, introduce the subspaces: (s1) N R := { H : RH = L × T } as the nullspace of the linearoperator R , and (s2) N Π := { H ∈ R F × T : supp ( H ) ⊆ Π ⊥ } as the nullspace of the linear operator P Π ( . ) [ Π ⊥ isthe complement of Π ]. If there exists a perturbation pair ( H , H ) with H + H ∈ N R ∩ N Π so that X + H and A + H are still low-rank and sparse, one may pick thepair ( X + H , A + H ) as another legitimate solution. Thissection aims at resolving such identifiability issues.Let U Σ V ′ denote the singular value decomposition(SVD) of X , and consider the subspaces: (s3) Φ X := { Z ∈ R F × T : Z = U W ′ + W V ′ , W ∈ R T × r , W ∈ R F × r } of matrices in either the column or row space of X ; (s4) Ω A := { H ∈ R F × T : supp ( H ) ⊆ supp ( A ) } of matriceswhose support is contained in that of A . Noteworthy proper-ties of these subspaces are: (i) since Φ X and Ω A ⊂ R F × T ,it is possible to directly compare elements from them; (ii) X ∈ Φ X and A ∈ Ω A ; and (iii) if Z ∈ Φ ⊥ X is added to X , then rank ( Z + X ) > r , and likewise Z ∈ N Ω , for any Z ∈ Ω ⊥ A .Suppose temporarily that the subspaces Φ X and Ω A arealso known. This extra piece of information helps identi-fiability based on data (d1) and (d2) since the potentiallytroublesome solutions Υ := { ( X + H , A + H ) : H + H ∈ N R ∩ N Π } (5) are restricted to a smaller set. If ( X + H , A + H ) / ∈ Υ ,where Υ := { ( X + H , A + H ) : H ∈ Φ X , H ∈ Ω A } (6)that candidate solution is not admissible since it is known apriori that X ∈ Φ X and A ∈ Ω A . This notion of exploitingadditional knowledge to assure uniqueness is known as localidentifiability [14]. Global identifiability from (d1) and (d2)is not guaranteed. However, local identifiability will becomeessential later on to establish the main result. With thesepreliminaries, the following lemma puts forth the necessaryand sufficient conditions for local identifiability. Lemma 2:
Matrices ( X , A ) satisfy (d1) and (d2) uniquelyif and only if (c1) Φ X ∩ Ω A = { } ; and, (c2) Υ ∩ Υ = { } . Condition (c1) implies that for the solutions in Υ to beadmissible, H + H must belong to the subspace Φ X ⊕ Ω A .Accordingly, (c2) holds true if N R ∩ N Π ∩ (Φ X ⊕ Ω A ) = { } . (7)Notice that (c1) appears also in the context of low-rank-plus-sparse recovery results in [14], [25]. However, (c2) is unique tothe setting here. It captures the impact of the overlap betweenthe nullspace of R and the operator P Π ( · ) . Finding simplersufficient conditions to assure (c1) and (c2) is dealt with next. B. Incoherence Measures
The overlap between any pair of subspaces { Φ X , Ω A , N R , N Π } plays a crucial role in identifiabilityand exact recovery as seen e.g., from Lemma 1. To quantifythe overlap of the subspaces e.g., Φ X and Ω A , consider the incoherence parameter µ (Φ X , Ω A ) := max X ∈ Ω A k X k F =1 k P Φ X ( X ) k F , (8)which clearly satisfies µ (Φ X , Ω A ) ∈ [0 , . The lower boundis achieved when Φ X and Ω A are orthogonal, whereas theupperbound is attained when Φ X ∩ Ω A contains a nonzeroelement. To gain further geometric intuition, µ (Φ X , Ω A ) represents the cosine of the angle between subspaces whenthey have trivial intersection, namely Φ X ∩ Ω A = { } [26].Small values of µ (Φ X , Ω A ) indicate sufficient separationbetween Φ X and Ω A , and thus less chance of ambiguitywhen discerning X from A .It will be seen later that (c1) requires µ (Φ X , Ω A ) < . Inaddition, to ensure (c2) one needs the incoherence parameter µ ( N R ∩ N Π , Φ X ⊕ Ω A ) < . In fact, µ ( N R ∩ N Π , Φ X ⊕ Ω A ) captures the ambiguity inherent to the nullspace of thecompression and sampling operators. It depends on all sub-spaces (s1)–(s4), and it is desirable to express it in terms of theincoherence of different subspace pairs, namely µ ( N R , Ω A ) , µ ( N R , Φ X ) , µ ( N Π , Ω A ) , and µ ( N Π , Φ X ) . This is formal-ized in the next claim. EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 4
Proposition 1:
Assume that µ (Ω A , Φ X ) < . If either dim( N R ∩ N Π ) = 0 ; or, dim( N R ∩ N Π ) ≥ and χ := h µ ( N Π , Φ X ) + µ ( N R , Ω A ) µ ( N Π , Ω A )1 − µ (Ω A , Φ X ) i / < hold, then Φ X ∩ Ω A = { } and N R ∩ N Π ∩ (Φ X ⊕ Ω A ) = { } . Proof: Since µ (Ω A , Φ X ) < and dim(Φ X ⊕ Ω A ⊕ ( N R ∩ N Π )) = dim(Φ X )+ dim(Ω A )+ dim( N R ∩ N Ω ) , [22,Lemma 11] implies that µ (Φ X ⊕ Ω A , N R ∩ N Π ) ≤ (cid:2) − µ (Φ X , Ω A ) (cid:3) − × (cid:2) µ (Φ X , N R ∩ N Π ) + µ (Ω A , N R ∩ N Π ) (cid:3) . (9)The result then follows by bounding µ (Φ X , N R ∩ N Π ) ≤ µ (Φ X , N R ) µ (Φ X , N Π ) using the fact that N R ∩ N Π ∈ N R , N Π [likewise for µ (Ω A , N R ∩ N Π ) ], and N R ∩ N Φ X = { } .Apparently, small values of µ ( N R , Ω A ) and µ ( N Π , Φ X ) gives rise to a small χ . In fact, µ ( N R , Ω A ) measures whether N R contains sparse elements, and it is tightly related tothe incoherence among the sparse column-subsets of R . Forrow-orthonormal compression matrices in particular, where RR ′ = I , the incoherence reduces to the restricted isometryconstant of R , see e.g., [20]. Moreover, µ ( N Π , Φ X ) mea-sures whether the low-rank matrices fall into the nullspaceof the subsampling operator P Π ( · ) , that is tightly linked tothe incoherence metrics introduced in the context of matrixcompletion; see e.g., [27]. It is worth mentioning that a wideclass of matrices resulting in small incoherence µ ( N R , Ω A ) , µ ( N Π , Φ X ) and µ (Ω A , Φ X ) are provided in [20], [27], [25],which give rise to a sufficiently small value of χ . C. Exact Recovery via Convex Optimization
Besides µ (Ω A , Φ X ) and χ , there are other incoherencemeasures which play an important role in the conditionsfor exact recovery. These measures are introduced to avoidambiguity when the (feasible) perturbations H and H donot necessarily belong to the subspaces Φ X and Ω A , re-spectively. Before moving on, it is worth noting that thesemeasures resemble the ones for matrix completion and decom-position problems; see e.g., [25], [27]. For instance, considera feasible solution { X + a i,j e i e ′ j , A + a i,j e i e ′ j } , where ( i, j ) / ∈ supp ( A ) , and thus a i,j e i e ′ j / ∈ Ω A . It may happenthat a i,j e i e ′ j ∈ Φ X and rank( X + a i,j e i e ′ j ) = rank( X ) − ,while k A − a i,j e i e ′ j k = k A k + 1 , thus challengingidentifiability when Φ X and Ω A are unknown. Similarcomplications arise if X has a sparse row space that canbe confused with the row space of A . These issues motivatedefining γ ( U ) := max i k P U e i k , γ ( V ) := max i k P V e i k (10)where P U := U U ′ (resp. P V := V V ′ ) are the pro-jectors onto the column (row) space of X . Notice that γ ( U ) , γ ( V ) ∈ [0 , . The maximum of γ ( U ) (resp. γ ( V ) )is attained when e i is in the column (row) space of X forsome i . Small values of γ ( U ) (resp. γ ( V ) ) imply that the column (row) spaces of X do not contain sparse vectors,respectively.Another identifiability instance arises when X is sparse, inwhich case each column of X is spanned by a few canonicalbasis vectors. Consider the parameter γ ( U , V ) := k U V ′ k ∞ = max i,j | e i ′ U V e j | . (11)A small value of γ ( U , V ) indicates that each column of X is spanned by sufficiently many canonical basis vectors.It is worth noting that γ ( U , V ) can be bounded in terms of γ ( U ) and γ ( V ) , but it is kept here for the sake of generality.From (c2) in Lemma 1 it is evident that the dimension ofthe nullspace N R ∩ N Π is critical for identifiability. In essence,the lower dim( N R ∩ N Π ) is, the higher is the chance for exactreconstruction. In order to quantify the size of the nullspace,define τ ( N R , N Π ) := max X ∈ N R ∩ N Π k X k =1 k X k ∞ (12)which will appear later in the exact recovery conditions. Allelements are now in place to state the main result. D. Main Result
Theorem 1:
Let ( X , A ) denote the true low-rank andsparse matrix pair of interest, and define X := U Σ V ′ , r := rank ( X ) , and s := k A k . Assume that A has at most k nonzero elements per column, and define the incoherenceparameters α := µ (Ω A , Φ X ) , β := µ (Ω A , N R ) , ξ := µ ( N Π , Φ X ) , ν := µ ( N R , Ω A ∩ N Π ) , η := γ ( U ) + γ ( V ) , τ := τ ( N R , N Π ) , γ := γ ( U , V ) . Given Y and Z Π adhering to (d1) and (d2), respectively, with known R and Π , if χ < , and (I) λ max := ( 1 k ) 1 − α − α (1 − α ) − ge/f α (1 − α ) + he/f> λ min := γ + qg/f − ηαk − kqh/f ≥ f := 1 − νβ − ( ξ + αν )(1 − α )( ξ + αβ ) > hold, where g := ξ + α ( ξ + αν )(1 − α ) α, h := ν + α (1 − α )( ξ + αν ) q := τ + ηα + ηξ, e := α (1 − α )( ξ + αβ ) + 1 + ν then for any λ min ≤ λ ≤ λ max the convex program (P1) yields ( ˆ X = X , ˆ A = A ) . Satisfaction of the conditions in Theorem 1 hinges uponthe incoherence parameters { α, γ, η, ξ, τ } whose sufficientlysmall values fulfil (I) and (II). In fact, these parameters areincreasing functions of the rank r and the sparsity level s . Inparticular, { α, γ, η } that capture the ambiguity of the additivecomponents X and A , are known to be small enough forsmall values of { r, s, k } ; see e.g., [14], [27]. Regarding χ ,recall that it is an increasing function of β and ξ , where theparameter ξ takes a small value when NetFlow samples anadequately large subset of OD flows uniformly at random.Moreover, in large-scale networks with distant OD node pairs,and routing paths that are sufficiently “spread-out”, the sparse EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 5 column-subsets of R tend to be incoherent, and thus β takes asmall value. Likewise, for sufficiently many NetFlow samplesand column-incoherent routing matrices, τ takes a small value. Remark 1 (Satisfiability):
Notice that (I) and (II) in Theo-rem 1 are expressible in terms of the angle between subspaces(s1)–(s4). In general, they are NP-hard to verify. Introducinga class of (possibly random) traffic matrices ( X , A ) andrealistic network settings giving rise to a desirable routingmatrix R is the subject of our ongoing research. The majorroadblock in this direction is deriving tight bounds for theparameter τ , which involves the intersection of a pair ofsubspaces. E. ADMM Algorithm
This section introduces an iterative solver for the convexprogram (P2) using the alternating direction method of multi-pliers (ADMM) method. ADMM is an iterative augmentedLagrangian method especially well-suited for parallel pro-cessing [28], and has been proven successful to tackle theoptimization tasks encountered e.g., in statistical learning; seee.g., [29]. While ADMM could be directly applied to (P2), R couples the entries of A and X leading to computationallydemanding nuclear- and ℓ -norm minimization subtasks periteration. To overcome this hurdle, a common trick is to intro-duce auxiliary (decoupling) variables { B , O } , and formulatethe following optimization problem(P3) min { A , X , O , B } k X k ∗ + λ k A k s. to Y = R ( O + B ) , Z Π = P Π ( O + B ) B = A , O = X , which is equivalent to (P2). To tackle (P3), associate the La-grange multipliers { M y , M z , M a , M x } with the constraints,and then introduce the quadratically augmented Lagrangianfunction L ( X , A , B , O ; M y , M z , M a , M x ):= k X k ∗ + λ k A k + h M y , Y − R ( O + B ) i + h M a , B − A i + h M z , Z Π − P Π ( O + B ) i + h M x , O − X i + c k Y − R ( O + B ) k F + c k Z Π − P Π ( O + B ) k F + c k B − A k F + c k O − X k F (13)where c > is a penalty coefficient. Splitting the primalvariables into two groups { X , B } and { A , O } , the ADMMsolver entails an iterative procedure comprising three steps periteration k = 1 , , . . . [S1] Update dual variables: M y [ k ] = M y [ k −
1] + c ( Y − R ( O [ k ] + B [ k ])) (14) M z [ k ] = M z [ k −
1] + c ( Z Π − P Π ( O + B )) (15) M a [ k ] = M a [ k −
1] + c ( B [ k ] − A [ k ]) (16) M x [ k ] = M x [ k −
1] + c ( O [ k ] − X [ k ]) (17) [S2] Update first group of primal variables: A [ k + 1]= arg min A ∈ R F × T n c k A − B [ k ] k F − h M a [ k ] , A i + λ k A k o . O [ k + 1]= arg min O ∈ R F × T n c k O − X [ k ] k F + c k Y − R ( O + B [ k ]) k F + c k Z Π − P Π ( O + B [ k ]) k F + h M x [ k ] − R ′ M y [ k ] − P Π ( M z [ k ]) , O i} . [S3] Update second group of primal variables: X [ k + 1]= arg min X ∈ R F × T n c k X − O [ k ] k F − h M x [ k ] , X i + k X k ∗ o B [ k + 1]= arg min B ∈ R F × T n c k A [ k ] − B k F + c k Y − R ( O [ k ] + B ) k F + c k Z Π − P Π ( O [ k ] + B ) k F + h M a [ k ] − R ′ M y [ k ] − P Π ( M z [ k ]) , B i} The resulting iterative solver is tabulated under Al-gorithm 1. Here, [ S τ ( X )] i,j := sgn ( x i,j ) max {| x i,j | − τ, } refers to the soft-thresholding operator; the vectors { y t , o t , a t , b t , z t , x t , m zt , m at , m xt , m yt } denote the t -th col-umn of their corresponding matrix arguments, and the diagonalmatrix Π t ∈ { , } P × P is unity at ( i, i ) -th entry if ( i, t ) ∈ Π ,and zero otherwise. Algorithm 1 reveals that the update for theanomaly matrix entails a soft-thresholding operator to promotesparsity, while the nominal traffic is updated via singular valuethresholding to effect low rank. The updates for B and O are also parallelized across the rows. Due to convexity of(P3), Algorithm 1 with two Gauss-Seidel block updates isconvergent to the global optimum of (P2) as stated next. Proposition 2: [28] For any value of the penalty coefficient c > , the iterates { X [ k ] , A [ k ] } converge to the optimalsolution of (P2) as k → ∞ . V. I
NCORPORATING S PATIOTEMPORAL C ORRELATION I NFORMATION
Being convex (P1) is appealing, and as Theorem 1 assertsfor the noiseless case it reconstructs reliably the underlyingtraffic when: (c1) the anomalous traffic is sufficiently “spo-radic” across time and flows; (c2) the nominal traffic matrixis sufficiently low-rank with non-spiky singular vectors; (c3)NetFlow uniformly samples OD flows; and, (c4) the routingpaths are sufficiently “spread-out.” In practical networks how-ever, these conditions may be violated, and as a consequence(P1) may perform poorly. For instance, if a bursty anomalyoccurs, (c1) does not hold. A particular OD flow may alsobe inaccessible to sample via NetFlow, that violates (c3).Apparently, in the latter case, knowing the cross-correlationof a missing OD flow with other flows enables accurateinterpolation of misses.Inherent patterns of the nominal traffic matrix X andthe anomalous traffic matrix A can be learned from his-torical/training data { x t , a t } t ∈ H , where x t and a t denote EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 6
Algorithm 1 : ADMM solver for (P2) input Y , Z Π , Π , R , λ, c, { H t := ( I F + Π t + R ′ R ) − } Tt =1 initialize M y [ −
1] = L × T , X [0] = O [0] = A [0] = B [0] = M z [ −
1] = M a [ −
1] = M x [ −
1] = F × T , and set k = 0 . while not converged do[S1] Update dual variables: M y [ k ] = M y [ k −
1] + c ( Y − R ( O [ k ] + B [ k ])) M z [ k ] = M z [ k −
1] + c ( Z Π − P Π ( O [ k ] + B [ k ])) M a [ k ] = M a [ k −
1] + c ( B [ k ] − A [ k ]) M x [ k ] = M x [ k −
1] + c ( O [ k ] − X [ k ]) [S2] Update first group of primal variables: A [ k + 1] = S λc ( c − M a [ k ] + B [ k ]) .Update in parallel ( t = 1 , . . . , T ) o t [ k +1] = H t (cid:0) c x t [ k ]+ c Π t z t + c R ′ y t − c [ Π t + R ′ R ] b t [ k ]+ R ′ m yt [ k ] + Π t m zt [ k ] − m xt [ k ] (cid:1) [S3] Update second group of primal variables: UΣV ′ = svd( O [ k + 1] + c − M x [ k ]) , X [ k + 1] = U S /c ( Σ ) V ′ Update in parallel ( t = 1 , . . . , T ) b t [ k + 1] = H t (cid:0) c a t [ k + 1] + c Π t z t + c R ′ y t − c [ Π t + R ′ R ] o t [ k + 1] + R ′ m yt [ k ] + Π t m zt [ k ] − m at [ k ] (cid:1) k ← k + 1 end whilereturn ( A [ k ] , X [ k ]) the network-wide nominal and anomalous traffic vectors attime t . Given the training data { x t , a t } t ∈ H , link counts Y obeying (2) as well as the partial flow-counts Z Π adheringto (4), and with { R , Π } known, the rest of this paper dealswith estimating the matrix pair ( X , A ) . A. Bilinear Factorization
The first step toward incorporating correlation information isto use the bilinear characterization of the nuclear norm. Usingsingular value decomposition [24], one can always factorizethe low-rank component as X = LQ ′ , where L ∈ R F × ρ , Q ∈ R T × ρ , for some ρ ≥ rank( X ) . The nuclear-norm canthen be redefined as (see e.g., [30]) k X k ∗ := min X = LQ ′ {k L k F + k Q k F } . (18)For the scalar case, (18) leads to the identity | a | =min a = bc ( | b | + | c | ) . The latter implies that the ℓ -norm of A can be alternatively defined as k A k := min A = B ⊙ C {k B k F + k C k F } (19)where B , C ∈ R F × T . For notational convenience, let U :=[ Y ′ , Z ′ Π ] and the corresponding linear operator P ( X ) :=[( RX ) ′ , P Ω ( X ) ′ ] . Leveraging (18) and (19), one is promptedto recast (P1) as(P4) min { L , Q , B , C } k U − P ( LQ ′ + B ⊙ C ) k F + λ ∗ (cid:8) k L k F + k Q k F (cid:9) + λ (cid:8) k B k F + k C k F (cid:9) . This Frobenius-norm regularization doubles the number ofoptimization variables for the sparse component A ( F T ),but reduces the variable count for the low-rank component X to ρ ( F + T ) . Regarding performance, the bilinear factorizationincurs no loss of optimality as stated in the next lemma. Lemma 3: If ˆ X denotes the optimal low-rank solution of (P1)and ρ ≥ rank( ˆ X ) , then (P4) is equivalent to (P1).Proof: It readily follows from (18) and (19) along withthe commutative property of minimization which allows takingminimization first with respect to (w.r.t.) { L , Q } and then w.r.t. { B , C } .VI. B AYESIAN T RAFFIC AND A NOMALY E STIMATES
This section recasts (P4) in a Bayesian framework byadopting the AWGN model U = P ( X + A ) + E , where E contains independent identically distributed (i.i.d.) entriesdrawn from N (0 , σ ) . As in (18) X is also factorized as LQ ′ with the independent factors L := [ l , . . . , l ρ ] and Q :=[ q , . . . , q ρ ] . Matrices L and Q are formed by i.i.d. columnsobeying l i ∼ N (0 , R L ) and q i ∼ N (0 , R Q ) , respectively,for positive-definite correlation matrices R L ∈ R F × F and R Q ∈ R T × T . Without loss of generality (w.l.o.g.), in order toavoid the scalar ambiguity in X = LQ ′ set tr( R L ) = tr( R Q ) .Likewise, the anomaly matrix is factored as A = B ⊙ C with the independent factors b := vec( B ) ∈ R F T and c := vec( C ) ∈ R F T drawn from b ∼ N (0 , R B ) and c ∼ N (0 , R C ) , with positive-definite correlation matrices R B , R C ∈ R F T × F T , respectively.For the considered AWGN model with priors, the maximuma posteriori (MAP) estimator of ( X , A ) is given by thesolution of(P5) min { L , Q , B , C } k U − P ( LQ ′ + B ⊙ C ) k F + λ (cid:2) b ′ R − B b + c ′ R − C c (cid:3) + λ ∗ (cid:2) tr( L ′ R − L L ) + tr( Q ′ R − Q Q ) (cid:3) for λ = λ ∗ = σ , where different weights λ and λ ∗ areconsidered here for generality. Observe that (P5) specializes to(P4) upon choosing R L = I F , R Q = I T , and R B = R C = I F T . Lemma 3 then implies that the convex program (P1)yields the MAP optimal estimator for the considered statisticalmodel so long as the factors contain i.i.d. Gaussian entries.With respect to the statistical model for the low-rank andsparse components, as it will become clear later on, R L ( R Q )captures the correlation among columns (rows) of X ; likewise, R B and R C capture the correlation among entries of A .Albeit clear in this section statistical formulation, theadopted model X = LQ ′ promotes low rank as a result of rank( X ) ≤ ρ , but it is not obvious whether A = B ⊙ C effectssparsity. The latter will rely on the fact that the product of twoindependent Gaussian random variables is heavy tailed. To rec-ognize this, consider the independent scalar random variables b ∼ N (0 , and c ∼ N (0 , . The product random variable a = bc can then be expressed as bc = ( b + c ) − ( b − c ) ,where S := ( b + c ) and S := ( b − c ) are central χ -distributed random variables. Since E [( a − b )( a + b )] = 0 ,the random variables S and S are independent, and con-sequently the characteristic function of a admits the simpleform Φ a ( ω ) = Φ S ( ω )Φ S ( ω ) = 1 / ( √ ω ) . Applyingthe inverse Fourier transform to Φ a ( ω ) , yields the prob-ability density function p a ( x ) = (1 / √ π ) k ( x/ , where k ( x ) := R ∞ [cos( ωx )] / ( √ ω ) dω denotes the modifiedBessel function of second-kind, which is tightly approximated EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 7 −20 −15 −10 −5 0 5 10 15 2000.511.522.5 x p a ( x ) Prod. GaussiansExponential
Fig. 2. Sparsity promoting priors with zero mean and unity variance. with p π/ (2 x ) e − x for x > [31, p. 20]. One can thenreadily deduce that p a ( x ) = p π/ (2 x ) e −| x | behaves similarto the Laplacian distribution, which is well known to promotesparsity. In contrast with the Laplacian distribution however,the product of Gaussian random variables incurs a slightlylighter tail as depicted in Fig. 2. It is worth commenting thatthe correlated multivariate Laplacian distribution is an alterna-tive prior distribution to postulate for the sparse component.However, its complicated form [32] renders the optimizationfor the MAP estimator intractable. Remark 2 (nonzero mean):
In general, one can allownonzero mean for the factors in the adopted statistical model,and subsequently replaces correlations with covariances. Thiscan be useful e.g., to estimate the nominal traffic which isinherently positive valued. The mean values are assumed zerohere for simplicity.
A. Learning the correlation matrices
Implementing (P5) requires first obtaining the correlationmatrices { R L , R Q , R B , R C } from the second-order statisticsof ( X , A ) , or their estimates based on training data. Givensecond-order statistics of the unknown nominal-traffic matrix X , matrices { R L , R Q } can be readily found as explained inthe next lemma. The proof is along the lines of [33], hence itis omitted for brevity. Lemma 4:
Under the Gaussian bilinear model for X , andwith tr( R L ) = tr( R Q ) , it holds that R Q = ρ E [ X ′ X ] / ( E [ k X k F ]) / , R L = ρ E [ XX ′ ] / ( E [ k X k F ]) / . It is evident that R L captures temporal correlation of thenetwork traffic (columns of X ), while R Q captures the spatial correlation across OD flows (rows of X ).For real data where the distribution of unknowns is not avail-able, { R L , R Q } are typically estimated from the training data,which can be e.g., past estimates of nominal and anomaloustraffic. For instance, consider { R L , R Q } estimates as input to(P5) for estimating the traffic at day K + 1 (correspondingto time horizon T ) with T time instants, from the trainingdata { x t } KTt =1 collected during the past K days. Apparently,reliable correlation estimates cannot be formed for generalnonstationary processes. Empirical analysis of Internet trafficsuggests adopting the following assumptions [3]: (a1) Process { x t } is cyclostationary with a day-long period due to large-scale periodic trends in the nominal traffic; and (a2) OD flowsare uncorrelated as their origins are mutually unrelated. Onecan also take into account weekly or monthly periodicity oftraffic usage to further improve the accuracy of the correlationestimates.Let r t denote the remainder of dividing t by T . For timeslots t , t ∈ T , (a1) asserts that the vector subprocesses { x kT + r t } K − k =0 and { x kT + r t } K − k =0 are stationary, and thusone can consistently estimate E [ x ′ r t x r t ] , to obtain R Q viathe sample correlation K P K − k =0 x kT + r t x ′ kT + r t [34]. Like-wise, the normalization term E [ k X k F ] is estimated relying on(a1) as K P Tt =1 P K − k =0 k x kT + t k . Estimating R L on the otherhand relies on (a2). Let ξ ′ f ∈ R T denote the time-series oftraffic associated with OD flow f , namely the f -th row of X .It then follows from (a2) that E [ ξ f ξ f ] = ( E [ ξ f ]) ′ ( E [ ξ f ]) for f = f ∈ F , where due to (a1), E [ ξ f,t ] ( ξ f,t signifiesthe t -th entry of ξ f ) is estimated via the sample mean K P K − k =0 x f,kT + r t . Moreover, for f = f = f , the estimatefor E [ ξ ′ f ξ f ] is K P K − k =0 P Tt =1 ξ f,kT + r t .Given the second-order statistics of A , the correlationmatrices R B and R C are obtained next. Lemma 5:
Under the Gaussian bilinear model for a =vec( A ′ ) , it holds that E [ aa ′ ] = R B ⊙ R C . In order to avoid the scalar ambiguity present in R B and R C , assume equal-magnitude entries | [ R B ] i,j | = | [ R C ] i,j | = | (cid:2) E [ aa ′ ] (cid:3) i,j | / , ∀ ( i, j ) . Apparently, for a diagonal correla-tion matrix E [ aa ′ ] , the factors are uniquely determined as [ R B ] i,i = [ R C ] i,i = (cid:2) E [ aa ′ ] (cid:3) / i,i , ∀ i . However, when nonzerooff-diagonals are present, there may exist a sign ambiguity, andthe signs should be assigned appropriately to guarantee that R B and R C are positive definite.Correlation matrices { R B , R C } required to run (P5) overthe time horizon T ( | T | = T ) are estimated from the trainingdata { a t } KTt =1 collected e.g., over the past K days. Due to thediverse nature of anomalies, developing a universal methodol-ogy to learn R B and R C is an ambitious objective. Dependingon the nature of anomalies, the learning process is possibleunder certain assumptions. One such reasonable assumptionis that anomalies of different flows are uncorrelated, but foreach OD flow, the anomalous traffic is stationary and possiblycorrelated over time. This model is appropriate e.g., whendifferent flows are subject to bursty anomalies arising fromunrelated external sources.For the stationary anomaly process of flow f , namely { a f,t } t , let R ( f ) a ( τ ) := E [ a f,t − τ a f,t ] denote the time-invariantcross-correlation. Let also α ′ f denote the f -th row of A , andintroduce the correlation matrix R ( f ) a := E [ α f α ′ f ] ∈ R T ,which is Toeplitz with entries [ R ( f ) a ] i,i + τ = R ( f ) a ( τ ) , i ∈ [ T ] , τ = 0 , . . . , T − . Accordingly, E [ aa ′ ] is a block-diagonalmatrix with blocks R ( f ) a , and subsequently Lemma 5 impliesthat R B and R C are block diagonal with Toeplitz blocks R ( f ) b and R ( f ) c , respectively. Under the equal-magnitude assumptionfor the entries of R B and R c , the entries of R ( f ) b and R ( f ) c EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 8 are readily obtained as (cid:2) R ( f ) b (cid:3) i,i + τ = (cid:12)(cid:12) R ( f ) a ( τ ) (cid:12)(cid:12) / , (cid:2) R ( f ) c (cid:3) i,i + τ = (cid:12)(cid:12) R ( f ) a ( τ ) (cid:12)(cid:12) / sgn (cid:0) R ( f ) a ( τ ) (cid:1) . (20)Notice that if | R ( f ) a ( τ ) | decays sufficiently fast as τ grows, R B and R C become positive definite [35]. Finally, thanks to thestationarity of { a f,t } t , R a ( τ ) can be consistently estimatedusing KT − τ P KTt = τ +1 a f,t − τ a f,t . It is worth noting that theconsidered model renders the sparsity regularizer in (P5)separable across rows of A , which in turn induces row-wisesparsity.VII. A LTERNATING M AJORIZATION -M INIMIZATION A LGORITHM
In order to efficiently solve (P5), an alternating minimiza-tion (AM) scheme is developed here by alternating amongfour matrix variables { L , Q , B , C } . The algorithm entailsiterations updating one matrix variable at a time, while keepingthe rest are kept fixed at their up-to-date values. In partic-ular, iteration k comprises orderly updates of four matrices L [ k ] → Q [ k ] → B [ k ] → C [ k ] . For instance, L [ k ] is updatedgiven the latest updates { Q [ k − , B [ k − , C [ k − } as L [ k ] = arg min L g ( k ) L ( L ) , where g ( k ) L ( L ) := 12 k U − P ( LQ ′ [ k −
1] + B [ k − ⊙ C [ k − k F + λ ∗ (cid:0) L ′ R − L (cid:1) (21)Likewise, Q [ k ] , B [ k ] , and C [ k ] are updated by respectivelyminimizing g ( k ) Q , g ( k ) B , and g ( k ) C , which are given similar to g ( k ) L based on latest updates of the corresponding variables.Functions { g ( k ) L , g ( k ) Q , g ( k ) B , g ( k ) C } are strongly convexquadratic programs due to regularization with positivedefinite correlations in the regularizer, and thus theirsolutions admits closed form after inverting certain possiblylarge-size matrices. For instance, updating L [ k ] requiresinverting an F ρ × F ρ matrix. This however may not beaffordable since in practice the number of flows F istypically O ( N ) , which can be too large. To cope with thiscurse of dimensionality, instead of { g ( k ) L , g ( k ) Q , g ( k ) B , g ( k ) C } judicious surrogates { ˜ g ( k ) L , ˜ g ( k ) Q , ˜ g ( k ) B , ˜ g ( k ) C } , chosen basedon the second-order Taylor-expansion around the previousupdates, are minimized. As will be clear later, adopting thesesurrogates avoids inversion, and parallelizes the computations.The aforementioned surrogate for g ( k ) L around L [ k − isgiven as ˜ g ( k ) L ( L ) := g ( k ) L ( L [ k − tr (cid:0) ( L − L [ k − ′ ∇ g ( k ) L ( L [ k − (cid:1) + µ L [ k ]2 k L − L [ k − k F (22)for some µ L [ k ] ≥ σ max (cid:2) ∇ g ( k ) L ( L [ k − (cid:3) (likewise for ˜ g ( k ) Q , ˜ g ( k ) B , and ˜ g ( k ) C ). It is useful to recognize that each sur-rogate, say ˜ g ( k ) L , has the following properties: (i) it majorizes g ( k ) L , namely g ( k ) L ( L ) ≤ ˜ g ( k ) L ( L ) , ∀ L ; and it is locally tight, Algorithm 2 : Alternating majorization-minimization solverfor (P5) input Y , Z Π , Π , R , R L , R Q , R B , R C , λ ∗ , λ , and { µ L [ k ] , µ Q [ k ] , µ B [ k ] , µ C [ k ] } ∞ k =1 . initialize L [0] , Q [0] , B [0] , C [0] at random, and set k = 0 . while not converged do[S1] Update LF [ k ] = R ′ Φ y ( L [ k ] , Q [ k ] , B [ k ] , C [ k ]) + Φ z ( L [ k ] , Q [ k ] , B [ k ] , C [ k ]) L [ k + 1] = L [ k ] − µ L [ k ] (cid:0) F [ k ] Q [ k ] + λ ∗ R − L L [ k ] (cid:1) [S2] Update QG [ k ] = Φ ′ y ( L [ k + 1] , Q [ k ] , B [ k ] , C [ k ]) R + Φ ′ z ( L [ k +1] , Q [ k ] , B [ k ] , C [ k ]) Q [ k + 1] = Q [ k ] − µ Q [ k ] h G [ k ] L [ k + 1] + λ ∗ R − Q Q [ k ] i [S3] Update BH [ k ] = R ′ Φ y ( L [ k + 1] , Q [ k + 1] , B [ k ] , C [ k ]) + Φ z ( L [ k +1] , Q [ k + 1] , B [ k ] , C [ k ]) B [ k + 1] = B [ k ] − µ B [ k ] h C [ k ] ⊙ H [ k ] + λ unvec (cid:0) R − vec( B [k]) (cid:1)i [S4] Update CE [ k ] = R ′ Φ y ( L [ k + 1] , Q [ k + 1] , B [ k + 1] , C [ k ]) + Φ y ( L [ k +1] , Q [ k + 1] , B [ k + 1] , C [ k ]) C [ k + 1] = C [ k ] − µ C [ k ] h B [ k ] ⊙ E [ k ] + λ unvec (cid:0) R − vec( C [k]) (cid:1)i k ← k + 1 end whilereturn ( A [ k ] = B [ k ] ⊙ C [ k ] , X [ k ] = L [ k ] Q ′ [ k ]) which means that (ii) g ( k ) L ( L [ k − g ( k ) L ( L [ k − ; and,(iii) ∇ g ( k ) L ( L [ k − ∇ ˜ g ( k ) L ( L [ k − .The sought approximation leads to an iterativeprocedure, where iteration k entails orderly updating { L [ k ] , Q [ k ] , B [ k ] , C [ k ] } by minimizing ˜ g ( k ) L , ˜ g ( k ) Q , ˜ g ( k ) B , ˜ g ( k ) C ,respectively; e.g., the update for L [ k ] is L [ k ] = arg min L ∈ R F × ρ ˜ g ( k ) L ( L )= L [ k − − ( µ L [ k ]) − ∇ g ( k ) L ( L [ k − which is a nothing but a single step of gradient descent on g ( k ) L . Upon defining the residual matrices Φ y ( L , Q , B , C ) := R ( LQ ′ + B ⊙ C ) − Y and Φ z ( L , Q , B , C ) := P Π ( LQ ′ + B ⊙ C ) − Z Π , the overall algorithm is listed in Table 2.All in all, Algorithm 2 amounts to an iterative block-coordinate-descent scheme with four block updates per iter-ation, each minimizing a tight surrogate of (P5). Since eachsubproblem is smooth and strongly convex, the convergencefollows from [36] as stated next. Proposition 3: [36] Upon choosing { c ′ L ≥ µ L [ k ] ≥ σ max (cid:2) ∇ g ( k ) L ( L [ k − (cid:3) } ∞ k =1 for some c ′ L > (likewisefor µ Q [ k ] , µ B [ k ] , µ C [ k ] ), the iterates { L [ k ] , Q [ k ] , B [ k ] , C [ k ] } generated by Algorithm 2 converge to a stationary point of(P5). Remark 3 (Fast algorithms):
In order to speed up the gradi-ent descent iterations per block of Algorithm 2, Nesterov-typeacceleration techniques along the lines of those introduced ine.g., [37] can be deployed, which can improve the O (1 /k ) convergence rate of the standard gradient descent to O (1 /k ) . EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 9
VIII. P
RACTICAL C ONSIDERATIONS
Before assessing their relevance to large-scale networks,the proposed algorithms must address additional practicalissues. Those relate to the fact that network data are typicallydecentralized, streaming, subject to outliers as well as misses,and the routing matrix may be either unknown or dynamicallychanging over time. This section sheds light on solutions tocope with such practical challenges.
A. Inconsistent partial measurements
Certain network links may not be easily accessible to collectmeasurements, or, their measurements might be lost duringthe communication process due to e.g., packet drops. Let Π y collect the available link measurements during the timehorizon T . In addition, certain link or flow counts may notbe consistent with the adopted model in (2) and (4). Toaccount for possible presence of outliers introduce the matrices O y ∈ R L × T and O z ∈ R F × T , which are nonzero at thepositions associated with the outlying measurements, and zeroelsewhere. The link-count model (2) should then be modifiedto Y Π y = P Π y ( R ( X + A )+ O y + V ) , and the flow counts to Z Π = P Π ( X + A + O z + W ) . Typically the outliers constitutea small fraction of measurements, thus rendering { O y , O z } sparse. The optimization task (P1) can then be modified totake into account the misses and outliers as follows(P6) ( ˆ X , ˆ A )=arg min { X , A , O y , O z } k P Π y ( Y − R ( X + A ) − O y ) k F + 12 k P Π ( Z − X − A − O z ) k F + λ ∗ k X k ∗ + λ k A k + λ y k O y k + λ z k O z k where λ y and λ z control the density of link- and flow-leveloutliers, respectively. Again, one can employ ADMM-typealgorithms to solve (P6).Routing information may not also be revealed in certainapplications due to e.g., privacy reasons. In this case, each net-work link can potentially carry an unknown fraction of everyOD flow. Let L in ( n ) and L out ( n ) denote the set of incomingand outgoing links to node n ∈ N . The routing variablesthen must respect the flow conservation constraints, that isformally R ∈ R := { R ∈ [0 , L × F : P ℓ ∈ L in ( n ) r ℓ,f = P ℓ ∈ L out ( n ) r ℓ,f , ∀ f ∈ F , n ∈ N } . Taking the unknownrouting variables into account, the optimization task to esti-mate the traffic is formulated as(P7) ( ˆ X , ˆ A ) = arg min { X , A , R ∈ R } k Y − R ( X + A ) k F + 12 k P Π ( Z − X − A ) k F + λ ∗ k X k ∗ + λ k A k which is nonconvex due to the presence of bilinear terms inthe LS cost. B. Real-time operation
Monitoring of large-scale IP networks necessitates collect-ing massive amounts of data which far outweigh the ability ofmodern computers to store and analyze them in real time. Inaddition, nonstationarities due to routing changes and missing data further challenges estimating traffic and anomalies. Indynamic networks routing tables are constantly readjusted toeffect traffic load balancing and avoid congestion caused bye.g., traffic congestion anomalies or network infrastructurefailures. On top of the previous arguments, in practice themeasurements are acquired sequentially across time, whichmotivates updating previously obtained estimates rather thanrecomputing new ones from scratch each time a new datumbecomes available.To account for routing changes, let R t ∈ R L × F denotethe routing matrix at time t . The observed link counts at timeinstant t then adhere to y t = R t ( x t + a t ) + v t , t = 1 , , . . . ,where y t ∈ R L , and the partial flow counts at time t obey z Π t = P Π t ( x t + a t + w t ) , t = 1 , , . . . , where z Π t ∈ R F ,and Π t indexes the OD flows measured at time t . In orderto estimate the nominal and anomalous traffic components ( x t , a t ) at time instant t in real time, given only the pastobservations { y τ , z Π τ } tτ =1 , the framework developed in ourcompanion paper [16] can be adopted. Building on the factthat the traffic traces { x t } ∞ t =1 lie in a low-dimensional linearsubspace, say L , one can postulate x t = Lq t for L ∈ R F × ρ with ρ ≪ F , where L spans the subspace L . Pursuing theideas in [16], the nuclear-norm characterization in (18), whichenjoys separability across time, can be applied to formulateexponentially-weighted LS estimators. The corresponding op-timization task can then be solved via alternating minimizationalgorithms [16].It is worth commenting that the companion work [16] aimsprimarily at identifying the anomalies a t from link counts,which requires slow variations of the routing matrix to ensure { R t x t } ∞ t =1 lie in a low-dimensional subspace. However, thetomography task considered in the present paper imposes norestriction on the routing matrix. Indeed, routing variabilityhelps estimation of the nominal traffic x t . More precisely,suppose that { R t } are sufficiently distinct so as the inter-section of the nullspaces T t N R t has a small dimension.Consequently, it is less likely to find an alternative feasiblesolution X := X + H with H := [ h , . . . , h F ] and h t ∈ N R t such that H ∈ Φ X (cf. Section IV); see alsoLemma 1. Further analysis of this intriguing phenomenon goesbeyond the scope of the present paper, and will be pursued asfuture research. C. Decentralized implementation
Algorithms 1 and 2 demand each network node (router) n ∈ N continuously communicate the local measurements ofits incident links as well as the OD-flow counts originating atnode n , to a central monitoring station. While this is typicallythe prevailing operational paradigm adopted in current networktechnologies, there are limitations associated with this archi-tecture. Collecting all these data at the routers may lead to ex-cessive protocol overhead, especially for large-scale networkswith high acquisition rate. In addition, with the exchange ofraw measurements missing data due to communication errorsare inevitable. Performing the optimization in a centralizedfashion raises robustness concerns as well, since the centralmonitoring station represents an isolated point of failure. EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 10
The aforementioned reasons motivate devising fully dis-tributed iterative algorithms in large-scale networks, whichallocate the network tomography functionality to the routers.In a nutshell, per iteration, nodes carry out simple computa-tional tasks locally, relying on their own local measurements.Subsequently, local estimates are refined after exchangingmessages only with directly connected neighbors, which fa-cilitates percolation of information to the entire network. Theultimate goal is for the network nodes to consent on theglobal map of network-traffic-state ( ˆ X , ˆ A ) , which remainsclose to the one obtained via the centralized counterpartwith the entire network data available at once. Building onthe separable characterization of the nuclear norm in (18),and adopting ADMM method as a basic tool to carry outdistributed optimization, a generic framework for decentralizedsparsity-regularized rank minimization was put forth in ourcompanion paper [15]. In the context of network anomalydetection, the results there are encouraging and the proposedideas can be applied to solve also (P1) in a distributed fashion.IX. P ERFORMANCE E VALUATION
Performance of the novel schemes is assessed in this sectionvia computer simulations with both synthetic and real networkdata as described below.
Synthetic network data.
The network topology is generatedaccording to a random geometric graph model, where thenodes are randomly placed in a unit square, and two nodesare connected with an edge if their distance is less than aprescribed threshold d c . In general, to form the routing matrixeach OD pair takes K nonoverlapping paths, each determinedaccording to the minimum hop-count algorithm. After findingthe routes, links carrying no traffic are discarded. Clearly, thenumber of links varies according to d c . The underlying trafficmatrix X follows the bilinear model X = LQ ′ , with thefactors L ∈ R F × ρ and Q ∈ R T × ρ having i.i.d. Gaussianentries N (0 , /F ) and N (0 , /T ) , respectively. Entries ofthe anomaly matrix A are also randomly drawn from the set {− , , } with probability (w.p.) Pr( a f,t = −
1) = Pr( a f,t =1) = p/ , and Pr( a f,t = 0) = 1 − p . The link loads are thenformed as Y = R ( X + A ) . A subset of OD flows is alsosampled uniformly at random to form the partial OD flow-level measurements Z Π = Π ⊙ ( X + A ) , where each entryof Π ∈ { , } F × T is i.i.d. Bernoulli distributed taking valueone w.p. π , and zero w.p. − π . Real network data . Real data including OD flow traffic levelsare collected from the operation of the Internet-2 network(Internet backbone network across USA) [17], shown in Fig. 3(a). Internet-2 comprises N = 11 nodes, L = 41 links,and F = 121 OD flows. Flow traffic levels are recordedevery five-minute interval, for a three-week operational periodduring December 8-28, 2003 [17], [38]. The collected flowlevels are the aggregation of clean and anomalous trafficcomponents, that is sum of unknown “ground-truth” low-rankand sparse matrices X + A . The “ground truth” compo-nents are then discerned from their aggregate after applyingrobust PCP algorithms developed e.g., in [25]. The recovered X exhibits three dominant singular values, confirming the (a) (b) Fig. 3. Network topology graphs. (a) Internet-2. (b) Random syntheticnetwork with N = 30 and d c = 0 . . low-rank property of the nominal traffic matrix. Also, afterretaining only the significant spikes with magnitude largerthan the threshold k Y k F /LT , the formed anomaly matrix A has . nonzero entries. The link loads in Y areobtained through multiplication of the aggregate traffic withthe Internet-2 routing matrix. Even though Y is “constructed”here from flow measurements, link loads are acquired fromSNMP traces [39]. Moreover, the aggregate flow traffic matrix X + A is sampled uniformly at random with probability π toform Z Π . In practice, these samples are acquired via NetFlowprotocol [7]. A. Exact recovery validation
To demonstrate the merits of (P2) in accurately recoveringthe true values ( X , A ) , it is solved for a wide range of rank r and (average) sparsity levels s = pF T using the ADMMsolver in Algorithm 1. Synthetic data is generated as describedbefore for a random network with N = 30 , d c = 0 . , and F = T = N ( N − / ; see Fig. 3(b). For F randomlyselected OD pairs, K nonoverlapping paths are chosen tocarry the traffic. Each path is created based on the minimum-hop count routing algorithm to form the routing matrix. Arandom fraction of the origin’s traffic is also assigned to eachpath. The gray-scale plots in Fig. 4 show phase transition forthe relative estimation error e x + a = e x + e a , including bothnominal e x := k ˆ X − X k F / k X k F , and anomalous trafficestimation error e a := k ˆ A − A k F / k A k F under variouspercentage of misses. Top figure is associated with K = 1 ,while for the bottom figure K = 3 . The parameter λ in (P2)is also tuned to optimize the performance.When single-path routing is used, the network entails L =159 physical links. In this case, the routing matrix R ∈{ , } × has a huge nullspace with dim( N R ) = 127 , andas a result Fig. 4 (top) indicates that accurate recovery is possi-ble only for relatively small values of r and s . However, whenmultipath routing ( K = 3 ) is used, there are more L = 227 physical links involved in carrying the traffic of OD flows. Thisshrinks the nullspace of R ∈ [0 , × to dim( R ) = 68 ,and improves the isometry property of R for sparse vectors.As a result, under traffic of higher dimensionality and denseranomalies accurate traffic estimation is possible; see Fig. 4(bottom). B. Traffic and anomaly maps
Real Internet-2 data is considered to portray the traffic basedon (P1) every -hour interval, which amounts to time horizon EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 11
Fig. 4. Relative estimation error e x + a for various values of rank ( r ) andsparsity level ( s = pF T ) where F = T = 290 and π = 0 . . (a) Single-path routing versus (b) multipath routing ( K = 3 ). White represents exactrecovery ( e x + a ≈ ), while black represents e x + a ≈ . of T = 504 time bins. Impact of NetFlow data.
The role of NetFlow measurementson the traffic estimation performance is depicted in Fig. 5 plot-ting the relative error e x + a for various percentages of NetFlowsamples ( π ). Normally, the estimation accuracy improves as π grows, where the improvement seems more pronounced forthe nominal traffic. When only the link loads are available,adding NetFlow samples enhances the nominal-trafficestimation accuracy by , while the one for the anomaloustraffic is improved by . This observation corroborates theeffectiveness of exploiting partial NetFlow samples towardmapping out the network traffic.
Traffic profiles.
For π = 0 . , the true and estimated traffictime-series are illustrated in Fig. 6 for three representativeOD flows originating from the CHIN autonomous systemlocated at Chicago. The depicted time-series correspond tothree different rows of ˆ X and ˆ A returned by (P1). It is apparentthat the traffic variations are closely tracked and significantspikes are correctly picked by (P1). It pinpoints confidentlya significant anomaly occurring within 9:20 P.M.–9:25 P.M.,December 11, 2003, in the flow CHIN–LOSA, which traversesseveral physical links. High false alarm declared for theCHIN–IPLS flow is also because it visits only a single link,and thus not revealing enough information. Unveiling anomalies.
Identifying anomalous patterns is piv-otal towards proactive network security tasks. The resultantestimated map ˆ A returned by (P1) offers a depiction of thenetwork health-state across both time and flows. Our previouswork in [12] and [16] deals with creating such a map with onlythe link loads Y at hand (i.e., Π = ∅ ), and the primary goal isto recover ˆ A . The purported results in [12], [16] are promisingand could markedly outperform state-of-art workhorse PCA- p R e l a t i v e e s t i m a t i o n err o r TotalAnomalyTraffic
Fig. 5. Relative Estimation error versus percentage of NetFlow samples. CHIN−−LOSA CHINA−−ATLA T r a ff i c l e v e l CHIN−−IPLS
Time index (t) CHIN−−LOSA CHIN−−ATLA A n o m a l y a m p li t ud e CHIN−−IPLS
Time index (t) (a) (b)
Fig. 6. Nominal (a) and anomalous (b) traffic portrays for three representativeOD flows when π = 0 . . True traffic is dashed blue and the estimated oneis solid red. based approaches in e.g., [13], [40]. Relative to [12], [16],the current work however allows additional partial flow-levelmeasurements. This naturally raises the question how effectivethis additional information is toward identifying the anomalies.As seen in Fig. 5, taking more NetFlow samples is useful, butbeyond a certain threshold it does not offer any extra appeal. C. Estimation with spatiotemporal correlation information
This section evaluates the effectiveness of (P5) and demon-strates the usefulness of traffic correlation information. Train-ing data from the week December 8-15, 2003 are used to esti-mate the Internet-2 traffic on the next day, December 16, 2003.The nominal “ground truth” traffic matrix X described earlieris considered, and for validation purposes bursty anomalies aresynthetically injected to form the aggregate traffic X + A ,which is then used to generate Y and Z Π . To simulate theNetFlow samples, suppose of randomly selected ODflows are inaccessible for the entire time horizon, and therest are sampled only of time, resulting in flow-levelmeasurements available. Bursty anomalies.
To generate anomalies X , envision ascenario where a subset of OD flows undergo bursty anomalieswhile the rest are clean. Per flow f bursty anomalies aregenerated according to the random multiplicative process { a f,t = γ f b f,t c f,t } t , with mutually independent stationaryprocesses { c f,t } and { b f,t } . The former is a correlated Gaus-sian process, and the latter is a correlated { , } -Bernoulli EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 12
Time (t) T i m e (t)
50 100 150 200 25050100150200250
Time (t) T i m e (t)
50 100 150 200 25050100150200250 (a) (b)
Fig. 7. Sample correlations R B (a) and R Q (b) learned based on historicaltraffic data during December 8-15, 2003. process to model the bursts. The Gaussian process obeys thefirst-order auto-regressive model c f,t = θc f,t − + σ n n f,t , with c f, = 0 and n f,t ∼ N (0 , for some θ < . The Bernoulliprocess also adheres to b f,t = d f,t b f,t − + (1 − d f,t ) e f,t ,where the independent random variables d f,t and e f,t obey d f,t ∼ Ber( α ) and e f,t ∼ Ber( ν ) , respectively. Initial variable b f, is also generated as Ber( ν ) . Learning correlations.
Owing to the stationarity of processes { b f,t } and { c f,t } , process { a f,t } is stationary, and as aresult R ( f ) a ( τ ) = γ f R ( f ) b ( τ ) R ( f ) c ( τ ) , with the correspond-ing correlations given as R ( f ) c ( τ ) = θ τ σ n / (1 − θ ) and R ( f ) b ( τ ) = ν (1 − ν ) α τ + ν . Set γ f = 50 , θ = 0 . , σ n = 0 . , α = 0 . , and ν = 0 . . The correlation matrices { R B , R C } with Toeplitz blocks are then obtained from (20).Moreover, to account for the cyclostationarity of traffic witha day-long periodicity, the correlation matrices { R L , R Q } arelearned as elaborated in Section VI-A. The resulting temporalcorrelation matrices R B and R Q , learned based on the trafficdata December 8-15, 2003, are displayed in Fig. 7, where data points in each axis correspond to hours. The sharptransition noticed in Fig. 7 (b) happens at p.m. thatsignifies a sudden increase in the traffic usage for the rest ofthe day. Traffic maps.
Fig. 9 depicts the time series of estimatedand true nominal traffic for the IPLS–CHIN OD flow(see Fig. 3(a)). For this flow, no direct NetFlow sample iscollected. It is apparent that (P5) which uses the knowledgeof traffic spatiotemporal correlation tracks fairly well the un-derlying traffic, whereas (P1) cannot even track the large-scalevariations of traffic. This demonstrates the nonidentifiabilityof (P1) when only a small fraction of OD flows arenonuniformly sampled, and notably around of rows of X are not directly observable. (P5) however interpolates thetraffic associated with unobserved OD flows with the observedones through the correlation matrices { R L , R Q } . The result-ing relative estimation error for (P5) is e x = 0 . , which iswell below e x = 0 . for (P1). The correlation knowledgealso helps discovering the anomalous traffic patterns as seenfrom Fig. 8, where in particular (P5) attains e a = 0 . , while(P1) does e a = 0 . . Interestingly, the anomaly map revealedby (P1) tends to spot the anomalies intermittently since the ℓ -norm regularizer weighs all flows and time-instants equally. Time index F l o w i nde x
50 100 150 200 25020406080100120 Time index F l o w i nde x
50 100 150 200 25020406080100120 (a) (b)
Time index F l o w i nde x
50 100 150 200 25020406080100120 (c)
Fig. 8. Estimated and “ground truth” (c) anomaly maps across time andflows without using correlation (a), and after using correlation information(b).
Time T r a ff i c l e v e l TrueEstimate (correlation)Estimate (no correlation)
Fig. 9. True and estimated traffic of IPLS-CHIN flow.
X. C
ONCLUSIONS AND F UTURE W ORK
This paper taps on recent advances in low-rank and sparserecovery to create maps of nonminal and anomalous traffic as avaluable input for network management and proactive securitytasks. A novel tomographic framework is put forth whichsubsumes critical network monitoring tasks including trafficestimation, anomaly identification, and traffic interpolation.Leveraging low intrinsic-dimensionality of nominal traffic aswell as the sparsity of anomalies, a convex program is for-mulated with ℓ - and nuclear-norm regularizers, with the linkloads and a small subsets of flow counts as the available data.Under certain circumstances on the true traffic and anomaliesin addition to the routing and OD-flow sampling strategies,sufficient conditions are derived, which guarantee accurateestimation of the traffic.For practical networks where the said conditions are pos-sibly violated, additional knowledge about inherent trafficpatterns are incorporated through correlations by adopting aBayesian approach and taking advantage of the bilinear charac-terization of the ℓ - and nuclear-norm. A systematic approachis also devised to learn the correlations using (cyclo)stationaryhistorical traffic data. Simulated tests with synthetic and real EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 13
Internet data confirm the efficacy of the novel estimators.There are yet intriguing unanswered questions that go beyondthe scope of the current paper, but worth pursuing as futureresearch. One such question pertains to quantifying a minimalcount of sampled OD flows for a realistic network scenariowith a given routing matrix, which assures accurate trafficestimation. Another avenue to explore involves adoption oftensor models along the lines of [33], [41], [42] to furtherexploit the network topological information toward improvingthe traffic estimation accuracy.A
PPENDIX P ROOF OF THE M AIN R ESULT
In what follows, conditions are first derived under whichthe pair ( X , A ) is the unique optimal solution of (P2).The sought conditions pertain to existence of certain dualcertificates, which are then constructed in Section B. A. Unique Optimality Conditions
Recall the nonsmooth optimization problem (P2), and itsLagrangian formed as L ( X , A ; M y , M z ) = k X k ∗ + λ k A k + h M y , Y − R ( X + A ) i + h M z , Z Π − P Π ( X + A ) i (23)where M y ∈ R L × T and M z ∈ R F × T are the matrices of dualvariables (multipliers) associated with the link and flow levelconstraints in (P2), respectively. From the characterizationof the subdifferential for the nuclear- and the ℓ -norm (seee.g., [43]), the subdifferential of the Lagrangian at ( X , A ) is given by (recall that X = U Σ V ′ ) ∂ X L ( X , A ; M y , M z ) = n U V ′ + W − R ′ M y − P Π ( M z ) : k W k ≤ , P Φ X ( W ) = F × T (cid:9) (24) ∂ A L ( X , A ; M y , M z ) = { λ sign( A ) + λ F − R ′ M y − P Π ( M z ) : k F k ∞ ≤ , P Ω A ( F ) = F × T (cid:9) . (25)The optimality conditions for (P2) assert that ( X , A ) is anoptimal (not necessarily unique) solution if and only if F × T ∈ ∂ A L ( X , A ; M y , M z ) F × T ∈ ∂ X L ( X , A ; M y , M z ) . This is tantamount to existence of the dual variables { W , F , M y , M z } satisfying: (i) k W k ≤ , P Φ X ( W ) = F × T , (ii) k F k ∞ ≤ , P Ω A ( F ) = F × T , and (iii) λ sign( A ) + λ F = UV ′ + W = R ′ M y − P Π ( M z ) .In essence, to eliminate M y , M z , one can alternativelyinterpret iii) as finding the dual variable Γ ∈ N ⊥ R + N ⊥ Π =( N R ∩ N Π ) ⊥ such that Γ = λ sign( A ) + λ F = UV ′ + W .Since W = P Φ ⊥ X ( Γ ) and F = P Ω ⊥ A ( Γ ) , conditions (i)and (ii) can also be simply recast in terms of Γ . In general,(i)–(iii) may hold for multiple solution pairs. However, thenext lemma asserts that a slight tightening of the optimalityconditions (i)–(iii) leads to a unique optimal solution for (P2).The proof goes along the lines of [12, Lemma 2], and it isomitted here for conciseness. Proposition 4: If ( X , A ) is locally identifiable from (c1) and(c2), and there exists a dual certificate Γ ∈ R F × T satisfying C1) P Φ X ( Γ ) = U V ′ C2) P Ω A ( Γ ) = λ sgn( A )C3) P N R ∩ N Π ( Γ ) = C4) k P Φ ⊥ X ( Γ ) k < k P Ω ⊥ A ( Γ ) k ∞ < λ then ( X , A ) is the unique optimal solution to (P2). The rest of the proof deals with construction of a valid dualcertificate Γ that simultaneously meets C1–C5.One should note that condition (iii) is a distinct feature ofthe recovery task pursued in this paper. In a similar context,in the robust PCP problem studied in [14], N R = ∅ , N Π = ∅ ,and thus C3 does not appear anymore. In addition, the low-rank plus compressed sparse recovery task studied in [12] doesnot involve the intersection of subspaces as appearing in C3. B. Dual Certificate Construction
The main steps of the construction are inspired by [14]which studies decomposition of low-rank plus sparse matrices,that is,
Π = ∅ and R = I F . However, relative to [14] theproblem here brings up several new distinct elements includingthe null space of compression and sampling operators inC3, which further challenge construction of dual certificates,and demands, in part, a new treatment. In addition, differentincoherence measures are introduced here which facilitatesatisfiability for random ensembles. The construction involvestwo steps. In the first step, a candidate dual certificate isselected to fulfil C1–C3, whereas the second step assures thecandidate dual certificate satisfies C4–C5 as well under certaintechnical conditions in terms of the incoherence parameters inSection IV-B.Toward the first step, condition (II) in Theorem 1 im-plies local identifiability of the observation model, namely Ω A ∩ Φ X = { } and (Ω A ⊕ Φ X ) ∩ ( N R ∩ N Π ) = { } ,and thus based on a property of direct-sum [24] there exists a unique certificate Γ ∈ Ω A ⊕ Φ X ⊕ ( N R ∩ N Ω ) withprojections P Ω A ( Γ ) = λ sign( A ) , P Φ X ( Γ ) = U V ′ , and P N R ∩ N Π ( Γ ) = . This dual certificate can be expressedas Γ = Γ Ω A + Γ Φ X + Γ N R ∩ N Π with the components Γ Ω A ∈ Ω A , Γ Φ X ∈ Φ X , and Γ N R ∩ N Π ∈ N R ∩ N Π . Aswill be seen later, it is more convenient to represent Γ Ω A = ǫ Ω A + λ sign( A ) and Γ Φ X = ǫ Φ X + U V ′ . From C1–C3,for the projection components { ǫ Ω A , ǫ Φ X , Γ N R ∩ N Π } it thenholds that ǫ Φ X = − P Φ X ( ǫ Ω A + λ sign( A ) + Γ N R ∩ N Π ) (26) ǫ Ω A = − P Ω A ( ǫ Φ X + U V ′ + Γ N R ∩ N Π ) (27) Γ N R ∩ N Π = − P N R ∩ N Π ( ǫ Φ X + U V ′ + ǫ Ω A + λ sign( A )) . (28)The second step of the proof manages the candidate dualcertificate Γ to satisfy C4 and C5 as well. The main idea isto tighten the conditions for local identifiability, and impose EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 14 additional conditions on the incoherence measures (c.f. Sec-tion IV-B) to ensure that C4 and C5 hold true. In this direction,one can begin by bounding k P Φ ⊥ X ( Γ ) k ≤ k Γ Ω A + Γ N R ∩ N Ω k = k ǫ Ω A + λ sgn( A ) + Γ N R ∩ N Ω k ( a ) ≤ k P Ω A ( ǫ Φ X + U V ′ + Γ N R ∩ N Ω ) k + λ k sgn( A ) k + k Γ N R ∩ N Ω k (29)and k P Ω ⊥ A ( Γ ) k ∞ ≤ k Γ Φ X + Γ N R ∩ N Π k ∞ = k ǫ Φ X + U V ′ + Γ N R ∩ N Π k ∞ ( b ) ≤ k P Φ X ( ǫ Ω A + λ sgn( A ) + Γ N R ∩ N Π ) k ∞ + k U V ′ k ∞ + k Γ N R ∩ N Π k ∞ (30)where (a) and (b) come from (26) and (27) after applyingthe triangle inequality. In order to bound the r.h.s. of (29)and (30), it is instructive first to recognize that k U V ′ k ∞ ≤ γ ( U , V ) , and k sgn( A ) k ≤ ( k sgn( A ) k ∞ , ∞ k sgn( A ) k , ) / = k (31)see e.g., [24]. In addition, building on (8) and (12), the firstterm in the r.h.s. of (29) is bounded as k P Ω A ( ǫ Φ X + U V ′ + Γ N R ∩ N Π ) k≤ k P Ω A P Φ X ( ǫ Φ X + U V ′ ) k + k P Ω A P N Π P N R ( Γ N R ∩ N Π ) k ( a ) ≤ µ (Φ X , Ω A ) (cid:0) k ǫ Φ X k + 1 (cid:1) + µ ( N R , Ω A ∩ N Π ) k Γ N R ∩ N Ω k (32)where (a) is due to the fact that P Ω A ∩ N Π = P Ω A P N Π .Proceeding in a similar manner as for (32), upon using (11)it follows that k P Φ X ( ǫ Ω A + λ sgn( A ) + Γ N R ∩ N Π ) k ∞ ≤ γ ( U , V ) k P Φ X ( ǫ Ω A + λ sgn( A ) + Γ N R ∩ N Π ) k≤ γ ( U , V ) (cid:2) k P Φ X ( ǫ Ω A + λ sgn( A ) k + k P Φ X P N Π ( Γ N R ∩ N Π ) k (cid:3) ≤ γ ( U , V ) (cid:2) µ (Ω A , Φ X ) (cid:0) k ǫ Ω A k + λk (cid:1) + µ (Φ X , N Π ) k Γ N R ∩ N Π k (cid:3) . (33)Focusing on (33) and (32), it is only left to bound k ǫ Ω A k , k ǫ Φ X k , and k Γ N R ∩ N Ω k . To this end, (27)-(28) are utilizedto arrive at k ǫ Φ X k = k P Φ X ( ǫ Ω A + λ sign( A ) + Γ N R ∩ N Π ) k≤ µ (Φ X , Ω A ) (cid:0) k ǫ Ω A k + λk (cid:1) + µ (Φ X , N Π ) k Γ N R ∩ N Π k (34) k ǫ Ω A k = k P Ω A ( ǫ Φ X + U V ′ + Γ N R ∩ N Π ) k≤ µ (Φ X , Ω A ) (cid:0) k ǫ Φ X k + 1 (cid:1) + µ ( N R , Ω A ) k Γ N R ∩ N Ω k (35) and k Γ N R ∩ N Ω k = k P N R ∩ N Ω ( ǫ Φ X + U V ′ + ǫ Ω A + λ sign( A )) k≤ µ (Φ X , N Ω ) (cid:0) k ǫ Φ X k + 1 (cid:1) + µ ( N R , Ω A ∩ N Ω ) (cid:0) k ǫ Ω A k + λk ) . (36)For convenience introduce the notations α := µ (Φ X , Ω A ) , β := µ ( N R , Ω A ) , ξ := µ (Φ X , N Π ) , and ν := µ ( N R , Ω A ∩ N Π ) . Then, after mixing (34)–(36) and doing some algebra itfollows that k Γ N R ∩ N Ω k ≤ θ := ξ + λkν + α ( ξ + αν )(1 − α )( α + λk )1 − νβ − ( ξ + αν )(1 − α )( ξ + αβ ) (37)and k ǫ Ω A k ≤ α + (1 − α ) α ( α + λk )+ (cid:2) β + α (1 − α ) β + αξ (1 − α ) − (cid:3) θ (38) k ǫ Φ X k ≤ (1 − α ) (cid:2) α ( α + λk ) + ( αβ + ξ ) θ (cid:3) . (39)At this point, it is important to recognize from (12) that k Γ N R ∩ N Π k ∞ ≤ τ k Γ N R ∩ N Π k .Now building on (37)-(39), one can bound the terms in ther.h.s. of (29) and (30) from above in terms of { α, β, ξ, ν, k } .Finally, to fulfill C4 and C5, it suffices to confine their cor-responding upper bounds to the values and λ , respectively.This imposes the conditions(a) λk + α + α (1 − α ) (cid:2) α ( α + λk )+( αβ + ξ ) θ (cid:3) +(1+ ν ) θ < (b) γ + ηαλk + ( τ + ηα + ηξ ) θ < λ .The conditions (a) and (b) imply that C1–C5 hold for thedual certificate Γ if there exists a valid λ ∈ [ λ min , λ max ] , with λ max ≥ λ min ≥ . The resulting condition is then summarizedin the assumptions (I) and (II) of Theorem 1, and the proof isnow complete. R EFERENCES[1] X. Wu, K. Yu, , and X. Wang, “On the growth of Internet applicationflows: A complex network perspective,” in
Proc. IEEE Intl. Conf. onComputer Commun. , Shangai, China, 2011.[2] Y. Shavitt, X. Sun, A. Wool, and B. Yener, “Computing the unmeasured:An algebraic approach to Internet mapping,” in
Proc. IEEE Intl. Conf.on Computer Commun. , Alaska, USA, April 2001.[3] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, andN. Taft, “Structural analysis of network traffic flows,” in
Proc. of ACMSIGMETRICS , New York, NY, Jul. 2004.[4] P. Barford and D. Plonka, “Characteristics of network traffic flow anoma-lies,” in
Proc. 1st ACM SIGCOMM Workshop on Internet Measurements ,San Francisco, CA, november 2001.[5] K. Papagiannaki, R. Cruz, and C. Diot, “Network performance moni-toring at small time scales,” in
Proc. 1st ACM SIGCOMM Workshop onInternet Measurements , Miami Beach, Florida, october 2003.[6] E. D. Kolaczyk,
Analysis of Network Data: Methods and Models . NewYork: Springer, 2009.[7] Q. Zhao, Z. Ge, J. Wang, and J. Xu, “Robust traffic matrix estimationwith imperfect information: Making use of multiple data sources,”vol. 34, pp. 133–144, 2006.[8] E. Cascetta, “Estimation of trip matrices from traffic counts and surveydata: A generalized least-squares estimator,”
Transportation Research,Part B: Methodological , vol. 18, pp. 289–299, 1984.[9] Y. Vardi, “Network tomography: Estimating source-destination trafficintensities from link data,”
Journal of American Statistical Association ,vol. 91, pp. 365 – 377, 1996.
EEE/ACM TRANSACTIONS ON NETWORKING (SUBMITTED) 15 [10] H. V. Zuylen and L. Willumsen, “The most likely trip matrix estimatedfrom traffic counts,”
Transportation Research, Part B: Methodological ,vol. 14, pp. 281–293, 1980.[11] M. Roughan, Y. Zhang, W. Willinger, and L. Qiu, “Spatio-temporalcompressive sensing and Internet traffic matrices,”
IEEE/ACM Trans.Networking , vol. 20, pp. 662–676, 2012.[12] M. Mardani, G. Mateos, and G. B. Giannakis, “Recovery of low-rankplus compressed sparse matrices with application to unveiling trafficanomalie,”
IEEE Trans. Info. Theory. , vol. 59, pp. 5186–5205, Aug 2013.[13] Y. Zhang, Z. Ge, A. Greenberg, and M. Roughan, “Network anomog-raphy,” in
Proc. of ACM SIGCOM Conf. on Interent Measurements ,Berekly, CA, USA, Oct. 2005.[14] V. Chandrasekaran, S. Sanghavi, P. R. Parrilo, and A. S. Willsky, “Rank-sparsity incoherence for matrix decomposition,”
SIAM J. Optim. , vol. 21,no. 2, pp. 572–596, 2011.[15] M. Mardani, G. Mateos, and G. B. Giannakis, “In-network sparsityregularized rank minimization: Applications and algorithms,”
IEEETrans. Signal Process. , vol. 59, pp. 5374–5388, Nov. 2013.[16] ——, “Dynamic anomalography: tracking network anomalies via spar-sity and low rank,”
IEEE J. Sel. Topics in Signal Process. , vol. 7, no. 11,pp. 50–66, Feb. 2013.[17] [Online]. Available: http://internet2.edu/observatory/archive/data-collections.html[18] B. K. Natarajan, “Sparse approximate solutions to linear systems,”
SIAMJ. Comput. , vol. 24, pp. 227–234, 1995.[19] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization,”
SIAM Rev. , vol. 52, no. 3, pp. 471–501, 2010.[20] E. J. Candès and T. Tao, “Decoding by linear programming,”
IEEETrans. Info. Theory , vol. 51, no. 12, pp. 4203–4215, 2005.[21] E. J. Candès and Y. Plan, “Matrix completion with noise,”
Proceedingsof the IEEE , vol. 98, pp. 925–936, 2009.[22] A. Ganesh, K. Min, J. Wright, and Y. Ma, “Principal component pursuitwith reduced linear measurements,” arXiv:1202.6445v1 [cs.IT] , 2012.[23] E. J. Candès and B. Recht, “Exact matrix completion via convexoptimization,”
Foundations of Computational Mathematics , vol. 9, no. 6,pp. 717–772, 2009.[24] R. A. Horn and C. R. Johnson,
Matrix Analysis . Cambridge UniversityPress, 1985.[25] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal componentanalysis?”
Journal of the ACM , vol. 58, no. 1, pp. 1–37, 2011.[26] F. Deutsch,
Best Approximation in Inner Product Spaces , 2nd ed.Springer-Verlag, 2001.[27] E. J. Candès and B. Recht, “Exact matrix completion via convexoptimization,”
Found. Comput. Math. , vol. 9, no. 6, pp. 717–722, 2009.[28] D. P. Bertsekas and J. N. Tsitsiklis,
Parallel and Distributed Computa-tion: Numerical Methods , 2nd ed. Athena-Scientific, 1999.[29] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”
Found. Trends Mach. Learning , vol. 3, pp. 1–122, 2010.[30] N. Srebro and A. Shraibman, “Rank, trace-norm and max-norm,” in
Proc. of Learning Theory , 2005, pp. 545–560.[31] S. G. Samko, A. A. Kilbas, and O. I. Marichev,
Fractional Integralsand Derivatives . Yverdon, Switzerland: Gordon and Breach: Springer,1993.[32] T. Eltoft, T. Kim, and T.-W. Lee, “On the multivariate Laplace distribu-tion,”
IEEE Signal Process. Letters , vol. 13, pp. 300–303, May 2006.[33] J. A. Bazerque, G. Mateos, and G. B. Giannakis, “Rank regularizationand Bayesian inference for tensor completion and extrapolation,”
IEEETrans. Signal Process. , vol. 61, no. 22, pp. 5689–5703, Nov. 2013.[34] G. B. Giannakis,
Cyclostationary Signal Analysis . Chapter in DigitalSignal Processing Handbook: V. K. Madisetti and D. Williams, Eds.Boca Raton, FL: CRC, 1998.[35] V. Solo and X. Kong,
Adaptive signal processing algorithms: stabilityand performance . Prentice-Hall, 1995.[36] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergenceanalysis of block successive minimization methods for nonsmoothoptimization,”
SIAM Journal on Optimization , vol. 23, pp. 1126–1153,2013.[37] Y. Nesterov, “A method of solving a convex programming problem withconvergence rate o (1 /k ) ,” Soviet Mathematics Doklady , vol. 27, pp.372–376, 1983.[38] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using trafficfeature distributions,” vol. 35, pp. 217–228, august 2005.[39] M. Thottan and C. Ji, “Anomaly detection in IP networks,”
IEEE Trans.Signal Process. , vol. 51, pp. 2191–2204, Aug. 2003. [40] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide trafficanomalies,” in
Proc. of ACM SIGCOMM , Portland, OR, Aug. 2004.[41] M. Mardani, G. Mateos, and G. B. Giannakis, “Subspace learn-ing and imputation for streaming big data matrices and tensors,” arXiv:1404.4667v1 [stat.ML] , 2014.[42] H. Kim, S. Lee, X. Ma, and C. Wang, “Higher-order PCA for anomalydetection in large-scale networks,” in
Proc. of 3rd Workshop on Comp.Advances in Multi-Sensor Adaptive Proc. , Aruba, Dutch Antilles, Dec.2009.[43] S. Boyd and L. Vandenberghe,