A Non-intrusive Failure Prediction Mechanism for Deployed Optical Networks
Dibakar Das, Mohammad Fahad Imteyaz, Jyotsna Bapat, Debabrata Das
aa r X i v : . [ c s . N I] J a n © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current orfuture media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, forresale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A Non-intrusive Failure Prediction Mechanism forDeployed Optical Networks
Dibakar Das
IIIT Bangalore
Bangalore, [email protected]
Mohammad Fahad Imteyaz
Tejas Networks
Bangalore, [email protected]
Jyotsna Bapat
IIIT Bangalore
Bangalore, [email protected]
Debabrata Das
IIIT Bangalore
Bangalore, [email protected]
Abstract —Failures in optical network backbone can lead tomajor disruption of internet data traffic. Hence, minimizing suchfailures is of paramount importance for the network operators.Even better, if the network failures can be predicted andpreventive steps can be taken in advance to avoid any disruptionin traffic. Various data driven and machine learning techniqueshave been proposed in literature for failure prediction. Most ofthese techniques need real time data from the networks andalso need different monitors to measure key optical parameters.This means provision for failure prediction has to be available innetwork nodes, e.g., routers and network management systems.However, sometimes deployed networks do not have failureprediction built into their initial design but subsequently needarises for such mechanisms. For such systems, there are two keychallenges. Firstly, statistics of failure distribution, data, etc.,are not readily available. Secondly, major changes cannot bemade to the network nodes which are already commerciallydeployed. This paper proposes a novel implementable non-intrusive failure prediction mechanism for deployed networknodes using information from log files of those devices. Numericalresults show that the mechanism has near perfect accuracy inpredicting failures of individual network nodes.
Index Terms —Failure prediction, optical networks, directedacyclic graph
I. I
NTRODUCTION
Digital world (business, education, consumer, etc.) dependson access networks and backbone networks. The backbonenetwork carries massive amount of data and interconnectsall types of access networks to provide reliable connectivityamong users, internet of things (IoT) devices, machines,clouds, etc. Most of backbone networks use optical commu-nication, due to its strong advantage to support very highdata rate. Backbone network consists of millions of router-s/switches across the globe. Each router has multiple elements,and if one of them malfunctions, this may disconnect the routerfrom rest of the network, leading to failure of critical globallink incurring enormous disruption of data traffic, causing ma-jor loss to businesses and other activities. Hence, it is importantto prevent failures using intelligent prediction mechanisms rather than a reactive approach of network recovery after afault.Several research proposals have been made on optical net-work failure prediction. [1] applies Gaussian process classifierto detect single link failures. It also proposes heuristic tofirst identify the suspected links and then apply the classifierto identify the failed link. A support vector machine anddouble exponential smoothing based approach for networkequipment failure prediction is described in [2]. [3] appliesBayesian networks to predict failures in cellular networks.Supervised learning based online and off-line techniques topredict link quality estimates in wireless sensor networks havebeen applied in [4]. [5] compares three different data miningalgorithms for network fault classification, namely, K-Means,Fuzzy C-Means, and Expectation Maximization, to suggestabnormal behavior in communication networks.All the literatures explored above need large amount of datato be collected or simulated assuming certain distributions ofvarious optical layer parameters. These techniques use variousmonitors for data collection. Sometimes, however, failure pre-diction is not built into the design of already deployed networksystems, but subsequently need arises for such mechanisms.In such a scenario, two key challenges have to be dealt with.Firstly, statistics of failure distribution and relevant data arenot readily available to directly apply conventional data drivenapproaches, such as, machine learning, Bayesian Networks,etc., [2] [3]. Even if logs are available, extraction of usefulstatistics from all the historical data can be a time consumingprocess. Secondly, when such network nodes are commerciallydeployed in thousands neither major changes in software orhardware are possible, nor recommended to do so.Logging mechanism in network nodes in form of text orbinary files contain various information on variation of opticaland system parameters, e.g., optical signal to noise ratio(OSNR), clock drift, etc. Log files also contain information onsequence of events over a period of time leading to differenttypes of node failures.his paper proposes a novel network node failure predictionusing information from the log files without making any majorchanges to the deployed system. Using the log files to figureout the events leading to failures, a prediction mechanismis developed based on directed acyclic graph (DAG) andconstructing two efficient standard data structures (section II).The internal nodes in the DAG are the events and the leafnodes are the failures. A directed edge exists from one eventto the succeeding one, finally reaching a leaf failure node.A probability is defined for each node based on how far anode is from a possible failure (distance is defined in termsof number of hops to a failure). The nearer the node to afailure the higher the probability. The DAG and the associateddata structures are constructed off-line from the old log files ofsimilar errors as those the system is expected to predict. Duringthe prediction phase, as events occur in real time the DAGis traversed through and failures are predicted in associationwith the data structures. Numerical results show near perfectfailure prediction. To the best of knowledge of the authors,none of the previous work in the literature has proposed sucha non-intrusive failure prediction mechanism for commerciallydeployed network nodes.There can be many possible implementations for such anetwork node failure prediction mechanism [1] [2]. Based onthe amount of information available at this point of time, thefollowing objectives are envisaged. • Develop a non-intrusive system without making anychanges to the commercially deployed system • Develop a quick first-cut solution, to reduce time-to-market • Use just enough information (available at this point oftime) from the log files to develop the first-cut solution • Though, it is first-cut solution with a low number failuresto predict, the system should evolve over time (forexample, the DAG would extended as more log filesare analyzed off-line to extract the appropriate eventsequences leading to different failures) • Quick failure prediction phase • Use efficient standard data structures from the informa-tion in log files to have a quick prediction phase byruling out invalid sequence of events, false-positives andconverge on the valid transitions in the DAG • Efficient, safe and quick implementation of the predictionmechanismThe paper is organized as follows. The proposed ideaand the system model is described in section II. Results arediscussed in section III. Conclusions are drawn in section IV.II. S
YSTEM M ODEL
Most network systems have logging mechanism which helpsin debugging malfunctionings and failures in the networksand their elements, e.g., routers, etc. Developers look at theselogs to find out the sequence of events leading to a failure.Normally, logs contain information, such as, time stamp andassociated texts, parameter values (e.g., clock drift), etc. (Fig.
Figure 1. Example log file E i ) and time ( T i ), i.e., ( E i , T i ) tuples,leading to a failure ( i being the index). Consider section ofan example log in Fig. 1. Each network log contain importantinformation in form of first two columns, namely, the timestamp T i and associated text and values of system parameters,e.g., clock drift, etc. When a failure occurs in a node, only therelevant information from the log files are designated as events E i in the third column of the example log in Fig. 1. For thisexample log file, to investigate a failure, information at times T , T , T m +1 and T n are relevant and designated as events E , E , E and E respectively. Events can be, for example, clockdrift beyond a certain threshold, rise in temperature above acertain limit, OSNR exceeding lower threshold, or a node notreceiving signal from its peer, etc. It is assumed that the eventsare spread out in time, otherwise it would be difficult to make aprediction and subsequent failure avoidance/prevention if theyhappen in quick succession. A DAG is created using theseevents for all the failures to be predicted by the system fromold log files. Note that a couple of logs per failure are enoughto identify the pattern of occurrences and designate certaininformation as events. This DAG and associated data structures(described below) are used for failure prediction.The overall architecture is shown in Fig. 2. During theprediction phase, log files are periodically read from thenetwork device using remote copy, etc. The log file is thenpassed through as time based sliding window parser. Thisparser uses the mapping of text to event mapping (derivedfrom the old log files) as explained above and outputs ( E i , T i )tuples. A serializer sorts these tuples in time and passesthe events to the DAG based prediction engine to check forfailures.Let there be maximum of M events and N failures in thesystem. This information is represented by a two dimensional igure 2. Architecture of the proposed prediction mechanism array E consisting N rows and M columns. Hence, each rowvector represents the sequence of events leading to a failure. E [ i, j ] is set to 1 if event j happens leading to failure i , for i = 1 , , .., N and j = 1 , , .., M , else the value is 0. It isassumed for any failure i and k, l ∈ j = 1 , , .., M , if k ≤ l then T k ≤ T l , implying E k happens before E l . E can be filledup based on the experience of analyzing same failures fromold log files.The above array of E [ i, j ] consisting 1s and 0s is used asa general way to represent different kinds of failures. Eventsdue to wrong configuration, one-off runtime events can beset easily. For periodic failures, two events may be required,one when the sequence starts and another when the sameends. For events based on variation of certain parameters,such as OSNR, different events may be set when they crossa lower or an upper threshold. Hence, this representation isgeneral enough to handle most types of events leading tocorresponding failures. Also, having 1s and 0s help in applyingbitwise operators which helps in efficient implementation.Using this matrix E , a DAG is created which is explainedwith an example below. Lets consider the matrix E , givenbelow in (1). E , = E E E E E E E E F F F F F (1)Events are represented along the columns of matrix E , andfailures along rows. Failure F happens in the sequence E → E → E . Other failures happen in same way.Generally, each of the events E j and failures F i , for i =1 , , .., N and j = 1 , , .., M , are the nodes of the DAG andthe sequence of events are linked with edges. For the examplematrix in (1), the DAG is shown in Fig 4. Note that for M events there are M − (leaving out the one with all zeros) sequences are possible. This DAG can be constructed oncewhile initializing the system and the same can be used duringthe online failure prediction phase. The data structure for theDAG is an array of nodes (vertically) and each of edges toit neighbours is a list (horizontally) as shown in Fig. 3. TheDAG is constructed using information from off-line analysisby the developers. Real time logs cannot be used to build theDAG. Figure 3. DAG data structureFigure 4. DAG with 8 events and 5 failures
Since, the only information available is the sequence eventsleading to a failure, conventional data driven approaches,such as, Bayesian Networks [3], etc., cannot be applied.To circumvent this problem and still have working solution,probability of failure is defined by how much a node is closerto a failure. For example, probability of F is higher whenevent E happens compared to when E occurs. However,probability of F and F are equal when E occurs, since thedistances from both failures are same (3 hops) by (1).ince, the DAG and the matrix E are static information,these can be used to construct a hop matrix H , in (2) whichwill contain the number of hops to a failure. For example, E has 3 hops to failures F and F . Note that F and F arethe only possible failures according to (1) when E occurs, F , F and F are ruled out. Similarly, E has 3 hops for F ,event E has 2 hops to failures F , F and F , and 3 to F . H , = F F F F F E E E E E E E E (2)Probability of a failure F i from any node (event E j ) in DAGis defined in (3). The essential intuition behind this definitionis to increase the probability exponentially so that the failureavoidance/prevention mechanism can be triggered at requiredthreshold value. Note that this definition holds good only forthe hop matrix (2) and not generalized to any number of hops.In future, the function will be generalized to any number ofnodes using concepts like network diameter, etc. P ( E j ) F i = 100 − e H , ( j,i ) (3) A. Failure prediction
Prediction process begins with the valid start events. In Fig.4, E , E , E and E are the valid start events (marked inred), rest are not. Once the DAG is constructed off-line, failureprediction starts with its traversal, as the events occur in realtime, and some invalid sequences are filtered out (from thetotal of M − possible sequences) for which there are noedges between nodes. For example, E cannot succeed E .At each hop, all the failures reachable from that node arepredicted using (2). This is not very efficient. One importantenhancement that can be made is to prune the number offailures that it is trying to predict. For example, considering F and F , if E occurs then the proposed mechanism willpredict both the failures with equal probability using (1), (2)and (3). It can be observed that the sequence of events for F ( E → E → E ) and F ( E → E → E ) are different. Forthis purpose, it is essential to keep track of the path to reacha node in the DAG to predict failures accurately.To uniquely identify the path, a heuristic is developed. Aglobal event bit-mask is constructed when an event is received(starting with the valid first events in Fig. 4). This eventbit-mask is initialized to zero. Bitwise AND operations isperformed with the event bit-mask and each row of matrix E , (1). Those output of the AND operations that do nothave the bits set to 1 are left out of contention and onesremaining are probable failures. These steps are continueduntil the correct failure is predicted. Note that this enhancedprocedure, to uniquely identify the path to a failure, needs to be performed only when there is a valid transition in the DAG.Also, if there is single transition from the present state in DAG,this procedure is not necessary. This working of the event bit-mask is explained in detail in the results section (section III).This process to identification of failures has some additionalcomputational overhead of the order of O ( N ) , assuming thatthe bitwise AND operation of the two vectors takes one timestep.Apparently, it may appear that a depth first search (DFS)may help in arriving at a failure. But, DFS is more useful andapplied in exploratory search of the entire graph. Here, this isnot the case because the proposed idea looks for the correctpath to a failure with static information already available inthe data structures. Also, DFS has higher time complexitycompared to the approach taken here.An important implementation point to be noted here is thatan event can either be a valid starting point for one failure andan intermediate one for a different failure. For example, E is the starting point for F and an intermediate one for F .Hence, when E occurs it is important to find out whetherit is a beginning of a new sequence of events for F or anintermediate one for F . Hence, concurrent invocation of theprediction engine for multiple failures is necessary. This canbe done by making the implementation re-entrant which can beinvoked by multiple threads concurrently. Each invocation ofthe prediction engine with try predict an independent failure.III. R ESULTS AND D ISCUSSION
This section discusses the results obtained by applyingthe network node failure prediction proposed above. For thispurpose, the same DAG in Fig. 4 is used to demonstrate theproposal. The model is implemented using python numpy libraries.
A. DAG based invalid sequence of detection
If there are 8 events, then there can be − sequencespossible, out of which only 5 are considered valid by 1.DAG helps in filtering out some of the invalid sequences.Lets consider an invalid sequence ( E = 1 , E = 0 , E =0 , E = 0 , E = 1 , E = 1 , E = 0 , E = 1) . If thissequence of events are propagated through the DAG, it isevident there is no transition from E to E . Hence, an invalidsequence is declared. Thus, sequence of events which do nothave transitions (edges) in the DAG get discarded in thisstep. However, the invalid sequence ( E = 1 , E = 0 , E =0 , E = 0 , E = 1 , E = 1 , E = 1 , E = 1) will predictfailures F and F . To circumvent these false-positive cases,the following steps are necessary. B. Sequence of events leading to failure
Fig. 5 steps through the failure detection for F when therelevant events occur. Along x -axis the sequence of events E i , i = 1, 2,.., 8, in time leading to F are shown. Note that E i = 1means the event has occurred and 0 otherwise, and F occurswhen the following ordered tuple of events happen ( E =1 , E = 0 , E = 0 , E = 0 , E = 1 , E = 1 , E = 0 , E =) . The probabilities (3) of relevant failures are shown along y -axis. When E occurs the model predicts F and F using thehop matrix (2) which rules out other failures. Subsequently,the model predicts F , F , F and F when E occurs, and F , F , F when E occurs.The above results have some redundancy since the modelpredicts two additional failures F and F along with thecorrect one F when E occurs. Next section applies theenhancement with event bit-mask to optimize the prediction. E = 1 E = 0 E = 0 E = 0 E = 1 E = 1 E = 0 E = 0 P r obab ili t y o f f a il u r e s , F i Events, E i F F F F F Figure 5. Sequence of events E i leading to failure F C. Sequence of events leading to failure with enhancements
Fig. 6 predicts F and F when E occurs and constructs aevent bit-mask [1, 0, 0, 0, 0, 0, 0, 0]. Since, E triggers validtransition in the DAG, a bitwise AND operation is performedwith rows of (1) and rules out failures F , F and F . Whenevent E (a valid transition in DAG) occurs the mask isupdated to [1, 0, 0, 0, 1, 0, 0, 0]. Again, when this maskis applied to F and F in (1), both are retained ( F , F , F are already ruled out in the previous step). Finally, event E leads to the updated event bit-mask as [1, 0, 0, 0, 1, 1, 0,0]. This mask is applied to F and F , the latter is ruled out.Hence, F is the only failure predicted which is as expected.IV. C ONCLUSION
Failure prediction in optical backbone network is extremelyimportant to avoid large scale disruption of data traffic. How-ever, such prediction mechanism is sometimes not built intothe network nodes at design time and subsequently the needarises to have one. This paper presented an implementablenon-intrusive failure prediction mechanism in network nodesmaking use of existing log files. The proposed idea constructsa DAG and other associated data structures from key eventsin the log files resulting in a failure. Numerical results showthat the proposed idea is able to predict the failures in a nearperfect way.Future work will hinge on extending the model to highernumber of nodes, and performance analysis after integrationwith the commercially deployed network. E = 1 E = 0 E = 0 E = 0 E = 1 E = 1 E = 0 E = 0 P r obab ili t y o f f a il u r e s , F i Events, E i F F F F F Figure 6. Sequence of events E i leading to failure F with enhancements A CKNOWLEDGMENT
The authors would like to thank Tejas Networks for spon-soring this research project.R
EFERENCES[1] T. Panayiotou, S. P. Chatzis, and G. Ellinas, “Leveraging statisticalmachine learning to address failure localization in optical networks,”
J.Opt. Commun. Netw. , vol. 10, no. 3, pp. 162–173, Mar 2018. [Online].Available: http://jocn.osa.org/abstract.cfm?URI=jocn-10-3-162[2] Z. Wang, M. Zhang, D. Wang, C. Song, M. Liu, J. Li,L. Lou, and Z. Liu, “Failure prediction using machine learningand time series in optical network,”
Opt. Express
SIGMOBILE Mob.Comput. Commun. Rev. , vol. 11, no. 3, pp. 71–83, Jul. 2007. [Online].Available: https://doi.org/10.1145/1317425.1317434[5] K. Qader, M. Adda, and M. Al-Kasassbeh, “Comparative analysis ofclustering techniques in network traffic faults classification,”