[PDF] A Data Augmented Bayesian Network for Node Failure Prediction in Optical Networks

Abstract

Failures in optical network backbone can cause significant interruption in internet data traffic. Hence, it is very important to reduce such network outages. Prediction of such failures would be a step forward to avoid such disruption of internet services for users as well as operators. Several research proposals are available in the literature which are applications of data science and machine learning techniques. Most of the techniques rely on significant amount of real time data collection. Network devices are assumed to be equipped to collect data and these are then analysed by different algorithms to predict failures. Every network element which is already deployed in the field may not have these data gathering or analysis techniques designed into them initially. However, such mechanisms become necessary later when they are already deployed in the field. This paper proposes a Bayesian network based failure prediction of network nodes, e.g., routers etc., using very basic information from the log files of the devices and applying power law based data augmentation to complement for scarce real time information. Numerical results show that network node failure prediction can be performed with high accuracy using the proposed mechanism.

Full PDF

aa r X i v : . [ c s . N I] F e b This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

A Data Augmented Bayesian Network for NodeFailure Prediction in Optical Networks

Dibakar Das

IIIT Bangalore

Bangalore, [email protected]

Mohammad Fahad Imteyaz

Tejas Networks

Bangalore, [email protected]

Jyotsna Bapat

IIIT Bangalore

Bangalore, [email protected]

Debabrata Das

IIIT Bangalore

Bangalore, [email protected]

Abstract —Failures in optical network backbone can causesigniﬁcant interruption in internet data trafﬁc. Hence, it is veryimportant to reduce such network outages. Prediction of suchfailures would be a step forward to avoid such disruption ofinternet services for users as well as operators. Several researchproposals are available in the literature which are applicationsof data science and machine learning techniques. Most of thetechniques rely on signiﬁcant amount of real time data collection.Network devices are assumed to be equipped to collect data andthese are then analysed by different algorithms to predict failures.Every network element which is already deployed in the ﬁeld maynot have these data gathering or analysis techniques designed intothem initially. However, such mechanisms become necessary laterwhen they are already deployed in the ﬁeld. This paper proposesa Bayesian network based failure prediction of network nodes,e.g., routers etc., using very basic information from the log ﬁles ofthe devices and applying power law based data augmentation tocomplement for scarce real time information. Numerical resultsshow that network node failure prediction can be performed withhigh accuracy using the proposed mechanism.

Index Terms —Bayesian Networks, data augmentation, optical,failure prediction

I. I

NTRODUCTION

Today’s digital world depend primarily on internet. Internetbackbone network carries bulk of the data trafﬁc from differentusers, such as, individuals, Internet of Things (IoT) devices,edges devices, computers and cloud. Backbone networks pri-marily use optical communications due to their high bandwidthand low bit error rates. These networks comprise of hugenumber of nodes, e.g., routers, etc., which carry data fromone part of the world to the other. A failure in any of thesenodes can lead to major disruption in internet services leadingto losses in business and other activities. Hence, for reliableinternet services it is essential to prevent failures proactivelyin backbone networks using intelligent mechanisms.There are several approaches for failure prediction in opticalnetworks. A gaussian classiﬁer based approach to detect singlelink failures has been proposed in [1]. Authors applied heuris-tics to shortlist the probable failed links and then the gaussianclassiﬁer is applied to identify the failed link. [2] proposessupport vector machine along with double exponential smooth-ing approach to predict optical network equipment failure. [3]

This research is funded by

Tejas Networks , Bangalore, India describes a method for prediction of link quality estimate inwireless sensor networks using online and ofﬂine supervisedlearning. A comparison of three data mining approaches, K-Means, Fuzzy C-Means, and Expectation Maximization, todetect abnormal behaviour in networks is proposed in [4].Using Bayesian networks, [5] derives a mechanism to predictfailures in cellular networks.Most of the proposals mentioned above are data intensive.They rely on collecting real time data from various monitorsin the network and then analyze the data to predict failures.For deployed systems in the ﬁeld, such prediction mechanismmay not be built into the initial design. However, subsequently,a need for failure prediction arises. In such a scenario, non-availability of relevant data is a major hindrance. Changesto the deployed system like introducing new probes to col-lect data are highly risky. Hence, applying convention dataintensive techniques are not possible. Non-intrusive failureprediction techniques have to be developed with very littleinformation available (quantitative or qualitative) without dis-turbing the deployed network. This paper proposes such atechnique using Bayesian Networks (BN) as explained below.In [6], we described an architecture for non-intrusive faultprediction in network nodes. It applies an ad-hoc node failureprediction mechanism as an initial solution. This paper extendsand generalizes the network node failure prediction mechanismin [6] applying formal approach of data augmented BN.Network nodes are equipped with log ﬁles which are usedby the developers to debug problems. Observing the logs ofpast failures, patterns emerge on the sequence of events leadingto a failure. These events can be represented as nodes in aDirected Acyclic Graph (DAG). This DAG can used as BNbased failure prediction mechanism. Bayesian networks needconditional probabilities of a node (event) given its parents inthe DAG for prediction. As already mentioned above, statisticson events and failures are not readily available in deployednetwork nodes. However, qualitative information on how fre-quently or infrequently a failure occurs can be acquired fromthe developers. Using this information, data augmentationis applied to generate the conditional probabilities assumingpower law distribution for failure occurrences. The BN usesthese probabilities and predicts failures as events occur in real-ime. Numerical results show, even with scarce data availablefrom logs retrieved from the deployed network nodes, fairlyaccurate failure prediction is possible.Objectives behind this BN based approach are as follows. • Construct a quick solution to meet time to market re-quirements • Construct a non-intrusive prediction mechanism devoidof any changes in the deployed network • Effectively use information from the logs and qualitativeinformation on frequency of occurrence of failures fromthe developers • Failure prediction mechanism should evolve over timeThe remaining of this paper is organized as follows. SectionII describes the proposed idea and the system model. Theresults obtained applying the proposed idea are presented insection III. Section IV concludes this work with some futuredirections. II. S

YSTEM M ODEL

As already mentioned, the statistical information about theoccurrence of events and failures at network nodes is notreadily known, since the deployed systems are not equippedwith necessary mechanisms to collect such data by initialdesign. Mining all the historical logs to extract statistical infor-mation mentioned above can be a extremely time consumingapproach and may not meet time to market requirements. Theonly information extracted from the logs is the sequence ofevents leading to failures with the help of the developers.Also, qualitative information on which errors occur morefrequently than others can be known from the experience ofthe developers.An example log ﬁle is shown in Fig. 1. The ﬁrst columncontains the time at the which the corresponding text (secondcolumn) is logged and associated values of system parameters,e.g., clock drift, Optical Signal To Noise Ratio (OSNR), etc.Based on analysis of the developers some of the texts canbe designated as events shown in third column of Fig. 1.There can be several events, such as, clock drift exceedingcertain threshold, temperature rising above a certain value,OSNR exceeding lower threshold, or a node not receivingsignal from its peer. Once the failures and their correspondingevents are designed from the logs, they are presented in formof a matrix as shown in (1) for 5 failures. Each row in thematrix represents a sequence of events leading to a failure. Avalue 1 means that the corresponding event has to happen forthat particular failure. For example, event E has to happenfor failures F , F and F , not for F and F . Subsequently,a DAG comprising all the events can be constructed (Fig. 2)which forms the BN. For example, event E → E → E → E have to occur in sequence for failure F . Note that E and E (marked in red) are the valid start states of event sequencesleading to failures. By (1), F , F , F and F start with E ,and F starts with E . Figure 1. Example log ﬁle with eventsFigure 2. DAG constructed using (1) E , =  E E E E E F F F F F  (1) A. Generation of statistics for events and failures

To apply BN for failure prediction, statistics of occurrenceof events and their failures are necessary to calculate theconditional probabilities. However, as already mentioned suchstatistics are not readily available. For this purpose, two avail-able information are used. Firstly, events are extracted fromold logs as explained above. Secondly, developers can providethe information on which failures occur more frequently thanothers. Based on this information, a probability distributioncan be assumed to artiﬁcially create statistics of the events andtheir corresponding failures. Since, there is non-zero chanceof any failure a scale free probability distribution can beassumed. For this purpose, a power law probability distributions assumed in (2) for occurrence of N failures, though otherscale free functions can also be considered (in future). p ( x ) = ax − k (2)where a is constant and failure F x occurs with probability p ( x ) , x = 1 , , .., N and k ≥ . Also, it is assumed that p ( i ) > p ( j ) for i < j and i, j ∈ { , , .., N } . Value of a canbe adjusted so that, N X x =1 p ( x ) ≈ (3)Based on the probability distribution function, statistics ofeach of the failures can be calculated as follows using (4). C F x = ⌊ S × p ( x ) ⌋ (4)where C F x is the number of occurrences of F x in the popu-lation of S failures.Once the number of occurrence of each failure is estimatedwith (4), the statistics of the events can also be found outusing (1). Using all these augmented data and information, aBN can be constructed with all the conditional probabilities. B. Application of Bayesian Networks

Application of BN is explained with the following scenerio.Lets evaluate the probabilities of occurrences of E and E given E = 1 (Fig. 2) as shown in (5). Note that there are 4possible combinations for E and E . P r ( E , E | E = 1) = P r ( E , E ) P r ( E = 1) (5) E and E depend on other predecessors (Fig. 2), so the aboveequation has to be expanded as follows. ⇒ P r ( E , E | E = 1) = 1 P r ( E = 1) × X E X E P r ( E , E , E , E ) (6) ⇒ P r ( E , E | E = 1) = 1 P r ( E = 1) × X E X E P r ( E | E , E × P r ( E | E , E × P r ( E | E , E × P r ( E | E ) (7) ⇒ P r ( E , E | E = 1) = 1 P r ( E = 1) × n X E P r ( E | E , E × (cid:16) X E × P r ( E | E , E × P r ( E | E , E × P r ( E | E ) (cid:17)o (8) Thus, P r ( E , E | E = 1) is expressed as conditional prob-abilities of occurrences of its predecessors in Fig. 2. Thesederivations have to be repeated for each combination of events.It is evident, when the BN is large, these derivations can beextremely tedious and cumbersome. C. Failure Prediction

Once the BN is constructed as explained above, failureprediction is performed based on the events happening in realtime, extracted from network node logs and traversing the BN.A remote machine, running the proposed BN failure predictionmodel, transfers the real time logs from the network nodesusing remote copy, etc., parses the logs for events, using thearchitecture proposed in [6].III. R

ESULTS AND D ISCUSSION

This section presents the results obtained using the systemmodel in section II. The model is implemented in python using pgmpy library [7]. The ﬁrst step is the generationof statistics of occurrence of failures. Using the generatedstatistics, the failures are predicted using BN. Calculating theconditional probabilities given all its predecessors of a eventin the BN manually for equations such as (8) can be extremelycumbersome, tedious and error-prone when the network islarge (which is expected to be in future). Hence, using atool such as pgmpy can be extremely beneﬁcial to reliablycalculate the probabilities.

A. Generation of failure statistics

For generation of population of failures the power lawdistribution in (2) is used with k = 2 , N = 5 , a = 0 . satisfying (3). Number of each failures F i , i = 1 , , , , isshown in Fig. 3 and the probability distribution of failuresis shown in Fig. 4, under the assumption that occurrencefrequency of F > occurrence frequency of F > occurrencefrequency of F > occurrence frequency of F > occurrencefrequency of F , available from the qualitative informationprovided by the developers. Ten thousand samples of failuresare generated. The probabilities of the events are shown inFig. 5. B. Application of BN

Using the augmented data described above in section III-A,the conditional probabilities necessary for prediction of fail-ures applying BN are calculated. Probabilities of E are shownin Table I. Similarly, the probabilities of E given its parent E Table IP

ROBABILITIES OF E Pr( E = 1) Pr( E = 0)0.924128 0.075871 (Fig. 2) occurred or not are provided in Table II. Likewise, thesame for E is provided in Table. III. Note that probabilitiesof E given E = 0 and E = 0 are not valid failures incurrent set (1). However, pgmpy needs all the combinations of F F F F F o f ea c h f a il u r e , F i Failure, F i Number of each failure in the power law distributed population

Figure 3. Number of each failure F i in the power law distributed population F F F F F p r ob . o f ea c h f a il u r e , F i Failure, F i Probability of each failure in the power law distributed population

Figure 4. Probability of each failure F i in the power law distributedpopulation probabilities of nodes given their parents to be made availableand each row in the tables should add up to 1. This does notadversely affect the performance of the prediction model asthe results show subsequently. The probabilities of E and E given their respective parents are shown in Tables IV andV. These probabilities are then fed into the BN for failureprediction. Table IIP

ROBABILITIES OF E Condition on Pr( E = 1) Pr( E = 0) Comments E = 1 0.924128 0.075871 E = 0 1 0 Tool needs allthe combination Non-occurrence of an event, i.e., E = 0, is hard to provideas evidence to the BN. Hence, occurrence of event is alwaysset as evidence to predict the failures. The tool takes all theprobabilities provided in that tables above and outputs theprediction after calculating the equations such as (8).If function call to query the BN for prediction of the E E E E E p r ob . o f ea c h e v en t, E i Event, E i Probability of each event in the power law distributed population

Figure 5. Probability of each event E i in the power law distributed populationTable IIIP ROBABILITIES OF E Condition on Pr( E = 1) Pr( E = 0) Comments E = 0 and E = 0 0 1 Tool needs allthe combinationsto add up to 1 E = 0 and E = 1 0 1 E = 1 and E = 0 1 0 E = 1 and E = 1 0.2 0.8 subsequent events with the evidence that E has alreadyoccurred, its output predicts F deﬁned in (1) as shownbelow. The PREDICTION is concatenation of evidence and

OUTPUT . FUNCTION CALL:infer.map_query([’E2’, ’E3’,’E4’, ’E5’],evidence={’E1’: ’1’})OUTPUT:{’E2’: ’1’, ’E3’: ’0’,’E4’: ’1’, ’E5’: ’1’}PREDICTION:{’E1’: ’1’, ’E2’: ’1’, ’E3’: ’0’,’E4’: ’1’, ’E5’: ’1’} --> Failure F1

Afterwards, when events E and E occur which are presentedas evidence to the BN, it continues to predict failure F . FUNCTION CALL:infer.map_query([’E3’, ’E4’, ’E5’],evidence={’E1’: ’1’,’E2’: ’1’})OUTPUT:{’E3’: ’0’, ’E4’: ’1’, ’E5’: ’1’}PREDICTION:{’E1’: ’1’, ’E2’: ’1’, ’E3’: ’0’,’E4’: ’1’, ’E5’: ’1’} --> Failure F1

With evidence E , E and E , the BN changes its predictionfrom F to F as deﬁned in (1). able IVP ROBABILITIES OF E Condition on Pr( E = 1) Pr( E = 0) Comments E = 0 and E = 0 0 1 Tool needs allthe combinationsto add up to 1 E = 0 and E = 1 0.607843 0.392156 E = 1 and E = 0 1 0 E = 1 and E = 1 0 1Table VP ROBABILITIES OF E Condition on Pr( E = 1) Pr( E = 0) Comments E = 0 and E = 0 0 1 Tool needs allthe combinationsto add up to 1 E = 0 and E = 1 1 0 E = 1 and E = 0 1 0 E = 1 and E = 1 0 1 FUNCTION CALL:infer.map_query([’E4’, ’E5’],evidence={’E1’: ’1’,’E2’: ’1’,’E3’: ’1’})OUTPUT:{’E4’: ’0’, ’E5’: ’1’}PREDICTION:{’E1’: ’1’, ’E2’: ’1’, ’E3’: ’1’,’E4’: ’0’, ’E5’: ’1’} --> Failure F2

However, if occurrence of E , E , E and E are providedas evidence then it correctly detects an invalid event since theoutput does not match with any row in (1). FUNCTION CALL:infer.map_query([’E5’],evidence={’E1’: ’1’,’E2’: ’1’,’E3’: ’1’,’E4’: ’1’})OUTPUT:{’E5’: ’0’}PREDICTION:{’E1’: ’1’, ’E2’: ’1’,’E3’: ’1’, ’E4’: ’1’,’E5’: ’0’} --> invalid event

If occurrence of events E and E are provided as evidencethen the BN predicts F as expected. FUNCTION CALL:infer.map_query([’E5’],evidence={’E2’: ’1’,’E4’: ’1’})OUTPUT:{’E5’: ’1’} PREDICTION:{’E1’: ’0’, ’E2’: ’1’, ’E3’: ’0’,’E4’: ’1’, ’E5’: ’1’} --> Failure F3

If events E and E are evidences, then F is predicted dueto its higher probability (Fig. 4). FUNCTION CALL:infer.map_query([’E4’, ’E5’],evidence={’E1’: ’1’,’E3’: ’1’})OUTPUT:{’E4’: ’1’, ’E5’: ’0’}PREDICTION:{’E1’: ’1’, ’E2’: ’0’, ’E3’: ’1’,’E4’: ’1’, ’E5’: ’0’} --> Failure F4

To predict F , the query has to happen on the evidence that E has occurred, since it is the only difference between F and F , and E , E and E have to occur. Doing so, the BNpredicts F correctly. FUNCTION CALL:infer.map_query([’E4’],evidence={’E1’: ’1’,’E3’: ’1’,’E5’: ’1’})OUTPUT:{’E4’: ’0’}PREDICTION:{’E1’: ’1’, ’E2’: ’0’, ’E3’: ’1’,’E4’: ’0’, ’E5’: ’1’} --> Failure F5

IV. C

ONCLUSION AND F UTURE W ORK

Failures in backbone optical networks can lead to majordisruption in internet trafﬁc. Hence, prediction of such failurescan avoid such problems. This paper proposed an data aug-mented BN to predict failures of networks node using someinformation from logs and (qualitative) inputs from developerson frequency of occurrence of failures. The conditional prob-abilities of the BN is calculated after generation of failurepopulation applying a power law distribution of the failuresbased on their frequency of occurrences. Results show that theproposed node failure prediction mechanism is able to performwith high accuracy.Future work will extend the model to more nodes in the BNand integrate this to the deployed network.A

CKNOWLEDGMENT

This research project is funded by Tejas Networks, Banga-lore, India. R

EFERENCES[1] T. Panayiotou, S. P. Chatzis, and G. Ellinas, “Leveraging statisticalmachine learning to address failure localization in optical networks,”

J.Opt. Commun. Netw. , vol. 10, no. 3, pp. 162–173, Mar 2018. [Online].Available: http://jocn.osa.org/abstract.cfm?URI=jocn-10-3-1622] Z. Wang, M. Zhang, D. Wang, C. Song, M. Liu, J. Li,L. Lou, and Z. Liu, “Failure prediction using machine learningand time series in optical network,”

Opt. Express

SIGMOBILE Mob.Comput. Commun. Rev. , vol. 11, no. 3, pp. 71–83, Jul. 2007. [Online].Available: https://doi.org/10.1145/1317425.1317434[4] K. Qader, M. Adda, and M. Al-Kasassbeh, “Comparative analysis ofclustering techniques in network trafﬁc faults classiﬁcation,”