Distributed Fault Detection in Sensor Networks using a Recurrent Neural Network
DDistributed Fault Detection in Sensor Networks using a Recurrent Neural Network
Oliver Obst
CSIRO ICT Centre, Autonomous Systems LaboratoryLocked Bag 17, North Ryde, NSW 1670 - Australia
Abstract
In long-term deployments of sensor networks, monitoring the quality of gathered data is a critical issue. Over the time of deploy-ment, sensors are exposed to harsh conditions, causing some of them to fail or to deliver less accurate data. If such a degradationremains undetected, the usefulness of a sensor network can be greatly reduced. We present an approach that learns spatio-temporalcorrelations between di ff erent sensors, and makes use of the learned model to detect misbehaving sensors by using distributed com-putation and only local communication between nodes. We introduce SODESN, a distributed recurrent neural network architecture,and a learning method to train SODESN for fault detection in a distributed scenario. Our approach is evaluated using data fromdi ff erent types of sensors and is able to work well even with less-than-perfect link qualities and more than 50% of failed nodes. Key words:
Echo state networks, recurrent neural networks, anomaly detection, distributed computation, wireless sensor networks
1. Introduction
Wireless sensor networks (WSN) are increasingly being de-ployed over extended periods of time [3], in particular for en-vironmental monitoring applications. To facilitate long termdeployments in remote areas, nodes are typically powered bysolar energy and rechargeable batteries. Consequently, muchof the research has focussed on energy-aware design of hard-and software as well as on building models of energy supplyand demand. The continuing progress in this area has lead tolonger autonomy of WSN, but also revealed that deploying asensor network over a long period of time requires automaticmonitoring of the quality of gathered data and of the condi-tion of solar panels, sensors and batteries. With informationabout the performance of these components, maintenance tripsto remote monitoring sites can be better planned or possiblyavoided, leading to a reduction of management costs. Some ofthe faults might be easier to detect than others: when some ofthe expected data is missing, fault seem obvious to recognize.Even in this simple case, an automatic notification relieves theadministrator from continuously monitoring a database. Whenthe network delivers data as expected, there might also be moresubtle problems, like mis-calibration or build-up of dust on sen-sors and solar panels, leading to incorrect sensor readings orshorter duty-cycles and thus less data. To prevent this, sensornetworks have to become more user-friendly: existing systemsoften require to manually detect and diagnose potential prob-lems. First steps towards higher reliability and user-friendlinessare automatically building a model of the normal system behav-ior and to use this model to detect anomalies. With the result ofthis process, it is possible to notify administrators who then can
Email address: [email protected] (Oliver Obst) decide on appropriate actions. Consequently, the system canrun unobserved with less danger of losing important data.For this work, we are interested in detecting problems thatmanifest in changes of sensor readings for some of the nodes ofan entire network as a result of a sensor fault. Typically, some ofthe sensors at di ff erent nodes are correlated over space or time.We present an approach that is able to learn spatio-temporalcorrelations and make use of them for detecting anomalies ina decentralized way, without using global communication dur-ing normal operation. Instead, sensor nodes participate in alarge, distributed recurrent neural network, where each of thesensor nodes hosts only a few neural units and communicatesonly with its local neighbor sensor nodes. Our neural networkapproach is inspired by echo state networks (ESN) [5], a re-current neural network approach which has shown to be suc-cessful in learning even complex time series. ESN have alreadybeen applied in anomaly detection in sensor networks [8], butonly in a way that requires one instance of an ESN on eachnode. This results in an unnecessary consumption of memoryresources and processing power. A straightforward distributionof an ESN over the entire sensor network is also not a solu-tion, because it requires all of the nodes to communicate witheach other. More often than not, this sort of communication isneither available nor desired in sensor networks.To address the problem of detecting sensor faults in WSN ina distributed way, we introduce spatially organized distributedecho state networks (SODESN), an architecture that allows fordistributing a single recurrent neural network over an entire sen-sor network even when the WSN imposes a local communica-tion structure on its connectivity matrix (Sect. 3). In Sect. 4, wepresent a training method for SODESN and an approach to trainSODESN for fault detection in WSN. SODESN learn a modelof normal behavior of sensor nodes based on information fromother sensors. The fault detection in turn monitors di ff erences Preprint submitted to not yet October 22, 2018 a r X i v : . [ c s . N E ] J un etween the model and actual sensor readings in a distributedway. We demonstrate the capabilities of our approach with datafrom di ff erent temperature and radiation sensors (Sect. 5) anddiscuss our results in Sect. 6. In the following section, we startwith a brief overview of related work, and give a short reviewof the ESN approach, the starting point for our work.
2. Background
Detecting and diagnosing faults is a challenge that has beenaddressed in many di ff erent areas for di ff erent purposes. Logic-based approaches, for instance, can be applied if a complete de-scription of the desired behavior of the system is available (seee.g. [10]). In distributed systems, approaches like in [1] detectfaults by using connections between processors to implementa voting based diagnosis system. WSN are distributed systemswhere di ff erent components, from batteries over sensors to pro-cessors, contribute to potentially many di ff erent types of faults.There may be problems with the energy supply, with the rout-ing or other communication problems, resulting in missing datafrom single nodes, or causing the whole system to deliver nodata at all. In long-term deployments, problems like degrada-tion of hardware can result in inaccurate measurements, causedby dust and continued exposure of sensors to the environment.Some of the existing work tackles the problem of automaticallydetecting node failures with centralized approaches (e.g. [11]),where relevant information is forwarded to a dedicated man-ager performing the fault detection. Methods to detect faults ina distributed way have been investigated, because global com-munication becomes prohibitive with increasing network sizes.The approach in [2] is an example of such a decentralized ap-proach, where sensor faults are detected based on di ff erences inthe readings between neighbors. It uses only local communi-cation between nodes, but assumes that all sensors measure thesame variable. Likewise, [9] is able to detect faults with a dis-tributed approach, but here, the assumptions are not as strong.Neighbor sensors are not required to measure the same variable,but are assumed to be correlated as long as they are workingnormally, and uncorrelated as soon as they are faulty. This faultdetection method uses a graph-based approach to isolate faultynodes in the network, where correlation between the time seriesover a time window is used to identify faults.In our work, we are also interested in detecting sensor faultsin a distributed way. Instead of explicitly basing our fault de-tection on spatial correlations between sensors, we want oursystem to detect the relevant spatio-temporal correlations on itsown. If we are able to distribute a large recurrent neural net-work over the entire sensor network, each sensor node can es-timate its own true values based on information from its neigh-bors in a training period. Because recurrent neural networksmodel dynamical systems (i.e. with a memory of past events),correlations can be both temporal as well as spatial. Using theestimated true values, and a threshold on deviation between es-timated value and recent readings, each node can decide if itcan be assumed to work correctly. Recurrent neural networks have only recently become morewidely used in practice, because many approaches have beendi ffi cult to set up and to train for. An ESN is a specific type ofrecurrent neural network which is able to successfully predictcomplex time series [5]. At the same time, the complexity oftraining an ESN is much lower than with traditional recurrentneural networks. Like any other neural network, ESN consistof neural units and synaptic connections between these units. Aneural network is recurrent if there is at least one cycle in theseconnections. Units are typically organized in di ff erent layersand possess a state (called “activation”). This activation is com-puted (using a typically non-linear “activation function”) basedon inputs from incoming connections. Connections betweenunits perform a linear transformation and can be either exci-tatory (positive connection weights) or inhibitory (in case ofnegative connection weights). Traditional approaches to train-ing recurrent neural networks, like backpropagation throughtime [12], change all of the weights between di ff erent units.The lower training complexity of ESN is a result of using afixed, randomly connected “reservoir” of neural units in the re-current layer, and only changing connections to output unitsduring training (see Fig. 1). Once the training is finished, con-nections are changed no longer. Both output and the next stateof the network are determined by the current state of the net-work and the current input. Figure 1: Echo State Network.
To make the approach work, however, connections cannotbe entirely random, but need to fulfill the so-called echo statecondition [4]. For an illustration of this condition [7], considera time-discrete recursive function f x t + = F ( f x t , f u t ) that is de-fined at least on a compact sub-area of the vector-space f x ∈ R n ,with n the number of internal units. The f x t are to be interpretedas internal states and f u t is some external input sequence. Now,assume an infinite input sequence: ¯ f u ∞ = f u , f u , . . . and tworandom initial internal states of the system f x and f y . To bothinitial states f x and f y the sequences ¯ f x ∞ = f x , f x , . . . and¯ f y ∞ = f y , f y , . . . can be assigned. f x t + = F ( f x t , f u t ) (1) f y t + = F ( f y t , f u t ) (2)2he system F ( · ) fulfills the echo state condition if it is inde-pendent from the set f u t , and if for any ( f x , f y ) and all realvalues (cid:15) >
0, there exists a δ ( (cid:15) ) for which d ( f x t , f y t ) ≤ (cid:15) for all t ≥ δ ( (cid:15) ), where d is a square Euclidean metric. Two rules area helpful for creating a connectivity matrix W with this condi-tion: C1 it is a necessary condition that the spectral radius of thebiggest eigenvalue of W is below one. C2 it is a su ffi cient condition that the biggest singular value of W is smaller than oneUsing one ESN for each sensor node, or one ESN in a cen-tral location, would require a combination of high memory re-sources on each node, an explicit selection of correlated sensorsor global communication. Instead, we describe a new approachwhere we distribute a recurrent neural network over an entiresensor network, fulfill the above mentioned echo state condi-tion, and use only communication between neighbor nodes.With sensor networks and recurrent neural networks, two dif-ferent kinds of networks play a role in the following. In orderto avoid confusion between the two in our description, we use node when we talk of sensor network nodes, whereas we use unit for the components of a neural network. In our notationwe use bold capital letters for matrices, bold small letters forvectors or vector-sized functions, and italics for scalars.
3. Spatially organized distributed echo state networks
To distribute a recurrent neural network over a WSN, connec-tions between units have to be restricted to the spatial neighbor-hood of sensor nodes in order to avoid unrestricted global com-munication. We also would like to retain the e ffi cient learningof ESN. Therefore, we create neural units on each sensor node,and follow the original idea of ESN in that all connections be-tween internal units are randomly initialized and fixed. Con-necting units only to spatial neighbors on di ff erent devices leadsto our idea of spatially organized distributed echo state net-works (SODESN), where the underlying communication struc-ture of the sensor network prohibits to use arbitrary synapticconnections between distributed units. More specifically, weallow hidden units to be connected to each other only if theyare hosted on the same or on a neighbor network node. More-over, neural inputs are only connected to units on the same sen-sor node in order to further reduce communication. Instead ofglobally connected output units, we use local output units oneach sensor node. Output units get their input from the localpart of the reservoir and from reservoirs on neighbor nodes.ESN typically use a sparsely connected reservoir, so that dif-ferent internal units develop di ff erent dynamics. Outputs arethen calculated as a linear combination of the (non-linear) in-ternal units. Using only local connections in SODESN almostautomatically leads to a sparse connection matrix, albeit witha di ff erent distribution of connections. From a global perspec-tive, regarding a SODESN as a single neural network, we alsowant to make sure the system fulfills the echo state conditionmentioned in the previous section. In a setup with M sensor nodes, each node m hosts K m inputunits, N m hidden units, and L m output units. The total numberof neural units thus is K = M (cid:88) m = K m inputs , N = M (cid:88) m = N m hidden units , and L = M (cid:88) m = L m output units . Then, from a global perspective, the SODESN model con-sists of K input units with an activation vector u ( n ) = ( u ( n ) , ..., u K ( n ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) node 1 , ..., u M ( n ) , ..., u K M ( n ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) node M ) (cid:48) , (3)of N hidden units with an activation vector x ( n ) = ( x ( n ) , ..., x N ( n ) , ..., x M ( n ) , ..., x N M ( n )) (cid:48) , (4)and of L output units with an activation vector y ( n ) = ( y ( n ) , ..., y L ( n ) , ..., y M ( n ) , ..., y L M ( n )) (cid:48) . (5)For the rest of this paper, we assume all neural units to beevenly distributed over all sensor nodes, i.e. each node containsthe same number of units.For theoretical considerations, it is convenient to representsynaptic connections weights between units in several globalmatrices, which have to be distributed in a practical implemen-tation. Connections between hidden units are represented in a N × N matrix W = ( w i j ), connections from input units to hid-den units in a N × K matrix w in = ( w ini j ) , and connections frominput and hidden units to output units in a L × ( K + N ) matrix w out = ( w outi j ).The activation of internal units is computed as x ( n + = f ( W in u ( n + + Wx ( n )) , (6)where u ( n +
1) represents the readings from all sensors, and f the vector of activation functions f of all internal units. We use f = tanh as activation function in each internal unit, and linearinput- and output units ( f = W back . Consequently, the activation of internal units x ( n + f ( W in u ( n + + Wx ( n ) + W back y ( n )). Forour application, we do not make use of these connections. In a practical implementation, activation vectors are dis-tributed over multiple sensor nodes. Moreover, there are con-nections between units on di ff erent sensor nodes, which requireto have a specified physical location. We store incoming con-nections from units hosted on neighbor sensor nodes on the lo-cal node. Units with outgoing connections to units on otherdevices just forward their activations with no changes to theneighbor device. Additional proxy units on the neighbor act asa place holder for remote units and take activations from con-nected units. From proxy units, there are only local connections3 igure 2: Neural units in a sensor node and connections to units on neighbor nodes. to the reservoir or to output units. Proxy units also eliminate theneed for all sensor nodes being synchronized as long as they alluse the same interval to process data (e.g. once every minute orevery 15 minutes). After new activations have been computed,their values are forwarded to connected proxy units where theycan be used by the neighbor device. Once their values havebeen used, proxy units are reset to 0. This is to avoid using oldvalues in case of a link failure between two network nodes. Inour experiments described in Sect. 5, we used link qualities offrom 10% to 100%. To set up the untrained SODESN, we construct the desirednumber of units on each sensor node. We create local inter-nal connection matrices W j with a specified density, and scaleeach of them so that the spectral radius is smaller than one. Inaddition, we create sparse random connections between inter-nal units on neighbor devices, represented by connection fromproxy units for incoming connections, and references to sensornodes and respective proxy units for outgoing connections. Lo-cal input connection matrices W in j with random weights fullyconnect input units to all local internal units on the node (withone input unit for each local sensor). For output units, we cre-ate local random matrices W out j to provide them with input frominput units, proxy units and internal units.The local internal connection matrices are scaled by theirlargest eigenvalue so that each spectral radius is at most one.For the entire connection matrix composed of all local matri-ces, this procedure does not in general lead to a spectral radiusof smaller than one yet, but it leads to similar conditions forthe internal units on each sensor node. After all local matricesare created in this way, the resulting global connection matrixis scaled to meet the echo state condition.Algorithm 1 generates a distributed SODESN, where eachsensor node hosts some input units, hidden units and outputunits. Globally, the sensor network imposes a specific structure on the random reservoir connectivity matrix. Figure 3 illus-trates the di ff erence in connectivity between a standard ESNand a SODESN. Algorithm 1 : Initialization: on each node j ...Generate K j input units, N j internal units, and L j output unitsGenerate M j = (cid:80) i N i proxy units for all neighbor sensor nodes i as place holders for the internal units on neighbornodesFor each neighbor sensor node i , create N j pointers to proxy units on node i Generate a sparse, random matrix W j for connections between local internal unitsFind λ j as the largest eigenvalue of W j Scale W j by 1 / max( λ j , Choose x ∈ { , ..., } , a connection density between neighbor unitsGenerate random connections from x × M j of the local proxy units to local internal unitsGenerate and initialize an all zero L j × ( M j + N j + K j ) matrix for connections to local output units from all otherlocal units.
4. A training algorithm for SODESN
After initial setup, SODESN needs to be trained. We describean approach to o ffl ine training a SODESN in a supervised fash-ion, i.e. we need time series of both input and output units astraining data. Once the training is finished, no further adapta-tion is made. For our application to diagnose problems in sensorreadings, we train output units to predict readings of a sensor ina neighbor node. In this case, the training data can be derivedfrom any input time series of “normal” sensor readings.4
50 100 150 200 250 300 350 400050100150200250300350400 Unit number U n i t nu m be r ESN: unit−to−unit connectivity with excitatory (blue) and inhibitory (green) connections U n i t nu m be r SODESN: unit−to−unit connectivity with excitatory (blue) and inhibitory (green) connections
Figure 3: Reservoir connectivity of a standard ESN (left) with 400 internal unitsand a SODESN (right), with 400 internal units distributed over a 5 × ffl ine training SODESN For a first description of the training algorithm, we regardSODESN as one recurrent neural network with specific con-nectivity – from there, a distribution of the algorithm over allsensor nodes is straightforward. Unfortunately, the standardtraining approach for ESN (see [4] for a detailed description)cannot be applied, because it assumes that output units can beconnected to any of the input or the hidden units. In SODESN,we want to connect output units only to local input, internal orproxy units.Training is executed in two steps, in a similar way to train-ing ESN: as a first step, we sample a matrix M of internal net-work states, and a matrix T of output activations. Samples aretaken while feeding a training data time series into input units(when using connections projecting back from output units intothe reservoir, a teacher time series has also to be fed to outputunits). For each time step of the training data, we collect a vec-tor of internal activations and a vector of output activations fromour SODESN. The sampled vectors are stored in new rows of M and T . With N the total number of hidden units, L the num-ber of output units, and S the number of training steps, the finalsizes of M and T are S × N and S × L , respectively. The firstsamples of a training are typically discarded in order to washout the initial network state.As a second step, we compute the output weights w outi j to letthe training time series d ( n ) for each output unit j approximatea linear combination of the internal activations x ( n ). “Approxi-mate” means to minimize the mean squared error on the train-ing signal, which, in the case of ESN, can be achieved by mul-tiplying the pseudoinverse of M with T : ( W out ) t = M − T . InSODESN, however, this operation is not possible, because itwill create connections from all internal units to all the outputunits. A solution to the problem is to adapt the output weightslocally, by using local connection matrices M j and T j for eachsensor node j . M j contains only activations of local input, in-ternal and proxy units, while T j contains output activations ofthe local output units (see Algorithm 2). For each node, wecompute a local output connection matrix:( W out j ) t = M − j T . (7)An additional advantage of this operation, at least in theory,is that it can be performed on each sensor node in parallel. In Algorithm 2 : O ffl ine training SODESN Input : u ( n ), d ( n ), n = ... T , T < T Initialize the network state x (0) = // Sample network state for training series
Initialize M = ∅ , T = ∅ for n = ... T do x ( n + = f ( W in u ( n + + Wx ( n )) // Discard initial states if n > = T then Add x as a new row to M Add tanh − d ( n ) as a new row to T end end // compute sample matrices for each node foreach sensor node j do Initialize M j = ∅ , T j = ∅ foreach column x (cid:48) in M do if x (cid:48) are the activations of an internal unit on the same or on a neighbor node then Add x (cid:48) as a new column to M j end end foreach column y (cid:48) in T do if y (cid:48) are the activations of an output unit on the same or on a neighbor node then Add y (cid:48) as a new column to T j end end // Compute all output weights for node j // using the pseudoinverse of M j ( W outj ) t = M − j T j end many practical cases, however, the amount of desired train-ing data and the complexity of the operation will exceed theavailable memory and limited processing power of small sen-sor nodes. This is not a severe restriction, though, because thetraining needs to be done only once and can be executed ona remote machine. The result of the training, a set of outputweights, has then to be sent back to all nodes and installed inthe local connection matrices. With the supervised training approach described above, weneed to provide input as well as output signals for each sensornode. In our application to detect sensor faults, we expect theinput signal and output signal for a sensor to be the same whenthe sensor works normally. To gather training data, the sensornetwork has to be deployed and collect sensor readings for a pe-riod of time. During this period, we assume there are no sensorfaults, so that the training output for each sensor is exactly thesame as the input time series.Using only normal data for training results in the learning topick up this correlation. The output weights will be adjusted5
96 192 288 384 480 576 672−5051015202530 Readings (96 points = 1 day)Sample weather data from 9−May−2006 to 15−May−2006 T e m pe r a t u r e s i n D eg r ee C e l s i u s I n f r a r ed and R ad i a t i on i n / W / s q m dry bulb temperaturesoil temperature 1cmsoil temperature 5cmsoil temperature 10cmsoil temperature 20cmsoil temperature 50cmradiation (scaled by 1/15)infrared (scaled by 1/15) (a) One week of the sensor data used in our experiments. In this graph, time seriesof infrared and radiation have been scaled by . infrareddry bulbtemperature soil temp.5cm soil temp.20cm radiationsoil temp.1cm soil temp.10cm soil temp.50cm (b) Arrows between sensor nodes indicate the local SODESN informationexchange for fault detection.Figure 4: Training data and setup of the simulated sensor network. so that input and output always match closely. When a sensoris faulty and delivers unexpected values to its input unit, therespective output will be similar to the input rather than an es-timate of the true value. In such a case, we cannot distinguishbetween normal or faulty sensors, so that our prediction is use-less.To fix the approach so that the prediction of the true value ofone sensor is independent on its actual value, a possible solutionis to not connect this sensor to the neural network during bothtraining and exploitation. The prediction is then solely basedon inputs from other sensors. This is, however, only possibleif we are interested in monitoring just very few sensors in thenetwork. To monitor all of the sensors, this would require todisconnect all of the sensors from the neural network. With noremaining inputs, we cannot make any predictions, so that thisapproach is not an option.A more promising attempt is therefore to make only the train-ing of one output unit independent of the respective input unit.This can be achieved by training one output unit at a time, anddisconnecting the input unit we are trying to predict during thetraining. However, this approach leads to a further problem: theprediction will be based on the assumption that there is no inputfrom the sensor in question. During normal operation the inputsignal of the sensor will be added to some of the internal unitsand lead to a change in the output. In our experiments we foundthe influence of the incoming signal large enough to make theprediction useless.Instead of just disconnecting individual input units duringtraining of their respective output units, we make sure there isan actual signal from all of the inputs. For the input of the sen-sor we are currently training, the signal should be uncorrelatedto the true sensor value. This can be achieved by for examplereplacing the input by a white noise signal. The correct signalis used as teacher output, and the goal of the training is to learnthe correlation between the true local sensor value and the valueof neighbor sensors.As mentioned above, the training aims to minimize the mean square error on the training signal. In all our experiments, wetested the capability of the SODESN to generalize for new databy computing the normalized root mean square error (NRMSE)of the predictions on an independent test set. The NRMSE of n predictions p of the SODESN against the test data t is definedby NRMSE = (cid:115) (cid:80) ni = ( t ( i ) − p ( i )) n var( t ) , (8)where var( t ) is the variance of the test data. Using SODESN for fault detection involves making predic-tions on each sensor node. It requires also to set a threshold forsensor readings to be considered abnormal. Possible methodsfor defining thresholds can be based on measuring deviationsfrom the predicted value of a sensor (for example a deviationexceeding the maximum deviation of predictions on the testset), or on the NRMSE between prediction of the sensor valueand its actual reading for a specified time window.In the previous section, we set up the training so that predic-tions of a sensor are independent of its current value. By usingrandom noise as local input during training, we base the faultdetection of each sensor on input from the rest of the network.If sensors fail only rarely, only a few of them will feed faultyvalues into the SODESN at the same time. If there is a faultysensor, it will continue to feed incorrect readings into the net-work until the problem is fixed. This will a ff ect fault detectionin the remaining sensors, even more so if more than one sensoris faulty at the same time.In systems with a high likelihood of simultaneous sensor fail-ures, it might therefore be a good idea to prevent faulty sensorsfrom feeding their readings into the SODESN. For the samereasons we used random noise as input during training in theprevious section (as opposed to no input), we expect that simplydisconnecting faulty sensors does not improve the predictions:6 NR M SE NRMSE of 10cm soil temp. predictionNRMSE of dry bulb temp. predictionaverage NRMSE over all 8 sensors (a) Learning curves for predicting the 10cm soiltemperature, air temperature, and an average learn-ing curve over all sensors. Experiments were runwith a 90% WSN connectivity and training set sizesof up to 30.000 data points. The graph shows resultsof sets of up to 18.000 points. NR M SE NRMSE of 10cm soil temp. predictionNRMSE of dry bulb temp. predictionaverage NRMSE over all 8 sensors (b) Influence of the number of internal units per nodeon the learning performance. On average (blue line),an increasing number of internal units decreases theNRMSE only slightly. Experiments were run usinga training set size of 30.000 and 90% success rate incommunication between nodes. NR M SE ESN: NRMSE of 10cm soil temp. predictionESN: NRMSE of air temp. predictionESN: average NRMSE over all 8 sensorsSODESN: average NRMSE over all 8 sensors (c) The benchmark against a centralized approachusing one ESN for each prediction shows that theSODESN is able to maintain a high prediction qual-ity even with poor link qualities. Only under idealconditions can the centralized approach keep up withSODESN.Figure 5: Results of various experiments using SODESN and a benchmark using ESN. after all, output units in other nodes used a fraction of their in-put for training. In order to decrease their e ff ect on the system,we do flag and disconnect faulty sensors from the SODESN.Instead of using no input from faulty sensors at all, we replacetheir input with the predictions of their readings as computed bythe SODESN. We expect this helping to maintain a high predic-tion quality for the remaining sensors with a larger number offaults in the system.
5. Experiments and results
We evaluated our approach in simulations where we useddata from a local weather station with several sensors mea-suring temperatures, radiation, infrared, etc. (the automaticweather station of the Department of Physical Geography ofMacquarie University [6]). The simulated setup consisted of 8sensor nodes arranged in a 2 by 4 grid where each node has oneof the sensors and can communicate with its nearest neighbors(see Fig. 4(b)). The sensors we used measured the air tempera-ture, soil temperatures at 1cm, 5cm, 10cm, 20cm, and 50cm re-spectively, radiation, and infrared. The data we used was takenin 15 minute intervals. Figure 4(a) shows data of our sensors forone week. In the graph it is visible that the di ff erent time seriesare at least weakly correlated to each other. In a setup with allsensors measuring the same variable at slightly di ff erent loca-tions, correlations would be expected to be even stronger. A number of param-eters play a role in training and using SODESN, such as theamount of training data, the number of units on each node, con-nectivity between units, link qualities between nodes, etc. In afirst experiment, we used 15 internal units on each node with to-tally approximately 10% connectivity between nodes. We useda spectral radius of 0.66 for the connectivity matrix, a link qual-ity of 90% during both training and testing, and an increasingamount of training data to obtain learning curves using an incre-mental 10-fold cross validation. The training data varied from 300 data points, corresponding to slightly more than 3 daysworth of data, up to 30.000 data points, i.e. data from a periodof 10 month. The test data set had a size of 16.665 data pointsin all cases. For each individual experiment, a new SODESNwas generated.
Experiment 2 — reservoir size.
To evaluate if and how much anincreasing number of internal units contributes to higher predic-tion quality, we varied the number of internal units per sensornode from 3 to 39 units, resulting in SODESN with 24 up to 312internal units. We used a training data size of 30.000 points fortraining and 16.665 data points for testing in a 10-fold cross val-idation. The basic procedure and all other parameters remainedunchanged from the first experiment.
Experiment 3 — ESN vs. SODESN.
To compare SODESNagainst a baseline, we simulated a fault detection with globalcommunication using one (centralized) ESN for each sensor inthe network. The ESN we used had 120 internal units each,equivalent to a SODESN with 15 units on each of our 8 nodes,and simulated link qualities from 10% to 100% during bothtraining and testing. In the centralized setting, these link quali-ties represent the quality of the link from sensor to central node(independent of the number of hops).In contrast to using SODESN, in a setup with one ESN foreach sensor it is possible to use input data from only 7 sensors,predicting an 8th sensor of our sensor network.
Experiment 4 — robustness.
Sensors in our first experimentsdeliver time series from di ff erent (yet correlated) phenomena,such as temperatures at di ff erent depths and radiation. To testSODESN with closer correlated inputs, we computed di ff erenttime series based on the air temperature data by randomly shift-ing the original series up to ±
30 minutes in time and addinguniform random noise of up to ± × ×
10 grid. Using these 100 nodes,7
100 200 300 400 500 600 700 800 900 10008101214161820222426 T e m pe r a t u r e ( deg r ee c e l s i u s ) dry bulb temp.soil temp. 1cmsoil temp. 5cmsoil temp. 10cmsoil temp. 20cmsoil temp. 20cm0 100 200 300 400 500 600 700 800 900 10001516171819202122 Measurement (one measurement every 15 mins) T e m pe r a t u r e ( deg r ee c e l s i u s ) soil temp. 5cm (true value)predicted soil temp. 5cm Figure 6: Example of a prediction of the current soil temperature at 5cm (the solid blue line in the top graph, the true value is shown as a dotted red line). Theprediction is based on inputs to other sensor nodes (bottom graph). The dotted red line in the bottom graph also shows the soil temperature at 5cm for comparison(not used in the prediction). Additional inputs were radiation and infrared (omitted for clarity). T e m pe r a t u r e ( deg r ee c e l s i u s ) predicted dry bulb temperaturedry bulb temperature (true value) Figure 7: Example of a prediction of the current air temperature as solid blue line, and the true value is shown as a dotted red line. As in Fig. 6, the prediction isbased on inputs to other sensor nodes. we simulated multiple sensor faults to test the e ff ect on the pre-diction quality for other sensors. To this end, we randomly se-lected an increasing number of sensors to fail. Instead of thetrue value, faulty sensors constantly returned zero and fed thisvalue into the SODESN. (c) Finally, we tested the e ff ect of mul-tiple sensor faults, where faulty sensors were stopped feedingtheir values into the neural network. As discussed above, thepredictions of their true values as computed by the SODESNwere used instead. Figure 5(a) gives an impression of the NRMSEwe obtained dependent on the amount of training data used.Results are shown for two of the sensors, and an overall aver-age NRMSE for all 8 sensors. With an increasing amount of training data, prediction of our SODESN becomes more reli-able, after an initial oscillating phase of 3000 data points. Ta-ble 1 shows NRMSE and some absolute maximum errors ofpredictions on test data. In particular for smaller training sets,absolute errors of the more dynamic time series, such as theair temperature, can become quite large for a short period eventhough prediction and true value are close over longer intervals.In this case, the NRMSE between predicted readings and actualvalues over a window of time might be a more reliable fault in-dicator. For less dynamic time series, such as the di ff erent soiltemperatures, both NRMSE and absolute errors are small andmay be used to indicate faults.Figure 6 shows the result of a continuous prediction of thesoil temperature at 5cm depth, while the sensor for this variablefed just random noise into the SODESN during the whole pe-riod (slightly more than 10 days). Similarly, the graph in Fig. 78 able 1: NRMSE and some maximum absolute errors for varying training setsizes. NRMSE (max. abs. errors in brackets in ◦ C )training air soil temp. soilset size temp. 5cm 20cm radiat.1500 6.76 (155.4) 0.14 (1.9) 2.64 3.4015000 1.11 (26.6) 0.04 (1.0) 0.46 0.9730000 0.56 (14.2) 0.04 (0.8) 0.19 0.79 is a plot of the prediction of the air temperature during the sameperiod, again while replacing the true temperature measurementby random noise in the input to the SODESN. Experiment 2.
On average, an increasing number of internalunits in the reservoir of our SODESN did not significantly im-prove the prediction quality. Figure 5(b) is a plot of the NRMSEfor several reservoir sizes from 3 to 39 units per node. It can beseen from both the plot and from Table 2 that the prediction ofair temperature seems to benefit from an increased number ofunits. In other cases, using more internal units does not lead tosmaller errors, and in some, the error increased even slightly.The average NRMSE over all sensors for SODESN with 312internal units is only slightly lower than the average NRMSEfor SODESN with only 24 internal units.
Experiment 3.
With a decreasing link quality, the accuracy ofthe centralized approach using one ESN for each predictedsensor decreases rapidly (see Table 3). In contrast to that,SODESN can maintain the same level of accuracy even withpoor link qualities between local nodes. The graph in Fig. 5(c)shows that the ESN can achieve results close to SODESN onlyunder almost perfect conditions. This seems surprising at first,but the di ff erence in performance is a result of the di ff erentmethods to pass on sensory information: In a centralized ap-proach, loss of data has much bigger impact on the result be-cause the missing information is not replicated elsewhere. Inour distributed approach, data is broadcasted to several neigh-bors (2 or 3 neighbors in our 8 node experiment, up to 4 neigh-bors in our experiments with 100 nodes). Because in our exper-iments links between nodes fail independently, the informationlost as a result of one link failing may still be present in thenetwork and can be used for prediction. Experiment 4. (a) Using 8 more closely correlated air temper-ature time series, we achieved an almost constant NRMSE of0.2 for SODESN independent of the number of units (from 3
Table 2: Some NRMSE and maximum absolute errors for di ff erent reservoirsizes from 3 to 39 units per node.NRMSE, and max. abs. errors in ◦ C (in brackets)soil temperatureunits air temperature 5cm 20cm radiation3 0.67 (18.2) 0.07 (1.2) 0.12 0.7415 0.51 (15.4) 0.04 (0.8) 0.21 0.7627 0.48 (12.0) 0.06 (1.2) 0.15 0.7339 0.47 (11.6) 0.09 (2.1) 0.15 0.73 units / node up to 39 units / node), and a maximum absolute pre-diction di ff erence of 6 ◦ C. The lowest NRMSE in experiment 2,where we used soil temperatures and radiation data to predictair temperature, was 0.47 (Table 2). The better performancein this scenario was expected. (b) Scaling the experiment upto 100 sensor nodes, the prediction has about the same qual-ity as with only 8 sensors. Then, we begin to subsequently failrandom sensors. A first qualitative (visual) inspection of thepredicted time series vs. the true values shows acceptable per-formance up to more than 60% of failed sensors (see Fig. 8(b)for a sample prediction with 60 failed sensors). More quanti-tatively, from the graph in Fig. 8(a) we see that failing up to16 of the sensors does not change the performance of the sys-tem at all. In our experiments, the average maximum absoluteerror for up to 16 failed nodes was below 11 ◦ C, and for up to32 failed nodes, it remained below 16 ◦ C. For 60 failed nodes,the NRMSE has grown from 0.26 to about 1.0, with an maxi-mum absolute error of around 19 ◦ C. (c) Feeding back the pre-dictions of the true value instead of faulty sensor values resultsin a greatly improved prediction quality, so that the average er-ror lower is almost constant for up to 50% of failed nodes. Evenfor more than 50% failed nodes, the error increases only slowlyuntil around 90%.
6. Discussion
Our first experiment showed that the amount of training dataused strongly influences the prediction quality. Further aspectsseem to be the “correlatedness” of di ff erent sensors and thedynamics of the time series. Some of the “easier” sensors inour experiment could be successfully modeled after trainingon 1500 data points ( ≈
15 days of training data), while for“harder”, less correlated sensors we needed at least 5000 points( ≈
52 days). Our o ffl ine learning approach requires to perform acomputation on the whole training time series. In particular forlarger data sets this will usually be done on a machine outsidethe network. The learning then computes sets of output weightsfor each sensor node. A way to deal with less correlated sen-sors may therefore be to successively improve the SODESN byre-training on increasingly larger data sets and exchanging thelearned weights over time.A second important factor is the amount of local communi-cation introduced. From the description of our architecture inSect. 3 it follows that neighbors exchange activations of their Table 3: NRMSE in a centralized approach using one ESN with 120 internalunits compared to NRMSE of SODESN under varying WSN link qualities from10 to 98%.NRMSE in ESN (E) and SODESN (S)soil temperatureair temperature 5cm 20cmlink % E S E S E S10 1.41 0.51 1.72 0.04 1.83 0.1950 0.89 0.54 1.00 0.04 1.04 0.2390 0.63 0.51 0.50 0.04 0.56 0.2298 0.55 0.49 0.27 0.04 0.32 0.17
20 40 60 80 10000.511.522.53 Number of failed sensors NR M SE All results are averages over 10 runs prediction with input from faulty sensorsfaulty sensor input replaced by prediction (a) Performance of the system with increasing num-ber of failures, with and without replacing faulty sen-sors with their prediction. T e m pe r a t u r e ( deg r ee c e l s i u s ) predicted air temperaturetrue air temperature (b) Sample prediction for one sensor with 60 out of100 sensors failed, and continuing to feed their valuesinto the SODESN. T e m pe r a t u r e ( deg r ee c e l s i u s ) predicted air temperaturetrue air temperature (c) Sample prediction for one sensor, with 60 from100 sensors failed, and replacing their values by pre-dictions.Figure 8: Results of various experiments using SODESN and a benchmark using ESN. local internal units – one value per unit and sample step. Re-sults from our second experiment are therefore interesting, be-cause we have seen that the number of internal units did notplay a crucial role – we used only 3 units in some experiments.SODESN communication and sample rate of sensors does nothave to run synchronously with each other. Alternatively it isalso possible collect some data locally, and to run the SODESNon larger blocks of data, as long as all nodes run their part ofthe SODESN at the same rate (proxy units would have to bechanged to queues in this case).The amount of local computation required is similarly depen-dent on the number of units. In contrast to the o ffl ine training,exploitation requires only a few operations, for each internalunit a number of additions, multiplications, and computation oftanh( x ).
7. Conclusions
In this paper, we presented SODESN, a novel distributed re-current neural network architecture for creating models of dy-namical systems. We introduced an o ffl ine learning approachfor SODESN that is closely related to training ESN and inheritsthe low computational complexity of the original approach. Wethen presented an approach to train SODESN for fault detectionin WSN, where predictions of sensor values are made based oninformation from neighbor nodes.Our evaluations on real-world data show that our approachcan be used to build models of dynamic time series and helpto detect sensor faults. We have shown that the approach isrobust to WSN link failures through its distributed computationand local communication. SODESN outperform a comparable,centralized approach assuming realistic link qualities. Usingonly local communication also contributes to SODESN scalingwell with an increasing number of WSN nodes.We have also shown that our approach is robust against mul-tiple node failures. In our evaluation using the predictions offailed sensors as input, 50% of the sensors failed without a ff ect-ing prediction quality, and the performance degraded gracefullyup to slightly more than 80% failed nodes. References [1] Blough, D. M., Sullivan, G. F., & Masson, G. M. (1989). Fault diagnosisfor sparsely interconnected multiprocessor systems. In
Digest of the Nine-teenth International Symposium on Fault-Tolerant Computing (FTCS-19) (pp. 62–69). IEEE Computer Society Press.[2] Chen, J., Kher, S., & Somani, A. (2006). Distributed fault detection ofwireless sensor networks. In
DIWANS ’06: Proceedings of the 2006 work-shop on Dependability issues in wireless ad hoc networks and sensor net-works (pp. 65–72). New York, NY, USA: ACM.[3] Corke, P., Valencia, P., Sikka, P., Wark, T., & Overs, L. (2007). Long-duration solar-powered wireless sensor networks. In
EmNets ’07: Proceed-ings of the 4th workshop on Embedded networked sensors (pp. 33–37). NewYork, NY, USA: ACM.[4] Jaeger, H. (2002).
Tutorial on training recurrent neural networks, coveringBPPT, RTRL, EKF and the echo state network approach . GMD Report 159,Fraunhofer Institute AIS.[5] Jaeger, H. & Haas, H. (2004). Harnessing Nonlinearity: PredictingChaotic Systems and Saving Energy in Wireless Communication.
Science ,304(5667), 78–80.[6] Macquarie University, Sydney (2008). Automatic weather station of theDepartment of Physical Geography. Web page: http://aws.mq.edu.au/index.html . Last checked: 3. Oct. 2008.[7] Mayer, N. M. & Browne, M. (2004). Echo state networks and self-prediction. In
Biologically Inspired Approaches to Advanced InformationTechnology , volume 3141 of
Lecture Notes in Computer Science (pp. 40–47). Springer.[8] Obst, O., Wang, X. R., & Prokopenko, M. (2008). Using echo state net-works for anomaly detection in underground coal mines. In
Proceedings ofthe International Conference on Information Processing in Sensor Networks(IPSN 2008) (pp. 219–229). Los Alamitos, CA, USA: IEEE Computer So-ciety.[9] Rajagopal, R., Nguyen, X., Ergen, S. C., & Varaiya, P. (2008). Distributedonline simultaneous fault detection for multiple sensors. In
Proceedings ofthe International Conference on Information Processing in Sensor Networks(IPSN 2008) (pp. 133–144). Washington, DC, USA: IEEE Computer Soci-ety.[10] Reiter, R. (1987). A theory of diagnosis from first principles.
ArtificialIntelligence , 32(1), 57–95.[11] Ruiz, L. B., Nogueira, J. M., & Loureiro, A. A. F. (2003). MANNA: amanagement architecture for wireless sensor networks.
IEEE Communica-tions Magazine , 41(2), 116–125.[12] Werbos, P. J. (1990). Backpropagation through time: what it does andhow to do it.
Proceedings of the IEEE , 78(10), 1550–1560., 78(10), 1550–1560.