[PDF] An Explainable Artificial Intelligence Approach for Unsupervised Fault Detection and Diagnosis in Rotating Machinery

Abstract

The monitoring of rotating machinery is an essential task in today's production processes. Currently, several machine learning and deep learning-based modules have achieved excellent results in fault detection and diagnosis. Nevertheless, to further increase user adoption and diffusion of such technologies, users and human experts must be provided with explanations and insights by the modules. Another issue is related, in most cases, with the unavailability of labeled historical data that makes the use of supervised models unfeasible. Therefore, a new approach for fault detection and diagnosis in rotating machinery is here proposed. The methodology consists of three parts: feature extraction, fault detection and fault diagnosis. In the first part, the vibration features in the time and frequency domains are extracted. Secondly, in the fault detection, the presence of fault is verified in an unsupervised manner based on anomaly detection algorithms. The modularity of the methodology allows different algorithms to be implemented. Finally, in fault diagnosis, Shapley Additive Explanations (SHAP), a technique to interpret black-box models, is used. Through the feature importance ranking obtained by the model explainability, the fault diagnosis is performed. Two tools for diagnosis are proposed, namely: unsupervised classification and root cause analysis. The effectiveness of the proposed approach is shown on three datasets containing different mechanical faults in rotating machinery. The study also presents a comparison between models used in machine learning explainability: SHAP and Local Depth-based Feature Importance for the Isolation Forest (Local- DIFFI). Lastly, an analysis of several state-of-art anomaly detection algorithms in rotating machinery is included.

Full PDF

AAn Explainable Artiﬁcial Intelligence Approach for Unsupervised FaultDetection and Diagnosis in Rotating Machinery

Lucas C. Brito a, ∗ , Gian Antonio Susto b , Jorge N. Brito c , Marcus A.V. Duarte a a School of Mechanical Engineering, Federal University of Uberlˆandia, Av. Jo˜ao N. ´Avila, 2121, Uberlˆandia, Brazil b Department of Information Engineering, University of Padova, Via Gradenigo 6/B, 35131, Padova, Italy c Department of Mechanical Engineering, Federal University of S˜ao Jo˜ao del Rei, P.Orlando, 170, S˜ao Jo˜ao del Rei, Brazil

Abstract

The monitoring of rotating machinery is an essential task in today’s production processes. Currently, severalmachine learning and deep learning-based modules have achieved excellent results in fault detection and di-agnosis. Nevertheless, to further increase user adoption and diﬀusion of such technologies, users and humanexperts must be provided with explanations and insights by the modules. Another issue is related, in mostcases, with the unavailability of labeled historical data that makes the use of supervised models unfeasible.Therefore, a new approach for fault detection and diagnosis in rotating machinery is here proposed. Themethodology consists of three parts: feature extraction, fault detection and fault diagnosis. In the ﬁrst part,the vibration features in the time and frequency domains are extracted. Secondly, in the fault detection, thepresence of fault is veriﬁed in an unsupervised manner based on anomaly detection algorithms. The modu-larity of the methodology allows diﬀerent algorithms to be implemented. Finally, in fault diagnosis, ShapleyAdditive Explanations (SHAP), a technique to interpret black-box models, is used. Through the featureimportance ranking obtained by the model explainability, the fault diagnosis is performed. Two tools fordiagnosis are proposed, namely: unsupervised classiﬁcation and root cause analysis. The eﬀectiveness of theproposed approach is shown on three datasets containing diﬀerent mechanical faults in rotating machinery.The study also presents a comparison between models used in machine learning explainability: SHAP andLocal Depth-based Feature Importance for the Isolation Forest (Local-DIFFI). Lastly, an analysis of severalstate-of-art anomaly detection algorithms in rotating machinery is included.

Keywords: A nomaly Detection, Explainable Artiﬁcial Intelligence, Fault Detection, Fault Diagnosis,Rotating Machinery, Condition Monitoring This work has been submitted for possible publication. Copyright may be transferred without notice, afterwhich this version may no longer be accessible.

1. Introduction and Related Work

The study of artiﬁcial intelligence (AI) techniques applied in the monitoring of rotating machinery is atopic in continuous development and of great interest by both researchers and industrial engineers. Moreand more, industries are adopting more sophisticated technologies for monitoring to increase the reliabilityand availability of the machines, and, consequently, remaining competitive in the globalized economy.As detailed by [1] there are three basic tasks of fault diagnosis: (1) determining whether the equipmentis normal or not; (2) ﬁnding the incipient fault and its reason; (3) predicting the trend of fault development.It is clear that when determining the type of fault and its reason (task 2), consequently the answer on thecondition of the equipment is obtained (task 1). However, the AI models usually used to classify the type of ∗ Corresponding author

Email address: [email protected] | [email protected] (Lucas C. Brito)

Preprint submitted to arXiv. February 24, 2021 a r X i v : . [ c s . A I] F e b ault require to be trained with labeled data (supervised training), and examples for all conditions, whichin most of the cases is not available in the industry [2]. In addition, motivated by the recent advancesin Deep Learning (DL), the vast majority of AI technologies lack of explainability traits and they requirea large volume of data labeled for both normal and fault conditions, dramatically limiting their industryapplication.Although the ﬁeld of rotating machinery monitoring is widely developed, a small number of approacheshave been presented based on unsupervised anomaly detection, in relation to the vast majority focusedon classiﬁcation and prognostics, as shown in the review works [1, 3, 4, 5]. A detailed review of AI forfault detection in rotating machines is presented in [1]: most of the 100 references cited there refer mainlyto the fault classiﬁcation. In their review, the main models present in literature were: Artiﬁcial NeuralNetworks (ANNs), k-Nearest Neighbor (kNN), Naives Bayes, Support Vector Machines (SVM) and DL-based approaches. In [3] a review of the main machine learning (ML) and DL techniques applied in themonitoring of induction motors is presented, aiming to detect faults such as: broken bars, bearings, statorfaults and eccentricity. Among more than 100 references cited, the vast majority refers to classiﬁcation andthe main models used are: ANNs, Decision Trees, k-NN, SVM and DL-based approaches.More recently, a broad review with more than 400 citations, focused on AI applications for fault detectionis presented [4]. The authors provide a historical overview, in addition to current developments and futureprospects. Among the revised ML methods employed in the ﬁeld, the authors recognized the following asthe most commonly adopted: ANNs, Decision Trees, kNN, Probabilistic Graphical Model (PGM) and SVM.Moreover, the following DL approaches are taken into consideration: Autoencoders (AEs), ConvolutionalNeural Networks (CNN), Deep Belief Network (DBN), Residual Neural Networks (ResNet). As the othercited review works, the references in [4] mainly focused on classiﬁcation of the type of fault. The authors alsoconﬁrm the dependence on real and labeled data from the machine under analysis. In addition to highlightingthe recent and future importance in the Intelligent Fault Diagnosis (IFD) scenario of explainable models,with increasing interest starting from 2017. Finally, they mention that traditional ML models should not beabandoned despite the recent advances of DL: this is because it is still worth investigating statistical learningin IFD with the big data revolution, since the theories of statistical learning have rigorous theoretical bases,which promote the construction of diagnostic models with parameters, characteristics and results that areeasy to understand.Anomaly detection is the process of identifying unexpected events in the dataset, which are diﬀerentfrom normal. In general, the signals generated by a fault have characteristic patterns that are diﬀerent fromnormal and indicate a change in the behavior of the machine. Using a method that indicates changes inthe current condition of the equipment, does not need a labeled historical dataset for training and providesexplainability of the results can be the solution to the mass dissemination of artiﬁcial intelligence methodsin the industrial environment for monitoring rotating machinery.Among the references studied, there are very few studies involving anomaly detection with unsupervisedapproaches in the monitoring of rotating machinery. Authors in [6] used Fourier local autocorrelation(FLAC) and Gaussian Mixture Model (GMM) approach (based on class cluster) to extract features in thetime-frequency domain and to detect faults respectively; in the same work, vibration signals were used todetect faults in wind turbine components. The authors showed that the use of features extracted fromFLAC improves the model’s performance, making it possible to detect anomalies in even more complicatedcases, such as low speeds, where conventional features do not present such satisfactory results. In [7] anapplication of the Self Organizing Maps (SOM) for anomaly detection was presented, showing its eﬀectivenessin detecting variations when the component fails: uses case related to cyber-physical system components(bearings and blades) were exploited. In [8] the authors proposed the use of diﬀerent methods of MLusing vibration signals from a fan for unsupervised detection of incipient fault. The algorithms used were:PCA T2 statistic, Hierarchical clustering, K-Means, Fuzzy C-Means clustering. Finally, they presented acomparison of the models, showing the feasibility of implementing them in the monitoring of machines.Other studies [9, 10, 11] propose anomaly detection approaches in the monitoring of rotating machinerybased on combinations of diﬀerent techniques with variations of GMM. To the best of our knowledge, thevast majority of state-of-the-art ML unsupervised anomaly detection, have never being used, e.g., IsolationForest (IF), Local Outlier Factor (LOF), Angle-Based Outlier Detection (ABOD) etc.2nother important aspect in the ﬁeld that has not been fully explored yet is the one related to inter-pretability of ML-based monitoring solutions in equipment machinery: as argued above, without providingexplainable results to the user, even when ML-based modules provide excellent results in historical data,AI models are unlikely to be applied in real-world scenarios [12]. Moreover, as mentioned by [4], collectinglabeled data from machines generates a high cost, and consequently unlabeled data is the majority in engi-neering scenarios. Therefore, using an AD (anomaly detection) model that works with unlabeled data andprovides explainability is essential to enable large-scale implementation of AI in the monitoring of rotatingmachinery.Recently studies are being developed with a focus on Explainable Artiﬁcial Intelligence (XAI). In orderto explain black-box models, diﬀerent methods can be used according to the ML model in use [13]. Ingeneral, the methods provide information to understand how the model performs fault detection, whichcan be, for example, a ranking of the most important features, the model weight relevance or the mostsigniﬁcant points in the underlying signals [14]. Despite the current interest, the vast majority of studiesare focused on explainability for DL models and mostly on fault classiﬁcation. More information can befound in the articles available on the topic [15, 16, 17, 18, 19, 20, 21, 22, 23, 24]. Among the referencesresearched, only [24, 25] address the explainability of the model in anomaly detection, being [24] based onDL. [25] presented a methodology for detecting anomalies in electric motors (voltage unbalance) using a setof similar equipment through electrical and vibration signature. The authors use generic building blocks andpresent advantages of not needing historical data, incorporating human knowledge. Despite the interestingapproach, it is noted that for its use it is necessary to have data from more than one machine, so that theycan be compared, making applications on single machines unfeasible.In this paper, a new approach for fault detection and diagnosis in rotating machinery is proposed. Inthe ﬁrst part, the vibration features in the time and frequency domains are extracted. Secondly, in thefault detection, the presence of fault is veriﬁed in an unsupervised manner based on anomaly detectionalgorithms. Finally, in fault diagnosis, Shapley Additive Explanations (SHAP), a technique to interpretblack-box models, is used. Through the feature importance ranking obtained by the model’s explainabil-ity, the fault diagnosis is performed. Two approaches of diagnosis are proposed, namely: unsupervisedclassiﬁcation and root cause analysis.The main contributions of the proposed approach are: i) unsupervised identiﬁcation of the fault inrotating machinery through vibration analysis; ii) unsupervised classiﬁcation of the type of fault in rotatingmachinery, based on the analysis of the features relevance; iii) possibility of performing root cause analysiswhen the features may be related to more than one fault and the unsupervised classiﬁcation is not feasible;iv) a new contribution to the study of XAI and novel application in fault diagnosis for rotating machineryis presented based on SHAP and Local-DIFFI; v) possibility to be applied in diﬀerent types of faults; vi)possibility to change models according to the dataset; vii) industrial applications.To the best of the authors’ knowledge, this is the ﬁrst study to compare and analyze unsupervisedstate-of-the-art anomaly detection algorithms for monitoring rotating machinery. In addition to providingexplainability about the ML models used and proposing a new approach to perform unsupervised classiﬁ-cation or root causes analysis.The remainder of this paper starts with a brief explanation about the machine learning and XAI methodsused in Section 2. The proposed approach is presented in Section 3. Experimental procedure is shown inSection 4. Analysis of the experimental results are given in Section 5. Finally, Section 6 concludes thispaper.

2. Methodologies

In this sub-Section we will provide a brief overview on the data-driven unsupervised Anomaly Detection(AD) algorithms compared in this work. 3nomaly detection (also known as outlier detection ) refers to the task of identifying rare observationswhich diﬀer from the general (’normal’) distribution of a data at hand [26]. Anomaly Detection approacheshave the capability of summarizing the status of a multivariate systems with a unique quantitative indicator,that is typically called Anomaly Score (AS) : while many approaches provide guidelines on how to deﬁneoutliers based on the AS, the quantitative nature of the AS indeces allowed to implement diﬀerent strategiesthat allow to govern the trade-oﬀ between false positives and false negatives depending on the applicationat hand. While no applications to the best of our knowledge have been presented in the ﬁeld of rotatingmachinery monitoring using vibration data and state-of-art models (that will be introduced in the rest ofthe Section), anomaly detection approaches have been successfully applied in various areas like biomedicalengineering [27], fraud detection [28], oil and gas [29].Algorithms are arranged by increasing year of presentation. k-nearest neighbor (kNN) is a simple and popular method used for supervised tasks of classiﬁcation andregression. In the context of AD, kNN can be also employed: given a sample, the distance to its kth-nearestneighbor can be considered as AS [30]. More formally, the anomaly score [31] is then deﬁned as: s kNN ( x ) = D k ( x ) (1)where D k ( x ) denotes the distance of the k th nearest neighbor from observation x . The distance functioncan be any metric distance function. The most common methods for selecting distance function are: largestdistance, where the distance to the k th neighbor is used as the AS; mean distance, where the AS is theaverage of all k neighbors; median distance, which uses the median of the distance to k neighbors as AS. The minimum covariance determinant (MCD) is a robust estimator of multivariate locations and its goalis to ﬁnd n instances (out of N ) whose covariance matrix has the lowest determinant [32]. In the contextof AD, MCD is used with Mahalanobis distance (MD), a well-known distance metric of a point from adistribution: ﬁrst a minimum covariance determinant model is ﬁtted and then the Mahalanobis distance isused as AS. Since the parameters required by MD are unknown (mean and covariance matrix), the MCDmodel is used to estimate them, and then the MD can be calculated as follows: s MCD ( x ) = d ( x, ¯ x, Cov ( X )) = (cid:112) ( x − ¯ x ) (cid:48) Cov ( X ) − ( x − ¯ x ) (2)where ¯ x is the sample mean and Cov ( X ) is the sample covariance matrix. If data are assumed centerednot normalized, the robust location and covariance are directly computed with the FastMCD algorithmwithout additional treatment. Otherwise, the support of the robust location and the covariance estimateare computed, and a covariance estimate is recomputed from it, without centering the data [26]. LOF [33] is a density-based approach for AD; such class of approaches are based on the study of localneighborhoods of the data points under exam: an observation is a dense region is considered as a normal datapoint (also referred in the literature as an inlier ), while observations in low-density regions are anomalies.The LOF procedure involves two steps: (i) evaluating the so-called Local Reachability Density; (ii)evaluating the AS s LOF . the Local Reachability Density of a data point x in its k -neighborhood N k ( x ) (thespace where the k other data points closest to x are living) is deﬁned as:LRD k ( x ) = k (cid:80) k ( y ∈ N k ( x )) r k ( x, y ) (3) The terms ’Anomaly’ and ’Outlier’ will be treated in the same way in this work. Other authors refer to the concept of Anomaly Score with various names like for example Health Factor or Deviance Index. r x ( x, y ) = max { d k ( x ) , d ( x, y ) } is the so-called reachability distance and d k ( x ) is the distance from x of its k -th nearest neighbor. The reachability distance just deﬁned is used instead of the distance d ( x, y ) inorder to reduce statistical ﬂuctuations/noise in the evaluation of the AS s LOF ; the AS is in fact deﬁned as: s LOF ( x ) = 1 k (cid:88) y ∈N k ( x ) LRD k ( y )LRD k ( x ) (4)The above deﬁned anomaly score can assume values between 0 and ∞ , however, a value around 1 (or lowerthan 1) indicates that the data point x is somehow similar to its neighbors and it can be therefore consideredas an inlier; a value of s LOF larger than 1 indicates instead a case in which the data point under exam canbe considered as an outlier. For more details we refer the interest readers to [33].LOF is a classic approach to AD and extended versions of the algorithms have been proposed overthe years [34, 35]: in this work we consider the popular Cluster-based LOF (CBLOF). The CBLOF [36]algorithm for AD is an extended version of LOF that exploits a clustering procedure before applying theLOF algorithm: the underlying idea of this approach is to overcome a known problem in LOF that has somediﬃculties in dealing with data that are clustered.First, a clustering algorithm (typically k -means) is used to partition the dataset into k disjointed clusters C = { C , . . . C k } . Each data instance is assigned with an AS s CBLOF based on the size of the cluster it wasassigned to: the method uses two cluster types that are called ’small cluster’ ( SC ) and ’large cluster’ ( LC )based on the cardinality of the cluster. The coeﬃcients for deciding small and large clusters are given bythe numeric parameters α and β . Where b is the boundary of a cluster, the anomaly score for a data point x is deﬁned as: s CBLOF ( x ) = (cid:40) | C i | ∗ min ( d ( x, C j )) , where x ∈ C i , C i ∈ SC and C j ∈ LC for j = 1 to b | C i | ∗ ( d ( x, C i )) , where x ∈ C i and C i ∈ LC (5) One-Class Support Vector Machine [37] is an extension for AD of the popular approach for classiﬁcationknown as Support Vector Machine. The training data is projected to a high-dimensional space and thehyperplane that best separates the points from the origin is determined. When evaluating a new sample, ifit lays within the frontier-delimited subspace, it is considered to come from the same population and thereforeit is considered as an inlier; otherwise, the data point is considered as an anomaly by the approach.As in SVM, kernel functions are used to produce non-linear hyperplanes; diﬀerent kernels can be used:linear, polynomial, sigmoid, gaussian. In this work, the kernel coeﬃcient for gaussian, polynomial andsigmoid will be called gamma and the parameter to deﬁne an upper bound on the fraction of training errorsand a lower bound of the fraction of support vectors, nu . Feature Bagging is the combination of multiple outlier detection algorithms using diﬀerent set of features[38]. Every outlier detection algorithm uses a small subset of features that are randomly selected from theoriginal feature set. Any AD approach can be used as the base estimator. Using a cumulative sum approach,each AS generated by each outlier detector used is combined in order to ﬁnd a ﬁnal AS and described as: s ﬁnal ( x ) = T (cid:88) t =1 s t ( x ) (6)where the ﬁnal anomaly score s ﬁnal ( x )is the sum of all anomaly scores s t from all T iterations on each outlierdetector used. The number of base estimators in the ensemble and the number of features to draw from Xto train each base estimator can be adjusted. Moreover, the ﬁnal combination of the AS can be performedby the averaging all models or taking the maximum scores. In the original paper [36], the authors indicated with ’CBLOF’ the AS computed with the ’FindCBLOF’ algorithm.Nevertheless the community has been referring also to the algorithm by the name ’CBLOF’: in this work we will follow thisnaming convention for CBLOF and for other AD approaches. .1.6. Angle-based Outlier Detector (ABOD) and Fast-ABOD Diﬀerently from the methods for detection outliers based on distances or distributions, Angle-basedOutlier Detector (ABOD) [39] exploits considerations made on the angles obtained by considering the datapoint under exam as vertex and all the possible couples of points considering the other data present in thedataset. The underlying idea is that outliers will form angles with other data points that are typically acute,while inliers will form angles of diﬀerent types: from small angles to straight ones. For this reason, what ismonitored as AS for a generic datapoint x is the variance of the angles formed by x as a vertex.The computation of the all the angles formed by all the possible triples in the dataset is a time consumingoperation: for this reason, several approximated versions of the ABOD algorithm have been proposed overthe years. In this work we will employ the approximation presented in the original paper [39] that is calledFast-ABOD that consider only the angles formed by the data point under exam and its k nearest neighbors;in the Fast-ABOD formulation the anomaly score is computed as: s Fast-ABOD ( (cid:126)A ) = V AR (cid:126)B, (cid:126)C ∈ N k (cid:126)A (cid:18) (cid:10) AB, AC (cid:11) (cid:107) AB (cid:107) (cid:107) AC (cid:107) (cid:19) (7)where (cid:10) ., . (cid:11) indicates the scalar product, (cid:126)A , (cid:126)B and (cid:126)C are the considered data points and VAR is the varianceover the angles between the diﬀerence vector of (cid:126)A to all pairs of points in N k ( (cid:126)A ) weighted by the distance ofthe points, and N k ( (cid:126)A ) is the set of k nearest neighbors of A . It is important to highlight that although theFast-ABOD presents a better computational cost, the quality of the approximation depends on the number k -nearest neighbors. Isolation Forest (iForest or IF), [40, 41], uses the concept of isolation instead of measuring distance ordensity to detect anomalies. The IF exploits a space partitioning procedure: the main idea underlying theapproach is that an outlier will require less iterations than an inlier to be isolated, i.e., to ﬁnd through thepartitioning procedure a region of the space where only such observation lies in.The partitioning procedure used by the IF is achieved through the creation of iTrees, binary trees thatare the result of a random partitioning procedure obtained by splitting the data based on one of theirfeatures at each iteration of the algorithm. Following the above stated fundamental idea of IF, it is expectedthat the path to reach a leaf node from the root of an iTree will be shorter for outliers than for inliers;the anomaly score will be related to this path length: the shorter the more anomalous the data point. Weunderline that this procedure is done randomly: to achieve fast computation the features and the splittingpoints are chosen randomly; the drawback of this approach is that a single tree can give an estimate ofthe path length that has high variance: thus, similar to the popular Random Forest (that we remark is asupervised approach), an ensemble of T trees is constructed in order to provide a low-variance estimation.More in detail, an iTree is built as follows.1. A subsample of data S ∈ X is randomly selected.2. A feature v ∈ { , . . . , p } is randomly selected: a node in the tree is created and at this node the valueof v is used;3. A random threshold ¯ v on v is chosen within the domain of the variable;4. Two children nodes are generated: one associated to the points with values for variable v below ¯ v andone for those with value above;5. The points from 2 to 4 of this procedure are repeated until either a data point is isolated or a thresholdon the maximum tree length is reached.After the iTrees are constructed, the AS score for a data point x is computed as follows: s IF ( x ) = 2 − E ( h ( x )) c (8)where h ( x ) is the length of the path for a data point from its leaf to the root, E ( h ( x )) is the average of h ( x ) in iTrees collection of iTrees and c is an adjustment factor which is set to the average path length of6nsuccessful searches in a binary search tree procedure. Using the AS just deﬁned, if instances return s IF very close to 1, then they are tagged as anomalies; on the other hand, values much smaller than 0.5 arequite safe to classify as normal instances, and values close to 0.5 then the entire sample does not really haveany distinct anomaly [41].iForest works well in high dimensional problems which have a large number of irrelevant attributes,and in situations where training set does not contain any anomalies. Given its high performance and thepossibility to parallelize its computation (thanks to its ensemble structure), IF is probably the most popularAD approach: for this reason, we will consider, as it will be detailed in Section 2.2.2, a dedicated approachfor providing interpretable traits to IF. HBOS is an AD approach based on histograms that was introduced for providing fast computation ofan AS w.r.t. previously proposed AD methods.The HBOS algorithm can be summarized as follows: univariate histograms for each single feature arecomputed (in case of numerical data a set of k bins of equal size are used for each histograms). The numberof bins k is an hyper-parameter that needs to be tuned; histograms are normalized to [0 ,

1] for each singlefeature; frequency (relative amount) of samples in a bin is used as density estimation; AS for each instance x is computed as a product of the inverse of the estimated density: s HBOS ( x ) = p (cid:88) i =0 log (cid:18) hist i ( x ) (cid:19) (9)where p is the number of features and hist i ( x ) is the density estimation. With such deﬁnition of the AS,with HBOS the outliers correspond to high values of s HOBS ( x ), while inliers to low values. In this algorithm,two parameters are still employed and need to be tuned, being α and the tolerance ( tol ). α is a regulationfactor to avoid overﬁtting and tol adjusts the ﬂexibility while dealing the samples falling outside the bins. Lightweight on-line detector of anomalies (LODA) is based on the concept of supervised learning thatshows that a collection of weak classiﬁers can result in a strong classiﬁer. LODA is comprised of a collectionof k one-dimensional histograms with n bins , each approximating the probability density of input dataprojected onto a single projection vector [42]. The average of the logarithm of probabilities estimated onindividual projection vector is LODA output, f ( x ), deﬁned as: f ( x ) = − k k (cid:88) i =1 log ˆ p i ( x T w i ) , (10)where p ( x T w i ) is the joint probability of projections, in other words, ˆ p i is the probability estimated by the ith hisogram, w i the corresponding projection vector and x the sample. LODA sparse random projectionscan also deﬁned by the user, here called n randomcuts . Due to its simplicity LODA is particularly useful indomain where a large number of samples need to be processed in real-time or in domains subject to conceptdrift. It can also be applied where the detector needs to be updated on-line [42]. The ensemble method combines diﬀerent algorithms to obtain a single ﬁnal result. Knowing that MLmodels are sensitive to the types of data, ensemble methods are commonly used to increase the eﬃciencyand robustness of the ﬁnal result. Being H i the result of each i th base model, the sum of the k selectedones is the ﬁnal result (FR) of the ensemble method, and the ﬁnal decision (FD) is obtained by a majorityvoting, both described as: F D = (cid:40) , if F R > k/ , otherwise. , where F R = k (cid:88) i =1 H i (11)Where in this case, 1 indicates that the sample is an anomaly and 0 that the sample is normal.7 .2. Explainable Artiﬁcial Intelligence (XAI) In this subsection the XAI approaches adopted in this work are revised.

Shapley Additive Explanations, [43] is a state-of-art and model-agnostic (it can be applied to any algo-rithm) for interpreting ML predictions, both in unsupervised and supervised tasks.Based on Shapley values from coalitional game theory, SHAP provides a feature importance rankingwhich can be used to explain the ML model to the individual data point level: in the context of anomalydetection, having an ordered list of features can be really helpful for domain expert to enable an eﬀectivetroubleshooting. The feature importance ranking is the result of the contribution of each feature to the ﬁnalprediction of the model.Since the Shapley values are expensive to obtain, SHAP approximates them of a conditional expectationfunction of the original model. The detailed mathematical formulation of SHAP can be retrieved at [43].

Given the increased interest and popularity of IF, we chose to consider in this work also a model-speciﬁcapproach for providing, like in SHAP, a feature importance ranking.Local Depth-based Feature Importance for the Isolation Forest is the ﬁrst model-speciﬁc method forinterpretability in IF [44]. While IF is one of the most commonly adopted AD algorithms, its structure andprediction lack in interpretability. To overcome this problem the Local-DIFFI method proposes an eﬀectiveand computationally inexpensive approach to deﬁne local feature importance (LFI) in IF, computed as:

LF I = I o C o , (12)where C o is the features counter for the single predicted outlier x o and Io is updated by adding the quantity[44] while iterating over all the trees in the forest:∆ = 1 h t ( X o ) − h max (13)The model is a post-hoc method, which, due to its operation, preserves the performance of an establishedand eﬀective AD algorithm (IF). An interesting property of Local-DIFFI is that, while achieving comparableresults w.r.t. SHAP, its computing time is orders of magnitudes smaller than SHAP. The method proposesto provide additional information about a trained instance of the IF model with the main objective ofincreasing the users’ conﬁdence in the result obtained. Besides the local feature importance provided byLocal-DIFFI, the method can also be used to provide global feature importance, namely DIFFI.

3. Proposed Approach

The proposed methodology is depicted in Fig. 1 and it is divided into three parts: 1) Feature extraction;2) Fault detection: Anomaly Detection; 3) Fault diagnosis: Unsupervised classiﬁcation / Root cause analysis.The vibration features are initially extracted based on the type of monitored component. The extractedfeatures are divided into a training and testing group, and the hyperparameters of the anomaly detectionmodels are tuned. The samples are evaluated in the fault detection part: if a fault (anomaly) is not detected,the analysis is completed; on the other hand, if the sample is a fault (anomaly), the most relevant featuresused to generate the result are evaluated through the model’s explainability. In the fault diagnosis part,the features that indicate only the presence of fault, but do not indicate the type / location are disregarded(called general features, e.g., rms and kurtosis). For components that have unique fault speciﬁc features(e.g., bearing, gearbox), it is possible to perform an unsupervised classiﬁcation based on the most relevantfeature for the result. On the other hand, for analysis where the features may be related to more thanone fault (e.g., misalignment and mechanical looseness), the most relevant features (feature ranking) foridentifying the sample as an anomaly are presented, allowing the specialist to analyze the problem in moredetail, namely root cause analysis. 8 ig. 1.

Framework of the proposed methodology.

One of the main reasons for the wide use of DL models in many tasks is that DL approaches implicitlyimplement a feature extraction procedure due to the DL architectures ability to learn discriminating featuresthrough non-linear relations performed within the model: avoiding the time consuming task of featureextraction is a captivating property for ML technologies developers. However, in the domain of rotatingmachinery, the vast majority of faults have already been studied and ad-hoc deﬁned features that areinformative for fault detection (that can be computed directly on the raw signals or after dedicated signalprocessing ﬁlters) have been developed by researchers over the years. For this reason, we have decided tobase our approach on ’classic’ ML techniques exploiting the wide knowlegde of ﬁltering approaches andfeature deﬁnitions provided by the literature.Among the sensors used for monitoring rotating machinery, the vibration-based diagnostic method isthe most popular and researched. The interest is justiﬁed by the fact that the vibration signals directlyrepresent the dynamic behavior of the equipment [45, 46, 47, 48] and are a non-invasive technique. Thefeatures to detect faults in rotating machinery using vibration signals are commonly extracted from thetime, frequency and time-frequency domains [4].(i) Among the most used in the time domain are: mean, standard deviation, rms (root mean square),peak value, peak-to-peak value. According to [49, 50], these features can be aﬀected by the speed and loadof the machines, therefore, other features are also commonly used to ﬁll this gap: shape indicator, skewness,kurtosis, crest factor, clearance indicator, etc., which are robust to the machine’s operating conditions.(ii) The features in the frequency domain are extracted from the frequency spectrum, for example: meanfrequency, central frequency, energy in frequency bands, etc. Diﬀerent information can be obtained that isnot found or is hardly extracted in the time domain [4].(iii) For the time-frequency domain, features such as entropy are usually extracted by Wavelet Transform(WT), Wavelet Packet Transform (WPT) and empirical model decomposition (EMD). These features arecapable of reﬂecting the machine’s health states in non-stationary operating conditions [4].In this study, two approaches were combined in relation to the types of features. Firstly, general featureswere selected to indicate the presence of system fault and degradation. This approach does not allow theidentiﬁcation / location of the fault, but it allows to detect variations in the system in a global way, avoidingthat a fault is not identiﬁed. Secondly, speciﬁc features commonly associated with the type of defect in therespective components were used to enable the identiﬁcation / location of the fault. During the extractionof speciﬁc features, it must be deﬁned whether the features are related to diﬀerent faults or are unique,enabling the fault diagnosis through unsupervised classiﬁcation or root cause analysis.9 .2. Fault detection: Anomaly detection

Identifying the fault is extremely important for production processes, and even more important whenperformed in an unsupervised manner, that is, without the presence of labeled data related to the faultmodes in training set. In this part, the extracted features are divided into a training and testing group, andthe hyperparameters of each AD model are adjusted. The samples are evaluated in unsupervised mannerand identiﬁed as normal or fault (anomaly). If an anomaly is not considered, the analysis is completed.On the other hand, if the sample is an anomaly, the root cause analysis or fault classiﬁcation based on themodel’s explainability is performed.Diﬀerent models used in the ﬁeld of AD were studied. As is common knowledge, the performance ofAD models are strongly related to the type of data available. While we report in the following the bestapproaches in our studies, the approach is generic and the user can modify the AD model in use if theexpected performance is not achieved, without aﬀecting its structure.The diﬀerent AD algorithms evaluated were the ones reported in Section 2.1: Clustering Based LocalOutlier Factor (CBLOF), Local Outlier Factor (LOF), Isolation Forest (IF), Lightweight on-line detector ofanomalies (LODA), Histogram-based Outlier Detection (HBOS), k-Nearest Neighbors (kNN), Fast - Angle-based Outlier Detector (FastABOD), Outlier Detection with Minimum Covariance Determinant (MCD),One-Class Support Vector Machine (OCSVM), Feature Bagging (FB) and Ensemble (combination of allmodels) available in [26].

Despite the advances in ML applications for fault diagnosis in rotating machines, the vast majorityof methods are performed in a supervised manner. In other words, the methods use labeled data in thetraining to ensure that the model is able to distinguish between diﬀerent classes of faults. The proposedmethodology presents an approach where no training labels are necessary. The fault diagnosis is performedin an unsupervised manner based on the importance ranking obtained by the model explainability. Twodiﬀerent analysis are possible depending on the type of component being monitored, namely: unsupervisedclassiﬁcation and root cause analysis. For faults that have unique characteristic features (e.g., bearings,gearbox) unsupervised classiﬁcation can be performed directly. On the other hand, for analysis where thefeatures may be related to more than one fault (e.g., misalignment and mechanical looseness), the mostrelevant features (feature ranking) to identifying the sample as an anomaly are presented, allowing thespecialist to analyze the problem in more detail, called root cause analysis.The methodology is based on the feature importance ranking for each new sample identiﬁed as ananomaly, as presented in Algorithm 1. After identifying the anomaly in the previous part, the most relevantfeatures are analyzed through the model’s explanability. SHAP is used to obtain the feature importanceranking. The general features that only indicate the presence of a fault, but do not indicate the type /location are disregarded (e.g. rms and kurtosis). A new ranking of importance is obtained using only thespeciﬁc features. For example, assuming that based on the importance score calculated by SHAP, the mostrelevant features in order are: rms, Ball Pass Frequency Outer (BPFO), Ball Pass Frequency Inner (BPFI),kurtosis, Ball Spin Frequency (BSF). Applying the methodology, the new ranking of importance would be:BPFO, BPFI and BSF. After that, according to the type of procedure applied, the result is obtained. Forunsupervised classiﬁcation, the speciﬁc features are analyzed, and the fault is classiﬁed based on the featuremost relevant to the result. As each speciﬁc feature is related to a potential/unique type of component fault,the most relevant feature is considered as the fault present in the system. For root cause analysis, since thefeatures may be related to more than one fault, the feature importance ranking is presented, assisting thespecialist in identifying the type of fault.Understanding which features the model uses to identify the anomaly is essential to perform root causeanalysis/classiﬁcation. In other words, through explainability it is possible to mimic human knowledge.Without the use of an explainability algorithm such as SHAP and Local-DIFFI, it is not possible to carryout the analysis, since the models used do not present explanations of how the ﬁnal results were obtained.Thus, the association of state-of-the-art models to identify anomalies in the signals with algorithms thatperform the explanability, allows the proposition of the new methodology.10ven though it is the state-of-the-art in explainability and model-agnostic, SHAP presents a high com-putational cost in relation to model-speciﬁc solutions. Therefore, a comparison was made using the recentproposed explainability algorithmic, Local-DIFFI for the Isolation Forest model. As stated above, the choiceof model-speciﬁc Local-DIFFI is due to the fact that Isolation Forest presents excellent results in the litera-ture and good robustness in relation to the variation of hyperparameters. Moreover, in general, Local-DIFFIpresents very similar results to SHAP, as shown in [44]. The similarity of the models was veriﬁed throughKendall-Tau rank distance, a metric commonly used for evaluation between two ranking lists.

Algorithm 1

Pseudo-Code procedure Unsupervised Classification Type: speciﬁc analysis / speciﬁc feature related to a single fault Input: new sample Output: fault classiﬁcation (most important speciﬁc feature) if new sample = anomaly then feature importance ranking ← shap or local-diﬃ(new sample) feature importance ranking. drop (general features) feature importance ranking ← sort (feature importance ranking) most important feature ← feature importance ranking[0] print(’The fault is located in: ’, most important feature) procedure Root Cause Analysis

Type: general analysis / speciﬁc feature related to diﬀerent faults

Input: new sample

Output: root causes (most important speciﬁc features) if new sample = anomaly then feature importance ranking ← shap or local-diﬃ(new sample) feature importance ranking. drop (general features) feature importance ranking. ← sort (feature importance ranking) print(’The root causes are related to: ’, feature importance ranking)

4. Experimental procedure

Three datasets were used to address diﬀerent faults found in rotating machinery. The faults analyzedwere: defects in bearing and gearbox, misalignment, unbalance, mechanical looseness and combined faults.The use of diﬀerent datasets, with diﬀerent monitoring approaches, aims to validate the proposed method-ology in diﬀerent scenarios.

The ﬁrst dataset considered (publicly available [51]), namely

Bearing Dataset , is composed by three run-to-failure tests with four bearings in each test. The rotation speed was kept constant at 2,000 rpm by anAC motor coupled to the shaft via rub belts. A radial load of 6,000 lb was applied to the shaft and bearingby a spring mechanism. Rexnord ZA-2115 double row bearings were installed on the shaft. PCB 353B33accelerometers were installed on the bearings housing. All failures occurred after exceeding the projectedbearing life, which is more than 100 million revolutions [51]. For the study, bearing 01 of test 02 was used.Each test consists of individual ﬁles of vibration signals recorded at speciﬁc intervals. Each ﬁle consists of20,480 points with the sampling rate set at 20 kHz. NI DAQ Card 6062E was used for collection.The dataset consists of run-to-failure tests, therefore no labels are available indicating the fault start:the only information provided is the type of fault present at the end of each test. To assess the eﬃciencyof the AD model, the data was manually labeled. In the analysis, it was considered that after starting the11efect, all subsequent observations correspond to a faulty bearing. It is worth mentioning that the labelswere used only to evaluate the eﬃciency of the methodology and they were not used by the AD model.The test has 984 observations, with the ﬁrst 531 observations labeled as normal and the last 453 asanomalies (fault). The fault was identiﬁed in the outer race. The features used were: kurtosis, rms,BPFI, BPFO and BSF, which are widely used in bearing fault detection [52, 53, 54, 55, 56, 57]. Speciﬁcfeatures are those that indicate the type of fault (BPFI, BPFO and BSF) and general features are thosethat indicate the presence of a defect (kurtosis and rms). The bearing fault frequencies are important toassess the type of defect and conﬁrm its existence, which is not always noticed by other features. It isalso important mentioning that there are cases where the fault does not present the classic defect behaviorwith the deterministic bearing frequencies in evidence [58], which makes it important to use other features.Knowing that bearing faults are generally associated with impacts, kurtosis is a relevant feature for thestudy. Finally, the rms value represents the global behavior of the system, indicating a general degradationand accentuation of the defect. The purpose of using this dataset, in addition to identifying the presence ofthe fault in a real monitoring situation, is to classify the type of fault using the proposed methodology.

The second dataset considered, the

Gearbox Dataset , was presented in [59] and it is used to evaluatefaults in gearbox. A 32-tooth pinion and an 80-tooth gear were installed on the ﬁrst stage input shaft. Thesecond stage consists of a 48-tooth pinion and 64-tooth gear. The data were recorded using an accelerometerthrough a dSPACE DS1006 system, with sampling frequency of 20 KHz. Nine diﬀerent gear conditions wereintroduced to the pinion on the input shaft, including healthy condition, missing tooth, root crack, spalling,and chipping tip with ﬁve diﬀerent levels of severity. For each gear condition, 104 observations were collectedresulting in a total of 936 observations.It is common knowledge that general gear problems tend to increase the energy of the sidebands spacedfrom the rotation frequency around the Gear Mesh Frequency (GMF) and their respective harmonics.Thus, simulating a real condition, the features used were: kurtosis, rms, 1xGMF, 2xGMF, 3xGMF, 4xGMF(1 st Stage), 1xGMF, 2xGMF (2 nd Stage). Due to non-stationary issues and the uncertainty caused by speedvarying, instead of using the energy value in each GMF and respective side bands, the energy in the GMFband +/- 4*(nominal rotation frequency) was calculated. In addition to being able to detect the fault (AD),the use of this fault dataset aims to identify the location of the fault in the gearbox (ﬁrst or second stage)and not to classify the type of fault (missing tooth, root crack, spalling and chipping tip).

The last dataset,

Mechanical Fault Dataset , was developed by one of the authors [60, 61]: the datasetcontains diﬀerent electrical and mechanical faults which were inserted in a experimental test rig; in thiswork we will consider the following faults: unbalance, misalignment, looseness and combined faults (beingthe combination of the previous ones). Six accelerometers were used to acquire the vibration signals, in thehorizontal, vertical and axial positions, three in the fan-end side and three in the drive-end side.The rotation speed was kept constant at 1717.5 rpm. The observations were labeled according to thefault introduced in the test rig, and later analysis of the vibration spectrum. Each ﬁle consists of 3,200points with df = 0.125 Hz. The dataset contains 5 conditions with a total of 1418 observations (532 normal,557 unbalance, 283 misalignment, 28 mechanical looseness and 18 combined fault).In general, the unbalance is commonly identiﬁed in the vibration signal by increasing the energy in 1x fr (speed rotation). It is noteworthy that other faults can also appear in 1 x fr as structural problemsand even mechanical looseness. The most common types of misalignment and mechanical looseness showan increase in energy level in 2 x and 3 x fr, and therefore may have similar characteristics. The mechanicallooseness can still have multiple and sub-harmonics of fr. Considering the types of faults and the respectivebehaviors, the following features were used: rms, energy level in 1 x fr, 2 x fr, 3 x fr and 4 x fr.In addition to the basic objective of identifying the fault, the use of this dataset aims to evaluate theclassiﬁcation methodology with a focus on root cause analysis, when the features are correlated with morethan one type of fault. The dataset also provides the possibility to study isolated and combined faults that,12lthough known, have been little used in studies involving fault detection and new techniques of artiﬁcialintelligence compared to bearings and gearboxes. Two approaches were used to deﬁne 3 diﬀerent scenarios [Case 1, 2, 3] that can be found in real-worldmonitoring applications.In the ﬁrst approach, a dynamic condition was considered with the data collected in sequence, where atemporal relationship and fault evolution is presented [Case 1]. For the study, a sliding window was used,where the training group was updated with each new sample, in case it was considered normal. 100 sampleswere initially used for the training group in order to ensure stability in the models. For this situation, asthe model was started together with the machine under normal conditions (e.g.: after maintenance or a newmachine), there are no anomalies in the training group. It is worth noting that this approach can also beused if there are anomalies in the training group (e.g.: cases of continuous monitoring where the machinewas repaired after a fault, and it is desired to use all the signals to increase the amount of data in the model).In the second approach, a static condition was considered, where the signals do not have a temporalcorrelation with each other [Case 2 and 3]. This approach simulates when historical data are available forthe machine without labels. They also refer to diﬀerent types of faults and normal conditions, however notnecessarily collected in sequence. It is important to highlight that although Cases 2 and 3 represent the samecondition called static condition, the types of faults studied are diﬀerent in each case. The data were dividedinto training and test groups. Due to the number of observations available, the size of the training groupis limited by the number of normal samples and the rest designated as a test. The training group consistedof 80% of samples of normal condition and 20% of anomalies selected at random. The proportion has beendeﬁned as a machine in operation is mostly in normal condition, and few situations with faults. Such anapproach also shows that it is possible to implement the proposed methodology even with the presence ofanomalies in training set.

The hyperparameters for each model were adjusted based on the training group to obtain the bestperformance and are shown in Table 1. The hyperparameters are presented in relation to the library used[26]. As the models did not show signiﬁcant diﬀerences in the ﬁnal result in relation to the hyperparametersfor each case, the hyperparameters were kept the same for all analysis.

Table 1

Hyperparameter for each model

Model and HyperparameterkNN n neighbors=5, method=largest, metric=’minkowski’MCD assume centered=FalseLOF n neighbors=16CBLOF n clusters=6, alpha=0.8, beta=4OCSVM kernel=’rbf’, gamma=0.2, nu=0.7FB base estimator=LOF, n estimators=10, max features=1.0, combination=’average’FastABOD n neighbors=5IF n estimators=100, max samples=128HBOS n bins=5, alpha=0.1, tol=0.5LODA n bins=5, n random cuts=50 .4. Evaluation metrics For the fault detection part, as an unsupervised methodology, at the end of the test the anomaly score iscalculated, where samples with high anomaly score values are usually anomalies. To verify the performanceof the proposed methodology, threshold values were deﬁned based on the training group. For the bearingdataset (Case 1) the threshold was deﬁned based on the assumption that the training group is composedof only signals in normal condition (considering that the initial signals correspond to the start of operationof the bearing). As the gearbox dataset (Case 2) and mechanical fault (Case 3), the contamination ratio isknown, its value was used to deﬁne the threshold. It is also worth mentioning that, due to the knowledgeabout the fault characteristics and respective behavior, the user can adjust the contamination rate of themethodology during the application, based on a preliminary analysis of the training data.For the static condition, each test was performed 100 times to show the stability of the model. Thesignals were randomly chosen for the test and training group at each iteration of the model. For thedynamic condition, in each new update of the training group, 5% of the samples were randomly excludedto also assess the stability of the model. As the update occurred more than 400 times in the tested dataset,the complete test was performed 10 times. In addition to the variation of the dataset, each iteration of themodel was performed with diﬀerent random seeds.The results are presented using the F1-Score, PR-AUC (Precision-Recall Area Under the Curve) andaverage confusion matrix of the iterations with respective standard deviations. The metrics were chosendue to the greater interest in correctly identifying samples referring to faults (anomalies). Although it isa problem to have false positives in the ﬁnal result, failing to acknowledge a fault is even worse as it canresult in the machine breakdown. Moreover, these metrics are also important when dealing with unbalanceddataset (common situation in the real scenario).For the fault diagnosis using the unsupervised classiﬁcation approach (Case 1 and 2), each sampleidentiﬁed as anomaly is classiﬁed in relation to the type / location of the fault. As the classiﬁcation wasperformed only for the anomalies identiﬁed, accuracy was used as an evaluation metric. For the root causeanalysis (Case 3), the feature importance ranking is presented. Kendall Tau distance was used to compareSHAP and Local-DIFFI. The tests were performed using 2.2 GHz Intel Core i7 Dual-Core, 8 GB 1600 MHzDDR3, Intel HD Graphics 6000 1536 MB.

5. Results and discussion

In this subsection the data used in this work for Case 1-3 are analyzed and discussed.

Fig. 2a shows the complete signal for the test in time domain. As the signal was not collected continuously(24/7), it was decided to present it according to the sample (x-axis). The point at the incipient fault starts,as well as the fault are identiﬁed. In Fig. 2a it can be seen that although the fault is easily identiﬁed bythe signal trend in the time domain, the incipient fault is not easily identiﬁed by visual analysis. Making itimportant to use the ML model with the appropriate features to provide the maintenance team adequatetime to schedule an intervention. Fig. 2b shows the moment of beginning of the incipient fault presentedin the envelope spectrum and used to deﬁne the labels of the signals. It is noted that from the sample 531there is evidence of BPFO, being deﬁned as indicative of incipient fault and, therefore, anomaly. Based onthe adopted methodology, all samples after this signal are considered faults (anomalies).The signals for the diﬀerent types of faults present in the dataset are shown in Fig. 3. Due to thepossibility of non-stationarity caused by the variation of the load, it was decided to show the signals in thetime domain. In addition, some defects, such as broken / cracked tooth, can also be better viewed.It is possible to notice an increase in the energy level in the signal for defects such as root crack, spallingand chipping tip (most severe). On the other hand, diﬀerentiating a normal signal from one with a missing14 a) Complete signal for the test in time domain (b)

Waterfall envelope spectrum

Fig. 2.

Bearing dataset signal. tooth or chipped tip in the initial stage is not so simple. Therefore, the feature extraction and the use ofartiﬁcial intelligence techniques become essential for more assertive monitoring. (a)

Normal (b)

Root crack (c)

Missing Tooth (d)

Spalling (e)

Chipping tip (least severe) (f)

Chipping tip (most severe)

Fig. 3.

Vibration signal examples under diﬀerent gear health conditions.

In Fig. 4 some examples of faults are shown. The frequency domain was used to better characterize thefaults, knowing that the speed rotation was kept constant.For the normal situation, there is no predominance of any characteristic frequency, in addition to pre-senting a low level of vibration in relation to other situations. In the unbalance case, it is evident the increasein energy in 1 x fr, characteristic of the fault. Misalignment and mechanical looseness exhibit very similarbehavior in the signal with 2 x fr greater than the other harmonics. The diﬀerentiation was performed basedon the type of fault inserted in the test rig. For the situation of combined failures (unbalance, misalignmentand mechanical looseness) the characteristics of all faults are noted.For the reasons mentioned above, the classiﬁcation of such faults includes the analysis of signals in otherpositions and complementary techniques. Therefore, for this case, the proposed classiﬁcation methodologywill provide only the most relevant features for the identiﬁcation of the fault, assisting the specialists in thesearch for the root cause of the problem. 15 a) Normal (b)

Unbalance (c)

Misalignment (d)

Looseness (e)

Combined Faults

Fig. 4.

Examples of vibration signals for diﬀerent faults present in the dataset.

Using the proposed methodology, the results obtained for the fault detection are presented in Table 2.Table 2 also shows the average time spent for training and testing (a new sample). The top three results foreach metric are shown in bold. It can be seen in Table 2 for Case 1 and 3, that the models that had betteridentiﬁcation of the faults through the F1-Score were: MCD, HBOS and IF. For Case 2, although HBOShad a very close performance, the models that showed the best results were: MCD, kNN and IF.In order to evaluate the general eﬃciency of the model, regardless of the deﬁned threshold, the PR-AUCvalue was calculated. The results show that by modifying the threshold, the models can present even betterresults. It is noteworthy that the threshold used for comparison and calculation of the F1-Score was deﬁnedbased on the previous analysis of the training group, simulating a real condition where the data for testingare not yet available. For this reason, it was decided to present the F1-Score value based on the deﬁnedthreshold instead of the optimum value that could be obtained by adjusting the threshold in the completedataset. Nevertheless, it can also be analyzed that, despite the improvement in the results, in general, themodels that showed better performance in relation to PR-AUC were the same ones with highest F1-Scorevalue: IF, HBOS and MCD.Although in general HBOS, MCD and IF presented good results for the three cases, it can be seen thatdepending on the dataset, other models can obtain better performance, such as kNN in Case 1 and 2. Thegood results obtained in Table 2 for the three cases show that it is possible to detect faults in rotatingmachinery through the models studied in an unsupervised way.Among the models with the best results, HBOS presented the lowest computational time. In general,LOF and OCSVM also presented low values. On the other hand, FastABOD, MCD and IF demandedmore computational time in relation to the other models (in a general analysis, excluding Ensemble). Thelow average time for most models in training and testing a sample allow implementation in an industrialenvironment focused on predictive maintenance.For the proposed comparison between SHAP and Local-DIFFI, and due to the good overall performanceof Isolation Forest, the details of the methodology results are presented for the model. The average valuesfor the confusion matrix are presented in Table 3 (the sample quantities were rounded up because theyare integer values). The confusion matrix allows a better visualization of the results in relation to thedistribution of the signals in the respective classes. The results are presented both in percentage and in16 able 2

Fault detection results

Metric kNN MCD LOF CBLOF OCSVM FB FastABOD IF HBOS LODA Ensemble

Case 1

F1-Score 63.21

Case 2

F1-Score 99.82 (0.02) (0.00) (0.01) (0.04) (0.71) (0.01) (0.12) (0.01) (0.04) (0.11) 0.01Time [s] 0.1706 0.0807

Case 3

F1-Score 96.27

Table 3

Confusion Matrix

Case 1

Normal Fault Case 2

Normal Fault Case 3

Normal Fault Normal True Label, Predicted Label quantity of signals. The results present in Table 3 for Case 1, show that the samples of the normal groupwere all correctly classiﬁed. The anomalies had an average classiﬁcation error of 25 samples in a total of453 anomalies, conﬁrming the good performance of the model. For Case 2, on average, 2 anomalies of 812were classiﬁed incorrectly, and 2 normal samples of 24 were classiﬁed as anomalies. For Case 3, 694 of 709anomalies were classiﬁed correctly and 82 of 107 normal samples were also classiﬁed correctly. As in Case 1and 2, the results show the good performance of the model.Such performances in a real application, will allow not to intervene in the machine unnecessarily (which isalso a big problem, considering the need to stop the production and high cost of some components that couldbe replaced without need). Moreover, the model was able to correctly identify most anomalies, includingthose at an early stage of fault, allowing the maintenance team to schedule the machine shutdown withoutdirectly interfering in the production process.Keeping in mind that the essence of anomaly detection methods is unsupervised, that is, without deﬁningeven the threshold value (in addition to not having labelled data in training), the normalized anomaly scoresare presented for the entire test, Fig. 5. The anomalies identiﬁed in the Fig. 5 are presented based on thedeﬁned threshold. The x-axis values refer to the test samples only. Fig. 5a shows the evolution of theanomaly score with the development of the fault. It is possible to notice the gradual increase near the regionidentiﬁed as the beginning of an incipient defect, sample 531, as shown in Fig. 2b. Subsequently, there is17 a) Case 1 - Bearing Dataset. (b)

Case 2 - Gearbox Dataset. (c)

Case 3 - Mechanical Fault Dataset.

Fig. 5.

Anomaly scores for Isolation Forest. an increase in the anomaly score in relation to the normal condition, indicating a permanent change in thebehavior of the equipment, and consequently, a fault.For Case 2 and 3, the anomaly score value does not show the evolution of the fault, since the analysis wasperformed in a static way. Analyzing Case 2, it is noted that the scores for samples considered anomaliesare higher than the normal group (ﬁrst 24 samples), enabling identiﬁcation. The samples were grouped insequence with respect to the equipment condition to allow comparison between the faults (healthy condition,missing tooth, root crack, spalling, chipping tip 1, chipping tip 2, chipping tip 3, chipping tip 4, chippingtip 5). It can be seen that similar faults with similar condition have similar anomaly scores. Thus, in a realmonitoring situation, if the anomaly score changes suddenly, it can be concluded that a possible new faultis occurring or that the severity of the current fault has been accentuated.For Case 3, due to the presence of diﬀerent fault conditions in each group, it is not possible to distinguishthe type of fault through visual analysis of the anomaly score. However, the variation between normalsamples (ﬁrst 107 samples) and those considered to be fault still diﬀer, even if less than the other cases.Despite the deﬁned threshold value for comparison of the models, it is possible to notice the diﬀerencebetween normal and faulty samples, even for where the fault is incipient. This diﬀerence allows the user tocorrectly identify anomalies present in the monitoring. The importance of anomalies detected in the incipientperiod is emphasized, as it is a stage where it is not easily identiﬁed by visual analysis or variation of the18ime signal energy trend (often used as a metric to deﬁne maintenance alarm levels based on standards).The diﬀerence in the anomaly score values also prove that the faults studied for rotating machinerybehave as anomalies / outliers. In other words, they are samples that have values so diﬀerent from otherobservations that they are capable of raising suspicions about the mechanism from which they were generated[62]. As the samples share this basic principle, it is concluded that the use of AI models based on anomalydetection allows to identify faults in rotating machinery in a satisfactory and unsupervised way.

The results obtained using the proposed methodology for the unsupervised classiﬁcation are presentedin the Table 4. The deﬁnition of the type of fault diagnosis to be used is performed during the extractionof features based on the premise of the features being related to diﬀerent types of fault or not.Case 1 and 2 present speciﬁc features related to a single type of fault / location, which allow theidentiﬁcation of the type of fault and the location, respectively. Therefore, the proposed UnsupervisedClassiﬁcation can be performed. For Case 1, it can be noted that IF, HBOS and FB models had

Table 4

Fault Diagnosis: Unsupervised Classiﬁcation

Case kNN MCD LOF CBLOF OCSVM FB FastABOD IF HBOS LODA Ensemble

Case 1

BSF BPFO BPFO BSF BSF

BPFO

BPFO BPFO

BPFO BPFOAccuracy 39.46 82.23 85.44 4.95 7.74

Case 2 1 st Stage st Stage 1 st Stage st Stage st Stage 1 st Stage 1 st Stage 1 st Stage 1 st Stage 1 st Stage st Stage

Accuracy

Std (0.77) (12.64) (2.88) (2.19) (3.61) (3.27) (3.83) (3.06) (3.88) (8.31) (1.11)Time [s] 2.3111 better results. Using IF as an example, in 99.57% of the samples analyzed, the speciﬁc feature BPFOwas considered the most relevant, and consequently, correctly classifying the type of fault. Analyzing theresults obtained in the previous stage of fault detection, IF and HBOS are good models for the methodology,since they showed good ability to detect and diagnose the fault. FB on the other hand, notwithstanding agood result in diagnosis, presented a low fault detection rate, which in this case would fail to identify someanomalies in the equipment. Some models such as CBLOF, kNN and OCSVM classiﬁed the fault as BSFinstead of BPFO, being considered an error. The other models, despite having correctly classiﬁed the typeof bearing fault, had a lower hit rate than those mentioned above, both in the fault detection stage and inthe unsupervised classiﬁcation.For Case 2, the fault was classiﬁed in relation to the location in the gearbox. The most relevant featureswere associated according to their stage. In other words, using the Ensemble model as an example, in96.72% of the samples analyzed, the most relevant feature was related to fault in the ﬁrst stage. The modelswith the highest hit rate were CBLOF, kNN and Ensemble. The models showed good results for both faultdetection and diagnosis. It is worth mentioning that for the dataset under analysis, most models showedgood results in detecting faults, possibly because they have well-characterized behaviors. IF and HBOSwhich presented good results for the fault detection in all cases, showed inferior performance, erroneouslyclassifying approximately 15% of the fault as present in the second stage. In general, the MCD that showedgood results for fault detection, was not as eﬀective in the fault diagnosis part.For Case 3, the features may be related to more than one fault, therefore, it is not possible to performthe unsupervised classiﬁcation directly. In this case, the general analysis, using the Root Cause Analysisprocedure is applied, Table 5.For better visualization, the results are presented based on the most relevant feature obtained by themethodology. A sub-division for each type of fault was carried out in order to provide more details on themethod. An example of the complete results is presented for the IF and Case 3.1, Table 6.19 able 5

Fault Diagnosis: Root Cause Analysis results

Case kNN MCD LOF CBLOF OCSVM FB FastABOD IF HBOS LODA Ensemble

Case 3.1 Case 3.2 Case 3.3 Case 3.4 Unbalance, Misalignment, Mechanical Looseness, Combined Faults

The unbalance fault is presented in Case 3.1 and the results are shown in Table 5 and Table 6. Due tothe unbalance behavior predominantly manifesting in 1xfr, it is expected that this features will show greaterrelevance for the analysis, as presented in the IF and OCSVM models. On the other hand, as the features aredirectly or indirectly related to more than one fault, the model can use the relationship with another feature,instead of what is expected. For example: it is known that unbalance manifests itself in 1xfr, however, ifthe energy in 2xfr is greater than 1xfr, possibly the sample presents a misalignment (excluding other faultpossibilities just for example). Thus, assuming an unbalanced sample, the model can use 2xfr, as a basis toknow if it is less or greater than 1xfr and thus 2xfr becomes the most relevant feature, even if the fault is anunbalance. In addition to the aforementioned justiﬁcation, the type of fault introduced was considered tolabel the samples. Thus, in some cases the fault behavior was not evident in the signal, which justiﬁes themodel to identify other features as more relevant. For example: for a small unbalance, the acquired signalis considered to be unbalanced, even if it does not signiﬁcantly increase the amplitude in 1xfr.Table 6 shows that in 55.83 % of the samples, 1xfr was classiﬁed as the most relevant feature. Subse-quently, the features 2x and 3xfr are the most important. Such features are related to the way of identifyingan unbalance in a vibration signal, and therefore they can be used by the specialist to analyze the rootcause of the fault. It is also noted that the 4xfr feature in most cases was classiﬁed as less relevant, sincethe feature (for the case under study) is not so important for identifying or distinguishing this fault. In

Table 6

Fault Diagnosis: Root Cause Analysis full ranking

Feature/Position 1 st nd rd th As presented in the methodology, the unsupervised classiﬁcation/root cause analysis is performed throughthe ranking of importance of the speciﬁc features obtained by the model’s explainability. To study the possi-bility of the methodology in working with diﬀerent explainable models and the feasibility of implementing acomputationally faster model, in Table 7 is shown a comparison for the complete relevance rankings obtainedby SHAP and Local-DIFFI. As the main goal is to compare the two methods, the values for Case 3 werecalculated for all faults. From Table 7, the time taken to perform the explainability was higher using SHAPthan Local-DIFFI. As a model-speciﬁc, Local-DIFFI presents a superior performance of approximately 6.5-8.0x in relation to SHAP, being extremely relevant in applications where the execution time is essential.

Table 7

XAI: SHAP vs. Local-DIFFI

Metric/Case Case 1 Case 2 Case 3Kendall-Tau Distance 0.348 0.127 0.455SHAP: Time [s] 0.2890 0.2538 0.3012Local-DIFFI: Time [s] 0.0361 0.0365 0.0453The comparison made through Kendall-Tau distance shows that the models have similarities in therankings of relevance, visually presented in Fig. 6. Since the main objective is to compare the two models,all the features used by the models for fault detection are considered, without excluding the general featuresproposed in the application of the fault diagnosis part. It can be seen in Fig. 6, for Case 1, that the mostrelevant feature for both models is precisely BPFO, allowing the unsupervised classiﬁcation to achieve goodresults. For Local-DIFFI, some samples presented BPFI and BSF as the most important speciﬁc feature,leading the methodology to misclassify the type of fault. Case 2 presented the lowest relationship betweenthe rankings. Among the most relevant features, SHAP showed less occurrence of the features related tothe second stage than Local-DIFFI, leading the model to make fewer errors during the application of theproposed methodology. Despite the minor similarity, the main feature (1xGMF 1st) was also the same in21 a) Case 1 - SHAP (b)

Case 2 - SHAP (c)

Case 3 - SHAP (d)

Case 1 - Local-DIFFI (e)

Case 2 - Local-DIFFI (f)

Case 3 - Local-DIFFI

Fig. 6.

SHAP and Local-DIFFI feature importance ranking. both models. For Case 3, the most relevant feature for both models is 1xfr with a good similarity forpositions 3, 4 and 5. Thus, through the analysis of the Kendall-Tau distance, it is possible to verify that therankings show similar behaviors. Because it is a model-speciﬁc method, Local-DIFFI is subject to noise dueto the stochasticity of the IF, which can reduce its result. Finally, the choice of the explainability model tobe used is based on a trade-oﬀ between response time and precision.

6. Conclusions

This paper proposes a new approach for fault detection and diagnosis in rotating machinery. A three-stage scheme is adopted 1) Feature extraction; 2) Fault detection: Anomaly Detection; 3) Fault diagnosis:Unsupervised classiﬁcation / Root cause analysis. The vibration features in the time and frequency domainswere extracted based on human knowledge already available. In the fault detection, the presence of faultwas veriﬁed in an unsupervised manner based on anomaly detection algorithms. Finally, in fault diagnosis,through the feature importance ranking obtained by the model’s explainability, the fault diagnosis wasperformed, being: unsupervised classiﬁcation or root cause analysis.The results show that the proposed methodology allows the unsupervised fault detection in rotatingmachinery. And, in addition to providing explainability about the models used, the methodology providesrelevant information for root cause analysis, or even unsupervised fault classiﬁcation.Diﬀerent state-of-the-art ML algorithms in anomaly detection were studied showing the possibility tochange models according to the dataset. The new approach can be applied to diﬀerent types of faultsjust by modifying the extracted features associated with a potential fault as shown for the 3 datasetsstudied. Since the approach does not require previously labeled data, and only knowledge currently availableon fault detection through vibration analysis, the methodology has many possible industrial applications.22uture work will focus on domain adaptation and transfer learning associated with methods for modelinterpretability to improve the applicability of the proposed approach in diﬀerent industrial scenarios.

Acknowledgement

The authors gratefully acknowledge the Brazilian research funding agencies CNPq (National Council forScientiﬁc and Technological Development) and CAPES (Federal Agency for the Support and Improvementof Higher Education) for their ﬁnancial support of this work.

References [1] R. Liu, B. Yang, E. Zio, X. Chen, Artiﬁcial intelligence for fault diagnosis of rotating machinery: A review, MechanicalSystems and Signal Processing 108 (2018) 33 – 47.[2] M. Carletti, C. Masiero, A. Beghi, G. A. Susto, Explainable machine learning in industry 4.0: evaluating feature impor-tance in anomaly detection to enable root cause analysis, in: 2019 IEEE International Conference on Systems, Man andCybernetics (SMC), IEEE, 2019, pp. 21–26.[3] P. Kumar, A. S. Hati, Review on machine learning algorithm based fault detection in induction motors, Arch ComputatMethods Eng (2020) 1–12.[4] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A. K. Nandi, Applications of machine learning to machine fault diagnosis: Areview and roadmap, Mechanical Systems and Signal Processing 138 (2020) 106587.[5] A. Stetco, F. Dinmohammadi, X. Zhao, V. Robu, D. Flynn, M. Barnes, J. Keane, G. Nenadic, Machine learning methodsfor wind turbine condition monitoring: A review, Renewable Energy 133 (2019) 620 – 635.[6] J. Ogata, M. Murakawa, Vibration-based anomaly detection using ﬂac features for wind turbine condition monitoring, 8thEuropean Workshop on Structural Health Monitoring (EWSHM 2016) July 5-8,2016.[7] A. von Birgelen, D. Buratti, J. Mager, O. Niggemann, Self-organizing maps for anomaly localization and predictivemaintenance in cyber-physical production systems, Procedia CIRP 72 (2018) 480–485.[8] N. Amruthnath, T. Gupta, A research study on unsupervised machine learning algorithms for early fault detection inpredictive maintenance, 5th ICIEA, Singapore (2018) 355– 361.[9] Y. Zhang, P. Hutchinson, N. Lieven, J. Nunez-Yanez, Adaptive event-triggered anomaly detection in compressed vibrationdata, Mechanical Systems and Signal Processing 122 (2019) 480–501.[10] T. Hasegawa, J. Ogata, M. Murakawa, T. Ogawa, Tandem connectionist anomaly detection: Use of faulty vibration signalsin feature representation learning, IEEE International Conference on Prognostics and Health Management, Seattle (2018)1–7.[11] T. Hasegawa, J. Ogata, M. Murakawa, T. Ogawa, Adaptive training of vibration-based anomaly detector for wind turbinecondition monitoring, Annual Conference on PHM Society (2017) 1–8.[12] C. Molnar, Interpretable Machine Learning, Lulu.com, (2020).[13] M. Du, N. Liu, X. Hu, Techniques for interpretable machine learning, Communications of the ACM 63 (1) (2019) 68–77.[14] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, arXiv preprint arXiv:1702.08608.[15] Y. Lei, F. Jia, J. Lin, S. Xing, S. Ding, An intelligent fault diagnosis method using unsupervised feature learning towardsmechanical big data, IEEE Transactions on Industrial Electronics 63 (2016) 3137–3147.[16] W. Zhang, G. Peng, C. Li, Y. Chen, Z. Zhang, A new deep learning model for fault diagnosis with good anti-noise anddomain adaptation ability on raw vibration signals, Sensors 17 (2017) 425.[17] F. Jia, Y. Lei, N. Lu, S. Xing, Deep normalized convolutional neural network for imbalanced fault classiﬁcation of machineryand its understanding via visualization, Mechanical Systems and Signal Processing 110 (2018) 349–367.

18] T. Li, Z. Zhao, C. Sun, L. Cheng, X. Chen, R. Yan, R. X. Gao, Waveletkernelnet: An interpretable deep neural networkfor industrial intelligent diagnosis, arXiv:1911.07925v3 (2019) 1–9.[19] F. B. Abid, M. Sallem, A. Braham, Robust interpretable deep learning for intelligent fault diagnosis of induction motors,IEEE Transactions on Instrumentation and Measurement 69 (2020) 3506–3515.[20] X. Li, W. Zhang, Q. Ding, Understanding and improving deep learning-based rolling bearing fault diagnosis with attentionmechanism, Signal Processing 161 (2019) 136–154.[21] H. Chen, C. Lee, Vibration signals analysis by explainable artiﬁcial intelligence (xai) approach: Application on bearingfaults diagnosis, IEEE Access 8 (2020) 134246–134256.[22] J. Grezmak, P. Wang, C. Sun, R. X. Gao, Explainable convolutional neural network for gearbox fault diagnosis, ProcediaCIRP 80 (2019) 476–481.[23] J. Grezmak, J. Zhang, P. Wang, K. A. Loparo, R. X. Gao, Interpretable convolutional neural network through layer-wiserelevance propagation for machine fault diagnosis, IEEE Sensors Journal 20 (2020) 3172–3181.[24] M. Saeki, J. Ogata, M. Murakawa, T. Ogawa, Visual explanation of neural network based rotation machinery anomalydetection system, IEEE International Conference on Prognostics and Health Management, San Francisco (2019) 1–4.[25] K. Hendrickx, W. Meert, Y. Mollet, J. Gyselinck, B. Cornelis, K. Gryllias, J. Davis., A general anomaly detectionframework for ﬂeet-based condition monitoring of machines, Mechanical Systems and Signal Processing 139 (2020) 106585.[26] Y. Zhao, Z. Nasrullah, Z. Li, Pyod: A python toolbox for scalable outlier detection, Journal of Machine Learning Research20 (96) (2019) 1–7.[27] L. Meneghetti, M. Terzi, S. Del Favero, G. A. Susto, C. Cobelli, Data-driven anomaly recognition for unsupervisedmodel-free fault detection in artiﬁcial pancreas, IEEE Transactions on Control Systems Technology.[28] A. K. Rai, R. K. Dwivedi, Fraud detection in credit card data using unsupervised machine learning based scheme, in: 2020International Conference on Electronics and Sustainable Communication Systems (ICESC), IEEE, 2020, pp. 421–426.[29] T. Barbariol, E. Feltresi, G. A. Susto, Self-diagnosis of multiphase ﬂow meters through machine learning-based anomalydetection, Energies 13 (12) (2020) 3136.[30] E. Knorr, R. Ng, Algorithms for mining distance-based outliers in large datasets, Proceedings of the 24rd InternationalConference on Very Large Data Bases 24 (1998) 392–403.[31] S. Ramaswamy, R. Rastogi, K. Shim, Eﬃcient algorithms for mining outliers from large data sets, In ACM Sigmod Record29 (2000) 427–438.[32] P. J. Rousseeuw, K. V. Driessen, A fast algorithm for the minimum covariance determinant estimator, Technometrics41(3) (1999) 212–223.[33] M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, Lof: identifying density-based local outliers, ACM 29 (2000) 93–104.[34] H.-P. Kriegel, P. Kr¨oger, E. Schubert, A. Zimek, Loop: local outlier probabilities, in: Proceedings of the 18th ACMconference on Information and knowledge management, 2009, pp. 1649–1652.[35] E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel, On evaluation of outlier rankings and outlier scores, in: Proceedingsof the 2012 SIAM International Conference on Data Mining, SIAM, 2012, pp. 1047–1058.[36] Z. He, X. Xu, S. Deng, Discovering cluster-based local outliers, Pattern Recognition Letters 24(9-10) (2003) 1641–1650.[37] B. Sch¨olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, R. C. Williamson, Estimating the support of a high-dimensionaldistribution, Neural computation 13(7) (2001) 1443–1471.[38] A. Lazarevic, V. Kumar, Feature bagging for outlier detection, In Proceedings of the eleventh ACM SIGKDD internationalconference on Knowledge discovery in data mining (2005) 157–166.[39] M. S. Hans-Peter Kriegel, A. Zimek, Angle-based outlier detection in high-dimensional data, In Proceedings of the 14thACM SIGKDD international conference on Knowledge discovery and data mining 14 (2008) 444–452.[40] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest, Proceedings of the 2008 IEEE International Conference on Data ining (ICDM’08),IEEE (2008) 413–422.[41] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation-based anomaly detection, ACM Trans. Knowl. Discov. Data 6 (2012) 1–39.[42] T. Pevn‘y, Loda: lightweight on-line detector of anomalies, Machine Learning 102(2) (2016) 275–304.[43] L. S. M, L. Su-In, A uniﬁed approach to interpreting model predictions, Advances in Neural Information ProcessingSystems 30 (2017) 4765–4774.[44] M. Carletti, M. Terzi, G. A. Susto, Interpretable anomaly detection with diﬃ: Depth-based feature importance for theisolation forest, arXiv preprint arXiv:2007.11117 (2020) 1–12.[45] L. Ciabattoni, F. Ferracuti, A. Freddi, A. Monteri`u, Statistical spectral analysis for fault diagnosis of rotating machines,IEEE Transactions on Industrial Electronics 65 (5) (2018) 4301–4310.[46] P. D. Samuel, D. J. Pines, A review of vibration-based techniques for helicopter transmission diagnostics, Journal of Soundand Vibration 282 (1) (2005) 475 – 508.[47] F. Dalvand, S. Dalvand, F. Sharaﬁ, M. Pecht, Current noise cancellation for bearing fault diagnosis using time shifting,IEEE Transactions on Industrial Electronics 64 (10) (2017) 8138–8147.[48] Y. Wei, Y. Li, M. Xu, W. Huang, A review of early fault diagnosis approaches and their applications in rotating machinery,Entropy 21 (4) (2019) 409.[49] Y. Lei, Z. He, Y. Zi, Q. Hu, Fault diagnosis of rotating machinery based on multiple anﬁs combination with gas, MechanicalSystems and Signal Processing 21 (5) (2007) 2280 – 2294.[50] Y. Lei, M. J. Zuo, Z. He, Y. Zi, A multidimensional hybrid intelligent method for gear fault diagnosis, Expert Systemswith Applications 37 (2) (2010) 1419 – 1430.[51] H. Qiu, J. Lee, J. Lin, G. Yu, Wavelet ﬁlter-based weak signature detection method and its application on rolling elementbearing prognostics, Journal of Sound and Vibration 289 (4) (2006) 1066 – 1090.[52] V. Bol´on Canedo, N. S´anchez Maro˜no, A. Alonso Betanzos, A review of feature selection methods on synthetic data,Knowl Inf Syst 34 (2013) 483 – 519.[53] K. Zhang, Y. Li, P. Scarf, A. Ball, Feature selection for high-dimensional machinery fault diagnosis data using multiplemodels and radial basis function networks, Neurocomputing 74 (17) (2011) 2941 – 2952.[54] X. Zhang, Q. Zhang, M. Chen, Y. Sun, X. Qin, H. Li, A two-stage feature selection and intelligent fault diagnosis methodfor rotating machinery using hybrid ﬁlter and wrapper method, Neurocomputing 275 (2018) 2426 – 2439.[55] Y. Lei, M. J. Zuo, Gear crack level identiﬁcation based on weighted k nearest neighbor classiﬁcation algorithm, MechanicalSystems and Signal Processing 23 (5) (2009) 1535 – 1547.[56] Y. Li, Y. Yang, G. Li, M. Xu, W. Huang, A fault diagnosis scheme for planetary gearboxes using modiﬁed multi-scalesymbolic dynamic entropy and mrmr feature selection, Mechanical Systems and Signal Processing 91 (2017) 295 – 312.[57] M. Singh, A. G. Shaik, Faulty bearing detection, classiﬁcation and location in a three-phase induction motor based onstockwell transform and support vector machine, Measurement 131 (2019) 524 – 533.[58] W. A. Smith, R. B. Randall, Rolling element bearing diagnostics using the case western reserve university data: Abenchmark study, Mechanical Systems and Signal Processing 64-65 (2015) 100 – 131.[59] P. Cao, S. Zhang, J. Tang, Preprocessing-free gear fault diagnosis using small datasets with deep convolutional neuralnetwork-based transfer learning, IEEE Access 6 (2018) 26241–26253.[60] J. N. Brito, R. Pederiva, Using artiﬁcial intelligence tools to detect problems in induction motors, In Proceedings of the 1stInternational Conference on Soft Computing and Intelligent Systems (International Session of 8th SOFT Fuzzy SystemsSymposium) and 3rd International Symposium on Advanced Intelligent Systems (SCIS and ISIS 2002) 1 (2002) 1–6.[61] J. N. Brito, R. Pederiva, A hybrid neural/expert system to diagnose problems in induction motors, Proceedings of 17thInternational Congress of Mechanical Engineering 17 (2003) 1–9.[62] D. Hawkins, Identiﬁcation of outliers, Chapman and Hall, (1980).ining (ICDM’08),IEEE (2008) 413–422.[41] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation-based anomaly detection, ACM Trans. Knowl. Discov. Data 6 (2012) 1–39.[42] T. Pevn‘y, Loda: lightweight on-line detector of anomalies, Machine Learning 102(2) (2016) 275–304.[43] L. S. M, L. Su-In, A uniﬁed approach to interpreting model predictions, Advances in Neural Information ProcessingSystems 30 (2017) 4765–4774.[44] M. Carletti, M. Terzi, G. A. Susto, Interpretable anomaly detection with diﬃ: Depth-based feature importance for theisolation forest, arXiv preprint arXiv:2007.11117 (2020) 1–12.[45] L. Ciabattoni, F. Ferracuti, A. Freddi, A. Monteri`u, Statistical spectral analysis for fault diagnosis of rotating machines,IEEE Transactions on Industrial Electronics 65 (5) (2018) 4301–4310.[46] P. D. Samuel, D. J. Pines, A review of vibration-based techniques for helicopter transmission diagnostics, Journal of Soundand Vibration 282 (1) (2005) 475 – 508.[47] F. Dalvand, S. Dalvand, F. Sharaﬁ, M. Pecht, Current noise cancellation for bearing fault diagnosis using time shifting,IEEE Transactions on Industrial Electronics 64 (10) (2017) 8138–8147.[48] Y. Wei, Y. Li, M. Xu, W. Huang, A review of early fault diagnosis approaches and their applications in rotating machinery,Entropy 21 (4) (2019) 409.[49] Y. Lei, Z. He, Y. Zi, Q. Hu, Fault diagnosis of rotating machinery based on multiple anﬁs combination with gas, MechanicalSystems and Signal Processing 21 (5) (2007) 2280 – 2294.[50] Y. Lei, M. J. Zuo, Z. He, Y. Zi, A multidimensional hybrid intelligent method for gear fault diagnosis, Expert Systemswith Applications 37 (2) (2010) 1419 – 1430.[51] H. Qiu, J. Lee, J. Lin, G. Yu, Wavelet ﬁlter-based weak signature detection method and its application on rolling elementbearing prognostics, Journal of Sound and Vibration 289 (4) (2006) 1066 – 1090.[52] V. Bol´on Canedo, N. S´anchez Maro˜no, A. Alonso Betanzos, A review of feature selection methods on synthetic data,Knowl Inf Syst 34 (2013) 483 – 519.[53] K. Zhang, Y. Li, P. Scarf, A. Ball, Feature selection for high-dimensional machinery fault diagnosis data using multiplemodels and radial basis function networks, Neurocomputing 74 (17) (2011) 2941 – 2952.[54] X. Zhang, Q. Zhang, M. Chen, Y. Sun, X. Qin, H. Li, A two-stage feature selection and intelligent fault diagnosis methodfor rotating machinery using hybrid ﬁlter and wrapper method, Neurocomputing 275 (2018) 2426 – 2439.[55] Y. Lei, M. J. Zuo, Gear crack level identiﬁcation based on weighted k nearest neighbor classiﬁcation algorithm, MechanicalSystems and Signal Processing 23 (5) (2009) 1535 – 1547.[56] Y. Li, Y. Yang, G. Li, M. Xu, W. Huang, A fault diagnosis scheme for planetary gearboxes using modiﬁed multi-scalesymbolic dynamic entropy and mrmr feature selection, Mechanical Systems and Signal Processing 91 (2017) 295 – 312.[57] M. Singh, A. G. Shaik, Faulty bearing detection, classiﬁcation and location in a three-phase induction motor based onstockwell transform and support vector machine, Measurement 131 (2019) 524 – 533.[58] W. A. Smith, R. B. Randall, Rolling element bearing diagnostics using the case western reserve university data: Abenchmark study, Mechanical Systems and Signal Processing 64-65 (2015) 100 – 131.[59] P. Cao, S. Zhang, J. Tang, Preprocessing-free gear fault diagnosis using small datasets with deep convolutional neuralnetwork-based transfer learning, IEEE Access 6 (2018) 26241–26253.[60] J. N. Brito, R. Pederiva, Using artiﬁcial intelligence tools to detect problems in induction motors, In Proceedings of the 1stInternational Conference on Soft Computing and Intelligent Systems (International Session of 8th SOFT Fuzzy SystemsSymposium) and 3rd International Symposium on Advanced Intelligent Systems (SCIS and ISIS 2002) 1 (2002) 1–6.[61] J. N. Brito, R. Pederiva, A hybrid neural/expert system to diagnose problems in induction motors, Proceedings of 17thInternational Congress of Mechanical Engineering 17 (2003) 1–9.[62] D. Hawkins, Identiﬁcation of outliers, Chapman and Hall, (1980).