[PDF] Semi-supervised and Unsupervised Methods for Heart Sounds Classification in Restricted Data Environments

Abstract

Automated heart sounds classification is a much-required diagnostic tool in the view of increasing incidences of heart related diseases worldwide. In this study, we conduct a comprehensive study of heart sounds classification by using various supervised, semi-supervised and unsupervised approaches on the PhysioNet/CinC 2016 Challenge dataset. Supervised approaches, including deep learning and machine learning methods, require large amounts of labelled data to train the models, which are challenging to obtain in most practical scenarios. In view of the need to reduce the labelling burden for clinical practices, where human labelling is both expensive and time-consuming, semi-supervised or even unsupervised approaches in restricted data setting are desirable. A GAN based semi-supervised method is therefore proposed, which allows the usage of unlabelled data samples to boost the learning of data distribution. It achieves a better performance in terms of AUROC over the supervised baseline when limited data samples exist. Furthermore, several unsupervised methods are explored as an alternative approach by considering the given problem as an anomaly detection scenario. In particular, the unsupervised feature extraction using 1D CNN Autoencoder coupled with one-class SVM obtains good performance without any data labelling. The potential of the proposed semi-supervised and unsupervised methods may lead to a workflow tool in the future for the creation of higher quality datasets.

Full PDF

11 Semi-supervised and Unsupervised Methods forHeart Sounds Classiﬁcation in Restricted DataEnvironments

Balagopal Unnikrishnan , Pranshu Ranjan Singh , Xulei Yang , and Matthew Chin Heng Chua Abstract —Automated heart sounds classiﬁcation is a much-required diagnostic tool in the view of increasing incidences ofheart related diseases worldwide. In this study, we conduct a com-prehensive study of heart sounds classiﬁcation by using varioussupervised, semi-supervised and unsupervised approaches on thePhysioNet/CinC 2016 Challenge dataset. Supervised approaches,including deep learning and machine learning methods, requirelarge amounts of labelled data to train the models, which arechallenging to obtain in most practical scenarios. In view of theneed to reduce the labelling burden for clinical practices, wherehuman labelling is both expensive and time-consuming, semi-supervised or even unsupervised approaches in restricted datasetting are desirable. A GAN based semi-supervised method istherefore proposed, which allows the usage of unlabelled datasamples to boost the learning of data distribution. It achievesa better performance in terms of AUROC over the supervisedbaseline when limited data samples exist. Furthermore, severalunsupervised methods are explored as an alternative approachby considering the given problem as an anomaly detectionscenario. In particular, the unsupervised feature extraction using1D CNN Autoencoder coupled with one-class SVM obtains goodperformance without any data labelling. The potential of theproposed semi-supervised and unsupervised methods may leadto a workﬂow tool in the future for the creation of higher qualitydatasets.

Index Terms —Heart Sounds Classiﬁcation, Semi-supervisedLearning, Unsupervised Learning, Generative Adversarial Net-works, One-Class Support Vector Machines.

I. I

NTRODUCTION C ARDIOVASCULAR diseases (CVDs) have been themain cause of death globally. 17.9 million deaths havebeen attributed to CVDs, which represents 31% of all globaldeaths [1]. There is a need for methods for ﬁrst hand examina-tion of cardiovascular system. Auscultation of the heart soundsor Phonocardiogram (PCG) signals is a crucial component ofphysical examination and can help detect cardiac conditionssuch as arrhythmia, valve disease, heart failure, and more[2]. Heart sound analysis by auscultation has been done byphysicians to assess the heart condition over a period oftime. However, designing an accurate and automated systemfor detection of abnormal heart sounds is challenging due tounavailability of rigorously validated and high-quality heartsounds datasets [3].Apart from PCG signals, Electrocardiogram (ECG) signalshas been used for detecting arrhythmia, myocardial ischemia and chronic alterations [4 - 5]. Although ECG signals can re-veal various intricate and abnormal heart behaviors, symptomssuch as heart murmurs are concealed from an ECG signal [6].The use of heart sounds to detect various heart abnormalitieshas led to the development of wide range of algorithms. In [7],PCG signals undergo digital subtraction analysis to detect andcharacterize heart murmurs. Automated classiﬁcation methodsof heart sounds involve approaches such as Support VectorMachines (SVM) [8], Neural Networks [9], Probability basedmethods [10] and ensemble of various classiﬁers [11].The design of supervised methods for heart sounds clas-siﬁcation requires large amount of labelled data. However,it is often difﬁcult, expensive, or time-consuming to obtainadditional labelled data [12]. There are challenges in obtainingpatients data in the medical domain. Furthermore, multiplephysicians have to perform labelling in order to achieve acommon consensus, etc. Semi-supervised learning and activelearning methods deal with this problem by utilizing availableunlabelled data along with the labelled data to build betterclassiﬁer models [13]. Chamberlain, Daniel, et al. demonstrateautomatic lung sounds classiﬁcation using a semi-superviseddeep learning algorithm [14]. Transfer learning for supervisedheart sounds classiﬁcation and data augmentation for minorityclass (abnormal category) samples are some of the areasbeing explored to improve the performance over traditionalsupervised classiﬁcation methods [15 - 16].In most cases, the abnormal samples are much lesserthan normal samples. This leads to a class imbalance whenperforming classiﬁcation tasks [17]. It is both time-consumingand expensive to collect the abnormal samples.There havebeen works that perform clustering on the extracted featuresfrom the heart sounds, followed by classiﬁcation [18]. Inanomaly detection methods, the model is trained only onnormal samples, but tested with both normal and abnormalsamples [19].In this work, the focus is on exploring current and newsupervised, semi-supervised and unsupervised methods forheart sounds classiﬁcation. The main contributions of thiswork are:(i) Analysis of the performance of various supervised methodsfor heart sounds classiﬁcation;(ii) Utilization of the Generative Adversarial Network (GAN)-based semi-supervised technique to obtain better performancein terms of Area Under the Receiver Operating Characteristiccurve (AUROC) as compared to the supervised benchmark,and a r X i v : . [ c s . C V ] J un (iii) Learning of latent representations from features of heartsounds using a 1D Convolutional Neural Network (CNN)model (Unsupervised method) and anomaly detection algo-rithms, and evaluate the classiﬁcation performance using AU-ROC metric.The methods and experimental analysis are discussed in detailin the following sections.II. M ETHODOLOGY

This section describes the data and the methods used in thisstudy. The sub-section Dataset and Data Preparation describethe dataset used and the feature extraction methods for heartsounds, respectively. Subsequent sub-sections explain the tech-niques used for heart sound classiﬁcation using supervised,semi-supervised and unsupervised methods.

A. Dataset

The heart sounds dataset used for this study was providedby the 2016 PhysioNet/Computing in Cardiology Challenge[2]. It contains 3,240 labelled heart sounds recordings. Thedataset is divided into two classes, Normal and Abnormalsamples. Fig. 1 shows the heart sounds signal for normal andabnormal sample. The duration of heart sounds signal rangesfrom 5 seconds (short-period) to 120 seconds (long period).This dataset was obtained by combining various heart soundsdatabases. It consists of 6 sub-datasets labelled A, B, C, D, Eand F as shown in the Fig. 2.The heart sounds recordings in this dataset were collectedfrom nine different locations of the body. The four majorlocations are the aortic area, pulmonic area, tricuspid areaand mitral area.The normal recordings correspond to healthysubjects whereas the abnormal ones were obtained from pa-tients with conﬁrmed cardiac diagnosis. The typical illnessesof the patients were heart valve defects and coronary artery.The presence of noise in some samples were due to theuncontrolled environment of the recordings. The noise sourcesincludes talking, stethoscope motion, breathing and intestinalsounds.

B. Data Preparation

For this study, various features obtained from heart soundssignals are used for training different models. The raw signalundergoes pre-processing steps such as padding and pruning.For padding operation, all the samples are zero-padded toachieve the length of the maximum length signal (120 seconds)in the dataset. For pruning operation, all the signals aretruncated to achieve the length of minimum length signal (5seconds) in the dataset.The different types of features extracted from the heartsounds signal are shown in Fig. 3. For semi-supervised meth-ods, the raw processed signal is used as input. For supervisedmethods, both the padded and pruned signals are used toobtain the spectrogram and mel-spectrogram features. Bothspectrogram and mel-spectrogram features are plotted withthe time as the x-axis and frequency as y-axis. These plotsare saved in form of color images having resolution 64 x 64x 3 and 128 x 128 x 3 respectively.

Fig. 1. The heart sounds signal for normal class (top) and abnormal class(bottom). The x-axis represents the time-steps and y-axis represents the signalvalue. The sampling rate of the signal is 2000 Hz.Fig. 2. The 2016 PhysioNet/Computing in Cardiology Challenge datasetdistribution. The dataset was obtained by combining heart sounds databasescollected independently by various research teams. The individual datasetsare labelled A, B, C, D, E and F. The distribution of normal and abnormalsamples in each sub-dataset is different.

Audio features such as Mel-Frequency Cepstral Coefﬁ-cients (MFCCs), Chroma [20], mel-scaled spectrogram (mel-spectrogram), spectral contrast [21] and tonal centroid features(tonnetz) [22] were extracted from the heart sounds signals.MFCCs, Chroma, mel-spectrogram, Spectral Contrast andTonnetz contribute 40, 12, 128, 7 and 6 features, respectively.These features are appended to form a combined feature listwith 193 features. These extracted audio features are used insupervised methods and unsupervised methods (for anomalydetection). Since there is a class imbalance, oversampling ofminority class (Abnormal class) is performed using SyntheticMinority Over-sampling Technique (SMOTE) [23]. This over-sampling is performed on the audio features.

Fig. 3. Feature Extraction from Heart Sounds Signal. Various features areextracted for supporting various techniques of heart sounds classiﬁcation.spectrogram and Mel-spectrogram are obtained by converting the PCG signalsto image. Audio features are obtained by appending speciﬁc features such asMFCC sequence, Chroma, Mel-spectrogram, Contrast and Tonnetz.

C. Supervised Methods for Heart Sounds Classiﬁcation

The various supervised methods used for performing heartsounds classiﬁcation can be grouped in four clusters:(i) Transfer Learning using pre-trained deep learning modelson spectrogram/ Mel-spectrogram images;(ii) Custom CNN on spectrogram images;(iii) Deep Learning models on extracted audio features, and(iv) Machine Learning models on extracted audio features.The details of the methods are described below.

1) Transfer Learning using Pre-trained Deep LearningModels on Spectrogram/ Mel-spectrogram Images:

Transferlearning in CNNs has shown that the image representationslearnt over a large-scale labelled dataset can be transferred toclassiﬁcation tasks over limited data samples [24]. ResNet-50[25], Inception-v3 [26] and DenseNet-121 [27] have shownstate-of-the-art classiﬁcation results on the ImageNet dataset.The spectrograms and mel-spectrograms obtained from heartsounds signals are converted to 64 x 64 x 3 images. (fromData Preparation sub-section) These images are trained onImageNet pre-trained ResNet-50, Inception-v3 and DenseNet-121 models. The output of the ﬁnal convolutional layer ofthree models is fed to a fully-connected single node layer forclassiﬁcation into Normal or Abnormal class.

2) Custom CNN on Spectrogram Images:

The spectrogramobtained from the heart sounds signal is converted to 128 x128 x 3 image. These images are fed to a custom designedCNN network which follows VGG [28] like architecture. Thecustom architecture of Custom CNN is provided in the TableI. The input spectrogram image passes through a series ofconvolution and pooling layers, and dense layers towardsthe end of the network to output the class of the heartsounds signal. ReLU activation [29] has been used for theconvolutional and dense layers, except for the ﬁnal dense layer,which uses Sigmoid activation. Dropout layers are added toprevent the model from over-ﬁtting to the training set [30].

3) Deep Learning Models on Extracted Audio Features:

The audio features extracted from the heart sounds signals

TABLE IC

USTOM

CNN A

RCHITECTURE ON S PECTROGRAM I MAGES

Layers AttributesConvolution 2D 16 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 16 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 32 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 32 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 64 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 64 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 128 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 128 ﬁlters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Flatten & Dropout dropout rate=0.25Dense 256 nodes, ReLU activationDropout dropout rate=0.25Dense 1 node, Sigmoid activation

TABLE IIN

EURAL N ETWORK WITH

LSTM

UNITS ON E XTRACTED A UDIO F EATURES

Layers AttributesLSTM 128 units, dropout=0.2, recurrentdropout=0.25Dropout dropout rate=0.25LSTM 64 units, dropout=0.2, recurrentdropout=0.25Dense 1 node, Sigmoid activation undergo oversampling using SMOTE to obtain the equalnumber of samples for both Normal and Abnormal classes.These features are then modeled using Dense Neural Network(Dense NN), Neural Network with Long Short Term Mem-ory units (LSTM NN) [31] and 1D CNN. The Dense NNarchitecture takes a feature list of dimension 193 as input aspasses it through a series of densely connected layers. TheDense NN architecture consists of 4 dense layers, each having128 nodes with ReLU activation, followed by a single nodedensely connected layer with Sigmoid activation. The LSTMNN architecture takes a feature list of dimension 193 x 1as input and passes it through a series of LSTM units anddensely connected layers. The architecture for LSTM NN isprovided in Table II. The LSTM units are useful in modelingsequential data and ﬁnal densely connected node is used forthe classiﬁcation. The 1D CNN architecture takes a feature listof dimension 193 x 1 as input and passes it through a seriesof 1D convolutional, 1D pooling and densely connected layersas depicted in Table III.

4) Machine Learning Models on Extracted Audio Features:

The extracted audio features are used to ﬁt machine learningmodels such as Decision Tree, SVM, Random Forest and

TABLE III1D CNN ON E XTRACTED A UDIO F EATURES

Layers AttributesConvolution 1D 128 ﬁlters, kernel size=3, ReLU activationConvolution 1D 128 ﬁlters, kernel size=3, ReLU activationMaxPool 1D kernel size=3, stride=3Convolution 1D 256 ﬁlters, kernel size=3, ReLU activationConvolution 1D 256 ﬁlters, kernel size=3, ReLU activationMaxPool 1D kernel size=3, stride=3Convolution 1D 512 ﬁlters, kernel size=3, ReLU activationConvolution 1D 512 ﬁlters, kernel size=3, ReLU activationFlattenDense 256 nodes, ReLU activationDense 128 nodes, ReLU activationDense 1 node, Sigmoid activation

Gradient Boosting. For each machine learning method, themodel was ﬁtted on training set and hyper-parameters of themodels were tuned using a validation set and ﬁnally evaluatedon a test set. The various hyper-parameters and their valuesfor each machine learning model are provided in Table IV. Fordecision tree modeling, the hyper-parameters are Criterion (thefunction that measures the quality of the split), max depth(the maximum depth of the decision tree), max leaf nodes(maximum number of nodes allowed in leaf/terminal positions)and class weight. The class weight hyper-parameter representsthe ratio (abnormal : normal ratio) of weights associated withthe classes. The hyper-parameters used in SVM modeling areC (penalty factor of error term), kernel (the type of kernel usedin the algorithm), gamma (kernel coefﬁcient) and class weight.The kernel used in SVM model was radial basis function (rbf).The hyper-parameters used in Random Forest modeling areCriterion, Number of estimators (the number of trees in theforest), max depth and max leaf nodes. For gradient boostingmodeling, the hyper-parameters used are Number of estimators(the number of boosting stages), max depth and learning rate(the reduces contribution of each tree by this rate). The abovehyper-parameters are tuned for each machine learning modelfor two cases, without SMOTE balancing and with SMOTEbalancing. D. Semi-supervised Method: Generative Adversarial Network

Generative adversarial networks (GANs) has provided away for generating fake samples and utilizing them for othertasks [32]. The semi-supervised method makes use of GANsto utilize the unlabelled data samples. The semi supervisedmodels has access to both the labelled and unlabelled datafrom the training set. In theory, such models should performbetter than the supervised methods as they now have accessto unlabelled training data - provided the semi-supervisedsmoothness assumption holds i.e. if two points x1, x2 areclose in a high density region, their labels y1, y2 are also becloser [33]. This class of semi-supervised algorithms are calledgenerative models and they are generally trained in a coupledfashion, similar to the training procedure of GANs. Fig. 4and Fig. 5 provide the GAN training and testing framework.A combined loss function as mentioned in equations (1) to

TABLE IVH

YPER - PARAMETERS FOR VARIOUS MACHINE LEARNING MODELS ONEXTRACTED AUDIO FEATURES

ML Model Hyper-parameter ValueDecision Tree Criterion EntropyMax Depth Max Leaf Nodes Class weight

Decision Tree (with SMOTE) Criterion EntropyMax Depth Max Leaf Nodes SVM C . Kernel rbfGamma autoClass weight

SVM (with SMOTE) C Kernel rbfGamma autoRandom Forest Criterion EntropyNo. of estimators

Max Depth Max Leaf Nodes Class weight

Random Forest (with SMOTE) Criterion EntropyNo. of estimators

Max Depth Max Leaf Nodes Gradient Boosting No. of estimators

Max Depth Learning rate . Gradient Boosting (with SMOTE) No. of estimators

Max Depth Learning rate . (4) is used to train the discriminator and generator, and thereformulation trick is used as depicted in [34].The generator is trained by matching the features of thegenerated samples and the real samples. The supervised loss issimilar to the cross-entropy loss in K-class classiﬁcation prob-lems and the unsupervised loss helps in distinguishing betweenreal and fake samples. This coupled training in an adversarialsetting is used to train the semi-supervised network. Thenetwork architecture for the discriminator and generator areprovided in Table V and Table VI. 1D convolutions are used inboth discriminator and generator network architecture. Theseare highly effective as convolution operations are translationand scale invariant and can pickup relevant features anywherewithin the input. This is useful since the heart sounds arenot segmented or aligned in any fashion. The ﬁrst 5 secondsof the heart sounds data is directly taken as input for thesemi-supervised method. In the overall training setup, minimalamount of annotation or labelling is required. E. Unsupervised Method: Anomaly Detection

For the purpose of obtaining good performance in restricteddata environments, the method of anomaly detection wasexplored. In anomaly detection scenario, the model is trainedusing just the normal class samples. Any abnormality ordeviation from normality is considered as a disease case(abnormal class). This has two major advantages, (i) it canperform the entire training without the need of any labels (need

Fig. 4. Semi-supervised GAN Training Framework. The Generator (G) takes a random noise z as input and produces a generated sample x generated . TheDiscriminator (D) takes the generated samples, labelled real samples ( x labelled , y ) and unlabelled real samples x unlabelled and produces the prediction ofthe class label and the Intermediate layer output M ( x ) .Fig. 5. Semi-supervised GAN Testing Framework. During the testing phase,only Discriminator (D) is used. The test sample is fed to the Discriminator toobtain the class prediction. The predicted class along with the ground truthclass is used to obtain the AUROC metric.TABLE VS EMI - SUPERVISED

GAN D

ISCRIMINATOR A RCHITECTURE

Layers AttributesConvolution 1D 64 ﬁlters, kernel size=8, stride=1, LeakyReLU activationConvolution 1D 64 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 128 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 ﬁlters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 ﬁlters, kernel size=8, stride=2, LeakyReLU activationAdaptive Avg Pooling1D output size=1Flatten Intermediate Layer OutputDense 2 nodes (number of classes)

TABLE VIS

EMI - SUPERVISED

GAN G

ENERATOR A RCHITECTURE

Layers AttributesDense 256*33(=8448) nodes, batch norm 1D,ReLU activationReshape Reshape to 256 x 33Conv Transpose 1D 256 ﬁlters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 ﬁlters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 ﬁlters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 ﬁlters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 ﬁlters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 128 ﬁlters, kernel size=8, stride=2,padding=1, batch norm 1D, ReLU activationConv Transpose 1D 64 ﬁlters, kernel size=8, stride=2,padding=1, batch norm 1D, ReLU activationConv Transpose 1D 1 ﬁlter, kernel size=8, stride=1, padding=0,tanh activation for abnormal class samples), and (ii) it allows for an anomalyscore which can be used to get the relative grade of theabnormal samples and can be utilized in various applicationssuch as triaging.Two anomaly detection algorithms and two sets of features(these features are used to train the two algorithms) are con-sidered for evaluation. The two anomaly detection algorithmsused are One-Class SVM [35] and Isolation Forest [36]. InOne-Class SVM algorithm, the normal samples are enclosedwithin a hyper-sphere or hyper-plane and everything outsidethis is considered as an anomalous sample. The distance fromthe separating plane decides the degree of abnormality. InIsolation forest, the samples are split randomly during trainingusing isolation trees. The resulting average tree lengths for thetree forest is taken as a measure of the abnormality. Anomaloussamples are more susceptible to isolation while splitting andhence have shorter average tree lengths. This can be used todistinguish between the anomalies and normal samples.During training, a stack of 1D Convolutions layers and1D Convolutions-Upsampling layers are combined to serve as

Loss discriminator = Loss unsupervised + Loss supervised (1)

Loss supervised = − E ( x, y )[ logP D ( y | x, y < K + 1)] , K = no.of classes (2) Loss unsupervised = − E x [ log (1 − P D ( y = K + 1 | x ))] − E xg [ logP D ( y = K + 1 | x )] (3) Loss generator = || E x [ M ( x )] − E xg [ M ( G ( z ))] || (4) Fig. 6. Unsupervised Anomaly Detection Framework using 1D CNN Autoencoder. The Autoencoder takes an input sample x and produces the reconstructedsample x recon . The latent representations z from the Autoencoder and the Reconstruction loss are used as features for anomaly detection methods, IsolationForest and One-Class SVM. an auto-encoder. The audio features extracted from the heartsounds are provided to the auto-encoder for reconstruction.Two features from 1D CNN Autoencoder were used forserving as the input features for Isolation Forest and One-Class SVM:(i) Reconstruction loss: The squared difference between the ac-tual input and the reconstructed output sample by the 1D CNNAutoencoder. The intuition is that, for the anomalous samples,the reconstruction loss would be higher during test time asthe actual model cannot accurately reconstruct anomaloussamples as it is trained to reconstruct only normal samples.The reconstruction loss is deﬁned in equation 5. Loss reconstruction = | X input − X rec | (5)(ii) Latent Representations : Latent representations or embed-dings is the output obtained from the bottleneck layer / the lastlayer of the encoder. While training, the latent representationwould provide a set of features which represent the trainingsamples. These feature set would help discriminate betweennormal and anomalous samples.The overall anomaly detection framework is shown in Fig.6. The autoencoder network structure is provided in TableVII. Two modes of training were used for training the 1DCNN autoencoder. In the ﬁrst case, the training data consistspurely of normal samples only. In the second case, the datais contaminated with abnormal samples as well. This wouldhelp in evaluating the utility of the method in the use-casewhere there are no ﬁlters to prevent abnormal samples frombeing used - like screening applications - where the data can have a mix of both normal and abnormal samples, butthe proportion of anomalous data is lesser. 8% - 12% isa reasonable assumption for contamination with anomaloussamples as the percentage prevalence of heart diseases amonggeneral population is roughly 10% [37]. During the trainingphase, extracted latent representations and reconstruction lossfor samples obtained from the auto-encoder are used to trainthe two anomaly detection algorithms. For both the algorithms,the experiments are conducted for clean data (only normalsamples) and contaminated data (normal and abnormal mixed).III. C OMPUTATIONS AND R ESULTS

This section describes the experiments performed to eval-uate the methods discussed in previous section. Computa-tional Setup and Evaluation Metrics sub-section describes thetraining setup and the various metrics used to validate theperformance on a test set. Subsequent sub-section describesthe results obtained for supervised, semi-supervised and unsu-pervised methods.

A. Computational Setup

The computations of supervised methods for heart soundsclassiﬁcation utilize the entire dataset consisting of 3,240samples. % of above dataset (648 samples) were used fortesting and remaining % were used for training. This %was further divided into training ( %, 2,333 samples) andvalidation ( %, 259 samples) sets. The training set was used TABLE VII1D CNN A

UTOENCODER A RCHITECTURE

Layers AttributesConvolution 1D 64 ﬁlters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 64 ﬁlters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 32 ﬁlters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 16 ﬁlters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 8 ﬁlters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Flatten Latent SpaceReshape Reshape to 12 x 8Convolution 1D 8 ﬁlters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 16 ﬁlters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 32 ﬁlters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 64 ﬁlters, kernel size=3, padding=sameUpsampling 1D size=2Zero Padding 1D 0 x 1Convolution 1D 1 ﬁlter, kernel size=3, padding=same for model ﬁtting and validation set was used to tune the hyper-parameters of the model.The computations of semi-supervised and unsupervisedmethods utilize only the sub-dataset E (2141 samples) of theentire dataset. Since each of the sub-datasets are collectedfrom different sources, there may be some bias associated witheach sub-dataset. % of above sub-dataset E (429 samples)were used for testing and remaining % were used fortraining. This % was further divided into training ( %,1,540 samples) and validation ( %, 172 samples) sets. B. Evaluation Metrics

The supervised methods are evaluated using the standardclassiﬁcation metrics such as sensitivity, speciﬁcity and ac-curacy. An additional metric MAcc, which is deﬁned asthe average of sensitivity and speciﬁcity was also used forevaluation [38]. For semi-supervised evaluation, the idea is tocompare the performance of the supervised baseline againstsemi-supervised methods across different amounts of labelleddata. The idea is to mimic a clinical use case setting wherethe number of labelled samples is limited. As there is classimbalance, AUROC (Area Under the Receiver Operating Char-acteristic curve) is used as the metric for comparison betweensupervised and semi-supervised models. AUROC not onlytakes into account the issue of class imbalance, but is also notsensitive to the cutoff value taken for the class predictions.

C. Results and Discussion

Table VIII shows the results for various supervised methodsdiscussed in this study. DenseNet-121 on Mel-spectrograms(with padding) and Decision Tree on extracted audio fea-tures achieved the best performance in terms of speciﬁcityand sensitivity respectively. Gradient Boosting on extracted audio features (with SMOTE balancing) achieved the bestperformance in terms of accuracy and MAcc. Table IX showsthe comparison among the Gradient Boosting method and themethods reported in the PhysioNet/CinC 2016 Challenge.For the semi-supervised computations, the percentage oflabelled data provided to the models is slowly increased acrossthe computation and compared based on the AUROC scores.Fig. 7 shows the plot of AUROC against the percentage oflabelled data used for supervised and semi-supervised method.The performance for the semi-supervised model is better thanthe supervised baseline even in the case of very few datasamples. Even with 4 or 8 data samples, the semi-supervisedmodel is able to outperform the supervised baseline. Thisobservation in the plot can be explained as:(i) The larger unlabelled training set helps approximate theoverall data distribution, which allows for a much betterdecision boundary than the supervised method which can onlyaccount for labelled samples;(ii) The unlabelled data has a regularization effect on theclassiﬁcation network as the semi-supervised training followsa coupled adversarial training procedure.A higher performance is obtained in terms of AUROC as moreand more labelled samples are used for classiﬁcation.

Use cases for semi-supervised methods:

These methodsare of particular importance in cases of limited annotationability. In most clinical cases, large number of labelled trainingsamples for supervised training is not readily available. Thesemi-supervised methods can be used in two scenarios:(i) It can learn better from lesser labelled data and providebetter labels for pseudo-labelling algorithms;(ii) It has a better predictive power as compared to supervisedmethods, and due to the usage of unlabelled samples as well,the model can iteratively select samples that needs to belabelled for better performance.Table X and Table XI provide the results of the anomalydetection methods. One-Class SVM achieves a better perfor-mance in terms of AUROC for both cases as compared toIsolation Forest. Moreover, latent representations or embed-dings give a better performance than reconstruction loss. Datacontamination in the experiment with autoencoder trained ononly normal samples is not a major concern as the latent rep-resentations obtained are fairly robust to this issue. However,for the experiment with autoencoder trained on both normaland abnormal samples, it is ideal that normal samples be fedinto the anomaly detection algorithm.

Use cases for unsupervised methods:

The unsupervised fea-ture extraction (using 1D CNN Autoencoder) method coupledwith anomaly detection methods achieved good performanceseven with no major labelling burden. These methods can beused in two scenarios:(i) These methods are good for triaging applications since theabnormality scores are an indicator of the disease and the oneswith the higher abnormality score can be evaluated ﬁrst;(ii) These methods can be useful for creating datasets forsupervised or semi-supervised training. Samples that are closeto the classiﬁcation boundary can be chosen in an unsupervisedsetting, as these samples are more confusing for the modelsto distinguish.

TABLE VIIIR

ESULTS FOR S UPERVISED METHODS OF H EART S OUNDS C LASSIFICATION

Method Accuracy Speciﬁcity Sensitivity MAccResNet-50 on Mel-spectrogram (withpadding) .

869 0 .

941 0 .

604 0 . ResNet-50 onspectrogram (withpadding) .

878 0 .

943 0 .

640 0 . ResNet-50 on Mel-spectrogram (withpruning) .

860 0 .

919 0 .

640 0 . ResNet-50 on spec-trogram (with prun-ing) .

847 0 .

896 0 .

669 0 . Inception-v3 on Mel-spectrogram (withpadding) .

850 0 .

931 0 .

554 0 . Inception-v3 onspectrogram (withpadding) .

867 0 .

947 0 .

576 0 . Inception-v3 on Mel-spectrogram (withpruning) .

796 0 .

941 0 .

266 0 . Inception-v3 on spec-trogram (with prun-ing) .

826 0 .

953 0 .

360 0 . DenseNet-121 onMel-spectrogram(with padding) .

869 0 .

965 0 .

518 0 . DenseNet-121 onspectrogram (withpadding) . .

187 0 . Custom CNN onspectrogram (withpadding) .

909 0 .

967 0 .

698 0 . Dense NN on ex-tracted audio features(with SMOTE) .

855 0 .

880 0 .

763 0 . LSTM NN on ex-tracted audio features(with SMOTE) .

748 0 .

770 0 .

670 0 .

1D CNN on extractedaudio features (withSMOTE) .

843 0 .

847 0 .

827 0 . Decision Tree on ex-tracted audio features .

824 0 . . Decision Tree on ex-tracted audio features(with SMOTE) .

832 0 .

837 0 .

813 0 . SVM on extracted au-dio features .

807 0 .

813 0 .

784 0 . SVM on extractedaudio features (withSMOTE) .

827 0 .

953 0 .

367 0 . Random Forest on ex-tracted audio features .

898 0 .

925 0 .

798 0 . Random Forest on ex-tracted audio features(with SMOTE) .

878 0 .

888 0 .

842 0 . Gradient Boosting onextracted audio fea-tures .

913 0 .

970 0 .

705 0 . Gradient Boosting onextracted audio fea-tures (with SMOTE) .

935 0 . TABLE IXC

OMPARISON OF PROPOSED METHOD WITH VARIOUS SUPERVISEDMETHODS REPORTED IN THE P HYSIO N ET /C IN C 2016 C

HALLENGE

Method Feature Balancingdata MAccAdaBoost and CNN [11] Time-frequency No . Ensemble of NN [39] Time-frequency Yes . Dropout Connected NN[40] MFCC No . SVM and KNN [41] Time-frequency,MFCC No . CNN [42] MFCC No . SVM and ELM [43] Audio SignalAnalysis No . Gradient Boosting (Currentstudy) Extracted Au-dio features Yes . Fig. 7. Semi-supervised Results. The graph shows the AUROC evaluationmetric against the percentage of labelled data for supervised baseline andsemi-supervised method. The performance of semi-supervised approach isbetter than supervised approach throughout the graph.TABLE XR

ESULTS FOR A NOMALY D ETECTION WHEN AUTOENCODER IS TRAINEDON ONLY NORMAL SAMPLES

Method Features Labels AUROCIsolation Forest Embeddings Normal . Contaminated . Rec Loss Normal . Contaminated . One-Class SVM Embeddings Normal

Contaminated . Rec Loss Normal . Contaminated . TABLE XIR

ESULTS FOR A NOMALY D ETECTION WHEN AUTOENCODER IS TRAINEDON ENTIRE DATA ( BOTH NORMAL AND ABNORMAL SAMPLES ) Method Features Labels AUROCIsolation Forest Embeddings Normal . Contaminated . Rec Loss Normal . Contaminated . One-Class SVM Embeddings Normal

Contaminated . Rec Loss Normal . Contaminated . IV. C

ONCLUSION AND F UTURE D IRECTIONS

This study explores the supervised, semi-supervised andunsupervised methods of heart sounds classiﬁcation for theuse cases where the availability of labelled data is scarce.In such cases, the supervised methods with large number oflabelled samples, plateau out and have similar performances.However, for smaller number of labelled samples, the semi-supervised algorithm outperforms the supervised baselines.Furthermore, the given problem is framed as an anomalydetection problem with unsupervised feature learning. Theissue of data contamination is also studied and the results arepresented.These works can be a starting point for various future usecases and studies. One promising direction for the utilizationof these methods is in the case of active learning - where ﬁrst asmall subset of samples is labelled and then iteratively samplesare chosen to be labelled further to improve performance. Thegood performance on lower number of labelled samples isalso useful in the case for pseudo-labelling, where existingsupervised classiﬁcation methods are used with the assumedlabels. The heart sounds signals used in this study are not seg-mented. Various segmentation algorithms have been developedin recent years. Proper segmentation and alignment techniquescan be employed to further boost the performance. Apartfrom band-pass ﬁlters, other signal processing techniquescan be explored to improve performance. Hence, better pre-processing techniques and feature extraction techniques can beanother pathway for exploration.Another more challenging setting is to use data augmenta-tion for heart sounds signals. It might be interesting to see howsound signals can be augmented using techniques like Mix-Up[44], apart from SMOTE for data balancing. However, sincethe domain of application is health care, it is important toensure that the augmented data samples should not introduceany wrong features or biases within the model and henceshould be undertaken with utmost care. Moreover, all themethods presented in this work can generalize for any 1Dsignal input. Hence, ECG signals can also be used insteadof PCG signals. This work is presented with the belief that itcan aid both in creation of better models and more importantly,better datasets which can further improve performance, as inmost practical cases, it is the quality of data used that is acrucial factor in obtaining better performance. R et al. , “An open accessdatabase for the evaluation of heart sound algorithms,”

PhysiologicalMeasurement , vol. 37, no. 12, p. 2181, 2016.[3] R. M. Rangayyan and R. J. Lehner, “Phonocardiogram signal analysis:a review.”

Critical reviews in biomedical engineering , vol. 15, no. 3, pp.211–236, 1987.[4] N. V. Thakor and Y.-S. Zhu, “Applications of adaptive ﬁltering to ecganalysis: noise cancellation and arrhythmia detection,”

IEEE transac-tions on biomedical engineering , vol. 38, no. 8, pp. 785–794, 1991.[5] R. Silipo and C. Marchesi, “Artiﬁcial neural networks for automatic ecganalysis,”

IEEE transactions on signal processing , vol. 46, no. 5, pp.1417–1425, 1998.[6] W. Phanphaisarn, A. Roeksabutr, P. Wardkein, J. Koseeyaporn, andP. Yupapin, “Heart detection and diagnosis based on ecg and epcgrelationships,”

Medical devices (Auckland, NZ) , vol. 4, p. 133, 2011.[7] M. A. Akbari, K. Hassani, J. D. Doyle, M. Navidbakhsh, M. Sangargir,K. Bajelani, and Z. S. Ahmadi, “Digital subtraction phonocardiography(dsp) applied to the detection and characterization of heart murmurs,”

Biomedical engineering online , vol. 10, no. 1, p. 109, 2011.[8] S. Ari, K. Hembram, and G. Saha, “Detection of cardiac abnormalityfrom pcg signal using lms based least square svm classiﬁer,”

ExpertSystems with Applications , vol. 37, no. 12, pp. 8019–8026, 2010.[9] I. Grzegorczyk, M. Soliski, M. epek, A. Perka, J. Rosiski, J. Rymko,K. Stpie, and J. Gieratowski, “Pcg classiﬁcation using a neural networkapproach,” in , 2016,pp. 1129–1132.[10] F. Plesinger, I. Viscor, J. Halamek, J. Jurco, and P. Jurak, “Heartsounds analysis using probability assessment,”

Physiological measure-ment , vol. 38, no. 8, p. 1685, 2017.[11] C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classiﬁers for detection of abnormal heartsounds,” in . IEEE,2016, pp. 621–624.[12] X. J. Zhu, “Semi-supervised learning literature survey,” University ofWisconsin-Madison Department of Computer Sciences, Tech. Rep.,2005.[13] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.[14] D. Chamberlain, R. Kodgule, D. Ganelin, V. Miglani, and R. R. Fletcher,“Application of semi-supervised deep learning to lung sound analysis,”in . IEEE, 2016, pp. 804–807.[15] A. I. Humayun, M. Khan, S. Ghaffarzadegan, Z. Feng, T. Hasan et al. ,“An ensemble of transfer, semi-supervised and supervised learningmethods for pathological heart sound classiﬁcation,” arXiv preprintarXiv:1806.06506 , 2018.[16] A. Ukil, S. Bandyopadhyay, C. Puri, R. Singh, and A. Pal, “Classaugmented semi-supervised learning for practical clinical analytics onphysiological signals,” arXiv preprint arXiv:1812.07498 , 2018.[17] M. M. Rahman and D. Davis, “Addressing the class imbalance problemin medical datasets,”

International Journal of Machine Learning andComputing , vol. 3, no. 2, p. 224, 2013.[18] G. Amit, N. Gavriely, and N. Intrator, “Cluster analysis and classiﬁcationof heart sounds,”

Biomedical Signal Processing and Control , vol. 4,no. 1, pp. 26–36, 2009.[19] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A reviewof novelty detection,”

Signal Processing , vol. 99, pp. 215–249, 2014.[20] D. Ellis, “Chroma feature analysis and synthesis,”

Resources of Lab-oratory for the Recognition and Organization of Speech and Audio-LabROSA , 2007.[21] D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, and L.-H. Cai, “Musictype classiﬁcation by spectral contrast feature,” in

Proceedings. IEEEInternational Conference on Multimedia and Expo , vol. 1. IEEE, 2002,pp. 113–116.[22] C. Harte, M. Sandler, and M. Gasser, “Detecting harmonic change inmusical audio,” in

Proceedings of the 1st ACM workshop on Audio andmusic computing multimedia . ACM, 2006, pp. 21–26.[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”

Journal of artiﬁcial intel-ligence research , vol. 16, pp. 321–357, 2002. [24] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferringmid-level image representations using convolutional neural networks,”in Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 1717–1724.[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[29] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from over-ﬁtting,”

The Journal of Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[33] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning(chapelle, o. et al., eds.; 2006)[book reviews],”

IEEE Transactions onNeural Networks , vol. 20, no. 3, pp. 542–542, 2009.[34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in

Advances in neuralinformation processing systems , 2016, pp. 2234–2242.[35] L. M. Manevitz and M. Yousef, “One-class svms for document clas-siﬁcation,”

Journal of machine Learning research , vol. 2, no. Dec, pp.139–154, 2001.[36] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in . IEEE, 2008, pp. 413–422.[37] E. J. Benjamin, M. J. Blaha, S. E. Chiuve, M. Cushman, S. R. Das,R. Deo, J. Floyd, M. Fornage, C. Gillespie, C. Isasi et al. , “Heartdisease and stroke statistics-2017 update: a report from the americanheart association.”

Circulation , vol. 135, no. 10, pp. e146–e603, 2017.[38] G. D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li,I. Silva, and R. G. Mark, “Recent advances in heart sound analysis,”

Physiological measurement , vol. 38, no. 8, pp. E10–E25, 2017.[39] M. Zabihi, A. B. Rad, S. Kiranyaz, M. Gabbouj, and A. K. Katsaggelos,“Heart sound anomaly and quality detection using ensemble of neuralnetworks without segmentation,” in . IEEE, 2016, pp. 613–616.[40] E. Kay and A. Agarwal, “Dropconnected neural network trained withdiverse features for classifying heart sounds,” in . IEEE, 2016, pp. 617–620.[41] I. J. D. Bobillo, “A tensor approach to heart sound classiﬁcation,” in . IEEE, 2016, pp.629–632.[42] J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, and K. Sricharan,“Classifying heart sound recordings using deep convolutional neuralnetworks and mel-frequency cepstral coefﬁcients,” in . IEEE, 2016, pp. 813–816.[43] X. Yang, F. Yang, L. Gobeawan, S. Y. Yeo, S. Leng, L. Zhong, andY. Su, “A multi-modal classiﬁer for heart sound recordings,” in . IEEE, 2016, pp. 1165–1168.[44] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” arXiv preprint arXiv:1710.09412arXiv preprint arXiv:1710.09412