Semi-supervised and Unsupervised Methods for Heart Sounds Classification in Restricted Data Environments
Balagopal Unnikrishnan, Pranshu Ranjan Singh, Xulei Yang, Matthew Chin Heng Chua
11 Semi-supervised and Unsupervised Methods forHeart Sounds Classification in Restricted DataEnvironments
Balagopal Unnikrishnan , Pranshu Ranjan Singh , Xulei Yang , and Matthew Chin Heng Chua Abstract —Automated heart sounds classification is a much-required diagnostic tool in the view of increasing incidences ofheart related diseases worldwide. In this study, we conduct a com-prehensive study of heart sounds classification by using varioussupervised, semi-supervised and unsupervised approaches on thePhysioNet/CinC 2016 Challenge dataset. Supervised approaches,including deep learning and machine learning methods, requirelarge amounts of labelled data to train the models, which arechallenging to obtain in most practical scenarios. In view of theneed to reduce the labelling burden for clinical practices, wherehuman labelling is both expensive and time-consuming, semi-supervised or even unsupervised approaches in restricted datasetting are desirable. A GAN based semi-supervised method istherefore proposed, which allows the usage of unlabelled datasamples to boost the learning of data distribution. It achievesa better performance in terms of AUROC over the supervisedbaseline when limited data samples exist. Furthermore, severalunsupervised methods are explored as an alternative approachby considering the given problem as an anomaly detectionscenario. In particular, the unsupervised feature extraction using1D CNN Autoencoder coupled with one-class SVM obtains goodperformance without any data labelling. The potential of theproposed semi-supervised and unsupervised methods may leadto a workflow tool in the future for the creation of higher qualitydatasets.
Index Terms —Heart Sounds Classification, Semi-supervisedLearning, Unsupervised Learning, Generative Adversarial Net-works, One-Class Support Vector Machines.
I. I
NTRODUCTION C ARDIOVASCULAR diseases (CVDs) have been themain cause of death globally. 17.9 million deaths havebeen attributed to CVDs, which represents 31% of all globaldeaths [1]. There is a need for methods for first hand examina-tion of cardiovascular system. Auscultation of the heart soundsor Phonocardiogram (PCG) signals is a crucial component ofphysical examination and can help detect cardiac conditionssuch as arrhythmia, valve disease, heart failure, and more[2]. Heart sound analysis by auscultation has been done byphysicians to assess the heart condition over a period oftime. However, designing an accurate and automated systemfor detection of abnormal heart sounds is challenging due tounavailability of rigorously validated and high-quality heartsounds datasets [3].Apart from PCG signals, Electrocardiogram (ECG) signalshas been used for detecting arrhythmia, myocardial ischemia and chronic alterations [4 - 5]. Although ECG signals can re-veal various intricate and abnormal heart behaviors, symptomssuch as heart murmurs are concealed from an ECG signal [6].The use of heart sounds to detect various heart abnormalitieshas led to the development of wide range of algorithms. In [7],PCG signals undergo digital subtraction analysis to detect andcharacterize heart murmurs. Automated classification methodsof heart sounds involve approaches such as Support VectorMachines (SVM) [8], Neural Networks [9], Probability basedmethods [10] and ensemble of various classifiers [11].The design of supervised methods for heart sounds clas-sification requires large amount of labelled data. However,it is often difficult, expensive, or time-consuming to obtainadditional labelled data [12]. There are challenges in obtainingpatients data in the medical domain. Furthermore, multiplephysicians have to perform labelling in order to achieve acommon consensus, etc. Semi-supervised learning and activelearning methods deal with this problem by utilizing availableunlabelled data along with the labelled data to build betterclassifier models [13]. Chamberlain, Daniel, et al. demonstrateautomatic lung sounds classification using a semi-superviseddeep learning algorithm [14]. Transfer learning for supervisedheart sounds classification and data augmentation for minorityclass (abnormal category) samples are some of the areasbeing explored to improve the performance over traditionalsupervised classification methods [15 - 16].In most cases, the abnormal samples are much lesserthan normal samples. This leads to a class imbalance whenperforming classification tasks [17]. It is both time-consumingand expensive to collect the abnormal samples.There havebeen works that perform clustering on the extracted featuresfrom the heart sounds, followed by classification [18]. Inanomaly detection methods, the model is trained only onnormal samples, but tested with both normal and abnormalsamples [19].In this work, the focus is on exploring current and newsupervised, semi-supervised and unsupervised methods forheart sounds classification. The main contributions of thiswork are:(i) Analysis of the performance of various supervised methodsfor heart sounds classification;(ii) Utilization of the Generative Adversarial Network (GAN)-based semi-supervised technique to obtain better performancein terms of Area Under the Receiver Operating Characteristiccurve (AUROC) as compared to the supervised benchmark,and a r X i v : . [ c s . C V ] J un (iii) Learning of latent representations from features of heartsounds using a 1D Convolutional Neural Network (CNN)model (Unsupervised method) and anomaly detection algo-rithms, and evaluate the classification performance using AU-ROC metric.The methods and experimental analysis are discussed in detailin the following sections.II. M ETHODOLOGY
This section describes the data and the methods used in thisstudy. The sub-section Dataset and Data Preparation describethe dataset used and the feature extraction methods for heartsounds, respectively. Subsequent sub-sections explain the tech-niques used for heart sound classification using supervised,semi-supervised and unsupervised methods.
A. Dataset
The heart sounds dataset used for this study was providedby the 2016 PhysioNet/Computing in Cardiology Challenge[2]. It contains 3,240 labelled heart sounds recordings. Thedataset is divided into two classes, Normal and Abnormalsamples. Fig. 1 shows the heart sounds signal for normal andabnormal sample. The duration of heart sounds signal rangesfrom 5 seconds (short-period) to 120 seconds (long period).This dataset was obtained by combining various heart soundsdatabases. It consists of 6 sub-datasets labelled A, B, C, D, Eand F as shown in the Fig. 2.The heart sounds recordings in this dataset were collectedfrom nine different locations of the body. The four majorlocations are the aortic area, pulmonic area, tricuspid areaand mitral area.The normal recordings correspond to healthysubjects whereas the abnormal ones were obtained from pa-tients with confirmed cardiac diagnosis. The typical illnessesof the patients were heart valve defects and coronary artery.The presence of noise in some samples were due to theuncontrolled environment of the recordings. The noise sourcesincludes talking, stethoscope motion, breathing and intestinalsounds.
B. Data Preparation
For this study, various features obtained from heart soundssignals are used for training different models. The raw signalundergoes pre-processing steps such as padding and pruning.For padding operation, all the samples are zero-padded toachieve the length of the maximum length signal (120 seconds)in the dataset. For pruning operation, all the signals aretruncated to achieve the length of minimum length signal (5seconds) in the dataset.The different types of features extracted from the heartsounds signal are shown in Fig. 3. For semi-supervised meth-ods, the raw processed signal is used as input. For supervisedmethods, both the padded and pruned signals are used toobtain the spectrogram and mel-spectrogram features. Bothspectrogram and mel-spectrogram features are plotted withthe time as the x-axis and frequency as y-axis. These plotsare saved in form of color images having resolution 64 x 64x 3 and 128 x 128 x 3 respectively.
Fig. 1. The heart sounds signal for normal class (top) and abnormal class(bottom). The x-axis represents the time-steps and y-axis represents the signalvalue. The sampling rate of the signal is 2000 Hz.Fig. 2. The 2016 PhysioNet/Computing in Cardiology Challenge datasetdistribution. The dataset was obtained by combining heart sounds databasescollected independently by various research teams. The individual datasetsare labelled A, B, C, D, E and F. The distribution of normal and abnormalsamples in each sub-dataset is different.
Audio features such as Mel-Frequency Cepstral Coeffi-cients (MFCCs), Chroma [20], mel-scaled spectrogram (mel-spectrogram), spectral contrast [21] and tonal centroid features(tonnetz) [22] were extracted from the heart sounds signals.MFCCs, Chroma, mel-spectrogram, Spectral Contrast andTonnetz contribute 40, 12, 128, 7 and 6 features, respectively.These features are appended to form a combined feature listwith 193 features. These extracted audio features are used insupervised methods and unsupervised methods (for anomalydetection). Since there is a class imbalance, oversampling ofminority class (Abnormal class) is performed using SyntheticMinority Over-sampling Technique (SMOTE) [23]. This over-sampling is performed on the audio features.
Fig. 3. Feature Extraction from Heart Sounds Signal. Various features areextracted for supporting various techniques of heart sounds classification.spectrogram and Mel-spectrogram are obtained by converting the PCG signalsto image. Audio features are obtained by appending specific features such asMFCC sequence, Chroma, Mel-spectrogram, Contrast and Tonnetz.
C. Supervised Methods for Heart Sounds Classification
The various supervised methods used for performing heartsounds classification can be grouped in four clusters:(i) Transfer Learning using pre-trained deep learning modelson spectrogram/ Mel-spectrogram images;(ii) Custom CNN on spectrogram images;(iii) Deep Learning models on extracted audio features, and(iv) Machine Learning models on extracted audio features.The details of the methods are described below.
1) Transfer Learning using Pre-trained Deep LearningModels on Spectrogram/ Mel-spectrogram Images:
Transferlearning in CNNs has shown that the image representationslearnt over a large-scale labelled dataset can be transferred toclassification tasks over limited data samples [24]. ResNet-50[25], Inception-v3 [26] and DenseNet-121 [27] have shownstate-of-the-art classification results on the ImageNet dataset.The spectrograms and mel-spectrograms obtained from heartsounds signals are converted to 64 x 64 x 3 images. (fromData Preparation sub-section) These images are trained onImageNet pre-trained ResNet-50, Inception-v3 and DenseNet-121 models. The output of the final convolutional layer ofthree models is fed to a fully-connected single node layer forclassification into Normal or Abnormal class.
2) Custom CNN on Spectrogram Images:
The spectrogramobtained from the heart sounds signal is converted to 128 x128 x 3 image. These images are fed to a custom designedCNN network which follows VGG [28] like architecture. Thecustom architecture of Custom CNN is provided in the TableI. The input spectrogram image passes through a series ofconvolution and pooling layers, and dense layers towardsthe end of the network to output the class of the heartsounds signal. ReLU activation [29] has been used for theconvolutional and dense layers, except for the final dense layer,which uses Sigmoid activation. Dropout layers are added toprevent the model from over-fitting to the training set [30].
3) Deep Learning Models on Extracted Audio Features:
The audio features extracted from the heart sounds signals
TABLE IC
USTOM
CNN A
RCHITECTURE ON S PECTROGRAM I MAGES
Layers AttributesConvolution 2D 16 filters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 16 filters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 32 filters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 32 filters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 64 filters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 64 filters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Convolution 2D 128 filters, 3 x 3 kernel, ReLU activation,padding=sameConvolution 2D 128 filters, 3 x 3 kernel, ReLU activation,padding=sameMaxPool 2D 2 x 2 kernel, stride=2Flatten & Dropout dropout rate=0.25Dense 256 nodes, ReLU activationDropout dropout rate=0.25Dense 1 node, Sigmoid activation
TABLE IIN
EURAL N ETWORK WITH
LSTM
UNITS ON E XTRACTED A UDIO F EATURES
Layers AttributesLSTM 128 units, dropout=0.2, recurrentdropout=0.25Dropout dropout rate=0.25LSTM 64 units, dropout=0.2, recurrentdropout=0.25Dense 1 node, Sigmoid activation undergo oversampling using SMOTE to obtain the equalnumber of samples for both Normal and Abnormal classes.These features are then modeled using Dense Neural Network(Dense NN), Neural Network with Long Short Term Mem-ory units (LSTM NN) [31] and 1D CNN. The Dense NNarchitecture takes a feature list of dimension 193 as input aspasses it through a series of densely connected layers. TheDense NN architecture consists of 4 dense layers, each having128 nodes with ReLU activation, followed by a single nodedensely connected layer with Sigmoid activation. The LSTMNN architecture takes a feature list of dimension 193 x 1as input and passes it through a series of LSTM units anddensely connected layers. The architecture for LSTM NN isprovided in Table II. The LSTM units are useful in modelingsequential data and final densely connected node is used forthe classification. The 1D CNN architecture takes a feature listof dimension 193 x 1 as input and passes it through a seriesof 1D convolutional, 1D pooling and densely connected layersas depicted in Table III.
4) Machine Learning Models on Extracted Audio Features:
The extracted audio features are used to fit machine learningmodels such as Decision Tree, SVM, Random Forest and
TABLE III1D CNN ON E XTRACTED A UDIO F EATURES
Layers AttributesConvolution 1D 128 filters, kernel size=3, ReLU activationConvolution 1D 128 filters, kernel size=3, ReLU activationMaxPool 1D kernel size=3, stride=3Convolution 1D 256 filters, kernel size=3, ReLU activationConvolution 1D 256 filters, kernel size=3, ReLU activationMaxPool 1D kernel size=3, stride=3Convolution 1D 512 filters, kernel size=3, ReLU activationConvolution 1D 512 filters, kernel size=3, ReLU activationFlattenDense 256 nodes, ReLU activationDense 128 nodes, ReLU activationDense 1 node, Sigmoid activation
Gradient Boosting. For each machine learning method, themodel was fitted on training set and hyper-parameters of themodels were tuned using a validation set and finally evaluatedon a test set. The various hyper-parameters and their valuesfor each machine learning model are provided in Table IV. Fordecision tree modeling, the hyper-parameters are Criterion (thefunction that measures the quality of the split), max depth(the maximum depth of the decision tree), max leaf nodes(maximum number of nodes allowed in leaf/terminal positions)and class weight. The class weight hyper-parameter representsthe ratio (abnormal : normal ratio) of weights associated withthe classes. The hyper-parameters used in SVM modeling areC (penalty factor of error term), kernel (the type of kernel usedin the algorithm), gamma (kernel coefficient) and class weight.The kernel used in SVM model was radial basis function (rbf).The hyper-parameters used in Random Forest modeling areCriterion, Number of estimators (the number of trees in theforest), max depth and max leaf nodes. For gradient boostingmodeling, the hyper-parameters used are Number of estimators(the number of boosting stages), max depth and learning rate(the reduces contribution of each tree by this rate). The abovehyper-parameters are tuned for each machine learning modelfor two cases, without SMOTE balancing and with SMOTEbalancing. D. Semi-supervised Method: Generative Adversarial Network
Generative adversarial networks (GANs) has provided away for generating fake samples and utilizing them for othertasks [32]. The semi-supervised method makes use of GANsto utilize the unlabelled data samples. The semi supervisedmodels has access to both the labelled and unlabelled datafrom the training set. In theory, such models should performbetter than the supervised methods as they now have accessto unlabelled training data - provided the semi-supervisedsmoothness assumption holds i.e. if two points x1, x2 areclose in a high density region, their labels y1, y2 are also becloser [33]. This class of semi-supervised algorithms are calledgenerative models and they are generally trained in a coupledfashion, similar to the training procedure of GANs. Fig. 4and Fig. 5 provide the GAN training and testing framework.A combined loss function as mentioned in equations (1) to
TABLE IVH
YPER - PARAMETERS FOR VARIOUS MACHINE LEARNING MODELS ONEXTRACTED AUDIO FEATURES
ML Model Hyper-parameter ValueDecision Tree Criterion EntropyMax Depth Max Leaf Nodes Class weight
Decision Tree (with SMOTE) Criterion EntropyMax Depth Max Leaf Nodes SVM C . Kernel rbfGamma autoClass weight
SVM (with SMOTE) C Kernel rbfGamma autoRandom Forest Criterion EntropyNo. of estimators
Max Depth Max Leaf Nodes Class weight
Random Forest (with SMOTE) Criterion EntropyNo. of estimators
Max Depth Max Leaf Nodes Gradient Boosting No. of estimators
Max Depth Learning rate . Gradient Boosting (with SMOTE) No. of estimators
Max Depth Learning rate . (4) is used to train the discriminator and generator, and thereformulation trick is used as depicted in [34].The generator is trained by matching the features of thegenerated samples and the real samples. The supervised loss issimilar to the cross-entropy loss in K-class classification prob-lems and the unsupervised loss helps in distinguishing betweenreal and fake samples. This coupled training in an adversarialsetting is used to train the semi-supervised network. Thenetwork architecture for the discriminator and generator areprovided in Table V and Table VI. 1D convolutions are used inboth discriminator and generator network architecture. Theseare highly effective as convolution operations are translationand scale invariant and can pickup relevant features anywherewithin the input. This is useful since the heart sounds arenot segmented or aligned in any fashion. The first 5 secondsof the heart sounds data is directly taken as input for thesemi-supervised method. In the overall training setup, minimalamount of annotation or labelling is required. E. Unsupervised Method: Anomaly Detection
For the purpose of obtaining good performance in restricteddata environments, the method of anomaly detection wasexplored. In anomaly detection scenario, the model is trainedusing just the normal class samples. Any abnormality ordeviation from normality is considered as a disease case(abnormal class). This has two major advantages, (i) it canperform the entire training without the need of any labels (need
Fig. 4. Semi-supervised GAN Training Framework. The Generator (G) takes a random noise z as input and produces a generated sample x generated . TheDiscriminator (D) takes the generated samples, labelled real samples ( x labelled , y ) and unlabelled real samples x unlabelled and produces the prediction ofthe class label and the Intermediate layer output M ( x ) .Fig. 5. Semi-supervised GAN Testing Framework. During the testing phase,only Discriminator (D) is used. The test sample is fed to the Discriminator toobtain the class prediction. The predicted class along with the ground truthclass is used to obtain the AUROC metric.TABLE VS EMI - SUPERVISED
GAN D
ISCRIMINATOR A RCHITECTURE
Layers AttributesConvolution 1D 64 filters, kernel size=8, stride=1, LeakyReLU activationConvolution 1D 64 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 128 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 filters, kernel size=8, stride=2, LeakyReLU activationConvolution 1D 256 filters, kernel size=8, stride=2, LeakyReLU activationAdaptive Avg Pooling1D output size=1Flatten Intermediate Layer OutputDense 2 nodes (number of classes)
TABLE VIS
EMI - SUPERVISED
GAN G
ENERATOR A RCHITECTURE
Layers AttributesDense 256*33(=8448) nodes, batch norm 1D,ReLU activationReshape Reshape to 256 x 33Conv Transpose 1D 256 filters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 filters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 filters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 filters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 256 filters, kernel size=8, stride=2,padding=0, batch norm 1D, ReLU activationConv Transpose 1D 128 filters, kernel size=8, stride=2,padding=1, batch norm 1D, ReLU activationConv Transpose 1D 64 filters, kernel size=8, stride=2,padding=1, batch norm 1D, ReLU activationConv Transpose 1D 1 filter, kernel size=8, stride=1, padding=0,tanh activation for abnormal class samples), and (ii) it allows for an anomalyscore which can be used to get the relative grade of theabnormal samples and can be utilized in various applicationssuch as triaging.Two anomaly detection algorithms and two sets of features(these features are used to train the two algorithms) are con-sidered for evaluation. The two anomaly detection algorithmsused are One-Class SVM [35] and Isolation Forest [36]. InOne-Class SVM algorithm, the normal samples are enclosedwithin a hyper-sphere or hyper-plane and everything outsidethis is considered as an anomalous sample. The distance fromthe separating plane decides the degree of abnormality. InIsolation forest, the samples are split randomly during trainingusing isolation trees. The resulting average tree lengths for thetree forest is taken as a measure of the abnormality. Anomaloussamples are more susceptible to isolation while splitting andhence have shorter average tree lengths. This can be used todistinguish between the anomalies and normal samples.During training, a stack of 1D Convolutions layers and1D Convolutions-Upsampling layers are combined to serve as
Loss discriminator = Loss unsupervised + Loss supervised (1)
Loss supervised = − E ( x, y )[ logP D ( y | x, y < K + 1)] , K = no.of classes (2) Loss unsupervised = − E x [ log (1 − P D ( y = K + 1 | x ))] − E xg [ logP D ( y = K + 1 | x )] (3) Loss generator = || E x [ M ( x )] − E xg [ M ( G ( z ))] || (4) Fig. 6. Unsupervised Anomaly Detection Framework using 1D CNN Autoencoder. The Autoencoder takes an input sample x and produces the reconstructedsample x recon . The latent representations z from the Autoencoder and the Reconstruction loss are used as features for anomaly detection methods, IsolationForest and One-Class SVM. an auto-encoder. The audio features extracted from the heartsounds are provided to the auto-encoder for reconstruction.Two features from 1D CNN Autoencoder were used forserving as the input features for Isolation Forest and One-Class SVM:(i) Reconstruction loss: The squared difference between the ac-tual input and the reconstructed output sample by the 1D CNNAutoencoder. The intuition is that, for the anomalous samples,the reconstruction loss would be higher during test time asthe actual model cannot accurately reconstruct anomaloussamples as it is trained to reconstruct only normal samples.The reconstruction loss is defined in equation 5. Loss reconstruction = | X input − X rec | (5)(ii) Latent Representations : Latent representations or embed-dings is the output obtained from the bottleneck layer / the lastlayer of the encoder. While training, the latent representationwould provide a set of features which represent the trainingsamples. These feature set would help discriminate betweennormal and anomalous samples.The overall anomaly detection framework is shown in Fig.6. The autoencoder network structure is provided in TableVII. Two modes of training were used for training the 1DCNN autoencoder. In the first case, the training data consistspurely of normal samples only. In the second case, the datais contaminated with abnormal samples as well. This wouldhelp in evaluating the utility of the method in the use-casewhere there are no filters to prevent abnormal samples frombeing used - like screening applications - where the data can have a mix of both normal and abnormal samples, butthe proportion of anomalous data is lesser. 8% - 12% isa reasonable assumption for contamination with anomaloussamples as the percentage prevalence of heart diseases amonggeneral population is roughly 10% [37]. During the trainingphase, extracted latent representations and reconstruction lossfor samples obtained from the auto-encoder are used to trainthe two anomaly detection algorithms. For both the algorithms,the experiments are conducted for clean data (only normalsamples) and contaminated data (normal and abnormal mixed).III. C OMPUTATIONS AND R ESULTS
This section describes the experiments performed to eval-uate the methods discussed in previous section. Computa-tional Setup and Evaluation Metrics sub-section describes thetraining setup and the various metrics used to validate theperformance on a test set. Subsequent sub-section describesthe results obtained for supervised, semi-supervised and unsu-pervised methods.
A. Computational Setup
The computations of supervised methods for heart soundsclassification utilize the entire dataset consisting of 3,240samples. % of above dataset (648 samples) were used fortesting and remaining % were used for training. This %was further divided into training ( %, 2,333 samples) andvalidation ( %, 259 samples) sets. The training set was used TABLE VII1D CNN A
UTOENCODER A RCHITECTURE
Layers AttributesConvolution 1D 64 filters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 64 filters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 32 filters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 16 filters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Convolution 1D 8 filters, kernel size=3, padding=sameMaxPool 1D kernel size=2, stride=2Flatten Latent SpaceReshape Reshape to 12 x 8Convolution 1D 8 filters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 16 filters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 32 filters, kernel size=3, padding=sameUpsampling 1D size=2Convolution 1D 64 filters, kernel size=3, padding=sameUpsampling 1D size=2Zero Padding 1D 0 x 1Convolution 1D 1 filter, kernel size=3, padding=same for model fitting and validation set was used to tune the hyper-parameters of the model.The computations of semi-supervised and unsupervisedmethods utilize only the sub-dataset E (2141 samples) of theentire dataset. Since each of the sub-datasets are collectedfrom different sources, there may be some bias associated witheach sub-dataset. % of above sub-dataset E (429 samples)were used for testing and remaining % were used fortraining. This % was further divided into training ( %,1,540 samples) and validation ( %, 172 samples) sets. B. Evaluation Metrics
The supervised methods are evaluated using the standardclassification metrics such as sensitivity, specificity and ac-curacy. An additional metric MAcc, which is defined asthe average of sensitivity and specificity was also used forevaluation [38]. For semi-supervised evaluation, the idea is tocompare the performance of the supervised baseline againstsemi-supervised methods across different amounts of labelleddata. The idea is to mimic a clinical use case setting wherethe number of labelled samples is limited. As there is classimbalance, AUROC (Area Under the Receiver Operating Char-acteristic curve) is used as the metric for comparison betweensupervised and semi-supervised models. AUROC not onlytakes into account the issue of class imbalance, but is also notsensitive to the cutoff value taken for the class predictions.
C. Results and Discussion
Table VIII shows the results for various supervised methodsdiscussed in this study. DenseNet-121 on Mel-spectrograms(with padding) and Decision Tree on extracted audio fea-tures achieved the best performance in terms of specificityand sensitivity respectively. Gradient Boosting on extracted audio features (with SMOTE balancing) achieved the bestperformance in terms of accuracy and MAcc. Table IX showsthe comparison among the Gradient Boosting method and themethods reported in the PhysioNet/CinC 2016 Challenge.For the semi-supervised computations, the percentage oflabelled data provided to the models is slowly increased acrossthe computation and compared based on the AUROC scores.Fig. 7 shows the plot of AUROC against the percentage oflabelled data used for supervised and semi-supervised method.The performance for the semi-supervised model is better thanthe supervised baseline even in the case of very few datasamples. Even with 4 or 8 data samples, the semi-supervisedmodel is able to outperform the supervised baseline. Thisobservation in the plot can be explained as:(i) The larger unlabelled training set helps approximate theoverall data distribution, which allows for a much betterdecision boundary than the supervised method which can onlyaccount for labelled samples;(ii) The unlabelled data has a regularization effect on theclassification network as the semi-supervised training followsa coupled adversarial training procedure.A higher performance is obtained in terms of AUROC as moreand more labelled samples are used for classification.
Use cases for semi-supervised methods:
These methodsare of particular importance in cases of limited annotationability. In most clinical cases, large number of labelled trainingsamples for supervised training is not readily available. Thesemi-supervised methods can be used in two scenarios:(i) It can learn better from lesser labelled data and providebetter labels for pseudo-labelling algorithms;(ii) It has a better predictive power as compared to supervisedmethods, and due to the usage of unlabelled samples as well,the model can iteratively select samples that needs to belabelled for better performance.Table X and Table XI provide the results of the anomalydetection methods. One-Class SVM achieves a better perfor-mance in terms of AUROC for both cases as compared toIsolation Forest. Moreover, latent representations or embed-dings give a better performance than reconstruction loss. Datacontamination in the experiment with autoencoder trained ononly normal samples is not a major concern as the latent rep-resentations obtained are fairly robust to this issue. However,for the experiment with autoencoder trained on both normaland abnormal samples, it is ideal that normal samples be fedinto the anomaly detection algorithm.
Use cases for unsupervised methods:
The unsupervised fea-ture extraction (using 1D CNN Autoencoder) method coupledwith anomaly detection methods achieved good performanceseven with no major labelling burden. These methods can beused in two scenarios:(i) These methods are good for triaging applications since theabnormality scores are an indicator of the disease and the oneswith the higher abnormality score can be evaluated first;(ii) These methods can be useful for creating datasets forsupervised or semi-supervised training. Samples that are closeto the classification boundary can be chosen in an unsupervisedsetting, as these samples are more confusing for the modelsto distinguish.
TABLE VIIIR
ESULTS FOR S UPERVISED METHODS OF H EART S OUNDS C LASSIFICATION
Method Accuracy Specificity Sensitivity MAccResNet-50 on Mel-spectrogram (withpadding) .
869 0 .
941 0 .
604 0 . ResNet-50 onspectrogram (withpadding) .
878 0 .
943 0 .
640 0 . ResNet-50 on Mel-spectrogram (withpruning) .
860 0 .
919 0 .
640 0 . ResNet-50 on spec-trogram (with prun-ing) .
847 0 .
896 0 .
669 0 . Inception-v3 on Mel-spectrogram (withpadding) .
850 0 .
931 0 .
554 0 . Inception-v3 onspectrogram (withpadding) .
867 0 .
947 0 .
576 0 . Inception-v3 on Mel-spectrogram (withpruning) .
796 0 .
941 0 .
266 0 . Inception-v3 on spec-trogram (with prun-ing) .
826 0 .
953 0 .
360 0 . DenseNet-121 onMel-spectrogram(with padding) .
869 0 .
965 0 .
518 0 . DenseNet-121 onspectrogram (withpadding) . .
187 0 . Custom CNN onspectrogram (withpadding) .
909 0 .
967 0 .
698 0 . Dense NN on ex-tracted audio features(with SMOTE) .
855 0 .
880 0 .
763 0 . LSTM NN on ex-tracted audio features(with SMOTE) .
748 0 .
770 0 .
670 0 .
1D CNN on extractedaudio features (withSMOTE) .
843 0 .
847 0 .
827 0 . Decision Tree on ex-tracted audio features .
824 0 . . Decision Tree on ex-tracted audio features(with SMOTE) .
832 0 .
837 0 .
813 0 . SVM on extracted au-dio features .
807 0 .
813 0 .
784 0 . SVM on extractedaudio features (withSMOTE) .
827 0 .
953 0 .
367 0 . Random Forest on ex-tracted audio features .
898 0 .
925 0 .
798 0 . Random Forest on ex-tracted audio features(with SMOTE) .
878 0 .
888 0 .
842 0 . Gradient Boosting onextracted audio fea-tures .
913 0 .
970 0 .
705 0 . Gradient Boosting onextracted audio fea-tures (with SMOTE) .
935 0 . TABLE IXC
OMPARISON OF PROPOSED METHOD WITH VARIOUS SUPERVISEDMETHODS REPORTED IN THE P HYSIO N ET /C IN C 2016 C
HALLENGE
Method Feature Balancingdata MAccAdaBoost and CNN [11] Time-frequency No . Ensemble of NN [39] Time-frequency Yes . Dropout Connected NN[40] MFCC No . SVM and KNN [41] Time-frequency,MFCC No . CNN [42] MFCC No . SVM and ELM [43] Audio SignalAnalysis No . Gradient Boosting (Currentstudy) Extracted Au-dio features Yes . Fig. 7. Semi-supervised Results. The graph shows the AUROC evaluationmetric against the percentage of labelled data for supervised baseline andsemi-supervised method. The performance of semi-supervised approach isbetter than supervised approach throughout the graph.TABLE XR
ESULTS FOR A NOMALY D ETECTION WHEN AUTOENCODER IS TRAINEDON ONLY NORMAL SAMPLES
Method Features Labels AUROCIsolation Forest Embeddings Normal . Contaminated . Rec Loss Normal . Contaminated . One-Class SVM Embeddings Normal
Contaminated . Rec Loss Normal . Contaminated . TABLE XIR
ESULTS FOR A NOMALY D ETECTION WHEN AUTOENCODER IS TRAINEDON ENTIRE DATA ( BOTH NORMAL AND ABNORMAL SAMPLES ) Method Features Labels AUROCIsolation Forest Embeddings Normal . Contaminated . Rec Loss Normal . Contaminated . One-Class SVM Embeddings Normal
Contaminated . Rec Loss Normal . Contaminated . IV. C
ONCLUSION AND F UTURE D IRECTIONS
This study explores the supervised, semi-supervised andunsupervised methods of heart sounds classification for theuse cases where the availability of labelled data is scarce.In such cases, the supervised methods with large number oflabelled samples, plateau out and have similar performances.However, for smaller number of labelled samples, the semi-supervised algorithm outperforms the supervised baselines.Furthermore, the given problem is framed as an anomalydetection problem with unsupervised feature learning. Theissue of data contamination is also studied and the results arepresented.These works can be a starting point for various future usecases and studies. One promising direction for the utilizationof these methods is in the case of active learning - where first asmall subset of samples is labelled and then iteratively samplesare chosen to be labelled further to improve performance. Thegood performance on lower number of labelled samples isalso useful in the case for pseudo-labelling, where existingsupervised classification methods are used with the assumedlabels. The heart sounds signals used in this study are not seg-mented. Various segmentation algorithms have been developedin recent years. Proper segmentation and alignment techniquescan be employed to further boost the performance. Apartfrom band-pass filters, other signal processing techniquescan be explored to improve performance. Hence, better pre-processing techniques and feature extraction techniques can beanother pathway for exploration.Another more challenging setting is to use data augmenta-tion for heart sounds signals. It might be interesting to see howsound signals can be augmented using techniques like Mix-Up[44], apart from SMOTE for data balancing. However, sincethe domain of application is health care, it is important toensure that the augmented data samples should not introduceany wrong features or biases within the model and henceshould be undertaken with utmost care. Moreover, all themethods presented in this work can generalize for any 1Dsignal input. Hence, ECG signals can also be used insteadof PCG signals. This work is presented with the belief that itcan aid both in creation of better models and more importantly,better datasets which can further improve performance, as inmost practical cases, it is the quality of data used that is acrucial factor in obtaining better performance. R et al. , “An open accessdatabase for the evaluation of heart sound algorithms,”
PhysiologicalMeasurement , vol. 37, no. 12, p. 2181, 2016.[3] R. M. Rangayyan and R. J. Lehner, “Phonocardiogram signal analysis:a review.”
Critical reviews in biomedical engineering , vol. 15, no. 3, pp.211–236, 1987.[4] N. V. Thakor and Y.-S. Zhu, “Applications of adaptive filtering to ecganalysis: noise cancellation and arrhythmia detection,”
IEEE transac-tions on biomedical engineering , vol. 38, no. 8, pp. 785–794, 1991.[5] R. Silipo and C. Marchesi, “Artificial neural networks for automatic ecganalysis,”
IEEE transactions on signal processing , vol. 46, no. 5, pp.1417–1425, 1998.[6] W. Phanphaisarn, A. Roeksabutr, P. Wardkein, J. Koseeyaporn, andP. Yupapin, “Heart detection and diagnosis based on ecg and epcgrelationships,”
Medical devices (Auckland, NZ) , vol. 4, p. 133, 2011.[7] M. A. Akbari, K. Hassani, J. D. Doyle, M. Navidbakhsh, M. Sangargir,K. Bajelani, and Z. S. Ahmadi, “Digital subtraction phonocardiography(dsp) applied to the detection and characterization of heart murmurs,”
Biomedical engineering online , vol. 10, no. 1, p. 109, 2011.[8] S. Ari, K. Hembram, and G. Saha, “Detection of cardiac abnormalityfrom pcg signal using lms based least square svm classifier,”
ExpertSystems with Applications , vol. 37, no. 12, pp. 8019–8026, 2010.[9] I. Grzegorczyk, M. Soliski, M. epek, A. Perka, J. Rosiski, J. Rymko,K. Stpie, and J. Gieratowski, “Pcg classification using a neural networkapproach,” in , 2016,pp. 1129–1132.[10] F. Plesinger, I. Viscor, J. Halamek, J. Jurco, and P. Jurak, “Heartsounds analysis using probability assessment,”
Physiological measure-ment , vol. 38, no. 8, p. 1685, 2017.[11] C. Potes, S. Parvaneh, A. Rahman, and B. Conroy, “Ensemble of feature-based and deep learning-based classifiers for detection of abnormal heartsounds,” in . IEEE,2016, pp. 621–624.[12] X. J. Zhu, “Semi-supervised learning literature survey,” University ofWisconsin-Madison Department of Computer Sciences, Tech. Rep.,2005.[13] B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.[14] D. Chamberlain, R. Kodgule, D. Ganelin, V. Miglani, and R. R. Fletcher,“Application of semi-supervised deep learning to lung sound analysis,”in . IEEE, 2016, pp. 804–807.[15] A. I. Humayun, M. Khan, S. Ghaffarzadegan, Z. Feng, T. Hasan et al. ,“An ensemble of transfer, semi-supervised and supervised learningmethods for pathological heart sound classification,” arXiv preprintarXiv:1806.06506 , 2018.[16] A. Ukil, S. Bandyopadhyay, C. Puri, R. Singh, and A. Pal, “Classaugmented semi-supervised learning for practical clinical analytics onphysiological signals,” arXiv preprint arXiv:1812.07498 , 2018.[17] M. M. Rahman and D. Davis, “Addressing the class imbalance problemin medical datasets,”
International Journal of Machine Learning andComputing , vol. 3, no. 2, p. 224, 2013.[18] G. Amit, N. Gavriely, and N. Intrator, “Cluster analysis and classificationof heart sounds,”
Biomedical Signal Processing and Control , vol. 4,no. 1, pp. 26–36, 2009.[19] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A reviewof novelty detection,”
Signal Processing , vol. 99, pp. 215–249, 2014.[20] D. Ellis, “Chroma feature analysis and synthesis,”
Resources of Lab-oratory for the Recognition and Organization of Speech and Audio-LabROSA , 2007.[21] D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, and L.-H. Cai, “Musictype classification by spectral contrast feature,” in
Proceedings. IEEEInternational Conference on Multimedia and Expo , vol. 1. IEEE, 2002,pp. 113–116.[22] C. Harte, M. Sandler, and M. Gasser, “Detecting harmonic change inmusical audio,” in
Proceedings of the 1st ACM workshop on Audio andmusic computing multimedia . ACM, 2006, pp. 21–26.[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”
Journal of artificial intel-ligence research , vol. 16, pp. 321–357, 2002. [24] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferringmid-level image representations using convolutional neural networks,”in Proceedings of the IEEE conference on computer vision and patternrecognition , 2014, pp. 1717–1724.[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.2818–2826.[27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[29] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-mann machines,” in
Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from over-fitting,”
The Journal of Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[33] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning(chapelle, o. et al., eds.; 2006)[book reviews],”
IEEE Transactions onNeural Networks , vol. 20, no. 3, pp. 542–542, 2009.[34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in
Advances in neuralinformation processing systems , 2016, pp. 2234–2242.[35] L. M. Manevitz and M. Yousef, “One-class svms for document clas-sification,”
Journal of machine Learning research , vol. 2, no. Dec, pp.139–154, 2001.[36] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in . IEEE, 2008, pp. 413–422.[37] E. J. Benjamin, M. J. Blaha, S. E. Chiuve, M. Cushman, S. R. Das,R. Deo, J. Floyd, M. Fornage, C. Gillespie, C. Isasi et al. , “Heartdisease and stroke statistics-2017 update: a report from the americanheart association.”
Circulation , vol. 135, no. 10, pp. e146–e603, 2017.[38] G. D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li,I. Silva, and R. G. Mark, “Recent advances in heart sound analysis,”
Physiological measurement , vol. 38, no. 8, pp. E10–E25, 2017.[39] M. Zabihi, A. B. Rad, S. Kiranyaz, M. Gabbouj, and A. K. Katsaggelos,“Heart sound anomaly and quality detection using ensemble of neuralnetworks without segmentation,” in . IEEE, 2016, pp. 613–616.[40] E. Kay and A. Agarwal, “Dropconnected neural network trained withdiverse features for classifying heart sounds,” in . IEEE, 2016, pp. 617–620.[41] I. J. D. Bobillo, “A tensor approach to heart sound classification,” in . IEEE, 2016, pp.629–632.[42] J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, and K. Sricharan,“Classifying heart sound recordings using deep convolutional neuralnetworks and mel-frequency cepstral coefficients,” in . IEEE, 2016, pp. 813–816.[43] X. Yang, F. Yang, L. Gobeawan, S. Y. Yeo, S. Leng, L. Zhong, andY. Su, “A multi-modal classifier for heart sound recordings,” in . IEEE, 2016, pp. 1165–1168.[44] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyondempirical risk minimization,” arXiv preprint arXiv:1710.09412arXiv preprint arXiv:1710.09412