[PDF] A deep-learning classifier for cardiac arrhythmias

Abstract

We report on a method that classifies heart beats according to a set of 13 classes, including cardiac arrhythmias. The method localises the QRS peak complex to define each heart beat and uses a neural network to infer the patterns characteristic of each heart beat class. The best performing neural network contains six one-dimensional convolutional layers and four dense layers, with the kernel sizes being multiples of the characteristic scale of the problem, thus resulting a computationally fast and physically motivated neural network. For the same number of heart beat classes, our method yields better results with a considerably smaller neural network than previously published methods, which renders our method competitive for deployment in an internet-of-things solution.

Full PDF

AA deep–learning classiﬁer for cardiac arrhythmias

Carla Soﬁa Carvalho

Hitachi Vantara

Lisbon, [email protected]

Abstract —We report on a method that classiﬁes heart beatsaccording to a set of 13 classes, including cardiac arrhythmias.The method localises the QRS peak complex to deﬁne each heartbeat and uses a neural network to infer the patterns characteristicof each heart beat class. The best performing neural networkcontains six one–dimensional convolutional layers and four denselayers, with the kernel sizes being multiples of the characteristicscale of the problem, thus resulting a computationally fast andphysically motivated neural network. For the same numberof heart beat classes, our method yields better results with aconsiderably smaller neural network than previously publishedmethods, which renders our method competitive for deploymentin an internet–of–things solution.

Index Terms —Cardiac arrhythmias, electrocardiograms, con-volutional neural networks.

I. I

NTRODUCTION

An industry domain that inherently produces data is thehealth domain, in particular the subdomain related to themonitoring of patients. Since cardiovascular diseases are theﬁrst cause of death worldwide , a common monitoring is thatof the heart as a way to identify or prevent heart dysfunctions.Heart dysfunctions are related to anomalies in the heart’selectrical activity, including cardiac arrhythmias, and can bediagnosed in electrocardiograms (ECG), produced in real timeby portable devices. These records show the heart’s beatingpatterns in time as a result of differences in the electricalpotential in the heart.The interest thus lies in producing a physically motivatedmethod that classiﬁes heart beats from medically annotatedECG records in a fast and robust way, so that it can bedeployed to generate alerts. Our suggested method encom-passes the processing of ECG records, based on the locationof characteristic features, and a classiﬁcation model, based ona deep neural network, with the medical annotations providingthe labels.The advantage of neural networks over a heuristic model isthat they search for the optimal combination of weights overdifferent layers in sequence, which add non–linearities and canreproduce different functional forms. Neural networks havebeen used in the past to classify cardiac arrhythmias, usingdifferent number of heart beat types (e.g. Refs. [1]–[3]) andadopting complex architectures with difﬁcult interpretability(e.g. Ref. [4]). In this paper, we test different architecturesfrom previously published work [1], [6] and create new neuralnetworks based on the scale of the problem, with view towards Chopped signal: record=101, chann=1, ilen=10, len=256 smooth(x_arr)0.0 0.2 0.4 0.6 0.8 1.0101 smooth(x_arr)d(smooth(x_arr))/dt0 25 50 75 100 125 150 1750.00.51.0 ps(smooth(x_arr))0.0 0.2 0.4 0.6 0.8 1.00.150.10 Label=N

Chopped signal: record=101, chann=2, ilen=10, len=256 smooth(x_arr)0.0 0.2 0.4 0.6 0.8 1.001 smooth(x_arr)d(smooth(x_arr))/dt0 25 50 75 100 125 150 1750.00.51.0 ps(smooth(x_arr))

Figure 1.

Signals in record 101.

Chopped smoothed signals and theirsubsequent discrete ﬁrst derivative and amplitude spectrum. Top three panels:Channel 1. Bottom three panels: Channel 2. increasing the performance while simultaneously keeping theneural networks short and fast.II. D

ATA PROCESSING

A. Data digitisation

We use publicly available data from the MIT–BIH Ar-rhythmia Database Directory, comprising 48 records . Theserecords contain signals from two ECG channels (an uppersignal and a lower signal) sampled at a frequency f s =1 / dt = 360 Hz for N × dt = 30 min . These records alsocontain annotations by two cardiologists. Some records containpaced beats driven by a pacemaker or artifacts. We chooseto use all 48 records, since the paced beats can work as anadditional heart beat type and the artifacts can work as a noisecomponent. https://physionet.org/physiobank/database/html/mitdbdir/mitdbdir.htm a r X i v : . [ q - b i o . Q M ] N ov e also use the WFDB software package to read andprocess the ﬁle format that the records are encoded in. B. Heart beat locations

The signals consist of readings of the heart potential intime. The heart potential contains characteristic peaks, namelythe P peak, the QRS peak complex and the T peak, whichcorrespond to the polarisation/depolarisation heart cycle. Theﬁrst step consists in identifying the QRS peak complexes intime, which are usually more prominent in the upper signal.The annotations are located at the QRS peak complexand provide the labels to the heart beats. Hence the nextstep consists in chopping the signals about each QRS peakcomplex so that each fraction contains an individual heartbeat (Fig. 1). Each record has a characteristic beat lengthbetween consecutive QRS peak complexes. We choose themedian characteristic beat length ( len = 256 ) so that theresulting chopped signals can be concatenated into a matrix x ijk , where i ∈ { , ..., n sample } is the number of resultingchopped signals, j ∈ { , ..., len } is the length in time of eachchopped signal and k ∈ { sign1 , sign2 } indicates the signal. C. Heart beat annotations

The possible values of the annotations deﬁne theset of heart beat classes. The records in the MIT–BIH Arrhythmia Database Directory follow the annota-tion system such that beat annotations take the values { N , L , R , B , a , J , A , S , j , e , n , V , r , E , F , /, f , Q , ? } and non–beatannotations take all the other possible values .We produce the distribution of heart beat classes acrossthe records (Fig. 2, top panel). We observe that the classes { B , n , r , ? } are not represented in this data set. We also observethat the beat classes are not all equally represented. Hence weset an upper bound to the number of occurrences per heart beatclass (here n row max = 4000 ), so that all beats from under–represented classes are included but only a fraction of the beatsfrom over–represented classes is included. We also observethat the classes { S , E } contain less than six elements, whichis the minimum number of elements required to balance therepresentation of a given class (Sec.III). Note that some valuesdo not correspond to heart beat classes, e.g. { Q } corresponds tounclassiﬁable beats and { /, f } corresponds to paced beats; wechoose to keep them to add robustness to the model. Hencethe heart beat classes that can be classiﬁed are reduced to { N , L , R , a , J , A , j , e , V , F , Q , /, f } , totalling 13 classes.We encode the annotation values into binary vectors withlength equal to the number of classes, resulting in a matrix y ic , where i ∈ { , ..., n sample } is the number of heart beats and c ∈ { , ..., n class } is the number of heart beat classes.III. D ATA ENGINEERING

A. Data resampling

Since the heart beat classes are not all equally represented,we re–sample the data by generating synthetic beats belonging https://wfdb.readthedocs.io/en/latest/ https://archive.physionet.org/physiobank/annotations.shtml ! " + / A E F J L N Q R S V [ ] a e f j x | ~010000200003000040000500006000070000 Number of occurrences per heart beat class (all data) original labels / A F J L N Q R V a e f j05001000150020002500300035004000

Number of occurrences per heart beat class(nrow_max=4000) original labelspredicted labels

Figure 2.

Distribution of the heart beat classes.

Top panel: Distributionof the original labels. Bottom panel: Distribution of both the original and thepredicted labels (best method), to the under–represented classes, thus producing a new dataset with balanced classes. We use the Synthetic MinorityOversampling Technique (SMOTE) as implemented in theimbalanced–learn package . When we resample data, thetraining set undergoes resampling, whereas the test set doesnot, so that the performance metrics refer to the original classdistribution. B. Generation of new variables

From the original signals, we can generate new variablesthat encode potentially useful information, e.g. the ﬁrst discretederivative of the smoothed signals dx ijk and the Fouriertransform of the chopped smoothed signals X ilk . The Fouriertransform of a signal contains both positive–frequency andnegative–frequency components; hence, instead of X ilk , weuse the amplitude spectrum | X ilk | = √ X ilk X ilk ∗ . The result ofthe concatenation of the original variables with the generatedvariables is the data matrix ~x ijk = { x ijk , dx ijk , | X ijk |} suchthat, for n chann = 2 , the data matrix ~x ijk consists of × n chann = 6 variables. C. Selection of variables

We compute the correlation between each pair of variablesindexed { k , k } with values { ~x ijk , ~x ijk } , which we denoteby Corr k k (Fig. 3, top panel). We also compute the correlationof each variable k with the beat annotations y ic , which wedenote by Corr k y (Fig. 3, bottom panel). https://imbalanced-learn.readthedocs.io/en/stable/index.html y setting an upper bound (e.g. corr max = 0 . ), we usethe correlation between each pair of variables as a measureof redundancy. Variables { k , k } such that | Corr k k | > corr max are classiﬁed as redundant, hence one of them canbe removed without loss of information. From Fig. 3 top panel,no removal is justiﬁed on the basis of redundancy.By setting a lower bound (e.g. corr min = 0 . ), we usethe correlation between each variable with the beat annotationsas a measure of relevance of that variable in predictingannotations. A variable k such that | Corr k y | < corr min is classiﬁed as irrelevant, hence it can be removed withoutloss of information. From Fig. 3 bottom panel, the variables { dx ij , k = sign1 , dx ij , k = sign2 , | X ij , k = sign2 |} can be removed onthe basis of relevance.IV. C LASSIFICATION MODELS

A. Neural networks

Since the heart beats in the data are labelled, we look for aclassiﬁcation model to infer the patterns common to heart beatsin the same class. Given the nature of the data, neural networks(NN) either of the recurrent type (RNN) or the convolutionaltype (CNN) will be adequate models.An RNN regards each heart beat as a sequence of datapoints in time and combines the value at the previous instantwith a transformation of earlier values. A variation of RNN isthe large–short–term–memory (LSTM) NN. Each heart beatmust be further divided into sublengths of the original beatlength so that the different sublengths are regarded as asequence of data points. Since each heart beat has size len and we are looking for three peak–like structures, then thecharacteristic sublength will be len / . For sublengths,we use fractions of the characteristic sublength, in particular sublen ∈ { / , / , / , } × len = { , , , } . A CNN regards each heart beat as a one–dimensional imageand operates one–dimensional convolutions (Conv1D) over thekernel. Since the characteristic sublength is len / , then thelargest scale will be the length corresponding to the Nyquistfrequency, hence len / / . For kernel sizes, we usepowers of two between 4 and 32; for stride step, we use stride = 1 . For optimiser, we use ADAM (with the default values lr = 0 . , beta 1 = 0 . , beta 2 = 0 . , decay = 0 ); forloss function, we use categorical cross entropy; for numberof epochs, we use n epochs = 25; for batch size, we use batch size = 64; or activation function, we use the rectiﬁedlinear unit (ReLU). B. Cross–validation

We devise a cross–validation scheme so that each classiﬁ-cation is trained on a manageably sized training set. We ﬁrstdivide the entire data matrix ~x ijk into nk = 5 subsets, each ofwhich preserving the proportion among the different classesas ~x ijk . We keep one of the nk subsets as test set with theoriginal class distribution and resample the remaining nk − subsets, producing the resampled data. We then divide theresampled data into nk (cid:48) = 3 subsets, one of which serving as s i g n 1 s i g n 2 d _ s i g n 1 d _ s i g n 2 p s _ s i g n 1 p s _ s i g n 2 sign1sign2d_sign1d_sign2ps_sign1ps_sign2 _len Figure 3.

Correlation matrices.

Top panel: Correlation matrix between eachpair of variables. Bottom panel: Correlation between each variable and theheart beat annotations. the resampled input data. We then divide the resampled inputdata into nk (cid:48)(cid:48) = 5 subsets, one of which serving as resampledtest set and the remaining serving as resampled training set.We rotate the resampled training data set over the nk (cid:48)(cid:48) subsetsso that the ﬁtting of the classiﬁcation model is done nk (cid:48)(cid:48) timeson different training sets. We then rotate the resampled inputdata over the nk (cid:48) sets so that the ﬁtting of the classiﬁcationmodel is done nk (cid:48)(cid:48) × nk (cid:48) times on different training sets.We average the nk (cid:48)(cid:48) × nk (cid:48) classiﬁcation predictions over nk (cid:48)(cid:48) , yielding nk (cid:48) mean predictions of the test data. We then aver-age the nk (cid:48) classiﬁcation predictions, yielding one confusionmatrix labeled by the corresponding nk (cid:48) mean accuracies.The ﬁtting resulting from each of the nk (cid:48)(cid:48) × nk (cid:48) subsets isapplied to the test data, thus producing nk (cid:48)(cid:48) × nk (cid:48) classiﬁcationpredictions for each element in the test data. The ﬁnal com-bined prediction is the mean over the nk (cid:48)(cid:48) × nk (cid:48) classiﬁcationpredictions, whose resulting accuracies we include in thetables. C. Selection of neural network architecture

In order to select an adequate NN architecture, we trydifferent architectures for RNN and CNN [5]. We use theimplementation from T

ENSOR F LOW via the application pro-gramming interface K ERAS . https://keras.ioable I Selected architectures for testing. C OLUMN

1: I

DENTIFICATION OF THE ARCHITECTURE . C

OLUMNS

DENTIFICATION OF THE PERFORMANCEMETRICS , WHERE A CCURACY IS THE RESULTING ACCURACY FROM THE COMBINED CLASSIFICATION OVER ALL nk (cid:48) SETS , AND (cid:104) P RECISION (cid:105)

AND (cid:104) R ECALL (cid:105)

ARE RESPECTIVELY THE MEAN PRECISION AND MEAN RECALL OVER ALL CLASSES . C

OLUMN

5: I

DENTIFICATION OF THE EFFICIENCYMETRIC , WHERE R UN TIME IS THE RUNNING TIME IN MINUTES .Architecture Accuracy (cid:104)

Precision (cid:105) (cid:104)

Recall (cid:105)

Run time(min)

LSTM ( sublen = 256 ) 0.712 0.479 0.580 436 LSTM ( sublen = 32 ) 0.699 0.460 0.556 445 Conv

Conv + LSTM ( sublen = 256 ) 0.770 0.572 0.569 215 Conv + LSTM ( sublen = 32 ) 0.761 0.513 0.605 143 ConvLSTM ( sublen = 256 ) 0.759 0.564 0.549 536 ConvLSTM ( sublen = 32 ) 0.759 0.570 0.584 391 We represent the NN architectures schematically as se-quences of layers, with Input representing the input data andOutput representing the predicted classiﬁcation. We explorefour types of architectures: a) architecture with LSTM layers, named

LSTM : LSTM : Input → LSTM → Dropout → Dense → Output ; (1) b) architecture with Conv1D layers, named Conv : Conv : Input → Conv1D → Dropout → MaxPool → Flat → Dense → Output ; (2) c) arquitectures that combine both LSTM and Conv1Dlayers, named Conv + LSTM and

ConvLSTM : Conv + LSTM : Input → Conv1D → Dropout → MaxPool → Flat → Dense → LSTM → Dropout → Dense → Output , (3) ConvLSTM : Input → ConvLSTM → Dropout → Flat → Dense → Output . (4)We ﬁrst test these architectures for the minimal NN for-mulation and for approximately the same number of layers,setting the kernel size to kernel size = 4 and the numberof ﬁlters to n filter = 64 . As measures of performance, weuse the total number of true positives (TP) and the number ofpredicted classes; as measure of efﬁciency, we use the runningtime (Table I).We observe that the

Conv architecture yields the bestperformance and the shortest running time; this promptedus to consider

Conv for further study. We also observe thatmost heart beats belonging to the classes { /, A , L , N , R , a , e , j } are correctly classiﬁed by all architectures. The goal is nowto increase the number of TP of the other classes, namely { F , J , V , f } . D. Selection of convolutional neural network architecture

We test the

Conv architecture for different number andorganisation of layers. We explore eight NNs. In the Conv1Dlayers, we ﬁrst set kernel size = 4 and padding = valid , and vary the number of ﬁlters within the range n filter ∈{ , , } , starting at 16 and doubling every time that theConv1D layer is preceded by a pooling layer. As measure ofperformance, we use the total TP. We start with the NN suggested in Ref. [1], since this NNwas conceived to classify heart beats from the same database.We keep the architecture as shown below, named Acharya : Acharya : Input → Conv1D (ReLU) → Dropout → MaxPool → Conv1D (ReLU) → Dropout → MaxPool → Conv1D (ReLU) → Dropout → MaxPool → Flat → Dense (ReLU) → Dense (ReLU) → Dense (Softmax) → Output . (5)Comparing Acharya (Fig. 4, left panel) with the previousbest performing NN, the TP of { f } increases signiﬁcantly,whereas the TP of { F , V } decrease, with the total TP stayingapproximately the same.We change Acharya by moving the drop–out layers fromafter the Conv1D layers to after the dense layers, which wename

Acharya 2 . Comparing

Acharya 2 with

Acharya , theTP of { F } increases, whereas the TP of { V , f } decrease, withthe total TP decreasing, thus a worsening in performance.We test the NN suggested in Ref. [6], since this NNwas conceived to estimate cosmological parameters, whichrequires looking for different scales in the data. We keep thearchitecture as shown below, named Gupta : Gupta : Input → Conv1D (ReLU) → AveragePool → Conv1D (ReLU) → Conv1D (ReLU) → AveragePool → Conv1D → AveragePool → Conv1D → AveragePool → AveragePool → Flat → Dense (ReLU) → Dropout → Dense (ReLU) → Dropout → Dense (ReLU) → Dropout → Dense (Softmax) → Output . (6)This NN contains contiguous convolutional layers withoutintermediate pooling layers, forming a block of two Conv1Dlayers. Comparing Gupta with

Acharya , the TP of { F , f } increase but the TP of { V } decreases, with the total TPdecreasing, thus no improvement in performance.We change Gupta by adding another set of contiguousConv1D layers without intermediate pooling layers, as shownbelow, named

Gupta 2 : Gupta 2 : Input → Conv1D (ReLU) → AveragePool

A F J L N Q R V a e f j

Predicted label /AFJLNQRVaefj T r ue l abe l

799 0 0 0 0 1 0 0 0 0 0 0 00 214 0 1 3 66 1 95 40 68 0 2 00 0 5 1 3 0 0 5 137 7 0 1 00 0 0 6 0 0 0 9 1 0 0 1 00 1 0 0 784 0 0 0 0 12 1 2 00 1 0 0 0 798 0 0 1 0 0 0 01 0 0 0 0 0 0 0 0 2 0 4 01 0 0 1 1 0 0 796 1 0 0 0 050 62 1 33 44 2 4 14 372168 8 41 10 5 0 0 0 0 0 0 0 25 0 0 00 1 0 0 0 0 0 0 0 2 1 0 0115 0 0 0 0 0 18 0 7 0 0 53 00 0 0 0 0 0 0 5 0 9 0 0 32ACHARYA NETWORK: scores=[0.66 0.77 0.79] / A F J L N Q R V a e f j

Predicted label /AFJLNQRVaefj T r ue l abe l

797 0 0 0 0 0 0 3 0 0 0 0 01 209 1 3 3 66 0 96 43 68 0 0 00 0 14 2 4 0 0 3 131 4 0 1 00 0 0 3 0 0 0 12 1 0 0 1 00 0 0 0 792 0 2 0 2 2 1 1 00 0 0 0 0 800 0 0 0 0 0 0 00 0 0 1 0 0 3 0 1 0 0 2 00 0 0 1 0 0 0 798 1 0 0 0 050 30 46 80 50 2 19 4 375 92 3 49 00 3 0 0 0 0 0 0 2 25 0 0 00 0 0 0 0 0 0 0 1 0 3 0 055 0 0 0 0 1 50 1 8 1 0 77 00 0 0 0 0 0 0 5 0 9 0 0 32GUPTA_2 NETWORK: scores=[0.72 0.76 0.78] / A F J L N Q R V a e f j

Predicted label /AFJLNQRVaefj T r ue l abe l

797 0 0 0 0 0 0 2 0 0 0 1 01 210 1 3 3 69 0 96 39 68 0 0 00 0 18 3 4 0 0 3 127 4 0 0 00 0 0 7 0 0 0 9 1 0 0 0 00 0 0 0 795 0 1 0 3 0 0 1 00 0 0 0 0 800 0 0 0 0 0 0 00 0 0 0 0 0 3 0 1 1 0 2 00 0 0 1 0 0 0 797 2 0 0 0 049 46 80 49 50 8 16 3 358111 3 27 00 4 0 0 0 0 0 0 2 24 0 0 00 0 0 0 0 0 0 0 1 0 3 0 042 0 1 0 0 0 54 0 8 0 0 88 00 0 0 0 0 0 0 5 0 9 0 0 32GUPTA_4 NETWORK: scores=[0.7 0.77 0.78]

Figure 4.

Confusion matrix from NNs.

Left panel:

Acharya with n conv = 3 convolutional layers, kernel size = { , , } , n pool = 3 MaxPoollayers and n drop = 3 drop–out layers. Centre panel:

Gupta 2 with n conv = 6 convolutional layers, kernel size = { , (4 , , (4 , , } , n pool = 3 AveragePool layers and n drop = 3 drop–out layers. Right panel:

Gupta 4 with n conv = 7 convolutional layers, kernel size = { , (4 , , (4 , , , } , n pool = 4 AveragePool layers and n drop = 3 drop–out layers. All NNs have padding = valid . → Conv1D (ReLU) → Conv1D (ReLU) → AveragePool → Conv1D (ReLU) → Conv1D (ReLU) → Conv1D (ReLU) → AveragePool → Flat → Dense (ReLU) → Dropout → Dense (ReLU) → Dropout → Dense (ReLU) → Dropout → Dense (Softmax) → Output (7)Comparing

Gupta 2 (Fig. 4, centre panel) with

Gupta , theTP of { V } increases but the TP of { F , f } decrease, with thetotal TP increasing, thus an improvement in performance.We change Gupta 2 by adding drop–out layers before eachpooling layer in a similar way to

Acharya , which we name Acharya 3 . Comparing

Acharya 3 with

Gupta 2 , the totalTP stays approximately the same, thus no improvement inperformance.We change again Gupta 2 by adding batch–normalizationlayers (BatchNorm) between the Conv1D layers and the ReLUlayers, which we name

Gupta 3 . Comparing

Gupta 3 with

Gupta 2 , the total TP stays approximately the same, thus noimprovement in performance.While the newly generated Gupta NNs prove to increase theTP of { V } in comparison to Gupta , they can not increase theTP of { F , J , f } . Hence we change

Gupta 2 again by addinga single Conv1D layer just before the ﬂat layer, which wename

Gupta 4 . Comparing

Gupta 4 (Fig. 4, right panel) with

Gupta 2 or Gupta 3 , the total TP decreases slightly, thus aworsening in performance.We change Gupta 4 by adding BatchNorm layers betweenthe Conv1D layers and the ReLU layers, which we name

Gupta 5 . Comparing

Gupta 5 with

Gupta 2 or Gupta 3 , thetotal TP decreases slightly, thus a worsening in performance.We thus select Gupta 2 for further testing.

E. Selection of convolutional neural network hyper–parameters

We explore further the

Gupta 2

NN by varying somehyper–parameters. As measures of performance, we use theaverage accuracies, the accuracy of the resulting accuracy andthe total TP. a) Type of pooling layer and type of padding:

We test thetype of pooling layer and the type of padding simultaneously.For each type of pooling layer, we vary the type of padding inthe Conv1D layers, while keeping the kernel sizes per blockequal to kernel size = { , (4 , , (4 , , } . We observethat AveragePool with padding = same yield the bestperformance. b) Number of dense layers: We vary the number of denselayers within the range n dense ∈ { , , , , , } , wherethe number of drop–out layers is n drop = n dense − , while keeping the number of ﬁlters equal to n filter =64 . We observe that

Gupta 2 yields an increase in the totalTP up to n dense = 4 and a decrease for n dense > . Since the difference in performance between n dense = 4 and n dense = 5 is not signiﬁcant, we keep the original

Gupta 2 with n dense = 4 dense layers due to its simplicity. c) Kernel sizes of each convolutional layer:

We vary thekernel sizes of each block over the range { , , } , arrangingthese three values in combinations respectively of one, twoand three values in consecutive order. With these constraints,the best performing NNs of the Gupta 2 type are for kernel size ∈ (cid:8) { , (4 , , (4 , , } , { , (8 , , (4 , , } , { , (4 , , (4 , , } , { , (16 , , (4 , , } , { , (8 , , (4 , , } , { , (8 , , (4 , , } (cid:9) . (8)While all these NNs yield average accuracies between 0.7and 0.8, the NN with kernel size = { , (8 , , (4 , , } yields the largest total TP and largest resulting accuracy.This suggests that the important features for the classiﬁcationof heart beats are best captured in groups of 4, 8 and 16data points, with the convolutions proceeding from large tosmall scales. These scales correspond to { , , } × dt = able II Comparison with previously published work. C OLUMN

1: I

DENTIFICATION OF THE WORK . C

OLUMN

2: I

DENTIFICATION OF THE NETWORK SIZE .C OLUMN

3: I

DENTIFICATION OF THE NUMBER OF CLASSES IN THE DATA . C

OLUMNS

DENTIFICATION OF THE PERFORMANCE METRICS FOR THEBEST PERFORMING NETWORK , WHERE A CCURACY IS THE RESULTING ACCURACY , AND (cid:104) P RECISION (cid:105) w AND (cid:104) R ECALL (cid:105) w ARE RESPECTIVELY THE MEANPRECISION AND MEAN RECALL OVER ALL CLASSES , WEIGHTED BY THE SIZE OF EACH CLASS .Work Network size No. classes Accuracy (cid:104)

Precision (cid:105) w (cid:104) Recall (cid:105) w Acharya et al. (2017) [1] 3 Conv1D + 3 Dense 5 (0.935, 0.940) (0.979, 0.979) (0.960, 0.967)He et al. (2018) [2] 9 Conv1D + 2 Dense 5 (0.979, 0.988)Jun et al. (2018) [3] 6 Conv2D + 1 Dense 8 0.990 0.986 0.978Rajpurkar et al. (2017) [4] 33 Conv1D + 1 Dense 14 0.809 0.827Carvalho (2020) 6 Conv1D + 4 Dense 13 0.821 0.848 0.822 / A F J L N Q R V a e f j

Predicted label /AFJLNQRVaefj T r ue l abe l Figure 5.

Normalised confusion matrix from selected NN.

Gupta 2NN with kernel size = { , (8 , , (4 , , } , AveragePool layers and padding = same , applied to ~x ijk = { x ij , sign1 , x ij , sign2 } . { . , . , . } s . We thus select kernel size = { , (8 , , (4 , , } for further testing. F. Selection of input data

We test

Gupta 2

NN for different data matrices, namely ~x ijk = { x ij , sign1 , x ij , sign2 } , consisting of the original vari-ables, and ~x ijk = { x ij , sign1 , x ij , sign2 , | X ij , sign1 |} , consistingof the variables selected from the correlation constraints.While both data matrices yield mean accuracies between 0.7and 0.8, the data matrix ~x ijk = { x ij , sign1 , x ij , sign2 } withoutsmoothing yields the largest total TP and the largest resultingaccuracy. This suggests that the Fourier transforms of theoriginal signals do not add discriminative information and thatsmoothing the heart beats might erase important features.V. R ESULTS

A. Results from the best–performing neural network

We produce the distribution of the predicted heart beatclasses, which follows approximately the same distribution asthat of the original heart beat classes (Fig. 2, bottom panel).We produce the confusion matrix of our best performing neuralnetwork normalised to the data per heart beat class for easierassessment of the performance per class (Fig. 5). The class thatis worst classiﬁed is { F } (“Fusion of ventricular and normalbeat”), which is mostly misclassiﬁed as { V } (“Premature ventricular contraction”), hence the neural network is con-founding between two ventricular arrhythmias. The next worseclassiﬁed classes are: a) { J } (“Nodal (junctional) prematurebeat”), which is mostly classiﬁed as { R } ( “Right bundlebranch block beat”); b) { A } (“Atrial premature beat”), whichis also classiﬁed as either { N , R , V , a } (where a stands for“Aberrated atrial premature beat”); and c) { f } (“Fusion ofpaced and normal beat”), which is also classiﬁed as either { /, Q } (respectively “Paced beat” and “Unclassiﬁable beat”). B. Comparison with results from other published work