[PDF] Sleep Apnea and Respiratory Anomaly Detection from a Wearable Band and Oxygen Saturation

Abstract

Objective: Sleep related respiratory abnormalities are typically detected using polysomnography. There is a need in general medicine and critical care for a more convenient method to automatically detect sleep apnea from a simple, easy-to-wear device. The objective is to automatically detect abnormal respiration and estimate the Apnea-Hypopnea-Index (AHI) with a wearable respiratory device, compared to an SpO2 signal or polysomnography using a large (n = 412) dataset serving as ground truth. Methods: Simultaneously recorded polysomnographic (PSG) and wearable respiratory effort data were used to train and evaluate models in a cross-validation fashion. Time domain and complexity features were extracted, important features were identified, and a random forest model employed to detect events and predict AHI. Four models were trained: one each using the respiratory features only, a feature from the SpO2 (%)-signal only, and two additional models that use the respiratory features and the SpO2 (%)-feature, one allowing a time lag of 30 seconds between the two signals. Results: Event-based classification resulted in areas under the receiver operating characteristic curves of 0.94, 0.86, 0.82, and areas under the precision-recall curves of 0.48, 0.32, 0.51 for the models using respiration and SpO2, respiration-only, and SpO2-only respectively. Correlation between expert-labelled and predicted AHI was 0.96, 0.78, and 0.93, respectively. Conclusions: A wearable respiratory effort signal with or without SpO2 predicted AHI accurately. Given the large dataset and rigorous testing design, we expect our models are generalizable to evaluating respiration in a variety of environments, such as at home and in critical care.

Full PDF

Wolfgang Ganglberger, MSc*

1, 2 , Abigail A. Bucklin, BA* , Ryan A. Tesh, BSc , Madalena Da Silva Cardoso , Haoqi Sun, PhD , Michael J. Leone, MSc , Luis Paixao, MD, MSc , Ezhil Panneerselvam, MD , Elissa M. Ye, MSc , B. Taylor Thompson, MD , Oluwaseun Akeju, MD , David Kuller, BSc , Robert J. Thomas, MD , M. Brandon Westover, MD, PhD Abstract

Objective : Sleep related respiratory abnormalities are typically detected using polysomnography. There is a need in general medicine and critical care for a more convenient method to automatically detect sleep apnea from a simple, easy-to-wear device. The objective is to automatically detect abnormal respiration and estimate the Apnea-Hypopnea-Index (AHI) with a wearable respiratory device, compared to an SpO signal or polysomnography using a large (n = 412) dataset serving as ground truth. Methods : Simultaneously recorded polysomnographic (PSG) and wearable respiratory effort data were used to train and evaluate models in a cross-validation fashion. Time domain and complexity features were extracted, important features were identified, and a random forest model employed to detect events and predict AHI. Four models were trained: one each using the respiratory features only, a feature from the SpO (%)-signal only, and two additional models that use the respiratory features and the SpO (%)-feature, one allowing a time lag of 30 seconds between the two signals. Results : Event-based classification resulted in areas under the receiver operating characteristic curves of 0.94, 0.86, 0.82, and areas under the precision-recall curves of 0.48, 0.32, 0.51 for the models using respiration and SpO , respiration-only, and SpO -only respectively. Correlation between expert-labelled and predicted AHI was 0.96, 0.78, and 0.93, respectively. Conclusions : A wearable respiratory effort signal with or without SpO2 predicted AHI accurately. Given the large dataset and rigorous testing design, we expect our models are generalizable to evaluating respiration in a variety of environments, such as at home and in critical care. Department of Neurology, Massachusetts General Hospital, Boston, MA, USA Sleep & Health Zurich, University of Zurich, Zurich, Switzerland Present Address: Department of Neurology, Washington University School of Medicine in St. Louis, USA Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA, USA Department of Anesthesiology, Critical Care, and Pain Medicine, Massachusetts General Hospital, Boston, USA MyAir Inc, Boston, MA, USA Department of Medicine, Division of Pulmonary, Critical Care & Sleep, Beth Israel Deaconess Medical Center, Boston, MA, USA * Co-first authors, equal contribution, † Co-senior authors, equal contribution. Corresponding author: M. Brandon Westover, [email protected]

Sleep Apnea and Respiratory Anomaly Detection from a Wearable Band and Oxygen Saturation I. INTRODUCTION

Abnormalities of breathing during sleep may reflect both a primary sleep disorder (sleep apnea) or the health of the brain, lungs, or heart. There is broad utility if an easy to apply, minimally obtrusive and accurate measure of respiration is available, either for sleep apnea diagnostics or to track key vital signs in a range of situations. The latter includes intensive care units, heart failure and lung disease admissions or home tracking, and during conscious sedation. In such situations, abnormal breathing is expected, and could be used as a biomarker of recovery or treatment effects. Respiratory tracking may also be used to identify those at higher risk for post-operative delirium[1]. Central sleep apnea, which occurs when breathing stops during sleep because the brain fails to send signals to the muscles that control breathing[2] is especially common in hospitalized patients and may predict readmissions in systolic heart failure[3]. Obstructive sleep apnea is more common in the general population than central sleep apnea, and it is estimated that 2% to 4% of the adult population is affected[4]. A recent study on global prevalence of obstructive sleep apnea reports that almost one billion people are affected and stresses the urge to improve both healthcare and cost-effectiveness[5]. The standard method for sleep apnea detection is for patients to undergo a sleep study at a sleep centre or hospital. These sleep studies use polysomnography (PSG) to detect sleep apnea. PSG includes data from multiple signals involving many wires and is often uncomfortable for the patient. The recording from the night is then manually scored by trained sleep experts who detect apnea and hypopnea events and arousals. This process takes significant time and effort and is susceptible to errors. Even home sleep apnea tests are not readily repeatable, and current guidelines emphasize manual scoring of respiratory events. Many studies recently have attempted to implement machine learning and deep learning techniques to automate the apnea-hypopnea detection process from PSG signals[6–9]. However, there is a need for a diagnostic tool for detecting sleep apnea that is suitable both for home settings as well as critical care. The ideal respiratory monitoring device should have at least three characteristics: ease of use/application, automated analytics, and accuracy. In the present study we focus on a method for detecting sleep apnea, as a surrogate for the general capability of detecting abnormal respiration during sleep, from a practical wearable respiratory belt on its own or in combination with SpO . Our analytics approach consists of the extraction of clinically relevant and interpretable features, as well as a machine learning model (random forest). While the model predicts apnea events, the underlying respiratory features can give further insight to clinicians. We show a novel, automated method for the detection of sleep apnea and AHI categorization with patient-friendly equipment suitable for a variety of clinical settings. We further provide the main code used in this study, ten PSGs with annotations and the wearable respiratory signals, and all trained models on our GitHub page[10]. II. M ATERIAL AND M ETHODS

Figure 1. Flowchart of the modelling approach. We train models to detect apnoea events and predict Apnoea Hypopnea Index based on either the wearable respiration signal only, or on respiration and oxygen saturation signals.

Data Collection : The data, 412 PSGs (Natus System[11]) from 404 patients who also wore a wearable respiration belt (Airgo[12], see Figure E1 in appendix) were collected at the Massachusetts General Hospital sleep lab between January 2019 and January 2020 with approval of the Partners Institutional Review Board under protocol 2018P002937. Participants who were coming to the sleep lab for an overnight sleep study were enrolled through verbal consent shortly before the onset of PSG. Patients who agreed to participate wore the respiratory belt during their sleep study. There were no exclusion criteria. Experts at the sleep lab annotated the PSG recordings according to the American Academy of Sleep Medicine 2007 manual[13]. PSG and wearable recordings synchronization have been manually reviewed and corrected if necessary.

Apnea and Hypopnea Scoring Rule:

There are two different accepted rules for scoring hypopneas according to the AASM manual[13], the ‘3% rule’ and ‘4% rule’, see appendix. The data was scored by experts (experienced registered polysomnographic technicians) using the 4% rule and we designed an automatic hypopnea scoring algorithm, using all PSG signals, to determine ground truth for the 3% rule and 4% rule.

We use the expert and automatic labels to train machine learning models on the wearable respiration and oximetry signals, resulting in the following ground truth labels: Expert Labels 4% Rule (‘EL4’), Auto Labels 4% Rule (‘AL4’), and Auto Labels 3% Rule (‘AL3’).

Cross-Validation:

To obtain an unbiased performance estimate for the models, we split the data subject-wise into training, validation, and test sets in a 10-fold-cross-validation fashion, see appendix.

Definition of Apnea Event:

All events labelled as any type of apnea or hypopnea are treated as an ‘apnea event.’ All remaining data are categorized as a ‘non-apnea-event’.

Overall Modelling Approach: (see Figure 1 and Figure E2)

We used a 90-second window, allowing enough context information before and after a typical 10 to 30 second lasting apnea event, and a step size of 1 second to extract features from the respiration signal, which were used to train a random forest model[14] for the binary apnea classification task (‘Model Respiration-Only’). Furthermore, one feature from the SpO signal was extracted (desaturation drop within 45 seconds of the detected event), and together with the respiration features was used to train another model with the same goal (‘Model Respiration+SpO ). A third model was trained with a very similar approach to the Respiration+SpO model, but the SpO feature was extracted to be insensitive to a time lag of up to ±30 seconds between the respiration and SpO signal (‘Model Respiration+SpO (robust)’). The latter is intended for scenarios where time alignment of signals coming from different sensors cannot be guaranteed. Finally, we constructed an ‘SpO -Only’ model, that solely checks if the magnitude of a desaturation drop is above a certain threshold. Table 1. Demographics Demographic n (%) Age (years) a

56 (16) Sex Male 220 (54%) Female 182 (45%) Unknown 2 (0.5%) Race White of Caucasian 257 (64%) Black or African American 20 (5%) Asian 16 (4%) Native Hawaiian or Pacific Islander 1 (0.2%) American Indian or Alaska Native 1 (0.2%) Other or Unknown 109 (27%) Hispanic Ethnicity 17 (14%) Study Type Diagnostic 193 (47%) Split Night 112 (28%) Full Night Titration 106 (26%) Apnea-Hypopnea-Index a

11 (13) Charlson Comorbidity Index a a Mean (Standard Deviation)

The following four types of features were extracted at various timepoints (see Figure E3): standard deviation of the signal, sum of positive first derivative, sample entropy[15, 16], and Katz fractal dimension[17]. From a total of 242 features we performed feature selection[18, 19] and we determined the number of features that achieved an optimal median receiver operating characteristic curve (ROC AUC) [20] among the ten folds. Furthermore, we performed model hyperparameter optimization. We used the models to predict apnea events and AHI, and evaluated the performances using ROC AUC, area under the precision recall curve (PRC AUC) [21], accuracy, sensitivity, precision, F1 score[22], and correlation between AHI and oxygen desaturation. See appendix for further details on methodology. III. R ESULTS

Apnea and Hypopnea Scoring Rule:

The Pearson Correlation between the hypopnea indices (HI) based on expert and automatic PSG-based labels were 0.90 for Expert 4% and Auto 4%, 0.70 for Expert 4% and Auto 3%, and 0.84 for Auto 4% and Auto 3%, (see Figure E4). The mean HI are the same for the expert and auto 4% approaches, and there is a mean increase of 6 hypopneas per hour when using the 3% rule. These results (together with a sleep expert’s manual review session of the 3%-rule auto-detections) show the applicability of the automatic hypopnea label algorithm (see Figure E5).

Feature Selection and Feature Importance : As illustrated in Figure E6, the individual folds resulted in AUCs between 0.8 and 0.92 and F1 scores between 0.55 and 0.75 on the validation sets. The mean and median performances increased up to approximately 40 features, plateaued until approximately 55, and then performances decreased again (‘overfitting’). Analysing those results, we decided to use a maximum of 51 features (per fold) and exclude all sample entropy and fractal dimension features computed on the raw (non-smoothed) respiration signal, as the smoothed-version features showed higher importance. The most important respiration features among all folds were (descending order): 1.

Ventilation features at positions 55, 60, 65, 75, 80, 90 (all with width 10 seconds, and 12 second width for position 60). 2.

Katz fractal dimension with width 10 seconds at positions 65 and 60. 3.

Ventilation width 10 minutes (‘reference’) 4.

Katz fractal dimension with width 60 seconds at position 90. 5.

Standard deviation with widths 8 and 10 seconds at positions 60 and 80, and width 10-minute (‘reference’). 6.

Sample entropy with width 30 seconds at positions 70 and 75.

Figure 2 . Apnoea Hypopnea Index (AHI)-based model performance evaluation (based on expert labels) for models trained on features from respiration and SpO (left column), respiration only (middle column), and SpO2-only (right column). Results are obtained from left-out test data in a 10-fold cross-validation fashion; n=409 polysomnographic recordings including wearable respiration. Panels A-C: Scatterplots of predicted (model-based) and expert-labelled AHI per recording. Panels D-F: Difference of predicted and expert-labelled AHI, including 95% confidence interval (CI). Both model results show a unimodal error distribution. Panels G-I: Confusion matrix (in %) with AHI categorization (Normal: 0-5, Mild: 5-15, Moderate: 15-30, Severe: > 30), where rows: expert-labelled AHI category, columns: predicted AHI category. Accuracies for categorizations: 80% (Respiration+SpO ), 67% (Respiration-Only), and 77% (SpO -Only). Event based:

Table E1 shows results for the event-based performance evaluation of the three models with different inputs. Adding the SpO feature increased the overall sensitivity from 55.6% to 66.7% and reduced the number of false positives by nearly 50% to a false positive rate of 2.7%. The lag-robust version performed similarly to the ‘non-robust’ model and was just a fraction more sensitive (67.6% vs. 66.7%) and less precise. When including the SpO feature, the largest increase in class-based sensitivity occurs for hypopneas, which is not surprising given that its definition includes ‘a reduction of airflow of more than 50%’ but not a complete absence of respiration – so this event was expected to be more difficult to detect looking at respiration signal only. Mean patient performance:

Model performances averaged over all PSGs that contained at least five expert-labelled apnoea events (n=360) (see Table 2 for Expert 4% and Auto 3% rules, and Table E2 for Auto 4% rule) showed: 1.

The results are similar for all ground truth label versions (3% and 4%-hypopnea-scoring rule), showing the model approach is suitable for either of the scoring rules. 2.

ROC AUC is greatest for Respiration+ SpO models (0.94), followed by Respiration-Only (0.86) and SpO -Only (0.82). 3. PRC AUC is greatest for SpO -Only (0.51), followed by Respiration+ SpO (0.48) and Respiration-Only (0.32). 4. The Respiration+SpO robust version shows only a slight performance decrease compared to Respiration+ SpO (ROC AUC 0.93 vs. 0.94, PRC AUC 0.45 vs. 0.48, respectively). Apnea Hypopnea Index:

The AHI evaluation from the Respiration+SpO model resulted in coefficients of determinations (r ) 0.92 and 0.85 for the Expert Labels (4%-hypopnea rule) and Auto labels (3%-hypopnea rule) respectively. The respective AHI categorization accuracies (Acc) were 0.8 and 0.7. See Table E3 in the appendix for r and accuracy values for the Respiration-Only and SpO only models. Figure 2 shows the scatterplot for predicted and expert-labelled AHI for the Respiration+SpO , the Respiration-Only model, and SpO -Only Model (panels A,-C), the distribution of the absolute AHI difference (panels D-F) and the AHI-categorization confusion matrix (panel G-I). The coefficients of determinations of 0.92, 0.61 and 0.87 respectively show that all models can predict AHI with good agreement to the true AHI. We observed an increased performance if the SpO feature is added together with the respiration signal. Figure E7 and Figure E8 show similar trends for the Auto 3% and Auto 4%-rule. The ROC AUCs for the binarized AHI categorization tasks are between 0.94-0.98 (Respiration+ SpO ), 0.85-0.95 (Respiration-Only), and 0.91-0.97 (SpO -Only) for the different hypopnea rules, indicating that the models accurately distinguish between AHI categories for any input signal and hypopnea rule used (see Figure 3). Example model predictions and expert labels are shown together with the respiration and oxygen saturation signals in Figure 4. SpO burden: The correlations between the ground truth labels and SpO labels are between 0.46-0.53. For the Respiration+SpO models, the correlations are between 0.50-0.52, and for the Respiration-Only model they are between 0.37-0.39. See Table E4. Figure 3. Receiver Operating Curves and the corresponding areas under the curve (AUC), evaluated on binary AHI categorization tasks, for different hypopnea scoring rules. All models result in AUCs greater than 0.85, showing an accurate AHI categorization.

Figure 4. Example signals and predictions: 60 minutes of a recording from a 47-year old subject, where the AHI prediction is comparable to the mean model performance (true AHI: 7, predicted AHI respiration+SpO :8, predicted AHI respiration-only: 10). The respiration and SpO signals as well as the apnoea events annotated by the sleep experts and predicted by our two models are presented. Point a) shows an instance where both models labelled an apnoea event, but the sleep experts did not, even though a decrease in the respiration signal and a large drop in blood oxygenation is visible. Such instances are treated as false positives, even though this event is either missed by the sleep experts or the event potentially narrowly fails to qualify for an apnoea event. Point b) is an example of an event where the sleep experts and both models label an apnoea event. Point c) shows an instance where the Respiration+SpO model incorrectly labels an apnoea event, but the Respiration-Only model does not. Point d) is labelled as an apnoea event by the sleep expert but is not detected by either model (false negative). The region around point e) shows a series of similar events, all showing a cyclic decrease and increase in respiration amplitude and desaturation. Some of those events are marked as apnoea events by the sleep expert, some are not – illustrating the difficulty for a model to fully agree with the human labels. At point f), the respiration clearly switches from an instable period to a more stable period and it becomes more regular – neither the sleep expert nor the models detect events here. Figure 5. 10-minute respiration and SpO2 signal tracings from two subjects that have a very similar expert-scored AHI (15 and 17) but clearly show different kinds of apnoea events. These different phenotypes are reflected in the obstructive apnoea indices (OI) and central apnoea indices (CI), subject 1: OI=1, CI=7, subject 2: OI=12, CI=2. The wearable respiration signal not only contains information to automatically assess the AHI, but to differentiate phenotypes. Figure 6. Example signals and predictions: 20 minutes of a recording from an 88-year old woman who was admitted to the ICU for a hip fracture. The patient has a Charlson Comorbidity Index of 0, and no previous obstructive sleep apnea diagnosis. The apnea predictions from the Respiration+SpO2 and Respiration-Only models are shown on the figure. The detected AHI from the Respiration+SpO2 model was 9.6, and from the Respiration-Only model it was 12.7. The average SpO2 value was 98.0 ± 1.6, with a maximum value of 100.0 and a minimum of 92.0. Table 2. Mean Patient Performance Expert Labels (4% Hypopnea Rule) Performance Metric Respiration+SpO Respiration+SpO (robust) Respiration-Only SpO -Only ROC AUC 0.94 (0.93 to 0.94) 0.93 (0.92 to 0.93) 0.86 (0.85 to 0.87) 0.82 (0.81 to 0.83) PRC AUC 0.48 (0.46 to 0.50) 0.45 (0.42 to 0.47) 0.32 (0.30 to 0.35) 0.51 (0.50 to 0.53) Accuracy 0.94 (0.93 to 0.95) 0.94 (0.93 to 0.95) 0.92 (0.92 to 0.93) 0.96 (0.95 to 0.97) Sensitivity 0.58 (0.56 to 0.61) 0.58 (0.56 to 0.61) 0.50 (0.48 to 0.53) 0.73 (0.71 to 0.76) Precision 0.50 (0.48 to 0.53) 0.44 (0.42 to 0.46) 0.32 (0.30 to 0.34) 0.57 (0.55 to 0.58) F1 Score 0.52 (0.50 to 0.54) 0.48 (0.46 to 0.50) 0.36 (0.34 to 0.38) 0.62 (0.61 to 0.64) Automatic Labels (3% Hypopnea Rule) Performance Metric Respiration+SpO Respiration+SpO (robust) Respiration-Only SpO -Only ROC AUC 0.93 (0.92 to 0.93) 0.92 (0.92 to 0.93) 0.86 (0.85 to 0.87) 0.83 (0.82 to 0.83) PRC AUC 0.44 (0.42 to 0.46) 0.42 (0.40 to 0.44) 0.32 (0.30 to 0.34) 0.52 (0.51 to 0.53) Accuracy 0.95 (0.94 to 0.96) 0.95 (0.94 to 0.96) 0.93 (0.93 to 0.94) 0.95 (0.94 to 0.96) Sensitivity 0.51 (0.48 to 0.53) 0.54 (0.51 to 0.56) 0.53 (0.51 to 0.56) 0.85 (0.83 to 0.87) Precision 0.52 (0.50 to 0.54) 0.48 (0.45 to 0.50) 0.36 (0.34 to 0.38) 0.49 (0.46 to 0.51) F1 Score 0.50 (0.48 to 0.52) 0.49 (0.47 to 0.51) 0.40 (0.37 to 0.42) 0.59 (0.57 to 0.61) IV. D ISCUSSION

Detecting abnormal sleep respiration including classic sleep apnea in a patient-friendly and low-cost manner can enable clinicians and researchers to better understand and treat primary sleep disorders and abnormal sleep-breathing in a range of medical and neurological conditions. The present study provides new measurement insights in this area as it: a) uses a large dataset of simultaneously recorded wearable signals and polysomnographies (as the gold standard) with more than 400 subjects. b) uses a rigorous method of model performance evaluation, i.e. no information from the test patients was used for model training (cross-validation), making the reported performance an unbiased estimate of what one can expect when using this model in the future on a similar population. c) shows state-of-the-art performance in automatically assessing AHI, compared to other studies published using wearable or portable signals (see Table E5). Performance comparison with other studies is challenging, as even when the same performance metrics (such as accuracy) are used, the underlying approach to compute the metrics might differ. Moreover, a substantial number of studies have not reported a complete view of the performance; for example, reporting sensitivity- and specificity-based metrics but not precision-based metrics, or reporting event-based metrics but not patient-based (AHI) metrics. d) investigates the question of how much it helps in predicting AHI to have an SpO (%) sensor in addition to the wearable respiratory device, where we observed an increase in performance for most metrics. This leads us to the conclusion that having both sensors will facilitate the most reliable automatic AHI assessment and further analysis, such as apnea phenotyping. Moreover, for real-world (and non-ideal sleep lab settings), we showed that allowing a time lag between the respiration and SpO signals (potentially recorded from two different sensors) of up to ±30 seconds, performance is practically unchanged. In this paper we have shown that it is possible to detect apnea events using a single, wearable respiratory belt with or without SpO , and from SpO alone. While we show here that it is possible to accurately compute AHI from SpO signals alone, it is important to note that the respiration signals from this wearable respiratory device provide additional important phenotypic information about breathing patterns that one may not obtain from conventional SpO metrics alone. Figure 5 shows an example of two distinctly different phenotypic breathing patterns from two patients with similar AHI score. This illustrates the importance of measuring respiration when detecting sleep apnea or tracking sleep-breathing in diverse conditions, in addition to SpO information. We have also shown that this model for apnea and hypopnea detection is robust. We tested the model using both of the hypopnea scoring rules recommended by the AASM manual[13], and the results for each rule showed accurate AHI prediction. Our results have implications for both sleep apnea diagnostics, and sleep-breathing tracking in environments such as the intensive care unit. There are several approaches to monitoring respiration in the intensive care unit [23, 24], including pulse oximetry, ventilator pressure and flow waveforms, lung volume measurement, electrical impedance tomography, respiratory information embedded in hemodynamic monitoring, volumetric capnography, esophageal and transpulmonary pressures and diaphragmatic electromyography. Each technique has advantages and disadvantages, and some cannot be continued once mechanical ventilation is stopped. The technology we describe is simple, and can move with the individual patient, including from hospital to home. Sleep based on electroencephalography is notoriously difficult to measure in the intensive care environment, and oximetry typically is used as a safety measure. Technologies and analytics of the type we describe could perform as an enhanced vital sign, including post-weaning from the ventilator, when patients are especially vulnerable. The effect of treatment of heart failure or lung disease exacerbations could also be tracked. In the context of sleep apnea, a diagnosis from a wearable respiration device with or without the addition of SpO allows greater flexibility of sleep apnea diagnosis in cases where full PSGs or event home testing are not readily feasible. High resolution respiratory effort monitoring also enables extraction of sleep stage information[25], and sleep-breathing phenotype data, which may further enable improved precision management. The concept of home sleep and sleep apnea testing continues to evolve rapidly with advances in technology. Current guidelines to assess obstructive sleep apnea do not include single respiratory effort belts [26]. Our results show that we can obtain clinically useful accuracy with a sensitive respiratory effort signal alone, combined with advanced analytics. Some limitations to the study include potential selection bias from the fact that the study population was chosen from a single-centre and included only patients already coming to the sleep lab for testing. Thus, we did not have hospitalized and critically ill patients, where breathing abnormalities may be harder to quantify. Including more patients from multiple hospitals and patients from the general public could have increased the breadth of the study population and the generalizability of the findings. Additionally, only one sleep expert scored each PSG recording, as is typical of clinical recordings. Although there are seven sleep experts (technicians) at the centre who score PSGs, each individual recording is only scored by one of the experts. Therefore, there is a possibility that that there is some variation or label noise when it comes to the ground truth sleep expert labels, because scoring PSGs is a subjective process and can be biased based on the expert that scored the recording. This concern is largely mitigated by the finding that results were similar when evaluating the model against ground truth labels determined from the PSG by automated methods. V. C ONCLUSIONS

Respiratory effort signals obtained with a single patient-friendly wearable band, standard oxygen saturation signals, and an automated analytics approach allow us to extract clinically relevant respiratory features and detect sleep apnea with high accuracy. In future work, we plan to apply this model to ICU patients. Patients enrolled in a clinical trial [27] in the ICU at Massachusetts General Hospital wear a respiratory belt for the study, and we plan to use this information in addition to SpO to detect apneas in this population of patients. For the patients already enrolled, we performed a preliminary check using our models and obtained promising initial results – Figure 6 shows the resulting apnea predictions for one example patient. Compliance with Ethical Standards: Funding: This study was funded by the NIH (1R01NS102190, 1R01NS102574, 1R01NS107291, 1RF1AG064312). Conflict of Interest: Wolfgang Ganglberger declares that he/she has no conflict of interest. Abigail A. Bucklin declares that he/she has no conflict of interest. Ryan A. Tesh declares that he/she has no conflict of interest. Madalena Da Silva Cardoso declares that he/she has no conflict of interest. Haoqi Sun declares that he/she has no conflict of interest. Michael J. Leone declares that he/she has no conflict of interest. Luis Paixao declares that he/she has no conflict of interest. Ezhil Panneerselvam declares that he/she has no conflict of interest. Elissa M. Ye declares that he/she has no conflict of interest. B. Taylor Thompson reports personal fees from Bayer and Thetis, outside the submitted work. Oluwaseun Akeju declares that he/she has no conflict of interest. David Kuller reports non-financial support from Myair Inc., during the conduct of the study; non-financial support from Myair Inc, outside the submitted work; In addition, Dr. Kuller has a patent Patent US10123724B2 "Breath volume monitoring system and method" issued. Dr. Thomas reports personal fees from GLG Councils, Guidepoint Global, and Jazz Pharmaceutics, outside the submitted work. In addition, Dr. Thomas has a patent ECG-spectrogram with royalties paid by MyCardio, LLC, a patent Auto-CPAP with royalties paid by DeVilbiss-Drive, and an unlicensed patent CO2 device for central / complex sleep apnea issued. Dr. Westover reports grants from NIH, during the conduct of the study. Ethical approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent: Informed consent was obtained from all individual participants included in the study.

References 1. Revels SL, Cameron BH, Cameron RB (2019) Obstructive sleep apnea and perioperative delirium among thoracic surgery intensive care unit patients: perspective on the STOP-BANG questionnaire and postoperative outcomes. J Thorac Dis 11:S1292–S1295. https://doi.org/10.21037/jtd.2019.04.63 2. Sacchetti LM, Mangiardi P (2012) Obstructive Sleep Apnea: Causes, Treatment and Health Implications. Nova Science Publishers 3. Khayat R, Abraham W, Patt B, et al (2012) Central sleep apnea is a predictor of cardiac readmission in hospitalized patients with systolic heart failure. J Card Fail 18:534–540. https://doi.org/10.1016/j.cardfail.2012.05.003 4. Epstein LJ, Kristo D, Strollo PJ, et al (2009) Clinical guideline for the evaluation, management and long-term care of obstructive sleep apnea in adults. J Clin Sleep Med 5:263–276 5. Benjafield AV, Ayas NT, Eastwood PR, et al (2019) Estimation of the global prevalence and burden of obstructive sleep apnoea: a literature-based analysis. The Lancet Respiratory Medicine 7:687–698. https://doi.org/10.1016/S2213-2600(19)30198-5 6. Deviaene M, Testelmans D, Borzée P, et al (2019) Feature Selection Algorithm based on Random Forest applied to Sleep Apnea Detection. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp 2580–2583 7. Marcos JV, Hornero R, Alvarez D, et al (2010) Automated detection of obstructive sleep apnoea syndrome from oxygen saturation recordings using linear discriminant analysis. Med Biol Eng Comput 48:895–902. https://doi.org/10.1007/s11517-010-0646-6 8. Varon C, Caicedo A, Testelmans D, et al (2015) A Novel Algorithm for the Automatic Detection of Sleep Apnea From Single-Lead ECG. IEEE Transactions on Biomedical Engineering 62:2269–2278. https://doi.org/10.1109/TBME.2015.2422378

9. Almazaydeh L, Elleithy K, Faezipour M (2012) Obstructive sleep apnea detection using SVM-based classification of ECG signal features. In: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. pp 4938–4941 10. (2020) mghcdac/respiratory_event_detection_wearable. MGH Clinical Data Animation Center 11. Welcome to Neuro. In: Natus. https://neuro.natus.com/. Accessed 6 Apr 2020 12. MyAir LLC. In: AirGo TM

24. Tobin MJ, Yang KL, Jubran A (1990) Respiratory monitoring in the ICU. Appl Cardiopulm Pathophysiol 3:211–218 25. Sun H, Ganglberger W, Panneerselvam E, et al (2019) Sleep Staging from Electrocardiography and Respiration with Deep Learning. Sleep. https://doi.org/10.1093/sleep/zsz306 26. Collop NA, Tracy SL, Kapur V, et al (2011) Obstructive sleep apnea devices for out-of-center (OOC) testing: technology evaluation. J Clin Sleep Med 7:531–548. https://doi.org/10.5664/JCSM.1328 27. https://clinicaltrials.gov/ct2/show/NCT03355053?term=Investigation+of+Sleep+in+the+Intensive+Care+Unit

Appendix M ETHODS

Apnea and Hypopnea Scoring Rule:

According to the AASM manual i , there are two different accepted rules for scoring hypopneas. Both rules have the same requirement for airflow decrease (measured on nasal pressure or PAP device flow signals) but differ in that a) there must be at least 3% oxygen desaturation from baseline or the airflow decrease is associated with an arousal, or b) there is at least 4% oxygen desaturation. For simplicity, we call the first rule ‘3% rule’ and the second ‘4% rule’. The sleep lab dataset used for this study was scored by “experts” (registered and experienced polysomnographic technicians) using the 4% rule. At the time of data collection, the 3% rule was not being implemented. To create apnea prediction models that were suitable with the 3% rule as well, we designed the following automatic hypopnea scoring algorithm to determine ground truth for the 3% rule based on full PSG data: 1) Compute airflow decrease and check if requirement, as specified in the AASM manual, is fulfilled. 2)

Compute the oxygen desaturation and check if a 3% threshold is exceeded. 3)

Check if the expert has labelled an event as an arousal. 4)

If 1) and either 2) or 3) is fulfilled, an event is classified according to the 3% rule. We then use these automatic labels derived from the PSG data in the same manner as the expert labels, i.e. as ground truths to train machine learning models that act on the reduced data. We computed the auto-labels for both the 3% and the 4%-rule, the latter to validate the automatic approach, as the auto-hypopnea-labels with the 4% rule should be similar to the expert hypopnea labels. We computed pairwise Pearson correlation coefficients between the hypopnea indices ( Cross Validation:

In each fold, recordings from 80% of all unique subjects were used for training, while both the validation and test sets consisted of 10% of patients each. For patients with multiple recordings, all recordings from the same patient were in the same set.

Signal Preprocessing:

The only preprocessing step done in the respiration signal was normalization: For each recording, the mean µ and standard deviation σ of the 0.01 and 0.99-quantiles clipped signal was used to normalize the signal s: s = (s- µ)/ σ.

Feature Extraction : Twelve timepoints of interest were defined for each 90-second segment:10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 80, and 90 seconds (see E3 for an illustration). Four types of features were extracted at these timepoints: 1.

Standard deviation of the signal for the past 8, 10, and 12 seconds; we also added past 5-min and 10-min (long-term) references. 2.

Sum of positive first derivative (as an estimate of ventilation) for the past 8, 10, and 12 seconds and additionally a 5-min and a 10-min feature. 3.

Sample Entropy ii : for the past 10, 30, 45, and 60 seconds. 4. Katz Fractal Dimension iii : for the past 10, 30, 45, and 60 seconds. Features 1 and 2 were aimed at encoding the information about the expected decrease in standard deviation and ventilation during an apnea period compared to before and after an event. Features 3 and 4 were aimed at encoding information about the regularity of the respiration, as we expected periods with apneas have a more irregular, unpredictable pattern than normal breathing. Features 3 and 4 were computed both from the raw respiration signal and from a 1.2-second-moving average of the respiration signal that is supposed to be less affected by short, small artifacts such as movement and heartbeats. In total, those four feature categories with different look-back times and positions in the 90-second window led to 242 features.

Feature Selection:

To reduce the number of features and avoid problems due to the curse of dimensionality iv and overfitting v the following iterative feature selection approach was designed: All 242 features were computed for the full data. 2.

Hierarchical clustering with Ward’s method vi was used to group the features into clusters so that colinear features are in the same cluster. 3. The remaining features (first iteration: all features) were used to train a random forest model to predict apnea-events. 4.

Feature permutation importance vii was computed, and the bottom 10% of features were removed. If all features from a cluster were in the bottom 10%, then one of the features from the cluster was kept. Therefore, all features from a cluster could only be removed if at a certain step one (and only one) feature was left in the cluster and was selected in the bottom 10%. This cluster-elimination behavior was used to account for colinear features that might have been ranked with low importance by permutation (because the colinear feature contains similar information) but in fact had high importance as a group. 5.

Steps 3 and 4 were performed until there were ten features left. To reduce computational complexity of feature computation, the sample entropy and Katz fractal dimension features were only kept for either the raw or 1.2-second-moving average version after result analysis, i.e. the versions that had the least surviving features in the remaining 10 features (in all folds) were eliminated. To decide on the number of features to use, we evaluated the apnea-event prediction performance of all models, for all steps in the feature selection procedure, on the validation set. Performance metrics used were the area under the receiver operating characteristic curve (ROC AUC) viii and the F1 score (harmonic mean of sensitivity and precision) ix .We selected the number of features that had the best median AUC among all ten folds. Model output smoothing rule:

The model was designed to be applied to the data with a step size of one second, and to yield an event prediction each second. We applied a post-processing prediction output smoothing rule that grouped any potentially disjoined 1-second-predictions into joined event predictions: If, in a segment of 10 seconds, there were more than or equal to i_positive_predictions (hyperparameter, see below) apnea predictions on a 1-second resolution, then the whole 10-second segment was labelled as an apnea prediction. If this rule was not fulfilled, the whole 10-second segment was labelled as no-apnea prediction.

Model Building and Hyperparameter Search:

With the number of features fixed, we then aimed to find a good set of hyperparameters for the binary classification random forest models (hyperparameter tuning used training data only). This step was done independently for the Respiration-Only and the Respiration+SpO models. Hyperparameters analyzed were: 1. minimum numbers of samples needed for a split 2. imbalance of apnea-event and non-apnea-event samples class-weights for apnea-events and non-apnea-events. For a resulting set of models that performed best on the ROC AUC on the validation set, two further post-processing parameters were refined: 1. prediction threshold of classifier 2. prediction smoothing rule ( i_positive_predictions)

A final set of hyperparameters was selected that offered the most favorable performance in terms of mean AHI prediction on the validation set.

Model evaluation:

The models’ performances were evaluated in the following ways: 1.

Event-based: All test set recordings were pooled together and the number of true positives, false positives, false negatives, and true negatives were computed x , where: -True positive: An expert-labelled event that is predicted by the model too, so that at least 50% of the expert-label overlaps with the prediction label. -False negative: An expert labelled event that is not predicted as an event by the model, i.e. no prediction or a prediction that overlaps less than 50% of the expert-label. -False positive: A model predicted event that is not expert-labelled, or less than 50% of the prediction’s duration is expert-labelled as an event. -Number of true negatives = (seconds of data without any labelled or predicted event)/18. The median length of a labelled event in our dataset is 18 seconds. By normalizing the non-labelled part by the median, a proportionate number of true negative events is computed. Furthermore, we computed the sensitivities of the different expert-labelled apnea events, e.g. sensitivity obstructive apnea = percentage of all expert-labelled obstructive apneas that were predicted as an event by the model. 2.

Mean patient performance: With the event-based definitions above, we computed the following standard performance metrics for each recording independently and report the sample mean and associated 95% confidence interval: ROC AUC, area under the precision recall curve (PRC AUC) xi , accuracy, sensitivity, precision, and F1 score. To avoid biasing the mean with recordings with nearly no events labelled (performance metrics can be very low or very high by chance in this case), recordings with less than 5 total expert-labelled events were excluded (this exclusion criteria affected 49 recordings and did not change the result significantly). 3. Apnea Hypopnea Index: Following AASM 2007 manual xii , AHI was computed both for the ground truth labels, and for the model predicted apnea events: AHI = ) was computed as the squared Pearson correlation coefficient. For each recording, the difference of AHI prediction was computed, i.e. AHI_ difference = AHI_predicted – AHI_true, and visualized as a histogram with bin size 2. Next, we categorized each expert-labelled and predicted AHI according to the AASM manual’s suggestion: Normal: 0-5 Mild: 5-15 Moderate: 15-30 Severe: 30 or more. A normalized confusion matrix of the categorized AHI prediction was computed and visualized (e.g. row ‘mild’, column ‘severe’ shows the percentage of all recordings that were categorized as ‘mild’ by the expert but ‘severe’ by the model), and the corresponding accuracy was derived. Furthermore, to evaluate the models’ ability to discriminate between binary AHI categorizations, we compute ROCs and AUCs for the following classification tasks: a) normal vs. mild, moderate and severe. b) normal and mild vs. moderate and severe. c) normal, mild and moderate vs. severe. To compute the ROCs, we vary the prediction threshold of the model. For each threshold, the predicted (binarized) AHI category for each patient is computed, from which a pair of sensitivity and specificity values is obtained. SpO burden: To quantify the blood oxygenation burden of a full night’s sleep, we defined: SpO burden = area under the median SpO line / time, where only data during sleep was considered. The Pearson correlation between the AHIs (true and predicted) and the SpO burden was computed. Data analysis was done using Python 3.6 xiii , with main packages used: NumPy xiv , pandas xv , SciPy xvi , Matplotlib xvii , scikit-learn xviii and EntroPy xix . F IGURES

AND T ABLES

Figure E3. Illustration of feature extraction. For each window of 90 seconds (and step size one second) features like standard deviation, ventilation, fractal dimension, and sample entropy are extracted at twelve positions. The signal input to the feature (‘width’) varies from 10 seconds to 60 seconds (and 5 and 10 minutes for reference).

Figure E1. The wearable respiration device (‘Airgo’ [11]) used in this study

Figure E2. Flowchart of the modelling approach for the SpO2-Only model. Figure E4. Automatic hypopnea algorithm results. The hypopnea labels of the expert and the auto 4% approach show a large agreement with a Pearson correlation of 0.9 and same mean hypopnea index (HI). Using the 3% rule leads to a mean increase of 6 hypopneas per hour Figure E5. Automatic Hypopnea detection (and rule conversion) algorithm. Example results for a 15- minute signal during diagnostic part of a split night study (i.e. PTAF signal available). Colored rectangles with height 4 represent hypopnea labels for automatic and expert approach, blue rectangle with height 1 represents an event labelled as obstructive apnea by the expert. The fourth and fifth panel show the computed features from the airflow and SpO2 signal respectively that are used to assess if a threshold (as specified by American Society of Sleep Manual) is met. TABLE E1. Performance Results Event-Based P

ERFORMANCE EVALUATION OF THE MODELS INCLUDING THE WEARABLE RESPIRATION ON THE FULL DATA OF

POLYSOMNOGRAPHIC RECORDINGS (10-

FOLD - CROSS - VALIDATION ), USING THE EXPERT LABELS AS GROUND TRUTH . R ESPIRATION +S P O R ESPIRATION +S PO (R OBUST ) R ESPIRATION -O NLY T OTAL HOURS SLEEP A NNOTATED EVENTS T RUE P OSITIVES F ALSE P OSITIVES F ALSE N EGATIVES

RUE N EGATIVES S ENSITIVITY O BSTRUCTIVE A PNEA S ENSITIVITY C ENTRAL A PNEA S ENSITIVITY M IXED A PNEA

ENSITIVITY H YPOPNEA P RECISION A PNEA E VENT

Figure E6. Results of iterative feature selection approach on the validation set for the 10 folds. A maximum of 50 features was selected as optimal. Table E2. Mean Patient Performance Automatic Labels (4% Hypopnea Rule) Performance Metric Respiration+SpO Respiration+SpO (robust) Respiration-Only SpO -Only ROC AUC 0.93 (0.93 to 0.94) 0.93 (0.92 to 0.93) 0.86 (0.85 to 0.87) 0.82 (0.81 to 0.83) PRC AUC 0.47 (0.45 to 0.49) 0.44 (0.41 to 0.46) 0.32 (0.30 to 0.34) 0.51 (0.50 to 0.53) Accuracy 0.95 (0.94 to 0.96) 0.95 (0.94 to 0.96) 0.94 (0.94 to 0.94) 0.96 (0.95 to 0.97) Sensitivity 0.51 (0.49 to 0.54) 0.52 (0.50 to 0.55) 0.50 (0.47 to 0.53) 0.73 (0.71 to 0.76) Precision 0.59 (0.57 to 0.61) 0.53 (0.51 to 0.55) 0.40 (0.38 to 0.42) 0.57 (0.55 to 0.58) T ABLE E3 – AHI E VALUATION P ERFORMANCE E XPERT LABELS (4%-

HYPOPNEA RULE ) R

ESPIRATION +S P O R ESPIRATION -O NLY S P O R R CCURACY

UTO LABELS (4%-

HYPOPNEA RULE ) R

ESPIRATION +S P O R ESPIRATION -O NLY S P O R R CCURACY

UTO LABELS (3%-

HYPOPNEA RULE ) R

ESPIRATION +S P O R ESPIRATION -O NLY S P O R R CCURACY Figure E7.

Apnea Hypopnea Index (AHI)-based model performance evaluation (based on auto labels for the 3% hypopnea rule) for models trained on features from respiration and SpO (left column), respiration only (middle column), and SpO2-only (right column). Results are obtained from left-out test data in a 10-fold cross-validation fashion; n=409 polysomnographic recordings including wearable respiration. Panels A-C: Scatterplots of predicted (model-based) and expert-labelled AHI per recording. Panels D-F: Difference of predicted and expert-labelled AHI, including 95% confidence interval (CI). Both model results show a unimodal error distribution. Panels G-I: Confusion matrix (in %) with AHI categorization (Normal: 0-5, Mild: 5-15, Moderate: 15-30, Severe: > 30), where rows: expert-labelled AHI category, columns: predicted AHI category. Accuracies for categorizations: 70% (Respiration+SpO ), 54% (Respiration-Only), and 68% (SpO2-Only). Figure E8. Apnea Hypopnea Index (AHI)-based model performance evaluation (based on auto labels for the 4% hypopnea rule) for models trained on features from respiration and SpO (left column), respiration only (middle column), and SpO2-only (right column). Results are obtained from left-out test data in a 10-fold cross-validation fashion; n=409 polysomnographic recordings including wearable respiration. Panels A-C: Scatterplots of predicted (model-based) and expert-labelled AHI per recording. Panels D-F: Difference of predicted and expert-labelled AHI, including 95% confidence interval (CI). Both model results show a unimodal error distribution. Panels G-I : Confusion matrix (in %) with AHI categorization (Normal: 0-5, Mild: 5-15, Moderate: 15-30, Severe: > 30), where rows: expert-labelled AHI category, columns: predicted AHI category. Accuracies for categorizations: 81% (Respiration+SpO ), 64% (Respiration-Only), and 82% (SpO2-Only). T ABLE E4 – C ORRELATION

AHI-S P O B URDEN

AHI

ASSESSMENT METHOD P EARSON CORRELATION

AHI E XPERT L ABELS

4 0.53 AHI A UTO -L ABELS

4 0.52 AHI A UTO -L ABELS

3 0.46 AHI R ESP +S P O (EL4) R ESP +S P O (AL4) 0.51 AHI R ESP +S P O (AL3) 0.50 AHI R ESPIRATION -O NLY (EL4) 0.37 AHI R ESPIRATION -O NLY (AL4) 0.39 AHI R ESPIRATION -O NLY (AL3) 0.37 AHI S P O -O NLY (3%

THRESHOLD ) S P O -O NLY (4%

THRESHOLD ) 0.58

9 *

Database: Sleep Heart Health Study database

Database: MIT-BIH database +Database: UZ Leuven database %Database: MESA sleep study database T ABLE E5 – L ITERATURE R EVIEW T ABLE P APER Y EAR D ATABASE /H OSPITAL R ECORDINGS /S UBJECTS W EARABLE /P ORTABLE R A CCURACY (%) AUC S P O xx ATABASE

ATABASE

2 8,444

SUBJECTS (*D

ATABASE SUBJECTS (+D

ATABASE

2) N O - 81.6, DIFFERENT TEST SETS ) 0.822,

DIFFERENT TEST SETS ) xxi OSPITAL

SUBJECTS N O - 93.02 0.95 xxii OSPITAL

SUBJECTS P ORTABLE - 89.8 - xxiii

OSPITAL SUBJECTS N O xxiv OSPITAL

RECORDINGS N O - 88.7 (FSFS + LR) (G AS + SVM) - xxv

OSPITAL SUBJECTS N O - 83.26 - xxvi OSPITAL

SUBJECTS N O - 87.2 0.92 xxvii OSPITAL

SUBJECTS N O - 87.61 0.925 xxviii OSPITAL SUBJECTS N O - 93.67 (K NN ) 88.61 (LS-SVM) - xxix OSPITAL

SUBJECTS N O - 87.5 - xxx OSPITAL

SUBJECTS N O - 93.9 0.97 xxxi ATABASE RECORDINGS N O - 93.3 - xxxii OSPITAL

SUBJECTS N O - 89.8 0.90 xxxiii ATABASE RECORDINGS N O - 97.7 - xxxiv OSPITAL

SUBJECTS N O - 89.7 0.967 xxxv OSPITAL

SUBJECTS P ORTABLE - 84.9 0.86 xxxvi

ATABASE ^D ATABASE

2 33

SUBJECTS N O - 85.26 (~D ATABASE ) 97.64 (^D

ATABASE ) - xxxvii

ATABASE RECORDINGS W EARABLE - IHR: P O : P O : P O : P O : R E SP I R A T I ON S I GNA L S xxxviii OSPITAL RECORDINGS W EARABLE - 72.8 ± ± xxxix OSPITAL

DATA POINTS W EARABLE - 91.8 - xl ATABASE

SUBJECTS N O - 82.1±0.6 ( ABDOMEN ) 82.6±0.4 ( THORACIC ) 83.4±0.3 (EDR) 54.8±0.4 ( ABDOMEN ) 55.4±0.6 ( THORACIC ) 52.4±0.4 (EDR) xli

ATABASE RECORDINGS N O - 73.58±5.20 - xlii D ATABASE RECORDINGS P ORTABLE

HI: xliii

OSPITAL

SIGNALS N O - 95 0.9837±0.0168 xliv OSPITAL SUBJECTS N O - 95.8 - xlv D ATABASE RECORDINGS N O - 95.1 - xlvi OSPITAL SUBJECTS N O - - 0.881 xlvii OSPITAL

SUBJECTS N O - 82.43 0.903 xlviii ATABASE

RECORDINGS N O - 72.3 - xlix ATABASE RECORDINGS N O - 98.43(A DA B OOST ) 98.68 (ANFIS) - l ATABASE RECORDINGS N O - 98.68 - li OSPITAL PATIENTS N O - 87.6 - lii ATABASE

SUBJECTS N O - 74.70±1.43 - liii ATABASE RECORDINGS N O - 91.2 0.9604 liv OSPITAL SUBJECTS N O - 89.26 - lv ATABASE RECORDINGS N O - 79.61 - lvi ATABASE

SUBJECTS N O - 95 - R E SP I R A T I O N

AND S P O lvii OSPITAL

RECORDINGS P ORTABLE - AHI > > AHI > >5: > > lviii OSPITAL SUBJECTS N O - 80 - lix OSPITAL SUBJECTS P ORTABLE - 84.13 - ^Database: PhysioNet Apnea-ECG database ~Database: PhysioNet UCD database Table E6. Event-Based Performance Evaluation for All Sensor-Input and Hypopnea Rules Combinations Abbreviations: AL3: Automatic Label 3% Hypopnea Rule, AL4: Automatic Label 4% Hypopnea Rule, EL4: Expert Label 4% Hypopnea Rule, TP: True Positives, FP: False Positives, FN: False Negatives, TN: True Negatives, SE: Sensitivity, OA: Obstructive Apnea, CA: Central Apnea, MA: Mixed Apnea, HY: Hypopnea

Model Sleep (h) Events TP FP FN TN SE OA SE CA SE MA SE HY Respiration-Only AL3 2522 27592 16175 24020 11417 529,560 0.6 0.78 0.58 0.41 Respiration-Only AL4 2522 27592 15479 19479 12113 531,340 0.56 0.78 0.57 0.37 Respiration-Only EL4 2522 27592 15334 28555 12258 524,442 0.55 0.75 0.57 0.38 Respiration+SpO AL3 2522 27592 16323 11037 11269 535,423 0.62 0.67 0.51 0.51 Respiration+SpO AL4 2522 27592 16736 9018 10856 524,870 0.63 0.67 0.50 0.54 Respiration+SpO EL4 2522 27592 18411 14458 9181 517,269 0.68 0.67 0.54 0.66 Respiration+SpO (Robust) AL3 2522 27592 17147 14056 10445 528,875 0.65 0.71 0.54 0.53 Respiration+SpO (Robust) AL4 2522 27592 17177 11527 10415 520,669 0.65 0.70 0.54 0.54 Respiration+SpO (Robust) EL4 2522 27592 18647 18460 8945 518,738 0.70 0.71 0.56 0.64 SpO -Only 3% 2522 27592 24984 17862 2608 448,211 0.92 0.83 0.77 0.97 References Apendix i Richard B. Berry et al., “Rules for Scoring Respiratory Events in Sleep: Update of the 2007 AASM Manual for the Scoring of Sleep and Associated Events,” Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine 8, no. 5 (October 15, 2012): 597–619, https://doi.org/10.5664/jcsm.2172. ii Joshua S. Richman and J. Randall Moorman, “Physiological Time-Series Analysis Using Approximate Entropy and Sample Entropy,” American Journal of Physiology-Heart and Circulatory Physiology 278, no. 6 (June 1, 2000): H2039–49, https://doi.org/10.1152/ajpheart.2000.278.6.H2039. iii Michael J. Katz, “Fractals and the Analysis of Waveforms,” Computers in Biology and Medicine 18, no. 3 (January 1, 1988): 145–56, https://doi.org/10.1016/0010-4825(88)90041-8.

Figure E9. AHI correlation for the different hypopnea labels and different input signals.

Figure E10. Change of AHI correlation between using the expert 4% hypopnea labels and the auto 3% hypopnea labels (in%). Relative to the 4% results, the Respiration-Only achieves largest performance (or the drop in performance is lowest). xvii John D. Hunter, “Matplotlib: A 2D Graphics Environment,” Computing in Science Engineering 9, no. 3 (May 2007): 90–95, https://doi.org/10.1109/MCSE.2007.55. xviii Fabian Pedregosa et al., “Scikit-Learn: Machine Learning in Python,” Journal of Machine Learning Research 12, no. Oct (2011): 2825–2830. xix Raphael Vallat, “Raphaelvallat/Entropy,” April 4, 2020, https://github.com/raphaelvallat/entropy. xx Margot Deviaene et al., “Feature Selection Algorithm Based on Random Forest Applied to Sleep ApneaDetection,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2019, 2580–83, https://doi.org/10.1109/EMBC.2019.8856582. xxi J. Víctor Marcos et al., “Automated Detection of Obstructive Sleep Apnoea Syndrome from Oxygen Saturation Recordings Using Linear Discriminant Analysis,” Medical & Biological Engineering & Computing 48, no. 9 (September 2010): 895–902, https://doi.org/10.1007/s11517-010-0646-6. xxii D. Álvarez et al., “Automated Analysis of Unattended Portable Oximetry by Means of Bayesian Neural Networks to Assist in the Diagnosis of Sleep Apnea,” in 2016 Global Medical Engineering Physics Exchanges/Pan American Health Care Exchanges (GMEPE/PAHCE), 2016, 1–4, https://doi.org/10.1109/GMEPE-PAHCE.2016.7504628. xxiii Bijoy Laxmi Koley and Debangshu Dey, “On-Line Detection of Apnea/Hypopnea Events Using SpO2 Signal: A Rule-Based Approach Employing Binary Classifier Models,” IEEE Journal of Biomedical and Health Informatics 18, no. 1 (January 2014): 231–39, https://doi.org/10.1109/JBHI.2013.2266279. xxiv Daniel Alvarez et al., “Assessment of Feature Selection and Classification Approaches to Enhance Information from Overnight Oximetry in the Context of ApneaDiagnosis,” International Journal of Neural Systems 23, no. 5 (October 2013): 1350020, https://doi.org/10.1142/S0129065713500202. xxv Baile Xie and Hlaing Minn, “Real-Time Sleep ApneaDetection by Classifier Combination,” IEEE Transactions on Information Technology in Biomedicine 16, no. 3 (May 2012): 469–77, https://doi.org/10.1109/TITB.2012.2188299. xxvi D. Alvarez et al., “Nonlinear Characteristics of Blood Oxygen Saturation from Nocturnal Oximetry for Obstructive Sleep Apnoea Detection,” Physiological Measurement 27, no. 4 (April 2006): 399–412, https://doi.org/10.1088/0967-3334/27/4/006. xxvii J. Víctor Marcos et al., “Assessment of Four Statistical Pattern Recognition Techniques to Assist in Obstructive Sleep Apnoea Diagnosis from Nocturnal Oximetry,” Medical Engineering & Physics 31, no. 8 (October 2009): 971–78, https://doi.org/10.1016/j.medengphy.2009.05.010. xxviii John F. Morales et al., “Sleep ApneaHypopnea Syndrome Classification in SpO2 Signals Using Wavelet Decomposition and Phase Space Reconstruction,” in 2017 IEEE 14th International Conference on Wearable and Implantable Body Sensor Networks (BSN), 2017, 43–46, https://doi.org/10.1109/BSN.2017.7936003. xxix Daniel Álvarez et al., “Feature Selection from Nocturnal Oximetry Using Genetic Algorithms to Assist in Obstructive Sleep Apnoea Diagnosis,” Medical Engineering & Physics 34, no. 8 (October 2012): 1049–57, https://doi.org/10.1016/j.medengphy.2011.11.009. xxx Daniel Sánchez Morillo and Nicole Gross, “Probabilistic Neural Network Approach for the Detection of SAHS from Overnight Pulse Oximetry,” Medical & Biological Engineering & Computing 51, no. 3 (March 2013): 305–15, https://doi.org/10.1007/s11517-012-0995-4. xxxi Laiali Almazaydeh, Miad Faezipour, and Khaled Elleithy, “A Neural Network System for Detection of Obstructive Sleep Apnea Through SpO2 Signal Features,” International Journal of Advanced Computer Science and Applications (IJACSA) 3, no. 5 (2012), https://doi.org/10.14569/IJACSA.2012.030502. xxxii J. Víctor Marcos et al., “Utility of Multilayer Perceptron Neural Network Classifiers in the Diagnosis of the Obstructive Sleep Apnoea Syndrome from Nocturnal Oximetry,” Computer Methods and Programs in Biomedicine 92, no. 1 (October 2008): 79–89, https://doi.org/10.1016/j.cmpb.2008.05.006. xxxiii Sheikh Shanawaz Mostafa et al., “Optimization of Sleep ApneaDetection Using SpO2 and ANN,” in 2017 XXVI International Conference on Information, Communication and Automation Technologies (ICAT), 2017, 1–6, https://doi.org/10.1109/ICAT.2017.8171609. xxxiv Daniel Alvarez et al., “Multivariate Analysis of Blood Oxygen Saturation Recordings in Obstructive Sleep ApneaDiagnosis,” IEEE Transactions on Bio-Medical Engineering 57, no. 12 (December 2010): 2816–24, https://doi.org/10.1109/TBME.2010.2056924. xxxv Ainara Garde et al., “Development of a Screening Tool for Sleep Disordered Breathing in Children Using the Phone OximeterTM,” PLoS ONE 9, no. 11 (November 17, 2014), https://doi.org/10.1371/journal.pone.0112959. xxxvi Sheikh Mostafa et al., “SpO2 Based Sleep Apnea Detection Using Deep Learning,” 2017, https://doi.org/10.1109/INES.2017.8118534. xxxvii Rahul Krishnan Pathinarupothi et al., “Single Sensor Techniques for Sleep Apnea Diagnosis Using Deep Learning,” in 2017 IEEE International Conference on Healthcare Informatics (ICHI), 2017, 524–29, https://doi.org/10.1109/ICHI.2017.37. xxxviii Tom Van Steenkiste et al., “Portable Detection of Apneaand Hypopnea Events Using Bio-Impedance of the Chest and Deep Learning,” IEEE Journal of Biomedical and Health Informatics, January 20, 2020, https://doi.org/10.1109/JBHI.2020.2967872. xxxix Sayandeep Acharya et al., “Ensemble Learning Approach via Kalman Filtering for a Passive Wearable Respiratory Monitor,” IEEE Journal of Biomedical and Health Informatics 23, no. 3 (May 2019): 1022–31, https://doi.org/10.1109/JBHI.2018.2857924. xl Tom Van Steenkiste et al., “Automated Sleep ApneaDetection in Raw Respiratory Signals Using Long Short-Term Memory Neural Networks,” IEEE Journal of Biomedical and Health Informatics 23, no. 6 (2019): 2354–64, https://doi.org/10.1109/JBHI.2018.2886064. xli T. Willemen et al., “Probabilistic Cardiac and Respiratory Based Classification of Sleep and Apneic Events in Subjects with SleepApnea,” Physiological Measurement 36, no. 10 (October 2015): 2103–18, https://doi.org/10.1088/0967-3334/36/10/2103. xlii Bijoy Laxmi Koley and Debangshu Dey, “Real-Time Adaptive Apnea and Hypopnea Event Detection Methodology for Portable Sleep Apnoea Monitoring Devices,” IEEE Transactions on Bio-Medical Engineering 60, no. 12 (December 2013): 3354–63, https://doi.org/10.1109/TBME.2013.2282337. xliii Aditya Sundar and Chinmay Das, “Low Cost, High Precision System for Diagnosis of Central Sleep ApneaDisorder,” in 2015 International Conference on Industrial Instrumentation and Control (ICIC), 2015, 354–59, https://doi.org/10.1109/IIC.2015.7150767. xliv Seong-Hoon Kim and Gi-Tae Han, “1D CNN Based Human Respiration Pattern Recognition Using Ultra Wideband Radar,” in 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2019, 411–14, https://doi.org/10.1109/ICAIIC.2019.8669000. xlv Bijoylaxmi Koley and Debangshu Dey, “Automated Detection of Apneaand Hypopnea Events,” in 2012 Third International Conference on Emerging Applications of Information Technology, 2012, 85–88, https://doi.org/10.1109/EAIT.2012.6407868. xlvi P. Caseiro, R. Fonseca-Pinto, and A. Andrade, “Screening of Obstructive Sleep ApneaUsing Hilbert-Huang Decomposition of Oronasal Airway Pressure Recordings,” Medical Engineering & Physics 32, no. 6 (July 2010): 561–68, https://doi.org/10.1016/j.medengphy.2010.01.008. xlvii G. C. Gutiérrez-Tobal et al., “Linear and Nonlinear Analysis of Airflow Recordings to Help in Sleep Apnoea-Hypopnoea Syndrome Diagnosis,” Physiological Measurement 33, no. 7 (July 2012): 1261–75, https://doi.org/10.1088/0967-3334/33/7/1261. xlviii Nandakumar Selvaraj and Ravi Narasimhan, “Detection of Sleep Apneaon a Per-Second Basis Using Respiratory Signals,” Conference Proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2013 (2013): 2124–27, https://doi.org/10.1109/EMBC.2013.6609953. xlix Minu M Paul and P G Student, “SAHS Detection Based on ANFIS Using Single Channel Airflow Signal” 5, no. 7 (2007): 9. l Cafer Avcı and Ahmet Akbaş, “Sleep ApneaClassification Based on Respiration Signals by Using Ensemble Methods,” Bio-Medical Materials and Engineering 26 Suppl 1 (2015): S1703-1710, https://doi.org/10.3233/BME-151470. li Galip Ozdemir, Huseyin Nasifoglu, and Osman Erogul, “A Time-Series Approach to Predict Obstructive Sleep Apnea (OSA) Episodes,” 2016, https://doi.org/10.11159/icbes16.117. lii Rim Haidar, Irena Koprinska, and Bryn Jeffries, “Sleep ApneaEvent Detection from Nasal Airflow Using Convolutional Neural Networks,” in Neural Information Processing, ed. Derong Liu et al., Lecture Notes in Computer Science (Cham: Springer International Publishing, 2017), 819–27, https://doi.org/10.1007/978-3-319-70139-4_83. liii Anirudh Thommandram, J. Mikael Eklund, and Carolyn McGregor, “Detection of Apnoea from Respiratory Time Series Data Using Clinically Recognizable Features and KNN Classification,” Conference Proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2013 (2013): 5013–16, https://doi.org/10.1109/EMBC.2013.6610674. liv Yashar Maali and Adel Al-Jumaily, “Automated Detecting Sleep Apnea Syndrome: A Novel System Based on Genetic SVM,” in 2011 11th International Conference on Hybrid Intelligent Systems (HIS), 2011, 590–94, https://doi.org/10.1109/HIS.2011.6122171. lv Ling Cen et al., “Automatic System for Obstructive Sleep ApneaEvents Detection Using Convolutional Neural Network,” Conference Proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2018 (2018): 3975–78, https://doi.org/10.1109/EMBC.2018.8513363. lvi Haitham M. Al-Angari and Alan V. Sahakian, “Automated Recognition of Obstructive Sleep ApneaSyndrome Using Support Vector Machine Classifier,” IEEE Transactions on Information Technology in Biomedicine: A Publication of the IEEE Engineering in Medicine and Biology Society 16, no. 3 (May 2012): 463–68, https://doi.org/10.1109/TITB.2012.2185809. lvii Daniel Álvarez et al., “A Machine Learning-Based Test for Adult Sleep Apnoea Screening at Home Using Oximetry and Airflow,” Scientific Reports 10, no. 1 (March 24, 2020): 1–12, https://doi.org/10.1038/s41598-020-62223-4. lviii Eduardo Gil et al., “Discrimination of Sleep-Apnea-Related Decreases in the Amplitude Fluctuations of PPG Signal in Children by HRV Analysis,” IEEE Transactions on Bio-Medical Engineering 56, no. 4 (April 2009): 1005–14, https://doi.org/10.1109/TBME.2008.2009340. lix Jhao-Cheng Wu et al., “A Portable Monitoring System with Automatic Event Detection for Sleep ApneaLevel-IV Evaluation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, 1–4, https://doi.org/10.1109/ISCAS.2018.8351221.l Cafer Avcı and Ahmet Akbaş, “Sleep ApneaClassification Based on Respiration Signals by Using Ensemble Methods,” Bio-Medical Materials and Engineering 26 Suppl 1 (2015): S1703-1710, https://doi.org/10.3233/BME-151470. li Galip Ozdemir, Huseyin Nasifoglu, and Osman Erogul, “A Time-Series Approach to Predict Obstructive Sleep Apnea (OSA) Episodes,” 2016, https://doi.org/10.11159/icbes16.117. lii Rim Haidar, Irena Koprinska, and Bryn Jeffries, “Sleep ApneaEvent Detection from Nasal Airflow Using Convolutional Neural Networks,” in Neural Information Processing, ed. Derong Liu et al., Lecture Notes in Computer Science (Cham: Springer International Publishing, 2017), 819–27, https://doi.org/10.1007/978-3-319-70139-4_83. liii Anirudh Thommandram, J. Mikael Eklund, and Carolyn McGregor, “Detection of Apnoea from Respiratory Time Series Data Using Clinically Recognizable Features and KNN Classification,” Conference Proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2013 (2013): 5013–16, https://doi.org/10.1109/EMBC.2013.6610674. liv Yashar Maali and Adel Al-Jumaily, “Automated Detecting Sleep Apnea Syndrome: A Novel System Based on Genetic SVM,” in 2011 11th International Conference on Hybrid Intelligent Systems (HIS), 2011, 590–94, https://doi.org/10.1109/HIS.2011.6122171. lv Ling Cen et al., “Automatic System for Obstructive Sleep ApneaEvents Detection Using Convolutional Neural Network,” Conference Proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference 2018 (2018): 3975–78, https://doi.org/10.1109/EMBC.2018.8513363. lvi Haitham M. Al-Angari and Alan V. Sahakian, “Automated Recognition of Obstructive Sleep ApneaSyndrome Using Support Vector Machine Classifier,” IEEE Transactions on Information Technology in Biomedicine: A Publication of the IEEE Engineering in Medicine and Biology Society 16, no. 3 (May 2012): 463–68, https://doi.org/10.1109/TITB.2012.2185809. lvii Daniel Álvarez et al., “A Machine Learning-Based Test for Adult Sleep Apnoea Screening at Home Using Oximetry and Airflow,” Scientific Reports 10, no. 1 (March 24, 2020): 1–12, https://doi.org/10.1038/s41598-020-62223-4. lviii Eduardo Gil et al., “Discrimination of Sleep-Apnea-Related Decreases in the Amplitude Fluctuations of PPG Signal in Children by HRV Analysis,” IEEE Transactions on Bio-Medical Engineering 56, no. 4 (April 2009): 1005–14, https://doi.org/10.1109/TBME.2008.2009340. lix Jhao-Cheng Wu et al., “A Portable Monitoring System with Automatic Event Detection for Sleep ApneaLevel-IV Evaluation,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018, 1–4, https://doi.org/10.1109/ISCAS.2018.8351221.