[PDF] A survey of statistical learning techniques as applied to inexpensive pediatric Obstructive Sleep Apnea data

Abstract

Pediatric obstructive sleep apnea affects an estimated 1-5% of elementary-school aged children and can lead to other detrimental health problems. Swift diagnosis and treatment are critical to a child's growth and development, but the variability of symptoms and the complexity of the available data make this a challenge. We take a first step in streamlining the process by focusing on inexpensive data from questionnaires and craniofacial measurements. We apply correlation networks, the Mapper algorithm from topological data analysis, and singular value decomposition in a process of exploratory data analysis. We then apply a variety of supervised and unsupervised learning techniques from statistics, machine learning, and topology, ranging from support vector machines to Bayesian classifiers and manifold learning. Finally, we analyze the results of each of these methods and discuss the implications for a multi-data-sourced algorithm moving forward.

Full PDF

AA survey of statistical learning techniques asapplied to inexpensive pediatric ObstructiveSleep Apnea data

Emily T. Winn, Marilyn Vazquez, Prachi Loliencar, Kaisa Taipale, Xu Wang, andGiseon Heo

Abstract

Pediatric obstructive sleep apnea affects an estimated 1-5% of elementary-school aged children and can lead to other detrimental health problems. Swift diag-nosis and treatment are critical to a child’s growth and development, but the vari-ability of symptoms and the complexity of the available data make this a challenge.We take a ﬁrst step in streamlining the process by focusing on inexpensive data fromquestionnaires and craniofacial measurements. We apply correlation networks, theMapper algorithm from topological data analysis, and singular value decompositionin a process of exploratory data analysis. We then apply a variety of supervised andunsupervised learning techniques from statistics, machine learning, and topology,ranging from support vector machines to Bayesian classiﬁers and manifold learning.Finally, we analyze the results of each of these methods and discuss the implicationsfor a multi-data-sourced algorithm moving forward.

Emily T. WinnDivision of Applied Mathematics, Brown University, Providence, RI, 02912. e-mail: [email protected]

Marilyn VazquezMathematical Bioscience Institute, Ohio State University, Columbus, OH, 43210. e-mail: [email protected]

Prachi LoliencarDepartment of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB, T6G2C1. e-mail: lolienca@ ualberta.caKaisa TaipaleC. H. Robinson, 14701 Charlson Rd, Eden Prairie, MN. e-mail: [email protected]

Xu WangDepartment of Mathematics, Wilfred Laurier University, Waterloo, ON, N2L C35, Canada. e-mail: [email protected]

Giseon Heo (corresponding author)School of Dentistry; Department of Mathematical and Statistical Sciences, University of Alberta,Edmonton AB T6G 1C9, Canada. e-mail: [email protected] a r X i v : . [ q - b i o . Q M ] F e b Winn et al.

Obstructive sleep apnea (OSA), a form of sleep-disordered breathing characterizedby recurrent episodes of partial or complete airway obstruction during sleep, is a se-rious health problem, affecting an estimated 1-5% of elementary school-aged chil-dren [9, 2]. Even mild forms of untreated pediatric OSA may cause high bloodpressure, behavioral challenges, or impeded growth. Compared to adults, the symp-toms of childhood-onset OSA are more varied and change continuously with de-velopment, making diagnosis a difﬁcult challenge. The complexity of the data fromsurveys, biomedical measurements, 3D facial photos, and time-series data calls forstate of the art techniques from mathematics and data science.Clinical data, including that considered in conﬁrming or ruling out a diagnosisof pediatric OSA, consist of high-dimensional multi-mode data with mixtures ofvariables of disparate types (e.g., nominal and categorical data of different scales,interval data, time-to-event and longitudinal outcomes) also called mixed or non-commensurate data. These data obtained from multiple sources are commonplacein modern statistical applications in medicine and health, with thousands, even mil-lions, of features recorded simultaneously from each object or individual.In this paper, we analyze symptom data provided by patients and clinicians, whilein related work, we analyze polysomnography data (physiological time series data).These are case studies in a larger project in building an algorithm to aid clinicians intheir treatment decisions for pediatric OSA patients. To overcome the difﬁculties inanalyzing high-dimensional multi-mode data from multiple sources, we propose toadopt a hybrid approach that interactively combines statistics, computational topol-ogy and deep learning to take advantage of their strengths and mitigate their weak-nesses. Statistics provides a suite of tools for model speciﬁcation and identiﬁcation,including model estimation and inference, which oftentimes entail a high compu-tational cost. Computational topology via persistent homology aims to detect ‘true’signals in high-dimensional data with respect to a varying model parameter ([8],[39]), which contrasts with the conventional statistical approach of estimating oneor more model parameters that, at best, yield signals; however, the lack of a coher-ent approach to statistical inference is a serious drawback. Deep learning, on theone hand, has been successfully used in speech recognition [15] and owes part of itspractical appeal to its computational efﬁciency; on the other hand, it can be difﬁcultto intuitively justify why deep neural networks work. Integrative ensemble methodswould thus ideally blend statistical theory, topology and deep learning in a seamlessfashion, all three working in concert, as in a musical ensemble, fully exploiting theamount of available information across multiple sources.Here, we illustrate the ﬁrst step toward to building a hybrid approach by compar-ing several methods in statistics, computational topology, and machine learning. InSection 2, we describe data collected from pediatric obstructive sleep apnea (OSA)study Pro00057638, at the University of Alberta. Survey and craniofacial data areeasier and cheaper to collect than polysomnography (PSG) data, so for maximumclinical impact we focus this article on analyses of those data sets. For the analysesof PSG time series data, we refer the reader to [30]. survey of techniques for OSA data 3

The rest of the paper is organized as follows: we outline initial ﬁndings of ourdata explorations in Section 3, which provides a basis for the methods we appliedto our data, described in Section 4. We compare methods in statistics and machinelearning for classifying OSA patients using survey data and craniofacial data anddiscuss results in Section 5. We complete our article with conclusion and futureresearch steps in Section 6.

As in adults, OSA in children is associated with cardiovascular dysfunction, neu-rocognitive dysfunction, behavioral issues, and metabolic consequences. OSA isalso believed to negatively inﬂuence school performance and learning potentialin children. The gold standard for diagnosis of pediatric OSA is polysomnogra-phy (PSG) [5]. However, in many countries, access to PSG is severely limited andmany children do not have an appropriate diagnosis before treatment. Consequently,children with OSA may not be treated or some children without OSA may undergounnecessary surgery. A simple and accessible way to identify children with OSA isneeded. Finding insights within inexpensive data is a crucial ﬁrst step.The patients at risk of OSA underwent PSG, ﬁlled out questionnaires, and had3D photos taken, which were assessed by orthodontists for craniofacial index. Nor-mative patients (who were considered not at risk of OSA) did not undergo PSG,but ﬁlled out questionnaires and were assessed by orthodontists in the same way aspatients at risk of OSA. Our analysis of PSG data can be found in [30].Broadly speaking, data types can be distinguished as structured and unstructured.Examples of the former include metabolite concentrations, medical records, andsurvey questionnaires, while more complex high-dimensional data such as digitalimages (e.g., photos, CT scans, MRI), text, time series, audio, and DNA sequencesare examples of the latter. Methodologies for analysis differ based on whether dataare structured or unstructured. In following sections we demonstrate methods instatistics, computational topology and deep learning to classify or cluster OSA pa-tients using survey questionnaires and craniofacial data (both structured). Our cur-rent research aims to build a foundation which will be useful for combining analyticmethods for both structured and unstructured data in predicting severity of OSA.

Once a clinician suspects that OSA may cause troublesome symptoms in a child,OSA-speciﬁc surveys can be administered to the affected parents and child. Thequestionnaires analyzed here encompass the Child’s Sleep Habits questionnaire, theOSA-18 Quality of Life survey, a Health Screening Questionnaire, the PediatricSleep Questionnaire, and the PedsQL Pediatric Quality of Life Inventories for child

Winn et al. and parent. In the case of pediatric surveys, many children are not old enough toread or respond to such a survey, leaving parents to report observations about symp-toms as best as they can. Missing data may result from survey-takers not knowingan answer to a question or feeling uncomfortable answering a question truthfully.However, even with their shortcomings, surveys are far easier and less costly toobtain and analyze than a PSG exam. Usually, PSG exams are not covered by in-surance, resulting in out of pocket costs; they require a separate appointment andovernight evaluation; and the results can take anywhere between six months andtwo years to come back. Conversely, surveys can be completed at the clinic during avisit, are of no additional costs to patients or their families, and can be evaluated im-mediately. The primary aim of employing statistical, topological, and deep learningtechniques in analysis of survey data is to identify features of the data that wouldenable more accurate analysis of OSA in children. In addition to the “subjective”patient- and parent-provided data about symptoms and quality of life, we includedentist-gathered data about craniofacial characteristics of children where noted be-low.

Of particular interest to clinicians is craniofacial data, which is a series of measure-ments taken to capture the shape of the face and mouth. There are two reasons for apotential preference for craniofacial data instead of survey or PSG data. First, cran-iofacial data is inexpensive and takes minutes to measure, and therefore takes upfar fewer resources than those needed for a PSG. The accessibility of this data isdemonstrated in our data set, which is complete for all craniofacial measurements.Second, craniofacial data consist of quantitative measurements, which reduces somebias that may come up in qualitative survey questions.There are nine craniofacial measurements we consider in our analysis. Figure9 in the appendix depicts the ﬁrst eight features, while the ninth, the CraniofacialIndex, is a sum of these ﬁrst eight measurements. The measurements are deﬁned asfollows [1]:1.

Proﬁle is a measurement for the angle of the shape made by the line from thebrow to the base of the nose and a line from the base of the nose to the chin whenviewing the patient from the side.2.

Midface Deﬁciency quantiﬁes the projection of the malar area below the eyes(the bones which form the eye socket and cheekbones) relative to the rest of theface.3.

Lower Face Height is the proportion of the length from the brow to the base ofthe nose to the length from the base of the nose to the bottom of the chin.4.

Lip Strain scores the amount of effort a patient uses to close their lips, measuredby observing the muscle contractions from the front view.5.

Palate scores the depth of the palate (the roof of the mouth) and the arch of thepalate. survey of techniques for OSA data 5 Overjet is the horizontal distance between the upper incisors and the lower in-cisors when the patient bites down. Any measurement above 5 mm is consideredsevere.7.

Overbite is the length of vertical overlap between the upper and lower incisors.8.

Posterior Bite is the transverse relationship between the molars and premolars,assessed by observing the relationship between upper posterior teeth and lowerposterior teeth on both sides of the mouth.9.

Craniofacial Index is the summation of the previous eight scores and gives asummary statistic of the craniofacial data.Measurements 1-8 are scored on a scale from 0 to 2, where 0 indicates a normalmeasurement and 2 indicates a severely abnormal measurement. As a result, theCraniofacial Index can range from 0 to 16. To analyze the craniofacial data, weﬁrst looked at the overall distributions of the complete data set (187 subjects, with76 controls and 111 patients), visualized as histograms (see Figures 10 and 11)to spot any glaring differences in distributions and calculated the Earth Mover’sDistance between each distribution to quantify those differences (see Table 9). Wethen conducted our analysis as described in Sections 3 and 5.

Our data consist of survey responses and craniofacial data from 200 subjects fromtwo different clinics, with 172 observed variables. After removing subjects whichdid not have an OSA classiﬁcation listed, we had 187 subjects remaining. At theearly stage of recruiting normative children, some subjects didn’t ﬁll out question-naires. The reason for this was a miscommunication between the principal investi-gator and research collaborators. Other subjects were too young to be able to ﬁll outany child surveys and thus had too much missing data. As a result, we ﬁrst removedany subjects with no OSA classiﬁcation or with more than 50% of the data variablesmissing. This brought our total subjects down to 173 (67 controls, 106 patients). Weexcluded text responses from this analysis, such as lists of medications or descrip-tions of pain. Yes-no questions were encoded in binary. All clock times for bed timeand waking time were removed, and we encoded total sleep time in one numericcolumn. Finally, we removed gender, height, weight, and body mass index (BMI)due to the number of missing values. After these steps, we were left with 157 inputvariables.Many of the questions in the surveys were on a Likert scale. For example, patientswere asked to rank on a scale of 1-7 how much they agree with certain statementssuch as “I have a hard time getting out of bed," or “I feel sleepy during the day."To aid the sorting algorithms, we standardized encoding so that higher numbersindicate the presence OSA symptoms and low numbers indicated absence of OSAsymptoms.For our learning methods, imputed all missing values using the

MissForest com-mand from the

MissingPy package for Python [27]; these numbers were rounded

Winn et al. to the nearest integer to be consistent with the raw data. We split the data such that70% was contained in the training set and 30% was in the test set. To demonstratestability, we applied each learning algorithm to 10 different training/test splits withthe same 70/30 ratio. We used the same test and train data sets for each supervisedlearning algorithm unless otherwise stated. The algorithms were run on three sub-sets of data: survey questions only, craniofacial measurements only, and combinedsurvey and craniofacial data.

In this section we present superlevel sets of correlation networks [14] and visu-alizations from the Mapper algorithm from topological data analysis [32], whichwe compare to visualizations from singular value decomposition [29]. These explo-rations of the data illustrate the heterogeneity of symptoms experienced by childrenand shed light on the limitations of the classiﬁcation algorithms explored later.

As an initial exploration of the data, we plotted super-level sets of correlation net-works for questionnaire responses of the patient and control groups. Networks allowfor visualizing the relationships between input variables and for comparison of thoserelationships across data sets. Correlation network analyses have been successfullyapplied in computational biology [3], neuroscience [37], and ﬁnance [18]. We usethe most basic correlation network as outlined below. By employing correlation net-works, we hope to observe which input variables interact differently in the patientgroup versus the control group and to use this information to inform our more rigor-ous analyses. For this particular method, we exclude the children’s Pediatric Qualityof Life survey, as those values likely correlate strongly to the answers of the parents’Quality of Life survey.We plot one network representing the patient data and one network representingthe control data. Each node or vertex in the network is a symptom/survey response.For each pair of survey questions i and j , we calculate the Pearson correlation be-tween questionnaire responses across all respondents. If the correlation between sur-vey responses i and j is greater than the threshold h , then we place an edge between i and j . If we consider the threshold parameter h ∈ [ − , ] , the networks obtainedby varying h form a ﬁltration of the simplicial complex given by the complete graphon all nodes. The topology of these super-level sets gives information about whatsymptoms are more co-incident in both pediatric OSA and control patients. In Fig-ure 1(a), the threshold is h = .

6, and in Figure 1(b), the threshold is h = .

7. Thesevalues were chosen as they were the values for which the graphs were sparse enoughto make visual observations but also still be able to see some edges. No two vari- survey of techniques for OSA data 7 ables had a correlation above 0.8, which we attribute to the size of the data set andthe variability of symptom expression in OSA patients.To make the plots easier to examine, we only show nodes with a degree of oneor more; that is, no isolated points were plotted, so any variables not shown can beassumed to not meet the correlation threshold with any other variable.In the network graphs with threshold h = .

6, there are some symptom corre-lations which seem obvious. For example, a patient having seen an orthodontistcorrelates highly with a patient having received orthodontic treatment, as one is aprerequisite for the other. However, Figure 1(a), we see that although the patientand control graphs seem to have the same number of nodes, the control group seemsto have fewer connected components and more nodes per connected component.We hypothesize that the variables measured in the control group are perhaps moreregularly distributed than in the patient group, that is, some subsets of variablesthe control group have a more concentrated joint distribution than the same sub-sets of variables in the patient group. Our hypothesis is further supported by Figure1(b), where the connectivity appears stronger in the control group than in the pa-tient group. In particular, we notice that ﬁve of the craniofacial variables are in aconnected component in the control group, whereas all craniofacial variables areisolated nodes in the patient group. Given the value of craniofacial measures as dis-cussed in Section 2.2, we note that the subset of craniofacial measures and theirrelationship to OSA diagnoses is worth exploring separately from the survey data.

The Mapper algorithm from topological data analysis ([26], implemented as [32]),is a tool for abstracting high dimensional data so that one can recover the under-lying topological structure (the topological nerve of the data). We here refer to thealgorithm as the K-mapper algorithm to indicate both the method and the implemen-tation. Whereas the other methods give classiﬁcations (a subject does have OSA ordoes not have OSA), K-mapper reveals the shape of data via a simplicial complexrepresentation. Given the high dimensionality of the data and the suite of topologicaland geometric classiﬁcation methods available, running the K-Mapper algorithm onour data may indicate whether such geometric and topological methods are worthimplementing, or whether we might gain information from those techniques that wewould not gain from traditional statistical tools. This methodology has gained sometraction elsewhere in exploring medical data ([16], [24]).The mapper algorithm works as follows: ﬁrst project the data set into space basedon the feature coordinates. Next, using a lens function and an unsupervised cluster-ing method set by the analyst, cover the projection with overlapping hypercubes andthen cluster points within the intervals deﬁned by the hypercubes; these clusters be-come nodes of the graph. Since a single point can appear in multiple nodes, draw anedge between two nodes if there is more than 60% overlap.

Winn et al. (a)(b)

Fig. 1:

Correlation networks derived from the raw data. If two variables had a correlation higherthan 0.6 (top) or 0.7 (bottom), then an edge was drawn between variables. Isolated points were notplotted; Quality of Life (QL) Survey answered by children was not included so as to have clearergraphs, as children’s answers to the QL survey were likely highly correlated with their parent’s orguardian’s answers.

We chose to run the K-mapper algorithm with ﬁve intervals (hypercubes) and Eu-clidean distance as our lens function. We also used unsupervised k -means clusteringwith k = survey of techniques for OSA data 9 there are two connected components, one with a node which is less than 40% OSAand two nodes with more than 80% OSA, and one with the six other nodes, eachof which have at least 70% OSA. Additionally, the homological structure is simple,with only one generator for the ﬁrst homology group evident from the one loop seenin the simplicial complex. Overall, the simplicial complex raises questions aboutthe structure of the data. Is there a manifold or subset of space where data points aremore than 70% likely to be at risk for OSA? Why is the only overlap for the lowestrisk cluster only above 60% with a cluster for which more than 90% of the subjectsare at risk for OSA? Such questions cannot be answered by traditional statisticalclassiﬁcation methods; we must invoke more topological based methods such asmanifold learning, singular value decomposition, and density based clustering foranswers.The survey data (Figure 2(b)) produced the exact same simplicial complex asthe combined data did; this may be a signal of the dominance of the survey data inthe combined set (survey data had 148 features, as opposed to craniofacial data’s 8features) or a feature of data scaling (we did not scale the data, and used Euclideandistance). This indicates that survey data do and should contribute to diagnosis de-cisions.In the simplicial complexes derived from craniofacial data, there was a stark dif-ference in the structure from the simplicial complexes derived from the combineddata and the survey data alone. (See Figure 2(c), which contains the simplicial com-plex derived from the craniofacial subset of the combined cleaned data, and 2(d),which contains the simplicial complex derived from the complete set of craniofacialdata.) First, the two craniofacial graphs have one connected component, as opposedto the two components of the other graphs. Second, both craniofacial networks had15 nodes compared to the 8 nodes, potentially demonstrating greater heterogeneityin craniofacial characteristics. Finally, the connectivity of the craniofacial networksis far higher than the ﬁrst two networks, with the ﬁrst simplicial complex having29 edges and the second simplicial complex having 24 edges. As an edge in the K-mapper algorithm indicates a 60% overlap or more between clusters, we see moreevidence that the underlying manifold for craniofacial data is harder to dissect andseparate compared to survey data. Singular value decomposition (SVD) is a commonly used matrix factorizationmethod that decomposes a matrix A into the product of three matrices: A = USV T [29]. SVD can be used as a dimension reduction technique that allows identiﬁcationof directions of high variance via the left and right singular vectors U and V . TheSVD is related to the well-known Principal Component Analysis (PCA). While thePCA is a linear projection of data via singular vectors, the SVD offers a full matrixfactorization. For the matrix of respondents and responses, the SVD can suggestsurvey questions that are signiﬁcant in conﬁrming or discarding a diagnosis of ob- (a) (b)(c) (d) Fig. 2:

Simplicial complexes derived using the K-Mapper algorithm. Top Row: Combined surveyand craniofacial data gave the same simplicial complex as survey data by itself. Note the simplicialcomplexes have two connected components, which imply there may be nontrivial connectivity inthe space underlying our data. Bottom row: Craniofacial data, a subset of the combined survey andcraniofacial data, and the complete set of craniofacial data give simplicial complexes with morenodes and higher connectivity. structive sleep apnea. It also suggests ways in which presentation may be divergentamong distinct groups of children. Here we examine survey data only, discardingthe craniofacial data for the moment.To apply SVD, we scale the survey data using scikit-learn’s MinMaxScaler,which scales each feature to the interval [ , ] . The SVD is sensitive to the scaleof variables, and without scaling, the answer to a question like “minutes of nightwaking” will appear as the feature that exhibits the most variance, because it takesvalues between 0 and 75 minutes, in contrast with all the 0-1 binary response ques-tions on the surveys.Visualization of the projection to the ﬁrst two singular components given by sin-gular value decomposition demonstrates the range of symptoms experienced by par- survey of techniques for OSA data 11 ticipants and illuminates the difﬁculty that we’ll see in using supervised learningmethods like decision trees and support vector machines to create accurate classi-ﬁers. The projections to the ﬁrst two singular components in respondent space isin Figure 3(a), with respondents "no OSA" and "OSA", and projection to symp-tom space is in Figure 3(b). As demonstrated in this projection, control patients andsymptomatic patients are not obviously separable, nor do they cluster cleanly. Thisobservation foreshadows the poor results we will see from classiﬁcation algorithms.The projection to the ﬁrst two singular components in symptom space show thatjust a few questions exhibit the largest variance in responses: the frequency withwhich a child falls asleep spontaneously while playing alone, riding in the car, eatingmeals, or watching television. The surprise is that while respondents with OSA tendto exhibit more of these symptoms, these symptoms do not cleanly differentiatebetween children with and without OSA. Pediatric OSA is a complex diagnosis thatcannot be easily reduced to appearance of a subset of symptoms. (a) (b) Fig. 3:

Projections to symptom space (left) and projection into respondent space (right) as given bysingular value decomposition (SVD). The ﬁgures underscore the difﬁcult challenge of categorizingpatients and prioritizing symptoms.

It is worth comparing the visualizations from singular value decomposition withthose coming from kernel PCA, shown in Figure 6.

Because of the high dimensionality of our data and the ﬁndings in our data ex-ploration, we apply a variety of machine learning techniques. In this section, wediscuss each of the methods used and the justiﬁcation for applying them to our datasets. The performance results of those classiﬁcation methods are in Section 5. Weaim to classify recruited patients into two categories: No OSA and risk of OSA. Inclinical practice, there are four categories for OSA - no OSA, mild OSA, moderate

OSA, and severe OSA - but to establish initial diagnostic results, we start with pres-ence or absence of OSA. Further discussion about classiﬁcation of severity, usingall four categories, is in Section 6.

We applied various supervised methods to see how accurately one can infer whethera subject has OSA or does not have OSA.Supervised learning methods applied to the data include Linear DiscriminantAnalysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression(LR), Decision Trees (DT), Random Forests (RF), Neural Networks (NNET), andSupport Vector Machines (SVM) and K-nearest Neighbour (KNN). These tech-niques are well-established and we provide only a cursory overview of results sothat contrast with other methods can be established.Linear Discriminant Analysis (LDA) proposed by Fisher in 1936 is the mostbasic and traditional classiﬁcation method. Although LDA may seem too simpleto handle large and complex data, it can give insight on how much improvementthose more advanced machine learning methods can achieve. We use LDA as abenchmark method. Compared to LDA, Quadratic Discriminant Analysis (QDA)assumes different covariance matrices for different classes, which result in non-linear decision boundaries between classes. Logistic regression (LR) is also verypopular method for classiﬁcation, especially for binary classiﬁcation. R packages mass [34] and “glmnet" [12] were used in the calculation.Decision tree (DT) is a non-parametric method which classiﬁes based on howstrongly each individual feature performs in predicting ultimate diagnosis. A ma-jor beneﬁt of DT is their interpretability, which allows the user to see which vari-ables are most signiﬁcant in the classiﬁcation. The

DecisionTreeClassiﬁer from the

Scikit-learn library for Python was used [23]. A grid search to ﬁnd the optimal cost-complexity pruning parameter was performed, using the model selection libraryfrom Scikit-learn. The optimized parameters for DT and other machine learningmethods can be found in Table 1.In addition to DT, we used random forests (RF) to select feature importance.Random forests consist of an ensemble of decision trees. At each step of growingthe component decision trees, there is a different set of variables randomly selectedfrom the whole set of variables. The purpose of this randomization is to grow manytrees, which are not similar to each other. This allows the algorithm to explore theexplanatory variable space as widely as possible, which often gives a more robustoutcome than decision trees. In this paper, we used the default choice for the numberof variables selected for each step, i.e., √ P , the square root of the total number ofvariables. R package randomForest was used in these calculations [17].Neural Networks (NNET) is the basic component of modern deep learningmethods. NNET iterates between forward and backward propagation to update theweights and makes a ﬁnal prediction with the non-linear function (usually, the logit survey of techniques for OSA data 13 function) of the linear combinations of variables. There are a number of parametersthat need to be tuned in order to achieve optimal results. These tuning parametersare the number of hidden layers and the number of hidden units. Usually the tun-ing parameters are chosen by cross-validation method. R package nnet was used toimplement NNET [35].In Support Vector Machines (SVM), a decision boundary between the two clas-siﬁcations is chosen by solving a convex optimization problem. Here we optimizedpenalty parameter C and scaling parameter γ for both linear and exponential ker-nels using a cross validation grid search. SVM has given promising results in somediagnostic contexts, for instance in diagnosing Lyme disease ([10]). Method CF Data Survey Data Combined DataDT α = . α = . α = . √ p ≈ √ p ≈ √ p ≈ e − e − e − γ = γ = . γ = . k -NN k = k = k = Table 1:

Optimal parameters for supervised learning methods as found by cross validation gridsearch.

Last, we used k -nearest neighbors classiﬁcation (KNN). We used uniform weightsas we do not wish to introduce feature importance in a benchmark measure. Weused a cross validation grid search to ﬁnd the optimal k over values { , , ..., } . The scikit-learn KNeighborsClassiﬁer toolbox for nearest neighbors was used in thesecalculations [23]. Bayesian networks are directed acyclic graph models that represent the joint distri-bution of a set of random variables. The random variables are represented as nodesand the dependencies between them are represented by the directed edges. There area few advantages of Bayesian networks (see more details in [28] and [31]): they are(1) suitable for small and incomplete data sets, (2) combine known knowledge/priorswith data (so one can use an expert’s knowledge if necessary), and (3) take into ac-count subtle relationships between variables, while avoiding a parametric approachthat makes strong assumptions on the data structure. The main disadvantage of thesenetworks, however, is the requirement of discretization of continuous variables.Bayesian classiﬁers are a particular application of Bayesian networks; they esti-mate the probability of each class given the predictor variables. We apply two basic classiﬁers, Naïve Bayes (NB) classiﬁer, Tree augmented Bayesian (TAN) classi-ﬁer, and a more recent approach, Semi-Hierarchical Bayesian (SHNB), describedin [20]. NB assumes that all the attributes are independent given the class variable.This classiﬁer can be effective due to the low number of required parameters and thelow computational cost for inference and learning. However, in many applications toreal life data, conditional independence may not be valid. TAN constructs a directedtree among the attributes to incorporate dependencies between attributes. That is,in TAN, each attribute depends on both the class and other attributes. We used Rpackages; bnlearn [25] and gRain [13] to obtain posterior probabilities (inference)and the model (learning) for the NB and TAN classiﬁers.As the number of attributes increases, the number of possible structures becomeslarger, requiring a large data set to obtain good estimates. To overcome high dimen-sionality (large number of attributes), the authors in [38] introduce the HierarchicalNaive Bayesian (HNB) classier, which integrates latent (‘hidden’) variables. We ap-plied SHNB, a variation of HNB, to our data analysis because the computations arestraightforward with two steps: create latent variables using NB, then use TAN withthe latent variables and the remaining attributes. To create latent variables we cal-culated conditional mutual information (CMI), a similarity between two variablesgiven the class. We then obtained an undirected graph connecting all the attributesbased on the CMI between every pair of variables. Lastly, we created latent vari-able from the variables that form maximal clique. For example, among 8 variablesof craniofacial data, two maximal cliques were chosen with the threshold similar-ity measure: overjet & overbite and midface deﬁciency & lower face height. Wecreated latent variables for overjet & overbite and midface deﬁciency & lower faceheight, termed as anterior teeth coupling and maxillomandibular facial proportion ,respectively. All four attributes consisted of three levels, see Table 2. Thus, both la-tent variables consisted of nine categories. We combined the levels “deep bite" and“open bite" in overbite and called it not-normal (similarly the non-normal categorywas used for both “increase" and “reverse" in overjet). The SHNB structures wereformed by the TAN classiﬁer with two latent variables and four remaining attributesthat were not involved in cliques. OB \ OJ Increased Normal ReverseDeep bite 15 27 0Normal 15 96 8Open bite 3 2 7 ATC Normal AbnormalNormal 96 23Abnormal 29 25

Table 2: (Left) Original 9 categories (cardinalities) of combined OB and OJ. (Right) A latentvariable, anterior teeth coupling (ATC) with 4 categories.

We modeled NB, TAN and SHNB for two additional data sets: (1) all surveyquestionnaires (total of 149 variables) and (2) both survey and craniofacial indexdata (total of 157 variables). Five continuous variables in survey questionnaires werediscretized into two groups and cut off at values: age (8 years old), survey of techniques for OSA data 15 sleep (9 hours), how long waking lasts at night (12 min),

In addition to the supervised learning methods, Bayesian classiﬁers, and correlationnetworks described in the previous sections, we considered manifold learning as anunsupervised learning method for classiﬁcation.Manifold learning is an approach that sees data as a sample from some low di-mensional manifold even if the points themselves are embedded in a high dimen-sion. From this perspective, clustering can be seen as an estimation of the con-nected components of the underlying manifold. Manifold learning approaches havebeen shown to have very accurate clustering results in highly non-linear scenarios[33, 4, 22]. Given the non-linearity of our data embedded in 157 dimensions, wewould like to apply manifold learning to cluster the survey respondents into a con-trol and a patient group. However, since there are only 173 useful respondents andthe data itself have a mixture of continuous and categorical features, we must becareful how this approach is applied and how the results are interpreted.For this purpose, we explored the application of manifold learning into this newsetting of mixed data types. We ﬁrst discuss the the challenges that need to be solvedand how we have addressed them (Section 4.3.1). We then discuss four differentdensity-based manifold learning methods that were applied to our data (Section4.3.2).

As mentioned previously, manifold learning has potential in being useful to the clus-tering of our mixed data. However, the nature of our data brings some questions onhow appropriate such a method is for our mixed data scenario. The ﬁrst challengeis that the theoretical guarantees of these types of methods usually require a largenumber of points [7, 19, 4]. This is because the accuracy depends on taking limitsto a small scale, which for categorical data means that every point will be its ownconnected component. It is important to mention that although theoretical results dorequire a large number of points, experimental results have shown that good results are still possible with some methods [33, 22]. We conjecture that this is due to thedata being close to a manifold, i.e., a manifold could be a good approximation to thegiven data. Therefore, we wanted to investigate if this was the case with our data.Another issue with having a small sample size is the curse of dimensionality inmanifold learning [4, 33]. This means that the higher the dimension of points to beclustered is, the exponentially more data that will be needed for accuracy. One wayto get around the inaccuracies caused by the curse of dimensionality is to employa dimension reduction method. While dimension reduction methods for categoricalvariables exist, we would also like to consider a different approach. It is commonin manifold learning to embed the data, raw or projection, in a graph in order toincrease efﬁciency. This step makes use of meaningful similarity measures to con-struct a representation of the data in graph form. We hope to get a representativegraph construction directly from the categorical points by an appropriate choice ofdistance metric.We now proceed to describe the manifold learning approaches used to analyzethe survey data.

A well deﬁned way to use manifold learning for clustering purposes is using thedensity-based approach [6, 4]. In this approach, clusters are deﬁned as high densityareas. This method seems reasonable for our data given some of the patterns inthe distance matrices, and thus density, of our data seen in Figure 4. In this ﬁgure,we looked at all the features in the data, and one can see some of the block-likepatterns that arise, which correspond to groupings of points. Note that the KernelPrincipal Component Analysis (Kernel PCA) projection of the data in Figure 4(a)is for visualization purposes only. We use this projection in this section in order totake into the account the non-linear characteristics of our data, something that PCAand SVD are not able to do. However, we are not able to conclude anything fromthem since the Kernel PCA used only applies to continuous data. We are able toconjecture that there might be a distinction of densities between the patient classand the control class. That is, the control class seems to be found mainly in the highdensity region, while the patient class seems to be spread throughout the space.To cluster the survey respondents in the "patient" and "control" classes, we ap-plied four different approaches: (1) Density-based spatial clustering of applicationswith noise (DBSCAN) [11], (2) Spectral clustering [36], (3) Continuous k-NearestNeighbor approach (CkNN) [4], and (4) Cut-Cluster-Classify (CCC) [33]. The tun-ing of each method’s parameters was done comparing the clustering F-score (har-monic mean of precision and sensitivity), i.e., the optimal parameters were chosenby which gave the highest F-score measure.DBSCAN is a clustering method that assumes clusters differ in the density withineach individual cluster. The method ﬁrst ﬁnds “core points” for each cluster, whichare points in the center of the cluster, and then connects close-by points to the clus-ters. It also labels some points as outliers. Given the fact we only want 2 classes, survey of techniques for OSA data 17 (a) (b)(c) (d)

Fig. 4: (a) Kernel PCA projection of all features of the data colored by sample density and thedistance matrices for the data set according to (b) Euclidean, (c) Correlation, and (d) Manhattanmetrics. The darker colors denote lower values and brighter colors denote higher values. we placed all outliers in the class that gave the largest F-score. The two main pa-rameters for this method are min_samples which refer to the minimum samplepoints to deﬁne core points. Given that we are dealing with the same number ofpoints in all the experiments, we found min_samples =2 to give the best F-score.The other parameter is ε , which deals with the distance needed to deﬁne a neigh-borhood. This highly depends on the distance matrix, so we studied the distributionof distance values in each distance matrix, which can be seen in Figure 5. This gaveus a good range of values to try, and we chose the one that gave the optimal F-score.The optimal values can be seen in Table 3.The main idea behind spectral clustering is constructing a graph Laplacian ma-trix and using its eigenvalue decomposition to cluster the data. The classical algo-rithm starts by constructing a normalized graph Laplacian matrix from similaritymeasures, calculating a few eigenvectors, and then using a k-means search on these (a) (b)(c) Fig. 5:

The distribution of distance values given the different metrics for (a) survey data, (b) cran-iofacial data, and (c) combined data.Euclidean Cosine Correlation Manhattan Pearson correlationSurvey Data 9.73 0.16 0.13 65.85 0.13CF Data 2.19 0.10 0.92 4.00 0.18Combined Data 6.71 0.14 0.04 45.92 0.04

Table 3:

Optimal ε values found for the DBSCAN method using the different features and distancemetrics. vectors. The only parameter needed is the number of clusters to ﬁnd, which for allour experiments is 2.CkNN clustering uses a continuous scale to construct a representative graph ofthe data. At each scale, it ﬁnds a clustering, and then uses persistent homology tochoose the best scale. The two parameters it needs are the number of points neededto deﬁne a neighborhood and the k value to deﬁne the sample density q k : q k ( x ) = || x − x k || , (1)where x k denotes the k th nearest neighbor of point x . The values with the bestF-scores are given in Table 4(a). survey of techniques for OSA data 19 is_neigh k Survey Data 3 2CF Data 7 2Combined Data 3 2 n_samples k Survey Data 100 25CF Data 150 2Combined Data 100 2 (a) CkNN Optimal Parameters (b) CCC Optimal Parameters

Table 4: (a) Optimal parameter values found for the CkNN method using the different features. (b)Optimal parameter values found for the Cut-Cluster-Classify method using the different features.

The Cut-Cluster-Classify method divides the labeling task into three steps. Itstarts by picking points that pass a density threshold to separate clusters, then clus-ters the remaining points, and ﬁnishes by classifying the remaining points. The twoparameters it needs are the number of points to sample and the k value to deﬁne thesample density q k deﬁned in Equation 1. The values with the best F-scores can befound in Table 4(b).As mentioned in the previous section, one of the most important parameters iswhich distance metric to use. We compared the F-scores of ﬁve different metrics fordiscrete points: Euclidean, Cosine, Correlation, City block (or Manhattan), and 1minus the absolute value of the Pearson coefﬁcient. The results can be seen in Tables5-7. Note that the metric giving the best results vary for the different clusteringmethods. We choose to report the quality metrics for the Manhattan distance inTables 8,10, and 11 since this distance seems to be consistently good for all themethods. Figure 6 shows some of the labeling results. Metric DBSCAN Spectral CkNN Cut-Cluster-Classify (CCC)Euclidean 0.75986 0.76534 0.76259 0.73993Cosine 0.75986 0.55319 0.75540 0.75812Correlation 0.76190 0.62000 0.75540 0.75986Manhattan 0.75986 0.62857 0.76259 0.75540Pearson coeff 0.7619 0.75812 0.75540 0.76259

Table 5:

Survey Data . F-scores of using different metrics (rows) to construct the distance matrix.The corresponding distance metric is the input for the different clustering methods (columns).

Given the hypothesis that the classes differ by sample density, we also estimatedthe sample density and found a threshold to cluster the data and compare with theother methods. The sample density we used is q k with a 2-norm and we used Otsu’smethod [21] to do the thresholding. This method analyzes the distribution of valuesand ﬁnds the modes for a threshold value. Table 6:

Craniofacial (CF) Data . F-scores of using different metrics (rows) to construct the dis-tance matrix. The corresponding distance metric is the input for the different clustering methods(columns).Metric DBSCAN Spectral CkNN Cut-Cluster-Classify (CCC)Euclidean 0.75986 0.76534 0.76259 0.75986Cosine 0.75986 0.58537 0.7554 0.74453Correlation 0.75986 0.61386 0.7554 0.75986Manhattan 0.76534 0.58937 0.76259 0.75986Pearson coeff 0.75986 0.75812 0.75540 0.75812

Table 7:

Combined Data . F-scores of using different metrics (rows) to construct the distance ma-trix. The corresponding distance metric is the input for the different clustering methods (columns).

For each method and each data subset, we assess classiﬁcation success by measur-ing the accuracy, positive predictive value (PPV), negative predictive value (NPV),sensitivity, and speciﬁcity. These measures are standard in clinical research litera-ture. To assess the stability of each method, we used ten different training/test splitsof the cleaned data. All measures for each method are reported in the form of mean ± standard deviation in Tables 8, 10, and 11. (Manifold learning methods had onlyone measure to report as they do not have training and test splits; see Section 4.3for more information.) We note that, when we applied QDA to the combined thesurvey and craniofacial data, some classes are too small to estimate the covariancematrices. Therefore, we did not obtain any results for QDA. Of all the classiﬁcation techniques applied to the survey data, Random forests(RF)and threshold q k density clustering performed the most consistently. In addition tohaving one of the higher accuracy scores (mean 0.77 ± ± ± survey of techniques for OSA data 21 be a strong source for correctly ruling out OSA as a potential underlying cause ofharmful symptoms. Threshold q k density had high accuracy (0.77457) and had ahigh sensitivity (0.86792) but lacked the most in speciﬁcity (0.62687), which wouldmake it a possible complement to RF.In PPV and sensitivity, RF fell short compared to some other methods. The high-est PPV score, 0.87 ± k -Nearest Neighbors, while the highest sen-sitivity, 0.88 ± ± ± q k density clustering, k -nearest neighbors, and naive Bayes would deliver the best results by all measures.All performance values for techniques as applied to survey data can be found inTable 8. Performance Measures of Classiﬁcation Methods on Survey DataMethod Accuracy PPV NPV Sensitivity SpeciﬁcityLDA 0.61 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± NNET 0.72 ± ± ± ± ± ± ± ± ± ± ± ± ± .

15 0.69 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± q k Table 8:

Performance measures for all supervised and unsupervised learning methods applied tothe survey data set. Best performances by each metric are bolded. Supervised methods were appliedto ten different train/test splits; their performance measures are recorded in the format of mean per-formance plus/minus standard deviation. Unsupervised learning methods do not take training sets,so only mean performance is reported. Note that the Cut-Cluster-Classify and DBSCAN methodsclassiﬁed all data as not having OSA, , while the CkNN method classiﬁed all but one subject ashaving OSA; see Figure 6We do not consider these results optimal for NPV or sensitivity.2 Winn et al.

Results: CF DistributionsIn comparing the distributions of craniofacial variables, we found that the three vari-ables with the largest difference between patients and controls were Palate Score,Lower Face Height, and Overjet Score. The other variables may have some differ-ence in distribution shapes (see Table 9 and Figures 10 and 11), but overall do notappear as different when comparing control subjects and OSA subjects. This is notto say that all other craniofacial variables should be ignored; rather, clinicians shouldpay extra attention to variables Palate Score, Lower Face Height, and Overjet Scorewhen evaluating a patient.

Frequencies and Earth Mover’s Distance for Craniofacial Data MeasuresMetric Group Score 0 Score 1 Score 2 EMDProﬁle P 0.7477 0.0 0.2522 0.1243C 0.9342 0.0 0.0658Midface Deﬁciency P 0.6126 0.3513 0.0360 0.1179C 0.7895 0.1934 0.0132Lower Face Height P 0.5766 0.3604 0.0630 0.1507C 0.8026 0.1579 0.0395Lip Strain P 0.6306 0.2793 0.0900 0.1234C 0.8158 0.1711 0.0132Palate P 0.4775 0.4595 0.0630 0.1817C 0.7500 0.2105 0.0395Overjet P 0.6306 0.0 0.3694 0.1498C 0.8553 0.0 0.1447Overbite P 0.8919 0.0 0.1081 0.0633C 0.9868 0.0 0.0132Posterior Bite P 0.8468 0.0811 0.0721 0.0846C 0.9737 0.0132 0.0132

Table 9:

Table showing the frequencies of scores 0, 1, and 2 for the patient (P) group and thecontrol (C) group for each craniofacial metric, as well as the Earth Mover’s Distance betweenthese frequency distributions. According to the data, the craniofacial variables with the largestdistribution differences were palate score, lower face height, and overjet. However, as shown inFigure 7, only the palate score played a signiﬁcant role in classiﬁcation in the combined data set.

Results: Classiﬁcation with Craniofacial DataSee Table 10 for the results of various classiﬁcation methods on the craniofacial dataset that was a subset of the combined data set.Overall, unlike in survey data, there was no one dominant method that performedconsistently in classiﬁcation using craniofacial data. In fact, in every measure, a survey of techniques for OSA data 23 different method achieved the highest score. Support vector machines and SHNBboth had a mean accuracy of about 0.72 ± ± ± ± Performance Measures of Classiﬁcation Methods on Craniofacial DataMethod Accuracy PPV NPV Sensitivity SpeciﬁcityLDA 0.63 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± RF 0.66 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± q k Table 10:

Performance measures for all supervised and unsupervised learning methods applied tothe craniofacial data set. For each metric, the best performance, as considered both by mean andstandard deviation, is in bold. Supervised methods were applied to ten different train/test splits;their performance measures are recorded in the format of mean performance plus/minus standarddeviation. Unsupervised learning methods do not take training sets, so only mean performancein classiﬁcation is reported. Note that the Cut-Cluster-Classify method and DBSCAN classiﬁedalmost all data as having OSA. We do not consider these results optimal for NPV or sensitivity.4 Winn et al.

We ﬁnally analyze the combined survey and craniofacial data to see how the twowork together in classiﬁcation. About 58% of survey respondents whose responseswere used in testing or training sets had diagnosed pediatric OSA; this is our no-information rate.Overall, the performance of each method was similar to each method’s respectiveperformance for survey data. (See Tables 11 and 8 respectively for comparison ofcombined data and survey data.) Algorithms applied to the combined data generallyoutperformed algorithms applied only to craniofacial data.Random forests, as in the survey data, performed the best in the categories ofaccuracy, NPV, and speciﬁcity. Threshold q k density clustering performed at aboutthe same level as in the survey data, once again having a high accuracy of 0.78035and a high sensitivity of 0.87736. Naive Bayes also once again delivered the bestsensitivity score, and k -nearest neighbors again yielded the highest PPV score. In-terestingly, k -nearest neighbors improved both in average and stability for PPV, upto 0.89 ± ± “palate score" . This information is especially valuablegiven that random forests was one of the top performing classiﬁcation techniques. Our results demonstrate that using inexpensive data can still make strong predictionsfor a binary classiﬁcation of not having OSA vs being at risk for OSA. In particu-lar, random forests, threshold-based density clustering, naïve Bayes Bayesian clas-siﬁers, and k -nearest neighbors performed the best. Of those successful methods,random forests and naïve Bayes are interpretable in that we can see which vari- survey of techniques for OSA data 25 Ground truth DBSCANSpectral clustering CkNNCut-Cluster-Classify (CCC) Threshold sample density

Fig. 6:

Kernel PCA coordinates of the combined survey and craniofacial data colored by the re-sulting labels from each clustering method starting with the ground truth. In this plot, blue (dark)points correspond to the control group while the yellow (light) ones correspond to OSA patients.6 Winn et al.Performance Measures of Classiﬁcation Methods on Combined DataMethod Accuracy PPV NPV Sensitivity SpeciﬁcityLDA 0.65 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± NNET 0.75 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± q k Table 11:

Performance measures for all supervised and unsupervised learning methods applied tothe combined data set. Best performances by each metric are bolded. Supervised methods wereapplied to ten different train/test splits; their performance measures are recorded in the formatof mean performance plus/minus standard deviation. Unsupervised learning methods do not taketraining sets, so there is only one number reported for their performance measures. Note that theCut-Cluster-Classify method classiﬁed all data as not having OSA, while the CkNN and DBSCANmethods classiﬁed almost all data as having OSA; see Figure 6. As such, we did not consider thosemethods to be better or worse in certain categories. ables were prioritized most in sorting. This information is valuable for clinicians inobtaining a differential diagnosis.Using survey data and craniofacial data, we merely attempted to classify whethera subject was diagnosed with OSA or not. However, this is a simpliﬁcation of theclassiﬁcation of OSA as mild, strong, or severe depending on the apnea hypopneaindex. The next natural step in our work is to expand classiﬁcation techniques tothese four possible outcomes (including no diagnosis of OSA) and identify whichmeasures indicate OSA severity.Another interesting future direction is to repeat the algorithms but instead ofcombining multiple surveys, isolate each survey and run each classiﬁcation methodseparately. In doing so, we may lend evidence to which survey is the most effectivein diagnosing OSA. We would do this for both the binary OSA classiﬁcation andthe classiﬁcation by the apnea hypopnea index.We hope to use this information to inform an algorithm which can aid cliniciansin diagnosis and personalized treatment of a child with OSA. This algorithm wouldbe updated as we track children through treatment and follow up in their progresstowards healthy sleeping patterns. survey of techniques for OSA data 27 x.ql_parents_sportsx.ql_parent_keepingupx.ql_child_achex.ql_child_cantdothingsx.ql_child_forgetx.ql_parent_notablex.osa_runnynosex.cshq_nightwaking_minx.ql_child_sadx.ql_parents_hurtsx.ql_child_sleepx.ql_parent_missingschool_docx.osa_attentionx.ql_child_worryx.meds_lastyearx.osa_breathholdx.palate_scorex.ql_child_missschool_docx.osa_snoringx.cshq_snoresx.osa_airx.osa_chokegaspx.ql_parents_sleepx.cshq_tiredx.osa_overallx.ql_child_energyx.ql_child_runx.osa_nasalobstructionx.osa_healthx.osa_daytimesleepiness 4 6 8MeanDecreaseAccuracy x.cshq_nightwaking_minx.ql_parent_worryx.cshq_waketimex.osa_moodx.ql_child_sportsx.ql_parent_missingschool_docx.ql_child_worryx.ql_child_achex.ql_child_sadx.ql_child_sleepx.ql_parents_energyx.palate_scorex.ql_child_missschool_docx.ql_child_forgetx.ql_parents_hurtsx.osa_airx.osa_attentionx.osa_chokegaspx.osa_breathholdx.osa_restlessx.osa_snoringx.cshq_snoresx.osa_nasalobstructionx.ql_child_energyx.ql_child_runx.osa_overallx.cshq_tiredx.ql_parents_sleepx.osa_healthx.osa_daytimesleepiness 0.0 1.0 2.0 3.0MeanDecreaseGini

Variable Important for the Combined Data

Fig. 7:

The plot of variable importance from random forests for the combined data. The left plotranks variables by their contributions to the mean decrease of the accuracy, while the right ranksvariables for their decrease in Gini score. The only craniofacial variable marked as important was“palate score"; all other highly ranked variables were from the surveys.

Acknowledgements

We thank The Institute for Computational and Experimental Research inMathematics (ICERM), Brown University for hosting the second Women in Data Science andMathematics workshop (WiSDM 2) in summer 2019. We would also like to thank the other groupmembers, Brenda Praggastis, Kritika Singhal, Melissa Stockman, and Sarah Tymochko. EW isfunded by the National Science Foundation Graduate Research Fellowship Program under GrantNo. 1644760. Any opinions, ﬁndings, and conclusions or recommendations expressed in this ma-terial are those of the author(s) and do not necessarily reﬂect the views of the National ScienceFoundation. XW would like to thank the National Sciences and Engineering Research Councilof Canada (NSERC DG 2019 - 05917). GH would like to thank the National Sciences and En-gineering Research Council of Canada (NSERC DG 2016-05167), Seed grant from Women andChildren’s Health Research Institute, Biomedical Research Award from American Association ofOrthodontists Foundation, and the McIntyre Memorial fund from the School of Dentistry at theUniversity of Alberta.8 Winn et al. x.cshq_wakesnegmoodx.ql_child_missschoolx.ql_child_forgetx.cshq_waketimex.osa_attentionx.ql_child_attentionx.ql_parent_missschoolx.ql_child_sportsx.ql_parent_runningx.osa_restlessx.ql_child_sadx.meds_lastyearx.osa_runnynosex.ql_child_worryx.ql_parent_missingschool_docx.osa_snoringx.ql_parents_energyx.osa_breathholdx.osa_chokegaspx.ql_child_missschool_docx.cshq_snoresx.ql_child_runx.ql_child_energyx.ql_parents_sleepx.osa_overallx.cshq_tiredx.osa_airx.osa_nasalobstructionx.osa_daytimesleepinessx.osa_health 4 5 6 7 8MeanDecreaseAccuracy x.ql_child_scaredx.cshq_nightwaking_minx.ql_child_missschoolx.ql_child_gettingalongx.cshq_waketimex.ql_child_forgetx.ql_child_achex.ql_parents_hurtsx.ql_child_sleepx.osa_runnynosex.ql_parent_missingschool_docx.ql_child_worryx.osa_attentionx.ql_child_missschool_docx.osa_restlessx.ql_child_sportsx.osa_chokegaspx.ql_parents_energyx.osa_snoringx.osa_breathholdx.cshq_snoresx.osa_airx.osa_nasalobstructionx.ql_child_energyx.ql_parents_sleepx.ql_child_runx.cshq_tiredx.osa_overallx.osa_daytimesleepinessx.osa_health 0.0 1.0 2.0 3.0MeanDecreaseGini

Random Forest without Craniofacial Variables

Fig. 8:

The plot of variable importance from RF for the survey data. The left plot ranks variables bytheir contributions to the accuracy, while the right ranks variables for their decrease in Gini score.The top three most important variables were from the same OSA-18 questionnaire, but questionsfrom each of the three surveys made the list. survey of techniques for OSA data 29

Fig. 9:

Table illustrating eight of the measurements taken for craniofacial data. A green circlereceives a numerical score of 0, a blue square receives a numerical score of 1, and a red trianglereceives a numerical score of 2. The ninth score, Dental Tool Score, is the sum of these sevenmeasurements.0 Winn et al.

Fig. 10: (Left) the distribution craniofacial data for the control group, while (Right) the craniofacialdata for the patient group. Graphs are scaled for overall frequency in the data sets. As shown withthe Earth Mover’s Distance in Table 9, the distributions which are most different are Palate Scoreand Lower Face Height. survey of techniques for OSA data 31

Fig. 11:

Distribution of Dental Tool Score for control group (top) and patient group (bottom).Graphs rescaled for frequency with respect to the size of their data sets. The Earth Mover’s Dis-tances between these two distributions is 0.0415.2 Winn et al.

References

1. Mostafa Altalibi, Humam Saltaji, Mary A Roberts, Michael P Major, Joanna MacLean, andPaul W. Major. Developing an index for the orthodontic treatment need in paediatric patientswith obstructive sleep apnoea: a protocol for a novel communication tool between physiciansand orthodontists.

British Medical Journal , 2014.2. Mark G. Hans Carol Rosen Juan Martin Palomo Ashok K. Rohra Jr., Catherine A. Demko.Sleep disordered breathing in children seeking orthodontic care.

American Journal of Or-thodontics and Dentofacial Orthopedics , 154(1):65–71, 2018.3. Albert Batushansky, David Toubiana, and Aaron Fait. Correlation-based network generation,visualization, and analysis as a powerful tool in biological studies: A case study in cancer cellmetabolism.

BioMed Research International , 2016.4. Tyrus Berry and Timothy Sauer. Consistent manifold representation for topological data anal-ysis.

Foundations of Data Science , 1(2639-8001_2019_1_1):1, 2019.5. Kari A. Draper David Gozal Ann Carol Halbower Jacqueline Jones Michael S. SchechterStephen Howard Sheldon Karen Spruyt Sally Davidson Ward Christopher Lehmann CaroleL. Marcus, Lee Jay Brooks and Richard N. Shiffman. Diagnosis and management of childhoodobstructive sleep apnea syndrome.

Pediatrics , 130:576–584, 2012.6. Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike Von Luxburg. Consistentprocedures for cluster tree estimation and pruning.

IEEE Transactions on Information Theory ,60(12):7900–7912, 2014.7. Ronald R. Coifman and Stéphane Lafon. Diffusion maps.

Applied and Computational Har-monic Analysis , 21(1):5 – 30, 2006.8. Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence andsimpliﬁcation.

Discrete Comput Geom , 28:511–533, 2002.9. Hung-Mo Lin Duanping Liao Susan Calhoun Antonio Vela-Bueno Fred Fedok Vukmir VlasicGavin Graff Edward O. Bixler, Alexandros N. Vgontzas. Sleep disordered breathing in chil-dren in a general population sample: Prevalence and risk factors.

Sleep , 32:731–736, 2009.10. H. I. Elshazly, A. M. Elkorany, and A. E. Hassanien. Lymph diseases diagnosis approachbased on support vector machines with different kernel functions. In , pages 198–203, Dec 2014.11. Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithmfor discovering clusters in large spatial databases with noise. In

Kdd , volume 96, pages 226–231, 1996.12. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalizedlinear models via coordinate descent.

Journal of Statistical Software , 33(1):1–22, 2010.13. Søren Højsgaard. Graphical independence networks with the grain package for r.

Journal ofStatistical Software, Articles , 46(10):1–26, 2012.14. Rachid Kharoubi, Karim Oualkacha, and Mkhardri Abdallah. The cluster correlation-networksupport vector machine for high-dimensional binary classiﬁcation.

Jouranl of Statistical Com-putation and Simulation , 89, 2019.15. G. Krishna, C. Tran, J. Yu, and A. H. Tewﬁk. Speech recognition with no speech or withnoisy speech. In

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , pages 1090–1094, May 2019.16. Glicksberg BS et al Li L, Cheng WY. Identiﬁcation of type 2 diabetes subgroups throughtopological analysis of patient similarity.

Science Translational Medicine , 7,311, 2015.17. Andy Liaw and Matthew Wiener. Classiﬁcation and regression by randomforest.

R News ,2(3):18–22, 2002.18. Tristan Millington and Mahesan Niranjan. Partial correlation ﬁnancial networks.

AppliedNetwork Science , 5, 2020.19. Boaz Nadler, Stéphane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffusion maps,spectral clustering and reaction coordinates of dynamical systems.

Applied and ComputationalHarmonic Analysis , 21(1):113–127, 2006. survey of techniques for OSA data 3320. H. Njah, S. Jamoussi, and W. Mahdi. Semi-hierarchical naïve bayes classiﬁer. In , pages 1772–1779, July 2016.21. Nobuyuki Otsu. A threshold selection method from gray-level histograms.

IEEE transactionson systems, man, and cybernetics , 9(1):62–66, 1979.22. Stefanos Papanikolaou, Michail Tzimas, Andrew CE Reid, and Stephen A Langer. Spatialstrain correlations, machine learning, and deformation history in crystal plasticity.

PhysicalReview E , 99(5):053003, 2019.23. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of MachineLearning Research , 12:2825–2830, 2011.24. Kyeong S., Kim JJ, and Kim E. Novel subgroups of attention-deﬁcit/hyperactivity disorderidentiﬁed by topological data analysis and their functional network modular organizations.

PLoS One , 12(8), 2017.25. Marco Scutari. Learning bayesian networks with the bnlearn r package.

Journal of StatisticalSoftware, Articles , 35(3):1–22, 2010.26. Gurjeet Singh, Facundo Memoli, and Gunnar Carlsson. Topological Methods for the Anal-ysis of High Dimensional Data Sets and 3D Object Recognition. In M. Botsch, R. Pajarola,B. Chen, and M. Zwicker, editors,

Eurographics Symposium on Point-Based Graphics . TheEurographics Association, 2007.27. Daniel J. Stekhoven and Peter Bühlmann. Missforest—non-parametric missing value imputa-tion for mixed-type data.

Bioinformatics , 28(1):112–118, 2011.28. Luis E. Sucar.

Probabilistic Graphical Models: Principles and Applications . Springer, 2015.29. Lloyd N Trefethen and David Bau III.

Numerical linear algebra , volume 50, chapter 4, pages25–31. Siam, 1997.30. Sarah Tymochko, Kritika Singhal, and Giseon Heo. Classifying sleep states using persistenthomology and markov chain: a pilot study.

WiSDM Proceedings , 2019.31. Laura Uusitalo. Advantages and challenges of bayesian networks in environmental modelling.

Ecological Modelling , 203 3:312–318, 2007.32. Hendrik Jacob van Veen and Nathaniel Saul. Keplermapper.http://doi.org/10.5281/zenodo.1054444, Jan 2019.33. Marilyn Y. Vazquez Landrove.

Consistency of Density Based Clustering and its Applicationto Image Segmentation . PhD Dissertation, George Mason University, 4400 University Drive,Fairfax, VA 22030, 8 2018.34. W. N. Venables and B. D. Ripley.

Modern Applied Statistics with S . Springer, New York,fourth edition, 2002. ISBN 0-387-95457-0.35. W. N. Venables and B. D. Ripley.

Modern Applied Statistics with S . Springer, New York,fourth edition, 2002. ISBN 0-387-95457-0.36. Ulrike Von Luxburg. A tutorial on spectral clustering.

Statistics and computing , 17(4):395–416, 2007.37. Siyu Yu, Nanning Zheng, Yongqiang Ma, Hao Wu, and Badong Chen. A novel brain decodingmethod: A correlation network framework for revealing brain connections.

IEEE Transactionson Cognitive and Developmental Systems , 11, March 2019.38. Nevin Lianwen Zhang, Thomas D. Nielsen, and Finn Verner Jensen. Latent variable discoveryin classiﬁcation models.

Artiﬁcial intelligence in medicine , 30 3:283–99, 2004.39. Afra Zomorodian and Gunnar Carlsson. Computing Persistent Homology.