A Bayesian Hierarchical Network for Combining Heterogeneous Data Sources in Medical Diagnoses
AA Bayesian Hierarchical Network for CombiningHeterogeneous Data Sources in Medical Diagnoses
Claire Donnat
Department of StatisticsStanford University [email protected]
Nina Miolane
Department of StatisticsStanford University [email protected]
Jack Kreindler
Centre for Health and Human Performance [email protected]
Frederick de Saint Pierre Bunbury
Carnegie Institution for Science [email protected]
Abstract
Computer-Aided Diagnosis has shown stellar performance in providing accuratemedical diagnoses across multiple testing modalities (medical images, electrophys-iological signals, etc.). While this field has typically focused on fully harvestingthe signal provided by a single (and generally extremely reliable) modality, fewerefforts have utilized imprecise data lacking reliable ground truth labels. In thisunsupervised, noisy setting, the robustification and quantification of the diagnosisuncertainty become paramount, thus posing a new challenge: how can we combinemultiple sources of information – often themselves with vastly varying levels ofprecision and uncertainty – to provide a diagnosis estimate with confidence bounds?Motivated by a concrete application in antibody testing, we devise a StochasticExpectation-Maximization algorithm that allows the principled integration of het-erogeneous, and potentially unreliable, data types. Our Bayesian formalism isessential in (a) flexibly combining these heterogeneous data sources and their corre-sponding levels of uncertainty, (b) quantifying the degree of confidence associatedwith a given diagnostic, and (c) dealing with the missing values that typicallyplague medical data. We quantify the potential of this approach on simulated data,and showcase its practicality by deploying it on a real COVID-19 immunity study.
Current medical diagnoses are most often based on the combination of several data inputs by medicalexperts, typically including (i) clinical history, interviews, and physical exams, (ii) laboratory tests,(iii) electrophysiological signals, and medical images. Advances in the machine learning (ML)community have highlighted the potential of ML to contribute to the field of Computer-AidedDiagnosis (CAD), for which we distinguish two main classes of methods.
Single modality analysis.
Most recent efforts in the ML community have focused on analyzing asingle medical data source — called a “modality". For instance, the parsing of electronic healthrecords (EHR) for clinical history, patients’ interviews, and physical exams has shown huge potentialfor the diagnosis of a broad range of diseases, from coronary artery disease to rheumatoid arthritis[1–5]. ML algorithms have been developed to process various types of laboratory results, from urinesteroid profiles to gene expression data [6–9]. Others have shown success in automatically processingand classifying medical signals such as electrocardiograms (ECG) [10–13], or electroencephalograms(EEG) [14]. Finally, a large body of work has focused on medical imaging analysis encompassing avast number of tasks, such as automatic extraction of diagnostic features [15, 16], segmentation ofanatomical structures [17–22], or direct diagnosis through image classification [23, 24].
Preprint. Under review. a r X i v : . [ s t a t . A P ] J u l hile these works achieve record-breaking diagnostic accuracy, they often rely on supervisedlearning approaches – requiring the diagnosis ground-truth to be available during training – andon the acquisition of large datasets. Furthermore, they focus solely on a single type of data input(EEGs, scans, etc.) – often acquired by clinicians using specialized, highly accurate equipment– and do not harvest the potentially rich and complementary sources of information provided byalternative medical modalities. Yet, with the development of at-home diagnostic tests (lateral flowassays, questionnaire data for disease screening, mobile health apps, etc.), this paradigm shifts, anddiagnoses have to be established through the combination of multiple sources of cheaper, yet oftennoisier and more imprecise data. In light of the uncertainty exhibited by these various inputs, it alsobecomes indispensable to pair the provision of a diagnosis with a notion of a confidence interval,thereby thoroughly characterizing our state of knowledge given the available data. Multiple modality integration.
Interestingly, diagnosis studies fusing different data sources seemedto be more prominent in the first years of CAD. Bayesian networks [25] – also called belief networks– have been used as a crucial decision tool for automatic diagnosis. Such networks provide abiologically-grounded and interpretable statistical framework that integrates the probabilities ofcontracting the disease given a patient’s clinical history, and the probability of developing symptomsor observing positive test results in the presence or absence of the disease. The advantages ofthese Bayesian networks are that they are (a) fully unsupervised and do not require ground-truthinformation, and (b) able to provide meaningful results even in low sample regimes.Bayesian networks have thus been implemented in a variety of contexts to integrate clinical dataand laboratory results, and diagnose conditions ranging from pyloric stenosis to acute appendicitis[26–30]. They have also been used to combine clinical data with medical images, and subsequentlyapplied to assess venous thromboembolism [31], community-acquired pneumonia [32], head injuryprognostics [33], or to predict tumors [34–36]. As early as 1994, full integration of clinical data,laboratory results and imaging features was performed to diagnose gallbladder disease [37].However, contrary to our proposed setting, the uncertainty that these Bayesian networks integrate istypically known and controlled. The medical conditions that they study are well characterized bya set of specific questions and physiological exams (e.g. projectile vomiting, potassium levels andultrasounds in the case of Pyloric stenosis). Not only do these inputs provide a very strong signal forthe diagnosis, but the uncertainty arising from the different modalities can often be reliably informedby a test manufacturer, extensive medical research or prior ML studies (for diagnostic inputs obtainedthrough classification algorithms). The uncertainty of a diagnostic method, however, is not alwayswell specified: whether it be (i) a physician’s assessment of the symptoms associated with a novel orrare disease, (ii) predictions of an ML algorithm whose accuracy has not been fully characterizedoutside of a curated research environment or reference datasets (such as MNIST, CIFAR), or simply(iii) biological tests whose sensitivity still has to be determined.
Multiple noisy modality integration: a COVID-19 case study.
To motivate this paper and furtherunderstand the limitations of these previous approaches, let us consider a particular use case: COVID-19 antibody testing. Lateral flow immunoassay (LFA) antibody tests for COVID-19 are one of themanageable, affordable and easily deployable options to allow at-home testing for large populationand provide assessments of our immunity. Yet, studies have shown that the sensitivity of these testsremains highly variable and highly contingent on the time of testing. The successful deploymentof LFAs thus depends on their augmentation with additional data inputs, such as user-specific riskfactors and self-reported symptoms. The provision of confidence scores is essential to flag potentialfalse negative or positive tests (requiring re-testing or closer scrutiny) and to assess local prevalencelevels – both pivotal for researchers and policymakers in the context of a pandemic.
Contributions and outline.
This applications paper is geared towards the practical integration ofnoisy sources of information for CAD. Our contribution is two-fold. From a methods perspective,we account for the variability of the inputs by devising a two-level Bayesian hierarchical model. Incontrast to existing Bayesian methods for CAD, our model is deeper, and trained using a StochasticExpectation Maximization (EM) algorithm [38–40]. The Stochastic EM allows to overcome thelimitations of its non-stochastic counterpart [38], that is (a) its sensitivity to the starting position,(b) its potential slow convergence rate, and (c) its possible convergence to a saddle position insteadof a local maximum. From an applications perspective, we gear this algorithm to enhance at-homeLFA testing in the context of the COVID-19 pandemic. In particular, we wish to (a) quantify thebenefit of multimodal data integration when the diagnostics are uncertain, and (b) show how our2ethod can benefit medical experts or researchers in real life. Section 3 presents the Bayesian modelfor multimodal integration and the Stochastic Expectation-Maximization algorithm that performsprincipled and scalable inference. Section 5 presents extensive tests of our model on simulateddatasets. Section 6 details the results obtained on the Covid-19 dataset and the impact of our methodfor affordable and reliable at-home test kits.
By way of clarifying the challenges that our algorithm proposes to overcome, we present here theCOVID Clearance Remote Recovery & Antibody Monitoring Platform study , which motivated ourapproach. The purpose of this study is to track the evolution of the immunity of a cohort of adultparticipants in the UK with various COVID-19 exposure risks. This paper focuses on a subset of 117healthcare workers, for which we have both questionnaire information and LFA test results. Home-testing Immunoassay Data.
Participants are issued packs of home-testing kits with writteninstructions. These kits identify Immunoglobulin M (IgM) and/or Immunoglobulin G (IgG) specificto SARS-CoV2 in blood samples using three gold-standard methods . Pictures and typicalexamples of LFA test results are provided in the supplementary materials. For the LFAs, participantsself-report a positive result if IgM and/or IgG are detected by the test in addition to a positive controlband. Through the collection of this data, the COV-CLEAR study aims to address questions relatingto (a) the quantification of the robustness of the antibody response, and (b) the durability of thisresponse. However, while affordable, we highlight that the LFA tests suffer from vastly varying levelsof sensitivity — registering a sensitivity as low as 70% on asymptomatic individuals.
Clinical Data: Questionnaire.
Additionally, participants are asked to answer a questionnaire,ideally before knowing the result of their LFA test. The form consists of questions related to K = 14 exhibited symptoms (fever, cough, runny nose, etc.) and M = 2 subject-specific risk factors(household size and proportion of members with a suspected or confirmed history of COVID-19).The empirical distributions of the symptoms and the risk factors in this cohort are provided in thesupplementary materials. In this section, we describe our principled integration of noisy diagnostic test results, with additionalclinical data such as symptom data and subject-specific risk factors. Our approach applies to generalheterogeneous medical data where the outputs are binary. The latter could be either self-reportedanswers to questionnaires, clinician-reported physiological exams, outputs of a diagnostic based onan image, abnormalities of laboratory results, etc., making this a widespread and general setting.Thus, while we implement and showcase the method for the particular purpose of applying it toantibody testing, this could be relevant to any medical diagnostic with binary inputs.Denote D the true diagnosis of an individual (healthy/sick), T the outcome of the noisy diagnostic test(positive/negative), S the symptomaticity (symptomatic/asymptomatic), X the symptoms exhibitedand Y the subject-specific risks factors. The underlying assumption is that given a true diagnosis D ,the symptoms X and the diagnostic test outcome T are independent. In other words, the probabilityof the diagnostic test being a false negative P [ T = 0 | D = 1] is independent of the symptoms of atruly infected individual P [ X | D = 1] . Similarly, given a diagnosis D , the test outcome T and theexhibited symptoms X are independent of the risk factors Y . We define: • P ( D = 1 | Y ) = π β ( Y ) the probability of contracting the disease given risk factors Y , • P [( T = 1 | D = 1) = x the sensitivity of the diagnostic test, • P [ T = 0 | D = 0] = 1 − y the specificity of the diagnostic test, i.e. y = P [ T = 1 | D = 0] , • P ( S = 1 | D = 0) = p the probability of being symptomatic when whilst not having beeninfected by that specific disease (the symptoms could be due to another illness for instance), COV-CLEAR, Home-sampled (blood) Abbott ARCHITECT R (cid:13) Anti-SARS-CoV-2 IgG CMIA CE Marked; PHE Approved Elecsys R (cid:13) Anti-SARS-CoV-2 Double Antigen, CE Marked, PHE Approved Biopanda Ltd Product Number: RAPG-COV-019 version 1 In-Vitro LFT IgM IgG Antibody test. CEcertified for Professional Use Only. P ( S = 1 | D = 1) = p the probability of having been symptomatic upon infection, • P ( X k = 1 | S = 1 , D = 0) = s k the probability of exhibiting symptom k when not infected, • P ( X k = 1 | S = 1 , D = 1) = s k the probability of exhibiting symptom k upon infection.In the above, the diagnostic test may be any biological test (e.g. LFA antibody tests), or output of amedical imaging classification algorithm provided by a ML research team. We have here splittedour binary variables into two classes according to their variability: (a) the test T , for which we haveprovisional estimates for the sensitivity and specificity (given by the manufacturer) but which weare still uncertain, and (b) the symptoms X , which carry additional uncertainty in that neither ofthem is extremely specific to COVID-19 nor is their prevalence amongst COVID cohorts very wellestablished; Values for ( x, y ) , p , p , s , s or β may be published, but challenged by complementaryresearch or field experience, or may be completely unavailable. We thus need to account for theseinputs’ variability and for the relative importance of different risk factors β . We propose using ahierarchical Bayesian network (see Fig. 9) that is consequently deeper than the ones traditionallyimplemented in the CAD literature [26–30].The uncertainty in the test sensitivity x and specificity − y are expressed through Beta priors on x and y with parameters ( α x , β x ) and ( α y , β y ) , see Fig. 9. Similarly, the uncertainty on the probabilities ofsymptomaticity and of the different symptoms are expressed via Beta priors with respective parameters ( α p , β p ) , ( α p , β p ) , { ( α s k , β s k ) , ( α s k , β s k ) } k , see Fig. 9. When published estimates exist for x, y, p , p , s and s (e.g. as provided by the LFA test manufacturer), we match the moments of theBeta priors with those of the published distributions. If this information is unavailable, we use thenon-informative Beta prior with parameters (0 . , . . In the COVID-19 example, we implementa non-informative prior for the probabilities of symptomaticity and symptoms, as the etiology ofthe disease remains unknown. These priors are then updated during learning as we aggregate theinformation from the observed and imputed variables.Lastly, we model the probability π β ( Y ) of contracting the disease with a logistic regression on the M risk factors Y . In the case of the COVID-19 dataset, these include county data, profession, sizeof household, etc. The coefficients β = { β m } m weight the relative importance of each factor Y m , m = 1 ...M . We express our uncertainty on the possible impact of a given factor m by introducing aGaussian prior: β ∼ N (0 , σ β ) , see Fig. 9. n ∆ ∆ KMM KY im β m σ β m D i X ik s dk α s dk , β s dk T i z d α z d , β z d S i ∆ p d α p d , β p d Weight Probability ProbabilityRisk factor m True Test outcomeSymptomaticity Symptom k Sensitivity, Specificityof risk factor m of symptomaticity of symptom k diagnosisFigure 1: Hierarchial bayesian network integrating the accuracy uncertainty on the data sources,to estimate the true diagnosis D . The index k = 1 ...K represents symptom k , while m = 1 ...M represents risk factor m . The shaded cells represent observed variables or known hyper-parameters.The dashed lines represent switches. 4 Inference in the Hierarchical Bayesian Network via Stochastic EM
At the subject level , we seek to compute the posterior distribution of the true diagnosis D i , informed bythe integration of the observed variables T i , S i , X i , Y i . This posterior yields an estimated diagnosis,together with a credible interval that expresses the confidence associated with our prediction. In thecase of COVID-19, this provides individual estimates of each citizen’s immunity that may informtheir social behavior as lockdown is eased. At the global level , we wish to learn: (i) the distributions of the sensitivity x and specificity y ofthe diagnostic test as observed in the field; (ii) the distributions of the probabilities p , p of beingsymptomatic, and s , s of exhibiting specific symptoms, within a population of interest; and (iii)the distribution of the impact β of the risk factors for contracting the disease. In the context of theCOVID-19 pandemic, such aggregated figures are pivotal to understand the dynamic of the diseaseand implement appropriate crisis policy.Since the true diagnoses are hidden variables, the Expectation-Maximization algorithm [41] is anappealing method to perform inference in our Bayesian model; hence to achieve the aforementionedobjectives. However, the EM algorithm requires the computation of an expectation over the posteriorof the hidden variables, which may be intractable depending on the probability distributions defined inthe model. To allow for flexibility in terms of model’s design within our multimodal data integrationframework, we offer to rely on the Stochastic EM algorithm (StEM) [42]. StEM effectively estimatesthe conditional expectation in the EM using the “Stochastic Imputation Principle", i.e. approximatingthe expectation by sampling once from the underlying distribution. This method allows us to carry outinference with priors that are not necessarily the Beta distributions implemented in our experiments —thus providing us with additional flexibility in modeling real data. Furthermore, StEM is more robust,being less dependent on the parameters’ initialization than the EM – a definite advantage given ourvery uncertain framework. Finally, StEM shows better asymptotic behavior: unlike the EM, StEMalways leads to maximization of a complete data log-likelihood in the M-step [43]. ( j + 1) Starting iteration j + 1 , the current estimates of the model’s parameters are θ ( j ) . The stochasticE-step computes the posterior distribution P ( D i | T i , S i , X i , Y i , θ ( j ) ) of the hidden variable D i . Wethen sample a diagnosis (cid:99) D i from this posterior. Proposition 4.1.
The odds of the posterior of the hidden variable D i at iteration ( j + 1) writes: P ( D i = 1 | T i , S i , X i , Y i , θ ( j ) ) P ( D i = 0 | T i , S i , X i , Y i , θ ( j ) )= x ( j ) T i (1 − x ( j ) ) (1 − T i ) × s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) y ( j ) T i (1 − y ( j ) ) (1 − T i ) × s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S i (1 − p ) (1 − S i ) . (1) ( j + 1) The following proposition shows the updates in the model’s parameters at iteration ( j + 1) , performedin the M-step. Proposition 4.2.
The parameters updates write: x ( j +1) = (cid:80) ni =1 T i (cid:99) D i + α x − (cid:80) ni =1 (cid:99) D i + ( α x + β x ) − , y ( j +1) = (cid:80) ni =1 T i (1 − (cid:99) D i ) + α y − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α y + β y ) − p ( j +1)0 = (cid:80) ni =1 (1 − (cid:99) D i ) S i + α p − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α p + β p ) − , p ( j +1)1 = (cid:80) ni =1 (cid:99) D i S i + α p − (cid:80) ni =1 (cid:99) D i + ( α p + β p ) − s ( j +1)0 = (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s − (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) + ( α s + β s ) − , s ( j +1)1 = (cid:80) ni =1 , s.t. S i =1 (cid:99) D i X i + α s − (cid:80) ni =1 , s.t. S i =1 (cid:99) D i + ( α s + β s ) − β ( j +1) = argmin β n (cid:88) i =1 || (cid:99) D i − g ( Y i β ) || σ + || β || σ β , (2)5 ensitivity .6.7.75.8.85.87.9.93.95.97.99 .6.7.75.8.85.87.9.93.95.97.99.6 .7 .75 .8 .85.87 .9 .93.95.97.99 Sensitivity .6 .7 .75 .8 .85.87 .9 .93.95.97.99 S p ec i f i c it y S p ec i f i c it y .84.92.88.96 Diagnosis Accuracy Diagnosis Accuracy Improvement .04.08.12.16.20 (A) (B)
Figure 2: Performance of the SEM algorithm for n=300 samples, σ = 0 . , and varying levels ofspecificity and sensitivity. (A) Raw accuracy of the labels imputed via StEM. (B)
Difference inaccuracy between the StEM imputed diagnostics and the observed test variable T . where the minimization on β is performed through Newton-Raphson descent. Algorithm Complexity.
Our algorithm only relies on simple sequential updates of the distributionparameters, as highlighted in Prop. B.1 and B.2. Denoting by L the number of such parameters, and N the number of samples: • The updates for X and T only involve matrix multiplications of the form (cid:98) D T X , while thosefor β involve matrix multiplications of size O ( M N ) — yielding a complexity of O ( LN ) . • Prior updates rely solely on element-wise operations on the log-odds, with O ( N ) complexity.Denoting as B the maximum number of steps, the complexity is O ( BLN ) , and thus linear in N .Fig 4(A) provides an example of the evolution of the number of steps as the number of samplesincreases. Since our approach is unsupervised, we begin by validating it on synthetic datasets where the groundtruth is known and controlled — thus allowing us to characterize the performance of the algorithm,and to showcase the strength of combining multiple noisy modalities.We assume the same generative model as in Fig. 9 and generate synthetic data for various pairs ofvalues of sensitivity and specificity ranging from 60% to 99% . In each case, we simulate N = 100 tuples of variables ( ( Y i , X i , T i ) ni =1 , for n ∈ { , , , } participants. We also vary thenoise level σ ∈ { . , . , . } for the prior on D in Fig. 9. To mimic our COVID data, we simulate K = 14 symptoms and M = 2 risk factors, for which we randomly select the parameters . Improving upon the sole test T . Fig. 10 (A) and (B) show the Stochastic EM’s raw accuracyand accuracy improvement over the sole test result T . Note that this difference is always positive,highlighting that our method only improves upon single diagnostic inputs — even when one inputis more reliable than any of the others. As expected, the most substantial improvements upon T are observed when the test specificity and/or sensitivity are low. For instance, for sensitivity andspecificity of 70%, our method provides an . ± (2 . % accuracy for the diagnosis – or equivalently,a 16% gain over T . We further quantify the algorithm’s performance in Table 1 of the supplementarymaterials, for values of the specificity close to those observed on the field. Benchmarking against other models.
We now benchmark our algorithm against other approaches: • Vanilla Classifier: using both the context Y and symptoms X as inputs, we fit a logistic regres-sion to the test labels T . We choose the regularization parameter using 10-fold cross-validation,and compute confidence intervals for the log-probability of ( D | X, T ) by bootstrapping. All of these parameters are provided with the code in the supplementary materials. A) Sensitivity (B) Sensitivity A cc u r ac y A cc u r ac y Samplesize Sigma1000500200100 0.10.51.00.5 0.6 0.7 0.8 0.9 1.0 1.10.5 0.6 0.7 0.8 0.9 1.0 1.1.82.90.98 .82.90.98
Figure 3: Performance of the Stochastic EM for a fixed specificity of 80%, (A) as the number ofsamples increases (and σ = 0 . ) and (B) as the variance increases (and n = 300 ). (B) Sensitivity A cc u r ac y StEMEM-1
EM-2Vanilla0.70.8 (A) Sample size C onv e r g e n ce s t e p s
100 200 400
Figure 4: (A) Number of EM steps until convergence; and (B) Accuracy comparison against bench-marks: Stochastic EM (StEM), EM with data-informed priors on models’ parameters, not updatedduring learning (EM-2), EM with uninformative priors on models’ parameters, not updated duringlearning (EM-3), Vanilla logistic regression (Log. Reg.). • Data-Agnostic EM : we implement a deterministic version of EM, providing uninformative priorsfor the parameter (thereby reflecting our absence of knowledge of the truth), which are notupdated — i.e, an "uninformed” equivalent of the belief networks found in the literature. • Data-Informed EM : similar to the Data-Agnostic EM, but choosing the priors (which thenremain fixed) based on the empirical data.The results are shown in Fig. 4(B), and further completed in the supplementary materials. Wehighlight the superiority of our deeper and more adaptive hierarchical model, yielding improvements(for a reasonable tuple of specificity and sensitivity of 80%) of up to 4% over the Data-Informed EM,8% over the Data-Agnostic one, and 9% over the Vanilla classifier.
Assessing the robustness of the Stochastic EM.
For a fixed specificity 80%, Figure 3 shows theaccuracy of StEM for different values of σ (A) or as the number of samples increases (B). Theseaxes of variation seem to yield little impact on the model’s performance, providing evidence that ouralgorithm is fairly robust, especially with respect to low-sample size. Convergence.
Finally, we assess the convergence properties of our algorithm. We examine thedistribution of the relative difference between recovered coefficients and ground truth (expressed asa percentage of the ground-truth value). The plots, provided in the supplementary materials, showdeviations that are within a few percentages of the true value of the coefficients – thus highlightingthe ability of the model to converge to the ground-truth parameters, and making it relevant from amedical perspective to characterize the disease’s etiology. Illustrations of the behavior of the numberof required EM steps are also provided in the supplementary materials.
We turn to the real dataset of n = 117 participants described in Section 2. The purpose of thissection is to show how our algorithm can be practically applied to process heterogeneous data typesand inform participants and researchers alike. At the individual level, we provide each user with(a) the confirmation of the diagnostic, and (b) confidence intervals associated with the uncertainty7 A) Subject 111: Negativequestionnaire confirms negative test (B) Subject 108: Positive questionnaire infirms negative testDiagnosis given questionnaire Diagnosis given questionnaireDiagnosis given questionnaire and test Diagnosis given questionnaire and test
Figure 5: Posterior diagnosis distributions on two selected subjects. In each panel (A-B): the left plotrepresents the posterior of the diagnosis, given the symptoms and risk factors data reported in thequestionnaire, while the right plot is the posterior of the diagnosis, given the symptoms, risk factors,and reported result of the diagnostic test. The red dot represents the expectation. (A) Asymptomatic0.410.6 Posteriors Priors1 - Spe. Sens. 1 - Spe. Sens. (B) Fever0.10.90.5 D = 0 D = 1 (C) Cough with sputum0.00.30.6 D = 0 D = 1
Figure 6: (A) Posteriors of sensitivity and specificity compared to their priors, for asymptomaticsubjects. (B-C): Posteriors of the probability of exhibiting specific symptoms, for symptomaticsubjects with estimated negative ( D = 0 ) or positive ( D = 1 ) diagnosis.to shed more light on the uncertainty associated with the diagnostic. At the global level, weprovide policymakers with (c) aggregated analysis of the herd immunity, and associated measures ofuncertainty. At the subject-level.
We present two examples, where our algorithm either confirms or infirms theresult of the test – thereby allowing for the potential flagging of false negatives. The first example(Fig. 5 A) is a user that registers a negative test while being asymptomatic and with a limited numberof risk factors. In this case, we expect our model to confirm the result of the test, and provide anarrower confidence interval as per the probability of immunity – as confirmed by Fig. 5 (A). Thesecond example showcases an instance where the questionnaire and the test disagree. Subject 108is a user that registers a negative test, while exhibiting a wide number of known COVID symptoms(dry cough, shortness of breath, fever), but took the test less than 10 days after his illness. Whilethe posterior does not reclassify the subjects’ diagnosis, the confidence interval associated with theprediction of immunity reflects the uncertainty associated with this case, and flags it as a potentialfalse negative.
At the global level.
At the global level, this framework allows to perform principled inference forthe disease and the population of interest. For this cohort of n = 117 healthcare workers in theUK, at-home tests predict . of immunity. The posterior distributions of the model’s parametersshed light on the actual accuracy of the LFA tests on our population. Fig. 6 (A) compares theposteriors of the sensitivity and specificity to the priors built from values reported by manufacturers,for asymptomatic subjects. Furthermore, our results provide information regarding the most prevalentsymptoms for COVID-19. For instance, among symptomatic participants, the probability of exhibitingfever is higher for infected subjects (see Fig. 6 (B)) while cough with sputum does not seem to beassociated with COVID-19, being probably the sign of another infection (see Fig. 6 (C)). Similarplots on additional symptoms are provided in the supplementary materials. This applications paper provides a statistical framework for multimodal integration of noisy diagnostictest results, self-reported symptoms, and demographic variables. Compared to previous approaches,8ur work is more amenable to the handling of the different inputs’ uncertainty. While we havefocused here on binary symptoms, this adaptive Bayesian framework, together with the flexibilityprovided by the Stochastic EM algorithm, pave the way for the integration of other continuous-valuedinputs (index of the severity of the disease, pain, etc.) as well as interactions – an extension thatwe are currently in the process of implementing. The robustness of the Stochastic EM is crucial inensuring the reliability of the parameters given the medical stakes. Finally, we note that the generativemodel allows for principled handling of missing data, thus allowing us to make diagnoses even in thepresence of incomplete data.
References [1] Frank E. Harrell, Kerry L. Lee, Robert M. Califf, David B. Pryor, and Robert A. Rosati.Regression modelling strategies for improved prognostic prediction.
Statistics in Medicine , 3(2):143–152, 1984. ISSN 10970258. doi: 10.1002/sim.4780030207.[2] Imran Kurt, Mevlut Ture, and A. Turhan Kurum. Comparing performances of logistic regression,classification and regression tree, and neural networks for predicting coronary artery disease.
Expert Systems with Applications , 34(1):366–374, 2008. ISSN 09574174. doi: 10.1016/j.eswa.2006.09.004.[3] Robert J. Carroll, Anne E. Eyler, and Joshua C. Denny. Naïve Electronic Health Recordphenotype identification for Rheumatoid arthritis.
AMIA ... Annual Symposium proceedings /AMIA Symposium. AMIA Symposium , 2011:189–196, 2011. ISSN 1942597X.[4] Julia Hippisley-Cox and Carol Coupland. Predicting risk of emergency admission to hospitalusing primary care data: Derivation and validation of QAdmissions score.
BMJ Open , 3(8),2013. ISSN 20446055. doi: 10.1136/bmjopen-2013-003482.[5] Fatemeh Rahimian, Gholamreza Salimi-Khorshidi, Amir H. Payberah, Jenny Tran, RobertoAyala Solares, Francesca Raimondi, Milad Nazarzadeh, Dexter Canoy, and Kazem Rahimi.Predicting the risk of emergency admission with machine learning: Development and validationusing linked electronic health records.
PLoS Medicine , 15(11):1–18, 2018. ISSN 15491676.doi: 10.1371/journal.pmed.1002695.[6] Lei Huang and Tong Wu. Novel neural network application for bacterial colony classification.
Theoretical Biology and Medical Modelling , 15(1):1–16, 2018. ISSN 17424682. doi: 10.1186/s12976-018-0093-x.[7] Edmund H. Wilkes, Gill Rumsby, and Gary M. Woodward. Using machine learning to aid theinterpretation of urine steroid profiles.
Clinical Chemistry , 64(11):1586–1595, 2018. ISSN15308561. doi: 10.1373/clinchem.2018.292201.[8] Ferhat Demirci, Pinar Akan, Tuncay Kume, Ali Riza Sisman, Zubeyde Erbayraktar, andSuleyman Sevinc. Artificial neural network approach in laboratory test reporting: Learningalgorithms.
American Journal of Clinical Pathology , 146(2):227–237, 2016. ISSN 19437722.doi: 10.1093/ajcp/aqw104.[9] Rupesh Agrahari, Amir Foroushani, T. Roderick Docking, Linda Chang, Gerben Duns,Monika Hudoba, Aly Karsan, and Habil Zare. Applications of Bayesian network mod-els in predicting types of hematological malignancies.
Scientific Reports , 8(1):1–12, 2018.ISSN 20452322. doi: 10.1038/s41598-018-24758-5. URL http://dx.doi.org/10.1038/s41598-018-24758-5 .[10] U. Rajendra Acharya, Hamido Fujita, Oh Shu Lih, Yuki Hagiwara, Jen Hong Tan, and Muham-mad Adam. Automated detection of arrhythmias using different intervals of tachycardia ECGsegments with convolutional neural network.
Information Sciences , 405:81–90, 2017. ISSN00200255. doi: 10.1016/j.ins.2017.04.012. URL http://dx.doi.org/10.1016/j.ins.2017.04.012 .[11] Serkan Kiranyaz, Turker Ince, and Moncef Gabbouj. Real-Time Patient-Specific ECG Classifi-cation by 1-D Convolutional Neural Networks.
IEEE Transactions on Biomedical Engineering ,63(3):664–675, 2016. ISSN 15582531. doi: 10.1109/TBME.2015.2468589.912] Ali Bahrami Rad, Trygve Eftestol, Kjersti Engan, Unai Irusta, Jan Terje Kvaloy, Jo Kramer-Johansen, Lars Wik, and Aggelos K. Katsaggelos. ECG-Based classification of resuscitationcardiac rhythms for retrospective data analysis.
IEEE Transactions on Biomedical Engineering ,64(10):2411–2418, 2017. ISSN 15582531. doi: 10.1109/TBME.2017.2688380.[13] Pawe? Pławiak. Novel methodology of cardiac health recognition based on ECG signalsand evolutionary-neural system.
Expert Systems with Applications , 92:334–349, 2018. ISSN09574174. doi: 10.1016/j.eswa.2017.09.022.[14] B. Richhariya and M. Tanveer. EEG signal classification using universum support vectormachine.
Expert Systems with Applications , 106:169–182, 2018. ISSN 09574174. doi:10.1016/j.eswa.2018.03.053. URL https://doi.org/10.1016/j.eswa.2018.03.053 .[15] Ricardo L. Thomaz, Pedro C. Carneiro, and Ana C. Patrocinio. Feature extraction usingconvolutional neural network for classifying breast density in mammographic images.
MedicalImaging 2017: Computer-Aided Diagnosis , 10134:101342M, 2017. ISSN 16057422. doi:10.1117/12.2254633.[16] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and Hayit Greenspan. Chestpathology identification using deep feature selection with non-medical training.
ComputerMethods in Biomechanics and Biomedical Engineering: Imaging and Visualization , 6(3):259–263, 2018. ISSN 21681171. doi: 10.1080/21681163.2016.1138324. URL http://dx.doi.org/10.1080/21681163.2016.1138324 .[17] Pim Moeskops, Max A. Viergever, Adrienne M. Mendrik, Linda S. De Vries, Manon J.N.L.Benders, and Ivana Isgum. Automatic Segmentation of MR Brain Images with a ConvolutionalNeural Network.
IEEE Transactions on Medical Imaging , 35(5):1252–1261, 2016. ISSN1558254X. doi: 10.1109/TMI.2016.2548501.[18] Wen Li, Fucang Jia, and Qingmao Hu. Automatic Segmentation of Liver Tumor in CT Imageswith Deep Convolutional Neural Networks.
Journal of Computer and Communications , 03(11):146–151, 2015. ISSN 2327-5219. doi: 10.4236/jcc.2015.311023.[19] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully Convolutional Networks for SemanticSegmentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(4):640–651,2017. ISSN 01628828. doi: 10.1109/TPAMI.2016.2572683.[20] Fausto Milletari, Nassir Navab, and Seyed Ahmad Ahmadi. V-Net: Fully convolutional neuralnetworks for volumetric medical image segmentation.
Proceedings - 2016 4th InternationalConference on 3D Vision, 3DV 2016 , pages 565–571, 2016. doi: 10.1109/3DV.2016.79.[21] Jose Dolz, Christian Desrosiers, and Ismail Ben Ayed. 3D fully convolutional networksfor subcortical segmentation in MRI: A large-scale study.
NeuroImage , 170(April 2017):456–470, 2018. ISSN 10959572. doi: 10.1016/j.neuroimage.2017.04.039. URL https://doi.org/10.1016/j.neuroimage.2017.04.039 .[22] Patrick Ferdinand Christ, Mohamed Ezzeldin A. Elshaer, Florian Ettlinger, Sunil Tatavarty, MarcBickel, Patrick Bilic, Markus Rempfle, Marco Armbruster, Felix Hofmann, Melvin D?Anastasi,Wieland H. Sommer, Seyed-Ahmad Ahmadi, and Bjoern H. Menze. Automatic Liver andLesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3DConditional Random Fields.
Urologic and Cutaneous Review , 23(8):435–460, 1919. ISSN10636919. doi: 10.1007/978-3-319-46723-8.[23] Marios Anthimopoulos, Stergios Christodoulidis, Lukas Ebner, Andreas Christe, and StavroulaMougiakakou. Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolu-tional Neural Network.
IEEE Transactions on Medical Imaging , 35(5):1207–1216, 2016. ISSN1558254X. doi: 10.1109/TMI.2016.2535865.[24] Yuma Miki, Chisako Muramatsu, Tatsuro Hayashi, Xiangrong Zhou, Takeshi Hara, AkitoshiKatsumata, and Hiroshi Fujita. Classification of teeth in cone-beam CT using deep convolutionalneural network.
Computers in Biology and Medicine , 80:24–29, 2017. ISSN 18790534. doi:10.1016/j.compbiomed.2016.11.003. URL http://dx.doi.org/10.1016/j.compbiomed.2016.11.003 . 1025] Radford M Neal.
Probabilistic Inference Using Markov Chain Monte Carlo Methods . Num-ber September. 1993. ISBN 1095-9572 (Electronic)\r1053-8119 (Linking). doi: 10.1016/j.neuroimage.2009.01.023.[26] Sonia M. Alvarez, Beverly A. Poelstra, and Randall S. Burd. Evaluation of a Bayesian decisionnetwork for diagnosing pyloric stenosis.
Journal of Pediatric Surgery , 41(1):155–161, 2006.ISSN 00223468. doi: 10.1016/j.jpedsurg.2005.10.019.[27] Olivier Gevaert, Frank De Smet, Dirk Timmerman, Yves Moreau, and Bart De Moor. Predictingthe prognosis of breast cancer by integrating clinical and microarray data with Bayesiannetworks.
Bioinformatics , 22(14), 2006. ISSN 13674803. doi: 10.1093/bioinformatics/btl230.[28] S. Sakai, K. Kobayashi, J. Nakamura, S. Toyabe, and Kohei Akazawa. Accuracy in thediagnostic prediction of acute appendicitis based on the Bayesian network model.
Methods ofInformation in Medicine , 46(6):723–726, 2007. ISSN 00261270. doi: 10.3414/ME9066.[29] J. J. González-López, M. García-Aparicio, D. Sánchez-Ponce, N. Muñoz-Sanz, N. Fernandez-Ledo, P. Beneyto, and M. C. Westcott. Development and validation of a Bayesian network forthe differential diagnosis of anterior uveitis.
Eye (Basingstoke) , 30(6):865–872, 2016. ISSN14765454. doi: 10.1038/eye.2016.64.[30] Chaitawat Sangamuang, Peter Haddawy, Viravarn Luvira, Watcharapong Piyaphanee, SoponIamsirithaworn, and Saranath Lawpoolsri. Accuracy of dengue clinical diagnosis with andwithout NS1 antigen rapid test: Comparison between human and Bayesian network modeldecision.
PLoS Neglected Tropical Diseases , 12(6):1–14, 2018. ISSN 19352735. doi: 10.1371/journal.pntd.0006573.[31] JA Kline, AJ Novobilski, C Kabrhel, and DM Courtney. Derivation and Validation of a BayesianNetwork to Predict Pretest Probability of Venous Thromboembolism.
Journal of Periodontology ,55(5):306–313, 1984. ISSN 0022-3492. doi: 10.1902/jop.1984.55.5.306.[32] D. Aronsky and P. J. Haug. Diagnosing community-acquired pneumonia with a Bayesiannetwork.
Proceedings / AMIA ... Annual Symposium. AMIA Symposium , (June 1997):632–636,1998. ISSN 1531605X.[33] G. C. Sakellaropoulos and G. C. Nikiforidis. Development of a Bayesian Network for theprognosis of head injuries using graphical model selection techniques.
Methods of Informationin Medicine , 38(1):37–42, 1999. ISSN 00261270. doi: 10.1055/s-0038-1634146.[34] Xiao Hui Wang, Bin Zheng, Walter F. Good, Jill L. King, and Yuan Hsiang Chang. Computer-assisted diagnosis of breast cancer using a data-driven Bayesian belief network.
Inter-national Journal of Medical Informatics , 54(2):115–126, 1999. ISSN 13865056. doi:10.1016/S1386-5056(98)00174-9.[35] Charles E. Kahn, Linda M. Roberts, Katherine A. Shaffer, and Peter Haddawy. Construction ofa Bayesian network for mammographic diagnosis of breast cancer.
Computers in Biology andMedicine , 27(1):19–29, 1997. ISSN 00104825. doi: 10.1016/S0010-4825(96)00039-X.[36] Sweta Sneha and Rajeev Agrawal. Towards enhanced accuracy in medical diagnostics - Atechnique utilizing statistical and clinical data analysis in the context of ultrasound images.
Proceedings of the Annual Hawaii International Conference on System Sciences , pages 2408–2415, 2013. ISSN 15301605. doi: 10.1109/HICSS.2013.566.[37] Peter Haddawy, Charles E. Kahn, and Manonton Butarbutar. A Bayesian network model forradiological diagnosis and procedure selection: Work-up of suspected gallbladder disease.
Medical Physics , 21(7):1185–1192, 1994. ISSN NA. doi: 10.1118/1.597400.[38] Gilles Celeux, Didier Chauveau, and Jean Diebolt. On stochastic versions of the em algorithm.1995.[39] Gilles Celeux, Didier Chauveau, and Jean Diebolt. On stochastic versions of the em algorithm.
Biometrika , 88(1):281–286, 2001. ISSN 00063444. doi: 10.1093/biomet/88.1.281.1140] Søren Feodor Nielsen et al. The stochastic em algorithm: estimation and asymptotic results.
Bernoulli , 6(3):457–489, 2000.[41] Arthur Dempster, Natalie Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Datavia the EM Algorithm.
Journal of the Royal Statistical Society. Series B (Methodological) , 39(1):1–38, 1977.[42] Gilles Celeux and Jean Diebolt. A random imputation principle : the stochastic EM algorithm.Technical report, 1988.[43] Soren Feodor Nielsen. The stochastic EM algorithm: Estimation and asymptotic results.
Bernoulli , 6(3):457–489, 2000. ISSN 13507265. doi: 10.2307/3318671.
A COVID-19 Datasets
This section provides additional details on the COV-CLEAR dataset . Figure 7 shows the samplingdistributions of selected symptoms and risk factors in the population of interest – grouped by theresult of the lateral flow immunoassay (LFA) test. Figure 8 illustrates the LFA test results as observedby a given subject. Participants self-report a positive result if IgM and/or IgG are detected by the test.The marker “C" stands for “Control". T = 0 T = 1
NauseaFever Short breath
T = 0 T = 1 T = 0 T = 1Fever Nausea Shortness of breath(A) (B) (C)T = 0 T = 1(D) (E)Household size Symptomaticity
Household size01 01 01010123456 Symptomatic
T = 0 T = 1 N u m b e r o f p a r ti c i p a n t s N u m b e r o f p a r ti c i p a n t s N u m b e r o f p a r ti c i p a n t s Figure 7: (A-C) Distributions of selected symptoms in the COV-CLEAR dataset. (D) Distribution ofhousehold size, a COVID-19 risk factor, in the COV-CLEAR dataset. (E) Distribution of symptomaticsubjects in the COV-CLEAR dataset. In each plot, the distributions are grouped with respect to theresult of the LFA test: T = 0 for a negative test, and T = 1 for a positive test.Figure 8: Left: Example of test result. Right: Illustrations of a subset of possible test results. Stochastic Expectation-Maximization
This section presents the derivation of the Stochastic Expectation-Maximization (StEM) algorithm,specifically the proofs of Propositions 4.1 and 4.2 of the main paper. For the convenience of thereader, we recall the Bayesian model of interest.
B.1 Bayesian generative model
Denote D the true diagnosis of an individual (healthy/sick), T the outcome of the noisy diagnostic test(positive/negative), S the symptomaticity (symptomatic/asymptomatic), X the symptoms exhibitedand Y the subject-specific risks factors. The underlying assumption is that given a true diagnosis D ,the symptoms X and the diagnostic test outcome T are independent. In other words, the probabilityof the diagnostic test being a false negative P [ T = 0 | D = 1] is independent of the symptoms of atruly infected individual P [ X | D = 1] . Similarly, given a diagnosis D , the test outcome T and theexhibited symptoms X are independent of the risk factors Y . We define: • P ( D = 1 | Y ) = π β ( Y ) the probability of contracting the disease given risk factors Y , • P [( T = 1 | D = 1) = x the sensitivity of the diagnostic test, • P [ T = 0 | D = 0] = 1 − y the specificity of the diagnostic test, i.e. y = P [ T = 1 | D = 0] , • P ( S = 1 | D = 0) = p the probability of being symptomatic when whilst not having beeninfected by that specific disease (the symptoms could be due to another illness for instance), • P ( S = 1 | D = 1) = p the probability of having been symptomatic upon infection, • P ( X k = 1 | S = 1 , D = 0) = s k the probability of exhibiting symptom k when not infected, • P ( X k = 1 | S = 1 , D = 1) = s k the probability of exhibiting symptom k upon infection.The uncertainty in the test sensitivity and specificity is expressed by putting a prior on x and y , whichwill be updated during training: P [ T = 1 | D = 1] = x ∼ B ( α x , β x ) , P [ T = 1 | D = 0] = y ∼ B ( α y , β y ) . (3)The uncertainty on the symptomaticity, as well as on the appearance of specific symptoms for infectedand healthy individuals is expressed with a prior on s and s , and updating it when we aggregatemore information: P [ S = 1 | D = 0] = p ∼ B ( α p , β p ) , P [ S = 1 | D = 1] = p ∼ B ( α p , β p ) . P [ X = 1 | D = 0] = s ∼ B ( α s , β s ) , P [ X = 1 | D = 1] = s ∼ B ( α s , β s ) . (4)Lastly, we model the probability for an individual to contract the disease depending on their riskfactors, by modeling the logodds: log (cid:18) π ( Y )1 − π ( Y ) (cid:19) = Y β + (cid:15) where: (cid:15) ∼ N (0 , σ ) , (5)where the parameters β weight the importance of the different components of the risk factors Y (county data, profession, size of household) in contracting the disease. We express our uncertainty onwhether a given factor has an impact by putting a prior on β : β ∼ N (0 , σ β ) . (6)In this model: • ζ = (cid:16) α x , β x , α y , β y , α p , β p , α p , β p , { α s k , β s k } k , { α s k , β s k } k , { σ β m } m (cid:17) arehyper-parameters, considered fixed and known, • θ = (cid:16) x, y, p , p , { s k , s k } k , { β m } m (cid:17) are the parameters,13 D is a hidden random variable, • Y, S, X, T are the observed variables.For simplicity of notations, we write O = ( S, X, T ) some of the observed variables. The Bayesianmodel is represented in plate notations in Figure 9. n ∆ ∆ KMM KY im β m σ β m D i X ik s dk α s dk , β s dk T i z d α z d , β z d S i ∆ p d α p d , β p d Weight Probability ProbabilityRisk factor m True Test outcomeSymptomaticity Symptom k Sensitivity, Specificityof risk factor m of symptomaticity of symptom k diagnosisFigure 9: Hierarchial bayesian network integrating the accuracy uncertainty on the data sources,to estimate the true diagnosis D . The index k = 1 ...K represents symptom k , while m = 1 ...M represents risk factor m . The shaded cells represent observed variables or known hyper-parameters.The dashed lines represent switches.Our objective is two-fold. At the patient level , we compute the posterior distribution of the truediagnosis D i , informed by the integration of the observed variables O i . This distribution gives usan estimate of the true diagnosis, through Maximum a Posteriori, as well as a credible interval thatexpresses our confidence in this diagnosis. At the global level , we learn the posterior distributionsof the parameters x, y , i.e. the sensitivity and specificity of the test - to be compared to the valuesgiven by the providers. We also learn the posterior distributions of the parameters s , s , i.e. theprobability of symptoms with or without the disease - to be compared to the values given by themedical specialists and the CDC. We also learn the posterior distribution of the parameter β , whichweights the impact of each risk factor for contracting the disease.To fulfil this objective, we perform inference in the Bayesian model described in Figure 9 and inthe main paper. Since D are hidden variables, we proceed with the Expectation-Maximization (EM)algorithm, specifically its stochastic version. B.2 Stochastic Expectation-Maximization
We seek the Maximum a Posteriori (MAP) of the parameters θ . Given the parameters’ estimates, wecan compute the posterior distributions of the diagnosiss D i ’s. The stochastic EM algorithm allowscomputing jointly the posterior distribution of the parameters and the hidden variables D i ’s, throughan iterative procedure described below.We want to maximize the posterior distribution of the parameters θ : P ( θ | O , ...O n , Y , ..., Y n ) = P ( O , ..., O n | θ, Y , ..., Y n ) × P ( θ ) P ( O , ..., O n | Y , ..., Y n ) ∝ P ( θ ) × Π ni =1 P ( O i | θ, Y i ) (7)14hich translates into maximizing the expression: (cid:96) = n (cid:88) i =1 log P ( O i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log (cid:88) d i =0 , P ( O i , D i = d i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log (cid:88) d i =0 , P ( O i , D i = d i | θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) + log P ( θ ) ≥ n (cid:88) i =1 (cid:88) d i =0 , P ( D i = d i | O i , ˜ θ, Y i ) log P ( O i , D i = d i | θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) + log P ( θ )= n (cid:88) i =1 E D i | O i , ˜ θ,Y i (cid:104) log P ( O i , D i = d i | θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) (cid:105) + log P ( θ ) (8)In the above computations, the lower bound is obtained using Jensen inequality. It represents atangent lower bound of the posterior distribution: θ → (cid:96) ( θ ) at the given set of parameters ˜ θ . Onthe last line, the expectation is taken for D i distributed according to P ( D i | O i , ˜ θ, Y i ) . Followingthe literature on the stochastic EM algorithm, we compute this expectation by replacing it with itsMonte-Carlo estimate, sampling one (cid:99) D i according to P ( D i | O i , ˜ θ, Y i ) . For simplicity of notation, wedenote: w i = P ( D i = d i | O i , ˜ θ, Y i ) . (9)The approximate lower bound of the posterior of the parameters writes, after sampling: (cid:96) ≥ n (cid:88) i =1 log P ( O i , D i = (cid:99) D i | θ, Y i ) w i + log P ( θ ) , (10)and we only need to maximize the following function M in its parameters θ : θ → M ( θ ) = n (cid:88) i =1 log P ( O i , D i = (cid:99) D i | θ, Y i ) + log P ( θ ) . (11)15ince D i = (cid:99) D i is now fixed, we further decompose the right-hand side of the above inequality. M ( θ ) = n (cid:88) i =1 log P ( T i , S i , X i , D i = (cid:99) D i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log P ( T i , S i , X i | D i = (cid:99) D i , θ, Y i )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , θ, Y i ) + log P ( S i , X i | D i = (cid:99) D i , θ, Y i )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , θ, Y i ) + log P ( S i , X i | D i = (cid:99) D i , θ, Y i )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | θ, Y i ) + log P ( θ )= n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , θ, Y i ) + log P ( X i | S i , D i = (cid:99) D i , θ, Y i ) + log P ( S i | D i = (cid:99) D i , θ, Y i )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | θ, Y i ) + log P ( θ ) (12)using the conditional independence of ( S i , X i ) and T i given D i . Using the dependency structure ofthe variables relying on the Bayesian network from Figure 9, we get: M ( θ ) = n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , x, y )+ n (cid:88) i =1 log P ( X i | S i , D i = (cid:99) D i , s , s ) + log P ( S i | D i = (cid:99) D i , p , p )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | β, Y i )+ log P ( x, y ) + log P ( p , p ) + log P ( s , s ) + log P ( β ) (13)As a result, we find the maximum a posteriori of x, y , s , s and β separately, by maximizingrespectively: M ( x, y ) = n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , x, y ) + log P ( x, y ) ,M ( p , p ) = n (cid:88) i =1 log P ( S i | D i = (cid:99) D i , p , p ) + log P ( p , p ) M ( s , s ) = n (cid:88) i =1 log P ( X i | S i , D i = (cid:99) D i , s , s ) + log P ( s , s ) ,M ( β ) = n (cid:88) i =1 log P ( D i = (cid:99) D i | β, Y i ) + log P ( β ) . (14)To summarize, at iteration ( j + 1) of the stochastic EM algorithm:16 Stochastic E-step: For each patient i : – Compute the posterior distribution of their diagnosis P ( D i | O i , ˜ θ, Y i ) , – Sample (cid:99) D i from the posterior. • M-step: Maximize the approximate lower-bound of the parameters’ posteriors in θ =( x, y, p , p , s , s , β ) by maximizing M ( x, y ) , M ( p , p ) , M ( s , s ) and M ( β ) . This up-dates the parameters to: θ ( j +1) = ( x ( j +1) , y ( j +1) , p ( j +1)0 , p ( j +1)1 , s ( j +1)0 , s ( j +1)1 , β ( j +1) ) . (15)until convergence in θ = ( x, y, p , p , s , s , β ) . At each iteration, this algorithm maximizes a(stochastic approximation) of a tangent lower-bound of the posterior distribution of the parameters.Therefore, each iteration increases the posterior distribution of the parameters. B.3 Auxiliary computation: joint log-likelihood
As an auxiliary computation, we provide the formula for the joint log-likelihood under our model,which we will use in the E- and M-steps.The joint log-likelihood writes: P ( O, D, θ | Y ) = P ( T, S, X, D, x, y, p , p , s , s , β | Y )= P ( T, S, X, x, y, p , p , s , s | D, β, Y ) × P ( D, β | Y )= P ( T, S, X, x, y, p , p , s , s | D ) × P ( D | Y, β ) × P ( β | Y ) (16)Using the conditional independence of ( S, X ) and T given D , we write: P ( O, D, θ | Y )= P ( T, x, y | D ) × P ( S, X, p , p , s , s | D ) × P ( D | Y, β ) × P ( β | Y )= P ( T, x, y | D ) × P ( X, s , s | S, p , p , D ) × P ( S, p , p | D ) × P ( D | Y, β ) × P ( β | Y )= P ( T, x, y | D ) × P ( X, s , s | S, D ) × P ( S, p , p | D ) × P ( D | Y, β ) × P ( β | Y ) (17) B.3.1 Auxiliary computations using the generative model
We separately compute the probabilities in the above formula, using the probability distributionsprovided by the generative model. We get, for the term involving the immunoassay test T : P ( T, x, y | D ) = P ( T | D, x, y ) × P ( x, y | D )= P ( T | D, x, y ) × P ( x ) × P ( y )= x T D (1 − x ) (1 − T ) D y T (1 − D ) (1 − y ) (1 − T )(1 − D ) × B ( α x , β x ) − x α x − (1 − x ) β x − × B ( α y , β y ) − y α y − (1 − y ) β y − . (18)For the term involving the symptoms X , we get: P ( X, s , s | S, D ) = P ( X | S, D, s , s ) × P ( s , s | D )= P ( X | S, D, s , s ) × P ( s ) × P ( s )= 0 (1 − S ) X (1 − S )(1 − X ) s S (1 − D ) X (1 − s ) S (1 − D )(1 − X ) s SDX (1 − s ) SD (1 − X ) × B ( α s , β s ) − s α s − (1 − s ) β s − × B ( α s , β s ) − s α s − (1 − s ) β s − = δ S =0 ,X =0 + δ S =1 × s (1 − D ) X (1 − s ) (1 − D )(1 − X ) s DX (1 − s ) D (1 − X ) × B ( α s , β s ) − s α s − (1 − s ) β s − × B ( α s , β s ) − s α s − (1 − s ) β s − (19)17or the term involving the symptomatic variable S , we get: P ( S, p , p | D ) = P ( S | p , p , D ) × P ( p ) × P ( p )= p (1 − D ) S (1 − p ) (1 − D )(1 − S ) × p DS (1 − p ) D (1 − S ) × B ( α p , β p ) − p α p − (1 − p ) β p − × B ( α p , β p ) − p α p − (1 − p ) β p − . (20)For the terms involving the diagnosis D and the parameter β of the logistic regression, we get: P ( D | Y, β ) = π ( Y, β ) D (1 − π ( Y, β )) − D P ( β ) = n ( β ; σ β ) , (21)where we use the notations: • π ( Y, β ) = g ( Y β ) with g the sigmoid function from the logistic regression, • n ( β ; σ β ) the probability density function of the Gaussian N (0 , σ β ) . B.3.2 Final formula for the joint log-likelihood
We plug the expressions of the probabilities in the joint log-likelihood, and get: P ( O, D, θ | Y )= P ( T, x, y | D ) × P ( X, s , s | S, D ) × P ( S, p , p | D ) × P ( D | Y, β ) × P ( β | Y ) ∝ x T D (1 − x ) (1 − T ) D y T (1 − D ) (1 − y ) (1 − T )(1 − D ) x α x − (1 − x ) β x − y α y − (1 − y ) β y − × (1 − S ) X (1 − S )(1 − X ) s S (1 − D ) X (1 − s ) S (1 − D )(1 − X ) s SDX (1 − s ) SD (1 − X ) × s α s − (1 − s ) β s − s α s − (1 − s ) β s − × p (1 − D ) S (1 − p ) (1 − D )(1 − S ) × p DS (1 − p ) D (1 − S ) × p α p − (1 − p ) β p − p α p − (1 − p ) β p − × π ( Y, β ) D (1 − π ( Y, β )) − D × n ( β ; σ β ) , (22)where we omit the normalizing constants from the beta distributions. B.4 Stochastic E-step: Compute posterior of the hidden variables D i In the stochastic E-step, we aim to sample (cid:99) D i from the posterior distribution of the hidden variable D i . Given the current estimate θ ( j ) of the parameters, we compute the posterior of the diagnosis D i for each patient i : P ( D i | O i , θ ( j ) , Y i ) ∝ P ( D i , O i , θ ( j ) | Y i ) , (23)where we plug the expression of the joint log-likelihood from Equation 51.Omitting the coefficients that are shared in both formulae for the probability below, we have: P ( D i = 1 | O i , θ ( j ) , Y i ) ∝ x ( j ) T i (1 − x ( j ) ) (1 − T i ) × s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) P ( D i = 0 | O i , θ ( j ) , Y i ) ∝ y ( j ) T i (1 − y ( j ) ) (1 − T i ) × s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S (1 − p ) (1 − S i ) , (24)by identification in the expression of the joint log-likehood from Equation 51.This shows the proposition: 18 roposition B.1. The odds of the posterior of the hidden variable D i at iteration ( j + 1) writes: P ( D i = 1 | T i , S i , X i , Y i , θ ( j ) ) P ( D i = 0 | T i , S i , X i , Y i , θ ( j ) )= x ( j ) T i (1 − x ( j ) ) (1 − T i ) × s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) y ( j ) T i (1 − y ( j ) ) (1 − T i ) × s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S i (1 − p ) (1 − S i ) . (25)Following the methodology of the stochast EM algorithm, we sample from this posterior to obtain (cid:99) D i . B.5 M-step: Update model’s parameters
In the M-step, we update the parameters of the generative model by maximizing the lower-bound oftheir posterior distribution. We update each set of parameters separetely, using the expressions inEquation 49.
B.5.1 Update Sensitivity and Specificity
Given (cid:99) D i ’s, we update x, y by maximizing: M ( x, y ) = n (cid:88) i =1 log P ( T i | D i = (cid:99) D i , x, y ) + log P ( x, y ) . (26)We recognize in this expression the logarithm of the probability distribution of a product of betadistributions in x and in y , omitting the normalization constants: Π ni =1 P ( T i | D i = (cid:99) D i , x, y ) = Π ni =1 (cid:16) x T i (cid:99) D i (1 − x ) (1 − T i ) (cid:99) D i × y T i (1 − (cid:99) D i ) (1 − y ) (1 − T i )(1 − (cid:99) D i ) (cid:17) P ( x, y ) = x α x − (1 − x ) β x − × y α y − (1 − y ) β y − , (27)such that M ( x, y ) writes: M ( x, y ) = log (cid:104) B x (cid:16) n (cid:88) i =1 T i (cid:99) D i + α x , n (cid:88) i =1 (1 − T i ) (cid:99) D i + β x (cid:17)(cid:105) + log (cid:104) B y (cid:16) n (cid:88) i =1 T i (1 − (cid:99) D i ) + α y , n (cid:88) i =1 (1 − T i )(1 − (cid:99) D i ) + β y (cid:17)(cid:105) . (28)The updated sensitivity and specificity, x ( j +1) , y ( j +1) , maximize M ( x, y ) . They are the modes ofthe beta distributions: x ( j +1) = argmax x B x (cid:16) n (cid:88) i =1 T i (cid:99) D i + α x , n (cid:88) i =1 (1 − T i ) (cid:99) D i + β x (cid:17) y ( j +1) = argmax y B y (cid:16) n (cid:88) i =1 T i (1 − (cid:99) D i ) + α y , n (cid:88) i =1 (1 − T i )(1 − (cid:99) D i ) + β y (cid:17) . (29)19f α x , β x > and α y , β y > , the expression for the modes are: x ( j +1) = (cid:80) ni =1 T i (cid:99) D i + α x − (cid:80) ni =1 T i (cid:99) D i + α x + (cid:80) ni =1 (1 − T i ) (cid:99) D i + β x − (cid:80) ni =1 T i (cid:99) D i + α x − (cid:80) ni =1 (cid:99) D i + ( α x + β x ) − y ( j +1) = (cid:80) ni =1 T i (1 − (cid:99) D i ) + α y − (cid:80) ni =1 T i (1 − (cid:99) D i ) + α y + (cid:80) ni =1 (1 − T i )(1 − (cid:99) D i ) + β y − (cid:80) ni =1 T i (1 − (cid:99) D i ) + α y − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α y + β y ) − . (30)We can have the case β x , β y < if the sensitivity or the specificity are very close to 1. In this case,we update the parameters using the expectation of the Beta distribution, instead of the mode. B.5.2 Update Probabilities of being symptomatic
Given the (cid:99) D i ’s, we update p , p by maximizing: M ( p , s ) = n (cid:88) i =1 log P ( S i | D i = (cid:99) D i , p , p ) + log P ( p , p ) . (31)We recognize in this expression the logarithm of the probability distribution of a product of betadistributions in p and in p , omitting the normalization constants : Π ni =1 P ( S i | D i = (cid:99) D i , p , p ) = Π ni =1 (cid:16) p (1 − (cid:99) D i ) S i (1 − p ) (1 − (cid:99) D i )(1 − S i ) × p (cid:99) D i S i (1 − p ) (cid:99) D i (1 − S i ) (cid:17) P ( p , p ) = p α p − (1 − p ) β p − × p α p − (1 − p ) β p − (32)such that M ( p , p ) writes: M ( p , p ) = log (cid:104) B p (cid:16) n (cid:88) i =1 (1 − (cid:99) D i ) S i + α p , n (cid:88) i =1 (1 − (cid:99) D i )(1 − S i ) + β p (cid:17)(cid:105) + log (cid:104) B p (cid:16) n (cid:88) i =1 (cid:99) D i S i + α p , n (cid:88) i =1 (cid:99) D i (1 − S i ) + β p (cid:17)(cid:105) . (33)The updated probabilities of being symptomatic, in absence or presence of the disease, p ( j +1)0 , p ( j +1)1 ,maximize M ( p , p ) . They are the modes of the beta distributions: p ( j +1)0 = argmax x B x (cid:16) n (cid:88) i =1 (1 − (cid:99) D i ) S i + α p , n (cid:88) i =1 (1 − (cid:99) D i )(1 − S i ) + β p (cid:17) p ( j +1)1 = argmax y B y (cid:16) n (cid:88) i =1 (cid:99) D i S i + α p , n (cid:88) i =1 (cid:99) D i (1 − S i ) + β p (cid:17) . (34)If α p , β p > and α p , β p > , the expression for the modes are: p j +1) = (cid:80) ni =1 (1 − (cid:99) D i ) S i + α p − (cid:80) ni =1 (1 − (cid:99) D i ) S i + α p + (cid:80) ni =1 (1 − (cid:99) D i )(1 − S i ) + β p − (cid:80) ni =1 (1 − (cid:99) D i ) S i + α p − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α p + β p ) − p j +1) = (cid:80) ni =1 (cid:99) D i S i + α p − (cid:80) ni =1 (cid:99) D i S i + α p + (cid:80) ni =1 (cid:99) D i (1 − S i ) + β p − (cid:80) ni =1 (cid:99) D i S i + α p − (cid:80) ni =1 (cid:99) D i + ( α p + β p ) − . (35)20e can have the case β p , β p < if the sensitivity or the specificity are very close to 1. In this case,we update the parameters using the expectation of the Beta distribution, instead of the mode. B.5.3 Update Symptoms Probabilities
Given (cid:99) D i ’s, we update s , s by maximizing: M ( s , s ) = n (cid:88) i =1 log P ( X i | S i , D i = (cid:99) D i , s , s ) + log P ( s , s ) , . (36)We recognize in this expression the logarithm of the probability distribution of a product of betadistributions in s and in s , omitting the normalization constants: Π ni =1 P ( X i | S i , D i = (cid:99) D i , s , s )= Π ni =1 (cid:16) δ S i =0 ,X i =0 + δ S i =1 × s (1 − (cid:99) D i ) X i (1 − s ) (1 − (cid:99) D i )(1 − X i ) s (cid:99) D i X i (1 − s ) (cid:99) D i (1 − X i ) (cid:17) = Π ni =1 , s.t. S i =1 (cid:16) s (1 − (cid:99) D i ) X i (1 − s ) (1 − (cid:99) D i )(1 − X i ) s (cid:99) D i X i (1 − s ) (cid:99) D i (1 − X i ) (cid:17) (37)and: P ( s , s ) = s α s − (1 − s ) β s − × s α s − (1 − s ) β s − (38)such that M ( s , s ) writes: M ( s , s ) = log (cid:104) B s (cid:16) n (cid:88) i =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s , n (cid:88) i =1 , s.t. S i =1 (1 − (cid:99) D i )(1 − X i ) + β s (cid:17)(cid:105) + log (cid:104) B s (cid:16) n (cid:88) i =1 , s.t. S i =1 (cid:99) D i X i + α s , n (cid:88) i =1 , s.t. S i =1 (cid:99) D i (1 − X i ) + β s (cid:17)(cid:105) . (39)The updated probabilities of exhibiting symptoms, in absence or presence of the disease, s ( j +1)0 , s ( j +1)1 , maximize M ( s , s ) . They are the modes of the beta distributions: s ( j +1)0 = argmax x B x (cid:16) n (cid:88) i =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s , n (cid:88) i =1 , s.t. S i =1 (1 − (cid:99) D i )(1 − X i ) + β s (cid:17) s ( j +1)1 = argmax y B y (cid:16) n (cid:88) i =1 , s.t. S i =1 (cid:99) D i X i + α s , n (cid:88) i =1 , s.t. S i =1 (cid:99) D i (1 − X i ) + β s (cid:17) . (40)If α s , β s > and α s , β s > , the expression for the modes are: s j +1) = (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s − (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s + (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i )(1 − X i ) + β s − (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s − (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) + ( α s + β s ) − s j +1) = (cid:80) ni =1 , s.t. S i =1 (cid:99) D i X + α s − (cid:80) ni =1 , s.t. S i =1 (cid:99) D i X i + α s + (cid:80) ni =1 , s.t. S i =1 (cid:99) D i (1 − X i ) + β s − (cid:80) ni =1 , s.t. S i =1 (cid:99) D i X i + α s − (cid:80) ni =1 , s.t. S i =1 (cid:99) D i + ( α s + β s ) − . (41)We can have the case β s , β s < if the probabilities of symptoms are very close to 1. In this case,we update the parameters using the expectation of the Beta distribution, instead of the mode.21 .5.4 Update coefficients of the risk factors β Given the (cid:99) D i ’s, we update β by maximizing: M ( β ) = n (cid:88) i =1 log P ( D i = (cid:99) D i | β, Y i ) + log P ( β )= − n (cid:88) i =1 || (cid:99) D i − g ( Y i β ) || σ − || β || σ β (42)where we omit the normalization constant of the Gaussian distributions. We recognize the lossfunction of the logistic regression, that we optimize using stochastic gradient descent, with updaterule: β ← β − γ (cid:104) (cid:88) i ( g ( Y i β ) − (cid:99) D i ) Y i + || β || σ β (cid:105) (43)This gives the update in β . B.5.5 Summary of the updates
This proposition summarizes the parameters updates.
Proposition B.2.
The parameters updates write: x ( j +1) = (cid:80) ni =1 T i (cid:99) D i + α x − (cid:80) ni =1 (cid:99) D i + ( α x + β x ) − , y ( j +1) = (cid:80) ni =1 T i (1 − (cid:99) D i ) + α y − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α y + β y ) − p ( j +1)0 = (cid:80) ni =1 (1 − (cid:99) D i ) S i + α p − (cid:80) ni =1 (1 − (cid:99) D i ) + ( α p + β p ) − , p ( j +1)1 = (cid:80) ni =1 (cid:99) D i S i + α p − (cid:80) ni =1 (cid:99) D i + ( α p + β p ) − s ( j +1)0 = (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) X i + α s − (cid:80) ni =1 , s.t. S i =1 (1 − (cid:99) D i ) + ( α s + β s ) − , s ( j +1)1 = (cid:80) ni =1 , s.t. S i =1 (cid:99) D i X i + α s − (cid:80) ni =1 , s.t. S i =1 (cid:99) D i + ( α s + β s ) − β ( j +1) = argmin β n (cid:88) i =1 || (cid:99) D i − g ( Y i β ) || σ + || β || σ β , (44) where the minimization on β is performed through stochastic gradient descent. C Stochastic EM with missing T : truncated Bayesian network We apply the stochastic EM algorithm in the Bayesian model truncated at T . In this model: • ζ = (cid:16) α p , β p , α p , β p , { α s k , β s k } k , { α s k , β s k } k , { σ β m } m (cid:17) are hyper-parameters,considered fixed and known, • θ = (cid:16) p , p , { s k , s k } k , { β m } m (cid:17) are the parameters, • D is a hidden random variable, • Y, S, X are the observed variables.We now writ: O = ( S, X ) .We still wish to maximize the expression: (cid:96) = n (cid:88) i =1 E D i | O i , ˜ θ,Y i (cid:104) log P ( O i , D i = d i | θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) (cid:105) + log P ( θ ) , (45)22nder our new notations stated above. The approximate lower bound of the posterior of the parametersstill writes, after sampling one (cid:99) D i according to P ( D i | O i , ˜ θ, Y i ) : (cid:96) ≥ n (cid:88) i =1 log P ( O i , D i = (cid:99) D i | θ, Y i ) w i + log P ( θ ) , (46), using the notation: w i = P ( D i | O i , ˜ θ, Y i ) . Again, we only need to maximize the following function M in its parameters θ : θ → M ( θ ) = n (cid:88) i =1 log P ( O i , D i = (cid:99) D i | θ, Y i ) + log P ( θ ) . (47)Since D i = (cid:99) D i is now fixed, we further decompose the right-hand side of the above inequality. M ( θ ) = n (cid:88) i =1 log P ( S i , X i , D i = (cid:99) D i | θ, Y i ) + log P ( θ )= M ( p , p ) + M ( s , s ) + M ( β ) (48)where: M ( p , p ) = n (cid:88) i =1 log P ( S i | D i = (cid:99) D i , p , p ) + log P ( p , p ) M ( s , s ) = n (cid:88) i =1 log P ( X i | S i , D i = (cid:99) D i , s , s ) + log P ( s , s ) ,M ( β ) = n (cid:88) i =1 log P ( D i = (cid:99) D i | β, Y i ) + log P ( β ) . (49)are the same functions as in the case with observed T i . The only difference is that ˆ D i is sampledfrom P ( D i | O i , ˜ θ, Y i ) where O i = ( S i , X i ) does not contain T i . C.1 Auxiliary computation: log-likelihood in truncated Bayesian model
The joint log-likelihood in the truncated Bayesian model writes: P ( O, D, θ | Y ) = P ( S, X, D, x, y, p , p , s , s , β | Y )= P ( S, X, x, y, p , p , s , s | D ) × P ( D | Y, β ) × P ( β | Y )= P ( X, s , s | S, D ) × P ( S, p , p | D ) × P ( D | Y, β ) × P ( β | Y ) , (50)which gives the same result as in the non-truncated case, but without the term P ( T, x, y | D ) . We usethe computations from subsection B.3.1 to get the final expression of the loglikelihood: We plug theexpressions of the probabilities in the joint log-likelihood, and get: P ( O, D, θ | Y ) ∝ (1 − S ) X (1 − S )(1 − X ) s S (1 − D ) X (1 − s ) S (1 − D )(1 − X ) s SDX (1 − s ) SD (1 − X ) × s α s − (1 − s ) β s − s α s − (1 − s ) β s − × p (1 − D ) S (1 − p ) (1 − D )(1 − S ) × p DS (1 − p ) D (1 − S ) × p α p − (1 − p ) β p − p α p − (1 − p ) β p − × π ( Y, β ) D (1 − π ( Y, β )) − D × n ( β ; σ β ) , (51)where we omit the normalizing constants from the beta distributions.23 .2 Stochastic E-step in the truncated model: Compute posterior of the hidden variables D i In the stochastic E-step, we aim to sample (cid:99) D i from the posterior distribution of the hidden variable D i . Given the current estimate θ ( j ) of the parameters, we compute the posterior of the diagnosis D i for each patient i : P ( D i | O i , θ ( j ) , Y i ) = P ( D i | S i , X i , θ ( j ) , Y i ) ∝ P ( D i , S i , X i , θ ( j ) | Y i ) , (52)where we plug the expression of the joint log-likelihood of the truncated model.Omitting the coefficients that are shared in both formulae for the probability below, we have: P ( D i = 1 | O i , θ ( j ) , Y i ) ∝ s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) P ( D i = 0 | O i , θ ( j ) , Y i ) ∝ s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S (1 − p ) (1 − S i ) , (53)by identification. D Stochastic EM with missing T : Missing and hidden variables We derive the Stochastic EM algorithm in the full Bayesian model, taking into account the missingvariables T i ’s (missing for i = r + 1 ...n ) and hidden variables D i ’s. We want to maximize theposterior distribution of the parameters θ : P ( θ | O , ...O r , O Mr +1 , ..., O Mn , Y , ..., Y n ) = P ( O , ..., O r , O Mr +1 , ..., O Mn | θ, Y , ..., Y n ) × P ( θ ) P ( O , ..., O r , O Mr +1 , ..., O Mn | Y , ..., Y n ) ∝ Π ri =1 P ( O i | θ, Y i ) × Π ni = r +1 P ( O Mi | θ, Y i ) × P ( θ ) (54)where we write O i = ( S i , X i , T i ) when T i is available and O Mi = ( S i , X i ) when T i is missing.This translates into maximizing the expression: (cid:96) = r (cid:88) i =1 log P ( O i | θ, Y i ) + n (cid:88) i = r +1 log P ( O Mi | θ, Y i ) + log P ( θ ) ≥ r (cid:88) i =1 E D i | O i , ˜ θ,Y i (cid:104) log P ( O i , D i = d i | θ, Y i ) P ( D i = d i | O i , ˜ θ, Y i ) (cid:105) + n (cid:88) i = r +1 E D i ,T i | O Mi , ˜ θ,Y i (cid:104) log P ( O Mi , D i = d i , T i = t i | θ, Y i ) P ( D i = d i , T i = t i | O Mi , ˜ θ, Y i ) (cid:105) + log P ( θ ) , (55)where we use Jensen inequality to compute the tangent lower-bounds of (cid:96) D = (cid:80) ri =1 log P ( O i | θ, Y i ) and (cid:96) T,D = (cid:80) ni = r +1 log P ( O Mi | θ, Y i ) independently. This is still a valid lower-bound of (cid:96) at ˜ θ as thesum of the tangent-lower bounds is a tangent-lower bound of the sum.After sampling, we need to maximize the following function of θ : θ → M MIS ( θ ) = r (cid:88) i =1 log P ( O i , (cid:99) D i = d i | θ, Y i ) + n (cid:88) i = r +1 log P ( O Mi , (cid:99) D i = d i , (cid:98) T i = t i | θ, Y i ) + log P ( θ )= M ( θ, n = r ) + n (cid:88) i = r +1 log P ( O Mi , (cid:99) D i = d i , (cid:98) T i = t i | θ, Y i ) (56)where M ( θ, n = r ) is the cost function in the case without missing T .24e compute the second term of the lower-bound, which we denote M NEW : M NEW ( θ ) = n (cid:88) i = r +1 log P ( O Mi , (cid:99) D i = d i , (cid:98) T i = t i | θ, Y i )= n (cid:88) i = r +1 log P ( S i , X i , (cid:99) D i = d i , (cid:98) T i = t i | θ, Y i )= n (cid:88) i = r +1 log P ( T i = (cid:98) T i | D i = (cid:99) D i , θ, Y i ) + log P ( X i | S i , D i = (cid:99) D i , θ, Y i ) + log P ( S i | D i = (cid:99) D i , θ, Y i )+ n (cid:88) i =1 log P ( D i = (cid:99) D i | θ, Y i ) (57)which is M ( θ, n = ( n − r + 1) , T i → (cid:98) T i ) . Therefore: mathcalM MIS ( θ ) is M ( θ ) where themissing T i ’s have been replaced by theirs imputed value.We proceed with the Stochastic EM algorithm as follows: • M-step: Compute the tangent-lower bound, which amounts to: – for i = 1 , ..., r : sample (cid:99) D i from P ( D i | O i , ˜ θ, Y i ) , – for i = r + 1 , ..., n : sample (cid:99) D i , (cid:98) T i from P ( D i , T i | O Mi , ˜ θ, Y i ) , • E-step: Maximize the lower-bound in θ . D.1 M-step
For i = 1 , ..., r , sampling (cid:99) D i is performed as usual. We focus on sampling (cid:99) D i , (cid:98) T i from P ( D i , T i | O Mi , ˜ θ, Y i ) .We compute the posterior of interest: P ( D i , T i | O Mi , ˜ θ, Y i ) ∝ P ( D i , T i , O Mi , | ˜ θ, Y i ) (58)We get: P ( D i = 0 , T i = 0 | O Mi , ˜ θ, Y i ) ∝ (1 − y ( j ) ) × s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S (1 − p ) (1 − S i ) P ( D i = 0 , T i = 1 | O Mi , ˜ θ, Y i ) ∝ y ( j ) × s ( j )0 S i X i (1 − s ( j )0 ) S i (1 − X i ) × (1 − π ( Y i , β ( j ) )) × p S (1 − p ) (1 − S i ) P ( D i = 1 , T i = 0 | O Mi , ˜ θ, Y i ) ∝ (1 − x ( j ) ) × s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) P ( D i = 1 , T i = 1 | O Mi , ˜ θ, Y i ) ∝ x ( j ) × s ( j )1 S i X i (1 − s ( j )1 ) S i (1 − X i ) × π ( Y i , β ( j ) ) × p S i (1 − p ) (1 − S i ) . (59)These allow to sample ( (cid:99) D i , (cid:98) T i ) according to the posterior, for i = r + 1 , .., n . D.2 E-step
The update are the same, except that the missing T i s are replaced by their imputed values. D.3 Enhancement
In addition, we can consider that we have a different prior on the imputed T i ’s. Therefore, we createa new class of sensivitivy/specificity pair, designed for the imputed T i . The priors are initialized asnon-information Beta distribution with parameters (2 , , and updated at training time.25 ensitivity .6.7.75.8.85.87.9.93.95.97.99 .6.7.75.8.85.87.9.93.95.97.99.6 .7 .75 .8 .85.87 .9 .93.95.97.99 Sensitivity .6 .7 .75 .8 .85.87 .9 .93.95.97.99 S p ec i f i c it y S p ec i f i c it y Diagnosis Accuracy Improvementover Data-Informed EM .02.06.1 (A) (B)Diagnosis Accuracy Improvementover Agnostic EM .04.12.2
Figure 10: Performance of the StEM algorithm compared to benchmark versions of the EM forn=300 samples, σ = 0 . , and varying levels of specificity and sensitivity. (A) Gain in accuracy withrespect to the Data-Agnostic EM described in Section 5. (B)
Gain in accuracy with respect to theData-Informed EM described in Section 5.
E Additional Plots for Validation on Synthetic Data
This section provides additional details on the validation of the StEM algorithm on synthetic data.
Improvement upon T . Table 1 shows the improvement in the diagnosis’ accuracy provided by ourmethod against the sole test T , and highlights the potential strength of harvesting multiple noisysources of information. The values that we have chosen here for the sensitivity are reflective of theones that have been reported for the LFA test in the COV-CLEAR study.Sensitivity SpecificityReal-life Value 70 80 93 99Asymptomatic 70 16.2 ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± .
21+ days 99.0 8.7 ± . ± . ± . ± . Table 1: Gain in accuracy (mean and standard deviation) when using StEM over the sole test
Benchmarks.
We complete here our discussion of the improvement that our method brings uponComputer-Aided Diagnosis (CAD) standard methods. Fig. 10 compares — for various pairs ofsensitivity and specificity. — the diagnosis accuracies of the Stochastic EM algorithm (StEM)and two variants of the EM algorithm: one variant where the parameters’ priors are agnostic, oruninformative; and another variant where these priors are computed from the available data, asdescribed in Section 5. This figure demonstrates that learning the parameters’ posterior distributionswhile estimating the diagnosis yields higher prediction accuracy.
Convergence.
Fig. 11 shows the distribution of the relative difference between recovered coefficientsand ground truth (as a percentage of the ground truth value), showing deviations that are within afew percentages of the true value of the coefficients — thus highlighting the ability of the model toconverge to the ground truth parameters.Fig. 12 shows the average number of convergence steps for different values of sensitivity, specificityand sample sizes. Interestingly, we note that for high values of the sensitivity/specificity, the rate ofconvergence is much faster. To illustrate the complexity of the algorithm, we also display in Fig. 13the time required per iteration as a function of the number of samples.26 A) (B) (C) Probability of symptom 0 if healthy Probability of symptom 0 if sick Coefficient of risk factor 2 (D) (E)
Sensitivity 1 - Specificity .9 .93.6 .7 .75 .8 .85 .95 .87 .97 .99 .9 .93.6 .7 .75 .8 .85 .95 .87 .97 .99 .9 .93.6 .7 .75 .8 .85 .95 .87 .97 .99 .9 .93.6 .95 .87 .97 .99
Sensitivity Sensitivity SensitivitySensitivity Sensitivity .7 .75 .8 .85 .9 .93.6 .95 .87 .97 .99.7 .75 .8 .85-.2.20.0-.6.80.0 0.00.0 0.0 -.4.5-.6.6 -.6.6
Figure 11: Relative difference between estimated parameters and their ground truth, for n = 300 , σ = 0 . and different values of sensitivity (simulated data). (B) (A) Sample size n = 100 C onv e r g e n ce s t e p s Sample size n = 200 Sample size n = 300(C)Sensitivity0.6 0.8 1.0 SensitivitySensitivity Specificity SpecificitySpecificity10 Figure 12: Number of steps until convergence for σ = 0 . and different values of sensitivity,specificity and sample sizes (simulated data). 27
00 0 500 1000 1500 2000 2500 number of samples t i m e pe r i t e r a t i on Figure 13: Time (s) per iteration, as the number of samples increases.
F Additional Plots on Real Data
At the subject level.
Fig. 14 presents four examples, where our algorithm either confirms or infirmsthe result of the test — thereby allowing for the potential flagging of false negatives or negatives.The first example (Fig. 14 A) is a user that registers a positive test, together with a significant numberof symptoms and risk factors. The second user (Fig. 14 B) registers a negative test, while beingasymptomatic and with a limited number of risk factors. In both cases, we expect our model toconfirm the result of the test, and provide a narrower confidence interval as per the probability ofimmunity — as confirmed by Fig. 14.The third and fourth examples showcase instances where the our diagnostic posterior and the testdisagree. Subject 108 is a user that registers a negative test, while exhibiting a wide number of knownCOVID symptoms (dry cough, shortness of breath, fever), but took the test less than 10 days afterhis illness. Similarly, subject 92 exhibited less symptoms (in particular, no shortness of breath), butlives in a household of three, where all the other members have also fallen sick. While the posteriorsreclassify the subjects’ diagnosis in each case, the confidence interval associated with the predictionof immunity reflect the uncertainty that is associated with these cases, and flag them as potential falsenegatives.
At the global level: LFA sensitivity and specificity.
The posterior distributions of the model’sparameters shed light on the actual accuracy of the LFA tests on our population. Fig. 15 compares theposteriors of the sensitivity and specificity to the priors built from values reported by manufacturers,for: (A) asymptomatic subjects, (B) symptomatic subjects who took the test between 2 to 10 daysafter their first symptoms, and (C) symptomatic subjects who took the test between 11 to 20 daysafter their first symptoms. We observe that the StEM has updated the prior distributions of sensitivityand specificity, the most significant update being observed for the asymptomatic subjects.
At the global level: COVID-19 symptoms in the population of interest.
Furthermore, our resultsprovide information regarding the most prevalent symptoms for COVID-19. The posterior probabilityof exhibiting specific symptoms, among symptomatic subjects with positive or negative predicteddiagnostic is shown on Fig. 16. We emphasize that these probability distributions are relevant to ourpopulation of n = 117 healthcare workers, and may vary for studies considering another population.28 A) Subject 83: Positivequestionnaire confirms positive test (B) Subject 111: Negativequestionnaire confirms negative test(C) Subject 108: Positive questionnaire infirms negative test (D) Subject 92: Positive questionnaire infirms negative test
Diagnosis given questionnaire Diagnosis given questionnaireDiagnosis given questionnaire Diagnosis given questionnaireDiagnosis given questionnaire and test Diagnosis given questionnaire and testDiagnosis given questionnaire and testDiagnosis given questionnaire and test
Figure 14: Posterior diagnosis distributions on four selected subjects. In each panel (A-D): the leftplot represents the posterior of the diagnosis, given the symptoms and risk factors data reported in thequestionnaire, while the right plot is the posterior of the diagnosis, given the symptoms, risk factors,and reported result of the diagnostic test. The red dot represents the expectation of the correspondingdistribution. (A) (B) (C) Asymptomatic 2 - 10 days 11 - 20 days0.410.6 1 10.70.80.9 0.70.80.9Posteriors Priors1 - Spe. Sens. 1 - Spe. Sens. Posteriors Priors1 - Spe. Sens. 1 - Spe. Sens. Posteriors Priors1 - Spe. Sens. 1 - Spe. Sens.
Figure 15: Posteriors of sensitivity and specificity compared to their priors, for three regimes: (A)Asymptomatic, (B) LFA test taken 2-10 days after first symptoms, (C) LFA test taken 11-20 daysafter first symptoms. 29
A) (B) (C) (D) (E) (F)Fever Cough with sputum Shortness of breathFatigue Congested nose Dry cough0.10.90.5 0.00.30.6 0.10.40.70.20.50.8 0.10.30.5 0.10.40.7D = 0 D = 0D = 0 D = 0 D = 0D = 0D = 1 D = 1 D = 1D = 1D = 1D = 1
Figure 16: Posterior distributions of the probability of exhibiting selected symptoms, for a symp-tomatic subject. The probability of exhibiting each symptom (A-F) is plotted for symptomaticsubjects with estimated negative ( D = 0 ) or positive ( D = 1= 1