A Machine Learning alternative to placebo-controlled clinical trials upon new diseases: A primer
IICAS 048/20
A Machine Learning alternative to placebo-controlledclinical trials upon new diseases: A primer
Ezequiel Alvarez ( a ) † , Federico Lamagna ( a,b ) ‡ , Manuel Szewc ( a ) (cid:5) ( a ) International Center for Advanced Studies (ICAS), UNSAM and CONICET,Campus Miguelete, 25 de Mayo y Francia, (1650) Buenos Aires, Argentina ( b ) Centro Atómico Bariloche, Instituto Balseiro and CONICETAv. Bustillo 9500, 8400, S. C. de Bariloche, Argentina
Abstract
The appearance of a new dangerous and contagious disease requires the develop-ment of a drug therapy faster than what is foreseen by usual mechanisms. Many drugtherapy developments consist in investigating through different clinical trials the effectsof different specific drug combinations by delivering it into a test group of ill patients,meanwhile a placebo treatment is delivered to the remaining ill patients, known as thecontrol group. We compare the above technique to a new technique in which all pa-tients receive a different and reasonable combination of drugs and use this outcome tofeed a Neural Network. By averaging out fluctuations and recognizing different patientfeatures, the Neural Network learns the pattern that connects the patients initial stateto the outcome of the treatments and therefore can predict the best drug therapy betterthan the above method. In contrast to many available works, we do not study any detailof drugs composition nor interaction, but instead pose and solve the problem from a phe-nomenological point of view, which allows us to compare both methods. Although theconclusion is reached through mathematical modeling and is stable upon any reasonablemodel, this is a proof-of-concept that should be studied within other expertises beforeconfronting a real scenario. All calculations, tools and scripts have been made opensource for the community to test, modify or expand it. Finally it should be mentionedthat, although the results presented here are in the context of a new disease in medicalsciences, these are useful for any field that requires a experimental technique with acontrol group.
E-mail: † [email protected], ‡ [email protected], (cid:5) [email protected] a r X i v : . [ q - b i o . Q M ] M a r Introduction
Current and last decades research in drug discovery, and drug therapy design in clinicaltrials have implemented Machine Learning (ML) techniques in order to optimize drug design,therapy and disposition. Drug discovery and design is a key field for tackling new diseases, andin recent years has used molecular structures and target activities databases in combinationwith ML techniques to learn and discover potential biological active molecules and theircorresponding human drug targets [1–6]. Clinical trials on known and new drugs, as well ason combination therapies, is one of the most expensive costs in biopharma and medicine andhas also taken profit of ML techniques to enhance its efficacy and reduce its costs. Thesetechniques have been applied for treatments in a diversity of diseases and recent reviews andrelevant papers can be found in Refs. [7–11] and references therein.Clinical trials are strictly regulated by their corresponding agencies. However, in manycases of seriously ill patients a compassionate drug therapy is allowed. Upon the appearanceof a new and contagious disease such as the COVID-19 [12, 13] or any other, urgent workingdrug therapies are required and usual development times must be stretched and optimized.In many cases a clinical trial consists in a placebo-controlled study: a test group receives adrug treatment and a control group receives placebo. This is necessary to recognize the realdrug effect while comparing it to a placebo treatment. These trials need many patients ineach of the groups in order to reduce the statistic fluctuations as well as fluctuations thatmay be originated in uncontrolled or unknown variables which may exist in the patients or ineach specific treatment. If the number of patients is large enough, these fluctuations wash-out and the real effect of the specific drug treatment may be analyzed. The problem withthis technique is that each specific drug treatment needs to be taken by a large number ofpatients, and therefore there is little room to expand in different combinatory treatments.This is translated into a time dilation in finding the best option in drug combinations thatwould yield the best available therapy.Upon the advent of Artificial Intelligence and ML techniques these premises may change.Although a normal human research requires to test the same drug and a placebo in manypatients to learn its effect as the fluctuations average to zero, a complex Neural Network(NN) could in principle learn from a more diverse dataset in which all patients are tested ina different drug combination. The NN would learn the pattern of the drugs’ effect whereasthe fluctuations would wash-out to zero in the loss function. This ‘reasoning’ of the NeuralNetwork could enlarge the scope of the number of possible drug combinations to be analyzedin comparison to the above described usual clinical trial when both techniques use the samenumber of patients. If this is the case, then the ML technique would be an important toolto find a better drug treatment for a new disease. This is the problem we investigate andpresent in the current article.We propose to explore whether with the assistance of Machine Learning techniques therecould be better ways than the placebo-controlled studies to tackle the problem of finding thebest drug therapy of a new unknown disease. This work is purely mathematical consisting2n modeling and simulations, and therefore does not have other ambition than being a proof-of-concept of the program described below. Reference [11] considers a similar approach,although we use different tools.This work is divided as follows. In Section 2 we pose the problem and understand itwithin a specific mathematical language. In Section 3 we propose phenomenological modelto address the problem and allow us to perform a diversity of calculations and simulationsto show our results. Section 4 contains a few remarks on the model and our results, as wellas future prospects that open up from these results. We conclude in Section 5 and collectsome relevant formulas and verification results in Appendices A and B. All calculations andprograms to reproduce and expand the results in this paper can be found as supplementarymaterial in the
GIT repo in [14].
Upon the appearance of a new disease, there is a set of possible drugs which can be combinedto treat it. However, the correct or best combination is usually unknown. The drugs’ effecton the patients is, among others, non-linear –drugs may interfere with one another–, andpatient-dependent. This scenario can be mathematically described as follows. We considerthe patient features and initial conditions with a multidimensional vector (cid:126)x , where eachcomponent x i corresponds to any kind of patient features or pre-existing conditions whichmay be relevant for the study, as for instance diabetes or heart pre-conditions, genomics, agegroup, ethnicity, etc.. The drugs whose combinations are to be studied are described witha multidimensional vector (cid:126)y , where each component corresponds to the daily dose of drug i to be delivered in treatment. For the sake of concreteness and simplicity, for given patient (cid:126)x and drug combination (cid:126)y , we assume a unique treatment that consists in delivering drugcombination (cid:126)y every day during a specific fixed number of days. Therefore, for any ( (cid:126)x, (cid:126)y ) alltreatments consist in the same technique, but with different patient and drug combination.As a result of this treatment, we assume that the patient outcome can be described with anumber between 0 and 1: 0 being dead and 1 being in excellent health conditions. In anexact science there would be a function that connects the initial condition and features ofthe patient to the outcome through a given specific treatment. This health function, whichwe define as h ( (cid:126)x, (cid:126)y ) , is a multi-variable non-linear unknown function that can take valuesin [0 , . However, in real scenarios there are many uncontrolled variables which also affectthe outcome of the given treatment, as it can be some unknown pre-existing condition, agenomic factor, or different physicians providing different evaluation of the patients, amongmany others. These uncontrolled variables could be included as a stochastic noise on h indifferent ways. One practical way is to consider a factorizable stochastic modulation on h , H ( (cid:126)x, (cid:126)y ) = S ( (cid:126)x, (cid:126)y ) h ( (cid:126)x, (cid:126)y ) , (1)3hile either re-normalizing, dropping or overflowing values outside [0 , . This noisy function H is the one with physical meaning that best represents in practice the health outcome ofa patient with features (cid:126)x when receiving a treatment consisting in drug therapy (cid:126)y . In thisframework we have that H ( (cid:126)x, (cid:126)
0) =
Health outcome of a patient receiving placebo or no drugs during treatmentWithin this mathematical modeling, the problem of finding the best drug treatment for thedisease on patient (cid:126)x is translated into finding the drug combination (cid:126)y that maximizes H ( (cid:126)x, (cid:126)y ) .In most of this work we are interested in finding the drug therapy that best heals in averagea set of patients, therefore in this case we are interested in finding the drug combination (cid:126)y that maximizes the average of H ( (cid:126)x, (cid:126)y ) for a set of patients (cid:126)x . Actually, in the cases in whichthe dependence of H on the patient features is subleading compared to the dependence onthe drug treatment, both maximums would be similar.Therefore, given N patients we want to explore which of the following techniques wouldbe the best option to find the drug combination (cid:126)y that maximizes the distribution of H ( (cid:126)x, (cid:126)y ) over a new given set of patients: • A) Regular Drug Therapy technique (RDT):
Divide N in k sets of n patients( n · k = N ) and in each one of these sets apply a given drug therapy (cid:126)y i ( i = 1 ...k )to half of the patients ( n/ ) and placebo to the other half, also known as the controlgroup. After the treatment, consider in each set the average outcome of the patientsreceiving treatment and the average of patients receiving placebo. (There are many dif-ferent clinical trials analyses [15, 16], however here we adopt one very similar to currentCOVID-19 clinical trial on Remdesivir [17].) The set that yields the largest differencebetween these two averages, and its corresponding drug therapy (cid:126)y A , is considered to bethe best drug therapy within this technique. In the previously defined mathematicallanguage, (cid:126)y A provides the best distribution of H ( (cid:126)x, (cid:126)y ) over (cid:126)x . Where best means thatprovides the largest average of H , and therefore would provide the best average healingfor patients. • B) Neural Network Drug Therapy on the RDT data (NN@RDT):
Using theabove N data points we train a Neural Network (NN) to learn H ( (cid:126)x, (cid:126)y ) . Once this NN istrained we simulate a large set of pseudo-patients and drug therapies ( (cid:126)x, (cid:126)y ) , and find thedrug therapy (cid:126)y that yields the maximum average for the trained NN output on ( (cid:126)x, (cid:126)y ) .We define this drug therapy as (cid:126)y B , and estimate that provides the best distribution of H ( (cid:126)x, (cid:126)y ) over (cid:126)x . • C) Neural Network Drug Therapy technique (NNDT):
Apply a different drugtherapy to each one of the N patients, observe the result after the treatment, and In this work we have used renormalization and overflowing and S independent of (cid:126)x and (cid:126)y , as describedin more detail in Appendix A. H . Then, follow the same procedure asin NN@RDT, simulating a large set of pseudo-patients and drug therapies, finding thebest drug therapy (cid:126)y C .There are a few points to be discussed about these techniques. What we name as Regular is just because this is the technique against which we want to contrast the other two, andbecause it is currently used in an important clinical trial for COVID-19 [17]. There are manyother techniques –or improvements of this– which are currently being used, including some ofthem using ML techniques. The technique NN@RDT does not need in principle other datapoints that those in RDT, although it is very likely that different sets in RDT have a spreadin their setups (duration, forms of dosage delivery, etc.). Technique NNDT can resemble inprinciple purely theoretical, since in reality one cannot try any drug combination in patients.However, one can use mathematics to reduce the space of drug therapies (cid:126)y in such a way thatonly feasible therapies appear as (cid:126)y is varied (see below).Finally, it is worth stressing at this point that this work aims to compare the abovetechniques and determine whether it would be possible to issue an statement that could workfor a general class of functions H , and therefore in particular to the true function. It is notthe objective of this work to inquire about the true form of H . As a matter of fact, thewhole work remains phenomenological and mathematical, with no connection to real drugcompounds nor real patient features. Along this section, we address the posed problem from the phenomenological point of view.We do not consider the intermediate steps, nor inquire on the real chemical or biologicalreactions which have been and are extensively studied, instead we consider only the initialcondition and the final outcome of the patient . Also, we do not consider the individualmeaning of the variables ( (cid:126)x, (cid:126)y ) , but we only require (cid:126)x to be features and initial conditions ofthe patient, and (cid:126)y any part of the treatment that can be varied at will. As a matter of factthe components of (cid:126)y could have other non-drug meaning that physicians consider relevant forthe treatment and the whole procedure would still be valid.In order to normalize the following procedure and to make it more efficient, it is moreappropriate to have all components in (cid:126)x and (cid:126)y to range between 0 and 1. Therefore, anypatient feature, or drug dosage should be normalized to a number in [0 , . In particular, anyknown to be harmful or impossible to practice drug dosage should not be taken into account,and only consider those which are scientifically and ethically possible and fit them in thisrange. It is also possible to take profit of known chemical and biological facts and eventuallyinclude some suggested possible combinations into only one y i component. This is what is See however discussion in Section 4 A similar approach with different tools is found in Ref. [11], we discuss our main differences in Section 4 H and H , in the case of patients only and patientsplus drugs, for uniformly distributed (cid:126)x and (cid:126)y .usually known as designing the NN according to the requirements of the problem and wecomment briefly about it in Section 4.Once we have stated the above considerations, we can proceed to compare techniquesRDT, NN@RDT and NNDT as follows. Although we do not aim to propose a function H , wecan investigate the comparison in different and varied reasonable samples of H and extractconclusions which we can expect in data.Along the following paragraphs we consider a scenario where patients have 5 features andthere are 10 drugs to be tested in any combination. This means that in x i , i goes from 1 to5, and in y j , j goes from 1 to 10. Any extension or other scenario can be easily tested, ormodified, since we have open the source code of all the calculations in this work in the GIT repo in Ref. [14].Along this section we present the results for two given H ’s, however we have also tested indifferent other kind of H ’s, as described in Appendix A and in [14]. We need H to satisfy a fewreasonable requirements. It should be non-linear, and it should contain features such as druginterference. The distribution of H ( (cid:126)x, (cid:126) for random patients (cid:126)x corresponds to the outcomeof patients with no drug treatment, and therefore should be mainly below some given bound,since we are assuming that an important fraction of patients do not heal without treatment.The distribution of H ( (cid:126)x, (cid:126)y ) over (cid:126)x and (cid:126)y should yield values in the proximity of 1 (excellenthealth), since we are assuming that in principle exists a combination of drugs that can healpatients. But it should also yield values in the proximity of 0 (dead), since it is reasonableto assume that there are also harmful drug combinations. Finally, H should depend little ornothing in some of the components x i and y j , since it is expected that some of the proposeddrugs do not have a significant effect in the outcome of the treatment.Using these requirements we have constructed and tested many hypothetical H . In thiswork we show the results for two representative functions H ( (cid:126)x, (cid:126)y ) and H ( (cid:126)x, (cid:126)y ) , whose explicitforms can be found in Appendix A. Others or new functions can be tested using the tools6vailable in Ref. [14]. In Fig. 1 we have plotted their distribution for placebo treatments( (cid:126)y = (cid:126) ) and for drug treatment by using x i and y j randomly uniformly distributed in [0 , .We stress that since only feasible drugs doses have been codified in the y i ∈ [0 , variables,then this random actually means random over feasible drugs doses. An explicit and reasonableform for H (or a sample of many explicit forms) is the starting point for our analysis.For each one of the H . functions we explicitly perform the techniques described aboveRDT, NN@RDT and NNDT and compare their outcomes. This is done by computing thedistribution of H ( (cid:126)x, (cid:126)y A,B,C ) over (cid:126)x , where (cid:126)y A,B,C corresponds to the optimal drug found byeach technique RDT, NN@RDT and NNDT, respectively. We can interpret the mean ofeach one of these distributions as the average outcome of the treatment with drug (cid:126)y
A,B,C ,respectively. Since the function H is monotonically related to the probability of healing, if atreatment has a larger average than the others, it can be deemed the treatment with a betterprobability of success when averaging over types of patients. Even more, one can look at thedistribution and seek drug combinations which have its shapes tilted towards , avoiding badoutcomes most of the times.We show the results of our analysis for N = 500 and N = 2000 patients in Figs. 2and 3 for functions H and H , respectively. In analyzing these results one should bearin mind that these functions are quite different not only in their shape, but also in theirstructure, as detailed in Appendix A. For the Regular Drug Therapy trials, these patientsare divided into groups of 100 patients each, from which half are treated with placebo. Wedo not consider less than 500 patients because a smaller number means too few samplesof the drug space for the RDT, although we did study NN with 200 training points andfound encouraging results. For each choice of N and H , we consider three different randomnoise values, which we call stochastic. Further details about the stochastic implementationcan be found in Appendix A. The results presented are those that are representative of thebest behavior of the techniques introduced. However, the procedure is fairly stable assumingcertain conditions. Further details about the NN hyperparameters choice and its performancecan be found in Appendix B.From both Figures we see that the NN-reliant techniques perform better than RDT,finding both a higher average output and a more probable favorable output than RDT. Mostof the times, NNDT performs better than NN@RDT. However, NN@RDT which performsbetter than RDT, in some cases –not shown– reaches the NNDT level or slightly better. Whileboth cases are favorable, we also see that H and H model different kind of functions: H yields a more evenly distributed outcome for every technique, while H yields a ‘lumped’ resultfor each technique. In both cases, the NN is found to provide a good reconstruction of theunderlying function H , yielding Spearman rank correlation coefficients ranging R ∼ . − . depending on N and the stochastic noise parameter, as detailed in Appendix B7 = 500 patients H N = 2000 patients (a) (b)(c) (d)(e) (f) Figure 2: Distributions for the H function for different patients, with drug features fixed overeach of the cases “no-drugs”, RDT, NN@RDT and NNDT. The procedure of drug therapydiscovery was done for samples of N = 500 and N = 2000 patients, and for values of thestochastic noise of 0%, 20% and 40%. In all cases the NNDT technique (blue) performs betterthan the RDT (green). 8 = 500 patients H N = 2000 patients (a) (b)(c) (d)(e) (f) Figure 3: Distributions for the H function for different patients, with drug features fixed overeach of the cases “no-drugs”, RDT, NN@RDT and NNDT. The procedure of drug therapydiscovery was done for samples of N = 500 and N = 2000 patients, and for values of thestochastic noise of 0%, 20% and 40%. In all cases the NNDT technique (blue) performs betterthan the RDT (green). 9 Discussion
The work and results presented so far have different possible improvements and followups.Many of them can be taken on directly from the available open source code in
GIT repo [14],where all the tools to reproduce and expand the article results are publicly available.Along the investigation we have found that the NN architecture is important to achievegood results. The NN best architecture is dependent on the number of cases analyzed ( N )since, as N is reduced, the architecture should be reduced as well to avoid overfitting. Also,we find that the higher is the complexity and non-linearity of H , the deeper (more layers)should be the NN. Further investigations and trials in understanding the best NN architecturedesign would play in favor of more solid results considering a real scenario.This work could be complemented with another ML technique to best choose the variables (cid:126)x and (cid:126)y . In particular one could run an auto-encoder on the whole set of these variables andlet the algorithm reduce the variables into more relevant ones. These latter could be used asinput in the NN used along this work and eventually optimize times and performance of thenetwork.An important point that can be extracted from this work is related to the function H and its true value. Although along the article we have tested general samples of H for thesake of finding general behaviors and properties, having true expected features coming fromthe biology and physician expertise could provide more realistic forecasts and more focusedNNs. Moreover, on the way around, one could attempt to reconstruct how H depends on itsvariables through the NN and compare it to molecular and biological models to learn fromits description.The NNDT presented seems to have, at least from the mathematical point of view, betterprospects than RDT. However, this NNDT could still be improved as follows. Once the NNhas been trained with the real patients data, upon the need to apply the best drug therapy toa new patient, one could use the NN to design the best drug therapy customized for this newpatient. The procedure is to set the (cid:126)x fixed to the new patient features, and vary at randomthe drugs (cid:126)y until finding the maximum outcome of the NN. Such a drug combination (cid:126)y C wouldbe a customized drug therapy for the patient features. We have verified that this techniqueis as good or better than NNDT, however to obtain the corresponding distributions of thiscustomized NN Drug Therapy within the presented framework requires large CPU resources.We plan to further study this possibility and eventually include them in a next update.At last, we mention similarities and differences between our work and the one in Ref. [11].The main similarity in both studies is the mathematical and phenomenological setup of theproblem in contrast to the usual approach. On the other hand, Ref. [11] proposes a Bayesianadditive regression tree model based on sequential experimentation, whereas our approachis based on a NN that analyzes the whole patients dataset at once, since time is crucialfor our objectives. Our scheme proposes to work with an explicit function H which allowsus to study in more detail the differences between RDT and NN approaches within this10ramework. We consider that both works complement each other and push forward the sameidea of considering replacing RDT for a more sophisticated Machine Learning algorithm. We have investigated, from a mathematical point of view, how new available Machine Learn-ing techniques could improve the efficacy and reduce times of clinical trials in finding the bestdrug therapy upon a new unknown disease.The main point of our analysis consists in replacing usual placebo-controlled clinical trialtechniques of a fraction of patients treated by a given specific drug and the other patientsby placebo, by all patients treated with a variety of different possible drug therapies eachone. We have shown that a Neural Network can assimilate the variety of data and expectedfluctuations without need of large number of patients under the same treatment.To compare the prediction of a trained Neural Network against usual clinical trials tech-niques we have implemented a phenomenological approach in which we model the evolutionof a patient with specific features from beginning to end of treatment as function of thedrug combination received. This modeling does not rely on any physiological, biological, normolecular behavior or interaction, but instead just on the outcome of the patient as a numberin [0 , related to the patient health at the end of treatment.Our findings show that a Neural Network Drug Therapy (blue line in Figs. 2 and 3)performs always better than a Regular Drug Therapy (green line). The strength of theargument resides in that this result holds regardless of the specific H used.We also discuss along the article potential developments and improvements that could bedone from this result. Among them, we propose that using such a Neural Network techniquecould be used to test any modeling as described above, or that the Machine Learning techniquecould provide a still better customized drug therapy to each specific patient. Further workis needed in many of these fronts. We also observe that the results presented in this workare useful as well for other disciplines in which experimental techniques require a test andcontrol group in different scenarios to understand different behaviors and/or patterns.We understand this work as a proof-of-concept of the presented idea. Further investiga-tions along the biopharma and physician sides would be required to explore whether some ofthe ideas here proposed could be taken to a real scenario. In such a case, this could providean important step in accelerating drug therapy discovery in important health issues, such asfor instance the COVID-19 pandemic. 11 Functions H In this work we modeled the behavior of a practically intractable physical system with afunction H ( (cid:126)x, (cid:126)y ) which has a fairly general set of hypothesis described in Sections 2,3. Toperform a practical test of the techniques proposed we use a set of functions H ( (cid:126)x, (cid:126)y ) whichaim to capture the following hypotheses: • It is a multi-variable (highly) non-linear function. • It can take values in [0 , . • It has a stochastic component. • H ( (cid:126)x, should be tilted towards 0 to represent the no-drugs expected outcome. • H ( (cid:126)x, (cid:126)y ) may reach higher or lower values than H ( (cid:126)x, , representing those drugs (cid:126)y thathave a positive or harmful effect for health, respectively. • It can contain features such as drug interference or cancellation. For example having adependence like x k ( y i − y j ) allows for such a behavior.Results presented in this work were done using two different forms for the H function, H and H . Their differences lie in the way they are constructed, which is discussed in the followingparagraphs. Function H H is factorized into a patient-specific part P , a drug-specific part d , a function of bothpatient and drugs d , and the stochastic piece S depending on a parameter η . H ( (cid:126)x, (cid:126)y ) = P ( (cid:126)x ) d ( (cid:126)y ) d ( (cid:126)x, (cid:126)y ) S η (2)with P ( (cid:126)x ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) ( α k − α k x k ) (cid:80) α k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ψ d ( (cid:126)y ) =exp( −| ˜ α y + ˜ α y + ˜ α y + ˜ α y + ˜ α y | + | ˜ α y + ˜ α y + ˜ α y + ˜ α y | + | ˜ α y + ˜ α y + ˜ α y + ˜ α y | / d ( (cid:126)x, (cid:126)y ) =1 + 0 . β y − | ( β y + β y + β y )( β x y − β x y ) | ) (3)12he patient function has a dependence on a coefficient ψ that controls how easily patientscan heal, in the absence of drugs. For ψ < this function leans towards higher values,and for ψ > , towards zero. The drug function d has an exponential dependence overa combination of drug parameters y i . The part d adds an oscillation that depends on aspecific combination of drugs and patient features, allowing here for certain combinationsto conspire into an increase or a reduction of the whole value of the function. For examplein the second term inside the sin function we see that there is a factor that contains thecombination x y − x y , for which takes into account drug interference between y and y , weighted by the features x and x . The rest of the functions parameters α i , ˜ α i , β i areselected at random, ranging in α i ∈ [0 , , ˜ α i , β i ∈ [ − , . For each specific set of values ofthe parameters, we have a certain function H . Thus, the above formulas describe a familyof functions H . We checked that for several values of the parameters the distributions of H with and without drugs are sufficiently separated, as in Fig. 3. We then fixed the parameters’values to reproduce such behavior. Regarding the noise factor S , we use a Gaussian functioncentered at 1, with a standard deviation η and independent of (cid:126)x and (cid:126)y . As we want theoutput of the function to be in the range [0,1], we have to do two things. First, as it cannottake negative values, we take S to be the absolute value of the Gaussian centered around 1.Then, we calculate the maximum value attained by the part P · d · d , and take into accountpossible fluctuations up to two standard deviations η . We then divide the function by thisvalue H (0)1 ≡ max (cid:126)x,(cid:126)y ( P ( (cid:126)x ) d ( (cid:126)y ) d ( (cid:126)x, (cid:126)y ))(1 + 2 η ) . Then we bound the value of the functionto be lower or equal than one. That is, if a certain fluctuation of the noise is outside of twostandard deviations and the function evaluates to something higher than 1, we take it to beequal to 1. Function H In the case of function H , it is not factorizable, but has a more compact form H = 115 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x + ( y + 3 y − y )( x − x ) + sinh( y − y ) − e − ( y − y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S η (4)Once again, it has the features described above, the nonlinearities, together with drug in-terference/cancellation. The distributions of this function with and without placebo can beseen in Fig. 3. What can be seen is that in this case the distribution with drugs takes smallervalues than the distribution without them, that is, there are combinations of drugs that areharmful for the patients. B Neural Network Performance and Validation
The results presented in Fig. 2 and 3 correspond to a particular choice of hyperparameters forthe Neural Networks-reliant techniques. The validity of the NN@RDT and NNDT techniques,13eeds to take into account not only how good is H ( (cid:126)x, (cid:126)y B,C ) , but how well does the NN performat reconstructing this truth-level distribution.For the task of building and training the Neural Networks we use Python programminglanguage and resort to the Keras package [18], for each of the functions we scanned overdifferent architectures and other hyperparameters. The Neural Networks considered werecomposed of fully connected layers, with RELU activation in all of the units, with a singleneuron in the final layer. We turn off the bias in most of the hidden layers to prevent a shiftin the output of the NN. As the goal at hand is a regression problem –we want to interpolatethe H function given a certain number of sample points–, we used a mean squared errorloss function. Each of the H functions is different in complexity and, as such, it would berecommendable to correspondingly choose the number of epochs, the learning rate and thenumber of neurons. As we need to fit nonlinear functions, we consider deep networks with anumber of hidden layers between 4 and 6. For the results in this article, the used architectureconsists of 8 hidden layers with neuron numbers of (100 , , , , , , , respectively.As we need to avoid overfitting, we have to select architectures with a number of parameterslower than the number of degrees of freedom present in the input data. For example, for N = 500 , 5 patient features and 10 drug features, we have degrees of freedom, for whichthe complexity of the network cannot be too large. The overfitting of the NN can be checkedby plotting the loss function over the training and validation sets. In any case, we haveworked in all NN with a dropout of 10% to reduce overfitting.We show the performance of each NN architecture chosen for a given H and N in threeplots in Fig. 4, 5, the first of which is the Loss function. The remaining two plots use atest set of previously unseen points to compare the predicted values of H against the truevalues, which we can denote as ( H ( x, y ) , H NN ( x, y ) ). We plot this set of points, comparingit to the H NN = H curve, and we calculate the mean squared error and the Spearman’s rankcorrelation coefficient. These two are used as a measure of how adequate the hyperparametersare for each H and number of patients N to reproduce the function H .Another test to see how well does H NN reproduce the behavior of H over the test datais to use H to select whether a patient heals or not and test if H NN categorizes the datain the same way. This can be done by setting a threshold in H (which we call w t ) to labelthe data and then computing the Area Under Curve (AUC) when trying to reproduce theselabels with H NN . While computing the mean squared error treats the fidelity point by point,this method checks whether H NN reproduces the true function behavior over the whole databy yielding a consistently high AUC for each w t . A lower AUC for a region of w t means thatthe NN is not able to capture the true variation of outcomes over some data region.We see that both cases yield a fairly good Spearman correlation coefficient while also givinga quite loyal AUC behavior, even when considering a stochastic component. If we increasethe noise, the performance worsens but not to an intolerable level. From these results weassert that we can trust the above NN framework to describe the Health distributions.14 eferences [1] Zhang, L., Tan, J., Han, D., & Zhu, H. (2017). From machine learning to deep learning:progress in machine intelligence for rational drug discovery. Drug Discovery Today,22(11), 1680–1685. doi:10.1016/j.drudis.2017.08.010.[2] Lavecchia, A. (2019). Deep learning in drug discovery: opportunities, challenges andfuture prospects. Drug Discovery Today. doi:10.1016/j.drudis.2019.07.006.[3] Kumari, P., Nath, A., & Chaube, R. (2015). Identification of human drug targets usingmachine-learning algorithms. Computers in Biology and Medicine, 56, 175–181.doi:10.1016/j.compbiomed.2014.11.008.[4] Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., & Blaschke, T. (2018). The rise ofdeep learning in drug discovery. Drug Discovery Today, 23(6), 1241–1250.doi:10.1016/j.drudis.2018.01.039.[5] Vamathevan, J., Clark, D., Czodrowski, P. et al. Applications of machine learning indrug discovery and development. Nat Rev Drug Discov 18, 463–477 (2019).https://doi.org/10.1038/s41573-019-0024-5.[6] Ghasemi, F., Mehridehnavi, A., Pérez-Garrido, A., & Pérez-Sánchez, H. (2018). Neuralnetwork and deep-learning algorithms used in QSAR studies: merits and drawbacks.Drug Discovery Today. doi:10.1016/j.drudis.2018.06.016.[7] Romm, E. L., & Tsigelny, I. F. (2019). Artificial Intelligence in Drug Treatment.Annual Review of Pharmacology and Toxicology, 60(1).doi:10.1146/annurev-pharmtox-010919-023746.[8] Ascent of machine learning in medicine. (2019). Nature Materials, 18(5), 407–407.doi:10.1038/s41563-019-0360-1.[9] Harrer, S., Shah, P., Antony, B., & Hu, J. (2019). Artificial Intelligence for ClinicalTrial Design. Trends in Pharmacological Sciences. doi:10.1016/j.tips.2019.05.005.[10] Woo, M. (2019). An AI boost for clinical trials. Nature, 573(7775), S100–S102.doi:10.1038/d41586-019-02871-3.[11] Kaptein, M. (2019). Personalization in biomedical-informatics: methodologicalconsiderations and recommendations. Journal of Biomedical Informatics.doi:10.1016/j.jbi.2018.12.002.[12] Elsevier, “Novel Coronavirus Information Center.” .1513] Nature, “Coronavirus collection of relevant articles.” .[14] Ezequiel Alvarez, Federico Lamagna, Manuel Szewc, Open source scripts for allcalculations in this work, GIT repository https://github.com/ManuelSzewc/ML4DT/ ,2020".[15] N. I. of Health (NIH), “Clinical Trials database.” ,2020.[16] N. I. of Health (NIH), “Clinical Trials database for Coronavirus.” , 2020.[17] J. Beigel, “Adaptive COVID-19 Treatment Trial.” , 2020.[18] F. Chollet et al., “Keras.” https://keras.io , 2015.167 Case 1: 500 and 2000 patients (a) (b)(c) (d)(e) (f)
Figure 4: Different measures of the goodness of the NN fit, for function H , for both N = 500 and N = 2000 train sizes, with a stochastic noise of ηη
Figure 4: Different measures of the goodness of the NN fit, for function H , for both N = 500 and N = 2000 train sizes, with a stochastic noise of ηη = 0.2.8 Case 2: 500 and 2000 patients (a) (b)(c) (d)(e) (f)
Figure 5: Different measures of the goodness of the NN fit, for function H , for both N = 500 and N = 2000 train sizes, with a stochastic noise of ηη