aa r X i v : . [ s t a t . O T ] S e p Data Science in Biomedicine
Yovaninna Alarc´on-Soto ∗ , Jenifer Espasand´ın-Dom´ınguez , Ipek Guler , MercedesConde-Amboage , Francisco Gude-Sampedro , Klaus Langohr , Carmen Cadarso-Su´arez ,Guadalupe G´omez-Melis Departament d’Estad´ıstica i Investigaci´o Operativa, Universitat Polit`ecnica de Catalunya/BARCELONATECH, C/ Jordi Girona, 1–3, 08034 Barcelona, Spain Unit of Biostatistics, Department of Statistics, Mathematical Analysis, and Optimization, Universidade deSantiago de Compostela Instituto Maim´onides de Investigaci´on Biom´edica de C´ordoba (IMIBIC) Models of Optimization, Decision, Statistics and Applications Reseach Group (MODESTYA), Departmentof Statistics, Mathematical Analysis, and Optimization, Universidade de Santiago de Compostela Clinical Epidemiology Unit, Complejo Hospitalario Universitario de Santiago de Compostela ∗ Corresponding author [email protected], Tel.:+34 934 016 947; fax: +34 934 015 855
Abstract : We highlight the role of Data Science in Biomedicine. Our manuscript goes from the generalto the particular, presenting a global definition of Data Science and showing the trend for this disciplinetogether with the terms of cloud computing and big data. In addition, since Data Science is mostly related toareas like economy or business, we describe its importance in biomedicine. Biomedical Data Science (BDS)presents the challenge of dealing with data coming from a range of biological and medical research, focusingon methodologies to advance the biomedical science discoveries, in an interdisciplinary context.
Keywords: biomedicine, data science.
1. Introduction
In the last 10 years, we have observed an important increase in the number of job offers requesting datascientists. Data science was already recognized as a science more than 5 decades ago by John Tukey. Inthe article
The Future of Data Analysis he points out that more emphasis should be placed on using datato suggest hypotheses to test and reflects on the existence of an as-yet unrecognized science, whose subjectof interest was learning from data (Donoho, 2017) and that lays the foundation of today data science area.‘Data analysis’, includes“(. . . ) among other things: procedures for analyzing data, techniques for interpreting the resultsof such procedures, ways of planning the gathering of data to make its analysis easier, more precise1r more accurate, and all the machinery and results of (mathematical) statistics which apply toanalyzing data.” (Tukey, 1962)Due to the technological explosion of the last few years, massive amounts of data are generated everyday in different areas. This new era requires the development of new techniques to analyze and drawreliable conclusions from these data. In this context, the figure of the data scientist emerges, proclaimed byDavenport & Patil (2012) as ‘the Sexiest Job of the 21st Century’. But, what exactly is a data scientist?This question has been already addressed by many other researchers, such as Schutt & O’Neil (2013) orDonoho (2017), and it has been the topic of many columns and discussions in important media such as TheGuardian or The New York Times.To provide a definition of data science in our own terms, we start by refering to the definition of datascientist found in the Oxford Dictionary (Oxford University Press, 2008):“A person employed to analyze and interpret complex digital data, such as the usage statisticsof a website, especially in order to assist a business in its decision-making.”We will follow the very helpful data science scheme created by Conway (2010) to explore the differentattributes a data scientist should convey (Figure 1). First, knowledge in Mathematics and Statistics isnecessary. Mathematics gives a universal language and is essential for solving real-world problems. FromStatistics comes the understanding and experience to work with data, selecting the appropriate techniques todeal with it, to pre-process, summarize, analyze and draw conclusions. Second, computer science knowledgeis also fundamental. Not only getting computers to do what you want them to do requires intensive hands-onexperience, but also computer scientists must be adept at modelling and analyzing problems. They must alsobe able to design solutions and verify that they are correct. Problem solving requires precision, creativity,and careful reasoning. Computer science has a wide range of sub-areas. These include computer architecture,software systems, graphics, artificial intelligence, computational science, and software engineering. Drawingfrom a common core of computer science knowledge, each of these areas focuses on particular challenges.The third and not least important characteristic of a data scientist is domain knowledge, a thoroughunderstanding of the field in which the research is being developed is needed to understand the researchcontext and more important to be able to provide realistic and responsible answers to the questions at hand.Examining what these three areas have in common, the intersection between mathematical and statisticalknowledge and domain knowledge is the most common, from which traditional research emerges, whereasMachine Learning arises from the intersection of mathematical and statistical knowledge and computerscience knowledge. The name Machine Learning, coined by Samuel (1959), is a field of computer sciencethat uses statistical techniques to give computer systems the ability to learn with data. Nevertheless, ifthere is not enough statistical knowledge to choose the appropriate methods and analyses for the pertinentresearch objectives, mixing expertize in the field of research with computer science knowledge might lead usto a danger zone.This overlap of skills gives people the ability to create what appears to be a legitimate analysis withoutany understanding of how they got there or what they have created (Conway, 2010). As Wilson (1927)stated “(. . . ) it is largely because of lack of knowledge of what statistics is that the person untrained2 ig. 1:
Data science scheme based on the Conway’s Venn diagram (Conway, 2010). in it trusts himself with a tool quite as dangerous as any he may pick out from the whole arma-mentarium of scientific methodology.”We believe, however, that further soft skills are required by a data scientist. For this reason, we haveadded a star in the intersection of the three areas, in the core of the Data Science concept. A data scientistneeds not only to be an expert in his or her area, but also a good communicator, collaborator, leader,advocate, and scholar. As a communicator, the key competencies are active listening and asking questions,explaining advantages or shortcoming of statistical and computer methods, and interpreting results in ameaningful way in the context of the application. He or she has to be a fine collaborator, because he or shewill have to work in interdisciplinary teams. In addition, being a leader is the key to successfully influencemultidisciplinary research, the data scientist will have to advocate to use his or her expertise, and given thatscience is continuously developing, a data scientist has to be a scholar. In a recent paper by Zapf et al. (2018)these soft skills are already identified for being a successful biostatistician, and they can be generalized toany data scientist.Therefore a data scientist needs to master a set of skills —mathematical, statistical, computational,communication skills— that are not easy to develop for a single person. Given the scarcity of people withsuch a complete profile, there is a need to create multidisciplinary working groups formed by differentspecialists who add their qualities to make room for data science itself.The article is organized as follows: in Section 2, we analyze the global impact of Data Science by updatingthe research of Kane (2014) in which the author analyzes the search-term usage of “Data Science” over timeuntil 2014 adding “Cloud Computing” and “Big Data” to the search, until 2018 and using Google Trends.This section includes an overview of the Data Science journals. Following, in Section 3, we describe Data3cience in Biomedicine, or Biomedical Data Science (BDS), present a web search restricted to the biomedicalarea, and include some examples of BDS studies. Finally, the main findings of this work are summarized inSection 4.
2. Data Science: Global Impact and dissemination
Cleveland (2014) proposes an action plan for statistics, in which he elevates the role of the statistician tothe level of a researcher who should not limit him or herself to providing only statistical calculations andp-values, but should, also, be involved in the interpretation of these.Data science has become very popular in recent years as a tool in many fields such as Economics (businessanalytics, fraud and risk detection), internet search, digital advertisements, image and speech recognition,delivery logistics, gaming, price comparison websites, airline route planning, robotics, among others. Tocontextualize the impact of this new discipline all over the world, we have used Google Trends to updatethe research of Kane (2014). Kane analyses the search-term usage of “Data Science”, “Cloud Computing”and “Big Data” until 2014 (see Figure 1). “Cloud Computing” and “Big Data” were added because of theirclose relation with Data Science, their intrinsic relation with the computational techniques and to frame theevolution of the impact of the Data Science. It must be taken into account that Google Trends is an onlinesearch tool that allows the user to see how often specific keywords, subjects, and phrases have been queriedover a specific period of time and provides information about Google searches all over the world. Searchtrends show how the interest for a given term has evolved over time by assigning a score between 0 and 100to search terms on a year-by-year basis.To visualize the progress of the terms “Data Science”, “Cloud Computing”, and “Big Data”, we presentthe results obtained both worldwide and in some countries in Europe, the United States (and some of itsstates), in Asia, and in Australia over time. The results are summarized in Figures 2 - 5. All the searcheswere performed using the R package gtrendsR (Massicotte & Eddelbuettel, 2018), which is an interface forretrieving and displaying the information returned online by Google Trends.An up-tick in Data Science is not produced until approximately the year 2012. It is precisely in this yearthat the interest for the term “Big Data” starts to grow at high rate. On the other hand, by the end of 2014and the beginning of 2015, the trend for searches on “Big Data” begins to stagnate, and we can observe analmost exponentially increasing interest for the term “Data Science”. On the other hand, the term “CloudComputing”, had its main boom around 2011, and since then, its influence has been decreasing.However, in some countries such as Spain, no real peak for the term “Data Science” is observed until theyear 2015. Even though there is also an increase in searches about this concept, the growth is much lesspronounced than in other European countries such as Germany, where the interest for “Data Science” isequal to that of “Big Data”, or the United Kingdom, where the trend for “Data Science” begins to unseatthat of “Big Data” (Figure 3). The trend is even more pronounced in United States, in particular in someof its states such as Massachusetts or California, where the main universities and research centers are. Inthese US states, the trend for “Big Data” is decreasing sharply coinciding with a growing interest in “DataScience” (see Figure 4). In other countries such as China, India, or Japan, the pattern of interest on theseterms is similar but with a certain slowness with respect to other countries. It seems that the interest in“Data Science” in these countries as well as in Spain, has not yet reached the same level as in other parts ofthe world (Figure 5). 4 able 1: Current journals in the Data Science field up to 2018.
Journal and website Publisher Scopus Openaccess Bio/healthresearch(explicity)
Journal of Data Science - No No NoInternational Journal of Data Science and Analytics https://link.springer.com/journal/41060
Springer No Hybrid YesData Science Journal https://datascience.codata.org - Yes Yes YesData Science- Methods, Infrastructure, and Applications https://datasciencehub.net
IOS Press No Yes NoEPJ Data Science https://epjdatascience.springeropen.com
Springer Yes Yes YesInternational Journal of Data Science
Inderscience No Hybrid NoAdvances in Data Science and Adaptive Analysis
WorldScientific No Hybrid No
With this search, we reassert the findings presented in Kane (2014): i) The trend for the term “DataScience” is eclipsing the popularity of the infrastructure on which it is based (cloud computing, big data,computational skills, etc.) specially in the more technological countries; ii) The interest for Data Science isincreasing worldwide and it appears that the trend is that this growth will continue in the coming years.
Journals of Data Science
In this new field, there are only seven scientific journals directly related with the data science (up to 2018);see Table1. Notice that we do not consider journals that are only related to Big Data Analysis or MachineLearning for the reasons exposed in Section 1.The goal of the
Journal of Data Science is to enable scientists to do their research on applied science andthrough the effective use of data. Regarding the
International Journal of Data Science and Analytics , themain topics addressed are data mining and knowledge discovery, database management, artificial intelligence(including robotics), computational biology/bioinformatics, and business information systems. The relatedindustry sectors are: electronics, telecommunications and IT & Software. The
International Journal of DataScience and Analytics brings together researchers, industry practitioners, and potential users of big data, topromote collaborations, exchange ideas and practices, discuss new opportunities, and investigate analyticsframeworks. The journal welcomes experimental and theoretical findings on data science and advancedanalytics along with their applications to real-life situations. The scope of the
Data Science Journal includesdescriptions of data systems, their publication on the internet, applications and legal issues. All the sciencesare covered, including the Physical Sciences, Engineering, the Geosciences, and the Biosciences, along with5
Date I n t e r e s t World
Fig. 2:
Google trends for the terms “Data Science” (red), “Big Data” (green), and “Cloud Computing”(blue) for global queries. The scores assigned by Google Trends on the “interest” ordinate expressthe popularity of that term over a specified time range, based on the absolute search volume for aterm, relative to the number of searches received by Google. The scores have no direct quantitativemeaning. For example, two different terms that have been searched 1000 and 20000 times,respectively, could achieve a score of 100. This is because the scores have been scaled between 0 and100, and a score of 100 always represents the highest relative search volume. Yearly scores arecalculated on the basis of the average relative daily search volume within the year. Date I n t e r e s t Spain
Date I n t e r e s t Germany
Date I n t e r e s t United Kingdom
Date I n t e r e s t Italy
Fig. 3:
Google trends for the terms “Data Science” (red), “Big Data” (green), and “Cloud Computing”(blue) for some countries of Europe.
Agriculture and the Medical Science. The ultimate goal of
Data Science - Methods, Infrastructure andApplications is to unleash the power of scientific data to deepen our understanding of physical, biological,and digital systems, gain insight into human social and economic behaviour, and design new solutions forthe future. Additionally, the
EPJ Data Science covers a broad range of research areas and applicationsand particularly encourages contributions from techno-socio-economic systems. Topics include, but are not7
Date I n t e r e s t United States
Date I n t e r e s t US − Massachusetts
Date I n t e r e s t US − California
Date I n t e r e s t US − Washington
Fig. 4:
Google trends for the terms “Data Science” (red), “Big Data” (green), and “Cloud Computing”(blue) for United States and some of its states. limited to, human behaviour, social interaction (including animal societies), economic and financial systems,management and business networks, socio-technical infrastructure, health and environmental systems, thescience of science, as well as general risk and crisis scenario forecasting up to and including policy advice.Finally, the
International Journal of Data Science aims to provide a professional forum for examining theprocesses and results associated with obtaining data, as well as munging, scrubbing, exploring, modelling,8
Date I n t e r e s t China
Date I n t e r e s t Japan
Date I n t e r e s t India
Date I n t e r e s t Australia
Fig. 5:
Google trends for the terms “Data Science” (red), “Big Data” (green), and “Cloud Computing”(blue) in some countries of Asia and in Australia. interpreting, communicating and visualizing data. Data science takes data in cyberspace as a researchobject. The goal is an integrated and interconnected process designed to form a common ground from whicha knowledge-based system can be built, shared, and supported by professionals from different disciplines.Finally,
Advances in Data Science and Adaptive Analysis is an interdisciplinary journal dedicated to reportoriginal research results on data analysis methodology developments and their applications, with a special9mphasis on the adaptive approaches. The mission of the journal is to elevate data analysis from theroutine data processing by traditional tools to a new scientific level, which encourages innovative methodsdevelopment for data science and its scientific research and engineering applications.As we can see in Table 1, not all the journals listed above explicitly include health data science and noneof them is exclusively dedicated to this area. Following, we provide a proper description of what we considerhealth or biomedical data science.
3. Data Science in the biomedical field
A Biomedical Data Scientist should be quantitatively trained including a comprehensive and rigorous profi-ciency of statistical principles and those computing skills to handle massive and complex data. He/she hasto be able to manage and analyze health data to solve emerging problems in public health and biomedicalsciences and to learn how to interpret their findings.Health data refers to data that come from the biomedical sciences, public health, and any other arearelated to the “bio” sciences. Examples are data sets from clinical trials, observational studies, genomicsand other omics studies, medical records, health care programs, or environmental programs.Health-related data are also a good example of the legal and ethical concerns that should be takeninto consideration regarding sensitive personal data (medical records, genomic profiles, etc.) or digitalepidemiology in the context of public health. Thus, ensuring compliance with ethical policies, adequateinformed consents, and data use agreements are essential when sharing information and collaborativelyusing data (G´omez-Mateu et al., 2016).
Fig. 6:
Healthcare field process in which a data scientist is involved.
The Seven Pillars of Statistical Wisdom
Stigler (2016) summarizes Statistical reasoning as an integral part of modern scientific practice and setsforth the foundation of statistics around seven principles. Stigler’s second pillar, Information, challengesthe importance of “big data” by noting that observations are not all equally important: the amount ofinformation in a data set is often proportional to only the square root of the number of observations, notthe absolute number.
Similar to the search presented in Section 2, we have analyzed the number of publications associated with“Data Science”, “Big Data”, and “Cloud Computing” in several countries and along the last fourteen years,using Web of Science ( https://clarivate.com/products/web-of-science/ ). The countries consideredwere Australia, China, Germany, India, Italy, Japan, Spain, the United Kingdom, and the United States.Notice that “publication” refers to articles, reviews, clinical trials, case reports, and books. Moreover,only topics related with the biomedical area, such as Oncology, Respiratory System, or Pediatrics, wereconsidered. The publication counts were obtained at the beginning of 2019 and are presented in Table 2(Appendix A).From the publication counts presented in Table 2 (see Appendix A), we can conclude that the number ofbiomedical publications has increased during the last years in the countries considered. Moreover, as mightbe expected, the number of publications associated with “Data Science” is much larger than the numberof publications associated with the topics “Big Data” and “Cloud Computing”. In fact, the publicationsassociated with Data Science represent more than 95% of all the publications analyzed, regardless of thecountry considered. Most noteworthy is the tremendous increase of record counts in China: 917 publicationswere registered in 2004, and this number has increased to 12013 in 2017; that is, an increase of more than10000 publications in only 13 years. Furthermore, the case of Spain is also remarkable because the presence ofpublications associated with “Data Science” is much lower than in other European countries like Germany,Italy, or the United Kingdom. For instance, in 2017 the number of publications in the United Kingdomand in Germany is approximately three times and twice as high as in Spain, respectively. Although thecomparison is not immediate because the population of United Kingdom and Germany is more two time ashigh as in Spain.On the other hand, the presence of publications about “Cloud Computing” in the biomedical area is reallylow: until after 2010, very low number of publications were registered in any of the countries considered.Even in 2017 the number of publications was low compared with the other topics. We can, hence, state thatthe use of Cloud Computing techniques is not widespread among researchers in the field of Biomedicine.Finally, the explosion of “Big Data” in the last years, seems to have an effect in the Biomedical researchbecause the number of publications in this topic has increased each year in the countries considered. Forexample, in Australia the number of publications about “Big Data” in 2007 was 17 as compared to 131publications in 2017, that is, an increase greater than 670%. It is clear that Big Data techniques have beenvery useful in order to solve biomedical problems.
The confluence of science, technology, and medicine in our dynamic digital era has spawned new data ap-plications to develop prescriptive analytics, to improve healthcare personalization and precision medicine,and to automate the reporting of health data for clinical decisions (Bhavnani et al., 2016). As we mentioned12efore, several biomedical research institutes are involved in the data science process working on complexdata bases in the areas of genomic and proteomic data analysis, infectious and immunological diseases,new therapies in cancer, hormones and cancer, genetics, cellular biology, among others. Most of the researchstudies need data science techniques to deal with these data sets. Those data science studies that are usuallycharacterized by complex structures or large numbers of variables, require a multidisciplinary environmentwith biomedical informatics, bioinformatics, biostatisticians, and clinicians. This environment brings to-gether statistics, computer sciences, and computational engineering, and aims to provide a methodologicallycorrect analysis.Biomedical Data Science can be applied in many different areas such as personalized medicine, genomicresearch, gene expression analysis, or in cancer drug studies, among others. Following, we present someexamples of applications.Personalized medicine is a medical approach in which patients are stratified in subgroups according totheir individual characteristics (genomic alterations, lifestyles, diagnostic markers, clinical profile, responseto treatments). With abundant and detailed patient data, medical decisions, such as diagnostic tests ortreatments, may be personalized and addressed to these subgroups of patients and not to the whole popu-lation. The advantages of personalized medicine are evident: more effective use of therapies and reductionof adverse effects, early disease diagnosis and prevention by using biomarkers, among others. A well-knownexample is the treatment with trastuzumab (Herceptin, a breast cancer drug) that can only be administeredif the HER2/neu receptor is overexpressed in tumor tissue because the drug interferes with this receptor.Another example of those personalized predictions can be the survival probabilities predicted for a futurelevel of a longitudinal biomarker recorded. The joint model approaches to study the association between alongitudinal biomarker and survival data provides dynamic predictions for survival probability coming fromthe effect of the longitudinal biomarker taken until time t , which can be updated when the patient has newinformation (Rizopoulos, 2011, 2012).Data science helps to examine health disparities. Research examining racial and ethnic disparities in careamong older adults is essential for providing better quality care and improving patient outcomes. Yet, inthe current climate of limited research funding, data science provides the opportunity for gerontologicalnurse researchers to address these important health care issues among racially and ethnically diverse groups,groups typically under-represented and difficult to access in research (Chase & Vega, 2016).Other example is to use data science for clinical decision making. Clinical laboratories contribute towardsthe screening, diagnosis and monitoring of many types of health conditions. While it is believed thatdiagnostic testing may account for just 2% - 4% of all healthcare spending, it may influence 60% - 80% ofmedical decision-making. The work of Espasand´ın-Dom´ınguez et al. (2018) is an example of BDS where avery recent extension of the distribution regression model introduced by Klein et al. (2015) is applied to adata set of blood potassium concentrations from patients across a Spanish region. The main aim of thismanuscript was to determine if geographical differences possibly attributed to pre analytical factors couldbe detected.The development of automated workflows that can capture and memorialize extensive experimental pro-tocols, aiding in reproducibility as well as taking data analysis to a new level (Lud¨ascher et al., 2006) is acentral data science technique. Workflows help support and accelerate scientific discoveries in biomedicalresearch by eliminating the burden of dealing with time-consuming data and software integration. This13pproach fundamentally frees researchers to concentrate on the scientific questions at hand instead of ad-dressing technical issues involved in setting up, executing, and validating the computational pipeline (Amaro,2016).We can found applications in many other fields. For instance, while studying the consequences of theanalytical treatment interruption in HIV-infected patients, Alarc´on-Soto et al. (2018) present a method tofit a mixed effects Cox model with interval-censored data to study the viral rebound of HIV. The proposal isbased on a multiple imputation approach that uses the truncated Weibull. The authors addressed the factof having data from eight different studies based on different grounds.Another application is to quantify spatio-temporal effects to graft failures in organ transplantation. Thetransplantation of solid organs is one of the most important accomplishments of modern medicine. Yet,organ shortage is a major public health issue. Using data science, the research can investigate early graftfailure time. When an organ becomes available from a deceased donor, the allocation policies such as medicalurgency, expected benefit and geographical constraints (distance between donor and recipient) are appliedto people in the waiting list to select a match. Allocation policies regard the survivability of the organoutside the human body, namely, the cold ischemic time, as an important factor since it is associated withthe quality degradation of the organ. Besides, the distance is an important factor on these decisions giventhat the farther the distance from the donor hospital to the transplant center, the worse might be the qualityof the organ (Pinheiro et al., 2016).We can even relate data science with mental health. Mental disorders are arguably the greatest ‘hidden’burden of ill health, with substantial long-term impacts on individuals, carers and society. People with theseconditions are often socially excluded and less likely to participate in research studies or remain in follow-up.Complexities around defining diagnoses present particular challenges for mental health research. Richlyannotated, longitudinal data sets matched to data science analytics offer an unprecedented opportunity formore robust diagnostics, and also the prediction of outcome, treatment response, and patient preferences toinform interventions (McIntosh et al., 2016).Many more examples of BDS are expected to arise in any other field related to health or bio sciences inthe near future. From the above, we could say that one of the main objectives of Data Science in Biomedicine is to generatevalid knowledge through better structuring in the procedures for extracting, analyzing and processing dataobtained in health and environmental research, supporting the transfer of their results to society. All thesedisciplines share common goals in terms of improving the quality of life of the people through actions in thepromotion of health and in the prevention of disease.A major challenge that exists in the healthcare domain is the “data privacy gap” between medical re-searchers and computer scientists. Medical researchers have natural access to healthcare data because theirresearch is paired with a medical practice. Acquiring data is not quite as simple for computer scientistswithout a proper collaboration with a medical practitioner. There are barriers in the acquisition of data.Many of these challenges can be avoided if accepted protocols, technologies, and safeguards are in place.On the other hand, people to whom the research efforts are addressed and those responsible for fundingagencies need to ensure that research output are used to maximize knowledge and potential benefits. Sharingthe data ensures that these are available to the research community, which accelerates the pace of discovery14nd enhances the efficiency of the research. Believing on these benefits, many initiatives actively encourageinvestigators to make their data available.Widely available crowd-sourcing programs such as PatientsLikeMe ( ) haveamassed participation from more than 400 thousand patients across 2500 disease conditions who activelyshare health related data on an open and online platform that tracks and collects important patient-reportedoutcomes. The United Kingdoms BioBank is a large-scale biomedical data set containing detailed pheno-typic, genotypic, and multimodal imaging findings to determine the genetic and nongenetic determinantsof health and disease in a contemporary cohort of more than 500000 participants. Available through openaccess, research collaborations have advanced our knowledge in the risk prediction of cardiovascular, psy-chiatric, and cerebrovascular diseases and have identified important anthropometric and genetic traits ofmetabolic health including diabetes mellitus and obesity.The objectives for these kind of initiatives are similar to the established data sources such as census andpublic health data sets, or standardized patient registries such as the National Cardiovascular Data Registry,where data are structured and aggregated. The objective is to monitor population trends, develop guideline-based care, and infer changes to healthcare policy, new citizen science and crowd-sourcing initiatives aimto leverage public and patient participation to collect health data and vital statistics through new massiveopen, and online data repositories (Bhavnani et al., 2016).Since 2003, the National Institutes of Health (NIH) has required a data sharing plan for all large fundinggrants. Similarly, some journals are also requiring the deposit of data and other research documentationassociated with published articles (Borgman, 2012; Piwowar et al., 2007).In May, 2010, the Wellcome Trust and the Hewlett Foundation convened a workshop in Washington, DC,to explore how funders could increase the availability of data generated by their funded research, and topromote the efficient use of those data to accelerate improvements in public health (Walport & Brest, 2011).In this meeting, funders agree to promote greater access to and use of data in ways that are: equitable,ethical and efficient. Equitable refers to recognizing those researchers who generate the data, other analystsreusing these data, meanwhile population and communities expect health benefits arising from research. Itshould protect the privacy of individuals. Healthcare data is obviously very sensitive because it can revealcompromising information about individuals. Several laws in various countries explicitly forbid the releaseof medical information about individuals for any purpose, unless safeguards are used to preserve privacy.Finally, it should improve the quality and value of research, and increase its contribution to improving publichealth.In June 2018, the NIH releases its first Strategic Plan for Data Science( ).In this plan, “NIH addresses storing data efficiently and securely; making data usable to as many people aspossible; developing a research workforce poised to capitalize on advances in data science and informationtechnology; and setting policies for productive, efficient, secure, and ethical data use. This plan commitsto ensuring that all data-science activities and products supported by the agency adhere to the FAIRprinciples, meaning that data be Findable, Accessible, Interoperable, and Reusable” (Wilkinson et al.,2016). 15 . Conclusions
Motivated by the remarkable increase of the number of publications on Data Science in the past few years,the purpose of this work has been to study the impact of Data Science in the area of biomedicine.With this objective in mind, we have carried out a search of the terms “Data Science” along with “BigData” and “Cloud Computing” using Google Trends until November 2018. While Big Data representsthe information assets characterized by a high volume, velocity and variety to require specific technologyand analytical methods for its transformation into value (De Mauro et al., 2015), Cloud Computing enablesubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources(e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and releasedwith minimal management effort or service provider interaction (Mell & Grance, 2009). Big Data andCloud Computing were chosen since they are somewhat related to computing movements and they help toput the “Data Science” search-traffic into perspective (Kane, 2014). According to our search results, in thelast years more and more publications in the area of biomedicine make use of the term “Data Science”,however, there are large differences among the countries considered.We have also listed the main journals only related to Data Science to point out the increasing importanceof Data Science. However, not all of the journals presented explicitly include Biomedical Data Science(BDS) as their main areas of research. In addition, we have stepped ahead of the contemporary definitionof Data Science, directly related to the economics or business world, describing the Data Science in theBiomedical field. We understand BDS as the interdisciplinary field that encompasses the study and pursuitof the effective use of biomedical data, information, and knowledge for scientific inquiry, problem-solving,and decision-making, driven by efforts to improve human health. It investigates and supports reasoning,modelling, simulation, experimentation, and translation across the spectrum, from molecules to individualsto populations.We strongly believe that the importance of Biomedical Data Science will continue increasing in the nearfuture due to nowadays’ possibilities to record enormous quantities of data and the technical facilities toprocess them. Statistical thinking and knowledge will play a key role in the correct analysis of such data.
Acknowledgements
This research was partially supported by the projects: MTM2015-64465-C2-1-R, MTM2014-52975-C2-1-R,MTM2016-76969-P cofinanced by the Ministry of Economy and Competitiveness (SPAIN), and the projectsMTM2017-83513-R and MTM2017-90568-REDT cofinanced by the Ministry of Economy and Competi-tiveness (Spain), all them cofinanced by the European Regional Development Fund (FEDER). This workwas also supported by grants from the Galician Government (ED341D-R2016/032 and ED431C 2016-025),and by grants from the Carlos III Health Institute, Spain (PI16/01395; PI16/01404; RD16/0007/0006 andRD16/0017/0018), and 2017 SGR 622 (GRBIO) from the Departament dEconomia i Coneixement de laGeneralitat de Catalunya (Spain). Work of M. Conde-Amboage has been supported by post-doctoral grantfrom Ministry of Culture, Education and University Planning and Ministry of Economy, Employment andIndustry of Galician Government. The authors want to thank the network BIOSTATNET for many fruitfuldiscussions. Additionally, Yovaninna Alarc´on-Soto wants to thank to CONICYT for her scholarship.16 . Publications on Data Science, Cloud Computing and Big Data in the last fourteen years able 2: Number of publications associated with the topics “Data Science” (denoted by DS), “Big Data”(denoted by BD) and “Cloud Computing” (denoted by CC) in different countries from 2004 to2018.
Year Topic USA UK Japan Germany Australia Spain Italy India China2004 DS 13942 3491 1551 2845 1400 977 1949 401 917BD 129 28 18 43 10 18 10 2 21CC 28 4 0 4 0 1 4 0 12005 DS 16124 4272 1762 3558 1556 1254 2364 526 1232BD 162 37 19 45 16 21 10 3 29CC 31 5 2 3 0 0 3 1 22006 DS 16296 4356 1671 3602 1719 1313 2373 553 1405BD 116 52 13 35 16 26 21 4 48CC 24 2 3 9 0 1 2 2 22007 DS 16208 4421 1702 3441 1717 1305 2567 592 1656BD 162 43 25 46 17 22 20 9 60CC 27 3 3 2 0 3 4 2 132008 DS 17799 4798 1792 3769 1923 1542 2590 715 2085BD 190 48 20 57 28 29 21 5 62CC 35 5 2 9 1 4 1 3 122009 DS 19181 5329 2051 4325 2264 1879 3091 851 2694BD 166 59 18 50 30 29 28 11 75CC 37 9 3 15 5 4 3 1 112010 DS 21162 5920 2159 4763 2690 2133 3309 1051 3437BD 206 66 25 74 28 18 31 14 103CC 70 9 9 12 6 8 4 7 172011 DS 23547 6592 2404 5205 2985 2373 3668 1170 4458BD 242 58 23 65 43 33 38 11 131CC 114 7 11 27 6 4 9 7 572012 DS 25815 7726 2801 5778 3461 2744 4037 1394 6008BD 242 66 12 79 40 37 44 16 112CC 133 16 13 22 13 15 9 14 492013 DS 27258 8020 2822 6027 3940 2989 4527 1593 7159BD 353 97 36 100 63 33 40 22 171CC 146 37 17 44 18 25 13 7 822014 DS 27894 7891 2792 6195 4101 3156 4686 1747 8796BD 508 127 44 115 80 55 60 24 214CC 171 36 12 27 26 23 27 17 1182015 DS 29141 8750 2943 6356 4525 3208 4883 1870 10485BD 691 186 48 148 103 72 67 44 302CC 171 41 16 44 25 38 33 31 1122016 DS 28825 8703 2905 6598 4788 3203 4886 2038 10789BD 805 230 48 191 123 79 102 71 319CC 212 53 21 48 33 38 34 66 1402017 DS 29051 8622 2858 6480 4617 3312 4919 1962 11727BD 879 266 63 201 131 95 91 69 482CC 221 47 16 44 45 55 50 48 2092018 DS 27397 8361 3014 6299 4678 3206 4927 1942 12013BD 917 260 58 202 143 112 129 89 540CC 230 46 19 41 30 42 33 83 20118 eferences
Alarc´on-Soto, Y., Langohr, K., Feh´er, C., Garc´ıa, F., & G´omez, G. (2018). Multiple imputation approach forinterval-censored time to HIV RNA viral rebound within a mixed effects cox model.
Biometrical Journal .Amaro, R. E. (2016). Drug discovery gets a boost from data science.
Structure , , 1225–1226.Bhavnani, S. P., Mu˜noz, D., & Bagai, A. (2016). Data science in healthcare: implications for early careerinvestigators. Circulation: Cardiovascular Quality and Outcomes , , 683–687.Borgman, C. L. (2012). The conundrum of sharing research data. Journal of the American Society forInformation Science and Technology , , 1059–1078.Chase, J. A. D., & Vega, A. (2016). Examining health disparities using data science. Research in Geronto-logical Nursing , , 106–107.Cleveland, W. S. (2014). Data science: An action plan for expanding the technical areas of the field ofstatistics. Statistical Analysis and Data Mining: The ASA Data Science Journal , , 414–417.Conway, D. (2010). The Data Science Venn Diagram. Drew Conway , .Crawford, K. (2013). The hidden biases in big data. Harvard Business Review , .Davenport, T. H., & Patil, D. (2012). Data scientist. Harvard Business Review , , 70–76.De Mauro, A., Greco, M., & Grimaldi, M. (2015). What is big data? a consensual definition and a reviewof key research topics. In AIP Conference Proceedings (Vol. 1644, pp. 97–104).Donoho, D. (2017). 50 Years of Data Science.
Journal of Computational and Graphical Statistics , ,745–766.Espasand´ın-Dom´ınguez, J., Ben´ıtez-Est´evez, A. J., Cadarso-Su´arez, C., Kneib, T., Barreiro-Mart´ınez, T.,Casas-M´endez, B., & Gude, F. (2018). Geographical differences in blood potassium detected using astructured additive distributional regression model. Spatial Statistics , , 1–13.G´omez-Mateu, M., Lorenzo-Arribas, A., Bofill, M., Vilor-Tejedor, N., Barrio, I., Espasand´ın-Dom´ınguez, J.,. . . P´erez- ´Alvarez, N. (2016). Big data in biomedical research. Perspectives from the Biostatnet-CRMWorkshop. BEIO , 257–277.Greenhouse, J. (2013). Statistical thinking: the bedrock of data science.
The Huffington Post .Kanaan, S. H. (2014).
Doing data science . USA: CreateSpace Independent Publishing Platform.Kane, M. J. (2014). Cleveland’s action plan and the development of data science over the last 12 years.
Statistical Analysis and Data Mining: The ASA Data Science Journal , , 423–424.Klein, N., Kneib, T., Lang, S., & Sohn, A. (2015). Bayesian structured additive distributional regression withan application to regional income inequality in germany. The Annals of Applied Statistics , , 1024–1052.19ud¨ascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., . . . Zhao, Y. (2006). Scientificworkflow management and the kepler system. Concurrency and Computation: Practice and Experience , , 1039–1065.Massicotte, P., & Eddelbuettel, D. (2018). gtrendsr: Perform and display google trends queries [Computersoftware manual]. Retrieved from https://CRAN.R-project.org/package=gtrendsR (R package version1.4.2)McIntosh, A. M., Stewart, R., John, A., Smith, D. J., Davis, K., Sudlow, C., . . . the MQ Data Science Group(2016). Data science for mental health: a UK perspective on a global challenge. The Lancet Psychiatry , , 993–998.Mell, P., & Grance, T. (2009). The nist definition of cloud computing. National Institute of Standards andTechnology , , 50.Oxford University Press (Ed.). (2008). Oxford English Dictionary (Vol. 30).Pinheiro, D., Hamad, F., Cadeiras, M., Menezes, R., & Nezamoddini-Kachouie, N. (2016). A data scienceapproach for quantifying spatio-temporal effects to graft failures in organ transplantation. In
Engineeringin Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International Conference of the IEEE (pp. 3433–3436).Piwowar, H. A., Day, R. S., & Fridsma, D. B. (2007). Sharing detailed research data is associated withincreased citation rate.
PloS one , , e308.Reddy, C. K., & Aggarwal, C. C. (2015). Healthcare Data Analytics . Chapman and Hall/CRC.Rizopoulos, D. (2011). Dynamic predictions and prospective accuracy in joint models for longitudinal andtime-to-event data.
Biometrics , , 819–829.Rizopoulos, D. (2012). Joint models for longitudinal and time-to-event data: With applications in R .Chapman and Hall/CRC.Samuel, A. L. (1959). Some studies in machine learning using the game of checkers.
IBM Journal of Researchand Development , , 210–229.Schutt, R., & O’Neil, C. (2013). Doing data science: Straight talk from the frontline . O’Reilly Media, Inc.Stigler, S. M. (2016).
The seven pillars of statistical wisdom . Harvard University Press.Tukey, J. W. (1962). The future of data analysis.