AAn Overview on the Web of Clinical Data
Marco Gori SAILAB, University of Siena http://sailab.diism.unisi.it , MAASAI, Universit`e Cˆote d’Azur https://team.inria.fr/maasai/team-members
Abstract.
In the last few years there has been an impressive growth ofconnections between medicine and artificial intelligence (AI) that havebeen characterized by the specific focus on single problems along withcorresponding clinical data. This paper proposes a new perspective inwhich the focus is on the progressive accumulation of a universal repos-itory of clinical hyperlinked data in the spirit that gave rise to the birthof the Web. The underlining idea is that this repository, that is referredto as the Web of Clinical Data (WCD), will dramatically change the AIapproach to medicine and its effectivness. It is claimed that research andAI-based applications will undergo an evolution process that will likelyreinforce systematically the solutions implemented in medical apps madeavailable in the WCD. The distinctive architectural feature of the WCDis that this universal repository will be under control of clinical units andhospitals, which is claimed to be the natural context for dealing with thecritical issues of clinical data.
Introduction
Eric J. Topol, in his popular book [7], claims that medicine will inevitably beSchumpetered in the coming years. No matter if this excitement is motivated,we are still missing a crucial catalyzer for strongly accelerating the Schumpeter-ing. At the moment, most studies are characterized by the relentless search forunified standards to share data. What if we get rid of any standard? It is worthmentioning that while from one side the diffusion of clinical data is carefullycontrolled, there are initiatives that come from patients motivated by the will-ingness of sharing their clinical data and experiences . The explosion of socialnetworks have in fact strongly favored the birth of communities where peoplelike to exchange ideas, make helpful connections, and feel supported when theyneed it most. On the other hand, there are also huge efforts in the building ofhealth systems that can strongly benefit also from recent the development ofAI, where the emphasis has been shifted to machine learning and processingon unstructured data. Amongst a number of attempts, the Canadian efforts isdefinitely worth mentioning. See e.g. a r X i v : . [ c s . C Y ] A ug Marco Gori
This paper promotes the idea of building a universal repository of clinicaldata to collect the health records of people, that closely reminds the spirit of theWeb. This universal collection would not represent only a truly paradigm shifton the access to clinical data by students and experts in medicine, but it wouldopen the doors to the new grand challenge of building decision support systemsthat operate on a universal repository on the basis of the content of the healthrecords, as well as with recommendations based on the discovering of similarities.This evolution of a such a project seems to be intimately connected with thevery critical nature of clinical data. When considering the risks, the most naturalanswer is to refrain from such a crazy adventure, but we claim that the timehas come to an in-depth analysis of risks and opportunities. The emerging novelsolutions that could dramatically improve long-term developments in medicine,which can also go beyond the borders of single countries. Concerning the accessi-bility we propose that the major difference with respect to the Web, is that onlythe Clinical Units (CU) can access, while patients are expected to benefit fromnew services. Depending on specific initiatives from CU, the act of patient datadonation could be rewarded in different ways (e.g. information services relatedto the patient clinical condition). The original spirit of the Web leads to conceivethe traditional world-wide search engine service that can itself be immediatelyvery important for the clinical activities. However, most important challengesare connected with the current developments of AI, which is expected to facili-tate the development of intelligence apps. One can also expect that those appswill conquer the degree of autonomy that make them nearly indistinguishablefrom a medical assistant operating in team under the control of human experts.An overview is given on the WCD by focussing mostly on the technologicalside of the project. Preliminary discussions on legal issues and the businessmodel, along with the social implications are also given.
In the last few years the impressive growth of computer-based medical serviceshave been mostly based on appropriate organization of clinical data. The struc-tured organization of huge amount of such data has been playing a fundamentalrole also in many artificial intelligence based challenges to medicine. Based on anearly intuition [5], this paper gives a view on a truly different way of collectingand organizing clinical data, which is claimed to open the doors to an explosiveevolution of the field. We assume that medical information is stored in any mul-timedia document connected by hyperlinks, just like in the Web. It’s up to AIagents to interpret the information hidden in the repository, which is referred toas the
Web of Clinical Data (WCD). Links in the WCD are expected to estab-lish different levels of relations. The nodes of the graph are medical documentsthat can properly be linked and included in the WCD. Some links are used forconnecting medical documents of a single patient corresponding to a certain spe-cific clinical event, others are used for establishing relations between documentsof different patients. No matter how the repository is created, we assume that n Overview on the Web of Clinical Data 3 documents do not contain any identifiable reference to the identity of patients,which are only characterized by the
Personal Code (PC). The repository canbe supplied by anonymized data coming from clinical centers as well as frompatients, who upload their data after having signed appropriate authorizations.As a result, the repository exhibits the very simple structure depicted in Fig. 1,where any type of multimedia medical document of a given patient is linked tohis/her Personal Code. These documents are collected as a list for any individ-ual thus exhibiting a forest structure. Ideally, one would like medical informationsupplied according to the structure of Fig. 1, where the anamnesis of any patientis represented by a list of clinical events , each of them properly organized intoclinical data, diagnosis and therapeutic treatments. The distinctive underlyingassumption of the WCD is that there is no need to provide such a structure,since it is up to the WCD Intelligent System that continuously process the clin-ical data to reconstruct the structure of Fig. 1. The identification of different clinical data diagnosis therapeutical infopersonal code unstructured patient data clinical data diagnosis therapeutical info . . . . . . structured patient data
Fig. 1.
Patients can upload their own data in a truly unstructured form. Clinical data,diagnosis, and therapeutic documents are stored as a forest with separate data foreach patient, that is characterized by an anonymous Personal Identification code. Theseparation and appropriate structured organization of this material can be created byopportune intelligent agents that contribute to the creation of the WCD. clinical events might arise from the detection of the date on the documents. Forexample, it could be the case that the therapy adopted by a patient is relatedto a set of clinical data and to a specific diagnosis, but there is no hyperlinkamongst these information. Just like humans, it’s up to an intelligent agent torealize that something is missing and reconstruct the links. Clearly, the presenceof the date is a fundamental cue for linking documents of the same clinical event.However, documents with no identifiable date can also be supplied to the WCD.
Marco Gori
Basically, the underlying principle is that they are also precious for medical in-ference, though one must consider their reduced significance due to the lack ofa temporal collocation in the anamnesis of a patient.The role of the WCD Intelligent System goes well beyond that of recon-structing the structure of data of simple patients. As we can seen in Fig. 2,one can connect patients of the WCD so as to enrich the forest structure into agraph-based structure where links arise because of regularities discovered in therepository. For example, an intelligent agent can discover similarities betweenthe clinical data of two patients. This can arise from data in different formats,ranging from text and different formats of signals. Similarities can also arise inthe diagnosis and the therapeutic treatments, so as the original forest evolvestowards a truly WCD with precious inferential information. text table image clinical data pharma teraphydiagnosis text text text table image clinical data pharma teraphydiagnosis text text inferred links between patients
Fig. 2.
Automatic construction of the WCD: The forest of patients grows up by ap-propriate links amongst similar data. Similarities arise from both text and image dataand are discovered by intelligent agents.
The actual evolution of the WCD does require to clean a number of crucial legalissues connected with the distribution of clinical data. Their actual distributionis supposed to under the control of the
Governing Board of the WCD, which willbe composed of doctors, scientists in related disciplines, and layers. First of all,we need to clarify who is supposed to gain the permission for accessing the dataalong with the connected purpose. The WCD is conceived for boosting research n Overview on the Web of Clinical Data 5 and world-wide medical treatment in clinical centers. As a consequence, the per-mission is granted to scientists and doctors upon approval from the
GoverningBoard of the WCD. As it will be shown in the next section, this restriction on theaccessibility poses fundamental constraints on the computer architecture, sincethe underlying assumption is that the repository can only be made availableto Clinical Units (CU), which could be hospitals, research centers whose accesscomes with the duty of not to distribute data elsewhere. Basically, the intentionis that of using the same legal framework which regulates the interaction of med-ical staff in sensible clinical data. Following related technical developments in thefield of bank accounting, we only make the access available independently of thegeographical position, whereas we do inherit and keep all the restrictions con-cerning the already established rules for the access. Clearly, this imposes neitherrestrictions on the place where data are stored nor on who provides the storingservice, which does not necessarily corresponds with granting accessibility.The issue of anonymization has been the subject of an in-depth investigationin the last few years, but only a few of them make significant efforts towards arigorous treatment (see e.g. [3] for a very good example). Many circulating claimson the issue of anonymization are often quite generic; namely the inferentialcontext is not well-defined, which facilitates restrictive claims on the supposedrisks of making clinical data widely available. Formally, the problem of disclosingthe identity from medical data seems to be generic and ill-posed. Clearly, textualdocuments on the anamnesis of a patient may contribute to discover the identity,whereas the inference from signal biomedical signals cannot rely on cues thatcan somehow reveal the identity. One can at most believe of some geographicalconnections that are expected to give rise to a certain disease. Notice that givinga collection of medical data under the conditions of the WCD, the identitydisclosure is only possible provided that the databases of clinical centers areviolated, which is in fact an event that can happen regardless of the WCDinitiative.Basically, the WCD initiative faces the ill-position of the issue of sensible databy adopting a strategy where anonymization and authorization is granted by de-sign. The act of autonomously data uploading does exhibit the stronger patient’sintention to make data available. It is worth mentioning that the interaction withclinical centers follows the other way around, namely data are downloaded bythe patients. Hence, we can think of keeping the same communication channelwhere this time patients upload their own data, which can come from differentsources. This information flow with clinical centers very much resembles the onethey establish with their own banks. The level of security and the type of in-formation exchange shares in fact the same needs. As patients decide to uploadtheir data to their own preferred clinical center (more are possible), we startcreating a repository which is promoted to the WCD, only after a further check,whose details are defined by the WCD Governing Board. The bottom line isthat the WCD does subscribe the current scenario for the access to clinical data,the only difference being that of making the data widely available to world-wideauthorized clinical and research units. In this framework, we expect that most
Marco Gori claims on sensible information might vanish thanks to the governing philosophyof the WCD.
The access requirements pointed out in the previous sections for the WCD openan interesting computer architecture problem that is mostly connected with thestoring of a growing repository that must be distributed worldwide. At the sametime, each CU is also expected to access the whole WCD repository. In orderto satisfy these requirement the architectural solution proposed for the WCD isthe one depicted in Fig. 3.
WCD uniform resource locator
Encrypted WCD Clinical Unit
SLWCDPLWCD
Clinical Unit
SLWCDPLWCD
Clinical Unit
SLWCDPLWCD
Clinical Unit
SLWCDPLWCD
Fig. 3.
Overall architecture of the WCD, where Clinical Units act as macro-nodes.Each such unit is used for storing patients’ data in a truly unstructured form (PatientLocal WCD - PLWCD), while additional meta-data are created in the (Semantic-LocalWCD - SLWCD). The WCD is also stored in an encrypted form onto a cloud system.
The repository can be given a formal description at local level thanks to the
Uni-form Resource Locator of any document, that can be constructed by following thesame principles used in the Web. For example, the reference to the ECG diagno-sis in a certain CU could be wcd.ao-siena.1492.ecg-careggi-june-2020.dia ,whose structure is based on separate fields which provide information on the pa-tient, the specific document and where it was archived. This is sketched below wcd . ao − siena (cid:124) (cid:123)(cid:122) (cid:125) ClinicalUnit . (cid:124)(cid:123)(cid:122)(cid:125) patient no . ecg − careggi − june − (cid:124) (cid:123)(cid:122) (cid:125) filename . dia (cid:124)(cid:123)(cid:122)(cid:125) diagnosis n Overview on the Web of Clinical Data 7 where we easily induce the general structure of the URL. Basically, the ECGdocument named ecg-careggi-june-2020.dia is a diagnostic document (ex-tension dia ) corresponding to patient of Clinical Unit ao-siena of the wcd .This document is supposed to be directly accessible within the CU where it isstored for any type of processing. However, since this document is also supposedto be accessed outside the CU where it has been generated, one needs to makeit available world-wide.There are a number of studies on architectural solutions for related problemsthat inspire the solution to this general framework (see e.g. [4,1,9,6]). Inter-estingly, related studies [8] have also been carried out in the context of medi-cal applications. The solution prospected in Fig. 3 is naturally following classicsearch engine technology where there is a global repository. Interestingly, thisanalogy comes with the fundamental difference that data are encrypted and,consequently, not accessible within the global Web repository. Search engine primitive
The basic idea is that of caching of the documents on a Cloud System (CS)in an encrypted format. Basically, any document created in a CU is encryptedand backed up to the CS. This allows us to carry out a classic information re-trieval search [4] where all documents produced in the CU are uploaded to theCS along with their encrypted keywords. This allows the CS to construct the in-verted indexes to be used for searching. Classic encryption solutions can be usedfor handling the security of the interaction between the CUs and the CS. Classicsearching primitives based on propositional calculus are supposed to be used,whereas the major assumption is not to assign the CS higher-level services. Theunderlying assumption is in fact that of moving to the CUs any solution basedon AI agents. It is worth mentioning the this horizontal keyword-based searchservice is in fact the first one which is offered on top of the WCD. Interestingly,the search primitives are also of crucial importance as a building layer for anyservice app.
Service Apps
Any service in the WCD is supposed to be given by apps running in CU com-puter servers. Any computation does require to select the documents that arepertinent to the service, so as they are temporarily downloaded in the CU server.Basically, these apps are expected to operate on a proper selection of informationfrom the WCD that is expected to be useful for their objective. The structureof any service is sketched as1. create the primary app cache q ← QueryFromService(input) RetrievedDocCollection ← WCD-retrieve(q) run BodyApp(RetrievedDocCollection,input) free(RetrievedDocCollection) Marco Gori
First, the app may need to create its own cache (primary cache) from theWCD to optimize the performance. This step may also be omitted, so as all theprocessing is based only on the input being process. The second step consists offormulating the searching query on the basis of the input to the app. Then, as thequery is fed to the
QueryFromService , pertinent documents are retrieved andstored in the secondary cache
RetrievedDocCollection . These locally retrievedcollection, along with the primary cache is used for carrying out the task assignedto the app. Basically, the actual processing is carried out by
BodyApp on thesecond argument input by exploiting the local cache.
Example 1.
Suppose we want to discover patients whose clinical data are similar tothose of a given patient. The reference for inspecting the similarity is defined by docu-ment input=ecg-careggi-june-2020.dia , which is in fact a document under diagnosis.This document comes with a number of attached metadata, like for instance, the spe-cific type of diagnosis (ECG signal). The document, along with its attached metadatarepresent the input of the service, which is expected to compose the query q auto-matically by using QueryFromService . As the query is obtained, it is used to collect
RetrievedDocCollection and, finally, the body of specific part of the app BodyAppis used to discover which documents are similar.
Patients and staff from the CUs are generally expected to upload clinical data inthe general unstructured form of a multi-media document. Clearly, it could bethe case that single patients directly provide information in a truly structuredway according to Fig. 1. Whenever this does not happen, the purpose of WCDenrichment apps is that of creating the missing structured representation.The first important enrichment task is that of segmenting patients’ clinicalevents. The most informative cue comes from the eventual presence of the date inthe document. As it is available, the task of its detection is a well-posed problemeither for textual or image-based documents. In both cases the extraction of thedata could be quite a complex problem. However, in both cases we are in frontof classic problems of pattern recognition that have been the subject of massiveinvestigation (see e.g. [2], for early studies). It is worth mentioning that wheneverthe date is either missing or badly recognized, the segmentation of patients’event can also rely on different cues. For example, there are cases in which acertain drug is prescribed upon a corresponding diagnosis, so as two documentscan be placed in the same medical event. Clearly, this task can become verysophisticated and very well represents the type of challenges that are open withthe WCD.Other important enrichment tasks involve the separation between clinic, di-agnostic, and therapeutic data. This is essentially a document classification task.While there is a huge literature for attacking this problem the specific contextof the WCD provide a number of additional cues to solve it successfully. Henceenrichment apps contribute to create the structured view of the forest indicatedin Fig. 1. Moreover, the successive discovery of links between different patients n Overview on the Web of Clinical Data 9
PLWCD SLWCD
Clinical UnitWCD Service Level EWCD on CloudEnrichment AppsService Apps
Fig. 4.
Service and Enrichment apps in the WCD. For a service app to operate on theWCD the connection with the cached info in the EWCD is required. of the same CU depicted in Fig. 2 is another tasks of enrichment apps. Thisproblem is basically one of discovering the similarities between multimedia doc-uments, which has also been the subject of a massive investigation In additionto links between documents of the same type, we can find links between text anddocuments, a task which clearly needs the additional step of extracting textualdescription from images. Overall, enrichment apps create the WCD locally toeach CU. However, their task goes beyond the discovery of local links. In orderto generate global links upon discovery of relationships between documents ofdifferent CUs, enrichment apps rely on the software architecture described inSection 3. In particular, the discussion of Example 1 concerning the discoveryof global similarities for a given document provides an insight on how to attackthe problem. Clearly, once the similarities have been found, we need to updatethe index of the WCD by its enrichment with the discovered links. While thisis carried out in the CUs, the inverted indexes in the CS must be updated ac-cording. Notice that the process of constructing the WCD by this similarityinspection needs to be frequently carried out because of the continuous updateof documents on the CUs.
The creation of the WCD along with the associated high-level meta-data associ-ated with the documents open the doors to the development of medical serviceapps ranging from the field of diagnosis to that of therapeutic treatment. The underlying philosophy in the development of service apps is that they do notsimply operate on a specific CU, but on the WCD. Depending on the dimensionof the CU, the apps might have different configurations, but the are conceivedfor working at global level.
PLWCD SLWCD
Medical Assistant Medical AssistantService App Service App Service AppService AppService App
Fig. 5.
Service apps onion skin view. The lower level only relies on the processing ofdata from the SLWCD. In a middle layers, apps operate by relying on the computationof other apps. Finally, in the last layer, medical assistant operate autonomously on thebasis of all the info available in the lower levels.
Because of the need to make them available on the WCD, service apps areexpected to run on popular operating systems. As we can see in Fig. 5, serviceapps can operate at different level of abstraction. As the process conquer a certaindegree of autonomy, service apps become a sort of medical assistant which cansupport decisions under the control of human experts. The long term view ofthe WCD is that the progressive construction of service apps is expected tocontribute to medical treatment on specific patients, but also to the permanentcreation of metadata that are expected to provide an important support forfuture medical medical decisions.
The described framework of the WCD suggests the development of a stablemarriage between medicine and artificial Intelligence. While there is a benefit inthe systematic developments of service apps thanks to the word-wide competitionfor medicine, a much better benchmarking context is created for research inartificial intelligence. Basically, it looks like the WCD can act as a fundamentalcatalyzer for both the disciplines. The current focus on specific benchmarks is n Overview on the Web of Clinical Data 11 expected to be translated into a sort of permanent AI challenge in medicine,that will constantly refer to the WCD.The philosophy behind the WCD is that of increasing the level of medicaltreatment by strongly promoting the CUs, which turn out to be the nodes ofthe WCD. As such, they are expected to control the world-wide repository ofclinical data and will likely realize that there is a crucial benefit of assumingthis role. The WCD will make available best medical practices worldwide, aservice that has an enormous value and that might also be paid off. Pharmacompanies might be interested in advertisement on the WCD, which could thesource of huge resources especially for best CUs. Simple metrics could in factmeasure the number of access to documents uploaded to different CUs, so as todistribute money from advertisement. This will definitely bless the principle thatin the WCD framework doctors carry out two different specific roles: First, theirprimary role is that of providing the appropriate medical assistance to singlepatients. Second, their successful treatments will be useful for future worldwidemedical decisions. Finally, the conception of the WCD comes with the wishthat its development will be primarily useful for the medical support to poorcountries.
Acknowledgments
We thank Alessandro Rossi, Stefano Melacci, Sandro Bartolini (University ofSiena), Arturo Chiti (Humanitas, Milan), and Paolo Traverso (FBK, Trento) forinsightful discussions.
References
1. Pedro Geraldo M. R. Alves and Diego F. Aranha. A framework for searching en-crypted databases.
J. Internet Serv. Appl. , 9(1):1:1–1:18, 2018.2. Heidi Bjering Stratti.
Optimal operators in digital image processing . PhD thesis,2008.3. Khaled El Emam and Luk Arbuckle.
Anonymizing Health Data: Case Studies andMethods to Get You Started . O?Reilly Media, Inc., 1st edition, 2013.4. Z. Fu, K. Ren, J. Shu, X. Sun, and F. Huang. Enabling personalized search overencrypted outsourced data with efficiency improvement.
IEEE Transactions onParallel and Distributed Systems , 27(9):2546–2559, 2016.5. M. Gori, Campiani G, Rossi A, and Setacci C. The web of clinical data.
Journal ofCardiovascular Surgery , 23:717–718, 2014.6. Yusheng Jiang, Tamotsu Noguchi, Nobuyuki Kanno, Yoshiko Yasumura, TakuyaSuzuki, Yu Ishimaki, and Hayato Yamana. A privacy-preserving query system us-ing fully homomorphic encryption with real-world implementation for medicine-sideeffect search. In
Proceedings of the 21st International Conference on InformationIntegration and Web-Based Applications and Services , iiWAS2019, page 63?72, NewYork, NY, USA, 2019. Association for Computing Machinery.7. Eric Topol.
The Creative Destruction of Medicine: How the Digital Revolution WillCreate Better Health Care . Basic Books, 2011.2 Marco Gori8. Lei Xu, Chungen Xu, Joseph K. Liu, Cong Zuo, and Peng Zhang. Building a dynamicsearchable encrypted medical database for multi-client.
Inf. Sci. , 527:394–405, 2020.9. Steven Zittrower and Cliff Changchun Zou. Encrypted phrase searching in the cloud.In2012 IEEE Global Communications Conference, GLOBECOM 2012, Anaheim,CA, USA, December 3-7, 2012