[PDF] An Overview on the Web of Clinical Data

Abstract

In the last few years there has been an impressive growth of connections between medicine and artificial intelligence (AI) that have been characterized by the specific focus on single problems along with corresponding clinical data. This paper proposes a new perspective in which the focus is on the progressive accumulation of a universal repository of clinical hyperlinked data in the spirit that gave rise to the birth of the Web. The underlining idea is that this repository, that is referred to as the Web of Clinical Data (WCD), will dramatically change the AI approach to medicine and its effectiveness. It is claimed that research and AI-based applications will undergo an evolution process that will likely reinforce systematically the solutions implemented in medical apps made available in the WCD. The distinctive architectural feature of the WCD is that this universal repository will be under control of clinical units and hospitals, which is claimed to be the natural context for dealing with the critical issues of clinical data.

Full PDF

AAn Overview on the Web of Clinical Data

Marco Gori SAILAB, University of Siena http://sailab.diism.unisi.it , MAASAI, Universit`e Cˆote d’Azur https://team.inria.fr/maasai/team-members

Abstract.

In the last few years there has been an impressive growth ofconnections between medicine and artiﬁcial intelligence (AI) that havebeen characterized by the speciﬁc focus on single problems along withcorresponding clinical data. This paper proposes a new perspective inwhich the focus is on the progressive accumulation of a universal repos-itory of clinical hyperlinked data in the spirit that gave rise to the birthof the Web. The underlining idea is that this repository, that is referredto as the Web of Clinical Data (WCD), will dramatically change the AIapproach to medicine and its eﬀectivness. It is claimed that research andAI-based applications will undergo an evolution process that will likelyreinforce systematically the solutions implemented in medical apps madeavailable in the WCD. The distinctive architectural feature of the WCDis that this universal repository will be under control of clinical units andhospitals, which is claimed to be the natural context for dealing with thecritical issues of clinical data.

Introduction

Eric J. Topol, in his popular book [7], claims that medicine will inevitably beSchumpetered in the coming years. No matter if this excitement is motivated,we are still missing a crucial catalyzer for strongly accelerating the Schumpeter-ing. At the moment, most studies are characterized by the relentless search foruniﬁed standards to share data. What if we get rid of any standard? It is worthmentioning that while from one side the diﬀusion of clinical data is carefullycontrolled, there are initiatives that come from patients motivated by the will-ingness of sharing their clinical data and experiences . The explosion of socialnetworks have in fact strongly favored the birth of communities where peoplelike to exchange ideas, make helpful connections, and feel supported when theyneed it most. On the other hand, there are also huge eﬀorts in the building ofhealth systems that can strongly beneﬁt also from recent the development ofAI, where the emphasis has been shifted to machine learning and processingon unstructured data. Amongst a number of attempts, the Canadian eﬀorts isdeﬁnitely worth mentioning. See e.g. a r X i v : . [ c s . C Y ] A ug Marco Gori

This paper promotes the idea of building a universal repository of clinicaldata to collect the health records of people, that closely reminds the spirit of theWeb. This universal collection would not represent only a truly paradigm shifton the access to clinical data by students and experts in medicine, but it wouldopen the doors to the new grand challenge of building decision support systemsthat operate on a universal repository on the basis of the content of the healthrecords, as well as with recommendations based on the discovering of similarities.This evolution of a such a project seems to be intimately connected with thevery critical nature of clinical data. When considering the risks, the most naturalanswer is to refrain from such a crazy adventure, but we claim that the timehas come to an in-depth analysis of risks and opportunities. The emerging novelsolutions that could dramatically improve long-term developments in medicine,which can also go beyond the borders of single countries. Concerning the accessi-bility we propose that the major diﬀerence with respect to the Web, is that onlythe Clinical Units (CU) can access, while patients are expected to beneﬁt fromnew services. Depending on speciﬁc initiatives from CU, the act of patient datadonation could be rewarded in diﬀerent ways (e.g. information services relatedto the patient clinical condition). The original spirit of the Web leads to conceivethe traditional world-wide search engine service that can itself be immediatelyvery important for the clinical activities. However, most important challengesare connected with the current developments of AI, which is expected to facili-tate the development of intelligence apps. One can also expect that those appswill conquer the degree of autonomy that make them nearly indistinguishablefrom a medical assistant operating in team under the control of human experts.An overview is given on the WCD by focussing mostly on the technologicalside of the project. Preliminary discussions on legal issues and the businessmodel, along with the social implications are also given.

In the last few years the impressive growth of computer-based medical serviceshave been mostly based on appropriate organization of clinical data. The struc-tured organization of huge amount of such data has been playing a fundamentalrole also in many artiﬁcial intelligence based challenges to medicine. Based on anearly intuition [5], this paper gives a view on a truly diﬀerent way of collectingand organizing clinical data, which is claimed to open the doors to an explosiveevolution of the ﬁeld. We assume that medical information is stored in any mul-timedia document connected by hyperlinks, just like in the Web. It’s up to AIagents to interpret the information hidden in the repository, which is referred toas the

Web of Clinical Data (WCD). Links in the WCD are expected to estab-lish diﬀerent levels of relations. The nodes of the graph are medical documentsthat can properly be linked and included in the WCD. Some links are used forconnecting medical documents of a single patient corresponding to a certain spe-ciﬁc clinical event, others are used for establishing relations between documentsof diﬀerent patients. No matter how the repository is created, we assume that n Overview on the Web of Clinical Data 3 documents do not contain any identiﬁable reference to the identity of patients,which are only characterized by the

Personal Code (PC). The repository canbe supplied by anonymized data coming from clinical centers as well as frompatients, who upload their data after having signed appropriate authorizations.As a result, the repository exhibits the very simple structure depicted in Fig. 1,where any type of multimedia medical document of a given patient is linked tohis/her Personal Code. These documents are collected as a list for any individ-ual thus exhibiting a forest structure. Ideally, one would like medical informationsupplied according to the structure of Fig. 1, where the anamnesis of any patientis represented by a list of clinical events , each of them properly organized intoclinical data, diagnosis and therapeutic treatments. The distinctive underlyingassumption of the WCD is that there is no need to provide such a structure,since it is up to the WCD Intelligent System that continuously process the clin-ical data to reconstruct the structure of Fig. 1. The identiﬁcation of diﬀerent clinical data diagnosis therapeutical infopersonal code unstructured patient data clinical data diagnosis therapeutical info . . . . . . structured patient data

Fig. 1.

Patients can upload their own data in a truly unstructured form. Clinical data,diagnosis, and therapeutic documents are stored as a forest with separate data foreach patient, that is characterized by an anonymous Personal Identiﬁcation code. Theseparation and appropriate structured organization of this material can be created byopportune intelligent agents that contribute to the creation of the WCD. clinical events might arise from the detection of the date on the documents. Forexample, it could be the case that the therapy adopted by a patient is relatedto a set of clinical data and to a speciﬁc diagnosis, but there is no hyperlinkamongst these information. Just like humans, it’s up to an intelligent agent torealize that something is missing and reconstruct the links. Clearly, the presenceof the date is a fundamental cue for linking documents of the same clinical event.However, documents with no identiﬁable date can also be supplied to the WCD.

Marco Gori

Basically, the underlying principle is that they are also precious for medical in-ference, though one must consider their reduced signiﬁcance due to the lack ofa temporal collocation in the anamnesis of a patient.The role of the WCD Intelligent System goes well beyond that of recon-structing the structure of data of simple patients. As we can seen in Fig. 2,one can connect patients of the WCD so as to enrich the forest structure into agraph-based structure where links arise because of regularities discovered in therepository. For example, an intelligent agent can discover similarities betweenthe clinical data of two patients. This can arise from data in diﬀerent formats,ranging from text and diﬀerent formats of signals. Similarities can also arise inthe diagnosis and the therapeutic treatments, so as the original forest evolvestowards a truly WCD with precious inferential information. text table image clinical data pharma teraphydiagnosis text text text table image clinical data pharma teraphydiagnosis text text inferred links between patients

Fig. 2.

Automatic construction of the WCD: The forest of patients grows up by ap-propriate links amongst similar data. Similarities arise from both text and image dataand are discovered by intelligent agents.

The actual evolution of the WCD does require to clean a number of crucial legalissues connected with the distribution of clinical data. Their actual distributionis supposed to under the control of the

Governing Board of the WCD, which willbe composed of doctors, scientists in related disciplines, and layers. First of all,we need to clarify who is supposed to gain the permission for accessing the dataalong with the connected purpose. The WCD is conceived for boosting research n Overview on the Web of Clinical Data 5 and world-wide medical treatment in clinical centers. As a consequence, the per-mission is granted to scientists and doctors upon approval from the

GoverningBoard of the WCD. As it will be shown in the next section, this restriction on theaccessibility poses fundamental constraints on the computer architecture, sincethe underlying assumption is that the repository can only be made availableto Clinical Units (CU), which could be hospitals, research centers whose accesscomes with the duty of not to distribute data elsewhere. Basically, the intentionis that of using the same legal framework which regulates the interaction of med-ical staﬀ in sensible clinical data. Following related technical developments in theﬁeld of bank accounting, we only make the access available independently of thegeographical position, whereas we do inherit and keep all the restrictions con-cerning the already established rules for the access. Clearly, this imposes neitherrestrictions on the place where data are stored nor on who provides the storingservice, which does not necessarily corresponds with granting accessibility.The issue of anonymization has been the subject of an in-depth investigationin the last few years, but only a few of them make signiﬁcant eﬀorts towards arigorous treatment (see e.g. [3] for a very good example). Many circulating claimson the issue of anonymization are often quite generic; namely the inferentialcontext is not well-deﬁned, which facilitates restrictive claims on the supposedrisks of making clinical data widely available. Formally, the problem of disclosingthe identity from medical data seems to be generic and ill-posed. Clearly, textualdocuments on the anamnesis of a patient may contribute to discover the identity,whereas the inference from signal biomedical signals cannot rely on cues thatcan somehow reveal the identity. One can at most believe of some geographicalconnections that are expected to give rise to a certain disease. Notice that givinga collection of medical data under the conditions of the WCD, the identitydisclosure is only possible provided that the databases of clinical centers areviolated, which is in fact an event that can happen regardless of the WCDinitiative.Basically, the WCD initiative faces the ill-position of the issue of sensible databy adopting a strategy where anonymization and authorization is granted by de-sign. The act of autonomously data uploading does exhibit the stronger patient’sintention to make data available. It is worth mentioning that the interaction withclinical centers follows the other way around, namely data are downloaded bythe patients. Hence, we can think of keeping the same communication channelwhere this time patients upload their own data, which can come from diﬀerentsources. This information ﬂow with clinical centers very much resembles the onethey establish with their own banks. The level of security and the type of in-formation exchange shares in fact the same needs. As patients decide to uploadtheir data to their own preferred clinical center (more are possible), we startcreating a repository which is promoted to the WCD, only after a further check,whose details are deﬁned by the WCD Governing Board. The bottom line isthat the WCD does subscribe the current scenario for the access to clinical data,the only diﬀerence being that of making the data widely available to world-wideauthorized clinical and research units. In this framework, we expect that most

Marco Gori claims on sensible information might vanish thanks to the governing philosophyof the WCD.

The access requirements pointed out in the previous sections for the WCD openan interesting computer architecture problem that is mostly connected with thestoring of a growing repository that must be distributed worldwide. At the sametime, each CU is also expected to access the whole WCD repository. In orderto satisfy these requirement the architectural solution proposed for the WCD isthe one depicted in Fig. 3.

WCD uniform resource locator

Encrypted WCD Clinical Unit

SLWCDPLWCD

Clinical Unit

SLWCDPLWCD

Clinical Unit

SLWCDPLWCD

Clinical Unit

SLWCDPLWCD

Fig. 3.

Overall architecture of the WCD, where Clinical Units act as macro-nodes.Each such unit is used for storing patients’ data in a truly unstructured form (PatientLocal WCD - PLWCD), while additional meta-data are created in the (Semantic-LocalWCD - SLWCD). The WCD is also stored in an encrypted form onto a cloud system.

The repository can be given a formal description at local level thanks to the

Uni-form Resource Locator of any document, that can be constructed by following thesame principles used in the Web. For example, the reference to the ECG diagno-sis in a certain CU could be wcd.ao-siena.1492.ecg-careggi-june-2020.dia ,whose structure is based on separate ﬁelds which provide information on the pa-tient, the speciﬁc document and where it was archived. This is sketched below wcd . ao − siena (cid:124) (cid:123)(cid:122) (cid:125) ClinicalUnit . (cid:124)(cid:123)(cid:122)(cid:125) patient no . ecg − careggi − june − (cid:124) (cid:123)(cid:122) (cid:125) filename . dia (cid:124)(cid:123)(cid:122)(cid:125) diagnosis n Overview on the Web of Clinical Data 7 where we easily induce the general structure of the URL. Basically, the ECGdocument named ecg-careggi-june-2020.dia is a diagnostic document (ex-tension dia ) corresponding to patient of Clinical Unit ao-siena of the wcd .This document is supposed to be directly accessible within the CU where it isstored for any type of processing. However, since this document is also supposedto be accessed outside the CU where it has been generated, one needs to makeit available world-wide.There are a number of studies on architectural solutions for related problemsthat inspire the solution to this general framework (see e.g. [4,1,9,6]). Inter-estingly, related studies [8] have also been carried out in the context of medi-cal applications. The solution prospected in Fig. 3 is naturally following classicsearch engine technology where there is a global repository. Interestingly, thisanalogy comes with the fundamental diﬀerence that data are encrypted and,consequently, not accessible within the global Web repository. Search engine primitive

The basic idea is that of caching of the documents on a Cloud System (CS)in an encrypted format. Basically, any document created in a CU is encryptedand backed up to the CS. This allows us to carry out a classic information re-trieval search [4] where all documents produced in the CU are uploaded to theCS along with their encrypted keywords. This allows the CS to construct the in-verted indexes to be used for searching. Classic encryption solutions can be usedfor handling the security of the interaction between the CUs and the CS. Classicsearching primitives based on propositional calculus are supposed to be used,whereas the major assumption is not to assign the CS higher-level services. Theunderlying assumption is in fact that of moving to the CUs any solution basedon AI agents. It is worth mentioning the this horizontal keyword-based searchservice is in fact the ﬁrst one which is oﬀered on top of the WCD. Interestingly,the search primitives are also of crucial importance as a building layer for anyservice app.

Service Apps

Any service in the WCD is supposed to be given by apps running in CU com-puter servers. Any computation does require to select the documents that arepertinent to the service, so as they are temporarily downloaded in the CU server.Basically, these apps are expected to operate on a proper selection of informationfrom the WCD that is expected to be useful for their objective. The structureof any service is sketched as1. create the primary app cache q ← QueryFromService(input) RetrievedDocCollection ← WCD-retrieve(q) run BodyApp(RetrievedDocCollection,input) free(RetrievedDocCollection) Marco Gori

First, the app may need to create its own cache (primary cache) from theWCD to optimize the performance. This step may also be omitted, so as all theprocessing is based only on the input being process. The second step consists offormulating the searching query on the basis of the input to the app. Then, as thequery is fed to the

QueryFromService , pertinent documents are retrieved andstored in the secondary cache

RetrievedDocCollection . These locally retrievedcollection, along with the primary cache is used for carrying out the task assignedto the app. Basically, the actual processing is carried out by

BodyApp on thesecond argument input by exploiting the local cache.

Example 1.

Suppose we want to discover patients whose clinical data are similar tothose of a given patient. The reference for inspecting the similarity is deﬁned by docu-ment input=ecg-careggi-june-2020.dia , which is in fact a document under diagnosis.This document comes with a number of attached metadata, like for instance, the spe-ciﬁc type of diagnosis (ECG signal). The document, along with its attached metadatarepresent the input of the service, which is expected to compose the query q auto-matically by using QueryFromService . As the query is obtained, it is used to collect

RetrievedDocCollection and, ﬁnally, the body of speciﬁc part of the app BodyAppis used to discover which documents are similar.

Patients and staﬀ from the CUs are generally expected to upload clinical data inthe general unstructured form of a multi-media document. Clearly, it could bethe case that single patients directly provide information in a truly structuredway according to Fig. 1. Whenever this does not happen, the purpose of WCDenrichment apps is that of creating the missing structured representation.The ﬁrst important enrichment task is that of segmenting patients’ clinicalevents. The most informative cue comes from the eventual presence of the date inthe document. As it is available, the task of its detection is a well-posed problemeither for textual or image-based documents. In both cases the extraction of thedata could be quite a complex problem. However, in both cases we are in frontof classic problems of pattern recognition that have been the subject of massiveinvestigation (see e.g. [2], for early studies). It is worth mentioning that wheneverthe date is either missing or badly recognized, the segmentation of patients’event can also rely on diﬀerent cues. For example, there are cases in which acertain drug is prescribed upon a corresponding diagnosis, so as two documentscan be placed in the same medical event. Clearly, this task can become verysophisticated and very well represents the type of challenges that are open withthe WCD.Other important enrichment tasks involve the separation between clinic, di-agnostic, and therapeutic data. This is essentially a document classiﬁcation task.While there is a huge literature for attacking this problem the speciﬁc contextof the WCD provide a number of additional cues to solve it successfully. Henceenrichment apps contribute to create the structured view of the forest indicatedin Fig. 1. Moreover, the successive discovery of links between diﬀerent patients n Overview on the Web of Clinical Data 9

PLWCD SLWCD

Clinical UnitWCD Service Level EWCD on CloudEnrichment AppsService Apps

Fig. 4.

Service and Enrichment apps in the WCD. For a service app to operate on theWCD the connection with the cached info in the EWCD is required. of the same CU depicted in Fig. 2 is another tasks of enrichment apps. Thisproblem is basically one of discovering the similarities between multimedia doc-uments, which has also been the subject of a massive investigation In additionto links between documents of the same type, we can ﬁnd links between text anddocuments, a task which clearly needs the additional step of extracting textualdescription from images. Overall, enrichment apps create the WCD locally toeach CU. However, their task goes beyond the discovery of local links. In orderto generate global links upon discovery of relationships between documents ofdiﬀerent CUs, enrichment apps rely on the software architecture described inSection 3. In particular, the discussion of Example 1 concerning the discoveryof global similarities for a given document provides an insight on how to attackthe problem. Clearly, once the similarities have been found, we need to updatethe index of the WCD by its enrichment with the discovered links. While thisis carried out in the CUs, the inverted indexes in the CS must be updated ac-cording. Notice that the process of constructing the WCD by this similarityinspection needs to be frequently carried out because of the continuous updateof documents on the CUs.

The creation of the WCD along with the associated high-level meta-data associ-ated with the documents open the doors to the development of medical serviceapps ranging from the ﬁeld of diagnosis to that of therapeutic treatment. The underlying philosophy in the development of service apps is that they do notsimply operate on a speciﬁc CU, but on the WCD. Depending on the dimensionof the CU, the apps might have diﬀerent conﬁgurations, but the are conceivedfor working at global level.

PLWCD SLWCD

Medical Assistant Medical AssistantService App Service App Service AppService AppService App

Fig. 5.

Service apps onion skin view. The lower level only relies on the processing ofdata from the SLWCD. In a middle layers, apps operate by relying on the computationof other apps. Finally, in the last layer, medical assistant operate autonomously on thebasis of all the info available in the lower levels.

Because of the need to make them available on the WCD, service apps areexpected to run on popular operating systems. As we can see in Fig. 5, serviceapps can operate at diﬀerent level of abstraction. As the process conquer a certaindegree of autonomy, service apps become a sort of medical assistant which cansupport decisions under the control of human experts. The long term view ofthe WCD is that the progressive construction of service apps is expected tocontribute to medical treatment on speciﬁc patients, but also to the permanentcreation of metadata that are expected to provide an important support forfuture medical medical decisions.

The described framework of the WCD suggests the development of a stablemarriage between medicine and artiﬁcial Intelligence. While there is a beneﬁt inthe systematic developments of service apps thanks to the word-wide competitionfor medicine, a much better benchmarking context is created for research inartiﬁcial intelligence. Basically, it looks like the WCD can act as a fundamentalcatalyzer for both the disciplines. The current focus on speciﬁc benchmarks is n Overview on the Web of Clinical Data 11 expected to be translated into a sort of permanent AI challenge in medicine,that will constantly refer to the WCD.The philosophy behind the WCD is that of increasing the level of medicaltreatment by strongly promoting the CUs, which turn out to be the nodes ofthe WCD. As such, they are expected to control the world-wide repository ofclinical data and will likely realize that there is a crucial beneﬁt of assumingthis role. The WCD will make available best medical practices worldwide, aservice that has an enormous value and that might also be paid oﬀ. Pharmacompanies might be interested in advertisement on the WCD, which could thesource of huge resources especially for best CUs. Simple metrics could in factmeasure the number of access to documents uploaded to diﬀerent CUs, so as todistribute money from advertisement. This will deﬁnitely bless the principle thatin the WCD framework doctors carry out two diﬀerent speciﬁc roles: First, theirprimary role is that of providing the appropriate medical assistance to singlepatients. Second, their successful treatments will be useful for future worldwidemedical decisions. Finally, the conception of the WCD comes with the wishthat its development will be primarily useful for the medical support to poorcountries.

Acknowledgments

We thank Alessandro Rossi, Stefano Melacci, Sandro Bartolini (University ofSiena), Arturo Chiti (Humanitas, Milan), and Paolo Traverso (FBK, Trento) forinsightful discussions.

References

1. Pedro Geraldo M. R. Alves and Diego F. Aranha. A framework for searching en-crypted databases.

J. Internet Serv. Appl. , 9(1):1:1–1:18, 2018.2. Heidi Bjering Stratti.

Optimal operators in digital image processing . PhD thesis,2008.3. Khaled El Emam and Luk Arbuckle.

Anonymizing Health Data: Case Studies andMethods to Get You Started . O?Reilly Media, Inc., 1st edition, 2013.4. Z. Fu, K. Ren, J. Shu, X. Sun, and F. Huang. Enabling personalized search overencrypted outsourced data with eﬃciency improvement.

IEEE Transactions onParallel and Distributed Systems , 27(9):2546–2559, 2016.5. M. Gori, Campiani G, Rossi A, and Setacci C. The web of clinical data.

Journal ofCardiovascular Surgery , 23:717–718, 2014.6. Yusheng Jiang, Tamotsu Noguchi, Nobuyuki Kanno, Yoshiko Yasumura, TakuyaSuzuki, Yu Ishimaki, and Hayato Yamana. A privacy-preserving query system us-ing fully homomorphic encryption with real-world implementation for medicine-sideeﬀect search. In

Proceedings of the 21st International Conference on InformationIntegration and Web-Based Applications and Services , iiWAS2019, page 63?72, NewYork, NY, USA, 2019. Association for Computing Machinery.7. Eric Topol.

The Creative Destruction of Medicine: How the Digital Revolution WillCreate Better Health Care . Basic Books, 2011.2 Marco Gori8. Lei Xu, Chungen Xu, Joseph K. Liu, Cong Zuo, and Peng Zhang. Building a dynamicsearchable encrypted medical database for multi-client.

Inf. Sci. , 527:394–405, 2020.9. Steven Zittrower and Cliﬀ Changchun Zou. Encrypted phrase searching in the cloud.In2012 IEEE Global Communications Conference, GLOBECOM 2012, Anaheim,CA, USA, December 3-7, 2012