Bias in ontologies -- a preliminary assessment
BBias in ontologies – a preliminary assessment
C. Maria Keet
Department of Computer ScienceUniversity of Cape Town, South [email protected]
Abstract
Logical theories in the form of ontologies and similar arte-facts in computing and IT are used for structuring, annotat-ing, and querying data, among others, and therewith influ-ence data analytics regarding what is fed into the algorithms.Algorithmic bias is a well-known notion, but what does biasmean in the context of ontologies that provide a structuringmechanism for an algorithm’s input? What are the sourcesof bias there and how would they manifest themselves inontologies? We examine and enumerate types of bias rele-vant for ontologies, and whether they are explicit or implicit.These eight types are illustrated with examples from extantproduction-level ontologies and samples from the literature.We then assessed three concurrently developed COVID-19ontologies on bias and detected different subsets of types ofbias in each one, to a greater or lesser extent. This first charac-terisation aims contribute to a sensitisation of ethical aspectsof ontologies primarily regarding representation of informa-tion and knowledge.
Introduction
Bias in models is a well-known topic, which has been popu-larised to the public with a catchy term “weapons of mathdestruction” (O’Neil 2016). Nearly all investigations on‘models’ concern statistical models created from Big Databy means of knowledge discovery, machine learning, anddeep learning techniques. There are many more types ofmodels, however. The other main category of models withinArtificial Intelligence (AI) are ontologies , which are staplein the knowledge representation and reasoning side of AI.Informally, an ontology is a logical theory of a subject do-main, capturing its classes, relations, and constraints thathold among them, which are used for tasks such as data inte-gration, information retrieval, electronic health records, ande-learning (Keet 2018). For instance, one may have multipledatabases that have to be merged due to a company take-overand one needs to know whether some entity type
Customer or COVID19-Patient in database has the same meaning as Custm or or
COVIDPatient in database , respectively, and ifit is, a way to declare that, or, e.g., to define precisely what COVID-19 death means in the mortality statistic. Ontologiescan help with it by providing an application-independent
Copyright © 2021, the author(s). representation of the subject domain as a common vocab-ulary and unambiguous specification of the intended mean-ing. Besides integration, one also can choose an ontologyupfront and use that across applications, such as an elec-tronic patient record system with a medical terminology forclassifying or annotating patients’s symptoms, disorders anda treatment that is shared with the insurer; SNOMED andthe ICD-10 are popular for that. An example of their Web-scale use is Google’s Knowledge Graph that drives searchand the creation and maintenance of its infoboxes. The onewho builds and controls the graph, then, is the one whohas the power to control presentation and access to infor-mation and possibly also the recording of information, and,as (Juel Vang 2013) argues in case of Google’s Graph, “tosome degree contests the autonomy of the user”.We illustrate the general idea of possible issues in the nextexample with ontology-mediated artificial moral agents.
Example 1
The Genet ontology aims to provide a frame-work to represent multiple ethical theories such as utilitar-ianism and divine command theory (Rautenbach and Keet2020) so that one can tailor the actions of a robot to themoral preferences of its owner or enhance argumentationin multi-agent systems (Liao, Slavkovik, and van der Torre2019). A section of its version 1 is shown in Fig. 1 in black-and-white informally on the left and a selection of the ax-ioms in Description Logics (DL) notation (Baader et al.2008) on the right. That the ontology admitted four distinctentities of moral value, rather than just humans, is alreadyan ideaological statement and therewith a bias.Now assume that you want to expand the moral circlebeyond those four in the ontology, with
Robot . By design,you cannot unless you have the rights and the technology tochange it. Let’s assume you have those.There are three options. First, you add
Robot as a
Patien-tKind and since you are sure robots are neither humans, nornature, nor non-human animals, add those disjointness ax-ioms. It will deduce
Robot (cid:118)
OtherSentient regardless whether you wanted that or not. If not—perhapsbecause you are religiously convinced inanimate objectscannot be sentient—then, second, you could add that theyare distinct as well:
Robot (cid:117)
OtherSentient (cid:118) ⊥ but then the reasoner will deduce a r X i v : . [ c s . A I] J a n atientKindHuman Nature NonHumanAnimal OtherSentient {disjoint,complete} Robot {disjoint} *1..*
SetOfPatientKinds *1..*
Ethical Theory has member has component
Figure 1: Small section of the OWL version of the Genet model of (Rautenbach and Keet 2020) (in black-and-white), a hy-pothetical addition with
Robot as entity of possible moral value (in blue, solid lines bottom-left), and the deduction (in green,dashed arrow). On the right, a selection of the relevant axioms in DL notation.
Robot (cid:118) ⊥ i.e., the class is unsatisfiable (cannot have instances). Thethird option is to modify the original axioms and losing com-patibility with Genet; e.g., to remove some disjointness ax-ioms or change the completeness axiom.
Ontologies in computing and IT have been popularisedsince the mid 1990s, with as a major success story the GeneOntology (Gene Ontology Consortium 2000) as ontologyand the OWL language as the W3C standard (Motik, Patel-Schneider, and Parsia 2009) to represent ontologies in. Thepopular ontology repository for bio-ontologies BioPortallists 831 ontologies and the repository of repositories Onto-Hub claims to have indexed 22460 ontologies of 139 repos-itories . Regarding possible bias in ontologies, aside from“encoding bias” (Uschold and Gruninger 1996) that refersto different formalisations of the same thing, there are fewarticles. An early paper discusses it in context of the “DirtyWar Index” tool that claimed to aim to inform public heathin armed conflict settings, which had several biases, suchas including ex-army in the civilian group whereas the pri-mary source database did not (Keet 2009). (Gomes and Bra-gato Barros 2020) assessed the FOAF terminology throughthe lens of discursive semiotics as a method. This aimed tomean one has to consider “the concretization, in language, ofa particular social, historical, ideological, and environmen-tal context”, using the specific framework with the “Gen-erative Trajectory of Meaning” of Greimas and Court´es .The bias analysis, however, was limited to a few well-knownones, being first & last name vs given & family name, gen-der, and the meaning of document. While valid, bias andtheir causes are more intricate and varied than these. Forinstance, consider religion, which may be a specialisationof their “ideological”, as was the case of the issue of in-clusion of homosexuality in the classification of mental dis-orders in the United States until DSM-III in 1987. Whattheir approach cannot capture, but is certainly an issue fordeclarative models, is, among others, the menopausal hor-mone therapy case: there were at least economic incentives figures from https://ontohub.org/; last checked on 13-1-2021. referenced as “Greimas, A. J. and J. Court´es. 2013. Dicion´ariode Semi´otica. S˜ao Paulo: Contexto.” for a brief overview of its history, see https://en.wikipedia.org/wiki/Homosexuality in DSM that determined which attributes ended up in the model withwhat threshold values in order to classify who is eligible fortreatment .In this paper, we aim to contribute to systematising thesort of bias that can enter or be present in ontologies andsimilar artefacts, such as conceptual data models and the-sauri. We will seek to provide a preliminary answer to whatbias means for ontologies, what their sources or causes are,and how that manifests itself in ontologies. The identifiedbiases types are structured along three categories: high-levelphilosophical ones, scope or purpose, and subject domain is-sues. Some of these biases are intentional biases that insidersknow very well, but outsiders and newcomers may have tobe notified of. For the unintentional biases that can creepin, this will be harder to manage; we do not aim to solvethat here, but first inventarise them. Second, we assess a setof COVID-19 ontologies on these biases. These ontologiesare under active development, competing, and merging, andhighly relevant for data management of the pandemic. Theassessment showed that none is free of bias.The remainder of this paper is structured as follows. Wefirst systematise and illustrate the principal sources, to con-tinue with the COVID-19 ontologies assessment. We thendiscuss the outcomes and touch upon automated reasoning,and close with conclusions. Principal sources of bias in ontologies
Of most interest practically ethically, is the bias with respectto the subject domain. To be able to discuss it properly, wefirst need to note and ‘set aside’ the straightforward ones ofphilosophical and engineering (encoding) bias. A summaryof the resultant eight types, or sources, of bias is included inTable 1.
High-level philosophical issues
Ontologies as an engineering version of the original idea ofOntology by philosophers, and its branch of analytic philos-ophy in particular. Most subject domain ontology develop-ers may not care much about the finer distinctions of core In essence, they narrowed the range of natural variability ofconcentrations of key molecules to increase the number of womenwho would be ‘abnormal’ and therewith qualifying for medication,which unintentionally led to an increase in cancer incidence. lass hierarchy DOCLE-lite Class hierarchy BFO v2.0
Figure 2: Illustration of some philosophical differencesbetween foundational ontologies:
DOLCE-Lite.owl has
Abstract , but bfo20.owl does not due to its realist stance,and while perdurant and occurrent roughly align (dashed ar-row), their respective subclasses do not, admitting to differ-ent types of perdurant in existence.
Type Subtype [im/ex]plicit bias
Philosophical - explicitPurpose - explicitSubject domain Science explicitGranularity eitherLinguistic eitherSocio-cultural eitherPolitical or religious eitherEconomics explicit
Table 1: Summary of typical possible biases in ontologiesgrouped by type, with an indication whether such biaseswould be explicit choices or whether they may creep in un-intentionally.notions, but they are there. Practically, for domain ontologydevelopment, one would choose a particular foundational ortop-level ontology that provides the main types of entitiesand relations so as to help structuring the content. There aremultiple such foundational ontologies in active use, such asBFO, DOLCE, UFO, SUMO, and YAMATO, which makedifferent commitments. Its developers are mostly clear aboutthat on general principles and how it affects the ontology’scontent, such as acknowledging the existence of abstract en-tities (Masolo et al. 2003) or what the core relations in theworld would be (Smith et al. 2005). See Fig. 2 for an ex-ample. While it is not trivial to choose which foundationalontology suits the modeller best, it is a deliberated decision,hence, an upfront explicit bias.There are related debates on whether what is representedis a representation of reality or merely our understandingthereof, or whether there would even be a reality. This is anold and recurring debate (see, e.g., (Merrill 2010)) that hasno resolve that everyone agrees on. For ontology develop-ment, the key take-away is whether one aims to be faith-ful to reality (or our best understanding of it) versus ulteriormotives, be it rejecting reality or not caring (‘post-truth’) orknowingly violating it for whatever reason. These differentstances act out at the subject domain level where the bias canhave most effect, as we shall see further below, and could re-sult either in an explicit or implicit bias.
OccurrentClassC participant
ClassAClassB
Continuant hasB: anyType
ClassC participant Pattern BPattern A Pattern C
ClassC BT Continuant RT ClassB
Figure 3: Three different patters with a purpose bias: Pat-tern A is biased toward a scientific approach with increasingprecision (and a bias toward 3-dimensionalism philosophi-cally), Pattern B indicates a conceptual data modelling in-fluence or purpose, and Pattern C takes a thesaurus-like ap-proach useful for document annotation.
Scope or purpose
In theory, ontologies are supposed to be application-independent, so as to be a solution to the data integrationproblem; if they are tailored to the application nonetheless,they may become part of the problem. In praxis, this appli-cation independence may not always hold. Developing anontology for the sake of it may be an interesting endeavour,but someone has to fund it and it helps to have a use case sce-nario to motivate for the development of it. This may affectwhat is represented and how and is, or at least should havebeen, an explicitly stated bias motivated by pragmatics, if itcan be considered a bias—as (Uschold and Gruninger 1996)do—since they are engineering choices rather than bias onthe knowledge itself.Three patterns of representation for different purposes areshown in Fig. 3, summarising common encoding biases. Toillustrate those, consider the following situation, representedin DL for brevity. Ventilation is an undisputed treatment forCOVID-19 patients and is being used, and so let us considerthree options:• If the scope or purpose is the be as detailed and reusableas possible, and knowing that
Treatment is a perdu-rant in philosophical terms, operating within the 3-dimensionalism bias, then
Ventilation (cid:118)
Treatment is the bare minimum to declare, and availing of the corerelation of participation , then also
HospitalisedPatient (cid:118) ∃ participatesIn . Treatment
One then could assert that our hospitalised COVID-19 pa-tients participate in the ventilation treatment.• A compact representation of the same state of affairs re-sults in faster data processing; e.g.,
Patient (cid:118) ∃ isOnVentilator . Boolean so as to use the ontology language to develop a concep-tual data model for database development, rather than thetraditional EER language for relational databases.• Another purpose could be annotation of literature to bet-ter manage it. Then neither the boolean nor all those con-straints and relations are needed, but it would focus oncasting the net wide on terminology with one preferredand several alternative labels including, but not limited to, entilator support , ventilation therapy , mechanical ventila-tion , and invasive ventilation with BT Ventilation and RT
Patient .The ontologist may complain about the latter two optionsas woefully underspecified, which is their bias, whereas thetwo tool developers may complain that the first option isneedlessly complicated due to their bias for simplicity.
Subject domain
The list of bias sources described in this section overlapswith Gomes and Bragato Barros’ one (Gomes and BragatoBarros 2020), but is extended with three categories, includ-ing one from (Keet 2009). In addition, we indicate whetherthey concern mainly intended or unintended biases, or both,and illustrate each one in order to demonstrate relevance.
Difference of opinion on reality and science.
Even underthe assumption of a commitment to the existence of reality,one still could disagree. A common example is whethera virus is an organism or not—it is not by any extantdefinition of what an organism is—and even bio-ontologiesand medical terminologies do not agree; compare, e.g., theCIDO (He et al. 2020) versus the SIO (Dumontier et al.2014) and the NCI Thesaurus . More broadly, it concernseither the insufficient insight or competing theories that thescientists still have to investigate, of which it is assumed thateventually there will be an agreement, or there are delays inpropagating discoveries into the ontologies. In other fieldsof research, there are inherently competing theories, suchas capitalism and socialism, that would result in a differentdomain ontology of economy. They are all intended choicesand biases. Required or chosen level of precision/granularity . It is ageneral question in ontology development how detailed itshould be and how deep the taxonomy should go. Less de-tail therefore may be an act of omission , an indication of ‘notneeded’, or a ‘ran out of time’ to be included for a next ver-sion. Inspecting an ontology in isolation, this is impossibleto determine unless either of the latter two are indicated inthe annotations. For instance, The Gene Ontology has threeversions: a GO basic that excludes several relations betweenentities, the GO, and a GO plus with additional axioms .An act of omission is aforementioned aggregation ofex-military persons with, say, non-involved persons asone group of Civilians : the source had a more detailedcategorisation that was abstracted away so that it resulted inone party (of the authors’ side) shown in a more favourablelight (Keet 2009). Similar issues exist for other conflictdatabases, which may be intentional or unintentional. Forinstance, a bombing target may be recorded as an instanceof having targeted a
Government building if that is theonly category available, or more precisely if it had anysubclasses, such as, say,
State hospital , State medicine https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI Thesaurus&ns=ncit&code=C14283; 18-1-2021. http://geneontology.org/docs/download-ontology/; last ac-cessed 13 January 2021. manufacture plant , Military base , and
Homeland securitytorture bunker , and more, rather than one layer of subclassesas in (Veerasamy, Grobler, and Solms 2012). Similarly, onecould have one aggregate group, say,
Foreign National withas alt-label
Alien for the USA, or also include subclassessuch as
Migrant and
Refugee , and further subclasses such as
Economic migrant , Spousal migrant , Critical skills migrant ,and so on. Such differentiations, or absence thereof, may beintended or they may be unintended and even change overtime when the subject changes, such as new immigrationpolicies and different ways of conducting conflict (e.g.,cyber attacks rather than bombings).
Cultural-linguistic motivations . Anyone who has learneda second language has come across untranslatable words orat least fine semantic distinctions. The question then arisesif, and if so, when, a difference ends up as a bias in the on-tology or not. For instance, English has only one term forriver—all rivers are just rivers—whereas French makes adistinction between a fleuve and a rivi`ere —one flows intoanother river, the other flows into the sea—that somehowhas to be represented and the ontologies aligned (McCraeet al. 2012), and likewise for observed differences in part-whole relations (Keet and Khumalo 2018). One may arguethat in both examples, the reality is the same but they havevarying descriptions, or take reality with a grain of salt andstate there are different realities depending on language, orthat there are different conceptualisations.A borderline case between cultural-linguistic preferencesand political bias are the false friends, where a term in alanguage has a different meaning or connotation in differentcountries where the language is spoken, due to historical dif-ferences across countries. For instance, ‘herd immunity’ is acommon term in American and British English, but is beingrebranded as ‘population immunity’ in South African En-glish, since the former has the connotation of non-human an-imals that people do not want to be associated with. It is alsodifferent in other languages; e.g., in Spanish, it is inmunidadde grupo and Dutch groepsimmuniteit , i.e., ‘group’ immu-nity rather than ‘herd’. Note that this is distinct from meresynonym confusion, such as a Football Ontology where itis unclear from the name whether it refers to the soccerfootball or American football or eraser/rubber/condom mix-ups, and orthographic differences (e.g., color vs colour),which can be accommodated in the ontology with labels andfiner-grained language-coding schemes (e.g., @en-uk and @en-us etc. ).The chance that a monolingual ontology developmentteam from one cultural identity in one country builds in sucha bias is substantial, and it can be reduced by constitutinga more diverse team of ontology developers who at leastspeak several languages among them. Any bias built inmay be intentional or unintentional. For instance, McCrae’steam (McCrae et al. 2012) was very multilingual and so itwas easy to observe the difference and propose a solution.In contrast, if one then develops an ontology afterwardknowingly only including the non-differentiating
River , thenthat is an explicit bias. ocio-cultural factors . This concerns hows society is or-ganised, with the assumptions that underlie it and historyhow it came about, and practical effects it may have whendeveloping the ontology. This may be organisational struc-tures, who lives with whom, demographics, allocation of re-sources, or social geography that influences what is salientand what not.For instance, who can marry whom and how many isa well-known point of variation across the world, whichcan cause difficulties for multinational organisations to har-monise that in one system. For instance, it may be a com-pany policy that one can insure the spouse of the employee,requiring a statement alike
Employee (cid:118) ∀ marriedTo . Spouse but should the model also include
Employee (cid:118) ≤ . Spouse i.e., at most one spouse? Should the gender of the spousebe recorded or marriedTo be defined as holding between hu-mans and no more?
Any answer will have a bias baked intoit. For an ontology to be as general as possible the most per-missive combination represented, and any constraints wouldhave to go into the conceptual data model for the specificdatabase.A concrete example is the relatively popular GoodRela-tions Ontology for e-commerce (Hepp 2008). It lists severalpayment methods, such as invoice, cash, and PayPal, andlimits the ‘on delivery’ to cash, but cash-less options on de-livery are just as possible, such as a pre-paid card or QR-code payment option, is missing, which is a non-uncommonmode of payment in areas where robberies are common.Also, its
Business assumes that they are legally registered,which may well hold in Europe where the ontology was de-veloped, but in many other countries there is a vast networkof the informal economy that does trade online with theirsmartphones and it has no specific opening hours.Socio-cultural factors may also influence the content ofmedical terminologies, such as the perception of alcoholuse across cultures, in-groups, and age, and what wouldbe considered as having a drinking problem. A recentexample is demonstrated by a comparison between theDSM-IV, DSM-V, and ICD-10 medical terminologies onissues with alcohol intake, where the criteria were changed.This resulted in an increase in
Alcohol Use Disorder usingDSM-V compared to the DSM-IV criteria (based on thesame data), primarily due to lowering the threshold for thenumber of diagnostic criteria required for it and increasingthe number of criteria through replacing one class withfour new classes that were arguably features of it (Lundinet al. 2015). This change in the lightweight ontology hasbeen blamed on a combination of socio-cultural factors andscientific disagreement (Wakefield 2015).
Political and religious motivations.
The line between so-cietal bias, political, and religious may be difficult to drawdepending on the case. Aforementioned DSM, which oughtto be based on science, was not entirely and likely was in-fluenced by religious viewpoints at least in some instances.Since the separation between state and church may not beall that separate, it practically may not be possible to dis- entangle the two. A clear-cut case is where the entity type
Aggrieved group , as a neutral term, enters the ontology as
Terrorist organisation as preferred label; concretely, there are terrorist and terroristgroup in the terrorism ontology of (Jin-dal, Seeja, and Jain 2020) whereas there is an
ActorEntity with various types of
Insider s and
Protestor s in the Cybert-errorism ontology (Veerasamy, Grobler, and Solms 2012).As with society and language matters, these issues moreeasily come to light if the team of ontology developers isdiverse or at least has diverse knowledge to bring in. Alsohere, such differences—biases—may be intentional or not.
Economic motivations . The, perhaps, most well-knownarena where economic motivations play a role, is the recog-nition of something as a disorder or disease, from which fol-lows that it deserves at least funding of a treatment if thereis one as well as resources for prevention and research. TheObesity Society’s panel of experts even stated this bluntly asthe main reason in favour of classifying obesity as a disease(TOS Obesity as a Disease Writing Group et al. 2008). Itsrecognition is good for big pharma and possibly also the pa-tients, but costly for insurers, which results in tension. Forthe ontology, it means that it is in our out, where the ontol-ogy comes into play in particular in electronic health records(how an observed finding is noted in the record, which treat-ments are linked to it) and further down the pipeline whenthe electronic records with their ontology, such as SNOMEDCT, are linked to the pharmacy and the insurer’s databases.There is a benefit to the data integration if the ontology usedfor it is grounded on evidence-based medicine and in one’sfavour; if it is not, it can be an uphill battle on multiplefronts. These issues are well-known and therefore can beclassified as intended biases.
Assessment of ontologies: the COVID-19Ontologies
To assess the notion of bias in ontologies beyond the con-crete selected examples in the previous section, it would sup-port the bias source identification for a set of ontologies inroughly the same subject domain. The reasoning is that sincethere are several ontologies in that given domain, there musthave been a reason to develop more than one rather thanto stick with one effort or to combine efforts. This may bedue to bias, but not necessarily so. This limits the choicesfor assessment. Comparisons of foundational ontologies areabound (see (Partridge et al. 2020) for the most recent at-tempt) and would have a less clear impact on domain on-tologies’ possible biases that affect applications that peopleuse. Of the core and domain ontologies, there are a few ontime and measurements, many on health and medicine (e.g.,37 are contextualised in (Haendel et al. 2018)), data mining,organisations and government, and others, which are moreor less stable and more or less maintained.We identified a set of ontologies in a same subject do-main, of which the authors have sufficient knowledge aboutthe domain to assess it, and that are under active develop-ment and maintenance, so that it has an increased chance ofthe assessment outcomes to be taken into account. A down-ide of the latter selection criterion may be that any issuesobserved may have been resolved in the meantime betweenassessment and review or publication of this paper. Nonethe-less, given the urgency of the theme, we chose to assess theCOVID-19 ontologies on bias. The next section will con-textualise each ontology briefly and the section thereaftercontains the assessment.
Ontology descriptions
The Coronavirus Infectious Disease Ontology (CIDO) (Heet al. 2020) is an ontology that was developed within theoverarching OBO Foundry approach (Smith et al. 2007): ittook a community-based development approach and reuses,among others, the Infections Disease Ontology that in turnis linked to the top-level ontology BFO (Arp, Smith, andSpear 2015) and therewith adhering to some of its princi-ples of structuring knowledge and philosophical stance ofrealism. The scope of the ontology was aimed at knowledgeand information about the SARS-CoV-2 virus and host tax-onomy data, its phenotype, and drugs and vaccines to fosterdata integration. The CIDO v1.0.109 was used for the as-sessment, to keep with the time frame where all ontologieswere released around July-August 2020, therewith reduc-ing the chance of mutual influence; in particular, the smaller cido-base.owl file (downloaded on 20-7-2020) with therelevant imports was assessed, which contains 82 classes, 15object properties (relations), no data properties (attributes)and one individual, and 90 logical axioms and is within theOWL Full profile due to issues with undeclared annotationproperties and a few undeclared classes. Aside from that,logically, it is expressible in
ALEHO in DL terminology,that is, a basic hierarchy with existentially quantified proper-ties and an occasional nominal (instances made into a class).The CODO (Dutta and DeBellis 2020) has as purposeto assist in representing and publishing of COVID-19 datafrom the disease course perspective and has subject do-main scope COVID-19 cases and patient information. Thatis, it aims to be an component in IT systems for health-care, rather than take a medical or research angle. The
CODO V1.2-16July2020.owl was used for the assess-ment, which contains 51 classes, 61 object properties, 45data properties, 56 individuals, and 463 logical axioms. Itis within the OWL 2 DL profile, and
SHOIQ ( D ) morespecifically, or: it is an expressive ontology that uses manyof the OWL 2 DL constructs available in the language.The COVoc, developed by the European BioinformaticsInstitute, has as purpose to support navigating and curat-ing the literature on COVID-19, and in particular the sci-entific research of it; documentation of its rationale is avail-able as a workshop presentation (Pendlington et al. 2020).Its first, and latest, released version is slightly later than thatof CIDO and CODO, although all had their drafts in June20, which did not affect its contents. The covoc.owl wasused for the assessment (v d.d. 28-8-2020), which contains541 classes, 179 object properties, no data properties or in-dividuals, and 672 logical axioms. It is within the OWL Fullprofile due to a subset property issue with the annotationproperties; without that and just the logical theory, it is ex-pressible in ALCHI , consisting of a basic hierarchy with
Bias CIDO CODO COVoc
Philosophical + - +Purpose - + +Science - - +Granularity ± + +Linguistic + - -Socio-cultural + + +Political or religious + + +Economics - - -Table 2: Presence or absence of bias in the three COVID-19ontologies examined.existentially quantified properties and a few subpropertiesand inverses.The “vocabulary for COVID-19 data”, available at http://covid19.squirrel.link/ontology/, has been excluded, becauseits contents is different from the other three, in that it is notfor COVID-19 data but to label datasets of COVID-19 data,such as Dataset of the Robert Koch-Institut . Bias assessment
The presence and absence of the different types of bias issummarised in Table 2, and will be illustrated and discussedin the remainder of this section.
CIDO
There are two socio-cultural biases in the CIDO. First, thereis a
COVID-19 diagnosis class with three subclasses: nega-tive, positive, and presumptive positive. There are two as-pects to this: the [disease]-positive/negative labeling, whichhas clear HIV connotations with all the stigmatisation thatcomes with it. This may be less prevalent in a country likethe USA where the incidence is relatively very low, but incountries where it is endemic, such as South Africa, suchlabelling can be harmful. It easily could have been, e.g, ‘in-fected’, ‘detected’, or ‘present’ and ‘not infected’, ‘absent’or ‘free’; that said, the positive/negative is a pervasive issuesacross languages and countries. The third category, ‘pre-sumptive positive’, elicits a negative connotation, plays intopeople’s fears, and would brand people that are statisticallyunlikely to have it, since many countries aim for at most 5-10% positivity rate. Neutral, and more accurate, terminologywould be, e.g., ‘pending result’, ‘awaiting test outcome’, or‘under investigation’.A similar bias in the other direction—of unwarrantedoptimism—is the assumption of
COVID-19 experimental drug in clinical trial (cid:118)
COVID-19 drug noting that
COVID-19 drug (cid:118)∃ treatment for.COVID-19 disease process is asserted in the ontology, and thus entails that
COVID-19experimental drug in clinical trial is a drug already and is be-ing part of regular treatment processes of COVID-19, sincethe property of ∃ treatment for.COVID-19 disease process isinherited down into the hierarchy. This is wishful thinking.A substance under investigation that is being evaluated is notecessarily effective or safe and for it to be a drug, it has tobe that and also have been approved by the regulatory body.A minor language note is drive-thru instead of drive-through for testing stations, but this can easily be addressedby providing alternative labels. Other US-centric indicationsare naming SARS-CoV-2 also the Wuhan virus , which wasrarely used outside the USA since it was advocated by Pres-ident Trump and his policies toward China, and
FDA EUA-authorized organization as the only other organisation as sib-ling of drive-thru COVID-19 testing facility . The latter mayalso be an instance of ran-out-of-time, since the authors ofthe accompanying paper ((He et al. 2020)) have diverse af-filiations.Regarding philosophical bias, this is evident by its embed-ding in the OBO Foundry suite (Smith et al. 2007), throughits partial reuse of ontologies within that framework, such asOBI and IAO, as well as the organisational principles howthe ontology is structured, which follows the BFO founda-tional ontology design principles (He et al. 2020).In sum, it does try to take the science angle to representingknowledge about COVID-19, but with a few biases towardUSA-centrism, which reduces its off-the-shelf potential. Or:if this were to be used in Europe or any of the key GlobalSouth countries with ample research, testing, or productioncapacities, such as India and South Africa, then they wouldhave to modify it first.
CODO
The CODO fares slightly better on the
Laboratory test find-ing , which can be negative, positive, or pending , rather, al-though also here the positive/negative may benefit from arelabeling. Also, it does have the well-known gender issue,captured in the axiom
Gender type ≡ {
Female, Male } A clear socio-cultural axiom in the ontology is
InfectedSpouse (cid:118)
InfectedFamilyMemberInfectedFamilyMember (cid:118)
Exposure to COVID-19
One can argue about omissions or time constraints, since theonly family member that can be infected is the spouse ac-cording to CODO, but there may be more family members.The cultural bias here is the concept of the nuclear family that consists of the parents and their children. Globally moreapplicable would be to talk of a household , however thatmay be composed. This, since there may be live-in grand-parents, cousins, nannies, domestic workers, and so on, andspouses may not live together in one household due to be-ing migrant workers. An early example of such complexi-ties in the context of COVID-19 can be found in (Parker andde Kadt 2020) and if CODO were to be used elsewhere, itwould have to revise this branch in the ontology.The purpose is indicated through its heavy use of dataproperties, hence, more alike a model for recording data thanfor representing the science of COVID-19 or SARS-CoV-2.A substantial amount of information would be usable acrosscountries trying to record data about patients. One class isspecific to the country of its developers, India, which is the
Mild and very mild COVID-19 , which is one of the three cat-egories mandated by its government rather than the mod- ellers’ granularity bias, as the authors noted in the annota-tions of
Patient . COVoc
COVoc clearly states that its purpose is COVID-19 scien-tific literature ‘triage’, and it is informally well-known thatknowledge organisation systems for literature annotation isfocused in facilitating that rather than being concerned withontological precision or correctness. Its contents are notclearly structured as a result of this bias, in the sense thatthere are many top-level terms and mixing of classes and in-stances, but some aspects, such as the use of the IAO andimport of the RO, may indicate some leaning to the OBOFoundry stack as well. Its actual contents regarding biasraises several questions.One is of granularity, and perhaps also focus or time,which are straightforward omissions, such as listing onlytwo continents, Asia and Europe (there are 4-7, dependingon how one categories), and a mixture of omission and pol-itics regarding the countries, since there are 10 subclassesof
Country , of which two are disputed (Hong Kong and Tai-wan) and one is definitely an error, since West Africa is nota country but a region on the African continent.Scientifically, the low-hanging fruit for bias detection is
Virus (cid:118)
Organism because a virus is not an organism, and that there are several disorders that are subclasses of
Disease , such as headache disorder (cid:118)
Diseaseanxiety disorder (cid:118)
Disease whereas they are distinct medically. With a benefit of thedoubt, one might argue they may be layperson common-sense assumptions, but these would then be rather seriousones for an ontology for scientific literature. Further scien-tific perspectives are built in by recording symptoms , such as
Cough and
Diarrhea as subclasses of phenotype , with phe-notype defined as “The detectable outward manifestations ofa specific genotype.”. This a very gene-centric view on thebody.Gender is not present, but biological sex is used instead.The only biological sex recorded in COVoc is male . Pub-lished literature on women and COVID-19 easily dates backto March 2020 (e.g., (Li et al. 2020)), however, which is wellbefore COVoc’s development.Since CIDO and CODO had different test statuses, it wasexamined in COVoc as well. It has seven options: thereis a possible case (meeting clinical criteria), a probablecase (meeting clinical criteria, with epidemiological link, ormeeting the diagnostic criteria), and once confirmed thereare five types of infection : asymptomatic, mild, moderate,severe, and critical. There are no test outcomes, only, rapidtesting and serology test .Since many terms are plain science terms, like replicasepolyprotein 1a (BtCoV) and cryogenic electron microscopy ,there are no obvious language or linguistic issues in thesense of bias, other than an English bias that nearly all exist-ing ontologies have. One arguably may be COVoc’s
Socialdistance compared to physical distancing , but the latter hasthe former as net effect and so the line is not clear. Economicotivations or possible benefits or losses are not evident ei-ther.
Discussion: Consequences of bias in ontologies
Having established that there are indeed biases in ontologies,does it really matter beyond the hypothetical issues and theincreased morbidity and mortality in case of the hormonereplacement therapy? They do and there are several wayswhere it can affect it, with the three principle ones being dueto omissions , incorrect attributions , and undesirable deduc-tions that are logically correct but not ontologically or notaccording to the other bias.Omissions and incorrect attributions have a direct ef-fect on data analysis, since they increase the amount ofnoise (technically speaking) when the ontology is used forontology-based data access and literature annotation andsearch. For instance, while mortality rates of men are higherfor COVID-19, relatively more women get infected; if thatcannot be annotated, since absent in COVoc, then the emerg-ing literature is harder to search to find studies on possi-ble causes for why women are tested positive more oftenthan men. Similarly, the lack of the concept of household,or at least more family members, in CODO, prohibits finer-grained recording of the chain of infection and thus morelikely to lose control of the spread of the virus.Incorrect attributions have to do with the annotator notfinding the desired knowledge in the ontology and then us-ing something else for it. For instance, if, say, Ireland wereto use CIDO, then the walk-through testing facility at DublinAirport can be approximated by CIDO’s drive-thru one in thesense of passing by or FDA authorised in the sense of beingan official test location. More generally: annotators chooseapproximations based on different criteria, so any data anal-ysis then will both miss instances and have false positives.Also, and aside from the fact that the different variations ontest outcomes contributes to the data integration problem, a presumptive positive annotation is, on the whole, an incor-rect label in about 45-95% of the time and would seriouslydistort epidemiological investigations and overload trackingand tracing efforts on top of it. That is, as long as the ontol-ogy does not fully characterise all the properties of an entitytype so as to be clear on the exact semantics, there is a heav-ier reliance on the term, with language alone being an easiertarget to be used or interpreted with bias.An example of an undesirable deduction resulting from abias built into an ontology would be the drugs with CIDO,which is illustrated in Fig. 4. CIDO aims to facilitate dataintegration (He et al. 2020), which could be done with,say, ontology-based data access (OBDA) and integrationto link data to ontologies (Poggi et al. 2008) where eachclass and object property in the ontology is mapped to aquery over the database(s). A query over the ontology thenavails of those mappings to retrieve the answer, togetherwith the knowledge represented in the ontology. Hydroxy-chloroquine is still used as an experimental drug in COVID-19 clinical trials and is listed as such in the clinical trialsregistry database , so then the query “retrieve all COVID- There are 24 active trials with hydroxychloroquine
COVID-19 drugCOVID-19 experimental drugCOVID-19 experimental drug in clinical trial
ClinicalTrials.gov databaseFDA database
Query “ retrieve all COVID-19 drugs ” Mapping:
SELECT Drug FROM fda WHERE Condition = ‘COVID-19’;
Mapping:
SELECT Intervention FROM CTgov WHERE Condition = ‘COVID-19’;
Answer
Figure 4: Ontology-based data access and integration sce-nario with CIDO and two database tables, from the Clinical-Trials.gov and FDA (selection shown, and mappings to theOWL classes are abbreviated). Retrieving
COVID-19 drug recursively fetches from the subclasses in the hierarchy andtakes the union of the query answer to each subclass, thusthen returning that hydroxychloroquine is already a COVID-19 drug, which is an undesirable deduction from both a sci-entific and regulatory standpoint.19 drugs” will include in the query answer hydroxychloro-quine, since it recursively retrieves the instances down inthe class hierarchy for all
COVID-19 drug subclasses. Hy-droxychloroquine is definitely not a drug to effectively treatCOVID-19, however, nor has it been approved for that pur-pose in any country.None of the COVID-19 ontologies have any meaningfuldeductions along the line of the protein phosphatase experi-ment that deduced a novelty for human understanding of it atthe theory level (Wolstencroft, Stevens, and Haarslev 2007),nor are they aimed at achieving that at present. Issues suchas the robot in Example 1 and Fig. 1, which similarly can betransposed on the gender binary bias, would likely surfaceduring ontology development since typically the reasoner isused to eliminate errors and then the deductions are mate-rialsed so that the reasoner is not needed in time-sensitiveapplications other than for query answering. Alternatively, alight-wight ontology language is used from the start so that for COVID-19 at the time of writing (of the 47 in total)https://clinicaltrials.gov/ct2/results?term=Hydroxychloroquine&Search=Apply&recrs=d&age v=&gndr=&type=&rslt=; lastaccessed on 15-1-2021. isagreements do not surface due to lack of language ex-pressiveness, notably because of the absence of disjointnessand qualified cardinality constraints. Therefore, our expec-tation is that the effects of bias with respect to reasoningconsequences may be more salient in data management andretrieving information rather than in reasoning over the log-ical theory itself.
Conclusions
Bias in the models easily creep into an ontology for vari-ous reasons. Eight types of sources of bias for ontologieswere identified and illustrated: philosophical, purpose, sci-ence, granularity, linguistic, socio-cultural, political or reli-gious, and economic motives. Some of them are explicit, andsome may be either explicit or implicit. Three COVID-19ontologies that were developed at the same time by differentgroups were assessed on these types of bias, which showedthat each one exhibited a subset of the types of sources ofbias. This first characterisation and comparative assessmentmay contribute to further research into ethical aspects of on-tologies, both the modelling component and how it affectstheir use in applications.As future work, we plan to look into a systematic way as-sessing and annotating explicit choices in the ontology, sinceontologies tend to be decoupled from any possible relatedontology paper that otherwise could have provided context.
References
Arp, R.; Smith, B.; and Spear, A. D. 2015.
Building Ontolo-gies with Basic Formal Ontology . USA: The MIT Press.Baader, F.; Calvanese, D.; McGuinness, D. L.; Nardi, D.;and Patel-Schneider, P. F., eds. 2008.
The Description Log-ics Handbook – Theory and Applications . Cambridge Uni-versity Press, 2 edition.Dumontier, M.; Baker, C.; Baran, J.; Callahan, A.; Chep-elev, L.; Cruz-Toledo, J.; Del Rio, N.; Duck, G.; Furlong,L.; Keath, N.; Klassen, D.; McCusker, J.; Queralt-Rosinach,N.; Samwald, M.; Villanueva-Rosales, N.; Wilkinson, M.;and Hoehndorf, R. 2014. The Semanticscience IntegratedOntology (SIO) for biomedical research and knowledge dis-covery.
Journal of Biomedical Semantics
Proc. of 12thInt. Conf. on Knowledge Engineering and Ontology Devel-opment (KEOD’20) . INSTICC.Gene Ontology Consortium. 2000. Gene Ontology: tool forthe unification of biology.
Nature Genetics
25: 25–29.Gomes, D. L.; and Bragato Barros, T. H. 2020. The Bias inOntologies: An Analysis of the FOAF Ontology. In Lykke,M.; Svarre, T.; Skov, M.; and Mart´ınez- ´Avila, D., eds.,
Pro-ceedings of the Sixteenth International ISKO Conference ,236 – 244. Ergon-Verlag. doi:10.5771/9783956507762-236.Haendel, M. A.; McMurry, J. A.; Relevo, R.; Mungall, C. J.;Robinson, P. N.; and Chute, C. G. 2018. A Census of Disease Ontologies.
Annual Review of Biomedical Data Science
Proceedings of the 11th interna-tional Conference on Biomedical Ontologies , volume 28xx.CEUR-WS.Hepp, M. 2008. GoodRelations: An Ontology for Describ-ing Products and Services Offers on the Web. In
Proceed-ings of the 16th International Conference on Knowledge En-gineering and Knowledge Management (EKAW’08) , volume5268 of
LNCS , 332–347. Springer.Jindal, R.; Seeja, K.; and Jain, S. 2020. Construction of do-main ontology utilizing formal concept analysis and socialmedia analytics.
International Journal of Cognitive Com-puting in Engineering
1: 62 – 69. ISSN 2666-3074. doi:https://doi.org/10.1016/j.ijcce.2020.11.003.Juel Vang, K. 2013. Ethics of Google’s Knowledge Graph:some considerations.
Journal of Information, Communica-tion and Ethics in Society
Peace & Conflict Review
An introduction to ontology engineering ,volume 20 of
Computing . UK: College Publications. 334p.Keet, C. M.; and Khumalo, L. 2018. On the ontology ofpart-whole relations in Zulu language and culture. In Borgo,S.; and Hitzler, P., eds., ,volume 306 of
FAIA , 225–238. IOS Press. 17-21 September,2018, Cape Town, South Africa.Li, N.; Han, L.; Peng, M.; Lv, Y.; Ouyang, Y.; Liu, K.; Yue,L.; Li, Q.; Sun, G.; Chen, L.; and Yang, L. 2020. Mater-nal and Neonatal Outcomes of Pregnant Women With Coro-navirus Disease 2019 (COVID-19) Pneumonia: A Case-Control Study.
Clinical Infectious Diseases
Proceedings of the 2019AAAI/ACM Conference on AI, Ethics, and Society, AIES2019, Honolulu, HI, USA, January 27-28, 2019 , 147–153.doi:10.1145/3306618.3314257.Lundin, A.; Hallgren, M.; Forsman, M.; and Forsell, Y.2015. Comparison of DSM-5 Classifications of Alcohol UseDisorders With Those of DSM-IV, DSM-III-R, and ICD-10in a General Population Sample in Sweden.
J Stud AlcoholDrugs
Applied Ontology
Weapons of Math Destruction: HowBig Data Increases Inequality and Threatens Democracy .Crown.Parker, A.; and de Kadt, J. 2020. Household char-acteristics in relation to COVID-19 risks in Gauteng.URL https://gcro.ac.za/data-gallery/interactive-data-visualisations/detail/household-characteristics-relation-covid-19-risks-gauteng/.Partridge, C.; Mitchell, A.; Cook, A.; Leal, D.; Sullivan, J.;and West, M. 2020. A Survey of Top-Level Ontologies - toinform the ontological choices for a Foundation Data Model.Technical report, The Construction Innovation Hub, Centrefor Digital Built Britain. doi:ttps://doi.org/10.17863/CAM.58311.Pendlington, Z. M.; Roncaglia, P.; Matentzoglu, N.;Osumi-Sutherland, D.; Caucheteur, D.; Gobeill, J.; Mot-tin, L.; Agosti, D.; Ruch, P.; and Parkinson, H. 2020.COVoc: a COVID-19 ontology to support literaturetriage. URL https://raw.githubusercontent.com/CIDO-ontology/WCO/master/day-1/Zoe COVoc.pdf. WCO-2020:Workshop on COVID-19 Ontologies.Poggi, A.; Lembo, D.; Calvanese, D.; De Giacomo, G.;Lenzerini, M.; and Rosati, R. 2008. Linking Data to On-tologies.
J. on Data Semantics
X: 133–173.Rautenbach, J.; and Keet, C. 2020. Toward equipping Artifi-cial Moral Agents with multiple ethical theories. In
RobOn-tics: International Workshop on Ontologies for AutonomousRobotics , volume 2708 of
CEUR-WS , 7.Smith, B.; Ashburner, M.; Rosse, C.; Bard, J.; Bug,W.; Ceusters, W.; Goldberg, L.; Eilbeck, K.; Ireland, A.;Mungall, C.; OBI Consortium, T.; Leontis, N.; Rocca-Serra,A.; Ruttenberg, A.; Sansone, S.-A.; Shah, M.; Whetzel, P.;and Lewis, S. 2007. The OBO Foundry: Coordinated Evolu-tion of Ontologies to Support Biomedical Data Integration.
Nature Biotechnology
Genome Biology
6: R46.TOS Obesity as a Disease Writing Group; Allison, D. B.;Downey, M.; Atkinson, R. L.; Billington, C. J.; Bray, G. A.;Eckel, R. H.; Finkelstein, E. A.; Jensen, M. D.; and Trem-blay, A. 2008. Obesity as a Disease: A White Paper onEvidence and Arguments Commissioned by the Council of The Obesity Society.
Obesity
Knowledge Engineer-ing Review
Proceedings of the11th European Conference onInformation Warfare and Security , 286–295. Academic Pub-lishing International.Wakefield, J. C. 2015. DSM-5 substance use disorder: Howconceptual missteps weakened the foundations of the addic-tive disorders field.
Acta Psychiatrica Scandinavica