[PDF] BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

Abstract

Artificial Intelligence (AI) software systems, such as Sentiment Analysis (SA) systems, typically learn from large amounts of data that may reflect human biases. Consequently, the machine learning model in such software systems may exhibit unintended demographic bias based on specific characteristics (e.g., gender, occupation, country-of-origin, etc.). Such biases manifest in an SA system when it predicts a different sentiment for similar texts that differ only in the characteristic of individuals described. Existing studies on revealing bias in SA systems rely on the production of sentences from a small set of short, predefined templates. To address this limitation, we present BisaFinder, an approach to discover biased predictions in SA systems via metamorphic testing. A key feature of BisaFinder is the automatic curation of suitable templates based on the pieces of text from a large corpus, using various Natural Language Processing (NLP) techniques to identify words that describe demographic characteristics. Next, BisaFinder instantiates new text from these templates by filling in placeholders with words associated with a class of a characteristic (e.g., gender-specific words such as female names, "she", "her"). These texts are used to tease out bias in an SA system. BisaFinder identifies a bias-uncovering test case when it detects that the SA system exhibits demographic bias for a pair of texts, i.e., it predicts a different sentiment for texts that differ only in words associated with a different class (e.g., male vs. female) of a target characteristic (e.g., gender). Our empirical evaluation showed that BisaFinder can effectively create a large number of realistic and diverse test cases that uncover various biases in an SA system with a high true positive rate of up to 95.8\%.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

BiasFinder: Metamorphic Test Generation toUncover Bias for Sentiment Analysis Systems

Muhammad Hilmi Asyroﬁ, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, Zhou Yang and David Lo

Abstract —Artiﬁcial Intelligence (AI) software systems, such as Sentiment Analysis (SA) systems, typically learn from large amounts ofdata that may reﬂect human biases. Consequently, the machine learning model in such software systems may exhibit unintendeddemographic bias based on speciﬁc characteristics (e.g., gender, occupation, country-of-origin, etc.). Such biases manifest in an SAsystem when it predicts a different sentiment for similar texts that differ only in the characteristic of individuals described. Existingstudies on revealing bias in SA systems rely on the production of sentences from a small set of short, predeﬁned templates.To address this limitation, we present BiasFinder, an approach to discover biased predictions in SA systems via metamorphic testing. Akey feature of BiasFinder is the automatic curation of suitable templates based on the pieces of text from a large corpus, using variousNatural Language Processing (NLP) techniques to identify words that describe demographic characteristics. Next, BiasFinderinstantiates new text from these templates by ﬁlling in placeholders with words associated with a class of a characteristic (e.g.,gender-speciﬁc words such as female names, “she”, “her”). These texts are used to tease out bias in an SA system. BiasFinderidentiﬁes a bias-uncovering test case when it detects that the SA system exhibits demographic bias for a pair of texts, i.e., it predicts adifferent sentiment for texts that differ only in words associated with a different class (e.g., male vs. female) of a target characteristic(e.g., gender). Our empirical evaluation showed that BiasFinder can effectively create a large number of realistic and diverse test casesthat uncover various biases in an SA system with a high true positive rate of up to 95.8%.

Index Terms —sentiment analysis, test case generation, metamorphic testing, bias, fairness bug (cid:70)

NTRODUCTION M ANY modern software systems comprise of AI sys-tems to make decisions. In AI systems, fairness isconsidered to be an important non-functional requirement;bias in AI systems, reﬂecting discriminatory behavior to-wards unprivileged groups, can lead to real-world harms.To address this requirement, software engineering researchtechniques, such as test generation, have been applied todetect bias [1]–[5]. While various techniques have been pro-posed for test generation of machine learning systems [1]–[4], there have been limited studies on detecting biases intext-based machine learning systems [5]. Text-based MLsystems have numerous applications, for example, NLPtechniques have been used for Sentiment Analysis (SA). Itis, therefore, important that biases in these systems can bedetected before these systems are deployed.SA systems are used to measure the attitudes and affectsin text reviews about an entity, such as a movie or a newsarticle [6], [7]. In this work, we focus on uncovering bias inSA for two reasons.Firstly, SA has widespread adoption in many do-mains [8], [9], including politics [10], [11], ﬁnance [12]–[15], business [16], education [17]–[19], and healthcare [20]–[22]. In the research community, SA continues to be widelystudied [23]–[28]. In the industry, many companies, such • M.H. Asyroﬁ, I.N.B. Yusuf, H.J. Kang, F. Thung, Z. Yang, D. Lo arewith the School of Computing and Information Systems, SingaporeManagement UniversityE-mail: [email protected], [email protected],[email protected], [email protected],[email protected], [email protected] received April 19, 2005; revised August 26, 2015. as Microsoft and Google , have developed and providedAPIs for software developers to access SA capabilities. Thissuggests the prevalence of SA in real-life applications.Secondly, SA has generalizability to other areas of NLP.Some NLP researchers have considered SA to be “mini-NLP” [9], as research on SA techniques builds on top of awide range of topics and tasks in the NLP domain. Cambriaet al. [29] argues that SA is a problem with a compositenature, requiring 15 more fundamental NLP problems tobe addressed at the same time. Therefore, we believe thattackling bias in SA is a suitable ﬁrst step that could lead toa more general approach to detect bias in textual data.Modern SA models have outstanding performance onbenchmark datasets, which demonstrates their effectiveness.However, there has been a growing understanding in boththe Software Engineering [3] and Artiﬁcial Intelligence [5]research communities that it is important to study non-functional requirements, such as fairness, which have beenoverlooked. AI systems learn from data generated by hu-mans. In the case of SA, the training data is typically adataset of human-written reviews. The training data may re-ﬂect human biases. SA systems may, therefore, exhibit biasestowards a demographic characteristic, such as gender [30],[31]. For example, the sentiment predicted by an SA systemmay differ for a piece of text after a perturbation in the textto replace words that describe a demographic characteristic,e.g., changing “I am an Asian man” into “I am a blackwoman” may cause a predicted sentiment to change from https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/ https://cloud.google.com/natural-language/docs/analyzing-sentiment a r X i v : . [ c s . S E ] F e b ositive to negative, therefore, showing that the SA systemreﬂects demographic bias.As SA systems are used in many domains, includingsensitive areas such as healthcare, and may be used forbusiness analytics to make critical business decisions, it isimportant to detect biases in these systems. Early discoveryof these biases will help to prevent the perpetuation ofhuman biases, and aid to prevent real-world harms. To doso, SA systems should be tested for fairness (i.e., absenceof unintended bias), as existing studies suggest [5], [30].Prior studies have relied on a small number of templates togenerate short texts that may uncover bias. Speciﬁcally forSA systems, Kiritchenko and Mohammad [30] propose EEC,which generates test cases produced from 11 handcraftedtemplates. These test cases help to detect if an SA systempredicts a different sentiment given two texts that differ onlyin a single word associated with a different gender or race.These test cases are limited in number and may notadequately uncover biases in a system. Very recently, SAresearchers [9] have noted that “the templates utilized tocreate the examples might be too simplistic” and identify-ing such biases “might be relatively easy”. They suggestthat “Future work should design more complex cases thatcover a wider range of scenarios.” In this work, our goalis to address these limitations of handcrafted templates byautomatically generating test cases to uncover biases.We propose BiasFinder, a framework that automaticallygenerates test cases to discover biased predictions in SA sys-tems. BiasFinder automatically identiﬁes and curates suit-able texts in a large corpus of reviews, and transforms thesetexts into templates. Each template can be used to produce alarge number of mutant texts, by ﬁlling in placeholders withconcrete values associated with a class (e.g., male vs. female)given a demographic characteristic (e.g., gender). Using thesemutant texts, BiasFinder then runs the SA system under test,checking if it predicts the same sentiment for two mutantsassociated with a different class (e.g. male vs. female) of thegiven characteristic (e.g. gender). A pair of such mutants arerelated through a metamorphic relation where they sharethe same predicted sentiment from a fair SA system.The key feature of BiasFinder is its automatic identiﬁ-cation and transformation of suitable text in a corpus to atemplate. This allows BiasFinder to produce a large numberof test cases that are varied and realistic compared to pre-vious approaches. Identifying suitable texts to transform toa template is challenging. For instance, all references to anentity should be replaced in a consistent way that does notmake the text (e.g., a paragraph) incoherent. An example isshown in Figure 1, in which all expressions referring to anentity (“Jake”) need to be updated. The name “Jake” andits references (bolded and underlined) need to be updatedtogether for the text to remain coherent. BiasFinder ad-dresses this challenge through the use of Natural LanguageProcessing (NLP) techniques, such as coreference resolutionand named entity recognition, to ﬁnd all words that requiremodiﬁcation.Our framework, BiasFinder, can be instantiated to iden-tify different kinds of bias. In this work, we show howBiasFinder can be instantiated to uncover bias in three dif-ferent demographic characteristics: gender, occupation, andcountry-of-origin. We empirically evaluate the 3 instances Original Text

It seems that

Jake with all his knowledge of the greatoutdoors didn’t realize the danger! He enters a mine shaftthat’s leaking with dangerous gas! Mutant Text

It seems that

Julia with all her knowledge of the greatoutdoors didn’t realize the danger!

She enters a mine shaftthat’s leaking with dangerous gas!

Fig. 1. An example of how all references to the same entity has tobe considered when mutating a text to be associated with a differentgender. of BiasFinder by running it on a SA model based on BERT[32] – a state-of-the-art text analysis engine. By runningBiasFinder on a IMDB Dataset of 50K Movie Reviews [33],we produce test cases of texts (paragraphs) resemblingmovie reviews. We show the effectiveness of BiasFinder inuncovering biases in an SA system; BiasFinder can generatemany pairs of texts revealing biases exhibited by the SAsystem. Additionally, we evaluate whether the pairs of textsare coherent and should have the same sentiment (althoughpredicted to be of different sentiments by the SA system) byperforming a user study. We ﬁnd that BiasFinder’s achievesa reasonable true positive rate of of up to 90%.The contributions of our work are: • We propose BiasFinder, a framework that uncovers biasin SA systems through the automatic generation of a largenumber of realistic test cases given a target characteristic.The source code of BiasFinder is publicly available. • BiasFinder automatically identiﬁes and curates appropri-ate and realistic texts (of various complexity) and trans-forms them into templates that can be instantiated todetect bias. Prior work only considers a small set ofmanually-crafted simple templates. • We evaluate BiasFinder on IMDB Dataset of 50K MovieReviews [33], and generate 129,598 bias-uncovering testcases over 3 demographic characteristics, gender, occupa-tion, and country-of-origin.The rest of this paper is organized as follows. Sec-tion 2 introduces the necessary background related to ourwork. Section 3 presents BiasFinder. Section 4 elaboratesGenderBiasFinder, an instantiation of BiasFinder to detectgender bias. Section 5 brieﬂy discusses instantiations ofBiasFinder for detecting occupation and country-of-originbias. Section 6 describes the results of our experiments.Section 7 presents related work. Finally, Section 8 concludesthis paper and describes some future work.

RELIMINARIES

This section provides more details of the benchmark datasetthat motivates our work (Section 2.1), as well as basic NLPoperations that we use as building blocks of our proposedapproach (Section 2.2).

The Equity Evaluation Corpus (EEC) is a benchmark datasetproposed by Kiritchenko and Mohammad [30] to reveal https://github.com/soarsmu/BiasFinder TABLE 1Templates in EEC

No Template (cid:104) person (cid:105) feels (cid:104) emotion (cid:105) (cid:104) person (cid:105) feel (cid:104) emotion (cid:105) (cid:104) person (cid:105) feel (cid:104) emotion (cid:105) (cid:104) person (cid:105) made me feel (cid:104) emotion (cid:105) (cid:104) person (cid:105) found himself/herself in a/an (cid:104) emotion (cid:105) situation 1,2006 (cid:104) person (cid:105) told us all about the recent (cid:104) emotion (cid:105) events 1,2007 The conversation with (cid:104) person (cid:105) was (cid:104) emotion (cid:105) Sentences with no emotion words: (cid:104) person (cid:105) in the market 609 I talked to (cid:104) person (cid:105) yesterday 6010 (cid:104) person (cid:105) goes to the school in our neighborhood 6011 (cid:104) person (cid:105) has two children 60

Templates in the EEC have two placeholders: (cid:104) person (cid:105) and (cid:104) emotion (cid:105) . Mutant texts are generated by instantiatingeach placeholder with a predeﬁned value. Predeﬁned valuesfor the placeholder (cid:104) person (cid:105) are: • Common African American female or male ﬁrst names;Common European American female or male ﬁrstnames; taken from Caliskan et al. [34] • Noun phrases referring to females, such as ‘my daugh-ter’; and noun phrases referring to males, such as ‘myson’.The second placeholder, (cid:104) emotion (cid:105) , corresponds to fourbasic emotions: anger, fear, joy, and sadness. For each emo-tion, EEC selects ﬁve words from Roget’s Thesaurus withvarying intensities.Although the EEC has successfully revealed bias inNLP systems [30], it is limited only to gender and racebias. It does not explore bias against other demographicinformation (e.g., occupation, etc.) that may also lead to in-appropriate behavior of Sentiment Analysis and other NLPsystems. Furthermore, the templates used to create the textdataset may be too short and simplistic as argued by Poriaet al. [9]. We suggest that a system that has the capability toautomatically create templates to produce more diverse andmore complex sentences can aid in better uncovering bias inSentiment Analysis systems. Part-of-speech tagging (PoS-tagging) is the process of iden-tifying the part of speech (e.g. noun, verb) that each word ina text belongs to [35]. An example of PoS-tagging is shownin Figure 2. In the example text, “Maria” is tagged as aproper noun (PROPN); “has” and “loves” are tagged asverbs (VERB); and “She” and “him” are tagged as pronouns(PRON). Named entity recognition (NER) automatically identiﬁesnamed entities in a text and groups them into predeﬁnedcategories. Examples of named entities are people, orga-nizations, occupations, and geographic locations [36]. Anexample of NER can be found in Figure 2, where the word”Maria” is assigned to the ”PERSON” category. In this work,we are mainly interested with the person (for gender andcountry-of-origin bias) and occupation (for occupation bias)categories.

Finding all expressions that refer to the same entity ina text is known as coreference resolution [37]. Linkingsuch expressions is useful for many NLP tasks where thecorrect interpretation of a piece of text has to be derived(e.g. document summarization, question answering). Coref-erence resolution only links expressions together, and doesnot identify the types of the referenced entities, which isdone through NER. An example of coreference resolutioncan be found in Figure 2, in which the expressions ”Maria”and ”She” are linked. Likewise, the expressions ”a friend”and ”him” are linked as they refer to the same entity.Given an input text, running a coreference resolution on itwill produce n lists of references; each list corresponds toreferences to a single entity. Input Text

Maria has a friend. She loves him.

POS-tagging

NER

Maria | PERSON

Coreferences Resolution +———————-+ | |

Maria has a friend. She loves him. | | +———————-+

Coreferences

Maria, Shea friend, him

Fig. 2. Example of POS-tagging, NER, and coreference resolution.There are two entities identiﬁed by the coreference resolution, ”Maria”and ”a friend”, and the expressions referring to these entities are linked.

The process of assigning a grammatical structure to a pieceof text and encoding dependency relationships betweenwords is known as dependency parsing [38], [39]. Encodingsuch information as a parse tree, words in a text are con-nected such that words that modify each other are linked.For example, a dependency parse tree connects a verb to itssubject and object, and a noun to its adjectives.Figure 3 shows an example of a parse tree that is out-put by performing a dependency parsing of an input text:“That guy from Blade Runner also cops a good billing”.The directed, labeled edges between nodes indicate the3 ig. 3. Example of a dependency parse tree for the sentence “That guyfrom Blade Runner also cops a good billing”. The root word of the phrase“That guy from Blade Runner” (bolded in the above image) is ”guy”. relationships between the parent and child nodes. From theparse tree, the root word of a phrase can be identiﬁed. Forexample, the root word of the phrase “That guy from BladeRunner” represented in Figure 3 is “played””, as its nodedoes not have any incoming edges from the nodes of otherwords in the phrase.

IAS F INDER

Figure 4 shows the architecture of our proposed approach:BiasFinder. It takes, as input, a collection of texts and asentiment analysis (SA) system, and produces, as output,a set of bias-uncovering test cases . BiasFinder has three com-ponents: (A) template generation engine , (B) mutant genera-tion engine , and (C) failure detection engine . The templategeneration engine generates bias-targeting templates from acollection of texts. These templates are designed to targetbias towards a speciﬁc characteristic (e.g., gender). Thegenerated templates are input to the mutant generationengine. This engine generates text variants ( mutants ) thatdiffer in a target bias characteristic (e.g., two paragraphs,which are otherwise identical, but describe an individualusing words associated with a different gender) and shouldhave the same sentiment. These mutants are then inputto the failure detection engine. This engine makes use ofthe metamorphic relation between mutants (i.e., they havethe same sentiment as they are generated from the sametemplate) to infer failures (i.e., bias). This engine identiﬁesmutants that uncover bias in the SA system. These mutantsare output as the bias-uncovering test cases.

The template generation engine follows the workﬂow inFigure 5. It takes a collection of texts as the input andproduces bias-targeting templates . Each template is a text unit(e.g., a paragraph) that contains one or more placeholders;the placeholders can be substituted with concrete values togenerate different pieces of text that should have the samesentiment.This engine generates templates for detecting bias in atarget characteristic (e.g., gender, occupation, etc.). It ex-tracts linguistic features such as named entities, corefer-ences, and part-of-speech (

Step 1 ). Using these features, itidentiﬁes entities related to the characteristic of the targetedbias (

Step 2 ). If such entities exist in the texts, BiasFinderreplaces references to these entities with placeholders. Es-sentially, the texts are converted to templates which will beused to generate mutant texts for uncovering the targetedbias (

Step 3 ). Fig. 4. Architecture of BiasFinderFig. 5. Workﬂow of Template Generation Engine

To generate mutant texts from a bias-targeting template, thisengine replaces template placeholders with concrete valuestaken from pre-determined lists of possible values. Theselists differ based on the target bias under consideration.The engine substitutes the placeholders with concrete val-ues while ensuring that the generated mutants are valid.A mutant is valid iff the values that are assigned to theplaceholders are in agreement with each other. For example,we do not want to generate the following text: “The manspeaks to herself”. The engine ensures this does not occurby picking only values from a single class (e.g., male-relatedwords) to substitute related placeholders to generate a mu-tant. Each generated mutant is thus associated to a class; andBiasFinder’s goal is to check if an SA discriminates againstone of the classes (e.g., male or female) associated with atarget characteristic (e.g. gender).

The failure detection engine takes as input a set of mutanttexts along with their class labels, and produces a set ofbias-uncovering test cases. le of a dependency parse treefor tThen, it feeds the mutants one-by-one to the SA system,which outputs a sentiment label for each mutant. Mutants of4 lgorithm 1:

Generating a Template for DetectingGender Bias

Input: s : a text, gn : gender nouns Output: t : a template or null t = s ; corefs = getCoreferences( s ); names = getPersonNamedEntities( s ); coref = ﬁlter( corefs , names , gn ); if coref (cid:54) = null then for r ∈ coref do if isPersonName( r, names ) then t = createPlaceholder ( t, s, r, names ); end else if isGenderPronoun( r ) then t = createPlaceholder ( t, s, r ); end else if hasGenderNoun( r, gn ) then t = createPlaceholder ( t, s, r, gn ); end end end return t == s ? null : t differing classes that are produced from the same templateare expected to have the same sentiment. Therefore, if theSA predicts that two mutants of different classes to havedifferent sentiments, they are an evidence of a biased pre-diction. Such pairs of mutants are output as bias-uncoveringtest cases . BiasFinder can be instantiated in various ways to uncoverdifferent kinds of biases. In this work, we investigate 3instances of BiasFinder that can uncover gender, occupa-tion, and country-of-origin biases of a sentiment analysis(SA) system. To instantiate BiasFinder to a particular targetcharacteristic, we need to customize its three components:template generation engine, mutant generation engine, andfailure detection engine. We elaborate on how we createGenderBiasFinder, an instance of BiasFinder targeting gen-der bias in Section 4, and brieﬂy describe the two otherinstances of BiasFinder in Section 5.

ENDER B IAS F INDER

An SA system exhibits gender bias if it behaves differ-ently for texts that only differ in words that reﬂect gender.GenderBiasFinder generates mutants by changing wordsassociated with the gender of a person, and uncovers genderbias when the SA system predicts differing sentiments for apair of mutants of different gender classes. In this work,we focus on binary genders: male and female; but ourapproach can be extended and generalized for non-binary genders. To uncover gender bias, we customize the threemain engines of BiasFinder: Template Generation Engine,Mutant Generation Engine, and Failure Detection Engine.

Algorithm 1 shows the process for generating templatesfor uncovering gender bias. Given an input text, Gender-BiasFinder extracts linguistic features in the form of parts-of-speech, named entities referring to person names, and coreferences. GenderBiasFinder uses coreference resolution(see Section 2.2.3) to ﬁnd references of entities in the text(Line 2). References to a unique entity are grouped togetherin a list. The output of the coreference resolution is n listswhere n is the total number of entities mentioned in thetext, which we refer to as corefs . We also run named entityrecognition (see Section 2.2.2) to identify person namedentities (e.g., person names) in the text (Line 3).Next, we ﬁlter coreference lists in corefs by performingtwo checks embedded inside function ﬁlter (Line 4):1) There is only one list in corefs that refers to a person.In this work, we consider any of the following as areference to a person: (i) a person name, (ii) a genderpronoun (i.e. he, she), or (iii) a phrase containing agender noun (e.g., “that guy from Blade Runner”).2) All references in the list identiﬁed above must be areference to a person.If both conditions are met, ﬁlter returns a coreferencelist coref satisfying the condition; otherwise, it returns null .These checks are done to avoid the generation of unsoundtemplates due to coreference resolution’s limitations, e.g.,detecting a set of references to the same entity as two disjointlists. If there is a coref returned, GenderBiasFinder iteratesall its references and creates placeholders depending on thetype of each reference r (Lines 7-15). At the end of theiteration, we output a template t generated from the inputtext s (Line 18). For each iteration, we have three casesdepending on the type of each r : Case 1:

The Reference is a Person Name (Line 7-9)

At line 7, GenderBiasFinder checks whether the refer-ence r is a person’s name in the list of names names extracted using named entity recognition (see Section 2.2.2).If this is the case, GenderBiasFinder generates a templateby replacing the person’s name with the (cid:104) name (cid:105) place-holder (Line 8). In the example shown in Figure 6, “DrewBarrymore” is a person’s name and is replaced with thisplaceholder. Text ’Never Been Kissed’ is a real feel good ﬁlm. If you haven’tseen it yet, then rent it out. I am going to buy it when itsreleased because I loved it.

Drew Barrymore is excellentagain, she plays her part well. I felt I could relate to thisﬁlm because of the school days I had were just as bad.

Coreferences

Drew Barrymore, she, her

Person Named Entity

Drew Barrymore

Generated Template ’Never Been Kissed’ is a real feel good ﬁlm. If you haven’tseen it yet, then rent it out. I am going to buy it whenits released because I loved it. (cid:104) name (cid:105) is excellent again, (cid:104) pro - spp (cid:105) plays (cid:104) pro - pp (cid:105) part well. I felt I could relate tothis ﬁlm because of the school days I had were just as bad. Fig. 6. An Illustrative Example for Case 1 and 2 of GenderBiasFinder

Case 2:

The Reference is a Gender Pronoun (Lines 10-12)

GenderBiasFinder checks if the reference r is a genderpronoun (Line 10). If so, GenderBiasFinder converts the5ender pronoun into (cid:104) pro - id (cid:105) (Line 11), where id can takeseveral values according to the type of the gender pronounthat the placeholder replaces: (1) spp for subjective personalpronoun (i.e., he and she), (2) opp for objective personalpronoun (i.e., him and her), (3) pp for possesive pronoun(i.e., his and her), and (4) rp for reﬂexive pronoun (i.e.,himself and herself). In the example shown in Figure 6,”she“ is converted to (cid:104) pro - spp (cid:105) placeholder, while “her” isconverted to (cid:104) pro - pp (cid:105) placeholder. Case 3:

The Reference has a Gender Noun (Lines 13-15)

GenderBiasFinder checks if the root word of the reference r is a gender noun (Line 13). GenderBiasFinder utilizesdependency parsing (see Section 2.2.4) to ﬁnd the root wordand performs POS-tagging (see Section 2.2.1) to conﬁrm thatthe root word is a noun. Next, it checks that the word existsin gn , a collection of gender-related nouns, and if it does,converts the root word to (cid:104) gaw (cid:105) placeholder (Line 14). Inthe example shown in Figure 7, the reference is “That guyfrom “Blade Runner””. By performing dependency parsingand POS-tagging, “guy” is identiﬁed as the root word andis a noun. GenderBiasFinder checks whether “guy” exists in gn . As it does, GenderBiasFinder replaces “guy” to a (cid:104) gaw (cid:105) placeholder. Some examples of gender nouns are shown inTable 3. In total, we use 22 gender nouns. Text

Even the manic loony who hangs out with the bad guysin ”Mad Max” is there.

That guy from ”Blade Runner” also cops a good billing, although he only turns up at thebeginning and the end of the movie. Coreferences

That guy from ”Blade Runner”, he

Dependency Parsing of The ReferencePOS-tagging of The Reference

Generated Template

Even the manic loony who hangs out with the bad guys in”Mad Max” is there. That (cid:104) gaw (cid:105) from ”Blade Runner” alsocops a good billing, although (cid:104) pro - spp (cid:105) only turns up atthe beginning and the end of the movie. Fig. 7. An Illustrative Example for Case 3 of GenderBiasFinder

For each generated template, the mutant generation engineproduces multiple mutants by replacing placeholders withconcrete values. As our objective in GenderBiasFinder is tocreate test cases related to gender, each mutant is associatedwith a gender class (i.e., male or female) and the mutantgeneration engine is restricted to values associated with thegiven gender class when ﬁlling in all placeholders for onemutant. The engine iterates over all possible combinations

TABLE 2Example Names from GenderComputer

Name Gender Country-of-origin

Matrosov Male RussiaKapoor Female IndiaAndrea Male ItalyZeynep Female Turkey

TABLE 3Example of Gender Nouns

Male Female boy, brother, father, dad, . . . girl, sister, mother, mom, ... of the values. Each placeholder can be substituted by a valuefrom a set. We describe the values that each placeholder canbe substituted with below: (cid:104) name (cid:105)

Placeholder:

Values to be substituted for thisplaceholder are taken from the set of names from Gender-Computer . GenderComputer provides a database of maleand female names from several countries. Each name inthe GenderComputer provides information about its genderand its country-of-origin. It is possible that a name may beused by both genders in the same or different countries.Thus, we ﬁlter the names to make sure that the selectednames are only used for one gender globally. We randomlytake names from this ﬁltered set: N male names and N female names. By default, N is set to 30. Examples of namesfrom GenderComputer are shown in Table 2. (cid:104) pro - id (cid:105) Placeholder:

Values to be substituted for thisplaceholder depend on the gender class of the mutant andthe id . For male mutant, the values are he for id - spp , himfor id - opp , his for id - pp , and himself for id - rp . For femalemutant, the values are she for id - spp , her for id - opp , her for id - pp , and herself for id - rp . (cid:104) gaw (cid:105) Placeholder:

Values to be substituted for this place-holder are the set of gender nouns taken from severalEnglish resources

Examples of these gender nouns areshown in Table 3.

The Failure Detection Engine runs the SA system, usingthe generated mutants as inputs. It receives, from the SAsystem, a label for each mutant indicating the predictedsentiment of the mutant. Mutants generated from the sametemplate are expected to have the same predicted senti-ment and are grouped together. Each group of mutants isfurther divided into two classes, depending on the genderassociated with the mutant. Mutants of from these twoclasses that have different sentiments are paired. In otherwords, the engine ﬁnd pairs of mutants generated fromthe same template that differ in both the gender class theyare associated with, and the sentiment predicted by the SAsystem. These pairs of mutants are the bias-uncovering testcases and are the output of GenderBiasFinder. https://github.com/tue-mdse/genderComputer https://7esl.com/gender-of-nouns/ https://ielts.com.au/articles/grammar-101-feminine-and-masculine-words-in-english/ lgorithm 2: Generating a Template for DetectingOccupation Bias

Input: s : a text Output: t : a template or null t = s ; occs = getOccNamedEntities( s ); for occ ∈ occs do if isNoun( occ ) then if hasAdjective( t, occ ) then t = removeAdjective ( t, occ ); end t = createPlaceholders ( t, occ ); occCorefs = getCoreferencesOf ( s, occ ); for r ∈ occCorefs do if isRefContainsOcc ( r, occ ) then if hasAdjective( t, r ) then t = removeAdjective ( t, r ); end t = createPlaceholders ( t, r ); end end return t; end end return null THER I NSTANCES OF B IAS F INDER

In this section, we describe how BiasFinder can be instanti-ated for occupation and country-of-origin biases.

Occupation bias occurs when an SA system favors an honest(i.e., non-criminal) occupation over another. It can be de-tected when the SA system produces differing sentiment fora pair of mutants that differ only on the occupation referredin the text. We perform these customizations to uncoveroccupation bias:

Template Generation Engine:

We generate occupation tem-plates following Algorithm 2. We ﬁrst extract the list ofoccupations occs mentioned in the input text s using namedentity recognition (Line 2). We then iterate each occupation occ from occs (Line 3). We then conﬁrm that occ is anoun and check whether occ has adjectives (Lines 4-5). Forexample, the adjective of “ driver ” in the “ race car driver ”noun phrase is “ race car ”. If the noun phrase containingthe occupation has an adjective, we remove the adjective toensure that the generated mutant text is semantically correct(Line 6). Leaving the adjective intact may produce a textthat describes a non-existent occupation such as “ race car secretary”. We then convert occ to (cid:104) occupation (cid:105) placeholder(Line 8). We also convert determiner “a” or “an” in front of occ (if it exists) to (cid:104) det (cid:105) placeholder to ensure the producedmutant template is grammatically correct. Next, we extract occCorefs (Line 9); occCorefs is the list of references in s that refer to the same entity that occ refers to. We theniterate each reference r from occRefs (Line 10). We checkwhether r is a mention of occ (Line 11). For such r , we againcreate (cid:104) occupation (cid:105) and (cid:104) det (cid:105) placeholders (if necessary),after removing adjectives (if necessary) (Lines 12-15). At theend of this process, we output a template t generated fromthe input text s (Line 18). In the example below, we detect “doctor” and “journal-ist” as occupations. We only use the ﬁrst occupation to forma template. As “doctor” is a noun, and it is not precededby any adjective, we replace it directly to (cid:104) occupation (cid:105) placeholder. We then replace its determiner with (cid:104) det (cid:105) place-holder. In this case, there are no coreferences of “doctor”, sothe template generation process ends. Text

The beautiful Jennifer Jones looks the part and gives awonderful, Oscar nominated performance as a doctor ofmixed breed during the advent of Communism in main-land China.

Occupation Named Entity doctor

Generated Template

The beautiful Jennifer Jones looks the part andgives a wonderful, Oscar nominated performance as (cid:104) det (cid:105) (cid:104) occupation (cid:105) of mixed breed during the advent ofCommunism in mainland China.

Fig. 8. An Illustrative Example for OccupationBiasFinder

Mutant Generation Engine:

To generate occupation mu-tants, the engine substitutes the (cid:104) occupation (cid:105) placeholderwith a value from a set of 79 honest (i.e., non-criminal)and gender-neutral occupation names that are taken from[40]–[42]. The value of (cid:104) det (cid:105) is linked with the value of (cid:104) occupation (cid:105) placeholder. For example, the values of (cid:104) det (cid:105) for “teacher” and “engineer” occupations are “a” and “an”,respectively.

Failure Detection Engine:

The engine inputs the generatedmutants to the SA system. The SA system labels each mutantwith a predicted sentiment. Mutants from the same templateare grouped together and mutants in the same group thathave a different sentiment are paired. By doing so, theengine ﬁnds pairs of mutants that differ both in the occu-pation they mentioned and the sentiment predicted by theSA system. These pairs of mutants are the bias-uncoveringtest cases for occupation bias.

Country-of-origin bias occurs when the SA system favors aperson who originates from one country over a person orig-inating from another country. This bias is detected when theSA system produces different sentiments for texts differingonly in country-of-origin of the person referred in the text.To uncover country-of-origin bias, we customize BiasFinderas follows:

Template Generation Engine:

For generating country-of-origin templates, we follow Algorithm 3. We ﬁrst run coref-erence resolution to ﬁnd corefs , which contains references ofpersons mentioned in the input text s (Line 1). We also runnamed entity recognition to extract the list of person namesmentioned in s (Line 2), which we refer to as names .Next, we ﬁlter coreference lists in corefs by using thesame ﬁlter function described in Algorithm 1 in Section 4.This is done to avoid the generation of unsound templatesdue to coreference resolution’s limitations. The ﬁlter function7 lgorithm 3: Generating a Template for DetectingCountry-of-Origin Bias

Input: s : a text Output: t : a template or null t = s ; corefs = getCoreferences( s ); names = getPersonNamedEntities( s ); coref = ﬁlter( corefs , names ); if coref (cid:54) = null then g = inferGender( coref ); if g ∈ { Male, F emale } then for r ∈ coref do if isPersonName( r, names ) then t = createPlaceholder ( t, s, r, g ); end end return t end end return null returns either a coreference list coref or null . We stop thetemplate generation process if null is returned.Otherwise, if the references in coref refer to a consistentgender g (Line 6) – i.e., by checking that there is a genderpronoun in coref and all gender pronouns in it are of thesame gender (e.g., he, him, his, himself for male gender)– we iterate each reference r in coref (Lines 7-11). If r isthe person name in names , we replace r with a placeholderrepresenting the gender that was detected (Lines 8-10). A (cid:104) male (cid:105) or (cid:104) female (cid:105) placeholder is created if a male or afemale gender was detected, respectively.In the example below, “Lauren Holly” is detected asa person name and the coreferences consistently refer tofemale gender. Thus, we replace “Lauren Holly” with (cid:104) female (cid:105) placeholder. Text

I loved this movie, it was cute and funny!

Lauren Holly was wonderful, she’s funny and very believable in her role.

Coreferences

Lauren Holly, she, her

Person Named Entity

Lauren Holly

Generated Template

I loved this movie, it was cute and funny! (cid:104) female (cid:105) waswonderful, she’s funny and very believable in her role.

Fig. 9. An Example of Text in CountryBiasFinder

Mutant Generation Engine:

To generate country-of-originmutants, the engine substitutes (cid:104) male (cid:105) and (cid:104) female (cid:105) place-holders with values from a set of people names taken fromGenderComputer . GenderComputer provides the country-of-origin and the gender of each name. Since the samename may occur in different country-of-origin and gender,we take only names that are unique in both country-of-origin and gender. We pick only a male name and a femalename from each country. In total, we have 52 names taken https://github.com/tue-mdse/genderComputer from 26 countries of origin. The placeholder values are thenﬁlled based on the gender associated with the name. Maleand female names are used to ﬁll (cid:104) male (cid:105) and (cid:104) female (cid:105) placeholders, respectively. Failure Detection Engine:

The engine accepts the generatedmutants as input and feed them to the SA system, whichgives a sentiment label for each mutant. Mutants from thesame template that have a different sentiment are thenpaired. Here, the engine ﬁnds pairs of mutants that differboth in the country-of-origin of the person they mentionedand the sentiment predicted by the SA system. These pairsof mutants are the bias-uncovering test cases for country-of-origin bias.

XPERIMENTS

In this section, we describe our dataset, experimental set-tings, evaluation metric, and our research questions. Next,we answer the research questions, and mention threats tovalidity.

We focus on a binary sentiment analysis task, i.e., a task ofclassifying whether a text conveys a positive or a negativesentiment. A popular dataset to evaluate a sentiment anal-ysis system’s performance for this task is the IMDB Datasetof 50K Movie Reviews [33]. It contains a set of 50,000 moviereviews; each review is labelled as either having an overallpositive or negative sentiment. Some of these movie reviewscontain text that are not natural language, e.g., HTML tags.We remove these text from the movie reviews. Then, wesplit the 50,000 movie reviews evenly to train and test sets.We use a Transformer-based model as the sentimentanalysis system in our experiments. Transformer-basedmodels have achieved state-of-the-art performances onmany NLP tasks in recent years [26], [43], [44]. We pickBERT [32] as a representative model. It is not our goalto evaluate and compare bias over all sentiment analysissystems – we leave such extensive comparative evaluationfor future work.BERT is a pre-trained language model that can be ﬁne-tuned to solve many downstream NLP tasks. We ﬁne-tunedBERT for binary sentiment analysis following Sun et al.work [27] and used the implementation they provided. Weﬁne-tuned the BERT sentiment analysis (SA) model usingthe train set and use the test set as input for three instancesof BiasFinder to generate mutants.We performed our experiments on a computer run-ning Ubuntu 18.04 with Intel(R) Core(TM) i7-9700K CPU@ 3.60GHz processor, 64GB RAM, and NVIDIA GeForceRTX 2080. For coreference resolution, we use NeuralCoref .We use both SpaCy and Stanford CoreNLP for Part-of-Speech (PoS) Tagging and Named Entity Recognition (NER).Our objective in the study is to produce test cases thatreveal bias. As earlier deﬁned, a bias-uncovering test case isa pair of mutants from a bias-targeting template that should https://github.com/xuyige/BERT4doc-Classiﬁcation https://github.com/huggingface/neuralcoref https://spacy.io/ https://stanfordnlp.github.io/CoreNLP/ RQ1.

How many BTCs can BiasFinder generate?

BiasFinder is the ﬁrst approach to automatically generatetemplates for uncovering multiple biases. We report thenumber of generated BTCs produced by each instance ofBiasFinder for the BERT-based SA system.

RQ2.

How many BTCs are true positives?

We count the number of BTCs that are true positive via auser study. The user study involves two participants, bothof whom are native English speakers and are not authorsof this paper. The participants were asked to label mutanttexts from randomly sampled BTCs for the three character-istics. We computed the number of samples required for astatistically representative sample given a margin of error of5% and 95% conﬁdence for the three populations (of 14,916,109,386, 5,296 generated BTCs). Therefore, we sampled 400BTCs for each characteristic, which is a number greater thanthe representative sample sizes (375, 383, 358). The userswere asked to label if the mutant texts involved in each BTCare coherent and to provide sentiment labels. We considera BTC a true positive if its mutants are labelled as coherenttexts and are of the same sentiment.

RQ1.

How many BTCs can BiasFinder generate?

TABLE 4BiasFinder’s Performance in Generating BTCs

Target Bias

Gender 3,057 14,916Occupation 8,867 109,386Country-of-Origin 2,344 5,296

Total 14,268 129,598

Table 4 shows the number of templates generated by andBTCs found by BiasFinder for each its instance. BiasFinderdiscovered a total of 129,598 BTCs across the three charac-teristics. Among the three characteristics, it generated thehighest number of BTCs for occupation bias and the fewestnumber of BTCs for the country-of-origin bias. Overall, ourexperimental results suggest that bias of each characteristicoccurs in the target SA system. BiasFinder can also ﬁndmore BTCs for each bias by generating more templates fromanother corpus. The templates range in their complexitydepending on the text in the corpus. Figures 10, 11, 12show examples of bias-uncovering test cases for gender,occupation, and country-of-origin, respectively.

RQ2.

How many BTCs are true positives?

Mutant Text - Using a uniquely male name

What is he supposed to be? He was a kid in the past, ...and the future? This movie had a lot of problems. Is he aghost, or just a strong kid. Man, ... what a piece of crap.I’m still confused. Also, is he supposed to be an abortion?Strange. Very strange. This movie will mess with yourmind, ... and it’s not very scary, ... just confusing. Whywas he , ... Where did, ... What was the, ... oh, who cares,... Benedetto isn’t worth it, ... My score: 10

Mutant Text - Using a uniquely female name

What is she supposed to be?

She was a kid in the past, ...and the future? This movie had a lot of problems. Is she aghost, or just a strong kid. Man, ... what a piece of crap.I’m still confused. Also, is she supposed to be an abortion?Strange. Very strange. This movie will mess with yourmind, ... and it’s not very scary, ... just confusing. Whywas she , ... Where did, ... What was the, ... oh, who cares,...

Elaisha isn’t worth it, ... My score: 10

Fig. 10. BTC Example for Gender

Mutant Text - Housekeeper

Great underrated movie great action good actors anda wonderful story line. Wesley is verry good and the housekeeper the bad guy is wonderful The girl plays anice role and the comedy mixed with blakness!

Mutant Text - Programmer

Great underrated movie great action good actors anda wonderful story line. Wesley is verry good and the programmer the bad guy is wonderful The girl plays anice role and the comedy mixed with blakness!

Fig. 11. BTC Example for OccupationTABLE 5Results of our user study. TPR stands for True Positive Rate.

Target Bias TPR by Annotator 1 TPR by Annotator 2

Gender 76.3% 58.8%Occupation 65.0% 78.0%Country-of-Origin 95.8% 65.0%

Table 5 shows the result of the user study. We ﬁnd thatthe true positive rates range from 65.0% to 95.8%. Weinvestigated the false positives produced by BiasFinder andwe show an example in Figure 13. Although BiasFinderidentiﬁed the word “director” as an occupation, it cannotbe transformed into a placeholder. The usage of the word isspeciﬁc to the context described in the text, and replacing itwith a different occupation results in an incoherent sentence.To determine the level of agreement between the twohuman annotators in the user study, we computed Cohen’sKappa [45] and obtained an average value of 0.485 – usuallyinterpreted as moderate agreement [46], [47]. According to Landis and Koch [48], a Kappa value between 0.40to 0.6 is moderate agreement, while 0.6 to 0.8 is substantial agreement,and a value above 0.8 represents almost perfect agreement. utant Text - Male Name from Somalia I consider this movie as one of the most interesting andfunny movies of all time ( ... ) Several universities inGermany and throughout Europe have made studies on

Waabberi ’s way of seeing things. By the way,

Waabberi isa very intelligent and sensitive person and on of the Jazzmusicians in Germany

Mutant Text - Male Name from Iran

I consider this movie as one of the most interestingand funny movies of all time ( ... ) Several universitiesin Germany and throughout Europe have made stud-ies on

Keyghobad ’s way of seeing things. By the way,

Keyghobad is a very intelligent and sensitive person andon of the Jazz musicians in Germany

Fig. 12. BTC Example for Country-of-Origin. (...) is a truncated piece ofthe original text.

Original Text

I believe the story felt very plain because the director failedto focus on character development

Mutant Text

I believe the story felt very plain because the banker failedto focus on character development

Fig. 13. Example of a false positive. The usage of the “director” wordis context speciﬁc, and cannot be replaced with another occupation toproduce a coherent mutant text.

We have only experimented with an SA system based onBERT and generated templates from the IMDB dataset.The results may not be generalize to other SA systemsand datasets. However, BERT is among the top performingmodels for text classiﬁcation in recent years [26], [32], [43],[44], and the IMDB dataset is a common dataset used forstudying sentiment analysis [28], [49], [50]. Thus, we believethe threats to validity are minimal as we show that the biasesoccur on mutants of texts from a common dataset and thebiases are exhibited by one of the top models.While we investigated only the IMDB dataset, it is alarge dataset and is commonly used to evaluate sentimentanalysis techniques. Thus, we believe IMDB is a good repre-sentative dataset for investigating biases in sentiment anal-ysis. Furthermore, the IMDB dataset contains high polarityreviews (i.e., reviews with strong positive/negative senti-ments) and, as such, there is minimal risk of mislabellingthem.

ELATED W ORK

In this section, we ﬁrst describe related work on understand-ing and detecting bias in AI systems (Section 7.1). Next,we describe some of the related work testing AI systems(Section 7.2).

The importance of studying bias in AI systems has beendescribed by many researchers [1], [2], [30], [51], [52].An AI system may perpetuate human biases and performdifferently for some demographic groups than others [1], [31], [51], [52]. As such, many existing studies on uncov-ering bias [1]–[3], [5], [30] focus on ﬁnding differences inthe system’s behavior given a change in a demographiccharacteristic (aka. attribute). Our approach has the samehigh-level objective of uncovering differences in behaviorwhen demographic characteristic is modiﬁed, however, ourapproach differs in several ways, which will be described inthe following paragraphs.Themis [1], Aeqitas [2], and FairTest [3] are approachesaiming to generate test cases that detect discrimination insoftware. Fairway [4] mitigates bias through several strate-gies, including identifying and removing ethical bias froma model’s training data. Unlike our approach, these strate-gies do not target NLP systems but focus on systems thattake numerical values or images as input, while BiasFindertargets Sentiment Analysis systems which take natural lan-guage text as input.Speciﬁc to NLP applications, CheckList [5] has beenproposed for creating test cases to evaluate systems on theircapabilities beyond their accuracies on test datasets. Fair-ness is among the capabilities that CheckList tests for, andCheckList relies on a small number of predeﬁned templatesfor producing test sentences. Our work is complementary tothis approach as it can be used to produce test cases withoutthe restriction of predeﬁned templates.For Sentiment Analysis systems, the EEC [30] has beenproposed to uncover bias by detecting differences in pre-dictions of text differing in a single word associated withgender or race. However, as described earlier in Section 2,other researchers [9] have pointed out that the EEC [30]relies on predeﬁned templates that may be too simplistic.We address this limitation as our approach dynamically gen-erates many templates to produce sentences that are variedand realistic. Moreover, our approach uncovers bias throughmutating words in text associated with characteristics otherthan gender and race.

In recent years, many researchers have proposed techniquesfor testing AI systems. There are too many of them to men-tion here. Still, we would like to highlight a few, especiallythose that are closer to our work. For a comprehensivetreatment on the topic of AI testing, please refer to thesurvey by Zhang et al. [53].Existing studies have applied metamorphic testing to AIsystems [54]–[57]. Many of these systems focus on ﬁndingbugs, for example, in machine translation [54], [57] or au-tonomous driving systems [55], [56]. Our work is relatedto these studies as BiasFinder is based on metamorphictesting, but differs in that we focus on ﬁnding fairnessbugs (gender, occupation, and country-of-origin bias) inSentiment Analysis systems.In the NLP domain, some research efforts have devel-oped methods for generating adversarial examples [58],[59], while other researchers have proposed techniques totest robustness to typos and other forms of noise [60], orchanges in the names of people mentioned in text [61]. Ourwork differs from these studies as it focuses on uncoveringbias rather than testing the correctness of an NLP system.10 C ONCLUSION AND F UTURE W ORK

There is growing use of Artiﬁcial Intelligence in softwaresystems, and fairness is an important requirement in Arti-ﬁcial Intelligence systems. Testing is one way to uncoverunintended biases [4], [53]. Our research contributes tothe body of work on fairness testing and motivates futureresearch to build automatic fairness testing methods for var-ious machine learning tasks, including sentiment analysis(that we consider in this work).We propose BiasFinder, a metamorphic testing frame-work for creating test cases to uncover demographic bi-ases in Sentiment Analysis (SA) systems. BiasFinder canbe instantiated for different demographic characteristics,such as gender or occupation. Given a target character-istic, BiasFinder curates suitable texts from a corpus tocreate bias-uncovering templates. From these templates, Bi-asFinder then produces mutated texts (mutants) that differonly in words associated with different classes (e.g., malevs. female) of the target characteristic (e.g., gender). Thesemutants are then used to tease out unintended bias inan SA system and identify bias-uncovering test cases. Byanalyzing a realistic and diverse corpus, BiasFinder canproduce realistic and diverse bias-uncovering test cases.Existing work [30] use only a small set of simple testcases, testing only for ethnicity bias by swapping namesof two geographical characteristics, i.e. African Americannames and European American names, in a small set of tem-plates of simple sentences. Our work generates templates oftest cases involving other characteristics, including genderand country-of-origin. Our mutant generation constructstest cases representing names from 30 countries. Together,the template and mutation generation produces test casesthat cover a wider range of scenarios.We empirically evaluated three instances of BiasFinderon an SA model based on BERT. BiasFinder identiﬁes14,916, 109,386, and 5,296 bias-uncovering test cases forgender, occupation, and country-of-origin bias. Through auser study, we ﬁnd that the true positive rates of BiasFinderare reasonably high. In other words, BiasFinder produces ahigh percentage of bias-revealing pairs of texts that a humanannotator considers to have the same sentiment and arecoherent.In the future, we plan to instantiate BiasFinder on morebiases and expand the experiments (e.g., by consideringother text corpora). Moreover, we will evaluate BiasFinderto determine if it generalizes to tasks beyond SA, for exam-ple, testing general text classiﬁers. R EFERENCES [1] S. Galhotra, Y. Brun, and A. Meliou, “Fairness testing: testingsoftware for discrimination,” in

Proceedings of the 2017 11th JointMeeting on Foundations of Software Engineering , 2017, pp. 498–510.[2] S. Udeshi, P. Arora, and S. Chattopadhyay, “Automated directedfairness testing,” in

Proceedings of the 33rd ACM/IEEE InternationalConference on Automated Software Engineering , 2018, pp. 98–108.[3] F. Tramer, V. Atlidakis, R. Geambasu, D. Hsu, J.-P. Hubaux,M. Humbert, A. Juels, and H. Lin, “Fairtest: Discovering un-warranted associations in data-driven applications,” in . IEEE,2017, pp. 401–416.[4] J. Chakraborty, S. Majumder, Z. Yu, and T. Menzies, “Fairway: away to build fair ml software,” in

Proceedings of the 28th ACM JointMeeting on European Software Engineering Conference and Symposiumon the Foundations of Software Engineering , 2020, pp. 654–665. [5] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy:Behavioral testing of nlp models with checklist,”

Association forComputational Linguistics (ACL 2020 , 2020.[6] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentimentclassiﬁcation using machine learning techniques,” in

Proceedingsof the ACL-02 Conference on Empirical Methods in Natural LanguageProcessing - Volume 10 , ser. EMNLP ’02. USA: Association forComputational Linguistics, 2002, p. 79–86. [Online]. Available:https://doi.org/10.3115/1118693.1118704[7] P. D. Turney, “Thumbs up or thumbs down? semanticorientation applied to unsupervised classiﬁcation of reviews,”in

Proceedings of the 40th Annual Meeting on Association forComputational Linguistics , ser. ACL ’02. USA: Association forComputational Linguistics, 2002, p. 417–424. [Online]. Available:https://doi.org/10.3115/1073083.1073153[8] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysisalgorithms and applications: A survey,”

Ain Shams engineeringjournal , vol. 5, no. 4, pp. 1093–1113, 2014.[9] S. Poria, D. Hazarika, N. Majumder, and R. Mihalcea,“Beneath the tip of the iceberg: Current challenges andnew directions in sentiment analysis research,”

IEEE Transac-tions on Affective Computing , 2020, accepted (early access at:https://ieeexplore.ieee.org/document/9260964).[10] M. Haselmayer and M. Jenny, “Sentiment analysis of political com-munication: combining a dictionary approach with crowdcoding,”

Quality & Quantity , vol. 51, pp. 2623 – 2646, 2017.[11] J. A. Caetano, H. S. Lima, M. F. Santos, and H. T. Marques-Neto, “Using sentiment analysis to deﬁne twitter political users’classes and their homophily during the 2016 american presidentialelection,”

Journal of Internet Services and Applications , vol. 9, pp. 1–15, 2018.[12] S. Krishnamoorthy, “Sentiment analysis of ﬁnancial newsarticles using performance indicators,”

Knowl. Inf. Syst. ,vol. 56, no. 2, p. 373–394, Aug. 2018. [Online]. Available:https://doi.org/10.1007/s10115-017-1134-1[13] T. Renault, “Sentiment analysis and machine learning in ﬁnance:a comparison of methods and models on one million messages,”

Digital Finance , 09 2019.[14] M. Day and C. Lee, “Deep learning for ﬁnancial sentiment analy-sis on ﬁnance news providers,” in , 2016, pp. 1127–1134.[15] S. Sohangir, D. Wang, A. Pomeranets, and T. M. Khoshgoftaar, “Bigdata: Deep learning for ﬁnancial sentiment analysis,”

Journal of BigData , vol. 5, pp. 1–25, 2017.[16] M. Rambocas, “Marketing research: The role of sentiment analy-sis,”

FEP WORKING PAPER SERIES , 04 2013.[17] S. Rani and P. Kumar, “A sentiment analysis system to improveteaching and learning,”

Computer , vol. 50, no. 05, pp. 36–43, may2017.[18] N. Altrabsheh, M. Gaber, and E. Haig, “Sa-e: Sentiment analysisfor education,” in

Frontiers in Artiﬁcial Intelligence and Applications ,vol. 255, 06 2013.[19] F. S. Dolianiti, D. Iakovakis, S. B. Dias, S. Hadjileontiadou, J. A.Diniz, and L. Hadjileontiadis, “Sentiment analysis techniques andapplications in education: A survey,” in

Technology and Innovationin Learning, Teaching and Education , M. Tsitouridou, J. A. Diniz, andT. A. Mikropoulos, Eds. Cham: Springer International Publishing,2019, pp. 412–427.[20] V. S. Gupta and S. Kohli, “Twitter sentiment analysis in health-care using hadoop and r,” in , 2016,pp. 3766–3772.[21] O. Oyebode, F. Alqahtani, and R. Orji, “Using machine learningand thematic analysis methods to evaluate mental health appsbased on user reviews,”

IEEE Access , vol. 8, pp. 111 141–111 158,2020.[22] S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya, “Medicalsentiment analysis using social media: Towards buildinga patient assisted system,” in

Proceedings of the EleventhInternational Conference on Language Resources and Evaluation(LREC 2018) arXiv preprint arXiv:1707.02377 , 2017.

24] Y. Zhang, Q. Liu, and L. Song, “Sentence-state lstm for textrepresentation,” in

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) ,2018, pp. 317–327.[25] J. Gong, X. Qiu, S. Wang, and X. Huang, “Information aggrega-tion via dynamic routing for sequence encoding,” arXiv preprintarXiv:1806.01501 , 2018.[26] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, andQ. V. Le, “Xlnet: Generalized autoregressive pretraining for lan-guage understanding,” in

Advances in neural information processingsystems , 2019, pp. 5753–5763.[27] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to ﬁne-tune bertfor text classiﬁcation?” in

China National Conference on ChineseComputational Linguistics . Springer, 2019, pp. 194–206.[28] J. Howard and S. Ruder, “Universal language model ﬁne-tuningfor text classiﬁcation,” in

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers) ,2018, pp. 328–339.[29] E. Cambria, S. Poria, A. Gelbukh, and M. Thelwall, “Sentimentanalysis is a big suitcase,”

IEEE Intelligent Systems , vol. 32, no. 6,pp. 74–80, 2017.[30] S. Kiritchenko and S. Mohammad, “Examining gender and racebias in two hundred sentiment analysis systems,” in

Proceedings ofthe Seventh Joint Conference on Lexical and Computational Semantics ,2018, pp. 43–53.[31] M. D´ıaz, I. Johnson, A. Lazar, A. M. Piper, and D. Gergle, “Ad-dressing age-related bias in sentiment analysis,” in

Proceedings ofthe 2018 CHI Conference on Human Factors in Computing Systems ,2018, pp. 1–14.[32] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” in

Proceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers) , 2019, pp.4171–4186.[33] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,“Learning word vectors for sentiment analysis,” in

Proceedingsof the 49th Annual Meeting of the Association for ComputationalLinguistics: Human Language Technologies

Science , vol. 356, pp. 183–186, 04 2017.[35] E. Brill, “Transformation-based error-driven learning and naturallanguage processing: A case study in part-of-speech tagging,”

Comput. Linguist. , vol. 21, no. 4, p. 543–565, Dec. 1995.[36] D. Nadeau and S. Sekine, “A survey of named entity recognitionand classiﬁcation,”

Lingvisticae Investigationes , vol. 30, pp. 3–26,2007.[37] W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machinelearning approach to coreference resolution of noun phrases,”

Computational Linguistics

Synthesis Lectureson Human Language Technologies , vol. 2, 01 2009.[39] D. Chen and C. Manning, “A fast and accurate dependency parserusing neural networks,” in

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing (EMNLP)

Proceedings of the 30th InternationalConference on Neural Information Processing Systems , ser. NIPS’16.Red Hook, NY, USA: Curran Associates Inc., 2016, p. 4356–4364.[41] A. Caliskan-Islam, J. Bryson, and A. Narayanan, “Semantics de-rived automatically from language corpora necessarily containhuman biases,”

Science , vol. 356, 08 2016.[42] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W.Chang, “Gender bias in coreference resolution: Evaluation anddebiasing methods,” in

Proceedings of the 2018 Conference ofthe North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (ShortPapers) et al. , “Language models are few-shot learners,” arXiv preprintarXiv:2005.14165 , 2020.[44] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transferlearning with a uniﬁed text-to-text transformer,” arXiv preprintarXiv:1910.10683 , 2019.[45] J. Cohen, “A coefﬁcient of agreement for nominal scales,”

Educa-tional and psychological measurement , vol. 20, no. 1, pp. 37–46, 1960.[46] M. V. M¨antyl¨a, B. Adams, F. Khomh, E. Engstr¨om, and K. Petersen,“On rapid releases and software testing: a case study and asemi-systematic literature review,”

Empirical Software Engineering ,vol. 20, no. 5, pp. 1384–1425, 2015.[47] M. Joblin, S. Apel, C. Hunsen, and W. Mauerer, “Classifying devel-opers into core and peripheral: An empirical study on count andnetwork metrics,” in . IEEE, 2017, pp. 164–174.[48] J. R. Landis and G. G. Koch, “The measurement of observeragreement for categorical data,”

Biometrics

Proceed-ings of the 57th Annual Meeting of the Association for ComputationalLinguistics: Student Research Workshop , 2019, pp. 407–414.[50] D. S. Sachan, M. Zaheer, and R. Salakhutdinov, “Revisiting lstmnetworks for semi-supervised text classiﬁcation via mixed objec-tive function,” in

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , vol. 33, 2019, pp. 6940–6948.[51] L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Mea-suring and mitigating unintended bias in text classiﬁcation,” in

Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, andSociety , 2018, pp. 67–73.[52] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity insupervised learning,” in

Advances in neural information processingsystems , 2016, pp. 3315–3323.[53] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learningtesting: Survey, landscapes and horizons,”

IEEE Transactions onSoftware Engineering , 2020.[54] Z. Sun, J. Zhang, M. Harman, M. Papadakis, and L. Zhang,“Automatic testing and improvement of machine translation,” in

International Conference on Software Engineering (ICSE) , 2020.[55] Z. Q. Zhou and L. Sun, “Metamorphic testing of driverless cars,”

Communications of the ACM , vol. 62, no. 3, pp. 61–67, 2019.[56] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “Deep-road: Gan-based metamorphic testing and input validation frame-work for autonomous driving systems,” in .IEEE, 2018, pp. 132–142.[57] L. Sun and Z. Q. Zhou, “Metamorphic testing for machine trans-lations: Mt4mt,” in . IEEE, 2018, pp. 96–100.[58] Z. Zhao, D. Dua, and S. Singh, “Generating natural adversarialexamples,” in

International Conference on Learning Representations ,2018.[59] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversarialexample generation with syntactically controlled paraphrase net-works,” in

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , 2018, pp. 1875–1885.[60] M. T. Ribeiro, S. Singh, and C. Guestrin, “Semantically equivalentadversarial rules for debugging nlp models,” in

Proceedings of the56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 2018, pp. 856–865.[61] V. Prabhakaran, B. Hutchinson, and M. Mitchell, “Perturbationsensitivity analysis to detect unintended model biases,” in

Pro-ceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) , 2019, pp. 5744–5749., 2019, pp. 5744–5749.