[PDF] Automatic Detection of Causality in Requirement Artifacts: the CiRA Approach

Abstract

System behavior is often expressed by causal relations in requirements (e.g., If event 1, then event 2). Automatically extracting this embedded causal knowledge supports not only reasoning about requirements dependencies, but also various automated engineering tasks such as seamless derivation of test cases. However, causality extraction from natural language is still an open research challenge as existing approaches fail to extract causality with reasonable performance. We understand causality extraction from requirements as a two-step problem: First, we need to detect if requirements have causal properties or not. Second, we need to understand and extract their causal relations. At present, though, we lack knowledge about the form and complexity of causality in requirements, which is necessary to develop a suitable approach addressing these two problems. We conduct an exploratory case study with 14,983 sentences from 53 requirements documents originating from 18 different domains and shed light on the form and complexity of causality in requirements. Based on our findings, we develop a tool-supported approach for causality detection (CiRA). This constitutes a first step towards causality extraction from NL requirements. We report on a case study and the resulting tool-supported approach for causality detection in requirements. Our case study corroborates, among other things, that causality is, in fact, a widely used linguistic pattern to describe system behavior, as about a third of the analyzed sentences are causal. We further demonstrate that our tool CiRA achieves a macro-F1 score of 82 % on real word data and that it outperforms related approaches with an average gain of 11.06 % in macro-Recall and 11.43 % in macro-Precision. Finally, we disclose our open data sets as well as our tool to foster the discourse on the automatic detection of causality in the RE community.

Full PDF

AAutomatic Detection of Causality inRequirement Artifacts: the CiRA Approach

Jannik Fischbach , Julian Frattini , Arjen Spaans , Maximilian Kummeth ,Andreas Vogelsang , Daniel Mendez , , and Michael Unterkalmsteiner Qualicen GmbH, Germany, { firstname.lastname } @qualicen.de Blekinge Institute of Technology, Sweden, { firstname.lastname } @bth.se University of Cologne, Germany, [email protected] fortiss GmbH, Germany, [email protected] Abstract. [ Context & motivation: ] System behavior is often ex-pressed by causal relations in requirements (e.g.,

If event 1, then event2 ). Automatically extracting this embedded causal knowledge supportsnot only reasoning about requirements dependencies, but also variousautomated engineering tasks such as seamless derivation of test cases.However, causality extraction from natural language (NL) is still an openresearch challenge as existing approaches fail to extract causality withreasonable performance. [

Question/problem: ] We understand causal-ity extraction from requirements as a two-step problem: First, we need todetect if requirements have causal properties or not. Second, we need tounderstand and extract their causal relations. At present, though, we lackknowledge about the form and complexity of causality in requirements,which is necessary to develop a suitable approach addressing these twoproblems. [

Principal ideas/results: ] We conduct an exploratory casestudy with 14,983 sentences from 53 requirements documents originat-ing from 18 diﬀerent domains and shed light on the form and complexityof causality in requirements. Based on our ﬁndings, we develop a tool-supported approach for causality detection (CiRA, standing for Causalityin Requirement Artifacts). This constitutes a ﬁrst step towards causalityextraction from NL requirements. [

Contribution: ] We report on a casestudy and the resulting tool-supported approach for causality detectionin requirements. Our case study corroborates, among other things, thatcausality is, in fact, a widely used linguistic pattern to describe systembehavior, as about a third of the analyzed sentences are causal. We fur-ther demonstrate that our tool CiRA achieves a macro-F score of 82 %on real word data and that it outperforms related approaches with anaverage gain of 11.06 % in macro-Recall and 11.43 % in macro-Precision.Finally, we disclose our open data sets as well as our tool to foster thediscourse on the automatic detection of causality in the RE community. Keywords:

Causality · Case Study · Requirements Engineering · Nat-ural Language Processing

System behavior is usually described by causal relations, e.g. “A conﬁrmationmessage shall be shown if the system has successfully processed the data.” Hence, © a r X i v : . [ c s . S E ] J a n J. Fischbach et al. causal relations are often inherently embedded in the textual descriptions ofrequirements. Understanding and extracting these causal relations oﬀers greatpotential for Requirements Engineering (RE); for instance, by supporting theautomated derivation of test cases and by facilitating reasoning about depen-dencies between requirements [7]. However, automated causality extraction fromrequirements is still challenging for two reasons. First, requirements are mostlyexpressed by unrestricted natural language (NL) so that the system behavioris speciﬁed in arbitrarily complex ways. Second, causality can occur in diﬀerentforms [2] such as marked / unmarked or explicit / implicit which makes it diﬃcultto identify and extract the causes and eﬀects. Existing approaches [1] fail toextract causality from NL with a performance that allows for use in practice.Therefore, we argue for the need of a novel method for the extraction of causalityfrom requirements. We understand causality extraction as a two-step problem:We ﬁrst need to detect whether requirements contain causal relations. Second,if they contain causal relations, we need to understand and extract them. Toaddress both problems, we have to comprehend in which form and complexityrequirements causality occurs in practice. This enables us to develop eﬃcientapproaches for the automated identiﬁcation and extraction of causal relations.However, empirical evidence on causality in requirements is presently still weak.In this paper, we report on how we addressed this research gap and make thefollowing contributions (C): – C 1 : We report on an exploratory case study where we analyze form andcomplexity of causality in requirements based on 14,983 sentences emergingfrom 53 requirement documents. These documents originate from 18 diﬀerentdomains. We corroborate, for example, that causality tends to occur, infact, in explicit and marked form, and that about 28 % of the analyzedsentences contain causal knowledge about the expected system behavior.This strengthens our conﬁdence in the relevance of our approach. – C 2 : We present our tool-supported approach named CiRA ( C ausality de-tection i n R equirement A rtifacts), which forms a ﬁrst step towards causalityextraction from NL requirements. We train and empirically evaluate CiRAusing the pre-analyzed data set and achieve an macro-F score of 82 %.Compared to baseline systems that classify causality based on the presenceof certain cue phrases, or shallow ML models, CiRA leads to an average per-formance gain of 11.43 % in macro-Precision and 11.06 % in macro-Recall. – C 3 : To strengthen transparency and facilitate replication, we disclose ourtool, code, and data set used in the case study. Causality represents a semantic relation that has been studied by various disci-plines, e.g. by psychology [27]. Before we can investigate in which form causalityoccurs in requirements, we must ﬁrst understand what causality actually means.

Concept of Causality

Causality is a relation between two events: a causing event(the cause ) and a caused event (the eﬀect ). An event is “any situation (including A demo of CiRA can be accessed at cira.diptsrv003.bth.se. Our code and annotateddata sets can be found at https://github.com/ﬁschJan/CiRA.iRA: Causality Detection in Requirement Artifacts 3 a process or state) that happens or occurs either instantaneously (punctual) orduring a period of time (durative)” [19]. The connection between causes andeﬀects is counterfactual [17]: If a cause c did not occur, then an eﬀect e couldnot have occurred either. Consequently, a causal relation requires that the eﬀectmay only occur if and only if the cause has occurred. Therefore, in the view ofBoolean algebra, a causal relation can be interpreted as an equivalence betweena cause and eﬀect ( c ⇐⇒ e ). If the cause is true, the eﬀect is true and if thecause is false, the eﬀect is also false. The relation between a cause and eﬀect canbe deﬁned in three diﬀerent ways [26]: as a cause , enable or prevent relationship. – c causes e : If c occurs, e also occurs ( c ⇐⇒ e ). This can be illustratedby REQ 1: “After the user enters a wrong password, a warning window shallbe shown.” In this case, the wrong input is the trigger to display the window. – c enables e : If c does not occur, e does not occur either ( e is notenabled). REQ 2: “As long as you are a student, you are allowed to usethe sport facilities of the university ( c ⇐⇒ e ).” Only the student statusenables to do sports on campus. – c prevents e : If c occurs, e does not occur ( c ⇐⇒ ¬ e ). REQ 3:“Data redundancy is required to prevent a single failure from causing theloss of collected data.” There will be no data loss due to data redundancy. Temporal Ordering of Causes and Eﬀects

Causes and eﬀects can occur in threediﬀerent temporal relations [19]. In the ﬁrst temporal relation, the cause occursbefore the eﬀect ( before relation). REQ 1 requires the user to enter a wrongpassword before the warning window will be displayed. In this example, the causeand eﬀect represent two punctual events. In the second temporal relation, theoccurrence of the cause and eﬀect overlaps: “The ﬁre is burning down the house.”In this case, the occurrence of the emerging ﬁre overlaps with the occurrence ofthe increasingly brittle house ( overlaps relation). In the third temporal relation( during relation), cause and eﬀect occur simultaneously. REQ 2 describes sucha relation, as the eﬀect that you are allowed to do sports on the campus is onlyvalid as long as you have the student status. The start and end time of the causeis therefore also the start and end of the eﬀect. Here, both events are durative.

Forms of Causality

Causality can be expressed in diﬀerent forms [2]: marked and unmarked causality, explicit and implicit causality, and ambiguous and non-ambiguous cue phrases. – Marked and unmarked : A causal relation is marked if a certain cue phraseindicates causality. The requirement “If the user presses the button, a win-dow appears” is marked by the cue phrase “if”, while “The user has noadmin rights. He cannot open the folder.” is unmarked . – Explicit and implicit : An explicit causal relation provides informationabout both the cause and eﬀect. The requirement “In case of an error, thesystems prints an error message to the console” is explicit as it containsthe cause (error) and eﬀect (error message). “A parent process kills a childprocess” is implicit because the eﬀect that the child process is terminated isnot explicitly stated. – Ambiguous and non-ambiguous cue phrases : Given the diﬀerence be-tween marked and unmarked causality, it seems feasible to deduce the pres-ence of causality in a sentence from the occurrence of certain cue phrases. J. Fischbach et al.

However, there are cue phrases (e.g. since) that may indicate causality, butalso occur in other contexts (e.g. to denote time constraints). Such cuephrases are called ambiguous , while cue phrases (e.g. because) that mostlyindicate causality are called non-ambiguous . Complexity of Causality

Our previous explanations refer to the simplest casewhere the causal relation consists of a single cause and eﬀect. With increasingsystem complexity, however, the expected system behaviour is described by mul-tiple causes and eﬀects that are connected to each other. They are linked eitherby conjunctions ( c ∧ c ∧ · · · ⇐⇒ e ) or disjunctions ( c ∨ c ∨ · · · ⇐⇒ e )or a combination of both which increases the complexity of the causal relation.Furthermore, causal relations can not only be contained in a single sentence, butalso span over multiple sentences, which is a signiﬁcant challenge for causalityextraction. Additionally, the complexity increases when several causal relationsare linked together, i.e. if the eﬀect of a relation r represents a cause in anotherrelation r . We deﬁne such causal relations, where r is dependent on r , as eventchains (e.g. r : c ⇐⇒ e and r : e ⇐⇒ e ). The case study was performed according to the guidelines of Runeson andH¨ost [23]. Based on the classiﬁcation of Robson [22], our case study is exploratoryas we seek for new insights into causality in requirement documents. In this sec-tion, we describe our research questions , study objects , study design , study results ,and threats to validity . We also give an overview of the implications of the study on the causality detection and extraction from requirements. We are interested in the form and complexity of causality in requirement doc-uments. Based on the terminology introduced in Section 2, we investigate thefollowing research questions (RQ): – RQ 1 : To which degree does causality occur in requirement documents? – RQ 2 : How often do the relations cause , enable and prevent occur? – RQ 3 : How often do the temporal relations before , overlap and during occur? – RQ 4 : In which form does causality occur in requirement documents?RQ 4a: How often does marked and unmarked causality occur?RQ 4b: How often does explicit and implicit causality occur?RQ 4c: Which causal cue phrases are used? Are they mainly ambiguous or non-ambiguous ? – RQ 5 : At which complexity does causality occur in requirement documents?RQ 5a: How often do multiple causes occur?RQ 5b: How often do multiple eﬀects occur?RQ 5c: How often does two sentence causality occur?RQ 5d: How often do event chains occur? We considered three criteria when selecting a suitable data set for our casestudy: 1) the data set shall contain requirements documents that are/were used iRA: Causality Detection in Requirement Artifacts 5 in practice, 2) the data set shall not be domain-speciﬁc, rather it shall containdocuments from diﬀerent domains, and 3) the documents shall originate fromdiﬀerent years. Consequently, our analysis is not restricted to a single year ordomain, but rather allows for a comprehensive view on causality in requirements.Based on these criteria, we selected the data set provided by Fischbach et al. [7].To the best of our knowledge, this data set is currently the most extensive collec-tion of requirements available in the RE community. It contains 463 documents,from which the authors extracted and pre-processed 212k sentences. For ouranalysis, we have randomly selected 53 documents from the data set. Our ﬁnaldata set consists of 14,983 sentences from 18 diﬀerent domains (see Fig. 1). D i g i t a l L i b r a r y S m a r t C i t y D i g i t a l S o c i e t y B a n k i n g D a t a A n a l y t i c s A e r o s p a c e H e a l t h A g r i c u l t u r e T e l e c o m m . M ili t a r y E n e r g y R e g u l a t o r y I n f r a s t u c t u r e I n s u r a n c e A u t o m o t i v e P h y s i c s S u s t a i n a b ili t y A s t r o n o m y s e n t e n c e s Domain Distribution u n k n o w n d o c u m e n t s Yearly Distribution

Fig. 1.

Descriptive statistics of our data set. The left graph shows the number ofsentences per domain. The right graph depicts the year of creation per document.

Model the phenomenon

In order to answer our RQ, we need to annotate thesentences in our data set with respect to certain categories (e.g. explicit or im-plicit causality). According to Pustejovsky and Stubbs [21], the ﬁrst step in eachannotation process is to “model the phenomenon” that needs to be annotated.Speciﬁcally, it should be deﬁned as a model M that consists of a vocabulary T ,the relations R between the terms as well as the interpretations I of terms. RQ1 can be understood as a binary annotation problem, which can be modeled as: – T : { sentence, causal, not causal } – R : { sentence ::= causal | not causal } – I : { causal = “A sentence is causal if it contains a relation between at leasttwo events, where e1 causes the occurrence of e2”, ¬ causal = “A sentence isnot causal if it describes a state that is independent on any events” } Modeling an annotation problem has two advantages: It contributes to a cleardeﬁnition of the research problem and can be used as a guide for the annotatorsto explain the meaning of the labels. We have modeled each RQ and discussedit with the annotators. In addition to interpretation I , we have also providedan example for each label to avoid misunderstandings. After modeling all RQs,the following nine categories emerged, according to which we annotated ourdata set: Causality , Explicit , Marked , Single Sentence , Single Cause ,Single Eﬀect , Event Chain , Relationship and Temporality . Annotation Environment

We developed our own annotation platform tailoredto our research questions. Contrary to other annotation platforms [20] whichonly show single sentences to the annotators, we also show the predecessor and The platform can be accessed at clabel.diptsrv003.bth.se. J. Fischbach et al. successor of each sentence. This is required to determine whether the causalityextends over one sentence or across multiple ones (see RQ 5c). For the binaryannotation problems (see RQ 1, RQ 4a, RQ 4b, RQ 5a - d), we provide two labelsfor each category. Cue phrases present in the sentence can either be selected bythe annotator from a list of already labeled cue phrases or new cue phrases canbe added using a text input ﬁeld (see RQ 4c). Since RQ 2 and RQ 3 are ternaryannotation problems, the platform provides three labels for these categories.

Annotation Guideline

Prior to the labeling process, we conducted a workshopwith all annotators to ensure a common understanding of causality. The resultsof the workshop were recorded in the form of an annotation guideline. All an-notators were instructed to observe the following annotation rules: First, youshould not just check for cue phrases and label the sentence directly as causal,but rather read the sentence completely before making a labeling decision. Oth-erwise, too many False Positives will be introduced. Second, you should check ifthe cause is really necessary for the eﬀect to occur. Only if the cause is mandatoryfor the eﬀect, it is a causal relation.

Table 1.

Inter-annotator agreement statistics per category. The two categories Re-lationship and Temporality were jointly labeled by the ﬁrst and second author andtherefore do not require a reliability assessment.

Causal Explicit Marked SingleSentence SingleCause SingleEﬀect EventChain avg.0 1 0 1 0 1 0 1 0 1 0 1 0 1Confusion 0

Matrix 1

274 499 39 411 12 464 17 462 43 338 46 318 13 9

Agreement

Cohen’s Kappa

Gwet’s AC1

Annotation Validity

To verify the reliability of our annotations, we calculatedthe inter-annotator agreement. We assigned 3,000 sentences to each annotator, ofwhich 2,500 are unique and 500 overlapping. Based on the overlapping sentences,we calculated the Cohen’s Kappa [3] measure to evaluate how well the annotatorscan make the same annotation decision for a given category. We chose Cohen’sKappa since it is widely used for assessing inter-rater reliability [25]. However,a number of statistical problems are known to exist with this measure [18].In case of a high imbalance of ratings, Cohen’s Kappa is low and indicatespoor inter-rater reliability even if there is a high agreement between the raters(Kappa paradox [6]). Thus, Cohen’s Kappa is not meaningful in such scenarios.Consequently, studies [28] suggest that Cohen’s Kappa should always be reportedtogether with the percentage of agreement and other paradox resistant measures(e.g. Gwet’s AC1 measure [10]) in order to make a valid statement about theinter-rater reliability. We involved six annotators in the creation of the corpusand assessed the inter-rater reliability on the basis of 3,000 overlapping sentences,which represents about 20 % of the total data set. We calculated all measures(see Tab. 1) using the cloud-based version of AgreeStat [11]. Cohen’s Kappaand Gwet’s AC1 can both be interpreted using the taxonomy developed byLandis and Koch [16]: values ≤ iRA: Causality Detection in Requirement Artifacts 7 and 0.81–1.00 as almost perfect agreement. Tab. 1 demonstrates that the inter-rater agreement of our annotation process is reliable. Across all categories, anaverage percentage of agreement of 86 % was achieved. Except for the categoriesSingle Cause and Single Eﬀect , all categories show a percentage of agreementof at least 84 %. We hypothesize that the slightly lower value of 76 % for these twocategories is caused by the fact that in some cases the annotators interpret thecauses and eﬀects with diﬀerent granularity (e.g., annotators might break somecauses and eﬀects down into several sub causes and eﬀects, while some do not).Hence, the annotations diﬀer slightly. The Kappa paradox is particularly evidentfor the categories Marked and Event Chain . Despite a high agreement of over90 %, Cohen’s Kappa yields a very low value, which “paradoxically” suggestsalmost no or only fair agreement. A more meaningful assessment is providedby Gwet’s AC1 as it did not fail in case of prevalence and remains close tothe percentage of agreement. Across all categories, the mean value is above 0.8,which indicates a nearly perfect agreement. Therefore, we assess our labeled dataset as reliable and suitable for further analysis and the implementation of ourcausality detection approach. Causality Explicit Marked Single Sentence Single Cause Single Effect Event Chain Relationship Temporality F r ac t i on o f se n t e n ces i n % CausalityNot CausalCausal ExplicitImplicitExplicit MarkedUnmarkedMarked Single SentenceTwo SentenceSingle Sentence Single CauseSeveral CausesOne Cause Single EffectSeveral EffectsOne Effect Event ChainNo Event ChainEvent Chain RelationshipPreventEnableCause TemporalityOverlapDuringBefore

Fig. 2.

Annotation results per category. The y axis of the bar plot for the category“Causality” refers to the total number of analyzed sentences. The other bar plots areonly related to the causal sentences.

Fig. 2 presents the analysis results for each labeled category. When interpretingthe values, it is important to note that we analyze entire requirement documentsin our study. Consequently, our data set contains records with diﬀerent contents,which do not necessarily represent all functional requirements. For example,requirement documents also contain non-functional requirements, phrases forcontent structuring, purpose statements, etc. Hence, the results of our analysisdo not only refer to functional requirements but in general to the content ofrequirement documents.

Answer to RQ1:

Fig. 2 highlights that causality occurs in requirement doc-uments. About 28 % of the analyzed sentences are causal. It can therefore beconcluded that causality is a major linguistic element of requirement documentssince almost one third of all sentences are causal.

Answer to RQ2:

The majority (56 %) of causal sentences contained inrequirement documents express an enable relationship between certain events.Only about 10 % of the causal sentences indicate a prevent relationship.

Cause relationships are found in about 34 % of the annotated data.

J. Fischbach et al.

Answer to RQ3:

Interestingly, we found that causes and eﬀects occur almostequally often in a before and during relation. With about 48 %, the before relationis the most frequent temporal relation in our data set, but only with a diﬀerenceof about 6 % compared to the during relation. The overlap relation occurredonly in a minority (8.78 % of the sentences).

Answer to RQ4a:

Fig. 2 shows that the majority of causal sentences containone or more cue phrases to indicate the causal relationship between certainevents.

Unmarked causality occurs only in about 15 % of the analyzed sentences.

Answer to RQ4b:

Most causal sentences are explicit , i.e. they contain infor-mation about both the cause and the eﬀect. Only about 10 % of causal sentencesare implicit . Answer to RQ4c:

Tab. 2 provides an overview of the causal cue phrases usedin the requirement documents. The left side of the table shows the diﬀerent cuephrases ordered by word group. On the right side, all verbs used to express causalrelations are listed. We order the verbs according to whether they express a cause , enable or prevent relationship. To measure the ambiguity of the individual cuephrases, we introduce the ambiguity factor (AF). We deﬁne AF for a cue phrase xas the conditional probability that a sentence is causal given that the cue phrasex occurs in the sentence: Pr(Causal | X is present in sentence). Hence, a high AFvalue indicates a non-ambiguous cue phrase, while low values indicate strongly ambiguous cue phrases. Tab. 2 demonstrates that a number of diﬀerent cuephrases are used to express causality in requirement documents. Not surprisingly,cue phrases like “if”, “because” and “therefore” show AF values of more than90 %. However, there is a variety of cue phrases that indicate causality in somesentences but also occur in other non-causal contexts. This is especially evidentin the case of pronouns. Relative sentences can indicate causality, but not inevery case, which is reﬂected by the low AF value. A similar pattern emergeswith regard to the used verbs. Only a few verbs (e.g., “leads to, degrade andenhance”) show a high AF value. Consequently, the majority of used verbs donot necessarily indicate a causal relation if they are present in a sentence.

Answer to RQ 5a:

Fig. 2 illustrates that a causal relation in requirementdocuments often includes only a single cause. Multiple causes occur in only19.1 % of analyzed causal sentences. The exact number of causes was not doc-umented during the annotation process. However, the participating annotatorsreported consistently that in the case of complex causal relations, two to threecauses were usually included. More than three causes were rare.

Answer to RQ5b:

Interestingly, the distribution of eﬀects is similar to thatof causes. Likewise, single eﬀects occur signiﬁcantly more often than multipleeﬀects. According to the annotators, the number of eﬀects in case of complexrelations is limited to two eﬀects. Three or more eﬀects occur rarely.

Answer to RQ5c:

Most causal relations can be found in single sentences.Relations where cause and eﬀect are distributed over several sentences occur onlyin about 7 % of the analyzed data. The annotators reported that most often thecue phrase “therefore” was used to express two-sentence causality.

Answer to RQ5d:

Fig. 2 shows that event chains are rarely used in require-ment documents. Most causal sentences contain isolated causal relations andonly a few event chains . iRA: Causality Detection in Requirement Artifacts 9 Table 2.

Overview of cue phrases used to indicate causality in requirement docu-ments.

Bold

AF values highlight non-ambiguous phrases that mostly indicated causal-ity (Pr(Causal | X is present in sentence) ≥ . Type Phrase Causal Not Causal AmbiguityFactor (AF) Type Phrase Causal Not Causal AF conjunctions if 387 41

Cause force(s/ed) 21 18 0.54as 607 1313 0.32 cause(s/ed) 32 10 0.76because 78 7 lead(s) to 5 0 but 100 204 0.33 reduce(s/ed) 48 28 0.63in order to 141 33 minimize(s/ed) 28 11 0.72so (that) 88 86 0.51 aﬀect(s/ed) 13 19 0.41unless 23 4 maximize(s/ed) 11 5 0.69while 71 90 0.44 eliminate(s/ed) 8 11 0.42once 48 15 0.76 result(s/ed) in 50 43 0.54except 9 5 0.64 increase(s/ed) 49 34 0.59as long as 12 1 decrease(s/ed) 5 8 0.38adverbs therefore 61 6 impact(s) 37 68 0.35when 331 64 degrade(s/ed) 11 2 whenever 10 0 introduce(s/ed) 11 12 0.48hence 21 9 0.70 enforce(s/ed) 2 1 0.67where 213 150 0.59 trigger(s/ed) 11 7 0.61since 65 32 0.67 Enable depend(s) on 28 21 0.57consequently 2 6 0.25 require(s/ed) 316 262 0.55wherever 5 2 0.71 allow(s/ed) 187 130 0.59rather 16 30 0.35 need(s/ed) 98 162 0.38to this/that end 12 0 necessitate(s/ed) 7 2 0.78thus 66 17 facilitate(s/ed) 29 28 0.51for this reason 7 3 0.70 enhance(s/ed) 16 4 due to 91 26 0.78 ensure(s/ed) 145 66 0.69thereby 4 2 0.67 achieve(s/ed) 30 24 0.56as a result 11 4 0.73 support(s/ed) 128 301 0.30for this purpose 1 2 0.33 enable(s/ed) 75 36 0.68pronouns which 277 608 0.31 permit(s/ed) 10 13 0.43who 28 52 0.35 rely on 3 5 0.38that 732 1178 0.38 Prevent hinder(s/ed) 1 1 0.50whose 16 11 0.59 prevent(s/ed) 38 17 0.69adjectives only 127 126 0.50 avoid(s/ed) 14 23 0.38prior to 26 20 0.57imperative 1 3 0.25necessary (to) 36 19 0.65preposition for 1209 2753 0.31during 327 137 0.70after 133 57 0.70by 506 1171 0.30with 680 1554 0.30in the course of 2 1 0.67through 114 204 0.36as part of 19 51 0.27in this case 18 3 before 54 27 0.67until 33 11 0.75upon 25 48 0.34in case of 30 7 in both cases 1 0 in the event of 15 2 in response to 6 7 0.46in the absence of 8 1

Based on the results of our case study, we draw the following conclusions: Causal-ity matters in requirements documents, which underlines the necessity of an ap-proach for the automatic detection and extraction of causal requirements. Thecomplexity of causal relations ranges from low to medium, since they usuallyconsist of a single cause and eﬀect relationship. However, for the approaches tobe applicable in practice, they need to comprehend also more complex relationscontaining between two to three causes and two eﬀects. Hence, the approachesmust be capable of understanding conjunctions, disjunctions and negations inthe sentences to fully capture the relationships between causes and eﬀects. Two-sentence causality and event chains occur only rarely. Thus, both aspects caninitially be neglected in the development of the approaches, while still more than92 % of the analyzed sentences can be covered. Since most causal relations inrequirements documents are explicit , the detection and extraction of causality issimpliﬁed. The information about both causes and eﬀects is embedded directly in the sentences, so that the approaches require little or no implicit knowledge.The analysis of the AF values reveals that most of the used cue phrases are am-biguous. Consequently, our methods require a deep understanding of languageas causality can not only be deduced from the presence of certain cue phrasesbut rather from a combination of the syntax and semantics of the sentence.

Internal Validity : A major threat to internal validity are the annotations them-selves as an annotation task is to a certain degree subjective. To minimize thebias of the annotators, we performed two mitigation actions: First, we conducteda workshop prior to the annotation process to ensure a common understandingof causality. Second, we assessed the inter-rater agreement by using multiplemetrics (Gwet’s AC1 etc.).

External Validity : To achieve reasonable generaliz-ability, we selected requirements documents from diﬀerent domains and years.As Fig. 1 shows, our data set covers a variety of domains, but the distribution ofthe sentences is imbalanced. The domains aerospace, data analytics, and smartcity account for a large part of the data set (9,724 sentences), while the other 15domains are underrepresented. Hence, our results do not allow a general conclu-sion about causality in requirements documents. Future studies should expandto more documents from these underrepresented as well as further domains toachieve a more global insight into causality in requirements documents.

This section presents the implementation of our causal classiﬁer. Initially, we de-scribe our applied methods followed by a report of the results of our experiments,in which we compare the performance of the individual methods.

Rule Based Approach

The baseline approach for causality detection involvesthe use of simple regex expressions. We iterate through all sentences in the testset and check if one of the phrases listed in Tab. 2 is included. For the positivecase, the sentence is classiﬁed as causal and vice versa.

Machine Learning Based Approach

As a second approach, we investigatethe use of supervised

ML models that learn to predict causality based on thelabeled data set. Speciﬁcally, we employ established binary classiﬁcation algo-rithms: Naive Bayes (NB), Support Vector Machines (SVM), Random Forest(RF), Decision Tree (DT), Logistic Regression (LR), Ada Boost (AB) and K-Nearest Neighbor (KNN). To determine the best hyperparameters for each bi-nary classiﬁer, we apply Grid Search, which ﬁts the model on every possible com-bination of hyperparameters and selects the most performant. We use two dif-ferent methods as word embeddings: Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). In Tab. 3 we report the classiﬁcationresults of each algorithm as well as the best combination of hyperparameters.

Deep Learning Based Approach

With the rise of Deep Learning (DL), moreand more researchers are using DL models for Natural Language Processing(NLP) tasks. In this context, the Bidirectional Encoder Representations fromTransformers (BERT) model [4] is prominent and has already been used for iRA: Causality Detection in Requirement Artifacts 11 question answering and named entity recognition. BERT is pre-trained on largecorpora and can therefore easily be ﬁne tuned for any downstream task withoutthe need for much training data (Transfer Learning). In our paper, we makeuse of the ﬁne tuning mechanism of BERT and investigate to which extent itcan be used for causality detection of requirement sentences. First, we tokenizeeach sentence. BERT requires input sequences with a ﬁxed length (maximum512 tokens). Therefore, for sentences that are shorter than this ﬁxed length,padding tokens (PAD) are inserted to adjust all sentences to the same length.Other tokens, such as the classiﬁcation (CLS) token, are also inserted in orderto provide further information of the sentence to the model. CLS is the ﬁrsttoken in the sequence and represents the whole sentence (i.e. it is the pooledoutput of all tokens of a sentence). For our classiﬁcation task, we mainly usethis token because it stores the information of the whole sentence. We feed thepooled information into a single-layer feedforward neural network that uses asoftmax layer, which calculates the probability that a sentence is causal or not.We tune BERT in three diﬀerent ways and investigate their performance: – BERT

Base

In the base variant, the sentences are tokenized as describedabove and put into the classiﬁer. To choose a suitable ﬁxed length for ourinput sequences, we analyzed the lengths of the sentences in our data set.Even with a ﬁxed length of 128 tokens we cover more than 97 % of thesentences. Sentences containing more tokens are shortened accordingly. Sincethis is only a small amount, only little information is lost. Thus, we chosea ﬁxed length of 128 tokens instead of the maximum possible 512 tokens tokeep BERT’s computational requirements to a minimum. – BERT

POS

Studies have shown that the performance of NLP models can beimproved by providing explicit prior knowledge of syntactic information tothe model [24]. Therefore, we enrich the input sequence with syntactic infor-mation and feed it into BERT. More speciﬁcally, we add the correspondingPart-of-speech (POS) tag to each token by using the spaCy NLP library [12].One way to encode the input sequence with the corresponding POS tags is toconcatenate each token embedding with a hot encoded vector representingthe POS tag. Since the BERT token embeddings are high dimensional, theimpact of a single added feature (i.e. the POS tag) would be low. Contrary,we hypothesize that the syntactic information has a higher impact if weannotate the input sentences directly with the POS tags and then put theannotated sentences into BERT. This way of creating linguistically enrichedinput sequences has already proven to be promising during the developmentof the NLPL word embeddings [5]. Fig. 3 shows how we incorporated thePOS tags into the input sequence. By extending the input sequence, the ﬁxedlength for the BERT model has to be adapted accordingly. After a furtheranalysis, a length of 384 tokens proved to be reasonable. – BERT

DEP

Similar to the previous ﬁne-tuning approach, we follow the ideaof enriching the input sequence by linguistic features. Instead of using thePOS tags, we use the dependency (DEP) tags (see Fig. 3) of each token.Thus, we provide knowledge about the grammatical structure of the sentenceto the classiﬁer. We hypothesize that this knowledge has a positive eﬀecton the model performance, as a causal relation is a speciﬁc grammaticalstructure (e.g. it often contains an adverbial clause) and the classiﬁer can learn causal speciﬁc patterns in the grammatical structure of the traininginstances. The ﬁxed token length was also increased to 384 tokens.

Bert

Base : If the process fails, an error message is shown.

Bert

POS : If

SCONJ the

DET process

NOUN fails

VERB , PUNCT an DET er-ror

NOUN message

NOUN is AUX shown

VERB . PUNCT

Bert

DEP : If mark the det process nsubj fails advcl , punct an det error compound mes-sage nsubjpass is auxpass shown ROOT . punct Fig. 3.

Input sequences used for our diﬀerent BERT ﬁne tuning models. POS tags aremarked orange and DEP tags are marked blue.

Our labeled data set is imbalanced as only 28.1 % are positive samples. To avoidthe class imbalance problem, we apply Random Under Sampling (see Fig. 4).We randomly select sentences from the majority class and exclude them fromthe data set until a balanced distribution is achieved. Our ﬁnal data set consistsof 8,430 sentences of which 4,215 are equally causal and non-causal. We followthe idea of Cross Validation and divide the data set in a training, validation andtest set. The training set is used for ﬁtting the algorithm while the validation setis used to tune its parameters. The test set is utilized for the evaluation of thealgorithm based on real world unseen data. We opt for a 10-fold Cross Validationas a number of studies have shown that a model that has been trained this waydemonstrates low bias and variance [13]. We use standard metrics, for evaluatingour approaches: Accuracy, Precision, Recall and F score [13]. When interpretingthe metrics, it is important to consider which misclassiﬁcation (False Negativeor False Positive) matters most resp. causes the highest costs. Since causalitydetection is supposed to be the ﬁrst step towards automatic causality extraction,we favor Recall over Precision. A high Recall corresponds to a greater degree ofautomation of causality extraction, because it is easier for users to discard FalsePositives then to manually detect False Negatives. Consequently, we seek highRecall to minimize the risk of missed causal sentences and acceptable Precisionto ensure that users are not overwhelmed by False Positives. Labeled Data Set Training Set Test SetBalanced Data Set

Random Under

Sampling

Training folds Validation fold

DL Approaches ML Approaches

Add POS and DEP tags Training

Trained ModelsRule-basedApproach

Tune Hyperparameters /

AdjustModel Weights

Evaluate

Generalization

Best Performing

Model (CiRA)

Fig. 4.

Implementation and Evaluation Procedure of our Binary Classiﬁer

Tab. 3 demonstrates the inability of the baseline approach to distinguish betweencausal (F score: 66 %) and non-casual (F score: 64 %) sentences. This coin- iRA: Causality Detection in Requirement Artifacts 13 cides with our observation from the case study that searching for cue phrases isnot suitable for causality detection. In comparison, most ML based approaches(except KN and DT) show a better performance. The best performance in thiscategory is achieved by RF with an Accuracy of 78 % (gain of 13 % comparedto baseline approach). The overall best classiﬁcation results are achieved by ourDL based approaches. All three variants were trained with the hyperparametersrecommended by Devlin et al. [4]. Even the vanilla BERT

Base model shows agreat performance in both classes (F score ≥

80 % for causal and non-causal).Interestingly, enriching the input sequences with syntactic information did notresult in a signiﬁcant performance boost.

BERT

POS even has a slightly worseAccuracy value of 78 % (diﬀerence of 2 % compared to

BERT

Base ). An im-provement of the performance can be observed in the case of

BERT

DEP , whichhas the best F score for both classes among all the other approaches and alsoachieves the highest Accuracy value of 82 %. Compared to the rule based and MLbased approaches, BERT

DEP yields an average gain of 11.06 % in macro-Recalland 11.43 % in macro-Precision. Interesting is a comparison with

BERT

Base . BERT

DEP shows better values across all metrics, but the diﬀerence is onlymarginal. This indicates that

BERT

Base already has a deep language under-standing due to its pre-training and therefore can be tuned well for causalitydetection without much further input. However, over all ﬁve runs, the use of theDEP tags shows a small but not negligible performance gain - especially regard-ing our main decision criterion: the Recall value (85 % for causal and 79 % fornon-causal). Therefore, we choose

BERT

DEP as our ﬁnal approach (CiRA).

Table 3.

Recall, Precision, F scores (per class) and Accuracy. We report the averagedscores over ﬁve repetitions and highlight in bold the best results for each metric. Causal (Support: 435) Not Causal (Support: 408)

Best hyperparameters Recall Precision F1 Recall Precision F1 Accuracy

Rule based - 0.65 0.66 0.66 0.65 0.63 0.64 0.65

ML based

NB alpha: 1, ﬁt prior: True,embed: BoW 0.71 0.7 0.71 0.68 0.69 0.69 0.7SVM C: 50, gamma: 0.001,kernel: rbf, embed: BoW 0.68 0.8 0.73 0.82 0.71 0.76 0.75RF criterion: entropy, max features: auto,n estimators: 500, embed: BoW 0.72

DL based

BERT

Base batch size: 16, learning rate: 2e-05,weight decay: 0.01, optimizer:AdamW 0.83 0.80 0.82 0.78 0.82 0.80 0.81BERT

POS

DEP (CiRA)

As indicated in Section 2, many disciplines have already dealt with causality.To the best of our knowledge, we are the ﬁrst to focus on causality from theperspective of RE. In our previous paper [7], we motivated why the RE com-munity should engage with causality, while in this paper we provide empiricalevidence for the relevance of causality in requirement documents and an in-sight into its form and complexity. Detecting causality in natural language hasbeen investigated by several studies. Multiple papers [14,29] use handcraftedpatterns to identify causal sentences. These approaches are highly dependenton the manually created patterns and show weak performance. Recent papers apply neural networks and exploit, similarly to us, the Transfer Learning capa-bility of BERT [15]. However, we see a number of problems with these papersregarding the realization of our described RE use cases: First, neither the codenor a demo is published, making it diﬃcult to reproduce the results and testingthe performance on RE data. Second, they train and evaluate their approacheson strongly unbalanced data sets with causal to non-causal ratios of 1:2 and1:3, but only report the macro-Recall and macro-Precision values and not themetrics per class. Thus, it is not clear whether the classiﬁer has a bias towardsthe majority class or not.

System behavior is often speciﬁed by causal relations in requirements. Extract-ing this causal knowledge supports automatic test case derivation and reasoningabout requirement dependencies [7]. However, existing methods fail to extractcausality with reasonable performance [1]. Therefore, we argue for the need ofa novel method for causality extraction. We understand causality extraction asa two-step problem: First, we need to detect if requirements have causal prop-erties. Second, we need to comprehend and extract their causal relations. Atpresent, however, we lack knowledge about the form and complexity of causalityin requirements, which is needed to develop suitable approaches for these twoproblems. In this paper, we address this research gap and contribute: (1) anexploratory case study with 14,983 sentences from 53 requirements documentsoriginating from 18 diﬀerent domains. We found that causality is a widely usedlinguistic pattern to describe system functionalities and that it mainly occurs inexplicit, marked form. (2) CiRA as an approach for the automatic detection ofcausality in requirements documents. This constitutes a ﬁrst step towards causal-ity extraction from NL requirements. We empirically evaluate our approach andachieve a macro-F score of 82 % on real word data. (3) we disclose our code,tool and annotated data set to facilitate replication.Two further research directions exist: First, extending the case study andanalyzing the sentences from the requirements documents in a more granular wayby categorizing them e.g. in functional and non-functional requirements. Thiswould expand our current insight into causality in requirements documents ingeneral by an insight into causality in speciﬁc requirement categories. Second, weare enhancing our previous approaches [8,9] to address the second sub-problem:the actual extraction of causal relations. Acknowledgements

We would like to acknowledge that this work was supported by the KKS foun-dation through the S.E.R.T. Research Proﬁle project at Blekinge Institute ofTechnology. Further, we thank Yannick Debes for his valuable feedback.

References

1. Asghar, N.: Automatic extraction of causal relations from natural language texts:A comprehensive survey. arXiv abs/1605.07895 abs/1906.07544abs/1906.07544