Automatic Detection of Causality in Requirement Artifacts: the CiRA Approach
Jannik Fischbach, Julian Frattini, Arjen Spaans, Maximilian Kummeth, Andreas Vogelsang, Daniel Mendez, Michael Unterkalmsteiner
AAutomatic Detection of Causality inRequirement Artifacts: the CiRA Approach
Jannik Fischbach , Julian Frattini , Arjen Spaans , Maximilian Kummeth ,Andreas Vogelsang , Daniel Mendez , , and Michael Unterkalmsteiner Qualicen GmbH, Germany, { firstname.lastname } @qualicen.de Blekinge Institute of Technology, Sweden, { firstname.lastname } @bth.se University of Cologne, Germany, [email protected] fortiss GmbH, Germany, [email protected] Abstract. [ Context & motivation: ] System behavior is often ex-pressed by causal relations in requirements (e.g.,
If event 1, then event2 ). Automatically extracting this embedded causal knowledge supportsnot only reasoning about requirements dependencies, but also variousautomated engineering tasks such as seamless derivation of test cases.However, causality extraction from natural language (NL) is still an openresearch challenge as existing approaches fail to extract causality withreasonable performance. [
Question/problem: ] We understand causal-ity extraction from requirements as a two-step problem: First, we need todetect if requirements have causal properties or not. Second, we need tounderstand and extract their causal relations. At present, though, we lackknowledge about the form and complexity of causality in requirements,which is necessary to develop a suitable approach addressing these twoproblems. [
Principal ideas/results: ] We conduct an exploratory casestudy with 14,983 sentences from 53 requirements documents originat-ing from 18 different domains and shed light on the form and complexityof causality in requirements. Based on our findings, we develop a tool-supported approach for causality detection (CiRA, standing for Causalityin Requirement Artifacts). This constitutes a first step towards causalityextraction from NL requirements. [
Contribution: ] We report on a casestudy and the resulting tool-supported approach for causality detectionin requirements. Our case study corroborates, among other things, thatcausality is, in fact, a widely used linguistic pattern to describe systembehavior, as about a third of the analyzed sentences are causal. We fur-ther demonstrate that our tool CiRA achieves a macro-F score of 82 %on real word data and that it outperforms related approaches with anaverage gain of 11.06 % in macro-Recall and 11.43 % in macro-Precision.Finally, we disclose our open data sets as well as our tool to foster thediscourse on the automatic detection of causality in the RE community. Keywords:
Causality · Case Study · Requirements Engineering · Nat-ural Language Processing
System behavior is usually described by causal relations, e.g. “A confirmationmessage shall be shown if the system has successfully processed the data.” Hence, © a r X i v : . [ c s . S E ] J a n J. Fischbach et al. causal relations are often inherently embedded in the textual descriptions ofrequirements. Understanding and extracting these causal relations offers greatpotential for Requirements Engineering (RE); for instance, by supporting theautomated derivation of test cases and by facilitating reasoning about depen-dencies between requirements [7]. However, automated causality extraction fromrequirements is still challenging for two reasons. First, requirements are mostlyexpressed by unrestricted natural language (NL) so that the system behavioris specified in arbitrarily complex ways. Second, causality can occur in differentforms [2] such as marked / unmarked or explicit / implicit which makes it difficultto identify and extract the causes and effects. Existing approaches [1] fail toextract causality from NL with a performance that allows for use in practice.Therefore, we argue for the need of a novel method for the extraction of causalityfrom requirements. We understand causality extraction as a two-step problem:We first need to detect whether requirements contain causal relations. Second,if they contain causal relations, we need to understand and extract them. Toaddress both problems, we have to comprehend in which form and complexityrequirements causality occurs in practice. This enables us to develop efficientapproaches for the automated identification and extraction of causal relations.However, empirical evidence on causality in requirements is presently still weak.In this paper, we report on how we addressed this research gap and make thefollowing contributions (C): – C 1 : We report on an exploratory case study where we analyze form andcomplexity of causality in requirements based on 14,983 sentences emergingfrom 53 requirement documents. These documents originate from 18 differentdomains. We corroborate, for example, that causality tends to occur, infact, in explicit and marked form, and that about 28 % of the analyzedsentences contain causal knowledge about the expected system behavior.This strengthens our confidence in the relevance of our approach. – C 2 : We present our tool-supported approach named CiRA ( C ausality de-tection i n R equirement A rtifacts), which forms a first step towards causalityextraction from NL requirements. We train and empirically evaluate CiRAusing the pre-analyzed data set and achieve an macro-F score of 82 %.Compared to baseline systems that classify causality based on the presenceof certain cue phrases, or shallow ML models, CiRA leads to an average per-formance gain of 11.43 % in macro-Precision and 11.06 % in macro-Recall. – C 3 : To strengthen transparency and facilitate replication, we disclose ourtool, code, and data set used in the case study. Causality represents a semantic relation that has been studied by various disci-plines, e.g. by psychology [27]. Before we can investigate in which form causalityoccurs in requirements, we must first understand what causality actually means.
Concept of Causality
Causality is a relation between two events: a causing event(the cause ) and a caused event (the effect ). An event is “any situation (including A demo of CiRA can be accessed at cira.diptsrv003.bth.se. Our code and annotateddata sets can be found at https://github.com/fischJan/CiRA.iRA: Causality Detection in Requirement Artifacts 3 a process or state) that happens or occurs either instantaneously (punctual) orduring a period of time (durative)” [19]. The connection between causes andeffects is counterfactual [17]: If a cause c did not occur, then an effect e couldnot have occurred either. Consequently, a causal relation requires that the effectmay only occur if and only if the cause has occurred. Therefore, in the view ofBoolean algebra, a causal relation can be interpreted as an equivalence betweena cause and effect ( c ⇐⇒ e ). If the cause is true, the effect is true and if thecause is false, the effect is also false. The relation between a cause and effect canbe defined in three different ways [26]: as a cause , enable or prevent relationship. – c causes e : If c occurs, e also occurs ( c ⇐⇒ e ). This can be illustratedby REQ 1: “After the user enters a wrong password, a warning window shallbe shown.” In this case, the wrong input is the trigger to display the window. – c enables e : If c does not occur, e does not occur either ( e is notenabled). REQ 2: “As long as you are a student, you are allowed to usethe sport facilities of the university ( c ⇐⇒ e ).” Only the student statusenables to do sports on campus. – c prevents e : If c occurs, e does not occur ( c ⇐⇒ ¬ e ). REQ 3:“Data redundancy is required to prevent a single failure from causing theloss of collected data.” There will be no data loss due to data redundancy. Temporal Ordering of Causes and Effects
Causes and effects can occur in threedifferent temporal relations [19]. In the first temporal relation, the cause occursbefore the effect ( before relation). REQ 1 requires the user to enter a wrongpassword before the warning window will be displayed. In this example, the causeand effect represent two punctual events. In the second temporal relation, theoccurrence of the cause and effect overlaps: “The fire is burning down the house.”In this case, the occurrence of the emerging fire overlaps with the occurrence ofthe increasingly brittle house ( overlaps relation). In the third temporal relation( during relation), cause and effect occur simultaneously. REQ 2 describes sucha relation, as the effect that you are allowed to do sports on the campus is onlyvalid as long as you have the student status. The start and end time of the causeis therefore also the start and end of the effect. Here, both events are durative.
Forms of Causality
Causality can be expressed in different forms [2]: marked and unmarked causality, explicit and implicit causality, and ambiguous and non-ambiguous cue phrases. – Marked and unmarked : A causal relation is marked if a certain cue phraseindicates causality. The requirement “If the user presses the button, a win-dow appears” is marked by the cue phrase “if”, while “The user has noadmin rights. He cannot open the folder.” is unmarked . – Explicit and implicit : An explicit causal relation provides informationabout both the cause and effect. The requirement “In case of an error, thesystems prints an error message to the console” is explicit as it containsthe cause (error) and effect (error message). “A parent process kills a childprocess” is implicit because the effect that the child process is terminated isnot explicitly stated. – Ambiguous and non-ambiguous cue phrases : Given the difference be-tween marked and unmarked causality, it seems feasible to deduce the pres-ence of causality in a sentence from the occurrence of certain cue phrases. J. Fischbach et al.
However, there are cue phrases (e.g. since) that may indicate causality, butalso occur in other contexts (e.g. to denote time constraints). Such cuephrases are called ambiguous , while cue phrases (e.g. because) that mostlyindicate causality are called non-ambiguous . Complexity of Causality
Our previous explanations refer to the simplest casewhere the causal relation consists of a single cause and effect. With increasingsystem complexity, however, the expected system behaviour is described by mul-tiple causes and effects that are connected to each other. They are linked eitherby conjunctions ( c ∧ c ∧ · · · ⇐⇒ e ) or disjunctions ( c ∨ c ∨ · · · ⇐⇒ e )or a combination of both which increases the complexity of the causal relation.Furthermore, causal relations can not only be contained in a single sentence, butalso span over multiple sentences, which is a significant challenge for causalityextraction. Additionally, the complexity increases when several causal relationsare linked together, i.e. if the effect of a relation r represents a cause in anotherrelation r . We define such causal relations, where r is dependent on r , as eventchains (e.g. r : c ⇐⇒ e and r : e ⇐⇒ e ). The case study was performed according to the guidelines of Runeson andH¨ost [23]. Based on the classification of Robson [22], our case study is exploratoryas we seek for new insights into causality in requirement documents. In this sec-tion, we describe our research questions , study objects , study design , study results ,and threats to validity . We also give an overview of the implications of the study on the causality detection and extraction from requirements. We are interested in the form and complexity of causality in requirement doc-uments. Based on the terminology introduced in Section 2, we investigate thefollowing research questions (RQ): – RQ 1 : To which degree does causality occur in requirement documents? – RQ 2 : How often do the relations cause , enable and prevent occur? – RQ 3 : How often do the temporal relations before , overlap and during occur? – RQ 4 : In which form does causality occur in requirement documents?RQ 4a: How often does marked and unmarked causality occur?RQ 4b: How often does explicit and implicit causality occur?RQ 4c: Which causal cue phrases are used? Are they mainly ambiguous or non-ambiguous ? – RQ 5 : At which complexity does causality occur in requirement documents?RQ 5a: How often do multiple causes occur?RQ 5b: How often do multiple effects occur?RQ 5c: How often does two sentence causality occur?RQ 5d: How often do event chains occur? We considered three criteria when selecting a suitable data set for our casestudy: 1) the data set shall contain requirements documents that are/were used iRA: Causality Detection in Requirement Artifacts 5 in practice, 2) the data set shall not be domain-specific, rather it shall containdocuments from different domains, and 3) the documents shall originate fromdifferent years. Consequently, our analysis is not restricted to a single year ordomain, but rather allows for a comprehensive view on causality in requirements.Based on these criteria, we selected the data set provided by Fischbach et al. [7].To the best of our knowledge, this data set is currently the most extensive collec-tion of requirements available in the RE community. It contains 463 documents,from which the authors extracted and pre-processed 212k sentences. For ouranalysis, we have randomly selected 53 documents from the data set. Our finaldata set consists of 14,983 sentences from 18 different domains (see Fig. 1). D i g i t a l L i b r a r y S m a r t C i t y D i g i t a l S o c i e t y B a n k i n g D a t a A n a l y t i c s A e r o s p a c e H e a l t h A g r i c u l t u r e T e l e c o m m . M ili t a r y E n e r g y R e g u l a t o r y I n f r a s t u c t u r e I n s u r a n c e A u t o m o t i v e P h y s i c s S u s t a i n a b ili t y A s t r o n o m y s e n t e n c e s Domain Distribution u n k n o w n d o c u m e n t s Yearly Distribution
Fig. 1.
Descriptive statistics of our data set. The left graph shows the number ofsentences per domain. The right graph depicts the year of creation per document.
Model the phenomenon
In order to answer our RQ, we need to annotate thesentences in our data set with respect to certain categories (e.g. explicit or im-plicit causality). According to Pustejovsky and Stubbs [21], the first step in eachannotation process is to “model the phenomenon” that needs to be annotated.Specifically, it should be defined as a model M that consists of a vocabulary T ,the relations R between the terms as well as the interpretations I of terms. RQ1 can be understood as a binary annotation problem, which can be modeled as: – T : { sentence, causal, not causal } – R : { sentence ::= causal | not causal } – I : { causal = “A sentence is causal if it contains a relation between at leasttwo events, where e1 causes the occurrence of e2”, ¬ causal = “A sentence isnot causal if it describes a state that is independent on any events” } Modeling an annotation problem has two advantages: It contributes to a cleardefinition of the research problem and can be used as a guide for the annotatorsto explain the meaning of the labels. We have modeled each RQ and discussedit with the annotators. In addition to interpretation I , we have also providedan example for each label to avoid misunderstandings. After modeling all RQs,the following nine categories emerged, according to which we annotated ourdata set: Causality , Explicit , Marked , Single Sentence , Single Cause ,Single Effect , Event Chain , Relationship and Temporality . Annotation Environment
We developed our own annotation platform tailoredto our research questions. Contrary to other annotation platforms [20] whichonly show single sentences to the annotators, we also show the predecessor and The platform can be accessed at clabel.diptsrv003.bth.se. J. Fischbach et al. successor of each sentence. This is required to determine whether the causalityextends over one sentence or across multiple ones (see RQ 5c). For the binaryannotation problems (see RQ 1, RQ 4a, RQ 4b, RQ 5a - d), we provide two labelsfor each category. Cue phrases present in the sentence can either be selected bythe annotator from a list of already labeled cue phrases or new cue phrases canbe added using a text input field (see RQ 4c). Since RQ 2 and RQ 3 are ternaryannotation problems, the platform provides three labels for these categories.
Annotation Guideline
Prior to the labeling process, we conducted a workshopwith all annotators to ensure a common understanding of causality. The resultsof the workshop were recorded in the form of an annotation guideline. All an-notators were instructed to observe the following annotation rules: First, youshould not just check for cue phrases and label the sentence directly as causal,but rather read the sentence completely before making a labeling decision. Oth-erwise, too many False Positives will be introduced. Second, you should check ifthe cause is really necessary for the effect to occur. Only if the cause is mandatoryfor the effect, it is a causal relation.
Table 1.
Inter-annotator agreement statistics per category. The two categories Re-lationship and Temporality were jointly labeled by the first and second author andtherefore do not require a reliability assessment.
Causal Explicit Marked SingleSentence SingleCause SingleEffect EventChain avg.0 1 0 1 0 1 0 1 0 1 0 1 0 1Confusion 0
Matrix 1
274 499 39 411 12 464 17 462 43 338 46 318 13 9
Agreement
Cohen’s Kappa
Gwet’s AC1
Annotation Validity
To verify the reliability of our annotations, we calculatedthe inter-annotator agreement. We assigned 3,000 sentences to each annotator, ofwhich 2,500 are unique and 500 overlapping. Based on the overlapping sentences,we calculated the Cohen’s Kappa [3] measure to evaluate how well the annotatorscan make the same annotation decision for a given category. We chose Cohen’sKappa since it is widely used for assessing inter-rater reliability [25]. However,a number of statistical problems are known to exist with this measure [18].In case of a high imbalance of ratings, Cohen’s Kappa is low and indicatespoor inter-rater reliability even if there is a high agreement between the raters(Kappa paradox [6]). Thus, Cohen’s Kappa is not meaningful in such scenarios.Consequently, studies [28] suggest that Cohen’s Kappa should always be reportedtogether with the percentage of agreement and other paradox resistant measures(e.g. Gwet’s AC1 measure [10]) in order to make a valid statement about theinter-rater reliability. We involved six annotators in the creation of the corpusand assessed the inter-rater reliability on the basis of 3,000 overlapping sentences,which represents about 20 % of the total data set. We calculated all measures(see Tab. 1) using the cloud-based version of AgreeStat [11]. Cohen’s Kappaand Gwet’s AC1 can both be interpreted using the taxonomy developed byLandis and Koch [16]: values ≤ iRA: Causality Detection in Requirement Artifacts 7 and 0.81–1.00 as almost perfect agreement. Tab. 1 demonstrates that the inter-rater agreement of our annotation process is reliable. Across all categories, anaverage percentage of agreement of 86 % was achieved. Except for the categoriesSingle Cause and Single Effect , all categories show a percentage of agreementof at least 84 %. We hypothesize that the slightly lower value of 76 % for these twocategories is caused by the fact that in some cases the annotators interpret thecauses and effects with different granularity (e.g., annotators might break somecauses and effects down into several sub causes and effects, while some do not).Hence, the annotations differ slightly. The Kappa paradox is particularly evidentfor the categories Marked and Event Chain . Despite a high agreement of over90 %, Cohen’s Kappa yields a very low value, which “paradoxically” suggestsalmost no or only fair agreement. A more meaningful assessment is providedby Gwet’s AC1 as it did not fail in case of prevalence and remains close tothe percentage of agreement. Across all categories, the mean value is above 0.8,which indicates a nearly perfect agreement. Therefore, we assess our labeled dataset as reliable and suitable for further analysis and the implementation of ourcausality detection approach. Causality Explicit Marked Single Sentence Single Cause Single Effect Event Chain Relationship Temporality F r ac t i on o f se n t e n ces i n % CausalityNot CausalCausal ExplicitImplicitExplicit MarkedUnmarkedMarked Single SentenceTwo SentenceSingle Sentence Single CauseSeveral CausesOne Cause Single EffectSeveral EffectsOne Effect Event ChainNo Event ChainEvent Chain RelationshipPreventEnableCause TemporalityOverlapDuringBefore
Fig. 2.
Annotation results per category. The y axis of the bar plot for the category“Causality” refers to the total number of analyzed sentences. The other bar plots areonly related to the causal sentences.
Fig. 2 presents the analysis results for each labeled category. When interpretingthe values, it is important to note that we analyze entire requirement documentsin our study. Consequently, our data set contains records with different contents,which do not necessarily represent all functional requirements. For example,requirement documents also contain non-functional requirements, phrases forcontent structuring, purpose statements, etc. Hence, the results of our analysisdo not only refer to functional requirements but in general to the content ofrequirement documents.
Answer to RQ1:
Fig. 2 highlights that causality occurs in requirement doc-uments. About 28 % of the analyzed sentences are causal. It can therefore beconcluded that causality is a major linguistic element of requirement documentssince almost one third of all sentences are causal.
Answer to RQ2:
The majority (56 %) of causal sentences contained inrequirement documents express an enable relationship between certain events.Only about 10 % of the causal sentences indicate a prevent relationship.
Cause relationships are found in about 34 % of the annotated data.
J. Fischbach et al.
Answer to RQ3:
Interestingly, we found that causes and effects occur almostequally often in a before and during relation. With about 48 %, the before relationis the most frequent temporal relation in our data set, but only with a differenceof about 6 % compared to the during relation. The overlap relation occurredonly in a minority (8.78 % of the sentences).
Answer to RQ4a:
Fig. 2 shows that the majority of causal sentences containone or more cue phrases to indicate the causal relationship between certainevents.
Unmarked causality occurs only in about 15 % of the analyzed sentences.
Answer to RQ4b:
Most causal sentences are explicit , i.e. they contain infor-mation about both the cause and the effect. Only about 10 % of causal sentencesare implicit . Answer to RQ4c:
Tab. 2 provides an overview of the causal cue phrases usedin the requirement documents. The left side of the table shows the different cuephrases ordered by word group. On the right side, all verbs used to express causalrelations are listed. We order the verbs according to whether they express a cause , enable or prevent relationship. To measure the ambiguity of the individual cuephrases, we introduce the ambiguity factor (AF). We define AF for a cue phrase xas the conditional probability that a sentence is causal given that the cue phrasex occurs in the sentence: Pr(Causal | X is present in sentence). Hence, a high AFvalue indicates a non-ambiguous cue phrase, while low values indicate strongly ambiguous cue phrases. Tab. 2 demonstrates that a number of different cuephrases are used to express causality in requirement documents. Not surprisingly,cue phrases like “if”, “because” and “therefore” show AF values of more than90 %. However, there is a variety of cue phrases that indicate causality in somesentences but also occur in other non-causal contexts. This is especially evidentin the case of pronouns. Relative sentences can indicate causality, but not inevery case, which is reflected by the low AF value. A similar pattern emergeswith regard to the used verbs. Only a few verbs (e.g., “leads to, degrade andenhance”) show a high AF value. Consequently, the majority of used verbs donot necessarily indicate a causal relation if they are present in a sentence.
Answer to RQ 5a:
Fig. 2 illustrates that a causal relation in requirementdocuments often includes only a single cause. Multiple causes occur in only19.1 % of analyzed causal sentences. The exact number of causes was not doc-umented during the annotation process. However, the participating annotatorsreported consistently that in the case of complex causal relations, two to threecauses were usually included. More than three causes were rare.
Answer to RQ5b:
Interestingly, the distribution of effects is similar to thatof causes. Likewise, single effects occur significantly more often than multipleeffects. According to the annotators, the number of effects in case of complexrelations is limited to two effects. Three or more effects occur rarely.
Answer to RQ5c:
Most causal relations can be found in single sentences.Relations where cause and effect are distributed over several sentences occur onlyin about 7 % of the analyzed data. The annotators reported that most often thecue phrase “therefore” was used to express two-sentence causality.
Answer to RQ5d:
Fig. 2 shows that event chains are rarely used in require-ment documents. Most causal sentences contain isolated causal relations andonly a few event chains . iRA: Causality Detection in Requirement Artifacts 9 Table 2.
Overview of cue phrases used to indicate causality in requirement docu-ments.
Bold
AF values highlight non-ambiguous phrases that mostly indicated causal-ity (Pr(Causal | X is present in sentence) ≥ . Type Phrase Causal Not Causal AmbiguityFactor (AF) Type Phrase Causal Not Causal AF conjunctions if 387 41
Cause force(s/ed) 21 18 0.54as 607 1313 0.32 cause(s/ed) 32 10 0.76because 78 7 lead(s) to 5 0 but 100 204 0.33 reduce(s/ed) 48 28 0.63in order to 141 33 minimize(s/ed) 28 11 0.72so (that) 88 86 0.51 affect(s/ed) 13 19 0.41unless 23 4 maximize(s/ed) 11 5 0.69while 71 90 0.44 eliminate(s/ed) 8 11 0.42once 48 15 0.76 result(s/ed) in 50 43 0.54except 9 5 0.64 increase(s/ed) 49 34 0.59as long as 12 1 decrease(s/ed) 5 8 0.38adverbs therefore 61 6 impact(s) 37 68 0.35when 331 64 degrade(s/ed) 11 2 whenever 10 0 introduce(s/ed) 11 12 0.48hence 21 9 0.70 enforce(s/ed) 2 1 0.67where 213 150 0.59 trigger(s/ed) 11 7 0.61since 65 32 0.67 Enable depend(s) on 28 21 0.57consequently 2 6 0.25 require(s/ed) 316 262 0.55wherever 5 2 0.71 allow(s/ed) 187 130 0.59rather 16 30 0.35 need(s/ed) 98 162 0.38to this/that end 12 0 necessitate(s/ed) 7 2 0.78thus 66 17 facilitate(s/ed) 29 28 0.51for this reason 7 3 0.70 enhance(s/ed) 16 4 due to 91 26 0.78 ensure(s/ed) 145 66 0.69thereby 4 2 0.67 achieve(s/ed) 30 24 0.56as a result 11 4 0.73 support(s/ed) 128 301 0.30for this purpose 1 2 0.33 enable(s/ed) 75 36 0.68pronouns which 277 608 0.31 permit(s/ed) 10 13 0.43who 28 52 0.35 rely on 3 5 0.38that 732 1178 0.38 Prevent hinder(s/ed) 1 1 0.50whose 16 11 0.59 prevent(s/ed) 38 17 0.69adjectives only 127 126 0.50 avoid(s/ed) 14 23 0.38prior to 26 20 0.57imperative 1 3 0.25necessary (to) 36 19 0.65preposition for 1209 2753 0.31during 327 137 0.70after 133 57 0.70by 506 1171 0.30with 680 1554 0.30in the course of 2 1 0.67through 114 204 0.36as part of 19 51 0.27in this case 18 3 before 54 27 0.67until 33 11 0.75upon 25 48 0.34in case of 30 7 in both cases 1 0 in the event of 15 2 in response to 6 7 0.46in the absence of 8 1
Based on the results of our case study, we draw the following conclusions: Causal-ity matters in requirements documents, which underlines the necessity of an ap-proach for the automatic detection and extraction of causal requirements. Thecomplexity of causal relations ranges from low to medium, since they usuallyconsist of a single cause and effect relationship. However, for the approaches tobe applicable in practice, they need to comprehend also more complex relationscontaining between two to three causes and two effects. Hence, the approachesmust be capable of understanding conjunctions, disjunctions and negations inthe sentences to fully capture the relationships between causes and effects. Two-sentence causality and event chains occur only rarely. Thus, both aspects caninitially be neglected in the development of the approaches, while still more than92 % of the analyzed sentences can be covered. Since most causal relations inrequirements documents are explicit , the detection and extraction of causality issimplified. The information about both causes and effects is embedded directly in the sentences, so that the approaches require little or no implicit knowledge.The analysis of the AF values reveals that most of the used cue phrases are am-biguous. Consequently, our methods require a deep understanding of languageas causality can not only be deduced from the presence of certain cue phrasesbut rather from a combination of the syntax and semantics of the sentence.
Internal Validity : A major threat to internal validity are the annotations them-selves as an annotation task is to a certain degree subjective. To minimize thebias of the annotators, we performed two mitigation actions: First, we conducteda workshop prior to the annotation process to ensure a common understandingof causality. Second, we assessed the inter-rater agreement by using multiplemetrics (Gwet’s AC1 etc.).
External Validity : To achieve reasonable generaliz-ability, we selected requirements documents from different domains and years.As Fig. 1 shows, our data set covers a variety of domains, but the distribution ofthe sentences is imbalanced. The domains aerospace, data analytics, and smartcity account for a large part of the data set (9,724 sentences), while the other 15domains are underrepresented. Hence, our results do not allow a general conclu-sion about causality in requirements documents. Future studies should expandto more documents from these underrepresented as well as further domains toachieve a more global insight into causality in requirements documents.
This section presents the implementation of our causal classifier. Initially, we de-scribe our applied methods followed by a report of the results of our experiments,in which we compare the performance of the individual methods.
Rule Based Approach
The baseline approach for causality detection involvesthe use of simple regex expressions. We iterate through all sentences in the testset and check if one of the phrases listed in Tab. 2 is included. For the positivecase, the sentence is classified as causal and vice versa.
Machine Learning Based Approach
As a second approach, we investigatethe use of supervised
ML models that learn to predict causality based on thelabeled data set. Specifically, we employ established binary classification algo-rithms: Naive Bayes (NB), Support Vector Machines (SVM), Random Forest(RF), Decision Tree (DT), Logistic Regression (LR), Ada Boost (AB) and K-Nearest Neighbor (KNN). To determine the best hyperparameters for each bi-nary classifier, we apply Grid Search, which fits the model on every possible com-bination of hyperparameters and selects the most performant. We use two dif-ferent methods as word embeddings: Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). In Tab. 3 we report the classificationresults of each algorithm as well as the best combination of hyperparameters.
Deep Learning Based Approach
With the rise of Deep Learning (DL), moreand more researchers are using DL models for Natural Language Processing(NLP) tasks. In this context, the Bidirectional Encoder Representations fromTransformers (BERT) model [4] is prominent and has already been used for iRA: Causality Detection in Requirement Artifacts 11 question answering and named entity recognition. BERT is pre-trained on largecorpora and can therefore easily be fine tuned for any downstream task withoutthe need for much training data (Transfer Learning). In our paper, we makeuse of the fine tuning mechanism of BERT and investigate to which extent itcan be used for causality detection of requirement sentences. First, we tokenizeeach sentence. BERT requires input sequences with a fixed length (maximum512 tokens). Therefore, for sentences that are shorter than this fixed length,padding tokens (PAD) are inserted to adjust all sentences to the same length.Other tokens, such as the classification (CLS) token, are also inserted in orderto provide further information of the sentence to the model. CLS is the firsttoken in the sequence and represents the whole sentence (i.e. it is the pooledoutput of all tokens of a sentence). For our classification task, we mainly usethis token because it stores the information of the whole sentence. We feed thepooled information into a single-layer feedforward neural network that uses asoftmax layer, which calculates the probability that a sentence is causal or not.We tune BERT in three different ways and investigate their performance: – BERT
Base
In the base variant, the sentences are tokenized as describedabove and put into the classifier. To choose a suitable fixed length for ourinput sequences, we analyzed the lengths of the sentences in our data set.Even with a fixed length of 128 tokens we cover more than 97 % of thesentences. Sentences containing more tokens are shortened accordingly. Sincethis is only a small amount, only little information is lost. Thus, we chosea fixed length of 128 tokens instead of the maximum possible 512 tokens tokeep BERT’s computational requirements to a minimum. – BERT
POS
Studies have shown that the performance of NLP models can beimproved by providing explicit prior knowledge of syntactic information tothe model [24]. Therefore, we enrich the input sequence with syntactic infor-mation and feed it into BERT. More specifically, we add the correspondingPart-of-speech (POS) tag to each token by using the spaCy NLP library [12].One way to encode the input sequence with the corresponding POS tags is toconcatenate each token embedding with a hot encoded vector representingthe POS tag. Since the BERT token embeddings are high dimensional, theimpact of a single added feature (i.e. the POS tag) would be low. Contrary,we hypothesize that the syntactic information has a higher impact if weannotate the input sentences directly with the POS tags and then put theannotated sentences into BERT. This way of creating linguistically enrichedinput sequences has already proven to be promising during the developmentof the NLPL word embeddings [5]. Fig. 3 shows how we incorporated thePOS tags into the input sequence. By extending the input sequence, the fixedlength for the BERT model has to be adapted accordingly. After a furtheranalysis, a length of 384 tokens proved to be reasonable. – BERT
DEP
Similar to the previous fine-tuning approach, we follow the ideaof enriching the input sequence by linguistic features. Instead of using thePOS tags, we use the dependency (DEP) tags (see Fig. 3) of each token.Thus, we provide knowledge about the grammatical structure of the sentenceto the classifier. We hypothesize that this knowledge has a positive effecton the model performance, as a causal relation is a specific grammaticalstructure (e.g. it often contains an adverbial clause) and the classifier can learn causal specific patterns in the grammatical structure of the traininginstances. The fixed token length was also increased to 384 tokens.
Bert
Base : If the process fails, an error message is shown.
Bert
POS : If
SCONJ the
DET process
NOUN fails
VERB , PUNCT an DET er-ror
NOUN message
NOUN is AUX shown
VERB . PUNCT
Bert
DEP : If mark the det process nsubj fails advcl , punct an det error compound mes-sage nsubjpass is auxpass shown ROOT . punct Fig. 3.
Input sequences used for our different BERT fine tuning models. POS tags aremarked orange and DEP tags are marked blue.
Our labeled data set is imbalanced as only 28.1 % are positive samples. To avoidthe class imbalance problem, we apply Random Under Sampling (see Fig. 4).We randomly select sentences from the majority class and exclude them fromthe data set until a balanced distribution is achieved. Our final data set consistsof 8,430 sentences of which 4,215 are equally causal and non-causal. We followthe idea of Cross Validation and divide the data set in a training, validation andtest set. The training set is used for fitting the algorithm while the validation setis used to tune its parameters. The test set is utilized for the evaluation of thealgorithm based on real world unseen data. We opt for a 10-fold Cross Validationas a number of studies have shown that a model that has been trained this waydemonstrates low bias and variance [13]. We use standard metrics, for evaluatingour approaches: Accuracy, Precision, Recall and F score [13]. When interpretingthe metrics, it is important to consider which misclassification (False Negativeor False Positive) matters most resp. causes the highest costs. Since causalitydetection is supposed to be the first step towards automatic causality extraction,we favor Recall over Precision. A high Recall corresponds to a greater degree ofautomation of causality extraction, because it is easier for users to discard FalsePositives then to manually detect False Negatives. Consequently, we seek highRecall to minimize the risk of missed causal sentences and acceptable Precisionto ensure that users are not overwhelmed by False Positives. Labeled Data Set Training Set Test SetBalanced Data Set
Random Under
Sampling
Training folds Validation fold
DL Approaches ML Approaches
Add POS and DEP tags Training
Trained ModelsRule-basedApproach
Tune Hyperparameters /
AdjustModel Weights
Evaluate
Generalization
Best Performing
Model (CiRA)
Fig. 4.
Implementation and Evaluation Procedure of our Binary Classifier
Tab. 3 demonstrates the inability of the baseline approach to distinguish betweencausal (F score: 66 %) and non-casual (F score: 64 %) sentences. This coin- iRA: Causality Detection in Requirement Artifacts 13 cides with our observation from the case study that searching for cue phrases isnot suitable for causality detection. In comparison, most ML based approaches(except KN and DT) show a better performance. The best performance in thiscategory is achieved by RF with an Accuracy of 78 % (gain of 13 % comparedto baseline approach). The overall best classification results are achieved by ourDL based approaches. All three variants were trained with the hyperparametersrecommended by Devlin et al. [4]. Even the vanilla BERT
Base model shows agreat performance in both classes (F score ≥
80 % for causal and non-causal).Interestingly, enriching the input sequences with syntactic information did notresult in a significant performance boost.
BERT
POS even has a slightly worseAccuracy value of 78 % (difference of 2 % compared to
BERT
Base ). An im-provement of the performance can be observed in the case of
BERT
DEP , whichhas the best F score for both classes among all the other approaches and alsoachieves the highest Accuracy value of 82 %. Compared to the rule based and MLbased approaches, BERT
DEP yields an average gain of 11.06 % in macro-Recalland 11.43 % in macro-Precision. Interesting is a comparison with
BERT
Base . BERT
DEP shows better values across all metrics, but the difference is onlymarginal. This indicates that
BERT
Base already has a deep language under-standing due to its pre-training and therefore can be tuned well for causalitydetection without much further input. However, over all five runs, the use of theDEP tags shows a small but not negligible performance gain - especially regard-ing our main decision criterion: the Recall value (85 % for causal and 79 % fornon-causal). Therefore, we choose
BERT
DEP as our final approach (CiRA).
Table 3.
Recall, Precision, F scores (per class) and Accuracy. We report the averagedscores over five repetitions and highlight in bold the best results for each metric. Causal (Support: 435) Not Causal (Support: 408)
Best hyperparameters Recall Precision F1 Recall Precision F1 Accuracy
Rule based - 0.65 0.66 0.66 0.65 0.63 0.64 0.65
ML based
NB alpha: 1, fit prior: True,embed: BoW 0.71 0.7 0.71 0.68 0.69 0.69 0.7SVM C: 50, gamma: 0.001,kernel: rbf, embed: BoW 0.68 0.8 0.73 0.82 0.71 0.76 0.75RF criterion: entropy, max features: auto,n estimators: 500, embed: BoW 0.72
DL based
BERT
Base batch size: 16, learning rate: 2e-05,weight decay: 0.01, optimizer:AdamW 0.83 0.80 0.82 0.78 0.82 0.80 0.81BERT
POS
DEP (CiRA)
As indicated in Section 2, many disciplines have already dealt with causality.To the best of our knowledge, we are the first to focus on causality from theperspective of RE. In our previous paper [7], we motivated why the RE com-munity should engage with causality, while in this paper we provide empiricalevidence for the relevance of causality in requirement documents and an in-sight into its form and complexity. Detecting causality in natural language hasbeen investigated by several studies. Multiple papers [14,29] use handcraftedpatterns to identify causal sentences. These approaches are highly dependenton the manually created patterns and show weak performance. Recent papers apply neural networks and exploit, similarly to us, the Transfer Learning capa-bility of BERT [15]. However, we see a number of problems with these papersregarding the realization of our described RE use cases: First, neither the codenor a demo is published, making it difficult to reproduce the results and testingthe performance on RE data. Second, they train and evaluate their approacheson strongly unbalanced data sets with causal to non-causal ratios of 1:2 and1:3, but only report the macro-Recall and macro-Precision values and not themetrics per class. Thus, it is not clear whether the classifier has a bias towardsthe majority class or not.
System behavior is often specified by causal relations in requirements. Extract-ing this causal knowledge supports automatic test case derivation and reasoningabout requirement dependencies [7]. However, existing methods fail to extractcausality with reasonable performance [1]. Therefore, we argue for the need ofa novel method for causality extraction. We understand causality extraction asa two-step problem: First, we need to detect if requirements have causal prop-erties. Second, we need to comprehend and extract their causal relations. Atpresent, however, we lack knowledge about the form and complexity of causalityin requirements, which is needed to develop suitable approaches for these twoproblems. In this paper, we address this research gap and contribute: (1) anexploratory case study with 14,983 sentences from 53 requirements documentsoriginating from 18 different domains. We found that causality is a widely usedlinguistic pattern to describe system functionalities and that it mainly occurs inexplicit, marked form. (2) CiRA as an approach for the automatic detection ofcausality in requirements documents. This constitutes a first step towards causal-ity extraction from NL requirements. We empirically evaluate our approach andachieve a macro-F score of 82 % on real word data. (3) we disclose our code,tool and annotated data set to facilitate replication.Two further research directions exist: First, extending the case study andanalyzing the sentences from the requirements documents in a more granular wayby categorizing them e.g. in functional and non-functional requirements. Thiswould expand our current insight into causality in requirements documents ingeneral by an insight into causality in specific requirement categories. Second, weare enhancing our previous approaches [8,9] to address the second sub-problem:the actual extraction of causal relations. Acknowledgements
We would like to acknowledge that this work was supported by the KKS foun-dation through the S.E.R.T. Research Profile project at Blekinge Institute ofTechnology. Further, we thank Yannick Debes for his valuable feedback.
References
1. Asghar, N.: Automatic extraction of causal relations from natural language texts:A comprehensive survey. arXiv abs/1605.07895 abs/1906.07544abs/1906.07544