Automatic Repair of Vulnerable Regular Expressions
aa r X i v : . [ c s . P L ] O c t NARIYOSHI CHIDA,
NTT Secure Platform Laboratories, NTT Corporation / Waseda University, Japan
TACHIO TERAUCHI,
Waseda University, JapanA regular expression is called vulnerable if there exist input strings on which the usual backtracking-basedmatching algorithm runs super linear time. Software containing vulnerable regular expressions are prone toalgorithmic-complexity denial of service attack in which the malicious user provides input strings exhibit-ing the bad behavior. Due to the prevalence of regular expressions in modern software, vulnerable regularexpressions are serious threat to software security. While there has been prior work on detecting vulnerableregular expressions, in this paper, we present a first step toward repairing a possibly vulnerable regular ex-pression. Importantly, our method handles real world regular expressions containing extended features such aslookarounds, capturing groups, and backreferencing. (The problem is actually trivial without such extensionssince any pure regular expression can be made invulnerable via a DFA conversion.) We build our approach onthe recent work on example-based repair of regular expressions by Pan et al. [Pan et al. 2019] which synthe-sizes a regular expression that is syntactically close to the original one and correctly classifies the given setof positive and negative examples. The key new idea is the use of linear-time constraints , which disambiguatea regular expression and ensure linear time matching. We generate the constraints using an extended non-deterministic finite automaton that supports the extended features in real-world regular expressions. Whileour method is not guaranteed to produce a semantically equivalent regular expressions, we empirically showthat the repaired regular expressions tend to be nearly indistinguishable from the original ones.Additional Key Words and Phrases: Program Synthesis, Regular Expressions, ReDoS
Regular expressions have become an integral part of modern programming languages and soft-ware development, e.g., they are used as general purpose libraries [Chapman and Stolee 2016;Russ 2007], for sanitizing user inputs [Hooimeijer et al. 2011; Yu et al. 2016], and extracting datafrom unstructured text [Bartoli et al. 2016; Li et al. 2008]. Despite the widespread use of regularexpressions in practice, it is an unfortunate fact that developers often write vulnerable regular ex-pressions which are vulnerable to attacks that craft inputs that cause the standard backtrackingregular expression matching algorithm to take quadratic (or worse) time in the size of the inputstring [Davis et al. 2018]. These performance problems are known as regular expression denial-of-service (ReDoS) vulnerabilities [Adar 2017], and they are a significant threat to our society dueto the widespread use of regular expressions [Davis et al. 2018; John 2016, 2019; Staicu and Pradel2018]. While there has been much research on the topic of overcoming the ReDoS vulnerabil-ities [Kirrage et al. 2013; Shen et al. 2018; Sugiyama and Minamide 2014; Weideman et al. 2016;Wüstholz et al. 2017], the previous works have focused mainly on the problem of detecting vulner-able regular expressions, and the problem of repairing them still remains largely open. As reportedby Davis et al. [Davis 2018; Davis et al. 2018], writing invulnerable regular expressions is a formi-dable task that developers often fail to achieve in practice.In this paper, we consider the problem of repairing the given regular expression into an invulner-able and semantically equivalent one. In particular, we consider repairing the so-called real-worldregular expressions that have grammatical extensions such as lookarounds, capturing groups, and
Authors’ addresses: Nariyoshi Chida, NTT Secure Platform Laboratories, NTT Corporation / Waseda University, Japan,[email protected]; Tachio Terauchi, Waseda University, Japan, [email protected]. 2475-1421/2018/1-ART1 $15.00https://doi.org/ Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :2 Nariyoshi Chida and Tachio Terauchi backreferences [Friedl 2006]. The first main contribution of our paper is a formal definition of vul-nerable real-world regular expressions . To this end, we define a formal model of real-world regularexpression engines given by a set of natural semantics deduction rules. The model contains suffi-cient details of the standard backtracking matching algorithm for formalizing the vulnerability ofreal-world regular expressions. While previous work [Wüstholz et al. 2017] has also proposed sucha definition for pure regular expressions which do not contain the extended features, their modelis based on a translation to non-deterministic finite automaton (NFA) and hence is inapplicableto real-world regular expressions which are known to be not (actually) regular [Câmpeanu et al.2003]. Also, as we shall show in Section 3, we have actually discovered that their model can misscertain vulnerable cases even for pure regular expressions (though the bug is easily fixable, thisshows the subtlety of the task of formalizing vulnerability).We note that the repair problem is trivial for pure regular expressions and can be done by apply-ing the standard powerset construction to translate the given expression to an equivalent determin-istic finite automaton (DFA) and translate the DFA back to a regular expression via the standardstate-removal method [Sipser 1997]. While the resulting regular expression can be exponentiallylarger, it is easy to see that it is invulnerable as the backtracking matching algorithm would onlytake time linear in the input string length. Unfortunately, the DFA translation approach cannot beapplied to real-world regular expressions because real-world regular expressions are not regularas remarked above and therefore there may be no DFA equivalent to the given expression. An-other issue which makes the problem highly non-trivial is the fact that the problem of decidingthe equivalence of real-world regular expressions is undecidable (in fact, with just the backrefer-ence extension) [Freydenberger 2013]. This precludes the adoption of synthesis techniques thatrequire equivalence queries such as the L* algorithm [Angluin 1987].To overcome these challenges, we adopt the example-based synthesis and repair approachesthat have been explored in recent works [Lee et al. 2016; Pan et al. 2019]. In these approaches, aregular expression is synthesized from a set of positive examples (strings to be accepted) and neg-ative examples (strings to be rejected) so that the synthesized expression should correctly classifythe examples. However, the previous works did not consider the extended features of real-worldregular expressions. Furthermore, they were only concerned with semantic correctness and did notinvestigate vulnerability. Thus, in our work, we extend the example-based synthesis approacheswith the support for the extended features of real-world regular expressions, and with a certain tweak to ensure that the synthesis algorithm will only synthesize invulnerable expressions. Thelatter is accomplished by a notion that we call linear-time property (LTP). We show that the LTP issufficient to guarantee the invulnerability of real-world regular expressions satisfying it (cf. The-orem A.9). We extend the example-based synthesis with additional constraints to ensure that thesynthesis result satisfies the LTP and therefore is invulnerable. While we show that the synthesisproblem is NP-hard (cf. Theorem 4.13), our algorithm is able to conduct an efficient search for asolution by extending the state space pruning techniques of [Lee et al. 2016; Pan et al. 2019] tosupport the extended features of real-world expressions, and extending the SMT solving basedconstraint solving approach of [Pan et al. 2019] with the support for the real-world extensionsand the additional constraints for enforcing the linear-time property.We have implemented a prototype of our algorithm in a tool called
Remedy (Regular ExpressionModifier for Ensuring Deterministic propertY) . We have experimented with the tool on a set ofbenchmarks of real-world regular expressions taken from [Davis et al. 2018]. Our experimentalresults show that Remedy was able to successfully repair.To summarize, this paper makes the following contributions: The implementation is available at placeholder
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:3 • We give a novel formal semantics of backtracking matching algorithm for real-world regularexpressions and with it, give a formal definition of the ReDoS vulnerability for them. Ourdefinition extends the previous proposal for the formal definition of the ReDoS vulnerabil-ity for pure regular expressions [Wüstholz et al. 2017] with the support for the real-worldextensions, and we also show a subtle bug in the previous proposal (Section 3). • We present the novel linear-time property of real-world regular expressions, and prove thatthe property is sufficient for guaranteeing invulnerability. Then, we define the problem ofsynthesizing an expression from a given set of examples that satisfies the linear time prop-erty that we call the linear-time-property repair problem (LTP repair problem) , and prove thatthe problem is NP-hard (Section 4). • We give an algorithm for solving the LTP repair problem. Our algorithm builds on the pre-vious example-based synthesis and repair algorithms for pure regular expressions [Lee et al.2016; Pan et al. 2019], and extends them in two important ways: (1) the support for the real-world extensions and (2) the incorporation of the linear-time-property to enforce invulner-ability (Section 5). • We present an implementation of the algorithm in a tool called
Remedy , and present anevaluation of the tool on a set of real-world benchmarks (Section 6).
In this section, we explain our algorithm for repairing vulnerable regular expression informally byan example. To illustrate, we use a regular expression that caused the global outage of real worldservices. Consider the regular expression: · ∗ · ∗ = · ∗ . The regular expression accepts a string thatcontains at least one equals sign. For example, it accepts 𝑎𝑎 = 𝑎 , 𝑎 = , and = 𝑎 . Unfortunately,the regular expression is vulnerable because it takes quadratic time to match strings that does notcontain equals signs. In fact, the regular expression caused the global outage of a real world servicefor 27 minutes due to the quadratic running time [Graham-Cumming 2019]. Remedy can help this by automatically repairing the vulnerable regular expression into an in-vulnerable one. The user simply provides the possibly vulnerable regular expression to be repaired.
Remedy first generates a set of positive examples, i.e., a set of strings that the regular expressionhas to accept, and a set of negative examples, i.e., a set of strings that the regular expression has toreject, by applying an input generator to the given regular expression. Here, we assume that thegenerated positive and negative examples are:Positive Examples Negative Example= abcabcd====abcdab=c
Remedy then explores a regular expression that is consistent with all examples and has the linear-time property (LTP). Also,
Remedy biases the search toward regular expressions that aresyntactically close to the given one to bias toward synthesizing regular expressions that are se-mantically close as well. The LTP is a property that ensures that a linear running time of a regularexpression engine. Informally, it makes the behavior of a regular expression engine deterministic thus ensuring the linear running time guarantee. The regular expression · ∗ · ∗ = · ∗ violates the LTPbecause a regular expression engine has three ways to match the equals character = , that is, = canmatch the first · ∗ , the second · ∗ , and the expression = . We explain the steps of the repair processat a high level. Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :4 Nariyoshi Chida and Tachio Terauchi
Generating templates.
Remedy generates templates , which are regular expressions containing holes . Informally, a hole (cid:3) is a placeholder that is to be replaced with some concrete regular expres-sion.
Remedy starts with the initial template set to be the input regular expression · ∗ · ∗ = · ∗ . Sincethe regular expression is vulnerable and does not satisfy the LTP, Remedy replaces the subexpres-sions with holes and expands the holes by replacing it with templates such as (cid:3)(cid:3) , (cid:3) | (cid:3) , (cid:3) ∗ , ( ?= (cid:3) ) ,and \ 𝑖 , iteratively. After some iterations, we get a template (cid:3) ∗ (cid:3) ∗ = · ∗ . Pruning by approximations.
In this step,
Remedy prunes templates that cannot reach the solu-tion. For this purpose,
Remedy approximates a template by replacing holes in the template with aregular expression that accepts any string, i.e., · ∗ , or rejects any string, i.e., a failure ∅ . The approx-imations are used to efficiently detect when there is no way to instantiate the template to reacha regular expression that is consistent with the examples, and such a template gets pruned. Fromthe template (cid:3) ∗ (cid:3) ∗ = · ∗ , Remedy generates the regular expressions (· ∗ ) ∗ (· ∗ ) ∗ = · ∗ and ∅ ∗ ∅ ∗ = · ∗ asthe over- and under-approximations, respectively. In this case, (· ∗ ) ∗ (· ∗ ) ∗ = · ∗ accepts all positiveexamples and ∅ ∗ ∅ ∗ = · ∗ rejects all negative examples. Thus, Remedy does not prune this templateand continues to explore the solution with the template.
Searching assignments.
Next,
Remedy checks whether the template can be instantiate to a regu-lar expression that satisfies the required properties by replacing its holes with some set of charac-ters. For this purpose,
Remedy generates two types of constraints, one for ensuring that the regularexpression is consistent with the examples, and the other for ensuring that the regular expressionsatisfies the LTP. Then,
Remedy solves the constraint using a Satisfiability Modulo Theories (SMT)solver. If the constraint is satisfiable, then we obtain the repaired regular expression from the so-lution. Otherwise,
Remedy backtracks to continue the exploring more templates.The construction of the first constraint is based on the construction proposed in the recent workby Pan et al. [Pan et al. 2019]. We extend their constructions by with the support for real-worldfeatures. For this template,
Remedy generates the constraint ( 𝜙 𝑝 ∧ 𝜙 𝑝 ∧ 𝜙 𝑝 ) ∧¬( 𝜙 𝑛 ) where 𝜙 𝑖𝑝 and 𝜙 𝑖𝑛 denote constraints for ensuring that the regular expression is consistent with the 𝑖 -th positive andnegative example, respectively. For example, 𝜙 𝑝 is a constraint for the positive example 𝑎𝑏 = 𝑐 and 𝜙 𝑝 = ( 𝑣 𝑎 ∧ 𝑣 𝑏 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ) , where 𝑣 𝑑𝑖 assigned to true means that (cid:3) 𝑖 can be replaced witha set of characters that accepts the character 𝑑 . In the same way, 𝜙 𝑛 is a constraint for the negativeexample 𝑎𝑏𝑐 so that ¬( 𝜙 𝑛 ) = true asserts that a regular expression obtained from the solution doesnot accept 𝑎𝑏𝑐 . Additionally, Remedy generates the second constraint ¬ 𝑣 = ∧¬ 𝑣 = ∧ Ó 𝑎 ∈ Σ (¬ 𝑣 𝑎 ∨¬ 𝑣 𝑎 ) for ensuring that the generated regular expression satisfies the LTP. This constraint says that theholes cannot be replaced by sets of characters that match the character = or share a same characterwith each other. We use the SMT solver to solve the conjunction of the two constraints. In thiscase, the constraint is satisfiable and Remedy replaces (cid:3) and (cid:3) with ∅ and [ ^ = ] ∗ , respectively.Here, [ ^ = ] is a set of characters that accepts any character except for = . Finally, Remedy returnsthe regular expression [ ^ = ] ∗ = · ∗ as the repaired regular expression.While the above example did not involve the extended features of real world regular expressions, Remedy can take regular expressions containing the extended features as inputs and also outputsuch expressions as the repair results. We briefly demonstrate how
Remedy generates the SMTconstraint for ensuring the LTP in the presence of the capturing group and backreference exten-sions. Let us denote a capturing group and backreference as ( 𝑟 ) 𝑖 and \ 𝑖 , respectively: ( 𝑟 ) 𝑖 capturesa substring that matches to 𝑟 and \ 𝑖 refers the substring. Consider the template ( (cid:3) ∗ ) 𝑎 \ ∗ 𝑏 . Inthis case, the constraint for ensuring the LTP generated by Remedy is ¬ 𝑣 𝑎 ∧ ¬ 𝑣 𝑏 . Informally, theconstraint is constructed as such because (cid:3) is referred from the backreference \ and a regularexpression engine can reach the character 𝑏 from \ without consuming any character. Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:5 𝑟 :: = [ 𝐶 ] Set of characters | 𝜖 Empty string | 𝑟𝑟 Concatenation | 𝑟 | 𝑟 Union | 𝑟 ∗ Repetition | ( 𝑟 ) 𝑖 Capturing group | \ 𝑖 Backreference | (?= 𝑟 ) Positive lookahead | (?! 𝑟 ) Negative lookahead | (?<= 𝑥 ) Fixed-string positive lookbehind | (?
In this section, we give the definition of real-world regular expressions. We also present the novelformal model of the backtracking matching algorithm for real-world regular expressions, and withit, we formally define the ReDoS vulnerability for the expressions.
Notations.
Throughout this paper, we use the following notations. We write Σ for a finite alphabet; 𝑎, 𝑏, 𝑐, ∈ Σ for a character; 𝑤, 𝑥, 𝑦, 𝑧 ∈ Σ ∗ for a sequence of characters; 𝜖 for the empty character; 𝑟 fora real-world regular expression; N for the set of natural numbers. For the string 𝑥 = 𝑥 [ ] ...𝑥 [ 𝑛 − ] ,its length is | 𝑥 | = 𝑛 . For ≤ 𝑖 ≤ 𝑗 < 𝑛 , the string 𝑥 [ 𝑖 ] ...𝑥 [ 𝑗 ] is called a substring of 𝑥 . We write 𝑥 [ 𝑖..𝑗 ] for the substring. In addition, we write 𝑥 [ 𝑖..𝑗 ) for the substring 𝑥 [ 𝑖 ] ...𝑥 [ 𝑗 − ] . For a finiteset 𝑆 , | 𝑆 | is the size of 𝑆 , i.e., the number of elements in 𝑆 . We assume that 𝑥 [ 𝑖..𝑗 ) = $ , where $ ∉ Σ ,if the index is out of the range of the string, i.e., 𝑖 < or | 𝑥 | < 𝑗 . Let 𝑓 be a (partial) function. Then, 𝑓 [ 𝛼 ↦→ 𝛽 ] denotes the (partial) function that maps 𝛼 to 𝛽 and behaves as 𝑓 for all other arguments.We use 𝑓 ( 𝛼 ) to denote the value that corresponds to 𝛼 . We write 𝑓 ( 𝛼 ) = ⊥ if 𝑓 is undefined at 𝛼 .For sets 𝐴 and 𝐵 , we write 𝐴 \ 𝐵 to denote the set difference, i.e., 𝐴 \ 𝐵 = { 𝑎 ∈ 𝐴 | 𝑎 ∉ 𝐵 } . We define ite ( true , 𝐴, 𝐵 ) = 𝐴 and ite ( false , 𝐴, 𝐵 ) = 𝐵 . The syntax of real-world regular expressions (simply regular expressions or expressions henceforth) isgiven by Figure 1. Here, 𝐶 ⊆ Σ and 𝑖 ∈ N . A set of characters [ 𝐶 ] exactly matches a character in 𝐶 .We sometimes write 𝑎 for [{ 𝑎 }] , and write · for [ Σ ] . The semantics of the pure regular expressionconstructs, that is, empty string 𝜖 , concatenation 𝑟 𝑟 , union 𝑟 | 𝑟 and repetition 𝑟 ∗ , are standard.We note that many convenient notations used in practical regular expressions such as options, one-or-more repetitions, and interval quantifiers can be treated as syntactic sugars: 𝑟 ? = 𝑟 | 𝜖 , 𝑟 + = 𝑟𝑟 ∗ ,and 𝑟 { 𝑖, 𝑗 } = 𝑟 ...𝑟 𝑖 𝑟 𝑖 + ? ...𝑟 𝑗 ? where 𝑟 𝑘 = 𝑟 for each 𝑘 ∈ { , . . . , 𝑗 } .The remaining constructs, that is, capturing groups, backreferences, (positive and negative)lookaheads and lookbehinds, comprise the real-world extensions. In what follows, we will ex-plain the semantics of the extended features informally in terms of the standard backtracking-based matching algorithm which attempts to match the given regular expression with the given(sub)string and backtracks when the attempt fails. The formal definition is given later in the sec-tion.A capturing group ( 𝑟 ) 𝑖 attempts to match 𝑟 , and if successful, stores the matched substring inthe storage identified by the index 𝑖 . Otherwise, the match fails and the algorithm backtracks. Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :6 Nariyoshi Chida and Tachio Terauchi A backreference \ 𝑖 refers to the substring matched to the corresponding capturing group ( 𝑟 ) 𝑖 , andattempts to match the same substring if the capture had succeeded. If the capture had not succeededor the matching against the captured substring fails, then the algorithm backtracks. For example,let us consider a regular expression ( [ - ]) ( [ 𝐴 - 𝑍 ]) \ \ . Here, \ and \ refer to the substringmatched by [ - ] and [ 𝐴 - 𝑍 ] , respectively. The expression represents the set of strings { 𝑎𝑏𝑎𝑏 | 𝑎 ∈[ - ] ∧ 𝑏 ∈ [ 𝐴 - 𝑍 ]} . Capturing groups in the real world often do not have explicit indexes, but wewrite them here for readability. We assume without loss of generality that each capturing groupalways has a corresponding backreference and vice versa. A positive (resp. negative) lookahead (?= 𝑟 )(resp. (?! 𝑟 )) attempts to match 𝑟 without any character consumption, and proceeds if the matchsucceeds (resp. fails) and backtracks otherwise. A fixed-string positive (resp. negative) lookbehind (?<= 𝑥 ) (resp. (?
Empty String ), (
Concatenation ), (
Union ) and (
Repetition )are self explanatory. Note that we avoid self looping in (
Repetition ) by not repeating the matchfrom the same position.In the rule (
Capturing group ), we first get the matching result N from matching 𝑤 against 𝑟 at the current position 𝑝 . And for each matching result ( 𝑝 𝑖 , Γ 𝑖 ) ∈ N (if any), we record thematched substring 𝑤 [ 𝑝..𝑝 𝑖 ) in the corresponding capturing group map Γ 𝑖 at the index 𝑖 . The rule Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:7 𝑝 < | 𝑤 | 𝑤 [ 𝑝 ] ∈ 𝐶 ([ 𝐶 ] , 𝑤, 𝑝, Γ ) { {( 𝑝 + , Γ ) } (Set of characters) 𝑝 ≥ | 𝑤 | ∨ 𝑤 [ 𝑝 ] ∉ 𝐶 ([ 𝐶 ] , 𝑤, 𝑝, Γ ) { ∅ (Set of characters Failure) ( 𝜖, 𝑤, 𝑝, Γ ) { {( 𝑝, Γ ) } (Empty String) ( 𝑟 , 𝑤, 𝑝, Γ ) { N ∀( 𝑝 𝑖 , Γ 𝑖 ) ∈ N , ( 𝑟 , 𝑤, 𝑝 𝑖 , Γ 𝑖 ) { N 𝑖 ( 𝑟 𝑟 , 𝑤, 𝑝, Γ ) { Ð ≤ 𝑖 < |N| N 𝑖 (Concatenation) ( 𝑟 , 𝑤, 𝑝, Γ ) { N ( 𝑟 , 𝑤, 𝑝, Γ ) { N ′ ( 𝑟 | 𝑟 , 𝑤, 𝑝, Γ ) { N ∪ N ′ (Union) ( 𝑟, 𝑤, 𝑝, Γ ) { N∀( 𝑝 𝑖 , Γ 𝑖 ) ∈ (N\{( 𝑝, Γ ) }) , ( 𝑟 ∗ , 𝑤, 𝑝 𝑖 , Γ 𝑖 ) { N 𝑖 ( 𝑟 ∗ , 𝑤, 𝑝, Γ ) { {( 𝑝, Γ ) } ∪ Ð ≤ 𝑖 < |(N|\{( 𝑝, Γ )}) N 𝑖 (Repetition) ( 𝑟, 𝑤, 𝑝, Γ ) { N(( 𝑟 ) 𝑗 , 𝑤, 𝑝, Γ ) { {( 𝑝 𝑖 , Γ 𝑖 [ 𝑗 ↦→ 𝑤 [ 𝑝..𝑝 𝑖 ) ]) | ( 𝑝 𝑖 , Γ 𝑖 ) ∈ N} (Capturing group) Γ ( 𝑖 ) ≠ ⊥ ( Γ ( 𝑖 ) , 𝑤, 𝑝, Γ ) { N(\ 𝑖, 𝑤, 𝑝, Γ ) { N (Backreference) Γ ( 𝑖 ) = ⊥(\ 𝑖, 𝑤, 𝑝, Γ ) { ∅ (Backreference Failure) ( 𝑟, 𝑤, 𝑝, Γ ) { N( (?= 𝑟 ) , 𝑤, 𝑝, Γ ) { {( 𝑝, Γ ′ ) | ( _ , Γ ′ ) ∈ N} (Positive lookahead) ( 𝑟, 𝑤, 𝑝, Γ ) { N N ′ = ite (N ≠ ∅ , ∅ , {( 𝑝, Γ ) })( (?! 𝑟 ) , 𝑤, 𝑝, Γ ) { N ′ (Negative lookahead) ( 𝑥, 𝑤 [ 𝑝 − | 𝑥 | ..𝑝 ) , , Γ ) { N N ′ = ite (N ≠ ∅ , {( 𝑝, Γ ) } , ∅)( (?<= 𝑥 ) , 𝑤, 𝑝, Γ ) { N ′ (Positive lookbehind) ( 𝑥, 𝑤 [ 𝑝 − | 𝑥 | ..𝑝 ) , , Γ ) { ∅ N ′ = ite (N ≠ ∅ , ∅ , {( 𝑝, Γ ) })( (?
BackreferenceFailure ). In the rule (
Positive lookahead ), the expression 𝑟 is matched against the given string 𝑤 at the current position 𝑝 to obtain the matching results N . Then, for every match result (if any) ( 𝑝 ′ , Γ ′ ) ∈ N , we reset the position from 𝑝 ′ to 𝑝 . This models the behavior of lookaheads whichdoes not consume the string. We note that actual backtracking matching algorithms often do notbacktrack to re-do a successful lookahead when the match fails in the later position, whereas ourformal model permits such a re-doing. The difference may manifest as a difference in the acceptedstrings when a lookahead contains a capturing group, because a capture may not be re-done inan actual algorithm. For example, (?= ( [ - ] ∗ ) ) [ - 𝑎 - 𝑧 ] ∗ \ matches 𝑎 in our model by captur-ing in the capture group, but an actual algorithm may reject the string by capturing and notbacktracking to re-do the capture when the matching fails at the backreference. However, the dif-ference is not an issue as far as vulnerability is concerned because our model still conservativelyapproximates the actual algorithm in such a case. That is, our model may only be slower than theactual algorithm due to more backtracking. The rules ( Negative lookahead ), (
Positive lookbe-hind ), and (
Negative lookbehind ) are similar to (
Positive lookahead ). However, note that thecaptures in these lookaround expressions are not used afterwards, that is, the original capturinggroup information Γ before the lookaround is retained. We have stipulated these rules in this wayso as to model the behavior seen in actual algorithms which also do not do capturing in theselookarounds. Definition 3.1 (Language).
The language of an expression 𝑟 , 𝐿 ( 𝑟 ) , is defined as follows: 𝐿 ( 𝑟 ) = { 𝑤 | ( 𝑟 , 𝑤, , ∅) { N ∧ (| 𝑤 | , Γ ) ∈ N } . We now provide some examples of the matchings. For readability, we omit the capturing groupinformation on examples do not contain them.
Example 3.2.
Consider the regular expression ( 𝑎 ∗ ) ∗ . The matching of the regular expression onthe input string 𝑎𝑏 is as follows: Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :8 Nariyoshi Chida and Tachio Terauchi < | 𝑎𝑏 | 𝑎 ∈ { 𝑎 }( 𝑎, 𝑎𝑏, ) { { } < | 𝑎𝑏 | 𝑏 ∉ { 𝑎 }( 𝑎, 𝑎𝑏, ) { ∅( 𝑎 ∗ , 𝑎𝑏, ) { { }( 𝑎 ∗ , 𝑎𝑏, ) { { , } < | 𝑎𝑏 | 𝑏 ∉ { 𝑎 }( 𝑎, 𝑎𝑏, ) { ∅(( 𝑎 ∗ ) ∗ , 𝑎𝑏, ) { { }(( 𝑎 ∗ ) ∗ , 𝑎𝑏, ) { { , } Thus, the expression rejects the input string since (( 𝑎 ∗ ) ∗ , 𝑎𝑏, ) { { , } and | 𝑎𝑏 | = ∉ { , } . Example 3.3.
Consider the expression (( ?= 𝑎 ) ∗ ) ∗ . The matching of the expression on 𝑎𝑏 is: < | 𝑎𝑏 | 𝑎 ∈ { 𝑎 }( 𝑎, 𝑎𝑏, ) { { }(( ?= 𝑎 ) , 𝑎𝑏, ) { { }(( ?= 𝑎 ) ∗ , 𝑎𝑏, ) { { }(((( ?= 𝑎 ) ∗ ) ∗ ) , 𝑎𝑏, ) { { } Thus, the expression rejects the input string since ((( ?= 𝑎 ) ∗ ) ∗ , 𝑎𝑏, ) { { } and | 𝑎𝑏 | = ∉ { } . Example 3.4.
Consider the regular expression ( 𝑎 ∗ ) \ and input string 𝑎𝑎 . The matching is: 𝐴 𝐵 𝐵 𝐵 ( 𝑎 ∗ ) \ { {( , Γ ) , ( , Γ )} where Γ = {( , 𝜖 )} , Γ = {( , 𝑎 )} , Γ = {( , 𝑎𝑎 )} and the subderivation 𝐴 is: < | 𝑎𝑎 | 𝑎 ∈ { 𝑎 }( 𝑎, 𝑎𝑎, , ∅) { {( , ∅)} < | 𝑎𝑎 | 𝑎 ∈ { 𝑎 }( 𝑎, 𝑎𝑎, , ∅) { {( , ∅)} ≥ | 𝑎𝑎 |( 𝑎, 𝑎𝑎, , ∅) { ∅( 𝑎 ∗ , 𝑎𝑎, , ∅) { ∅( 𝑎 ∗ , 𝑎𝑎, , ∅) { {( , ∅) , ( , ∅)}( 𝑎 ∗ , 𝑎𝑎, , ∅) { {( , ∅) , ( , ∅) , ( , ∅)}(( 𝑎 ∗ ) , 𝑎𝑎, , ∅) { {( , Γ ) , ( , Γ ) , ( , Γ )} and the roots of the subderivations 𝐵 , 𝐵 and 𝐵 are, respectively, (\ , 𝑎𝑎, , Γ ) { {( , Γ )} , (\ , 𝑎𝑎, , Γ ) { {( , Γ )} and (\ , 𝑎𝑎, , Γ ) { ∅ , and the matchings are: 𝜖 = Γ 𝜖 ∈ { 𝜖 }( 𝜖, 𝑎𝑎, , Γ ) { {( , Γ )}(\ , 𝑎𝑎, , Γ ) { {( , Γ )} 𝑎 = Γ < | 𝑎𝑎 | 𝑎 ∈ { 𝑎 }( 𝑎, 𝑎𝑎, , Γ ) { {( , Γ )}(\ , 𝑎𝑎, , Γ ) { {( , Γ )} 𝑎𝑎 = Γ ≥ | 𝑎𝑎 |( 𝑎𝑎, 𝑎𝑎, , Γ ) { ∅(\ , 𝑎𝑎, , Γ ) { ∅ The expression accepts the input string since (( 𝑎 ∗ ) \ , 𝑎𝑎, , ∅) { N where (| 𝑎𝑎 | , Γ ) ∈ N .For a derivation of ( 𝑟 , 𝑤, 𝑝, Γ ) { N , we define its size to be the number of nodes in the derivationtree. Note that the size of a derivation is well defined because our rules are deterministic. Definition 3.5 (Running time).
Given an expression 𝑟 and a string 𝑤 , we define the running time of the backtracking matching algorithm on 𝑟 and 𝑤 , Time ( 𝑟 , 𝑤 ) , to be the size of the derivation of ( 𝑟 , 𝑤, , ∅) { N . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:9
We remark that our formal model of backtracking matching algorithms may be less efficient thanan actual algorithm because it computes all possible runs without any optimization. However, thedefinition is sufficient for our purpose since our repair algorithm synthesizes regular expressionsthat is invulnerable even with respect to the inefficient formal model. We are now ready to definethe vulnerability of regular expressions.
Definition 3.6 (Vulnerable Regular Expressions).
We say that an expression 𝑟 is vulnerable if Time ( 𝑟 , 𝑤 ) ∉ 𝑂 (| 𝑤 |) .Note that an expression 𝑟 is vulnerable iff there exist infinitely many strings 𝑤 , 𝑤 ,. . . such that Time ( 𝑟 , 𝑤 𝑖 ) (for 𝑖 ∈ N ) grows super-linearly in | 𝑤 𝑖 | . Such strings are often called attack strings .For example, the expressions ( 𝑎 ∗ ) ∗ in Example 3.2 and ( 𝑎 ∗ ) \ in Example 3.4 are vulnerable be-cause there exist a set of attack strings { 𝑎 𝑛 𝑏 | 𝑛 ∈ N } on which ( 𝑎 ∗ ) ∗ and ( 𝑎 ∗ ) \ respectively take Ω ( 𝑛 ! ) and Ω ( 𝑛 ) time. Indeed, running an actual matching algorithm such as Python’s re on theseexpressions with these attack strings exhibits a super-linear behavior. By contrast, the expression (( ?= 𝑎 ) ∗ ) ∗ in Example 3.3 takes O(n) time on these strings and is in fact invulnerable. It is worthnoting that ( 𝑎 ∗ ) ∗ is incorrectly classified as invulnerable in [Wüstholz et al. 2017] both accordingto their formal definition of vulnerability and by their Rexploiter vulnerability detection tool.Although the bug is easily fixable by incorporating 𝜖 transitions into their NFA-based formalismin a certain way, this shows the subtlety of formalizing vulnerability. As remarked before, our repair algorithm adopts the example-based synthesis and repair approachesexplored in recent works [Lee et al. 2016; Pan et al. 2019]. In these approaches, we are given a setof positive examples 𝑃 ⊆ Σ ∗ and negative examples 𝑁 ⊆ Σ ∗ , and we look for 𝑟 ′ that correctly classi-fies the examples, that is, 𝑃 ⊆ 𝐿 ( 𝑟 ′ ) and 𝑁 ∩ 𝐿 ( 𝑟 ′ ) = ∅ . In our case, we want 𝑟 ′ that is invulnerableand semantically equivalent to the given (possibly vulnerable) expression 𝑟 (i.e., 𝐿 ( 𝑟 ) = 𝐿 ( 𝑟 ′ ) ). Toensure the latter, albeit only in a best effort way, we adopt the example-based synthesis approachwith 𝑃 and 𝑁 generated from 𝑟 by using an appropriate input generator (cf. Section 6.1 for thedetails of the input generator). We also bias 𝑟 ′ to be syntactically close to 𝑟 using the notion of distance similar to the one in a previous work on example-based repair [Pan et al. 2019], so that 𝑟 and 𝑟 ′ are more likely to semantically close as well. Additionally, we enforce that 𝑟 ′ satisfies anovel notion called linear time property (LTP) which ensures that 𝑟 ′ is invulnerable.We define LTP in Section 4.1 and prove its correctness in the supplementary material. In Sec-tion 4.2, we define the notion of distance and defines the LTP repair problem which is the problemof synthesizing an expression syntactically close to the given expression, satisfies the LTP, andcorrectly classifies the given set of examples. We prove that the problem is NP-hard. Section 5presents an algorithm for solving the linear-time repair problem.
This section presents the LTP. We begin by introducing some preliminary notions.
Definition 4.1 (Bracketing of a regular expression [Koch and Scherzinger 2007]).
The bracketing ofan expression 𝑟 , 𝑟 [] , is obtained by inductively mapping each subexpression 𝑠 of 𝑟 to [ 𝑖 𝑠 ] 𝑖 where 𝑖 is a unique index. Here, [ 𝑖 and ] 𝑖 are called brackets and are disjoint from the alphabet Σ of 𝑟 .Note that 𝑟 [] is a regular expression over the alphabet Σ ∪ Ψ , where Ψ = {[ 𝑖 , ] 𝑖 | 𝑖 ∈ N } . Wecall Ψ the bracketing alphabet of 𝑟 [] . For example, for the expression 𝑟 = (( 𝑎 ) ∗ ) ∗ 𝑏 , its bracketing 𝑟 [] = [ ( [ ( [ 𝑎 ] ) ∗ ] ) ∗ [ 𝑏 ] ] with the bracketing alphabet {[ 𝑖 , ] 𝑖 | 𝑖 ∈ { , , }} . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :10 Nariyoshi Chida and Tachio Terauchi eNFAtr ( [ 𝐶 ]) = ({ 𝑞 , 𝑞 } , {( 𝑞 , 𝑎, 𝑞 ) | ∀ 𝑎 ∈ 𝐶 } , 𝑞 , 𝑞 ) eNFAtr ( 𝑟 𝑟 ) = ( 𝑄 ∪ 𝑄 , 𝛿 ∪ 𝛿 ∪ {( 𝑞 𝑛 , 𝜖, 𝑞 )} , 𝑞 , 𝑞 𝑛 ) where ( 𝑄 , 𝛿 , 𝑞 , 𝑞 𝑛 ) = eNFAtr ( 𝑟 ) and ( 𝑄 , 𝛿 , 𝑞 , 𝑞 𝑛 ) = eNFAtr ( 𝑟 ) eNFAtr ( 𝑟 | 𝑟 ) = ( 𝑄 ∪ 𝑄 ∪ { 𝑞 , 𝑞 𝑛 } , 𝛿 ∪ 𝛿 ∪{( 𝑞 , 𝜖, 𝑞 ) , ( 𝑞 , 𝜖, 𝑞 ) , ( 𝑞 𝑛 , 𝜖, 𝑞 𝑛 ) , ( 𝑞 𝑛 , 𝜖, 𝑞 𝑛 )} , 𝑞 , 𝑞 𝑛 ) where ( 𝑄 , 𝛿 , 𝑞 , 𝑞 𝑛 ) = eNFAtr ( 𝑟 ) and ( 𝑄 , 𝛿 , 𝑞 , 𝑞 𝑛 ) = eNFAtr ( 𝑟 ) eNFAtr ( 𝑟 ∗ ) = ( 𝑄 ∪ { 𝑞 , 𝑞 𝑛 } , 𝛿 ∪ {( 𝑞 , 𝜖, 𝑞 ) , ( 𝑞 , 𝜖, 𝑞 𝑛 ) , ( 𝑞 𝑛 , 𝜖, 𝑞 𝑛 )} , 𝑞 , 𝑞 𝑛 ) where ( 𝑄, 𝛿, 𝑞 , 𝑞 𝑛 ) = eNFAtr ( 𝑟 ) eNFAtr (( 𝑟 ) 𝑖 ) = eNFAtr ( 𝑟 ) and I = I [ 𝑖 ↦→ 𝑞 ] , where eNFAtr ( 𝑟 ) = ( _ , _ , 𝑞 , _ ) eNFAtr (\ 𝑖 ) = ({ 𝑞 , 𝑞 } , {( 𝑞 , 𝑎, 𝑞 ) | 𝑎 ∈ First ( I ( 𝑖 ) ) ♮ } , 𝑞 , 𝑞 ) Fig. 3. The extended NFA translation.
Definition 4.2 (Lookaround removal).
The expression 𝑟 with its lookarounds removed , rmla ( 𝑟 ) , isan expression obtained by replacing each lookaround in 𝑟 with 𝜖 .A non-deterministic automaton (NFA) over an alphabet Σ is a tuple ( 𝑄, 𝛿, 𝑞 , 𝑞 𝑛 ) where 𝑄 is afinite set of states, 𝛿 ⊆ 𝑄 × ( Σ ∪ { 𝜖 }) × 𝑄 is the transition relation, 𝑞 is the initial state, and 𝑞 𝑛 isthe accepting state. Definition 4.3 (Extended NFA translation).
Given a lookaround-free expression 𝑟 over Σ , its ex-tended NFA translation , eNFAtr ( 𝑟 [] ) , is a NFA over Σ ∪ Ψ defined by the rules shown in Fig. 3 where Ψ is the bracketing alphabet of 𝑟 [] .In the translation process shown in Fig. 3, we maintain a global map I from capturing groupindexes to states. I is initially empty and is updated whenever a capturing group ( 𝑟 ) 𝑖 is encoun-tered so that I ( 𝑖 ) is set to be the initial state of the NFA constructed from 𝑟 . First ( 𝑞 ) is defined asfollows: 𝜌𝑎 ∈ First ( 𝑞 ) iff 𝜌 ∈ Ψ ∗ , 𝑎 ∈ Σ , and there is a 𝜌𝑎 -labeled path from the state 𝑞 . We define 𝜌𝑎 ♮ = 𝑎 , and First ( 𝑞 ) ♮ = { 𝑎 | 𝜌𝑎 ∈ First ( 𝑞 )} . Fig. 4. The extended NFA translation of ( 𝑎 ∗ ) ∗ . Our extended NFA translation is simi-lar to the standard Thompson’s translationfrom pure regular expressions to NFA [Sipser1997; Thompson 1968] but extended to real-world regular expressions. However, unlikethe Thompson’s translation, it does not pre-serve the semantics (necessarily not so be-cause real-world regular expressions are notregular even with no lookarounds). Instead, we use the translation only for the purpose of defin-ing the LTP. The bracketing is also added for that purpose.For a NFA and a pair of states 𝑞 and 𝑞 ′ of the NFA, let us write paths ( 𝑞, 𝑞 ′ ) for the set of stringsover the alphabet of the NFA that take the NFA from 𝑞 to 𝑞 ′ . Definition 4.4 (
𝐵𝑝𝑎𝑡ℎ𝑠 ). Let 𝑟 be a regular expression over Σ , [ 𝑖 ∈ Ψ where Ψ is the bracketingalphabet of 𝑟 [] , and 𝑎 ∈ Σ . Let ( _ , 𝛿, _ , _ ) = eNFAtr ( 𝑟 [] ) . We define 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑎 ) ⊆ Ψ ∗ as follows: 𝜌 ∈ 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑎 ) iff 𝜌 ∈ paths ( 𝑞 𝑗 , 𝑞 𝑙 ) ∩ Ψ ∗ for some ( 𝑞 𝑗 , [ 𝑖 , _ ) , ( 𝑞 𝑙 , 𝑎, _ ) ∈ 𝛿 . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:11
Note that 𝑞 𝑗 is uniquely determined (by [ 𝑖 ), while 𝑞 𝑙 is not. Roughly, 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑎 ) is the set ofsequences of brackets appeared in a path from the unique edge labeled [ 𝑖 to an edge labeled 𝑎 inthe extended NFA translation of 𝑟 . Example 4.5.
Fig. 4 shows the extended NFA translation of ( 𝑎 ∗ ) ∗ where unlabeled edges denote 𝜖 transitions. Note that 𝐵𝑝𝑎𝑡ℎ𝑠 (( 𝑎 ∗ ) ∗ , [ , 𝑎 ) = {[ ( [ ] ) 𝑛 [ [ | 𝑛 ∈ N } .Now we define the LTP. Definition 4.6 (Linear time property).
Let 𝑟 be an expression over an alphabet Σ . We say that 𝑟 satisfies the linear time property (LTP) if (1) | 𝐵𝑝𝑎𝑡ℎ𝑠 ( rmla ( 𝑟 ) , [ 𝑖 , 𝑎 )| ≤ for all 𝑎 ∈ Σ and [ 𝑖 ∈ Ψ where Ψ is the bracketing alphabet of rmla ( 𝑟 ) [] , and (2) no lookaround in 𝑟 contains a repetition.Roughly, condition (1) ensures the determinism of the backtracking matching algorithm. That is,the algorithm can determine which subexpression to match at a union and when to exit the loopat a repetition by looking at the next character in the input string. The determinism rules out theneed for backtracking and therefore ensuring that matching finishes in linear time. Note that, forpure regular expressions, the invulnerable expression obtained by the DFA approach mentionedin Section 1 satisfies the LTP. The LTP does not impose the determinism condition in lookarounds,because the condition prohibits some important use patterns of them. For instance, it will precludeany meaningful use of a positive lookahead because if the lookahead succeeds then the subexpres-sion immediately following the lookahead must match the same string. Therefore, for lookarounds,we impose the repetition-freedom condition as stipulated by (2). The condition trivially ensuresthat the matching within a lookaround finishes in constant time. Therefore, (1) and (2) combinedguarantees that the overall matching finishes in linear time. Example 4.7.
Recall the expressions 𝑟 = ( 𝑎 ∗ ) ∗ , 𝑟 = (( ?= 𝑎 ) ∗ ) ∗ and 𝑟 = ( 𝑎 ∗ ) \ from Exam-ples 3.2, 3.3 and 3.4. The expression 𝑟 does not satisfy the LTP because as shown in Example 4.5, | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ , 𝑎 )| = ℵ > . Similarly, 𝑟 does not satisfy the LTP because 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ , 𝑎 ) = {[ [ [ [ , [ [ [ ] ] [ } where 𝑟 [] = [ [ ( [ ( [ 𝑎 ] ) ∗ ] ) ] [ \ ] ] and so | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ , 𝑎 )| = > .By contrast, 𝑟 (trivially) satisfies the LTP because rmla ( 𝑟 ) = ( 𝜖 ∗ ) ∗ which contains no characters.Also, we can show that 𝑟 = ( 𝑎 ∗ ) ( 𝑏 ∗ ) satisfies the LTP because | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑎 )| = | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑏 )| = for 𝑖 ∈ { , , } , and | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑎 )| = and | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ 𝑖 , 𝑏 )| = for 𝑖 ∈ { , } , where 𝑟 [] = [ [ ( [ 𝑎 ] ) ∗ ] [ ( [ 𝑏 ] ) ∗ ] ] . Example 4.8.
Next, recall the vulnerable regular expression 𝑟 = ( [ Σ ] ∗ ) = ( [ Σ ] ∗ ) from Section 2.The bracketing of the expression is 𝑟 [] = [ [ ( [ [ Σ ]] ) ∗ ] [ = ] [ ( [ [ Σ ]] ) ∗ ] ] . The expression 𝑟 does not satisfy the LTP because | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 , [ , = ) = |{[ [ , [ ] [ }| = which violates the firstcondition of the LTP. Next, consider the regular expression 𝑟 = (( ?= [ Σ ] ∗ ) [ Σ ]) ∗ . The expresssion 𝑟 also does not satisfy the LTP because the positive lookahead ( ?= [ Σ ] ∗ ) contains the repetitions [ Σ ] ∗ , violating the second condition of the LTP. The expression 𝑟 is also vulnerable. For instance, { 𝑎 𝑛 | 𝑛 ∈ N } is a possible set of attack strings for 𝑟 for any 𝑎 ∈ Σ .Finally, we show the following theorem. Theorem 4.9.
A regular expression that satisfies the LTP runs in linear time in the worst-case evenif a regular expression engine employs an algorithm shown in Fig. 2.
The complete proof appears in the supplementary material. We remark that the LTP is a suf-ficient but not a necessary condition for invulnerability. For example, 𝑎 | 𝑎𝑎 is not vulnerable butdoes not satisfy the LTP. Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :12 Nariyoshi Chida and Tachio Terauchi
In this section, we define the LTP repair problem. First, we define the notion of distance betweenregular expressions following a recent work on example-based regular expression repair [Pan et al.2019]. Below, we identify a regular expression with its AST representation. For an AST 𝑟 , we defineits size , | 𝑟 | , to be the number of nodes 𝑟 . Definition 4.10 (Distance between regular expressions [Pan et al. 2019]).
Given an AST 𝑟 and non-overlapping subtrees 𝑟 , . . . , 𝑟 𝑛 of 𝑟 , an edit 𝑟 [ 𝑟 ′ / 𝑟 , · · · , 𝑟 ′ 𝑛 / 𝑟 𝑛 ] replaces each 𝑟 𝑖 with 𝑟 ′ 𝑖 . The cost ofan edit is the sum Í 𝑖 ∈{ ,...,𝑛 } | 𝑟 𝑖 | + | 𝑟 ′ 𝑖 | . The distance between regular expressions 𝑟 and 𝑟 , D ( 𝑟 , 𝑟 ) ,is the minimum cost of an edit that transforms 𝑟 to 𝑟 .For example, for expressions 𝑟 = 𝑎 | 𝑏 | 𝑐 and 𝑟 = 𝑑 | 𝑐 , D ( 𝑟 , 𝑟 ) = which is realized by the editthat replaces the subtree 𝑎 | 𝑏 in 𝑟 by 𝑑 . We now define the repair problem. Definition 4.11 (LTP Repair Problem).
Given an expression 𝑟 , a finite set of positive examples 𝑃 ⊆ Σ ∗ , and a finite set of negative examples 𝑁 ⊆ Σ ∗ where 𝑃 ⊆ 𝐿 ( 𝑟 ) and 𝐿 ( 𝑟 ) ∩ 𝑁 = ∅ , the linear-time property repair problem ( the LTP repair problem ) is the problem of synthesizing a regularexpression 𝑟 such that (1) 𝑟 satisfies the LTP, (2) 𝑃 ⊆ 𝐿 ( 𝑟 ) , (3) 𝑁 ∩ 𝐿 ( 𝑟 ) = ∅ , and (4) for all regularexpressions 𝑟 satisfying (1)-(3) and D ( 𝑟 , 𝑟 ) ≤ D ( 𝑟 , 𝑟 ) .Condition (1) guarantees that the repaired expression 𝑟 is invulnerable. Conditions (2) and (3)stipulate that 𝑟 correctly classifies the examples. Condition (4) says that 𝑟 is syntactically closeto the original expression 𝑟 . We note that the LTP repair problem is easy without the closenesscondition (4): it can be solved by constructing a DFA that accepts exactly 𝑃 (or Σ ∗ \ 𝑁 ), which canbe done in time linear in Í 𝑤 ∈ 𝑃 | 𝑤 | (or Í 𝑤 ∈ 𝑁 | 𝑤 | ), and then translating the DFA to a (pure) regularexpression by the standard state removal method [Sipser 1997]. However, such an expression isunlikely to be semantically equivalent to 𝑟 , that is, it suffers from overfitting . Condition (4) is animportant ingredient of our approach that biases the solution toward a semantically correct one.We show that the LTP repair problem is NP-hard by a reduction from ExactCover which isknow to be NP-complete [Karp 1972]. More formally, we consider the decision problem versionof the LTP repair problem in which we are asked to find a repair 𝑟 of 𝑟 satisfying conditions(1)-(3) and D ( 𝑟 , 𝑟 ) ≤ 𝑘 for some given 𝑘 ∈ N . Note that the decision problem is no harder thanthe original repair problem because the solution to the repair problem can be used to solve thedecision problem. Definition 4.12 (Exact Cover).
Given a finite set U and a set of sets S ⊂ P (U) , ExatCover isthe problem of finding S ′ ⊆ S such that for every 𝑖 ∈ U , there is a unique 𝑆 ∈ S ′ such that 𝑖 ∈ 𝑆 . Theorem 4.13.
The LTP repair problem is NP-hard.
Proof.
We give a reduction from the exact cover to the repair problem. Let S = { 𝑆 , 𝑆 , ..., 𝑆 𝑘 } .We create the following (decision version of) LTP repair problem: • The alphabet Σ = U ; • The set of positive examples 𝑃 = U ; • The set of negative examples 𝑁 = ∅ ; • The distance bound is 𝑘 ; and • The pre-repair expression 𝑟 = 𝑟 𝑟 where 𝑟 and 𝑟 are as defined below: 𝑟 = 𝜖 ( ?= [ 𝑆 ]) 𝑘 ( 𝜖 ) [ 𝑆 ] ( 𝜖 ) | 𝜖 ( ?= [ 𝑆 ]) 𝑘 ( 𝜖 ) [ 𝑆 ] ( 𝜖 ) | ... | 𝜖 ( ?= [ 𝑆 𝑘 ]) 𝑘 ( 𝜖 ) 𝑘 − [ 𝑆 𝑘 ] ( 𝜖 ) 𝑘 𝑟 = (( ?! \ )|( ?= \ \ )) 𝑘 (( ?! \ )|( ?= \ \ )) 𝑘 ... (( ?! \ 𝑘 − )|( ?= \ 𝑘 − \ 𝑘 )) 𝑘 . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:13
Algorithm 1:
Repair algorithm
Input: a regular expression 𝑟 , a set of positive examples 𝑃 , a set of negative examples 𝑁 Output: a regular expression that satisfies the LTP and is consistent with 𝑃 and 𝑁 function Repair( 𝑟 , 𝑃 , 𝑁 ) Q ← { 𝑟 } while Q is not empty do t ← Q .pop() if 𝑃 ⊆ 𝐿 ( t ⊤ ) and 𝑁 ∩ 𝐿 ( t ⊥ ) = ∅ then Φ ← getInvulnerableConstraint( t , 𝑃 , 𝑁 ) if Φ is satisfiable then return solution( t , Φ ) Q .push(expandHoles( t )) Q .push(addHoles( t )) Here, 𝑟 𝑘 is the expression obtained by concatenating 𝑟 𝑘 times.It is easy to see that this is a polynomial time reduction since the construction of 𝑟 can be donein time cubic in the size of the input ExactCover instance. Also, note that the above is a validLTP repair problem instance because 𝑃 = U ⊆ 𝐿 ( 𝑟 ) and 𝐿 ( 𝑟 ) ∩ 𝑁 = ∅ . We show that reduc-tion is correct, that is, the input ExactCover instance has a solution iff there exists 𝑟 satisfyingconditions (1)-(3) of Definition 4.11 and D ( 𝑟 , 𝑟 ) ≤ 𝑘 . First, we show the only-if direction, let S ′ ⊂ S be a solution to the ExactCover instance. The repaired expression 𝑟 = 𝑟 𝑟 where 𝑟 = 𝑟 , and 𝑟 is 𝑟 but with each 𝑖 -th head 𝜖 in the union replaced by [∅] iff 𝑆 𝑖 ∉ S ′ . Note that D ( 𝑟 , 𝑟 ) = |S \ S ′ | ≤ 𝑘 . Also, 𝑟 satisfies the LTP because for every 𝑎 ∈ U , there exists only one 𝑆 𝑖 ∈ S ′ such that 𝑎 ∈ 𝑆 𝑖 , i.e., on any input string starting with 𝑎 , we deterministically move to the 𝑖 -th choice in the union (and there are no branches after that point). Also, 𝑟 correctly classifiesthe examples. To see this, consider an arbitrary 𝑎 ∈ 𝑃 = U . Then, 𝑎 is included in some 𝑆 𝑖 ∈ S ′ .Therefore, the matching passes the 𝑟 part with successful captures at indexes 𝑖 − and 𝑖 , andpasses the 𝑟 part because the negative lookahead ( ?! \ 𝑗 ) succeeds for all 𝑗 ≠ 𝑖 and the positivelookahead ( ?= \ 𝑖 − \ 𝑖 ) succeeds. Thus, 𝑟 is a correct repair.We show the if direction. First, note that any valid repair of 𝑟 must preserve the 𝑘 union choicesof 𝑟 because deleting any union choice would already exceed the cost 𝑘 . From this, it is nothard to see that the only possible change is to change the head 𝜖 in the union choices in 𝑟 . Forinstance, it is useless to change [ 𝑆 𝑖 ] to some 𝑟 where 𝐿 ( 𝑟 ) contains elements not in 𝑆 𝑖 because ofthe 𝑘 many ( ?= [ 𝑆 𝑖 ]) preceding it. Note that changing ( ?= [ 𝑆 𝑖 ]) 𝑘 would exceed the cost. Nor, can [ 𝑆 𝑖 ] be changed to some 𝑟 where 𝐿 ( 𝑟 ) does not contain an element of 𝑆 𝑖 because of the capturinggroup ( 𝜖 ) 𝑖 and ( 𝜖 ) 𝑖 − before and after [ 𝑆 𝑖 ] and the check done in 𝑟 . Note that changing any ofthe check in 𝑟 would again exceed the cost. This also shows that the capturing groups ( 𝜖 ) 𝑖 and ( 𝜖 ) 𝑖 − cannot be changed. Therefore, the only meaningful change that can be done is to changesome of the head 𝜖 in 𝑟 to some 𝑟 . Note that for any 𝑟 chosen here, by the LTP, 𝑟 will notaccept { 𝑎 | 𝑎𝑤 ∈ 𝐿 ( 𝑟 )} as any input 𝑎 ∈ { 𝑎 | 𝑎𝑤 ∈ 𝐿 ( 𝑟 )} would direct the match algorithmdeterministically to this choice but the match would fail when it proceeds to [ 𝑆 𝑖 ] . Therefore, theonly change that can be done is to change it to some 𝑟 such that 𝐿 ( 𝑟 ) = ∅ . Then, from a successfulrepair 𝑟 , we obtain the solution S ′ to the ExactCover instance where 𝑆 𝑖 ∉ S ′ iff the 𝑖 -th head 𝜖 in 𝑟 is changed to some 𝑟 such that 𝐿 ( 𝑟 ) = ∅ . (cid:3) Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :14 Nariyoshi Chida and Tachio Terauchi
In this section, we describe the architecture of our repair algorithm and each of its componentsin detail. Recall from Section 1 and 2 that our algorithm adopts the template-based search withsearch space pruning approach of [Lee et al. 2016; Pan et al. 2019] and the SMT-based constraintsolving to find a solution within the given candidate template proposed in [Pan et al. 2019]. Ouralgorithm extends them to support the extended features of real-world regular expressions andto guarantee invulnerability by asserting the LTP. We begin by giving the overview of our repairalgorithm in Section 5.1. Then, we describe the detail of the search space pruning technique inSection 5.2, and that of the SMT constraint generation in Section 5.3.
Algorithm 1 shows the high-level structure of our repair algorithm. The algorithm takes a possiblyvulnerable regular expression 𝑟 , a set of positive examples 𝑃 , and a set of negative examples 𝑁 asinput (line 1). Its output is a regular expression that satifies the LTP and is consistent with 𝑃 and 𝑁 .At a high level, our algorithm consists of the following four key components. Generate the initial template.
Our algorithm uses a priority queue Q , which maintains all regu-lar expression templates during the synthesis. A regular expression template t is a regular expres-sion that may contain a hole (cid:3) denoting a placeholder that is to be replaced by a concrete regularexpression. It is formally defined by the following grammar: t :: = [ 𝐶 ] | tt | t | t | t ∗ | ( t ) 𝑖 | \ 𝑖 | (?: t ) | (?= t ) | (?! t ) | (?<= t 𝑥 ) | (?
Our algorithm next retrieves and removes a template from the headof Q (line 4), and applies a feasibility checking to the template (line 5). The feasibility checking isintroduced by [Lee et al. 2016], and it has been shown to significantly reduce the search spaceas explored by recent works [Chen et al. 2020; Lee et al. 2016; Pan et al. 2019]. The procedure ofthe feasibility checking is as follows: given a template t , a set of positive examples 𝑃 , and a set ofnegative examples 𝑁 , we generate an over- and under-approximation t ⊤ and t ⊥ , respectively. Forcorrectness, we require 𝐿 ( 𝑟 ′ ) ⊆ 𝐿 ( t ⊤ ) and 𝐿 ( t ⊥ ) ⊆ 𝐿 ( 𝑟 ′ ) for any expression 𝑟 ′ obtainable by fillingthe holes of t . If 𝑃 * 𝐿 ( t ⊤ ) or 𝑁 ∩ 𝐿 ( t ⊥ ) ≠ ∅ , then there is no way to get an expression consistentwith 𝑃 and 𝑁 from the template, and thus we safely discard the template from the search.The approximations are obtained by replacing each hole by either [ Σ ] ∗ , that is, all strings, or [∅] , that is, no strings. However, unlike in the previous works that simply replaced all holes by [ Σ ] ∗ (respectively, [∅] ) to create the over-approximation (respectively, under-approximation), weneed to consider the polarities of the contexts in which the holes appear, due to the presence ofnegative lookarounds. We present the details of this phase in Section 5.2. Searching assignments by constraints solving.
After passing the feasibility checking, our al-gorithm decides if the template can be instantiated into a regular expression that is consistentwith the examples and satisfies the LTP by filling each hole by some set of characters (i.e., some [ 𝐶 ] ). We do this by encoding the search problem as an SMT formula and invoking an SMT solverto find a satisfying assignment to the formula (lines 6-7). Our encoding builds on the approachproposed in a recent work [Pan et al. 2019] which looks for an instantiation consistent with thegiven positive examples 𝑃 and negative examples 𝑁 . Next, we briefly review their approach.For each positive or negative example, they first enumerate the possible runs that accept theexample when each hole in the template is treated as a set of characters that accepts any character Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:15 (i.e., as [ Σ ] ). Then, for each run 𝜂 , they generate a formula 𝜙 𝜂 that asserts which character mustbe in the character set at each hole for the instantiation of the template to accept the run 𝜂 . Then,the formula 𝜙 𝑤 = Ô 𝜂 ∈ Runs ( 𝑤 ) 𝜙 𝜂 denotes the constraint for the example 𝑤 to be accepted by thetemplate instantiation where Runs ( 𝑤 ) is the set of runs for the example 𝑤 . Finally, they return theconjunction Ó 𝑤 ∈ 𝑃 𝜙 𝑤 ∧ Ó 𝑤 ∈ 𝑁 ¬ 𝜙 𝑤 as the generated formula. For example, consider a template ( (cid:3) | (cid:3) (cid:3) ) ( (cid:3) | (cid:3) (cid:3) ) , a set of positive examples 𝑃 = { 𝑎𝑏𝑐 } , and a set of negative examples 𝑁 = { 𝑎𝑏𝑑 } . The formula for 𝑎𝑏𝑐 ∈ 𝑃 is ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑐 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑐 ) . The indexes on the variables inthe formula correspond to that of the template. In addition, the characters on the variable meansthat the corresponding hole should be a set of character that contains the character. For example, 𝑣 𝑎 means that (cid:3) should be a set of characters [ 𝐶 ] where 𝑎 ∈ 𝐶 . In the same way, the formula for 𝑎𝑏𝑑 ∈ 𝑁 is ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑑 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑑 ) . As a result, the formula for the template is (( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑐 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑐 )) ∧ ¬(( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑑 ) ∨ ( 𝑣 𝑎 ∧ 𝑣 𝑏 ∧ 𝑣 𝑑 )) . Finally, the technique finds a satisfyingassignment for the formula by using an SMT solver. In the above example, we can get such anassignment that sets 𝑣 𝑑 and 𝑣 𝑑 to false and the others to true. The obtained instantiation of thetemplate is ( [ 𝑎 ] | [ 𝑎 ] [ 𝑏 ]) ( [ 𝑐 ] | [ 𝑏 ] [ 𝑐 ]) , which indeed correctly classifies 𝑃 and 𝑁 .However, their work only considered pure regular expressions and was also only concerned withthe consistency with respect to the given examples. Thus, we make two non-trivial extensions totheir technique in order to adopt it for our purpose: (1) support for the extended features of real-world regular expressions, and (2) additional constraints to enforce the LTP. The details of thisphase are presented in Section 5.3. Expanding and adding holes to a template.
The failure of the SMT solver to find a solutionimplies that there exists no instantiation of the template obtainable by filling the holes by sets ofcharacters that is consistent with the examples and satisfies the LTP. In such a case, our algorithmexpands the holes in the template to generate unexplored templates and add them to the priorityqueue Q (line 9). More precisely, the template expansion replaces the holes in the template withother templates following the syntax of the template. For example, the template ( (cid:3) ) \ is expandedto ( (cid:3)(cid:3) ) \ , ( (cid:3) | (cid:3) ) \ , ( (cid:3) ∗ ) \ , and so on. Here, to ensure the LTP, we do not replace the holes inlookarounds with templates containing repetitions.Finally, if the current template fails the feasibility check and there are no more templates in thequeue, we generate new templates by adding holes to the current template and add the generatedtemplates to the queue, because it would be fruitless to expand the current template any further(line 10). The addition of a new hole is done by replacing a set of characters by a hole or replacingan expression with a hole when every immediate subexpression of the expression is a hole. We show the method for generating the over-approximation and the under-approximation of thegiven template. The basic idea here is similar to the ones in the previous works [Chen et al. 2020;Lee et al. 2016; Pan et al. 2019]. That is, we replace each hole in the given template with the regularexpression that accept any strings [ Σ ] ∗ or accepts no strings (i.e., failure) [∅] . To cope with thepresence of negative lookarounds, we compute both over and under approximations of each subex-pression and flip their polarities at negative lookarounds. For instance, to over-approximate (re-spectively, under-approximate) a negative lookahead ( ?! 𝑡 ) , we compute the under-approximation(respectively, over-approximation) of 𝑡 , say 𝑡 ′ , and return ( ?! 𝑡 ′ ) . For example, for the template ( ?! (cid:3) ( ?! (cid:3) )) , its over- and under-approximations are ( ?! [∅] ( ?! [ Σ ] ∗ )) and ( ?! [ Σ ] ∗ ( ?! [∅])) , respec-tively.Fig 5 shows the rules that define the approximations. Here, t ( t ⊤ , t ⊥ ) means that the over-and under-approximations of t are t ⊤ and t ⊥ , respectively. As stipulated in the rule ( Hole ), we
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :16 Nariyoshi Chida and Tachio Terauchi [ 𝐶 ] ( [ 𝐶 ] , [ 𝐶 ]) ( Set of characters ) t ( t ⊤ , t ⊥ ) t ( t ⊤ , t ⊥ ) t t ( t ⊤ t ⊤ , t ⊥ t ⊥ ) ( Concatenation ) t ( t ⊤ , t ⊥ ) t ( t ⊤ , t ⊥ ) t | t ( t ⊤ | t ⊤ , t ⊥ | t ⊥ ) ( Union ) t ( t ⊤ , t ⊥ ) t ∗ ( t ∗⊤ , t ∗⊥ ) ( Repetition ) t ( t ⊤ , t ⊥ ) (( t ) 𝑖 ( t ⊤ , t ⊥ ) ( Capturing group ) \ 𝑖 (\ 𝑖, \ 𝑖 ) ( Backreference ) t ( t ⊤ , t ⊥ ) (?= t ) ( (?= t ⊤ ) , (?= t ⊥ ) ) ( Positive lookahead ) t ( t ⊤ , t ⊥ ) (?! t ) ( (?! t ⊥ ) , (?! t ⊤ ) ) ( Negative lookahead ) t 𝑥 ( t ⊤ , t ⊥ ) (?<= t 𝑥 ) ( (?<= t ⊤ ) , (?<= t ⊥ ) ) ( Positive lookbehind ) t 𝑥 ( t ⊤ , t ⊥ ) (?
Negative lookbehind ), the over- and under- approximations areflipped when entering a negative lookaround. The rest of the rules simply applies the rewriting tothe subexpressions inductively. We state the correctness of the approximations.
Definition 5.1.
Let t be a template that contains 𝑖 holes. We write t h 𝑟 , 𝑟 , ..., 𝑟 𝑖 i for the regularexpression obtained by replacing each 𝑖 -th hole with 𝑟 𝑖 in t . We define 𝑅 t = { 𝑟 | ∃ 𝑟 , . . . , 𝑟 𝑖 .𝑟 = t h 𝑟 , 𝑟 , ..., 𝑟 𝑖 i} . Lemma 5.2.
Let t be a template and 𝑟 ∈ 𝑅 t . Then, 𝐿 ( t ⊥ ) ⊆ 𝐿 ( 𝑟 ) ⊆ 𝐿 ( t ⊤ ) . Corollary 5.3.
Suppose that 𝑃 * 𝐿 ( t ⊤ ) or 𝑁 ∩ 𝐿 ( t ⊥ ) ≠ ∅ . Then, there is no regular expression 𝑟 ∈ 𝑅 t such that 𝑃 ⊆ 𝐿 ( 𝑟 ) and 𝑁 ∩ 𝐿 ( 𝑟 ) = ∅ . We now present the construction of the SMT formula. The formula encodes the condition assertingthat the given template can be instantiated by filling its holes with sets of characters to becomea regular expression that is (1) consistent with the given set of examples and (2) satisfies the LTP.From a satisfying assignment to the formula, we obtain a regular expression that satisfies con-ditions (1) and (2). Conversely, when the formula is found unsatisfiable, it implies that no suchregular expressions can be obtained from the template by filling its holes with sets of characters.The formula consists of two main components, the constraint for consistency with the examples 𝜙 𝑐 and the constraint for the linear time property 𝜙 𝑙 . The constraint 𝜙 𝑐 guarantees that the regularexpression obtained by replacing the holes in the template with the sets of characters is consis-tent with the given positive and negative examples. The constraint 𝜙 𝑙 further constrains such aregular expression to satisfy the LTP. The whole constraint is defined as 𝜙 𝑐 ∧ 𝜙 𝑙 . We describe theconstructions for each constraint in Section 5.3.1 and 5.3.2, respectively. At a high-level, the construction of theconstraint is similar to the one proposed by Pan et al. [Pan et al. 2019] that we reviewed in Sec-tion 5.1. That is, the constraint is a propositional formula on Boolean variables 𝑣 𝑎𝑖 where the su-perscript 𝑎 ∈ Σ is a character and the subscript 𝑖 ∈ N is an index of a hole in the template. We willgenerate the constraint so that 𝑣 𝑎𝑖 assigned true means that the 𝑖 -th hole in the template should bereplaced with some [ 𝐶 ] satisfying 𝑎 ∈ 𝐶 . As remarked before, we extend the baseline constructionof [Pan et al. 2019] with the support for the extended features of real-world regular expressions.Now we describe the construction of the constraint in detail. We define the construction viathe function encode which takes a template t and a string 𝑤 and outputs a constraint 𝜙 𝑤 . The Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:17 𝑝 < | 𝑤 | 𝑤 [ 𝑝 ] ∈ 𝐶 ([ 𝐶 ] , 𝑤, 𝑝, Γ , 𝜙 ) d ({( 𝑝 + , Γ , 𝜙 ) } , ∅) (Set of characters) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F)∃( 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑖 ) ∈ S . ( t , 𝑤, 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑖 ) d (S 𝑖 , F 𝑖 )( t t , 𝑤, 𝑝, Γ , 𝜙 ) d ( Ð ≤ 𝑖 < |S| S 𝑖 , F ∪ Ð ≤ 𝑖 < |S| F 𝑖 ) (Concatenation) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F ) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F )( t | t , 𝑤, 𝑝, Γ , 𝜙 ) d (S ∪ S , F ∪ F ) (Union) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F)∃( 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑖 ) ∈ (S\{( 𝑝, Γ , _ ) }) . ( 𝑟 ∗ , 𝑤, 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑖 ) d (S 𝑖 , F 𝑖 )( t ∗ , 𝑤, 𝑝, Γ , 𝜙 ) d ({( 𝑝, Γ , 𝜙 ) } ∪ Ð ≤ 𝑖 < |S| S 𝑖 , ∅) (Repetition) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F)(( t ) 𝑖 , 𝑤, 𝑝, Γ , 𝜙 ) d ( Ð ( 𝑝𝑖, Γ 𝑖,𝜙𝑐𝑖 )∈S ( 𝑝 𝑖 , Γ 𝑖 [ 𝑖 ↦→ 𝑤 [ 𝑝..𝑝 𝑖 ) ] , 𝜙 𝑐𝑖 ) , F) (Capturing group) Let 𝑥 = Γ ( 𝑖 ) 𝑥 = 𝑤 [ 𝑝..𝑝 + | 𝑥 |)(\ 𝑖, 𝑤, 𝑝, Γ , 𝜙 ) d ({( 𝑝 + | 𝑥 | , Γ , 𝜙 ) } , ∅) ( Backreference ) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F)( (?= t ) , 𝑤, 𝑝, Γ , 𝜙 ) d (S[( 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑐𝑖 ) ↦→ ( 𝑝, Γ , 𝜙 𝑐𝑖 ) ] , F) ( Positive lookahead ) ( t , 𝑤, 𝑝, Γ , 𝜙 ) d (S , F)( (?! t ) , 𝑤, 𝑝, Γ , 𝜙 ) d (F[(⊥ , ⊥ , 𝜙 𝑐𝑖 ) ↦→ ( 𝑝, Γ , 𝜙 𝑐𝑖 ) ] , S[( 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑐𝑖 ) ↦→ (⊥ , ⊥ , 𝜙 𝑐𝑖 ) ]) ( Negative lookahead ) ( 𝑥, 𝑤 [ 𝑝 − | 𝑥 | , 𝑝 ) , , Γ , 𝜙 ) d (S , F)( (?<= 𝑥 ) , 𝑤, 𝑝, Γ , 𝜙 ) d (S[( 𝑝 𝑖 , Γ 𝑖 , 𝜙 𝑐𝑖 ) ↦→ ( 𝑝, Γ , 𝜙 𝑐𝑖 ) ] , F) ( Positive lookbehind ) ( 𝑥, 𝑤 [ 𝑝 − | 𝑥 | , 𝑝 ) , , Γ , 𝜙 ) d (S , F)( (?
M [( 𝑝, Γ , 𝜙 ) ↦→ ( 𝑝 ′ , Γ ′ , 𝜙 ′ )] , where M is either S or F , denotes the set of matching results obtainedby replacing ( 𝑝, Γ , 𝜙 ) with ( 𝑝 ′ , Γ ′ , 𝜙 ′ ) in M . For space, we omit the cases where the matching fails,that is, ( 𝑟 , 𝑤, 𝑝, Γ , 𝜙 ) d (∅ , {(⊥ , ⊥ , 𝜙 )}) .The rules are similar to the corresponding rules defining the formal semantics of regular expres-sions shown in Fig. 2, and they mimic the behavior of the backtracking matching algorithm. Thekey difference is the rule ( Hole ) for processing holes. Given the current accumulated constraint 𝜙 ,the rule adds appropriate constraints asserting that the character 𝑤 [ 𝑝 ] has to be included or notincluded in the set of characters that replaces the hole by conjoining 𝑣 𝑤𝑖 [ 𝑝 ] to 𝜙 for the successcase, and conjoining ¬ 𝑣 𝑤𝑖 [ 𝑝 ] to 𝜙 for the failure case.Finally, we construct the constraint for consistency with the examples for t as follows: 𝜙 𝑐 = Û 𝑤 𝑝 ∈ 𝑃 encode ( t , 𝑤 𝑝 ) ∧ Û 𝑤 𝑛 ∈ 𝑁 ¬ encode ( t , 𝑤 𝑛 ) Example 5.4.
Consider the template t = ( ?! (cid:3) ) (cid:3) 𝑏𝑐 , the set of positive examples 𝑃 = { 𝑎𝑏𝑐, 𝑐𝑏𝑐 } ,and the set of negative examples 𝑁 = { 𝑏𝑏𝑐 } . For the positive examples, we have encode ( t , 𝑎𝑏𝑐 ) = ¬ 𝑣 𝑎 ∧ 𝑣 𝑎 and encode ( t , 𝑐𝑏𝑐 ) = ¬ 𝑣 𝑐 ∧ 𝑣 𝑐 . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :18 Nariyoshi Chida and Tachio Terauchi
Algorithm 2:
The construction of constraints for linear time property.
Input: a template t Output: a constraint 𝜙 𝑙 function generateConstraintsForLinear( t ) t ← rmla ( t ) A ← eNFAtr ( t [] ) // A = ( 𝑄, 𝛿, 𝑞 , 𝑞 𝑛 ) 𝜙 𝑙 ← true for each ( 𝑞, [ 𝑖 , 𝑞 ′ ) ∈ 𝛿 do 𝐿 ← First ( 𝑞 ) if 𝜌 𝑖 𝑎 , 𝜌 𝑗 𝑎 ∈ 𝐿 , where 𝜌 𝑖 ≠ 𝜌 𝑗 then return false for each 𝑎 ∈ 𝐿 ♮ and (cid:3) 𝑖 ∈ 𝐿 ♮ do 𝜙 𝑙 ← 𝜙 𝑙 ∧ ¬ 𝑣 𝑎𝑖 for each 𝑎 ∈ Σ and (cid:3) 𝑖 , (cid:3) 𝑗 ∈ 𝐿 ♮ where 𝑖 ≠ 𝑗 do 𝜙 𝑙 ← 𝜙 𝑙 ∧ (¬ 𝑣 𝑎𝑖 ∨ ¬ 𝑣 𝑎𝑗 ) return 𝜙 𝑙 For the negative example, we have encode ( t , 𝑏𝑏𝑐 ) = ¬ 𝑣 𝑏 ∧ 𝑣 𝑏 . Therefore, 𝜙 𝑐 = ((¬ 𝑣 𝑎 ∧ 𝑣 𝑎 ) ∧ (¬ 𝑣 𝑐 ∧ 𝑣 𝑐 )) ∧ ¬(¬ 𝑣 𝑏 ∧ 𝑣 𝑏 ) . Algorithm 2 describes the construction of theconstraint for enforcing that the generated regular expression satisfies the LTP. The algorithmtakes as input a template t and returns a constraint for the linear time property 𝜙 𝑙 .Recall from Definition 4.6 that the LTP requires that (1) the matching algorithm runs determin-istically on the expression when the lookarounds are removed, and (2) the lookarounds do notcontain repetitions. The algorithm first removes lookarounds from the template by using rmla de-fined in Definition 4.2 (line 2). Here, we extend rmla to templates by treating each hole (cid:3) 𝑖 as theset of characters [{ (cid:3) 𝑖 }] , that is, (cid:3) 𝑖 is treated like a single character. The repetition freedom require-ment is guaranteed by our template construction which does not put repetitions in a subexpressionof a lookaround (cf. Expanding and adding holes to a template in Section 5.1).Next, the algorithm constructs a NFA for t [] via the extended NFA translation defined in Def-inition 4.3 (line 3). Then, for each open bracket [ 𝑖 in the NFA, the algorithm computes the set ofpaths First ( 𝑞 ) where 𝑞 is the unique source state of the (unique) [ 𝑖 -labeled edge. Here, we extend First so that a hole (cid:3) 𝑖 is treated as a character (cf. Section 4.1). That is, 𝜌𝛼 ∈ First ( 𝑞 ) iff 𝜌 ∈ Ψ ∗ , 𝛼 ∈ Σ ∪ { (cid:3) 𝑖 | (cid:3) 𝑖 appears in t } , and there is a 𝜌𝛼 -labeled path from 𝑞 .The algorithm then checks whether the template t already violates the determinism requirementof the LTP by checking if there exist multiple brackets-only routes from [ 𝑖 that reach a samecharacter (lines 6-8). Note that if the check is true, then | 𝐵𝑝𝑎𝑡ℎ𝑠 ( rmla ( 𝑟 ) , [ 𝑖 , 𝑎 )| ≥ for any regularexpression 𝑟 obtainable from the template t . Therefore, we may safely reject such a template, andwe return the unsatisfiable formula false .Otherwise, we proceed to add two types of constraints in lines 9-12. The constraints of the firsttype are added in lines 9-10. Here, we assert that, if some character 𝑎 ∈ Σ and a hole (cid:3) 𝑖 are bothreachable from [ 𝑖 by bracketing-only paths, then the hole must not be filled with a set of charactersthat contains 𝑎 . Here, 𝐿 ♮ = { 𝛼 | 𝜌𝛼 ∈ 𝐿 } (cf. Section 4.1). The constraints of the second type areadded in lines 11-12. Here, we assert that, if there are two different holes (cid:3) 𝑖 and (cid:3) 𝑗 reachable from Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:19
Fig. 7. A simplified version of the NFA translation of [ ([ (cid:3) ] |[ (cid:3) ] |[ 𝑎 ] ) ] [ 𝑏 ] [ \ ] . [ 𝑖 by bracketing-only paths, then for any character 𝑎 ∈ Σ , at most one of the hole can be filled witha set of characters that contains 𝑎 . It is easy to see that these constraints must be satisfied or else theregular expression obtained from the template would violate the determinism requirement of theLTP. Conversely, satisfying these constraints for all [ 𝑖 ensures that the determinism requirementis satisfied. Finally, the algorithm returns the resulting constraint 𝜙 𝑙 (line 13). Example 5.5.
Let us consider running the algorithm on the template t = ( (cid:3) | (cid:3) | 𝑎 ) 𝑏 \ ( ?! 𝑎 ) .The algorithm first removes lookarounds in the template (line 2), and thus the template becomes t ← ( (cid:3) | (cid:3) | 𝑎 ) 𝑏 \ . Next, the algorithm applies the extended NFA translation to t [] (line 3). Here, t [] is [ ( [ (cid:3) ] | [ (cid:3) ] | [ 𝑎 ] ) ] [ 𝑏 ] [ \ ] . For simplicity, we omit the some redundant bracketsfor sequences and unions. Fig. 7 shows the NFA obtained by the extended NFA translation. Then,the algorithm constructs the constraints (lines 5-12). Let us consider a case of 𝑞 𝑖 ∈ 𝑄 . In this case, First ( 𝑞 𝑖 ) = {[ [ (cid:3) , [ [ (cid:3) , [ [ 𝑎 } . At line 10, the algorithm adds constraints as follows: 𝜙 𝑙 ← 𝜙 𝑙 ∧ ¬ 𝑣 𝑎 ∧ ¬ 𝑣 𝑎 . At line 12, the algorithm adds constraints as follows: 𝜙 𝑙 ← 𝜙 𝑙 ∧ Ó 𝑎 ∈ Σ (¬ 𝑣 𝑎 ∨ ¬ 𝑣 𝑎 ) . In this section, we present the results of our evaluation. We evaluate the performance of
Remedy by answering the following questions.
RQ1
How well does
Remedy perform on the repairing of vulnerable regular expressions?
RQ2
How similar are the original regular expression and repaired regular expression by
Rem-edy w.r.t. the language?
RQ3
How the number of examples impacts on the similarity of the regular expressions?For the first question, we measure the running time to repair the vulnerable regular expres-sion on the real-world data set. For the second question, we measure the similarity of originaland repaired regular expressions with respect to the language using the same metrics describedin [Bastani et al. 2017]. For the last question, we choose some regular expressions that the similar-ities are low in the second question, and measure the similarity again by increasing the numberof examples to check the impact of the number of examples. Additionally, we present a simpleoptimization to improve the similarity.
We have implemented our algorithm in a tool called
Remedy , written in Java.
Remedy uses theZ3 SMT solver [De Moura and Bjørner 2008] to check the satisfiability of constraints. The experi-ments are conducted on a MacBook Pro with a 3.5 GHz Intel i7 and 16 GB of RAM running macOS10.13.
Benchmark.
To conduct the experiments, we used a real-world data set
Ecosystem ReDoS data set .The Ecosystem ReDoS data set is a data set collected by Davis et al. [Davis et al. 2018]. The data setcontains real-world regular expressions in Node.js (JavaScript) and Python core libraries, and the
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :20 Nariyoshi Chida and Tachio Terauchi regular expressions in the data set contain various grammatical extensions such as lookarounds,capturing groups, and backreferences. Since the data set does not separate vulnerable and invul-nerable regular expressions, we contacted the author and used the data set that only containsvulnerable regular expressions provided by the author. Finally, we randomly selected 1,758 vulner-able regular expressions from the data sets.
Sampling Examples.
The data set does not provide a set of positive examples 𝑃 and a set ofnegative examples 𝑁 . To obtain the examples, we prepared them by sampling randomly from thestrings generated by regular expressions in the data set. More precisely, we generated the examples 𝑃 and 𝑁 for a regular expression 𝑟 by the following steps:(1) We prepare a set of strings Σ to construct input strings. To this end, we traverse the AST of 𝑟 and if we visit the node that corresponds to the set of character [C] , then we randomlyselect a character 𝑎 ∈ C and add it to Σ , i.e., Σ = Σ ∪ { 𝑎 } . Initially, Σ = ∅ . Finally, we add acharacter 𝑏 ∉ Σ into Σ .(2) Let 𝑛 be the minimum length of the input string that is accepted by the regular expression 𝑟 .Then, we enumerate strings that the length is less than or equal to 𝑛 + by combining thestrings in Σ repeatedly.(3) For each string 𝑠 enumerated in the previous step, we check whether or not 𝑟 accepts 𝑠 . If 𝑟 accepts 𝑠 , then we add 𝑠 to a set of strings 𝑃 ′ , otherwise we add it to a set of string 𝑁 ′ .(4) We randomly sample ten strings as positive examples 𝑃 and negative examples 𝑁 from 𝑃 ′ and 𝑁 ′ , respectively. If the size of the set is less than ten, then we select the whole set as theexamples.Sometimes real-world regular expressions contain long sequences of characters, e.g., a regular ex-pression that takes as input an HTTP header and validates the pattern to check whether or notthe HTTP header contains a word Content-Security-Policy may contain the word as a regular ex-pression. In that case, we cannot use the above steps in a straightforward way because the longsequence increase the size of | Σ | , and it may lead to explode the size of 𝑃 ′ and 𝑁 ′ . To avoid the explo-sion, we handle sequences of characters as a single words. For example, let a regular expression be Content-Security-Policy | [ 𝑧 ] . Then, the set of strings Σ = { Content-Security-Policy , 𝑧 } , while if we ap-ply the above steps in a straightforward way, the set of strings Σ is { 𝐶, 𝑜, 𝑛, 𝑡, 𝑒, − , 𝑆, 𝑐, 𝑢, 𝑟 , 𝑖, 𝑦, 𝑃, 𝑙, 𝑧 } . Validations.
To validate our implementation, we run a repaired regular expression on the reg-ular expression library for Java java.util.regex and check whether the regular expression isconsistent with all examples.
To answer this question, we ran
Remedy with a timeout of 30 seconds, and measured the runningtimes of the repairs. Figure 8 summarizes the results of the repairs. In total, there were 1,080 suc-cessful repairs that include the modern extensions of real-world regular expressions. Additionally,
Remedy could repair 61.4 % of regular expressions within 30 seconds, and 83.5 % of the repairedregular expressions are within 5 seconds. However,
Remedy could not repair 678 regular expres-sions within the timeout range because these regular expressions require larger changes from theoriginal.For evaluation, we have implemented the DFA translation approach [van der Merwe et al. 2017]and compared the results of pure vulnerable regular expressions. As a result of the comparison, weobserved the following two things: • By using lookbehinds,
Remedy could repair some vulnerable regular expressions simplerthan the DFA translation approach w.r.t. the distance. For example, consider the pure vul-nerable regular expression · ∗ · ∗ = . The repaired regular expression by Remedy is · · ∗ ( ?<= [ = ]) , Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:21
Fig. 8. Results of the repairs. (Left) histogram of the running time and (right) histogram of F1-score. while that of the DFA translation approach is [ ^ = ] ∗ [ = ] ( [ = ] | [ ^ = ] [ ^ = ] ∗ [ = ]) ∗ . In addition, Remedy repairs the vulnerable regular expression ] ∗ font-style:italic [ ^ > ] ∗ > intoan invulnerable regular expression ] ∗ )> , while that of the DFA translation generates a regular expression that the size is 73,433,094. • Remedy and the DFA translation approach shares some corner cases that lead the repairing toa timeout. For example,
Remedy fails to repair the regular expression · ∗ [ - 𝐴 - 𝑍𝑎 - 𝑧 ] + [ . ] schema [ . ] json , and the size of the regular expression generated by the DFA translation approach is56,782,361. To measure the similarity among the language, we estimate the approximate language 𝐿 ′ ( 𝑟 ) , where 𝑟 is a regular expression and consists of 100 random samples from 𝐿 ( 𝑟 ) . Using the language 𝐿 ′ ( 𝑟 ) ,we estimate the precision, recall, and F1-score from the original regular expression 𝑟 and repairedregular expression 𝑟 as follows: we estimate the precision of 𝑟 by | 𝐿 ′ ( 𝑟 )∩ 𝐿 ( 𝑟 ) || 𝐿 ′ ( 𝑟 ) | , the recall of 𝑟 by | 𝐿 ′ ( 𝑟 )∩ 𝐿 ( 𝑟 ) || 𝐿 ′ ( 𝑟 ) | , and the F1-score by · precision · recallprecision + recall . The F1-score is between 0 and 1, inclusively, andit achieves high F1-score iff both precision and recall are high. The high F1-score implies that thelanguages of these regular expressions are similar.To estimate the semantic similarity fairly, we used different approach of the sampling examplesfrom that of Section 6.1. Concretely, we used the following approach to sample the examples ofthe regular expression 𝑟 .(1) Let 𝑛 be a minimum length of the input string that is accepted by the regular expression 𝑟 .Then, we enumerate all strings that the length is less than or equal to 𝑛 + over the ASCIIalphabets.(2) For each string 𝑤 enumerated in the previous step, we check whether or not 𝑤 ∈ 𝐿 ( 𝑟 ) , andif so, add 𝑤 into a set of strings 𝑆 , i.e., 𝑆 = 𝑆 ∪ { 𝑤 } .(3) We randomly sample 100 strings as the elements of the approximate language 𝐿 ′ ( 𝑟 ) . If thesize of 𝑆 is less than 100, we use the whole set as the language.Figure 8 shows the results. Remedy could repair 62.3 % of the regular expressions with the F1-score 0.7 or more. However,
Remedy generated 31.9 % of regular expressions with F1-score 0.3or lower. To analyze the cause of the low F1-scores, we picked some regular expressions that the
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :22 Nariyoshi Chida and Tachio Terauchi
F1-scores are low, and tried to improve the F1-scores by increasing the number of examples andthe optimization relating to sets of characters. We discuss the details in the next section.
To improve the similarity, we add two types of modifications to the
Remedy . Then, we comparethe similarity of the repaired regular expressions to evaluate the impacts of the modifications. Themodifications are: (1) increasing the number of examples, and (2) applying a simple optimizationfor sets of characters. For the first modification, we change the number of examples from 10 to1,000. For the second modification, we apply the optimization that is described below.The optimization is simple: for each set of characters in 𝑟 , the optimization adds an alphabet 𝑎 ∈ Σ if the modified regular expression is still consistent with all examples and satisfies the LTPas much as possible. In this evaluation, we use the Remedy that satisfies the same condition withthe RQ1 and RQ2, i.e., it uses 10 examples without the optimization, as a baseline. base ex opt ex+opt 𝑟 𝑟 𝑟 Fig. 9. Comparison results of F1-scores.
Figure 9 summarizes the results of repairingwith the modifications. In the figure, base isa baseline, i.e.,
Remedy that use 10 exampleswithout the optimization, ex is Remedy thatuse 1,000 examples without the optimization, opt is Remedy that use 10 examples with theoptimization, and ex+opt is Remedy that use1,000 examples with the optimization. By ob-serving the results, we have two findings. • Increasing the number of examples does not affect the F1-score. • Applying the optimization significantly improves the F1-score.
As remarked in Section 1, most prior works on regular expression vulnerability have only consid-ered detecting the vulnerability [Kirrage et al. 2013; Shen et al. 2018; Sugiyama and Minamide 2014;Weideman et al. 2016; Wüstholz et al. 2017], and many of them also do not support the extendedfeatures of real-world regular expressions. By contrast, we consider the problem of repairing vul-nerable regular expressions, and also support the real-world extensions. To our knowledge, theonly prior work to consider the repair is the work by van der Merwe et al. [van der Merwe et al.2017] that describes the DFA-based repair approach for pure regular expressions that we have alsosketched in Section 1. However, as remarked before, real-world regular expressions are not regu-lar and hence not convertible to a DFA in general. Another issue with the DFA-based approach isthat resulting regular expression is often much larger compared to the original and not human-readable.Our repair algorithm synthesizes a regular expression from examples. There has been muchwork on example-based regular expression synthesis [Alquezar and Sanfeliu 1994; Dupont 1996;Fernau 2009; Galassi and Giordana 2005; Zhong et al. 2018]. Next, we discuss the ones that aremost closely related to ours.Bartoli et al. [Bartoli et al. 2016] introduced a technique to synthesizing regular expression fromexamples based on genetic programming for text extraction, and they implemented their techniqueas a system named RegexGenerator++. As pointed out in [Pan et al. 2019], RegexGenerator++ doesnot guarantee to generate a correct regular expression , i.e., a regular expression that is consistentwith examples. In addition, it also does not guarantee to generate an invulnerable regular expres-sion.
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:23
Cody-Kenny et al. [Cody-Kenny et al. 2017] introduced PonyGE2, a tool implemented an auto-mated regular expression improvement technique for running time performance based on geneticprogramming. Since their motivation is to improve the performance, PonyGE2 does not guaranteeto generate an invulnerable regular expression.Lee et al. [Lee et al. 2016] presented AlphaRegex, a tool for synthesizing a regular expressionfrom examples based on top-down enumerative search with pruning techniques called over- andunder-approximations. However, AlphaRegex restricts the use of alphabets, i.e., it considers onlya binary alphabet. Additionally, it may generate a vulnerable regular expression. In fact, severaloutputs presented in their paper are vulnerable regular expressions, e.g., the regular expressionssynthesized by AlphaRegex ∗ ∗ ∗ , ( ( | )) ∗ ∗ , and ( | ) ∗ ( | ) ∗ are vulnerable regularexpressions because they have worst-case super-linear complexity.Pan et al. [Pan et al. 2019] proposed RFixer, a tool for repairing an incorrect regular expression ,which is not consistent with examples, into a correct one from the examples. RFixer employedSMT solver to overcome the enormous size of the synthesis search space due to the alphabets, andour approach to finding the assignment of alphabets is based on their work. However, RFixer isfocused on repairing an incorrect regular expression but does not attempt to repair a vulnerableone.Chen et al. [Chen et al. 2020] introduced REGEL, a tool for synthesizing a regular expressionfrom a combination of examples and natural language. Their motivation is to combine the tech-niques to synthesize a regular expression in the programming language community and naturallanguage processing community and to show the combination achieves high accuracy and fastersynthesis. Thus, to synthesize an invulnerable regular expression is out of the scope of their work. In this paper, we introduced
Remedy that repairs a vulnerable regular expression from examples.In particular, our aim is to support modern extensions of real-world regular expressions in therepairing. To this end, we proposed extensions of existing repair techniques of incorrect regularexpressions, i.e., over- and under-approximations and constraint generation techniques to replaceholes with sets of characters. Additionally, we defined a linear time property for a regular expres-sion that ensures the linear running time of the regular expression matching and presented a newconstraint generation technique to ensure the invulnerability by enforcing that a regular expres-sion synthesized by
Remedy satisfies the linear time property. We evaluated the effectiveness of
Remedy on the real-world data set that includes modern extensions of real-world regular expres-sions. The evaluation showed that
Remedy can repair real-world regular expressions successfully.
REFERENCES
In Proceedings of the ACL’02 Workshop on Unsupervised Lexical Acquisition . 291–300.Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples.
Inf. Comput.
75, 2 (1987), 87–106.https://doi.org/10.1016/0890-5401(87)90052-6A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao. 2016. Inference of Regular Expressions for Text Extrac-tion from Examples.
IEEE Transactions on Knowledge and Data Engineering
28, 5 (May 2016), 1217–1230.https://doi.org/10.1109/TKDE.2016.2515587Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing Program Input Grammars.
SIGPLAN Not.
52, 6 (June 2017), 95–110. https://doi.org/10.1145/3140587.3062349Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. 2003. A Formal Study Of Practical Regular Expressions.
Int. J. Found. Comput.Sci.
14, 6 (2003), 1007–1018. https://doi.org/10.1142/S012905410300214XCarl Chapman and Kathryn T. Stolee. 2016. Exploring Regular Expression Usage and Context in Python. In
Proceedings ofthe 25th International Symposium on Software Testing and Analysis (Saarbrücken, Germany) (ISSTA 2016) . AssociationProc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :24 Nariyoshi Chida and Tachio Terauchi for Computing Machinery, New York, NY, USA, 282–293. https://doi.org/10.1145/2931037.2931073Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. 2020. Multi-modal synthesis of regular expressions.In
Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implemen-tation, PLDI 2020, London, UK, June 15-20, 2020 , Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 487–502.https://doi.org/10.1145/3385412.3385988Brendan Cody-Kenny, Michael Fenton, Adrian Ronayne, Eoghan Considine, Thomas McGuire, and Michael O’Neill. 2017.A Search for Improved Performance in Regular Expressions. In
Proceedings of the Genetic and Evolutionary ComputationConference (Berlin, Germany) (GECCO ’17) . Association for Computing Machinery, New York, NY, USA, 1280–1287.https://doi.org/10.1145/3071178.3071196James C. Davis. 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice. URLhttps://medium.com/bugbountywriteup/introduction987fdc4c7b0.James C. Davis, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2018. The Impact of Regular Expression Denialof Service (ReDoS) in Practice: An Empirical Study at the Ecosystem Scale. In
Proceedings of the 2018 26th ACM JointMeeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (LakeBuena Vista, FL, USA) (ESEC/FSE 2018) . ACM, New York, NY, USA, 246–256. https://doi.org/10.1145/3236024.3236027Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In
Proceedings of the Theory and Practice ofSoftware, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (Budapest,Hungary) (TACAS’08/ETAPS’08) . Springer-Verlag, Berlin, Heidelberg, 337–340.Pierre Dupont. 1996. Incremental Regular Inference. In
Proceedings of the 3rd International Colloquium on GrammaticalInference: Learning Syntax from Sentences (ICG! ’96) . Springer-Verlag, Berlin, Heidelberg, 222–237.Henning Fernau. 2009. Algorithms for learning regular expressions from positive data.
Information and Computation
Theory of ComputingSystems
53, 2 (2013), 159–193. https://doi.org/10.1007/s00224-012-9389-0Jeffrey E. F. Friedl. 2006.
Mastering Regular Expressions: Understand Your Data and Be More Productive (3th ed.) .Ugo Galassi and Attilio Giordana. 2005. Learning Regular Expressions from Noisy Sequences. In
Abstraction, Reformulationand Approximation , Jean-Daniel Zucker and Lorenza Saitta (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 92–106.Jan Goyvaerts and Steven Levithan. 2012.
Regular Expressions Cookbook (2nd ed.) .John Graham-Cumming. 2019. Details of the Cloudflare outage on July 2, 2019.https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/.Pieter Hooimeijer, Benjamin Livshits, David Molnar, Prateek Saxena, and Margus Veanes. 2011. Fast and Precise SanitizerAnalysis with BEK. In
Proceedings of the 20th USENIX Conference on Security (San Francisco, CA) (SEC’11) . USENIXAssociation, USA, 1.Graham-Cumming John. 2016. Outage Postmortem July 20, 2016. URL https://stackstatus.net/post/147710624694/outage-postmortemjuly202016.Graham-Cumming John. 2019. Details of the Cloudflare outage on July 2, 2019. URL https://blog.cloudflare.com/detailsof-thecloudflareoutageonjuly22019/.Richard M. Karp. 1972.
Reducibility among Combinatorial Problems . Springer US, Boston, MA, 85–103.https://doi.org/10.1007/978-1-4684-2001-2_9James Kirrage, Asiri Rathnayake, and Hayo Thielecke. 2013. Static Analysis for Regular Expression Denial-of-ServiceAttacks. In
Network and System Security , Javier Lopez, Xinyi Huang, and Ravi Sandhu (Eds.). Springer Berlin Heidelberg,Berlin, Heidelberg, 135–148.Christoph Koch and Stefanie Scherzinger. 2007. Attribute Grammars for Scalable Query Processing on XML Streams.
TheVLDB Journal
16, 3 (July 2007), 317–342. https://doi.org/10.1007/s00778-005-0169-1Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing Regular Expressions from Examples for Introductory AutomataAssignments.
SIGPLAN Not.
52, 3 (Oct. 2016), 70–80. https://doi.org/10.1145/3093335.2993244Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008.Regular Expression Learning for Information Extraction. In
Proceedings of the 2008 Conference on EmpiricalMethods in Natural Language Processing
Sci.Comput. Program.
93 (2014), 3–18. https://doi.org/10.1016/j.scico.2012.11.006Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions.
Proc. ACMProgram. Lang.
3, OOPSLA, Article 139 (Oct. 2019), 29 pages. https://doi.org/10.1145/3360565Cox Russ. 2007. Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...).URL https://swtch.com/ rsc/regexp/regexp1.html.Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:25
Yuju Shen, Yanyan Jiang, Chang Xu, Ping Yu, Xiaoxing Ma, and Jian Lu. 2018. ReScue: Crafting Regular Expression DoSAttacks. In
Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier,France) (ASE 2018) . ACM, New York, NY, USA, 225–235. https://doi.org/10.1145/3238147.3238159Michael Sipser. 1997.
Introduction to the theory of computation . PWS Publishing Company.Cristian-Alexandru Staicu and Michael Pradel. 2018. Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers. In
Information and Media Technologies
9, 3 (2014), 222–232. https://doi.org/10.11185/imt.9.222Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm.
Commun. ACM
11, 6 (June 1968),419–422. https://doi.org/10.1145/363347.363387Brink van der Merwe, Nicolaas Weideman, and Martin Berglund. 2017. Turning Evil Regexes Harmless. In
Proceedings ofthe South African Institute of Computer Scientists and Information Technologists (SAICSIT ’17) . Association for ComputingMachinery, New York, NY, USA, Article 38, 10 pages. https://doi.org/10.1145/3129416.3129440Nicolaas Weideman, Brink van der Merwe, Martin Berglund, and Bruce Watson. 2016. Analyzing Matching Time Behav-ior of Backtracking Regular Expression Matchers by Using Ambiguity of NFA. In
Implementation and Application ofAutomata , Yo-Sub Han and Kai Salomaa (Eds.). Springer International Publishing, Cham, 322–334.Valentin Wüstholz, Oswaldo Olivo, Marijn J. Heule, and Isil Dillig. 2017. Static Detection of DoS Vulnerabilities inPrograms That Use Regular Expressions. In
Proceedings, Part II, of the 23rd International Conference on Tools andAlgorithms for the Construction and Analysis of Systems - Volume 10206 . Springer-Verlag, Berlin, Heidelberg, 3–20.https://doi.org/10.1007/978-3-662-54580-5_1Fang Yu, Ching-Yuan Shueh, Chun-Han Lin, Yu-Fang Chen, Bow-Yaw Wang, and Tevfik Bultan. 2016. Optimal SanitizationSynthesis for Web Application Vulnerability Repair. In
Proceedings of the 25th International Symposium on SoftwareTesting and Analysis (Saarbrücken, Germany) (ISSTA 2016) . Association for Computing Machinery, New York, NY, USA,189–200. https://doi.org/10.1145/2931037.2931050Zexuan Zhong, Jiaqi Guo, Wei Yang, Jian Peng, Tao Xie, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2018. SemRegex:A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications. In
EMNP’18 .ACL.
A CORRECTNESS OF LINEAR TIME PROPERTY
In this section, we show that the matching of regular expressions that satisfy the LTP runs inlinear time. Recall from Section 3 that the running time is defined by the size of the derivation of ( 𝑟 , 𝑤, , ∅) { N . We show that the size of the derivation is linear. More precisely, we show thatthe derivation is deterministic , i.e., at most one element in N can succeed the matching, and theother elements fail the matching immediately. We divide the proof into the four parts: let 𝑟 be aregular expression that satisfies the LTP. (1) For a lookaround 𝑟 , 𝑟 runs in a constant time. (2) Forthe derivation ( 𝑟 , 𝑤, 𝑝, Γ ) { N , at most one element ( 𝑝 , Γ ) ∈ N can succeed the matching (cf.Lemma A.4). (3) The element ( 𝑝 , Γ ) ∈ N succeeds or fails the matching at most linear size ofthe derivation (cf. Lemma A.7). (4) The other elements ( 𝑝 , Γ ) ∈ N\{( 𝑝 , Γ )} fail the matching atmost constant size of the derivation (cf. Lemma A.8). To distinguish these elements, we call ( 𝑝 , Γ ) alive state and ( 𝑝 , Γ ) dead state .Now we show the first part of the proof. By the definition of the LTP, a regular expressionwrapped by a lookaround does not contain repetitions. Thus, we show that a regular expressionthat does not contain repetitions runs in constant time. Theorem A.1.
A regular expression that does not contain repetitions runs in constant time.
Proof.
We prove that the size of N is constant. This proof is by induction on the structure of aregular expression. Base case • When 𝑟 is [ 𝐶 ] , N = {( 𝑝 + , Γ )} or a failure. Thus, the size of both N is one, i.e., constant. Inductive case
Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :26 Nariyoshi Chida and Tachio Terauchi • When 𝑟 is 𝑟 𝑟 , we assume that ( 𝑟 , 𝑤, 𝑝, Γ ) { N . Due to the induction hypothesis, thesize of N is constant. Then, ∀( 𝑝 ′ , Γ ′ ) ∈ N , ( 𝑟 , 𝑤, 𝑝 ′ , Γ ′ ) { N ′ and the size of N ′ is alsoconstant. Thus, the size of Ð ≤ 𝑖 < |N | N 𝑖 is constant. • When 𝑟 is 𝑟 | 𝑟 , the size of both N is constant by the induction hypothesis.The other cases, i.e., ( 𝑟 ) 𝑖 , \ 𝑖 , (?: 𝑟 ), (?= 𝑟 ), (?! 𝑟 ), (?<= 𝑥 ), and (?
Union ) and(
Repetition ) are non-deterministic. Thus, we first show that the derivation of them are determin-istic if the regular expression satisfies the LTP.
Lemma A.2.
Suppose that regular expressions 𝑟 | 𝑟 satisfies the LTP. Then, for the derivation ( 𝑟 | 𝑟 , 𝑤, 𝑝, Γ ) { N , at most one element in N can succeed the matching. Proof. If 𝑟 or 𝑟 fails the matching, then the assertion holds. Thus, we assume that both 𝑟 and 𝑟 succeed the matching. By case analysis on the consumption of the regular expressions 𝑟 and 𝑟 . • If 𝑟 and 𝑟 consume at least one character, it violates the LTP because, by assumption, 𝑟 and 𝑟 consume 𝑤 [ 𝑝 ] , and this means that 𝑟 | 𝑟 violates the LTP since there are at least twopaths from 𝑟 | 𝑟 to consume 𝑤 [ 𝑝 ] , i.e., 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 | 𝑟 , [ 𝑖 , 𝑤 [ 𝑝 ]) ≥ , where [ 𝑖 is a bracket for 𝑟 | 𝑟 . But this contradicts the assumption. • If either 𝑟 or 𝑟 consumes at least one character, 𝐿 ( 𝑟 ) ∪ 𝐿 ( 𝑟 ) ≠ ∅ because if 𝐿 ( 𝑟 ) ∪ 𝐿 ( 𝑟 ) = ∅ ,the it violates the LTP, i.e., for an alphabet 𝑎 ∈ 𝐿 ( 𝑟 ) ∪ Ł ( 𝑟 ) , | 𝐵𝑝𝑎𝑡ℎ𝑠 ( 𝑟 | 𝑟 , [ 𝑖 , 𝑎 )| ≥ . Thus, 𝐿 ( 𝑟 ) ∪ 𝐿 ( 𝑟 ) = ∅ . In this case, either 𝑟 or 𝑟 consumes 𝑤 [ 𝑝 ] . Thus, this case holds. • If both 𝑟 and 𝑟 do not consume any character and the next derivation consumes at leastone character, then it violates the LTP because there are at least two paths from 𝑟 and 𝑟 tothe character. If the next derivation does not consumes any character, then the size of N = and this case holds. (cid:3) Lemma A.3.
Suppose that a regular expression 𝑟 ∗ satisfies the LTP. Then, for the derivation ( 𝑟 ∗ , 𝑤, 𝑝, Γ ) { N , at most one element in N can succeed the matching. Proof. If 𝑟 fails the matching, then the assertion holds because the size of N is one. Thus, weassume that 𝑟 succeeds the matching. Let ( 𝑟 , 𝑤, 𝑝, Γ ) { N ′ and 𝑛 be |N ′ \{( 𝑝, Γ )}| . By case analysison the size of 𝑛 . • If 𝑛 = , then |N | = . Thus, this case holds. • If 𝑛 ≥ , the element that has the largest position can succeeds and the other elementsfails the matching without consuming any character. We prove by contradiction. Let theelement is ( 𝑝, Γ ) ∈ N ′ , i.e, ∀( 𝑝 ′ , Γ ′ ) ∈ N\{( 𝑝, Γ )} , 𝑝 ′ < 𝑝 . If the other element ( 𝑝 ′ , Γ ′ ) succeeds the matching, then there exists the next matching of a regular expression 𝑟 ′ and thematching also succeeds. But this means that 𝑟 ∗ violates the LTP because 𝑟 ∗ and 𝑟 ′ consumethe character 𝑤 [ 𝑝 ] and there is a path from 𝑟 ∗ to 𝑟 ′ via 𝜖 . This contradicts the assumption. (cid:3) Lemma A.4.
Suppose that 𝑟 is a regular expression that satisfies the LTP. For the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) { N , at most one element ( 𝑝 , Γ ) ∈ N is an alive state. Proof.
By induction on the structure of the regular expression 𝑟 . Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. utomatic Repair of Vulnerable Regular Expressions 1:27
Basis. • Case 𝑟 = [ 𝐶 ] . The size of N is at most one. Thus, this case holds. • Case 𝑟 = 𝜖 . The size of N = . Thus, this case holds. Induction. • Case 𝑟 = 𝑟 | 𝑟 . By Lemma A.2, we know that for the derivation ( 𝑟 | 𝑟 , 𝑤, 𝑝, Γ ) { N , at mostone element ( 𝑝 , Γ ) ∈ N can succeed the matching. Thus, this case holds. • Case 𝑟 = 𝑟 ∗ . By Lemma A.3, we know that for the derivation ( 𝑟 ∗ , 𝑤, 𝑝, Γ ) { N , at most oneelement ( 𝑝 , Γ ) ∈ N can succeed the matching. Thus, this case holds.In the same way, we can prove the other cases. (cid:3) Then, we show that the third part of the proof, i.e., consider the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) { N and assume that ( 𝑝 , Γ ) is an alive state. Then, the alive state ( 𝑝 , Γ ) ∈ N succeeds or fails thematching at most linear size of the derivation. To estimate the size of the derivation,To this end, we show that the size of the derivations from the alive state ( 𝑝 , Γ ) is at most linear.We first present some lemmas relating to the derivation that does not consume any character. Then,we show the second part using these lemmas. Lemma A.5.
Suppose that 𝑟 is a regular expression that satisfies the LTP. If the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) does not consume any character, then the size of the derivation is at most the size of the regularexpression | 𝑟 | . Proof. If 𝐿 ( 𝑟 ) ≠ { 𝜖 } and 𝐿 ( 𝑟 ) ≠ ∅ , then the assertion holds. If 𝐿 ( 𝑟 ) = N and |N | ≥ , we provethis by contradiction. If the size of the derivation is more than | 𝑟 | , then the derivation has to callthe same derivation ( 𝑟 ′ , 𝑤, 𝑝, Γ ′ ) at least two times. This violates the LTP because there are at leasttwo paths to the sub expression 𝑟 ′ . Thus, this contradicts the assumption. (cid:3) Lemma A.6.
Suppose that 𝑟 is a regular expression that satisfies the LTP. For the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) { N , if ( 𝑝 , Γ ) ∈ N and ( 𝑝 , Γ )N and 𝑝 = 𝑝 , then Γ = Γ . Proof.
We prove this by contradiction. If there exists ( 𝑝 , Γ ) ∈ N and ( 𝑝 , Γ )N , there exista capture ( 𝑖, 𝑥 ) ∈ Γ \ Γ . Since Γ and Γ obtained from the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) , 𝑟 contains twoexpressions 𝑟 𝑥 and ( 𝑟 ′ 𝑥 ) 𝑖 as sub expressions and they consume 𝑥 . However, this means that 𝑟 vio-lates the LTP because there are at least two paths from 𝑟 𝑥 and ( 𝑟 ′ 𝑥 ) 𝑖 to 𝑥 [ ] . This contradicts theassumption. (cid:3) Lemma A.7.
Suppose that 𝑟 is a regular expression that satisfies the LTP, and ( 𝑝 , Γ ) is an alivestate. Then, the size of the derivation ( 𝑟 , 𝑤, 𝑝 , Γ ) { N is linear. Proof.
By Lemma A.6, the element in N is unique w.r.t the position 𝑝 . By Lemma A.5, thederivation increases the position 𝑝 after at most | 𝑟 | steps. Thus, the size of the derivation is linearbecause the matching finishes at most | 𝑟 | × | 𝑤 | steps. (cid:3) Finally, we show that the last part of the proof, i.e., the dead state ( 𝑝 , Γ ) ∈ N\{( 𝑝 , Γ )} , where ( 𝑝 , Γ ) is an alive state, fail the matching at most constant size of the derivation. Lemma A.8.
Suppose that 𝑟 is a regular expression that satisfies the LTP, and ( 𝑝 , Γ ) is a deadstate. Then, the size of the derivation ( 𝑟 , 𝑤, 𝑝 , Γ ) { N is constant. Proof.
Recall that the only rules that have an ambiguity are (
Union ) and (
Repetition ), andthis means that dead states are obtained only by these rules. If the derivation ( 𝑟 , 𝑤, 𝑝 , Γ ) , then itviolates the LTP and we can prove that in the same way as the proofs of Lemma A.2 and ?? . Thus, Proc. ACM Program. Lang., Vol. 1, No. CONF, Article 1. Publication date: January 2018. :28 Nariyoshi Chida and Tachio Terauchi the derivation ( 𝑟 , 𝑤, 𝑝 , Γ ) does not consume any character. In that case, by Lemma A.5, the sizeof the derivation is constant. Hence, this assertion holds. (cid:3) Consequently, we derive the following theorem.
Theorem A.9.
A regular expression that satisfies the LTP runs in linear time in the worst-case evenif a regular expression engines employs an algorithm shown in Table 2.
Proof.
Suppose that 𝑟 is a regular expression that satisfies the LTP. We show that the size of thederivation ( 𝑟 , 𝑤, , ∅) { N is linear. By case analysis on a tuple ( 𝑝, Γ ) . If ( 𝑝, Γ ) is the only elementthat can succeeds the matching, then the size of the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) is linear by Lemma ?? .If ( 𝑝, Γ ) is the other element, the the size of the derivation ( 𝑟 , 𝑤, 𝑝, Γ ) is constant. Hence, the sizeof the derivation ( 𝑟 , 𝑤, , ∅) { N is linear, and this means that the regular expression 𝑟 runs inlinear time. (cid:3) ACKNOWLEDGMENTS