Capture-Avoiding and Hygienic Program Transformations (incl. Proofs)
CCapture-Avoiding and HygienicProgram Transformations (incl. Proofs)
Sebastian Erdweg , Tijs van der Storm , , and Yi Dai TU Darmstadt, Germany CWI, Amsterdam, The Netherlands INRIA Lille, France University of Marburg, Germany
Abstract.
Program transformations in terms of abstract syntax treescompromise referential integrity by introducing variable capture. Variablecapture occurs when in the generated program a variable declarationaccidentally shadows the intended target of a variable reference. Existingtransformation systems either do not guarantee the avoidance of variablecapture or impair the implementation of transformations.We present an algorithm called name - fix that automatically eliminatesvariable capture from a generated program by systematically renamingvariables. name - fix is guided by a graph representation of the bindingstructure of a program, and requires name-resolution algorithms for thesource language and the target language of a transformation. name - fix isgeneric and works for arbitrary transformations in any transformationsystem that supports origin tracking for names. We verify the correctnessof name - fix and identify an interesting class of transformations for which name - fix provides hygiene. We demonstrate the applicability of name - fix for implementing capture-avoiding substitution, inlining, lambda lifting,and compilers for two domain-specific languages. Program transformations find ubiquitous application in compiler constructionto realize desugarings, optimizers, and code generators. While traditionallythe implementation of compilers was reserved for a selected few experts, thecurrent trend of domain-specific and extensible programming languages exposesdevelopers to the challenges of writing program transformations. In this paper,we address one of these challenges: capture avoidance.A program transformation translates programs from a source language to atarget language. In doing so, many transformations reuse the names that occur ina source program to identify the corresponding artifacts generated in the targetprogram. For example, consider the compilation of a state machine to a simpleprocedural language as illustrated in Figure 1. The state machine has three states opened , closed , and locked . For each state the compiler generates a constant integerfunction with the same name. Furthermore, for each state the compiler generatesa dispatch function that takes an event and depending on the event returns thesubsequent state. For example, the dispatch function for opened tests if the givenevent is close and either yields the integer constant representing the following a r X i v : . [ c s . P L ] A p r tate openedclose => closed state closedlock => lockedopen => opened state lockedunlock => closed (a) Door state machine. 1 fun opened() = fun closed() = fun locked() = fun opened - dispatch(event) = if (event == "close" ) then closed() else error (); fun closed - dispatch(event) = if (event == "open" ) then opened() else if (event == "lock" ) then locked() else error (); fun locked - dispatch(event) = if (event == "unlock" ) then closed() else error (); fun main - dispatch - next - event(state, event) = if (state == opened()) then opened - dispatch(event) else if (state == closed()) [...]; (b) Program generated for the door state machine. Fig. 1.
Many transformations reuse names from the source program in generated code. state closed or a dynamic error. Finally, the compiler generates a main dispatchfunction that calls the dispatch function of the current state.A naive implementation of such compiler is easy to implement, but alsoruns the risk of introducing variable capture. For example, if we consistentlyrename the state locked to opened - dispatch as shown in Figure 2(a), we expect thecompiler to produce code that behaves the same as the code generated for thestate machine without renaming. However, a naive compiler blindly copies thestate names into the generated program, which leads to the incorrect code shownin Figure 2(b): The function definition on line shadows the constant functionon line and thus captures the variable reference opened - dispatch on line (weassume there is no overloading). For the example shown, the problem is easy tofix by renaming the dispatch function on line and its reference on line to afresh name opened - dispatch - . However, a general solution is difficult to obtain.Existing approaches either rely on naming conventions and fail to guaranteecapture avoidance, or they require a specific transformation engine and affectthe implementation of transformations.We propose a generic solution called name - fix that guarantees capture avoid-ance and does not affect the implementation of transformations. name - fix com-pares the name graph of the source program with the name graph of the generatedprogram to identify variable capture. If there is variable capture, name - fix sys-tematically and globally renames variable names to differentiate the capturedvariables from the capturing variables, while preserving intended variable ref-erences among original variables and among synthesized variables, respectively. name - fix requires name analyses for the source and target languages, whichoften exists or are needed anyway (e.g., for editor services, error checking, orrefactoring), and hence can be reused. name - fix treats transformations as a blackbox and is independent of the used transformation engine as long as it supportsorigin tracking for names [26]. 2 tate openedclose => closed state closedlock => opened-dispatchopen => opened state opened-dispatchunlock => closed (a) Consistently renam-ing door state machine. 1 fun opened() = fun closed() = fun opened-dispatch() = fun opened-dispatch(event) = if (event == "close" ) then closed() else error (); fun closed - dispatch(event) = if (event == "open" ) then opened() else if (event == "lock" ) then opened-dispatch() else ... fun opened - dispatch - dispatch(event) = if (event == "unlock" ) then closed() else error (); fun main - dispatch - next - event(state, event) = if (state == opened()) then opened-dispatch(event) else if (state == closed()) [...]; (b) Program generated for the renamed door state machineis incorrect: Variable capture of opened - dispatch . Fig. 2.
Variable capture can occur when original and synthesized names are mixed. name - fix enables developers of program transformations to focus on theactual translation logic and to ignore variable capture. In particular, name - fix enables developers to use simple naming schemes for synthesized variables inthe transformation and to produce intermediate open terms. For example, inFigure 1, we append "-dispatch" to a state’s name to derive the name of thecorresponding dispatch function. This construction occurs at two independentplaces in the transformation: When generating a dispatch function for a state, andwhen generating the main dispatch function. The connection between these is onlyestablished when assembling all parts of the generated program in the final stepof the transformation. Using name - fix , it is safe to apply global naming schemeswith intermediate open terms to associate generated variable references anddeclarations. Transformations of this kind fall into the class of transformationsfor which name - fix guarantees hygiene, that is, α -equivalent source programs arealways mapped to α -equivalent target programs.In summary, we make the following contributions: – We studied 9 existing DSL implementations that use transformations andfound that 8 of them were prone to variable capture. – We present name - fix , an algorithm that automatically eliminates variablecapture from the result of a program transformation. – We state and verify termination and correctness properties for name - fix andshow that name - fix produces α -equivalent programs for programs that areequal up to consistent but possibly capturing renaming. – We propose a notion of hygienic transformations and identify an interestingclass of transformations for which name - fix provides hygiene. – We present an implementation of name - fix in the metaprogramming systemRascal. Our implementation supports capture avoidance for transformationsthat generate code as syntax trees or as strings.3 We demonstrate the applicability of name - fix in a wide range of scenarios:for capture-avoiding substitution, for optimization (function inlining), fordesugaring of language extensions (lambda lifting), and for code generation(compilation of DSLs for state machines and for digital forensics). Capture avoidance is best known from capture-avoiding substitution: Whensubstituting an expression e under a binder as in λ x. ( e [ y := e ]) , variable x may not occur free in e otherwise the original binding of x in e wouldbe shadowed by the λ . To implement capture-avoiding substitution, we mustrename x to a fresh variable α (cid:54)∈ { y } ∪ FV ( e ) ∪ FV ( e ) to avoid the capture: λ α. ( e [ x := α ][ y := e ]) . Ensuring capture avoidance is already relativelycomplicated for substitution in the λ -calculus. For larger languages and morecomplex program transformations, ensuring capture avoidance is a non-trivialand error-prone task. To better understand the relevance of the problem of variable capture, we studiedimplementations of a DSL for questionnaires in 10 state-of-the-art languageworkbenches in the context of the Language Workbench Challenge 2013 [9]. Thequestionnaire DSL features named declarations of questions and named definitionsof derived values. 9 of the 10 language workbenches translate a questionnaire intoa graphical representation using either Java or HTML with CSS and JavaScriptas target language. One workbench uses interpretation instead of transformation.In most cases, the implementation of the DSL was conducted by the developersof the workbench themselves.The result of our study is shocking: The DSL implementations in 8 of the 9language workbenches that use transformations fail to address capture avoidanceand produce incorrect code even for minimal changes to the definition of aquestionnaire. For example, some implementations fail when a question name ischanged to container , questions , or SWTUtils , because these names are implicitlyreserved for synthesized variables. Other implementations of the DSL use namingschemes similar to the one we illustrated in the state-machine example. If thereis already a question called Q , these implementations fail when naming anotherquestion QBlock , calculated _ Q , or grp _ Q . Some of the variable captures result incompile-time errors of the generated Java code, others result in misbehaved codethat, for example, silently skips some of the questions when storing answers per-sistently. Debugging such errors typically requires investigation of the generatedcode and can be very time-consuming.Of the studied DSL implementations, only the transformation implementedin Más addressed variable capture. It uses global name mappings to generate We studied all workbenches of the previous study [9]: Ens¯o, Más, MetaEdit+, MPS,Onion, Rascal, Spoofax, SugarJ, the Whole Platform, and Xtext.
The goal of this work is to provide a mechanism that avoids variable capturein code that is generated by program transformations. To this end, we seek amechanism that satisfies the following design goals:G1: Preserve reference intent: If a reference from the source program occurs in thetarget program, then the original declaration must also occur in the targetprogram and the reference is still bound by it. In other words, source-programvariables may neither be captured by synthesized declarations nor by othersource-program declarations.G2: Preserve declaration extent: If a declaration from the source program occursin the target program, then only source-program references may be boundby it. In other words, synthesized variable references may not be captured bysource-program declarations.G3: Noninvasive: Avoidance of variable capture should not impact the readabilityof generated code. This is important in practice, where the generated codeis often manually inspected when debugging a program transformation. Inparticular, a generated program should be left unchanged if it does not contain variable capture.G4: Language-parametric: It should be possible to eliminate variable capture fromvirtually all source and target languages that feature static name resolution.G5: Transformation-parametric: The mechanism should work with different trans-formation engines and should not impose a specific style of transforming pro-grams. Ideally, the mechanism supports existing transformations unchanged.In the following sections, we present our solution name - fix . It fully achievesthe first three goals. In addition, name - fix is language-parametric provided thename analysis of source and target language satisfy modest assumptions. Finally, name - fix works with any transformation engine that provides origin tracking [26]for variable names, so that names originating from the source program can bedistinguished from names synthesized by the transformation. The core idea of our solution is to provide a generic mechanism for the detectionand elimination of variable capture based on name graphs of the source andtarget program. We use the term name for the string-valued entity that occursin the abstract syntax tree of a program. Naturally, the same name may occur at5ultiple locations of a program. To distinguish different occurrences of the samename, we assume names are labeled with a variable ID. In source programs, suchIDs are unique. However, for target programs generated by some transformation,we do not require that variable IDs are unique, because the transformation mayhave copied and duplicated names from the input program to the output program.We write x v to denote that name x is labeled with variable ID v , and wewrite p @ v to retrieve from program p the name corresponding to variable ID v .Nodes that share the same ID must have the same name so that p @ v is uniquelydetermined. The nodes of a name graph are the variable IDs that occur in aprogram and the edges connect references to the corresponding declarations. Definition 1.
The name graph of a program p is a pair G = ( V, ρ ) where V is the set of variable IDs in p (references and declarations), ρ ∈ V → V is a partial function from references to declarations,and if ρ ( v r ) = v d , then reference and declaration have the same name p @ v r = p @ v d .1 4 86 2 9 5 Fig. 3.
Name graph of statemachine in Figure 1(a).
For example, Figure 3 displays the name graph ofthe state machine in Figure 1(a), where we use linenumbers as variable IDs: ID 1 represents the dec-laration of opened , ID 2 represents the reference to closed in the transition on line 2, ID 4 represents thedeclaration of closed , and so on.We require that transformations preserve vari-able IDs when reusing names from the source pro-gram in the generated code. For example, when compiling the state machine ofFigure 1(a) to the code in Figure 1(b), the compiler reuses the names of statedeclarations for the declaration of constant functions and for references to theseconstant functions in the main dispatch. Accordingly, in the generated code, thesenames must have the same variable ID as in the source program. Essentially,whenever a transformation copies a name from the source program to the targetprogram, the corresponding ID must be copied as well and thus preserved. Incontrast, names that are synthesized by the transformation should have freshvariable IDs. 1 4 86 2 9 5’12 synthesized variables ’4 ’6 ’9 ’11
Fig. 4.
Names of compiledstate machine of Figure 1(b).
For example, Figure 4 shows the name graph ofthe compiled state machine (we left out nodes offunction parameters event and state for clarity). Weuse line numbers from the source program as variableIDs for reused variables, and ticked line numbers ofthe target program as variable IDs for synthesizedvariables. In addition, we depict nodes of synthesizedvariables with a darker background color. We havecycles in the name graph for source nodes 1, 4, and8 because the transformation duplicated the namesat these labels to generate constant functions andreferences to these constant functions.One important property of the name graph inFigure 4 is that the source nodes are disconnected from the synthesized nodes, and6ll references from the original name graph in Figure 3 have been preserved. Incontrast, consider the name graph in Figure 5 that displays result of compilationafter renaming state locked to opened - dispatch as in Figure 2(b). The graphillustrates that a source variable has been captured (dashed arrow) duringcompilation: The variable at line 5 of the source program was intended to pointto the state declared at line 8, but after compilation it points to the dispatchfunction at line 4 of the synthesized program. 1 4 86 2 9 5’12’4 ’6 ’9 ’11 Fig. 5.
Variable capture(dashed arrow) in the code ofFigure 2(b).
Our solution identifies variable capture by com-paring the original name graph of the whole programwith the name graph of the generated code. Func-tion find - capture in Figure 6 computes the set ofedges that witness variable capture. In the state-machine example, find - capture finds only one edge ( (cid:55)→ ’4 ) as part of notPresrvRef1 . We discuss the pre-cise definition of variable capture in the subsequentsection.If there are witnesses of variable capture, oursolution computes a variable renaming that has twoproperties. First, for each witness of variable capture,the renaming renames the capturing variable toeliminate the witness. Second, the renaming ensuresthat intentional references to the capturing variable are renamed as well. Thiscan be difficult because the name graph of the generated code is inaccurate dueto variable capture. Therefore, our solution conservatively approximates the setof potential references by including all synthesized variables of the same name.Function comp - renaming in Figure 6 computes the renaming as a function froma variable ID to the variable’s fresh name, computed by gensym . For the example,we get π src = ∅ because ’4 (cid:54)∈ V s and π syn = { ’4 (cid:55)→ "opened-dispatch-0" , ’12 (cid:55)→ "opened-dispatch-0" } because t @ ’4 = t @ ’12 . Function rename in Figure 6 visitsall nodes in a syntax tree (represented as s-expression) and applies the renaming π to variables with the corresponding IDs. For the example, the renaming yieldsa capture-free program with the same name graph as shown in Figure 4.Function name - fix in Figure 6 brings it all together and is the main entrypoint of our solution. It takes the name graph of the source program and thegenerated target program as input. First, it computes the name graph of thetarget program using the function resolve T that we assume to provide nameresolution for the target language T . name - fix then calls find - capture to identifyvariable capture. If find - capture finds no capturing edges, name - fix returns thegenerated program unchanged. Otherwise, name - fix calls comp - renaming and rename to compute and apply the renaming that eliminates the witnessed variablecapture. Since the name graph G t of t may be inaccurate due to variable capture, name - fix recursively calls itself to repeat the search for and potential repair ofvariable capture. Note that name - fix applies a closed-world assumption to inferthat all unbound variables are indeed free, and thus can be renamed at will.7 yntactic conventions: x v variable x labeled with variable ID v p @ v = x name x that occurs in program p at variable ID v find - capture (( V s , ρ s ), ( V t , ρ t )) = {notPresrvRef1 = {( v (cid:55)→ ρ t ( v )) | v ∈ dom ( ρ t ), v ∈ V s , v ∈ dom ( ρ s ), ρ s ( v ) (cid:54) = ρ t ( v )};notPresrvRef2 = {( v (cid:55)→ ρ t ( v )) | v ∈ dom ( ρ t ), v ∈ V s , v (cid:54)∈ dom ( ρ s ), v (cid:54) = ρ t ( v )};notPresrvDef = {( v (cid:55)→ ρ t ( v )) | v ∈ dom ( ρ t ), v (cid:54)∈ V s , ρ t ( v ) ∈ V s }; return notPresrvRef1 ∪ notPresrvRef2 ∪ notPresrvDef;} comp - renaming (( V s , ρ s ), ( V t , ρ t ), t, capture) = { π src = ∅ ; π syn = ∅ ; foreach v d in codom (capture) {usedNames = { t @ v | v ∈ V t } ∪ codom ( π src ) ∪ codom ( π syn )fresh = gensym( t @ v d , usedNames); if ( v d ∈ V s ∧ v d (cid:54)∈ π src ) π src = π src ∪ {( v d (cid:55)→ fresh)} ∪ {( v r (cid:55)→ fresh) | v r ∈ dom ( ρ s ), ρ s ( v r ) = v d }; if ( v d (cid:54)∈ V s ∧ v d (cid:54)∈ π syn ) π syn = π syn ∪ {( v (cid:55)→ fresh) | v ∈ V t \ V s , t @ v = t @ v d };} return ( π src , π syn );} rename (t, π ) = { return t match { case x v if v ∈ dom ( π ) => π (v) v case x v => x v case c => c case (t . . . t n ) => ( rename (t , π ) . . . rename (t n , π ));}} name - fix (G s , t) = {G t = resolve T (t);capture = find - capture (G s , G t ); if (capture == ∅ ) return t;( π src , π syn ) = comp - renaming (G s , G t , t, capture);t’ = rename (t, π src ∪ π syn ); return name - fix (G s , t’);} Fig. 6.
Definition of name - fix that guarantees capture-avoidance.
24 3(a) Graph of sourceprogram. 1 24 3’5(b) Graph of gener-ated program t . 1 24 3’5(c) After renamingthe inner x . 1 24 3’5(d) After renamingthe outer x . Fig. 7.
Name graphs during execution of name - fix for t = λ x . ( λ x . x x ’ ) x . In the following, we present examples that illustrate two design choices of name - fix that may be somewhat unintuitive: Why are multiple rounds of renam-ing required, and why do we rename all synthesized variables of the same name.For the former property, consider the lambda expression t = λ x . ( λ x . x x ’ ) x ,where we use superscripts to annotate variable IDs and ticked IDs for synthesizedvariables. The first graph in Figure 7 shows the original binding structure ofthe hypothetical source program that t is generated from. The second graphshows the binding structure of t . The synthesized variable x ’ is captured by thebinding of x , which is illegal due to notPresrvDef in find - capture . Accordingly, comp - renaming initiates a renaming of x , also renaming x to preserve thesource reference. This yields expression t (cid:48) = λ x . ( λ α . α x ’ ) x with bindingstructure as shown in the third graph. Indeed, x no longer captures x ’ . How-ever, now x captures x ’ . Thus, by renaming x and its reference x , we get t (cid:48)(cid:48) = λ β . ( λ α . α x ’ ) β with capture-free binding structure as shown in thelast graph. The iterative renaming was necessary because the name graph of t did not indicate that x ’ is eventually captured by x . We could have preemp-tively renamed x together with x , but this contradicts our goal for minimalinvasiveness.To illustrate why name - fix renames all synthesized variables of the samename, consider the expression t = λ x ’ . x ( λ x . x ’ ) in which x ’ captures x ’ and x captures x ’ . Thus, name - fix needs to rename x ’ and x . Because x ’ and x ’ are both synthesized and have the same name, renaming of x ’ entailsthe renaming of x ’ even though they are unrelated in the name graph of t . Thus, name - fix yields the correct result t (cid:48) = λ α ’ . x ( λ β . α ’ ) . To see why x ’ shouldbind x ’ , consider what happens had the source program consistently used y inplace of x : t = λ x ’ . y ( λ y . x ’ ) . This program has no variable capture andis returned unchanged by name - fix . Since we want the result of name - fix to beinvariant under consistent renamings of the source variables, x ’ must bind x ’ inboth t and t . By renaming all synthesized variables of the same name, name - fix ensures that no potential variable reference is truncated.Both of the above examples also illustrate another point: name - fix does notguarantee valid name binding with respect to the target language. The final9esult in both examples contains a free variable. Instead, name - fix guaranteesthat there is no variable capture. We state and verify the precise properties of name - fix in the next section. Our solution name - fix iteratively eliminates variable capture in a fixed-pointcomputation. In this section we show three important properties of name - fix : name - fix terminates, name - fix eliminates variable capture, and name - fix yields α -equivalent outputs for inputs that are equal up to consistent (but possiblycapturing) variable renaming.We represent programs as s-expressions with constant symbols c , labeledvariable names x v , and compound terms ( t . . . t n ) . We shall frequently requiretwo programs to be equal up to unconditional renaming: Definition 2.
Two programs are label-equivalent p ≡ L p iff they are equal upto variable names: c ≡ L c if c = c x v ≡ L x v if v = v ( t . . . t n ) ≡ L ( t (cid:48) . . . t (cid:48) n ) if t i ≡ L t (cid:48) i ∀ ≤ i ≤ n To simplify our formalization, we do not consider bijective relabeling functionsand assume label-equivalence instead. As first metatheoretical result we statethat name - fix terminates. Theorem 1.
For any name graph G s and any program t , name - fix ( G s , t ) ter-minates in finitely many steps. We present our framework for capture-avoiding transformations independent ofany concrete source and target languages. Since our technique works on top ofname graphs, we require functions resolve L that compute the name graph of aprogram of some language L by name analysis. However, instead of requiring aspecific form of name analysis, we specify minimal requirements on the behaviorof resolve L that suffice to show our technique is sound. The first assumption statesthat name analysis must produce a name graph. Assumption 1.
Given a program p , resolve L ( p ) yields the name graph G = ( V, ρ ) of p according to Definition 1. The second assumption requires resolve L to behave deterministically. First, giventwo programs p and p that are equal up to variable names, names that arereferences in p must be references in p if the declaration is available (but itcan refer to another declaration). Second, given a reference with two potentialdeclarations in p and p , resolve L must deterministically choose one of them. Proofs of all theorems and additional lemmas appear in Appendix A. ssumption 2. Let p ≡ L p be label-equivalent with name graphs resolve L ( p ) =( V, ρ ) and resolve L ( p ) = ( V, ρ ) .(i) If ρ ( v r ) = v d and p @ v r = p @ v d , then v r ∈ dom ( ρ ) .(ii) If ρ ( v r ) = v d , ρ ( v r ) = v (cid:48) d , p @ v d = p @ v (cid:48) d , and p @ v d = p @ v (cid:48) d , then v d = v (cid:48) d . In addition to these assumptions, we require that the name graph ( V, ρ ) of theoriginal source program satisfies dom ( ρ ) ∩ codom ( ρ ) = ∅ . We call such graphs bipartite name graphs . Note that resolve L often does not produce bipartite namegraphs for generated code due to name copying as in Figure 4. We believe ourrequirements are modest and readily satisfied by name analyses of most languages. name - fix eliminates variable capture We define the notion of capture-avoiding transformations in terms of the namegraph of the source and target programs, before we show that name - fix can turnany transformation into a capture-avoiding one. Definition 3.
A transformation f : S → T is capture-avoiding if for all s ∈ S with resolve S ( s ) = ( V s , ρ s ) and t = f ( s ) with resolve T ( t ) = ( V t , ρ t ) :1. Preservation of reference intent: For all v ∈ dom ( ρ t ) with v ∈ V s ,(i) if v ∈ dom ( ρ s ) , then ρ s ( v ) = ρ t ( v ) ,(ii) if v (cid:54)∈ dom ( ρ s ) , then v = ρ t ( v ) .2. Preservation of declaration extent: For all v ∈ dom ( ρ t ) , if v (cid:54)∈ V s , then ρ t ( v ) (cid:54)∈ V s .The first condition states that a capture-avoiding transformation must preservereferences of the source program. That is, if a variable v occurs in the targetprogram and this reference was bound in the source program, then the targetprogram must provide the same binding for v . That is, the transformation mustpreserve the reference intent of the source program’s author.If the source program does not contain v as a bound variable (but maybe asa declaration), v can only refer to itself in the target program. We specificallyadmit such self-references to allow transformations to duplicate names of source-program declarations in order to introduce additional delegation. For example,our compiler for state machines illustrated in Figure 1(a) uses names of statedeclarations to generate constant functions and references to these functions.Note that we also admit duplication of reference names, each of which has thesame variable ID and thus must refer to the original declaration.The second condition states that a capture-avoiding transformation must keepsynthesized variable references separate from variables declared in the sourceprogram. We consider all variables of the source program V s to be original andall variables of the target program that do not come from the source program ( V t \ V s ) to be synthesized. This condition prevents synthesized variable referencesto be captured by original variable declarations, that is, synthesized variablescan only be bound by synthesized declarations.11unction find - capture in Figure 6 implements the test for capture avoidanceand collects witnesses in case of variable capture. Since name - fix only terminateswhen find - capture fails to find variable capture, the correctness of name - fix follows from its termination. Theorem 2 (Capture avoidance).
Given a transformation f : S → T , name - fix yields a capture-avoiding transformation λs. name - fix ( resolve S ( s ) , f ( s )) . α -equivalence and sub- α -equivalence It is not enough to ensure that name - fix eliminates variable capture, because, forexample, a function that returns the empty program would satisfy this property.To ensure the usefulness of name - fix , we need to show that, given two programsthat are equal up to possibly capturing renaming, it produces α -equivalentprograms (and not just any programs). Two programs are α -equivalent if theyare equal up to non-capturing renaming, that is, if they have the same syntacticstructure and binding structure. Definition 4.
Two programs p and p with name graphs resolve L ( p ) = ( V , ρ ) and resolve L ( p ) = ( V , ρ ) are α -equivalent p ≡ α p iff p ≡ L p and ρ = ρ .Note that p ≡ L p entails V = V . As expected, our definition of α -equivalenceis independent of the concrete names that occur in the programs. The followingexamples illustrate our definition of α -equivalence.Program Name graph p = λ x . ( λ y . y y ) x G = ( { , , , , } , { (2 (cid:55)→ , (4 (cid:55)→ , (5 (cid:55)→ } ) p = λ x . ( λ x . x x ) x G = ( { , , , , } , { (2 (cid:55)→ , (4 (cid:55)→ , (5 (cid:55)→ } ) p = λ x . ( λ y . x + y ) x G = ( { , , , , } , { (2 (cid:55)→ , (4 (cid:55)→ , (5 (cid:55)→ } ) p = λ x . ( λ x . x + x ) x G = ( { , , , , } , { (2 (cid:55)→ , (4 (cid:55)→ , (5 (cid:55)→ } ) Our definition correctly identifies p ≡ α p , because they are label-equivalent andhave the same name graphs. Indeed, p can be derived from p by consistentlyrenaming all occurrences of the bound variable y to x . In contrast, p (cid:54)≡ α p because the binding structure differs: x is bound to x in p , but to x in p . All other combinations of above programs (modulo symmetry of ≡ α ) arenot α -equivalent because they fail the required label-equivalence. In particular, p (cid:54)≡ α p in spite of having the same binding structure.To relate programs that are equal up to possibly capturing renaming, wepropose the following notion of sub- α -equivalence. Definition 5.
Two programs are sub- α -equivalent p ≡ Gα p under a namegraph G = ( V, ρ ) iff p ≡ L p and, given V p is the set of labels in p and p ,(i) for all v r , v d ∈ V p ∩ V with ρ ( v r ) = v d , p @ v r = p @ v d ⇔ p @ v r = p @ v d (ii) for all v r , v d ∈ V p \ V, p @ v r = p @ v d ⇔ p @ v r = p @ v d α -equivalence compares two programs based on the actual names occurringin them, and not based on the binding structure. The relation is parameterizedover a name graph G . The first condition states that for each binding in thisgraph, p and p need to agree on whether reference and declaration share thesame name or not. Even if the reference and declaration have the same name, itdoes not imply that there is a corresponding binding in either p or p , becauseanother declaration can also have this name and capture the reference. Thesecond condition states that for all variables not in G , p and p need to agreeon which variable occurrences share names. To illustrate sub- α -equivalence, letus consider G = ( { , , } , { (2 (cid:55)→ , (3 (cid:55)→ } ) and the following programs: [ p ] ≡ Gα p = λ x . ( λ y ’ . x + y ’ ) x p = λ z . ( λ y ’ . z + y ’ ) z p = λ x . ( λ z ’ . x + z ’ ) x p = λ z . ( λ z ’ . z + z ’ ) z ¬ [ p ] ≡ Gα p = λ z . ( λ y ’ . x + y ’ ) x p = λ x . ( λ y ’ . z + y ’ ) x p = λ x . ( λ z ’ . x + y ’ ) x p = λ x . ( λ y ’ . x + z ’ ) x The first four programs are sub- α -equivalent to p under G . We have p ≡ Gα p because they agree on the name sharing at variable IDs , , and , which isrequired because of the bindings in G , and on the name sharing at variable IDs ’ and ’ , which is required because these IDs are not in G . Similar analysis shows p ≡ Gα p and p ≡ Gα p . Programs p through p are examples that are not sub- α -equivalent to p under G . For p and p the first condition of sub- α -equivalencefails because there is no agreement on the name sharing at and . For p and p the second condition fails because there is no agreement on the name sharingat ’ and ’ .Note that p ≡ Gα p illustrates that sub- α -equivalence is weaker than α -equivalence because p (cid:54)≡ α p . In the following subsection we use sub- α -equivalenceto characterize programs that name - fix can repair to α -equivalent programs. name - fix We now turn to one of the main results of our metatheory: Function name - fix isnoninvasive, preserves sub- α -equivalence, and is invariant under consistent (butpossibly capturing) renaming of original and synthesized variables, as specifiedby sub- α -equivalence.For capture-free programs, name - fix yields the input program unchanged,that is, name - fix is noninvasive: Theorem 3.
For any name graph G s = ( V s , ρ s ) and any program t with find - capture ( G s , resolve T ( t )) = ∅ , name - fix ( G s , t ) = t .Given a bipartite name graph of the source program, name - fix preserves sub- α -equivalence: Theorem 4.
For any bipartite name graph G s = ( V s , ρ s ) and any program t , name - fix ( G s , t ) ≡ G s α t . name - fix maps sub- α -equivalent programs to α -equivalent ones: Theorem 5.
For any bipartite name graph G s = ( V s , ρ s ) and programs t ≡ G s α t , name - fix ( G s , t ) ≡ α name - fix ( G s , t ) . In the previous section, we demonstrated that for any transformation f : S → T , name - fix provides a capture-avoiding transformation λ s. name - fix ( G s , f ( s )) .However, for some transformations name - fix yields a transformation that adheresto the stronger property of hygienic transformations. Definition 6.
A transformation f : S → T is hygienic if it maps α -equivalentsource programs to α -equivalent target programs: s ≡ α s = ⇒ f ( s ) ≡ α f ( s ) . This definition of hygiene for transformations follows Herman’s definition ofhygiene for syntax macros [10].Transformations can inspect the names of variables and can generate struc-turally different code for α -equivalent inputs. For example, a transformationmay decide to produce thread-safe accessors for variables with names prefixedby sync _ . Accordingly, a consistent renaming from sync _ foo to foo in the sourceprogram leads to generated programs that are not structurally equivalent, letalone α -equivalent. However, there is an interesting class of transformations forwhich name - fix provides hygiene: Definition 7.
A transformation f : S → T is sub-hygienic if it maps α -equivalentsource programs s ≡ α s to sub- α -equivalent target programs f ( s ) ≡ G s α f ( s ) under the name graph G s of s (or s ).The class of sub-hygienic transformations includes some common transformationschemes. First, it includes transformations that transform a source programsolely based on the program’s structure but independent of the concrete variablenames occurring in it. In such transformations, synthesized variable names areconstant and the same for any source program. Second, for a source languagewithout name shadowing (such as state machines), sub-hygienic transformationsinclude those that derive synthesized variable names using an injective function g : string → string over the corresponding source variable names. For example, inFigure 1, we derived the name of a dispatch function by appending - dispatch tothe corresponding state name. In both cases name - fix eliminates all potentialvariable capture and yields a fully hygienic transformation: Theorem 6.
For any sub-hygienic transformation f : S → T , transformation λ s. name - fix ( G s , f ( s )) is hygienic. un zero() = fun succ(x) = let n = in x + n; let n = x + in succ(succ(n + x + zero())) (a) Program with free variable x . fun zero() = fun succ(x) = let n = in (x + n); let n0 = * n + in succ(succ(n0 + * n + zero())) (b) Result of substituting * n for x . Fig. 8. name - fix yields a capture-avoiding substitution that renames local variables. To evaluate the applicability of capture-avoiding program transformation inpractice, we have successfully applied name - fix in three different scenarios: – Optimization: Function inlining via substitution in a procedural language. – Desugaring of language extensions: Lambda lifting of local functions. – Code generation: Compilation of state machines and of Derric, an existingDSL for digital forensics, to Java.We have implemented all case-studies in Rascal, a programming language andenvironment for source code analysis and transformation [13]. The source codeof our implementation and all case studies are available online: http://github.com/seba--/hygienic-transformations . As described in Section 3, a transformation must preserve variable IDs of thesource program when reusing these names in the target program. While itis possible for a developer of a program transformation to manually preservevariable IDs via copying, it is easier and safer if the transformation engine does itautomatically. We extended Rascal to preserve variable IDs automatically via anew Rascal feature called string origins [24]. Every string value (captured by the str data type) carries information about its origin. A string can either originatefrom a parsed text file, from a string literal in a metaprogram, or from a stringcomputation such as concatenation, slicing, or substitution.String origins allow us to obtain precise offsets and lengths for known sub-strings (e.g., names) so that it is possible to replace substrings. We use thisfeature to support name - fix for transformations that produce a target programas a string instead of an abstract syntax tree. Despite the higher fragility ofstring-based transformations, they are common in practice. In our case studies,we use string-based transformations to generate Java code. Substitution and inlining are program transformations that may introduce vari-able capture. Using name - fix , the definition of capture-avoiding versions of these15ransformations becomes straight-forward because name - fix takes over the re-sponsibility for avoiding variable capture. Figure 8 illustrates the applicationof capture-avoiding substitution to a program of a simple language with globalfirst-order functions and local let -bound variables. In the example, we use substi-tution to replace free occurrences of variable x by * n . To prevent capture, ourcapture-avoiding substitution function renames the locally bound variable n .Substitution is a program transformation where the source and the target lan-guage coincide. Capture-avoiding substitution must retain the binding structureof the original (source) program. Since this requirement is part of our definition ofcapture-avoiding transformations, we can use name - fix to get a capture-avoidingsubstitution function from a capturing substitution function. This simplifies thedefinition of substitution for our procedural language to the following: subst(p, x, e) = name - fix (resolve(p), substP(p, x, e));substP(p, x, e) = prog([substF(f, x, e) | f ← p.fdefs], [substE(e2, x, e) | e2 ← p.main]);substF(fdef(f, params, b), x, e) = fdef(f, params, x in params ? b : substE(b, x, e));substE(var(y), x, e) = x == y ? e : var(y);substE(let(y, e1, e2), x, e) = let(y, substE(e1, x, e), x == y ? e2 : substE(e2, x, e));substE(e1, x, e) = for (Exp e2 ← e1) insert substE(e2, x, e); Function substP takes a program p and substitutes e for x in all function definitionsand expressions of the main routine using substF and substE , respectively. Function substF substitutes e for x in the body of a function only if x does not occur asparameter name of the function, that is, only if x is indeed free in the functionbody. Function substE proceeds similarly for let -bound variables. The final caseof substE uses Rascal’s generic-programming features [13] to provide a defaultimplementation: We substitute e for x in each direct subexpression of e1 andinsert the corresponding result in place of the subexpression.Function subst ensures capture avoidance, but function substP does not: Whenpushing expression e under a binder, the bound variable may occur free in e , inwhich case the bound variable should be renamed. By using name - fix , we canomit checking and potentially renaming the bound variable both for functiondefinitions and for let expressions and still get a capture-avoiding substitutionfunction subst that behaves as illustrated in Figure 8.Inlining of functions is a common program-optimization technique used bycompilers. We illustrate our implementation of capture-avoiding inlining in Fig-ure 9. The left column shows a simple program using two logical functions or and and . The central column shows the program after inlining and . Note that ourlanguage uses a single namespace for functions and let -bound variables. We avoidcapture of the reference to or by renaming the local variable or to or0 . The rightcolumn shows the result of inlining or in the central program. The local variable tmp in the definition of or is renamed to tmp0 since otherwise it would capturethe reference to the variable tmp of the main body.Based on our implementation of substitution, we can easily implement inliningby calling substE to substitute all arguments of a function call into the bodyof the function. Like for substitution, it suffices to call name - fix after function16 un or(x, y) = let tmp = x inif tmp == then y else tmp; fun and(x, y) = !or(!x, !y); let or = inlet tmp = in and(or, tmp) (a) Original program. fun or(x, y) = ...; fun and(x, y) = ...; let or0 = inlet tmp = in !or(!or0, !tmp) (b) First inline function and . fun or(x, y) = ...; fun and(x, y) = ...; let or0 = inlet tmp = inlet tmp0 = !or0 inif tmp0 == then !tmp else tmp0 (c) Then inline function or . Fig. 9.
Capture-avoiding function inlining is similar to hygienic macro expansion. inlining is complete. Intuitively, this is because name - fix only renames boundvariables, which are ignored by substE anyway. A detailed investigation of when to call name - fix is part of our future work. Language extensions augment a base language with additional language features.Many compilers first desugar a source program to a core language. Extensiblelanguages like SugarJ [8] enable regular programmers to define their own exten-sions via custom desugaring transformations. Such desugaring transformationsshould preserve the binding structure of the source program. In fact, the lackof capture-avoiding and hygienic transformations in extensible languages was amajor motivation of this work.Exemplary, to show that name - fix supports language extensions, we imple-mented an extension of our procedural language for local function definitionsthat we desugar by lifting them into the global toplevel function scope [12]. Theleft column of Figure 10 shows an example usage of the extension, where wehave a global function f that is shadowed by a local function f , which is used inanother local function g . When lifting the two local functions, we get two toplevelfunctions named f , where the originally local f captures a call to the originallyglobal f in the definition of y . Accordingly, name - fix renames the lifted function f and its calls, both in the main program and the lifted version of g .We implement lambda lifting by recursively (i) finding local functions, (ii)adapting calls to the local function to pass along variables that occur free inthe function body, and (iii) lifting the function definition to the toplevel. Toidentify calls of a local function, we use the name graph of the non-lifted program.A single call to name - fix after desugaring suffices to eliminate potential nameshadowing between functions in the toplevel function scope.17 un f(x) = x + let y = f(10) inlet fun f(x) = f(x + y) inlet fun g(x) = f(y + x + in f(1) + g(3) (a) Example with local functions f and g . fun f(x) = x + fun f0(x, y) = f0(x + y, y); fun g(x, y) = f0(y + x +
1, y); let y = f(10) in f0(1, y) + g(3, y) (b) Desugaring of local functions. Fig. 10.
Lambda lifting of local functions f and g requires renaming to avoid capture. list [FDef] compile( list [State] states) =map(state2const, states) + map(state2dispatch, states) + mainDispatch(states)FDef state2const(State s, int i) =fdef(s.name, [], val(nat(i)));FDef state2dispatch(State s) =fdef( "
Implementation of compiler from state machines to our procedural language.
In Section 1, we introduced a language for state machines to illustrate the problemof inadvertent capture in program transformation. The name - fix algorithmcan be used to repair the result of the transformation without changing thetransformation itself. As a result, developers can structure transformations inalmost arbitrary ways. In the case of the state-machine compiler, a simple namingconvention suffices to link generated references to declarations. In our case study,the conventions are that state names become constants and state names suffixedwith - dispatch become dispatch functions.We believe the increased liberty of using naming conventions simplifies theimplementation of program transformations. We illustrate the main part of thecompiler of state machines to our procedural language in Figure 11. In contrastto approaches based on explicit binders such as HOAS [18] or FreshML [22],generated references do not have to literally occur below their binders in thetransformation itself. For example, function compile independently generates state18 tate currentclose => closed endstate closedopen => currentlock => token endstate tokenunlock => closed end (a) Renamed door statemachine. public class Door { final int current =
0, closed =
1, token = void run(...) { int current0 = current; String token0 = null ; while ((token0 = input.nextLine()) ! = null ) { if (current0 == current){ if (close(token0)) current0 = closed; else continue ;} if (current0 == closed){ if (open(token0)) current0 = current; else if (lock(token0)) current0 = token; else continue ;} if (current0 == token){ if (unlock(token0)) current0 = closed; else continue ;}}}} (b) Renaming of local variables current and token to preservethe references of the state machine (exemplarily highlighted). Fig. 12.
Application of name - fix for generated Java code with JDT name resolution. constants, state dispatch functions, and the main dispatch function (by mainCond ),even though the main dispatch function refers to both generated constants andstate dispatch functions via naming conventions. Compilation to Java.
To exercise capture-avoiding transformation in a morerealistic setting, we also applied name - fix on the result of compiling state machinesto Java. To obtain a name graph for Java, we used Rascal’s M source codemodel of Java, which provides accurate name and type information extractedfrom the Eclipse JDT [11]. The compiler from state machines to Java generatesJava code as structural strings (cf. Section 6.1). It generates a constant for eachstate and a single dispatch loop in a run method.We illustrate the application of the compiler and the use of name - fix on thegenerated Java code in Figure 12. The left column shows the state machine fromFigure 1(a) where we consistently renamed states opened and locked to current and token , respectively. The right column shows the compiled Java program.Since the dispatch loop in run uses current to store the current state and token tosave the last-read token, the compilation introduces variable capture. Note thateven without using name - fix , the generated code compiles fine but is ill-behavedbecause current == current in the first if would always succeed. name - fix repairsthe variable capture by renaming the local variables. This case study shows that name - fix and our implementation are not limited to simple languages, but areapplicable for generating capture-free programs of languages like Java. Derric
Derric is a domain-specific language for describing (binary) file formats [25].Such descriptions are used in digital forensic investigations to recover evidence19 ormat
Bad sequence
S1 S2 structures
S1 { x: 0x0; y: S2.x; }S2 { x; } (a) A
Derric format. public class
Bad { private long x; private boolean S1() {markStart(); long x0 = ...; ValueSet vs2 = ...;vs2.addEquals(0); if (!vs2.equals(x0)) return noMatch(); long y = ...; ValueSet vs5 = ...;vs5.addEquals(x); if (!vs5.equals(y)) return noMatch();addSubSequence( "S1" ); return true ;}...} (b) The local variable shadows the field and must be renamed. Fig. 13. name - fix eliminates variable capture for existing DSL compiler of Derric . from (possibly damaged) storage devices. Derric descriptions consist of two parts.The first part describes the high-level structure of a file format by listing sequenceconstraints on basic building blocks (called structures) of a file. The secondpart describes each structure by declaring fields, their type, and inter-structuredata dependencies. From these descriptions, the
Derric compiler generateshigh-performance validators in Java that check whether a byte sequence matchesthe declared format.We show a minimalist, artificial
Derric format description in the left columnof Figure 13. The format declares two structures ( S1 and S2 ), which must occurin sequence. S1 contains two fields: x , which must be 0, and y , which should beequal to field x of S2 , which is not further constrained. We show an excerpt ofthe code generated by the Derric compiler in the right column of Figure 13.The main issue is in method S1 , which handles format recognition of structure S1 .Field x , which Derric uses to communicate S2 ’s field x to method S1 is shadowedby the local variable x which corresponds to S1 ’s field x . Without going intotoo much detail, it is instructive to note that the Java code compiles fine evenwithout any renaming, but it behaves incorrectly: Instead of checking S1.y = S2.x ,it checks
S1.y = S1.x . Such scenario occurs whenever two structures have a fieldof the same name and one structure access this field of the other structure ina constraint. name - fix restores correctness by consistently renaming the localvariable in case of capture.The Derric case study illustrates the flexibility and power of name - fix . Derric is a real-world DSL compiling to a mainstream programming language(Java). The compiler consists of multiple transformations for desugaring andoptimization. The result of these transformations is an intermediate model of avalidator, which is then pretty printed to Java. Nevertheless, we did not haveto modify the
Derric compiler in any significant way to be able to repair20nadvertent captures, nor was the compiler designed with name - fix in mind. Thisis shows that our approach is readily applicable in realistic settings. We reflect on the problem statement of this work, explain how name - fix supportsbreaking hygiene, and point out open issues and future work. Problem statement.
In section 2.2, we postulated five design goals for name - fix ,all of which it satisfies. In Section 4, we have verified that name - fix preservesreference intent (G1) and declaration extent (G2) of the source program. Moreover,we have established an equivalence theory for name - fix that at least supportsnoninvasiveness (G3). In the previous section, we have shown how name - fix canbe applied in a wide range of scenarios using different languages: state machines,a simple procedural language, Derric , and Java. These results support our claimthat capture elimination with name - fix is language-parametric (G4).Although the case studies are all implemented in Rascal, any transformationengine that propagates the unique labels of names is suited for name - fix . Similarto our encoding, one could easily imagine representing names as tagged strings Name = (String,Int) . A structural representation of strings or compound identifiersare not necessary. Moreover, we do not require that transformations are writtenin any specific style to support capture elimination. In particular, our transfor-mations make use of sophisticated language features such as intermediate openterms or generic programming. We conclude that a mechanism like name - fix istransformation-parametric and realizable in other transformation engines (G5). Breaking hygiene.
Some transformations require that source programs refer tonames synthesized by a transformation. Such breaking of hygiene often occurswith implicitly declared variables. In other words, intended capture impliesthat there is a source reference that is not bound by a declaration in thesource program. Consider, anaphoric conditionals which are like normal if -expressions but allow reference to the result of the condition using a specialvariable it [1]. For instance, in the expression aif c then !it else it , the variable it implicitly refers to a local variable generated by the desugaring of aif . Ap-plying name - fix , however, resolves the capture which in this case is intended: let it0 = c in if it0 then !it else it . To break hygiene in such cases, the transformationmust mark the source occurrences of it when they are carried over to the result: aif(c, t, e) ⇒ let( "it" , c, cond(var( "it" ), mark ( "it" , t), mark ( "it" , e))) . In our im-plementation, mark ( s, t ) sets a synthesized=true attribute on the ID of any string s in t . Effectively this means that such names are treated as synthesized namesinstead of source names. As a result, name - fix does not rename the binder, andthe result of desugaring the above expression will be let it = c in if it then !it else it . Future work.
Theorem 6 shows that name - fix turns sub-hygienic transformationsinto hygienic transformations. However, there is currently no decision procedurefor whether a transformation is sub-hygienic or not. For a Turing-completemetalanguage, a static analysis can only approximate this property. Nevertheless,21 conservative analysis would be useful as it can guarantee that a transformationis sub-hygienic. For example, all transformations of our case studies exceptsubstitution are sub-hygienic, but we have not formally ensured that. We expecta type system that checks sub-hygiene to provide guidance to transformationdevelopers similar to FreshML [22], but without reducing the flexibility.Another open issue is when to apply name - fix . This is important when build-ing transformations on top of other transformations or composing transformationssequentially into transformation pipelines. After every application of a transfor-mation, there could be inadvertent variable capture that name - fix can eliminate.For our case studies we used informal reasoning to decide whether the call to name - fix can be delayed, but more principled guidance would be useful. Forexample, a simple class of transformations that commutes with applicationsof name-fix is the class of name-insensitive transformations, such as constantpropagation. More generally, care has to be taken whenever a transformationcompares two names for equality, because intermediate variable capture may yieldinaccurate equalities. Since name-fix is the identity on capture-free programs(Theorem 3), applying name-fix more than necessary is at most inefficient, butnot unsafe. name - fix renames not only synthesized names but also names that originatefrom the source program. This may break the expected interface of the generatedcode. Accordingly, name - fix currently is a whole-program transformation thatdoes not support linking of generated programs against previously generatedlibraries, because names in these libraries cannot be changed. Therefore, name - fix is currently ill-suited for separate compilation. We have experienced this problemin the Derric compiler, where a
Derric field named
BIG_ENDIAN will shadowa constant with the same name that occurs in
Derric ’s precompiled run-timesystem. We leave the investigation of a modular name - fix for future work.Finally, the current implementation of name - fix requires repeated execution ofthe name analysis of the target language. As a result, name - fix can be expensivein terms of run-time performance. When a compiler is run continuously in an IDE,this penalty can be an impediment to usability. Fortunately, incremental nameanalysis is a well-studied topic (e.g., [19,27]) that is likely to yield benefits for name - fix because (i) we know the delta induced by name - fix (renamed variables)and (ii) new variable capture can only occur in references that have changed. Various approaches to ensuring capture avoidance have been studied in previouswork. Many of them represent a program not as a syntax tree, but use the syntaxtree as a spanning tree for a graph-based program representation with additionallinks from variable references to the corresponding variable declarations. Theadvantage of graph-based representations is that variable references are unam-biguously resolved at all times, which can guide developers of transformations.For example, nameless program representations such as de Bruijn indices [5]encode the graph structure of variable bindings via numeric values; Oliveira and22öh directly encode recursion and sharing in the abstract syntax of embeddedDSLs [16] via structured graphs. The disadvantage of these techniques is thatthey require explicit handling of graphs (updating indices, redirecting edges) anddo not support open terms well.In higher-order abstract syntax (HOAS) [18] variable references and decla-rations are encoded using the binding constructs of the metalanguage. Thus,developers of transformations inherit name analysis and capture-avoiding sub-stitution from the metalanguage and work with fully name-resolved terms. It iswell-known that HOAS has a number of practical problems [21]. For instance,the use of metalevel functions to encode binders makes them opaque; it is notpossible to represent open terms or to pattern match against variable bindersinside constructs such as let .FreshML [22] uses types to describe the binding structure of object-languagevariable binders. This enables deconstruction of a variable binder via patternmatching, which yields a fresh name and the body as an open term in which thebound variable has been renamed to the fresh one. Due to using fresh variables,accidental variable capture cannot occur but intentional variable capture ispossible. FreshML is limited by using types for declaring variable scope, becausethis is only possible for “declare-before-use” lexical scoping and not, for example,for the scoping of methods in an object-oriented class.In model-driven engineering it is common to describe abstract syntax usingclass-based metamodels [17]. Syntactic categories correspond to classes, parent-child relations and cross-references are encoded using associations. Metamodels areexpressive enough to model programs with each name resolved to its declarationusing direct references (pointers). As a result, a large class of model-transformationformalisms are based on graph rewriting [4]. However, we are unaware of anywork in this area that addresses capture avoidance. Especially, in the case ofmodel-to-text (M2T) transformations, names have to be output and all guaranteesabout capture avoidance (if any) are lost.Seminal work on hygiene has been performed in the context of syntaxmacros [14,3]. Like name - fix , hygienic macro expansion automatically renamesbound variables to avoid variable capture. In related work, a number of approachesto hygienic macro expansion have been proposed [2,3,7,10]. Closest to our workis the expansion algorithm proposed by Dybvig, Hieb, and Bruggeman [7] inthat they also associate additional contextual information to identifiers in syntaxobjects, similar to our string origins. However, in their work renamings appearduring macro expansion (modulo lazy evaluation), whereas we perform renam-ings after transformation. Moreover, since for macros the role of an identifieronly becomes apparent after macro expansion, they have to track alternativeinterpretations for a single identifier. In contrast, we require name analysis forthe source language, which enables a completely different approach to hygienictransformations.Marco [15] is a language-agnostic macro engine that detects variable captureby parsing error messages produced by an off-the-shelve compiler of the baselanguage. Marco checks whether any of the free names introduced by a macro is23aptured at a call-site of the macro. While Marco does not require name analysis,it has to rely on the quality of error messages of the base compiler, provides nosafety guarantees, and can only detect but not fix variable capture.Generation environments [23] are metalanguage values that allow the scopingof variable names generated by a program transformation. A program trans-formation can open a generation environment to generate code relative to theencapsulated lexical context. Since generation environments can be passed aroundas metalanguage values, different transformations can produce code for a shareda lexical context. While generation environments simplify the implementationof transformations, they rely on the discipline of developers and do not providestatic guarantees.Another area where capture avoidance is important is rename refactorings.In particular, previous work on rename refactoring for Java [20] omits checkingpreconditions and instead tries to fix the result of a renaming through qualifiednames so that reference intent is preserved. De Jonge et al. generalize this approachto support name-binding preservation in refactorings for other languages [6]. Incontrast to our work, rename refactorings are a limited class of transformationsthat do not introduce any synthesized names. We presented name - fix , a generic solution for eliminating variable capture fromthe result of program transformations by comparing name graphs of the transfor-mation’s input and output. This work brings benefits of hygienic macros to thedomain of program transformations. In particular, name - fix relieves developersof transformations from manually ensuring capture avoidance, and it enablesthe safe usage of simple naming conventions. We have verified that name - fix terminates, is correct, and yields α -equivalent programs for inputs that are equalup to possibly capturing renaming. As we demonstrated with case studies onprogram optimization, language extension, and DSL compilation, name - fix isapplicable to a wide range of program transformations and languages. Acknowledgement.
We thank Mitchel Wand, Paolo Giarrusso, Justin Pombrio,Atze van der Ploeg, and the anonymous reviewers for helpful feedback.
References
1. E. Barzilay, R. Culpepper, and M. Flatt. Keeping it clean with syntax parameters.In
Scheme , 2011.2. A. Bawden and J. Rees. Syntactic closures. In
LFP , pages 86–95. ACM, 1988.3. W. Clinger and J. Rees. Macros that work. In
POPL , pages 155–162. ACM, 1991.4. K. Czarnecki and S. Helsen. Feature-based survey of model transformation ap-proaches.
IBM Systems Journal , 45(3):621–645, 2006.5. N. G. de Bruijn. Lambda calculus notation with nameless dummies, a tool forautomatic formula manipulation, with application to the Church-Rosser theorem.
Indagationes Mathematicae , 75(5):381–392, 1972. . M. de Jonge and E. Visser. A language generic solution for name binding preservationin refactorings. In LDTA . ACM, 2012.7. R. K. Dybvig, R. Hieb, and C. Bruggeman. Syntactic abstraction in scheme.
Lispand Symbolic Computation , 5(4):295–326, 1992.8. S. Erdweg.
Extensible Languages for Flexible and Principled Domain Abstraction .PhD thesis, Philipps-Universiät Marburg, 2013.9. S. Erdweg, T. van der Storm, M. Völter, M. Boersma, R. Bosman, W. R. Cook,A. Gerritsen, A. Hulshout, S. Kelly, A. Loh, G. Konat, P. J. Molina, M. Palatnik,R. Pohjonen, E. Schindler, K. Schindler, R. Solmi, V. Vergu, E. Visser, K. van derVlist, G. Wachsmuth, and J. van der Woning. The state of the art in languageworkbenches. In
SLE , volume 8225 of
LNCS , pages 197–217. Springer, 2013.10. D. Herman.
A Theory of Typed Hygienic Macros . PhD thesis, NortheasternUniversity, Boston, Massachusetts, 2012.11. A. Izmaylova, P. Klint, A. Shahi, and J. Vinju. M : An open model for measuringsource code artifacts. arXiv:1312.1188, 2013. BENEVOL’13.12. T. Johnsson. Lambda lifting: Transforming programs to recursive equations. In Proceedings of Functional Programming Languages and Computer Architecture(FPCA) , pages 190–203. Springer, 1985.13. P. Klint, T. van der Storm, and J. Vinju. Rascal: A domain-specific language forsource code analysis and manipulation. In
SCAM , pages 168–177, 2009.14. E. Kohlbecker, D. P. Friedman, M. Felleisen, and B. Duba. Hygienic macroexpansion. In
LFP , pages 151–161. ACM, 1986.15. B. Lee, R. Grimm, M. Hirzel, and K. S. McKinley. Marco: Safe, expressive macrosfor any language. In
ECOOP , volume 7313 of
LNCS , pages 589–613. Springer,2012.16. B. C. d. S. Oliveira and A. Löh. Abstract syntax graphs for domain specificlanguages. In
PEPM , pages 87–96. ACM, 2013.17. R. F. Paige, D. S. Kolovos, and F. A. C. Polack. Metamodelling for grammarwareresearchers. In
SLE , volume 7745 of
LNCS , pages 64–82. Springer, 2012.18. F. Pfenning and C. Elliott. Higher-order abstract syntax. In
PLDI , pages 199–208.ACM, 1988.19. T. Reps, T. Teitelbaum, and A. Demers. Incremental context-dependent analysisfor language-based editors.
TOPLAS , 5(3):449–477, 1983.20. M. Schäfer, T. Ekman, and O. de Moor. Sound and extensible renaming for Java.In
OOPSLA , pages 227–294. ACM, 2008.21. T. Sheard. Accomplishments and research challenges in meta-programming. In
SAIG , pages 2–44. Springer, 2001.22. M. R. Shinwell, A. M. Pitts, and M. J. Gabbay. FreshML: Programming withbinders made simple. In
ICFP , pages 263–274. ACM, 2003.23. Y. Smaragdakis and D. S. Batory. Scoping constructs for software generators. In
GCSE , volume 1799 of
LNCS , pages 65–78. Springer, 1999.24. P. I. Valdera, T. van der Storm, and S. Erdweg. Tracing model transformationswith string origins. In
ICMT . Springer, 2014. to appear.25. J. van den Bos and T. van der Storm. Bringing domain-specific languages to digitalforensics. In
ICSE , pages 671–680. ACM, 2011.26. A. van Deursen, P. Klint, and F. Tip. Origin tracking.
Symbolic Computation ,15:523–545, 1993.27. G. Wachsmuth, G. D. P. Konat, V. A. Vergu, D. M. Groenewegen, and E. Visser.A language independent task engine for incremental name and type analysis. In
SLE , volume 8225 of
LNCS , pages 260–280. Springer, 2013. ppendix A Proofs of theorems from Section 4 and Section 5
Theorem 1.
For any name graph G s and any program t , name - fix ( G s , t ) ter-minates in finitely many steps.Proof. The depth of the recursion of name - fix is bound by the number of variabledeclarations in t . Each variable declaration v d can at most occur once in theresult of find - capture because it is immediately renamed to a fresh name. Therenamed variable declaration cannot occur in find - capture again because (i) if v d ∈ V s , then only references v r ∈ V s with ρ ( v r ) = v d share the fresh nameand (ii) if v d (cid:54)∈ V s , then only references v r ∈ V t \ V s share the fresh name but v r ∈ dom ( find - capture ) entails find - capture ( v r ) ∈ V s . Hence name - fix terminatesafter at most all variable declarations in t have been renamed once. (cid:117)(cid:116) Theorem 2 (Capture avoidance).
Given a transformation f : S → T , name - fix yields a capture-avoiding transformation λs. name - fix ( resolve S ( s ) , f ( s )) . Proof.
When name - fix terminates, find - capture = ∅ and thus all reference intentand declaration extent is preserved from the name graph of s to the name graphof the resulting program. (cid:117)(cid:116) Lemma 1.
For any graph G , sub- α -equivalence under G is an equivalence rela-tion, that is, it is reflexive, symmetric, and transitive.Proof. Follows directly from the definition of sub- α -equivalence and the fact that ≡ L is an equivalence relation. (cid:117)(cid:116) Theorem 3.
For any name graph G s = ( V s , ρ s ) and any program t with find - capture ( G s , resolve T ( t )) = ∅ , name - fix ( G s , t ) = t . Proof.
By definition of name - fix . (cid:117)(cid:116) Lemma 2.
For any renaming π and program t , rename ( t, π ) ≡ L t .Proof. By induction on the structure of t (cid:117)(cid:116) Lemma 3.
For any name graph G s = ( V s , ρ s ) and sub- α -equivalent programs t ≡ G s α t under name graph G s , if find - capture ( G s , resolve T ( t )) = ∅ and find - capture ( G s , resolve T ( t )) = ∅ , then t ≡ α t .Proof. Let ( V i , ρ i ) = resolve T ( t i ) and capture i = find - capture ( G s , resolve T ( t i )) .By definition t ≡ α t if ρ = ρ , which holds if dom ( ρ ) = dom ( ρ ) and ρ ( v ) = ρ ( v ) for all v ∈ dom ( ρ ) . Let v ∈ dom ( ρ ) (analogously for v ∈ dom ( ρ ) ).We distinguish 3 cases:. If v ∈ V s and v ∈ dom ( ρ s ) , then ρ ( v ) = ρ s ( v ) because otherwise ρ s ( v ) (cid:54) = ρ ( v ) entails ( v (cid:55)→ ρ ( v )) ∈ notPresrvRef1 ⊆ capture , contradicting capture = ∅ . By Assumption 1, t @ v = t @ ρ ( v )1 , which implies t @ v = t @ ρ ( v )2 by thefirst condition of t ≡ G s α t . Thus by Assumption 2-(i), v ∈ dom ( ρ ) . Then ρ ( v ) = ρ s ( v ) because otherwise ( v (cid:55)→ ρ ( v )) ∈ notPresrvRef1 ⊆ capture ,contradicting capture = ∅ . By Assumption 2-(ii), ρ ( v ) = ρ ( v ) .2. If v ∈ V s and v (cid:54)∈ dom ( ρ s ) , then ρ ( v ) = v , because otherwise ( v (cid:55)→ ρ ( v )) ∈ notPresrvRef2 ⊆ capture , contradicting capture = ∅ . We trivially have t @ v = t @ ρ ( v )2 and thus by Assumption 2-(i), v ∈ dom ( ρ ) . Then ρ ( v ) = v , bybecause otherwise ( v (cid:55)→ ρ ( v )) ∈ notPresrvRef2 ⊆ capture , contradicting capture = ∅ . By Assumption 2-(ii), ρ ( v ) = v = ρ ( v ) .3. If v (cid:54)∈ V s , then ρ ( v ) (cid:54)∈ V s because otherwise ( v (cid:55)→ ρ ( v )) ∈ notPresrvDef ⊆ capture , contradicting capture = ∅ . By Assumption 1, t @ v = t @ ρ ( v )1 , whichimplies t @ v = t @ ρ ( v )2 by the second condition of t ≡ G s α t . By Assumption 2-(i), v ∈ dom ( ρ ) . We have ρ ( v ) (cid:54)∈ V s because otherwise ( v (cid:55)→ ρ ( v )) ∈ notPresrvDef ⊆ capture , contradicting capture = ∅ . By Assumption 1, t @ v = t @ ρ ( v )1 and thus ρ ( v ) = ρ ( v ) by Assumption 2. (cid:117)(cid:116) Lemma 4.
For any bipartite name graph G s = ( V s , ρ s ) and program t with capture = find - capture ( G s , resolve T ( t )) (cid:54) = ∅ , renaming preserves sub- α -equivalence t ≡ G s α rename ( t, π src ∪ π syn ) given π src and π syn as in name - fix .Proof. Let ( V t , ρ t ) = resolve T ( t ) and t (cid:48) = rename ( t, π src ∪ π syn ) . First we have t ≡ L t (cid:48) by Lemma 2. By the definition of rename and since dom ( π src ) ∩ dom ( π syn ) = ∅ , t (cid:48) @ v becomes either π src ( v ) if v ∈ dom ( π src ) , π syn ( v ) if v ∈ dom ( π syn ) , and remains unchanged otherwise. We separately show that bothconditions of sub- α -equivalence are satisfied:1. For all v r , v d ∈ V t ∩ V s with ρ s ( v r ) = v d we have v r / ∈ dom ( π syn ) , v d / ∈ dom ( π syn ) , v r ∈ dom ( π src ) ⇔ v d ∈ dom ( π src ) because G s is bipartite, and if v r ∈ dom ( π src ) , then π src ( v r ) = π src ( v d ) . Thus, t @ v r = t @ v d ⇔ t (cid:48) @ v r = t (cid:48) @ v d .2. For all v r , v d ∈ V t \ V s we have v r (cid:54)∈ dom ( π src ) and v d (cid:54)∈ dom ( π src ) . If t @ v r (cid:54) = t @ v d , then t (cid:48) @ v r (cid:54) = t (cid:48) @ v d because π syn maps distinct names to distinct freshnames. If instead t @ v r = t @ v d , we have v r ∈ dom ( π syn ) ⇔ v d ∈ dom ( π syn ) and if v r ∈ dom ( π syn ) , then π syn ( v r ) = π syn ( v d ) . Thus, t @ v r = t @ v d ⇔ t (cid:48) @ v r = t (cid:48) @ v d . (cid:117)(cid:116) Theorem 4.
For any bipartite name graph G s = ( V s , ρ s ) and any program t , name - fix ( G s , t ) ≡ G s α t .Proof. By induction on name - fix ( G s , t ) using Theorem 3 and Lemma 4. (cid:117)(cid:116) Lemma 5.
For any bipartite name graph G s = ( V s , ρ s ) and programs t ≡ G s α t ,if find - capture ( G s , resolve T ( t )) = ∅ , then t ≡ α name - fix ( G s , t ) . roof. By induction on name - fix ( G s , t ) . Base case: find - capture ( G s , resolve T ( t ))= ∅ and name - fix ( G s , t ) = t . Then t ≡ α t by Lemma 3. Step case: find - capture ( G s , resolve ( t )) (cid:54) = ∅ and name - fix ( G s , t ) = name - fix ( G s , t (cid:48) ) . Then t ≡ G s α t (cid:48) by Lemma 4, and t ≡ α name - fix ( G s , t (cid:48) ) by the induction hypothesis. (cid:117)(cid:116) Theorem 5.
For any bipartite name graph G s = ( V s , ρ s ) and programs t ≡ G s α t , name - fix ( G s , t ) ≡ α name - fix ( G s , t ) .Proof. By induction on name - fix ( G s , t ) . Base case by Lemma 5. Step case: find - capture ( G s , resolve ( t )) (cid:54) = ∅ and name - fix ( G s , t ) = name - fix ( G s , t (cid:48) ) . Then t (cid:48) ≡ G s α t by Lemma 4 and t (cid:48) ≡ G s α t by transitivity. Thus, name - fix ( G s , t (cid:48) ) ≡ G s α name - fix ( G s , t ) by the induction hypothesis. (cid:117)(cid:116) Theorem 6.
For any sub-hygienic transformation f : S → T , transformation λ s. name - fix ( G s , f ( s )) is hygienic.Proof. For any s ≡ α s , f ( s ) ≡ G s α f ( s ) by the definition of sub-hygiene. Then name - fix ( G s , f ( s )) ≡ α name - fix ( G s , f ( s )) by Theorem 5. (cid:117)(cid:116)(cid:117)(cid:116)