[PDF] Tail Modulo Cons

Abstract

OCaml function calls consume space on the system stack. Operating systems set default limits on the stack space which are much lower than the available memory. If a program runs out of stack space, they get the dreaded "Stack Overflow" exception -- they crash. As a result, OCaml programmers have to be careful, when they write recursive functions, to remain in the so-called _tail-recursive_ fragment, using _tail_ calls that do not consume stack space. This discipline is a source of difficulties for both beginners and experts. Beginners have to be taught recursion, and then tail-recursion. Experts disagree on the "right" way to write `List.map`. The direct version is beautiful but not tail-recursive, so it crashes on larger inputs. The naive tail-recursive transformation is (slightly) slower than the direct version, and experts may want to avoid that cost. Some libraries propose horrible implementations, unrolling code by hand, to compensate for this performance loss. In general, tail-recursion requires the programmer to manually perform sophisticated program transformations. In this work we propose an implementation of "Tail Modulo Cons" (TMC) for OCaml. TMC is a program transformation for a fragment of non-tail-recursive functions, that rewrites them in _destination-passing style_. The supported fragment is smaller than other approaches such as continuation-passing-style, but the performance of the transformed code is on par with the direct, non-tail-recursive version. Many useful functions that traverse a recursive datastructure and rebuild another recursive structure are in the TMC fragment, in particular `List.map` (and `List.filter`, `List.append`, etc.). Finally those functions can be written in a way that is beautiful, correct on all inputs, and efficient.

Full PDF

TTail Modulo Cons

Fr´ed´eric Bour , Basile Cl´ement , and Gabriel Scherer INRIA Tarides

Abstract

List.map . The direct version is beautiful but not tail-recursive, so it crashes on largerinputs. The naive tail-recursive transformation is (slightly) slower than the direct version,and experts may want to avoid that cost. Some libraries propose horrible implementations,unrolling code by hand, to compensate for this performance loss. In general, tail-recursionrequires the programmer to manually perform sophisticated program transformations.In this work we propose an implementation of “Tail Modulo Cons” (TMC) for OCaml.TMC is a program transformation for a fragment of non-tail-recursive functions, thatrewrites them in destination-passing style . The supported fragment is smaller than otherapproaches such as continuation-passing-style, but the performance of the transformed codeis on par with the direct, non-tail-recursive version. Many useful functions that traverse arecursive datastructure and rebuild another recursive structure are in the TMC fragment,in particular

List.map (and

List.{filter,append} , etc.). Finally those functions can bewritten in a way that is beautiful, correct on all inputs, and eﬃcient.In this work we give a novel modular, compositional deﬁnition of the TMC transfor-mation. We discuss the design space of user-interface choices: what degree of control forthe user, when to warn or fail when the transformation may lead unexpected results. Wemention a few remaining design diﬃculties, and present (in appendices) a performanceevaluation of the transformed code. “OCaml”, we teach our students, “is a functional programming language. We can write thebeautiful function

List.map as follows:” let rec map f = function| [] -> []| x :: xs -> f x :: map f xs “Well, actually”, we continue, “OCaml is an eﬀectful language, so we need to be carefulabout the evaluation order. We want map to process elements from the beginning to the end ofthe input list, and the evaluation order of f x :: map f xs is unspeciﬁed. So we write: a r X i v : . [ c s . P L ] F e b ail Modulo Cons Bour, Cl´ement, Scherer let rec map f = function| [] -> []| x :: xs ->let y = f x iny :: map f xs “Well, actually, this version fails with a Stack_overflow exception on large input lists. Ifyou want your map to behave correctly on all inputs, you should write a tail-recursive version.For this you can use the accumulator-passing style:” let map f li =let rec map_ acc = function| [] -> List.rev acc| x :: xs -> map_ (f x :: acc) xsin map_ [] f li “Well, actually, this version works ﬁne on large lists, but it is less eﬃcient than the originalversion. It is noticeably slower on small lists, which are the most common inputs for mostprograms. We measured it 35% slower on lists of size 10. If you want to write a robustfunction for a standard library, you may want to support both use-cases as well as possible.One approach is to start with a non-tail-recursive version, and switch to a tail-recursive versionfor large inputs; even there you can use some manual optimizations to reduce the overhead ofthe accumulator. For example, the nice Containers library does it as follows:”. let tail_map f l =(* Unwind the list of tuples, reconstructing the full list front-to-back.@param tail_acc a suffix of the final list; we append tuples’ contentat the front of it *)let rec rebuild tail_acc = function| [] -> tail_acc| (y0, y1, y2, y3, y4, y5, y6, y7, y8) :: bs ->rebuild (y0 :: y1 :: y2 :: y3 :: y4 :: y5 :: y6 :: y7 :: y8 :: tail_acc) bsin(* Create a compressed reverse-list representation using tuples@param tuple_acc a reverse list of chunks mapped with [f] *)let rec dive tuple_acc = function| x0 :: x1 :: x2 :: x3 :: x4 :: x5 :: x6 :: x7 :: x8 :: xs ->let y0 = f x0 in let y1 = f x1 in let y2 = f x2 inlet y3 = f x3 in let y4 = f x4 in let y5 = f x5 inlet y6 = f x6 in let y7 = f x7 in let y8 = f x8 indive ((y0, y1, y2, y3, y4, y5, y6, y7, y8) :: tuple_acc) xs| xs ->(* Reverse direction, finishing off with a direct map *)let tail = List.map f xs inrebuild tail tuple_accindive [] l let direct_depth_default_ = 1000let map f l =let rec direct f i l = match l with| [] -> []| [x] -> [f x]| [x1;x2] -> let y1 = f x1 in [y1; f x2]| [x1;x2;x3] ->let y1 = f x1 in let y2 = f x2 in [y1; y2; f x3]| _ when i=0 -> tail_map f l| x1::x2::x3::x4::l’ ->let y1 = f x1 inlet y2 = f x2 inlet y3 = f x3 inlet y4 = f x4 iny1 :: y2 :: y3 :: y4 :: direct f (i-1) l’indirect f direct_depth_default_ l

At this point, unfortunately, some students leave the class and never come back. (Thesedays they just have to disconnect from the remote-teaching server.)We propose a new feature for the OCaml compiler, an explicit, opt-in “Tail Modulo Cons”transformation, to retain our students. After the ﬁrst version (or maybe, if we are teaching anadvanced class, after the second version), we could show them the following version: let[@tail_mod_cons] rec map f = function| [] -> []| x :: xs -> f x :: map f xs

This version would be as fast as the simple implementation, tail-recursive, and easy to write.The catch, of course, is to teach when this [@tail_mod_cons] annotation can be used. Maybewe would not show it at all, and pretend that the direct map version with let y is ﬁne. Thiswould be a much smaller lie than it currently is, a [@tail_mod_cons] -sized lie.Finally, experts would be very happy. They know about all these versions, but they wouldnot have to write them by hand anymore. Have a program perform (some of) the programtransformations that they are currently doing manually.2 ail Modulo Cons Bour, Cl´ement, Scherer

A function call is in tail position within a function deﬁnition if the deﬁnition has “nothing todo” after evaluating the function call – the result of the call is the result of the whole functionat this point of the program. (A precise deﬁnition will be given in Section 2.) A function is tailrecursive if all its recursive calls are tail calls.In the deﬁnition of map , the recursive call is not in tail position: after computing the resultof map f xs we still have to compute the ﬁnal list cell, y :: (cid:3) . We say that a call is tail modulocons when the work remaining is formed of data constructors only, such as (::) here. let[@tail_mod_cons] rec map f = function| [] -> []| x :: xs ->let y = f x iny :: map f xs

Other datatype constructors may also be used; the following example is also tail-recursive modulo cons : let[@tail_mod_cons] rec tree_of_list = function| [] -> Empty| x :: xs -> Node(Empty, x, tree_of_list xs) The TMC transformation returns an equivalent function in destination-passing style wherethe calls in tail modulo cons position have been turned into tail calls. In particular, for map itgives a tail-recursive function, which runs in constant stack space; many other list functionsalso become tail-recursive. The transformed code of map can be described as follows: let rec map f = function| [] -> []| x::xs ->let y = f x inlet dst = y :: Hole inmap_dps dst 1 f xs;dst and map_dps dst i f = function| [] ->dst.i <- []| x::xs ->let y = f x inlet dst’ = y :: Hole indst.i <- dst’;map_dps dst’ 1 f xs

The transformed code has two variants of the map function. The map_dps variant is in destination-passing style , it expects additional parameters that specify a memory location, a destination , and will write its result to this destination instead of returning it. It is tail-recursive.The map variant provides the same interface as the non-transformed function, and internallycalls map_dps on non-empty lists. It is not tail-recursive, but it does not call itself recursively,it jumps to the tail-recursive map_dps after one call.The key idea of the transformation is that the expression y :: map f xs , which containeda non-tail-recursive call, is transformed into ﬁrst the computation of a partial list cell, written y :: Hole , followed by a call to map_dps that is asked to write its result in the position of the

Hole . The recursive call thus happens after the cell creation (instead of before), in tail-recursiveposition in the map_dps variant. In the direct variant, the value of the destination dst has to bereturned after the call.The transformed code is in a pseudo-OCaml, it is not a valid OCaml program: we use amagical

Hole constant, and our notation dst.i <- ... to update constructor parameters in-place is also invalid in source programs. The transformation is implemented on a lower-level,untyped intermediate representation of the OCaml compiler (Lambda), where those operationsdo exist. The OCaml type system is not expressive enough to type-check the transformedprogram: the list cell is only partially-initialized at ﬁrst, each partial cell is mutated exactly3 ail Modulo Cons Bour, Cl´ement, Scherer once, and in the end the whole result is returned as an immutable list. Some type system areexpressive enough to represent this transformed code, notably Mezzo (Pottier and Protzenko,2013).

Instead of a program transformation in destination-passing style, we could perform a more gen-eral program transformation that can make more functions tail-recursive, for example a generic continuation-passing style (CPS) transformation. We have three arguments for implementingthe TMC transformation: • The TMC transformation generates more eﬃcient code, using mutation instead of functioncalls. On the OCaml runtime, the diﬀerence is a large constant factor. • The CPS transformation can be expressed at the source level, and can be made reasonablynice-looking using some monadic-binding syntactic sugar. TMC can only be done by thecompiler, or using safety-breaking features. • TMC is provided as an opt-in, on-demand optimization. We can add more such optimiza-tions, they are not competing with each other, especially if they are to be rather usedby expert programmers. Someone should try presenting CPS as an annotation-driventransformation, but we wanted to look at TMC ﬁrst.

Using the native system stack is a choice of the OCaml implementation. Some other implemen-tations of functional languages, such as SML/NJ, use a diﬀerent stack (the OCaml bytecodeinterpreter also does this), or directly allocate stack frames on their GC-managed heap. Thisapproach makes “stack overﬂow” go away completely, and it also makes it very simple to im-plement stack-capture control operators, such as continuations, or other stack operations suchas continuation marks.On the other hand, using the native stack brings compatibility beneﬁts (coherent stack tracesfor mixed OCaml+C programs), and seems to noticeably improve the performance of functioncalls (on benchmarks that are only testing function calls and return, such as Ackermann or thenaive Fibonacci, OCaml can be 4x, 5x faster than SML/NJ.)Lazy (call-by-need) languages will also often avoid running into stack overﬂows: as soonas a lazy datastructure is returned, which is the default, functions such as map will returnimmediately, with recursive calls frozen in a lazy thunk, waiting to be evaluated on-demandas the user traverses the result structure. User still need to worry about tail-recursivity fortheir strict functions (if the implementation uses the system stack); strict functions are oftenpreferred when writing performant code.

Some operating systems can provide an unlimited system stack; such as ulimit -s unlimited on Linux systems – the system stack is then resized on-demand. Then it is possible to runnon-tail-recursive functions without fear of overﬂows. Frustratingly, unlimited stacks are notavailable on all systems, and not the default on any system in wide use. Convincing all users to On a toy benchmark with large-sized lists, the CPS version is 100% slower and has 130% more allocationsthan the non-tail-recursive version. ail Modulo Cons Bour, Cl´ement, Scherer setup their system in a non-standard way would be much harder than performing a programtransformation or accepting the CPS overhead for some programs. Tail-recursion modulo cons was well-known in the Lisp community as early as the 1970s. Forexample the REMREC system (Risch, 1973) would automatically transform recursive functionsinto loops, and supports modulo-cons tail recursion. It also supports tail-recursion moduloassociative arithmetic operators, which is outside the scope of our work, but supported by theGCC compiler for example. The TMC fragment is precisely described (in prose) in Friedmanand Wise (1975).In the Prolog community it is a common pattern to implement destination-passing stylethrough uniﬁcation variables; in particular “diﬀerence lists” are a common representation oflists with a ﬁnal hole. Uniﬁcation variables are ﬁrst-class values, in particular they can be passedas function arguments. This makes it easy to write the destination-passing-style equivalentof a context of the form

List.append li (cid:3) , as the diﬀerence list (List.append li X, X) . Inconstrast, we only support direct constructor applications. However, this expressivity comes ata performance cost, and there is no static checking that the data is fully initialized at the endof computation.In general, if we think of non-tail recursive functions as having an “evaluation context” leftfor after the recursive call, then the techniques to turn classes of calls into tail-calls correspond todiﬀerent reiﬁed representations of non-tail contexts, as long as they support eﬃcient compositionand hole-plugging. TMC comes from representing data-construction contexts as the partialdata itself, with hole-plugging by mutation. Associative-operator transformations representthe context 1 + (4 + (cid:3) ) as the number 5 directly. (Sometimes it suﬃces to keep around anabstraction of the context; this is a key idea in John Clements’ work on stack-based security inpresence of tail calls.)Minamide (1998) gives a “functional” interface to destination-passing-style program, by pre-senting a partial data-constructor composition

Foo(x,Bar( (cid:3) )) as a use-once, linear-typed func-tion linfun h -> Foo(x,Bar(h)) . Those special linear functions remain implemented as partialdata, but they expose a referentially-transparent interface to the programmer, restricted by alinear type discline. This is a beautiful way to represent destination-passing style, orthogonalto our work: users of Minamide’s system would still have to write the transformed version byhand, and we could implement a transformation into destination-passing style expressed in hissystem. Pottier and Protzenko (2013) supports a more general-purpose type system based onseparation logic, which can directly express uniquely-owned partially-initialized data, and itsimplicit transformation into immutable, duplicable results. (See the List module of the Mezzostandard library, and in particular cell , freeze and append in destination-passing-style). This work is in progress. We claim the following contributions: • A formal grammar of which programs expressions are in the “Tail Modulo Cons” fragment. • A novel, modular deﬁnition of the transformation into destination-passing-style. • Discussion of the user-interface issues related to transformation control. • A performance evaluation of the transformation for

List.map , in the speciﬁc context ofthe OCaml runtime.A notable non-contribution is a correctness proof for the transformation. We would like towork on a correctness proof soon; the correctness argument requires reasoning on mutability5 ail Modulo Cons Bour, Cl´ement, Scherer and ownership of partial values, an excellent use-case for separation logic.

Exprs (cid:51) e, d ::= x, y | n ∈ N | f e | let x = e in e (cid:48) | K ( e i ) i | match e with ( p i → e (cid:48) i ) i | d.e ← e (cid:48) FunctionNames (cid:51) f Patterns (cid:51) p ::= x | K ( p i ) i Stmt (cid:51) s ::= let rec ( f i x = e i ) i Figure 1: A ﬁrst-order programming languageIn order to simplify the presentation of the transformation, we consider a simple untypedlanguage, described in Figure 1. This language, which we present with a syntax similar toOCaml, embeds function application and sequencing let-binding, as well as constructor appli-cation and pattern-matching. In addition, to implement the imperative DPS transformation, weinclude a special operator: d.e ← e (cid:48) is an imperative construct which updates d ’s e -th argumentin-place. All those constructs (and more) are present in the untyped intermediate language usedin the OCaml compiler where we implemented the transformation. One notable missing con-struct is function abstraction. In fact, this model requires that all functions be called by name,and functions can only be deﬁned through a toplevel let rec statement. Our implementationsupports the full OCaml language, but it cannot specialize higher-order function arguments forTMC.In the following, we will use syntaxic sugar for some usual constructs; namely, we can desugar e e let = e in e e , . . . , e n )into the constructor application Tuple ( e , . . . , e n ).The multi-hole grammar of tail contexts, where each hole indicates a tail position, for thissimple language is depicted in Figure 2. To interpret a multi-hole context T with n holes, wedenote by T [ e , . . . , e n ] the term obtained by replacing each of the holes in T from left to rightwith the expressions e , . . . , e n . In a decomposition e = T [ e , . . . , e n ], each of the e i is in tailposition; in particular, a call f e is in tail position (i.e. it is a tail call) in expression e (cid:48) if thereis a decomposition of e (cid:48) as T [ e , . . . , e j , f e, e j +1 , . . . , e n ].One can remark that, for a language construct, the holes in the tail context are preciselythe complement of holes in the evaluation context. For instance, the construct let x = e in e (cid:48) has evaluation contexts of the form let x = E in e (cid:48) and tail contexts let x = e in T . In somesense, the tail contexts are “guarded” by the evaluation context: a reduction can only occurin a tail position after the construct has been reduced away, and the subterm in tail positionis now at toplevel. This guarantees that when we start reducing inside the tail context, thereis no remaining computation to be performed in the surrounding context. The “depth” of thesurrounding context is a source-level notion that is directly related to call-stack size: tail callsdo not require reserving frames on the call stack.Our ﬁrst goal is to ﬁgure out what the proper grammar is for properly deﬁning tail callsmodulo cons. The lazy way would be to allow, in tail position, a single constructor application6 ail Modulo Cons Bour, Cl´ement, Scherer TailCtx (cid:51) T ::= | (cid:3) | e ; T | let x = e in T | match e with ( p j → T j ) j ConstrCtx (cid:51) C [ (cid:3) ] ::= (cid:3) | K (( e i ) i , C [ (cid:3) ] , ( e j ) j )Figure 2: Tail multicontexts, constructor contextsitself containing a tail position, that is, add a case K (( e i ) i , (cid:3) , ( e j ) j ) to the deﬁnition of T . Thiswould capture most of the cases presented above. However, such a lazy implementation wouldbe brittle: for instance, a partially unrolled version of map below would not beneﬁt from theoptimization. let rec umap f xs =match xs with| [] -> []| [x] -> [f x]| x1 :: x2 :: xs ->f x1 :: f x2 :: umap f xs This would make the performance-seeking developer unhappy, as they would have to choosebetween our optimization or the performance beneﬁts of unrolling. To make them happy again,we at least need to consider repeated constructor applications. Using the grammar C fromFigure 2, we could consider all the decompositions T [ C ], and extract the inner hole from thenested C context.However, this approach would still be somewhat fragile, and some very reasonable programtransformations on the nested TMC call would break it. For instance, it is not possible tolocally let-bind parts of the function application, or to perform a match ultimately leadingto a TMC call inside a constructor application. In our new grammar U , instead of adding acase K (( e i ) i , (cid:3) , ( e j ) j ) that forces constructors to occur only at the ”leaves” of a context, weadd a case K (( e i ) i , U, ( e j ) j ), allowing arbitrary interleavings of constructors and tail-preservingcontexts. This gives a more natural and less surprising deﬁnition of the tail positions modulocons. This grammar is depicted in Figure 3. TailModConsCtx (cid:51) U ::= | (cid:3) | e ; U | let x = e in U | match e with ( p j → U ) j | K (( e i ) i , U, ( e j ) j )Figure 3: Tail modulo cons contextsIf an expression e is of the form U (cid:104) ( e i ) i , f e, ( e j ) j (cid:105) , we say that the plugged subterms arein tail position modulo constructor in e , and in particular f e is a tail call modulo constructor .We also deﬁne tail positions strictly modulo cons as the tail positions modulo cons which arenot regular tail positions. 7 ail Modulo Cons Bour, Cl´ement, Scherer Notice that there is a subtlety here: the term K ( f e , f e ) admits two distinct contextdecompositions, one with U := K ( (cid:3) , f e ) where f e is a tail call modulo cons, the otherwith U := K ( f e , (cid:3) ) where f e is a tail call modulo cons. (This is intentional, obtained byallowing a single sub-context in the constructor rule of U .) We can transform this term suchthat either one of the calls become tail-calls, but not both. In other words, the notion of being“in tail position modulo cons” depends on the decomposition context U .Our implementation has to decide which context decomposition to perform. It does not makechoices on the user’s behalf: in such ambiguous situations, it will ask the user to disambiguateby adding a [@tailcall] attribute to the one call that should be made a tail-call.Remark: our grammar for U is maximal in the sense that in each possible context decom-position of a term, all tail positions modulo cons are inside a hole of the U context. It wouldbe possible to abandon maximality by allowing arbitrary terms e (containing no hole) as acontext. We avoided doing this, as it would introduce ambiguities in the grammar of context(the program ( a ; b ) can be parsed using this e case directly, or using the e ; U rule ﬁrst), so thatoperations deﬁned on contexts would depend on the parse tree of the context in the grammar. Many functions that consume and produce lists are tail-recursive-modulo-cons, in the sensethat all they have a TMC decomposition where all recursive calls are in TMC position. Notablefunctions include map , as already discussed, but also for example: let[@tail_mod_cons] rec filter p = function| [] -> []| x :: xs -> if p x then x :: filter p xs else filter p xslet[@tail_mod_cons] rec merge cmp l1 l2 =match l1, l2 with| [], l | l, [] -> l| h1 :: t1, h2 :: t2 ->if cmp h1 h2 <= 0then h1 :: merge cmp t1 l2else h2 :: merge cmp l1 t2

TMC is not useful only for lists or other “linear” data types, with at most one recursiveoccurrence of the datatype in each constructor.

A non-example

Consider a map function on binary trees: let[@tail_mod_cons] rec map f = function| Leaf v -> Leaf (f v)| Node(t1, t2) -> Node(map f t1, (map[@tailcall]) f t2)

In this function, there are two recursive calls, but only one of them can be optimized; we usedthe [@tailcall] attribute to direct our implementation to optimize the call to the right child.This is actually a bad example of TMC usage in most cases, given that • If the tree is arbitrary, there is no reason that it would be right-leaning rather thanleft-leaning. Making only the right-child calls tail-calls does not protect us from stackoverﬂows. • If the tree is known to be balanced, then in practice the depth is probably very smallin both directions, so the TMC transformation is not necessary to have a well-behavedfunction.8 ail Modulo Cons Bour, Cl´ement, Scherer

Yes-examples from our real world

There are interesting examples of TMC-transformationon functions operating on tree-like data structures, when there are natural assumptions aboutwhich child is likely to contain a deep subtree. The OCaml compiler itself contains a numberof them; consider for example the following function from the

Cmm module, one of its lower-levelprogram representations: let[@tail_mod_cons] rec map_tail f = function| Clet(id, exp, body) ->Clet(id, exp, map_tail f body)| Cifthenelse(cond, ifso, ifnot) ->Cifthenelse(cond, map_tail f ifso, (map_tail[@tailcall]) f ifnot)| Csequence(e1, e2) ->Csequence(e1, map_tail f e2)| Cswitch(e, tbl, el) ->Cswitch(e, tbl, Array.map (map_tail f) el)[...]| Cexit _ | Cop (Craise _, _, _) as cmm ->cmm| Cconst_int _ | Cvar _ | Ctuple _ | Cop _ as c ->f c

This function is traversing the “tail” context of an arbitrary program term – a meta-example!The

Cifthenelse node acts as our binary-node constructor, we do not know which side is likely tobe larger, so TMC is not so interesting. The recursive calls for

Cswitch are not in TMC position.But on the other hand the

Clet , Csequence cases are very beneﬁcial to have in TMC: while theyhave several recursive subtrees, they are in practice only deeply nested in the direction that isturned into a tailcall by the transformation. The OCaml compiler does sometimes encountermachine-generated programs with a unusually long sequence of either constructions, and theTMC transformation may very well avoid a stack overﬂow in this case.Another example would be

A good way of thinking about our TMC transformation is as follows. We want to transforma tail context modulo cons U into a regular tail context T , where tail calls modulo cons havebeen replaced by regular tail calls, but to the DPS version of the callee. More precisely, givena term e , we will build its DPS version as follows: • First, we ﬁnd a decomposition of e as U [ e , . . . , e n ] which identiﬁes the tail positionsmodulo cons. We want this decomposition to capture as much of the TMC calls to DPS-enabled functions (functions marked with the [@tail_mod_cons] attribute) as possible. • Once the decomposition is selected, we transform the context x.y ← U (where x and y are fresh variables not appearing in U ) into a context T [ x .y ← (cid:3) , . . . , x n .y n ← (cid:3) ].This transformation is eﬀectively moving the assignment from the root to the leaves ofthe context U . • Finally, we replace the assignments x.y ← f e by calls f dps (( x, y ) , e ) when the callee f has a DPS version f dps , introducing calls in tail position. 9 ail Modulo Cons Bour, Cl´ement, Scherer However, this transformation is not enough: we also need to transform the code of the origi-nal function to call f dps in the recursive case. There is a naive way to do it, which is suboptimal,as we explain, before showing how we do it, in Section 4.2 (The “direct” transformation).Finally, Section 4.3 (Compression of nested constructors) highlights an optimization madeby our implementation which generates cleaner code for TMC calls inside nested constructorapplications. We now present in detail the transformation used to build the body of the transformed f dps func-tion from a function deﬁnition let rec f x = e . As a reminder, the semantics of f dps (( d, i ) , e (cid:48) )should be the same as d.i ← f e (cid:48) , and the body of f dps should replace tail calls modulo cons in e with regular tail calls to the DPS versions of the callee. We only present the transformation forunary functions: the general case follows by using tuple for n-ary functions. Our implementa-tion handles fully applied functions of arbitrary arity, treating them similarly as the equivalentunary function with a tuple argument.We ﬁrst ﬁnd a decomposition of e as U [ e , . . . , e n ]. Recall that the TMC transformation de-pends on this decomposition, as we have multiple choices for the decomposition of a constructor K ( e , . . . , e n ). We use the following logic: • If none of the e i contains calls in TMC position to a function which has a DPS version(a TMC candidate ), use the decomposition e = (cid:3) [ e ]. • If exactly one of the e i contains such a TMC candidate, or if exactly one of the e i contains a TMC candidate marked with the [@tailcall] annotation, name it e j and usethe decomposition e = ( K ( e , . . . , e j − , (cid:3) , e j +1 , . . . , e n ) [ e j ]. • Otherwise, report an error to the user, indicating the ambiguity, and requesting that one(or more) [@tailcall] annotations be added.For other constructs, we only decompose them if at least one of their component has aTMC candidate; for instance, let x = e in e (cid:48) gets decomposed into (cid:3) [ let x = e in e (cid:48) ] unless e (cid:48) contains TMC candidates. This avoids needlessly duplicating assignments.Once we obtained the decomposition as a tail context modulo constructors, we transformsaid context into a tail context where each expression in tail position is an assignment. Wewrite d.n ← U (cid:32) dps T [ d i .n i ← (cid:3) ] i to signify that the context T [ d i .n i ← (cid:3) ] i is obtained byperforming the DPS transformation on the TMC context U with destination d.n . The rulesdescribing this transformation are shown below. d.n ← (cid:3) (cid:32) dps (cid:3) [ d.n ← (cid:3) ] d.n ← U (cid:32) dps T [ d i .n i ← (cid:3) ] i d.n ← let x = e in U (cid:32) dps let x = e in T [ d i .n i ← (cid:3) ] i ∀ j, d.n ← U j (cid:32) dps T j (cid:2) d i j .n i j ← (cid:3) (cid:3) i j d.n ← match e with ( p j → U j ) j (cid:32) dps match e with (cid:16) p j → T j (cid:2) d i j .n i j ← (cid:3) (cid:3) i j (cid:17) j n (cid:48) = | I | + 1 d (cid:48) .n (cid:48) ← U (cid:32) dps T [ d (cid:48) l .n (cid:48) l ← (cid:3) ] l d.n ← K (( e i ) i ∈ I , U, ( e j ) j ) (cid:32) dps let d (cid:48) = K (( e i ) i ∈ I , Hole , ( e j ) j ) in d.n ← d (cid:48) ; T [ d (cid:48) l .n (cid:48) l ← (cid:3) ] l ail Modulo Cons Bour, Cl´ement, Scherer Most cases are straightforward: for constructs whose tail context modulo cons is also a reg-ular tail context, we simply apply the transformation into said tail context. The two importantcases are the one of a hole, where we introduce an assignment to the result of evaluating thehole, and the case of a constructor, where we “reify” the hole by using a placeholder value,ﬁll the current destination, and recursively transform the TMC context with the newly createdhole as a new destination d (cid:48) .n (cid:48) . After this transformation, we can now build the term T [ d i .n i ← e i ] i where the e i are thesubterms in the initial decomposition U [ e i ] i of the body of our recursive function. Finally, foreach e i with shape f i e (cid:48) i where f i has a DPS version f dpsi , we can replace d i .n i ← e i with f idps (( d i , n i ) , e i ), yielding the ﬁnal result of the DPS transformation.We remark again that the only calls in tail position in the transformed term are the assign-ments we have just transformed into calls to a DPS version of the original callee. Indeed, anyother tail call in the initial decomposition has been replaced by an assignment. We ”lose” atail call when we go from the ”transformed” world back to the ”direct-style” world – a CPStransformation would work similarly, transforming f e into k ( f e ) if f has no CPS version. As we just noted, tail calls in the original function are tail calls in the DPS version only if theircallee also have a DPS version (e.g. in the common case of a recursive function). Tail callswhere the callee didn’t have a DPS version are no longer in tail position. As such, the simpleand lazy way to call into f dps in the body of f , namely, introducing a destination and calling f dps , is suboptimal, as it could make previously-tail recursive paths in f no longer tail recursive,and the programmer may rely on those being tail recursive. Instead, we will ensure that callsto f dps only happen inside a constructor application: this way, all the pre-existing tail calls willbe left untouched.The transformation is very similar as the DPS transformation (in fact, all of the “boring”cases are identical), and we will reuse the same context decomposition of e as U [ e , . . . , e n ].We again perform a context rewriting on U , but now the output is an arbitrary context E . (cid:3) (cid:32) direct (cid:3) U (cid:32) direct E let x = e in U (cid:32) direct ( let x = e in E ) ∀ j, U j (cid:32) direct E j match e with ( p j → U j ) j (cid:32) direct ( match e with ( p j → E j ) j ) n = | I | + 1 d.n ← U (cid:32) dps T [ d l .n l ← (cid:3) ] l K (( e i ) i ∈ I , U, ( e j ) j ) (cid:32) direct let d = K (( e i ) i ∈ I , Hole , ( e j ) j ) in T [ d l .n l ← (cid:3) ] l ; d This transformation leaves the regular tail positions unchanged, but switches to the DPSversion for tail positions strictly modulo cons. We then replace again tail assignments of a callto a tail call to the DPS version of the callee. Note that this time, calls to the DPS version It may look like d (cid:48) is only used once in the right-hand-side of the conclusion of this rule, so its bindingcould be inlined, but this d (cid:48) is used in the last premise and will occur in U . ail Modulo Cons Bour, Cl´ement, Scherer are not in tail position (we need to return the computed value): we simply introduce a freshdestination so that we can call into the DPS version.The presentation by Minamide (1998) suggests a slightly diﬀerent encoding, where we wouldpass a third extra argument to the DPS version: the location of the ﬁnal value to be returned.This would allow tail calls into the DPS version, at the cost of an extra argument. This maylook compelling, but in practice not so much, because calls from the DPS version back into the“direct” world will never be in tail position, and we only end up paying a constant factor moreframes. Consider a function such as the partially unrolled map shown above. It has two nested construc-tor applications, and the DPS transformation as described above will generate the code on theleft below for the umap_dps version. This is unsatisfactory, as it introduces needless writes thatthe OCaml compiler does not eliminate. Instead, we would want to generate the nicer code onthe right. let rec umap_dps dst i f = function| [] ->dst.i <- []| [x] ->dst.i <- [f x]| x1 :: x2 :: xs ->let dst1 = f x1 :: Hole indst.i <- dst1;let dst2 = f x2 :: Hole indst1.1 <- dst2;umap_dps dst2 1 f xs let rec umap_dps dst i f = function| [] ->dst.i <- []| [x] ->dst.i <- [f x]| x1 :: x2 :: xs ->let y1 = f x1 inlet y2 = f x2 inlet dst’ = y2 :: Hole indst.i <- y1 :: dst’;umap_dps dst’ 1 f xs

Notice that in the nicer code, we need to let-bind constructor arguments to preserve ex-ecution order. We implement this optimization by keeping track, in the rules for the dps transformation, of an additional “constructor context”. Before, we were conceptually preserv-ing the semantics of d.n ← U ; now, we will be preserving the semantics of d.n ← C [ U ] for someconstructor context C . C represents delayed constructor applications, which we will performlater — typically, immediately before calling a DPS-transformed function.The new rules, written d.n ← C [ U ] (cid:32) dps T [ d i .n i ← C i ] i , are shown in Figure 4. The rulesfor let and match are unchanged, except that we pass the unchanged constructor context C recursively.The constructor rule is now split in two parts: the DPS-Constr-Opt rule adds the newconstructor to the delayed constructor context, and the

DPS-Reify rule generates an assign-ment for the constructor context. Note that the rules for (cid:32) dps are no longer deterministic: the

DPS-Reify can apply whenever the delayed constructor stack is nonempty. We perform thereiﬁcation in two cases. The ﬁrst one is before generating a call to a DPS-transformed version,because we need a concrete destination for that. The second one is that we keep track ofwhether a subterm would duplicate the delayed context (e.g. due to a match ) and immediatelyapply the reiﬁcation after a constructor in that case.Notice that a similar optimization could be made in the direct transformation (in fact, ourimplementation does just that): the goal of switching to the dps mode in a constructor contextis simply to provide a destination to the inner TMC calls. We can, without loss of generality,only switch to dps when there is a TMC call in tail position in the recursive argument (i.e.there will be no opportunities to introduce a destination in a subterm).12 ail Modulo Cons Bour, Cl´ement, Scherer

DPS-Hole-Opt d.n ← C [ (cid:3) ] (cid:32) dps (cid:3) [ d.n ← C ] DPS-Reify n (cid:48) = | I | + 1 d (cid:48) .n (cid:48) ← (cid:3) [ U ] (cid:32) dps T [ d l .n l ← C l ] l d.n ← C (cid:104) K (( e i ) i , (cid:3) , ( e j ) j ) (cid:105) [ U ] (cid:32) dps let d (cid:48) = K (( e i ) i ∈ I , Hole , ( e j ) j ) in d.n ← C [ d (cid:48) ]; T [ d l .n l ← C l ] l DPS-Constr-Opt n (cid:48) = | I | + 1 d (cid:48) .n (cid:48) ← C (cid:104) K (( v ) i ∈ Ii , (cid:3) , ( e j ) j ) (cid:105) [ U ] (cid:32) dps T (cid:104) ( d l .n l ← C l ) l (cid:105) d.n ← C (cid:104) K (( e i ) i ∈ I , U, ( e j ) j ) (cid:105) (cid:32) dps let ( v i = e i ) i ∈ I inlet ( v j = e j ) j in T [ d l .n l ← C l ] l d.n ← (cid:3) [ U ] (cid:32) dps (cid:3) [ d.i ← C i ] d.n ← U (cid:32) dps (cid:3) [ d.i ← C i ]Figure 4: DPS transformation, with constructor compressionFinally, we note that in our implementation we perform all of the transformations (dps,direct, as well as the computation of the auxiliary information such as whether there are TMCopportunities in the term and whether we need to provide a destination to beneﬁt from a switchto the DPS mode) in a single pass over the terms. Some readers may wonder whether introducing mutation to build immutable data structurescould be an issue with other subsystems of the OCaml implementation that perform ﬁne-grainedmutability reasoning, notably the Flambda optimizer and the Multicore runtime.The answer is that there is no issue, move along! The OCaml value model (even underMulticore) already contains the notions that (immutable) values may start “uninitialized” andeventually be ﬁlled by “initializing” writes – this is how immutable values are constructed fromthe C FFI, for example. In 2016, in preparation for our TMC work, Fr´ed´eric Bour extendedthe Lambda intermediate representation with an explicit notion of “initializing” write (

The TMC transformation makes the assumption that partially-initialized values (with a

Hole placeholder) have a unique owner, and will be initialized into complete values by a single write.This assumption can be violated by the addition of control operators to the language, such as13 ail Modulo Cons Bour, Cl´ement, Scherer call/cc or delim/cc . The problem comes from the non-linear usage of continuations, where thesame continuation is invoked twice.In practice this means that we cannot combine the external (but magical) delim/cc librarywith our TMC transformation. Currently the only solution is to disable the TMC transfor-mation for delimited-continuation users (at the risk of stack overﬂows), but in the future wecould perform a more general stack-avoiding transformation, such as a continuation-passing-style transformation, for delim/cc users.(We considered intermediate approaches, based on detecting multi-writes to a partially-initialized value, and copying the partial value on the ﬂy. This works for lists, where theposition of the hole is, but we did not manage to deﬁne this approach in the general case ofarbitrary constructors.) If a call inside a constructor is in TMC position, our transformation ensures that it is evaluated“last”. For example, in the program K ( e , f e , e ), if f is the TMC call, we know that e and e will be evaluated (in some order) before f e in the transformed program.In OCaml, the order of evaluation order of constructor arguments is implementation-deﬁned;the evaluation-order resulting from the TMC transform is perfectly valid for the source program.However, in this case it is also diﬀerent from the one you would typically observe on theunmodiﬁed program – most implementations use either left-to-right or right-to-left.We consider that it is reasonable that an explicit transformation would change the evaluationorder – especially as it remains a valid order for the source program. Reviewers have foundthis to be an issue, and suggested instead to forbid having potentially-side-eﬀectful argumentsin TMC constructor applications: in this example, if we restrict e and e to be values (orvariables), we cannot observe a diﬀerence anymore.In several cases this forces the user to let -bind the arguments beforehand, explicitly ex-pressing an evaluation order. This is a sensible design, but in our experience many functionsthat would beneﬁt from the TMC transformation are not written in this style, and convertingthem to be in this style would be a bothersome and invasive change, raising the barrier toentry of the [@tail_mod_cons] annotation – without much beneﬁts in terms of evaluation-order,as functions that need to enforce a speciﬁc evaluation order should use explicit let -bindingsanyway. This is for example the case of the most interesting example of Section 3 (TRMCfunctions of interest), the tail_map function on a compiler intermediate representation. This document presents a precisely-deﬁned subset of functions that can be TMC-transformed(they must be decomposable through a TMC context U , with a TMC-specialized function callin tail-position or, preferably, strictly in tail-position modulo cons).For many recursive functions in this subset, the TMC-transformation returns a “tail-recursivefunction”, using a constant amount of stack space. However, many other functions are not“tail-recursive modulo cons” (TRMC) in this sense. Initially we wanted to restrict our trans-formation to reject non-TRMC functions: the success of the transformation would guaranteethat the resulting function never stack-overﬂows.However, we realized that many functions we want to TMC-transform are not in the TRMCfragment. The interesting tail_map function from Section 3 (TRMC functions of interest) doesnot provide this guarantee – for instance, its stack usage grows linearly with the nesting of Cifthenelse constructors in the then direction.14 ail Modulo Cons Bour, Cl´ement, Scherer

If a given function f is marked for TMC transformation, what is the scope of the code in whichTMC calls to f should be transformed? If one tried to give a maximal scope, an answer couldbe: all calls to f in the program. This requires including information on which functions have aDPS version in module boundaries, and thus in module types (to enable separate compilation).We tried to ﬁnd a smaller scope that is easy to deﬁne, and would not require cross-moduleinformation.The second idea was the following “minimal” scope: when we have a group of mutually-recursive functions marked for TMC, we should only rewrite calls to those functions in therecursive bodies themselves, not in the rest of the program. In let rec ( f i x = e i ) i in e (cid:48) , thecalls to the f i in some e j would get rewritten, but not in e (cid:48) . This restriction makes sense,given that generally the stack-consuming calls are done within the recursive bodies, with onlya constant number of calls in e (cid:48) that do not contribute to stack exhaustion.However, consider the List.flatten function, which ﬂattens lists of lists into simple lists. let rec flatten = function| [] -> []| xs :: xss -> xs @ flatten xs

This function is not in the TRMC fragment. However, it can be rewritten to be TMC-transformable by performing a local inlining of the @ operator to ﬂatten two lists: let[@tail_mod_cons] rec flatten = function| [] -> []| xs :: xss ->let[@tail_mod_cons] rec append_flatten xs xss =match xs with| [] -> flatten xss| xs :: xss -> xs :: append_flatten xs xssin append_flatten xs xss This deﬁnition contains a toplevel recursive function and a local recursive function, andboth are in the TMC fragment. However, to get constant-stack usage it is essential that thecall to append_flatten that is outside its recursive deﬁnition be TMC-specialized. Otherwise itis not a tail-call anymore in the transformed program.For now we have decided to extend the “minimal” scope as follows: for a recursive deﬁnition let rec ( f i x = e i ) i in e (cid:48) , TMC calls to the f i in the body e (cid:48) are not rewritten unless the wholeterm is itself in the body of a function marked for TMC. In other words, the scope is “minimal”at the toplevel, but “maximal” within TMC-marked functions.Another alternative is to stick to the minimal scope, and warn on the flatten implementationabove. It is still possible to write a TMC flatten in the minimal scope, by extruding the localdeﬁnition into a mutual recursion: let[@tail_mod_cons] rec flatten = function| [] -> []| xs :: xss -> append_flatten xs xssand[@tail_mod_cons] append_flatten xs xss =match xs with| [] -> flatten xss| xs :: xss -> xs :: append_flatten xs xss Indeed, for local deﬁnitions it is always possible to rewrite the term let rec ( f i x = e i ) i in e (cid:48) by moving e (cid:48) inside the mutually-recursive part: let rec (( f i x = e i ) i , g = e (cid:48) ) in g (). The15 ail Modulo Cons Bour, Cl´ement, Scherer question would be whether we want to force users to perform this transformation manually,and how to tell them that we expect it. We formalized the TMC-transformation on a ﬁrst-order language, and our implementationsilently ignores the higher-order features of OCaml. What would it mean to DPS-transform ahigher-order function such as ( let app f x = f x )?Our answer would be that the higher-order DPS-transformation takes a pair of function f and returns a pair ( f direct , f dps ), allowing to deﬁne let app direct ( f direct , f dps ) x = f direct x let app dps ( d, i ) ( f direct , f dps ) x = f dps ( d, i ) x The program below on the left, frustratingly, cannot be TMC-transformed with our currentdeﬁnition or implementation. One can manually express a multi-destination

DPS version, onthe right, but it is still unclear how to specify the fragment of input programs that would betransformed in this way. (We discussed this with Pierre Chambart.) let rec partition p = function| [] -> ([], [])| x::xs ->let (yes, no) = partition p xs inif p xthen (x :: yes, no)else (yes, x :: no) let rec partition_dpsdst_yes i_yes dst_no i_no p xs= match xs with| [] -> dst_yes.i_yes <- []; dst_no.i_no <- []| x :: xs ->let dst_yes’, i_yes’, dst_no’, i_no’ =if p x thenlet dst’ = x :: Hole indst_yes.i_yes <- dst’;dst’, 1, dst_no, y_noelselet dst’ = x :: Hole indst_no.i_no <- dst’;dst_yes, i_yes, dst’, 1inpartition_dpsdst_yes’ i_yes’ dst_right’ i_right’ p xs

References

Daniel P. Friedman and David S. Wise. Unwinding stylized recursions into iterations. Technicalreport, Indiana University, 1975.Yasuhiko Minamide. A functional representation of data structures with a hole. In

POPL ,1998.Fran¸cois Pottier and Jonathan Protzenko. Programming with permissions in Mezzo. In

ICFP ,September 2013.Tore Risch. REMREC - a program for automatic recursion removal in LISP. Technical report,Uppsala Universitet, 1973.16 ail Modulo Cons Bour, Cl´ement, Scherer

A Performance evaluation

In this section we present preliminary performance numbers for the TMC-transformed versionof

List.map , which appear to conﬁrm the claim that this version is “almost as fast as the naive,non-tail-recursive version” – and supports lists of all length.The performance results were produced by Anton Bachin’s benchmarking script faster-map,which internally uses the core-bench library that is careful to reduce measurement errors onshort-running functions, and also measures memory allocation. They are “preliminary” inthat they were run on a personal laptop (running an Intel Core i5-4500U – Haswell), we havenot reproduced results in a controlled environment or on other architectures. We ran thebenchmarks several times, with variations, and the qualitative results were very stable.

A.1 The big picture

We graph the performance ratio of various

List.map implementation, relative to the “naivetail-recursive version”, that accumulates the results in an auxiliary list that is reversed at theend – the ﬁrst tail-recursive version of our Prologue. We pointed out that this implementationis “slow” compared to the non-tail-recursive version: for most input lengths it is the slowestversion, with other versions taking around 60-80% of its runtime (lower is better). One can alsosee that it is not that slow, it is at most twice as slow.The other

List.map versions in the graph are the following: base

The implementation of Jane Street’s Base library (version 0.14.0). It is hand-optimizedto compensate for the costs of being tail-recursive. batteries

The implementation of the community-maintained Batteries library. It is actuallywritten in destination-passing-style, using an unsafe encoding with

Obj.magic to unsafelycast a mutable record into a list cell. (The trick comes from the older Extlib library, andits implementation has a comment crediting Jacques Garrigue for the particular encodingused.) containers

Is another standard-library extension by Simon Cruanes; it is the hand-optimizedtail-recursive implementation we included in the Prologue. trmc is “our” version, the last version of the Prologue: the result of applying our implemen-tation of the TMC transformation to the simple, non-tail-recursive version. stdlib is the non-tail-recursive version that is in the OCaml standard library. (All measure-ments used OCaml 4.10) stdlib unrolled 5x, trmc unrolled 4x are the result of manually unrolling the simple im-plementation (to go in the direction of the Base and Containers implementation); in the trmc case, the result is then TRMC-transformed.Our expectation, before running measurements, are that trmc should be about as fast as stdlib , both slower than manually-optimized implementations (they were tuned to competewith the stdlib version). We hoped that trmc unrolled 4x would be competitive with themanually-optimized implementations. Finally, batteries should be about on-par with trmc ,as it is using the same implementation technique (as a manual transformation rather than acompiler-supported transformation). 17 ail Modulo Cons Bour, Cl´ement, Scherer T i m e r e l a t i v e t o n a i v e t a il - r ec u r s i v e v e r s i o n ( % ) List size (no. of elements)Time elapsed (relative) – lower is betternaive tail-rec.basebatteriescontainers trmctrmc unrolled 4xstdlibstdlib unrolled 5 × Actual results

In the ﬁrst half of the graph, up to lists of size 1000, the results are as weexpected. There are three performance groups. The slow tail-recursive baseline alone. Then batteries , stdlib and trmc close to each other ( batteries does better than the two other,which is surprising, possibly a code-size eﬀect). Then the versions using manual optimizations: base , containers , and our unrolled versions.At the far end of the graph, with lists of size higher than 10 , the results are interesting, andvery positive: trmc and batteries are the fastest versions, containers is slower. base fallsback to the slow tail-recursive version after a certain input length, so its graph progressivelyjoins the baseline on larger lists. (Note: later versions of Base switched to use an implementationcloser to containers in this regime.)We also got performance results on stdlib and stdlib unrolled on large lists, by conﬁguringthe system to remove the call-stack limit to avoid the stack overﬂows; their performance proﬁleis interesting and non-obvious, we discuss it in Section A.3 (ulimit).In the third quarter of the graph, for list sizes in the region [5 . ; 10 ], we observe surprisingresults where the destination-passing-style version ( trmc and batteries ) become, momentarily,sensibly slower than the non-tail-recursive stdlib version. We discuss this strange eﬀect indetails in Section A.2 (Promotion eﬀects), but the summary is that it is mostly a GC eﬀectdue to the micro-benchmark setting (this strange region disappears with a slightly diﬀerentmeasurement), and would not show up in typical programs.18 ail Modulo Cons Bour, Cl´ement, Scherer A.2 Promotion eﬀects

What happens in the [5 . ; 10 ] region, where destination-passing-style implementations seemto suddenly lose ground against stdlib ? Promotion eﬀects in the OCaml garbage collector.Consider a call to List.map on an input of length N , which is in the process of building theresult list. With the standard List.map , the result is built “from the end”: ﬁrst the list cellfor the very last element of the list is allocated, then for the one-before-last element, etc., untilthe whole list is created. With the TMC-transformed

List.map , the result list is built “fromthe beginning”: a list cell is ﬁrst allocated for the ﬁrst element of the list (with a “hole” in tailpositive) and written in the destination, then a list cell for the second element is written in theﬁrst cell’s hole, etc., until the whole list is created.The OCaml garbage collector is generational, with a small minor heap, collected frequently,and a large major heap, collected infrequently. When the minor heap becomes full, it is collected,and all its objects that are still alive are promoted to the major heap.What if the minor heap becomes full in the middle of the allocation of the result list? Let’sconsider the average scenario where promotion happens at cell N/

2. In both cases ( stdlib or trmc ), one half of the list cells are already allocated (the second or the ﬁrst half, respectively),and they get promoted to the major heap. What diﬀers is what happens next . With the non-tail-recursive implementation, the next list cell (in position N/ −

1) is allocated, pointing tothe ( N/ N/ N/ major heap. The GC cannot aﬀord to traverse the large major heap to determine the lattercategory, so it keeps track of all the pointers from the major to the minor heap; this is called“write barrier”. At this point in our new list’s life, writing the ( N/ N/ N/ N/ N/ N/ N/ − stdlib version, they corresponds to the N/ − also get promoted... if the resultlist is still alive at this point!To recapituate, the “remainder” of the result list is promoted in all cases in the destination-passing version; it is only promoted in the non-tail-recursive version if the result list is stillalive.“But this is silly”, you say, “who would call List.map on a very large list and drop the resultimmediately after?” Well, this code: let make_map_test name map_function argument : Core_bench.Bench.Test.t =Core_bench.Bench.Test.create ~name(fun () -> ail Modulo Cons Bour, Cl´ement, Scherer map_function ((+) 1) argument|> ignore) Remark 1.

There are real-world example where large results are short-lived. For example,consider a processing pipeline that calls map , and then filter on the result, etc.: the result of map may become dead quickly, if the lists are not too large. It is less likely to hit this promotioneﬀect with medium-sized lists, but if this is all your application code is doing you will still seethe eﬀect once every

L/M calls, where L is the length of your lists and M the size of the minorheap, adding to a small but observable promotionoverhead. If we change the code to a version that guarantees that the result is kept until at the nextminor promotion, then there should be no diﬀerence in promotion behavior. We did exactlythis, and the new graph looks like this:020406080100 1 10 100 1000 10000 100000 10 T i m e r e l a t i v e t o n a i v e t a il - r ec u r s i v e v e r s i o n ( % ) List size (no. of elements)Time elapsed (relative) – lower is betternaive tail-rec.basebatteriescontainers trmctrmc unrolled 4xstdlibstdlib unrolled 5 × This is the version that we consider representative of most applciations, where data is long-lived enough that the subtle promotion eﬀects do not get into play. Notice in particular that,on this version, trmc unrolled is robustly better than all other implementations. ( stdlibunrolled is still somewhat faster in the previous “awkward region”, and we are not sure why,nor do we care very much.)

A.3 ulimit

Using ulimit -s unlimited , we removed the call-stack limit on our test machine, to test thespeed of the non-tail-recursive

List.map on large lists. These behaviors tend to be unobservedby OCaml users, which just get a program crash with a

Stack_overflow .20 ail Modulo Cons Bour, Cl´ement, Scherer

It is interesting that the stdlib version gets progressively slower on large inputs, until itmatches the performance of the tail-recursive baseline. Notice that the stdlib unrolled variantis also getting slower, although it starts with a good safety margin.Pierre Chambart suggested that this slowdown may result from stack scanning: when theGC performs a minor collection, it scans the call stack to ﬁnd root pointers to minor objects.When the result list gets much larger than the minor-heap size, many minor collections willoccur during the

List.map call, with progressively larger call stacks. The total overhead ofthe call-stack scanning is thus quadratic; scanning the stack is fast , but it eventually slowsdown those implementations noticeably. (In theory it would only get slower and slower as weincreased the input size.)Some implementations use the heap rather than a call stack; either the tail-recursive