[PDF] Zero-cost meta-programmed stateful functors in F*

Abstract

Writing code is hard; proving it correct is even harder. As the scale of verified software projects reaches new heights, the problem of efficiently verifying large amounts of software becomes more and more salient. Nowhere is this issue more evident than in the context of verified cryptographic libraries. To achieve feature-parity and be competitive with unverified cryptographic libraries, a very large number of algorithms and APIs need to be verified. However, the task is oftentimes repetitive, and factoring out commonality between algorithms is fraught with difficulties, requiring until now a significant amount of manual effort. This paper shows how a judicious combination of known functional programming techniques leads to an order-of-magnitude improvement in the amount of verified code produced by the popular HACL* cryptographic library, without compromising performance. We review three techniques that build upon each other, in order of increasing sophistication. First, we use dependent types to crisply capture the specification and state machine of a block algorithm, a cryptographic notion that was until now only informally and imprecisely specified. Next, we rely on partial evaluation to author a higher-order, stateful functor that transforms any unsafe block API into a safe counterpart. Finally, we rely on elaborator reflection to automate the very process of authoring a functor, using a code-rewriting tactic. This culminates in a style akin to templatized C++ code, but relying on a userland tactic and partial evaluation, rather than built-in compiler support.

Full PDF

aa r X i v : . [ c s . P L ] F e b Zero-cost meta-programmed stateful functors in F ★ Jonathan Protzenko

Microsoft Research

Son Ho

Inria

Abstract

We present zero-cost, high-level F ★ functors and their com-pilation to low-level, eﬃcient C code. Thanks to a combina-tion of partial evaluation, ﬁne-grained control of reduction,and tactic-driven C++ template-like metaprogramming, weprovide the programmer with a toolkit that dramatically re-duces the proof-to-code ratio, brings out the essence of algo-rithmic and implementation agility, and allows substantialcode reuse while remaining at a very high-level of abstrac-tion. None of our techniques require modifying the F ★ com-piler.We describe a systematic process to develop functors, andillustrate it with the streaming functor, which wraps an error-prone, cryptographic block API by hiding internal buﬀeringand state machine management to prevent C programmermistakes. We apply this functor to 10 implementations fromthe HACL ★ [31] cryptographic library. We then write a tac-tic to automate the functor encoding, allowing the program-mer to author multi-argument functors with a deeply nestedcall graph without any syntactic overhead. We apply thisgeneral tactic on 5 algorithms from HACL ★ , yielding over30 specialized functor applications. We use as an exampleCurve25519, a complex algorithm whose ﬁnal, specializedversion we express as nested functor applications. In recent years [9], projects such as FiatCrypto [20], Jas-min [4, 5], CryptoLine [22], Vale-Crypto [14, 21] or HACL ★ [31,41] have demonstrated that is it now feasible to verify cryp-tographic algorithms whose performance matches or exceedsstate-of-the-art, unveriﬁed code from projects like OpenSSL [30]or libsodium [1]. Building upon such eﬀorts, projects like Ev-erCrypt [32] oﬀer entire veriﬁed cryptographic providers,which expose a wide range of modern algorithms, oﬀeringan agile and multiplexing API via CPU auto-detection.There remain, however, several obstacles on the road towider adoption and distribution of veriﬁed cryptography.Oftentimes, veriﬁed libraries are integrated into unveriﬁedsoftware projects, written in C, C++ or any other popularlanguage. Thus, when authoring veriﬁed code, it does notsuﬃce to merely implement an algorithm as speciﬁed by the Conference’17, July 2017, Washington, DC, USA https://doi.org/10.1145/nnnnnnn.nnnnnnn

RFC; the author of veriﬁed code must strive for elegant, in-tuitive, easy-to-use APIs that minimize the risk of program-mer error. Designing such a safe API requires more layersof abstraction, which increases the veriﬁcation burden.As an example, consider hash algorithms like SHA-2 [2]or Blake2 [8]. They are block-based, which requires clientsto obey a precise state machine, feeding the data to be hashedblock by block, and invalidating the state when processingthe ﬁnal data. This is error-prone and almost certain to re-sult in programmer mistakes. There exists a safer API withbetter usability, which various libraries aim to provide. Thesafe API maintains an internal buﬀer, and eliminates almostall possible mistakes when using it from C. Yet, even Blake2’sown safe, reference API implementation [34] contained abuﬀer management bug that went undetected for seven years [29]!The natural conclusion is that veriﬁed cryptographic li-braries need to expose only safe, and veriﬁed APIs. Unfor-tunately, none of them currently do. One reason is that theamount of eﬀort required is tremendous: verifying a non-trivial, safe API for SHA2-256 is certainly feasible [7], but do-ing it over and over for a dozen similar algorithms is out ofthe question. Verifying safe APIs is tedious, repetitive, anduntil now did not lend itself well to automation.A reason for the lack of automation is that the commonal-ity between algorithms has not yet been formally captured.Consider HACL ★ , which at this time oﬀers 17 algorithms,spread across about 40 implementations. We observe that atleast ten of those algorithms obey almost identical state ma-chines and behave in a fashion almost identical to Blake2.Yet, a formal argument was never made to assert that thesealgorithms really do behave in the same way.Another reason for the lack of automation is that so far,no systematic techniques have been established to “verifyin the large” [18]. If those 10 algorithms behave similarly,it should be feasible to write and verify the safe API onceand for all, and get 10 copies of it “for free”. In practice,this requires, in addition to capturing the commonality, de-vising techniques, tools and strategies to write and verifycode at this very high level of abstraction. To add to thechallenge, slow cryptographic code is dismissed by practi-cioners, which makes it impossible to use standard func-tional programming paradigms such as type classes withrun-time dictionaries.All of these shortcomings show up in existing work. Forinstance, EverCrypt oﬀers only a single safe API for hashing,brieﬂy mentioned but not explained in [32, XIII]. Lackingany better ways of automating those proofs, Protzenko et.al. were not able to verify more than a single safe API. onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho This paper, using F ★ as our lingua franca , presents a seriesof language-based techniques. We explain how to conceptu-alize commonality between algorithms, using block APIs asour running example. We show how the programmer canthink, write and verify at the highest level of abstraction,writing functors that transform an unsafe block API intoa safe API once and for all, without incurring any perfor-mance penalty at run-time. We show all the techniques thatenable this style of programming without needing to modifythe F ★ compiler. Speciﬁcally, we show how to use a combi-nation of partial evaluation and meta-programming to writecode in a style akin to C++ templates, and specialize it forfree for multiple instances. Our evaluation shows that pro-grammer eﬀort is reduced, at the expense of a modest in-crease in veriﬁcation time for meta-generated code.All throughout the paper, we use cryptography as themain driving force. There are two reasons. First, cryptographicAPIs oﬀer the highest level of complexity, combining muta-ble state, state machines, abstract representations and ab-stract predicates. While our techniques have been success-fully applied to simpler use-cases, e.g. data structures and as-sociative lists, we wish to showcase the full power and gen-erality of our approach. Second, we beneﬁt from the largeexisting codebase of cryptographic code in F ★ , meaning wecan perform a real-world quantitative evaluation of how ourtechniques make veriﬁcation more scalable.All of our ideas have been implemented, veriﬁed and demon-strated on the HACL ★ and EverCrypt libraries. Our code hasbeen integrated in the HACL ★ repository and is now usedby Firefox, Linux, the Tezos blockchain, and others.Our paper is structured as follows: • we provide some detailed background (§2) on the F ★ -to-C toolchain that HACL ★ uses; • we explore the commonality problem (§3), also knownas “agility” in cryptographic lingo, using type classesas a key technical device; • equipped with a precise deﬁnition of a block algorithm,we show how to manually write a functor that gener-ates a safe API for any block algorithm (§4); • getting such higher-order code to compile to C with-out runtime overhead is a challenge; we list a constel-lation of techniques (§5) that, put together, completelyeliminate the cost of high-level programming thanksto partial evaluation; • manually writing functors can prove quite tedious; weuse the full power of Meta-F ★ (§6) to automate thefunctor encoding, allowing programmers to write genericcode without any syntactic penalty; • we evaluate the beneﬁts of our techniques (§7), quan-tifying the veriﬁcation eﬀort for the programmer, aswell as the performance impact on veriﬁcation times.While demonstrated on F ★ , we believe both the languagetechniques and the design of the cryptographic functors are reusable in other veriﬁcation settings, languages and toolchains.For instance, identifying and capturing the block API is ageneric contribution that can be reproduced, say, in Coq [38]just as well. Similarly, the idea of authoring the streamingfunctor and relying on partial evaluation to eliminate run-time overhead could applied to, say, LMS-Scala [6]. ★ , Low ★ , Meta-F ★ HACL ★ [31, 41] is a cryptographic library written in F ★ ,which we use as a baseline to author, evaluate and integrateour proof techniques. HACL ★ compiles to C, and oﬀers vec-torized versions of many algorithms via C compiler intrin-sics, e.g. for targets that support AVX, AVX2 or ARM Neon.EverCrypt is a high-level API that multiplexes between HACL ★ and Vale-Crypto [21] and supports dynamic selection of al-gorithms and implementations based on the target CPU’sfeature set. Combined with EverCrypt, HACL ★ features 130klines of F* code for 62k lines of generated C code (all exclud-ing comments and whitespace). F ★ is a state-of-the-art veriﬁcation-oriented programminglanguage. Hailing from the tradition of ML [28], F ★ featuresdependent types and a user-extensible eﬀect system, whichallows reasoning about IO, concurrency, divergence, variousﬂavors of mutability, or any combination thereof. For veri-ﬁcation, F ★ uses a weakest precondition calculus based onDijkstra Monads [3, 37], which synthesizes veriﬁcation con-ditions that are then discharged to the Z3 [16] SMT solver.Proofs in F ★ typically are a mixture of manual reasoning(calls to lemmas), semi-automated reasoning (via tactics) andfully automated reasoning (via SMT). Low ★ is a subset of F ★ that exposes a carefully curated setof features from the C language. Using F ★ ’s eﬀect system,Low ★ models the C stack and heap, and allocations in thoseregions of the memory. Low ★ also models data-oriented fea-tures of C, such as arrays, pointer arithmetic, machine inte-gers with modulo semantics, const pointers, and many oth-ers via a set of distinguished libraries. Programming in Low ★ guarantees spatial safety (no out-of-bounds accesses), tem-poral safety (no double frees, no use-after free) and a formof side-channel resistance [33, 41]. All of these guaranteesare enforced statically and incur no run-time checks.To provide a ﬂavor of programming in Low ★ , we presentthe mk _ update _ blake2 function below, taken from HACL ★ . val mk _ update _ blake2 ( a : blake _ alg ) ( v : vectorization ) ( s : state a v )2 ( totlen : U64 . t ) ( data : block a ): Stack U64 . t requires 𝜆 h → live h s ∧ live h data ∧ disjoint s data )4 ( ensures 𝜆 h0 totlen ' h1 → modifies1 s h0 h1 ∧ as _ seq h1 s , U64 . v totlen ' ) ==6 Blake2 . Spec . update a ( as _ seq h0 s ) ( U64 . v totlen ) ( as _ seq h0 data )) Functions in Low ★ are annotated with their return eﬀect, inthis case Stack , which indicates the function only performsstack allocations, and therefore is guaranteed to have nomemory leaks. For all other cases, programmers may use the ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA ST eﬀect. The return type of the function is U64 . t , the typeof 64-bit unsigned machine integers with modulo seman-tics. Functions are speciﬁed using pre- and two-state post-conditions. Low ★ relies on a modiﬁes-clause theory [24]: thepre-condition demands that arrays data and state be disjoint,while the post-condition ensures that the only memory lo-cation aﬀected by a call to mk _ update _ blake2 is s . If a clientholds an array a , then the combination of disjoint a s and modifies1 s h0 h1 allows them to derive automatically that a is unchanged in h1 . The live ness clauses ensure accesses to s and data are valid in the body of mk _ update _ blake2 .Importantly, the functional behavior of mk _ update _ blake2 is speciﬁed via Blake2 . Spec . update , a pure function that formsour speciﬁcation . This style is referred to as intrinsic reason-ing , or implementation reﬁnement : we prove that the low-level behavior of mk _ update _ blake2 is characterised by a pure,trusted and carefully audited speciﬁcation. Various functionsallow reﬂecting Low ★ objects as their pure counterpart: inthis case, as _ seq reﬂects the contents of s in h1 as a puresequence, and U64 . v reﬂects a machine integer as a pureunbounded mathematical number. Functions such as as _ seq are in eﬀect Ghost , meaning that they may only be used inproofs and reﬁnements, and cannot appear at run-time.

Partial evaluation is a powerful, trusted mechanism inF ★ . With it, the programmer can trigger various steps of re-duction using F ★ ’s trusted normalizer without having to per-form any proof. F ★ exposes keywords, attributes and func-tions to control this mechanism. We look back at the exam-ple of mk _ update _ blake2 : as it stands, this function is actu-ally not valid Low ★ : the type state is indexed over two argu-ments, capturing the choice of algorithm a and implementa-tion v . Here, lbuﬀer t l is an array of type t and length l . inline _ for _ extraction let blake2 _ state a v = lbuﬀer ( element _ t a v ) (4 ul ∗ . row _ len a v ) Such a type deﬁnition cannot be safely extracted to C and isnot valid Low ★ . The special “inline for extraction” attribute,however, indicates to F ★ that right before extraction, occur-rences of blake2 _ state should be replaced with their deﬁni-tion. This in turns triggers more reductions steps, wherematches are reduced, unreachable branches eliminated, beta-redexes evaluated away, meaning that if we apply mk _ update _ blake2 to constant values for its ﬁrst two arguments, we actuallyget Low ★ code after partial evaluation: let update _ blake2s _32 = mk _ update Blake2S Scalar // regular Clet update _ blake2s _128 = mk _ update Blake2S AVX // vectorized The “inline for extraction” mechanism applies to functions,including stateful ones: F ★ introduces an A-normal form [35],meaning a stateful call f e becomes let x = e in f x , which thenallows 𝛽 -reduction since x is a value.We make extensive use of this keyword all throughoutthis paper, as it provides a way to drastically slash code du-plication. For instance, HACL ★ has a single implementation of Blake2, which partially evaluates to four diﬀerent imple-mentations depending on the variant (Blake2s vs. Blake2b)and the degree of vectorization (C, AVX, AVX2).Beyond this keyword, other mechanisms exist for partialevaluation. The [ @inline _ let ] attribute allows reducing purelet-bindings inside function deﬁnitions, and the normalize call allows very ﬁne-grained control of the reduction ﬂagsfor a sub-term. The deﬁnition of mk _ update _ blake2 uses both. Meta-F ★ is a recent extension of F ★ [26] that allows theprogrammer to script the F ★ compiler using user-written F ★ programs , an approach known as elaborator reﬂection andpioneered by Lean [17] and Idris [15]. Meta-F ★ oﬀers, by de-sign, a safe API for term manipulation, meaning it re-checksthe results of meta-program execution: if a meta-programattempts to synthesize an ill-typed term, F ★ aborts. There-fore, tactics do not need to be statically proven correct andenjoy a great deal of ﬂexibility. Tactics are written in the Tac eﬀect, and in this paper we use the terms “tactic” and“meta-program” interchangeably.

Erasure and extraction in F ★ follows Letouzey’s extrac-tion principles for Coq [25]. After type-checking and per-forming partial evaluation, F ★ erases computationally-irrelevantcode and performs extraction to an intermediary represen-tation dubbed the “ML AST”.For erasure, F ★ eliminates reﬁnements, pre- and post-conditions,and generally replaces computationally irrelevant terms withunits, i.e. any subexpression of type Ghost becomes () . F ★ also removes calls to unit-returning functions, which meansthat calls to lemmas are also eliminated.For extraction, F ★ ensures that the “ML AST” featuresonly prenex polymorphism (i.e. type schemes), and that itis annotated with classic ML types. Naturally, not all F ★ pro-grams are type-able per the ML rules: using a bidirectionalapproach, extraction inserts necessary casts, and replacestypes that are invalid in ML with ⊤ , the uninformative type.The ML AST can then be further compiled to three dif-ferent targets. Going to OCaml, ⊤ becomes Obj . t and castsbecome calls to Obj . magic ; owing to OCaml’s uniform valuerepresentation, any F ★ program can be compiled to OCaml.Going to F ⊤ type. KreMLin [33] compiles the “ML AST” to readable , au-ditable C by using a series of small, composable passes. TheKreMLin preservation theorem [33] states that the safetyguarantees in Low ★ carry over to the generated C code. Wepresent a few transformations that are relevant to this work.The ML AST supports parameterized data types, such aspairs, tuples, and user-deﬁned inductives, e.g. type ( ' a , ' b ) pair = Pair of ' a ∗ ' b . First, KreMLin removes unused type parame-ters. This is important for deeply dependent inductive in-dices: as long as they only appear in reﬁnements, the re-sulting ⊤ type parameter in the ML AST is eliminated by onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Hoalloc init updatelast finish freeupdate block init init Figure 1.

State machine of an error-prone block-based API alloc freereinit updatefinish

Figure 2.

State machine of a safe, streaming APIKreMLin. Next, KreMLin performs a whole-program anal-ysis and monomorphizes types, functions and polymorphicequalities based on usage at call-sites. This process is similarto MLton [40].KreMLin also performs unit elimination: unit argumentsare removed from functions and unit ﬁelds are removed fromdata types. This, in particular, means that a data type thatfeatures a

Ghost ﬁeld will incur no penalty in C, since

Ghost is erased by F ★ to unit . Furthermore, KreMLin features ﬁvediﬀerent compilation schemes for inductive types. The de-fault uses a tagged-union scheme, but for single-case induc-tives, a clean C struct is used rather than waste space for the(useless) tag. Furthermore, if the constructor itself takes asingle argument, the struct is altogether eliminated. We posited earlier that many algorithms exhibit large amountsof commonality , and essentially behave the same way. Weprovide some basic cryptographic context, then show howwe can describe what a block API is in F ★ . Many cryptographic algorithms oﬀer identical or similar func-tionalities . For example, SHA2 [2], SHA3 [19] and Blake2 [8](in no-key mode) all implement the hash functionality, tak-ing an input text to compute a resulting digest. As anotherexample, HMAC [11], Poly1305 [13], GCM [27] and Blake2implement the message authentication code (MAC) function-ality, taking an input text and a key to compute a digest.At a high level, these functionalities are simply black boxeswith one or two inputs, and a single output. Taking HACL ★ ’sSHA2-256 implementation as an example, this results in anatural, self-explanatory C API: void sha2 _256( uint8 _ t ∗ input , uint32 _ t input _ len , uint8 _ t ∗ dst ); This “one-shot” API, however, places unrealistic expecta-tions on clients of this library. For instance, the TLS proto-col, in order to authenticate messages, computes repeatedintermediary hashes of the handshake data transmitted sofar. Using the one-shot API would be grossly ineﬃcient, asit would require re-hashing the entire handshake data ev-ery single time. In other situations, such as Noise protocols,just hashing the concatenation of two non-contiguous ar-rays with this API requires a full copy into a contiguousarray.Cryptographic libraries thus need to provide a diﬀerentAPI that allows clients to perform incremental hash compu-tations. A natural candidate for this is the block API: all ofthe algorithms we mentioned above are block-based, mean-ing that under the hood, they initialize their internal state,process the data block-by-block (for an algorithm-speciﬁcblock size), perform some special treatment for the leftoverdata, and then extract the internal state onto a user-provideddestination buﬀer, which holds the ﬁnal digest. Revealingthis API (Figure 1) would allow clients to feed their datainto the hash incrementally .The issue with this block API is that it is wildly unsafeto call from unveriﬁed C code. First, it requires clients tomaintain a block-sized buﬀer, that once full must be emptiedvia a call to update _ block . This entails non-trivial modulo-arithmetic computations and pointer manipulations, whichare error-prone [29]. Second, clients can very easily violatethe state machine. For instance, when extracting an inter-mediary hash, clients must remember to copy the internalhash state, call the sequence update _ last / finish on the copy,free that copy, and only then resume feeding more data intothe original hash state. Third, algorithms exhibit subtle dif-ferences: for instance, Blake2 must not receive empty datafor update _ last , while SHA2 is absolutely ﬁne. In short, theblock API is error-prone, confusing, and is likely to result inprogrammer mistakes.We thus wish to take all of the block-based algorithms,and devise a way to wrap their respective block APIs into auniform, safe API that eliminates all of the pitfalls above. Wedub this safe API the streaming API (Figure 2): it has a de-generate state machine with a single state; it performs buﬀermanagement under the hood; it hides the diﬀerences be-tween algorithms; and performs necessary copies as-neededwhen a digest needs to be extracted.Writing and verifying a copy of the streaming API foreach one of the eligible algorithms would be tedious, notvery much fun, and bad proof engineering. We thus set outto write a functor , that takes any block API and returns thecorresponding streaming API. But for that, we ﬁrst need tostate what a block API is. Before we get to the block API itself, we need to capture amore basic notion, that of an abstract piece of data that lives ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA type stateful = | Stateful :2 (∗ Low−level type ∗) s : Type0 → footprint : h : mem → s : s → Ghost loc → invariant : h : mem → s : s → Type → (∗ A pure representation of an s ∗) t : Type0 → v : h : mem → s : s → Ghost t → (∗ Adequate framing lemmas, relying on v ∗) invariant _ loc _ in _ footprint : h : mem → s : s → Lemma

13 ( requires ( invariant h s ))14 ( ensures ( loc _ in ( footprint h s ) h )) → frame _ invariant : l : loc → s : s → h0 : mem → h1 : mem → Lemma

17 ( requires ( invariant h0 s ∧ loc _ disjoint l ( footprint h0 s ) ∧ modifies l h0 h1 ))19 ( ensures ( invariant h1 s ∧ v i h0 s == v i h1 s ∧ footprint h1 s == footprint h0 s )) → (∗ Stateful operations ∗) alloca : ... → Stack ... → malloc : ... → ST ... → free : ... → Stack ... → copy : ... → Stack ... (∗ end of type class ∗) → stateful Figure 3.

The stateful

APIin memory, composes with the Low ★ memory model andmodiﬁes-clause theory, and supports basic operations suchas allocation, de-allocation and copy. This is the stateful typeclass, presented in Figure 3. It captures many of the ideaspresented in §2, except now in an abstract fashion.The low-level type s (e.g. lbuﬀer U8 . t ul ) comes with anabstract footprint (e.g. the extent of that array), and an ab-stract invariant (e.g. the array is live ). The low-level type canbe reﬂected as a pure value of type t (e.g. a sequence) us-ing a v function (e.g. as _ seq , see §2). The administrative lem-mas allow harmonious interaction with Low ★ ’s modiﬁes-clause theory; the ﬁrst captures via loc _ in an abstract notionof liveness which stateful s must observe, and which allowsclients to automatically derive disjointness of a fresh allo-cation. The second lemma allows automatic framing of theinvariant thanks to a suitable SMT pattern (elided here).The stateful operations allow, respectively, allocating afresh state s on the stack and on the heap; freeing a heap-allocated state; and copying the state.Writing instances of stateful is easy, the most complex onebeing the internal state of Blake2 which occupies 46 lines ofcode, with all proofs going through automatically.A procedural note: stateful is, for all intents and purposes,a type class [36], but we do not use the type class syntax ofF ★ which did not work with Low ★ code when we started thiswork. We directly use a single-constructor inductive which is what the type class syntax desugars to. We plan to switchto the class keyword once all the bugs are ﬁxed in F ★ . We now capture the essence of a block algorithm by author-ing a type class that encapsulates a block algorithm’s types,representations, speciﬁcations, lemmas and stateful imple-mentations in one go.We need the block type class to capture four broad traitsof a block algorithm, namely i) explain the runtime repre-sentation and spatial characteristics of the block algorithm,ii) specify as pure functions the transitions in the state ma-chine, iii) reveal the block algorithm’s central lemma, i.e.processing the input data block by block is the same as pro-cessing all of the data in one go, and iv) expose the low-levelrun-time functions that realize the transitions in the statemachine. The result appears in Figure 4 in a simpliﬁed form(the actual deﬁnition is about 150 lines of F ★ ). Run-time characteristics.

A block algorithm revolvesaround its state , of type stateful . A block algorithm may needto keep a key at run-time ( km = Runtime , e.g. Poly1305), orkeep a ghost key for speciﬁcation purposes ( km = Erased , e.g.keyed Blake2), or may need no key at all, in which case the key ﬁeld is a degenerate instance of stateful where key . s = unit . Speciﬁcation.

Using state . t , i.e. the algorithm’s state re-ﬂected as pure value, we specify each transition of the statemachine at lines 18-23. Importantly, rather than specify an“update block” function, we use an “update multi” functionthat can process multiple blocks at a time. We don’t imposeany constraints on how update _ multi is authored: we just re-quest that it obeys the fold law via the lemma at line 28.This style has several advantages. First, this leaves thepossibility for optimized algorithms that process multipleblocks at a time to provide their own update _ multi function,rather than being forced to ineﬃciently process a single block.For unoptimized algorithms that are authored with update _ block ,we provide a higher-order combinator that derives an update _ multi function and its correctness lemma automatically. Second,by being very abstract about how the blocks are processed,we capture a wide range of behaviors for block algorithms.For instance, Poly1305 has immutable internal state for stor-ing precomputations, along with an accumulator that changeswith each call to update _ block : we simply pick state . t to be apair, where the fold only operates on the second component. The block lemma.

The spec _ is _ incremental lemma capturesthe key correctness condition and ties all of the speciﬁcationfunctions together. It relies on a helper, split _ at _ last , whichwas carefully crafted to subsume the diﬀerent behaviors be-tween Blake2 and other block algorithms. let split _ at _ last ( block _ len : U32 . t ) ( b : seq U8 . t ) = let n = length b / block _ len inlet rem = length b % c . block _ len in onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho type block = | Block :2 km : key _ management (∗ km = Runtime ∨ km = Erased ∗ ) → ( ∗ Low − level types ∗ ) state : stateful → key : stateful → ( ∗ Introducing a notion of blocks and ﬁnal result ∗ ) max _ input _ length : x : nat { 0 < x ∧ x < pow2

64 }) → output _ len : x : U32 . t { U32 . v x > 0 } → block _ len : x : U32 . t { U32 . v x > 0 } → ( ∗ The one − shot speciﬁcation ∗ ) spec _ s : key . t → input : seq U8 . t { length input ≤ max _ input _ length } → output : seq U8 . t { length output == U32 . v output _ len } → ( ∗ The block speciﬁcation ∗ ) init _ s : key . t → state . t → update _ multi _ s : state . t → prevlen : nat → s : seq U8 . t { length s % U32 . v block _ len = 0 } → state . t → update _ last _ s : state . t → prevlen : nat → s : seq U8 . t { length s ≤ U32 . v block _ len } → state . t → finish _ s : key . t → state . t → s : seq U8 . t { length s = U32 . v output _ len } → update _ multi _ is _ a _ fold : ... → ( ∗ Central correctness lemma of a block algorithm ∗ ) spec _ is _ incremental : key : key . t → input : seq U8 . t { length input ≤ max _ input _ length } → Lemma (32 let bs , l = split _ at _ last ( U32 . v block _ len ) input in let hash0 = update _ multi _ s ( init _ s key ) 0 bs in let hash1 = finish _ s key ( update _ last _ s hash0 ( length bs ) l ) in hash1 == spec _ s key input ) → update _ multi :38 s : state . s → prevlen : U64 . t → blocks : buﬀer U8 . t { length blocks % U32 . v block _ len = 0 } → len : U32 . t { U32 . v len = length blocks ∧ ... ( ∗ omitted ∗ ) } → Stack unit

42 ( requires ... ( ∗ omitted ∗ ) )43 ( ensures ( 𝜆 h0 _ h1 → modifies ( state . footprint h0 s ) h0 h1 ∧ state . footprint h0 s == state . footprint h1 s ∧ state . invariant h1 s ∧ state . v i h1 s == update _ multi _ s

48 ( state . v i h0 s ) ( U64 . v prevlen ) ( as _ seq h0 blocks ) ∧ ...49 → ... ( ∗ rest of the block typeclass, e.g. init, ﬁnish... ∗ ) → block Figure 4.

The block

API let n = if rem = 0 && n > 0 then n − else n inlet blocks , rest = split b ( n ∗ l ) in blocks , rest [ @ CAbstractStruct ] type state _ s ( c : block ) = | State : block _ state : c . state . s → buf : B . buﬀer U8 . t { B . len buf = c . block _ len } → total _ len : U64 . t → seen : G . erased ( S . seq U8 . t ) → p _ key : optional _ key c . km c . key → state _ s clet state ( c : block ) = pointer ( state _ s c ) Figure 5.

The streaming algorithm’s state

Stateful implementations.

We now zoom in on the update _ multi low-level signature, which describes a block’s algorithm run-time processing of multiple blocks in one go (Figure 4). Thisfunction is characterized by the spec-level update _ multi _ s ; itpreserves the invariant as well as the footprint; and onlyaﬀects the block algorithm’s state when called.The combination of spec _ is _ incremental along with the Low ★ signatures of update _ multi and others restricts the API in away that the only valid usage is dictated by Figure 1. De-signing this type class while looking at a wide range of al-gorithms has forced us to come up with a precise, yet gen-eral enough description of what a block algorithm is: wehave been able to author instances of this type class forSHA2 (4 variants), Blake2 (2 variants), Poly1305, and legacyalgorithms MD5 and SHA1. We plan to signiﬁcantly extendthe set of available instances, adding vectorized variants ofPoly1305 and Blake2 to the mix, along with new algorithmssuch as SHA3. Equipped with an accurate and precise description of whata block algorithm is, we are now ready to write an API trans-former that takes an instance of block , implementing thestate machine from Figure 1, and returns the safe API fromFigure 2. We call this API transformer a functor, since onceapplied to a block it generates type deﬁnitions for the inter-nal state, speciﬁcations, correctness lemmas and of coursethe ﬁve low-level, runtime functions that implement the tran-sitions from Figure 2.Since F ★ has no native support for functors, we describea somewhat manual encoding; §6 shows how to automatethis encoding with Meta-F ★ .The streaming functor’s state is naturally parameterizedover a block (Figure 5), and wraps the block algorithm’s statewith several other ﬁelds.The CAbstractStruct attribute ensures that the followingC code will appear in the header. This pattern is known as“C abstract structs”, i.e. the client cannot allocate structs orinspect private state, since the deﬁnition of the type is notknown; it can only hold pointers to that state, which forces ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA them to go through the API and provides a modicum of ab-straction. struct state _ s ; typedef struct state _ s ∗ state ; First, buf is a block-sized internal buﬀer, which relievesthe client of having to perform modulo computations andbuﬀer management. Once the buﬀer is full, the streamingfunctor calls the underlying block algorithm’s update _ multi function, which eﬀectively folds the blocks into the block _ state .The key is optional, and total _ len keeps track of how muchdata has been fed so far.The most subtle point is the use of a ghost sequence ofbytes, which keeps track of the past, i.e. the bytes we havefed so far into the hash. This is reﬂected in the functor’sinvariant, which states that if we split the input data intoblocks, then the current block algorithm state is the resultof accumulating all of the blocks into the block state; therest of the data that doesn’t form a full block is stored in buf . let state _ invariant ( c : block ) ( h : mem ) ( s : state c ) = let s = deref h s inlet State block _ state buﬀer total _ len seen key = s inlet blocks , rest = split _ at _ last ( U32 . v c . block _ len ) seen in ( ∗ omitted ∗ ) ... ∧ c . state . v h block _ state == c . update _ multi _ s c . init _ s ( optional _ reveal h key ) 0 blocks ∧ slice ( as _ seq h buﬀer ) 0 ( length rest ) == rest The mk _ finish function takes a block algorithm c and re-turns a suitable finish function usable with a state c . Under-the-hood, it calls c . s . copy to avoid invalidating the block _ state ;then c . update _ last followed by c . finish , the last two transi-tions of Figure 1. Thanks to the correctness lemma in c alongwith the invariant, mk _ finish states that the digest written in dst is the result of applying the full block algorithm to thedata the was fed into the streaming state so far. val mk _ finish : c : block → s : state c → dst : B . buﬀer U8 . t { B . len dst == c . output _ len } → Stack unit ( requires 𝜆 h0 → ... ( ∗ omitted ∗ ) )( ensures 𝜆 h0 s ' h1 → ... ∧ ( ∗ some omitted ∗ ) as _ seq h1 dst == c . spec _ s ( get _ key c h0 s ) ( get _ seen c h0 s )) This particular usage of a ghost variable is actually thethird iteration of the streaming API, and the one that wehave found easiest to use and be productive with. It allowsauthoring a function get _ seen that in any heap returns thebytes seen so far; previously, the user was required to ma-terialize the previously-seen bytes as a ghost argument to mk _ finish , which incurred a substantial syntactic burden.This streaming API has two limitations. First, we cannotprove the absence of memory leaks. This is a fundamentallimitation of Low ★ , which cannot show that a malloc fol-lowed by a free is morally equivalent to being in the Stack eﬀect. However, this can easily be addressed with manualcode review or oﬀ-the-shelf tools, such as clang ’s − fsanitize = memory .The second is that there is still a source of unsafety for Cclients: they may exceed the maximum amount of data thatcan be fed into the block algorithm. This is purely a designdecision: since the limit is never less than bytes (that’stwo million terabytes), we chose to not penalize the vastmajority of C clients who will never exceed that limit, andleave it up to clients who may encounter such extreme casesto perform length-checking themselves. We now focus on the usage of meta-level arguments, whichact as tweaking knobs to control the shape of the stream-ing API. Clients can of course choose a suitable block sizeand suitable types for the block state and key representa-tion, which inﬂuences the result of the functor application.But the key management is of particular interest. noextract type key _ management = | Runtime | Erasedinline _ for _ extractionlet optional _ key ( km : key _ management ) ( key : stateful ) : Type = match km with | Runtime → key . s | Erased → Ghost . erased key . t The km parameter of the block type class is purely meta,and will never be examined at run-time. It allows the block algorithm to indicate whether it needs a key. In the stream-ing code, every reference to key goes through a wrapper likethe one above. After partial application, the optional _ ∗ wrap-pers reduce to either a proper key type, or to a ghost value,which then gets erased to unit , which means the key ﬁeld ofthe state entirely disappears thanks to KreMLin’s unit ﬁeldelimination. The streaming API’s init function uncondition-ally takes a key at run-time; but for algorithms like hashes,it suﬃces to pick c . key . s = unit and the superﬂuous argumentto init gets eliminated too. In reality, the entire type class is parameterized over an in-dex, omitted here for conciseness. The index is used ghostly,except for the init function. This allows doing run-time agility,e.g. by having a streaming API for any hash algorithm; thestate then becomes a state a where a is the chosen hash al-gorithms, and every single deﬁnition we have seen becomesagile over the choice of a . Using this, we trivially re-implementEverCrypt’s old incremental hashing module, making it amere application of the streaming functor, where the indexallows choosing a particular hash algorithm at init -time. ★ code We now turn our attention to extraction, and explain how tocarefully tweak the streaming functor so that, once applied onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho let mk _ finish idx ( c : block _ alg idx ) ( i : idx ) __( p : state ) dst =2 [ @inline _ let ] let _= c . update _ multi _ is _ a _ fold i in let h0 = ST . get () in let State block _ state buf _ total _ len seen k ' = ! ∗ p in push _ frame ();8 let h1 = ST . get () in c . state . frame _ invariant i B . loc _ none block _ state h0 h1 ;1011 let r = rest c i total _ len in let buf _ = B . sub buf _ 0 ul ( rest c i total _ len ) in assert (( U64 . v total _ len − U32 . v r ) % U32 . v ( c . block _ len i ) = 0);1415 let tmp _ block _ state = c . state . alloca i in c . state . copy ( G . hide i ) block _ state tmp _ block _ state ;17 ... Figure 6.

The functor’s finish functionto a given block algorithm, it yields ﬁrst-order, specializedcode that ﬁts in the Low ★ subset that can compile to C.In this section, we use F ★ ﬂags and attributes without re-sorting to Meta-F ★ tactics. While we use the streaming func-tor as our running example, the techniques are systematicand can be applied to any hand-written functor in F ★ . We now show how to use both the inline _ for _ extraction key-word and the [ @inline _ let ] attribute (§2) to ensure that, afterF ★ has performed its extraction-speciﬁc normalization run,no traces are left of the functor argument, and the resultingcode only contains ﬁrst-order Low ★ code. In other words,we completely eliminate accesses to the type class dictio-nary via partial evaluation, meaning no run-time overhead.We focus on mk _ finish , the streaming API’s finish function(Figure 6). The function contains numerous patterns thatneed to be inlined away, which makes it representative ofthe streaming functor as a whole.The let-binding at line 2 serves only to bring the associa-tivity lemma from the type class into the scope of the SMTcontext, so that its associated pattern can trigger and savethe programmer from having to call the lemma manually.The usage of [ @inline _ let ] eliminates this partial application.We use [ @inline _ let ] in numerous other places in the stream-ing functor, to generate cosmetically more pleasant C code.Built-in constructs of Low ★ such as lines 4 and 7 receivespecial treatment in the toolchain and are eliminated. Callsto lemma and assertions at line 9 and 13 have type Ghost unit .The KreMLin compiler eliminates superﬂuous units, so thesedisappear too.At lines 15-16, we need to copy the block state into a tem-porary, in order to obtain the state machine mentioned ear-lier (Figure 2). The syntax hides nested calls to projectors of the block and stateful type classes respectively. To makesure these reduce, we mark both type class deﬁnitions as “in-line for extraction”, which in turns makes their projectorsreduce. At extraction-time, provided mk _ finish is applied toan instance, lines 15-16 reduce into direct calls to the origi-nal alloca and copy functions found in the type class.At line 11, we call rest , a helper shared across multiplefunctions that returns the amount of data currently in theinternal buﬀer. As such, rest needs access to the type class,if only to know the block algorithm’s block size. To avoidgenerating a run-time access to c in the call to rest , we markthe deﬁnition of rest as “inline for extraction”; provided rest undergoes the same treatment we described above, all refer-ences to c are now eliminated at extraction-time. Playing with normalization attributes and keywords guar-antees that any reference to the type class argument disap-pears; but this is not enough to make the code valid Low ★ .The issue lies with our earlier type deﬁnition (Figure 5).While the projection c . block _ len i is innocuous, and dis-appears (reﬁnements are erased), the type of block _ state isan application of a type-level function (the projector) to thetype argument i . Upon seeing such a type, F ★ ’s extractionsimply inserts ⊤ . This means that a partial application of state _ s to a class c will generate: type ' c state _ s = { block _ state : Obj . t ; ... } type c _ state = unit state _ s Indeed, the partial application of an inductive does nottrigger partial evaluation. F ★ does not generate a fresh, spe-cialized copy of an inductive when it encounters a partialapplication. All seems lost, for a ﬁeld block _ state : ⊤ cannotbe compiled to C. We can, however, regain ML-like polymor-phism with this “one simple trick”: noeq type state _ s ( c : block ) ( t : Type0 { t == c . state . s }) = | State : block _ state : t → buf : B . buﬀer U8 . t { B . len buf = c . block _ len } → ... → state _ s c This second version is curiously convoluted; from the pointof view of F ★ , however, this is a perfectly valid ML typewhose deﬁnition in the ML AST becomes: type ' c ' t state _ s = { block _ state : ' t ; buf : ... } We apply a similar trick to functions , where for instancethe prototype of mk _ finish becomes: val mk _ finish : c : block → t : Type0 { t == c . state . s } → s : state c t → ... This extracts to an ML AST that is free of casts, sinceall types within the body of mk _ finish are now of type t (atype parameter of the function) instead of c . state . s (a non-extractable function call at the Type0 level). We obtain anML-polymorphic deﬁnition, along with monomorphic uses: ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA let mk _ finish ( type c t ) ( s : ( c , t ) state ) ... = ... let finish _ sha2 _256 = mk _ finish < unit , U32 . t buﬀer > Finally, we rely on KreMLin to perform whole-programmonomorphization (§2) in order to specialize type deﬁni-tions and functions based on their usage at type parameter

U32 . t buﬀer . Combined with unit-ﬁeld elimination and un-used type parameter elimination, we obtain a fully special-ized copy of finish for SHA2-256, along with a clean deﬁ-nition for the type, just like a C programmer would havewritten: typedef struct { uint32 _ t ∗ block _ state ; uint8 _ t ∗ buf ; uint64 _ t total _ len ;} Hacl _ Streaming _ Functor _ state _ sha2 _256; The process of hand-writing a functor (§4) gives the pro-grammer ﬁne-grained control over reduction and type monomor-phization. However, this manual encoding requires the pro-grammer to mark the entire call-graph as “inline for extrac-tion”, in order to properly eliminate occurrences of the meta-argument c in the generated code. In many situations, this isnot acceptable: a huge blob of code would not pass musterwith software engineers who want to use veriﬁed libraries,and as such we need to retain the structure of the call-graphin the generated C code. We now abstract over, and generalize the setting of §4. Weconsider call-graphs of arbitrary depth, where the only re-striction is the absence of recursion. This is a safe assump-tion: most Low ★ code uses loop combinators, as recursionresults in unpredictable performance, owing to the unevensupport for tail-call optimizations in C compilers.We assume every function in the call-graph is parametricover a meta-parameter, which we call from here on an index .In order for the functions in our call-graph to be valid Low ★ ,they must be applied to a concrete index in order to triggerenough partial evaluation.In §4, the meta-parameter was the type class c , whichcontained type deﬁnitions followed by speciﬁcations, lem-mas, low-level implementations, helper deﬁnitions, etc. Thisstyle is burdensome, as a large algorithm will typically in-cur a type class with dozens of ﬁelds, which makes author-ing instances tedious and non-modular. We now present adiﬀerent style, which we have found minimizes syntacticoverhead, and is easier to work with in day-to-day proofengineering.For the rest of this paper, we choose for the index a ﬁ-nite enumeration, accompanied with a set of helper deﬁni-tions over that index. To illustrate that second style, we use HPKE [10] (Hybrid Public-Key Encryption), a cryptographicconstruction that combines AEAD (Authenticated Encryp-tion with Additional Data), DH (Diﬃe-Hellman) and hash-ing.We wish to generate specialized instances of HPKE for agiven triplet of implementations . Using C++ as an analogy,we wish to author the equivalent of template HPKE . // Spec . HPKE . fstilet alg = DH . alg & AEAD . alg & Hash . alg // Spec . AEAD . fstitype alg = AES128 _ GCM | AES256 _ GCM | CHACHA20 _ POLY1305 // Impl . HPKE . fstiinline _ for _ extraction let key _ aead ( alg : Spec . HPKE . alg ) = lbuﬀer U8 . t ( AEAD . key _ len ( snd3 alg ))// Impl . HPKE . fstlet sign _ t ( alg : Spec . HPKE . alg ): Type = ... → ... let sign ( alg : Spec . HPKE . alg ): sign _ t alg = 𝜆 ... → ... The index

Spec . HPKE . alg captures all possible algorithmchoices prescribed by the HPKE RFC. We thus write spec-iﬁcations, lemmas, helpers and types parametrically overthe index as standalone deﬁnitions. The key _ aead type, forexample, is parametric over triplets of algorithms, and de-ﬁnes a low-level key to be an array of bytes whose lengthis the key length for the chosen AEAD. The same system-atic parametrization over alg can be carried to functions andtheir types, shown with sign as an example. An importantpoint is that for a given sign _ t , there may be multiple im-plementations of the given algorithm. So, the ﬁnite combi-nations for Spec . HPKE . alg admit an inﬁnite number of imple-mentations. Figure 7 describes the call-graph of HPKE: nodes are F ★ top-level functions and arrows indicate function calls. Thehelper node makes veriﬁcation modular but would pollutethe generated C code, and therefore must be evaluated away.All other circled nodes must appear in the generated C code.This call-graph is representative of a typical large-scaleveriﬁcation eﬀort: in order to make veriﬁcation robust inthe presence of an SMT solver; to ensure modularity of theproofs; and to increase the likelihood that a future program-mer can understand and maintain a given proof, we encour-age a proliferation of small helpers with crisp pre- and post-conditions. However, these helpers would typically amount,if they were extracted, to a mere line or two of C code.We now wish to obtain a copy of this call-graph where helper has been inlined away, and where all functions inthe generated call-graph are specialized variants for a given triplet of algorithms. We proceed in two steps: ﬁrst, we rewritethe call-graph to look as follows: onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho hpke helperenc sign Figure 7.

Simpliﬁed call-graph of HPKE inline _ for _ extraction let helper ( alg : Spec . HPKE . alg )( sign : sign _ t alg ): helper _ t alg = 𝜆 ... → ... inline _ for _ extraction let hpke ( alg : Spec . HPKE . alg )( sign : sign _ t alg ) ( enc : enc _ t alg ): hpke _ t alg = 𝜆 ... → ... helper alg sign ...... enc ... This form is convoluted: we have recursively parameter-ized the hpke to accept specialized versions of all the func-tions that we wish to retain in the call-graph after partialevaluation. Then, we let the user perform instantiations toobtain a specialized HPKE algorithm: // ChachaPoly . AVX2 . fsti : val enc : enc _ t Spec . AEAD . ChachaPoly // HPKE . ChachaPoly . fst : let hpke _ chachapoly : hpke _ t (..., Spec . AEAD . ChachaPoly , ...) = hpke (...,

Spec . AEAD . ChachaPoly , ...) ...

ChachaPoly . AVX2 . enc This results in a specialized version of hpke , which calls a specialized version of encrypt . The index is gone and thereis no run-time overhead; we have in eﬀect specialized theentire call-graph for a speciﬁc value of the index. Note thatwe can provide many more specializations of HPKE for thesame value of the index, e.g. with the AVX512 version ofChachaPoly. Back to our earlier C++ template analogy, wehave in eﬀect written:

HPKE<..., ChachaPolyAvx, ...> Hpke_ChachaPoly . We now formalize the call-graph specialization logic as aset of rewriting rules (Figure 8), which are to be understoodas follows. The user annotates with an attribute 𝑎 𝑓 everyfunction 𝑓 in the call-graph, where 𝑎 𝑓 is either [ @ Eliminate ] or [ @ Specialize ] . Un-annotated functions are understood tobe outside the call-graph and are ignored.We use 𝑓 → 𝑔 to state that 𝑓 calls into 𝑔 . We deﬁne spec ( 𝑓 ) to be all the functions annotated with [ @ Specialize ] that are called by 𝑓 through [ @ Eliminate ] functions. This setnever contains 𝑓 , since we ruled out recursion. One propertyof interest is: spec ( 𝑔 ) ⊂ spec ( 𝑓 ) if 𝑓 → 𝑔 ( 𝑝 ) In rule ( 𝑖 ) , each function 𝑓 is rewritten to take as extra pa-rameters specialized versions of all the functions 𝑓 ′ 𝑗 it might(transitively) call into. When such a specialized function call is encountered in ( 𝑖𝑖𝑖 ) , it is rewritten into a call to the spe-cialized variant the function received as a parameter. Notethat the index disappears: the parameter 𝑓 ′ is of type 𝑡 𝑓 ′ 𝑖 ,i.e. it is a specialized instance of 𝑓 ′ for the current index 𝑖 .Calls to functions that are to be eliminated ( 𝑖𝑖 ) are rewrit-ten diﬀerently: since they disappear from the call-graph, werely on the “inline for extraction” attribute to inline theirdeﬁnitions away. They do take, however, extra arguments,for all the specialized functions they eventually call: we passthose as well, which are always bound thanks to ( 𝑝 ) . ★ We have implemented these rewriting rules as a recursivetraversal of the call-graph. Our tactic, at 620 lines, (includingwhitespace and comments) is the second largest Meta-F ★ program written to date. We now brieﬂy give an overviewof the implementation. The tactic is written in Tac , the eﬀectof meta-programs. As mentioned earlier (§2), the design ofMeta-F ★ means any fresh term generated by a tactic mustbe re-checked for soundness.The tactic is written in a state-passing style, as meta-programsdo not have access to mutable state, and revolves around thefollowing internal deﬁnitions: noeq type mapping =| Eliminate new _ name : name → mapping | Specialize : mappinglet state = list ( name & ( term & mapping & list name )) The state type is an associative list that to each 𝑓 (of type name ) associates: its type 𝑡 𝑓 : index → Type (of type term ,the safe view of terms exposed to meta-programs); its 𝑎 𝑓 and new name 𝑔 (of type mapping ); and its set spec ( 𝑓 ) (oftype list name ).The core of our tactic is visit _ f , which returns an updated state along with a set of fresh deﬁnitions to be inserted intothe current module. In order to call our tactic, the user passesthe roots of the call-graph traversal, along with the type ofthe index. The % splice directive inserts meta-generated def-initions at the current point, and requires the user to passthe names of all the specialized nodes that they wish to calllater, in order to establish a lexical scope (scope resolutionhappens before meta-program evaluation in F ★ ). % hpke _ setupBaseI ' ; hpke _ setupBaseR ' ; hpke _ sealBase ' ; hpke _ openBase ' ] ( Meta . Interface . specialize ( ` Spec . HPKE . alg ) [ ` Impl . HPKE . setupBaseI ; ` Impl . HPKE . setupBaseR ; ` Impl . HPKE . sealBase ; ` Impl . HPKE . openBase ]) One possible specialization, out of more than a hundredpossible options, is P256 for elliptic curve DH, AVX 128-bitChachaPoly for AEAD and SHA256 for hashing. Note how ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA 𝑎 𝑓 let 𝑓 : 𝑖 : index → 𝑡 𝑓 𝑖 = 𝜆 ( 𝑖 : index ) . 𝑒 inline_for_extraction let 𝑔 : 𝑖 : index → −−−−→( 𝑡 𝑓 𝑗 𝑖 ) 𝑓 𝑗 ∈ spec ( 𝑓 ) → 𝑡 𝑓 𝑖𝜆 ( 𝑖 : index ) −−−−−−−−→( 𝑓 ′ 𝑗 : 𝑡 𝑓 𝑗 𝑖 ) 𝑓 𝑗 ∈ spec ( 𝑓 ) . 𝑒 ( 𝑖 ) 𝑓 𝑖 𝑒 𝑔 𝑖 −→ 𝑓 ′ 𝑗 𝑓 𝑗 ∈ spec ( 𝑓 ) 𝑒 where 𝑎 𝑓 = [ @ Eliminate ] ( 𝑖𝑖 ) 𝑓 𝑖 𝑒 𝑓 ′ 𝑒 where 𝑎 𝑓 = [ @ Specialize ] ( 𝑖𝑖𝑖 ) spec ( 𝑓 ) = { 𝑔 𝑛 | 𝑓 → 𝑔 → · · · → 𝑔 𝑛 ∧ 𝑎 𝑔 𝑗 < 𝑛 = [ @ Eliminate ] ∧ 𝑎 𝑔 𝑛 = [ @ Specialized ] }

Figure 8.

Our tactic expressed as rewriting rules

Curve

FieldField51 Field64 Core64Hacl Vale provedagainstimplements implementsprovedagainstimplements implements

Figure 9.

Abstract, large-scale call-graph for Curve25519the user ﬁrst generates a specialized version of setupBaseI ,then a specialized version of sealBase that calls it. let alg = DH _ P256 , CHACHA20 _ POLY1305 , SHA2 _256 let setupBaseI = hpke _ setupBaseI ' alg hkdf _ expand256hkdf _ extract256 sha256 secret _ to _ public _ p256 dh _ p256let setupBaseR = hpke _ setupBaseR ' alg hkdf _ expand256hkdf _ extract256 sha256 secret _ to _ public _ p256 dh _ p256let sealBase = hpke _ sealBase ' alg setupBaseI encrypt _ cp128let openBase = hpke _ openBase ' alg setupBaseR decrypt _ cp128 Our tactic has a few more bells and whistles. Notably,it may optionally thread through the call-graph an extra-precondition if, say, some specialized functions require thepresence of a speciﬁc CPU instruction. It is also compatiblewith abstraction boundaries, which makes veriﬁcation moremodular, and allows specializing the call-graph against anabstract interface, making this a functor that takes an ab-stract signature. Finally, we also allow the leaves of the call-graph to omit the index if they are specialized implementa-tions for only one value of the index.

We now present the application of our automated tactic to aparticularly gnarly example. Curve25519 is an elliptic curvealgorithm [12]; suﬃces to say that it relies on a mathemat-ical ﬁeld, which admits two eﬃcient implementations; fur-thermore, one of these two implementations relies on a core set of primitives (e.g. multiplication) which themselves ad-mit two diﬀerent implementations, one in Low ★ from theHACL ★ library, and one in Vale assembly [21].In Figure 9, circles denote interfaces. We deﬁne the indexas follows, along with some helpers for the low-level repre-sentation of ﬁeld elements: type field _ repr = | Field51 | Field64let felem ( s : field _ repr ) = ... We then make sure the whole Curve25519 module is writtenagainst an abstract

Field . fsti , where all deﬁnitions are poly-morphic over the index; Field . fsti captures the signature to beprovided by ﬁeld implementations. This relies on the extrafeature mentioned earlier, where our tactic stops at abstrac-tion boundaries.As an example, Field . fsti contains signatures that are im-plemented by individual ﬁeld implementations, such as Field51 . fst : ( ∗ Field.fsti ∗ ) [ @@ Specialize ] val store _ felem : s : field _ repr → b : lbuﬀer U64 . t ul → f : felem s → Stack unit ... ( ∗ Field51.fst ∗ ) let store _ felem ( u64s : lbuﬀer U64 . t ul ) ( f : Field51 . felem ) Stack unit ... = ...

We apply this pattern once more for

Field64 . fsti , which isitself written against an abstract Core64 . fsti : ( ∗ Helpers.fsti ∗ ) inline _ for _ extraction let fadd _ t ( s : field _ spec ) = out : felem s → f1 : felem s → f2 : felem s → Stack unit ... ( ∗ Core64.fsti ∗ ) [ @ Meta . Attribute . specialize ] val fadd : fadd _ t Field64 ( ∗ Core64.Vale.fst ∗ ) let fadd : fadd _ t Field64 = ... This relies on another extra feature mentioned earlier, namelythe ability to have, at the leaves of the call-graph, functionsthat omit the index if only one case is possible.We obtain three specialized applications. One would be,in OCaml, module Curve51 = Curve25519(Field51) , while an-other would be module Curve64Vale = Curve25519(Field64(CoreVale)) .Unlike in OCaml, though, our functors reduce via partialevaluation and incur no run-time overhead. onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho F ★ LoC C LoC verif. extract.

EverCrypt hashing (old) 848 798

Functor and interfaces × ) 224 1283 39.4s 13.6sPoly1305 416 304 55.7s 6.4sBlake2s, Blake2b 752 1374 115.1s 9.2s Total

Table 1.

Quantifying the impact of the streaming functor

To evaluate the applicability of the streaming functor, wecompare lines of code (LoC) for the F ★ source code and theﬁnal C code as a proxy for programmer eﬀort. Our pointof reference is a ﬁrst, non-generic streaming API that previ-ously operated atop the EverCrypt agile hash layer.Table 1 presents the evaluation. For the old, non-genericstreaming API, the proof-to-code ratio was . , meaningwe had to write more than one line of code in F ★ for everyline of generated C code.Capturing the block API and implementing the functoruses 1505 lines of F ★ code. The extra veriﬁcation eﬀort isquickly amortized across the 10 applications of the stream-ing functor, which each requires a modest amount of proofsto implement the exact signature of the block API. Poly1305and Blake2 were originally authored without bringing outthe functional, fold-like nature of the algorithms, which ledto some glue code and proofs to meet the block API. Alto-gether, we obtain a ﬁnal proof-to-code ratio of . , whichwe interpret to coarsely mean a 34% improvement in pro-grammer productivity. We expect this number to further de-crease, as more applications of the streaming functor follow.For execution times, we present the veriﬁcation time ofthe functor itself, and the veriﬁcation time of each of the in-stances, including glue proofs. Compared to fully verifyingBlake2 (7.5 minutes), or Poly1305 (~14 minutes), the veriﬁ-cation cost is modest. Applying the streaming functor to atype class argument incurs no veriﬁcation cost, so the ex-traction column measures the cost of partial evaluation andextraction to the ML AST, which is negligible. From a qualitative point of view, the usage of the call-graphrewriting tactic signiﬁcantly improved programmer expe-rience and addressed many fundamental engineering road-blocks in one go. First, programmers would tweak F ★ ’s in-clude path to switch between implementations (e.g. Field51 vs Field64 ), eﬀectively making it impossible to build veriﬁedapplications on top of HACL ★ . Second, the lack of modular-ity and call-graph specialization in old versions of HACL ★ made C and vectorized implementations appear in the sameﬁle; since we had to use − mavx − mavx2 for compiling instrics, Algorithm Veriﬁcation time

Curve25519 1379s (+127.0%)Chacha20 174s (+41.0%)Poly1305 429s (+17.2%)Chacha20Poly1305 421s (+92.2%)

Table 2.

Cost of verifying the tactic-rewritten call graphsthe C compiler would use AVX2 instructions for our non-vectorized, regular C version, causing illegal instruction er-rors later on [39]. We now put one tactic instantiation (i.e.one % splice ) per ﬁle, which solves the problem deﬁnitively.Finally, previous versions of HACL ★ did not distinguish be-tween algorithmic agility and choice of implementation fora given algorithm. This made a modular and specializableHPKE just impossible to author.From a quantitative point of view, we measure the over-head incurred by re-checking the tactic-generated call-graph,relative to the total veriﬁcation time for a given algorithm.In most cases, the overhead is < , because we don’trewrite lemmas and proofs. Curve25519 is an outlier becausewe thread a precondition, resulting in additional veriﬁcationburden. In practice, build time matters little in the face of im-proved programmer productivity. Tactics are not part of the trusted computing base (§2); un-like, say, MTac2 [23], Meta-F ★ [26] does not allow the user toprove properties about tactics, trading provable correctnessfor ease-of-use and programmer productivity.The debugging experience for tactics is thus pleasant. Ifthe tactic itself fails, F* points to the faulty line in the meta-program. If the generated code is ill-typed, we examine itlike any other F* program. We debugged this tactic on Curve25519;once debugged, the tactic never generated ill-typed codeand was used successfully by other co-authors.A technical detail relates to lexical scoping (§6.3): the usermust somewhat materialize in the source code the namesof all the 𝑔 ’s that are generated by the tactic, in order toestablish proper lexical scope. For that, regular users canobserve the standard output, where the tactic prints a sum-mary of the functions it generated, their types, and theirnames. Equipped with the names of the generated 𝑔 ’s, usersthen edit their source code to ﬁll in the ﬁrst argument to % splice . Alternatively, power users just look up the manglingscheme and directly write arguments to % splice in one go. Based on our experience performing very large-scale ver-iﬁcation in F ★ , we have shown an array of techniques tomake program proof a productive endeavor. Establishingclear, crisp abstractions that highlight commonality betweenrelated pieces of code sets a foundation for higher-level APIs.With partial evaluation and meta-programming, program-mers can think at the highest levels of abstractions, while ero-cost meta-programmed stateful functors in F ★ Conference’17, July 2017, Washington, DC, USA minimizing eﬀort and increasing code reuse. Quantitativeevaluations provides evidence that our techniques improveprogrammer experience. In our view, scaling software veri-ﬁcation is just as important as veriﬁcation tours de force .We intend to continue our eﬀorts on the HACL ★ code-base. An immediate goal is to extend the streaming functorto take a meta-level parameter that allows storing 𝑛 blocksin the internal buﬀer, which is essential for vectorized imple-mentations that process multiple blocks at a time. We alsointend to author a new type class to capture the commonal-ity between Chacha20, AES, SHA3 and Blake2 in PRF modeand the matching ﬂavors of AEAD, along with a correspond-ing functor to automatically generate safer APIs for those. onference’17, July 2017, Washington, DC, USA Jonathan Protzenko and Son Ho References [1] The sodium crypto library (libsodium). https://github.com/jedisct1/libsodium .[2] Federal Information Processing Standards Publication 180-4: Securehash standard (SHS), 2012. NIST.[3] Danel Ahman, Cătălin Hriţcu, Kenji Maillard, Guido Martínez, Gor-don Plotkin, Jonathan Protzenko, Aseem Rastogi, and Nikhil Swamy.Dijkstra monads for free. In

ACM Symposium on Principles of Pro-gramming Languages (POPL) , January 2017.[4] José Bacelar Almeida, Manuel Barbosa, Gilles Barthe, Arthur Blot,Benjamin Grégoire, Vincent Laporte, Tiago Oliveira, Hugo Pacheco,Benedikt Schmidt, and Pierre-Yves Strub. Jasmin: High-assurance andhigh-speed cryptography. 2017.[5] José Bacelar Almeida, Manuel Barbosa, Gilles Barthe, BenjaminGrégoire, Adrien Koutsos, Vincent Laporte, Tiago Oliveira,and Pierre-Yves Strub. The last mile: High-assurance andhigh-speed cryptographic implementations. arXiv:1904.04606 https://arxiv.org/abs/1904.04606 .[6] Nada Amin and Tiark Rompf. Lms-verify: Abstraction without regretfor veriﬁed systems programming. In

POPL , 2017.[7] Andrew W Appel. Veriﬁcation of a cryptographic primitive: SHA-256.

ACM Transactions on Programming Languages and Systems (TOPLAS) ,37(2):7, 2015.[8] Jean-Philippe Aumasson, Samuel Neves, Zooko Wilcox-O’Hearn, andChristian Winnerlein. BLAKE2: Simpler, smaller, fast as MD5. In

Applied Cryptography and Network Security , pages 119–135, 2013.[9] Manuel Barbosa, Gilles Barthe, Karthikeyan Bhargavan, BrunoBlanchet, Cas Cremers, Kevin Liao, and Bryan Parno. Sok: Computer-aided cryptography.

IACR Cryptol. ePrint Arch. , 2019:1393, 2019.[10] R. Barnes and K. Bhargavan. Hybrid public key encryption. IRTFInternet-Draft draft-irtf-cfrg-hpke-02 , 2019.[11] Lennart Beringer, Adam Petcher, Katherine Q. Ye, and Andrew W. Ap-pel. Veriﬁed correctness and security of OpenSSL HMAC. 2015.[12] D. J. Bernstein. Curve25519: New Diﬃe-Hellman speed records. In

Proceedings of the IACR Conference on Practice and Theory of PublicKey Cryptography (PKC) , 2006.[13] Daniel J. Bernstein. The Poly1305-AES message-authentication code.In

Proceedings of Fast Software Encryption , March 2005.[14] Barry Bond, Chris Hawblitzel, Manos Kapritsos, K. Rustan M. Leino,Jacob R. Lorch, Bryan Parno, Ashay Rane, Srinath Setty, and LaureThompson. Vale: Verifying high-performance cryptographic assem-bly code. In

Proceedings of the USENIX Security Symposium , August2017.[15] Edwin Brady. Idris, a general-purpose dependently typed program-ming language: Design and implementation.

Journal of functionalprogramming , 23(5):552–593, 2013.[16] L. de Moura and N. Bjørner. Z3: An eﬃcient SMT solver. 2008.[17] Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris van Doorn,and Jakob von Raumer. The Lean theorem prover. In

Proc. of theConference on Automated Deduction (CADE) , 2015.[18] Frank DeRemer and Hans Kron. Programming-in-the large versusprogramming-in-the-small. In

Proceedings of the International Con-ference on Reliable Software , page 114–121, New York, NY, USA, 1975.Association for Computing Machinery.[19] Morris J Dworkin. Sha-3 standard: Permutation-based hash andextendable-output functions. Technical report, 2015.[20] A. Erbsen, J. Philipoom, J. Gross, R. Sloan, and A. Chlipala. Simplehigh-level code for cryptographic arithmetic - with proofs, withoutcompromises. 2019.[21] Aymeric Fromherz, Nick Giannarakis, Chris Hawblitzel, Bryan Parno,Aseem Rastogi, and Nikhil Swamy. A veriﬁed, eﬃcient embedding ofa veriﬁable assembly language.

Proc. ACM Program. Lang. , 3(POPL),January 2019. [22] Yu-Fu Fu, Jiaxiang Liu, Xiaomu Shi, Ming-Hsien Tsai, Bow-Yaw Wang,and Bo-Yin Yang. Signed cryptographic program veriﬁcation withtyped cryptoline. In

Proceedings of the 2019 ACM SIGSAC Conferenceon Computer and Communications Security , CCS ’19, page 1591–1606,New York, NY, USA, 2019. Association for Computing Machinery.[23] Jan-Oliver Kaiser, Beta Ziliani, Robbert Krebbers, Yann Régis-Gianas,and Derek Dreyer. Mtac2: typed tactics for backward reasoning incoq.

Proceedings of the ACM on Programming Languages , 2(ICFP):1–31, 2018.[24] K. Rustan M. Leino. Dafny: An automatic program veriﬁer for func-tional correctness. In

Proceedings of the Conference on Logic for Pro-gramming, Artiﬁcial Intelligence, and Reasoning (LPAR) , 2010.[25] Pierre Letouzey. A new extraction for coq. In

International Workshopon Types for Proofs and Programs , pages 200–219. Springer, 2002.[26] Guido Martínez, Danel Ahman, Victor Dumitrescu, Nick Giannarakis,Chris Hawblitzel, Catalin Hritcu, Monal Narasimhamurthy, ZoeParaskevopoulou, Clément Pit-Claudel, Jonathan Protzenko, TahinaRamananandro, Aseem Rastogi, and Nikhil Swamy. Meta-F*: Proofautomation with SMT, tactics, and metaprograms. In , pages 30–59. Springer, 2019.[27] David A. McGrew and John Viega. The security and performance ofthe Galois/counter mode of operation. In

Proceedings of the Interna-tional Conference on Cryptology in India (INDOCRYPT) , 2004.[28] Robin Milner. A theory of type polymorphism in programming.

Jour-nal of computer and system sciences , 17(3):348–375, 1978.[29] Nicky Mouha, Mohammad S Raunak, D Richard Kuhn, and RaghuKacker. Finding bugs in cryptographic hash function implementa-tions.

IEEE transactions on reliability , 67(3):870–884, 2018.[30] OpenSSL Team. OpenSSL. , May 2005.[31] Marina Polubelova, Karthikeyan Bhargavan, Jonathan Protzenko,Benjamin Beurdouche, Aymeric Fromherz, Natalia Kulatova, and San-tiago Zanella-Béguelin. Hacl × n: Veriﬁed generic simd crypto (for allyour favorite platforms). Cryptology ePrint Archive, Report 2020/572,2020. https://eprint.iacr.org/2020/572 .[32] Jonathan Protzenko, Bryan Parno, Aymeric Fromherz, Chris Haw-blitzel, Marina Polubelova, Karthikeyan Bhargavan, Benjamin Beur-douche, Joonwon Choi, Antoine Delignat-Lavaud, Cédric Fournet,et al. Evercrypt: A fast, veriﬁed, cross-platform cryptographicprovider. In , pages634–653, 2019.[33] Jonathan Protzenko, Jean-Karim Zinzindohoué, Aseem Rastogi,Tahina Ramananandro, Peng Wang, Santiago Zanella-Béguelin, An-toine Delignat-Lavaud, Catalin Hritcu, Karthikeyan Bhargavan, Cé-dric Fournet, and Nikhil Swamy. Veriﬁed low-level programming em-bedded in F*. PACMPL , (ICFP), September 2017.[34] M-J. Saarinen and J-P. Aumasson. The blake2 cryptographic hash andmessage authentication code (mac). IETF RFC 7693, 2015.[35] Amr Sabry and Matthias Felleisen. Reasoning about programsin continuation-passing style.

Lisp and symbolic computation , 6(3-4):289–360, 1993.[36] Tim Sheard and Simon Peyton Jones. Template meta-programmingfor haskell. In

Proceedings of the 2002 ACM SIGPLAN Workshop onHaskell , Haskell ’02, page 1–16, New York, NY, USA, 2002. Associationfor Computing Machinery.[37] Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, An-toine Delignat-Lavaud, Simon Forest, Karthikeyan Bhargavan, CédricFournet, Pierre-Yves Strub, Markulf Kohlweiss, Jean-Karim Zinzindo-houé, and Santiago Zanella-Béguelin. Dependent types and multi-monadic eﬀects in F*. In