Analyzing Smart Contracts: From EVM to a sound Control-Flow Graph
Elvira Albert, Jesús Correas, Pablo Gordillo, Alejandro Hernández-Cerezo Guillermo Román-Díez, Albert Rubio
AAnalyzing Smart Contracts: From EVM to a soundControl-Flow Graph
Elvira Albert , , Jes´us Correas , Pablo Gordillo , Alejandro Hern´andez-Cerezo ,Guillermo Rom´an-D´ıez , and Albert Rubio , Instituto de Tecnolog´ıa del Conocimiento, Spain Complutense University of Madrid, Spain Universidad Polit´ecnica de Madrid, Spain
Abstract.
The
EVM language is a simple stack-based language with words of 256 bits, with onesignificant difference between the
EVM and other virtual machine languages (like Java Bytecodeor CLI for .Net programs): the use of the stack for saving the jump addresses instead of having itexplicit in the code of the jumping instructions. Static analyzers need the complete control flowgraph (CFG) of the
EVM program in order to be able to represent all its execution paths. Thisreport addresses the problem of obtaining a precise and complete stack-sensitive CFG by means ofa static analysis, cloning the blocks that might be executed using different states of the executionstack. The soundness of the analysis presented is proved. EVM
Language
The
EVM language is a simple stack-based language with words of 256 bits with a local volatile memorythat behaves as a simple word-addressed array of bytes, and a persistent storage that is part of theblockchain state. A more detailed description of the language and the complete set of operation codescan be found in [6]. In this section, we focus only on the relevant characteristics of the
EVM that areneeded for describing our work.
Example 1.
In order to describe our techniques, we use as running example a simplified version (withoutcalls to the external service Oraclize and the authenticity proof verifier) of the contract [1] that imple-ments a lottery system. During a game, players call a method joinPot to buy lottery tickets; each player’saddress is appended to an array addresses of current players, and the number of tickets is appended toan array slots , both having variable length. After some time has elapsed, anyone can call rewardWinner which calls the
Oraclize service to obtain a random number for the winning ticket. If all goes accordingto plan, the
Oraclize service then responds by calling the __callback method with this random numberand the authenticity proof as arguments. A new instance of the game is then started, and the winner isallowed to withdraw her balance using a withdraw method. Figure 2 shows an excerpt of the
Solidity code(including the public function findWinner ) and a fragment of the
EVM code produced by the compiler.The
Solidity source code is shown for readability, as our analysis works directly on the
EVM code.To the right of Figure 2 we show a fragment of the
EVM code of method findWinner . It can be seenthat the
EVM has instructions for operating with the stack contents, like
DUP x or SWAP x ; for comparisons,like LT , GT ; for accessing the storage (memory) of the contract, like SSTORE , SLOAD ( MLOAD , MSTORE ); toadd/remove elements to/from the stack, like
PUSH x / POP ; and many others (we again refer to [6] fordetails). Some instructions increment the program counter in several units (e.g.,
PUSH x Y adds a wordwith the constant Y of x bytes to the stack and increments the program counter by x + 1). In whatfollows, we use size ( b ) to refer to the number of units that instruction b increments the value of theprogram counter. For instance size ( POP ) = 1, size ( PUSH1 ) = 2 or size ( PUSH3 ) = 4. (cid:4)
One significant difference between the
EVM and other virtual machine languages (like Java Bytecodeor CLI for .Net programs) is the use of the stack for saving the jump addresses instead of having it explicitin the code of the jumping instructions. In
EVM , instructions
JUMP and
JUMPI will jump, unconditionallyand conditionally respectively, to the program counter stored in the top of the execution stack. Thisfeature of the
EVM requires, in order to obtain the control flow graph of the program, to keep track ofthe information stored in the stack. Let us illustrate it with an example.
Example 2.
In the
EVM code to the right of Figure 2 we can see two jump instructions at program points and , respectively, and the jump address ( and ) is stored in the instruction immediatelybefore them: or . It then jumps to this destination by using the instruction JUMPDEST (programpoints , , ). (cid:4) a r X i v : . [ c s . P L ] M a y contract EthereumPot { address [] public addresses ; address public winnerAddress; uint [] public slots ; function callback ( bytes32 queryId , string result , bytes proof ) { if ( msg . sender != oraclize cbAddress()) throw ; random number = uint (sha3( result)) winnerAddress = findWinner(random number); amountWon = this .balance ∗
98 / 100 ; winnerAnnounced(winnerAddress, amountWon); if (winnerAddress. send (amountWon)) { if (owner. send ( this .balance)) { openPot(); } } } function findWinner ( uint random) constant returns (address winner) { for ( uint i = 0; i < slots . length ; i++) { if (random < = slots[i]) { return addresses [ i ]; } } } // Other functions } · · · · · · · · · Fig. 1.
Excerpt of
Solidity code for
EthereumPot contract (left), and fragment of
EVM code for function findWinner (right)
We start our analysis by defining the set J , which contains all possible jump destinations in an EVM program P ≡ b , . . . , b p : J ( P ) = { pc | b pc ∈ P ∧ b pc ≡ JUMPDEST } . We use b pc ∈ P for referring to the instruction at program counter pc in the EVM program P . In whatfollows, we omit P from definitions when it is clear from the context, e.g., we use J to refer to J ( P ). Example 3.
Given the
EVM code that corresponds to function findWinner , we get the following set: J = { , , , , , , , , , , } (cid:4) The first step in the computation of the CFG is to define the notion of block . In general [2], givena program P , a block is a maximal sequence of straight-line consecutive code in the program with theproperties that the flow of control can only enter the block through the first instruction in the block, andcan only leave the block at the last instruction. Let us define the concept of block in an EVM program:
Definition 1 (blocks).
Given an
EVM program P ≡ b , . . . , b p , we define blocks ( P ) = (cid:26) B i ≡ b i , . . . , b j (cid:12)(cid:12)(cid:12)(cid:12) ( ∀ k.i < k < j, b k (cid:54)∈ Jump ∪ End ∪ {
JUMPDEST } ) ∧ ( i =0 ∨ b i ≡ JUMPDEST ∨ b i − = JUMPI ) ∧ ( j = p ∨ b j ∈ Jump ∨ b j ∈ End ∨ b j +1 ≡ JUMPDEST ) (cid:27) where Jump = { JUMP , JUMPI } End = { REVERT , STOP , INVALID } Example 4.
Figure 3 shows the blocks (nodes) obtained for findWinner and their corresponding jumpinvocations. Solid and dashed edges represent the two possible execution paths depending on the entryblock: solid edges represent the path that starts from block and dashed edges the path that starts contract
EthereumPot { address [] public addresses ; address public winnerAddress; uint [] public slots ; function callback ( bytes32 queryId , string result , bytes proof ) { if ( msg . sender != oraclize cbAddress()) throw ; random number = uint (sha3( result)) winnerAddress = findWinner(random number); amountWon = this .balance ∗
98 / 100 ; winnerAnnounced(winnerAddress, amountWon); if (winnerAddress. send (amountWon)) { if (owner. send ( this .balance)) { openPot(); } } } function findWinner ( uint random) constant returns (address winner) { for ( uint i = 0; i < slots . length ; i++) { if (random < = slots[i]) { return addresses [ i ]; } } } // Other functions } · · · · · · · · · Fig. 2.
Excerpt of
Solidity code for
EthereumPot contract (left), and fragment of
EVM code for function findWinner (right) from . Note that most of the blocks start with a
JUMPDEST instruction ( , , , , , , , , , , ). The rest of the blocks start with instructions that come right after a JUMPI instruction ( , ). Analogously, most blocks end in a JUMP ( , , , , ), JUMPI ( , , , ) or RETURN ( ) instruction or in the instruction that precedes JUMPDEST ( ). (cid:4) Observing the blocks in Figure 3, we can see that most
JUMP instructions use the address introducedin the
PUSH instruction executed immediately before the
JUMP . However, in general, in
EVM code, it ispossible to find a
JUMP whose address has been stored in a different block. This happens for instance whena public function is invoked privately from other methods of the same contract, the returning programcounter is introduced by the invokers at different program points and it will be used in a unique
JUMP instruction when the invoked method finishes in order to return to the particular caller that invoked thatfunction.
Example 5.
In Figure 3, at block we have a
JUMP (marked with (cid:73) ) whose address is not pushed inthe same block. This
JUMP takes the returned address from function findWinner . If findWinner is publiclyinvoked, it jumps to address (pushed at block at (cid:63) ) and if it is invoked from __callback it jumpsto (pushed at block at (cid:63) ). EVM to a complete CFG
As we have seen in the previous section, the addresses used by the jumping instructions are stored inthe execution stack. In
EVM , blocks can be reached with different stack sizes an contents. As it is usedin other tools [4,3,5], to precisely infer the possible addresses at jumping program points, we need a context-sensitive static analysis that analyze separately all blocks for each possible stack than can reachthem (only considering the addresses stored in the stack). This section presents an address analysis of EVM programs which allows us to compute a complete CFG of the
EVM code. To compute the addressesinvolved in the jumping instructions, we define a static analysis which soundly infers all possible addressesthat a
JUMP instruction could use.
UMPDESTPUSH1 0x03DUP1SLOADSWAP1POPDUP2LTISZEROPUSH2 0x06d0JUMPI
Block 661 - 66D
PUSH1 0x03DUP2DUP2SLOADDUP2LTISZEROISZEROPUSH2 0x066fJUMPI JUMPDESTSWAP1PUSH1 0x00MSTOREPUSH1 0x20PUSH1 0x00SHA3ADDSLOADDUP4GTISZEROISZEROPUSH2 0x06c3JUMPIJUMPDESTDUP1DUP1PUSH1 0x01ADDSWAP2POPPOPPUSH2 0x0653JUMP
Block 66F - 682Block 6C3 - 6CFBlock 653 - 660
JUMPDESTPUSH1 0x00DUP1PUSH1 0x00 SWAP1POP
Block 64B - 652Block 941 – 953Block 123 - 141 Block 66E - 66E
PUSH1 0x01DUP2·····ISZEROPUSH2 0x0691JUMPI
Block 690 - 690Block 683 - 68F
JUMPDEST
Block 6D0 - 6D0
JUMPDEST POPSWAP2SWAP1POPJUMP
Block 6D1 - 6D6
JUMPDEST SWAP1·····POPPUSH2 0x06d1JUMP
Block 691 - 6C2Block 954 - 9B8Block 142 - 183
JUMPDESTMODADDPUSH1 0x0aDUP2SWAP1SSTOREPOPPUSH2 0x0954PUSH1 0x0aSLOADPUSH2 0x064bJUMP JUMPDEST ·····ORSWAP1SSTORE…...ISZEROPUSH2 0x09baJUMPIJUMPDEST PUSH1 0x40…..SWAP2SUBSWAP1RETURNJUMPDESTPOPPUSH2 0x0142PUSH1 0x04DUP1CALLDATASIZESUBDUP2…..PUSH2 0x064bJUMP INVALID INVALID
Fig. 3.
Fragment of the CFG of findWinner
In our address analysis we aim at having the stack represented by explicit variables. Given thecharacteristics of
EVM programs, the execution stack of
EVM programs produced from
Solidity programswithout recursion can be flattened. Besides, as the size of the stack of the Ethereum Virtual Machine isbounded to 1024 elements (see [6]), the number of stack variables is limited. We use V to represent theset of all possible stack variables that may be used in the program. The first element we define for ouranalysis is its abstract state: The abstract state
Our analysis uses a partial representation of the execution stack as basic element. Tothis end, we use the notion of stack state as a pair (cid:104) n, σ (cid:105) , where n is the number of elements in the stack,and σ is a partial mapping that relates some stack positions with a set of jump destinations. A positionin the stack is referred as s i with 0 ≤ i < n , and s n − is the position at the top of the stack. The abstractstate of the analysis is defined on the set of all stack states S = {(cid:104) n, σ (cid:105) | ≤ n ≤ |V| ∧ σ ( s ) ∈ Σ n } where Σ n is the set of all mappings using n stack variables, defined recursively as follows: Σ i = Σ i − ∪ { σ [ s i (cid:55)→ j ] | σ ∈ Σ i − ∧ j ⊆ J } ; Σ = { σ ∅ } , where σ ∅ is the empty mapping. Definition 2 (abstract state).
The abstract state is a partial mapping π of the form S (cid:55)→ P ( S ) . The application of σ to an element s i , that is, σ ( s i ), corresponds to the set of jump destinations thata stack variable s i can contain. The first element of the tuple, that is, n , stores the size of the stack inthe different abstract states.The abstract domain is the lattice (cid:104) AS , π (cid:62) , π ⊥ , (cid:116) , (cid:118)(cid:105) , where AS is the set of abstract states and π (cid:62) isthe top of the lattice defined as the mapping π (cid:62) such that ∀ s ∈ S , π (cid:62) ( s ) = S . The bottom element of thelattice π ⊥ is the empty mapping. Now, to define (cid:116) and (cid:118) , we first define the function img ( π, s ) as π ( s )if s ∈ dom ( π ) and ∅ , otherwise. Given two abstract states π and π , we use π = π (cid:116) π to denote that π is the least upper-bound defined as follows ∀ s ∈ dom ( π ) ∪ dom ( π ) , π ( s ) = img ( π , s ) ∪ img ( π , s ) Atthis point, π (cid:118) π holds iff dom ( π ) ⊆ dom ( π ) and ∀ s ∈ dom ( π ) , π ( s ) ⊆ π ( s ) . Transfer function
One of the ingredients of our analysis is a transfer function that models the effectof each
EVM instruction on the abtract state for the different instructions. Given a stack state s of theform (cid:104) n, σ (cid:105) , Figure 4 defines the updating function λ ( b δ,αpc , s ) where b corresponds to the EVM instructionto be applied, pc corresponds to the program counter of the instruction and α and δ to the number oflements placed to and removed from the EVM stack when executing b , respectively. Given a map m wewill use m [ x (cid:55)→ y ] to indicate the result of updating m by making m ( x ) = y while m stays the same forall locations different from x , and we will use m \ [ x ] to refer to a partial mapping that stays the samefor all locations different from x , and m ( x ) is undefined.By means of λ , now we can define the transfer function of our analysis. Definition 3 (transfer function).
Given the set of abstract states AS and the set of EVM instructions
Ins , the transfer function τ is defined as a mapping of the form τ : Ins × AS (cid:55)→ AS is defined as follows: τ ( b, π ) = π (cid:48) where ∀ s ∈ dom ( π ) , π (cid:48) ( s ) = λ ( b, π ( s )) b δ,α λ ( b, (cid:104) n, σ (cid:105) )(1) PUSH x v (cid:104) n + 1 , σ [ s n (cid:55)→ { v } ] (cid:105) when v ∈ J(cid:104) n + 1 , σ (cid:105) when v (cid:54)∈ J (2) DUP x (cid:104) n + 1 , σ (cid:105) when s n − x (cid:54)∈ dom ( σ ) (cid:104) n + 1 , σ [ s n (cid:55)→ σ ( s n − x )] (cid:105) when s n − x ∈ dom ( σ )(3) SWAP x (cid:104) n, σ (cid:105) when s n − (cid:54)∈ dom ( σ ) ∧ s n − x − (cid:54)∈ dom ( σ ) (cid:104) n, σ [ s n − x − (cid:55)→ σ ( s n − ) , s n − (cid:55)→ σ ( s n − x − )] (cid:105) when s n − ∈ dom ( σ ) ∧ s n − x − ∈ dom ( σ ) (cid:104) n, σ [ s n − (cid:55)→ σ ( s n − x − )] \ σ [ s n − x − ] (cid:105) when s n − (cid:54)∈ dom ( σ ) ∧ s n − x − ∈ dom ( σ ) (cid:104) n, σ [ s n − x − (cid:55)→ σ ( s n − )] \ σ [ s n − ] (cid:105) when s n − ∈ dom ( σ ) ∧ s n − x − (cid:54)∈ dom ( σ )(4) otherwise (cid:104) n − δ + α, σ \ [ s n − , . . . , s n − δ ] (cid:105) Fig. 4.
Updating function
Example 6.
Given the following initial abstract state {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} , which corresponds to theinitial stack state for executing block , the application of the transfer function τ to the block thatstarts at EVM instruction , produces the following results (between parenthesis we show the programpoint). To the right we show the application of the transfer function to block with its initial abstractstate {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} . ( ) JUMPDEST {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) MOD {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) ADD {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) PUSH1 0A {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) DUP2 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) SWAP1 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) SSTORE {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) POP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) PUSH2 0954 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }}(cid:105)}} ( ) PUSH1 0A {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }}(cid:105)}} ( ) SLOAD {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }}(cid:105)}} ( ) PUSH2 064B {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105)}} ( ) JUMP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }}(cid:105)}} ( ) JUMPDEST {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) POP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ( ) PUSH2 0142 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) PUSH1 04 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) DUP1 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) CALLDATASIZE {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) SUB {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ...( ) SWAP1 {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) POP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) POP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) POP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} ( ) PUSH2 064B {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105)}} ( ) JUMP {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} (cid:4) Addresses equations system
The next step consists in defining, by means of the transfer and the updatingfunctions, a constraint equation system to represent all possible jumping addresses that could be validfor executing a jump instruction in the program.
Definition 4 (addresses equations system).
Given an
EVM program P of the form b , . . . , b p , its ad-dresses equation system , E ( P ) includes the following equations according to all EVM bytecode instruction b pc ∈ P : pc C pc (1) JUMP X σ ( s n − ) (cid:119) idmap ( λ ( b pc , (cid:104) n, σ (cid:105) )) ∀ s ∈ dom ( X pc ) , (cid:104) n, σ (cid:105) ∈ X pc ( s ) (2) JUMPI X σ ( s n − ) (cid:119) idmap ( λ ( b pc , (cid:104) n, σ (cid:105) )) ∀ s ∈ dom ( X pc ) , (cid:104) n, σ (cid:105) ∈ X pc ( s ) X pc +1 (cid:119) idmap ( λ ( b pc , (cid:104) n, σ (cid:105) )) ∀ s ∈ dom ( X pc ) , (cid:104) n, σ (cid:105) ∈ X pc ( s ) (3) b pc (cid:54)∈ End ∧ b pc + size ( b pc ) = JUMPDEST X pc + size ( b pc ) (cid:119) idmap ( λ ( b pc , (cid:104) n, σ (cid:105) )) ∀ s ∈ dom ( X pc ) , (cid:104) n, σ (cid:105) ∈ X pc ( s ) (4) b pc (cid:54)∈ End X pc + size ( b pc ) (cid:119) τ ( b pc , X pc ) (5) otherwise X pc + size ( b pc ) (cid:119) τ ( b pc , X pc ) where idmap ( s ) returns a map π such that dom ( π ) = { s } and π ( s ) = { s } and size ( b pc ) returns thenumber of bytes of the instruction b pc . Observe that the addresses equations system will have one equation for all program points of the pro-gram. Concretely, variables of the form X pc store the jumping addresses saved in the stack after executing b pc for all possible entry stacks. This information will be used for computing all possible jump destinationswhen executing JUMP or JUMPI instructions. For computing the system, most instructions, cases (4) and(5), just apply the transfer function τ to compute the possible stack states of the subsequent instruction.Note that the expression pc + size ( b pc ) at (3) just computes the position of the next instruction in the EVM program. Jumping instructions, points (1) and (2), compute the initial state of the invoked blocks,thus they produce a map with all possible input stack states that can reach one block.
JUMP and
JUMPI instructions produce, for each stack state, one equation by taking the element from the previous stackstate X σ ( s n − ) . JUMPI , point (2) produces an extra equation X pc +1 to capture the possibility of continuingto the next instruction instead of jumping to the destination address. Additionally, those instructionsbefore JUMPDEST , point (3), produce initial states for the block that starts in the
JUMPDEST . When theconstraint equation system is solved, constraint variables over-approximate the jumping information forthe program.
Example 7.
As it can be seen if Figure 3, we can jump to block 64B from two different blocks, and . The computation of the jump equations systems will produce the following equations for the entryprogram points of these two blocks: X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ... X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105)}}X (cid:13) (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , {}(cid:105)}} ... X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105)}}X (cid:13) (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ {(cid:104) , { s (cid:55)→ }(cid:105)}} Observe that we have two different stack contents reaching the same program point, e.g. two equationsfor X are produced by two different blocks, the JUMP at the end of block , identified by X (cid:13) ,and the JUMP at the end of block 123, identified by X (cid:13) . Thus the equation that must hold for X isproduced by the application of the operation X (cid:13) (cid:116) X (cid:13) , as follows: X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} Note that the application of the transfer function τ for all instructions of block 64B applies function λ to all elements in the abstract state and updates the stack state accordingly ( JUMPDEST ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} ( PUSH1 00 ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} ( DUP1 ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} ( PUSH1 00 ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} ( SWAP1 ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} ( POP ) X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} (cid:4) Solving the addresses equations system of a program P can be done iteratively. A na¨ıve algorithmconsists in first creating one constraint variable X (cid:119) π ∅ [ (cid:104) , σ ∅ (cid:105) (cid:55)→ {(cid:104) , σ ∅ (cid:105)} ], where π ∅ and σ ∅ are emptymappings, and X pc (cid:119) π ⊥ for all pc ∈ P, pc (cid:54) = 0, and then iteratively refining the values of these variablesas follows: (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } ... X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ (cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ , s (cid:55)→ }(cid:105) } A (cid:13)X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) }X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) } B (cid:13)X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ (cid:104) , {}(cid:105) } ... B (cid:13)X (cid:119) {(cid:104) , {}(cid:105) (cid:55)→ (cid:104) , {}(cid:105) } ... Fig. 5.
Jumps equations system of __callback function
1. substitute the current values of the constraint variables in the right-hand side of each constraint, andthen evaluate the right-hand side if needed;2. if each constraint
X (cid:119) E holds, where E is the value of the evaluation of the right-hand side of theprevious step, then the process finishes; otherwise3. for each X (cid:119) E which does not hold, let E (cid:48) be the current value of X . Then update the current valueof X to E (cid:116) E (cid:48) . Once all these updates are (iteratively) applied we repeat the process at step 1.Termination is guaranteed since the abstract domain does not have infinitely ascending chains as thenumber of jump destinations and the stack size is finite. Example 8.
Figure 5 shows the equations produced by Definition 4 of the first and the last instruction ofall blocks shown in Figure 3. The first instruction shown in the system is X , computed in Example 7.Observe that the application of τ stores the jumping addresses in the corresponding abstract states after PUSH instructions (see X , X , X , X , . . . ). Such addresses will be used to produce the equationsat the JUMP or JUMPI instructions. In the case of
JUMP , as the jump is unconditional, it only producesone equation, e.g. X consumes address to produce the input state of X , or X produces theinput abstract state for X . JUMPI instructions produce two different equations: (1) one equation whichcorresponds to the jumping address stored in the stack, e.g. equations X and X produced by thejumps of the equations X and X respectively; and (2) one equation which corresponds to the nextnstruction, e.g. X and X produced by X and X , respectively. Finally, another point to highlightoccurs at equation X : as we have two possible jumping addresses in the stack of and both can be usedby the JUMP at the end of the block, we produce two inputs for the two possible jumping addresses, X and X , for capturing the two possible branches from block (see Figure 3). (cid:4) Theorem 1 (Soundness).
Let P ≡ b , . . . , b p be a program, X , . . . , X n the solution of the jumpsequations system of P , and pc the program counter of a jump instruction. Then for any execution of P , there exists s ∈ dom ( X pc ) such that (cid:104) n, σ (cid:105) ∈ X pc ( s ) and σ ( s n − ) contains all jump addresses thatinstruction b pc might jump to during the execution of P .Control Flow Graph. At this point, by means of the addresses equation system solution, we computethe control flow graph of the program. In order to simplify the notation, given a block B i , we define thefunction getId ( i, (cid:104) n, σ (cid:105) ), which receives the block identifier i and an abstract stack (cid:104) n, σ (cid:105) and returns aunique identifier for the abstract stack (cid:104) n, σ (cid:105) ∈ dom ( X i ). Similarly, getStack ( i, id ) returns the abstractstate (cid:104) n, σ (cid:105) that corresponds to the identifier id of block B i . Besides, we define the function getSize ( pc, id )that, given a program point pc ∈ B i and a unique identifier id for B i , returns the value n (cid:48) s.t. (cid:104) n, σ (cid:105) = getStack ( i, id ), and X pc ( (cid:104) n, σ (cid:105) ) = (cid:104) n (cid:48) , σ (cid:48) (cid:105) . Example 9.
Given the equation: X (cid:119) {(cid:104) , { s (cid:55)→ }(cid:105) (cid:124) (cid:123)(cid:122) (cid:125) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105) , (cid:104) , { s (cid:55)→ }(cid:105) (cid:124) (cid:123)(cid:122) (cid:125) (cid:55)→ (cid:104) , { s (cid:55)→ }(cid:105)} , if we compute the functions getId and getSize , we have that getId ( , (cid:104) , { s (cid:55)→ }(cid:105) ) = 1 and getId ( , (cid:104) , { s (cid:55)→ }(cid:105) ) = 2. Analogously, getSize ( ,
1) = 7 and getSize ( ,
2) = 3. (cid:4)
Definition 5 (control flow graph).
Given an
EVM program P , its blocks B i ≡ b i . . . b j ∈ blocks ( P ) and its flow analysis results provided by a set of variables of the form X pc for all pc ∈ P , we define the control flow graph of P as a directed graph CFG = (cid:104) V, E (cid:105) with a set of vertices V = { B i : id | B i ∈ blocks ( P ) ∧ (cid:104) n, σ (cid:105) ∈ dom ( X i ) ∧ id = getId ( i, (cid:104) n, σ (cid:105) ) } and a set of edges E = E jump ∪ E next such that: E jump = { B i : id → B d : id | b j ∈ Jump ∧(cid:104) n, σ (cid:105) ∈ dom ( X j ) ∧ id = getId ( i, (cid:104) n, σ (cid:105) ) ∧(cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X j ( (cid:104) n, σ (cid:105) ) ∧ d = σ (cid:48) ( s n (cid:48) − ) ∧(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) = λ ( b j , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ) ∧ id = getId ( d, (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) ) } E next = { B i : id → B d : id | b j (cid:54) = JUMP ∧ b j (cid:54)∈ End ∧(cid:104) n, σ (cid:105) ∈ dom ( X j ) ∧ id = getId ( i, (cid:104) n, σ (cid:105) ) ∧(cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X j ( (cid:104) n, σ (cid:105) ) ∧ d = j + size ( b j ) ∧(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) = λ ( b j , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ) ∧ id = getId ( d, (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) )The first relevant point of the control flow graph (CFG) we produce is that, for producing the set ofvertices V , we replicate each block for each different stack state that could be used for invoking it (graynodes in Figure 3 are replicated in the CFG). Analogously, the different entry stack states are also usedto produce different edges depending on its corresponding replicated blocks. Note that the definitiondistinguishes between two kinds of edges. (1) edges produced by JUMP or JUMPI instructions at the endof the blocks, whose destination is taken from the values stored in the stack states of the instructionbefore the jump with d = σ (cid:48) ( s n (cid:48) − ); and (2) edges produced by continuations to the next instruction,whose destination is computed with d = j + size ( b j ). In both kinds of edges, as we could have replicatedblocks, we apply function λ and get the id of the resulting state to compute the id of the destination: (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) = λ ( b j , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ) ∧ id = getId ( d, (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) ). Example 10.
Considering the blocks shown in Figure 3 and the equations shown at Figure 5, the CFGof the program includes non-replicated nodes for those blocks that only receive one possible stack state(white nodes in Figure 3). However, the nodes that could be reached by two different stack states (graynodes in Figure 3) will be replicated in the CFG: V = { B , B , B , B , B , B , B , B , B , B , B , B , B , B , B ,B , B , B , B , B , B , B , B , B , B , B } nalogously, our CFG replicates the edges according to the nodes replicated (solid and dashed edges inFigure 3):E = { B → B , B → B , B → B , B → B , B → B ,B → B , B → B , B → B , B → B , B → B ,B → B , B → B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B ,B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B ,B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B , B (cid:57)(cid:57)(cid:75) B } Note that, in Figure 3, we distinguish dashed and solid edges just to remark that we could have twopossible execution paths, that is, if the call to findWinner comes from block B , it will return to block B and, if the execution comes from a public invocation, i.e. block B , it will return to block B . (cid:4) The proof sketch follows the next steps:1. We first define an
EVM operational semantics that describes how
EVM programs handle jump ad-dresses on the stack.2. Then we define an
EVM collecting semantics for the operational semantics. Such collecting semanticsgathers all transitions that can be produced by the execution of a program P ;3. We continue by defining the jumps-to property as a property of this collecting semantics; and4. Then we prove a lemma that states that the least solution of the set of constraints generated asdescribed in Definition 4 is a safe approximation of the EVM collecting semantics w.r.t. the jumps-toproperty.5. Finally, we rewrite Theorem 1 in terms of the operational semantics and prove it.Figure 6 shows the semantics of some instructions involved in the computation of the values storedin the stack for handling jumps. The state of the program S is a tuple (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) where pc is the valueof the program counter with the index of the next instruction to be executed, and (cid:104) n, σ (cid:105) is a stackstate as defined in Section 2. Interesting rules are the ones that deal with jump destination addresseson the stack: Rule (4) adds a new address on the stack, and Rules (6) and (8-10) copy or exchangeexisting addresses on top of the stack, respectively. Rules (1) to (3) perform a jump in the programand therefore consume the address placed on top of the stack, plus an additional word in the caseof JUMPI . If the instructions considered in this simplified semantics do not handle jump addresses, thecorresponding rules just remove some values from the stack in the program state S (Rules (5) , (7) and (11) ). The remaining EVM instructions not explicitely considered in this simplified semantics aregenerically represented by Rule (12) with b δ,αpc , where δ is the number of items removed from stack when b pc is executed, and α is the number of additional items placed on the stack. Complete executions aretraces of the form S ⇒ S ⇒ . . . ⇒ S n where S ≡ (cid:104) , (cid:104) , σ ∅ (cid:105)(cid:105) is the initial state, σ is the emptymapping, and S n corresponds to the last state. There are no infinite traces, as any transaction thatexecutes EVM code has a finite gas limit and every instruction executed consumes some amount of gas.When the gas limit is exceeded, an out-of-gas exception occurs and the program halts immediately.
Definition 6 (
EVM collecting semantics).
Given an
EVM program P , the EVM collecting semanticsoperator C P is defined as follows: C P ( X ) = {(cid:104) S, S (cid:48) (cid:105) | (cid:104) , S (cid:105) ∈ X ∧ S ⇒ S (cid:48) } The
EVM semantics is defined as ξ P = (cid:83) n> C nP ( X ) , where X ≡ {(cid:104) , (cid:104) , σ ∅ (cid:105)(cid:105)} is the initial configura-tion. Definition 7 (jumps-to property).
Let P be an IR program, ξ P = (cid:83) n> C nP ( X ) , and b an instructionat program point pc , then we say that ξ P (cid:15) pc T if T = {(cid:104) n, σ (cid:105) | (cid:104) S, S (cid:48) (cid:105) ∈ ξ P ∧ (cid:104) n, σ (cid:105) ∈ S (cid:48) } . The following lemma states that the least solution of the constraint equation system defined inDefinition 2 is a safe approximation of ξ P : Lemma 1 (soundness).
Let P ≡ b , . . . , b p be a program, pc a program point and X , . . . , X p the leastsolution of the constraints equation system as defined in Section 2. The following holds:If ξ P (cid:15) pc T , then for all (cid:104) n, σ (cid:105) ∈ T , exists s ∈ dom ( X pc ) such that (cid:104) n, σ (cid:105) ∈ X pc ( s ) . b pc = JUMP (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) σ ( s n − ) , (cid:104) n − , σ \ [ s n − ] (cid:105)(cid:105) (2) b pc = JUMPI (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) σ ( s n − ) , (cid:104) n − , σ \ [ s n − , s n − ] (cid:105)(cid:105) (3) b pc = JUMPI (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n − , σ \ [ s n − , s n − ] (cid:105)(cid:105) (4) b pc = PUSH x v, v ∈ J(cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n + 1 , σ [ s n (cid:55)→ { v } ] (cid:105)(cid:105) (5) b pc = PUSH x v, v / ∈ J(cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n + 1 , σ (cid:105)(cid:105) (6) b pc = DUP x, s n − x ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n + 1 , σ [ s n (cid:55)→ σ ( s n − x )] (cid:105)(cid:105) (7) b pc = DUP x, s n − x / ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n + 1 , σ (cid:105)(cid:105) (8) b pc = SWAP x, s n − ∈ dom ( σ ) , s n − x − ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n, σ [ s n − x − (cid:55)→ σ ( s n − ) , s n − (cid:55)→ σ ( s n − x − )] (cid:105)(cid:105) (9) b pc = SWAP x, s n − ∈ dom ( σ ) , s n − x − / ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n, σ [ s n − x − (cid:55)→ σ ( s n − )] \ [ s n − ] (cid:105)(cid:105) (10) b pc = SWAP x, s n − / ∈ dom ( σ ) , s n − x − ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n, σ [ s n − (cid:55)→ σ ( s n − x − )] \ [ s n − x − ] (cid:105)(cid:105) (11) b pc = SWAP x, s n − / ∈ dom ( σ ) , s n − x − / ∈ dom ( σ ) (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n, σ \ [ s n − , s n − x − ] (cid:105)(cid:105) (12) b δ,αpc ∈ otherwise , b δ,αpc / ∈ End (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) ⇒ (cid:104) pc + size ( b pc ) , (cid:104) n − δ + α, σ \ [ s n − , ..., s n − δ ] (cid:105)(cid:105) Fig. 6.
Simplified
EVM semantics for handling jumps
Proof.
We use X mpc to refer to the value obtained for X pc after m iterations of the algorithm for solvingthe equation system depicted in Section 2. We say that X pc covers (cid:104) n, σ (cid:105) in C mP ( X ) at program point pc when this lemma holds for the result of computing C mP ( X ). In order to prove this lemma, we can reasonby induction on the value of m , the length of the traces S ⇒ m S m considered in C mP ( X ). Case base : if m = 0, S = (cid:104) , (cid:104) , σ ∅ (cid:105)(cid:105) and the Lemma trivially holds as (cid:104) , σ ∅ (cid:105) ∈ X ( (cid:104) , σ ∅ (cid:105) ). Induction Hypothesis : we assume Lemma 1 holds for all traces of length m ≥ Inductive Case : Let us consider traces of length m + 1, which are of the form S ⇒ m S m ⇒ S m +1 . S m is a program state of the form S m = (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) . We can apply the induction hypothesis to S m : thereexists some s ∈ dom ( X mpc ) such that (cid:104) n, σ (cid:105) ∈ X mpc ( s ). For extending the Lemma, we reason for all possiblerules in the simplified EVM semantics (Fig. 6) we may apply from S m to S m +1 : – Rule (1): After executing a
JUMP instruction S m +1 is of the form (cid:104) σ ( s n − ) , (cid:104) n − , σ \ [ s n − ] (cid:105)(cid:105) . Initeration m + 1, the following set of equations corresponding to b pc is evaluated: X σ ( s n − ) (cid:119) idmap ( λ ( b pc , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) for all s (cid:48) ∈ dom ( X pc ) , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X pc ( s (cid:48) )where idmap ( λ ( b pc , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) = π ⊥ [ (cid:104) n (cid:48) − , σ (cid:48) \ [ s n − ] (cid:105) (cid:55)→ {(cid:104) n (cid:48) − , σ (cid:48) \ [ s n − ] (cid:105)} ] (Case (4) in Fig. 4).The induction hypothesis guarantees that there exists some s (cid:48)(cid:48) ∈ X mpc such that (cid:104) n, σ (cid:105) ∈ X mpc ( s (cid:48)(cid:48) ),here S m = (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) . Therefore, at Iteration m + 1, the following must hold: X m +1 σ ( s n − ) (cid:119) π ⊥ [ (cid:104) n − , σ \ [ s n − ] (cid:105) (cid:55)→ {(cid:104) n − , σ \ [ s n − ] (cid:105)} ]so (cid:104) n − , σ \ [ s n − ] (cid:105) ∈ X m +1 σ ( s n − ) ( (cid:104) n − , σ \ [ s n − ] (cid:105) ) and thus Lemma 1 holds. – Rules (2) and (3): After executing a
JUMPI instruction, S m +1 is either (cid:104) σ ( s n − ) , (cid:104) n − , σ \ [ s n − , s n − ] (cid:105)(cid:105) or (cid:104) pc + size ( b pc ) , (cid:104) n − , σ \ [ s n − , s n − ] (cid:105)(cid:105) , respectively. In any of those cases the following sets ofequations are evaluated: X σ ( s n − ) (cid:119) idmap ( λ ( JUMPI , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) for all s (cid:48) ∈ dom ( X pc ) , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X pc ( s (cid:48) ) X pc +1 (cid:119) idmap ( λ ( JUMPI , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) for all s (cid:48) ∈ dom ( X pc ) , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X pc ( s (cid:48) )where idmap ( λ ( b pc , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) = π ⊥ [ (cid:104) n (cid:48) − , σ (cid:48) \ [ s n − , s n − ] (cid:105) (cid:55)→ {(cid:104) n (cid:48) − , σ (cid:48) \ [ s n − , s n − ] (cid:105)} ] (Case (4)of the definition of the update function λ in Fig. 4). As in the previous case, the induction hypothesisguarantees that at Iteration m there exists s (cid:48)(cid:48) ∈ X mpc such that (cid:104) n, σ (cid:105) ∈ X mpc ( s (cid:48)(cid:48) ). Therefore, inIteration m + 1, the following must hold: X m +1 σ ( s n − ) (cid:119) π ⊥ [ (cid:104) n − , σ \ [ s n − , s n − ] (cid:105) (cid:55)→ {(cid:104) n − , σ \ [ s n − , s n − ] (cid:105)} ] X m +1 pc +1 (cid:119) π ⊥ [ (cid:104) n − , σ \ [ s n − , s n − ] (cid:105) (cid:55)→ {(cid:104) n − , σ \ [ s n − , s n − ] (cid:105)} ]and thus Lemma 1 holds for these cases as well. – Rules (4) - (12): We will first consider the case in which any of these rules corresponds to an
EVM instruction followed by an instruction different from
JUMPDEST . All rules are similar, as they all usethe set of equations generated by Case (4) in Definition 4. We will see Rule (4) in detail.After executing a
PUSH x v instruction, S m +1 is (cid:104) pc + size ( b pc ) , (cid:104) n + 1 , σ [ s n (cid:55)→ { v } ] (cid:105)(cid:105) . We have toprove that exists some s ∈ dom ( X pc + size ( b pc ) ) such that (cid:104) n + 1 , σ [ s n (cid:55)→ { v } ] (cid:105) ∈ X pc + size ( b pc ) ( s ). Thefollowing set of equations is evaluated: X pc + size ( b pc ) (cid:119) τ ( PUSH x, X pc ) (1)By Definition 3, τ ( PUSH x, X pc ) = π (cid:48) , where ∀ s (cid:48) ∈ dom ( π ) , π (cid:48) ( s (cid:48) ) = λ ( PUSH x, X pc ( s (cid:48) )). By the case (1)of the definition of the update function λ , we have that: ∀(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) ∈ dom ( X pc ) , π (cid:48) ( (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105) ) = (cid:104) n (cid:48)(cid:48) + 1 , σ (cid:48)(cid:48) [ s n (cid:55)→ { v } ] (cid:105) (2)By the induction hypothesis, at Iteration m there exists some s ∈ dom ( X mpc ) such that (cid:104) n, σ (cid:105) ∈ X mpc ( s ).Therefore, by 1 and 2, at Iteration m + 1 we have that the following holds: s ∈ dom ( X m +1 pc + size ( b pc ) ) and (cid:104) n + 1 , σ [ s n (cid:55)→ { v } ] (cid:105) ∈ X pc + size ( b pc ) ( s )and thus Lemma 1 holds for Rule (4). – Rules (4) - (12), followed by a
JUMPDEST instruction. After executing any of these instructions, S m +1 is (cid:104) pc + size ( b pc ) , (cid:104) n (cid:48)(cid:48)(cid:48) , σ (cid:48)(cid:48)(cid:48) (cid:105)(cid:105) , where (cid:104) n (cid:48)(cid:48)(cid:48) , σ (cid:48)(cid:48)(cid:48) (cid:105) is obtained according to the rule from Figure 6. Wehave to prove that exists some s ∈ dom ( X pc + size ( b pc ) ) such that (cid:104) n (cid:48)(cid:48)(cid:48) , σ (cid:48)(cid:48)(cid:48) (cid:105) ∈ X pc + size ( b pc ) ( s ). Thefollowing set of equations is evaluated: X pc + size ( b pc ) (cid:119) idmap ( λ ( b pc , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) for all s (cid:48) ∈ dom ( X pc ) , (cid:104) n (cid:48) , σ (cid:48) (cid:105) ∈ X pc ( s (cid:48) ) (3)where idmap ( λ ( b pc , (cid:104) n (cid:48) , σ (cid:48) (cid:105) )) = π ⊥ [ (cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) ] (cid:105) (cid:55)→ {(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105)} ], where n (cid:48)(cid:48) and σ ‘ are obtained accordingto the cases of the updating function detailed in Figure 4. We can see that {(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105)} ] match themodification made to the state S m +1 by the corresponding rule of the semantics. Therefore, atIteration there exists an s = {(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105)} ] such that {(cid:104) n (cid:48)(cid:48) , σ (cid:48)(cid:48) (cid:105)} ] ∈ X m +1 pc + size ( b pc ) , and Lemma 1 alsoholds.When the algorithm stops Lemma 1 holds, as for any pc X m +1 pc (cid:119) X mpc for each iteration of thealgorithm for solving the equations system of Section 2.Now we rewrite Theorem 1 in terms of the operational semantics of Figure 6. This rewriting actuallyis stronger than Theorem 1, as guarantees the correctness of the stack states obtained from the jumpsequations system at any step of the execution. Theorem 2 (Soundness).
Let P ≡ b , . . . , b p be a program, S = (cid:104) , (cid:104) , σ ∅ (cid:105)(cid:105) the initial program state,and X , . . . , X n the solution of the jumps equations system of P . Then for any trace S (cid:59) ∗ S m , where S m = (cid:104) pc, (cid:104) n, σ (cid:105)(cid:105) , there exists s ∈ dom ( X pc ) such that (cid:104) n, σ (cid:105) ∈ X pc ( s ) .Proof. Straightforward from Lemma 1, as the
EVM collecting semantics takes into account all possibletraces of the operational semantics. eferences
1. The
EthereumPot contract, 2017. https://etherscan.io/address/0x5a13caa82851342e14cd2ad0257707cddb8a31b7.2. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman.
Compilers: Principles, Techniques, and Tools . Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd edition, 2006.3. Lexi Brent, Anton Jurisevic, Michael Kong, Eric Liu, Francois Gauthier, Vincent Gramoli, Ralph Holz,and Bernhard Scholz. Vandal: A Scalable Security Analysis Framework for Smart Contracts, 2018.arXiv:1809.03981.4. Neville Grech, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. Gigahorse: thorough, declarative de-compilation of smart contracts. In Joanne M. Atlee, Tevfik Bultan, and Jon Whittle, editors,
Proceedings ofthe 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31,2019 , pages 1176–1186. IEEE / ACM, 2019.5. Neville Grech, Michael Kong, Anton Jurisevic, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. Mad-max: surviving out-of-gas conditions in ethereum smart contracts.