[PDF] AI-based Blackbox Code Deobfuscation: Understand, Improve and Mitigate

Abstract

Code obfuscation aims at protecting Intellectual Property and other secrets embedded into software from being retrieved. Recent works leverage advances in artificial intelligence with the hope of getting blackbox deobfuscators completely immune to standard (whitebox) protection mechanisms. While promising, this new field of AI-based blackbox deobfuscation is still in its infancy. In this article we deepen the state of AI-based blackbox deobfuscation in three key directions: understand the current state-of-the-art, improve over it and design dedicated protection mechanisms. In particular, we define a novel generic framework for AI-based blackbox deobfuscation encompassing prior work and highlighting key components; we are the first to point out that the search space underlying code deobfuscation is too unstable for simulation-based methods (e.g., Monte Carlo Tres Search used in prior work) and advocate the use of robust methods such as S-metaheuritics; we propose the new optimized AI-based blackbox deobfuscator Xyntia which significantly outperforms prior work in terms of success rate (especially with small time budget) while being completely immune to the most recent anti-analysis code obfuscation methods; and finally we propose two novel protections against AI-based blackbox deobfuscation, allowing to counter Xyntia's powerful attacks.

Full PDF

aa r X i v : . [ c s . CR ] F e b AI-based Blackbox Code DeobfuscationUnderstand, Improve and Mitigate

Grégoire Menguy [email protected] LISTFrance

Sébastien Bardin [email protected] LISTFrance

Richard Bonichon [email protected] LabsFrance

Cauim de Souza Lima [email protected] LISTFrance

ABSTRACT

Code obfuscation aims at protecting Intellectual Property and othersecrets embedded into software from being retrieved. Recent worksleverage advances in artiﬁcial intelligence with the hope of gettingblackbox deobfuscators completely immune to standard (whitebox)protection mechanisms. While promising, this new ﬁeld of

AI-basedblackbox deobfuscation is still in its infancy. In this article we deepenthe state of AI-based blackbox deobfuscation in three key direc-tions: understand the current state-of-the-art, improve over it anddesign dedicated protection mechanisms . In particular, we deﬁne anovel generic framework for AI-based blackbox deobfuscation en-compassing prior work and highlighting key components; we arethe ﬁrst to point out that the search space underlying code deob-fuscation is too unstable for simulation-based methods (e.g., MonteCarlo Tres Search used in prior work) and advocate the use of ro-bust methods such as S-metaheuritics; we propose the new opti-mized AI-based blackbox deobfuscator Xyntia which signiﬁcantlyoutperforms prior work in terms of success rate (especially withsmall time budget) while being completely immune to the mostrecent anti-analysis code obfuscation methods; and ﬁnally we pro-pose two novel protections against AI-based blackbox deobfusca-tion, allowing to counter Xyntia’s powerful attacks.

KEYWORDS

Binary-level code analysis, deobfuscation, artiﬁcial intelligence

Context.

Software contain valuable assets, such as secret algo-rithms, business logic or cryptographic keys, that attackers maytry to retrieve. The so-called Man-At-The-End-Attacks scenario(MATE) considers the case where software users themselves areadversarial and try to extract such information from the code.

Codeobfuscation [12, 13] aims at protecting codes against such attacks,by transforming a sensitive program 𝑃 into a functionally equiv-alent program 𝑃 ′ that is more “diﬃcult” (more expensive, for ex-ample, in money or time) to understand or modify. On the ﬂipside, code deobfuscation aims to extract information from obfus-cated codes. Whitebox deobfuscation techniques, based on advanced symbolicprogram analysis, have proven extremely powerful against stan-dard obfuscation schemes [3, 5, 10, 22, 29, 31, 37] – especially in local attack scenarios where the attacker analyses pre-identiﬁedparts of the code (e.g., trigger conditions). But they are inherentlysensitive to the syntactic complexity of the code under analysis,leading to recent and eﬀective countermeasures [12, 26, 27, 38].

AI-based blackbox deobfuscation.

Despite being rarely soundor complete, artiﬁcial intelligence (AI) techniques are ﬂexible andoften provide good enough solutions to hard problems in reason-able time. They have been therefore recently applied to binary-level code deobfuscation. The pioneering work by Blazytko et al. [7]shows how

Monte Carlo Tree Search (MCTS) [9] can be leveraged tosolve local deobfuscation tasks by learning the semantics of piecesof protected codes in a blackbox manner , in principle immune tothe syntactic complexity of these codes. Their method and proto-type, Syntia, have been successfully used to reverse state-of-the-art protectors like VMProtect [35], Themida [28] and Tigress [11],drawing attention from the software security community [8].

Problem.

While promising, AI-based blackbox (code) deobfusca-tion techniques are still not well understood. Several key questionsof practical relevance (e.g., deobfuscation correctness and quality,sensitivity to time budget) are not addressed in Blazytko et al.’soriginal paper, making it hard to exactly assess the strengths andweaknesses of the approach. Moreover, as Syntia comes with manyhard-coded design and implementation choices, it is legitimate toask whether other choices lead to better performance, and to get abroader view of AI-based blackbox deobfuscation methods. Finally,it is unclear how these methods compare with recent proposals forgreybox deobfuscation [16] or general program synthesis [6, 30],and how to protect from such blackbox attacks.

Goal.

We focus on advancing the current state of AI-based black-box deobfuscation methods in the following three key directions:(1) generalize the initial Syntia proposal and reﬁne the initial ex-periments by Blazytko et al. in order to better understand

AI-basedblackbox methods, (2) improve the current state-of-the-art (Syntia)through a careful formalization and exploration of the design spaceand evaluate the approach against greybox and program synthesismethods, and ﬁnally (3) study how to mitigate such AI-based at-tacks. Especially, we study the underlying search space, bringingnew insights for eﬃcient blackbox deobfuscation, and promote theapplication of S-metaheuristics [33] instead of MCTS. ontributions.

Our main contributions are the following: • We reﬁne experiments by Blazytko et al. in a systematic way ,highlighting both new strengths and new weaknesses of theinitial Syntia proposal for AI-based blackbox deobfuscation(Section 4). Especially, Syntia (based on Monte Carlo SearchTree, MCTS) is far less eﬃcient than expected for small timebudget (its typical usage scenario) and lacks robustness; • We propose a missing formalization of blackbox deobfusca-tion (Section 4) and dig into Syntia internals to rationalizeour observations (Section 4.4). It appears that the search spaceunderlying blackbox code deobfuscation is too unstable to relyon MCTS – especially assigning a score to a partial node through simulation leads here to poor estimations. As a re-sult, Syntia is here almost enumerative ; • We propose to see (Section 5) blackbox deobfuscation as an optimization problem rather than a single player game , al-lowing to reuse

S-metaheuristics [33], known to be more ro-bust than MCTS on unstable search space (especially, theydo not need to score partial states). We propose Xyntia (Sec-tion 5), an

AI-based blackbox deobfuscator using

Iterated Lo-cal Search (ILS) [24], known among S-metaheuristics for itsrobustness. Thorough experiments show that Xyntia keepsthe beneﬁts of Syntia while correcting most of its ﬂaws. Es-pecially, Xyntia signiﬁcantly outperforms

Syntia, synthesiz-ing twice more expressions with a budget of 1s/expr thanSyntia with 600s/expr. Other meta-heuristics also clearly beatMCTS, even if they are less eﬀective here than ILS; • We evaluate Xyntia against other state-of-the-art attackers (Section 6), namely the QSynth greybox deobfuscator [16],program synthesizers (CVC4 [6] and STOKE [30]) and pattern-matching based simpliﬁers. Xyntia outperforms all of them– it ﬁnds 2 × more expressions and is 30 × faster than QSynthon heavy protections; • We evaluate Xyntia against state-of-the-art defenses (Section 7),especially recent anti-analysis proposals [14, 26, 32, 36, 38].As expected, Xyntia is immune to such defenses. In particu-lar, it successfully bypasses side-channels [32], path explo-sion [26] and MBA [38]. We also use it to synthesizes VM-handlers from state-of-the-art virtualizers [11, 35, 36]; • Finally, we propose the two ﬁrst protections against AI-basedblackbox deobfuscation (Section 8). We observe that all phasesof blackbox techniques can be thwarted (hypothesis, sam-pling and learning) and propose two practical methods ex-ploiting these limitations, and discuss them in the context ofvirtualization-based obfuscation: (1) semantically complex han-dlers ; (2) merged handlers with branch-less conditions . Experi-ments show that both protections are highly eﬀective againstblackbox attacks.We hope that our results will help better understand AI-based codedeobfuscation, and lead to further progress in this promising ﬁeld.

Availability.

Benchmarks and code are available online . Also, weput a fair amount of experimental data in appendices for convenience.While the core paper can be read without, this material will still bemade available online in a technical report. Will be made available

Program obfuscation [12, 13] is a family of methods designed tomake reverse engineering (understanding programs’ internals) hard.It is employed by manufacturers to protect intellectual propertyand by malware authors to hinder analysis. It transforms a pro-gram 𝑃 in a functionally equivalent, more complex program 𝑃 ′ with an acceptable performance penalty. Obfuscation does not en-sure that a program cannot be understood – this is impossible inthe MATE context [4] – but aims to delay the analysis as muchas possible in order to make it unproﬁtable. Thus, it is especiallyimportant to protect from automated deobfuscation analyses (anti-analysis obfuscation). We present here two important obfuscationmethods. Mixed Boolean-Arithmetic (MBA) encoding [38] transformsan arithmetic and/or Boolean expression into an equivalent one,combining arithmetic and Boolean operations. It can be applied it-eratively to increase the syntactic complexity of the expression. Ey-rolles et al. [18] shows that SMT solvers struggle to answer equiv-alence requests on MBA expressions, preventing the automatedsimpliﬁcation of protected expressions by symbolic methods.

Virtualization [36] translates an initial code 𝑃 into a bytecode 𝐵 together with a custom virtual machine. Execution of the obfus-cated code can be divided in 3 steps (Fig. 1): (1) fetch the next byte-code instruction to execute, (2) decod the bytecode and ﬁnds thecorresponding handler , (3) and ﬁnally execute the handler. Virtual-ization hides the real control-ﬂow-graph (CFG) of 𝑃 , and reversingthe handlers is key for reversing the VM. Virtualization is notablyused in malware [19, 34]. FetchBytecodes Decode Execute ℎ ( 𝑥,𝑦 ) ℎ ( 𝑥,𝑦 ) ℎ ( 𝑥,𝑦 ) ... ℎ 𝑛 ( 𝑥,𝑦 ) Handlers

Figure 1: Virtualization based obfuscation

Deobfuscation aims at reverting an obfuscated program back to aform close enough to the original one, or at least to a more un-derstandable version. Along the previous years, symbolic deobfus-cation methods based on advanced program analysis techniqueshave proven to be very eﬃcient at breaking standard protections[3, 5, 10, 22, 29, 31, 37]. However, very eﬀective countermeasuresstart to emerge, based on deep limitations of the underlying code-level reasoning mechanisms and potentially strongly limiting theirusage [3, 26, 27, 32, 36]. Especially, all such methods are ultimately sensitive to the syntactic complexity of the code under analysis.

Artiﬁcial intelligence based blackbox deobfuscation has been recentlyproposed by Blazytko et al. [7], implemented in the Syntia tool,to learn the semantic of well-delimited code fragments, e.g. MBAexpressions or VM handlers. The code under analysis is seen asa blackbox that can only be queried (i.e., executed under chosen nputs to observe results). Syntia samples input-output (I/O) rela-tions, then use a learning engine to ﬁnd an expression mappingsampled inputs to their observed outputs. Because it relies on alimited number of samples, results are not guaranteed to be cor-rect. However, being fully blackbox, it is in principle insensitive tosyntactic complexity . Scope.

Syntia tries to infer a simple semantics of heavily obfuscatedlocal code fragments – e.g., trigger based conditions or VM handlers.Understanding these fragments is critical to fulﬁll analysis.

Workﬂow.

Syntia’s workﬂow is representative of AI-based black-box deobfuscators. First, it needs (1) a reverse window i.e., a subsetof code to work on; (2) the location of its inputs and outputs . Con-sider the code in Listing 1 evaluating a condition at line 4. To under-stand this condition, a reverser focuses on the code between lines 1and 3. This code segment is our reverse window. The reverser thenneeds to locate relevant inputs and outputs. The condition at line4 is performed on 𝑡

3. This is our output. The set of inputs containsany variables (register or memory location at assembly level) inﬂu-encing the outputs. Here, inputs are 𝑥 and 𝑦 . Armed with these in-formation, Syntia samples inputs randomly and observes resultingoutputs. In our example, it might consider samples ( 𝑥 ↦→ , 𝑦 ↦→ ) , ( 𝑥 ↦→ , 𝑦 ↦→ ) and ( 𝑥 ↦→ , 𝑦 ↦→ ) which respectively evaluate 𝑡 𝑡 ← 𝑥 + 𝑦 and the reverser concludes that the condi-tion is 𝑥 + 𝑦 =

5, where a symbolic method will typically simplyretrieve that (( 𝑥 ∨ 𝑦 ) × − ( 𝑥 ⊕ 𝑦 ) − 𝑦 ) = i n t t1 = 2 ∗ y ; i n t t2 = x | t1 ; i n t t3 = t2 ∗ 2 − ( x ^ t1 ) − y ; i f ( t3 == 5) . . . Listing 1: Obfuscated condition

In the MATE scenario, the attacker is the software user himself.He has only access to the obfuscated version of the code underanalysis and can read or run it at will. We consider that the at-tacker is highly skilled in reverse engineering but has limited re-sources in terms of time or money. We see reverse engineeringas a human-in-the-loop process where the attacker combines man-ual analysis with automated state-of-the-art deobfuscation meth-ods (slicing, symbolic execution, etc.) on critical, heavily obfus-cated code fragments like VM handlers or trigger-based conditions.Thus, an eﬀective defense strategy is to thwart automated deobfus-cation methods.

We now intuitively motivate the use of blackbox deobfuscation.Consider that we reverse a software protected through virtualiza-tion. We need to extract the semantics of all handlers, which usu-ally perform basic operations like ℎ ( 𝑥, 𝑦 ) = 𝑥 + 𝑦 . Understanding ℎ is trivial, but it can be protected to hinder analysis. Eq. (1) shows how MBA encoding hides ℎ ’s semantics. ℎ ( 𝑥, 𝑦 ) = 𝑥 + 𝑦 𝑚𝑏𝑎 −→ ( 𝑥 ∨ 𝑦 ) × − ( 𝑥 ⊕ 𝑦 ) − 𝑦 (1)Such encoding syntactically transforms the expression to makeit incomprehensible while preserving its semantics . To highlightthe diﬀerence between syntax and semantics, we distinguish:(1) The syntactic complexity of expression 𝑒 is the size of 𝑒 ,i.e. the number of operators used in it;(2) The semantic complexity of expression 𝑒 is the smallestsize of expressions 𝑒 ′ (in a given language) equivalent to 𝑒 .For example, in the MBA language, 𝑥 + 𝑦 is syntactically simplerthan ( 𝑥 ∨ 𝑦 ) × − ( 𝑥 ⊕ 𝑦 ) − 𝑦 , yet they have the same semanticcomplexity as they are equivalent. Conversely, 𝑥 + 𝑦 is more seman-tically complex than ( 𝑥 + 𝑦 ) ∧

0, which equals 0. We do not claim togive a deﬁnitive deﬁnition of semantic and syntactic complexity –as smaller is not always simpler – but introduce the idea that twokinds of complexity exist and are independent.The encoding in Eq. (1) is simple, but it can be repeatedly ap-plied to create a more syntactically complex expression, leadingthe reverser to either give up or try to simplify it automatically.Whitebox methods based on symbolic execution (SE) [29, 37] and formula simpliﬁcations (in the vein of compiler optimizations) canextract the semantic of an expression, yet they are sensitive to syn-tactic complexity and will not return simple versions of highly ob-fuscated expressions. Conversely, blackbox deobfuscation treats thecode as a blackbox, considering only sampled I/O behaviors.

Thusincreasing syntactic complexity, as usual state-of-the-art protectionsdo, has simply no impact on blackbox methods . We now present how blackbox methods integrate in a global deob-fuscation process and highlight crucial properties they must hold.

Global workﬂow.

Reverse engineering can be fully automated,or handmade by a reverser, leveraging tools to automate speciﬁctasks. While the deobfuscation process operates on the whole ob-fuscated binary, blackbox modules can be used to analyze parts ofthe code like conditions or VM handlers. Upon meeting a complexcode fragment, the blackbox deobfuscator is called to retrieve asimple semantic expression. After synthesis succeeds, the inferredexpression is used to help continue the analysis.

Requirements.

In virtualization based obfuscation, the blackboxmodule is typically queried on all VM handlers [7]. As the numberof handlers can be arbitrarily high, blackbox methods need to be fast . In addition, inferred expressions should ideally be as simple as the original non-obfuscated expression and semantically equiv-alent to the obfuscated expression (i.e. correct). Finally, robustness (i.e. the capacity to synthesize complex expressions) is needed to beusable in various situations. Thus, speed , simplicity , correctness and robustness , are required for eﬃcient blackbox deobfuscation. We propose a general view of AI-based code deobfuscation ﬁttingstate-of-the art solutions [7, 16]. We also extend the evaluation ofSyntia by Blazytko et al. [7], highlighting both some previouslyunreported weaknesses and strengths. From that we derive general essons on the (in)adequacy of MCTS for code deobfuscation, thatwill guide our new approach (Section 5). AI-based deobfuscation takes an obfuscated expression and triesto infer an equivalent one with lower syntactic complexity. Suchproblem can be stated as following:

Deobfuscation.

Let 𝑒 , 𝑜𝑏𝑓 be 2 equivalent expressions such that 𝑜𝑏𝑓 is an obfuscated version of 𝑒 – note that 𝑜𝑏𝑓 is possibly muchlarger than 𝑒 . Deobfuscation aims to infer an expression 𝑒 ′ equiv-alent to 𝑜𝑏𝑓 (and 𝑒 ), but with size similar to 𝑒 . Such problem canbe approached in three ways depending on the amount of informa-tion given to the analyzer: Blackbox

We can only run 𝑜𝑏𝑓 . The search is thus driven bysampled I/O behaviors. Syntia [7] is a blackbox approach;

Greybox

Here 𝑜𝑏𝑓 is executable and readable but the seman-tics of its operators is mostly unknown. The search is driven bypreviously sampled I/O behaviors which can be applied to subpartsof 𝑜𝑏𝑓 . QSynth [16] is a greybox solution;

Whitebox

The analyzer has full access to 𝑜𝑏𝑓 (run, read) andthe semantics of its operators is precisely known. Thus, the searchcan proﬁt from advanced pattern matching and symbolic strategies.Standard static analysis falls in this category.

Blackbox methods.

AI-based blackbox deobfuscators follow theframework given in Algorithm 1. In order to deobfuscate code, onemust detail a sampling strategy (i.e., how inputs are generated), a learning strategy (i.e., how to learn an expression mapping sam-pled inputs to observed outputs) and a simpliﬁcation postprocess .For example,

Syntia samples inputs randomly , uses

Monte CarloTree Search (MCTS) [9] as learning strategy and leverages the

Z3SMT solver [17] for simpliﬁcation. The choice of the sampling andlearning strategies is critical. For example, too few samples couldlead to incorrect results while too many could impact the searcheﬃciency, and an inappropriate learning algorithm could impactrobustness or speed.Let us now turn to discussing Syntia’s learning strategy. Weshow that using MCTS leads to disappointing performances andgive insight to understand why.

Algorithm 1

AI-based blackbox deobfuscation framework

Inputs:

𝐶𝑜𝑑𝑒 : code to analyze

𝑆𝑎𝑚𝑝𝑙𝑒 : sampling strategy

𝐿𝑒𝑎𝑟𝑛 : learning strategy

𝑆𝑖𝑚𝑝𝑙𝑖 𝑓 𝑦 : expression simpliﬁer

Output: learned expression or Failure procedure Deobfuscate ( 𝐶𝑜𝑑𝑒, 𝑆𝑎𝑚𝑝𝑙𝑒, 𝐿𝑒𝑎𝑟𝑛 ) 𝑂𝑟𝑎𝑐𝑙𝑒 ← 𝑆𝑎𝑚𝑝𝑙𝑒 ( 𝐶𝑜𝑑𝑒 ) 𝑠𝑢𝑐𝑐, 𝑒𝑥𝑝𝑟 ← 𝐿𝑒𝑎𝑟𝑛 ( 𝑂𝑟𝑎𝑐𝑙𝑒 ) if 𝑠𝑢𝑐𝑐 = 𝑇𝑟𝑢𝑒 then return

𝑆𝑖𝑚𝑝𝑙𝑖 𝑓𝑦 ( 𝑒𝑥𝑝𝑟 ) else return 𝐹𝑎𝑖𝑙𝑢𝑟𝑒

We extend Syntia’s evaluation and tackle the following questionsleft unaddressed by Blazytko et al. [7].

RQ1

Are results stable across diﬀerent runs?

This is desirable due to the stochastic nature of MCTS;

RQ2

Is Syntia fast, robust and does it infer simple and correct re-sults?

Syntia oﬀers a priori no guarantee of correctness nor quality.Also, we consider small time budget (1s), adapted to human-in-the-loop reverse scenarios but absent from the initial eval-uation;

RQ3

How is synthesis impacted by the set of operators’ size?

Syntia learns expressions over a search space ﬁxed by prede-ﬁned grammars. Intuitively, the more operators in the gram-mar, the harder it will be to converge to a solution. We use 3sets of operators to assess this impact.

We distinguish the success rate (num-ber of expressions inferred) from the equivalence rate (number ofexpressions inferred and equivalent to the original one). The equiv-alence rate relies on the Z3 SMT solver [17] with a timeout of 10s.Since Z3 timeouts are inconclusive answers, we deﬁne a notion of equivalence range : its lower bound is the proven equivalencerate (number of expression proven to be equivalent) while its up-per bound is the optimistic equivalence rate (expressions notproven diﬀerent, i.e., optimistic = proven + quality of an expression as the ratio between the number ofoperators in recovered and target expressions. It estimates the syn-tactic complexity of inferred expressions compared to the originalones. A quality of 1 indicates that the recovered expression has thesame size as the target one.

Benchmarks.

We consider two benchmark suites: B1 and B2. B1 comes from Blazytko et al. [7] and was used to evaluate Syntia.It comprises 500 randomly generated expressions with up to 3 ar-guments, and simple semantics. It aims at representing state-of-the-art VM-based obfuscators. However, we found that B1 suﬀersfrom several signiﬁcant issues : (1) it is not well distributed over thenumber of inputs and expression types, making it unsuitable forﬁne-grained analysis; (2) only 216 expressions are unique modulorenaming – the other 284 expressions are 𝛼 -equivalent, like x+yand a+b. These problems threaten the validity of the evaluation.We thus propose a new benchmark B2 consisting of 1,110 ran-domly generated expressions, better distributed according to num-ber of inputs and nature of operators – see Appendix A.2 for de-tails. We use three categories of expressions: Boolean, Arithmeticand Mixed Boolean-Arithmetic, with 2 to 6 inputs. Each expres-sion has an Abstract Syntax Tree (AST) of maximal height 3. As aresult, B2 is more challenging than B1 and enables a ﬁner-grainedevaluation. Operator sets.

Table 1 introduces three operator sets:

Full , Expr and

Mba . We use these to evaluate sensitivity to the search spaceand answer

RQ3 . Expr is as expressive as

Full even if

Expr ⊂ Full . Mba can only express Mixed Boolean-Arithmetic expressions [38].

Conﬁguration.

We run all our experiments on a machine with 6Intel Xeon E-2176M CPUs and 32 GB of RAM. We evaluate Syntiain its original conﬁguration [7]: the SA-UCT constant is 1.5, we use https://github.com/RUB-SysSec/syntia/tree/master/samples/mba/tigress able 1: Sets of operators Full : {− , ¬ , + , − , × , ≫ 𝑢 , ≫ 𝑠 , ≪ , ∧ , ∨ , ⊕ , ÷ 𝑠 , ÷ 𝑢 , % 𝑠 , % 𝑢 , ++ } Expr : {− , ¬ , + , − , × , ∧ , ∨ , ⊕ , ÷ 𝑠 , ÷ 𝑢 , ++ } Mba : {− , ¬ , + , − , × , ∧ , ∨ , ⊕}

50 I/O samples and a maximum playout depth of 0. It also limitsSyntia to 50,000 iterations per sample, corresponding to a timeoutof 60 s per sample on our test machine.

Let us summarize here the outcome ofour experiments — see Appendix A.1 for complete results.

RQ1.

Over 15 runs, Syntia ﬁnds between 362 and 376 expressionsof B1 i.e., 14 expressions of diﬀerence (2 .

8% of B1). Over B2, it ﬁndsbetween 349 and 383 expressions i.e., 34 expressions of diﬀerence(3.06% of B2). Hence,

Syntia is very stable across executions . RQ2.

Syntia cannot eﬃciently infer B2 ( ≈

34% success rate). More-over, Table 2 shows Syntia to be highly sensitive to time budget.More precisely, with a time budget of 1s/expr., Syntia only retrieves16.3% of B2. Still, even with a timeout of 600 s/expr., it tops at 42%of B2. In addition, Syntia is unable to synthesize expressions withmore than 3 inputs – success rates for 4, 5 and 6 inputs respectivelyfalls to 10%, 2.2% and 1.1%. It also struggles over expressions usinga mix of boolean and arithmetic operators, synthesizing only 21%.Still, Syntia performs well regarding quality and correctness. Onaverage, its quality is around 0.60 (for a timeout of 60s/expr.) i.e., re-sulting expressions are simpler than the original (non obfuscated)ones, and it rarely returns non-equivalent expressions – between0.5% and 0.8% of B2. We thus conclude that

Syntia is stable and re-turns correct and simple results. Yet, it is not eﬃcient enough (solveonly few expressions on B2, heavily impacted by time budget) andnot robust (number of inputs and expression’s type).

Table 2: Syntia depending on the timeout per expression (B2)

1s 10s 60s 600sSucc. Rate 16.5% 25.6% 34.5% 42.3%Equiv. Range 16.3% 25.1 - 25.3% 33.7 - 34.0% 41.4 - 41.6%Mean Qual 0.35 0.49 0.59 0.67

RQ3.

Default Syntia synthesizes expressions over the

Full set ofoperators. To evaluate its sensitivity to the search space we run itover

Full , Expr and

Mba . Smaller sets do exhibit higher successrates (42% on

Mba ) but results remain disappointing.

Syntia is sen-sitive to the size of the operator set but is ineﬃcient even with

Mba . Conclusion.

Syntia is stable, correct and returns simple results. Yet,it is heavily impacted by the time budget and lacks robustness. It thusfails to meet the requirements given in Section 3.3.

To ensure the conclusions given in Section 4.4 apply to MCTS andnot only to Syntia, we study Syntia extensively to ﬁnd better setups (Appendix A.1) for the following parameters: simulation depth,SA-UCT value (conﬁguring the balance between exploitative andexplorative behaviors), number of I/O samples and distance. Opti-mizing Syntia’s parameters slightly improves its results which staydisappointing (at best, ≈

50% of success rate on

Mba in 60 s/expr.).

Conclusion.

By default, Syntia is well conﬁgured. Changing its pa-rameters lead in the best scenario to marginal improvement, hencethe pitfalls highlighted seem to be inherent to the MCTS approach.

Let us explore whether these issues are related to MCTS.

Monte Carlo Tree Search.

MCTS creates here a search tree whereeach node is an expression which can be terminal (e.g. 𝑎 +

1, where 𝑎 is a variable) or partial (e.g. 𝑈 + 𝑎 , where 𝑈 is a non-terminalsymbol). The goal of MCTS is to expand the search tree smartly, focusing on most pertinent nodes ﬁrst . Evaluating the pertinence ofa terminal node is done by sampling (computing here a distance be-tween the evaluation of sampled input over the node expressionagainst their expected output values). For partial nodes , MCTS re-lies on simulation : random rules of the grammar are applied to theexpression (e.g., 𝑈 + 𝑎 ❀ 𝑏 + 𝑎 ) until it becomes terminal and isevaluated. As an example, let {( 𝑎 ↦→ , 𝑏 ↦→ ) , ( 𝑎 ↦→ , 𝑏 ↦→ )} bethe sampled inputs. The expression 𝑏 + 𝑎 (simulated from 𝑈 + 𝑎 ) eval-uates them to ( , ) . If the ground-truth outputs are 1 and −

1, thedistance will equal 𝛿 ( , ) + 𝛿 ( , − ) where 𝛿 is a chosen distancefunction. We call the result the pertinence measure . The closer it isto 0, the more pertinent the node 𝑈 + 𝑎 is considered and the morethe search will focus on it. Analysis.

This simulation-based pertinence estimation is not reli-able in our code deobfuscation setting. • We present in Fig. 2, for diﬀerent non-terminal nodes, thedistance values computed through simulations. We observethat from a starting node, a random simulation can returndrastically diﬀerent results. It shows that the search space isvery unstable and that relying on a simulation is misleading(especially in our context where time budget is small); • Moreover, our experiments show that in practice Syntia isnot guided by simulations and behaves almost as if it werean enumerative (BFS) search – MCTS where simulation isnon informative. As an example, Fig. 3 compares how thedistance evolves over time for Syntia and a custom, fullyenumerative, MCTS synthesizer: both are very similar; • Finally, on B2 with a timeout of 60 s / expr, only 34/341 suc-cessfully synthesized expressions are the children of previ-ously most promising nodes. It shows that Syntia success-fully synthesized expressions due to its exploratory (i.e., enu-merative) behavior rather than to the selection of nodes ac-cording to their pertinence . Conclusion.

The search space from blackbox code deobfuscation istoo unstable, making MCTS’s simulations unreliable. MCTS in thatsetting is then almost enumerative and ineﬃcient. That is why Syntiais slow and not robust, but returns simple expressions.

While Syntia returns simple results, it only synthesizes semanti-cally simple expressions and is slow. These unsatisfactory resultscan be explained by the fact that the search space is too unstable,making the use of MCTS unsuitable. In the next section, we showthat methods avoiding the manipulating of partial expressions (andthus free from simulation) are better suited to deobfuscation. 𝑢 − 𝑢 ¬ 𝑢𝑢 × 𝑢𝑢 − 𝑢𝑢 + 𝑢𝑢 ∨ 𝑢𝑢 ∧ 𝑢𝑢 ⊕ 𝑢𝑢 × ( 𝑢 + 𝑢 ) 𝑢 × ( 𝑢 × 𝑢 ) 𝑢 × ( 𝑢 − 𝑢 ) 𝑢 × ( 𝑢 ∧ 𝑢 ) 𝑢 × ( 𝑢 ∨ 𝑢 ) 𝑢 × ( 𝑢 ⊕ 𝑢 )( 𝑢 × 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 + 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 − 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ∧ 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ∨ 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ⊕ 𝑢 ) × ( 𝑢 + 𝑢 ) 𝑢 − 𝑢 ¬ 𝑢𝑢 × 𝑢𝑢 − 𝑢𝑢 + 𝑢𝑢 ∨ 𝑢𝑢 ∧ 𝑢𝑢 ⊕ 𝑢𝑢 × ( 𝑢 + 𝑢 ) 𝑢 × ( 𝑢 × 𝑢 ) 𝑢 × ( 𝑢 − 𝑢 ) 𝑢 × ( 𝑢 ∧ 𝑢 ) 𝑢 × ( 𝑢 ∨ 𝑢 ) 𝑢 × ( 𝑢 ⊕ 𝑢 )( 𝑢 × 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 + 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 − 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ∧ 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ∨ 𝑢 ) × ( 𝑢 + 𝑢 )( 𝑢 ⊕ 𝑢 ) × ( 𝑢 + 𝑢 ) L o g a r i t h . d i s t . f r o m ( 𝑎 ∧ 𝑏 ) × ( 𝑏 + 𝑐 ) Non terminal expressionsMean distance

Each point represents the distance between ( 𝑎 ∧ 𝑏 ) × ( 𝑏 + 𝑐 ) and one simulation ofa non terminal expression (horizontal axis). A non terminal expression, can generatemultiple terminal ones through simulations, leading to completely diﬀerent results. Figure 2: Dispersion of the distance for diﬀerent simulations L o g a r i t h m i c d i s t a n c e L o g a r i t h m i c d i s t a n c e Figure 3: Syntia and enumerative MCTS: distance evolution

We deﬁne a new AI-based blackbox deobfuscator, dubbed Xyntia,leveraging

S-metaheuristics [33] and

Iterated Local Search (ILS) [24]and compare its design to rival deobfuscators. Unlike MCTS, S-metaheuristics only manipulate terminal expressions and do not cre-ate tree searches, thus we expect them to be better suited thanMCTS for code deobfuscation. Among S-metaheuristics, ILS is par-ticularly designed for unstable search spaces , with the ability to re-member the last best solution encountered and restart the searchfrom that point. We show that these methods are well-guided bythe distance function and signiﬁcantly outperform MCTS in thecontext of blackbox code deobfuscation.

As presented in Section 4, Syntia frames deobfuscation as a singleplayer game. We instead propose to frame it as an optimizationproblem using ILS as learning strategy.

Blackbox deobfuscation: an optimization problem.

Blackboxdeobfuscation synthesizes an expression from inputs-outputs sam-ples and can be modeled as an optimization problem. The objec-tive function, noted 𝑓 , measures the similarity between currentand ground truth behaviors by computing the sum of the distances between found and objective outputs. The goal is to infer an ex-pression minimizing the objective function over the I/O samples.If the underlying grammar is expressive enough, a minimum ex-ists and matches all sampled inputs to objective outputs, zeroing 𝑓 . The reliability of the found solution depends on the number ofI/O samples considered. Too few samples would not restrain searchenough and lead to ﬂawed results. Solving through search heuristics.

S-metaheuristics [33] can beadvantageously used to solve such optimization problems. A widerange of heuristics exists (Hill Climbing, Random Walk, SimulatedAnnealing, etc.). They all iteratively improve a candidate solutionby testing its “neighbors” and moving along the search space. Be-cause solution improvement is evaluated by the objective function,it is said to guide the search.

Iterated Local Search.

Some S-metaheuristics are prone to bestuck in local optimums so that the result depends on the initialinput chosen. Iterated Local Search (ILS) [24] tackles the problemthrough iteration of search and the ability to restart from previ-ously seen best solutions. Note that ILS is parameterized by an-other search heuristics (for us: Hill Climbing). Once a local op-timum is found by this side search, ILS perturbs it and uses theperturbed solution as initial state for the side search. At each it-eration, ILS also saves the best solution found. Unlike most otherS-metaheurtics (Hill Climbing, Random Walk, Metropolis Hastingand Simulated Annealing, etc.), if the search follows a misleadingpath, ILS can restore the best seen solution so far to restart froman healthy state.

Xyntia is built upon 3 components: the optimization problem weaim to solve, the oracle which extracts the sampling informationfrom the protected code under analysis and the search heuristics . Oracle.

The oracle is deﬁned by the sampling strategy which de-picts how the protected program must be sampled and how manysamples are considered. As default, we consider that our oraclesamples 100 inputs over the range [−

50; 49 ] . Five are not randomlygenerated but equal interesting constant vectors ( ® , ® , ®− , ® 𝑚𝑖𝑛 𝑠 , ® 𝑚𝑎𝑥 𝑠 ).These choices arise from a systematic study of the diﬀerent settingsto ﬁnd the best design (see Appendix A.2.2). Optimization problem.

The optimization problem is deﬁned asfollow. The search space is the set of expressions expressible us-ing the

Expr set of operators (see Table 1), and considers a uniqueconstant . This grammar enables Xyntia to reach optimal resultswhile being as expressive as rivals’ tools like Syntia [7]. Besides,we consider the objective function: 𝑓 ® 𝑜 ∗ (® 𝑜 ) = Õ 𝑖 𝑙𝑜𝑔 ( + | 𝑜 𝑖 − 𝑜 ∗ 𝑖 |) It computes the Log-arithmetic distance between synthesized ex-pressions’ outputs ( ® 𝑜 ) and sampled ones ( ® 𝑜 ∗ ). The choice of thegrammar and of the objective function are respectively discussedin Section 5.3 and Appendix A.2.2. Search.

Xyntia leverages Iterated Local Search (ILS) to minimizeour objective function and so to synthesize target expressions. We resent now how ILS is adapted to our context. ILS applies twosteps starting from a random terminal (a constant or a variable): • ILS reuses the best expression found so far to perturb it byrandomly selecting a node of the AST and replacing it by arandom terminal node. The resulting AST is kept even if thedistance increases and passed to the next step. • Iterative Random Mutations: the side search (in our case HillClimbing) iteratively mutates the input expression until itcannot improve anymore. We estimate that no more improve-ment can be done after 100 inconclusive mutations. A mu-tation (see Fig. 4) consists in replacing a randomly chosennode of the abstract syntax tree (AST) by a leaf or an ASTof depth one (only one operator). At each mutation, it keepsthe version of the AST minimizing the distance function.During mutations, the best solution so far is updated to berestored in the perturbation step. If a solution nulliﬁes theobjective function, it is directly returned.These two operations are iteratively performed until time isout (by default

60 s ) or an expression mapping all I/O samplesis found. Furthermore, as Syntia applies Z3’s simplifer to "cleanup" recovered expressions, we add a custom post-process expressionsimpliﬁer , applying simple rewrite rules until a ﬁxpoint is reached.Appendix A.2.2 compares Xyntia with and without simpliﬁcation.Xyntia is implemented in

OCaml [23], within the BINSEC frame-work for binary-level program analysis [15]. It comprises ≈

9k linesof code.

Random selection Mutated + (− 𝑎 ) 𝑚𝑢𝑡𝑎𝑡𝑖𝑜𝑛 −→ (− 𝑏 ) + (− 𝑎 ) Figure 4: Random mutation example

We now evaluate Xyntia in depth and compare it to Syntia. As withSyntia we answer the following questions:

RQ4

Are results stable across diﬀerent runs?

RQ5

Is Xyntia robust, fast and does it infer simple and correct re-sults?

RQ6

How is synthesis impacted by the set of operators’ size?

Conﬁguration.

For all our experiments, we default to locally op-timal Xyntia (Xyntia

Opt ) presented in Section 5.2. It learns expres-sions over

Expr , samples 100 inputs (95 randomly and 5 constantvectors) and uses the Log-arithmetic distance as objective function.

Interestingly, all results reported here also hold (to a lesser extendregarding eﬃciency) for other Xyntia conﬁgurations (Section 5.4), es-pecially these versions consistently beat Syntia.

RQ4.

Over 15 runs Xyntia always ﬁnds all 500 expressions in B1and between 1051 to 1061 in B2. The diﬀerence between the bestand the worst case is only 10 expressions (0.9% of B2). Thus,

Xyntiais very stable across executions . RQ5.

Unlike Syntia, Xyntia performs well on both B1 and B2 witha timeout of 60 s/expr. Fig. 5 reveals that it is still successful for atimeout of 1 s/expr. (78% proven equivalence rate). Moreover, for atimeout of 600 s/expr. (10 min), Syntia ﬁnds 2 × fewer expressions than Xyntia with a 1 s/expr. time budget. In addition, Xyntia han-dles well expressions using up to 5 arguments and all expressiontypes. Its mean quality is around 0.93, which is very good (objec-tive is 1), and it rarely returns not equivalent expressions – onlybetween 1.3% and 4.9%. Thus, Xyntia reaches high success and equiv-alence rate. It is fast, synthesizing most expressions in ≤ 𝑠 , and itreturns simple and correct results. E q u i v a l e n c e R a t e ( % ) Timeout (s / expression)

Xyntia ProvenXyntia Optimistic Syntia ProvenSyntia Optimistic

Figure 5: Equivalence range of Syntia and Xyntia (Xyntia

Opt )depending on timeout (B2)RQ6.

Xyntia by default synthesizes expressions over

Expr whileSyntia infers expressions over

Full . To compare their sensitivity tosearch space and show that previous results was not due to searchspace inconsistency, we run Xyntia over

Full , Expr and

Mba andcompare it to Syntia. Experiments shows that Xyntia reaches highequivalence rates for all operator sets while Syntia results stay low.Still, Xyntia seems more sensitive to the size of the set of operatorsthan Syntia. Its proven equivalence rate decreases from 90% (

Expr )to 71% (

Full ) while Syntia decreases only from 38.7% (

Expr ) to33.7% (

Full ). Conversely, as for Syntia, restricting to

Mba beneﬁtsto Xyntia. Thus, like Syntia, Xyntia is sensitive to the size of theoperator set. Yet, Xyntia reaches high equivalence rates even on

Full while Syntia remains ineﬃcient even on

Mba . Conclusion.

Xyntia is a lot faster and more robust than Syntia. Itis also stable and returns simple expressions. Thus, Xyntia, unlikeSyntia, meets the requirements given in Section 3.3.

Previous experiments consider the Xyntia

Opt conﬁguration of Xyn-tia. It comes from a systematic evaluation of the design space (Ap-pendix A.2.2). To do so, we considered (1) diﬀerent S-metaheuris-tics (Hill Climbing, Random Walk, Simulated Annealing, Metropo-lis Hasting and Iterated Local Search); (2) diﬀerent sampling strate-gies; (3) diﬀerent objective functions. This evaluation conﬁrms thatXyntia

Opt is locally optimal and that ILS, being able to restore bestexpression seen after a number of unsuccessful mutations, outper-forms other S-metaheuritics. Moreover, all S-metaheurstics – ex-cept Hill Climbing – outperforms Syntia.It conﬁrms that estimating non terminal expression’s pertinencethrough simulations, as MCTS does, is not suitable for deobfusca-tion (Section 4.4). It is far more relevant to manipulate terminalexpressions only as S-metaheurstics. onclusion. Principled and systematic evaluation of Xyntia’s de-sign space lead to the locally optimal Xyntia

Opt conﬁguration. It no-tably shows that ILS outperforms other tested S-metaheuristics. More-over, all these S-metaheuristics – except Hill Climbing – outperformMCTS, conﬁrming that manipulating only terminal expressions isbeneﬁcial.

Unlike MCTS, ILS does not generate a search tree and only manip-ulates terminal expressions. As such, no simulation is performedand the distance function guides the search well. Indeed, as Fig. 6presents, the distance follows a step-wise progression. Distanceevolution is drastically diﬀerent from Syntia and enumerative MCTS(Fig. 3). It assesses that unlike them, Xyntia is guided by the dis-tance function. This enables Xyntia to synthesize deeper expres-sions that would be out of reach for enumerative search. More-over, note that Xyntia globally follows a positive trend i.e. it doesnot unlearn previous work. Indeed, before each perturbation, thebest expression found from now is restored. Thus, if iterative muta-tions follows a misleading path, the resulting solution is not keptand the best solution is reused to be perturbed. Keeping the cur-rent best solution is of ﬁrst relevance as the search space is highlyunstable and enables Xyntia to be more reliable and less dependantof randomness. L o g a r i t h m i c d i s t a n c e Figure 6: Xyntia (Xyntia

Opt ): distance evolutionConclusion.

Unlike MCTS, which is almost enumerative in codedeobfuscation, ILS is well guided by the objective function and thedistance evolution throughout the synthesis follows a positive trend,hence the diﬀerence in performance. Moreover, this is true as well forother S-metaheuristics, which appear to be much more suited for codedeobfuscation than MCTS.

Blackbox methods rely on two main steps, sampling and learning,which both show weaknesses. Indeed, Xyntia and Syntia randomlysample inputs to approximate the semantics of an expression. Itthen assumes that samples depict all behaviors of the code underanalysis. If this assumption is invalid then the learning phase willmiss some behaviors, returning partial results. As such, blackboxdeobfuscation is not appropriate to handle points-to functions.Learning can itself be impacted by other factors. For instance,learning expressions with unexpected constant values is hard. In-deed, the grammar of Xyntia and Syntia only considers constantvalue . Thus, ﬁnding expressions with constant values absent fromthe grammar requires to create them (e.g., encoding 3 as 1 + + Because of the high instability of the search space,

Iterated LocalSearch is much more appropriate than MCTS (and, to a lesser ex-tent, than other S-metaheuristics) for blackbox code deobfuscation,as it manipulates terminal expressions only and is able to restorethe best solution seen so far in case the search gets lost. Thesefeatures enable Xyntia to keep the advantages of Syntia (stability,output quality) while clearly improving over its weaknesses: espe-cially Xyntia manages with 1s timeout to synthesize twice moreexpressions than Syntia with 10min timeout.Other S-metaheuristics also perform signiﬁcantly better thanMCTS here, demonstrating that the problem itself is not well-suitedfor partial solution exploration and simulation-guided search.

We now extend the comparison to other state-of-the-art tools: (1) agreybox deobfuscator (QSynth [16]); (2) whitebox simpliﬁers (GCC,Z3 simpliﬁer and our custom simpliﬁer); (3) program synthesizers(CVC4 [6], winner of the SyGus’19 syntax-guided synthesis com-petition [2] and STOKE [30], an eﬃcient superoptimizer). Unlikeblackbox approaches, greybox and whitebox methods should beevaluated on the enhancement rate. Indeed, these methods can al-ways succeed by returning the obfuscated expression without sim-pliﬁcation. The enhancement rate measures how often synthesizedexpressions are smaller than the original ( 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 ≤ Xyntia and QSynth learn expressions over distinct grammars:

Expr and

Mba respectively. Moreover, QSynth is unfortunately not avail-able, whether in a source or executable form. So we could neitheradapt nor reproduce the experiments. In the end, we could onlycompare it over

Mba , using the results reported by David et al. [16].

Benchmarks.

We compare blackbox program synthesizers on B2and grey/white box approaches on QSynth’s datasets – availablefor extended comparison . Thus, we consider the 3 datasets fromDavid et al.’s [16] of obfuscated expressions using Tigress [11]: EA (base dataset, obfuscated with the EncodeArithmetic transfor-mation),

VR-EA (EA obfuscated with

Virtualize and

EncodeArith-metic protections), and

EA-ED (EA obfuscated with

EncodeArith-metic and

EncodeData transformations.

Greybox.

We compare Xyntia to QSynth’s published results [16]on EA, VR-EA and EA-ED. Fig. 7a shows that while both toolsreach comparable results (enhancement rate ≈ https://github.com/werew/qsynth-artifacts E n h a n c e d Xyntia-MBASyntia-MBAQSynth CVC4-MBASTOKE-synthSTOKE-opti (a) Enhancement rate T i m e ( s ) Xyntia-MBASyntia-MBAQSynthCVC4-MBASTOKE-synth (b) Mean synthesis time per expression – STOKE-opti notshown as it always uses 60 s

Figure 7: Syntia, QSynth, Xyntia, CVC4 and STOKE on EA, VR-EA and EA-ED datasets (timeout = 60 s) heavy obfuscations (EA-ED) while QSynth drops to 133/500. Ac-tually, Xyntia is insensitive to syntactic complexity while QSynthis.

Whitebox.

We compare Xyntia over the EA, VR-EA and EA-EDdatasets with 3 whitebox approaches: GCC, Z3 simpliﬁer (v4.8.7)and our custom simpliﬁer. As expected, they are not eﬃcient com-pared to Xyntia (Appendix A.3.1). Regardless of the obfuscation,they simplify ≤

68 expressions where Xyntia simpliﬁes 360 of them.

Program synthesizers.

We now compare Xyntia to state-of-the-art program synthesizers, namely CVC4 [6] and STOKE [30]. CVC4takes as input a grammar and a speciﬁcation and returns, throughenumerative search, a consistent expression. STOKE is a super-optimizer leveraging program synthesis (based on Metropolis Hast-ing) to infer optimized code snippets. It does not return an expres-sion but optimized assembly code. STOKE addresses the optimiza-tion problem in two ways: (1) STOKE-synth starts from a pre-de-ﬁned number of nops and mutates them. (2) STOKE-opti startsfrom the non-optimized code and mutates it to simplify it. WhileSTOKE integrates its own sampling strategy and grammar, CVC4does not – thus, we consider for CVC4 the same sampling strategyas Xyntia (100 I/O samples with 5 constant vectors) as well as the

Expr and

Mba grammars. More precisely, CVC4-

Expr is used overB2 to compare to Xyntia (Xyntia

Opt ) and CVC4-

Mba is evaluatedon EA, VR-EA and EA-ED to compare against QSynth.

Table 3: Program synthesizers on B2

CVC4-

Expr

STOKE-synthSuccess Rate 36.8% 38.0%Equiv. Range 29.3 - 36.8% 38.0%Mean Qual. 0.56 0.91Table 3 shows that CVC4-

Expr and STOKE-synth fail to synthe-size more than 40% of B2 while Xyntia reaches 90.6% proven equiv-alence rate. Indeed enumerative search (CVC4) is less appropriatewhen time is limited. Results of STOKE-synth are also expectedas its search space considers all assembly mnemonics. Moreover,Fig. 7a shows that blackbox and whitebox (STOKE-opti) synthesiz-ers do not eﬃciently simplify obfuscated expressions. STOKE-optiﬁnds only 1 / 500 expressions over EA-ED and does not handlejump instructions, inserted by the VM, failing to analyze VR-EA.

Xyntia rivals QSynth on light / mild protections and outperformit on heavy protections, while pure whitebox approaches are farbehind, showing the beneﬁts of being independent from syntac-tic complexity. Also, Xyntia outperforms state-of-the-art programsynthesizers showing that it is better suited to perform deobfus-cation. These good results show that seeing deobfuscation as anoptimization problem is fruitful.

We now prove that Xyntia is insensitive to common protections(opaque predicates) as well as to recent anti-analysis protections(MBA, covert channels, path explosion) and we conﬁrm that black-box methods can help reverse state-of-the-art virtualization [11,35].

Xyntia is able to bypass many protections (Table 4).

Mixed Boolean-Arithmetic [38] hides the original semanticsof an expression both to humans and SMT solvers. However, the en-coded expression remains equivalent to the original one. As such,the semantic complexity stays unchanged, and Xyntia should notbe impacted. Launching Xyntia on B2 obfuscated with Tigress [11]

Encode Arithmetic transformation (size of expression: x800) con-ﬁrms that it has no impact.

Opaque predicates [14] obfuscate control ﬂow by creating arti-ﬁcial conditions in programs. The conditions are traditionally tau-tologies and dynamic runs of the code will follow a unique path.Thus, sampling is not aﬀected and synthesis not impacted. Weshow it by launching Xyntia over B2 obfuscated with Tigress

Ad-dOpaque transformation.

Path-based obfuscation [26, 36] takes advantage of the pathexplosion problem to thwart symbolic execution, massively addingadditional feasible paths through dedicated encodings. We showthat it has no eﬀect, by protecting B2 with a custom encoding in-spired by [26] (Appendix A.4.1 gives an example of our encoding).

Covert channels [32] hide information ﬂow to static analyz-ers by rerouting data to invisible states (usually OS related) beforeretrieving it – for example taking advantage of timing diﬀerencebetween a slow thread and a fast thread to infer the result of some able 4: Xyntia (Xyntia Opt ) against usual protections (B2, timeout = 60 s) ∅ MBA Opaque Path oriented Covert channelsSucc. Rate 95.5% 95.4% 94.68% 95.4% 95.1%Equiv. Range 90.6 - 94.2% 90.0 - 93.8% 89.9 - 93.0% 89.5- 93.7% 89.0 - 94.0%Mean Qual. 0.92 0.95 0.90 0.94 0.89 computation with great accuracy. Again, as blackbox deobfusca-tion focuses only on input-output relationship, covert channelsshould not disturb it. Note that the probabilistic nature of suchobfuscations (obfuscated behaviours can diﬀer from unobfuscatedones from time to time) could be a problem in case of high faultprobabilities, but in order for the technique to be useful, fault prob-ability must precisely remains low. We obfuscate B2 with the

Ini-tEntropy and

InitImplicitFlow (thread kind) transformations of Ti-gress [11]. Table 4 indeed shows the absence of impact: “faults”probability being so low, it does not aﬀect sampling.

Conclusion.

State-of-the-art protections are not eﬀective againstblackbox deobfuscation. They prevent eﬃcient reading of the codeand tracing of data but blackbox methods directly execute it.

We now use Xyntia to reverse code obfuscated with state-of-the-art virtualization. We obfuscate a program computing MBA oper-ations with Tigress [11] and VMProtect [35] and our goal is to re-verse the VM handlers. Using such synthetic program enables toexpose a wide variety of handlers.

Table 5: Xyntia and Syntia results over program obfuscatedwith Tigress [11] and VMProtect [35]

Tigress (simple) Tigress (hard) VMProtectBinary size 40KB 251KB 615KB

Tigress [11] is a source-to-source obfuscator. Our obfuscated pro-gram contains 13 handlers. Since at assembly level each handlerends with an indirect jump to the next handler to execute, we wereable to extract the positions of handlers using execution traces.We then used the scripts from [7] to sample each handler. Xyn-tia synthesizes 12/13 handlers in less that 7 s each. We can classifythem in diﬀerent categories: (1) arithmetic and Boolean ( + , − , × , ∧ , ∨ , ⊕ ); (2) stack (store and load); (3) control ﬂow (goto and re-turn); (4) calling convention (retrieve obfuscated function’s argu-ments). These results show that Xyntia can synthesize a wide vari-ety of handlers. Interestingly, while these handlers contain manyconstant values (typically, oﬀsets for context update), Xyntia canhandle them as well. In particular, it infers the calling conventionrelated handler, synthesizing constant values up to 28 (to accessthe 6th argument). Thus, even if Xyntia is inherently limited onconstant values (see Section 5.6) it still handles them to a limited ex-tent. Repeating the experiment by adding Encode Data and

EncodeArithmetic to Virtualize yields similar results. Xyntia synthesizesall 17 exposed handlers but one, conﬁrming that Xyntia handles combinations of protections. Finally, note that Syntia fails to syn-thesize handlers completely (not handling constant values). Still itinfers arithmetic and Boolean handlers (without context updates).

VMProtect [35] is an assembly to assembly obfuscator. We use thelatest premium version (v3.5.0). As each VM handler ends with a ret or an indirect jump, we easily extracted each distinct handlerfrom execution traces. Our traces expose 114 distinct handlers con-taining on average 43 instructions (Table 5). VMProtect’s VM isstack based. To infer the semantics of each handler, we again usedBlazytko’s scripts [7] in “memory mode” (i.e., forbidding registersto be seen as inputs or outputs). Our experiments show that eacharithmetic and Boolean handlers ( add , mul , nor , nand ) are replicated 11times to fake a large number of distinct handlers. Moreover, we arealso able to extract the semantics of some stack related handlers.In the end, we successfully infer the semantics of 44 arithmetic orBoolean handlers and 32 stack related handlers. Synthesis took atmost 0.3 s per handler. Syntia gets equal results as Xyntia. Conclusion.

Xyntia synthesizes most Tigress’ VM handlers, (incldu-ing interesting constant values) and extracts the semantics of VM-Protect’s arithmetic and Boolean handlers. This shows that blackboxdeobfuscation can be highly eﬀective, making the need for eﬃcientprotections clear.

We now study defense mechanisms against blackbox deobfusca-tion.

We remind that blackbox methods require the reverser to locatea suitable reverse window delimiting the code of interest with itsinput and output. This can be done manually or automatically [7],still this is mandatory and not trivial. The defender could targetthis step, reusing standard obfuscation techniques.

Still there is a risk that the attacker ﬁnds the good windows. Hencewe are looking for a more radical protection against blackbox attacks.We suppose that the reverse window, input and output are correctlyidentiﬁed, and we seek to protected a given piece of code.

Note that adding extra fake inputs (not inﬂuencing the result)is easily circumvented in a blackbox setting, by dynamically test-ing diﬀerent values for each input and ﬁltering inputs where nodiﬀerence is observed.

Protection rationale.

Even with correctly delimited windows, syn-thesis can still be thwarted. Recall that blackbox methods rely on2 main steps (1) I/O sampling; (2) learning from samples, and bothcan be sabotaged. • First, if the sampling phase is not performed properly, thelearner could miss important behaviors of the code, return-ing incomplete or even misleading information; Second, if the expression under analysis is too complex, thelearner will fail to map inputs to their outputs.In both cases, no information is retrieved. Hence, the key to im-pede blackbox deobfuscation is to migrate from syntactic complex-ity to semantic complexity . We propose in Sections 8.2 and 8.3 twonovel protections impeding the sampling and learning phases.

Blackbox approaches are sensitive to semantic complexity. As such,relying on a set of complex handlers is an eﬀective strategy tothwart synthesis. These complex handlers can then be combinedto recover standard operations. We propose a method to generatearbitrary complex handlers in terms of size and number of inputs.

Complex semantic handlers.

Let 𝑆 be a set of expressions and ℎ, 𝑒 , ..., 𝑒 𝑛 − be 𝑛 expressions in 𝑆 . Suppose that ( 𝑆, ★ ) is a group.Then ℎ can be encoded as ℎ = 𝑛 − ★ 𝑖 = ℎ 𝑖 , where for all i, with 0 ≤ 𝑖 ≤ 𝑛 , ℎ 𝑖 =  ℎ − 𝑒 if 𝑖 = 𝑒 𝑖 − 𝑒 𝑖 + if 1 ≤ 𝑖 < 𝑛 − 𝑒 𝑛 − if 𝑖 = 𝑛 − ℎ 𝑖 is a new handler that can be combined with others to ex-press common operations – see Table 6 for an example. Note thatthe choice of ( 𝑒 , ..., 𝑒 𝑛 ) is arbitrary. One can choose very complexexpressions with as many arguments as wanted. Table 6: Examples of encoding ℎ = ( 𝑥 + 𝑦 ) + −(( 𝑎 − 𝑥 ) − ( 𝑥𝑦 )) + ℎ = ( 𝑎 − 𝑥 ) − 𝑥𝑦 + (−( 𝑦 − ( 𝑎 ∧ 𝑥 )) × ( 𝑦 ⊗ 𝑥 )) + ℎ = ( 𝑦 − ( 𝑎 ∧ 𝑥 )) × ( 𝑦 ⊗ 𝑥 ) ℎ = 𝑥 + 𝑦 Experimental design.

To evaluate Syntia and Xyntia against ournew encoding, we created 3 datasets – BP1, BP2 and BP3, listed byincreasing order of complexity. Each dataset contains 15 handlerswhich can be combined to encode the + , − , × , ∧ and ∨ operators.Within dataset, all handlers have the same number of inputs. Ta-ble 7 reports details on each datasets – more details are available inAppendix A.5. The mean overhead column is an estimation of thecomplexity added to the code by averaging the number of opera-tors needed to encode a single basic operator ( + , − , × , ∨ , ∧ ). Over-heads in BP1 (21x), BP2 (39x) and even BP3 (258x) are reasonablecompared to some syntactical obfuscations: encoding 𝑥 + 𝑦 withMBA three times in Tigress yields a 800x overhead. Evaluation.

Results (Fig. 8) show that while Xyntia (with 1h.expr.)manages well low complexity handlers (BP1: 13/15), yet perfor-mance degrades quickly as complexity increases (BP2: 3/15, BP3:1/15). Syntia, CVC4 and STOKE-synth ﬁnd none with 1 h/expr.,even on BP1 (Appendix A.5).

Table 7: Protected datasets

Conclusion.

Semantically complex handlers are eﬃcient againstblackbox deobfuscation. While high complexity handlers comes with a cost similar to strong MBA encodings, medium complexity handlersoﬀer a strong protection at a reasonable cost.

Discussion.

Our protection can be bypassed if the attacker focuseson the good combinations of handlers, rather than on the handlersthemselves. To prevent it, complex handlers can be duplicated (asin VMProtect, see Section 7.2) to make patterns recognition morechallenging. E q u i v a l e n t Figure 8: Xyntia (Xyntia

Opt ) on BP1,2, 3 – varying timeouts

We now propose another protection, based on conditional expres-sions and the merging of existing handlers. While block mergingis known for a long time against human reversers, we show thatit is extremely eﬃcient against blackbox attacks. Note that whilewe write our merged handlers with explicit if-then-else operators(ITE) for simplicity, these conditions are not necessarily implementedwith conditional branching (cf. Fig. 9 for an example of branchlessencoding). Hence, we consider that the attacker sees merged han-dlers as a unique code fragment.

Datasets.

We introduce 5 datasets (see Appendix A.5.2) composedof 20 expressions. Expressions in dataset 1 are built with 1 if-then-else (ITE) exposing 2 basic handlers (among + , − , × , ∧ , ∨ , ⊕ ); expres-sions in dataset 2 are built with 2 nested ITEs exposing 3 basichandlers, etc. Conditions are equality checks against consecutiveconstant values (0 , ,

2, etc.). For example, dataset 2 contains theexpression:

𝐼𝑇 𝐸 ( 𝑧 = , 𝑥 + 𝑦, 𝐼𝑇 𝐸 ( 𝑧 = , 𝑥 − 𝑦, 𝑥 × 𝑦 )) (2) Scenarios.

Adding conditionals brings extra challenges (1) the gram-mar must be expressive enough to handle conditions; (2) the sam-pling phase must be eﬃcient enough to cover all possible behav-iors. Thus, we consider diﬀerent scenarios:

Utopian

The synthesizer learns expressions over the

Mba set ofoperators, extended with an

𝐼𝑇 𝐸 ( ★ = , ★ , ★ ) operator ( Mba +ITEoperator set). Moreover, the sampling is done so that all branchesare traversed the same number of time. This situation, favoringthe attacker, will show that merged handlers are always eﬃcient.

Mba + ITE

This situation is more realistic: the attacker does notknow at ﬁrst glance how to sample. However, its grammar ﬁtsperfectly the expressions to reverse. Available at :

Will be made available ba + Shifts Here Xyntia does not sample inputs uniformly overthe diﬀerent behaviors, does not consider ITE operators, but al-lows shifts to represent branch-less conditions.

Default.

This is the default version of the synthesizer.In all these scenarios, appropriate constant values are added tothe grammar. For example, to synthesize Eq. (2), and are added. int32_t h(int32_t a, int32_t b, int32_t c) {// if (c == cst ) then h1(a,b,c) else h2(a,b,c);int32_t res = c - cst ;int32_t s = res >> 31;res = (-((res ^ s) -s) >> 31) & 1;return h1(a, b, c)*(1 - res) + res*h2(a, b, c);} Figure 9: Example of a branch-less condition E q u i v a l e n t ITE depth

Xyntia UtopianXyntia MBA+ITEXyntia MBA+ShiftsXyntia Xyntia

Opt

Figure 10: Merged handlers: Xyntia (timeout=60s)Evaluation.

Fig. 10 presents Xyntia’s results on the 5 datasets. Asexpected, the

Utopian scenario is where Xyntia does best, still itcannot cope with more than 3 nested ITEs. For realistic scenarios,Xyntia suﬀers even more. Results for Syntia, CVC4 and STOKE-synth (see Appendix A.5.2) conﬁrm this result (no solution foundfor ≥ Conclusion.

Merged handlers are extremely powerful against black-box synthesis. Even in the ideal sampling scenario, blackbox methodscannot retrieve the semantics of expressions with more than 3 nestedconditionals – while runtime overhead is minimal.

Discussion.

Symbolic methods, like symbolic execution, are un-hindered by this protection, for they track the succession of han-dlers and know which sub parts of merged handlers are executed.They can then reconstruct the real semantics of the code. To han-dle this, our anti-AI protection can be combined with (lightweight)anti-symbolic protections (e.g. [26, 36]).

Blackbox deobfuscation.

Blazytko et al.’s work [7] has alreadybeen thoroughly discussed. We complete their experimental evalu-ation, generalize and improve their approach: Xyntia with 1s/expr.ﬁnds twice more expressions than Syntia with 600s/expr.

White- and greybox deobfuscation.

Several recent works lever-age whitebox symbolic methods for deobfuscation (“symbolic de-obfuscation”) [5, 10, 22, 29, 31, 37]. Unfortunately, they are sen-sitive to code complexity as discussed in Section 7, and eﬃcient countermeasures are now available [12, 26, 27, 38] – while Xyn-tia is immune to them (Section 7.1). David et al. [16] recently pro-posed QSynth, a greybox deobfuscation method combining I/O re-lationship caching (blackbox) and incremental reasoning along thetarget expression (whitebox). Yet, QSynth is sensitive to massivesyntactic obfuscations where Xyntia is not (cf. Section 6). Further-more, QSynth works on a simple grammar. It is unclear whetherits caching technique would scale to larger grammars like those ofXyntia and Syntia.

Program synthesis.

Program synthesis aims at ﬁnding a func-tion from a speciﬁcation which can be given either formally, innatural language or as I/O relations – the case we are interested inhere. There exist three main families of program synthesis meth-ods [20]: enumerative, constraint solving and stochastic. Enumer-ative search does enumerate all programs starting from the sim-pler one, pruning snippets incoherent with the speciﬁcation andreturning the ﬁrst code meeting the speciﬁcation. We compare, inthis paper, to one of such method – CVC4 [6], winner of the SyGus’19 syntax-guided synthesis competition [2] – and showed that ourapproach is more appropriate to deobfuscation. Constraint solvingmethods [21] on the other hand encode the skeleton of the targetprogram as a ﬁrst order satisﬁability problem and use an oﬀ-the-shelf SMT solver to infer an implementation meeting speciﬁcation.However, it is less eﬃcient than enumerative and stochastic meth-ods [1]. Finally, stochastic methods [30] traverse the search spacerandomly in the hope of ﬁnding a program consistent with a spec-iﬁcation. Contrary to them, we aim at solving the deobfuscationproblem in a fully blackbox way (not relying on the obfuscatedcode, nor on an estimation of the result size).

10 CONCLUSION

AI-based blackbox deobfuscation is a promising recent researcharea. The ﬁeld has been barely explored yet and the pros and consof such methods are still unclear. This article deepens the stateof AI-based blackbox deobfuscation in three diﬀerent directions.First, we deﬁne a novel generic framework for AI-based blackboxdeobfuscation, encompassing prior works such as Syntia, we iden-tify that the search space underlying code deobfuscation is too un-stable for simulation-based methods, and advocate the use of S-metaheuritics. Second we take advantage of our framework to care-fully design Xyntia, a new AI-based blackbox deobfuscator. Xyn-tia signiﬁcantly outperforms Syntia in terms of success rate, whilekeeping its good properties – especially, Xyntia is completely im-mune to the most recent anti-analysis code obfuscation methods.Xyntia also proves to be more eﬃcient than greybox and whiteboxdeobfuscators or standard program synthesis methods. Finally, wepropose the two ﬁrst protections against AI-based blackbox deob-fuscation, completely preventing Xyntia and Syntia’s attacks forreasonable cost. We hope that these results will help better under-stand AI-based deobfuscation, and lead to further progress in theﬁeld. EFERENCES [1] Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, MukundRaghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, EminaTorlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In

Formal Methodsin Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013 .IEEE, 1–8. http://ieeexplore.ieee.org/document/6679385/[2] Rajeev Alur, Dana Fisman, Saswat Padhi, Rishabh Singh, and Abhishek Udupa.2019. SyGuS-Comp 2018: Results and Analysis.

CoRR abs/1904.07146 (2019).arXiv:1904.07146 http://arxiv.org/abs/1904.07146[3] Sebastian Banescu, Christian S. Collberg, Vijay Ganesh, Zack Newsham, andAlexander Pretschner. 2016. Code obfuscation against symbolic execution at-tacks. In

Annual Conference on Computer Security Applications, ACSAC 2016 .[4] Boaz Barak, Oded Goldreich, Russell Impagliazzo, Steven Rudich, Amit Sahai,Salil Vadhan, and Ke Yang. 2012. On the (im) possibility of obfuscating programs.

Journal of the ACM (JACM)

59, 2 (2012), 1–48.[5] Sébastien Bardin, Robin David, and Jean-Yves Marion. 2017. Backward-BoundedDSE: Targeting Infeasibility Questions on Obfuscated Codes. In . IEEEComputer Society, 633–651. https://doi.org/10.1109/SP.2017.36[6] Clark Barrett, Christopher L. Conway, Morgan Deters, Liana Hadarean,Dejan Jovanovi’c, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011.CVC4. In

Proceedings of the 23rd International Conference on ComputerAided Veriﬁcation (CAV ’11) (Lecture Notes in Computer Science, Vol. 6806)

Usenix

Security (Van-couver, BC, Canada). 643–659.[8] Tim Blazytko, Moritz Contag, Cornelius Aschermann, and Thorsten Holz. 2018.Syntia: Breaking State-of-the-Art Binary Code Obfuscation via Program Synthe-sis.

Black Hat Asia (2018).[9] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter ICowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth-rakis, and Simon Colton. 2012. A survey of monte carlo tree search methods.

IEEE Transactions on Computational Intelligence and AI in games

4, 1 (2012), 1–43.[10] David Brumley, Cody Hartwig, Zhenkai Liang, James Newsome, Dawn Xi-aodong Song, and Heng Yin. 2008. Automatically Identifying Trigger-based Be-havior in Malware. In

Botnet Detection: Countering the Largest Security Threat .Springer, 65–88.[11] C. Collberg, S. Martin, J. Myers, and B. Zimmerman. [n.d.]. The Tigress C Diver-siﬁer/Obfuscator. http://tigress.cs.arizona.edu/[12] Christian Collberg and Jasvir Nagra. 2009.

Surreptitious Software: Obfuscation,Watermarking, and Tamperprooﬁng for Software Protection (1st ed.). Addison-Wesley Professional.[13] Christian Collberg, Clark Thomborson, and Douglas Low. 1997. A taxonomy ofobfuscating transformations.[14] Christian Collberg, Clark Thomborson, and Douglas Low. 1998. Manufacturingcheap, resilient, and stealthy opaque constructs. In

Proceedings of the 25th ACMSIGPLAN-SIGACT symposium on Principles of programming languages . 184–196.[15] Robin David, Sébastien Bardin, Thanh Dinh Ta, Laurent Mounier, Josselin Feist,Marie-Laure Potet, and Jean-Yves Marion. 2016. BINSEC/SE: A dynamic sym-bolic execution toolkit for binary-level analysis. In , Vol. 1.IEEE, 653–656.[16] Robin David, Luigi Coniglio, and Mariano Ceccato. 2020. QSynth-A ProgramSynthesis based Approach for Binary Code Deobfuscation. In

BAR 2020 Work-shop .[17] Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An eﬃcient SMT solver. In

International conference on Tools and Algorithms for the Construction and Analysisof Systems . Springer, 337–340.[18] Ninon Eyrolles, Louis Goubin, and Marion Videau. 2016. Defeating MBA-basedObfuscation. In

Proceedings of the 2016 ACM Workshop on Software PROtec-tion, SPRO@CCS 2016, Vienna, Austria, October 24-28, 2016 , Brecht Wyseur andBjorn De Sutter (Eds.). ACM, 27–38. https://doi.org/10.1145/2995306.2995308[19] Nicolas Falliere, Patrick Fitzgerald, and Eric Chien. 2009. Inside the jaws oftrojan. clampi.

Rapport technique, Symantec Corporation (2009).[20] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthe-sis.

Foundations and Trends® in Programming Languages

4, 1-2 (2017), 1–119.[21] Susmit Jha, Sumit Gulwani, Sanjit A Seshia, and Ashish Tiwari. 2010. Oracle-guided component-based program synthesis. In , Vol. 1. IEEE, 215–224.[22] Johannes Kinder. 2012. Towards Static Analysis of Virtualization-ObfuscatedBinaries. In .[23] Xavier Leroy, Damien Doligez, Alain Frisch, Jacques Garrigue, DidierRémy, and Jérôme Vouillon. 2020.

The OCaml system release 4.10 .https://caml.inria.fr/pub/docs/manual-ocaml/ [24] Helena Ramalhinho Lourenço, Olivier C Martin, and Thomas Stützle. 2019. It-erated local search: Framework and applications. In

Handbook of metaheuristics .Springer, 129–168.[25] National Security Agency (NSA). [n.d.]. Ghidra. https://ghidra-sre.org/[26] Mathilde Ollivier, Sébastien Bardin, Richard Bonichon, and Jean-Yves Marion.2019. How to kill symbolic deobfuscation for free (or: unleashing the potential ofpath-oriented protections). In

Proceedings of the 35th Annual Computer SecurityApplications Conference . 177–189.[27] Mathilde Ollivier, Sébastien Bardin, Richard Bonichon, and Jean-Yves Marion.2019. Obfuscation: where are we in anti-DSE protections?(a ﬁrst attempt). In

Proceedings of the 9th Workshop on Software Security, Protection, and Reverse En-gineering . 1–8.[28] Oreans Technologies. 2020. Themida – Advanced Windows Software ProtectionSystem. http://oreans.com/themida.php.[29] Jonathan Salwan, Sébastien Bardin, and Marie-Laure Potet. 2018. Symbolic de-obfuscation: from virtualized code back to the original. In .[30] Eric Schkufza, Rahul Sharma, and Alex Aiken. 2012. Stochastic Superoptimiza-tion.

CoRR abs/1211.0557 (2012). arXiv:1211.0557 http://arxiv.org/abs/1211.0557[31] Sebastian Schrittwieser, Stefan Katzenbeisser, Johannes Kinder, Georg Merz-dovnik, and Edgar Weippl. 2016. Protecting Software Through Obfuscation: CanIt Keep Pace with Progress in Code Analysis?

ACM Comput. Surv.

49, 1, Article4 (2016), 37 pages.[32] Jon Stephens, Babak Yadegari, Christian S. Collberg, Saumya Debray, and CarlosScheidegger. 2018. Probabilistic Obfuscation Through Covert Channels. In . 243–257. https://doi.org/10.1109/EuroSP.2018.00025[33] El-Ghazali Talbi. 2009.

Metaheuristics: From Design to Implementation . WileyPublishing.[34] Tora. [n.d.]. DevirtualizingFinSpy. http://linuxch.org/poc2012/Tora,DevirtualizingFinSpy.pdf[35] VM Protect Software. 2020. VMProtect Software Protection.http://vmpsoft.com.[36] Babak Yadegari and Saumya Debray. 2015. Symbolic Execution of Ob-fuscated Code. In

Proceedings of the 22nd ACM SIGSAC Conference onComputer and Communications Security (Denver, Colorado, USA) (CCS’15) . Association for Computing Machinery, New York, NY, USA, 732–744.https://doi.org/10.1145/2810103.2813663[37] Babak Yadegari, Brian Johannesmeyer, Ben Whitely, and Saumya Debray. 2015.A Generic Approach to Automatic Deobfuscation of Executable Code. In

Sym-posium on Security and Privacy, SP .[38] Yongxin Zhou, Alec Main, Yuan X. Gu, and Harold Johnson. 2007. InformationHiding in Software with Mixed Boolean-arithmetic Transforms.In

Proceedings ofthe 8th International Conference on Information Security Applications (Jeju Island,Korea) (WISA’07) . Springer-Verlag, Berlin, Heidelberg, 61–75. APPENDIX

We now introduce complementary results to describe details thatwe did not fully explained for the sake of space. We follow the sameorganisation as the main article:

Appendix A.1 details the evaluation of Syntia (from Sec-tion 4). It presents used datasets and obtained Syntia results;

Appendix A.2 details the evaluation of Xyntia (see Sec-tion 5) and the study leading to the optimal Xyntia;

Appendix A.3 describes the comparison of Xyntia to white-box, pattern based simpliﬁers from Section 6.

Appendix A.4 details the obfucations used to evaluate Xyn-tia against state-of-the-art protections in Section 7;

Appendix A.5 describes datasets used in Section 8 and de-tails evaluation of Syntia, CVC4 and STOKE over proposedprotections.

A.1 Understand AI-based deobfuscation: MoreDetails

Section 4 presents an in-depth evaluation of Syntia. We show nowcomplementary data to detail: (1) the distribution of expressionsin our custom benchmark suite B2; (2) the results of Syntia over15 runs; (3) the results of Syntia in terms of quality and correct-ness and how it reacts to the number of inputs and expressiontypes; (4) the study of Syntia’s parameters to ﬁnd an optimal con-ﬁguration.

Table 8: Description of B2

Type

A.1.1

Experimental design . In order to perform a ﬁne grainedevaluation of Syntia, we use 2 benchmark suites: B1 and B2. B1 hasbeen introduced by Blazytko et al. [7] to evaluate Syntia, and con-tains 500 expressions. However, it presents important limitationsas discussed in Section 4.2. Thus, we introduce a custom bench-mark B2 which contains 1110 expressions. It is better distributedaccording to the type of the expressions – Boolean, Arithmetic andMixed Boolean-Arithmetic – and number of inputs used – between2 and 6. Moreover, B2 is more challenging than B1, consideringmore complex expressions. Table 8 presents the number of expres-sions per type of expressions and number of inputs.

A.1.2

Evaluation of Syntia . In Section 4.2 we evaluate Syntia toestimate its stability across executions, its robustness, speed, qual-ity and correctness. We present now complete experiments resultsand discuss them.

RQ1.

Table 9 presents results of Syntia over 15 runs and Table 10presents statistics on it. We observe that Syntia is indeed very sta-ble across executions.

RQ2.

As presented in Table 10 Syntia is not able to synthesize B2eﬃciently, only synthesizing 34.5% of it. Moreover, as presentedin Table 26, Syntia cannot handle expressions using more than 3inputs. Indeed, its success rate falls to 10.0%, 2.2% and 1.1% for re-spectively 4, 5 and 6 inputs. Syntia is also impacted by the type

Table 9: Success rate of Syntia across 15 runs (timeout=60s)

Test execution no. B1 B21 367 (73.4%) 349 (31.4%)2 362 (72.4%) 376 (33.9%)3 376 (75.2%) 371 (33.4%)4 365 (73.0%) 367 (33.1%)5 369 (73.8%) 379 (34.1%)6 365 (73.0%) 383 (34.5%)7 375 (75.0%) 366 (33.0%)8 370 (74.0%) 371 (33.4%)9 366 (73.2%) 358 (32.3%)10 372 (74.4%) 367 (33.1%)11 367 (73.4%) 364 (32.8%)12 364 (72.8%) 372 (33.5%)13 371 (74.2%) 378 (34.1%)14 368 (73.6%) 350 (31.5%)15 370 (74.0%) 354 (31.9%)

Table 10: 15 runs of Syntia over B1 and B2 (timeout = 60 s)

Data-set Min. Max. Mean 𝜎 Syntia B1 362(72.4%) 376(75.2%) 368.5(73.7%) 3.83(0.76%)B2 349(31.4%) 383(34.5%) 367.0(33.1%) 10.11(0.91%) of the target expression. Handling boolean expressions seems sim-pler for Syntia. On the contrary, it struggles to synthesize MBA ex-pressions. Still we observe that Syntia returns really good qualityresults ( ≈ . ) and almost never returns non equivalent expres-sions. RQ3.

Syntia defaults to synthesizing expressions over the

Full op-erators’ set. To evaluate its sensitivity to the size of the operators’set, we launch it over

Full , Expr and

Mba . Table 11 shows that re-stricting the search space beneﬁts to Syntia. However, even in thebest scenario (

Mba ) its results are deceiving. Indeed, it synthesizesonly ≈

42% of B2.

Table 11: Syntia’s results on

Full / Expr / Mba (B2, time-out=60s).

Full Expr Mba

Syntia Succ. Rate 34.5% 38.8% 42.6%Equiv. Range 33.7 - 34.0% 38.7% 42.3 - 42.6%Mean Qual. 0.59 0.62 0.66

A.1.3

Optimal Syntia . To ensure conclusions given in Section 4.4apply to MCTS and not only to Syntia, we studied Syntia exten-sively, searching for better set-ups. We study Syntia according tofollowing parameters: simulation depth, SA-UCT value, number ofI/O samples and choice of the distance.

Table 12: Syntia depending on max playout depth (

Mba , B2,timeout = 60 s).

Max play. depth 0 3 5Succ. Rate 42.6 % 31.8 % 28.6 %Equiv. Range 42.3 - 42.6 % 31.4 - 31.8 % 28.1 - 28.6 %Mean Qual. 0.66 1.03 1.06 imulation depth. As presented in Section 4.4, MCTS simulateseach generated nodes. To do so, it applies rules of the grammarrandomly to the non terminal expression until it becomes terminal.An important parameter is thus the maximum simulation depthi.e. the number of rules not leading to terminal nodes (like 𝑈 → 𝑈 + 𝑈 ). By default, Syntia considers a maximum simulation depth of0, which mean that all non terminal symbols are directly replacedby variables or constant values. Table 12 shows that increasing thisparameter is not beneﬁcial. Number of I/O samples.

By defaults Syntia considers 50 samples.Table 13 presents results for diﬀerent number of samples. We ob-serve little improvement when the number of samples decreases.Still, it stays in the same range of results.

Table 13: Syntia for diﬀerent number of samples (B2,

Mba ,timeout=60s).

Objective function.

By default, Syntia evaluates if an expressionis close to the target one by computing the mean between diﬀer-ent distances. To complete our evaluation of Syntia we launchedit with Xyntia’s Log-arithmetic distance. We observe that as Xyn-tia the log-arithmetic seems more appropriate to guide the search.Still, Syntia’s success rate stays bellow 50%.

Table 14: Syntia depending on the objective function (B2,

Mba , timeout=60s).

Syntia-dist Log-arithSucc. Rate 42.6% 47.9%Equiv. Range 42.3 - 42.6% 47.4 - 47.9%Mean Qual. 0.66 0.70

Simulated annealing UCT (SA-UCT).

From a high level, MCTScan be divided in 2 behaviors: exploitation (where it focuses onpromising nodes) and exploration (where it checks rarely visitedor at ﬁrst glance non interesting nodes). The SA-UCT constant is aparameter to conﬁgure the balance between these behaviors. Thesmaller is the constant the more exploitative MCTS is. On the con-trary, the bigger it is, more explorative is MCTS. By default Syntiasets the SA-UCT constant to 1.5. Table 15 presents results of Syntiafor smaller and bigger values. For smaller values, Syntia is less eﬃ-cient. This is coherent with claims from Section 4.4. Indeed, as thesearch space is highly unstable, simulations are misleading. Thus,focusing too much on exploitation is unsuitable. However, it alsoappears that, bigger values can be beneﬁcial. This is also coherentwith Section 4.4 as it shows that the most important behavior is ex-ploration. Still, even with SA-UCT values > . < Optimal Syntia.

Our extensive study highlights a new optimalconﬁguration of Syntia (

Mba set of operators, simulation depth=0,

Table 15: Syntia depending on SA-UCT value (

Mba , B2, time-out = 60 s).

SA-UCT 3 2 1.5 0.5 0.1Succ. Rate 48.0% 48.2% 42.6 % 34.6 % 19.1 %Equiv. Range 47.7 - 48.0% 48.1 - 48.2 % 42.3 - 42.6 % 34.6 % 19.1 %Mean Qual. 0.71 0.72 0.66 0.62 0.44

Table 16: Optimal Syntia (B2, timeout = 60 s).

Succ. Rate 52.7%Equiv. Range 52.1 - 52.6%Mean Qual. 0.76

A.2 Improve AI-based deobfuscation : Moredetails

Section 5 presents our new AI-based blackbox deobfuscator dubbedXyntia. We show now complementary data and results to detail:(1) the results of Xyntia over 15 runs; (2) the results of Xyntia interms of quality, correctness, capacity to handle high number ofinputs and diﬀerent expression types; (3) the results of Xyntia over

Full , Expr , Mba ; (4) the study leading to optimal Xyntia; (5) thecapacity of Xyntia to integrate a high number of constant valuesin its grammar.

A.2.1

Evaluation of Xyntia . To evaluate Xyntia and compare itagainst Syntia we replicate for Xyntia the experimental procedurefollowed in Section 4. We present now complete experiments re-sults and discuss them.

RQ4.

To assess the usability of Xyntia we need to know if it isstable across executions. Indeed, Xyntia, as Syntia, is stochastic andresults may vary from one run to another. Table 17 shows results ofXyntia over 15 runs on B1 and B2 – statistics are given in Table 18.No signiﬁcant variation is observed, meaning that Xyntia is stableacross executions.

Table 17: Success rate of Xyntia (Xyntia

Opt ) across 15 runs(timeout = 60 s)

Test execution no. B1 B21 500 (100%) 1051 (94.7%)2 500 (100%) 1051 (94.7%)3 500 (100%) 1060 (95.5%)4 500 (100%) 1054 (95.0%)5 500 (100%) 1060 (95.5%)6 500 (100%) 1059 (95.4%)7 500 (100%) 1051 (94.7%)8 500 (100%) 1059 (95.4%)9 500 (100%) 1055 (95.0%)10 500 (100%) 1053 (94.7%)11 500 (100%) 1059 (95.4%)12 500 (100%) 1052 (94.8%)13 500 (100%) 1061 (95.6%)14 500 (100%) 1054 (95.0%)15 500 (100%) 1053 (94.9%)

RQ5.

Unlike Syntia, Xyntia is eﬃcient on B2. Moreover, as pre-sented in Table 19, it is able to synthesize expressions using upto 5 inputs with a success rate ≥ able 18: Xyntia (Xyntia Opt ): 15 runs on B1/B2 (timeout=60s)

Data-set Min. Max. Mean 𝜎 Xyntia B1 500(100%) 500(100%) 500(100%) 0(0.00%)B2 1051(94.7%) 1061(95.6%) 1055.5(95.1%) 3.63(0.33%) reaches a success rate > >

85% of them. In addition, we observe that Xyntia returns sim-ple and almost always correct results. Still, results given in Ta-ble 19 seems to show that Syntia returns better quality resultsand less non-equivalent expressions than Xyntia. However, theseconclusions are biased by the fact that Syntia has a lower successrate than Xyntia and ﬁnds only very simple expressions. Thus, wepresent results on expressions that had been successfully synthe-sized by both Syntia and Xyntia. Table 20 demonstrates that underthis condition, the quality of both tools are comparable. Still, Xyn-tia reaches such results thanks to our post-process simpliﬁer. Thus,Syntia eﬀectively synthesizes simpler expressions, but the gap canbe bridged by adding a simple simpliﬁer to Xyntia. On the otherhand, we see that Syntia returns between 6 and 9 non-equivalentexpressions while Xyntia returns between 1 and 4. Thus Xyntiaseems more reliable.

RQ6.

Xyntia defaults to synthesizing expressions over

Expr whileSyntia infers expressions over

Full . To evaluate the sensitivity ofXyntia to search space and show that previous results was not dueto search space inconsistency, we run Xyntia over

Full , Expr and

Mba . Table 21 indicates that Xyntia reaches high equivalence ratesfor all operators’ sets – recall Syntia results stayed low. Still, Xyn-tia seems more sensitive to the size of the set of operators thanSyntia. Its proven equivalence rate decreases from 90% (

Expr ) to71% (

Full ) while Syntia decreases only from 38.7% (

Expr ) to 33.7%(

Full ). On the other hand, restricting the search space to

Mba ben-eﬁts to both Syntia and Xyntia.

A.2.2

Optimal Xyntia . The systematic evaluation of Xyntia de-pending on some design choices is resumed in Section 5.4. We com-plete here our analysis and give more details about the measuredresults. We focus on the following aspects: (1) the choice of theS-metaheuristic; (2) the choice of the sampling strategy; (3) thechoice of the distance as objective function; (4) the eﬀect of ourcustom simpliﬁer.

Choice of the S-metaheuristic.

We compare 5 S-metaheuristics,namely Hill Climbing, Random Walk, Simulated Annealing, Me-tropolis Hasting and Iterated Local Search, to ﬁnd out the bettersuited to deobfuscation. Table 22 shows that ILS has a higher equiv-alence rate than other search heuristics. Moreover, we observe thatall S-metaheurstics obtain similar or better results than Syntia. Thelow equivalence rate of Hill Climbing compared to other S-meta-heuristics can be explained by the fact that it has no way to evadelocal optimums. Even in this conditions, we observe that its resultsare not that far from Syntia (which reaches an equivalence rate of ≈

38% on

Expr ). It conﬁrms that estimating non terminal expres-sion’s pertinence through simulations as MCTS does is not suitablefor deobfuscation (see Section 4.4). It is far more relevant to manip-ulate terminal expressions only as S-metaheurstics do.

Eﬀect of the sampling strategy.

Table 23 presents Xyntia’s re-sults for diﬀerent number of randomly chosen I/O samples. Intu-itively, the higher the number of samples is considered, the moreprecise is the synthesis speciﬁcation. Consequently, one may thinkthat increasing the number of samples would negatively impact thesuccess rate and positively impact the equivalence rate. The exper-iments shows that while it improves the equivalence rate (from74.95% to 87.39% for respectively 10 and 100 I/O samples on

Expr ),it does not weaken the success rate. This result can be explainedby the fact that more inputs are used by the objetive function tomore precisely guide the synthesis. Still, the degree to which theresults are impacted depends on the set of operators. For

Mba and

Expr , the equivalence range seems to stagnate when adding morethan 50 samples while

Full still improves with 100 samples.In order to improve Xyntia’s results over the

Full sets of opera-tors we propose to add constant vectors ( ® , ® , ®− , ® 𝑚𝑖𝑛 𝑠 , ® 𝑚𝑎𝑥 𝑠 ) to en-force important behaviors such as division by zero and overﬂows.Table 24 presents results of Xyntia in two conﬁgurations: (1) 100randomly generated samples and (2) 95 randomly generated sam-ples plus 5 constant vectors. We see that adding such constant vec-tors slightly improves Xyntia’s equivalence rate over the Full and

Expr sets of operators.

Choice of the distance.

The default design of Xyntia (Section 5)leverages the Log-arithmetic distance as objective function. Wepresent in Table 25 an evaluation of Xyntia with the following al-ternative distances: • Arithmetic ® 𝑜 ∗ (® 𝑜 ) = Í 𝑖 | 𝑜 𝑖 − 𝑜 ∗ 𝑖 |• Hamming ® 𝑜 ∗ (® 𝑜 ) = Í 𝑖 Í 𝑗 = 𝑜 𝑖,𝑗 ⊕ 𝑜 ∗ 𝑖,𝑗 • Xor ® 𝑜 ∗ (® 𝑜 ) = Í 𝑖 𝑜 𝑖 ⊕ 𝑜 ∗ 𝑖 • Log-Arithmetic ® 𝑜 ∗ (® 𝑜 ) = Í 𝑖 𝑙𝑜𝑔 ( + | 𝑜 𝑖 − 𝑜 ∗ 𝑖 |) where ® 𝑜 ∗ is the vector of sampled outputs and ® 𝑜 is the actualoutputs of the synthesized expression.It appears that the Log-arithmetic distance guides synthesis thebest. Over Expr , Xyntia reaches a proven equivalence rate between84.50% with the Hamming distance, and 90.6% with the Log-arith-metic one. While, intuitively, the Xor and Hamming distances shouldguide the search better for Boolean expressions, Table 26 demon-strates that this is not the case: the Log-arithmetic distance is betterfor Boolean expressions.

Eﬀect of the simpliﬁer.

Xyntia integrates a simple and eﬃcientsimpliﬁcation engine to post-process the expressions found. Thesimpliﬁcation rules, which are partially listed in Table 28, are itera-tively applied on the expression until a ﬁxpoint is reached. Table 27presents the quality of synthesized expressions with and withoutthe simpliﬁcation engine. We observe that the simpliﬁer signiﬁ-cantly improves the quality of the expressions and enables us toreach really good quality results ( ≈

1) for

Expr and

Mba . However,for

Full , the quality stays around 1.3. As such, some more engi-neering might be needed to get better results for

Full . Neverthe-less, our simpliﬁer eﬃciently rewrites expressions while adding nosigniﬁcant latency. Indeed, the average time spent with this post-processing step is around 2.6ms. able 19: Syntia & Xyntia (Xyntia Opt ): results according to expression type and number of inputs (B2, timeout = 60 s)

Type

Table 20: Results for expressions that both Syntia and Xyntia(Xyntia

Opt ) successfully synthesized (B2, timeout = 60 s).

Syntia Xyntia

Table 21: Xyntia (Xyntia

Opt ): results on

Full , Expr and

Mba (B2, timeout = 60 s).

Full Expr Mba

Xyntia Succ. Rate 85.3% 95.5% 95.7%Equiv. Range 71.2 - 76.1% 90.6 - 94.2% 91.4 - 95.6%Mean Qual. 1.04 0.92 0.97

Table 22: Synthesis Equivalence Rate for diﬀerent S-metaheuristics (B2, Xyntia

Opt , timeout = 60 s)

S-metaheuristic Equiv. RangeRandom Walk 62.3 - 63.4%Hill Climbing 31.9 - 33.1%Iterated Local Search

Simulated Annealing 64.8 - 65.8%Metropolis-Hastings 57.7 - 58.5%

Table 23: Results of Xyntia for diﬀerent number of samples(B2, Xyntia

Opt , timeout = 60 s).

Full Expr Mba

10 Succ. Rate 85.05% 93.69% 93.33%Equiv. Range 52.79 - 59.10% 74.95- 79.55% 79.64 - 85.14%Mean Qual. 0.94 0.95 0.9620 Succ. Rate 86.85% 93.96% 94.50%Equiv. Range 59.46 - 65.14% 82.61 - 88.65% 87.12 - 92.43%Mean Qual. 1.02 0.93 0.9650 Succ. Rate 88.65% 95.50% 96.13%Equiv. Range 66.49 - 72.34% 87.75 - 92.70% 89.91 - 95.77%Mean Qual. 1.04 0.92 0.96100 Succ. Rate 86.67% 95.32% 96.58%Equiv. Range 69.10 - 75.50% 87.39 - 93.51% 91.26 - 96.58%Mean Qual. 1.05 0.94 0.95

A.2.3

Limitations . Section 5.6 discusses inherent limitations ofblackbox methods. While some are extensively studied in Section 8,

Table 24: Xyntia with and without constant values (B2,Xyntia

Opt , timeout = 60 s).

Full Expr Mba no consts Succ. Rate 86.67% 95.32% 96.58%Equiv. Range 69.10 - 75.50% 87.39 - 93.51% 91.26 - 96.58%Mean Qual. 1.05 0.94 0.955 consts Succ. Rate 85.32% 95.50% 95.68%Equiv. Range 71.17 - 76.13% 90.6 - 94.2% 91.35 - 95.59%Mean Qual. 1.04 0.92 0.97

Table 25: Xyntia’s results for diﬀerent distances (B2,Xyntia

Opt , timeout = 60 s).

Dist.

Full Expr Mba

Arith Succ. Rate 86.58% 94.50% 95.59%Equiv. Range 69.55 - 76.22% 87.57 - 93.51% 89.37 - 95.59%Mean Qual. 1.14 0.98 1.01Hamm. Succ. Rate 83.42% 91.53% 92.25%Equiv. Range 67.30 - 73.42% 84.50 - 89.73% 88.38 - 92.25%Mean Qual. 1.09 0.93 0.92Xor Succ. Rate 82.34% 92.34% 95.50%Equiv. Range 66.76 - 73.15% 86.07 - 90.09% 90.90 - 95.41%Mean Qual. 1.13 0.94 0.98LogArith Succ. Rate 85.32% 95.50% 95.68%Equiv. Range 71.17 - 76.13% 90.6 - 94.2% 91.35 - 95.59%Mean Qual. 1.04 0.92 0.97 we propose to discuss here if AI-based blackbox methods can ef-ﬁciently synthesize expressions manipulating constant values. In-deed, Xyntia and Syntia only integrate the constant in their gram-mar. Thus, if they try to synthesize an expression containing con-stant values ( ≠

1) they will need to create them. However, this isunlikely, especially if the constant is far from 1. One solution is toadd all 2 constant values in the grammar. In order to verify if thisapproach is conceivable, we add ranges of constant values ( [ 𝑁 ] for 𝑁 ∈ { , , , , } ) in Xyntia’s grammar. The results foreach conﬁguration are presented in Fig. 11. They show that increas-ing the number of constant values dramatically impacts Xyntia’sperformance. We conclude that adding all possible constant valuesis not beneﬁcial. Another solution is to add well chosen constantvalues (-1, 𝑚𝑖𝑛 𝑠 , 𝑚𝑎𝑥 𝑠 ) but we decided not to explore this approachin this paper. Still, in Section 7.2, we observe that such restriction islimited as Xyntia is able to synthesize interesting constant values.Note that Syntia cannot do it. able 26: Xyntia’s results for diﬀerent distances overBoolean and arithmetic type of expressions (B2, Xyntia Opt ,timeout = 60 s).

Dist. Boolean Arith.Arith Succ. Rate 96.76% 96.49%Equiv. Range 95.68% 87.30 - 95.68%Mean Qual. 0.75 1.05Hamm. Succ. Rate 97.84% 90.81%Equiv. Range 95.41 - 95.68% 81.08 - 89.19%Mean Qual. 0.76 0.98Xor Succ. Rate 97.84% 90.54%Equiv. Range 96.76 - 97.03% 82.16 - 87.03%Mean Qual. 0.79 0.97LogArith Succ. Rate 98.38% 96.48%Equiv. Range 97.84% 88.11 - 95.14%Mean Qual. 0.73 0.98

Table 27: Xyntia quality with and without simpliﬁer (B2,Xyntia

Opt , timeout = 60 s).

Full Expr Mba

Mean Qual. No Simpl. 1.77 1.33 1.38Mean Qual. Simpl. 1.19 0.93 0.97Mean Simpl. Time (s) 0.0027 0.0026 0.0026 E q u i v a l e n c e R a t e ( % ) Xyntia ProvenXyntia Optimistic

Figure 11: Eﬀect of the number of constant values in Xyn-tia’s grammar on equivalence rate (B2, Xyntia

Opt , time-out=60s)

A.3 Compare to other approaches: More details

We present in Section 6 a comparative study of Xyntia against grey-box deobfuscators, whitebox simpliﬁers and state-of-the-art pro-gram synthesizers. We give here more details on how Xyntia com-pares to whitebox simpliﬁers.

A.3.1

Comparison to whitebox simplifiers . We compare Xyn-tia over the EA, VR-EA and EA-ED datasets with 3 whitebox ap-proaches: GCC, Z3 simpliﬁer (v4.8.7) and our custom simpliﬁer.We use GCC v8.3.0 with optimization level 3 to compile obfus-cated expressions. We do not report the mean simpliﬁcation time

Table 28: Xyntia’s simpliﬁcation rules (partial)

Constant 𝑓 ( 𝑐𝑜𝑛𝑠𝑡 , ..., 𝑐𝑜𝑛𝑠𝑡 𝑁 ) → 𝑟𝑒𝑠𝑢𝑙𝑡 Arithmetic 𝐸 + → 𝐸 𝐸 − → 𝐸 𝐸 − (− 𝐸 ) → 𝐸 + 𝐸 𝐸 − 𝐸 → −(− 𝐸 ) → 𝐸 (− 𝐸 ) + 𝐸 → 𝐸 − 𝐸 𝐸 × → 𝐸 × → 𝐸 𝐸 << → 𝐸 𝐸 >> 𝑢 → 𝐸 𝐸 >> 𝑠 → 𝐸 Boolean ¬(¬ 𝐸 ) → 𝐸 𝐸 ∧ − → 𝐸 𝐸 ∧ → 𝐸 ∧ 𝐸 → 𝐸 𝐸 ∨ → 𝐸 𝐸 ∨ − → − 𝐸 ∨ 𝐸 → 𝐸 𝐸 ⊕ − → ¬ 𝐸 𝐸 ⊕ → 𝐸 𝐸 ⊕ 𝐸 → . Table 29 shows that GCC, Z3 simplifyand our custom simpliﬁer hardly clear expressions compared toXyntia. However, synthesis is on average slower that syntax basedsimpliﬁers. Table 29: Results of whitebox simpliﬁers on the EA dataset

GCC -03 Simplﬁer Z3 XyntiaEA Enhancement rate 68 / 500 36 / 500 22 / 500 360 / 500Mean time (s) - 0.005 0.0002 2.45VR-EA Enhancement rate 22 / 500 0 / 500 31 / 500 360 / 500Mean time (s) - - 0.0010 2.45EA-ED Enhancement rate 14 / 500 15 / 500 17 / 500 360 / 500Mean time (s) - 0.0055 0.00042 2.45 Symbolic execution engine returns expressions with a lot of concatenations .4 Deobfuscation with Xyntia We show in Section 7 that Xyntia bypasses state-of-the-art obfus-cation strategies and enables to reverse VM handlers of programobfuscated with Tigress [11] and VMProtect [35]. We detail now(1) the obfuscation used in Section 7.1; (2) scripts to generate Ti-gress use cases from Section 7.2.

A.4.1 Eﬀectiveness against usual protections.

Section 7.1 shows thatXyntia enables to bypass usual protections. All tested obfuscation,except path-based obfuscation , were performed through Tigress [11].Table 30 presents the Tigress commands used to generate obfus-cated expressions. Conversely, evaluation of path based obfusca-tion relies on a custom encoding inspired from [26]. We present itnow.

Path-based obfuscation [26, 36] takes advantage of the pathexplosion problem to thwart symbolic execution. While it is eﬃ-cient against symbolic based analysis, what about blackbox ones?The example in Listing 2, is inspired by the

For primitive from [26].It computes the sum of x and y adding loops to increase the numberof paths to explore (one path for each value of x and y), eﬀectivelykilling symbolic execution. However, blackbox deobfuscation seesinputs-outputs behaviors only and would successfully synthesizethe expression. To conﬁrm it, we encoded B2 as in Listing 2. Table 4shows the absence of impact. i n t sum ( i n t x , i n t y ) { i n t x1 , y1 ; f o r ( i n t i = 0 ; i < x ; i ++) { x1 ++; } f o r ( i n t i = 0 ; i < y ; i ++) { y1 ++; } r etur n x1 + y1 ; } Listing 2: Sum function with path-oriented obfuscation

A.4.2 Virtualization based Deobfuscation.

Section 7 shows that Xyn-tia enables to synthesize the VM-handler of software protectedwith Tigress [11] and VMProtect [35]. Table 31 presents the Ti-gress commands used to generate the 2 Tigress use cases (notethat the “heavy_computing” function contains all mixed boolean-arithmetic expressions and is the one we want to obfuscate).

A.5 Counter AI-based deobfuscation : Moredetails

Protections against blackbox deobfuscation methods have been dis-cussed extensively in Section 8. We complete in the following (1) thedescription of the datasets used to evaluate the eﬃciency of the pro-posed methods; (2) the results of Syntia, CVC4 and STOKE againstproposed protections.

A.5.1

Semantically complex handlers . Section 8.2 presents anencoding to translate a set of semantically simple handlers to com-plex ones. The proposed solution enables the creation of handlersas complex as wanted in terms of size and number of arguments.To evaluate the eﬃciency of the approach, we created 3 datasetsnamely BP1 (Table 34), BP2 (Table 35) and BP3 (too large to be pre-sented here). Results of Xyntia, Syntia, CVC4 and STOKE-synth

Table 30: Tigress commands for obfuscation in Section 7.1

CommandMBA tigress --Environment=x86_64:Linux:Gcc:4.6--Transform=EncodeArithmetic --Functions=fun0--Transform=EncodeArithmetic --Functions=fun0--out=out.c fun0.c

Opaque predicate tigress --Environment=x86_64:Linux:Gcc:4.6--Seed=0 --Inputs="+1:int:42,-1:length:1?10"--Transform=InitEntropy --Transform=AddOpaque--Functions=fun0 --AddOpaqueKinds=question--AddOpaqueSplitKinds=inside --AddOpaqueCount=10fun0.c --out=out.c

Covert channel tigress --Seed=0 --Verbosity=1--Environment=x86_64:Linux:Gcc:4.6 -pthread--Transform=InitEntropy --Functions=fun0--Transform=InitImplicitFlow --Functions=main--InitImplicitFlowKinds=trivial_thread--InitImplicitFlowHandlerCount=1--InitImplicitFlowJitCount=1--InitImplicitFlowJitFunctionBody="(for (if (bb 50) (bb 50)))"--InitImplicitFlowTrace=false --InitImplicitFlowTrain=false--InitImplicitFlowTime=true--InitImplicitFlowTrainingTimesClock=500--InitImplicitFlowTrainingTimesThread=500 fun0.c --out=out.c

Table 31: Tigress commands for the 2 use cases in Section 7.2

CommandTigress (simple) tigress --Environment=x86_64:Linux:Gcc:4.6--Transform=Virtualize --Functions=heavy_computing--VirtualizeDispatch=direct --out=out.c main.c

Tigress (hard) tigress --Environment=x86_64:Linux:Gcc:4.6--Transform=EncodeData--LocalVariables='heavy_computing:v0,v1,v2,v3,v4,v5'--EncodeDataCodecs=poly1 --Transform=Virtualize--Functions=heavy_computing --VirtualizeDispatch=direct--Transform=EncodeArithmetic --Functions=heavy_computing--Transform=EncodeArithmetic --Functions=heavy_computing--out=out.c main.c

Table 32: Found expressions on BP1,2,3 (timeout = 1 h)

Xyntia Syntia CVC4-

Expr

STOKE-synthBP1 13 0 0 0BP2 3 0 0 0BP3 1 0 0 0(STOKE-opti is not considered as it is not blackbox) are presentedin Table 32. It conﬁrms that the protection is highly eﬀective.

A.5.2

Merged handlers . In Section 8.3, we measure the impactof conditionals on blackbox methods. We thus introduce 5 datasetscontaining 20 expressions each. The ﬁrst one combines two ba-sic handlers with one ITE, the second one combines 3 basic han-dlers though 2 nested conditionals and so on. The ﬁrst and seconddatasets are given in Fig. 13. Each condition compares the third in-put ( 𝑧 ) with a constant. The constant values are always sorted inincreasing order starting from zero – i.e., the ﬁrst ITE compares 𝑧 to 0, the second to 1, etc..The results of Xyntia against merged handlers are presented inSection 8.3. We present now the evaluation of Syntia, CVC4 andSTOKE-synth (STOKE-opti is not tested as it is not blackbox) on E q u i v a l e n t ITE depth

CVC4 UtopianCVC4 MBA+ITECVC4 MBA+ShiftsCVC4-

Expr

Figure 12: CVC4 against merged handlers (timeout = 60 s) the same datasets. Because we cannot change ITEs to the gram-mar of Syntia nor STOKE-synth, we evaluate them in their de-fault conﬁguration. Fig. 12 shows that in the utopian conﬁgura-tion CVC4 eﬃciently synthesizes all expressions with one condi-tional. However, we see that in any conﬁguration, it is not ableto synthesize expressions with nested conditionals. On the otherhand, Table 33 shows that neither Syntia nor STOKE-synth is ableto handle merged handler. Results conﬁrm that merging handlersis an eﬃcient protection, impeding sampling and synthesis.