Inductive Program Synthesis Over Noisy Data
aa r X i v : . [ c s . P L ] O c t Inductive Program Synthesis over Noisy Data
Shivam Handa
Electrical Engineering and Computer ScienceMassachusetts Institute of [email protected]
Martin C. Rinard
Electrical Engineering and Computer ScienceMassachusetts Institute of [email protected]
ABSTRACT
We present a new framework and associated synthesis algorithmsfor program synthesis over noisy data, i.e., data that may containincorrect/corrupted input-output examples. This framework is basedon an extension of finite tree automata called state-weighted finitetree automata . We show how to apply this framework to formulateand solve a variety of program synthesis problems over noisy data.Results from our implemented system running on problems fromthe SyGuS 2018 benchmark suite highlight its ability to success-fully synthesize programs in the face of noisy data sets, includingthe ability to synthesize a correct program even when every input-output example in the data set is corrupted.
CCS CONCEPTS • Theory of computation → Formal languages and automata the-ory ; •
Software and its engineering → Programming by exam-ple ; •
Computing methodologies → Machine learning . KEYWORDS
Program Synthesis, Noisy Data, Corrupted Data, Machine Learn-ing
ACM Reference Format:
Shivam Handa and Martin C. Rinard. 2020. Inductive Program Synthesisover Noisy Data. In
Proceedings of the 28th ACM Joint European SoftwareEngineering Conference and Symposium on the Foundations of Software En-gineering (ESEC/FSE ’20), November 8–13, 2020, Virtual Event, USA.
ACM,New York, NY, USA, 13 pages. https://doi.org/10.1145/3368089.3409732
In recent years there has been significant interest in learning pro-grams from input-output examples. These techniques have beensuccessfully used to synthesize programs for domains such as stringand format transformations [10, 18], data wrangling [7], data com-pletion [20], and data structure manipulation [8, 12, 21]. Even thoughthese efforts have been largely successful, they do not aspire towork with noisy data sets that may contain corrupted input-outputexamples.We present a new program synthesis technique that is designedto work with noisy/corrupted data sets. Given:
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA © 2020 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-7043-1/20/11.https://doi.org/10.1145/3368089.3409732 • Programs:
A collection of programs 𝑝 defined by a grammar 𝐺 and a bounded scope threshold 𝑑 , • Data Set:
A data set D = {( 𝜎 , 𝑜 ) , . . . , ( 𝜎 𝑛 , 𝑜 𝑛 )} of input-outputexamples, • Loss Function:
A loss function L( 𝑝, D) that measures the costof the input-output examples on which 𝑝 produces a differentoutput than the output in the data set D , • Complexity Measure:
A complexity measure 𝐶 ( 𝑝 ) that mea-sures the complexity of a given program 𝑝 , • Objective Function:
An arbitrary objective function 𝑈 ( 𝑙, 𝑐 ) , whichmaps loss 𝑙 and complexity 𝑐 to a totally ordered set, such thatfor all values of 𝑙 , 𝑈 ( 𝑙, 𝑐 ) is monotonically nondecreasing withrespect to 𝑐 ,our program synthesis technique produces a program 𝑝 that mini-mizes 𝑈 (L( 𝑝, D) , 𝐶 ( 𝑝 )) . Example problems that our program syn-thesis technique can solve include: • Best Fit Program Synthesis:
Given a potentially noisy dataset D , find a best fit program 𝑝 for D , i.e., a 𝑝 that minimizes L( 𝑝, D) . • Accuracy vs. Complexity Tradeoff:
Given a data set D , find 𝑝 that minimizes the weighted sum L( 𝑝, D) + 𝜆 · 𝐶 ( 𝑝 ) . This prob-lem enables the programmer to define and minimize a weightedtradeoff between the complexity of the program and the loss. • Data Cleaning and Correction:
Given a data set D , find 𝑝 that minimizes the loss L( 𝑝, D) . Input-output examples withnonzero loss are identified as corrupted and either 1) filtered outor 2) replaced with the output from the synthesized program. • Bayesian Program Synthesis:
Given a data set D and a proba-bility distribution 𝜋 ( 𝑝 ) over programs 𝑝 , find the most probableprogram 𝑝 given D . • Best Program for Given Accuracy:
Given a data set D anda bound 𝑏 , find 𝑝 that minimizes 𝐶 ( 𝑝 ) subject to L( 𝑝, D) ≤ 𝑏 .One use case finds the simplest program that agrees with thedata set D on at least 𝑛 − 𝑏 input-output examples. • Forced Accuracy:
Given data sets D ′ , D , where D ′ ⊆ D , find 𝑝 that minimizes the weighted sum L( 𝑝, D) + 𝜆 · 𝐶 ( 𝑝 ) subject to L( 𝑝, D ′ ) ≤ 𝑏 . One use case finds a program 𝑝 which minimizesthe loss over the data set D but is always correct for D ′ . • Approximate Program Synthesis:
Given a clean (noise-free)data set D , find the least complex program 𝑝 that minimizes theloss L( 𝑝, D) . Here the goal is not to work with a noisy data set,but instead to find the best approximate solution to a synthesisproblem when an exact solution does not exist within the collec-tion of considered programs 𝑝 . SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard
We work with noise models that assume a (hidden) clean data setcombined with a noise source that delivers the noisy data set pre-sented to the program synthesis system. Like many inductive pro-gram synthesis systems [10, 18], one target is discrete problems thatinvolve discrete data such as strings, data structures, or tablulardata. In contrast to traditional machine learning problems, whichoften involve continuous noise sources [4], the noise sources fordiscrete problems often involve discrete noise — noise that involvesa discrete change that affects only part of each output, leaving theremaining parts intact and uncorrupted.
Different loss functions can be appropriate for different noise sourcesand use cases. The 0 / D and synthesizedprogram 𝑝 do not agree, is a general loss function that can be ap-propriate when the focus is to maximize the number of inputs forwhich the synthesized program 𝑝 produces the correct output. TheDamerau-Levenshtein (DL) loss function [5], which measures theedit difference under character insertions, deletions, substitutions,and/or transpositions, extracts information present in partially cor-rupted outputs and can be appropriate for measuring discrete noisein input-output examples involving text strings. The 0 /∞ loss func-tion, which is ∞ unless 𝑝 agrees with all of the input-output exam-ples in the data set D , specializes our technique to the standardprogram synthesis scenario that requires the synthesized programto agree with all input-output examples.Because discrete noise sources often leave parts of corruptedoutputs intact, exact program synthesis (i.e., synthesizing a programthat agrees with all outputs in the hidden clean data set) is oftenpossible even when all outputs in the data set are corrupted. Our ex-perimental results (Section 9) indicate that matching the loss func-tion to the characteristics of the discrete noise source can enablevery accurate program synthesis even when 1) there are only ahandful of input-output examples in the data and 2) all of the out-puts in the data set are corrupted. We attribute this success to theability of our synthesis technique, working in conjunction with anappropriately designed loss function, to effectively extract infor-mation from outputs corrupted by discrete noise sources. Our technique augments finite tree automata (FTA) to associateaccepting states with weights that capture the loss for the outputassociated with each accepting state. Given a data set D , the re-sulting state-weighted finite tree automata (SFTA) partition the pro-grams 𝑝 defined by the grammar 𝐺 into equivalence classes. Eachequivalence class consists of all programs with identical input-outputbehavior on the inputs in D . All programs in a given equivalenceclass therefore have the same loss over D . The technique then usesdynamic programming to find the minimum complexity program 𝑝 in each equivalence class [9]. From this set of minimum complexityprograms, the technique then finds the program 𝑝 that minimizesthe objective function 𝑈 ( 𝑝, D) . We have implemented our technique and applied it to various pro-grams in the SyGuS 2018 benchmark set [1]. The results indicatethat the technique is effective at solving program synthesis prob-lems over strings with modestly sized solutions even in the pres-ence of substantial noise. For discrete noise sources and a loss func-tion that is a good match for the noise source, the technique is typ-ically able to extract enough information left intact in corruptedoutputs to synthesize a correct program even when all outputs arecorrupted (in this paper we consider a synthesized program to becorrect if it agrees with all input-output examples in the originalhidden clean data set). Even with the 0/1 loss function, which doesnot aspire to extract any information from corrupted outputs, thetechnique is typically able to synthesize a correct program evenwith only a few correct (uncorrupted) input-output examples inthe data set. Overall the results highlight the potential for effec-tive program synthesis even in the presence of substantial noise.
This paper makes the following contributions: • Technique:
It presents an implemented technique for inductiveprogram synthesis over noisy data. The technique uses an exten-sion of finite tree automata, state-weighted finite tree automata ,to synthesize programs that minimize an objective function in-volving the loss over the input data set and the complexity ofthe synthesized program. • Use Cases:
It presents multiple uses cases including best fit pro-gram synthesis for noisy data sets, navigating accuracy vs. com-plexity tradeoffs, Bayesian program synthesis, identifying andcorrecting corrupted data, and approximate program synthesis. • Experimental Results:
It presents experimental results fromour implemented system on the SyGuS 2018 benchmark set. Theseresults characterize the scalability of the technique and highlightinteractions between the DSL, the noise source, the loss func-tion, and the overall effectiveness of the synthesis technique. Inparticular, they highlight the ability of the technique to, givena close match between the noise source and the loss function,synthesize a correct program 𝑝 even when 1) there are only ahandful of input-output examples in the data set D and 2) alloutputs are corrupted. We next review finite tree automata (FTA) and FTA-based induc-tive program synthesis.
Finite tree automata are a type of state machine which accept treesrather than strings. They generalize standard finite automata todescribe a regular language over trees.
Definition 1 (
FTA) . A (bottom-up) finite tree automaton (FTA)over alphabet 𝐹 is a tuple A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ) where 𝑄 is a set of states, 𝑄 𝑓 ⊆ 𝑄 is the set of accepting states and Δ is a set of transitions ofthe form 𝑓 ( 𝑞 , . . . , 𝑞 𝑘 ) → 𝑞 where 𝑞, 𝑞 , . . . 𝑞 𝑘 are states, 𝑓 ∈ 𝐹 . Every symbol 𝑓 in alphabet 𝐹 has an associated arity. The set 𝐹 𝑘 ⊆ 𝐹 is the set of all 𝑘 -arity symbols in 𝐹 . 0-arity terms 𝑡 in 𝐹 nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA and Truenot False Figure 1: Tree for formula and ( True , not ( False )) J 𝑐 K 𝜎 ⇒ 𝑐 (Constant) J 𝑥 K 𝜎 ⇒ 𝜎 ( 𝑥 ) (Variable) J 𝑛 K 𝜎 ⇒ 𝑣 J 𝑛 K 𝜎 ⇒ 𝑣 . . . J 𝑛 𝑘 K 𝜎 ⇒ 𝑣 𝑘 J 𝑓 ( 𝑛 , 𝑛 , . . . 𝑛 𝑘 ) K 𝜎 ⇒ 𝑓 ( 𝑣 , 𝑣 , . . . 𝑣 𝑘 ) (Function) Figure 2: Execution semantics for program 𝑝 are viewed as single node trees (leaves of trees). 𝑡 is accepted byan FTA if we can rewrite 𝑡 to some state 𝑞 ∈ 𝑄 𝑓 using rules in Δ .The language of an 𝐹𝑇 𝐴 A , denoted by L(A) , corresponds to theset of all ground terms accepted by A . Example 1.
Consider the tree automaton A defined by states 𝑄 = { 𝑞 𝑇 , 𝑞 𝐹 } , 𝐹 = { True , False } , 𝐹 = not , 𝐹 = { and } , final states 𝑄 𝑓 = { 𝑞 𝑇 } and the following transition rules Δ : True → 𝑞 𝑇 False → 𝑞 𝐹 not ( 𝑞 𝑇 ) → 𝑞 𝐹 not ( 𝑞 𝐹 ) → 𝑞 𝑇 and ( 𝑞 𝑇 , 𝑞 𝑇 ) → 𝑞 𝑇 and ( 𝑞 𝐹 , 𝑞 𝑇 ) → 𝑞 𝐹 and ( 𝑞 𝑇 , 𝑞 𝐹 ) → 𝑞 𝐹 and ( 𝑞 𝐹 , 𝑞 𝐹 ) → 𝑞 𝐹 or ( 𝑞 𝑇 , 𝑞 𝑇 ) → 𝑞 𝑇 or ( 𝑞 𝐹 , 𝑞 𝑇 ) → 𝑞 𝑇 or ( 𝑞 𝑇 , 𝑞 𝐹 ) → 𝑞 𝑇 or ( 𝑞 𝐹 , 𝑞 𝐹 ) → 𝑞 𝐹 The above tree automaton accepts all propositional logic formu-las over
True and
False which evaluate to
True . Figure 1 presentsthe tree for the formula and ( True , not ( False )) . We next define the programs we consider, how inputs to the pro-gram are specified, and the program semantics. Without loss ofgenerality, we assume programs 𝑝 are specified as parse trees ina domain-specific language (DSL) grammar 𝐺 . Internal nodes rep-resent function invocations; leaves are constants/0-arity symbolsin the DSL. A program 𝑝 executes in an input 𝜎 . J 𝑝 K 𝜎 denotes theoutput of 𝑝 on input 𝜎 ( J . K is defined in Figure 2).All valid programs (which can be executed) are defined by a DSLgrammar 𝐺 = ( 𝑇 , 𝑁 , 𝑃, 𝑠 ) where: • 𝑇 is a set of terminal symbols. These may include constants andsymbols which may change value depending on the input 𝜎 . • 𝑁 is the set of nonterminals that represent subexpressions in ourDSL. • 𝑃 is the set of‘ production rules of the form 𝑠 → 𝑓 ( 𝑠 , . . . , 𝑠 𝑛 ) , where 𝑓 is a built-in function in the DSL and 𝑠, 𝑠 , . . . , 𝑠 𝑛 are non-terminals in the grammar. • 𝑠 ∈ 𝑁 is the start non-terminal in the grammar. 𝑡 ∈ 𝑇 , J 𝑡 K 𝜎 = 𝑐𝑞 𝑐𝑡 ∈ 𝑄 (Term) 𝑞 𝑜𝑠 ∈ 𝑄𝑞 𝑜𝑠 ∈ 𝑄 𝑓 (Final) 𝑠 → 𝑓 ( 𝑠 , . . . 𝑠 𝑘 ) ∈ 𝑃, { 𝑞 𝑐 𝑠 , . . . , 𝑞 𝑐 𝑘 𝑠 𝑘 } ⊆ 𝑄, J 𝑓 ( 𝑐 , . . . 𝑐 𝑘 ) K 𝜎 = 𝑐𝑞 𝑐𝑠 ∈ 𝑄, 𝑓 ( 𝑞 𝑐 𝑠 , . . . , 𝑞 𝑐 𝑘 𝑠 𝑘 ) → 𝑞 𝑐𝑠 ∈ Δ (Prod) Figure 3: Rules for constructing a CFTA A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ) given input 𝜎 , output 𝑜 , and grammar 𝐺 = ( 𝑇 , 𝑁 , 𝑃, 𝑠 ) . We assume that we are given a black box implementation ofeach built-in function 𝑓 in the DSL. In general, all techniques ex-plored within this paper can be generalized to any DSL which canbe specified within the above framework. Example 2.
The following DSL defines expressions over input x,constants 2 and 3, and addition and multiplication: 𝑛 : = 𝑥 | 𝑛 + 𝑡 | 𝑛 × 𝑡 ; 𝑡 : = | We review the approach introduced by [19, 20] to use finite tree au-tomata to solve synthesis tasks over a broad class of DSLs. Given aDSL and a set of input-output examples, a
Concrete Finite Tree Au-tomaton (CFTA) is a tree automaton which accepts all trees repre-senting DSL programs consistent with the input-output examplesand nothing else. The states of the FTA correspond to concrete val-ues and the transitions are obtained using the semantics of the DSLconstructs.Given an input-output example ( 𝜎, 𝑜 ) and DSL ( 𝐺, J . K ) , constructa CFTA using the rules in Figure 3. The alphabet of the CFTA con-tains built-in functions within the DSL. The states in the CFTA areof the form 𝑞 𝑐𝑠 , where 𝑠 is a symbol (terminal or non-terminal) in 𝐺 and 𝑐 is a concrete value. The existence of state 𝑞 𝑐𝑠 implies thatthere exists a partial program which can map 𝜎 to concrete value 𝑐 . Similarly, the existence of transition 𝑓 ( 𝑞 𝑐 𝑠 , 𝑞 𝑐 𝑠 . . . 𝑞 𝑐 𝑘 𝑠 𝑘 ) → 𝑞 𝑐𝑠 im-plies 𝑓 ( 𝑐 , 𝑐 . . . 𝑐 𝑘 ) = 𝑐 .The Term rule states that if we have a terminal 𝑡 (either a con-stant in our language or input symbol 𝑥 ), execute it with the input 𝜎 and construct a state 𝑞 𝑐𝑡 (where 𝑐 = J 𝑡 K 𝜎 ). The Final rule statesthat, given start symbol 𝑠 and we expect 𝑜 as the output, if 𝑞 𝑜𝑠 exists, then we have an accepting state. The Prod rule states that,if we have a production rule 𝑓 ( 𝑠 , 𝑠 , . . . 𝑠 𝑘 ) → 𝑠 ∈ Δ , and thereexists states 𝑞 𝑐 𝑠 , 𝑞 𝑐 𝑠 . . . 𝑞 𝑐 𝑘 𝑠 𝑘 ∈ 𝑄 , then we also have state 𝑞 𝑐𝑠 in theCFTA and a transition 𝑓 ( 𝑞 𝑐 𝑠 , 𝑞 𝑐 𝑠 , . . . 𝑞 𝑐 𝑘 𝑠 𝑘 ) → 𝑞 𝑐𝑠 .The language of the CFTA constructed from Figure 3 is exactlythe set of parse trees of DSL programs that are consistent with ourinput-output example (i.e., maps input 𝜎 to output 𝑜 ).In general, the rules in Figure 3 may result in a CFTA which hasinfinitely many states. To control the size of the resulting CFTA, wedo not add a new state within the constructed CFTA if the smallesttree it will accept is larger than a given threshold 𝑑 . This results ina CFTA which accepts all programs which are consistent with theinput-output example but are smaller than the given threshold (itmay accept some programs which are larger than the given thresh-old but it will never accept a program which is inconsistent with SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard the input-output example). This is standard practice in the synthe-sis literature [13, 19].
Given two CFTAs A and A built over the same grammar 𝐺 from input-output examples ( 𝜎 , 𝑜 ) and ( 𝜎 , 𝑜 ) respectively, theintersection of these two automata A contains programs whichsatisfy both input-output examples (or has the empty language).Given CFTAs A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ) and A ′ = ( 𝑄 ′ , 𝐹 ′ , 𝑄 ′ 𝑓 , Δ ) , A ∗ = ( 𝑄 ∗ , 𝐹, 𝑄 ∗ 𝑓 , Δ ∗ ) is the intersection of A and A ′ , where 𝑄 ∗ , 𝑄 ∗ 𝑓 , and Δ ∗ are the smallest set such that: 𝑞 ® 𝑐 𝑠 ∈ 𝑄 and 𝑞 ® 𝑐 𝑠 ∈ 𝑄 ′ then 𝑞 ® 𝑐 : ® 𝑐 𝑠 ∈ 𝑄 ∗ 𝑞 ® 𝑐 𝑠 ∈ 𝑄 𝑓 and 𝑞 ® 𝑐 𝑠 ∈ 𝑄 ′ 𝑓 then 𝑞 ® 𝑐 : ® 𝑐 𝑠 ∈ 𝑄 ∗ 𝑓 𝑓 ( 𝑞 ® 𝑐 𝑠 , . . . 𝑞 ® 𝑐 𝑘 𝑠 𝑘 ) → 𝑞 ® 𝑐𝑠 ∈ Δ and 𝑓 ( 𝑞 ® 𝑐 ′ 𝑠 , . . . 𝑞 ® 𝑐 ′ 𝑘 𝑠 𝑘 ) → 𝑞 ® 𝑐 ′ 𝑠 ∈ Δ ′ then 𝑓 ( 𝑞 ® 𝑐 : ® 𝑐 ′ 𝑠 , . . . 𝑞 ® 𝑐 𝑘 : ® 𝑐 ′ 𝑘 𝑠 𝑘 ) → 𝑞 ® 𝑐 : ® 𝑐 ′ 𝑠 ∈ Δ ∗ where ® 𝑐 denotes a vector of values and ® 𝑐 : ® 𝑐 denote a vectorconstructed by appending vector ® 𝑣 at the end of vector ® 𝑣 . Given a data set D = {( 𝜎 , 𝑜 ) , . . . , ( 𝜎 𝑛 , 𝑜 𝑛 )} and a program 𝑝 , a Loss Function L( 𝑝, D) calculates how incorrect the program iswith respect to the given data set. We work with loss functions L( 𝑝, D) that depend only on the data set and the outputs of theprogram for the inputs in the data set, i.e., given programs 𝑝 , 𝑝 ,such that for all ( 𝜎 𝑖 , 𝑜 𝑖 ) ∈ D , J 𝑝 K 𝜎 𝑖 = J 𝑝 K 𝜎 𝑖 , then L( 𝑝 , D) = L( 𝑝 , D) . We also further assume that the loss function L( 𝑝, D) can be expressed in the following form: L( 𝑝, D) = 𝑛 Õ 𝑖 = 𝐿 ( 𝑜 𝑖 , J 𝑝 K 𝜎 𝑖 ) where 𝐿 ( 𝑜 𝑖 , J 𝑝 K 𝜎 𝑖 ) is a per-example loss function. Definition 2. / Loss Function:
The / loss function L / ( 𝑝, D) counts the number of input-output examples where 𝑝 does not agree with the data set D : L / ( 𝑝, D) = 𝑛 Õ 𝑖 = ( 𝑜 𝑖 ≠ J 𝑝 K 𝜎 𝑖 ) else 0 Definition 3. /∞ Loss Function:
The /∞ loss function L /∞ ( 𝑝, D) is 0 if 𝑝 matches all outputs in the data set D and ∞ otherwise: L /∞ ( 𝑝, D) = (∀( 𝜎, 𝑜 ) ∈ D .𝑜 = J 𝑝 K 𝜎 ) else ∞ Definition 4.
Damerau-Levenshtein (DL) Loss Function:
The DL loss function L 𝐷𝐿 ( 𝑝, D) uses the Damerau-Levenshtein met-ric [5], to measure the distance between the output from the synthe-sized program and the corresponding output in the noisy data set: L 𝐷𝐿 ( 𝑝, D) = Õ ( 𝜎 𝑖 ,𝑜 𝑖 ) ∈D 𝐿 J 𝑝 K 𝜎 𝑖 ,𝑜 𝑖 (cid:0) (cid:12)(cid:12) J 𝑝 K 𝜎 𝑖 (cid:12)(cid:12) , | 𝑜 𝑖 | (cid:1) where, 𝐿 𝑎,𝑏 ( 𝑖, 𝑗 ) is the Damerau-Levenshtein metric [5]. This metric counts the number of single character deletions, in-sertions, substitutions, or transpositions required to convert onetext string into another. Because more than 80% of all human mis-spellings are reported to be captured by a single one of these fouroperations [5], the DL loss function may be appropriate for com-putations that work with human-provided text input-output exam-ples.
Given a program 𝑝 , a Complexity Measure 𝐶 ( 𝑝 ) ranks programsindependent of the input-output examples in the data set D . Thismeasure is used to trade off performance on the noisy data setvs. complexity of the synthesized program. Formally, a complex-ity measure is a function 𝐶 ( 𝑝 ) that maps each program 𝑝 express-ible in the given DSL 𝐺 to a real number. The following Cost ( 𝑝 ) complexity measure computes the complexity of given program 𝑝 represented as a parse tree recursively as follows:Cost ( 𝑡 ) = cost ( 𝑡 ) Cost ( 𝑓 ( 𝑒 , 𝑒 , . . . 𝑒 𝑘 )) = cost ( 𝑓 ) + 𝑘 Í 𝑖 = Cost ( 𝑒 𝑖 ) where 𝑡 and 𝑓 are terminals and built-in functions in our DSL re-spectively. Setting cost ( 𝑡 ) = cost ( 𝑓 ) = ( 𝑝 ) that computes the size of 𝑝 .Given an FTA A , we can use dynamic programming to find theminimum complexity parse tree (under the above Cost ( 𝑝 ) mea-sure) accepted by A [9]. In general, given an FTA A , we assumewe are provided with a method to find the program 𝑝 accepted by A which minimizes the complexity measure. Given loss 𝑙 and complexity 𝑐 , an Objective Function 𝑈 ( 𝑙, 𝑐 ) maps 𝑙, 𝑐 to a totally ordered set such that for all 𝑙 , 𝑈 ( 𝑙, 𝑐 ) is monotoni-cally nondecreasing with respect to 𝑐 . Definition 5.
Tradeoff Objective Function:
Given a tradeoffparameter 𝜆 > , the tradeoff objective function 𝑈 𝜆 ( 𝑙, 𝑐 ) = 𝑙 + 𝜆𝑐 . This objective function trades the loss of the synthesized pro-gram off against the complexity of the synthesized program. Simi-larly to how regularization can prevent a machine learning modelfrom overfitting noisy data by biasing the training algorithm topick a simpler model, the tradeoff objective function may preventthe algorithm from synthesizing a program which overfits the databy biasing it to pick a simpler program (based on the complexitymeasure).
Definition 6.
Lexicographic Objective Function:
A lexico-graphic objective function 𝑈 𝐿 ( 𝑙, 𝑐 ) = h 𝑙, 𝑐 i maps 𝑙 and 𝑐 into a lexico-graphically ordered space, i.e., h 𝑙 , 𝑐 i < h 𝑙 , 𝑐 i if and only if either 𝑙 < 𝑙 or 𝑙 = 𝑙 and 𝑐 < 𝑐 . This objective function first minimizes the loss, then the com-plexity. It may be appropriate, for example, for best fit programsynthesis, data cleaning and correction, and approximate programsynthesis over clean data sets. nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA 𝑡 ∈ 𝑇 , J 𝑡 K 𝜎 = 𝑐𝑞 𝑐𝑡 ∈ 𝑄 (Term) 𝑞 𝑐𝑠 ∈ 𝑄𝑞 𝑐𝑠 ∈ 𝑄 𝑓 𝑤 ( 𝑞 𝑐𝑠 ) = 𝐿 ( 𝑜, 𝑐 ) (Final) 𝑠 → 𝑓 ( 𝑠 , . . . 𝑠 𝑘 ) ∈ 𝑃, { 𝑞 𝑐 𝑠 , . . . , 𝑞 𝑐 𝑘 𝑠 𝑘 } ⊆ 𝑄, J 𝑓 ( 𝑐 , . . . 𝑐 𝑘 ) K 𝜎 = 𝑐𝑞 𝑐𝑠 ∈ 𝑄, 𝑓 ( 𝑞 𝑐 𝑠 , . . . , 𝑞 𝑐 𝑘 𝑠 𝑘 ) → 𝑞 𝑐𝑠 ∈ Δ (Prod) Figure 4: Rules for constructing a SFTA A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) given input 𝜎 , per-example loss function 𝐿 , and grammar 𝐺 = ( 𝑇 , 𝑁 , 𝑃, 𝑠 ) . State-weighted finite tree automata (SFTA) are FTA augmentedwith a weight function that attaches a weight to all accepting states.
Definition 7 (
SFTA) . A state-weighted finite tree automaton(SFTA) over alphabet 𝐹 is a tuple A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ,𝑤 ) where 𝑄 isa set of states, 𝑄 𝑓 ⊆ 𝑄 is the set of accepting states, Δ is a set of tran-sitions of the form 𝑓 ( 𝑞 , . . . , 𝑞 𝑘 ) → 𝑞 where 𝑞, 𝑞 , . . . 𝑞 𝑘 are states, 𝑓 ∈ 𝐹 and 𝑤 : 𝑄 𝑓 → R is a function which assigns a weight 𝑤 ( 𝑞 ) (from domain 𝑊 ) to each accepting state 𝑞 ∈ 𝑄 𝑓 . Because CFTAs are designed to handle synthesis over clean (noise-free) data sets, they have only one accept state 𝑞 𝑜𝑠 (the state withstart symbol 𝑠 and output value 𝑜 ). We weaken this condition toallow multiple accept states with attached weights using SFTAs.Given an input-output example ( 𝜎, 𝑜 ) and per-example loss func-tion 𝐿 ( 𝑜, 𝑐 ) , Figure 4 presents rules for constructing a SFTA that,given a program 𝑝 , returns the loss for 𝑝 on example ( 𝜎, 𝑜 ) . TheSFTA Final rule (Figure 4) marks all states 𝑞 𝑐𝑠 with start symbol 𝑠 as accepting states regardless of the concrete value 𝑐 attachedto the state. The rule also associates the loss 𝐿 ( 𝑜, 𝑐 ) for concretevalue 𝑐 and output 𝑜 with state 𝑞 𝑐𝑠 as the weight 𝑤 ( 𝑞 𝑐𝑠 ) = 𝐿 ( 𝑜, 𝑐 ) .The CFTA Final rule (Figure 3), in contrast, marks only the state 𝑞 𝑜𝑠 (with output value 𝑜 and start state 𝑠 ) as the accepting state.A SFTA divides the set of programs in the DSL into subsets.Given an input 𝜎 , all programs in a subset produce the same output(based on the accepting state), with the SFTA assigning a weight 𝑤 ( 𝑞 𝑐𝑠 ) = 𝐿 ( 𝑜, 𝑐 ) as the weight of this subset of programs.We denote the SFTA constructed from DSL 𝐺 , example ( 𝜎, 𝑜 ) ,per-example loss function 𝐿 , and threshold 𝑑 as A 𝑑𝐺 ( 𝜎, 𝑜, 𝐿 ) . Weomit the subscript grammar 𝐺 and threshold 𝑑 wherever it is obvi-ous from context. Example 3.
Consider the DSL presented in Example 2. Given input-output example ({ 𝑥 → } , ) and weight function 𝑙 ( 𝑐 ) = ( 𝑐 − ) ,Figure 5 presents the SFTA which represents all programs of heightless than . For readability, we omit the states for terminals 2 and 3. For allaccepting states the first number (the number in black) representsthe computed value and the second number (the number in red)represents the weight of the accepting state. 1 𝑥 ,
64 2 , , ,
36 6 , , , , , , + , × × + + , × × + × , + + × × + × + Figure 5: The SFTA constructed for Example 3
Definition 8 ( + Intersection) . Given two SFTAs A = ( 𝑄 , 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) and A = ( 𝑄 , 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) , a SFTA A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ,𝑤 ) is the + intersection A and A , if the CFTA in A is the intersection of CFTAs of A and A , and the weight of acceptstates in A is the sum of weight of corresponding weights in A and A . Formally: • The CFTA ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ) is the intersection of CFTAs ( 𝑄 , 𝐹, 𝑄 𝑓 , Δ ) and ( 𝑄 , 𝐹, 𝑄 𝑓 , Δ )• 𝑤 ( 𝑞 ® 𝑐 : ® 𝑐 𝑠 ) = 𝑤 ( 𝑞 ® 𝑐 𝑠 ) + 𝑤 ( 𝑞 ® 𝑐 𝑠 ) (for 𝑞 ® 𝑐 : ® 𝑐 𝑠 ∈ 𝑄 𝑓 ). Given two SFTAs A and A , A + A denotes the + intersectionof A and A . Definition 9 ( / Intersection) . Given a SFTA A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) and a CFTA A ∗ = ( 𝑄 ∗ , 𝐹, 𝑄 ∗ 𝑓 , Δ ∗ ) , a SFTA A ′ = ( 𝑄 ′ , 𝐹, 𝑄 ′ 𝑓 , Δ ′ , 𝑤 ′ ) is the / intersection of A and A ∗ , if theFTA A ′ is the intersection of CFTA A and A ∗ , and the weight of theaccepting state in A ′ is the same as the weight of the correspondingaccepting state in A . Formally: • The CFTA ( 𝑄 ′ , 𝐹, 𝑄 ′ 𝑓 , Δ ′ ) is the intersection of FTAs ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ) and ( 𝑄 ∗ , 𝐹, 𝑄 ∗ 𝑓 , Δ ∗ )• 𝑤 ′ ( 𝑞 ® 𝑐 : ® 𝑐 𝑠 ) = 𝑤 ( 𝑞 ® 𝑐 : ® 𝑐 𝑠 ) (for 𝑞 ® 𝑐 : ® 𝑐 𝑠 ∈ 𝑄 ′ 𝑓 ). Given a SFTA A and a CFTA A ∗ , A/A ∗ denotes the / intersectionof A and A ∗ .Given a single input-output example, a CFTA built on that exam-ple only accepts programs which are consistent with that example. / intersection is a simple method to prune a SFTA to only containprograms which are consistent with an input-output example. Definition 10 ( 𝑤 -pruned SFTA) . A SFTA A ′ = ( 𝑄, 𝐹, 𝑄 ′ 𝑓 , Δ ,𝑤 ′ ) is the 𝑤 -pruned SFTA of A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) if we remove all accept states with weights SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard greater 𝑤 from 𝑄 𝑓 . Formally, 𝑄 ′ 𝑓 = { 𝑞 | 𝑞 ∈ 𝑄 𝑓 ∧ 𝑤 < = 𝑤 ( 𝑞 )} and 𝑤 ′ ( 𝑞 ) = 𝑤 ( 𝑞 ) if 𝑞 ∈ 𝑄 ′ 𝑓 . Given a SFTA A , A ↓ 𝑤 denotes the 𝑤 -pruned SFTA of A . Definition 11 ( 𝑞 -selection) . Given a SFTA A = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) and a accept state 𝑞 ∈ 𝑄 𝑓 , the CFTA ( 𝑄, 𝐹, { 𝑞 } , Δ ) is called the 𝑞 -selection of SFTA A . Given a SFTA A , the notation A 𝑞 denotes the 𝑞 -selection of SFTA A . Given a data set {( 𝜎 , 𝑜 ) , . . . , ( 𝜎 𝑛 , 𝑜 𝑛 )} of input-output examplesand loss function L( 𝑝, D) with per-example loss function 𝐿 , weconstruct SFTAs for each input-output example A , A , . . . A 𝑛 where A 𝑖 = A( 𝜎 𝑖 , 𝑜 𝑖 , 𝐿 ) . Theorem 1.
Given a SFTA A = A( 𝜎, 𝑜, 𝐿 ) = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ,𝑤 ) ,for all accepting states 𝑞 ∈ 𝑄 𝑓 and for all programs 𝑝 accepted bythe 𝑞 -selection automata A 𝑞 : 𝐿 ( 𝑜, J 𝑝 K 𝜎 ) = 𝑤 ( 𝑞 ) Proof.
Consider any state 𝑞 ∈ 𝑄 𝑓 . All programs accepted bystate 𝑞 compute the same concrete value 𝑐 on the given input 𝜎 .Hence for all programs accepted by the 𝑞 -selection automata A 𝑞 , J 𝑝 K 𝜎 = 𝑐 . By construction (Figure 4), 𝑤 ( 𝑞 ) = 𝐿 ( 𝑐 ) = 𝐿 ( J 𝑝 K 𝜎 ) (cid:3) Let SFTA
A(D , 𝐿 ) be the + intersection of SFTAs defined oninput-output examples in D . Formally: A(D , 𝐿 ) = A( 𝜎 , 𝑜 , 𝐿 ) + A( 𝜎 , 𝑜 , 𝐿 ) + . . . A( 𝜎 𝑛 , 𝑜 𝑛 , 𝐿 ) Since the size of each SFTA A( 𝜎 𝑖 , 𝑜 𝑖 , 𝐿 ) is bounded, the cost ofcomputing A(D , 𝐿 ) is O (|D|) . Theorem 2.
Given
A(D , 𝐿 ) = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ , 𝑤 ) as defined above,for all accepting states 𝑞 ∈ 𝑄 𝑓 , for all programs 𝑝 accepted by the 𝑞 -selection automata A(D , 𝐿 ) 𝑞 : L( 𝑝, D) = 𝑤 ( 𝑞 ) i.e., the weight of the state 𝑞 measures the loss of programs by 𝑞 ondata set D . Proof.
Consider any accepting state 𝑞 ∈ 𝑄 𝑓 . Since A(D , 𝐿 ) isan intersection of SFTAs A . . . A 𝑛 (where A 𝑖 = A( 𝜎 𝑖 , 𝑜 𝑖 , 𝐿 ) = 𝑄 𝑖 , 𝐹, ( 𝑄 𝑓 ) 𝑖 , Δ 𝑖 , 𝑤 𝑖 ) ), there exist accepting states 𝑞 ∈ ( 𝑄 𝑓 ) , 𝑞 ∈ ( 𝑄 𝑓 ) , . . . 𝑞 𝑛 ∈ ( 𝑄 𝑓 ) 𝑛 such that all programs 𝑝 accepted by A 𝑞 are accepted by (A ) 𝑞 , (A ) 𝑞 . . . (A 𝑛 ) 𝑞 𝑛 .From Theorem 1, for all programs 𝑝 accepted by A 𝑞 , 𝑤 𝑖 ( 𝑞 𝑖 ) = 𝐿 ( 𝑜 𝑖 , J 𝑝 K 𝜎 𝑖 ) From definition of + intersection, 𝑤 ( 𝑞 ) = 𝑛 Õ 𝑖 = 𝑤 𝑖 ( 𝑞 𝑖 ) = 𝑛 Õ 𝑖 = 𝐿 ( 𝑜 𝑖 , J 𝑝 K 𝜎 𝑖 ) = L( 𝑝, D) (cid:3) Algorithm 1 presents the base algorithm to synthesize programswithin various noisy synthesis settings.
Theorem 3.
The program 𝑝 ∗ returned by Algorithm 1 is equal to 𝑝 ′ where 𝑝 ′ = argmin 𝑝 ∈ 𝐺 𝑑 𝑈 (L( 𝑝, D) , 𝐶 ( 𝑝 )) Algorithm 1:
Synthesis Algorithm
Input :
DSL 𝐺 , threshold 𝑑 , data set D , per-example lossfunction 𝐿 , complexity measure 𝐶 , and objectivefunction 𝑈 Result:
Synthesized program 𝑝 ∗ A(D , 𝐿 ) = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ,𝑤 ) D and per-example loss function 𝐿 foreach 𝑞 ∈ 𝑄 𝑓 do 𝑝 𝑞 ← argmin 𝑝 ∈A (D ,𝐿 ) 𝑞 𝐶 ( 𝑝 ) 𝑞 , find the most optimalprogram 𝑝 𝑞 end 𝑞 ∗ ← argmin 𝑞 ∈ 𝑄 𝑓 𝑈 ( 𝑤 ( 𝑞 ) , 𝐶 ( 𝑝 𝑞 )) 𝑝 ∗ ← 𝑝 𝑞 ∗ Proof.
Given
A(D , 𝐿 ) = ( 𝑄, 𝐹, 𝑄 𝑓 , Δ ,𝑤 ) . 𝑝 ∗ returned by Algo-rithm 1 is equal to 𝑝 𝑞 ∗ , where 𝑞 ∗ = argmin 𝑞 ∈ 𝑄 𝑓 𝑈 ( 𝑤 ( 𝑞 ) , 𝐶 ( 𝑝 𝑞 )) ,where 𝑝 𝑞 = argmin 𝑝 ∈A (D ,𝐿 ) 𝑞 𝐶 ( 𝑝 ) . We can rewrite 𝑞 ∗ as argmin 𝑞 ∈ 𝑄 𝑓 𝑈 ( 𝑤 ( 𝑞 ) , min 𝑝 ∈A (D ,𝐿 ) 𝑞 𝐶 ( 𝑝 )) Since for any 𝑙 , 𝑈 ( 𝑙, 𝑐 ) is non-decreasing with respect to 𝑐 , we canrewrite 𝑞 ∗ as argmin 𝑞 ∈ 𝑄 𝑓 min 𝑝 ∈A (D ,𝐿 ) 𝑞 𝑈 ( 𝑤 ( 𝑞 ) , 𝐶 ( 𝑝 )) By Theorem 2, for any 𝑝 ∈ A(D , 𝐿 ) 𝑞 : 𝑤 ( 𝑞 ) = L( 𝑝, D) 𝑞 ∗ = argmin 𝑞 ∈ 𝑄 𝑓 min 𝑝 ∈A (D ,𝐿 ) 𝑞 𝑈 (L( 𝑝, D) , 𝐶 ( 𝑝 )) Because 𝑞 ∗ is the accepting state of 𝑝 ∗ and 𝑝 ∈ A(D , 𝐿 ) if and onlyif ∃ 𝑞 ∈ 𝑄 𝑓 .𝑝 ∈ A(D , 𝐿 ) 𝑞 , we can rewrite the above equation as: 𝑝 ∗ = argmin 𝑝 ∈A (D ,𝐿 ) 𝑈 (L( 𝑝, D) , 𝐶 ( 𝑝 )) The set of programs accepted by
A(D , 𝐿 ) is the same set of pro-grams in grammar 𝐺 𝑑 . Hence proved. (cid:3) We next present several modifications of the core algorithm tosolve various synthesis problems.
Given a DSL 𝐺 , a data set D , loss function L , complexity measure 𝑐 , and positive weight 𝜆 , we wish to find a program 𝑝 ∗ which min-imizes the weighed sum of the loss function and the complexitymeasure. Formally: 𝑝 ∗ = argmin 𝑝 ∈ 𝐺 𝑑 (L( 𝑝, D) + 𝜆 · 𝐶 ( 𝑝 )) where 𝐺 𝑑 is the set of programs in DSL 𝐺 with size less than thethreshold 𝑑 . By using the objective function 𝑈 ( 𝑙, 𝑐 ) = 𝑙 + 𝜆𝑐 , wecan use Algorithm 1 to synthesize program 𝑝 ∗ which minimizesthe objective function given above. nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Given a DSL 𝐺 , a data set D , loss function L , complexity mea-sure 𝐶 and bound 𝑏 , we wish to find a program 𝑝 ∗ that minimizesthe complexity measure 𝐶 but has loss less than 𝑏 . Formally: 𝑝 ∗ = argmin 𝑝 ∈ 𝐺 𝑑 𝐶 ( 𝑝 ) s.t. L( 𝑝, D) < 𝑏 . Note that this condition canbe rewritten as 𝑝 ∗ = argmin 𝑝 ∈A ′ 𝐶 ( 𝑝 ) where A ′ = A(D , 𝐿 ) ↓ 𝑏 .By the definition of ↓ 𝑏 , all accepting states of A ′ have weightless than 𝑏 . Therefore all programs accepted by A ′ have loss lessthan 𝑏 (i.e. L( 𝑝, D) < 𝑏 ). Also note that if a program 𝑝 is not in A ′ then either it has loss greater than 𝑏 or it is not within thethreshold 𝑑 . Given DSL 𝐺 , a data set D , a subset D ′ ⊆ D , loss function L ,complexity measure 𝐶 , and objective function 𝑈 , we wish to find aprogram 𝑝 ∗ which minimizes the objective function with an addedconstraint of bounded loss over data set D ′ . Formally: 𝑝 ∗ = argmin 𝑝 ∈ 𝐺 𝑑 𝑈 (L( 𝑝, D) , 𝐶 ( 𝑝 )) s.t. L( 𝑝, D ′ ) ≤ 𝑏 We first construct a SFTA
A(D ′ , 𝐿 ) ↓ 𝑏 which contains all pro-grams consistent with loss less than or equal to 𝑏 over data set D ′ .After constructing A(D , 𝐿 ) as in Algorithm 1, we modify A(D , 𝐿 ) by / intersection A(D ′ , 𝐿 ) (after dropping the weights of the ac-cepting states) with A(D , 𝐿 ) (i.e. A(D , 𝐿 ) ← A(D , 𝐿 )/A(D ′ , 𝐿 ) as in Algorithm 1). By definition of / intersection and A , loss ofall programs returned by the modified algorithm on D ′ will be lessthan equal to 𝑏 . Definition 12.
Bayesian Program Synthesis : Given a set ofinput-output examples D = {( 𝜎 𝑖 , 𝑜 𝑖 )| 𝑖 = . . . 𝑛 } , DSL grammar 𝐺 ,and a probability distribution 𝜋 , 𝑝 ∗ is the solution to the Bayesianprogram synthesis problem, if 𝑝 ∗ is the most probable program inDSL 𝐺 , given the data set D . Formally 𝑝 ∗ = argmax 𝑝 ∈ 𝐺 𝜋 ( 𝑝 | D) . By Bayes rule 𝑝 ∗ = argmax 𝑝 ∈ 𝐺 𝜋 (D| 𝑝 ) 𝜋 ( 𝑝 ) , so 𝑝 ∗ = argmax 𝑝 ∈ 𝐺 (cid:2) ( log 𝜋 (D| 𝑝 )) + ( log 𝜋 ( 𝑝 )) (cid:3) Assuming independence of observations: 𝑝 ∗ = argmax 𝑝 ∈ 𝐺 𝑑 (cid:2)(cid:16) Õ ( 𝜎 𝑖 ,𝑜 𝑖 ) ∈D log 𝜋 ( 𝑜 𝑖 | J 𝑝 K 𝜎 𝑖 ) (cid:17) + ( log 𝜋 ( 𝑝 )) (cid:3) Where 𝜋 ( 𝑜 𝑖 | J 𝑝 K 𝜎 𝑖 ) denotes the probability of output observation 𝑜 𝑖 in the data set D , given a program 𝑝 With complexity measurelog 𝜋 ( 𝑝 ) , per-example loss function log 𝜋 ( 𝑜 𝑖 | J 𝑝 K 𝜎 𝑖 ) (given example ( 𝜎 𝑖 , 𝑜 𝑖 ) ), and the following loss function: L( 𝑝, D) = Õ ( 𝜎 𝑖 ,𝑜 𝑖 ) ∈D log 𝜋 ( 𝑜 𝑖 | J 𝑝 K 𝜎 𝑖 )) the technique in Section 7.1 (Algorithm 1) synthesizes the mostprobable program 𝑝 ∗ At Most 𝑘 Wrong:
Consider a setting in which, given a data set,a random procedure is allowed to corrupt at most 𝑘 of these input-output examples. Given this noisy data set D , our task is to syn-thesize the simplest program 𝑝 ∗ which is wrong on at most 𝑘 of String expr 𝑒 : = Str ( 𝑓 ) | Concat ( 𝑓 , 𝑒 ) ;Substring expr 𝑓 : = ConstStr ( 𝑠 ) | SubStr ( 𝑥, 𝑝 , 𝑝 ) ;Position 𝑝 : = Pos ( 𝑥, 𝜏, 𝑘, 𝑑 ) | ConstPos ( 𝑘 ) Direction 𝑑 : = Start | End ; Figure 6: DSL for string transformation, 𝜏 is a token, 𝑘 is aninteger, and 𝑠 is a string constant these input-output examples. Formally, given data set D , bound 𝑘 ,DSL 𝐺 , and a complexity measure 𝐶 : 𝑝 ∗ = argmin 𝑝 ∈ 𝐺 𝐶 ( 𝑝 ) s.t. L / ( 𝑝, D) ≤ 𝑘 where L / is the 0/1 loss function. The best program for a given ac-curacy framework (subsection 7.2) allows us to synthesize 𝑝 ∗ sub-ject to a threshold 𝑑 . String transformations have been extensively studied within theProgramming by Example community [10, 13, 18]. We implementedour technique (in 6k lines of Java code) and used it to solve bench-mark program synthesis problems from the SyGuS 2018 bench-mark suite [2]. This benchmark suite contains a range of stringtransformation problems, a class of problems that has been exten-sively studied in past program synthesis projects [10, 13, 18].We use the DSL from [19] (Figure 6) with the size complex-ity measure Size ( 𝑝 ) . The DSL supports extracting and contatenat-ing ( Concat ) substrings of the input string 𝑥 ; each substring is ex-tracted using the SubStr function with a start and end position. Aposition can either be a constant index (
ConstPos ) or the start orend of the 𝑘 𝑡ℎ occurrence of the match token 𝜏 in the input string( Pos ). Instead of computing individual SFTAs for each input-output ex-ample, then combining the SFTAs via + intersections to obtainthe final SFTA, our implementation computes the final SFTA di-rectly working over the full data set. The implementation also ap-plies two techniques that constrain the size of the final SFTA. First,it bounds the number of recursive applications of the production 𝑒 : = Concat ( 𝑓 , 𝑒 ) by applying a bounded scope height threshold 𝑑 .Second, during construction of the SFTA, a state with symbol 𝑒 isonly added to the SFTA if the length of the state’s output value isnot greater than the length of the output string plus one. We evaluate the scalability of our implementation by applying itto all problems in the SyGuS 2018 benchmark suite [1]. For eachproblem we use the clean (noise-free) data set for the problem pro-vided with the benchmark suite. We use the lexicographic objectivefunction 𝑈 𝐿 ( 𝑙, 𝑐 ) with the 0 /∞ loss function and the 𝑐 = Size ( 𝑝 ) complexity measure. We run each benchmark with bounded scopeheight threshold 𝑑 =
1, 2, 3, and 4 and record the running time onthat benchmark problem and the number of states in the SFTA. Astate with symbol 𝑒 is only added to the SFTA if the length of itsoutput value is not greater than the length of the output string.Because the running time of our technique does not depend onthe specific utility function (except for the time required to evalu-ate the utility function, which is typically negligible for most utility SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard
Threshold 1 2 3 4Benchmark Name time(sec) SFTA size time(sec) SFTA size time(sec) SFTA size time(sec) SFTA sizebikes 0.16 1.08 0.73 10.56 4.72 56.4 19.83 145.8bikes-long 0.21 1.02 1.37 9.42 6.04 49.9 26.99 139.35bikes-long-repeat 0.18 1.02 1.06 9.42 6.03 49.9 27.47 139.35bikes-short 0.15 1.08 0.79 10.56 3.98 56.4 18.62 145.8dr-name X X 7.54 107.5 107.18 1547.2 - -dr-name-long X X 17.4 70.28 300.9 1077.6 - -dr-name-long-repeat X X 19.15 70.28 301.3 1077.6 - -dr-name-short X X 10.2 107.5 101.5 154.8 - -firstname 0.28 1.02 1.46 4.34 4.024 4.33 3.97 4.34firstname-long 1.72 1.04 12.03 4.36 39.08 4.36 41.22 4.36firstname-long-repeat 1.64 1.04 13.96 4.36 42.4 4.36 43.1 4.36firstname-short 0.26 1.02 1.47 4.37 3.93 4.34 3.9 4.34initials X X X X 8.7 42.3 30.4 42.34initials-long X X X X 86.44 42.36 376.56 42.36initials-long-repeat X X X X 86.23 42.36 386.25 42.36initials-short X X X X 8.92 42.34 31.72 42.34lastname 0.43 2.56 4.78 28.3 27.29 208.35 159.41 741.44lastname-long 1.93 1.37 15.1 11.34 112.04 50.81 485.98 50.8lastname-long-repeat 1.85 1.37 18.35 11.34 113.36 50.81 486.35 50.8lastname-short 0.6 2.56 3.07 28.3 28.3 208.35 160.54 741.44name-combine X X 8.49 269.9 224.074 7485.83 - -name-combine-long X X 32.28 161.54 - - - -name-combine-long-repeat X X 98.46 299 - - - -name-combine-short X X 6.5 269.9 207.86 7485.83 - -name-combine-2 X X X X 63.490 855.34 - -name-combine-2-long X X X X 591.6 851.44 - -name-combine-2-long-repeat X X X X 592.0 851.44 - -name-combine-2-short X X X X 57.26 855.34 - -name-combine-3 X X X X 43.082 911.53 527.29 8104.7name-combine-3-long X X X X 193.42 649.13 - -name-combine-3-long-repeat X X X X 192.81 649.13 - -name-combine-3-short X X X X 42.266 911.53 526.13 8104.7reverse-name X X 6.9 269.9 217.19 7495.9 - -reverse-name-long X X 29.55 161.53 - - - -reverse-name-long-repeat X X 27.6 161.53 - - - -reverse-name-short X X 6.84 269.9 228.24 7485.8 - -phone 0.12 0.46 0.47 1.58 0.87 1.58 0.78 1.58phone-long 0.8 0.46 3.9 1.58 7.79 1.58 32.79 1.58phone-long-repeat 0.69 0.46 3.29 1.58 7.76 1.58 43.24 1.58phone-short 0.12 0.46 0.37 1.58 0.804 1.578 4.97 1.58phone-1 0.15 0.46 0.44 1.58 0.84 1.58 3.017 1.58phone-1-long 0.99 0.46 3.8 1.58 8.23 1.58 16.58 1.58phone-1-long-repeat 0.90 0.46 4.1 1.58 8.42 1.58 17.5 1.58phone-1-short 0.14 0.46 0.45 1.58 0.8 1.58 1.5 1.58phone-2 0.13 0.46 0.44 1.58 0.83 1.58 3.176 1.58phone-2-long 0.64 0.46 2.84 1.58 8.36 1.58 16 1.58phone-2-long-repeat 0.85 0.46 3.8 1.58 9.83 1.58 17.55 1.58phone-2-short 0.09 0.46 0.47 1.58 0.83 1.58 2.78 1.58phone-5 0.18 0.23 0.16 0.23 0.11 0.23 0.7 0.23phone-5-long 1.24 0.23 0.94 0.23 0.75 0.23 4.2 0.23phone-5-long-repeat 1.27 0.23 1.19 0.23 0.77 0.23 2.96 0.23phone-5-short 0.17 0.23 0.17 0.23 0.11 0.23 0.9 0.23phone-6 0.27 0.64 1.38 2.6 2.67 2.61 9.3 2.61phone-6-long 1.84 0.64 6.48 2.6 24.66 2.61 103.3 2.61phone-6-long-repeat 2.16 0.64 7.12 2.6 24.69 2.61 143.9 2.61phone-6-short 0.28 0.64 0.76 2.6 2.27 2.61 11.19 2.61phone-7 0.24 0.64 1.04 2.6 2.87 2.61 11.141 2.61phone-7-long 2.6 0.64 7.8 2.6 26.1 2.61 108.1 2.61phone-7-long-repeat 2.58 0.64 6.68 2.6 26.15 2.61 115.42 2.61phone-7-short 0.23 0.64 1.13 2.6 3.26 2.61 10.71 2.61phone-8 0.23 0.64 1 2.6 2.65 2.61 8.51 2.61phone-8-long 2.33 0.64 7.58 2.6 25.87 2.61 114.54 2.61phone-8-long-repeat 1.67 0.64 7.7 2.6 25.45 2.61 128.3 2.61phone-8-short 0.27 0.64 0.97 2.6 2.45 2.61 13.81 2.61
Figure 7: Runtimes and SFTA sizes for selected SyGuS 2018 benchmarks nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA functions, and except for search space pruning techniques appro-priate for specific combinations of utility functions and DSLs), weanticipate that these results will generalize to other utility func-tions. All experiments are run on an 3.00 GHz Intel(R) Xeon(R)CPU E5-2690 v2 machine with 512GB memory running Linux 4.15.0.With a timeout limit of 10 minutes and bounded scope height thresh-old of 4, the implementation is able to solve 64 out of the 108 SyGuS2018 benchmark problems. For the remaining 48 benchmark prob-lems a correct program does not exist within the DSL at boundedscope height threshold 4.Table 7 presents results for selected SyGuS 2018 benchmarks.We omit all name-combine-4-*, phone-3-*, phone-4-*, phone-9-*,phone-10-*, and univ-* benchmarks — all runs for these bench-marks either do not synthesize a correct program or do not termi-nate. Table 7 presents results for all other SyGuS 2018 benchmarks.Our synthesis technique removes all duplicates from the dataset incase of 0 /∞ loss. There is a row for each benchmark problem. Thefirst column presents the name of the benchmark. The next fourcolumns present results for the technique running with boundedscope height threshold 𝑑 =
1, 2, 3, and 4, respectively. Each columnhas two subcolumns: the first presents the running time on thatbenchmark problem (in seconds); the second presents the numberof states in the SFTA (in thousands of states). An entry X indicatesthat the implementation terminated but did not synthesize a cor-rect program that agreed with all provided input-output examples.An entry - indicates that the implementation did not terminate.In general, both the running times and the number of statesin the SFTA increase as the number of provided input-output ex-amples and/or the bounded height threshold increases. The SFTAsize sometimes stops increasing as the height threshold increases.We attribute this phenomenon to the application of a search spacepruning technique that terminates the recursive application of theproduction 𝑒 : = Concat ( 𝑓 , 𝑒 ) ; when the generated string becomeslonger than the current output string — in this case any resultingsynthesized program will produce an output that does not matchthe output in the data set.We compare with a previous technique that uses FTAs to solveprogram synthesis problems [20]. This previous technique requiresclean data and only synthesizes programs that agree with all input-output examples in the data set. Our technique builds SFTAs withsimilar structure, with additional overhead coming from the eval-uation of the objective function. We obtained the implementationof the technique presented in [20] and ran this implementation onall benchmarks in the SyGuS 2018 benchmark suite. The runningtimes of our implementation and this previous implementation arecomparable. We next present results for our implementation running on small(few input-output examples) data sets with character deletions. Weuse a noise source that cyclically deletes a single character fromeach output in the data set in turn, starting with the first character,proceeding through the output positions, then wrapping aroundto the first character again. We apply this noise source to corruptevery output in the data set. To construct a noisy data set with 𝑘 correct (uncorrupted) outputs, we do not apply the noise source tothe last 𝑘 outputs in the data set. We exclude all *-long, *long-repeat, and *-short benchmarks andall benchmarks that do not terminate within the time limit at heightbound 4. For each remaining benchmark we use our implementa-tion and the generated corrupted data sets to determine the mini-mum number of correct outputs in the corrupted data set requiredfor the implementation to produce a correct program that matchesthe original hidden clean data set on all input-output examples. Weconsider three loss functions: the 0 / Definition 13.
The 1-Delete loss func-tion L 𝐷 ( 𝑝, D) uses the per-example loss function 𝐿 that is 0 if theoutputs from the synthesized program and the data set match exactly,1 if a single deletion enables the output from the synthesized programto match the output from the data set, and ∞ otherwise: L 𝐷 ( 𝑝, D) = Õ ( 𝜎 𝑖 ,𝑜 𝑖 ) ∈D 𝐿 𝐷 ( J 𝑝 K 𝜎 𝑖 , 𝑜 𝑖 ) , where 𝐿 𝐷 ( 𝑜 , 𝑜 ) = 𝑜 = 𝑜 𝑎 · 𝑥 · 𝑏 = 𝑜 ∧ 𝑎 · 𝑏 = 𝑜 ∧ | 𝑥 | = ∞ otherwise We use the lexicographic objective function 𝑈 𝐿 ( 𝑙, 𝑐 ) with 𝑐 = Size ( 𝑝 ) as the complexity measure and bounded scope height thresh-old 𝑑 =
4. We apply a search space pruning technique that termi-nates the recursive application of the production 𝑒 : = Concat ( 𝑓 , 𝑒 ) ;when the generated string becomes more than one character longerthan the current output string.Table 9 summarizes the results. The Data Set Size Column presentsthe total number of input-output examples in the corrupted dataset. The next three columns, 1-Delete, DL, and 0/1, present theminimum number of correct (uncorrupted) input-output examplesrequired for the technique to synthesize a correct program (thatagrees with the original hidden clean data set on all input-outputexamples) using the 1-Delete, DL, and 0/1 loss functions, respec-tively.With the 1-Delete loss function, the minimum number of re-quired correct input-output examples is always 0 — the implemen-tation synthesizes, for every benchmark problem, a correct pro-gram that matches every input-output example in the original cleandata set even when given a data set in which every output is cor-rupted. This result highlights how 1) discrete noise sources pro-duce noisy outputs that leave a significant amount of informationfrom the original uncorrupted output available in the corruptedoutput and 2) a loss function that matches the noise source can en-able the synthesis technique to exploit this information to producecorrect programs even in the face of substantial noise.With the DL loss function, the implementation synthesizes acorrect program for 8 of the 16 benchmarks when all outputs inthe data set are corrupted. For 7 of the remaining 8 benchmarks thetechnique requires 2 correct input-output examples to synthesizethe correct program. The remaining benchmark requires 3 correctexamples. The general pattern is that the technique tends to re-quire correct examples when the output strings are short. The syn-thesized incorrect programs typically use less of the input string. SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard
Threshold 1 2 3 4Benchmark Name n output size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss size time(sec) SFTA size DL Loss sizebikes 6 33 0.18 14.28 0 8 1.04 191.6 0 8 7.0 1532.67 0 8 49.06 6655.88 0 8bikes-long 24 136 0.37 135.42 0 8 2.85 1695.02 0 8 25.45 13114.22 0 8 158.04 57398.12 0 8bikes-long-repeat 58 325 0.68 135.46 0 8 5.81 1695.06 0 8 55.19 13114.26 0 8 364.4 57398.16 0 8bikes-short 6 33 0.13 14.28 0 8 1.11 191.6 0 8 7.34 1532.67 0 8 54.77 6655.88 0 8dr-name 4 36 0.49 63.76 4 11 8.66 1693.15 0 14 194.82 29685.59 0 14 - - - -dr-name-long 50 515 1.24 356.55 50 11 22.5 9844.45 0 14 - - - - - - - -dr-name-long-repeat 150 1545 2.34 3565.15 150 11 59.33 98444.15 0 14 - - - - - - - -dr-name-short 4 36 0.55 63.76 4 11 8.53 1693.15 0 14 302.33 29685.59 0 14 - - - -firstname 4 20 0.26 15.95 0 8 1.67 133.16 0 8 26.0 595.77 0 8 49.09 595.77 0 8firstname-long 54 335 1.47 161.35 0 8 16.14 1333.45 0 8 251.45 5959.55 0 8 547.8 5959.55 0 8firstname-long-repeat 204 1280 3.64 1613.2 0 8 47.74 13334.2 0 8 - - - - - - - -firstname-short 4 20 0.23 15.95 0 8 1.61 133.16 0 8 27.41 595.77 0 8 45.07 595.77 0 8initials 4 16 0.23 15.97 8 6 1.64 168.58 4 12 31.26 1450.97 0 22 117.93 5913.66 0 22initials-long 54 216 1.23 143.25 108 6 14.83 1669.35 54 12 370.59 14493.25 0 22 - - - -initials-long-repeat 204 816 3.14 1432.2 408 6 48.23 16693.2 204 12 - - - - - - - -initials-short 4 16 0.22 15.97 8 6 1.61 168.58 4 12 19.0 1450.97 0 22 121.17 5913.66 0 22lastname 4 30 0.42 38.48 0 10 4.45 591.56 0 10 59.1 5717.2 0 10 432.69 34502.07 0 10lastname-long 54 356 1.48 186.05 0 10 18.5 2316.35 0 10 249.64 19128.05 0 10 - - - -lastname-long-repeat 204 1334 3.69 1860.2 0 10 60.44 23163.2 0 10 - - - - - - - -lastname-short 4 30 0.44 38.48 0 10 4.25 591.56 0 10 57.87 5717.2 0 10 481.84 34502.07 0 10name-combine 6 81 0.44 69.97 6 8 9.55 3263.99 0 21 381.69 102428.84 0 21 - - - -name-combine-long 50 691 1.53 546.75 50 8 41.16 20330.85 0 21 - - - - - - - -name-combine-long-repeat 204 2818 8.85 9717.2 204 8 464.71 392615.2 0 21 - - - - - - - -name-combine-short 6 81 0.45 69.97 6 8 10.23 3263.99 0 21 413.17 102428.84 0 21 - - - -name-combine-2 4 32 0.5 49.25 4 11 6.43 1124.6 4 11 113.3 17414.8 0 24 - - - -name-combine-2-long 54 497 2.03 407.45 54 11 47.53 9703.35 54 11 - - - - - - - -name-combine-2-long-repeat 204 1892 5.35 4074.2 204 11 161.69 97033.2 204 11 - - - - - - - -name-combine-2-short 4 32 0.51 49.25 4 11 6.5 1124.6 4 11 111.78 17414.8 0 24 - - - -name-combine-3 6 56 0.33 38.94 12 13 4.12 984.52 6 16 78.62 16834.93 0 22 - - - -name-combine-3-long 50 476 1.2 288.15 100 13 18.06 6511.15 50 16 325.26 110276.25 0 22 - - - -name-combine-3-long-repeat 200 1904 2.53 2881.2 400 13 59.13 65111.2 200 16 - - - - - - - -name-combine-3-short 6 56 0.34 38.94 12 13 3.98 984.52 6 16 74.39 16834.93 0 22 - - - -name-combine-4 5 49 0.34 52.26 15 13 5.1 1679.88 10 16 139.73 36808.22 5 19 - - - -name-combine-4-long 50 526 1.39 362.65 150 13 22.88 9825.05 100 16 533.41 201894.15 50 19 - - - -name-combine-4-long-repeat 200 2104 3.15 3626.2 600 13 77.55 98250.2 400 16 - - - - - - - -name-combine-4-short 5 49 0.36 52.26 15 13 4.92 1679.88 10 16 139.69 36808.22 5 19 - - - -phone 6 18 0.13 7.35 0 6 0.55 48.79 0 6 2.37 162.2 0 6 6.16 162.2 0 6phone-long 100 300 0.71 734.1 0 6 4.06 4878.1 0 6 26.93 16219.1 0 6 75.58 16219.1 0 6phone-long-repeat 400 1200 1.51 734.4 0 6 13.93 4878.4 0 6 85.42 16219.4 0 6 251.59 16219.4 0 6phone-short 6 18 0.1 7.35 0 6 0.55 48.79 0 6 2.31 162.2 0 6 5.47 162.2 0 6phone-1 6 18 0.13 7.35 0 6 0.57 48.79 0 6 2.37 162.2 0 6 5.86 162.2 0 6phone-1-long 100 300 0.76 734.1 0 6 4.22 4878.1 0 6 26.69 16219.1 0 6 76.31 16219.1 0 6phone-1-long-repeat 400 1200 1.52 734.4 0 6 14.18 4878.4 0 6 86.62 16219.4 0 6 260.76 16219.4 0 6phone-1-short 6 18 0.1 7.35 0 6 0.55 48.79 0 6 2.49 162.2 0 6 6.3 162.2 0 6phone-2 6 18 0.12 7.35 0 8 0.57 48.79 0 8 2.16 162.2 0 8 6.37 162.2 0 8phone-2-long 100 300 0.69 734.1 0 8 4.09 4878.1 0 8 26.24 16219.1 0 8 69.53 16219.1 0 8phone-2-long-repeat 400 1200 1.55 734.4 0 8 17.56 4878.4 0 8 86.94 16219.4 0 8 242.99 16219.4 0 8phone-2-short 6 18 0.12 7.35 0 8 0.56 48.79 0 8 2.05 162.2 0 8 5.78 162.2 0 8phone-3 7 91 0.39 42.06 14 11 5.13 1826.9 14 11 202.67 59912.06 7 20 - - - -phone-3-long 100 1300 1.54 4205.1 200 11 51.3 182689.1 200 11 - - - - - - - -phone-3-long-repeat 400 5200 4.28 4205.4 800 11 182.03 182689.4 800 11 - - - - - - - -phone-3-short 7 91 0.35 42.06 14 11 5.03 1826.9 14 11 201.6 59912.06 7 20 - - - -phone-4 6 66 0.26 39.66 12 8 4.11 1498.16 6 17 126.55 42781.51 6 17 - - - -phone-4-long 100 1100 1.44 3965.1 200 8 43.97 149815.1 100 17 - - - - - - - -phone-4-long-repeat 400 4400 3.85 3965.4 800 8 152.39 149815.4 400 17 - - - - - - - -phone-4-short 6 66 0.29 39.66 12 8 4.17 1498.16 6 17 125.07 42781.51 6 17 - - - -phone-5 7 15 0.15 5.57 0 8 0.64 5.57 0 8 0.67 5.57 0 8 0.65 5.57 0 8phone-5-long 100 240 1.05 556.1 0 8 4.77 556.1 0 8 5.28 556.1 0 8 4.27 556.1 0 8phone-5-long-repeat 400 960 2.58 556.4 0 8 14.38 556.4 0 8 16.45 556.4 0 8 14.09 556.4 0 8phone-5-short 7 15 0.15 5.57 0 8 0.63 5.57 0 8 0.62 5.57 0 8 0.67 5.57 0 8phone-6 7 21 0.25 12.85 0 10 1.39 72.62 0 10 7.53 315.42 0 10 22.65 315.42 0 10phone-6-long 100 300 1.44 1284.1 0 10 14.09 7261.1 0 10 72.39 31541.1 0 10 313.96 31541.1 0 10phone-6-long-repeat 400 1200 3.98 1284.4 0 10 42.03 7261.4 0 10 260.06 31541.4 0 10 - - - -phone-6-short 7 21 0.3 12.85 0 10 1.39 72.62 0 10 7.63 315.42 0 10 24.87 315.42 0 10phone-7 7 21 0.22 12.85 0 10 1.34 72.62 0 10 6.87 315.42 0 10 24.32 315.42 0 10phone-7-long 100 300 1.5 1284.1 0 10 12.99 7261.1 0 10 72.47 31541.1 0 10 304.19 31541.1 0 10phone-7-long-repeat 400 1200 4.25 1284.4 0 10 44.31 7261.4 0 10 268.87 31541.4 0 10 - - - -phone-7-short 7 21 0.22 12.85 0 10 1.41 72.62 0 10 7.2 315.42 0 10 22.27 315.42 0 10phone-8 7 21 0.22 12.85 0 10 1.33 72.62 0 10 7.5 315.42 0 10 23.33 315.42 0 10phone-8-long 100 300 1.45 1284.1 0 10 14.08 7261.1 0 10 82.64 31541.1 0 10 327.2 31541.1 0 10phone-8-long-repeat 400 1200 3.96 1284.4 0 10 46.61 7261.4 0 10 266.52 31541.4 0 10 - - - -phone-8-short 7 21 0.22 12.85 0 10 1.4 72.62 0 10 6.83 315.42 0 10 25.22 315.42 0 10phone-9 7 99 0.86 163.06 21 8 43.75 12299.47 14 21 - - - - - - - -phone-9-long 100 1440 5.19 16305.1 300 8 555.42 1229985.1 200 21 - - - - - - - -phone-9-long-repeat 400 5760 17.01 16305.4 1200 8 - - - - - - - - - - - -phone-9-short 7 99 0.9 163.06 21 8 41.61 12299.47 14 21 - - - - - - - -phone-10 7 120 1.18 187.59 21 8 65.53 18600.06 14 21 - - - - - - - -phone-10-long 100 1740 6.82 18758.1 300 8 - - - - - - - - - - - -phone-10-long-repeat 400 6960 21.71 18758.4 1200 8 - - - - - - - - - - - -phone-10-short 7 120 0.98 187.59 21 8 65.15 18600.06 14 21 - - - - - - - -reverse-name 6 81 0.44 69.97 6 18 9.12 3263.99 0 21 - - - - - - - -reverse-name-long 50 691 1.45 546.75 50 18 43.63 20330.85 0 21 - - - - - - - -reverse-name-long-repeat 200 2764 4.07 5467.2 200 18 150.97 203308.2 0 21 - - - - - - - -reverse-name-short 6 81 0.55 69.97 6 18 10.1 3263.99 0 21 370.83 102428.84 0 21 - - - -univ-1 6 258 147.5 11954.27 12 8 - - - - - - - - - - - -univ-1-long 20 699 98.37 22386.12 40 8 - - - - - - - - - - - -univ-1-long-repeat 30 1000 132.04 22386.13 60 8 - - - - - - - - - - - -univ-1-short 6 258 170.89 11954.27 12 8 - - - - - - - - - - - -univ-2 6 243 99.43 4954.21 17 18 - - - - - - - - - - - -univ-2-long 20 744 115.13 27793.82 65 18 - - - - - - - - - - - -univ-2-long-repeat 30 1075 145.75 27793.83 98 18 - - - - - - - - - - - -univ-2-short 6 243 106.17 4954.21 17 18 - - - - - - - - - - - -univ-3 6 122 22.53 1134.63 5 20 - - - - - - - - - - - -univ-3-long 20 378 30.38 4930.22 25 20 - - - - - - - - - - - -univ-3-long-repeat 30 570 37.74 4930.23 45 20 - - - - - - - - - - - -univ-3-short 6 122 25.99 1134.63 5 20 - - - - - - - - - - - -univ-4 8 150 22.75 842.34 18 20 - - - - - - - - - - - -univ-4-long 20 366 27.1 4510.42 39 20 - - - - - - - - - - - -univ-4-long-repeat 30 552 34.4 4510.43 63 20 - - - - - - - - - - - -univ-4-short 8 150 25.79 842.34 18 20 - - - - - - - - - - - -univ-5 8 150 25.14 1005.5 18 20 - - - - - - - - - - - -univ-5-long 20 366 32.39 5708.62 39 20 - - - - - - - - - - - -univ-5-long-repeat 30 552 41.96 5708.63 63 20 - - - - - - - - - - - -univ-5-short 8 150 31.34 1005.5 18 20 - - - - - - - - - - - -univ-6 8 150 38.49 1171.16 18 20 - - - - - - - - - - - -univ-6-long 20 366 36.38 6896.62 39 20 - - - - - - - - - - - -univ-6-long-repeat 30 552 53.14 6896.63 63 20 - - - - - - - - - - - -univ-6-short 8 150 35.31 1171.16 18 20 - - - - - - - - - - - -
Figure 8: Runtimes, SFTA sizes, Synthesized Program Loss, and its size for SyGuS 2018 benchmarks under DL Loss nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA
Benchmark Data Set Number of RequiredSize Correct Examples1-Delete DL 0/1bikes 6 0 0 2dr-name 4 0 0 1firstname 4 0 0 2lastname 4 0 2 4initials 4 0 2 2reverse-name 6 0 0 2name-combine 4 0 0 2name-combine-2 4 0 0 2name-combine-3 4 0 0 2phone 6 0 2 3phone-1 6 0 3 3phone-2 6 0 2 3phone-5 7 0 2 3phone-6 7 0 2 3phone-7 7 0 2 3phone-8 7 0 0 1
Figure 9: Minimum number of correct examples required tosynthesize correct a program.
These results highlight how the DL loss function still extractssignificant useful information available in outputs corrupted withdiscrete noise sources. But in comparison with the 1-Delete lossfunction, the DL loss function is not as good a match for the charac-ter deletion noise source. The result is that the synthesis technique,when working with the DL loss function, works better with longerinputs, sometimes encounters incorrect programs that fit the cor-rupted data better, and therefore sometimes requires correct inputsto synthesize the correct program.With the 0/1 loss function, the technique always requires at leastone and usually more correct inputs to synthesize the correct pro-gram. In contrast to the 1-Delete and DL loss functions, the 0/1 lossfunction does not extract information from corrupted outputs. Tosynthesize a correct program with the 0/1 loss function in this sce-nario, the technique must effectively ignore the corrupted outputsto synthesize the program working only with information fromthe correct outputs. It therefore always requires at least one andusually more correct outputs before it can synthesize the correctprogram.
We next present results for our implementation running on largerdata sets with character replacements. The phone-*-long-repeatbenchmarks within the SyGuS 2018 benchmarks contain transfor-mations over phone numbers. The data sets for these benchmarkscontain 400 input-output examples, including repeated input-outputexamples.For each of these phone-*-long-repeat benchmark problems onwhich our technique terminates with bounded scope height thresh-old 4 (Section 9.2), we construct a noisy data set as follows. For eachdigit in each output string in the data set, we flip a biased coin.With probability 𝑏 , we replace the digit with a uniformly chosenrandom digit (so that each digit in the noisy output is not equalto the original digit in the uncorrupted output with probability9 / × 𝑏 ). Benchmark Data set DL Output ProgramSize Loss Size Sizename-combine-4 5 10 49 16phone-3 7 14 91 11phone-4 6 6 66 17phone-9 7 14 99 21phone-10 7 14 120 21 Figure 10: Approximate program synthesis with DL loss.
We then run our implementation on each benchmark problemwith the noisy data set using the tradeoff objective function 𝑈 𝜆 ( 𝑙, 𝑐 ) = 𝑙 + 𝜆 × 𝑐 with complexity measure 𝑐 = Size ( 𝑝 ) and the following 𝑛 -Substitution loss function: Definition 14. 𝑛 -Substitution Loss Function: The 𝑛 -Substitution loss function L 𝑛𝑆 ( 𝑝, D) uses the per-example lossfunction 𝐿 𝑛𝑆 that counts the number of positions where the noisy out-put does not agree with the output from the synthesized program. Ifthe synthesized program produces an output that is longer or shorterthan the output in the noisy data set, the loss function is ∞ : L 𝑛𝑆 ( 𝑝, D) = Õ ( 𝜎 𝑖 ,𝑜 𝑖 ) ∈D 𝐿 𝑛𝑆 ( J 𝑝 K 𝜎 𝑖 , 𝑜 𝑖 ) , where 𝐿 𝑛𝑆 ( 𝑜 , 𝑜 ) = ∞ | 𝑜 | ≠ | 𝑜 | | 𝑜 | Í 𝑖 = if 𝑜 [ 𝑖 ] ≠ 𝑜 [ 𝑖 ] else | 𝑜 | = | 𝑜 | We run the implementation for all combinations of the boundedscope threshold 𝑏 ∈ { . , . , . } and 𝜆 ∈ { . , . } . For everycombination of 𝑏 and 𝜆 , and for every one of the phone-*-long-repeat benchmarks in the SyGuS 2018 benchmark set, the imple-mentation synthesizes a correct program that produces the sameoutputs as in the original (hidden) clean data set.These results highlight, once again, the ability of our techniqueto work with loss functions that match the characteristics of dis-crete noise sources to synthesize correct programs even in the faceof substantial noise. For the benchmarks in Table 10, a correct program does not existwithin the DSL at bounded scope threshold 2. Table 10 presents re-sults from our implementation on the clean (noise-free) benchmarkdata sets with the DL loss function,
Size ( 𝑝 ) complexity measure,lexicographic objective function 𝑈 𝐿 (L 𝐷𝐿 ( 𝑝, D) , Size ( 𝑝 )) , and boundedscope threshold 2. The first column presents the name of the bench-mark. The next four columns present the number of input-outputexamples in the benchmark data set, the DL loss incurred by thesynthesized program over the entire data set, the sum of the lengthsof the output strings of the data set (the DL loss for an empty out-put would be this sum), and the size of the synthesized program.For the phone-* benchmarks, a correct program outputs the en-tire input telephone number but changes the punctutation, for ex-ample by including an area code in parentheses. The synthesizedapproximate programs correctly preserve the telephone numberbut apply only some of the punctuation changes. The result is 2 = / = / SEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Shivam Handa and Martin C. Rinard and 17 = / 𝑑 =
1, 2, 3,and 4, respectively. Each column has four subcolumns: the firstpresents the running time on that benchmark problem (in seconds).The second presents the number of states in the SFTA (in thou-sands of states). The third presents the DL loss of the synthesizedprogram over the entire dataset (compare this DL loss with thesum of the output lengths over the data set). The fourth presentsthe size of the synthesized program. An entry - indicates that theimplementation did not terminate.
Practical Applicabilty:
The experimental results show that ourtechnique is effective at solving string manipulation program syn-thesis problems with modestly sized solutions like those presentin the SyGuS 2018 benchmarks. More specifically, the results high-light how the combination of structure from the DSL, a discretenoise source that preserves some information even in corruptedoutputs, and a good match between the loss function and noisesource can enable very effective synthesis for data data sets withonly a handful of input-output examples even in the presence ofsubstantial noise. Even with as generic a loss function as the 0/1loss function, the technique is effective at dealing with data setsin which a significant fraction of the outputs are corrupted. Weanticipate that these results will generalize to similar classes ofprogram synthesis problems with modestly sized solutions withina tractable and focused class of computations.We note that our current implementation does not scale to Sy-GuS 2018 benchmarks with larger solutions. These benchmarkswere designed to test the scalability of current and future programsynthesis systems. No currently extant program analysis systemof which we are aware can solve these larger problems.To the extent that the SyGuS 2018 bencharks accurately rep-resent the kinds of program synthesis problems that will be en-countered in practice, our results provide encouraging evidencethat our technique can help program synthesis systems work ef-fectively with noisy data sets. Important future work in this areawill more fully investigate interactions between the DSL, the noisesource, the loss function, the classes of synthesis problems that oc-cur in practice, and the scalability of the synthesis technique. Afull evaluation of the immediate practical applicability of programsynthesis for noisy data sets, as well as a meaningful evaluation ofprogram synthesis more generally, awaits this future work.
Noise Sources With Different Characteristics:
Our experimentslargely consider discrete noise sources that preserve some informa-tion in corrupted outputs. The results highlight how loss functions like the 1-Delete, DL, and 𝑛 -Substitution loss functions can enableour technique to extract and exploit this preserved information toenhance the effectiveness of the synthesis. The question may arisehow well may our technique perform with noise sources that leavelittle or even no information intact in corrupted outputs? Here theresults from the 0 /
10 RELATED WORK
The problem of learning programs from a set of input-output ex-amples has been studied extensively [10, 16, 18]. These techniquescan be largely broken down into the following four categories:
Synthesis Using Solvers:
These systems require the user to pro-vide precise semantics for the operators for the DSL they are using[11]. Our technique, in contrast, only requires black-box implemen-tations of these operators. A large class of these systems dependon solvers which do not scale as the number of examples increases.Since our techniques are based on tree automata, our cost linearlyincreases as the number of examples increase. These systems re-quire all input-output examples to be correct and only synthesizeprograms that match all input-output examples.
Enumerative Techniques:
These techniques search the space ofprograms to find a single program that is consistent with the givenexamples [8, 12]. Specifically, they enumerate all programs in thegiven DSL and terminate when they find the correct program. Thesetechniques may apply different heuristics/techniques to prune thesearch space/speed up this process [12]. These techniques requireall input-output examples to be correct and only synthesize pro-grams that match all input-output examples.
VSA-based/Tree Automata-based Techniques:
Thesetechniques build complex data structures representing all possibleprograms compatible with the given examples [13, 18, 20]. Currentwork requires users to provide correct input-output examples. Ourwork modifies these techniques to handle noisy data and to syn-thesize approximate programs that minimize an objective functionover the provided potentially noisy data set.
Neural Program Synthesis/ML Approaches:
There is extensivework that uses machine learning/deep neural networks to synthe-size programs [3, 6, 15]. These techniques require a training phaseand a differentable loss function. Our technique requires no train-ing phase and can work with arbitrary loss functions including,for example, the 0 / nductive Program Synthesis over Noisy Data ESEC/FSE ’20, November 8–13, 2020, Virtual Event, USA Data Set Sampling or Cleaning:
There has been recent workwhich aspires to clean the data set or pick representative examplesfrom the data set for synthesis [10, 14, 15], for example by usingmachine learning or data cleaning to select productive subsets ofthe data set over which to perform exact synthesis. In contrast tothese techniques, our proposed techniques 1) provide deterministicguarantees (as opposed to either probabilistic guarantees as in [15]or no guarantees at all as in [10, 14]), 2) do not require the use oforacles as in [15], 3) can operate successfully even on data sets inwhich most or even all of the input-output examples are corrupted,and 4) do not require the explicit selection of a subset of the dataset to drive the synthesis as in [10, 15].
Active Learning:
Recent research exploits the availability of areference implementation to use active learning for program syn-thesis [17]. Our technique, in contrast, works directly from giveninput-output examples with no reference implementation.
11 CONCLUSION
Dealing with noisy data is a pervasive problem in modern comput-ing environments. Previous program synthesis systems target datasets in which all input-output examples are correct to synthesizeprograms that match all input-output examples in the data set.We present a new program synthesis technique for workingwith noisy data and/or performing approximate program synthe-sis. Using state-weighted finite tree automata, this technique sup-ports the formulation and solution of a variety of new programsynthesis problems involving noisy data and/or approximate pro-gram synthesis. The results highlight how this technique, by ex-ploiting information from a variety of sources — structure from theunderlying DSL, information left intact by discrete noise sources— can deliver effective program synthesis even in the presence ofsubstantial noise.
ACKNOWLEDGMENTS
This research was supported, in part, by the Boeing Corporationand DARPA Grants N6600120C4025 and HR001120C0015.
REFERENCES [1] 2018. SyGuS 2018 String Benchmark Suite.https://github.com/SyGuS-Org/benchmarks/tree/master/comp/2019/PBE_SLIA_Track/from_2018.Accessed: 2020-07-18.[2] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, MukundRaghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, EminaTorlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In . IEEE, 1–8.[3] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, andDaniel Tarlow. 2016. Deepcoder: Learning to write programs. arXiv preprintarXiv:1611.01989 (2016).[4] Christopher M Bishop. 2006.
Pattern recognition and machine learning . springer.[5] Fred J. Damerau. 1964. A Technique for Computer Detection and Cor-rection of Spelling Errors.
Commun. ACM
7, 3 (March 1964), 171–176.https://doi.org/10.1145/363958.363994[6] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. Robustfill: Neural program learn-ing under noisy I/O. In
Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70 . JMLR. org, 990–998.[7] Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri.2017. Component-based synthesis of table consolidation and transformationtasks from examples. In
ACM SIGPLAN Notices , Vol. 52. ACM, 422–436.[8] John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structuretransformations from input-output examples. In
ACM SIGPLAN Notices , Vol. 50.ACM, 229–239. [9] Giorgio Gallo, Giustino Longo, Stefano Pallottino, and Sang Nguyen. 1993. Di-rected hypergraphs and applications.
Discrete applied mathematics
42, 2-3 (1993),177–201.[10] Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In
ACM Sigplan Notices , Vol. 46. ACM, 317–330.[11] Shachar Itzhaky, Sumit Gulwani, Neil Immerman, and Mooly Sagiv. 2010. Asimple inductive synthesis methodology and its applications. In
ACM SigplanNotices , Vol. 45. ACM, 36–46.[12] Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directedprogram synthesis.
ACM SIGPLAN Notices
50, 6 (2015), 619–630.[13] Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: a framework for induc-tive program synthesis. In
ACM SIGPLAN Notices , Vol. 50. ACM, 107–126.[14] Yewen Pu, Zachery Miranda, Armando Solar-Lezama, and Leslie Pack Kaelbling.2017. Selecting representative examples for program synthesis. arXiv preprintarXiv:1711.03243 (2017).[15] Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. 2016. Learn-ing programs from noisy data. In
ACM SIGPLAN Notices , Vol. 51. ACM, 761–774.[16] D Shaw. 1975. Inferring LISP Programs From Examples.[17] Jiasi Shen and Martin C Rinard. 2019. Using active learning to synthesize modelsof applications that access databases. In
Proceedings of the 40th ACM SIGPLANConference on Programming Language Design and Implementation . 269–285.[18] Rishabh Singh and Sumit Gulwani. 2016. Transforming spreadsheet data typesusing examples. In
Acm Sigplan Notices , Vol. 51. ACM, 343–356.[19] Xinyu Wang, Isil Dillig, and Rishabh Singh. 2017. Program synthesis usingabstraction refinement.
Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Lan-guages
1, OOPSLA (2017), 62.[21] Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016.Synthesizing transformations on hierarchically structured data. In