Probabilistic Software Modeling: A Data-driven Paradigm for Software Analysis
Hannes Thaller, Lukas Linsbauer, Rudolf Ramler, Alexander Egyed
PProbabilistic Software Modeling:A Data-driven Paradigm for Software Analysis
Hannes Thaller, Lukas Linsbauer, Alexander Egyed
Institute for Software Systems EngineeringJohannes Kepler University Linz, Austria{hannes.thaller, lukas.linsbauer, alexander.egyed}@jku.at
Rudolf Ramler
Software Competence Center Hagenberg [email protected]
Abstract —Software systems are complex, and behavioral com-prehension with the increasing amount of AI components chal-lenges traditional testing and maintenance strategies. The lackof tools and methodologies for behavioral software comprehen-sion leaves developers to testing and debugging that work inthe boundaries of known scenarios. We present ProbabilisticSoftware Modeling (PSM), a data-driven modeling paradigm forpredictive and generative methods in software engineering. PSManalyzes a program and synthesizes a network of probabilisticmodels that can simulate and quantify the original program’s be-havior. The approach extracts the type, executable, and propertystructure of a program and copies its topology. Each model is thenoptimized towards the observed runtime leading to a networkthat reflects the system’s structure and behavior. The resultingnetwork allows for the full spectrum of statistical inferentialanalysis with which rich predictive and generative applicationscan be built. Applications range from the visualization of states,inferential queries, test case generation, and anomaly detectionup to the stochastic execution of the modeled system. In this work,we present the modeling methodologies, an empirical study of theruntime behavior of software systems, and a comprehensive studyon PSM modeled systems. Results indicate that PSM is a solidfoundation for structural and behavioral software comprehensionapplications.
Index Terms —probabilistic modeling, software modeling, staticcode analysis, dynamic code analysis, runtime monitoring, infer-ence, simulation, deep learning, normalizing flows
I. I
NTRODUCTION
Software complexity increases with every requirement,feature, revision, module, or software 2.0 (Artificial Intelli-gence (AI)) component that is integrated. Complexity relatedchallenges in traditional software engineering have many toolsand methodologies that mitigate and alleviate issues (e.g.,requirements engineering, version control systems, unit testing).However, tight integration of AI components in programs isstill in its infancy and so are the methodologies and tools thatallow combined analysis, development, testing, integration, andmaintenance.We present
Probabilistic Software Modeling (PSM) , a data-driven modeling paradigm for predictive and generative meth-ods in software engineering. PSM is an analysis methodologyfor traditional software (e.g., Java [1]) that builds a
ProbabilisticModel (PM) of a program. The PM allows developers to reasonabout their program’s semantics on the same level of abstractionas their source code (e.g., methods, fields, or classes) withoutchanging the development process or programming language.This enables the advantages of probabilistic modeling and causal reasoning for traditional software development thatare fundamental in other domains (such as medical biology,material simulation, economics, meteorology). PSM enablesapplications such as test-case generation, semantic clonedetection, or anomaly detection seamlessly for both, traditionalsoftware as well as AI components and their randomness. Ourexperiments indicate that PMs can model programs and allowfor causal reasoning and consistent data generation that theseapplications are built on.PSM has four main aspects:
Code (Structure), Runtime(Behavior), Modeling, and Inference . First, PSM extractsa program’s structure via static code analysis (
Code ). Theabstraction level is properties, executables, and types (e.g.,fields, methods, and classes in Java) but ignores statements,allowing PSM to scale. Second, it inspects the program’s behavior by observing its runtime (
Runtime ). This includesproperty accesses and executable invocations. Then, PSMcombines this static structure and dynamic behavior into aprobabilistic model (
Modeling ). This step also represents themain contribution of this work. Finally, predictive or generativeapplications (e.g., a test-case generator or anomaly detector)leverage the models via statistical inference (
Inference ).The prototype used for the evaluation is called
Gradient and is openly available.First, Section II views our contribution from the perspectiveof existing related domains. Section III introduces an illustrativeexample we use throughout this paper. In Section IV wemotivate our contribution by providing an outlook on possibleapplications and research opportunities that PSM enables. Thenwe briefly discuss the nomenclature and background neededto understand PSM (Section V). Section VI presents the maincontribution containing the general usage pragmatism andconstruction methodologies for PSM models on a conceptuallevel. A comprehensive evaluation of whether software can betransformed into statistical models is given in Section VII anddiscussed in Section VIII. Section XI concludes the paper.II. R ELATED W ORK
To position PSM it is useful to distinguish between pro-gramming paradigms and software analysis methods . A pro-gramming paradigm is a collection of programming languagesthat share common traits (e.g., object-oriented, logical, orfunctional programming). Analysis methods extract information https://github.com/jku-isse/gradient a r X i v : . [ c s . S E ] D ec rom programs (e.g., design pattern detection, clone detection).PSM is an analysis method that analyzes a program given inan object-oriented programming language and synthesizes aprobabilistic model from it. Probabilistic programming is a programming paradigm inwhich probabilistic models are specified. Developers describeprobabilistic programs in a domain-specific language (e.g.,BUGS [2]) or via a library in a host language (e.g., Pyro [3],PyMC [4], Edward [5]). In contrast, PSM analyzes a programwritten in a traditional programming language and translatesit into a probabilistic program. This difference also holdsfor modeling concepts like
Bayesian Networks [6] or
Object-Oriented Bayesian Networks [7], [8] that can be implementedvia a probabilistic programming language.
Formal methods are a programming paradigm that leverageslogic as a programming language (e.g., TLA+ [9] or Alloy [10]).
Stochastic model checking [11] introduces uncertainty inthe rigid formalism to model, e.g., natural phenomenons.Developers specify the behavior and provide the state transitionprobabilities in a special-purpose language (e.g., PRISM [12],PAT [13], CADP [14]). Again, PSM analyzes a programand synthesizes a PM allowing developers to work with theprogramming language of their choice.
Symbolic execution [15] is an analysis method that executesa program with symbols rather than concrete values (e.g., JPF-SE [16], KLEE [17], Pex [18]). It can be used to determinewhich input values cause specific branching points (if-elsebranches) in a program.
Probabilistic symbolic execution [19]is an extension that quantifies the execution, e.g., branchingpoints, in terms of probabilities. This is useful for applicationsthat quantify program changes [20] or performance [21]. Proba-bilistic symbolic execution operates on the statement level whilePSM abstracts statements capturing, e.g., inputs and outputsof methods. This abstraction makes PSM computationallyscalable while symbolic execution suffers from state explosions.Furthermore, this abstraction shifts the analysis focus to theprogram semantics compared to the statement semantics (e.g.,what happens between methods vs. what happens at the ifstatement).
Probabilistic debugging [22], [23] is an analysis methodthat supports developers in debugging sessions. The debuggerassigns probabilities to each statement and updates themaccording to the most likely erroneous statement. Again, incontrast to PSM, they operate on statement level. Anotherdifference is given in the methodologies life cycle. Debugginghas an operational life cycle only valid until the bug is found.PSM and the resulting models are intended to be persistedalong with the matching source code revision. This allows,e.g., method-level error localization, by comparing multiplerevisions of the same model.
Invariant detectors [24], [25], [26], [27], [28], [29] learnassertions and add them to the source code. This helpsto pinpoint erroneous regions in the source code. Invariantdetectors learn rules of value boundaries of statements (i.e.,pre- and post-conditions), not the actual distribution. However,this distribution allows PSM to generate new data enablingcausal reasoning across multiple code elements. III. I
LLUSTRATIVE E XAMPLE
Consider as our running example the
Nutrition Advisor that takes a person’s anthropometric measurements (heightand weight) and returns a textual advice based on the
Body Mass Index (BMI) . Figure 1a shows the class dia-gram of the Nutrition Advisor, consisting of three coreclasses and the
Servlet class. Classes considered by PSMare annotated with
Model (e.g.,
Person ). Figure 1b de-picts a sequence diagram of one program trace with con-crete values. The
Servlet receives properties (e.g., height,weight, or gender) with which it instantiates a Person ob-ject (not shown).
NutritionAdvisor.advice( · ) takesthis Person object, extracts the height ( . ) and weight ( . ) and computes the person’s BMI ( . ) via BmiService.bmi( · ) . The result is a textual advice basedon the BMI ("You are healthy, try a ..."). Note that, for thesake of simplicity, Figure 1a only shows a subset of the codeelements from the real Nutrition Advisor (e.g., Person.name or Person.age are omitted). Given a program such as theNutrition Advisor, PSM can be used to build a network ofprobabilistic models with the same structure and behavior.IV. M
OTIVATING A PPLICATIONS
PSM is a generic framework that enables a wide rangeof predictive and generative applications. This section lists aselection of possible applications.
A. Predictive Applications
Predictive applications seek to quantify, visualize, infer andpredict the behavior and quality of a system.
Visualization and Comprehension [30], [31], [32] appli-cations help to understand software and its behavior. Thisincludes the visualization of code elements and non-functionalattributes such as performance. The PMs are the source of thevisualization showing the global but also contextual behavioracross code elements. For example, Figure 2b visualizes the height -property in which typical and less typical values canbe seen in a blink. P ( Height | Gender = F emale ) visualizesa context-aware behavior how gender affects the height. Semantic Clone-Detection [33], [34] applications detectsyntactically different but semantically equivalent code frag-ments, e.g., the iterative and recursive version of an algorithm.Traditionally, clone detection compares source code fragmentsfocusing on exact or slightly adapted clones. However, semanticequality is beyond purely static properties of source code. PSMcan detect method level clones by comparing their models. Thecomparison can be realized, for example, via statistical tests onsampled data [35], [36], [37] (simple automated decision), viavisualization techniques such as Q-Q plots [38] (comprehensivemanual decision), or a combination these.
Anomaly Detection [24], [39], [40], [41] applications mea-sure the divergence between a persisted PSM model and a newlycollected observation. These applications can be deployed intoa live system, in which components are monitored and checkedagainst their models. A threshold checks for unlikely runtimeobservations x (i.e., p ( W eight = weight new ) < . ) triggeringadditional actions in cause of a failure. x and its effects onother elements can then be investigated with, e.g., visualization utritionAdvisor bmiService: BmiServiceadvice(person: Person): String Person height: fl oatweight: fl oat BmiService bmi(height: fl oat, weight: fl oat): fl oat Servlet handle(...)
Model Model C o d e Model gender: String (a) [Code]
The static structure of the Nutrition Advisor,consisting of three core classes and a context class (e.g., aweb-interface) calling the program. servlet: Servlet advisor: NutritionAdvisor person: Person bmiService: BmiService advice(person) height168.59weight69.54bmi(height=169.59, weight=69.54)24.466"You are healthy, try a ..." R u n t i m e (b) [Runtime] The dynamic behavior of the Nutrition Advisor, visualized by one execution trace.The
NutritionAdvisor handles advice requests in which
Person objects are received anda textual advice is returned.
Figure 1:
The Nutrition Advisor receives a person with its anthropometric measurements and computes a textual advice regarding the person’s diet. Forsimplicity some properties and executables are omitted. and comprehension techniques, for further decision-makingprocesses.
B. Generative Applications
Generative applications leverage observations drawn fromthe models, e.g., executable inputs or property values.
Test-Case Generation [42], [43] applications draw obser-vations from executable and property models to generate testdata. PSM can generate scoped test data with a specificlikelihood or for a specific system scenario (system state).For instance, likelihood-scoped data can be used to generatedifferent test suites such as typical, rare, or unseen by sampling x < P ( P erson ) = P ( Height, W eight ) < y where x and y are predefined boundaries of the likelihood. This helps tostrengthen test suites with meaningful, automatically generatedtests based on real (un)likely behavior. Simulation applications sample execution traces from thenetwork of models in a structured fashion to reproduce therunning system. This probabilistically executes the originalprogram without actually running it. Simulations can bridgeboundaries between hardware and software interfaces, reducingthe number of hardware dependencies during development.V. B
ACKGROUND
PSM combines two major domains: Software Engineering(SE) and Machine Learning (ML). Naturally, some terms canbe misinterpreted depending on the readers background. Thefollowing terminology was chosen as the best common groundand might be untypical in the respective domain.
A. Code
Types, properties, and executables are object-oriented terms(e.g., classes, fields, and methods in Java [1], see Figure 1a).In the context of PSM, these are referred to as code elements .These code elements can be organized in an
Abstract SemanticsGraph (ASG) , which is a high-level version of an abstractsyntax tree (AST). An ASG contains no lexical nodes but hasadditional semantic relationships (e.g., typing information ofexpressions). Also, in the context of PSM, we define that each code element has a symbol . A symbol is a numerical identifier,e.g.,
Symbol ( P erson.weight ) = 0 . B. Runtime
Runtime monitoring (or dynamic code analysis) [44] isthe process of observing a running program. The programis executed by a trigger (parameters and environment) whichis the context of the monitoring session. A running programspawns event streams which are sequences of monitoring events (e.g., Figure 1b). These events contain information such asproperties that were changed or executables that were invoked.Also, the stream shows which parts of the underlying sourcecode are active with the given trigger.
Tracing tracks everypossible event at runtime, whereas sampling records eventsaccording to a specific rate.
C. Modeling A probabilistic model uses the theory of probability to modela complex system (e.g., Nutrition Advisor). A random variable X i ∈ X (e.g., W eight ) captures an aspect of the system’sevent space. The value range of random variables is given by
V al ( X i ) (e.g., V al ( W eight ) = { i | i ∈ R } ). A probabilitydistribution P is a mapping from events in the system to realvalues (e.g., Figure 2b histogram elements map to a point onthe Fitted Distribution line). These values are between and and all values sum up to 1. The marginal distribution P ( X i ) describes the probability distribution of the random variable X i (e.g., P ( W eight ) ). The joint distribution P ( X , . . . , X n ) represents the probability distribution that can be described withall of the variables (e.g., P ( W eight, Height ) ). A conditionaldistribution P ( X | Y ) describes the probability distribution of X given that some additional information of the random vari-able Y was observed (e.g., P ( W eight | Height = 193 cm ) ). Y is called the conditional and scopes the distribution of X .More background information is given, e.g., by Koller andFriedman [6], Murphy [45], or Bishop [46].PSM is mostly interested in the conditional distribution ofa code element given its invoking context, e.g., a propertyaccess P ( Weight | C ) with its context C (e.g., advice ype PropertyExecutable declares reads / writesinvokes / returns NutritionAdvisor
P(NutritionAdvisor) = P(BmiService) = P( ∅ ) P(BmiService) = P( ∅ )P(Advice) = P(Height R , Weight R , Bmi Inv , Advice
Ret ) BmiService
P(BmiService) = P( ∅ )P(Bmi) = P(Height Pa , Weight Pa , Bmi Ret ) Person
P(Height)P(Weight) P(Person) = P(Height, Weight, Gender)
Servlet
Boundary
Probabilistic Modeling Universe handle(...) M o d e li n g Servlet P(Gender) (a) [Modeling]
The Probabilistic Model Network of the Nutrition Advisor (simplified).Elements within the Probabilistic Modeling Universe are modeled according to theirprobabilistic expressions. Triangles are properties, circles are executables, and rectanglesare types. The superscripts represent property reads R , and executable invocations Inv ,parameters
P a , and return values
Ret . I n f e r e n c e D e n s i t y Height
140 160 180 200P(Height) P(Height | Gender = Female) (b) [Modeling]
The distribution of the
Person.weight properties.The histogram are the runtime observations that were sampled fromthe
True Distribution (usually unknown). The
Fitted Distribution isthe model approximation based on the data.
Figure 2:
The Nutrition Advisor system as
Probabilistic Model Network (left) and the model of the
Person.weight node (right). method). A probabilistic expression such as P ( W eight | C ) isequivalent to pseudocode in SE. They describe a process (e.g.,a sorting algorithm) that can be parameterized with a concreteimplementation and technology (e.g., functional implementationgiven in Haskell [47] or object-oriented implementation inJava [1]). Similarly, P ( Weight | C ) can be parameterized viaa stochastic model representing its quantity, and in hindsightits process.This work presents the modeling strategies (see Section VI)in the form of probabilistic expressions that our prototype pa-rameterizes via Real Non-Volume Preserving Transformations(NVPs) [48]. NVPs are density estimators that allow efficientand exact inference, sampling, and likelihood estimation of datapoints. NVPs learn an invertible and pure bijective function f : X (cid:55)→ Z (with g = f − ) that map the original input variable x ∈ X to simpler latent variables z ∈ Z . The latent variablesare often isotropic unit norm Gaussian N (0 , ) that are wellunderstood in terms of sampling and likelihood evaluation.An NVP is a combination of multiple small neural networks,called coupling layers, that are combined by simple scale andtranslation transformations. Conditional NVPs are an extensionthat estimate P ( X | C ) . D. Inference
Every PSM application in Section IV is build upon inference.It is the combination of sampling , conditioning , and likelihoodevaluation .Each node in a PSM network is an NVP. Sampling withNVPs is done by sampling from the Gaussian latent-space z ∼ N (0 , ) and applying the NVP in inverse x = g ( z ) . NVP canbe conditioned statically and dynamically. Static conditioningis achieved by adding additional features to the network duringtraining. Dynamic conditioning finds latent-space configurationsthat match the condition by, e.g., variational inference [49], [50]. Finally, likelihood evaluation is achieved by evaluatingthe likelihood under the Gaussian latent-space times the NVPsJacobian p X ( x ) = p Z ( f ( x )) (cid:12)(cid:12)(cid:12)(cid:12) det (cid:18) ∂f ( x ) ∂x T (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) . (1)More details are given by Dinh et al. [48].VI. A PPROACH
PSM is a four-fold approach illustrated in Figure 3 in which:1) [Code] static code information is extracted and analyzed;2) [Runtime] runtime behavior is collected and transformed;3) [Modeling] probabilistic models are built by combiningcode and runtime data;4) [Inference] applications are build by leveraging causalreasoning and data generation.The main contributions of this work are concepts and realiza-tions in the
Modeling aspect.
A. Code
The input is the
Source Code (1) of a program (e.g., ofthe Nutrition Advisor). Then,
Static Code Analysis extractsthe
Program Structure (2) in the form of an ASG. The classdiagram in Figure 1a may act as an abstract substitute of thestructure in this example. Elements that are to be modeled areannotated with the label
Model . In that regard, PSM is selectiveof the code elements considered for static and dynamic codeanalysis. The selection depends on the application context (seeSection IV), or the developer’s interest. The set of all codeelements PSM considers is called the
Modeling Universe . B. RuntimeDynamic Code Analysis extracts the
Runtime Behavior (3)by executing the program with a trigger and monitoring the o d e R u n t i m e Modeling Inference S t a t i c C o d e A n a l y s i s D y n a m i c C o d e A n a l y s i s Model ParameterOptimization
Program StructureRuntimeBehavior BehaviorDatasetsProbabilistic Model Network (empty)Source Code codeelementruntimeevent emptymodel fi ttedmodel Probabilistic ModelNetwork ( fi tted) Dataset CreationStructureExtraction modeldataset Causal Reasoning forwardreasoning backwardreasoningmarginal & conditionalgeneration
Application sampling
Figure 3:
Source Code (1) has a
Program Structure (2) and a
Runtime Behavior (3) that is extracted via
Static and
Dynamic Code Analysis . These result in a
Probabilistic Model Network (empty) (4) and
Behavior Datasets (5) that are combined by
Optimization (6) into the final
Probabilistic Model Network (fitted) (7). internal events. This results in an event stream similar to thesequence diagram in Figure 1b. Events are property accessesand executable invocations of code elements in the modelinguniverse. Depending on the application context (see Section IV),execution triggers can be, e.g., tests (weak), or the runtimeof a deployed system (strong). For example,
Visualizationand Comprehension demands a trigger as close as possibleto the real environment (manual understanding). In contrast,
Semantic Clone-Detection makes differential comparisonsbetween models where synthetic data suffices (automaticcomparisons).
C. Modeling
PSM extracts from the
Program Structure the code elementtopology and builds the
Probabilistic Model Network (empty) (see Figure 3, step 4). From a software engineering perspective,this process is comparable to traversing the ASG and attachingan empty (unfitted) PM to the every node. An example networkis demonstrated in Figure 2a where each node is a PM. Theactual construction rules (probabilistic expressions) to buildsuch a PSM network are given below (Section VI-C1). The
Dataset Creation tallies and pre-processes the event streaminto
Behavior Datasets (5) for each code element. The
ModelParameter Optimization (6) fits each PM, i.e., node in the
Probabilistic Model Network , to the
Behavior Dataset of theassociated code element. This results the
Probabilistic ModelNetwork (fitted) (7) with the same topology found in the
Program Structure , optimized towards the observed
RuntimeBehavior .
1) Construction Rules:
The construction rules define howeach node in the
Probabilistic Model Network (4), i.e., a givencode element, is transformed into a probabilistic expression.This expression is a description of the model (random) variablesand its approximating quantity (e.g., see Figure 2a). Hence,building the PM network equals 1) a traversal in the program’sASG; 2) an application of the construction rules creating aprobabilistic expression (per node); and 3) the parameterizationof the expressions with a concrete model (e.g., VAEs).The property construction rule defines a property modelby the property value itself, conditioned on the symbol of the accessing executable (conditional). P ( Property | C ) = P ( R, W | C ) (2) R and W are the read and write accesses to the property. For ex-ample, the Person.weight model is defined by P ( Weight | C ) . The value range of the property depends on the propertyitself, whereas the range of the conditional is all (executable)symbols that exist in the project Val ( C ) = Symbols ( Project ) .This includes executable symbols that live outside the PSMUniverse. The conditional allows PSM to differentiate betweencall sites. This allows each call site to have a different distribu-tion. For example, NutritionAdvisorAdolesence and
NutritionAdvisorAdult use the
BmiService leadingto two slightly shifted weight distributions in the same model.The executable construction rule defines an executablemodel by a joint distribution of the inputs and outputs,conditioned on the symbol of the invoking executable. P ( Executable | C ) = P ( I , O | C ) (3) = P ( P a , Inv , R (cid:124) (cid:123)(cid:122) (cid:125) I , W , Ret (cid:124) (cid:123)(cid:122) (cid:125) O | C ) (4) P a are parameters,
Inv are (executable) invocations, R areproperty reads, W are property writes, Ret are the return val-ues, and C are all (executable) symbols that exist in the project Val ( C ) = Symbols ( Project ) . An example would be the bmi -method with P ( Bmi ) = P ( Height
P a , W eight
P a , Bmi
Ret | C ) where V al ( C ) = { Symbol ( N utritionAdvisor.advice ) } (see Figure 2a).The type construction rule defines a type model by thejoint distribution of properties the type declares, conditionedon the symbol of the accessing executable. P ( T ype ) = P ( P roperty
T ype | C ) (5)For example, a Person object is defined by P ( Height, W eight ) . The type distribution is empty inthe case where no properties exist as Figure 2a shows for the bmiService property in NutritionAdvisor . Samplingfrom a type distribution instantiates a new object of a giventype by assigning the sampled values to the properties. ) Technical Modeling Considerations:
PSM estimates thedensity of the values that code elements emit during runtimein the form of generative models. It searches for a modelfrom which new samples can be drawn, and that compressesthe original monitoring data into a fixed set of parameters.This goal stipulates a set of requirements with which thenetwork nodes can be parameterized. The model should bea scalable, parametric, decidable, generative, (conditional)density estimator . • Scalable such that it can handle the enormous amountsof data running systems produce. • Parametric such that it has a fixed set of parameters thatcan be stored and shared. • Decidable such that the parameter optimization has a clearconvergence criterion. • Generative such that it allows for efficient sampling ofthe approximated distribution. • A (Conditional) Density estimator that is capable ofapproximating arbitrary data.Besides, the learning process should be as robust as possibleto reduce human intervention. Each requirement is tied tofunctional (generative, density estimator) or non-functional(scalable, parametric, decidable) requirements of PSM. Oneclass of models that fit many of these requirements arelikelihood-based deep generative networks like VariationalAuto-Encoders (VAEs) [51], [52], [53] or flow-based methodslike the
Real Non-Volumetric Preserving Transformation [48]and derivatives [54], [55], [56] .Another technical consideration is that Equations 2, 4, and 5can be factorized in each other. That is, a real implementationdoes not need a model for each property, executable, and typebut may combine them into one model. The prototype in thiswork uses exclusively executable models (see Section VII).
D. Inference
Inference is the fundament of all applications motivatedin Section IV and illustrated in Figure 3. The three tightlyconnected main aspects of inference are sampling (generation), conditioning (information propagation), likelihood evaluation (criticism).
Sampling draws observations from one (local) ormultiple (global) nodes (NVPs) in the PSM network. Thisenables the probabilistic execution of e.g., an executable or asubsystem.
Conditioning sets the models into a specific state.For example, Figure 2b illustrates the height property in itsunconditioned and conditioned state. Local conditioning setsone node into a state. Global conditioning propagates a stateacross multiple nodes.
Likelihood Evaluation quantifies samplesin terms of their likelihood under a given node (i.e., a model).Figure 3 illustrates the combination of these aspects andcombines them into causal forward (8) and backward (9)reasoning. Forward reasoning (8) (e.g.,
Person.height to BmiService.bmi ) samples a conditional distribution andpropagates it through the network to set downstream nodesinto a conditioned state. Backward reasoning (9) starts at aconditioned downstream node and searches for the most likelycause. At every step it is possible to draw conditional orunconditional samples. The directional aspect (forward andbackward) is based on the source codes dependency graph.
Table I:
Hyper-parameters used in the experiments. to
10 000 %3 Preprocessing Number Standardization4 Preprocessing Discretization Threshold 165 Preprocessing Discretization Encoding Base 106 Preprocessing Text Encoding Base 107 Optimizer Algorithm Adam [57]8 Optimizer Learning Rate × − × −
10 Optimizer Batch Size full dataset11 Optimizer Max Epoch
12 Optimizer Early Stopping Patience 20 epochs13 NVP [48] Coupling Count
14 Coupling Layer [48] Linear Layer Count
15 Coupling Layer [48] Hidden Units Count (low) (high)16 Coupling Layer [48] Latent-Space N (0 , )
17 Coupling Layer [48] Translation Activations Gelu [59]18 Coupling Layer [48] Scale Activations Gelu [59], Tanh
PSM networks, however, are undirectional (a network of joint-distributions). VII. S
TUDY
The core hypothesis of PSM is that programs can betransformed into a probabilistic model. This study (i.e., theprototype, research questions, analyses, and discussions) fo-cuses on evaluating the core PSM methodologies presentedin Section VI. Specifically, the study answers the followingquestions, providing evidence for the core hypothesis:
RQ1 [Code]
Are projects exposing enough code elements thatare eligible for PSM?
RQ2 [Runtime]
Are code elements creating enough runtimedata with which the model parameters can be optimized?
RQ3 [Modeling]
Are probabilistic models capable of capturingthe runtime data of eligible code elements?
RQ4 [Inference]
Is the network of probabilistic models capableof solving inferential tasks?
RQ1 addresses the precondition whether projects exposeenough data (i.e., number or text) code elements that canbe modeled.
RQ2 addresses the precondition whether these(data) code elements create a sufficient amount of runtimedata that can be modeled.
RQ3 addresses the central questionwhether the behavior of a program in the form of its runtimedata can be approximated via the concrete models. Finally,
RQ4 evaluates the usefulness of the approach and whether PSMis a sound basis for the applications presented in Section IV.The four questions are scoped by structured programs that canbe executed and support runtime monitoring. The empiricalevidence in this work is essential for any future endeavor relatedto statistical modeling of software. The evaluation of concreteapplications of PSM described in Section IV are beyond thescope of this study.
A. Setup
We implemented a prototype called Gradient that reflectsthe process and data flow presented in Figure 3. https://github.com/jku-isse/gradient-benchmark able II: Overview of the projects used in the study. LoC are the lines of code in a project.
Project Version
Data Ref Unk Total Data Ref Unk Total Data Ref Void Unk TotalNutrition Advisor 0.1.0
Structurizr 1.0.0
115 9941 123 229 85 24 338 725 342 26 1093 320 302 508 20 1150 jLatexmath 1.0.7
156 21 369 191 490 121 81 692 1115 556 153 1824 269 416 511 59 1255
PMD 6.5.0
799 89 349 981 1858 503 481 2842 2933 2910 1943 7786 3222 719 3445 2073 94591075 120 813 1300 2588 712 587 3887 4792 3809 2122 10 723 3821 1437 4483 2153 11 894
Data = {Number, Text}, Ref = Reference, Unk = Unknown
1) The input
Source Code were open source subject systemswritten in Java (see next Section VII-B).2) The
Program Structure was extracted using Spoon [60].3) AspectJ 1.9.1 was used to weave monitoring aspects(tracing) into the subject systems to capture their
RuntimeBehavior in the modeling universe.4) The
Probabilistic Model Network (empty) was createdby applying the rules from Section VI-C1 for each codeelement. Shape and size of the NVPs is given in Table I.5) The
Behavior Datasets were created by tallying theevent stream. This includes splitting the dataset intotraining and evaluation partitions and preprocessing them.Preprocessing consisted of encoding text features byenumerating (starting from 0) and encoding them in abase 10 vector space. The same procedure was appliedto the conditional dimension. Number dimensions wereconsidered discrete if less or equal than values werefound and underwent the same base 10 encoding procedure.Finally, all dimensions were standardized to have a meanof zero and a standard deviation of 1.6) Model parameters were optimized with their datasets, andthe best parameter setting was retained (w.r.t. evaluationperformance).7) Finally, the persisted models were used in the analysisscenarios (see Section VII-E3).Hyper-parameters of the experiments are given in Table I.The chosen values are based on additional non-reportedexperiments evaluated on a synthetic dataset. All experimentswere executed on a single machine (Intel i7, Nvidia GTX 970). B. Subject Systems
The study uses four subject systems listed in Table II.
Nutri-tion Advisor is the running example introduced in Section V.
Structurizr [61] is a developer-focused software architecturevisualization tool. jLatexmath [62] is a library for renderingLaTeX formulas.
PMD [63] is a static code analysis tool forJava applications.All code elements of the projects were included in themodeling universe (excluding inherited third-party elements).
Nutrition Advisor received advice requests as a triggerwith data based on the NHANES [64] dataset. jLatexmath and
Structurizr were executed with examples provided in theirdocumentation.
PMD analyzed the
Nutrition Advisor and outputthe results in HTML format. The subject systems and their triggers are openly available as a benchmark suite for futureexperiments and comparisons. C. Controlled Variables
The study controls for one variable:
Capacity . • Capacity:
The capacity describes the number ( low = 32 , high = 128 ) of units in the linear layers of the NVPs. D. Response Variables
The response is split into a quantitative and qualitativepart. The quantitative part evaluates the
Events per CodeElement (ECE) , Distinct Values per Code Element (DCE) , and
Negative Log-Likelihood (NLL). The qualitative part assessesthe visual fidelity of the samples generated by the modelcompared to the original dataset and evaluates the usefulnessof the PSM network via a scenario-based evaluation given inSection VII-E4. • Events per Code Element (ECE):
Measures the numberof events emitted by code elements. This provides insightinto the runtime activity of elements and how manymodels need to be fitted. We report ECE1 and ECE10to distinguish between dependencies/constants and realbehavior carrying code elements. ECE1 includes all codeelements with at least one event (all active code elementsat runtime). ECE10 includes only code elements thatemitted at least 10 events at runtime. • Distinct Values per Code Element (DCE):
Measuresthe number of distinct values emitted by code elements.This provides insight into the capacity models must have.We report DCE1 and DCE10 where DCE10 includes codeelements with at least 10 distinct values. • Average Negative Log-Likelihood (NLL):
Measures theaverage Negative Log-Likelihood (Equation 1) of datapoints under the model in natural units of information(nats; lower is better).
E. Experiment Results
The study results are split into four groups:
Code , Runtime , Modeling , and
Inference .
1) Code:
The projects contained a total of
27 804 property,parameter, and executable code elements. PMD is the largestproject containing
76 % of the total code elements. NutritionAdvisor is the smallest project containing .
25 % . Mostelements were executables (
43 % ) or parameters (
39 % ).
42 % of the elements were data elements, i.e., had either a number https://github.com/jku-isse/gradient-benchmark able III: Events are the number of events observed at runtime. ACT10 are the number of events observed at runtime on code elements with at least 10events. DCT10 are the number of distinct values on code elements with at least 10 distinct values.
Project Data Type Events ACT10 DCT10
Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 TotalNutrition Advisor Data
Others
Structurizr Data
Others
12 3 36 58 489 34 17 104 57 607 29 16 59 3331 jLatexmath Data
130 15 526 6 415 336 274 61 1297 6 414 919 39 18 81 24 495
Others
66 6 530 1 377 280 257 56 1064 1 376 553 107 30 408 42 592
PMD Data
35 5 154 15 069 591 117 37 267 15 068 209 39 18 91 24 511
Others
18 5 117 1 882 176 64 20 185 1 879 058 30 16 123 69 56921 5 138 24 868 732 83 25 306 24 861 389 39 17 102 176 052
Mdn = Median, Q1/3 = QuartileData = {Number, Text}, Others = {Reference, Unknown} or text type that is eligible for PSM modeling.
22 % werereferences within the modeling universe and the remaining
36 % were elements of unknown type that were not within themodeling universe. Table II shows detailed results per subjectsystem, element type, and data type.
2) Runtime:
Monitoring sessions lasted for a median dura-tion of .
55 s (IQR = . to . ) and were concurrentlyexecuted with the modeling sessions of other projects. Themedian processing speed was
25 101 events per second (IQR=
24 727 to
26 283 ).During the monitoring session, a total of
24 868 732 eventswere emitted from code elements (
22 % of total codeelements).
36 % of the code elements emitted data (textor number) events.
68 % were generated by the PMD project,while the least events were generated by the Nutrition Advisor .
12 % .
87 % of the events were data (text or number) eventswhile the remaining
13 % were either reference or unknown events.The event analysis shows that most of the events (
24 861 389 )occurred on (
14 % of total) code elements. This excludeselements that emitted less than 10 events (ECE10).
36 % ofthe code elements generated data (text or number) events.Percentages for the largest and smallest, as for the data typesmatch those of the events. Differences are given in Table IIIin terms of the central tendencies.The distinct value analysis shows that a total of
176 052 distinct values were generated by code elements ( .
29 % ).This excludes elements that emitted less than 10 events(DCE10).
44 % of the code elements generated data events.Most of the distinct values come from the PMD project thatmake up
53 % . Least distinct values were generated by theStructurizr with .
32 % . Distinct values related to
Data wereencountered
34 % while others were encountered
66 % of thetime.
3) Modeling:
Table IV contains the detailed results of thelow capacity setting and the margins for the high capacitysetting. The total wall time to optimize the parameters of allmodels was
195 min (
111 min for high capacity). The mediantime one model needed to optimize in the low capacity settingwas Mdn = . , IQR = . to . (Mdn = . , IQR= . to . for high capacity). A total of models were fitted. PMD accounted for
74 % of the models. In sum,
680 080 data points were used in theprocess were Nutrition Advisor had the most data points avail-able per model ( ). A total of dimensions exist acrossall models were PMD accounts for
72 % of all dimensions.However, the Nutrition Advisor models had the highest amountof dimensions per model.
62 % of the dimensions were relatedto continuous features and the remainder to discrete features. Atotal of
12 787 800 parameters were used (Mdn =
15 780 , IQR=
15 000 to
16 560 ) in the low capacity setting for the models.The high capacity setting had a total of
165 172 056 parameters(Mdn =
210 468 , IQR =
207 384 to
213 552 ). Finally, allprojects yielded a total test NLL of − . (low capacity).On average, the models found in the PMD project had the bestNLL with − . and the worst in the Structurizr − . (loweris better). No significant divergence between training and testNLL can be seen.The qualitative inspection of the models revealed a goodapproximation with two caveats. First, imprecisions in theapproximations are given for categorical dimensions thatinclude high mass levels. The high mass levels cause anincrease of mass in the surrounding levels compared to theoriginal data. Proximity in categorical data is introduced bythe 10-ary encoding and the continuous nature of NVPs.Second, imprecisions are given in continuous dimensions withdisconnected high-density modes being connected. This issueoccurs more frequently in the low capacity setting than in thehigh capacity setting indicating underfitted models.
4) Inference:
The qualitative assessment of the inferencecapabilities of PSM are split into two scenarios presented inFigure 4 and Figure 5. These scenarios extend the runningexample by adding the
Servlet to the Modeling Universe.The first scenario in Figure 4 shows a simulation in whichthe Nutrition Advisor is conditioned on women requests.The circles at the top illustrate the original call hierarchyand parts of the PSM network from Figure 1a. Each nodewas fitted on the original data without any restrictions orconditions. The contour plots below show the height and weight variables in each model conditioned by gender (see Figure 5for unconditional version). The density plots at the bottom able IV:
Model analysis results split across projects, and capacity. Lower is better for NLL results.
Capacity Project Models Data Points Dimensions Training NLL Test NLL
Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 Total Mdn Q1 Q3 TotalLow Nutrition Advisor − . − .
40 1 . − . − . − .
51 1 . − . Structurizr
50 67 31 137 14 715 3 3 4 179 − . − .
77 1 . − . − . − .
95 2 . − . jLatexmath
146 393 82 1248 206 820 4 3 7 763 − . − .
64 1 . − . − . − .
91 1 . − . PMD
574 133 56 337 454 545 4 3 5 2511 − . − . − . − . − . − . − . − . Low
774 151 56 472 680 080 4 3 5 3480 − . − . − . − . − . − . − . − . High − . − . − . − . − . − . − . − . Mdn = Median, Q1/3 = Quartile h e i g h t weight40 120 40 120 40 120 40 120weight weight weight P ( b m i ) bmi0.000.030.06 bmi bmi10 30 50 10 30 50 10 30 50 handle(...) advice(...) bmi(...) bmi(...) UnconditionedConditioned
Person.gender = Female
Figure 4:
Shows an inference example with a condition caused by a latentvariable starting at the handle-method. Gender, only accessible in the handle-method is conditioned to females. Height and weight are propagated whilebmi jointly adapts to the condition. The last column shows a roundtrip of 10(40 propagation hops) and its effect on compared to the original distribution. present the bmi variable of the same respective model. Inthe background is the original unconditioned distribution (i.e.,including males).
Only the handle -model has direct access tothe gender property. By iteratively sampling n observations,propagating, and conditioning the next model the originalconditional information (i.e., P erson.gender = F emale )flows through the network. This equals n (probabilistic)executions of the program. Finally, Figure 4 on the rightshows the degree of information degradation in a forward andbackward inference setting with 10 round-trips (40 informationhops). Centers and shape are mostly preserved but a slightshift of variance can be seen. The density of the bmi variablewas preserved over the 40 hops without any crucial loss ofinformation.The second scenario in Figure 5 assumes that Servlet and
NutritionAdvisor are developed by Company A while
BmiService is developed by Company B specialized on AI.Company A uses the simple height/weight formula to stub the
BmiService until Company B delivers its service based on aregression model. Company A has a PSM model M null ofthe system. Company A builds a second revision M alt of itsPSM model, including the new component they received fromCompany B (BmiService). The automated compatibility checks h e i g h t h e i g h t nullalt30 150weight 0 20 40bmi 0.000.060.80 P ( b m i ) Model alt null
Figure 5:
Shows an example for semantic testing and criticism where the null -model and alt -model come from different teams. The clear differencebetween the return values was detected automatically and works indifferentwith traditional software as with software 2.0. during continuous integration failed for bmi code elements(in bmi(. . . ) and advice(. . . )) but are successful for all otherelements. Revisiting the call graph in reverse order reveals asemantic error in the new component illustrated in Figure 5.The inputs match (contour plots on the left) but the outputsdiverge drastically (density plot on the right). The issue wasthat Company A uses the metric measurement system whileCompany B uses the imperial system.The scenario is based on real data. However, the regressionmodel was substituted by the simple BMI formula givenin the imperial form. Compatibility checks were done withKolmogorov-Smirnov Tests [37].The remarkable aspect of this scenario is the ignorance ofPSM regarding the true underlying implementation (code vsAI model). Unit tests of the component and integration testsof depending components would need to ask the model forthe correct assertion values given an input. Not only are thesetests flawed, but every update of the model’s parameters wouldtrigger cascading changes in the tests. In contrast, PSM teststhe behavior, not the code (semantic tests).VIII. D
ISCUSSION
The results presented in Section VII-E provide direct orindirect evidence for the research questions in Section VII.
A. Code
The results of the code analysis (see Section VII-E1) showsthat the total project size is secondary for PSM. Nearly half(
43 % ) of the code elements in a project are text or numbers andan be modeled. The remaining elements are either referencingeligible code elements or are external dependencies. This largeproportion justifies the use of PSM for projects independentof their size (
RQ1 ).In conclusion, projects, independent of their size, exposeenough code elements eligible for PSM.
B. Runtime
The results of the runtime analysis (see Section VII-E2) showthat most events are related to actual data (
87 % ), providingevidence for
RQ2 and support for PSM. These data events areemitted by a rather small portion of the active code elements(
14 % , ACT10). Regarding
RQ3 , this means that few models willcapture most of a program’s behavior. Most of the variabilityis generated by few code elements .
29 % . Nearly half of thevariability is related to data (
44 % ) while the other half aremostly object references. In terms of
RQ3 this means that theaverage capacity (free optimizable parameters) of models canbe low; simplifying model maintenance and interpretation.In conclusion, active code elements are creating enough data(text or number) that can be used for PSM.
C. Modeling
The results of the modeling analysis (see Section VII-E3)show that most models have few dimensions providing furtherempirical support to use low capacity models. The selectedcapacity does not hint at overfitting to specific portions ofthe data given that training and test NLL are not significantlydifferent. However, many low-dimension discrete only modelscan be replaced by Conditional Probability Tables (CPDs) [6]for a more efficient and precise representation.The qualitative inspections revealed high-quality models withgood approximations with two caveats (mass leakage and modeconnectivity). The two issues are related to the capacity ofthe model (too high for discrete, too low for continuous) thatadaptive model type and parameter selection can solve.In conclusion, the qualitative and quantitative assessmentssuggest that probabilistic models can approximate the behaviorof a program. D. Inference
The inference analysis (see Section VII-E4) evaluated theusefulness of PSM models by two illustrative scenarios.The first scenario (Figure 4) illustrated multi-dimensionalinformation (height and weight) propagation with latent factors(gender only visible in request ) across multiple models. Thesecond scenario (Figure 5) focused on model/data evaluationin a software development context in which software and AIcomponents are integrated. The scenarios distill the foundationson which any PSM application (see Section IV) is built:sampling (generation), conditioning (information propagation),and likelihood evaluation (criticism).In conclusion, results show that local (within model) andglobal (between models) generation is sensitive to conditionsallowing consistent causal reasoning in PSM models. A table encoding the probability per categorical level.
IX. L
IMITATIONS
There are several limitations to the approach or the currentprototype. The approach needs a structured program, and itmust be observable at runtime. Large methods that handlemultiple tasks will reduce the usefulness of PSM.The current prototype is focused on data. References arehandles to objects that might contain data or more references.PSM naturally dereferences these handles since models onlycontain, e.g., properties, that are accessed. This means thatPSM is not useful for libraries whose only purpose is referencemanagement, e.g., a collection library.The current prototype explodes lists as singular valueassignments, i.e., a list of two elements acts as two assignmentsto a non-list variable. No order relationship between listelements is preserved as typical for distributions. Sequentialmodels can alleviate this limitation. However, the usefulnessis subject to the actual application that is realized.X. T
HREATS TO V ALIDITY
An external threat to validity is given by the number ofprojects used in the study. Rigorous internal evaluation andprojects of different size and type minimize the threat. Differentsizes control for the expectation that large projects will havemore elements and events, resulting in better models. Differentproject types (e.g., PMD as system or jLatexmath as applicationsoftware) control for the element type distribution and theirruntime content (user vs. synthetic data). Finally, the evaluationmodels all eligible code elements and measured the varianceacross the projects. The NLL across projects in Table IV doesnot hint at a by-chance good project selection.XI. C
ONCLUSION AND F UTURE W ORK
In this work, we presented Probabilistic Software Modeling(PSM), a data-driven approach for predictive and generativemethods in software engineering.We have discussed applications, pragmatics, constructiondetails, and technical considerations of PSM. We evaluatedthe viability and usability of PSM on multiple projects anddiscussed scenarios that provide insight into how PSM is used.The results have shown that PSM is not only viable but naturallyintegrates with software 2.0 (AI components).Our future work will focus on the realization and evaluationof applications and their comparison to the current state-of-the-art.In conclusion, PSM analyzes a program and synthesizes aprobabilistic model that is capable of simulating and quantifyingit. The resulting models are repeatable, persistable, shareable,and quantifiable representations and act as a foundation fromwhich solutions can be derived.A
CKNOWLEDGMENTS
The research reported in this paper has been supported by theAustrian Ministry for Transport, Innovation and Technology,the Federal Ministry of Science, Research and Economy, andthe Province of Upper Austria in the frame of the COMETcenter SCCH.
EFERENCES[1] K. Arnold, J. Gosling, and D. Holmes,
The Java Programming Language ,3rd ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co.,Inc., 2000.[2] D. Lunn, D. Spiegelhalter, A. Thomas, and N. Best, “The BUGS project:Evolution, critique and future directions,”
Statistics in Medicine , vol. 28,no. 25, pp. 3049–3067, 2009.[3] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan,T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman,“Pyro: Deep Universal Probabilistic Programming,”
J. Mach. Learn.Res. , vol. 20, no. 1, pp. 973–978, Jan. 2019. [Online]. Available:http://dl.acm.org/citation.cfm?id=3322706.3322734[4] J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic programmingin Python using PyMC3,”
PeerJ Computer Science , vol. 2, p. e55, Apr.2016.[5] D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Murphy, and D. M.Blei, “Deep Probabilistic Programming,”
ArXiv , vol. abs/1701.03757,2017.[6] D. Koller and N. Friedman,
Probabilistic Graphical Models: Principlesand Techniques , ser. Adaptive Computation and Machine Learning.Cambridge, MA: MIT Press, 2009.[7] D. Koller and A. Pfeffer, “Object-oriented Bayesian Networks,”in
Proceedings of the Thirteenth Conference on Uncertainty inArtificial Intelligence , ser. UAI’97. San Francisco, CA, USA: MorganKaufmann Publishers Inc., 1997, pp. 302–313. [Online]. Available:http://dl.acm.org/citation.cfm?id=2074226.2074262[8] F. Musella and P. Vicard, “Object-oriented Bayesian networks for complexquality management problems,”
Quality & Quantity , vol. 49, no. 1, pp.115–133, Jan. 2015.[9] L. Lamport, J. Matthews, M. Tuttle, and Y. Yu, “Specifying and VerifyingSystems with TLA+,” in
Proceedings of the 10th Workshop on ACMSIGOPS European Workshop , ser. EW 10. New York, NY, USA: ACM,2002, pp. 45–48.[10] D. Jackson, “Alloy: A Lightweight Object Modelling Notation,”
ACMTrans. Softw. Eng. Methodol. , vol. 11, no. 2, pp. 256–290, Apr. 2002.[11] M. Kwiatkowska, G. Norman, and D. Parker, “Stochastic ModelChecking,” in
Formal Methods for Performance Evaluation: 7thInternational School on Formal Methods for the Design of Computer,Communication, and Software Systems, SFM 2007, Bertinoro, Italy, May28-June 2, 2007, Advanced Lectures , ser. Lecture Notes in ComputerScience, M. Bernardo and J. Hillston, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2007, pp. 220–270. [Online]. Available:https://doi.org/10.1007/978-3-540-72522-0_6[12] ——, “PRISM 4.0: Verification of Probabilistic Real-Time Systems,” in
Computer Aided Verification , ser. Lecture Notes in Computer Science,G. Gopalakrishnan and S. Qadeer, Eds. Springer Berlin Heidelberg,2011, pp. 585–591.[13] Y. Liu, J. Sun, and J. S. Dong, “PAT 3: An Extensible Architecturefor Building Multi-domain Model Checkers,” in . Hiroshima,Japan: IEEE, Nov. 2011, pp. 190–199.[14] H. Garavel, F. Lang, R. Mateescu, and W. Serwe, “CADP 2011: A toolboxfor the construction and analysis of distributed processes,”
InternationalJournal on Software Tools for Technology Transfer , vol. 15, no. 2, pp.89–107, Apr. 2013.[15] J. C. King, “Symbolic Execution and Program Testing,”
Communicationsof the ACM , vol. 19, no. 7, pp. 385–394, Jul. 1976.[16] S. Anand, C. S. P˘as˘areanu, and W. Visser, “JPF-SE: A SymbolicExecution Extension to Java PathFinder,” in
Proceedings of the13th International Conference on Tools and Algorithms for theConstruction and Analysis of Systems , ser. TACAS’07. Berlin,Heidelberg: Springer-Verlag, 2007, pp. 134–138. [Online]. Available:http://dl.acm.org/citation.cfm?id=1763507.1763523[17] C. Cadar, D. Dunbar, and D. Engler, “KLEE: Unassisted andAutomatic Generation of High-coverage Tests for Complex SystemsPrograms,” in
Proceedings of the 8th USENIX Conference on OperatingSystems Design and Implementation , ser. OSDI’08. Berkeley, CA,USA: USENIX Association, 2008, pp. 209–224. [Online]. Available:http://dl.acm.org/citation.cfm?id=1855741.1855756[18] N. Tillmann and J. de Halleux, “Pex–White Box Test Generation for.NET,” in
Tests and Proofs , ser. Lecture Notes in Computer Science,B. Beckert and R. Hähnle, Eds. Springer Berlin Heidelberg, 2008, pp.134–153. [19] J. Geldenhuys, M. B. Dwyer, and W. Visser, “Probabilistic SymbolicExecution,” in
Proceedings of the 2012 International Symposium onSoftware Testing and Analysis , ser. ISSTA 2012. ACM, 2012, pp.166–176.[20] A. Filieri, C. S. Pasareanu, and G. Yang, “Quantification of SoftwareChanges through Probabilistic Symbolic Execution (N),” in . Lincoln, NE, USA: IEEE, Nov. 2015, pp. 703–708.[21] B. Chen, Y. Liu, and W. Le, “Generating Performance Distributionsvia Probabilistic Symbolic Execution,” in
Proceedings of the 38thInternational Conference on Software Engineering , ser. ICSE ’16. NewYork, NY, USA: ACM, 2016, pp. 49–60.[22] Z. Xu, S. Ma, X. Zhang, S. Zhu, and B. Xu, “Debugging with Intelligencevia Probabilistic Inference,” p. 11, 2018.[23] D. Andrzejewski, A. Mulhern, B. Liblit, and X. Zhu, “StatisticalDebugging Using Latent Topic Models,” in
Machine Learning: ECML2007 , J. N. Kok, J. Koronacki, R. L. de Mantaras, S. Matwin,D. Mladeniˇc, and A. Skowron, Eds. Berlin, Heidelberg: SpringerBerlin Heidelberg, 2007, vol. 4701, pp. 6–17. [Online]. Available:http://link.springer.com/10.1007/978-3-540-74958-5_5[24] S. Hangal and M. S. Lam, “Tracking down Software Bugs UsingAutomatic Anomaly Detection,”
Proceedings of the 24th InternationalConference on Software Engineering. ICSE 2002 , pp. 291–301, 2002.[25] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin, “DynamicallyDiscovering Likely Program Invariants to Support Program Evolution,”
IEEE Transactions on Software Engineering , vol. 27, no. 2, pp. 99–123,Feb. 2001.[26] T.-D. B. Le and D. Lo, “Deep Specification Mining,” in
Proceedings ofthe 27th ACM SIGSOFT International Symposium on Software Testingand Analysis - ISSTA 2018 . ACM Press, 2018, pp. 106–117.[27] Z. Zuo, S.-C. Khoo, and C. Sun, “Efficient Predicated Bug SignatureMining via Hierarchical Instrumentation,” in
Proceedings of the 2014International Symposium on Software Testing and Analysis , ser. ISSTA2014. New York, NY, USA: ACM, 2014, pp. 215–224.[28] R. Gore, P. F. Reynolds, and D. Kamensky, “Statistical debugging withelastic predicates,” in , Nov. 2011, pp. 492–495.[29] D. Lo and S. Maoz, “Scenario-based and value-based specification mining:Better together,”
Automated Software Engineering , vol. 19, no. 4, pp.423–458, Dec. 2012.[30] S. Jayaraman, B. Jayaraman, and D. Lessa, “Compact Visualization ofJava Program Execution,”
Software: Practice and Experience , vol. 47,no. 2, pp. 163–191, 2017.[31] M. H. Brown and R. Sedgewick, “Techniques for Algorithm Animation,”
IEEE Software , vol. 2, no. 1, pp. 28–39, Jan. 1985.[32] S. Mukherjea and J. T. Stasko, “Toward Visual Debugging: IntegratingAlgorithm Animation Capabilities Within a Source-Level Debugger,”
ACM Trans. Comput.-Hum. Interact. , vol. 1, no. 3, pp. 215–244, Sep.1994.[33] M. Gabel, L. Jiang, and Z. Su, “Scalable Detection of SemanticClones,” in
Proceedings of the 13th International Conference on SoftwareEngineering - ICSE ’08 . ACM Press, 2008, p. 321.[34] H. Kim, Y. Jung, S. Kim, and K. Yi, “MeCC,”
Proceeding of the 33rdinternational conference on Software engineering - ICSE ’11 , p. 301,2011.[35] H. B. Mann and D. R. Whitney, “On a Test of Whether One of TwoRandom Variables Is Stochastically Larger than the Other,”
The Annalsof Mathematical Statistics , vol. 18, no. 1, pp. 50–60, Mar. 1947.[36] W. H. Kruskal and W. A. Wallis, “Use of Ranks in One-Criterion VarianceAnalysis,”
Journal of the American Statistical Association , vol. 47, no.260, pp. 583–621, Dec. 1952.[37] F. J. Massey, “The Kolmogorov-Smirnov Test for Goodness of Fit,”
Journal of the American Statistical Association , vol. 46, no. 253, pp.68–78, Mar. 1951.[38] M. B. Wilk and R. Gnanadesikan, “Probability Plotting Methods for theAnalysis of Data,”
Biometrika , vol. 55, no. 1, pp. 1–17, 1968.[39] L. Aniello, C. Ciccotelli, M. Cinque, F. Frattini, L. Querzoni,and S. Russo, “Automatic Invariant Selection for Online AnomalyDetection,” in
Computer Safety, Reliability, and Security , A. Skavhaug,J. Guiochet, and F. Bitsch, Eds. Cham: Springer InternationalPublishing, 2016, vol. 9922, pp. 172–183. [Online]. Available:http://link.springer.com/10.1007/978-3-319-45477-1_14[40] V. Kotu and B. Deshpande, “Anomaly Detection,” in
DataScience . Elsevier, 2019, pp. 447–465. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/B978012814761000013741] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,”
ACM Computing Surveys , vol. 41, no. 3, pp. 1–58, Jul. 2009.[42] L. Cseppent˝o and Z. Micskei, “Evaluating Code-Based Test InputGenerator Tools: Evaluating Code-Based Test Input Generator Tools,”
Software Testing, Verification and Reliability , vol. 27, no. 6, p. e1627,Sep. 2017.[43] G. Fraser and A. Zeller, “Mutation-Driven Generation of Unit Tests andOracles,”
IEEE Transactions on Software Engineering , vol. 38, no. 2,pp. 278–292, Mar. 2012.[44] T. Ball, “The Concept of Dynamic Analysis,” in
Proceedingsof the 7th European Software Engineering Conference HeldJointly with the 7th ACM SIGSOFT International Symposium onFoundations of Software Engineering , ser. ESEC/FSE-7. London,UK, UK: Springer-Verlag, 1999, pp. 216–234. [Online]. Available:http://dl.acm.org/citation.cfm?id=318773.318944[45] K. P. Murphy,
Machine Learning: A Probabilistic Perspective , ser.Adaptive Computation and Machine Learning Series. Cambridge, MA:MIT Press, 2012.[46] C. M. Bishop,
Pattern Recognition and Machine Learning , ser. Informa-tion Science and Statistics. New York: Springer, 2006.[47] S. L. Peyton Jones, Ed.,
Haskell 98 Language and Libraries: The RevisedReport . Cambridge, U.K. ; New York: Cambridge University Press,2003, oCLC: ocm51271691.[48] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation usingReal NVP,” arXiv:1605.08803 [cs, stat] , May 2016. [Online]. Available:http://arxiv.org/abs/1605.08803[49] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational Inference: AReview for Statisticians,”
Journal of the American Statistical Association ,vol. 112, no. 518, pp. 859–877, 2017.[50] D. J. Rezende and S. Mohamed, “Variational Inference with NormalizingFlows,” May 2015. [Online]. Available: http://arxiv.org/abs/1505.05770[51] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” Dec.2013. [Online]. Available: http://arxiv.org/abs/1312.6114[52] C. Doersch, “Tutorial on Variational Autoencoders,” pp. 1–23, 2016.[Online]. Available: https://arxiv.org/abs/1606.05908[53] K. Sohn, X. Yan, and H. Lee, “Learning Structured OutputRepresentation Using Deep Conditional Generative Models,” in
Proceedings of the 28th International Conference on NeuralInformation Processing Systems - Volume 2 , ser. NIPS’15. Cambridge,MA, USA: MIT Press, 2015, pp. 3483–3491. [Online]. Available:http://dl.acm.org/citation.cfm?id=2969442.2969628[54] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, andD. Duvenaud, “FFJORD: Free-form Continuous Dynamics for ScalableReversible Generative Models,” arXiv:1810.01367 [cs, stat] , Oct. 2018.[Online]. Available: http://arxiv.org/abs/1810.01367[55] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Maskedautoencoder for distribution estimation,” in
International Conference onMachine Learning , 2015, pp. 881–889.[56] G. Papamakarios, D. Sterratt, and I. Murray, “Sequential NeuralLikelihood: Fast Likelihood-free Inference with Autoregressive Flows,”in
The 22nd International Conference on Artificial Intelligenceand Statistics , Apr. 2019, pp. 837–848. [Online]. Available: http://proceedings.mlr.press/v89/papamakarios19a.html[57] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”Dec. 2014. [Online]. Available: http://arxiv.org/abs/1412.6980[58] A. Krogh and J. A. Hertz, “A Simple Weight Decay Can ImproveGeneralization,” in
Advances in Neural Information Processing Systems4 , J. E. Moody, S. J. Hanson, and R. P. Lippmann, Eds. Morgan-Kaufmann, 1992, pp. 950–957. [Online]. Available: http://papers.nips.cc/paper/563-a-simple-weight-decay-can-improve-generalization.pdf[59] D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units(GELUs),” arXiv:1606.08415 [cs] , Jun. 2016. [Online]. Available:http://arxiv.org/abs/1606.08415[60] R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier,“SPOON: A Library for Implementing Analyses and Transformationsof Java Source Code,”