[PDF] AutoMATES: Automated Model Assembly from Text, Equations, and Software

Abstract

Models of complicated systems can be represented in different ways - in scientific papers, they are represented using natural language text as well as equations. But to be of real use, they must also be implemented as software, thus making code a third form of representing models. We introduce the AutoMATES project, which aims to build semantically-rich unified representations of models from scientific code and publications to facilitate the integration of computational models from different domains and allow for modeling large, complicated systems that span multiple domains and levels of abstraction.

Full PDF

AAutoMATES: Automated Model Assembly from Text, Equations,and Software

Adarsh Pyarelal , Marco A. Valenzuela-Esc´arcega , Rebecca Sharp , Paul D. Hein , JonStephens , Pratik Bhandari , HeuiChan Lim , Saumya Debray , and Clayton T. Morrison School of Information, University of Arizona, Tucson, AZ Department of Computer Science, University of Arizona, Tucson, AZ ml4ai.github.io/automates

Abstract

Models of complicated systems can be represented in diﬀerent ways - in scientiﬁc papers,they are represented using natural language text as well as equations. But to be of real use,they must also be implemented as software, thus making code a third form of representingmodels. We introduce the AutoMATES project, which aims to build semantically-rich uniﬁedrepresentations of models from scientiﬁc code and publications to facilitate the integration ofcomputational models from diﬀerent domains and allow for modeling large, complicated systemsthat span multiple domains and levels of abstraction.

There exist today state-of-the-art computational models that can provide highly accurate predic-tions about complex phenomena such as crop growth and weather patterns. However, certainphenomena, such as food insecurity, involve a host of factors that cannot be modeled by any singleone of these models, but which instead require the integration of multiple models.To truly integrate these computational models, it is necessary to ‘lift’ them to a common rep-resentation that is (i) agnostic to the software implementation, (ii) semantically rich enough torepresent the implicit domain knowledge in the models, and (iii) connected to the domain litera-ture. The AutoMATES project aims to build technology to construct and curate semantically-richrepresentations of scientiﬁc models by integrating three diﬀerent sources of information: • natural language descriptions of models in publications and other technical documentation, • the equations contained in these documents, and • the software the implements these models.An example of a model being represented in these three forms (text, equations, and software) isshown in Figure 1. This model is a diﬀerential equation describing the biophysical variable, leafarea index (LAI). The network on the right half of the ﬁgure is an aspirational representation ofthe model as a Bayesian network. Although this example is hand-crafted, our end goal is to be ableto automatically assemble models with this level of semantic richness. In this paper, we describeour high-level approach and present our latest results. For more technical details, please visit ml4ai.github.io/automates/documentation/deliverable reports . a r X i v : . [ c s . A I] J a n yarelal et al. Modeling the World’s Systems, 2019Figure 1: Integration of text, equations, and code into a semantically-enriched Bayesian network. Signiﬁcance:

This work will dramatically advance the state-of-the-art in automated model cura-tion and integration, enabling scientists and analysts to understand complex mechanisms that spanmultiple domains. By exposing the implicit domain knowledge baked into computational models,this eﬀort will enable semantically rich automated model composition and reasoning in context, atscale.

The AutoMATES system is designed to extract information from several knowledge sources, linkthe extracted concepts into a uniﬁed model representation, compare models based on diﬀerentfeatures, and augment the models with supplementary information. Each of these components isimplemented independently, but designed to interoperate with each other. A high-level view of thesystem’s architecture is shown in Figure 2. This modular architecture provides AutoMATES withthe extensibility required to support diﬀerent knowledge sources in the future as it is extended tohandle other domains. Brieﬂy, the four main components of AutoMATES are:1.

Extracting model information from diﬀerent aspects of source code and corresponding scien-tiﬁc publications and technical documents. From source code, we extract the model from thecode implementation itself ( 3) as well as any supplementary information such as descriptionsexpressed in comments. From scientiﬁc publications and documentation, AutoMATES readsmodel information from text ( 5) and equations ( 4).2.

Grounding the extracted information by identifying when the same concept is expressedin diﬀerent knowledge sources, and linking them together to form a uniﬁed, programmingyarelal et al. Modeling the World’s Systems, 2019Figure 2: Architecture overviewlanguage-agnostic intermediary model representation: a

Grounded Function Network (GrFN) .3.

Comparing models, using the GrFN representation, by analyzing structural and functional(via sensitivity analysis) similarities and diﬀerences ( 6).4.

Augmenting models through selection of model components appropriate for a task, composingmodel components, generating model descriptions in context to augment existing documen-tation, and model execution.

Our program analysis approach to extracting model information from the source code implementa-tion begins with for2py , a front-end translator that maps Fortran source programs to a language-independent program analysis intermediate representation (PAIR) that is then used to generateﬁles used as input to subsequent analysis. This design decouples input processing from outputgeneration, and is motivated by the following:1.

Performance and scalability.

Modules that are referenced by multiple program componentsdo not have to be reanalyzed separately for each referencing component. Independent source-language modules can in principle, be analyzed concurrently.2.

Support for source-language heterogeneity.

This design makes it possible, in principle, tosupport programs with diﬀerent components written in diﬀerent languages. It also allows usto reason about models implemented in diﬀerent source languages.3.

Independence of back-end tasks.

Diﬀerent back-end analysis tasks, e.g., sensitivity analysisand comment analysis, can be carried out independently (and, if necessary, concurrently) onthe PAIR. for2py currently handles a signiﬁcant subset of Fortran, including: data types such as scalars andarrays; control constructs such as conditionals, loops, functions, and subroutines; and input/output(I/O) primitives including formatted and list-directed I/O. We expect to soon complete the handlingof modules and derived types.A fundamental challenge we have to address is that of scalability, since software implementingsophisticated scientiﬁc models can encompass thousands of source ﬁles and hundreds of thousandsof lines of code. We do this by performing analysis at the module level of granularity. Givenyarelal et al. Modeling the World’s Systems, 2019the source code for a scientiﬁc model, we analyze its modules to identify deﬁne-use relationshipsbetween them and construct a module dependency graph that identiﬁes these dependencies betweendiﬀerent modules. We use a topological sort of this graph to guide the subsequent analysis of themodules. The module dependency graph imposes a partial order on the modules of the analyzedsystem, indicating which modules are independent of each other and can therefore be analyzedin parallel. This ordering has three signiﬁcant implications for scaling. First, it allows moderncomputer systems such as multi-core processors and cloud-based systems to be utilized eﬀectively.Second, it provides the user a straightforward tunable tradeoﬀ between computational resourcesand analysis eﬃciency. Finally, it means that the cost of analyzing a software system is proportionalto the depth of its module dependency graph rather than its total size (number of nodes), resultingin sublinear asymptotic complexity.

Models are often represented concisely as equations, at a level of abstraction that can supplementboth the natural language description as well as the source code implementation. For humans tocompare the equations and source code for several models, as is done in [2], is time consumingand expensive. Accordingly, we are developing an automated approach that identiﬁes the relevantequations in text and rendered images of documents (PDFs treated as images) associated with sci-entiﬁc models, parse them into an intermediate symbolic mathematical representation, and groundthe variables in the equations to text descriptions and source code variables.Non-textual elements in PDFs have previously been identiﬁed using heuristics based on documentstructure [4] or statistical learning [3, 1]. Here, taking advantage of advances in deep learning[8], we identify the location of the bounding box surrounding the equations using machine visiontechniques. After identifying the location of the equations in the PDF, the next step in the pipelineis to parse the rendered equation into an intermediate representation. We choose to use L A TEXbecause we have the L A TEX source code for each of the training examples, and also because L A TEXpreserves all of the typographic information (e.g., boldface, subscript, etc.), which conveys variablesemantics. We decompile the image using an encoder-decoder system that encodes the image ofthe equation through a series of convolutions and produces L A TEX commands that generate theimage [6, 5]. We are currently evaluating this process on a held-out subset of the data fromArXiv.This decoded L A TEX representation will then be parsed into Python code (by extending coverageof an open source rule-based system, latex2sympy , to equation elements frequently found in thedomain) and then converted to a equation elements frequently found in the domain) and thenconverted to a GrFN representation. To ground the GrFN representation extracted from theequations, we locate text that references the equation, using the equation identiﬁer when availableand the lexical content when it is not. We are developing a framework for reading and extracting model information from the scientiﬁcpapers that directly describe the computational models (e.g., dssat , swap ) whose source code weanalyze ( 3).Scientiﬁc papers are typically available as PDFs, which need to be preprocessed into a format that https://github.com/augustt198/latex2sympy yarelal et al. Modeling the World’s Systems, 2019Figure 3: Example of variables (represented as concepts, deﬁnitions and value assignments) ex-tracted from scientiﬁc text as a result of the machine reading pipeline. Priestley-Taylor Model ASCE Model

Figure 4: Results of comparing Priestly-Taylor (PT) and ASCE models. Blue nodes representvariables shared between PT and ASCE. Black nodes represent variables not shared but alongdirected paths between shared variables. Green nodes in the ASCE model represent variableswhose states directly aﬀect shared directed paths – if controlled, this isolates the portions of ASCEthat overlap with PT. Finally, orange nodes represent variables in the ASCE model that can beisolated from the overlap in the comparison.a machine reader can use. We make use of Science Parse , an open source tool that segments thesections based on the paper layout and typography.Our framework then implements an open-domain information extraction system based on Eidos ,a machine reading system designed to extract causal relations. At its core, Eidos has a grammarof rules [10, 9] that model linguistic patterns commonly used by authors to express causality intext. Here, where we are interested in gathering context about the models implemented in sourcecode, causal relations are useful, but not suﬃcient. We have modiﬁed Eidos to extract mentionsof model variables and their descriptions. Additionally, it will be critical to read for backgroundassumptions (e.g., model preconditions) and additional contextual information which could informthe setting of parameters (using quantities and units identiﬁed by grobid-quantities ). https://github.com/allenai/science-parse https://github.com/clulab/eidos https://github.com/kermitt2/grobid-quantities yarelal et al. Modeling the World’s Systems, 2019Figure 5: Screenshot of the AutoMATES CodeExplorer (available at http://vanga.sista.arizona.edu/automates ), showing the translation of a the Priestley-Taylor method for calcu-lating potential evapotranspiration (a submodule in DSSAT [7]) into a computation graph. The assign nodes are annotated with the automatically extracted L A TEX-typeset representation ofthe equation extracted from the code, which will facilitate linking with scientiﬁc publications.Additionally, the variable nodes are automatically aligned with descriptions extracted from codecomments and scientiﬁc texts.The extracted variables and their mentions will necessarily be aligned with the variables read fromsource code ( 3) and equations ( 4) to ﬁnd and resolve commonalities and discrepancies in diﬀerentrepresentations of the same model. In Figure 3, we show a screenshot showing results of the currenttext reading pipeline.

Model comparison and eventual augmentation is then enabled by our model analysis pipeline,which identiﬁes which portions of two or more models share the same or similar computationsabout similar variables, and which components are diﬀerent.This analysis is enabled by the uniﬁed grounded function network (GrFN) representation, suchthat we ﬁrst identify shared variables and then analyze the GrFN topology to identify diﬀerencesin setting variables states. Figure 4 shows an example of comparing the PT and ASCE evapo-transpiration models from the DSSAT crop modeling system [7]. Sensitivity analysis is then usedto analyze the functional relationships between the variables. Because sensitivity analysis can becomputationally expensive, we are developing methods that use automatic code diﬀerentiation toeﬃciently compute the derivatives of variables with respect to each other, and Bayesian optimiza-yarelal et al. Modeling the World’s Systems, 2019Figure 6: Initial results of automated sensitivity analysis. The pair of variables that the Priestley-Taylor model of evapotranspiration is most sensitive to has been automatically identiﬁed givenbounds information for the input variables, and a surface plot has been generated that shows theeﬀect of varying that pair of variables (maximum temperature and solar radiation) on the outputvariable (potential evapotranspiration).tion techniques to estimate sensitivity functions with as few samples as possible. The ﬁnal productof this analysis (a) includes modular executable representations of grounded models (as dynamicBayesian networks), (b) provides results of model comparison to enable model choice in tasks, and(c) based on grounded model similarity and diﬀerences, enables model composition.In Figure 6, we show some initial results from automated sensitivity analysis.

Systems of interest for scientiﬁc, humanitarian, and security reasons often require the integration ofcomputational models from multiple domains - for example, modeling food security in a region re-quires the use of computational crop, weather, and hydrology models, to name but a few. However,this integration currently requires signiﬁcant manual eﬀort in the form of exposing and curatinginterfaces to the computational models. The framework we are developing will greatly speed upthis curation and integration process, making it possible to eﬀectively model large, complicatedsystems and reason about them at multiple levels of abstraction.yarelal et al. Modeling the World’s Systems, 2019

The system described here is open-source and publicly available at github.com/ml4ai/automates and github.com/ml4ai/delphi . We have also set up a public webapp, CodeExplorer (see screen-shot in Figure 5), which shows oﬀ a subset of the functionality of the AutoMATES system, and islive at vanga.sista.arizona.edu/automates . This work is supported by the Defense Advanced Research Projects Agency (DARPA) as partof the Automated Scientiﬁc Knowledge Extraction (ASKE) program under agreement numberHR00111990011.

References [1] Jacob Robert Bruce. Mathematical expression detection and segmentation in document im-ages. Master’s thesis, Virginia Tech, 2014.[2] G. G. T. Camargo and A. R. Kemanian. Six crop models diﬀer in their simulation of wateruptake.

Agricultural and Forest Meterology , 220:116–129, 2016.[3] Wei-Ta Chu and Fan Liu. Mathematical formula detection in heterogeneous document images.In , pages140–145, 2013.[4] Christopher Clark and Santosh Divvala. Pdﬃgures 2.0: Mining ﬁgures from research papers.2016.[5] Yuntian Deng, Anssi Kanervisto, Jeﬀrey Ling, and Alexander M. Rush. Image-to-markupgeneration with coarse-to-ﬁne attention. In

Proceedings of the 34th International Conferenceon Machine Learning , pages 980–989, 2017.[6] Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush. What you get is what you see: Avisual markup decompiler.

CoRR , abs/1609.04938, 2016.[7] J.W Jones, G Hoogenboom, C.H Porter, K.J Boote, W.D Batchelor, L.A Hunt, P.W Wilkens,U Singh, A.J Gijsman, and J.T Ritchie. The dssat cropping system model.

European Journalof Agronomy , 18(3):235 – 265, 2003. Modelling Cropping Systems: Science, Software andApplications.[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uniﬁed,real-time object detection. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 779–788, 2016.[9] Marco A. Valenzuela-Esc´arcega, ¨Ozg¨un Babur, Gus Hahn-Powell, Dane Bell, Thomas Hicks,Enrique Noriega-Atala, Xia Wang, Mihai Surdeanu, Emek Demir, and Clayton T. Morrison.Large-scale automated machine reading discovers new cancer driving mechanisms.

Database:The Journal of Biological Databases and Curation , 2018.[10] Marco A. Valenzuela-Esc´arcega, Gus Hahn-Powell, and Mihai Surdeanu. Odin’s runes: A rulelanguage for information extraction. In10th International Conference on Language Resourcesand Evaluation