AutoMATES: Automated Model Assembly from Text, Equations, and Software
Adarsh Pyarelal, Marco A. Valenzuela-Escarcega, Rebecca Sharp, Paul D. Hein, Jon Stephens, Pratik Bhandari, HeuiChan Lim, Saumya Debray, Clayton T. Morrison
AAutoMATES: Automated Model Assembly from Text, Equations,and Software
Adarsh Pyarelal , Marco A. Valenzuela-Esc´arcega , Rebecca Sharp , Paul D. Hein , JonStephens , Pratik Bhandari , HeuiChan Lim , Saumya Debray , and Clayton T. Morrison School of Information, University of Arizona, Tucson, AZ Department of Computer Science, University of Arizona, Tucson, AZ ml4ai.github.io/automates
Abstract
Models of complicated systems can be represented in different ways - in scientific papers,they are represented using natural language text as well as equations. But to be of real use,they must also be implemented as software, thus making code a third form of representingmodels. We introduce the AutoMATES project, which aims to build semantically-rich unifiedrepresentations of models from scientific code and publications to facilitate the integration ofcomputational models from different domains and allow for modeling large, complicated systemsthat span multiple domains and levels of abstraction.
There exist today state-of-the-art computational models that can provide highly accurate predic-tions about complex phenomena such as crop growth and weather patterns. However, certainphenomena, such as food insecurity, involve a host of factors that cannot be modeled by any singleone of these models, but which instead require the integration of multiple models.To truly integrate these computational models, it is necessary to ‘lift’ them to a common rep-resentation that is (i) agnostic to the software implementation, (ii) semantically rich enough torepresent the implicit domain knowledge in the models, and (iii) connected to the domain litera-ture. The AutoMATES project aims to build technology to construct and curate semantically-richrepresentations of scientific models by integrating three different sources of information: • natural language descriptions of models in publications and other technical documentation, • the equations contained in these documents, and • the software the implements these models.An example of a model being represented in these three forms (text, equations, and software) isshown in Figure 1. This model is a differential equation describing the biophysical variable, leafarea index (LAI). The network on the right half of the figure is an aspirational representation ofthe model as a Bayesian network. Although this example is hand-crafted, our end goal is to be ableto automatically assemble models with this level of semantic richness. In this paper, we describeour high-level approach and present our latest results. For more technical details, please visit ml4ai.github.io/automates/documentation/deliverable reports . a r X i v : . [ c s . A I] J a n yarelal et al. Modeling the World’s Systems, 2019Figure 1: Integration of text, equations, and code into a semantically-enriched Bayesian network. Significance:
This work will dramatically advance the state-of-the-art in automated model cura-tion and integration, enabling scientists and analysts to understand complex mechanisms that spanmultiple domains. By exposing the implicit domain knowledge baked into computational models,this effort will enable semantically rich automated model composition and reasoning in context, atscale.
The AutoMATES system is designed to extract information from several knowledge sources, linkthe extracted concepts into a unified model representation, compare models based on differentfeatures, and augment the models with supplementary information. Each of these components isimplemented independently, but designed to interoperate with each other. A high-level view of thesystem’s architecture is shown in Figure 2. This modular architecture provides AutoMATES withthe extensibility required to support different knowledge sources in the future as it is extended tohandle other domains. Briefly, the four main components of AutoMATES are:1.
Extracting model information from different aspects of source code and corresponding scien-tific publications and technical documents. From source code, we extract the model from thecode implementation itself ( 3) as well as any supplementary information such as descriptionsexpressed in comments. From scientific publications and documentation, AutoMATES readsmodel information from text ( 5) and equations ( 4).2.
Grounding the extracted information by identifying when the same concept is expressedin different knowledge sources, and linking them together to form a unified, programmingyarelal et al. Modeling the World’s Systems, 2019Figure 2: Architecture overviewlanguage-agnostic intermediary model representation: a
Grounded Function Network (GrFN) .3.
Comparing models, using the GrFN representation, by analyzing structural and functional(via sensitivity analysis) similarities and differences ( 6).4.
Augmenting models through selection of model components appropriate for a task, composingmodel components, generating model descriptions in context to augment existing documen-tation, and model execution.
Our program analysis approach to extracting model information from the source code implementa-tion begins with for2py , a front-end translator that maps Fortran source programs to a language-independent program analysis intermediate representation (PAIR) that is then used to generatefiles used as input to subsequent analysis. This design decouples input processing from outputgeneration, and is motivated by the following:1.
Performance and scalability.
Modules that are referenced by multiple program componentsdo not have to be reanalyzed separately for each referencing component. Independent source-language modules can in principle, be analyzed concurrently.2.
Support for source-language heterogeneity.
This design makes it possible, in principle, tosupport programs with different components written in different languages. It also allows usto reason about models implemented in different source languages.3.
Independence of back-end tasks.
Different back-end analysis tasks, e.g., sensitivity analysisand comment analysis, can be carried out independently (and, if necessary, concurrently) onthe PAIR. for2py currently handles a significant subset of Fortran, including: data types such as scalars andarrays; control constructs such as conditionals, loops, functions, and subroutines; and input/output(I/O) primitives including formatted and list-directed I/O. We expect to soon complete the handlingof modules and derived types.A fundamental challenge we have to address is that of scalability, since software implementingsophisticated scientific models can encompass thousands of source files and hundreds of thousandsof lines of code. We do this by performing analysis at the module level of granularity. Givenyarelal et al. Modeling the World’s Systems, 2019the source code for a scientific model, we analyze its modules to identify define-use relationshipsbetween them and construct a module dependency graph that identifies these dependencies betweendifferent modules. We use a topological sort of this graph to guide the subsequent analysis of themodules. The module dependency graph imposes a partial order on the modules of the analyzedsystem, indicating which modules are independent of each other and can therefore be analyzedin parallel. This ordering has three significant implications for scaling. First, it allows moderncomputer systems such as multi-core processors and cloud-based systems to be utilized effectively.Second, it provides the user a straightforward tunable tradeoff between computational resourcesand analysis efficiency. Finally, it means that the cost of analyzing a software system is proportionalto the depth of its module dependency graph rather than its total size (number of nodes), resultingin sublinear asymptotic complexity.
Models are often represented concisely as equations, at a level of abstraction that can supplementboth the natural language description as well as the source code implementation. For humans tocompare the equations and source code for several models, as is done in [2], is time consumingand expensive. Accordingly, we are developing an automated approach that identifies the relevantequations in text and rendered images of documents (PDFs treated as images) associated with sci-entific models, parse them into an intermediate symbolic mathematical representation, and groundthe variables in the equations to text descriptions and source code variables.Non-textual elements in PDFs have previously been identified using heuristics based on documentstructure [4] or statistical learning [3, 1]. Here, taking advantage of advances in deep learning[8], we identify the location of the bounding box surrounding the equations using machine visiontechniques. After identifying the location of the equations in the PDF, the next step in the pipelineis to parse the rendered equation into an intermediate representation. We choose to use L A TEXbecause we have the L A TEX source code for each of the training examples, and also because L A TEXpreserves all of the typographic information (e.g., boldface, subscript, etc.), which conveys variablesemantics. We decompile the image using an encoder-decoder system that encodes the image ofthe equation through a series of convolutions and produces L A TEX commands that generate theimage [6, 5]. We are currently evaluating this process on a held-out subset of the data fromArXiv.This decoded L A TEX representation will then be parsed into Python code (by extending coverageof an open source rule-based system, latex2sympy , to equation elements frequently found in thedomain) and then converted to a equation elements frequently found in the domain) and thenconverted to a GrFN representation. To ground the GrFN representation extracted from theequations, we locate text that references the equation, using the equation identifier when availableand the lexical content when it is not. We are developing a framework for reading and extracting model information from the scientificpapers that directly describe the computational models (e.g., dssat , swap ) whose source code weanalyze ( 3).Scientific papers are typically available as PDFs, which need to be preprocessed into a format that https://github.com/augustt198/latex2sympy yarelal et al. Modeling the World’s Systems, 2019Figure 3: Example of variables (represented as concepts, definitions and value assignments) ex-tracted from scientific text as a result of the machine reading pipeline. Priestley-Taylor Model ASCE Model
Figure 4: Results of comparing Priestly-Taylor (PT) and ASCE models. Blue nodes representvariables shared between PT and ASCE. Black nodes represent variables not shared but alongdirected paths between shared variables. Green nodes in the ASCE model represent variableswhose states directly affect shared directed paths – if controlled, this isolates the portions of ASCEthat overlap with PT. Finally, orange nodes represent variables in the ASCE model that can beisolated from the overlap in the comparison.a machine reader can use. We make use of Science Parse , an open source tool that segments thesections based on the paper layout and typography.Our framework then implements an open-domain information extraction system based on Eidos ,a machine reading system designed to extract causal relations. At its core, Eidos has a grammarof rules [10, 9] that model linguistic patterns commonly used by authors to express causality intext. Here, where we are interested in gathering context about the models implemented in sourcecode, causal relations are useful, but not sufficient. We have modified Eidos to extract mentionsof model variables and their descriptions. Additionally, it will be critical to read for backgroundassumptions (e.g., model preconditions) and additional contextual information which could informthe setting of parameters (using quantities and units identified by grobid-quantities ). https://github.com/allenai/science-parse https://github.com/clulab/eidos https://github.com/kermitt2/grobid-quantities yarelal et al. Modeling the World’s Systems, 2019Figure 5: Screenshot of the AutoMATES CodeExplorer (available at http://vanga.sista.arizona.edu/automates ), showing the translation of a the Priestley-Taylor method for calcu-lating potential evapotranspiration (a submodule in DSSAT [7]) into a computation graph. The assign nodes are annotated with the automatically extracted L A TEX-typeset representation ofthe equation extracted from the code, which will facilitate linking with scientific publications.Additionally, the variable nodes are automatically aligned with descriptions extracted from codecomments and scientific texts.The extracted variables and their mentions will necessarily be aligned with the variables read fromsource code ( 3) and equations ( 4) to find and resolve commonalities and discrepancies in differentrepresentations of the same model. In Figure 3, we show a screenshot showing results of the currenttext reading pipeline.
Model comparison and eventual augmentation is then enabled by our model analysis pipeline,which identifies which portions of two or more models share the same or similar computationsabout similar variables, and which components are different.This analysis is enabled by the unified grounded function network (GrFN) representation, suchthat we first identify shared variables and then analyze the GrFN topology to identify differencesin setting variables states. Figure 4 shows an example of comparing the PT and ASCE evapo-transpiration models from the DSSAT crop modeling system [7]. Sensitivity analysis is then usedto analyze the functional relationships between the variables. Because sensitivity analysis can becomputationally expensive, we are developing methods that use automatic code differentiation toefficiently compute the derivatives of variables with respect to each other, and Bayesian optimiza-yarelal et al. Modeling the World’s Systems, 2019Figure 6: Initial results of automated sensitivity analysis. The pair of variables that the Priestley-Taylor model of evapotranspiration is most sensitive to has been automatically identified givenbounds information for the input variables, and a surface plot has been generated that shows theeffect of varying that pair of variables (maximum temperature and solar radiation) on the outputvariable (potential evapotranspiration).tion techniques to estimate sensitivity functions with as few samples as possible. The final productof this analysis (a) includes modular executable representations of grounded models (as dynamicBayesian networks), (b) provides results of model comparison to enable model choice in tasks, and(c) based on grounded model similarity and differences, enables model composition.In Figure 6, we show some initial results from automated sensitivity analysis.
Systems of interest for scientific, humanitarian, and security reasons often require the integration ofcomputational models from multiple domains - for example, modeling food security in a region re-quires the use of computational crop, weather, and hydrology models, to name but a few. However,this integration currently requires significant manual effort in the form of exposing and curatinginterfaces to the computational models. The framework we are developing will greatly speed upthis curation and integration process, making it possible to effectively model large, complicatedsystems and reason about them at multiple levels of abstraction.yarelal et al. Modeling the World’s Systems, 2019
The system described here is open-source and publicly available at github.com/ml4ai/automates and github.com/ml4ai/delphi . We have also set up a public webapp, CodeExplorer (see screen-shot in Figure 5), which shows off a subset of the functionality of the AutoMATES system, and islive at vanga.sista.arizona.edu/automates . This work is supported by the Defense Advanced Research Projects Agency (DARPA) as partof the Automated Scientific Knowledge Extraction (ASKE) program under agreement numberHR00111990011.
References [1] Jacob Robert Bruce. Mathematical expression detection and segmentation in document im-ages. Master’s thesis, Virginia Tech, 2014.[2] G. G. T. Camargo and A. R. Kemanian. Six crop models differ in their simulation of wateruptake.
Agricultural and Forest Meterology , 220:116–129, 2016.[3] Wei-Ta Chu and Fan Liu. Mathematical formula detection in heterogeneous document images.In , pages140–145, 2013.[4] Christopher Clark and Santosh Divvala. Pdffigures 2.0: Mining figures from research papers.2016.[5] Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M. Rush. Image-to-markupgeneration with coarse-to-fine attention. In
Proceedings of the 34th International Conferenceon Machine Learning , pages 980–989, 2017.[6] Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush. What you get is what you see: Avisual markup decompiler.
CoRR , abs/1609.04938, 2016.[7] J.W Jones, G Hoogenboom, C.H Porter, K.J Boote, W.D Batchelor, L.A Hunt, P.W Wilkens,U Singh, A.J Gijsman, and J.T Ritchie. The dssat cropping system model.
European Journalof Agronomy , 18(3):235 – 265, 2003. Modelling Cropping Systems: Science, Software andApplications.[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,real-time object detection. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 779–788, 2016.[9] Marco A. Valenzuela-Esc´arcega, ¨Ozg¨un Babur, Gus Hahn-Powell, Dane Bell, Thomas Hicks,Enrique Noriega-Atala, Xia Wang, Mihai Surdeanu, Emek Demir, and Clayton T. Morrison.Large-scale automated machine reading discovers new cancer driving mechanisms.
Database:The Journal of Biological Databases and Curation , 2018.[10] Marco A. Valenzuela-Esc´arcega, Gus Hahn-Powell, and Mihai Surdeanu. Odin’s runes: A rulelanguage for information extraction. In10th International Conference on Language Resourcesand Evaluation