PyAutoFit: A Classy Probabilistic Programming Language for Model Composition and Fitting
PPyAutoFit : A Classy Probabilistic ProgrammingLanguage for Model Composition and Fitting
James. W. Nightingale , Richard G. Hayes , and Matthew Griffiths Institute for Computational Cosmology, Stockton Rd, Durham, United Kingdom, DH1 3LE ConcR Ltd, London, UK
DOI:
Software • Review• Repository• Archive
Editor:
Dan Foreman-Mackey
Reviewers: • @arm61• @karllark
Submitted:
24 July 2020
Published:
05 February 2021
License
Authors of papers retaincopyright and release the workunder a Creative CommonsAttribution 4.0 InternationalLicense (CC BY 4.0).
Summary
A major trend in academia and data science is the rapid adoption of Bayesian statistics for dataanalysis and modeling, leading to the development of probabilistic programming languages(PPL). A PPL provides a framework that allows users to easily specify a probabilistic modeland perform inference automatically.
PyAutoFit is a Python-based PPL which interfaces withall aspects of the modeling (e.g., the model, data, fitting procedure, visualization, results)and therefore provides complete management of every aspect of modeling. This includescomposing high-dimensionality models from individual model components, customizing thefitting procedure and performing data augmentation before a model-fit. Advanced featuresinclude database tools for analysing large suites of modeling results and exploiting domain-specific knowledge of a problem via non-linear search chaining. Accompanying
PyAutoFit isthe autofit workspace, which includes example scripts and the
HowToFit lecture series whichintroduces non-experts to model-fitting and provides a guide on how to begin a project using
PyAutoFit . Readers can try
PyAutoFit right now by going to the introduction Jupyternotebook on Binder or checkout our readthedocs for a complete overview of
PyAutoFit ’sfeatures.
Background of Probabilistic Programming
Probabilistic programming languages (PPLs) have enabled contemporary statistical inferencetechniques to be applied to a diverse range of problems across academia and industry. Pack-ages such as PyMC3 (Salvatier et al., 2016), Pyro (Bingham et al., 2019) and STAN (Carpen-ter et al., 2017) offer general-purpose frameworks where users can specify a generative modeland fit it to data using a variety of non-linear fitting techniques. Each package is specialized toproblems of a certain nature, with many focused on problems like generalized linear modelingor determining the distribution(s) from which the data was drawn. For these problems themodel is typically composed of the equations and distributions that are fitted to the data,which are easily expressed syntactically such that the PPL API offers an expressive way todefine the model and extensions can be implemented in an intuitive and straightforward way.
Statement of Need
PyAutoFit is a PPL whose core design is providing a direct interface with the model, data,fitting procedure and results, allowing it to provide comprehensive management of manydifferent aspects of model-fitting.
PyAutoFit began as an Astronomy project for fitting largeimaging datasets of galaxies, after the developers found that existing PPLs were not suited to
Nightingale et al., (2021).
PyAutoFit : A Classy Probabilistic Programming Language for Model Composition and Fitting.
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
PyAutoFit , making it suitable to abroader range of model-fitting problems.
Software Description
To compose a model with
PyAutoFit model components are written as Python classes,allowing
PyAutoFit to define the model and associated parameters in an expressive way thatis tied to the modeling software’s API. A model fit then requires that a
PyAutoFit Analysis class is written, which combines the data, model and likelihood function and defines howthe model-fit is performed using a
NonLinearSearch . The
NonLinearSearch procedure isdefined using an external inference library such as dynesty (Speagle, 2020), emcee (Foreman-Mackey et al., 2013) or
PySwarms (Miranda, 2018).The
Analysis class provides a model specific interface between
PyAutoFit and the modelingsoftware, allowing it to handle the ‘heavy lifting’ that comes with writing model-fitting soft-ware. This includes interfacing with the non-linear search, outputting results in a structuredpath format and model-specific visualization during and after the non-linear search. Results areoutput in a database structure that allows the
Aggregator tool to load results post-analysisvia a Python script or Jupyter notebook. This includes methods for summarizing the resultsof every fit, filtering results to inspect subsets of model fits and visualizing results. Resultsare loaded as
Python generators, ensuring the
Aggregator can be used to interpret largefiles in a memory efficient way.
PyAutoFit is therefore suited to ‘big data’ problems whereindependent fits to large homogeneous data-sets using an identical model-fitting procedureare performed.
Model Abstraction and Composition
For many modeling problems the model comprises abstract model components representingobjects or processes in a physical system. For example, galaxy morphology studies in astro-physics where model components represent the light profile of stars (Häußler et al., 2013;Nightingale et al., 2019). For these problems the likelihood function is typically a sequenceof numerical processes (e.g., convolutions, Fourier transforms, linear algebra) and extensionsto the model often requires the addition of new model components in a way that is non-trivially included in the fitting process and likelihood function. Existing PPLs have tools forthese problems, for example ‘black-box’ likelihood functions in PyMC3. However, these so-lutions decouple model composition from the data and fitting procedure, making the modelless expressive, restricting model customization and reducing flexibility in how the model-fitis performed.By writing model components as Python classes, the model and its associated parameters aredefined in an expressive way that is tied to the modeling software’s API. Model compositionwith
PyAutoFit allows complex models to be built from these individual components, ab-stracting the details of how they change model-fitting procedure from the user. Models can befully customized, allowing adjustment of individual parameter priors, the fixing or coupling ofparameters between model components and removing regions of parameter space via param-eter assertions. Adding new model components to a
PyAutoFit project is straightforward,whereby adding a new Python class means it works within the entire modeling framework.
Nightingale et al., (2021).
PyAutoFit : A Classy Probabilistic Programming Language for Model Composition and Fitting.
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550 yAutoFit is therefore ideal for problems where there is a desire to compose, fit and com-pare many similar (but slightly different) models to a single dataset, with the Aggregator including tools to facilitate this.For many model fitting problems, domain specific knowledge of the model can be exploited tospeed up the non-linear search and ensure it locates the global maximum likelihood solution.For example, initial fits can be performed using simplified model parameterizations, augmenteddatasets and faster non-linear fitting techniques. Through experience users may know thatcertain model components share minimal covariance, meaning that separate fits to each modelcomponent (in parameter spaces of reduced dimensionality) can be performed before fittingthem simultaneously. The results of these simplified fits can then be used to initialize fitsusing a higher dimensionality model. Breaking down a model-fit in this way uses
PyAutoFit ’snon-linear search chaining, which granularizes the non-linear fitting procedure into a seriesof linked non-linear searches. Initial model-fits are followed by fits that gradually increasethe model complexity, using the information gained throughout the pipeline to guide each
NonLinearSearch and thus enable accurate fitting of models of arbitrary complexity.
History
PyAutoFit is a generalization of PyAutoLens, an Astronomy package developed to analyseimages of gravitationally lensed galaxies. Modeling gravitational lenses historically requireslarge amounts of human time and supervision, an approach which does not scale to theincoming samples of 100000 objects. Domain exploitation enabled full automation of thelens modeling procedure (Nightingale et al., 2018; Nightingale & Dye, 2015), with modelcustomization and the aggregator enabling one to fit large datasets with many different models.More recently,
PyAutoFit has been applied to calibrating radiation damage to charge coupledimaging devices and a model of cancer tumour growth.
Workspace and HowToFit Tutorials
PyAutoFit is distributed with the autofit workspace, which contains example scripts forcomposing a model, performing a fit, using the
Aggregator and
PyAutoFit ’s advancedstatistical inference methods. Also included are the
HowToFit tutorials, a series of Jupyternotebooks aimed at non-experts, introducing them to model-fitting and Bayesian inference.They teach users how to write model-components and
Analysis classes in
PyAutoFit , usethese to fit a dataset and interpret the model-fitting results. The lectures are available on ourBinder and may therefore be taken without a local
PyAutoFit installation.
Software Citations
PyAutoFit is written in Python 3.6+ (Van Rossum & Drake, 2009) and uses the followingsoftware packages:• corner.py https://github.com/dfm/corner.py (Foreman-Mackey, 2016)• dynesty https://github.com/joshspeagle/dynesty (Speagle, 2020)• emcee https://github.com/dfm/emcee (Foreman-Mackey et al., 2013)• matplotlib https://github.com/matplotlib/matplotlib (Hunter, 2007)•
NumPy https://github.com/numpy/numpy (Harris et al., 2020)•
PyMulitNest https://github.com/JohannesBuchner/PyMultiNest (Feroz et al., 2009)(Buchner et al., 2014)
Nightingale et al., (2021).
PyAutoFit : A Classy Probabilistic Programming Language for Model Composition and Fitting.
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550 PySwarms https://github.com/ljvmiranda921/pyswarms (Miranda, 2018)•
Scipy https://github.com/scipy/scipy (Virtanen et al., 2020)
Related Probabilistic Programming Languages • PyMC3 https://github.com/pymc-devs/pymc3 (Salvatier et al., 2016)•
Pyro https://github.com/pyro-ppl/pyro (Bingham et al., 2019)•
STAN https://github.com/stan-dev/stan (Carpenter et al., 2017)•
TensorFlow Probability https://github.com/tensorflow/probability (Dillon et al.,2017)• uravu https://github.com/arm61/uravu (McCluskey & Snow, 2020)
Acknowledgements
References
Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh,R., Szerlip, P., Horsfall, P., & Goodman, N. D. (2019). Pyro: Deep universal probabilisticprogramming.
Journal of Machine Learning Research , (Xxxx), 0–5. http://arxiv.org/abs/1810.09538Buchner, J., Georgakakis, A., Nandra, K., Hsu, L., Rangel, C., Brightman, M., Merloni, A.,Salvato, M., Donley, J., & Kocevski, D. (2014). X-ray spectral modelling of the AGNobscuring region in the CDFS: Bayesian model selection and catalogue. Astronomy andAstrophysics , , A125. https://doi.org/10.1051/0004-6361/201322971Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,M. A., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software , (1). https://doi.org/10.18637/jss.v076.i01Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., Patton, B.,Alemi, A., Hoffman, M., & Saurous, R. A. (2017). TensorFlow Distributions. arXive-Prints , arXiv:1711.10604. http://arxiv.org/abs/1711.10604Feroz, F., Hobson, M. P., & Bridges, M. (2009). MultiNest: An efficient and robust Bayesianinference tool for cosmology and particle physics. Monthly Notices of the Royal Astro-nomical Society , (4), 1601–1614. https://doi.org/10.1111/j.1365-2966.2009.14548.xForeman-Mackey, D. (2016). Corner.py: Scatterplot matrices in python. The Journal of OpenSource Software , (2), 24. https://doi.org/10.21105/joss.00024Foreman-Mackey, D., Hogg, D. W., Lang, D., & Goodman, J. (2013). emcee: The MCMCHammer. Publications of the Astronomical Society of the Pacific , (925), 306. https://doi.org/10.1086/670067 Nightingale et al., (2021).
PyAutoFit : A Classy Probabilistic Programming Language for Model Composition and Fitting.
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550
Nature , (7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2Häußler, B., Bamford, S. P., Vika, M., Rojas, A. L., Barden, M., Kelvin, L. S., Alpaslan,M., Robotham, A. S. G., Driver, S. P., Baldry, I. K., Brough, S., Hopkins, A. M.,Liske, J., Nichol, R. C., Popescu, C. C., & Tuffs, R. J. (2013). Megamorph - multi-wavelength measurement of galaxy structure: Complete Sérsic profile information frommodern surveys. Monthly Notices of the Royal Astronomical Society , (1), 330–369.https://doi.org/10.1093/mnras/sts633Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science &Engineering , (3), 90–95. https://doi.org/10.1109/MCSE.2007.55McCluskey, A. R., & Snow, T. (2020). Uravu: Making bayesian modelling easy(er). Journalof Open Source Software , (50), 2214. https://doi.org/10.21105/joss.02214Miranda, L. J. V. (2018). PySwarms, a research-toolkit for Particle Swarm Optimization inPython. Journal of Open Source Software , . https://doi.org/10.21105/joss.00433Nightingale, J. W., & Dye, S. (2015). Adaptive semi-linear inversion of strong gravitationallens imaging. Monthly Notices of the Royal Astronomical Society , (3), 2940–2959.https://doi.org/10.1093/mnras/stv1455Nightingale, J. W., Dye, S., & Massey, R. J. (2018). AutoLens: Automated modeling of astrong lens’s light, mass, and source. Monthly Notices of the Royal Astronomical Society , (4), 4738–4784. https://doi.org/10.1093/mnras/sty1264Nightingale, J. W., Massey, R. J., Harvey, D. R., Cooper, A. P., Etherington, A., Tam, S. I.,& Hayes, R. G. (2019). Galaxy structure with strong gravitational lensing: Decomposingthe internal mass distribution of massive elliptical galaxies. Monthly Notices of the RoyalAstronomical Society , (2), 2049–2068. https://doi.org/10.1093/mnras/stz2220Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic program-ming in Python using PyMC3. PeerJ Computer Science , (4), 1–24. https://doi.org/10.7717/peerj-cs.55Speagle, J. S. (2020). dynesty: a dynamic nested sampling package for estimating Bayesianposteriors and evidences. Monthly Notices of the Royal Astronomical Society , (3),3132–3158. https://doi.org/10.1093/mnras/staa278Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual . CreateSpace.ISBN: 1441412697Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R.,Larson, E., … Contributors, S. 1. 0. (2020). SciPy 1.0: Fundamental Algorithms forScientific Computing in Python.
Nature Methods , , 261–272. https://doi.org/10.1038/s41592-019-0686-2 Nightingale et al., (2021).
PyAutoFit : A Classy Probabilistic Programming Language for Model Composition and Fitting.
Journal of OpenSource Software , 6(58), 2550. https://doi.org/10.21105/joss.02550, 6(58), 2550. https://doi.org/10.21105/joss.02550